bnxt: card intermittently hanging and dropping link

2018-08-15 Thread Daniel Axtens
Hi Michael,

I have some user reports of issues with a Broadcom 57412 card with the
card intermittently hanging and dropping the link.

The problem has been observed on a Dell server with an Ubuntu 4.13
kernel (bnxt_en version 1.7.0) and with an Ubuntu 4.15 kernel (bnxt_en
version 1.8.0). It seems to occur while mounting an XFS volume over
iSCSI, although running blkid on the partition succeeds, and it seems to
be a different volume each time.

I've included an excerpt from the kernel log below.

It seems that other people have reported this with a RHEL7 kernel - I
have no idea what driver version that would be running, or what workload
they were operating, but the warnings and messages look the same:
https://www.dell.com/community/PowerEdge-Hardware-General/Critical-network-bnxt-en-module-crashes-on-14G-servers/td-p/6031769/highlight/true
The forum poster reported that disconnecting the card from power for 5
minutes was sufficient to get things working again and I have asked our
user to test that.

Is this a known issue?

Regards,
Daniel

[ 2662.526151] scsi host14: iSCSI Initiator over TCP/IP
[ 2662.538110] scsi host15: iSCSI Initiator over TCP/IP
[ 2662.547350] scsi host16: iSCSI Initiator over TCP/IP
[ 2662.554660] scsi host17: iSCSI Initiator over TCP/IP
[ 2662.813860] scsi 15:0:0:1: Direct-Access PURE FlashArray    
PQ: 0 ANSI: 6
[ 2662.813888] scsi 14:0:0:1: Direct-Access PURE FlashArray    
PQ: 0 ANSI: 6
[ 2662.813972] scsi 16:0:0:1: Direct-Access PURE FlashArray    
PQ: 0 ANSI: 6
[ 2662.814322] sd 15:0:0:1: Attached scsi generic sg1 type 0
[ 2662.814553] sd 14:0:0:1: Attached scsi generic sg2 type 0
[ 2662.814554] scsi 17:0:0:1: Direct-Access PURE FlashArray    
PQ: 0 ANSI: 6
[ 2662.814612] sd 16:0:0:1: Attached scsi generic sg3 type 0
[ 2662.815081] sd 15:0:0:1: [sdb] 10737418240 512-byte logical blocks: (5.50 
TB/5.00 TiB)
[ 2662.815195] sd 15:0:0:1: [sdb] Write Protect is off
[ 2662.815197] sd 15:0:0:1: [sdb] Mode Sense: 43 00 00 08
[ 2662.815229] sd 14:0:0:1: [sdc] 10737418240 512-byte logical blocks: (5.50 
TB/5.00 TiB)
[ 2662.815292] sd 17:0:0:1: Attached scsi generic sg4 type 0
[ 2662.815342] sd 14:0:0:1: [sdc] Write Protect is off
[ 2662.815343] sd 14:0:0:1: [sdc] Mode Sense: 43 00 00 08
[ 2662.815419] sd 15:0:0:1: [sdb] Write cache: disabled, read cache: enabled, 
doesn't support DPO or FUA
[ 2662.815447] sd 16:0:0:1: [sdd] 10737418240 512-byte logical blocks: (5.50 
TB/5.00 TiB)
[ 2662.815544] sd 14:0:0:1: [sdc] Write cache: disabled, read cache: enabled, 
doesn't support DPO or FUA
[ 2662.815614] sd 16:0:0:1: [sdd] Write Protect is off
[ 2662.815615] sd 16:0:0:1: [sdd] Mode Sense: 43 00 00 08
[ 2662.815882] sd 16:0:0:1: [sdd] Write cache: disabled, read cache: enabled, 
doesn't support DPO or FUA
[ 2662.816188] sd 17:0:0:1: [sde] 10737418240 512-byte logical blocks: (5.50 
TB/5.00 TiB)
[ 2662.816298] sd 17:0:0:1: [sde] Write Protect is off
[ 2662.816300] sd 17:0:0:1: [sde] Mode Sense: 43 00 00 08
[ 2662.816502] sd 17:0:0:1: [sde] Write cache: disabled, read cache: enabled, 
doesn't support DPO or FUA
[ 2662.820080] sd 15:0:0:1: [sdb] Attached SCSI disk
[ 2662.820594] sd 14:0:0:1: [sdc] Attached SCSI disk
[ 2662.820995] sd 17:0:0:1: [sde] Attached SCSI disk
[ 2662.821176] sd 16:0:0:1: [sdd] Attached SCSI disk
[ 2662.913642] device-mapper: multipath round-robin: version 1.2.0 loaded
[ 2663.954001] XFS (dm-2): Mounting V5 Filesystem
[ 2673.186083]  connection3:0: ping timeout of 5 secs expired, recv timeout 5, 
last rx 4295558209, last ping 4295559460, now 4295560768
[ 2673.186135]  connection3:0: detected conn error (1022)
[ 2673.186137]  connection2:0: ping timeout of 5 secs expired, recv timeout 5, 
last rx 4295558209, last ping 4295559460, now 4295560768
[ 2673.186168]  connection2:0: detected conn error (1022)
[ 2673.186170]  connection1:0: ping timeout of 5 secs expired, recv timeout 5, 
last rx 4295558209, last ping 4295559460, now 4295560768
[ 2673.186211]  connection1:0: detected conn error (1022)
[ 2674.209870]  connection4:0: ping timeout of 5 secs expired, recv timeout 5, 
last rx 4295558463, last ping 4295559720, now 4295561024
[ 2674.209924]  connection4:0: detected conn error (1022)
[ 2678.560630]  session1: session recovery timed out after 5 secs
[ 2678.560641]  session2: session recovery timed out after 5 secs
[ 2678.560647]  session3: session recovery timed out after 5 secs
[ 2678.951453] device-mapper: multipath: Failing path 8:32.
[ 2678.951509] device-mapper: multipath: Failing path 8:48.
[ 2678.951548] device-mapper: multipath: Failing path 8:16.
[ 2679.584302]  session4: session recovery timed out after 5 secs
[ 2679.584313] sd 17:0:0:1: rejecting I/O to offline device
[ 2679.584356] sd 17:0:0:1: [sde] killing request
[ 2679.584362] sd 17:0:0:1: rejecting I/O to offline device
[ 2679.584392] sd 17:0:0:1: [sde] killing request
[ 2679.584401] sd 17:0:0:1: [sde] FAILED Result: hostbyte=DID_NO_CONNECT 

[PATCH stable 4.4 5/9] tcp: free batches of packets in tcp_prune_ofo_queue()

2018-08-15 Thread Mao Wenan
From: Eric Dumazet 

[ Upstream commit 72cd43ba64fc172a443410ce01645895850844c8 ]

Juha-Matti Tilli reported that malicious peers could inject tiny
packets in out_of_order_queue, forcing very expensive calls
to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
every incoming packet. out_of_order_queue rb-tree can contain
thousands of nodes, iterating over all of them is not nice.

Before linux-4.9, we would have pruned all packets in ofo_queue
in one go, every  packets.  depends on sk_rcvbuf and skbs
truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.

Since we plan to increase tcp_rmem[2] in the future to cope with
modern BDP, can not revert to the old behavior, without great pain.

Strategy taken in this patch is to purge ~12.5 % of the queue capacity.

Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all 
packets")
Signed-off-by: Eric Dumazet 
Reported-by: Juha-Matti Tilli 
Acked-by: Yuchung Cheng 
Acked-by: Soheil Hassas Yeganeh 
Signed-off-by: David S. Miller 
Signed-off-by: Mao Wenan 
---
 net/ipv4/tcp_input.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 12edc4f..32225dc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4893,22 +4893,26 @@ static bool tcp_prune_ofo_queue(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
struct rb_node *node, *prev;
+   int goal;
 
if (RB_EMPTY_ROOT(>out_of_order_queue))
return false;
 
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
-
+   goal = sk->sk_rcvbuf >> 3;
node = >ooo_last_skb->rbnode;
do {
prev = rb_prev(node);
rb_erase(node, >out_of_order_queue);
+   goal -= rb_to_skb(node)->truesize;
__kfree_skb(rb_to_skb(node));
-   sk_mem_reclaim(sk);
-if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf &&
-!tcp_under_memory_pressure(sk))
-break;
-
+   if (!prev || goal <= 0) {
+   sk_mem_reclaim(sk);
+   if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf &&
+   !tcp_under_memory_pressure(sk))
+   break;
+   goal = sk->sk_rcvbuf >> 3;
+   }
node = prev;
} while (node);
tp->ooo_last_skb = rb_entry(prev, struct sk_buff, rbnode);
-- 
1.8.3.1



[PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"

2018-08-15 Thread Mao Wenan
This reverts commit dc6ae4dffd656811dee7151b19545e4cd839d378.

We need change simple queue to RB tree to finally fix CVE-2018-5390, So
revert this patch firstly because of many conflicts when we want to apply
previous patch 9f5afeae(tcp: use an RB tree for ooo receive queue), after
this we will reapply patch series from Eric.

Signed-off-by: Mao Wenan 
---
 net/ipv4/tcp_input.c | 16 ++--
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4a261e0..995b2bc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4791,7 +4791,6 @@ restart:
 static void tcp_collapse_ofo_queue(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
-   u32 range_truesize, sum_tiny = 0;
struct sk_buff *skb = skb_peek(>out_of_order_queue);
struct sk_buff *head;
u32 start, end;
@@ -4801,7 +4800,6 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 
start = TCP_SKB_CB(skb)->seq;
end = TCP_SKB_CB(skb)->end_seq;
-   range_truesize = skb->truesize;
head = skb;
 
for (;;) {
@@ -4816,24 +4814,14 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
if (!skb ||
after(TCP_SKB_CB(skb)->seq, end) ||
before(TCP_SKB_CB(skb)->end_seq, start)) {
-   /* Do not attempt collapsing tiny skbs */
-   if (range_truesize != head->truesize ||
-   end - start >= SKB_WITH_OVERHEAD(SK_MEM_QUANTUM)) {
-   tcp_collapse(sk, >out_of_order_queue,
-head, skb, start, end);
-   } else {
-   sum_tiny += range_truesize;
-   if (sum_tiny > sk->sk_rcvbuf >> 3)
-   return;
-   }
-
+   tcp_collapse(sk, >out_of_order_queue,
+head, skb, start, end);
head = skb;
if (!skb)
break;
/* Start new segment */
start = TCP_SKB_CB(skb)->seq;
end = TCP_SKB_CB(skb)->end_seq;
-   range_truesize = skb->truesize;
} else {
if (before(TCP_SKB_CB(skb)->seq, start))
start = TCP_SKB_CB(skb)->seq;
-- 
1.8.3.1



[PATCH stable 4.4 2/9] Revert "tcp: avoid collapses in tcp_prune_queue() if possible"

2018-08-15 Thread Mao Wenan
This reverts commit 5fbec4801264cb3279ef6ac9c70bcbe2aaef89d5.

We need change simple queue to RB tree to finally fix CVE-2018-5390, So
revert this patch firstly because of many conflicts when we want to apply 
previous patch 9f5afeae(tcp: use an RB tree for ooo receive queue), after 
this we will reapply patch series from Eric.

Signed-off-by: Mao Wenan 
---
 net/ipv4/tcp_input.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 995b2bc..df2f342 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4877,9 +4877,6 @@ static int tcp_prune_queue(struct sock *sk)
else if (tcp_under_memory_pressure(sk))
tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
-   if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf)
-   return 0;
-
tcp_collapse_ofo_queue(sk);
if (!skb_queue_empty(>sk_receive_queue))
tcp_collapse(sk, >sk_receive_queue,
-- 
1.8.3.1



[PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)

2018-08-15 Thread Mao Wenan
There are five patches to fix CVE-2018-5390 in latest mainline 
branch, but only two patches exist in stable 4.4 and 3.18: 
dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
I have tested with stable 4.4 kernel, and found the cpu usage was very high.
So I think only two patches can't fix the CVE-2018-5390.
test results:
with fix patch: 78.2%   ksoftirqd
withoutfix patch:   90% ksoftirqd

Then I try to imitate 72cd43ba(tcp: free batches of packets in 
tcp_prune_ofo_queue())
to drop at least 12.5 % of sk_rcvbuf to avoid malicious attacks with simple 
queue 
instead of RB tree. The result is not very well.
 
After analysing the codes of stable 4.4, and debuging the 
system, shows that search of ofo_queue(tcp ofo using a simple queue) cost more 
cycles.

So I try to backport "tcp: use an RB tree for ooo receive queue" using RB tree 
instead of simple queue, then backport Eric Dumazet 5 fixed patches in mainline,
good news is that ksoftirqd is turn to about 20%, which is the same with 
mainline now.

Stable 4.4 have already back port two patches, 
f4a3313d(tcp: avoid collapses in tcp_prune_queue() if possible)
3d4bf93a(tcp: detect malicious patterns in tcp_collapse_ofo_queue())
If we want to change simple queue to RB tree to finally resolve, we should 
apply previous 
patch 9f5afeae(tcp: use an RB tree for ooo receive queue.) firstly, but 
9f5afeae have many 
conflicts with 3d4bf93a and f4a3313d, which are part of patch series from Eric 
in 
mainline to fix CVE-2018-5390, so I need revert part of patches in stable 4.4 
firstly, 
then apply 9f5afeae, and reapply five patches from Eric.

Eric Dumazet (6):
  tcp: increment sk_drops for dropped rx packets
  tcp: free batches of packets in tcp_prune_ofo_queue()
  tcp: avoid collapses in tcp_prune_queue() if possible
  tcp: detect malicious patterns in tcp_collapse_ofo_queue()
  tcp: call tcp_drop() from tcp_data_queue_ofo()
  tcp: add tcp_ooo_try_coalesce() helper

Mao Wenan (2):
  Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"
  Revert "tcp: avoid collapses in tcp_prune_queue() if possible"

Yaogong Wang (1):
  tcp: use an RB tree for ooo receive queue

 include/linux/skbuff.h   |   8 +
 include/linux/tcp.h  |   7 +-
 include/net/sock.h   |   7 +
 include/net/tcp.h|   2 +-
 net/core/skbuff.c|  19 +++
 net/ipv4/tcp.c   |   4 +-
 net/ipv4/tcp_input.c | 412 +--
 net/ipv4/tcp_ipv4.c  |   3 +-
 net/ipv4/tcp_minisocks.c |   1 -
 net/ipv6/tcp_ipv6.c  |   1 +
 10 files changed, 294 insertions(+), 170 deletions(-)

-- 
1.8.3.1



[PATCH stable 4.4 9/9] tcp: add tcp_ooo_try_coalesce() helper

2018-08-15 Thread Mao Wenan
From: Eric Dumazet 

[ Upstream commit 58152ecbbcc6a0ce7fddd5bf5f6ee535834ece0c ]

In case skb in out_or_order_queue is the result of
multiple skbs coalescing, we would like to get a proper gso_segs
counter tracking, so that future tcp_drop() can report an accurate
number.

I chose to not implement this tracking for skbs in receive queue,
since they are not dropped, unless socket is disconnected.

Signed-off-by: Eric Dumazet 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
Signed-off-by: David S. Miller 
Signed-off-by: Mao Wenan 
---
 net/ipv4/tcp_input.c | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 96a1e0d..fdb5509 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4296,6 +4296,23 @@ static bool tcp_try_coalesce(struct sock *sk,
return true;
 }
 
+static bool tcp_ooo_try_coalesce(struct sock *sk,
+struct sk_buff *to,
+struct sk_buff *from,
+bool *fragstolen)
+{
+   bool res = tcp_try_coalesce(sk, to, from, fragstolen);
+
+   /* In case tcp_drop() is called later, update to->gso_segs */
+   if (res) {
+   u32 gso_segs = max_t(u16, 1, skb_shinfo(to)->gso_segs) +
+  max_t(u16, 1, skb_shinfo(from)->gso_segs);
+
+   skb_shinfo(to)->gso_segs = min_t(u32, gso_segs, 0x);
+   }
+   return res;
+}
+
 static void tcp_drop(struct sock *sk, struct sk_buff *skb)
 {
sk_drops_add(sk, skb);
@@ -4422,7 +4439,8 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
/* In the typical case, we are adding an skb to the end of the list.
 * Use of ooo_last_skb avoids the O(Log(N)) rbtree lookup.
 */
-   if (tcp_try_coalesce(sk, tp->ooo_last_skb, skb, )) {
+   if (tcp_ooo_try_coalesce(sk, tp->ooo_last_skb,
+skb, )) {
 coalesce_done:
tcp_grow_window(sk, skb);
kfree_skb_partial(skb, fragstolen);
@@ -4467,7 +4485,8 @@ coalesce_done:
tcp_drop(sk, skb1);
goto add_sack;
}
-   } else if (tcp_try_coalesce(sk, skb1, skb, )) {
+   } else if (tcp_ooo_try_coalesce(sk, skb1,
+   skb, )) {
goto coalesce_done;
}
p = >rb_right;
-- 
1.8.3.1



[PATCH stable 4.4 3/9] tcp: increment sk_drops for dropped rx packets

2018-08-15 Thread Mao Wenan
From: Eric Dumazet 

[ Upstream commit 532182cd610782db8c18230c2747626562032205 ]

Now ss can report sk_drops, we can instruct TCP to increment
this per socket counter when it drops an incoming frame, to refine
monitoring and debugging.

Following patch takes care of listeners drops.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Mao Wenan 
---
 include/net/sock.h   |  7 +++
 net/ipv4/tcp_input.c | 33 -
 net/ipv4/tcp_ipv4.c  |  1 +
 net/ipv6/tcp_ipv6.c  |  1 +
 4 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 3d5ff74..5770757 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2139,6 +2139,13 @@ sock_skb_set_dropcount(const struct sock *sk, struct 
sk_buff *skb)
SOCK_SKB_CB(skb)->dropcount = atomic_read(>sk_drops);
 }
 
+static inline void sk_drops_add(struct sock *sk, const struct sk_buff *skb)
+{
+   int segs = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
+
+   atomic_add(segs, >sk_drops);
+}
+
 void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
   struct sk_buff *skb);
 void __sock_recv_wifi_status(struct msghdr *msg, struct sock *sk,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index df2f342..5fb4e80 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4296,6 +4296,12 @@ static bool tcp_try_coalesce(struct sock *sk,
return true;
 }
 
+static void tcp_drop(struct sock *sk, struct sk_buff *skb)
+{
+   sk_drops_add(sk, skb);
+   __kfree_skb(skb);
+}
+
 /* This one checks to see if we can put data from the
  * out_of_order queue into the receive_queue.
  */
@@ -4320,7 +4326,7 @@ static void tcp_ofo_queue(struct sock *sk)
__skb_unlink(skb, >out_of_order_queue);
if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
SOCK_DEBUG(sk, "ofo packet was already received\n");
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
continue;
}
SOCK_DEBUG(sk, "ofo requeuing : rcv_next %X seq %X - %X\n",
@@ -4372,7 +4378,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
 
if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) {
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFODROP);
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
return;
}
 
@@ -4436,7 +4442,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
/* All the bits are present. Drop. */
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
skb = NULL;
tcp_dsack_set(sk, seq, end_seq);
goto add_sack;
@@ -4475,7 +4481,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
 TCP_SKB_CB(skb1)->end_seq);
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
-   __kfree_skb(skb1);
+   tcp_drop(sk, skb1);
}
 
 add_sack:
@@ -4558,12 +4564,13 @@ err:
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
-   int eaten = -1;
bool fragstolen = false;
+   int eaten = -1;
 
-   if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
-   goto drop;
-
+   if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) {
+   __kfree_skb(skb);
+   return;
+   }
skb_dst_drop(skb);
__skb_pull(skb, tcp_hdr(skb)->doff * 4);
 
@@ -4645,7 +4652,7 @@ out_of_window:
tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS);
inet_csk_schedule_ack(sk);
 drop:
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
return;
}
 
@@ -5220,7 +5227,7 @@ syn_challenge:
return true;
 
 discard:
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
return false;
 }
 
@@ -5438,7 +5445,7 @@ csum_error:
TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
 
 discard:
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
 }
 EXPORT_SYMBOL(tcp_rcv_established);
 
@@ -5668,7 +5675,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, 
struct sk_buff *skb,
  TCP_DELACK_MAX, TCP_RTO_MAX);
 
 discard:
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
return 0;
} else {
tcp_send_ack(sk);
@@ -6025,7 +6032,7 @@ int tcp_rcv_state_process(struct sock *sk, 

[PATCH stable 4.4 6/9] tcp: avoid collapses in tcp_prune_queue() if possible

2018-08-15 Thread Mao Wenan
From: Eric Dumazet 

[ Upstream commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7 ]

Right after a TCP flow is created, receiving tiny out of order
packets allways hit the condition :

if (atomic_read(>sk_rmem_alloc) >= sk->sk_rcvbuf)
tcp_clamp_window(sk);

tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
(guarded by tcp_rmem[2])

Calling tcp_collapse_ofo_queue() in this case is not useful,
and offers a O(N^2) surface attack to malicious peers.

Better not attempt anything before full queue capacity is reached,
forcing attacker to spend lots of resource and allow us to more
easily detect the abuse.

Signed-off-by: Eric Dumazet 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Mao Wenan 
---
 net/ipv4/tcp_input.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 32225dc..77130ae 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4948,6 +4948,9 @@ static int tcp_prune_queue(struct sock *sk)
else if (tcp_under_memory_pressure(sk))
tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
+   if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf)
+   return 0;
+
tcp_collapse_ofo_queue(sk);
if (!skb_queue_empty(>sk_receive_queue))
tcp_collapse(sk, >sk_receive_queue, NULL,
-- 
1.8.3.1



[PATCH stable 4.4 4/9] tcp: use an RB tree for ooo receive queue

2018-08-15 Thread Mao Wenan
From: Yaogong Wang 

[ Upstream commit 9f5afeae51526b3ad7b7cb21ee8b145ce6ea7a7a ]

Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.

Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.

In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.

Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.

However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.

This patch converts it to a RB tree, to get bounded latencies.

Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.

Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)

Next step would be to also use an RB tree for the write queue at sender
side ;)

Signed-off-by: Yaogong Wang 
Signed-off-by: Eric Dumazet 
Cc: Yuchung Cheng 
Cc: Neal Cardwell 
Cc: Ilpo Järvinen 
Acked-By: Ilpo Järvinen 
Signed-off-by: David S. Miller 
Signed-off-by: Mao Wenan 
---
 include/linux/skbuff.h   |   8 ++
 include/linux/tcp.h  |   7 +-
 include/net/tcp.h|   2 +-
 net/core/skbuff.c|  19 +++
 net/ipv4/tcp.c   |   4 +-
 net/ipv4/tcp_input.c | 354 +++
 net/ipv4/tcp_ipv4.c  |   2 +-
 net/ipv4/tcp_minisocks.c |   1 -
 8 files changed, 241 insertions(+), 156 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c28bd8b..a490dd7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2273,6 +2273,8 @@ static inline void __skb_queue_purge(struct sk_buff_head 
*list)
kfree_skb(skb);
 }
 
+void skb_rbtree_purge(struct rb_root *root);
+
 void *netdev_alloc_frag(unsigned int fragsz);
 
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length,
@@ -2807,6 +2809,12 @@ static inline int pskb_trim_rcsum(struct sk_buff *skb, 
unsigned int len)
return __pskb_trim(skb, len);
 }
 
+#define rb_to_skb(rb) rb_entry_safe(rb, struct sk_buff, rbnode)
+#define skb_rb_first(root) rb_to_skb(rb_first(root))
+#define skb_rb_last(root)  rb_to_skb(rb_last(root))
+#define skb_rb_next(skb)   rb_to_skb(rb_next(&(skb)->rbnode))
+#define skb_rb_prev(skb)   rb_to_skb(rb_prev(&(skb)->rbnode))
+
 #define skb_queue_walk(queue, skb) \
for (skb = (queue)->next;   
\
 skb != (struct sk_buff *)(queue);  
\
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 5b6df1a..747404d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -279,10 +279,9 @@ struct tcp_sock {
struct sk_buff* lost_skb_hint;
struct sk_buff *retransmit_skb_hint;
 
-   /* OOO segments go in this list. Note that socket lock must be held,
-* as we do not use sk_buff_head lock.
-*/
-   struct sk_buff_head out_of_order_queue;
+   /* OOO segments go in this rbtree. Socket lock must be held. */
+   struct rb_root  out_of_order_queue;
+   struct sk_buff  *ooo_last_skb; /* cache rb_last(out_of_order_queue) */
 
/* SACKs data, these 2 need to be together (see tcp_options_write) */
struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cac4a6ad..8bc259d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -649,7 +649,7 @@ static inline void tcp_fast_path_check(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (skb_queue_empty(>out_of_order_queue) &&
+   if (RB_EMPTY_ROOT(>out_of_order_queue) &&
tp->rcv_wnd &&
atomic_read(>sk_rmem_alloc) < sk->sk_rcvbuf &&
!tp->urg_data)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 55be076..9703924 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2378,6 +2378,25 @@ void skb_queue_purge(struct sk_buff_head *list)
 EXPORT_SYMBOL(skb_queue_purge);
 
 /**
+ * skb_rbtree_purge - empty a skb rbtree
+ * @root: root of the rbtree to empty
+ *
+ * Delete all buffers on an _buff rbtree. Each buffer is removed from
+ * the list and one reference dropped. This function does not take
+ * any lock. Synchronization should be handled by the caller (e.g., TCP
+ * out-of-order queue is protected by the socket lock).
+ */
+void skb_rbtree_purge(struct rb_root *root)
+{
+   struct sk_buff *skb, *next;
+
+   rbtree_postorder_for_each_entry_safe(skb, next, root, rbnode)
+   kfree_skb(skb);
+
+   *root = RB_ROOT;
+}
+

[PATCH stable 4.4 8/9] tcp: call tcp_drop() from tcp_data_queue_ofo()

2018-08-15 Thread Mao Wenan
From: Eric Dumazet 

[ Upstream commit 8541b21e781a22dce52a74fef0b9bed00404a1cd ] 

In order to be able to give better diagnostics and detect
malicious traffic, we need to have better sk->sk_drops tracking.

Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
Signed-off-by: David S. Miller 
Signed-off-by: Mao Wenan 
---
 net/ipv4/tcp_input.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c48924f..96a1e0d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4445,7 +4445,7 @@ coalesce_done:
/* All the bits are present. Drop. */
NET_INC_STATS(sock_net(sk),
  LINUX_MIB_TCPOFOMERGE);
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
skb = NULL;
tcp_dsack_set(sk, seq, end_seq);
goto add_sack;
@@ -4464,7 +4464,7 @@ coalesce_done:
 TCP_SKB_CB(skb1)->end_seq);
NET_INC_STATS(sock_net(sk),
  LINUX_MIB_TCPOFOMERGE);
-   __kfree_skb(skb1);
+   tcp_drop(sk, skb1);
goto add_sack;
}
} else if (tcp_try_coalesce(sk, skb1, skb, )) {
-- 
1.8.3.1



[PATCH stable 4.4 7/9] tcp: detect malicious patterns in tcp_collapse_ofo_queue()

2018-08-15 Thread Mao Wenan
From: Eric Dumazet 

[ Upstream commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf ]

In case an attacker feeds tiny packets completely out of order,
tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
expensive copies, but not changing socket memory usage at all.

1) Do not attempt to collapse tiny skbs.
2) Add logic to exit early when too many tiny skbs are detected.

We prefer not doing aggressive collapsing (which copies packets)
for pathological flows, and revert to tcp_prune_ofo_queue() which
will be less expensive.

In the future, we might add the possibility of terminating flows
that are proven to be malicious.

Signed-off-by: Eric Dumazet 
Acked-by: Soheil Hassas Yeganeh 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Mao Wenan 
---
 net/ipv4/tcp_input.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 77130ae..c48924f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4845,6 +4845,7 @@ end:
 static void tcp_collapse_ofo_queue(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
+   u32 range_truesize, sum_tiny = 0;
struct sk_buff *skb, *head;
struct rb_node *p;
u32 start, end;
@@ -4863,6 +4864,7 @@ new_range:
}
start = TCP_SKB_CB(skb)->seq;
end = TCP_SKB_CB(skb)->end_seq;
+   range_truesize = skb->truesize;
 
for (head = skb;;) {
skb = tcp_skb_next(skb, NULL);
@@ -4873,11 +4875,21 @@ new_range:
if (!skb ||
after(TCP_SKB_CB(skb)->seq, end) ||
before(TCP_SKB_CB(skb)->end_seq, start)) {
-   tcp_collapse(sk, NULL, >out_of_order_queue,
-head, skb, start, end);
+   /* Do not attempt collapsing tiny skbs */
+   if (range_truesize != head->truesize ||
+   end - start >= SKB_WITH_OVERHEAD(SK_MEM_QUANTUM)) {
+   tcp_collapse(sk, NULL, >out_of_order_queue,
+head, skb, start, end);
+   } else {
+   sum_tiny += range_truesize;
+   if (sum_tiny > sk->sk_rcvbuf >> 3)
+   return;
+   }
+
goto new_range;
}
 
+   range_truesize += skb->truesize;
if (unlikely(before(TCP_SKB_CB(skb)->seq, start)))
start = TCP_SKB_CB(skb)->seq;
if (after(TCP_SKB_CB(skb)->end_seq, end))
-- 
1.8.3.1



Re: [PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"

2018-08-15 Thread maowenan



On 2018/8/15 21:18, Greg KH wrote:
> On Wed, Aug 15, 2018 at 09:21:00PM +0800, Mao Wenan wrote:
>> This reverts commit dc6ae4dffd656811dee7151b19545e4cd839d378.
> 
> I need a reason why, and a signed-off-by line :(

stable 4.4 only back port two patches to fix CVE-2018-5390, I have tested they 
can't
fix fully because of simple queue used in lower version, so we need change 
simple queue
to RB tree to finally resolve. But 9f5afeae have many conflicts with tcp: 
detect malicious patterns in tcp_collapse_ofo_queue()
and tcp: avoid collapses in tcp_prune_queue() if possible, and there are patch 
series from Eric in mainline to fix CVE-2018-5390,
so I need revert part of patches in stable 4.4 firstly, then apply 9f5afeae, 
and reapply five patches from Eric.
9f5afeae tcp: use an RB tree for ooo receive queue

> 
> thanks,
> 
> greg k-h
> 
> 



Re: [PATCH stable 4.4 0/9] fix SegmentSmack (CVE-2018-5390)

2018-08-15 Thread maowenan



On 2018/8/15 23:41, Greg KH wrote:
> On Wed, Aug 15, 2018 at 03:24:32PM +0200, Greg KH wrote:
>> On Wed, Aug 15, 2018 at 09:20:59PM +0800, Mao Wenan wrote:
>>> There are five patches to fix CVE-2018-5390 in latest mainline 
>>> branch, but only two patches exist in stable 4.4 and 3.18: 
>>> dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
>>> 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
>>> but I have tested with these patches, and found the cpu usage was very high.
>>> test results:
>>> with fix patch: 78.2%   ksoftirqd
>>> no fix patch:   90% ksoftirqd
>>>
>>> After analysing the codes of stable 4.4, and debuging the 
>>> system, the search of ofo_queue(tcp ofo using a simple queue) cost more 
>>> cycles.
>>> So I think only two patches can't fix the CVE-2018-5390.
>>> So I try to backport "tcp: use an RB tree for ooo receive queue" using RB 
>>> tree 
>>> instead of simple queue, then backport Eric Dumazet 5 fixed patches in 
>>> mainline,
>>> good news is that ksoftirqd is turn to about 20%, which is the same with 
>>> mainline now.
>>
>> Thanks for doing this work, I had some questions on the individual
>> patches.  Can you address them and resend?
> 
> Also, always cc: the stable@vger list when sending stable patches so
> that others can review and comment on them.

ok,I will resend patches later after refining them.

> 
> thanks,
> 
> greg k-h
> 
> .
> 



Re: [PATCH net] veth: Free queues on link delete

2018-08-15 Thread kbuild test robot
Hi David,

I love your patch! Yet something to improve:

[auto build test ERROR on net/master]

url:
https://github.com/0day-ci/linux/commits/dsahern-kernel-org/veth-Free-queues-on-link-delete/20180816-073955
config: i386-randconfig-x016-201832 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers//net/veth.c: In function 'veth_dellink':
>> drivers//net/veth.c:476:2: error: implicit declaration of function 
>> 'veth_free_queues'; did you mean 'veth_dev_free'? 
>> [-Werror=implicit-function-declaration]
 veth_free_queues(dev);
 ^~~~
 veth_dev_free
   cc1: some warnings being treated as errors

vim +476 drivers//net/veth.c

   470  
   471  static void veth_dellink(struct net_device *dev, struct list_head *head)
   472  {
   473  struct veth_priv *priv;
   474  struct net_device *peer;
   475  
 > 476  veth_free_queues(dev);
   477  priv = netdev_priv(dev);
   478  peer = rtnl_dereference(priv->peer);
   479  
   480  /* Note : dellink() is called from default_device_exit_batch(),
   481   * before a rcu_synchronize() point. The devices are guaranteed
   482   * not being freed before one RCU grace period.
   483   */
   484  RCU_INIT_POINTER(priv->peer, NULL);
   485  unregister_netdevice_queue(dev, head);
   486  
   487  if (peer) {
   488  priv = netdev_priv(peer);
   489  RCU_INIT_POINTER(priv->peer, NULL);
   490  unregister_netdevice_queue(peer, head);
   491  }
   492  }
   493  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v2 net-next 0/8] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers

2018-08-15 Thread Florian Fainelli
On 12/05/2017 05:46 PM, tristram...@microchip.com wrote:
> From: Tristram Ha 
> 
> This series of patches is to modify the original KSZ9477 DSA driver so
> that other KSZ switch drivers can be added and use the common code.
> 
> There are several steps to accomplish this achievement.  First is to
> rename some function names with a prefix to indicate chip specific
> function.  Second is to move common code into header that can be shared.
> Last is to modify tag_ksz.c so that it can handle many tail tag formats
> used by different KSZ switch drivers.
> 
> ksz_common.c will contain the common code used by all KSZ switch drivers.
> ksz9477.c will contain KSZ9477 code from the original ksz_common.c.
> ksz9477_spi.c is renamed from ksz_spi.c.
> ksz9477_reg.h is renamed from ksz_9477_reg.h.
> ksz_common.h is added to provide common code access to KSZ switch
> drivers.
> ksz_spi.h is added to provide common SPI access functions to KSZ SPI
> drivers.

Is something gating this series from getting included? It's been nearly
8 months now and this has not been include nor resubmitted, any plans to
rebase that patch series and work towards inclusion in net-next when it
opens back again?

Thank you!

> 
> v2
> - Initialize reg_mutex before use
> - The alu_mutex is only used inside chip specific functions
> 
> v1
> - Each patch in the set is self-contained
> - Use ksz9477 prefix to indicate KSZ9477 specific code
> 
> Tristram Ha (8):
>   Replace license with GPL.
>   Clean up code according to patch check suggestions.
>   Initialize mutex before use.
>   Rename some functions with ksz9477 prefix to separate chip specific
> code from common code.
>   Rename ksz_spi.c to ksz9477_spi.c and update Kconfig in preparation to
> add more KSZ switch drivers.
>   Break KSZ9477 DSA driver into two files in preparation to add more KSZ
> switch drivers.  Add common functions in ksz_common.h so that other
> KSZ switch drivers can access code in ksz_common.c.  Add ksz_spi.h
> for common functions used by KSZ switch SPI drivers.
>   Prepare PHY for proper advertisement and get link status for the port.
>   Rename ksz_9477_reg.h to ksz9477_reg.h for consistency as the product
> name is always KSZ.
> 
>  drivers/net/dsa/microchip/Kconfig  |   12 +-
>  drivers/net/dsa/microchip/Makefile |4 +-
>  drivers/net/dsa/microchip/ksz9477.c| 1331 
> 
>  .../microchip/{ksz_9477_reg.h => ksz9477_reg.h}|   23 +-
>  drivers/net/dsa/microchip/ksz9477_spi.c|  188 +++
>  drivers/net/dsa/microchip/ksz_common.c | 1176 +++--
>  drivers/net/dsa/microchip/ksz_common.h |  229 
>  drivers/net/dsa/microchip/ksz_priv.h   |  256 ++--
>  drivers/net/dsa/microchip/ksz_spi.c|  216 
>  drivers/net/dsa/microchip/ksz_spi.h|   82 ++
>  10 files changed, 2122 insertions(+), 1395 deletions(-)
>  create mode 100644 drivers/net/dsa/microchip/ksz9477.c
>  rename drivers/net/dsa/microchip/{ksz_9477_reg.h => ksz9477_reg.h} (98%)
>  create mode 100644 drivers/net/dsa/microchip/ksz9477_spi.c
>  create mode 100644 drivers/net/dsa/microchip/ksz_common.h
>  delete mode 100644 drivers/net/dsa/microchip/ksz_spi.c
>  create mode 100644 drivers/net/dsa/microchip/ksz_spi.h
> 


-- 
Florian


Re: [PATCH] net: dsa: add support for ksz9897 ethernet switch

2018-08-15 Thread Florian Fainelli
On 08/15/2018 08:51 AM, Lad Prabhakar wrote:
> From: "Lad, Prabhakar" 
> 
> ksz9477 is superset of ksz9xx series, driver just works
> out of the box for ksz9897 chip with this patch.

net-next is currently closed, but other than that:

Reviewed-by: Florian Fainelli 

> 
> Signed-off-by: Lad, Prabhakar 
> ---
>  Documentation/devicetree/bindings/net/dsa/ksz.txt | 4 +++-
>  drivers/net/dsa/microchip/ksz_common.c| 9 +
>  drivers/net/dsa/microchip/ksz_spi.c   | 1 +
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/dsa/ksz.txt 
> b/Documentation/devicetree/bindings/net/dsa/ksz.txt
> index a700943..ac145b8 100644
> --- a/Documentation/devicetree/bindings/net/dsa/ksz.txt
> +++ b/Documentation/devicetree/bindings/net/dsa/ksz.txt
> @@ -4,7 +4,9 @@ Microchip KSZ Series Ethernet switches
>  Required properties:
>  
>  - compatible: For external switch chips, compatible string must be exactly 
> one
> -  of: "microchip,ksz9477"
> +  of the following:
> +  - "microchip,ksz9477"
> +  - "microchip,ksz9897"
>  
>  See Documentation/devicetree/bindings/net/dsa/dsa.txt for a list of 
> additional
>  required and optional properties.
> diff --git a/drivers/net/dsa/microchip/ksz_common.c 
> b/drivers/net/dsa/microchip/ksz_common.c
> index 7210c49..54e0ca6 100644
> --- a/drivers/net/dsa/microchip/ksz_common.c
> +++ b/drivers/net/dsa/microchip/ksz_common.c
> @@ -1102,6 +1102,15 @@ static const struct ksz_chip_data ksz_switch_chips[] = 
> {
>   .cpu_ports = 0x7F,  /* can be configured as cpu port */
>   .port_cnt = 7,  /* total physical port count */
>   },
> + {
> + .chip_id = 0x00989700,
> + .dev_name = "KSZ9897",
> + .num_vlans = 4096,
> + .num_alus = 4096,
> + .num_statics = 16,
> + .cpu_ports = 0x7F,  /* can be configured as cpu port */
> + .port_cnt = 7,  /* total physical port count */
> + },
>  };
>  
>  static int ksz_switch_init(struct ksz_device *dev)
> diff --git a/drivers/net/dsa/microchip/ksz_spi.c 
> b/drivers/net/dsa/microchip/ksz_spi.c
> index c519469..8c1778b 100644
> --- a/drivers/net/dsa/microchip/ksz_spi.c
> +++ b/drivers/net/dsa/microchip/ksz_spi.c
> @@ -195,6 +195,7 @@ static int ksz_spi_remove(struct spi_device *spi)
>  
>  static const struct of_device_id ksz_dt_ids[] = {
>   { .compatible = "microchip,ksz9477" },
> + { .compatible = "microchip,ksz9897" },
>   {},
>  };
>  MODULE_DEVICE_TABLE(of, ksz_dt_ids);
> 


-- 
Florian


Re: [PATCH bpf] bpf: fix a rcu usage warning in bpf_prog_array_copy_core()

2018-08-15 Thread Daniel Borkmann
On 08/15/2018 02:08 AM, Roman Gushchin wrote:
> On Tue, Aug 14, 2018 at 04:59:45PM -0700, Alexei Starovoitov wrote:
>> On Tue, Aug 14, 2018 at 11:01:12AM -0700, Yonghong Song wrote:
>>> Commit 394e40a29788 ("bpf: extend bpf_prog_array to store pointers
>>> to the cgroup storage") refactored the bpf_prog_array_copy_core()
>>> to accommodate new structure bpf_prog_array_item which contains
>>> bpf_prog array itself.
>>>
>>> In the old code, we had
>>>perf_event_query_prog_array():
>>>  mutex_lock(...)
>>>  bpf_prog_array_copy_call():
>>>prog = rcu_dereference_check(array, 1)->progs
>>>bpf_prog_array_copy_core(prog, ...)
>>>  mutex_unlock(...)
>>>
>>> With the above commit, we had
>>>perf_event_query_prog_array():
>>>  mutex_lock(...)
>>>  bpf_prog_array_copy_call():
>>>bpf_prog_array_copy_core(array, ...):
>>>  item = rcu_dereference(array)->items;
>>>  ...
>>>  mutex_unlock(...)
>>>
>>> The new code will trigger a lockdep rcu checking warning.
>>> The fix is to change rcu_dereference() to rcu_dereference_check()
>>> to prevent such a warning.
>>>
>>> Reported-by: syzbot+6e72317008eef84a2...@syzkaller.appspotmail.com
>>> Fixes: 394e40a29788 ("bpf: extend bpf_prog_array to store pointers to the 
>>> cgroup storage")
>>> Cc: Roman Gushchin 
>>> Signed-off-by: Yonghong Song 

Applied to bpf, thanks Yonghong!


Re: [bpf PATCH] samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM

2018-08-15 Thread Song Liu
On Wed, Aug 15, 2018 at 8:47 AM, Y Song  wrote:
> On Wed, Aug 15, 2018 at 7:57 AM, Jesper Dangaard Brouer
>  wrote:
>> It is common XDP practice to unload/deattach the XDP bpf program,
>> when the XDP sample program is Ctrl-C interrupted (SIGINT) or
>> killed (SIGTERM).
>>
>> The samples/bpf programs xdp_redirect_cpu and xdp_rxq_info,
>> forgot to trap signal SIGTERM (which is the default signal used
>> by the kill command).
>>
>> This was discovered by Red Hat QA, which automated scripts depend
>> on killing the XDP sample program after a timeout period.
>>
>> Fixes: fad3917e361b ("samples/bpf: add cpumap sample program 
>> xdp_redirect_cpu")
>> Fixes: 0fca931a6f21 ("samples/bpf: program demonstrating access to 
>> xdp_rxq_info")
>> Reported-by: Jean-Tsung Hsiao 
>> Signed-off-by: Jesper Dangaard Brouer 
>
> Acked-by: Yonghong Song 

Reviewed-by: Song Liu 

>
>> ---
>>  samples/bpf/xdp_redirect_cpu_user.c |3 ++-
>>  samples/bpf/xdp_rxq_info_user.c |3 ++-
>>  2 files changed, 4 insertions(+), 2 deletions(-)


Re: [bpf PATCH] samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM

2018-08-15 Thread Daniel Borkmann
On 08/15/2018 05:47 PM, Y Song wrote:
> On Wed, Aug 15, 2018 at 7:57 AM, Jesper Dangaard Brouer
>  wrote:
>> It is common XDP practice to unload/deattach the XDP bpf program,
>> when the XDP sample program is Ctrl-C interrupted (SIGINT) or
>> killed (SIGTERM).
>>
>> The samples/bpf programs xdp_redirect_cpu and xdp_rxq_info,
>> forgot to trap signal SIGTERM (which is the default signal used
>> by the kill command).
>>
>> This was discovered by Red Hat QA, which automated scripts depend
>> on killing the XDP sample program after a timeout period.
>>
>> Fixes: fad3917e361b ("samples/bpf: add cpumap sample program 
>> xdp_redirect_cpu")
>> Fixes: 0fca931a6f21 ("samples/bpf: program demonstrating access to 
>> xdp_rxq_info")
>> Reported-by: Jean-Tsung Hsiao 
>> Signed-off-by: Jesper Dangaard Brouer 
> 
> Acked-by: Yonghong Song 

Applied to bpf, thanks Jesper!


Re: [PATCH bpf-next V3] net/xdp: Fix suspicious RCU usage warning

2018-08-15 Thread Daniel Borkmann
On 08/13/2018 02:22 PM, Daniel Borkmann wrote:
[...]
> I'll get the patch in once it has been pulled.

Applied to bpf, thanks Tariq!


[PATCH iproute2] ipmaddr: use preferred_family when given

2018-08-15 Thread Mahesh Bandewar
From: Mahesh Bandewar 

When creating socket() AF_INET is used irrespective of the family
that is given at the command-line (with -4, -6, or -0). This change
will open the socket with the preferred family.

Signed-off-by: Mahesh Bandewar 
---
 ip/ipmaddr.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/ip/ipmaddr.c b/ip/ipmaddr.c
index a48499029e17..abf83784d0df 100644
--- a/ip/ipmaddr.c
+++ b/ip/ipmaddr.c
@@ -289,6 +289,7 @@ static int multiaddr_list(int argc, char **argv)
 static int multiaddr_modify(int cmd, int argc, char **argv)
 {
struct ifreq ifr = {};
+   int family;
int fd;
 
if (cmd == RTM_NEWADDR)
@@ -324,7 +325,17 @@ static int multiaddr_modify(int cmd, int argc, char **argv)
exit(-1);
}
 
-   fd = socket(AF_INET, SOCK_DGRAM, 0);
+   switch (preferred_family) {
+   case AF_INET6:
+   case AF_PACKET:
+   case AF_INET:
+   family = preferred_family;
+   break;
+   default:
+   family = AF_INET;
+   }
+
+   fd = socket(family, SOCK_DGRAM, 0);
if (fd < 0) {
perror("Cannot create socket");
exit(1);
-- 
2.18.0.865.gffc8e1a3cd6-goog



Re: virtio_net failover and initramfs (was: Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework)

2018-08-15 Thread Siwei Liu
On Wed, Aug 15, 2018 at 12:05 PM, Samudrala, Sridhar
 wrote:
> On 8/14/2018 5:03 PM, Siwei Liu wrote:
>>
>> Are we sure all userspace apps skip and ignore slave interfaces by
>> just looking at "IFLA_MASTER" attribute?
>>
>> When STANDBY is enabled on virtio-net, a failover master interface
>> will appear, which automatically enslaves the virtio device. But it is
>> found out that iSCSI (or any network boot) cannot boot strap over the
>> new failover interface together with a standby virtio (without any VF
>> or PT device in place).
>>
>> Dracut (initramfs) ends up with timeout and dropping into emergency shell:
>>
>> [  228.170425] dracut-initqueue[377]: Warning: dracut-initqueue
>> timeout - starting timeout scripts
>> [  228.171788] dracut-initqueue[377]: Warning: Could not boot.
>>   Starting Dracut Emergency Shell...
>> Generating "/run/initramfs/rdsosreport.txt"
>> Entering emergency mode. Exit the shell to continue.
>> Type "journalctl" to view system logs.
>> You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or
>> /boot
>> after mounting them and attach it to a bug report.
>> dracut:/# ip l sh
>> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN
>> mode DEFAULT group default qlen 1000
>>  link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> 2: eth0:  mtu 1500 qdisc noqueue
>> state UP mode DEFAULT group default qlen 1000
>>  link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff\
>> 3: eth1:  mtu 1500 qdisc pfifo_fast
>> master eth0 state UP mode DEFAULT group default qlen 1000
>>  link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff
>> dracut:/#
>>
>> If changing dracut code to ignore eth1 (with IFLA_MASTER attr),
>> network boot starts to work.
>
>
> Does dracut by default tries to use all the interfaces that are UP?
>
Yes. The specific dracut cmdline of our case is "ip=dhcp
netroot=iscsi:... ", but it's not specific to iscsi boot. And because
of same MAC address for failover and standby, while dracut tries to
run DHCP on all interfaces that are up it eventually gets same route
for each interface. Those conflict route entries kill off the network
connection.

>
>>
>> The reason is that dracut has its own means to differentiate virtual
>> interfaces for network boot: it does not look at IFLA_MASTER and
>> ignores slave interfaces. Instead, users have to provide explicit
>> option e.g. bond=eth0,eth1 in the boot line, then dracut would know
>> the config and ignore the slave interfaces.
>
>
> Isn't it possible to specify the interface that should be used for network
> boot?
As I understand it, one can only specify interface name for running
DHCP but not select interface for network boot.  We want DHCP to run
on every NIC that is up (excluding the enslaved interfaces), and only
one of them can get a route entry to the network boot server (ie.g.
iSCSI target).

>
>
>>
>> However, with automatic creation of failover interface that assumption
>> is no longer true. Can we change dracut to ignore all slave interface
>> by checking  IFLA_MASTER? I don't think so. It has a large impact to
>> existing configs.
>
>
> What is the issue with checking for IFLA_MASTER? I guess this is used with
> team/bonding setups.
That should be discussed within and determined by the dracut
community. But the current dracut code doesn't check IFLA_MASTER for
team or bonding specifically. I guess this change might have broader
impact to existing userspace that might be already relying on the
current behaviour.

Thanks,
-Siwei

>
>
>>
>> What's a feasible solution then? Check the driver name for failover as
>> well?
>>
>> Thanks,
>> -Siwei
>>
>>
>>
>> On Tue, May 22, 2018 at 10:38 AM, Jiri Pirko  wrote:
>>>
>>> Tue, May 22, 2018 at 06:52:21PM CEST, m...@redhat.com wrote:

 On Tue, May 22, 2018 at 05:45:01PM +0200, Jiri Pirko wrote:
>
> Tue, May 22, 2018 at 05:32:30PM CEST, m...@redhat.com wrote:
>>
>> On Tue, May 22, 2018 at 05:13:43PM +0200, Jiri Pirko wrote:
>>>
>>> Tue, May 22, 2018 at 03:39:33PM CEST, m...@redhat.com wrote:

 On Tue, May 22, 2018 at 03:26:26PM +0200, Jiri Pirko wrote:
>
> Tue, May 22, 2018 at 03:17:37PM CEST, m...@redhat.com wrote:
>>
>> On Tue, May 22, 2018 at 03:14:22PM +0200, Jiri Pirko wrote:
>>>
>>> Tue, May 22, 2018 at 03:12:40PM CEST, m...@redhat.com wrote:

 On Tue, May 22, 2018 at 11:08:53AM +0200, Jiri Pirko wrote:
>
> Tue, May 22, 2018 at 11:06:37AM CEST, j...@resnulli.us wrote:
>>
>> Tue, May 22, 2018 at 04:06:18AM CEST,
>> sridhar.samudr...@intel.com wrote:
>>>
>>> Use the registration/notification framework supported by the
>>> generic
>>> failover infrastructure.
>>>
>>> Signed-off-by: Sridhar Samudrala
>>> 
>>
>> In previous patchset versions, the common code did

Re: [PATCH bpf] bpf: fix a rcu usage warning in bpf_prog_array_copy_core()

2018-08-15 Thread Roman Gushchin
On Wed, Aug 15, 2018 at 02:30:11PM -0700, Alexei Starovoitov wrote:
> On Tue, Aug 14, 2018 at 05:08:44PM -0700, Roman Gushchin wrote:
> > On Tue, Aug 14, 2018 at 04:59:45PM -0700, Alexei Starovoitov wrote:
> > > On Tue, Aug 14, 2018 at 11:01:12AM -0700, Yonghong Song wrote:
> > > > Commit 394e40a29788 ("bpf: extend bpf_prog_array to store pointers
> > > > to the cgroup storage") refactored the bpf_prog_array_copy_core()
> > > > to accommodate new structure bpf_prog_array_item which contains
> > > > bpf_prog array itself.
> > > > 
> > > > In the old code, we had
> > > >perf_event_query_prog_array():
> > > >  mutex_lock(...)
> > > >  bpf_prog_array_copy_call():
> > > >prog = rcu_dereference_check(array, 1)->progs
> > > >bpf_prog_array_copy_core(prog, ...)
> > > >  mutex_unlock(...)
> > > > 
> > > > With the above commit, we had
> > > >perf_event_query_prog_array():
> > > >  mutex_lock(...)
> > > >  bpf_prog_array_copy_call():
> > > >bpf_prog_array_copy_core(array, ...):
> > > >  item = rcu_dereference(array)->items;
> > > >  ...
> > > >  mutex_unlock(...)
> > > > 
> > > > The new code will trigger a lockdep rcu checking warning.
> > > > The fix is to change rcu_dereference() to rcu_dereference_check()
> > > > to prevent such a warning.
> > > > 
> > > > Reported-by: syzbot+6e72317008eef84a2...@syzkaller.appspotmail.com
> > > > Fixes: 394e40a29788 ("bpf: extend bpf_prog_array to store pointers to 
> > > > the cgroup storage")
> > > > Cc: Roman Gushchin 
> > > > Signed-off-by: Yonghong Song 
> > > 
> > > makes sense to me
> > > Acked-by: Alexei Starovoitov 
> > > 
> > > Roman, would you agree?
> > > 
> > 
> > rcu_dereference_check(<>, 1) always looks a bit strange to me,
> > but if it's the only reasonable way to silence the warning,
> > of course I'm fine with it.
> 
> do you have better suggestion?
> This patch is a fix for the regression introduced in your earlier patch,
> so I think the only fair path forward is either to Ack it or
> to send an alternative patch asap.
> 

As I said, I've nothing against.

Acked-by: Roman Gushchin 

Thanks!


Re: [iproute PATCH 0/3] Fix and test ssfilter

2018-08-15 Thread Stephen Hemminger
On Tue, 14 Aug 2018 14:18:05 +0200
Phil Sutter  wrote:

> This series contains a fix for ssfilter and introduces a testscript to
> verify correct functionality.
> 
> Phil Sutter (3):
>   ss: Review ssfilter
>   testsuite: Prepare for ss tests
>   testsuite: Add a first ss test validating ssfilter
> 
>  misc/ssfilter.y   |  36 ++---
>  testsuite/Makefile|   2 +-
>  testsuite/lib/generic.sh  |  37 ++
>  testsuite/tests/ss/ss1.dump   | Bin 0 -> 720 bytes
>  testsuite/tests/ss/ssfilter.t |  48 ++
>  5 files changed, 84 insertions(+), 39 deletions(-)
>  create mode 100644 testsuite/tests/ss/ss1.dump
>  create mode 100755 testsuite/tests/ss/ssfilter.t
> 

Applied


Re: [iproute PATCH] man: ip-route: Clarify referenced versions are Linux ones

2018-08-15 Thread Stephen Hemminger
On Wed, 15 Aug 2018 11:18:26 +0200
Phil Sutter  wrote:

> Versioning scheme of Linux and iproute2 is similar, therefore the
> referenced kernel versions are likely to confuse readers. Clarify this
> by prefixing each kernel version by 'Linux' prefix.
> 

Sure, makes sense applied.



Re: [PATCH bpf] bpf: fix a rcu usage warning in bpf_prog_array_copy_core()

2018-08-15 Thread Alexei Starovoitov
On Tue, Aug 14, 2018 at 05:08:44PM -0700, Roman Gushchin wrote:
> On Tue, Aug 14, 2018 at 04:59:45PM -0700, Alexei Starovoitov wrote:
> > On Tue, Aug 14, 2018 at 11:01:12AM -0700, Yonghong Song wrote:
> > > Commit 394e40a29788 ("bpf: extend bpf_prog_array to store pointers
> > > to the cgroup storage") refactored the bpf_prog_array_copy_core()
> > > to accommodate new structure bpf_prog_array_item which contains
> > > bpf_prog array itself.
> > > 
> > > In the old code, we had
> > >perf_event_query_prog_array():
> > >  mutex_lock(...)
> > >  bpf_prog_array_copy_call():
> > >prog = rcu_dereference_check(array, 1)->progs
> > >bpf_prog_array_copy_core(prog, ...)
> > >  mutex_unlock(...)
> > > 
> > > With the above commit, we had
> > >perf_event_query_prog_array():
> > >  mutex_lock(...)
> > >  bpf_prog_array_copy_call():
> > >bpf_prog_array_copy_core(array, ...):
> > >  item = rcu_dereference(array)->items;
> > >  ...
> > >  mutex_unlock(...)
> > > 
> > > The new code will trigger a lockdep rcu checking warning.
> > > The fix is to change rcu_dereference() to rcu_dereference_check()
> > > to prevent such a warning.
> > > 
> > > Reported-by: syzbot+6e72317008eef84a2...@syzkaller.appspotmail.com
> > > Fixes: 394e40a29788 ("bpf: extend bpf_prog_array to store pointers to the 
> > > cgroup storage")
> > > Cc: Roman Gushchin 
> > > Signed-off-by: Yonghong Song 
> > 
> > makes sense to me
> > Acked-by: Alexei Starovoitov 
> > 
> > Roman, would you agree?
> > 
> 
> rcu_dereference_check(<>, 1) always looks a bit strange to me,
> but if it's the only reasonable way to silence the warning,
> of course I'm fine with it.

do you have better suggestion?
This patch is a fix for the regression introduced in your earlier patch,
so I think the only fair path forward is either to Ack it or
to send an alternative patch asap.



[PATCH 1/2] ip: convert monitor to switch

2018-08-15 Thread Stephen Hemminger
From: Stephen Hemminger 

The decoding of netlink message types is natural for a C
switch statement.

Signed-off-by: Stephen Hemminger 
---
 ip/ipmonitor.c | 63 ++
 1 file changed, 38 insertions(+), 25 deletions(-)

diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index 8b017341da6b..5552d98ee9e5 100644
--- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -58,7 +58,9 @@ static int accept_msg(const struct sockaddr_nl *who,
 {
FILE *fp = (FILE *)arg;
 
-   if (n->nlmsg_type == RTM_NEWROUTE || n->nlmsg_type == RTM_DELROUTE) {
+   switch (n->nlmsg_type) {
+   case RTM_NEWROUTE:
+   case RTM_DELROUTE: {
struct rtmsg *r = NLMSG_DATA(n);
int len = n->nlmsg_len - NLMSG_LENGTH(sizeof(*r));
 
@@ -82,24 +84,28 @@ static int accept_msg(const struct sockaddr_nl *who,
}
}
 
-   if (n->nlmsg_type == RTM_NEWLINK || n->nlmsg_type == RTM_DELLINK) {
+   case RTM_NEWLINK:
+   case RTM_DELLINK:
ll_remember_index(who, n, NULL);
print_headers(fp, "[LINK]", ctrl);
print_linkinfo(who, n, arg);
return 0;
-   }
-   if (n->nlmsg_type == RTM_NEWADDR || n->nlmsg_type == RTM_DELADDR) {
+
+   case RTM_NEWADDR:
+   case RTM_DELADDR:
print_headers(fp, "[ADDR]", ctrl);
print_addrinfo(who, n, arg);
return 0;
-   }
-   if (n->nlmsg_type == RTM_NEWADDRLABEL || n->nlmsg_type == 
RTM_DELADDRLABEL) {
+
+   case RTM_NEWADDRLABEL:
+   case RTM_DELADDRLABEL:
print_headers(fp, "[ADDRLABEL]", ctrl);
print_addrlabel(who, n, arg);
return 0;
-   }
-   if (n->nlmsg_type == RTM_NEWNEIGH || n->nlmsg_type == RTM_DELNEIGH ||
-   n->nlmsg_type == RTM_GETNEIGH) {
+
+   case RTM_NEWNEIGH:
+   case RTM_DELNEIGH:
+   case RTM_GETNEIGH:
if (preferred_family) {
struct ndmsg *r = NLMSG_DATA(n);
 
@@ -110,34 +116,41 @@ static int accept_msg(const struct sockaddr_nl *who,
print_headers(fp, "[NEIGH]", ctrl);
print_neigh(who, n, arg);
return 0;
-   }
-   if (n->nlmsg_type == RTM_NEWPREFIX) {
+
+   case RTM_NEWPREFIX:
print_headers(fp, "[PREFIX]", ctrl);
print_prefix(who, n, arg);
return 0;
-   }
-   if (n->nlmsg_type == RTM_NEWRULE || n->nlmsg_type == RTM_DELRULE) {
+
+   case RTM_NEWRULE:
+   case RTM_DELRULE:
print_headers(fp, "[RULE]", ctrl);
print_rule(who, n, arg);
return 0;
-   }
-   if (n->nlmsg_type == RTM_NEWNETCONF) {
+
+   case NLMSG_TSTAMP:
+   print_nlmsg_timestamp(fp, n);
+   return 0;
+
+   case RTM_NEWNETCONF:
print_headers(fp, "[NETCONF]", ctrl);
print_netconf(who, ctrl, n, arg);
return 0;
-   }
-   if (n->nlmsg_type == NLMSG_TSTAMP) {
-   print_nlmsg_timestamp(fp, n);
-   return 0;
-   }
-   if (n->nlmsg_type == RTM_NEWNSID || n->nlmsg_type == RTM_DELNSID) {
+
+   case RTM_DELNSID:
+   case RTM_NEWNSID:
print_headers(fp, "[NSID]", ctrl);
print_nsid(who, n, arg);
return 0;
-   }
-   if (n->nlmsg_type != NLMSG_ERROR && n->nlmsg_type != NLMSG_NOOP &&
-   n->nlmsg_type != NLMSG_DONE) {
-   fprintf(fp, "Unknown message: type=0x%08x(%d) 
flags=0x%08x(%d)len=0x%08x(%d)\n",
+
+   case NLMSG_ERROR:
+   case NLMSG_NOOP:
+   case NLMSG_DONE:
+   break;  /* ignore */
+
+   default:
+   fprintf(stderr,
+   "Unknown message: type=0x%08x(%d) flags=0x%08x(%d) 
len=0x%08x(%d)\n",
n->nlmsg_type, n->nlmsg_type,
n->nlmsg_flags, n->nlmsg_flags, n->nlmsg_len,
n->nlmsg_len);
-- 
2.18.0



[PATCH 2/2] ipmonitor: decode DELNETCONF message

2018-08-15 Thread Stephen Hemminger
From: Stephen Hemminger 

When device is deleted DELNETCONF is sent, but ipmonitor
was unable to decode it.

Signed-off-by: Stephen Hemminger 
---
 ip/ipmonitor.c | 1 +
 ip/ipnetconf.c | 6 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index 5552d98ee9e5..a93b62cd6624 100644
--- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -133,6 +133,7 @@ static int accept_msg(const struct sockaddr_nl *who,
return 0;
 
case RTM_NEWNETCONF:
+   case RTM_DELNETCONF:
print_headers(fp, "[NETCONF]", ctrl);
print_netconf(who, ctrl, n, arg);
return 0;
diff --git a/ip/ipnetconf.c b/ip/ipnetconf.c
index 03f98ace9145..afce982ced37 100644
--- a/ip/ipnetconf.c
+++ b/ip/ipnetconf.c
@@ -66,7 +66,8 @@ int print_netconf(const struct sockaddr_nl *who, struct 
rtnl_ctrl_data *ctrl,
 
if (n->nlmsg_type == NLMSG_ERROR)
return -1;
-   if (n->nlmsg_type != RTM_NEWNETCONF) {
+
+   if (n->nlmsg_type != RTM_NEWNETCONF && n->nlmsg_type != RTM_DELNETCONF) 
{
fprintf(stderr, "Not RTM_NEWNETCONF: %08x %08x %08x\n",
n->nlmsg_len, n->nlmsg_type, n->nlmsg_flags);
 
@@ -91,6 +92,9 @@ int print_netconf(const struct sockaddr_nl *who, struct 
rtnl_ctrl_data *ctrl,
return 0;
 
open_json_object(NULL);
+   if (n->nlmsg_type == RTM_DELNETCONF)
+   print_bool(PRINT_ANY, "deleted", "Deleted ", true);
+
print_string(PRINT_ANY, "family",
 "%s ", family_name(ncm->ncm_family));
 
-- 
2.18.0



[PATCH iproute2 0/2] ip monitor fixes

2018-08-15 Thread Stephen Hemminger
While debugging another problem noticed that ip monitor does
not decode all the netconf messages.

Stephen Hemminger (2):
  ip: convert monitor to switch
  ipmonitor: decode DELNETCONF message

 ip/ipmonitor.c | 64 ++
 ip/ipnetconf.c |  6 -
 2 files changed, 44 insertions(+), 26 deletions(-)

-- 
2.18.0



Re: 98DX3336 net driver

2018-08-15 Thread Chris Packham
Hi Joacim,

On 16/08/18 01:46, Thomas Petazzoni wrote:
> Hello Joacim,
> 
> On Wed, 15 Aug 2018 11:48:17 +, Joacim Zetterling wrote:
> 
>> Sorry for bothering You! But I have a short question that You may have an 
>> answer to.
>>
>> Just wondering if You know where to find a "marvell,prestera-98dx3336" 
>> packet processor driver?
>>
>> Stated in the dts files for the xCat family (armada-xp-3233/3336/4251)
>>
>> I searched a lot without any success. Is it delivered from Marvell.
>>
>> We have a PonCat3 platform build on nativ linux instead of using Marvell:s 
>> CPSS suite (beast).
>>
>> But have some problems regarding the network.
>>
>> We also have the DB-XC3-24G4XG development board from Marvell.
> 
> I'm adding Chris Packham in Cc, who added the DT bindings for the
> Prestera switch and the DT description for it in the mainline Linux
> kernel.
> 
> I believe there is no driver in the upstream Linux kernel for this
> switch, and you need to use some out of tree Marvell-specific driver
> for this. For sure Chris can give you more details about this.
> 
> Do not hesitate to get back to us if you need to support the Prestera
> switch in the mainline Linux kernel.

Unfortunately there isn't proper a driver available. The CPSS from 
Marvell is needed.

Their reference code has some partially DT aware support (mvMbusDrv) but 
it wouldn't pick up this binding.

We have code that does but it is really little more than providing raw 
HW access via a file handle that can be mmapped by a userspace 
application linked against the CPSS. I haven't put any effort into 
upstreaming this because I figured no-one would be interested as it's 
pretty useless without the userspace portion.

If there is enough interest I'd be keen participate in creating an 
in-kernel switchdev driver for these chips (you're not the first to 
ask). I'd need to be careful w.r.t. various NDAs that are in place.

Regards,
Chris


KINDLY REPLY stemlightresour...@gmail.com URGENTLY

2018-08-15 Thread STEMLIGHTRESOURCES
KINDLY REPLY stemlightresour...@gmail.com URGENTLY


Re: [iproute PATCH v2 4/4] lib: Enable colored output only for TTYs

2018-08-15 Thread Stephen Hemminger
On Wed, 15 Aug 2018 18:51:15 +0200
Phil Sutter  wrote:

> On Wed, Aug 15, 2018 at 10:43:25AM -0600, David Ahern wrote:
> > On 8/15/18 10:39 AM, Phil Sutter wrote:  
> > > On Wed, Aug 15, 2018 at 10:24:31AM -0600, David Ahern wrote:  
> > >> On 8/15/18 10:21 AM, Phil Sutter wrote:  
> > >>> Add an additional prerequisite to check_enable_color() to make sure
> > >>> stdout actually points to an open TTY device. Otherwise calls like
> > >>>
> > >>> | ip -color a s >/tmp/foo
> > >>>
> > >>> will print color escape sequences into that file. Allow to override this
> > >>> check by specifying '-color' flag more than once.
> > >>>
> > >>> Signed-off-by: Phil Sutter 
> > >>> ---
> > >>> Changes since v1:
> > >>> - Allow to override isatty() check by specifying '-color' flag more than
> > >>>   once.  
> > >>
> > >> That adds overhead to my workflow where I almost always have to pipe the
> > >> output of ip to a pager.  
> > > 
> > > alias ip='ip -color -color'  
> > 
> > no. Don't impact existing users.  
> 
> That's a possible fix for *your* workflow. If applied to the shell
> handling that workflow, it won't impact existing users.
> 
> > > Another alternative may be to introduce -autocolor flag. Establishing
> > > the same syntax as used by 'ls' is not as trivial due to the simple
> > > commandline parsing used in 'ip'.  
> > 
> > I disagree with ignoring or overriding an argument a user passes in. You
> > are guessing what is the correct output and you are guessing wrong.
> > There is nothing wrong with piping output to a file and the viewing that
> > file through 'less -R'.
> > 
> > If a user does not want the color codes in the file, then that user can
> > drop the -color arg. iproute2 commands should not be guessing.  
> 
> OK, I got it. Should I respin the fixes or will you apply the series
> partially?
> 
> Thanks, Phil

Please follow the color conventions of grep and ls to have
consistent user experience.
-c  == --color=always

and add never and auto.



Re: virtio_net failover and initramfs (was: Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework)

2018-08-15 Thread Samudrala, Sridhar

On 8/14/2018 5:03 PM, Siwei Liu wrote:

Are we sure all userspace apps skip and ignore slave interfaces by
just looking at "IFLA_MASTER" attribute?

When STANDBY is enabled on virtio-net, a failover master interface
will appear, which automatically enslaves the virtio device. But it is
found out that iSCSI (or any network boot) cannot boot strap over the
new failover interface together with a standby virtio (without any VF
or PT device in place).

Dracut (initramfs) ends up with timeout and dropping into emergency shell:

[  228.170425] dracut-initqueue[377]: Warning: dracut-initqueue
timeout - starting timeout scripts
[  228.171788] dracut-initqueue[377]: Warning: Could not boot.
  Starting Dracut Emergency Shell...
Generating "/run/initramfs/rdsosreport.txt"
Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report.
dracut:/# ip l sh
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN
mode DEFAULT group default qlen 1000
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0:  mtu 1500 qdisc noqueue
state UP mode DEFAULT group default qlen 1000
 link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff\
3: eth1:  mtu 1500 qdisc pfifo_fast
master eth0 state UP mode DEFAULT group default qlen 1000
 link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff
dracut:/#

If changing dracut code to ignore eth1 (with IFLA_MASTER attr),
network boot starts to work.


Does dracut by default tries to use all the interfaces that are UP?




The reason is that dracut has its own means to differentiate virtual
interfaces for network boot: it does not look at IFLA_MASTER and
ignores slave interfaces. Instead, users have to provide explicit
option e.g. bond=eth0,eth1 in the boot line, then dracut would know
the config and ignore the slave interfaces.


Isn't it possible to specify the interface that should be used for network boot?




However, with automatic creation of failover interface that assumption
is no longer true. Can we change dracut to ignore all slave interface
by checking  IFLA_MASTER? I don't think so. It has a large impact to
existing configs.


What is the issue with checking for IFLA_MASTER? I guess this is used with
team/bonding setups.



What's a feasible solution then? Check the driver name for failover as well?

Thanks,
-Siwei



On Tue, May 22, 2018 at 10:38 AM, Jiri Pirko  wrote:

Tue, May 22, 2018 at 06:52:21PM CEST, m...@redhat.com wrote:

On Tue, May 22, 2018 at 05:45:01PM +0200, Jiri Pirko wrote:

Tue, May 22, 2018 at 05:32:30PM CEST, m...@redhat.com wrote:

On Tue, May 22, 2018 at 05:13:43PM +0200, Jiri Pirko wrote:

Tue, May 22, 2018 at 03:39:33PM CEST, m...@redhat.com wrote:

On Tue, May 22, 2018 at 03:26:26PM +0200, Jiri Pirko wrote:

Tue, May 22, 2018 at 03:17:37PM CEST, m...@redhat.com wrote:

On Tue, May 22, 2018 at 03:14:22PM +0200, Jiri Pirko wrote:

Tue, May 22, 2018 at 03:12:40PM CEST, m...@redhat.com wrote:

On Tue, May 22, 2018 at 11:08:53AM +0200, Jiri Pirko wrote:

Tue, May 22, 2018 at 11:06:37AM CEST, j...@resnulli.us wrote:

Tue, May 22, 2018 at 04:06:18AM CEST, sridhar.samudr...@intel.com wrote:

Use the registration/notification framework supported by the generic
failover infrastructure.

Signed-off-by: Sridhar Samudrala 

In previous patchset versions, the common code did
netdev_rx_handler_register() and netdev_upper_dev_link() etc
(netvsc_vf_join()). Now, this is still done in netvsc. Why?

This should be part of the common "failover" code.


Also note that in the current patchset you use IFF_FAILOVER flag for
master, yet for the slave you use IFF_SLAVE. That is wrong.
IFF_FAILOVER_SLAVE should be used.

Or drop IFF_FAILOVER_SLAVE and set both IFF_FAILOVER and IFF_SLAVE?

No. IFF_SLAVE is for bonding.

What breaks if we reuse it for failover?

This is exposed to userspace. IFF_SLAVE is expected for bonding slaves.
And failover slave is not a bonding slave.

That does not really answer the question.  I'd claim it's sufficiently
like a bond slave for IFF_SLAVE to make sense.

In fact you will find that netvsc already sets IFF_SLAVE, and so

netvsc does the whole failover thing in a wrong way. This patchset is
trying to fix it.

Maybe, but we don't need gratuitous changes either, especially if they
break userspace.

What do you mean by the "break"? It was a mistake to reuse IFF_SLAVE at
the first place, lets fix it. If some userspace depends on that flag, it
is broken anyway.



does e.g. the eql driver.

The advantage of using IFF_SLAVE is that userspace knows to skip it.  If

The userspace should know how to skip other types of slaves - team,
bridge, ovs, etc.
The "master link" should be the one to look at.


How should existing userspace know which ones to skip and which one is
the master?  Right now userspace seems to assume whatever does not have
IFF_SLAVE should be looked at. Are you saying that's 

Re: [iproute PATCH v2 4/4] lib: Enable colored output only for TTYs

2018-08-15 Thread Phil Sutter
On Wed, Aug 15, 2018 at 10:57:13AM -0600, David Ahern wrote:
> On 8/15/18 10:51 AM, Phil Sutter wrote:
> > Should I respin the fixes or will you apply the series
> > partially?
> 
> Stephen has released 4.18 but not merged -next to master yet, so I
> applied the first 3 to -next.

OK, thanks!

Cheers, Phil


RE: [PATCH net-next][RFC] net/tls: Add support for async decryption of tls records

2018-08-15 Thread Vakul Garg


> -Original Message-
> From: Dave Watson [mailto:davejwat...@fb.com]
> Sent: Wednesday, August 15, 2018 10:26 PM
> To: Vakul Garg 
> Cc: netdev@vger.kernel.org; bor...@mellanox.com;
> avia...@mellanox.com; da...@davemloft.net
> Subject: Re: [PATCH net-next][RFC] net/tls: Add support for async decryption
> of tls records
> 
> On 08/14/18 07:47 PM, Vakul Garg wrote:
> > Incoming TLS records which are directly decrypted into user space
> > application buffer i.e. records which are decrypted in zero-copy mode
> > are submitted for async decryption. When the decryption cryptoapi
> > returns -EINPROGRESS, the next tls record is parsed and then submitted
> > for decryption. The references to records which has been sent for
> > async decryption are dropped. This happens in a loop for all the
> > records that can be decrypted in zero-copy mode. For records for which
> > decryption is not possible in zero-copy mode, asynchronous decryption
> > is not used and we wait for decryption crypto api to complete.
> >
> > For crypto requests executing in async fashion, the memory for
> > aead_request, sglists and skb etc is freed from the decryption
> > completion handler. The decryption completion handler wakesup the
> > sleeping user context. This happens when the user context is done
> > enqueueing all the crypto requests and is waiting for all the async
> > operations to finish. Since the splice() operation does not use
> > zero-copy decryption, async remains disabled for splice().
> 
> I found it a little hard to understand what you are trying to do based on the
> commit message.  
 
Ok, I will rewrite it. 

> Reading the code, it looks like if the recv() spans multiple
> TLS records, we will start decryption on all of them, and only wait right
> before recv() returns, yes?  This approach sounds great to me.
> 

Yes, that's the idea. I am firing as many decryption jobs as possible till I run
out of user application provided buffer space.

> Three of the selftests are failing for me:
> 
> [ FAIL ] tls.recv_partial
> [ FAIL ] tls.recv_peek
> [ FAIL ] tls.recv_peek_multiple
 
Will look into it.
Thanks for spending time in review my patch.
The patch is showing good performance benefits.


Re: [iproute PATCH v2 4/4] lib: Enable colored output only for TTYs

2018-08-15 Thread David Ahern
On 8/15/18 10:51 AM, Phil Sutter wrote:
> Should I respin the fixes or will you apply the series
> partially?

Stephen has released 4.18 but not merged -next to master yet, so I
applied the first 3 to -next.


Re: [PATCH net-next][RFC] net/tls: Add support for async decryption of tls records

2018-08-15 Thread Dave Watson
On 08/14/18 07:47 PM, Vakul Garg wrote:
> Incoming TLS records which are directly decrypted into user space
> application buffer i.e. records which are decrypted in zero-copy mode
> are submitted for async decryption. When the decryption cryptoapi
> returns -EINPROGRESS, the next tls record is parsed and then submitted
> for decryption. The references to records which has been sent for async
> decryption are dropped. This happens in a loop for all the records that
> can be decrypted in zero-copy mode. For records for which decryption is
> not possible in zero-copy mode, asynchronous decryption is not used and
> we wait for decryption crypto api to complete.
> 
> For crypto requests executing in async fashion, the memory for
> aead_request, sglists and skb etc is freed from the decryption
> completion handler. The decryption completion handler wakesup the
> sleeping user context. This happens when the user context is done
> enqueueing all the crypto requests and is waiting for all the async
> operations to finish. Since the splice() operation does not use
> zero-copy decryption, async remains disabled for splice().

I found it a little hard to understand what you are trying to do based
on the commit message.  Reading the code, it looks like if the recv()
spans multiple TLS records, we will start decryption on all of them,
and only wait right before recv() returns, yes?  This approach sounds
great to me.

Three of the selftests are failing for me:

[ FAIL ] tls.recv_partial
[ FAIL ] tls.recv_peek
[ FAIL ] tls.recv_peek_multiple


Re: [iproute PATCH v2 4/4] lib: Enable colored output only for TTYs

2018-08-15 Thread Phil Sutter
On Wed, Aug 15, 2018 at 10:43:25AM -0600, David Ahern wrote:
> On 8/15/18 10:39 AM, Phil Sutter wrote:
> > On Wed, Aug 15, 2018 at 10:24:31AM -0600, David Ahern wrote:
> >> On 8/15/18 10:21 AM, Phil Sutter wrote:
> >>> Add an additional prerequisite to check_enable_color() to make sure
> >>> stdout actually points to an open TTY device. Otherwise calls like
> >>>
> >>> | ip -color a s >/tmp/foo
> >>>
> >>> will print color escape sequences into that file. Allow to override this
> >>> check by specifying '-color' flag more than once.
> >>>
> >>> Signed-off-by: Phil Sutter 
> >>> ---
> >>> Changes since v1:
> >>> - Allow to override isatty() check by specifying '-color' flag more than
> >>>   once.
> >>
> >> That adds overhead to my workflow where I almost always have to pipe the
> >> output of ip to a pager.
> > 
> > alias ip='ip -color -color'
> 
> no. Don't impact existing users.

That's a possible fix for *your* workflow. If applied to the shell
handling that workflow, it won't impact existing users.

> > Another alternative may be to introduce -autocolor flag. Establishing
> > the same syntax as used by 'ls' is not as trivial due to the simple
> > commandline parsing used in 'ip'.
> 
> I disagree with ignoring or overriding an argument a user passes in. You
> are guessing what is the correct output and you are guessing wrong.
> There is nothing wrong with piping output to a file and the viewing that
> file through 'less -R'.
> 
> If a user does not want the color codes in the file, then that user can
> drop the -color arg. iproute2 commands should not be guessing.

OK, I got it. Should I respin the fixes or will you apply the series
partially?

Thanks, Phil


Re: [iproute PATCH v2 4/4] lib: Enable colored output only for TTYs

2018-08-15 Thread David Ahern
On 8/15/18 10:39 AM, Phil Sutter wrote:
> On Wed, Aug 15, 2018 at 10:24:31AM -0600, David Ahern wrote:
>> On 8/15/18 10:21 AM, Phil Sutter wrote:
>>> Add an additional prerequisite to check_enable_color() to make sure
>>> stdout actually points to an open TTY device. Otherwise calls like
>>>
>>> | ip -color a s >/tmp/foo
>>>
>>> will print color escape sequences into that file. Allow to override this
>>> check by specifying '-color' flag more than once.
>>>
>>> Signed-off-by: Phil Sutter 
>>> ---
>>> Changes since v1:
>>> - Allow to override isatty() check by specifying '-color' flag more than
>>>   once.
>>
>> That adds overhead to my workflow where I almost always have to pipe the
>> output of ip to a pager.
> 
> alias ip='ip -color -color'

no. Don't impact existing users.

> 
> Another alternative may be to introduce -autocolor flag. Establishing
> the same syntax as used by 'ls' is not as trivial due to the simple
> commandline parsing used in 'ip'.

I disagree with ignoring or overriding an argument a user passes in. You
are guessing what is the correct output and you are guessing wrong.
There is nothing wrong with piping output to a file and the viewing that
file through 'less -R'.

If a user does not want the color codes in the file, then that user can
drop the -color arg. iproute2 commands should not be guessing.


Re: [iproute PATCH v2 4/4] lib: Enable colored output only for TTYs

2018-08-15 Thread Phil Sutter
On Wed, Aug 15, 2018 at 10:24:31AM -0600, David Ahern wrote:
> On 8/15/18 10:21 AM, Phil Sutter wrote:
> > Add an additional prerequisite to check_enable_color() to make sure
> > stdout actually points to an open TTY device. Otherwise calls like
> > 
> > | ip -color a s >/tmp/foo
> > 
> > will print color escape sequences into that file. Allow to override this
> > check by specifying '-color' flag more than once.
> > 
> > Signed-off-by: Phil Sutter 
> > ---
> > Changes since v1:
> > - Allow to override isatty() check by specifying '-color' flag more than
> >   once.
> 
> That adds overhead to my workflow where I almost always have to pipe the
> output of ip to a pager.

alias ip='ip -color -color'

Another alternative may be to introduce -autocolor flag. Establishing
the same syntax as used by 'ls' is not as trivial due to the simple
commandline parsing used in 'ip'.

Cheers, Phil


Re: [iproute PATCH v2 4/4] lib: Enable colored output only for TTYs

2018-08-15 Thread David Ahern
On 8/15/18 10:21 AM, Phil Sutter wrote:
> Add an additional prerequisite to check_enable_color() to make sure
> stdout actually points to an open TTY device. Otherwise calls like
> 
> | ip -color a s >/tmp/foo
> 
> will print color escape sequences into that file. Allow to override this
> check by specifying '-color' flag more than once.
> 
> Signed-off-by: Phil Sutter 
> ---
> Changes since v1:
> - Allow to override isatty() check by specifying '-color' flag more than
>   once.

That adds overhead to my workflow where I almost always have to pipe the
output of ip to a pager.


[iproute PATCH v2 0/4] A bunch of fixes regarding colored output

2018-08-15 Thread Phil Sutter
This series contains fixes for conditionally colored output in patches 1
and 2. Patch 3 merges the common conditionals from ip, tc and bridge
tools. Patch 4 then adds a further restriction to colored output to
prevent garbled output when redirecting into a file.

Changes since v1:
- Adjusted last patch according to feedback. Details given in changelog
  of that patch.

Phil Sutter (4):
  tc: Fix typo in check for colored output
  bridge: Fix check for colored output
  Merge common code for conditionally colored output
  lib: Enable colored output only for TTYs

 bridge/bridge.c   |  5 ++---
 include/color.h   |  1 +
 ip/ip.c   |  3 +--
 lib/color.c   | 13 +
 man/man8/bridge.8 |  5 -
 man/man8/ip.8 |  5 -
 man/man8/tc.8 |  5 -
 tc/tc.c   |  3 +--
 8 files changed, 30 insertions(+), 10 deletions(-)

-- 
2.18.0



[iproute PATCH v2 4/4] lib: Enable colored output only for TTYs

2018-08-15 Thread Phil Sutter
Add an additional prerequisite to check_enable_color() to make sure
stdout actually points to an open TTY device. Otherwise calls like

| ip -color a s >/tmp/foo

will print color escape sequences into that file. Allow to override this
check by specifying '-color' flag more than once.

Signed-off-by: Phil Sutter 
---
Changes since v1:
- Allow to override isatty() check by specifying '-color' flag more than
  once.
- Document new behaviour in man pages.
---
 lib/color.c   | 6 +-
 man/man8/bridge.8 | 5 -
 man/man8/ip.8 | 5 -
 man/man8/tc.8 | 5 -
 4 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/lib/color.c b/lib/color.c
index edf96e5c6ecd7..3a66d8ccb4b00 100644
--- a/lib/color.c
+++ b/lib/color.c
@@ -3,6 +3,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -79,7 +80,10 @@ void enable_color(void)
 
 int check_enable_color(int color, int json)
 {
-   if (color && !json) {
+   if (json || !color)
+   return 1;
+
+   if (color > 1 || isatty(fileno(stdout))) {
enable_color();
return 0;
}
diff --git a/man/man8/bridge.8 b/man/man8/bridge.8
index e7f7148315e19..b865e8b6cd771 100644
--- a/man/man8/bridge.8
+++ b/man/man8/bridge.8
@@ -171,7 +171,10 @@ return code will be non zero.
 
 .TP
 .BR "\-c" , " -color"
-Use color output.
+Use color output if stdout is a terminal. Specify twice to enable color output
+irrespective of stdout state. This flag is ignored if
+.B \-json
+is also given.
 
 .TP
 .BR "\-j", " \-json"
diff --git a/man/man8/ip.8 b/man/man8/ip.8
index 0087d18b74706..c5484bbc11483 100644
--- a/man/man8/ip.8
+++ b/man/man8/ip.8
@@ -188,7 +188,10 @@ executes specified command over all objects, it depends if 
command supports this
 
 .TP
 .BR "\-c" , " -color"
-Use color output.
+Use color output if stdout is a terminal. Specify twice to enable color output
+irrespective of stdout state. This flag is ignored if
+.B \-json
+is also given.
 
 .TP
 .BR "\-t" , " \-timestamp"
diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 840880fbdba63..cbe7ac1847bbb 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -730,7 +730,10 @@ option.
 
 .TP
 .BR "\ -color"
-Use color output.
+Use color output if stdout is a terminal. Specify twice to enable color output
+irrespective of stdout state. This flag is ignored if
+.B \-json
+is also given.
 
 .TP
 .BR "\-j", " \-json"
-- 
2.18.0



[iproute PATCH 3/4] Merge common code for conditionally colored output

2018-08-15 Thread Phil Sutter
Instead of calling enable_color() conditionally with identical check in
three places, introduce check_enable_color() which does it in one place.

Signed-off-by: Phil Sutter 
---
 bridge/bridge.c | 3 +--
 include/color.h | 1 +
 ip/ip.c | 3 +--
 lib/color.c | 9 +
 tc/tc.c | 3 +--
 5 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/bridge/bridge.c b/bridge/bridge.c
index 289a157d37f03..451d684e0bcfd 100644
--- a/bridge/bridge.c
+++ b/bridge/bridge.c
@@ -200,8 +200,7 @@ main(int argc, char **argv)
 
_SL_ = oneline ? "\\" : "\n";
 
-   if (color && !json)
-   enable_color();
+   check_enable_color(color, json);
 
if (batch_file)
return batch(batch_file);
diff --git a/include/color.h b/include/color.h
index c80359d3e2e95..4f2c918db7e43 100644
--- a/include/color.h
+++ b/include/color.h
@@ -13,6 +13,7 @@ enum color_attr {
 };
 
 void enable_color(void);
+int check_enable_color(int color, int json);
 void set_color_palette(void);
 int color_fprintf(FILE *fp, enum color_attr attr, const char *fmt, ...);
 enum color_attr ifa_family_color(__u8 ifa_family);
diff --git a/ip/ip.c b/ip/ip.c
index 71d5170c0cc23..38eac5ec1e17d 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -304,8 +304,7 @@ int main(int argc, char **argv)
 
_SL_ = oneline ? "\\" : "\n";
 
-   if (color && !json)
-   enable_color();
+   check_enable_color(color, json);
 
if (batch_file)
return batch(batch_file);
diff --git a/lib/color.c b/lib/color.c
index da1f516cb2492..edf96e5c6ecd7 100644
--- a/lib/color.c
+++ b/lib/color.c
@@ -77,6 +77,15 @@ void enable_color(void)
set_color_palette();
 }
 
+int check_enable_color(int color, int json)
+{
+   if (color && !json) {
+   enable_color();
+   return 0;
+   }
+   return 1;
+}
+
 void set_color_palette(void)
 {
char *p = getenv("COLORFGBG");
diff --git a/tc/tc.c b/tc/tc.c
index 3bb893756f40e..e775550174473 100644
--- a/tc/tc.c
+++ b/tc/tc.c
@@ -515,8 +515,7 @@ int main(int argc, char **argv)
 
_SL_ = oneline ? "\\" : "\n";
 
-   if (color && !json)
-   enable_color();
+   check_enable_color(color, json);
 
if (batch_file)
return batch(batch_file);
-- 
2.18.0



[iproute PATCH 2/4] bridge: Fix check for colored output

2018-08-15 Thread Phil Sutter
There is no point in calling enable_color() conditionally if it was
already called for each time '-color' flag was parsed. Align the
algorithm with that in ip and tc by actually making use of 'color'
variable.

Fixes: e9625d6aead11 ("Merge branch 'iproute2-master' into iproute2-next")
Signed-off-by: Phil Sutter 
---
 bridge/bridge.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/bridge/bridge.c b/bridge/bridge.c
index 7fcfe1116f6e5..289a157d37f03 100644
--- a/bridge/bridge.c
+++ b/bridge/bridge.c
@@ -174,7 +174,7 @@ main(int argc, char **argv)
if (netns_switch(argv[1]))
exit(-1);
} else if (matches(opt, "-color") == 0) {
-   enable_color();
+   ++color;
} else if (matches(opt, "-compressvlans") == 0) {
++compress_vlans;
} else if (matches(opt, "-force") == 0) {
-- 
2.18.0



[iproute PATCH 1/4] tc: Fix typo in check for colored output

2018-08-15 Thread Phil Sutter
The check used binary instead of boolean AND, which means colored output
was enabled only if the number of specified '-color' flags was odd.

Fixes: 2d165c0811058 ("tc: implement color output")
Signed-off-by: Phil Sutter 
---
 tc/tc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tc/tc.c b/tc/tc.c
index 3bb5910ffac52..3bb893756f40e 100644
--- a/tc/tc.c
+++ b/tc/tc.c
@@ -515,7 +515,7 @@ int main(int argc, char **argv)
 
_SL_ = oneline ? "\\" : "\n";
 
-   if (color & !json)
+   if (color && !json)
enable_color();
 
if (batch_file)
-- 
2.18.0



[PATCH] net: dsa: add support for ksz9897 ethernet switch

2018-08-15 Thread Lad Prabhakar
From: "Lad, Prabhakar" 

ksz9477 is superset of ksz9xx series, driver just works
out of the box for ksz9897 chip with this patch.

Signed-off-by: Lad, Prabhakar 
---
 Documentation/devicetree/bindings/net/dsa/ksz.txt | 4 +++-
 drivers/net/dsa/microchip/ksz_common.c| 9 +
 drivers/net/dsa/microchip/ksz_spi.c   | 1 +
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/net/dsa/ksz.txt 
b/Documentation/devicetree/bindings/net/dsa/ksz.txt
index a700943..ac145b8 100644
--- a/Documentation/devicetree/bindings/net/dsa/ksz.txt
+++ b/Documentation/devicetree/bindings/net/dsa/ksz.txt
@@ -4,7 +4,9 @@ Microchip KSZ Series Ethernet switches
 Required properties:
 
 - compatible: For external switch chips, compatible string must be exactly one
-  of: "microchip,ksz9477"
+  of the following:
+  - "microchip,ksz9477"
+  - "microchip,ksz9897"
 
 See Documentation/devicetree/bindings/net/dsa/dsa.txt for a list of additional
 required and optional properties.
diff --git a/drivers/net/dsa/microchip/ksz_common.c 
b/drivers/net/dsa/microchip/ksz_common.c
index 7210c49..54e0ca6 100644
--- a/drivers/net/dsa/microchip/ksz_common.c
+++ b/drivers/net/dsa/microchip/ksz_common.c
@@ -1102,6 +1102,15 @@ static const struct ksz_chip_data ksz_switch_chips[] = {
.cpu_ports = 0x7F,  /* can be configured as cpu port */
.port_cnt = 7,  /* total physical port count */
},
+   {
+   .chip_id = 0x00989700,
+   .dev_name = "KSZ9897",
+   .num_vlans = 4096,
+   .num_alus = 4096,
+   .num_statics = 16,
+   .cpu_ports = 0x7F,  /* can be configured as cpu port */
+   .port_cnt = 7,  /* total physical port count */
+   },
 };
 
 static int ksz_switch_init(struct ksz_device *dev)
diff --git a/drivers/net/dsa/microchip/ksz_spi.c 
b/drivers/net/dsa/microchip/ksz_spi.c
index c519469..8c1778b 100644
--- a/drivers/net/dsa/microchip/ksz_spi.c
+++ b/drivers/net/dsa/microchip/ksz_spi.c
@@ -195,6 +195,7 @@ static int ksz_spi_remove(struct spi_device *spi)
 
 static const struct of_device_id ksz_dt_ids[] = {
{ .compatible = "microchip,ksz9477" },
+   { .compatible = "microchip,ksz9897" },
{},
 };
 MODULE_DEVICE_TABLE(of, ksz_dt_ids);
-- 
2.7.4



Re: [bpf PATCH] samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM

2018-08-15 Thread Y Song
On Wed, Aug 15, 2018 at 7:57 AM, Jesper Dangaard Brouer
 wrote:
> It is common XDP practice to unload/deattach the XDP bpf program,
> when the XDP sample program is Ctrl-C interrupted (SIGINT) or
> killed (SIGTERM).
>
> The samples/bpf programs xdp_redirect_cpu and xdp_rxq_info,
> forgot to trap signal SIGTERM (which is the default signal used
> by the kill command).
>
> This was discovered by Red Hat QA, which automated scripts depend
> on killing the XDP sample program after a timeout period.
>
> Fixes: fad3917e361b ("samples/bpf: add cpumap sample program 
> xdp_redirect_cpu")
> Fixes: 0fca931a6f21 ("samples/bpf: program demonstrating access to 
> xdp_rxq_info")
> Reported-by: Jean-Tsung Hsiao 
> Signed-off-by: Jesper Dangaard Brouer 

Acked-by: Yonghong Song 

> ---
>  samples/bpf/xdp_redirect_cpu_user.c |3 ++-
>  samples/bpf/xdp_rxq_info_user.c |3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)


Re: [PATCH stable 4.4 0/9] fix SegmentSmack (CVE-2018-5390)

2018-08-15 Thread Greg KH
On Wed, Aug 15, 2018 at 03:24:32PM +0200, Greg KH wrote:
> On Wed, Aug 15, 2018 at 09:20:59PM +0800, Mao Wenan wrote:
> > There are five patches to fix CVE-2018-5390 in latest mainline 
> > branch, but only two patches exist in stable 4.4 and 3.18: 
> > dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
> > 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
> > but I have tested with these patches, and found the cpu usage was very high.
> > test results:
> > with fix patch: 78.2%   ksoftirqd
> > no fix patch:   90% ksoftirqd
> > 
> > After analysing the codes of stable 4.4, and debuging the 
> > system, the search of ofo_queue(tcp ofo using a simple queue) cost more 
> > cycles.
> > So I think only two patches can't fix the CVE-2018-5390.
> > So I try to backport "tcp: use an RB tree for ooo receive queue" using RB 
> > tree 
> > instead of simple queue, then backport Eric Dumazet 5 fixed patches in 
> > mainline,
> > good news is that ksoftirqd is turn to about 20%, which is the same with 
> > mainline now.
> 
> Thanks for doing this work, I had some questions on the individual
> patches.  Can you address them and resend?

Also, always cc: the stable@vger list when sending stable patches so
that others can review and comment on them.

thanks,

greg k-h


RE: [iproute PATCH 4/4] lib: Enable colored output only for TTYs

2018-08-15 Thread David Laight
From: Stephen Hemminger
> Sent: 15 August 2018 16:04
...
> > This also disables color sequence when the output is piped to a pager
> > such as less which with the -R argument can handle it just fine.
> >
> > ie., the user needs to remove the color arg when that output is not wanted.
> 
> If you are going to change the color enabling, why not make it compatible
> with what ls does?

Indeed - otherwise it is very hard to debug the colour escape sequences.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



[PATCH v2] net/mlx5: Delete unneeded function argument

2018-08-15 Thread Yuval Shaia
priv argument is not used by the function, delete it.

Fixes: a89842811ea98 ("net/mlx5e: Merge per priority stats groups")
Signed-off-by: Yuval Shaia 
---
v1 -> v2:
* Remove blank line as pointed by Leon.
---
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 1646859974ce..4c4d779dafa8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -813,7 +813,7 @@ static const struct counter_desc 
pport_per_prio_traffic_stats_desc[] = {
 
 #define NUM_PPORT_PER_PRIO_TRAFFIC_COUNTERS
ARRAY_SIZE(pport_per_prio_traffic_stats_desc)
 
-static int mlx5e_grp_per_prio_traffic_get_num_stats(struct mlx5e_priv *priv)
+static int mlx5e_grp_per_prio_traffic_get_num_stats(void)
 {
return NUM_PPORT_PER_PRIO_TRAFFIC_COUNTERS * NUM_PPORT_PRIO;
 }
@@ -971,7 +971,7 @@ static int mlx5e_grp_per_prio_pfc_fill_stats(struct 
mlx5e_priv *priv,
 
 static int mlx5e_grp_per_prio_get_num_stats(struct mlx5e_priv *priv)
 {
-   return mlx5e_grp_per_prio_traffic_get_num_stats(priv) +
+   return mlx5e_grp_per_prio_traffic_get_num_stats() +
mlx5e_grp_per_prio_pfc_get_num_stats(priv);
 }
 
-- 
2.17.1



Re: [iproute PATCH 4/4] lib: Enable colored output only for TTYs

2018-08-15 Thread Stephen Hemminger
On Wed, 15 Aug 2018 08:40:20 -0600
David Ahern  wrote:

> On 8/15/18 3:06 AM, Phil Sutter wrote:
> > Add an additional prerequisite to check_enable_color() to make sure
> > stdout actually points to an open TTY device. Otherwise calls like
> > 
> > | ip -color a s >/tmp/foo
> > 
> > will print color escape sequences into that file.
> > 
> > Signed-off-by: Phil Sutter 
> > ---
> >  lib/color.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/lib/color.c b/lib/color.c
> > index edf96e5c6ecd7..500ba09682697 100644
> > --- a/lib/color.c
> > +++ b/lib/color.c
> > @@ -3,6 +3,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -79,7 +80,7 @@ void enable_color(void)
> >  
> >  int check_enable_color(int color, int json)
> >  {
> > -   if (color && !json) {
> > +   if (color && !json && isatty(fileno(stdout))) {
> > enable_color();
> > return 0;
> > }
> >   
> 
> This also disables color sequence when the output is piped to a pager
> such as less which with the -R argument can handle it just fine.
> 
> ie., the user needs to remove the color arg when that output is not wanted.

If you are going to change the color enabling, why not make it compatible
with what ls does?

From man ls(1) (and  grep)

   --color[=WHEN]
  colorize  the output; WHEN can be 'always' (default if omitted),
  'auto', or 'never'; more info below
...

   Using  color  to distinguish file types is disabled both by default and
   with --color=never.  With --color=auto, ls emits color codes only  when
   standard  output is connected to a terminal.  The LS_COLORS environment
   variable can change the settings.  Use the dircolors command to set it.


Make -c be same as --color=always to keep compatiablity


[bpf PATCH] samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM

2018-08-15 Thread Jesper Dangaard Brouer
It is common XDP practice to unload/deattach the XDP bpf program,
when the XDP sample program is Ctrl-C interrupted (SIGINT) or
killed (SIGTERM).

The samples/bpf programs xdp_redirect_cpu and xdp_rxq_info,
forgot to trap signal SIGTERM (which is the default signal used
by the kill command).

This was discovered by Red Hat QA, which automated scripts depend
on killing the XDP sample program after a timeout period.

Fixes: fad3917e361b ("samples/bpf: add cpumap sample program xdp_redirect_cpu")
Fixes: 0fca931a6f21 ("samples/bpf: program demonstrating access to 
xdp_rxq_info")
Reported-by: Jean-Tsung Hsiao 
Signed-off-by: Jesper Dangaard Brouer 
---
 samples/bpf/xdp_redirect_cpu_user.c |3 ++-
 samples/bpf/xdp_rxq_info_user.c |3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/xdp_redirect_cpu_user.c 
b/samples/bpf/xdp_redirect_cpu_user.c
index 9a6c7e0a6dd1..2d23054aaccf 100644
--- a/samples/bpf/xdp_redirect_cpu_user.c
+++ b/samples/bpf/xdp_redirect_cpu_user.c
@@ -679,8 +679,9 @@ int main(int argc, char **argv)
return EXIT_FAIL_OPTION;
}
 
-   /* Remove XDP program when program is interrupted */
+   /* Remove XDP program when program is interrupted or killed */
signal(SIGINT, int_exit);
+   signal(SIGTERM, int_exit);
 
if (bpf_set_link_xdp_fd(ifindex, prog_fd[prog_num], xdp_flags) < 0) {
fprintf(stderr, "link set xdp fd failed\n");
diff --git a/samples/bpf/xdp_rxq_info_user.c b/samples/bpf/xdp_rxq_info_user.c
index 248a7eab9531..ef26f882f92f 100644
--- a/samples/bpf/xdp_rxq_info_user.c
+++ b/samples/bpf/xdp_rxq_info_user.c
@@ -567,8 +567,9 @@ int main(int argc, char **argv)
exit(EXIT_FAIL_BPF);
}
 
-   /* Remove XDP program when program is interrupted */
+   /* Remove XDP program when program is interrupted or killed */
signal(SIGINT, int_exit);
+   signal(SIGTERM, int_exit);
 
if (bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags) < 0) {
fprintf(stderr, "link set xdp fd failed\n");



Re: [PATCH v3 net-next] veth: Free queues on link delete

2018-08-15 Thread David Ahern
On 8/15/18 2:07 AM, Toshiaki Makita wrote:
> David Ahern reported memory leak in veth.
> 

...

> veth_rq allocated in veth_newlink() was not freed on dellink.
> 
> We need to free up them after veth_close() so that any packets will not
> reference the queues afterwards. Thus free them in veth_dev_free() in
> the same way as freeing stats structure (vstats).
> 
> Also move queues allocation to veth_dev_init() to be in line with stats
> allocation.
> 
> Fixes: 638264dc90227 ("veth: Support per queue XDP ring")
> Reported-by: David Ahern 
> Signed-off-by: Toshiaki Makita 
> ---
> This is a fix for a bug which exists only in net-next.
> Let me know if I should wait for -next merging into net or reopen of -next.
> 
>  drivers/net/veth.c | 70 
> +-
>  1 file changed, 33 insertions(+), 37 deletions(-)
> 

Reviewed-by: David Ahern 
Tested-by: David Ahern 


Re: [iproute PATCH 4/4] lib: Enable colored output only for TTYs

2018-08-15 Thread David Ahern
On 8/15/18 3:06 AM, Phil Sutter wrote:
> Add an additional prerequisite to check_enable_color() to make sure
> stdout actually points to an open TTY device. Otherwise calls like
> 
> | ip -color a s >/tmp/foo
> 
> will print color escape sequences into that file.
> 
> Signed-off-by: Phil Sutter 
> ---
>  lib/color.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/color.c b/lib/color.c
> index edf96e5c6ecd7..500ba09682697 100644
> --- a/lib/color.c
> +++ b/lib/color.c
> @@ -3,6 +3,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -79,7 +80,7 @@ void enable_color(void)
>  
>  int check_enable_color(int color, int json)
>  {
> - if (color && !json) {
> + if (color && !json && isatty(fileno(stdout))) {
>   enable_color();
>   return 0;
>   }
> 

This also disables color sequence when the output is piped to a pager
such as less which with the -R argument can handle it just fine.

ie., the user needs to remove the color arg when that output is not wanted.


Re: [PATCH] net/mlx5: Delete unneeded function argument

2018-08-15 Thread Leon Romanovsky
On Wed, Aug 15, 2018 at 04:54:33PM +0300, Yuval Shaia wrote:
> priv argument is not used by the function, delete it.
>
> Fixes: a89842811ea98 ("net/mlx5e: Merge per priority stats groups")
>

No extra space here.

Thanks


signature.asc
Description: PGP signature


Re: [PATCH] net: macb: Fix regression breaking non-MDIO fixed-link PHYs

2018-08-15 Thread Lad, Prabhakar
Hi,

On Wed, Aug 15, 2018 at 3:35 AM Andrew Lunn  wrote:
>
> On Tue, Aug 14, 2018 at 05:58:12PM +0200, Uwe Kleine-König wrote:
> > Hello Ahmad,
> >
> >
> > On Tue, Aug 14, 2018 at 04:12:40PM +0200, Ahmad Fatoum wrote:
> > > The referenced commit broke initializing macb on the EVB-KSZ9477 eval 
> > > board.
> > > There, of_mdiobus_register was called even for the fixed-link representing
> > > the SPI-connected switch PHY, with the result that the driver attempts to
> > > enumerate PHYs on a non-existent MDIO bus:
> > >
I ran into a similar problem on v14.4 for davinci_mdio I had to patch
it with [1].
The cpsw has 2 phys one phy is connected to KSZ9031 and other to
ksz9897 Ethernet switch
which is treated as a fixed phy with no mdio lines because of which
mdio_read/write failed.
This didn’t happen in v4.9.x something in core has changed ?

[1]

diff --git a/drivers/net/ethernet/ti/davinci_mdio.c
b/drivers/net/ethernet/ti/davinci_mdio.c
index 3e84107..197baa6 100644
--- a/drivers/net/ethernet/ti/davinci_mdio.c
+++ b/drivers/net/ethernet/ti/davinci_mdio.c
@@ -245,6 +245,13 @@ static int davinci_mdio_read(struct mii_bus *bus,
int phy_id, int phy_reg)
u32 reg;
int ret;

+   if (phy_id == 2)
+   return 0;
+
if (phy_reg & ~PHY_REG_MASK || phy_id & ~PHY_ID_MASK)
return -EINVAL;

@@ -289,6 +296,13 @@ static int davinci_mdio_write(struct mii_bus
*bus, int phy_id,
u32 reg;
int ret;

+   if (phy_id == 2)
+   return 0;
+
if (phy_reg & ~PHY_REG_MASK || phy_id & ~PHY_ID_MASK)
return -EINVAL;


Cheers,
--Prabhakar Lad


[PATCH] net/mlx5: Delete unneeded function argument

2018-08-15 Thread Yuval Shaia
priv argument is not used by the function, delete it.

Fixes: a89842811ea98 ("net/mlx5e: Merge per priority stats groups")

Signed-off-by: Yuval Shaia 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 1646859974ce..4c4d779dafa8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -813,7 +813,7 @@ static const struct counter_desc 
pport_per_prio_traffic_stats_desc[] = {
 
 #define NUM_PPORT_PER_PRIO_TRAFFIC_COUNTERS
ARRAY_SIZE(pport_per_prio_traffic_stats_desc)
 
-static int mlx5e_grp_per_prio_traffic_get_num_stats(struct mlx5e_priv *priv)
+static int mlx5e_grp_per_prio_traffic_get_num_stats(void)
 {
return NUM_PPORT_PER_PRIO_TRAFFIC_COUNTERS * NUM_PPORT_PRIO;
 }
@@ -971,7 +971,7 @@ static int mlx5e_grp_per_prio_pfc_fill_stats(struct 
mlx5e_priv *priv,
 
 static int mlx5e_grp_per_prio_get_num_stats(struct mlx5e_priv *priv)
 {
-   return mlx5e_grp_per_prio_traffic_get_num_stats(priv) +
+   return mlx5e_grp_per_prio_traffic_get_num_stats() +
mlx5e_grp_per_prio_pfc_get_num_stats(priv);
 }
 
-- 
2.17.1



net/ipv4/tcp.c:2278 WARN_ON(sock_owned_by_user(sk))

2018-08-15 Thread Adam Mitchell
After moving a busy mysql database to AWS (and changing distro,
kernel, etc), we see the following kernel warning every few days (on
several hosts):

 WARNING: CPU: 33 PID: 75361 at net/ipv4/tcp.c:2278 tcp_close+0x40f/0x430

This is coming from the following section of tcp_close():

 /* Now socket is owned by kernel and we acquire BH lock
 *  to finish close. No need to check for user refs.
 */
 local_bh_disable();
 bh_lock_sock(sk);
 WARN_ON(sock_owned_by_user(sk));

TL;DR -> Any ideas what could lead to this warning?  I'm thinking that
the fput() (and related) code above this in the stack trace should
have left this socket in a state that could be closed without
warnings.  Could some other kernel thread be setting sk->sk_lock.owned
to 1 during this tcp_close()?

It appears that the socket is NOT owned by kernel, causing the
warning.  This warning does not appear to break anything - however we
will eventually receive this next error (can take over a week to
happen):

 IPv4: Attempt to release TCP socket in state 10 39ed388a

Once this happens, mysqld will no longer accept() new clients.
Existing open sockets work fine.  We assume the above warning about
"state 10" (TCP_LISTEN) is related, but we don't know why/how that
LISTEN socket is being released or how/if it's related to the above
warnings from tcp_close().

This error comes from net/ipv4/af_inet.c:144:

 if (sk->sk_type == SOCK_STREAM && sk->sk_state != TCP_CLOSE) {
 pr_err("Attempt to release TCP socket in state %d %p\n",
sk->sk_state, sk);
 return;
 }


For what it's worth, mysqld seems completely unaware that it cannot
take new clients.  GDB attached to the user-mode process (pstack)
shows that it's still listening for new connections (but they will
never come up the stack).  For this reason, I don't think mysqld was
part of the attempt to close that LISTEN socket.

 Thread 1 (Thread 0x7fd82b8bd8c0 (LWP 5633)):
 #0  0x7fd82942ff0d in poll () from /lib64/libc.so.6
 #1  0x00d9c5fe in
Mysqld_socket_listener::listen_for_connection_event() ()
 #2  0x00798461 in mysqld_main(int, char**) ()
 #3  0x7fd82935e445 in __libc_start_main () from /lib64/libc.so.6
 #4  0x0078c875 in _start ()


Extra details below:

Kernel is 4.16.13-1.el7.elrepo.x86_64 #1 running in AWS.

https://elixir.bootlin.com/linux/v4.16.13/source/net/ipv4/tcp.c#L2278
https://elixir.bootlin.com/linux/v4.16.13/source/net/ipv4/af_inet.c#L144



[726780.788201] WARNING: CPU: 15 PID: 52245 at net/ipv4/tcp.c:2278
tcp_close+0x40f/0x430
[726780.794947] Modules linked in: binfmt_misc nf_conntrack_netlink
nfnetlink_queue tcp_diag inet_diag isofs ip6table_mangle ip6table_raw
ip6table_nat nf_nat_ipv6 iptable_security xt_CT iptable_raw
iptable_nat nf_nat_ipv4 nf_nat iptable_mangle xt_pkttype xt_NFLOG
nfnetlink_log xt_u32 xt_multiport xt_set xt_conntrack
ip_set_hash_netport ip_set_hash_ipport ip_set_hash_net ip_set_hash_ip
ip_set nfnetlink nf_conntrack_proto_gre nf_conntrack_ipv6
nf_defrag_ipv6 ip6table_filter ip6_tables xt_LOG nf_conntrack_tftp
nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack
iptable_filter zfs(PO) zunicode(PO) zavl(PO) icp(PO) sb_edac
intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
pcbc aesni_intel crypto_simd glue_helper cryptd cirrus snd_seq
zcommon(PO) ttm intel_rapl_perf snd_seq_device
[726780.848696]  drm_kms_helper znvpair(PO) snd_pcm snd_timer spl(O)
drm snd soundcore syscopyarea sysfillrect pcspkr sysimgblt input_leds
fb_sys_fops i2c_piix4 ip_tables xfs libcrc32c ata_generic pata_acpi
ata_piix xen_blkfront crc32c_intel libata ena(O) serio_raw floppy
sunrpc
[726780.863916] CPU: 15 PID: 52245 Comm: mysqld Tainted: PW  O
4.16.13-1.el7.elrepo.x86_64 #1
[726780.869486] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[726780.873756] RIP: 0010:tcp_close+0x40f/0x430
[726780.877049] RSP: 0018:c90063a87dd0 EFLAGS: 00010202
[726780.880913] RAX: 0001 RBX: 8839cbb4b300 RCX:
0001
[726780.885708] RDX: 0041 RSI: 00023540 RDI:
002b
[726780.890412] RBP: c90063a87df0 R08:  R09:
0101
[726780.895192] R10: 03ff R11: 2057 R12:
8839cbb4b388
[726780.900029] R13: 0009 R14: 8839cbb4b3c8 R15:
883c5ed14900
[726780.904777] FS:  7f6acd467700() GS:883c8b1c()
knlGS:
[726780.909964] CS:  0010 DS:  ES:  CR0: 80050033
[726780.914078] CR2: 7f6cd5d430c8 CR3: 003c7e6b6005 CR4:
001606e0
[726780.918837] DR0:  DR1:  DR2:

[726780.923433] DR3:  DR6: fffe0ff0 DR7:
0400
[726780.927882] Call Trace:
[726780.930410]  inet_release+0x42/0x70
[726780.933423]  inet6_release+0x30/0x40
[726780.936439]  sock_release+0x25/0x80
[726780.939421]  

Re: [PATCH stable 4.4 4/9] tcp: use an RB tree for ooo receive queue

2018-08-15 Thread Greg KH
On Wed, Aug 15, 2018 at 09:21:03PM +0800, Mao Wenan wrote:
> From: Yaogong Wang 
> 
> Over the years, TCP BDP has increased by several orders of magnitude,
> and some people are considering to reach the 2 Gbytes limit.
> 
> Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
> MSS.
> 
> In presence of packet losses (or reorders), TCP stores incoming packets
> into an out of order queue, and number of skbs sitting there waiting for
> the missing packets to be received can be in the 10^5 range.
> 
> Most packets are appended to the tail of this queue, and when
> packets can finally be transferred to receive queue, we scan the queue
> from its head.
> 
> However, in presence of heavy losses, we might have to find an arbitrary
> point in this queue, involving a linear scan for every incoming packet,
> throwing away cpu caches.
> 
> This patch converts it to a RB tree, to get bounded latencies.
> 
> Yaogong wrote a preliminary patch about 2 years ago.
> Eric did the rebase, added ofo_last_skb cache, polishing and tests.
> 
> Tested with network dropping between 1 and 10 % packets, with good
> success (about 30 % increase of throughput in stress tests)
> 
> Next step would be to also use an RB tree for the write queue at sender
> side ;)
> 
> Signed-off-by: Yaogong Wang 
> Signed-off-by: Eric Dumazet 
> Cc: Yuchung Cheng 
> Cc: Neal Cardwell 
> Cc: Ilpo Järvinen 
> Acked-By: Ilpo Järvinen 
> Signed-off-by: David S. Miller 
> Signed-off-by: root 

root and commit id?



Re: [PATCH stable 4.4 6/9] tcp: avoid collapses in tcp_prune_queue() if possible

2018-08-15 Thread Greg KH
On Wed, Aug 15, 2018 at 09:21:05PM +0800, Mao Wenan wrote:
> From: Eric Dumazet 
> 
> [ Upstream commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7 ]
> 
> Right after a TCP flow is created, receiving tiny out of order
> packets allways hit the condition :
> 
> if (atomic_read(>sk_rmem_alloc) >= sk->sk_rcvbuf)
>   tcp_clamp_window(sk);
> 
> tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
> (guarded by tcp_rmem[2])
> 
> Calling tcp_collapse_ofo_queue() in this case is not useful,
> and offers a O(N^2) surface attack to malicious peers.
> 
> Better not attempt anything before full queue capacity is reached,
> forcing attacker to spend lots of resource and allow us to more
> easily detect the abuse.
> 
> Signed-off-by: Eric Dumazet 
> Acked-by: Soheil Hassas Yeganeh 
> Acked-by: Yuchung Cheng 
> Signed-off-by: David S. Miller 
> Signed-off-by: Greg Kroah-Hartman 
> Signed-off-by: root 

root?


Re: [PATCH stable 4.4 5/9] tcp: free batches of packets in tcp_prune_ofo_queue()

2018-08-15 Thread Greg KH
On Wed, Aug 15, 2018 at 09:21:04PM +0800, Mao Wenan wrote:
> From: Eric Dumazet 
> 
> Juha-Matti Tilli reported that malicious peers could inject tiny
> packets in out_of_order_queue, forcing very expensive calls
> to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
> every incoming packet. out_of_order_queue rb-tree can contain
> thousands of nodes, iterating over all of them is not nice.
> 
> Before linux-4.9, we would have pruned all packets in ofo_queue
> in one go, every  packets.  depends on sk_rcvbuf and skbs
> truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.
> 
> Since we plan to increase tcp_rmem[2] in the future to cope with
> modern BDP, can not revert to the old behavior, without great pain.
> 
> Strategy taken in this patch is to purge ~12.5 % of the queue capacity.
> 
> Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all 
> packets")
> Signed-off-by: Eric Dumazet 
> Reported-by: Juha-Matti Tilli 
> Acked-by: Yuchung Cheng 
> Acked-by: Soheil Hassas Yeganeh 
> Signed-off-by: David S. Miller 
> Signed-off-by: root 

root?

And commit id?

thanks,

greg k-h


Re: [PATCH stable 4.4 9/9] tcp: add tcp_ooo_try_coalesce() helper

2018-08-15 Thread Greg KH
On Wed, Aug 15, 2018 at 09:21:08PM +0800, Mao Wenan wrote:
> From: Eric Dumazet 
> 
> In case skb in out_or_order_queue is the result of
> multiple skbs coalescing, we would like to get a proper gso_segs
> counter tracking, so that future tcp_drop() can report an accurate
> number.
> 
> I chose to not implement this tracking for skbs in receive queue,
> since they are not dropped, unless socket is disconnected.
> 
> Signed-off-by: Eric Dumazet 
> Acked-by: Soheil Hassas Yeganeh 
> Acked-by: Yuchung Cheng 
> Signed-off-by: David S. Miller 
> Signed-off-by: Mao Wenan 

Upstream commit id?


Re: [PATCH stable 4.4 0/9] fix SegmentSmack (CVE-2018-5390)

2018-08-15 Thread Greg KH
On Wed, Aug 15, 2018 at 09:20:59PM +0800, Mao Wenan wrote:
> There are five patches to fix CVE-2018-5390 in latest mainline 
> branch, but only two patches exist in stable 4.4 and 3.18: 
> dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
> 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
> but I have tested with these patches, and found the cpu usage was very high.
> test results:
> with fix patch: 78.2%   ksoftirqd
> no fix patch:   90% ksoftirqd
> 
> After analysing the codes of stable 4.4, and debuging the 
> system, the search of ofo_queue(tcp ofo using a simple queue) cost more 
> cycles.
> So I think only two patches can't fix the CVE-2018-5390.
> So I try to backport "tcp: use an RB tree for ooo receive queue" using RB 
> tree 
> instead of simple queue, then backport Eric Dumazet 5 fixed patches in 
> mainline,
> good news is that ksoftirqd is turn to about 20%, which is the same with 
> mainline now.

Thanks for doing this work, I had some questions on the individual
patches.  Can you address them and resend?

thanks,

greg k-h


Re: [PATCH stable 4.4 8/9] tcp: call tcp_drop() from tcp_data_queue_ofo()

2018-08-15 Thread Greg KH
On Wed, Aug 15, 2018 at 09:21:07PM +0800, Mao Wenan wrote:
> From: Eric Dumazet 
> 
> In order to be able to give better diagnostics and detect
> malicious traffic, we need to have better sk->sk_drops tracking.
> 
> Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue")
> Signed-off-by: Eric Dumazet 
> Acked-by: Soheil Hassas Yeganeh 
> Acked-by: Yuchung Cheng 
> Signed-off-by: David S. Miller 
> Signed-off-by: Mao Wenan 
> ---
>  net/ipv4/tcp_input.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Upstream commit id?


Re: [PATCH stable 4.4 3/9] tcp: increment sk_drops for dropped rx packets

2018-08-15 Thread Greg KH
On Wed, Aug 15, 2018 at 09:21:02PM +0800, Mao Wenan wrote:
> From: Eric Dumazet 
> 
> Now ss can report sk_drops, we can instruct TCP to increment
> this per socket counter when it drops an incoming frame, to refine
> monitoring and debugging.
> 
> Following patch takes care of listeners drops.
> 
> Signed-off-by: Eric Dumazet 
> Signed-off-by: David S. Miller 
> Signed-off-by: Mao Wenan 

What is the upstream git commit id for this?

thanks,

greg k-h


Re: [PATCH stable 4.4 7/9] tcp: detect malicious patterns in tcp_collapse_ofo_queue()

2018-08-15 Thread Greg KH
On Wed, Aug 15, 2018 at 09:21:06PM +0800, Mao Wenan wrote:
> From: Eric Dumazet 
> 
> [ Upstream commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf ]
> 
> In case an attacker feeds tiny packets completely out of order,
> tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
> expensive copies, but not changing socket memory usage at all.
> 
> 1) Do not attempt to collapse tiny skbs.
> 2) Add logic to exit early when too many tiny skbs are detected.
> 
> We prefer not doing aggressive collapsing (which copies packets)
> for pathological flows, and revert to tcp_prune_ofo_queue() which
> will be less expensive.
> 
> In the future, we might add the possibility of terminating flows
> that are proven to be malicious.
> 
> Signed-off-by: Eric Dumazet 
> Acked-by: Soheil Hassas Yeganeh 
> Signed-off-by: David S. Miller 
> Signed-off-by: Greg Kroah-Hartman 
> Signed-off-by: root 

signed off by from "root"?  :)

And why are you adding the patch back after removing it?

thanks,

greg k-h


Re: [PATCH stable 4.4 2/9] Revert "tcp: avoid collapses in tcp_prune_queue() if possible"

2018-08-15 Thread Greg KH
On Wed, Aug 15, 2018 at 09:21:01PM +0800, Mao Wenan wrote:
> This reverts commit 5fbec4801264cb3279ef6ac9c70bcbe2aaef89d5.
> ---

Same here for description and signed off by.

thanks,

greg k-h


Re: [PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"

2018-08-15 Thread Greg KH
On Wed, Aug 15, 2018 at 09:21:00PM +0800, Mao Wenan wrote:
> This reverts commit dc6ae4dffd656811dee7151b19545e4cd839d378.

I need a reason why, and a signed-off-by line :(

thanks,

greg k-h


Re: I found a strange place while reading “net/ipv6/reassembly.c”

2018-08-15 Thread Sabrina Dubroca
2018-08-15, 04:38:29 +, Ttttabcd wrote:
> Hello everyone who develops the kernel.
> 
> At the beginning I was looking for the source author, but his email
> address has expired, so I can only come here to ask questions.
> 
> The problem is in the /net/ipv6/reassembly.c file, the author is
> Pedro Roque.
> 
> I found some strange places when I read the code for this file
> (Linux Kernel version 4.18).
> 
> In the "/net/ipv6/reassembly.c"
> 
> In the function "ip6_frag_queue"
> 
>   offset = ntohs(fhdr->frag_off) & ~0x7;
>   end = offset + (ntohs(ipv6_hdr(skb)->payload_len) -
>   ((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
> 
>   if ((unsigned int)end > IPV6_MAXPLEN) {
>   *prob_offset = (u8 *)>frag_off - skb_network_header(skb);
>   return -1;
>   }
> 
> Here the length of the payload is judged.

This check is based on the fragment currently being processed, and
only considers the reassembled payload.

> And in the function "ip6_frag_reasm"
> 

Re-adding the comment that comes just above this:

/* Unfragmented part is taken from the first segment. */
>   payload_len = ((head->data - skb_network_header(head)) -
>  sizeof(struct ipv6hdr) + fq->q.len -
>  sizeof(struct frag_hdr));
>   if (payload_len > IPV6_MAXPLEN)
>   goto out_oversize;

This considers the reassembled payload (ie, as above) *and* any
extension headers that might come before it.


>   ..
>   out_oversize:
>   net_dbg_ratelimited("ip6_frag_reasm: payload len = %d\n", 
> payload_len);
>   goto out_fail;
> 
> Here also judges the length of the payload.
> 
> Judged the payload length twice.
> 
> I tested that the code in the label "out_oversize:" does not execute
> at all, because it has been returned in "ip6_frag_queue".

If you try again with a routing header, I think you'll see it trigger.

> Unless I comment out the code that judge the payload length in the
> function "ip6_frag_queue", the code labeled "out_oversize:" can be
> executed.
> 
> So, is this repeated?

-- 
Sabrina


[PATCH stable 4.4 4/9] tcp: use an RB tree for ooo receive queue

2018-08-15 Thread Mao Wenan
From: Yaogong Wang 

Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.

Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.

In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.

Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.

However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.

This patch converts it to a RB tree, to get bounded latencies.

Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.

Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)

Next step would be to also use an RB tree for the write queue at sender
side ;)

Signed-off-by: Yaogong Wang 
Signed-off-by: Eric Dumazet 
Cc: Yuchung Cheng 
Cc: Neal Cardwell 
Cc: Ilpo Järvinen 
Acked-By: Ilpo Järvinen 
Signed-off-by: David S. Miller 
Signed-off-by: root 
Signed-off-by: Mao Wenan 
---
 include/linux/skbuff.h   |   8 ++
 include/linux/tcp.h  |   7 +-
 include/net/tcp.h|   2 +-
 net/core/skbuff.c|  19 +++
 net/ipv4/tcp.c   |   4 +-
 net/ipv4/tcp_input.c | 354 +++
 net/ipv4/tcp_ipv4.c  |   2 +-
 net/ipv4/tcp_minisocks.c |   1 -
 8 files changed, 241 insertions(+), 156 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c28bd8b..a490dd7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2273,6 +2273,8 @@ static inline void __skb_queue_purge(struct sk_buff_head 
*list)
kfree_skb(skb);
 }
 
+void skb_rbtree_purge(struct rb_root *root);
+
 void *netdev_alloc_frag(unsigned int fragsz);
 
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length,
@@ -2807,6 +2809,12 @@ static inline int pskb_trim_rcsum(struct sk_buff *skb, 
unsigned int len)
return __pskb_trim(skb, len);
 }
 
+#define rb_to_skb(rb) rb_entry_safe(rb, struct sk_buff, rbnode)
+#define skb_rb_first(root) rb_to_skb(rb_first(root))
+#define skb_rb_last(root)  rb_to_skb(rb_last(root))
+#define skb_rb_next(skb)   rb_to_skb(rb_next(&(skb)->rbnode))
+#define skb_rb_prev(skb)   rb_to_skb(rb_prev(&(skb)->rbnode))
+
 #define skb_queue_walk(queue, skb) \
for (skb = (queue)->next;   
\
 skb != (struct sk_buff *)(queue);  
\
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 5b6df1a..747404d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -279,10 +279,9 @@ struct tcp_sock {
struct sk_buff* lost_skb_hint;
struct sk_buff *retransmit_skb_hint;
 
-   /* OOO segments go in this list. Note that socket lock must be held,
-* as we do not use sk_buff_head lock.
-*/
-   struct sk_buff_head out_of_order_queue;
+   /* OOO segments go in this rbtree. Socket lock must be held. */
+   struct rb_root  out_of_order_queue;
+   struct sk_buff  *ooo_last_skb; /* cache rb_last(out_of_order_queue) */
 
/* SACKs data, these 2 need to be together (see tcp_options_write) */
struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cac4a6ad..8bc259d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -649,7 +649,7 @@ static inline void tcp_fast_path_check(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (skb_queue_empty(>out_of_order_queue) &&
+   if (RB_EMPTY_ROOT(>out_of_order_queue) &&
tp->rcv_wnd &&
atomic_read(>sk_rmem_alloc) < sk->sk_rcvbuf &&
!tp->urg_data)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 55be076..9703924 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2378,6 +2378,25 @@ void skb_queue_purge(struct sk_buff_head *list)
 EXPORT_SYMBOL(skb_queue_purge);
 
 /**
+ * skb_rbtree_purge - empty a skb rbtree
+ * @root: root of the rbtree to empty
+ *
+ * Delete all buffers on an _buff rbtree. Each buffer is removed from
+ * the list and one reference dropped. This function does not take
+ * any lock. Synchronization should be handled by the caller (e.g., TCP
+ * out-of-order queue is protected by the socket lock).
+ */
+void skb_rbtree_purge(struct rb_root *root)
+{
+   struct sk_buff *skb, *next;
+
+   rbtree_postorder_for_each_entry_safe(skb, next, root, rbnode)
+   kfree_skb(skb);
+
+   *root = RB_ROOT;
+}
+
+/**
  * skb_queue_head - queue a 

[PATCH stable 4.4 8/9] tcp: call tcp_drop() from tcp_data_queue_ofo()

2018-08-15 Thread Mao Wenan
From: Eric Dumazet 

In order to be able to give better diagnostics and detect
malicious traffic, we need to have better sk->sk_drops tracking.

Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
Signed-off-by: David S. Miller 
Signed-off-by: Mao Wenan 
---
 net/ipv4/tcp_input.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c48924f..96a1e0d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4445,7 +4445,7 @@ coalesce_done:
/* All the bits are present. Drop. */
NET_INC_STATS(sock_net(sk),
  LINUX_MIB_TCPOFOMERGE);
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
skb = NULL;
tcp_dsack_set(sk, seq, end_seq);
goto add_sack;
@@ -4464,7 +4464,7 @@ coalesce_done:
 TCP_SKB_CB(skb1)->end_seq);
NET_INC_STATS(sock_net(sk),
  LINUX_MIB_TCPOFOMERGE);
-   __kfree_skb(skb1);
+   tcp_drop(sk, skb1);
goto add_sack;
}
} else if (tcp_try_coalesce(sk, skb1, skb, )) {
-- 
1.8.3.1



[PATCH stable 4.4 9/9] tcp: add tcp_ooo_try_coalesce() helper

2018-08-15 Thread Mao Wenan
From: Eric Dumazet 

In case skb in out_or_order_queue is the result of
multiple skbs coalescing, we would like to get a proper gso_segs
counter tracking, so that future tcp_drop() can report an accurate
number.

I chose to not implement this tracking for skbs in receive queue,
since they are not dropped, unless socket is disconnected.

Signed-off-by: Eric Dumazet 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
Signed-off-by: David S. Miller 
Signed-off-by: Mao Wenan 
---
 net/ipv4/tcp_input.c | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 96a1e0d..fdb5509 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4296,6 +4296,23 @@ static bool tcp_try_coalesce(struct sock *sk,
return true;
 }
 
+static bool tcp_ooo_try_coalesce(struct sock *sk,
+struct sk_buff *to,
+struct sk_buff *from,
+bool *fragstolen)
+{
+   bool res = tcp_try_coalesce(sk, to, from, fragstolen);
+
+   /* In case tcp_drop() is called later, update to->gso_segs */
+   if (res) {
+   u32 gso_segs = max_t(u16, 1, skb_shinfo(to)->gso_segs) +
+  max_t(u16, 1, skb_shinfo(from)->gso_segs);
+
+   skb_shinfo(to)->gso_segs = min_t(u32, gso_segs, 0x);
+   }
+   return res;
+}
+
 static void tcp_drop(struct sock *sk, struct sk_buff *skb)
 {
sk_drops_add(sk, skb);
@@ -4422,7 +4439,8 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
/* In the typical case, we are adding an skb to the end of the list.
 * Use of ooo_last_skb avoids the O(Log(N)) rbtree lookup.
 */
-   if (tcp_try_coalesce(sk, tp->ooo_last_skb, skb, )) {
+   if (tcp_ooo_try_coalesce(sk, tp->ooo_last_skb,
+skb, )) {
 coalesce_done:
tcp_grow_window(sk, skb);
kfree_skb_partial(skb, fragstolen);
@@ -4467,7 +4485,8 @@ coalesce_done:
tcp_drop(sk, skb1);
goto add_sack;
}
-   } else if (tcp_try_coalesce(sk, skb1, skb, )) {
+   } else if (tcp_ooo_try_coalesce(sk, skb1,
+   skb, )) {
goto coalesce_done;
}
p = >rb_right;
-- 
1.8.3.1



[PATCH stable 4.4 3/9] tcp: increment sk_drops for dropped rx packets

2018-08-15 Thread Mao Wenan
From: Eric Dumazet 

Now ss can report sk_drops, we can instruct TCP to increment
this per socket counter when it drops an incoming frame, to refine
monitoring and debugging.

Following patch takes care of listeners drops.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Mao Wenan 
---
 include/net/sock.h   |  7 +++
 net/ipv4/tcp_input.c | 33 -
 net/ipv4/tcp_ipv4.c  |  1 +
 net/ipv6/tcp_ipv6.c  |  1 +
 4 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 3d5ff74..5770757 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2139,6 +2139,13 @@ sock_skb_set_dropcount(const struct sock *sk, struct 
sk_buff *skb)
SOCK_SKB_CB(skb)->dropcount = atomic_read(>sk_drops);
 }
 
+static inline void sk_drops_add(struct sock *sk, const struct sk_buff *skb)
+{
+   int segs = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
+
+   atomic_add(segs, >sk_drops);
+}
+
 void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
   struct sk_buff *skb);
 void __sock_recv_wifi_status(struct msghdr *msg, struct sock *sk,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index df2f342..5fb4e80 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4296,6 +4296,12 @@ static bool tcp_try_coalesce(struct sock *sk,
return true;
 }
 
+static void tcp_drop(struct sock *sk, struct sk_buff *skb)
+{
+   sk_drops_add(sk, skb);
+   __kfree_skb(skb);
+}
+
 /* This one checks to see if we can put data from the
  * out_of_order queue into the receive_queue.
  */
@@ -4320,7 +4326,7 @@ static void tcp_ofo_queue(struct sock *sk)
__skb_unlink(skb, >out_of_order_queue);
if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
SOCK_DEBUG(sk, "ofo packet was already received\n");
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
continue;
}
SOCK_DEBUG(sk, "ofo requeuing : rcv_next %X seq %X - %X\n",
@@ -4372,7 +4378,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
 
if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) {
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFODROP);
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
return;
}
 
@@ -4436,7 +4442,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
/* All the bits are present. Drop. */
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
skb = NULL;
tcp_dsack_set(sk, seq, end_seq);
goto add_sack;
@@ -4475,7 +4481,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
 TCP_SKB_CB(skb1)->end_seq);
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
-   __kfree_skb(skb1);
+   tcp_drop(sk, skb1);
}
 
 add_sack:
@@ -4558,12 +4564,13 @@ err:
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
-   int eaten = -1;
bool fragstolen = false;
+   int eaten = -1;
 
-   if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
-   goto drop;
-
+   if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) {
+   __kfree_skb(skb);
+   return;
+   }
skb_dst_drop(skb);
__skb_pull(skb, tcp_hdr(skb)->doff * 4);
 
@@ -4645,7 +4652,7 @@ out_of_window:
tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS);
inet_csk_schedule_ack(sk);
 drop:
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
return;
}
 
@@ -5220,7 +5227,7 @@ syn_challenge:
return true;
 
 discard:
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
return false;
 }
 
@@ -5438,7 +5445,7 @@ csum_error:
TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
 
 discard:
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
 }
 EXPORT_SYMBOL(tcp_rcv_established);
 
@@ -5668,7 +5675,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, 
struct sk_buff *skb,
  TCP_DELACK_MAX, TCP_RTO_MAX);
 
 discard:
-   __kfree_skb(skb);
+   tcp_drop(sk, skb);
return 0;
} else {
tcp_send_ack(sk);
@@ -6025,7 +6032,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff 
*skb)
 
if (!queued) {
 discard:
-

[PATCH stable 4.4 0/9] fix SegmentSmack (CVE-2018-5390)

2018-08-15 Thread Mao Wenan
There are five patches to fix CVE-2018-5390 in latest mainline 
branch, but only two patches exist in stable 4.4 and 3.18: 
dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
but I have tested with these patches, and found the cpu usage was very high.
test results:
with fix patch: 78.2%   ksoftirqd
no fix patch:   90% ksoftirqd

After analysing the codes of stable 4.4, and debuging the 
system, the search of ofo_queue(tcp ofo using a simple queue) cost more cycles.
So I think only two patches can't fix the CVE-2018-5390.
So I try to backport "tcp: use an RB tree for ooo receive queue" using RB tree 
instead of simple queue, then backport Eric Dumazet 5 fixed patches in mainline,
good news is that ksoftirqd is turn to about 20%, which is the same with 
mainline now.

Eric Dumazet (6):
  tcp: increment sk_drops for dropped rx packets
  tcp: free batches of packets in tcp_prune_ofo_queue()
  tcp: avoid collapses in tcp_prune_queue() if possible
  tcp: detect malicious patterns in tcp_collapse_ofo_queue()
  tcp: call tcp_drop() from tcp_data_queue_ofo()
  tcp: add tcp_ooo_try_coalesce() helper

Mao Wenan (2):
  Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"
  Revert "tcp: avoid collapses in tcp_prune_queue() if possible"

Yaogong Wang (1):
  tcp: use an RB tree for ooo receive queue

 include/linux/skbuff.h   |   8 +
 include/linux/tcp.h  |   7 +-
 include/net/sock.h   |   7 +
 include/net/tcp.h|   2 +-
 net/core/skbuff.c|  19 +++
 net/ipv4/tcp.c   |   4 +-
 net/ipv4/tcp_input.c | 412 +--
 net/ipv4/tcp_ipv4.c  |   3 +-
 net/ipv4/tcp_minisocks.c |   1 -
 net/ipv6/tcp_ipv6.c  |   1 +
 10 files changed, 294 insertions(+), 170 deletions(-)

-- 
1.8.3.1



[PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"

2018-08-15 Thread Mao Wenan
This reverts commit dc6ae4dffd656811dee7151b19545e4cd839d378.
---
 net/ipv4/tcp_input.c | 16 ++--
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4a261e0..995b2bc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4791,7 +4791,6 @@ restart:
 static void tcp_collapse_ofo_queue(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
-   u32 range_truesize, sum_tiny = 0;
struct sk_buff *skb = skb_peek(>out_of_order_queue);
struct sk_buff *head;
u32 start, end;
@@ -4801,7 +4800,6 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 
start = TCP_SKB_CB(skb)->seq;
end = TCP_SKB_CB(skb)->end_seq;
-   range_truesize = skb->truesize;
head = skb;
 
for (;;) {
@@ -4816,24 +4814,14 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
if (!skb ||
after(TCP_SKB_CB(skb)->seq, end) ||
before(TCP_SKB_CB(skb)->end_seq, start)) {
-   /* Do not attempt collapsing tiny skbs */
-   if (range_truesize != head->truesize ||
-   end - start >= SKB_WITH_OVERHEAD(SK_MEM_QUANTUM)) {
-   tcp_collapse(sk, >out_of_order_queue,
-head, skb, start, end);
-   } else {
-   sum_tiny += range_truesize;
-   if (sum_tiny > sk->sk_rcvbuf >> 3)
-   return;
-   }
-
+   tcp_collapse(sk, >out_of_order_queue,
+head, skb, start, end);
head = skb;
if (!skb)
break;
/* Start new segment */
start = TCP_SKB_CB(skb)->seq;
end = TCP_SKB_CB(skb)->end_seq;
-   range_truesize = skb->truesize;
} else {
if (before(TCP_SKB_CB(skb)->seq, start))
start = TCP_SKB_CB(skb)->seq;
-- 
1.8.3.1



[PATCH stable 4.4 2/9] Revert "tcp: avoid collapses in tcp_prune_queue() if possible"

2018-08-15 Thread Mao Wenan
This reverts commit 5fbec4801264cb3279ef6ac9c70bcbe2aaef89d5.
---
 net/ipv4/tcp_input.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 995b2bc..df2f342 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4877,9 +4877,6 @@ static int tcp_prune_queue(struct sock *sk)
else if (tcp_under_memory_pressure(sk))
tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
-   if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf)
-   return 0;
-
tcp_collapse_ofo_queue(sk);
if (!skb_queue_empty(>sk_receive_queue))
tcp_collapse(sk, >sk_receive_queue,
-- 
1.8.3.1



[PATCH stable 4.4 5/9] tcp: free batches of packets in tcp_prune_ofo_queue()

2018-08-15 Thread Mao Wenan
From: Eric Dumazet 

Juha-Matti Tilli reported that malicious peers could inject tiny
packets in out_of_order_queue, forcing very expensive calls
to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
every incoming packet. out_of_order_queue rb-tree can contain
thousands of nodes, iterating over all of them is not nice.

Before linux-4.9, we would have pruned all packets in ofo_queue
in one go, every  packets.  depends on sk_rcvbuf and skbs
truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.

Since we plan to increase tcp_rmem[2] in the future to cope with
modern BDP, can not revert to the old behavior, without great pain.

Strategy taken in this patch is to purge ~12.5 % of the queue capacity.

Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all 
packets")
Signed-off-by: Eric Dumazet 
Reported-by: Juha-Matti Tilli 
Acked-by: Yuchung Cheng 
Acked-by: Soheil Hassas Yeganeh 
Signed-off-by: David S. Miller 
Signed-off-by: root 
Signed-off-by: Mao Wenan 
---
 net/ipv4/tcp_input.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 12edc4f..32225dc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4893,22 +4893,26 @@ static bool tcp_prune_ofo_queue(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
struct rb_node *node, *prev;
+   int goal;
 
if (RB_EMPTY_ROOT(>out_of_order_queue))
return false;
 
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
-
+   goal = sk->sk_rcvbuf >> 3;
node = >ooo_last_skb->rbnode;
do {
prev = rb_prev(node);
rb_erase(node, >out_of_order_queue);
+   goal -= rb_to_skb(node)->truesize;
__kfree_skb(rb_to_skb(node));
-   sk_mem_reclaim(sk);
-if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf &&
-!tcp_under_memory_pressure(sk))
-break;
-
+   if (!prev || goal <= 0) {
+   sk_mem_reclaim(sk);
+   if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf &&
+   !tcp_under_memory_pressure(sk))
+   break;
+   goal = sk->sk_rcvbuf >> 3;
+   }
node = prev;
} while (node);
tp->ooo_last_skb = rb_entry(prev, struct sk_buff, rbnode);
-- 
1.8.3.1



[PATCH stable 4.4 7/9] tcp: detect malicious patterns in tcp_collapse_ofo_queue()

2018-08-15 Thread Mao Wenan
From: Eric Dumazet 

[ Upstream commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf ]

In case an attacker feeds tiny packets completely out of order,
tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
expensive copies, but not changing socket memory usage at all.

1) Do not attempt to collapse tiny skbs.
2) Add logic to exit early when too many tiny skbs are detected.

We prefer not doing aggressive collapsing (which copies packets)
for pathological flows, and revert to tcp_prune_ofo_queue() which
will be less expensive.

In the future, we might add the possibility of terminating flows
that are proven to be malicious.

Signed-off-by: Eric Dumazet 
Acked-by: Soheil Hassas Yeganeh 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: root 
Signed-off-by: Mao Wenan 
---
 net/ipv4/tcp_input.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 77130ae..c48924f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4845,6 +4845,7 @@ end:
 static void tcp_collapse_ofo_queue(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
+   u32 range_truesize, sum_tiny = 0;
struct sk_buff *skb, *head;
struct rb_node *p;
u32 start, end;
@@ -4863,6 +4864,7 @@ new_range:
}
start = TCP_SKB_CB(skb)->seq;
end = TCP_SKB_CB(skb)->end_seq;
+   range_truesize = skb->truesize;
 
for (head = skb;;) {
skb = tcp_skb_next(skb, NULL);
@@ -4873,11 +4875,21 @@ new_range:
if (!skb ||
after(TCP_SKB_CB(skb)->seq, end) ||
before(TCP_SKB_CB(skb)->end_seq, start)) {
-   tcp_collapse(sk, NULL, >out_of_order_queue,
-head, skb, start, end);
+   /* Do not attempt collapsing tiny skbs */
+   if (range_truesize != head->truesize ||
+   end - start >= SKB_WITH_OVERHEAD(SK_MEM_QUANTUM)) {
+   tcp_collapse(sk, NULL, >out_of_order_queue,
+head, skb, start, end);
+   } else {
+   sum_tiny += range_truesize;
+   if (sum_tiny > sk->sk_rcvbuf >> 3)
+   return;
+   }
+
goto new_range;
}
 
+   range_truesize += skb->truesize;
if (unlikely(before(TCP_SKB_CB(skb)->seq, start)))
start = TCP_SKB_CB(skb)->seq;
if (after(TCP_SKB_CB(skb)->end_seq, end))
-- 
1.8.3.1



[PATCH stable 4.4 6/9] tcp: avoid collapses in tcp_prune_queue() if possible

2018-08-15 Thread Mao Wenan
From: Eric Dumazet 

[ Upstream commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7 ]

Right after a TCP flow is created, receiving tiny out of order
packets allways hit the condition :

if (atomic_read(>sk_rmem_alloc) >= sk->sk_rcvbuf)
tcp_clamp_window(sk);

tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
(guarded by tcp_rmem[2])

Calling tcp_collapse_ofo_queue() in this case is not useful,
and offers a O(N^2) surface attack to malicious peers.

Better not attempt anything before full queue capacity is reached,
forcing attacker to spend lots of resource and allow us to more
easily detect the abuse.

Signed-off-by: Eric Dumazet 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: root 
Signed-off-by: Mao Wenan 
---
 net/ipv4/tcp_input.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 32225dc..77130ae 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4948,6 +4948,9 @@ static int tcp_prune_queue(struct sock *sk)
else if (tcp_under_memory_pressure(sk))
tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
+   if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf)
+   return 0;
+
tcp_collapse_ofo_queue(sk);
if (!skb_queue_empty(>sk_receive_queue))
tcp_collapse(sk, >sk_receive_queue, NULL,
-- 
1.8.3.1



Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

2018-08-15 Thread Leon Romanovsky
On Mon, Aug 06, 2018 at 02:20:37PM -0500, Steve Wise wrote:
>
>
> On 8/1/2018 9:27 AM, Max Gurtovoy wrote:
> >
> >
> > On 8/1/2018 8:12 AM, Sagi Grimberg wrote:
> >> Hi Max,
> >
> > Hi,
> >
> >>
> >>> Yes, since nvmf is the only user of this function.
> >>> Still waiting for comments on the suggested patch :)
> >>>
> >>
> >> Sorry for the late response (but I'm on vacation so I have
> >> an excuse ;))
> >
> > NP :) currently the code works..
> >
> >>
> >> I'm thinking that we should avoid trying to find an assignment
> >> when stuff like irqbalance daemon is running and changing
> >> the affinitization.
> >
> > but this is exactly what Steve complained and Leon try to fix (and
> > break the connection establishment).
> > If this is the case and we all agree then we're good without Leon's
> > patch and without our suggestions.
> >
>
> I don't agree.  Currently setting certain affinity mappings breaks nvme
> connectivity.  I don't think that is desirable.  And mlx5 is broken in
> that it doesn't allow changing the affinity but silently ignores the
> change, which misleads the admin or irqbalance...

Exactly, I completely agree with Steve and don't understand any
rationale in the comments above. As a summery from my side:
NVMeOF is broken, but we are not going to fix and prohibit
from one specific driver to change affinity on the fly.

Nice.

Thanks


signature.asc
Description: PGP signature


KINDLY REPLY stemlightresour...@gmail.com URGENTLY

2018-08-15 Thread STEMLIGHTRESOURCES
KINDLY REPLY stemlightresour...@gmail.com URGENTLY


[iproute PATCH] man: ip-route: Clarify referenced versions are Linux ones

2018-08-15 Thread Phil Sutter
Versioning scheme of Linux and iproute2 is similar, therefore the
referenced kernel versions are likely to confuse readers. Clarify this
by prefixing each kernel version by 'Linux' prefix.

Signed-off-by: Phil Sutter 
---
 man/man8/ip-route.8.in | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in
index b21a847241427..a33ce1f0f4006 100644
--- a/man/man8/ip-route.8.in
+++ b/man/man8/ip-route.8.in
@@ -483,43 +483,43 @@ seconds and ms, msec or msecs to specify milliseconds.
 
 
 .TP
-.BI rttvar " TIME " "(2.3.15+ only)"
+.BI rttvar " TIME " "(Linux 2.3.15+ only)"
 the initial RTT variance estimate. Values are specified as with
 .BI rtt
 above.
 
 .TP
-.BI rto_min " TIME " "(2.6.23+ only)"
+.BI rto_min " TIME " "(Linux 2.6.23+ only)"
 the minimum TCP Retransmission TimeOut to use when communicating with this
 destination. Values are specified as with
 .BI rtt
 above.
 
 .TP
-.BI ssthresh " NUMBER " "(2.3.15+ only)"
+.BI ssthresh " NUMBER " "(Linux 2.3.15+ only)"
 an estimate for the initial slow start threshold.
 
 .TP
-.BI cwnd " NUMBER " "(2.3.15+ only)"
+.BI cwnd " NUMBER " "(Linux 2.3.15+ only)"
 the clamp for congestion window. It is ignored if the
 .B lock
 flag is not used.
 
 .TP
-.BI initcwnd " NUMBER " "(2.5.70+ only)"
+.BI initcwnd " NUMBER " "(Linux 2.5.70+ only)"
 the initial congestion window size for connections to this destination.
 Actual window size is this value multiplied by the MSS
 (``Maximal Segment Size'') for same connection. The default is
 zero, meaning to use the values specified in RFC2414.
 
 .TP
-.BI initrwnd " NUMBER " "(2.6.33+ only)"
+.BI initrwnd " NUMBER " "(Linux 2.6.33+ only)"
 the initial receive window size for connections to this destination.
 Actual window size is this value multiplied by the MSS of the connection.
 The default value is zero, meaning to use Slow Start value.
 
 .TP
-.BI features " FEATURES " (3.18+ only)
+.BI features " FEATURES " (Linux 3.18+ only)
 Enable or disable per-route features. Only available feature at this
 time is
 .B ecn
@@ -531,17 +531,17 @@ also be used even if the
 sysctl is set to 0.
 
 .TP
-.BI quickack " BOOL " "(3.11+ only)"
+.BI quickack " BOOL " "(Linux 3.11+ only)"
 Enable or disable quick ack for connections to this destination.
 
 .TP
-.BI fastopen_no_cookie " BOOL " "(4.15+ only)"
+.BI fastopen_no_cookie " BOOL " "(Linux 4.15+ only)"
 Enable TCP Fastopen without a cookie for connections to this destination.
 
 .TP
-.BI congctl " NAME " "(3.20+ only)"
+.BI congctl " NAME " "(Linux 3.20+ only)"
 .TP
-.BI "congctl lock" " NAME " "(3.20+ only)"
+.BI "congctl lock" " NAME " "(Linux 3.20+ only)"
 Sets a specific TCP congestion control algorithm only for a given destination.
 If not specified, Linux keeps the current global default TCP congestion control
 algorithm, or the one set from the application. If the modifier
@@ -554,14 +554,14 @@ control algorithm for that destination, thus it will be 
enforced/guaranteed to
 use the proposed algorithm.
 
 .TP
-.BI advmss " NUMBER " "(2.3.15+ only)"
+.BI advmss " NUMBER " "(Linux 2.3.15+ only)"
 the MSS ('Maximal Segment Size') to advertise to these
 destinations when establishing TCP connections. If it is not given,
 Linux uses a default value calculated from the first hop device MTU.
 (If the path to these destination is asymmetric, this guess may be wrong.)
 
 .TP
-.BI reordering " NUMBER " "(2.3.15+ only)"
+.BI reordering " NUMBER " "(Linux 2.3.15+ only)"
 Maximal reordering on the path to this destination.
 If it is not given, Linux uses the value selected with
 .B sysctl
@@ -782,7 +782,7 @@ is a set of encapsulation attributes specific to the
 .IR SEG6_ACTION " [ "
 .IR SEG6_ACTION_PARAM " ] "
 - Operation to perform on matching packets.
-The following actions are currently supported (\fB4.14+ only\fR).
+The following actions are currently supported (\fBLinux 4.14+ only\fR).
 .in +2
 
 .B End
@@ -830,7 +830,7 @@ address is set as described in \fBip-sr\fR(8).
 .in -8
 
 .TP
-.BI expires " TIME " "(4.4+ only)"
+.BI expires " TIME " "(Linux 4.4+ only)"
 the route will be deleted after the expires time.
 .B Only
 support IPv6 at present.
-- 
2.18.0



[iproute PATCH 1/4] tc: Fix typo in check for colored output

2018-08-15 Thread Phil Sutter
The check used binary instead of boolean AND, which means colored output
was enabled only if the number of specified '-color' flags was odd.

Fixes: 2d165c0811058 ("tc: implement color output")
Signed-off-by: Phil Sutter 
---
 tc/tc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tc/tc.c b/tc/tc.c
index 3bb5910ffac52..3bb893756f40e 100644
--- a/tc/tc.c
+++ b/tc/tc.c
@@ -515,7 +515,7 @@ int main(int argc, char **argv)
 
_SL_ = oneline ? "\\" : "\n";
 
-   if (color & !json)
+   if (color && !json)
enable_color();
 
if (batch_file)
-- 
2.18.0



[iproute PATCH 4/4] lib: Enable colored output only for TTYs

2018-08-15 Thread Phil Sutter
Add an additional prerequisite to check_enable_color() to make sure
stdout actually points to an open TTY device. Otherwise calls like

| ip -color a s >/tmp/foo

will print color escape sequences into that file.

Signed-off-by: Phil Sutter 
---
 lib/color.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/color.c b/lib/color.c
index edf96e5c6ecd7..500ba09682697 100644
--- a/lib/color.c
+++ b/lib/color.c
@@ -3,6 +3,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -79,7 +80,7 @@ void enable_color(void)
 
 int check_enable_color(int color, int json)
 {
-   if (color && !json) {
+   if (color && !json && isatty(fileno(stdout))) {
enable_color();
return 0;
}
-- 
2.18.0



[iproute PATCH 2/4] bridge: Fix check for colored output

2018-08-15 Thread Phil Sutter
There is no point in calling enable_color() conditionally if it was
already called for each time '-color' flag was parsed. Align the
algorithm with that in ip and tc by actually making use of 'color'
variable.

Fixes: e9625d6aead11 ("Merge branch 'iproute2-master' into iproute2-next")
Signed-off-by: Phil Sutter 
---
 bridge/bridge.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/bridge/bridge.c b/bridge/bridge.c
index 7fcfe1116f6e5..289a157d37f03 100644
--- a/bridge/bridge.c
+++ b/bridge/bridge.c
@@ -174,7 +174,7 @@ main(int argc, char **argv)
if (netns_switch(argv[1]))
exit(-1);
} else if (matches(opt, "-color") == 0) {
-   enable_color();
+   ++color;
} else if (matches(opt, "-compressvlans") == 0) {
++compress_vlans;
} else if (matches(opt, "-force") == 0) {
-- 
2.18.0



[iproute PATCH 0/4] A bunch of fixes regarding colored output

2018-08-15 Thread Phil Sutter
This series contains fixes for conditionally colored output in patches 1
and 2. Patch 3 merges the common conditionals from ip, tc and bridge
tools. Patch 4 then adds a further restriction to colored output to
prevent garbled output when redirecting into a file.

Phil Sutter (4):
  tc: Fix typo in check for colored output
  bridge: Fix check for colored output
  Merge common code for conditionally colored output
  lib: Enable colored output only for TTYs

 bridge/bridge.c |  5 ++---
 include/color.h |  1 +
 ip/ip.c |  3 +--
 lib/color.c | 10 ++
 tc/tc.c |  3 +--
 5 files changed, 15 insertions(+), 7 deletions(-)

-- 
2.18.0



[iproute PATCH 3/4] Merge common code for conditionally colored output

2018-08-15 Thread Phil Sutter
Instead of calling enable_color() conditionally with identical check in
three places, introduce check_enable_color() which does it in one place.

Signed-off-by: Phil Sutter 
---
 bridge/bridge.c | 3 +--
 include/color.h | 1 +
 ip/ip.c | 3 +--
 lib/color.c | 9 +
 tc/tc.c | 3 +--
 5 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/bridge/bridge.c b/bridge/bridge.c
index 289a157d37f03..451d684e0bcfd 100644
--- a/bridge/bridge.c
+++ b/bridge/bridge.c
@@ -200,8 +200,7 @@ main(int argc, char **argv)
 
_SL_ = oneline ? "\\" : "\n";
 
-   if (color && !json)
-   enable_color();
+   check_enable_color(color, json);
 
if (batch_file)
return batch(batch_file);
diff --git a/include/color.h b/include/color.h
index c80359d3e2e95..4f2c918db7e43 100644
--- a/include/color.h
+++ b/include/color.h
@@ -13,6 +13,7 @@ enum color_attr {
 };
 
 void enable_color(void);
+int check_enable_color(int color, int json);
 void set_color_palette(void);
 int color_fprintf(FILE *fp, enum color_attr attr, const char *fmt, ...);
 enum color_attr ifa_family_color(__u8 ifa_family);
diff --git a/ip/ip.c b/ip/ip.c
index 71d5170c0cc23..38eac5ec1e17d 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -304,8 +304,7 @@ int main(int argc, char **argv)
 
_SL_ = oneline ? "\\" : "\n";
 
-   if (color && !json)
-   enable_color();
+   check_enable_color(color, json);
 
if (batch_file)
return batch(batch_file);
diff --git a/lib/color.c b/lib/color.c
index da1f516cb2492..edf96e5c6ecd7 100644
--- a/lib/color.c
+++ b/lib/color.c
@@ -77,6 +77,15 @@ void enable_color(void)
set_color_palette();
 }
 
+int check_enable_color(int color, int json)
+{
+   if (color && !json) {
+   enable_color();
+   return 0;
+   }
+   return 1;
+}
+
 void set_color_palette(void)
 {
char *p = getenv("COLORFGBG");
diff --git a/tc/tc.c b/tc/tc.c
index 3bb893756f40e..e775550174473 100644
--- a/tc/tc.c
+++ b/tc/tc.c
@@ -515,8 +515,7 @@ int main(int argc, char **argv)
 
_SL_ = oneline ? "\\" : "\n";
 
-   if (color && !json)
-   enable_color();
+   check_enable_color(color, json);
 
if (batch_file)
return batch(batch_file);
-- 
2.18.0



[PATCH v3 net-next] veth: Free queues on link delete

2018-08-15 Thread Toshiaki Makita
David Ahern reported memory leak in veth.

===
$ cat /sys/kernel/debug/kmemleak
unreferenced object 0x8800354d5c00 (size 1024):
  comm "ip", pid 836, jiffies 4294722952 (age 25.904s)
  hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[<(ptrval)>] kmemleak_alloc+0x70/0x94
[<(ptrval)>] slab_post_alloc_hook+0x42/0x52
[<(ptrval)>] __kmalloc+0x101/0x142
[<(ptrval)>] kmalloc_array.constprop.20+0x1e/0x26 [veth]
[<(ptrval)>] veth_newlink+0x147/0x3ac [veth]
...
unreferenced object 0x88002e009c00 (size 1024):
  comm "ip", pid 836, jiffies 4294722958 (age 25.898s)
  hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[<(ptrval)>] kmemleak_alloc+0x70/0x94
[<(ptrval)>] slab_post_alloc_hook+0x42/0x52
[<(ptrval)>] __kmalloc+0x101/0x142
[<(ptrval)>] kmalloc_array.constprop.20+0x1e/0x26 [veth]
[<(ptrval)>] veth_newlink+0x219/0x3ac [veth]
===

veth_rq allocated in veth_newlink() was not freed on dellink.

We need to free up them after veth_close() so that any packets will not
reference the queues afterwards. Thus free them in veth_dev_free() in
the same way as freeing stats structure (vstats).

Also move queues allocation to veth_dev_init() to be in line with stats
allocation.

Fixes: 638264dc90227 ("veth: Support per queue XDP ring")
Reported-by: David Ahern 
Signed-off-by: Toshiaki Makita 
---
This is a fix for a bug which exists only in net-next.
Let me know if I should wait for -next merging into net or reopen of -next.

 drivers/net/veth.c | 70 +-
 1 file changed, 33 insertions(+), 37 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index e3202af..8d679c8 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -789,16 +789,48 @@ static int is_valid_veth_mtu(int mtu)
return mtu >= ETH_MIN_MTU && mtu <= ETH_MAX_MTU;
 }
 
+static int veth_alloc_queues(struct net_device *dev)
+{
+   struct veth_priv *priv = netdev_priv(dev);
+   int i;
+
+   priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL);
+   if (!priv->rq)
+   return -ENOMEM;
+
+   for (i = 0; i < dev->num_rx_queues; i++)
+   priv->rq[i].dev = dev;
+
+   return 0;
+}
+
+static void veth_free_queues(struct net_device *dev)
+{
+   struct veth_priv *priv = netdev_priv(dev);
+
+   kfree(priv->rq);
+}
+
 static int veth_dev_init(struct net_device *dev)
 {
+   int err;
+
dev->vstats = netdev_alloc_pcpu_stats(struct pcpu_vstats);
if (!dev->vstats)
return -ENOMEM;
+
+   err = veth_alloc_queues(dev);
+   if (err) {
+   free_percpu(dev->vstats);
+   return err;
+   }
+
return 0;
 }
 
 static void veth_dev_free(struct net_device *dev)
 {
+   veth_free_queues(dev);
free_percpu(dev->vstats);
 }
 
@@ -1040,31 +1072,13 @@ static int veth_validate(struct nlattr *tb[], struct 
nlattr *data[],
return 0;
 }
 
-static int veth_alloc_queues(struct net_device *dev)
-{
-   struct veth_priv *priv = netdev_priv(dev);
-
-   priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL);
-   if (!priv->rq)
-   return -ENOMEM;
-
-   return 0;
-}
-
-static void veth_free_queues(struct net_device *dev)
-{
-   struct veth_priv *priv = netdev_priv(dev);
-
-   kfree(priv->rq);
-}
-
 static struct rtnl_link_ops veth_link_ops;
 
 static int veth_newlink(struct net *src_net, struct net_device *dev,
struct nlattr *tb[], struct nlattr *data[],
struct netlink_ext_ack *extack)
 {
-   int err, i;
+   int err;
struct net_device *peer;
struct veth_priv *priv;
char ifname[IFNAMSIZ];
@@ -1117,12 +1131,6 @@ static int veth_newlink(struct net *src_net, struct 
net_device *dev,
return PTR_ERR(peer);
}
 
-   err = veth_alloc_queues(peer);
-   if (err) {
-   put_net(net);
-   goto err_peer_alloc_queues;
-   }
-
if (!ifmp || !tbp[IFLA_ADDRESS])
eth_hw_addr_random(peer);
 
@@ -1151,10 +1159,6 @@ static int veth_newlink(struct net *src_net, struct 
net_device *dev,
 * should be re-allocated
 */
 
-   err = veth_alloc_queues(dev);
-   if (err)
-   goto err_alloc_queues;
-
if (tb[IFLA_ADDRESS] == NULL)
eth_hw_addr_random(dev);
 
@@ -1174,28 +1178,20 @@ static int veth_newlink(struct net 

Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-08-15 Thread Pravin Shelar
Hi Stefano

On Tue, Aug 7, 2018 at 6:31 AM, Stefano Brivio  wrote:
> Hi Pravin,
>
> On Tue, 31 Jul 2018 16:12:03 -0700
> Pravin Shelar  wrote:
>
>> Rather than reducing number of thread down to 1, we could find better
>> number of FDs per port.
>> How about this simple solution:
>> 1. Allocate (N * P) FDs as long as it is under FD limit.
>> 2. If FD limit (-EMFILE) is hit reduce N value by half and repeat step 1.
>
> I still see a few quantitative issues with this approach, other than
> Ben's observation about design (which, by the way, looks entirely
> reasonable to me).
>
> We're talking about a disproportionate amount of sockets in any case.
> We can have up to 2^16 vports here, with 5k vports being rather common,
> and for any reasonable value of N that manages somehow to perturbate the
> distribution of upcalls per thread, we are talking about something well
> in excess of 100k sockets. I think this doesn't really scale.
>
My argument is not about proposed fairness algorithm. It is about cost
of the fairness and I do not see it is addressed in any of the follow
ups. You seems to be worried about memory cost and fairness aspects, I
am worried about CPU cost of the solution.
I think proposed solution is solving the fairness issue but it is also
creating bottleneck in upcall processing. OVS is known to have slower
upcall processing. This patch is adding even more cost to the upcall
handling. The latency of first packet handling is also going up with
this approach.

I revisited the original patch, here is what I see in term of added
cost to existing upcall processing:
1. one "kzalloc(sizeof(*upcall), GFP_ATOMIC);" This involve allocate
and initialize memory
2. copy flow key which is more than 1 KB (upcall->key = *key)
3. Acquire spin_lock_bh dp->upcalls.lock, which would disable bottom
half processing on CPU while waiting for the global lock.
4. Iterate list of queued upcalls, one of objective it is to avoid out
of order packet. But I do not see point of ordering packets from
different streams.
5. signal upcall thread after delay ovs_dp_upcall_delay(). This adds
further to the latency.
6. upcall is then handed over to different thread (context switch),
likely on different CPU.
8. the upcall object is freed on remote CPU.
9. single lock essentially means OVS kernel datapath upcall processing
is single threaded no matter number of cores in system.

I would be interested in how are we going to address these issues.

In example you were talking about netlink fd issue on server with 48
core, how does this solution works when there are 5K ports each
triggering upcall ? Can you benchmark your patch? Do you have
performance numbers for TCP_CRR with and without this patch ? Also
publish latency numbers for this patch. Please turn off megaflow to
exercise upcall handling.

I understand fairness has cost, but we need to find right balance
between performance and fairness. Current fairness scheme is a
lockless algorithm without much computational overhead, did you try to
improve current algorithm so that it uses less number of ports.


> With the current value for N (3/4 * number of threads) this can even get
> close to /proc/sys/fs/file-max on some systems, and there raising the
> number of allowed file descriptors for ovs-vswitchd isn't a solution
> anymore.
>
> I would instead try to address the concerns that you had about the
> original patch adding fairness in the kernel, rather than trying to
> make the issue appear less severe in ovs-vswitchd.
>
> --
> Stefano