bnxt: card intermittently hanging and dropping link
Hi Michael, I have some user reports of issues with a Broadcom 57412 card with the card intermittently hanging and dropping the link. The problem has been observed on a Dell server with an Ubuntu 4.13 kernel (bnxt_en version 1.7.0) and with an Ubuntu 4.15 kernel (bnxt_en version 1.8.0). It seems to occur while mounting an XFS volume over iSCSI, although running blkid on the partition succeeds, and it seems to be a different volume each time. I've included an excerpt from the kernel log below. It seems that other people have reported this with a RHEL7 kernel - I have no idea what driver version that would be running, or what workload they were operating, but the warnings and messages look the same: https://www.dell.com/community/PowerEdge-Hardware-General/Critical-network-bnxt-en-module-crashes-on-14G-servers/td-p/6031769/highlight/true The forum poster reported that disconnecting the card from power for 5 minutes was sufficient to get things working again and I have asked our user to test that. Is this a known issue? Regards, Daniel [ 2662.526151] scsi host14: iSCSI Initiator over TCP/IP [ 2662.538110] scsi host15: iSCSI Initiator over TCP/IP [ 2662.547350] scsi host16: iSCSI Initiator over TCP/IP [ 2662.554660] scsi host17: iSCSI Initiator over TCP/IP [ 2662.813860] scsi 15:0:0:1: Direct-Access PURE FlashArray PQ: 0 ANSI: 6 [ 2662.813888] scsi 14:0:0:1: Direct-Access PURE FlashArray PQ: 0 ANSI: 6 [ 2662.813972] scsi 16:0:0:1: Direct-Access PURE FlashArray PQ: 0 ANSI: 6 [ 2662.814322] sd 15:0:0:1: Attached scsi generic sg1 type 0 [ 2662.814553] sd 14:0:0:1: Attached scsi generic sg2 type 0 [ 2662.814554] scsi 17:0:0:1: Direct-Access PURE FlashArray PQ: 0 ANSI: 6 [ 2662.814612] sd 16:0:0:1: Attached scsi generic sg3 type 0 [ 2662.815081] sd 15:0:0:1: [sdb] 10737418240 512-byte logical blocks: (5.50 TB/5.00 TiB) [ 2662.815195] sd 15:0:0:1: [sdb] Write Protect is off [ 2662.815197] sd 15:0:0:1: [sdb] Mode Sense: 43 00 00 08 [ 2662.815229] sd 14:0:0:1: [sdc] 10737418240 512-byte logical blocks: (5.50 TB/5.00 TiB) [ 2662.815292] sd 17:0:0:1: Attached scsi generic sg4 type 0 [ 2662.815342] sd 14:0:0:1: [sdc] Write Protect is off [ 2662.815343] sd 14:0:0:1: [sdc] Mode Sense: 43 00 00 08 [ 2662.815419] sd 15:0:0:1: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 2662.815447] sd 16:0:0:1: [sdd] 10737418240 512-byte logical blocks: (5.50 TB/5.00 TiB) [ 2662.815544] sd 14:0:0:1: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 2662.815614] sd 16:0:0:1: [sdd] Write Protect is off [ 2662.815615] sd 16:0:0:1: [sdd] Mode Sense: 43 00 00 08 [ 2662.815882] sd 16:0:0:1: [sdd] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 2662.816188] sd 17:0:0:1: [sde] 10737418240 512-byte logical blocks: (5.50 TB/5.00 TiB) [ 2662.816298] sd 17:0:0:1: [sde] Write Protect is off [ 2662.816300] sd 17:0:0:1: [sde] Mode Sense: 43 00 00 08 [ 2662.816502] sd 17:0:0:1: [sde] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 2662.820080] sd 15:0:0:1: [sdb] Attached SCSI disk [ 2662.820594] sd 14:0:0:1: [sdc] Attached SCSI disk [ 2662.820995] sd 17:0:0:1: [sde] Attached SCSI disk [ 2662.821176] sd 16:0:0:1: [sdd] Attached SCSI disk [ 2662.913642] device-mapper: multipath round-robin: version 1.2.0 loaded [ 2663.954001] XFS (dm-2): Mounting V5 Filesystem [ 2673.186083] connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295558209, last ping 4295559460, now 4295560768 [ 2673.186135] connection3:0: detected conn error (1022) [ 2673.186137] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295558209, last ping 4295559460, now 4295560768 [ 2673.186168] connection2:0: detected conn error (1022) [ 2673.186170] connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295558209, last ping 4295559460, now 4295560768 [ 2673.186211] connection1:0: detected conn error (1022) [ 2674.209870] connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295558463, last ping 4295559720, now 4295561024 [ 2674.209924] connection4:0: detected conn error (1022) [ 2678.560630] session1: session recovery timed out after 5 secs [ 2678.560641] session2: session recovery timed out after 5 secs [ 2678.560647] session3: session recovery timed out after 5 secs [ 2678.951453] device-mapper: multipath: Failing path 8:32. [ 2678.951509] device-mapper: multipath: Failing path 8:48. [ 2678.951548] device-mapper: multipath: Failing path 8:16. [ 2679.584302] session4: session recovery timed out after 5 secs [ 2679.584313] sd 17:0:0:1: rejecting I/O to offline device [ 2679.584356] sd 17:0:0:1: [sde] killing request [ 2679.584362] sd 17:0:0:1: rejecting I/O to offline device [ 2679.584392] sd 17:0:0:1: [sde] killing request [ 2679.584401] sd 17:0:0:1: [sde] FAILED Result: hostbyte=DID_NO_CONNECT
[PATCH stable 4.4 5/9] tcp: free batches of packets in tcp_prune_ofo_queue()
From: Eric Dumazet [ Upstream commit 72cd43ba64fc172a443410ce01645895850844c8 ] Juha-Matti Tilli reported that malicious peers could inject tiny packets in out_of_order_queue, forcing very expensive calls to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for every incoming packet. out_of_order_queue rb-tree can contain thousands of nodes, iterating over all of them is not nice. Before linux-4.9, we would have pruned all packets in ofo_queue in one go, every packets. depends on sk_rcvbuf and skbs truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB. Since we plan to increase tcp_rmem[2] in the future to cope with modern BDP, can not revert to the old behavior, without great pain. Strategy taken in this patch is to purge ~12.5 % of the queue capacity. Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all packets") Signed-off-by: Eric Dumazet Reported-by: Juha-Matti Tilli Acked-by: Yuchung Cheng Acked-by: Soheil Hassas Yeganeh Signed-off-by: David S. Miller Signed-off-by: Mao Wenan --- net/ipv4/tcp_input.c | 16 ++-- 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 12edc4f..32225dc 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4893,22 +4893,26 @@ static bool tcp_prune_ofo_queue(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); struct rb_node *node, *prev; + int goal; if (RB_EMPTY_ROOT(>out_of_order_queue)) return false; NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED); - + goal = sk->sk_rcvbuf >> 3; node = >ooo_last_skb->rbnode; do { prev = rb_prev(node); rb_erase(node, >out_of_order_queue); + goal -= rb_to_skb(node)->truesize; __kfree_skb(rb_to_skb(node)); - sk_mem_reclaim(sk); -if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf && -!tcp_under_memory_pressure(sk)) -break; - + if (!prev || goal <= 0) { + sk_mem_reclaim(sk); + if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf && + !tcp_under_memory_pressure(sk)) + break; + goal = sk->sk_rcvbuf >> 3; + } node = prev; } while (node); tp->ooo_last_skb = rb_entry(prev, struct sk_buff, rbnode); -- 1.8.3.1
[PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"
This reverts commit dc6ae4dffd656811dee7151b19545e4cd839d378. We need change simple queue to RB tree to finally fix CVE-2018-5390, So revert this patch firstly because of many conflicts when we want to apply previous patch 9f5afeae(tcp: use an RB tree for ooo receive queue), after this we will reapply patch series from Eric. Signed-off-by: Mao Wenan --- net/ipv4/tcp_input.c | 16 ++-- 1 file changed, 2 insertions(+), 14 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 4a261e0..995b2bc 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4791,7 +4791,6 @@ restart: static void tcp_collapse_ofo_queue(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); - u32 range_truesize, sum_tiny = 0; struct sk_buff *skb = skb_peek(>out_of_order_queue); struct sk_buff *head; u32 start, end; @@ -4801,7 +4800,6 @@ static void tcp_collapse_ofo_queue(struct sock *sk) start = TCP_SKB_CB(skb)->seq; end = TCP_SKB_CB(skb)->end_seq; - range_truesize = skb->truesize; head = skb; for (;;) { @@ -4816,24 +4814,14 @@ static void tcp_collapse_ofo_queue(struct sock *sk) if (!skb || after(TCP_SKB_CB(skb)->seq, end) || before(TCP_SKB_CB(skb)->end_seq, start)) { - /* Do not attempt collapsing tiny skbs */ - if (range_truesize != head->truesize || - end - start >= SKB_WITH_OVERHEAD(SK_MEM_QUANTUM)) { - tcp_collapse(sk, >out_of_order_queue, -head, skb, start, end); - } else { - sum_tiny += range_truesize; - if (sum_tiny > sk->sk_rcvbuf >> 3) - return; - } - + tcp_collapse(sk, >out_of_order_queue, +head, skb, start, end); head = skb; if (!skb) break; /* Start new segment */ start = TCP_SKB_CB(skb)->seq; end = TCP_SKB_CB(skb)->end_seq; - range_truesize = skb->truesize; } else { if (before(TCP_SKB_CB(skb)->seq, start)) start = TCP_SKB_CB(skb)->seq; -- 1.8.3.1
[PATCH stable 4.4 2/9] Revert "tcp: avoid collapses in tcp_prune_queue() if possible"
This reverts commit 5fbec4801264cb3279ef6ac9c70bcbe2aaef89d5. We need change simple queue to RB tree to finally fix CVE-2018-5390, So revert this patch firstly because of many conflicts when we want to apply previous patch 9f5afeae(tcp: use an RB tree for ooo receive queue), after this we will reapply patch series from Eric. Signed-off-by: Mao Wenan --- net/ipv4/tcp_input.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 995b2bc..df2f342 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4877,9 +4877,6 @@ static int tcp_prune_queue(struct sock *sk) else if (tcp_under_memory_pressure(sk)) tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss); - if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf) - return 0; - tcp_collapse_ofo_queue(sk); if (!skb_queue_empty(>sk_receive_queue)) tcp_collapse(sk, >sk_receive_queue, -- 1.8.3.1
[PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
There are five patches to fix CVE-2018-5390 in latest mainline branch, but only two patches exist in stable 4.4 and 3.18: dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue() 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible I have tested with stable 4.4 kernel, and found the cpu usage was very high. So I think only two patches can't fix the CVE-2018-5390. test results: with fix patch: 78.2% ksoftirqd withoutfix patch: 90% ksoftirqd Then I try to imitate 72cd43ba(tcp: free batches of packets in tcp_prune_ofo_queue()) to drop at least 12.5 % of sk_rcvbuf to avoid malicious attacks with simple queue instead of RB tree. The result is not very well. After analysing the codes of stable 4.4, and debuging the system, shows that search of ofo_queue(tcp ofo using a simple queue) cost more cycles. So I try to backport "tcp: use an RB tree for ooo receive queue" using RB tree instead of simple queue, then backport Eric Dumazet 5 fixed patches in mainline, good news is that ksoftirqd is turn to about 20%, which is the same with mainline now. Stable 4.4 have already back port two patches, f4a3313d(tcp: avoid collapses in tcp_prune_queue() if possible) 3d4bf93a(tcp: detect malicious patterns in tcp_collapse_ofo_queue()) If we want to change simple queue to RB tree to finally resolve, we should apply previous patch 9f5afeae(tcp: use an RB tree for ooo receive queue.) firstly, but 9f5afeae have many conflicts with 3d4bf93a and f4a3313d, which are part of patch series from Eric in mainline to fix CVE-2018-5390, so I need revert part of patches in stable 4.4 firstly, then apply 9f5afeae, and reapply five patches from Eric. Eric Dumazet (6): tcp: increment sk_drops for dropped rx packets tcp: free batches of packets in tcp_prune_ofo_queue() tcp: avoid collapses in tcp_prune_queue() if possible tcp: detect malicious patterns in tcp_collapse_ofo_queue() tcp: call tcp_drop() from tcp_data_queue_ofo() tcp: add tcp_ooo_try_coalesce() helper Mao Wenan (2): Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()" Revert "tcp: avoid collapses in tcp_prune_queue() if possible" Yaogong Wang (1): tcp: use an RB tree for ooo receive queue include/linux/skbuff.h | 8 + include/linux/tcp.h | 7 +- include/net/sock.h | 7 + include/net/tcp.h| 2 +- net/core/skbuff.c| 19 +++ net/ipv4/tcp.c | 4 +- net/ipv4/tcp_input.c | 412 +-- net/ipv4/tcp_ipv4.c | 3 +- net/ipv4/tcp_minisocks.c | 1 - net/ipv6/tcp_ipv6.c | 1 + 10 files changed, 294 insertions(+), 170 deletions(-) -- 1.8.3.1
[PATCH stable 4.4 9/9] tcp: add tcp_ooo_try_coalesce() helper
From: Eric Dumazet [ Upstream commit 58152ecbbcc6a0ce7fddd5bf5f6ee535834ece0c ] In case skb in out_or_order_queue is the result of multiple skbs coalescing, we would like to get a proper gso_segs counter tracking, so that future tcp_drop() can report an accurate number. I chose to not implement this tracking for skbs in receive queue, since they are not dropped, unless socket is disconnected. Signed-off-by: Eric Dumazet Acked-by: Soheil Hassas Yeganeh Acked-by: Yuchung Cheng Signed-off-by: David S. Miller Signed-off-by: Mao Wenan --- net/ipv4/tcp_input.c | 23 +-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 96a1e0d..fdb5509 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4296,6 +4296,23 @@ static bool tcp_try_coalesce(struct sock *sk, return true; } +static bool tcp_ooo_try_coalesce(struct sock *sk, +struct sk_buff *to, +struct sk_buff *from, +bool *fragstolen) +{ + bool res = tcp_try_coalesce(sk, to, from, fragstolen); + + /* In case tcp_drop() is called later, update to->gso_segs */ + if (res) { + u32 gso_segs = max_t(u16, 1, skb_shinfo(to)->gso_segs) + + max_t(u16, 1, skb_shinfo(from)->gso_segs); + + skb_shinfo(to)->gso_segs = min_t(u32, gso_segs, 0x); + } + return res; +} + static void tcp_drop(struct sock *sk, struct sk_buff *skb) { sk_drops_add(sk, skb); @@ -4422,7 +4439,8 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb) /* In the typical case, we are adding an skb to the end of the list. * Use of ooo_last_skb avoids the O(Log(N)) rbtree lookup. */ - if (tcp_try_coalesce(sk, tp->ooo_last_skb, skb, )) { + if (tcp_ooo_try_coalesce(sk, tp->ooo_last_skb, +skb, )) { coalesce_done: tcp_grow_window(sk, skb); kfree_skb_partial(skb, fragstolen); @@ -4467,7 +4485,8 @@ coalesce_done: tcp_drop(sk, skb1); goto add_sack; } - } else if (tcp_try_coalesce(sk, skb1, skb, )) { + } else if (tcp_ooo_try_coalesce(sk, skb1, + skb, )) { goto coalesce_done; } p = >rb_right; -- 1.8.3.1
[PATCH stable 4.4 3/9] tcp: increment sk_drops for dropped rx packets
From: Eric Dumazet [ Upstream commit 532182cd610782db8c18230c2747626562032205 ] Now ss can report sk_drops, we can instruct TCP to increment this per socket counter when it drops an incoming frame, to refine monitoring and debugging. Following patch takes care of listeners drops. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Mao Wenan --- include/net/sock.h | 7 +++ net/ipv4/tcp_input.c | 33 - net/ipv4/tcp_ipv4.c | 1 + net/ipv6/tcp_ipv6.c | 1 + 4 files changed, 29 insertions(+), 13 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 3d5ff74..5770757 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -2139,6 +2139,13 @@ sock_skb_set_dropcount(const struct sock *sk, struct sk_buff *skb) SOCK_SKB_CB(skb)->dropcount = atomic_read(>sk_drops); } +static inline void sk_drops_add(struct sock *sk, const struct sk_buff *skb) +{ + int segs = max_t(u16, 1, skb_shinfo(skb)->gso_segs); + + atomic_add(segs, >sk_drops); +} + void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb); void __sock_recv_wifi_status(struct msghdr *msg, struct sock *sk, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index df2f342..5fb4e80 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4296,6 +4296,12 @@ static bool tcp_try_coalesce(struct sock *sk, return true; } +static void tcp_drop(struct sock *sk, struct sk_buff *skb) +{ + sk_drops_add(sk, skb); + __kfree_skb(skb); +} + /* This one checks to see if we can put data from the * out_of_order queue into the receive_queue. */ @@ -4320,7 +4326,7 @@ static void tcp_ofo_queue(struct sock *sk) __skb_unlink(skb, >out_of_order_queue); if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) { SOCK_DEBUG(sk, "ofo packet was already received\n"); - __kfree_skb(skb); + tcp_drop(sk, skb); continue; } SOCK_DEBUG(sk, "ofo requeuing : rcv_next %X seq %X - %X\n", @@ -4372,7 +4378,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb) if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) { NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFODROP); - __kfree_skb(skb); + tcp_drop(sk, skb); return; } @@ -4436,7 +4442,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb) if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) { /* All the bits are present. Drop. */ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE); - __kfree_skb(skb); + tcp_drop(sk, skb); skb = NULL; tcp_dsack_set(sk, seq, end_seq); goto add_sack; @@ -4475,7 +4481,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb) tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq, TCP_SKB_CB(skb1)->end_seq); NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE); - __kfree_skb(skb1); + tcp_drop(sk, skb1); } add_sack: @@ -4558,12 +4564,13 @@ err: static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); - int eaten = -1; bool fragstolen = false; + int eaten = -1; - if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) - goto drop; - + if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) { + __kfree_skb(skb); + return; + } skb_dst_drop(skb); __skb_pull(skb, tcp_hdr(skb)->doff * 4); @@ -4645,7 +4652,7 @@ out_of_window: tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS); inet_csk_schedule_ack(sk); drop: - __kfree_skb(skb); + tcp_drop(sk, skb); return; } @@ -5220,7 +5227,7 @@ syn_challenge: return true; discard: - __kfree_skb(skb); + tcp_drop(sk, skb); return false; } @@ -5438,7 +5445,7 @@ csum_error: TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS); discard: - __kfree_skb(skb); + tcp_drop(sk, skb); } EXPORT_SYMBOL(tcp_rcv_established); @@ -5668,7 +5675,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, TCP_DELACK_MAX, TCP_RTO_MAX); discard: - __kfree_skb(skb); + tcp_drop(sk, skb); return 0; } else { tcp_send_ack(sk); @@ -6025,7 +6032,7 @@ int tcp_rcv_state_process(struct sock *sk,
[PATCH stable 4.4 6/9] tcp: avoid collapses in tcp_prune_queue() if possible
From: Eric Dumazet [ Upstream commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7 ] Right after a TCP flow is created, receiving tiny out of order packets allways hit the condition : if (atomic_read(>sk_rmem_alloc) >= sk->sk_rcvbuf) tcp_clamp_window(sk); tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc (guarded by tcp_rmem[2]) Calling tcp_collapse_ofo_queue() in this case is not useful, and offers a O(N^2) surface attack to malicious peers. Better not attempt anything before full queue capacity is reached, forcing attacker to spend lots of resource and allow us to more easily detect the abuse. Signed-off-by: Eric Dumazet Acked-by: Soheil Hassas Yeganeh Acked-by: Yuchung Cheng Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman Signed-off-by: Mao Wenan --- net/ipv4/tcp_input.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 32225dc..77130ae 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4948,6 +4948,9 @@ static int tcp_prune_queue(struct sock *sk) else if (tcp_under_memory_pressure(sk)) tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss); + if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf) + return 0; + tcp_collapse_ofo_queue(sk); if (!skb_queue_empty(>sk_receive_queue)) tcp_collapse(sk, >sk_receive_queue, NULL, -- 1.8.3.1
[PATCH stable 4.4 4/9] tcp: use an RB tree for ooo receive queue
From: Yaogong Wang [ Upstream commit 9f5afeae51526b3ad7b7cb21ee8b145ce6ea7a7a ] Over the years, TCP BDP has increased by several orders of magnitude, and some people are considering to reach the 2 Gbytes limit. Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000 MSS. In presence of packet losses (or reorders), TCP stores incoming packets into an out of order queue, and number of skbs sitting there waiting for the missing packets to be received can be in the 10^5 range. Most packets are appended to the tail of this queue, and when packets can finally be transferred to receive queue, we scan the queue from its head. However, in presence of heavy losses, we might have to find an arbitrary point in this queue, involving a linear scan for every incoming packet, throwing away cpu caches. This patch converts it to a RB tree, to get bounded latencies. Yaogong wrote a preliminary patch about 2 years ago. Eric did the rebase, added ofo_last_skb cache, polishing and tests. Tested with network dropping between 1 and 10 % packets, with good success (about 30 % increase of throughput in stress tests) Next step would be to also use an RB tree for the write queue at sender side ;) Signed-off-by: Yaogong Wang Signed-off-by: Eric Dumazet Cc: Yuchung Cheng Cc: Neal Cardwell Cc: Ilpo Järvinen Acked-By: Ilpo Järvinen Signed-off-by: David S. Miller Signed-off-by: Mao Wenan --- include/linux/skbuff.h | 8 ++ include/linux/tcp.h | 7 +- include/net/tcp.h| 2 +- net/core/skbuff.c| 19 +++ net/ipv4/tcp.c | 4 +- net/ipv4/tcp_input.c | 354 +++ net/ipv4/tcp_ipv4.c | 2 +- net/ipv4/tcp_minisocks.c | 1 - 8 files changed, 241 insertions(+), 156 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index c28bd8b..a490dd7 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2273,6 +2273,8 @@ static inline void __skb_queue_purge(struct sk_buff_head *list) kfree_skb(skb); } +void skb_rbtree_purge(struct rb_root *root); + void *netdev_alloc_frag(unsigned int fragsz); struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length, @@ -2807,6 +2809,12 @@ static inline int pskb_trim_rcsum(struct sk_buff *skb, unsigned int len) return __pskb_trim(skb, len); } +#define rb_to_skb(rb) rb_entry_safe(rb, struct sk_buff, rbnode) +#define skb_rb_first(root) rb_to_skb(rb_first(root)) +#define skb_rb_last(root) rb_to_skb(rb_last(root)) +#define skb_rb_next(skb) rb_to_skb(rb_next(&(skb)->rbnode)) +#define skb_rb_prev(skb) rb_to_skb(rb_prev(&(skb)->rbnode)) + #define skb_queue_walk(queue, skb) \ for (skb = (queue)->next; \ skb != (struct sk_buff *)(queue); \ diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 5b6df1a..747404d 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -279,10 +279,9 @@ struct tcp_sock { struct sk_buff* lost_skb_hint; struct sk_buff *retransmit_skb_hint; - /* OOO segments go in this list. Note that socket lock must be held, -* as we do not use sk_buff_head lock. -*/ - struct sk_buff_head out_of_order_queue; + /* OOO segments go in this rbtree. Socket lock must be held. */ + struct rb_root out_of_order_queue; + struct sk_buff *ooo_last_skb; /* cache rb_last(out_of_order_queue) */ /* SACKs data, these 2 need to be together (see tcp_options_write) */ struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */ diff --git a/include/net/tcp.h b/include/net/tcp.h index cac4a6ad..8bc259d 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -649,7 +649,7 @@ static inline void tcp_fast_path_check(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); - if (skb_queue_empty(>out_of_order_queue) && + if (RB_EMPTY_ROOT(>out_of_order_queue) && tp->rcv_wnd && atomic_read(>sk_rmem_alloc) < sk->sk_rcvbuf && !tp->urg_data) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 55be076..9703924 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2378,6 +2378,25 @@ void skb_queue_purge(struct sk_buff_head *list) EXPORT_SYMBOL(skb_queue_purge); /** + * skb_rbtree_purge - empty a skb rbtree + * @root: root of the rbtree to empty + * + * Delete all buffers on an _buff rbtree. Each buffer is removed from + * the list and one reference dropped. This function does not take + * any lock. Synchronization should be handled by the caller (e.g., TCP + * out-of-order queue is protected by the socket lock). + */ +void skb_rbtree_purge(struct rb_root *root) +{ + struct sk_buff *skb, *next; + + rbtree_postorder_for_each_entry_safe(skb, next, root, rbnode) + kfree_skb(skb); + + *root = RB_ROOT; +} +
[PATCH stable 4.4 8/9] tcp: call tcp_drop() from tcp_data_queue_ofo()
From: Eric Dumazet [ Upstream commit 8541b21e781a22dce52a74fef0b9bed00404a1cd ] In order to be able to give better diagnostics and detect malicious traffic, we need to have better sk->sk_drops tracking. Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue") Signed-off-by: Eric Dumazet Acked-by: Soheil Hassas Yeganeh Acked-by: Yuchung Cheng Signed-off-by: David S. Miller Signed-off-by: Mao Wenan --- net/ipv4/tcp_input.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c48924f..96a1e0d 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4445,7 +4445,7 @@ coalesce_done: /* All the bits are present. Drop. */ NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPOFOMERGE); - __kfree_skb(skb); + tcp_drop(sk, skb); skb = NULL; tcp_dsack_set(sk, seq, end_seq); goto add_sack; @@ -4464,7 +4464,7 @@ coalesce_done: TCP_SKB_CB(skb1)->end_seq); NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPOFOMERGE); - __kfree_skb(skb1); + tcp_drop(sk, skb1); goto add_sack; } } else if (tcp_try_coalesce(sk, skb1, skb, )) { -- 1.8.3.1
[PATCH stable 4.4 7/9] tcp: detect malicious patterns in tcp_collapse_ofo_queue()
From: Eric Dumazet [ Upstream commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf ] In case an attacker feeds tiny packets completely out of order, tcp_collapse_ofo_queue() might scan the whole rb-tree, performing expensive copies, but not changing socket memory usage at all. 1) Do not attempt to collapse tiny skbs. 2) Add logic to exit early when too many tiny skbs are detected. We prefer not doing aggressive collapsing (which copies packets) for pathological flows, and revert to tcp_prune_ofo_queue() which will be less expensive. In the future, we might add the possibility of terminating flows that are proven to be malicious. Signed-off-by: Eric Dumazet Acked-by: Soheil Hassas Yeganeh Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman Signed-off-by: Mao Wenan --- net/ipv4/tcp_input.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 77130ae..c48924f 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4845,6 +4845,7 @@ end: static void tcp_collapse_ofo_queue(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); + u32 range_truesize, sum_tiny = 0; struct sk_buff *skb, *head; struct rb_node *p; u32 start, end; @@ -4863,6 +4864,7 @@ new_range: } start = TCP_SKB_CB(skb)->seq; end = TCP_SKB_CB(skb)->end_seq; + range_truesize = skb->truesize; for (head = skb;;) { skb = tcp_skb_next(skb, NULL); @@ -4873,11 +4875,21 @@ new_range: if (!skb || after(TCP_SKB_CB(skb)->seq, end) || before(TCP_SKB_CB(skb)->end_seq, start)) { - tcp_collapse(sk, NULL, >out_of_order_queue, -head, skb, start, end); + /* Do not attempt collapsing tiny skbs */ + if (range_truesize != head->truesize || + end - start >= SKB_WITH_OVERHEAD(SK_MEM_QUANTUM)) { + tcp_collapse(sk, NULL, >out_of_order_queue, +head, skb, start, end); + } else { + sum_tiny += range_truesize; + if (sum_tiny > sk->sk_rcvbuf >> 3) + return; + } + goto new_range; } + range_truesize += skb->truesize; if (unlikely(before(TCP_SKB_CB(skb)->seq, start))) start = TCP_SKB_CB(skb)->seq; if (after(TCP_SKB_CB(skb)->end_seq, end)) -- 1.8.3.1
Re: [PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"
On 2018/8/15 21:18, Greg KH wrote: > On Wed, Aug 15, 2018 at 09:21:00PM +0800, Mao Wenan wrote: >> This reverts commit dc6ae4dffd656811dee7151b19545e4cd839d378. > > I need a reason why, and a signed-off-by line :( stable 4.4 only back port two patches to fix CVE-2018-5390, I have tested they can't fix fully because of simple queue used in lower version, so we need change simple queue to RB tree to finally resolve. But 9f5afeae have many conflicts with tcp: detect malicious patterns in tcp_collapse_ofo_queue() and tcp: avoid collapses in tcp_prune_queue() if possible, and there are patch series from Eric in mainline to fix CVE-2018-5390, so I need revert part of patches in stable 4.4 firstly, then apply 9f5afeae, and reapply five patches from Eric. 9f5afeae tcp: use an RB tree for ooo receive queue > > thanks, > > greg k-h > >
Re: [PATCH stable 4.4 0/9] fix SegmentSmack (CVE-2018-5390)
On 2018/8/15 23:41, Greg KH wrote: > On Wed, Aug 15, 2018 at 03:24:32PM +0200, Greg KH wrote: >> On Wed, Aug 15, 2018 at 09:20:59PM +0800, Mao Wenan wrote: >>> There are five patches to fix CVE-2018-5390 in latest mainline >>> branch, but only two patches exist in stable 4.4 and 3.18: >>> dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue() >>> 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible >>> but I have tested with these patches, and found the cpu usage was very high. >>> test results: >>> with fix patch: 78.2% ksoftirqd >>> no fix patch: 90% ksoftirqd >>> >>> After analysing the codes of stable 4.4, and debuging the >>> system, the search of ofo_queue(tcp ofo using a simple queue) cost more >>> cycles. >>> So I think only two patches can't fix the CVE-2018-5390. >>> So I try to backport "tcp: use an RB tree for ooo receive queue" using RB >>> tree >>> instead of simple queue, then backport Eric Dumazet 5 fixed patches in >>> mainline, >>> good news is that ksoftirqd is turn to about 20%, which is the same with >>> mainline now. >> >> Thanks for doing this work, I had some questions on the individual >> patches. Can you address them and resend? > > Also, always cc: the stable@vger list when sending stable patches so > that others can review and comment on them. ok,I will resend patches later after refining them. > > thanks, > > greg k-h > > . >
Re: [PATCH net] veth: Free queues on link delete
Hi David, I love your patch! Yet something to improve: [auto build test ERROR on net/master] url: https://github.com/0day-ci/linux/commits/dsahern-kernel-org/veth-Free-queues-on-link-delete/20180816-073955 config: i386-randconfig-x016-201832 (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): drivers//net/veth.c: In function 'veth_dellink': >> drivers//net/veth.c:476:2: error: implicit declaration of function >> 'veth_free_queues'; did you mean 'veth_dev_free'? >> [-Werror=implicit-function-declaration] veth_free_queues(dev); ^~~~ veth_dev_free cc1: some warnings being treated as errors vim +476 drivers//net/veth.c 470 471 static void veth_dellink(struct net_device *dev, struct list_head *head) 472 { 473 struct veth_priv *priv; 474 struct net_device *peer; 475 > 476 veth_free_queues(dev); 477 priv = netdev_priv(dev); 478 peer = rtnl_dereference(priv->peer); 479 480 /* Note : dellink() is called from default_device_exit_batch(), 481 * before a rcu_synchronize() point. The devices are guaranteed 482 * not being freed before one RCU grace period. 483 */ 484 RCU_INIT_POINTER(priv->peer, NULL); 485 unregister_netdevice_queue(dev, head); 486 487 if (peer) { 488 priv = netdev_priv(peer); 489 RCU_INIT_POINTER(priv->peer, NULL); 490 unregister_netdevice_queue(peer, head); 491 } 492 } 493 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH v2 net-next 0/8] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers
On 12/05/2017 05:46 PM, tristram...@microchip.com wrote: > From: Tristram Ha > > This series of patches is to modify the original KSZ9477 DSA driver so > that other KSZ switch drivers can be added and use the common code. > > There are several steps to accomplish this achievement. First is to > rename some function names with a prefix to indicate chip specific > function. Second is to move common code into header that can be shared. > Last is to modify tag_ksz.c so that it can handle many tail tag formats > used by different KSZ switch drivers. > > ksz_common.c will contain the common code used by all KSZ switch drivers. > ksz9477.c will contain KSZ9477 code from the original ksz_common.c. > ksz9477_spi.c is renamed from ksz_spi.c. > ksz9477_reg.h is renamed from ksz_9477_reg.h. > ksz_common.h is added to provide common code access to KSZ switch > drivers. > ksz_spi.h is added to provide common SPI access functions to KSZ SPI > drivers. Is something gating this series from getting included? It's been nearly 8 months now and this has not been include nor resubmitted, any plans to rebase that patch series and work towards inclusion in net-next when it opens back again? Thank you! > > v2 > - Initialize reg_mutex before use > - The alu_mutex is only used inside chip specific functions > > v1 > - Each patch in the set is self-contained > - Use ksz9477 prefix to indicate KSZ9477 specific code > > Tristram Ha (8): > Replace license with GPL. > Clean up code according to patch check suggestions. > Initialize mutex before use. > Rename some functions with ksz9477 prefix to separate chip specific > code from common code. > Rename ksz_spi.c to ksz9477_spi.c and update Kconfig in preparation to > add more KSZ switch drivers. > Break KSZ9477 DSA driver into two files in preparation to add more KSZ > switch drivers. Add common functions in ksz_common.h so that other > KSZ switch drivers can access code in ksz_common.c. Add ksz_spi.h > for common functions used by KSZ switch SPI drivers. > Prepare PHY for proper advertisement and get link status for the port. > Rename ksz_9477_reg.h to ksz9477_reg.h for consistency as the product > name is always KSZ. > > drivers/net/dsa/microchip/Kconfig | 12 +- > drivers/net/dsa/microchip/Makefile |4 +- > drivers/net/dsa/microchip/ksz9477.c| 1331 > > .../microchip/{ksz_9477_reg.h => ksz9477_reg.h}| 23 +- > drivers/net/dsa/microchip/ksz9477_spi.c| 188 +++ > drivers/net/dsa/microchip/ksz_common.c | 1176 +++-- > drivers/net/dsa/microchip/ksz_common.h | 229 > drivers/net/dsa/microchip/ksz_priv.h | 256 ++-- > drivers/net/dsa/microchip/ksz_spi.c| 216 > drivers/net/dsa/microchip/ksz_spi.h| 82 ++ > 10 files changed, 2122 insertions(+), 1395 deletions(-) > create mode 100644 drivers/net/dsa/microchip/ksz9477.c > rename drivers/net/dsa/microchip/{ksz_9477_reg.h => ksz9477_reg.h} (98%) > create mode 100644 drivers/net/dsa/microchip/ksz9477_spi.c > create mode 100644 drivers/net/dsa/microchip/ksz_common.h > delete mode 100644 drivers/net/dsa/microchip/ksz_spi.c > create mode 100644 drivers/net/dsa/microchip/ksz_spi.h > -- Florian
Re: [PATCH] net: dsa: add support for ksz9897 ethernet switch
On 08/15/2018 08:51 AM, Lad Prabhakar wrote: > From: "Lad, Prabhakar" > > ksz9477 is superset of ksz9xx series, driver just works > out of the box for ksz9897 chip with this patch. net-next is currently closed, but other than that: Reviewed-by: Florian Fainelli > > Signed-off-by: Lad, Prabhakar > --- > Documentation/devicetree/bindings/net/dsa/ksz.txt | 4 +++- > drivers/net/dsa/microchip/ksz_common.c| 9 + > drivers/net/dsa/microchip/ksz_spi.c | 1 + > 3 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/Documentation/devicetree/bindings/net/dsa/ksz.txt > b/Documentation/devicetree/bindings/net/dsa/ksz.txt > index a700943..ac145b8 100644 > --- a/Documentation/devicetree/bindings/net/dsa/ksz.txt > +++ b/Documentation/devicetree/bindings/net/dsa/ksz.txt > @@ -4,7 +4,9 @@ Microchip KSZ Series Ethernet switches > Required properties: > > - compatible: For external switch chips, compatible string must be exactly > one > - of: "microchip,ksz9477" > + of the following: > + - "microchip,ksz9477" > + - "microchip,ksz9897" > > See Documentation/devicetree/bindings/net/dsa/dsa.txt for a list of > additional > required and optional properties. > diff --git a/drivers/net/dsa/microchip/ksz_common.c > b/drivers/net/dsa/microchip/ksz_common.c > index 7210c49..54e0ca6 100644 > --- a/drivers/net/dsa/microchip/ksz_common.c > +++ b/drivers/net/dsa/microchip/ksz_common.c > @@ -1102,6 +1102,15 @@ static const struct ksz_chip_data ksz_switch_chips[] = > { > .cpu_ports = 0x7F, /* can be configured as cpu port */ > .port_cnt = 7, /* total physical port count */ > }, > + { > + .chip_id = 0x00989700, > + .dev_name = "KSZ9897", > + .num_vlans = 4096, > + .num_alus = 4096, > + .num_statics = 16, > + .cpu_ports = 0x7F, /* can be configured as cpu port */ > + .port_cnt = 7, /* total physical port count */ > + }, > }; > > static int ksz_switch_init(struct ksz_device *dev) > diff --git a/drivers/net/dsa/microchip/ksz_spi.c > b/drivers/net/dsa/microchip/ksz_spi.c > index c519469..8c1778b 100644 > --- a/drivers/net/dsa/microchip/ksz_spi.c > +++ b/drivers/net/dsa/microchip/ksz_spi.c > @@ -195,6 +195,7 @@ static int ksz_spi_remove(struct spi_device *spi) > > static const struct of_device_id ksz_dt_ids[] = { > { .compatible = "microchip,ksz9477" }, > + { .compatible = "microchip,ksz9897" }, > {}, > }; > MODULE_DEVICE_TABLE(of, ksz_dt_ids); > -- Florian
Re: [PATCH bpf] bpf: fix a rcu usage warning in bpf_prog_array_copy_core()
On 08/15/2018 02:08 AM, Roman Gushchin wrote: > On Tue, Aug 14, 2018 at 04:59:45PM -0700, Alexei Starovoitov wrote: >> On Tue, Aug 14, 2018 at 11:01:12AM -0700, Yonghong Song wrote: >>> Commit 394e40a29788 ("bpf: extend bpf_prog_array to store pointers >>> to the cgroup storage") refactored the bpf_prog_array_copy_core() >>> to accommodate new structure bpf_prog_array_item which contains >>> bpf_prog array itself. >>> >>> In the old code, we had >>>perf_event_query_prog_array(): >>> mutex_lock(...) >>> bpf_prog_array_copy_call(): >>>prog = rcu_dereference_check(array, 1)->progs >>>bpf_prog_array_copy_core(prog, ...) >>> mutex_unlock(...) >>> >>> With the above commit, we had >>>perf_event_query_prog_array(): >>> mutex_lock(...) >>> bpf_prog_array_copy_call(): >>>bpf_prog_array_copy_core(array, ...): >>> item = rcu_dereference(array)->items; >>> ... >>> mutex_unlock(...) >>> >>> The new code will trigger a lockdep rcu checking warning. >>> The fix is to change rcu_dereference() to rcu_dereference_check() >>> to prevent such a warning. >>> >>> Reported-by: syzbot+6e72317008eef84a2...@syzkaller.appspotmail.com >>> Fixes: 394e40a29788 ("bpf: extend bpf_prog_array to store pointers to the >>> cgroup storage") >>> Cc: Roman Gushchin >>> Signed-off-by: Yonghong Song Applied to bpf, thanks Yonghong!
Re: [bpf PATCH] samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM
On Wed, Aug 15, 2018 at 8:47 AM, Y Song wrote: > On Wed, Aug 15, 2018 at 7:57 AM, Jesper Dangaard Brouer > wrote: >> It is common XDP practice to unload/deattach the XDP bpf program, >> when the XDP sample program is Ctrl-C interrupted (SIGINT) or >> killed (SIGTERM). >> >> The samples/bpf programs xdp_redirect_cpu and xdp_rxq_info, >> forgot to trap signal SIGTERM (which is the default signal used >> by the kill command). >> >> This was discovered by Red Hat QA, which automated scripts depend >> on killing the XDP sample program after a timeout period. >> >> Fixes: fad3917e361b ("samples/bpf: add cpumap sample program >> xdp_redirect_cpu") >> Fixes: 0fca931a6f21 ("samples/bpf: program demonstrating access to >> xdp_rxq_info") >> Reported-by: Jean-Tsung Hsiao >> Signed-off-by: Jesper Dangaard Brouer > > Acked-by: Yonghong Song Reviewed-by: Song Liu > >> --- >> samples/bpf/xdp_redirect_cpu_user.c |3 ++- >> samples/bpf/xdp_rxq_info_user.c |3 ++- >> 2 files changed, 4 insertions(+), 2 deletions(-)
Re: [bpf PATCH] samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM
On 08/15/2018 05:47 PM, Y Song wrote: > On Wed, Aug 15, 2018 at 7:57 AM, Jesper Dangaard Brouer > wrote: >> It is common XDP practice to unload/deattach the XDP bpf program, >> when the XDP sample program is Ctrl-C interrupted (SIGINT) or >> killed (SIGTERM). >> >> The samples/bpf programs xdp_redirect_cpu and xdp_rxq_info, >> forgot to trap signal SIGTERM (which is the default signal used >> by the kill command). >> >> This was discovered by Red Hat QA, which automated scripts depend >> on killing the XDP sample program after a timeout period. >> >> Fixes: fad3917e361b ("samples/bpf: add cpumap sample program >> xdp_redirect_cpu") >> Fixes: 0fca931a6f21 ("samples/bpf: program demonstrating access to >> xdp_rxq_info") >> Reported-by: Jean-Tsung Hsiao >> Signed-off-by: Jesper Dangaard Brouer > > Acked-by: Yonghong Song Applied to bpf, thanks Jesper!
Re: [PATCH bpf-next V3] net/xdp: Fix suspicious RCU usage warning
On 08/13/2018 02:22 PM, Daniel Borkmann wrote: [...] > I'll get the patch in once it has been pulled. Applied to bpf, thanks Tariq!
[PATCH iproute2] ipmaddr: use preferred_family when given
From: Mahesh Bandewar When creating socket() AF_INET is used irrespective of the family that is given at the command-line (with -4, -6, or -0). This change will open the socket with the preferred family. Signed-off-by: Mahesh Bandewar --- ip/ipmaddr.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/ip/ipmaddr.c b/ip/ipmaddr.c index a48499029e17..abf83784d0df 100644 --- a/ip/ipmaddr.c +++ b/ip/ipmaddr.c @@ -289,6 +289,7 @@ static int multiaddr_list(int argc, char **argv) static int multiaddr_modify(int cmd, int argc, char **argv) { struct ifreq ifr = {}; + int family; int fd; if (cmd == RTM_NEWADDR) @@ -324,7 +325,17 @@ static int multiaddr_modify(int cmd, int argc, char **argv) exit(-1); } - fd = socket(AF_INET, SOCK_DGRAM, 0); + switch (preferred_family) { + case AF_INET6: + case AF_PACKET: + case AF_INET: + family = preferred_family; + break; + default: + family = AF_INET; + } + + fd = socket(family, SOCK_DGRAM, 0); if (fd < 0) { perror("Cannot create socket"); exit(1); -- 2.18.0.865.gffc8e1a3cd6-goog
Re: virtio_net failover and initramfs (was: Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework)
On Wed, Aug 15, 2018 at 12:05 PM, Samudrala, Sridhar wrote: > On 8/14/2018 5:03 PM, Siwei Liu wrote: >> >> Are we sure all userspace apps skip and ignore slave interfaces by >> just looking at "IFLA_MASTER" attribute? >> >> When STANDBY is enabled on virtio-net, a failover master interface >> will appear, which automatically enslaves the virtio device. But it is >> found out that iSCSI (or any network boot) cannot boot strap over the >> new failover interface together with a standby virtio (without any VF >> or PT device in place). >> >> Dracut (initramfs) ends up with timeout and dropping into emergency shell: >> >> [ 228.170425] dracut-initqueue[377]: Warning: dracut-initqueue >> timeout - starting timeout scripts >> [ 228.171788] dracut-initqueue[377]: Warning: Could not boot. >> Starting Dracut Emergency Shell... >> Generating "/run/initramfs/rdsosreport.txt" >> Entering emergency mode. Exit the shell to continue. >> Type "journalctl" to view system logs. >> You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or >> /boot >> after mounting them and attach it to a bug report. >> dracut:/# ip l sh >> 1: lo: mtu 65536 qdisc noqueue state UNKNOWN >> mode DEFAULT group default qlen 1000 >> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >> 2: eth0: mtu 1500 qdisc noqueue >> state UP mode DEFAULT group default qlen 1000 >> link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff\ >> 3: eth1: mtu 1500 qdisc pfifo_fast >> master eth0 state UP mode DEFAULT group default qlen 1000 >> link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff >> dracut:/# >> >> If changing dracut code to ignore eth1 (with IFLA_MASTER attr), >> network boot starts to work. > > > Does dracut by default tries to use all the interfaces that are UP? > Yes. The specific dracut cmdline of our case is "ip=dhcp netroot=iscsi:... ", but it's not specific to iscsi boot. And because of same MAC address for failover and standby, while dracut tries to run DHCP on all interfaces that are up it eventually gets same route for each interface. Those conflict route entries kill off the network connection. > >> >> The reason is that dracut has its own means to differentiate virtual >> interfaces for network boot: it does not look at IFLA_MASTER and >> ignores slave interfaces. Instead, users have to provide explicit >> option e.g. bond=eth0,eth1 in the boot line, then dracut would know >> the config and ignore the slave interfaces. > > > Isn't it possible to specify the interface that should be used for network > boot? As I understand it, one can only specify interface name for running DHCP but not select interface for network boot. We want DHCP to run on every NIC that is up (excluding the enslaved interfaces), and only one of them can get a route entry to the network boot server (ie.g. iSCSI target). > > >> >> However, with automatic creation of failover interface that assumption >> is no longer true. Can we change dracut to ignore all slave interface >> by checking IFLA_MASTER? I don't think so. It has a large impact to >> existing configs. > > > What is the issue with checking for IFLA_MASTER? I guess this is used with > team/bonding setups. That should be discussed within and determined by the dracut community. But the current dracut code doesn't check IFLA_MASTER for team or bonding specifically. I guess this change might have broader impact to existing userspace that might be already relying on the current behaviour. Thanks, -Siwei > > >> >> What's a feasible solution then? Check the driver name for failover as >> well? >> >> Thanks, >> -Siwei >> >> >> >> On Tue, May 22, 2018 at 10:38 AM, Jiri Pirko wrote: >>> >>> Tue, May 22, 2018 at 06:52:21PM CEST, m...@redhat.com wrote: On Tue, May 22, 2018 at 05:45:01PM +0200, Jiri Pirko wrote: > > Tue, May 22, 2018 at 05:32:30PM CEST, m...@redhat.com wrote: >> >> On Tue, May 22, 2018 at 05:13:43PM +0200, Jiri Pirko wrote: >>> >>> Tue, May 22, 2018 at 03:39:33PM CEST, m...@redhat.com wrote: On Tue, May 22, 2018 at 03:26:26PM +0200, Jiri Pirko wrote: > > Tue, May 22, 2018 at 03:17:37PM CEST, m...@redhat.com wrote: >> >> On Tue, May 22, 2018 at 03:14:22PM +0200, Jiri Pirko wrote: >>> >>> Tue, May 22, 2018 at 03:12:40PM CEST, m...@redhat.com wrote: On Tue, May 22, 2018 at 11:08:53AM +0200, Jiri Pirko wrote: > > Tue, May 22, 2018 at 11:06:37AM CEST, j...@resnulli.us wrote: >> >> Tue, May 22, 2018 at 04:06:18AM CEST, >> sridhar.samudr...@intel.com wrote: >>> >>> Use the registration/notification framework supported by the >>> generic >>> failover infrastructure. >>> >>> Signed-off-by: Sridhar Samudrala >>> >> >> In previous patchset versions, the common code did
Re: [PATCH bpf] bpf: fix a rcu usage warning in bpf_prog_array_copy_core()
On Wed, Aug 15, 2018 at 02:30:11PM -0700, Alexei Starovoitov wrote: > On Tue, Aug 14, 2018 at 05:08:44PM -0700, Roman Gushchin wrote: > > On Tue, Aug 14, 2018 at 04:59:45PM -0700, Alexei Starovoitov wrote: > > > On Tue, Aug 14, 2018 at 11:01:12AM -0700, Yonghong Song wrote: > > > > Commit 394e40a29788 ("bpf: extend bpf_prog_array to store pointers > > > > to the cgroup storage") refactored the bpf_prog_array_copy_core() > > > > to accommodate new structure bpf_prog_array_item which contains > > > > bpf_prog array itself. > > > > > > > > In the old code, we had > > > >perf_event_query_prog_array(): > > > > mutex_lock(...) > > > > bpf_prog_array_copy_call(): > > > >prog = rcu_dereference_check(array, 1)->progs > > > >bpf_prog_array_copy_core(prog, ...) > > > > mutex_unlock(...) > > > > > > > > With the above commit, we had > > > >perf_event_query_prog_array(): > > > > mutex_lock(...) > > > > bpf_prog_array_copy_call(): > > > >bpf_prog_array_copy_core(array, ...): > > > > item = rcu_dereference(array)->items; > > > > ... > > > > mutex_unlock(...) > > > > > > > > The new code will trigger a lockdep rcu checking warning. > > > > The fix is to change rcu_dereference() to rcu_dereference_check() > > > > to prevent such a warning. > > > > > > > > Reported-by: syzbot+6e72317008eef84a2...@syzkaller.appspotmail.com > > > > Fixes: 394e40a29788 ("bpf: extend bpf_prog_array to store pointers to > > > > the cgroup storage") > > > > Cc: Roman Gushchin > > > > Signed-off-by: Yonghong Song > > > > > > makes sense to me > > > Acked-by: Alexei Starovoitov > > > > > > Roman, would you agree? > > > > > > > rcu_dereference_check(<>, 1) always looks a bit strange to me, > > but if it's the only reasonable way to silence the warning, > > of course I'm fine with it. > > do you have better suggestion? > This patch is a fix for the regression introduced in your earlier patch, > so I think the only fair path forward is either to Ack it or > to send an alternative patch asap. > As I said, I've nothing against. Acked-by: Roman Gushchin Thanks!
Re: [iproute PATCH 0/3] Fix and test ssfilter
On Tue, 14 Aug 2018 14:18:05 +0200 Phil Sutter wrote: > This series contains a fix for ssfilter and introduces a testscript to > verify correct functionality. > > Phil Sutter (3): > ss: Review ssfilter > testsuite: Prepare for ss tests > testsuite: Add a first ss test validating ssfilter > > misc/ssfilter.y | 36 ++--- > testsuite/Makefile| 2 +- > testsuite/lib/generic.sh | 37 ++ > testsuite/tests/ss/ss1.dump | Bin 0 -> 720 bytes > testsuite/tests/ss/ssfilter.t | 48 ++ > 5 files changed, 84 insertions(+), 39 deletions(-) > create mode 100644 testsuite/tests/ss/ss1.dump > create mode 100755 testsuite/tests/ss/ssfilter.t > Applied
Re: [iproute PATCH] man: ip-route: Clarify referenced versions are Linux ones
On Wed, 15 Aug 2018 11:18:26 +0200 Phil Sutter wrote: > Versioning scheme of Linux and iproute2 is similar, therefore the > referenced kernel versions are likely to confuse readers. Clarify this > by prefixing each kernel version by 'Linux' prefix. > Sure, makes sense applied.
Re: [PATCH bpf] bpf: fix a rcu usage warning in bpf_prog_array_copy_core()
On Tue, Aug 14, 2018 at 05:08:44PM -0700, Roman Gushchin wrote: > On Tue, Aug 14, 2018 at 04:59:45PM -0700, Alexei Starovoitov wrote: > > On Tue, Aug 14, 2018 at 11:01:12AM -0700, Yonghong Song wrote: > > > Commit 394e40a29788 ("bpf: extend bpf_prog_array to store pointers > > > to the cgroup storage") refactored the bpf_prog_array_copy_core() > > > to accommodate new structure bpf_prog_array_item which contains > > > bpf_prog array itself. > > > > > > In the old code, we had > > >perf_event_query_prog_array(): > > > mutex_lock(...) > > > bpf_prog_array_copy_call(): > > >prog = rcu_dereference_check(array, 1)->progs > > >bpf_prog_array_copy_core(prog, ...) > > > mutex_unlock(...) > > > > > > With the above commit, we had > > >perf_event_query_prog_array(): > > > mutex_lock(...) > > > bpf_prog_array_copy_call(): > > >bpf_prog_array_copy_core(array, ...): > > > item = rcu_dereference(array)->items; > > > ... > > > mutex_unlock(...) > > > > > > The new code will trigger a lockdep rcu checking warning. > > > The fix is to change rcu_dereference() to rcu_dereference_check() > > > to prevent such a warning. > > > > > > Reported-by: syzbot+6e72317008eef84a2...@syzkaller.appspotmail.com > > > Fixes: 394e40a29788 ("bpf: extend bpf_prog_array to store pointers to the > > > cgroup storage") > > > Cc: Roman Gushchin > > > Signed-off-by: Yonghong Song > > > > makes sense to me > > Acked-by: Alexei Starovoitov > > > > Roman, would you agree? > > > > rcu_dereference_check(<>, 1) always looks a bit strange to me, > but if it's the only reasonable way to silence the warning, > of course I'm fine with it. do you have better suggestion? This patch is a fix for the regression introduced in your earlier patch, so I think the only fair path forward is either to Ack it or to send an alternative patch asap.
[PATCH 1/2] ip: convert monitor to switch
From: Stephen Hemminger The decoding of netlink message types is natural for a C switch statement. Signed-off-by: Stephen Hemminger --- ip/ipmonitor.c | 63 ++ 1 file changed, 38 insertions(+), 25 deletions(-) diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c index 8b017341da6b..5552d98ee9e5 100644 --- a/ip/ipmonitor.c +++ b/ip/ipmonitor.c @@ -58,7 +58,9 @@ static int accept_msg(const struct sockaddr_nl *who, { FILE *fp = (FILE *)arg; - if (n->nlmsg_type == RTM_NEWROUTE || n->nlmsg_type == RTM_DELROUTE) { + switch (n->nlmsg_type) { + case RTM_NEWROUTE: + case RTM_DELROUTE: { struct rtmsg *r = NLMSG_DATA(n); int len = n->nlmsg_len - NLMSG_LENGTH(sizeof(*r)); @@ -82,24 +84,28 @@ static int accept_msg(const struct sockaddr_nl *who, } } - if (n->nlmsg_type == RTM_NEWLINK || n->nlmsg_type == RTM_DELLINK) { + case RTM_NEWLINK: + case RTM_DELLINK: ll_remember_index(who, n, NULL); print_headers(fp, "[LINK]", ctrl); print_linkinfo(who, n, arg); return 0; - } - if (n->nlmsg_type == RTM_NEWADDR || n->nlmsg_type == RTM_DELADDR) { + + case RTM_NEWADDR: + case RTM_DELADDR: print_headers(fp, "[ADDR]", ctrl); print_addrinfo(who, n, arg); return 0; - } - if (n->nlmsg_type == RTM_NEWADDRLABEL || n->nlmsg_type == RTM_DELADDRLABEL) { + + case RTM_NEWADDRLABEL: + case RTM_DELADDRLABEL: print_headers(fp, "[ADDRLABEL]", ctrl); print_addrlabel(who, n, arg); return 0; - } - if (n->nlmsg_type == RTM_NEWNEIGH || n->nlmsg_type == RTM_DELNEIGH || - n->nlmsg_type == RTM_GETNEIGH) { + + case RTM_NEWNEIGH: + case RTM_DELNEIGH: + case RTM_GETNEIGH: if (preferred_family) { struct ndmsg *r = NLMSG_DATA(n); @@ -110,34 +116,41 @@ static int accept_msg(const struct sockaddr_nl *who, print_headers(fp, "[NEIGH]", ctrl); print_neigh(who, n, arg); return 0; - } - if (n->nlmsg_type == RTM_NEWPREFIX) { + + case RTM_NEWPREFIX: print_headers(fp, "[PREFIX]", ctrl); print_prefix(who, n, arg); return 0; - } - if (n->nlmsg_type == RTM_NEWRULE || n->nlmsg_type == RTM_DELRULE) { + + case RTM_NEWRULE: + case RTM_DELRULE: print_headers(fp, "[RULE]", ctrl); print_rule(who, n, arg); return 0; - } - if (n->nlmsg_type == RTM_NEWNETCONF) { + + case NLMSG_TSTAMP: + print_nlmsg_timestamp(fp, n); + return 0; + + case RTM_NEWNETCONF: print_headers(fp, "[NETCONF]", ctrl); print_netconf(who, ctrl, n, arg); return 0; - } - if (n->nlmsg_type == NLMSG_TSTAMP) { - print_nlmsg_timestamp(fp, n); - return 0; - } - if (n->nlmsg_type == RTM_NEWNSID || n->nlmsg_type == RTM_DELNSID) { + + case RTM_DELNSID: + case RTM_NEWNSID: print_headers(fp, "[NSID]", ctrl); print_nsid(who, n, arg); return 0; - } - if (n->nlmsg_type != NLMSG_ERROR && n->nlmsg_type != NLMSG_NOOP && - n->nlmsg_type != NLMSG_DONE) { - fprintf(fp, "Unknown message: type=0x%08x(%d) flags=0x%08x(%d)len=0x%08x(%d)\n", + + case NLMSG_ERROR: + case NLMSG_NOOP: + case NLMSG_DONE: + break; /* ignore */ + + default: + fprintf(stderr, + "Unknown message: type=0x%08x(%d) flags=0x%08x(%d) len=0x%08x(%d)\n", n->nlmsg_type, n->nlmsg_type, n->nlmsg_flags, n->nlmsg_flags, n->nlmsg_len, n->nlmsg_len); -- 2.18.0
[PATCH 2/2] ipmonitor: decode DELNETCONF message
From: Stephen Hemminger When device is deleted DELNETCONF is sent, but ipmonitor was unable to decode it. Signed-off-by: Stephen Hemminger --- ip/ipmonitor.c | 1 + ip/ipnetconf.c | 6 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c index 5552d98ee9e5..a93b62cd6624 100644 --- a/ip/ipmonitor.c +++ b/ip/ipmonitor.c @@ -133,6 +133,7 @@ static int accept_msg(const struct sockaddr_nl *who, return 0; case RTM_NEWNETCONF: + case RTM_DELNETCONF: print_headers(fp, "[NETCONF]", ctrl); print_netconf(who, ctrl, n, arg); return 0; diff --git a/ip/ipnetconf.c b/ip/ipnetconf.c index 03f98ace9145..afce982ced37 100644 --- a/ip/ipnetconf.c +++ b/ip/ipnetconf.c @@ -66,7 +66,8 @@ int print_netconf(const struct sockaddr_nl *who, struct rtnl_ctrl_data *ctrl, if (n->nlmsg_type == NLMSG_ERROR) return -1; - if (n->nlmsg_type != RTM_NEWNETCONF) { + + if (n->nlmsg_type != RTM_NEWNETCONF && n->nlmsg_type != RTM_DELNETCONF) { fprintf(stderr, "Not RTM_NEWNETCONF: %08x %08x %08x\n", n->nlmsg_len, n->nlmsg_type, n->nlmsg_flags); @@ -91,6 +92,9 @@ int print_netconf(const struct sockaddr_nl *who, struct rtnl_ctrl_data *ctrl, return 0; open_json_object(NULL); + if (n->nlmsg_type == RTM_DELNETCONF) + print_bool(PRINT_ANY, "deleted", "Deleted ", true); + print_string(PRINT_ANY, "family", "%s ", family_name(ncm->ncm_family)); -- 2.18.0
[PATCH iproute2 0/2] ip monitor fixes
While debugging another problem noticed that ip monitor does not decode all the netconf messages. Stephen Hemminger (2): ip: convert monitor to switch ipmonitor: decode DELNETCONF message ip/ipmonitor.c | 64 ++ ip/ipnetconf.c | 6 - 2 files changed, 44 insertions(+), 26 deletions(-) -- 2.18.0
Re: 98DX3336 net driver
Hi Joacim, On 16/08/18 01:46, Thomas Petazzoni wrote: > Hello Joacim, > > On Wed, 15 Aug 2018 11:48:17 +, Joacim Zetterling wrote: > >> Sorry for bothering You! But I have a short question that You may have an >> answer to. >> >> Just wondering if You know where to find a "marvell,prestera-98dx3336" >> packet processor driver? >> >> Stated in the dts files for the xCat family (armada-xp-3233/3336/4251) >> >> I searched a lot without any success. Is it delivered from Marvell. >> >> We have a PonCat3 platform build on nativ linux instead of using Marvell:s >> CPSS suite (beast). >> >> But have some problems regarding the network. >> >> We also have the DB-XC3-24G4XG development board from Marvell. > > I'm adding Chris Packham in Cc, who added the DT bindings for the > Prestera switch and the DT description for it in the mainline Linux > kernel. > > I believe there is no driver in the upstream Linux kernel for this > switch, and you need to use some out of tree Marvell-specific driver > for this. For sure Chris can give you more details about this. > > Do not hesitate to get back to us if you need to support the Prestera > switch in the mainline Linux kernel. Unfortunately there isn't proper a driver available. The CPSS from Marvell is needed. Their reference code has some partially DT aware support (mvMbusDrv) but it wouldn't pick up this binding. We have code that does but it is really little more than providing raw HW access via a file handle that can be mmapped by a userspace application linked against the CPSS. I haven't put any effort into upstreaming this because I figured no-one would be interested as it's pretty useless without the userspace portion. If there is enough interest I'd be keen participate in creating an in-kernel switchdev driver for these chips (you're not the first to ask). I'd need to be careful w.r.t. various NDAs that are in place. Regards, Chris
KINDLY REPLY stemlightresour...@gmail.com URGENTLY
KINDLY REPLY stemlightresour...@gmail.com URGENTLY
Re: [iproute PATCH v2 4/4] lib: Enable colored output only for TTYs
On Wed, 15 Aug 2018 18:51:15 +0200 Phil Sutter wrote: > On Wed, Aug 15, 2018 at 10:43:25AM -0600, David Ahern wrote: > > On 8/15/18 10:39 AM, Phil Sutter wrote: > > > On Wed, Aug 15, 2018 at 10:24:31AM -0600, David Ahern wrote: > > >> On 8/15/18 10:21 AM, Phil Sutter wrote: > > >>> Add an additional prerequisite to check_enable_color() to make sure > > >>> stdout actually points to an open TTY device. Otherwise calls like > > >>> > > >>> | ip -color a s >/tmp/foo > > >>> > > >>> will print color escape sequences into that file. Allow to override this > > >>> check by specifying '-color' flag more than once. > > >>> > > >>> Signed-off-by: Phil Sutter > > >>> --- > > >>> Changes since v1: > > >>> - Allow to override isatty() check by specifying '-color' flag more than > > >>> once. > > >> > > >> That adds overhead to my workflow where I almost always have to pipe the > > >> output of ip to a pager. > > > > > > alias ip='ip -color -color' > > > > no. Don't impact existing users. > > That's a possible fix for *your* workflow. If applied to the shell > handling that workflow, it won't impact existing users. > > > > Another alternative may be to introduce -autocolor flag. Establishing > > > the same syntax as used by 'ls' is not as trivial due to the simple > > > commandline parsing used in 'ip'. > > > > I disagree with ignoring or overriding an argument a user passes in. You > > are guessing what is the correct output and you are guessing wrong. > > There is nothing wrong with piping output to a file and the viewing that > > file through 'less -R'. > > > > If a user does not want the color codes in the file, then that user can > > drop the -color arg. iproute2 commands should not be guessing. > > OK, I got it. Should I respin the fixes or will you apply the series > partially? > > Thanks, Phil Please follow the color conventions of grep and ls to have consistent user experience. -c == --color=always and add never and auto.
Re: virtio_net failover and initramfs (was: Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework)
On 8/14/2018 5:03 PM, Siwei Liu wrote: Are we sure all userspace apps skip and ignore slave interfaces by just looking at "IFLA_MASTER" attribute? When STANDBY is enabled on virtio-net, a failover master interface will appear, which automatically enslaves the virtio device. But it is found out that iSCSI (or any network boot) cannot boot strap over the new failover interface together with a standby virtio (without any VF or PT device in place). Dracut (initramfs) ends up with timeout and dropping into emergency shell: [ 228.170425] dracut-initqueue[377]: Warning: dracut-initqueue timeout - starting timeout scripts [ 228.171788] dracut-initqueue[377]: Warning: Could not boot. Starting Dracut Emergency Shell... Generating "/run/initramfs/rdsosreport.txt" Entering emergency mode. Exit the shell to continue. Type "journalctl" to view system logs. You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot after mounting them and attach it to a bug report. dracut:/# ip l sh 1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: eth0: mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff\ 3: eth1: mtu 1500 qdisc pfifo_fast master eth0 state UP mode DEFAULT group default qlen 1000 link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff dracut:/# If changing dracut code to ignore eth1 (with IFLA_MASTER attr), network boot starts to work. Does dracut by default tries to use all the interfaces that are UP? The reason is that dracut has its own means to differentiate virtual interfaces for network boot: it does not look at IFLA_MASTER and ignores slave interfaces. Instead, users have to provide explicit option e.g. bond=eth0,eth1 in the boot line, then dracut would know the config and ignore the slave interfaces. Isn't it possible to specify the interface that should be used for network boot? However, with automatic creation of failover interface that assumption is no longer true. Can we change dracut to ignore all slave interface by checking IFLA_MASTER? I don't think so. It has a large impact to existing configs. What is the issue with checking for IFLA_MASTER? I guess this is used with team/bonding setups. What's a feasible solution then? Check the driver name for failover as well? Thanks, -Siwei On Tue, May 22, 2018 at 10:38 AM, Jiri Pirko wrote: Tue, May 22, 2018 at 06:52:21PM CEST, m...@redhat.com wrote: On Tue, May 22, 2018 at 05:45:01PM +0200, Jiri Pirko wrote: Tue, May 22, 2018 at 05:32:30PM CEST, m...@redhat.com wrote: On Tue, May 22, 2018 at 05:13:43PM +0200, Jiri Pirko wrote: Tue, May 22, 2018 at 03:39:33PM CEST, m...@redhat.com wrote: On Tue, May 22, 2018 at 03:26:26PM +0200, Jiri Pirko wrote: Tue, May 22, 2018 at 03:17:37PM CEST, m...@redhat.com wrote: On Tue, May 22, 2018 at 03:14:22PM +0200, Jiri Pirko wrote: Tue, May 22, 2018 at 03:12:40PM CEST, m...@redhat.com wrote: On Tue, May 22, 2018 at 11:08:53AM +0200, Jiri Pirko wrote: Tue, May 22, 2018 at 11:06:37AM CEST, j...@resnulli.us wrote: Tue, May 22, 2018 at 04:06:18AM CEST, sridhar.samudr...@intel.com wrote: Use the registration/notification framework supported by the generic failover infrastructure. Signed-off-by: Sridhar Samudrala In previous patchset versions, the common code did netdev_rx_handler_register() and netdev_upper_dev_link() etc (netvsc_vf_join()). Now, this is still done in netvsc. Why? This should be part of the common "failover" code. Also note that in the current patchset you use IFF_FAILOVER flag for master, yet for the slave you use IFF_SLAVE. That is wrong. IFF_FAILOVER_SLAVE should be used. Or drop IFF_FAILOVER_SLAVE and set both IFF_FAILOVER and IFF_SLAVE? No. IFF_SLAVE is for bonding. What breaks if we reuse it for failover? This is exposed to userspace. IFF_SLAVE is expected for bonding slaves. And failover slave is not a bonding slave. That does not really answer the question. I'd claim it's sufficiently like a bond slave for IFF_SLAVE to make sense. In fact you will find that netvsc already sets IFF_SLAVE, and so netvsc does the whole failover thing in a wrong way. This patchset is trying to fix it. Maybe, but we don't need gratuitous changes either, especially if they break userspace. What do you mean by the "break"? It was a mistake to reuse IFF_SLAVE at the first place, lets fix it. If some userspace depends on that flag, it is broken anyway. does e.g. the eql driver. The advantage of using IFF_SLAVE is that userspace knows to skip it. If The userspace should know how to skip other types of slaves - team, bridge, ovs, etc. The "master link" should be the one to look at. How should existing userspace know which ones to skip and which one is the master? Right now userspace seems to assume whatever does not have IFF_SLAVE should be looked at. Are you saying that's
Re: [iproute PATCH v2 4/4] lib: Enable colored output only for TTYs
On Wed, Aug 15, 2018 at 10:57:13AM -0600, David Ahern wrote: > On 8/15/18 10:51 AM, Phil Sutter wrote: > > Should I respin the fixes or will you apply the series > > partially? > > Stephen has released 4.18 but not merged -next to master yet, so I > applied the first 3 to -next. OK, thanks! Cheers, Phil
RE: [PATCH net-next][RFC] net/tls: Add support for async decryption of tls records
> -Original Message- > From: Dave Watson [mailto:davejwat...@fb.com] > Sent: Wednesday, August 15, 2018 10:26 PM > To: Vakul Garg > Cc: netdev@vger.kernel.org; bor...@mellanox.com; > avia...@mellanox.com; da...@davemloft.net > Subject: Re: [PATCH net-next][RFC] net/tls: Add support for async decryption > of tls records > > On 08/14/18 07:47 PM, Vakul Garg wrote: > > Incoming TLS records which are directly decrypted into user space > > application buffer i.e. records which are decrypted in zero-copy mode > > are submitted for async decryption. When the decryption cryptoapi > > returns -EINPROGRESS, the next tls record is parsed and then submitted > > for decryption. The references to records which has been sent for > > async decryption are dropped. This happens in a loop for all the > > records that can be decrypted in zero-copy mode. For records for which > > decryption is not possible in zero-copy mode, asynchronous decryption > > is not used and we wait for decryption crypto api to complete. > > > > For crypto requests executing in async fashion, the memory for > > aead_request, sglists and skb etc is freed from the decryption > > completion handler. The decryption completion handler wakesup the > > sleeping user context. This happens when the user context is done > > enqueueing all the crypto requests and is waiting for all the async > > operations to finish. Since the splice() operation does not use > > zero-copy decryption, async remains disabled for splice(). > > I found it a little hard to understand what you are trying to do based on the > commit message. Ok, I will rewrite it. > Reading the code, it looks like if the recv() spans multiple > TLS records, we will start decryption on all of them, and only wait right > before recv() returns, yes? This approach sounds great to me. > Yes, that's the idea. I am firing as many decryption jobs as possible till I run out of user application provided buffer space. > Three of the selftests are failing for me: > > [ FAIL ] tls.recv_partial > [ FAIL ] tls.recv_peek > [ FAIL ] tls.recv_peek_multiple Will look into it. Thanks for spending time in review my patch. The patch is showing good performance benefits.
Re: [iproute PATCH v2 4/4] lib: Enable colored output only for TTYs
On 8/15/18 10:51 AM, Phil Sutter wrote: > Should I respin the fixes or will you apply the series > partially? Stephen has released 4.18 but not merged -next to master yet, so I applied the first 3 to -next.
Re: [PATCH net-next][RFC] net/tls: Add support for async decryption of tls records
On 08/14/18 07:47 PM, Vakul Garg wrote: > Incoming TLS records which are directly decrypted into user space > application buffer i.e. records which are decrypted in zero-copy mode > are submitted for async decryption. When the decryption cryptoapi > returns -EINPROGRESS, the next tls record is parsed and then submitted > for decryption. The references to records which has been sent for async > decryption are dropped. This happens in a loop for all the records that > can be decrypted in zero-copy mode. For records for which decryption is > not possible in zero-copy mode, asynchronous decryption is not used and > we wait for decryption crypto api to complete. > > For crypto requests executing in async fashion, the memory for > aead_request, sglists and skb etc is freed from the decryption > completion handler. The decryption completion handler wakesup the > sleeping user context. This happens when the user context is done > enqueueing all the crypto requests and is waiting for all the async > operations to finish. Since the splice() operation does not use > zero-copy decryption, async remains disabled for splice(). I found it a little hard to understand what you are trying to do based on the commit message. Reading the code, it looks like if the recv() spans multiple TLS records, we will start decryption on all of them, and only wait right before recv() returns, yes? This approach sounds great to me. Three of the selftests are failing for me: [ FAIL ] tls.recv_partial [ FAIL ] tls.recv_peek [ FAIL ] tls.recv_peek_multiple
Re: [iproute PATCH v2 4/4] lib: Enable colored output only for TTYs
On Wed, Aug 15, 2018 at 10:43:25AM -0600, David Ahern wrote: > On 8/15/18 10:39 AM, Phil Sutter wrote: > > On Wed, Aug 15, 2018 at 10:24:31AM -0600, David Ahern wrote: > >> On 8/15/18 10:21 AM, Phil Sutter wrote: > >>> Add an additional prerequisite to check_enable_color() to make sure > >>> stdout actually points to an open TTY device. Otherwise calls like > >>> > >>> | ip -color a s >/tmp/foo > >>> > >>> will print color escape sequences into that file. Allow to override this > >>> check by specifying '-color' flag more than once. > >>> > >>> Signed-off-by: Phil Sutter > >>> --- > >>> Changes since v1: > >>> - Allow to override isatty() check by specifying '-color' flag more than > >>> once. > >> > >> That adds overhead to my workflow where I almost always have to pipe the > >> output of ip to a pager. > > > > alias ip='ip -color -color' > > no. Don't impact existing users. That's a possible fix for *your* workflow. If applied to the shell handling that workflow, it won't impact existing users. > > Another alternative may be to introduce -autocolor flag. Establishing > > the same syntax as used by 'ls' is not as trivial due to the simple > > commandline parsing used in 'ip'. > > I disagree with ignoring or overriding an argument a user passes in. You > are guessing what is the correct output and you are guessing wrong. > There is nothing wrong with piping output to a file and the viewing that > file through 'less -R'. > > If a user does not want the color codes in the file, then that user can > drop the -color arg. iproute2 commands should not be guessing. OK, I got it. Should I respin the fixes or will you apply the series partially? Thanks, Phil
Re: [iproute PATCH v2 4/4] lib: Enable colored output only for TTYs
On 8/15/18 10:39 AM, Phil Sutter wrote: > On Wed, Aug 15, 2018 at 10:24:31AM -0600, David Ahern wrote: >> On 8/15/18 10:21 AM, Phil Sutter wrote: >>> Add an additional prerequisite to check_enable_color() to make sure >>> stdout actually points to an open TTY device. Otherwise calls like >>> >>> | ip -color a s >/tmp/foo >>> >>> will print color escape sequences into that file. Allow to override this >>> check by specifying '-color' flag more than once. >>> >>> Signed-off-by: Phil Sutter >>> --- >>> Changes since v1: >>> - Allow to override isatty() check by specifying '-color' flag more than >>> once. >> >> That adds overhead to my workflow where I almost always have to pipe the >> output of ip to a pager. > > alias ip='ip -color -color' no. Don't impact existing users. > > Another alternative may be to introduce -autocolor flag. Establishing > the same syntax as used by 'ls' is not as trivial due to the simple > commandline parsing used in 'ip'. I disagree with ignoring or overriding an argument a user passes in. You are guessing what is the correct output and you are guessing wrong. There is nothing wrong with piping output to a file and the viewing that file through 'less -R'. If a user does not want the color codes in the file, then that user can drop the -color arg. iproute2 commands should not be guessing.
Re: [iproute PATCH v2 4/4] lib: Enable colored output only for TTYs
On Wed, Aug 15, 2018 at 10:24:31AM -0600, David Ahern wrote: > On 8/15/18 10:21 AM, Phil Sutter wrote: > > Add an additional prerequisite to check_enable_color() to make sure > > stdout actually points to an open TTY device. Otherwise calls like > > > > | ip -color a s >/tmp/foo > > > > will print color escape sequences into that file. Allow to override this > > check by specifying '-color' flag more than once. > > > > Signed-off-by: Phil Sutter > > --- > > Changes since v1: > > - Allow to override isatty() check by specifying '-color' flag more than > > once. > > That adds overhead to my workflow where I almost always have to pipe the > output of ip to a pager. alias ip='ip -color -color' Another alternative may be to introduce -autocolor flag. Establishing the same syntax as used by 'ls' is not as trivial due to the simple commandline parsing used in 'ip'. Cheers, Phil
Re: [iproute PATCH v2 4/4] lib: Enable colored output only for TTYs
On 8/15/18 10:21 AM, Phil Sutter wrote: > Add an additional prerequisite to check_enable_color() to make sure > stdout actually points to an open TTY device. Otherwise calls like > > | ip -color a s >/tmp/foo > > will print color escape sequences into that file. Allow to override this > check by specifying '-color' flag more than once. > > Signed-off-by: Phil Sutter > --- > Changes since v1: > - Allow to override isatty() check by specifying '-color' flag more than > once. That adds overhead to my workflow where I almost always have to pipe the output of ip to a pager.
[iproute PATCH v2 0/4] A bunch of fixes regarding colored output
This series contains fixes for conditionally colored output in patches 1 and 2. Patch 3 merges the common conditionals from ip, tc and bridge tools. Patch 4 then adds a further restriction to colored output to prevent garbled output when redirecting into a file. Changes since v1: - Adjusted last patch according to feedback. Details given in changelog of that patch. Phil Sutter (4): tc: Fix typo in check for colored output bridge: Fix check for colored output Merge common code for conditionally colored output lib: Enable colored output only for TTYs bridge/bridge.c | 5 ++--- include/color.h | 1 + ip/ip.c | 3 +-- lib/color.c | 13 + man/man8/bridge.8 | 5 - man/man8/ip.8 | 5 - man/man8/tc.8 | 5 - tc/tc.c | 3 +-- 8 files changed, 30 insertions(+), 10 deletions(-) -- 2.18.0
[iproute PATCH v2 4/4] lib: Enable colored output only for TTYs
Add an additional prerequisite to check_enable_color() to make sure stdout actually points to an open TTY device. Otherwise calls like | ip -color a s >/tmp/foo will print color escape sequences into that file. Allow to override this check by specifying '-color' flag more than once. Signed-off-by: Phil Sutter --- Changes since v1: - Allow to override isatty() check by specifying '-color' flag more than once. - Document new behaviour in man pages. --- lib/color.c | 6 +- man/man8/bridge.8 | 5 - man/man8/ip.8 | 5 - man/man8/tc.8 | 5 - 4 files changed, 17 insertions(+), 4 deletions(-) diff --git a/lib/color.c b/lib/color.c index edf96e5c6ecd7..3a66d8ccb4b00 100644 --- a/lib/color.c +++ b/lib/color.c @@ -3,6 +3,7 @@ #include #include #include +#include #include #include #include @@ -79,7 +80,10 @@ void enable_color(void) int check_enable_color(int color, int json) { - if (color && !json) { + if (json || !color) + return 1; + + if (color > 1 || isatty(fileno(stdout))) { enable_color(); return 0; } diff --git a/man/man8/bridge.8 b/man/man8/bridge.8 index e7f7148315e19..b865e8b6cd771 100644 --- a/man/man8/bridge.8 +++ b/man/man8/bridge.8 @@ -171,7 +171,10 @@ return code will be non zero. .TP .BR "\-c" , " -color" -Use color output. +Use color output if stdout is a terminal. Specify twice to enable color output +irrespective of stdout state. This flag is ignored if +.B \-json +is also given. .TP .BR "\-j", " \-json" diff --git a/man/man8/ip.8 b/man/man8/ip.8 index 0087d18b74706..c5484bbc11483 100644 --- a/man/man8/ip.8 +++ b/man/man8/ip.8 @@ -188,7 +188,10 @@ executes specified command over all objects, it depends if command supports this .TP .BR "\-c" , " -color" -Use color output. +Use color output if stdout is a terminal. Specify twice to enable color output +irrespective of stdout state. This flag is ignored if +.B \-json +is also given. .TP .BR "\-t" , " \-timestamp" diff --git a/man/man8/tc.8 b/man/man8/tc.8 index 840880fbdba63..cbe7ac1847bbb 100644 --- a/man/man8/tc.8 +++ b/man/man8/tc.8 @@ -730,7 +730,10 @@ option. .TP .BR "\ -color" -Use color output. +Use color output if stdout is a terminal. Specify twice to enable color output +irrespective of stdout state. This flag is ignored if +.B \-json +is also given. .TP .BR "\-j", " \-json" -- 2.18.0
[iproute PATCH 3/4] Merge common code for conditionally colored output
Instead of calling enable_color() conditionally with identical check in three places, introduce check_enable_color() which does it in one place. Signed-off-by: Phil Sutter --- bridge/bridge.c | 3 +-- include/color.h | 1 + ip/ip.c | 3 +-- lib/color.c | 9 + tc/tc.c | 3 +-- 5 files changed, 13 insertions(+), 6 deletions(-) diff --git a/bridge/bridge.c b/bridge/bridge.c index 289a157d37f03..451d684e0bcfd 100644 --- a/bridge/bridge.c +++ b/bridge/bridge.c @@ -200,8 +200,7 @@ main(int argc, char **argv) _SL_ = oneline ? "\\" : "\n"; - if (color && !json) - enable_color(); + check_enable_color(color, json); if (batch_file) return batch(batch_file); diff --git a/include/color.h b/include/color.h index c80359d3e2e95..4f2c918db7e43 100644 --- a/include/color.h +++ b/include/color.h @@ -13,6 +13,7 @@ enum color_attr { }; void enable_color(void); +int check_enable_color(int color, int json); void set_color_palette(void); int color_fprintf(FILE *fp, enum color_attr attr, const char *fmt, ...); enum color_attr ifa_family_color(__u8 ifa_family); diff --git a/ip/ip.c b/ip/ip.c index 71d5170c0cc23..38eac5ec1e17d 100644 --- a/ip/ip.c +++ b/ip/ip.c @@ -304,8 +304,7 @@ int main(int argc, char **argv) _SL_ = oneline ? "\\" : "\n"; - if (color && !json) - enable_color(); + check_enable_color(color, json); if (batch_file) return batch(batch_file); diff --git a/lib/color.c b/lib/color.c index da1f516cb2492..edf96e5c6ecd7 100644 --- a/lib/color.c +++ b/lib/color.c @@ -77,6 +77,15 @@ void enable_color(void) set_color_palette(); } +int check_enable_color(int color, int json) +{ + if (color && !json) { + enable_color(); + return 0; + } + return 1; +} + void set_color_palette(void) { char *p = getenv("COLORFGBG"); diff --git a/tc/tc.c b/tc/tc.c index 3bb893756f40e..e775550174473 100644 --- a/tc/tc.c +++ b/tc/tc.c @@ -515,8 +515,7 @@ int main(int argc, char **argv) _SL_ = oneline ? "\\" : "\n"; - if (color && !json) - enable_color(); + check_enable_color(color, json); if (batch_file) return batch(batch_file); -- 2.18.0
[iproute PATCH 2/4] bridge: Fix check for colored output
There is no point in calling enable_color() conditionally if it was already called for each time '-color' flag was parsed. Align the algorithm with that in ip and tc by actually making use of 'color' variable. Fixes: e9625d6aead11 ("Merge branch 'iproute2-master' into iproute2-next") Signed-off-by: Phil Sutter --- bridge/bridge.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bridge/bridge.c b/bridge/bridge.c index 7fcfe1116f6e5..289a157d37f03 100644 --- a/bridge/bridge.c +++ b/bridge/bridge.c @@ -174,7 +174,7 @@ main(int argc, char **argv) if (netns_switch(argv[1])) exit(-1); } else if (matches(opt, "-color") == 0) { - enable_color(); + ++color; } else if (matches(opt, "-compressvlans") == 0) { ++compress_vlans; } else if (matches(opt, "-force") == 0) { -- 2.18.0
[iproute PATCH 1/4] tc: Fix typo in check for colored output
The check used binary instead of boolean AND, which means colored output was enabled only if the number of specified '-color' flags was odd. Fixes: 2d165c0811058 ("tc: implement color output") Signed-off-by: Phil Sutter --- tc/tc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tc/tc.c b/tc/tc.c index 3bb5910ffac52..3bb893756f40e 100644 --- a/tc/tc.c +++ b/tc/tc.c @@ -515,7 +515,7 @@ int main(int argc, char **argv) _SL_ = oneline ? "\\" : "\n"; - if (color & !json) + if (color && !json) enable_color(); if (batch_file) -- 2.18.0
[PATCH] net: dsa: add support for ksz9897 ethernet switch
From: "Lad, Prabhakar" ksz9477 is superset of ksz9xx series, driver just works out of the box for ksz9897 chip with this patch. Signed-off-by: Lad, Prabhakar --- Documentation/devicetree/bindings/net/dsa/ksz.txt | 4 +++- drivers/net/dsa/microchip/ksz_common.c| 9 + drivers/net/dsa/microchip/ksz_spi.c | 1 + 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/net/dsa/ksz.txt b/Documentation/devicetree/bindings/net/dsa/ksz.txt index a700943..ac145b8 100644 --- a/Documentation/devicetree/bindings/net/dsa/ksz.txt +++ b/Documentation/devicetree/bindings/net/dsa/ksz.txt @@ -4,7 +4,9 @@ Microchip KSZ Series Ethernet switches Required properties: - compatible: For external switch chips, compatible string must be exactly one - of: "microchip,ksz9477" + of the following: + - "microchip,ksz9477" + - "microchip,ksz9897" See Documentation/devicetree/bindings/net/dsa/dsa.txt for a list of additional required and optional properties. diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c index 7210c49..54e0ca6 100644 --- a/drivers/net/dsa/microchip/ksz_common.c +++ b/drivers/net/dsa/microchip/ksz_common.c @@ -1102,6 +1102,15 @@ static const struct ksz_chip_data ksz_switch_chips[] = { .cpu_ports = 0x7F, /* can be configured as cpu port */ .port_cnt = 7, /* total physical port count */ }, + { + .chip_id = 0x00989700, + .dev_name = "KSZ9897", + .num_vlans = 4096, + .num_alus = 4096, + .num_statics = 16, + .cpu_ports = 0x7F, /* can be configured as cpu port */ + .port_cnt = 7, /* total physical port count */ + }, }; static int ksz_switch_init(struct ksz_device *dev) diff --git a/drivers/net/dsa/microchip/ksz_spi.c b/drivers/net/dsa/microchip/ksz_spi.c index c519469..8c1778b 100644 --- a/drivers/net/dsa/microchip/ksz_spi.c +++ b/drivers/net/dsa/microchip/ksz_spi.c @@ -195,6 +195,7 @@ static int ksz_spi_remove(struct spi_device *spi) static const struct of_device_id ksz_dt_ids[] = { { .compatible = "microchip,ksz9477" }, + { .compatible = "microchip,ksz9897" }, {}, }; MODULE_DEVICE_TABLE(of, ksz_dt_ids); -- 2.7.4
Re: [bpf PATCH] samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM
On Wed, Aug 15, 2018 at 7:57 AM, Jesper Dangaard Brouer wrote: > It is common XDP practice to unload/deattach the XDP bpf program, > when the XDP sample program is Ctrl-C interrupted (SIGINT) or > killed (SIGTERM). > > The samples/bpf programs xdp_redirect_cpu and xdp_rxq_info, > forgot to trap signal SIGTERM (which is the default signal used > by the kill command). > > This was discovered by Red Hat QA, which automated scripts depend > on killing the XDP sample program after a timeout period. > > Fixes: fad3917e361b ("samples/bpf: add cpumap sample program > xdp_redirect_cpu") > Fixes: 0fca931a6f21 ("samples/bpf: program demonstrating access to > xdp_rxq_info") > Reported-by: Jean-Tsung Hsiao > Signed-off-by: Jesper Dangaard Brouer Acked-by: Yonghong Song > --- > samples/bpf/xdp_redirect_cpu_user.c |3 ++- > samples/bpf/xdp_rxq_info_user.c |3 ++- > 2 files changed, 4 insertions(+), 2 deletions(-)
Re: [PATCH stable 4.4 0/9] fix SegmentSmack (CVE-2018-5390)
On Wed, Aug 15, 2018 at 03:24:32PM +0200, Greg KH wrote: > On Wed, Aug 15, 2018 at 09:20:59PM +0800, Mao Wenan wrote: > > There are five patches to fix CVE-2018-5390 in latest mainline > > branch, but only two patches exist in stable 4.4 and 3.18: > > dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue() > > 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible > > but I have tested with these patches, and found the cpu usage was very high. > > test results: > > with fix patch: 78.2% ksoftirqd > > no fix patch: 90% ksoftirqd > > > > After analysing the codes of stable 4.4, and debuging the > > system, the search of ofo_queue(tcp ofo using a simple queue) cost more > > cycles. > > So I think only two patches can't fix the CVE-2018-5390. > > So I try to backport "tcp: use an RB tree for ooo receive queue" using RB > > tree > > instead of simple queue, then backport Eric Dumazet 5 fixed patches in > > mainline, > > good news is that ksoftirqd is turn to about 20%, which is the same with > > mainline now. > > Thanks for doing this work, I had some questions on the individual > patches. Can you address them and resend? Also, always cc: the stable@vger list when sending stable patches so that others can review and comment on them. thanks, greg k-h
RE: [iproute PATCH 4/4] lib: Enable colored output only for TTYs
From: Stephen Hemminger > Sent: 15 August 2018 16:04 ... > > This also disables color sequence when the output is piped to a pager > > such as less which with the -R argument can handle it just fine. > > > > ie., the user needs to remove the color arg when that output is not wanted. > > If you are going to change the color enabling, why not make it compatible > with what ls does? Indeed - otherwise it is very hard to debug the colour escape sequences. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
[PATCH v2] net/mlx5: Delete unneeded function argument
priv argument is not used by the function, delete it. Fixes: a89842811ea98 ("net/mlx5e: Merge per priority stats groups") Signed-off-by: Yuval Shaia --- v1 -> v2: * Remove blank line as pointed by Leon. --- drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c index 1646859974ce..4c4d779dafa8 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c @@ -813,7 +813,7 @@ static const struct counter_desc pport_per_prio_traffic_stats_desc[] = { #define NUM_PPORT_PER_PRIO_TRAFFIC_COUNTERS ARRAY_SIZE(pport_per_prio_traffic_stats_desc) -static int mlx5e_grp_per_prio_traffic_get_num_stats(struct mlx5e_priv *priv) +static int mlx5e_grp_per_prio_traffic_get_num_stats(void) { return NUM_PPORT_PER_PRIO_TRAFFIC_COUNTERS * NUM_PPORT_PRIO; } @@ -971,7 +971,7 @@ static int mlx5e_grp_per_prio_pfc_fill_stats(struct mlx5e_priv *priv, static int mlx5e_grp_per_prio_get_num_stats(struct mlx5e_priv *priv) { - return mlx5e_grp_per_prio_traffic_get_num_stats(priv) + + return mlx5e_grp_per_prio_traffic_get_num_stats() + mlx5e_grp_per_prio_pfc_get_num_stats(priv); } -- 2.17.1
Re: [iproute PATCH 4/4] lib: Enable colored output only for TTYs
On Wed, 15 Aug 2018 08:40:20 -0600 David Ahern wrote: > On 8/15/18 3:06 AM, Phil Sutter wrote: > > Add an additional prerequisite to check_enable_color() to make sure > > stdout actually points to an open TTY device. Otherwise calls like > > > > | ip -color a s >/tmp/foo > > > > will print color escape sequences into that file. > > > > Signed-off-by: Phil Sutter > > --- > > lib/color.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/lib/color.c b/lib/color.c > > index edf96e5c6ecd7..500ba09682697 100644 > > --- a/lib/color.c > > +++ b/lib/color.c > > @@ -3,6 +3,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -79,7 +80,7 @@ void enable_color(void) > > > > int check_enable_color(int color, int json) > > { > > - if (color && !json) { > > + if (color && !json && isatty(fileno(stdout))) { > > enable_color(); > > return 0; > > } > > > > This also disables color sequence when the output is piped to a pager > such as less which with the -R argument can handle it just fine. > > ie., the user needs to remove the color arg when that output is not wanted. If you are going to change the color enabling, why not make it compatible with what ls does? From man ls(1) (and grep) --color[=WHEN] colorize the output; WHEN can be 'always' (default if omitted), 'auto', or 'never'; more info below ... Using color to distinguish file types is disabled both by default and with --color=never. With --color=auto, ls emits color codes only when standard output is connected to a terminal. The LS_COLORS environment variable can change the settings. Use the dircolors command to set it. Make -c be same as --color=always to keep compatiablity
[bpf PATCH] samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM
It is common XDP practice to unload/deattach the XDP bpf program, when the XDP sample program is Ctrl-C interrupted (SIGINT) or killed (SIGTERM). The samples/bpf programs xdp_redirect_cpu and xdp_rxq_info, forgot to trap signal SIGTERM (which is the default signal used by the kill command). This was discovered by Red Hat QA, which automated scripts depend on killing the XDP sample program after a timeout period. Fixes: fad3917e361b ("samples/bpf: add cpumap sample program xdp_redirect_cpu") Fixes: 0fca931a6f21 ("samples/bpf: program demonstrating access to xdp_rxq_info") Reported-by: Jean-Tsung Hsiao Signed-off-by: Jesper Dangaard Brouer --- samples/bpf/xdp_redirect_cpu_user.c |3 ++- samples/bpf/xdp_rxq_info_user.c |3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/samples/bpf/xdp_redirect_cpu_user.c b/samples/bpf/xdp_redirect_cpu_user.c index 9a6c7e0a6dd1..2d23054aaccf 100644 --- a/samples/bpf/xdp_redirect_cpu_user.c +++ b/samples/bpf/xdp_redirect_cpu_user.c @@ -679,8 +679,9 @@ int main(int argc, char **argv) return EXIT_FAIL_OPTION; } - /* Remove XDP program when program is interrupted */ + /* Remove XDP program when program is interrupted or killed */ signal(SIGINT, int_exit); + signal(SIGTERM, int_exit); if (bpf_set_link_xdp_fd(ifindex, prog_fd[prog_num], xdp_flags) < 0) { fprintf(stderr, "link set xdp fd failed\n"); diff --git a/samples/bpf/xdp_rxq_info_user.c b/samples/bpf/xdp_rxq_info_user.c index 248a7eab9531..ef26f882f92f 100644 --- a/samples/bpf/xdp_rxq_info_user.c +++ b/samples/bpf/xdp_rxq_info_user.c @@ -567,8 +567,9 @@ int main(int argc, char **argv) exit(EXIT_FAIL_BPF); } - /* Remove XDP program when program is interrupted */ + /* Remove XDP program when program is interrupted or killed */ signal(SIGINT, int_exit); + signal(SIGTERM, int_exit); if (bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags) < 0) { fprintf(stderr, "link set xdp fd failed\n");
Re: [PATCH v3 net-next] veth: Free queues on link delete
On 8/15/18 2:07 AM, Toshiaki Makita wrote: > David Ahern reported memory leak in veth. > ... > veth_rq allocated in veth_newlink() was not freed on dellink. > > We need to free up them after veth_close() so that any packets will not > reference the queues afterwards. Thus free them in veth_dev_free() in > the same way as freeing stats structure (vstats). > > Also move queues allocation to veth_dev_init() to be in line with stats > allocation. > > Fixes: 638264dc90227 ("veth: Support per queue XDP ring") > Reported-by: David Ahern > Signed-off-by: Toshiaki Makita > --- > This is a fix for a bug which exists only in net-next. > Let me know if I should wait for -next merging into net or reopen of -next. > > drivers/net/veth.c | 70 > +- > 1 file changed, 33 insertions(+), 37 deletions(-) > Reviewed-by: David Ahern Tested-by: David Ahern
Re: [iproute PATCH 4/4] lib: Enable colored output only for TTYs
On 8/15/18 3:06 AM, Phil Sutter wrote: > Add an additional prerequisite to check_enable_color() to make sure > stdout actually points to an open TTY device. Otherwise calls like > > | ip -color a s >/tmp/foo > > will print color escape sequences into that file. > > Signed-off-by: Phil Sutter > --- > lib/color.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/lib/color.c b/lib/color.c > index edf96e5c6ecd7..500ba09682697 100644 > --- a/lib/color.c > +++ b/lib/color.c > @@ -3,6 +3,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -79,7 +80,7 @@ void enable_color(void) > > int check_enable_color(int color, int json) > { > - if (color && !json) { > + if (color && !json && isatty(fileno(stdout))) { > enable_color(); > return 0; > } > This also disables color sequence when the output is piped to a pager such as less which with the -R argument can handle it just fine. ie., the user needs to remove the color arg when that output is not wanted.
Re: [PATCH] net/mlx5: Delete unneeded function argument
On Wed, Aug 15, 2018 at 04:54:33PM +0300, Yuval Shaia wrote: > priv argument is not used by the function, delete it. > > Fixes: a89842811ea98 ("net/mlx5e: Merge per priority stats groups") > No extra space here. Thanks signature.asc Description: PGP signature
Re: [PATCH] net: macb: Fix regression breaking non-MDIO fixed-link PHYs
Hi, On Wed, Aug 15, 2018 at 3:35 AM Andrew Lunn wrote: > > On Tue, Aug 14, 2018 at 05:58:12PM +0200, Uwe Kleine-König wrote: > > Hello Ahmad, > > > > > > On Tue, Aug 14, 2018 at 04:12:40PM +0200, Ahmad Fatoum wrote: > > > The referenced commit broke initializing macb on the EVB-KSZ9477 eval > > > board. > > > There, of_mdiobus_register was called even for the fixed-link representing > > > the SPI-connected switch PHY, with the result that the driver attempts to > > > enumerate PHYs on a non-existent MDIO bus: > > > I ran into a similar problem on v14.4 for davinci_mdio I had to patch it with [1]. The cpsw has 2 phys one phy is connected to KSZ9031 and other to ksz9897 Ethernet switch which is treated as a fixed phy with no mdio lines because of which mdio_read/write failed. This didn’t happen in v4.9.x something in core has changed ? [1] diff --git a/drivers/net/ethernet/ti/davinci_mdio.c b/drivers/net/ethernet/ti/davinci_mdio.c index 3e84107..197baa6 100644 --- a/drivers/net/ethernet/ti/davinci_mdio.c +++ b/drivers/net/ethernet/ti/davinci_mdio.c @@ -245,6 +245,13 @@ static int davinci_mdio_read(struct mii_bus *bus, int phy_id, int phy_reg) u32 reg; int ret; + if (phy_id == 2) + return 0; + if (phy_reg & ~PHY_REG_MASK || phy_id & ~PHY_ID_MASK) return -EINVAL; @@ -289,6 +296,13 @@ static int davinci_mdio_write(struct mii_bus *bus, int phy_id, u32 reg; int ret; + if (phy_id == 2) + return 0; + if (phy_reg & ~PHY_REG_MASK || phy_id & ~PHY_ID_MASK) return -EINVAL; Cheers, --Prabhakar Lad
[PATCH] net/mlx5: Delete unneeded function argument
priv argument is not used by the function, delete it. Fixes: a89842811ea98 ("net/mlx5e: Merge per priority stats groups") Signed-off-by: Yuval Shaia --- drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c index 1646859974ce..4c4d779dafa8 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c @@ -813,7 +813,7 @@ static const struct counter_desc pport_per_prio_traffic_stats_desc[] = { #define NUM_PPORT_PER_PRIO_TRAFFIC_COUNTERS ARRAY_SIZE(pport_per_prio_traffic_stats_desc) -static int mlx5e_grp_per_prio_traffic_get_num_stats(struct mlx5e_priv *priv) +static int mlx5e_grp_per_prio_traffic_get_num_stats(void) { return NUM_PPORT_PER_PRIO_TRAFFIC_COUNTERS * NUM_PPORT_PRIO; } @@ -971,7 +971,7 @@ static int mlx5e_grp_per_prio_pfc_fill_stats(struct mlx5e_priv *priv, static int mlx5e_grp_per_prio_get_num_stats(struct mlx5e_priv *priv) { - return mlx5e_grp_per_prio_traffic_get_num_stats(priv) + + return mlx5e_grp_per_prio_traffic_get_num_stats() + mlx5e_grp_per_prio_pfc_get_num_stats(priv); } -- 2.17.1
net/ipv4/tcp.c:2278 WARN_ON(sock_owned_by_user(sk))
After moving a busy mysql database to AWS (and changing distro, kernel, etc), we see the following kernel warning every few days (on several hosts): WARNING: CPU: 33 PID: 75361 at net/ipv4/tcp.c:2278 tcp_close+0x40f/0x430 This is coming from the following section of tcp_close(): /* Now socket is owned by kernel and we acquire BH lock * to finish close. No need to check for user refs. */ local_bh_disable(); bh_lock_sock(sk); WARN_ON(sock_owned_by_user(sk)); TL;DR -> Any ideas what could lead to this warning? I'm thinking that the fput() (and related) code above this in the stack trace should have left this socket in a state that could be closed without warnings. Could some other kernel thread be setting sk->sk_lock.owned to 1 during this tcp_close()? It appears that the socket is NOT owned by kernel, causing the warning. This warning does not appear to break anything - however we will eventually receive this next error (can take over a week to happen): IPv4: Attempt to release TCP socket in state 10 39ed388a Once this happens, mysqld will no longer accept() new clients. Existing open sockets work fine. We assume the above warning about "state 10" (TCP_LISTEN) is related, but we don't know why/how that LISTEN socket is being released or how/if it's related to the above warnings from tcp_close(). This error comes from net/ipv4/af_inet.c:144: if (sk->sk_type == SOCK_STREAM && sk->sk_state != TCP_CLOSE) { pr_err("Attempt to release TCP socket in state %d %p\n", sk->sk_state, sk); return; } For what it's worth, mysqld seems completely unaware that it cannot take new clients. GDB attached to the user-mode process (pstack) shows that it's still listening for new connections (but they will never come up the stack). For this reason, I don't think mysqld was part of the attempt to close that LISTEN socket. Thread 1 (Thread 0x7fd82b8bd8c0 (LWP 5633)): #0 0x7fd82942ff0d in poll () from /lib64/libc.so.6 #1 0x00d9c5fe in Mysqld_socket_listener::listen_for_connection_event() () #2 0x00798461 in mysqld_main(int, char**) () #3 0x7fd82935e445 in __libc_start_main () from /lib64/libc.so.6 #4 0x0078c875 in _start () Extra details below: Kernel is 4.16.13-1.el7.elrepo.x86_64 #1 running in AWS. https://elixir.bootlin.com/linux/v4.16.13/source/net/ipv4/tcp.c#L2278 https://elixir.bootlin.com/linux/v4.16.13/source/net/ipv4/af_inet.c#L144 [726780.788201] WARNING: CPU: 15 PID: 52245 at net/ipv4/tcp.c:2278 tcp_close+0x40f/0x430 [726780.794947] Modules linked in: binfmt_misc nf_conntrack_netlink nfnetlink_queue tcp_diag inet_diag isofs ip6table_mangle ip6table_raw ip6table_nat nf_nat_ipv6 iptable_security xt_CT iptable_raw iptable_nat nf_nat_ipv4 nf_nat iptable_mangle xt_pkttype xt_NFLOG nfnetlink_log xt_u32 xt_multiport xt_set xt_conntrack ip_set_hash_netport ip_set_hash_ipport ip_set_hash_net ip_set_hash_ip ip_set nfnetlink nf_conntrack_proto_gre nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_LOG nf_conntrack_tftp nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter zfs(PO) zunicode(PO) zavl(PO) icp(PO) sb_edac intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd cirrus snd_seq zcommon(PO) ttm intel_rapl_perf snd_seq_device [726780.848696] drm_kms_helper znvpair(PO) snd_pcm snd_timer spl(O) drm snd soundcore syscopyarea sysfillrect pcspkr sysimgblt input_leds fb_sys_fops i2c_piix4 ip_tables xfs libcrc32c ata_generic pata_acpi ata_piix xen_blkfront crc32c_intel libata ena(O) serio_raw floppy sunrpc [726780.863916] CPU: 15 PID: 52245 Comm: mysqld Tainted: PW O 4.16.13-1.el7.elrepo.x86_64 #1 [726780.869486] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006 [726780.873756] RIP: 0010:tcp_close+0x40f/0x430 [726780.877049] RSP: 0018:c90063a87dd0 EFLAGS: 00010202 [726780.880913] RAX: 0001 RBX: 8839cbb4b300 RCX: 0001 [726780.885708] RDX: 0041 RSI: 00023540 RDI: 002b [726780.890412] RBP: c90063a87df0 R08: R09: 0101 [726780.895192] R10: 03ff R11: 2057 R12: 8839cbb4b388 [726780.900029] R13: 0009 R14: 8839cbb4b3c8 R15: 883c5ed14900 [726780.904777] FS: 7f6acd467700() GS:883c8b1c() knlGS: [726780.909964] CS: 0010 DS: ES: CR0: 80050033 [726780.914078] CR2: 7f6cd5d430c8 CR3: 003c7e6b6005 CR4: 001606e0 [726780.918837] DR0: DR1: DR2: [726780.923433] DR3: DR6: fffe0ff0 DR7: 0400 [726780.927882] Call Trace: [726780.930410] inet_release+0x42/0x70 [726780.933423] inet6_release+0x30/0x40 [726780.936439] sock_release+0x25/0x80 [726780.939421]
Re: [PATCH stable 4.4 4/9] tcp: use an RB tree for ooo receive queue
On Wed, Aug 15, 2018 at 09:21:03PM +0800, Mao Wenan wrote: > From: Yaogong Wang > > Over the years, TCP BDP has increased by several orders of magnitude, > and some people are considering to reach the 2 Gbytes limit. > > Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000 > MSS. > > In presence of packet losses (or reorders), TCP stores incoming packets > into an out of order queue, and number of skbs sitting there waiting for > the missing packets to be received can be in the 10^5 range. > > Most packets are appended to the tail of this queue, and when > packets can finally be transferred to receive queue, we scan the queue > from its head. > > However, in presence of heavy losses, we might have to find an arbitrary > point in this queue, involving a linear scan for every incoming packet, > throwing away cpu caches. > > This patch converts it to a RB tree, to get bounded latencies. > > Yaogong wrote a preliminary patch about 2 years ago. > Eric did the rebase, added ofo_last_skb cache, polishing and tests. > > Tested with network dropping between 1 and 10 % packets, with good > success (about 30 % increase of throughput in stress tests) > > Next step would be to also use an RB tree for the write queue at sender > side ;) > > Signed-off-by: Yaogong Wang > Signed-off-by: Eric Dumazet > Cc: Yuchung Cheng > Cc: Neal Cardwell > Cc: Ilpo Järvinen > Acked-By: Ilpo Järvinen > Signed-off-by: David S. Miller > Signed-off-by: root root and commit id?
Re: [PATCH stable 4.4 6/9] tcp: avoid collapses in tcp_prune_queue() if possible
On Wed, Aug 15, 2018 at 09:21:05PM +0800, Mao Wenan wrote: > From: Eric Dumazet > > [ Upstream commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7 ] > > Right after a TCP flow is created, receiving tiny out of order > packets allways hit the condition : > > if (atomic_read(>sk_rmem_alloc) >= sk->sk_rcvbuf) > tcp_clamp_window(sk); > > tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc > (guarded by tcp_rmem[2]) > > Calling tcp_collapse_ofo_queue() in this case is not useful, > and offers a O(N^2) surface attack to malicious peers. > > Better not attempt anything before full queue capacity is reached, > forcing attacker to spend lots of resource and allow us to more > easily detect the abuse. > > Signed-off-by: Eric Dumazet > Acked-by: Soheil Hassas Yeganeh > Acked-by: Yuchung Cheng > Signed-off-by: David S. Miller > Signed-off-by: Greg Kroah-Hartman > Signed-off-by: root root?
Re: [PATCH stable 4.4 5/9] tcp: free batches of packets in tcp_prune_ofo_queue()
On Wed, Aug 15, 2018 at 09:21:04PM +0800, Mao Wenan wrote: > From: Eric Dumazet > > Juha-Matti Tilli reported that malicious peers could inject tiny > packets in out_of_order_queue, forcing very expensive calls > to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for > every incoming packet. out_of_order_queue rb-tree can contain > thousands of nodes, iterating over all of them is not nice. > > Before linux-4.9, we would have pruned all packets in ofo_queue > in one go, every packets. depends on sk_rcvbuf and skbs > truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB. > > Since we plan to increase tcp_rmem[2] in the future to cope with > modern BDP, can not revert to the old behavior, without great pain. > > Strategy taken in this patch is to purge ~12.5 % of the queue capacity. > > Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all > packets") > Signed-off-by: Eric Dumazet > Reported-by: Juha-Matti Tilli > Acked-by: Yuchung Cheng > Acked-by: Soheil Hassas Yeganeh > Signed-off-by: David S. Miller > Signed-off-by: root root? And commit id? thanks, greg k-h
Re: [PATCH stable 4.4 9/9] tcp: add tcp_ooo_try_coalesce() helper
On Wed, Aug 15, 2018 at 09:21:08PM +0800, Mao Wenan wrote: > From: Eric Dumazet > > In case skb in out_or_order_queue is the result of > multiple skbs coalescing, we would like to get a proper gso_segs > counter tracking, so that future tcp_drop() can report an accurate > number. > > I chose to not implement this tracking for skbs in receive queue, > since they are not dropped, unless socket is disconnected. > > Signed-off-by: Eric Dumazet > Acked-by: Soheil Hassas Yeganeh > Acked-by: Yuchung Cheng > Signed-off-by: David S. Miller > Signed-off-by: Mao Wenan Upstream commit id?
Re: [PATCH stable 4.4 0/9] fix SegmentSmack (CVE-2018-5390)
On Wed, Aug 15, 2018 at 09:20:59PM +0800, Mao Wenan wrote: > There are five patches to fix CVE-2018-5390 in latest mainline > branch, but only two patches exist in stable 4.4 and 3.18: > dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue() > 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible > but I have tested with these patches, and found the cpu usage was very high. > test results: > with fix patch: 78.2% ksoftirqd > no fix patch: 90% ksoftirqd > > After analysing the codes of stable 4.4, and debuging the > system, the search of ofo_queue(tcp ofo using a simple queue) cost more > cycles. > So I think only two patches can't fix the CVE-2018-5390. > So I try to backport "tcp: use an RB tree for ooo receive queue" using RB > tree > instead of simple queue, then backport Eric Dumazet 5 fixed patches in > mainline, > good news is that ksoftirqd is turn to about 20%, which is the same with > mainline now. Thanks for doing this work, I had some questions on the individual patches. Can you address them and resend? thanks, greg k-h
Re: [PATCH stable 4.4 8/9] tcp: call tcp_drop() from tcp_data_queue_ofo()
On Wed, Aug 15, 2018 at 09:21:07PM +0800, Mao Wenan wrote: > From: Eric Dumazet > > In order to be able to give better diagnostics and detect > malicious traffic, we need to have better sk->sk_drops tracking. > > Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue") > Signed-off-by: Eric Dumazet > Acked-by: Soheil Hassas Yeganeh > Acked-by: Yuchung Cheng > Signed-off-by: David S. Miller > Signed-off-by: Mao Wenan > --- > net/ipv4/tcp_input.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) Upstream commit id?
Re: [PATCH stable 4.4 3/9] tcp: increment sk_drops for dropped rx packets
On Wed, Aug 15, 2018 at 09:21:02PM +0800, Mao Wenan wrote: > From: Eric Dumazet > > Now ss can report sk_drops, we can instruct TCP to increment > this per socket counter when it drops an incoming frame, to refine > monitoring and debugging. > > Following patch takes care of listeners drops. > > Signed-off-by: Eric Dumazet > Signed-off-by: David S. Miller > Signed-off-by: Mao Wenan What is the upstream git commit id for this? thanks, greg k-h
Re: [PATCH stable 4.4 7/9] tcp: detect malicious patterns in tcp_collapse_ofo_queue()
On Wed, Aug 15, 2018 at 09:21:06PM +0800, Mao Wenan wrote: > From: Eric Dumazet > > [ Upstream commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf ] > > In case an attacker feeds tiny packets completely out of order, > tcp_collapse_ofo_queue() might scan the whole rb-tree, performing > expensive copies, but not changing socket memory usage at all. > > 1) Do not attempt to collapse tiny skbs. > 2) Add logic to exit early when too many tiny skbs are detected. > > We prefer not doing aggressive collapsing (which copies packets) > for pathological flows, and revert to tcp_prune_ofo_queue() which > will be less expensive. > > In the future, we might add the possibility of terminating flows > that are proven to be malicious. > > Signed-off-by: Eric Dumazet > Acked-by: Soheil Hassas Yeganeh > Signed-off-by: David S. Miller > Signed-off-by: Greg Kroah-Hartman > Signed-off-by: root signed off by from "root"? :) And why are you adding the patch back after removing it? thanks, greg k-h
Re: [PATCH stable 4.4 2/9] Revert "tcp: avoid collapses in tcp_prune_queue() if possible"
On Wed, Aug 15, 2018 at 09:21:01PM +0800, Mao Wenan wrote: > This reverts commit 5fbec4801264cb3279ef6ac9c70bcbe2aaef89d5. > --- Same here for description and signed off by. thanks, greg k-h
Re: [PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"
On Wed, Aug 15, 2018 at 09:21:00PM +0800, Mao Wenan wrote: > This reverts commit dc6ae4dffd656811dee7151b19545e4cd839d378. I need a reason why, and a signed-off-by line :( thanks, greg k-h
Re: I found a strange place while reading “net/ipv6/reassembly.c”
2018-08-15, 04:38:29 +, Ttttabcd wrote: > Hello everyone who develops the kernel. > > At the beginning I was looking for the source author, but his email > address has expired, so I can only come here to ask questions. > > The problem is in the /net/ipv6/reassembly.c file, the author is > Pedro Roque. > > I found some strange places when I read the code for this file > (Linux Kernel version 4.18). > > In the "/net/ipv6/reassembly.c" > > In the function "ip6_frag_queue" > > offset = ntohs(fhdr->frag_off) & ~0x7; > end = offset + (ntohs(ipv6_hdr(skb)->payload_len) - > ((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1))); > > if ((unsigned int)end > IPV6_MAXPLEN) { > *prob_offset = (u8 *)>frag_off - skb_network_header(skb); > return -1; > } > > Here the length of the payload is judged. This check is based on the fragment currently being processed, and only considers the reassembled payload. > And in the function "ip6_frag_reasm" > Re-adding the comment that comes just above this: /* Unfragmented part is taken from the first segment. */ > payload_len = ((head->data - skb_network_header(head)) - > sizeof(struct ipv6hdr) + fq->q.len - > sizeof(struct frag_hdr)); > if (payload_len > IPV6_MAXPLEN) > goto out_oversize; This considers the reassembled payload (ie, as above) *and* any extension headers that might come before it. > .. > out_oversize: > net_dbg_ratelimited("ip6_frag_reasm: payload len = %d\n", > payload_len); > goto out_fail; > > Here also judges the length of the payload. > > Judged the payload length twice. > > I tested that the code in the label "out_oversize:" does not execute > at all, because it has been returned in "ip6_frag_queue". If you try again with a routing header, I think you'll see it trigger. > Unless I comment out the code that judge the payload length in the > function "ip6_frag_queue", the code labeled "out_oversize:" can be > executed. > > So, is this repeated? -- Sabrina
[PATCH stable 4.4 4/9] tcp: use an RB tree for ooo receive queue
From: Yaogong Wang Over the years, TCP BDP has increased by several orders of magnitude, and some people are considering to reach the 2 Gbytes limit. Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000 MSS. In presence of packet losses (or reorders), TCP stores incoming packets into an out of order queue, and number of skbs sitting there waiting for the missing packets to be received can be in the 10^5 range. Most packets are appended to the tail of this queue, and when packets can finally be transferred to receive queue, we scan the queue from its head. However, in presence of heavy losses, we might have to find an arbitrary point in this queue, involving a linear scan for every incoming packet, throwing away cpu caches. This patch converts it to a RB tree, to get bounded latencies. Yaogong wrote a preliminary patch about 2 years ago. Eric did the rebase, added ofo_last_skb cache, polishing and tests. Tested with network dropping between 1 and 10 % packets, with good success (about 30 % increase of throughput in stress tests) Next step would be to also use an RB tree for the write queue at sender side ;) Signed-off-by: Yaogong Wang Signed-off-by: Eric Dumazet Cc: Yuchung Cheng Cc: Neal Cardwell Cc: Ilpo Järvinen Acked-By: Ilpo Järvinen Signed-off-by: David S. Miller Signed-off-by: root Signed-off-by: Mao Wenan --- include/linux/skbuff.h | 8 ++ include/linux/tcp.h | 7 +- include/net/tcp.h| 2 +- net/core/skbuff.c| 19 +++ net/ipv4/tcp.c | 4 +- net/ipv4/tcp_input.c | 354 +++ net/ipv4/tcp_ipv4.c | 2 +- net/ipv4/tcp_minisocks.c | 1 - 8 files changed, 241 insertions(+), 156 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index c28bd8b..a490dd7 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2273,6 +2273,8 @@ static inline void __skb_queue_purge(struct sk_buff_head *list) kfree_skb(skb); } +void skb_rbtree_purge(struct rb_root *root); + void *netdev_alloc_frag(unsigned int fragsz); struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length, @@ -2807,6 +2809,12 @@ static inline int pskb_trim_rcsum(struct sk_buff *skb, unsigned int len) return __pskb_trim(skb, len); } +#define rb_to_skb(rb) rb_entry_safe(rb, struct sk_buff, rbnode) +#define skb_rb_first(root) rb_to_skb(rb_first(root)) +#define skb_rb_last(root) rb_to_skb(rb_last(root)) +#define skb_rb_next(skb) rb_to_skb(rb_next(&(skb)->rbnode)) +#define skb_rb_prev(skb) rb_to_skb(rb_prev(&(skb)->rbnode)) + #define skb_queue_walk(queue, skb) \ for (skb = (queue)->next; \ skb != (struct sk_buff *)(queue); \ diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 5b6df1a..747404d 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -279,10 +279,9 @@ struct tcp_sock { struct sk_buff* lost_skb_hint; struct sk_buff *retransmit_skb_hint; - /* OOO segments go in this list. Note that socket lock must be held, -* as we do not use sk_buff_head lock. -*/ - struct sk_buff_head out_of_order_queue; + /* OOO segments go in this rbtree. Socket lock must be held. */ + struct rb_root out_of_order_queue; + struct sk_buff *ooo_last_skb; /* cache rb_last(out_of_order_queue) */ /* SACKs data, these 2 need to be together (see tcp_options_write) */ struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */ diff --git a/include/net/tcp.h b/include/net/tcp.h index cac4a6ad..8bc259d 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -649,7 +649,7 @@ static inline void tcp_fast_path_check(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); - if (skb_queue_empty(>out_of_order_queue) && + if (RB_EMPTY_ROOT(>out_of_order_queue) && tp->rcv_wnd && atomic_read(>sk_rmem_alloc) < sk->sk_rcvbuf && !tp->urg_data) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 55be076..9703924 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2378,6 +2378,25 @@ void skb_queue_purge(struct sk_buff_head *list) EXPORT_SYMBOL(skb_queue_purge); /** + * skb_rbtree_purge - empty a skb rbtree + * @root: root of the rbtree to empty + * + * Delete all buffers on an _buff rbtree. Each buffer is removed from + * the list and one reference dropped. This function does not take + * any lock. Synchronization should be handled by the caller (e.g., TCP + * out-of-order queue is protected by the socket lock). + */ +void skb_rbtree_purge(struct rb_root *root) +{ + struct sk_buff *skb, *next; + + rbtree_postorder_for_each_entry_safe(skb, next, root, rbnode) + kfree_skb(skb); + + *root = RB_ROOT; +} + +/** * skb_queue_head - queue a
[PATCH stable 4.4 8/9] tcp: call tcp_drop() from tcp_data_queue_ofo()
From: Eric Dumazet In order to be able to give better diagnostics and detect malicious traffic, we need to have better sk->sk_drops tracking. Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue") Signed-off-by: Eric Dumazet Acked-by: Soheil Hassas Yeganeh Acked-by: Yuchung Cheng Signed-off-by: David S. Miller Signed-off-by: Mao Wenan --- net/ipv4/tcp_input.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c48924f..96a1e0d 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4445,7 +4445,7 @@ coalesce_done: /* All the bits are present. Drop. */ NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPOFOMERGE); - __kfree_skb(skb); + tcp_drop(sk, skb); skb = NULL; tcp_dsack_set(sk, seq, end_seq); goto add_sack; @@ -4464,7 +4464,7 @@ coalesce_done: TCP_SKB_CB(skb1)->end_seq); NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPOFOMERGE); - __kfree_skb(skb1); + tcp_drop(sk, skb1); goto add_sack; } } else if (tcp_try_coalesce(sk, skb1, skb, )) { -- 1.8.3.1
[PATCH stable 4.4 9/9] tcp: add tcp_ooo_try_coalesce() helper
From: Eric Dumazet In case skb in out_or_order_queue is the result of multiple skbs coalescing, we would like to get a proper gso_segs counter tracking, so that future tcp_drop() can report an accurate number. I chose to not implement this tracking for skbs in receive queue, since they are not dropped, unless socket is disconnected. Signed-off-by: Eric Dumazet Acked-by: Soheil Hassas Yeganeh Acked-by: Yuchung Cheng Signed-off-by: David S. Miller Signed-off-by: Mao Wenan --- net/ipv4/tcp_input.c | 23 +-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 96a1e0d..fdb5509 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4296,6 +4296,23 @@ static bool tcp_try_coalesce(struct sock *sk, return true; } +static bool tcp_ooo_try_coalesce(struct sock *sk, +struct sk_buff *to, +struct sk_buff *from, +bool *fragstolen) +{ + bool res = tcp_try_coalesce(sk, to, from, fragstolen); + + /* In case tcp_drop() is called later, update to->gso_segs */ + if (res) { + u32 gso_segs = max_t(u16, 1, skb_shinfo(to)->gso_segs) + + max_t(u16, 1, skb_shinfo(from)->gso_segs); + + skb_shinfo(to)->gso_segs = min_t(u32, gso_segs, 0x); + } + return res; +} + static void tcp_drop(struct sock *sk, struct sk_buff *skb) { sk_drops_add(sk, skb); @@ -4422,7 +4439,8 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb) /* In the typical case, we are adding an skb to the end of the list. * Use of ooo_last_skb avoids the O(Log(N)) rbtree lookup. */ - if (tcp_try_coalesce(sk, tp->ooo_last_skb, skb, )) { + if (tcp_ooo_try_coalesce(sk, tp->ooo_last_skb, +skb, )) { coalesce_done: tcp_grow_window(sk, skb); kfree_skb_partial(skb, fragstolen); @@ -4467,7 +4485,8 @@ coalesce_done: tcp_drop(sk, skb1); goto add_sack; } - } else if (tcp_try_coalesce(sk, skb1, skb, )) { + } else if (tcp_ooo_try_coalesce(sk, skb1, + skb, )) { goto coalesce_done; } p = >rb_right; -- 1.8.3.1
[PATCH stable 4.4 3/9] tcp: increment sk_drops for dropped rx packets
From: Eric Dumazet Now ss can report sk_drops, we can instruct TCP to increment this per socket counter when it drops an incoming frame, to refine monitoring and debugging. Following patch takes care of listeners drops. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Mao Wenan --- include/net/sock.h | 7 +++ net/ipv4/tcp_input.c | 33 - net/ipv4/tcp_ipv4.c | 1 + net/ipv6/tcp_ipv6.c | 1 + 4 files changed, 29 insertions(+), 13 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 3d5ff74..5770757 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -2139,6 +2139,13 @@ sock_skb_set_dropcount(const struct sock *sk, struct sk_buff *skb) SOCK_SKB_CB(skb)->dropcount = atomic_read(>sk_drops); } +static inline void sk_drops_add(struct sock *sk, const struct sk_buff *skb) +{ + int segs = max_t(u16, 1, skb_shinfo(skb)->gso_segs); + + atomic_add(segs, >sk_drops); +} + void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb); void __sock_recv_wifi_status(struct msghdr *msg, struct sock *sk, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index df2f342..5fb4e80 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4296,6 +4296,12 @@ static bool tcp_try_coalesce(struct sock *sk, return true; } +static void tcp_drop(struct sock *sk, struct sk_buff *skb) +{ + sk_drops_add(sk, skb); + __kfree_skb(skb); +} + /* This one checks to see if we can put data from the * out_of_order queue into the receive_queue. */ @@ -4320,7 +4326,7 @@ static void tcp_ofo_queue(struct sock *sk) __skb_unlink(skb, >out_of_order_queue); if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) { SOCK_DEBUG(sk, "ofo packet was already received\n"); - __kfree_skb(skb); + tcp_drop(sk, skb); continue; } SOCK_DEBUG(sk, "ofo requeuing : rcv_next %X seq %X - %X\n", @@ -4372,7 +4378,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb) if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) { NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFODROP); - __kfree_skb(skb); + tcp_drop(sk, skb); return; } @@ -4436,7 +4442,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb) if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) { /* All the bits are present. Drop. */ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE); - __kfree_skb(skb); + tcp_drop(sk, skb); skb = NULL; tcp_dsack_set(sk, seq, end_seq); goto add_sack; @@ -4475,7 +4481,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb) tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq, TCP_SKB_CB(skb1)->end_seq); NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE); - __kfree_skb(skb1); + tcp_drop(sk, skb1); } add_sack: @@ -4558,12 +4564,13 @@ err: static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); - int eaten = -1; bool fragstolen = false; + int eaten = -1; - if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) - goto drop; - + if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) { + __kfree_skb(skb); + return; + } skb_dst_drop(skb); __skb_pull(skb, tcp_hdr(skb)->doff * 4); @@ -4645,7 +4652,7 @@ out_of_window: tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS); inet_csk_schedule_ack(sk); drop: - __kfree_skb(skb); + tcp_drop(sk, skb); return; } @@ -5220,7 +5227,7 @@ syn_challenge: return true; discard: - __kfree_skb(skb); + tcp_drop(sk, skb); return false; } @@ -5438,7 +5445,7 @@ csum_error: TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS); discard: - __kfree_skb(skb); + tcp_drop(sk, skb); } EXPORT_SYMBOL(tcp_rcv_established); @@ -5668,7 +5675,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, TCP_DELACK_MAX, TCP_RTO_MAX); discard: - __kfree_skb(skb); + tcp_drop(sk, skb); return 0; } else { tcp_send_ack(sk); @@ -6025,7 +6032,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) if (!queued) { discard: -
[PATCH stable 4.4 0/9] fix SegmentSmack (CVE-2018-5390)
There are five patches to fix CVE-2018-5390 in latest mainline branch, but only two patches exist in stable 4.4 and 3.18: dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue() 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible but I have tested with these patches, and found the cpu usage was very high. test results: with fix patch: 78.2% ksoftirqd no fix patch: 90% ksoftirqd After analysing the codes of stable 4.4, and debuging the system, the search of ofo_queue(tcp ofo using a simple queue) cost more cycles. So I think only two patches can't fix the CVE-2018-5390. So I try to backport "tcp: use an RB tree for ooo receive queue" using RB tree instead of simple queue, then backport Eric Dumazet 5 fixed patches in mainline, good news is that ksoftirqd is turn to about 20%, which is the same with mainline now. Eric Dumazet (6): tcp: increment sk_drops for dropped rx packets tcp: free batches of packets in tcp_prune_ofo_queue() tcp: avoid collapses in tcp_prune_queue() if possible tcp: detect malicious patterns in tcp_collapse_ofo_queue() tcp: call tcp_drop() from tcp_data_queue_ofo() tcp: add tcp_ooo_try_coalesce() helper Mao Wenan (2): Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()" Revert "tcp: avoid collapses in tcp_prune_queue() if possible" Yaogong Wang (1): tcp: use an RB tree for ooo receive queue include/linux/skbuff.h | 8 + include/linux/tcp.h | 7 +- include/net/sock.h | 7 + include/net/tcp.h| 2 +- net/core/skbuff.c| 19 +++ net/ipv4/tcp.c | 4 +- net/ipv4/tcp_input.c | 412 +-- net/ipv4/tcp_ipv4.c | 3 +- net/ipv4/tcp_minisocks.c | 1 - net/ipv6/tcp_ipv6.c | 1 + 10 files changed, 294 insertions(+), 170 deletions(-) -- 1.8.3.1
[PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"
This reverts commit dc6ae4dffd656811dee7151b19545e4cd839d378. --- net/ipv4/tcp_input.c | 16 ++-- 1 file changed, 2 insertions(+), 14 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 4a261e0..995b2bc 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4791,7 +4791,6 @@ restart: static void tcp_collapse_ofo_queue(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); - u32 range_truesize, sum_tiny = 0; struct sk_buff *skb = skb_peek(>out_of_order_queue); struct sk_buff *head; u32 start, end; @@ -4801,7 +4800,6 @@ static void tcp_collapse_ofo_queue(struct sock *sk) start = TCP_SKB_CB(skb)->seq; end = TCP_SKB_CB(skb)->end_seq; - range_truesize = skb->truesize; head = skb; for (;;) { @@ -4816,24 +4814,14 @@ static void tcp_collapse_ofo_queue(struct sock *sk) if (!skb || after(TCP_SKB_CB(skb)->seq, end) || before(TCP_SKB_CB(skb)->end_seq, start)) { - /* Do not attempt collapsing tiny skbs */ - if (range_truesize != head->truesize || - end - start >= SKB_WITH_OVERHEAD(SK_MEM_QUANTUM)) { - tcp_collapse(sk, >out_of_order_queue, -head, skb, start, end); - } else { - sum_tiny += range_truesize; - if (sum_tiny > sk->sk_rcvbuf >> 3) - return; - } - + tcp_collapse(sk, >out_of_order_queue, +head, skb, start, end); head = skb; if (!skb) break; /* Start new segment */ start = TCP_SKB_CB(skb)->seq; end = TCP_SKB_CB(skb)->end_seq; - range_truesize = skb->truesize; } else { if (before(TCP_SKB_CB(skb)->seq, start)) start = TCP_SKB_CB(skb)->seq; -- 1.8.3.1
[PATCH stable 4.4 2/9] Revert "tcp: avoid collapses in tcp_prune_queue() if possible"
This reverts commit 5fbec4801264cb3279ef6ac9c70bcbe2aaef89d5. --- net/ipv4/tcp_input.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 995b2bc..df2f342 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4877,9 +4877,6 @@ static int tcp_prune_queue(struct sock *sk) else if (tcp_under_memory_pressure(sk)) tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss); - if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf) - return 0; - tcp_collapse_ofo_queue(sk); if (!skb_queue_empty(>sk_receive_queue)) tcp_collapse(sk, >sk_receive_queue, -- 1.8.3.1
[PATCH stable 4.4 5/9] tcp: free batches of packets in tcp_prune_ofo_queue()
From: Eric Dumazet Juha-Matti Tilli reported that malicious peers could inject tiny packets in out_of_order_queue, forcing very expensive calls to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for every incoming packet. out_of_order_queue rb-tree can contain thousands of nodes, iterating over all of them is not nice. Before linux-4.9, we would have pruned all packets in ofo_queue in one go, every packets. depends on sk_rcvbuf and skbs truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB. Since we plan to increase tcp_rmem[2] in the future to cope with modern BDP, can not revert to the old behavior, without great pain. Strategy taken in this patch is to purge ~12.5 % of the queue capacity. Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all packets") Signed-off-by: Eric Dumazet Reported-by: Juha-Matti Tilli Acked-by: Yuchung Cheng Acked-by: Soheil Hassas Yeganeh Signed-off-by: David S. Miller Signed-off-by: root Signed-off-by: Mao Wenan --- net/ipv4/tcp_input.c | 16 ++-- 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 12edc4f..32225dc 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4893,22 +4893,26 @@ static bool tcp_prune_ofo_queue(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); struct rb_node *node, *prev; + int goal; if (RB_EMPTY_ROOT(>out_of_order_queue)) return false; NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED); - + goal = sk->sk_rcvbuf >> 3; node = >ooo_last_skb->rbnode; do { prev = rb_prev(node); rb_erase(node, >out_of_order_queue); + goal -= rb_to_skb(node)->truesize; __kfree_skb(rb_to_skb(node)); - sk_mem_reclaim(sk); -if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf && -!tcp_under_memory_pressure(sk)) -break; - + if (!prev || goal <= 0) { + sk_mem_reclaim(sk); + if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf && + !tcp_under_memory_pressure(sk)) + break; + goal = sk->sk_rcvbuf >> 3; + } node = prev; } while (node); tp->ooo_last_skb = rb_entry(prev, struct sk_buff, rbnode); -- 1.8.3.1
[PATCH stable 4.4 7/9] tcp: detect malicious patterns in tcp_collapse_ofo_queue()
From: Eric Dumazet [ Upstream commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf ] In case an attacker feeds tiny packets completely out of order, tcp_collapse_ofo_queue() might scan the whole rb-tree, performing expensive copies, but not changing socket memory usage at all. 1) Do not attempt to collapse tiny skbs. 2) Add logic to exit early when too many tiny skbs are detected. We prefer not doing aggressive collapsing (which copies packets) for pathological flows, and revert to tcp_prune_ofo_queue() which will be less expensive. In the future, we might add the possibility of terminating flows that are proven to be malicious. Signed-off-by: Eric Dumazet Acked-by: Soheil Hassas Yeganeh Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman Signed-off-by: root Signed-off-by: Mao Wenan --- net/ipv4/tcp_input.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 77130ae..c48924f 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4845,6 +4845,7 @@ end: static void tcp_collapse_ofo_queue(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); + u32 range_truesize, sum_tiny = 0; struct sk_buff *skb, *head; struct rb_node *p; u32 start, end; @@ -4863,6 +4864,7 @@ new_range: } start = TCP_SKB_CB(skb)->seq; end = TCP_SKB_CB(skb)->end_seq; + range_truesize = skb->truesize; for (head = skb;;) { skb = tcp_skb_next(skb, NULL); @@ -4873,11 +4875,21 @@ new_range: if (!skb || after(TCP_SKB_CB(skb)->seq, end) || before(TCP_SKB_CB(skb)->end_seq, start)) { - tcp_collapse(sk, NULL, >out_of_order_queue, -head, skb, start, end); + /* Do not attempt collapsing tiny skbs */ + if (range_truesize != head->truesize || + end - start >= SKB_WITH_OVERHEAD(SK_MEM_QUANTUM)) { + tcp_collapse(sk, NULL, >out_of_order_queue, +head, skb, start, end); + } else { + sum_tiny += range_truesize; + if (sum_tiny > sk->sk_rcvbuf >> 3) + return; + } + goto new_range; } + range_truesize += skb->truesize; if (unlikely(before(TCP_SKB_CB(skb)->seq, start))) start = TCP_SKB_CB(skb)->seq; if (after(TCP_SKB_CB(skb)->end_seq, end)) -- 1.8.3.1
[PATCH stable 4.4 6/9] tcp: avoid collapses in tcp_prune_queue() if possible
From: Eric Dumazet [ Upstream commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7 ] Right after a TCP flow is created, receiving tiny out of order packets allways hit the condition : if (atomic_read(>sk_rmem_alloc) >= sk->sk_rcvbuf) tcp_clamp_window(sk); tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc (guarded by tcp_rmem[2]) Calling tcp_collapse_ofo_queue() in this case is not useful, and offers a O(N^2) surface attack to malicious peers. Better not attempt anything before full queue capacity is reached, forcing attacker to spend lots of resource and allow us to more easily detect the abuse. Signed-off-by: Eric Dumazet Acked-by: Soheil Hassas Yeganeh Acked-by: Yuchung Cheng Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman Signed-off-by: root Signed-off-by: Mao Wenan --- net/ipv4/tcp_input.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 32225dc..77130ae 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4948,6 +4948,9 @@ static int tcp_prune_queue(struct sock *sk) else if (tcp_under_memory_pressure(sk)) tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss); + if (atomic_read(>sk_rmem_alloc) <= sk->sk_rcvbuf) + return 0; + tcp_collapse_ofo_queue(sk); if (!skb_queue_empty(>sk_receive_queue)) tcp_collapse(sk, >sk_receive_queue, NULL, -- 1.8.3.1
Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
On Mon, Aug 06, 2018 at 02:20:37PM -0500, Steve Wise wrote: > > > On 8/1/2018 9:27 AM, Max Gurtovoy wrote: > > > > > > On 8/1/2018 8:12 AM, Sagi Grimberg wrote: > >> Hi Max, > > > > Hi, > > > >> > >>> Yes, since nvmf is the only user of this function. > >>> Still waiting for comments on the suggested patch :) > >>> > >> > >> Sorry for the late response (but I'm on vacation so I have > >> an excuse ;)) > > > > NP :) currently the code works.. > > > >> > >> I'm thinking that we should avoid trying to find an assignment > >> when stuff like irqbalance daemon is running and changing > >> the affinitization. > > > > but this is exactly what Steve complained and Leon try to fix (and > > break the connection establishment). > > If this is the case and we all agree then we're good without Leon's > > patch and without our suggestions. > > > > I don't agree. Currently setting certain affinity mappings breaks nvme > connectivity. I don't think that is desirable. And mlx5 is broken in > that it doesn't allow changing the affinity but silently ignores the > change, which misleads the admin or irqbalance... Exactly, I completely agree with Steve and don't understand any rationale in the comments above. As a summery from my side: NVMeOF is broken, but we are not going to fix and prohibit from one specific driver to change affinity on the fly. Nice. Thanks signature.asc Description: PGP signature
KINDLY REPLY stemlightresour...@gmail.com URGENTLY
KINDLY REPLY stemlightresour...@gmail.com URGENTLY
[iproute PATCH] man: ip-route: Clarify referenced versions are Linux ones
Versioning scheme of Linux and iproute2 is similar, therefore the referenced kernel versions are likely to confuse readers. Clarify this by prefixing each kernel version by 'Linux' prefix. Signed-off-by: Phil Sutter --- man/man8/ip-route.8.in | 30 +++--- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in index b21a847241427..a33ce1f0f4006 100644 --- a/man/man8/ip-route.8.in +++ b/man/man8/ip-route.8.in @@ -483,43 +483,43 @@ seconds and ms, msec or msecs to specify milliseconds. .TP -.BI rttvar " TIME " "(2.3.15+ only)" +.BI rttvar " TIME " "(Linux 2.3.15+ only)" the initial RTT variance estimate. Values are specified as with .BI rtt above. .TP -.BI rto_min " TIME " "(2.6.23+ only)" +.BI rto_min " TIME " "(Linux 2.6.23+ only)" the minimum TCP Retransmission TimeOut to use when communicating with this destination. Values are specified as with .BI rtt above. .TP -.BI ssthresh " NUMBER " "(2.3.15+ only)" +.BI ssthresh " NUMBER " "(Linux 2.3.15+ only)" an estimate for the initial slow start threshold. .TP -.BI cwnd " NUMBER " "(2.3.15+ only)" +.BI cwnd " NUMBER " "(Linux 2.3.15+ only)" the clamp for congestion window. It is ignored if the .B lock flag is not used. .TP -.BI initcwnd " NUMBER " "(2.5.70+ only)" +.BI initcwnd " NUMBER " "(Linux 2.5.70+ only)" the initial congestion window size for connections to this destination. Actual window size is this value multiplied by the MSS (``Maximal Segment Size'') for same connection. The default is zero, meaning to use the values specified in RFC2414. .TP -.BI initrwnd " NUMBER " "(2.6.33+ only)" +.BI initrwnd " NUMBER " "(Linux 2.6.33+ only)" the initial receive window size for connections to this destination. Actual window size is this value multiplied by the MSS of the connection. The default value is zero, meaning to use Slow Start value. .TP -.BI features " FEATURES " (3.18+ only) +.BI features " FEATURES " (Linux 3.18+ only) Enable or disable per-route features. Only available feature at this time is .B ecn @@ -531,17 +531,17 @@ also be used even if the sysctl is set to 0. .TP -.BI quickack " BOOL " "(3.11+ only)" +.BI quickack " BOOL " "(Linux 3.11+ only)" Enable or disable quick ack for connections to this destination. .TP -.BI fastopen_no_cookie " BOOL " "(4.15+ only)" +.BI fastopen_no_cookie " BOOL " "(Linux 4.15+ only)" Enable TCP Fastopen without a cookie for connections to this destination. .TP -.BI congctl " NAME " "(3.20+ only)" +.BI congctl " NAME " "(Linux 3.20+ only)" .TP -.BI "congctl lock" " NAME " "(3.20+ only)" +.BI "congctl lock" " NAME " "(Linux 3.20+ only)" Sets a specific TCP congestion control algorithm only for a given destination. If not specified, Linux keeps the current global default TCP congestion control algorithm, or the one set from the application. If the modifier @@ -554,14 +554,14 @@ control algorithm for that destination, thus it will be enforced/guaranteed to use the proposed algorithm. .TP -.BI advmss " NUMBER " "(2.3.15+ only)" +.BI advmss " NUMBER " "(Linux 2.3.15+ only)" the MSS ('Maximal Segment Size') to advertise to these destinations when establishing TCP connections. If it is not given, Linux uses a default value calculated from the first hop device MTU. (If the path to these destination is asymmetric, this guess may be wrong.) .TP -.BI reordering " NUMBER " "(2.3.15+ only)" +.BI reordering " NUMBER " "(Linux 2.3.15+ only)" Maximal reordering on the path to this destination. If it is not given, Linux uses the value selected with .B sysctl @@ -782,7 +782,7 @@ is a set of encapsulation attributes specific to the .IR SEG6_ACTION " [ " .IR SEG6_ACTION_PARAM " ] " - Operation to perform on matching packets. -The following actions are currently supported (\fB4.14+ only\fR). +The following actions are currently supported (\fBLinux 4.14+ only\fR). .in +2 .B End @@ -830,7 +830,7 @@ address is set as described in \fBip-sr\fR(8). .in -8 .TP -.BI expires " TIME " "(4.4+ only)" +.BI expires " TIME " "(Linux 4.4+ only)" the route will be deleted after the expires time. .B Only support IPv6 at present. -- 2.18.0
[iproute PATCH 1/4] tc: Fix typo in check for colored output
The check used binary instead of boolean AND, which means colored output was enabled only if the number of specified '-color' flags was odd. Fixes: 2d165c0811058 ("tc: implement color output") Signed-off-by: Phil Sutter --- tc/tc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tc/tc.c b/tc/tc.c index 3bb5910ffac52..3bb893756f40e 100644 --- a/tc/tc.c +++ b/tc/tc.c @@ -515,7 +515,7 @@ int main(int argc, char **argv) _SL_ = oneline ? "\\" : "\n"; - if (color & !json) + if (color && !json) enable_color(); if (batch_file) -- 2.18.0
[iproute PATCH 4/4] lib: Enable colored output only for TTYs
Add an additional prerequisite to check_enable_color() to make sure stdout actually points to an open TTY device. Otherwise calls like | ip -color a s >/tmp/foo will print color escape sequences into that file. Signed-off-by: Phil Sutter --- lib/color.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/color.c b/lib/color.c index edf96e5c6ecd7..500ba09682697 100644 --- a/lib/color.c +++ b/lib/color.c @@ -3,6 +3,7 @@ #include #include #include +#include #include #include #include @@ -79,7 +80,7 @@ void enable_color(void) int check_enable_color(int color, int json) { - if (color && !json) { + if (color && !json && isatty(fileno(stdout))) { enable_color(); return 0; } -- 2.18.0
[iproute PATCH 2/4] bridge: Fix check for colored output
There is no point in calling enable_color() conditionally if it was already called for each time '-color' flag was parsed. Align the algorithm with that in ip and tc by actually making use of 'color' variable. Fixes: e9625d6aead11 ("Merge branch 'iproute2-master' into iproute2-next") Signed-off-by: Phil Sutter --- bridge/bridge.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bridge/bridge.c b/bridge/bridge.c index 7fcfe1116f6e5..289a157d37f03 100644 --- a/bridge/bridge.c +++ b/bridge/bridge.c @@ -174,7 +174,7 @@ main(int argc, char **argv) if (netns_switch(argv[1])) exit(-1); } else if (matches(opt, "-color") == 0) { - enable_color(); + ++color; } else if (matches(opt, "-compressvlans") == 0) { ++compress_vlans; } else if (matches(opt, "-force") == 0) { -- 2.18.0
[iproute PATCH 0/4] A bunch of fixes regarding colored output
This series contains fixes for conditionally colored output in patches 1 and 2. Patch 3 merges the common conditionals from ip, tc and bridge tools. Patch 4 then adds a further restriction to colored output to prevent garbled output when redirecting into a file. Phil Sutter (4): tc: Fix typo in check for colored output bridge: Fix check for colored output Merge common code for conditionally colored output lib: Enable colored output only for TTYs bridge/bridge.c | 5 ++--- include/color.h | 1 + ip/ip.c | 3 +-- lib/color.c | 10 ++ tc/tc.c | 3 +-- 5 files changed, 15 insertions(+), 7 deletions(-) -- 2.18.0
[iproute PATCH 3/4] Merge common code for conditionally colored output
Instead of calling enable_color() conditionally with identical check in three places, introduce check_enable_color() which does it in one place. Signed-off-by: Phil Sutter --- bridge/bridge.c | 3 +-- include/color.h | 1 + ip/ip.c | 3 +-- lib/color.c | 9 + tc/tc.c | 3 +-- 5 files changed, 13 insertions(+), 6 deletions(-) diff --git a/bridge/bridge.c b/bridge/bridge.c index 289a157d37f03..451d684e0bcfd 100644 --- a/bridge/bridge.c +++ b/bridge/bridge.c @@ -200,8 +200,7 @@ main(int argc, char **argv) _SL_ = oneline ? "\\" : "\n"; - if (color && !json) - enable_color(); + check_enable_color(color, json); if (batch_file) return batch(batch_file); diff --git a/include/color.h b/include/color.h index c80359d3e2e95..4f2c918db7e43 100644 --- a/include/color.h +++ b/include/color.h @@ -13,6 +13,7 @@ enum color_attr { }; void enable_color(void); +int check_enable_color(int color, int json); void set_color_palette(void); int color_fprintf(FILE *fp, enum color_attr attr, const char *fmt, ...); enum color_attr ifa_family_color(__u8 ifa_family); diff --git a/ip/ip.c b/ip/ip.c index 71d5170c0cc23..38eac5ec1e17d 100644 --- a/ip/ip.c +++ b/ip/ip.c @@ -304,8 +304,7 @@ int main(int argc, char **argv) _SL_ = oneline ? "\\" : "\n"; - if (color && !json) - enable_color(); + check_enable_color(color, json); if (batch_file) return batch(batch_file); diff --git a/lib/color.c b/lib/color.c index da1f516cb2492..edf96e5c6ecd7 100644 --- a/lib/color.c +++ b/lib/color.c @@ -77,6 +77,15 @@ void enable_color(void) set_color_palette(); } +int check_enable_color(int color, int json) +{ + if (color && !json) { + enable_color(); + return 0; + } + return 1; +} + void set_color_palette(void) { char *p = getenv("COLORFGBG"); diff --git a/tc/tc.c b/tc/tc.c index 3bb893756f40e..e775550174473 100644 --- a/tc/tc.c +++ b/tc/tc.c @@ -515,8 +515,7 @@ int main(int argc, char **argv) _SL_ = oneline ? "\\" : "\n"; - if (color && !json) - enable_color(); + check_enable_color(color, json); if (batch_file) return batch(batch_file); -- 2.18.0
[PATCH v3 net-next] veth: Free queues on link delete
David Ahern reported memory leak in veth. === $ cat /sys/kernel/debug/kmemleak unreferenced object 0x8800354d5c00 (size 1024): comm "ip", pid 836, jiffies 4294722952 (age 25.904s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 backtrace: [<(ptrval)>] kmemleak_alloc+0x70/0x94 [<(ptrval)>] slab_post_alloc_hook+0x42/0x52 [<(ptrval)>] __kmalloc+0x101/0x142 [<(ptrval)>] kmalloc_array.constprop.20+0x1e/0x26 [veth] [<(ptrval)>] veth_newlink+0x147/0x3ac [veth] ... unreferenced object 0x88002e009c00 (size 1024): comm "ip", pid 836, jiffies 4294722958 (age 25.898s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 backtrace: [<(ptrval)>] kmemleak_alloc+0x70/0x94 [<(ptrval)>] slab_post_alloc_hook+0x42/0x52 [<(ptrval)>] __kmalloc+0x101/0x142 [<(ptrval)>] kmalloc_array.constprop.20+0x1e/0x26 [veth] [<(ptrval)>] veth_newlink+0x219/0x3ac [veth] === veth_rq allocated in veth_newlink() was not freed on dellink. We need to free up them after veth_close() so that any packets will not reference the queues afterwards. Thus free them in veth_dev_free() in the same way as freeing stats structure (vstats). Also move queues allocation to veth_dev_init() to be in line with stats allocation. Fixes: 638264dc90227 ("veth: Support per queue XDP ring") Reported-by: David Ahern Signed-off-by: Toshiaki Makita --- This is a fix for a bug which exists only in net-next. Let me know if I should wait for -next merging into net or reopen of -next. drivers/net/veth.c | 70 +- 1 file changed, 33 insertions(+), 37 deletions(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index e3202af..8d679c8 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -789,16 +789,48 @@ static int is_valid_veth_mtu(int mtu) return mtu >= ETH_MIN_MTU && mtu <= ETH_MAX_MTU; } +static int veth_alloc_queues(struct net_device *dev) +{ + struct veth_priv *priv = netdev_priv(dev); + int i; + + priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL); + if (!priv->rq) + return -ENOMEM; + + for (i = 0; i < dev->num_rx_queues; i++) + priv->rq[i].dev = dev; + + return 0; +} + +static void veth_free_queues(struct net_device *dev) +{ + struct veth_priv *priv = netdev_priv(dev); + + kfree(priv->rq); +} + static int veth_dev_init(struct net_device *dev) { + int err; + dev->vstats = netdev_alloc_pcpu_stats(struct pcpu_vstats); if (!dev->vstats) return -ENOMEM; + + err = veth_alloc_queues(dev); + if (err) { + free_percpu(dev->vstats); + return err; + } + return 0; } static void veth_dev_free(struct net_device *dev) { + veth_free_queues(dev); free_percpu(dev->vstats); } @@ -1040,31 +1072,13 @@ static int veth_validate(struct nlattr *tb[], struct nlattr *data[], return 0; } -static int veth_alloc_queues(struct net_device *dev) -{ - struct veth_priv *priv = netdev_priv(dev); - - priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL); - if (!priv->rq) - return -ENOMEM; - - return 0; -} - -static void veth_free_queues(struct net_device *dev) -{ - struct veth_priv *priv = netdev_priv(dev); - - kfree(priv->rq); -} - static struct rtnl_link_ops veth_link_ops; static int veth_newlink(struct net *src_net, struct net_device *dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { - int err, i; + int err; struct net_device *peer; struct veth_priv *priv; char ifname[IFNAMSIZ]; @@ -1117,12 +1131,6 @@ static int veth_newlink(struct net *src_net, struct net_device *dev, return PTR_ERR(peer); } - err = veth_alloc_queues(peer); - if (err) { - put_net(net); - goto err_peer_alloc_queues; - } - if (!ifmp || !tbp[IFLA_ADDRESS]) eth_hw_addr_random(peer); @@ -1151,10 +1159,6 @@ static int veth_newlink(struct net *src_net, struct net_device *dev, * should be re-allocated */ - err = veth_alloc_queues(dev); - if (err) - goto err_alloc_queues; - if (tb[IFLA_ADDRESS] == NULL) eth_hw_addr_random(dev); @@ -1174,28 +1178,20 @@ static int veth_newlink(struct net
Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order
Hi Stefano On Tue, Aug 7, 2018 at 6:31 AM, Stefano Brivio wrote: > Hi Pravin, > > On Tue, 31 Jul 2018 16:12:03 -0700 > Pravin Shelar wrote: > >> Rather than reducing number of thread down to 1, we could find better >> number of FDs per port. >> How about this simple solution: >> 1. Allocate (N * P) FDs as long as it is under FD limit. >> 2. If FD limit (-EMFILE) is hit reduce N value by half and repeat step 1. > > I still see a few quantitative issues with this approach, other than > Ben's observation about design (which, by the way, looks entirely > reasonable to me). > > We're talking about a disproportionate amount of sockets in any case. > We can have up to 2^16 vports here, with 5k vports being rather common, > and for any reasonable value of N that manages somehow to perturbate the > distribution of upcalls per thread, we are talking about something well > in excess of 100k sockets. I think this doesn't really scale. > My argument is not about proposed fairness algorithm. It is about cost of the fairness and I do not see it is addressed in any of the follow ups. You seems to be worried about memory cost and fairness aspects, I am worried about CPU cost of the solution. I think proposed solution is solving the fairness issue but it is also creating bottleneck in upcall processing. OVS is known to have slower upcall processing. This patch is adding even more cost to the upcall handling. The latency of first packet handling is also going up with this approach. I revisited the original patch, here is what I see in term of added cost to existing upcall processing: 1. one "kzalloc(sizeof(*upcall), GFP_ATOMIC);" This involve allocate and initialize memory 2. copy flow key which is more than 1 KB (upcall->key = *key) 3. Acquire spin_lock_bh dp->upcalls.lock, which would disable bottom half processing on CPU while waiting for the global lock. 4. Iterate list of queued upcalls, one of objective it is to avoid out of order packet. But I do not see point of ordering packets from different streams. 5. signal upcall thread after delay ovs_dp_upcall_delay(). This adds further to the latency. 6. upcall is then handed over to different thread (context switch), likely on different CPU. 8. the upcall object is freed on remote CPU. 9. single lock essentially means OVS kernel datapath upcall processing is single threaded no matter number of cores in system. I would be interested in how are we going to address these issues. In example you were talking about netlink fd issue on server with 48 core, how does this solution works when there are 5K ports each triggering upcall ? Can you benchmark your patch? Do you have performance numbers for TCP_CRR with and without this patch ? Also publish latency numbers for this patch. Please turn off megaflow to exercise upcall handling. I understand fairness has cost, but we need to find right balance between performance and fairness. Current fairness scheme is a lockless algorithm without much computational overhead, did you try to improve current algorithm so that it uses less number of ports. > With the current value for N (3/4 * number of threads) this can even get > close to /proc/sys/fs/file-max on some systems, and there raising the > number of allowed file descriptors for ovs-vswitchd isn't a solution > anymore. > > I would instead try to address the concerns that you had about the > original patch adding fairness in the kernel, rather than trying to > make the issue appear less severe in ovs-vswitchd. > > -- > Stefano