Re: KASAN: null-ptr-deref Read in rds_ib_get_mr
On 2018/5/12 0:58, Santosh Shilimkar wrote: On 5/11/2018 12:48 AM, Yanjun Zhu wrote: On 2018/5/11 13:20, DaeRyong Jeong wrote: We report the crash: KASAN: null-ptr-deref Read in rds_ib_get_mr Note that this bug is previously reported by syzkaller. https://syzkaller.appspot.com/bug?id=0bb56a5a48b000b52aa2b0d8dd20b1f545214d91 Nonetheless, this bug has not fixed yet, and we hope that this report and our analysis, which gets help by the RaceFuzzer's feature, will helpful to fix the crash. This crash has been found in v4.17-rc1 using RaceFuzzer (a modified version of Syzkaller), which we describe more at the end of this report. Our analysis shows that the race occurs when invoking two syscalls concurrently, bind$rds and setsockopt$RDS_GET_MR. Analysis: We think the concurrent execution of __rds_rdma_map() and rds_bind() causes the problem. __rds_rdma_map() checks whether rs->rs_bound_addr is 0 or not. But the concurrent execution with rds_bind() can by-pass this check. Therefore, __rds_rdmap_map() calls rs->rs_transport->get_mr() and rds_ib_get_mr() causes the null deref at ib_rdma.c:544 in v4.17-rc1, when dereferencing rs_conn. Thread interleaving: CPU0 (__rds_rdma_map) CPU1 (rds_bind) // rds_add_bound() sets rs->bound_addr as none 0 ret = rds_add_bound(rs, sin->sin_addr.s_addr, >sin_port); if (rs->rs_bound_addr == 0 || !rs->rs_transport) { ret = -ENOTCONN; /* XXX not a great errno */ goto out; } if (rs->rs_transport) { /* previously bound */ trans = rs->rs_transport; if (trans->laddr_check(sock_net(sock->sk), sin->sin_addr.s_addr) != 0) { ret = -ENOPROTOOPT; // rds_remove_bound() sets rs->bound_addr as 0 rds_remove_bound(rs); ... trans_private = rs->rs_transport->get_mr(sg, nents, rs, >r_key); (in rds_ib_get_mr()) struct rds_ib_connection *ic = rs->rs_conn->c_transport_data; Call sequence (v4.17-rc1): CPU0 rds_setsockopt rds_get_mr __rds_rdma_map rds_ib_get_mr CPU1 rds_bind rds_add_bound ... rds_remove_bound Crash log: == BUG: KASAN: null-ptr-deref in rds_ib_get_mr+0x3a/0x150 net/rds/ib_rdma.c:544 Read of size 8 at addr 0068 by task syz-executor0/32067 CPU: 0 PID: 32067 Comm: syz-executor0 Not tainted 4.17.0-rc1 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x166/0x21c lib/dump_stack.c:113 kasan_report_error mm/kasan/report.c:352 [inline] kasan_report+0x140/0x360 mm/kasan/report.c:412 check_memory_region_inline mm/kasan/kasan.c:260 [inline] __asan_load8+0x54/0x90 mm/kasan/kasan.c:699 rds_ib_get_mr+0x3a/0x150 net/rds/ib_rdma.c:544 __rds_rdma_map+0x521/0x9d0 net/rds/rdma.c:271 rds_get_mr+0xad/0xf0 net/rds/rdma.c:333 rds_setsockopt+0x57f/0x720 net/rds/af_rds.c:347 __sys_setsockopt+0x147/0x230 net/socket.c:1903 __do_sys_setsockopt net/socket.c:1914 [inline] __se_sys_setsockopt net/socket.c:1911 [inline] __x64_sys_setsockopt+0x67/0x80 net/socket.c:1911 do_syscall_64+0x15f/0x4a0 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4563f9 RSP: 002b:7f6a2b3c2b28 EFLAGS: 0246 ORIG_RAX: 0036 RAX: ffda RBX: 0072bee0 RCX: 004563f9 RDX: 0002 RSI: 0114 RDI: 0015 RBP: 0575 R08: 0020 R09: R10: 2140 R11: 0246 R12: 7f6a2b3c36d4 R13: R14: 006fd398 R15: == diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index e678699..2228b50 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -539,11 +539,17 @@ void rds_ib_flush_mrs(void) void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents, struct rds_sock *rs, u32 *key_ret) { - struct rds_ib_device *rds_ibdev; + struct rds_ib_device *rds_ibdev = NULL; struct rds_ib_mr *ibmr = NULL; - struct rds_ib_connection *ic = rs->rs_conn->c_transport_data; + struct rds_ib_connection *ic = NULL; int ret; + if (rs->rs_bound_addr == 0) { + ret = -EPERM; + goto out; + } + No you can't return such error for this API and the socket related checks needs to be done at core layer. I remember fixing this race but probably never pushed fix upstream. OK. Wait for your patch. :-) The MR code is due for update with optim
Re: [rds-devel] KASAN: null-ptr-deref Read in rds_ib_get_mr
On 2018/5/11 18:46, Sowmini Varadhan wrote: On (05/11/18 15:48), Yanjun Zhu wrote: diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index e678699..2228b50 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -539,11 +539,17 @@ void rds_ib_flush_mrs(void) void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents, struct rds_sock *rs, u32 *key_ret) { - struct rds_ib_device *rds_ibdev; + struct rds_ib_device *rds_ibdev = NULL; struct rds_ib_mr *ibmr = NULL; - struct rds_ib_connection *ic = rs->rs_conn->c_transport_data; + struct rds_ib_connection *ic = NULL; int ret; + if (rs->rs_bound_addr == 0) { + ret = -EPERM; + goto out; + } + + ic = rs->rs_conn->c_transport_data; rds_ibdev = rds_ib_get_device(rs->rs_bound_addr); if (!rds_ibdev) { ret = -ENODEV; I made this raw patch. If you can reproduce this bug, please make tests with it. I dont think this solves the problem, I think it just changes the timing under which it can still happen. what if the rds_remove_bound() in rds_bind() happens after the check for if (rs->rs_bound_addr == 0) added above by the patch I believe you need some type of synchronization (either through mutex, or some atomic flag in the rs or similar) to make sure rds_bind() and rds_ib_get_mr() are mutually exclusive. Sure. I agree with you. Maybe mutex is a good choice. Zhu Yanjun --Sowmini
Re: KASAN: null-ptr-deref Read in rds_ib_get_mr
On 2018/5/11 13:20, DaeRyong Jeong wrote: We report the crash: KASAN: null-ptr-deref Read in rds_ib_get_mr Note that this bug is previously reported by syzkaller. https://syzkaller.appspot.com/bug?id=0bb56a5a48b000b52aa2b0d8dd20b1f545214d91 Nonetheless, this bug has not fixed yet, and we hope that this report and our analysis, which gets help by the RaceFuzzer's feature, will helpful to fix the crash. This crash has been found in v4.17-rc1 using RaceFuzzer (a modified version of Syzkaller), which we describe more at the end of this report. Our analysis shows that the race occurs when invoking two syscalls concurrently, bind$rds and setsockopt$RDS_GET_MR. Analysis: We think the concurrent execution of __rds_rdma_map() and rds_bind() causes the problem. __rds_rdma_map() checks whether rs->rs_bound_addr is 0 or not. But the concurrent execution with rds_bind() can by-pass this check. Therefore, __rds_rdmap_map() calls rs->rs_transport->get_mr() and rds_ib_get_mr() causes the null deref at ib_rdma.c:544 in v4.17-rc1, when dereferencing rs_conn. Thread interleaving: CPU0 (__rds_rdma_map) CPU1 (rds_bind) // rds_add_bound() sets rs->bound_addr as none 0 ret = rds_add_bound(rs, sin->sin_addr.s_addr, >sin_port); if (rs->rs_bound_addr == 0 || !rs->rs_transport) { ret = -ENOTCONN; /* XXX not a great errno */ goto out; } if (rs->rs_transport) { /* previously bound */ trans = rs->rs_transport; if (trans->laddr_check(sock_net(sock->sk), sin->sin_addr.s_addr) != 0) { ret = -ENOPROTOOPT; // rds_remove_bound() sets rs->bound_addr as 0 rds_remove_bound(rs); ... trans_private = rs->rs_transport->get_mr(sg, nents, rs, >r_key); (in rds_ib_get_mr()) struct rds_ib_connection *ic = rs->rs_conn->c_transport_data; Call sequence (v4.17-rc1): CPU0 rds_setsockopt rds_get_mr __rds_rdma_map rds_ib_get_mr CPU1 rds_bind rds_add_bound ... rds_remove_bound Crash log: == BUG: KASAN: null-ptr-deref in rds_ib_get_mr+0x3a/0x150 net/rds/ib_rdma.c:544 Read of size 8 at addr 0068 by task syz-executor0/32067 CPU: 0 PID: 32067 Comm: syz-executor0 Not tainted 4.17.0-rc1 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x166/0x21c lib/dump_stack.c:113 kasan_report_error mm/kasan/report.c:352 [inline] kasan_report+0x140/0x360 mm/kasan/report.c:412 check_memory_region_inline mm/kasan/kasan.c:260 [inline] __asan_load8+0x54/0x90 mm/kasan/kasan.c:699 rds_ib_get_mr+0x3a/0x150 net/rds/ib_rdma.c:544 __rds_rdma_map+0x521/0x9d0 net/rds/rdma.c:271 rds_get_mr+0xad/0xf0 net/rds/rdma.c:333 rds_setsockopt+0x57f/0x720 net/rds/af_rds.c:347 __sys_setsockopt+0x147/0x230 net/socket.c:1903 __do_sys_setsockopt net/socket.c:1914 [inline] __se_sys_setsockopt net/socket.c:1911 [inline] __x64_sys_setsockopt+0x67/0x80 net/socket.c:1911 do_syscall_64+0x15f/0x4a0 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4563f9 RSP: 002b:7f6a2b3c2b28 EFLAGS: 0246 ORIG_RAX: 0036 RAX: ffda RBX: 0072bee0 RCX: 004563f9 RDX: 0002 RSI: 0114 RDI: 0015 RBP: 0575 R08: 0020 R09: R10: 2140 R11: 0246 R12: 7f6a2b3c36d4 R13: R14: 006fd398 R15: == diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index e678699..2228b50 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -539,11 +539,17 @@ void rds_ib_flush_mrs(void) void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents, struct rds_sock *rs, u32 *key_ret) { - struct rds_ib_device *rds_ibdev; + struct rds_ib_device *rds_ibdev = NULL; struct rds_ib_mr *ibmr = NULL; - struct rds_ib_connection *ic = rs->rs_conn->c_transport_data; + struct rds_ib_connection *ic = NULL; int ret; + if (rs->rs_bound_addr == 0) { + ret = -EPERM; + goto out; + } + +
Re: [PATCH] mlx4_core: allocate 4KB ICM chunks
On 2018/5/11 7:31, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size, the above issue is fixed. However in order to support 4KB ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need Hi, Replace continuous memory pages with virtual memory, is there any performance loss? Zhu Yanjun of DMA ops). Signed-off-by: Qing HuangAcked-by: Daniel Jurgens --- drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..2b17a4b 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in 4KB page size chunks to avoid high order memory + * allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << 12, + MLX4_TABLE_CHUNK_SIZE = 1 << 12 }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); }
Re: [PATCHv2 1/1] IB/rxe: avoid double kfree_skb
Add netdev@vger.kernel.org On 2018/4/26 12:41, Zhu Yanjun wrote: When skb is sent, it will pass the following functions in soft roce. rxe_send [rdma_rxe] ip_local_out __ip_local_out ip_output ip_finish_output ip_finish_output2 dev_queue_xmit __dev_queue_xmit dev_hard_start_xmit In the above functions, if error occurs in the above functions or iptables rules drop skb after ip_local_out, kfree_skb will be called. So it is not necessary to call kfree_skb in soft roce module again. Or else crash will occur. The steps to reproduce: server client -- |1.1.1.1||1.1.1.2| -- On server: rping -s -a 1.1.1.1 -v -C 1 -S 512 On client: rping -c -a 1.1.1.1 -v -C 1 -S 512 The kernel configs CONFIG_DEBUG_KMEMLEAK and CONFIG_DEBUG_OBJECTS are enabled on both server and client. When rping runs, run the following command in server: iptables -I OUTPUT -p udp --dport 4791 -j DROP Without this patch, crash will occur. CC: Srinivas Eeda CC: Junxiao Bi Signed-off-by: Zhu Yanjun Reviewed-by: Yuval Shaia --- V1->V2: Not only the dropped skb is freed, but also the error skb is also freed. So in soft roce, it is not necessary to call kfree_skb again. --- drivers/infiniband/sw/rxe/rxe_req.c | 1 - drivers/infiniband/sw/rxe/rxe_resp.c | 6 +- 2 files changed, 1 insertion(+), 6 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c index 7bdaf71..7851999 100644 --- a/drivers/infiniband/sw/rxe/rxe_req.c +++ b/drivers/infiniband/sw/rxe/rxe_req.c @@ -728,7 +728,6 @@ int rxe_requester(void *arg) rollback_state(wqe, qp, _wqe, rollback_psn); if (ret == -EAGAIN) { - kfree_skb(skb); rxe_run_task(>req.task, 1); goto exit; } diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c index a65c996..955ff3b 100644 --- a/drivers/infiniband/sw/rxe/rxe_resp.c +++ b/drivers/infiniband/sw/rxe/rxe_resp.c @@ -742,7 +742,6 @@ static enum resp_states read_reply(struct rxe_qp *qp, err = rxe_xmit_packet(rxe, qp, _pkt, skb); if (err) { pr_err("Failed sending RDMA reply.\n"); - kfree_skb(skb); return RESPST_ERR_RNR; } @@ -954,10 +953,8 @@ static int send_ack(struct rxe_qp *qp, struct rxe_pkt_info *pkt, } err = rxe_xmit_packet(rxe, qp, _pkt, skb); - if (err) { + if (err) pr_err_ratelimited("Failed sending ack\n"); - kfree_skb(skb); - } err1: return err; @@ -1141,7 +1138,6 @@ static enum resp_states duplicate_request(struct rxe_qp *qp, if (rc) { pr_err("Failed resending result. This flow is not handled - skb ignored\n"); rxe_drop_ref(qp); - kfree_skb(skb_copy); rc = RESPST_CLEANUP; goto out; }
Re: [PATCH 1/1] IB/rxe: avoid double kfree_skb
Hi, all rxe_send [rdma_rxe] ip_local_out __ip_local_out ip_output ip_finish_output ip_finish_output2 dev_queue_xmit __dev_queue_xmit dev_hard_start_xmit e1000_xmit_frame [e1000] When skb is sent, it will pass the above functions. I checked all the above functions. If error occurs in the above functions after ip_local_out, kfree_skb will be called. So when ip_local_out returns an error, skb should be freed. It is not necessary to call kfree_skb in soft roce module again. If I am wrong, please correct me. Zhu Yanjun On 2018/4/24 16:34, Yanjun Zhu wrote: Hi, all rxe_send ip_local_out __ip_local_out nf_hook_slow In the above call process, nf_hook_slow drops and frees skb, then -EPERM is returned when iptables rules(iptables -I OUTPUT -p udp --dport 4791 -j DROP) is set. If skb->users is not changed in softroce, kfree_skb should not be called in this module. I will make further investigations about other error handler after ip_local_out. If I am wrong, please correct me. Any reply is appreciated. Zhu Yanjun On 2018/4/20 13:46, Yanjun Zhu wrote: On 2018/4/20 10:19, Doug Ledford wrote: On Thu, 2018-04-19 at 10:01 -0400, Zhu Yanjun wrote: When skb is dropped by iptables rules, the skb is freed at the same time -EPERM is returned. So in softroce, it is not necessary to free skb again. Or else, crash will occur. The steps to reproduce: server client - - |1.1.1.1|<rxe-channel--->|1.1.1.2| - - On server: rping -s -a 1.1.1.1 -v -C 1 -S 512 On client: rping -c -a 1.1.1.1 -v -C 1 -S 512 The kernel configs CONFIG_DEBUG_KMEMLEAK and CONFIG_DEBUG_OBJECTS are enabled on both server and client. When rping runs, run the following command in server: iptables -I OUTPUT -p udp --dport 4791 -j DROP Without this patch, crash will occur. CC: Srinivas Eeda <srinivas.e...@oracle.com> CC: Junxiao Bi <junxiao...@oracle.com> Signed-off-by: Zhu Yanjun <yanjun@oracle.com> Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com> I have no reason to doubt your analysis, but if there are a bunch of error paths for net_xmit and they all return with your skb still being valid and holding a reference, and then one oddball that returns with your skb already gone, that just sounds like a mistake waiting to happen (not to mention a bajillion special cases sprinkled everywhere to deal with this apparent inconsistency). Can we get a netdev@ confirmation on this being the right solution? Yes. I agree with you. After iptables rule "iptables -I OUTPUT -p udp --dport 4791 -j DROP", the skb is freed in this function /* Returns 1 if okfn() needs to be executed by the caller, * -EPERM for NF_DROP, 0 otherwise. Caller must hold rcu_read_lock. */ int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state, const struct nf_hook_entries *e, unsigned int s) { unsigned int verdict; int ret; for (; s < e->num_hook_entries; s++) { verdict = nf_hook_entry_hookfn(>hooks[s], skb, state); switch (verdict & NF_VERDICT_MASK) { case NF_ACCEPT: break; case NF_DROP: kfree_skb(skb); <here, skb is freed ret = NF_DROP_GETERR(verdict); if (ret == 0) ret = -EPERM; return ret; case NF_QUEUE: ret = nf_queue(skb, state, e, s, verdict); if (ret == 1) continue; return ret; default: /* Implicit handling for NF_STOLEN, as well as any other * non conventional verdicts. */ return 0; } } return 1; } EXPORT_SYMBOL(nf_hook_slow); If I am wrong, please correct me. And my test environment is still there, any solution can be verified in it. Zhu Yanjun --- drivers/infiniband/sw/rxe/rxe_net.c | 3 +++ drivers/infiniband/sw/rxe/rxe_req.c | 5 +++-- drivers/infiniband/sw/rxe/rxe_resp.c | 9 ++--- 3 files changed, 12 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c index 9da6e37..2094434 100644 --- a/drivers/infiniband/sw/rxe/rxe_net.c +++ b/drivers/infiniband/sw/rxe/rxe_net.c @@ -511,6 +511,9 @@ int rxe_send(struct rxe_pkt_info *pkt, struct sk_buff *skb) if (unlikely(net_xmit_eval(err))) { pr_debug("error sending packet: %d\n"
Re: [PATCH 1/1] IB/rxe: avoid double kfree_skb
Hi, all rxe_send ip_local_out __ip_local_out nf_hook_slow In the above call process, nf_hook_slow drops and frees skb, then -EPERM is returned when iptables rules(iptables -I OUTPUT -p udp --dport 4791 -j DROP) is set. If skb->users is not changed in softroce, kfree_skb should not be called in this module. I will make further investigations about other error handler after ip_local_out. If I am wrong, please correct me. Any reply is appreciated. Zhu Yanjun On 2018/4/20 13:46, Yanjun Zhu wrote: On 2018/4/20 10:19, Doug Ledford wrote: On Thu, 2018-04-19 at 10:01 -0400, Zhu Yanjun wrote: When skb is dropped by iptables rules, the skb is freed at the same time -EPERM is returned. So in softroce, it is not necessary to free skb again. Or else, crash will occur. The steps to reproduce: server client - - |1.1.1.1|<rxe-channel--->|1.1.1.2| - - On server: rping -s -a 1.1.1.1 -v -C 1 -S 512 On client: rping -c -a 1.1.1.1 -v -C 1 -S 512 The kernel configs CONFIG_DEBUG_KMEMLEAK and CONFIG_DEBUG_OBJECTS are enabled on both server and client. When rping runs, run the following command in server: iptables -I OUTPUT -p udp --dport 4791 -j DROP Without this patch, crash will occur. CC: Srinivas Eeda <srinivas.e...@oracle.com> CC: Junxiao Bi <junxiao...@oracle.com> Signed-off-by: Zhu Yanjun <yanjun@oracle.com> Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com> I have no reason to doubt your analysis, but if there are a bunch of error paths for net_xmit and they all return with your skb still being valid and holding a reference, and then one oddball that returns with your skb already gone, that just sounds like a mistake waiting to happen (not to mention a bajillion special cases sprinkled everywhere to deal with this apparent inconsistency). Can we get a netdev@ confirmation on this being the right solution? Yes. I agree with you. After iptables rule "iptables -I OUTPUT -p udp --dport 4791 -j DROP", the skb is freed in this function /* Returns 1 if okfn() needs to be executed by the caller, * -EPERM for NF_DROP, 0 otherwise. Caller must hold rcu_read_lock. */ int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state, const struct nf_hook_entries *e, unsigned int s) { unsigned int verdict; int ret; for (; s < e->num_hook_entries; s++) { verdict = nf_hook_entry_hookfn(>hooks[s], skb, state); switch (verdict & NF_VERDICT_MASK) { case NF_ACCEPT: break; case NF_DROP: kfree_skb(skb); <here, skb is freed ret = NF_DROP_GETERR(verdict); if (ret == 0) ret = -EPERM; return ret; case NF_QUEUE: ret = nf_queue(skb, state, e, s, verdict); if (ret == 1) continue; return ret; default: /* Implicit handling for NF_STOLEN, as well as any other * non conventional verdicts. */ return 0; } } return 1; } EXPORT_SYMBOL(nf_hook_slow); If I am wrong, please correct me. And my test environment is still there, any solution can be verified in it. Zhu Yanjun --- drivers/infiniband/sw/rxe/rxe_net.c | 3 +++ drivers/infiniband/sw/rxe/rxe_req.c | 5 +++-- drivers/infiniband/sw/rxe/rxe_resp.c | 9 ++--- 3 files changed, 12 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c index 9da6e37..2094434 100644 --- a/drivers/infiniband/sw/rxe/rxe_net.c +++ b/drivers/infiniband/sw/rxe/rxe_net.c @@ -511,6 +511,9 @@ int rxe_send(struct rxe_pkt_info *pkt, struct sk_buff *skb) if (unlikely(net_xmit_eval(err))) { pr_debug("error sending packet: %d\n", err); + /* -EPERM means the skb is dropped and freed. */ + if (err == -EPERM) + return -EPERM; return -EAGAIN; } diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c index 7bdaf71..9d2efec 100644 --- a/drivers/infiniband/sw/rxe/rxe_req.c +++ b/drivers/infiniband/sw/rxe/rxe_req.c @@ -727,8 +727,9 @@ int rxe_requester(void *arg) rollback_state(wqe, qp, _wqe, rollback_psn); - if (ret == -EAGAIN) { - kfree_skb(skb); + if ((ret == -EAGAIN) || (ret == -EPERM)) { + if (ret == -EAGAIN) +
Re: [PATCH 1/1] IB/rxe: avoid double kfree_skb
On 2018/4/20 10:19, Doug Ledford wrote: On Thu, 2018-04-19 at 10:01 -0400, Zhu Yanjun wrote: When skb is dropped by iptables rules, the skb is freed at the same time -EPERM is returned. So in softroce, it is not necessary to free skb again. Or else, crash will occur. The steps to reproduce: server client -- |1.1.1.1||1.1.1.2| -- On server: rping -s -a 1.1.1.1 -v -C 1 -S 512 On client: rping -c -a 1.1.1.1 -v -C 1 -S 512 The kernel configs CONFIG_DEBUG_KMEMLEAK and CONFIG_DEBUG_OBJECTS are enabled on both server and client. When rping runs, run the following command in server: iptables -I OUTPUT -p udp --dport 4791 -j DROP Without this patch, crash will occur. CC: Srinivas Eeda CC: Junxiao Bi Signed-off-by: Zhu Yanjun Reviewed-by: Yuval Shaia I have no reason to doubt your analysis, but if there are a bunch of error paths for net_xmit and they all return with your skb still being valid and holding a reference, and then one oddball that returns with your skb already gone, that just sounds like a mistake waiting to happen (not to mention a bajillion special cases sprinkled everywhere to deal with this apparent inconsistency). Can we get a netdev@ confirmation on this being the right solution? Yes. I agree with you. After iptables rule "iptables -I OUTPUT -p udp --dport 4791 -j DROP", the skb is freed in this function /* Returns 1 if okfn() needs to be executed by the caller, * -EPERM for NF_DROP, 0 otherwise. Caller must hold rcu_read_lock. */ int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state, const struct nf_hook_entries *e, unsigned int s) { unsigned int verdict; int ret; for (; s < e->num_hook_entries; s++) { verdict = nf_hook_entry_hookfn(>hooks[s], skb, state); switch (verdict & NF_VERDICT_MASK) { case NF_ACCEPT: break; case NF_DROP: kfree_skb(skb); req.task, 1); goto exit; } diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c index a65c996..6bdf9b2 100644 --- a/drivers/infiniband/sw/rxe/rxe_resp.c +++ b/drivers/infiniband/sw/rxe/rxe_resp.c @@ -742,7 +742,8 @@ static enum resp_states read_reply(struct rxe_qp *qp, err = rxe_xmit_packet(rxe, qp, _pkt, skb); if (err) { pr_err("Failed sending RDMA reply.\n"); - kfree_skb(skb); + if (err != -EPERM) +
Re: [PATCH 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device
On 2018/4/17 23:37, Tariq Toukan wrote: On 16/04/2018 4:02 AM, Zhu Yanjun wrote: While a faulty cable is used or HCA firmware error, HCA device will be offline. When the driver is accessing this offline device, the following call trace will pop out. " ... [] dump_stack+0x63/0x81 [] panic+0xcc/0x21b [] mlx4_enter_error_state+0xba/0xf0 [mlx4_core] [] mlx4_cmd_reset_flow+0x38/0x60 [mlx4_core] [] mlx4_cmd_poll+0xc1/0x2e0 [mlx4_core] [] __mlx4_cmd+0xb0/0x160 [mlx4_core] [] mlx4_SENSE_PORT+0x54/0xd0 [mlx4_core] [] mlx4_dev_cap+0x4a4/0xb50 [mlx4_core] ... " In the above call trace, the function mlx4_cmd_poll calls the function mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out. This is not reasonable. Since HCA device is offline when it is being accessed, it should not be reset again. In this patch, since HCA is offline, the function mlx4_cmd_post returns an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly returns instead of resetting HCA. CC: Srinivas EedaCC: Junxiao Bi Suggested-by: Håkon Bugge Signed-off-by: Zhu Yanjun --- drivers/net/ethernet/mellanox/mlx4/cmd.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c index 6a9086d..f1c8c42 100644 --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c @@ -451,6 +451,8 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, u64 in_param, u64 out_param, * Device is going through error recovery * and cannot accept commands. */ + mlx4_err(dev, "%s : Device is in error recovery.\n", __func__); + ret = -EINVAL; goto out; } @@ -657,6 +659,9 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, u64 in_param, u64 *out_param, } out_reset: + if (err == -EINVAL) + goto out; + See below. if (err) err = mlx4_cmd_reset_flow(dev, op, op_modifier, err); out: @@ -766,6 +771,9 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, u64 in_param, u64 *out_param, *out_param = context->out_param; out_reset: + if (err == -EINVAL) + goto out; + if (err) Instead, just do here: if (err && err != -EINVAL) err = mlx4_cmd_reset_flow(dev, op, op_modifier, err); out: I am not sure this does not mistakenly cover other cases that already exist and have (err == -EINVAL). For example, this line is hard to predict: err = mlx4_status_to_errno and later on, we might get into if (mlx4_closing_cmd_fatal_error(op, stat)) which leads to out_reset. Thanks a lot. Sure. I agree with you that "err = mlx4_status_to_errno" and "if (mlx4_closing_cmd_fatal_error(op, stat))" will also make "err=-EINVAL". This will mistakenly go to out instead of resetting HCA device. I will make a new patch to avoid the above error. Zhu Yanjun We must have a deeper look at this. But a better option is, change the error indication to uniquely indicate "already in error recovery".
Re: [PATCH next-queue 2/2] ixgbe: add unlikely notes to tx fastpath expressions
On 2018/1/9 6:47, Shannon Nelson wrote: Add unlikely() to a few error checking expressions in the Tx offload handling. Suggested-by: Yanjun Zhu <yanjun@oracle.com> Hi, I am fine with this patch. I have a question. The ipsec feature is supported in ixgbevf? Thanks a lot. Zhu Yanjun Signed-off-by: Shannon Nelson <shannon.nel...@oracle.com> --- drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c index 57c10e6..3d069a2 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c @@ -749,28 +749,28 @@ int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring, struct xfrm_state *xs; struct tx_sa *tsa; - if (!first->skb->sp->len) { + if (unlikely(!first->skb->sp->len)) { netdev_err(tx_ring->netdev, "%s: no xfrm state len = %d\n", __func__, first->skb->sp->len); return 0; } xs = xfrm_input_state(first->skb); - if (!xs) { + if (unlikely(!xs)) { netdev_err(tx_ring->netdev, "%s: no xfrm_input_state() xs = %p\n", __func__, xs); return 0; } itd->sa_idx = xs->xso.offload_handle - IXGBE_IPSEC_BASE_TX_INDEX; - if (itd->sa_idx > IXGBE_IPSEC_MAX_SA_COUNT) { + if (unlikely(itd->sa_idx > IXGBE_IPSEC_MAX_SA_COUNT)) { netdev_err(tx_ring->netdev, "%s: bad sa_idx=%d handle=%lu\n", __func__, itd->sa_idx, xs->xso.offload_handle); return 0; } tsa = >tx_tbl[itd->sa_idx]; - if (!tsa->used) { + if (unlikely(!tsa->used)) { netdev_err(tx_ring->netdev, "%s: unused sa_idx=%d\n", __func__, itd->sa_idx); return 0;
Re: [PATCH v3 next-queue 08/10] ixgbe: process the Tx ipsec offload
On 2017/12/20 8:00, Shannon Nelson wrote: If the skb has a security association referenced in the skb, then set up the Tx descriptor with the ipsec offload bits. While we're here, we fix an oddly named field in the context descriptor struct. v3: added ifdef CONFIG_XFRM_OFFLOAD check around call to ixgbe_ipsec_tx() v2: use ihl != 5 move the ixgbe_ipsec_tx() call to near the call to ixgbe_tso() drop the ipsec packet if the tx offload setup fails simplify the ixgbe_ipsec_tx() parameters by using 'first' leave out the ixgbe_tso() changes since we don't support TSO with ipsec yet. Signed-off-by: Shannon Nelson--- drivers/net/ethernet/intel/ixgbe/ixgbe.h | 10 +++- drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 79 ++ drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c | 4 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 26 +++-- drivers/net/ethernet/intel/ixgbe/ixgbe_type.h | 2 +- 5 files changed, 112 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index a094b23..3d2b7bf 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -171,10 +171,11 @@ enum ixgbe_tx_flags { IXGBE_TX_FLAGS_CC = 0x08, IXGBE_TX_FLAGS_IPV4 = 0x10, IXGBE_TX_FLAGS_CSUM = 0x20, + IXGBE_TX_FLAGS_IPSEC= 0x40, /* software defined flags */ - IXGBE_TX_FLAGS_SW_VLAN = 0x40, - IXGBE_TX_FLAGS_FCOE = 0x80, + IXGBE_TX_FLAGS_SW_VLAN = 0x80, + IXGBE_TX_FLAGS_FCOE = 0x100, }; /* VLAN info */ @@ -1014,6 +1015,8 @@ void ixgbe_ipsec_restore(struct ixgbe_adapter *adapter); void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring, union ixgbe_adv_rx_desc *rx_desc, struct sk_buff *skb); +int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring, struct ixgbe_tx_buffer *first, + struct ixgbe_ipsec_tx_data *itd); #else static inline void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter) { }; static inline void ixgbe_stop_ipsec_offload(struct ixgbe_adapter *adapter) { }; @@ -1021,5 +1024,8 @@ static inline void ixgbe_ipsec_restore(struct ixgbe_adapter *adapter) { }; static inline void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring, union ixgbe_adv_rx_desc *rx_desc, struct sk_buff *skb) { }; +static inline int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring, +struct ixgbe_tx_buffer *first, +struct ixgbe_ipsec_tx_data *itd) { return 0; }; #endif /* CONFIG_XFRM_OFFLOAD */ #endif /* _IXGBE_H_ */ diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c index a9b8f5c..c2fd2ac 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c @@ -695,12 +695,91 @@ static void ixgbe_ipsec_del_sa(struct xfrm_state *xs) } } +/** + * ixgbe_ipsec_offload_ok - can this packet use the xfrm hw offload + * @skb: current data packet + * @xs: pointer to transformer state struct + **/ +static bool ixgbe_ipsec_offload_ok(struct sk_buff *skb, struct xfrm_state *xs) +{ + if (xs->props.family == AF_INET) { + /* Offload with IPv4 options is not supported yet */ + if (ip_hdr(skb)->ihl != 5) + return false; + } else { + /* Offload with IPv6 extension headers is not support yet */ + if (ipv6_ext_hdr(ipv6_hdr(skb)->nexthdr)) + return false; + } + + return true; +} + static const struct xfrmdev_ops ixgbe_xfrmdev_ops = { .xdo_dev_state_add = ixgbe_ipsec_add_sa, .xdo_dev_state_delete = ixgbe_ipsec_del_sa, + .xdo_dev_offload_ok = ixgbe_ipsec_offload_ok, }; /** + * ixgbe_ipsec_tx - setup Tx flags for ipsec offload + * @tx_ring: outgoing context + * @first: current data packet + * @itd: ipsec Tx data for later use in building context descriptor + **/ +int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring, + struct ixgbe_tx_buffer *first, + struct ixgbe_ipsec_tx_data *itd) +{ + struct ixgbe_adapter *adapter = netdev_priv(tx_ring->netdev); + struct ixgbe_ipsec *ipsec = adapter->ipsec; + struct xfrm_state *xs; + struct tx_sa *tsa; + + if (!first->skb->sp->len) { Hi, Nelson The function ixgbe_ipsec_tx is called in tx fastpath. Can we add unlikely as below: if (unlikely(!first->skb->sp->len)) ? If I am wrong, please correct me. Thanks a lot. Zhu Yanjun + netdev_err(tx_ring->netdev, "%s: no xfrm state len = %d\n", + __func__, first->skb->sp->len); + return 0; + } + + xs = xfrm_input_state(first->skb); +
Re: [PATCH v3 next-queue 00/10] ixgbe: Add ipsec offload
On 2017/12/21 14:39, Yanjun Zhu wrote: On 2017/12/20 7:59, Shannon Nelson wrote: This is an implementation of the ipsec hardware offload feature for the ixgbe driver and Intel's 10Gbe series NICs: x540, x550, 82599. Hi, Nelson I notice that the ipsec feature is based on x540, x550, 82599. But this ixgbe driver will also work with 82598. Does this ipsec feature also work with 82598? Sorry. I mean, after these ipsec patches are applied, whether ipsec offload enabled or not, can this ixgbe driver still work well with 82598? Zhu Yanjun Thanks a lot. Zhu Yanjun These patches apply to net-next v4.14 as well as Jeff Kirsher's next-queue v4.15-rc1-206-ge47375b. The ixgbe NICs support ipsec offload for 1024 Rx and 1024 Tx Security Associations (SAs), using up to 128 inbound IP addresses, and using the rfc4106(gcm(aes)) encryption. This code does not yet support IPv6, checksum offload, or TSO in conjunction with the ipsec offload - those will be added in the future. This code shows improvements in both packet throughput and CPU utilization. For example, here are some quicky numbers that show the magnitude of the performance gain on a single run of "iperf -c " with the ipsec offload on both ends of a point-to-point connection: 9.4 Gbps - normal case 7.6 Gbps - ipsec with offload 343 Mbps - ipsec no offload To set up a similar test case, you first need to be sure you have a recent version of iproute2 that supports the ipsec offload tag, probably something from ip 4.12 or newer would be best. I have a shell script that builds up the appropriate commands for me, but here are the resulting commands for all tcp traffic between 14.0.0.52 and 14.0.0.70: For the left side (14.0.0.52): ip x p add dir out src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp tmpl \ proto esp src 14.0.0.52 dst 14.0.0.70 spi 0x07 mode transport reqid 0x07 ip x p add dir in src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp tmpl \ proto esp dst 14.0.0.52 src 14.0.0.70 spi 0x07 mode transport reqid 0x07 ip x s add proto esp src 14.0.0.52 dst 14.0.0.70 spi 0x07 mode transport \ reqid 0x07 replay-window 32 \ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \ sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp offload dev eth4 dir out ip x s add proto esp dst 14.0.0.52 src 14.0.0.70 spi 0x07 mode transport \ reqid 0x07 replay-window 32 \ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \ sel src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp offload dev eth4 dir in For the right side (14.0.0.70): ip x p add dir out src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp tmpl \ proto esp src 14.0.0.70 dst 14.0.0.52 spi 0x07 mode transport reqid 0x07 ip x p add dir in src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp tmpl \ proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport reqid 0x07 ip x s add proto esp src 14.0.0.70 dst 14.0.0.52 spi 0x07 mode transport \ reqid 0x07 replay-window 32 \ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \ sel src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp offload dev eth4 dir out ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \ reqid 0x07 replay-window 32 \ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \ sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp offload dev eth4 dir in In both cases, the command "ip x s flush ; ip x p flush" will clean it all out and remove the offloads. Lastly, thanks to Alex Duyck for his early comments. Please see the individual patches for specific update info. v3: fixes after comments from those wonderfully pesky kbuild robots v2: fixes after comments from Alex Shannon Nelson (10): ixgbe: clean up ipsec defines ixgbe: add ipsec register access routines ixgbe: add ipsec engine start and stop routines ixgbe: add ipsec data structures ixgbe: add ipsec offload add and remove SA ixgbe: restore offloaded SAs after a reset ixgbe: process the Rx ipsec offload ixgbe: process the Tx ipsec offload ixgbe: ipsec offload stats ixgbe: register ipsec offload with the xfrm subsystem drivers/net/ethernet/intel/ixgbe/Makefile | 1 + drivers/net/ethernet/intel/ixgbe/ixgbe.h | 33 +- drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 2 + drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 923 +++ drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.h | 92 +++ drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c | 4 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 39 +- drivers/net/ethernet/intel/ixgbe/ixgbe_type.h | 22 +- 8 files changed, 1093 insertions(+), 23 deletions(-) create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.h
Re: [PATCH v3 next-queue 00/10] ixgbe: Add ipsec offload
On 2017/12/20 7:59, Shannon Nelson wrote: This is an implementation of the ipsec hardware offload feature for the ixgbe driver and Intel's 10Gbe series NICs: x540, x550, 82599. Hi, Nelson I notice that the ipsec feature is based on x540, x550, 82599. But this ixgbe driver will also work with 82598. Does this ipsec feature also work with 82598? Thanks a lot. Zhu Yanjun These patches apply to net-next v4.14 as well as Jeff Kirsher's next-queue v4.15-rc1-206-ge47375b. The ixgbe NICs support ipsec offload for 1024 Rx and 1024 Tx Security Associations (SAs), using up to 128 inbound IP addresses, and using the rfc4106(gcm(aes)) encryption. This code does not yet support IPv6, checksum offload, or TSO in conjunction with the ipsec offload - those will be added in the future. This code shows improvements in both packet throughput and CPU utilization. For example, here are some quicky numbers that show the magnitude of the performance gain on a single run of "iperf -c " with the ipsec offload on both ends of a point-to-point connection: 9.4 Gbps - normal case 7.6 Gbps - ipsec with offload 343 Mbps - ipsec no offload To set up a similar test case, you first need to be sure you have a recent version of iproute2 that supports the ipsec offload tag, probably something from ip 4.12 or newer would be best. I have a shell script that builds up the appropriate commands for me, but here are the resulting commands for all tcp traffic between 14.0.0.52 and 14.0.0.70: For the left side (14.0.0.52): ip x p add dir out src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp tmpl \ proto esp src 14.0.0.52 dst 14.0.0.70 spi 0x07 mode transport reqid 0x07 ip x p add dir in src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp tmpl \ proto esp dst 14.0.0.52 src 14.0.0.70 spi 0x07 mode transport reqid 0x07 ip x s add proto esp src 14.0.0.52 dst 14.0.0.70 spi 0x07 mode transport \ reqid 0x07 replay-window 32 \ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \ sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp offload dev eth4 dir out ip x s add proto esp dst 14.0.0.52 src 14.0.0.70 spi 0x07 mode transport \ reqid 0x07 replay-window 32 \ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \ sel src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp offload dev eth4 dir in For the right side (14.0.0.70): ip x p add dir out src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp tmpl \ proto esp src 14.0.0.70 dst 14.0.0.52 spi 0x07 mode transport reqid 0x07 ip x p add dir in src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp tmpl \ proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport reqid 0x07 ip x s add proto esp src 14.0.0.70 dst 14.0.0.52 spi 0x07 mode transport \ reqid 0x07 replay-window 32 \ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \ sel src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp offload dev eth4 dir out ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \ reqid 0x07 replay-window 32 \ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \ sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp offload dev eth4 dir in In both cases, the command "ip x s flush ; ip x p flush" will clean it all out and remove the offloads. Lastly, thanks to Alex Duyck for his early comments. Please see the individual patches for specific update info. v3: fixes after comments from those wonderfully pesky kbuild robots v2: fixes after comments from Alex Shannon Nelson (10): ixgbe: clean up ipsec defines ixgbe: add ipsec register access routines ixgbe: add ipsec engine start and stop routines ixgbe: add ipsec data structures ixgbe: add ipsec offload add and remove SA ixgbe: restore offloaded SAs after a reset ixgbe: process the Rx ipsec offload ixgbe: process the Tx ipsec offload ixgbe: ipsec offload stats ixgbe: register ipsec offload with the xfrm subsystem drivers/net/ethernet/intel/ixgbe/Makefile| 1 + drivers/net/ethernet/intel/ixgbe/ixgbe.h | 33 +- drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 2 + drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 923 +++ drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.h | 92 +++ drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c | 4 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c| 39 +- drivers/net/ethernet/intel/ixgbe/ixgbe_type.h| 22 +- 8 files changed, 1093 insertions(+), 23 deletions(-) create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.h
Re: [PATCH net-next 1/1] forcedeth: remove unnecessary variable
On 2017/12/8 3:07, David Miller wrote: From: Zhu YanjunDate: Wed, 6 Dec 2017 23:15:15 -0500 Since both tx_ring and first_tx are the head of tx ring, it not necessary to use two variables. So first_tx is removed. These are not variables, they are structure members. Sure. These 2 structure members are the head of tx ring. And they are not changed in the whole driver. So they are duplicate. I will change to structure members and send V2. Zhu Yanjun
Re: [PATCH 1/1] bnx2x: fix slowpath null crash
On 2017/11/8 11:27, Elior, Ariel wrote: When "NETDEV WATCHDOG: em4 (bnx2x): transmit queue 2 timed out" occurs, BNX2X_SP_RTNL_TX_TIMEOUT is set. In the function bnx2x_sp_rtnl_task, bnx2x_nic_unload and bnx2x_nic_load are executed to shutdown and open NIC. In the function bnx2x_nic_load, bnx2x_alloc_mem allocates dma failure. The message "bnx2x: [bnx2x_alloc_mem:8399(em4)]Can't allocate memory" pops out. The variable slowpath is set to NULL. When shutdown the NIC, the function bnx2x_nic_unload is called. In the function bnx2x_nic_unload, the following functions are executed. bnx2x_chip_cleanup bnx2x_set_storm_rx_mode bnx2x_set_q_rx_mode bnx2x_set_q_rx_mode bnx2x_config_rx_mode bnx2x_set_rx_mode_e2 In the function bnx2x_set_rx_mode_e2, the variable slowpath is operated. Then the crash occurs. To fix this crash, the variable slowpath is checked. And in the function bnx2x_sp_rtnl_task, after dma memory allocation fails, another shutdown and open NIC is executed. CC: Joe JinCC: Junxiao Bi Signed-off-by: Zhu Yanjun Acked-by: Ariel Elior Thanks a lot. Zhu Yanjun Thanks Zhu - you did a thorough job. Ariel
Re: [PATCH 1/1] bnx2x: fix slowpath null crash
Please ignore this mail. Zhu Yanjun On 2017/11/8 9:58, root wrote: From: Zhu YanjunWhen "NETDEV WATCHDOG: em4 (bnx2x): transmit queue 2 timed out" occurs, BNX2X_SP_RTNL_TX_TIMEOUT is set. In the function bnx2x_sp_rtnl_task, bnx2x_nic_unload and bnx2x_nic_load are executed to shutdown and open NIC. In the function bnx2x_nic_load, bnx2x_alloc_mem allocates dma failure. The message "bnx2x: [bnx2x_alloc_mem:8399(em4)]Can't allocate memory" pops out. The variable slowpath is set to NULL. When shutdown the NIC, the function bnx2x_nic_unload is called. In the function bnx2x_nic_unload, the following functions are executed. bnx2x_chip_cleanup bnx2x_set_storm_rx_mode bnx2x_set_q_rx_mode bnx2x_set_q_rx_mode bnx2x_config_rx_mode bnx2x_set_rx_mode_e2 In the function bnx2x_set_rx_mode_e2, the variable slowpath is operated. Then the crash occurs. To fix this crash, the variable slowpath is checked. And in the function bnx2x_sp_rtnl_task, after dma memory allocation fails, another shutdown and open NIC is executed. CC: Joe Jin CC: Junxiao Bi Signed-off-by: Zhu Yanjun --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c index c12b4d3..5929324 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c @@ -9332,7 +9332,7 @@ void bnx2x_chip_cleanup(struct bnx2x *bp, int unload_mode, bool keep_link) /* Schedule the rx_mode command */ if (test_bit(BNX2X_FILTER_RX_MODE_PENDING, >sp_state)) set_bit(BNX2X_FILTER_RX_MODE_SCHED, >sp_state); - else + else if (!bp->slowpath) bnx2x_set_storm_rx_mode(bp); /* Cleanup multicast configuration */ @@ -10271,8 +10271,16 @@ static void bnx2x_sp_rtnl_task(struct work_struct *work) smp_mb(); bnx2x_nic_unload(bp, UNLOAD_NORMAL, true); - bnx2x_nic_load(bp, LOAD_NORMAL); - + /*When ret value shows failure of allocation failure, +*the nic is rebooted again. If open still fails, a error +*message to notify the user. +*/ + if (bnx2x_nic_load(bp, LOAD_NORMAL) == -ENOMEM) { + bnx2x_nic_unload(bp, UNLOAD_NORMAL, true); + if (bnx2x_nic_load(bp, LOAD_NORMAL)) { + BNX2X_ERR("Open the NIC fails again!\n"); + } + } rtnl_unlock(); return; }
Re: [PATCH 1/1] forcedeth: remove tx_stop variable
Hi, all After this patch is applied, the TCP && UDP tests are made. The TCP bandwidth is 939 Mbits/sec. The UDP bandwidth is 806 Mbits/sec. So I think this patch can work well. host1 <-> host2 host1: forcedeth NIC IP: 1.1.1.107 iperf -s host2: forcedeth NIC IP:1.1.1.105 iperf -c 1.1.1.107 The TCP Bandwidth is as below: Client connecting to 1.1.1.107, TCP port 5001 TCP window size: 85.0 KByte (default) [ 3] local 1.1.1.105 port 46092 connected with 1.1.1.107 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.09 GBytes 939 Mbits/sec The UDP is as below: iperf -c 1.1.1.107 -u -b 1000m Client connecting to 1.1.1.107, UDP port 5001 Sending 1470 byte datagrams UDP buffer size: 208 KByte (default) [ 3] local 1.1.1.105 port 47265 connected with 1.1.1.107 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 964 MBytes 809 Mbits/sec [ 3] Sent 687990 datagrams [ 3] Server Report: [ 3] 0.0-10.0 sec 960 MBytes 806 Mbits/sec 0.019 ms 2942/687989 (0.43%) [ 3] 0.0-10.0 sec 1 datagrams received out-of-order Zhu Yanjun On 2017/9/8 20:28, Zhu Yanjun wrote: The variable tx_stop is used to indicate the tx queue state: started or stopped. In fact, the inline function netif_queue_stopped can do the same work. So replace the variable tx_stop with the function netif_queue_stopped. Signed-off-by: Zhu Yanjun--- drivers/net/ethernet/nvidia/forcedeth.c | 13 - 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c index 994a83a..e6e0de4 100644 --- a/drivers/net/ethernet/nvidia/forcedeth.c +++ b/drivers/net/ethernet/nvidia/forcedeth.c @@ -834,7 +834,6 @@ struct fe_priv { u32 tx_pkts_in_progress; struct nv_skb_map *tx_change_owner; struct nv_skb_map *tx_end_flip; - int tx_stop; /* TX software stats */ struct u64_stats_sync swstats_tx_syncp; @@ -1939,7 +1938,6 @@ static void nv_init_tx(struct net_device *dev) np->tx_pkts_in_progress = 0; np->tx_change_owner = NULL; np->tx_end_flip = NULL; - np->tx_stop = 0; for (i = 0; i < np->tx_ring_size; i++) { if (!nv_optimized(np)) { @@ -2211,7 +2209,6 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, struct net_device *dev) empty_slots = nv_get_empty_tx_slots(np); if (unlikely(empty_slots <= entries)) { netif_stop_queue(dev); - np->tx_stop = 1; spin_unlock_irqrestore(>lock, flags); return NETDEV_TX_BUSY; } @@ -2359,7 +2356,6 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff *skb, empty_slots = nv_get_empty_tx_slots(np); if (unlikely(empty_slots <= entries)) { netif_stop_queue(dev); - np->tx_stop = 1; spin_unlock_irqrestore(>lock, flags); return NETDEV_TX_BUSY; } @@ -2583,8 +2579,8 @@ static int nv_tx_done(struct net_device *dev, int limit) netdev_completed_queue(np->dev, tx_work, bytes_compl); - if (unlikely((np->tx_stop == 1) && (np->get_tx.orig != orig_get_tx))) { - np->tx_stop = 0; + if (unlikely(netif_queue_stopped(dev) && +(np->get_tx.orig != orig_get_tx))) { netif_wake_queue(dev); } return tx_work; @@ -2637,8 +2633,8 @@ static int nv_tx_done_optimized(struct net_device *dev, int limit) netdev_completed_queue(np->dev, tx_work, bytes_cleaned); - if (unlikely((np->tx_stop == 1) && (np->get_tx.ex != orig_get_tx))) { - np->tx_stop = 0; + if (unlikely(netif_queue_stopped(dev) && +(np->get_tx.ex != orig_get_tx))) { netif_wake_queue(dev); } return tx_work; @@ -2724,7 +2720,6 @@ static void nv_tx_timeout(struct net_device *dev) /* 2) complete any outstanding tx and do not give HW any limited tx pkts */ saved_tx_limit = np->tx_limit; np->tx_limit = 0; /* prevent giving HW any limited pkts */ - np->tx_stop = 0; /* prevent waking tx queue */ if (!nv_optimized(np)) nv_tx_done(dev, np->tx_ring_size); else
Re: [PATCH 1/5] rds: tcp: release the created connection
On 2017/3/27 15:37, Sowmini Varadhan wrote: On (03/27/17 03:06), Zhu Yanjun wrote: Date: Mon, 27 Mar 2017 03:06:26 -0400 From: Zhu YanjunTo: yanjun@oracle.com, santosh.shilim...@oracle.com, netdev@vger.kernel.org, linux-r...@vger.kernel.org, rds-de...@oss.oracle.com, junxiao...@oracle.com, joe@oracle.com Subject: [PATCH 1/5] rds: tcp: release the created connection X-Mailer: git-send-email 2.7.4 When some error occurs, the created connection should be destroyed. No please dont do this. This is the case when there are duelling connections. We want to reset the new (accept sock) and leave the old socket in place. How did you test this? Did you test it with network namespaces? Sorry. I just made simple test. It seems that it worked well. Would you like to show me some test about this patch? Thanks a lot. Zhu Yanjun --Sowmini net/rds/tcp_listen.c | 1 + diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c index 5076788..58aa5bc 100644 --- a/net/rds/tcp_listen.c +++ b/net/rds/tcp_listen.c @@ -196,6 +196,7 @@ int rds_tcp_accept_one(struct socket *sock) rst_nsk: /* reset the newly returned accept sock and bail */ kernel_sock_shutdown(new_sock, SHUT_RDWR); + rds_conn_destroy(conn); ret = 0; out: if (rs_tcp) -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3 1/4] rds: ib: drop unnecessary rdma_reject
On 2017/3/13 14:32, Leon Romanovsky wrote: On Mon, Mar 13, 2017 at 01:43:45AM -0400, Zhu Yanjun wrote: When rdma_accept fails, rdma_reject is called in it. As such, it is not necessary to execute rdma_reject again. It is not always correct, according to the code, rdma_accept can fail and will return EINVAL and skip calling to rdma_reject. Sure. When -EINVAL is returned, By this function cma_comp(id_priv, RDMA_CM_CONNECT), the connection is lost. As such, it does not matter whether sending rdma_reject or not. Zhu Yanjun 3725 int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) 3726 { 3727 struct rdma_id_private *id_priv; 3728 int ret; 3729 3730 id_priv = container_of(id, struct rdma_id_private, id); 3731 3732 id_priv->owner = task_pid_nr(current); 3733 3734 if (!cma_comp(id_priv, RDMA_CM_CONNECT)) 3735 return -EINVAL; 3736 Cc: Joe JinCc: Junxiao Bi Acked-by: Santosh Shilimkar Signed-off-by: Zhu Yanjun --- net/rds/ib_cm.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index ce3775a..4b9405c 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -677,9 +677,8 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id, event->param.conn.initiator_depth); /* rdma_accept() calls rdma_reject() internally if it fails */ - err = rdma_accept(cm_id, _param); - if (err) - rds_ib_conn_error(conn, "rdma_accept failed (%d)\n", err); + if (rdma_accept(cm_id, _param)) + rds_ib_conn_error(conn, "rdma_accept failed\n"); out: if (conn) -- 2.7.4
Re: [PATCHv2 1/4] rds: ib: drop unnecessary rdma_reject
On 2017/3/13 3:43, santosh.shilim...@oracle.com wrote: On 3/12/17 12:33 PM, Leon Romanovsky wrote: On Sun, Mar 12, 2017 at 04:07:55AM -0400, Zhu Yanjun wrote: When rdma_accept fails, rdma_reject is called in it. As such, it is not necessary to execute rdma_reject again. Cc: Joe JinCc: Junxiao Bi Acked-by: Santosh Shilimkar Signed-off-by: Zhu Yanjun --- Change from v1 to v2: Add the acker. net/rds/ib_cm.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index ce3775a..eca3d5f 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -677,8 +677,7 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id, event->param.conn.initiator_depth); /* rdma_accept() calls rdma_reject() internally if it fails */ -err = rdma_accept(cm_id, _param); -if (err) +if (rdma_accept(cm_id, _param)) rds_ib_conn_error(conn, "rdma_accept failed (%d)\n", err); You omitted initialization of "err" variable which you print here ^. Its inited by rds_ib_setup_qp() but you are right. It will print failed with error = 0. :-) Zhu, please drop that 'err' from the message. OK. I will do. Zhu Yanjun
Re: [PATCH 2/5] rds: ib: replace spin_lock_irq with spin_lock_irqsave
Sorry. I have no test case to show some issue. But from Linux Kernel Development Second Edition by Robert Love. Use spin_lock_irq is dangerous since spin_unlock_irq unconditionally enables interrupts. We can assume the following scenario: --->the interrupt is disabled. spin_lock_irq(lock_ptr); <---this will disable interrupt again list_del(>ib_node); spin_unlock_irq(lock_ptr); <---this will enable interrupt >the interrupt is enabled. our code change the state of interrupt. This will make potential risk. But spin_lock_irqsave/spin_unlock_irqrestore will not make potential risk. Zhu Yanjun On 2017/3/10 0:50, Santosh Shilimkar wrote: On 3/8/2017 11:26 PM, Zhu Yanjun wrote: It is difficult to make sure the state of the interrupt when this function is called. As such, it is safer to use spin_lock_irqsave than spin_lock_irq. There is no reason to hold irqs and as such the code path is safe from irq context. I don't see need of this change unless you have test case which showed some issue. Regards, Santosh
Re: [PATCH 1/1] ixgbe: add the external ixgbe fiber transceiver status
On 2017/2/10 3:08, Tantilov, Emil S wrote: -Original Message- From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On Behalf Of Zhu Yanjun Sent: Wednesday, February 08, 2017 7:03 PM To: Kirsher, Jeffrey T; broo...@kernel.org; da...@davemloft.net; intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org Subject: [PATCH 1/1] ixgbe: add the external ixgbe fiber transceiver status When the ixgbe fiber transceiver is external, it is necessary to get the present/absent status of this external ixgbe fiber transceiver. The transceiver field was deprecated in the old ethtool API and is being removed in the new. This patch will not apply at all once those changes are made: http://patchwork.ozlabs.org/patch/725081/ Thanks for your kind reply. I will change this patch based on the above changes. Zhu Yanjun Thanks, Emil
[PATCHv2 1/1] r8169: fix the typo in the comment
From: Zhu Yanjun>From the realtek data sheet, the PID0 should be bit 0. Signed-off-by: Zhu Yanjun --- Change from v1 to v2: change the commit header. drivers/net/ethernet/realtek/r8169.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 44389c9..8f1623b 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -696,7 +696,7 @@ enum rtl_tx_desc_bit_1 { enum rtl_rx_desc_bit { /* Rx private */ PID1= (1 << 18), /* Protocol ID bit 1/2 */ - PID0= (1 << 17), /* Protocol ID bit 2/2 */ + PID0= (1 << 17), /* Protocol ID bit 0/2 */ #define RxProtoUDP (PID1) #define RxProtoTCP (PID0) -- 2.7.4
Re: [PATCH 1/1] r8169: fix the typo
Hi, Please comment on this patch. Zhu Yanjun On 2016/12/29 11:11, Zhu Yanjun wrote: >From the realtek data sheet, the PID0 should be bit 0. Signed-off-by: Zhu Yanjun--- drivers/net/ethernet/realtek/r8169.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 44389c9..8f1623b 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -696,7 +696,7 @@ enum rtl_tx_desc_bit_1 { enum rtl_rx_desc_bit { /* Rx private */ PID1= (1 << 18), /* Protocol ID bit 1/2 */ - PID0= (1 << 17), /* Protocol ID bit 2/2 */ + PID0= (1 << 17), /* Protocol ID bit 0/2 */ #define RxProtoUDP (PID1) #define RxProtoTCP(PID0)