Re: KASAN: null-ptr-deref Read in rds_ib_get_mr

2018-05-11 Thread Yanjun Zhu



On 2018/5/12 0:58, Santosh Shilimkar wrote:

On 5/11/2018 12:48 AM, Yanjun Zhu wrote:



On 2018/5/11 13:20, DaeRyong Jeong wrote:

We report the crash: KASAN: null-ptr-deref Read in rds_ib_get_mr

Note that this bug is previously reported by syzkaller.
https://syzkaller.appspot.com/bug?id=0bb56a5a48b000b52aa2b0d8dd20b1f545214d91 

Nonetheless, this bug has not fixed yet, and we hope that this 
report and our
analysis, which gets help by the RaceFuzzer's feature, will helpful 
to fix the

crash.

This crash has been found in v4.17-rc1 using RaceFuzzer (a modified
version of Syzkaller), which we describe more at the end of this
report. Our analysis shows that the race occurs when invoking two
syscalls concurrently, bind$rds and setsockopt$RDS_GET_MR.


Analysis:
We think the concurrent execution of __rds_rdma_map() and rds_bind()
causes the problem. __rds_rdma_map() checks whether 
rs->rs_bound_addr is 0

or not. But the concurrent execution with rds_bind() can by-pass this
check. Therefore, __rds_rdmap_map() calls rs->rs_transport->get_mr() 
and
rds_ib_get_mr() causes the null deref at ib_rdma.c:544 in v4.17-rc1, 
when

dereferencing rs_conn.


Thread interleaving:
CPU0 (__rds_rdma_map)    CPU1 (rds_bind)
    // rds_add_bound() sets rs->bound_addr 
as none 0
    ret = rds_add_bound(rs, 
sin->sin_addr.s_addr, >sin_port);

if (rs->rs_bound_addr == 0 || !rs->rs_transport) {
ret = -ENOTCONN; /* XXX not a great errno */
goto out;
}
    if (rs->rs_transport) { /* previously 
bound */

    trans = rs->rs_transport;
    if 
(trans->laddr_check(sock_net(sock->sk),

sin->sin_addr.s_addr) != 0) {
    ret = -ENOPROTOOPT;
    // rds_remove_bound() sets 
rs->bound_addr as 0

    rds_remove_bound(rs);
...
trans_private = rs->rs_transport->get_mr(sg, nents, rs,
 >r_key);
(in rds_ib_get_mr())
struct rds_ib_connection *ic = rs->rs_conn->c_transport_data;


Call sequence (v4.17-rc1):
CPU0
rds_setsockopt
rds_get_mr
    __rds_rdma_map
    rds_ib_get_mr


CPU1
rds_bind
rds_add_bound
...
rds_remove_bound


Crash log:
==
BUG: KASAN: null-ptr-deref in rds_ib_get_mr+0x3a/0x150 
net/rds/ib_rdma.c:544

Read of size 8 at addr 0068 by task syz-executor0/32067

CPU: 0 PID: 32067 Comm: syz-executor0 Not tainted 4.17.0-rc1 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014

Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x166/0x21c lib/dump_stack.c:113
  kasan_report_error mm/kasan/report.c:352 [inline]
  kasan_report+0x140/0x360 mm/kasan/report.c:412
  check_memory_region_inline mm/kasan/kasan.c:260 [inline]
  __asan_load8+0x54/0x90 mm/kasan/kasan.c:699
  rds_ib_get_mr+0x3a/0x150 net/rds/ib_rdma.c:544
  __rds_rdma_map+0x521/0x9d0 net/rds/rdma.c:271
  rds_get_mr+0xad/0xf0 net/rds/rdma.c:333
  rds_setsockopt+0x57f/0x720 net/rds/af_rds.c:347
  __sys_setsockopt+0x147/0x230 net/socket.c:1903
  __do_sys_setsockopt net/socket.c:1914 [inline]
  __se_sys_setsockopt net/socket.c:1911 [inline]
  __x64_sys_setsockopt+0x67/0x80 net/socket.c:1911
  do_syscall_64+0x15f/0x4a0 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4563f9
RSP: 002b:7f6a2b3c2b28 EFLAGS: 0246 ORIG_RAX: 0036
RAX: ffda RBX: 0072bee0 RCX: 004563f9
RDX: 0002 RSI: 0114 RDI: 0015
RBP: 0575 R08: 0020 R09: 
R10: 2140 R11: 0246 R12: 7f6a2b3c36d4
R13:  R14: 006fd398 R15: 
==

diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index e678699..2228b50 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -539,11 +539,17 @@ void rds_ib_flush_mrs(void)
  void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents,
 struct rds_sock *rs, u32 *key_ret)
  {
-   struct rds_ib_device *rds_ibdev;
+   struct rds_ib_device *rds_ibdev = NULL;
 struct rds_ib_mr *ibmr = NULL;
-   struct rds_ib_connection *ic = rs->rs_conn->c_transport_data;
+   struct rds_ib_connection *ic = NULL;
 int ret;

+   if (rs->rs_bound_addr == 0) {
+   ret = -EPERM;
+   goto out;
+   }
+

No you can't return such error for this API and the
socket related checks needs to be done at core layer.
I remember fixing this race but probably never pushed
fix upstream.

OK. Wait for your patch. :-)


The MR code is due for update with optim

Re: [rds-devel] KASAN: null-ptr-deref Read in rds_ib_get_mr

2018-05-11 Thread Yanjun Zhu



On 2018/5/11 18:46, Sowmini Varadhan wrote:

On (05/11/18 15:48), Yanjun Zhu wrote:

diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index e678699..2228b50 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -539,11 +539,17 @@ void rds_ib_flush_mrs(void)
void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents,
 struct rds_sock *rs, u32 *key_ret)
{
- struct rds_ib_device *rds_ibdev;
+ struct rds_ib_device *rds_ibdev = NULL;
 struct rds_ib_mr *ibmr = NULL;
- struct rds_ib_connection *ic = rs->rs_conn->c_transport_data;
+ struct rds_ib_connection *ic = NULL;
 int ret;

+ if (rs->rs_bound_addr == 0) {
+ ret = -EPERM;
+ goto out;
+ }
+
+ ic = rs->rs_conn->c_transport_data;
 rds_ibdev = rds_ib_get_device(rs->rs_bound_addr);
 if (!rds_ibdev) {
 ret = -ENODEV;

I made this raw patch. If you can reproduce this bug, please make tests
with it.

I dont think this solves the problem, I think it
just changes the timing under which it can still happen.

what if the rds_remove_bound() in rds_bind() happens after the check
for if (rs->rs_bound_addr == 0) added above by the patch

I believe you need some type of synchronization (either
through mutex, or some atomic flag in the rs or similar) to make
sure rds_bind() and rds_ib_get_mr() are mutually exclusive.

Sure. I agree with you. Maybe mutex is a good choice.

Zhu Yanjun


--Sowmini
  







Re: KASAN: null-ptr-deref Read in rds_ib_get_mr

2018-05-11 Thread Yanjun Zhu



On 2018/5/11 13:20, DaeRyong Jeong wrote:

We report the crash: KASAN: null-ptr-deref Read in rds_ib_get_mr

Note that this bug is previously reported by syzkaller.
https://syzkaller.appspot.com/bug?id=0bb56a5a48b000b52aa2b0d8dd20b1f545214d91
Nonetheless, this bug has not fixed yet, and we hope that this report and our
analysis, which gets help by the RaceFuzzer's feature, will helpful to fix the
crash.

This crash has been found in v4.17-rc1 using RaceFuzzer (a modified
version of Syzkaller), which we describe more at the end of this
report. Our analysis shows that the race occurs when invoking two
syscalls concurrently, bind$rds and setsockopt$RDS_GET_MR.


Analysis:
We think the concurrent execution of __rds_rdma_map() and rds_bind()
causes the problem. __rds_rdma_map() checks whether rs->rs_bound_addr is 0
or not. But the concurrent execution with rds_bind() can by-pass this
check. Therefore, __rds_rdmap_map() calls rs->rs_transport->get_mr() and
rds_ib_get_mr() causes the null deref at ib_rdma.c:544 in v4.17-rc1, when
dereferencing rs_conn.


Thread interleaving:
CPU0 (__rds_rdma_map)   CPU1 (rds_bind)
// rds_add_bound() sets 
rs->bound_addr as none 0
ret = rds_add_bound(rs, 
sin->sin_addr.s_addr, >sin_port);
if (rs->rs_bound_addr == 0 || !rs->rs_transport) {
ret = -ENOTCONN; /* XXX not a great errno */
goto out;
}
if (rs->rs_transport) { 
/* previously bound */
trans = 
rs->rs_transport;
if 
(trans->laddr_check(sock_net(sock->sk),
   
sin->sin_addr.s_addr) != 0) {
ret = 
-ENOPROTOOPT;
// 
rds_remove_bound() sets rs->bound_addr as 0

rds_remove_bound(rs);
...
trans_private = rs->rs_transport->get_mr(sg, nents, rs,
 >r_key);
(in rds_ib_get_mr())
struct rds_ib_connection *ic = rs->rs_conn->c_transport_data;


Call sequence (v4.17-rc1):
CPU0
rds_setsockopt
rds_get_mr
__rds_rdma_map
rds_ib_get_mr


CPU1
rds_bind
rds_add_bound
...
rds_remove_bound


Crash log:
==
BUG: KASAN: null-ptr-deref in rds_ib_get_mr+0x3a/0x150 net/rds/ib_rdma.c:544
Read of size 8 at addr 0068 by task syz-executor0/32067

CPU: 0 PID: 32067 Comm: syz-executor0 Not tainted 4.17.0-rc1 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x166/0x21c lib/dump_stack.c:113
  kasan_report_error mm/kasan/report.c:352 [inline]
  kasan_report+0x140/0x360 mm/kasan/report.c:412
  check_memory_region_inline mm/kasan/kasan.c:260 [inline]
  __asan_load8+0x54/0x90 mm/kasan/kasan.c:699
  rds_ib_get_mr+0x3a/0x150 net/rds/ib_rdma.c:544
  __rds_rdma_map+0x521/0x9d0 net/rds/rdma.c:271
  rds_get_mr+0xad/0xf0 net/rds/rdma.c:333
  rds_setsockopt+0x57f/0x720 net/rds/af_rds.c:347
  __sys_setsockopt+0x147/0x230 net/socket.c:1903
  __do_sys_setsockopt net/socket.c:1914 [inline]
  __se_sys_setsockopt net/socket.c:1911 [inline]
  __x64_sys_setsockopt+0x67/0x80 net/socket.c:1911
  do_syscall_64+0x15f/0x4a0 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4563f9
RSP: 002b:7f6a2b3c2b28 EFLAGS: 0246 ORIG_RAX: 0036
RAX: ffda RBX: 0072bee0 RCX: 004563f9
RDX: 0002 RSI: 0114 RDI: 0015
RBP: 0575 R08: 0020 R09: 
R10: 2140 R11: 0246 R12: 7f6a2b3c36d4
R13:  R14: 006fd398 R15: 
==

diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index e678699..2228b50 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -539,11 +539,17 @@ void rds_ib_flush_mrs(void)
 void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents,
    struct rds_sock *rs, u32 *key_ret)
 {
-   struct rds_ib_device *rds_ibdev;
+   struct rds_ib_device *rds_ibdev = NULL;
    struct rds_ib_mr *ibmr = NULL;
-   struct rds_ib_connection *ic = rs->rs_conn->c_transport_data;
+   struct rds_ib_connection *ic = NULL;
    int ret;

+   if (rs->rs_bound_addr == 0) {
+   ret = -EPERM;
+   goto out;
+   }
+
+

Re: [PATCH] mlx4_core: allocate 4KB ICM chunks

2018-05-10 Thread Yanjun Zhu



On 2018/5/11 7:31, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size, the above issue is fixed.

However in order to support 4KB ICM chunk size, we need to fix another
issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need

Hi,

Replace continuous memory pages with virtual memory, is there any 
performance loss?


Zhu Yanjun

of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
---
  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..2b17a4b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
  
  /*

- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in 4KB page size chunks to avoid high order memory
+ * allocations in fragmented/high usage memory situation.
   */
  enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = 1 << 12,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << 12
  };
  
  static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)

@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  
-	table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);

+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
  
-	kfree(table->icm);

+   vfree(table->icm);
  
  	return -ENOMEM;

  }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
  
-	kfree(table->icm);

+   vfree(table->icm);
  }




Re: [PATCHv2 1/1] IB/rxe: avoid double kfree_skb

2018-04-25 Thread Yanjun Zhu

Add netdev@vger.kernel.org


On 2018/4/26 12:41, Zhu Yanjun wrote:

When skb is sent, it will pass the following functions in soft roce.

rxe_send [rdma_rxe]
 ip_local_out
 __ip_local_out
 ip_output
 ip_finish_output
 ip_finish_output2
 dev_queue_xmit
 __dev_queue_xmit
 dev_hard_start_xmit

In the above functions, if error occurs in the above functions or
iptables rules drop skb after ip_local_out, kfree_skb will be called.
So it is not necessary to call kfree_skb in soft roce module again.
Or else crash will occur.

The steps to reproduce:

  server   client
 --
 |1.1.1.1||1.1.1.2|
 --

On server: rping -s -a 1.1.1.1 -v -C 1 -S 512
On client: rping -c -a 1.1.1.1 -v -C 1 -S 512

The kernel configs CONFIG_DEBUG_KMEMLEAK and
CONFIG_DEBUG_OBJECTS are enabled on both server and client.

When rping runs, run the following command in server:

iptables -I OUTPUT -p udp  --dport 4791 -j DROP

Without this patch, crash will occur.

CC: Srinivas Eeda 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
Reviewed-by: Yuval Shaia 
---
V1->V2: Not only the dropped skb is freed, but also the error skb
 is also freed. So in soft roce, it is not necessary to call
 kfree_skb again.
---
  drivers/infiniband/sw/rxe/rxe_req.c  | 1 -
  drivers/infiniband/sw/rxe/rxe_resp.c | 6 +-
  2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_req.c 
b/drivers/infiniband/sw/rxe/rxe_req.c
index 7bdaf71..7851999 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -728,7 +728,6 @@ int rxe_requester(void *arg)
rollback_state(wqe, qp, _wqe, rollback_psn);
  
  		if (ret == -EAGAIN) {

-   kfree_skb(skb);
rxe_run_task(>req.task, 1);
goto exit;
}
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c 
b/drivers/infiniband/sw/rxe/rxe_resp.c
index a65c996..955ff3b 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -742,7 +742,6 @@ static enum resp_states read_reply(struct rxe_qp *qp,
err = rxe_xmit_packet(rxe, qp, _pkt, skb);
if (err) {
pr_err("Failed sending RDMA reply.\n");
-   kfree_skb(skb);
return RESPST_ERR_RNR;
}
  
@@ -954,10 +953,8 @@ static int send_ack(struct rxe_qp *qp, struct rxe_pkt_info *pkt,

}
  
  	err = rxe_xmit_packet(rxe, qp, _pkt, skb);

-   if (err) {
+   if (err)
pr_err_ratelimited("Failed sending ack\n");
-   kfree_skb(skb);
-   }
  
  err1:

return err;
@@ -1141,7 +1138,6 @@ static enum resp_states duplicate_request(struct rxe_qp 
*qp,
if (rc) {
pr_err("Failed resending result. This flow is not 
handled - skb ignored\n");
rxe_drop_ref(qp);
-   kfree_skb(skb_copy);
rc = RESPST_CLEANUP;
goto out;
}




Re: [PATCH 1/1] IB/rxe: avoid double kfree_skb

2018-04-25 Thread Yanjun Zhu

Hi, all

rxe_send [rdma_rxe]
    ip_local_out
        __ip_local_out
        ip_output
            ip_finish_output
                ip_finish_output2
                    dev_queue_xmit
                        __dev_queue_xmit
                            dev_hard_start_xmit
                                e1000_xmit_frame [e1000]

When skb is sent, it will pass the above functions. I checked all the 
above functions. If error occurs in the above functions after 
ip_local_out, kfree_skb will be called.
So when ip_local_out returns an error, skb should be freed. It is not 
necessary to call kfree_skb in soft roce module again.


If I am wrong, please correct me.

Zhu Yanjun
On 2018/4/24 16:34, Yanjun Zhu wrote:

Hi, all

rxe_send
    ip_local_out
        __ip_local_out
            nf_hook_slow

In the above call process, nf_hook_slow drops and frees skb, then 
-EPERM is returned when iptables rules(iptables -I OUTPUT -p udp 
--dport 4791 -j DROP) is set.


If skb->users is not changed in softroce, kfree_skb should not be 
called in this module.


I will make further investigations about other error handler after 
ip_local_out.

If I am wrong, please correct me.

Any reply is appreciated.

Zhu Yanjun
On 2018/4/20 13:46, Yanjun Zhu wrote:



On 2018/4/20 10:19, Doug Ledford wrote:

On Thu, 2018-04-19 at 10:01 -0400, Zhu Yanjun wrote:
When skb is dropped by iptables rules, the skb is freed at the same 
time
-EPERM is returned. So in softroce, it is not necessary to free skb 
again.

Or else, crash will occur.

The steps to reproduce:

  server   client
 -    -
 |1.1.1.1|<rxe-channel--->|1.1.1.2|
 -    -

On server: rping -s -a 1.1.1.1 -v -C 1 -S 512
On client: rping -c -a 1.1.1.1 -v -C 1 -S 512

The kernel configs CONFIG_DEBUG_KMEMLEAK and
CONFIG_DEBUG_OBJECTS are enabled on both server and client.

When rping runs, run the following command in server:

iptables -I OUTPUT -p udp  --dport 4791 -j DROP

Without this patch, crash will occur.

CC: Srinivas Eeda <srinivas.e...@oracle.com>
CC: Junxiao Bi <junxiao...@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>

I have no reason to doubt your analysis, but if there are a bunch of
error paths for net_xmit and they all return with your skb still being
valid and holding a reference, and then one oddball that returns with
your skb already gone, that just sounds like a mistake waiting to 
happen

(not to mention a bajillion special cases sprinkled everywhere to deal
with this apparent inconsistency).

Can we get a netdev@ confirmation on this being the right solution?

Yes. I agree with you.
After iptables rule "iptables -I OUTPUT -p udp  --dport 4791 -j 
DROP", the skb is freed in this function


/* Returns 1 if okfn() needs to be executed by the caller,
 * -EPERM for NF_DROP, 0 otherwise.  Caller must hold rcu_read_lock. */
int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
 const struct nf_hook_entries *e, unsigned int s)
{
    unsigned int verdict;
    int ret;

    for (; s < e->num_hook_entries; s++) {
    verdict = nf_hook_entry_hookfn(>hooks[s], skb, 
state);

    switch (verdict & NF_VERDICT_MASK) {
    case NF_ACCEPT:
    break;
    case NF_DROP:
kfree_skb(skb);   <here, skb is freed
    ret = NF_DROP_GETERR(verdict);
    if (ret == 0)
    ret = -EPERM;
    return ret;
    case NF_QUEUE:
    ret = nf_queue(skb, state, e, s, verdict);
    if (ret == 1)
    continue;
    return ret;
    default:
    /* Implicit handling for NF_STOLEN, as well 
as any other

 * non conventional verdicts.
 */
    return 0;
    }
    }

    return 1;
}
EXPORT_SYMBOL(nf_hook_slow);

If I am wrong, please correct me.

And my test environment is still there, any solution can be verified 
in it.


Zhu Yanjun



---
  drivers/infiniband/sw/rxe/rxe_net.c  | 3 +++
  drivers/infiniband/sw/rxe/rxe_req.c  | 5 +++--
  drivers/infiniband/sw/rxe/rxe_resp.c | 9 ++---
  3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_net.c 
b/drivers/infiniband/sw/rxe/rxe_net.c

index 9da6e37..2094434 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -511,6 +511,9 @@ int rxe_send(struct rxe_pkt_info *pkt, struct 
sk_buff *skb)

   if (unlikely(net_xmit_eval(err))) {
 pr_debug("error sending packet: %d\n"

Re: [PATCH 1/1] IB/rxe: avoid double kfree_skb

2018-04-24 Thread Yanjun Zhu

Hi, all

rxe_send
    ip_local_out
        __ip_local_out
            nf_hook_slow

In the above call process, nf_hook_slow drops and frees skb, then -EPERM 
is returned when iptables rules(iptables -I OUTPUT -p udp --dport 4791 
-j DROP) is set.


If skb->users is not changed in softroce, kfree_skb should not be called 
in this module.


I will make further investigations about other error handler after 
ip_local_out.

If I am wrong, please correct me.

Any reply is appreciated.

Zhu Yanjun
On 2018/4/20 13:46, Yanjun Zhu wrote:



On 2018/4/20 10:19, Doug Ledford wrote:

On Thu, 2018-04-19 at 10:01 -0400, Zhu Yanjun wrote:
When skb is dropped by iptables rules, the skb is freed at the same 
time
-EPERM is returned. So in softroce, it is not necessary to free skb 
again.

Or else, crash will occur.

The steps to reproduce:

  server   client
 -    -
 |1.1.1.1|<rxe-channel--->|1.1.1.2|
 -    -

On server: rping -s -a 1.1.1.1 -v -C 1 -S 512
On client: rping -c -a 1.1.1.1 -v -C 1 -S 512

The kernel configs CONFIG_DEBUG_KMEMLEAK and
CONFIG_DEBUG_OBJECTS are enabled on both server and client.

When rping runs, run the following command in server:

iptables -I OUTPUT -p udp  --dport 4791 -j DROP

Without this patch, crash will occur.

CC: Srinivas Eeda <srinivas.e...@oracle.com>
CC: Junxiao Bi <junxiao...@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>

I have no reason to doubt your analysis, but if there are a bunch of
error paths for net_xmit and they all return with your skb still being
valid and holding a reference, and then one oddball that returns with
your skb already gone, that just sounds like a mistake waiting to happen
(not to mention a bajillion special cases sprinkled everywhere to deal
with this apparent inconsistency).

Can we get a netdev@ confirmation on this being the right solution?

Yes. I agree with you.
After iptables rule "iptables -I OUTPUT -p udp  --dport 4791 -j DROP", 
the skb is freed in this function


/* Returns 1 if okfn() needs to be executed by the caller,
 * -EPERM for NF_DROP, 0 otherwise.  Caller must hold rcu_read_lock. */
int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
 const struct nf_hook_entries *e, unsigned int s)
{
    unsigned int verdict;
    int ret;

    for (; s < e->num_hook_entries; s++) {
    verdict = nf_hook_entry_hookfn(>hooks[s], skb, state);
    switch (verdict & NF_VERDICT_MASK) {
    case NF_ACCEPT:
    break;
    case NF_DROP:
kfree_skb(skb);   <here, skb is freed
    ret = NF_DROP_GETERR(verdict);
    if (ret == 0)
    ret = -EPERM;
    return ret;
    case NF_QUEUE:
    ret = nf_queue(skb, state, e, s, verdict);
    if (ret == 1)
    continue;
    return ret;
    default:
    /* Implicit handling for NF_STOLEN, as well as 
any other

 * non conventional verdicts.
 */
    return 0;
    }
    }

    return 1;
}
EXPORT_SYMBOL(nf_hook_slow);

If I am wrong, please correct me.

And my test environment is still there, any solution can be verified 
in it.


Zhu Yanjun



---
  drivers/infiniband/sw/rxe/rxe_net.c  | 3 +++
  drivers/infiniband/sw/rxe/rxe_req.c  | 5 +++--
  drivers/infiniband/sw/rxe/rxe_resp.c | 9 ++---
  3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_net.c 
b/drivers/infiniband/sw/rxe/rxe_net.c

index 9da6e37..2094434 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -511,6 +511,9 @@ int rxe_send(struct rxe_pkt_info *pkt, struct 
sk_buff *skb)

   if (unlikely(net_xmit_eval(err))) {
 pr_debug("error sending packet: %d\n", err);
+   /* -EPERM means the skb is dropped and freed. */
+   if (err == -EPERM)
+   return -EPERM;
 return -EAGAIN;
 }
  diff --git a/drivers/infiniband/sw/rxe/rxe_req.c 
b/drivers/infiniband/sw/rxe/rxe_req.c

index 7bdaf71..9d2efec 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -727,8 +727,9 @@ int rxe_requester(void *arg)
   rollback_state(wqe, qp, _wqe, 
rollback_psn);

  -   if (ret == -EAGAIN) {
-   kfree_skb(skb);
+   if ((ret == -EAGAIN) || (ret == -EPERM)) {
+   if (ret == -EAGAIN)
+

Re: [PATCH 1/1] IB/rxe: avoid double kfree_skb

2018-04-19 Thread Yanjun Zhu



On 2018/4/20 10:19, Doug Ledford wrote:

On Thu, 2018-04-19 at 10:01 -0400, Zhu Yanjun wrote:

When skb is dropped by iptables rules, the skb is freed at the same time
-EPERM is returned. So in softroce, it is not necessary to free skb again.
Or else, crash will occur.

The steps to reproduce:

  server   client
 --
 |1.1.1.1||1.1.1.2|
 --

On server: rping -s -a 1.1.1.1 -v -C 1 -S 512
On client: rping -c -a 1.1.1.1 -v -C 1 -S 512

The kernel configs CONFIG_DEBUG_KMEMLEAK and
CONFIG_DEBUG_OBJECTS are enabled on both server and client.

When rping runs, run the following command in server:

iptables -I OUTPUT -p udp  --dport 4791 -j DROP

Without this patch, crash will occur.

CC: Srinivas Eeda 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
Reviewed-by: Yuval Shaia 

I have no reason to doubt your analysis, but if there are a bunch of
error paths for net_xmit and they all return with your skb still being
valid and holding a reference, and then one oddball that returns with
your skb already gone, that just sounds like a mistake waiting to happen
(not to mention a bajillion special cases sprinkled everywhere to deal
with this apparent inconsistency).

Can we get a netdev@ confirmation on this being the right solution?

Yes. I agree with you.
After iptables rule "iptables -I OUTPUT -p udp  --dport 4791 -j DROP", 
the skb is freed in this function


/* Returns 1 if okfn() needs to be executed by the caller,
 * -EPERM for NF_DROP, 0 otherwise.  Caller must hold rcu_read_lock. */
int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
 const struct nf_hook_entries *e, unsigned int s)
{
    unsigned int verdict;
    int ret;

    for (; s < e->num_hook_entries; s++) {
    verdict = nf_hook_entry_hookfn(>hooks[s], skb, state);
    switch (verdict & NF_VERDICT_MASK) {
    case NF_ACCEPT:
    break;
    case NF_DROP:
kfree_skb(skb);   req.task, 1);
 goto exit;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c 
b/drivers/infiniband/sw/rxe/rxe_resp.c
index a65c996..6bdf9b2 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -742,7 +742,8 @@ static enum resp_states read_reply(struct rxe_qp *qp,
 err = rxe_xmit_packet(rxe, qp, _pkt, skb);
 if (err) {
 pr_err("Failed sending RDMA reply.\n");
-   kfree_skb(skb);
+   if (err != -EPERM)
+   

Re: [PATCH 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device

2018-04-17 Thread Yanjun Zhu



On 2018/4/17 23:37, Tariq Toukan wrote:



On 16/04/2018 4:02 AM, Zhu Yanjun wrote:

While a faulty cable is used or HCA firmware error, HCA device will
be offline. When the driver is accessing this offline device, the
following call trace will pop out.

"
...
   [] dump_stack+0x63/0x81
   [] panic+0xcc/0x21b
   [] mlx4_enter_error_state+0xba/0xf0 [mlx4_core]
   [] mlx4_cmd_reset_flow+0x38/0x60 [mlx4_core]
   [] mlx4_cmd_poll+0xc1/0x2e0 [mlx4_core]
   [] __mlx4_cmd+0xb0/0x160 [mlx4_core]
   [] mlx4_SENSE_PORT+0x54/0xd0 [mlx4_core]
   [] mlx4_dev_cap+0x4a4/0xb50 [mlx4_core]
...
"
In the above call trace, the function mlx4_cmd_poll calls the function
mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post
returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls
mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out.

This is not reasonable. Since HCA device is offline when it is being
accessed, it should not be reset again.

In this patch, since HCA is offline, the function mlx4_cmd_post returns
an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly 
returns

instead of resetting HCA.

CC: Srinivas Eeda 
CC: Junxiao Bi 
Suggested-by: Håkon Bugge 
Signed-off-by: Zhu Yanjun 
---
  drivers/net/ethernet/mellanox/mlx4/cmd.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c

index 6a9086d..f1c8c42 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -451,6 +451,8 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, 
u64 in_param, u64 out_param,

   * Device is going through error recovery
   * and cannot accept commands.
   */
+    mlx4_err(dev, "%s : Device is in error recovery.\n", __func__);
+    ret = -EINVAL;
  goto out;
  }
  @@ -657,6 +659,9 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, 
u64 in_param, u64 *out_param,

  }
    out_reset:
+    if (err == -EINVAL)
+    goto out;
+


See below.


  if (err)
  err = mlx4_cmd_reset_flow(dev, op, op_modifier, err);
  out:
@@ -766,6 +771,9 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, 
u64 in_param, u64 *out_param,

  *out_param = context->out_param;
    out_reset:
+    if (err == -EINVAL)
+    goto out;
+
  if (err)


Instead, just do here: if (err && err != -EINVAL)


  err = mlx4_cmd_reset_flow(dev, op, op_modifier, err);
  out:



I am not sure this does not mistakenly cover other cases that already 
exist and have (err == -EINVAL).


For example, this line is hard to predict:
err = mlx4_status_to_errno
and later on, we might get into
if (mlx4_closing_cmd_fatal_error(op, stat))
which leads to out_reset.

Thanks a lot.
Sure. I agree with you that "err = mlx4_status_to_errno" and "if 
(mlx4_closing_cmd_fatal_error(op, stat))" will also make "err=-EINVAL".

This will mistakenly go to out instead of resetting HCA device.

I will make a new patch to avoid the above error.

Zhu Yanjun


We must have a deeper look at this.
But a better option is, change the error indication to uniquely 
indicate "already in error recovery".






Re: [PATCH next-queue 2/2] ixgbe: add unlikely notes to tx fastpath expressions

2018-01-18 Thread Yanjun Zhu



On 2018/1/9 6:47, Shannon Nelson wrote:

Add unlikely() to a few error checking expressions in the Tx
offload handling.

Suggested-by: Yanjun Zhu <yanjun@oracle.com>

Hi,

I am fine with this patch. I have a question. The ipsec feature is 
supported in ixgbevf?


Thanks a lot.
Zhu Yanjun

Signed-off-by: Shannon Nelson <shannon.nel...@oracle.com>
---
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 57c10e6..3d069a2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -749,28 +749,28 @@ int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring,
struct xfrm_state *xs;
struct tx_sa *tsa;
  
-	if (!first->skb->sp->len) {

+   if (unlikely(!first->skb->sp->len)) {
netdev_err(tx_ring->netdev, "%s: no xfrm state len = %d\n",
   __func__, first->skb->sp->len);
return 0;
}
  
  	xs = xfrm_input_state(first->skb);

-   if (!xs) {
+   if (unlikely(!xs)) {
netdev_err(tx_ring->netdev, "%s: no xfrm_input_state() xs = 
%p\n",
   __func__, xs);
return 0;
}
  
  	itd->sa_idx = xs->xso.offload_handle - IXGBE_IPSEC_BASE_TX_INDEX;

-   if (itd->sa_idx > IXGBE_IPSEC_MAX_SA_COUNT) {
+   if (unlikely(itd->sa_idx > IXGBE_IPSEC_MAX_SA_COUNT)) {
netdev_err(tx_ring->netdev, "%s: bad sa_idx=%d handle=%lu\n",
   __func__, itd->sa_idx, xs->xso.offload_handle);
return 0;
}
  
  	tsa = >tx_tbl[itd->sa_idx];

-   if (!tsa->used) {
+   if (unlikely(!tsa->used)) {
netdev_err(tx_ring->netdev, "%s: unused sa_idx=%d\n",
   __func__, itd->sa_idx);
return 0;




Re: [PATCH v3 next-queue 08/10] ixgbe: process the Tx ipsec offload

2017-12-22 Thread Yanjun Zhu



On 2017/12/20 8:00, Shannon Nelson wrote:

If the skb has a security association referenced in the skb, then
set up the Tx descriptor with the ipsec offload bits.  While we're
here, we fix an oddly named field in the context descriptor struct.

v3: added ifdef CONFIG_XFRM_OFFLOAD check around call to ixgbe_ipsec_tx()

v2: use ihl != 5
 move the ixgbe_ipsec_tx() call to near the call to ixgbe_tso()
 drop the ipsec packet if the tx offload setup fails
 simplify the ixgbe_ipsec_tx() parameters by using 'first'
 leave out the ixgbe_tso() changes since we don't support TSO
with ipsec yet.

Signed-off-by: Shannon Nelson 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe.h   | 10 +++-
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 79 ++
  drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c   |  4 +-
  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 26 +++--
  drivers/net/ethernet/intel/ixgbe/ixgbe_type.h  |  2 +-
  5 files changed, 112 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index a094b23..3d2b7bf 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -171,10 +171,11 @@ enum ixgbe_tx_flags {
IXGBE_TX_FLAGS_CC   = 0x08,
IXGBE_TX_FLAGS_IPV4 = 0x10,
IXGBE_TX_FLAGS_CSUM = 0x20,
+   IXGBE_TX_FLAGS_IPSEC= 0x40,
  
  	/* software defined flags */

-   IXGBE_TX_FLAGS_SW_VLAN  = 0x40,
-   IXGBE_TX_FLAGS_FCOE = 0x80,
+   IXGBE_TX_FLAGS_SW_VLAN  = 0x80,
+   IXGBE_TX_FLAGS_FCOE = 0x100,
  };
  
  /* VLAN info */

@@ -1014,6 +1015,8 @@ void ixgbe_ipsec_restore(struct ixgbe_adapter *adapter);
  void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
union ixgbe_adv_rx_desc *rx_desc,
struct sk_buff *skb);
+int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring, struct ixgbe_tx_buffer *first,
+  struct ixgbe_ipsec_tx_data *itd);
  #else
  static inline void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter) { 
};
  static inline void ixgbe_stop_ipsec_offload(struct ixgbe_adapter *adapter) { 
};
@@ -1021,5 +1024,8 @@ static inline void ixgbe_ipsec_restore(struct 
ixgbe_adapter *adapter) { };
  static inline void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
  union ixgbe_adv_rx_desc *rx_desc,
  struct sk_buff *skb) { };
+static inline int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring,
+struct ixgbe_tx_buffer *first,
+struct ixgbe_ipsec_tx_data *itd) { return 0; };
  #endif /* CONFIG_XFRM_OFFLOAD */
  #endif /* _IXGBE_H_ */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index a9b8f5c..c2fd2ac 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -695,12 +695,91 @@ static void ixgbe_ipsec_del_sa(struct xfrm_state *xs)
}
  }
  
+/**

+ * ixgbe_ipsec_offload_ok - can this packet use the xfrm hw offload
+ * @skb: current data packet
+ * @xs: pointer to transformer state struct
+ **/
+static bool ixgbe_ipsec_offload_ok(struct sk_buff *skb, struct xfrm_state *xs)
+{
+   if (xs->props.family == AF_INET) {
+   /* Offload with IPv4 options is not supported yet */
+   if (ip_hdr(skb)->ihl != 5)
+   return false;
+   } else {
+   /* Offload with IPv6 extension headers is not support yet */
+   if (ipv6_ext_hdr(ipv6_hdr(skb)->nexthdr))
+   return false;
+   }
+
+   return true;
+}
+
  static const struct xfrmdev_ops ixgbe_xfrmdev_ops = {
.xdo_dev_state_add = ixgbe_ipsec_add_sa,
.xdo_dev_state_delete = ixgbe_ipsec_del_sa,
+   .xdo_dev_offload_ok = ixgbe_ipsec_offload_ok,
  };
  
  /**

+ * ixgbe_ipsec_tx - setup Tx flags for ipsec offload
+ * @tx_ring: outgoing context
+ * @first: current data packet
+ * @itd: ipsec Tx data for later use in building context descriptor
+ **/
+int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring,
+  struct ixgbe_tx_buffer *first,
+  struct ixgbe_ipsec_tx_data *itd)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(tx_ring->netdev);
+   struct ixgbe_ipsec *ipsec = adapter->ipsec;
+   struct xfrm_state *xs;
+   struct tx_sa *tsa;
+
+   if (!first->skb->sp->len) {

Hi, Nelson

The function ixgbe_ipsec_tx is called in tx fastpath. Can we add 
unlikely as below:

if (unlikely(!first->skb->sp->len)) ?

If I am wrong, please correct me.

Thanks a lot.
Zhu Yanjun

+   netdev_err(tx_ring->netdev, "%s: no xfrm state len = %d\n",
+  __func__, first->skb->sp->len);
+   return 0;
+   }
+
+   xs = xfrm_input_state(first->skb);
+

Re: [PATCH v3 next-queue 00/10] ixgbe: Add ipsec offload

2017-12-20 Thread Yanjun Zhu



On 2017/12/21 14:39, Yanjun Zhu wrote:



On 2017/12/20 7:59, Shannon Nelson wrote:

This is an implementation of the ipsec hardware offload feature for
the ixgbe driver and Intel's 10Gbe series NICs: x540, x550, 82599.

Hi, Nelson

I notice that the ipsec feature is based on x540, x550, 82599. But 
this ixgbe driver

will also work with 82598.

Does this ipsec feature also work with 82598?
Sorry. I mean, after these ipsec patches are applied, whether ipsec 
offload enabled or not,

can this ixgbe driver still work well with 82598?

Zhu Yanjun


Thanks a lot.
Zhu Yanjun
These patches apply to net-next v4.14 as well as Jeff Kirsher's 
next-queue

v4.15-rc1-206-ge47375b.

The ixgbe NICs support ipsec offload for 1024 Rx and 1024 Tx Security
Associations (SAs), using up to 128 inbound IP addresses, and using the
rfc4106(gcm(aes)) encryption.  This code does not yet support IPv6,
checksum offload, or TSO in conjunction with the ipsec offload - those
will be added in the future.

This code shows improvements in both packet throughput and CPU 
utilization.

For example, here are some quicky numbers that show the magnitude of the
performance gain on a single run of "iperf -c " with the ipsec
offload on both ends of a point-to-point connection:

9.4 Gbps - normal case
7.6 Gbps - ipsec with offload
343 Mbps - ipsec no offload

To set up a similar test case, you first need to be sure you have a 
recent
version of iproute2 that supports the ipsec offload tag, probably 
something

from ip 4.12 or newer would be best.  I have a shell script that builds
up the appropriate commands for me, but here are the resulting commands
for all tcp traffic between 14.0.0.52 and 14.0.0.70:

For the left side (14.0.0.52):
   ip x p add dir out src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp tmpl \
  proto esp src 14.0.0.52 dst 14.0.0.70 spi 0x07 mode transport 
reqid 0x07

   ip x p add dir in src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp tmpl \
  proto esp dst 14.0.0.52 src 14.0.0.70 spi 0x07 mode transport 
reqid 0x07
   ip x s add proto esp src 14.0.0.52 dst 14.0.0.70 spi 0x07 mode 
transport \

  reqid 0x07 replay-window 32 \
  aead 'rfc4106(gcm(aes))' 
0x44434241343332312423222114131211f4f3f2f1 128 \
  sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp offload dev 
eth4 dir out
   ip x s add proto esp dst 14.0.0.52 src 14.0.0.70 spi 0x07 mode 
transport \

  reqid 0x07 replay-window 32 \
  aead 'rfc4106(gcm(aes))' 
0x44434241343332312423222114131211f4f3f2f1 128 \
  sel src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp offload dev 
eth4 dir in

  For the right side (14.0.0.70):
   ip x p add dir out src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp tmpl \
  proto esp src 14.0.0.70 dst 14.0.0.52 spi 0x07 mode transport 
reqid 0x07

   ip x p add dir in src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp tmpl \
  proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport 
reqid 0x07
   ip x s add proto esp src 14.0.0.70 dst 14.0.0.52 spi 0x07 mode 
transport \

  reqid 0x07 replay-window 32 \
  aead 'rfc4106(gcm(aes))' 
0x44434241343332312423222114131211f4f3f2f1 128 \
  sel src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp offload dev 
eth4 dir out
   ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode 
transport \

  reqid 0x07 replay-window 32 \
  aead 'rfc4106(gcm(aes))' 
0x44434241343332312423222114131211f4f3f2f1 128 \
  sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp offload dev 
eth4 dir in


In both cases, the command "ip x s flush ; ip x p flush" will clean
it all out and remove the offloads.

Lastly, thanks to Alex Duyck for his early comments.

Please see the individual patches for specific update info.

v3: fixes after comments from those wonderfully pesky kbuild robots
v2: fixes after comments from Alex

Shannon Nelson (10):
   ixgbe: clean up ipsec defines
   ixgbe: add ipsec register access routines
   ixgbe: add ipsec engine start and stop routines
   ixgbe: add ipsec data structures
   ixgbe: add ipsec offload add and remove SA
   ixgbe: restore offloaded SAs after a reset
   ixgbe: process the Rx ipsec offload
   ixgbe: process the Tx ipsec offload
   ixgbe: ipsec offload stats
   ixgbe: register ipsec offload with the xfrm subsystem

  drivers/net/ethernet/intel/ixgbe/Makefile    |   1 +
  drivers/net/ethernet/intel/ixgbe/ixgbe.h |  33 +-
  drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |   2 +
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c   | 923 
+++

  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.h   |  92 +++
  drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c |   4 +-
  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c    |  39 +-
  drivers/net/ethernet/intel/ixgbe/ixgbe_type.h    |  22 +-
  8 files changed, 1093 insertions(+), 23 deletions(-)
  create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
  create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.h








Re: [PATCH v3 next-queue 00/10] ixgbe: Add ipsec offload

2017-12-20 Thread Yanjun Zhu



On 2017/12/20 7:59, Shannon Nelson wrote:

This is an implementation of the ipsec hardware offload feature for
the ixgbe driver and Intel's 10Gbe series NICs: x540, x550, 82599.

Hi, Nelson

I notice that the ipsec feature is based on x540, x550, 82599.  But this 
ixgbe driver

will also work with 82598.

Does this ipsec feature also work with 82598?

Thanks a lot.
Zhu Yanjun

These patches apply to net-next v4.14 as well as Jeff Kirsher's next-queue
v4.15-rc1-206-ge47375b.

The ixgbe NICs support ipsec offload for 1024 Rx and 1024 Tx Security
Associations (SAs), using up to 128 inbound IP addresses, and using the
rfc4106(gcm(aes)) encryption.  This code does not yet support IPv6,
checksum offload, or TSO in conjunction with the ipsec offload - those
will be added in the future.

This code shows improvements in both packet throughput and CPU utilization.
For example, here are some quicky numbers that show the magnitude of the
performance gain on a single run of "iperf -c " with the ipsec
offload on both ends of a point-to-point connection:

9.4 Gbps - normal case
7.6 Gbps - ipsec with offload
343 Mbps - ipsec no offload

To set up a similar test case, you first need to be sure you have a recent
version of iproute2 that supports the ipsec offload tag, probably something
from ip 4.12 or newer would be best.  I have a shell script that builds
up the appropriate commands for me, but here are the resulting commands
for all tcp traffic between 14.0.0.52 and 14.0.0.70:

For the left side (14.0.0.52):
   ip x p add dir out src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp tmpl \
  proto esp src 14.0.0.52 dst 14.0.0.70 spi 0x07 mode transport reqid 0x07
   ip x p add dir in src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp tmpl \
  proto esp dst 14.0.0.52 src 14.0.0.70 spi 0x07 mode transport reqid 0x07
   ip x s add proto esp src 14.0.0.52 dst 14.0.0.70 spi 0x07 mode transport \
  reqid 0x07 replay-window 32 \
  aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \
  sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp offload dev eth4 dir out
   ip x s add proto esp dst 14.0.0.52 src 14.0.0.70 spi 0x07 mode transport \
  reqid 0x07 replay-window 32 \
  aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \
  sel src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp offload dev eth4 dir in
  
For the right side (14.0.0.70):

   ip x p add dir out src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp tmpl \
  proto esp src 14.0.0.70 dst 14.0.0.52 spi 0x07 mode transport reqid 0x07
   ip x p add dir in src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp tmpl \
  proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport reqid 0x07
   ip x s add proto esp src 14.0.0.70 dst 14.0.0.52 spi 0x07 mode transport \
  reqid 0x07 replay-window 32 \
  aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \
  sel src 14.0.0.70/24 dst 14.0.0.52/24 proto tcp offload dev eth4 dir out
   ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
  reqid 0x07 replay-window 32 \
  aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \
  sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp offload dev eth4 dir in

In both cases, the command "ip x s flush ; ip x p flush" will clean
it all out and remove the offloads.

Lastly, thanks to Alex Duyck for his early comments.

Please see the individual patches for specific update info.

v3: fixes after comments from those wonderfully pesky kbuild robots
v2: fixes after comments from Alex

Shannon Nelson (10):
   ixgbe: clean up ipsec defines
   ixgbe: add ipsec register access routines
   ixgbe: add ipsec engine start and stop routines
   ixgbe: add ipsec data structures
   ixgbe: add ipsec offload add and remove SA
   ixgbe: restore offloaded SAs after a reset
   ixgbe: process the Rx ipsec offload
   ixgbe: process the Tx ipsec offload
   ixgbe: ipsec offload stats
   ixgbe: register ipsec offload with the xfrm subsystem

  drivers/net/ethernet/intel/ixgbe/Makefile|   1 +
  drivers/net/ethernet/intel/ixgbe/ixgbe.h |  33 +-
  drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |   2 +
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c   | 923 +++
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.h   |  92 +++
  drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c |   4 +-
  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c|  39 +-
  drivers/net/ethernet/intel/ixgbe/ixgbe_type.h|  22 +-
  8 files changed, 1093 insertions(+), 23 deletions(-)
  create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
  create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.h





Re: [PATCH net-next 1/1] forcedeth: remove unnecessary variable

2017-12-07 Thread Yanjun Zhu



On 2017/12/8 3:07, David Miller wrote:

From: Zhu Yanjun 
Date: Wed,  6 Dec 2017 23:15:15 -0500


Since both tx_ring and first_tx are the head of tx ring, it not
necessary to use two variables. So first_tx is removed.

These are not variables, they are structure members.
Sure. These 2 structure members are the head of tx ring. And they are 
not changed in

the whole driver. So they are duplicate.

I will change to structure members and send V2.

Zhu Yanjun






Re: [PATCH 1/1] bnx2x: fix slowpath null crash

2017-11-07 Thread Yanjun Zhu



On 2017/11/8 11:27, Elior, Ariel wrote:

When "NETDEV WATCHDOG: em4 (bnx2x): transmit queue 2 timed out" occurs,
BNX2X_SP_RTNL_TX_TIMEOUT is set. In the function bnx2x_sp_rtnl_task,
bnx2x_nic_unload and bnx2x_nic_load are executed to shutdown and open
NIC. In the function bnx2x_nic_load, bnx2x_alloc_mem allocates dma
failure. The message "bnx2x: [bnx2x_alloc_mem:8399(em4)]Can't
allocate memory" pops out. The variable slowpath is set to NULL.
When shutdown the NIC, the function bnx2x_nic_unload is called. In
the function bnx2x_nic_unload, the following functions are executed.
bnx2x_chip_cleanup
 bnx2x_set_storm_rx_mode
 bnx2x_set_q_rx_mode
 bnx2x_set_q_rx_mode
 bnx2x_config_rx_mode
 bnx2x_set_rx_mode_e2
In the function bnx2x_set_rx_mode_e2, the variable slowpath is operated.
Then the crash occurs.
To fix this crash, the variable slowpath is checked. And in the function
bnx2x_sp_rtnl_task, after dma memory allocation fails, another shutdown
and open NIC is executed.

CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 

Acked-by: Ariel Elior 

Thanks a lot.

Zhu Yanjun


Thanks Zhu - you did a thorough job.
Ariel




Re: [PATCH 1/1] bnx2x: fix slowpath null crash

2017-11-07 Thread Yanjun Zhu

Please ignore this mail.

Zhu Yanjun


On 2017/11/8 9:58, root wrote:

From: Zhu Yanjun 

When "NETDEV WATCHDOG: em4 (bnx2x): transmit queue 2 timed out" occurs,
BNX2X_SP_RTNL_TX_TIMEOUT is set. In the function bnx2x_sp_rtnl_task,
bnx2x_nic_unload and bnx2x_nic_load are executed to shutdown and open
NIC. In the function bnx2x_nic_load, bnx2x_alloc_mem allocates dma
failure. The message "bnx2x: [bnx2x_alloc_mem:8399(em4)]Can't
allocate memory" pops out. The variable slowpath is set to NULL.
When shutdown the NIC, the function bnx2x_nic_unload is called. In
the function bnx2x_nic_unload, the following functions are executed.
bnx2x_chip_cleanup
 bnx2x_set_storm_rx_mode
 bnx2x_set_q_rx_mode
 bnx2x_set_q_rx_mode
 bnx2x_config_rx_mode
 bnx2x_set_rx_mode_e2
In the function bnx2x_set_rx_mode_e2, the variable slowpath is operated.
Then the crash occurs.
To fix this crash, the variable slowpath is checked. And in the function
bnx2x_sp_rtnl_task, after dma memory allocation fails, another shutdown
and open NIC is executed.

CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 14 +++---
  1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index c12b4d3..5929324 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -9332,7 +9332,7 @@ void bnx2x_chip_cleanup(struct bnx2x *bp, int 
unload_mode, bool keep_link)
/* Schedule the rx_mode command */
if (test_bit(BNX2X_FILTER_RX_MODE_PENDING, >sp_state))
set_bit(BNX2X_FILTER_RX_MODE_SCHED, >sp_state);
-   else
+   else if (!bp->slowpath)
bnx2x_set_storm_rx_mode(bp);
  
  	/* Cleanup multicast configuration */

@@ -10271,8 +10271,16 @@ static void bnx2x_sp_rtnl_task(struct work_struct 
*work)
smp_mb();
  
  		bnx2x_nic_unload(bp, UNLOAD_NORMAL, true);

-   bnx2x_nic_load(bp, LOAD_NORMAL);
-
+   /*When ret value shows failure of allocation failure,
+*the nic is rebooted again. If open still fails, a error
+*message to notify the user.
+*/
+   if (bnx2x_nic_load(bp, LOAD_NORMAL) == -ENOMEM) {
+   bnx2x_nic_unload(bp, UNLOAD_NORMAL, true);
+   if (bnx2x_nic_load(bp, LOAD_NORMAL)) {
+   BNX2X_ERR("Open the NIC fails again!\n");
+   }
+   }
rtnl_unlock();
return;
}




Re: [PATCH 1/1] forcedeth: remove tx_stop variable

2017-09-14 Thread Yanjun Zhu

Hi, all

After this patch is applied, the TCP && UDP tests are made.

The TCP bandwidth is 939 Mbits/sec. The UDP bandwidth is 806 Mbits/sec.

So I think this patch can work well.

host1 <-> host2

host1: forcedeth NIC
IP: 1.1.1.107
iperf -s

host2: forcedeth NIC
IP:1.1.1.105
iperf -c 1.1.1.107

The TCP Bandwidth is as below:

Client connecting to 1.1.1.107, TCP port 5001
TCP window size: 85.0 KByte (default)

[  3] local 1.1.1.105 port 46092 connected with 1.1.1.107 port 5001
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   939 Mbits/sec

The UDP is as below:

iperf -c 1.1.1.107 -u -b 1000m

Client connecting to 1.1.1.107, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size:  208 KByte (default)

[  3] local 1.1.1.105 port 47265 connected with 1.1.1.107 port 5001
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.0 sec   964 MBytes   809 Mbits/sec
[  3] Sent 687990 datagrams
[  3] Server Report:
[  3]  0.0-10.0 sec   960 MBytes   806 Mbits/sec   0.019 ms 2942/687989 
(0.43%)

[  3]  0.0-10.0 sec  1 datagrams received out-of-order

Zhu Yanjun

On 2017/9/8 20:28, Zhu Yanjun wrote:

The variable tx_stop is used to indicate the tx queue state: started
or stopped. In fact, the inline function netif_queue_stopped can do
the same work. So replace the variable tx_stop with the
function netif_queue_stopped.

Signed-off-by: Zhu Yanjun 
---
  drivers/net/ethernet/nvidia/forcedeth.c | 13 -
  1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index 994a83a..e6e0de4 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -834,7 +834,6 @@ struct fe_priv {
u32 tx_pkts_in_progress;
struct nv_skb_map *tx_change_owner;
struct nv_skb_map *tx_end_flip;
-   int tx_stop;
  
  	/* TX software stats */

struct u64_stats_sync swstats_tx_syncp;
@@ -1939,7 +1938,6 @@ static void nv_init_tx(struct net_device *dev)
np->tx_pkts_in_progress = 0;
np->tx_change_owner = NULL;
np->tx_end_flip = NULL;
-   np->tx_stop = 0;
  
  	for (i = 0; i < np->tx_ring_size; i++) {

if (!nv_optimized(np)) {
@@ -2211,7 +2209,6 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
empty_slots = nv_get_empty_tx_slots(np);
if (unlikely(empty_slots <= entries)) {
netif_stop_queue(dev);
-   np->tx_stop = 1;
spin_unlock_irqrestore(>lock, flags);
return NETDEV_TX_BUSY;
}
@@ -2359,7 +2356,6 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff 
*skb,
empty_slots = nv_get_empty_tx_slots(np);
if (unlikely(empty_slots <= entries)) {
netif_stop_queue(dev);
-   np->tx_stop = 1;
spin_unlock_irqrestore(>lock, flags);
return NETDEV_TX_BUSY;
}
@@ -2583,8 +2579,8 @@ static int nv_tx_done(struct net_device *dev, int limit)
  
  	netdev_completed_queue(np->dev, tx_work, bytes_compl);
  
-	if (unlikely((np->tx_stop == 1) && (np->get_tx.orig != orig_get_tx))) {

-   np->tx_stop = 0;
+   if (unlikely(netif_queue_stopped(dev) &&
+(np->get_tx.orig != orig_get_tx))) {
netif_wake_queue(dev);
}
return tx_work;
@@ -2637,8 +2633,8 @@ static int nv_tx_done_optimized(struct net_device *dev, 
int limit)
  
  	netdev_completed_queue(np->dev, tx_work, bytes_cleaned);
  
-	if (unlikely((np->tx_stop == 1) && (np->get_tx.ex != orig_get_tx))) {

-   np->tx_stop = 0;
+   if (unlikely(netif_queue_stopped(dev) &&
+(np->get_tx.ex != orig_get_tx))) {
netif_wake_queue(dev);
}
return tx_work;
@@ -2724,7 +2720,6 @@ static void nv_tx_timeout(struct net_device *dev)
/* 2) complete any outstanding tx and do not give HW any limited tx 
pkts */
saved_tx_limit = np->tx_limit;
np->tx_limit = 0; /* prevent giving HW any limited pkts */
-   np->tx_stop = 0;  /* prevent waking tx queue */
if (!nv_optimized(np))
nv_tx_done(dev, np->tx_ring_size);
else




Re: [PATCH 1/5] rds: tcp: release the created connection

2017-03-27 Thread Yanjun Zhu



On 2017/3/27 15:37, Sowmini Varadhan wrote:

On (03/27/17 03:06), Zhu Yanjun wrote:

Date: Mon, 27 Mar 2017 03:06:26 -0400
From: Zhu Yanjun 
To: yanjun@oracle.com, santosh.shilim...@oracle.com,
  netdev@vger.kernel.org, linux-r...@vger.kernel.org,
  rds-de...@oss.oracle.com, junxiao...@oracle.com, joe@oracle.com
Subject: [PATCH 1/5] rds: tcp: release the created connection
X-Mailer: git-send-email 2.7.4

When some error occurs, the created connection should be destroyed.

No please dont do this.

This is the case when there are duelling connections. We want
to reset the new (accept sock) and leave the old socket in place.

How did you test this? Did you test it with network namespaces?
Sorry. I just made simple test. It seems that it worked well. Would you 
like to show me some test about this patch?


Thanks a lot.
Zhu Yanjun


--Sowmini



  net/rds/tcp_listen.c | 1 +

diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 5076788..58aa5bc 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -196,6 +196,7 @@ int rds_tcp_accept_one(struct socket *sock)
  rst_nsk:
/* reset the newly returned accept sock and bail */
kernel_sock_shutdown(new_sock, SHUT_RDWR);
+   rds_conn_destroy(conn);
ret = 0;
  out:
if (rs_tcp)
--
2.7.4


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




Re: [PATCHv3 1/4] rds: ib: drop unnecessary rdma_reject

2017-03-13 Thread Yanjun Zhu



On 2017/3/13 14:32, Leon Romanovsky wrote:

On Mon, Mar 13, 2017 at 01:43:45AM -0400, Zhu Yanjun wrote:

When rdma_accept fails, rdma_reject is called in it. As such, it is
not necessary to execute rdma_reject again.

It is not always correct, according to the code, rdma_accept can fail
and will return EINVAL and skip calling to rdma_reject.
Sure. When -EINVAL is returned, By this function cma_comp(id_priv, 
RDMA_CM_CONNECT), the connection

is lost. As such, it does not matter whether sending rdma_reject or not.

Zhu Yanjun


3725 int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
3726 {
3727 struct rdma_id_private *id_priv;
3728 int ret;
3729
3730 id_priv = container_of(id, struct rdma_id_private, id);
3731
3732 id_priv->owner = task_pid_nr(current);
3733
3734 if (!cma_comp(id_priv, RDMA_CM_CONNECT))
3735 return -EINVAL;
3736



Cc: Joe Jin 
Cc: Junxiao Bi 
Acked-by: Santosh Shilimkar 
Signed-off-by: Zhu Yanjun 
---
  net/rds/ib_cm.c | 5 ++---
  1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index ce3775a..4b9405c 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -677,9 +677,8 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
event->param.conn.initiator_depth);

/* rdma_accept() calls rdma_reject() internally if it fails */
-   err = rdma_accept(cm_id, _param);
-   if (err)
-   rds_ib_conn_error(conn, "rdma_accept failed (%d)\n", err);
+   if (rdma_accept(cm_id, _param))
+   rds_ib_conn_error(conn, "rdma_accept failed\n");

  out:
if (conn)
--
2.7.4





Re: [PATCHv2 1/4] rds: ib: drop unnecessary rdma_reject

2017-03-12 Thread Yanjun Zhu



On 2017/3/13 3:43, santosh.shilim...@oracle.com wrote:

On 3/12/17 12:33 PM, Leon Romanovsky wrote:

On Sun, Mar 12, 2017 at 04:07:55AM -0400, Zhu Yanjun wrote:

When rdma_accept fails, rdma_reject is called in it. As such, it is
not necessary to execute rdma_reject again.

Cc: Joe Jin 
Cc: Junxiao Bi 
Acked-by: Santosh Shilimkar 
Signed-off-by: Zhu Yanjun 
---
Change from v1 to v2:
  Add the acker.

 net/rds/ib_cm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index ce3775a..eca3d5f 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -677,8 +677,7 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id 
*cm_id,

 event->param.conn.initiator_depth);

 /* rdma_accept() calls rdma_reject() internally if it fails */
-err = rdma_accept(cm_id, _param);
-if (err)
+if (rdma_accept(cm_id, _param))
 rds_ib_conn_error(conn, "rdma_accept failed (%d)\n", err);


You omitted initialization of "err" variable which you print here ^.


Its inited by rds_ib_setup_qp() but you are right. It will print
failed with error = 0. :-)

Zhu, please drop that 'err' from the message.

OK. I will do.

Zhu Yanjun



Re: [PATCH 2/5] rds: ib: replace spin_lock_irq with spin_lock_irqsave

2017-03-11 Thread Yanjun Zhu

Sorry. I have no test case to show some issue.
But from Linux Kernel Development Second Edition by Robert Love.

Use spin_lock_irq is dangerous since spin_unlock_irq unconditionally 
enables interrupts.


We can assume the following scenario:

--->the interrupt is disabled.

spin_lock_irq(lock_ptr);   <---this will disable interrupt again
list_del(>ib_node);
spin_unlock_irq(lock_ptr); <---this will enable interrupt

>the interrupt is enabled.

our code change the state of interrupt. This will make potential risk.
But spin_lock_irqsave/spin_unlock_irqrestore will not make potential risk.

Zhu Yanjun
On 2017/3/10 0:50, Santosh Shilimkar wrote:

On 3/8/2017 11:26 PM, Zhu Yanjun wrote:

It is difficult to make sure the state of the interrupt when this
function is called. As such, it is safer to use spin_lock_irqsave
than spin_lock_irq.


There is no reason to hold irqs  and as such the code path is
safe from irq context. I don't see need of this change unless
you have test case which showed some issue.

Regards,
Santosh





Re: [PATCH 1/1] ixgbe: add the external ixgbe fiber transceiver status

2017-02-09 Thread Yanjun Zhu



On 2017/2/10 3:08, Tantilov, Emil S wrote:

-Original Message-
From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
Behalf Of Zhu Yanjun
Sent: Wednesday, February 08, 2017 7:03 PM
To: Kirsher, Jeffrey T ; broo...@kernel.org;
da...@davemloft.net; intel-wired-...@lists.osuosl.org;
netdev@vger.kernel.org
Subject: [PATCH 1/1] ixgbe: add the external ixgbe fiber transceiver status

When the ixgbe fiber transceiver is external, it is necessary to get
the present/absent status of this external ixgbe fiber transceiver.

The transceiver field was deprecated in the old ethtool API and is being
removed in the new. This patch will not apply at all once those changes are 
made:

http://patchwork.ozlabs.org/patch/725081/
Thanks for your kind reply. I will change this patch based on the above 
changes.


Zhu Yanjun


Thanks,
Emil
  







[PATCHv2 1/1] r8169: fix the typo in the comment

2017-01-05 Thread yanjun . zhu
From: Zhu Yanjun 

>From the realtek data sheet, the PID0 should be bit 0.

Signed-off-by: Zhu Yanjun 
---
Change from v1 to v2:
 change the commit header. 

 drivers/net/ethernet/realtek/r8169.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 44389c9..8f1623b 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -696,7 +696,7 @@ enum rtl_tx_desc_bit_1 {
 enum rtl_rx_desc_bit {
/* Rx private */
PID1= (1 << 18), /* Protocol ID bit 1/2 */
-   PID0= (1 << 17), /* Protocol ID bit 2/2 */
+   PID0= (1 << 17), /* Protocol ID bit 0/2 */
 
 #define RxProtoUDP (PID1)
 #define RxProtoTCP (PID0)
-- 
2.7.4



Re: [PATCH 1/1] r8169: fix the typo

2016-12-29 Thread Yanjun Zhu

Hi,

Please comment on this patch.

Zhu Yanjun

On 2016/12/29 11:11, Zhu Yanjun wrote:

>From the realtek data sheet, the PID0 should be bit 0.

Signed-off-by: Zhu Yanjun 
---
  drivers/net/ethernet/realtek/r8169.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 44389c9..8f1623b 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -696,7 +696,7 @@ enum rtl_tx_desc_bit_1 {
  enum rtl_rx_desc_bit {
/* Rx private */
PID1= (1 << 18), /* Protocol ID bit 1/2 */
-   PID0= (1 << 17), /* Protocol ID bit 2/2 */
+   PID0= (1 << 17), /* Protocol ID bit 0/2 */
  
  #define RxProtoUDP	(PID1)

  #define RxProtoTCP(PID0)