Re: [PATCH 10/10] Replace tasklets with workqueues

2019-10-11 Thread Maksym Planeta

Hi,

this is a kind reminder regarding the patchset. I added description of 
races in the original email.


On 30/07/2019 21:20, Maksym Planeta wrote:



On 25/07/2019 20:50, Jason Gunthorpe wrote:

On Thu, Jul 25, 2019 at 04:36:20PM +0200, Maksym Planeta wrote:

Is this one better?

Replace tasklets with workqueues in rxe driver. The reason for this
replacement is that tasklets are supposed to run atomically, although 
the

actual code may block.

Modify the SKB destructor for outgoing SKB's to schedule QP tasks 
only if

the QP is not destroyed itself.

Add a variable "pending_skb_down" to ensure that reference counting 
for a QP

is decremented only when QP access related to this skb is over.

Separate part of pool element cleanup code to allow this code to be 
called

in the very end of cleanup, even if some of cleanup is scheduled for
asynchronous execution. Example, when it was happening is destructor 
for a

QP.

Disallow calling of task functions "directly". This allows to 
simplify logic

inside rxe_task.c

Schedule rxe_qp_do_cleanup onto high-priority system workqueue, 
because this

function can be scheduled from normal system workqueue.

Before destroying a QP, wait until all references to this QP are gone.
Previously the problem was that outgoing SKBs could be freed after 
the QP

these SKBs refer to is destroyed.

Add blocking rxe_run_task to replace __rxe_do_task that was calling task
function directly.


Mostly but it would also be good to describe the use after free and
races more specifically



These situations are described in the cover letter (PATCH 00/10). Do you 
need a more detailed description than that?



Jason





--
Regards,
Maksym Planeta


Re: [PATCH 10/10] Replace tasklets with workqueues

2019-07-30 Thread Maksym Planeta




On 25/07/2019 20:50, Jason Gunthorpe wrote:

On Thu, Jul 25, 2019 at 04:36:20PM +0200, Maksym Planeta wrote:

Is this one better?

Replace tasklets with workqueues in rxe driver. The reason for this
replacement is that tasklets are supposed to run atomically, although the
actual code may block.

Modify the SKB destructor for outgoing SKB's to schedule QP tasks only if
the QP is not destroyed itself.

Add a variable "pending_skb_down" to ensure that reference counting for a QP
is decremented only when QP access related to this skb is over.

Separate part of pool element cleanup code to allow this code to be called
in the very end of cleanup, even if some of cleanup is scheduled for
asynchronous execution. Example, when it was happening is destructor for a
QP.

Disallow calling of task functions "directly". This allows to simplify logic
inside rxe_task.c

Schedule rxe_qp_do_cleanup onto high-priority system workqueue, because this
function can be scheduled from normal system workqueue.

Before destroying a QP, wait until all references to this QP are gone.
Previously the problem was that outgoing SKBs could be freed after the QP
these SKBs refer to is destroyed.

Add blocking rxe_run_task to replace __rxe_do_task that was calling task
function directly.


Mostly but it would also be good to describe the use after free and
races more specifically



These situations are described in the cover letter (PATCH 00/10). Do you 
need a more detailed description than that?



Jason



--
Regards,
Maksym Planeta


Re: [PATCH 10/10] Replace tasklets with workqueues

2019-07-25 Thread Maksym Planeta

Is this one better?

Replace tasklets with workqueues in rxe driver. The reason for this 
replacement is that tasklets are supposed to run atomically, although 
the actual code may block.


Modify the SKB destructor for outgoing SKB's to schedule QP tasks only 
if the QP is not destroyed itself.


Add a variable "pending_skb_down" to ensure that reference counting for 
a QP is decremented only when QP access related to this skb is over.


Separate part of pool element cleanup code to allow this code to be 
called in the very end of cleanup, even if some of cleanup is scheduled 
for asynchronous execution. Example, when it was happening is destructor 
for a QP.


Disallow calling of task functions "directly". This allows to simplify 
logic inside rxe_task.c


Schedule rxe_qp_do_cleanup onto high-priority system workqueue, because 
this function can be scheduled from normal system workqueue.


Before destroying a QP, wait until all references to this QP are gone. 
Previously the problem was that outgoing SKBs could be freed after the 
QP these SKBs refer to is destroyed.


Add blocking rxe_run_task to replace __rxe_do_task that was calling task 
function directly.


On 22/07/2019 17:32, Jason Gunthorpe wrote:

On Mon, Jul 22, 2019 at 05:14:26PM +0200, Maksym Planeta wrote:

Replace tasklets with workqueues in rxe driver.

Ensure that task is called only through a workqueue. This allows to
simplify task logic.

Add additional dependencies to make sure that cleanup tasks do not
happen after object's memory is already reclaimed.

Improve overal stability of the driver by removing multiple race
conditions and use-after-free situations.


This should be described more precisely

Jason



--
Regards,
Maksym Planeta


[PATCH 04/10] Protect kref_put with the lock

2019-07-22 Thread Maksym Planeta
Need to ensure that kref_put does not run concurrently with the loop
inside rxe_pool_get_key.

Signed-off-by: Maksym Planeta 
---
 drivers/infiniband/sw/rxe/rxe_pool.c | 18 ++
 drivers/infiniband/sw/rxe/rxe_pool.h |  4 +---
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c 
b/drivers/infiniband/sw/rxe/rxe_pool.c
index efa9bab01e02..30a887cf9200 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -536,3 +536,21 @@ void *rxe_pool_get_key(struct rxe_pool *pool, void *key)
read_unlock_irqrestore(>pool_lock, flags);
return node ? elem : NULL;
 }
+
+static void rxe_dummy_release(struct kref *kref)
+{
+}
+
+void rxe_drop_ref(struct rxe_pool_entry *pelem)
+{
+   int res;
+   struct rxe_pool *pool = pelem->pool;
+   unsigned long flags;
+
+   write_lock_irqsave(>pool_lock, flags);
+   res = kref_put(>ref_cnt, rxe_dummy_release);
+   write_unlock_irqrestore(>pool_lock, flags);
+   if (res) {
+   rxe_elem_release(>ref_cnt);
+   }
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h 
b/drivers/infiniband/sw/rxe/rxe_pool.h
index 5c6a9429f541..b90cc84c5511 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -166,8 +166,6 @@ static inline void rxe_add_ref(struct rxe_pool_entry 
*pelem) {
 }
 
 /* drop a reference on an object */
-static inline void rxe_drop_ref(struct rxe_pool_entry *pelem) {
-   kref_put(>ref_cnt, rxe_elem_release);
-}
+void rxe_drop_ref(struct rxe_pool_entry *pelem);
 
 #endif /* RXE_POOL_H */
-- 
2.20.1



[PATCH 07/10] Pass the return value of kref_put further

2019-07-22 Thread Maksym Planeta
Used in a later patch.

Signed-off-by: Maksym Planeta 
---
 drivers/infiniband/sw/rxe/rxe_pool.c | 3 ++-
 drivers/infiniband/sw/rxe/rxe_pool.h | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c 
b/drivers/infiniband/sw/rxe/rxe_pool.c
index 30a887cf9200..711d7d7f3692 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -541,7 +541,7 @@ static void rxe_dummy_release(struct kref *kref)
 {
 }
 
-void rxe_drop_ref(struct rxe_pool_entry *pelem)
+int rxe_drop_ref(struct rxe_pool_entry *pelem)
 {
int res;
struct rxe_pool *pool = pelem->pool;
@@ -553,4 +553,5 @@ void rxe_drop_ref(struct rxe_pool_entry *pelem)
if (res) {
rxe_elem_release(>ref_cnt);
}
+   return res;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h 
b/drivers/infiniband/sw/rxe/rxe_pool.h
index b90cc84c5511..1e3508c74dc0 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -166,6 +166,6 @@ static inline void rxe_add_ref(struct rxe_pool_entry 
*pelem) {
 }
 
 /* drop a reference on an object */
-void rxe_drop_ref(struct rxe_pool_entry *pelem);
+int rxe_drop_ref(struct rxe_pool_entry *pelem);
 
 #endif /* RXE_POOL_H */
-- 
2.20.1



[PATCH 01/10] Simplify rxe_run_task interface

2019-07-22 Thread Maksym Planeta
Make rxe_run_task only schedule tasks and never run them directly.

This simplification will be used in further patches.

Signed-off-by: Maksym Planeta 
---
 drivers/infiniband/sw/rxe/rxe_comp.c  | 16 
 drivers/infiniband/sw/rxe/rxe_loc.h   |  2 +-
 drivers/infiniband/sw/rxe/rxe_net.c   |  2 +-
 drivers/infiniband/sw/rxe/rxe_qp.c| 10 +-
 drivers/infiniband/sw/rxe/rxe_req.c   |  6 +++---
 drivers/infiniband/sw/rxe/rxe_resp.c  |  8 +---
 drivers/infiniband/sw/rxe/rxe_task.c  |  7 ++-
 drivers/infiniband/sw/rxe/rxe_task.h  |  6 +++---
 drivers/infiniband/sw/rxe/rxe_verbs.c |  8 
 9 files changed, 28 insertions(+), 37 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c 
b/drivers/infiniband/sw/rxe/rxe_comp.c
index 116cafc9afcf..ad09bd9d0a82 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -142,7 +142,7 @@ void retransmit_timer(struct timer_list *t)
 
if (qp->valid) {
qp->comp.timeout = 1;
-   rxe_run_task(>comp.task, 1);
+   rxe_run_task(>comp.task);
}
 }
 
@@ -156,7 +156,7 @@ void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff 
*skb)
if (must_sched != 0)
rxe_counter_inc(SKB_TO_PKT(skb)->rxe, RXE_CNT_COMPLETER_SCHED);
 
-   rxe_run_task(>comp.task, must_sched);
+   rxe_run_task(>comp.task);
 }
 
 static inline enum comp_state get_wqe(struct rxe_qp *qp,
@@ -329,7 +329,7 @@ static inline enum comp_state check_ack(struct rxe_qp *qp,
qp->comp.psn = pkt->psn;
if (qp->req.wait_psn) {
qp->req.wait_psn = 0;
-   rxe_run_task(>req.task, 1);
+   rxe_run_task(>req.task);
}
}
return COMPST_ERROR_RETRY;
@@ -463,7 +463,7 @@ static void do_complete(struct rxe_qp *qp, struct 
rxe_send_wqe *wqe)
 */
if (qp->req.wait_fence) {
qp->req.wait_fence = 0;
-   rxe_run_task(>req.task, 1);
+   rxe_run_task(>req.task);
}
 }
 
@@ -479,7 +479,7 @@ static inline enum comp_state complete_ack(struct rxe_qp 
*qp,
if (qp->req.need_rd_atomic) {
qp->comp.timeout_retry = 0;
qp->req.need_rd_atomic = 0;
-   rxe_run_task(>req.task, 1);
+   rxe_run_task(>req.task);
}
}
 
@@ -525,7 +525,7 @@ static inline enum comp_state complete_wqe(struct rxe_qp 
*qp,
 
if (qp->req.wait_psn) {
qp->req.wait_psn = 0;
-   rxe_run_task(>req.task, 1);
+   rxe_run_task(>req.task);
}
}
 
@@ -644,7 +644,7 @@ int rxe_completer(void *arg)
 
if (qp->req.wait_psn) {
qp->req.wait_psn = 0;
-   rxe_run_task(>req.task, 1);
+   rxe_run_task(>req.task);
}
 
state = COMPST_DONE;
@@ -725,7 +725,7 @@ int rxe_completer(void *arg)
RXE_CNT_COMP_RETRY);
qp->req.need_retry = 1;
qp->comp.started_retry = 1;
-   rxe_run_task(>req.task, 1);
+   rxe_run_task(>req.task);
}
 
if (pkt) {
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h 
b/drivers/infiniband/sw/rxe/rxe_loc.h
index 775c23becaec..b7159d9d5107 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -277,7 +277,7 @@ static inline int rxe_xmit_packet(struct rxe_qp *qp, struct 
rxe_pkt_info *pkt,
if ((qp_type(qp) != IB_QPT_RC) &&
(pkt->mask & RXE_END_MASK)) {
pkt->wqe->state = wqe_state_done;
-   rxe_run_task(>comp.task, 1);
+   rxe_run_task(>comp.task);
}
 
rxe_counter_inc(rxe, RXE_CNT_SENT_PKTS);
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c 
b/drivers/infiniband/sw/rxe/rxe_net.c
index 5a3474f9351b..e50f19fadcf9 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -414,7 +414,7 @@ static void rxe_skb_tx_dtor(struct sk_buff *skb)
 
if (unlikely(qp->need_req_skb &&
 skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW))
-   rxe_run_task(>req.task, 1);
+   rxe_run_task(>req.task);

[PATCH 06/10] Remove pd form rxe_ah

2019-07-22 Thread Maksym Planeta
Pointer to PD is not used in rxe_ah anymore.

Signed-off-by: Maksym Planeta 
---
 drivers/infiniband/sw/rxe/rxe_verbs.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h 
b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 5c4b2239129c..8dd65c2a7c72 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -73,7 +73,6 @@ struct rxe_pd {
 struct rxe_ah {
struct ib_ahibah;
struct rxe_pool_entry   pelem;
-   struct rxe_pd   *pd;
struct rxe_av   av;
 };
 
-- 
2.20.1



[PATCH 09/10] Consolidate resetting of QP's tasks into one place

2019-07-22 Thread Maksym Planeta
Used in a later patch.

Signed-off-by: Maksym Planeta 
---
 drivers/infiniband/sw/rxe/rxe_loc.h  |  1 +
 drivers/infiniband/sw/rxe/rxe_qp.c   | 17 +
 drivers/infiniband/sw/rxe/rxe_req.c  |  1 +
 drivers/infiniband/sw/rxe/rxe_resp.c | 25 ++---
 4 files changed, 17 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h 
b/drivers/infiniband/sw/rxe/rxe_loc.h
index b7159d9d5107..e37adde2bced 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -205,6 +205,7 @@ static inline int rcv_wqe_size(int max_sge)
 }
 
 void free_rd_atomic_resource(struct rxe_qp *qp, struct resp_res *res);
+void cleanup_rd_atomic_resources(struct rxe_qp *qp);
 
 static inline void rxe_advance_resp_resource(struct rxe_qp *qp)
 {
diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c 
b/drivers/infiniband/sw/rxe/rxe_qp.c
index 7cd929185581..2fccdede8869 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -161,7 +161,7 @@ void free_rd_atomic_resource(struct rxe_qp *qp, struct 
resp_res *res)
res->type = 0;
 }
 
-static void cleanup_rd_atomic_resources(struct rxe_qp *qp)
+void cleanup_rd_atomic_resources(struct rxe_qp *qp)
 {
int i;
struct resp_res *res;
@@ -526,21 +526,6 @@ static void rxe_qp_reset(struct rxe_qp *qp)
 
/* cleanup attributes */
atomic_set(>ssn, 0);
-   qp->req.opcode = -1;
-   qp->req.need_retry = 0;
-   qp->req.noack_pkts = 0;
-   qp->resp.msn = 0;
-   qp->resp.opcode = -1;
-   qp->resp.drop_msg = 0;
-   qp->resp.goto_error = 0;
-   qp->resp.sent_psn_nak = 0;
-
-   if (qp->resp.mr) {
-   rxe_drop_ref(>resp.mr->pelem);
-   qp->resp.mr = NULL;
-   }
-
-   cleanup_rd_atomic_resources(qp);
 
/* reenable tasks */
rxe_enable_task(>resp.task);
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c 
b/drivers/infiniband/sw/rxe/rxe_req.c
index c63e66873757..3bb9afdeaee1 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -599,6 +599,7 @@ int rxe_requester(void *arg)
goto exit;
 
if (unlikely(qp->req.state == QP_STATE_RESET)) {
+   qp->req.noack_pkts = 0;
qp->req.wqe_index = consumer_index(qp->sq.queue);
qp->req.opcode = -1;
qp->req.need_rd_atomic = 0;
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c 
b/drivers/infiniband/sw/rxe/rxe_resp.c
index 7be8a362d2ef..eaee68d718ce 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -1214,19 +1214,10 @@ int rxe_responder(void *arg)
 
qp->resp.aeth_syndrome = AETH_ACK_UNLIMITED;
 
-   if (!qp->valid) {
-   ret = -EINVAL;
-   goto done;
-   }
-
-   switch (qp->resp.state) {
-   case QP_STATE_RESET:
+   if (!qp->valid || qp->resp.state == QP_STATE_RESET) {
state = RESPST_RESET;
-   break;
-
-   default:
+   } else {
state = RESPST_GET_REQ;
-   break;
}
 
while (1) {
@@ -1374,6 +1365,18 @@ int rxe_responder(void *arg)
case RESPST_RESET:
rxe_drain_req_pkts(qp, false);
qp->resp.wqe = NULL;
+   qp->resp.msn = 0;
+   qp->resp.opcode = -1;
+   qp->resp.drop_msg = 0;
+   qp->resp.goto_error = 0;
+   qp->resp.sent_psn_nak = 0;
+
+   if (qp->resp.mr) {
+   rxe_drop_ref(>resp.mr->pelem);
+   qp->resp.mr = NULL;
+   }
+
+   cleanup_rd_atomic_resources(qp);
goto exit;
 
case RESPST_ERROR:
-- 
2.20.1



[PATCH 00/10] Refactor rxe driver to remove multiple race conditions

2019-07-22 Thread Maksym Planeta
This patchset helps to get rid of following race condition situations:
 
  1. Tasklet functions were incrementing reference counting after entering
  running the tasklet.   
  2. Getting a pointer to reference counted object (kref) was done without
  protecting kref_put with a lock.
  3. QP cleanup was sometimes scheduling cleanup for later execution in
  rxe_qp_do_cleaunpm, although this QP's memory could be freed immediately after
  returning from rxe_qp_cleanup.
  4. Non-atomic cleanup functions could be called in SoftIRQ context
  5. Manipulating with reference counter inside a critical section could have
  been done both inside and outside of SoftIRQ region. Such behavior may end up
  in a deadlock.

The easiest way to observe these problems is to compile the kernel with KASAN
and lockdep and abruptly stop an application using SoftRoCE during the
communication phase. For my system this often resulted in kernel crash of a
deadlock inside the kernel.

To fix the above mentioned problems, this patch does following things:

  1. Replace tasklets with workqueues
  2. Adds locks to kref_put
  3. Aquires reference counting in an appropriate place

As a shortcomming, the performance is slightly reduced, because instead of
trying to execute tasklet function directly the new version always puts it onto
the queue.

TBH, I'm not sure that I removed all of the problems, but the driver
deffinetely behaves much more stable now. I would be glad to get some
help with additional testing.

 drivers/infiniband/sw/rxe/rxe_comp.c|  38 ++
 drivers/infiniband/sw/rxe/rxe_cq.c  |  17 ++-
 drivers/infiniband/sw/rxe/rxe_hw_counters.c |   1 -
 drivers/infiniband/sw/rxe/rxe_hw_counters.h |   1 -
 drivers/infiniband/sw/rxe/rxe_loc.h |   3 +-
 drivers/infiniband/sw/rxe/rxe_mcast.c   |  22 ++--
 drivers/infiniband/sw/rxe/rxe_mr.c  |  10 +-
 drivers/infiniband/sw/rxe/rxe_net.c |  21 ++-
 drivers/infiniband/sw/rxe/rxe_pool.c|  40 --
 drivers/infiniband/sw/rxe/rxe_pool.h|  16 ++-
 drivers/infiniband/sw/rxe/rxe_qp.c  | 130 +-
 drivers/infiniband/sw/rxe/rxe_recv.c|   8 +-
 drivers/infiniband/sw/rxe/rxe_req.c |  17 +--
 drivers/infiniband/sw/rxe/rxe_resp.c|  54 
 drivers/infiniband/sw/rxe/rxe_task.c| 139 +++-
 drivers/infiniband/sw/rxe/rxe_task.h|  40 ++
 drivers/infiniband/sw/rxe/rxe_verbs.c   |  81 ++--
 drivers/infiniband/sw/rxe/rxe_verbs.h   |   8 +-
 18 files changed, 302 insertions(+), 344 deletions(-)

-- 
2.20.1



[PATCH 02/10] Remove counter that does not have a meaning anymore

2019-07-22 Thread Maksym Planeta
After putting all tasks to schedule, this counter does not have a
meaning anymore.

Signed-off-by: Maksym Planeta 
---
 drivers/infiniband/sw/rxe/rxe_comp.c| 6 --
 drivers/infiniband/sw/rxe/rxe_hw_counters.c | 1 -
 drivers/infiniband/sw/rxe/rxe_hw_counters.h | 1 -
 3 files changed, 8 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c 
b/drivers/infiniband/sw/rxe/rxe_comp.c
index ad09bd9d0a82..5f3a43cfeb63 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -148,14 +148,8 @@ void retransmit_timer(struct timer_list *t)
 
 void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
 {
-   int must_sched;
-
skb_queue_tail(>resp_pkts, skb);
 
-   must_sched = skb_queue_len(>resp_pkts) > 1;
-   if (must_sched != 0)
-   rxe_counter_inc(SKB_TO_PKT(skb)->rxe, RXE_CNT_COMPLETER_SCHED);
-
rxe_run_task(>comp.task);
 }
 
diff --git a/drivers/infiniband/sw/rxe/rxe_hw_counters.c 
b/drivers/infiniband/sw/rxe/rxe_hw_counters.c
index 636edb5f4cf4..9bc762b0e66a 100644
--- a/drivers/infiniband/sw/rxe/rxe_hw_counters.c
+++ b/drivers/infiniband/sw/rxe/rxe_hw_counters.c
@@ -41,7 +41,6 @@ static const char * const rxe_counter_name[] = {
[RXE_CNT_RCV_RNR] =  "rcvd_rnr_err",
[RXE_CNT_SND_RNR] =  "send_rnr_err",
[RXE_CNT_RCV_SEQ_ERR] =  "rcvd_seq_err",
-   [RXE_CNT_COMPLETER_SCHED] =  "ack_deferred",
[RXE_CNT_RETRY_EXCEEDED]  =  "retry_exceeded_err",
[RXE_CNT_RNR_RETRY_EXCEEDED]  =  "retry_rnr_exceeded_err",
[RXE_CNT_COMP_RETRY]  =  "completer_retry_err",
diff --git a/drivers/infiniband/sw/rxe/rxe_hw_counters.h 
b/drivers/infiniband/sw/rxe/rxe_hw_counters.h
index 72c0d63c79e0..cfe19159a77f 100644
--- a/drivers/infiniband/sw/rxe/rxe_hw_counters.h
+++ b/drivers/infiniband/sw/rxe/rxe_hw_counters.h
@@ -45,7 +45,6 @@ enum rxe_counters {
RXE_CNT_RCV_RNR,
RXE_CNT_SND_RNR,
RXE_CNT_RCV_SEQ_ERR,
-   RXE_CNT_COMPLETER_SCHED,
RXE_CNT_RETRY_EXCEEDED,
RXE_CNT_RNR_RETRY_EXCEEDED,
RXE_CNT_COMP_RETRY,
-- 
2.20.1



[PATCH 10/10] Replace tasklets with workqueues

2019-07-22 Thread Maksym Planeta
Replace tasklets with workqueues in rxe driver.

Ensure that task is called only through a workqueue. This allows to
simplify task logic.

Add additional dependencies to make sure that cleanup tasks do not
happen after object's memory is already reclaimed.

Improve overal stability of the driver by removing multiple race
conditions and use-after-free situations.

Signed-off-by: Maksym Planeta 
---
 drivers/infiniband/sw/rxe/rxe_cq.c|  15 ++-
 drivers/infiniband/sw/rxe/rxe_net.c   |  17 +++-
 drivers/infiniband/sw/rxe/rxe_qp.c|  73 ---
 drivers/infiniband/sw/rxe/rxe_req.c   |   4 +-
 drivers/infiniband/sw/rxe/rxe_task.c  | 126 --
 drivers/infiniband/sw/rxe/rxe_task.h  |  38 ++--
 drivers/infiniband/sw/rxe/rxe_verbs.c |   7 ++
 drivers/infiniband/sw/rxe/rxe_verbs.h |   7 +-
 8 files changed, 124 insertions(+), 163 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_cq.c 
b/drivers/infiniband/sw/rxe/rxe_cq.c
index 5693986c8c1b..6c3cf5ba2911 100644
--- a/drivers/infiniband/sw/rxe/rxe_cq.c
+++ b/drivers/infiniband/sw/rxe/rxe_cq.c
@@ -66,9 +66,9 @@ int rxe_cq_chk_attr(struct rxe_dev *rxe, struct rxe_cq *cq,
return -EINVAL;
 }
 
-static void rxe_send_complete(unsigned long data)
+static void rxe_send_complete(struct work_struct *work)
 {
-   struct rxe_cq *cq = (struct rxe_cq *)data;
+   struct rxe_cq *cq = container_of(work, typeof(*cq), work);
unsigned long flags;
 
spin_lock_irqsave(>cq_lock, flags);
@@ -106,8 +106,12 @@ int rxe_cq_from_init(struct rxe_dev *rxe, struct rxe_cq 
*cq, int cqe,
cq->is_user = 1;
 
cq->is_dying = false;
+   INIT_WORK(>work, rxe_send_complete);
 
-   tasklet_init(>comp_task, rxe_send_complete, (unsigned long)cq);
+   cq->wq = alloc_ordered_workqueue("cq:send_complete", 0);
+   if (!cq->wq) {
+   return -ENOMEM;
+   }
 
spin_lock_init(>cq_lock);
cq->ibcq.cqe = cqe;
@@ -161,7 +165,7 @@ int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int 
solicited)
if ((cq->notify == IB_CQ_NEXT_COMP) ||
(cq->notify == IB_CQ_SOLICITED && solicited)) {
cq->notify = 0;
-   tasklet_schedule(>comp_task);
+   queue_work(cq->wq, >work);
}
 
return 0;
@@ -183,5 +187,8 @@ void rxe_cq_cleanup(struct rxe_pool_entry *arg)
if (cq->queue)
rxe_queue_cleanup(cq->queue);
 
+   if (cq->wq)
+   destroy_workqueue(cq->wq);
+
rxe_elem_cleanup(arg);
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c 
b/drivers/infiniband/sw/rxe/rxe_net.c
index f8c3604bc5ad..2997edc27592 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -410,13 +410,20 @@ static void rxe_skb_tx_dtor(struct sk_buff *skb)
 {
struct sock *sk = skb->sk;
struct rxe_qp *qp = sk->sk_user_data;
-   int skb_out = atomic_dec_return(>skb_out);
+   int skb_out = atomic_read(>skb_out);
 
-   if (unlikely(qp->need_req_skb &&
-skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW))
-   rxe_run_task(>req.task);
+   atomic_inc(>pending_skb_down);
+   skb_out = atomic_read(>pending_skb_down);
+   if (!rxe_drop_ref(>pelem)) {
+   atomic_dec(>pending_skb_down);
+   atomic_dec(>skb_out);
 
-   rxe_drop_ref(>pelem);
+   if (unlikely(qp->need_req_skb &&
+skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW))
+   rxe_run_task(>req.task);
+
+   skb_out = atomic_read(>pending_skb_down);
+   }
 }
 
 int rxe_send(struct rxe_pkt_info *pkt, struct sk_buff *skb)
diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c 
b/drivers/infiniband/sw/rxe/rxe_qp.c
index 2fccdede8869..2bb25360319e 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -214,6 +214,7 @@ static void rxe_qp_init_misc(struct rxe_dev *rxe, struct 
rxe_qp *qp,
 
atomic_set(>ssn, 0);
atomic_set(>skb_out, 0);
+   atomic_set(>pending_skb_down, 0);
 }
 
 static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
@@ -332,6 +333,8 @@ static int rxe_qp_init_resp(struct rxe_dev *rxe, struct 
rxe_qp *qp,
return 0;
 }
 
+void rxe_qp_do_cleanup(struct work_struct *work);
+
 /* called by the create qp verb */
 int rxe_qp_from_init(struct rxe_dev *rxe, struct rxe_qp *qp, struct rxe_pd *pd,
 struct ib_qp_init_attr *init,
@@ -368,6 +371,9 @@ int rxe_qp_from_init(struct rxe_dev *rxe, struct rxe_qp 
*qp, struct rxe_pd *pd,
qp->attr.qp_state = IB_QPS_RESET;
qp->valid = 1;
 
+   INIT_WORK(>cleanup_work, rxe_qp_do_cleanup);
+   qp->cleanup_completion = NULL;
+
return 0;
 
 err2:

[PATCH 05/10] Fix reference counting for rxe tasklets

2019-07-22 Thread Maksym Planeta
Reference should be aquired *before* the tasklet is run. The best time
to increment the reference counter is at initialisation. Otherwise, the
object may not exists anymore by the time tasklet is run.

Signed-off-by: Maksym Planeta 
---
 drivers/infiniband/sw/rxe/rxe_comp.c | 4 
 drivers/infiniband/sw/rxe/rxe_req.c  | 4 
 drivers/infiniband/sw/rxe/rxe_resp.c | 3 ---
 drivers/infiniband/sw/rxe/rxe_task.c | 8 ++--
 drivers/infiniband/sw/rxe/rxe_task.h | 2 +-
 5 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c 
b/drivers/infiniband/sw/rxe/rxe_comp.c
index bdeb288673f3..3a1a33286f87 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -557,8 +557,6 @@ int rxe_completer(void *arg)
struct rxe_pkt_info *pkt = NULL;
enum comp_state state;
 
-   rxe_add_ref(>pelem);
-
if (!qp->valid || qp->req.state == QP_STATE_ERROR ||
qp->req.state == QP_STATE_RESET) {
rxe_drain_resp_pkts(qp, qp->valid &&
@@ -780,7 +778,6 @@ int rxe_completer(void *arg)
 * exit from the loop calling us
 */
WARN_ON_ONCE(skb);
-   rxe_drop_ref(>pelem);
return -EAGAIN;
 
 done:
@@ -788,6 +785,5 @@ int rxe_completer(void *arg)
 * us again to see if there is anything else to do
 */
WARN_ON_ONCE(skb);
-   rxe_drop_ref(>pelem);
return 0;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c 
b/drivers/infiniband/sw/rxe/rxe_req.c
index 0d018adbe512..c63e66873757 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -594,8 +594,6 @@ int rxe_requester(void *arg)
struct rxe_send_wqe rollback_wqe;
u32 rollback_psn;
 
-   rxe_add_ref(>pelem);
-
 next_wqe:
if (unlikely(!qp->valid || qp->req.state == QP_STATE_ERROR))
goto exit;
@@ -702,7 +700,6 @@ int rxe_requester(void *arg)
wqe->state = wqe_state_done;
wqe->status = IB_WC_SUCCESS;
__rxe_do_task(>comp.task);
-   rxe_drop_ref(>pelem);
return 0;
}
payload = mtu;
@@ -753,6 +750,5 @@ int rxe_requester(void *arg)
__rxe_do_task(>comp.task);
 
 exit:
-   rxe_drop_ref(>pelem);
return -EAGAIN;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c 
b/drivers/infiniband/sw/rxe/rxe_resp.c
index f038bae1bd70..7be8a362d2ef 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -1212,8 +1212,6 @@ int rxe_responder(void *arg)
struct rxe_pkt_info *pkt = NULL;
int ret = 0;
 
-   rxe_add_ref(>pelem);
-
qp->resp.aeth_syndrome = AETH_ACK_UNLIMITED;
 
if (!qp->valid) {
@@ -1392,6 +1390,5 @@ int rxe_responder(void *arg)
 exit:
ret = -EAGAIN;
 done:
-   rxe_drop_ref(>pelem);
return ret;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_task.c 
b/drivers/infiniband/sw/rxe/rxe_task.c
index 7c2b6e4595f5..9d6b8ad08a3a 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ b/drivers/infiniband/sw/rxe/rxe_task.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 
+#include "rxe.h"
 #include "rxe_task.h"
 
 int __rxe_do_task(struct rxe_task *task)
@@ -115,14 +116,15 @@ void rxe_do_task(unsigned long data)
 }
 
 int rxe_init_task(void *obj, struct rxe_task *task,
- void *arg, int (*func)(void *), char *name)
+ struct rxe_qp *qp, int (*func)(void *), char *name)
 {
task->obj   = obj;
-   task->arg   = arg;
+   task->arg   = qp;
task->func  = func;
snprintf(task->name, sizeof(task->name), "%s", name);
task->destroyed = false;
 
+   rxe_add_ref(>pelem);
tasklet_init(>tasklet, rxe_do_task, (unsigned long)task);
 
task->state = TASK_STATE_START;
@@ -135,6 +137,7 @@ void rxe_cleanup_task(struct rxe_task *task)
 {
unsigned long flags;
bool idle;
+   struct rxe_qp *qp = (struct rxe_qp *)task->arg;
 
/*
 * Mark the task, then wait for it to finish. It might be
@@ -149,6 +152,7 @@ void rxe_cleanup_task(struct rxe_task *task)
} while (!idle);
 
tasklet_kill(>tasklet);
+   rxe_drop_ref(>pelem);
 }
 
 void rxe_run_task(struct rxe_task *task)
diff --git a/drivers/infiniband/sw/rxe/rxe_task.h 
b/drivers/infiniband/sw/rxe/rxe_task.h
index 5c1fc7d5b953..671b1774b577 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.h
+++ b/drivers/infiniband/sw/rxe/rxe_task.h
@@ -63,7 +63,7 @@ struct rxe_task {
  * fcn  => function to call until it returns != 0
  */
 int rxe_init_task(void *obj, struct rxe_task *task,
- void *arg, int (*func)(void *), char *name);
+ struct rxe_qp *qp, int (*func)(void *), char *name);
 
 /* cleanup task */
 void rxe_cleanup_task(struct rxe_task *task);
-- 
2.20.1



[PATCH 03/10] Make pool interface more type safe

2019-07-22 Thread Maksym Planeta
Replace void* with rxe_pool_entry* for some functions.

Change macro to inline function.

Signed-off-by: Maksym Planeta 
---
 drivers/infiniband/sw/rxe/rxe_comp.c  | 18 
 drivers/infiniband/sw/rxe/rxe_mcast.c | 20 
 drivers/infiniband/sw/rxe/rxe_mr.c|  8 ++--
 drivers/infiniband/sw/rxe/rxe_net.c   |  6 +--
 drivers/infiniband/sw/rxe/rxe_pool.c  | 12 ++---
 drivers/infiniband/sw/rxe/rxe_pool.h  | 16 ---
 drivers/infiniband/sw/rxe/rxe_qp.c| 32 ++---
 drivers/infiniband/sw/rxe/rxe_recv.c  |  8 ++--
 drivers/infiniband/sw/rxe/rxe_req.c   |  8 ++--
 drivers/infiniband/sw/rxe/rxe_resp.c  | 22 -
 drivers/infiniband/sw/rxe/rxe_verbs.c | 66 +--
 11 files changed, 108 insertions(+), 108 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c 
b/drivers/infiniband/sw/rxe/rxe_comp.c
index 5f3a43cfeb63..bdeb288673f3 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -534,7 +534,7 @@ static void rxe_drain_resp_pkts(struct rxe_qp *qp, bool 
notify)
struct rxe_send_wqe *wqe;
 
while ((skb = skb_dequeue(>resp_pkts))) {
-   rxe_drop_ref(qp);
+   rxe_drop_ref(>pelem);
kfree_skb(skb);
}
 
@@ -557,7 +557,7 @@ int rxe_completer(void *arg)
struct rxe_pkt_info *pkt = NULL;
enum comp_state state;
 
-   rxe_add_ref(qp);
+   rxe_add_ref(>pelem);
 
if (!qp->valid || qp->req.state == QP_STATE_ERROR ||
qp->req.state == QP_STATE_RESET) {
@@ -646,7 +646,7 @@ int rxe_completer(void *arg)
 
case COMPST_DONE:
if (pkt) {
-   rxe_drop_ref(pkt->qp);
+   rxe_drop_ref(>qp->pelem);
kfree_skb(skb);
skb = NULL;
}
@@ -694,7 +694,7 @@ int rxe_completer(void *arg)
if (qp->comp.started_retry &&
!qp->comp.timeout_retry) {
if (pkt) {
-   rxe_drop_ref(pkt->qp);
+   rxe_drop_ref(>qp->pelem);
kfree_skb(skb);
skb = NULL;
}
@@ -723,7 +723,7 @@ int rxe_completer(void *arg)
}
 
if (pkt) {
-   rxe_drop_ref(pkt->qp);
+   rxe_drop_ref(>qp->pelem);
kfree_skb(skb);
skb = NULL;
}
@@ -748,7 +748,7 @@ int rxe_completer(void *arg)
mod_timer(>rnr_nak_timer,
  jiffies + rnrnak_jiffies(aeth_syn(pkt)
& ~AETH_TYPE_MASK));
-   rxe_drop_ref(pkt->qp);
+   rxe_drop_ref(>qp->pelem);
kfree_skb(skb);
skb = NULL;
goto exit;
@@ -766,7 +766,7 @@ int rxe_completer(void *arg)
rxe_qp_error(qp);
 
if (pkt) {
-   rxe_drop_ref(pkt->qp);
+   rxe_drop_ref(>qp->pelem);
kfree_skb(skb);
skb = NULL;
}
@@ -780,7 +780,7 @@ int rxe_completer(void *arg)
 * exit from the loop calling us
 */
WARN_ON_ONCE(skb);
-   rxe_drop_ref(qp);
+   rxe_drop_ref(>pelem);
return -EAGAIN;
 
 done:
@@ -788,6 +788,6 @@ int rxe_completer(void *arg)
 * us again to see if there is anything else to do
 */
WARN_ON_ONCE(skb);
-   rxe_drop_ref(qp);
+   rxe_drop_ref(>pelem);
return 0;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c 
b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 522a7942c56c..24ebc4ca1b99 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -59,7 +59,7 @@ int rxe_mcast_get_grp(struct rxe_dev *rxe, union ib_gid *mgid,
spin_lock_init(>mcg_lock);
grp->rxe = rxe;
 
-   rxe_add_key(grp, mgid);
+   rxe_add_key(>pelem, mgid);
 
err = rxe_mcast_add(rxe, mgid);
if (err)
@@ -70,7 +70,7 @@ int rxe_mcast_get_grp(struct rxe_dev *rxe, union ib_gid *mgid,
return 0;
 
 err2:
-   rxe_drop_ref(grp);
+   rxe_drop_ref(>pelem);
 err1:
return err;
 }
@@ -103,7 +103,7 @@ int rxe_mcast_add_grp_elem(struct rxe_dev *rxe, struct 
rxe_qp *qp,
}
 
/* each qp 

[PATCH 08/10] Move responsibility of cleaning up pool elements

2019-07-22 Thread Maksym Planeta
Specific implementations must finish off the cleanup of pool elements
when it is the right moment. Reason for that is that a concreate cleanup
function may want postpone and schedule part of object destruction to
later time.

Signed-off-by: Maksym Planeta 
---
 drivers/infiniband/sw/rxe/rxe_cq.c| 2 ++
 drivers/infiniband/sw/rxe/rxe_mcast.c | 2 ++
 drivers/infiniband/sw/rxe/rxe_mr.c| 2 ++
 drivers/infiniband/sw/rxe/rxe_pool.c  | 9 -
 drivers/infiniband/sw/rxe/rxe_pool.h  | 2 ++
 5 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_cq.c 
b/drivers/infiniband/sw/rxe/rxe_cq.c
index ad3090131126..5693986c8c1b 100644
--- a/drivers/infiniband/sw/rxe/rxe_cq.c
+++ b/drivers/infiniband/sw/rxe/rxe_cq.c
@@ -182,4 +182,6 @@ void rxe_cq_cleanup(struct rxe_pool_entry *arg)
 
if (cq->queue)
rxe_queue_cleanup(cq->queue);
+
+   rxe_elem_cleanup(arg);
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c 
b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 24ebc4ca1b99..6453ae97653f 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -187,4 +187,6 @@ void rxe_mc_cleanup(struct rxe_pool_entry *arg)
 
rxe_drop_key(>pelem);
rxe_mcast_delete(rxe, >mgid);
+
+   rxe_elem_cleanup(arg);
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c 
b/drivers/infiniband/sw/rxe/rxe_mr.c
index 441b01e30afa..1c0940655df1 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -104,6 +104,8 @@ void rxe_mem_cleanup(struct rxe_pool_entry *arg)
 
kfree(mem->map);
}
+
+   rxe_elem_cleanup(arg);
 }
 
 static int rxe_mem_alloc(struct rxe_mem *mem, int num_buf)
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c 
b/drivers/infiniband/sw/rxe/rxe_pool.c
index 711d7d7f3692..3b14ab599662 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -462,9 +462,16 @@ void rxe_elem_release(struct kref *kref)
 
if (pool->cleanup)
pool->cleanup(elem);
+   else
+   rxe_elem_cleanup(elem);
+}
+
+void rxe_elem_cleanup(struct rxe_pool_entry *pelem)
+{
+   struct rxe_pool *pool = pelem->pool;
 
if (!(pool->flags & RXE_POOL_NO_ALLOC))
-   kmem_cache_free(pool_cache(pool), elem);
+   kmem_cache_free(pool_cache(pool), pelem);
atomic_dec(>num_elem);
ib_device_put(>rxe->ib_dev);
rxe_pool_put(pool);
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h 
b/drivers/infiniband/sw/rxe/rxe_pool.h
index 1e3508c74dc0..84cfbc749a5c 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -160,6 +160,8 @@ void *rxe_pool_get_key(struct rxe_pool *pool, void *key);
 /* cleanup an object when all references are dropped */
 void rxe_elem_release(struct kref *kref);
 
+void rxe_elem_cleanup(struct rxe_pool_entry *pelem);
+
 /* take a reference on an object */
 static inline void rxe_add_ref(struct rxe_pool_entry *pelem) {
kref_get(>ref_cnt);
-- 
2.20.1



Re: [PATCH 07/10] Pass the return value of kref_put further

2019-07-22 Thread Maksym Planeta
In a later patch I need to know if the dependency to QP object still 
exists or not.


On 22/07/2019 17:29, Jason Gunthorpe wrote:

On Mon, Jul 22, 2019 at 05:14:23PM +0200, Maksym Planeta wrote:

Used in a later patch.

Signed-off-by: Maksym Planeta 
  drivers/infiniband/sw/rxe/rxe_pool.c | 3 ++-
  drivers/infiniband/sw/rxe/rxe_pool.h | 2 +-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c 
b/drivers/infiniband/sw/rxe/rxe_pool.c
index 30a887cf9200..711d7d7f3692 100644
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -541,7 +541,7 @@ static void rxe_dummy_release(struct kref *kref)
  {
  }
  
-void rxe_drop_ref(struct rxe_pool_entry *pelem)

+int rxe_drop_ref(struct rxe_pool_entry *pelem)
  {
int res;
struct rxe_pool *pool = pelem->pool;
@@ -553,4 +553,5 @@ void rxe_drop_ref(struct rxe_pool_entry *pelem)
if (res) {
rxe_elem_release(>ref_cnt);
}
+   return res;
  }


Using the return value of kref_put at all is super sketchy. Are you
sure this is actually a kref usage pattern?

Why would this be needed?

Jason



--
Regards,
Maksym Planeta


Re: [PATCH 04/10] Protect kref_put with the lock

2019-07-22 Thread Maksym Planeta




On 22/07/2019 17:25, Jason Gunthorpe wrote:

On Mon, Jul 22, 2019 at 05:14:20PM +0200, Maksym Planeta wrote:

Need to ensure that kref_put does not run concurrently with the loop
inside rxe_pool_get_key.

Signed-off-by: Maksym Planeta 
  drivers/infiniband/sw/rxe/rxe_pool.c | 18 ++
  drivers/infiniband/sw/rxe/rxe_pool.h |  4 +---
  2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c 
b/drivers/infiniband/sw/rxe/rxe_pool.c
index efa9bab01e02..30a887cf9200 100644
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -536,3 +536,21 @@ void *rxe_pool_get_key(struct rxe_pool *pool, void *key)
read_unlock_irqrestore(>pool_lock, flags);
return node ? elem : NULL;
  }
+
+static void rxe_dummy_release(struct kref *kref)
+{
+}
+
+void rxe_drop_ref(struct rxe_pool_entry *pelem)
+{
+   int res;
+   struct rxe_pool *pool = pelem->pool;
+   unsigned long flags;
+
+   write_lock_irqsave(>pool_lock, flags);
+   res = kref_put(>ref_cnt, rxe_dummy_release);
+   write_unlock_irqrestore(>pool_lock, flags);


This doesn't make sense..

If something is making the kref go to 0 while the node is still in the
RB tree then that is a bug.

You should never need to add locking around a kref_put.



From https://www.kernel.org/doc/Documentation/kref.txt

| The last rule (rule 3) is the nastiest one to handle.  Say, for
| instance, you have a list of items that are each kref-ed, and you wish
| to get the first one.  You can't just pull the first item off the list
| and kref_get() it.  That violates rule 3 because you are not already
| holding a valid pointer.  You must add a mutex (or some other lock).



Jason



--
Regards,
Maksym Planeta


Re: [PATCH] sysctl: Add a feature to drop caches selectively

2014-06-26 Thread Maksym Planeta
> With a binary interface like an ioctl I can see how you could have extra
> unused fields which you can ignore now and let people start adding extra
> options like the range in the future.

Yes, ioctl is another possibility. But I would argue that sysctl is
more convenient interface, because idea of sdrop_caches is similar to
drop_caches's one and it is convenient to have these interfaces in the
same place. But if sdrop_caches uses procfs it seems that there is no
easy way to pass parameters of different types in one write operation.

> Other questions I'd ask would be - how about the access control model?
> Will only root be able to drop caches? Why can't I drop caches for my
> own file?

Access control model is the same as for drop_caches. This means that
only root can write to this file. But it is easy to add a feature that
allows any user to clean page cache of inodes that this user owns.

2014-06-25 15:42 GMT+02:00 Artem Bityutskiy :
> On Wed, 2014-06-25 at 15:23 +0200, Thomas Knauth wrote:
>> On Wed, Jun 25, 2014 at 11:56 AM, Artem Bityutskiy  
>> wrote:
>> > Thanks for the answer, although you forgot to comment on the question
>> > about possibly extending the new interface to work with file ranges in
>> > the future. For example, I have a 2 TiB file, and I am only interested
>> > in dropping caches for the first couple of gigabytes. Would I extend
>> > your interface, or would I come up with another one?
>>
>> Ah, didn't quite understand what was meant with file ranges. Again, we
>> had not considered this so far. I guess you could make a distinction
>> between directories and files here. If the path points to a file, you
>> can have an optional argument indicating the range of bytes you would
>> like to drop. Something like
>>
>> echo "my-file 0-1000,8000-1000" > /proc/sys/vm/sdrop_cache
>>
>> If this is desirable, we can add it to the patch.
>
> With a binary interface like an ioctl I can see how you could have extra
> unused fields which you can ignore now and let people start adding extra
> options like the range in the future.
>
> With this kind of interface I am not sure how to do this.
>
> Other questions I'd ask would be - how about the access control model?
> Will only root be able to drop caches? Why can't I drop caches for my
> own file?
>
> I did not put much thinking into this, but it looks like ioctl could be
> a better interface for the task you are trying to solve...
>
> Sorry if I am a bit vague, I am mostly trying to make you guys give this
> more thoughts, and come up with a deeper analysis. Interfaces are very
> important to get right, or as right as possible...
>
> --
> Best Regards,
> Artem Bityutskiy
>



-- 
Regards,
Maksym Planeta.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sysctl: Add a feature to drop caches selectively

2014-06-26 Thread Maksym Planeta
 With a binary interface like an ioctl I can see how you could have extra
 unused fields which you can ignore now and let people start adding extra
 options like the range in the future.

Yes, ioctl is another possibility. But I would argue that sysctl is
more convenient interface, because idea of sdrop_caches is similar to
drop_caches's one and it is convenient to have these interfaces in the
same place. But if sdrop_caches uses procfs it seems that there is no
easy way to pass parameters of different types in one write operation.

 Other questions I'd ask would be - how about the access control model?
 Will only root be able to drop caches? Why can't I drop caches for my
 own file?

Access control model is the same as for drop_caches. This means that
only root can write to this file. But it is easy to add a feature that
allows any user to clean page cache of inodes that this user owns.

2014-06-25 15:42 GMT+02:00 Artem Bityutskiy dedeki...@gmail.com:
 On Wed, 2014-06-25 at 15:23 +0200, Thomas Knauth wrote:
 On Wed, Jun 25, 2014 at 11:56 AM, Artem Bityutskiy dedeki...@gmail.com 
 wrote:
  Thanks for the answer, although you forgot to comment on the question
  about possibly extending the new interface to work with file ranges in
  the future. For example, I have a 2 TiB file, and I am only interested
  in dropping caches for the first couple of gigabytes. Would I extend
  your interface, or would I come up with another one?

 Ah, didn't quite understand what was meant with file ranges. Again, we
 had not considered this so far. I guess you could make a distinction
 between directories and files here. If the path points to a file, you
 can have an optional argument indicating the range of bytes you would
 like to drop. Something like

 echo my-file 0-1000,8000-1000  /proc/sys/vm/sdrop_cache

 If this is desirable, we can add it to the patch.

 With a binary interface like an ioctl I can see how you could have extra
 unused fields which you can ignore now and let people start adding extra
 options like the range in the future.

 With this kind of interface I am not sure how to do this.

 Other questions I'd ask would be - how about the access control model?
 Will only root be able to drop caches? Why can't I drop caches for my
 own file?

 I did not put much thinking into this, but it looks like ioctl could be
 a better interface for the task you are trying to solve...

 Sorry if I am a bit vague, I am mostly trying to make you guys give this
 more thoughts, and come up with a deeper analysis. Interfaces are very
 important to get right, or as right as possible...

 --
 Best Regards,
 Artem Bityutskiy




-- 
Regards,
Maksym Planeta.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sysctl: Add a feature to drop caches selectively

2014-06-24 Thread Maksym Planeta
To clean the page cache one can use /proc/sys/vm/drop_caches. But this
drops the whole page cache. In contrast to that sdrop_caches enables
ability to drop the page cache selectively by path string.

Suggested-by: Thomas Knauth 
Signed-off-by: Maksym Planeta 
---
 Documentation/sysctl/vm.txt |  15 ++
 fs/Makefile |   2 +-
 fs/sdrop_caches.c   | 124 
 3 files changed, 140 insertions(+), 1 deletion(-)
 create mode 100644 fs/sdrop_caches.c

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index bd4b34c..faad01d 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm:
 - dirty_ratio
 - dirty_writeback_centisecs
 - drop_caches
+- sdrop_caches
 - extfrag_threshold
 - hugepages_treat_as_movable
 - hugetlb_shm_group
@@ -211,6 +212,20 @@ with your system.  To disable them, echo 4 (bit 3) into 
drop_caches.
 
 ==
 
+sdrop_caches
+
+Writing to this will cause the kernel to drop clean caches starting from
+specified path.
+
+To free pagecache of a file:
+   echo /home/user/file > /proc/sys/vm/sdrop_caches
+To free pagecache of a directory and all files in it.
+   echo /home/user/directly > /proc/sys/vm/sdrop_caches
+
+Restrictions are the same as for drop_caches.
+
+==
+
 extfrag_threshold
 
 This parameter affects whether the kernel will compact memory or direct
diff --git a/fs/Makefile b/fs/Makefile
index 4030cbf..366c7b9 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -44,7 +44,7 @@ obj-$(CONFIG_FS_MBCACHE)  += mbcache.o
 obj-$(CONFIG_FS_POSIX_ACL) += posix_acl.o
 obj-$(CONFIG_NFS_COMMON)   += nfs_common/
 obj-$(CONFIG_COREDUMP) += coredump.o
-obj-$(CONFIG_SYSCTL)   += drop_caches.o
+obj-$(CONFIG_SYSCTL)   += drop_caches.o sdrop_caches.o
 
 obj-$(CONFIG_FHANDLE)  += fhandle.o
 
diff --git a/fs/sdrop_caches.c b/fs/sdrop_caches.c
new file mode 100644
index 000..c193655
--- /dev/null
+++ b/fs/sdrop_caches.c
@@ -0,0 +1,124 @@
+/*
+ * Implement the manual selective drop pagecache function
+ */
+
+#include 
+
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static void clean_mapping(struct dentry *dentry)
+{
+   struct inode *inode = dentry->d_inode;
+
+   if (!inode)
+   return;
+
+   if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+   (inode->i_mapping->nrpages == 0)) {
+   return;
+   }
+
+   invalidate_mapping_pages(inode->i_mapping, 0, -1);
+}
+
+static void clean_all_dentries_locked(struct dentry *dentry)
+{
+   struct dentry *child;
+
+   list_for_each_entry(child, >d_subdirs, d_u.d_child) {
+   clean_all_dentries_locked(child);
+   }
+
+   clean_mapping(dentry);
+}
+
+static void clean_all_dentries(struct dentry *dentry)
+{
+   spin_lock_nested(>d_lock, DENTRY_D_LOCK_NESTED);
+   clean_all_dentries_locked(dentry);
+   spin_unlock(>d_lock);
+}
+
+static int drop_pagecache(const char * __user filename)
+{
+   unsigned int lookup_flags = LOOKUP_FOLLOW;
+   struct path path;
+   int error;
+
+retry:
+   error = user_path_at(AT_FDCWD, filename, lookup_flags, );
+   if (!error) {
+   /* clean */
+   clean_all_dentries(path.dentry);
+   }
+   if (retry_estale(error, lookup_flags)) {
+   lookup_flags |= LOOKUP_REVAL;
+   goto retry;
+   }
+   return error;
+}
+
+static int sdrop_ctl_handler(struct ctl_table *table, int write,
+void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   char __user *pathname = buffer + *lenp - 1;
+
+   put_user('\0', pathname);
+
+   if (!write)
+   return 0;
+
+   return drop_pagecache(buffer);
+}
+
+static struct ctl_path vm_path[] = { { .procname = "vm", }, { } };
+static struct ctl_table sdrop_ctl_table[] = {
+   {
+   .procname = "sdrop_caches",
+   .mode = 0644,
+   .proc_handler = sdrop_ctl_handler,
+   },
+   { }
+};
+
+static struct ctl_table_header *sdrop_proc_entry;
+
+/* Init function called on module entry */
+int sdrop_init(void)
+{
+   int ret = 0;
+
+   sdrop_proc_entry = register_sysctl_paths(vm_path, sdrop_ctl_table);
+
+   if (sdrop_proc_entry == NULL) {
+   ret = -ENOMEM;
+   pr_err("sdrop_caches: Couldn't create proc entry\n");
+   }
+
+   return ret;
+}
+
+/* Cleanup function called on module exit */
+void sdrop_cleanup(void)
+{
+   unregister_sysctl_table(sdrop_proc_entry);
+}
+
+module_init(sdrop_init);
+module_exit(sdrop_cleanup);
+
+MODULE_LICE

[PATCH] sysctl: Add a feature to drop caches selectively

2014-06-24 Thread Maksym Planeta
To clean the page cache one can use /proc/sys/vm/drop_caches. But this
drops the whole page cache. In contrast to that sdrop_caches enables
ability to drop the page cache selectively by path string.

Suggested-by: Thomas Knauth thomas.kna...@gmx.de
Signed-off-by: Maksym Planeta mcsim.plan...@gmail.com
---
 Documentation/sysctl/vm.txt |  15 ++
 fs/Makefile |   2 +-
 fs/sdrop_caches.c   | 124 
 3 files changed, 140 insertions(+), 1 deletion(-)
 create mode 100644 fs/sdrop_caches.c

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index bd4b34c..faad01d 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm:
 - dirty_ratio
 - dirty_writeback_centisecs
 - drop_caches
+- sdrop_caches
 - extfrag_threshold
 - hugepages_treat_as_movable
 - hugetlb_shm_group
@@ -211,6 +212,20 @@ with your system.  To disable them, echo 4 (bit 3) into 
drop_caches.
 
 ==
 
+sdrop_caches
+
+Writing to this will cause the kernel to drop clean caches starting from
+specified path.
+
+To free pagecache of a file:
+   echo /home/user/file  /proc/sys/vm/sdrop_caches
+To free pagecache of a directory and all files in it.
+   echo /home/user/directly  /proc/sys/vm/sdrop_caches
+
+Restrictions are the same as for drop_caches.
+
+==
+
 extfrag_threshold
 
 This parameter affects whether the kernel will compact memory or direct
diff --git a/fs/Makefile b/fs/Makefile
index 4030cbf..366c7b9 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -44,7 +44,7 @@ obj-$(CONFIG_FS_MBCACHE)  += mbcache.o
 obj-$(CONFIG_FS_POSIX_ACL) += posix_acl.o
 obj-$(CONFIG_NFS_COMMON)   += nfs_common/
 obj-$(CONFIG_COREDUMP) += coredump.o
-obj-$(CONFIG_SYSCTL)   += drop_caches.o
+obj-$(CONFIG_SYSCTL)   += drop_caches.o sdrop_caches.o
 
 obj-$(CONFIG_FHANDLE)  += fhandle.o
 
diff --git a/fs/sdrop_caches.c b/fs/sdrop_caches.c
new file mode 100644
index 000..c193655
--- /dev/null
+++ b/fs/sdrop_caches.c
@@ -0,0 +1,124 @@
+/*
+ * Implement the manual selective drop pagecache function
+ */
+
+#include linux/module.h
+
+
+#include linux/kernel.h
+#include linux/proc_fs.h
+#include linux/string.h
+#include linux/vmalloc.h
+#include linux/uaccess.h
+#include linux/mm.h
+#include linux/fs.h
+#include linux/writeback.h
+#include linux/sysctl.h
+#include linux/gfp.h
+#include linux/limits.h
+#include linux/namei.h
+
+static void clean_mapping(struct dentry *dentry)
+{
+   struct inode *inode = dentry-d_inode;
+
+   if (!inode)
+   return;
+
+   if ((inode-i_state  (I_FREEING|I_WILL_FREE|I_NEW)) ||
+   (inode-i_mapping-nrpages == 0)) {
+   return;
+   }
+
+   invalidate_mapping_pages(inode-i_mapping, 0, -1);
+}
+
+static void clean_all_dentries_locked(struct dentry *dentry)
+{
+   struct dentry *child;
+
+   list_for_each_entry(child, dentry-d_subdirs, d_u.d_child) {
+   clean_all_dentries_locked(child);
+   }
+
+   clean_mapping(dentry);
+}
+
+static void clean_all_dentries(struct dentry *dentry)
+{
+   spin_lock_nested(dentry-d_lock, DENTRY_D_LOCK_NESTED);
+   clean_all_dentries_locked(dentry);
+   spin_unlock(dentry-d_lock);
+}
+
+static int drop_pagecache(const char * __user filename)
+{
+   unsigned int lookup_flags = LOOKUP_FOLLOW;
+   struct path path;
+   int error;
+
+retry:
+   error = user_path_at(AT_FDCWD, filename, lookup_flags, path);
+   if (!error) {
+   /* clean */
+   clean_all_dentries(path.dentry);
+   }
+   if (retry_estale(error, lookup_flags)) {
+   lookup_flags |= LOOKUP_REVAL;
+   goto retry;
+   }
+   return error;
+}
+
+static int sdrop_ctl_handler(struct ctl_table *table, int write,
+void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   char __user *pathname = buffer + *lenp - 1;
+
+   put_user('\0', pathname);
+
+   if (!write)
+   return 0;
+
+   return drop_pagecache(buffer);
+}
+
+static struct ctl_path vm_path[] = { { .procname = vm, }, { } };
+static struct ctl_table sdrop_ctl_table[] = {
+   {
+   .procname = sdrop_caches,
+   .mode = 0644,
+   .proc_handler = sdrop_ctl_handler,
+   },
+   { }
+};
+
+static struct ctl_table_header *sdrop_proc_entry;
+
+/* Init function called on module entry */
+int sdrop_init(void)
+{
+   int ret = 0;
+
+   sdrop_proc_entry = register_sysctl_paths(vm_path, sdrop_ctl_table);
+
+   if (sdrop_proc_entry == NULL) {
+   ret = -ENOMEM;
+   pr_err(sdrop_caches: Couldn't create proc entry\n);
+   }
+
+   return ret;
+}
+
+/* Cleanup function called on module