Re: [PATCH 10/10] Replace tasklets with workqueues
Hi, this is a kind reminder regarding the patchset. I added description of races in the original email. On 30/07/2019 21:20, Maksym Planeta wrote: On 25/07/2019 20:50, Jason Gunthorpe wrote: On Thu, Jul 25, 2019 at 04:36:20PM +0200, Maksym Planeta wrote: Is this one better? Replace tasklets with workqueues in rxe driver. The reason for this replacement is that tasklets are supposed to run atomically, although the actual code may block. Modify the SKB destructor for outgoing SKB's to schedule QP tasks only if the QP is not destroyed itself. Add a variable "pending_skb_down" to ensure that reference counting for a QP is decremented only when QP access related to this skb is over. Separate part of pool element cleanup code to allow this code to be called in the very end of cleanup, even if some of cleanup is scheduled for asynchronous execution. Example, when it was happening is destructor for a QP. Disallow calling of task functions "directly". This allows to simplify logic inside rxe_task.c Schedule rxe_qp_do_cleanup onto high-priority system workqueue, because this function can be scheduled from normal system workqueue. Before destroying a QP, wait until all references to this QP are gone. Previously the problem was that outgoing SKBs could be freed after the QP these SKBs refer to is destroyed. Add blocking rxe_run_task to replace __rxe_do_task that was calling task function directly. Mostly but it would also be good to describe the use after free and races more specifically These situations are described in the cover letter (PATCH 00/10). Do you need a more detailed description than that? Jason -- Regards, Maksym Planeta
Re: [PATCH 10/10] Replace tasklets with workqueues
On 25/07/2019 20:50, Jason Gunthorpe wrote: On Thu, Jul 25, 2019 at 04:36:20PM +0200, Maksym Planeta wrote: Is this one better? Replace tasklets with workqueues in rxe driver. The reason for this replacement is that tasklets are supposed to run atomically, although the actual code may block. Modify the SKB destructor for outgoing SKB's to schedule QP tasks only if the QP is not destroyed itself. Add a variable "pending_skb_down" to ensure that reference counting for a QP is decremented only when QP access related to this skb is over. Separate part of pool element cleanup code to allow this code to be called in the very end of cleanup, even if some of cleanup is scheduled for asynchronous execution. Example, when it was happening is destructor for a QP. Disallow calling of task functions "directly". This allows to simplify logic inside rxe_task.c Schedule rxe_qp_do_cleanup onto high-priority system workqueue, because this function can be scheduled from normal system workqueue. Before destroying a QP, wait until all references to this QP are gone. Previously the problem was that outgoing SKBs could be freed after the QP these SKBs refer to is destroyed. Add blocking rxe_run_task to replace __rxe_do_task that was calling task function directly. Mostly but it would also be good to describe the use after free and races more specifically These situations are described in the cover letter (PATCH 00/10). Do you need a more detailed description than that? Jason -- Regards, Maksym Planeta
Re: [PATCH 10/10] Replace tasklets with workqueues
Is this one better? Replace tasklets with workqueues in rxe driver. The reason for this replacement is that tasklets are supposed to run atomically, although the actual code may block. Modify the SKB destructor for outgoing SKB's to schedule QP tasks only if the QP is not destroyed itself. Add a variable "pending_skb_down" to ensure that reference counting for a QP is decremented only when QP access related to this skb is over. Separate part of pool element cleanup code to allow this code to be called in the very end of cleanup, even if some of cleanup is scheduled for asynchronous execution. Example, when it was happening is destructor for a QP. Disallow calling of task functions "directly". This allows to simplify logic inside rxe_task.c Schedule rxe_qp_do_cleanup onto high-priority system workqueue, because this function can be scheduled from normal system workqueue. Before destroying a QP, wait until all references to this QP are gone. Previously the problem was that outgoing SKBs could be freed after the QP these SKBs refer to is destroyed. Add blocking rxe_run_task to replace __rxe_do_task that was calling task function directly. On 22/07/2019 17:32, Jason Gunthorpe wrote: On Mon, Jul 22, 2019 at 05:14:26PM +0200, Maksym Planeta wrote: Replace tasklets with workqueues in rxe driver. Ensure that task is called only through a workqueue. This allows to simplify task logic. Add additional dependencies to make sure that cleanup tasks do not happen after object's memory is already reclaimed. Improve overal stability of the driver by removing multiple race conditions and use-after-free situations. This should be described more precisely Jason -- Regards, Maksym Planeta
[PATCH 04/10] Protect kref_put with the lock
Need to ensure that kref_put does not run concurrently with the loop inside rxe_pool_get_key. Signed-off-by: Maksym Planeta --- drivers/infiniband/sw/rxe/rxe_pool.c | 18 ++ drivers/infiniband/sw/rxe/rxe_pool.h | 4 +--- 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c index efa9bab01e02..30a887cf9200 100644 --- a/drivers/infiniband/sw/rxe/rxe_pool.c +++ b/drivers/infiniband/sw/rxe/rxe_pool.c @@ -536,3 +536,21 @@ void *rxe_pool_get_key(struct rxe_pool *pool, void *key) read_unlock_irqrestore(>pool_lock, flags); return node ? elem : NULL; } + +static void rxe_dummy_release(struct kref *kref) +{ +} + +void rxe_drop_ref(struct rxe_pool_entry *pelem) +{ + int res; + struct rxe_pool *pool = pelem->pool; + unsigned long flags; + + write_lock_irqsave(>pool_lock, flags); + res = kref_put(>ref_cnt, rxe_dummy_release); + write_unlock_irqrestore(>pool_lock, flags); + if (res) { + rxe_elem_release(>ref_cnt); + } +} diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h index 5c6a9429f541..b90cc84c5511 100644 --- a/drivers/infiniband/sw/rxe/rxe_pool.h +++ b/drivers/infiniband/sw/rxe/rxe_pool.h @@ -166,8 +166,6 @@ static inline void rxe_add_ref(struct rxe_pool_entry *pelem) { } /* drop a reference on an object */ -static inline void rxe_drop_ref(struct rxe_pool_entry *pelem) { - kref_put(>ref_cnt, rxe_elem_release); -} +void rxe_drop_ref(struct rxe_pool_entry *pelem); #endif /* RXE_POOL_H */ -- 2.20.1
[PATCH 07/10] Pass the return value of kref_put further
Used in a later patch. Signed-off-by: Maksym Planeta --- drivers/infiniband/sw/rxe/rxe_pool.c | 3 ++- drivers/infiniband/sw/rxe/rxe_pool.h | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c index 30a887cf9200..711d7d7f3692 100644 --- a/drivers/infiniband/sw/rxe/rxe_pool.c +++ b/drivers/infiniband/sw/rxe/rxe_pool.c @@ -541,7 +541,7 @@ static void rxe_dummy_release(struct kref *kref) { } -void rxe_drop_ref(struct rxe_pool_entry *pelem) +int rxe_drop_ref(struct rxe_pool_entry *pelem) { int res; struct rxe_pool *pool = pelem->pool; @@ -553,4 +553,5 @@ void rxe_drop_ref(struct rxe_pool_entry *pelem) if (res) { rxe_elem_release(>ref_cnt); } + return res; } diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h index b90cc84c5511..1e3508c74dc0 100644 --- a/drivers/infiniband/sw/rxe/rxe_pool.h +++ b/drivers/infiniband/sw/rxe/rxe_pool.h @@ -166,6 +166,6 @@ static inline void rxe_add_ref(struct rxe_pool_entry *pelem) { } /* drop a reference on an object */ -void rxe_drop_ref(struct rxe_pool_entry *pelem); +int rxe_drop_ref(struct rxe_pool_entry *pelem); #endif /* RXE_POOL_H */ -- 2.20.1
[PATCH 01/10] Simplify rxe_run_task interface
Make rxe_run_task only schedule tasks and never run them directly. This simplification will be used in further patches. Signed-off-by: Maksym Planeta --- drivers/infiniband/sw/rxe/rxe_comp.c | 16 drivers/infiniband/sw/rxe/rxe_loc.h | 2 +- drivers/infiniband/sw/rxe/rxe_net.c | 2 +- drivers/infiniband/sw/rxe/rxe_qp.c| 10 +- drivers/infiniband/sw/rxe/rxe_req.c | 6 +++--- drivers/infiniband/sw/rxe/rxe_resp.c | 8 +--- drivers/infiniband/sw/rxe/rxe_task.c | 7 ++- drivers/infiniband/sw/rxe/rxe_task.h | 6 +++--- drivers/infiniband/sw/rxe/rxe_verbs.c | 8 9 files changed, 28 insertions(+), 37 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c index 116cafc9afcf..ad09bd9d0a82 100644 --- a/drivers/infiniband/sw/rxe/rxe_comp.c +++ b/drivers/infiniband/sw/rxe/rxe_comp.c @@ -142,7 +142,7 @@ void retransmit_timer(struct timer_list *t) if (qp->valid) { qp->comp.timeout = 1; - rxe_run_task(>comp.task, 1); + rxe_run_task(>comp.task); } } @@ -156,7 +156,7 @@ void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb) if (must_sched != 0) rxe_counter_inc(SKB_TO_PKT(skb)->rxe, RXE_CNT_COMPLETER_SCHED); - rxe_run_task(>comp.task, must_sched); + rxe_run_task(>comp.task); } static inline enum comp_state get_wqe(struct rxe_qp *qp, @@ -329,7 +329,7 @@ static inline enum comp_state check_ack(struct rxe_qp *qp, qp->comp.psn = pkt->psn; if (qp->req.wait_psn) { qp->req.wait_psn = 0; - rxe_run_task(>req.task, 1); + rxe_run_task(>req.task); } } return COMPST_ERROR_RETRY; @@ -463,7 +463,7 @@ static void do_complete(struct rxe_qp *qp, struct rxe_send_wqe *wqe) */ if (qp->req.wait_fence) { qp->req.wait_fence = 0; - rxe_run_task(>req.task, 1); + rxe_run_task(>req.task); } } @@ -479,7 +479,7 @@ static inline enum comp_state complete_ack(struct rxe_qp *qp, if (qp->req.need_rd_atomic) { qp->comp.timeout_retry = 0; qp->req.need_rd_atomic = 0; - rxe_run_task(>req.task, 1); + rxe_run_task(>req.task); } } @@ -525,7 +525,7 @@ static inline enum comp_state complete_wqe(struct rxe_qp *qp, if (qp->req.wait_psn) { qp->req.wait_psn = 0; - rxe_run_task(>req.task, 1); + rxe_run_task(>req.task); } } @@ -644,7 +644,7 @@ int rxe_completer(void *arg) if (qp->req.wait_psn) { qp->req.wait_psn = 0; - rxe_run_task(>req.task, 1); + rxe_run_task(>req.task); } state = COMPST_DONE; @@ -725,7 +725,7 @@ int rxe_completer(void *arg) RXE_CNT_COMP_RETRY); qp->req.need_retry = 1; qp->comp.started_retry = 1; - rxe_run_task(>req.task, 1); + rxe_run_task(>req.task); } if (pkt) { diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h index 775c23becaec..b7159d9d5107 100644 --- a/drivers/infiniband/sw/rxe/rxe_loc.h +++ b/drivers/infiniband/sw/rxe/rxe_loc.h @@ -277,7 +277,7 @@ static inline int rxe_xmit_packet(struct rxe_qp *qp, struct rxe_pkt_info *pkt, if ((qp_type(qp) != IB_QPT_RC) && (pkt->mask & RXE_END_MASK)) { pkt->wqe->state = wqe_state_done; - rxe_run_task(>comp.task, 1); + rxe_run_task(>comp.task); } rxe_counter_inc(rxe, RXE_CNT_SENT_PKTS); diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c index 5a3474f9351b..e50f19fadcf9 100644 --- a/drivers/infiniband/sw/rxe/rxe_net.c +++ b/drivers/infiniband/sw/rxe/rxe_net.c @@ -414,7 +414,7 @@ static void rxe_skb_tx_dtor(struct sk_buff *skb) if (unlikely(qp->need_req_skb && skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW)) - rxe_run_task(>req.task, 1); + rxe_run_task(>req.task);
[PATCH 06/10] Remove pd form rxe_ah
Pointer to PD is not used in rxe_ah anymore. Signed-off-by: Maksym Planeta --- drivers/infiniband/sw/rxe/rxe_verbs.h | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h index 5c4b2239129c..8dd65c2a7c72 100644 --- a/drivers/infiniband/sw/rxe/rxe_verbs.h +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h @@ -73,7 +73,6 @@ struct rxe_pd { struct rxe_ah { struct ib_ahibah; struct rxe_pool_entry pelem; - struct rxe_pd *pd; struct rxe_av av; }; -- 2.20.1
[PATCH 09/10] Consolidate resetting of QP's tasks into one place
Used in a later patch. Signed-off-by: Maksym Planeta --- drivers/infiniband/sw/rxe/rxe_loc.h | 1 + drivers/infiniband/sw/rxe/rxe_qp.c | 17 + drivers/infiniband/sw/rxe/rxe_req.c | 1 + drivers/infiniband/sw/rxe/rxe_resp.c | 25 ++--- 4 files changed, 17 insertions(+), 27 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h index b7159d9d5107..e37adde2bced 100644 --- a/drivers/infiniband/sw/rxe/rxe_loc.h +++ b/drivers/infiniband/sw/rxe/rxe_loc.h @@ -205,6 +205,7 @@ static inline int rcv_wqe_size(int max_sge) } void free_rd_atomic_resource(struct rxe_qp *qp, struct resp_res *res); +void cleanup_rd_atomic_resources(struct rxe_qp *qp); static inline void rxe_advance_resp_resource(struct rxe_qp *qp) { diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c index 7cd929185581..2fccdede8869 100644 --- a/drivers/infiniband/sw/rxe/rxe_qp.c +++ b/drivers/infiniband/sw/rxe/rxe_qp.c @@ -161,7 +161,7 @@ void free_rd_atomic_resource(struct rxe_qp *qp, struct resp_res *res) res->type = 0; } -static void cleanup_rd_atomic_resources(struct rxe_qp *qp) +void cleanup_rd_atomic_resources(struct rxe_qp *qp) { int i; struct resp_res *res; @@ -526,21 +526,6 @@ static void rxe_qp_reset(struct rxe_qp *qp) /* cleanup attributes */ atomic_set(>ssn, 0); - qp->req.opcode = -1; - qp->req.need_retry = 0; - qp->req.noack_pkts = 0; - qp->resp.msn = 0; - qp->resp.opcode = -1; - qp->resp.drop_msg = 0; - qp->resp.goto_error = 0; - qp->resp.sent_psn_nak = 0; - - if (qp->resp.mr) { - rxe_drop_ref(>resp.mr->pelem); - qp->resp.mr = NULL; - } - - cleanup_rd_atomic_resources(qp); /* reenable tasks */ rxe_enable_task(>resp.task); diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c index c63e66873757..3bb9afdeaee1 100644 --- a/drivers/infiniband/sw/rxe/rxe_req.c +++ b/drivers/infiniband/sw/rxe/rxe_req.c @@ -599,6 +599,7 @@ int rxe_requester(void *arg) goto exit; if (unlikely(qp->req.state == QP_STATE_RESET)) { + qp->req.noack_pkts = 0; qp->req.wqe_index = consumer_index(qp->sq.queue); qp->req.opcode = -1; qp->req.need_rd_atomic = 0; diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c index 7be8a362d2ef..eaee68d718ce 100644 --- a/drivers/infiniband/sw/rxe/rxe_resp.c +++ b/drivers/infiniband/sw/rxe/rxe_resp.c @@ -1214,19 +1214,10 @@ int rxe_responder(void *arg) qp->resp.aeth_syndrome = AETH_ACK_UNLIMITED; - if (!qp->valid) { - ret = -EINVAL; - goto done; - } - - switch (qp->resp.state) { - case QP_STATE_RESET: + if (!qp->valid || qp->resp.state == QP_STATE_RESET) { state = RESPST_RESET; - break; - - default: + } else { state = RESPST_GET_REQ; - break; } while (1) { @@ -1374,6 +1365,18 @@ int rxe_responder(void *arg) case RESPST_RESET: rxe_drain_req_pkts(qp, false); qp->resp.wqe = NULL; + qp->resp.msn = 0; + qp->resp.opcode = -1; + qp->resp.drop_msg = 0; + qp->resp.goto_error = 0; + qp->resp.sent_psn_nak = 0; + + if (qp->resp.mr) { + rxe_drop_ref(>resp.mr->pelem); + qp->resp.mr = NULL; + } + + cleanup_rd_atomic_resources(qp); goto exit; case RESPST_ERROR: -- 2.20.1
[PATCH 00/10] Refactor rxe driver to remove multiple race conditions
This patchset helps to get rid of following race condition situations: 1. Tasklet functions were incrementing reference counting after entering running the tasklet. 2. Getting a pointer to reference counted object (kref) was done without protecting kref_put with a lock. 3. QP cleanup was sometimes scheduling cleanup for later execution in rxe_qp_do_cleaunpm, although this QP's memory could be freed immediately after returning from rxe_qp_cleanup. 4. Non-atomic cleanup functions could be called in SoftIRQ context 5. Manipulating with reference counter inside a critical section could have been done both inside and outside of SoftIRQ region. Such behavior may end up in a deadlock. The easiest way to observe these problems is to compile the kernel with KASAN and lockdep and abruptly stop an application using SoftRoCE during the communication phase. For my system this often resulted in kernel crash of a deadlock inside the kernel. To fix the above mentioned problems, this patch does following things: 1. Replace tasklets with workqueues 2. Adds locks to kref_put 3. Aquires reference counting in an appropriate place As a shortcomming, the performance is slightly reduced, because instead of trying to execute tasklet function directly the new version always puts it onto the queue. TBH, I'm not sure that I removed all of the problems, but the driver deffinetely behaves much more stable now. I would be glad to get some help with additional testing. drivers/infiniband/sw/rxe/rxe_comp.c| 38 ++ drivers/infiniband/sw/rxe/rxe_cq.c | 17 ++- drivers/infiniband/sw/rxe/rxe_hw_counters.c | 1 - drivers/infiniband/sw/rxe/rxe_hw_counters.h | 1 - drivers/infiniband/sw/rxe/rxe_loc.h | 3 +- drivers/infiniband/sw/rxe/rxe_mcast.c | 22 ++-- drivers/infiniband/sw/rxe/rxe_mr.c | 10 +- drivers/infiniband/sw/rxe/rxe_net.c | 21 ++- drivers/infiniband/sw/rxe/rxe_pool.c| 40 -- drivers/infiniband/sw/rxe/rxe_pool.h| 16 ++- drivers/infiniband/sw/rxe/rxe_qp.c | 130 +- drivers/infiniband/sw/rxe/rxe_recv.c| 8 +- drivers/infiniband/sw/rxe/rxe_req.c | 17 +-- drivers/infiniband/sw/rxe/rxe_resp.c| 54 drivers/infiniband/sw/rxe/rxe_task.c| 139 +++- drivers/infiniband/sw/rxe/rxe_task.h| 40 ++ drivers/infiniband/sw/rxe/rxe_verbs.c | 81 ++-- drivers/infiniband/sw/rxe/rxe_verbs.h | 8 +- 18 files changed, 302 insertions(+), 344 deletions(-) -- 2.20.1
[PATCH 02/10] Remove counter that does not have a meaning anymore
After putting all tasks to schedule, this counter does not have a meaning anymore. Signed-off-by: Maksym Planeta --- drivers/infiniband/sw/rxe/rxe_comp.c| 6 -- drivers/infiniband/sw/rxe/rxe_hw_counters.c | 1 - drivers/infiniband/sw/rxe/rxe_hw_counters.h | 1 - 3 files changed, 8 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c index ad09bd9d0a82..5f3a43cfeb63 100644 --- a/drivers/infiniband/sw/rxe/rxe_comp.c +++ b/drivers/infiniband/sw/rxe/rxe_comp.c @@ -148,14 +148,8 @@ void retransmit_timer(struct timer_list *t) void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb) { - int must_sched; - skb_queue_tail(>resp_pkts, skb); - must_sched = skb_queue_len(>resp_pkts) > 1; - if (must_sched != 0) - rxe_counter_inc(SKB_TO_PKT(skb)->rxe, RXE_CNT_COMPLETER_SCHED); - rxe_run_task(>comp.task); } diff --git a/drivers/infiniband/sw/rxe/rxe_hw_counters.c b/drivers/infiniband/sw/rxe/rxe_hw_counters.c index 636edb5f4cf4..9bc762b0e66a 100644 --- a/drivers/infiniband/sw/rxe/rxe_hw_counters.c +++ b/drivers/infiniband/sw/rxe/rxe_hw_counters.c @@ -41,7 +41,6 @@ static const char * const rxe_counter_name[] = { [RXE_CNT_RCV_RNR] = "rcvd_rnr_err", [RXE_CNT_SND_RNR] = "send_rnr_err", [RXE_CNT_RCV_SEQ_ERR] = "rcvd_seq_err", - [RXE_CNT_COMPLETER_SCHED] = "ack_deferred", [RXE_CNT_RETRY_EXCEEDED] = "retry_exceeded_err", [RXE_CNT_RNR_RETRY_EXCEEDED] = "retry_rnr_exceeded_err", [RXE_CNT_COMP_RETRY] = "completer_retry_err", diff --git a/drivers/infiniband/sw/rxe/rxe_hw_counters.h b/drivers/infiniband/sw/rxe/rxe_hw_counters.h index 72c0d63c79e0..cfe19159a77f 100644 --- a/drivers/infiniband/sw/rxe/rxe_hw_counters.h +++ b/drivers/infiniband/sw/rxe/rxe_hw_counters.h @@ -45,7 +45,6 @@ enum rxe_counters { RXE_CNT_RCV_RNR, RXE_CNT_SND_RNR, RXE_CNT_RCV_SEQ_ERR, - RXE_CNT_COMPLETER_SCHED, RXE_CNT_RETRY_EXCEEDED, RXE_CNT_RNR_RETRY_EXCEEDED, RXE_CNT_COMP_RETRY, -- 2.20.1
[PATCH 10/10] Replace tasklets with workqueues
Replace tasklets with workqueues in rxe driver. Ensure that task is called only through a workqueue. This allows to simplify task logic. Add additional dependencies to make sure that cleanup tasks do not happen after object's memory is already reclaimed. Improve overal stability of the driver by removing multiple race conditions and use-after-free situations. Signed-off-by: Maksym Planeta --- drivers/infiniband/sw/rxe/rxe_cq.c| 15 ++- drivers/infiniband/sw/rxe/rxe_net.c | 17 +++- drivers/infiniband/sw/rxe/rxe_qp.c| 73 --- drivers/infiniband/sw/rxe/rxe_req.c | 4 +- drivers/infiniband/sw/rxe/rxe_task.c | 126 -- drivers/infiniband/sw/rxe/rxe_task.h | 38 ++-- drivers/infiniband/sw/rxe/rxe_verbs.c | 7 ++ drivers/infiniband/sw/rxe/rxe_verbs.h | 7 +- 8 files changed, 124 insertions(+), 163 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_cq.c b/drivers/infiniband/sw/rxe/rxe_cq.c index 5693986c8c1b..6c3cf5ba2911 100644 --- a/drivers/infiniband/sw/rxe/rxe_cq.c +++ b/drivers/infiniband/sw/rxe/rxe_cq.c @@ -66,9 +66,9 @@ int rxe_cq_chk_attr(struct rxe_dev *rxe, struct rxe_cq *cq, return -EINVAL; } -static void rxe_send_complete(unsigned long data) +static void rxe_send_complete(struct work_struct *work) { - struct rxe_cq *cq = (struct rxe_cq *)data; + struct rxe_cq *cq = container_of(work, typeof(*cq), work); unsigned long flags; spin_lock_irqsave(>cq_lock, flags); @@ -106,8 +106,12 @@ int rxe_cq_from_init(struct rxe_dev *rxe, struct rxe_cq *cq, int cqe, cq->is_user = 1; cq->is_dying = false; + INIT_WORK(>work, rxe_send_complete); - tasklet_init(>comp_task, rxe_send_complete, (unsigned long)cq); + cq->wq = alloc_ordered_workqueue("cq:send_complete", 0); + if (!cq->wq) { + return -ENOMEM; + } spin_lock_init(>cq_lock); cq->ibcq.cqe = cqe; @@ -161,7 +165,7 @@ int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int solicited) if ((cq->notify == IB_CQ_NEXT_COMP) || (cq->notify == IB_CQ_SOLICITED && solicited)) { cq->notify = 0; - tasklet_schedule(>comp_task); + queue_work(cq->wq, >work); } return 0; @@ -183,5 +187,8 @@ void rxe_cq_cleanup(struct rxe_pool_entry *arg) if (cq->queue) rxe_queue_cleanup(cq->queue); + if (cq->wq) + destroy_workqueue(cq->wq); + rxe_elem_cleanup(arg); } diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c index f8c3604bc5ad..2997edc27592 100644 --- a/drivers/infiniband/sw/rxe/rxe_net.c +++ b/drivers/infiniband/sw/rxe/rxe_net.c @@ -410,13 +410,20 @@ static void rxe_skb_tx_dtor(struct sk_buff *skb) { struct sock *sk = skb->sk; struct rxe_qp *qp = sk->sk_user_data; - int skb_out = atomic_dec_return(>skb_out); + int skb_out = atomic_read(>skb_out); - if (unlikely(qp->need_req_skb && -skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW)) - rxe_run_task(>req.task); + atomic_inc(>pending_skb_down); + skb_out = atomic_read(>pending_skb_down); + if (!rxe_drop_ref(>pelem)) { + atomic_dec(>pending_skb_down); + atomic_dec(>skb_out); - rxe_drop_ref(>pelem); + if (unlikely(qp->need_req_skb && +skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW)) + rxe_run_task(>req.task); + + skb_out = atomic_read(>pending_skb_down); + } } int rxe_send(struct rxe_pkt_info *pkt, struct sk_buff *skb) diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c index 2fccdede8869..2bb25360319e 100644 --- a/drivers/infiniband/sw/rxe/rxe_qp.c +++ b/drivers/infiniband/sw/rxe/rxe_qp.c @@ -214,6 +214,7 @@ static void rxe_qp_init_misc(struct rxe_dev *rxe, struct rxe_qp *qp, atomic_set(>ssn, 0); atomic_set(>skb_out, 0); + atomic_set(>pending_skb_down, 0); } static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp, @@ -332,6 +333,8 @@ static int rxe_qp_init_resp(struct rxe_dev *rxe, struct rxe_qp *qp, return 0; } +void rxe_qp_do_cleanup(struct work_struct *work); + /* called by the create qp verb */ int rxe_qp_from_init(struct rxe_dev *rxe, struct rxe_qp *qp, struct rxe_pd *pd, struct ib_qp_init_attr *init, @@ -368,6 +371,9 @@ int rxe_qp_from_init(struct rxe_dev *rxe, struct rxe_qp *qp, struct rxe_pd *pd, qp->attr.qp_state = IB_QPS_RESET; qp->valid = 1; + INIT_WORK(>cleanup_work, rxe_qp_do_cleanup); + qp->cleanup_completion = NULL; + return 0; err2:
[PATCH 05/10] Fix reference counting for rxe tasklets
Reference should be aquired *before* the tasklet is run. The best time to increment the reference counter is at initialisation. Otherwise, the object may not exists anymore by the time tasklet is run. Signed-off-by: Maksym Planeta --- drivers/infiniband/sw/rxe/rxe_comp.c | 4 drivers/infiniband/sw/rxe/rxe_req.c | 4 drivers/infiniband/sw/rxe/rxe_resp.c | 3 --- drivers/infiniband/sw/rxe/rxe_task.c | 8 ++-- drivers/infiniband/sw/rxe/rxe_task.h | 2 +- 5 files changed, 7 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c index bdeb288673f3..3a1a33286f87 100644 --- a/drivers/infiniband/sw/rxe/rxe_comp.c +++ b/drivers/infiniband/sw/rxe/rxe_comp.c @@ -557,8 +557,6 @@ int rxe_completer(void *arg) struct rxe_pkt_info *pkt = NULL; enum comp_state state; - rxe_add_ref(>pelem); - if (!qp->valid || qp->req.state == QP_STATE_ERROR || qp->req.state == QP_STATE_RESET) { rxe_drain_resp_pkts(qp, qp->valid && @@ -780,7 +778,6 @@ int rxe_completer(void *arg) * exit from the loop calling us */ WARN_ON_ONCE(skb); - rxe_drop_ref(>pelem); return -EAGAIN; done: @@ -788,6 +785,5 @@ int rxe_completer(void *arg) * us again to see if there is anything else to do */ WARN_ON_ONCE(skb); - rxe_drop_ref(>pelem); return 0; } diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c index 0d018adbe512..c63e66873757 100644 --- a/drivers/infiniband/sw/rxe/rxe_req.c +++ b/drivers/infiniband/sw/rxe/rxe_req.c @@ -594,8 +594,6 @@ int rxe_requester(void *arg) struct rxe_send_wqe rollback_wqe; u32 rollback_psn; - rxe_add_ref(>pelem); - next_wqe: if (unlikely(!qp->valid || qp->req.state == QP_STATE_ERROR)) goto exit; @@ -702,7 +700,6 @@ int rxe_requester(void *arg) wqe->state = wqe_state_done; wqe->status = IB_WC_SUCCESS; __rxe_do_task(>comp.task); - rxe_drop_ref(>pelem); return 0; } payload = mtu; @@ -753,6 +750,5 @@ int rxe_requester(void *arg) __rxe_do_task(>comp.task); exit: - rxe_drop_ref(>pelem); return -EAGAIN; } diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c index f038bae1bd70..7be8a362d2ef 100644 --- a/drivers/infiniband/sw/rxe/rxe_resp.c +++ b/drivers/infiniband/sw/rxe/rxe_resp.c @@ -1212,8 +1212,6 @@ int rxe_responder(void *arg) struct rxe_pkt_info *pkt = NULL; int ret = 0; - rxe_add_ref(>pelem); - qp->resp.aeth_syndrome = AETH_ACK_UNLIMITED; if (!qp->valid) { @@ -1392,6 +1390,5 @@ int rxe_responder(void *arg) exit: ret = -EAGAIN; done: - rxe_drop_ref(>pelem); return ret; } diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c index 7c2b6e4595f5..9d6b8ad08a3a 100644 --- a/drivers/infiniband/sw/rxe/rxe_task.c +++ b/drivers/infiniband/sw/rxe/rxe_task.c @@ -35,6 +35,7 @@ #include #include +#include "rxe.h" #include "rxe_task.h" int __rxe_do_task(struct rxe_task *task) @@ -115,14 +116,15 @@ void rxe_do_task(unsigned long data) } int rxe_init_task(void *obj, struct rxe_task *task, - void *arg, int (*func)(void *), char *name) + struct rxe_qp *qp, int (*func)(void *), char *name) { task->obj = obj; - task->arg = arg; + task->arg = qp; task->func = func; snprintf(task->name, sizeof(task->name), "%s", name); task->destroyed = false; + rxe_add_ref(>pelem); tasklet_init(>tasklet, rxe_do_task, (unsigned long)task); task->state = TASK_STATE_START; @@ -135,6 +137,7 @@ void rxe_cleanup_task(struct rxe_task *task) { unsigned long flags; bool idle; + struct rxe_qp *qp = (struct rxe_qp *)task->arg; /* * Mark the task, then wait for it to finish. It might be @@ -149,6 +152,7 @@ void rxe_cleanup_task(struct rxe_task *task) } while (!idle); tasklet_kill(>tasklet); + rxe_drop_ref(>pelem); } void rxe_run_task(struct rxe_task *task) diff --git a/drivers/infiniband/sw/rxe/rxe_task.h b/drivers/infiniband/sw/rxe/rxe_task.h index 5c1fc7d5b953..671b1774b577 100644 --- a/drivers/infiniband/sw/rxe/rxe_task.h +++ b/drivers/infiniband/sw/rxe/rxe_task.h @@ -63,7 +63,7 @@ struct rxe_task { * fcn => function to call until it returns != 0 */ int rxe_init_task(void *obj, struct rxe_task *task, - void *arg, int (*func)(void *), char *name); + struct rxe_qp *qp, int (*func)(void *), char *name); /* cleanup task */ void rxe_cleanup_task(struct rxe_task *task); -- 2.20.1
[PATCH 03/10] Make pool interface more type safe
Replace void* with rxe_pool_entry* for some functions. Change macro to inline function. Signed-off-by: Maksym Planeta --- drivers/infiniband/sw/rxe/rxe_comp.c | 18 drivers/infiniband/sw/rxe/rxe_mcast.c | 20 drivers/infiniband/sw/rxe/rxe_mr.c| 8 ++-- drivers/infiniband/sw/rxe/rxe_net.c | 6 +-- drivers/infiniband/sw/rxe/rxe_pool.c | 12 ++--- drivers/infiniband/sw/rxe/rxe_pool.h | 16 --- drivers/infiniband/sw/rxe/rxe_qp.c| 32 ++--- drivers/infiniband/sw/rxe/rxe_recv.c | 8 ++-- drivers/infiniband/sw/rxe/rxe_req.c | 8 ++-- drivers/infiniband/sw/rxe/rxe_resp.c | 22 - drivers/infiniband/sw/rxe/rxe_verbs.c | 66 +-- 11 files changed, 108 insertions(+), 108 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c index 5f3a43cfeb63..bdeb288673f3 100644 --- a/drivers/infiniband/sw/rxe/rxe_comp.c +++ b/drivers/infiniband/sw/rxe/rxe_comp.c @@ -534,7 +534,7 @@ static void rxe_drain_resp_pkts(struct rxe_qp *qp, bool notify) struct rxe_send_wqe *wqe; while ((skb = skb_dequeue(>resp_pkts))) { - rxe_drop_ref(qp); + rxe_drop_ref(>pelem); kfree_skb(skb); } @@ -557,7 +557,7 @@ int rxe_completer(void *arg) struct rxe_pkt_info *pkt = NULL; enum comp_state state; - rxe_add_ref(qp); + rxe_add_ref(>pelem); if (!qp->valid || qp->req.state == QP_STATE_ERROR || qp->req.state == QP_STATE_RESET) { @@ -646,7 +646,7 @@ int rxe_completer(void *arg) case COMPST_DONE: if (pkt) { - rxe_drop_ref(pkt->qp); + rxe_drop_ref(>qp->pelem); kfree_skb(skb); skb = NULL; } @@ -694,7 +694,7 @@ int rxe_completer(void *arg) if (qp->comp.started_retry && !qp->comp.timeout_retry) { if (pkt) { - rxe_drop_ref(pkt->qp); + rxe_drop_ref(>qp->pelem); kfree_skb(skb); skb = NULL; } @@ -723,7 +723,7 @@ int rxe_completer(void *arg) } if (pkt) { - rxe_drop_ref(pkt->qp); + rxe_drop_ref(>qp->pelem); kfree_skb(skb); skb = NULL; } @@ -748,7 +748,7 @@ int rxe_completer(void *arg) mod_timer(>rnr_nak_timer, jiffies + rnrnak_jiffies(aeth_syn(pkt) & ~AETH_TYPE_MASK)); - rxe_drop_ref(pkt->qp); + rxe_drop_ref(>qp->pelem); kfree_skb(skb); skb = NULL; goto exit; @@ -766,7 +766,7 @@ int rxe_completer(void *arg) rxe_qp_error(qp); if (pkt) { - rxe_drop_ref(pkt->qp); + rxe_drop_ref(>qp->pelem); kfree_skb(skb); skb = NULL; } @@ -780,7 +780,7 @@ int rxe_completer(void *arg) * exit from the loop calling us */ WARN_ON_ONCE(skb); - rxe_drop_ref(qp); + rxe_drop_ref(>pelem); return -EAGAIN; done: @@ -788,6 +788,6 @@ int rxe_completer(void *arg) * us again to see if there is anything else to do */ WARN_ON_ONCE(skb); - rxe_drop_ref(qp); + rxe_drop_ref(>pelem); return 0; } diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c index 522a7942c56c..24ebc4ca1b99 100644 --- a/drivers/infiniband/sw/rxe/rxe_mcast.c +++ b/drivers/infiniband/sw/rxe/rxe_mcast.c @@ -59,7 +59,7 @@ int rxe_mcast_get_grp(struct rxe_dev *rxe, union ib_gid *mgid, spin_lock_init(>mcg_lock); grp->rxe = rxe; - rxe_add_key(grp, mgid); + rxe_add_key(>pelem, mgid); err = rxe_mcast_add(rxe, mgid); if (err) @@ -70,7 +70,7 @@ int rxe_mcast_get_grp(struct rxe_dev *rxe, union ib_gid *mgid, return 0; err2: - rxe_drop_ref(grp); + rxe_drop_ref(>pelem); err1: return err; } @@ -103,7 +103,7 @@ int rxe_mcast_add_grp_elem(struct rxe_dev *rxe, struct rxe_qp *qp, } /* each qp
[PATCH 08/10] Move responsibility of cleaning up pool elements
Specific implementations must finish off the cleanup of pool elements when it is the right moment. Reason for that is that a concreate cleanup function may want postpone and schedule part of object destruction to later time. Signed-off-by: Maksym Planeta --- drivers/infiniband/sw/rxe/rxe_cq.c| 2 ++ drivers/infiniband/sw/rxe/rxe_mcast.c | 2 ++ drivers/infiniband/sw/rxe/rxe_mr.c| 2 ++ drivers/infiniband/sw/rxe/rxe_pool.c | 9 - drivers/infiniband/sw/rxe/rxe_pool.h | 2 ++ 5 files changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/sw/rxe/rxe_cq.c b/drivers/infiniband/sw/rxe/rxe_cq.c index ad3090131126..5693986c8c1b 100644 --- a/drivers/infiniband/sw/rxe/rxe_cq.c +++ b/drivers/infiniband/sw/rxe/rxe_cq.c @@ -182,4 +182,6 @@ void rxe_cq_cleanup(struct rxe_pool_entry *arg) if (cq->queue) rxe_queue_cleanup(cq->queue); + + rxe_elem_cleanup(arg); } diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c index 24ebc4ca1b99..6453ae97653f 100644 --- a/drivers/infiniband/sw/rxe/rxe_mcast.c +++ b/drivers/infiniband/sw/rxe/rxe_mcast.c @@ -187,4 +187,6 @@ void rxe_mc_cleanup(struct rxe_pool_entry *arg) rxe_drop_key(>pelem); rxe_mcast_delete(rxe, >mgid); + + rxe_elem_cleanup(arg); } diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c index 441b01e30afa..1c0940655df1 100644 --- a/drivers/infiniband/sw/rxe/rxe_mr.c +++ b/drivers/infiniband/sw/rxe/rxe_mr.c @@ -104,6 +104,8 @@ void rxe_mem_cleanup(struct rxe_pool_entry *arg) kfree(mem->map); } + + rxe_elem_cleanup(arg); } static int rxe_mem_alloc(struct rxe_mem *mem, int num_buf) diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c index 711d7d7f3692..3b14ab599662 100644 --- a/drivers/infiniband/sw/rxe/rxe_pool.c +++ b/drivers/infiniband/sw/rxe/rxe_pool.c @@ -462,9 +462,16 @@ void rxe_elem_release(struct kref *kref) if (pool->cleanup) pool->cleanup(elem); + else + rxe_elem_cleanup(elem); +} + +void rxe_elem_cleanup(struct rxe_pool_entry *pelem) +{ + struct rxe_pool *pool = pelem->pool; if (!(pool->flags & RXE_POOL_NO_ALLOC)) - kmem_cache_free(pool_cache(pool), elem); + kmem_cache_free(pool_cache(pool), pelem); atomic_dec(>num_elem); ib_device_put(>rxe->ib_dev); rxe_pool_put(pool); diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h index 1e3508c74dc0..84cfbc749a5c 100644 --- a/drivers/infiniband/sw/rxe/rxe_pool.h +++ b/drivers/infiniband/sw/rxe/rxe_pool.h @@ -160,6 +160,8 @@ void *rxe_pool_get_key(struct rxe_pool *pool, void *key); /* cleanup an object when all references are dropped */ void rxe_elem_release(struct kref *kref); +void rxe_elem_cleanup(struct rxe_pool_entry *pelem); + /* take a reference on an object */ static inline void rxe_add_ref(struct rxe_pool_entry *pelem) { kref_get(>ref_cnt); -- 2.20.1
Re: [PATCH 07/10] Pass the return value of kref_put further
In a later patch I need to know if the dependency to QP object still exists or not. On 22/07/2019 17:29, Jason Gunthorpe wrote: On Mon, Jul 22, 2019 at 05:14:23PM +0200, Maksym Planeta wrote: Used in a later patch. Signed-off-by: Maksym Planeta drivers/infiniband/sw/rxe/rxe_pool.c | 3 ++- drivers/infiniband/sw/rxe/rxe_pool.h | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c index 30a887cf9200..711d7d7f3692 100644 +++ b/drivers/infiniband/sw/rxe/rxe_pool.c @@ -541,7 +541,7 @@ static void rxe_dummy_release(struct kref *kref) { } -void rxe_drop_ref(struct rxe_pool_entry *pelem) +int rxe_drop_ref(struct rxe_pool_entry *pelem) { int res; struct rxe_pool *pool = pelem->pool; @@ -553,4 +553,5 @@ void rxe_drop_ref(struct rxe_pool_entry *pelem) if (res) { rxe_elem_release(>ref_cnt); } + return res; } Using the return value of kref_put at all is super sketchy. Are you sure this is actually a kref usage pattern? Why would this be needed? Jason -- Regards, Maksym Planeta
Re: [PATCH 04/10] Protect kref_put with the lock
On 22/07/2019 17:25, Jason Gunthorpe wrote: On Mon, Jul 22, 2019 at 05:14:20PM +0200, Maksym Planeta wrote: Need to ensure that kref_put does not run concurrently with the loop inside rxe_pool_get_key. Signed-off-by: Maksym Planeta drivers/infiniband/sw/rxe/rxe_pool.c | 18 ++ drivers/infiniband/sw/rxe/rxe_pool.h | 4 +--- 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c index efa9bab01e02..30a887cf9200 100644 +++ b/drivers/infiniband/sw/rxe/rxe_pool.c @@ -536,3 +536,21 @@ void *rxe_pool_get_key(struct rxe_pool *pool, void *key) read_unlock_irqrestore(>pool_lock, flags); return node ? elem : NULL; } + +static void rxe_dummy_release(struct kref *kref) +{ +} + +void rxe_drop_ref(struct rxe_pool_entry *pelem) +{ + int res; + struct rxe_pool *pool = pelem->pool; + unsigned long flags; + + write_lock_irqsave(>pool_lock, flags); + res = kref_put(>ref_cnt, rxe_dummy_release); + write_unlock_irqrestore(>pool_lock, flags); This doesn't make sense.. If something is making the kref go to 0 while the node is still in the RB tree then that is a bug. You should never need to add locking around a kref_put. From https://www.kernel.org/doc/Documentation/kref.txt | The last rule (rule 3) is the nastiest one to handle. Say, for | instance, you have a list of items that are each kref-ed, and you wish | to get the first one. You can't just pull the first item off the list | and kref_get() it. That violates rule 3 because you are not already | holding a valid pointer. You must add a mutex (or some other lock). Jason -- Regards, Maksym Planeta
Re: [PATCH] sysctl: Add a feature to drop caches selectively
> With a binary interface like an ioctl I can see how you could have extra > unused fields which you can ignore now and let people start adding extra > options like the range in the future. Yes, ioctl is another possibility. But I would argue that sysctl is more convenient interface, because idea of sdrop_caches is similar to drop_caches's one and it is convenient to have these interfaces in the same place. But if sdrop_caches uses procfs it seems that there is no easy way to pass parameters of different types in one write operation. > Other questions I'd ask would be - how about the access control model? > Will only root be able to drop caches? Why can't I drop caches for my > own file? Access control model is the same as for drop_caches. This means that only root can write to this file. But it is easy to add a feature that allows any user to clean page cache of inodes that this user owns. 2014-06-25 15:42 GMT+02:00 Artem Bityutskiy : > On Wed, 2014-06-25 at 15:23 +0200, Thomas Knauth wrote: >> On Wed, Jun 25, 2014 at 11:56 AM, Artem Bityutskiy >> wrote: >> > Thanks for the answer, although you forgot to comment on the question >> > about possibly extending the new interface to work with file ranges in >> > the future. For example, I have a 2 TiB file, and I am only interested >> > in dropping caches for the first couple of gigabytes. Would I extend >> > your interface, or would I come up with another one? >> >> Ah, didn't quite understand what was meant with file ranges. Again, we >> had not considered this so far. I guess you could make a distinction >> between directories and files here. If the path points to a file, you >> can have an optional argument indicating the range of bytes you would >> like to drop. Something like >> >> echo "my-file 0-1000,8000-1000" > /proc/sys/vm/sdrop_cache >> >> If this is desirable, we can add it to the patch. > > With a binary interface like an ioctl I can see how you could have extra > unused fields which you can ignore now and let people start adding extra > options like the range in the future. > > With this kind of interface I am not sure how to do this. > > Other questions I'd ask would be - how about the access control model? > Will only root be able to drop caches? Why can't I drop caches for my > own file? > > I did not put much thinking into this, but it looks like ioctl could be > a better interface for the task you are trying to solve... > > Sorry if I am a bit vague, I am mostly trying to make you guys give this > more thoughts, and come up with a deeper analysis. Interfaces are very > important to get right, or as right as possible... > > -- > Best Regards, > Artem Bityutskiy > -- Regards, Maksym Planeta. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sysctl: Add a feature to drop caches selectively
With a binary interface like an ioctl I can see how you could have extra unused fields which you can ignore now and let people start adding extra options like the range in the future. Yes, ioctl is another possibility. But I would argue that sysctl is more convenient interface, because idea of sdrop_caches is similar to drop_caches's one and it is convenient to have these interfaces in the same place. But if sdrop_caches uses procfs it seems that there is no easy way to pass parameters of different types in one write operation. Other questions I'd ask would be - how about the access control model? Will only root be able to drop caches? Why can't I drop caches for my own file? Access control model is the same as for drop_caches. This means that only root can write to this file. But it is easy to add a feature that allows any user to clean page cache of inodes that this user owns. 2014-06-25 15:42 GMT+02:00 Artem Bityutskiy dedeki...@gmail.com: On Wed, 2014-06-25 at 15:23 +0200, Thomas Knauth wrote: On Wed, Jun 25, 2014 at 11:56 AM, Artem Bityutskiy dedeki...@gmail.com wrote: Thanks for the answer, although you forgot to comment on the question about possibly extending the new interface to work with file ranges in the future. For example, I have a 2 TiB file, and I am only interested in dropping caches for the first couple of gigabytes. Would I extend your interface, or would I come up with another one? Ah, didn't quite understand what was meant with file ranges. Again, we had not considered this so far. I guess you could make a distinction between directories and files here. If the path points to a file, you can have an optional argument indicating the range of bytes you would like to drop. Something like echo my-file 0-1000,8000-1000 /proc/sys/vm/sdrop_cache If this is desirable, we can add it to the patch. With a binary interface like an ioctl I can see how you could have extra unused fields which you can ignore now and let people start adding extra options like the range in the future. With this kind of interface I am not sure how to do this. Other questions I'd ask would be - how about the access control model? Will only root be able to drop caches? Why can't I drop caches for my own file? I did not put much thinking into this, but it looks like ioctl could be a better interface for the task you are trying to solve... Sorry if I am a bit vague, I am mostly trying to make you guys give this more thoughts, and come up with a deeper analysis. Interfaces are very important to get right, or as right as possible... -- Best Regards, Artem Bityutskiy -- Regards, Maksym Planeta. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sysctl: Add a feature to drop caches selectively
To clean the page cache one can use /proc/sys/vm/drop_caches. But this drops the whole page cache. In contrast to that sdrop_caches enables ability to drop the page cache selectively by path string. Suggested-by: Thomas Knauth Signed-off-by: Maksym Planeta --- Documentation/sysctl/vm.txt | 15 ++ fs/Makefile | 2 +- fs/sdrop_caches.c | 124 3 files changed, 140 insertions(+), 1 deletion(-) create mode 100644 fs/sdrop_caches.c diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index bd4b34c..faad01d 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm: - dirty_ratio - dirty_writeback_centisecs - drop_caches +- sdrop_caches - extfrag_threshold - hugepages_treat_as_movable - hugetlb_shm_group @@ -211,6 +212,20 @@ with your system. To disable them, echo 4 (bit 3) into drop_caches. == +sdrop_caches + +Writing to this will cause the kernel to drop clean caches starting from +specified path. + +To free pagecache of a file: + echo /home/user/file > /proc/sys/vm/sdrop_caches +To free pagecache of a directory and all files in it. + echo /home/user/directly > /proc/sys/vm/sdrop_caches + +Restrictions are the same as for drop_caches. + +== + extfrag_threshold This parameter affects whether the kernel will compact memory or direct diff --git a/fs/Makefile b/fs/Makefile index 4030cbf..366c7b9 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -44,7 +44,7 @@ obj-$(CONFIG_FS_MBCACHE) += mbcache.o obj-$(CONFIG_FS_POSIX_ACL) += posix_acl.o obj-$(CONFIG_NFS_COMMON) += nfs_common/ obj-$(CONFIG_COREDUMP) += coredump.o -obj-$(CONFIG_SYSCTL) += drop_caches.o +obj-$(CONFIG_SYSCTL) += drop_caches.o sdrop_caches.o obj-$(CONFIG_FHANDLE) += fhandle.o diff --git a/fs/sdrop_caches.c b/fs/sdrop_caches.c new file mode 100644 index 000..c193655 --- /dev/null +++ b/fs/sdrop_caches.c @@ -0,0 +1,124 @@ +/* + * Implement the manual selective drop pagecache function + */ + +#include + + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static void clean_mapping(struct dentry *dentry) +{ + struct inode *inode = dentry->d_inode; + + if (!inode) + return; + + if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || + (inode->i_mapping->nrpages == 0)) { + return; + } + + invalidate_mapping_pages(inode->i_mapping, 0, -1); +} + +static void clean_all_dentries_locked(struct dentry *dentry) +{ + struct dentry *child; + + list_for_each_entry(child, >d_subdirs, d_u.d_child) { + clean_all_dentries_locked(child); + } + + clean_mapping(dentry); +} + +static void clean_all_dentries(struct dentry *dentry) +{ + spin_lock_nested(>d_lock, DENTRY_D_LOCK_NESTED); + clean_all_dentries_locked(dentry); + spin_unlock(>d_lock); +} + +static int drop_pagecache(const char * __user filename) +{ + unsigned int lookup_flags = LOOKUP_FOLLOW; + struct path path; + int error; + +retry: + error = user_path_at(AT_FDCWD, filename, lookup_flags, ); + if (!error) { + /* clean */ + clean_all_dentries(path.dentry); + } + if (retry_estale(error, lookup_flags)) { + lookup_flags |= LOOKUP_REVAL; + goto retry; + } + return error; +} + +static int sdrop_ctl_handler(struct ctl_table *table, int write, +void __user *buffer, size_t *lenp, loff_t *ppos) +{ + char __user *pathname = buffer + *lenp - 1; + + put_user('\0', pathname); + + if (!write) + return 0; + + return drop_pagecache(buffer); +} + +static struct ctl_path vm_path[] = { { .procname = "vm", }, { } }; +static struct ctl_table sdrop_ctl_table[] = { + { + .procname = "sdrop_caches", + .mode = 0644, + .proc_handler = sdrop_ctl_handler, + }, + { } +}; + +static struct ctl_table_header *sdrop_proc_entry; + +/* Init function called on module entry */ +int sdrop_init(void) +{ + int ret = 0; + + sdrop_proc_entry = register_sysctl_paths(vm_path, sdrop_ctl_table); + + if (sdrop_proc_entry == NULL) { + ret = -ENOMEM; + pr_err("sdrop_caches: Couldn't create proc entry\n"); + } + + return ret; +} + +/* Cleanup function called on module exit */ +void sdrop_cleanup(void) +{ + unregister_sysctl_table(sdrop_proc_entry); +} + +module_init(sdrop_init); +module_exit(sdrop_cleanup); + +MODULE_LICE
[PATCH] sysctl: Add a feature to drop caches selectively
To clean the page cache one can use /proc/sys/vm/drop_caches. But this drops the whole page cache. In contrast to that sdrop_caches enables ability to drop the page cache selectively by path string. Suggested-by: Thomas Knauth thomas.kna...@gmx.de Signed-off-by: Maksym Planeta mcsim.plan...@gmail.com --- Documentation/sysctl/vm.txt | 15 ++ fs/Makefile | 2 +- fs/sdrop_caches.c | 124 3 files changed, 140 insertions(+), 1 deletion(-) create mode 100644 fs/sdrop_caches.c diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index bd4b34c..faad01d 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm: - dirty_ratio - dirty_writeback_centisecs - drop_caches +- sdrop_caches - extfrag_threshold - hugepages_treat_as_movable - hugetlb_shm_group @@ -211,6 +212,20 @@ with your system. To disable them, echo 4 (bit 3) into drop_caches. == +sdrop_caches + +Writing to this will cause the kernel to drop clean caches starting from +specified path. + +To free pagecache of a file: + echo /home/user/file /proc/sys/vm/sdrop_caches +To free pagecache of a directory and all files in it. + echo /home/user/directly /proc/sys/vm/sdrop_caches + +Restrictions are the same as for drop_caches. + +== + extfrag_threshold This parameter affects whether the kernel will compact memory or direct diff --git a/fs/Makefile b/fs/Makefile index 4030cbf..366c7b9 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -44,7 +44,7 @@ obj-$(CONFIG_FS_MBCACHE) += mbcache.o obj-$(CONFIG_FS_POSIX_ACL) += posix_acl.o obj-$(CONFIG_NFS_COMMON) += nfs_common/ obj-$(CONFIG_COREDUMP) += coredump.o -obj-$(CONFIG_SYSCTL) += drop_caches.o +obj-$(CONFIG_SYSCTL) += drop_caches.o sdrop_caches.o obj-$(CONFIG_FHANDLE) += fhandle.o diff --git a/fs/sdrop_caches.c b/fs/sdrop_caches.c new file mode 100644 index 000..c193655 --- /dev/null +++ b/fs/sdrop_caches.c @@ -0,0 +1,124 @@ +/* + * Implement the manual selective drop pagecache function + */ + +#include linux/module.h + + +#include linux/kernel.h +#include linux/proc_fs.h +#include linux/string.h +#include linux/vmalloc.h +#include linux/uaccess.h +#include linux/mm.h +#include linux/fs.h +#include linux/writeback.h +#include linux/sysctl.h +#include linux/gfp.h +#include linux/limits.h +#include linux/namei.h + +static void clean_mapping(struct dentry *dentry) +{ + struct inode *inode = dentry-d_inode; + + if (!inode) + return; + + if ((inode-i_state (I_FREEING|I_WILL_FREE|I_NEW)) || + (inode-i_mapping-nrpages == 0)) { + return; + } + + invalidate_mapping_pages(inode-i_mapping, 0, -1); +} + +static void clean_all_dentries_locked(struct dentry *dentry) +{ + struct dentry *child; + + list_for_each_entry(child, dentry-d_subdirs, d_u.d_child) { + clean_all_dentries_locked(child); + } + + clean_mapping(dentry); +} + +static void clean_all_dentries(struct dentry *dentry) +{ + spin_lock_nested(dentry-d_lock, DENTRY_D_LOCK_NESTED); + clean_all_dentries_locked(dentry); + spin_unlock(dentry-d_lock); +} + +static int drop_pagecache(const char * __user filename) +{ + unsigned int lookup_flags = LOOKUP_FOLLOW; + struct path path; + int error; + +retry: + error = user_path_at(AT_FDCWD, filename, lookup_flags, path); + if (!error) { + /* clean */ + clean_all_dentries(path.dentry); + } + if (retry_estale(error, lookup_flags)) { + lookup_flags |= LOOKUP_REVAL; + goto retry; + } + return error; +} + +static int sdrop_ctl_handler(struct ctl_table *table, int write, +void __user *buffer, size_t *lenp, loff_t *ppos) +{ + char __user *pathname = buffer + *lenp - 1; + + put_user('\0', pathname); + + if (!write) + return 0; + + return drop_pagecache(buffer); +} + +static struct ctl_path vm_path[] = { { .procname = vm, }, { } }; +static struct ctl_table sdrop_ctl_table[] = { + { + .procname = sdrop_caches, + .mode = 0644, + .proc_handler = sdrop_ctl_handler, + }, + { } +}; + +static struct ctl_table_header *sdrop_proc_entry; + +/* Init function called on module entry */ +int sdrop_init(void) +{ + int ret = 0; + + sdrop_proc_entry = register_sysctl_paths(vm_path, sdrop_ctl_table); + + if (sdrop_proc_entry == NULL) { + ret = -ENOMEM; + pr_err(sdrop_caches: Couldn't create proc entry\n); + } + + return ret; +} + +/* Cleanup function called on module