Re: [PATCH v4 7/7] block: Make SCSI device suspend and resume work reliably

2017-09-26 Thread Bart Van Assche
On Tue, 2017-09-26 at 16:32 +0800, Ming Lei wrote:
> On Mon, Sep 25, 2017 at 11:13:47PM +, Bart Van Assche wrote:
> > Sorry but I disagree. I'm using RCU to achieve the same effect as a barrier
> > and to move the cost of the barrier from the reader to the updater. See also
> > Paul E. McKenney, Mathieu Desnoyers, Lai Jiangshan, and Josh Triplett,
> > The RCU-barrier menagerie, LWN.net, November 12, 2013
> > (https://lwn.net/Articles/573497/).
> 
> Let me explain it in a bit details: [ ... ]

The approach I used in patch 7/7 is identical to one of the approaches explained
in the article I referred to. If you do not agree that that approach is safe 
that
means that you do not neither agree with Paul McKenney's view on how RCU works.
Do you really think that you know more about RCU than Paul McKenney?

Bart.

Re: [PATCH v4 7/7] block: Make SCSI device suspend and resume work reliably

2017-09-26 Thread Ming Lei
On Mon, Sep 25, 2017 at 11:13:47PM +, Bart Van Assche wrote:
> On Tue, 2017-09-26 at 06:59 +0800, Ming Lei wrote:
> > On Mon, Sep 25, 2017 at 01:29:24PM -0700, Bart Van Assche wrote:
> > > +int blk_queue_enter(struct request_queue *q, bool nowait, bool preempt)
> > >  {
> > >   while (true) {
> > >   int ret;
> > >  
> > > - if (percpu_ref_tryget_live(>q_usage_counter))
> > > - return 0;
> > > + if (percpu_ref_tryget_live(>q_usage_counter)) {
> > > + /*
> > > +  * Since setting the PREEMPT_ONLY flag is followed
> > > +  * by a switch of q_usage_counter from per-cpu to
> > > +  * atomic mode and back to per-cpu and since the
> > > +  * switch to atomic mode uses call_rcu_sched(), it
> > > +  * is not necessary to call smp_rmb() here.
> > > +  */
> > 
> > rcu_read_lock is held only inside percpu_ref_tryget_live().
> > 
> > Without one explicit barrier(smp_mb) between getting the refcounter
> > and reading the preempt only flag, the two operations(writing to
> > refcounter and reading the flag) can be reordered, so
> > unfreeze/unfreeze may be completed before this IO is completed.
> 
> Sorry but I disagree. I'm using RCU to achieve the same effect as a barrier
> and to move the cost of the barrier from the reader to the updater. See also
> Paul E. McKenney, Mathieu Desnoyers, Lai Jiangshan, and Josh Triplett,
> The RCU-barrier menagerie, LWN.net, November 12, 2013
> (https://lwn.net/Articles/573497/).

Let me explain it in a bit details:

1) in SCSI quiesce path:

We can think there is one synchronize_rcu() in freeze/unfreeze.

Let's see the RCU document:

Documentation/RCU/whatisRCU.txt:

void synchronize_rcu(void);

Marks the end of updater code and the beginning of reclaimer
code.  It does this by blocking until all pre-existing RCU
read-side critical sections on all CPUs have completed.
Note that synchronize_rcu() will -not- necessarily wait for
any subsequent RCU read-side critical sections to complete.
For example, consider the following sequence of events:

So synchronize_rcu() in SCSI quiesce path just waits for completion
of pre-existing read-side critical section, and subsequent RCU
read-side critical sections won't be waited.

2) in normal I/O path of blk_enter_queue()

- only rcu read lock is held in percpu_ref_tryget_live(), and the lock
is released when this helper returns.

- there isn't explicit barrier(smp_mb()) between percpu_ref_tryget_live()
and checking flag of preempt only, so writing to percpu_ref counter
and reading the preempt flag can be reordered as the following:

-- check flag of preempt only 

current process preempt out now, and just at the exact
time, SCSI quiesce is run from another process, then
freeze/unfreeze is completed because no pre-exit read-side
critical sections, and the percpu_ref isn't held too.

finally this process is preeempt in, and try to grab one ref
and submit I/O after SCSI is quiesced(which shouldn't be
allowed)

-- percpu_ref_tryget_live()


-- 
Ming


Re: [PATCH v4 7/7] block: Make SCSI device suspend and resume work reliably

2017-09-25 Thread Bart Van Assche
On Tue, 2017-09-26 at 06:59 +0800, Ming Lei wrote:
> On Mon, Sep 25, 2017 at 01:29:24PM -0700, Bart Van Assche wrote:
> > +int blk_queue_enter(struct request_queue *q, bool nowait, bool preempt)
> >  {
> > while (true) {
> > int ret;
> >  
> > -   if (percpu_ref_tryget_live(>q_usage_counter))
> > -   return 0;
> > +   if (percpu_ref_tryget_live(>q_usage_counter)) {
> > +   /*
> > +* Since setting the PREEMPT_ONLY flag is followed
> > +* by a switch of q_usage_counter from per-cpu to
> > +* atomic mode and back to per-cpu and since the
> > +* switch to atomic mode uses call_rcu_sched(), it
> > +* is not necessary to call smp_rmb() here.
> > +*/
> 
> rcu_read_lock is held only inside percpu_ref_tryget_live().
> 
> Without one explicit barrier(smp_mb) between getting the refcounter
> and reading the preempt only flag, the two operations(writing to
> refcounter and reading the flag) can be reordered, so
> unfreeze/unfreeze may be completed before this IO is completed.

Sorry but I disagree. I'm using RCU to achieve the same effect as a barrier
and to move the cost of the barrier from the reader to the updater. See also
Paul E. McKenney, Mathieu Desnoyers, Lai Jiangshan, and Josh Triplett,
The RCU-barrier menagerie, LWN.net, November 12, 2013
(https://lwn.net/Articles/573497/).

Bart.

Re: [PATCH v4 7/7] block: Make SCSI device suspend and resume work reliably

2017-09-25 Thread Ming Lei
On Mon, Sep 25, 2017 at 01:29:24PM -0700, Bart Van Assche wrote:
> It is essential during suspend and resume that neither the filesystem
> state nor the filesystem metadata in RAM changes. This is why while
> the hibernation image is being written or restored that SCSI devices
> are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
> and scsi_device_resume(). In the SDEV_QUIESCE state execution of
> non-preempt requests is deferred. This is realized by returning
> BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
> devices. Avoid that a full queue prevents power management requests
> to be submitted by deferring allocation of non-preempt requests for
> devices in the quiesced state. This patch has been tested by running
> the following commands and by verifying that after resume the fio job
> is still running:
> 
> for d in /sys/class/block/sd*[a-z]; do
>   hcil=$(readlink "$d/device")
>   hcil=${hcil#../../../}
>   echo 4 > "$d/queue/nr_requests"
>   echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
> done
> bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c)
> bdev=${bdev#../../}
> hcil=$(readlink "/sys/block/$bdev/device")
> hcil=${hcil#../../../}
> fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 
> --rw=randread \
>   --ioengine=libaio --numjobs=4 --iodepth=16 --iodepth_batch=1 --thread \
>   --loops=$((2**31)) &
> pid=$!
> sleep 1
> systemctl hibernate
> sleep 10
> kill $pid
> 
> Reported-by: Oleksandr Natalenko 
> References: "I/O hangs after resuming from suspend-to-ram" 
> (https://marc.info/?l=linux-block=150340235201348).
> Signed-off-by: Bart Van Assche 
> Cc: Martin K. Petersen 
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> ---
>  block/blk-core.c   | 41 +++--
>  block/blk-mq.c |  4 ++--
>  block/blk-timeout.c|  2 +-
>  fs/block_dev.c |  4 ++--
>  include/linux/blkdev.h |  2 +-
>  5 files changed, 37 insertions(+), 16 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 9111a8f9c7a1..01b7afee58f0 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -351,10 +351,12 @@ void blk_set_preempt_only(struct request_queue *q, bool 
> preempt_only)
>   unsigned long flags;
>  
>   spin_lock_irqsave(q->queue_lock, flags);
> - if (preempt_only)
> + if (preempt_only) {
>   queue_flag_set(QUEUE_FLAG_PREEMPT_ONLY, q);
> - else
> + } else {
>   queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
> + wake_up_all(>mq_freeze_wq);
> + }
>   spin_unlock_irqrestore(q->queue_lock, flags);
>  }
>  EXPORT_SYMBOL(blk_set_preempt_only);
> @@ -776,13 +778,31 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL(blk_alloc_queue);
>  
> -int blk_queue_enter(struct request_queue *q, bool nowait)
> +/**
> + * blk_queue_enter() - try to increase q->q_usage_counter
> + * @q: request queue pointer
> + * @nowait: if the queue is frozen, do not wait until it is unfrozen
> + * @preempt: if QUEUE_FLAG_PREEMPT_ONLY has been set, do not wait until that
> + *   flag has been cleared
> + */
> +int blk_queue_enter(struct request_queue *q, bool nowait, bool preempt)
>  {
>   while (true) {
>   int ret;
>  
> - if (percpu_ref_tryget_live(>q_usage_counter))
> - return 0;
> + if (percpu_ref_tryget_live(>q_usage_counter)) {
> + /*
> +  * Since setting the PREEMPT_ONLY flag is followed
> +  * by a switch of q_usage_counter from per-cpu to
> +  * atomic mode and back to per-cpu and since the
> +  * switch to atomic mode uses call_rcu_sched(), it
> +  * is not necessary to call smp_rmb() here.
> +  */

rcu_read_lock is held only inside percpu_ref_tryget_live().

Without one explicit barrier(smp_mb) between getting the refcounter
and reading the preempt only flag, the two operations(writing to
refcounter and reading the flag) can be reordered, so
unfreeze/unfreeze may be completed before this IO is completed.

> + if (preempt || !blk_queue_preempt_only(q))
> + return 0;
> + else
> + percpu_ref_put(>q_usage_counter);
> + }
>  
>   if (nowait)
>   return -EBUSY;
> @@ -797,7 +817,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
>   smp_rmb();
>  
>   ret = wait_event_interruptible(q->mq_freeze_wq,
> - !atomic_read(>mq_freeze_depth) ||
> + (atomic_read(>mq_freeze_depth) == 0 

[PATCH v4 7/7] block: Make SCSI device suspend and resume work reliably

2017-09-25 Thread Bart Van Assche
It is essential during suspend and resume that neither the filesystem
state nor the filesystem metadata in RAM changes. This is why while
the hibernation image is being written or restored that SCSI devices
are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
and scsi_device_resume(). In the SDEV_QUIESCE state execution of
non-preempt requests is deferred. This is realized by returning
BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
devices. Avoid that a full queue prevents power management requests
to be submitted by deferring allocation of non-preempt requests for
devices in the quiesced state. This patch has been tested by running
the following commands and by verifying that after resume the fio job
is still running:

for d in /sys/class/block/sd*[a-z]; do
  hcil=$(readlink "$d/device")
  hcil=${hcil#../../../}
  echo 4 > "$d/queue/nr_requests"
  echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
done
bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c)
bdev=${bdev#../../}
hcil=$(readlink "/sys/block/$bdev/device")
hcil=${hcil#../../../}
fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 --rw=randread \
  --ioengine=libaio --numjobs=4 --iodepth=16 --iodepth_batch=1 --thread \
  --loops=$((2**31)) &
pid=$!
sleep 1
systemctl hibernate
sleep 10
kill $pid

Reported-by: Oleksandr Natalenko 
References: "I/O hangs after resuming from suspend-to-ram" 
(https://marc.info/?l=linux-block=150340235201348).
Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-core.c   | 41 +++--
 block/blk-mq.c |  4 ++--
 block/blk-timeout.c|  2 +-
 fs/block_dev.c |  4 ++--
 include/linux/blkdev.h |  2 +-
 5 files changed, 37 insertions(+), 16 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9111a8f9c7a1..01b7afee58f0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -351,10 +351,12 @@ void blk_set_preempt_only(struct request_queue *q, bool 
preempt_only)
unsigned long flags;
 
spin_lock_irqsave(q->queue_lock, flags);
-   if (preempt_only)
+   if (preempt_only) {
queue_flag_set(QUEUE_FLAG_PREEMPT_ONLY, q);
-   else
+   } else {
queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
+   wake_up_all(>mq_freeze_wq);
+   }
spin_unlock_irqrestore(q->queue_lock, flags);
 }
 EXPORT_SYMBOL(blk_set_preempt_only);
@@ -776,13 +778,31 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
-int blk_queue_enter(struct request_queue *q, bool nowait)
+/**
+ * blk_queue_enter() - try to increase q->q_usage_counter
+ * @q: request queue pointer
+ * @nowait: if the queue is frozen, do not wait until it is unfrozen
+ * @preempt: if QUEUE_FLAG_PREEMPT_ONLY has been set, do not wait until that
+ * flag has been cleared
+ */
+int blk_queue_enter(struct request_queue *q, bool nowait, bool preempt)
 {
while (true) {
int ret;
 
-   if (percpu_ref_tryget_live(>q_usage_counter))
-   return 0;
+   if (percpu_ref_tryget_live(>q_usage_counter)) {
+   /*
+* Since setting the PREEMPT_ONLY flag is followed
+* by a switch of q_usage_counter from per-cpu to
+* atomic mode and back to per-cpu and since the
+* switch to atomic mode uses call_rcu_sched(), it
+* is not necessary to call smp_rmb() here.
+*/
+   if (preempt || !blk_queue_preempt_only(q))
+   return 0;
+   else
+   percpu_ref_put(>q_usage_counter);
+   }
 
if (nowait)
return -EBUSY;
@@ -797,7 +817,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
smp_rmb();
 
ret = wait_event_interruptible(q->mq_freeze_wq,
-   !atomic_read(>mq_freeze_depth) ||
+   (atomic_read(>mq_freeze_depth) == 0 &&
+(preempt || !blk_queue_preempt_only(q))) ||
blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
@@ -1416,7 +1437,7 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
create_io_context(gfp_mask, q->node);
 
ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ||
- (op & REQ_NOWAIT));
+ (op & REQ_NOWAIT), op &