Re: [PATCH v3 5/6] block: Make SCSI device suspend and resume work reliably

2017-09-25 Thread Bart Van Assche
On Tue, 2017-09-26 at 06:51 +0800, Ming Lei wrote:
> On Mon, Sep 25, 2017 at 04:20:20PM +, Bart Van Assche wrote:
> > So that means that you have not noticed that it is safe to leave out that
> > smp_rmp() call because blk-mq queue freezing and unfreezing waits for a 
> > grace
> > period and hence waits until all CPUs have executed a full memory barrier?
> 
> No, memory barrier has to be pair, this barrier has to be there, another
> one before unfreeze/freeze can be removed because it is implied inside
> freeze/unfreeze.
> 
> Actually it should have been smp_mb() between writing the percpu-refcount
> and reading preempt_only flag, otherwise if the two op is reordered,
> queue freeze/unfreeze may see the counter is zero, and this normal I/O
> still can be observed even after queue is freezed and SCSI is put into
> quiesced.

That's wrong. Both memory barriers can be left out. Freezing + unfreezing a
block layer queue causes call_rcu() to be invoked and also makes the code
that performs the freezing + unfreezing to wait until a grace period has
occurred on all CPUs. In other words, until all CPUs have performed a full
memory barrier. This is why both memory barriers can be left out. A quote
from "The RCU-barrier menagerie" (The RCU-barrier menagerie):
"There are many issues with excessive use of memory barriers, one of which
is their overhead. It turns out that RCU can be substituted for memory
barriers in a great many cases, and this substitution allows one of the
memory barriers of the pair to be removed. Unfortunately, in most cases, the
other memory barrier must be replaced by an expensive grace-period-wait
primitive such as synchronize_rcu(), but there are a few special cases where
synchronize_rcu() is not needed."

Bart.

Re: [PATCH v3 5/6] block: Make SCSI device suspend and resume work reliably

2017-09-25 Thread Ming Lei
On Mon, Sep 25, 2017 at 04:20:20PM +, Bart Van Assche wrote:
> On Mon, 2017-09-25 at 10:26 +0800, Ming Lei wrote:
> > On Fri, Sep 22, 2017 at 03:14:04PM -0700, Bart Van Assche wrote:
> > > +int blk_queue_enter(struct request_queue *q, bool nowait, bool preempt)
> > >  {
> > >   while (true) {
> > >   int ret;
> > >  
> > > - if (percpu_ref_tryget_live(>q_usage_counter))
> > > - return 0;
> > > + if (percpu_ref_tryget_live(>q_usage_counter)) {
> > > + /*
> > > +  * Ensure read order of q_usage_counter and the
> > > +  * PREEMPT_ONLY queue flag.
> > > +  */
> > > + smp_rmb();
> > > + if (preempt || !blk_queue_preempt_only(q))
> > > + return 0;
> > > + else
> > > + percpu_ref_put(>q_usage_counter);
> > > + }
> > 
> > Now you introduce one smp_rmb() and test on preempt flag on
> > blk-mq's fast path, which should have been avoided, so I
> > think this way is worse than my patchset.
> 
> So that means that you have not noticed that it is safe to leave out that
> smp_rmp() call because blk-mq queue freezing and unfreezing waits for a grace
> period and hence waits until all CPUs have executed a full memory barrier?

No, memory barrier has to be pair, this barrier has to be there, another
one before unfreeze/freeze can be removed because it is implied inside
freeze/unfreeze.

Actually it should have been smp_mb() between writing the percpu-refcount
and reading preempt_only flag, otherwise if the two op is reordered,
queue freeze/unfreeze may see the counter is zero, and this normal I/O
still can be observed even after queue is freezed and SCSI is put into
quiesced.

-- 
Ming


Re: [PATCH v3 5/6] block: Make SCSI device suspend and resume work reliably

2017-09-25 Thread Bart Van Assche
On Mon, 2017-09-25 at 10:26 +0800, Ming Lei wrote:
> On Fri, Sep 22, 2017 at 03:14:04PM -0700, Bart Van Assche wrote:
> > +int blk_queue_enter(struct request_queue *q, bool nowait, bool preempt)
> >  {
> > while (true) {
> > int ret;
> >  
> > -   if (percpu_ref_tryget_live(>q_usage_counter))
> > -   return 0;
> > +   if (percpu_ref_tryget_live(>q_usage_counter)) {
> > +   /*
> > +* Ensure read order of q_usage_counter and the
> > +* PREEMPT_ONLY queue flag.
> > +*/
> > +   smp_rmb();
> > +   if (preempt || !blk_queue_preempt_only(q))
> > +   return 0;
> > +   else
> > +   percpu_ref_put(>q_usage_counter);
> > +   }
> 
> Now you introduce one smp_rmb() and test on preempt flag on
> blk-mq's fast path, which should have been avoided, so I
> think this way is worse than my patchset.

So that means that you have not noticed that it is safe to leave out that
smp_rmp() call because blk-mq queue freezing and unfreezing waits for a grace
period and hence waits until all CPUs have executed a full memory barrier?

Bart.

Re: [PATCH v3 5/6] block: Make SCSI device suspend and resume work reliably

2017-09-24 Thread Ming Lei
On Fri, Sep 22, 2017 at 03:14:04PM -0700, Bart Van Assche wrote:
> It is essential during suspend and resume that neither the filesystem
> state nor the filesystem metadata in RAM changes. This is why while
> the hibernation image is being written or restored that SCSI devices
> are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
> and scsi_device_resume(). In the SDEV_QUIESCE state execution of
> non-preempt requests is deferred. This is realized by returning
> BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
> devices. Avoid that a full queue prevents power management requests
> to be submitted by deferring allocation of non-preempt requests for
> devices in the quiesced state. This patch has been tested by running
> the following commands and by verifying that after resume the fio job
> is still running:
> 
> for d in /sys/class/block/sd*[a-z]; do
>   hcil=$(readlink "$d/device")
>   hcil=${hcil#../../../}
>   echo 4 > "$d/queue/nr_requests"
>   echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
> done
> bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c)
> bdev=${bdev#../../}
> hcil=$(readlink "/sys/block/$bdev/device")
> hcil=${hcil#../../../}
> fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 
> --rw=randread \
>   --ioengine=libaio --numjobs=4 --iodepth=16 --iodepth_batch=1 --thread \
>   --loops=$((2**31)) &
> pid=$!
> sleep 1
> systemctl hibernate
> sleep 10
> kill $pid
> 
> Reported-by: Oleksandr Natalenko 
> References: "I/O hangs after resuming from suspend-to-ram" 
> (https://marc.info/?l=linux-block=150340235201348).
> Signed-off-by: Bart Van Assche 
> Cc: Martin K. Petersen 
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> ---
>  block/blk-core.c   | 37 -
>  block/blk-mq.c |  4 ++--
>  block/blk-timeout.c|  2 +-
>  fs/block_dev.c |  4 ++--
>  include/linux/blkdev.h |  2 +-
>  5 files changed, 34 insertions(+), 15 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 45cf3f56a730..971825bd4462 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -351,10 +351,12 @@ void blk_set_preempt_only(struct request_queue *q, bool 
> preempt_only)
>   unsigned long flags;
>  
>   spin_lock_irqsave(q->queue_lock, flags);
> - if (preempt_only)
> + if (preempt_only) {
>   queue_flag_set(QUEUE_FLAG_PREEMPT_ONLY, q);
> - else
> + } else {
>   queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
> + wake_up_all(>mq_freeze_wq);
> + }
>   spin_unlock_irqrestore(q->queue_lock, flags);
>  }
>  EXPORT_SYMBOL(blk_set_preempt_only);
> @@ -773,13 +775,29 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL(blk_alloc_queue);
>  
> -int blk_queue_enter(struct request_queue *q, bool nowait)
> +/**
> + * blk_queue_enter() - try to increase q->q_usage_counter
> + * @q: request queue pointer
> + * @nowait: if the queue is frozen, do not wait until it is unfrozen
> + * @preempt: if QUEUE_FLAG_PREEMPT_ONLY has been set, do not wait until that
> + *   flag has been cleared
> + */
> +int blk_queue_enter(struct request_queue *q, bool nowait, bool preempt)
>  {
>   while (true) {
>   int ret;
>  
> - if (percpu_ref_tryget_live(>q_usage_counter))
> - return 0;
> + if (percpu_ref_tryget_live(>q_usage_counter)) {
> + /*
> +  * Ensure read order of q_usage_counter and the
> +  * PREEMPT_ONLY queue flag.
> +  */
> + smp_rmb();
> + if (preempt || !blk_queue_preempt_only(q))
> + return 0;
> + else
> + percpu_ref_put(>q_usage_counter);
> + }

Now you introduce one smp_rmb() and test on preempt flag on
blk-mq's fast path, which should have been avoided, so I
think this way is worse than my patchset.

On some systems(even a system with SCSI, or system without
SCSI), SCSI quiesce may never be used at all, so it is unfair
to introduce the cost in fast path for this system.

We can avoid that, why not do that?

-- 
Ming


[PATCH v3 5/6] block: Make SCSI device suspend and resume work reliably

2017-09-22 Thread Bart Van Assche
It is essential during suspend and resume that neither the filesystem
state nor the filesystem metadata in RAM changes. This is why while
the hibernation image is being written or restored that SCSI devices
are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
and scsi_device_resume(). In the SDEV_QUIESCE state execution of
non-preempt requests is deferred. This is realized by returning
BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
devices. Avoid that a full queue prevents power management requests
to be submitted by deferring allocation of non-preempt requests for
devices in the quiesced state. This patch has been tested by running
the following commands and by verifying that after resume the fio job
is still running:

for d in /sys/class/block/sd*[a-z]; do
  hcil=$(readlink "$d/device")
  hcil=${hcil#../../../}
  echo 4 > "$d/queue/nr_requests"
  echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
done
bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c)
bdev=${bdev#../../}
hcil=$(readlink "/sys/block/$bdev/device")
hcil=${hcil#../../../}
fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 --rw=randread \
  --ioengine=libaio --numjobs=4 --iodepth=16 --iodepth_batch=1 --thread \
  --loops=$((2**31)) &
pid=$!
sleep 1
systemctl hibernate
sleep 10
kill $pid

Reported-by: Oleksandr Natalenko 
References: "I/O hangs after resuming from suspend-to-ram" 
(https://marc.info/?l=linux-block=150340235201348).
Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 block/blk-core.c   | 37 -
 block/blk-mq.c |  4 ++--
 block/blk-timeout.c|  2 +-
 fs/block_dev.c |  4 ++--
 include/linux/blkdev.h |  2 +-
 5 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 45cf3f56a730..971825bd4462 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -351,10 +351,12 @@ void blk_set_preempt_only(struct request_queue *q, bool 
preempt_only)
unsigned long flags;
 
spin_lock_irqsave(q->queue_lock, flags);
-   if (preempt_only)
+   if (preempt_only) {
queue_flag_set(QUEUE_FLAG_PREEMPT_ONLY, q);
-   else
+   } else {
queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
+   wake_up_all(>mq_freeze_wq);
+   }
spin_unlock_irqrestore(q->queue_lock, flags);
 }
 EXPORT_SYMBOL(blk_set_preempt_only);
@@ -773,13 +775,29 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
-int blk_queue_enter(struct request_queue *q, bool nowait)
+/**
+ * blk_queue_enter() - try to increase q->q_usage_counter
+ * @q: request queue pointer
+ * @nowait: if the queue is frozen, do not wait until it is unfrozen
+ * @preempt: if QUEUE_FLAG_PREEMPT_ONLY has been set, do not wait until that
+ * flag has been cleared
+ */
+int blk_queue_enter(struct request_queue *q, bool nowait, bool preempt)
 {
while (true) {
int ret;
 
-   if (percpu_ref_tryget_live(>q_usage_counter))
-   return 0;
+   if (percpu_ref_tryget_live(>q_usage_counter)) {
+   /*
+* Ensure read order of q_usage_counter and the
+* PREEMPT_ONLY queue flag.
+*/
+   smp_rmb();
+   if (preempt || !blk_queue_preempt_only(q))
+   return 0;
+   else
+   percpu_ref_put(>q_usage_counter);
+   }
 
if (nowait)
return -EBUSY;
@@ -794,7 +812,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
smp_rmb();
 
ret = wait_event_interruptible(q->mq_freeze_wq,
-   !atomic_read(>mq_freeze_depth) ||
+   (atomic_read(>mq_freeze_depth) == 0 &&
+(preempt || !blk_queue_preempt_only(q))) ||
blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
@@ -2172,6 +2191,7 @@ blk_qc_t generic_make_request(struct bio *bio)
 */
struct bio_list bio_list_on_stack[2];
blk_qc_t ret = BLK_QC_T_NONE;
+   const bool nowait = bio->bi_opf & REQ_NOWAIT;
 
if (!generic_make_request_checks(bio))
goto out;
@@ -2211,7 +2231,7 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bio->bi_disk->queue;
 
-   if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
+