date:20180412

Re: [PATCH 1/4] bcache: finish incremental GC

2018-04-12 Thread Coly Li

On 2018/4/13 11:12 AM, tang.jun...@zte.com.cn wrote:
> Hi Coly,
> 
> Hello Coly,
> 
>> On 2018/4/12 2:38 PM, tang.jun...@zte.com.cn wrote:
>>> From: Tang Junhui 
>>>
>>> In GC thread, we record the latest GC key in gc_done, which is expected
>>> to be used for incremental GC, but in currently code, we didn't realize
>>> it. When GC runs, front side IO would be blocked until the GC over, it
>>> would be a long time if there is a lot of btree nodes.
>>>
>>> This patch realizes incremental GC, the main ideal is that, when there
>>> are front side I/Os, after GC some nodes (100), we stop GC, release locker
>>> of the btree node, and go to process the front side I/Os for some times
>>> (100 ms), then go back to GC again.
>>>
>>> By this patch, when we doing GC, I/Os are not blocked all the time, and
>>> there is no obvious I/Os zero jump problem any more.
>>>
>>> Signed-off-by: Tang Junhui 
>>
>> Hi Junhui,
>>
>> I reply my comments in line,
> Thanks for your comments in advance.
>>
>>> ---
>>>  drivers/md/bcache/bcache.h  |  5 +
>>>  drivers/md/bcache/btree.c   | 15 ++-
>>>  drivers/md/bcache/request.c |  3 +++
>>>  3 files changed, 22 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>>> index 843877e..ab4c9ca 100644
>>> --- a/drivers/md/bcache/bcache.h
>>> +++ b/drivers/md/bcache/bcache.h
>>> @@ -445,6 +445,7 @@ struct cache {
>>>  
>>>  struct gc_stat {
>>>  size_tnodes;
>>> +size_tnodes_pre;
>>>  size_tkey_bytes;
>>>  
>>>  size_tnkeys;
>>> @@ -568,6 +569,10 @@ struct cache_set {
>>>   */
>>>  atomic_trescale;
>>>  /*
>>> + * used for GC, identify if any front side I/Os is inflight
>>> + */
>>> +atomic_tio_inflight;
>>
>> Could you please to rename the above variable to something like
>> search_inflight ? Just to be more explicit, considering we have many
>> types of io requests.
> OK, It looks better.
>>
>>> +/*
>>>   * When we invalidate buckets, we use both the priority and the amount
>>>   * of good data to determine which buckets to reuse first - to weight
>>>   * those together consistently we keep track of the smallest nonzero
>>> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
>>> index 81e8dc3..b36d276 100644
>>> --- a/drivers/md/bcache/btree.c
>>> +++ b/drivers/md/bcache/btree.c
>>> @@ -90,6 +90,9 @@
>>>  
>>>  #define MAX_NEED_GC64
>>>  #define MAX_SAVE_PRIO72
>>> +#define MIN_GC_NODES100
>>> +#define GC_SLEEP_TIME100
>>
>> How about naming the above macro as GC_SLEEP_MS ? So people may
>> understand the sleep time is 100ms without checking more context.
> OK, It looks better.
>>> +
>>>  
>>>  #define PTR_DIRTY_BIT(((uint64_t) 1 << 36))
>>>  
>>> @@ -1581,6 +1584,13 @@ static int btree_gc_recurse(struct btree *b, struct 
>>> btree_op *op,
>>>  memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1));
>>>  r->b = NULL;
>>>  
>>> +if (atomic_read(>c->io_inflight) &&
>>> +gc->nodes >= gc->nodes_pre + MIN_GC_NODES) {
>>
>> On 32bit machines, gc->nodes is a 32bit count, if it is overflowed the
>> above check will be false for a very long time, and in some condition it
>> might be always false after gc->nodes overflowed.
>>
>> How about adding the following check before the if() statement ?
> 
>> /* in case gc->nodes is overflowed */
>> if (gc->nodes_pre < gc->nodes)
>> gc->noeds_pre = gc->nodes;
>>
> I think 32bit is big enough, which can express a value of billions,
> but the number of nodes is usually around tens of thousands.
> Also gc->nodes and gc->nodes_pre were set to zero when GC begin each time.
> 
> How do you think about? 

Oh, I see. Then I don't worry here. The patch is OK to me.

Thanks :-)

Coly Li


[code snipped]

Re: [PATCH 2/4] bcache: calculate the number of incremental GC nodesaccording to the total of btree nodes

2018-04-12 Thread tang . junhui

Hi Coly

>Hi Junhui,
>
>>  btree_node_prefetch(b, k);
>> +/*
>> + * initiallize c->gc_stats.nodes
>> + * for incremental GC
>> + */
>> +b->c->gc_stats.nodes++;
>   
>I don't get the point why increase gc_stats.nodes here. Could you please
>explain a bit more ? 
 
We get gc_stats.nodes here, because we need to know the total btree nodes 
to caculate how much btree nodes we could do incremental GC in the first
time. If we don't get it, it would be zero in the first GC time, and we
can't caculate how much btree nodes in incremental GC correctly in the
first GC time.

Though the count is not accurate (since after we get gc_stats.nodes, journal
replay, I/O writes may happend, so the real nodes count may be changed), but
we only use it to estimate how much btree nodes we could GC each time, so it
doesn't matter.
 
Thanks.  
Tang Junhui

Re: [PATCH 1/4] bcache: finish incremental GC

2018-04-12 Thread tang . junhui

Hi Coly,

Hello Coly,

> On 2018/4/12 2:38 PM, tang.jun...@zte.com.cn wrote:
> > From: Tang Junhui 
> > 
> > In GC thread, we record the latest GC key in gc_done, which is expected
> > to be used for incremental GC, but in currently code, we didn't realize
> > it. When GC runs, front side IO would be blocked until the GC over, it
> > would be a long time if there is a lot of btree nodes.
> > 
> > This patch realizes incremental GC, the main ideal is that, when there
> > are front side I/Os, after GC some nodes (100), we stop GC, release locker
> > of the btree node, and go to process the front side I/Os for some times
> > (100 ms), then go back to GC again.
> > 
> > By this patch, when we doing GC, I/Os are not blocked all the time, and
> > there is no obvious I/Os zero jump problem any more.
> > 
> > Signed-off-by: Tang Junhui 
> 
> Hi Junhui,
> 
> I reply my comments in line,
Thanks for your comments in advance.
> 
> > ---
> >  drivers/md/bcache/bcache.h  |  5 +
> >  drivers/md/bcache/btree.c   | 15 ++-
> >  drivers/md/bcache/request.c |  3 +++
> >  3 files changed, 22 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> > index 843877e..ab4c9ca 100644
> > --- a/drivers/md/bcache/bcache.h
> > +++ b/drivers/md/bcache/bcache.h
> > @@ -445,6 +445,7 @@ struct cache {
> >  
> >  struct gc_stat {
> >  size_tnodes;
> > +size_tnodes_pre;
> >  size_tkey_bytes;
> >  
> >  size_tnkeys;
> > @@ -568,6 +569,10 @@ struct cache_set {
> >   */
> >  atomic_trescale;
> >  /*
> > + * used for GC, identify if any front side I/Os is inflight
> > + */
> > +atomic_tio_inflight;
> 
> Could you please to rename the above variable to something like
> search_inflight ? Just to be more explicit, considering we have many
> types of io requests.
OK, It looks better.
> 
> > +/*
> >   * When we invalidate buckets, we use both the priority and the amount
> >   * of good data to determine which buckets to reuse first - to weight
> >   * those together consistently we keep track of the smallest nonzero
> > diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> > index 81e8dc3..b36d276 100644
> > --- a/drivers/md/bcache/btree.c
> > +++ b/drivers/md/bcache/btree.c
> > @@ -90,6 +90,9 @@
> >  
> >  #define MAX_NEED_GC64
> >  #define MAX_SAVE_PRIO72
> > +#define MIN_GC_NODES100
> > +#define GC_SLEEP_TIME100
> 
> How about naming the above macro as GC_SLEEP_MS ? So people may
> understand the sleep time is 100ms without checking more context.
OK, It looks better.
> > +
> >  
> >  #define PTR_DIRTY_BIT(((uint64_t) 1 << 36))
> >  
> > @@ -1581,6 +1584,13 @@ static int btree_gc_recurse(struct btree *b, struct 
> > btree_op *op,
> >  memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1));
> >  r->b = NULL;
> >  
> > +if (atomic_read(>c->io_inflight) &&
> > +gc->nodes >= gc->nodes_pre + MIN_GC_NODES) {
> 
> On 32bit machines, gc->nodes is a 32bit count, if it is overflowed the
> above check will be false for a very long time, and in some condition it
> might be always false after gc->nodes overflowed.
> 
> How about adding the following check before the if() statement ?

> /* in case gc->nodes is overflowed */
> if (gc->nodes_pre < gc->nodes)
> gc->noeds_pre = gc->nodes;
>
I think 32bit is big enough, which can express a value of billions,
but the number of nodes is usually around tens of thousands.
Also gc->nodes and gc->nodes_pre were set to zero when GC begin each time.

How do you think about? 

> > +gc->nodes_pre =  gc->nodes;
> > +ret = -EAGAIN;
> > +break;
> > +}
> > +
> >  if (need_resched()) {
> >  ret = -EAGAIN;
> >  break;
> > @@ -1748,7 +1758,10 @@ static void bch_btree_gc(struct cache_set *c)
> >  closure_sync();
> >  cond_resched();
> >  
> > -if (ret && ret != -EAGAIN)
> > +if (ret == -EAGAIN)
> > +schedule_timeout_interruptible(msecs_to_jiffies
> > +   (GC_SLEEP_TIME));
> > +else if (ret)
> >  pr_warn("gc failed!");
> >  } while (ret);
> >  
> > diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> > index 643c3021..26750a9 100644
> > --- a/drivers/md/bcache/request.c
> > +++ b/drivers/md/bcache/request.c
> > @@ -637,7 +637,9 @@ static void do_bio_hook(struct search *s, struct bio 
> > *orig_bio)
> >  static void search_free(struct closure *cl)
> >  {
> >  struct search *s = container_of(cl, struct search, cl);
> > +
> >  bio_complete(s);
> > +atomic_dec(>d->c->io_inflight);
> >  
> >  if (s->iop.bio)
> >  bio_put(s->iop.bio);
> > @@ -655,6 +657,7 @@ static inline struct

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-12 Thread Kees Cook

On Thu, Apr 12, 2018 at 3:47 PM, Kees Cook  wrote:
> After fixing up some build issues in the middle of the 4.16 cycle, I
> get an unhelpful bisect result of commit 0a4b6e2f80aa ("Merge branch
> 'for-4.16/block'"). Instead of letting the test run longer, I'm going
> to switch to doing several shorter test boots per kernel and see if
> that helps. One more bisect coming...

Okay, so I can confirm the bisect points at the _merge_ itself, not a
specific patch. I'm not sure how to proceed here. It looks like some
kind of interaction between separate trees? Jens, do you have
suggestions on how to track this down?

-Kees

-- 
Kees Cook
Pixel Security

Re: [PATCH v3] blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash

2018-04-12 Thread Joseph Qi



On 18/4/11 07:02, Bart Van Assche wrote:
> Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
> it is no longer safe to access cgroup information during or after the
> blk_cleanup_queue() call. Hence protect the generic_make_request_checks()
> call with blk_queue_enter() / blk_queue_exit().
> 
> Reported-by: Ming Lei 
> Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the 
> block cgroup controller")
> Signed-off-by: Bart Van Assche 
> Cc: Ming Lei 
> Cc: Joseph Qi 

I've tested using the following steps:
1) start a fio job with buffered write;
2) then remove the scsi device that fio write to:
echo "scsi remove-single-device ${dev}" > /proc/scsi/scsi

After applying this patch, the reported oops has gone.

Tested-by: Joseph Qi 

> ---
> 
> Changes compared to v2: converted two ternary expressions into if-statements.
> 
> Changes compared to v1: guarded the blk_queue_exit() inside the loop with "if 
> (q)".
> 
>  block/blk-core.c | 35 +--
>  1 file changed, 29 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 34e2f2227fd9..39308e874ffa 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2386,8 +2386,20 @@ blk_qc_t generic_make_request(struct bio *bio)
>* yet.
>*/
>   struct bio_list bio_list_on_stack[2];
> + blk_mq_req_flags_t flags = 0;
> + struct request_queue *q = bio->bi_disk->queue;
>   blk_qc_t ret = BLK_QC_T_NONE;
>  
> + if (bio->bi_opf & REQ_NOWAIT)
> + flags = BLK_MQ_REQ_NOWAIT;
> + if (blk_queue_enter(q, flags) < 0) {
> + if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> + bio_wouldblock_error(bio);
> + else
> + bio_io_error(bio);
> + return ret;
> + }
> +
>   if (!generic_make_request_checks(bio))
>   goto out;
>  
> @@ -2424,11 +2436,22 @@ blk_qc_t generic_make_request(struct bio *bio)
>   bio_list_init(_list_on_stack[0]);
>   current->bio_list = bio_list_on_stack;
>   do {
> - struct request_queue *q = bio->bi_disk->queue;
> - blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
> - BLK_MQ_REQ_NOWAIT : 0;
> + bool enter_succeeded = true;
> +
> + if (unlikely(q != bio->bi_disk->queue)) {
> + if (q)
> + blk_queue_exit(q);
> + q = bio->bi_disk->queue;
> + flags = 0;
> + if (bio->bi_opf & REQ_NOWAIT)
> + flags = BLK_MQ_REQ_NOWAIT;
> + if (blk_queue_enter(q, flags) < 0) {
> + enter_succeeded = false;
> + q = NULL;
> + }
> + }
>  
> - if (likely(blk_queue_enter(q, flags) == 0)) {
> + if (enter_succeeded) {
>   struct bio_list lower, same;
>  
>   /* Create a fresh bio_list for all subordinate requests 
> */
> @@ -2436,8 +2459,6 @@ blk_qc_t generic_make_request(struct bio *bio)
>   bio_list_init(_list_on_stack[0]);
>   ret = q->make_request_fn(q, bio);
>  
> - blk_queue_exit(q);
> -
>   /* sort new bios into those for a lower level
>* and those for the same level
>*/
> @@ -2464,6 +2485,8 @@ blk_qc_t generic_make_request(struct bio *bio)
>   current->bio_list = NULL; /* deactivate */
>  
>  out:
> + if (q)
> + blk_queue_exit(q);
>   return ret;
>  }
>  EXPORT_SYMBOL(generic_make_request);
>

Re: sr: get/drop reference to device in revalidate and check_events

2018-04-12 Thread Martin K. Petersen


Jens,

> We can't just use scsi_cd() to get the scsi_cd structure, we have
> to grab a live reference to the device. For both callbacks, we're
> not inside an open where we already hold a reference to the device.

Applied to 4.17/scsi-fixes, thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: System hung in I/O when booting with sd card

2018-04-12 Thread Bart Van Assche


On 04/12/18 18:38, Shawn Lin wrote:

I think your patch solve this. Thanks.

Tested-by: Shawn Lin 


Thanks for the testing!

Bart.

Re: [PATCH 4/4] bcache: fix I/O significant decline while backenddevices registering

2018-04-12 Thread tang . junhui

Hi Coly,

> On 2018/4/12 2:38 PM, tang.jun...@zte.com.cn wrote:
> > From: Tang Junhui 
> > 
> > I attached several backend devices in the same cache set, and produced lots
> > of dirty data by running small rand I/O writes in a long time, then I
> > continue run I/O in the others cached devices, and stopped a cached device,
> > after a mean while, I register the stopped device again, I see the running
> > I/O in the others cached devices dropped significantly, sometimes even
> > jumps to zero.
> > 
> > In currently code, bcache would traverse each keys and btree node to count
> > the dirty data under read locker, and the writes threads can not get the
> > btree write locker, and when there is a lot of keys and btree node in the
> > registering device, it would last several seconds, so the write I/Os in
> > others cached device are blocked and declined significantly.
> > 
> > In this patch, when a device registering to a ache set, which exist others
> > cached devices with running I/Os, we get the amount of dirty data of the
> > device in an incremental way, and do not block other cached devices all the
> > time.
> > 
> > Signed-off-by: Tang Junhui 
> > ---
> >  drivers/md/bcache/writeback.c | 26 +-
> >  1 file changed, 25 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> > index 56a3788..2467d3a 100644
> > --- a/drivers/md/bcache/writeback.c
> > +++ b/drivers/md/bcache/writeback.c
> > @@ -488,10 +488,14 @@ static int bch_writeback_thread(void *arg)
> >  }
> >  
> 
> Hi Junhui,
> 
> >  /* Init */
> > +#define INIT_KEYS_MIN50
> > +#define INIT_KEYS_SLEEP_TIME100
> >  
> 
> How about renaming the above macros to something like
> INIT_KEYS_WAIT_COUNTER and INIT_KEYS_SLEEP_MS ?
> 
> Also I suggested to rename io_inflight in previous patch review comments.
> 
> For rested part of this patch looks good to me.
> 
> Reviewed-by: Coly Li 
> 

Tanks for your reveiw, I will modify the issue you mentioned 
and send a V2 patch later.

> 
> >  struct sectors_dirty_init {
> >  struct btree_opop;
> >  unsignedinode;
> > +size_tcount;
> > +struct bkeystart;
> >  };
> >  
> >  static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b,
> > @@ -506,18 +510,38 @@ static int sectors_dirty_init_fn(struct btree_op 
> > *_op, struct btree *b,
> >  bcache_dev_sectors_dirty_add(b->c, KEY_INODE(k),
> >   KEY_START(k), KEY_SIZE(k));
> >  
> > +op->count++;
> > +if (atomic_read(>c->io_inflight) &&
> > +   !(op->count % INIT_KEYS_MIN)) {
> > +bkey_copy_key(>start, k);
> > +return -EAGAIN;
> > +}
> > +
> >  return MAP_CONTINUE;
> >  }
> >  
> >  void bch_sectors_dirty_init(struct bcache_device *d)
> >  {
> >  struct sectors_dirty_init op;
> > +int ret;
> >  
> >  bch_btree_op_init(, -1);
> >  op.inode = d->id;
> > +op.count = 0;
> > +op.start = KEY(op.inode, 0, 0);
> >  
> > -bch_btree_map_keys(, d->c, (op.inode, 0, 0),
> > +do {
> > +ret = bch_btree_map_keys(, d->c, ,
> > sectors_dirty_init_fn, 0);
> > +if (ret == -EAGAIN)
> > +schedule_timeout_interruptible(
> > +msecs_to_jiffies(INIT_KEYS_SLEEP_TIME));
> > +else if (ret < 0) {
> > +pr_warn("sectors dirty init failed, ret=%d!", ret);
> > +break;
> > +}
> > +} while (ret == -EAGAIN);
> > +
> >  }
> >  
> >  void bch_cached_dev_writeback_init(struct cached_dev *dc)
> > 

Thanks.
Tang Junhui

Re: 4.15.14 crash with iscsi target and dvd

2018-04-12 Thread Ming Lei

On Thu, Apr 12, 2018 at 09:43:02PM -0400, Wakko Warner wrote:
> Ming Lei wrote:
> > On Tue, Apr 10, 2018 at 08:45:25PM -0400, Wakko Warner wrote:
> > > Sorry for the delay.  I reverted my change, added this one.  I didn't
> > > reboot, I just unloaded and loaded this one.
> > > Note: /dev/sr1 as seen from the initiator is /dev/sr0 (physical disc) on 
> > > the
> > > target.
> > > 
> > > Doesn't crash, however on the initiator I see this:
> > > [9273849.70] ISO 9660 Extensions: RRIP_1991A
> > > [9273863.359718] scsi_io_completion: 13 callbacks suppressed
> > > [9273863.359788] sr 26:0:0:0: [sr1] tag#1 UNKNOWN(0x2003) Result: 
> > > hostbyte=0x00 driverbyte=0x08
> > > [9273863.359909] sr 26:0:0:0: [sr1] tag#1 Sense Key : 0x2 [current] 
> > > [9273863.359974] sr 26:0:0:0: [sr1] tag#1 ASC=0x8 ASCQ=0x0 
> > > [9273863.360036] sr 26:0:0:0: [sr1] tag#1 CDB: opcode=0x28 28 00 00 22 f6 
> > > 96 00 00 80 00
> > > [9273863.360116] blk_update_request: 13 callbacks suppressed
> > > [9273863.360177] blk_update_request: I/O error, dev sr1, sector 9165400
> > > [9273875.864648] sr 26:0:0:0: [sr1] tag#1 UNKNOWN(0x2003) Result: 
> > > hostbyte=0x00 driverbyte=0x08
> > > [9273875.864738] sr 26:0:0:0: [sr1] tag#1 Sense Key : 0x2 [current] 
> > > [9273875.864801] sr 26:0:0:0: [sr1] tag#1 ASC=0x8 ASCQ=0x0 
> > > [9273875.864890] sr 26:0:0:0: [sr1] tag#1 CDB: opcode=0x28 28 00 00 22 f7 
> > > 16 00 00 80 00
> > > [9273875.864971] blk_update_request: I/O error, dev sr1, sector 9165912
> > > 
> > > To cause this, I mounted the dvd as seen in the first line and ran this
> > > command: find /cdrom2 -type f | xargs -tn1 cat > /dev/null
> > > I did some various tests.  Each test was done after umount and mount to
> > > clear the cache.
> > > cat  > /dev/null causes the message.
> > > dd if= of=/dev/null bs=2048 doesn't
> > >   using bs=4096 doesn't
> > >   using bs=64k doesn't
> > >   using bs=128k does
> > > cat uses a blocksize of 128k.
> > > 
> > > The following was done without being mounted.
> > > ddrescue -f -f /dev/sr1 /dev/null 
> > >   doesn't cause the message
> > > dd if=/dev/sr1 of=/dev/null bs=128k
> > >   doesn't cause the message
> > > using bs=256k causes the message once:
> > > [9275916.857409] sr 27:0:0:0: [sr1] tag#0 UNKNOWN(0x2003) Result: 
> > > hostbyte=0x00 driverbyte=0x08
> > > [9275916.857482] sr 27:0:0:0: [sr1] tag#0 Sense Key : 0x2 [current] 
> > > [9275916.857520] sr 27:0:0:0: [sr1] tag#0 ASC=0x8 ASCQ=0x0 
> > > [9275916.857556] sr 27:0:0:0: [sr1] tag#0 CDB: opcode=0x28 28 00 00 00 00 
> > > 00 00 00 80 00
> > > [9275916.857614] blk_update_request: I/O error, dev sr1, sector 0
> > > 
> > > If I access the disc from the target natively either by mounting and
> > > accessing files or working with the device directly (ie dd) no errors are
> > > logged on the target.
> > 
> > OK, thanks for your test.
> > 
> > Could you test the following patch and see if there is still the failure
> > message?
> > 
> > diff --git a/drivers/target/target_core_pscsi.c 
> > b/drivers/target/target_core_pscsi.c
> > index 0d99b242e82e..6137287b52fb 100644
> > --- a/drivers/target/target_core_pscsi.c
> > +++ b/drivers/target/target_core_pscsi.c
> > @@ -913,9 +913,11 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist 
> > *sgl, u32 sgl_nents,
> >  
> > rc = bio_add_pc_page(pdv->pdv_sd->request_queue,
> > bio, page, bytes, off);
> > +   if (rc != bytes)
> > +   goto fail;
> > pr_debug("PSCSI: bio->bi_vcnt: %d nr_vecs: %d\n",
> > bio_segments(bio), nr_vecs);
> > -   if (rc != bytes) {
> > +   if (/*rc != bytes*/0) {
> > pr_debug("PSCSI: Reached bio->bi_vcnt max:"
> > " %d i: %d bio: %p, allocating another"
> > " bio\n", bio->bi_vcnt, i, bio);
> 
> Target doesn't crash but the errors on the initiator are still there.

OK, then this error log isn't related with my commit, because the patch
I sent to you in last email is to revert my commit simply.

But the following patch is one correct fix for your crash.

https://marc.info/?l=linux-kernel=152331690727052=2


Thanks,
Ming

Re: [PATCH] bcache: simplify the calculation of the total amount of flash dirty data

2018-04-12 Thread tang . junhui

Hi Coly,

> On 2018/4/12 3:21 PM, tang.jun...@zte.com.cn wrote:
> > From: Tang Junhui 
> > 
> > Currently we calculate the total amount of flash only devices dirty data
> > by adding the dirty data of each flash only device under registering
> > locker. It is very inefficient.
> > 
> > In this patch, we add a member flash_dev_dirty_sectors in struct cache_set
> > to record the total amount of flash only devices dirty data in real time,
> > so we didn't need to calculate the total amount of dirty data any more.
> > 
> > Signed-off-by: Tang Junhui 
> 
> Hi Junhui,
> 
> The patch looks good to me, you have my
> Reviewed-by: Coly Li 
> 

Thanks for your review.

> Thanks.
> 
> Coly Li
> 
> > ---
> >  drivers/md/bcache/bcache.h|  1 +
> >  drivers/md/bcache/super.c |  2 ++
> >  drivers/md/bcache/writeback.c |  5 -
> >  drivers/md/bcache/writeback.h | 19 ---
> >  4 files changed, 7 insertions(+), 20 deletions(-)
> > 
> > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> > index 843877e..cdbd596 100644
> > --- a/drivers/md/bcache/bcache.h
> > +++ b/drivers/md/bcache/bcache.h
> > @@ -490,6 +490,7 @@ struct cache_set {
> >  struct bcache_device**devices;
> >  struct list_headcached_devs;
> >  uint64_tcached_dev_sectors;
> > +atomic_long_tflash_dev_dirty_sectors;
> >  struct closurecaching;
> >  
> >  struct closuresb_write;
> > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> > index b4d2892..772520e 100644
> > --- a/drivers/md/bcache/super.c
> > +++ b/drivers/md/bcache/super.c
> > @@ -1208,6 +1208,8 @@ static void flash_dev_free(struct closure *cl)
> >  {
> >  struct bcache_device *d = container_of(cl, struct bcache_device, cl);
> >  mutex_lock(_register_lock);
> > +atomic_long_sub(bcache_dev_sectors_dirty(d),
> > +>c->flash_dev_dirty_sectors);
> >  bcache_device_free(d);
> >  mutex_unlock(_register_lock);
> >  kobject_put(>kobj);
> > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> > index 56a3788..ec50d76 100644
> > --- a/drivers/md/bcache/writeback.c
> > +++ b/drivers/md/bcache/writeback.c
> > @@ -23,7 +23,7 @@ static void __update_writeback_rate(struct cached_dev *dc)
> >  {
> >  struct cache_set *c = dc->disk.c;
> >  uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size -
> > -bcache_flash_devs_sectors_dirty(c);
> > +atomic_long_read(>flash_dev_dirty_sectors);
> >  uint64_t cache_dirty_target =
> >  div_u64(cache_sectors * dc->writeback_percent, 100);
> >  int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
> > @@ -314,6 +314,9 @@ void bcache_dev_sectors_dirty_add(struct cache_set *c, 
> > unsigned inode,
> >  if (!d)
> >  return;
> >  
> > +if (UUID_FLASH_ONLY(>uuids[inode]))
> > +atomic_long_add(nr_sectors, >flash_dev_dirty_sectors);
> > +
> >  stripe = offset_to_stripe(d, offset);
> >  stripe_offset = offset & (d->stripe_size - 1);
> >  
> > diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
> > index a9e3ffb..1a7eacf 100644
> > --- a/drivers/md/bcache/writeback.h
> > +++ b/drivers/md/bcache/writeback.h
> > @@ -15,25 +15,6 @@ static inline uint64_t bcache_dev_sectors_dirty(struct 
> > bcache_device *d)
> >  return ret;
> >  }
> >  
> > -static inline uint64_t  bcache_flash_devs_sectors_dirty(struct cache_set 
> > *c)
> > -{
> > -uint64_t i, ret = 0;
> > -
> > -mutex_lock(_register_lock);
> > -
> > -for (i = 0; i < c->nr_uuids; i++) {
> > -struct bcache_device *d = c->devices[i];
> > -
> > -if (!d || !UUID_FLASH_ONLY(>uuids[i]))
> > -continue;
> > -   ret += bcache_dev_sectors_dirty(d);
> > -}
> > -
> > -mutex_unlock(_register_lock);
> > -
> > -return ret;
> > -}
> > -
> >  static inline unsigned offset_to_stripe(struct bcache_device *d,
> >  uint64_t offset)
> >  {
> > 

Thanks.
Tang Junhui

Re: System hung in I/O when booting with sd card

2018-04-12 Thread Shawn Lin


Hi Bart,

On 2018/4/12 17:05, Shawn Lin wrote:

Hi Bart,

On 2018/4/12 9:54, Bart Van Assche wrote:

On Thu, 2018-04-12 at 09:48 +0800, Shawn Lin wrote:

I ran into 2 times that my system hung here when booting with a ext4 sd
card. No sure how to reproduce it but it seems doesn't matter with the
ext4 as I see it again with a vfat sd card, this morning with Linus'
master branch. Is it a known issue or any idea how to debug this?


Please verify whether the following patch resolves this:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus=37f9579f4c31a6d698dbf3016d7bf132f9288d30 



Thanks for your patch! I set up some devices to test reboot this morning
against Linus' master branch. Two devices out of 5, were already hang
for that without your patch, but the other 5 are still work with your
patch.

Will update the result tomorrow morning.


I think your patch solve this. Thanks.

Tested-by: Shawn Lin 





Thanks,

Bart.

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-12 Thread Kees Cook

On Thu, Apr 12, 2018 at 3:01 PM, Kees Cook  wrote:
> On Thu, Apr 12, 2018 at 12:04 PM, Oleksandr Natalenko
>  wrote:
>> Hi.
>>
>> On čtvrtek 12. dubna 2018 20:44:37 CEST Kees Cook wrote:
>>> My first bisect attempt gave me commit 5448aca41cd5 ("null_blk: wire
>>> up timeouts"), which seems insane given that null_blk isn't even built
>>> in the .config. I managed to get the testing automated now for a "git
>>> bisect run ...", so I'm restarting again to hopefully get a better
>>> answer. Normally the Oops happens in the first minute of running. I've
>>> set the timeout to 10 minutes for a "good" run...

After fixing up some build issues in the middle of the 4.16 cycle, I
get an unhelpful bisect result of commit 0a4b6e2f80aa ("Merge branch
'for-4.16/block'". Instead of letting the test run longer, I'm going
to switch to doing several shorter test boots per kernel and see if
that helps. One more bisect coming...

>> Apparently, you do this on Linus' tree, right? If so, I won't compile it here
>> then.
>
> Actually, I didn't test Linus's tree, but I can do that after the most
> recent bisect finishes... I'm just trying to find where it went wrong
> in 4.16.

FWIW, I see an Oops under Linus's latest tree.

-Kees

-- 
Kees Cook
Pixel Security

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread Bart Van Assche

On Thu, 2018-04-12 at 12:09 -0700, t...@kernel.org wrote:
> On Thu, Apr 12, 2018 at 06:56:26PM +, Bart Van Assche wrote:
> > On Thu, 2018-04-12 at 11:11 -0700, t...@kernel.org wrote:
> > > * Right now, blk_queue_enter/exit() doesn't have any annotations.
> > >   IOW, there's no way for paths which need enter locked to actually
> > >   assert that.  If we're gonna protect more things with queue
> > >   enter/exit, it gotta be annotated.
> > 
> > Hello Tejun,
> > 
> > One possibility is to check the count for the local CPU of 
> > q->q_usage_counter
> > at runtime, e.g. using WARN_ON_ONCE(). However, that might have a runtime
> > overhead. Another possibility is to follow the example of kernfs and to use
> > a lockdep map and lockdep_assert_held(). There might be other alternatives I
> > have not thought of.
> 
> Oh, I'd just do straight-forward lockdep annotation on
> queue_enter/exit functions and provide the necessary assert helpers.

Hello Tejun,

Is something like the patch below perhaps what you had in mind?

Thanks,

Bart.

Subject: [PATCH] block: Use lockdep to verify whether blk_queue_enter() is
 used when necessary

Suggested-by: Tejun Heo 
Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
---
 block/blk-cgroup.c |  2 ++
 block/blk-core.c   | 13 -
 block/blk.h|  1 +
 include/linux/blk-cgroup.h |  2 ++
 include/linux/blkdev.h |  1 +
 5 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 1c16694ae145..c34e13e76f90 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -145,6 +145,8 @@ struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
 {
struct blkcg_gq *blkg;
 
+   lock_is_held(>q_enter_map);
+
/*
 * Hint didn't match.  Look up from the radix tree.  Note that the
 * hint can only be updated under queue_lock as otherwise @blkg
diff --git a/block/blk-core.c b/block/blk-core.c
index 39308e874ffa..a61dbe7f24a5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -931,8 +931,13 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
}
rcu_read_unlock();
 
-   if (success)
+   if (success) {
+   mutex_acquire_nest(>q_enter_map, 0, 0,
+  lock_is_held(>q_enter_map) ?
+  >q_enter_map : NULL,
+  _RET_IP_);
return 0;
+   }
 
if (flags & BLK_MQ_REQ_NOWAIT)
return -EBUSY;
@@ -959,6 +964,7 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 
 void blk_queue_exit(struct request_queue *q)
 {
+   mutex_release(>q_enter_map, 0, _RET_IP_);
percpu_ref_put(>q_usage_counter);
 }
 
@@ -994,12 +1000,15 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
gfp_mask, int node_id,
   spinlock_t *lock)
 {
struct request_queue *q;
+   static struct lock_class_key __key;
 
q = kmem_cache_alloc_node(blk_requestq_cachep,
gfp_mask | __GFP_ZERO, node_id);
if (!q)
return NULL;
 
+   lockdep_init_map(>q_enter_map, "q_enter_map", &__key, 0);
+
q->id = ida_simple_get(_queue_ida, 0, 0, gfp_mask);
if (q->id < 0)
goto fail_q;
@@ -2264,6 +2273,8 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}
 
+   lock_is_held(>q_enter_map);
+
/*
 * For a REQ_NOWAIT based request, return -EOPNOTSUPP
 * if queue is not a request based queue.
diff --git a/block/blk.h b/block/blk.h
index 7cd64f533a46..26f313331b13 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -145,6 +145,7 @@ static inline void blk_queue_enter_live(struct 
request_queue *q)
 * been established, further references under that same context
 * need not check that the queue has been frozen (marked dead).
 */
+   mutex_acquire_nest(>q_enter_map, 0, 0, >q_enter_map, _RET_IP_);
percpu_ref_get(>q_usage_counter);
 }
 
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 6c666fd7de3c..52e2e2d9971e 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -266,6 +266,8 @@ static inline struct blkcg_gq *__blkg_lookup(struct blkcg 
*blkcg,
 {
struct blkcg_gq *blkg;
 
+   lock_is_held(>q_enter_map);
+
if (blkcg == _root)
return q->root_blkg;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 0d22f351a74b..a0b1adbd4406 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -631,6 +631,7 @@ struct request_queue {
struct rcu_head rcu_head;
wait_queue_head_t   mq_freeze_wq;
struct

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-12 Thread Kees Cook

On Thu, Apr 12, 2018 at 12:04 PM, Oleksandr Natalenko
 wrote:
> Hi.
>
> On čtvrtek 12. dubna 2018 20:44:37 CEST Kees Cook wrote:
>> My first bisect attempt gave me commit 5448aca41cd5 ("null_blk: wire
>> up timeouts"), which seems insane given that null_blk isn't even built
>> in the .config. I managed to get the testing automated now for a "git
>> bisect run ...", so I'm restarting again to hopefully get a better
>> answer. Normally the Oops happens in the first minute of running. I've
>> set the timeout to 10 minutes for a "good" run...
>
> Apparently, you do this on Linus' tree, right? If so, I won't compile it here
> then.

Actually, I didn't test Linus's tree, but I can do that after the most
recent bisect finishes... I'm just trying to find where it went wrong
in 4.16.

> Thanks for taking care of this.

Thanks for building the reproducer! :)

-Kees

-- 
Kees Cook
Pixel Security

Re: [PATCH v2] block: ratelimite pr_err on IO path

2018-04-12 Thread Martin K. Petersen


Jack,

> + pr_err_ratelimited("%s: ref tag error at 
> location %llu (rcvd %u)\n",

I'm a bit concerned about dropping records of potential data loss.

Also, what are you doing that compels all these to be logged? This
should be a very rare occurrence.

-- 
Martin K. Petersen  Oracle Linux Engineering

[PATCH] target: Fix Fortify_panic kernel exception

2018-04-12 Thread Bryant G. Ly

[  496.212783] [ cut here ]
[  496.212784] kernel BUG at 
/build/linux-hwe-edge-ojNirv/linux-hwe-edge-4.15.0/lib/string.c:1052!
[  496.212789] Oops: Exception in kernel mode, sig: 5 [#1]
[  496.212791] LE SMP NR_CPUS=2048 NUMA pSeries
[  496.212795] Modules linked in: hvcs(OE) hvcserver dm_snapshot dm_bufio 
rpadlpar_io rpaphp ip6table_raw xt_CT xt_mac xt_tcpudp xt_comment xt_physdev 
xt_set ip_set_hash_net ip_set iptable_raw dccp_diag dccp tcp_diag udp_diag 
inet_diag unix_diag af_packet_diag netlink_diag target_core_pscsi(OE) 
target_core_file(OE) target_core_iblock(OE) iscsi_target_mod(OE) vxlan 
ip6_udp_tunnel udp_tunnel openvswitch nsh nf_nat_ipv6 target_core_user(OE) uio 
binfmt_misc xt_conntrack nf_conntrack_netlink nfnetlink nf_conntrack_netbios_ns 
nf_conntrack_broadcast nf_conntrack_ipv6 nf_defrag_ipv6 nbd ipt_REJECT 
nf_reject_ipv4 ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 pseries_rng nf_nat ibmvmc(OE) 
nf_conntrack libcrc32c vmx_crypto crct10dif_vpmsum iptable_mangle iptable_filter
[  496.212854]  ip_tables ip6table_filter ip6_tables ebtables x_tables 
br_netfilter bridge stp llc ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp 
libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 mlx4_en ses enclosure 
scsi_transport_sas uas usb_storage ibmvscsis(OE) target_core_mod(OE) 
ibmveth(OE) mlx5_core mlx4_core mlxfw crc32c_vpmsum be2net tg3 ipr devlink
[  496.212888] CPU: 1 PID: 2587 Comm: kworker/1:2 Tainted: G   OE
4.15.0-15-generic #16~16.04.1-Ubuntu
[  496.212897] Workqueue: ibmvscsis300f ibmvscsis_scheduler [ibmvscsis]
[  496.212900] NIP:  c0cbbf00 LR: c0cbbefc CTR: 00655170
[  496.212903] REGS: c007e58e3580 TRAP: 0700   Tainted: G   OE 
(4.15.0-15-generic)
[  496.212906] MSR:  8282b033   CR: 
286c  XER: 2003
[  496.212915] CFAR: c018d634 SOFTE: 1
   GPR00: c0cbbefc c007e58e3800 c16bae00 
0022
   GPR04: c007fe94ce18 c007fe964368 0003 

   GPR08: 0007 c1193a74 0007fd7c 
3986
   GPR12: 2200 cfa80b00 c013a308 
c007f48adb00
   GPR16:    

   GPR20:   fef7 
0402
   GPR24:  f1a8cb40 03f0 
00648010
   GPR28: c005a360a570 c007f4095880 c000fc9e7e00 
c007f1f56000
[  496.212952] NIP [c0cbbf00] fortify_panic+0x28/0x38
[  496.212956] LR [c0cbbefc] fortify_panic+0x24/0x38
[  496.212958] Call Trace:
[  496.212960] [c007e58e3800] [c0cbbefc] fortify_panic+0x24/0x38 
(unreliable)
[  496.212965] [c007e58e3860] [df150c28] 
iblock_execute_write_same+0x3b8/0x3c0 [target_core_iblock]
[  496.212976] [c007e58e3910] [d6c737d4] 
__target_execute_cmd+0x54/0x150 [target_core_mod]
[  496.212982] [c007e58e3940] [d6d32ce4] 
ibmvscsis_write_pending+0x74/0xe0 [ibmvscsis]
[  496.212991] [c007e58e39b0] [d6c74fc8] 
transport_generic_new_cmd+0x318/0x370 [target_core_mod]
[  496.213001] [c007e58e3a30] [d6c75084] 
transport_handle_cdb_direct+0x64/0xd0 [target_core_mod]
[  496.213011] [c007e58e3aa0] [d6c75298] 
target_submit_cmd_map_sgls+0x1a8/0x320 [target_core_mod]
[  496.213021] [c007e58e3b30] [d6c75458] 
target_submit_cmd+0x48/0x60 [target_core_mod]
[  496.213026] [c007e58e3bd0] [d6d34c20] 
ibmvscsis_scheduler+0x370/0x600 [ibmvscsis]
[  496.213031] [c007e58e3c90] [c013135c] 
process_one_work+0x1ec/0x580
[  496.213035] [c007e58e3d20] [c0131798] worker_thread+0xa8/0x600
[  496.213039] [c007e58e3dc0] [c013a468] kthread+0x168/0x1b0
[  496.213044] [c007e58e3e30] [c000b528] 
ret_from_kernel_thread+0x5c/0xb4
[  496.213047] Instruction dump:
[  496.213049] 7c0803a6 4e800020 3c4c00a0 3842ef28 7c0802a6 f8010010 f821ffa1 
7c641b78
[  496.213055] 3c62ff94 3863dc00 4b4d16f1 6000 <0fe0>   

[  496.213062] ---[ end trace 4c7e8c92043f3868 ]---
[  654.577815] ibmvscsis 300f: connection lost with outstanding work

The patch fixes the above trace where the size passed into
memcmp is greater than the size of the data passed in from
ptr1 or ptr2 then a fortify_panic is posted.

Fixes: 2237498f0b5c ("target/iblock: Convert WRITE_SAME to 
blkdev_issue_zeroout")
Signed-off-by: Bryant G. Ly 
Reviewed-by: Steven Royer 
Tested-by: Taylor Jakobson 
Cc: Christoph Hellwig 
Cc: Nicholas Bellinger 
Cc: 
---

Re: [PATCH v2] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread Alexandru Moise

On Thu, Apr 12, 2018 at 01:11:11PM -0600, Bart Van Assche wrote:
> Several block drivers call alloc_disk() followed by put_disk() if
> something fails before device_add_disk() is called without calling
> blk_cleanup_queue(). Make sure that also for this scenario a request
> queue is dissociated from the cgroup controller. This patch avoids
> that loading the parport_pc, paride and pf drivers triggers the
> following kernel crash:
> 
> BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
> Read of size 4 at addr 0008 by task modprobe/744
> Call Trace:
> dump_stack+0x9a/0xeb
> kasan_report+0x139/0x350
> pi_init+0x42e/0x580 [paride]
> pf_init+0x2bb/0x1000 [pf]
> do_one_initcall+0x8e/0x405
> do_init_module+0xd9/0x2f2
> load_module+0x3ab4/0x4700
> SYSC_finit_module+0x176/0x1a0
> do_syscall_64+0xee/0x2b0
> entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> Reported-by: Alexandru Moise <00moses.alexande...@gmail.com>
> Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the 
> block cgroup controller")
> Signed-off-by: Bart Van Assche 
> Tested-by: Alexandru Moise <00moses.alexande...@gmail.com>
> Cc: Tejun Heo 
> Cc: Alexandru Moise <00moses.alexande...@gmail.com>
> Cc: Joseph Qi 
> ---
> 
> Changes compared to v1:
> - Surrounded use of blkcg_root with #ifdef CONFIG_BLK_CGROUP / #endif.
> - Added Alex' Tested-by.
> 
>  block/blk-sysfs.c | 31 +++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index f8457d6f0190..46bb932721dc 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -793,6 +793,37 @@ static void __blk_release_queue(struct work_struct *work)
>   blk_stat_remove_callback(q, q->poll_cb);
>   blk_stat_free_callback(q->poll_cb);
>  
> + if (!blk_queue_dead(q)) {
> + /*
> +  * Last reference was dropped without having called
> +  * blk_cleanup_queue().
> +  */
> + WARN_ONCE(blk_queue_init_done(q),
> +   "request queue %p has been registered but 
> blk_cleanup_queue() has not been called for that queue\n",
> +   q);
> + bdi_put(q->backing_dev_info);
> + blkcg_exit_queue(q);
> +
> + if (q->elevator) {
> + ioc_clear_queue(q);
> + elevator_exit(q, q->elevator);
> + }
> + }
> +
> +#ifdef CONFIG_BLK_CGROUP
> + {
> + struct blkcg_gq *blkg;
> +
> + rcu_read_lock();
> + blkg = blkg_lookup(_root, q);
> + rcu_read_unlock();
> +
> + WARN(blkg,
> +  "request queue %p is being released but it has not yet 
> been removed from the blkcg controller\n",
> +  q);
> + }
> +#endif
> +
>   blk_free_queue_stats(q->stats);
>  
>   blk_exit_rl(q, >root_rl);
I'm not qualified enough to review it but build and boot are good.

Tested-by: Alexandru Moise <00moses.alexande...@gmail.com>

Thanks,
../Alex
> -- 
> 2.16.2
>

[PATCH v2] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread Bart Van Assche

Several block drivers call alloc_disk() followed by put_disk() if
something fails before device_add_disk() is called without calling
blk_cleanup_queue(). Make sure that also for this scenario a request
queue is dissociated from the cgroup controller. This patch avoids
that loading the parport_pc, paride and pf drivers triggers the
following kernel crash:

BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
Read of size 4 at addr 0008 by task modprobe/744
Call Trace:
dump_stack+0x9a/0xeb
kasan_report+0x139/0x350
pi_init+0x42e/0x580 [paride]
pf_init+0x2bb/0x1000 [pf]
do_one_initcall+0x8e/0x405
do_init_module+0xd9/0x2f2
load_module+0x3ab4/0x4700
SYSC_finit_module+0x176/0x1a0
do_syscall_64+0xee/0x2b0
entry_SYSCALL_64_after_hwframe+0x42/0xb7

Reported-by: Alexandru Moise <00moses.alexande...@gmail.com>
Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the 
block cgroup controller")
Signed-off-by: Bart Van Assche 
Tested-by: Alexandru Moise <00moses.alexande...@gmail.com>
Cc: Tejun Heo 
Cc: Alexandru Moise <00moses.alexande...@gmail.com>
Cc: Joseph Qi 
---

Changes compared to v1:
- Surrounded use of blkcg_root with #ifdef CONFIG_BLK_CGROUP / #endif.
- Added Alex' Tested-by.

 block/blk-sysfs.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f8457d6f0190..46bb932721dc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -793,6 +793,37 @@ static void __blk_release_queue(struct work_struct *work)
blk_stat_remove_callback(q, q->poll_cb);
blk_stat_free_callback(q->poll_cb);
 
+   if (!blk_queue_dead(q)) {
+   /*
+* Last reference was dropped without having called
+* blk_cleanup_queue().
+*/
+   WARN_ONCE(blk_queue_init_done(q),
+ "request queue %p has been registered but 
blk_cleanup_queue() has not been called for that queue\n",
+ q);
+   bdi_put(q->backing_dev_info);
+   blkcg_exit_queue(q);
+
+   if (q->elevator) {
+   ioc_clear_queue(q);
+   elevator_exit(q, q->elevator);
+   }
+   }
+
+#ifdef CONFIG_BLK_CGROUP
+   {
+   struct blkcg_gq *blkg;
+
+   rcu_read_lock();
+   blkg = blkg_lookup(_root, q);
+   rcu_read_unlock();
+
+   WARN(blkg,
+"request queue %p is being released but it has not yet 
been removed from the blkcg controller\n",
+q);
+   }
+#endif
+
blk_free_queue_stats(q->stats);
 
blk_exit_rl(q, >root_rl);
-- 
2.16.2

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread t...@kernel.org

Hello,

On Thu, Apr 12, 2018 at 06:56:26PM +, Bart Van Assche wrote:
> On Thu, 2018-04-12 at 11:11 -0700, t...@kernel.org wrote:
> > * Right now, blk_queue_enter/exit() doesn't have any annotations.
> >   IOW, there's no way for paths which need enter locked to actually
> >   assert that.  If we're gonna protect more things with queue
> >   enter/exit, it gotta be annotated.
> 
> Hello Tejun,
> 
> One possibility is to check the count for the local CPU of q->q_usage_counter
> at runtime, e.g. using WARN_ON_ONCE(). However, that might have a runtime
> overhead. Another possibility is to follow the example of kernfs and to use
> a lockdep map and lockdep_assert_held(). There might be other alternatives I
> have not thought of.

Oh, I'd just do straight-forward lockdep annotation on
queue_enter/exit functions and provide the necessary assert helpers.

Thanks.

-- 
tejun

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-12 Thread Oleksandr Natalenko

Hi.

On čtvrtek 12. dubna 2018 20:44:37 CEST Kees Cook wrote:
> My first bisect attempt gave me commit 5448aca41cd5 ("null_blk: wire
> up timeouts"), which seems insane given that null_blk isn't even built
> in the .config. I managed to get the testing automated now for a "git
> bisect run ...", so I'm restarting again to hopefully get a better
> answer. Normally the Oops happens in the first minute of running. I've
> set the timeout to 10 minutes for a "good" run...

Apparently, you do this on Linus' tree, right? If so, I won't compile it here 
then.

Thanks for taking care of this.

Regards,
  Oleksandr

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread Bart Van Assche

On Thu, 2018-04-12 at 11:11 -0700, t...@kernel.org wrote:
> * Right now, blk_queue_enter/exit() doesn't have any annotations.
>   IOW, there's no way for paths which need enter locked to actually
>   assert that.  If we're gonna protect more things with queue
>   enter/exit, it gotta be annotated.

Hello Tejun,

One possibility is to check the count for the local CPU of q->q_usage_counter
at runtime, e.g. using WARN_ON_ONCE(). However, that might have a runtime
overhead. Another possibility is to follow the example of kernfs and to use
a lockdep map and lockdep_assert_held(). There might be other alternatives I
have not thought of.

Bart.

[PATCH] loop: fix aio/dio end of request clearing

2018-04-12 Thread Jens Axboe

If we read more than the user asked for, we zero fill the last
part. But the current code assumes that the request has just
one bio, and since that's not guaranteed to be true, we can
run into a situation where we attempt to advance a bio by
a bigger amount than its size.

Handle the zero filling appropriately, regardless of what bio
ends up needing to be cleared.

Signed-off-by: Jens Axboe 

---

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c9d04497a415..d3aa96a2f369 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -455,9 +455,22 @@ static void lo_complete_rq(struct request *rq)
if (unlikely(req_op(cmd->rq) == REQ_OP_READ && cmd->use_aio &&
 cmd->ret >= 0 && cmd->ret < blk_rq_bytes(cmd->rq))) {
struct bio *bio = cmd->rq->bio;
+   long left = cmd->ret;
 
-   bio_advance(bio, cmd->ret);
-   zero_fill_bio(bio);
+   while (left >= bio->bi_iter.bi_size) {
+   left -= bio->bi_iter.bi_size;
+   bio = bio->bi_next;
+   if (WARN_ON_ONCE(!bio))
+   break;
+   }
+   while (bio) {
+   if (left) {
+   bio_advance(bio, left);
+   left = 0;
+   }
+   zero_fill_bio(bio);
+   bio = bio->bi_next;
+   }
}
 
blk_mq_end_request(rq, cmd->ret < 0 ? BLK_STS_IOERR : BLK_STS_OK);

-- 
Jens Axboe

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-12 Thread Kees Cook

On Wed, Apr 11, 2018 at 5:03 PM, Kees Cook  wrote:
> On Wed, Apr 11, 2018 at 3:47 PM, Kees Cook  wrote:
>> On Tue, Apr 10, 2018 at 8:13 PM, Kees Cook  wrote:
>>> I'll see about booting with my own kernels, etc, and try to narrow this 
>>> down. :)
>>
>> If I boot kernels I've built, I no longer hit the bug in this VM
>> (though I'll keep trying). What compiler are you using?
>
> Ignore that: I've reproduced it with my kernels now. I think I messed
> up the initramfs initially. But with an exact copy of your .config,
> booting under Arch grub with initramfs, I see it. I'll start removing
> variables now... :P

My first bisect attempt gave me commit 5448aca41cd5 ("null_blk: wire
up timeouts"), which seems insane given that null_blk isn't even built
in the .config. I managed to get the testing automated now for a "git
bisect run ...", so I'm restarting again to hopefully get a better
answer. Normally the Oops happens in the first minute of running. I've
set the timeout to 10 minutes for a "good" run...

-Kees

-- 
Kees Cook
Pixel Security

[PATCH v2] block: do not use interruptible wait anywhere

2018-04-12 Thread Alan Jenkins

When blk_queue_enter() waits for a queue to unfreeze, or unset the
PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.

The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
("block, scsi: Make SCSI quiesce and resume work reliably").  Note the SCSI
device is resumed asynchronously, i.e. after un-freezing userspace tasks.

So that commit exposed the bug as a regression in v4.15.  A mysterious
SIGBUS (or -EIO) sometimes happened during the time the device was being
resumed.  Most frequently, there was no kernel log message, and we saw Xorg
or Xwayland killed by SIGBUS.[1]

[1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979

Without this fix, I get an IO error in this test:

# dd if=/dev/sda of=/dev/null iflag=direct & \
  while killall -SIGUSR1 dd; do sleep 0.1; done & \
  echo mem > /sys/power/state ; \
  sleep 5; killall dd  # stop after 5 seconds

The interruptible wait was added to blk_queue_enter in
commit 3ef28e83ab15 ("block: generic request_queue reference counting").
Before then, the interruptible wait was only in blk-mq, but I don't think
it could ever have been correct.

Reviewed-by: Bart Van Assche 
Cc: sta...@vger.kernel.org
Signed-off-by: Alan Jenkins 
---
v2: fix indentation

 block/blk-core.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index abcb8684ba67..1a762f3980f2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -915,7 +915,6 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 
while (true) {
bool success = false;
-   int ret;
 
rcu_read_lock();
if (percpu_ref_tryget_live(>q_usage_counter)) {
@@ -947,14 +946,12 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 */
smp_rmb();
 
-   ret = wait_event_interruptible(q->mq_freeze_wq,
-   (atomic_read(>mq_freeze_depth) == 0 &&
-(preempt || !blk_queue_preempt_only(q))) ||
-   blk_queue_dying(q));
+   wait_event(q->mq_freeze_wq,
+  (atomic_read(>mq_freeze_depth) == 0 &&
+   (preempt || !blk_queue_preempt_only(q))) ||
+  blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
-   if (ret)
-   return ret;
}
 }
 
-- 
2.14.3

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread t...@kernel.org

Hello,

On Thu, Apr 12, 2018 at 04:29:09PM +, Bart Van Assche wrote:
> Any code that submits a bio or request needs blk_queue_enter() /
> blk_queue_exit() anyway. Please have a look at the following commit - you will
> see that that commit reduces the number of blk_queue_enter() / 
> blk_queue_exit()
> calls in the hot path:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus=37f9579f4c31a6d698dbf3016d7bf132f9288d30

So, this can work, but it's still pretty fragile.

* Lock switching is fragile and we really should get rid of it.  This
  is very likely to come back and bite us.

* Right now, blk_queue_enter/exit() doesn't have any annotations.
  IOW, there's no way for paths which need enter locked to actually
  assert that.  If we're gonna protect more things with queue
  enter/exit, it gotta be annotated.

Thanks.

-- 
tejun

Re: [PATCH] block: do not use interruptible wait anywhere

2018-04-12 Thread Bart Van Assche

On Thu, 2018-04-12 at 17:23 +0100, Alan Jenkins wrote:
> @@ -947,14 +946,12 @@ int blk_queue_enter(struct request_queue *q, 
> blk_mq_req_flags_t flags)
>*/
>   smp_rmb();
>  
> - ret = wait_event_interruptible(q->mq_freeze_wq,
> + wait_event(q->mq_freeze_wq,
>   (atomic_read(>mq_freeze_depth) == 0 &&
>(preempt || !blk_queue_preempt_only(q))) ||
>   blk_queue_dying(q));
>   if (blk_queue_dying(q))
>   return -ENODEV;
> - if (ret)
> - return ret;
>   }
>  }

Hello Alan,

Please reindent the wait_event() arguments such that these remain aligned.

Anyway:

Reviewed-by: Bart Van Assche

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread Bart Van Assche

On Thu, 2018-04-12 at 09:12 -0700, t...@kernel.org wrote:
> > Did you perhaps mean blkg_lookup_create()? That function has one caller,
> > namely blkcg_bio_issue_check(). The only caller of that function is
> > generic_make_request_checks(). A patch was posted on the linux-block mailing
> > list recently that surrounds that call with blk_queue_enter() / 
> > blk_queue_exit().
> > I think that prevents that blkcg_exit_queue() is called concurrently with
> > blkg_lookup_create().
> 
> Yeah, that'd solve the problem for that particular path, but what
> that's doing is adding another layer of refcnting which is used to
> implement the revoke (or sever) semantics.  This is a fragile approach
> tho because it isn't clear who's protecting what and there's nothing
> in blkg's usage which suggests it'd be protected that way and we're
> basically working around a different underlying problem (lock
> switching) by expanding the coverage of a different lock.

Hello Tejun,

Any code that submits a bio or request needs blk_queue_enter() /
blk_queue_exit() anyway. Please have a look at the following commit - you will
see that that commit reduces the number of blk_queue_enter() / blk_queue_exit()
calls in the hot path:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus=37f9579f4c31a6d698dbf3016d7bf132f9288d30

Thanks,

Bart.

[PATCH] block: do not use interruptible wait anywhere

2018-04-12 Thread Alan Jenkins

When blk_queue_enter() waits for a queue to unfreeze, or unset the
PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.

The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
("block, scsi: Make SCSI quiesce and resume work reliably").  Note the SCSI
device is resumed asynchronously, i.e. after un-freezing userspace tasks.

So that commit exposed the bug as a regression in v4.15.  A mysterious
SIGBUS (or -EIO) sometimes happened during the time the device was being
resumed.  Most frequently, there was no kernel log message, and we saw Xorg
or Xwayland killed by SIGBUS.[1]

[1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979

Without this fix, I get an IO error in this test:

# dd if=/dev/sda of=/dev/null iflag=direct & \
  while killall -SIGUSR1 dd; do sleep 0.1; done & \
  echo mem > /sys/power/state ; \
  sleep 5; killall dd  # stop after 5 seconds

The interruptible wait was added to blk_queue_enter in
commit 3ef28e83ab15 ("block: generic request_queue reference counting").
Before then, the interruptible wait was only in blk-mq, but I don't think
it could ever have been correct.

Cc: sta...@vger.kernel.org
Signed-off-by: Alan Jenkins 
---
 block/blk-core.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index abcb8684ba67..5a6d20069364 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -915,7 +915,6 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 
while (true) {
bool success = false;
-   int ret;
 
rcu_read_lock();
if (percpu_ref_tryget_live(>q_usage_counter)) {
@@ -947,14 +946,12 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 */
smp_rmb();
 
-   ret = wait_event_interruptible(q->mq_freeze_wq,
+   wait_event(q->mq_freeze_wq,
(atomic_read(>mq_freeze_depth) == 0 &&
 (preempt || !blk_queue_preempt_only(q))) ||
blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
-   if (ret)
-   return ret;
}
 }
 
-- 
2.14.3

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread t...@kernel.org

Hello,

On Thu, Apr 12, 2018 at 04:03:52PM +, Bart Van Assche wrote:
> On Thu, 2018-04-12 at 08:37 -0700, Tejun Heo wrote:
> > On Thu, Apr 12, 2018 at 08:09:17AM -0600, Bart Van Assche wrote:
> > > I have retested hotunplugging by rerunning the srp-test software. It
> > > seems like you overlooked that this patch does not remove the
> > > blkcg_exit_queue() call from blk_cleanup_queue()? If a device is
> > > hotunplugged it is up to the block driver to call
> > > blk_cleanup_queue(). And blk_cleanup_queue() will call
> > > blkcg_exit_queue().
> > 
> > Hmm... what'd prevent blg_lookup_and_create() racing against that?
> 
> Hello Tejun,
> 
> Did you perhaps mean blkg_lookup_create()? That function has one caller,

Ah yeah, sorry about the sloppiness.

> namely blkcg_bio_issue_check(). The only caller of that function is
> generic_make_request_checks(). A patch was posted on the linux-block mailing
> list recently that surrounds that call with blk_queue_enter() / 
> blk_queue_exit().
> I think that prevents that blkcg_exit_queue() is called concurrently with
> blkg_lookup_create().

Yeah, that'd solve the problem for that particular path, but what
that's doing is adding another layer of refcnting which is used to
implement the revoke (or sever) semantics.  This is a fragile approach
tho because it isn't clear who's protecting what and there's nothing
in blkg's usage which suggests it'd be protected that way and we're
basically working around a different underlying problem (lock
switching) by expanding the coverage of a different lock.

I'd much prefer fixing the lock switching problem properly than
working around that shortcoming this way.

Thans.

-- 
tejun

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread Bart Van Assche

On Thu, 2018-04-12 at 08:37 -0700, Tejun Heo wrote:
> On Thu, Apr 12, 2018 at 08:09:17AM -0600, Bart Van Assche wrote:
> > I have retested hotunplugging by rerunning the srp-test software. It
> > seems like you overlooked that this patch does not remove the
> > blkcg_exit_queue() call from blk_cleanup_queue()? If a device is
> > hotunplugged it is up to the block driver to call
> > blk_cleanup_queue(). And blk_cleanup_queue() will call
> > blkcg_exit_queue().
> 
> Hmm... what'd prevent blg_lookup_and_create() racing against that?

Hello Tejun,

Did you perhaps mean blkg_lookup_create()? That function has one caller,
namely blkcg_bio_issue_check(). The only caller of that function is
generic_make_request_checks(). A patch was posted on the linux-block mailing
list recently that surrounds that call with blk_queue_enter() / 
blk_queue_exit().
I think that prevents that blkcg_exit_queue() is called concurrently with
blkg_lookup_create().

Bart.

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread Tejun Heo

Hello, Bart.

On Thu, Apr 12, 2018 at 08:09:17AM -0600, Bart Van Assche wrote:
> I have retested hotunplugging by rerunning the srp-test software. It
> seems like you overlooked that this patch does not remove the
> blkcg_exit_queue() call from blk_cleanup_queue()? If a device is
> hotunplugged it is up to the block driver to call
> blk_cleanup_queue(). And blk_cleanup_queue() will call
> blkcg_exit_queue().

Hmm... what'd prevent blg_lookup_and_create() racing against that?

Thanks.

-- 
tejun

Re: [PATCH 4/4] bcache: fix I/O significant decline while backend devices registering

2018-04-12 Thread Coly Li

On 2018/4/12 2:38 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> I attached several backend devices in the same cache set, and produced lots
> of dirty data by running small rand I/O writes in a long time, then I
> continue run I/O in the others cached devices, and stopped a cached device,
> after a mean while, I register the stopped device again, I see the running
> I/O in the others cached devices dropped significantly, sometimes even
> jumps to zero.
> 
> In currently code, bcache would traverse each keys and btree node to count
> the dirty data under read locker, and the writes threads can not get the
> btree write locker, and when there is a lot of keys and btree node in the
> registering device, it would last several seconds, so the write I/Os in
> others cached device are blocked and declined significantly.
> 
> In this patch, when a device registering to a ache set, which exist others
> cached devices with running I/Os, we get the amount of dirty data of the
> device in an incremental way, and do not block other cached devices all the
> time.
> 
> Signed-off-by: Tang Junhui 
> ---
>  drivers/md/bcache/writeback.c | 26 +-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 56a3788..2467d3a 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -488,10 +488,14 @@ static int bch_writeback_thread(void *arg)
>  }
>  

Hi Junhui,

>  /* Init */
> +#define INIT_KEYS_MIN50
> +#define INIT_KEYS_SLEEP_TIME 100
>  

How about renaming the above macros to something like
INIT_KEYS_WAIT_COUNTER and INIT_KEYS_SLEEP_MS ?

Also I suggested to rename io_inflight in previous patch review comments.

For rested part of this patch looks good to me.

Reviewed-by: Coly Li 

Thanks.

Coly Li

>  struct sectors_dirty_init {
>   struct btree_op op;
>   unsignedinode;
> + size_t  count;
> + struct bkey start;
>  };
>  
>  static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b,
> @@ -506,18 +510,38 @@ static int sectors_dirty_init_fn(struct btree_op *_op, 
> struct btree *b,
>   bcache_dev_sectors_dirty_add(b->c, KEY_INODE(k),
>KEY_START(k), KEY_SIZE(k));
>  
> + op->count++;
> + if (atomic_read(>c->io_inflight) &&
> +!(op->count % INIT_KEYS_MIN)) {
> + bkey_copy_key(>start, k);
> + return -EAGAIN;
> + }
> +
>   return MAP_CONTINUE;
>  }
>  
>  void bch_sectors_dirty_init(struct bcache_device *d)
>  {
>   struct sectors_dirty_init op;
> + int ret;
>  
>   bch_btree_op_init(, -1);
>   op.inode = d->id;
> + op.count = 0;
> + op.start = KEY(op.inode, 0, 0);
>  
> - bch_btree_map_keys(, d->c, (op.inode, 0, 0),
> + do {
> + ret = bch_btree_map_keys(, d->c, ,
>  sectors_dirty_init_fn, 0);
> + if (ret == -EAGAIN)
> + schedule_timeout_interruptible(
> + msecs_to_jiffies(INIT_KEYS_SLEEP_TIME));
> + else if (ret < 0) {
> + pr_warn("sectors dirty init failed, ret=%d!", ret);
> + break;
> + }
> + } while (ret == -EAGAIN);
> +
>  }
>  
>  void bch_cached_dev_writeback_init(struct cached_dev *dc)
>

Re: [PATCH 3/4] bcache: notify allocator to prepare for GC

2018-04-12 Thread Coly Li

On 2018/4/12 2:38 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> Since no new bucket can be allocated during GC, and front side I/Os would
> run out of all the buckets, so notify allocator to pack the free_inc queue
> full of buckets before GC, then we could have enough buckets for front side
> I/Os during GC period.
> 
> Signed-off-by: Tang Junhui 

Hi Junhui,

This method is complicated IMHO. Why not in bch_allocator_thread()
simple call wake_up_gc(ca->set) after invalidate_buckets() ?

Thanks.

Coly Li

> ---
>  drivers/md/bcache/alloc.c  | 11 +++-
>  drivers/md/bcache/bcache.h |  2 ++
>  drivers/md/bcache/btree.c  | 69 
> --
>  drivers/md/bcache/btree.h  |  4 +++
>  4 files changed, 83 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> index a0cc1bc..85020cc 100644
> --- a/drivers/md/bcache/alloc.c
> +++ b/drivers/md/bcache/alloc.c
> @@ -323,7 +323,8 @@ static int bch_allocator_thread(void *arg)
>* possibly issue discards to them, then we add the bucket to
>* the free list:
>*/
> - while (!fifo_empty(>free_inc)) {
> + while (!fifo_empty(>free_inc) &&
> +ca->prepare_gc != GC_PREPARING) {
>   long bucket;
>  
>   fifo_pop(>free_inc, bucket);
> @@ -353,6 +354,14 @@ static int bch_allocator_thread(void *arg)
>   invalidate_buckets(ca);
>  
>   /*
> +  * Let GC continue
> +  */
> + if (ca->prepare_gc == GC_PREPARING) {
> + ca->prepare_gc = GC_PREPARED;
> + wake_up_gc(ca->set);
> + }
> +
> + /*
>* Now, we write their new gens to disk so we can start writing
>* new stuff to them:
>*/
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index ab4c9ca..61a6ff2 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -428,6 +428,8 @@ struct cache {
>* cpu
>*/
>   unsignedinvalidate_needs_gc;
> + /* used to notify allocator to prepare GC*/
> + unsigned intprepare_gc;
>  
>   booldiscard; /* Get rid of? */
>  
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 10f967e..aba0700 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -1805,6 +1805,46 @@ static void bch_btree_gc(struct cache_set *c)
>   bch_moving_gc(c);
>  }
>  
> +static bool cache_gc_prepare_none(struct cache_set *c)
> +{
> + struct cache *ca;
> + unsigned int i;
> +
> + for_each_cache(ca, c, i)
> + if (ca->prepare_gc != GC_PREPARE_NONE)
> + return false;
> + return true;
> +}
> +
> +static bool cache_gc_prepare_ok(struct cache_set *c)
> +{
> + struct cache *ca;
> + unsigned int i;
> +
> + for_each_cache(ca, c, i)
> + if (ca->prepare_gc != GC_PREPARED)
> + return false;
> + return true;
> +}
> +
> +static void set_cache_gc_prepare_none(struct cache_set *c)
> +{
> + struct cache *ca;
> + unsigned int i;
> +
> + for_each_cache(ca, c, i)
> + ca->prepare_gc = GC_PREPARE_NONE;
> +}
> +
> +static void set_cache_gc_preparing(struct cache_set *c)
> +{
> + struct cache *ca;
> + unsigned int i;
> +
> + for_each_cache(ca, c, i)
> + ca->prepare_gc = GC_PREPARING;
> +}
> +
>  static bool gc_should_run(struct cache_set *c)
>  {
>   struct cache *ca;
> @@ -1814,8 +1854,33 @@ static bool gc_should_run(struct cache_set *c)
>   if (ca->invalidate_needs_gc)
>   return true;
>  
> - if (atomic_read(>sectors_to_gc) < 0)
> - return true;
> + if (atomic_read(>sectors_to_gc) < 0) {
> + mutex_lock(>bucket_lock);
> + if (cache_gc_prepare_none(c)) {
> + /*
> +  * notify allocator thread to prepare for GC
> +  */
> + set_cache_gc_preparing(c);
> + mutex_unlock(>bucket_lock);
> + pr_info("gc preparing");
> + return false;
> + } else if (cache_gc_prepare_ok(c)) {
> + /*
> +  * alloc thread finished preparing,
> +  * and continue to GC
> +  */
> + set_cache_gc_prepare_none(c);
> + mutex_unlock(>bucket_lock);
> + pr_info("gc prepared ok");
> + return true;
> + } else {
> + /*
> +  * waitting allocator finishing prepareing
> +  */
> +

Re: [PATCH 2/4] bcache: calculate the number of incremental GC nodes according to the total of btree nodes

2018-04-12 Thread Coly Li

On 2018/4/12 2:38 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> This patch base on "[PATCH] bcache: finish incremental GC".
> 
> Since incremental GC would stop 100ms when front side I/O comes, so when
> there are many btree nodes, if GC only processes constant (100) nodes each
> time, GC would last a long time, and the front I/Os would run out of the
> buckets (since no new bucket can be allocated during GC), and I/Os be
> blocked again.
> 
> So GC should not process constant nodes, but varied nodes according to the
> number of btree nodes. In this patch, GC is divided into constant (100)
> times, so when there are many btree nodes, GC can process more nodes each
> time, otherwise GC will process less nodes each time (but no less than
> MIN_GC_NODES).
> 
> Signed-off-by: Tang Junhui 
> ---
>  drivers/md/bcache/btree.c | 37 ++---
>  1 file changed, 34 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index b36d276..10f967e 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -90,10 +90,10 @@
>  
>  #define MAX_NEED_GC  64
>  #define MAX_SAVE_PRIO72
> +#define MAX_GC_TIMES 100
>  #define MIN_GC_NODES 100
>  #define GC_SLEEP_TIME100
>  
> -
>  #define PTR_DIRTY_BIT(((uint64_t) 1 << 36))
>  
>  #define PTR_HASH(c, k)   
> \
> @@ -1519,6 +1519,31 @@ static unsigned btree_gc_count_keys(struct btree *b)
>   return ret;
>  }
>  
> +static size_t btree_gc_min_nodes(struct cache_set *c)
> +{
> + size_t  min_nodes;
> +
> + /*
> +  * Since incremental GC would stop 100ms when front
> +  * side I/O comes, so when there are many btree nodes,
> +  * if GC only processes constant (100) nodes each time,
> +  * GC would last a long time, and the front side I/Os
> +  * would run out of the buckets (since no new bucket
> +  * can be allocated during GC), and be blocked again.
> +  * So GC should not process constant nodes, but varied
> +  * nodes according to the number of btree nodes, which
> +  * realized by dividing GC into constant(100) times,
> +  * so when there are many btree nodes, GC can process
> +  * more nodes each time, otherwise, GC will process less
> +  * nodes each time (but no less than MIN_GC_NODES)
> +  */
> + min_nodes = c->gc_stats.nodes / MAX_GC_TIMES;
> + if (min_nodes < MIN_GC_NODES)
> + min_nodes = MIN_GC_NODES;
> +
> + return min_nodes;
> +}
> +
>  static int btree_gc_recurse(struct btree *b, struct btree_op *op,
>   struct closure *writes, struct gc_stat *gc)
>  {
> @@ -1585,7 +1610,7 @@ static int btree_gc_recurse(struct btree *b, struct 
> btree_op *op,
>   r->b = NULL;
>  
>   if (atomic_read(>c->io_inflight) &&
> - gc->nodes >= gc->nodes_pre + MIN_GC_NODES) {
> + gc->nodes >= gc->nodes_pre + btree_gc_min_nodes(b->c)) {
>   gc->nodes_pre =  gc->nodes;
>   ret = -EAGAIN;
>   break;
> @@ -1841,8 +1866,14 @@ static int bch_btree_check_recurse(struct btree *b, 
> struct btree_op *op)
>   do {
>   k = bch_btree_iter_next_filter(, >keys,
>  bch_ptr_bad);
> - if (k)
> + if (k) {

Hi Junhui,

>   btree_node_prefetch(b, k);
> + /*
> +  * initiallize c->gc_stats.nodes
> +  * for incremental GC
> +  */
> + b->c->gc_stats.nodes++;

I don't get the point why increase gc_stats.nodes here. Could you please
explain a bit more ?

Thanks.

Coly Li


> + }
>  
>   if (p)
>   ret = btree(check_recurse, p, b, op);
>

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread Bart Van Assche


On 04/12/18 07:51, Tejun Heo wrote:

On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:

Several block drivers call alloc_disk() followed by put_disk() if
something fails before device_add_disk() is called without calling
blk_cleanup_queue(). Make sure that also for this scenario a request
queue is dissociated from the cgroup controller. This patch avoids
that loading the parport_pc, paride and pf drivers triggers the
following kernel crash:

BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
Read of size 4 at addr 0008 by task modprobe/744
Call Trace:
dump_stack+0x9a/0xeb
kasan_report+0x139/0x350
pi_init+0x42e/0x580 [paride]
pf_init+0x2bb/0x1000 [pf]
do_one_initcall+0x8e/0x405
do_init_module+0xd9/0x2f2
load_module+0x3ab4/0x4700
SYSC_finit_module+0x176/0x1a0
do_syscall_64+0xee/0x2b0
entry_SYSCALL_64_after_hwframe+0x42/0xb7

Reported-by: Alexandru Moise <00moses.alexande...@gmail.com>
Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the block 
cgroup controller")
Signed-off-by: Bart Van Assche 
Cc: Alexandru Moise <00moses.alexande...@gmail.com>
Cc: Tejun Heo 


So, this might be correct for this reported case but I don't think
it's correct in general.  There's no synchronization between blkcg
q->lock usages and blk_cleanup_queue().  e.g. hotunplugging and
shutting down a device while file operations are still in progress can
easily blow this up.


Hello Tejun,

I have retested hotunplugging by rerunning the srp-test software. It 
seems like you overlooked that this patch does not remove the 
blkcg_exit_queue() call from blk_cleanup_queue()? If a device is 
hotunplugged it is up to the block driver to call blk_cleanup_queue(). 
And blk_cleanup_queue() will call blkcg_exit_queue().


Bart.

Re: [PATCH 1/4] bcache: finish incremental GC

2018-04-12 Thread Coly Li

On 2018/4/12 2:38 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> In GC thread, we record the latest GC key in gc_done, which is expected
> to be used for incremental GC, but in currently code, we didn't realize
> it. When GC runs, front side IO would be blocked until the GC over, it
> would be a long time if there is a lot of btree nodes.
> 
> This patch realizes incremental GC, the main ideal is that, when there
> are front side I/Os, after GC some nodes (100), we stop GC, release locker
> of the btree node, and go to process the front side I/Os for some times
> (100 ms), then go back to GC again.
> 
> By this patch, when we doing GC, I/Os are not blocked all the time, and
> there is no obvious I/Os zero jump problem any more.
> 
> Signed-off-by: Tang Junhui 

Hi Junhui,

I reply my comments in line,

> ---
>  drivers/md/bcache/bcache.h  |  5 +
>  drivers/md/bcache/btree.c   | 15 ++-
>  drivers/md/bcache/request.c |  3 +++
>  3 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 843877e..ab4c9ca 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -445,6 +445,7 @@ struct cache {
>  
>  struct gc_stat {
>   size_t  nodes;
> + size_t  nodes_pre;
>   size_t  key_bytes;
>  
>   size_t  nkeys;
> @@ -568,6 +569,10 @@ struct cache_set {
>*/
>   atomic_trescale;
>   /*
> +  * used for GC, identify if any front side I/Os is inflight
> +  */
> + atomic_tio_inflight;

Could you please to rename the above variable to something like
search_inflight ? Just to be more explicit, considering we have many
types of io requests.

> + /*
>* When we invalidate buckets, we use both the priority and the amount
>* of good data to determine which buckets to reuse first - to weight
>* those together consistently we keep track of the smallest nonzero
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 81e8dc3..b36d276 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -90,6 +90,9 @@
>  
>  #define MAX_NEED_GC  64
>  #define MAX_SAVE_PRIO72
> +#define MIN_GC_NODES 100
> +#define GC_SLEEP_TIME100

How about naming the above macro as GC_SLEEP_MS ? So people may
understand the sleep time is 100ms without checking more context.

> +
>  
>  #define PTR_DIRTY_BIT(((uint64_t) 1 << 36))
>  
> @@ -1581,6 +1584,13 @@ static int btree_gc_recurse(struct btree *b, struct 
> btree_op *op,
>   memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1));
>   r->b = NULL;
>  
> + if (atomic_read(>c->io_inflight) &&
> + gc->nodes >= gc->nodes_pre + MIN_GC_NODES) {

On 32bit machines, gc->nodes is a 32bit count, if it is overflowed the
above check will be false for a very long time, and in some condition it
might be always false after gc->nodes overflowed.

How about adding the following check before the if() statement ?
/* in case gc->nodes is overflowed */
if (gc->nodes_pre < gc->nodes)
gc->noeds_pre = gc->nodes;

> + gc->nodes_pre =  gc->nodes;
> + ret = -EAGAIN;
> + break;
> + }
> +
>   if (need_resched()) {
>   ret = -EAGAIN;
>   break;
> @@ -1748,7 +1758,10 @@ static void bch_btree_gc(struct cache_set *c)
>   closure_sync();
>   cond_resched();
>  
> - if (ret && ret != -EAGAIN)
> + if (ret == -EAGAIN)
> + schedule_timeout_interruptible(msecs_to_jiffies
> +(GC_SLEEP_TIME));
> + else if (ret)
>   pr_warn("gc failed!");
>   } while (ret);
>  
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 643c3021..26750a9 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -637,7 +637,9 @@ static void do_bio_hook(struct search *s, struct bio 
> *orig_bio)
>  static void search_free(struct closure *cl)
>  {
>   struct search *s = container_of(cl, struct search, cl);
> +
>   bio_complete(s);
> + atomic_dec(>d->c->io_inflight);
>  
>   if (s->iop.bio)
>   bio_put(s->iop.bio);
> @@ -655,6 +657,7 @@ static inline struct search *search_alloc(struct bio *bio,
>  
>   closure_init(>cl, NULL);
>   do_bio_hook(s, bio);
> + atomic_inc(>c->io_inflight);
>  

Similar variable naming.

>   s->orig_bio = bio;
>   s->cache_miss   = NULL;
> 

Thanks for your effort.

Coly Li

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread t...@kernel.org

On Thu, Apr 12, 2018 at 06:58:21AM -0700, t...@kernel.org wrote:
> Cool, can we just factor out the queue lock from those drivers?  It's
> just really messy to be switching locks like we do in the cleanup
> path.

So, looking at a couple drivers, it looks like all we'd need is a
struct which contains a spinlock and a refcnt.  Switching the drivers
to use that is a bit tedious but straight-forward and the block layer
can simply put that struct on queue release.

Thanks.

-- 
tejun

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread t...@kernel.org

Hello,

On Thu, Apr 12, 2018 at 03:56:51PM +0200, h...@lst.de wrote:
> On Thu, Apr 12, 2018 at 06:48:12AM -0700, t...@kernel.org wrote:
> > > Which sounds like a very good reason not to use a driver controller
> > > lock for internals like blkcq.
> > > 
> > > In fact splitting the lock used for synchronizing access to queue
> > > fields from the driver controller lock used to synchronize I/O
> > > in the legacy path in long overdue.
> > 
> > It'd be probably a lot easier to make sure the shared lock doesn't go
> > away till all the request_queues using it go away.  The choice is
> > between refcnting something which carries the lock and double locking
> > in hot paths.  Can't think of many downsides of the former approach.
> 
> We've stopped sharing request_queues between different devices a while
> ago.  The problem is just that for legacy devices the driver still controls
> the lock life time, and it might be shorter than the queue lifetime.
> Note that for blk-mq we basically don't use the queue_lock at all,
> and certainly not in the hot path.

Cool, can we just factor out the queue lock from those drivers?  It's
just really messy to be switching locks like we do in the cleanup
path.

Thanks.

-- 
tejun

Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-12 Thread Tejun Heo

On Thu, Apr 12, 2018 at 07:05:13AM +0800, Ming Lei wrote:
> > Not really because aborted_gstate right now doesn't have any memory
> > barrier around it, so nothing ensures blk_add_timer() actually appears
> > before.  We can either add the matching barriers in aborted_gstate
> > update and when it's read in the normal completion path, or we can
> > wait for the update to be visible everywhere by waiting for rcu grace
> > period (because the reader is rcu protected).
> 
> Seems not necessary.
> 
> Suppose it is out of order, the only side-effect is that the new
> recycled request is timed out as a bit late, I think that is what
> we can survive, right?

It at least can mess up the timeout duration for the next recycle
instance because there can be two competing blk_add_timer() instances.
I'm not sure whether there can be other consequences.  When ownership
isn't clear, it becomes really difficult to reason about these things
and can lead to subtle failures.  I think it'd be best to always
establish who owns what.

Thanks.

-- 
tejun

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread h...@lst.de

On Thu, Apr 12, 2018 at 06:48:12AM -0700, t...@kernel.org wrote:
> > Which sounds like a very good reason not to use a driver controller
> > lock for internals like blkcq.
> > 
> > In fact splitting the lock used for synchronizing access to queue
> > fields from the driver controller lock used to synchronize I/O
> > in the legacy path in long overdue.
> 
> It'd be probably a lot easier to make sure the shared lock doesn't go
> away till all the request_queues using it go away.  The choice is
> between refcnting something which carries the lock and double locking
> in hot paths.  Can't think of many downsides of the former approach.

We've stopped sharing request_queues between different devices a while
ago.  The problem is just that for legacy devices the driver still controls
the lock life time, and it might be shorter than the queue lifetime.
Note that for blk-mq we basically don't use the queue_lock at all,
and certainly not in the hot path.

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread Tejun Heo

On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:
> Several block drivers call alloc_disk() followed by put_disk() if
> something fails before device_add_disk() is called without calling
> blk_cleanup_queue(). Make sure that also for this scenario a request
> queue is dissociated from the cgroup controller. This patch avoids
> that loading the parport_pc, paride and pf drivers triggers the
> following kernel crash:
> 
> BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
> Read of size 4 at addr 0008 by task modprobe/744
> Call Trace:
> dump_stack+0x9a/0xeb
> kasan_report+0x139/0x350
> pi_init+0x42e/0x580 [paride]
> pf_init+0x2bb/0x1000 [pf]
> do_one_initcall+0x8e/0x405
> do_init_module+0xd9/0x2f2
> load_module+0x3ab4/0x4700
> SYSC_finit_module+0x176/0x1a0
> do_syscall_64+0xee/0x2b0
> entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> Reported-by: Alexandru Moise <00moses.alexande...@gmail.com>
> Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the 
> block cgroup controller")
> Signed-off-by: Bart Van Assche 
> Cc: Alexandru Moise <00moses.alexande...@gmail.com>
> Cc: Tejun Heo 

So, this might be correct for this reported case but I don't think
it's correct in general.  There's no synchronization between blkcg
q->lock usages and blk_cleanup_queue().  e.g. hotunplugging and
shutting down a device while file operations are still in progress can
easily blow this up.

Thanks.

-- 
tejun

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread t...@kernel.org

Hello,

On Thu, Apr 12, 2018 at 03:14:40PM +0200, h...@lst.de wrote:
> > At least the SCSI ULP drivers drop the last reference to a disk after
> > the blk_cleanup_queue() call. As explained in the description of commit
> > a063057d7c73, removing a request queue from blkcg must happen before
> > blk_cleanup_queue() finishes because a block driver may free the
> > request queue spinlock immediately after blk_cleanup_queue() returns.
> > So I don't think that we can move the code that removes a request
> > queue from blkcg into put_disk(). Another challenge is that some block
> > drivers (e.g. skd) clear the disk->queue pointer if device_add_disk()
> > has not been called to avoid that put_disk() causes a request queue
> > reference count imbalance.
> 
> Which sounds like a very good reason not to use a driver controller
> lock for internals like blkcq.
> 
> In fact splitting the lock used for synchronizing access to queue
> fields from the driver controller lock used to synchronize I/O
> in the legacy path in long overdue.

It'd be probably a lot easier to make sure the shared lock doesn't go
away till all the request_queues using it go away.  The choice is
between refcnting something which carries the lock and double locking
in hot paths.  Can't think of many downsides of the former approach.

Thanks.

-- 
tejun

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-12 Thread t...@kernel.org

Hello, Israel.

On Thu, Apr 12, 2018 at 11:59:11AM +0300, Israel Rukshin wrote:
> On 4/12/2018 12:31 AM, t...@kernel.org wrote:
> >Hey, again.
> >
> >On Wed, Apr 11, 2018 at 10:07:33AM -0700, t...@kernel.org wrote:
> >>Hello, Israel.
> >>
> >>On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:
> Just noticed this one, this looks interesting to me as well. Israel,
> can you run your test with this patch?
> >>>Yes, I just did and it looks good.
> >>Awesome.
> >Just to be sure, you tested the combined patch and saw the XXX debug
> >messages, right?
> 
> I am not sure I understand.
> What is the combined patch?
> What are the debug messages that I need to see?

There are total of three patches and I posted the combined patch + a
debug message in another post.  Can you please test the following
patch?  It'll print out something like the following (possibly many of
them).

 XXX blk_mq_timeout_reset_cleanup(): executed %d missed completions

Thanks.

Index: work/block/blk-mq.c
===
--- work.orig/block/blk-mq.c
+++ work/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else
+   rq->missed_completion = true;
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
 
blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
blk_add_timer(rq);
+   rq->missed_completion = false;
 
write_seqcount_end(>gstate_seq);
preempt_enable();
@@ -818,7 +821,8 @@ struct blk_mq_timeout_data {
unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request 
*req,
+   int *nr_resets, bool reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
@@ -833,13 +837,10 @@ static void blk_mq_rq_timed_out(struct r
__blk_mq_complete_request(req);
break;
case BLK_EH_RESET_TIMER:
-   /*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored
-* completions and further spurious timeouts.
-*/
-   blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
+   req->rq_flags |= RQF_MQ_TIMEOUT_RESET;
+   (*nr_resets)++;
+   hctx->need_sync_rcu = true;
break;
case BLK_EH_NOT_HANDLED:
break;
@@ -876,13 +877,35 @@ static void blk_mq_check_expired(struct
time_after_eq(jiffies, deadline)) {
blk_mq_rq_update_aborted_gstate(rq, gstate);
data->nr_expired++;
-   hctx->nr_expired++;
+   hctx->need_sync_rcu = true;
} else if (!data->next_set || time_after(data->next, deadline)) {
data->next = deadline;
data->next_set = 1;
}
 }
 
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
+{
+   struct blk_mq_hw_ctx *hctx;
+   bool has_rcu = false;
+   int i;
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   if (!hctx->need_sync_rcu)
+   continue;
+
+   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+   has_rcu = true;
+   else
+   synchronize_srcu(hctx->srcu);
+
+   if (clear)
+   hctx->need_sync_rcu = false;
+   }
+   if (has_rcu)
+   synchronize_rcu();
+}
+
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
 {
@@ -895,7 +918,45 @@ static void blk_mq_terminate_expired(str
 */
if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
READ_ONCE(rq->gstate) == rq->aborted_gstate)
-   blk_mq_rq_timed_out(rq, reserved);
+   blk_mq_rq_timed_out(hctx, rq, priv, reserved);
+}
+
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq's timer reset has gone through rcu synchronization and is
+* visible now.  Allow normal completions again by resetting
+* ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
+* blk_mq_timeout_reset_cleanup() needs it again and there's no
+* memory ordering around ->aborted_gstate making it the only field
+* safe to update.  Let blk_add_timer() clear it later when the
+* request is recycled or times out again.
+*/
+   if

Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-12 Thread Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig 

In addition to all the arguments in the changelog the diffstat is
a pretty clear indicator that a straight forward state machine is
exactly what we want.

Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-12 Thread Christoph Hellwig

On Wed, Apr 11, 2018 at 10:11:05AM +0800, Ming Lei wrote:
> On Tue, Apr 10, 2018 at 03:01:57PM -0600, Bart Van Assche wrote:
> > The blk-mq timeout handling code ignores completions that occur after
> > blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
> > has reset rq->aborted_gstate. If a block driver timeout handler always
> > returns BLK_EH_RESET_TIMER then the result will be that the request
> > never terminates.
> 
> Under this situation:
> 
> IMO, if this request has been handled by driver's irq handler, and if
> driver's .timeout still returns BLK_EH_RESET_TIMER, it is driver's bug,
> and the correct return value should be BLK_EH_HANDLED.

We have plenty drivers that do that, so we'll need to audit all the
drivers first.  I guess a start would be to find a way that disables
timeouts entirely.

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread h...@lst.de

On Thu, Apr 12, 2018 at 11:52:11AM +, Bart Van Assche wrote:
> On Thu, 2018-04-12 at 07:34 +0200, Christoph Hellwig wrote:
> > On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:
> > > Several block drivers call alloc_disk() followed by put_disk() if
> > > something fails before device_add_disk() is called without calling
> > > blk_cleanup_queue(). Make sure that also for this scenario a request
> > > queue is dissociated from the cgroup controller. This patch avoids
> > > that loading the parport_pc, paride and pf drivers triggers the
> > > following kernel crash:
> > 
> > Can we move the cleanup to put_disk in general and not just for
> > this case?  Having alloc/free routines pair up generally avoids
> > a lot of confusion.
> 
> Hello Christoph,
> 
> At least the SCSI ULP drivers drop the last reference to a disk after
> the blk_cleanup_queue() call. As explained in the description of commit
> a063057d7c73, removing a request queue from blkcg must happen before
> blk_cleanup_queue() finishes because a block driver may free the
> request queue spinlock immediately after blk_cleanup_queue() returns.
> So I don't think that we can move the code that removes a request
> queue from blkcg into put_disk(). Another challenge is that some block
> drivers (e.g. skd) clear the disk->queue pointer if device_add_disk()
> has not been called to avoid that put_disk() causes a request queue
> reference count imbalance.

Which sounds like a very good reason not to use a driver controller
lock for internals like blkcq.

In fact splitting the lock used for synchronizing access to queue
fields from the driver controller lock used to synchronize I/O
in the legacy path in long overdue.

Re: [PATCH] bcache: simplify the calculation of the total amount of flash dirty data

2018-04-12 Thread Coly Li

On 2018/4/12 3:21 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> Currently we calculate the total amount of flash only devices dirty data
> by adding the dirty data of each flash only device under registering
> locker. It is very inefficient.
> 
> In this patch, we add a member flash_dev_dirty_sectors in struct cache_set
> to record the total amount of flash only devices dirty data in real time,
> so we didn't need to calculate the total amount of dirty data any more.
> 
> Signed-off-by: Tang Junhui 

Hi Junhui,

The patch looks good to me, you have my
Reviewed-by: Coly Li 

Thanks.

Coly Li

> ---
>  drivers/md/bcache/bcache.h|  1 +
>  drivers/md/bcache/super.c |  2 ++
>  drivers/md/bcache/writeback.c |  5 -
>  drivers/md/bcache/writeback.h | 19 ---
>  4 files changed, 7 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 843877e..cdbd596 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -490,6 +490,7 @@ struct cache_set {
>   struct bcache_device**devices;
>   struct list_headcached_devs;
>   uint64_tcached_dev_sectors;
> + atomic_long_t   flash_dev_dirty_sectors;
>   struct closure  caching;
>  
>   struct closure  sb_write;
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index b4d2892..772520e 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1208,6 +1208,8 @@ static void flash_dev_free(struct closure *cl)
>  {
>   struct bcache_device *d = container_of(cl, struct bcache_device, cl);
>   mutex_lock(_register_lock);
> + atomic_long_sub(bcache_dev_sectors_dirty(d),
> + >c->flash_dev_dirty_sectors);
>   bcache_device_free(d);
>   mutex_unlock(_register_lock);
>   kobject_put(>kobj);
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 56a3788..ec50d76 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -23,7 +23,7 @@ static void __update_writeback_rate(struct cached_dev *dc)
>  {
>   struct cache_set *c = dc->disk.c;
>   uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size -
> - bcache_flash_devs_sectors_dirty(c);
> + atomic_long_read(>flash_dev_dirty_sectors);
>   uint64_t cache_dirty_target =
>   div_u64(cache_sectors * dc->writeback_percent, 100);
>   int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
> @@ -314,6 +314,9 @@ void bcache_dev_sectors_dirty_add(struct cache_set *c, 
> unsigned inode,
>   if (!d)
>   return;
>  
> + if (UUID_FLASH_ONLY(>uuids[inode]))
> + atomic_long_add(nr_sectors, >flash_dev_dirty_sectors);
> +
>   stripe = offset_to_stripe(d, offset);
>   stripe_offset = offset & (d->stripe_size - 1);
>  
> diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
> index a9e3ffb..1a7eacf 100644
> --- a/drivers/md/bcache/writeback.h
> +++ b/drivers/md/bcache/writeback.h
> @@ -15,25 +15,6 @@ static inline uint64_t bcache_dev_sectors_dirty(struct 
> bcache_device *d)
>   return ret;
>  }
>  
> -static inline uint64_t  bcache_flash_devs_sectors_dirty(struct cache_set *c)
> -{
> - uint64_t i, ret = 0;
> -
> - mutex_lock(_register_lock);
> -
> - for (i = 0; i < c->nr_uuids; i++) {
> - struct bcache_device *d = c->devices[i];
> -
> - if (!d || !UUID_FLASH_ONLY(>uuids[i]))
> - continue;
> -ret += bcache_dev_sectors_dirty(d);
> - }
> -
> - mutex_unlock(_register_lock);
> -
> - return ret;
> -}
> -
>  static inline unsigned offset_to_stripe(struct bcache_device *d,
>   uint64_t offset)
>  {
>

Re: [PATCH v3] blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash

2018-04-12 Thread Bart Van Assche


On 04/12/18 00:27, Christoph Hellwig wrote:

On Tue, Apr 10, 2018 at 05:02:40PM -0600, Bart Van Assche wrote:

Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
it is no longer safe to access cgroup information during or after the
blk_cleanup_queue() call. Hence protect the generic_make_request_checks()
call with blk_queue_enter() / blk_queue_exit().


I think the problem is that blkcg does weird things from
blk_cleanup_queue.  I'd rather fix that root cause than working around it.


Hello Christoph,

Can you clarify your comment? generic_make_request_checks() calls 
blkcg_bio_issue_check() and that function in turn calls blkg_lookup() 
and other blkcg functions. Hence this patch that avoids that blkcg code 
is called concurrently with removal of a request queue from blkcg.


Thanks,

Bart.

[PATCH V3] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-12 Thread Ming Lei

The normal request completion can be done before or during handling
BLK_EH_RESET_TIMER, and this race may cause the request to never
be completed since driver's .timeout() may always return
BLK_EH_RESET_TIMER.

This issue can't be fixed completely by driver, since the normal
completion can be done between returning .timeout() and handling
BLK_EH_RESET_TIMER.

This patch fixes the race by introducing rq state of MQ_RQ_COMPLETE_IN_RESET,
and reading/writing rq's state by holding queue lock, which can be
per-request actually, but just not necessary to introduce one lock for
so unusual event.

Also when .timeout() returns BLK_EH_HANDLED, sync with normal completion
path before completing this timed-out rq finally for avoiding this rq's
state touched by normal completion.

Cc: "jianchao.wang" 
Cc: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Sagi Grimberg 
Cc: Israel Rukshin ,
Cc: Max Gurtovoy 
Cc: sta...@vger.kernel.org
Signed-off-by: Ming Lei 
---
V3:
- before completing rq for BLK_EH_HANDLED, sync with normal
  completion path
- make sure rq's state updated as MQ_RQ_IN_FLIGHT before completing
V2:
- rename the new flag as MQ_RQ_COMPLETE_IN_TIMEOUT
- fix lock uses in blk_mq_rq_timed_out
- document re-order between blk_add_timer() and
blk_mq_rq_update_aborted_gstate(req, 0)

 block/blk-mq.c | 85 +++---
 block/blk-mq.h |  1 +
 2 files changed, 70 insertions(+), 16 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0dc9e341c2a7..d70f69a32226 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -198,23 +198,12 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
 
-/**
- * blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
- * @q: request queue.
- *
- * Note: this function does not prevent that the struct request end_io()
- * callback function is invoked. Once this function is returned, we make
- * sure no dispatch can happen until the queue is unquiesced via
- * blk_mq_unquiesce_queue().
- */
-void blk_mq_quiesce_queue(struct request_queue *q)
+static void blk_mq_queue_synchronize_rcu(struct request_queue *q)
 {
struct blk_mq_hw_ctx *hctx;
unsigned int i;
bool rcu = false;
 
-   blk_mq_quiesce_queue_nowait(q);
-
queue_for_each_hw_ctx(q, hctx, i) {
if (hctx->flags & BLK_MQ_F_BLOCKING)
synchronize_srcu(hctx->srcu);
@@ -224,6 +213,21 @@ void blk_mq_quiesce_queue(struct request_queue *q)
if (rcu)
synchronize_rcu();
 }
+
+/**
+ * blk_mq_quiesce_queue() - wait until all ongoing dispatches have finished
+ * @q: request queue.
+ *
+ * Note: this function does not prevent that the struct request end_io()
+ * callback function is invoked. Once this function is returned, we make
+ * sure no dispatch can happen until the queue is unquiesced via
+ * blk_mq_unquiesce_queue().
+ */
+void blk_mq_quiesce_queue(struct request_queue *q)
+{
+   blk_mq_quiesce_queue_nowait(q);
+   blk_mq_queue_synchronize_rcu(q);
+}
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
 
 /*
@@ -630,10 +634,27 @@ void blk_mq_complete_request(struct request *rq)
 * However, that would complicate paths which want to synchronize
 * against us.  Let stay in sync with the issue path so that
 * hctx_lock() covers both issue and completion paths.
+*
+* Cover complete vs BLK_EH_RESET_TIMER race in slow path with
+* holding queue lock.
 */
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else {
+   unsigned long flags;
+   bool need_complete = false;
+
+   spin_lock_irqsave(q->queue_lock, flags);
+   if (!blk_mq_rq_aborted_gstate(rq))
+   need_complete = true;
+   else
+   blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE_IN_TIMEOUT);
+   spin_unlock_irqrestore(q->queue_lock, flags);
+
+   if (need_complete)
+   __blk_mq_complete_request(rq);
+   }
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -814,6 +835,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
+   unsigned long flags;
 
req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
 
@@ -822,16 +844,47 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
 
switch (ret) {
case BLK_EH_HANDLED:
+   /*
+

Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-12 Thread Bart Van Assche

On Thu, 2018-04-12 at 07:34 +0200, Christoph Hellwig wrote:
> On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:
> > Several block drivers call alloc_disk() followed by put_disk() if
> > something fails before device_add_disk() is called without calling
> > blk_cleanup_queue(). Make sure that also for this scenario a request
> > queue is dissociated from the cgroup controller. This patch avoids
> > that loading the parport_pc, paride and pf drivers triggers the
> > following kernel crash:
> 
> Can we move the cleanup to put_disk in general and not just for
> this case?  Having alloc/free routines pair up generally avoids
> a lot of confusion.

Hello Christoph,

At least the SCSI ULP drivers drop the last reference to a disk after
the blk_cleanup_queue() call. As explained in the description of commit
a063057d7c73, removing a request queue from blkcg must happen before
blk_cleanup_queue() finishes because a block driver may free the
request queue spinlock immediately after blk_cleanup_queue() returns.
So I don't think that we can move the code that removes a request
queue from blkcg into put_disk(). Another challenge is that some block
drivers (e.g. skd) clear the disk->queue pointer if device_add_disk()
has not been called to avoid that put_disk() causes a request queue
reference count imbalance.

Bart.

[PATCH 2/2] bcache: add code comments for bset.c

2018-04-12 Thread Coly Li

This patch tries to add code comments in bset.c, to make some
tricky code and designment to be more comprehensible. Most information
of this patch comes from the discussion between Kent and I, he
offers very informative details. If there is any mistake
of the idea behind the code, no doubt that's from me misrepresentation.

Signed-off-by: Coly Li 
---
 drivers/md/bcache/bset.c | 63 
 1 file changed, 63 insertions(+)

diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c
index 343f4e9428e0..057071f59d20 100644
--- a/drivers/md/bcache/bset.c
+++ b/drivers/md/bcache/bset.c
@@ -365,6 +365,10 @@ EXPORT_SYMBOL(bch_btree_keys_init);
 
 /* Binary tree stuff for auxiliary search trees */
 
+/*
+ * return array index next to j when does in-order traverse
+ * of a binary tree which is stored in a linear array
+ */
 static unsigned inorder_next(unsigned j, unsigned size)
 {
if (j * 2 + 1 < size) {
@@ -378,6 +382,10 @@ static unsigned inorder_next(unsigned j, unsigned size)
return j;
 }
 
+/*
+ * return array index previous to j when does in-order traverse
+ * of a binary tree which is stored in a linear array
+ */
 static unsigned inorder_prev(unsigned j, unsigned size)
 {
if (j * 2 < size) {
@@ -420,6 +428,10 @@ static unsigned __to_inorder(unsigned j, unsigned size, 
unsigned extra)
return j;
 }
 
+/*
+ * Return the cacheline index in bset_tree->data, where j is index
+ * from a linear array which stores the auxiliar binary tree
+ */
 static unsigned to_inorder(unsigned j, struct bset_tree *t)
 {
return __to_inorder(j, t->size, t->extra);
@@ -440,6 +452,10 @@ static unsigned __inorder_to_tree(unsigned j, unsigned 
size, unsigned extra)
return j;
 }
 
+/*
+ * Return an index from a linear array which stores the auxiliar binary
+ * tree, j is the cacheline index of t->data.
+ */
 static unsigned inorder_to_tree(unsigned j, struct bset_tree *t)
 {
return __inorder_to_tree(j, t->size, t->extra);
@@ -545,6 +561,20 @@ static inline uint64_t shrd128(uint64_t high, uint64_t 
low, uint8_t shift)
return low;
 }
 
+/*
+ * Calculate mantissa value for struct bkey_float.
+ * If most significant bit of f->exponent is not set, then
+ *  - f->exponent >> 6 is 0
+ *  - p[0] points to bkey->low
+ *  - p[-1] borrows bits from KEY_INODE() of bkey->high
+ * if most isgnificant bits of f->exponent is set, then
+ *  - f->exponent >> 6 is 1
+ *  - p[0] points to bits from KEY_INODE() of bkey->high
+ *  - p[-1] points to other bits from KEY_INODE() of
+ *bkey->high too.
+ * See make_bfloat() to check when most significant bit of f->exponent
+ * is set or not.
+ */
 static inline unsigned bfloat_mantissa(const struct bkey *k,
   struct bkey_float *f)
 {
@@ -569,6 +599,16 @@ static void make_bfloat(struct bset_tree *t, unsigned j)
BUG_ON(m < l || m > r);
BUG_ON(bkey_next(p) != m);
 
+   /*
+* If l and r have different KEY_INODE values (different backing
+* device), f->exponent records how many least significant bits
+* are different in KEY_INODE values and sets most significant
+* bits to 1 (by +64).
+* If l and r have same KEY_INODE value, f->exponent records
+* how many different bits in least significant bits of bkey->low.
+* See bfloat_mantiss() how the most significant bit of
+* f->exponent is used to calculate bfloat mantissa value.
+*/
if (KEY_INODE(l) != KEY_INODE(r))
f->exponent = fls64(KEY_INODE(r) ^ KEY_INODE(l)) + 64;
else
@@ -632,6 +672,15 @@ void bch_bset_init_next(struct btree_keys *b, struct bset 
*i, uint64_t magic)
 }
 EXPORT_SYMBOL(bch_bset_init_next);
 
+/*
+ * Build auxiliary binary tree 'struct bset_tree *t', this tree is used to
+ * accelerate bkey search in a btree node (pointed by bset_tree->data in
+ * memory). After search in the auxiliar tree by calling bset_search_tree(),
+ * a struct bset_search_iter is returned which indicates range [l, r] from
+ * bset_tree->data where the searching bkey might be inside. Then a followed
+ * linear comparison does the exact search, see __bch_bset_search() for how
+ * the auxiliary tree is used.
+ */
 void bch_bset_build_written_tree(struct btree_keys *b)
 {
struct bset_tree *t = bset_tree_last(b);
@@ -897,6 +946,17 @@ static struct bset_search_iter bset_search_tree(struct 
bset_tree *t,
unsigned inorder, j, n = 1;
 
do {
+   /*
+* A bit trick here.
+* If p < t->size, (int)(p - t->size) is a minus value and
+* the most significant bit is set, right shifting 31 bits
+* gets 1. If p >= t->size, the most significant bit is
+* not set, right shifting 31 bits gets 0.
+* So the following 2 lines equals to
+*  if (p >= t->size)
+

[PATCH 1/2] bcache: remove unncessary code in bch_btree_keys_init()

2018-04-12 Thread Coly Li

Function bch_btree_keys_init() initializes b->set[].size and
b->set[].data to zero. As the code comments indicates, these code indeed
is unncessary, because both struct btree_keys and struct bset_tree are
nested embedded into struct btree, when struct btree is filled with 0
bits by kzalloc() in mca_bucket_alloc(), b->set[].size and
b->set[].data are initialized to 0 (a.k.a NULL) already.

This patch removes the redundant code, and add comments in
bch_btree_keys_init() and mca_bucket_alloc() to explain why it's safe.

Signed-off-by: Coly Li 
---
 drivers/md/bcache/bset.c  | 13 ++---
 drivers/md/bcache/btree.c |  4 
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c
index 579c696a5fe0..343f4e9428e0 100644
--- a/drivers/md/bcache/bset.c
+++ b/drivers/md/bcache/bset.c
@@ -352,15 +352,14 @@ void bch_btree_keys_init(struct btree_keys *b, const 
struct btree_keys_ops *ops,
b->nsets = 0;
b->last_set_unwritten = 0;
 
-   /* XXX: shouldn't be needed */
-   for (i = 0; i < MAX_BSETS; i++)
-   b->set[i].size = 0;
/*
-* Second loop starts at 1 because b->keys[0]->data is the memory we
-* allocated
+* struct btree_keys in embedded in struct btree, and struct
+* bset_tree is embedded into struct btree_keys. They are all
+* initialized as 0 by kzalloc() in mca_bucket_alloc(), and
+* b->set[0].data is allocated in bch_btree_keys_alloc(), so we
+* don't have to initiate b->set[].size and b->set[].data here
+* any more.
 */
-   for (i = 1; i < MAX_BSETS; i++)
-   b->set[i].data = NULL;
 }
 EXPORT_SYMBOL(bch_btree_keys_init);
 
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 17936b2dc7d6..344641e23415 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -600,6 +600,10 @@ static void mca_data_alloc(struct btree *b, struct bkey 
*k, gfp_t gfp)
 static struct btree *mca_bucket_alloc(struct cache_set *c,
  struct bkey *k, gfp_t gfp)
 {
+   /*
+* kzalloc() is necessary here for initialization,
+* see code comments in bch_btree_keys_init().
+*/
struct btree *b = kzalloc(sizeof(struct btree), gfp);
if (!b)
return NULL;
-- 
2.16.2

Re: 4.15.14 crash with iscsi target and dvd

2018-04-12 Thread Ming Lei

On Tue, Apr 10, 2018 at 08:45:25PM -0400, Wakko Warner wrote:
> Ming Lei wrote:
> > Sure, thanks for your sharing.
> > 
> > Wakko, could you test the following patch and see if there is any
> > difference?
> > 
> > --
> > diff --git a/drivers/target/target_core_pscsi.c 
> > b/drivers/target/target_core_pscsi.c
> > index 0d99b242e82e..6147178f1f37 100644
> > --- a/drivers/target/target_core_pscsi.c
> > +++ b/drivers/target/target_core_pscsi.c
> > @@ -888,7 +888,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist 
> > *sgl, u32 sgl_nents,
> > if (len > 0 && data_len > 0) {
> > bytes = min_t(unsigned int, len, PAGE_SIZE - off);
> > bytes = min(bytes, data_len);
> > -
> > + new_bio:
> > if (!bio) {
> > nr_vecs = min_t(int, BIO_MAX_PAGES, nr_pages);
> > nr_pages -= nr_vecs;
> > @@ -931,6 +931,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist 
> > *sgl, u32 sgl_nents,
> >  * be allocated with pscsi_get_bio() above.
> >  */
> > bio = NULL;
> > +   goto new_bio;
> > }
> >  
> > data_len -= bytes;
> 
> Sorry for the delay.  I reverted my change, added this one.  I didn't
> reboot, I just unloaded and loaded this one.
> Note: /dev/sr1 as seen from the initiator is /dev/sr0 (physical disc) on the
> target.
> 
> Doesn't crash, however on the initiator I see this:
> [9273849.70] ISO 9660 Extensions: RRIP_1991A
> [9273863.359718] scsi_io_completion: 13 callbacks suppressed
> [9273863.359788] sr 26:0:0:0: [sr1] tag#1 UNKNOWN(0x2003) Result: 
> hostbyte=0x00 driverbyte=0x08
> [9273863.359909] sr 26:0:0:0: [sr1] tag#1 Sense Key : 0x2 [current] 
> [9273863.359974] sr 26:0:0:0: [sr1] tag#1 ASC=0x8 ASCQ=0x0 
> [9273863.360036] sr 26:0:0:0: [sr1] tag#1 CDB: opcode=0x28 28 00 00 22 f6 96 
> 00 00 80 00
> [9273863.360116] blk_update_request: 13 callbacks suppressed
> [9273863.360177] blk_update_request: I/O error, dev sr1, sector 9165400
> [9273875.864648] sr 26:0:0:0: [sr1] tag#1 UNKNOWN(0x2003) Result: 
> hostbyte=0x00 driverbyte=0x08
> [9273875.864738] sr 26:0:0:0: [sr1] tag#1 Sense Key : 0x2 [current] 
> [9273875.864801] sr 26:0:0:0: [sr1] tag#1 ASC=0x8 ASCQ=0x0 
> [9273875.864890] sr 26:0:0:0: [sr1] tag#1 CDB: opcode=0x28 28 00 00 22 f7 16 
> 00 00 80 00
> [9273875.864971] blk_update_request: I/O error, dev sr1, sector 9165912
> 
> To cause this, I mounted the dvd as seen in the first line and ran this
> command: find /cdrom2 -type f | xargs -tn1 cat > /dev/null
> I did some various tests.  Each test was done after umount and mount to
> clear the cache.
> cat  > /dev/null causes the message.
> dd if= of=/dev/null bs=2048 doesn't
>   using bs=4096 doesn't
>   using bs=64k doesn't
>   using bs=128k does
> cat uses a blocksize of 128k.
> 
> The following was done without being mounted.
> ddrescue -f -f /dev/sr1 /dev/null 
>   doesn't cause the message
> dd if=/dev/sr1 of=/dev/null bs=128k
>   doesn't cause the message
> using bs=256k causes the message once:
> [9275916.857409] sr 27:0:0:0: [sr1] tag#0 UNKNOWN(0x2003) Result: 
> hostbyte=0x00 driverbyte=0x08
> [9275916.857482] sr 27:0:0:0: [sr1] tag#0 Sense Key : 0x2 [current] 
> [9275916.857520] sr 27:0:0:0: [sr1] tag#0 ASC=0x8 ASCQ=0x0 
> [9275916.857556] sr 27:0:0:0: [sr1] tag#0 CDB: opcode=0x28 28 00 00 00 00 00 
> 00 00 80 00
> [9275916.857614] blk_update_request: I/O error, dev sr1, sector 0
> 
> If I access the disc from the target natively either by mounting and
> accessing files or working with the device directly (ie dd) no errors are
> logged on the target.

OK, thanks for your test.

Could you test the following patch and see if there is still the failure
message?

diff --git a/drivers/target/target_core_pscsi.c 
b/drivers/target/target_core_pscsi.c
index 0d99b242e82e..6137287b52fb 100644
--- a/drivers/target/target_core_pscsi.c
+++ b/drivers/target/target_core_pscsi.c
@@ -913,9 +913,11 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist *sgl, 
u32 sgl_nents,
 
rc = bio_add_pc_page(pdv->pdv_sd->request_queue,
bio, page, bytes, off);
+   if (rc != bytes)
+   goto fail;
pr_debug("PSCSI: bio->bi_vcnt: %d nr_vecs: %d\n",
bio_segments(bio), nr_vecs);
-   if (rc != bytes) {
+   if (/*rc != bytes*/0) {
pr_debug("PSCSI: Reached bio->bi_vcnt max:"
" %d i: %d bio: %p, allocating another"
" bio\n", bio->bi_vcnt, i, bio);

-- 
Ming

Re: [PATCH V2] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-12 Thread Ming Lei

On Thu, Apr 12, 2018 at 10:38:56AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 04/12/2018 07:38 AM, Ming Lei wrote:
> > +*
> > +* Cover complete vs BLK_EH_RESET_TIMER race in slow path with
> > +* helding queue lock.
> >  */
> > hctx_lock(hctx, _idx);
> > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> > __blk_mq_complete_request(rq);
> > +   else {
> > +   unsigned long flags;
> > +   bool need_complete = false;
> > +
> > +   spin_lock_irqsave(q->queue_lock, flags);
> > +   if (!blk_mq_rq_aborted_gstate(rq))
> > +   need_complete = true;
> > +   else
> > +   blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE_IN_TIMEOUT);
> > +   spin_unlock_irqrestore(q->queue_lock, flags);
> 
> What if the .timeout return BLK_EH_HANDLED during this ?
> timeout context irq context
>   .timeout()
>  blk_mq_complete_request
>set state to MQ_RQ_COMPLETE_IN_TIMEOUT
>   __blk_mq_complete_request
> WARN_ON_ONCE(blk_mq_rq_state(rq) 
>  != MQ_RQ_IN_FLIGHT);

Thinking of it further, if the normal completion from irq context can
happen after BLK_EH_HANDLED is returned from .timeout, this request may 
be recycled immediately after __blk_mq_complete_request(), then
blk_mq_complete_request() can complete the new instance early, so that
can be another race between old normal completion and new dispatch.

I guess .timeout should make sure this thing won't happen.

--
Ming

[PATCH v2] block: ratelimite pr_err on IO path

2018-04-12 Thread Jack Wang

From: Jack Wang 

This avoid soft lockup below:
[ 2328.328429] Call Trace:
[ 2328.328433]  vprintk_emit+0x229/0x2e0
[ 2328.328436]  ? t10_pi_type3_verify_ip+0x20/0x20
[ 2328.328437]  printk+0x52/0x6e
[ 2328.328439]  t10_pi_verify+0x9e/0xf0
[ 2328.328441]  bio_integrity_process+0x12e/0x220
[ 2328.328442]  ? t10_pi_type1_verify_crc+0x20/0x20
[ 2328.328443]  bio_integrity_verify_fn+0xde/0x140
[ 2328.328447]  process_one_work+0x13f/0x370
[ 2328.328449]  worker_thread+0x62/0x3d0
[ 2328.328450]  ? rescuer_thread+0x2f0/0x2f0
[ 2328.328452]  kthread+0x116/0x150
[ 2328.328454]  ? __kthread_parkme+0x70/0x70
[ 2328.328457]  ret_from_fork+0x35/0x40

Signed-off-by: Jack Wang 
---
v2: keep the message in same line as Robert and coding style suggested

 block/t10-pi.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index a98db38..6faf8c1 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -84,10 +84,11 @@ static blk_status_t t10_pi_verify(struct blk_integrity_iter 
*iter,
 
if (be32_to_cpu(pi->ref_tag) !=
lower_32_bits(iter->seed)) {
-   pr_err("%s: ref tag error at location %llu " \
-  "(rcvd %u)\n", iter->disk_name,
-  (unsigned long long)
-  iter->seed, be32_to_cpu(pi->ref_tag));
+   pr_err_ratelimited("%s: ref tag error at 
location %llu (rcvd %u)\n",
+  iter->disk_name,
+  (unsigned long long)
+  iter->seed,
+  be32_to_cpu(pi->ref_tag));
return BLK_STS_PROTECTION;
}
break;
@@ -101,10 +102,11 @@ static blk_status_t t10_pi_verify(struct 
blk_integrity_iter *iter,
csum = fn(iter->data_buf, iter->interval);
 
if (pi->guard_tag != csum) {
-   pr_err("%s: guard tag error at sector %llu " \
-  "(rcvd %04x, want %04x)\n", iter->disk_name,
-  (unsigned long long)iter->seed,
-  be16_to_cpu(pi->guard_tag), be16_to_cpu(csum));
+   pr_err_ratelimited("%s: guard tag error at sector %llu 
(rcvd %04x, want %04x)\n",
+  iter->disk_name,
+  (unsigned long long)iter->seed,
+  be16_to_cpu(pi->guard_tag),
+  be16_to_cpu(csum));
return BLK_STS_PROTECTION;
}
 
-- 
2.7.4

Re: System hung in I/O when booting with sd card

2018-04-12 Thread Shawn Lin


Hi Bart,

On 2018/4/12 9:54, Bart Van Assche wrote:

On Thu, 2018-04-12 at 09:48 +0800, Shawn Lin wrote:

I ran into 2 times that my system hung here when booting with a ext4 sd
card. No sure how to reproduce it but it seems doesn't matter with the
ext4 as I see it again with a vfat sd card, this morning with Linus'
master branch. Is it a known issue or any idea how to debug this?


Please verify whether the following patch resolves this:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus=37f9579f4c31a6d698dbf3016d7bf132f9288d30


Thanks for your patch! I set up some devices to test reboot this morning
against Linus' master branch. Two devices out of 5, were already hang
for that without your patch, but the other 5 are still work with your
patch.

Will update the result tomorrow morning.



Thanks,

Bart.

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-12 Thread Israel Rukshin


On 4/12/2018 12:31 AM, t...@kernel.org wrote:

Hey, again.

On Wed, Apr 11, 2018 at 10:07:33AM -0700, t...@kernel.org wrote:

Hello, Israel.

On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:

Just noticed this one, this looks interesting to me as well. Israel,
can you run your test with this patch?

Yes, I just did and it looks good.

Awesome.

Just to be sure, you tested the combined patch and saw the XXX debug
messages, right?


I am not sure I understand.
What is the combined patch?
What are the debug messages that I need to see?


Thanks.

Re: [PATCH] block: ratelimite pr_err on IO path

2018-04-12 Thread Jinpu Wang

On Wed, Apr 11, 2018 at 7:07 PM, Elliott, Robert (Persistent Memory)
 wrote:
>> -Original Message-
>> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
>> ow...@vger.kernel.org] On Behalf Of Jack Wang
>> Sent: Wednesday, April 11, 2018 8:21 AM
>> Subject: [PATCH] block: ratelimite pr_err on IO path
>>
>> From: Jack Wang 
> ...
>> - pr_err("%s: ref tag error at location %llu " \
>> -"(rcvd %u)\n", iter->disk_name,
>> -(unsigned long long)
>> -iter->seed, be32_to_cpu(pi->ref_tag));
>> + pr_err_ratelimited("%s: ref tag error at "
>> +"location %llu (rcvd %u)\n",
>
> Per process/coding-style.rst, you should keep a string like that on
> one line even if that exceeds 80 columns:
>
>   Statements longer than 80 columns will be broken into sensible chunks, 
> unless
>   exceeding 80 columns significantly increases readability and does not hide
>   information. ... However, never break user-visible strings such as
>   printk messages, because that breaks the ability to grep for them.
>
>
Thanks Robert, as the original code keep the 80 columns, I just
followed, I will fix it in v2.


-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Re: [PATCH] block/amiflop: Don't log an error message for an invalid ioctl

2018-04-12 Thread Geert Uytterhoeven

On Thu, Apr 12, 2018 at 3:23 AM, Finn Thain  wrote:
> Do as the swim3 driver does and just return -ENOTTY.
>
> Cc: Geert Uytterhoeven 
> Cc: linux-m...@lists.linux-m68k.org
> Signed-off-by: Finn Thain 

Reviewed-by:  Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Re: [PATCH 0/4] bcache: incremental GC and dirty data init

2018-04-12 Thread tang . junhui

Hi Coly,

> On 2018/4/12 2:38 PM, tang.jun...@zte.com.cn wrote:
> > Hi maintainers and folks,
> > 
> > Some patches of this patch set have been sent before, they are not merged
> > yet, and I add two new patches to solve some issues I found while testing.
> > since They are interdependent, so I make a patch set and resend them.
> > 
> > [PATCH 1/4] bcache: finish incremental GC
> > [PATCH 2/4] bcache: calculate the number of incremental GC nodes according 
> > to
> > the total of btree nodes
> > [PATCH 3/4] bcache: notify allocator to prepare for GC
> > [PATCH 4/4] bcache: fix I/O significant decline while backend devices 
> > registering
> > 
> > These patches are useful to prevent I/O fluctuations or even jump 0 while 
> > GC or
> > cached devices registering. I have test them for some times, I hope 
> > somebody could
> > have a review for these patch, any comment would be welcome.
> > 
> 
> Hi Junhui,
> 
> Copied, these patches are in my to-review list :-)

Thanks for your quick response, you are always so warm to me.
I am looking forward to your comments.

Thanks.

Tang Junhui

[PATCH] bcache: simplify the calculation of the total amount of flash dirty data

2018-04-12 Thread tang . junhui

From: Tang Junhui 

Currently we calculate the total amount of flash only devices dirty data
by adding the dirty data of each flash only device under registering
locker. It is very inefficient.

In this patch, we add a member flash_dev_dirty_sectors in struct cache_set
to record the total amount of flash only devices dirty data in real time,
so we didn't need to calculate the total amount of dirty data any more.

Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/bcache.h|  1 +
 drivers/md/bcache/super.c |  2 ++
 drivers/md/bcache/writeback.c |  5 -
 drivers/md/bcache/writeback.h | 19 ---
 4 files changed, 7 insertions(+), 20 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 843877e..cdbd596 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -490,6 +490,7 @@ struct cache_set {
struct bcache_device**devices;
struct list_headcached_devs;
uint64_tcached_dev_sectors;
+   atomic_long_t   flash_dev_dirty_sectors;
struct closure  caching;
 
struct closure  sb_write;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index b4d2892..772520e 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1208,6 +1208,8 @@ static void flash_dev_free(struct closure *cl)
 {
struct bcache_device *d = container_of(cl, struct bcache_device, cl);
mutex_lock(_register_lock);
+   atomic_long_sub(bcache_dev_sectors_dirty(d),
+   >c->flash_dev_dirty_sectors);
bcache_device_free(d);
mutex_unlock(_register_lock);
kobject_put(>kobj);
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 56a3788..ec50d76 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -23,7 +23,7 @@ static void __update_writeback_rate(struct cached_dev *dc)
 {
struct cache_set *c = dc->disk.c;
uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size -
-   bcache_flash_devs_sectors_dirty(c);
+   atomic_long_read(>flash_dev_dirty_sectors);
uint64_t cache_dirty_target =
div_u64(cache_sectors * dc->writeback_percent, 100);
int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
@@ -314,6 +314,9 @@ void bcache_dev_sectors_dirty_add(struct cache_set *c, 
unsigned inode,
if (!d)
return;
 
+   if (UUID_FLASH_ONLY(>uuids[inode]))
+   atomic_long_add(nr_sectors, >flash_dev_dirty_sectors);
+
stripe = offset_to_stripe(d, offset);
stripe_offset = offset & (d->stripe_size - 1);
 
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index a9e3ffb..1a7eacf 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -15,25 +15,6 @@ static inline uint64_t bcache_dev_sectors_dirty(struct 
bcache_device *d)
return ret;
 }
 
-static inline uint64_t  bcache_flash_devs_sectors_dirty(struct cache_set *c)
-{
-   uint64_t i, ret = 0;
-
-   mutex_lock(_register_lock);
-
-   for (i = 0; i < c->nr_uuids; i++) {
-   struct bcache_device *d = c->devices[i];
-
-   if (!d || !UUID_FLASH_ONLY(>uuids[i]))
-   continue;
-  ret += bcache_dev_sectors_dirty(d);
-   }
-
-   mutex_unlock(_register_lock);
-
-   return ret;
-}
-
 static inline unsigned offset_to_stripe(struct bcache_device *d,
uint64_t offset)
 {
-- 
1.8.3.1

Re: [PATCH V2] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-12 Thread Ming Lei

Hi Jianchao,

On Thu, Apr 12, 2018 at 10:38:56AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 04/12/2018 07:38 AM, Ming Lei wrote:
> > +*
> > +* Cover complete vs BLK_EH_RESET_TIMER race in slow path with
> > +* helding queue lock.
> >  */
> > hctx_lock(hctx, _idx);
> > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> > __blk_mq_complete_request(rq);
> > +   else {
> > +   unsigned long flags;
> > +   bool need_complete = false;
> > +
> > +   spin_lock_irqsave(q->queue_lock, flags);
> > +   if (!blk_mq_rq_aborted_gstate(rq))
> > +   need_complete = true;
> > +   else
> > +   blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE_IN_TIMEOUT);
> > +   spin_unlock_irqrestore(q->queue_lock, flags);
> 
> What if the .timeout return BLK_EH_HANDLED during this ?
> timeout context irq context
>   .timeout()
>  blk_mq_complete_request
>set state to MQ_RQ_COMPLETE_IN_TIMEOUT
>   __blk_mq_complete_request
> WARN_ON_ONCE(blk_mq_rq_state(rq) 
>  != MQ_RQ_IN_FLIGHT);
> 
> If further upon blk_mq_free_request, the final freed request maybe changed to 
> MQ_RQ_COMPLETE_IN_TIMEOUT
> instead of MQ_RQ_IDLE.

Good catch, so this patch has to deal with race between complete and
BLK_EH_HANDLED too.

Given both the above handling are in slow path, the queue lock can be used
to cover them all atomically just as done for EH_RESET_TIMER.

Will will it in V3.

Thanks,
Ming

Re: [PATCH 0/4] bcache: incremental GC and dirty data init

2018-04-12 Thread Coly Li

On 2018/4/12 2:38 PM, tang.jun...@zte.com.cn wrote:
> Hi maintainers and folks,
> 
> Some patches of this patch set have been sent before, they are not merged
> yet, and I add two new patches to solve some issues I found while testing.
> since They are interdependent, so I make a patch set and resend them.
> 
> [PATCH 1/4] bcache: finish incremental GC
> [PATCH 2/4] bcache: calculate the number of incremental GC nodes according to
>   the total of btree nodes
> [PATCH 3/4] bcache: notify allocator to prepare for GC
> [PATCH 4/4] bcache: fix I/O significant decline while backend devices 
> registering
> 
> These patches are useful to prevent I/O fluctuations or even jump 0 while GC 
> or
> cached devices registering. I have test them for some times, I hope somebody 
> could
> have a review for these patch, any comment would be welcome.
> 

Hi Junhui,

Copied, these patches are in my to-review list :-)

Thanks.

Coly Li

[PATCH 0/4] bcache: incremental GC and dirty data init

2018-04-12 Thread tang . junhui

Hi maintainers and folks,

Some patches of this patch set have been sent before, they are not merged
yet, and I add two new patches to solve some issues I found while testing.
since They are interdependent, so I make a patch set and resend them.

[PATCH 1/4] bcache: finish incremental GC
[PATCH 2/4] bcache: calculate the number of incremental GC nodes according to
the total of btree nodes
[PATCH 3/4] bcache: notify allocator to prepare for GC
[PATCH 4/4] bcache: fix I/O significant decline while backend devices 
registering

These patches are useful to prevent I/O fluctuations or even jump 0 while GC or
cached devices registering. I have test them for some times, I hope somebody 
could
have a review for these patch, any comment would be welcome.

[PATCH 3/4] bcache: notify allocator to prepare for GC

2018-04-12 Thread tang . junhui

From: Tang Junhui 

Since no new bucket can be allocated during GC, and front side I/Os would
run out of all the buckets, so notify allocator to pack the free_inc queue
full of buckets before GC, then we could have enough buckets for front side
I/Os during GC period.

Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/alloc.c  | 11 +++-
 drivers/md/bcache/bcache.h |  2 ++
 drivers/md/bcache/btree.c  | 69 --
 drivers/md/bcache/btree.h  |  4 +++
 4 files changed, 83 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index a0cc1bc..85020cc 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -323,7 +323,8 @@ static int bch_allocator_thread(void *arg)
 * possibly issue discards to them, then we add the bucket to
 * the free list:
 */
-   while (!fifo_empty(>free_inc)) {
+   while (!fifo_empty(>free_inc) &&
+  ca->prepare_gc != GC_PREPARING) {
long bucket;
 
fifo_pop(>free_inc, bucket);
@@ -353,6 +354,14 @@ static int bch_allocator_thread(void *arg)
invalidate_buckets(ca);
 
/*
+* Let GC continue
+*/
+   if (ca->prepare_gc == GC_PREPARING) {
+   ca->prepare_gc = GC_PREPARED;
+   wake_up_gc(ca->set);
+   }
+
+   /*
 * Now, we write their new gens to disk so we can start writing
 * new stuff to them:
 */
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index ab4c9ca..61a6ff2 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -428,6 +428,8 @@ struct cache {
 * cpu
 */
unsignedinvalidate_needs_gc;
+   /* used to notify allocator to prepare GC*/
+   unsigned intprepare_gc;
 
booldiscard; /* Get rid of? */
 
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 10f967e..aba0700 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1805,6 +1805,46 @@ static void bch_btree_gc(struct cache_set *c)
bch_moving_gc(c);
 }
 
+static bool cache_gc_prepare_none(struct cache_set *c)
+{
+   struct cache *ca;
+   unsigned int i;
+
+   for_each_cache(ca, c, i)
+   if (ca->prepare_gc != GC_PREPARE_NONE)
+   return false;
+   return true;
+}
+
+static bool cache_gc_prepare_ok(struct cache_set *c)
+{
+   struct cache *ca;
+   unsigned int i;
+
+   for_each_cache(ca, c, i)
+   if (ca->prepare_gc != GC_PREPARED)
+   return false;
+   return true;
+}
+
+static void set_cache_gc_prepare_none(struct cache_set *c)
+{
+   struct cache *ca;
+   unsigned int i;
+
+   for_each_cache(ca, c, i)
+   ca->prepare_gc = GC_PREPARE_NONE;
+}
+
+static void set_cache_gc_preparing(struct cache_set *c)
+{
+   struct cache *ca;
+   unsigned int i;
+
+   for_each_cache(ca, c, i)
+   ca->prepare_gc = GC_PREPARING;
+}
+
 static bool gc_should_run(struct cache_set *c)
 {
struct cache *ca;
@@ -1814,8 +1854,33 @@ static bool gc_should_run(struct cache_set *c)
if (ca->invalidate_needs_gc)
return true;
 
-   if (atomic_read(>sectors_to_gc) < 0)
-   return true;
+   if (atomic_read(>sectors_to_gc) < 0) {
+   mutex_lock(>bucket_lock);
+   if (cache_gc_prepare_none(c)) {
+   /*
+* notify allocator thread to prepare for GC
+*/
+   set_cache_gc_preparing(c);
+   mutex_unlock(>bucket_lock);
+   pr_info("gc preparing");
+   return false;
+   } else if (cache_gc_prepare_ok(c)) {
+   /*
+* alloc thread finished preparing,
+* and continue to GC
+*/
+   set_cache_gc_prepare_none(c);
+   mutex_unlock(>bucket_lock);
+   pr_info("gc prepared ok");
+   return true;
+   } else {
+   /*
+* waitting allocator finishing prepareing
+*/
+   mutex_unlock(>bucket_lock);
+   return false;
+   }
+   }
 
return false;
 }
diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
index d211e2c..e60bd7a 100644
--- a/drivers/md/bcache/btree.h
+++ b/drivers/md/bcache/btree.h
@@ -102,6 +102,10 @@
 #include "bset.h"
 #include

[PATCH 1/4] bcache: finish incremental GC

2018-04-12 Thread tang . junhui

From: Tang Junhui 

In GC thread, we record the latest GC key in gc_done, which is expected
to be used for incremental GC, but in currently code, we didn't realize
it. When GC runs, front side IO would be blocked until the GC over, it
would be a long time if there is a lot of btree nodes.

This patch realizes incremental GC, the main ideal is that, when there
are front side I/Os, after GC some nodes (100), we stop GC, release locker
of the btree node, and go to process the front side I/Os for some times
(100 ms), then go back to GC again.

By this patch, when we doing GC, I/Os are not blocked all the time, and
there is no obvious I/Os zero jump problem any more.

Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/bcache.h  |  5 +
 drivers/md/bcache/btree.c   | 15 ++-
 drivers/md/bcache/request.c |  3 +++
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 843877e..ab4c9ca 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -445,6 +445,7 @@ struct cache {
 
 struct gc_stat {
size_t  nodes;
+   size_t  nodes_pre;
size_t  key_bytes;
 
size_t  nkeys;
@@ -568,6 +569,10 @@ struct cache_set {
 */
atomic_trescale;
/*
+* used for GC, identify if any front side I/Os is inflight
+*/
+   atomic_tio_inflight;
+   /*
 * When we invalidate buckets, we use both the priority and the amount
 * of good data to determine which buckets to reuse first - to weight
 * those together consistently we keep track of the smallest nonzero
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 81e8dc3..b36d276 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -90,6 +90,9 @@
 
 #define MAX_NEED_GC64
 #define MAX_SAVE_PRIO  72
+#define MIN_GC_NODES   100
+#define GC_SLEEP_TIME  100
+
 
 #define PTR_DIRTY_BIT  (((uint64_t) 1 << 36))
 
@@ -1581,6 +1584,13 @@ static int btree_gc_recurse(struct btree *b, struct 
btree_op *op,
memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1));
r->b = NULL;
 
+   if (atomic_read(>c->io_inflight) &&
+   gc->nodes >= gc->nodes_pre + MIN_GC_NODES) {
+   gc->nodes_pre =  gc->nodes;
+   ret = -EAGAIN;
+   break;
+   }
+
if (need_resched()) {
ret = -EAGAIN;
break;
@@ -1748,7 +1758,10 @@ static void bch_btree_gc(struct cache_set *c)
closure_sync();
cond_resched();
 
-   if (ret && ret != -EAGAIN)
+   if (ret == -EAGAIN)
+   schedule_timeout_interruptible(msecs_to_jiffies
+  (GC_SLEEP_TIME));
+   else if (ret)
pr_warn("gc failed!");
} while (ret);
 
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 643c3021..26750a9 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -637,7 +637,9 @@ static void do_bio_hook(struct search *s, struct bio 
*orig_bio)
 static void search_free(struct closure *cl)
 {
struct search *s = container_of(cl, struct search, cl);
+
bio_complete(s);
+   atomic_dec(>d->c->io_inflight);
 
if (s->iop.bio)
bio_put(s->iop.bio);
@@ -655,6 +657,7 @@ static inline struct search *search_alloc(struct bio *bio,
 
closure_init(>cl, NULL);
do_bio_hook(s, bio);
+   atomic_inc(>c->io_inflight);
 
s->orig_bio = bio;
s->cache_miss   = NULL;
-- 
1.8.3.1

[PATCH 4/4] bcache: fix I/O significant decline while backend devices registering

2018-04-12 Thread tang . junhui

From: Tang Junhui 

I attached several backend devices in the same cache set, and produced lots
of dirty data by running small rand I/O writes in a long time, then I
continue run I/O in the others cached devices, and stopped a cached device,
after a mean while, I register the stopped device again, I see the running
I/O in the others cached devices dropped significantly, sometimes even
jumps to zero.

In currently code, bcache would traverse each keys and btree node to count
the dirty data under read locker, and the writes threads can not get the
btree write locker, and when there is a lot of keys and btree node in the
registering device, it would last several seconds, so the write I/Os in
others cached device are blocked and declined significantly.

In this patch, when a device registering to a ache set, which exist others
cached devices with running I/Os, we get the amount of dirty data of the
device in an incremental way, and do not block other cached devices all the
time.

Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/writeback.c | 26 +-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 56a3788..2467d3a 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -488,10 +488,14 @@ static int bch_writeback_thread(void *arg)
 }
 
 /* Init */
+#define INIT_KEYS_MIN  50
+#define INIT_KEYS_SLEEP_TIME   100
 
 struct sectors_dirty_init {
struct btree_op op;
unsignedinode;
+   size_t  count;
+   struct bkey start;
 };
 
 static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b,
@@ -506,18 +510,38 @@ static int sectors_dirty_init_fn(struct btree_op *_op, 
struct btree *b,
bcache_dev_sectors_dirty_add(b->c, KEY_INODE(k),
 KEY_START(k), KEY_SIZE(k));
 
+   op->count++;
+   if (atomic_read(>c->io_inflight) &&
+  !(op->count % INIT_KEYS_MIN)) {
+   bkey_copy_key(>start, k);
+   return -EAGAIN;
+   }
+
return MAP_CONTINUE;
 }
 
 void bch_sectors_dirty_init(struct bcache_device *d)
 {
struct sectors_dirty_init op;
+   int ret;
 
bch_btree_op_init(, -1);
op.inode = d->id;
+   op.count = 0;
+   op.start = KEY(op.inode, 0, 0);
 
-   bch_btree_map_keys(, d->c, (op.inode, 0, 0),
+   do {
+   ret = bch_btree_map_keys(, d->c, ,
   sectors_dirty_init_fn, 0);
+   if (ret == -EAGAIN)
+   schedule_timeout_interruptible(
+   msecs_to_jiffies(INIT_KEYS_SLEEP_TIME));
+   else if (ret < 0) {
+   pr_warn("sectors dirty init failed, ret=%d!", ret);
+   break;
+   }
+   } while (ret == -EAGAIN);
+
 }
 
 void bch_cached_dev_writeback_init(struct cached_dev *dc)
-- 
1.8.3.1

Re: [PATCH v3] blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash

2018-04-12 Thread Christoph Hellwig

On Tue, Apr 10, 2018 at 05:02:40PM -0600, Bart Van Assche wrote:
> Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
> it is no longer safe to access cgroup information during or after the
> blk_cleanup_queue() call. Hence protect the generic_make_request_checks()
> call with blk_queue_enter() / blk_queue_exit().

I think the problem is that blkcg does weird things from
blk_cleanup_queue.  I'd rather fix that root cause than working around it.

68 matches

Mail list logo