Re: [PATCH v3 RESEND 1/2] blk-throttle: track read and write request individually

2017-11-23 Thread Jens Axboe
On 11/23/2017 06:31 PM, Joseph Qi wrote:
> Hi Jens,
> Could you please give your advice for the two patches or pick them up if
> you think they are good?

It looks OK to me, but my preference would be to push this until
4.16.

-- 
Jens Axboe



Re: [PATCH 1/3] lockdep: Apply crossrelease to PG_locked locks

2017-11-23 Thread Byungchul Park
On Thu, Nov 16, 2017 at 02:07:46PM +0100, Michal Hocko wrote:
> On Thu 16-11-17 21:48:05, Byungchul Park wrote:
> > On 11/16/2017 9:02 PM, Michal Hocko wrote:
> > > for each struct page. So you are doubling the size. Who is going to
> > > enable this config option? You are moving this to page_ext in a later
> > > patch which is a good step but it doesn't go far enough because this
> > > still consumes those resources. Is there any problem to make this
> > > kernel command line controllable? Something we do for page_owner for
> > > example?
> > 
> > Sure. I will add it.
> > 
> > > Also it would be really great if you could give us some measures about
> > > the runtime overhead. I do not expect it to be very large but this is
> > 
> > The major overhead would come from the amount of additional memory
> > consumption for 'lockdep_map's.
> 
> yes
> 
> > Do you want me to measure the overhead by the additional memory
> > consumption?
> > 
> > Or do you expect another overhead?
> 
> I would be also interested how much impact this has on performance. I do
> not expect it would be too large but having some numbers for cache cold
> parallel kbuild or other heavy page lock workloads.

Hello Michal,

I measured 'cache cold parallel kbuild' on my qemu machine. The result
varies much so I cannot confirm, but I think there's no meaningful
difference between before and after applying crossrelease to page locks.

Actually, I expect little overhead in lock_page() and unlock_page() even
after applying crossreleas to page locks, but only expect a bit overhead
by additional memory consumption for 'lockdep_map's per page.

I run the following instructions within "QEMU x86_64 4GB memory 4 cpus":

   make clean
   echo 3 > drop_caches
   time make -j4

The results are:

   # w/o page lock tracking

   At the 1st try,
   real 5m28.105s
   user 17m52.716s
   sys  3m8.871s

   At the 2nd try,
   real 5m27.023s
   user 17m50.134s
   sys  3m9.289s

   At the 3rd try,
   real 5m22.837s
   user 17m34.514s
   sys  3m8.097s

   # w/ page lock tracking

   At the 1st try,
   real 5m18.158s
   user 17m18.200s
   sys  3m8.639s

   At the 2nd try,
   real 5m19.329s
   user 17m19.982s
   sys  3m8.345s

   At the 3rd try,
   real 5m19.626s
   user 17m21.363s
   sys  3m9.869s

I think thers's no meaningful difference on my small machine.

--
Thanks,
Byungchul


Re: [GIT PULL] nvme fixes for Linux 4.15

2017-11-23 Thread Jens Axboe
On 11/23/2017 09:54 PM, Jens Axboe wrote:
> On 11/23/2017 07:44 AM, Christoph Hellwig wrote:
>> Sagi Grimberg (3):
>>   nvme-fc: check if queue is ready in queue_rq
>>   nvme-loop: check if queue is ready in queue_rq
> 
> The nvme-loop part looks fine, but why is the nvme-fc part using:
> 
> enum nvme_fc_queue_flags {
>  
> NVME_FC_Q_CONNECTED = (1 << 0),   
>   
> +   NVME_FC_Q_LIVE = (1 << 1),
>   
> }; 
> 
> for flags that are used with set_bit() and friends? That's just
> misleading, should be 0, 1, etc, not a shift.
> 
> The rest looks pretty straight forward, but the above is an eye sore.

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index e0577bf33f45..0a8af4daef89 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -31,8 +31,8 @@
 
 
 enum nvme_fc_queue_flags {
-   NVME_FC_Q_CONNECTED = (1 << 0),
-   NVME_FC_Q_LIVE = (1 << 1),
+   NVME_FC_Q_CONNECTED = 0,
+   NVME_FC_Q_LIVE,
 };
 
 #define NVMEFC_QUEUE_DELAY 3   /* ms units */

-- 
Jens Axboe



Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown

2017-11-23 Thread Hou Tao
Hi Jan,

On 2017/11/21 0:43, Jan Kara wrote:
> Hi Tao!
> 
> On Fri 17-11-17 14:51:18, Hou Tao wrote:
>> On 2017/3/13 23:14, Jan Kara wrote:
>>> blkdev_open() may race with gendisk shutdown in two different ways.
>>> Either del_gendisk() has already unhashed block device inode (and thus
>>> bd_acquire() will end up creating new block device inode) however
>>> gen_gendisk() will still return the gendisk that is being destroyed.
>>> Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
>>> before we get to get_gendisk() and get_gendisk() will return new gendisk
>>> that got allocated for device that reused the device number.
>>>
>>> In both cases this will result in possible inconsistencies between
>>> bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
>>> destroyed and device number reused, in the second case immediately).
>>
>>> Fix the problem by checking whether the gendisk is still alive and inode
>>> hashed when associating bdev inode with it and its bdi. That way we are
>>> sure that we will not associate bdev inode with disk that got past
>>> blk_unregister_region() in del_gendisk() (and thus device number can get
>>> reused). Similarly, we will not associate bdev that was once associated
>>> with gendisk that is going away (and thus the corresponding bdev inode
>>> will get unhashed in del_gendisk()) with a new gendisk that is just
>>> reusing the device number.
>>
>> I have run some extended tests based on Omar Sandoval's script [1] these 
>> days,
>> and found two new problems related with the race between bdev open and 
>> gendisk
>> shutdown.
>>
>> The first problem lies in __blkdev_get(), and will cause use-after-free, or
>> worse oops. It happens because there may be two instances of gendisk related
>> with one bdev. When the process which owns the newer gendisk completes the
>> invocation of __blkdev_get() firstly, the other process which owns the older
>> gendisk will put the last reference of the older gendisk and cause 
>> use-after-free.
>>
>> The following call sequences illustrate the problem:
>>
>> unhash the bdev_inode of /dev/sda
>>
>> process A (blkdev_open()):
>>  bd_acquire() returns new bdev_inode of /dev/sda
>>  get_gendisk() returns gendisk (v1 gendisk)
>>
>> remove gendisk from bdev_map
>> device number is reused
> 
> ^^ this is through blk_unregister_region() in del_gendisk(), isn't it?

Yes, both the unhash of inode and the removal of gendisk from bdev_map
happen in del_gendisk().

>> process B (blkdev_open()):
>>  bd_acquire() returns new bdev_inode of /dev/sda
>>  get_gendisk() returns a new gendisk (v2 gendisk)
>>  increase bdev->bd_openers
> 
> So this needs to be a bit more complex - bd_acquire() must happen before
> bdev_unhash_inode() in del_gendisk() above happened. And get_gendisk() must
> happen after blk_unregister_region() from del_gendisk() happened. But yes,
> this seems possible.
> 
>> process A (blkdev_open()):
>>  find bdev->bd_openers != 0
>>  put_disk(disk) // this is the last reference to v1 gendisk
>>  disk_unblock_events(disk) // use-after-free occurs
>>
>> The problem can be fixed by an extra check showed in the following snippet:
>>
>> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int 
>> for_part)
>> {
>>  if (!bdev->bd_openers) {
>>  } else {
>>  if (bdev->bd_disk != disk) {
>>  ret = -ENXIO;
>>  goto out_unlock_bdev;
>>  }
>>  }
>> }
>>
>> And we also can simplify the check in your patch by moving
>> blk_unregister_region() before bdev_unhash_inode(), so whenever we get a
>> new bdev_inode, the gendisk returned by get_gendisk() will either be NULL
>> or a new gendisk.
> 
> I don't think we can do that. If we call blk_unregister_region() before
> bdev_unhash_inode(), the device number can get reused while bdev inode is
> still visible. So you can get that bdev inode v1 associated with gendisk
> v2. And open following unhashing of bdev inode v1 will get bdev inode v2
> associated with gendisk v2. So you have two different bdev inodes
> associated with the same gendisk and that will cause data integrity issues.

Even we do not move blk_unregister_region(), the data integrity issue still
exists: bd_acquire() is invoked before bdev_unhash_inode() and get_gendisk()
is invoked after blk_unregister_region(). The move is used to simplify the
consistency check or the version check between bdev_inode and gendisk, so
the added check above will be unnecessary, because whenever bd_acquire()
returns a new bdev_inode, the gendisk returned by get_gendisk() will also
a new gendisk or NULL. Anyway, it's just another workaround, so it's OK
we do not do that.

> But what you suggest above is probably a reasonable fix. Will you submit a
> proper patch please?

Uh, it's also an incomplete fix to the race problem. Anyhow i will do it.

>> And the only check we need to do in __blkdev_get() is
>> checking the 

Re: [PATCH 0/5] cleanup for blk-wbt

2017-11-23 Thread Jens Axboe
On 11/23/2017 06:37 AM, weiping zhang wrote:
> Hi Jens,
> 
> several cleanup for blk-wbt, no function change, thanks
> 
> weiping zhang (5):
>   blk-wbt: remove duplicated setting in wbt_init
>   blk-wbt: cleanup comments to one line
>   blk-sysfs: remove NULL pointer checking in queue_wb_lat_store
>   blk-wbt: move wbt_clear_stat to common place in wbt_done
>   blk-wbt: fix comments typo

Thanks, looks good - I applied 4, I dropped the comment fixup. It's
pointless churn, and there are lots of other comments using that
style for one line, even in that same file.

-- 
Jens Axboe



Re: [PATCH v3 RESEND 1/2] blk-throttle: track read and write request individually

2017-11-23 Thread Joseph Qi
Hi Jens,
Could you please give your advice for the two patches or pick them up if
you think they are good?

Thanks,
Joseph

On 17/11/21 09:38, Joseph Qi wrote:
> From: Joseph Qi 
> 
> In mixed read/write workload on SSD, write latency is much lower than
> read. But now we only track and record read latency and then use it as
> threshold base for both read and write io latency accounting. As a
> result, write io latency will always be considered as good and
> bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
> tg to be checked will be treated as idle most of the time and still let
> others dispatch more ios, even it is truly running under low limit and
> wants its low limit to be guaranteed, which is not we expected in fact.
> So track read and write request individually, which can bring more
> precise latency control for low limit idle detection.
> 
> Signed-off-by: Joseph Qi 
> Reviewed-by: Shaohua Li 
> ---
>  block/blk-throttle.c | 134 
> ++-
>  1 file changed, 79 insertions(+), 55 deletions(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 96ad326..bf52035 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -216,9 +216,9 @@ struct throtl_data
>  
>   unsigned int scale;
>  
> - struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
> - struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
> - struct latency_bucket __percpu *latency_buckets;
> + struct latency_bucket tmp_buckets[2][LATENCY_BUCKET_SIZE];
> + struct avg_latency_bucket avg_buckets[2][LATENCY_BUCKET_SIZE];
> + struct latency_bucket __percpu *latency_buckets[2];
>   unsigned long last_calculate_time;
>   unsigned long filtered_latency;
>  
> @@ -2041,10 +2041,10 @@ static void blk_throtl_update_idletime(struct 
> throtl_grp *tg)
>  #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
>  static void throtl_update_latency_buckets(struct throtl_data *td)
>  {
> - struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
> - int i, cpu;
> - unsigned long last_latency = 0;
> - unsigned long latency;
> + struct avg_latency_bucket avg_latency[2][LATENCY_BUCKET_SIZE];
> + int i, cpu, rw;
> + unsigned long last_latency[2] = { 0 };
> + unsigned long latency[2];
>  
>   if (!blk_queue_nonrot(td->queue))
>   return;
> @@ -2053,56 +2053,67 @@ static void throtl_update_latency_buckets(struct 
> throtl_data *td)
>   td->last_calculate_time = jiffies;
>  
>   memset(avg_latency, 0, sizeof(avg_latency));
> - for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
> - struct latency_bucket *tmp = >tmp_buckets[i];
> -
> - for_each_possible_cpu(cpu) {
> - struct latency_bucket *bucket;
> -
> - /* this isn't race free, but ok in practice */
> - bucket = per_cpu_ptr(td->latency_buckets, cpu);
> - tmp->total_latency += bucket[i].total_latency;
> - tmp->samples += bucket[i].samples;
> - bucket[i].total_latency = 0;
> - bucket[i].samples = 0;
> - }
> + for (rw = READ; rw <= WRITE; rw++) {
> + for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
> + struct latency_bucket *tmp = >tmp_buckets[rw][i];
> +
> + for_each_possible_cpu(cpu) {
> + struct latency_bucket *bucket;
> +
> + /* this isn't race free, but ok in practice */
> + bucket = per_cpu_ptr(td->latency_buckets[rw],
> + cpu);
> + tmp->total_latency += bucket[i].total_latency;
> + tmp->samples += bucket[i].samples;
> + bucket[i].total_latency = 0;
> + bucket[i].samples = 0;
> + }
>  
> - if (tmp->samples >= 32) {
> - int samples = tmp->samples;
> + if (tmp->samples >= 32) {
> + int samples = tmp->samples;
>  
> - latency = tmp->total_latency;
> + latency[rw] = tmp->total_latency;
>  
> - tmp->total_latency = 0;
> - tmp->samples = 0;
> - latency /= samples;
> - if (latency == 0)
> - continue;
> - avg_latency[i].latency = latency;
> + tmp->total_latency = 0;
> + tmp->samples = 0;
> + latency[rw] /= samples;
> + if (latency[rw] == 0)
> + continue;
> + 

Re: [GIT PULL] nvme fixes for Linux 4.15

2017-11-23 Thread Jens Axboe
On 11/23/2017 07:44 AM, Christoph Hellwig wrote:
> Sagi Grimberg (3):
>   nvme-fc: check if queue is ready in queue_rq
>   nvme-loop: check if queue is ready in queue_rq

The nvme-loop part looks fine, but why is the nvme-fc part using:

enum nvme_fc_queue_flags { 
NVME_FC_Q_CONNECTED = (1 << 0), 
+   NVME_FC_Q_LIVE = (1 << 1),  
}; 

for flags that are used with set_bit() and friends? That's just
misleading, should be 0, 1, etc, not a shift.

The rest looks pretty straight forward, but the above is an eye sore.

-- 
Jens Axboe



Re: [PATCH V14 02/24] mmc: block: Check return value of blk_get_request()

2017-11-23 Thread Linus Walleij
On Tue, Nov 21, 2017 at 2:42 PM, Adrian Hunter  wrote:

> blk_get_request() can fail, always check the return value.
>
> Fixes: 0493f6fe5bde ("mmc: block: Move boot partition locking into a driver 
> op")
> Fixes: 3ecd8cf23f88 ("mmc: block: move multi-ioctl() to use block layer")
> Fixes: 614f0388f580 ("mmc: block: move single ioctl() commands to block 
> requests")
> Fixes: 627c3ccfb46a ("mmc: debugfs: Move block debugfs into block module")
> Signed-off-by: Adrian Hunter 

Reviewed-by: Linus Walleij 
This should also go into fixes I think.

I guess this particular sloppiness was due to modeling the code on
drivers/ide/* which doesn't check the return value either.

I should go and fix that :/

Yours,
Linus Walleij


Re: [PATCH V14 07/24] mmc: block: Use data timeout in card_busy_detect()

2017-11-23 Thread Adrian Hunter
On 22/11/17 16:43, Ulf Hansson wrote:
> On 22 November 2017 at 08:40, Adrian Hunter  wrote:
>> On 21/11/17 17:39, Ulf Hansson wrote:
>>> On 21 November 2017 at 14:42, Adrian Hunter  wrote:
 card_busy_detect() has a 10 minute timeout. However the correct timeout is
 the data timeout. Change card_busy_detect() to use the data timeout.
>>>
>>> Unfortunate I don't think there is "correct" timeout for this case.
>>>
>>> The data->timeout_ns is to indicate for the host to how long the
>>> maximum time it's allowed to take between blocks that are written to
>>> the data lines.
>>>
>>> I haven't found a definition of the busy timeout, after the data write
>>> has completed. The spec only mentions that the device moves to
>>> programming state and pulls DAT0 to indicate busy.
>>
>> To me it reads more like the timeout is for each block, including the last
>> i.e. the same timeout for "busy".  Note the card is also busy between blocks.
> 
> I don't think that is the same timeout. Or maybe it is.
> 
> In the eMMC 5.1 spec, there are mentions about "buffer busy signal"
> and "programming busy signal", see section 6.15.3 (Timings - Data
> Write).
> 
> Anyway, whether any of them is specified, is to me unclear.
> 
>>
>> Equally it is the timeout we give the host controller.  So either the host
>> controller does not have a timeout for "busy" - which begs the question why
>> it has a timeout at all - or it invents its own "busy" timeout - which begs
>> the question why it isn't in the spec.
> 
> Well, there are some vague hints in section 6.8.2 (Time-out
> conditions), but I don't find these timeouts values being referred to
> in 6.15 (Timings). And that puzzles me.
> 
> Moreover, the below is quoted from section 6.6.8.1 (Block write):
> --
> Some Devices may require long and unpredictable times to write a block
> of data. After receiving a block of data and completing the CRC check,
> the Device will begin writing and hold the DAT0 line low. The host may
> poll the status of the Device with a SEND_STATUS command (CMD13) at
> any time, and the Device will respond with its status (except in Sleep
> state). The status bit READY_FOR_DATA indicates whether the Device can
> accept new data or not. The host may deselect the Device by issuing
> CMD7 that will then displace the Device into the Disconnect State and
> release the DAT0 line without interrupting the write operation. When
> reselecting the Device, it will reactivate busy indication by pulling
> DAT0 to low. See 6.15 for details of busy indication.
> --
> 
>>
>>>
>>> Sure, 10 min seems crazy, perhaps something along the lines of 10-20 s
>>> is more reasonable. What do you think?
>>
>> We give SD cards a generous 3 seconds for writes.  SDHCI has long had a 10
>> second software timer for the whole request, which strongly suggests that
>> requests have always completed within 10 seconds.  So that puts the range of
>> an arbitrary timeout 3-10 s.
> 
>>From the reasoning above, I guess we could try out 10 s. That is at
> least a lot better than 10 minutes.
> 
> I also see that we have at three different places (switch, erase,
> block data transfers) implementing busy signal detection. It would be
> nice to try to align those pieces of code, because they are quite
> similar. Of course, this deserves it's own separate task to try to fix
> up.
> 
> BTW, perhaps we should move this to a separate change on top of the
> series? Or is there as specific need for this in regards to blkmq and
> CQE?

It is related to the recovery changes, so can be moved later in the patch set.


Re: [PATCH V14 03/24] mmc: core: Do not leave the block driver in a suspended state

2017-11-23 Thread Linus Walleij
On Tue, Nov 21, 2017 at 2:42 PM, Adrian Hunter  wrote:

> The block driver must be resumed if the mmc bus fails to suspend the card.
>
> Signed-off-by: Adrian Hunter 

Reviewed-by: Linus Walleij 

Also looks like a clear candidate for fixes.

Yours,
Linus Walleij


Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

2017-11-23 Thread Hannes Reinecke
On 11/23/2017 03:34 PM, Christoph Hellwig wrote:
> FYI, the patch below changes both the irq and block mappings to
> always use the cpu possible map (should be split in two in due time).
> 
> I think this is the right way forward.  For every normal machine
> those two are the same, but for VMs with maxcpus above their normal
> count or some big iron that can grow more cpus it means we waster
> a few more resources for the not present but reserved cpus.  It
> fixes the reported issue for me:
> 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 9f8cffc8a701..3eb169f15842 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -16,11 +16,6 @@
>  
>  static int cpu_to_queue_index(unsigned int nr_queues, const int cpu)
>  {
> - /*
> -  * Non present CPU will be mapped to queue index 0.
> -  */
> - if (!cpu_present(cpu))
> - return 0;
>   return cpu % nr_queues;
>  }
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 11097477eeab..612ce1fb7c4e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2114,16 +2114,11 @@ static void blk_mq_init_cpu_queues(struct 
> request_queue *q,
>   INIT_LIST_HEAD(&__ctx->rq_list);
>   __ctx->queue = q;
>  
> - /* If the cpu isn't present, the cpu is mapped to first hctx */
> - if (!cpu_present(i))
> - continue;
> -
> - hctx = blk_mq_map_queue(q, i);
> -
>   /*
>* Set local node, IFF we have more than one hw queue. If
>* not, we remain on the home node of the device
>*/
> + hctx = blk_mq_map_queue(q, i);
>   if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
>   hctx->numa_node = local_memory_node(cpu_to_node(i));
>   }
> @@ -2180,7 +2175,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>*
>* If the cpu isn't present, the cpu is mapped to first hctx.
>*/
> - for_each_present_cpu(i) {
> + for_each_possible_cpu(i) {
>   hctx_idx = q->mq_map[i];
>   /* unmapped hw queue can be remapped after CPU topo changed */
>   if (!set->tags[hctx_idx] &&
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index e12d35108225..a37a3b4b6342 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -39,7 +39,7 @@ static void irq_spread_init_one(struct cpumask *irqmsk, 
> struct cpumask *nmsk,
>   }
>  }
>  
> -static cpumask_var_t *alloc_node_to_present_cpumask(void)
> +static cpumask_var_t *alloc_node_to_possible_cpumask(void)
>  {
>   cpumask_var_t *masks;
>   int node;
> @@ -62,7 +62,7 @@ static cpumask_var_t *alloc_node_to_present_cpumask(void)
>   return NULL;
>  }
>  
> -static void free_node_to_present_cpumask(cpumask_var_t *masks)
> +static void free_node_to_possible_cpumask(cpumask_var_t *masks)
>  {
>   int node;
>  
> @@ -71,22 +71,22 @@ static void free_node_to_present_cpumask(cpumask_var_t 
> *masks)
>   kfree(masks);
>  }
>  
> -static void build_node_to_present_cpumask(cpumask_var_t *masks)
> +static void build_node_to_possible_cpumask(cpumask_var_t *masks)
>  {
>   int cpu;
>  
> - for_each_present_cpu(cpu)
> + for_each_possible_cpu(cpu)
>   cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
>  }
>  
> -static int get_nodes_in_cpumask(cpumask_var_t *node_to_present_cpumask,
> +static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask,
>   const struct cpumask *mask, nodemask_t *nodemsk)
>  {
>   int n, nodes = 0;
>  
>   /* Calculate the number of nodes in the supplied affinity mask */
>   for_each_node(n) {
> - if (cpumask_intersects(mask, node_to_present_cpumask[n])) {
> + if (cpumask_intersects(mask, node_to_possible_cpumask[n])) {
>   node_set(n, *nodemsk);
>   nodes++;
>   }
> @@ -109,7 +109,7 @@ irq_create_affinity_masks(int nvecs, const struct 
> irq_affinity *affd)
>   int last_affv = affv + affd->pre_vectors;
>   nodemask_t nodemsk = NODE_MASK_NONE;
>   struct cpumask *masks;
> - cpumask_var_t nmsk, *node_to_present_cpumask;
> + cpumask_var_t nmsk, *node_to_possible_cpumask;
>  
>   /*
>* If there aren't any vectors left after applying the pre/post
> @@ -125,8 +125,8 @@ irq_create_affinity_masks(int nvecs, const struct 
> irq_affinity *affd)
>   if (!masks)
>   goto out;
>  
> - node_to_present_cpumask = alloc_node_to_present_cpumask();
> - if (!node_to_present_cpumask)
> + node_to_possible_cpumask = alloc_node_to_possible_cpumask();
> + if (!node_to_possible_cpumask)
>   goto out;
>  
>   /* Fill out vectors at the beginning that don't need affinity */
> @@ -135,8 +135,8 @@ irq_create_affinity_masks(int nvecs, const struct 
> irq_affinity *affd)
>  
> 

Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

2017-11-23 Thread Christian Borntraeger
Yes it seems to fix the bug.

On 11/23/2017 03:34 PM, Christoph Hellwig wrote:
> FYI, the patch below changes both the irq and block mappings to
> always use the cpu possible map (should be split in two in due time).
> 
> I think this is the right way forward.  For every normal machine
> those two are the same, but for VMs with maxcpus above their normal
> count or some big iron that can grow more cpus it means we waster
> a few more resources for the not present but reserved cpus.  It
> fixes the reported issue for me:
> 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 9f8cffc8a701..3eb169f15842 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -16,11 +16,6 @@
> 
>  static int cpu_to_queue_index(unsigned int nr_queues, const int cpu)
>  {
> - /*
> -  * Non present CPU will be mapped to queue index 0.
> -  */
> - if (!cpu_present(cpu))
> - return 0;
>   return cpu % nr_queues;
>  }
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 11097477eeab..612ce1fb7c4e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2114,16 +2114,11 @@ static void blk_mq_init_cpu_queues(struct 
> request_queue *q,
>   INIT_LIST_HEAD(&__ctx->rq_list);
>   __ctx->queue = q;
> 
> - /* If the cpu isn't present, the cpu is mapped to first hctx */
> - if (!cpu_present(i))
> - continue;
> -
> - hctx = blk_mq_map_queue(q, i);
> -
>   /*
>* Set local node, IFF we have more than one hw queue. If
>* not, we remain on the home node of the device
>*/
> + hctx = blk_mq_map_queue(q, i);
>   if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
>   hctx->numa_node = local_memory_node(cpu_to_node(i));
>   }
> @@ -2180,7 +2175,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>*
>* If the cpu isn't present, the cpu is mapped to first hctx.
>*/
> - for_each_present_cpu(i) {
> + for_each_possible_cpu(i) {
>   hctx_idx = q->mq_map[i];
>   /* unmapped hw queue can be remapped after CPU topo changed */
>   if (!set->tags[hctx_idx] &&
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index e12d35108225..a37a3b4b6342 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -39,7 +39,7 @@ static void irq_spread_init_one(struct cpumask *irqmsk, 
> struct cpumask *nmsk,
>   }
>  }
> 
> -static cpumask_var_t *alloc_node_to_present_cpumask(void)
> +static cpumask_var_t *alloc_node_to_possible_cpumask(void)
>  {
>   cpumask_var_t *masks;
>   int node;
> @@ -62,7 +62,7 @@ static cpumask_var_t *alloc_node_to_present_cpumask(void)
>   return NULL;
>  }
> 
> -static void free_node_to_present_cpumask(cpumask_var_t *masks)
> +static void free_node_to_possible_cpumask(cpumask_var_t *masks)
>  {
>   int node;
> 
> @@ -71,22 +71,22 @@ static void free_node_to_present_cpumask(cpumask_var_t 
> *masks)
>   kfree(masks);
>  }
> 
> -static void build_node_to_present_cpumask(cpumask_var_t *masks)
> +static void build_node_to_possible_cpumask(cpumask_var_t *masks)
>  {
>   int cpu;
> 
> - for_each_present_cpu(cpu)
> + for_each_possible_cpu(cpu)
>   cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
>  }
> 
> -static int get_nodes_in_cpumask(cpumask_var_t *node_to_present_cpumask,
> +static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask,
>   const struct cpumask *mask, nodemask_t *nodemsk)
>  {
>   int n, nodes = 0;
> 
>   /* Calculate the number of nodes in the supplied affinity mask */
>   for_each_node(n) {
> - if (cpumask_intersects(mask, node_to_present_cpumask[n])) {
> + if (cpumask_intersects(mask, node_to_possible_cpumask[n])) {
>   node_set(n, *nodemsk);
>   nodes++;
>   }
> @@ -109,7 +109,7 @@ irq_create_affinity_masks(int nvecs, const struct 
> irq_affinity *affd)
>   int last_affv = affv + affd->pre_vectors;
>   nodemask_t nodemsk = NODE_MASK_NONE;
>   struct cpumask *masks;
> - cpumask_var_t nmsk, *node_to_present_cpumask;
> + cpumask_var_t nmsk, *node_to_possible_cpumask;
> 
>   /*
>* If there aren't any vectors left after applying the pre/post
> @@ -125,8 +125,8 @@ irq_create_affinity_masks(int nvecs, const struct 
> irq_affinity *affd)
>   if (!masks)
>   goto out;
> 
> - node_to_present_cpumask = alloc_node_to_present_cpumask();
> - if (!node_to_present_cpumask)
> + node_to_possible_cpumask = alloc_node_to_possible_cpumask();
> + if (!node_to_possible_cpumask)
>   goto out;
> 
>   /* Fill out vectors at the beginning that don't need affinity */
> @@ -135,8 +135,8 @@ irq_create_affinity_masks(int nvecs, const struct 
> 

[PATCH 0/5] cleanup for blk-wbt

2017-11-23 Thread weiping zhang
Hi Jens,

several cleanup for blk-wbt, no function change, thanks

weiping zhang (5):
  blk-wbt: remove duplicated setting in wbt_init
  blk-wbt: cleanup comments to one line
  blk-sysfs: remove NULL pointer checking in queue_wb_lat_store
  blk-wbt: move wbt_clear_stat to common place in wbt_done
  blk-wbt: fix comments typo

 block/blk-sysfs.c |  5 +
 block/blk-wbt.c   | 11 +++
 2 files changed, 4 insertions(+), 12 deletions(-)

-- 
2.9.4



Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

2017-11-23 Thread Christoph Hellwig
I can't reproduce it in my VM with adding a new CPU.  Do you have
any interesting blk-mq like actually using multiple queues?  I'll
give that a spin next.


Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

2017-11-23 Thread Christoph Hellwig
[fullquote deleted]

> What will happen for the CPU hotplug case?
> Wouldn't we route I/O to a disabled CPU with this patch?

Why would we route I/O to a disabled CPU (we generally route
I/O to devices to start with).  How would including possible
but not present cpus change anything?


[GIT PULL] nvme fixes for Linux 4.15

2017-11-23 Thread Christoph Hellwig
Hi Jens,

a couple nvme fixes for 4.15:

 - expand the queue ready fix that we only had for RDMA to also cover FC and
   loop by moving it to common code (Sagi)
 - fix an array out of bounds in the PCIe HMB code (Minwoo Im)
 - two new device quirks (Jeff Lien and Kai-Heng Feng)
 - static checkers fixes (Keith Busch)
 - FC target refcount fix (James Smart)
 - A trivial spelling fix in new code (Colin Ian King)


The following changes since commit 1690102de5651bb85b23d5eeaff682a6b96d705b:

  blktrace: Use blk_trace_bio_get_cgid inside blk_add_trace_bio (2017-11-19 
12:37:26 -0700)

are available in the git repository at:

  git://git.infradead.org/nvme.git nvme-4.15

for you to fetch changes up to 8c97eeccf0ad8783c057830119467b877bdfced7:

  nvme-pci: add quirk for delay before CHK RDY for WDC SN200 (2017-11-23 
09:12:08 +0100)


Colin Ian King (1):
  nvme: fix spelling mistake: "requeing" -> "requeuing"

James Smart (1):
  nvmet-fc: correct ref counting error when deferred rcv used

Jeff Lien (1):
  nvme-pci: add quirk for delay before CHK RDY for WDC SN200

Kai-Heng Feng (1):
  nvme-pci: disable APST on Samsung SSD 960 EVO + ASUS PRIME B350M-A

Keith Busch (2):
  nvme: Fix NULL dereference on reservation request
  nvme: Suppress static analyis warning

Minwoo Im (1):
  nvme-pci: avoid hmb desc array idx out-of-bound when hmmaxd set.

Sagi Grimberg (3):
  nvme-fabrics: introduce init command check for a queue that is not alive
  nvme-fc: check if queue is ready in queue_rq
  nvme-loop: check if queue is ready in queue_rq

 drivers/nvme/host/core.c  | 19 ---
 drivers/nvme/host/fabrics.h   | 30 ++
 drivers/nvme/host/fc.c| 19 ++-
 drivers/nvme/host/multipath.c |  2 +-
 drivers/nvme/host/nvme.h  |  2 +-
 drivers/nvme/host/pci.c   | 16 +---
 drivers/nvme/host/rdma.c  | 32 ++--
 drivers/nvme/target/fc.c  |  9 ++---
 drivers/nvme/target/loop.c| 25 -
 9 files changed, 107 insertions(+), 47 deletions(-)


Re: [PATCH V14 03/24] mmc: core: Do not leave the block driver in a suspended state

2017-11-23 Thread Ulf Hansson
On 21 November 2017 at 14:42, Adrian Hunter  wrote:
> The block driver must be resumed if the mmc bus fails to suspend the card.
>
> Signed-off-by: Adrian Hunter 

Thanks, applied for fixes and added a stable tag (I think v3.19+ is
the first one we can pick, else some other manual back porting is
needed).

Kind regards
Uffe

> ---
>  drivers/mmc/core/bus.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/mmc/core/bus.c b/drivers/mmc/core/bus.c
> index a4b49e25fe96..7586ff2ad1f1 100644
> --- a/drivers/mmc/core/bus.c
> +++ b/drivers/mmc/core/bus.c
> @@ -157,6 +157,9 @@ static int mmc_bus_suspend(struct device *dev)
> return ret;
>
> ret = host->bus_ops->suspend(host);
> +   if (ret)
> +   pm_generic_resume(dev);
> +
> return ret;
>  }
>
> --
> 1.9.1
>


Re: [PATCH V14 04/24] mmc: block: Ensure that debugfs files are removed

2017-11-23 Thread Ulf Hansson
On 21 November 2017 at 14:42, Adrian Hunter  wrote:
> The card is not necessarily being removed, but the debugfs files must be
> removed when the driver is removed, otherwise they will continue to exist
> after unbinding the card from the driver. e.g.
>
>   # echo "mmc1:0001" > /sys/bus/mmc/drivers/mmcblk/unbind
>   # cat /sys/kernel/debug/mmc1/mmc1\:0001/ext_csd
>   [  173.634584] BUG: unable to handle kernel NULL pointer dereference at 
> 0050
>   [  173.643356] IP: mmc_ext_csd_open+0x5e/0x170
>
> A complication is that the debugfs_root may have already been removed, so
> check for that too.
>
> Fixes: 627c3ccfb46a ("mmc: debugfs: Move block debugfs into block module")
> Signed-off-by: Adrian Hunter 

Thanks, applied for fixes and added a stable tag!

Kind regards
Uffe

> ---
>  drivers/mmc/core/block.c   | 44 +---
>  drivers/mmc/core/debugfs.c |  1 +
>  2 files changed, 38 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
> index 4a319ddbd956..ccfa98af1dd3 100644
> --- a/drivers/mmc/core/block.c
> +++ b/drivers/mmc/core/block.c
> @@ -122,6 +122,10 @@ struct mmc_blk_data {
> struct device_attribute force_ro;
> struct device_attribute power_ro_lock;
> int area_type;
> +
> +   /* debugfs files (only in main mmc_blk_data) */
> +   struct dentry *status_dentry;
> +   struct dentry *ext_csd_dentry;
>  };
>
>  /* Device type for RPMB character devices */
> @@ -2653,7 +2657,7 @@ static int mmc_ext_csd_release(struct inode *inode, 
> struct file *file)
> .llseek = default_llseek,
>  };
>
> -static int mmc_blk_add_debugfs(struct mmc_card *card)
> +static int mmc_blk_add_debugfs(struct mmc_card *card, struct mmc_blk_data 
> *md)
>  {
> struct dentry *root;
>
> @@ -2663,28 +2667,53 @@ static int mmc_blk_add_debugfs(struct mmc_card *card)
> root = card->debugfs_root;
>
> if (mmc_card_mmc(card) || mmc_card_sd(card)) {
> -   if (!debugfs_create_file("status", S_IRUSR, root, card,
> -_dbg_card_status_fops))
> +   md->status_dentry =
> +   debugfs_create_file("status", S_IRUSR, root, card,
> +   _dbg_card_status_fops);
> +   if (!md->status_dentry)
> return -EIO;
> }
>
> if (mmc_card_mmc(card)) {
> -   if (!debugfs_create_file("ext_csd", S_IRUSR, root, card,
> -_dbg_ext_csd_fops))
> +   md->ext_csd_dentry =
> +   debugfs_create_file("ext_csd", S_IRUSR, root, card,
> +   _dbg_ext_csd_fops);
> +   if (!md->ext_csd_dentry)
> return -EIO;
> }
>
> return 0;
>  }
>
> +static void mmc_blk_remove_debugfs(struct mmc_card *card,
> +  struct mmc_blk_data *md)
> +{
> +   if (!card->debugfs_root)
> +   return;
> +
> +   if (!IS_ERR_OR_NULL(md->status_dentry)) {
> +   debugfs_remove(md->status_dentry);
> +   md->status_dentry = NULL;
> +   }
> +
> +   if (!IS_ERR_OR_NULL(md->ext_csd_dentry)) {
> +   debugfs_remove(md->ext_csd_dentry);
> +   md->ext_csd_dentry = NULL;
> +   }
> +}
>
>  #else
>
> -static int mmc_blk_add_debugfs(struct mmc_card *card)
> +static int mmc_blk_add_debugfs(struct mmc_card *card, struct mmc_blk_data 
> *md)
>  {
> return 0;
>  }
>
> +static void mmc_blk_remove_debugfs(struct mmc_card *card,
> +  struct mmc_blk_data *md)
> +{
> +}
> +
>  #endif /* CONFIG_DEBUG_FS */
>
>  static int mmc_blk_probe(struct mmc_card *card)
> @@ -2724,7 +2753,7 @@ static int mmc_blk_probe(struct mmc_card *card)
> }
>
> /* Add two debugfs entries */
> -   mmc_blk_add_debugfs(card);
> +   mmc_blk_add_debugfs(card, md);
>
> pm_runtime_set_autosuspend_delay(>dev, 3000);
> pm_runtime_use_autosuspend(>dev);
> @@ -2750,6 +2779,7 @@ static void mmc_blk_remove(struct mmc_card *card)
>  {
> struct mmc_blk_data *md = dev_get_drvdata(>dev);
>
> +   mmc_blk_remove_debugfs(card, md);
> mmc_blk_remove_parts(card, md);
> pm_runtime_get_sync(>dev);
> mmc_claim_host(card->host);
> diff --git a/drivers/mmc/core/debugfs.c b/drivers/mmc/core/debugfs.c
> index 01e459a34f33..0f4a7d7b2626 100644
> --- a/drivers/mmc/core/debugfs.c
> +++ b/drivers/mmc/core/debugfs.c
> @@ -314,4 +314,5 @@ void mmc_add_card_debugfs(struct mmc_card *card)
>  void mmc_remove_card_debugfs(struct mmc_card *card)
>  {
> debugfs_remove_recursive(card->debugfs_root);
> +   card->debugfs_root = NULL;
>  }
> --
> 1.9.1
>


Re: [PATCH V14 02/24] mmc: block: Check return value of blk_get_request()

2017-11-23 Thread Ulf Hansson
On 21 November 2017 at 14:42, Adrian Hunter  wrote:
> blk_get_request() can fail, always check the return value.
>
> Fixes: 0493f6fe5bde ("mmc: block: Move boot partition locking into a driver 
> op")
> Fixes: 3ecd8cf23f88 ("mmc: block: move multi-ioctl() to use block layer")
> Fixes: 614f0388f580 ("mmc: block: move single ioctl() commands to block 
> requests")
> Fixes: 627c3ccfb46a ("mmc: debugfs: Move block debugfs into block module")
> Signed-off-by: Adrian Hunter 

Thanks, applied for fixes and added a stable tag!

Kind regards
Uffe

> ---
>  drivers/mmc/core/block.c | 20 +++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
> index f60939858586..4a319ddbd956 100644
> --- a/drivers/mmc/core/block.c
> +++ b/drivers/mmc/core/block.c
> @@ -233,6 +233,10 @@ static ssize_t power_ro_lock_store(struct device *dev,
>
> /* Dispatch locking to the block layer */
> req = blk_get_request(mq->queue, REQ_OP_DRV_OUT, __GFP_RECLAIM);
> +   if (IS_ERR(req)) {
> +   count = PTR_ERR(req);
> +   goto out_put;
> +   }
> req_to_mmc_queue_req(req)->drv_op = MMC_DRV_OP_BOOT_WP;
> blk_execute_rq(mq->queue, NULL, req, 0);
> ret = req_to_mmc_queue_req(req)->drv_op_result;
> @@ -249,7 +253,7 @@ static ssize_t power_ro_lock_store(struct device *dev,
> set_disk_ro(part_md->disk, 1);
> }
> }
> -
> +out_put:
> mmc_blk_put(md);
> return count;
>  }
> @@ -625,6 +629,10 @@ static int mmc_blk_ioctl_cmd(struct mmc_blk_data *md,
> req = blk_get_request(mq->queue,
> idata->ic.write_flag ? REQ_OP_DRV_OUT : REQ_OP_DRV_IN,
> __GFP_RECLAIM);
> +   if (IS_ERR(req)) {
> +   err = PTR_ERR(req);
> +   goto cmd_done;
> +   }
> idatas[0] = idata;
> req_to_mmc_queue_req(req)->drv_op =
> rpmb ? MMC_DRV_OP_IOCTL_RPMB : MMC_DRV_OP_IOCTL;
> @@ -692,6 +700,10 @@ static int mmc_blk_ioctl_multi_cmd(struct mmc_blk_data 
> *md,
> req = blk_get_request(mq->queue,
> idata[0]->ic.write_flag ? REQ_OP_DRV_OUT : REQ_OP_DRV_IN,
> __GFP_RECLAIM);
> +   if (IS_ERR(req)) {
> +   err = PTR_ERR(req);
> +   goto cmd_err;
> +   }
> req_to_mmc_queue_req(req)->drv_op =
> rpmb ? MMC_DRV_OP_IOCTL_RPMB : MMC_DRV_OP_IOCTL;
> req_to_mmc_queue_req(req)->drv_op_data = idata;
> @@ -2551,6 +2563,8 @@ static int mmc_dbg_card_status_get(void *data, u64 *val)
>
> /* Ask the block layer about the card status */
> req = blk_get_request(mq->queue, REQ_OP_DRV_IN, __GFP_RECLAIM);
> +   if (IS_ERR(req))
> +   return PTR_ERR(req);
> req_to_mmc_queue_req(req)->drv_op = MMC_DRV_OP_GET_CARD_STATUS;
> blk_execute_rq(mq->queue, NULL, req, 0);
> ret = req_to_mmc_queue_req(req)->drv_op_result;
> @@ -2585,6 +2599,10 @@ static int mmc_ext_csd_open(struct inode *inode, 
> struct file *filp)
>
> /* Ask the block layer for the EXT CSD */
> req = blk_get_request(mq->queue, REQ_OP_DRV_IN, __GFP_RECLAIM);
> +   if (IS_ERR(req)) {
> +   err = PTR_ERR(req);
> +   goto out_free;
> +   }
> req_to_mmc_queue_req(req)->drv_op = MMC_DRV_OP_GET_EXT_CSD;
> req_to_mmc_queue_req(req)->drv_op_data = _csd;
> blk_execute_rq(mq->queue, NULL, req, 0);
> --
> 1.9.1
>


Re: [PATCH V14 01/24] mmc: block: Fix missing blk_put_request()

2017-11-23 Thread Ulf Hansson
On 21 November 2017 at 14:42, Adrian Hunter  wrote:
> Ensure blk_get_request() is paired with blk_put_request().
>
> Fixes: 0493f6fe5bde ("mmc: block: Move boot partition locking into a driver 
> op")
> Fixes: 627c3ccfb46a ("mmc: debugfs: Move block debugfs into block module")
> Signed-off-by: Adrian Hunter 

Thanks, applied for fixes and added a stable tag!

Kind regards
Uffe

> ---
>  drivers/mmc/core/block.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
> index ea80ff4cd7f9..f60939858586 100644
> --- a/drivers/mmc/core/block.c
> +++ b/drivers/mmc/core/block.c
> @@ -236,6 +236,7 @@ static ssize_t power_ro_lock_store(struct device *dev,
> req_to_mmc_queue_req(req)->drv_op = MMC_DRV_OP_BOOT_WP;
> blk_execute_rq(mq->queue, NULL, req, 0);
> ret = req_to_mmc_queue_req(req)->drv_op_result;
> +   blk_put_request(req);
>
> if (!ret) {
> pr_info("%s: Locking boot partition ro until next power on\n",
> @@ -2557,6 +2558,7 @@ static int mmc_dbg_card_status_get(void *data, u64 *val)
> *val = ret;
> ret = 0;
> }
> +   blk_put_request(req);
>
> return ret;
>  }
> @@ -2587,6 +2589,7 @@ static int mmc_ext_csd_open(struct inode *inode, struct 
> file *filp)
> req_to_mmc_queue_req(req)->drv_op_data = _csd;
> blk_execute_rq(mq->queue, NULL, req, 0);
> err = req_to_mmc_queue_req(req)->drv_op_result;
> +   blk_put_request(req);
> if (err) {
> pr_err("FAILED %d\n", err);
> goto out_free;
> --
> 1.9.1
>


Re: [PATCH V14 06/24] mmc: block: Simplify cleaning up the queue

2017-11-23 Thread Linus Walleij
On Tue, Nov 21, 2017 at 2:42 PM, Adrian Hunter  wrote:

> Use blk_cleanup_queue() to shutdown the queue when the driver is removed,
> and instead get an extra reference to the queue to prevent the queue being
> freed before the final mmc_blk_put().
>
> Signed-off-by: Adrian Hunter 

This is way more elegant.
Reviewed-by: Linus Walleij 

Yours,
Linus Walleij


[PATCH 4/5] blk-wbt: move wbt_clear_stat to common place in wbt_done

2017-11-23 Thread weiping zhang
wbt_done call wbt_clear_stat no matter current stat was tracked
or not, move it to common place.

Signed-off-by: weiping zhang 
---
 block/blk-wbt.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 0fb65f0..cd9a20a 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -178,12 +178,11 @@ void wbt_done(struct rq_wb *rwb, struct blk_issue_stat 
*stat)
 
if (wbt_is_read(stat))
wb_timestamp(rwb, >last_comp);
-   wbt_clear_state(stat);
} else {
WARN_ON_ONCE(stat == rwb->sync_cookie);
__wbt_done(rwb, wbt_stat_to_mask(stat));
-   wbt_clear_state(stat);
}
+   wbt_clear_state(stat);
 }
 
 /*
-- 
2.9.4



[PATCH 5/5] blk-wbt: fix comments typo

2017-11-23 Thread weiping zhang
Signed-off-by: weiping zhang 
---
 block/blk-wbt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index cd9a20a..9f4ef9c 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -481,7 +481,7 @@ static inline unsigned int get_limit(struct rq_wb *rwb, 
unsigned long rw)
 
/*
 * At this point we know it's a buffered write. If this is
-* kswapd trying to free memory, or REQ_SYNC is set, set, then
+* kswapd trying to free memory, or REQ_SYNC is set, then
 * it's WB_SYNC_ALL writeback, and we'll use the max limit for
 * that. If the write is marked as a background write, then use
 * the idle limit, or go to normal if we haven't had competing
-- 
2.9.4



[PATCH 1/5] blk-wbt: remove duplicated setting in wbt_init

2017-11-23 Thread weiping zhang
rwb->wc and rwb->queue_depth were overwritten by wbt_set_write_cache and
wbt_set_queue_depth, remove the default setting.

Signed-off-by: weiping zhang 
---
 block/blk-wbt.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index e59d59c..edb09e93 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -723,8 +723,6 @@ int wbt_init(struct request_queue *q)
init_waitqueue_head(>rq_wait[i].wait);
}
 
-   rwb->wc = 1;
-   rwb->queue_depth = RWB_DEF_DEPTH;
rwb->last_comp = rwb->last_issue = jiffies;
rwb->queue = q;
rwb->win_nsec = RWB_WINDOW_NSEC;
-- 
2.9.4



[PATCH 3/5] blk-sysfs: remove NULL pointer checking in queue_wb_lat_store

2017-11-23 Thread weiping zhang
wbt_init doesn't set q->rq_wb to NULL, if wbt_init return 0,
so check return value is enough, remove NULL checking.

Signed-off-by: weiping zhang 
---
 block/blk-sysfs.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b8362c0..9b1093a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -449,12 +449,9 @@ static ssize_t queue_wb_lat_store(struct request_queue *q, 
const char *page,
ret = wbt_init(q);
if (ret)
return ret;
-
-   rwb = q->rq_wb;
-   if (!rwb)
-   return -EINVAL;
}
 
+   rwb = q->rq_wb;
if (val == -1)
rwb->min_lat_nsec = wbt_default_latency_nsec(q);
else if (val >= 0)
-- 
2.9.4



[PATCH 2/5] blk-wbt: cleanup comments to one line

2017-11-23 Thread weiping zhang
Signed-off-by: weiping zhang 
---
 block/blk-wbt.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index edb09e93..0fb65f0 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -729,9 +729,7 @@ int wbt_init(struct request_queue *q)
rwb->enable_state = WBT_STATE_ON_DEFAULT;
wbt_update_limits(rwb);
 
-   /*
-* Assign rwb and add the stats callback.
-*/
+   /* Assign rwb and add the stats callback. */
q->rq_wb = rwb;
blk_stat_add_callback(q, rwb->cb);
 
-- 
2.9.4



Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

2017-11-23 Thread Christoph Hellwig
FYI, the patch below changes both the irq and block mappings to
always use the cpu possible map (should be split in two in due time).

I think this is the right way forward.  For every normal machine
those two are the same, but for VMs with maxcpus above their normal
count or some big iron that can grow more cpus it means we waster
a few more resources for the not present but reserved cpus.  It
fixes the reported issue for me:

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 9f8cffc8a701..3eb169f15842 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -16,11 +16,6 @@
 
 static int cpu_to_queue_index(unsigned int nr_queues, const int cpu)
 {
-   /*
-* Non present CPU will be mapped to queue index 0.
-*/
-   if (!cpu_present(cpu))
-   return 0;
return cpu % nr_queues;
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 11097477eeab..612ce1fb7c4e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2114,16 +2114,11 @@ static void blk_mq_init_cpu_queues(struct request_queue 
*q,
INIT_LIST_HEAD(&__ctx->rq_list);
__ctx->queue = q;
 
-   /* If the cpu isn't present, the cpu is mapped to first hctx */
-   if (!cpu_present(i))
-   continue;
-
-   hctx = blk_mq_map_queue(q, i);
-
/*
 * Set local node, IFF we have more than one hw queue. If
 * not, we remain on the home node of the device
 */
+   hctx = blk_mq_map_queue(q, i);
if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
hctx->numa_node = local_memory_node(cpu_to_node(i));
}
@@ -2180,7 +2175,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 *
 * If the cpu isn't present, the cpu is mapped to first hctx.
 */
-   for_each_present_cpu(i) {
+   for_each_possible_cpu(i) {
hctx_idx = q->mq_map[i];
/* unmapped hw queue can be remapped after CPU topo changed */
if (!set->tags[hctx_idx] &&
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index e12d35108225..a37a3b4b6342 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -39,7 +39,7 @@ static void irq_spread_init_one(struct cpumask *irqmsk, 
struct cpumask *nmsk,
}
 }
 
-static cpumask_var_t *alloc_node_to_present_cpumask(void)
+static cpumask_var_t *alloc_node_to_possible_cpumask(void)
 {
cpumask_var_t *masks;
int node;
@@ -62,7 +62,7 @@ static cpumask_var_t *alloc_node_to_present_cpumask(void)
return NULL;
 }
 
-static void free_node_to_present_cpumask(cpumask_var_t *masks)
+static void free_node_to_possible_cpumask(cpumask_var_t *masks)
 {
int node;
 
@@ -71,22 +71,22 @@ static void free_node_to_present_cpumask(cpumask_var_t 
*masks)
kfree(masks);
 }
 
-static void build_node_to_present_cpumask(cpumask_var_t *masks)
+static void build_node_to_possible_cpumask(cpumask_var_t *masks)
 {
int cpu;
 
-   for_each_present_cpu(cpu)
+   for_each_possible_cpu(cpu)
cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
 }
 
-static int get_nodes_in_cpumask(cpumask_var_t *node_to_present_cpumask,
+static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask,
const struct cpumask *mask, nodemask_t *nodemsk)
 {
int n, nodes = 0;
 
/* Calculate the number of nodes in the supplied affinity mask */
for_each_node(n) {
-   if (cpumask_intersects(mask, node_to_present_cpumask[n])) {
+   if (cpumask_intersects(mask, node_to_possible_cpumask[n])) {
node_set(n, *nodemsk);
nodes++;
}
@@ -109,7 +109,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
int last_affv = affv + affd->pre_vectors;
nodemask_t nodemsk = NODE_MASK_NONE;
struct cpumask *masks;
-   cpumask_var_t nmsk, *node_to_present_cpumask;
+   cpumask_var_t nmsk, *node_to_possible_cpumask;
 
/*
 * If there aren't any vectors left after applying the pre/post
@@ -125,8 +125,8 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
if (!masks)
goto out;
 
-   node_to_present_cpumask = alloc_node_to_present_cpumask();
-   if (!node_to_present_cpumask)
+   node_to_possible_cpumask = alloc_node_to_possible_cpumask();
+   if (!node_to_possible_cpumask)
goto out;
 
/* Fill out vectors at the beginning that don't need affinity */
@@ -135,8 +135,8 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
 
/* Stabilize the cpumasks */
get_online_cpus();
-   build_node_to_present_cpumask(node_to_present_cpumask);
-   nodes = get_nodes_in_cpumask(node_to_present_cpumask, 

Re: [PATCH V14 04/24] mmc: block: Ensure that debugfs files are removed

2017-11-23 Thread Linus Walleij
On Tue, Nov 21, 2017 at 2:42 PM, Adrian Hunter  wrote:

> The card is not necessarily being removed, but the debugfs files must be
> removed when the driver is removed, otherwise they will continue to exist
> after unbinding the card from the driver. e.g.
>
>   # echo "mmc1:0001" > /sys/bus/mmc/drivers/mmcblk/unbind
>   # cat /sys/kernel/debug/mmc1/mmc1\:0001/ext_csd
>   [  173.634584] BUG: unable to handle kernel NULL pointer dereference at 
> 0050
>   [  173.643356] IP: mmc_ext_csd_open+0x5e/0x170
>
> A complication is that the debugfs_root may have already been removed, so
> check for that too.
>
> Fixes: 627c3ccfb46a ("mmc: debugfs: Move block debugfs into block module")
> Signed-off-by: Adrian Hunter 

Reviewed-by: Linus Walleij 

I was assuming debugfs would always be removed from debugfs.c
using debugfs_remove_recursive(card->debugfs_root) but that
doesn't work for this case where we bin/unbind the block layer
interactively, sorry for missing it :(

Ulf: I think this can go in as an *early* fix as well, say after -rc1.

Yours,
Linus Walleij


Re: [PATCH V14 05/24] mmc: block: No need to export mmc_cleanup_queue()

2017-11-23 Thread Linus Walleij
On Tue, Nov 21, 2017 at 2:42 PM, Adrian Hunter  wrote:

> mmc_cleanup_queue() is not used by a different module. Do not export it.
>
> Signed-off-by: Adrian Hunter 

Reviewed-by: Linus Walleij 

Yours,
Linus Walleij


Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

2017-11-23 Thread Christoph Hellwig
Ok, it helps to make sure we're actually doing I/O from the CPU,
I've reproduced it now.


Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

2017-11-23 Thread Christoph Hellwig
On Thu, Nov 23, 2017 at 07:28:31PM +0100, Christian Borntraeger wrote:
> zfcp on s390.

Ok, so it can't be the interrupt code, but probably is the blk-mq-cpumap.c
changes.  Can you try to revert just those for a quick test?


Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

2017-11-23 Thread Christian Borntraeger


On 11/23/2017 07:32 PM, Christoph Hellwig wrote:
> On Thu, Nov 23, 2017 at 07:28:31PM +0100, Christian Borntraeger wrote:
>> zfcp on s390.
> 
> Ok, so it can't be the interrupt code, but probably is the blk-mq-cpumap.c
> changes.  Can you try to revert just those for a quick test?


Hmm, I get further in boot, but the system seems very sluggish and it does not
seem to be able to access the scsi disks (get data from them)

 



Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

2017-11-23 Thread Christian Borntraeger
zfcp on s390.

On 11/23/2017 07:25 PM, Christoph Hellwig wrote:
> What HBA driver do you use in the host?
> 



Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

2017-11-23 Thread Christian Borntraeger


On 11/23/2017 03:34 PM, Christoph Hellwig wrote:
> FYI, the patch below changes both the irq and block mappings to
> always use the cpu possible map (should be split in two in due time).
> 
> I think this is the right way forward.  For every normal machine
> those two are the same, but for VMs with maxcpus above their normal
> count or some big iron that can grow more cpus it means we waster
> a few more resources for the not present but reserved cpus.  It
> fixes the reported issue for me:


While it fixes the hotplug issue under KVM, the same kernel no longers boots in 
the host, 
it seems stuck early at boot just before detecting the SCSI disks. I have not 
yet looked into
that.

Christian
> 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 9f8cffc8a701..3eb169f15842 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -16,11 +16,6 @@
> 
>  static int cpu_to_queue_index(unsigned int nr_queues, const int cpu)
>  {
> - /*
> -  * Non present CPU will be mapped to queue index 0.
> -  */
> - if (!cpu_present(cpu))
> - return 0;
>   return cpu % nr_queues;
>  }
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 11097477eeab..612ce1fb7c4e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2114,16 +2114,11 @@ static void blk_mq_init_cpu_queues(struct 
> request_queue *q,
>   INIT_LIST_HEAD(&__ctx->rq_list);
>   __ctx->queue = q;
> 
> - /* If the cpu isn't present, the cpu is mapped to first hctx */
> - if (!cpu_present(i))
> - continue;
> -
> - hctx = blk_mq_map_queue(q, i);
> -
>   /*
>* Set local node, IFF we have more than one hw queue. If
>* not, we remain on the home node of the device
>*/
> + hctx = blk_mq_map_queue(q, i);
>   if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE)
>   hctx->numa_node = local_memory_node(cpu_to_node(i));
>   }
> @@ -2180,7 +2175,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
>*
>* If the cpu isn't present, the cpu is mapped to first hctx.
>*/
> - for_each_present_cpu(i) {
> + for_each_possible_cpu(i) {
>   hctx_idx = q->mq_map[i];
>   /* unmapped hw queue can be remapped after CPU topo changed */
>   if (!set->tags[hctx_idx] &&
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index e12d35108225..a37a3b4b6342 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -39,7 +39,7 @@ static void irq_spread_init_one(struct cpumask *irqmsk, 
> struct cpumask *nmsk,
>   }
>  }
> 
> -static cpumask_var_t *alloc_node_to_present_cpumask(void)
> +static cpumask_var_t *alloc_node_to_possible_cpumask(void)
>  {
>   cpumask_var_t *masks;
>   int node;
> @@ -62,7 +62,7 @@ static cpumask_var_t *alloc_node_to_present_cpumask(void)
>   return NULL;
>  }
> 
> -static void free_node_to_present_cpumask(cpumask_var_t *masks)
> +static void free_node_to_possible_cpumask(cpumask_var_t *masks)
>  {
>   int node;
> 
> @@ -71,22 +71,22 @@ static void free_node_to_present_cpumask(cpumask_var_t 
> *masks)
>   kfree(masks);
>  }
> 
> -static void build_node_to_present_cpumask(cpumask_var_t *masks)
> +static void build_node_to_possible_cpumask(cpumask_var_t *masks)
>  {
>   int cpu;
> 
> - for_each_present_cpu(cpu)
> + for_each_possible_cpu(cpu)
>   cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
>  }
> 
> -static int get_nodes_in_cpumask(cpumask_var_t *node_to_present_cpumask,
> +static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask,
>   const struct cpumask *mask, nodemask_t *nodemsk)
>  {
>   int n, nodes = 0;
> 
>   /* Calculate the number of nodes in the supplied affinity mask */
>   for_each_node(n) {
> - if (cpumask_intersects(mask, node_to_present_cpumask[n])) {
> + if (cpumask_intersects(mask, node_to_possible_cpumask[n])) {
>   node_set(n, *nodemsk);
>   nodes++;
>   }
> @@ -109,7 +109,7 @@ irq_create_affinity_masks(int nvecs, const struct 
> irq_affinity *affd)
>   int last_affv = affv + affd->pre_vectors;
>   nodemask_t nodemsk = NODE_MASK_NONE;
>   struct cpumask *masks;
> - cpumask_var_t nmsk, *node_to_present_cpumask;
> + cpumask_var_t nmsk, *node_to_possible_cpumask;
> 
>   /*
>* If there aren't any vectors left after applying the pre/post
> @@ -125,8 +125,8 @@ irq_create_affinity_masks(int nvecs, const struct 
> irq_affinity *affd)
>   if (!masks)
>   goto out;
> 
> - node_to_present_cpumask = alloc_node_to_present_cpumask();
> - if (!node_to_present_cpumask)
> + node_to_possible_cpumask = alloc_node_to_possible_cpumask();
> + if (!node_to_possible_cpumask)
>   

Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

2017-11-23 Thread Christoph Hellwig
What HBA driver do you use in the host?


[PATCH] dm: use cloned bio as head, not remainder, in __split_and_process_bio()

2017-11-23 Thread NeilBrown

When we use bio_clone_bioset() to split off the front part of a bio
and chain the two together and submit the remainder to
generic_make_request(), it is important that the newly allocated
bio is used as the head to be processed immediately, and the original
bio gets "bio_advance()"d and sent to generic_make_request() as the
remainder.

If the newly allocated bio is used as the remainder, and if it then
needs to be split again, then the next bio_clone_bioset() call will
be made while holding a reference a bio (result of the first clone)
from the same bioset.  This can potentially exhaust the bioset mempool
and result in a memory allocation deadlock.

So the result of the bio_clone_bioset() must be attached to the new
dm_io struct, and the original must be resubmitted.  The current code
is backwards.

Note that there is no race caused by reassigning cio.io->bio after already
calling __map_bio().  This bio will only be dereferenced again after
dec_pending() has found io->io_count to be zero, and this cannot happen
before the dec_pending() call at the end of __split_and_process_bio().

Reported-by: Mikulas Patocka 
Signed-off-by: NeilBrown 
---

Hi,
 I think this should resolve the problem Mikulas noticed that the
 bios form a deep chain instead of a wide tree.

Thanks,
NeilBrown

 drivers/md/dm.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 99ec215f7dcb..2e0e10a1c030 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1482,12 +1482,19 @@ static void __split_and_process_bio(struct 
mapped_device *md,
 * Remainder must be passed to 
generic_make_request()
 * so that it gets handled *after* bios already 
submitted
 * have been completely processed.
+* We take a clone of the original to store in
+* ci.io->bio to be used by end_io_acct() and
+* for dec_pending to use for completion 
handling.
+* As this path is not used for 
REQ_OP_ZONE_REPORT,
+* the usage of io->bio in 
dm_remap_zone_report()
+* won't be affected by this reassignment.
 */
struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
 
md->queue->bio_split);
-   bio_advance(b, (bio_sectors(b) - 
ci.sector_count) << 9);
+   ci.io->bio = b;
+   bio_advance(bio, (bio_sectors(bio) - 
ci.sector_count) << 9);
bio_chain(b, bio);
-   generic_make_request(b);
+   generic_make_request(bio);
break;
}
}
-- 
2.14.0.rc0.dirty



signature.asc
Description: PGP signature