Re: [PATCH] queue stall with blk-mq-sched

2017-01-24 Thread Hannes Reinecke
On 01/24/2017 11:06 PM, Jens Axboe wrote:
> On 01/24/2017 12:55 PM, Jens Axboe wrote:
>> Try this patch. We only want to bump it for the driver tags, not the
>> scheduler side.
> 
> More complete version, this one actually tested. I think this should fix
> your issue, let me know.
> 
Nearly there.
The initial stall is gone, but the test got hung at the 'stonewall'
sequence again:

[global]
bs=4k
ioengine=libaio
iodepth=256
size=4g
direct=1
runtime=60
# directory=/mnt
numjobs=32
group_reporting
cpus_allowed_policy=split
filename=/dev/md127

[seq-read]
rw=read
-> stonewall

[rand-read]
rw=randread
stonewall

Restarting all queues made the fio job continue.
There were 4 queues with state 'restart', and one queue with state 'active'.
So we're missing a queue run somewhere.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] blkcg: fix double free of new_blkg in blkcg_init_queue

2017-01-24 Thread Hou Tao
if blkg_create fails, new_blkg passed as an argument will
be freed by blkg_create, so there is no need to free it again.

Signed-off-by: Hou Tao 
---
 block/blk-cgroup.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 8ba0af7..58fb0dd 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1080,7 +1080,6 @@ int blkcg_init_queue(struct request_queue *q)
radix_tree_preload_end();
 
if (IS_ERR(blkg)) {
-   blkg_free(new_blkg);
return PTR_ERR(blkg);
}
 
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] block level event logging for storage media management

2017-01-24 Thread Song Liu

> On Jan 24, 2017, at 12:18 PM, Oleg Drokin  wrote:
> 
> 
> On Jan 23, 2017, at 2:27 AM, Dan Williams wrote:
> 
>> [ adding Oleg ]
>> 
>> On Sun, Jan 22, 2017 at 10:00 PM, Song Liu  wrote:
>>> Hi Dan,
>>> 
>>> I think the the block level event log is more like log only system. When en 
>>> event
>>> happens,  it is not necessary to take immediate action. (I guess this is 
>>> different
>>> to bad block list?).
>>> 
>>> I would hope the event log to track more information. Some of these 
>>> individual
>>> event may not be very interesting, for example, soft error or latency 
>>> outliers.
>>> However, when we gather event log for a fleet of devices, these "soft event"
>>> may become valuable for health monitoring.
>> 
>> I'd be interested in this. It sounds like you're trying to fill a gap
>> between tracing and console log messages which I believe others have
>> encountered as well.
> 
> We have a somewhat similar problem problem in Lustre and I guess it's not 
> just Lustre.
> Currently there are all sorts of conditional debug code all over the place 
> that goes
> to the console and when you enable it for anything verbose, you quickly 
> overflow
> your dmesg buffer no matter the size, that might be mostly ok for local
> "block level" stuff, but once you become distributed, it start to be a mess
> and once you get to be super large it worsens even more since you need to
> somehow coordinate data from multiple nodes, ensure all of it is not lost and 
> still
> you don't end up using a lot of it since only a few nodes end up being useful.
> (I don't know how NFS people manage to debug complicated issues using just 
> this,
> could not be super easy).
> 
> Having some sort of a buffer of a (potentially very) large size that could be
> storing the data until it's needed, or eagerly polled by some daemon for 
> storage
> (helpful when you expect a lot of data that definitely won't fit in RAM).
> 
> Tracepoints have the buffer and the daemon, but creating new messages is
> very cumbersome, so converting every debug message into one does not look 
> very feasible.
> Also it's convenient to have "event masks" one want logged that I don't think 
> you could
> do with tracepoints.
> 
> I know you were talking about reporting events to the block layer, but other 
> than plain
> errors what would block layer do with them? just a convenient way to map 
> messages
> to a particular device? You don't plan to store it on some block device as 
> part
> of the block layer, right?
> 
> Implementing such a buffer all sorts of additional generic data might be
> collected automatically for all events as part of the buffer format like
> what cpu did emit it, time, stack usage information, current pid,
> backtrace (tracepoint-alike could be optional), actual source code location of
> the message, …
> 
> Having something like that being standard part of {dev,pr}_{dbg,warn,...} and 
> friends
> would be super awesome too, I imagine (adding Greg to CC for that).
> 


Hi Oleg, 

Thanks for sharing these insights. 

We built an event logger that parses dmesg to get events. For similar reasons 
as you described 
above, it doesn't work well. And one of the biggest issue is poor "event mask" 
support. I am 
hoping get better event mask in newer implementation, for example, with kernel 
tracing filter, or 
implement customized logic in BPF. 

With a relatively mature infrastructure, we don't have much problem storing 
logs from the event
logger. Specifically, we use a daemon that collects events and send them to 
distributed storage
(HDFS+HIVE). It might be an overkill for smaller deployment. 

We do use information from similar (not exactly the one above) logs to make 
decision about 
device handling. For example, if a drive throws too much medium error in short 
period of time, 
we will kick it out of production. I think it is not necessary to include this 
in the block layer. 

Overall, I am hoping the kernel can generate accurate events, with flexible 
filter/mask support. 
There are different ways to store and consume these data. I guess most of these 
will be 
implemented in user space. Let's discuss potential use cases and requirements. 
These 
discussions should help us build the kernel part of the event log. 

Thanks,
Song










Re: [PATCH] queue stall with blk-mq-sched

2017-01-24 Thread Jens Axboe
On 01/24/2017 12:55 PM, Jens Axboe wrote:
> Try this patch. We only want to bump it for the driver tags, not the
> scheduler side.

More complete version, this one actually tested. I think this should fix
your issue, let me know.

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index a49ec77..1b156ca 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -90,9 +90,11 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
return atomic_read(>nr_active) < depth;
 }
 
-static int __blk_mq_get_tag(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue 
*bt)
+static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
+   struct sbitmap_queue *bt)
 {
-   if (!hctx_may_queue(hctx, bt))
+   if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
+   !hctx_may_queue(data->hctx, bt))
return -1;
return __sbitmap_queue_get(bt);
 }
@@ -118,7 +120,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
tag_offset = tags->nr_reserved_tags;
}
 
-   tag = __blk_mq_get_tag(data->hctx, bt);
+   tag = __blk_mq_get_tag(data, bt);
if (tag != -1)
goto found_tag;
 
@@ -129,7 +131,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
do {
prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE);
 
-   tag = __blk_mq_get_tag(data->hctx, bt);
+   tag = __blk_mq_get_tag(data, bt);
if (tag != -1)
break;
 
@@ -144,7 +146,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 * Retry tag allocation after running the hardware queue,
 * as running the queue may also have found completions.
 */
-   tag = __blk_mq_get_tag(data->hctx, bt);
+   tag = __blk_mq_get_tag(data, bt);
if (tag != -1)
break;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ee69e5e..dcb5676 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -230,15 +230,14 @@ struct request *__blk_mq_alloc_request(struct 
blk_mq_alloc_data *data,
 
rq = tags->static_rqs[tag];
 
-   if (blk_mq_tag_busy(data->hctx)) {
-   rq->rq_flags = RQF_MQ_INFLIGHT;
-   atomic_inc(>hctx->nr_active);
-   }
-
if (data->flags & BLK_MQ_REQ_INTERNAL) {
rq->tag = -1;
rq->internal_tag = tag;
} else {
+   if (blk_mq_tag_busy(data->hctx)) {
+   rq->rq_flags = RQF_MQ_INFLIGHT;
+   atomic_inc(>hctx->nr_active);
+   }
rq->tag = tag;
rq->internal_tag = -1;
}
@@ -869,6 +868,10 @@ static bool blk_mq_get_driver_tag(struct request *rq,
 
rq->tag = blk_mq_get_tag();
if (rq->tag >= 0) {
+   if (blk_mq_tag_busy(data.hctx)) {
+   rq->rq_flags |= RQF_MQ_INFLIGHT;
+   atomic_inc(>nr_active);
+   }
data.hctx->tags->rqs[rq->tag] = rq;
goto done;
}

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] block level event logging for storage media management

2017-01-24 Thread Oleg Drokin

On Jan 23, 2017, at 2:27 AM, Dan Williams wrote:

> [ adding Oleg ]
> 
> On Sun, Jan 22, 2017 at 10:00 PM, Song Liu  wrote:
>> Hi Dan,
>> 
>> I think the the block level event log is more like log only system. When en 
>> event
>> happens,  it is not necessary to take immediate action. (I guess this is 
>> different
>> to bad block list?).
>> 
>> I would hope the event log to track more information. Some of these 
>> individual
>> event may not be very interesting, for example, soft error or latency 
>> outliers.
>> However, when we gather event log for a fleet of devices, these "soft event"
>> may become valuable for health monitoring.
> 
> I'd be interested in this. It sounds like you're trying to fill a gap
> between tracing and console log messages which I believe others have
> encountered as well.

We have a somewhat similar problem problem in Lustre and I guess it's not just 
Lustre.
Currently there are all sorts of conditional debug code all over the place that 
goes
to the console and when you enable it for anything verbose, you quickly overflow
your dmesg buffer no matter the size, that might be mostly ok for local
"block level" stuff, but once you become distributed, it start to be a mess
and once you get to be super large it worsens even more since you need to
somehow coordinate data from multiple nodes, ensure all of it is not lost and 
still
you don't end up using a lot of it since only a few nodes end up being useful.
(I don't know how NFS people manage to debug complicated issues using just this,
could not be super easy).

Having some sort of a buffer of a (potentially very) large size that could be
storing the data until it's needed, or eagerly polled by some daemon for storage
(helpful when you expect a lot of data that definitely won't fit in RAM).

Tracepoints have the buffer and the daemon, but creating new messages is
very cumbersome, so converting every debug message into one does not look very 
feasible.
Also it's convenient to have "event masks" one want logged that I don't think 
you could
do with tracepoints.

I know you were talking about reporting events to the block layer, but other 
than plain
errors what would block layer do with them? just a convenient way to map 
messages
to a particular device? You don't plan to store it on some block device as part
of the block layer, right?

Implementing such a buffer all sorts of additional generic data might be
collected automatically for all events as part of the buffer format like
what cpu did emit it, time, stack usage information, current pid,
backtrace (tracepoint-alike could be optional), actual source code location of
the message, …

Having something like that being standard part of {dev,pr}_{dbg,warn,...} and 
friends
would be super awesome too, I imagine (adding Greg to CC for that).

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-24 Thread Vishal Verma
On 01/24, Jan Kara wrote:
> On Fri 20-01-17 07:42:09, Dan Williams wrote:
> > On Fri, Jan 20, 2017 at 1:47 AM, Jan Kara  wrote:
> > > On Thu 19-01-17 14:17:19, Vishal Verma wrote:
> > >> On 01/18, Jan Kara wrote:
> > >> > On Tue 17-01-17 15:37:05, Vishal Verma wrote:
> > >> > 2) PMEM is exposed for DAX aware filesystem. This seems to be what you 
> > >> > are
> > >> > mostly interested in. We could possibly do something more efficient 
> > >> > than
> > >> > what NVDIMM driver does however the complexity would be relatively 
> > >> > high and
> > >> > frankly I'm far from convinced this is really worth it. If there are so
> > >> > many badblocks this would matter, the HW has IMHO bigger problems than
> > >> > performance.
> > >>
> > >> Correct, and Dave was of the opinion that once at least XFS has reverse
> > >> mapping support (which it does now), adding badblocks information to
> > >> that should not be a hard lift, and should be a better solution. I
> > >> suppose should try to benchmark how much of a penalty the current 
> > >> badblock
> > >> checking in the NVVDIMM driver imposes. The penalty is not because there
> > >> may be a large number of badblocks, but just due to the fact that we
> > >> have to do this check for every IO, in fact, every 'bvec' in a bio.
> > >
> > > Well, letting filesystem know is certainly good from error reporting 
> > > quality
> > > POV. I guess I'll leave it upto XFS guys to tell whether they can be more
> > > efficient in checking whether current IO overlaps with any of given bad
> > > blocks.
> > >
> > >> > Now my question: Why do we bother with badblocks at all? In cases 1) 
> > >> > and 2)
> > >> > if the platform can recover from MCE, we can just always access 
> > >> > persistent
> > >> > memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that
> > >> > seems to already happen so we just need to make sure all places handle
> > >> > returned errors properly (e.g. fs/dax.c does not seem to) and we are 
> > >> > done.
> > >> > No need for bad blocks list at all, no slow down unless we hit a bad 
> > >> > cell
> > >> > and in that case who cares about performance when the data is gone...
> > >>
> > >> Even when we have MCE recovery, we cannot do away with the badblocks
> > >> list:
> > >> 1. My understanding is that the hardware's ability to do MCE recovery is
> > >> limited/best-effort, and is not guaranteed. There can be circumstances
> > >> that cause a "Processor Context Corrupt" state, which is unrecoverable.
> > >
> > > Well, then they have to work on improving the hardware. Because having HW
> > > that just sometimes gets stuck instead of reporting bad storage is simply
> > > not acceptable. And no matter how hard you try you cannot avoid MCEs from
> > > OS when accessing persistent memory so OS just has no way to avoid that
> > > risk.
> > >
> > >> 2. We still need to maintain a badblocks list so that we know what
> > >> blocks need to be cleared (via the ACPI method) on writes.
> > >
> > > Well, why cannot we just do the write, see whether we got CMCI and if yes,
> > > clear the error via the ACPI method?
> > 
> > I would need to check if you get the address reported in the CMCI, but
> > it would only fire if the write triggered a read-modify-write cycle. I
> > suspect most copies to pmem, through something like
> > arch_memcpy_to_pmem(), are not triggering any reads. It also triggers
> > asynchronously, so what data do you write after clearing the error?
> > There may have been more writes while the CMCI was being delivered.
> 
> OK, I see. And if we just write new data but don't clear error on write
> through the ACPI method, will we still get MCE on following read of that
> data? But regardless whether we get MCE or not, I suppose that the memory
> location will be still marked as bad in some ACPI table, won't it?

Correct, the location will continue to result in MCEs on reads if it isn't
marked as clear explicitly. I'm not sure that there is an ACPI table
that keeps a list of bad locations, it is just a poison bit in the cache
line, and presumable DIMMs will have some internal data structures that
also mark bad locations.

> 
>   Honza
> -- 
> Jan Kara 
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] queue stall with blk-mq-sched

2017-01-24 Thread Jens Axboe
On 01/24/2017 11:49 AM, Hannes Reinecke wrote:
> On 01/24/2017 05:09 PM, Jens Axboe wrote:
>> On 01/24/2017 08:54 AM, Hannes Reinecke wrote:
>>> Hi Jens,
>>>
>>> I'm trying to debug a queue stall with your blk-mq-sched branch; with my
>>> latest mpt3sas patches fio stops basically directly after starting a
>>> sequential read :-(
>>>
>>> I've debugged things and came up with the attached patch; we need to
>>> restart waiters with blk_mq_tag_idle() after completing a tag.
>>> We're already calling blk_mq_tag_busy() when fetching a tag, so I think
>>> calling blk_mq_tag_idle() is required when retiring a tag.
>>
>> The patch isn't correct, the whole point of the un-idling is that it
>> ISN'T happening for every request completion. Otherwise you throw
>> away scalability. So a queue will go into active mode on the first
>> request, and idle when it's been idle for a bit. The active count
>> is used to divide up the tags.
>>
>> So I'm assuming we're missing a queue run somewhere when we fail
>> getting a driver tag. The latter should only happen if the target
>> has IO in flight already, and the restart marking should take care
>> of it. Obviously there's a case where that is not true, since you
>> are seeing stalls.
>>
> But what is the point in the 'blk_mq_tag_busy()' thingie then?
> When will it be reset?
> The only instances I've seen is that it'll be getting reset during 
> resize and teardown ... hence my patch.

The point is to have some count of how many queues are busy "lately",
which helps in dividing up the tags fairly. Hence we bump it as soon as
the queue goes active, and drop it after some delay. That's working as
expected.

>>> However, even with the attached patch I'm seeing some queue stalls;
>>> looks like they're related to the 'stonewall' statement in fio.
>>
>> I think you are heading down the wrong path. Your patch will cause
>> the symptoms to be a bit different, but you'll still run into cases
>> where we fail giving out the tag and then stall.
>>
> Hehe.
> How did you know that?

My crystal ball :-)

> That's indeed what I'm seeing.
> 
> Oh well, back to the drawing board...

Try this patch. We only want to bump it for the driver tags, not the
scheduler side.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index ee69e5e..c905aa1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -230,15 +230,15 @@ struct request *__blk_mq_alloc_request(struct 
blk_mq_alloc_data *data,
 
rq = tags->static_rqs[tag];
 
-   if (blk_mq_tag_busy(data->hctx)) {
-   rq->rq_flags = RQF_MQ_INFLIGHT;
-   atomic_inc(>hctx->nr_active);
-   }
-
if (data->flags & BLK_MQ_REQ_INTERNAL) {
rq->tag = -1;
rq->internal_tag = tag;
} else {
+   if (blk_mq_tag_busy(data->hctx)) {
+   rq->rq_flags = RQF_MQ_INFLIGHT;
+   atomic_inc(>hctx->nr_active);
+   }
+
rq->tag = tag;
rq->internal_tag = -1;
}
@@ -870,6 +870,10 @@ static bool blk_mq_get_driver_tag(struct request *rq,
rq->tag = blk_mq_get_tag();
if (rq->tag >= 0) {
data.hctx->tags->rqs[rq->tag] = rq;
+   if (blk_mq_tag_busy(data.hctx)) {
+   rq->rq_flags |= RQF_MQ_INFLIGHT;
+   atomic_inc(>nr_active);
+   }
goto done;
}
 

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/16] dm: always defer request allocation to the owner of the request_queue

2017-01-24 Thread Christoph Hellwig
On Tue, Jan 24, 2017 at 11:39:51AM -0500, Mike Snitzer wrote:
> Fair enough.  Cc'ing Junichi just in case he sees anything we're
> missing.

Thanks, I'll wait for his repsonse before reposting with the few accumulated
changes (including the dm cleanup).
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] queue stall with blk-mq-sched

2017-01-24 Thread Hannes Reinecke

On 01/24/2017 05:09 PM, Jens Axboe wrote:

On 01/24/2017 08:54 AM, Hannes Reinecke wrote:

Hi Jens,

I'm trying to debug a queue stall with your blk-mq-sched branch; with my
latest mpt3sas patches fio stops basically directly after starting a
sequential read :-(

I've debugged things and came up with the attached patch; we need to
restart waiters with blk_mq_tag_idle() after completing a tag.
We're already calling blk_mq_tag_busy() when fetching a tag, so I think
calling blk_mq_tag_idle() is required when retiring a tag.


The patch isn't correct, the whole point of the un-idling is that it
ISN'T happening for every request completion. Otherwise you throw
away scalability. So a queue will go into active mode on the first
request, and idle when it's been idle for a bit. The active count
is used to divide up the tags.

So I'm assuming we're missing a queue run somewhere when we fail
getting a driver tag. The latter should only happen if the target
has IO in flight already, and the restart marking should take care
of it. Obviously there's a case where that is not true, since you
are seeing stalls.


But what is the point in the 'blk_mq_tag_busy()' thingie then?
When will it be reset?
The only instances I've seen is that it'll be getting reset during 
resize and teardown ... hence my patch.



However, even with the attached patch I'm seeing some queue stalls;
looks like they're related to the 'stonewall' statement in fio.


I think you are heading down the wrong path. Your patch will cause
the symptoms to be a bit different, but you'll still run into cases
where we fail giving out the tag and then stall.


Hehe.
How did you know that?

That's indeed what I'm seeing.

Oh well, back to the drawing board...

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] queue stall with blk-mq-sched

2017-01-24 Thread Hannes Reinecke

On 01/24/2017 05:03 PM, Jens Axboe wrote:

On 01/24/2017 08:54 AM, Hannes Reinecke wrote:

Hi Jens,

I'm trying to debug a queue stall with your blk-mq-sched branch; with my
latest mpt3sas patches fio stops basically directly after starting a
sequential read :-(

I've debugged things and came up with the attached patch; we need to
restart waiters with blk_mq_tag_idle() after completing a tag.
We're already calling blk_mq_tag_busy() when fetching a tag, so I think
calling blk_mq_tag_idle() is required when retiring a tag.


I'll take a look at this. It sounds like all your grief is related to
shared tag maps, which I don't have anything that uses here. I'll see
if we are leaking it, you should be able to check that by reading the
'active' file in the sysfs directory.


Ah. That'll explain it.

Basically _all_ my HBAs I'm testing with have shared tag maps :-(

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/16] dm: always defer request allocation to the owner of the request_queue

2017-01-24 Thread Mike Snitzer
On Tue, Jan 24 2017 at  9:20am -0500,
Christoph Hellwig  wrote:

> On Tue, Jan 24, 2017 at 05:05:39AM -0500, Mike Snitzer wrote:
> > possible and is welcomed cleanup.  The only concern I have is that using
> > get_request() for the old request_fn request_queue eliminates the
> > guaranteed availability of requests to allow for forward progress (on
> > path failure or for the purposes of swap over mpath, etc).  This isn't a
> > concern for blk-mq because as you know we have a fixed set of tags (and
> > associated preallocated requests).
> > 
> > So I'm left unconvinced old request_fn request-based DM multipath isn't
> > regressing in how request resubmission can be assured a request will be
> > available when needed on retry in the face of path failure.
> 
> Mempool only need a size where we can make guaranteed requests, so for
> get_request based drivers under dm the theoretical minimum size would be
> one as we never rely on a second request to finish the first one,
> and each request_queue has it's own mempool(s) to start with.

Fair enough.  Cc'ing Junichi just in case he sees anything we're
missing.
 
> > dm_mod's 'reserved_rq_based_ios' module_param governs the minimum number
> > of requests in the md->rq_pool (and defaults to 256 requests per
> > request-based DM request_queue).  Whereas blk_init_rl()'s
> > mempool_create_node() uses BLKDEV_MIN_RQ (4) yet q->nr_requests =
> > BLKDEV_MAX_RQ (128).  Also, this patch eliminates the utility of
> > 'reserved_rq_based_ios' module_param without actually removing it.
> 
> It's still used for the bio pool, so I couldn't simply remove it.

Ah, yeah, you're talking about:
pools->bs = bioset_create_nobvec(pool_size, front_pad);

Makes sense.

> > Anyway, should blk-core evolve to allow drivers to specify a custom
> > min_nr of requests in the old request_fn request_queue's mempool?  Or is
> > my concern overblown?
> 
> Thanks to mempool_resize we could add that functionality.  I just don't
> see an actual need for it.
> 
> > p.s. dm.c:dm_alloc_md_mempools() could be cleaned up a bit more since
> > only bio-based DM will have a pools->io_pool moving forward; but I can
> > circle back to that cleanup after.
> 
> If you're fine with doing more than the necessary changes in a patch
> that needs to got into the block tree I'd be happy to apply my usual
> cleanup magic to it.

I think this one example illustrates cleanup that makes sense even if
going through block.  While in there it is best to not leave extra
branching that just doesn't make sense to have anymore.  So if you do
another version of this patchset feel free to move this direct into
dm_alloc_md_mempools()'s BIO_BASED case:
pools->io_pool = mempool_create_slab_pool(pool_size, _io_cache);
And eliminate the cachep pointer.

But if not that is fine too.

Also:

Reviewed-by: Mike Snitzer 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/16] block: split scsi_request out of struct request

2017-01-24 Thread Bart Van Assche
On Tue, 2017-01-24 at 09:09 +0100, h...@lst.de wrote:
> On Tue, Jan 24, 2017 at 12:33:13AM +, Bart Van Assche wrote:
> > Do we perhaps need a check before the above memcpy() call whether or not
> > sense == NULL?
> 
> Yes, probably.  I didn't think of the case where the caller wouldn't
> pass a sense buffer.  Just curious, what's the callstack that caused
> this?

Hello Christoph,

The call stack was as follows:
* __scsi_execute()
* scsi_execute_req_flags()
* scsi_vpd_inquiry()
* scsi_attach_vpd()
* scsi_probe_and_add_lun()
* __scsi_add_device()
* ata_scsi_scan_host()
* async_port_probe()
* async_run_entry_fn()
* process_one_work()
* worker_thread()
* kthread()

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] queue stall with blk-mq-sched

2017-01-24 Thread Jens Axboe
On 01/24/2017 08:54 AM, Hannes Reinecke wrote:
> Hi Jens,
> 
> I'm trying to debug a queue stall with your blk-mq-sched branch; with my
> latest mpt3sas patches fio stops basically directly after starting a
> sequential read :-(
> 
> I've debugged things and came up with the attached patch; we need to
> restart waiters with blk_mq_tag_idle() after completing a tag.
> We're already calling blk_mq_tag_busy() when fetching a tag, so I think
> calling blk_mq_tag_idle() is required when retiring a tag.

The patch isn't correct, the whole point of the un-idling is that it
ISN'T happening for every request completion. Otherwise you throw
away scalability. So a queue will go into active mode on the first
request, and idle when it's been idle for a bit. The active count
is used to divide up the tags.

So I'm assuming we're missing a queue run somewhere when we fail
getting a driver tag. The latter should only happen if the target
has IO in flight already, and the restart marking should take care
of it. Obviously there's a case where that is not true, since you
are seeing stalls.

> However, even with the attached patch I'm seeing some queue stalls;
> looks like they're related to the 'stonewall' statement in fio.

I think you are heading down the wrong path. Your patch will cause
the symptoms to be a bit different, but you'll still run into cases
where we fail giving out the tag and then stall.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] queue stall with blk-mq-sched

2017-01-24 Thread Jens Axboe
On 01/24/2017 08:54 AM, Hannes Reinecke wrote:
> Hi Jens,
> 
> I'm trying to debug a queue stall with your blk-mq-sched branch; with my
> latest mpt3sas patches fio stops basically directly after starting a
> sequential read :-(
> 
> I've debugged things and came up with the attached patch; we need to
> restart waiters with blk_mq_tag_idle() after completing a tag.
> We're already calling blk_mq_tag_busy() when fetching a tag, so I think
> calling blk_mq_tag_idle() is required when retiring a tag.

I'll take a look at this. It sounds like all your grief is related to
shared tag maps, which I don't have anything that uses here. I'll see
if we are leaking it, you should be able to check that by reading the
'active' file in the sysfs directory.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] queue stall with blk-mq-sched

2017-01-24 Thread Hannes Reinecke
Hi Jens,

I'm trying to debug a queue stall with your blk-mq-sched branch; with my
latest mpt3sas patches fio stops basically directly after starting a
sequential read :-(

I've debugged things and came up with the attached patch; we need to
restart waiters with blk_mq_tag_idle() after completing a tag.
We're already calling blk_mq_tag_busy() when fetching a tag, so I think
calling blk_mq_tag_idle() is required when retiring a tag.

However, even with the attached patch I'm seeing some queue stalls;
looks like they're related to the 'stonewall' statement in fio.

Debugging continues.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
From 82b15ff40d71aed318f9946881825f9f03ef8f48 Mon Sep 17 00:00:00 2001
From: Hannes Reinecke 
Date: Tue, 24 Jan 2017 14:43:09 +0100
Subject: [PATCH] block-mq: fixup queue stall

__blk_mq_alloc_request() calls blk_mq_tag_busy(), which might result
in the queue to become blocked. So we need to call blk_mq_tag_idle()
once the tag is finished to wakeup all waiters on the queue.

Patch is relative to the blk-mq-sched branch
 
Signed-off-by: Hannes Reinecke 
---
 block/blk-mq.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 739a292..d52bcb1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -333,10 +333,12 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 {
 	const int sched_tag = rq->internal_tag;
 	struct request_queue *q = rq->q;
+	bool unbusy = false;
 
-	if (rq->rq_flags & RQF_MQ_INFLIGHT)
+	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
 		atomic_dec(>nr_active);
-
+		unbusy = true;
+	}
 	wbt_done(q->rq_wb, >issue_stat);
 	rq->rq_flags = 0;
 
@@ -346,6 +348,9 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
 		blk_mq_sched_completed_request(hctx, rq);
+	if (unbusy)
+		blk_mq_tag_idle(hctx);
+
 	blk_queue_exit(q);
 }
 
-- 
1.8.5.6



Re: [PATCH 01/16] block: fix elevator init check

2017-01-24 Thread Christoph Hellwig
On Tue, Jan 24, 2017 at 08:06:39AM -0700, Jens Axboe wrote:
> We check for REQ_PREFLUSH | REQ_FUA in so many places though, might not
> be a bad idea to keep the helper but just make it take the opf and fix
> up the other locations too. The fact that PREFLUSH|FUA is the magic to
> check for is somewhat tribal knowledge.

I'll see if I can come up with something sensible.  The current
helper using the bio and the magic 0 value is not exactly helpful.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/16] block: fix elevator init check

2017-01-24 Thread Jens Axboe
On Mon, Jan 23 2017, Christoph Hellwig wrote:
> We can't initalize the elevator fields for flushes as flush share space
> in struct request with the elevator data.  But currently we can't
> commnicate that a request is a flush through blk_get_request as we
> can only pass READ or WRITE, and the low-level code looks at the
> possible NULL bio to check for a flush.
> 
> Fix this by allowing to pass any block op and flags, and by checking for
> the flush flags in __get_request.

We check for REQ_PREFLUSH | REQ_FUA in so many places though, might not
be a bad idea to keep the helper but just make it take the opf and fix
up the other locations too. The fact that PREFLUSH|FUA is the magic to
check for is somewhat tribal knowledge.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] blk-mq-sched: fix possible crash if changing scheduler fails

2017-01-24 Thread Jens Axboe
On 01/23/2017 04:47 PM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> In elevator_switch(), we free the old scheduler's tags and then
> initialize the new scheduler. If initializing the new scheduler fails,
> we fall back to the old scheduler, but our tags have already been freed.
> There's no reason to free the sched_tags only to reallocate them, so
> let's just reuse them, which makes it possible to safely fall back to
> the old scheduler.

Do we need a failure case on both ends? We never free the driver
tags, so it's always perfectly safe to fall back to not running
with a scheduler.

Since we were talking about making the scheduler depth configurable
or dependent on the scheduler, reusing tags seems like something that
would potentially be a short lived "feature".

So I think I'd prefer if we teardown/setup like we do now, and just
have the failure case be falling back to "none".


-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: fix use after free in __blkdev_direct_IO

2017-01-24 Thread Jens Axboe
On 01/24/2017 06:50 AM, Christoph Hellwig wrote:
> We can't dereference the dio structure after submitting the last bio for
> this request, as I/O completion might have happened before the code is
> run. Introduce a local is_sync variable instead.

Applied.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/16] dm: always defer request allocation to the owner of the request_queue

2017-01-24 Thread Christoph Hellwig
On Tue, Jan 24, 2017 at 05:05:39AM -0500, Mike Snitzer wrote:
> possible and is welcomed cleanup.  The only concern I have is that using
> get_request() for the old request_fn request_queue eliminates the
> guaranteed availability of requests to allow for forward progress (on
> path failure or for the purposes of swap over mpath, etc).  This isn't a
> concern for blk-mq because as you know we have a fixed set of tags (and
> associated preallocated requests).
> 
> So I'm left unconvinced old request_fn request-based DM multipath isn't
> regressing in how request resubmission can be assured a request will be
> available when needed on retry in the face of path failure.

Mempool only need a size where we can make guaranteed requests, so for
get_request based drivers under dm the theoretical minimum size would be
one as we never rely on a second request to finish the first one,
and each request_queue has it's own mempool(s) to start with.

> dm_mod's 'reserved_rq_based_ios' module_param governs the minimum number
> of requests in the md->rq_pool (and defaults to 256 requests per
> request-based DM request_queue).  Whereas blk_init_rl()'s
> mempool_create_node() uses BLKDEV_MIN_RQ (4) yet q->nr_requests =
> BLKDEV_MAX_RQ (128).  Also, this patch eliminates the utility of
> 'reserved_rq_based_ios' module_param without actually removing it.

It's still used for the bio pool, so I couldn't simply remove it.

> Anyway, should blk-core evolve to allow drivers to specify a custom
> min_nr of requests in the old request_fn request_queue's mempool?  Or is
> my concern overblown?

Thanks to mempool_resize we could add that functionality.  I just don't
see an actual need for it.

> p.s. dm.c:dm_alloc_md_mempools() could be cleaned up a bit more since
> only bio-based DM will have a pools->io_pool moving forward; but I can
> circle back to that cleanup after.

If you're fine with doing more than the necessary changes in a patch
that needs to got into the block tree I'd be happy to apply my usual
cleanup magic to it.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] block: fix use after free in __blkdev_direct_IO

2017-01-24 Thread Christoph Hellwig
We can't dereference the dio structure after submitting the last bio for
this request, as I/O completion might have happened before the code is
run. Introduce a local is_sync variable instead.

Fixes: 542ff7bf ("block: new direct I/O implementation")
Signed-off-by: Christoph Hellwig 
Reported-by: Matias Bjørling 
Tested-by: Matias Bjørling 
---
 fs/block_dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 5db5d13..3c47614 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -331,7 +331,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
struct blk_plug plug;
struct blkdev_dio *dio;
struct bio *bio;
-   bool is_read = (iov_iter_rw(iter) == READ);
+   bool is_read = (iov_iter_rw(iter) == READ), is_sync;
loff_t pos = iocb->ki_pos;
blk_qc_t qc = BLK_QC_T_NONE;
int ret;
@@ -344,7 +344,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
bio_get(bio); /* extra ref for the completion handler */
 
dio = container_of(bio, struct blkdev_dio, bio);
-   dio->is_sync = is_sync_kiocb(iocb);
+   dio->is_sync = is_sync = is_sync_kiocb(iocb);
if (dio->is_sync)
dio->waiter = current;
else
@@ -398,7 +398,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
}
blk_finish_plug();
 
-   if (!dio->is_sync)
+   if (!is_sync)
return -EIOCBQUEUED;
 
for (;;) {
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: KASAN: Use-after-free

2017-01-24 Thread Matias Bjørling
On 01/24/2017 02:34 PM, Christoph Hellwig wrote:
> On Tue, Jan 24, 2017 at 11:01:42AM +0100, Matias Bjørling wrote:
>> Yup. That fixes it. Should I put together the patch, or will you take
>> care of it?
> 
> I'll send it out.  Of course with proper reporting credits for you.
> 

Awesome. Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: KASAN: Use-after-free

2017-01-24 Thread Christoph Hellwig
On Tue, Jan 24, 2017 at 11:01:42AM +0100, Matias Bjørling wrote:
> Yup. That fixes it. Should I put together the patch, or will you take
> care of it?

I'll send it out.  Of course with proper reporting credits for you.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/10] blk-mq: move hctx and ctx counters from sysfs to debugfs

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 07:59 PM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> These counters aren't as out-of-place in sysfs as the other stuff, but
> debugfs is a slightly better home for them.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  block/blk-mq-debugfs.c | 181 
> +
>  block/blk-mq-sysfs.c   |  64 -
>  2 files changed, 181 insertions(+), 64 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/10] blk-mq: move hctx io_poll, stats, and dispatched from sysfs to debugfs

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 07:59 PM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> These statistics _might_ be useful to userspace, but it's better not to
> commit to an ABI for these yet. Also, the dispatched file in sysfs
> couldn't be cleared, so make it clearable like the others in debugfs.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  block/blk-mq-debugfs.c | 132 
> +
>  block/blk-mq-sysfs.c   |  92 --
>  2 files changed, 132 insertions(+), 92 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/10] blk-mq: add tags and sched_tags bitmaps to debugfs

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 07:59 PM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> These can be used to debug issues like tag leaks and stuck requests.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  block/blk-mq-debugfs.c | 50 
> ++
>  1 file changed, 50 insertions(+)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/10] blk-mq: move tags and sched_tags info from sysfs to debugfs

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 07:59 PM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> These are very tied to the blk-mq tag implementation, so exposing them
> to sysfs isn't a great idea. Move the debugging information to debugfs
> and add basic entries for the number of tags and the number of reserved
> tags to sysfs.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  block/blk-mq-debugfs.c | 70 
> ++
>  block/blk-mq-sysfs.c   | 33 
>  block/blk-mq-tag.c | 27 ---
>  block/blk-mq-tag.h |  1 -
>  4 files changed, 86 insertions(+), 45 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/10] sbitmap: add helpers for dumping to a seq_file

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 07:59 PM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> This is useful debugging information that will be used in the blk-mq
> debugfs directory.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  include/linux/sbitmap.h | 34 
>  lib/sbitmap.c   | 83 
> +
>  2 files changed, 117 insertions(+)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/10] blk-mq: export software queue pending map to debugfs

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 07:59 PM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> This is useful for debugging problems where we've gotten stuck with
> requests in the software queues.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  block/blk-mq-debugfs.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] blk-mq: add extra request information to debugfs

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 07:59 PM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> The request pointers by themselves aren't super useful.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  block/blk-mq-debugfs.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 1967c06d80e0..ee947f2f605b 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -87,7 +87,9 @@ static int blk_mq_debugfs_rq_show(struct seq_file *m, void 
> *v)
>  {
>   struct request *rq = list_entry_rq(v);
>  
> - seq_printf(m, "%p\n", rq);
> + seq_printf(m, "%p {.cmd_type=%u, .cmd_flags=0x%x, .rq_flags=0x%x, 
> .tag=%d, .internal_tag=%d}\n",
> +rq, rq->cmd_type, rq->cmd_flags, (unsigned int)rq->rq_flags,
> +rq->tag, rq->internal_tag);
>   return 0;
>  }
>  
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/10] blk-mq: move hctx->dispatch and ctx->rq_list from sysfs to debugfs

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 07:59 PM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> These lists are only useful for debugging; they definitely don't belong
> in sysfs. Putting them in debugfs also removes the limitation of a
> single page of output.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  block/blk-mq-debugfs.c | 106 
> +
>  block/blk-mq-sysfs.c   |  57 --
>  2 files changed, 106 insertions(+), 57 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] blk-mq: add hctx->{state,flags} to debugfs

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 07:59 PM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> hctx->state could come in handy for bugs where the hardware queue gets
> stuck in the stopped state, and hctx->flags is just useful to know.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  block/blk-mq-debugfs.c | 42 ++
>  1 file changed, 42 insertions(+)
> 
Hehe. I've found myself adding exactly the same attributes when
debugging mq-deadline :-)

> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 01711bbf5ade..0c14511fa9e0 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -29,7 +29,49 @@ struct blk_mq_debugfs_attr {
>  
>  static struct dentry *block_debugfs_root;
>  
> +static int hctx_state_show(struct seq_file *m, void *v)
> +{
> + struct blk_mq_hw_ctx *hctx = m->private;
> +
> + seq_printf(m, "0x%lx\n", hctx->state);
> + return 0;
> +}
> +
What about decoding it?
Would make life so much easier, and one doesn't have to have a kernel
source tree available when looking things up...

> +static int hctx_state_open(struct inode *inode, struct file *file)
> +{
> + return single_open(file, hctx_state_show, inode->i_private);
> +}
> +
> +static const struct file_operations hctx_state_fops = {
> + .open   = hctx_state_open,
> + .read   = seq_read,
> + .llseek = seq_lseek,
> + .release= single_release,
> +};
> +
> +static int hctx_flags_show(struct seq_file *m, void *v)
> +{
> + struct blk_mq_hw_ctx *hctx = m->private;
> +
> + seq_printf(m, "0x%lx\n", hctx->flags);
> + return 0;
> +}
> +
Same argument here.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] blk-mq: create debugfs directory tree

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 07:59 PM, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> In preparation for putting blk-mq debugging information in debugfs,
> create a directory tree mirroring the one in sysfs:
> 
> # tree -d /sys/kernel/debug/block
> /sys/kernel/debug/block
> |-- nvme0n1
> |   `-- mq
> |   |-- 0
> |   |   `-- cpu0
> |   |-- 1
> |   |   `-- cpu1
> |   |-- 2
> |   |   `-- cpu2
> |   `-- 3
> |   `-- cpu3
> `-- vda
> `-- mq
> `-- 0
> |-- cpu0
> |-- cpu1
> |-- cpu2
> `-- cpu3
> 
> Also add the scaffolding for the actual files that will go in here,
> either under the hardware queue or software queue directories.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  block/Makefile |   1 +
>  block/blk-mq-debugfs.c | 152 
> +
>  block/blk-mq-sysfs.c   |   8 +++
>  block/blk-mq.c |   2 +
>  block/blk-mq.h |  33 +++
>  include/linux/blkdev.h |   5 ++
>  6 files changed, 201 insertions(+)
>  create mode 100644 block/blk-mq-debugfs.c
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16] block: don't assign cmd_flags in __blk_rq_prep_clone

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 04:29 PM, Christoph Hellwig wrote:
> These days we have the proper flags set since request allocation time.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/blk-core.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index fe5cc98..93a9e0b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2983,7 +2983,6 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
>  static void __blk_rq_prep_clone(struct request *dst, struct request *src)
>  {
>   dst->cpu = src->cpu;
> - dst->cmd_flags = src->cmd_flags | REQ_NOMERGE;
>   dst->cmd_type = src->cmd_type;
>   dst->__sector = blk_rq_pos(src);
>   dst->__data_len = blk_rq_bytes(src);
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/16] block: split scsi_request out of struct request

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 04:29 PM, Christoph Hellwig wrote:
> And require all drivers that want to support BLOCK_PC to allocate it
> as the first thing of their private data.  To support this the legacy
> IDE and BSG code is switched to set cmd_size on their queues to let
> the block layer allocate the additional space.
> 
> Signed-off-by: Christoph Hellwig 
> ---
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/16] scsi: allocate scsi_cmnd structures as part of struct request

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 04:29 PM, Christoph Hellwig wrote:
> Rely on the new block layer functionality to allocate additional driver
> specific data behind struct request instead of implementing it in SCSI
> itѕelf.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/hosts.c  |  20 +--
>  drivers/scsi/scsi.c   | 319 
> --
>  drivers/scsi/scsi_error.c |  17 ++-
>  drivers/scsi/scsi_lib.c   | 122 --
>  drivers/scsi/scsi_priv.h  |   8 +-
>  include/scsi/scsi_host.h  |   3 -
>  6 files changed, 95 insertions(+), 394 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/16] scsi: remove gfp_flags member in scsi_host_cmd_pool

2017-01-24 Thread Johannes Thumshirn
On Mon, Jan 23, 2017 at 04:29:14PM +0100, Christoph Hellwig wrote:
> When using the slab allocator we already decide at cache creation time if
> an allocation comes from a GFP_DMA pool using the SLAB_CACHE_DMA flag,
> and there is no point passing the kmalloc-family only GFP_DMA flag to
> kmem_cache_alloc.  Drop all the infrastructure for doing so.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] lightnvm: use end_io callback instead of instance

2017-01-24 Thread Matias Bjørling
When the lightnvm core had the "gennvm" layer between the device and the
target, there was a need for the core to be able to figure out which
target it should send an end_io callback to. Leading to a "double"
end_io, first for the media manager instance, and then for the target
instance. Now that core and gennvm is merged, there is no longer a need
for this, and a single end_io callback will do.

Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c  |  5 ++---
 drivers/lightnvm/rrpc.c  | 11 +--
 drivers/lightnvm/rrpc.h  |  3 ---
 include/linux/lightnvm.h | 11 +++
 4 files changed, 10 insertions(+), 20 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 80cd767..4abd334 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -776,14 +776,13 @@ EXPORT_SYMBOL(nvm_free_rqd_ppalist);
 void nvm_end_io(struct nvm_rq *rqd, int error)
 {
struct nvm_tgt_dev *tgt_dev = rqd->dev;
-   struct nvm_tgt_instance *ins = rqd->ins;
 
/* Convert address space */
if (tgt_dev)
nvm_rq_dev_to_tgt(tgt_dev, rqd);
 
-   rqd->error = error;
-   ins->tt->end_io(rqd);
+   if (rqd->end_io)
+   rqd->end_io(rqd, error);
 }
 EXPORT_SYMBOL(nvm_end_io);
 
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index 9fb7de3..c399f55 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -777,16 +777,16 @@ static void rrpc_end_io_write(struct rrpc *rrpc, struct 
rrpc_rq *rrqd,
}
 }
 
-static void rrpc_end_io(struct nvm_rq *rqd)
+static void rrpc_end_io(struct nvm_rq *rqd, int error)
 {
-   struct rrpc *rrpc = container_of(rqd->ins, struct rrpc, instance);
+   struct rrpc *rrpc = rqd->private;
struct nvm_tgt_dev *dev = rrpc->dev;
struct rrpc_rq *rrqd = nvm_rq_to_pdu(rqd);
uint8_t npages = rqd->nr_ppas;
sector_t laddr = rrpc_get_laddr(rqd->bio) - npages;
 
if (bio_data_dir(rqd->bio) == WRITE) {
-   if (rqd->error == NVM_RSP_ERR_FAILWRITE)
+   if (error == NVM_RSP_ERR_FAILWRITE)
rrpc_mark_bad_block(rrpc, rqd);
 
rrpc_end_io_write(rrpc, rrqd, laddr, npages);
@@ -972,8 +972,9 @@ static int rrpc_submit_io(struct rrpc *rrpc, struct bio 
*bio,
 
bio_get(bio);
rqd->bio = bio;
-   rqd->ins = >instance;
+   rqd->private = rrpc;
rqd->nr_ppas = nr_pages;
+   rqd->end_io = rrpc_end_io;
rrq->flags = flags;
 
err = nvm_submit_io(dev, rqd);
@@ -1532,7 +1533,6 @@ static void *rrpc_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk)
if (!rrpc)
return ERR_PTR(-ENOMEM);
 
-   rrpc->instance.tt = _rrpc;
rrpc->dev = dev;
rrpc->disk = tdisk;
 
@@ -1611,7 +1611,6 @@ static struct nvm_tgt_type tt_rrpc = {
 
.make_rq= rrpc_make_rq,
.capacity   = rrpc_capacity,
-   .end_io = rrpc_end_io,
 
.init   = rrpc_init,
.exit   = rrpc_exit,
diff --git a/drivers/lightnvm/rrpc.h b/drivers/lightnvm/rrpc.h
index 94e4d73..fdb6ff9 100644
--- a/drivers/lightnvm/rrpc.h
+++ b/drivers/lightnvm/rrpc.h
@@ -102,9 +102,6 @@ struct rrpc_lun {
 };
 
 struct rrpc {
-   /* instance must be kept in top to resolve rrpc in unprep */
-   struct nvm_tgt_instance instance;
-
struct nvm_tgt_dev *dev;
struct gendisk *disk;
 
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index ce0b2da..f6e2376 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -213,10 +213,6 @@ struct nvm_target {
struct gendisk *disk;
 };
 
-struct nvm_tgt_instance {
-   struct nvm_tgt_type *tt;
-};
-
 #define ADDR_EMPTY (~0ULL)
 
 #define NVM_VERSION_MAJOR 1
@@ -224,10 +220,9 @@ struct nvm_tgt_instance {
 #define NVM_VERSION_PATCH 0
 
 struct nvm_rq;
-typedef void (nvm_end_io_fn)(struct nvm_rq *);
+typedef void (nvm_end_io_fn)(struct nvm_rq *, int);
 
 struct nvm_rq {
-   struct nvm_tgt_instance *ins;
struct nvm_tgt_dev *dev;
 
struct bio *bio;
@@ -250,7 +245,8 @@ struct nvm_rq {
uint16_t flags;
 
u64 ppa_status; /* ppa media status */
-   int error;
+
+   void *private;
 };
 
 static inline struct nvm_rq *nvm_rq_from_pdu(void *pdu)
@@ -450,7 +446,6 @@ struct nvm_tgt_type {
/* target entry points */
nvm_tgt_make_rq_fn *make_rq;
nvm_tgt_capacity_fn *capacity;
-   nvm_end_io_fn *end_io;
 
/* module-specific init/teardown */
nvm_tgt_init_fn *init;
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/16] scsi: remove __scsi_alloc_queue

2017-01-24 Thread Johannes Thumshirn
On Mon, Jan 23, 2017 at 04:29:17PM +0100, Christoph Hellwig wrote:
> Instead do an internal export of __scsi_init_queue for the transport
> classes that export BSG nodes.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/16] scsi: remove __scsi_alloc_queue

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 04:29 PM, Christoph Hellwig wrote:
> Instead do an internal export of __scsi_init_queue for the transport
> classes that export BSG nodes.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/scsi_lib.c | 19 ---
>  drivers/scsi/scsi_transport_fc.c|  6 --
>  drivers/scsi/scsi_transport_iscsi.c |  3 ++-
>  include/scsi/scsi_host.h|  2 --
>  include/scsi/scsi_transport.h   |  2 ++
>  5 files changed, 12 insertions(+), 20 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/16] scsi: remove scsi_cmd_dma_pool

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 04:29 PM, Christoph Hellwig wrote:
> There is no need for GFP_DMA allocations of the scsi_cmnd structures
> themselves, all that might be DMAed to or from is the actual payload,
> or the sense buffers.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/scsi.c | 15 +--
>  1 file changed, 1 insertion(+), 14 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/16] scsi: respect unchecked_isa_dma for blk-mq

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 04:29 PM, Christoph Hellwig wrote:
> Currently blk-mq always allocates the sense buffer using normal GFP_KERNEL
> allocation.  Refactor the cmd pool code to split the cmd and sense allocation
> and share the code to allocate the sense buffers as well as the sense buffer
> slab caches between the legacy and blk-mq path.
> 
> Note that this switches to lazy allocation of the sense slab caches - the
> slab caches (not the actual allocations) won't be destroy until the scsi
> module is unloaded instead of keeping track of hosts using them.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/hosts.c |  4 
>  drivers/scsi/scsi.c  | 24 ---
>  drivers/scsi/scsi_lib.c  | 62 
> +---
>  drivers/scsi/scsi_priv.h |  5 
>  4 files changed, 73 insertions(+), 22 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] mmc: block: stop passing around pointless return values

2017-01-24 Thread Linus Walleij
The mmc_blk_issue_rq() function is called in exactly one place
in queue.c and there the return value is ignored. So the
functions called from that function that also meticulously
return 0/1 do so for no good reason.

Error reporting on the asynchronous requests are done upward to
the block layer when the requests are eventually completed or
fail, which may happen during the flow of the mmc_blk_issue_*
functions directly (for "special commands") or later, when an
asynchronous read/write request is completed.

The issuing functions do not give rise to errors on their own,
and there is nothing to return back to the caller in queue.c.
Drop all return values and make the function return void.

Signed-off-by: Linus Walleij 
---
 drivers/mmc/core/block.c | 38 ++
 drivers/mmc/core/block.h |  2 +-
 2 files changed, 15 insertions(+), 25 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index f3e0c778cdbd..ede759dda395 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1149,7 +1149,7 @@ int mmc_access_rpmb(struct mmc_queue *mq)
return false;
 }
 
-static int mmc_blk_issue_discard_rq(struct mmc_queue *mq, struct request *req)
+static void mmc_blk_issue_discard_rq(struct mmc_queue *mq, struct request *req)
 {
struct mmc_blk_data *md = mq->blkdata;
struct mmc_card *card = md->queue.card;
@@ -1187,11 +1187,9 @@ static int mmc_blk_issue_discard_rq(struct mmc_queue 
*mq, struct request *req)
mmc_blk_reset_success(md, type);
 fail:
blk_end_request(req, err, blk_rq_bytes(req));
-
-   return err ? 0 : 1;
 }
 
-static int mmc_blk_issue_secdiscard_rq(struct mmc_queue *mq,
+static void mmc_blk_issue_secdiscard_rq(struct mmc_queue *mq,
   struct request *req)
 {
struct mmc_blk_data *md = mq->blkdata;
@@ -1254,11 +1252,9 @@ static int mmc_blk_issue_secdiscard_rq(struct mmc_queue 
*mq,
mmc_blk_reset_success(md, type);
 out:
blk_end_request(req, err, blk_rq_bytes(req));
-
-   return err ? 0 : 1;
 }
 
-static int mmc_blk_issue_flush(struct mmc_queue *mq, struct request *req)
+static void mmc_blk_issue_flush(struct mmc_queue *mq, struct request *req)
 {
struct mmc_blk_data *md = mq->blkdata;
struct mmc_card *card = md->queue.card;
@@ -1269,8 +1265,6 @@ static int mmc_blk_issue_flush(struct mmc_queue *mq, 
struct request *req)
ret = -EIO;
 
blk_end_request_all(req, ret);
-
-   return ret ? 0 : 1;
 }
 
 /*
@@ -1622,7 +1616,7 @@ static void mmc_blk_rw_start_new(struct mmc_queue *mq, 
struct mmc_card *card,
}
 }
 
-static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
+static void mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 {
struct mmc_blk_data *md = mq->blkdata;
struct mmc_card *card = md->queue.card;
@@ -1635,7 +1629,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, 
struct request *rqc)
struct mmc_async_req *old_areq;
 
if (!rqc && !mq->mqrq_prev->req)
-   return 0;
+   return;
 
do {
if (rqc) {
@@ -1648,7 +1642,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, 
struct request *rqc)
pr_err("%s: Transfer size is not 4KB sector 
size aligned\n",
rqc->rq_disk->disk_name);
mmc_blk_rw_cmd_abort(card, rqc);
-   return 0;
+   return;
}
 
mmc_blk_rw_rq_prep(mq->mqrq_cur, card, 0, mq);
@@ -1665,7 +1659,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, 
struct request *rqc)
 */
if (status == MMC_BLK_NEW_REQUEST)
mq->flags |= MMC_QUEUE_NEW_REQUEST;
-   return 0;
+   return;
}
 
/*
@@ -1699,7 +1693,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, 
struct request *rqc)
   __func__, blk_rq_bytes(req),
   brq->data.bytes_xfered);
mmc_blk_rw_cmd_abort(card, req);
-   return 0;
+   return;
}
break;
case MMC_BLK_CMD_ERR:
@@ -1767,18 +1761,16 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, 
struct request *rqc)
}
} while (ret);
 
-   return 1;
+   return;
 
  cmd_abort:
mmc_blk_rw_cmd_abort(card, req);
 
  start_new_req:
mmc_blk_rw_start_new(mq, card, rqc);
-
-   return 0;
 }
 
-int mmc_blk_issue_rq(struct mmc_queue *mq, struct request *req)
+void mmc_blk_issue_rq(struct mmc_queue *mq, 

[PATCH 4/6] mmc: block: inline command abortions

2017-01-24 Thread Linus Walleij
Setting rqc to NULL followed by a goto to cmd_abort is just a way
to do unconditional abort without starting any new command.
Inline the calls to mmc_blk_rw_cmd_abort() and return immediately
in those cases.

As a result, mmc_blk_rw_start_new() is not called with NULL
requests, and we can remove the NULL check in the beginning of
this function.

Add some comments to the code flow so it is clear that this is
where the asynchronous requests come back in and the result of
them gets handled.

Signed-off-by: Linus Walleij 
---
 drivers/mmc/core/block.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 13e6fe060f26..4bbb3d16c09b 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1612,9 +1612,6 @@ static void mmc_blk_rw_cmd_abort(struct mmc_card *card, 
struct request *req)
 static void mmc_blk_rw_start_new(struct mmc_queue *mq, struct mmc_card *card,
 struct request *req)
 {
-   if (!req)
-   return;
-
if (mmc_card_removed(card)) {
req->rq_flags |= RQF_QUIET;
blk_end_request_all(req, -EIO);
@@ -1649,9 +1646,8 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, 
struct request *rqc)
!IS_ALIGNED(blk_rq_sectors(rqc), 8)) {
pr_err("%s: Transfer size is not 4KB sector 
size aligned\n",
rqc->rq_disk->disk_name);
-   req = rqc;
-   rqc = NULL;
-   goto cmd_abort;
+   mmc_blk_rw_cmd_abort(card, rqc);
+   return 0;
}
 
mmc_blk_rw_rq_prep(mq->mqrq_cur, card, 0, mq);
@@ -1660,11 +1656,20 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, 
struct request *rqc)
areq = NULL;
areq = mmc_start_req(card->host, areq, );
if (!areq) {
+   /*
+* We have just put the first request into the pipeline
+* and there is nothing more to do until it is
+* complete.
+*/
if (status == MMC_BLK_NEW_REQUEST)
mq->flags |= MMC_QUEUE_NEW_REQUEST;
return 0;
}
 
+   /*
+* An asynchronous request has been completed and we proceed
+* to handle the result of it.
+*/
mq_rq = container_of(areq, struct mmc_queue_req, mmc_active);
brq = _rq->brq;
req = mq_rq->req;
@@ -1691,8 +1696,8 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, 
struct request *rqc)
pr_err("%s BUG rq_tot %d d_xfer %d\n",
   __func__, blk_rq_bytes(req),
   brq->data.bytes_xfered);
-   rqc = NULL;
-   goto cmd_abort;
+   mmc_blk_rw_cmd_abort(card, req);
+   return 0;
}
break;
case MMC_BLK_CMD_ERR:
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] mmc: block: do not assign mq_rq when aborting command

2017-01-24 Thread Linus Walleij
The code in mmc_blk_issue_rq_rq() aborts a command if the request
is not properly aligned on large sectors. As part of the path
jumping out, it assigns the local variable mq_rq reflecting
a MMC queue request to the current MMC queue request, which is
confusing since the variable is not used after this jump.

Signed-off-by: Linus Walleij 
---
 drivers/mmc/core/block.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index b60d1fb3a07a..13e6fe060f26 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1649,7 +1649,6 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, 
struct request *rqc)
!IS_ALIGNED(blk_rq_sectors(rqc), 8)) {
pr_err("%s: Transfer size is not 4KB sector 
size aligned\n",
rqc->rq_disk->disk_name);
-   mq_rq = mq->mqrq_cur;
req = rqc;
rqc = NULL;
goto cmd_abort;
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] mmc: block: break out mmc_blk_rw_start_new()

2017-01-24 Thread Linus Walleij
As a step toward breaking apart the very complex function
mmc_blk_issue_rw_rq() we break out the code to start a new
request.

Signed-off-by: Linus Walleij 
---
 drivers/mmc/core/block.c | 27 +--
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 14efe92a14ef..b60d1fb3a07a 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1609,6 +1609,22 @@ static void mmc_blk_rw_cmd_abort(struct mmc_card *card, 
struct request *req)
  blk_rq_cur_bytes(req));
 }
 
+static void mmc_blk_rw_start_new(struct mmc_queue *mq, struct mmc_card *card,
+struct request *req)
+{
+   if (!req)
+   return;
+
+   if (mmc_card_removed(card)) {
+   req->rq_flags |= RQF_QUIET;
+   blk_end_request_all(req, -EIO);
+   } else {
+   mmc_blk_rw_rq_prep(mq->mqrq_cur, card, 0, mq);
+   mmc_start_req(card->host,
+ >mqrq_cur->mmc_active, NULL);
+   }
+}
+
 static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 {
struct mmc_blk_data *md = mq->blkdata;
@@ -1751,16 +1767,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, 
struct request *rqc)
mmc_blk_rw_cmd_abort(card, req);
 
  start_new_req:
-   if (rqc) {
-   if (mmc_card_removed(card)) {
-   rqc->rq_flags |= RQF_QUIET;
-   blk_end_request_all(rqc, -EIO);
-   } else {
-   mmc_blk_rw_rq_prep(mq->mqrq_cur, card, 0, mq);
-   mmc_start_req(card->host,
- >mqrq_cur->mmc_active, NULL);
-   }
-   }
+   mmc_blk_rw_start_new(mq, card, rqc);
 
return 0;
 }
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6] mmc: block: command issue cleanups

2017-01-24 Thread Linus Walleij
The function mmc_blk_issue_rw_rq() is hopelessly convoluted and
need to be refactored to it can be understood by humans.

In the process I found some weird magic return values passed
around for no good reason.

Things are more readable after this.

This work is done towards the goal of breaking the function in
two parts: one just submitting the requests and one checking the
result and possibly resubmitting the command on error, so we
can make the usual path (non-errorpath) smooth and quick, and
be called directly when the driver completes a request.

That in turn is a prerequisite for proper blk-mq integration
with the MMC/SD stack.

All that comes later.

Linus Walleij (6):
  mmc: block: break out mmc_blk_rw_cmd_abort()
  mmc: block: break out mmc_blk_rw_start_new()
  mmc: block: do not assign mq_rq when aborting command
  mmc: block: inline command abortions
  mmc: block: introduce new_areq and old_areq
  mmc: block: stop passing around pointless return values

 drivers/mmc/core/block.c | 108 ++-
 drivers/mmc/core/block.h |   2 +-
 2 files changed, 60 insertions(+), 50 deletions(-)

-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] mmc: block: break out mmc_blk_rw_cmd_abort()

2017-01-24 Thread Linus Walleij
As a first step toward breaking apart the very complex function
mmc_blk_issue_rw_rq() we break out the command abort code.
This code assumes "ret" is != 0 and then repeatedly hammers
blk_end_request() until the request to the block layer to end
the request succeeds.

Signed-off-by: Linus Walleij 
---
 drivers/mmc/core/block.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 7bd03381810d..14efe92a14ef 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1598,6 +1598,17 @@ static int mmc_blk_cmd_err(struct mmc_blk_data *md, 
struct mmc_card *card,
return ret;
 }
 
+static void mmc_blk_rw_cmd_abort(struct mmc_card *card, struct request *req)
+{
+   int ret = 1;
+
+   if (mmc_card_removed(card))
+   req->rq_flags |= RQF_QUIET;
+   while (ret)
+   ret = blk_end_request(req, -EIO,
+ blk_rq_cur_bytes(req));
+}
+
 static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
 {
struct mmc_blk_data *md = mq->blkdata;
@@ -1737,11 +1748,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, 
struct request *rqc)
return 1;
 
  cmd_abort:
-   if (mmc_card_removed(card))
-   req->rq_flags |= RQF_QUIET;
-   while (ret)
-   ret = blk_end_request(req, -EIO,
-   blk_rq_cur_bytes(req));
+   mmc_blk_rw_cmd_abort(card, req);
 
  start_new_req:
if (rqc) {
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/16] dm: always defer request allocation to the owner of the request_queue

2017-01-24 Thread Mike Snitzer
On Mon, Jan 23 2017 at 10:29am -0500,
Christoph Hellwig  wrote:

> DM already calls blk_mq_alloc_request on the request_queue of the
> underlying device if it is a blk-mq device.  But now that we allow drivers
> to allocate additional data and initialize it ahead of time we need to do
> the same for all drivers.   Doing so and using the new cmd_size
> infrastructure in the block layer greatly simplifies the dm-rq and mpath
> code, and should also make arbitrary combinations of SQ and MQ devices
> with SQ or MQ device mapper tables easily possible as a further step.

Thanks for working (and suffering) through all of this request-based DM
code.  Nice to have someone else be painfully aware of the complexity in
request-based DM's old request_fn support.

The queue->cmd_size (per request data) definitely makes this more
possible and is welcomed cleanup.  The only concern I have is that using
get_request() for the old request_fn request_queue eliminates the
guaranteed availability of requests to allow for forward progress (on
path failure or for the purposes of swap over mpath, etc).  This isn't a
concern for blk-mq because as you know we have a fixed set of tags (and
associated preallocated requests).

So I'm left unconvinced old request_fn request-based DM multipath isn't
regressing in how request resubmission can be assured a request will be
available when needed on retry in the face of path failure.

dm_mod's 'reserved_rq_based_ios' module_param governs the minimum number
of requests in the md->rq_pool (and defaults to 256 requests per
request-based DM request_queue).  Whereas blk_init_rl()'s
mempool_create_node() uses BLKDEV_MIN_RQ (4) yet q->nr_requests =
BLKDEV_MAX_RQ (128).  Also, this patch eliminates the utility of
'reserved_rq_based_ios' module_param without actually removing it.

Anyway, should blk-core evolve to allow drivers to specify a custom
min_nr of requests in the old request_fn request_queue's mempool?  Or is
my concern overblown?

Seems we're very close to making this request-based DM cleanup doable.
Just would like some extra eyes and care/thought/guidance from yourself
and likely Jens.

Thanks,
Mike

p.s. dm.c:dm_alloc_md_mempools() could be cleaned up a bit more since
only bio-based DM will have a pools->io_pool moving forward; but I can
circle back to that cleanup after.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: KASAN: Use-after-free

2017-01-24 Thread Matias Bjørling
On 01/24/2017 10:52 AM, Christoph Hellwig wrote:
> On Tue, Jan 24, 2017 at 10:32:11AM +0100, Matias Bjørling wrote:
>> *(gdb) list *blkdev_direct_IO+0x50c
>> 0x8142ab8c is in blkdev_direct_IO (fs/block_dev.c:401).
>> 396  submit_bio(bio);
>> 397  bio = bio_alloc(GFP_KERNEL, nr_pages);
>> 398  }
>> 399  blk_finish_plug();
>> 400  
>> 401  if (!dio->is_sync)
>> 402  return -EIOCBQUEUED;
>>
>> It looks like the qc = submit_bio() completes the I/O before the
>> blkdev_direct_IO completes, which then leads to the use after free case,
>> when the private dio struct is accessed.
> 
> Ok, the is_sync check here is clearly a bug, we need a local variable
> for is_sync as well for the aio case:
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 5db5d13..3c47614 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -331,7 +331,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   struct blk_plug plug;
>   struct blkdev_dio *dio;
>   struct bio *bio;
> - bool is_read = (iov_iter_rw(iter) == READ);
> + bool is_read = (iov_iter_rw(iter) == READ), is_sync;
>   loff_t pos = iocb->ki_pos;
>   blk_qc_t qc = BLK_QC_T_NONE;
>   int ret;
> @@ -344,7 +344,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   bio_get(bio); /* extra ref for the completion handler */
>  
>   dio = container_of(bio, struct blkdev_dio, bio);
> - dio->is_sync = is_sync_kiocb(iocb);
> + dio->is_sync = is_sync = is_sync_kiocb(iocb);
>   if (dio->is_sync)
>   dio->waiter = current;
>   else
> @@ -398,7 +398,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   }
>   blk_finish_plug();
>  
> - if (!dio->is_sync)
> + if (!is_sync)
>   return -EIOCBQUEUED;
>  
>   for (;;) {
> 

Yup. That fixes it. Should I put together the patch, or will you take
care of it?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: KASAN: Use-after-free

2017-01-24 Thread Christoph Hellwig
On Tue, Jan 24, 2017 at 10:32:11AM +0100, Matias Bjørling wrote:
> *(gdb) list *blkdev_direct_IO+0x50c
> 0x8142ab8c is in blkdev_direct_IO (fs/block_dev.c:401).
> 396   submit_bio(bio);
> 397   bio = bio_alloc(GFP_KERNEL, nr_pages);
> 398   }
> 399   blk_finish_plug();
> 400   
> 401   if (!dio->is_sync)
> 402   return -EIOCBQUEUED;
> 
> It looks like the qc = submit_bio() completes the I/O before the
> blkdev_direct_IO completes, which then leads to the use after free case,
> when the private dio struct is accessed.

Ok, the is_sync check here is clearly a bug, we need a local variable
for is_sync as well for the aio case:

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 5db5d13..3c47614 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -331,7 +331,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
struct blk_plug plug;
struct blkdev_dio *dio;
struct bio *bio;
-   bool is_read = (iov_iter_rw(iter) == READ);
+   bool is_read = (iov_iter_rw(iter) == READ), is_sync;
loff_t pos = iocb->ki_pos;
blk_qc_t qc = BLK_QC_T_NONE;
int ret;
@@ -344,7 +344,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
bio_get(bio); /* extra ref for the completion handler */
 
dio = container_of(bio, struct blkdev_dio, bio);
-   dio->is_sync = is_sync_kiocb(iocb);
+   dio->is_sync = is_sync = is_sync_kiocb(iocb);
if (dio->is_sync)
dio->waiter = current;
else
@@ -398,7 +398,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
}
blk_finish_plug();
 
-   if (!dio->is_sync)
+   if (!is_sync)
return -EIOCBQUEUED;
 
for (;;) {
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: KASAN: Use-after-free

2017-01-24 Thread Matias Bjørling
On 01/23/2017 06:20 PM, Christoph Hellwig wrote:
> On Mon, Jan 23, 2017 at 05:07:52PM +0100, Matias Bjørling wrote:
>> Hi,
>>
>> I could use some help verifying an use-after-free bug that I am seeing
>> after the new direct I/O work went in.
>>
>> When issuing a direct write io using libaio, a bio is referenced in the
>> blkdev_direct_IO path, and then put again in the blkdev_bio_end_io path.
>> However, there is a case where the bio is put twice, which leads to
>> modules that rely on the bio after biodev_bio_end_io() to access it
>> prematurely.
> 
> Can you reproduce anything like this with a normal block device?

Looks like bcache has something similar with a get/put in it. I'll look
a bit more.

> 
>>
>> The KASAN error report:
>>
>> [   14.645916]
>> ==
>> [   14.648027] BUG: KASAN: use-after-free in
>> blkdev_direct_IO+0x50c/0x600 at addr 8801ef30ea14
> 
> Can you resolve that address for me, please?
> 

*(gdb) list *blkdev_direct_IO+0x50c
0x8142ab8c is in blkdev_direct_IO (fs/block_dev.c:401).
396 submit_bio(bio);
397 bio = bio_alloc(GFP_KERNEL, nr_pages);
398 }
399 blk_finish_plug();
400 
401 if (!dio->is_sync)
402 return -EIOCBQUEUED;

It looks like the qc = submit_bio() completes the I/O before the
blkdev_direct_IO completes, which then leads to the use after free case,
when the private dio struct is accessed.

>> Which boils down to the bio already being freed, when we hit the
>> module's endio function.
>>
>> The double bio_put() occurs when dio->should_dirty = 0, and dio->is_sync
>> = 0. The flow follows:
>>
>> Issuing:
>>   blkdev_direct_IO
>>  get_bio(bio)
> 
> bio refcounting in __blkdev_direct_IO is the following
> 
> first bio is allocated using bio_alloc_bioset to embedd the dio structure
> into it.  We then grab a second reference to that bio and only that one.
> Each other new bio just gets it's single reference from bio_alloc.
> 
> blkdev_bio_end_io then checks if we hit the last reference on the dio, in
> which case it then drops the additional reference on the first bio after
> calling the aio completion handler.  Once that is done it always drops the
> reference for the current bio - either explicitly or through
> bio_check_pages_dirty in the should_dirty case.
> 
>>  submit_io()
>>rrpc make_request(bio)
>>  get_bio(bio)
>> Completion:
>>   blkdev_bio_end_io
>> bio_put(bio)
>> bio_put(bio) - bio is freed
> 
> Yes, and that's how it's intended.
> 
>>   rrpc_end_io
>> bio_put(bio) (use-after-free access)
> 
> Can you check this is the same bio that got submitted and it didn't
> get bounced somewhere?
> 

Yup, the same:

[11.329950] blkdev_direct_io: get_bio (bio=8801f1e7a018) get ref
[11.331557] blkdev_direct_io: (!nr_pages) (bio=8801f1e7a018) submit
[11.333603] rrpc bio_get: (bio=8801f1e7a018) get ref
[11.335004] blkdev_bio_end_io:!dio->is_syn(bio=8801f1e7a018) put ref
[11.335009] rrpc bio_put: (bio=8801f1e7a018) put ref

It could look like the first get_bio() ref is decremented prematurely in
the blkdev_bio_end_io() path, where it should first have been
decremented at the end of blkdev_direct_IO() path.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/16] dm: always defer request allocation to the owner of the request_queue

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 04:29 PM, Christoph Hellwig wrote:
> DM already calls blk_mq_alloc_request on the request_queue of the
> underlying device if it is a blk-mq device.  But now that we allow drivers
> to allocate additional data and initialize it ahead of time we need to do
> the same for all drivers.   Doing so and using the new cmd_size
> infrastructure in the block layer greatly simplifies the dm-rq and mpath
> code, and should also make arbitrary combinations of SQ and MQ devices
> with SQ or MQ device mapper tables easily possible as a further step.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/md/dm-core.h  |   1 -
>  drivers/md/dm-mpath.c | 132 --
>  drivers/md/dm-rq.c| 251 
> ++
>  drivers/md/dm-rq.h|   2 +-
>  drivers/md/dm-target.c|   7 --
>  drivers/md/dm.c   |  18 +--
>  drivers/md/dm.h   |   3 +-
>  include/linux/device-mapper.h |   3 -
>  8 files changed, 81 insertions(+), 336 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/16] dm: remove incomple BLOCK_PC support

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 04:29 PM, Christoph Hellwig wrote:
> DM tries to copy a few fields around for BLOCK_PC requests, but given
> that no dm-target ever wires up scsi_cmd_ioctl BLOCK_PC can't actually
> be sent to dm.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/md/dm-rq.c | 16 
>  1 file changed, 16 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/16] block: allow specifying size for extra command data

2017-01-24 Thread Hannes Reinecke
On 01/23/2017 04:29 PM, Christoph Hellwig wrote:
> This mirrors the blk-mq capabilities to allocate extra drivers-specific
> data behind struct request by setting a cmd_size field, as well as having
> a constructor / destructor for it.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/blk-core.c   | 59 
> --
>  block/blk-flush.c  |  5 ++---
>  block/blk-sysfs.c  |  7 --
>  include/linux/blkdev.h |  7 ++
>  4 files changed, 61 insertions(+), 17 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html