Re: [PATCH] queue stall with blk-mq-sched
On 01/24/2017 11:06 PM, Jens Axboe wrote: > On 01/24/2017 12:55 PM, Jens Axboe wrote: >> Try this patch. We only want to bump it for the driver tags, not the >> scheduler side. > > More complete version, this one actually tested. I think this should fix > your issue, let me know. > Nearly there. The initial stall is gone, but the test got hung at the 'stonewall' sequence again: [global] bs=4k ioengine=libaio iodepth=256 size=4g direct=1 runtime=60 # directory=/mnt numjobs=32 group_reporting cpus_allowed_policy=split filename=/dev/md127 [seq-read] rw=read -> stonewall [rand-read] rw=randread stonewall Restarting all queues made the fio job continue. There were 4 queues with state 'restart', and one queue with state 'active'. So we're missing a queue run somewhere. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] blkcg: fix double free of new_blkg in blkcg_init_queue
if blkg_create fails, new_blkg passed as an argument will be freed by blkg_create, so there is no need to free it again. Signed-off-by: Hou Tao--- block/blk-cgroup.c | 1 - 1 file changed, 1 deletion(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 8ba0af7..58fb0dd 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1080,7 +1080,6 @@ int blkcg_init_queue(struct request_queue *q) radix_tree_preload_end(); if (IS_ERR(blkg)) { - blkg_free(new_blkg); return PTR_ERR(blkg); } -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] block level event logging for storage media management
> On Jan 24, 2017, at 12:18 PM, Oleg Drokinwrote: > > > On Jan 23, 2017, at 2:27 AM, Dan Williams wrote: > >> [ adding Oleg ] >> >> On Sun, Jan 22, 2017 at 10:00 PM, Song Liu wrote: >>> Hi Dan, >>> >>> I think the the block level event log is more like log only system. When en >>> event >>> happens, it is not necessary to take immediate action. (I guess this is >>> different >>> to bad block list?). >>> >>> I would hope the event log to track more information. Some of these >>> individual >>> event may not be very interesting, for example, soft error or latency >>> outliers. >>> However, when we gather event log for a fleet of devices, these "soft event" >>> may become valuable for health monitoring. >> >> I'd be interested in this. It sounds like you're trying to fill a gap >> between tracing and console log messages which I believe others have >> encountered as well. > > We have a somewhat similar problem problem in Lustre and I guess it's not > just Lustre. > Currently there are all sorts of conditional debug code all over the place > that goes > to the console and when you enable it for anything verbose, you quickly > overflow > your dmesg buffer no matter the size, that might be mostly ok for local > "block level" stuff, but once you become distributed, it start to be a mess > and once you get to be super large it worsens even more since you need to > somehow coordinate data from multiple nodes, ensure all of it is not lost and > still > you don't end up using a lot of it since only a few nodes end up being useful. > (I don't know how NFS people manage to debug complicated issues using just > this, > could not be super easy). > > Having some sort of a buffer of a (potentially very) large size that could be > storing the data until it's needed, or eagerly polled by some daemon for > storage > (helpful when you expect a lot of data that definitely won't fit in RAM). > > Tracepoints have the buffer and the daemon, but creating new messages is > very cumbersome, so converting every debug message into one does not look > very feasible. > Also it's convenient to have "event masks" one want logged that I don't think > you could > do with tracepoints. > > I know you were talking about reporting events to the block layer, but other > than plain > errors what would block layer do with them? just a convenient way to map > messages > to a particular device? You don't plan to store it on some block device as > part > of the block layer, right? > > Implementing such a buffer all sorts of additional generic data might be > collected automatically for all events as part of the buffer format like > what cpu did emit it, time, stack usage information, current pid, > backtrace (tracepoint-alike could be optional), actual source code location of > the message, … > > Having something like that being standard part of {dev,pr}_{dbg,warn,...} and > friends > would be super awesome too, I imagine (adding Greg to CC for that). > Hi Oleg, Thanks for sharing these insights. We built an event logger that parses dmesg to get events. For similar reasons as you described above, it doesn't work well. And one of the biggest issue is poor "event mask" support. I am hoping get better event mask in newer implementation, for example, with kernel tracing filter, or implement customized logic in BPF. With a relatively mature infrastructure, we don't have much problem storing logs from the event logger. Specifically, we use a daemon that collects events and send them to distributed storage (HDFS+HIVE). It might be an overkill for smaller deployment. We do use information from similar (not exactly the one above) logs to make decision about device handling. For example, if a drive throws too much medium error in short period of time, we will kick it out of production. I think it is not necessary to include this in the block layer. Overall, I am hoping the kernel can generate accurate events, with flexible filter/mask support. There are different ways to store and consume these data. I guess most of these will be implemented in user space. Let's discuss potential use cases and requirements. These discussions should help us build the kernel part of the event log. Thanks, Song
Re: [PATCH] queue stall with blk-mq-sched
On 01/24/2017 12:55 PM, Jens Axboe wrote: > Try this patch. We only want to bump it for the driver tags, not the > scheduler side. More complete version, this one actually tested. I think this should fix your issue, let me know. diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index a49ec77..1b156ca 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -90,9 +90,11 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx, return atomic_read(>nr_active) < depth; } -static int __blk_mq_get_tag(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt) +static int __blk_mq_get_tag(struct blk_mq_alloc_data *data, + struct sbitmap_queue *bt) { - if (!hctx_may_queue(hctx, bt)) + if (!(data->flags & BLK_MQ_REQ_INTERNAL) && + !hctx_may_queue(data->hctx, bt)) return -1; return __sbitmap_queue_get(bt); } @@ -118,7 +120,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data) tag_offset = tags->nr_reserved_tags; } - tag = __blk_mq_get_tag(data->hctx, bt); + tag = __blk_mq_get_tag(data, bt); if (tag != -1) goto found_tag; @@ -129,7 +131,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data) do { prepare_to_wait(>wait, , TASK_UNINTERRUPTIBLE); - tag = __blk_mq_get_tag(data->hctx, bt); + tag = __blk_mq_get_tag(data, bt); if (tag != -1) break; @@ -144,7 +146,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data) * Retry tag allocation after running the hardware queue, * as running the queue may also have found completions. */ - tag = __blk_mq_get_tag(data->hctx, bt); + tag = __blk_mq_get_tag(data, bt); if (tag != -1) break; diff --git a/block/blk-mq.c b/block/blk-mq.c index ee69e5e..dcb5676 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -230,15 +230,14 @@ struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data, rq = tags->static_rqs[tag]; - if (blk_mq_tag_busy(data->hctx)) { - rq->rq_flags = RQF_MQ_INFLIGHT; - atomic_inc(>hctx->nr_active); - } - if (data->flags & BLK_MQ_REQ_INTERNAL) { rq->tag = -1; rq->internal_tag = tag; } else { + if (blk_mq_tag_busy(data->hctx)) { + rq->rq_flags = RQF_MQ_INFLIGHT; + atomic_inc(>hctx->nr_active); + } rq->tag = tag; rq->internal_tag = -1; } @@ -869,6 +868,10 @@ static bool blk_mq_get_driver_tag(struct request *rq, rq->tag = blk_mq_get_tag(); if (rq->tag >= 0) { + if (blk_mq_tag_busy(data.hctx)) { + rq->rq_flags |= RQF_MQ_INFLIGHT; + atomic_inc(>nr_active); + } data.hctx->tags->rqs[rq->tag] = rq; goto done; } -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] block level event logging for storage media management
On Jan 23, 2017, at 2:27 AM, Dan Williams wrote: > [ adding Oleg ] > > On Sun, Jan 22, 2017 at 10:00 PM, Song Liuwrote: >> Hi Dan, >> >> I think the the block level event log is more like log only system. When en >> event >> happens, it is not necessary to take immediate action. (I guess this is >> different >> to bad block list?). >> >> I would hope the event log to track more information. Some of these >> individual >> event may not be very interesting, for example, soft error or latency >> outliers. >> However, when we gather event log for a fleet of devices, these "soft event" >> may become valuable for health monitoring. > > I'd be interested in this. It sounds like you're trying to fill a gap > between tracing and console log messages which I believe others have > encountered as well. We have a somewhat similar problem problem in Lustre and I guess it's not just Lustre. Currently there are all sorts of conditional debug code all over the place that goes to the console and when you enable it for anything verbose, you quickly overflow your dmesg buffer no matter the size, that might be mostly ok for local "block level" stuff, but once you become distributed, it start to be a mess and once you get to be super large it worsens even more since you need to somehow coordinate data from multiple nodes, ensure all of it is not lost and still you don't end up using a lot of it since only a few nodes end up being useful. (I don't know how NFS people manage to debug complicated issues using just this, could not be super easy). Having some sort of a buffer of a (potentially very) large size that could be storing the data until it's needed, or eagerly polled by some daemon for storage (helpful when you expect a lot of data that definitely won't fit in RAM). Tracepoints have the buffer and the daemon, but creating new messages is very cumbersome, so converting every debug message into one does not look very feasible. Also it's convenient to have "event masks" one want logged that I don't think you could do with tracepoints. I know you were talking about reporting events to the block layer, but other than plain errors what would block layer do with them? just a convenient way to map messages to a particular device? You don't plan to store it on some block device as part of the block layer, right? Implementing such a buffer all sorts of additional generic data might be collected automatically for all events as part of the buffer format like what cpu did emit it, time, stack usage information, current pid, backtrace (tracepoint-alike could be optional), actual source code location of the message, … Having something like that being standard part of {dev,pr}_{dbg,warn,...} and friends would be super awesome too, I imagine (adding Greg to CC for that). -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems
On 01/24, Jan Kara wrote: > On Fri 20-01-17 07:42:09, Dan Williams wrote: > > On Fri, Jan 20, 2017 at 1:47 AM, Jan Karawrote: > > > On Thu 19-01-17 14:17:19, Vishal Verma wrote: > > >> On 01/18, Jan Kara wrote: > > >> > On Tue 17-01-17 15:37:05, Vishal Verma wrote: > > >> > 2) PMEM is exposed for DAX aware filesystem. This seems to be what you > > >> > are > > >> > mostly interested in. We could possibly do something more efficient > > >> > than > > >> > what NVDIMM driver does however the complexity would be relatively > > >> > high and > > >> > frankly I'm far from convinced this is really worth it. If there are so > > >> > many badblocks this would matter, the HW has IMHO bigger problems than > > >> > performance. > > >> > > >> Correct, and Dave was of the opinion that once at least XFS has reverse > > >> mapping support (which it does now), adding badblocks information to > > >> that should not be a hard lift, and should be a better solution. I > > >> suppose should try to benchmark how much of a penalty the current > > >> badblock > > >> checking in the NVVDIMM driver imposes. The penalty is not because there > > >> may be a large number of badblocks, but just due to the fact that we > > >> have to do this check for every IO, in fact, every 'bvec' in a bio. > > > > > > Well, letting filesystem know is certainly good from error reporting > > > quality > > > POV. I guess I'll leave it upto XFS guys to tell whether they can be more > > > efficient in checking whether current IO overlaps with any of given bad > > > blocks. > > > > > >> > Now my question: Why do we bother with badblocks at all? In cases 1) > > >> > and 2) > > >> > if the platform can recover from MCE, we can just always access > > >> > persistent > > >> > memory using memcpy_mcsafe(), if that fails, return -EIO. Actually that > > >> > seems to already happen so we just need to make sure all places handle > > >> > returned errors properly (e.g. fs/dax.c does not seem to) and we are > > >> > done. > > >> > No need for bad blocks list at all, no slow down unless we hit a bad > > >> > cell > > >> > and in that case who cares about performance when the data is gone... > > >> > > >> Even when we have MCE recovery, we cannot do away with the badblocks > > >> list: > > >> 1. My understanding is that the hardware's ability to do MCE recovery is > > >> limited/best-effort, and is not guaranteed. There can be circumstances > > >> that cause a "Processor Context Corrupt" state, which is unrecoverable. > > > > > > Well, then they have to work on improving the hardware. Because having HW > > > that just sometimes gets stuck instead of reporting bad storage is simply > > > not acceptable. And no matter how hard you try you cannot avoid MCEs from > > > OS when accessing persistent memory so OS just has no way to avoid that > > > risk. > > > > > >> 2. We still need to maintain a badblocks list so that we know what > > >> blocks need to be cleared (via the ACPI method) on writes. > > > > > > Well, why cannot we just do the write, see whether we got CMCI and if yes, > > > clear the error via the ACPI method? > > > > I would need to check if you get the address reported in the CMCI, but > > it would only fire if the write triggered a read-modify-write cycle. I > > suspect most copies to pmem, through something like > > arch_memcpy_to_pmem(), are not triggering any reads. It also triggers > > asynchronously, so what data do you write after clearing the error? > > There may have been more writes while the CMCI was being delivered. > > OK, I see. And if we just write new data but don't clear error on write > through the ACPI method, will we still get MCE on following read of that > data? But regardless whether we get MCE or not, I suppose that the memory > location will be still marked as bad in some ACPI table, won't it? Correct, the location will continue to result in MCEs on reads if it isn't marked as clear explicitly. I'm not sure that there is an ACPI table that keeps a list of bad locations, it is just a poison bit in the cache line, and presumable DIMMs will have some internal data structures that also mark bad locations. > > Honza > -- > Jan Kara > SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] queue stall with blk-mq-sched
On 01/24/2017 11:49 AM, Hannes Reinecke wrote: > On 01/24/2017 05:09 PM, Jens Axboe wrote: >> On 01/24/2017 08:54 AM, Hannes Reinecke wrote: >>> Hi Jens, >>> >>> I'm trying to debug a queue stall with your blk-mq-sched branch; with my >>> latest mpt3sas patches fio stops basically directly after starting a >>> sequential read :-( >>> >>> I've debugged things and came up with the attached patch; we need to >>> restart waiters with blk_mq_tag_idle() after completing a tag. >>> We're already calling blk_mq_tag_busy() when fetching a tag, so I think >>> calling blk_mq_tag_idle() is required when retiring a tag. >> >> The patch isn't correct, the whole point of the un-idling is that it >> ISN'T happening for every request completion. Otherwise you throw >> away scalability. So a queue will go into active mode on the first >> request, and idle when it's been idle for a bit. The active count >> is used to divide up the tags. >> >> So I'm assuming we're missing a queue run somewhere when we fail >> getting a driver tag. The latter should only happen if the target >> has IO in flight already, and the restart marking should take care >> of it. Obviously there's a case where that is not true, since you >> are seeing stalls. >> > But what is the point in the 'blk_mq_tag_busy()' thingie then? > When will it be reset? > The only instances I've seen is that it'll be getting reset during > resize and teardown ... hence my patch. The point is to have some count of how many queues are busy "lately", which helps in dividing up the tags fairly. Hence we bump it as soon as the queue goes active, and drop it after some delay. That's working as expected. >>> However, even with the attached patch I'm seeing some queue stalls; >>> looks like they're related to the 'stonewall' statement in fio. >> >> I think you are heading down the wrong path. Your patch will cause >> the symptoms to be a bit different, but you'll still run into cases >> where we fail giving out the tag and then stall. >> > Hehe. > How did you know that? My crystal ball :-) > That's indeed what I'm seeing. > > Oh well, back to the drawing board... Try this patch. We only want to bump it for the driver tags, not the scheduler side. diff --git a/block/blk-mq.c b/block/blk-mq.c index ee69e5e..c905aa1 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -230,15 +230,15 @@ struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data, rq = tags->static_rqs[tag]; - if (blk_mq_tag_busy(data->hctx)) { - rq->rq_flags = RQF_MQ_INFLIGHT; - atomic_inc(>hctx->nr_active); - } - if (data->flags & BLK_MQ_REQ_INTERNAL) { rq->tag = -1; rq->internal_tag = tag; } else { + if (blk_mq_tag_busy(data->hctx)) { + rq->rq_flags = RQF_MQ_INFLIGHT; + atomic_inc(>hctx->nr_active); + } + rq->tag = tag; rq->internal_tag = -1; } @@ -870,6 +870,10 @@ static bool blk_mq_get_driver_tag(struct request *rq, rq->tag = blk_mq_get_tag(); if (rq->tag >= 0) { data.hctx->tags->rqs[rq->tag] = rq; + if (blk_mq_tag_busy(data.hctx)) { + rq->rq_flags |= RQF_MQ_INFLIGHT; + atomic_inc(>nr_active); + } goto done; } -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/16] dm: always defer request allocation to the owner of the request_queue
On Tue, Jan 24, 2017 at 11:39:51AM -0500, Mike Snitzer wrote: > Fair enough. Cc'ing Junichi just in case he sees anything we're > missing. Thanks, I'll wait for his repsonse before reposting with the few accumulated changes (including the dm cleanup). -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] queue stall with blk-mq-sched
On 01/24/2017 05:09 PM, Jens Axboe wrote: On 01/24/2017 08:54 AM, Hannes Reinecke wrote: Hi Jens, I'm trying to debug a queue stall with your blk-mq-sched branch; with my latest mpt3sas patches fio stops basically directly after starting a sequential read :-( I've debugged things and came up with the attached patch; we need to restart waiters with blk_mq_tag_idle() after completing a tag. We're already calling blk_mq_tag_busy() when fetching a tag, so I think calling blk_mq_tag_idle() is required when retiring a tag. The patch isn't correct, the whole point of the un-idling is that it ISN'T happening for every request completion. Otherwise you throw away scalability. So a queue will go into active mode on the first request, and idle when it's been idle for a bit. The active count is used to divide up the tags. So I'm assuming we're missing a queue run somewhere when we fail getting a driver tag. The latter should only happen if the target has IO in flight already, and the restart marking should take care of it. Obviously there's a case where that is not true, since you are seeing stalls. But what is the point in the 'blk_mq_tag_busy()' thingie then? When will it be reset? The only instances I've seen is that it'll be getting reset during resize and teardown ... hence my patch. However, even with the attached patch I'm seeing some queue stalls; looks like they're related to the 'stonewall' statement in fio. I think you are heading down the wrong path. Your patch will cause the symptoms to be a bit different, but you'll still run into cases where we fail giving out the tag and then stall. Hehe. How did you know that? That's indeed what I'm seeing. Oh well, back to the drawing board... Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] queue stall with blk-mq-sched
On 01/24/2017 05:03 PM, Jens Axboe wrote: On 01/24/2017 08:54 AM, Hannes Reinecke wrote: Hi Jens, I'm trying to debug a queue stall with your blk-mq-sched branch; with my latest mpt3sas patches fio stops basically directly after starting a sequential read :-( I've debugged things and came up with the attached patch; we need to restart waiters with blk_mq_tag_idle() after completing a tag. We're already calling blk_mq_tag_busy() when fetching a tag, so I think calling blk_mq_tag_idle() is required when retiring a tag. I'll take a look at this. It sounds like all your grief is related to shared tag maps, which I don't have anything that uses here. I'll see if we are leaking it, you should be able to check that by reading the 'active' file in the sysfs directory. Ah. That'll explain it. Basically _all_ my HBAs I'm testing with have shared tag maps :-( Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/16] dm: always defer request allocation to the owner of the request_queue
On Tue, Jan 24 2017 at 9:20am -0500, Christoph Hellwigwrote: > On Tue, Jan 24, 2017 at 05:05:39AM -0500, Mike Snitzer wrote: > > possible and is welcomed cleanup. The only concern I have is that using > > get_request() for the old request_fn request_queue eliminates the > > guaranteed availability of requests to allow for forward progress (on > > path failure or for the purposes of swap over mpath, etc). This isn't a > > concern for blk-mq because as you know we have a fixed set of tags (and > > associated preallocated requests). > > > > So I'm left unconvinced old request_fn request-based DM multipath isn't > > regressing in how request resubmission can be assured a request will be > > available when needed on retry in the face of path failure. > > Mempool only need a size where we can make guaranteed requests, so for > get_request based drivers under dm the theoretical minimum size would be > one as we never rely on a second request to finish the first one, > and each request_queue has it's own mempool(s) to start with. Fair enough. Cc'ing Junichi just in case he sees anything we're missing. > > dm_mod's 'reserved_rq_based_ios' module_param governs the minimum number > > of requests in the md->rq_pool (and defaults to 256 requests per > > request-based DM request_queue). Whereas blk_init_rl()'s > > mempool_create_node() uses BLKDEV_MIN_RQ (4) yet q->nr_requests = > > BLKDEV_MAX_RQ (128). Also, this patch eliminates the utility of > > 'reserved_rq_based_ios' module_param without actually removing it. > > It's still used for the bio pool, so I couldn't simply remove it. Ah, yeah, you're talking about: pools->bs = bioset_create_nobvec(pool_size, front_pad); Makes sense. > > Anyway, should blk-core evolve to allow drivers to specify a custom > > min_nr of requests in the old request_fn request_queue's mempool? Or is > > my concern overblown? > > Thanks to mempool_resize we could add that functionality. I just don't > see an actual need for it. > > > p.s. dm.c:dm_alloc_md_mempools() could be cleaned up a bit more since > > only bio-based DM will have a pools->io_pool moving forward; but I can > > circle back to that cleanup after. > > If you're fine with doing more than the necessary changes in a patch > that needs to got into the block tree I'd be happy to apply my usual > cleanup magic to it. I think this one example illustrates cleanup that makes sense even if going through block. While in there it is best to not leave extra branching that just doesn't make sense to have anymore. So if you do another version of this patchset feel free to move this direct into dm_alloc_md_mempools()'s BIO_BASED case: pools->io_pool = mempool_create_slab_pool(pool_size, _io_cache); And eliminate the cachep pointer. But if not that is fine too. Also: Reviewed-by: Mike Snitzer -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/16] block: split scsi_request out of struct request
On Tue, 2017-01-24 at 09:09 +0100, h...@lst.de wrote: > On Tue, Jan 24, 2017 at 12:33:13AM +, Bart Van Assche wrote: > > Do we perhaps need a check before the above memcpy() call whether or not > > sense == NULL? > > Yes, probably. I didn't think of the case where the caller wouldn't > pass a sense buffer. Just curious, what's the callstack that caused > this? Hello Christoph, The call stack was as follows: * __scsi_execute() * scsi_execute_req_flags() * scsi_vpd_inquiry() * scsi_attach_vpd() * scsi_probe_and_add_lun() * __scsi_add_device() * ata_scsi_scan_host() * async_port_probe() * async_run_entry_fn() * process_one_work() * worker_thread() * kthread() Bart.-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] queue stall with blk-mq-sched
On 01/24/2017 08:54 AM, Hannes Reinecke wrote: > Hi Jens, > > I'm trying to debug a queue stall with your blk-mq-sched branch; with my > latest mpt3sas patches fio stops basically directly after starting a > sequential read :-( > > I've debugged things and came up with the attached patch; we need to > restart waiters with blk_mq_tag_idle() after completing a tag. > We're already calling blk_mq_tag_busy() when fetching a tag, so I think > calling blk_mq_tag_idle() is required when retiring a tag. The patch isn't correct, the whole point of the un-idling is that it ISN'T happening for every request completion. Otherwise you throw away scalability. So a queue will go into active mode on the first request, and idle when it's been idle for a bit. The active count is used to divide up the tags. So I'm assuming we're missing a queue run somewhere when we fail getting a driver tag. The latter should only happen if the target has IO in flight already, and the restart marking should take care of it. Obviously there's a case where that is not true, since you are seeing stalls. > However, even with the attached patch I'm seeing some queue stalls; > looks like they're related to the 'stonewall' statement in fio. I think you are heading down the wrong path. Your patch will cause the symptoms to be a bit different, but you'll still run into cases where we fail giving out the tag and then stall. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] queue stall with blk-mq-sched
On 01/24/2017 08:54 AM, Hannes Reinecke wrote: > Hi Jens, > > I'm trying to debug a queue stall with your blk-mq-sched branch; with my > latest mpt3sas patches fio stops basically directly after starting a > sequential read :-( > > I've debugged things and came up with the attached patch; we need to > restart waiters with blk_mq_tag_idle() after completing a tag. > We're already calling blk_mq_tag_busy() when fetching a tag, so I think > calling blk_mq_tag_idle() is required when retiring a tag. I'll take a look at this. It sounds like all your grief is related to shared tag maps, which I don't have anything that uses here. I'll see if we are leaking it, you should be able to check that by reading the 'active' file in the sysfs directory. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] queue stall with blk-mq-sched
Hi Jens, I'm trying to debug a queue stall with your blk-mq-sched branch; with my latest mpt3sas patches fio stops basically directly after starting a sequential read :-( I've debugged things and came up with the attached patch; we need to restart waiters with blk_mq_tag_idle() after completing a tag. We're already calling blk_mq_tag_busy() when fetching a tag, so I think calling blk_mq_tag_idle() is required when retiring a tag. However, even with the attached patch I'm seeing some queue stalls; looks like they're related to the 'stonewall' statement in fio. Debugging continues. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) From 82b15ff40d71aed318f9946881825f9f03ef8f48 Mon Sep 17 00:00:00 2001 From: Hannes ReineckeDate: Tue, 24 Jan 2017 14:43:09 +0100 Subject: [PATCH] block-mq: fixup queue stall __blk_mq_alloc_request() calls blk_mq_tag_busy(), which might result in the queue to become blocked. So we need to call blk_mq_tag_idle() once the tag is finished to wakeup all waiters on the queue. Patch is relative to the blk-mq-sched branch Signed-off-by: Hannes Reinecke --- block/blk-mq.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 739a292..d52bcb1 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -333,10 +333,12 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx, { const int sched_tag = rq->internal_tag; struct request_queue *q = rq->q; + bool unbusy = false; - if (rq->rq_flags & RQF_MQ_INFLIGHT) + if (rq->rq_flags & RQF_MQ_INFLIGHT) { atomic_dec(>nr_active); - + unbusy = true; + } wbt_done(q->rq_wb, >issue_stat); rq->rq_flags = 0; @@ -346,6 +348,9 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx, blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); if (sched_tag != -1) blk_mq_sched_completed_request(hctx, rq); + if (unbusy) + blk_mq_tag_idle(hctx); + blk_queue_exit(q); } -- 1.8.5.6
Re: [PATCH 01/16] block: fix elevator init check
On Tue, Jan 24, 2017 at 08:06:39AM -0700, Jens Axboe wrote: > We check for REQ_PREFLUSH | REQ_FUA in so many places though, might not > be a bad idea to keep the helper but just make it take the opf and fix > up the other locations too. The fact that PREFLUSH|FUA is the magic to > check for is somewhat tribal knowledge. I'll see if I can come up with something sensible. The current helper using the bio and the magic 0 value is not exactly helpful. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/16] block: fix elevator init check
On Mon, Jan 23 2017, Christoph Hellwig wrote: > We can't initalize the elevator fields for flushes as flush share space > in struct request with the elevator data. But currently we can't > commnicate that a request is a flush through blk_get_request as we > can only pass READ or WRITE, and the low-level code looks at the > possible NULL bio to check for a flush. > > Fix this by allowing to pass any block op and flags, and by checking for > the flush flags in __get_request. We check for REQ_PREFLUSH | REQ_FUA in so many places though, might not be a bad idea to keep the helper but just make it take the opf and fix up the other locations too. The fact that PREFLUSH|FUA is the magic to check for is somewhat tribal knowledge. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] blk-mq-sched: fix possible crash if changing scheduler fails
On 01/23/2017 04:47 PM, Omar Sandoval wrote: > From: Omar Sandoval> > In elevator_switch(), we free the old scheduler's tags and then > initialize the new scheduler. If initializing the new scheduler fails, > we fall back to the old scheduler, but our tags have already been freed. > There's no reason to free the sched_tags only to reallocate them, so > let's just reuse them, which makes it possible to safely fall back to > the old scheduler. Do we need a failure case on both ends? We never free the driver tags, so it's always perfectly safe to fall back to not running with a scheduler. Since we were talking about making the scheduler depth configurable or dependent on the scheduler, reusing tags seems like something that would potentially be a short lived "feature". So I think I'd prefer if we teardown/setup like we do now, and just have the failure case be falling back to "none". -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block: fix use after free in __blkdev_direct_IO
On 01/24/2017 06:50 AM, Christoph Hellwig wrote: > We can't dereference the dio structure after submitting the last bio for > this request, as I/O completion might have happened before the code is > run. Introduce a local is_sync variable instead. Applied. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/16] dm: always defer request allocation to the owner of the request_queue
On Tue, Jan 24, 2017 at 05:05:39AM -0500, Mike Snitzer wrote: > possible and is welcomed cleanup. The only concern I have is that using > get_request() for the old request_fn request_queue eliminates the > guaranteed availability of requests to allow for forward progress (on > path failure or for the purposes of swap over mpath, etc). This isn't a > concern for blk-mq because as you know we have a fixed set of tags (and > associated preallocated requests). > > So I'm left unconvinced old request_fn request-based DM multipath isn't > regressing in how request resubmission can be assured a request will be > available when needed on retry in the face of path failure. Mempool only need a size where we can make guaranteed requests, so for get_request based drivers under dm the theoretical minimum size would be one as we never rely on a second request to finish the first one, and each request_queue has it's own mempool(s) to start with. > dm_mod's 'reserved_rq_based_ios' module_param governs the minimum number > of requests in the md->rq_pool (and defaults to 256 requests per > request-based DM request_queue). Whereas blk_init_rl()'s > mempool_create_node() uses BLKDEV_MIN_RQ (4) yet q->nr_requests = > BLKDEV_MAX_RQ (128). Also, this patch eliminates the utility of > 'reserved_rq_based_ios' module_param without actually removing it. It's still used for the bio pool, so I couldn't simply remove it. > Anyway, should blk-core evolve to allow drivers to specify a custom > min_nr of requests in the old request_fn request_queue's mempool? Or is > my concern overblown? Thanks to mempool_resize we could add that functionality. I just don't see an actual need for it. > p.s. dm.c:dm_alloc_md_mempools() could be cleaned up a bit more since > only bio-based DM will have a pools->io_pool moving forward; but I can > circle back to that cleanup after. If you're fine with doing more than the necessary changes in a patch that needs to got into the block tree I'd be happy to apply my usual cleanup magic to it. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] block: fix use after free in __blkdev_direct_IO
We can't dereference the dio structure after submitting the last bio for this request, as I/O completion might have happened before the code is run. Introduce a local is_sync variable instead. Fixes: 542ff7bf ("block: new direct I/O implementation") Signed-off-by: Christoph HellwigReported-by: Matias Bjørling Tested-by: Matias Bjørling --- fs/block_dev.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 5db5d13..3c47614 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -331,7 +331,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) struct blk_plug plug; struct blkdev_dio *dio; struct bio *bio; - bool is_read = (iov_iter_rw(iter) == READ); + bool is_read = (iov_iter_rw(iter) == READ), is_sync; loff_t pos = iocb->ki_pos; blk_qc_t qc = BLK_QC_T_NONE; int ret; @@ -344,7 +344,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) bio_get(bio); /* extra ref for the completion handler */ dio = container_of(bio, struct blkdev_dio, bio); - dio->is_sync = is_sync_kiocb(iocb); + dio->is_sync = is_sync = is_sync_kiocb(iocb); if (dio->is_sync) dio->waiter = current; else @@ -398,7 +398,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) } blk_finish_plug(); - if (!dio->is_sync) + if (!is_sync) return -EIOCBQUEUED; for (;;) { -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: KASAN: Use-after-free
On 01/24/2017 02:34 PM, Christoph Hellwig wrote: > On Tue, Jan 24, 2017 at 11:01:42AM +0100, Matias Bjørling wrote: >> Yup. That fixes it. Should I put together the patch, or will you take >> care of it? > > I'll send it out. Of course with proper reporting credits for you. > Awesome. Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: KASAN: Use-after-free
On Tue, Jan 24, 2017 at 11:01:42AM +0100, Matias Bjørling wrote: > Yup. That fixes it. Should I put together the patch, or will you take > care of it? I'll send it out. Of course with proper reporting credits for you. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/10] blk-mq: move hctx and ctx counters from sysfs to debugfs
On 01/23/2017 07:59 PM, Omar Sandoval wrote: > From: Omar Sandoval> > These counters aren't as out-of-place in sysfs as the other stuff, but > debugfs is a slightly better home for them. > > Signed-off-by: Omar Sandoval > --- > block/blk-mq-debugfs.c | 181 > + > block/blk-mq-sysfs.c | 64 - > 2 files changed, 181 insertions(+), 64 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/10] blk-mq: move hctx io_poll, stats, and dispatched from sysfs to debugfs
On 01/23/2017 07:59 PM, Omar Sandoval wrote: > From: Omar Sandoval> > These statistics _might_ be useful to userspace, but it's better not to > commit to an ABI for these yet. Also, the dispatched file in sysfs > couldn't be cleared, so make it clearable like the others in debugfs. > > Signed-off-by: Omar Sandoval > --- > block/blk-mq-debugfs.c | 132 > + > block/blk-mq-sysfs.c | 92 -- > 2 files changed, 132 insertions(+), 92 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/10] blk-mq: add tags and sched_tags bitmaps to debugfs
On 01/23/2017 07:59 PM, Omar Sandoval wrote: > From: Omar Sandoval> > These can be used to debug issues like tag leaks and stuck requests. > > Signed-off-by: Omar Sandoval > --- > block/blk-mq-debugfs.c | 50 > ++ > 1 file changed, 50 insertions(+) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/10] blk-mq: move tags and sched_tags info from sysfs to debugfs
On 01/23/2017 07:59 PM, Omar Sandoval wrote: > From: Omar Sandoval> > These are very tied to the blk-mq tag implementation, so exposing them > to sysfs isn't a great idea. Move the debugging information to debugfs > and add basic entries for the number of tags and the number of reserved > tags to sysfs. > > Signed-off-by: Omar Sandoval > --- > block/blk-mq-debugfs.c | 70 > ++ > block/blk-mq-sysfs.c | 33 > block/blk-mq-tag.c | 27 --- > block/blk-mq-tag.h | 1 - > 4 files changed, 86 insertions(+), 45 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/10] sbitmap: add helpers for dumping to a seq_file
On 01/23/2017 07:59 PM, Omar Sandoval wrote: > From: Omar Sandoval> > This is useful debugging information that will be used in the blk-mq > debugfs directory. > > Signed-off-by: Omar Sandoval > --- > include/linux/sbitmap.h | 34 > lib/sbitmap.c | 83 > + > 2 files changed, 117 insertions(+) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/10] blk-mq: export software queue pending map to debugfs
On 01/23/2017 07:59 PM, Omar Sandoval wrote: > From: Omar Sandoval> > This is useful for debugging problems where we've gotten stuck with > requests in the software queues. > > Signed-off-by: Omar Sandoval > --- > block/blk-mq-debugfs.c | 15 +++ > 1 file changed, 15 insertions(+) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/10] blk-mq: add extra request information to debugfs
On 01/23/2017 07:59 PM, Omar Sandoval wrote: > From: Omar Sandoval> > The request pointers by themselves aren't super useful. > > Signed-off-by: Omar Sandoval > --- > block/blk-mq-debugfs.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c > index 1967c06d80e0..ee947f2f605b 100644 > --- a/block/blk-mq-debugfs.c > +++ b/block/blk-mq-debugfs.c > @@ -87,7 +87,9 @@ static int blk_mq_debugfs_rq_show(struct seq_file *m, void > *v) > { > struct request *rq = list_entry_rq(v); > > - seq_printf(m, "%p\n", rq); > + seq_printf(m, "%p {.cmd_type=%u, .cmd_flags=0x%x, .rq_flags=0x%x, > .tag=%d, .internal_tag=%d}\n", > +rq, rq->cmd_type, rq->cmd_flags, (unsigned int)rq->rq_flags, > +rq->tag, rq->internal_tag); > return 0; > } > > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/10] blk-mq: move hctx->dispatch and ctx->rq_list from sysfs to debugfs
On 01/23/2017 07:59 PM, Omar Sandoval wrote: > From: Omar Sandoval> > These lists are only useful for debugging; they definitely don't belong > in sysfs. Putting them in debugfs also removes the limitation of a > single page of output. > > Signed-off-by: Omar Sandoval > --- > block/blk-mq-debugfs.c | 106 > + > block/blk-mq-sysfs.c | 57 -- > 2 files changed, 106 insertions(+), 57 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] blk-mq: add hctx->{state,flags} to debugfs
On 01/23/2017 07:59 PM, Omar Sandoval wrote: > From: Omar Sandoval> > hctx->state could come in handy for bugs where the hardware queue gets > stuck in the stopped state, and hctx->flags is just useful to know. > > Signed-off-by: Omar Sandoval > --- > block/blk-mq-debugfs.c | 42 ++ > 1 file changed, 42 insertions(+) > Hehe. I've found myself adding exactly the same attributes when debugging mq-deadline :-) > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c > index 01711bbf5ade..0c14511fa9e0 100644 > --- a/block/blk-mq-debugfs.c > +++ b/block/blk-mq-debugfs.c > @@ -29,7 +29,49 @@ struct blk_mq_debugfs_attr { > > static struct dentry *block_debugfs_root; > > +static int hctx_state_show(struct seq_file *m, void *v) > +{ > + struct blk_mq_hw_ctx *hctx = m->private; > + > + seq_printf(m, "0x%lx\n", hctx->state); > + return 0; > +} > + What about decoding it? Would make life so much easier, and one doesn't have to have a kernel source tree available when looking things up... > +static int hctx_state_open(struct inode *inode, struct file *file) > +{ > + return single_open(file, hctx_state_show, inode->i_private); > +} > + > +static const struct file_operations hctx_state_fops = { > + .open = hctx_state_open, > + .read = seq_read, > + .llseek = seq_lseek, > + .release= single_release, > +}; > + > +static int hctx_flags_show(struct seq_file *m, void *v) > +{ > + struct blk_mq_hw_ctx *hctx = m->private; > + > + seq_printf(m, "0x%lx\n", hctx->flags); > + return 0; > +} > + Same argument here. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/10] blk-mq: create debugfs directory tree
On 01/23/2017 07:59 PM, Omar Sandoval wrote: > From: Omar Sandoval> > In preparation for putting blk-mq debugging information in debugfs, > create a directory tree mirroring the one in sysfs: > > # tree -d /sys/kernel/debug/block > /sys/kernel/debug/block > |-- nvme0n1 > | `-- mq > | |-- 0 > | | `-- cpu0 > | |-- 1 > | | `-- cpu1 > | |-- 2 > | | `-- cpu2 > | `-- 3 > | `-- cpu3 > `-- vda > `-- mq > `-- 0 > |-- cpu0 > |-- cpu1 > |-- cpu2 > `-- cpu3 > > Also add the scaffolding for the actual files that will go in here, > either under the hardware queue or software queue directories. > > Signed-off-by: Omar Sandoval > --- > block/Makefile | 1 + > block/blk-mq-debugfs.c | 152 > + > block/blk-mq-sysfs.c | 8 +++ > block/blk-mq.c | 2 + > block/blk-mq.h | 33 +++ > include/linux/blkdev.h | 5 ++ > 6 files changed, 201 insertions(+) > create mode 100644 block/blk-mq-debugfs.c > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 16/16] block: don't assign cmd_flags in __blk_rq_prep_clone
On 01/23/2017 04:29 PM, Christoph Hellwig wrote: > These days we have the proper flags set since request allocation time. > > Signed-off-by: Christoph Hellwig> --- > block/blk-core.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index fe5cc98..93a9e0b 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2983,7 +2983,6 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone); > static void __blk_rq_prep_clone(struct request *dst, struct request *src) > { > dst->cpu = src->cpu; > - dst->cmd_flags = src->cmd_flags | REQ_NOMERGE; > dst->cmd_type = src->cmd_type; > dst->__sector = blk_rq_pos(src); > dst->__data_len = blk_rq_bytes(src); > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/16] block: split scsi_request out of struct request
On 01/23/2017 04:29 PM, Christoph Hellwig wrote: > And require all drivers that want to support BLOCK_PC to allocate it > as the first thing of their private data. To support this the legacy > IDE and BSG code is switched to set cmd_size on their queues to let > the block layer allocate the additional space. > > Signed-off-by: Christoph Hellwig> --- Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/16] scsi: allocate scsi_cmnd structures as part of struct request
On 01/23/2017 04:29 PM, Christoph Hellwig wrote: > Rely on the new block layer functionality to allocate additional driver > specific data behind struct request instead of implementing it in SCSI > itѕelf. > > Signed-off-by: Christoph Hellwig> --- > drivers/scsi/hosts.c | 20 +-- > drivers/scsi/scsi.c | 319 > -- > drivers/scsi/scsi_error.c | 17 ++- > drivers/scsi/scsi_lib.c | 122 -- > drivers/scsi/scsi_priv.h | 8 +- > include/scsi/scsi_host.h | 3 - > 6 files changed, 95 insertions(+), 394 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] scsi: remove gfp_flags member in scsi_host_cmd_pool
On Mon, Jan 23, 2017 at 04:29:14PM +0100, Christoph Hellwig wrote: > When using the slab allocator we already decide at cache creation time if > an allocation comes from a GFP_DMA pool using the SLAB_CACHE_DMA flag, > and there is no point passing the kmalloc-family only GFP_DMA flag to > kmem_cache_alloc. Drop all the infrastructure for doing so. > > Signed-off-by: Christoph Hellwig> --- Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] lightnvm: use end_io callback instead of instance
When the lightnvm core had the "gennvm" layer between the device and the target, there was a need for the core to be able to figure out which target it should send an end_io callback to. Leading to a "double" end_io, first for the media manager instance, and then for the target instance. Now that core and gennvm is merged, there is no longer a need for this, and a single end_io callback will do. Signed-off-by: Matias Bjørling--- drivers/lightnvm/core.c | 5 ++--- drivers/lightnvm/rrpc.c | 11 +-- drivers/lightnvm/rrpc.h | 3 --- include/linux/lightnvm.h | 11 +++ 4 files changed, 10 insertions(+), 20 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 80cd767..4abd334 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -776,14 +776,13 @@ EXPORT_SYMBOL(nvm_free_rqd_ppalist); void nvm_end_io(struct nvm_rq *rqd, int error) { struct nvm_tgt_dev *tgt_dev = rqd->dev; - struct nvm_tgt_instance *ins = rqd->ins; /* Convert address space */ if (tgt_dev) nvm_rq_dev_to_tgt(tgt_dev, rqd); - rqd->error = error; - ins->tt->end_io(rqd); + if (rqd->end_io) + rqd->end_io(rqd, error); } EXPORT_SYMBOL(nvm_end_io); diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c index 9fb7de3..c399f55 100644 --- a/drivers/lightnvm/rrpc.c +++ b/drivers/lightnvm/rrpc.c @@ -777,16 +777,16 @@ static void rrpc_end_io_write(struct rrpc *rrpc, struct rrpc_rq *rrqd, } } -static void rrpc_end_io(struct nvm_rq *rqd) +static void rrpc_end_io(struct nvm_rq *rqd, int error) { - struct rrpc *rrpc = container_of(rqd->ins, struct rrpc, instance); + struct rrpc *rrpc = rqd->private; struct nvm_tgt_dev *dev = rrpc->dev; struct rrpc_rq *rrqd = nvm_rq_to_pdu(rqd); uint8_t npages = rqd->nr_ppas; sector_t laddr = rrpc_get_laddr(rqd->bio) - npages; if (bio_data_dir(rqd->bio) == WRITE) { - if (rqd->error == NVM_RSP_ERR_FAILWRITE) + if (error == NVM_RSP_ERR_FAILWRITE) rrpc_mark_bad_block(rrpc, rqd); rrpc_end_io_write(rrpc, rrqd, laddr, npages); @@ -972,8 +972,9 @@ static int rrpc_submit_io(struct rrpc *rrpc, struct bio *bio, bio_get(bio); rqd->bio = bio; - rqd->ins = >instance; + rqd->private = rrpc; rqd->nr_ppas = nr_pages; + rqd->end_io = rrpc_end_io; rrq->flags = flags; err = nvm_submit_io(dev, rqd); @@ -1532,7 +1533,6 @@ static void *rrpc_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk) if (!rrpc) return ERR_PTR(-ENOMEM); - rrpc->instance.tt = _rrpc; rrpc->dev = dev; rrpc->disk = tdisk; @@ -1611,7 +1611,6 @@ static struct nvm_tgt_type tt_rrpc = { .make_rq= rrpc_make_rq, .capacity = rrpc_capacity, - .end_io = rrpc_end_io, .init = rrpc_init, .exit = rrpc_exit, diff --git a/drivers/lightnvm/rrpc.h b/drivers/lightnvm/rrpc.h index 94e4d73..fdb6ff9 100644 --- a/drivers/lightnvm/rrpc.h +++ b/drivers/lightnvm/rrpc.h @@ -102,9 +102,6 @@ struct rrpc_lun { }; struct rrpc { - /* instance must be kept in top to resolve rrpc in unprep */ - struct nvm_tgt_instance instance; - struct nvm_tgt_dev *dev; struct gendisk *disk; diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h index ce0b2da..f6e2376 100644 --- a/include/linux/lightnvm.h +++ b/include/linux/lightnvm.h @@ -213,10 +213,6 @@ struct nvm_target { struct gendisk *disk; }; -struct nvm_tgt_instance { - struct nvm_tgt_type *tt; -}; - #define ADDR_EMPTY (~0ULL) #define NVM_VERSION_MAJOR 1 @@ -224,10 +220,9 @@ struct nvm_tgt_instance { #define NVM_VERSION_PATCH 0 struct nvm_rq; -typedef void (nvm_end_io_fn)(struct nvm_rq *); +typedef void (nvm_end_io_fn)(struct nvm_rq *, int); struct nvm_rq { - struct nvm_tgt_instance *ins; struct nvm_tgt_dev *dev; struct bio *bio; @@ -250,7 +245,8 @@ struct nvm_rq { uint16_t flags; u64 ppa_status; /* ppa media status */ - int error; + + void *private; }; static inline struct nvm_rq *nvm_rq_from_pdu(void *pdu) @@ -450,7 +446,6 @@ struct nvm_tgt_type { /* target entry points */ nvm_tgt_make_rq_fn *make_rq; nvm_tgt_capacity_fn *capacity; - nvm_end_io_fn *end_io; /* module-specific init/teardown */ nvm_tgt_init_fn *init; -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/16] scsi: remove __scsi_alloc_queue
On Mon, Jan 23, 2017 at 04:29:17PM +0100, Christoph Hellwig wrote: > Instead do an internal export of __scsi_init_queue for the transport > classes that export BSG nodes. > > Signed-off-by: Christoph Hellwig> --- Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/16] scsi: remove __scsi_alloc_queue
On 01/23/2017 04:29 PM, Christoph Hellwig wrote: > Instead do an internal export of __scsi_init_queue for the transport > classes that export BSG nodes. > > Signed-off-by: Christoph Hellwig> --- > drivers/scsi/scsi_lib.c | 19 --- > drivers/scsi/scsi_transport_fc.c| 6 -- > drivers/scsi/scsi_transport_iscsi.c | 3 ++- > include/scsi/scsi_host.h| 2 -- > include/scsi/scsi_transport.h | 2 ++ > 5 files changed, 12 insertions(+), 20 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/16] scsi: remove scsi_cmd_dma_pool
On 01/23/2017 04:29 PM, Christoph Hellwig wrote: > There is no need for GFP_DMA allocations of the scsi_cmnd structures > themselves, all that might be DMAed to or from is the actual payload, > or the sense buffers. > > Signed-off-by: Christoph Hellwig> --- > drivers/scsi/scsi.c | 15 +-- > 1 file changed, 1 insertion(+), 14 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/16] scsi: respect unchecked_isa_dma for blk-mq
On 01/23/2017 04:29 PM, Christoph Hellwig wrote: > Currently blk-mq always allocates the sense buffer using normal GFP_KERNEL > allocation. Refactor the cmd pool code to split the cmd and sense allocation > and share the code to allocate the sense buffers as well as the sense buffer > slab caches between the legacy and blk-mq path. > > Note that this switches to lazy allocation of the sense slab caches - the > slab caches (not the actual allocations) won't be destroy until the scsi > module is unloaded instead of keeping track of hosts using them. > > Signed-off-by: Christoph Hellwig> --- > drivers/scsi/hosts.c | 4 > drivers/scsi/scsi.c | 24 --- > drivers/scsi/scsi_lib.c | 62 > +--- > drivers/scsi/scsi_priv.h | 5 > 4 files changed, 73 insertions(+), 22 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] mmc: block: stop passing around pointless return values
The mmc_blk_issue_rq() function is called in exactly one place in queue.c and there the return value is ignored. So the functions called from that function that also meticulously return 0/1 do so for no good reason. Error reporting on the asynchronous requests are done upward to the block layer when the requests are eventually completed or fail, which may happen during the flow of the mmc_blk_issue_* functions directly (for "special commands") or later, when an asynchronous read/write request is completed. The issuing functions do not give rise to errors on their own, and there is nothing to return back to the caller in queue.c. Drop all return values and make the function return void. Signed-off-by: Linus Walleij--- drivers/mmc/core/block.c | 38 ++ drivers/mmc/core/block.h | 2 +- 2 files changed, 15 insertions(+), 25 deletions(-) diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c index f3e0c778cdbd..ede759dda395 100644 --- a/drivers/mmc/core/block.c +++ b/drivers/mmc/core/block.c @@ -1149,7 +1149,7 @@ int mmc_access_rpmb(struct mmc_queue *mq) return false; } -static int mmc_blk_issue_discard_rq(struct mmc_queue *mq, struct request *req) +static void mmc_blk_issue_discard_rq(struct mmc_queue *mq, struct request *req) { struct mmc_blk_data *md = mq->blkdata; struct mmc_card *card = md->queue.card; @@ -1187,11 +1187,9 @@ static int mmc_blk_issue_discard_rq(struct mmc_queue *mq, struct request *req) mmc_blk_reset_success(md, type); fail: blk_end_request(req, err, blk_rq_bytes(req)); - - return err ? 0 : 1; } -static int mmc_blk_issue_secdiscard_rq(struct mmc_queue *mq, +static void mmc_blk_issue_secdiscard_rq(struct mmc_queue *mq, struct request *req) { struct mmc_blk_data *md = mq->blkdata; @@ -1254,11 +1252,9 @@ static int mmc_blk_issue_secdiscard_rq(struct mmc_queue *mq, mmc_blk_reset_success(md, type); out: blk_end_request(req, err, blk_rq_bytes(req)); - - return err ? 0 : 1; } -static int mmc_blk_issue_flush(struct mmc_queue *mq, struct request *req) +static void mmc_blk_issue_flush(struct mmc_queue *mq, struct request *req) { struct mmc_blk_data *md = mq->blkdata; struct mmc_card *card = md->queue.card; @@ -1269,8 +1265,6 @@ static int mmc_blk_issue_flush(struct mmc_queue *mq, struct request *req) ret = -EIO; blk_end_request_all(req, ret); - - return ret ? 0 : 1; } /* @@ -1622,7 +1616,7 @@ static void mmc_blk_rw_start_new(struct mmc_queue *mq, struct mmc_card *card, } } -static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc) +static void mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc) { struct mmc_blk_data *md = mq->blkdata; struct mmc_card *card = md->queue.card; @@ -1635,7 +1629,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc) struct mmc_async_req *old_areq; if (!rqc && !mq->mqrq_prev->req) - return 0; + return; do { if (rqc) { @@ -1648,7 +1642,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc) pr_err("%s: Transfer size is not 4KB sector size aligned\n", rqc->rq_disk->disk_name); mmc_blk_rw_cmd_abort(card, rqc); - return 0; + return; } mmc_blk_rw_rq_prep(mq->mqrq_cur, card, 0, mq); @@ -1665,7 +1659,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc) */ if (status == MMC_BLK_NEW_REQUEST) mq->flags |= MMC_QUEUE_NEW_REQUEST; - return 0; + return; } /* @@ -1699,7 +1693,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc) __func__, blk_rq_bytes(req), brq->data.bytes_xfered); mmc_blk_rw_cmd_abort(card, req); - return 0; + return; } break; case MMC_BLK_CMD_ERR: @@ -1767,18 +1761,16 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc) } } while (ret); - return 1; + return; cmd_abort: mmc_blk_rw_cmd_abort(card, req); start_new_req: mmc_blk_rw_start_new(mq, card, rqc); - - return 0; } -int mmc_blk_issue_rq(struct mmc_queue *mq, struct request *req) +void mmc_blk_issue_rq(struct mmc_queue *mq,
[PATCH 4/6] mmc: block: inline command abortions
Setting rqc to NULL followed by a goto to cmd_abort is just a way to do unconditional abort without starting any new command. Inline the calls to mmc_blk_rw_cmd_abort() and return immediately in those cases. As a result, mmc_blk_rw_start_new() is not called with NULL requests, and we can remove the NULL check in the beginning of this function. Add some comments to the code flow so it is clear that this is where the asynchronous requests come back in and the result of them gets handled. Signed-off-by: Linus Walleij--- drivers/mmc/core/block.c | 21 + 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c index 13e6fe060f26..4bbb3d16c09b 100644 --- a/drivers/mmc/core/block.c +++ b/drivers/mmc/core/block.c @@ -1612,9 +1612,6 @@ static void mmc_blk_rw_cmd_abort(struct mmc_card *card, struct request *req) static void mmc_blk_rw_start_new(struct mmc_queue *mq, struct mmc_card *card, struct request *req) { - if (!req) - return; - if (mmc_card_removed(card)) { req->rq_flags |= RQF_QUIET; blk_end_request_all(req, -EIO); @@ -1649,9 +1646,8 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc) !IS_ALIGNED(blk_rq_sectors(rqc), 8)) { pr_err("%s: Transfer size is not 4KB sector size aligned\n", rqc->rq_disk->disk_name); - req = rqc; - rqc = NULL; - goto cmd_abort; + mmc_blk_rw_cmd_abort(card, rqc); + return 0; } mmc_blk_rw_rq_prep(mq->mqrq_cur, card, 0, mq); @@ -1660,11 +1656,20 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc) areq = NULL; areq = mmc_start_req(card->host, areq, ); if (!areq) { + /* +* We have just put the first request into the pipeline +* and there is nothing more to do until it is +* complete. +*/ if (status == MMC_BLK_NEW_REQUEST) mq->flags |= MMC_QUEUE_NEW_REQUEST; return 0; } + /* +* An asynchronous request has been completed and we proceed +* to handle the result of it. +*/ mq_rq = container_of(areq, struct mmc_queue_req, mmc_active); brq = _rq->brq; req = mq_rq->req; @@ -1691,8 +1696,8 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc) pr_err("%s BUG rq_tot %d d_xfer %d\n", __func__, blk_rq_bytes(req), brq->data.bytes_xfered); - rqc = NULL; - goto cmd_abort; + mmc_blk_rw_cmd_abort(card, req); + return 0; } break; case MMC_BLK_CMD_ERR: -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] mmc: block: do not assign mq_rq when aborting command
The code in mmc_blk_issue_rq_rq() aborts a command if the request is not properly aligned on large sectors. As part of the path jumping out, it assigns the local variable mq_rq reflecting a MMC queue request to the current MMC queue request, which is confusing since the variable is not used after this jump. Signed-off-by: Linus Walleij--- drivers/mmc/core/block.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c index b60d1fb3a07a..13e6fe060f26 100644 --- a/drivers/mmc/core/block.c +++ b/drivers/mmc/core/block.c @@ -1649,7 +1649,6 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc) !IS_ALIGNED(blk_rq_sectors(rqc), 8)) { pr_err("%s: Transfer size is not 4KB sector size aligned\n", rqc->rq_disk->disk_name); - mq_rq = mq->mqrq_cur; req = rqc; rqc = NULL; goto cmd_abort; -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] mmc: block: break out mmc_blk_rw_start_new()
As a step toward breaking apart the very complex function mmc_blk_issue_rw_rq() we break out the code to start a new request. Signed-off-by: Linus Walleij--- drivers/mmc/core/block.c | 27 +-- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c index 14efe92a14ef..b60d1fb3a07a 100644 --- a/drivers/mmc/core/block.c +++ b/drivers/mmc/core/block.c @@ -1609,6 +1609,22 @@ static void mmc_blk_rw_cmd_abort(struct mmc_card *card, struct request *req) blk_rq_cur_bytes(req)); } +static void mmc_blk_rw_start_new(struct mmc_queue *mq, struct mmc_card *card, +struct request *req) +{ + if (!req) + return; + + if (mmc_card_removed(card)) { + req->rq_flags |= RQF_QUIET; + blk_end_request_all(req, -EIO); + } else { + mmc_blk_rw_rq_prep(mq->mqrq_cur, card, 0, mq); + mmc_start_req(card->host, + >mqrq_cur->mmc_active, NULL); + } +} + static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc) { struct mmc_blk_data *md = mq->blkdata; @@ -1751,16 +1767,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc) mmc_blk_rw_cmd_abort(card, req); start_new_req: - if (rqc) { - if (mmc_card_removed(card)) { - rqc->rq_flags |= RQF_QUIET; - blk_end_request_all(rqc, -EIO); - } else { - mmc_blk_rw_rq_prep(mq->mqrq_cur, card, 0, mq); - mmc_start_req(card->host, - >mqrq_cur->mmc_active, NULL); - } - } + mmc_blk_rw_start_new(mq, card, rqc); return 0; } -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/6] mmc: block: command issue cleanups
The function mmc_blk_issue_rw_rq() is hopelessly convoluted and need to be refactored to it can be understood by humans. In the process I found some weird magic return values passed around for no good reason. Things are more readable after this. This work is done towards the goal of breaking the function in two parts: one just submitting the requests and one checking the result and possibly resubmitting the command on error, so we can make the usual path (non-errorpath) smooth and quick, and be called directly when the driver completes a request. That in turn is a prerequisite for proper blk-mq integration with the MMC/SD stack. All that comes later. Linus Walleij (6): mmc: block: break out mmc_blk_rw_cmd_abort() mmc: block: break out mmc_blk_rw_start_new() mmc: block: do not assign mq_rq when aborting command mmc: block: inline command abortions mmc: block: introduce new_areq and old_areq mmc: block: stop passing around pointless return values drivers/mmc/core/block.c | 108 ++- drivers/mmc/core/block.h | 2 +- 2 files changed, 60 insertions(+), 50 deletions(-) -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] mmc: block: break out mmc_blk_rw_cmd_abort()
As a first step toward breaking apart the very complex function mmc_blk_issue_rw_rq() we break out the command abort code. This code assumes "ret" is != 0 and then repeatedly hammers blk_end_request() until the request to the block layer to end the request succeeds. Signed-off-by: Linus Walleij--- drivers/mmc/core/block.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c index 7bd03381810d..14efe92a14ef 100644 --- a/drivers/mmc/core/block.c +++ b/drivers/mmc/core/block.c @@ -1598,6 +1598,17 @@ static int mmc_blk_cmd_err(struct mmc_blk_data *md, struct mmc_card *card, return ret; } +static void mmc_blk_rw_cmd_abort(struct mmc_card *card, struct request *req) +{ + int ret = 1; + + if (mmc_card_removed(card)) + req->rq_flags |= RQF_QUIET; + while (ret) + ret = blk_end_request(req, -EIO, + blk_rq_cur_bytes(req)); +} + static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc) { struct mmc_blk_data *md = mq->blkdata; @@ -1737,11 +1748,7 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc) return 1; cmd_abort: - if (mmc_card_removed(card)) - req->rq_flags |= RQF_QUIET; - while (ret) - ret = blk_end_request(req, -EIO, - blk_rq_cur_bytes(req)); + mmc_blk_rw_cmd_abort(card, req); start_new_req: if (rqc) { -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/16] dm: always defer request allocation to the owner of the request_queue
On Mon, Jan 23 2017 at 10:29am -0500, Christoph Hellwigwrote: > DM already calls blk_mq_alloc_request on the request_queue of the > underlying device if it is a blk-mq device. But now that we allow drivers > to allocate additional data and initialize it ahead of time we need to do > the same for all drivers. Doing so and using the new cmd_size > infrastructure in the block layer greatly simplifies the dm-rq and mpath > code, and should also make arbitrary combinations of SQ and MQ devices > with SQ or MQ device mapper tables easily possible as a further step. Thanks for working (and suffering) through all of this request-based DM code. Nice to have someone else be painfully aware of the complexity in request-based DM's old request_fn support. The queue->cmd_size (per request data) definitely makes this more possible and is welcomed cleanup. The only concern I have is that using get_request() for the old request_fn request_queue eliminates the guaranteed availability of requests to allow for forward progress (on path failure or for the purposes of swap over mpath, etc). This isn't a concern for blk-mq because as you know we have a fixed set of tags (and associated preallocated requests). So I'm left unconvinced old request_fn request-based DM multipath isn't regressing in how request resubmission can be assured a request will be available when needed on retry in the face of path failure. dm_mod's 'reserved_rq_based_ios' module_param governs the minimum number of requests in the md->rq_pool (and defaults to 256 requests per request-based DM request_queue). Whereas blk_init_rl()'s mempool_create_node() uses BLKDEV_MIN_RQ (4) yet q->nr_requests = BLKDEV_MAX_RQ (128). Also, this patch eliminates the utility of 'reserved_rq_based_ios' module_param without actually removing it. Anyway, should blk-core evolve to allow drivers to specify a custom min_nr of requests in the old request_fn request_queue's mempool? Or is my concern overblown? Seems we're very close to making this request-based DM cleanup doable. Just would like some extra eyes and care/thought/guidance from yourself and likely Jens. Thanks, Mike p.s. dm.c:dm_alloc_md_mempools() could be cleaned up a bit more since only bio-based DM will have a pools->io_pool moving forward; but I can circle back to that cleanup after. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: KASAN: Use-after-free
On 01/24/2017 10:52 AM, Christoph Hellwig wrote: > On Tue, Jan 24, 2017 at 10:32:11AM +0100, Matias Bjørling wrote: >> *(gdb) list *blkdev_direct_IO+0x50c >> 0x8142ab8c is in blkdev_direct_IO (fs/block_dev.c:401). >> 396 submit_bio(bio); >> 397 bio = bio_alloc(GFP_KERNEL, nr_pages); >> 398 } >> 399 blk_finish_plug(); >> 400 >> 401 if (!dio->is_sync) >> 402 return -EIOCBQUEUED; >> >> It looks like the qc = submit_bio() completes the I/O before the >> blkdev_direct_IO completes, which then leads to the use after free case, >> when the private dio struct is accessed. > > Ok, the is_sync check here is clearly a bug, we need a local variable > for is_sync as well for the aio case: > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 5db5d13..3c47614 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -331,7 +331,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter > *iter, int nr_pages) > struct blk_plug plug; > struct blkdev_dio *dio; > struct bio *bio; > - bool is_read = (iov_iter_rw(iter) == READ); > + bool is_read = (iov_iter_rw(iter) == READ), is_sync; > loff_t pos = iocb->ki_pos; > blk_qc_t qc = BLK_QC_T_NONE; > int ret; > @@ -344,7 +344,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter > *iter, int nr_pages) > bio_get(bio); /* extra ref for the completion handler */ > > dio = container_of(bio, struct blkdev_dio, bio); > - dio->is_sync = is_sync_kiocb(iocb); > + dio->is_sync = is_sync = is_sync_kiocb(iocb); > if (dio->is_sync) > dio->waiter = current; > else > @@ -398,7 +398,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter > *iter, int nr_pages) > } > blk_finish_plug(); > > - if (!dio->is_sync) > + if (!is_sync) > return -EIOCBQUEUED; > > for (;;) { > Yup. That fixes it. Should I put together the patch, or will you take care of it? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: KASAN: Use-after-free
On Tue, Jan 24, 2017 at 10:32:11AM +0100, Matias Bjørling wrote: > *(gdb) list *blkdev_direct_IO+0x50c > 0x8142ab8c is in blkdev_direct_IO (fs/block_dev.c:401). > 396 submit_bio(bio); > 397 bio = bio_alloc(GFP_KERNEL, nr_pages); > 398 } > 399 blk_finish_plug(); > 400 > 401 if (!dio->is_sync) > 402 return -EIOCBQUEUED; > > It looks like the qc = submit_bio() completes the I/O before the > blkdev_direct_IO completes, which then leads to the use after free case, > when the private dio struct is accessed. Ok, the is_sync check here is clearly a bug, we need a local variable for is_sync as well for the aio case: diff --git a/fs/block_dev.c b/fs/block_dev.c index 5db5d13..3c47614 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -331,7 +331,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) struct blk_plug plug; struct blkdev_dio *dio; struct bio *bio; - bool is_read = (iov_iter_rw(iter) == READ); + bool is_read = (iov_iter_rw(iter) == READ), is_sync; loff_t pos = iocb->ki_pos; blk_qc_t qc = BLK_QC_T_NONE; int ret; @@ -344,7 +344,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) bio_get(bio); /* extra ref for the completion handler */ dio = container_of(bio, struct blkdev_dio, bio); - dio->is_sync = is_sync_kiocb(iocb); + dio->is_sync = is_sync = is_sync_kiocb(iocb); if (dio->is_sync) dio->waiter = current; else @@ -398,7 +398,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) } blk_finish_plug(); - if (!dio->is_sync) + if (!is_sync) return -EIOCBQUEUED; for (;;) { -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: KASAN: Use-after-free
On 01/23/2017 06:20 PM, Christoph Hellwig wrote: > On Mon, Jan 23, 2017 at 05:07:52PM +0100, Matias Bjørling wrote: >> Hi, >> >> I could use some help verifying an use-after-free bug that I am seeing >> after the new direct I/O work went in. >> >> When issuing a direct write io using libaio, a bio is referenced in the >> blkdev_direct_IO path, and then put again in the blkdev_bio_end_io path. >> However, there is a case where the bio is put twice, which leads to >> modules that rely on the bio after biodev_bio_end_io() to access it >> prematurely. > > Can you reproduce anything like this with a normal block device? Looks like bcache has something similar with a get/put in it. I'll look a bit more. > >> >> The KASAN error report: >> >> [ 14.645916] >> == >> [ 14.648027] BUG: KASAN: use-after-free in >> blkdev_direct_IO+0x50c/0x600 at addr 8801ef30ea14 > > Can you resolve that address for me, please? > *(gdb) list *blkdev_direct_IO+0x50c 0x8142ab8c is in blkdev_direct_IO (fs/block_dev.c:401). 396 submit_bio(bio); 397 bio = bio_alloc(GFP_KERNEL, nr_pages); 398 } 399 blk_finish_plug(); 400 401 if (!dio->is_sync) 402 return -EIOCBQUEUED; It looks like the qc = submit_bio() completes the I/O before the blkdev_direct_IO completes, which then leads to the use after free case, when the private dio struct is accessed. >> Which boils down to the bio already being freed, when we hit the >> module's endio function. >> >> The double bio_put() occurs when dio->should_dirty = 0, and dio->is_sync >> = 0. The flow follows: >> >> Issuing: >> blkdev_direct_IO >> get_bio(bio) > > bio refcounting in __blkdev_direct_IO is the following > > first bio is allocated using bio_alloc_bioset to embedd the dio structure > into it. We then grab a second reference to that bio and only that one. > Each other new bio just gets it's single reference from bio_alloc. > > blkdev_bio_end_io then checks if we hit the last reference on the dio, in > which case it then drops the additional reference on the first bio after > calling the aio completion handler. Once that is done it always drops the > reference for the current bio - either explicitly or through > bio_check_pages_dirty in the should_dirty case. > >> submit_io() >>rrpc make_request(bio) >> get_bio(bio) >> Completion: >> blkdev_bio_end_io >> bio_put(bio) >> bio_put(bio) - bio is freed > > Yes, and that's how it's intended. > >> rrpc_end_io >> bio_put(bio) (use-after-free access) > > Can you check this is the same bio that got submitted and it didn't > get bounced somewhere? > Yup, the same: [11.329950] blkdev_direct_io: get_bio (bio=8801f1e7a018) get ref [11.331557] blkdev_direct_io: (!nr_pages) (bio=8801f1e7a018) submit [11.333603] rrpc bio_get: (bio=8801f1e7a018) get ref [11.335004] blkdev_bio_end_io:!dio->is_syn(bio=8801f1e7a018) put ref [11.335009] rrpc bio_put: (bio=8801f1e7a018) put ref It could look like the first get_bio() ref is decremented prematurely in the blkdev_bio_end_io() path, where it should first have been decremented at the end of blkdev_direct_IO() path. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/16] dm: always defer request allocation to the owner of the request_queue
On 01/23/2017 04:29 PM, Christoph Hellwig wrote: > DM already calls blk_mq_alloc_request on the request_queue of the > underlying device if it is a blk-mq device. But now that we allow drivers > to allocate additional data and initialize it ahead of time we need to do > the same for all drivers. Doing so and using the new cmd_size > infrastructure in the block layer greatly simplifies the dm-rq and mpath > code, and should also make arbitrary combinations of SQ and MQ devices > with SQ or MQ device mapper tables easily possible as a further step. > > Signed-off-by: Christoph Hellwig> --- > drivers/md/dm-core.h | 1 - > drivers/md/dm-mpath.c | 132 -- > drivers/md/dm-rq.c| 251 > ++ > drivers/md/dm-rq.h| 2 +- > drivers/md/dm-target.c| 7 -- > drivers/md/dm.c | 18 +-- > drivers/md/dm.h | 3 +- > include/linux/device-mapper.h | 3 - > 8 files changed, 81 insertions(+), 336 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/16] dm: remove incomple BLOCK_PC support
On 01/23/2017 04:29 PM, Christoph Hellwig wrote: > DM tries to copy a few fields around for BLOCK_PC requests, but given > that no dm-target ever wires up scsi_cmd_ioctl BLOCK_PC can't actually > be sent to dm. > > Signed-off-by: Christoph Hellwig> --- > drivers/md/dm-rq.c | 16 > 1 file changed, 16 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/16] block: allow specifying size for extra command data
On 01/23/2017 04:29 PM, Christoph Hellwig wrote: > This mirrors the blk-mq capabilities to allocate extra drivers-specific > data behind struct request by setting a cmd_size field, as well as having > a constructor / destructor for it. > > Signed-off-by: Christoph Hellwig> --- > block/blk-core.c | 59 > -- > block/blk-flush.c | 5 ++--- > block/blk-sysfs.c | 7 -- > include/linux/blkdev.h | 7 ++ > 4 files changed, 61 insertions(+), 17 deletions(-) > Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html