Re: [PATCH V4 0/7] nvme: pci: fix & improve timeout handling
Hi ming I did some tests on my local. [ 598.828578] nvme nvme0: I/O 51 QID 4 timeout, disable controller This should be a timeout on nvme_reset_dev->nvme_wait_freeze. [ 598.828743] nvme nvme0: EH 1: before shutdown [ 599.013586] nvme nvme0: EH 1: after shutdown [ 599.137197] nvme nvme0: EH 1: after recovery The EH 1 have mark the state to LIVE [ 599.137241] nvme nvme0: failed to mark controller state 1 So the EH 0 failed to mark state to LIVE The card was removed. This should not be expected by nested EH. [ 599.137322] nvme nvme0: Removing after probe failure status: 0 [ 599.326539] nvme nvme0: EH 0: after recovery [ 599.326760] nvme0n1: detected capacity change from 128035676160 to 0 [ 599.457208] nvme nvme0: failed to set APST feature (-19) nvme_reset_dev should identify whether it is nested. Thanks Jianchao
Re: bug in tag handling in blk-mq?
On Tue, 2018-05-08 at 14:37 -0600, Jens Axboe wrote: > > - sdd has nothing pending, yet has 6 active waitqueues. sdd is where ccache storage lives, which that should have been the only activity on that drive, as I built source in sdb, and was doing nothing else that utilizes sdd. -Mike
Re: bug in tag handling in blk-mq?
On Tue, 2018-05-08 at 19:09 -0600, Jens Axboe wrote: > > Alright, I managed to reproduce it. What I think is happening is that > BFQ is limiting the inflight case to something less than the wake > batch for sbitmap, which can lead to stalls. I don't have time to test > this tonight, but perhaps you can give it a go when you are back at it. > If not, I'll try tomorrow morning. > > If this is the issue, I can turn it into a real patch. This is just to > confirm that the issue goes away with the below. Confirmed. Impressive high speed bug stomping. > diff --git a/lib/sbitmap.c b/lib/sbitmap.c > index e6a9c06ec70c..94ced15b6428 100644 > --- a/lib/sbitmap.c > +++ b/lib/sbitmap.c > @@ -272,6 +272,7 @@ EXPORT_SYMBOL_GPL(sbitmap_bitmap_show); > > static unsigned int sbq_calc_wake_batch(unsigned int depth) > { > +#if 0 > unsigned int wake_batch; > > /* > @@ -284,6 +285,9 @@ static unsigned int sbq_calc_wake_batch(unsigned int > depth) > wake_batch = max(1U, depth / SBQ_WAIT_QUEUES); > > return wake_batch; > +#else > + return 1; > +#endif > } > > int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth, >
Re: [PATCH] brd: Mark as non-rotational
I'm sending this mail for another chance of review. Thanks, SeongJae Park On Thu, May 3, 2018 at 6:53 PM SeongJae Parkwrote: > This commit sets QUEUE_FLAG_NONROT and clears up QUEUE_FLAG_ADD_RANDOM > to mark the ramdisks as non-rotational device. > Signed-off-by: SeongJae Park > --- > drivers/block/brd.c | 4 > 1 file changed, 4 insertions(+) > diff --git a/drivers/block/brd.c b/drivers/block/brd.c > index 66cb0f857f64..39c5b90cc187 100644 > --- a/drivers/block/brd.c > +++ b/drivers/block/brd.c > @@ -402,6 +402,10 @@ static struct brd_device *brd_alloc(int i) > set_capacity(disk, rd_size * 2); > disk->queue->backing_dev_info->capabilities |= BDI_CAP_SYNCHRONOUS_IO; > + /* Tell the block layer that this is not a rotational device */ > + blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue); > + blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, disk->queue); > + > return brd; > out_free_queue: > -- > 2.13.0
[PATCH 04/10] block: Use bioset_init() for fs_bio_set
Minor optimization - remove a pointer indirection when using fs_bio_set. Signed-off-by: Kent Overstreet--- block/bio.c | 7 +++ block/blk-core.c| 2 +- drivers/target/target_core_iblock.c | 2 +- include/linux/bio.h | 4 ++-- 4 files changed, 7 insertions(+), 8 deletions(-) diff --git a/block/bio.c b/block/bio.c index 980befd919..b7cdad6fc4 100644 --- a/block/bio.c +++ b/block/bio.c @@ -53,7 +53,7 @@ static struct biovec_slab bvec_slabs[BVEC_POOL_NR] __read_mostly = { * fs_bio_set is the bio_set containing bio and iovec memory pools used by * IO code that does not need private memory pools. */ -struct bio_set *fs_bio_set; +struct bio_set fs_bio_set; EXPORT_SYMBOL(fs_bio_set); /* @@ -2055,11 +2055,10 @@ static int __init init_bio(void) bio_integrity_init(); biovec_init_slabs(); - fs_bio_set = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS); - if (!fs_bio_set) + if (bioset_init(_bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS)) panic("bio: can't allocate bios\n"); - if (bioset_integrity_create(fs_bio_set, BIO_POOL_SIZE)) + if (bioset_integrity_create(_bio_set, BIO_POOL_SIZE)) panic("bio: can't create integrity pool\n"); return 0; diff --git a/block/blk-core.c b/block/blk-core.c index 6d82c4f7fa..66f24798ef 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -3409,7 +3409,7 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src, struct bio *bio, *bio_src; if (!bs) - bs = fs_bio_set; + bs = _bio_set; __rq_for_each_bio(bio_src, rq_src) { bio = bio_clone_fast(bio_src, gfp_mask, bs); diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c index 07c814c426..c969c01c7c 100644 --- a/drivers/target/target_core_iblock.c +++ b/drivers/target/target_core_iblock.c @@ -164,7 +164,7 @@ static int iblock_configure_device(struct se_device *dev) goto out_blkdev_put; } pr_debug("IBLOCK setup BIP bs->bio_integrity_pool: %p\n", -bs->bio_integrity_pool); +>bio_integrity_pool); } dev->dev_attrib.hw_pi_prot_type = dev->dev_attrib.pi_prot_type; } diff --git a/include/linux/bio.h b/include/linux/bio.h index fa3cf94a50..91b02520e2 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -423,11 +423,11 @@ extern void __bio_clone_fast(struct bio *, struct bio *); extern struct bio *bio_clone_fast(struct bio *, gfp_t, struct bio_set *); extern struct bio *bio_clone_bioset(struct bio *, gfp_t, struct bio_set *bs); -extern struct bio_set *fs_bio_set; +extern struct bio_set fs_bio_set; static inline struct bio *bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs) { - return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set); + return bio_alloc_bioset(gfp_mask, nr_iovecs, _bio_set); } static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs) -- 2.17.0
[PATCH 01/10] mempool: Add mempool_init()/mempool_exit()
Allows mempools to be embedded in other structs, getting rid of a pointer indirection from allocation fastpaths. mempool_exit() is safe to call on an uninitialized but zeroed mempool. Signed-off-by: Kent Overstreet--- include/linux/mempool.h | 34 + mm/mempool.c| 108 ++-- 2 files changed, 115 insertions(+), 27 deletions(-) diff --git a/include/linux/mempool.h b/include/linux/mempool.h index b51f5c430c..0c964ac107 100644 --- a/include/linux/mempool.h +++ b/include/linux/mempool.h @@ -25,6 +25,18 @@ typedef struct mempool_s { wait_queue_head_t wait; } mempool_t; +static inline bool mempool_initialized(mempool_t *pool) +{ + return pool->elements != NULL; +} + +void mempool_exit(mempool_t *pool); +int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn, + mempool_free_t *free_fn, void *pool_data, + gfp_t gfp_mask, int node_id); +int mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn, +mempool_free_t *free_fn, void *pool_data); + extern mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn, mempool_free_t *free_fn, void *pool_data); extern mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn, @@ -43,6 +55,14 @@ extern void mempool_free(void *element, mempool_t *pool); */ void *mempool_alloc_slab(gfp_t gfp_mask, void *pool_data); void mempool_free_slab(void *element, void *pool_data); + +static inline int +mempool_init_slab_pool(mempool_t *pool, int min_nr, struct kmem_cache *kc) +{ + return mempool_init(pool, min_nr, mempool_alloc_slab, + mempool_free_slab, (void *) kc); +} + static inline mempool_t * mempool_create_slab_pool(int min_nr, struct kmem_cache *kc) { @@ -56,6 +76,13 @@ mempool_create_slab_pool(int min_nr, struct kmem_cache *kc) */ void *mempool_kmalloc(gfp_t gfp_mask, void *pool_data); void mempool_kfree(void *element, void *pool_data); + +static inline int mempool_init_kmalloc_pool(mempool_t *pool, int min_nr, size_t size) +{ + return mempool_init(pool, min_nr, mempool_kmalloc, + mempool_kfree, (void *) size); +} + static inline mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size) { return mempool_create(min_nr, mempool_kmalloc, mempool_kfree, @@ -68,6 +95,13 @@ static inline mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size) */ void *mempool_alloc_pages(gfp_t gfp_mask, void *pool_data); void mempool_free_pages(void *element, void *pool_data); + +static inline int mempool_init_page_pool(mempool_t *pool, int min_nr, int order) +{ + return mempool_init(pool, min_nr, mempool_alloc_pages, + mempool_free_pages, (void *)(long)order); +} + static inline mempool_t *mempool_create_page_pool(int min_nr, int order) { return mempool_create(min_nr, mempool_alloc_pages, mempool_free_pages, diff --git a/mm/mempool.c b/mm/mempool.c index 5c9dce3471..df90ace400 100644 --- a/mm/mempool.c +++ b/mm/mempool.c @@ -137,6 +137,28 @@ static void *remove_element(mempool_t *pool, gfp_t flags) return element; } +/** + * mempool_destroy - exit a mempool initialized with mempool_init() + * @pool: pointer to the memory pool which was initialized with + * mempool_init(). + * + * Free all reserved elements in @pool and @pool itself. This function + * only sleeps if the free_fn() function sleeps. + * + * May be called on a zeroed but uninitialized mempool (i.e. allocated with + * kzalloc()). + */ +void mempool_exit(mempool_t *pool) +{ + while (pool->curr_nr) { + void *element = remove_element(pool, GFP_KERNEL); + pool->free(element, pool->pool_data); + } + kfree(pool->elements); + pool->elements = NULL; +} +EXPORT_SYMBOL(mempool_exit); + /** * mempool_destroy - deallocate a memory pool * @pool: pointer to the memory pool which was allocated via @@ -150,15 +172,65 @@ void mempool_destroy(mempool_t *pool) if (unlikely(!pool)) return; - while (pool->curr_nr) { - void *element = remove_element(pool, GFP_KERNEL); - pool->free(element, pool->pool_data); - } - kfree(pool->elements); + mempool_exit(pool); kfree(pool); } EXPORT_SYMBOL(mempool_destroy); +int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn, + mempool_free_t *free_fn, void *pool_data, + gfp_t gfp_mask, int node_id) +{ + spin_lock_init(>lock); + pool->min_nr= min_nr; + pool->pool_data = pool_data; + pool->alloc = alloc_fn; + pool->free = free_fn; + init_waitqueue_head(>wait); + + pool->elements = kmalloc_array_node(min_nr, sizeof(void *), +
[PATCH 05/10] block: Add bio_copy_data_iter(), zero_fill_bio_iter()
Add versions that take bvec_iter args instead of using bio->bi_iter - to be used by bcachefs. Signed-off-by: Kent Overstreet--- block/bio.c | 44 include/linux/bio.h | 18 +++--- 2 files changed, 39 insertions(+), 23 deletions(-) diff --git a/block/bio.c b/block/bio.c index b7cdad6fc4..d7bd765e9e 100644 --- a/block/bio.c +++ b/block/bio.c @@ -530,20 +530,20 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int nr_iovecs, } EXPORT_SYMBOL(bio_alloc_bioset); -void zero_fill_bio(struct bio *bio) +void zero_fill_bio_iter(struct bio *bio, struct bvec_iter start) { unsigned long flags; struct bio_vec bv; struct bvec_iter iter; - bio_for_each_segment(bv, bio, iter) { + __bio_for_each_segment(bv, bio, iter, start) { char *data = bvec_kmap_irq(, ); memset(data, 0, bv.bv_len); flush_dcache_page(bv.bv_page); bvec_kunmap_irq(data, ); } } -EXPORT_SYMBOL(zero_fill_bio); +EXPORT_SYMBOL(zero_fill_bio_iter); /** * bio_put - release a reference to a bio @@ -971,28 +971,13 @@ void bio_advance(struct bio *bio, unsigned bytes) } EXPORT_SYMBOL(bio_advance); -/** - * bio_copy_data - copy contents of data buffers from one chain of bios to - * another - * @src: source bio list - * @dst: destination bio list - * - * If @src and @dst are single bios, bi_next must be NULL - otherwise, treats - * @src and @dst as linked lists of bios. - * - * Stops when it reaches the end of either @src or @dst - that is, copies - * min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of bios). - */ -void bio_copy_data(struct bio *dst, struct bio *src) +void bio_copy_data_iter(struct bio *dst, struct bvec_iter dst_iter, + struct bio *src, struct bvec_iter src_iter) { - struct bvec_iter src_iter, dst_iter; struct bio_vec src_bv, dst_bv; void *src_p, *dst_p; unsigned bytes; - src_iter = src->bi_iter; - dst_iter = dst->bi_iter; - while (1) { if (!src_iter.bi_size) { src = src->bi_next; @@ -1029,6 +1014,25 @@ void bio_copy_data(struct bio *dst, struct bio *src) bio_advance_iter(dst, _iter, bytes); } } +EXPORT_SYMBOL(bio_copy_data_iter); + +/** + * bio_copy_data - copy contents of data buffers from one chain of bios to + * another + * @src: source bio list + * @dst: destination bio list + * + * If @src and @dst are single bios, bi_next must be NULL - otherwise, treats + * @src and @dst as linked lists of bios. + * + * Stops when it reaches the end of either @src or @dst - that is, copies + * min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of bios). + */ +void bio_copy_data(struct bio *dst, struct bio *src) +{ + bio_copy_data_iter(dst, dst->bi_iter, + src, src->bi_iter); +} EXPORT_SYMBOL(bio_copy_data); struct bio_map_data { diff --git a/include/linux/bio.h b/include/linux/bio.h index 91b02520e2..5a6ee955a8 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -67,8 +67,12 @@ #define bio_multiple_segments(bio) \ ((bio)->bi_iter.bi_size != bio_iovec(bio).bv_len) -#define bio_sectors(bio) ((bio)->bi_iter.bi_size >> 9) -#define bio_end_sector(bio)((bio)->bi_iter.bi_sector + bio_sectors((bio))) + +#define bvec_iter_sectors(iter)((iter).bi_size >> 9) +#define bvec_iter_end_sector(iter) ((iter).bi_sector + bvec_iter_sectors((iter))) + +#define bio_sectors(bio) bvec_iter_sectors((bio)->bi_iter) +#define bio_end_sector(bio)bvec_iter_end_sector((bio)->bi_iter) /* * Return the data direction, READ or WRITE. @@ -501,6 +505,8 @@ static inline void bio_flush_dcache_pages(struct bio *bi) } #endif +extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter dst_iter, + struct bio *src, struct bvec_iter src_iter); extern void bio_copy_data(struct bio *dst, struct bio *src); extern void bio_free_pages(struct bio *bio); @@ -509,7 +515,13 @@ extern struct bio *bio_copy_user_iov(struct request_queue *, struct iov_iter *, gfp_t); extern int bio_uncopy_user(struct bio *); -void zero_fill_bio(struct bio *bio); +void zero_fill_bio_iter(struct bio *bio, struct bvec_iter iter); + +static inline void zero_fill_bio(struct bio *bio) +{ + zero_fill_bio_iter(bio, bio->bi_iter); +} + extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *); extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int); extern unsigned int bvec_nr_vecs(unsigned short idx); -- 2.17.0
[PATCH 03/10] block: Add bioset_init()/bioset_exit()
Similarly to mempool_init()/mempool_exit(), take a pointer indirection out of allocation/freeing by allowing biosets to be embedded in other structs. Signed-off-by: Kent Overstreet--- block/bio.c | 93 +++-- include/linux/bio.h | 2 + 2 files changed, 67 insertions(+), 28 deletions(-) diff --git a/block/bio.c b/block/bio.c index 360e9bcea5..980befd919 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1856,21 +1856,83 @@ int biovec_init_pool(mempool_t *pool, int pool_entries) return mempool_init_slab_pool(pool, pool_entries, bp->slab); } -void bioset_free(struct bio_set *bs) +/* + * bioset_exit - exit a bioset initialized with bioset_init() + * + * May be called on a zeroed but uninitialized bioset (i.e. allocated with + * kzalloc()). + */ +void bioset_exit(struct bio_set *bs) { if (bs->rescue_workqueue) destroy_workqueue(bs->rescue_workqueue); + bs->rescue_workqueue = NULL; mempool_exit(>bio_pool); mempool_exit(>bvec_pool); bioset_integrity_free(bs); - bio_put_slab(bs); + if (bs->bio_slab) + bio_put_slab(bs); + bs->bio_slab = NULL; +} +EXPORT_SYMBOL(bioset_exit); +void bioset_free(struct bio_set *bs) +{ + bioset_exit(bs); kfree(bs); } EXPORT_SYMBOL(bioset_free); +/** + * bioset_init - Initialize a bio_set + * @pool_size: Number of bio and bio_vecs to cache in the mempool + * @front_pad: Number of bytes to allocate in front of the returned bio + * @flags: Flags to modify behavior, currently %BIOSET_NEED_BVECS + * and %BIOSET_NEED_RESCUER + * + * Similar to bioset_create(), but initializes a passed-in bioset instead of + * separately allocating it. + */ +int bioset_init(struct bio_set *bs, + unsigned int pool_size, + unsigned int front_pad, + int flags) +{ + unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec); + + bs->front_pad = front_pad; + + spin_lock_init(>rescue_lock); + bio_list_init(>rescue_list); + INIT_WORK(>rescue_work, bio_alloc_rescue); + + bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad); + if (!bs->bio_slab) + return -ENOMEM; + + if (mempool_init_slab_pool(>bio_pool, pool_size, bs->bio_slab)) + goto bad; + + if ((flags & BIOSET_NEED_BVECS) && + biovec_init_pool(>bvec_pool, pool_size)) + goto bad; + + if (!(flags & BIOSET_NEED_RESCUER)) + return 0; + + bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0); + if (!bs->rescue_workqueue) + goto bad; + + return 0; +bad: + bioset_exit(bs); + return -ENOMEM; +} +EXPORT_SYMBOL(bioset_init); + /** * bioset_create - Create a bio_set * @pool_size: Number of bio and bio_vecs to cache in the mempool @@ -1895,43 +1957,18 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad, int flags) { - unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec); struct bio_set *bs; bs = kzalloc(sizeof(*bs), GFP_KERNEL); if (!bs) return NULL; - bs->front_pad = front_pad; - - spin_lock_init(>rescue_lock); - bio_list_init(>rescue_list); - INIT_WORK(>rescue_work, bio_alloc_rescue); - - bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad); - if (!bs->bio_slab) { + if (bioset_init(bs, pool_size, front_pad, flags)) { kfree(bs); return NULL; } - if (mempool_init_slab_pool(>bio_pool, pool_size, bs->bio_slab)) - goto bad; - - if ((flags & BIOSET_NEED_BVECS) && - biovec_init_pool(>bvec_pool, pool_size)) - goto bad; - - if (!(flags & BIOSET_NEED_RESCUER)) - return bs; - - bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0); - if (!bs->rescue_workqueue) - goto bad; - return bs; -bad: - bioset_free(bs); - return NULL; } EXPORT_SYMBOL(bioset_create); diff --git a/include/linux/bio.h b/include/linux/bio.h index 720f7261d0..fa3cf94a50 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -406,6 +406,8 @@ static inline struct bio *bio_next_split(struct bio *bio, int sectors, return bio_split(bio, sectors, gfp, bs); } +extern int bioset_init(struct bio_set *, unsigned int, unsigned int, int flags); +extern void bioset_exit(struct bio_set *); extern struct bio_set *bioset_create(unsigned int, unsigned int, int flags); enum { BIOSET_NEED_BVECS = BIT(0), -- 2.17.0
[PATCH 06/10] block: Split out bio_list_copy_data()
Found a bug (with ASAN) where we were passing a bio to bio_copy_data() with bi_next not NULL, when it should have been - a driver had left bi_next set to something after calling bio_endio(). Since the normal case is only copying single bios, split out bio_list_copy_data() to avoid more bugs like this in the future. Signed-off-by: Kent Overstreet--- block/bio.c | 83 + drivers/block/pktcdvd.c | 2 +- include/linux/bio.h | 5 ++- 3 files changed, 55 insertions(+), 35 deletions(-) diff --git a/block/bio.c b/block/bio.c index d7bd765e9e..c58544d4bc 100644 --- a/block/bio.c +++ b/block/bio.c @@ -971,32 +971,16 @@ void bio_advance(struct bio *bio, unsigned bytes) } EXPORT_SYMBOL(bio_advance); -void bio_copy_data_iter(struct bio *dst, struct bvec_iter dst_iter, - struct bio *src, struct bvec_iter src_iter) +void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter, + struct bio *src, struct bvec_iter *src_iter) { struct bio_vec src_bv, dst_bv; void *src_p, *dst_p; unsigned bytes; - while (1) { - if (!src_iter.bi_size) { - src = src->bi_next; - if (!src) - break; - - src_iter = src->bi_iter; - } - - if (!dst_iter.bi_size) { - dst = dst->bi_next; - if (!dst) - break; - - dst_iter = dst->bi_iter; - } - - src_bv = bio_iter_iovec(src, src_iter); - dst_bv = bio_iter_iovec(dst, dst_iter); + while (src_iter->bi_size && dst_iter->bi_size) { + src_bv = bio_iter_iovec(src, *src_iter); + dst_bv = bio_iter_iovec(dst, *dst_iter); bytes = min(src_bv.bv_len, dst_bv.bv_len); @@ -1010,31 +994,66 @@ void bio_copy_data_iter(struct bio *dst, struct bvec_iter dst_iter, kunmap_atomic(dst_p); kunmap_atomic(src_p); - bio_advance_iter(src, _iter, bytes); - bio_advance_iter(dst, _iter, bytes); + bio_advance_iter(src, src_iter, bytes); + bio_advance_iter(dst, dst_iter, bytes); } } EXPORT_SYMBOL(bio_copy_data_iter); /** - * bio_copy_data - copy contents of data buffers from one chain of bios to - * another - * @src: source bio list - * @dst: destination bio list - * - * If @src and @dst are single bios, bi_next must be NULL - otherwise, treats - * @src and @dst as linked lists of bios. + * bio_copy_data - copy contents of data buffers from one bio to another + * @src: source bio + * @dst: destination bio * * Stops when it reaches the end of either @src or @dst - that is, copies * min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of bios). */ void bio_copy_data(struct bio *dst, struct bio *src) { - bio_copy_data_iter(dst, dst->bi_iter, - src, src->bi_iter); + struct bvec_iter src_iter = src->bi_iter; + struct bvec_iter dst_iter = dst->bi_iter; + + bio_copy_data_iter(dst, _iter, src, _iter); } EXPORT_SYMBOL(bio_copy_data); +/** + * bio_list_copy_data - copy contents of data buffers from one chain of bios to + * another + * @src: source bio list + * @dst: destination bio list + * + * Stops when it reaches the end of either the @src list or @dst list - that is, + * copies min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of + * bios). + */ +void bio_list_copy_data(struct bio *dst, struct bio *src) +{ + struct bvec_iter src_iter = src->bi_iter; + struct bvec_iter dst_iter = dst->bi_iter; + + while (1) { + if (!src_iter.bi_size) { + src = src->bi_next; + if (!src) + break; + + src_iter = src->bi_iter; + } + + if (!dst_iter.bi_size) { + dst = dst->bi_next; + if (!dst) + break; + + dst_iter = dst->bi_iter; + } + + bio_copy_data_iter(dst, _iter, src, _iter); + } +} +EXPORT_SYMBOL(bio_list_copy_data); + struct bio_map_data { int is_our_pages; struct iov_iter iter; diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c index c61d20c9f3..00ea788b17 100644 --- a/drivers/block/pktcdvd.c +++ b/drivers/block/pktcdvd.c @@ -1285,7 +1285,7 @@ static void pkt_start_write(struct pktcdvd_device *pd, struct packet_data *pkt) * Fill-in bvec with data from orig_bios. */ spin_lock(>lock); - bio_copy_data(pkt->w_bio, pkt->orig_bios.head); + bio_list_copy_data(pkt->w_bio, pkt->orig_bios.head); pkt_set_state(pkt,
[PATCH 10/10] block: Add sysfs entry for fua support
Signed-off-by: Kent Overstreet--- block/blk-sysfs.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index cbea895a55..d6dd7d8198 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -497,6 +497,11 @@ static ssize_t queue_wc_store(struct request_queue *q, const char *page, return count; } +static ssize_t queue_fua_show(struct request_queue *q, char *page) +{ + return sprintf(page, "%u\n", test_bit(QUEUE_FLAG_FUA, >queue_flags)); +} + static ssize_t queue_dax_show(struct request_queue *q, char *page) { return queue_var_show(blk_queue_dax(q), page); @@ -665,6 +670,11 @@ static struct queue_sysfs_entry queue_wc_entry = { .store = queue_wc_store, }; +static struct queue_sysfs_entry queue_fua_entry = { + .attr = {.name = "fua", .mode = S_IRUGO }, + .show = queue_fua_show, +}; + static struct queue_sysfs_entry queue_dax_entry = { .attr = {.name = "dax", .mode = S_IRUGO }, .show = queue_dax_show, @@ -714,6 +724,7 @@ static struct attribute *default_attrs[] = { _random_entry.attr, _poll_entry.attr, _wc_entry.attr, + _fua_entry.attr, _dax_entry.attr, _wb_lat_entry.attr, _poll_delay_entry.attr, -- 2.17.0
[PATCH 09/10] block: Export bio check/set pages_dirty
Signed-off-by: Kent Overstreet--- block/bio.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/block/bio.c b/block/bio.c index 5c81391100..6689102f5d 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1610,6 +1610,7 @@ void bio_set_pages_dirty(struct bio *bio) set_page_dirty_lock(page); } } +EXPORT_SYMBOL_GPL(bio_set_pages_dirty); static void bio_release_pages(struct bio *bio) { @@ -1693,6 +1694,7 @@ void bio_check_pages_dirty(struct bio *bio) bio_put(bio); } } +EXPORT_SYMBOL_GPL(bio_check_pages_dirty); void generic_start_io_acct(struct request_queue *q, int rw, unsigned long sectors, struct hd_struct *part) -- 2.17.0
[PATCH 08/10] block: Add warning for bi_next not NULL in bio_endio()
Recently found a bug where a driver left bi_next not NULL and then called bio_endio(), and then the submitter of the bio used bio_copy_data() which was treating src and dst as lists of bios. Fixed that bug by splitting out bio_list_copy_data(), but in case other things are depending on bi_next in weird ways, add a warning to help avoid more bugs like that in the future. Signed-off-by: Kent Overstreet--- block/bio.c | 3 +++ block/blk-core.c | 8 +++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/block/bio.c b/block/bio.c index ce8e259f9a..5c81391100 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1775,6 +1775,9 @@ void bio_endio(struct bio *bio) if (!bio_integrity_endio(bio)) return; + if (WARN_ONCE(bio->bi_next, "driver left bi_next not NULL")) + bio->bi_next = NULL; + /* * Need to have a real endio function for chained bios, otherwise * various corner cases will break (like stacking block devices that diff --git a/block/blk-core.c b/block/blk-core.c index 66f24798ef..f3cf79198a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -204,6 +204,10 @@ static void req_bio_endio(struct request *rq, struct bio *bio, bio_advance(bio, nbytes); /* don't actually finish bio if it's part of flush sequence */ + /* +* XXX this code looks suspicious - it's not consistent with advancing +* req->bio in caller +*/ if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ)) bio_endio(bio); } @@ -2982,8 +2986,10 @@ bool blk_update_request(struct request *req, blk_status_t error, struct bio *bio = req->bio; unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes); - if (bio_bytes == bio->bi_iter.bi_size) + if (bio_bytes == bio->bi_iter.bi_size) { req->bio = bio->bi_next; + bio->bi_next = NULL; + } /* Completion has already been traced */ bio_clear_flag(bio, BIO_TRACE_COMPLETION); -- 2.17.0
[PATCH 07/10] block: Add missing flush_dcache_page() call
Since a bio can point to userspace pages (e.g. direct IO), this is generally necessary. Signed-off-by: Kent Overstreet--- block/bio.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/block/bio.c b/block/bio.c index c58544d4bc..ce8e259f9a 100644 --- a/block/bio.c +++ b/block/bio.c @@ -994,6 +994,8 @@ void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter, kunmap_atomic(dst_p); kunmap_atomic(src_p); + flush_dcache_page(dst_bv.bv_page); + bio_advance_iter(src, src_iter, bytes); bio_advance_iter(dst, dst_iter, bytes); } -- 2.17.0
[PATCH 02/10] block: Convert bio_set to mempool_init()
Minor performance improvement by getting rid of pointer indirections from allocation/freeing fastpaths. Signed-off-by: Kent Overstreet--- block/bio-integrity.c | 29 ++--- block/bio.c | 36 +--- include/linux/bio.h | 10 +- 3 files changed, 36 insertions(+), 39 deletions(-) diff --git a/block/bio-integrity.c b/block/bio-integrity.c index 9cfdd6c83b..add7c7c853 100644 --- a/block/bio-integrity.c +++ b/block/bio-integrity.c @@ -56,12 +56,12 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio, struct bio_set *bs = bio->bi_pool; unsigned inline_vecs; - if (!bs || !bs->bio_integrity_pool) { + if (!bs || !mempool_initialized(>bio_integrity_pool)) { bip = kmalloc(sizeof(struct bio_integrity_payload) + sizeof(struct bio_vec) * nr_vecs, gfp_mask); inline_vecs = nr_vecs; } else { - bip = mempool_alloc(bs->bio_integrity_pool, gfp_mask); + bip = mempool_alloc(>bio_integrity_pool, gfp_mask); inline_vecs = BIP_INLINE_VECS; } @@ -74,7 +74,7 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio, unsigned long idx = 0; bip->bip_vec = bvec_alloc(gfp_mask, nr_vecs, , - bs->bvec_integrity_pool); + >bvec_integrity_pool); if (!bip->bip_vec) goto err; bip->bip_max_vcnt = bvec_nr_vecs(idx); @@ -90,7 +90,7 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio, return bip; err: - mempool_free(bip, bs->bio_integrity_pool); + mempool_free(bip, >bio_integrity_pool); return ERR_PTR(-ENOMEM); } EXPORT_SYMBOL(bio_integrity_alloc); @@ -111,10 +111,10 @@ static void bio_integrity_free(struct bio *bio) kfree(page_address(bip->bip_vec->bv_page) + bip->bip_vec->bv_offset); - if (bs && bs->bio_integrity_pool) { - bvec_free(bs->bvec_integrity_pool, bip->bip_vec, bip->bip_slab); + if (bs && mempool_initialized(>bio_integrity_pool)) { + bvec_free(>bvec_integrity_pool, bip->bip_vec, bip->bip_slab); - mempool_free(bip, bs->bio_integrity_pool); + mempool_free(bip, >bio_integrity_pool); } else { kfree(bip); } @@ -465,16 +465,15 @@ EXPORT_SYMBOL(bio_integrity_clone); int bioset_integrity_create(struct bio_set *bs, int pool_size) { - if (bs->bio_integrity_pool) + if (mempool_initialized(>bio_integrity_pool)) return 0; - bs->bio_integrity_pool = mempool_create_slab_pool(pool_size, bip_slab); - if (!bs->bio_integrity_pool) + if (mempool_init_slab_pool(>bio_integrity_pool, + pool_size, bip_slab)) return -1; - bs->bvec_integrity_pool = biovec_create_pool(pool_size); - if (!bs->bvec_integrity_pool) { - mempool_destroy(bs->bio_integrity_pool); + if (biovec_init_pool(>bvec_integrity_pool, pool_size)) { + mempool_exit(>bio_integrity_pool); return -1; } @@ -484,8 +483,8 @@ EXPORT_SYMBOL(bioset_integrity_create); void bioset_integrity_free(struct bio_set *bs) { - mempool_destroy(bs->bio_integrity_pool); - mempool_destroy(bs->bvec_integrity_pool); + mempool_exit(>bio_integrity_pool); + mempool_exit(>bvec_integrity_pool); } EXPORT_SYMBOL(bioset_integrity_free); diff --git a/block/bio.c b/block/bio.c index e1708db482..360e9bcea5 100644 --- a/block/bio.c +++ b/block/bio.c @@ -254,7 +254,7 @@ static void bio_free(struct bio *bio) bio_uninit(bio); if (bs) { - bvec_free(bs->bvec_pool, bio->bi_io_vec, BVEC_POOL_IDX(bio)); + bvec_free(>bvec_pool, bio->bi_io_vec, BVEC_POOL_IDX(bio)); /* * If we have front padding, adjust the bio pointer before freeing @@ -262,7 +262,7 @@ static void bio_free(struct bio *bio) p = bio; p -= bs->front_pad; - mempool_free(p, bs->bio_pool); + mempool_free(p, >bio_pool); } else { /* Bio was allocated by bio_kmalloc() */ kfree(bio); @@ -454,7 +454,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int nr_iovecs, inline_vecs = nr_iovecs; } else { /* should not use nobvec bioset for nr_iovecs > 0 */ - if (WARN_ON_ONCE(!bs->bvec_pool && nr_iovecs > 0)) + if (WARN_ON_ONCE(!mempool_initialized(>bvec_pool) && +nr_iovecs > 0)) return NULL; /* *
[PATCH 00/10] Misc block layer patches for bcachefs
- Add separately allowed mempools, biosets: bcachefs uses both all over the place - Bit of utility code - bio_copy_data_iter(), zero_fill_bio_iter() - bio_list_copy_data(), the bi_next check - defensiveness because of a bug I had fun chasing down at one point - add some exports, because bcachefs does dio its own way - show whether fua is supported in sysfs, because I don't know of anything that exports whether the _block layer_ specifically thinks fua is supported. Kent Overstreet (10): mempool: Add mempool_init()/mempool_exit() block: Convert bio_set to mempool_init() block: Add bioset_init()/bioset_exit() block: Use bioset_init() for fs_bio_set block: Add bio_copy_data_iter(), zero_fill_bio_iter() block: Split out bio_list_copy_data() block: Add missing flush_dcache_page() call block: Add warning for bi_next not NULL in bio_endio() block: Export bio check/set pages_dirty block: Add sysfs entry for fua support block/bio-integrity.c | 29 ++-- block/bio.c | 226 ++-- block/blk-core.c| 10 +- block/blk-sysfs.c | 11 ++ drivers/block/pktcdvd.c | 2 +- drivers/target/target_core_iblock.c | 2 +- include/linux/bio.h | 35 +++-- include/linux/mempool.h | 34 + mm/mempool.c| 108 + 9 files changed, 320 insertions(+), 137 deletions(-) -- 2.17.0
Re: bug in tag handling in blk-mq?
On 5/8/18 3:19 PM, Jens Axboe wrote: > On 5/8/18 2:37 PM, Jens Axboe wrote: >> On 5/8/18 10:42 AM, Mike Galbraith wrote: >>> On Tue, 2018-05-08 at 08:55 -0600, Jens Axboe wrote: All the block debug files are empty... >>> >>> Sigh. Take 2, this time cat debug files, having turned block tracing >>> off before doing anything else (so trace bits in dmesg.txt should end >>> AT the stall). >> >> OK, that's better. What I see from the traces: >> >> - You have regular IO and some non-fs IO (from scsi_execute()). This mix >> may be key. >> >> - sdd has nothing pending, yet has 6 active waitqueues. >> >> I'm going to see if I can reproduce this. Paolo, what kind of attempts >> to reproduce this have you done? > > No luck so far. Out of the patches you referenced, I can only find the > shallow depth change, since that's in the parent of this email. Can > you send those as well? > > Perhaps also expand a bit on exactly what you are running. File system, > mount options, etc. Alright, I managed to reproduce it. What I think is happening is that BFQ is limiting the inflight case to something less than the wake batch for sbitmap, which can lead to stalls. I don't have time to test this tonight, but perhaps you can give it a go when you are back at it. If not, I'll try tomorrow morning. If this is the issue, I can turn it into a real patch. This is just to confirm that the issue goes away with the below. diff --git a/lib/sbitmap.c b/lib/sbitmap.c index e6a9c06ec70c..94ced15b6428 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -272,6 +272,7 @@ EXPORT_SYMBOL_GPL(sbitmap_bitmap_show); static unsigned int sbq_calc_wake_batch(unsigned int depth) { +#if 0 unsigned int wake_batch; /* @@ -284,6 +285,9 @@ static unsigned int sbq_calc_wake_batch(unsigned int depth) wake_batch = max(1U, depth / SBQ_WAIT_QUEUES); return wake_batch; +#else + return 1; +#endif } int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth, -- Jens Axboe
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On Tue, 8 May 2018 17:31:48 -0600 Logan Gunthorpewrote: > On 08/05/18 05:11 PM, Alex Williamson wrote: > > A runtime, sysfs approach has some benefits here, > > especially in identifying the device assuming we're ok with leaving > > the persistence problem to userspace tools. I'm still a little fond of > > the idea of exposing an acs_flags attribute for devices in sysfs where > > a write would do a soft unplug and re-add of all affected devices to > > automatically recreate the proper grouping. Any dynamic change in > > routing and grouping would require all DMA be re-established anyway and > > a soft hotplug seems like an elegant way of handling it. Thanks, > > This approach sounds like it has a lot more issues to contend with: > > For starters, a soft unplug/re-add of all the devices behind a switch is > going to be difficult if a lot of those devices have had drivers > installed and their respective resources are now mounted or otherwise in > use. > > Then, do we have to redo a the soft-replace every time we change the ACS > bit for every downstream port? That could mean you have to do dozens > soft-replaces before you have all the ACS bits set which means you have > a storm of drivers being added and removed. True, anything requiring tweaking multiple downstream ports would induce a hot-unplug/replug for each. A better sysfs interface would allow multiple downstream ports to be updated in a single shot. > This would require some kind of fancy custom setup software that runs at > just the right time in the boot sequence or a lot of work on the users > part to unbind all the resources, setup the ACS bits and then rebind > everything (assuming the soft re-add doesn't rebind it every time you > adjust one ACS bit). Ugly. > > IMO, if we need to do the sysfs approach then we need to be able to > adjust the groups dynamically in a sensible way and not through the > large hammer that is soft-replaces. I think this would be great but I > don't think we will be tackling that with this patch set. OTOH, I think the only sensible way to dynamically adjust groups is through hotplug, we cannot have running drivers attached to downstream endpoints as we're adjusting the routing. Thanks, Alex
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On Tue, 8 May 2018 19:06:17 -0400 Don Dutilewrote: > On 05/08/2018 05:27 PM, Stephen Bates wrote: > > As I understand it VMs need to know because VFIO passes IOMMU > > grouping up into the VMs. So if a IOMMU grouping changes the VM's > > view of its PCIe topology changes. I think we even have to be > > cognizant of the fact the OS running on the VM may not even support > > hot-plug of PCI devices. > Alex: > Really? IOMMU groups are created by the kernel, so don't know how > they would be passed into the VMs, unless indirectly via PCI(e) > layout. At best, twiddling w/ACS enablement (emulation) would cause > VMs to see different IOMMU groups, but again, VMs are not the > security point/level, the host/HV's are. Correct, the VM has no concept of the host's IOMMU groups, only the hypervisor knows about the groups, but really only to the extent of which device belongs to which group and whether the group is viable. Any runtime change to grouping though would require DMA mapping updates, which I don't see how we can reasonably do with drivers, vfio-pci or native host drivers, bound to the affected devices. Thanks, Alex
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On 08/05/18 05:11 PM, Alex Williamson wrote: > On to the implementation details... I already mentioned the BDF issue > in my other reply. If we had a way to persistently identify a device, > would we specify the downstream points at which we want to disable ACS > or the endpoints that we want to connect? The latter has a problem > that the grouping upstream of an endpoint is already set by the time we > discover the endpoint, so we might need to unwind to get the grouping > correct. The former might be more difficult for users to find the > necessary nodes, but easier for the kernel to deal with during > discovery. I was envisioning the former with kernel helping by printing a dmesg in certain circumstances to help with figuring out which devices need to be specified. Specifying a list of endpoints on the command line and having the kernel try to figure out which downstream ports need to be adjusted while we are in the middle of enumerating the bus is, like you said, a nightmare. > A runtime, sysfs approach has some benefits here, > especially in identifying the device assuming we're ok with leaving > the persistence problem to userspace tools. I'm still a little fond of > the idea of exposing an acs_flags attribute for devices in sysfs where > a write would do a soft unplug and re-add of all affected devices to > automatically recreate the proper grouping. Any dynamic change in > routing and grouping would require all DMA be re-established anyway and > a soft hotplug seems like an elegant way of handling it. Thanks, This approach sounds like it has a lot more issues to contend with: For starters, a soft unplug/re-add of all the devices behind a switch is going to be difficult if a lot of those devices have had drivers installed and their respective resources are now mounted or otherwise in use. Then, do we have to redo a the soft-replace every time we change the ACS bit for every downstream port? That could mean you have to do dozens soft-replaces before you have all the ACS bits set which means you have a storm of drivers being added and removed. This would require some kind of fancy custom setup software that runs at just the right time in the boot sequence or a lot of work on the users part to unbind all the resources, setup the ACS bits and then rebind everything (assuming the soft re-add doesn't rebind it every time you adjust one ACS bit). Ugly. IMO, if we need to do the sysfs approach then we need to be able to adjust the groups dynamically in a sensible way and not through the large hammer that is soft-replaces. I think this would be great but I don't think we will be tackling that with this patch set. Logan
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On 08/05/18 05:00 PM, Dan Williams wrote: >> I'd advise caution with a user supplied BDF approach, we have no >> guaranteed persistence for a device's PCI address. Adding a device >> might renumber the buses, replacing a device with one that consumes >> more/less bus numbers can renumber the buses, motherboard firmware >> updates could renumber the buses, pci=assign-buses can renumber the >> buses, etc. This is why the VT-d spec makes use of device paths when >> describing PCI hierarchies, firmware can't know what bus number will be >> assigned to a device, but it does know the base bus number and the path >> of devfns needed to get to it. I don't know how we come up with an >> option that's easy enough for a user to understand, but reasonably >> robust against hardware changes. Thanks, > > True, but at the same time this feature is for "users with custom > hardware designed for purpose", I assume they would be willing to take > on the bus renumbering risk. It's already the case that > /sys/bus/pci/drivers//bind takes BDF, which is why it seemed to > make a similar interface for the command line. Ideally we could later > get something into ACPI or other platform firmware to arrange for > bridges to disable ACS by default if we see p2p becoming a > common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a > given PCI-E sub-domain. Yeah, I'm having a hard time coming up with an easy enough solution for the user. I agree with Dan though, the bus renumbering risk would be fairly low in the custom hardware seeing the switches are likely going to be directly soldered to the same board with the CPU. That being said, I supposed we could allow the command line to take both a BDF or a BaseBus/DF/DF/DF path. Though, implementing this sounds like a bit of a challenge. Logan
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On Tue, 8 May 2018 22:25:06 + "Stephen Bates"wrote: > >Yeah, so based on the discussion I'm leaning toward just having a > >command line option that takes a list of BDFs and disables ACS > > for them. (Essentially as Dan has suggested.) This avoids the > > shotgun. > > I concur that this seems to be where the conversation is taking us. > > @Alex - Before we go do this can you provide input on the approach? I > don't want to re-spin only to find we are still not converging on the > ACS issue I can envision numerous implementation details that makes this less trivial than it sounds, but it seems like the thing we need to decide first is if intentionally leaving windows between devices with the intention of exploiting them for direct P2P DMA in an otherwise IOMMU managed address space is something we want to do. From a security perspective, we already handle this with IOMMU groups because many devices do not support ACS, the new thing is embracing this rather than working around it. It makes me a little twitchy, but so long as the IOMMU groups match the expected worst case routing between devices, it's really no different than if we could wipe the ACS capability from the device. On to the implementation details... I already mentioned the BDF issue in my other reply. If we had a way to persistently identify a device, would we specify the downstream points at which we want to disable ACS or the endpoints that we want to connect? The latter has a problem that the grouping upstream of an endpoint is already set by the time we discover the endpoint, so we might need to unwind to get the grouping correct. The former might be more difficult for users to find the necessary nodes, but easier for the kernel to deal with during discovery. A runtime, sysfs approach has some benefits here, especially in identifying the device assuming we're ok with leaving the persistence problem to userspace tools. I'm still a little fond of the idea of exposing an acs_flags attribute for devices in sysfs where a write would do a soft unplug and re-add of all affected devices to automatically recreate the proper grouping. Any dynamic change in routing and grouping would require all DMA be re-established anyway and a soft hotplug seems like an elegant way of handling it. Thanks, Alex
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On 05/08/2018 05:27 PM, Stephen Bates wrote: Hi Don Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices. That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints. I recommend doing so via a sysfs method. Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series. So I don't understand the comments why VMs should need to know. As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices. Alex: Really? IOMMU groups are created by the kernel, so don't know how they would be passed into the VMs, unless indirectly via PCI(e) layout. At best, twiddling w/ACS enablement (emulation) would cause VMs to see different IOMMU groups, but again, VMs are not the security point/level, the host/HV's are. Is there a thread I need to read up to explain /clear-up the thoughts above? If you search for p2pdma you should find the previous discussions. Thanks for the input! Stephen
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On Tue, May 8, 2018 at 3:32 PM, Alex Williamsonwrote: > On Tue, 8 May 2018 16:10:19 -0600 > Logan Gunthorpe wrote: > >> On 08/05/18 04:03 PM, Alex Williamson wrote: >> > If IOMMU grouping implies device assignment (because nobody else uses >> > it to the same extent as device assignment) then the build-time option >> > falls to pieces, we need a single kernel that can do both. I think we >> > need to get more clever about allowing the user to specify exactly at >> > which points in the topology they want to disable isolation. Thanks, >> >> >> Yeah, so based on the discussion I'm leaning toward just having a >> command line option that takes a list of BDFs and disables ACS for them. >> (Essentially as Dan has suggested.) This avoids the shotgun. >> >> Then, the pci_p2pdma_distance command needs to check that ACS is >> disabled for all bridges between the two devices. If this is not the >> case, it returns -1. Future work can check if the EP has ATS support, in >> which case it has to check for the ACS direct translated bit. >> >> A user then needs to either disable the IOMMU and/or add the command >> line option to disable ACS for the specific downstream ports in the PCI >> hierarchy. This means the IOMMU groups will be less granular but >> presumably the person adding the command line argument understands this. >> >> We may also want to do some work so that there's informative dmesgs on >> which BDFs need to be specified on the command line so it's not so >> difficult for the user to figure out. > > I'd advise caution with a user supplied BDF approach, we have no > guaranteed persistence for a device's PCI address. Adding a device > might renumber the buses, replacing a device with one that consumes > more/less bus numbers can renumber the buses, motherboard firmware > updates could renumber the buses, pci=assign-buses can renumber the > buses, etc. This is why the VT-d spec makes use of device paths when > describing PCI hierarchies, firmware can't know what bus number will be > assigned to a device, but it does know the base bus number and the path > of devfns needed to get to it. I don't know how we come up with an > option that's easy enough for a user to understand, but reasonably > robust against hardware changes. Thanks, True, but at the same time this feature is for "users with custom hardware designed for purpose", I assume they would be willing to take on the bus renumbering risk. It's already the case that /sys/bus/pci/drivers//bind takes BDF, which is why it seemed to make a similar interface for the command line. Ideally we could later get something into ACPI or other platform firmware to arrange for bridges to disable ACS by default if we see p2p becoming a common-off-the-shelf feature. I.e. a BIOS switch to enable p2p in a given PCI-E sub-domain.
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>Yeah, so based on the discussion I'm leaning toward just having a >command line option that takes a list of BDFs and disables ACS for them. >(Essentially as Dan has suggested.) This avoids the shotgun. I concur that this seems to be where the conversation is taking us. @Alex - Before we go do this can you provide input on the approach? I don't want to re-spin only to find we are still not converging on the ACS issue Thanks Stephen
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On 05/08/2018 06:03 PM, Alex Williamson wrote: On Tue, 8 May 2018 21:42:27 + "Stephen Bates"wrote: Hi Alex But it would be a much easier proposal to disable ACS when the IOMMU is not enabled, ACS has no real purpose in that case. I guess one issue I have with this is that it disables IOMMU groups for all Root Ports and not just the one(s) we wish to do p2pdma on. But as I understand this series, we're not really targeting specific sets of devices either. It's more of a shotgun approach that we disable ACS on downstream switch ports and hope that we get the right set of devices, but with the indecisiveness that we might later white-list select root ports to further increase the blast radius. The IOMMU and P2P are already not exclusive, we can bounce off the IOMMU or make use of ATS as we've previously discussed. We were previously talking about a build time config option that you didn't expect distros to use, so I don't think intervention for the user to disable the IOMMU if it's enabled by default is a serious concern either. ATS definitely makes things more interesting for the cases where the EPs support it. However I don't really have a handle on how common ATS support is going to be in the kinds of devices we have been focused on (NVMe SSDs and RDMA NICs mostly). What you're trying to do is enabled direct peer-to-peer for endpoints which do not support ATS when the IOMMU is enabled, which is not something that necessarily makes sense to me. As above the advantage of leaving the IOMMU on is that it allows for both p2pdma PCI domains and IOMMU groupings PCI domains in the same system. It is just that these domains will be separate to each other. That argument makes sense if we had the ability to select specific sets of devices, but that's not the case here, right? With the shotgun approach, we're clearly favoring one at the expense of the other and it's not clear why we don't simple force the needle all the way in that direction such that the results are at least predictable. So that leaves avoiding bounce buffers as the remaining IOMMU feature I agree with you here that the devices we will want to use for p2p will probably not require a bounce buffer and will support 64 bit DMA addressing. I'm still not seeing why it's terribly undesirable to require devices to support ATS if they want to do direct P2P with an IOMMU enabled. I think the one reason is for the use-case above. Allowing IOMMU groupings on one domain and p2pdma on another domain If IOMMU grouping implies device assignment (because nobody else uses it to the same extent as device assignment) then the build-time option falls to pieces, we need a single kernel that can do both. I think we need to get more clever about allowing the user to specify exactly at which points in the topology they want to disable isolation. Thanks, Alex +1/ack RDMA VFs lend themselves to NVMEoF w/device-assignment need a way to put NVME 'resources' into an assignable/manageable object for 'IOMMU-grouping', which is really a 'DMA security domain' and less an 'IOMMU grouping domain'. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On 08/05/18 04:03 PM, Alex Williamson wrote: > If IOMMU grouping implies device assignment (because nobody else uses > it to the same extent as device assignment) then the build-time option > falls to pieces, we need a single kernel that can do both. I think we > need to get more clever about allowing the user to specify exactly at > which points in the topology they want to disable isolation. Thanks, Yeah, so based on the discussion I'm leaning toward just having a command line option that takes a list of BDFs and disables ACS for them. (Essentially as Dan has suggested.) This avoids the shotgun. Then, the pci_p2pdma_distance command needs to check that ACS is disabled for all bridges between the two devices. If this is not the case, it returns -1. Future work can check if the EP has ATS support, in which case it has to check for the ACS direct translated bit. A user then needs to either disable the IOMMU and/or add the command line option to disable ACS for the specific downstream ports in the PCI hierarchy. This means the IOMMU groups will be less granular but presumably the person adding the command line argument understands this. We may also want to do some work so that there's informative dmesgs on which BDFs need to be specified on the command line so it's not so difficult for the user to figure out. Logan
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On Tue, 8 May 2018 21:42:27 + "Stephen Bates"wrote: > Hi Alex > > >But it would be a much easier proposal to disable ACS when the > > IOMMU is not enabled, ACS has no real purpose in that case. > > I guess one issue I have with this is that it disables IOMMU groups > for all Root Ports and not just the one(s) we wish to do p2pdma on. But as I understand this series, we're not really targeting specific sets of devices either. It's more of a shotgun approach that we disable ACS on downstream switch ports and hope that we get the right set of devices, but with the indecisiveness that we might later white-list select root ports to further increase the blast radius. > >The IOMMU and P2P are already not exclusive, we can bounce off > > the IOMMU or make use of ATS as we've previously discussed. We were > >previously talking about a build time config option that you > > didn't expect distros to use, so I don't think intervention for the > > user to disable the IOMMU if it's enabled by default is a serious > > concern either. > > ATS definitely makes things more interesting for the cases where the > EPs support it. However I don't really have a handle on how common > ATS support is going to be in the kinds of devices we have been > focused on (NVMe SSDs and RDMA NICs mostly). > > > What you're trying to do is enabled direct peer-to-peer for > > endpoints which do not support ATS when the IOMMU is enabled, which > > is not something that necessarily makes sense to me. > > As above the advantage of leaving the IOMMU on is that it allows for > both p2pdma PCI domains and IOMMU groupings PCI domains in the same > system. It is just that these domains will be separate to each other. That argument makes sense if we had the ability to select specific sets of devices, but that's not the case here, right? With the shotgun approach, we're clearly favoring one at the expense of the other and it's not clear why we don't simple force the needle all the way in that direction such that the results are at least predictable. > > So that leaves avoiding bounce buffers as the remaining IOMMU > > feature > > I agree with you here that the devices we will want to use for p2p > will probably not require a bounce buffer and will support 64 bit DMA > addressing. > > > I'm still not seeing why it's terribly undesirable to require > > devices to support ATS if they want to do direct P2P with an IOMMU > > enabled. > > I think the one reason is for the use-case above. Allowing IOMMU > groupings on one domain and p2pdma on another domain If IOMMU grouping implies device assignment (because nobody else uses it to the same extent as device assignment) then the build-time option falls to pieces, we need a single kernel that can do both. I think we need to get more clever about allowing the user to specify exactly at which points in the topology they want to disable isolation. Thanks, Alex
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Alex >But it would be a much easier proposal to disable ACS when the IOMMU is >not enabled, ACS has no real purpose in that case. I guess one issue I have with this is that it disables IOMMU groups for all Root Ports and not just the one(s) we wish to do p2pdma on. >The IOMMU and P2P are already not exclusive, we can bounce off the >IOMMU or make use of ATS as we've previously discussed. We were >previously talking about a build time config option that you didn't >expect distros to use, so I don't think intervention for the user to >disable the IOMMU if it's enabled by default is a serious concern >either. ATS definitely makes things more interesting for the cases where the EPs support it. However I don't really have a handle on how common ATS support is going to be in the kinds of devices we have been focused on (NVMe SSDs and RDMA NICs mostly). > What you're trying to do is enabled direct peer-to-peer for endpoints > which do not support ATS when the IOMMU is enabled, which is not > something that necessarily makes sense to me. As above the advantage of leaving the IOMMU on is that it allows for both p2pdma PCI domains and IOMMU groupings PCI domains in the same system. It is just that these domains will be separate to each other. > So that leaves avoiding bounce buffers as the remaining IOMMU feature I agree with you here that the devices we will want to use for p2p will probably not require a bounce buffer and will support 64 bit DMA addressing. > I'm still not seeing why it's terribly undesirable to require devices to > support > ATS if they want to do direct P2P with an IOMMU enabled. I think the one reason is for the use-case above. Allowing IOMMU groupings on one domain and p2pdma on another domain Stephen
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Tue, 8 May 2018 17:25:24 -0400 Don Dutilewrote: > On 05/08/2018 12:57 PM, Alex Williamson wrote: > > On Mon, 7 May 2018 18:23:46 -0500 > > Bjorn Helgaas wrote: > > > >> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote: > >>> Hi Everyone, > >>> > >>> Here's v4 of our series to introduce P2P based copy offload to NVMe > >>> fabrics. This version has been rebased onto v4.17-rc2. A git repo > >>> is here: > >>> > >>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4 > >>> ... > >> > >>> Logan Gunthorpe (14): > >>>PCI/P2PDMA: Support peer-to-peer memory > >>>PCI/P2PDMA: Add sysfs group to display p2pmem stats > >>>PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset > >>>PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches > >>>docs-rst: Add a new directory for PCI documentation > >>>PCI/P2PDMA: Add P2P DMA driver writer's documentation > >>>block: Introduce PCI P2P flags for request and request queue > >>>IB/core: Ensure we map P2P memory correctly in > >>> rdma_rw_ctx_[init|destroy]() > >>>nvme-pci: Use PCI p2pmem subsystem to manage the CMB > >>>nvme-pci: Add support for P2P memory in requests > >>>nvme-pci: Add a quirk for a pseudo CMB > >>>nvmet: Introduce helper functions to allocate and free request SGLs > >>>nvmet-rdma: Use new SGL alloc/free helper for requests > >>>nvmet: Optionally use PCI P2P memory > >>> > >>> Documentation/ABI/testing/sysfs-bus-pci| 25 + > >>> Documentation/PCI/index.rst| 14 + > >>> Documentation/driver-api/index.rst | 2 +- > >>> Documentation/driver-api/pci/index.rst | 20 + > >>> Documentation/driver-api/pci/p2pdma.rst| 166 ++ > >>> Documentation/driver-api/{ => pci}/pci.rst | 0 > >>> Documentation/index.rst| 3 +- > >>> block/blk-core.c | 3 + > >>> drivers/infiniband/core/rw.c | 13 +- > >>> drivers/nvme/host/core.c | 4 + > >>> drivers/nvme/host/nvme.h | 8 + > >>> drivers/nvme/host/pci.c| 118 +++-- > >>> drivers/nvme/target/configfs.c | 67 +++ > >>> drivers/nvme/target/core.c | 143 - > >>> drivers/nvme/target/io-cmd.c | 3 + > >>> drivers/nvme/target/nvmet.h| 15 + > >>> drivers/nvme/target/rdma.c | 22 +- > >>> drivers/pci/Kconfig| 26 + > >>> drivers/pci/Makefile | 1 + > >>> drivers/pci/p2pdma.c | 814 > >>> + > >>> drivers/pci/pci.c | 6 + > >>> include/linux/blk_types.h | 18 +- > >>> include/linux/blkdev.h | 3 + > >>> include/linux/memremap.h | 19 + > >>> include/linux/pci-p2pdma.h | 118 + > >>> include/linux/pci.h| 4 + > >>> 26 files changed, 1579 insertions(+), 56 deletions(-) > >>> create mode 100644 Documentation/PCI/index.rst > >>> create mode 100644 Documentation/driver-api/pci/index.rst > >>> create mode 100644 Documentation/driver-api/pci/p2pdma.rst > >>> rename Documentation/driver-api/{ => pci}/pci.rst (100%) > >>> create mode 100644 drivers/pci/p2pdma.c > >>> create mode 100644 include/linux/pci-p2pdma.h > >> > >> How do you envison merging this? There's a big chunk in drivers/pci, but > >> really no opportunity for conflicts there, and there's significant stuff in > >> block and nvme that I don't really want to merge. > >> > >> If Alex is OK with the ACS situation, I can ack the PCI parts and you could > >> merge it elsewhere? > > > > AIUI from previously questioning this, the change is hidden behind a > > build-time config option and only custom kernels or distros optimized > > for this sort of support would enable that build option. I'm more than > > a little dubious though that we're not going to have a wave of distros > > enabling this only to get user complaints that they can no longer make > > effective use of their devices for assignment due to the resulting span > > of the IOMMU groups, nor is there any sort of compromise, configure > > the kernel for p2p or device assignment, not both. Is this really such > > a unique feature that distro users aren't going to be asking for both > > features? Thanks, > > > > Alex > At least 1/2 the cases presented to me by existing customers want it in a > tunable kernel, > and tunable btwn two points, if the hw allows it to be 'contained' in that > manner, which > a (layer of) switch(ing) provides. > To me, that means a kernel cmdline parameter to _enable_, and another sysfs > (configfs? ... i'm not a configfs afficionato to say which is best), > method to make two points p2p dma capable. That's not what's done
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Jerome >I think there is confusion here, Alex properly explained the scheme > PCIE-device do a ATS request to the IOMMU which returns a valid >translation for a virtual address. Device can then use that address >directly without going through IOMMU for translation. This makes sense and to be honest I now understand ATS and its interaction with ACS a lot better than I did 24 hours ago ;-). >ATS is implemented by the IOMMU not by the device (well device implement >the client side of it). Also ATS is meaningless without something like >PASID as far as i know. I think it's the client side that is what is important to us. Not many EPs support ATS today and it's not clear if many will in the future. So assuming we want to do p2pdma between devices (some of) which do NOT support ATS how best do we handle the ACS issue? Disabling the IOMMU seems a bit strong to me given this impacts all the PCI domains in the system and not just the domain we wish to do P2P on. Stephen
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Don >Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two >devices. >That agent should 'request' to the kernel that ACS be removed/circumvented > (p2p enabled) btwn two endpoints. >I recommend doing so via a sysfs method. Yes we looked at something like this in the past but it does hit the IOMMU grouping issue I discussed earlier today which is not acceptable right now. In the long term, once we get IOMMU grouping callbacks to VMs we can look at extending p2pdma in this way. But I don't think this is viable for the initial series. >So I don't understand the comments why VMs should need to know. As I understand it VMs need to know because VFIO passes IOMMU grouping up into the VMs. So if a IOMMU grouping changes the VM's view of its PCIe topology changes. I think we even have to be cognizant of the fact the OS running on the VM may not even support hot-plug of PCI devices. > Is there a thread I need to read up to explain /clear-up the thoughts above? If you search for p2pdma you should find the previous discussions. Thanks for the input! Stephen
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On Tue, 8 May 2018 14:49:23 -0600 Logan Gunthorpewrote: > On 08/05/18 02:43 PM, Alex Williamson wrote: > > Yes, GPUs seem to be leading the pack in implementing ATS. So now the > > dumb question, why not simply turn off the IOMMU and thus ACS? The > > argument of using the IOMMU for security is rather diminished if we're > > specifically enabling devices to poke one another directly and clearly > > this isn't favorable for device assignment either. Are there target > > systems where this is not a simple kernel commandline option? Thanks, > > Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run > into some bios's that set the bits on boot (which is annoying). But it would be a much easier proposal to disable ACS when the IOMMU is not enabled, ACS has no real purpose in that case. > I also don't expect people will respond well to making the IOMMU and P2P > exclusive. The IOMMU is often used for more than just security and on > many platforms it's enabled by default. I'd much rather allow IOMMU use > but have fewer isolation groups in much the same way as if you had PCI > bridges that didn't support ACS. The IOMMU and P2P are already not exclusive, we can bounce off the IOMMU or make use of ATS as we've previously discussed. We were previously talking about a build time config option that you didn't expect distros to use, so I don't think intervention for the user to disable the IOMMU if it's enabled by default is a serious concern either. What you're trying to do is enabled direct peer-to-peer for endpoints which do not support ATS when the IOMMU is enabled, which is not something that necessarily makes sense to me. As I mentioned in a previous reply, the IOMMU provides us with an I/O virtual address space for devices, ACS is meant to fill the topology based gaps in that virtual address space, making transactions follow IOMMU compliant routing rules to avoid aliases between the IOVA and physical address spaces. But this series specifically wants to leave those gaps open for direct P2P access. So we compromise the P2P aspect of security, still protecting RAM, but potentially only to the extent that a device cannot hop through or interfere with other devices to do its bidding. Device assignment is mostly tossed out the window because not only are bigger groups more difficult to deal with, the IOVA space is riddled with gaps, which is not really a solved problem. So that leaves avoiding bounce buffers as the remaining IOMMU feature, but we're dealing with native express devices and relatively high end devices that are probably installed in modern systems, so that seems like a non-issue. Are there other uses I'm forgetting? We can enable interrupt remapping separate from DMA translation, so we can exclude that one. I'm still not seeing why it's terribly undesirable to require devices to support ATS if they want to do direct P2P with an IOMMU enabled. Thanks, Alex
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 05/08/2018 12:57 PM, Alex Williamson wrote: On Mon, 7 May 2018 18:23:46 -0500 Bjorn Helgaaswrote: On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote: Hi Everyone, Here's v4 of our series to introduce P2P based copy offload to NVMe fabrics. This version has been rebased onto v4.17-rc2. A git repo is here: https://github.com/sbates130272/linux-p2pmem pci-p2p-v4 ... Logan Gunthorpe (14): PCI/P2PDMA: Support peer-to-peer memory PCI/P2PDMA: Add sysfs group to display p2pmem stats PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches docs-rst: Add a new directory for PCI documentation PCI/P2PDMA: Add P2P DMA driver writer's documentation block: Introduce PCI P2P flags for request and request queue IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]() nvme-pci: Use PCI p2pmem subsystem to manage the CMB nvme-pci: Add support for P2P memory in requests nvme-pci: Add a quirk for a pseudo CMB nvmet: Introduce helper functions to allocate and free request SGLs nvmet-rdma: Use new SGL alloc/free helper for requests nvmet: Optionally use PCI P2P memory Documentation/ABI/testing/sysfs-bus-pci| 25 + Documentation/PCI/index.rst| 14 + Documentation/driver-api/index.rst | 2 +- Documentation/driver-api/pci/index.rst | 20 + Documentation/driver-api/pci/p2pdma.rst| 166 ++ Documentation/driver-api/{ => pci}/pci.rst | 0 Documentation/index.rst| 3 +- block/blk-core.c | 3 + drivers/infiniband/core/rw.c | 13 +- drivers/nvme/host/core.c | 4 + drivers/nvme/host/nvme.h | 8 + drivers/nvme/host/pci.c| 118 +++-- drivers/nvme/target/configfs.c | 67 +++ drivers/nvme/target/core.c | 143 - drivers/nvme/target/io-cmd.c | 3 + drivers/nvme/target/nvmet.h| 15 + drivers/nvme/target/rdma.c | 22 +- drivers/pci/Kconfig| 26 + drivers/pci/Makefile | 1 + drivers/pci/p2pdma.c | 814 + drivers/pci/pci.c | 6 + include/linux/blk_types.h | 18 +- include/linux/blkdev.h | 3 + include/linux/memremap.h | 19 + include/linux/pci-p2pdma.h | 118 + include/linux/pci.h| 4 + 26 files changed, 1579 insertions(+), 56 deletions(-) create mode 100644 Documentation/PCI/index.rst create mode 100644 Documentation/driver-api/pci/index.rst create mode 100644 Documentation/driver-api/pci/p2pdma.rst rename Documentation/driver-api/{ => pci}/pci.rst (100%) create mode 100644 drivers/pci/p2pdma.c create mode 100644 include/linux/pci-p2pdma.h How do you envison merging this? There's a big chunk in drivers/pci, but really no opportunity for conflicts there, and there's significant stuff in block and nvme that I don't really want to merge. If Alex is OK with the ACS situation, I can ack the PCI parts and you could merge it elsewhere? AIUI from previously questioning this, the change is hidden behind a build-time config option and only custom kernels or distros optimized for this sort of support would enable that build option. I'm more than a little dubious though that we're not going to have a wave of distros enabling this only to get user complaints that they can no longer make effective use of their devices for assignment due to the resulting span of the IOMMU groups, nor is there any sort of compromise, configure the kernel for p2p or device assignment, not both. Is this really such a unique feature that distro users aren't going to be asking for both features? Thanks, Alex At least 1/2 the cases presented to me by existing customers want it in a tunable kernel, and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which a (layer of) switch(ing) provides. To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best), method to make two points p2p dma capable. Worse case, the whole system is one large IOMMU group (current mindset of this static or run-time config option), or best case (over time, more hw), a secure set of the primary system with p2p-enabled sections, that are deemed 'safe' or 'self-inflicting-unsecure', the latter the case of today's VM with an assigned device -- can scribble all over the VM, but no other VM and not the host/HV. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: bug in tag handling in blk-mq?
On 5/8/18 2:37 PM, Jens Axboe wrote: > On 5/8/18 10:42 AM, Mike Galbraith wrote: >> On Tue, 2018-05-08 at 08:55 -0600, Jens Axboe wrote: >>> >>> All the block debug files are empty... >> >> Sigh. Take 2, this time cat debug files, having turned block tracing >> off before doing anything else (so trace bits in dmesg.txt should end >> AT the stall). > > OK, that's better. What I see from the traces: > > - You have regular IO and some non-fs IO (from scsi_execute()). This mix > may be key. > > - sdd has nothing pending, yet has 6 active waitqueues. > > I'm going to see if I can reproduce this. Paolo, what kind of attempts > to reproduce this have you done? No luck so far. Out of the patches you referenced, I can only find the shallow depth change, since that's in the parent of this email. Can you send those as well? Perhaps also expand a bit on exactly what you are running. File system, mount options, etc. -- Jens Axboe
Re: general protection fault in lo_ioctl (2)
Theodore Y. Ts'o wrote: > On Tue, May 08, 2018 at 08:05:12PM +0900, Tetsuo Handa wrote: > > > > So, it is time to think how to solve this race condition, as well as how to > > solve > > lockdep's deadlock warning (and I guess that syzbot is actually hitting > > deadlocks). > > An approach which serializes loop operations using global lock was proposed > > at > > https://groups.google.com/d/msg/syzkaller-bugs/2Rw8-OM6IbM/PzdobV8kAgAJ . > > Please respond... > > I'm looking at your patch which you proposed on this, and the locking > architecture still looks way too complex. Things like > loop_mutex_owner, and all of the infrastructure around > lo->ioctl_in_progress should be removed, if at all possible. The patch in the above link no longer uses "lo->ioctl_in_progress". You looked at previous version rather than current version. > > I believe it should be possible to do things with a single global > mutex, some code refactoring, and some unlocked versions of some of > the functions. The patch in the above link uses single global mutex "loop_mutex". > > Again, this requires root, and it requires someone deliberately trying > to induce a race. So "it's time" is not necessarily the priority I > would set for this item. But if we are going to fix it, let's fix it > right, and not make the code more complex and less maintainable, all > in the name of trying to make a rare, not-likely-to-happen-in-real-life > syzbot reported problem to go away. While NULL pointer dereference would be rare, deadlocks might not be rare enough to postpone the patch. Deadlocks can cause pile of false-positive hung task reports and can prevent syzbot from finding other bugs. That's why I say "it is time to think".
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On 05/08/2018 10:44 AM, Stephen Bates wrote: Hi Dan It seems unwieldy that this is a compile time option and not a runtime option. Can't we have a kernel command line option to opt-in to this behavior rather than require a wholly separate kernel image? I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series. It is clear if it is a kernel command-line option or a CONFIG option. One does not have access to the kernel command-line w/o a few privs. A CONFIG option prevents a distribution to have a default, locked-down kernel _and_ the ability to be 'unlocked' if the customer/site is 'secure' via other means. A run/boot-time option is more flexible and achieves the best of both. Why is this text added in a follow on patch and not the patch that introduced the config option? Because the ACS section was added later in the series and this information is associated with that additional functionality. I'm also wondering if that command line option can take a 'bus device function' address of a switch to limit the scope of where ACS is disabled. Well, p2p DMA is a function of a cooperating 'agent' somewhere above the two devices. That agent should 'request' to the kernel that ACS be removed/circumvented (p2p enabled) btwn two endpoints. I recommend doing so via a sysfs method. That way, the system can limit the 'unsecure' space btwn two devices, likely configured on a separate switch, from the rest of the still-secured/ACS-enabled PCIe tree. PCIe is pt-to-pt, effectively; maybe one would have multiple nics/fabrics p2p to/from NVME, but one could look at it as a list of pairs (nic1<->nvme1; nic2<->nvme2; ). A pair-listing would be optimal, allowing the kernel to figure out the ACS path, and not making it endpoint-switch-switch...-switch-endpt error-entry prone. Additionally, systems that can/prefer to do so via a RP's IOMMU, albeit not optimal, but better then all the way to/from memory, and a security/iova-check possible, can modify the pt-to-pt ACS algorithm to accomodate over time (e.g., cap bits be they hw or device-driver/extension/quirk defined for each bridge/RP in a PCI domain). Kernels that never want to support P2P could build w/o it enabled cmdline option is moot. Kernels built with it on, *still* need cmdline option, to be blunt that the kernel is enabling a feature that could render the entire (IO sub)system unsecure. By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system. as devices are added, they start in ACS-enabled, secured mode. As sysfs entry modifies p2p ability, IOMMU group is modified as well. btw -- IOMMU grouping is a host/HV control issue, not a VM control/knowledge issue. So I don't understand the comments why VMs should need to know. -- configure p2p _before_ assigning devices to VMs. ... iommu groups are checked at assignment time. -- so even if hot-add, separate iommu group, then enable p2p, becomes same IOMMU group, then can only assign to same VM. -- VMs don't know IOMMU's & ACS are involved now, and won't later, even if device's dynamically added/removed Is there a thread I need to read up to explain /clear-up the thoughts above? Stephen [1] https://marc.info/?l=linux-doc=150907188310838=2
Re: [PATCH 1/4] block: break discard submissions into the user defined size
On 5/8/18 2:43 PM, Omar Sandoval wrote: > On Mon, May 07, 2018 at 10:13:32AM -0600, Jens Axboe wrote: >> Don't build discards bigger than what the user asked for, if the >> user decided to limit the size by writing to 'discard_max_bytes'. >> >> Signed-off-by: Jens Axboe>> --- >> block/blk-lib.c | 7 --- >> 1 file changed, 4 insertions(+), 3 deletions(-) >> >> diff --git a/block/blk-lib.c b/block/blk-lib.c >> index a676084d4740..7417d617091b 100644 >> --- a/block/blk-lib.c >> +++ b/block/blk-lib.c >> @@ -62,10 +62,11 @@ int __blkdev_issue_discard(struct block_device *bdev, >> sector_t sector, >> unsigned int req_sects; >> sector_t end_sect, tmp; >> >> -/* Make sure bi_size doesn't overflow */ >> -req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9); >> +/* Issue in chunks of the user defined max discard setting */ >> +req_sects = min_t(sector_t, nr_sects, >> +q->limits.max_discard_sectors); >> > > Some drivers, including nvme, do > >blk_queue_max_discard_sectors(q, UINT_MAX) > > That means q->limits.max_discard_sectors can be UINT_MAX, and > > bio->bi_iter.bi_size = req_sects << 9; > > will overflow. We should probably cap max_discard_sectors at > UINT_MAX >> 9 in a prep patch. Probably better to just keep the local overflow check, I'll do that. Thanks for spotting it! -- Jens Axboe
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On Tue, May 08, 2018 at 02:19:05PM -0600, Logan Gunthorpe wrote: > > > On 08/05/18 02:13 PM, Alex Williamson wrote: > > Well, I'm a bit confused, this patch series is specifically disabling > > ACS on switches, but per the spec downstream switch ports implementing > > ACS MUST implement direct translated P2P. So it seems the only > > potential gap here is the endpoint, which must support ATS or else > > there's nothing for direct translated P2P to do. The switch port plays > > no part in the actual translation of the request, ATS on the endpoint > > has already cached the translation and is now attempting to use it. > > For the switch port, this only becomes a routing decision, the request > > is already translated, therefore ACS RR and EC can be ignored to > > perform "normal" (direct) routing, as if ACS were not present. It would > > be a shame to go to all the trouble of creating this no-ACS mode to find > > out the target hardware supports ATS and should have simply used it, or > > we should have disabled the IOMMU altogether, which leaves ACS disabled. > > Ah, ok, I didn't think it was the endpoint that had to implement ATS. > But in that case, for our application, we need NVMe cards and RDMA NICs > to all have ATS support and I expect that is just as unlikely. At least > none of the endpoints on my system support it. Maybe only certain GPUs > have this support. I think there is confusion here, Alex properly explained the scheme PCIE-device do a ATS request to the IOMMU which returns a valid translation for a virtual address. Device can then use that address directly without going through IOMMU for translation. ATS is implemented by the IOMMU not by the device (well device implement the client side of it). Also ATS is meaningless without something like PASID as far as i know. Cheers, Jérôme
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On 08/05/18 02:43 PM, Alex Williamson wrote: > Yes, GPUs seem to be leading the pack in implementing ATS. So now the > dumb question, why not simply turn off the IOMMU and thus ACS? The > argument of using the IOMMU for security is rather diminished if we're > specifically enabling devices to poke one another directly and clearly > this isn't favorable for device assignment either. Are there target > systems where this is not a simple kernel commandline option? Thanks, Well, turning off the IOMMU doesn't necessarily turn off ACS. We've run into some bios's that set the bits on boot (which is annoying). I also don't expect people will respond well to making the IOMMU and P2P exclusive. The IOMMU is often used for more than just security and on many platforms it's enabled by default. I'd much rather allow IOMMU use but have fewer isolation groups in much the same way as if you had PCI bridges that didn't support ACS. Logan
Re: [PATCH 4/4] blk-wbt: throttle discards like background writes
On Mon, May 07, 2018 at 10:13:35AM -0600, Jens Axboe wrote: > Throttle discards like we would any background write. Discards should > be background activity, so if they are impacting foreground IO, then > we will throttle them down. Seems reasonable. Reviewed-by: Omar Sandoval> Signed-off-by: Jens Axboe > --- > block/blk-stat.h | 6 +++--- > block/blk-wbt.c | 43 ++- > block/blk-wbt.h | 4 +++- > 3 files changed, 32 insertions(+), 21 deletions(-)
Re: [PATCH 3/4] blk-wbt: pass in enum wbt_flags to get_rq_wait()
On Mon, May 07, 2018 at 10:13:34AM -0600, Jens Axboe wrote: > This is in preparation for having more write queues, in which > case we would have needed to pass in more information than just > a simple 'is_kswapd' boolean. Reviewed-by: Omar Sandoval> > Signed-off-by: Jens Axboe > --- > block/blk-wbt.c | 25 +++-- > block/blk-wbt.h | 4 +++- > 2 files changed, 18 insertions(+), 11 deletions(-)
Re: [PATCH 2/4] blk-wbt: account any writing command as a write
On Mon, May 07, 2018 at 10:13:33AM -0600, Jens Axboe wrote: > We currently special case WRITE and FLUSH, but we should really > just include any command with the write bit set. This ensures > that we account DISCARD. > > Reviewed-by: Christoph HellwigReviewed-by: Omar Sandoval > Signed-off-by: Jens Axboe > --- > block/blk-wbt.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/blk-wbt.c b/block/blk-wbt.c > index f92fc84b5e2c..3e34b41bcefc 100644 > --- a/block/blk-wbt.c > +++ b/block/blk-wbt.c > @@ -701,7 +701,7 @@ static int wbt_data_dir(const struct request *rq) > > if (op == REQ_OP_READ) > return READ; > - else if (op == REQ_OP_WRITE || op == REQ_OP_FLUSH) > + else if (op_is_write(op)) > return WRITE; > > /* don't account */ > -- > 2.7.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] block: break discard submissions into the user defined size
On Mon, May 07, 2018 at 10:13:32AM -0600, Jens Axboe wrote: > Don't build discards bigger than what the user asked for, if the > user decided to limit the size by writing to 'discard_max_bytes'. > > Signed-off-by: Jens Axboe> --- > block/blk-lib.c | 7 --- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/block/blk-lib.c b/block/blk-lib.c > index a676084d4740..7417d617091b 100644 > --- a/block/blk-lib.c > +++ b/block/blk-lib.c > @@ -62,10 +62,11 @@ int __blkdev_issue_discard(struct block_device *bdev, > sector_t sector, > unsigned int req_sects; > sector_t end_sect, tmp; > > - /* Make sure bi_size doesn't overflow */ > - req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9); > + /* Issue in chunks of the user defined max discard setting */ > + req_sects = min_t(sector_t, nr_sects, > + q->limits.max_discard_sectors); > Some drivers, including nvme, do blk_queue_max_discard_sectors(q, UINT_MAX) That means q->limits.max_discard_sectors can be UINT_MAX, and bio->bi_iter.bi_size = req_sects << 9; will overflow. We should probably cap max_discard_sectors at UINT_MAX >> 9 in a prep patch.
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On Tue, 8 May 2018 14:19:05 -0600 Logan Gunthorpewrote: > On 08/05/18 02:13 PM, Alex Williamson wrote: > > Well, I'm a bit confused, this patch series is specifically disabling > > ACS on switches, but per the spec downstream switch ports implementing > > ACS MUST implement direct translated P2P. So it seems the only > > potential gap here is the endpoint, which must support ATS or else > > there's nothing for direct translated P2P to do. The switch port plays > > no part in the actual translation of the request, ATS on the endpoint > > has already cached the translation and is now attempting to use it. > > For the switch port, this only becomes a routing decision, the request > > is already translated, therefore ACS RR and EC can be ignored to > > perform "normal" (direct) routing, as if ACS were not present. It would > > be a shame to go to all the trouble of creating this no-ACS mode to find > > out the target hardware supports ATS and should have simply used it, or > > we should have disabled the IOMMU altogether, which leaves ACS disabled. > > Ah, ok, I didn't think it was the endpoint that had to implement ATS. > But in that case, for our application, we need NVMe cards and RDMA NICs > to all have ATS support and I expect that is just as unlikely. At least > none of the endpoints on my system support it. Maybe only certain GPUs > have this support. Yes, GPUs seem to be leading the pack in implementing ATS. So now the dumb question, why not simply turn off the IOMMU and thus ACS? The argument of using the IOMMU for security is rather diminished if we're specifically enabling devices to poke one another directly and clearly this isn't favorable for device assignment either. Are there target systems where this is not a simple kernel commandline option? Thanks, Alex
Re: bug in tag handling in blk-mq?
On 5/8/18 10:42 AM, Mike Galbraith wrote: > On Tue, 2018-05-08 at 08:55 -0600, Jens Axboe wrote: >> >> All the block debug files are empty... > > Sigh. Take 2, this time cat debug files, having turned block tracing > off before doing anything else (so trace bits in dmesg.txt should end > AT the stall). OK, that's better. What I see from the traces: - You have regular IO and some non-fs IO (from scsi_execute()). This mix may be key. - sdd has nothing pending, yet has 6 active waitqueues. I'm going to see if I can reproduce this. Paolo, what kind of attempts to reproduce this have you done? -- Jens Axboe
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On 08/05/18 02:13 PM, Alex Williamson wrote: > Well, I'm a bit confused, this patch series is specifically disabling > ACS on switches, but per the spec downstream switch ports implementing > ACS MUST implement direct translated P2P. So it seems the only > potential gap here is the endpoint, which must support ATS or else > there's nothing for direct translated P2P to do. The switch port plays > no part in the actual translation of the request, ATS on the endpoint > has already cached the translation and is now attempting to use it. > For the switch port, this only becomes a routing decision, the request > is already translated, therefore ACS RR and EC can be ignored to > perform "normal" (direct) routing, as if ACS were not present. It would > be a shame to go to all the trouble of creating this no-ACS mode to find > out the target hardware supports ATS and should have simply used it, or > we should have disabled the IOMMU altogether, which leaves ACS disabled. Ah, ok, I didn't think it was the endpoint that had to implement ATS. But in that case, for our application, we need NVMe cards and RDMA NICs to all have ATS support and I expect that is just as unlikely. At least none of the endpoints on my system support it. Maybe only certain GPUs have this support. Logan
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On Tue, 8 May 2018 13:45:50 -0600 Logan Gunthorpewrote: > On 08/05/18 01:34 PM, Alex Williamson wrote: > > They are not so unrelated, see the ACS Direct Translated P2P > > capability, which in fact must be implemented by switch downstream > > ports implementing ACS and works specifically with ATS. This appears to > > be the way the PCI SIG would intend for P2P to occur within an IOMMU > > managed topology, routing pre-translated DMA directly between peer > > devices while requiring non-translated requests to bounce through the > > IOMMU. Really, what's the value of having an I/O virtual address space > > provided by an IOMMU if we're going to allow physical DMA between > > downstream devices, couldn't we just turn off the IOMMU altogether? Of > > course ATS is not without holes itself, basically that we trust the > > endpoint's implementation of ATS implicitly. Thanks, > > I agree that this is what the SIG intends, but I don't think hardware > fully supports this methodology yet. The Direct Translated capability > just requires switches to forward packets that have the AT request type > set. It does not require them to do the translation or to support ATS > such that P2P requests can be translated by the IOMMU. I expect this is > so that an downstream device can implement ATS and not get messed up by > an upstream switch that doesn't support it. Well, I'm a bit confused, this patch series is specifically disabling ACS on switches, but per the spec downstream switch ports implementing ACS MUST implement direct translated P2P. So it seems the only potential gap here is the endpoint, which must support ATS or else there's nothing for direct translated P2P to do. The switch port plays no part in the actual translation of the request, ATS on the endpoint has already cached the translation and is now attempting to use it. For the switch port, this only becomes a routing decision, the request is already translated, therefore ACS RR and EC can be ignored to perform "normal" (direct) routing, as if ACS were not present. It would be a shame to go to all the trouble of creating this no-ACS mode to find out the target hardware supports ATS and should have simply used it, or we should have disabled the IOMMU altogether, which leaves ACS disabled. Thanks, Alex
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On 08/05/18 01:34 PM, Alex Williamson wrote: > They are not so unrelated, see the ACS Direct Translated P2P > capability, which in fact must be implemented by switch downstream > ports implementing ACS and works specifically with ATS. This appears to > be the way the PCI SIG would intend for P2P to occur within an IOMMU > managed topology, routing pre-translated DMA directly between peer > devices while requiring non-translated requests to bounce through the > IOMMU. Really, what's the value of having an I/O virtual address space > provided by an IOMMU if we're going to allow physical DMA between > downstream devices, couldn't we just turn off the IOMMU altogether? Of > course ATS is not without holes itself, basically that we trust the > endpoint's implementation of ATS implicitly. Thanks, I agree that this is what the SIG intends, but I don't think hardware fully supports this methodology yet. The Direct Translated capability just requires switches to forward packets that have the AT request type set. It does not require them to do the translation or to support ATS such that P2P requests can be translated by the IOMMU. I expect this is so that an downstream device can implement ATS and not get messed up by an upstream switch that doesn't support it. Logan
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On Tue, 8 May 2018 13:13:40 -0600 Logan Gunthorpewrote: > On 08/05/18 10:50 AM, Christian König wrote: > > E.g. transactions are initially send to the root complex for > > translation, that's for sure. But at least for AMD GPUs the root complex > > answers with the translated address which is then cached in the device. > > > > So further transactions for the same address range then go directly to > > the destination. > > Sounds like you are referring to Address Translation Services (ATS). > This is quite separate from ACS and, to my knowledge, isn't widely > supported by switch hardware. They are not so unrelated, see the ACS Direct Translated P2P capability, which in fact must be implemented by switch downstream ports implementing ACS and works specifically with ATS. This appears to be the way the PCI SIG would intend for P2P to occur within an IOMMU managed topology, routing pre-translated DMA directly between peer devices while requiring non-translated requests to bounce through the IOMMU. Really, what's the value of having an I/O virtual address space provided by an IOMMU if we're going to allow physical DMA between downstream devices, couldn't we just turn off the IOMMU altogether? Of course ATS is not without holes itself, basically that we trust the endpoint's implementation of ATS implicitly. Thanks, Alex
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 08/05/18 10:57 AM, Alex Williamson wrote: > AIUI from previously questioning this, the change is hidden behind a > build-time config option and only custom kernels or distros optimized > for this sort of support would enable that build option. I'm more than > a little dubious though that we're not going to have a wave of distros > enabling this only to get user complaints that they can no longer make > effective use of their devices for assignment due to the resulting span > of the IOMMU groups, nor is there any sort of compromise, configure > the kernel for p2p or device assignment, not both. Is this really such > a unique feature that distro users aren't going to be asking for both > features? Thanks, I think it is. But it sounds like the majority want this to be a command line option. So we will look at doing that for v5. Logan
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On 08/05/18 10:50 AM, Christian König wrote: > E.g. transactions are initially send to the root complex for > translation, that's for sure. But at least for AMD GPUs the root complex > answers with the translated address which is then cached in the device. > > So further transactions for the same address range then go directly to > the destination. Sounds like you are referring to Address Translation Services (ATS). This is quite separate from ACS and, to my knowledge, isn't widely supported by switch hardware. Logan
[PATCH v3 0/3] AIO add per-command iopriority
From: Adam ManzanaresThis is the per-I/O equivalent of the ioprio_set system call. See the following link for performance implications on a SATA HDD: https://lkml.org/lkml/2016/12/6/495 First patch factors ioprio_check_cap function out of ioprio_set system call to also be used by the aio ioprio interface. Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat. Third patch passes ioprio hint from aio iocb to kiocb and enables block_dev usage of the per I/O ioprio feature. v2: merge patches use IOCB_FLAG_IOPRIO validate intended use with IOCB_IOPRIO add linux-api and linux-block to cc v3: add ioprio_check_cap function convert kiocb ki_hint to u16 use ioprio_check_cap when adding ioprio to kiocb in aio.c Adam Manzanares (3): block: add ioprio_check_cap function fs: Convert kiocb rw_hint from enum to u16 fs: Add aio iopriority support for block_dev block/ioprio.c | 22 -- fs/aio.c | 16 fs/block_dev.c | 2 ++ include/linux/fs.h | 17 +++-- include/linux/ioprio.h | 2 ++ include/uapi/linux/aio_abi.h | 1 + 6 files changed, 52 insertions(+), 8 deletions(-) -- 2.15.1
[PATCH v3 3/3] fs: Add aio iopriority support for block_dev
From: Adam ManzanaresWhen IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field. When a bio is created for an aio request by the block dev we set the priority value of the bio to the user supplied value. Signed-off-by: Adam Manzanares --- fs/aio.c | 16 fs/block_dev.c | 2 ++ include/linux/fs.h | 2 ++ include/uapi/linux/aio_abi.h | 1 + 4 files changed, 21 insertions(+) diff --git a/fs/aio.c b/fs/aio.c index 88d7927ffbc6..f43d1d3a2e39 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1597,6 +1597,22 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, req->common.ki_flags |= IOCB_EVENTFD; } + if (iocb->aio_flags & IOCB_FLAG_IOPRIO) { + /* +* If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then +* aio_reqprio is interpreted as an I/O scheduling +* class and priority. +*/ + ret = ioprio_check_cap(iocb->aio_reqprio); + if (ret) { + pr_debug("aio ioprio check cap error\n"); + goto out_put_req; + } + + req->common.ki_ioprio = iocb->aio_reqprio; + req->common.ki_flags |= IOCB_IOPRIO; + } + ret = kiocb_set_rw_flags(>common, iocb->aio_rw_flags); if (unlikely(ret)) { pr_debug("EINVAL: aio_rw_flags\n"); diff --git a/fs/block_dev.c b/fs/block_dev.c index 7ec920e27065..970bef79caa6 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -355,6 +355,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) bio->bi_write_hint = iocb->ki_hint; bio->bi_private = dio; bio->bi_end_io = blkdev_bio_end_io; + if (iocb->ki_flags & IOCB_IOPRIO) + bio->bi_ioprio = iocb->ki_ioprio; ret = bio_iov_iter_get_pages(bio, iter); if (unlikely(ret)) { diff --git a/include/linux/fs.h b/include/linux/fs.h index 7a90ce387e00..3415e83f6350 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -294,6 +294,7 @@ enum rw_hint { #define IOCB_SYNC (1 << 5) #define IOCB_WRITE (1 << 6) #define IOCB_NOWAIT(1 << 7) +#define IOCB_IOPRIO(1 << 8) struct kiocb { struct file *ki_filp; @@ -302,6 +303,7 @@ struct kiocb { void*private; int ki_flags; u16 ki_hint; + u16 ki_ioprio; /* See linux/ioprio.h */ } __randomize_layout; static inline bool is_sync_kiocb(struct kiocb *kiocb) diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index a04adbc70ddf..d4593a6062ef 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -54,6 +54,7 @@ enum { * is valid. */ #define IOCB_FLAG_RESFD(1 << 0) +#define IOCB_FLAG_IOPRIO (1 << 1) /* read() from /dev/aio returns these structures. */ struct io_event { -- 2.15.1
[PATCH v3 2/3] fs: Convert kiocb rw_hint from enum to u16
From: Adam ManzanaresIn order to avoid kiocb bloat for per command iopriority support, rw_hint is converted from enum to a u16. Added a guard around ki_hint assigment. Signed-off-by: Adam Manzanares --- include/linux/fs.h | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 760d8da1b6c7..7a90ce387e00 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -284,6 +284,8 @@ enum rw_hint { WRITE_LIFE_EXTREME = RWH_WRITE_LIFE_EXTREME, }; +#define MAX_KI_HINT((1 << 16) - 1) /* ki_hint type is u16 */ + #define IOCB_EVENTFD (1 << 0) #define IOCB_APPEND(1 << 1) #define IOCB_DIRECT(1 << 2) @@ -299,7 +301,7 @@ struct kiocb { void (*ki_complete)(struct kiocb *iocb, long ret, long ret2); void*private; int ki_flags; - enum rw_hintki_hint; + u16 ki_hint; } __randomize_layout; static inline bool is_sync_kiocb(struct kiocb *kiocb) @@ -1927,12 +1929,21 @@ static inline enum rw_hint file_write_hint(struct file *file) static inline int iocb_flags(struct file *file); +/* ki_hint changed from enum to u16, make sure rw_hint fits into u16 */ +static inline u16 ki_hint_valid(enum rw_hint hint) +{ + if (hint > MAX_KI_HINT) + return 0; + + return hint; +} + static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) { *kiocb = (struct kiocb) { .ki_filp = filp, .ki_flags = iocb_flags(filp), - .ki_hint = file_write_hint(filp), + .ki_hint = ki_hint_valid(file_write_hint(filp)), }; } -- 2.15.1
[PATCH v3 1/3] block: add ioprio_check_cap function
From: Adam ManzanaresAio per command iopriority support introduces a second interface between userland and the kernel capable of passing iopriority. The aio interface also needs the ability to verify that the submitting context has sufficient priviledges to submit IOPRIO_RT commands. This patch creates the ioprio_check_cap function to be used by the ioprio_set system call and also by the aio interface. Signed-off-by: Adam Manzanares --- block/ioprio.c | 22 -- include/linux/ioprio.h | 2 ++ 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/block/ioprio.c b/block/ioprio.c index 6f5d0b6625e3..f9821080c92c 100644 --- a/block/ioprio.c +++ b/block/ioprio.c @@ -61,15 +61,10 @@ int set_task_ioprio(struct task_struct *task, int ioprio) } EXPORT_SYMBOL_GPL(set_task_ioprio); -SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio) +int ioprio_check_cap(int ioprio) { int class = IOPRIO_PRIO_CLASS(ioprio); int data = IOPRIO_PRIO_DATA(ioprio); - struct task_struct *p, *g; - struct user_struct *user; - struct pid *pgrp; - kuid_t uid; - int ret; switch (class) { case IOPRIO_CLASS_RT: @@ -92,6 +87,21 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio) return -EINVAL; } + return 0; +} + +SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio) +{ + struct task_struct *p, *g; + struct user_struct *user; + struct pid *pgrp; + kuid_t uid; + int ret; + + ret = ioprio_check_cap(ioprio); + if (ret) + return ret; + ret = -ESRCH; rcu_read_lock(); switch (which) { diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h index 627efac73e6d..4a28cec49ec3 100644 --- a/include/linux/ioprio.h +++ b/include/linux/ioprio.h @@ -77,4 +77,6 @@ extern int ioprio_best(unsigned short aprio, unsigned short bprio); extern int set_task_ioprio(struct task_struct *task, int ioprio); +extern int ioprio_check_cap(int ioprio); + #endif -- 2.15.1
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Mon, 7 May 2018 18:23:46 -0500 Bjorn Helgaaswrote: > On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote: > > Hi Everyone, > > > > Here's v4 of our series to introduce P2P based copy offload to NVMe > > fabrics. This version has been rebased onto v4.17-rc2. A git repo > > is here: > > > > https://github.com/sbates130272/linux-p2pmem pci-p2p-v4 > > ... > > > Logan Gunthorpe (14): > > PCI/P2PDMA: Support peer-to-peer memory > > PCI/P2PDMA: Add sysfs group to display p2pmem stats > > PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset > > PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches > > docs-rst: Add a new directory for PCI documentation > > PCI/P2PDMA: Add P2P DMA driver writer's documentation > > block: Introduce PCI P2P flags for request and request queue > > IB/core: Ensure we map P2P memory correctly in > > rdma_rw_ctx_[init|destroy]() > > nvme-pci: Use PCI p2pmem subsystem to manage the CMB > > nvme-pci: Add support for P2P memory in requests > > nvme-pci: Add a quirk for a pseudo CMB > > nvmet: Introduce helper functions to allocate and free request SGLs > > nvmet-rdma: Use new SGL alloc/free helper for requests > > nvmet: Optionally use PCI P2P memory > > > > Documentation/ABI/testing/sysfs-bus-pci| 25 + > > Documentation/PCI/index.rst| 14 + > > Documentation/driver-api/index.rst | 2 +- > > Documentation/driver-api/pci/index.rst | 20 + > > Documentation/driver-api/pci/p2pdma.rst| 166 ++ > > Documentation/driver-api/{ => pci}/pci.rst | 0 > > Documentation/index.rst| 3 +- > > block/blk-core.c | 3 + > > drivers/infiniband/core/rw.c | 13 +- > > drivers/nvme/host/core.c | 4 + > > drivers/nvme/host/nvme.h | 8 + > > drivers/nvme/host/pci.c| 118 +++-- > > drivers/nvme/target/configfs.c | 67 +++ > > drivers/nvme/target/core.c | 143 - > > drivers/nvme/target/io-cmd.c | 3 + > > drivers/nvme/target/nvmet.h| 15 + > > drivers/nvme/target/rdma.c | 22 +- > > drivers/pci/Kconfig| 26 + > > drivers/pci/Makefile | 1 + > > drivers/pci/p2pdma.c | 814 > > + > > drivers/pci/pci.c | 6 + > > include/linux/blk_types.h | 18 +- > > include/linux/blkdev.h | 3 + > > include/linux/memremap.h | 19 + > > include/linux/pci-p2pdma.h | 118 + > > include/linux/pci.h| 4 + > > 26 files changed, 1579 insertions(+), 56 deletions(-) > > create mode 100644 Documentation/PCI/index.rst > > create mode 100644 Documentation/driver-api/pci/index.rst > > create mode 100644 Documentation/driver-api/pci/p2pdma.rst > > rename Documentation/driver-api/{ => pci}/pci.rst (100%) > > create mode 100644 drivers/pci/p2pdma.c > > create mode 100644 include/linux/pci-p2pdma.h > > How do you envison merging this? There's a big chunk in drivers/pci, but > really no opportunity for conflicts there, and there's significant stuff in > block and nvme that I don't really want to merge. > > If Alex is OK with the ACS situation, I can ack the PCI parts and you could > merge it elsewhere? AIUI from previously questioning this, the change is hidden behind a build-time config option and only custom kernels or distros optimized for this sort of support would enable that build option. I'm more than a little dubious though that we're not going to have a wave of distros enabling this only to get user complaints that they can no longer make effective use of their devices for assignment due to the resulting span of the IOMMU groups, nor is there any sort of compromise, configure the kernel for p2p or device assignment, not both. Is this really such a unique feature that distro users aren't going to be asking for both features? Thanks, Alex
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Am 08.05.2018 um 18:27 schrieb Logan Gunthorpe: On 08/05/18 01:17 AM, Christian König wrote: AMD APUs mandatory need the ACS flag set for the GPU integrated in the CPU when IOMMU is enabled or otherwise you will break SVM. Well, given that the current set only disables ACS bits on bridges (previous versions were only on switches) this shouldn't be an issue for integrated devices. We do not disable ACS flags globally. Ok, that is at least a step in the right direction. But I think we seriously need to test that for side effects. And what exactly is the problem here? I'm currently testing P2P with GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine. In addition to Stephen's comments, seeing we've established a general need to avoid the root complex (until we have a whitelist at least) we must have ACS disabled along the path between the devices. Otherwise, all TLPs will go through the root complex and if there is no support it will fail. Well I'm not an expert on this, but if I'm not completely mistaken that is not correct. E.g. transactions are initially send to the root complex for translation, that's for sure. But at least for AMD GPUs the root complex answers with the translated address which is then cached in the device. So further transactions for the same address range then go directly to the destination. What you don't want is device isolation, cause in this case the root complex handles the transaction themselves. IIRC there where also something like "force_isolation" and "nobypass" parameters for the IOMMU to control that behavior. It's already late here, but going to dig up the documentation for that tomorrow and/or contact a hardware engineer involved in the ACS spec. Regards, Christian. If the consensus is we want a command line option, then so be it. But we'll have to deny pretty much all P2P transactions unless the user correctly disables ACS along the path using the command line option and this is really annoying for users of this functionality to understand how to do that correctly. Logan
Re: bug in tag handling in blk-mq?
On Tue, 2018-05-08 at 08:55 -0600, Jens Axboe wrote: > > All the block debug files are empty... Sigh. Take 2, this time cat debug files, having turned block tracing off before doing anything else (so trace bits in dmesg.txt should end AT the stall). -Mike dmesg.xz Description: application/xz dmesg.txt.xz Description: application/xz block_debug.xz Description: application/xz
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Am 08.05.2018 um 16:25 schrieb Stephen Bates: Hi Christian AMD APUs mandatory need the ACS flag set for the GPU integrated in the CPU when IOMMU is enabled or otherwise you will break SVM. OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP? Well I'm not an expert on this, but I think that is an incorrect assumption you guys use here. At least in the default configuration even with IOMMU enabled P2P transactions does NOT necessary travel up to the root complex for translation. It's already late here, but if nobody beats me I'm going to dig up the necessary documentation tomorrow. Regards, Christian. Similar problems arise when you do this for dedicated GPU, but we haven't upstreamed the support for this yet. Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit. So that is a clear NAK from my side for the approach. Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach? And what exactly is the problem here? We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe). I'm currently testing P2P with GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine. Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change. Cheers Stephen
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On 08/05/18 01:17 AM, Christian König wrote: > AMD APUs mandatory need the ACS flag set for the GPU integrated in the > CPU when IOMMU is enabled or otherwise you will break SVM. Well, given that the current set only disables ACS bits on bridges (previous versions were only on switches) this shouldn't be an issue for integrated devices. We do not disable ACS flags globally. > And what exactly is the problem here? I'm currently testing P2P with > GPUs in different IOMMU domains and at least with AMD IOMMUs that works > perfectly fine. In addition to Stephen's comments, seeing we've established a general need to avoid the root complex (until we have a whitelist at least) we must have ACS disabled along the path between the devices. Otherwise, all TLPs will go through the root complex and if there is no support it will fail. If the consensus is we want a command line option, then so be it. But we'll have to deny pretty much all P2P transactions unless the user correctly disables ACS along the path using the command line option and this is really annoying for users of this functionality to understand how to do that correctly. Logan
Re: [PATCH 1/2] nvme: pci: simplify timeout handling
On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote: > This sync may be raced with one timed-out request, which may be handled > as BLK_EH_HANDLED or BLK_EH_RESET_TIMER, so the above sync queues can't > work reliably. Ming, As proposed, that scenario is impossible to encounter. Resetting the controller inline with the timeout reaps all the commands, and then sets the controller state to RESETTING. While blk-mq may not allow the driver to complete those requests, having the driver sync with the queues will hold the controller in the reset state until blk-mq is done with its timeout work; therefore, it is impossible for the NVMe driver to return "BLK_EH_RESET_TIMER", and all commands will be completed through nvme_timeout's BLK_EH_HANDLED exactly as desired. Could you please recheck my suggestion? The alternatives proposed are far too risky for a 4.17 consideration, and I'm hoping we can stabilize this behavior in the current release if possible. Thanks, Keith
Re: [PATCH] Documentation: block: cmdline-partition.txt fixes and additions
On Sun, 6 May 2018 11:50:29 -0700 Randy Dunlapwrote: > Make the description of the kernel command line option "blkdevparts" > a bit more flowing and readable. > > Fix a few typos. > Add the optional and suffixes. > Note that size can be "-" to indicate all of the remaining space. > > Signed-off-by: Randy Dunlap Applied, thanks. jon
Re: [PATCH V4 0/7] nvme: pci: fix & improve timeout handling
On Sat, May 05, 2018 at 07:51:22PM -0400, Laurence Oberman wrote: > 3rd and 4th attempts slightly better, but clearly not dependable > > [root@segstorage1 blktests]# ./check block/011 > block/011 => nvme0n1 (disable PCI device while doing I/O)[failed] > runtime... 81.188s > --- tests/block/011.out 2018-05-05 18:01:14.268414752 -0400 > +++ results/nvme0n1/block/011.out.bad 2018-05-05 > 19:44:48.848568687 -0400 > @@ -1,2 +1,3 @@ > Running block/011 > +tests/block/011: line 47: echo: write error: Input/output error > Test complete > > This one passed > [root@segstorage1 blktests]# ./check block/011 > block/011 => nvme0n1 (disable PCI device while doing I/O)[passed] > runtime 81.188s ... 43.400s > > I will capture a vmcore next time it panics and give some information > after analyzing the core We definitely should never panic, but I am not sure this blktest can be reliable on IO errors: the test is disabling memory space enabling and bus master without the driver's knowledge, and it does this repeatedly in a tight loop. If the test happens to disable the device while the driver is trying to recover from the previous iteration, the recovery will surely fail, so I think IO errors may possibly be expected. As far as I can tell, the only way you'll actually get it to succeed is if the test's subsequent "enable" happen's to hit in conjuction with the driver's reset pci_enable_device_mem(), such that the pci_dev's enable_cnt is > 1, which prevents the disabling for the remainder of the test's looping. I still think this is a very good test, but we might be able to make it more deterministic on what actually happens to the pci device.
Re: bug in tag handling in blk-mq?
On 5/8/18 2:37 AM, Mike Galbraith wrote: > On Tue, 2018-05-08 at 06:51 +0200, Mike Galbraith wrote: >> >> I'm deadlined ATM, but will get to it. > > (Bah, even a zombie can type ccache -C; make -j8 and stare...) > > kbuild again hung on the first go (yay), and post hang data written to > sdd1 survived (kernel source lives in sdb3). Full ftrace buffer (echo > 1 > events/block/enable) available off list if desired. dmesg.txt.xz > is dmesg from post hang crashdump, attached because it contains the > tail of trace buffer, so _might_ be useful. > > homer:~ # df|grep sd > /dev/sdb3 959074776 785342824 172741072 82% / > /dev/sdc3 959074776 455464912 502618984 48% /backup > /dev/sdb1 159564 7980 151584 6% /boot/efi > /dev/sdd1 961301832 393334868 519112540 44% /abuild > > Kernel is virgin modulo these... > > patches/remove_irritating_plus.diff > patches/add-scm-version-to-EXTRAVERSION.patch > patches/block-bfq:-postpone-rq-preparation-to-insert-or-merge.patch > patches/block-bfq:-test.patch (hang provocation hack from Paolo) All the block debug files are empty... -- Jens Axboe
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Dan >It seems unwieldy that this is a compile time option and not a runtime >option. Can't we have a kernel command line option to opt-in to this >behavior rather than require a wholly separate kernel image? I think because of the security implications associated with p2pdma and ACS we wanted to make it very clear people were choosing one (p2pdma) or the other (IOMMU groupings and isolation). However personally I would prefer including the option of a run-time kernel parameter too. In fact a few months ago I proposed a small patch that did just that [1]. It never really went anywhere but if people were open to the idea we could look at adding it to the series. > Why is this text added in a follow on patch and not the patch that > introduced the config option? Because the ACS section was added later in the series and this information is associated with that additional functionality. > I'm also wondering if that command line option can take a 'bus device > function' address of a switch to limit the scope of where ACS is > disabled. By this you mean the address for either a RP, DSP, USP or MF EP below which we disable ACS? We could do that but I don't think it avoids the issue of changes in IOMMU groupings as devices are added/removed. It simply changes the problem from affecting and entire PCI domain to a sub-set of the domain. We can already handle this by doing p2pdma on one RP and normal IOMMU isolation on the other RPs in the system. Stephen [1] https://marc.info/?l=linux-doc=150907188310838=2
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
On Mon, Apr 23, 2018 at 4:30 PM, Logan Gunthorpewrote: > For peer-to-peer transactions to work the downstream ports in each > switch must not have the ACS flags set. At this time there is no way > to dynamically change the flags and update the corresponding IOMMU > groups so this is done at enumeration time before the groups are > assigned. > > This effectively means that if CONFIG_PCI_P2PDMA is selected then > all devices behind any PCIe switch heirarchy will be in the same IOMMU > group. Which implies that individual devices behind any switch > heirarchy will not be able to be assigned to separate VMs because > there is no isolation between them. Additionally, any malicious PCIe > devices will be able to DMA to memory exposed by other EPs in the same > domain as TLPs will not be checked by the IOMMU. > > Given that the intended use case of P2P Memory is for users with > custom hardware designed for purpose, we do not expect distributors > to ever need to enable this option. Users that want to use P2P > must have compiled a custom kernel with this configuration option > and understand the implications regarding ACS. They will either > not require ACS or will have design the system in such a way that > devices that require isolation will be separate from those using P2P > transactions. > > Signed-off-by: Logan Gunthorpe > --- > drivers/pci/Kconfig| 9 + > drivers/pci/p2pdma.c | 45 ++--- > drivers/pci/pci.c | 6 ++ > include/linux/pci-p2pdma.h | 5 + > 4 files changed, 50 insertions(+), 15 deletions(-) > > diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig > index b2396c22b53e..b6db41d4b708 100644 > --- a/drivers/pci/Kconfig > +++ b/drivers/pci/Kconfig > @@ -139,6 +139,15 @@ config PCI_P2PDMA > transations must be between devices behind the same root port. > (Typically behind a network of PCIe switches). > > + Enabling this option will also disable ACS on all ports behind > + any PCIe switch. This effectively puts all devices behind any > + switch heirarchy into the same IOMMU group. Which implies that > + individual devices behind any switch will not be able to be > + assigned to separate VMs because there is no isolation between > + them. Additionally, any malicious PCIe devices will be able to > + DMA to memory exposed by other EPs in the same domain as TLPs > + will not be checked by the IOMMU. > + > If unsure, say N. It seems unwieldy that this is a compile time option and not a runtime option. Can't we have a kernel command line option to opt-in to this behavior rather than require a wholly separate kernel image? Why is this text added in a follow on patch and not the patch that introduced the config option? I'm also wondering if that command line option can take a 'bus device function' address of a switch to limit the scope of where ACS is disabled.
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Christian > AMD APUs mandatory need the ACS flag set for the GPU integrated in the > CPU when IOMMU is enabled or otherwise you will break SVM. OK but in this case aren't you losing (many of) the benefits of P2P since all DMAs will now get routed up to the IOMMU before being passed down to the destination PCIe EP? > Similar problems arise when you do this for dedicated GPU, but we > haven't upstreamed the support for this yet. Hmm, as above. With ACS enabled on all downstream ports any P2P enabled DMA will be routed to the IOMMU which removes a lot of the benefit. > So that is a clear NAK from my side for the approach. Do you have an alternative? This is the approach we arrived it after a reasonably lengthy discussion on the mailing lists. Alex, are you still comfortable with this approach? > And what exactly is the problem here? We had a pretty lengthy discussion on this topic on one of the previous revisions. The issue is that currently there is no mechanism in the IOMMU code to inform VMs if IOMMU groupings change. Since p2pdma can dynamically change its topology (due to PCI hotplug) we had to be cognizant of the fact that ACS settings could change. Since there is no way to currently handle changing ACS settings and hence IOMMU groupings the consensus was to simply disable ACS on all ports in a p2pdma domain. This effectively makes all the devices in the p2pdma domain part of the same IOMMU grouping. The plan will be to address this in time and add a mechanism for IOMMU grouping changes and notification to VMs but that's not part of this series. Note you are still allowed to have ACS functioning on other PCI domains so if you do not a plurality of IOMMU groupings you can still achieve it (but you can't do p2pdma across IOMMU groupings, which is safe). > I'm currently testing P2P with GPUs in different IOMMU domains and at least > with AMD IOMMUs that works perfectly fine. Yup that should work though again I have to ask are you disabling ACS on the ports between the two peer devices to get the p2p benefit? If not you are not getting all the performance benefit (due to IOMMU routing), if you are then there are obviously security implications between those IOMMU domains if they are assigned to different VMs. And now the issue is if new devices are added and the p2p topology needed to change there would be no way to inform the VMs of any IOMMU group change. Cheers Stephen
Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO
On Mon, May 07, 2018 at 05:42:36PM -0700, Randy Dunlap wrote: > On 05/07/2018 02:01 PM, Johannes Weiner wrote: > > + * The ratio is tracked in decaying time averages over 10s, 1m, 5m > > + * windows. Cumluative stall times are tracked and exported as well to > >Cumulative > > > +/** > > + * psi_memstall_leave - mark the end of an memory stall section > > end of a memory Thanks Randy, I'll get those fixed.
Re: [PATCH 6/7] psi: pressure stall information for CPU, memory, and IO
On Tue, May 08, 2018 at 11:04:09AM +0800, kbuild test robot wrote: >118#else /* CONFIG_PSI */ >119static inline void psi_enqueue(struct task_struct *p, u64 now) >120{ >121} >122static inline void psi_dequeue(struct task_struct *p, u64 now) >123{ >124} >125static inline void psi_ttwu_dequeue(struct task_struct *p) {} > > 126{ >127} Stupid last-minute cleanup reshuffling. v2 will have: diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index cb4a68bcf37a..ff6256b3d216 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -122,7 +122,7 @@ static inline void psi_enqueue(struct task_struct *p, u64 now) static inline void psi_dequeue(struct task_struct *p, u64 now) { } -static inline void psi_ttwu_dequeue(struct task_struct *p) {} +static inline void psi_ttwu_dequeue(struct task_struct *p) { } #endif /* CONFIG_PSI */
Re: general protection fault in lo_ioctl (2)
On Tue, May 08, 2018 at 08:05:12PM +0900, Tetsuo Handa wrote: > > So, it is time to think how to solve this race condition, as well as how to > solve > lockdep's deadlock warning (and I guess that syzbot is actually hitting > deadlocks). > An approach which serializes loop operations using global lock was proposed at > https://groups.google.com/d/msg/syzkaller-bugs/2Rw8-OM6IbM/PzdobV8kAgAJ . > Please respond... I'm looking at your patch which you proposed on this, and the locking architecture still looks way too complex. Things like loop_mutex_owner, and all of the infrastructure around lo->ioctl_in_progress should be removed, if at all possible. I believe it should be possible to do things with a single global mutex, some code refactoring, and some unlocked versions of some of the functions. Again, this requires root, and it requires someone deliberately trying to induce a race. So "it's time" is not necessarily the priority I would set for this item. But if we are going to fix it, let's fix it right, and not make the code more complex and less maintainable, all in the name of trying to make a rare, not-likely-to-happen-in-real-life syzbot reported problem to go away. Cheers, - Ted
Re: general protection fault in lo_ioctl (2)
On 2018/05/08 5:56, Tetsuo Handa wrote: > On 2018/05/02 20:23, Dmitry Vyukov wrote: >> #syz dup: INFO: rcu detected stall in blkdev_ioctl > > The cause of stall turned out to be ioctl(loop_fd, LOOP_CHANGE_FD, loop_fd). > > But we haven't explained the cause of NULL pointer dereference which can > occur when raced with ioctl(LOOP_CLR_FD). Therefore, > > #syz undup > Using sleep injection patch and reproducer shown below, I can reproduce the crashes. It is a race between ioctl(loop_fd, LOOP_CLR_FD, 0) versus ioctl(other_loop_fd, LOOP_SET_FD, loop_fd). Unless we hold corresponding lo->lo_ctl_mutex (or keep corresponding lo->lo_refcnt elevated) when traversing other loop devices, "/* Avoid recursion */" loop from loop_set_fd()/loop_change_fd() will suffer from races by loop_clr_fd(). So, it is time to think how to solve this race condition, as well as how to solve lockdep's deadlock warning (and I guess that syzbot is actually hitting deadlocks). An approach which serializes loop operations using global lock was proposed at https://groups.google.com/d/msg/syzkaller-bugs/2Rw8-OM6IbM/PzdobV8kAgAJ . Please respond... --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -909,6 +909,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode, error = -EINVAL; goto out_putf; } + pr_err("Start sleeping\n"); + schedule_timeout_killable(3 * HZ); + pr_err("End sleeping\n"); f = l->lo_backing_file; } #include #include #include #include #include #include #include int main(int argc, char *argv[]) { int fd0 = open("/dev/loop0", O_RDONLY); int fd1 = open("/dev/loop1", O_RDONLY); int fd2 = open("/tmp/file", O_RDWR | O_CREAT | O_TRUNC, 0600); ioctl(fd1, LOOP_SET_FD, fd2); if (fork() == 0) { sleep(1); ioctl(fd1, LOOP_CLR_FD, 0); _exit(0); } ioctl(fd0, LOOP_SET_FD, fd1); return 0; } [ 14.119073] loop: module loaded [ 17.363610] Start sleeping [ 20.383442] End sleeping [ 20.386511] BUG: unable to handle kernel NULL pointer dereference at 0008 [ 20.394779] PGD 13377d067 P4D 13377d067 PUD 131509067 PMD 0 [ 20.400847] Oops: [#1] SMP [ 20.403875] Modules linked in: loop [ 20.406188] CPU: 6 PID: 6470 Comm: a.out Tainted: GT 4.17.0-rc4+ #540 [ 20.411266] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017 [ 20.418169] RIP: 0010:lo_ioctl+0x7ef/0x840 [loop] [ 20.421272] RSP: 0018:c9bbbd88 EFLAGS: 00010282 [ 20.424661] RAX: RBX: RCX: 83679478 [ 20.429271] RDX: 8801332e9c00 RSI: 0086 RDI: 0286 [ 20.434517] RBP: c9bbbdd8 R08: 0638 R09: [ 20.436879] R10: 0190 R11: 0720072007200720 R12: 8801314ab118 [ 20.439076] R13: 880138deae40 R14: 8801311f7780 R15: 8801314ab000 [ 20.441144] FS: 7f0b57743740() GS:88013a78() knlGS: [ 20.443588] CS: 0010 DS: ES: CR0: 80050033 [ 20.445284] CR2: 0008 CR3: 000138efb002 CR4: 000606e0 [ 20.447381] Call Trace: [ 20.448149] blkdev_ioctl+0x88d/0x950 [ 20.449237] block_ioctl+0x38/0x40 [ 20.450269] do_vfs_ioctl+0xaa/0x650 [ 20.451479] ? handle_mm_fault+0x108/0x250 [ 20.452704] ksys_ioctl+0x70/0x80 [ 20.453737] __x64_sys_ioctl+0x15/0x20 [ 20.454887] do_syscall_64+0x5d/0x100 [ 20.456014] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 20.457519] RIP: 0033:0x7f0b57267107 [ 20.458644] RSP: 002b:7fff8a0fd698 EFLAGS: 0246 ORIG_RAX: 0010 [ 20.460853] RAX: ffda RBX: 0004 RCX: 7f0b57267107 [ 20.462952] RDX: 0004 RSI: 4c00 RDI: 0003 [ 20.465023] RBP: 0003 R08: 7f0b57743740 R09: [ 20.467091] R10: 7f0b57743a10 R11: 0246 R12: 004005ef [ 20.469361] R13: 7fff8a0fd790 R14: R15: [ 20.471657] Code: a0 48 89 55 d0 e8 e0 5f 1d e1 bf b8 0b 00 00 e8 78 9e 7c e2 48 c7 c7 a9 40 00 a0 e8 ca 5f 1d e1 48 8b 55 d0 48 8b 82 f0 00 00 00 <48> 8b 40 08 48 8b 40 68 48 85 c0 0f 84 15 fd ff ff 0f b7 90 b8 [ 20.477207] RIP: lo_ioctl+0x7ef/0x840 [loop] RSP: c9bbbd88 [ 20.479027] CR2: 0008 [ 20.480063] ---[ end trace 925bc1b992d96cb3 ]--- [ 20.481441] Kernel panic - not syncing: Fatal exception [
Re: [PATCH] loop: remember whether sysfs_create_group() succeeded
On 05/05/2018 01:49 PM, Tetsuo Handa wrote: > Milan Broz wrote: >>> Do we want to abort LOOP_SET_FD request if sysfs_create_group() failed? >> >> I would prefer failure - there are several utilities that expects attributes >> in >> sysfs to be valid (for example I print info from here in cryptsetup status >> if the backing image is an image), so ignoring failure put the system >> in inconsistent state. > > I see. But can we for now send v1 patch for 4.17 release (and postpone making > LOOP_SET_FD request fail if sysfs_create_group() failed)? This bug has so far > crashed syzbot tests for 6432 times in 190 days. Jens already merged it in the block git. So syzbot should be more happy now :) > We have a lot of bugs regarding loop module which prevent syzbot from > finding other bugs. I want to immediately squash bugs in block/loop so that > we can reduce false-positive hung task reports. Sure, syzbot is definitely very useful idea, thanks! Milan
Re: bug in tag handling in blk-mq?
On Tue, 2018-05-08 at 06:51 +0200, Mike Galbraith wrote: > > I'm deadlined ATM, but will get to it. (Bah, even a zombie can type ccache -C; make -j8 and stare...) kbuild again hung on the first go (yay), and post hang data written to sdd1 survived (kernel source lives in sdb3). Full ftrace buffer (echo 1 > events/block/enable) available off list if desired. dmesg.txt.xz is dmesg from post hang crashdump, attached because it contains the tail of trace buffer, so _might_ be useful. homer:~ # df|grep sd /dev/sdb3 959074776 785342824 172741072 82% / /dev/sdc3 959074776 455464912 502618984 48% /backup /dev/sdb1 159564 7980 151584 6% /boot/efi /dev/sdd1 961301832 393334868 519112540 44% /abuild Kernel is virgin modulo these... patches/remove_irritating_plus.diff patches/add-scm-version-to-EXTRAVERSION.patch patches/block-bfq:-postpone-rq-preparation-to-insert-or-merge.patch patches/block-bfq:-test.patch (hang provocation hack from Paolo) -Mike block_debug.tar.xz Description: application/xz-compressed-tar dmesg.xz Description: application/xz dmesg.txt.xz Description: application/xz
Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
Hi Bjorn, Am 08.05.2018 um 01:13 schrieb Bjorn Helgaas: [+to Alex] Alex, Are you happy with this strategy of turning off ACS based on CONFIG_PCI_P2PDMA? We only check this at enumeration-time and I don't know if there are other places we would care? thanks for pointing this out, I totally missed this hack. AMD APUs mandatory need the ACS flag set for the GPU integrated in the CPU when IOMMU is enabled or otherwise you will break SVM. Similar problems arise when you do this for dedicated GPU, but we haven't upstreamed the support for this yet. So that is a clear NAK from my side for the approach. And what exactly is the problem here? I'm currently testing P2P with GPUs in different IOMMU domains and at least with AMD IOMMUs that works perfectly fine. Regards, Christian. On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote: For peer-to-peer transactions to work the downstream ports in each switch must not have the ACS flags set. At this time there is no way to dynamically change the flags and update the corresponding IOMMU groups so this is done at enumeration time before the groups are assigned. This effectively means that if CONFIG_PCI_P2PDMA is selected then all devices behind any PCIe switch heirarchy will be in the same IOMMU group. Which implies that individual devices behind any switch heirarchy will not be able to be assigned to separate VMs because there is no isolation between them. Additionally, any malicious PCIe devices will be able to DMA to memory exposed by other EPs in the same domain as TLPs will not be checked by the IOMMU. Given that the intended use case of P2P Memory is for users with custom hardware designed for purpose, we do not expect distributors to ever need to enable this option. Users that want to use P2P must have compiled a custom kernel with this configuration option and understand the implications regarding ACS. They will either not require ACS or will have design the system in such a way that devices that require isolation will be separate from those using P2P transactions. Signed-off-by: Logan Gunthorpe--- drivers/pci/Kconfig| 9 + drivers/pci/p2pdma.c | 45 ++--- drivers/pci/pci.c | 6 ++ include/linux/pci-p2pdma.h | 5 + 4 files changed, 50 insertions(+), 15 deletions(-) diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index b2396c22b53e..b6db41d4b708 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -139,6 +139,15 @@ config PCI_P2PDMA transations must be between devices behind the same root port. (Typically behind a network of PCIe switches). + Enabling this option will also disable ACS on all ports behind + any PCIe switch. This effectively puts all devices behind any + switch heirarchy into the same IOMMU group. Which implies that s/heirarchy/hierarchy/ (also above in changelog) + individual devices behind any switch will not be able to be + assigned to separate VMs because there is no isolation between + them. Additionally, any malicious PCIe devices will be able to + DMA to memory exposed by other EPs in the same domain as TLPs + will not be checked by the IOMMU. + If unsure, say N. config PCI_LABEL diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index ed9dce8552a2..e9f43b43acac 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct device *dev) } /* - * If a device is behind a switch, we try to find the upstream bridge - * port of the switch. This requires two calls to pci_upstream_bridge(): - * one for the upstream port on the switch, one on the upstream port - * for the next level in the hierarchy. Because of this, devices connected - * to the root port will be rejected. + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges + * @pdev: device to disable ACS flags for + * + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded + * up to the RC which is not what we want for P2P. s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI) + * + * This function is called when the devices are first enumerated and + * will result in all devices behind any bridge to be in the same IOMMU + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely + * on this largish hammer. If you need the devices to be in separate groups + * don't enable CONFIG_PCI_P2PDMA. + * + * Returns 1 if the ACS bits for this device was cleared, otherwise 0. */ -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev) +int pci_p2pdma_disable_acs(struct pci_dev *pdev) { - struct pci_dev *up1, *up2; + int pos; + u16 ctrl; - if (!pdev) - return NULL; +