Re: [PATCH 05/24] ibtrs: client: main functionality
On Mon, Feb 5, 2018 at 3:14 PM, Sagi Grimberg wrote: > >> Indeed, seems sbitmap can be reused. >> >> But tags is a part of IBTRS, and is not related to block device at all. >> One >> IBTRS connection (session) handles many block devices > > > we use host shared tag sets for the case of multiple block devices. Unfortunately (or fortunately, depends on the indented mq design) tags are not shared between hw_queues. So in our case (1 session queue, N devices) you always have to specify tags->nr_hw_queues = 1 or magic will not happen and you will always have more tags than your session supports. But nr_hw_queues = 1 kills performance dramatically. What scales well is the following: nr_hw_queues == num_online_cpus() == number of QPs in one session. >> (or any IO producers). > > > Lets wait until we actually have this theoretical non-block IO > producers.. > >> With a tag you get a free slot of a buffer where you can read/write, so >> once >> you've allocated a tag you won't sleep on IO path inside a library. > > > Same for block tags (given that you don't set the request queue > otherwise) > >> Also tag >> helps a lot on IO fail-over to another connection (multipath >> implementation, >> which is also a part of the transport library, not a block device), where >> you >> simply reuse the same buffer slot (with a tag in your hands) forwarding IO >> to >> another RDMA connection. > > > What is the benefit of this detached architecture? That gives us separated rdma IO library, where ibnbd is one of the players. > IMO, one reason why you ended up not reusing a lot of the infrastructure > is yielded from the attempt to support a theoretical different consumer > that is not ibnbd. Well, not quite. Not using rdma api helpers (we will use it) and not using tags from block layer (we need tags inside transport) this is not "a lot of the infrastructure" :) I would say that we are not fast enough to follow all kernel trends. That is the major reason, but not other user of ibtrs. > Did you actually had plans for any other consumers? Yep, the major target is replicated block storage, that's why separated transport. > Personally, I think you will be much better off with a unified approach > for your block device implementation. -- Roman
Re: [PATCH 05/24] ibtrs: client: main functionality
On Mon, 2018-02-05 at 15:19 +0100, Roman Penyaev wrote: > On Mon, Feb 5, 2018 at 12:19 PM, Sagi Grimberg wrote: > > Do you actually ever have remote write access in your protocol? > > We do not have reads, instead client writes on write and server writes > on read. (write only storage solution :) So there are no restrictions on which clients can log in and any client can send RDMA writes to the target system? I think this needs to be improved ... Thanks, Bart.
Re: [PATCH 05/24] ibtrs: client: main functionality
Hi Sagi, On Mon, Feb 5, 2018 at 12:19 PM, Sagi Grimberg wrote: > Hi Roman, > >> +static inline void ibtrs_clt_state_lock(void) >> +{ >> + rcu_read_lock(); >> +} >> + >> +static inline void ibtrs_clt_state_unlock(void) >> +{ >> + rcu_read_unlock(); >> +} > > > This looks rather pointless... Yeah, old scraps. Some time later those were not only wrappers around rcu. Now rcu can be called explicitly, that is true. Thanks. > >> + >> +#define cmpxchg_min(var, new) ({ \ >> + typeof(var) old;\ >> + \ >> + do {\ >> + old = var; \ >> + new = (!old ? new : min_t(typeof(var), old, new)); \ >> + } while (cmpxchg(&var, old, new) != old); \ >> +}) > > > Why is this sort of thing local to your driver? Good question :) Most likely because personally I do not know what is the good generic place for this kind of stuff. Probably I share the same feeling with the author of these lines in nvme/host/rdma.c: put_unaligned_le24() :) >> +/** >> + * struct ibtrs_fr_pool - pool of fast registration descriptors >> + * >> + * An entry is available for allocation if and only if it occurs in >> @free_list. >> + * >> + * @size: Number of descriptors in this pool. >> + * @max_page_list_len: Maximum fast registration work request page list >> length. >> + * @lock: Protects free_list. >> + * @free_list: List of free descriptors. >> + * @desc: Fast registration descriptor pool. >> + */ >> +struct ibtrs_fr_pool { >> + int size; >> + int max_page_list_len; >> + spinlock_t lock; /* protects free_list */ >> + struct list_headfree_list; >> + struct ibtrs_fr_descdesc[0]; >> +}; > > > We already have a per-qp fr list implementation, any specific reason to > implement it again? No, fr is a part of the code which we are not using, fmr is faster in our setup. So we will need to reiterate on fr mode, thanks. >> +static inline struct ibtrs_tag * >> +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type) >> +{ >> + size_t max_depth = clt->queue_depth; >> + struct ibtrs_tag *tag; >> + int cpu, bit; >> + >> + cpu = get_cpu(); >> + do { >> + bit = find_first_zero_bit(clt->tags_map, max_depth); >> + if (unlikely(bit >= max_depth)) { >> + put_cpu(); >> + return NULL; >> + } >> + >> + } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map))); >> + put_cpu(); >> + >> + tag = GET_TAG(clt, bit); >> + WARN_ON(tag->mem_id != bit); >> + tag->cpu_id = cpu; >> + tag->con_type = con_type; >> + >> + return tag; >> +} >> + >> +static inline void __ibtrs_put_tag(struct ibtrs_clt *clt, >> + struct ibtrs_tag *tag) >> +{ >> + clear_bit_unlock(tag->mem_id, clt->tags_map); >> +} >> + >> +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt, >> + enum ibtrs_clt_con_type con_type, >> + int can_wait) >> +{ >> + struct ibtrs_tag *tag; >> + DEFINE_WAIT(wait); >> + >> + tag = __ibtrs_get_tag(clt, con_type); >> + if (likely(tag) || !can_wait) >> + return tag; >> + >> + do { >> + prepare_to_wait(&clt->tags_wait, &wait, >> TASK_UNINTERRUPTIBLE); >> + tag = __ibtrs_get_tag(clt, con_type); >> + if (likely(tag)) >> + break; >> + >> + io_schedule(); >> + } while (1); >> + >> + finish_wait(&clt->tags_wait, &wait); >> + >> + return tag; >> +} >> +EXPORT_SYMBOL(ibtrs_clt_get_tag); >> + >> +void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag) >> +{ >> + if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map))) >> + return; >> + >> + __ibtrs_put_tag(clt, tag); >> + >> + /* >> +* Putting a tag is a barrier, so we will observe >> +* new entry in the wait list, no worries. >> +*/ >> + if (waitqueue_active(&clt->tags_wait)) >> + wake_up(&clt->tags_wait); >> +} >> +EXPORT_SYMBOL(ibtrs_clt_put_tag); > > > Again, the tags are not clear why they are needed... We have two separate instances: block device (IBNBD) and a transport library (IBTRS). Many block devices share the same IBTRS session with fixed size queue depth. Tags is a part of IBTRS, so with allocated tag you get a free slot of a buffer where you can read/write, so once you've allocated a tag you won't sleep on IO path inside a library. Also tag helps a lot on IO fail-ove
Re: [PATCH 05/24] ibtrs: client: main functionality
Indeed, seems sbitmap can be reused. But tags is a part of IBTRS, and is not related to block device at all. One IBTRS connection (session) handles many block devices we use host shared tag sets for the case of multiple block devices. (or any IO producers). Lets wait until we actually have this theoretical non-block IO producers.. With a tag you get a free slot of a buffer where you can read/write, so once you've allocated a tag you won't sleep on IO path inside a library. Same for block tags (given that you don't set the request queue otherwise) Also tag helps a lot on IO fail-over to another connection (multipath implementation, which is also a part of the transport library, not a block device), where you simply reuse the same buffer slot (with a tag in your hands) forwarding IO to another RDMA connection. What is the benefit of this detached architecture? IMO, one reason why you ended up not reusing a lot of the infrastructure is yielded from the attempt to support a theoretical different consumer that is not ibnbd. Did you actually had plans for any other consumers? Personally, I think you will be much better off with a unified approach for your block device implementation.
Re: [PATCH 05/24] ibtrs: client: main functionality
On Fri, Feb 2, 2018 at 5:54 PM, Bart Van Assche wrote: > On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote: >> +static inline struct ibtrs_tag * >> +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type) >> +{ >> + size_t max_depth = clt->queue_depth; >> + struct ibtrs_tag *tag; >> + int cpu, bit; >> + >> + cpu = get_cpu(); >> + do { >> + bit = find_first_zero_bit(clt->tags_map, max_depth); >> + if (unlikely(bit >= max_depth)) { >> + put_cpu(); >> + return NULL; >> + } >> + >> + } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map))); >> + put_cpu(); >> + >> + tag = GET_TAG(clt, bit); >> + WARN_ON(tag->mem_id != bit); >> + tag->cpu_id = cpu; >> + tag->con_type = con_type; >> + >> + return tag; >> +} >> + >> +static inline void __ibtrs_put_tag(struct ibtrs_clt *clt, >> +struct ibtrs_tag *tag) >> +{ >> + clear_bit_unlock(tag->mem_id, clt->tags_map); >> +} >> + >> +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt, >> + enum ibtrs_clt_con_type con_type, >> + int can_wait) >> +{ >> + struct ibtrs_tag *tag; >> + DEFINE_WAIT(wait); >> + >> + tag = __ibtrs_get_tag(clt, con_type); >> + if (likely(tag) || !can_wait) >> + return tag; >> + >> + do { >> + prepare_to_wait(&clt->tags_wait, &wait, TASK_UNINTERRUPTIBLE); >> + tag = __ibtrs_get_tag(clt, con_type); >> + if (likely(tag)) >> + break; >> + >> + io_schedule(); >> + } while (1); >> + >> + finish_wait(&clt->tags_wait, &wait); >> + >> + return tag; >> +} >> +EXPORT_SYMBOL(ibtrs_clt_get_tag); >> + >> +void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag) >> +{ >> + if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map))) >> + return; >> + >> + __ibtrs_put_tag(clt, tag); >> + >> + /* >> + * Putting a tag is a barrier, so we will observe >> + * new entry in the wait list, no worries. >> + */ >> + if (waitqueue_active(&clt->tags_wait)) >> + wake_up(&clt->tags_wait); >> +} >> +EXPORT_SYMBOL(ibtrs_clt_put_tag); > > Do these functions have any advantage over the code in lib/sbitmap.c? If not, > please call the sbitmap functions instead of adding an additional tag > allocator. Indeed, seems sbitmap can be reused. But tags is a part of IBTRS, and is not related to block device at all. One IBTRS connection (session) handles many block devices (or any IO producers). With a tag you get a free slot of a buffer where you can read/write, so once you've allocated a tag you won't sleep on IO path inside a library. Also tag helps a lot on IO fail-over to another connection (multipath implementation, which is also a part of the transport library, not a block device), where you simply reuse the same buffer slot (with a tag in your hands) forwarding IO to another RDMA connection. -- Roman
Re: [PATCH 05/24] ibtrs: client: main functionality
Hi Roman, +static inline void ibtrs_clt_state_lock(void) +{ + rcu_read_lock(); +} + +static inline void ibtrs_clt_state_unlock(void) +{ + rcu_read_unlock(); +} This looks rather pointless... + +#define cmpxchg_min(var, new) ({ \ + typeof(var) old;\ + \ + do {\ + old = var; \ + new = (!old ? new : min_t(typeof(var), old, new)); \ + } while (cmpxchg(&var, old, new) != old); \ +}) Why is this sort of thing local to your driver? +/** + * struct ibtrs_fr_pool - pool of fast registration descriptors + * + * An entry is available for allocation if and only if it occurs in @free_list. + * + * @size: Number of descriptors in this pool. + * @max_page_list_len: Maximum fast registration work request page list length. + * @lock: Protects free_list. + * @free_list: List of free descriptors. + * @desc: Fast registration descriptor pool. + */ +struct ibtrs_fr_pool { + int size; + int max_page_list_len; + spinlock_t lock; /* protects free_list */ + struct list_headfree_list; + struct ibtrs_fr_descdesc[0]; +}; We already have a per-qp fr list implementation, any specific reason to implement it again? +static inline struct ibtrs_tag * +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type) +{ + size_t max_depth = clt->queue_depth; + struct ibtrs_tag *tag; + int cpu, bit; + + cpu = get_cpu(); + do { + bit = find_first_zero_bit(clt->tags_map, max_depth); + if (unlikely(bit >= max_depth)) { + put_cpu(); + return NULL; + } + + } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map))); + put_cpu(); + + tag = GET_TAG(clt, bit); + WARN_ON(tag->mem_id != bit); + tag->cpu_id = cpu; + tag->con_type = con_type; + + return tag; +} + +static inline void __ibtrs_put_tag(struct ibtrs_clt *clt, + struct ibtrs_tag *tag) +{ + clear_bit_unlock(tag->mem_id, clt->tags_map); +} + +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt, + enum ibtrs_clt_con_type con_type, + int can_wait) +{ + struct ibtrs_tag *tag; + DEFINE_WAIT(wait); + + tag = __ibtrs_get_tag(clt, con_type); + if (likely(tag) || !can_wait) + return tag; + + do { + prepare_to_wait(&clt->tags_wait, &wait, TASK_UNINTERRUPTIBLE); + tag = __ibtrs_get_tag(clt, con_type); + if (likely(tag)) + break; + + io_schedule(); + } while (1); + + finish_wait(&clt->tags_wait, &wait); + + return tag; +} +EXPORT_SYMBOL(ibtrs_clt_get_tag); + +void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag) +{ + if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map))) + return; + + __ibtrs_put_tag(clt, tag); + + /* +* Putting a tag is a barrier, so we will observe +* new entry in the wait list, no worries. +*/ + if (waitqueue_active(&clt->tags_wait)) + wake_up(&clt->tags_wait); +} +EXPORT_SYMBOL(ibtrs_clt_put_tag); Again, the tags are not clear why they are needed... +/** + * ibtrs_destroy_fr_pool() - free the resources owned by a pool + * @pool: Fast registration pool to be destroyed. + */ +static void ibtrs_destroy_fr_pool(struct ibtrs_fr_pool *pool) +{ + struct ibtrs_fr_desc *d; + int i, err; + + if (!pool) + return; + + for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { + if (d->mr) { + err = ib_dereg_mr(d->mr); + if (err) + pr_err("Failed to deregister memory region," + " err: %d\n", err); + } + } + kfree(pool); +} + +/** + * ibtrs_create_fr_pool() - allocate and initialize a pool for fast registration + * @device:IB device to allocate fast registration descriptors for. + * @pd:Protection domain associated with the FR descriptors. + * @pool_size: Number of descriptors to allocate. + * @max_page_list_len: Maximum fast registration work request page list length. + */ +static struct ibtrs_fr_pool *ibtrs_create_fr_pool(struct ib_device *device, + struct ib_pd *pd, + int pool_size, +
Re: [PATCH 05/24] ibtrs: client: main functionality
On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote: > +static inline struct ibtrs_tag * > +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type) > +{ > + size_t max_depth = clt->queue_depth; > + struct ibtrs_tag *tag; > + int cpu, bit; > + > + cpu = get_cpu(); > + do { > + bit = find_first_zero_bit(clt->tags_map, max_depth); > + if (unlikely(bit >= max_depth)) { > + put_cpu(); > + return NULL; > + } > + > + } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map))); > + put_cpu(); > + > + tag = GET_TAG(clt, bit); > + WARN_ON(tag->mem_id != bit); > + tag->cpu_id = cpu; > + tag->con_type = con_type; > + > + return tag; > +} > + > +static inline void __ibtrs_put_tag(struct ibtrs_clt *clt, > +struct ibtrs_tag *tag) > +{ > + clear_bit_unlock(tag->mem_id, clt->tags_map); > +} > + > +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt, > + enum ibtrs_clt_con_type con_type, > + int can_wait) > +{ > + struct ibtrs_tag *tag; > + DEFINE_WAIT(wait); > + > + tag = __ibtrs_get_tag(clt, con_type); > + if (likely(tag) || !can_wait) > + return tag; > + > + do { > + prepare_to_wait(&clt->tags_wait, &wait, TASK_UNINTERRUPTIBLE); > + tag = __ibtrs_get_tag(clt, con_type); > + if (likely(tag)) > + break; > + > + io_schedule(); > + } while (1); > + > + finish_wait(&clt->tags_wait, &wait); > + > + return tag; > +} > +EXPORT_SYMBOL(ibtrs_clt_get_tag); > + > +void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag) > +{ > + if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map))) > + return; > + > + __ibtrs_put_tag(clt, tag); > + > + /* > + * Putting a tag is a barrier, so we will observe > + * new entry in the wait list, no worries. > + */ > + if (waitqueue_active(&clt->tags_wait)) > + wake_up(&clt->tags_wait); > +} > +EXPORT_SYMBOL(ibtrs_clt_put_tag); Do these functions have any advantage over the code in lib/sbitmap.c? If not, please call the sbitmap functions instead of adding an additional tag allocator. Thanks, Bart.