Re: [PATCH 05/24] ibtrs: client: main functionality

2018-02-05 Thread Roman Penyaev
On Mon, Feb 5, 2018 at 3:14 PM, Sagi Grimberg  wrote:
>
>> Indeed, seems sbitmap can be reused.
>>
>> But tags is a part of IBTRS, and is not related to block device at all.
>> One
>> IBTRS connection (session) handles many block devices
>
>
> we use host shared tag sets for the case of multiple block devices.

Unfortunately (or fortunately, depends on the indented mq design) tags are
not shared between hw_queues.  So in our case (1 session queue, N devices)
you always have to specify tags->nr_hw_queues = 1 or magic will not happen
and you will always have more tags than your session supports.  But
nr_hw_queues = 1 kills performance dramatically.  What scales well is
the following: nr_hw_queues == num_online_cpus() == number of QPs in
one session.

>> (or any IO producers).
>
>
> Lets wait until we actually have this theoretical non-block IO
> producers..
>
>> With a tag you get a free slot of a buffer where you can read/write, so
>> once
>> you've allocated a tag you won't sleep on IO path inside a library.
>
>
> Same for block tags (given that you don't set the request queue
> otherwise)
>
>> Also tag
>> helps a lot on IO fail-over to another connection (multipath
>> implementation,
>> which is also a part of the transport library, not a block device), where
>> you
>> simply reuse the same buffer slot (with a tag in your hands) forwarding IO
>> to
>> another RDMA connection.
>
>
> What is the benefit of this detached architecture?

That gives us separated rdma IO library, where ibnbd is one of the players.

> IMO, one reason why you ended up not reusing a lot of the infrastructure
> is yielded from the  attempt to support a theoretical different consumer
> that is not ibnbd.

Well, not quite.  Not using rdma api helpers (we will use it) and not
using tags from block layer (we need tags inside transport) this is not
"a lot of the infrastructure" :)

I would say that we are not fast enough to follow all kernel trends.
That is the major reason, but not other user of ibtrs.

> Did you actually had plans for any other consumers?

Yep, the major target is replicated block storage, that's why separated
transport.

> Personally, I think you will be much better off with a unified approach
> for your block device implementation.

--
Roman


Re: [PATCH 05/24] ibtrs: client: main functionality

2018-02-05 Thread Bart Van Assche
On Mon, 2018-02-05 at 15:19 +0100, Roman Penyaev wrote:
> On Mon, Feb 5, 2018 at 12:19 PM, Sagi Grimberg  wrote:
> > Do you actually ever have remote write access in your protocol?
> 
> We do not have reads, instead client writes on write and server writes
> on read. (write only storage solution :)

So there are no restrictions on which clients can log in and any client can
send RDMA writes to the target system? I think this needs to be improved ...

Thanks,

Bart.




Re: [PATCH 05/24] ibtrs: client: main functionality

2018-02-05 Thread Roman Penyaev
Hi Sagi,

On Mon, Feb 5, 2018 at 12:19 PM, Sagi Grimberg  wrote:
> Hi Roman,
>
>> +static inline void ibtrs_clt_state_lock(void)
>> +{
>> +   rcu_read_lock();
>> +}
>> +
>> +static inline void ibtrs_clt_state_unlock(void)
>> +{
>> +   rcu_read_unlock();
>> +}
>
>
> This looks rather pointless...

Yeah, old scraps.  Some time later those were not only wrappers
around rcu.  Now rcu can be called explicitly, that is true.
Thanks.

>
>> +
>> +#define cmpxchg_min(var, new) ({   \
>> +   typeof(var) old;\
>> +   \
>> +   do {\
>> +   old = var;  \
>> +   new = (!old ? new : min_t(typeof(var), old, new));  \
>> +   } while (cmpxchg(&var, old, new) != old);   \
>> +})
>
>
> Why is this sort of thing local to your driver?

Good question :)  Most likely because personally I do not know
what is the good generic place for this kind of stuff.

Probably I share the same feeling with the author of these lines
in nvme/host/rdma.c: put_unaligned_le24() :)

>> +/**
>> + * struct ibtrs_fr_pool - pool of fast registration descriptors
>> + *
>> + * An entry is available for allocation if and only if it occurs in
>> @free_list.
>> + *
>> + * @size:  Number of descriptors in this pool.
>> + * @max_page_list_len: Maximum fast registration work request page list
>> length.
>> + * @lock:  Protects free_list.
>> + * @free_list: List of free descriptors.
>> + * @desc:  Fast registration descriptor pool.
>> + */
>> +struct ibtrs_fr_pool {
>> +   int size;
>> +   int max_page_list_len;
>> +   spinlock_t  lock; /* protects free_list */
>> +   struct list_headfree_list;
>> +   struct ibtrs_fr_descdesc[0];
>> +};
>
>
> We already have a per-qp fr list implementation, any specific reason to
> implement it again?

No, fr is a part of the code which we are not using, fmr is faster
in our setup.  So we will need to reiterate on fr mode, thanks.


>> +static inline struct ibtrs_tag *
>> +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
>> +{
>> +   size_t max_depth = clt->queue_depth;
>> +   struct ibtrs_tag *tag;
>> +   int cpu, bit;
>> +
>> +   cpu = get_cpu();
>> +   do {
>> +   bit = find_first_zero_bit(clt->tags_map, max_depth);
>> +   if (unlikely(bit >= max_depth)) {
>> +   put_cpu();
>> +   return NULL;
>> +   }
>> +
>> +   } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
>> +   put_cpu();
>> +
>> +   tag = GET_TAG(clt, bit);
>> +   WARN_ON(tag->mem_id != bit);
>> +   tag->cpu_id = cpu;
>> +   tag->con_type = con_type;
>> +
>> +   return tag;
>> +}
>> +
>> +static inline void __ibtrs_put_tag(struct ibtrs_clt *clt,
>> +  struct ibtrs_tag *tag)
>> +{
>> +   clear_bit_unlock(tag->mem_id, clt->tags_map);
>> +}
>> +
>> +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt,
>> +   enum ibtrs_clt_con_type con_type,
>> +   int can_wait)
>> +{
>> +   struct ibtrs_tag *tag;
>> +   DEFINE_WAIT(wait);
>> +
>> +   tag = __ibtrs_get_tag(clt, con_type);
>> +   if (likely(tag) || !can_wait)
>> +   return tag;
>> +
>> +   do {
>> +   prepare_to_wait(&clt->tags_wait, &wait,
>> TASK_UNINTERRUPTIBLE);
>> +   tag = __ibtrs_get_tag(clt, con_type);
>> +   if (likely(tag))
>> +   break;
>> +
>> +   io_schedule();
>> +   } while (1);
>> +
>> +   finish_wait(&clt->tags_wait, &wait);
>> +
>> +   return tag;
>> +}
>> +EXPORT_SYMBOL(ibtrs_clt_get_tag);
>> +
>> +void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag)
>> +{
>> +   if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map)))
>> +   return;
>> +
>> +   __ibtrs_put_tag(clt, tag);
>> +
>> +   /*
>> +* Putting a tag is a barrier, so we will observe
>> +* new entry in the wait list, no worries.
>> +*/
>> +   if (waitqueue_active(&clt->tags_wait))
>> +   wake_up(&clt->tags_wait);
>> +}
>> +EXPORT_SYMBOL(ibtrs_clt_put_tag);
>
>
> Again, the tags are not clear why they are needed...

We have two separate instances: block device (IBNBD) and a transport
library (IBTRS).  Many block devices share the same IBTRS session
with fixed size queue depth.  Tags is a part of IBTRS, so with allocated
tag you get a free slot of a buffer where you can read/write, so once
you've allocated a tag you won't sleep on IO path inside a library.
Also tag helps a lot on IO fail-ove

Re: [PATCH 05/24] ibtrs: client: main functionality

2018-02-05 Thread Sagi Grimberg



Indeed, seems sbitmap can be reused.

But tags is a part of IBTRS, and is not related to block device at all.  One
IBTRS connection (session) handles many block devices


we use host shared tag sets for the case of multiple block devices.


(or any IO producers).


Lets wait until we actually have this theoretical non-block IO
producers..


With a tag you get a free slot of a buffer where you can read/write, so once
you've allocated a tag you won't sleep on IO path inside a library.


Same for block tags (given that you don't set the request queue
otherwise)


Also tag
helps a lot on IO fail-over to another connection (multipath implementation,
which is also a part of the transport library, not a block device), where you
simply reuse the same buffer slot (with a tag in your hands) forwarding IO to
another RDMA connection.


What is the benefit of this detached architecture? IMO, one reason why
you ended up not reusing a lot of the infrastructure is yielded from the
attempt to support a theoretical different consumer that is not ibnbd.
Did you actually had plans for any other consumers?

Personally, I think you will be much better off with a unified approach
for your block device implementation.


Re: [PATCH 05/24] ibtrs: client: main functionality

2018-02-05 Thread Roman Penyaev
On Fri, Feb 2, 2018 at 5:54 PM, Bart Van Assche  wrote:
> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>> +static inline struct ibtrs_tag *
>> +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
>> +{
>> + size_t max_depth = clt->queue_depth;
>> + struct ibtrs_tag *tag;
>> + int cpu, bit;
>> +
>> + cpu = get_cpu();
>> + do {
>> + bit = find_first_zero_bit(clt->tags_map, max_depth);
>> + if (unlikely(bit >= max_depth)) {
>> + put_cpu();
>> + return NULL;
>> + }
>> +
>> + } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
>> + put_cpu();
>> +
>> + tag = GET_TAG(clt, bit);
>> + WARN_ON(tag->mem_id != bit);
>> + tag->cpu_id = cpu;
>> + tag->con_type = con_type;
>> +
>> + return tag;
>> +}
>> +
>> +static inline void __ibtrs_put_tag(struct ibtrs_clt *clt,
>> +struct ibtrs_tag *tag)
>> +{
>> + clear_bit_unlock(tag->mem_id, clt->tags_map);
>> +}
>> +
>> +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt,
>> + enum ibtrs_clt_con_type con_type,
>> + int can_wait)
>> +{
>> + struct ibtrs_tag *tag;
>> + DEFINE_WAIT(wait);
>> +
>> + tag = __ibtrs_get_tag(clt, con_type);
>> + if (likely(tag) || !can_wait)
>> + return tag;
>> +
>> + do {
>> + prepare_to_wait(&clt->tags_wait, &wait, TASK_UNINTERRUPTIBLE);
>> + tag = __ibtrs_get_tag(clt, con_type);
>> + if (likely(tag))
>> + break;
>> +
>> + io_schedule();
>> + } while (1);
>> +
>> + finish_wait(&clt->tags_wait, &wait);
>> +
>> + return tag;
>> +}
>> +EXPORT_SYMBOL(ibtrs_clt_get_tag);
>> +
>> +void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag)
>> +{
>> + if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map)))
>> + return;
>> +
>> + __ibtrs_put_tag(clt, tag);
>> +
>> + /*
>> +  * Putting a tag is a barrier, so we will observe
>> +  * new entry in the wait list, no worries.
>> +  */
>> + if (waitqueue_active(&clt->tags_wait))
>> + wake_up(&clt->tags_wait);
>> +}
>> +EXPORT_SYMBOL(ibtrs_clt_put_tag);
>
> Do these functions have any advantage over the code in lib/sbitmap.c? If not,
> please call the sbitmap functions instead of adding an additional tag 
> allocator.

Indeed, seems sbitmap can be reused.

But tags is a part of IBTRS, and is not related to block device at all.  One
IBTRS connection (session) handles many block devices (or any IO producers).
With a tag you get a free slot of a buffer where you can read/write, so once
you've allocated a tag you won't sleep on IO path inside a library.  Also tag
helps a lot on IO fail-over to another connection (multipath implementation,
which is also a part of the transport library, not a block device), where you
simply reuse the same buffer slot (with a tag in your hands) forwarding IO to
another RDMA connection.

--
Roman


Re: [PATCH 05/24] ibtrs: client: main functionality

2018-02-05 Thread Sagi Grimberg

Hi Roman,


+static inline void ibtrs_clt_state_lock(void)
+{
+   rcu_read_lock();
+}
+
+static inline void ibtrs_clt_state_unlock(void)
+{
+   rcu_read_unlock();
+}


This looks rather pointless...


+
+#define cmpxchg_min(var, new) ({   \
+   typeof(var) old;\
+   \
+   do {\
+   old = var;  \
+   new = (!old ? new : min_t(typeof(var), old, new));  \
+   } while (cmpxchg(&var, old, new) != old);   \
+})


Why is this sort of thing local to your driver?


+/**
+ * struct ibtrs_fr_pool - pool of fast registration descriptors
+ *
+ * An entry is available for allocation if and only if it occurs in @free_list.
+ *
+ * @size:  Number of descriptors in this pool.
+ * @max_page_list_len: Maximum fast registration work request page list length.
+ * @lock:  Protects free_list.
+ * @free_list: List of free descriptors.
+ * @desc:  Fast registration descriptor pool.
+ */
+struct ibtrs_fr_pool {
+   int size;
+   int max_page_list_len;
+   spinlock_t  lock; /* protects free_list */
+   struct list_headfree_list;
+   struct ibtrs_fr_descdesc[0];
+};


We already have a per-qp fr list implementation, any specific reason to
implement it again?


+static inline struct ibtrs_tag *
+__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
+{
+   size_t max_depth = clt->queue_depth;
+   struct ibtrs_tag *tag;
+   int cpu, bit;
+
+   cpu = get_cpu();
+   do {
+   bit = find_first_zero_bit(clt->tags_map, max_depth);
+   if (unlikely(bit >= max_depth)) {
+   put_cpu();
+   return NULL;
+   }
+
+   } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
+   put_cpu();
+
+   tag = GET_TAG(clt, bit);
+   WARN_ON(tag->mem_id != bit);
+   tag->cpu_id = cpu;
+   tag->con_type = con_type;
+
+   return tag;
+}
+
+static inline void __ibtrs_put_tag(struct ibtrs_clt *clt,
+  struct ibtrs_tag *tag)
+{
+   clear_bit_unlock(tag->mem_id, clt->tags_map);
+}
+
+struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt,
+   enum ibtrs_clt_con_type con_type,
+   int can_wait)
+{
+   struct ibtrs_tag *tag;
+   DEFINE_WAIT(wait);
+
+   tag = __ibtrs_get_tag(clt, con_type);
+   if (likely(tag) || !can_wait)
+   return tag;
+
+   do {
+   prepare_to_wait(&clt->tags_wait, &wait, TASK_UNINTERRUPTIBLE);
+   tag = __ibtrs_get_tag(clt, con_type);
+   if (likely(tag))
+   break;
+
+   io_schedule();
+   } while (1);
+
+   finish_wait(&clt->tags_wait, &wait);
+
+   return tag;
+}
+EXPORT_SYMBOL(ibtrs_clt_get_tag);
+
+void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag)
+{
+   if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map)))
+   return;
+
+   __ibtrs_put_tag(clt, tag);
+
+   /*
+* Putting a tag is a barrier, so we will observe
+* new entry in the wait list, no worries.
+*/
+   if (waitqueue_active(&clt->tags_wait))
+   wake_up(&clt->tags_wait);
+}
+EXPORT_SYMBOL(ibtrs_clt_put_tag);


Again, the tags are not clear why they are needed...


+/**
+ * ibtrs_destroy_fr_pool() - free the resources owned by a pool
+ * @pool: Fast registration pool to be destroyed.
+ */
+static void ibtrs_destroy_fr_pool(struct ibtrs_fr_pool *pool)
+{
+   struct ibtrs_fr_desc *d;
+   int i, err;
+
+   if (!pool)
+   return;
+
+   for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
+   if (d->mr) {
+   err = ib_dereg_mr(d->mr);
+   if (err)
+   pr_err("Failed to deregister memory region,"
+  " err: %d\n", err);
+   }
+   }
+   kfree(pool);
+}
+
+/**
+ * ibtrs_create_fr_pool() - allocate and initialize a pool for fast 
registration
+ * @device:IB device to allocate fast registration descriptors for.
+ * @pd:Protection domain associated with the FR descriptors.
+ * @pool_size: Number of descriptors to allocate.
+ * @max_page_list_len: Maximum fast registration work request page list length.
+ */
+static struct ibtrs_fr_pool *ibtrs_create_fr_pool(struct ib_device *device,
+ struct ib_pd *pd,
+ int pool_size,
+   

Re: [PATCH 05/24] ibtrs: client: main functionality

2018-02-02 Thread Bart Van Assche
On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
> +static inline struct ibtrs_tag *
> +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
> +{
> + size_t max_depth = clt->queue_depth;
> + struct ibtrs_tag *tag;
> + int cpu, bit;
> +
> + cpu = get_cpu();
> + do {
> + bit = find_first_zero_bit(clt->tags_map, max_depth);
> + if (unlikely(bit >= max_depth)) {
> + put_cpu();
> + return NULL;
> + }
> +
> + } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
> + put_cpu();
> +
> + tag = GET_TAG(clt, bit);
> + WARN_ON(tag->mem_id != bit);
> + tag->cpu_id = cpu;
> + tag->con_type = con_type;
> +
> + return tag;
> +}
> +
> +static inline void __ibtrs_put_tag(struct ibtrs_clt *clt,
> +struct ibtrs_tag *tag)
> +{
> + clear_bit_unlock(tag->mem_id, clt->tags_map);
> +}
> +
> +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt,
> + enum ibtrs_clt_con_type con_type,
> + int can_wait)
> +{
> + struct ibtrs_tag *tag;
> + DEFINE_WAIT(wait);
> +
> + tag = __ibtrs_get_tag(clt, con_type);
> + if (likely(tag) || !can_wait)
> + return tag;
> +
> + do {
> + prepare_to_wait(&clt->tags_wait, &wait, TASK_UNINTERRUPTIBLE);
> + tag = __ibtrs_get_tag(clt, con_type);
> + if (likely(tag))
> + break;
> +
> + io_schedule();
> + } while (1);
> +
> + finish_wait(&clt->tags_wait, &wait);
> +
> + return tag;
> +}
> +EXPORT_SYMBOL(ibtrs_clt_get_tag);
> +
> +void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag)
> +{
> + if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map)))
> + return;
> +
> + __ibtrs_put_tag(clt, tag);
> +
> + /*
> +  * Putting a tag is a barrier, so we will observe
> +  * new entry in the wait list, no worries.
> +  */
> + if (waitqueue_active(&clt->tags_wait))
> + wake_up(&clt->tags_wait);
> +}
> +EXPORT_SYMBOL(ibtrs_clt_put_tag);

Do these functions have any advantage over the code in lib/sbitmap.c? If not,
please call the sbitmap functions instead of adding an additional tag allocator.

Thanks,

Bart.