RE: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

2010-08-10 Thread Walukiewicz, Miroslaw
I agree with you that changing kernel ABI is not necessary. 
I will follow your directions regarding a single allocation at start. 

Regards,

Mirek 

-Original Message-
From: Roland Dreier [mailto:rdre...@cisco.com] 
Sent: Friday, August 06, 2010 5:58 PM
To: Walukiewicz, Miroslaw
Cc: linux-rdma@vger.kernel.org
Subject: Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

  The proposed path optimization is removing of dynamic allocations 
  by redefining a structure definition passed to kernel. 

  To 
  
  struct ibv_post_send {
  __u32 command;
  __u16 in_words;
  __u16 out_words;
  __u64 response;
  __u32 qp_handle;
  __u32 wr_count;
  __u32 sge_count;
  __u32 wqe_size;
  struct ibv_kern_send_wr send_wr[512];
  };

I don't see how this can possibly work.  Where does the scatter/gather
list go if you make this have a fixed size array of send_wr?

Also I don't see why you need to change the user/kernel ABI at all to
get rid of dynamic allocations... can't you just have the kernel keep a
cached send_wr allocation (say, per user context) and reuse that?  (ie
allocate memory but don't free the first time into post_send, and only
reallocate if a bigger send request comes, and only free when destroying
the context)

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

2010-08-10 Thread Walukiewicz, Miroslaw
Hello Jason, 

Do you have any benchmarks that show the alloca is a measurable
overhead?  

We changed overall path (both kernel and user space) to allocation-less 
approach and 
We achieved twice better latency using call to kernel driver. I have no data 
which path 
Is dominant - kernel or user space. I think I will have some measurements next 
week, so I will share 
My results.

Roland is right, all you
really need is a per-context (+per-cpu?) buffer you can grab, fill,
and put back.

I agree. I will go into this direction.

Regards,

Mirek

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe
Sent: Friday, August 06, 2010 6:33 PM
To: Walukiewicz, Miroslaw
Cc: Roland Dreier; linux-rdma@vger.kernel.org
Subject: Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

On Fri, Aug 06, 2010 at 11:03:36AM +0100, Walukiewicz, Miroslaw wrote:

 Currently the transmit/receive path works following way: User calls
 ibv_post_send() where vendor specific function is called.  When the
 path should go through kernel the ibv_cmd_post_send() is called.
 The function creates the POST_SEND message body that is passed to
 kernel.  As the number of sges is unknown the dynamic allocation for
 message body is performed.  (see libibverbs/src/cmd.c)

Do you have any benchmarks that show the alloca is a measurable
overhead?  I'm pretty skeptical... alloca will generally boil down to
one or two assembly instructions adjusting the stack pointer, and not
even that if you are lucky and it can be merged into the function
prologe.

 In the kernel the message body is parsed and a structure of wr and
 sges is recreated using dynamic allocations in kernel The goal of
 this operation is having a similar structure like in user space.

.. the kmalloc call(s) on the other hand definately seems worth
looking at ..

 In kernel in ib_uverbs_post_send() instead of dynamic allocation of
 the ib_send_wr structures the table of 512 ib_send_wr structures
 will be defined and all entries will be linked to unidirectional
 list so qp-device-post_send(qp, wr, bad_wr) API will be not
 changed.

Isn't there a kernel API already for managing a pool of
pre-allocated fixed-size allocations?

It isn't clear to me that is even necessary, Roland is right, all you
really need is a per-context (+per-cpu?) buffer you can grab, fill,
and put back.

 As I know no driver uses that kernel path to posting buffers so
 iWARP multicast acceleration implemented in NES driver Would be a
 first application that can utilize the optimized path.

??

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

2010-08-06 Thread Roland Dreier
  The proposed path optimization is removing of dynamic allocations 
  by redefining a structure definition passed to kernel. 

  To 
  
  struct ibv_post_send {
  __u32 command;
  __u16 in_words;
  __u16 out_words;
  __u64 response;
  __u32 qp_handle;
  __u32 wr_count;
  __u32 sge_count;
  __u32 wqe_size;
  struct ibv_kern_send_wr send_wr[512];
  };

I don't see how this can possibly work.  Where does the scatter/gather
list go if you make this have a fixed size array of send_wr?

Also I don't see why you need to change the user/kernel ABI at all to
get rid of dynamic allocations... can't you just have the kernel keep a
cached send_wr allocation (say, per user context) and reuse that?  (ie
allocate memory but don't free the first time into post_send, and only
reallocate if a bigger send request comes, and only free when destroying
the context)

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

2010-08-06 Thread Jason Gunthorpe
On Fri, Aug 06, 2010 at 11:03:36AM +0100, Walukiewicz, Miroslaw wrote:

 Currently the transmit/receive path works following way: User calls
 ibv_post_send() where vendor specific function is called.  When the
 path should go through kernel the ibv_cmd_post_send() is called.
 The function creates the POST_SEND message body that is passed to
 kernel.  As the number of sges is unknown the dynamic allocation for
 message body is performed.  (see libibverbs/src/cmd.c)

Do you have any benchmarks that show the alloca is a measurable
overhead?  I'm pretty skeptical... alloca will generally boil down to
one or two assembly instructions adjusting the stack pointer, and not
even that if you are lucky and it can be merged into the function
prologe.

 In the kernel the message body is parsed and a structure of wr and
 sges is recreated using dynamic allocations in kernel The goal of
 this operation is having a similar structure like in user space.

.. the kmalloc call(s) on the other hand definately seems worth
looking at ..

 In kernel in ib_uverbs_post_send() instead of dynamic allocation of
 the ib_send_wr structures the table of 512 ib_send_wr structures
 will be defined and all entries will be linked to unidirectional
 list so qp-device-post_send(qp, wr, bad_wr) API will be not
 changed.

Isn't there a kernel API already for managing a pool of
pre-allocated fixed-size allocations?

It isn't clear to me that is even necessary, Roland is right, all you
really need is a per-context (+per-cpu?) buffer you can grab, fill,
and put back.

 As I know no driver uses that kernel path to posting buffers so
 iWARP multicast acceleration implemented in NES driver Would be a
 first application that can utilize the optimized path.

??

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: {RFC] ibv_post_send()/ibv_post_recv() kernel path optimizations

2010-08-06 Thread Ralph Campbell
On Fri, 2010-08-06 at 03:03 -0700, Walukiewicz, Miroslaw wrote:
 Currently the ibv_post_send()/ibv_post_recv() path through kernel 
 (using /dev/infiniband/rdmacm) could be optimized by removing dynamic memory 
 allocations on the path. 
 
 Currently the transmit/receive path works following way:
 User calls ibv_post_send() where vendor specific function is called. 
 When the path should go through kernel the ibv_cmd_post_send() is called.
  The function creates the POST_SEND message body that is passed to kernel. 
 As the number of sges is unknown the dynamic allocation for message body is 
 performed. 
 (see libibverbs/src/cmd.c)
 
 In the kernel the message body is parsed and a structure of wr and sges is 
 recreated using dynamic allocations in kernel 
 The goal of this operation is having a similar structure like in user space. 
 
 The proposed path optimization is removing of dynamic allocations 
 by redefining a structure definition passed to kernel. 
 From 
 
 struct ibv_post_send {
 __u32 command;
 __u16 in_words;
 __u16 out_words;
 __u64 response;
 __u32 qp_handle;
 __u32 wr_count;
 __u32 sge_count;
 __u32 wqe_size;
 struct ibv_kern_send_wr send_wr[0];
 };
 To 
 
 struct ibv_post_send {
 __u32 command;
 __u16 in_words;
 __u16 out_words;
 __u64 response;
 __u32 qp_handle;
 __u32 wr_count;
 __u32 sge_count;
 __u32 wqe_size;
 struct ibv_kern_send_wr send_wr[512];
 };
 
 Similar change is required in kernel  struct ib_uverbs_post_send defined in 
 /ofa_kernel/include/rdma/ib_uverbs.h
 
 This change limits a number of send_wr passed from unlimited (assured by 
 dynamic allocation) to reasonable number of 512. 
 I think this number should be a max number of QP entries available to send. 
 As the all iB/iWARP applications are low latency applications so the number 
 of WRs passed are never unlimited.
 
 As the result instead of dynamic allocation the ibv_cmd_post_send() fills the 
 proposed structure 
 directly and passes it to kernel. Whenever the number of send_wr number 
 exceeds the limit the ENOMEM error is returned.
 
 In kernel  in ib_uverbs_post_send() instead of dynamic allocation of the 
 ib_send_wr structures 
 the table of 512  ib_send_wr structures  will be defined and 
 all entries will be linked to unidirectional list so 
 qp-device-post_send(qp, wr, bad_wr) API will be not changed. 
 
 As I know no driver uses that kernel path to posting buffers so iWARP 
 multicast acceleration implemented in NES driver 
 Would be a first application that can utilize the optimized path. 
 
 Regards,
 
 Mirek
 
 Signed-off-by: Mirek Walukiewicz miroslaw.walukiew...@intel.com

The libipathverbs.so plug-in for libibverbs and
the ib_ipath and ib_qib kernel modules use this path for
ibv_post_send().

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html