[dpdk-dev] [PATCH v6 4/8] eal/linux: add per rx queue interrupt handling based on VFIO

2015-02-28 Thread Liang, Cunming
Thanks Thomas.
It's my fault that directly reply David's mail, haven't notice his mail isn't 
in a plain text mode.

> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Friday, February 27, 2015 10:13 PM
> To: Liang, Cunming
> Cc: David Marchand; dev at dpdk.org; Stephen Hemminger; Zhou, Danny
> Subject: Re: [PATCH v6 4/8] eal/linux: add per rx queue interrupt handling 
> based
> on VFIO
> 
> Hi Cunming,
> 
> First, sorry to have to say that, but it is not easy to read discussions
> where quote marks are not used. I re-insert them for clarity.
> 
> Comments below.
> 
> 2015-02-27 12:22, Liang, Cunming:
> > From: David Marchand [mailto:david.marchand at 6wind.com]
> > Sent: Friday, February 27, 2015 6:34 PM
> >
> > > I am not really comfortable with this api.
> > >
> > > This is just creating something on top of the standard epoll api with
> > > limitations. In the end, we could just use an external lib that does this
> > > already.
> >
> > [Liang, Cunming] Not really, I think. We try to protect the data inside
> > ?rte_intr_handle?, it doesn?t expect user to understand the things defined
> > inside ?rte_intr_handle?.
> > It?s better typedef ?rte_intr_handle? as a raw integer ID, having a function
> > to get it from a ethdev. Then all the interrupt api is around it.
> > It provides the common pci NIC devices rxtx interrupt processing approach.
> > For the limitations, we can fix it step by step.
> >
> > > So ok, this will work for your limited use case, but this will not be
> > > really useful for anything else.
> > > Not sure it has its place in eal, this is more an example to me.
> >
> > [Liang, Cunming] ?limited use case? do you means only for rxtx ?
> > It don?t expect to provide a generic event mechanism (like libev/libevent
> > does), but a simple way to allow PMD work with DMA interrupt. It mainly
> > abstract for rx interrupt purpose. I appreciate if you could help to list
> > more useful cases.
> 
> You don't expect to provide a generic event mechanism but application
> developpers could need to wait for many events at once, not only Rx ones.
> That's why it's better to provide only the needed parts to use something
> generic like libevent.
> And we should avoid reinventing the wheel.
[LCM] Ok, I get you. I have a simple proposal to allow either RX event or other 
events can be handled in rte_intr_wait().
For the input data 'epoll_data', instead of using 'u32', let's keep use 'int 
fd'.
If the most significant bit is 0, event[n] stands for a fd. If it's 1, 
event[0]&0x stands for a vector number.
So during 'rte_intr_set', it get 16bit vector number and encode it as a 32bit 
int with the most significant bit 1.
Then on 'rte_intr_wait', only process the data.fd with the most significant bit 
1. And bypass the user fd.
'rte_intr_wait(struct rte_intr_handle *intr_handle, int epfd, int *event, 
uint16_t num)'.
As user already can assign an epfd, so they can add any normal event fd into 
the epfd. Make sense ?
> 
> > > > +static void
> > > > +eal_intr_process_rxtx_interrupts(struct rte_intr_handle *intr_handle,
> > > > +struct epoll_event *events,
> > > > +uint32_t *vec, int nfds)
> > > > +{
> > > > +   int i, bytes_read;
> > > > +   union rte_intr_read_buffer buf;
> > > > +   int fd;
> > > > +
> > > > +   for (i = 0; i < nfds; i++) {
> > > > +   /* set the length to be read for different handle type 
> > > > */
> > > > +   switch (intr_handle->type) {
> > > > +   case RTE_INTR_HANDLE_UIO:
> > > > +   bytes_read = sizeof(buf.uio_intr_count);
> > > > +   break;
> > > > +   case RTE_INTR_HANDLE_ALARM:
> > > > +   bytes_read = sizeof(buf.timerfd_num);
> > > > +   break;
> > > > +#ifdef VFIO_PRESENT
> > > > +   case RTE_INTR_HANDLE_VFIO_MSIX:
> > > > +   case RTE_INTR_HANDLE_VFIO_MSI:
> > > > +   case RTE_INTR_HANDLE_VFIO_LEGACY:
> > > > +   bytes_read = sizeof(buf.vfio_intr_count);
> > > > +   break;
> > > > +#endif
> > > > +   default:
> > > > +   bytes_read = 1;
> > > > +   break;
> > > > +   }
> > > > +
> > > > +   /**
> > > > +   * read out to clear the ready-to-be-read flag
> > > > +   * for epoll_wait.
> > > > +   */
> > > > +   vec[i] = events[i].data.u32;
> > > > +   assert(vec[i] < VFIO_MAX_RXTX_INTR_ID);
> > > > +
> > > > +   fd = intr_handle->efds[vec[i]];
> > > > +   bytes_read = read(fd, , bytes_read);
> > > > +   if (bytes_read < 0)
> > > > +   RTE_LOG(ERR, EAL, "Error reading from file "
> > > > +   "descriptor %d: %s\n", fd, 
> > > > 

[dpdk-dev] [PATCH v6 4/8] eal/linux: add per rx queue interrupt handling based on VFIO

2015-02-27 Thread Thomas Monjalon
Hi Cunming,

First, sorry to have to say that, but it is not easy to read discussions
where quote marks are not used. I re-insert them for clarity.

Comments below.

2015-02-27 12:22, Liang, Cunming:
> From: David Marchand [mailto:david.marchand at 6wind.com]
> Sent: Friday, February 27, 2015 6:34 PM
> 
> > I am not really comfortable with this api.
> > 
> > This is just creating something on top of the standard epoll api with
> > limitations. In the end, we could just use an external lib that does this
> > already.
> 
> [Liang, Cunming] Not really, I think. We try to protect the data inside
> ?rte_intr_handle?, it doesn?t expect user to understand the things defined
> inside ?rte_intr_handle?.
> It?s better typedef ?rte_intr_handle? as a raw integer ID, having a function
> to get it from a ethdev. Then all the interrupt api is around it.
> It provides the common pci NIC devices rxtx interrupt processing approach.
> For the limitations, we can fix it step by step.
> 
> > So ok, this will work for your limited use case, but this will not be
> > really useful for anything else.
> > Not sure it has its place in eal, this is more an example to me.
> 
> [Liang, Cunming] ?limited use case? do you means only for rxtx ?
> It don?t expect to provide a generic event mechanism (like libev/libevent
> does), but a simple way to allow PMD work with DMA interrupt. It mainly
> abstract for rx interrupt purpose. I appreciate if you could help to list
> more useful cases.

You don't expect to provide a generic event mechanism but application
developpers could need to wait for many events at once, not only Rx ones.
That's why it's better to provide only the needed parts to use something
generic like libevent.
And we should avoid reinventing the wheel.

> > > +static void
> > > +eal_intr_process_rxtx_interrupts(struct rte_intr_handle *intr_handle,
> > > +struct epoll_event *events,
> > > +uint32_t *vec, int nfds)
> > > +{
> > > +   int i, bytes_read;
> > > +   union rte_intr_read_buffer buf;
> > > +   int fd;
> > > +
> > > +   for (i = 0; i < nfds; i++) {
> > > +   /* set the length to be read for different handle type */
> > > +   switch (intr_handle->type) {
> > > +   case RTE_INTR_HANDLE_UIO:
> > > +   bytes_read = sizeof(buf.uio_intr_count);
> > > +   break;
> > > +   case RTE_INTR_HANDLE_ALARM:
> > > +   bytes_read = sizeof(buf.timerfd_num);
> > > +   break;
> > > +#ifdef VFIO_PRESENT
> > > +   case RTE_INTR_HANDLE_VFIO_MSIX:
> > > +   case RTE_INTR_HANDLE_VFIO_MSI:
> > > +   case RTE_INTR_HANDLE_VFIO_LEGACY:
> > > +   bytes_read = sizeof(buf.vfio_intr_count);
> > > +   break;
> > > +#endif
> > > +   default:
> > > +   bytes_read = 1;
> > > +   break;
> > > +   }
> > > +
> > > +   /**
> > > +   * read out to clear the ready-to-be-read flag
> > > +   * for epoll_wait.
> > > +   */
> > > +   vec[i] = events[i].data.u32;
> > > +   assert(vec[i] < VFIO_MAX_RXTX_INTR_ID);
> > > +
> > > +   fd = intr_handle->efds[vec[i]];
> > > +   bytes_read = read(fd, , bytes_read);
> > > +   if (bytes_read < 0)
> > > +   RTE_LOG(ERR, EAL, "Error reading from file "
> > > +   "descriptor %d: %s\n", fd, 
> > > strerror(errno));
> > > +   else if (bytes_read == 0)
> > > +   RTE_LOG(ERR, EAL, "Read nothing from file "
> > > +   "descriptor %d\n", fd);
> > > +   }
> > > +}
> > 
> > Why unconditionnally read ?
> > You are absorbing events from the application if the application gave you
> > an external epfd and populated it with its own fds.
> 
> [Liang, Cunming] The vector number was checked. If an external epfd
> populated some event carry fd rather than a data.u32 but the value
> inside the valid range, it considers as a valid vector number. No matter
> the read success or not, it always notify the event. Do you have any
> suggestion used here to check the condition ?
> 
> > > +static int init_tls_epfd(void)
> > > +{
> > > +   int pfd = epoll_create(1);
> > > +   if (pfd < 0) {
> > > +   RTE_LOG(ERR, EAL,
> > > +   "Cannot create epoll instance\n");
> > > +   return -1;
> > > +   }
> > > +   return pfd;
> > > +}
> > > +
> > > +int
> > > +rte_intr_rx_wait(struct rte_intr_handle *intr_handle, int epfd,
> > > +uint32_t *vec, uint16_t num)
> > > +{
> > 
> > In the end, this "rx" does not mean anything to eal.
> 
> [Liang, Cunming] That?s a good point. I tried to remove ?rx? and use a
> generic word here. 

[dpdk-dev] [PATCH v6 4/8] eal/linux: add per rx queue interrupt handling based on VFIO

2015-02-27 Thread Cunming Liang
This patch does below:
 - Create multiple VFIO eventfd for rx queues.
 - Handle per rx queue interrupt.
 - Eliminate unnecessary suspended DPDK polling thread wakeup mechanism
   for rx interrupt by allowing polling thread epoll_wait rx queue
   interrupt notification.

Signed-off-by: Danny Zhou 
Signed-off-by: Cunming Liang 
---
v6 changes
 - split rte_intr_wait_rx_pkt into two function, wait and set.
 - rewrite rte_intr_rx_wait/rte_intr_rx_set to remove queue visibility on eal.
 - rte_intr_rx_wait to support multiplexing.
 - allow epfd as input to support flexible event fd combination.

v5 changes
 - Rebase the patchset onto the HEAD
 - Isolate ethdev from EAL for new-added wait-for-rx interrupt function
 - Export wait-for-rx interrupt function for shared libraries

v4 changes:
 - Adjust position of new-added structure fields

v3 changes:
 - Fix review comments

v2 changes:
 - Fix compilation issue for a missed header file
 - Bug fix: free unreleased resources on the exception path before return
 - Consolidate coding style related review comments

 lib/librte_eal/linuxapp/eal/eal_interrupts.c| 224 +++-
 lib/librte_eal/linuxapp/eal/eal_pci_vfio.c  |  23 ++-
 lib/librte_eal/linuxapp/eal/rte_eal_version.map |   2 +
 3 files changed, 201 insertions(+), 48 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/eal_interrupts.c 
b/lib/librte_eal/linuxapp/eal/eal_interrupts.c
index 8c5b834..f90c2b4 100644
--- a/lib/librte_eal/linuxapp/eal/eal_interrupts.c
+++ b/lib/librte_eal/linuxapp/eal/eal_interrupts.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -70,6 +71,8 @@

 #define EAL_INTR_EPOLL_WAIT_FOREVER (-1)

+static RTE_DEFINE_PER_LCORE(int, _epfd) = -1; /**< epoll fd per thread */
+
 /**
  * union for pipe fds.
  */
@@ -127,6 +130,9 @@ static pthread_t intr_thread;
 #ifdef VFIO_PRESENT

 #define IRQ_SET_BUF_LEN  (sizeof(struct vfio_irq_set) + sizeof(int))
+/* irq set buffer length for queue interrupts and LSC interrupt */
+#define MSIX_IRQ_SET_BUF_LEN (sizeof(struct vfio_irq_set) + \
+ sizeof(int) * (VFIO_MAX_RXTX_INTR_ID + 1))

 /* enable legacy (INTx) interrupts */
 static int
@@ -218,10 +224,10 @@ vfio_disable_intx(struct rte_intr_handle *intr_handle) {
return 0;
 }

-/* enable MSI-X interrupts */
+/* enable MSI interrupts */
 static int
 vfio_enable_msi(struct rte_intr_handle *intr_handle) {
-   int len, ret;
+   int len, ret, max_intr;
char irq_set_buf[IRQ_SET_BUF_LEN];
struct vfio_irq_set *irq_set;
int *fd_ptr;
@@ -230,12 +236,19 @@ vfio_enable_msi(struct rte_intr_handle *intr_handle) {

irq_set = (struct vfio_irq_set *) irq_set_buf;
irq_set->argsz = len;
-   irq_set->count = 1;
+   if ((!intr_handle->max_intr) ||
+   (intr_handle->max_intr > VFIO_MAX_RXTX_INTR_ID))
+   max_intr = VFIO_MAX_RXTX_INTR_ID + 1;
+   else
+   max_intr = intr_handle->max_intr;
+
+   irq_set->count = max_intr;
irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | 
VFIO_IRQ_SET_ACTION_TRIGGER;
irq_set->index = VFIO_PCI_MSI_IRQ_INDEX;
irq_set->start = 0;
fd_ptr = (int *) _set->data;
-   *fd_ptr = intr_handle->fd;
+   memcpy(fd_ptr, intr_handle->efds, sizeof(intr_handle->efds));
+   fd_ptr[max_intr - 1] = intr_handle->fd;

ret = ioctl(intr_handle->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);

@@ -244,27 +257,10 @@ vfio_enable_msi(struct rte_intr_handle *intr_handle) {
intr_handle->fd);
return -1;
}
-
-   /* manually trigger interrupt to enable it */
-   memset(irq_set, 0, len);
-   len = sizeof(struct vfio_irq_set);
-   irq_set->argsz = len;
-   irq_set->count = 1;
-   irq_set->flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER;
-   irq_set->index = VFIO_PCI_MSI_IRQ_INDEX;
-   irq_set->start = 0;
-
-   ret = ioctl(intr_handle->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
-
-   if (ret) {
-   RTE_LOG(ERR, EAL, "Error triggering MSI interrupts for fd %d\n",
-   intr_handle->fd);
-   return -1;
-   }
return 0;
 }

-/* disable MSI-X interrupts */
+/* disable MSI interrupts */
 static int
 vfio_disable_msi(struct rte_intr_handle *intr_handle) {
struct vfio_irq_set *irq_set;
@@ -292,8 +288,8 @@ vfio_disable_msi(struct rte_intr_handle *intr_handle) {
 /* enable MSI-X interrupts */
 static int
 vfio_enable_msix(struct rte_intr_handle *intr_handle) {
-   int len, ret;
-   char irq_set_buf[IRQ_SET_BUF_LEN];
+   int len, ret, max_intr;
+   char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
struct vfio_irq_set *irq_set;
int *fd_ptr;

@@ -301,12 +297,19 @@ vfio_enable_msix(struct rte_intr_handle *intr_handle) {

irq_set = (struct vfio_irq_set *) irq_set_buf;
   

[dpdk-dev] [PATCH v6 4/8] eal/linux: add per rx queue interrupt handling based on VFIO

2015-02-27 Thread David Marchand
I am not really comfortable with this api.

This is just creating something on top of the standard epoll api with
limitations.
In the end, we could just use an external lib that does this already.

So ok, this will work for your limited use case, but this will not be
really useful for anything else.
Not sure it has its place in eal, this is more an example to me.


On Fri, Feb 27, 2015 at 5:56 AM, Cunming Liang 
wrote:

> This patch does below:
>  - Create multiple VFIO eventfd for rx queues.
>  - Handle per rx queue interrupt.
>  - Eliminate unnecessary suspended DPDK polling thread wakeup mechanism
>for rx interrupt by allowing polling thread epoll_wait rx queue
>interrupt notification.
>
> Signed-off-by: Danny Zhou 
> Signed-off-by: Cunming Liang 
> ---
> v6 changes
>  - split rte_intr_wait_rx_pkt into two function, wait and set.
>  - rewrite rte_intr_rx_wait/rte_intr_rx_set to remove queue visibility on
> eal.
>  - rte_intr_rx_wait to support multiplexing.
>  - allow epfd as input to support flexible event fd combination.
>
>
>  lib/librte_eal/linuxapp/eal/eal_interrupts.c| 224
> +++-
>  lib/librte_eal/linuxapp/eal/eal_pci_vfio.c  |  23 ++-
>  lib/librte_eal/linuxapp/eal/rte_eal_version.map |   2 +
>  3 files changed, 201 insertions(+), 48 deletions(-)
>
> diff --git a/lib/librte_eal/linuxapp/eal/eal_interrupts.c
> b/lib/librte_eal/linuxapp/eal/eal_interrupts.c
> index 8c5b834..f90c2b4 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_interrupts.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_interrupts.c
>
>
[snip]


>
> +static void
> +eal_intr_process_rxtx_interrupts(struct rte_intr_handle *intr_handle,
> +struct epoll_event *events,
> +uint32_t *vec, int nfds)
> +{
> +   int i, bytes_read;
> +   union rte_intr_read_buffer buf;
> +   int fd;
> +
> +   for (i = 0; i < nfds; i++) {
> +   /* set the length to be read for different handle type */
> +   switch (intr_handle->type) {
> +   case RTE_INTR_HANDLE_UIO:
> +   bytes_read = sizeof(buf.uio_intr_count);
> +   break;
> +   case RTE_INTR_HANDLE_ALARM:
> +   bytes_read = sizeof(buf.timerfd_num);
> +   break;
> +#ifdef VFIO_PRESENT
> +   case RTE_INTR_HANDLE_VFIO_MSIX:
> +   case RTE_INTR_HANDLE_VFIO_MSI:
> +   case RTE_INTR_HANDLE_VFIO_LEGACY:
> +   bytes_read = sizeof(buf.vfio_intr_count);
> +   break;
> +#endif
> +   default:
> +   bytes_read = 1;
> +   break;
> +   }
> +
> +   /**
> +   * read out to clear the ready-to-be-read flag
> +   * for epoll_wait.
> +   */
> +   vec[i] = events[i].data.u32;
> +   assert(vec[i] < VFIO_MAX_RXTX_INTR_ID);
> +
> +   fd = intr_handle->efds[vec[i]];
> +   bytes_read = read(fd, , bytes_read);
> +   if (bytes_read < 0)
> +   RTE_LOG(ERR, EAL, "Error reading from file "
> +   "descriptor %d: %s\n", fd,
> strerror(errno));
> +   else if (bytes_read == 0)
> +   RTE_LOG(ERR, EAL, "Read nothing from file "
> +   "descriptor %d\n", fd);
> +   }
> +}
>

Why unconditionnally read ?
You are absorbing events from the application if the application gave you
an external epfd and populated it with its own fds.


> +
> +static int init_tls_epfd(void)
> +{
> +   int pfd = epoll_create(1);
> +   if (pfd < 0) {
> +   RTE_LOG(ERR, EAL,
> +   "Cannot create epoll instance\n");
> +   return -1;
> +   }
> +   return pfd;
> +}
> +
> +int
> +rte_intr_rx_wait(struct rte_intr_handle *intr_handle, int epfd,
> +uint32_t *vec, uint16_t num)
> +{
>

In the end, this "rx" does not mean anything to eal.


+#define MAX_EVENTS  8
> +   struct epoll_event events[MAX_EVENTS];
> +   int ret, nfds = 0;
> +
> +   if (!intr_handle || !vec) {
> +   RTE_LOG(ERR, EAL, "invalid input parameter\n");
> +   return -1;
> +   }
> +
> +   if (intr_handle->type != RTE_INTR_HANDLE_VFIO_MSIX) {
> +   RTE_LOG(ERR, EAL, "intr type should be VFIO_MSIX\n");
> +   return -1;
> +   }
> +
> +   if (epfd == RTE_EPOLL_FD_ANY) {
> +   /* using per thread epoll fd */
> +   if (unlikely(RTE_PER_LCORE(_epfd) == -1))
> +   RTE_PER_LCORE(_epfd) = init_tls_epfd();
> +   epfd = RTE_PER_LCORE(_epfd);
> +   }
>

Rather than testing every time, this should be set by the caller, i.e. epfd
is always valid.
If application does not want to create a epfd, then it calls
 rte_intr_rx_wait with