[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-20 Thread Wang, Zhihong


> -Original Message-
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Monday, January 19, 2015 9:02 PM
> To: Wang, Zhihong
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang at intel.com wrote:
> > This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> > It also extends memcpy test coverage with unaligned cases and more test
> points.
> >
> > Optimization techniques are summarized below:
> >
> > 1. Utilize full cache bandwidth
> >
> > 2. Enforce aligned stores
> >
> > 3. Apply load address alignment based on architecture features
> >
> > 4. Make load/store address available as early as possible
> >
> > 5. General optimization techniques like inlining, branch reducing,
> > prefetch pattern access
> >
> > Zhihong Wang (4):
> >   Disabled VTA for memcpy test in app/test/Makefile
> >   Removed unnecessary test cases in test_memcpy.c
> >   Extended test coverage in test_memcpy_perf.c
> >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> > platforms
> >
> >  app/test/Makefile  |   6 +
> >  app/test/test_memcpy.c |  52 +-
> >  app/test/test_memcpy_perf.c| 238 +---
> >  .../common/include/arch/x86/rte_memcpy.h   | 664
> +++--
> >  4 files changed, 656 insertions(+), 304 deletions(-)
> >
> > --
> > 1.9.3
> >
> >
> Are you able to compile this with gcc 4.9.2?  The compilation of
> test_memcpy_perf is taking forever for me.  It appears hung.
> Neil


Neil,

Thanks for reporting this!
It should compile but will take quite some time if the CPU doesn't support 
AVX2, the reason is that:
1. The SSE & AVX memcpy implementation is more complicated than AVX2 version 
thus the compiler takes more time to compile and optimize
2. The new test_memcpy_perf.c contains 126 constants memcpy calls for better 
test case coverage, that's quite a lot

I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
1. The whole compile process takes 9'41" with the original test_memcpy_perf.c 
(63 + 63 = 126 constant memcpy calls)
2. It takes only 2'41" after I reduce the constant memcpy call number to 12 + 
12 = 24

I'll reduce memcpy call in the next version of patch.

Zhihong (John)


[dpdk-dev] [PATCH] Unlink existing unused sockets at start up

2015-12-21 Thread Wang, Zhihong


> -Original Message-
> From: Ilya Maximets [mailto:i.maximets at samsung.com]
> Sent: Friday, December 18, 2015 2:18 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: p.fedin at samsung.com; yuanhan.liu at linux.intel.com; s.dyasly at 
> samsung.com;
> Xie, Huawei 
> Subject: Re: [PATCH] Unlink existing unused sockets at start up
> 
> On 18.12.2015 05:39, Wang, Zhihong wrote:
> 
> > Yes ideally the underneath lib shouldn't meddle with the recovery logic.
> > But I do think we should at least put a warning in the lib function
> > said the app should make the path available. This is another topic though 
> > :-)
> Like we did in memcpy:
> > /**
> >  * Copy 16 bytes from one location to another,
> >  * locations should not overlap.
> >  */
> >
> 
> Isn't it enough to have an error in the log?

Function comments and function code are different things and are both necessary.
Also why wait till error occurs when a comment can warn the developer?

> 
> lib/librte_vhost/vhost_user/vhost-net-user.c:130:
> RTE_LOG(ERR, VHOST_CONFIG, "fail to bind fd:%d, remove file:%s and try
> again.\n",
> 
> Best regards, Ilya Maximets.


[dpdk-dev] [PATCH 1/3] app/test-pmd: Handle SIGINT and SIGTERM in testpmd

2015-12-24 Thread Wang, Zhihong
> > +/* When we receive a INT signal, close all ports */ static void
> > +sigint_handler(__rte_unused int signum) {
> > +   unsigned portid;
> > +
> > +   printf("Preparing to exit...\n");
> 
> Better to notice user "Signal xxx received, reparing to exit... "

Can do that.

> 
> > +   FOREACH_PORT(portid, ports) {
> > +   if (port_id_is_invalid(portid, ENABLED_WARN))
> > +   continue;
> > +   printf("Stopping port %d...", portid);
> > +   rte_eth_dev_stop(portid);
> > +   rte_eth_dev_close(portid);
> > +   printf(" Done\n");
> > +   }
> > +   printf("Bye...\n");
> 
> Here why don't call pmd_test_exit()? Any issue with that func?

Yes should just call this one :)

> 
> Thanks,
> Michael
> > +   exit(0);
> > +}
> > +
> >  int
> >  main(int argc, char** argv)
> >  {
> > int  diag;
> > uint8_t port_id;
> >
> > +   signal(SIGINT, sigint_handler);
> > +   signal(SIGTERM, sigint_handler);
> > +
> > diag = rte_eal_init(argc, argv);
> > if (diag < 0)
> > rte_panic("Cannot init EAL\n");



[dpdk-dev] [PATCH 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in l3fwd

2015-12-25 Thread Wang, Zhihong
> > +/* When we receive a INT signal, close all ports */ static void
> > +sigint_handler(__rte_unused int signum) {
> > +   unsigned portid, nb_ports;
> > +
> > +   printf("Preparing to exit...\n");
> > +   nb_ports = rte_eth_dev_count();
> > +   for (portid = 0; portid < nb_ports; portid++) {
> > +   if ((enabled_port_mask & (1 << portid)) == 0) {
> > +   continue;
> > +   }
> > +   printf("Stopping port %d...", portid);
> > +   rte_eth_dev_stop(portid);
> > +   rte_eth_dev_close(portid);
> 
> Hmm, so your interrupt thread invokes dev_stop, while IO lcores keep calling
> rx_burst/tx_burst?
> For graceful shutdown on SIGINT, I suppose you first have to stop your IO 
> lcores
> first.
> Let say have a global var: 'stop' that every lcore has to check from time to 
> time (or
> something similar).

Thanks for the advice! This works once the program enters the forwarding phase.
Have to go the other way if it's still in initialization phase which can take 
quite some time.

/Zhihong

> Konstantin
> 
> > +   printf(" Done\n");
> > +   }
> > +   printf("Bye...\n");
> > +   exit(0);
> > +}
> > +
> >  int
> >  main(int argc, char **argv)
> >  {
> > @@ -2572,6 +2594,9 @@ main(int argc, char **argv)
> > uint32_t n_tx_queue, nb_lcores;
> > uint8_t portid, nb_rx_queue, queue, socketid;
> >
> > +   signal(SIGINT, sigint_handler);
> > +   signal(SIGTERM, sigint_handler);
> > +
> > /* init EAL */
> > ret = rte_eal_init(argc, argv);
> > if (ret < 0)
> > --
> > 2.5.0



[dpdk-dev] [PATCH 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in l3fwd

2015-12-25 Thread Wang, Zhihong
> On Wed, 23 Dec 2015 15:03:15 -0500
> Zhihong Wang  wrote:
> 
> > +/* When we receive a INT signal, close all ports */ static void
> > +sigint_handler(__rte_unused int signum) {
> > +   unsigned portid, nb_ports;
> > +
> > +   printf("Preparing to exit...\n");
> > +   nb_ports = rte_eth_dev_count();
> > +   for (portid = 0; portid < nb_ports; portid++) {
> > +   if ((enabled_port_mask & (1 << portid)) == 0) {
> > +   continue;
> > +   }
> > +   printf("Stopping port %d...", portid);
> > +   rte_eth_dev_stop(portid);
> > +   rte_eth_dev_close(portid);
> > +   printf(" Done\n");
> > +   }
> > +   printf("Bye...\n");
> > +   exit(0);
> > +}
> 
> Signal handlers should only set a flag, which is then checked by thread loops.
> Calling functions in DPDK from signal handlers is not safe.

I'll make changes in v2 to address this issue. Thanks for pointing out :)
In some cases signal handler have to do the exit though, like when the program 
is still doing memory initialization and will take some time.


[dpdk-dev] [PATCH v2 2/3] examples/l2fwd: Handle SIGINT and SIGTERM in l2fwd

2015-12-28 Thread Wang, Zhihong
Hi Stephen,

Really appreciate the detailed review!
Please see comments below.


> > +static int force_quit = -1;
> > +static int signo_quit = -1;
> 
> These need to be volatile otherwise you risk compiler optimizing away your
> checks.

Yes. Don't wanna take chances here.

> 
> Also, don't use -1/0 just use 0/1 for boolean or better yet the definition in
>  of bool and true/false.
> That way the code can read much nicer.

-1 when forwarding not started yet.
Can add a "static bool fwd_started;" to represent this to make it clearer.

> 
> >  #define RTE_LOGTYPE_L2FWD RTE_LOGTYPE_USER1
> >
> >  #define NB_MBUF   8192
> > @@ -284,6 +289,8 @@ l2fwd_main_loop(void)
> > }
> >
> > while (1) {
> > +   if (unlikely(force_quit != 0))
> > +   break;
> 
> Please maske this a proper while loop instead.

Exactly.

> 
> while (!force_quit) {
> 
> >
> > cur_tsc = rte_rdtsc();
> >
> > @@ -534,6 +541,45 @@ check_all_ports_link_status(uint8_t port_num,
> uint32_t port_mask)
> > }
> >  }
> >
> > +static void
> > +stop_ports(void)
> > +{
> > +   unsigned portid, nb_ports;
> > +
> > +   nb_ports = rte_eth_dev_count();
> > +   for (portid = 0; portid < nb_ports; portid++) {
> > +   if ((l2fwd_enabled_port_mask & (1 << portid)) == 0) {
> > +   continue;
> > +   }
> 
> No need for {} here.
> 
> > +   printf("Stopping port %d...", portid);
> > +   rte_eth_dev_stop(portid);
> > +   rte_eth_dev_close(portid);
> > +   printf(" Done\n");
> > +   }
> > +}
> > +
> > +static void
> > +signal_handler(__rte_unused int signum) {
> > +   if (signum == SIGINT || signum == SIGTERM) {
> 
> signum is used, dont give __rte_unused attribute.
> 
> >
> > /* launch per-lcore init on every lcore */
> > +   force_quit = 0;
> 
> What is gained by having tri-value here. Just initialize it as false.

As stated above.

> 
> 
> > rte_eal_mp_remote_launch(l2fwd_launch_one_lcore, NULL,
> CALL_MASTER);
> > RTE_LCORE_FOREACH_SLAVE(lcore_id) {
> > if (rte_eal_wait_lcore(lcore_id) < 0)
> > return -1;
> > }
> >
> > +   printf("Stopping forwarding... Done\n");
> > +   /* stop ports */
> > +   stop_ports();
> > +   printf("Bye...\n");
> > +   /* inform if there's a caller */
> > +   if (force_quit != 0) {
> > +   signal(signo_quit, SIG_DFL);
> > +   kill(getpid(), signo_quit);
> 
> The kill should not be needed.

The purpose is to make the program exit with the killed status.

> 
> It would be good if examples cleaned up allocations, that way they could be 
> used
> with valgrind for validation of drivers, etc.



[dpdk-dev] [PATCH v2 1/3] app/test-pmd: Handle SIGINT and SIGTERM in testpmd

2015-12-28 Thread Wang, Zhihong
> > -   cl = cmdline_stdin_new(main_ctx, "testpmd> ");
> > -   if (cl == NULL) {
> > +   testpmd_cl = cmdline_stdin_new(main_ctx, "testpmd> ");
> > +   if (testpmd_cl == NULL) {
> > return;
> > }
> 
> Style nit: don't need {} around single statement.
> 
> > +static void
> > +sigint_handler(__rte_unused int signum) {
> > +   if (signum == SIGINT || signum == SIGTERM) {
> 
> signmum is used, so don't want __rte_unused
> 

Thanks :) Will fix these in the next version.



[dpdk-dev] [PATCH v2 0/3] Handle SIGINT and SIGTERM in DPDK examples

2015-12-28 Thread Wang, Zhihong


> -Original Message-
> From: Qiu, Michael
> Sent: Monday, December 28, 2015 12:18 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: Ananyev, Konstantin ;
> stephen at networkplumber.org
> Subject: Re: [PATCH v2 0/3] Handle SIGINT and SIGTERM in DPDK examples
> 
> On 2015/12/25 17:40, Wang, Zhihong wrote:
> > This patch handles SIGINT and SIGTERM in testpmd, l2fwd, and l3fwd, make
> sure all ports are properly stopped and closed.
> > For virtual ports, the stop and close function may deal with resource 
> > cleanup,
> such as socket files unlinking.
> >
> > --
> > Changes in v2:
> >
> > 1. Make sure graceful exit for all running phases
> >
> > 2. Make sure program exits with the right status
> >
> > Zhihong Wang (3):
> >   app/test-pmd: Handle SIGINT and SIGTERM in testpmd
> >   examples/l2fwd: Handle SIGINT and SIGTERM in l2fwd
> >   examples/l3fwd: Handle SIGINT and SIGTERM in l3fwd
> >
> >  app/test-pmd/cmdline.c |  19 ++---
> >  app/test-pmd/testpmd.c |  38 ++---
> >  app/test-pmd/testpmd.h |   1 +
> >  examples/l2fwd/main.c  |  60 +++
> >  examples/l3fwd/main.c  | 110
> -
> >  5 files changed, 196 insertions(+), 32 deletions(-)
> >
> 
> Next time, you'd better not to top post for V2 :)

Gotcha :)

> 
> Acked-by: Michael Qiu 


[dpdk-dev] [PATCH v3 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in l3fwd

2015-12-30 Thread Wang, Zhihong
> > +static uint8_t
> > +start_ports(void)
> > +{
> > +   unsigned portid, nb_ports, avail_ports;
> > +   int ret;
> > +
> > +   nb_ports = rte_eth_dev_count();
> > +   avail_ports = 0;
> > +   for (portid = 0; portid < nb_ports; portid++) {
> > +   if ((enabled_port_mask & (1 << portid)) == 0)
> > +   continue;
> > +   avail_ports++;
> > +   port_started = true;
> 
> Why do you need it at each iteration?

Only become true when the first enabled port about to started. In case there's 
no port enabled at all.
In my opinion no need to optimize since it's not performance sensitive and the 
logic is correct :)


> 
> > +   printf("Starting port %d...", portid);
> > +   ret = rte_eth_dev_start(portid);
> > +   if (ret < 0)
> > +   rte_exit(EXIT_FAILURE,
> > +   "rte_eth_dev_start: err=%d, port=%d\n",
> > +   ret, portid);
> > +   /*
> > +* If enabled, put device in promiscuous mode.
> > +* This allows IO forwarding mode to forward packets
> > +* to itself through 2 cross-connected  ports of the
> > +* target machine.
> > +*/
> > +   if (promiscuous_on)
> > +   rte_eth_promiscuous_enable(portid);
> > +   printf(" Done\n");
> > +   }
> > +
> > +   return avail_ports;
> > +}

[...]

> > +static void
> > +signal_handler(int signum)
> > +{
> > +   if (signum == SIGINT || signum == SIGTERM) {
> > +   printf("\nSignal %d received, preparing to exit...\n",
> > +   signum);
> > +   if (port_started) {
> > +   printf("Ports started already...\n");
> > +   signo_quit = signum;
> > +   force_quit = true;
> > +   } else {
> 
> 
> Hmm, and what if signal_handler() would be executed not in the context of
> master lcore?
> Then there could be a raise condition, and you could end up here, while master
> lcore would be in the middle of start_ports()->rte_eth_dev_start().

Good point! Then we need rte_atomic16_cmpset() to avoid the race condition.


> Probably not a big deal, but why do you need this  if (port_started) {...} 
> else {...}
> at all?
> Why not just:

If no port has been started, then just kill itself.
This is for cases like when you just started it and then want to shut it down, 
it'll wait a long time for initialization (memory, etc.) before the force_quit 
signal take effect.


> 
> signal_handler(int signum)
> {
>   signo_quit = signum;
>   force_quit = true;
> }
> ?
> 
> Konstantin
> 
> > +   printf("Ports not started yet...\n");
> > +   printf("Bye...\n");
> > +   /* exit with the expected status */
> > +   signal(signum, SIG_DFL);
> > +   kill(getpid(), signum);
> > +   }
> > +   }
> > +}
> > +



[dpdk-dev] [PATCH v4 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in l3fwd

2015-12-31 Thread Wang, Zhihong
> > +#define PORT_IDLE 0
> > +#define PORT_INIT 1
> > +#define PORT_WORK 2
> > +#define PORT_STOP 3
> > +#define PORT_QUIT 4
> 
> Seems ok, but over-complicated.
> I think all you need is just IDLE, INIT, QUIT.

Yes for l2/l3fwd 3 states are enough.
I implement a full state machine so it can also serve as an example on how to 
do this in other cases, like where stop might be called before or during init.

> Konstantin




[dpdk-dev] [PATCH v3 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in l3fwd

2015-12-31 Thread Wang, Zhihong


> -Original Message-
> From: Ananyev, Konstantin
> Sent: Wednesday, December 30, 2015 7:30 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: stephen at networkplumber.org; Qiu, Michael 
> Subject: RE: [PATCH v3 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in
> l3fwd
> 
> 
> 
> > -Original Message-
> > From: Wang, Zhihong
> > Sent: Wednesday, December 30, 2015 3:15 AM
> > To: Ananyev, Konstantin; dev at dpdk.org
> > Cc: stephen at networkplumber.org; Qiu, Michael
> > Subject: RE: [PATCH v3 3/3] examples/l3fwd: Handle SIGINT and SIGTERM
> > in l3fwd
> >
> > > > +static uint8_t
> > > > +start_ports(void)
> > > > +{
> > > > +   unsigned portid, nb_ports, avail_ports;
> > > > +   int ret;
> > > > +
> > > > +   nb_ports = rte_eth_dev_count();
> > > > +   avail_ports = 0;
> > > > +   for (portid = 0; portid < nb_ports; portid++) {
> > > > +   if ((enabled_port_mask & (1 << portid)) == 0)
> > > > +   continue;
> > > > +   avail_ports++;
> > > > +   port_started = true;
> > >
> > > Why do you need it at each iteration?
> >
> > Only become true when the first enabled port about to started. In case 
> > there's
> no port enabled at all.
> > In my opinion no need to optimize since it's not performance sensitive
> > and the logic is correct :)
> >
> >
> > >
> > > > +   printf("Starting port %d...", portid);
> > > > +   ret = rte_eth_dev_start(portid);
> > > > +   if (ret < 0)
> > > > +   rte_exit(EXIT_FAILURE,
> > > > +   "rte_eth_dev_start: err=%d, 
> > > > port=%d\n",
> > > > +   ret, portid);
> > > > +   /*
> > > > +* If enabled, put device in promiscuous mode.
> > > > +* This allows IO forwarding mode to forward packets
> > > > +* to itself through 2 cross-connected  ports of the
> > > > +* target machine.
> > > > +*/
> > > > +   if (promiscuous_on)
> > > > +   rte_eth_promiscuous_enable(portid);
> > > > +   printf(" Done\n");
> > > > +   }
> > > > +
> > > > +   return avail_ports;
> > > > +}
> >
> > [...]
> >
> > > > +static void
> > > > +signal_handler(int signum)
> > > > +{
> > > > +   if (signum == SIGINT || signum == SIGTERM) {
> > > > +   printf("\nSignal %d received, preparing to exit...\n",
> > > > +   signum);
> > > > +   if (port_started) {
> > > > +   printf("Ports started already...\n");
> > > > +   signo_quit = signum;
> > > > +   force_quit = true;
> > > > +   } else {
> > >
> > >
> > > Hmm, and what if signal_handler() would be executed not in the
> > > context of master lcore?
> > > Then there could be a raise condition, and you could end up here,
> > > while master lcore would be in the middle of
> start_ports()->rte_eth_dev_start().
> >
> > Good point! Then we need rte_atomic16_cmpset() to avoid the race condition.
> >
> >
> > > Probably not a big deal, but why do you need this  if (port_started)
> > > {...} else {...} at all?
> > > Why not just:
> >
> > If no port has been started, then just kill itself.
> > This is for cases like when you just started it and then want to shut
> > it down, it'll wait a long time for initialization (memory, etc.) before the
> force_quit signal take effect.
> 
> Do you mean rte_eal_init()?
> Then why not to install non-default signal handlers after rte_eal_init()?
> Konstantin

Yes that does sounds better :)



> 
> >
> >
> > >
> > > signal_handler(int signum)
> > > {
> > >   signo_quit = signum;
> > >   force_quit = true;
> > > }
> > > ?
> > >
> > > Konstantin
> > >
> > > > +   printf("Ports not started yet...\n");
> > > > +   printf("Bye...\n");
> > > > +   /* exit with the expected status */
> > > > +   signal(signum, SIG_DFL);
> > > > +   kill(getpid(), signum);
> > > > +   }
> > > > +   }
> > > > +}
> > > > +



[dpdk-dev] [PATCH v4 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in l3fwd

2015-12-31 Thread Wang, Zhihong


> -Original Message-
> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> Sent: Thursday, December 31, 2015 10:09 AM
> To: Wang, Zhihong 
> Cc: Ananyev, Konstantin ; dev at dpdk.org; 
> Qiu,
> Michael 
> Subject: Re: [PATCH v4 3/3] examples/l3fwd: Handle SIGINT and SIGTERM in
> l3fwd
> 
> On Thu, 31 Dec 2015 01:44:20 +
> "Wang, Zhihong"  wrote:
> 
> > > > +#define PORT_IDLE 0
> > > > +#define PORT_INIT 1
> > > > +#define PORT_WORK 2
> > > > +#define PORT_STOP 3
> > > > +#define PORT_QUIT 4
> > >
> > > Seems ok, but over-complicated.
> > > I think all you need is just IDLE, INIT, QUIT.
> >
> > Yes for l2/l3fwd 3 states are enough.
> > I implement a full state machine so it can also serve as an example on how 
> > to
> do this in other cases, like where stop might be called before or during init.
> 
> These are examples, it is better to have as little code as necessary to get 
> the job
> done. That makes the example clearer.  Adding extra unnecessary complexity
> just makes it harder to understand.


Thanks for the suggestions!
I'll send the v5 combining your comments and Konstantin's together to make it 
simpler.


[dpdk-dev] [PATCH] A fix to work around strict-aliasing rules breaking

2015-03-04 Thread Wang, Zhihong


> -Original Message-
> From: Richardson, Bruce
> Sent: Monday, March 02, 2015 6:32 PM
> To: Wang, Zhihong
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] A fix to work around strict-aliasing rules
> breaking
> 
> On Mon, Mar 02, 2015 at 05:03:50PM +0800, zhihong.wang at intel.com wrote:
> > Fixed strict-aliasing rules breaking errors for some GCC version.
> >
> 
> This looks messy. Also, I believe the definition of memcpy should include the
> "restrict" keyword to indicate that source and dest can't overlap. Might that
> help fix the issue?

It's actually caused by casting void * to multiple other pointer types.

> 
> /Bruce
> 
> > Signed-off-by: Zhihong Wang 
> > ---
> >  .../common/include/arch/x86/rte_memcpy.h   | 44 
> --
> >  1 file changed, 24 insertions(+), 20 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > index 69a5c6f..f412099 100644
> > --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > @@ -195,6 +195,8 @@ rte_mov256blocks(uint8_t *dst, const uint8_t *src,
> > size_t n)  static inline void *  rte_memcpy(void *dst, const void
> > *src, size_t n)  {
> > +   uintptr_t dstu = (uintptr_t)dst;
> > +   uintptr_t srcu = (uintptr_t)src;
> > void *ret = dst;
> > int dstofss;
> > int bits;
> > @@ -204,22 +206,22 @@ rte_memcpy(void *dst, const void *src, size_t n)
> >  */
> > if (n < 16) {
> > if (n & 0x01) {
> > -   *(uint8_t *)dst = *(const uint8_t *)src;
> > -   src = (const uint8_t *)src + 1;
> > -   dst = (uint8_t *)dst + 1;
> > +   *(uint8_t *)dstu = *(const uint8_t *)srcu;
> > +   srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> > +   dstu = (uintptr_t)((uint8_t *)dstu + 1);
> > }
> > if (n & 0x02) {
> > -   *(uint16_t *)dst = *(const uint16_t *)src;
> > -   src = (const uint16_t *)src + 1;
> > -   dst = (uint16_t *)dst + 1;
> > +   *(uint16_t *)dstu = *(const uint16_t *)srcu;
> > +   srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> > +   dstu = (uintptr_t)((uint16_t *)dstu + 1);
> > }
> > if (n & 0x04) {
> > -   *(uint32_t *)dst = *(const uint32_t *)src;
> > -   src = (const uint32_t *)src + 1;
> > -   dst = (uint32_t *)dst + 1;
> > +   *(uint32_t *)dstu = *(const uint32_t *)srcu;
> > +   srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> > +   dstu = (uintptr_t)((uint32_t *)dstu + 1);
> > }
> > if (n & 0x08) {
> > -   *(uint64_t *)dst = *(const uint64_t *)src;
> > +   *(uint64_t *)dstu = *(const uint64_t *)srcu;
> > }
> > return ret;
> > }
> > @@ -458,6 +460,8 @@ static inline void *  rte_memcpy(void *dst, const
> > void *src, size_t n)  {
> > __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7,
> xmm8;
> > +   uintptr_t dstu = (uintptr_t)dst;
> > +   uintptr_t srcu = (uintptr_t)src;
> > void *ret = dst;
> > int dstofss;
> > int srcofs;
> > @@ -467,22 +471,22 @@ rte_memcpy(void *dst, const void *src, size_t n)
> >  */
> > if (n < 16) {
> > if (n & 0x01) {
> > -   *(uint8_t *)dst = *(const uint8_t *)src;
> > -   src = (const uint8_t *)src + 1;
> > -   dst = (uint8_t *)dst + 1;
> > +   *(uint8_t *)dstu = *(const uint8_t *)srcu;
> > +   srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> > +   dstu = (uintptr_t)((uint8_t *)dstu + 1);
> > }
> > if (n & 0x02) {
> > -   *(uint16_t *)dst = *(const uint16_t *)src;
> > -   src = (const uint16_t *)src + 1;
> > -   dst = (uint16_t *)dst + 1;
> > +   *(uint16_t *)dstu = *(const uint16_t *)srcu;
> > +   srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> > +   dstu = (uintptr_t)((uint16_t *)dstu + 1);
> > }
> > if (n & 0x04) {
> > - 

[dpdk-dev] [PATCH] A fix to work around strict-aliasing rules breaking

2015-03-04 Thread Wang, Zhihong


> -Original Message-
> From: Wodkowski, PawelX
> Sent: Monday, March 02, 2015 8:32 PM
> To: Richardson, Bruce; Wang, Zhihong
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] A fix to work around strict-aliasing rules
> breaking
> 
> On 2015-03-02 11:32, Bruce Richardson wrote:
> > On Mon, Mar 02, 2015 at 05:03:50PM +0800, zhihong.wang at intel.com
> wrote:
> >> Fixed strict-aliasing rules breaking errors for some GCC version.
> >>
> >
> > This looks messy. Also, I believe the definition of memcpy should
> > include the "restrict" keyword to indicate that source and dest can't
> > overlap. Might that help fix the issue?
> >
> 
> Is this error related with overlapping or casting 'void *' to 'uintXX_t *' 
> that
> make compiler report aliasing rule breaking?
> 
> >
> >> Signed-off-by: Zhihong Wang 
> >> ---
> >>   .../common/include/arch/x86/rte_memcpy.h   | 44 --
> 
> >>   1 file changed, 24 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> >> b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> >> index 69a5c6f..f412099 100644
> >> --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> >> +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> >> @@ -195,6 +195,8 @@ rte_mov256blocks(uint8_t *dst, const uint8_t *src,
> size_t n)
> >>   static inline void *
> >>   rte_memcpy(void *dst, const void *src, size_t n)
> >>   {
> >> +  uintptr_t dstu = (uintptr_t)dst;
> >> +  uintptr_t srcu = (uintptr_t)src;
> 
> If so maybe using union here would be good solution or 'char *'.

Pawel,

Thanks for the suggestion! But I don't think union can work around this --- 
already tried in CentOS release 6.5.
Anyway this is for compiler ethics only, the assembly code generated will be 
the same no matter what kind of method is used.

Zhihong (John)

> 
> --
> Pawel


[dpdk-dev] [PATCH] librte_eal/common: Fix cast from pointer to integer of different size

2015-03-09 Thread Wang, Zhihong


> -Original Message-
> From: Qiu, Michael
> Sent: Friday, March 06, 2015 11:13 AM
> To: dev at dpdk.org
> Cc: Qiu, Michael; Wang, Zhihong
> Subject: [PATCH] librte_eal/common: Fix cast from pointer to integer of
> different size
> 
> ./i686-native-linuxapp-gcc/include/rte_memcpy.h:592:23: error:
> cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
> 
>   dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16;
> 
> Type 'long long' is 64-bit in i686 platform while 'void *'
> is 32-bit.
> 
> Signed-off-by: Michael Qiu 
> Signed-off-by: Zhihong Wang 
> ---
> v4 --> v3:
>   fix dstofss/bits to size_t in rte_memcpy()
> v3 --> v2:
> make dstofss and srcofs to be type size_t
> casting type use uintptr_t
> 
> v2 --> v1:
> Remove unnecessary casting (void *)
> 
>  lib/librte_eal/common/include/arch/x86/rte_memcpy.h | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> index 7b2d382..6ec4434 100644
> --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> @@ -196,8 +196,8 @@ static inline void *  rte_memcpy(void *dst, const void
> *src, size_t n)  {
>   void *ret = dst;
> - int dstofss;
> - int bits;
> + size_t dstofss;
> + size_t bits;
> 
>   /**
>* Copy less than 16 bytes
> @@ -271,7 +271,7 @@ COPY_BLOCK_64_BACK31:
>   /**
>* Make store aligned when copy size exceeds 512 bytes
>*/
> - dstofss = 32 - (int)((long long)(void *)dst & 0x1F);
> + dstofss = 32 - ((uintptr_t)dst & 0x1F);
>   n -= dstofss;
>   rte_mov32((uint8_t *)dst, (const uint8_t *)src);
>   src = (const uint8_t *)src + dstofss;
> @@ -493,8 +493,8 @@ rte_memcpy(void *dst, const void *src, size_t n)  {
>   __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7,
> xmm8;
>   void *ret = dst;
> - int dstofss;
> - int srcofs;
> + size_t dstofss;
> + size_t srcofs;
> 
>   /**
>* Copy less than 16 bytes
> @@ -589,12 +589,12 @@ COPY_BLOCK_64_BACK15:
>* unaligned copy functions require up to 15 bytes
>* backwards access.
>*/
> - dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16;
> + dstofss = 16 - ((uintptr_t)dst & 0x0F) + 16;
>   n -= dstofss;
>   rte_mov32((uint8_t *)dst, (const uint8_t *)src);
>   src = (const uint8_t *)src + dstofss;
>   dst = (uint8_t *)dst + dstofss;
> - srcofs = (int)((long long)(const void *)src & 0x0F);
> + srcofs = ((uintptr_t)src & 0x0F);
> 
>   /**
>* For aligned copy
> --
> 1.9.3

Acked-by:  Wang, Zhihong 


[dpdk-dev] rte_memcpy.h: additional cflags required with OVS

2015-03-11 Thread Wang, Zhihong

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Kavanagh, Mark B  
> Sent: Tuesday, March 10, 2015 6:04 PM
> To: Mcnamara, John; Qiu, Michael; dev at dpdk.org; Panu Matilainen
> Subject: Re: [dpdk-dev] rte_memcpy.h: additional cflags required with OVS
> 
> 
> 
> >-Original Message-
> >From: Mcnamara, John
> >Sent: Tuesday, March 10, 2015 8:27 AM
> >To: Qiu, Michael; Kavanagh, Mark B; dev at dpdk.org; Panu Matilainen
> >Subject: RE: [dpdk-dev] rte_memcpy.h: additional cflags required with
> >OVS
> >
> >> -Original Message-
> >> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Qiu, Michael
> >> Sent: Tuesday, March 10, 2015 3:05 AM
> >> To: Kavanagh, Mark B; dev at dpdk.org
> >> Subject: Re: [dpdk-dev] rte_memcpy.h: additional cflags required with
> >> OVS
> >>
> >
> >> What's your gcc version? this should be an issue with old version
> >> gcc, and I'm working on this to solve this issue now.
> >
> >
> >Hi Michael,
> >
> >I see the issue with gcc 4.7.2 but not with 4.9.2.
> 
> I'm using gcc v4.8.3.
> 
> Just to clarify my initial post, there are two issues related to gcc intrinsic
> headers emmintrin.h, and tmmintrin.h:
>   - in former, a difference in parameter types for _mm_storeu_si128 is
> the issue. This is the primary issue observed.
>   - in tmmintrin.h, when __OPTIMIZE__ is not defined, function
> _mm_alignr_epi8 is also not defined, leading to an 'implicit definition of
> function' error.

Add the "-mssse3" flag should be able to solve the 'implicit definition of 
function' error.
BTW, current dpdk should compile with gcc 4.7.2, anything changed there that 
makes this flag mandatory?

Zhihong (John)

> I've only noticed this intermittently (even though I compile OVS with
> -O2 CFLAGS)
> 
> >
> >John


[dpdk-dev] [PATCH] common/rte_memcpy: Fix x86intrin.h missed

2015-03-13 Thread Wang, Zhihong


> -Original Message-
> From: Qiu, Michael
> Sent: Friday, March 13, 2015 3:03 PM
> To: dev at dpdk.org
> Cc: Wang, Zhihong; Qiu, Michael
> Subject: [PATCH] common/rte_memcpy: Fix x86intrin.h missed
> 
> rte_memcpy.h(46): catastrophic error: cannot open source file "x86intrin.h"
> 
> For icc and old gcc, this header is not included.
> 
> Signed-off-by: Michael Qiu 
> ---
>  lib/librte_eal/common/include/arch/x86/rte_memcpy.h | 20
> 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> index ac72069..bd10d36 100644
> --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> @@ -43,7 +43,27 @@
>  #include 
>  #include 
>  #include 
> +#if (defined(__ICC) || (__GNUC__ == 4 &&  __GNUC_MINOR__ < 4))
> +
> +#ifdef __SSE__
> +#include 
> +#endif
> +
> +#ifdef __SSE2__
> +#include 
> +#endif
> +
> +#if defined(__SSE4_2__) || defined(__SSE4_1__) #include 
> +#endif
> +
> +#if defined(__AVX__)
> +#include 
> +#endif
> +
> +#else
>  #include 
> +#endif
> 
>  #ifdef __cplusplus
>  extern "C" {
> --
> 1.9.3

Acked-by:  Wang, Zhihong 



[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms

2015-01-21 Thread Wang, Zhihong


> -Original Message-
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Wednesday, January 21, 2015 3:16 AM
> To: Stephen Hemminger
> Cc: Wang, Zhihong; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in
> arch/x86/rte_memcpy.h for both SSE and AVX platforms
> 
> On Tue, Jan 20, 2015 at 09:15:38AM -0800, Stephen Hemminger wrote:
> > On Mon, 19 Jan 2015 09:53:34 +0800
> > zhihong.wang at intel.com wrote:
> >
> > > Main code changes:
> > >
> > > 1. Differentiate architectural features based on CPU flags
> > >
> > > a. Implement separated move functions for SSE/AVX/AVX2 to make
> > > full utilization of cache bandwidth
> > >
> > > b. Implement separated copy flow specifically optimized for
> > > target architecture
> > >
> > > 2. Rewrite the memcpy function "rte_memcpy"
> > >
> > > a. Add store aligning
> > >
> > > b. Add load aligning based on architectural features
> > >
> > > c. Put block copy loop into inline move functions for better
> > > control of instruction order
> > >
> > > d. Eliminate unnecessary MOVs
> > >
> > > 3. Rewrite the inline move functions
> > >
> > > a. Add move functions for unaligned load cases
> > >
> > > b. Change instruction order in copy loops for better pipeline
> > > utilization
> > >
> > > c. Use intrinsics instead of assembly code
> > >
> > > 4. Remove slow glibc call for constant copies
> > >
> > > Signed-off-by: Zhihong Wang 
> >
> > Dumb question: why not fix glibc memcpy instead?
> > What is special about rte_memcpy?
> >
> >
> Fair point.  Though, does glibc implement optimized memcpys per arch?  Or
> do they just rely on the __builtin's from gcc to get optimized variants?
> 
> Neil

Neil, Stephen,

Glibc has per arch implementation but is for general purpose, while rte_memcpy 
is more for small size & in cache memcpy, which is the DPDK case. This lead to 
different trade-offs and optimization techniques.
Also, glibc's update from version to version is also based on general 
judgments. We can say that glibc 2.18 is for Ivy Bridge and 2.20 is for 
Haswell, though not full accurate. But we need an implementation for both Sandy 
Bridge and Haswell.

For instance, glibc 2.18 has load aligning optimization for unaligned memcpy 
but doesn't support 256-bit mov; while glibc 2.20 add support for 256-bit mov, 
but remove load aligning optimization. This hurts unaligned memcpy performance 
a lot on architectures like Ivy Bridge. Glibc's reason is that the load 
aligning optimization doesn't help when src/dst isn't in cache, which could be 
the general case, but not the DPDK case.

Zhihong (John)


[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-23 Thread Wang, Zhihong


> -Original Message-
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Wednesday, January 21, 2015 8:38 PM
> To: Ananyev, Konstantin
> Cc: Wang, Zhihong; Richardson, Bruce; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> On Wed, Jan 21, 2015 at 12:02:57PM +, Ananyev, Konstantin wrote:
> >
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang, Zhihong
> > > Sent: Wednesday, January 21, 2015 3:44 AM
> > > To: Richardson, Bruce; Neil Horman
> > > Cc: dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > >
> > >
> > > > -Original Message-----
> > > > From: Richardson, Bruce
> > > > Sent: Wednesday, January 21, 2015 12:15 AM
> > > > To: Neil Horman
> > > > Cc: Wang, Zhihong; dev at dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >
> > > > On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > > On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> > > > > >
> > > > > >
> > > > > > > -Original Message-
> > > > > > > From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > > > > Sent: Monday, January 19, 2015 9:02 PM
> > > > > > > To: Wang, Zhihong
> > > > > > > Cc: dev at dpdk.org
> > > > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > > > >
> > > > > > > On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > > > > zhihong.wang at intel.com
> > > > wrote:
> > > > > > > > This patch set optimizes memcpy for DPDK for both SSE and
> > > > > > > > AVX
> > > > platforms.
> > > > > > > > It also extends memcpy test coverage with unaligned cases
> > > > > > > > and more test
> > > > > > > points.
> > > > > > > >
> > > > > > > > Optimization techniques are summarized below:
> > > > > > > >
> > > > > > > > 1. Utilize full cache bandwidth
> > > > > > > >
> > > > > > > > 2. Enforce aligned stores
> > > > > > > >
> > > > > > > > 3. Apply load address alignment based on architecture
> > > > > > > > features
> > > > > > > >
> > > > > > > > 4. Make load/store address available as early as possible
> > > > > > > >
> > > > > > > > 5. General optimization techniques like inlining, branch
> > > > > > > > reducing, prefetch pattern access
> > > > > > > >
> > > > > > > > Zhihong Wang (4):
> > > > > > > >   Disabled VTA for memcpy test in app/test/Makefile
> > > > > > > >   Removed unnecessary test cases in test_memcpy.c
> > > > > > > >   Extended test coverage in test_memcpy_perf.c
> > > > > > > >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> and AVX
> > > > > > > > platforms
> > > > > > > >
> > > > > > > >  app/test/Makefile  |   6 +
> > > > > > > >  app/test/test_memcpy.c |  52 +-
> > > > > > > >  app/test/test_memcpy_perf.c| 238 
> > > > > > > > +---
> > > > > > > >  .../common/include/arch/x86/rte_memcpy.h   | 664
> > > > > > > +++--
> > > > > > > >  4 files changed, 656 insertions(+), 304 deletions(-)
> > > > > > > >
> > > > > > > > --
> > > > > > > > 1.9.3
> > > > > > > >
> > > > > > > >
> > > > > > > Are you able to compile this with gcc 4.9.2?  The
> > > > > > > compilation of test_memcpy_perf is taking forever for me.  It
> appears hung.
> > > > > > > Neil
> > > > > >
> > > > > >
> > > > > > Neil,
> > > > > >
> > > > >

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-23 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> Sent: Wednesday, January 21, 2015 9:26 PM
> To: Marc Sune
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> >
> > On 21/01/15 14:02, Bruce Richardson wrote:
> > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > >>>>-Original Message-
> > >>>>From: Richardson, Bruce
> > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > >>>>To: Neil Horman
> > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >>>>
> > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> > >>>>>>>-Original Message-
> > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > >>>>>>>To: Wang, Zhihong
> > >>>>>>>Cc: dev at dpdk.org
> > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >>>>>>>
> > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > >>>>>>>zhihong.wang at intel.com
> > >>>>wrote:
> > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and AVX
> > >>>>platforms.
> > >>>>>>>>It also extends memcpy test coverage with unaligned cases and
> > >>>>>>>>more test
> > >>>>>>>points.
> > >>>>>>>>Optimization techniques are summarized below:
> > >>>>>>>>
> > >>>>>>>>1. Utilize full cache bandwidth
> > >>>>>>>>
> > >>>>>>>>2. Enforce aligned stores
> > >>>>>>>>
> > >>>>>>>>3. Apply load address alignment based on architecture features
> > >>>>>>>>
> > >>>>>>>>4. Make load/store address available as early as possible
> > >>>>>>>>
> > >>>>>>>>5. General optimization techniques like inlining, branch
> > >>>>>>>>reducing, prefetch pattern access
> > >>>>>>>>
> > >>>>>>>>Zhihong Wang (4):
> > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> and AVX
> > >>>>>>>> platforms
> > >>>>>>>>
> > >>>>>>>>  app/test/Makefile  |   6 +
> > >>>>>>>>  app/test/test_memcpy.c |  52 +-
> > >>>>>>>>  app/test/test_memcpy_perf.c| 238 +---
> > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h   | 664
> > >>>>>>>+++--
> > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> > >>>>>>>>
> > >>>>>>>>--
> > >>>>>>>>1.9.3
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>Are you able to compile this with gcc 4.9.2?  The compilation
> > >>>>>>>of test_memcpy_perf is taking forever for me.  It appears hung.
> > >>>>>>>Neil
> > >>>>>>Neil,
> > >>>>>>
> > >>>>>>Thanks for reporting this!
> > >>>>>>It should compile but will take quite some time if the CPU
> > >>>>>>doesn't support
> > >>>>AVX2, the reason is that:
> > >>>>>>1. The SSE & AVX memcpy implementation is more complicat

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-27 Thread Wang, Zhihong


> -Original Message-
> From: Ananyev, Konstantin
> Sent: Tuesday, January 27, 2015 2:29 AM
> To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> Cc: dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> Hi Zhihong,
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang, Zhihong
> > Sent: Friday, January 23, 2015 6:52 AM
> > To: Richardson, Bruce; Marc Sune
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> >
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > > Richardson
> > > Sent: Wednesday, January 21, 2015 9:26 PM
> > > To: Marc Sune
> > > Cc: dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > > >
> > > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > > >>>>-Original Message-
> > > > >>>>From: Richardson, Bruce
> > > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > > >>>>To: Neil Horman
> > > > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> > > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > >>>>
> > > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong wrote:
> > > > >>>>>>>-Original Message-
> > > > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > > >>>>>>>To: Wang, Zhihong
> > > > >>>>>>>Cc: dev at dpdk.org
> > > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > >>>>>>>optimization
> > > > >>>>>>>
> > > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > >>>>>>>zhihong.wang at intel.com
> > > > >>>>wrote:
> > > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and
> > > > >>>>>>>>AVX
> > > > >>>>platforms.
> > > > >>>>>>>>It also extends memcpy test coverage with unaligned cases
> > > > >>>>>>>>and more test
> > > > >>>>>>>points.
> > > > >>>>>>>>Optimization techniques are summarized below:
> > > > >>>>>>>>
> > > > >>>>>>>>1. Utilize full cache bandwidth
> > > > >>>>>>>>
> > > > >>>>>>>>2. Enforce aligned stores
> > > > >>>>>>>>
> > > > >>>>>>>>3. Apply load address alignment based on architecture
> > > > >>>>>>>>features
> > > > >>>>>>>>
> > > > >>>>>>>>4. Make load/store address available as early as possible
> > > > >>>>>>>>
> > > > >>>>>>>>5. General optimization techniques like inlining, branch
> > > > >>>>>>>>reducing, prefetch pattern access
> > > > >>>>>>>>
> > > > >>>>>>>>Zhihong Wang (4):
> > > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> > > and AVX
> > > > >>>>>>>> platforms
> > > > >>>>>>>>
> > > > >>>>>>>>  app/test/Makefile  |   6 +

[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms

2015-01-27 Thread Wang, Zhihong


> -Original Message-
> From: Wodkowski, PawelX
> Sent: Monday, January 26, 2015 10:43 PM
> To: Wang, Zhihong; dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in
> arch/x86/rte_memcpy.h for both SSE and AVX platforms
> 
> Hi,
> 
> I must say: greate work.
> 
> I have some small comments:
> 
> > +/**
> > + * Macro for copying unaligned block from one location to another,
> > + * 47 bytes leftover maximum,
> > + * locations should not overlap.
> > + * Requirements:
> > + * - Store is aligned
> > + * - Load offset is , which must be immediate value within [1, 15]
> > + * - For , make sure  bit backwards & <16 - offset> bit
> forwards
> > are available for loading
> > + * - , ,  must be variables
> > + * - __m128i  ~  must be pre-defined
> > + */
> > +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)
> > \
> > +{  
> >  \
> ...
> > +}
> 
> Why not do { ... } while(0) or ({ ... }) ? This could have unpredictable side
> effects.
> 
> Second:
> Why you completely substitute
> #define rte_memcpy(dst, src, n)  \
>   ({ (__builtin_constant_p(n)) ?   \
>   memcpy((dst), (src), (n)) :  \
>   rte_memcpy_func((dst), (src), (n)); })
> 
> with inline rte_memcpy()? This construction  can help compiler to deduce
> which version to use (static?) inline implementation or call external
> function.
> 
> Did you try 'extern inline' type? It could help reducing compilation time.

Hi Pawel,

Good call on "MOVEUNALIGNED_LEFT47". Thanks!

I removed the conditional __builtin_constant_p(n) because it calls glibc memcpy 
when the parameter is constant, while rte_memcpy has better performance there.
Current long compile time is caused by too many function calls, I'll fix that 
in the next version.

Zhihong (John)



[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-27 Thread Wang, Zhihong
Hey Luke,

Thanks for the excellent questions!

The following script will launch the memcpy test in DPDK:
echo -e 'memcpy_autotest\nmemcpy_perf_autotest\nquit\n' | 
./x86_64-native-linuxapp-gcc/app/test -c 4 -n 4 -- -i

Thanks for sharing the object code, I think it?s the Sandy Bridge version 
though.
The rte_memcpy for Haswell is quite simple too, this is a decision based on 
arch difference: Haswell has significant improvements in memory hierarchy.
The Sandy Bridge unaligned memcpy is large in size but it has better 
performance because converting unaligned loads into aligned ones is crucial for 
in cache memcpy on Sandy Bridge.

The rep instruction is still not fast enough yet, but I can?t say much about it 
since I haven?t investigated it thoroughly.

To my understanding memcpy optimization is all about trade-offs according to 
use cases and this one is for DPDK scenario (Small size, in cache: you may find 
quite a few with only 6 bytes or so), you can refer to the rfc for this patch.
It?s not likely that one could make one that?re optimal for all scenarios.

But I agree with the author of glibc memcpy on this: A program with too many 
memcpys is a program with design flaw.


Thanks
Zhihong (John)

From: lukego at gmail.com [mailto:luk...@gmail.com] On Behalf Of Luke Gorrie
Sent: Monday, January 26, 2015 4:03 PM
To: Wang, Zhihong
Cc: dev at dpdk.org; snabb-devel at googlegroups.com
Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

On 26 January 2015 at 02:30, Wang, Zhihong mailto:zhihong.wang at intel.com>> wrote:
Hi Luke,

I?m very glad that you?re interested in this work. ?

Great :).

 I never published any performance data, and haven?t run cachebench.
We use test_memcpy_perf.c in DPDK to do the test mainly, because it?s the 
environment that DPDK runs. You can also find the performance comparison there 
with glibc.
It can be launched in /app/test: memcpy_perf_autotest.

Could you give me a command-line example to run this please? (Sorry if this 
should be obvious.)

 Finally, inline can bring benefits based on practice, constant value unrolling 
for example, and for DPDK we need all possible optimization.

Do we need to think about code size and potential instruction cache thrashing?

For me one call to rte_memcpy compiles to 3520 
instructions<https://gist.github.com/lukego/8b17a07246d999331b04> in 20KB of 
object code. That's more than half the size of the Haswell instruction cache 
(32KB) per call.

glibc 2.20's 
memcpy_avx_unaligned<https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S;h=9f033f54568c3e5b6d9de9b3ba75f5be41070b92;hb=HEAD>
 is only 909 bytes shared/total and also seems to have basically excellent 
performance on Haswell.

So I am concerned about the code size of rte_memcpy, especially when inlined, 
and meta-concerned about the nonlinear impact of nested inlined functions on 
both compile time and object code size.


There is another issue that I am concerned about:

The Intel Optimization Guide suggests that rep movs is very efficient starting 
in Ivy Bridge. In practice though it seems to be much slower than using vector 
instructions, even though it is faster than it used to be in Sandy Bridge. Is 
that true?

This could have a substantial impact on off-the-shelf memcpy. glibc 2.20's 
memcpy uses movs for sizes >= 2048 and that is where performance takes a dive 
for me (in microbenchmarks). GCC will also emit inline string move instructions 
for certain constant-size memcpy calls at certain optimization levels.


So I feel like I haven't yet found the right memcpy for me. and we haven't even 
started to look at the interesting parts like cache-coherence behaviour when 
sharing data between cores (vhost) and whether streaming load/store can be used 
to defend the state of cache lines between cores.


Do I make any sense? What do I miss?


Cheers,
-Luke




[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-27 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of EDMISON, Kelvin
> (Kelvin)
> Sent: Friday, January 23, 2015 2:22 AM
> To: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> On 2015-01-21, 3:54 PM, "Neil Horman"  wrote:
> 
> >On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
> >> On Wed, 21 Jan 2015 13:26:20 +
> >> Bruce Richardson  wrote:
> >>
> >> > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> >> > >
> >> > > On 21/01/15 14:02, Bruce Richardson wrote:
> >> > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> >> > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> >> > > >>>>-Original Message-
> >> > > >>>>From: Richardson, Bruce
> >> > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> >> > > >>>>To: Neil Horman
> >> > > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> >> > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >> > > >>>>
> >> > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> >> > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong
> wrote:
> >> > > >>>>>>>-Original Message-
> >> > > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> >> > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> >> > > >>>>>>>To: Wang, Zhihong
> >> > > >>>>>>>Cc: dev at dpdk.org
> >> > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> optimization
> >> > > >>>>>>>
> >> > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> >>zhihong.wang at intel.com
> >> > > >>>>wrote:
> >> > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and
> >>AVX
> >> > > >>>>platforms.
> >> > > >>>>>>>>It also extends memcpy test coverage with unaligned cases
> >>and
> >> > > >>>>>>>>more test
> >> > > >>>>>>>points.
> >> > > >>>>>>>>Optimization techniques are summarized below:
> >> > > >>>>>>>>
> >> > > >>>>>>>>1. Utilize full cache bandwidth
> >> > > >>>>>>>>
> >> > > >>>>>>>>2. Enforce aligned stores
> >> > > >>>>>>>>
> >> > > >>>>>>>>3. Apply load address alignment based on architecture
> >>features
> >> > > >>>>>>>>
> >> > > >>>>>>>>4. Make load/store address available as early as possible
> >> > > >>>>>>>>
> >> > > >>>>>>>>5. General optimization techniques like inlining, branch
> >> > > >>>>>>>>reducing, prefetch pattern access
> >> > > >>>>>>>>
> >> > > >>>>>>>>Zhihong Wang (4):
> >> > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> >> > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> >> > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> >> > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both
> SSE
> >>and AVX
> >> > > >>>>>>>> platforms
> >> > > >>>>>>>>
> >> > > >>>>>>>>  app/test/Makefile  |   6 +
> >> > > >>>>>>>>  app/test/test_memcpy.c |  52
> >>+-
> >> > > >>>>>>>>  app/test/test_memcpy_perf.c| 238
> >>+---
> >> > > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h   | 664
> 

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-28 Thread Wang, Zhihong


> -Original Message-
> From: Ananyev, Konstantin
> Sent: Tuesday, January 27, 2015 8:20 PM
> To: Wang, Zhihong; Richardson, Bruce; 'Marc Sune'
> Cc: 'dev at dpdk.org'
> Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> > -Original Message-
> > From: Ananyev, Konstantin
> > Sent: Tuesday, January 27, 2015 11:30 AM
> > To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> > Cc: dev at dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> >
> >
> > > -Original Message-
> > > From: Wang, Zhihong
> > > Sent: Tuesday, January 27, 2015 1:42 AM
> > > To: Ananyev, Konstantin; Richardson, Bruce; Marc Sune
> > > Cc: dev at dpdk.org
> > > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: Ananyev, Konstantin
> > > > Sent: Tuesday, January 27, 2015 2:29 AM
> > > > To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> > > > Cc: dev at dpdk.org
> > > > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >
> > > > Hi Zhihong,
> > > >
> > > > > -Original Message-
> > > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang,
> > > > > Zhihong
> > > > > Sent: Friday, January 23, 2015 6:52 AM
> > > > > To: Richardson, Bruce; Marc Sune
> > > > > Cc: dev at dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > >
> > > > >
> > > > >
> > > > > > -Original Message-
> > > > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > > > > > Richardson
> > > > > > Sent: Wednesday, January 21, 2015 9:26 PM
> > > > > > To: Marc Sune
> > > > > > Cc: dev at dpdk.org
> > > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > > >
> > > > > > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > > > > > >
> > > > > > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > > > > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > > > > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > > > > > >>>>-Original Message-
> > > > > > > >>>>From: Richardson, Bruce
> > > > > > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > > > > > >>>>To: Neil Horman
> > > > > > > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> > > > > > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > > > > >>>>optimization
> > > > > > > >>>>
> > > > > > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > > > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +, Wang, Zhihong
> wrote:
> > > > > > > >>>>>>>-Original Message-
> > > > > > > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> > > > > > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > > > > > >>>>>>>To: Wang, Zhihong
> > > > > > > >>>>>>>Cc: dev at dpdk.org
> > > > > > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > > > > >>>>>>>optimization
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > > > > >>>>>>>zhihong.wang at intel.com
> > > > > > > >>>>wrote:
> > > > > > > >>>>>>>>This patch set optimizes memcpy for DPDK for both
> > > > > > > >>>>>>>>SSE and AVX
> > > > > > > >>>>platforms.
> > > > > > > >>>>>>>>It also extends memcpy test coverage with unaligned
> > > > > > > >>>>>>&

[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

2015-01-29 Thread Wang, Zhihong


> -Original Message-
> From: EDMISON, Kelvin (Kelvin) [mailto:kelvin.edmison at alcatel-lucent.com]
> Sent: Thursday, January 29, 2015 5:48 AM
> To: Wang, Zhihong; Stephen Hemminger; Neil Horman
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> On 2015-01-27, 3:22 AM, "Wang, Zhihong"  wrote:
> 
> >
> >
> >> -Original Message-
> >> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of EDMISON,
> Kelvin
> >> (Kelvin)
> >> Sent: Friday, January 23, 2015 2:22 AM
> >> To: dev at dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >>
> >>
> >>
> >> On 2015-01-21, 3:54 PM, "Neil Horman" 
> wrote:
> >>
> >> >On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
> >> >> On Wed, 21 Jan 2015 13:26:20 + Bruce Richardson
> >> >>  wrote:
> >> >>
> [..trim...]
> >> >> One issue I have is that as a vendor we need to ship on binary,
> >> >>not different distributions  for each Intel chip variant. There is
> >> >>some support for multi-chip version functions  but only in latest
> >> >>Gcc which isn't in Debian stable. And the
> >>multi-chip
> >> >>version
> >> >> of functions is going to be more expensive than inlining. For some
> >> >>cases, I have  seen that the overhead of fancy instructions looks
> >> >>good but have
> >>nasty
> >> >>side effects
> >> >> like CPU stall and/or increased power consumption which turns of
> >>turbo
> >> >>boost.
> >> >>
> >> >>
> >> >> Distro's in general have the same problem with special case
> >> >>optimizations.
> >> >>
> >> >What we really need is to do something like borrow the alternatives
> >> >mechanism from the kernel so that we can dynamically replace
> >> >instructions at run time based on cpu flags.  That way we could make
> >> >the choice at run time, and wouldn't have to do alot of special case
> >> >jumping about.
> >> >Neil
> >>
> >> +1.
> >>
> >> I think it should be an anti-requirement that the build machine be
> >> the exact same chip as the deployment platform.
> >>
> >> I like the cpu flag inspection approach.  It would help in the case
> >>where  DPDK is in a VM and an odd set of CPU flags have been exposed.
> >>
> >> If that approach doesn't work though, then perhaps DPDK memcpy could
> >>go  through a benchmarking at app startup time and select the most
> >>performant  option out of a set, like mdraid's raid6 implementation
> >>does.  To give an  example, this is what my systems print out at boot
> >>time re: raid6  algorithm selection.
> >> raid6: sse2x13171 MB/s
> >> raid6: sse2x23925 MB/s
> >> raid6: sse2x44523 MB/s
> >> raid6: using algorithm sse2x4 (4523 MB/s)
> >>
> >> Regards,
> >>Kelvin
> >>
> >
> >Thanks for the proposal!
> >
> >For DPDK, performance is always the most important concern. We need to
> >utilize new architecture features to achieve that, so solution per arch
> >is necessary.
> >Even a few extra cycles can lead to bad performance if they're in a hot
> >loop.
> >For instance, let's assume DPDK takes 60 cycles to process a packet on
> >average, then 3 more cycles here means 5% performance drop.
> >
> >The dynamic solution is doable but with performance penalties, even if
> >it could be small. Also it may bring extra complexity, which can lead
> >to unpredictable behaviors and side effects.
> >For example, the dynamic solution won't have inline unrolling, which
> >can bring significant performance benefit for small copies with
> >constant length, like eth_addr.
> >
> >We can investigate the VM scenario more.
> >
> >Zhihong (John)
> 
> John,
> 
>   Thanks for taking the time to answer my newbie question. I deeply
> appreciate the attention paid to performance in DPDK. I have a follow-up
> though.
> 
> I'm trying to figure out what requirements this approach creates for the
> software build environment.  If we want to build optimized versions for
> Haswell, Ivy Bridge, Sandy Bridge, etc, does this mean that we must have one
> of each micro-architecture available for running the builds, or is there a way
> of cross-compiling for all micro-architectures from just one build
> environment?
> 
> Thanks,
>   Kelvin
> 

I'm not an expert in this, just some facts based on my test: The compile 
process depends on the compiler and the lib version.
So even on a machine that doesn't support the necessary ISA, it still should 
compile as long as gcc & glibc & etc have the support, only you'll get "Illegal 
instruction" trying launching the compiled binary.

Therefore if there's a way (worst case scenario: change flags manually) to make 
DPDK build process think that it's on a Haswell machine, it will produce 
Haswell binaries.

Zhihong (John)


[dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms

2015-01-30 Thread Wang, Zhihong
Hey Konstantin,

This method does reduce code size but lead to significant performance drop.
I think we need to keep the original code.


Thanks
Zhihong (John)


> -Original Message-
> From: Ananyev, Konstantin
> Sent: Thursday, January 29, 2015 11:18 PM
> To: Wang, Zhihong; dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in
> arch/x86/rte_memcpy.h for both SSE and AVX platforms
> 
> Hi Zhihong,
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zhihong Wang
> > Sent: Thursday, January 29, 2015 2:39 AM
> > To: dev at dpdk.org
> > Subject: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in
> > arch/x86/rte_memcpy.h for both SSE and AVX platforms
> >
> > Main code changes:
> >
> > 1. Differentiate architectural features based on CPU flags
> >
> > a. Implement separated move functions for SSE/AVX/AVX2 to make
> > full utilization of cache bandwidth
> >
> > b. Implement separated copy flow specifically optimized for target
> > architecture
> >
> > 2. Rewrite the memcpy function "rte_memcpy"
> >
> > a. Add store aligning
> >
> > b. Add load aligning based on architectural features
> >
> > c. Put block copy loop into inline move functions for better
> > control of instruction order
> >
> > d. Eliminate unnecessary MOVs
> >
> > 3. Rewrite the inline move functions
> >
> > a. Add move functions for unaligned load cases
> >
> > b. Change instruction order in copy loops for better pipeline
> > utilization
> >
> > c. Use intrinsics instead of assembly code
> >
> > 4. Remove slow glibc call for constant copies
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> >  .../common/include/arch/x86/rte_memcpy.h   | 680
> +++--
> >  1 file changed, 509 insertions(+), 171 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > index fb9eba8..7b2d382 100644
> > --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > @@ -34,166 +34,189 @@
> >  #ifndef _RTE_MEMCPY_X86_64_H_
> >  #define _RTE_MEMCPY_X86_64_H_
> >
> > +/**
> > + * @file
> > + *
> > + * Functions for SSE/AVX/AVX2 implementation of memcpy().
> > + */
> > +
> > +#include 
> >  #include 
> >  #include 
> > -#include 
> > +#include 
> >
> >  #ifdef __cplusplus
> >  extern "C" {
> >  #endif
> >
> > -#include "generic/rte_memcpy.h"
> > +/**
> > + * Copy bytes from one location to another. The locations must not
> overlap.
> > + *
> > + * @note This is implemented as a macro, so it's address should not
> > +be taken
> > + * and care is needed as parameter expressions may be evaluated
> multiple times.
> > + *
> > + * @param dst
> > + *   Pointer to the destination of the data.
> > + * @param src
> > + *   Pointer to the source data.
> > + * @param n
> > + *   Number of bytes to copy.
> > + * @return
> > + *   Pointer to the destination data.
> > + */
> > +static inline void *
> > +rte_memcpy(void *dst, const void *src, size_t n)
> > +__attribute__((always_inline));
> >
> > -#ifdef __INTEL_COMPILER
> > -#pragma warning(disable:593) /* Stop unused variable warning (reg_a
> > etc). */ -#endif
> > +#ifdef RTE_MACHINE_CPUFLAG_AVX2
> >
> > +/**
> > + * AVX2 implementation below
> > + */
> > +
> > +/**
> > + * Copy 16 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> >  static inline void
> >  rte_mov16(uint8_t *dst, const uint8_t *src)  {
> > -   __m128i reg_a;
> > -   asm volatile (
> > -   "movdqu (%[src]), %[reg_a]\n\t"
> > -   "movdqu %[reg_a], (%[dst])\n\t"
> > -   : [reg_a] "=x" (reg_a)
> > -   : [src] "r" (src),
> > - [dst] "r"(dst)
> > -   : "memory"
> > -   );
> > +   __m128i xmm0;
> > +
> > +   xmm0 = _mm_loadu_si128((const __m128i *)src);
> > +   _mm_storeu_si128((__m128i *)dst, xmm0);
> >  }
> >
> > +/**
> > + * Copy 32 bytes from one location to another,
> > + * locations should not overlap.
> > + */
>

[dpdk-dev] [PATCH 4/6] testpmd: handle all rxqs in rss setup

2016-06-03 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang, Zhihong
> Sent: Thursday, May 26, 2016 10:55 AM
> To: Thomas Monjalon 
> Cc: dev at dpdk.org; Ananyev, Konstantin ;
> Richardson, Bruce ; De Lara Guarch, Pablo
> 
> Subject: Re: [dpdk-dev] [PATCH 4/6] testpmd: handle all rxqs in rss setup
> 
> 
> 
> > -Original Message-
> > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> > Sent: Wednesday, May 25, 2016 5:42 PM
> > To: Wang, Zhihong 
> > Cc: dev at dpdk.org; Ananyev, Konstantin ;
> > Richardson, Bruce ; De Lara Guarch, Pablo
> > 
> > Subject: Re: [PATCH 4/6] testpmd: handle all rxqs in rss setup
> >
> > 2016-05-05 18:46, Zhihong Wang:
> > > This patch removes constraints in rxq handling when multiqueue is enabled
> > > to handle all the rxqs.
> > >
> > > Current testpmd forces a dedicated core for each rxq, some rxqs may be
> > > ignored when core number is less than rxq number, and that causes
> confusion
> > > and inconvenience.
> >
> > I have the feeling that "constraints", "confusion" and "inconvenience"
> > should be more explained.
> > Please give some examples with not enough and too much cores. Thanks
> 
> Sure, will add detailed description in v2  ;)

V2 has been sent.
We see increasing examples looking for help on this "confusion",
one recent example:
http://openvswitch.org/pipermail/dev/2016-June/072110.html




[dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions

2016-06-07 Thread Wang, Zhihong


> -Original Message-
> From: Ravi Kerur [mailto:rkerur at gmail.com]
> Sent: Tuesday, June 7, 2016 2:32 AM
> To: Wang, Zhihong ; Thomas Monjalon
> 
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
> 
> Zhilong, Thomas,
> 
> If there is enough interest within DPDK community I can work on adding support
> for 'unaligned access' and 'test cases' for it. Please let me know either way.
> 


Hi Ravi,

This rte_memcmp is proved with better performance than glibc's in aligned
cases, I think it has good value to DPDK lib.

Though we don't have memcmp in critical pmd data path, it offers a better
choice for applications who do.


Thanks
Zhihong


> Thanks,
> Ravi
> 
> 
> On Thu, May 26, 2016 at 2:05 AM, Wang, Zhihong 
> wrote:
> 
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ravi Kerur
> > Sent: Tuesday, March 8, 2016 7:01 AM
> > To: dev at dpdk.org
> > Subject: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
> >
> > v1:
> >? ? ? ? ?This patch adds test cases for rte_memcmp functions.
> >? ? ? ? ?New rte_memcmp functions can be tested via 'make test'
> >? ? ? ? ?and 'testpmd' utility.
> >
> >? ? ? ? ?Compiled and tested on Ubuntu 14.04(non-NUMA) and
> >? ? ? ? ?15.10(NUMA) systems.
> [...]
> 
> > +/
> > ***
> > + * Memcmp function performance test configuration section. Each performance
> > test
> > + * will be performed MEMCMP_ITERATIONS times.
> > + *
> > + * The five arrays below control what tests are performed. Every 
> > combination
> > + * from the array entries is tested.
> > + */
> > +#define MEMCMP_ITERATIONS (500 * 500 * 500)
> 
> 
> Maybe less iteration will make the test faster without compromise precison?
> 
> 
> > +
> > +static size_t memcmp_sizes[] = {
> > +? ? ?2, 5, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
> > +? ? ?129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384,
> > +? ? ?385, 447, 448, 449, 511, 512, 513, 767, 768, 769, 1023, 1024,
> > +? ? ?1025, 1522, 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608,
> > +? ? ?5632, 6144, 6656, 7168, 7680, 8192, 16834
> > +};
> > +
> [...]
> > +/*
> > + * Do all performance tests.
> > + */
> > +static int
> > +test_memcmp_perf(void)
> > +{
> > +? ? ?if (run_all_memcmp_eq_perf_tests() != 0)
> > +? ? ? ? ? ? ?return -1;
> > +
> > +? ? ?if (run_all_memcmp_gt_perf_tests() != 0)
> > +? ? ? ? ? ? ?return -1;
> > +
> > +? ? ?if (run_all_memcmp_lt_perf_tests() != 0)
> > +? ? ? ? ? ? ?return -1;
> > +
> 
> 
> Perhaps unaligned test cases are needed here.
> How do you think?
> 
> 
> > +
> > +? ? ?return 0;
> > +}
> > +
> > +static struct test_command memcmp_perf_cmd = {
> > +? ? ?.command = "memcmp_perf_autotest",
> > +? ? ?.callback = test_memcmp_perf,
> > +};
> > +REGISTER_TEST_COMMAND(memcmp_perf_cmd);
> > --
> > 1.9.1



[dpdk-dev] [PATCH v2 4/5] testpmd: handle all rxqs in rss setup

2016-06-08 Thread Wang, Zhihong


> -Original Message-
> From: De Lara Guarch, Pablo
> Sent: Tuesday, June 7, 2016 6:30 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: Ananyev, Konstantin ; Richardson, Bruce
> ; thomas.monjalon at 6wind.com
> Subject: RE: [PATCH v2 4/5] testpmd: handle all rxqs in rss setup
> 
> 
> 
> > -Original Message-
> > From: Wang, Zhihong
> > Sent: Wednesday, June 01, 2016 4:28 AM
> > To: dev at dpdk.org
> > Cc: Ananyev, Konstantin; Richardson, Bruce; De Lara Guarch, Pablo;
> > thomas.monjalon at 6wind.com; Wang, Zhihong
> > Subject: [PATCH v2 4/5] testpmd: handle all rxqs in rss setup
> >
> > This patch removes constraints in rxq handling when multiqueue is enabled
> > to handle all the rxqs.
> >
> > Current testpmd forces a dedicated core for each rxq, some rxqs may be
> > ignored when core number is less than rxq number, and that causes
> > confusion
> > and inconvenience.
> >
> >
> > Signed-off-by: Zhihong Wang 
> 
> Patch looks good, but you said that you were going to add a more detailed
> description in the commit message.

I added them in the cover letter.
Will add them here too.

> 
> Thanks,
> Pablo


[dpdk-dev] [PATCH v2 1/5] testpmd: add retry option

2016-06-08 Thread Wang, Zhihong


> -Original Message-
> From: De Lara Guarch, Pablo
> Sent: Tuesday, June 7, 2016 5:28 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: Ananyev, Konstantin ; Richardson, Bruce
> ; thomas.monjalon at 6wind.com
> Subject: RE: [PATCH v2 1/5] testpmd: add retry option
> 
> 
> 
> > -Original Message-
> > From: Wang, Zhihong
> > Sent: Wednesday, June 01, 2016 4:28 AM
> > To: dev at dpdk.org
> > Cc: Ananyev, Konstantin; Richardson, Bruce; De Lara Guarch, Pablo;
> > thomas.monjalon at 6wind.com; Wang, Zhihong
> > Subject: [PATCH v2 1/5] testpmd: add retry option
> >
> > This patch adds retry option in testpmd to prevent most packet losses.
> > It can be enabled by "set fwd  retry". All modes except rxonly
> > support this option.
> >
> > Adding retry mechanism expands test case coverage to support scenarios
> > where packet loss affects test results.
> >
> >
> > Signed-off-by: Zhihong Wang 
> 
> ...
> 
> > --- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> > +++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
> > @@ -249,8 +249,10 @@ set fwd
> >
> >  Set the packet forwarding mode::
> >
> > -   testpmd> set fwd (io|mac|mac_retry|macswap|flowgen| \
> > - rxonly|txonly|csum|icmpecho)
> > +   testpmd> set fwd (io|mac|macswap|flowgen| \
> > + rxonly|txonly|csum|icmpecho) (""|retry)
> > +
> > +``retry`` can be specified for forwarding engines except ``rx_only``.
> >
> >  The available information categories are:
> >
> > @@ -260,8 +262,6 @@ The available information categories are:
> >
> >  * ``mac``: Changes the source and the destination Ethernet addresses of
> > packets before forwarding them.
> >
> > -* ``mac_retry``: Same as "mac" forwarding mode, but includes retries if the
> > destination queue is full.
> > -
> >  * ``macswap``: MAC swap forwarding mode.
> >Swaps the source and the destination Ethernet addresses of packets
> before
> > forwarding them.
> >
> > @@ -392,7 +392,7 @@ Set number of packets per burst::
> >
> >  This is equivalent to the ``--burst command-line`` option.
> >
> > -In ``mac_retry`` forwarding mode, the transmit delay time and number of
> > retries can also be set::
> > +When retry is enabled, the transmit delay time and number of retries can
> > also be set::
> >
> > testpmd> set burst tx delay (micrseconds) retry (num)
> 
> Could you fix the typo "micrseconds" in this patch?

Sure ;)

> 
> >
> > --
> > 2.5.0
> 
> Apart from this,
> 
> Acked-by: Pablo de Lara 



[dpdk-dev] [PATCH v2 3/5] testpmd: show throughput in port stats

2016-06-08 Thread Wang, Zhihong


> -Original Message-
> From: De Lara Guarch, Pablo
> Sent: Tuesday, June 7, 2016 6:03 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: Ananyev, Konstantin ; Richardson, Bruce
> ; thomas.monjalon at 6wind.com
> Subject: RE: [PATCH v2 3/5] testpmd: show throughput in port stats
> 
> 
> 
> > -Original Message-
> > From: Wang, Zhihong
> > Sent: Wednesday, June 01, 2016 4:28 AM
> > To: dev at dpdk.org
> > Cc: Ananyev, Konstantin; Richardson, Bruce; De Lara Guarch, Pablo;
> > thomas.monjalon at 6wind.com; Wang, Zhihong
> > Subject: [PATCH v2 3/5] testpmd: show throughput in port stats
> >
> > This patch adds throughput numbers (in the period since last use of this
> > command) in port statistics display for "show port stats (port_id|all)".
> >
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> >  app/test-pmd/config.c | 20 
> >  1 file changed, 20 insertions(+)
> >
> > diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
> > index c611649..f487b87 100644
> > --- a/app/test-pmd/config.c
> > +++ b/app/test-pmd/config.c
> > @@ -92,6 +92,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include "testpmd.h"
> >
> > @@ -150,6 +151,10 @@ print_ethaddr(const char *name, struct ether_addr
> > *eth_addr)
> >  void
> >  nic_stats_display(portid_t port_id)
> >  {
> > +   static uint64_t sum_rx[RTE_MAX_ETHPORTS];
> > +   static uint64_t sum_tx[RTE_MAX_ETHPORTS];
> > +   static uint64_t cycles[RTE_MAX_ETHPORTS];
> > +   uint64_t pkt_rx, pkt_tx, cycle;
> 
> Could you rename some of these variables to something more specific?

Thanks for the suggestion! Will rename them.

> Like:
> pkt_rx -> diff_rx_pkts
> sum_rx -> prev_rx_pkts
> cycle -> diff_cycles
> cycles -> prev_cycles
> 
> 
> 
> > struct rte_eth_stats stats;
> > struct rte_port *port = [port_id];
> > uint8_t i;
> > @@ -209,6 +214,21 @@ nic_stats_display(portid_t port_id)
> > }
> > }
> >
> > +   cycle = cycles[port_id];
> > +   cycles[port_id] = rte_rdtsc();
> > +   if (cycle > 0)
> > +   cycle = cycles[port_id] - cycle;
> > +
> > +   pkt_rx = stats.ipackets - sum_rx[port_id];
> > +   pkt_tx = stats.opackets - sum_tx[port_id];
> > +   sum_rx[port_id] = stats.ipackets;
> > +   sum_tx[port_id] = stats.opackets;
> > +   printf("\n  Throughput (since last show)\n");
> > +   printf("  RX-pps: %12"PRIu64"\n"
> > +   "  TX-pps: %12"PRIu64"\n",
> > +   cycle > 0 ? pkt_rx * rte_get_tsc_hz() / cycle : 0,
> > +   cycle > 0 ? pkt_tx * rte_get_tsc_hz() / cycle : 0);
> > +
> > printf("  %s%s\n",
> >nic_stats_border, nic_stats_border);
> >  }
> > --
> > 2.5.0



[dpdk-dev] [PATCH] doc: virtio pmd versions

2016-06-15 Thread Wang, Zhihong


> -Original Message-
> From: Mcnamara, John
> Sent: Thursday, June 9, 2016 8:56 PM
> To: Richardson, Bruce ; Wang, Zhihong
> ; dev at dpdk.org
> Cc: Wang, Zhihong 
> Subject: RE: [dpdk-dev] [PATCH] doc: virtio pmd versions
> 
> > -Original Message-
> > From: Richardson, Bruce
> > Sent: Thursday, June 9, 2016 1:53 PM
> > To: Mcnamara, John ; Wang, Zhihong
> > ; dev at dpdk.org
> > Cc: Wang, Zhihong 
> > Subject: RE: [dpdk-dev] [PATCH] doc: virtio pmd versions
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Mcnamara, John
> > 
> > >
> > > > -Original Message-
> > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zhihong Wang
> > 
> > > > +
> > > > +Virtio PMD Versions
> > > > +---
> > > > +
> > > > +Virtio driver has 3 versions of rx functions and 2 versions of tx
> > > > functions.
> > >
> > > In some places RX/TX is used and in some rx/tx. I would suggest the
> > > uppercase versions throughout.
> > >
> >
> > In the commit logs, the only valid contractions allowed by the check-git-
> > log.sh script are Rx and Tx
> >
> > bad=$(echo "$headlines" | grep -E --color=always \
> > -e '\<(rx|tx|RX|TX)\>' \
> >  
> >
> > I would therefore suggest we follow the same rules for the docs for
> > consistency.
> 
> Hi,
> 
> I don't mind what it is once we have consistency, so Rx/Tx is fine. Zhihong,
> please note.

Thank you John and Bruce!
V2 has been sent, please take a look.

> 
> John
> 
> 



[dpdk-dev] [PATCH v2 5/5] testpmd: show topology at forwarding start

2016-06-15 Thread Wang, Zhihong


> -Original Message-
> From: De Lara Guarch, Pablo
> Sent: Tuesday, June 14, 2016 11:13 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: Ananyev, Konstantin ; Richardson, Bruce
> ; thomas.monjalon at 6wind.com
> Subject: RE: [PATCH v2 5/5] testpmd: show topology at forwarding start
> 
> 
> Hi Zhihong,
> 
> > -----Original Message-
> > From: Wang, Zhihong
> > Sent: Wednesday, June 01, 2016 4:28 AM
> > To: dev at dpdk.org
> > Cc: Ananyev, Konstantin; Richardson, Bruce; De Lara Guarch, Pablo;
> > thomas.monjalon at 6wind.com; Wang, Zhihong
> > Subject: [PATCH v2 5/5] testpmd: show topology at forwarding start
> >
> > This patch show topology at forwarding start.
> >
> > "show config fwd" also does this, but showing it directly can reduce the
> > possibility of misconfiguration.
> >
> >
> > Signed-off-by: Zhihong Wang 
> [...]
> 
> > diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> > index 9b1d99c..b946034 100644
> > --- a/app/test-pmd/testpmd.c
> > +++ b/app/test-pmd/testpmd.c
> > @@ -1009,7 +1009,7 @@ start_packet_forwarding(int with_tx_first)
> > if(!no_flush_rx)
> > flush_fwd_rx_queues();
> >
> > -   fwd_config_setup();
> > +   fwd_config_setup_display();
> 
> Bernard has made a patch that separates the display and setup of the
> configuration,
> (http://dpdk.org/dev/patchwork/patch/13650/)
> so fwd_config_display() does not call fwd_config_setup() anymore.
> 
> Could you modify this patch, so you call fwd_config_setup() and
> fwd_config_display()?

Thanks for the info! I've updated this patch with a v3.
Could you please help review?


> 
> Sorry for the confusion,
> Pablo
> 
> > rxtx_config_display();
> >
> > for (i = 0; i < cur_fwd_config.nb_fwd_ports; i++) {



[dpdk-dev] [PATCH v3 4/5] testpmd: handle all rxqs in rss setup

2016-06-28 Thread Wang, Zhihong
Thanks Nelio and Pablo!

> -Original Message-
> From: N?lio Laranjeiro [mailto:nelio.laranjeiro at 6wind.com]
> Sent: Tuesday, June 28, 2016 4:34 PM
> To: De Lara Guarch, Pablo 
> Cc: Wang, Zhihong ; dev at dpdk.org; Ananyev,
> Konstantin ; Richardson, Bruce
> ; thomas.monjalon at 6wind.com
> Subject: Re: [dpdk-dev] [PATCH v3 4/5] testpmd: handle all rxqs in rss setup
> 
> Hi Pablo,
> 
> On Mon, Jun 27, 2016 at 10:36:38PM +, De Lara Guarch, Pablo wrote:
> > Hi Nelio,
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of N?lio Laranjeiro
> > > Sent: Monday, June 27, 2016 3:24 PM
> > > To: Wang, Zhihong
> > > Cc: dev at dpdk.org; Ananyev, Konstantin; Richardson, Bruce; De Lara 
> > > Guarch,
> > > Pablo; thomas.monjalon at 6wind.com
> > > Subject: Re: [dpdk-dev] [PATCH v3 4/5] testpmd: handle all rxqs in rss 
> > > setup
> > >
> > > On Tue, Jun 14, 2016 at 07:08:05PM -0400, Zhihong Wang wrote:
> > > > This patch removes constraints in rxq handling when multiqueue is 
> > > > enabled
> > > > to handle all the rxqs.
> > > >
> > > > Current testpmd forces a dedicated core for each rxq, some rxqs may be
> > > > ignored when core number is less than rxq number, and that causes
> > > confusion
> > > > and inconvenience.
> > > >
> > > > One example: One Red Hat engineer was doing multiqueue test, there're 2
> > > > ports in guest each with 4 queues, and testpmd was used as the 
> > > > forwarding
> > > > engine in guest, as usual he used 1 core for forwarding, as a results he
> > > > only saw traffic from port 0 queue 0 to port 1 queue 0, then a lot of
> > > > emails and quite some time are spent to root cause it, and of course 
> > > > it's
> > > > caused by this unreasonable testpmd behavior.
> > > >
> > > > Moreover, even if we understand this behavior, if we want to test the
> > > > above case, we still need 8 cores for a single guest to poll all the
> > > > rxqs, obviously this is too expensive.
> > > >
> > > > We met quite a lot cases like this, one recent example:
> > > > http://openvswitch.org/pipermail/dev/2016-June/072110.html
> > > >
> > > >
> > > > Signed-off-by: Zhihong Wang 
> > > > ---
> > > >  app/test-pmd/config.c | 8 +---
> > > >  1 file changed, 1 insertion(+), 7 deletions(-)
> > > >
> > > > diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
> > > > index ede7c78..4719a08 100644
> > > > --- a/app/test-pmd/config.c
> > > > +++ b/app/test-pmd/config.c
> > > > @@ -1199,19 +1199,13 @@ rss_fwd_config_setup(void)
> > > > cur_fwd_config.nb_fwd_ports = nb_fwd_ports;
> > > > cur_fwd_config.nb_fwd_streams =
> > > > (streamid_t) (nb_q * cur_fwd_config.nb_fwd_ports);
> > > > -   if (cur_fwd_config.nb_fwd_streams > 
> > > > cur_fwd_config.nb_fwd_lcores)
> > > > -   cur_fwd_config.nb_fwd_streams =
> > > > -   (streamid_t)cur_fwd_config.nb_fwd_lcores;
> > > > -   else
> > > > -   cur_fwd_config.nb_fwd_lcores =
> > > > -   (lcoreid_t)cur_fwd_config.nb_fwd_streams;
> > > >
> > > > /* reinitialize forwarding streams */
> > > > init_fwd_streams();
> > > >
> > > > setup_fwd_config_of_each_lcore(_fwd_config);
> > > > rxp = 0; rxq = 0;
> > > > -   for (lc_id = 0; lc_id < cur_fwd_config.nb_fwd_lcores; lc_id++) {
> > > > +   for (lc_id = 0; lc_id < cur_fwd_config.nb_fwd_streams; lc_id++) 
> > > > {
> > > > struct fwd_stream *fs;
> > > >
> > > > fs = fwd_streams[lc_id];
> > > > --
> > > > 2.5.0
> > >
> > > Hi Zhihong,
> > >
> > > It seems this commits introduce a bug in pkt_burst_transmit(), this only
> > > occurs when the number of cores present in the coremask is greater than
> > > the number of queues i.e. coremask=0xffe --txq=4 --rxq=4.
> > >
> > >   Port 0 Link Up - speed 4 Mbps - full-duplex
> > >   Port 1 Link Up - speed 4 Mbps - full-duplex
> > >   Done
> > >   testpmd> start tx_first
> &g

[dpdk-dev] [PATCH v3 5/7] virtio: virtio vec rx

2015-10-22 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Huawei Xie
> Sent: Tuesday, October 20, 2015 11:30 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH v3 5/7] virtio: virtio vec rx
> 
> With fixed avail ring, we don't need to get desc idx from avail ring.
> virtio driver only has to deal with desc ring.
> This patch uses vector instruction to accelerate processing desc ring.
> 
> Signed-off-by: Huawei Xie 
> ---
>  drivers/net/virtio/virtio_ethdev.h  |   2 +
>  drivers/net/virtio/virtio_rxtx.c|   3 +
>  drivers/net/virtio/virtio_rxtx.h|   2 +
>  drivers/net/virtio/virtio_rxtx_simple.c | 224
> 
>  drivers/net/virtio/virtqueue.h  |   1 +
>  5 files changed, 232 insertions(+)
> 
> diff --git a/drivers/net/virtio/virtio_ethdev.h 
> b/drivers/net/virtio/virtio_ethdev.h
> index 9026d42..d7797ab 100644
> --- a/drivers/net/virtio/virtio_ethdev.h
> +++ b/drivers/net/virtio/virtio_ethdev.h
> @@ -108,6 +108,8 @@ uint16_t virtio_recv_mergeable_pkts(void *rx_queue,
> struct rte_mbuf **rx_pkts,
>  uint16_t virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>   uint16_t nb_pkts);
> 
> +uint16_t virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
> + uint16_t nb_pkts);
> 
>  /*
>   * The VIRTIO_NET_F_GUEST_TSO[46] features permit the host to send us
> diff --git a/drivers/net/virtio/virtio_rxtx.c 
> b/drivers/net/virtio/virtio_rxtx.c
> index 5162ce6..947fc46 100644
> --- a/drivers/net/virtio/virtio_rxtx.c
> +++ b/drivers/net/virtio/virtio_rxtx.c
> @@ -432,6 +432,9 @@ virtio_dev_rx_queue_setup(struct rte_eth_dev *dev,
>   vq->mpool = mp;
> 
>   dev->data->rx_queues[queue_idx] = vq;
> +
> + virtio_rxq_vec_setup(vq);
> +
>   return 0;
>  }
> 
> diff --git a/drivers/net/virtio/virtio_rxtx.h 
> b/drivers/net/virtio/virtio_rxtx.h
> index 7d2d8fe..831e492 100644
> --- a/drivers/net/virtio/virtio_rxtx.h
> +++ b/drivers/net/virtio/virtio_rxtx.h
> @@ -33,5 +33,7 @@
> 
>  #define RTE_PMD_VIRTIO_RX_MAX_BURST 64
> 
> +int virtio_rxq_vec_setup(struct virtqueue *rxq);
> +
>  int virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
>   struct rte_mbuf *m);
> diff --git a/drivers/net/virtio/virtio_rxtx_simple.c
> b/drivers/net/virtio/virtio_rxtx_simple.c
> index cac5b9f..ef17562 100644
> --- a/drivers/net/virtio/virtio_rxtx_simple.c
> +++ b/drivers/net/virtio/virtio_rxtx_simple.c
> @@ -58,6 +58,10 @@
>  #include "virtqueue.h"
>  #include "virtio_rxtx.h"
> 
> +#define RTE_VIRTIO_VPMD_RX_BURST 32
> +#define RTE_VIRTIO_DESC_PER_LOOP 8
> +#define RTE_VIRTIO_VPMD_RX_REARM_THRESH
> RTE_VIRTIO_VPMD_RX_BURST
> +
>  int __attribute__((cold))
>  virtqueue_enqueue_recv_refill_simple(struct virtqueue *vq,
>   struct rte_mbuf *cookie)
> @@ -82,3 +86,223 @@ virtqueue_enqueue_recv_refill_simple(struct
> virtqueue *vq,
> 
>   return 0;
>  }
> +
> +static inline void
> +virtio_rxq_rearm_vec(struct virtqueue *rxvq)
> +{
> + int i;
> + uint16_t desc_idx;
> + struct rte_mbuf **sw_ring;
> + struct vring_desc *start_dp;
> + int ret;
> +
> + desc_idx = rxvq->vq_avail_idx & (rxvq->vq_nentries - 1);
> + sw_ring = >sw_ring[desc_idx];
> + start_dp = >vq_ring.desc[desc_idx];
> +
> + ret = rte_mempool_get_bulk(rxvq->mpool, (void **)sw_ring,
> + RTE_VIRTIO_VPMD_RX_REARM_THRESH);
> + if (unlikely(ret)) {
> + rte_eth_devices[rxvq->port_id].data->rx_mbuf_alloc_failed +=
> + RTE_VIRTIO_VPMD_RX_REARM_THRESH;
> + return;
> + }
> +
> + for (i = 0; i < RTE_VIRTIO_VPMD_RX_REARM_THRESH; i++) {
> + uintptr_t p;
> +
> + p = (uintptr_t)_ring[i]->rearm_data;
> + *(uint64_t *)p = rxvq->mbuf_initializer;
> +
> + start_dp[i].addr =
> + (uint64_t)((uintptr_t)sw_ring[i]->buf_physaddr +
> + RTE_PKTMBUF_HEADROOM - sizeof(struct virtio_net_hdr));
> + start_dp[i].len = sw_ring[i]->buf_len -
> + RTE_PKTMBUF_HEADROOM + sizeof(struct virtio_net_hdr);
> + }
> +
> + rxvq->vq_avail_idx += RTE_VIRTIO_VPMD_RX_REARM_THRESH;
> + rxvq->vq_free_cnt -= RTE_VIRTIO_VPMD_RX_REARM_THRESH;
> + vq_update_avail_idx(rxvq);
> +}
> +
> +/* virtio vPMD receive routine, only accept(nb_pkts >=
> RTE_VIRTIO_DESC_PER_LOOP)
> + *
> + * This routine is for non-mergable RX, one desc for each guest buffer.
> + * This routine is based on the RX ring layout optimization. Each entry in 
> the
> + * avail ring points to the desc with the same index in the desc ring and 
> this
> + * will never be changed in the driver.
> + *
> + * - nb_pkts < RTE_VIRTIO_DESC_PER_LOOP, just return no packet
> + */
> +uint16_t
> +virtio_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
> + uint16_t nb_pkts)
> +{
> + struct virtqueue *rxvq = rx_queue;
> + uint16_t nb_used;
> + uint16_t desc_idx;
> + 

[dpdk-dev] [PATCH v5 5/7] virtio: virtio vec rx

2015-10-26 Thread Wang, Zhihong
> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Huawei Xie
> Sent: Sunday, October 25, 2015 11:35 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH v5 5/7] virtio: virtio vec rx
> 
> With fixed avail ring, we don't need to get desc idx from avail ring.
> virtio driver only has to deal with desc ring.
> This patch uses vector instruction to accelerate processing desc ring.
> 
> Signed-off-by: Huawei Xie 

Acked-by: Wang, Zhihong 



[dpdk-dev] [PATCH v3 2/2] vhost: Add VHOST PMD

2015-11-12 Thread Wang, Zhihong
Hi Tetsuya,

In my test I created 2 vdev using "--vdev 
'eth_vhost0,iface=/tmp/sock0,queues=1' --vdev 
'eth_vhost1,iface=/tmp/sock1,queues=1'", and the qemu message got handled in 
wrong order.
The reason is that: 2 threads are created to handle message from 2 sockets, but 
their fds are SHARED, so each thread are reading from both sockets.

This can lead to incorrect behaviors, in my case sometimes the 
VHOST_USER_SET_MEM_TABLE got handled after VRING initialization and lead to 
destroy_device().

Detailed log as shown below: thread 69351 & 69352 are both reading fd 25. 
Thanks Yuanhan for helping debugging!


Thanks
Zhihong


-

>  debug: setting up new vq conn for fd: 23, tid: 69352
VHOST_CONFIG: new virtio connection is 25
VHOST_CONFIG: new device, handle is 0
>  debug: vserver_message_handler thread id: 69352, fd: 25
VHOST_CONFIG: read message VHOST_USER_SET_OWNER
>  debug: vserver_message_handler thread id: 69352, fd: 25
VHOST_CONFIG: read message VHOST_USER_GET_FEATURES
>  debug: vserver_message_handler thread id: 69352, fd: 25
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
VHOST_CONFIG: vring call idx:0 file:26
>  debug: vserver_message_handler thread id: 69352, fd: 25
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
VHOST_CONFIG: vring call idx:1 file:27
>  debug: vserver_message_handler thread id: 69351, fd: 25
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
VHOST_CONFIG: vring call idx:0 file:28
>  debug: vserver_message_handler thread id: 69351, fd: 25
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
VHOST_CONFIG: vring call idx:1 file:26
>  debug: vserver_message_handler thread id: 69351, fd: 25
VHOST_CONFIG: read message VHOST_USER_SET_FEATURES
>  debug: vserver_message_handler thread id: 69351, fd: 25
VHOST_CONFIG: read message VHOST_USER_SET_MEM_TABLE
>  debug: device_fh: 0: user_set_mem_table
VHOST_CONFIG: mapped region 0 fd:27 to 0x7ff6c000 sz:0xa off:0x0
VHOST_CONFIG: mapped region 1 fd:29 to 0x7ff68000 sz:0x4000 off:0xc
>  debug: vserver_message_handler thread id: 69351, fd: 25
VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
>  debug: vserver_message_handler thread id: 69351, fd: 25
VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
>  debug: vserver_message_handler thread id: 69351, fd: 25
VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
>  debug: vserver_message_handler thread id: 69351, fd: 25
VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
VHOST_CONFIG: vring kick idx:0 file:30
>  debug: vserver_message_handler thread id: 69352, fd: 25
VHOST_CONFIG: virtio is not ready for processing.
>  debug: vserver_message_handler thread id: 69351, fd: 25
VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
>  debug: vserver_message_handler thread id: 69351, fd: 25
VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
>  debug: vserver_message_handler thread id: 69351, fd: 25
VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
VHOST_CONFIG: vring kick idx:1 file:31
VHOST_CONFIG: virtio is now ready for processing.
PMD: New connection established
VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM

-

> ...
> +
> +static void *vhost_driver_session(void *param __rte_unused)
> +{
> + static struct virtio_net_device_ops *vhost_ops;
> +
> + vhost_ops = rte_zmalloc(NULL, sizeof(*vhost_ops), 0);
> + if (vhost_ops == NULL)
> + rte_panic("Can't allocate memory\n");
> +
> + /* set vhost arguments */
> + vhost_ops->new_device = new_device;
> + vhost_ops->destroy_device = destroy_device;
> + if (rte_vhost_driver_pmd_callback_register(vhost_ops) < 0)
> + rte_panic("Can't register callbacks\n");
> +
> + /* start event handling */
> + rte_vhost_driver_session_start();
> +
> + rte_free(vhost_ops);
> + pthread_exit(0);
> +}
> +
> +static void vhost_driver_session_start(struct pmd_internal *internal)
> +{
> + int ret;
> +
> + ret = pthread_create(>session_th,
> + NULL, vhost_driver_session, NULL);
> + if (ret)
> + rte_panic("Can't create a thread\n");
> +}
> +
> ...



[dpdk-dev] [PATCH v3 2/2] vhost: Add VHOST PMD

2015-11-13 Thread Wang, Zhihong


> -Original Message-
> From: Tetsuya Mukawa [mailto:mukawa at igel.co.jp]
> Sent: Friday, November 13, 2015 11:10 AM
> To: Wang, Zhihong ; dev at dpdk.org; Liu, Yuanhan
> 
> Cc: ann.zhuangyanying at huawei.com
> Subject: Re: [dpdk-dev] [PATCH v3 2/2] vhost: Add VHOST PMD
> 
> On 2015/11/12 21:52, Wang, Zhihong wrote:
> > Hi Tetsuya,
> >
> > In my test I created 2 vdev using "--vdev
> 'eth_vhost0,iface=/tmp/sock0,queues=1' --vdev
> 'eth_vhost1,iface=/tmp/sock1,queues=1'", and the qemu message got handled
> in wrong order.
> > The reason is that: 2 threads are created to handle message from 2 sockets, 
> > but
> their fds are SHARED, so each thread are reading from both sockets.
> >
> > This can lead to incorrect behaviors, in my case sometimes the
> VHOST_USER_SET_MEM_TABLE got handled after VRING initialization and lead to
> destroy_device().
> >
> > Detailed log as shown below: thread 69351 & 69352 are both reading fd 25.
> Thanks Yuanhan for helping debugging!
> >
> 
> Hi Zhihong and Yuanhan,
> 
> Thank you so much for debugging the issue.
> I will fix vhost PMD not to create multiple message handling threads.
> 
> I am going to submit the PMD today.
> Could you please check it again using latest one?
> 

Looking forward to it!


> Tetsuya
> 
> 
> > Thanks
> > Zhihong
> >
> >
> > --
> > ---
> >
> > >  debug: setting up new vq conn for fd: 23, tid: 69352
> > VHOST_CONFIG: new virtio connection is 25
> > VHOST_CONFIG: new device, handle is 0
> > >  debug: vserver_message_handler thread id: 69352, fd: 25
> > VHOST_CONFIG: read message VHOST_USER_SET_OWNER
> > >  debug: vserver_message_handler thread id: 69352, fd: 25
> > VHOST_CONFIG: read message VHOST_USER_GET_FEATURES
> > >  debug: vserver_message_handler thread id: 69352, fd: 25
> > VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
> > VHOST_CONFIG: vring call idx:0 file:26
> > >  debug: vserver_message_handler thread id: 69352, fd: 25
> > VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
> > VHOST_CONFIG: vring call idx:1 file:27
> > >  debug: vserver_message_handler thread id: 69351, fd: 25
> > VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
> > VHOST_CONFIG: vring call idx:0 file:28
> > >  debug: vserver_message_handler thread id: 69351, fd: 25
> > VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
> > VHOST_CONFIG: vring call idx:1 file:26
> > >  debug: vserver_message_handler thread id: 69351, fd: 25
> > VHOST_CONFIG: read message VHOST_USER_SET_FEATURES
> > >  debug: vserver_message_handler thread id: 69351, fd: 25
> > VHOST_CONFIG: read message VHOST_USER_SET_MEM_TABLE
> > >  debug: device_fh: 0: user_set_mem_table
> > VHOST_CONFIG: mapped region 0 fd:27 to 0x7ff6c000 sz:0xa
> > off:0x0
> > VHOST_CONFIG: mapped region 1 fd:29 to 0x7ff68000 sz:0x4000
> > off:0xc
> > >  debug: vserver_message_handler thread id: 69351, fd: 25
> > VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
> > >  debug: vserver_message_handler thread id: 69351, fd: 25
> > VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
> > >  debug: vserver_message_handler thread id: 69351, fd: 25
> > VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
> > >  debug: vserver_message_handler thread id: 69351, fd: 25
> > VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
> > VHOST_CONFIG: vring kick idx:0 file:30
> > >  debug: vserver_message_handler thread id: 69352, fd: 25
> > VHOST_CONFIG: virtio is not ready for processing.
> > >  debug: vserver_message_handler thread id: 69351, fd: 25
> > VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
> > >  debug: vserver_message_handler thread id: 69351, fd: 25
> > VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
> > >  debug: vserver_message_handler thread id: 69351, fd: 25
> > VHOST_CONFIG: read message VHOST_USER_SET_VRING_KICK
> > VHOST_CONFIG: vring kick idx:1 file:31
> > VHOST_CONFIG: virtio is now ready for processing.
> > PMD: New connection established
> > VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
> >
> > --
> > ---
> >
> >> ...
> >> +
> >> +static void *vhost_driver_session(void *param __rte

[dpdk-dev] [PATCH v4 2/2] vhost: Add VHOST PMD

2015-11-16 Thread Wang, Zhihong
A quick glimpse and the bug is gone now :)
Will have more test later on.

> -Original Message-
> From: Tetsuya Mukawa [mailto:mukawa at igel.co.jp]
> Sent: Friday, November 13, 2015 1:21 PM
> To: dev at dpdk.org; Wang, Zhihong ; Liu, Yuanhan
> 
> Cc: Loftus, Ciara ; pmatilai at redhat.com;
> ann.zhuangyanying at huawei.com; Richardson, Bruce
> ; Xie, Huawei ;
> thomas.monjalon at 6wind.com; stephen at networkplumber.org;
> rich.lane at bigswitch.com; Tetsuya Mukawa 
> Subject: [PATCH v4 2/2] vhost: Add VHOST PMD
> 
> The patch introduces a new PMD. This PMD is implemented as thin wrapper
> of librte_vhost. It means librte_vhost is also needed to compile the PMD.
> The vhost messages will be handled only when a port is started. So start
> a port first, then invoke QEMU.
> 
> The PMD has 2 parameters.
>  - iface:  The parameter is used to specify a path to connect to a
>virtio-net device.
>  - queues: The parameter is used to specify the number of the queues
>virtio-net device has.
>(Default: 1)
> 
> Here is an example.
> $ ./testpmd -c f -n 4 --vdev 'eth_vhost0,iface=/tmp/sock0,queues=1' -- -i
> 
> To connect above testpmd, here is qemu command example.
> 
> $ qemu-system-x86_64 \
> 
> -chardev socket,id=chr0,path=/tmp/sock0 \
> -netdev vhost-user,id=net0,chardev=chr0,vhostforce,queues=1 \
> -device virtio-net-pci,netdev=net0
> 
> Signed-off-by: Tetsuya Mukawa 
> ---



[dpdk-dev] [PATCH] vhost: avoid buffer overflow in update_secure_len

2015-11-18 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Wednesday, November 18, 2015 10:57 AM
> To: Rich Lane 
> Cc: dev at dpdk.org; Xie, Huawei ; Wang, Zhihong
> ; Richardson, Bruce 
> Subject: Re: [PATCH] vhost: avoid buffer overflow in update_secure_len
> 
> On Tue, Nov 17, 2015 at 08:39:30AM -0800, Rich Lane wrote:
> >
> > I don't think that adding a SIGINT handler is the right solution,
> > though. The guest app could be killed with another signal (SIGKILL).
> 
> Good point.
> 
> > Worse, a malicious or
> > buggy guest could write to just that field. vhost should not crash no
> > matter what the guest writes into the virtqueues.
> 
> Yeah, I agree with you: though we could fix this issue in the source side, we 
> also
> should do some defend here.
> 

Exactly, DPDK should be able to take care of both ends:
# Provide interface for resource cleanup
# Be prepared if the app doesn't shutdown properly

> How about following patch then?
> 
> Note that the vec_id overflow check should be done before referencing it, but
> not after. Hence I moved it ahead.
> 
>   --yliu
> 
> ---
> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c 
> index
> 9322ce6..08f5942 100644
> --- a/lib/librte_vhost/vhost_rxtx.c
> +++ b/lib/librte_vhost/vhost_rxtx.c
> @@ -132,6 +132,8 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
> 
>   /* Get descriptor from available ring */
>   desc = >desc[head[packet_success]];
> + if (desc->len == 0)
> + break;
> 
>   buff = pkts[packet_success];
> 
> @@ -153,6 +155,8 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>   /* Buffer address translation. */
>   buff_addr = gpa_to_vva(dev, desc->addr);
>   } else {
> + if (desc->len < vq->vhost_hlen)
> + break;
>   vb_offset += vq->vhost_hlen;
>   hdr = 1;
>   }
> @@ -446,6 +450,9 @@ update_secure_len(struct vhost_virtqueue *vq, uint32_t
> id,
>   uint32_t vec_id = *vec_idx;
> 
>   do {
> + if (vec_id >= BUF_VECTOR_MAX)
> + break;
> +
>   next_desc = 0;
>   len += vq->desc[idx].len;
>   vq->buf_vec[vec_id].buf_addr = vq->desc[idx].addr; @@ -519,6
> +526,8 @@ virtio_dev_merge_rx(struct virtio_net *dev, uint16_t queue_id,
>   goto merge_rx_exit;
>   } else {
>   update_secure_len(vq, res_cur_idx, 
> _len,
> _idx);
> + if (secure_len == 0)
> + goto merge_rx_exit;
>   res_cur_idx++;
>   }
>   } while (pkt_len > secure_len);
> @@ -631,6 +640,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev,
> uint16_t queue_id,
>   uint8_t alloc_err = 0;
> 
>   desc = >desc[head[entry_success]];
> + if (desc->len == 0)
> + break;
> 
>   /* Discard first buffer as it is the virtio header */
>   if (desc->flags & VRING_DESC_F_NEXT) { @@ -638,6 +649,8 @@
> rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>   vb_offset = 0;
>   vb_avail = desc->len;
>   } else {
> + if (desc->len < vq->vhost_hlen)
> + break;
>   vb_offset = vq->vhost_hlen;
>   vb_avail = desc->len - vb_offset;
>   }


[dpdk-dev] [RFC PATCH 2/2] lib/librte_eal: Remove unnecessary hugepage zero-filling

2015-11-18 Thread Wang, Zhihong

> -Original Message-
> From: Mcnamara, John
> Sent: Wednesday, November 18, 2015 6:40 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Subject: RE: [dpdk-dev] [RFC PATCH 2/2] lib/librte_eal: Remove unnecessary
> hugepage zero-filling
> 
> 
> 
> > -Original Message-
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zhihong Wang
> > Sent: Wednesday, November 18, 2015 3:27 AM
> > To: dev at dpdk.org
> > Subject: [dpdk-dev] [RFC PATCH 2/2] lib/librte_eal: Remove unnecessary
> > hugepage zero-filling
> >
> > The kernel fills new allocated (huge) pages with zeros.
> > DPDK just has to touch the pages to trigger the allocation.
> >
> > ...
> > if (orig) {
> > hugepg_tbl[i].orig_va = virtaddr;
> > -   memset(virtaddr, 0, hugepage_sz);
> > +   memset(virtaddr, 0, 8);
> > }
> 
> Probably worth adding a one or two line comment here to avoid someone
> thinking that it is a bug at some later stage. The text in the commit message
> above is suitable.
> 

Good suggestion! Will add it :)

> John.
> --



[dpdk-dev] [RFC PATCH 2/2] lib/librte_eal: Remove unnecessary hugepage zero-filling

2015-11-19 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Stephen Hemminger
> Sent: Thursday, November 19, 2015 3:09 AM
> To: Richardson, Bruce 
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [RFC PATCH 2/2] lib/librte_eal: Remove unnecessary
> hugepage zero-filling
> 
> On Wed, 18 Nov 2015 16:13:32 +
> "Richardson, Bruce"  wrote:
> 
> >
> >
> > > -Original Message-
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Stephen
> > > Hemminger
> > > Sent: Wednesday, November 18, 2015 4:00 PM
> > > To: Xie, Huawei 
> > > Cc: dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [RFC PATCH 2/2] lib/librte_eal: Remove
> > > unnecessary hugepage zero-filling
> > >
> > > On Wed, 18 Nov 2015 12:07:54 +
> > > "Xie, Huawei"  wrote:
> > >
> > > > >>> The kernel fills new allocated (huge) pages with zeros.
> > > > >>> DPDK just has to touch the pages to trigger the allocation.
> > > > I think we shouldn't reply on the assumption that kernel has
> > > > zeroed the memory. Kernel zeroes the memory mostly to avoid
> > > > information leakage.It could also achieve this by setting each bit to 1.
> > > > What we indeed need to check is later DPDK initialization code
> > > > doesn't assume the memory has been zeroed. Otherwise zero only
> > > > that part of the memory. Does this makes sense?
> > >
> > > If all new pages are zero, why does DPDK have to pre-touch the pages
> > > at all?
> >
> > The pages won't actually be mapped into the processes address space until
> accessed.
> >
> > /Bruce
> 
> Isn't that what mmap MAP_POPULATE flag (not currently used) will do.

Yes, the MAP_POPULATE does literally the same thing.
This flag is implemented since Linux 2.5.46 according to Linux man page, guess 
that's why DPDK fault the page tables manually in the first place. :)

I think we can use this flag since it makes the code clearer.

/Zhihong



[dpdk-dev] [RFC PATCH 2/2] lib/librte_eal: Remove unnecessary hugepage zero-filling

2015-11-19 Thread Wang, Zhihong


> -Original Message-
> From: Xie, Huawei
> Sent: Wednesday, November 18, 2015 8:08 PM
> To: Wang, Zhihong ; Mcnamara, John
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [RFC PATCH 2/2] lib/librte_eal: Remove unnecessary
> hugepage zero-filling
> 
> On 11/18/2015 6:45 PM, Wang, Zhihong wrote:
> >> -Original Message-
> >> From: Mcnamara, John
> >> Sent: Wednesday, November 18, 2015 6:40 PM
> >> To: Wang, Zhihong ; dev at dpdk.org
> >> Subject: RE: [dpdk-dev] [RFC PATCH 2/2] lib/librte_eal: Remove
> >> unnecessary hugepage zero-filling
> >>
> >>
> >>
> >>> -Original Message-
> >>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Zhihong Wang
> >>> Sent: Wednesday, November 18, 2015 3:27 AM
> >>> To: dev at dpdk.org
> >>> Subject: [dpdk-dev] [RFC PATCH 2/2] lib/librte_eal: Remove
> >>> unnecessary hugepage zero-filling
> >>>
> >>> The kernel fills new allocated (huge) pages with zeros.
> >>> DPDK just has to touch the pages to trigger the allocation.
> I think we shouldn't reply on the assumption that kernel has zeroed the 
> memory.

I understand the concern.
In my opinion application shouldn't assume malloced memory to be zero-filled. 
So it should be okay for DPDK even if the kernel doesn't zero the page at all.

I agree that we should check if any code accidentally make that assumption. 
Currently there's rte_pktmbuf_init() for packet mbuf initialization.

/Zhihong


> Kernel zeroes the memory mostly to avoid information leakage.It could also
> achieve this by setting each bit to 1.
> What we indeed need to check is later DPDK initialization code doesn't assume
> the memory has been zeroed. Otherwise zero only that part of the memory.
> Does this makes sense?
> 
> >>> ...
> >>>   if (orig) {
> >>>   hugepg_tbl[i].orig_va = virtaddr;
> >>> - memset(virtaddr, 0, hugepage_sz);
> >>> + memset(virtaddr, 0, 8);
> >>>   }
> >> Probably worth adding a one or two line comment here to avoid someone
> >> thinking that it is a bug at some later stage. The text in the commit
> >> message above is suitable.
> >>
> > Good suggestion! Will add it :)
> >
> >> John.
> >> --
> >



[dpdk-dev] [RFC PATCH 2/2] lib/librte_eal: Remove unnecessary hugepage zero-filling

2015-11-19 Thread Wang, Zhihong

> -Original Message-
> From: Xie, Huawei
> Sent: Thursday, November 19, 2015 2:05 PM
> To: Wang, Zhihong ; Stephen Hemminger
> ; Richardson, Bruce
> 
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [RFC PATCH 2/2] lib/librte_eal: Remove unnecessary
> hugepage zero-filling
> 
> On 11/19/2015 10:16 AM, Wang, Zhihong wrote:
> >
> >> -Original Message-
> >> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Stephen
> >> Hemminger
> >> Sent: Thursday, November 19, 2015 3:09 AM
> >> To: Richardson, Bruce 
> >> Cc: dev at dpdk.org
> >> Subject: Re: [dpdk-dev] [RFC PATCH 2/2] lib/librte_eal: Remove
> >> unnecessary hugepage zero-filling
> >>
> >> On Wed, 18 Nov 2015 16:13:32 +
> >> "Richardson, Bruce"  wrote:
> >>
> >>>
> >>>> -Original Message-
> >>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Stephen
> >>>> Hemminger
> >>>> Sent: Wednesday, November 18, 2015 4:00 PM
> >>>> To: Xie, Huawei 
> >>>> Cc: dev at dpdk.org
> >>>> Subject: Re: [dpdk-dev] [RFC PATCH 2/2] lib/librte_eal: Remove
> >>>> unnecessary hugepage zero-filling
> >>>>
> >>>> On Wed, 18 Nov 2015 12:07:54 +
> >>>> "Xie, Huawei"  wrote:
> >>>>
> >>>>>>>> The kernel fills new allocated (huge) pages with zeros.
> >>>>>>>> DPDK just has to touch the pages to trigger the allocation.
> >>>>> I think we shouldn't reply on the assumption that kernel has
> >>>>> zeroed the memory. Kernel zeroes the memory mostly to avoid
> >>>>> information leakage.It could also achieve this by setting each bit to 1.
> >>>>> What we indeed need to check is later DPDK initialization code
> >>>>> doesn't assume the memory has been zeroed. Otherwise zero only
> >>>>> that part of the memory. Does this makes sense?
> >>>> If all new pages are zero, why does DPDK have to pre-touch the
> >>>> pages at all?
> >>> The pages won't actually be mapped into the processes address space
> >>> until
> >> accessed.
> >>> /Bruce
> >> Isn't that what mmap MAP_POPULATE flag (not currently used) will do.
> > Yes, the MAP_POPULATE does literally the same thing.
> > This flag is implemented since Linux 2.5.46 according to Linux man
> > page, guess that's why DPDK fault the page tables manually in the
> > first place. :)
> >
> > I think we can use this flag since it makes the code clearer.
> The manual says MAP_POPULATE is only supported for private mappings since
> Linux 2.6.23.

I've done check before and MAP_SHARED | MAP_POPULATE worked together correctly. 
Is there any implicit complication here?


> >
> > /Zhihong
> >
> >



[dpdk-dev] [PATCH RFC v2 1/2] lib/librte_eal: Reduce timer initialization time

2015-11-23 Thread Wang, Zhihong


> -Original Message-
> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> Sent: Saturday, November 21, 2015 12:32 AM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH RFC v2 1/2] lib/librte_eal: Reduce timer
> initialization time
> 
> On Thu, 19 Nov 2015 20:53:47 -0500
> Zhihong Wang  wrote:
> 
> > +   struct timespec sleeptime = {.tv_nsec = 1E8 }; /* 1/10 second */
> 
> Why not NSEC_PER_SEC / 10

Yes, it looks better using macro :)


[dpdk-dev] [PATCH RFC v2 2/2] lib/librte_eal: Remove unnecessary hugepage zero-filling

2015-11-23 Thread Wang, Zhihong


> -Original Message-
> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> Sent: Saturday, November 21, 2015 12:33 AM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH RFC v2 2/2] lib/librte_eal: Remove unnecessary
> hugepage zero-filling
> 
> On Thu, 19 Nov 2015 20:53:48 -0500
> Zhihong Wang  wrote:
> 
> > The kernel fills new allocated (huge) pages with zeros.
> > DPDK just has to touch the pages to trigger the allocation.
> >
> > Signed-off-by: Zhihong Wang 
> 
> The code is correct, and optimized.
> Don't you need to update the commit message since you no longer have to touch
> the pages.

Thanks for pointing this out!


[dpdk-dev] [PATCH 2/2] lib/librte_eal: Remove unnecessary hugepage zero-filling

2015-11-25 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Wednesday, November 25, 2015 7:04 AM
> To: Stephen Hemminger 
> Cc: Wang, Zhihong ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 2/2] lib/librte_eal: Remove unnecessary
> hugepage zero-filling
> 
> 2015-11-24 14:44, Stephen Hemminger:
> > On Tue, 24 Nov 2015 22:13:28 +0100
> > Thomas Monjalon  wrote:
> >
> > > 2015-11-22 18:28, Stephen Hemminger:
> > > > On Sun, 22 Nov 2015 14:13:35 -0500 Zhihong Wang
> > > >  wrote:
> > > >
> > > > > The kernel fills new allocated (huge) pages with zeros.
> > > > > DPDK just has to populate page tables to trigger the allocation.
> > > > >
> > > > > Signed-off-by: Zhihong Wang 
> > > >
> > > > Nice, especially on slow machines or with large memory.
> > > >
> > > > Acked-by: Stephen Hemminger 
> > >
> > > Yes very nice.
> > > I think it's too late to integrate this change which can have some
> > > unpredictable side effects.
> > > Do you agree to wait for 2.3?
> >
> > What side effects? Either it is zero or it is not.
> > Only some broken architecture would have an issue.
> 
> I mean it changes the memory allocator behaviour. It's not something we want 
> to
> discover a new bug just before the release.
> This kind of important change must be integrated at the beginning of the 
> release
> cycle.
> I'm asking for opinions because it would be really nice to have.

Literally this patch doesn't change anything, it just keeps DPDK from 
zero-filling pages again which have just been zero-filled.
It would be nice to have this patch in DPDK 2.2 since it can reduce the startup 
time nearly by half for hugepage cases.
But I understand longer merge/test window make it safer for a release.
It makes sense either way.



[dpdk-dev] [dpdk-dev,v2] Clean up rte_memcpy.h file

2016-02-29 Thread Wang, Zhihong


> -Original Message-
> From: Ravi Kerur [mailto:rkerur at gmail.com]
> Sent: Saturday, February 27, 2016 10:06 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev,v2] Clean up rte_memcpy.h file
> 
> 
> 
> On Wed, Jan 27, 2016 at 8:18 PM, Zhihong Wang 
> wrote:
> > Remove unnecessary type casting in functions.
> >
> > Tested on Ubuntu (14.04 x86_64) with "make test".
> > "make test" results match the results with baseline.
> > "Memcpy perf" results match the results with baseline.
> >
> > Signed-off-by: Ravi Kerur 
> > Acked-by: Stephen Hemminger 
> >
> > ---
> > .../common/include/arch/x86/rte_memcpy.h? ? ? ? ? ?| 340 +++---
> ---
> >? 1 file changed, 175 insertions(+), 165 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > index 6a57426..839d4ec 100644
> > --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> 
> [...]
> 
> >? /**
> > @@ -150,13 +150,16 @@ rte_mov64blocks(uint8_t *dst, const uint8_t *src,
> size_t n)
> >? ? ? ?__m256i ymm0, ymm1;
> >
> >? ? ? ?while (n >= 64) {
> > -? ? ? ? ? ? ?ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t
> *)src + 0 * 32));
> > +
> > +? ? ? ? ? ? ?ymm0 = _mm256_loadu_si256((const __m256i *)(src + 0 * 32));
> > +? ? ? ? ? ? ?ymm1 = _mm256_loadu_si256((const __m256i *)(src + 1 * 32));
> > +
> > +? ? ? ? ? ? ?_mm256_storeu_si256((__m256i *)(dst + 0 * 32), ymm0);
> > +? ? ? ? ? ? ?_mm256_storeu_si256((__m256i *)(dst + 1 * 32), ymm1);
> > +
> 
> Any particular reason to change the order of the statements here? :)
> Overall this patch looks good.
> 
> I checked the code changes, initial code had moving ?addresses (src and dst) 
> and
> decrement counter scattered between store and load instructions. I changed it 
> to
> loads, followed by stores and handle address/counters increment/decrement
> without changing functionality.
> 

It's definitely okay to do this. Actually changing it or not won't affect
the final output at all since gcc will optimize it while generating code.
It's C code we're writing after all.

But personally I prefer to keep the original order just as a comment
that what's needed in the future should be calculated ASAP, and
different kinds (CPU port) of instructions should be mixed together. :)

Could you please rebase this patch since there has been some changes
already?

> >? ? ? ? ? ? ? ?n -= 64;
> > -? ? ? ? ? ? ?ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t
> *)src + 1 * 32));
> > -? ? ? ? ? ? ?src = (const uint8_t *)src + 64;
> > -? ? ? ? ? ? ?_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32),
> ymm0);
> > -? ? ? ? ? ? ?_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32),
> ymm1);
> > -? ? ? ? ? ? ?dst = (uint8_t *)dst + 64;
> > +? ? ? ? ? ? ?src = src + 64;
> > +? ? ? ? ? ? ?dst = dst + 64;
> >? ? ? ?}
> >? }
> >



[dpdk-dev] [PATCH v5 2/3] examples/l2fwd: Handle SIGINT and SIGTERM in l2fwd

2016-01-04 Thread Wang, Zhihong


> -Original Message-
> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> Sent: Friday, January 1, 2016 1:02 AM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ananyev, Konstantin ; 
> Qiu,
> Michael 
> Subject: Re: [PATCH v5 2/3] examples/l2fwd: Handle SIGINT and SIGTERM in
> l2fwd
> 
> On Wed, 30 Dec 2015 16:59:50 -0500
> Zhihong Wang  wrote:
> 
> > +static void
> > +signal_handler(int signum)
> > +{
> > +   if (signum == SIGINT || signum == SIGTERM) {
> > +   printf("\n\nSignal %d received, preparing to exit...\n",
> > +   signum);
> > +   force_quit = true;
> 
> Actually, the if () is redundant since you only registered SIGINT, and SIGTERM
> those are the only signals you could possibly receive.

Yes it's kind of an obsession I guess, just want to make the code crystal clear 
:)

> 
> Acked-by: Stephen Hemminger 


[dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms

2016-01-15 Thread Wang, Zhihong


> -Original Message-
> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> Sent: Friday, January 15, 2016 12:49 AM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ananyev, Konstantin ;
> Richardson, Bruce ; Xie, Huawei
> 
> Subject: Re: [PATCH 0/4] Optimize memcpy for AVX512 platforms
> 
> On Thu, 14 Jan 2016 01:13:18 -0500
> Zhihong Wang  wrote:
> 
> > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> > utilization of hardware resources and deliver high performance.
> >
> > In current DPDK, memcpy holds a large proportion of execution time in
> > libs like Vhost, especially for large packets, and this patch can bring
> > considerable benefits.
> >
> > The implementation is based on the current DPDK memcpy framework, some
> > background introduction can be found in these threads:
> > http://dpdk.org/ml/archives/dev/2014-November/008158.html
> > http://dpdk.org/ml/archives/dev/2015-January/011800.html
> >
> > Code changes are:
> >
> >   1. Read CPUID to check if AVX512 is supported by CPU
> >
> >   2. Predefine AVX512 macro if AVX512 is enabled by compiler
> >
> >   3. Implement AVX512 memcpy and choose the right implementation based
> on
> >  predefined macros
> >
> >   4. Decide alignment unit for memcpy perf test based on predefined macros
> >
> > Zhihong Wang (4):
> >   lib/librte_eal: Identify AVX512 CPU flag
> >   mk: Predefine AVX512 macro for compiler
> >   lib/librte_eal: Optimize memcpy for AVX512 platforms
> >   app/test: Adjust alignment unit for memcpy perf test
> >
> >  app/test/test_memcpy_perf.c|   6 +
> >  .../common/include/arch/x86/rte_cpuflags.h |   2 +
> >  .../common/include/arch/x86/rte_memcpy.h   | 247
> -
> >  mk/rte.cpuflags.mk |   4 +
> >  4 files changed, 255 insertions(+), 4 deletions(-)
> >
> 
> This really looks like code that could benefit from Gcc
> function multiversioning. The current cpuflags model is useless/flawed
> in real product deployment


I've tried gcc function multi versioning, with a simple add() function
which returns a + b, and a loop calling it for millions of times. Turned
out this mechanism adds 17% extra time to execute, overall it's a lot
of extra overhead.

Quote the gcc wiki: "GCC takes care of doing the dispatching to call
the right version at runtime". So it loses inlining and adds extra
dispatching overhead.

Also this mechanism works only for C++, right?

I think using predefined macros at compile time is more efficient and
suits DPDK more.

Could you please give an example when the current CPU flags model
stop working? So I can fix it.



[dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms

2016-01-19 Thread Wang, Zhihong
> -Original Message-
> From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> Sent: Tuesday, January 19, 2016 4:06 AM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ananyev, Konstantin ;
> Richardson, Bruce ; Xie, Huawei
> 
> Subject: Re: [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
> 
> On Sun, 17 Jan 2016 22:05:09 -0500
> Zhihong Wang  wrote:
> 
> > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> > utilization of hardware resources and deliver high performance.
> >
> > In current DPDK, memcpy holds a large proportion of execution time in
> > libs like Vhost, especially for large packets, and this patch can bring
> > considerable benefits.
> >
> > The implementation is based on the current DPDK memcpy framework, some
> > background introduction can be found in these threads:
> > http://dpdk.org/ml/archives/dev/2014-November/008158.html
> > http://dpdk.org/ml/archives/dev/2015-January/011800.html
> >
> > Code changes are:
> >
> >   1. Read CPUID to check if AVX512 is supported by CPU
> >
> >   2. Predefine AVX512 macro if AVX512 is enabled by compiler
> >
> >   3. Implement AVX512 memcpy and choose the right implementation based
> on
> >  predefined macros
> >
> >   4. Decide alignment unit for memcpy perf test based on predefined macros
> 
> Cool, I like it. How much impact does this have on VHOST?

The impact is significant especially for enqueue (Detailed numbers might not
be appropriate here due to policy :-), only how I test it), because VHOST 
actually
spends a lot of time doing memcpy. Simply measure 1024B RX/TX time cost and
compare it with 64B's and you'll get a sense of it, although not precise.

My test cases include NIC2VM2NIC and VM2VM scenarios, which are the main
use cases currently, and use both throughput and RX/TX cycles for evaluation.



[dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms

2016-01-28 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Wednesday, January 27, 2016 11:24 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ravi Kerur 
> Subject: Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
> 
> 2016-01-17 22:05, Zhihong Wang:
> > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> > utilization of hardware resources and deliver high performance.
> 
> On a related note, your expertise would be very valuable to review
> these patches please:
> (memcpy) http://dpdk.org/dev/patchwork/patch/4396/
> (memcmp) http://dpdk.org/dev/patchwork/patch/4788/

Will do, thanks.

> 
> Thanks


[dpdk-dev] [PATCH] lib/librte_eal: Fix compile issue with gcc 5.3.1

2016-01-28 Thread Wang, Zhihong
> Subject: [PATCH] lib/librte_eal: Fix compile issue with gcc 5.3.1
> 
> In fedora 22 with GCC version 5.3.1, when compile,
> will result an error:
> 
> include/rte_memcpy.h:309:7: error: "RTE_MACHINE_CPUFLAG_AVX2"
> is not defined [-Werror=undef]
> #elif RTE_MACHINE_CPUFLAG_AVX2
> 
> Fixes: 9484092baad3 ("eal/x86: optimize memcpy for AVX512 platforms")
> 
> Signed-off-by: Michael Qiu 
> ---
>  app/test/test_memcpy_perf.c | 2 +-
>  lib/librte_eal/common/include/arch/x86/rte_memcpy.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)


There's issue in the original code.

#elif works with statements:
#elif < statement: true or false>

But what it meant is whether the identifier has been defined:
#elif defined 

Thanks for correcting this!

Acked-by: Wang, Zhihong 


[dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support to the TX path

2016-11-04 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Thursday, November 3, 2016 4:11 PM
> To: Wang, Zhihong ; Yuanhan Liu
> 
> Cc: stephen at networkplumber.org; Pierre Pfister (ppfister)
> ; Xie, Huawei ; dev at 
> dpdk.org;
> vkaplans at redhat.com; mst at redhat.com
> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support
> to the TX path
> 
> 
> 
> On 11/02/2016 11:51 AM, Maxime Coquelin wrote:
> >
> >
> > On 10/31/2016 11:01 AM, Wang, Zhihong wrote:
> >>
> >>
> >>> -Original Message-
> >>> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> >>> Sent: Friday, October 28, 2016 3:42 PM
> >>> To: Wang, Zhihong ; Yuanhan Liu
> >>> 
> >>> Cc: stephen at networkplumber.org; Pierre Pfister (ppfister)
> >>> ; Xie, Huawei ;
> dev at dpdk.org;
> >>> vkaplans at redhat.com; mst at redhat.com
> >>> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors
> >>> support
> >>> to the TX path
> >>>
> >>>
> >>>
> >>> On 10/28/2016 02:49 AM, Wang, Zhihong wrote:
> >>>>
> >>>>>> -Original Message-
> >>>>>> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> >>>>>> Sent: Thursday, October 27, 2016 6:46 PM
> >>>>>> To: Maxime Coquelin 
> >>>>>> Cc: Wang, Zhihong ;
> >>>>>> stephen at networkplumber.org; Pierre Pfister (ppfister)
> >>>>>> ; Xie, Huawei ;
> >>> dev at dpdk.org;
> >>>>>> vkaplans at redhat.com; mst at redhat.com
> >>>>>> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors
> >>> support
> >>>>>> to the TX path
> >>>>>>
> >>>>>> On Thu, Oct 27, 2016 at 12:35:11PM +0200, Maxime Coquelin wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 10/27/2016 12:33 PM, Yuanhan Liu wrote:
> >>>>>>>>>> On Thu, Oct 27, 2016 at 11:10:34AM +0200, Maxime Coquelin
> >>> wrote:
> >>>>>>>>>>>> Hi Zhihong,
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 10/27/2016 11:00 AM, Wang, Zhihong wrote:
> >>>>>>>>>>>>>> Hi Maxime,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Seems indirect desc feature is causing serious
> performance
> >>>>>>>>>>>>>> degradation on Haswell platform, about 20% drop for both
> >>>>>>>>>>>>>> mrg=on and mrg=off (--txqflags=0xf00, non-vector
> version),
> >>>>>>>>>>>>>> both iofwd and macfwd.
> >>>>>>>>>>>> I tested PVP (with macswap on guest) and Txonly/Rxonly on
> an
> >>> Ivy
> >>>>>> Bridge
> >>>>>>>>>>>> platform, and didn't faced such a drop.
> >>>>>>>>>>
> >>>>>>>>>> I was actually wondering that may be the cause. I tested it with
> >>>>>>>>>> my IvyBridge server as well, I saw no drop.
> >>>>>>>>>>
> >>>>>>>>>> Maybe you should find a similar platform (Haswell) and have a
> >>>>>>>>>> try?
> >>>>>>>> Yes, that's why I asked Zhihong whether he could test Txonly in
> >>>>>>>> guest
> >>> to
> >>>>>>>> see if issue is reproducible like this.
> >>>>>>
> >>>>>> I have no Haswell box, otherwise I could do a quick test for you.
> >>>>>> IIRC,
> >>>>>> he tried to disable the indirect_desc feature, then the performance
> >>>>>> recovered. So, it's likely the indirect_desc is the culprit here.
> >>>>>>
> >>>>>>>> I will be easier for me to find an Haswell machine if it has not
> >>>>>>>> to be
> >>>>>>>> connected back to back to and HW/SW packet generator.
> >>>> In fact simple loopback test will also do, without pktgen.
> >>>>

[dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support to the TX path

2016-11-04 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Friday, November 4, 2016 4:00 PM
> To: Wang, Zhihong ; Yuanhan Liu
> 
> Cc: stephen at networkplumber.org; Pierre Pfister (ppfister)
> ; Xie, Huawei ; dev at 
> dpdk.org;
> vkaplans at redhat.com; mst at redhat.com
> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support
> to the TX path
> 
> 
> 
> On 11/04/2016 08:57 AM, Maxime Coquelin wrote:
> > Hi Zhihong,
> >
> > On 11/04/2016 08:20 AM, Wang, Zhihong wrote:
> >>
> >>
> >>> -Original Message-
> >>> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> >>> Sent: Thursday, November 3, 2016 4:11 PM
> >>> To: Wang, Zhihong ; Yuanhan Liu
> >>> 
> >>> Cc: stephen at networkplumber.org; Pierre Pfister (ppfister)
> >>> ; Xie, Huawei ;
> dev at dpdk.org;
> >>> vkaplans at redhat.com; mst at redhat.com
> >>> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors
> >>> support
> >>> to the TX path
> >>>
> >>>
> >>>
> >>> On 11/02/2016 11:51 AM, Maxime Coquelin wrote:
> >>>>
> >>>>
> >>>> On 10/31/2016 11:01 AM, Wang, Zhihong wrote:
> >>>>>
> >>>>>
> >>>>>> -Original Message-
> >>>>>> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> >>>>>> Sent: Friday, October 28, 2016 3:42 PM
> >>>>>> To: Wang, Zhihong ; Yuanhan Liu
> >>>>>> 
> >>>>>> Cc: stephen at networkplumber.org; Pierre Pfister (ppfister)
> >>>>>> ; Xie, Huawei ;
> >>> dev at dpdk.org;
> >>>>>> vkaplans at redhat.com; mst at redhat.com
> >>>>>> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors
> >>>>>> support
> >>>>>> to the TX path
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 10/28/2016 02:49 AM, Wang, Zhihong wrote:
> >>>>>>>
> >>>>>>>>> -Original Message-
> >>>>>>>>> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> >>>>>>>>> Sent: Thursday, October 27, 2016 6:46 PM
> >>>>>>>>> To: Maxime Coquelin 
> >>>>>>>>> Cc: Wang, Zhihong ;
> >>>>>>>>> stephen at networkplumber.org; Pierre Pfister (ppfister)
> >>>>>>>>> ; Xie, Huawei ;
> >>>>>> dev at dpdk.org;
> >>>>>>>>> vkaplans at redhat.com; mst at redhat.com
> >>>>>>>>> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect
> descriptors
> >>>>>> support
> >>>>>>>>> to the TX path
> >>>>>>>>>
> >>>>>>>>> On Thu, Oct 27, 2016 at 12:35:11PM +0200, Maxime Coquelin
> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 10/27/2016 12:33 PM, Yuanhan Liu wrote:
> >>>>>>>>>>>>> On Thu, Oct 27, 2016 at 11:10:34AM +0200, Maxime
> Coquelin
> >>>>>> wrote:
> >>>>>>>>>>>>>>> Hi Zhihong,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 10/27/2016 11:00 AM, Wang, Zhihong wrote:
> >>>>>>>>>>>>>>>>> Hi Maxime,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Seems indirect desc feature is causing serious
> >>> performance
> >>>>>>>>>>>>>>>>> degradation on Haswell platform, about 20% drop for
> both
> >>>>>>>>>>>>>>>>> mrg=on and mrg=off (--txqflags=0xf00, non-vector
> >>> version),
> >>>>>>>>>>>>>>>>> both iofwd and macfwd.
> >>>>>>>>>>>>>>> I tested PVP (with macswap on guest) and Txonly/Rxonly
> on
> >>> an
> >>>>>> Ivy
> >>>>>>>>> Bridge

[dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support to the TX path

2016-11-04 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Friday, November 4, 2016 7:23 PM
> To: Wang, Zhihong ; Yuanhan Liu
> 
> Cc: stephen at networkplumber.org; Pierre Pfister (ppfister)
> ; Xie, Huawei ; dev at 
> dpdk.org;
> vkaplans at redhat.com; mst at redhat.com
> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support to 
> the
> TX path
> 
> 
> 
> >>>> Hi Maxime,
> >>>>
> >>>> I did a little more macswap test and found out more stuff here:
> >>> Thanks for doing more tests.
> >>>
> >>>>
> >>>>  1. I did loopback test on another HSW machine with the same H/W,
> >>>> and indirect_desc on and off seems have close perf
> >>>>
> >>>>  2. So I checked the gcc version:
> >>>>
> >>>>  *  Previous: gcc version 6.2.1 20160916 (Fedora 24)
> >>>>
> >>>>  *  New: gcc version 5.4.0 20160609 (Ubuntu 16.04.1 LTS)
> >>>
> >>> On my side, I tested with RHEL7.3:
> >>>  - gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
> >>>
> >>> It certainly contains some backports from newer GCC versions.
> >>>
> >>>>
> >>>> On previous one indirect_desc has 20% drop
> >>>>
> >>>>  3. Then I compiled binary on Ubuntu and scp to Fedora, and as
> >>>> expected I got the same perf as on Ubuntu, and the perf gap
> >>>> disappeared, so gcc is definitely one factor here
> >>>>
> >>>>  4. Then I use the Ubuntu binary on Fedora for PVP test, then the
> >>>> perf gap comes back again and the same with the Fedora binary
> >>>> results, indirect_desc causes about 20% drop
> >>>
> >>> Let me know if I understand correctly:
> >
> > Yes, and it's hard to breakdown further at this time.
> >
> > Also we may need to check whether it's caused by certain NIC
> > model. Unfortunately I don't have the right setup right now.
> >
> >>> Loopback test with macswap:
> >>>  - gcc version 6.2.1 : 20% perf drop
> >>>  - gcc version 5.4.0 : No drop
> >>>
> >>> PVP test with macswap:
> >>>  - gcc version 6.2.1 : 20% perf drop
> >>>  - gcc version 5.4.0 : 20% perf drop
> >>
> >> I forgot to ask, did you recompile only host, or both host and guest
> >> testmpd's in your test?
> 
> > Both.
> 
> I recompiled testpmd on a Fedora 24 machine using GCC6:
> gcc (GCC) 6.1.1 20160621 (Red Hat 6.1.1-3)
> Testing loopback with macswap on my Haswell RHEL7.3 machine gives me the
> following results:
>   - indirect on: 7.75Mpps
>   - indirect off: 7.35Mpps
> 
> Surprisingly, I get better results with indirect on my setup (I
> reproduced the tests multiple times).
> 
> Do you have a document explaining the tuning/config you apply to both
> the host and the guest (isolation, HT, hugepage size, ...) in your
> setup?


The setup where it goes wrong:
 1. Xeon E5-2699, HT on, turbo off, 1GB hugepage for both host and guest
 2. Fortville 40G
 3. Fedora 4.7.5-200.fc24.x86_64
 4. gcc version 6.2.1
 5. 16.11 RC2 for both host and guest
 6. PVP, testpmd macswap for both host and guest

BTW, I do see indirect_desc gives slightly better performance for loopback
in tests on other platforms, but don't know how PVP performs yet.


> 
> Regards,
> Maxime


[dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support to the TX path

2016-11-04 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Friday, November 4, 2016 8:54 PM
> To: Wang, Zhihong ; Yuanhan Liu
> 
> Cc: stephen at networkplumber.org; Pierre Pfister (ppfister)
> ; Xie, Huawei ; dev at 
> dpdk.org;
> vkaplans at redhat.com; mst at redhat.com
> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support
> to the TX path
> 
> 
> 
> On 11/04/2016 01:30 PM, Wang, Zhihong wrote:
> >
> >
> >> -Original Message-
> >> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> >> Sent: Friday, November 4, 2016 7:23 PM
> >> To: Wang, Zhihong ; Yuanhan Liu
> >> 
> >> Cc: stephen at networkplumber.org; Pierre Pfister (ppfister)
> >> ; Xie, Huawei ;
> dev at dpdk.org;
> >> vkaplans at redhat.com; mst at redhat.com
> >> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors
> support to the
> >> TX path
> >>
> >>
> >>
> >>>>>> Hi Maxime,
> >>>>>>
> >>>>>> I did a little more macswap test and found out more stuff here:
> >>>>> Thanks for doing more tests.
> >>>>>
> >>>>>>
> >>>>>>  1. I did loopback test on another HSW machine with the same H/W,
> >>>>>> and indirect_desc on and off seems have close perf
> >>>>>>
> >>>>>>  2. So I checked the gcc version:
> >>>>>>
> >>>>>>  *  Previous: gcc version 6.2.1 20160916 (Fedora 24)
> >>>>>>
> >>>>>>  *  New: gcc version 5.4.0 20160609 (Ubuntu 16.04.1 LTS)
> >>>>>
> >>>>> On my side, I tested with RHEL7.3:
> >>>>>  - gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
> >>>>>
> >>>>> It certainly contains some backports from newer GCC versions.
> >>>>>
> >>>>>>
> >>>>>> On previous one indirect_desc has 20% drop
> >>>>>>
> >>>>>>  3. Then I compiled binary on Ubuntu and scp to Fedora, and as
> >>>>>> expected I got the same perf as on Ubuntu, and the perf gap
> >>>>>> disappeared, so gcc is definitely one factor here
> >>>>>>
> >>>>>>  4. Then I use the Ubuntu binary on Fedora for PVP test, then the
> >>>>>> perf gap comes back again and the same with the Fedora binary
> >>>>>> results, indirect_desc causes about 20% drop
> >>>>>
> >>>>> Let me know if I understand correctly:
> >>>
> >>> Yes, and it's hard to breakdown further at this time.
> >>>
> >>> Also we may need to check whether it's caused by certain NIC
> >>> model. Unfortunately I don't have the right setup right now.
> >>>
> >>>>> Loopback test with macswap:
> >>>>>  - gcc version 6.2.1 : 20% perf drop
> >>>>>  - gcc version 5.4.0 : No drop
> >>>>>
> >>>>> PVP test with macswap:
> >>>>>  - gcc version 6.2.1 : 20% perf drop
> >>>>>  - gcc version 5.4.0 : 20% perf drop
> >>>>
> >>>> I forgot to ask, did you recompile only host, or both host and guest
> >>>> testmpd's in your test?
> >>
> >>> Both.
> >>
> >> I recompiled testpmd on a Fedora 24 machine using GCC6:
> >> gcc (GCC) 6.1.1 20160621 (Red Hat 6.1.1-3)
> >> Testing loopback with macswap on my Haswell RHEL7.3 machine gives me
> the
> >> following results:
> >>   - indirect on: 7.75Mpps
> >>   - indirect off: 7.35Mpps
> >>
> >> Surprisingly, I get better results with indirect on my setup (I
> >> reproduced the tests multiple times).
> >>
> >> Do you have a document explaining the tuning/config you apply to both
> >> the host and the guest (isolation, HT, hugepage size, ...) in your
> >> setup?
> >
> >
> > The setup where it goes wrong:
> >  1. Xeon E5-2699, HT on, turbo off, 1GB hugepage for both host and guest
> On the Haswell machine (on which I don't have BIOS access), HT is on,
> but I unplug siblings at runtime.
> I also have 1G pages on both sides, and I isolate the cores used by both
> testpmd and vCPUS.
> 
> >  2. Fortville 40G
> >  3. Fedora 4.7.5-200.fc24.x86_64
> >  4. gcc version 6.2.1
> >  5. 16.11 RC2 for both host and guest
> >  6. PVP, testpmd macswap for both host and guest
> >
> > BTW, I do see indirect_desc gives slightly better performance for loopback
> > in tests on other platforms, but don't know how PVP performs yet.
> Interesting, other platforms are also Haswell/Broadwell?

Yes, but with different OS.

If you don't have the setup I can do more detailed profiling for the
root cause next week, since my platform is the only one right now that
reporting the drop.


> 
> For PVP benchmarks, are your figures with 0% pkt loss?

No, for testpmd perf analysis it's not necessary in my opinion.

I do tried low rate though, the result is the same.

> 
> Thanks,
> Maxime
> 
> >
> >
> >>
> >> Regards,
> >> Maxime


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-09 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Wang, Zhihong
> Sent: Wednesday, September 28, 2016 12:45 AM
> To: Yuanhan Liu ; Jianbo Liu
> 
> Cc: Maxime Coquelin ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> 
> 
> > -Original Message-
> > From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> > Sent: Tuesday, September 27, 2016 6:21 PM
> > To: Jianbo Liu 
> > Cc: Wang, Zhihong ; Maxime Coquelin
> > ; dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> >
> > On Thu, Sep 22, 2016 at 05:01:41PM +0800, Jianbo Liu wrote:
> > > On 22 September 2016 at 14:58, Wang, Zhihong
> 
> > wrote:
> > > >
> > > >
> > > >> -Original Message-
> > > >> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> > > >> Sent: Thursday, September 22, 2016 1:48 PM
> > > >> To: Yuanhan Liu 
> > > >> Cc: Wang, Zhihong ; Maxime Coquelin
> > > >> ; dev at dpdk.org
> > > >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> > > >>
> > > >> On 22 September 2016 at 10:29, Yuanhan Liu
> 
> > > >> wrote:
> > > >> > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
> > > >> >> >> > My setup consists of one host running a guest.
> > > >> >> >> > The guest generates as much 64bytes packets as possible
> using
> > > >> >> >>
> > > >> >> >> Have you tested with other different packet size?
> > > >> >> >> My testing shows that performance is dropping when packet
> size is
> > > >> more
> > > >> >> >> than 256.
> > > >> >> >
> > > >> >> >
> > > >> >> > Hi Jianbo,
> > > >> >> >
> > > >> >> > Thanks for reporting this.
> > > >> >> >
> > > >> >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> > > >> >> >
> > > >> Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
> > > >>
> > > >> >> >  2. Could you please specify what CPU you're running? Is it
> Haswell
> > > >> >> > or Ivy Bridge?
> > > >> >> >
> > > >> It's an ARM server.
> > > >>
> > > >> >> >  3. How many percentage of drop are you seeing?
> > > >> The testing result:
> > > >> size (bytes) improvement (%)
> > > >> 64   3.92
> > > >> 128 11.51
> > > >> 256  24.16
> > > >> 512  -13.79
> > > >> 1024-22.51
> > > >> 1500-12.22
> > > >> A correction is that performance is dropping if byte size is larger 
> > > >> than
> 512.
> > > >
> > > >
> > > > Jianbo,
> > > >
> > > > Could you please verify does this patch really cause enqueue perf to
> drop?
> > > >
> > > > You can test the enqueue path only by set guest to do rxonly, and
> compare
> > > > the mpps by show port stats all in the guest.
> > > >
> > > >
> > > Tested with testpmd, host: txonly, guest: rxonly
> > > size (bytes) improvement (%)
> > > 644.12
> > > 128   6
> > > 256   2.65
> > > 512   -1.12
> > > 1024 -7.02
> >
> > There is a difference between Zhihong's code and the old I spotted in
> > the first time: Zhihong removed the avail_idx prefetch. I understand
> > the prefetch becomes a bit tricky when mrg-rx code path is considered;
> > thus, I didn't comment on that.
> >
> > That's one of the difference that, IMO, could drop a regression. I then
> > finally got a chance to add it back.
> >
> > A rough test shows it improves the performance of 1400B packet size
> greatly
> > in the "txonly in host and rxonly in guest" case: +33% is the number I get
> > with my test server (Ivybridge).
> 
> Thanks Yuanhan! I'll validate this on x86.

Hi Yuanhan,

Seems your code doesn't perform correctly. I write a new version
of avail idx prefetch but didn't see any perf benefit.

To be honest I doubt the benefit of this idea. The previous mrg_off
code has this method but doesn't give any benefits.

Even if this is useful, the benefits should be more significant for
small packets, it's unlikely this simple idx prefetch could bring
over 30% perf gain for large packets like 1400B ones.

But if you really do work it out like that I'll be very glad to see.

Thanks
Zhihong

> 
> >
> > I guess this might/would help your case as well. Mind to have a test
> > and tell me the results?
> >
> > BTW, I made it in rush; I haven't tested the mrg-rx code path yet.
> >
> > Thanks.
> >
> > --yliu


[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Monday, October 10, 2016 11:59 AM
> To: Michael S. Tsirkin 
> Cc: Maxime Coquelin ; Stephen Hemminger
> ; dev at dpdk.org; qemu-
> devel at nongnu.org; Wang, Zhihong 
> Subject: Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature
> 
> On Mon, Oct 10, 2016 at 06:46:44AM +0300, Michael S. Tsirkin wrote:
> > On Mon, Oct 10, 2016 at 11:37:44AM +0800, Yuanhan Liu wrote:
> > > On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> > > > On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > > > >
> > > > >
> > > > > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > > > Yes but two points.
> > > >
> > > > 1. why is this memset expensive?
> > >
> > > I don't have the exact answer, but just some rough thoughts:
> > >
> > > It's an external clib function: there is a call stack and the
> > > IP register will bounch back and forth.
> >
> > for memset 0?  gcc 5.3.1 on fedora happily inlines it.
> 
> Good to know!
> 
> > > overkill to use that for resetting 14 bytes structure.
> > >
> > > Some trick like
> > > *(struct virtio_net_hdr *)hdr = {0, };
> > >
> > > Or even
> > > hdr->xxx = 0;
> > > hdr->yyy = 0;
> > >
> > > should behaviour better.
> > >
> > > There was an example: the vhost enqueue optmization patchset from
> > > Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> > > on my Ivybridge server: it has no such issue on his server though.
> > >
> > > [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> > >
> > >   --yliu
> >
> > I'd say that's weird. what's your config? any chance you
> > are using an old compiler?
> 
> Not really, it's gcc 5.3.1. Maybe Zhihong could explain more. IIRC,
> he said the memset is not well optimized for Ivybridge server.

The dst is remote in that case. It's fine on Haswell but has complication
in Ivy Bridge which (wasn't supposed to but) causes serious frontend issue.

I don't think gcc inlined it there. I'm using fc24 gcc 6.1.1.

> 
>   --yliu


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-10 Thread Wang, Zhihong


> -Original Message-
> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> Sent: Monday, October 10, 2016 1:32 PM
> To: Yuanhan Liu 
> Cc: Wang, Zhihong ; Maxime Coquelin
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On 10 October 2016 at 10:44, Yuanhan Liu 
> wrote:
> > On Sun, Oct 09, 2016 at 12:09:07PM +, Wang, Zhihong wrote:
> >> > > > Tested with testpmd, host: txonly, guest: rxonly
> >> > > > size (bytes) improvement (%)
> >> > > > 644.12
> >> > > > 128   6
> >> > > > 256   2.65
> >> > > > 512   -1.12
> >> > > > 1024 -7.02
> >> > >
> >> > > There is a difference between Zhihong's code and the old I spotted in
> >> > > the first time: Zhihong removed the avail_idx prefetch. I understand
> >> > > the prefetch becomes a bit tricky when mrg-rx code path is
> considered;
> >> > > thus, I didn't comment on that.
> >> > >
> >> > > That's one of the difference that, IMO, could drop a regression. I then
> >> > > finally got a chance to add it back.
> >> > >
> >> > > A rough test shows it improves the performance of 1400B packet size
> >> > greatly
> >> > > in the "txonly in host and rxonly in guest" case: +33% is the number I
> get
> >> > > with my test server (Ivybridge).
> >> >
> >> > Thanks Yuanhan! I'll validate this on x86.
> >>
> >> Hi Yuanhan,
> >>
> >> Seems your code doesn't perform correctly. I write a new version
> >> of avail idx prefetch but didn't see any perf benefit.
> >>
> >> To be honest I doubt the benefit of this idea. The previous mrg_off
> >> code has this method but doesn't give any benefits.
> >
> > Good point. I thought of that before, too. But you know that I made it
> > in rush, that I didn't think further and test more.
> >
> > I looked the code a bit closer this time, and spotted a bug: the prefetch
> > actually didn't happen, due to following code piece:
> >
> > if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) {
> > prefetch_avail_idx(vq);
> > ...
> > }
> >
> > Since vq->next_avail_idx is set to 0 at the entrance of enqueue path,
> > prefetch_avail_idx() will be called. The fix is easy though: just put
> > prefetch_avail_idx before invoking enqueue_packet.
> >
> > In summary, Zhihong is right, I see no more gains with that fix :(
> >
> > However, as stated, that's kind of the only difference I found between
> > yours and the old code, that maybe it's still worthwhile to have a
> > test on ARM, Jianbo?
> >
> I haven't tested it, but I think it could be no improvement for ARM either.
> 
> A smalll suggestion for enqueue_packet:
> 
> .
> +   /* start copy from mbuf to desc */
> +   while (mbuf_avail || mbuf->next) {
> .
> 
> Considering pkt_len is in the first cache line (same as data_len),
> while next pointer is in the second cache line,
> is it better to check the total packet len, instead of the last mbuf's
> next pointer to jump out of while loop and avoid possible cache miss?

Jianbo,

Thanks for the reply!

This idea sounds good, but it won't help the general perf in my
opinion, since the 2nd cache line is accessed anyway prior in
virtio_enqueue_offload.

Also this would bring a NULL check when actually access mbuf->next.

BTW, could you please publish the number of:

 1. mrg_rxbuf=on, comparison between original and original + this patch

 2. mrg_rxbuf=off, comparison between original and original + this patch

So we can have a whole picture of how this patch impact on ARM platform.

Thanks
Zhihong



[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-10 Thread Wang, Zhihong


> -Original Message-
> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> Sent: Monday, October 10, 2016 2:58 PM
> To: Wang, Zhihong 
> Cc: Yuanhan Liu ; Maxime Coquelin
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On 10 October 2016 at 14:22, Wang, Zhihong 
> wrote:
> >
> >
> >> -Original Message-
> >> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> >> Sent: Monday, October 10, 2016 1:32 PM
> >> To: Yuanhan Liu 
> >> Cc: Wang, Zhihong ; Maxime Coquelin
> >> ; dev at dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> >>
> >> On 10 October 2016 at 10:44, Yuanhan Liu 
> >> wrote:
> >> > On Sun, Oct 09, 2016 at 12:09:07PM +, Wang, Zhihong wrote:
> >> >> > > > Tested with testpmd, host: txonly, guest: rxonly
> >> >> > > > size (bytes) improvement (%)
> >> >> > > > 644.12
> >> >> > > > 128   6
> >> >> > > > 256   2.65
> >> >> > > > 512   -1.12
> >> >> > > > 1024 -7.02
> >> >> > >
> >> >> > > There is a difference between Zhihong's code and the old I spotted
> in
> >> >> > > the first time: Zhihong removed the avail_idx prefetch. I
> understand
> >> >> > > the prefetch becomes a bit tricky when mrg-rx code path is
> >> considered;
> >> >> > > thus, I didn't comment on that.
> >> >> > >
> >> >> > > That's one of the difference that, IMO, could drop a regression. I
> then
> >> >> > > finally got a chance to add it back.
> >> >> > >
> >> >> > > A rough test shows it improves the performance of 1400B packet
> size
> >> >> > greatly
> >> >> > > in the "txonly in host and rxonly in guest" case: +33% is the number
> I
> >> get
> >> >> > > with my test server (Ivybridge).
> >> >> >
> >> >> > Thanks Yuanhan! I'll validate this on x86.
> >> >>
> >> >> Hi Yuanhan,
> >> >>
> >> >> Seems your code doesn't perform correctly. I write a new version
> >> >> of avail idx prefetch but didn't see any perf benefit.
> >> >>
> >> >> To be honest I doubt the benefit of this idea. The previous mrg_off
> >> >> code has this method but doesn't give any benefits.
> >> >
> >> > Good point. I thought of that before, too. But you know that I made it
> >> > in rush, that I didn't think further and test more.
> >> >
> >> > I looked the code a bit closer this time, and spotted a bug: the prefetch
> >> > actually didn't happen, due to following code piece:
> >> >
> >> > if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) {
> >> > prefetch_avail_idx(vq);
> >> > ...
> >> > }
> >> >
> >> > Since vq->next_avail_idx is set to 0 at the entrance of enqueue path,
> >> > prefetch_avail_idx() will be called. The fix is easy though: just put
> >> > prefetch_avail_idx before invoking enqueue_packet.
> >> >
> >> > In summary, Zhihong is right, I see no more gains with that fix :(
> >> >
> >> > However, as stated, that's kind of the only difference I found between
> >> > yours and the old code, that maybe it's still worthwhile to have a
> >> > test on ARM, Jianbo?
> >> >
> >> I haven't tested it, but I think it could be no improvement for ARM either.
> >>
> >> A smalll suggestion for enqueue_packet:
> >>
> >> .
> >> +   /* start copy from mbuf to desc */
> >> +   while (mbuf_avail || mbuf->next) {
> >> .
> >>
> >> Considering pkt_len is in the first cache line (same as data_len),
> >> while next pointer is in the second cache line,
> >> is it better to check the total packet len, instead of the last mbuf's
> >> next pointer to jump out of while loop and avoid possible cache miss?
> >
> > Jianbo,
> >
> > Thanks for the reply!
> >
> > This idea sounds good, but it won't help the general perf in my
> > opinion, since the 2nd cache line is accessed anyway prior in
> > virtio_enqueue_offload.
> >
> Yes, you are right. I'm thinking of prefetching beforehand.
> And if it's a chained mbuf, virtio_enqueue_offload will not be called
> in next loop.
> 
> > Also this would bring a NULL check when actually access mbuf->next.
> >
> > BTW, could you please publish the number of:
> >
> >  1. mrg_rxbuf=on, comparison between original and original + this patch
> >
> >  2. mrg_rxbuf=off, comparison between original and original + this patch
> >
> > So we can have a whole picture of how this patch impact on ARM platform.
> >
> I think you already have got many results in my previous emails.
> Sorry I can't test right now and busy with other things.

We're still missing mrg on data.



[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-12 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Wednesday, October 12, 2016 10:53 AM
> To: Wang, Zhihong ; Jianbo Liu  linaro.org>
> Cc: Maxime Coquelin ; dev at dpdk.org; Thomas
> Monjalon 
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On Thu, Sep 22, 2016 at 01:47:45PM +0800, Jianbo Liu wrote:
> > On 22 September 2016 at 10:29, Yuanhan Liu 
> wrote:
> > > On Wed, Sep 21, 2016 at 08:54:11PM +0800, Jianbo Liu wrote:
> > >> >> > My setup consists of one host running a guest.
> > >> >> > The guest generates as much 64bytes packets as possible using
> > >> >>
> > >> >> Have you tested with other different packet size?
> > >> >> My testing shows that performance is dropping when packet size is more
> > >> >> than 256.
> > >> >
> > >> >
> > >> > Hi Jianbo,
> > >> >
> > >> > Thanks for reporting this.
> > >> >
> > >> >  1. Are you running the vector frontend with mrg_rxbuf=off?
> > >> >
> > Yes, my testing is mrg_rxbuf=off, but not vector frontend PMD.
> >
> > >> >  2. Could you please specify what CPU you're running? Is it Haswell
> > >> > or Ivy Bridge?
> > >> >
> > It's an ARM server.
> >
> > >> >  3. How many percentage of drop are you seeing?
> > The testing result:
> > size (bytes) improvement (%)
> > 64   3.92
> > 128 11.51
> > 256  24.16
> > 512  -13.79
> > 1024-22.51
> > 1500-12.22
> > A correction is that performance is dropping if byte size is larger than 
> > 512.
> 
> I have thought of this twice. Unfortunately, I think I may need NACK this
> series.
> 
> Merging two code path into one is really good: as you stated, it improves
> the maintainability. But only if we see no performance regression on both
> path after the refactor. Unfortunately, that's not the case here: it hurts
> the performance for one code path (non-mrg Rx).
> 
> That makes me think we may should not do the code path merge at all. I think
> that also aligns with what you have said before (internally): we could do the
> merge if it gives comparable performance before and after that.
> 
> Besides that, I don't quite like the way you did in patch 2 (rewrite enqueue):
> you made a lot of changes in one patch. That means if something wrong
> happened,
> it is hard to narrow down which change introduces that regression. Badly,
> that's exactly what we met here. Weeks have been passed, I see no progress.
> 
> That's the reason we like the idea of "one patch only does one thing, an
> atomic thing".


Yuanhan, folks,

Thanks for the analysis. I disagree here though.

I analyze, develop, benchmark on x86 platforms, where this patch
works great.

I've been trying to analyze on ARM too but it takes time and I've
had a schedule. Also since the ARM perf issue comes when it's
v6 already, I might not be able to make it in time. However
that's what I have to do for this patch to be merged in this
or the next release.

In the meantime, may I suggest we consider the possibility to
have dedicated codes for **perf critical paths** for different
kinds of architecture?

It can be hard for a person to have both the knowledge and the
development environment for multiple archs at the same time.

Moreover, different optimization techniques might be required for
different archs, so it's hard and unnecessary to make a function
works for all archs, sometimes it's just not the right thing to do.


Thanks
Zhihong


> 
> So I will apply the first patch (it's a bug fixing patch) and ask you to
> refactor the rest, without the code path merge.
> 
> I think we could still have a good maintainability code base if we introduce
> more common helper functions that can be used on both Rx path, or even on
> Tx path (such as update_used_ring, or shadow_used_ring).
> 
> It's a bit late for too many changes for v16.11. I think you could just
> grab patch 6 (vhost: optimize cache access) to the old mrg-Rx code path,
> if that also helps the performance? Let us handle the left in next release,
> such as shadow used ring.
> 
> Thanks.
> 
>   --yliu


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-13 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Wednesday, October 12, 2016 11:31 PM
> To: Wang, Zhihong 
> Cc: Yuanhan Liu ; Jianbo Liu
> ; Maxime Coquelin ;
> dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> Sorry guys, you lost me in the discussion.
> 
> Is there some regression only on ARM?

ARM is what we see, no info on ppc yet.

> Does it need some work specifically on memcpy for ARM,
> or vhost for ARM?
> Who can work on ARM optimization?

These are still open questions, Jiaobo who reported this doesn't
have time for more testing now according to the reply.

I'm trying to do some test in the hope to identify the root cause.

> 
> More comments below.
> 
> 2016-10-12 12:22, Wang, Zhihong:
> > From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> > > > It's an ARM server.
> > > >
> > > > >> >  3. How many percentage of drop are you seeing?
> > > > The testing result:
> > > > size (bytes) improvement (%)
> > > > 64   3.92
> > > > 128 11.51
> > > > 256  24.16
> > > > 512  -13.79
> > > > 1024-22.51
> > > > 1500-12.22
> > > > A correction is that performance is dropping if byte size is larger than
> 512.
> > >
> > > I have thought of this twice. Unfortunately, I think I may need NACK this
> > > series.
> > >
> > > Merging two code path into one is really good: as you stated, it improves
> > > the maintainability. But only if we see no performance regression on both
> > > path after the refactor. Unfortunately, that's not the case here: it hurts
> > > the performance for one code path (non-mrg Rx).
> 
> +1
> 
> > > That makes me think we may should not do the code path merge at all. I
> think
> > > that also aligns with what you have said before (internally): we could do
> the
> > > merge if it gives comparable performance before and after that.
> > >
> > > Besides that, I don't quite like the way you did in patch 2 (rewrite
> enqueue):
> > > you made a lot of changes in one patch. That means if something wrong
> > > happened,
> > > it is hard to narrow down which change introduces that regression. Badly,
> > > that's exactly what we met here. Weeks have been passed, I see no
> progress.
> 
> +1, it is important to have simple patches making changes step by step.
> 
> > > That's the reason we like the idea of "one patch only does one thing, an
> > > atomic thing".
> >
> >
> > Yuanhan, folks,
> >
> > Thanks for the analysis. I disagree here though.
> >
> > I analyze, develop, benchmark on x86 platforms, where this patch
> > works great.
> >
> > I've been trying to analyze on ARM too but it takes time and I've
> > had a schedule. Also since the ARM perf issue comes when it's
> > v6 already, I might not be able to make it in time. However
> > that's what I have to do for this patch to be merged in this
> > or the next release.
> >
> > In the meantime, may I suggest we consider the possibility to
> > have dedicated codes for **perf critical paths** for different
> > kinds of architecture?
> 
> Yes that's what we do in several parts of DPDK.
> 
> > It can be hard for a person to have both the knowledge and the
> > development environment for multiple archs at the same time.
> 
> Yes we do not expect you work on ARM.
> So if nobody work on the ARM issue, you could make 2 code paths
> in order to allow your optimization for x86 only.
> But that's not the preferred way.
> And you must split your rework to better identify which part is
> a regression on ARM.
> 
> > Moreover, different optimization techniques might be required for
> > different archs, so it's hard and unnecessary to make a function
> > works for all archs, sometimes it's just not the right thing to do.
> 
> Yes sometimes. Please help us to be convinced for this case.
> 
> > > So I will apply the first patch (it's a bug fixing patch) and ask you to
> > > refactor the rest, without the code path merge.
> > >
> > > I think we could still have a good maintainability code base if we 
> > > introduce
> > > more common helper functions that can be used on both Rx path, or
> even on
> > > Tx path (such as update_used_ring, or shadow_used_ring).
> 
> Yes it is a good step.
> And the code path merge could be reconsidered later.
> 
> > > It's a bit late for too many changes for v16.11. I think you could just
> > > grab patch 6 (vhost: optimize cache access) to the old mrg-Rx code path,
> > > if that also helps the performance? Let us handle the left in next 
> > > release,
> > > such as shadow used ring.
> 
> Thank you


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-10-13 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Thursday, October 13, 2016 1:33 PM
> To: Wang, Zhihong 
> Cc: Jianbo Liu ; Thomas Monjalon
> ; Maxime Coquelin
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On Wed, Oct 12, 2016 at 12:22:08PM +, Wang, Zhihong wrote:
> > > > >> >  3. How many percentage of drop are you seeing?
> > > > The testing result:
> > > > size (bytes) improvement (%)
> > > > 64   3.92
> > > > 128 11.51
> > > > 256  24.16
> > > > 512  -13.79
> > > > 1024-22.51
> > > > 1500-12.22
> > > > A correction is that performance is dropping if byte size is larger than
> 512.
> > >
> > > I have thought of this twice. Unfortunately, I think I may need NACK this
> > > series.
> > >
> > > Merging two code path into one is really good: as you stated, it improves
> > > the maintainability. But only if we see no performance regression on both
> > > path after the refactor. Unfortunately, that's not the case here: it hurts
> > > the performance for one code path (non-mrg Rx).
> > >
> > > That makes me think we may should not do the code path merge at all. I
> think
> > > that also aligns with what you have said before (internally): we could do
> the
> > > merge if it gives comparable performance before and after that.
> > >
> > > Besides that, I don't quite like the way you did in patch 2 (rewrite
> enqueue):
> > > you made a lot of changes in one patch. That means if something wrong
> > > happened,
> > > it is hard to narrow down which change introduces that regression. Badly,
> > > that's exactly what we met here. Weeks have been passed, I see no
> progress.
> > >
> > > That's the reason we like the idea of "one patch only does one thing, an
> > > atomic thing".
> >
> >
> > Yuanhan, folks,
> >
> > Thanks for the analysis. I disagree here though.
> >
> > I analyze, develop, benchmark on x86 platforms, where this patch
> > works great.
> 
> Yes, that's great effort! With your hardwork, we know what the bottleneck
> is and how it could be improved.
> 
> However, you don't have to do code refactor (merge two code path to one)
> to apply those improvements. From what I know, in this patchset, there
> are two factors could improve the performance:
> 
> - copy hdr together with packet data
> 
> - shadow used ring update and update at once
> 
> The overall performance boost I got with your v6 patchset with mrg-Rx
> code path is about 27% (in PVP case). And I have just applied the 1st
> optimization, it yields about 20% boosts. The left could be covered if
> we apply the 2nd optimization (I guess).
> 
> That would be a clean way to optimize vhost mergeable Rx path:
> 
> - you don't touch non-mrg Rx path (well, you may could apply the
>   shadow_used_ring trick to it as wel)
> 
>   This would at least make sure we will have no such performance
>   regression issue reported by ARM guys.
> 
> - you don't refactor the code
> 
>   The rewrite from scratch could introduce other issues, besides the
>   performance regression. We may just don't know it yet.
> 
> 
> Make sense to you? If you agree, I think we could still make it in
> this release: they would be some small changes after all. For example,
> below is the patch applies the 1st optimization tip on top of
> dpdk-next-virtio


Thanks for this great idea. I think it's a better way to do it.
I'll start to make the patch then.


> 
>   --yliu
> 
> ---
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 8a151af..0ddb5af 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -379,7 +379,7 @@ copy_mbuf_to_desc_mergeable(struct virtio_net
> *dev, struct vhost_virtqueue *vq,
>   uint16_t end_idx, struct rte_mbuf *m,
>   struct buf_vector *buf_vec)
>  {
> - struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
> + struct virtio_net_hdr_mrg_rxbuf *virtio_hdr;
>   uint32_t vec_idx = 0;
>   uint16_t start_idx = vq->last_used_idx;
>   uint16_t cur_idx = start_idx;
> @@ -388,6 +388,8 @@ copy_mbuf_to_desc_mergeable(struct virtio_net
> *dev, struct vhost_virtqueue *vq,
>   uint32

[dpdk-dev] [RFC PATCH 0/2] performance utility in testpmd

2016-04-22 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Thursday, April 21, 2016 5:54 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; De Lara Guarch, Pablo 
> Subject: Re: [dpdk-dev] [RFC PATCH 0/2] performance utility in testpmd
> 
> 2016-04-20 18:43, Zhihong Wang:
> > This RFC patch proposes a general purpose forwarding engine in testpmd
> > namely "portfwd", to enable performance analysis and tuning for poll mode
> > drivers in vSwitching scenarios.
> >
> >
> > Problem statement
> > -
> >
> > vSwitching is more I/O bound in a lot of cases since there are a lot of
> > LLC/cross-core memory accesses.
> >
> > In order to reveal memory/cache behavior in real usage scenarios and enable
> > efficient performance analysis and tuning for vSwitching, DPDK needs a
> > sample application that supports traffic flow close to real deployment,
> > e.g. multi-tenancy, service chaining.
> >
> > There is a vhost sample application currently to enable simple vSwitching
> > scenarios, it comes with several limitations:
> >
> >1) Traffic flow is too simple and not flexible
> >
> >2) Switching based on MAC/VLAN only
> >
> >3) Not enough performance metrics
> >
> >
> > Proposed solution
> > -
> >
> > The testpmd sample application is a good choice, it's a powerful poll mode
> > driver management framework hosts various forwarding engine.
> 
> Not sure it is a good choice.
> The goal of testpmd is to test every PMD features.
> How far can we go in adding some stack processing while keeping it
> easily maintainable?


Thanks for the quick response!


This utility is not for vSwitching in particular, it's just adding more 
forwarding
setup capabilities in testpmd.

testpmd composes of separated components:
1) pmd management framework
2) forwarding engines:
   a) traffic setup
   b) forwarding function

When adding a new fwd engine, only the new traffic setup function and
forwarding function (maybe cmd handlers too) are added, no existing
things are touched. So it doesn't make it harder to maintain.

It also doesn't change the current behavior at all, by default it's still iofwd,
the user can switch to portfwd only when flexible forwarding rules are
needed.

Also, I believe in both DPDK and OVS-DPDK community, testpmd has
Already become a widely used tool to setup performance and functional
test, and there're some complains about the usability and flexibility.


Just one of the many examples to show why we need a feature-rich fwd
engine:

There was an OVS bug reported by Red Hat that took both OVS and DPDK
a long time to investigate, and it turned out to be a testpmd setup
issue: They used testpmd in the guest to do the forwarding, and when
multiqueue is enabled, current testpmd have to use separated cores for
each rxq, so insufficient cores will result in untended rxqs, which is not an
expected result, and not an necessary limitation.

Also, when OVS-DPDK are integrating multiqueue, a lot of cores have to
be assigned to the VM to handle all the rxqs for the test, which puts
limitation on both performance test and functional test because a single
numa node have limited cores.


Another thing is about the learning curve to use DPDK sample application,
we can actually use portfwd for all kinds of pmd test (both host and guest,
nic pmds, vhost pmds, virtio pmds, etc.), and it's simple to use, instead of
useing different apps, like vhost sample in the host and testpmd in the
guest.



> 
> > Now with the vhost pmd feature, it can also handle vhost devices, only a
> > new forwarding engine is needed to make use of it.
> 
> Why a new forwarding engine is needed for vhost?


Appologize for my poor English, what I meant is with the vhost pmd feature,
testpmd has become a vSwitch already, we just need to add more forwarding
setup capability to make use of it.


> 
> > portfwd is implemented to this end.
> >
> > Features of portfwd:
> >
> >1) Build up traffic from simple rx/tx to complex scenarios easily
> >
> >2) Rich performance statistics for all ports
> 
> Have you checked CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES and
> CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS?


These stats are good, it'll be even better to have per rx/tx cycle & burst size
info for each port in portfwd, like:


cycle stat (since last show)

port   0, burst  32,
  rx, run, min, avg, max,
   0,   0,   0,   0,   0,
   1,  21, 596, 663, 752,
   2, 289, 580, 725,1056,
   3,   

[dpdk-dev] [RFC PATCH 0/2] performance utility in testpmd

2016-04-22 Thread Wang, Zhihong


> -Original Message-
> From: Richardson, Bruce
> Sent: Thursday, April 21, 2016 7:00 PM
> To: Thomas Monjalon 
> Cc: Wang, Zhihong ; dev at dpdk.org; De Lara 
> Guarch,
> Pablo 
> Subject: Re: [dpdk-dev] [RFC PATCH 0/2] performance utility in testpmd
> 
> On Thu, Apr 21, 2016 at 11:54:12AM +0200, Thomas Monjalon wrote:
> > 2016-04-20 18:43, Zhihong Wang:
> > > This RFC patch proposes a general purpose forwarding engine in testpmd
> > > namely "portfwd", to enable performance analysis and tuning for poll mode
> > > drivers in vSwitching scenarios.
> > >
> > >
> > > Problem statement
> > > -
> > >
> > > vSwitching is more I/O bound in a lot of cases since there are a lot of
> > > LLC/cross-core memory accesses.
> > >
> > > In order to reveal memory/cache behavior in real usage scenarios and 
> > > enable
> > > efficient performance analysis and tuning for vSwitching, DPDK needs a
> > > sample application that supports traffic flow close to real deployment,
> > > e.g. multi-tenancy, service chaining.
> > >
> > > There is a vhost sample application currently to enable simple vSwitching
> > > scenarios, it comes with several limitations:
> > >
> > >1) Traffic flow is too simple and not flexible
> > >
> > >2) Switching based on MAC/VLAN only
> > >
> > >3) Not enough performance metrics
> > >
> > >
> > > Proposed solution
> > > -
> > >
> > > The testpmd sample application is a good choice, it's a powerful poll mode
> > > driver management framework hosts various forwarding engine.
> >
> > Not sure it is a good choice.
> > The goal of testpmd is to test every PMD features.
> > How far can we go in adding some stack processing while keeping it
> > easily maintainable?
> 
> I was thinking the exact same thing. Would it not be better to enhance the
> existing vhost example application to remove the limitations you call out 
> above?
> I don't particularly like the idea of introducing protocol awareness into 
> testpmd
> for IP forwarding, for instance.


Hi Bruce,

I understand the concern.

Like I mentioned in the original thread, this utility is not for vSwitching
in particular, it's just adding more forwarding setup capabilities in testpmd.

testpmd composes of separated components:
1) pmd management framework
2) forwarding engines:
   a) traffic setup
   b) forwarding function

When adding a new fwd engine, only the new traffic setup function and
forwarding function (maybe cmd handlers too) are added, no existing
things are touched. So it doesn't make it harder to maintain.

It doesn't change the current behavior at all, by default it's still iofwd,
the user can switch to portfwd only when flexible forwarding rules are
needed.


I understand that testpmd was positioned to provide test framework
for pmds, and it's no harm to keep it that way. But I think testpmd has
already become a widely used tool to setup performance and functional
test in both DPDK and OVS-DPDK community. It can do more with simple
changes like this.

There're benefits to enhance forwarding capabilities of testpmd, like,
people can build up DPDK performance/functional test more easily,
in both host and guest; also it eliminates overheads in apps like OVS to
test what DPDK can provide in real scenarios for analysis.


With the vhost pmd feature, testpmd has become a vSwitch already,
we just need to add more forwarding setup capability to make use of it.

If we modify the current vhost sample to do this, then we basically just
re-implement what testpmd has already provided all over again, which
introduces duplicated efforts and increases maintenance work because
what's changed in testpmd may need to go into that sample too.


Thanks
Zhihong


> 
>   Regards,
>   /Bruce



[dpdk-dev] [PATCH 0/7] vhost/example cleanup/fix

2016-04-28 Thread Wang, Zhihong

> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Yuanhan Liu
> Sent: Tuesday, April 26, 2016 12:46 PM
> To: dev at dpdk.org
> Cc: Xie, Huawei ; Yuanhan Liu
> 
> Subject: [dpdk-dev] [PATCH 0/7] vhost/example cleanup/fix
> 
> I'm starting to work on the vhost ABI refactoring, that I also have to
> touch the vhost example code, to make it work. The vhost example code,
> however, is very messy, full of __very__ long lines. This would make
> a later diff to apply the new vhost API be very ugly, therefore, not
> friendly for review. This is how this cleanup comes.


I think this patch is great effort to clean the messy code and make clearer
logic, only one suggestion: do you think a complete cleanup would help more?
in terms of code style and function organization. Since there'll be further work
on it, and it's a small file anyway. Currently some parts still seem messy to 
me,
which compromises the effort of this patch.


> 
> Besides that, there is one enhancement patch, which handles the broadcast
> packets so that we could rely the ARP request packet, to let vhost-switch
> be more like a real switch. There is another patch that (hopefully) would
> fix the mbuf allocation failure ultimately. I also added some guidelines
> there as comments to show how to count how many mbuf entries is enough for
> our usage.
> 
> ---
> Yuanhan Liu (7):
>   examples/vhost: remove the non-working zero copy code
>   examples/vhost: remove unused macro and struct
>   examples/vhost: use tailq to link vhost devices
>   examples/vhost: use mac compare helper function directly
>   examples/vhost: handle broadcast packet
>   examples/vhost: fix mbuf allocation failures
>   examples/vhost: switch_worker cleanup
> 
>  doc/guides/sample_app_ug/vhost.rst |   36 +-
>  examples/vhost/main.c  | 2319 
> ++--
>  examples/vhost/main.h  |   49 +-
>  3 files changed, 375 insertions(+), 2029 deletions(-)
> 
> --
> 1.9.0



[dpdk-dev] [PATCH] optimize vhost enqueue

2016-08-17 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Tuesday, August 16, 2016 10:00 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
> 
> Hi Zhihong,
> 
> On 08/16/2016 05:50 AM, Zhihong Wang wrote:
> > This patch optimizes the vhost enqueue function: rte_vhost_enqueue_burst.
> >
> > Currently there're 2 callbacks for vhost enqueue:
> >  *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.
> >  *  virtio_dev_rx for mrg_rxbuf turned off cases.
> >
> > The virtio_dev_merge_rx doesn't provide optimal performance, also it is
> > reported having compatibility issue working with Windows VMs.
> Could you tell us more please about this compatibility issue?


For example, when you have testpmd in the host and Window VM as the guest,
with mrg_rxbuf turned on, the guest will hang once there's packets enqueued
by virtio_dev_merge_rx.

Let me know if you see the same issue.


> 
> >
> > Besides, having 2 separated functions increases maintenance efforts.
> >
> > This patch uses a single function logic to replace the current 2 for
> > better maintainability, and provides better performance by optimizing
> > caching behavior especially for mrg_rxbuf turned on cases.
> Do you have some benchmark comparison before and after your change?
> 
> Also, for maintainability, I would suggest the that the enqueue
> function be split. Because vhost_enqueue_burst becomes very long (220
> LoC), and max level of indentation is too high (6).
> 
> It makes the code hard to understand, and prone to miss bugs during
> review and maintenance.


This is something I've thought about while writing the code, the reason I
keep it as one function body is that:

 1. This function is very performance sensitive, and we need full control of
code ordering (You can compare with the current performance with the
mrg_rxbuf feature turned on to see the difference).

 2. I somehow find that a single function logic makes it easier to understand,
surely I can add comments to make it easiler to read for .

Please let me know if you still insist, we can discuss more on it.


> 
> >
> > It also fixes the issue working with Windows VMs.
> Ideally, the fix should be sent separately, before the rework.
> Indeed, we might want to have the fix in the stable branch, without
> picking the optimization.
> 
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> >  lib/librte_vhost/vhost-net.h  |   6 +-
> >  lib/librte_vhost/vhost_rxtx.c | 582 
> > ++
> >  lib/librte_vhost/virtio-net.c |  15 +-
> >  3 files changed, 208 insertions(+), 395 deletions(-)
> 582 lines changed is a huge patch.
> If possible, it would be better splitting it in incremental changes,
> making the review process easier.


It looks like a huge patch, but it simply deletes the current implementation
and add the new code. I think perhaps split it into 2, 1st one to replace
just the rte_vhost_enqueue_burst, 2nd one to delete all the obsolete functions.
It should make the patch clear, how do you think?  :)


> 
> Also, for v2, please prefix the commit title with "vhost:".

Thanks for the hint! Will do.

> 
> Thanks for your contribution, I'm looking forward for the v2.
> - Maxime


[dpdk-dev] [PATCH] optimize vhost enqueue

2016-08-17 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Wednesday, August 17, 2016 10:38 AM
> To: Wang, Zhihong 
> Cc: Maxime Coquelin ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
> 
> On Wed, Aug 17, 2016 at 01:45:26AM +, Wang, Zhihong wrote:
> >
> >
> > > -Original Message-
> > > From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> > > Sent: Tuesday, August 16, 2016 10:00 PM
> > > To: Wang, Zhihong ; dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
> > >
> > > Hi Zhihong,
> > >
> > > On 08/16/2016 05:50 AM, Zhihong Wang wrote:
> > > > This patch optimizes the vhost enqueue function:
> rte_vhost_enqueue_burst.
> > > >
> > > > Currently there're 2 callbacks for vhost enqueue:
> > > >  *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.
> > > >  *  virtio_dev_rx for mrg_rxbuf turned off cases.
> > > >
> > > > The virtio_dev_merge_rx doesn't provide optimal performance, also it is
> > > > reported having compatibility issue working with Windows VMs.
> > > Could you tell us more please about this compatibility issue?
> >
> >
> > For example, when you have testpmd in the host and Window VM as the
> guest,
> > with mrg_rxbuf turned on, the guest will hang once there's packets enqueued
> > by virtio_dev_merge_rx.
> 
> You should put it into commit log.


Okay.


> 
> > Let me know if you see the same issue.
> >
> >
> > >
> > > >
> > > > Besides, having 2 separated functions increases maintenance efforts.
> > > >
> > > > This patch uses a single function logic to replace the current 2 for
> > > > better maintainability, and provides better performance by optimizing
> > > > caching behavior especially for mrg_rxbuf turned on cases.
> 
> Here, here sounds two parts to me:
> 
> - one to unite mergeable and non-mergeable Rx
> 
> - another one to optimize the mergeable path
> 
> That means you should do it in two patches, with that we can have clear
> understanding what changes the performance boost. It also helps review.


Please see explanation below.


> 
> > > Do you have some benchmark comparison before and after your change?
> > >
> > > Also, for maintainability, I would suggest the that the enqueue
> > > function be split. Because vhost_enqueue_burst becomes very long (220
> > > LoC), and max level of indentation is too high (6).
> > >
> > > It makes the code hard to understand, and prone to miss bugs during
> > > review and maintenance.
> 
> Agreed.
> 
> >
> > This is something I've thought about while writing the code, the reason I
> > keep it as one function body is that:
> >
> >  1. This function is very performance sensitive, and we need full control of
> > code ordering (You can compare with the current performance with the
> > mrg_rxbuf feature turned on to see the difference).
> 
> Will inline functions help?


Optimization in this patch actually reorganizes the code from its logic,
so it's not suitable for making separated functions.

I'll explain this in v2.


> 
> >  2. I somehow find that a single function logic makes it easier to 
> > understand,
> > surely I can add comments to make it easiler to read for .
> >
> > Please let me know if you still insist, we can discuss more on it.
> 
> I am personally not a fan of huge function; I would try hard to avoid
> too many levels of indentation as well.
> 
> >
> > >
> > > >
> > > > It also fixes the issue working with Windows VMs.
> > > Ideally, the fix should be sent separately, before the rework.
> > > Indeed, we might want to have the fix in the stable branch, without
> > > picking the optimization.
> 
> Agreed.


The fact is that I don't have much time to debug with the current code
since it's messy and I don't have Windows virtio code and the debugging
environment.

This patch doesn't try to fix this issue, it rewrites the logic totally,
and somehow fixes this issue.

Do you think integrating this whole patch into the stable branch will work?
Personally I think it makes more sense.


> 
> > >
> > > >
> > > > Signed-off-by: Zhihong Wang 
> > > > ---
> > > >  lib/librte_vhost/vhost-net.h  |   6 +-
> > > >  lib/librte_vhost/vhost_rxtx.c | 582
> ++
> > > >  lib

[dpdk-dev] [PATCH] optimize vhost enqueue

2016-08-17 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Wednesday, August 17, 2016 5:18 PM
> To: Wang, Zhihong ; Yuanhan Liu
> 
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
> 
> 
> 
> On 08/17/2016 08:41 AM, Wang, Zhihong wrote:
> >
> >
> >> -Original Message-
> >> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> >> Sent: Wednesday, August 17, 2016 10:38 AM
> >> To: Wang, Zhihong 
> >> Cc: Maxime Coquelin ; dev at dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
> >>
> >> On Wed, Aug 17, 2016 at 01:45:26AM +, Wang, Zhihong wrote:
> >>>
> >>>
> >>>> -----Original Message-
> >>>> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> >>>> Sent: Tuesday, August 16, 2016 10:00 PM
> >>>> To: Wang, Zhihong ; dev at dpdk.org
> >>>> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
> >>>>
> >>>> Hi Zhihong,
> >>>>
> >>>> On 08/16/2016 05:50 AM, Zhihong Wang wrote:
> >>>>> This patch optimizes the vhost enqueue function:
> >> rte_vhost_enqueue_burst.
> >>>>>
> >>>>> Currently there're 2 callbacks for vhost enqueue:
> >>>>>  *  virtio_dev_merge_rx for mrg_rxbuf turned on cases.
> >>>>>  *  virtio_dev_rx for mrg_rxbuf turned off cases.
> >>>>>
> >>>>> The virtio_dev_merge_rx doesn't provide optimal performance, also it is
> >>>>> reported having compatibility issue working with Windows VMs.
> >>>> Could you tell us more please about this compatibility issue?
> >>>
> >>>
> >>> For example, when you have testpmd in the host and Window VM as the
> >> guest,
> >>> with mrg_rxbuf turned on, the guest will hang once there's packets
> enqueued
> >>> by virtio_dev_merge_rx.
> >>
> >> You should put it into commit log.
> >
> >
> > Okay.
> >
> >
> >>
> >>> Let me know if you see the same issue.
> >>>
> >>>
> >>>>
> >>>>>
> >>>>> Besides, having 2 separated functions increases maintenance efforts.
> >>>>>
> >>>>> This patch uses a single function logic to replace the current 2 for
> >>>>> better maintainability, and provides better performance by optimizing
> >>>>> caching behavior especially for mrg_rxbuf turned on cases.
> >>
> >> Here, here sounds two parts to me:
> >>
> >> - one to unite mergeable and non-mergeable Rx
> >>
> >> - another one to optimize the mergeable path
> >>
> >> That means you should do it in two patches, with that we can have clear
> >> understanding what changes the performance boost. It also helps review.
> >
> >
> > Please see explanation below.
> >
> >
> >>
> >>>> Do you have some benchmark comparison before and after your change?
> >>>>
> >>>> Also, for maintainability, I would suggest the that the enqueue
> >>>> function be split. Because vhost_enqueue_burst becomes very long (220
> >>>> LoC), and max level of indentation is too high (6).
> >>>>
> >>>> It makes the code hard to understand, and prone to miss bugs during
> >>>> review and maintenance.
> >>
> >> Agreed.
> >>
> >>>
> >>> This is something I've thought about while writing the code, the reason I
> >>> keep it as one function body is that:
> >>>
> >>>  1. This function is very performance sensitive, and we need full control 
> >>> of
> >>> code ordering (You can compare with the current performance with the
> >>> mrg_rxbuf feature turned on to see the difference).
> >>
> >> Will inline functions help?
> >
> >
> > Optimization in this patch actually reorganizes the code from its logic,
> > so it's not suitable for making separated functions.
> >
> > I'll explain this in v2.
> 
> I agree with Yuanhan.
> Inline functions should not break the optimizations.
> IMHO, this is mandatory for the patch to be accepted.


Excellent!


> 
> >
> >
> >>
> >>>  2. I somehow find that a single function logic makes it easier to 
> 

[dpdk-dev] [PATCH] optimize vhost enqueue

2016-08-18 Thread Wang, Zhihong
Thanks Maxime and Yuanhan for your review and suggestions!
Please help review the v2 of this patch.


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Wednesday, August 17, 2016 5:51 PM
> To: Maxime Coquelin 
> Cc: Wang, Zhihong ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] optimize vhost enqueue
> 
> On Wed, Aug 17, 2016 at 11:17:46AM +0200, Maxime Coquelin wrote:
> > >>>This is something I've thought about while writing the code, the reason I
> > >>>keep it as one function body is that:
> > >>>
> > >>> 1. This function is very performance sensitive, and we need full 
> > >>> control of
> > >>>code ordering (You can compare with the current performance with
> the
> > >>>mrg_rxbuf feature turned on to see the difference).
> > >>
> > >>Will inline functions help?
> > >
> > >
> > >Optimization in this patch actually reorganizes the code from its logic,
> > >so it's not suitable for making separated functions.
> > >
> > >I'll explain this in v2.
> >
> > I agree with Yuanhan.
> > Inline functions should not break the optimizations.
> > IMHO, this is mandatory for the patch to be accepted.
> 
> Yes.
> 
> > It seems you are not the only one facing the issue:
> > https://github.com/YanVugenfirer/kvm-guest-drivers-windows/issues/70
> >
> > So a dedicated fix is really important.
> 
> Yes.
> 
> >
> > >This patch doesn't try to fix this issue, it rewrites the logic totally,
> > >and somehow fixes this issue.
> > >
> > >Do you think integrating this whole patch into the stable branch will work?
> > >Personally I think it makes more sense.
> >
> > No.
> > We don't even know why/how it fixes the Windows issue, which would be
> > the first thing to understand before integrating a fix in stable branch.
> 
> Yes.
> 
> >
> > And the stable branch is not meant for integrating such big reworks,
> > it is only meant to fix bugs.
> 
> Yes.
> 
> > The risk of regressions have to be avoided as much as possible.
> 
> Yes.
> 
>   --yliu


[dpdk-dev] [PATCH v2 1/6] vhost: rewrite enqueue

2016-08-19 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Friday, August 19, 2016 10:39 AM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; maxime.coquelin at redhat.com
> Subject: Re: [PATCH v2 1/6] vhost: rewrite enqueue
> 
> On Thu, Aug 18, 2016 at 02:33:06AM -0400, Zhihong Wang wrote:
> > This patch implements the vhost logic from scratch into a single function
> > designed for high performance and better maintainability.
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> >  lib/librte_vhost/vhost_rxtx.c | 212
> --
> >  1 file changed, 205 insertions(+), 7 deletions(-)
> >
> > diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> > index 08a73fd..8e6d782 100644
> > --- a/lib/librte_vhost/vhost_rxtx.c
> > +++ b/lib/librte_vhost/vhost_rxtx.c
> > @@ -91,7 +91,7 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t
> qp_nb)
> > return (is_tx ^ (idx & 1)) == 0 && idx < qp_nb * VIRTIO_QNUM;
> >  }
> >
> > -static void
> > +static inline void __attribute__((always_inline))
> >  virtio_enqueue_offload(struct rte_mbuf *m_buf, struct virtio_net_hdr
> *net_hdr)
> >  {
> > if (m_buf->ol_flags & PKT_TX_L4_MASK) {
> > @@ -533,19 +533,217 @@ virtio_dev_merge_rx(struct virtio_net *dev,
> uint16_t queue_id,
> > return pkt_idx;
> >  }
> >
> > +static inline uint32_t __attribute__((always_inline))
> > +loop_check(struct vhost_virtqueue *vq, uint16_t avail_idx, uint32_t 
> > pkt_left)
> > +{
> > +   if (pkt_left == 0 || avail_idx == vq->last_used_idx)
> > +   return 1;
> > +
> > +   return 0;
> > +}
> 
> Hmmm, I don't see any benifit from making such simple check into a
> function.

It's for prefetch code later to be merged into this function.

> 
> > +static inline uint32_t __attribute__((always_inline))
> > +enqueue_packet(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > +   uint16_t avail_idx, struct rte_mbuf *mbuf,
> > +   uint32_t is_mrg_rxbuf)
> > +{
> > +   struct virtio_net_hdr_mrg_rxbuf *virtio_hdr;
> > +   struct vring_desc *desc;
> > +   uint64_t desc_host_write_addr = 0;
> > +   uint32_t desc_chain_head = 0;
> > +   uint32_t desc_chain_len = 0;
> > +   uint32_t desc_current = 0;
> > +   uint32_t desc_write_offset = 0;
> > +   uint32_t mbuf_len = 0;
> > +   uint32_t mbuf_len_left = 0;
> > +   uint32_t copy_len = 0;
> 
> The dequeue function uses var like desc_addr, desc_avail, desc_offset,
> mbuf_avail, ..., I see no reason to use something different here. This
> breaks the code consistency. Besides that, var name like desc_host_write_addr
> looks redundant; desc_addr is much cleaner.

Okay.

> 
>   --yliu


[dpdk-dev] [PATCH v2 2/6] vhost: remove obsolete

2016-08-19 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Friday, August 19, 2016 10:33 AM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; maxime.coquelin at redhat.com
> Subject: Re: [PATCH v2 2/6] vhost: remove obsolete
> 
> On Thu, Aug 18, 2016 at 02:33:07AM -0400, Zhihong Wang wrote:
> > This patch removes obsolete functions.
> 
> Splitting patches doesn't work in this way: this should be in the first
> patch. Otherwise, build breaks in the first patch, as some functions are
> defined but not used.

Thanks. I'll send out v3 soon, also to fix a small glitch
while running in old platform like snb and ivb.

> 
>   --yliu


[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-08-23 Thread Wang, Zhihong
> Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue
> 
> Hi Zhihong,
> 
[...]
> > The main optimization techniques are:
> >
> >  1. Reorder code to reduce CPU pipeline stall cycles.
> >
> >  2. Batch update the used ring for better efficiency.
> >
> >  3. Prefetch descriptor to hide cache latency.
> >
> >  4. Remove useless volatile attribute to allow compiler optimization.
> 
> Thanks for these details, this is helpful to understand where the perf
> gain comes from.
> I would suggest to add these information as comments in the code
> where/if it makes sense. If more a general comment, at least add it in
> the commit message of the patch introducing it.
> Indeed, adding it to the cover letter is fine, but the information is
> lost as soon as the series is applied.

Hi Maxime,

I did add these info in the later optimization patches to explain each
optimization techniques. The v1 was indeed hard to read.


> 
> You don't mention any figures, so I set up a benchmark on my side to
> evaluate your series. It indeed shows an interesting performance gain.
> 
> My setup consists of one host running a guest.
> The guest generates as much 64bytes packets as possible using
> pktgen-dpdk. The hosts forwards received packets back to the guest
> using testpmd on vhost pmd interface. Guest's vCPUs are pinned to
> physical CPUs.
> 

Thanks for doing the test!

I didn't publish any numbers since the gain varies in different platforms
and test setups.

In my phy to vm test on both IVB and HSW, where testpmd in the host rx from
the nic and enqueue to the guest, the enqueue efficiency (cycles per packet)
is 2.4x and 1.4x as fast as the current code for mergeable on and mergeable
off respectively, for v3 patch.


> I tested it with and without your v1 patch, with and without
> rx-mergeable feature turned ON.
> Results are the average of 8 runs of 60 seconds:
> 
> Rx-Mergeable ON : 7.72Mpps
> Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps
> Rx-Mergeable OFF: 10.52Mpps
> Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps
> 
> Regards,
> Maxime


[dpdk-dev] [PATCH v3 1/5] vhost: rewrite enqueue

2016-08-23 Thread Wang, Zhihong
Hi Maxime,

Thanks very much for the detailed review.

> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Monday, August 22, 2016 5:36 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: yuanhan.liu at linux.intel.com
> Subject: Re: [PATCH v3 1/5] vhost: rewrite enqueue
> 
> 
> 
> On 08/19/2016 07:43 AM, Zhihong Wang wrote:
> > This patch implements the vhost logic from scratch into a single function
> > designed for high performance and better maintainability.
> >
> > ---
> > Changes in v3:
> >
> >  1. Rewrite enqueue and delete the obsolete in the same patch.
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> >  lib/librte_vhost/vhost_rxtx.c | 537 
> > +-
> >  1 file changed, 160 insertions(+), 377 deletions(-)
> >
> > diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> > index 08a73fd..b09a9c3 100644
> > --- a/lib/librte_vhost/vhost_rxtx.c
> > +++ b/lib/librte_vhost/vhost_rxtx.c
> > @@ -91,7 +91,7 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t
> qp_nb)
> > return (is_tx ^ (idx & 1)) == 0 && idx < qp_nb * VIRTIO_QNUM;
> >  }
> >
> > -static void
> > +static inline void __attribute__((always_inline))
> >  virtio_enqueue_offload(struct rte_mbuf *m_buf, struct virtio_net_hdr
> *net_hdr)
> >  {
> > if (m_buf->ol_flags & PKT_TX_L4_MASK) {
> > @@ -125,427 +125,210 @@ virtio_enqueue_offload(struct rte_mbuf *m_buf,
> struct virtio_net_hdr *net_hdr)
> > }
> >  }
> >
> > -static inline void
> > -copy_virtio_net_hdr(struct virtio_net *dev, uint64_t desc_addr,
> > -   struct virtio_net_hdr_mrg_rxbuf hdr)
> > +static inline uint32_t __attribute__((always_inline))
> > +loop_check(struct vhost_virtqueue *vq, uint16_t avail_idx, uint32_t 
> > pkt_left)
> Creating a function just for doing this doesn't make much sense.
> And the function name doesn't help.
> I think you should just remove this function.

Okay.

> 
> >  {
> > -   if (dev->vhost_hlen == sizeof(struct virtio_net_hdr_mrg_rxbuf))
> > -   *(struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)desc_addr = hdr;
> > -   else
> > -   *(struct virtio_net_hdr *)(uintptr_t)desc_addr = hdr.hdr;
> > +   if (pkt_left == 0 || avail_idx == vq->last_used_idx)
> > +   return 1;
> > +
> > +   return 0;
> >  }
> >
> > -static inline int __attribute__((always_inline))
> > -copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > - struct rte_mbuf *m, uint16_t desc_idx)
> > +static inline uint32_t __attribute__((always_inline))
> > +enqueue_packet(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > +   uint16_t avail_idx, struct rte_mbuf *mbuf,
> > +   uint32_t is_mrg_rxbuf)
> >  {
> > -   uint32_t desc_avail, desc_offset;
> > -   uint32_t mbuf_avail, mbuf_offset;
> > -   uint32_t cpy_len;
> > +   struct virtio_net_hdr_mrg_rxbuf *virtio_hdr;
> > struct vring_desc *desc;
> > -   uint64_t desc_addr;
> > -   struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
> > -
> > -   desc = >desc[desc_idx];
> > +   uint64_t desc_addr = 0;
> > +   uint32_t desc_chain_head = 0;
> > +   uint32_t desc_chain_len = 0;
> > +   uint32_t desc_current = 0;
> > +   uint32_t desc_offset = 0;
> > +   uint32_t mbuf_len = 0;
> > +   uint32_t mbuf_avail = 0;
> > +   uint32_t copy_len = 0;
> > +   uint32_t extra_buffers = 0;
> > +   uint32_t used_idx_round = 0;
> Most of these variables don't need to be initialized.

Okay.

> 
> > +
> > +   /* start with the first mbuf of the packet */
> > +   mbuf_len = rte_pktmbuf_data_len(mbuf);
> > +   mbuf_avail = mbuf_len;
> > +
> > +   /* get the current desc */
> > +   desc_current = vq->avail->ring[(vq->last_used_idx) & (vq->size - 1)];
> > +   desc_chain_head = desc_current;
> > +   desc = >desc[desc_current];
> > desc_addr = gpa_to_vva(dev, desc->addr);
> > -   /*
> > -* Checking of 'desc_addr' placed outside of 'unlikely' macro to avoid
> > -* performance issue with some versions of gcc (4.8.4 and 5.3.0) which
> > -* otherwise stores offset on the stack instead of in a register.
> > -*/
> > -   if (unlikely(desc->len < dev->vhost_hlen) || !desc_addr)
> > -   return -1;
> > -
> > -   rte_prefetch0((void *)(uintptr_t)des

[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-08-23 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Monday, August 22, 2016 6:02 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: yuanhan.liu at linux.intel.com
> Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue
> 
> 
> On 08/22/2016 10:11 AM, Maxime Coquelin wrote:
> > Hi Zhihong,
> >
> > On 08/19/2016 07:43 AM, Zhihong Wang wrote:
> > > This patch set optimizes the vhost enqueue function.
> > >
> > > It implements the vhost logic from scratch into a single function
> > > designed
> > > for high performance and good maintainability, and improves CPU
> > > efficiency
> > > significantly by optimizing cache access, which means:
> > >
> > >  *  For fast frontends (eg. DPDK virtio pmd), higher performance
> (maximum
> > > throughput) can be achieved.
> > >
> > >  *  For slow frontends (eg. kernel virtio-net), better scalability can be
> > > achieved, each vhost core can support more connections since it takes
> > > less cycles to handle each single frontend.
> > >
> > > The main optimization techniques are:
> > >
> > >  1. Reorder code to reduce CPU pipeline stall cycles.
> > >
> > >  2. Batch update the used ring for better efficiency.
> > >
> > >  3. Prefetch descriptor to hide cache latency.
> > >
> > >  4. Remove useless volatile attribute to allow compiler optimization.
> >
> > Thanks for these details, this is helpful to understand where the perf
> > gain comes from.
> > I would suggest to add these information as comments in the code
> > where/if it makes sense. If more a general comment, at least add it in
> > the commit message of the patch introducing it.
> > Indeed, adding it to the cover letter is fine, but the information is
> > lost as soon as the series is applied.
> >
> > You don't mention any figures, so I set up a benchmark on my side to
> > evaluate your series. It indeed shows an interesting performance gain.
> >
> > My setup consists of one host running a guest.
> > The guest generates as much 64bytes packets as possible using
> > pktgen-dpdk. The hosts forwards received packets back to the guest
> > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to
> > physical CPUs.
> >
> > I tested it with and without your v1 patch, with and without
> > rx-mergeable feature turned ON.
> > Results are the average of 8 runs of 60 seconds:
> >
> > Rx-Mergeable ON : 7.72Mpps
> > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps
> > Rx-Mergeable OFF: 10.52Mpps
> > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps
> >
> I forgot to add that before this series, I think we should first fix the 
> windows bug.
> Else we will need a dedicated fix for the stable branch.

Okay I'll try to fix it, though I can't make any promises at present.

Have tried once but stopped since we don't have enough debug info from the
frontend side so basically I was debugging the backend based on guesses.


> 
> Regards,
> Maxime



[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-08-23 Thread Wang, Zhihong


> -Original Message-
> From: Wang, Zhihong
> Sent: Tuesday, August 23, 2016 10:31 AM
> To: Maxime Coquelin ; dev at dpdk.org
> Cc: yuanhan.liu at linux.intel.com
> Subject: RE: [PATCH v3 0/5] vhost: optimize enqueue
> 
> 
> 
> > -Original Message-
> > From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> > Sent: Monday, August 22, 2016 6:02 PM
> > To: Wang, Zhihong ; dev at dpdk.org
> > Cc: yuanhan.liu at linux.intel.com
> > Subject: Re: [PATCH v3 0/5] vhost: optimize enqueue
> >
> >
> > On 08/22/2016 10:11 AM, Maxime Coquelin wrote:
> > > Hi Zhihong,
> > >
> > > On 08/19/2016 07:43 AM, Zhihong Wang wrote:
> > > > This patch set optimizes the vhost enqueue function.
> > > >
> > > > It implements the vhost logic from scratch into a single function
> > > > designed
> > > > for high performance and good maintainability, and improves CPU
> > > > efficiency
> > > > significantly by optimizing cache access, which means:
> > > >
> > > >  *  For fast frontends (eg. DPDK virtio pmd), higher performance
> > (maximum
> > > > throughput) can be achieved.
> > > >
> > > >  *  For slow frontends (eg. kernel virtio-net), better scalability can 
> > > > be
> > > > achieved, each vhost core can support more connections since it 
> > > > takes
> > > > less cycles to handle each single frontend.
> > > >
> > > > The main optimization techniques are:
> > > >
> > > >  1. Reorder code to reduce CPU pipeline stall cycles.
> > > >
> > > >  2. Batch update the used ring for better efficiency.
> > > >
> > > >  3. Prefetch descriptor to hide cache latency.
> > > >
> > > >  4. Remove useless volatile attribute to allow compiler optimization.
> > >
> > > Thanks for these details, this is helpful to understand where the perf
> > > gain comes from.
> > > I would suggest to add these information as comments in the code
> > > where/if it makes sense. If more a general comment, at least add it in
> > > the commit message of the patch introducing it.
> > > Indeed, adding it to the cover letter is fine, but the information is
> > > lost as soon as the series is applied.
> > >
> > > You don't mention any figures, so I set up a benchmark on my side to
> > > evaluate your series. It indeed shows an interesting performance gain.
> > >
> > > My setup consists of one host running a guest.
> > > The guest generates as much 64bytes packets as possible using
> > > pktgen-dpdk. The hosts forwards received packets back to the guest
> > > using testpmd on vhost pmd interface. Guest's vCPUs are pinned to
> > > physical CPUs.
> > >
> > > I tested it with and without your v1 patch, with and without
> > > rx-mergeable feature turned ON.
> > > Results are the average of 8 runs of 60 seconds:
> > >
> > > Rx-Mergeable ON : 7.72Mpps
> > > Rx-Mergeable ON + "vhost: optimize enqueue" v1: 9.19Mpps
> > > Rx-Mergeable OFF: 10.52Mpps
> > > Rx-Mergeable OFF + "vhost: optimize enqueue" v1: 10.60Mpps
> > >
> > I forgot to add that before this series, I think we should first fix the 
> > windows
> bug.
> > Else we will need a dedicated fix for the stable branch.
> 
> Okay I'll try to fix it, though I can't make any promises at present.
> 
> Have tried once but stopped since we don't have enough debug info from the
> frontend side so basically I was debugging the backend based on guesses.

Hi Maxime, Yuanhan,

I've identified the root cause, do you think it makes sense to put the fix
in the same patch set? Or send it as a separated patch?


Thanks
Zhihong

> 
> 
> >
> > Regards,
> > Maxime



[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

2016-08-24 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Monday, August 22, 2016 6:35 PM
> To: Maxime Coquelin ; Wang, Zhihong
> ; yuanhan.liu at linux.intel.com
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> 2016-08-22 12:01, Maxime Coquelin:
> > I forgot to add that before this series, I think we should first fix the
> windows bug.
> > Else we will need a dedicated fix for the stable branch.
> 
> This is a funny situation :)
> If Zhihong had reworked the code without mentioning it is fixing a scenario
> with Windows guests, maybe that nobody would have notice ;) That's
> probably why it is not written in v2/v3. But thanks to the v1, we all know it:
>   "It also fixes the issue working with Windows VMs."

I thought it'd be more appropriate to send a dedicated fix for stable branch.
So I removed this info.

> 
> So yes, it would be a lot better to find the root cause and try to have a
> minimal fix for 16.07, then rework the code for performance in 16.11.
> I think we must avoid silent fixes, and even more, avoid writing specific 
> fixes
> for stable branches without validating them in the master branch and its large
> users base.

Okay, that's also what Maxime and Yuanhan suggest.

BTW the root cause has been identified and fix will be in v4.

> 
> Thanks for your good works guys, DPDK vhost is improving very well.


[dpdk-dev] [PATCH v3 4/5] vhost: batch update used ring

2016-08-25 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Thursday, August 25, 2016 11:48 AM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; maxime.coquelin at redhat.com
> Subject: Re: [PATCH v3 4/5] vhost: batch update used ring
> 
> On Fri, Aug 19, 2016 at 01:43:49AM -0400, Zhihong Wang wrote:
> > This patch enables batch update of the used ring for better efficiency.
> >
> > Signed-off-by: Zhihong Wang 
> ...
> > diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> > index 1785695..87d09fa 100644
> > --- a/lib/librte_vhost/virtio-net.c
> > +++ b/lib/librte_vhost/virtio-net.c
> > @@ -152,10 +152,14 @@ cleanup_device(struct virtio_net *dev, int
> destroy)
> >  static void
> >  free_device(struct virtio_net *dev)
> >  {
> > +   struct vhost_virtqueue *vq;
> > uint32_t i;
> >
> > -   for (i = 0; i < dev->virt_qp_nb; i++)
> > -   rte_free(dev->virtqueue[i * VIRTIO_QNUM]);
> > +   for (i = 0; i < dev->virt_qp_nb; i++) {
> > +   vq = dev->virtqueue[i * VIRTIO_QNUM];
> > +   rte_free(vq->shadow_used_ring);
> > +   rte_free(vq);
> > +   }
> > rte_free(dev);
> >  }
> > @@ -418,13 +422,18 @@ int
> >  vhost_set_vring_num(int vid, struct vhost_vring_state *state)
> >  {
> > struct virtio_net *dev;
> > +   struct vhost_virtqueue *vq;
> >
> > dev = get_device(vid);
> > if (dev == NULL)
> > return -1;
> >
> > /* State->index refers to the queue index. The txq is 1, rxq is 0. */
> > -   dev->virtqueue[state->index]->size = state->num;
> > +   vq = dev->virtqueue[state->index];
> > +   vq->size = state->num;
> > +   vq->shadow_used_ring = rte_malloc("",
> > +   vq->size * sizeof(struct vring_used_elem),
> > +   RTE_CACHE_LINE_SIZE);
> 
> Few notes here:
> 
> - I think the typical way to not specific a string type is using NULL,
>   but not "".
> 
> - You should check the return value of rte_malloc: it could fail.
> 
> - Note that free_device() is invoked only when the vhost-user connection
>   is broken (say the guest is halt). However, vhost_set_vring_num() could
>   be invoked many times for a connection, say when you restart testpmd
>   many times. This would lead to memory leak.
> 
>   The right way is to free it on get_vring_base().

Good catch! Thanks.

> 
>   --yliu


[dpdk-dev] [PATCH 1/3] eal/x86: fix build with clang for old AVX

2016-02-04 Thread Wang, Zhihong
> Subject: [PATCH 1/3] eal/x86: fix build with clang for old AVX
> 
> When configuring RTE_MACHINE to "default", rte_memcpy implementation
> is the default one (old AVX).
> In this code, clang raises a warning thanks to -Wsometimes-uninitialized:
> 
> rte_memcpy.h:838:6: error:
> variable 'srcofs' is used uninitialized whenever 'if' condition is false
> if (dstofss > 0) {
> ^~~
> rte_memcpy.h:849:6: note: uninitialized use occurs here
> if (srcofs == 0) {
> ^~
> 
> It is fixed by initializing srcofs to 0.
> 
> Fixes: 1ae817f9f887 ("eal/x86: tune memcpy for platforms without AVX512")
> 
> Signed-off-by: Thomas Monjalon 
> ---
>  lib/librte_eal/common/include/arch/x86/rte_memcpy.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)


Hi Thomas,

Thanks for pointing this out!
My last hasty modification on this is not correct.

The patch below will fix it. All modifications are tested.
Sorry for all the hassle! :'(

"srcofs" should be calculated based on source address anyway.


--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -512,8 +512,9 @@ COPY_BLOCK_64_BACK31:
/**
 * Make store aligned when copy size exceeds 512 bytes
 */
-   dstofss = 32 - ((uintptr_t)dst & 0x1F);
+   dstofss = (uintptr_t)dst & 0x1F;
if (dstofss > 0) {
+   dstofss = 32 - dstofss;
n -= dstofss;
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
src = (const uint8_t *)src + dstofss;
@@ -834,14 +835,15 @@ COPY_BLOCK_64_BACK15:
 * unaligned copy functions require up to 15 bytes
 * backwards access.
 */
-   dstofss = 16 - ((uintptr_t)dst & 0x0F) + 16;
+   dstofss = (uintptr_t)dst & 0x0F;
if (dstofss > 0) {
+   dstofss = 16 - dstofss + 16;
n -= dstofss;
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
src = (const uint8_t *)src + dstofss;
dst = (uint8_t *)dst + dstofss;
-   srcofs = ((uintptr_t)src & 0x0F);
}
+   srcofs = ((uintptr_t)src & 0x0F);

/**
 * For aligned copy



[dpdk-dev] [dpdk-dev, v3] Implement memcmp using Intel SIMD instrinsics.

2016-02-23 Thread Wang, Zhihong
> > It'd be great if you could format this patch into a patch set with several
> > little ones. :-)
> > Also, the kernel checkpatch is very helpful.
> > Good coding style and patch organization make it easy for in-depth reviews.
> > 
> Combination of scalar and vector (32/64/128) was done to get optimal 
> performance numbers. If there is enough interest in this I can work on it and 
> provide an updated patch set.

That'll be very helpful! Looking forward to your patch :)
BTW, have you tested real example performance with your patch?


[dpdk-dev] [PATCH RFC] Memcpy optimization

2014-11-14 Thread Wang, Zhihong
assembly code
* Remove slow glibc call for constant copies

Current memcpy performance test is in "test_memcpy_perf.c", which will also be 
updated with unaligned test cases.

4. Glibc memcpy analysis

Glibc 2.16 (Fedora 20) and 2.20 (Currently the latest, released on Sep 07, 
2014) are analyzed.

Glibc 2.16 issues:
* No support for 256-bit load/store
* Significant slowdown for unaligned constant cases due to split loads and 4k 
aliasing

Glibc 2.20 issue:
* Removed load address alignment, which can lead to significant slowdown for 
unaligned cases in former architectures like Sandy Bridge

Also, calls to glibc can't be optimized by gcc at compile time.

Acknowledgements

Valuable suggestions from: Liang Cunming, Zhu Heqing, Bruce Richardson, and 
Chen Wenjun.

Author's Address

Wang Zhihong (John)
Email: zhihong.wang at intel.com



[dpdk-dev] [PATCH 0/6] vhost/virtio performance loopback utility

2016-05-20 Thread Wang, Zhihong

> -Original Message-
> From: Wang, Zhihong
> Sent: Friday, May 6, 2016 6:47 AM
> To: dev at dpdk.org
> Cc: Ananyev, Konstantin ; Richardson, Bruce
> ; thomas.monjalon at 6wind.com
> Subject: [PATCH 0/6] vhost/virtio performance loopback utility
> 

Hi Thomas, Bruce,

Do you have any comments on this patch?

Thanks
Zhihong

> This patch enables vhost/virtio pmd performance loopback test in testpmd.
> All the features are for general usage.



[dpdk-dev] [PATCH 1/6] testpmd: add io_retry forwarding

2016-05-26 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Wednesday, May 25, 2016 5:32 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ananyev, Konstantin ;
> Richardson, Bruce ; De Lara Guarch, Pablo
> 
> Subject: Re: [PATCH 1/6] testpmd: add io_retry forwarding
> 
> 2016-05-05 18:46, Zhihong Wang:
> > --- a/app/test-pmd/testpmd.h
> > +++ b/app/test-pmd/testpmd.h
> >  extern struct fwd_engine io_fwd_engine;
> > +extern struct fwd_engine io_retry_fwd_engine;
> >  extern struct fwd_engine mac_fwd_engine;
> >  extern struct fwd_engine mac_retry_fwd_engine;
> >  extern struct fwd_engine mac_swap_engine;
> 
> We now have 2 engines with "retry" behaviour.
> It is maybe the way to go, but I want to ask the question:
> Would it be possible to have "retry" as an engine parameter?
> 

If it's just about the way to write commands there isn't much difference,
like "set fwd io_rety" and "set fwd io retry".

Do you mean to add the "retry" for all engines, and also implement this
as a parameter in each original engine? So for example, no iofwd-retry.c,
just add this feature inside iofwd.c?


[dpdk-dev] [PATCH 2/6] testpmd: configurable tx_first burst number

2016-05-26 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Wednesday, May 25, 2016 5:35 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ananyev, Konstantin ;
> Richardson, Bruce ; De Lara Guarch, Pablo
> 
> Subject: Re: [PATCH 2/6] testpmd: configurable tx_first burst number
> 
> 2016-05-05 18:46, Zhihong Wang:
> > This patch enables configurable tx_first burst number.
> >
> > Use "start tx_first (burst_num)" to specify how many bursts of packets to
> > be sent before forwarding start, or "start tx_first" like before for the
> > default 1 burst send.
> 
> The idea here is to fill the loopback latency gap with bursts.
> Would it be possible to make it automatic by detecting the first
> received packets to stop Tx generator?

The idea is great! The implementation might not be graceful though
-- current tx_first mode first calls txonly engine before calling the
actual engine, say iofwd, so iofwd is not established before tx_first
is done, therefore no detection.

It's possible to do this, but we need to implement another forward
engine like "io_retry_fill_first" alone, it complicates testpmd just for
this loop back test.

Looks to me it's better to use combination of existing fwd engines to
do this, it's also more flexible with burst number parameters.


[dpdk-dev] [PATCH 6/6] testpmd: update documentation

2016-05-26 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Wednesday, May 25, 2016 5:48 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ananyev, Konstantin ;
> Richardson, Bruce ; De Lara Guarch, Pablo
> 
> Subject: Re: [PATCH 6/6] testpmd: update documentation
> 
> 2016-05-05 18:47, Zhihong Wang:
> > This patch updates documentation for testpmd.
> 
> Please avoid such doc update patch.
> It is preferred to have the doc changes in the patch changing the code.
> Thanks for the good work!


Thanks for the hint! Will update in v2.


[dpdk-dev] [PATCH 4/6] testpmd: handle all rxqs in rss setup

2016-05-26 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Wednesday, May 25, 2016 5:42 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ananyev, Konstantin ;
> Richardson, Bruce ; De Lara Guarch, Pablo
> 
> Subject: Re: [PATCH 4/6] testpmd: handle all rxqs in rss setup
> 
> 2016-05-05 18:46, Zhihong Wang:
> > This patch removes constraints in rxq handling when multiqueue is enabled
> > to handle all the rxqs.
> >
> > Current testpmd forces a dedicated core for each rxq, some rxqs may be
> > ignored when core number is less than rxq number, and that causes confusion
> > and inconvenience.
> 
> I have the feeling that "constraints", "confusion" and "inconvenience"
> should be more explained.
> Please give some examples with not enough and too much cores. Thanks

Sure, will add detailed description in v2  ;)


[dpdk-dev] [PATCH 5/6] testpmd: show topology at forwarding start

2016-05-26 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Wednesday, May 25, 2016 5:45 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ananyev, Konstantin ;
> Richardson, Bruce ; De Lara Guarch, Pablo
> 
> Subject: Re: [PATCH 5/6] testpmd: show topology at forwarding start
> 
> 2016-05-05 18:47, Zhihong Wang:
> > This patch show topology at forwarding start.
> >
> > "show config fwd" also does this, but showing it directly can reduce the
> > possibility of misconfiguration.
> [...]
> > -   fwd_config_setup();
> > +   fwd_config_display();
> > rxtx_config_display();
> 
> Having _display() calling _setup() is really strange.
> Maybe it is worth to be fixed in this patch.

It looks strange to me too. Will look for a fix.



[dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions

2016-05-26 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ravi Kerur
> Sent: Tuesday, March 8, 2016 7:01 AM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
> 
> v1:
> This patch adds test cases for rte_memcmp functions.
> New rte_memcmp functions can be tested via 'make test'
> and 'testpmd' utility.
> 
> Compiled and tested on Ubuntu 14.04(non-NUMA) and
> 15.10(NUMA) systems.
[...]

> +/
> ***
> + * Memcmp function performance test configuration section. Each performance
> test
> + * will be performed MEMCMP_ITERATIONS times.
> + *
> + * The five arrays below control what tests are performed. Every combination
> + * from the array entries is tested.
> + */
> +#define MEMCMP_ITERATIONS (500 * 500 * 500)


Maybe less iteration will make the test faster without compromise precison?


> +
> +static size_t memcmp_sizes[] = {
> + 2, 5, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
> + 129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384,
> + 385, 447, 448, 449, 511, 512, 513, 767, 768, 769, 1023, 1024,
> + 1025, 1522, 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608,
> + 5632, 6144, 6656, 7168, 7680, 8192, 16834
> +};
> +
[...]
> +/*
> + * Do all performance tests.
> + */
> +static int
> +test_memcmp_perf(void)
> +{
> + if (run_all_memcmp_eq_perf_tests() != 0)
> + return -1;
> +
> + if (run_all_memcmp_gt_perf_tests() != 0)
> + return -1;
> +
> + if (run_all_memcmp_lt_perf_tests() != 0)
> + return -1;
> +


Perhaps unaligned test cases are needed here.
How do you think?


> +
> + return 0;
> +}
> +
> +static struct test_command memcmp_perf_cmd = {
> + .command = "memcmp_perf_autotest",
> + .callback = test_memcmp_perf,
> +};
> +REGISTER_TEST_COMMAND(memcmp_perf_cmd);
> --
> 1.9.1



[dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and SSE intrinsics

2016-05-26 Thread Wang, Zhihong


> -Original Message-
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ravi Kerur
> Sent: Tuesday, March 8, 2016 7:01 AM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and
> SSE intrinsics
> 
> v1:
> This patch adds memcmp functionality using AVX and SSE
> intrinsics provided by Intel. For other architectures
> supported by DPDK regular memcmp function is used.
> 
> Compiled and tested on Ubuntu 14.04(non-NUMA) and 15.10(NUMA)
> systems.
> 
[...]

> + if (unlikely(!_mm_testz_si128(xmm2, xmm2))) {
> + __m128i idx =
> + _mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 
> 3, 2, 1, 0);

line over 80 characters ;)

> +
> + /*
> +  * Reverse byte order
> +  */
> + xmm0 = _mm_shuffle_epi8(xmm0, idx);
> + xmm1 = _mm_shuffle_epi8(xmm1, idx);
> +
> + /*
> + * Compare unsigned bytes with instructions for signed bytes
> + */
> + xmm0 = _mm_xor_si128(xmm0, _mm_set1_epi8(0x80));
> + xmm1 = _mm_xor_si128(xmm1, _mm_set1_epi8(0x80));
> +
> + return _mm_movemask_epi8(xmm0 > xmm1) -
> _mm_movemask_epi8(xmm1 > xmm0);
> + }
> +
> + return 0;
> +}

[...]

> +static inline int
> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
> +{
> + const uint8_t *src_1 = (const uint8_t *)_src_1;
> + const uint8_t *src_2 = (const uint8_t *)_src_2;
> + int ret = 0;
> +
> + if (n < 16)
> + return rte_memcmp_regular(src_1, src_2, n);
[...]
> +
> + while (n > 512) {
> + ret = rte_cmp256(src_1 + 0 * 256, src_2 + 0 * 256);

Thanks for the great work!

Seems to me there's a big improvement area before going into detailed
instruction layout tuning that -- No unalignment handling here for large
size memcmp.

So almost without a doubt the performance will be low in micro-architectures
like Sandy Bridge if the start address is unaligned, which might be a
common case.

> + if (unlikely(ret != 0))
> + return ret;
> +
> + ret = rte_cmp256(src_1 + 1 * 256, src_2 + 1 * 256);
> + if (unlikely(ret != 0))
> + return ret;
> +
> + src_1 = src_1 + 512;
> + src_2 = src_2 + 512;
> + n -= 512;
> + }
> + goto CMP_BLOCK_LESS_THAN_512;
> +}
> +
> +#else /* RTE_MACHINE_CPUFLAG_AVX2 */




[dpdk-dev] [PATCH 1/6] testpmd: add io_retry forwarding

2016-05-26 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Thursday, May 26, 2016 2:27 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ananyev, Konstantin ;
> Richardson, Bruce ; De Lara Guarch, Pablo
> 
> Subject: Re: [PATCH 1/6] testpmd: add io_retry forwarding
> 
> 2016-05-26 02:40, Wang, Zhihong:
> > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> > > 2016-05-05 18:46, Zhihong Wang:
> > > > --- a/app/test-pmd/testpmd.h
> > > > +++ b/app/test-pmd/testpmd.h
> > > >  extern struct fwd_engine io_fwd_engine;
> > > > +extern struct fwd_engine io_retry_fwd_engine;
> > > >  extern struct fwd_engine mac_fwd_engine;
> > > >  extern struct fwd_engine mac_retry_fwd_engine;
> > > >  extern struct fwd_engine mac_swap_engine;
> > >
> > > We now have 2 engines with "retry" behaviour.
> > > It is maybe the way to go, but I want to ask the question:
> > > Would it be possible to have "retry" as an engine parameter?
> > >
> >
> > If it's just about the way to write commands there isn't much difference,
> > like "set fwd io_rety" and "set fwd io retry".
> >
> > Do you mean to add the "retry" for all engines, and also implement this
> > as a parameter in each original engine? So for example, no iofwd-retry.c,
> > just add this feature inside iofwd.c?
> 
> Yes, if it makes sense. For engines other than io_fwd and mac_fwd, the retry
> option can be unsupported (return an error) as a first step.

I think it makes sense in terms of making code more clear and manageable.


[dpdk-dev] [PATCH 2/6] testpmd: configurable tx_first burst number

2016-05-26 Thread Wang, Zhihong


> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Thursday, May 26, 2016 2:32 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; Ananyev, Konstantin ;
> Richardson, Bruce ; De Lara Guarch, Pablo
> 
> Subject: Re: [PATCH 2/6] testpmd: configurable tx_first burst number
> 
> 2016-05-26 02:53, Wang, Zhihong:
> > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> > > 2016-05-05 18:46, Zhihong Wang:
> > > > This patch enables configurable tx_first burst number.
> > > >
> > > > Use "start tx_first (burst_num)" to specify how many bursts of packets 
> > > > to
> > > > be sent before forwarding start, or "start tx_first" like before for the
> > > > default 1 burst send.
> > >
> > > The idea here is to fill the loopback latency gap with bursts.
> > > Would it be possible to make it automatic by detecting the first
> > > received packets to stop Tx generator?
> >
> > The idea is great! The implementation might not be graceful though
> > -- current tx_first mode first calls txonly engine before calling the
> > actual engine, say iofwd, so iofwd is not established before tx_first
> > is done, therefore no detection.
> 
> And what about rewriting tx_first?
> No strong opinion. I let you and Pablo decide.
> 


I think the current way is better in terms of simplicity and flexibility.
Also this "fill the ring" criteria doesn't fit other test scenarios, it's
just for this loop back test, but tx_first is for all scenarios.


> > It's possible to do this, but we need to implement another forward
> > engine like "io_retry_fill_first" alone, it complicates testpmd just for
> > this loop back test.
> >
> > Looks to me it's better to use combination of existing fwd engines to
> > do this, it's also more flexible with burst number parameters.



[dpdk-dev] [PATCH] eal: fix rte_memcpy perf in hsw/bdw

2016-05-26 Thread Wang, Zhihong

> -Original Message-
> From: Xu, Qian Q
> Sent: Thursday, May 26, 2016 1:19 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: Wang, Zhihong 
> Subject: RE: [dpdk-dev] [PATCH] eal: fix rte_memcpy perf in hsw/bdw
> 
> Tested-by: Qian Xu 
> 
> - Test Commit: 8f6f24342281f59de0df7bd976a32f714d39b9a9
> - OS/Kernel: Fedora 21/4.1.13
> - GCC: gcc (GCC) 4.9.2 20141101 (Red Hat 4.9.2-1)
> - CPU: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10
> - Total 1 cases, 1 passed, 0 failed.
[...]
> 
> 8. Compare #1 with #2, can see ~5% performance increase on BDW-EP CPU
> server.

Thanks Qian! HSW should suffer even more from this issue.



[dpdk-dev] [dpdk-stable] [PATCH v4 1/6] vhost: fix windows vm hang

2016-09-05 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Monday, September 5, 2016 1:25 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; maxime.coquelin at redhat.com;
> yuanhan.liu at linux.intel.com; thomas.monjalon at 6wind.com;
> stable at dpdk.org
> Subject: Re: [dpdk-stable] [PATCH v4 1/6] vhost: fix windows vm hang
> 
> On Mon, Aug 29, 2016 at 11:35:59PM -0400, Zhihong Wang wrote:
> > This patch fixes a Windows VM compatibility issue in DPDK 16.07 vhost
> code,
> > which causes the guest to hang once any packets are enqueued when
> mrg_rxbuf
> > is turned on.
> 
> This commit log lacks two important pieces: why does the hang happen and
> how does your patch fix it.

Okay, I'll add it in v5.

> 
> > How to test?
> >
> >  1. Start testpmd in the host with a vhost port.
> >
> >  2. Start a Windows VM image with qemu and connect to the vhost port.
> >
> >  3. Start io forwarding with tx_first in host testpmd.
> >
> > For 16.07 code, the Windows VM will hang once any packets are enqueued.
> >
> > Cc: 
> > Signed-off-by: Zhihong Wang 
> > ---
> >  lib/librte_vhost/vhost_rxtx.c | 17 -
> >  1 file changed, 12 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> > index 08a73fd..5806f99 100644
> > --- a/lib/librte_vhost/vhost_rxtx.c
> > +++ b/lib/librte_vhost/vhost_rxtx.c
> > @@ -384,6 +384,8 @@ copy_mbuf_to_desc_mergeable(struct virtio_net
> *dev, struct vhost_virtqueue *vq,
> > uint16_t start_idx = vq->last_used_idx;
> > uint16_t cur_idx = start_idx;
> > uint64_t desc_addr;
> > +   uint32_t desc_chain_head;
> > +   uint32_t desc_chain_len;
> 
> What's the point of introducing "desc_chain_len"? It has the same value
> of desc_offset.

No it's not, desc_offset is the offset of the current desc only.
That's where the old code goes wrong.

If you take a look at the virtio spec:

/* le32 is used here for ids for padding reasons. */
struct vring_used_elem {
/* Index of start of used descriptor chain. */
le32 id;
/* Total length of the descriptor chain which was written to. */
le32 len;
};

> 
>   --yliu


[dpdk-dev] [PATCH v4 2/6] vhost: rewrite enqueue

2016-09-07 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Wednesday, September 7, 2016 1:33 PM
> To: Wang, Zhihong 
> Cc: dev at dpdk.org; maxime.coquelin at redhat.com;
> thomas.monjalon at 6wind.com
> Subject: Re: [PATCH v4 2/6] vhost: rewrite enqueue
> 
> Hmmm, yet another email didn't send out successfully. Resend.
> 
> BTW, please work out v5 on top of the latest next-virtio tree.
> 
> Thanks.

Okay. Thanks.

> 
>   --yliu
> 
> On Mon, Sep 05, 2016 at 02:39:25PM +0800, Yuanhan Liu wrote:
> 
> On Mon, Aug 29, 2016 at 11:36:00PM -0400, Zhihong Wang wrote:
> > This patch implements the vhost logic from scratch into a single function
> > designed for high performance and better maintainability.
> >
> > This is the baseline version of the new code, more optimization will be
> > added in the following patches in this patch set.
> >
> > ---
> > Changes in v4:
> >
> >  1. Refactor the code for clearer logic.
> >
> >  2. Add PRINT_PACKET for debugging.
> >
> > ---
> > Changes in v3:
> >
> >  1. Rewrite enqueue and delete the obsolete in the same patch.
> 
> Change log should go >
> 
> > Signed-off-by: Zhihong Wang 
> > ---
> 
> ... here, after the SoB.
> 
> >  lib/librte_vhost/vhost_rxtx.c | 525 
> > -
> -
> >  1 file changed, 145 insertions(+), 380 deletions(-)
> >
> > diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> > index 5806f99..629e8ae 100644
> > --- a/lib/librte_vhost/vhost_rxtx.c
> > +++ b/lib/librte_vhost/vhost_rxtx.c
> > @@ -91,7 +91,7 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx,
> uint32_t qp_nb)
> > return (is_tx ^ (idx & 1)) == 0 && idx < qp_nb * VIRTIO_QNUM;
> >  }
> >
> > -static void
> > +static inline void __attribute__((always_inline))
> >  virtio_enqueue_offload(struct rte_mbuf *m_buf, struct virtio_net_hdr
> *net_hdr)
> >  {
> > if (m_buf->ol_flags & PKT_TX_L4_MASK) {
> > @@ -112,6 +112,10 @@ virtio_enqueue_offload(struct rte_mbuf *m_buf,
> struct virtio_net_hdr *net_hdr)
> > cksum));
> > break;
> > }
> > +   } else {
> > +   net_hdr->flags = 0;
> > +   net_hdr->csum_start = 0;
> > +   net_hdr->csum_offset = 0;
> > }
> >
> > if (m_buf->ol_flags & PKT_TX_TCP_SEG) {
> > @@ -122,437 +126,198 @@ virtio_enqueue_offload(struct rte_mbuf
> *m_buf, struct virtio_net_hdr *net_hdr)
> > net_hdr->gso_size = m_buf->tso_segsz;
> > net_hdr->hdr_len = m_buf->l2_len + m_buf->l3_len
> > + m_buf->l4_len;
> > +   } else {
> > +   net_hdr->gso_type = 0;
> > +   net_hdr->hdr_len = 0;
> > +   net_hdr->gso_size = 0;
> > }
> >  }
> >
> > -static inline void
> > -copy_virtio_net_hdr(struct virtio_net *dev, uint64_t desc_addr,
> > -   struct virtio_net_hdr_mrg_rxbuf hdr)
> > +static inline void __attribute__((always_inline))
> > +update_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > +   uint32_t desc_chain_head, uint32_t desc_chain_len)
> >  {
> > -   if (dev->vhost_hlen == sizeof(struct virtio_net_hdr_mrg_rxbuf))
> > -   *(struct virtio_net_hdr_mrg_rxbuf *)(uintptr_t)desc_addr =
> hdr;
> > -   else
> > -   *(struct virtio_net_hdr *)(uintptr_t)desc_addr = hdr.hdr;
> > +   uint32_t used_idx_round = vq->last_used_idx & (vq->size - 1);
> 
> I'd suggest to use "used_idx", instead of "used_idx_round".
> 
> > +
> > +   vq->used->ring[used_idx_round].id = desc_chain_head;
> > +   vq->used->ring[used_idx_round].len = desc_chain_len;
> > +   vhost_log_used_vring(dev, vq, offsetof(struct vring_used,
> > +   ring[used_idx_round]),
> > +   sizeof(vq->used->ring[used_idx_round]));
> >  }
> >
> > -static inline int __attribute__((always_inline))
> > -copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > - struct rte_mbuf *m, uint16_t desc_idx)
> > +static inline uint32_t __attribute__((always_inline))
> > +enqueue_packet(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > +   uint16_t avail_idx, struct rte_mbuf *mbuf,
> > +   uint32_t is_mrg_

[dpdk-dev] [PATCH v5 2/6] vhost: rewrite enqueue

2016-09-14 Thread Wang, Zhihong
> > +   desc_current =
> > +   vq->avail->ring[(vq->last_used_idx)
> &
> > +   (vq->size - 1)];
> > +   desc_chain_head = desc_current;
> > +   desc = >desc[desc_current];
> > +   desc_addr = gpa_to_vva(dev, desc->addr);
> > +   if (unlikely(!desc_addr))
> > +   goto error;
> >
> > -   desc = >desc[desc->next];
> > -   desc_addr = gpa_to_vva(dev, desc->addr);
> > -   if (unlikely(!desc_addr))
> > -   return -1;
> > -
> > -   desc_offset = 0;
> > -   desc_avail  = desc->len;
> > +   desc_chain_len = 0;
> > +   desc_offset = 0;
> As I commented on v3, there is code duplication between next flag, and
> mrg buf cases:
> desc_offset = 0;
> 
> and:
> 
> desc = >desc[desc_current];
> desc_addr = gpa_to_vva(dev, desc->addr);
> if (unlikely(!desc_addr))
>  goto error;
> 

Do you mean to add something like:

static inline int __attribute__((always_inline))
get_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
uint32_t desc_idx, struct vring_desc **desc,
uint64_t *desc_addr)
{
*desc = >desc[desc_idx];
*desc_addr = gpa_to_vva(dev, (*desc)->addr);
if (unlikely(!(*desc_addr)))
return -1;

return 0;
}


> Regards,
> Maxime


[dpdk-dev] [PATCH v5 2/6] vhost: rewrite enqueue

2016-09-14 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Tuesday, September 13, 2016 12:27 AM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: yuanhan.liu at linux.intel.com; thomas.monjalon at 6wind.com
> Subject: Re: [PATCH v5 2/6] vhost: rewrite enqueue
> 
> 
> 
> On 09/09/2016 05:39 AM, Zhihong Wang wrote:
> >
> > +static inline void __attribute__((always_inline))
> > +notify_guest(struct virtio_net *dev, struct vhost_virtqueue *vq)
> > +{
> > rte_smp_wmb();
> > -
> > -   *(volatile uint16_t *)>used->idx += count;
> > -   vq->last_used_idx += count;
> > -   vhost_log_used_vring(dev, vq,
> > -   offsetof(struct vring_used, idx),
> > -   sizeof(vq->used->idx));
> > -
> > -   /* flush used->idx update before we read avail->flags. */
> Please don't remove comments if not justified.
> Here the comment is important, as it explains why the barrier is needed.

Okay.

> > +   *(volatile uint16_t *)>used->idx = vq->last_used_idx;
> > +   vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
> > +   sizeof(vq->used->idx));
> > rte_mb();
> > -
> > -   /* Kick the guest if necessary. */
> > if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
> > && (vq->callfd >= 0))
> > eventfd_write(vq->callfd, (eventfd_t)1);
> > -   return count;
> >  }


[dpdk-dev] [PATCH v5 5/6] vhost: batch update used ring

2016-09-14 Thread Wang, Zhihong


> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
> Sent: Monday, September 12, 2016 11:46 PM
> To: Wang, Zhihong ; dev at dpdk.org
> Cc: yuanhan.liu at linux.intel.com; thomas.monjalon at 6wind.com
> Subject: Re: [PATCH v5 5/6] vhost: batch update used ring
> 
> 
> 
> On 09/09/2016 05:39 AM, Zhihong Wang wrote:
> > This patch enables batch update of the used ring for better efficiency.
> >
> > Signed-off-by: Zhihong Wang 
> > ---
> > Changes in v4:
> >
> >  1. Free shadow used ring in the right place.
> >
> >  2. Add failure check for shadow used ring malloc.
> >
> >  lib/librte_vhost/vhost.c  | 20 --
> >  lib/librte_vhost/vhost.h  |  4 +++
> >  lib/librte_vhost/vhost_user.c | 31 +
> >  lib/librte_vhost/virtio_net.c | 64
> +++
> >  4 files changed, 101 insertions(+), 18 deletions(-)
> >
> > diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
> > index 46095c3..cb31cdd 100644
> > --- a/lib/librte_vhost/vhost.c
> > +++ b/lib/librte_vhost/vhost.c
> > @@ -119,10 +119,26 @@ cleanup_device(struct virtio_net *dev, int
> destroy)
> >  static void
> >  free_device(struct virtio_net *dev)
> >  {
> > +   struct vhost_virtqueue *vq_0;
> > +   struct vhost_virtqueue *vq_1;
> > uint32_t i;
> >
> > -   for (i = 0; i < dev->virt_qp_nb; i++)
> > -   rte_free(dev->virtqueue[i * VIRTIO_QNUM]);
> > +   for (i = 0; i < dev->virt_qp_nb; i++) {
> > +   vq_0 = dev->virtqueue[i * VIRTIO_QNUM];
> > +   if (vq_0->shadow_used_ring) {
> > +   rte_free(vq_0->shadow_used_ring);
> > +   vq_0->shadow_used_ring = NULL;
> > +   }
> > +
> > +   vq_1 = dev->virtqueue[i * VIRTIO_QNUM + 1];
> > +   if (vq_1->shadow_used_ring) {
> > +   rte_free(vq_1->shadow_used_ring);
> > +   vq_1->shadow_used_ring = NULL;
> > +   }
> > +
> > +   /* malloc together, free together */
> > +   rte_free(vq_0);
> > +   }
> >
> > rte_free(dev);
> >  }
> > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > index 9707dfc..381dc27 100644
> > --- a/lib/librte_vhost/vhost.h
> > +++ b/lib/librte_vhost/vhost.h
> > @@ -85,6 +85,10 @@ struct vhost_virtqueue {
> >
> > /* Physical address of used ring, for logging */
> > uint64_tlog_guest_addr;
> > +
> > +   /* Shadow used ring for performance */
> > +   struct vring_used_elem  *shadow_used_ring;
> > +   uint32_tshadow_used_idx;
> >  } __rte_cache_aligned;
> >
> >  /* Old kernels have no such macro defined */
> > diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
> > index eee99e9..d7cf1ed 100644
> > --- a/lib/librte_vhost/vhost_user.c
> > +++ b/lib/librte_vhost/vhost_user.c
> > @@ -193,7 +193,21 @@ static int
> >  vhost_user_set_vring_num(struct virtio_net *dev,
> >  struct vhost_vring_state *state)
> >  {
> > -   dev->virtqueue[state->index]->size = state->num;
> > +   struct vhost_virtqueue *vq;
> > +
> > +   vq = dev->virtqueue[state->index];
> > +   vq->size = state->num;
> > +   if (!vq->shadow_used_ring) {
> > +   vq->shadow_used_ring = rte_malloc(NULL,
> > +   vq->size * sizeof(struct vring_used_elem),
> > +   RTE_CACHE_LINE_SIZE);
> > +   if (!vq->shadow_used_ring) {
> > +   RTE_LOG(ERR, VHOST_CONFIG,
> > +   "Failed to allocate memory"
> > +   " for shadow used ring.\n");
> > +   return -1;
> > +   }
> > +   }
> >
> > return 0;
> >  }
> > @@ -611,14 +625,21 @@ static int
> >  vhost_user_get_vring_base(struct virtio_net *dev,
> >   struct vhost_vring_state *state)
> >  {
> > +   struct vhost_virtqueue *vq;
> > +
> > /* We have to stop the queue (virtio) if it is running. */
> > if (dev->flags & VIRTIO_DEV_RUNNING) {
> > dev->flags &= ~VIRTIO_DEV_RUNNING;
> > notify_ops->destroy_device(dev->vid);
> > }
> >
> > +   vq = dev->virtqueue[state->index];
> > /* H

[dpdk-dev] [PATCH v5 5/6] vhost: batch update used ring

2016-09-18 Thread Wang, Zhihong


> -Original Message-
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Sunday, September 18, 2016 10:56 AM
> To: Maxime Coquelin 
> Cc: Wang, Zhihong ; dev at dpdk.org;
> thomas.monjalon at 6wind.com
> Subject: Re: [PATCH v5 5/6] vhost: batch update used ring
> 
> On Thu, Sep 15, 2016 at 06:38:06PM +0200, Maxime Coquelin wrote:
> > >>>+static inline void __attribute__((always_inline))
> > >>>+flush_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > >>>+uint32_t used_idx_start)
> > >>>+{
> > >>>+if (used_idx_start + vq->shadow_used_idx < vq->size) {
> > >>>+rte_memcpy(>used->ring[used_idx_start],
> > >>>+>shadow_used_ring[0],
> > >>>+vq->shadow_used_idx *
> > >>>+sizeof(struct vring_used_elem));
> > >>>+vhost_log_used_vring(dev, vq,
> > >>>+offsetof(struct vring_used,
> > >>>+ring[used_idx_start]),
> > >>>+vq->shadow_used_idx *
> > >>>+sizeof(struct vring_used_elem));
> > >>>+} else {
> > >>>+uint32_t part_1 = vq->size - used_idx_start;
> > >>>+uint32_t part_2 = vq->shadow_used_idx - part_1;
> > >>>+
> > >>>+rte_memcpy(>used->ring[used_idx_start],
> > >>>+>shadow_used_ring[0],
> > >>>+part_1 *
> > >>>+sizeof(struct vring_used_elem));
> > >>>+vhost_log_used_vring(dev, vq,
> > >>>+offsetof(struct vring_used,
> > >>>+ring[used_idx_start]),
> > >>>+part_1 *
> > >>>+sizeof(struct vring_used_elem));
> > >>>+rte_memcpy(>used->ring[0],
> > >>>+>shadow_used_ring[part_1],
> > >>>+part_2 *
> > >>>+sizeof(struct vring_used_elem));
> > >>>+vhost_log_used_vring(dev, vq,
> > >>>+offsetof(struct vring_used,
> > >>>+ring[0]),
> > >>>+part_2 *
> > >>>+sizeof(struct vring_used_elem));
> > >>>+}
> > >>> }
> > >>Is expanding the code done for performance purpose?
> > >
> > >Hi Maxime,
> > >
> > >Yes theoretically this has the least branch number.
> > >And I think the logic is simpler this way.
> > Ok, in that case, maybe you could create a function to
> > do the rte_memcpy and the vhost_log_used on a given range.
> 
> Agreed, that will be better; it could avoid repeating similar code
> block 3 times.

Okay. Thanks for the suggestion, Maxime and Yuanhan.

> 
> > I don't have a strong opinion on this, if Yuanhan is fine
> > with current code, that's ok for me.
> 
> From what I know, that's kind of DPDK prefered way, to expand code
> when necessary. For example, 9ec201f5d6e7 ("mbuf: provide bulk
> allocation").
> 
> So I'm fine with it.
> 
>   --yliu


  1   2   >