Re: [ovs-dev] [PATCH v4 4/7] netdev: Remove useless cutlen.

2017-10-13 Thread Bodireddy, Bhanuprakash
>Cutlen already applied while processing OVS_ACTION_ATTR_OUTPUT.
>
>Signed-off-by: Ilya Maximets 
>---
> lib/netdev-bsd.c   | 2 +-
> lib/netdev-dpdk.c  | 5 -
> lib/netdev-dummy.c | 2 +-
> lib/netdev-linux.c | 4 ++--
> 4 files changed, 4 insertions(+), 9 deletions(-)
>
>diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c index 4f243b5..7454d03 100644
>--- a/lib/netdev-bsd.c
>+++ b/lib/netdev-bsd.c
>@@ -697,7 +697,7 @@ netdev_bsd_send(struct netdev *netdev_, int qid
>OVS_UNUSED,
>
> for (i = 0; i < batch->count; i++) {
> const void *data = dp_packet_data(batch->packets[i]);
>-size_t size = dp_packet_get_send_len(batch->packets[i]);
>+size_t size = dp_packet_size(batch->packets[i]);
>
> while (!error) {
> ssize_t retval;
>diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index 011c6f7..300a0ae
>100644
>--- a/lib/netdev-dpdk.c
>+++ b/lib/netdev-dpdk.c
>@@ -1851,8 +1851,6 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
>struct dp_packet_batch *batch)
> dropped += batch_cnt - cnt;
> }
>
>-dp_packet_batch_apply_cutlen(batch);
>-
> for (uint32_t i = 0; i < cnt; i++) {
> struct dp_packet *packet = batch->packets[i];
> uint32_t size = dp_packet_size(packet); @@ -1905,7 +1903,6 @@
>netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
> dpdk_do_tx_copy(netdev, qid, batch);
> dp_packet_delete_batch(batch, true);
> } else {
>-dp_packet_batch_apply_cutlen(batch);
> __netdev_dpdk_vhost_send(netdev, qid, batch->packets, batch-
>>count);
> }
> return 0;
>@@ -1936,8 +1933,6 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int
>qid,
> int batch_cnt = dp_packet_batch_size(batch);
> struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
>
>-dp_packet_batch_apply_cutlen(batch);
>-
> tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
> tx_cnt = netdev_dpdk_qos_run(dev, pkts, tx_cnt);
> dropped = batch_cnt - tx_cnt;
>diff --git a/lib/netdev-dummy.c b/lib/netdev-dummy.c index 57ef13f..1f846b5
>100644
>--- a/lib/netdev-dummy.c
>+++ b/lib/netdev-dummy.c
>@@ -1071,7 +1071,7 @@ netdev_dummy_send(struct netdev *netdev, int
>qid OVS_UNUSED,
> struct dp_packet *packet;
> DP_PACKET_BATCH_FOR_EACH(packet, batch) {
> const void *buffer = dp_packet_data(packet);
>-size_t size = dp_packet_get_send_len(packet);
>+size_t size = dp_packet_size(packet);
>
> if (batch->packets[i]->packet_type != htonl(PT_ETH)) {
> error = EPFNOSUPPORT;
>diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c index aaf4899..e70cef3
>100644
>--- a/lib/netdev-linux.c
>+++ b/lib/netdev-linux.c
>@@ -1197,7 +1197,7 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
> for (int i = 0; i < batch->count; i++) {
> struct dp_packet *packet = batch->packets[i];
> iov[i].iov_base = dp_packet_data(packet);
>-iov[i].iov_len = dp_packet_get_send_len(packet);
>+iov[i].iov_len = dp_packet_size(packet);
> mmsg[i].msg_hdr = (struct msghdr) { .msg_name = &sll,
> .msg_namelen = sizeof sll,
> .msg_iov = &iov[i], @@ -1234,7 
> +1234,7 @@
>netdev_linux_tap_batch_send(struct netdev *netdev_,
> struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> for (int i = 0; i < batch->count; i++) {
> struct dp_packet *packet = batch->packets[i];
>-size_t size = dp_packet_get_send_len(packet);
>+size_t size = dp_packet_size(packet);
> ssize_t retval;
> int error;

With the above change, I think we can get rid of dp_packet_get_send_len() API 
altogether. 
The only place it gets called now is dp_packet_batch_apply_cutlen() and that 
can be replaced.

dp_packet_batch_apply_cutlen(..) {
...
-dp_packet_set_size(packet, dp_packet_get_send_len(packet));
+   dp_packet_set_size(packet, dp_packet_size(packet) - 
dp_packet_get_cutlen(packet));
}

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 5/7] timeval: Introduce time_usec().

2017-10-13 Thread Bodireddy, Bhanuprakash
>This fanction will provide monotonic time in microseconds.

[BHANU] Typo here with function.

>
>Signed-off-by: Ilya Maximets 
>---
> lib/timeval.c | 22 ++  lib/timeval.h |  2 ++
> 2 files changed, 24 insertions(+)
>
>diff --git a/lib/timeval.c b/lib/timeval.c index dd63f03..be2eddc 100644
>--- a/lib/timeval.c
>+++ b/lib/timeval.c
>@@ -233,6 +233,22 @@ time_wall_msec(void)
> return time_msec__(&wall_clock);
> }
>
>+static long long int
>+time_usec__(struct clock *c)
>+{
>+struct timespec ts;
>+
>+time_timespec__(c, &ts);
>+return timespec_to_usec(&ts);
>+}
>+
>+/* Returns a monotonic timer, in microseconds. */ long long int
>+time_usec(void)
>+{
>+return time_usec__(&monotonic_clock); }
>+

[BHANU]  As you are introducing the support for microsecond granularity, can 
you also add time_wall_usec() and time_wall_usec__() here?
The ipfix code (ipfix_now()) can be the first one to use it for now. May be 
more in the future! 

> /* Configures the program to die with SIGALRM 'secs' seconds from now, if
>  * 'secs' is nonzero, or disables the feature if 'secs' is zero. */  void @@ 
> -360,6
>+376,12 @@ timeval_to_msec(const struct timeval *tv)
> return (long long int) tv->tv_sec * 1000 + tv->tv_usec / 1000;  }
>
>+long long int
>+timespec_to_usec(const struct timespec *ts) {
>+return (long long int) ts->tv_sec * 1000 * 1000 + ts->tv_nsec /
>+1000; }
>+

[BHANU] how about adding timeval_to_usec()?
Also it would be nice to have the usec_to_timespec() and  timeval_diff_usec() 
implemented to make this commit complete.

- Bhanuprakash. 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 6/7] dpif-netdev: Time based output batching.

2017-10-13 Thread Bodireddy, Bhanuprakash
>This allows to collect packets from more than one RX burst and send them
>together with a configurable intervals.
>
>'other_config:tx-flush-interval' can be used to configure time that a packet
>can wait in output batch for sending.
>
>dpif-netdev turned to microsecond resolution for time measuring to ensure
>desired resolution of 'tx-flush-interval'.
>
>Signed-off-by: Ilya Maximets 
>---
> lib/dpif-netdev.c| 141
>---
> vswitchd/vswitch.xml |  16 ++
> 2 files changed, 127 insertions(+), 30 deletions(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 166b73a..3ddb711
>100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -85,6 +85,9 @@ VLOG_DEFINE_THIS_MODULE(dpif_netdev);
> #define MAX_RECIRC_DEPTH 5
> DEFINE_STATIC_PER_THREAD_DATA(uint32_t, recirc_depth, 0)
>
>+/* Use instant packet send by default. */ #define
>+DEFAULT_TX_FLUSH_INTERVAL 0
>+
> /* Configuration parameters. */
> enum { MAX_FLOWS = 65536 }; /* Maximum number of flows in flow table.
>*/
> enum { MAX_METERS = 65536 };/* Maximum number of meters. */
>@@ -178,12 +181,13 @@ struct emc_cache {
> 

> /* Simple non-wildcarding single-priority classifier. */
>
>-/* Time in ms between successive optimizations of the dpcls subtable vector
>*/ -#define DPCLS_OPTIMIZATION_INTERVAL 1000
>+/* Time in microseconds between successive optimizations of the dpcls
>+ * subtable vector */
>+#define DPCLS_OPTIMIZATION_INTERVAL 100LL
>
>-/* Time in ms of the interval in which rxq processing cycles used in
>- * rxq to pmd assignments is measured and stored. */ -#define
>PMD_RXQ_INTERVAL_LEN 1
>+/* Time in microseconds of the interval in which rxq processing cycles
>+used
>+ * in rxq to pmd assignments is measured and stored. */ #define
>+PMD_RXQ_INTERVAL_LEN 1000LL
>
> /* Number of intervals for which cycles are stored
>  * and used during rxq to pmd assignment. */ @@ -270,6 +274,9 @@ struct
>dp_netdev {
> struct hmap ports;
> struct seq *port_seq;   /* Incremented whenever a port changes. */
>
>+/* The time that a packet can wait in output batch for sending. */
>+atomic_uint32_t tx_flush_interval;
>+
> /* Meters. */
> struct ovs_mutex meter_locks[N_METER_LOCKS];
> struct dp_meter *meters[MAX_METERS]; /* Meter bands. */ @@ -356,7
>+363,7 @@ enum rxq_cycles_counter_type {
> RXQ_N_CYCLES
> };
>
>-#define XPS_TIMEOUT_MS 500LL
>+#define XPS_TIMEOUT 50LL/* In microseconds. */
>
> /* Contained by struct dp_netdev_port's 'rxqs' member.  */  struct
>dp_netdev_rxq { @@ -526,6 +533,7 @@ struct tx_port {
> int qid;
> long long last_used;
> struct hmap_node node;
>+long long flush_time;
> struct dp_packet_batch output_pkts;  };
>
>@@ -614,6 +622,9 @@ struct dp_netdev_pmd_thread {
>  * than 'cmap_count(dp->poll_threads)'. */
> uint32_t static_tx_qid;
>
>+/* Number of filled output batches. */
>+int n_output_batches;
>+
> struct ovs_mutex port_mutex;/* Mutex for 'poll_list' and 'tx_ports'. 
> */
> /* List of rx queues to poll. */
> struct hmap poll_list OVS_GUARDED;
>@@ -707,8 +718,9 @@ static void dp_netdev_add_rxq_to_pmd(struct
>dp_netdev_pmd_thread *pmd,  static void
>dp_netdev_del_rxq_from_pmd(struct dp_netdev_pmd_thread *pmd,
>struct rxq_poll *poll)
> OVS_REQUIRES(pmd->port_mutex);
>-static void
>-dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread
>*pmd);
>+static int
>+dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread
>*pmd,
>+   bool force);
>
> static void reconfigure_datapath(struct dp_netdev *dp)
> OVS_REQUIRES(dp->port_mutex);
>@@ -783,7 +795,7 @@ emc_cache_slow_sweep(struct emc_cache
>*flow_cache)  static inline void  pmd_thread_ctx_time_update(struct
>dp_netdev_pmd_thread *pmd)  {
>-pmd->ctx.now = time_msec();
>+pmd->ctx.now = time_usec();
> }
>
> /* Returns true if 'dpif' is a netdev or dummy dpif, false otherwise. */ @@ -
>1283,6 +1295,7 @@ create_dp_netdev(const char *name, const struct
>dpif_class *class,
> conntrack_init(&dp->conntrack);
>
> atomic_init(&dp->emc_insert_min, DEFAULT_EM_FLOW_INSERT_MIN);
>+atomic_init(&dp->tx_flush_interval, DEFAULT_TX_FLUSH_INTERVAL);
>
> cmap_init(&dp->poll_threads);
>
>@@ -2950,7 +2963,7 @@ dpif_netdev_execute(struct dpif *dpif, struct
>dpif_execute *execute)
> dp_packet_batch_init_packet(&pp, execute->packet);
> dp_netdev_execute_actions(pmd, &pp, false, execute->flow,
>   execute->actions, execute->actions_len);
>-dp_netdev_pmd_flush_output_packets(pmd);
>+dp_netdev_pmd_flush_output_packets(pmd, true);
>
> if (pmd->core_id == NON_PMD_CORE_ID) {
> ovs_mutex_unlock(&dp->non_pmd_mutex);
>@@ -2999,6 +3012,16 @@ dpif_netdev_set_config(struct dpif *dpif, const
>struct smap *other_config)
> smap_get_ullong(other_config, "emc-insert-inv-prob",
>  

Re: [ovs-dev] [PATCH v4 0/7] Output packet batching.

2017-10-13 Thread Bodireddy, Bhanuprakash
>Hi Ilya,
>
>Sorry for the late response, as I was rather busy and did not find time to look
>at your revisions 1 till 3. Hopefully, I can make it up looking at v4...
>
>I did some tests in-line with the earlier tests I did with Bhanu's patch 
>series.
>Here is a comparison for a simple PVP test using a single physical port with 64
>bytes packets (wire speed 10G), single PMD thread:
>
>#Flows  master patched
>==  =  =
>     10  3,123,350  4,174,807  pps
>     32  2,090,440  3,625,314  pps
>     50  1,954,184  3,499,402  pps
>    100  1,705,794  3,264,955  pps
>    500  1,601,252  2,956,190  pps
>   1000  1,568,175  2,712,385  pps
>
>In addition, I did some latency statistics based on a PVP setup with two
>physical ports, and one virtual port, and two OpenFlow rules:
>
>ovs-ofctl add-flow br0 "in_port=dpdk0,action=vhost0"
>ovs-ofctl add-flow br0 "in_port=vhost0,action=dpdk1"
>
>Also, note that there is some deviation on latency numbers, so I did 4 runs and
>reported the min-max values.
>
>First the master results:
>
>Summary (flows = 30, 10G line rate = 95%, runtime = 60 seconds):
>
>   Pkt size min(ns) avg(ns)  max(ns)
>     -  ---  -
>    512  7,437 - 7,469  11,416 - 13,770   99,395 - 112,296
>   1024  7,197 - 7,221  11,277 - 12,379   42,876 -  47,230
>   1280  7,373 - 7,549  10,647 - 12,528   37,240 -  42,235
>   1518  8,046 - 8,135  11,808 - 12,931   36,534 -  46,388
>
>And the patched results:
>
>   Pkt size min(ns) avg(ns)  max(ns)
>     -  ---  -
>    512  7,605 - 7,662  11,711 - 14,053   56,603 - 121,059
>   1024  7,285 - 7,317  11,291 - 12,695   44,753 -  69,624
>   1280  7,605 - 7,702  10,842 - 12,685   37,047 -  45,747
>   1518  8,111 - 8,159  11,434 - 13,045   38,587 -  41,754
>
>As you can see above for the default configuration there is a minimal latency
>increase. I assume you did some latency tests yourself, and I hope these
>numbers match your findings...

Thanks for sharing the numbers Eelco. This should be useful for future 
reference.

Also please note that with batching patches applied there is a small 
performance drop
In P2P test case(non batching scenario). This shouldn't be a concern at this 
point but 
VSPERF  may raise a red flag with some of its test cases when this series is 
applied.

Other than that the series looks good to me. I have asked Ian to check QoS 
functionality with
this series + incremental patch(that fixes the known issue with policer) to 
check for any other corner cases.

- Bhanuprakash.

>
>I do have some small comments on your patchsets but will address them
>replying to the individual emails.
>
>Cheers,
>
>Eelco
>
>On 05/10/17 17:05, Ilya Maximets wrote:
>> This patch-set inspired by [1] from Bhanuprakash Bodireddy.
>> Implementation of [1] looks very complex and introduces many pitfalls
>> [2] for later code modifications like possible packet stucks.
>>
>> This version targeted to make simple and flexible output packet
>> batching on higher level without introducing and even simplifying netdev
>layer.
>>
>> Basic testing of 'PVP with OVS bonding on phy ports' scenario shows
>> significant performance improvement.
>>
>> Test results for time-based batching for v3:
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-September/338247.h
>> tml
>>
>> [1] [PATCH v4 0/5] netdev-dpdk: Use intermediate queue during packet
>transmission.
>>
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337019.html
>>
>> [2] For example:
>>
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337133.html
>>
>> Version 4:
>>  * Rebased on current master.
>>  * Rebased on top of "Keep latest measured time for PMD thread."
>>(Jan Scheurich)
>>  * Microsecond resolution related patches integrated.
>>  * Time-based batching without RFC tag.
>>  * 'output_time' renamed to 'flush_time'. (Jan Scheurich)
>>  * 'flush_time' update moved to
>'dp_netdev_pmd_flush_output_on_port'.
>>(Jan Scheurich)
>>  * 'output-max-latency' renamed to 'tx-flush-interval'.
>>  * Added patch for output batching statistics.
>>
>> Version 3:
>>
>>  * Rebased on current master.
>>  * Time based RFC: fixed assert on n_output_batches <= 0.
>>
>> Version 2:
>>
>>  * Rebased on current master.
>>  * Added time based batching RFC patch.
>>  * Fixed mixing packets with different sources in same batch.
>>
>>
>> Ilya Maximets (7):
>>dpif-netdev: Keep latest measured time for PMD thread.
>>dpif-netdev: Output packet batching.
>>netdev: Remove unused may_steal.
>>netdev: Remove useless cutlen.
>>timeval: Introduce time_usec().
>>dpif-netdev: Time based output batching.
>>dpif-netdev: Count sent packets and batches.
>>
>>   lib/dpif-netdev.c | 334 +--
>---
>>   lib/netdev-bsd.c   

Re: [ovs-dev] [PATCH v5 03/10] util: Add high resolution sleep support.

2017-11-06 Thread Bodireddy, Bhanuprakash
Hi Ben,
>
>On Fri, Sep 15, 2017 at 05:40:23PM +0100, Bhanuprakash Bodireddy wrote:
>> This commit introduces xnanosleep() for the threads needing high
>> resolution sleep timeouts.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>
>This is a little confusing.  The name xnanosleep() implies that its argument
>would be in nanoseconds, but it's in fact in milliseconds.
>Second, I don't understand why it's only implemented for Linux.

I tried reworking this API with nanoseconds argument and implementing 
nsec_to_timespec() today. 
This changes works fine on Linux, however the windows build breaks with below 
error as reported by appveyor.

error C4013: 'nanosleep' undefined; assuming extern returning int
(windows.h and time.h headers are included).

But looks like nanosleep is supported on windows. Any inputs on this would be 
helpful.

- Bhanuprakash.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v5 03/10] util: Add high resolution sleep support.

2017-11-08 Thread Bodireddy, Bhanuprakash
>
>> Ben Pfaff  writes:
>>
>> > On Mon, Nov 06, 2017 at 05:29:26PM +, Bodireddy, Bhanuprakash
>> wrote:
>> >> Hi Ben,
>> >> >
>> >> >On Fri, Sep 15, 2017 at 05:40:23PM +0100, Bhanuprakash Bodireddy
>> wrote:
>> >> >> This commit introduces xnanosleep() for the threads needing high
>> >> >> resolution sleep timeouts.
>> >> >>
>> >> >> Signed-off-by: Bhanuprakash Bodireddy
>> >> >> 
>> >> >
>> >> >This is a little confusing.  The name xnanosleep() implies that
>> >> >its argument would be in nanoseconds, but it's in fact in milliseconds.
>> >> >Second, I don't understand why it's only implemented for Linux.
>> >>
>> >> I tried reworking this API with nanoseconds argument and
>> >> implementing
>> >> nsec_to_timespec() today.
>> >> This changes works fine on Linux, however the windows build breaks
>> >> with below error as reported by appveyor.
>> >>
>> >> error C4013: 'nanosleep' undefined; assuming extern returning int
>> >> (windows.h and time.h headers are included).
>> >>
>> >> But looks like nanosleep is supported on windows. Any inputs on
>> >> this would be helpful.
>> >
>> > If nanosleep isn't available on Windows (it looks like it isn't),
>> > then I'd recommend using some other function that Windows does have.
>> > If its argument isn't in nanoseconds, you can convert it.
>> >
>> > If you don't really need nanosecond resolution (the fact that the
>> > argument was in milliseconds seems like a hint), then maybe you
>> > could just use some other function instead of nanosleep, even on Linux.
>> >
>> > This stackoverflow page has some information:
>> > https://stackoverflow.com/questions/7827062/is-there-a-windows-
>> equival
>> > ent-of-nanosleep
>>
>> So, there's really no good way in windows of doing this - for OvS, I
>> would suggest reading up on the windows Wait calls
>> (https://msdn.microsoft.com/en-
>> us/library/windows/desktop/ms687069(v=vs.85).aspx#waitfunctionsandtim
>> e-outintervals).
>>
>> Prefer those to Sleep(), as Sleep(MS) can stall or deadlock the
>> process
>(at
>> least from what I remember a lifetime ago).
>There is no direct equivalent unfortunately.
>I would use
>CreateWaitableTimer(https://msdn.microsoft.com/en-
>us/library/windows/desktop
>/ms682492(v=vs.85).aspx) with SetWaitableTimer
>(https://msdn.microsoft.com/en-
>us/library/windows/desktop/ms686289(v=vs.85).
>aspx) and then wait on the timer(WaitForSingleObject) although you have 100
>nanosecond intervals.
>To go lower you can use: QueryPerformanceCounter
>(https://msdn.microsoft.com/en-
>us/library/windows/desktop/ms644904(v=vs.85).
>aspx) .
>I can try to do some benchmarks if you need such a high resolution.

Thanks for your inputs and those were helpful.
I implemented the windows equivalient of nanosleep and posted the patche here. 
https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/340743.html

I verified that this builds on windows with appveyor. But I couldn't verify the 
functionality here and
that's the reason I posted this as a separate patch instead of folding in to 
2/7.

- Bhanuprakash.



___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 0/7] Introduce high resolution sleep support.

2017-11-08 Thread Bodireddy, Bhanuprakash
HI Ben,
>On Wed, Nov 08, 2017 at 04:35:52PM +, Bhanuprakash Bodireddy wrote:
>> This patchset introduces high resolution sleep support for linux and
>windows.
>> Also time macros are introduced to replace the numbers with meaningful
>> names.
>
>Thank you very much for the series.
>
>Did you test that the Windows version of the code compiles (e.g. via
>appveyor)?

I cross checked with appveyor and the build was successful. I replied to 
another thread where we were discussing about the windows implementation.

>
>I would normally squash patch 3 (the Windows version of xnanosleep) into
>patch 2 (the Linux version). 

I couldn't verify the functionality of windows implementation and hence posted 
it as a separate patch for now.
I will squash it once I receive feedback from Alin. 

 Also, I would normally squash the patches that
>just replace constants by xSEC_PER_ySEC macros into the patch that
>introduced those macros (if there are other changes then I would separate
>those).
Ok.

>
>I am concerned about types.  The xSEC_PER_ySEC macros all use type "long"
>for their constants, but in some cases the code needs to have type "long
>long", for example in many cases when multiplying by one of these macros.
>When the patches replace an LL-suffixed literal by one of the xSEC_PER_ySEC
>macros, this introduces a risk of overflow that was not present before.
>
>I am not certain that the xSEC_PER_ySEC macros clarify things, especially given
>the type issues.  I don't feel strongly about it though.

Yeah I understand your concern here and difficult to test the cases for overflow
with this changes. I will leave it the way it is now.

>
>In the xnanosleep() implementation for Windows, I think that the two calls to
>CloseHandle can be consolidated into one.

Sure.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] ovs-tcpdump error

2017-11-10 Thread Bodireddy, Bhanuprakash
>Aaron Conole  writes:
>
>> Hi Bhanu,
>>
>> "Bodireddy, Bhanuprakash"  writes:
>>
>>> Hi,
>>>
>>>
>>>
>>> ovs-tcpdump throws the below error when trying to capture packets on
>>> one of the vhostuserports.
>>>
>>>
>>>
>>> $ ovs-tcpdump -i dpdkvhostuser0
>>>
>>>ERROR: Please create an interface called `midpdkvhostuser0`
>>>
>>> See your OS guide for how to do this.
>>>
>>> Ex: ip link add midpdkvhostuser0 type veth peer name
>>> midpdkvhostuser02
>>>
>>>
>>>
>>> $ ip link add midpdkvhostuser0 type veth peer name midpdkvhostuser02
>>>
>>>  Error: argument "midpdkvhostuser0" is wrong: "name" too long
>>>
>>>
>>>
>>> To get around this issue, I have to pass  ‘—mirror-to’ option as below.
>>>
>>>
>>>
>>> $ ovs-tcpdump -i dpdkvhostuser0 -XX --mirror-to vh0
>>>
>>>
>>>
>>> Is this due to the length of the port name?  Would be nice to fix this 
>>> issue.
>>
>> Thanks for the detailed write up.
>>
>> It is related to the mirror port name length.  The mirror port is
>> bound by IFNAMSIZ restriction, so it must be 15 characters + nul, and
>> midpdkvhostuser0 would be 16 + nul.  This is a linux specific
>> restriction, and it won't be changed because it is very much a well
>> established UAPI (and changing it will have implications on code not
>> able to deal with larger sized name buffers).
>>
>> I'm not sure how best to fix it.  My concession was the mirror-to
>> option.  Perhaps there's a better way?
>
>Hi Bhanu, I've been thinking about this a bit more.  How about something like
>the following patch?
>
>If you think it's acceptable, I'll submit it formally.

Hi Aaron,

I am on fedora and applied the patch but couldn't verify the fix as I get the 
below error.

Traceback (most recent call last):
  File "./utilities/ovs-tcpdump", line 21, in 
import random.randint
ImportError: No module named randint

When I slightly change the code to

-import random.randint
+ from random import randint
...
-return "ovsmi%06d" % random.randint(1, 99)
+return "ovsmi%06d" % randint(1, 99)

I get below error
Traceback (most recent call last):
  File "./utilities/ovs-tcpdump", line 478, in 
main()
  File "./utilities/ovs-tcpdump", line 419, in main
mirror_interface = mirror_interface or _make_mirror_name(interface)
TypeError: 'dict' object is not callable

Why is this so?

- Bhanuprakash.

>
>---
>
>diff --git a/utilities/ovs-tcpdump.in b/utilities/ovs-tcpdump.in index
>6718c77..76e8a7b 100755
>--- a/utilities/ovs-tcpdump.in
>+++ b/utilities/ovs-tcpdump.in
>@@ -18,6 +18,7 @@ import fcntl
>
> import os
> import pwd
>+import random.randint
> import struct
> import subprocess
> import sys
>@@ -39,6 +40,7 @@ except Exception:
>
> tapdev_fd = None
> _make_taps = {}
>+_make_mirror_name = {}
>
>
> def _doexec(*args, **kwargs):
>@@ -76,8 +78,16 @@ def _install_tap_linux(tap_name, mtu_value=None):
> pipe.wait()
>
>
>+def _make_linux_mirror_name(interface_name):
>+if interface_name.length() > 13:
>+return "ovsmi%06d" % random.randint(1, 99)
>+return "mi%s" % interface_name
>+
>+
> _make_taps['linux'] = _install_tap_linux  _make_taps['linux2'] =
>_install_tap_linux
>+_make_mirror_name['linux'] = _make_linux_mirror_name
>+_make_mirror_name['linux2'] = _make_linux_mirror_name
>
>
> def username():
>@@ -406,7 +416,7 @@ def main():
> print("TCPDUMP Args: %s" % ' '.join(tcpdargs))
>
> ovsdb = OVSDB(db_sock)
>-mirror_interface = mirror_interface or "mi%s" % interface
>+mirror_interface = mirror_interface or _make_mirror_name(interface)
>
> if sys.platform in _make_taps and \
>mirror_interface not in netifaces.interfaces():
>---
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 3/7] util: High resolution sleep support for windows.

2017-11-14 Thread Bodireddy, Bhanuprakash
Hi Alin,

>Thanks a lot for the patch.
>
>I have a few comments inlined.
>
>> -Original Message-
>> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
>> boun...@openvswitch.org] On Behalf Of Bhanuprakash Bodireddy
>> Sent: Wednesday, November 8, 2017 6:36 PM
>> To: d...@openvswitch.org
>> Cc: Alin Gabriel Serdean 
>> Subject: [ovs-dev] [PATCH 3/7] util: High resolution sleep support for
>> windows.
>>
>> This commit implements xnanosleep() for the threads needing high
>> resolution sleep timeouts in windows.
>>
>> CC: Alin Gabriel Serdean 
>> CC: Aaron Conole 
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>>  lib/util.c | 17 +
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/lib/util.c b/lib/util.c
>> index a29e288..46b5691 100644
>> --- a/lib/util.c
>> +++ b/lib/util.c
>> @@ -2217,6 +2217,23 @@ xnanosleep(uint64_t nanoseconds)
>>  retval = nanosleep(&ts_sleep, NULL);
>>  error = retval < 0 ? errno : 0;
>>  } while (error == EINTR);
>> +#else
>> +HANDLE timer = CreateWaitableTimer(NULL, FALSE, "NSTIMER");
>[Alin Serdean] Small nit we don't need to name the timer because we don't
>reuse it.
>> +if (timer) {
>> +LARGE_INTEGER duetime;
>> +duetime.QuadPart = -nanoseconds;
>> +if (SetWaitableTimer(timer, &duetime, 0, NULL, NULL, FALSE)) {
>> +WaitForSingleObject(timer, INFINITE);
>> +CloseHandle(timer);
>> +} else {
>> +CloseHandle(timer);
>> +VLOG_ERR_ONCE("SetWaitableTimer Failed (%s)",
>> +   ovs_lasterror_to_string());
>> +}
>[Alin Serdean] Can you move the CloseHandle part here?
>> +} else {
>> +VLOG_ERR_ONCE("CreateWaitableTimer Failed (%s)",
>> +   ovs_lasterror_to_string());
>> +}
>>  #endif
>>  ovsrcu_quiesce_end();
>>  }

Thanks for your comments. I will send across v2 with the above changes.

- Bhanuprakash. 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] packets: Prefetch the packet metadata in cacheline1.

2017-11-20 Thread Bodireddy, Bhanuprakash
>
>Bhanuprakash Bodireddy  writes:
>
>> pkt_metadata_prefetch_init() is used to prefetch the packet metadata
>> before initializing the metadata in pkt_metadata_init(). This is done
>> for every packet in userspace datapath and is performance critical.
>>
>> Commit 99fc16c0 prefetches only cachline0 and cacheline2 as the
>> metadata part of respective cachelines will be initialized by
>pkt_metadata_init().
>>
>> However in VXLAN case when popping the vxlan header,
>> netdev_vxlan_pop_header() invokes pkt_metadata_init_tnl() which zeroes
>> out metadata part of
>> cacheline1 that wasn't prefetched earlier and causes performance
>> degradation.
>>
>> By prefetching cacheline1, 9% performance improvement is observed.
>
>Do we see a degredation in the non-vxlan case?  If not, then I don't see any
>reason not to apply this patch.

This patch doesn't impact the performance of non-vxlan cases and only have a 
positive impact in vxlan case.

- Bhanuprakash.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

2017-11-22 Thread Bodireddy, Bhanuprakash
>This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>
>Padding and aligning of dp_netdev_pmd_thread structure members is
>useless, broken in a several ways and only greatly degrades maintainability
>and extensibility of the structure.

The idea of my earlier patch was to mark the cache lines and reduce the holes 
while still maintaining the grouping of related members in this structure.
Also cache line marking is a good practice to make some one extra cautious when 
extending or editing important structures . 
Most importantly I was experimenting with prefetching on this structure and 
needed cache line markers for it. 

I see that you are on ARM (I don't have HW to test) and want to know if this 
commit has any negative affect and any numbers would be appreciated.
More comments inline.

>
>Issues:
>
>1. It's not working because all the instances of struct
>   dp_netdev_pmd_thread allocated only by usual malloc. All the
>   memory is not aligned to cachelines -> structure almost never
>   starts at aligned memory address. This means that any further
>   paddings and alignments inside the structure are completely
>   useless. Fo example:
>
>   Breakpoint 1, pmd_thread_main
>   (gdb) p pmd
>   $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
>   (gdb) p &pmd->cacheline1
>   $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
>   (gdb) p &pmd->cacheline0
>   $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
>   (gdb) p &pmd->flow_cache
>   $53 = (struct emc_cache *) 0x1b1afe0
>
>   All of the above addresses shifted from cacheline start by 32B.

If you see below all the addresses are 64 byte aligned.

(gdb) p pmd
$1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
(gdb) p &pmd->cacheline0
$2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
(gdb) p &pmd->cacheline1
$3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
(gdb) p &pmd->flow_cache
$4 = (struct emc_cache *) 0x7fc1e9b1a0c0
(gdb) p &pmd->flow_table
$5 = (struct cmap *) 0x7fc1e9fba100
(gdb) p &pmd->stats
$6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
(gdb) p &pmd->port_mutex
$7 = (struct ovs_mutex *) 0x7fc1e9fba180
(gdb) p &pmd->poll_list
$8 = (struct hmap *) 0x7fc1e9fba1c0
(gdb) p &pmd->tnl_port_cache
$9 = (struct hmap *) 0x7fc1e9fba200
(gdb) p &pmd->stats_zero
$10 = (unsigned long long (*)[5]) 0x7fc1e9fba240

I tried using xzalloc_cacheline instead of default xzalloc() here.  I tried 
tens of times and always found that the address is
64 byte aligned and it should start at the beginning of cache line on X86. 
Not sure why the comment  " (The memory returned will not be at the start of  a 
cache line, though, so don't assume such alignment.)" says otherwise?

>
>   Can we fix it properly? NO.
>   OVS currently doesn't have appropriate API to allocate aligned
>   memory. The best candidate is 'xmalloc_cacheline()' but it
>   clearly states that "The memory returned will not be at the
>   start of a cache line, though, so don't assume such alignment".
>   And also, this function will never return aligned memory on
>   Windows or MacOS.
>
>2. CACHE_LINE_SIZE is not constant. Different architectures have
>   different cache line sizes, but the code assumes that
>   CACHE_LINE_SIZE is always equal to 64 bytes. All the structure
>   members are grouped by 64 bytes and padded to CACHE_LINE_SIZE.
>   This leads to a huge holes in a structures if CACHE_LINE_SIZE
>   differs from 64. This is opposite to portability. If I want
>   good performance of cmap I need to have CACHE_LINE_SIZE equal
>   to the real cache line size, but I will have huge holes in the
>   structures. If you'll take a look to struct rte_mbuf from DPDK
>   you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and
>   RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.

I understand that ARM and few other processors (like OCTEON) have 128 bytes 
cache lines.
But  again curious of performance impact in your case with this new alignment.

>
>3. Sizes of system/libc defined types are not constant for all the
>   systems. For example, sizeof(pthread_mutex_t) == 48 on my
>   ARMv8 machine, but only 40 on x86. The difference could be
>   much bigger on Windows or MacOS systems. But the code assumes
>   that sizeof(struct ovs_mutex) is always 48 bytes. This may lead
>   to broken alignment/big holes in case of padding/wrong comments
>   about amount of free pad bytes.

This isn't an issue as you would have already mentioned and more about issue 
with the comment that reads the pad bytes.
In case of ARM it would be just 8 pad bytes instead of 16 on X86. 

union {
struct {
struct ovs_mutex port_mutex; /* 484998448 */
};/*  48 */
uint8_tpad13[64];/*  64 */
};   /

>
>

Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

2017-11-24 Thread Bodireddy, Bhanuprakash
>On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:
>>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>>>
>>> Padding and aligning of dp_netdev_pmd_thread structure members is
>>> useless, broken in a several ways and only greatly degrades
>>> maintainability and extensibility of the structure.
>>
>> The idea of my earlier patch was to mark the cache lines and reduce the
>holes while still maintaining the grouping of related members in this 
>structure.
>
>Some of the grouping aspects looks strange. For example, it looks illogical 
>that
>'exit_latch' is grouped with 'flow_table' but not the 'reload_seq' and other
>reload related stuff. It looks strange that statistics and counters spread 
>across
>different groups. So, IMHO, it's not well grouped.

I had to strike a fine balance and some members may be placed in a different 
group
due to their sizes and importance. Let me think if I can make it better.

>
>> Also cache line marking is a good practice to make some one extra cautious
>when extending or editing important structures .
>> Most importantly I was experimenting with prefetching on this structure
>and needed cache line markers for it.
>>
>> I see that you are on ARM (I don't have HW to test) and want to know if this
>commit has any negative affect and any numbers would be appreciated.
>
>Basic VM-VM testing shows stable 0.5% perfromance improvement with
>revert applied.

I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86.  

>Padding adds 560 additional bytes of holes.
As the cache line in ARM is 128 , it created holes, I can find a workaround to 
handle this.

>
>> More comments inline.
>>
>>>
>>> Issues:
>>>
>>>1. It's not working because all the instances of struct
>>>   dp_netdev_pmd_thread allocated only by usual malloc. All the
>>>   memory is not aligned to cachelines -> structure almost never
>>>   starts at aligned memory address. This means that any further
>>>   paddings and alignments inside the structure are completely
>>>   useless. Fo example:
>>>
>>>   Breakpoint 1, pmd_thread_main
>>>   (gdb) p pmd
>>>   $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
>>>   (gdb) p &pmd->cacheline1
>>>   $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
>>>   (gdb) p &pmd->cacheline0
>>>   $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
>>>   (gdb) p &pmd->flow_cache
>>>   $53 = (struct emc_cache *) 0x1b1afe0
>>>
>>>   All of the above addresses shifted from cacheline start by 32B.
>>
>> If you see below all the addresses are 64 byte aligned.
>>
>> (gdb) p pmd
>> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
>> (gdb) p &pmd->cacheline0
>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
>> (gdb) p &pmd->cacheline1
>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
>> (gdb) p &pmd->flow_cache
>> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0
>> (gdb) p &pmd->flow_table
>> $5 = (struct cmap *) 0x7fc1e9fba100
>> (gdb) p &pmd->stats
>> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
>> (gdb) p &pmd->port_mutex
>> $7 = (struct ovs_mutex *) 0x7fc1e9fba180
>> (gdb) p &pmd->poll_list
>> $8 = (struct hmap *) 0x7fc1e9fba1c0
>> (gdb) p &pmd->tnl_port_cache
>> $9 = (struct hmap *) 0x7fc1e9fba200
>> (gdb) p &pmd->stats_zero
>> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240
>>
>> I tried using xzalloc_cacheline instead of default xzalloc() here.  I
>> tried tens of times and always found that the address is
>> 64 byte aligned and it should start at the beginning of cache line on X86.
>> Not sure why the comment  " (The memory returned will not be at the start
>of  a cache line, though, so don't assume such alignment.)" says otherwise?
>
>Yes, you will always get aligned addressess on your x86 Linux system that
>supports
>posix_memalign() call. The comment says what it says because it will make
>some memory allocation tricks in case posix_memalign() is not available
>(Windows, some MacOS, maybe some Linux systems (not sure)) and the
>address will not be aligned it this case.

I also verified the other case when posix_memalign isn't available and even in 
that case
it returns the address aligned on CACHE_LINE_SIZE boundary. I will send out a 
patch to use
 xzalloc_cacheline for allocating the memory.

>
>>
>>>
>>>   Can we

Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

2017-11-26 Thread Bodireddy, Bhanuprakash
[snip]
>>>
>>> I don't think it complicates development and instead I feel the
>>> commit gives a clear indication to the developer that the members are
>>> grouped and
>>aligned and marked with cacheline markers.
>>> This makes the developer extra cautious when adding new members so
>>> that
>>holes can be avoided.
>>
>>Starting rebase of the output batching patch-set I figured out that I
>>need to remove 'unsigned long long last_cycles' and add 'struct
>>dp_netdev_pmd_thread_ctx ctx'
>>which is 8 bytes larger. Could you, please, suggest me where should I
>>place that new structure member and what to do with a hole from
>'last_cycles'?
>>
>>This is not a trivial question, because already poor grouping will
>>become worse almost anyway.
>
>Aah, realized now that the batching series doesn't cleanly apply on master.
>Let me check this and will send across the changes that should fix this.
>

I see that 2 patches of the output batching series touches this structure and I 
 modified
the structure to factor in below changes introduced in batching series.
 -  Include dp_netdev_pmd_thread_ctx structure.
 -  Include n_output_batches variable.
 -   Change in sizes of dp_netdev_pmd_stats struct and stats_zero .
 -   ovs_mutex  size ( 48bytes on x86 vs 56bytes in ARM)

Also carried some testing and found no performance impact with the below 
changes. 

---
struct dp_netdev_pmd_thread {
PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
struct dp_netdev *dp;
struct cmap_node node;  /* In 'dp->poll_threads'. */
pthread_cond_t cond;/* For synchronizing pmd thread 
   reload. */
);

PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
struct ovs_mutex cond_mutex;/* Mutex for condition variable. */
pthread_t thread;
);

/* Per thread exact-match cache.  Note, the instance for cpu core
 * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
 * need to be protected by 'non_pmd_mutex'.  Every other instance
 * will only be accessed by its own pmd thread. */
PADDED_MEMBERS(CACHE_LINE_SIZE,
OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
);

/* Flow-Table and classifiers
 *
 * Writers of 'flow_table' must take the 'flow_mutex'.  Corresponding
 * changes to 'classifiers' must be made while still holding the
 * 'flow_mutex'.
 */
PADDED_MEMBERS(CACHE_LINE_SIZE,
struct ovs_mutex flow_mutex;
);
PADDED_MEMBERS(CACHE_LINE_SIZE,
struct cmap flow_table OVS_GUARDED; /* Flow table. */

/* One classifier per in_port polled by the pmd */
struct cmap classifiers;
/* Periodically sort subtable vectors according to hit frequencies */
long long int next_optimization;
/* End of the next time interval for which processing cycles
   are stored for each polled rxq. */
long long int rxq_next_cycle_store;

/* Cycles counters */
struct dp_netdev_pmd_cycles cycles;

/* Current context of the PMD thread. */
struct dp_netdev_pmd_thread_ctx ctx;
);
PADDED_MEMBERS(CACHE_LINE_SIZE,
/* Statistics. */
struct dp_netdev_pmd_stats stats;
/* 8 pad bytes. */
);

PADDED_MEMBERS(CACHE_LINE_SIZE,
struct latch exit_latch;/* For terminating the pmd thread. */
struct seq *reload_seq;
uint64_t last_reload_seq;
atomic_bool reload; /* Do we need to reload ports? */
/* Set to true if the pmd thread needs to be reloaded. */
bool need_reload;
bool isolated;

struct ovs_refcount ref_cnt;/* Every reference must be refcount'ed. 
*/

/* Queue id used by this pmd thread to send packets on all netdevs if
 * XPS disabled for this netdev. All static_tx_qid's are unique and less
 * than 'cmap_count(dp->poll_threads)'. */
uint32_t static_tx_qid;

/* Number of filled output batches. */
int n_output_batches;
unsigned core_id;   /* CPU core id of this pmd thread. */
int numa_id;/* numa node id of this pmd thread. */

/* 16 pad bytes. */
);

PADDED_MEMBERS(CACHE_LINE_SIZE,
struct ovs_mutex port_mutex;/* Mutex for 'poll_list'
   and 'tx_ports'. */
/* 16 pad bytes. */
);
PADDED_MEMBERS(CACHE_LINE_SIZE,
/* List of rx queues to poll. */
struct hmap poll_list OVS_GUARDED;
/* Map of 'tx_port's used for transmission.  Written by the main
 * thread, read by the pmd thread. */
struct hmap tx_ports OVS_GUARDED;
);
PADDED_MEMBERS(CACHE_LINE_SIZE,
/* These are thread-local copies of 'tx_ports'.  One conta

Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

2017-11-27 Thread Bodireddy, Bhanuprakash
[ snip]
>>> Yes, you will always get aligned addressess on your x86 Linux system
>>> that supports
>>> posix_memalign() call. The comment says what it says because it will
>>> make some memory allocation tricks in case posix_memalign() is not
>>> available (Windows, some MacOS, maybe some Linux systems (not sure))
>>> and the address will not be aligned it this case.
>>
>> I also verified the other case when posix_memalign isn't available and
>> even in that case it returns the address aligned on CACHE_LINE_SIZE
>> boundary. I will send out a patch to use  xzalloc_cacheline for allocating 
>> the
>memory.
>
>I don't know how you tested this, because it is impossible:
>
>   1. OVS allocates some memory:
>   base = xmalloc(...);
>
>   2. Rounds it up to the cache line start:
>   payload = (void **) ROUND_UP((uintptr_t) base,
>CACHE_LINE_SIZE);
>
>   3. Returns the pointer increased by 8 bytes:
>   return (char *) payload + MEM_ALIGN;
>
>So, unless you redefined MEM_ALIGN to zero, you will never get aligned
>memory address while allocating by xmalloc_cacheline() on system without
>posix_memalign().
>

Hmmm, I didn't set MEM_ALIGN to zero instead used below test code to get 
aligned addresses
when posix_memalign() isn't available.  We can't set MEM_ALIGN to zero so have 
to do this
hack to get aligned address and store the initial address (original address 
allocated by malloc) in a place before the
aligned location so that it can be freed  by later  call to free(). (I should 
have mentioned in my previous mail). 

-
void **payload;
void *base;

base = xmalloc(CACHE_LINE_SIZE + size + MEM_ALIGN);
/* Address aligned on CACHE_LINE_SIZE boundary. */
payload = (void**)(((uintptr_t) base + CACHE_LINE_SIZE + MEM_ALIGN) &
~(CACHE_LINE_SIZE - 1));
/* Store the original address so it can be freed later. */
payload[-1] = base;
return (char *)payload;
-

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] packets: Prefetch the packet metadata in cacheline1.

2017-11-27 Thread Bodireddy, Bhanuprakash
>>Bhanuprakash Bodireddy  writes:
>>
>>> pkt_metadata_prefetch_init() is used to prefetch the packet metadata
>>> before initializing the metadata in pkt_metadata_init(). This is done
>>> for every packet in userspace datapath and is performance critical.
>>>
>>> Commit 99fc16c0 prefetches only cachline0 and cacheline2 as the
>>> metadata part of respective cachelines will be initialized by
>>pkt_metadata_init().
>>>
>>> However in VXLAN case when popping the vxlan header,
>>> netdev_vxlan_pop_header() invokes pkt_metadata_init_tnl() which
>>> zeroes out metadata part of
>>> cacheline1 that wasn't prefetched earlier and causes performance
>>> degradation.
>>>
>>> By prefetching cacheline1, 9% performance improvement is observed.
>>
>>Do we see a degredation in the non-vxlan case?  If not, then I don't
>>see any reason not to apply this patch.
>
>This patch doesn't impact the performance of non-vxlan cases and only have a
>positive impact in vxlan case.

The commit message claims that the performance improvement was 9% with this 
patch
but when Sugesh was checking he wasn't getting that performance improvement on 
his Haswell.

I was chatting to Sugesh this afternoon on this patch and we found some 
interesting details and much
of this boils down to how the OvS is built .( Apart from HW, BIOS settings - TB 
disabled).

The test case here measure the VXLAN de capsulation performance alone for 
packet sizes of 118 bytes.
The OvS CFLAGS and throughput numbers are as below.

CFLAGS="-O2"
Master  4.667 Mpps  
With Patch   5.045 Mpps

CFLAGS="-O2 -msse4.2"
Master  4.710 Mpps
With Patch   5.097 Mpps

CFLAGS="-O2 -march=native"
Master  5.072 Mpps
With Patch   5.193 Mpps

CFLAGS="-Ofast -march=native"
Master  5.349 Mpps
With Patch   5.378 Mpps

This means the performance measurements/claims are difficult to assess and as 
one can see above with "-Ofast, -march=native"
the improvement is insignificant but this is very platform dependent due to 
"march=native" flag. Also the optimization flags seems to
make significant difference.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

2017-11-27 Thread Bodireddy, Bhanuprakash
>I agree with Ilya here. Adding theses cache line markers and re-grouping
>variables to minimize gaps in cache lines is creating a maintenance burden
>without any tangible benefit. I have had to go through the pain of refactoring
>my PMD Performance Metrics patch to the new dp_netdev_pmd_thread
>struct and spent a lot of time to analyze the actual memory layout with GDB
>and play Tetris with the variables.

Analyzing the memory layout with gdb for large structures is time consuming and 
not usually recommended.
I would suggest using Poke-a-hole(pahole) and that helps to understand and fix 
the structures in no time.
With pahole it's going to be lot easier to work with large structures 
especially.

>
>There will never be more than a handful of PMDs, so minimizing the gaps does
>not matter from memory perspective. And whether the individual members
>occupy 4 or 5 cache lines does not matter either compared to the many
>hundred cache lines touched for EMC and DPCLS lookups of an Rx batch. And
>any optimization done for x86 is not necessarily optimal for other
>architectures.

I agree that optimization targeted for x86 doesn't necessarily suit ARM due to 
its different cache line size.

>
>Finally, even for x86 there is not even a performance improvement. I re-ran
>our standard L3VPN over VXLAN performance PVP test on master and with
>Ilya's revert patch:
>
>Flows   master  reverted
>8,  4.464.48
>100,4.274.29
>1000,   4.074.07
>2000,   3.683.68
>5000,   3.033.03
>1,  2.762.77
>2,  2.642.65
>5,  2.602.61
>10, 2.602.61
>50, 2.602.61

What are the  CFLAGS in this case, as they seem to make difference. I have 
added my finding here for a different patch targeted at performance
  https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341270.html

Patches to consider when testing your use case:
 Xzalloc_cachline:  
https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341231.html
 (If using output batching)  
https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341230.html

- Bhanuprakash.

>
>All in all, I support reverting this change.
>
>Regards, Jan
>
>Acked-by: Jan Scheurich 
>
>> -Original Message-
>> From: ovs-dev-boun...@openvswitch.org
>> [mailto:ovs-dev-boun...@openvswitch.org] On Behalf Of Bodireddy,
>> Bhanuprakash
>> Sent: Friday, 24 November, 2017 17:09
>> To: Ilya Maximets ; ovs-dev@openvswitch.org;
>> Ben Pfaff 
>> Cc: Heetae Ahn 
>> Subject: Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor
>dp_netdev_pmd_thread structure."
>>
>> >On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:
>> >>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>> >>>
>> >>> Padding and aligning of dp_netdev_pmd_thread structure members is
>> >>> useless, broken in a several ways and only greatly degrades
>> >>> maintainability and extensibility of the structure.
>> >>
>> >> The idea of my earlier patch was to mark the cache lines and reduce
>> >> the
>> >holes while still maintaining the grouping of related members in this
>structure.
>> >
>> >Some of the grouping aspects looks strange. For example, it looks
>> >illogical that 'exit_latch' is grouped with 'flow_table' but not the
>> >'reload_seq' and other reload related stuff. It looks strange that
>> >statistics and counters spread across different groups. So, IMHO, it's not
>well grouped.
>>
>> I had to strike a fine balance and some members may be placed in a
>> different group due to their sizes and importance. Let me think if I can make
>it better.
>>
>> >
>> >> Also cache line marking is a good practice to make some one extra
>> >> cautious
>> >when extending or editing important structures .
>> >> Most importantly I was experimenting with prefetching on this
>> >> structure
>> >and needed cache line markers for it.
>> >>
>> >> I see that you are on ARM (I don't have HW to test) and want to
>> >> know if this
>> >commit has any negative affect and any numbers would be appreciated.
>> >
>> >Basic VM-VM testing shows stable 0.5% perfromance improvement with
>> >revert applied.
>>
>> I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86.
>>
>> >Padding adds 560 additional bytes of holes.
>> As the cache line in ARM is 128 , it created holes, I can find a workaround 
>> to
>handle this.
>

Re: [ovs-dev] [PATCH] packets: Prefetch the packet metadata in cacheline1.

2017-11-28 Thread Bodireddy, Bhanuprakash


>-Original Message-
>From: Chandran, Sugesh
>Sent: Monday, November 27, 2017 5:58 PM
>To: Bodireddy, Bhanuprakash ; 'Aaron
>Conole' 
>Cc: 'd...@openvswitch.org' ; Ben Pfaff
>
>Subject: RE: [ovs-dev] [PATCH] packets: Prefetch the packet metadata in
>cacheline1.
>
>Hi Bhanu,
>
>Regards
>_Sugesh
>
>> -Original Message-
>> From: Bodireddy, Bhanuprakash
>> Sent: Monday, November 27, 2017 4:35 PM
>> To: 'Aaron Conole' 
>> Cc: 'd...@openvswitch.org' ; Ben Pfaff
>> ; Chandran, Sugesh 
>> Subject: RE: [ovs-dev] [PATCH] packets: Prefetch the packet metadata
>> in cacheline1.
>>
>> >>Bhanuprakash Bodireddy  writes:
>> >>
>> >>> pkt_metadata_prefetch_init() is used to prefetch the packet
>> >>> metadata before initializing the metadata in pkt_metadata_init().
>> >>> This is done for every packet in userspace datapath and is performance
>critical.
>> >>>
>> >>> Commit 99fc16c0 prefetches only cachline0 and cacheline2 as the
>> >>> metadata part of respective cachelines will be initialized by
>> >>pkt_metadata_init().
>> >>>
>> >>> However in VXLAN case when popping the vxlan header,
>> >>> netdev_vxlan_pop_header() invokes pkt_metadata_init_tnl() which
>> >>> zeroes out metadata part of
>> >>> cacheline1 that wasn't prefetched earlier and causes performance
>> >>> degradation.
>> >>>
>> >>> By prefetching cacheline1, 9% performance improvement is observed.
>> >>
>> >>Do we see a degredation in the non-vxlan case?  If not, then I don't
>> >>see any reason not to apply this patch.
>> >
>> >This patch doesn't impact the performance of non-vxlan cases and only
>> >have a positive impact in vxlan case.
>>
>> The commit message claims that the performance improvement was 9%
>with
>> this patch but when Sugesh was checking he wasn't getting that
>> performance improvement on his Haswell.
>>
>> I was chatting to Sugesh this afternoon on this patch and we found
>> some interesting details and much of this boils down to how the OvS is
>> built .( Apart from HW, BIOS settings - TB disabled).
>>
>> The test case here measure the VXLAN de capsulation performance alone
>> for packet sizes of 118 bytes.
>> The OvS CFLAGS and throughput numbers are as below.
>>
>> CFLAGS="-O2"
>> Master  4.667 Mpps
>> With Patch   5.045 Mpps
>>
>> CFLAGS="-O2 -msse4.2"
>> Master  4.710 Mpps
>> With Patch   5.097 Mpps
>>
>> CFLAGS="-O2 -march=native"
>> Master  5.072 Mpps
>> With Patch   5.193 Mpps
>>
>> CFLAGS="-Ofast -march=native"
>> Master  5.349 Mpps
>> With Patch   5.378 Mpps
>>
>> This means the performance measurements/claims are difficult to assess
>> and as one can see above with "-Ofast, -march=native"
>> the improvement is insignificant but this is very platform dependent
>> due to "march=native" flag. Also the optimization flags seems to make
>> significant difference.
>[Sugesh] I also tested on my board with same set of configuration and getting
>the same result as yours.
>So this patch offers performance improvement based on the compiler option.
>I am not sure whats the most preferred/used compiler option out there.
>I always build OVS with CFLAGS="-Ofast -march=native" and the patch
>doesn't have a great improvement in it.
>
>I don't mind Acking the patch, if you could re-send the patch with these
>results and options in the commit message.
>Atleast it will offer performance improvement for other build options.

Thanks Sugesh for testing this out. I will send out v2 of this with the 
information I mentioned in
the earlier mail included in the commit message.

Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

2017-11-28 Thread Bodireddy, Bhanuprakash
>
>> Analyzing the memory layout with gdb for large structures is time consuming
>and not usually recommended.
>> I would suggest using Poke-a-hole(pahole) and that helps to understand
>and fix the structures in no time.
>> With pahole it's going to be lot easier to work with large structures
>especially.
>
>Thanks for the pointer. I'll have a look at pahole.
>It doesn't affect my reasoning against optimizing the compactification of 
>struct
>dp_netdev_pmd_thread, though.
>
>> >Finally, even for x86 there is not even a performance improvement. I
>> >re-ran our standard L3VPN over VXLAN performance PVP test on master
>> >and with Ilya's revert patch:
>> >
>> >Flows   master  reverted
>> >8,  4.464.48
>> >100,4.274.29
>> >1000,   4.074.07
>> >2000,   3.683.68
>> >5000,   3.033.03
>> >1,  2.762.77
>> >2,  2.642.65
>> >5,  2.602.61
>> >10, 2.602.61
>> >50, 2.602.61
>>
>> What are the  CFLAGS in this case, as they seem to make difference. I
>> have added my finding here for a different patch targeted at performance
>>
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341270.ht
>> ml
>
>I'm compiling with "-O3 -msse4.2" to be in line with production deployments
>of OVS-DPDK that need to run on a wider family of Xeon generations.

Thanks for this. AFAIK by specifying  '-msse4.2' alone, you don't get to use 
the builtin_popcnt().
One way to enable is to use '-mpopcnt'   in CFLAGS or build with march=native.

(This is slightly out of context for this thread and JFYI. Ignore this if you 
only want to use intrinsics and not builtin popcnt.)

>
>>
>> Patches to consider when testing your use case:
>>  Xzalloc_cachline:  https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341231.html
>>  (If using output batching)  
>> https://mail.openvswitch.org/pipermail/ovs-
>dev/2017-November/341230.html
>
>I didn't use these. Tx batching is not relevant here. And I understand the
>xzalloc_cacheline patch alone does not guarantee that the allocated memory
>is indeed cache line-aligned.

Atleast with POSIX_MEMALIGN, address will be aligned on 64 byte and start at 
CACHE_LINE_SIZE boundary.
I am yet to check Ben's new patch and test it. 

- Bhanuprakash.

>
>Thx, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] util: Make xmalloc_cacheline() allocate full cachelines.

2017-11-28 Thread Bodireddy, Bhanuprakash
>Until now, xmalloc_cacheline() has provided its caller memory that does not
>share a cache line, but when posix_memalign() is not available it did not
>provide a full cache line; instead, it returned memory that was offset 8 bytes
>into a cache line.  This makes it hard for clients to design structures to be 
>cache
>line-aligned.  This commit changes
>xmalloc_cacheline() to always return a full cache line instead of memory
>offset into one.
>
>Signed-off-by: Ben Pfaff 
>---
> lib/util.c | 60 ---
>-
> 1 file changed, 32 insertions(+), 28 deletions(-)
>
>diff --git a/lib/util.c b/lib/util.c
>index 9e6edd27ae4c..137091a3cd4f 100644
>--- a/lib/util.c
>+++ b/lib/util.c
>@@ -196,15 +196,9 @@ x2nrealloc(void *p, size_t *n, size_t s)
> return xrealloc(p, *n * s);
> }
>
>-/* The desired minimum alignment for an allocated block of memory. */ -
>#define MEM_ALIGN MAX(sizeof(void *), 8) -
>BUILD_ASSERT_DECL(IS_POW2(MEM_ALIGN));
>-BUILD_ASSERT_DECL(CACHE_LINE_SIZE >= MEM_ALIGN);
>-
>-/* Allocates and returns 'size' bytes of memory in dedicated cache lines.  
>That
>- * is, the memory block returned will not share a cache line with other data,
>- * avoiding "false sharing".  (The memory returned will not be at the start of
>- * a cache line, though, so don't assume such alignment.)
>+/* Allocates and returns 'size' bytes of memory aligned to a cache line
>+and in
>+ * dedicated cache lines.  That is, the memory block returned will not
>+share a
>+ * cache line with other data, avoiding "false sharing".
>  *
>  * Use free_cacheline() to free the returned memory block. */  void * @@ -
>221,28 +215,37 @@ xmalloc_cacheline(size_t size)
> }
> return p;
> #else
>-void **payload;
>-void *base;
>-
> /* Allocate room for:
>  *
>- * - Up to CACHE_LINE_SIZE - 1 bytes before the payload, so that the
>- *   start of the payload doesn't potentially share a cache line.
>+ * - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to allow the
>+ *   pointer to be aligned exactly sizeof(void *) bytes before the
>+ *   beginning of a cache line.
>  *
>- * - A payload consisting of a void *, followed by padding out to
>- *   MEM_ALIGN bytes, followed by 'size' bytes of user data.
>+ * - Pointer: A pointer to the start of the header padding, to allow 
>us
>+ *   to free() the block later.
>  *
>- * - Space following the payload up to the end of the cache line, so
>- *   that the end of the payload doesn't potentially share a cache 
>line
>- *   with some following block. */
>-base = xmalloc((CACHE_LINE_SIZE - 1)
>-   + ROUND_UP(MEM_ALIGN + size, CACHE_LINE_SIZE));
>-
>-/* Locate the payload and store a pointer to the base at the beginning. */
>-payload = (void **) ROUND_UP((uintptr_t) base, CACHE_LINE_SIZE);
>-*payload = base;
>-
>-return (char *) payload + MEM_ALIGN;
>+ * - User data: 'size' bytes.
>+ *
>+ * - Trailer padding: Enough to bring the user data up to a cache line
>+ *   multiple.
>+ *
>+ * +---+-++-+
>+ * | header| pointer | user data  | trailer |
>+ * +---+-++-+
>+ * ^   ^ ^
>+ * |   | |
>+ * p   q r
>+ *
>+ */
>+void *p = xmalloc((CACHE_LINE_SIZE - 1)
>+  + sizeof(void *)
>+  + ROUND_UP(size, CACHE_LINE_SIZE));
>+bool runt = PAD_SIZE((uintptr_t) p, CACHE_LINE_SIZE) < sizeof(void *);
>+void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? CACHE_LINE_SIZE : 0),
>+CACHE_LINE_SIZE);
>+void **q = (void **) r - 1;
>+*q = p;
>+    return r;
> #endif
> }
>
>@@ -265,7 +268,8 @@ free_cacheline(void *p)
> free(p);
> #else
> if (p) {
>-free(*(void **) ((uintptr_t) p - MEM_ALIGN));
>+void **q = (void **) p - 1;
>+free(*q);
> }
> #endif
> }
>--

Thanks for the patch.
Reviewed and tested this and now it returns 64 byte aligned address.

Acked-by: Bhanuprakash Bodireddy 

- Bhanuprakash.


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

2017-11-28 Thread Bodireddy, Bhanuprakash
>On 27.11.2017 20:02, Bodireddy, Bhanuprakash wrote:
>>> I agree with Ilya here. Adding theses cache line markers and
>>> re-grouping variables to minimize gaps in cache lines is creating a
>>> maintenance burden without any tangible benefit. I have had to go
>>> through the pain of refactoring my PMD Performance Metrics patch to
>>> the new dp_netdev_pmd_thread struct and spent a lot of time to
>>> analyze the actual memory layout with GDB and play Tetris with the
>variables.
>>
>> Analyzing the memory layout with gdb for large structures is time consuming
>and not usually recommended.
>> I would suggest using Poke-a-hole(pahole) and that helps to understand
>and fix the structures in no time.
>> With pahole it's going to be lot easier to work with large structures
>especially.
>
>Interesting tool, but it seems doesn't work perfectly. I see duplicated unions
>and zero length arrays in the output and I still need to check sizes by hands.
>And it fails trying to run on my ubuntu 16.04 LTS on x86.
>IMHO, the code should be simple enough to not use external utilities when
>you need to check the single structure.

Pahole has been there for a while and is available with most distributions and 
works reliably on RHEL distros.
I am on fedora and built pahole from sources and it displays all the sizes and 
cacheline boundaries.

>
>>>
>>> There will never be more than a handful of PMDs, so minimizing the
>>> gaps does not matter from memory perspective. And whether the
>>> individual members occupy 4 or 5 cache lines does not matter either
>>> compared to the many hundred cache lines touched for EMC and DPCLS
>>> lookups of an Rx batch. And any optimization done for x86 is not
>>> necessarily optimal for other architectures.
>>
>> I agree that optimization targeted for x86 doesn't necessarily suit ARM due
>to its different cache line size.
>>
>>>
>>> Finally, even for x86 there is not even a performance improvement. I
>>> re-ran our standard L3VPN over VXLAN performance PVP test on master
>>> and with Ilya's revert patch:
>>>
>>> Flows   master  reverted
>>> 8,  4.464.48
>>> 100,4.274.29
>>> 1000,   4.074.07
>>> 2000,   3.683.68
>>> 5000,   3.033.03
>>> 1,  2.762.77
>>> 2,  2.642.65
>>> 5,  2.602.61
>>> 10, 2.602.61
>>> 50, 2.602.61
>>
>> What are the  CFLAGS in this case, as they seem to make difference. I have
>added my finding here for a different patch targeted at performance
>>
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341270.ht
>> ml
>
>Do you have any performance results that shows significant performance
>difference between above cases? Please describe your test scenario and
>environment so we'll be able to see that padding/alignment really needed
>here. I saw no such results yet.
>
>BTW, at one place you're saying that patch was not about performance, at the
>same time you're trying to show that it has some positive performance
>impact. I'm a bit confused with that.

Giving a bit more context here, I have been experimenting  with *prefetching* in
OvS as prefetching isn't used except in 2 instances(emc_processing & cmaps).
This work is aimed at checking the performance benefits with prefetching on not 
just Haswell
but also with newer range of processors.

The best way to prefetch a part of structure is to mark the portion of it. This 
isn't possible
unless we have some kind of cache line marking. This is what my patch initially 
did and then
we can prefetch portion of it based on cacheline markers. You can find an 
example in pkt_metadata struct.

My point is on X86  with cache line marking and with xzalloc_cacheline() API 
one shouldn't
see drop in performance if not improvement. But the real improvements would be 
seen when the
prefetching is done at right places and that’s WIP.

Bhanuprakash.

>
>>
>> Patches to consider when testing your use case:
>>  Xzalloc_cachline:  https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341231.html
>>  (If using output batching)  
>> https://mail.openvswitch.org/pipermail/ovs-
>dev/2017-November/341230.html
>>
>> - Bhanuprakash.
>>
>>>
>>> All in all, I support reverting this change.
>>>
>>> Regards, Jan
>>>
>>> Acked-by: Jan Scheurich 
>>>
>>>> -Original Message-
>>>> From: ovs-dev-boun...@openvswitch.org
>>>> [mailto:ovs-dev-boun...@

Re: [ovs-dev] [PATCH v2 1/2] timeval: Introduce macros to convert timespec and timeval.

2017-11-28 Thread Bodireddy, Bhanuprakash
Hi Ben,

>On Tue, Nov 14, 2017 at 08:42:30PM +, Bhanuprakash Bodireddy wrote:
>> This commit replaces the numbers with MSEC_PER_SEC, NSEC_PER_SEC and
>> USEC_PER_MSEC macros when dealing with timespec and timeval.
>>
>> This commit doesn't change functionality.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>
>This still seems careless and risky to me.
>
>For example:
>msecs = secs * MSEC_PER_SEC * 1LL;
>which expands to
>msecs = secs * 1000L * 1LL;
>still risks overflow on a 32-bit system (where 1000L is 32 bits long).
>
>The previous version of the code didn't have that problem:
>msecs = secs * 1000LL;
>
>Maybe it would be better to just leave these as-is.

I agree with you and take back my changes w.r.t introducing the time MACROS. I 
have posted v3 version replacing the Macro.
On an unrelated note, can you please also review the patch here that extends 
get_process_info(). 
   
https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/340762.html

My Keepalive patch series has dependency on high resolution timer patch and 
above mentioned API.
 
- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] util: Make xmalloc_cacheline() allocate full cachelines.

2017-11-29 Thread Bodireddy, Bhanuprakash
>
>On Tue, Nov 28, 2017 at 09:06:09PM +, Bodireddy, Bhanuprakash wrote:
>> >Until now, xmalloc_cacheline() has provided its caller memory that
>> >does not share a cache line, but when posix_memalign() is not
>> >available it did not provide a full cache line; instead, it returned
>> >memory that was offset 8 bytes into a cache line.  This makes it hard
>> >for clients to design structures to be cache line-aligned.  This
>> >commit changes
>> >xmalloc_cacheline() to always return a full cache line instead of
>> >memory offset into one.
>> >
>> >Signed-off-by: Ben Pfaff 
>> >---
>> > lib/util.c | 60
>> >---
>> >-
>> > 1 file changed, 32 insertions(+), 28 deletions(-)
>> >
>> >diff --git a/lib/util.c b/lib/util.c
>> >index 9e6edd27ae4c..137091a3cd4f 100644
>> >--- a/lib/util.c
>> >+++ b/lib/util.c
>> >@@ -196,15 +196,9 @@ x2nrealloc(void *p, size_t *n, size_t s)
>> > return xrealloc(p, *n * s);
>> > }
>> >
>> >-/* The desired minimum alignment for an allocated block of memory.
>> >*/ - #define MEM_ALIGN MAX(sizeof(void *), 8) -
>> >BUILD_ASSERT_DECL(IS_POW2(MEM_ALIGN));
>> >-BUILD_ASSERT_DECL(CACHE_LINE_SIZE >= MEM_ALIGN);
>> >-
>> >-/* Allocates and returns 'size' bytes of memory in dedicated cache
>> >lines.  That
>> >- * is, the memory block returned will not share a cache line with
>> >other data,
>> >- * avoiding "false sharing".  (The memory returned will not be at
>> >the start of
>> >- * a cache line, though, so don't assume such alignment.)
>> >+/* Allocates and returns 'size' bytes of memory aligned to a cache
>> >+line and in
>> >+ * dedicated cache lines.  That is, the memory block returned will
>> >+not share a
>> >+ * cache line with other data, avoiding "false sharing".
>> >  *
>> >  * Use free_cacheline() to free the returned memory block. */  void
>> >* @@ -
>> >221,28 +215,37 @@ xmalloc_cacheline(size_t size)
>> > }
>> > return p;
>> > #else
>> >-void **payload;
>> >-void *base;
>> >-
>> > /* Allocate room for:
>> >  *
>> >- * - Up to CACHE_LINE_SIZE - 1 bytes before the payload, so that 
>> >the
>> >- *   start of the payload doesn't potentially share a cache line.
>> >+ * - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to allow the
>> >+ *   pointer to be aligned exactly sizeof(void *) bytes before the
>> >+ *   beginning of a cache line.
>> >  *
>> >- * - A payload consisting of a void *, followed by padding out to
>> >- *   MEM_ALIGN bytes, followed by 'size' bytes of user data.
>> >+ * - Pointer: A pointer to the start of the header padding, to 
>> >allow us
>> >+ *   to free() the block later.
>> >  *
>> >- * - Space following the payload up to the end of the cache line, 
>> >so
>> >- *   that the end of the payload doesn't potentially share a cache 
>> >line
>> >- *   with some following block. */
>> >-base = xmalloc((CACHE_LINE_SIZE - 1)
>> >-   + ROUND_UP(MEM_ALIGN + size, CACHE_LINE_SIZE));
>> >-
>> >-/* Locate the payload and store a pointer to the base at the beginning.
>*/
>> >-payload = (void **) ROUND_UP((uintptr_t) base, CACHE_LINE_SIZE);
>> >-*payload = base;
>> >-
>> >-return (char *) payload + MEM_ALIGN;
>> >+ * - User data: 'size' bytes.
>> >+ *
>> >+ * - Trailer padding: Enough to bring the user data up to a cache 
>> >line
>> >+ *   multiple.
>> >+ *
>> >+ * +---+-++-+
>> >+ * | header| pointer | user data  | trailer |
>> >+ * +---+-++-+
>> >+ * ^   ^ ^
>> >+ * |   | |
>> >+ * p   q r
>> >+ *
>> >+ */
>> >+void *p = xmalloc((CACHE_LINE_SIZE - 1)
>> >+  + sizeof(void *)
>> >+  + ROUND_UP(size, CACHE_LINE_SIZE));
>> 

Re: [ovs-dev] [PATCH] util: Make xmalloc_cacheline() allocate full cachelines.

2017-11-30 Thread Bodireddy, Bhanuprakash
>On Wed, Nov 29, 2017 at 08:02:17AM +0000, Bodireddy, Bhanuprakash wrote:
>> >
>> >On Tue, Nov 28, 2017 at 09:06:09PM +, Bodireddy, Bhanuprakash
>wrote:
>> >> >Until now, xmalloc_cacheline() has provided its caller memory that
>> >> >does not share a cache line, but when posix_memalign() is not
>> >> >available it did not provide a full cache line; instead, it
>> >> >returned memory that was offset 8 bytes into a cache line.  This
>> >> >makes it hard for clients to design structures to be cache
>> >> >line-aligned.  This commit changes
>> >> >xmalloc_cacheline() to always return a full cache line instead of
>> >> >memory offset into one.
>> >> >
>> >> >Signed-off-by: Ben Pfaff 
>> >> >---
>> >> > lib/util.c | 60
>> >> >---
>> >> >-
>> >> > 1 file changed, 32 insertions(+), 28 deletions(-)
>> >> >
>> >> >diff --git a/lib/util.c b/lib/util.c index
>> >> >9e6edd27ae4c..137091a3cd4f 100644
>> >> >--- a/lib/util.c
>> >> >+++ b/lib/util.c
>> >> >@@ -196,15 +196,9 @@ x2nrealloc(void *p, size_t *n, size_t s)
>> >> > return xrealloc(p, *n * s);
>> >> > }
>> >> >
>> >> >-/* The desired minimum alignment for an allocated block of memory.
>> >> >*/ - #define MEM_ALIGN MAX(sizeof(void *), 8) -
>> >> >BUILD_ASSERT_DECL(IS_POW2(MEM_ALIGN));
>> >> >-BUILD_ASSERT_DECL(CACHE_LINE_SIZE >= MEM_ALIGN);
>> >> >-
>> >> >-/* Allocates and returns 'size' bytes of memory in dedicated
>> >> >cache lines.  That
>> >> >- * is, the memory block returned will not share a cache line with
>> >> >other data,
>> >> >- * avoiding "false sharing".  (The memory returned will not be at
>> >> >the start of
>> >> >- * a cache line, though, so don't assume such alignment.)
>> >> >+/* Allocates and returns 'size' bytes of memory aligned to a
>> >> >+cache line and in
>> >> >+ * dedicated cache lines.  That is, the memory block returned
>> >> >+will not share a
>> >> >+ * cache line with other data, avoiding "false sharing".
>> >> >  *
>> >> >  * Use free_cacheline() to free the returned memory block. */
>> >> >void
>> >> >* @@ -
>> >> >221,28 +215,37 @@ xmalloc_cacheline(size_t size)
>> >> > }
>> >> > return p;
>> >> > #else
>> >> >-void **payload;
>> >> >-void *base;
>> >> >-
>> >> > /* Allocate room for:
>> >> >  *
>> >> >- * - Up to CACHE_LINE_SIZE - 1 bytes before the payload, so that
>the
>> >> >- *   start of the payload doesn't potentially share a cache 
>> >> >line.
>> >> >+ * - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to allow 
>> >> >the
>> >> >+ *   pointer to be aligned exactly sizeof(void *) bytes before 
>> >> >the
>> >> >+ *   beginning of a cache line.
>> >> >  *
>> >> >- * - A payload consisting of a void *, followed by padding out 
>> >> >to
>> >> >- *   MEM_ALIGN bytes, followed by 'size' bytes of user data.
>> >> >+ * - Pointer: A pointer to the start of the header padding, to 
>> >> >allow
>us
>> >> >+ *   to free() the block later.
>> >> >  *
>> >> >- * - Space following the payload up to the end of the cache 
>> >> >line, so
>> >> >- *   that the end of the payload doesn't potentially share a 
>> >> >cache
>line
>> >> >- *   with some following block. */
>> >> >-base = xmalloc((CACHE_LINE_SIZE - 1)
>> >> >-   + ROUND_UP(MEM_ALIGN + size, CACHE_LINE_SIZE));
>> >> >-
>> >> >-/* Locate the payload and store a pointer to the base at the
>beginning.
>> >*/
>> >> >-payload = (void **) ROUND_UP((uintptr_t) base, CACHE_LINE_SIZE);
>> >> >-*payload = base;
>> >> >-
>> >> >-

Re: [ovs-dev] [PATCH RFC 1/5] compiler: Introduce OVS_PREFETCH variants.

2017-12-04 Thread Bodireddy, Bhanuprakash
Hi Ben,

>On Mon, Dec 04, 2017 at 08:16:46PM +, Bhanuprakash Bodireddy wrote:
>> This commit introduces prefetch variants by using the GCC built-in
>> prefetch function.
>>
>> The prefetch variants gives the user better control on designing data
>> caching strategy in order to increase cache efficiency and minimize
>> cache pollution. Data reference patterns here can be classified in to
>>
>>  - Non-temporal(NT) - Data that is referenced once and not reused in
>>   immediate future.
>>  - Temporal - Data will be used again soon.
>>
>> The Macro variants can be used where there are
>>  - Predictable memory access patterns.
>>  - Execution pipeline can stall if data isn't available.
>>  - Time consuming loops.
>>
>> For example:
>>
>>   OVS_PREFETCH_CACHE(addr, OPCH_LTR)
>> - OPCH_LTR : OVS PREFETCH CACHE HINT-LOW TEMPORAL READ.
>> - __builtin_prefetch(addr, 0, 1)
>> - Prefetch data in to L3 cache for readonly purpose.
>>
>>   OVS_PREFETCH_CACHE(addr, OPCH_HTW)
>> - OPCH_HTW : OVS PREFETCH CACHE HINT-HIGH TEMPORAL WRITE.
>> - __builtin_prefetch(addr, 1, 3)
>> - Prefetch data in to all caches in anticipation of write. In doing
>>   so it invalidates other cached copies so as to gain 'exclusive'
>>   access.
>>
>>   OVS_PREFETCH(addr)
>> - OPCH_HTR : OVS PREFETCH CACHE HINT-HIGH TEMPORAL READ.
>> - __builtin_prefetch(addr, 0, 3)
>> - Prefetch data in to all caches in anticipation of read and that
>>   data will be used again soon (HTR - High Temporal Read).
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>
>The information in this commit message seems like it could also be useful as
>part of a code comment.

This makes sense and I can include this in the code comments with some examples 
of usage.

- Bhanuprakash.


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH RFC 2/5] configure: Include -mprefetchwt1 explicitly.

2017-12-04 Thread Bodireddy, Bhanuprakash
>On Mon, Dec 04, 2017 at 08:16:47PM +, Bhanuprakash Bodireddy wrote:
>> Processors support prefetch instruction in anticipation of write but
>> compilers(gcc) won't use them unless explicitly asked to do so even
>> with '-march=native' specified.
>>
>> [Problem]
>>   Case A:
>> OVS_PREFETCH_CACHE(addr, OPCH_HTW)
>>__builtin_prefetch(addr, 1, 3)
>>  leaq-112(%rbp), %rax[Assembly]
>>  prefetchw  (%rax)
>>
>>   Case B:
>> OVS_PREFETCH_CACHE(addr, OPCH_LTW)
>>__builtin_prefetch(addr, 1, 1)
>>  leaq-112(%rbp), %rax[Assembly]
>>  prefetchw  (%rax) <***problem***>
>>
>>   Inspite of specifying -march=native and using Low Temporal
>Write(OPCH_LTW),
>>   the compiler generates 'prefetchw' instruction instead of 'prefetchwt1'
>>   instruction available on processor.
>>
>> [Solution]
>>   Include -mprefetchwt1
>>
>>   Case B:
>> OVS_PREFETCH_CACHE(addr, OPCH_LTW)
>>__builtin_prefetch(addr, 1, 1)
>>  leaq-112(%rbp), %rax[Assembly]
>>  prefetchwt1  (%rax)
>>
>> [Testing]
>>   $ ./boot.sh
>>   $ ./configure
>>  checking target hint for cgcc... x86_64
>>  checking whether gcc accepts -mprefetchwt1... yes
>>   $ make -j
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>
>Does this have any effect if the architecture or CPU configured for use does
>not support prefetchwt1?

That's a good question and I spent reasonable time today to figure this out.
I have Haswell, Broadwell and Skylake CPUs and they all support this 
instruction.  But I found that this instruction isn't enabled by default even 
with march=native and so need to explicitly enable this.

Coming to your question, there won't be side effects on using OPCH_LTW.
On Processors that *doesn't* support PREFETCHW and PREFETCHWT1 the compiler 
generates a 'prefetcht1' instruction.
On processors that support PREFETCHW the compiler generates 'prefetchw' 
instruction.
On processors that support PREFETCHW & PREFETCHWT1, the compiler generates 
'prefetchwt1' instruction with -mprefetchwt1 explicitly enabled.

>If it could lead to that situation, then this does not
>seem like the right thing to do, and we might want to fall back to
>recommending use of the option when the person building knows that the
>software will run on a machine with prefetchwt1.

According to above on processors that doesn't have this instruction support, 
'prefetchnt1' instruction would be generated and doesn't have side effects.
I verified this using https://gcc.godbolt.org/  and carefully checking the 
instructions generated for different compiler versions and march flags.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH RFC 3/5] util: Extend ovs_prefetch_range to include prefetch type.

2017-12-04 Thread Bodireddy, Bhanuprakash
>On Mon, Dec 04, 2017 at 08:16:48PM +, Bhanuprakash Bodireddy wrote:
>> With ovs_prefetch_range(), large amounts of data can be prefetched in
>> to caches. Prefetch type gives better control over data caching
>> strategy; Meaning where the data should be prefetched(L1/L2/L3) and if
>> the data reference is temporal or non-temporal.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>
>I'll leave review of patches 3-5 to others who better understand the specific
>issues.

No problem, I posted this as RFC to get early feedback and am currently looking
at bottlenecks in other usecases (vxlans, conntrack) with multiple pmd threads 
to
use prefetching. 

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v6 0/7] Output packet batching.

2017-12-05 Thread Bodireddy, Bhanuprakash
>I have retested your "Output patches batching" v6 in our standard PVP L3-
>VPN/VXLAN benchmark setup [1]. The configuration is a single PMD serving a
>physical 10G port and a VM running DPDK testpmd as IP reflector with 4
>equally loaded vhostuser ports. The tests are run with 64 byte packets. Below
>are Mpps values averaged over four 10 second runs:
>
>master  patchpatch
>Flows   Mppstx-flush-interval=0  tx-flush-interval=50
>8   4.419   4.342   -1.7%4.7497.5%
>100 4.026   3.956   -1.7%4.2816.3%
>10003.630   3.6320.1%3.7603.6%
>20003.394   3.390   -0.1%3.4902.8%
>50002.989   2.938   -1.7%2.9940.2%
>1   2.756   2.711   -1.6%2.746   -0.4%
>2   2.641   2.598   -1.6%2.622   -0.7%
>5   2.604   2.558   -1.8%2.579   -1.0%
>10  2.598   2.552   -1.8%2.572   -1.0%
>50  2.598   2.550   -1.8%2.571   -1.0%
>
>As expected output batching within rx bursts (tx-flush-interval=0) provides
>little or no benefit in this scenario. The test results reflect roughly a 1.7%
>performance penalty due to the tx batching overhead. This overhead is
>measurable, but should in my eyes not be a blocker for merging this patch
>series.

I had a similar observation when I was testing for regression with non-batching 
scenario.
https://mail.openvswitch.org/pipermail/ovs-dev/2017-October/339719.html

As tx-flush-interval by default is 0 (enable instant send) and causes 
performance degradation,
I recommend documenting this in one of the commits and giving a link to this 
performance numbers(adding Tested-at tag)
so that users can tune tx-flush-interval accordingly. 

>
>Interestingly, tests with time-based tx batching and a minimum flush interval
>of 50 microseconds show a consistent and significant performance increase
>for small number of flows (in the regime where EMC is effective) and a
>reduced penalty of 1% for many flows. I don't have a good explanation yet for
>this phenomenon. I would be interested to see if other benchmark results
>support the general positive impact of time-based tx batching on throughput
>also for synthetic DPDK applications in the VM. The average Ping RTT increases
>by 20-30 us as expected.

I think this depends on tx-flush-interval and also should be documented.

- Bhanuprakash.

>
>We will also retest the performance improvement of time-based tx batching
>on interrupt driven Linux kernel applications (such as iperf3).
>
>BR, Jan
>
>> -Original Message-
>> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
>> Sent: Friday, 01 December, 2017 16:44
>> To: ovs-dev@openvswitch.org; Bhanuprakash Bodireddy
>
>> Cc: Heetae Ahn ; Antonio Fischetti
>; Eelco Chaudron
>> ; Ciara Loftus ; Kevin
>Traynor ; Jan Scheurich
>> ; Ian Stokes ; Ilya
>Maximets 
>> Subject: [PATCH v6 0/7] Output packet batching.
>>
>> This patch-set inspired by [1] from Bhanuprakash Bodireddy.
>> Implementation of [1] looks very complex and introduces many pitfalls [2]
>> for later code modifications like possible packet stucks.
>>
>> This version targeted to make simple and flexible output packet batching on
>> higher level without introducing and even simplifying netdev layer.
>>
>> Basic testing of 'PVP with OVS bonding on phy ports' scenario shows
>> significant performance improvement.
>>
>> Test results for time-based batching for v3:
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>September/338247.html
>>
>> Test results for v4:
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>October/339624.html
>>
>> [1] [PATCH v4 0/5] netdev-dpdk: Use intermediate queue during packet
>transmission.
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>August/337019.html
>>
>> [2] For example:
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>August/337133.html
>>
>> Version 6:
>>  * Rebased on current master:
>>- Added new patch to refactor dp_netdev_pmd_thread structure
>>  according to following suggestion:
>>  https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341230.html
>>
>>NOTE: I still prefer reverting of the padding related patch.
>>  Rebase done to not block acepting of this series.
>>  Revert patch and discussion here:
>>  https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341153.html
>>
>>  * Added comment about pmd_thread_ctx_time_update() usage.
>>
>> Version 5:
>>  * pmd_thread_ctx_time_update() calls moved to different places to
>>call them only from dp_netdev_process_rxq_port() and main
>>polling functions:
>>  pmd_thread_main, dpif_netdev_run and
>dpif_netdev_execute.
>>All other functions should use cached time from pmd->ctx.now.
>>It's guaranteed to be updated at least once per polling cycle.
>>  * 'may_steal' patch returned to version from v3 because
>>'may_steal' in

Re: [ovs-dev] [PATCH RFC 2/5] configure: Include -mprefetchwt1 explicitly.

2017-12-05 Thread Bodireddy, Bhanuprakash
>>>On Mon, Dec 04, 2017 at 08:16:47PM +, Bhanuprakash Bodireddy
>wrote:
 Processors support prefetch instruction in anticipation of write but
 compilers(gcc) won't use them unless explicitly asked to do so even
 with '-march=native' specified.

 [Problem]
   Case A:
 OVS_PREFETCH_CACHE(addr, OPCH_HTW)
__builtin_prefetch(addr, 1, 3)
  leaq-112(%rbp), %rax[Assembly]
  prefetchw  (%rax)

   Case B:
 OVS_PREFETCH_CACHE(addr, OPCH_LTW)
__builtin_prefetch(addr, 1, 1)
  leaq-112(%rbp), %rax[Assembly]
  prefetchw  (%rax) <***problem***>

   Inspite of specifying -march=native and using Low Temporal
>>>Write(OPCH_LTW),
   the compiler generates 'prefetchw' instruction instead of 'prefetchwt1'
   instruction available on processor.

 [Solution]
   Include -mprefetchwt1

   Case B:
 OVS_PREFETCH_CACHE(addr, OPCH_LTW)
__builtin_prefetch(addr, 1, 1)
  leaq-112(%rbp), %rax[Assembly]
  prefetchwt1  (%rax)

 [Testing]
   $ ./boot.sh
   $ ./configure
  checking target hint for cgcc... x86_64
  checking whether gcc accepts -mprefetchwt1... yes
   $ make -j

 Signed-off-by: Bhanuprakash Bodireddy >>> intel.com>
>>>
>>>Does this have any effect if the architecture or CPU configured for
>>>use does not support prefetchwt1?
>>
>> That's a good question and I spent reasonable time today to figure this out.
>> I have Haswell, Broadwell and Skylake CPUs and they all support this
>instruction.
>
>Hmm. I have 2 different Broadwell machines (Xeon E5 v4 and i7-6800K) and
>both of them doesn't have prefetchwt1 instruction according to cpuid:
>
>   PREFETCHWT1  = false

Xeon E5-26XX v4 is Broadwell workstation/server but i7-6800k is Skylake Desktop 
variant where as E3-12XX v5 is equivalent skylake workstation/server variant.
AFAIK, prefetchwt1 should be available on above processors, not sure why cpuid 
displays it otherwise.

pmd_thread_main()
---
WITH OPCH_HTW, we see prefetchw instruction. 

OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_HTW);
cycles_count_start(pmd);
for (;;) {
for (i = 0; i < poll_cnt; i++) {
process_packets =
dp_netdev_process_rxq_port(pmd, poll_list[i].rxq->rx,
   poll_list[i].port_no);
cycles_count_intermediate(pmd, poll_list[i].rxq,


Address Source Line Assembly
0x6e29ef4,086   movl  0x823ecb(%rip), %edi  

0x6e29f54,085   movq  0x50(%rsp), %rax  

0x6e29fa4,086   test %edi, %edi 

0x6e29fc4,085   prefetchwz  (%rax)  


With OPCH_LTW, we can see prefetchwt1b instruction being used(change made to 
show this).

OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_LTW);
cycles_count_start(pmd);
for (;;) {
for (i = 0; i < poll_cnt; i++) {
..

Address Source Line Assembly
0x6e29ef4,086   movl  0x823ecb(%rip), %edi  

0x6e29f54,085   movq  0x50(%rsp), %rax  

0x6e29fa4,086   test %edi, %edi 

0x6e29fc4,085   prefetchwt1b  (%rax)

-

>
>This means that introducing of this change will break binary compatibility even
>between CPUs of the same generation, i.e. I will not be able to run on my
>system binaries compiled on yours.
>
>If it's true I prefer to not have this change.
>
>Anyway adding of this change will make compiling a generic binary for a
>different platforms impossible if your build server supports prefetchwt1.
>There should be way to disable this arch specific compiler flag even if it
>supported on my current platform.

I see your point where a build server can be advanced and supports the 
prefetchwt1 instruction
and when I copy and run the precompiled binaries on a server not supporting it, 
how does this behave?

Not sure on this. May be Redhat/canonical developers can comment on how they 
handle this kind of cases.

I will try to check this on my side.

- Bhanuprakash.

>
>Best regards, Ilya Maximets.
>
>> But I found that this instruction isn't enabled by default even with
>march=native and so need to explicitly enable this.
>>

Re: [ovs-dev] [PATCH RFC 5/5] dpif-netdev: Prefetch the cacheline having the cycle stats.

2017-12-05 Thread Bodireddy, Bhanuprakash
>
>> Prefetch the cacheline having the cycle stats so that we can speed up
>> the cycles_count_start() and cycles_count_intermediate().
>
>Do you have any performance results?

I don’t have nos. for this patch alone. I was testing the overall throughput 
along with other patches (that were *not* part of this RFC series) to verify 
performance improvements. I will include in commit log when I do for individual 
patches. 

BTW, I usually look at  the % of total instructions getting retired, cycles 
spent in front and back-end for the functions to see if prefetching does 
improve/degrade performance.

- Bhanuprakash.

>
>>
>> Signed-off-by: Bhanuprakash Bodireddy > intel.com>
>> ---
>>  lib/dpif-netdev.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>> b74b5d7..ab13d83 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -576,7 +576,7 @@ struct dp_netdev_pmd_thread {
>>  struct ovs_mutex flow_mutex;
>>  /* 8 pad bytes. */
>>  );
>> -PADDED_MEMBERS(CACHE_LINE_SIZE,
>> +PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>cachelineC,
>>  struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>
>>  /* One classifier per in_port polled by the pmd */ @@ -4082,6
>> +4082,7 @@ reload:
>>  lc = UINT_MAX;
>>  }
>>
>> +OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_HTW);
>>  cycles_count_start(pmd);
>>  for (;;) {
>>  for (i = 0; i < poll_cnt; i++) {
>> --
>> 2.4.11
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH RFC 2/5] configure: Include -mprefetchwt1 explicitly.

2017-12-05 Thread Bodireddy, Bhanuprakash
[...]
>int main()
>{
>int c;
>
>__builtin_prefetch(&c, 1, 1);
>c = 8;
>
>return c;
>}
>
>on my old Ivy Bridge i7-3770 CPU. It does not support even 'prefetchw':
>
>  PREFETCHWT1  = false
>  3DNow! PREFETCH/PREFETCHW instructions = false
>
>Results:

[Bhanu] I  found https://gcc.godbolt.org/ the other day and its handy to 
generate code for different targets and compilers.

>$ gcc 1.c
>$ objdump -S ./a.out | grep prefetch -A2 -B2
>  40055b:   31 c0   xor%eax,%eax
>  40055d:   48 8d 45 f4 lea-0xc(%rbp),%rax
>  400561:   0f 18 18prefetcht2 (%rax)
>  400564:   c7 45 f4 08 00 00 00movl   $0x8,-0xc(%rbp)
>  40056b:   8b 45 f4mov-0xc(%rbp),%eax

[Bhanu] Expected and compiler generates prefetcht2.

>
>$ gcc 1.c -march=native
>$ objdump -S ./a.out | grep prefetch -A2 -B2
>  40055b:   31 c0   xor%eax,%eax
>  40055d:   48 8d 45 f4 lea-0xc(%rbp),%rax
>  400561:   0f 18 18prefetcht2 (%rax)
>  400564:   c7 45 f4 08 00 00 00movl   $0x8,-0xc(%rbp)
>  40056b:   8b 45 f4mov-0xc(%rbp),%eax

[Bhanu] Though march=native is specified the processor doesn't  have it and 
still prefetchnt2 is generated by compiler.

>$ gcc 1.c -march=native -mprefetchwt1
>$ objdump -S ./a.out | grep prefetch -A2 -B2
>  40055b:   31 c0   xor%eax,%eax
>  40055d:   48 8d 45 f4 lea-0xc(%rbp),%rax
>  400561:   0f 0d 10prefetchwt1 (%rax)
>  400564:   c7 45 f4 08 00 00 00movl   $0x8,-0xc(%rbp)
>  40056b:   8b 45 f4mov-0xc(%rbp),%eax

[Bhanu] The compiler inserts prefetchwt1 instruction as we asked it to do.

>
>So, it inserts this instruction even if I have no such instruction in CPU.

[Bhanu] 
Though the compiler generates this, as the instruction isn't available on the 
processor it just become a multi byte NO-Operation(NOP).
On processors(Intel) that doesn't have prefetchw or 3D Now feature(AMD)  it 
decodes in to NOP.
http://ref.x86asm.net/coder64.html#x0F0D
- Click on '0D' in two-byte opcode index - (16.  0F0D NOP)
   -  More information on this can be found in Intel SW developers 
manual (Combined Volumes)

>More interesting is that program still works without any issues.
>I assume that CPU just skips that instruction or executes something else.

[Bhanu] This is what is mostly expected. On processors that supports 
prefetchwt1 it executes and others it just becomes a NOP.

>
>So, it's really strange and it's unclear what CPU really executes in case where
>we have 'prefetchwt1' in code but not supported by CPU.

[Bhanu] It’s decoded in to NOP may be by pipeline decoding units.

>
>If CPU just skips this instruction we will lost all the prefetching 
>optimizations
>because all the calls will be replaced by non-existent 'prefetchwt1'.

[Bhanu] I would be worried if core generates an exception treating it as 
illegal instruction. Instead pipeline units treat this as NOP if it doesn't 
support it.
So the micro optimizations doesn't really do any thing on the processors that 
doesn't support it.

>
>How can we be sure that 'prefetchwt1' was really executed?

[Bhanu] I don’t know how we can see this unless we can peek in to Instruction 
queues & Decoders of the pipeline :(.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH RFC 2/5] configure: Include -mprefetchwt1 explicitly.

2017-12-07 Thread Bodireddy, Bhanuprakash
>> >> If CPU just skips this instruction we will lost all the prefetching
>> >> optimizations because all the calls will be replaced by non-existent
>'prefetchwt1'.
>> >
>> > [Bhanu] I would be worried if core generates an exception treating
>> > it as illegal instruction. Instead pipeline units treat this as NOP
>> > if it
>> doesn't support it.
>> > So the micro optimizations doesn't really do any thing on the processors
>that doesn't support it.
>>
>> This could be an issue. If someday we'll have real performance
>> optimization based on OPCH_HTW prefetch, we will have prefetchwt1 on
>> system that supports it and NOP on others even if they have usual
>> prefetchw which could provide performance improvement too.

[Bhanu]  Adding the below information only for future reference, (going to 
point to this thread in the commit log)

On systems that has *only* prefetchw and no prefetchwt1 instruction.
 OPCH_LTW-   prefetchw 
 OPCH_MTW  -   prefetchw
 OPCH_HTW   -prefetchw
 OPCH_NTW   -prefetchw

On systems that supports both prefetchw and prefetchwt1,
 OPCH_LTW-   prefetchwt1
 OPCH_MTW  -   prefetchwt1
 OPCH_HTW   -prefetchw
 OPCH_NTW   -prefetchwt1

So OPCH_HTW would always be prefetchw and LTW/MTW/HTW  might turn in to NOPs on 
processors that support prefetchw alone.
(when compiled with CFLAGS = -march=native -mprefetchwt1)

>>
>> As I understand, checking of '-mprefetchwt1' is equal to checking
>> compiler version. It doesn't check anything about supporting of this
>instruction in CPU.
>> This could end up with non-working performance optimizations and even
>> degradation on systems that supports usual prefetches but not
>> prefetchwt1 (useless NOPs degrades performance if they are on a hot
>path).
>>
>> IMHO, This compiler option should be passed only if CPU really supports it.
>> I guess, the maximum that we can do is add a note into performance
>> optimization guide that '-mprefetchwt1' could be passed via CFLAGS if
>> user sure that it supported by target CPU.
>
>That is my thinking as well. The people/organizations building OVS packages
>for deployment have the responsibility to specify the minimum requirements
>on the target architecture and feed that into the compiler using CFLAGS. That
>may well be leaning towards the lower end of capabilities to maximize
>compatibility and sacrifice some performance on high-end CPUs.
>
>The specialized prefetch macros should be mapped to the best available
>target instructions by the compiler and/or conditional compile directives
>based on the CFLAGS architecture settings.
>
>We would gather all these target-specific compiler optimization guidelines in
>the advanced DPDK documentation of OVS.
>
>Of course developers or benchmark testers are free to use -march=native or
>similar at their discretion in their local test beds for best possible 
>performance.

If the general view is get rid of this flag at compilation and only to document 
this, I am happy with this and can update the documentation.
But I still think we are being too defensive here and with few NOPs performance 
impact isn't even noticeable. 

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v6 1/7] dpif-netdev: Refactor PMD thread structure for further extension.

2017-12-07 Thread Bodireddy, Bhanuprakash
>On 07/12/17 14:28, Ilya Maximets wrote:
>> Thanks for review, comments inline.
>>
>> On 07.12.2017 15:49, Eelco Chaudron wrote:
>>> On 01/12/17 16:44, Ilya Maximets wrote:
 This is preparation for 'struct dp_netdev_pmd_thread' modification
 in upcoming commits. Needed to avoid reordering and regrouping while
 replacing old and adding new members.

>>> Should this be part of the TX batching set? Anyway, I'm ok if it's
>>> not stalling the approval :)
>> Unfortunately yes, because members reordered and regrouped just to
>> include new members: pmd->ctx and pmd->n_output_batches. This could
>> not be a standalone change because adding of different members will
>> require different regrouping/ reordering. I moved this change to a
>> separate patch to not do this twice while adding each member in patches
>2/7 and 6/7.
>>
>> Anyway, as I mentioned in cover letter, I still prefer reverting of
>> the padding at all by this patch:
>>
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341153.html

I understand that with PADDED_MEMBERS macro it was slightly tricky to extend or 
reorganize the structure and so suggested 'pahole'.
But I see that the problem hasn't gone and still there are some strong opinions 
on reverting the earlier effort.

I  don’t mind reverting the patch but would be nice if the changes to this 
structure are made keeping alignment in mind.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] dpif-netdev: Optimize the exact match lookup.

2017-12-08 Thread Bodireddy, Bhanuprakash
Hi Tonghao,

>On Thu, Jul 27, 2017 at 11:38:00PM -0700, Tonghao Zhang wrote:
>> When inserting or updating (e.g. emc_insert) a flow to EMC, we compare
>> (e.g the hash and miniflow ) the netdev_flow_key.
>> If the key is matched, we will update it. If we didn’t find the
>> miniflow in the cache, the new flow will be stored.
>>
>> But when looking up the flow, we compare the hash and miniflow of key
>> and make sure it is alive. If a flow is not alive but the key is
>> matched, we still will go to next loop. More important, we can’t find
>> the flow in the next loop (the flow is not alive in the previous
>> loop). This patch simply compares the miniflows of the packets.
>>
>> The topo is shown as below. VM01 sends TCP packets to VM02, and OvS
>> forwards packtets.
>>
>>  VM01 -- OVS+DPDK VM02 -- VM03
>>
>> With this patch, the TCP throughput between VMs is 5.37, 5.45, 5.48,
>> 5.59, 5.65, 5.60 Gbs/sec avg: 5.52 Gbs/sec
>>
>> up to:
>> 5.64, 5.65, 5.66, 5.67, 5.62, 5.67 Gbs/sec avg: 5.65 Gbs/sec
>>
>> (maybe ~2.3% performance improve, but it is hard to tell exactly due
>> to variance in the test results).
>>
>> Signed-off-by: Tonghao Zhang 
>
>Thank you for the patch.  I haven't spotted any reviews for this on the mailing
>list.  I apologize for that--usually I expect to see a review much more quickly
>than this.  I hope that someone who understands the dpif-netdev code well
>will provide a review soon.

I reviewed and tested this patch and the performance improvement is marginal 
and varies a lot depending on traffic pattern.

In the original implementation, if the hashes match and the entry is alive in 
EMC then the Miniflows are compared using memcmp() and takes
Significant cycles.
With the change proposed in this patch, if the hash matches we would do the 
Miniflow comparison(takes significant cycles depending on key->len) and then go 
on to check if the entry is alive. In case the entry isn't available(With EMC 
saturated and packets hitting classifier) we probably would have wasted lot of 
cycles in this case doing the expensive memcmp().

What do you think?

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] dpif-netdev: Allocate dp_netdev_pmd_thread struct by xzalloc_cacheline.

2017-12-08 Thread Bodireddy, Bhanuprakash
>
>On 08.12.2017 16:45, Stokes, Ian wrote:
>>> All instances of struct dp_netdev_pmd_thread are allocated by xzalloc
>>> and therefore doesn't guarantee memory allocation aligned on
>>> CACHE_LINE_SIZE boundary. Due to this any padding done inside the
>>> structure with this assumption might create holes.
>>>
>>> This commit replaces xzalloc, free with xzalloc_cacheline and
>>> free_cacheline. With the changes the memory is 64 byte aligned.
>>
>> Thanks for this Bhanu,
>>
>> I think this looks OK and I'm considering pushing to the DPDK_Merge branch
>but as there has been a fair bit of debate lately regarding memory and cache
>alignment I want to flag to others who have engaged to date to have their say
>before I apply it as there has been no input yet for the patch.
>>
>> @Jan/Ilya, are you ok with this change?
>
>OVS will likely crash on destroying non_pmd thread because it still allocated 
>by
>usual xzalloc, but freed with others by free_cacheline().

Are you sure OvS crashes in this case and reproducible?
Firstly I didn't see a crash and to double check this I enabled a DBG in 
dp_netdev_destroy_pmd() 
to see if free_cacheline() is called for the non pmd thread (whose core_id is 
NON_PMD_CORE_ID) and that 
doesn't seem to be hitting and gets hit only for pmd threads having valid 
core_ids.

Also AFAIK, non pmd thread is nothing but vswitchd thread and I don’t see how 
that can be freed from the above
function.  Also I started wondering where the memory allocated for non_pmd 
thread is getting freed now.

Let me know the steps if you can reproduce the crash as you mentioned.

- Bhanuprakash.

>
>>
>> Thanks
>> Ian
>>
>>>
>>> Before:
>>> With xzalloc, all the memory is 16 byte aligned.
>>>
>>> (gdb) p pmd
>>> $1 = (struct dp_netdev_pmd_thread *) 0x7eff8a813010
>>> (gdb) p &pmd->cacheline0
>>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7eff8a813010
>>> (gdb) p &pmd->cacheline1
>>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7eff8a813050
>>> (gdb) p &pmd->flow_cache
>>> $4 = (struct emc_cache *) 0x7eff8a813090
>>> (gdb) p &pmd->flow_table
>>> $5 = (struct cmap *) 0x7eff8acb30d0
>>> (gdb)  p &pmd->stats
>>> $6 = (struct dp_netdev_pmd_stats *) 0x7eff8acb3110
>>> (gdb) p &pmd->port_mutex
>>> $7 = (struct ovs_mutex *) 0x7eff8acb3150
>>> (gdb) p &pmd->poll_list
>>> $8 = (struct hmap *) 0x7eff8acb3190
>>> (gdb) p &pmd->tnl_port_cache
>>> $9 = (struct hmap *) 0x7eff8acb31d0
>>> (gdb) p &pmd->stats_zero
>>> $10 = (unsigned long long (*)[5]) 0x7eff8acb3210
>>>
>>> After:
>>> With xzalloc_cacheline, all the memory is 64 byte aligned.
>>>
>>> (gdb) p pmd
>>> $1 = (struct dp_netdev_pmd_thread *) 0x7f39e2365040
>>> (gdb) p &pmd->cacheline0
>>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7f39e2365040
>>> (gdb) p &pmd->cacheline1
>>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7f39e2365080
>>> (gdb) p &pmd->flow_cache
>>> $4 = (struct emc_cache *) 0x7f39e23650c0
>>> (gdb) p &pmd->flow_table
>>> $5 = (struct cmap *) 0x7f39e2805100
>>> (gdb) p &pmd->stats
>>> $6 = (struct dp_netdev_pmd_stats *) 0x7f39e2805140
>>> (gdb) p &pmd->port_mutex
>>> $7 = (struct ovs_mutex *) 0x7f39e2805180
>>> (gdb) p &pmd->poll_list
>>> $8 = (struct hmap *) 0x7f39e28051c0
>>> (gdb) p &pmd->tnl_port_cache
>>> $9 = (struct hmap *) 0x7f39e2805200
>>> (gdb) p &pmd->stats_zero
>>> $10 = (unsigned long long (*)[5]) 0x7f39e2805240
>>>
>>> Reported-by: Ilya Maximets 
>>> Signed-off-by: Bhanuprakash Bodireddy
>>> 
>>> ---
>>>  lib/dpif-netdev.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>> db78318..3e281ae
>>> 100644
>>> --- a/lib/dpif-netdev.c
>>> +++ b/lib/dpif-netdev.c
>>> @@ -3646,7 +3646,7 @@ reconfigure_pmd_threads(struct dp_netdev
>*dp)
>>>  FOR_EACH_CORE_ON_DUMP(core, pmd_cores) {
>>>  pmd = dp_netdev_get_pmd(dp, core->core_id);
>>>  if (!pmd) {
>>> -pmd = xzalloc(sizeof *pmd);
>>> +pmd = xzalloc_cacheline(sizeof *pmd);
>>>  dp_netdev_configure_pmd(pmd, dp, core->core_id, core-
 numa_id);
>>>  pmd->thread = ovs_thread_create("pmd", pmd_thread_main,
>pmd);
>>>  VLOG_INFO("PMD thread on numa_id: %d, core id: %2d
>>> created.", @@ -4574,7 +4574,7 @@ dp_netdev_destroy_pmd(struct
>>> dp_netdev_pmd_thread
>>> *pmd)
>>>  xpthread_cond_destroy(&pmd->cond);
>>>  ovs_mutex_destroy(&pmd->cond_mutex);
>>>  ovs_mutex_destroy(&pmd->port_mutex);
>>> -free(pmd);
>>> +free_cacheline(pmd);
>>>  }
>>>
>>>  /* Stops the pmd thread, removes it from the 'dp->poll_threads',
>>> --
>>> 2.4.11
>>>
>>> ___
>>> dev mailing list
>>> d...@openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
>>
>>
___
dev mailing list
d...@openvswitch.org
https://mail.openvsw

Re: [ovs-dev] [PATCH] dpif-netdev: Allocate dp_netdev_pmd_thread struct by xzalloc_cacheline.

2017-12-08 Thread Bodireddy, Bhanuprakash
>
>On 08.12.2017 18:44, Bodireddy, Bhanuprakash wrote:
>>>
>>> On 08.12.2017 16:45, Stokes, Ian wrote:
>>>>> All instances of struct dp_netdev_pmd_thread are allocated by
>>>>> xzalloc and therefore doesn't guarantee memory allocation aligned
>>>>> on CACHE_LINE_SIZE boundary. Due to this any padding done inside
>>>>> the structure with this assumption might create holes.
>>>>>
>>>>> This commit replaces xzalloc, free with xzalloc_cacheline and
>>>>> free_cacheline. With the changes the memory is 64 byte aligned.
>>>>
>>>> Thanks for this Bhanu,
>>>>
>>>> I think this looks OK and I'm considering pushing to the DPDK_Merge
>>>> branch
>>> but as there has been a fair bit of debate lately regarding memory
>>> and cache alignment I want to flag to others who have engaged to date
>>> to have their say before I apply it as there has been no input yet for the
>patch.
>>>>
>>>> @Jan/Ilya, are you ok with this change?
>>>
>>> OVS will likely crash on destroying non_pmd thread because it still
>>> allocated by usual xzalloc, but freed with others by free_cacheline().
>>
>> Are you sure OvS crashes in this case and reproducible?
>> Firstly I didn't see a crash and to double check this I enabled a DBG
>> in dp_netdev_destroy_pmd() to see if free_cacheline() is called for
>> the non pmd thread (whose core_id is NON_PMD_CORE_ID) and that
>doesn't seem to be hitting and gets hit only for pmd threads having valid
>core_ids.
>
>This should happen in dp_netdev_free() on ovs exit or deletion of the
>datapath.
>
>I guess, you need following patch to reproduce:
>https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>December/341617.html
>
>Ian is going to include it to the closest pull request.
>
>Even if it's not reproducible you have to fix memory allocation for non_pmd
>anyway. Current code logically wrong.

Ok, that makes sense I would use the xzalloc_cacheline() for allocating memory 
for non_pmd too..

Bhanuprakash.

>
>>
>> Also AFAIK, non pmd thread is nothing but vswitchd thread and I don’t
>> see how that can be freed from the above function.  Also I started
>wondering where the memory allocated for non_pmd thread is getting freed
>now.
>>
>> Let me know the steps if you can reproduce the crash as you mentioned.
>>
>> - Bhanuprakash.
>>
>>>
>>>>
>>>> Thanks
>>>> Ian
>>>>
>>>>>
>>>>> Before:
>>>>> With xzalloc, all the memory is 16 byte aligned.
>>>>>
>>>>> (gdb) p pmd
>>>>> $1 = (struct dp_netdev_pmd_thread *) 0x7eff8a813010
>>>>> (gdb) p &pmd->cacheline0
>>>>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7eff8a813010
>>>>> (gdb) p &pmd->cacheline1
>>>>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7eff8a813050
>>>>> (gdb) p &pmd->flow_cache
>>>>> $4 = (struct emc_cache *) 0x7eff8a813090
>>>>> (gdb) p &pmd->flow_table
>>>>> $5 = (struct cmap *) 0x7eff8acb30d0
>>>>> (gdb)  p &pmd->stats
>>>>> $6 = (struct dp_netdev_pmd_stats *) 0x7eff8acb3110
>>>>> (gdb) p &pmd->port_mutex
>>>>> $7 = (struct ovs_mutex *) 0x7eff8acb3150
>>>>> (gdb) p &pmd->poll_list
>>>>> $8 = (struct hmap *) 0x7eff8acb3190
>>>>> (gdb) p &pmd->tnl_port_cache
>>>>> $9 = (struct hmap *) 0x7eff8acb31d0
>>>>> (gdb) p &pmd->stats_zero
>>>>> $10 = (unsigned long long (*)[5]) 0x7eff8acb3210
>>>>>
>>>>> After:
>>>>> With xzalloc_cacheline, all the memory is 64 byte aligned.
>>>>>
>>>>> (gdb) p pmd
>>>>> $1 = (struct dp_netdev_pmd_thread *) 0x7f39e2365040
>>>>> (gdb) p &pmd->cacheline0
>>>>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7f39e2365040
>>>>> (gdb) p &pmd->cacheline1
>>>>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7f39e2365080
>>>>> (gdb) p &pmd->flow_cache
>>>>> $4 = (struct emc_cache *) 0x7f39e23650c0
>>>>> (gdb) p &pmd->flow_table
>>>>> $5 = (struct cmap *) 0x7f39e2805100
>>>>> (gdb) p &pmd->stats
>

Re: [ovs-dev] [PATCH] netdev-native-tnl: Add assertion in vxlan_pop_header.

2018-01-12 Thread Bodireddy, Bhanuprakash
Hi Ben,

>On Fri, Jan 12, 2018 at 05:43:13PM +, Bhanuprakash Bodireddy wrote:
>> During tunnel decapsulation the below steps are performed:
>>  [1] Tunnel information is populated in packet metadata i.e packet->md-
>>tunnel.
>>  [2] Outer header gets popped.
>>  [3] Packet is recirculated.
>>
>> For [1] to work, the dp_packet L3 and L4 header offsets should be valid.
>> The offsets in the dp_packet are set as part of miniflow extraction.
>>
>> If offsets are accidentally reset (or) the pop header operation is
>> performed prior to miniflow extraction, step [1] fails silently and
>> creates issues that are harder to debug. Add the assertion to check if
>> the offsets are valid.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>>  lib/netdev-native-tnl.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/lib/netdev-native-tnl.c b/lib/netdev-native-tnl.c index
>> 9ce8567..fb5eab0 100644
>> --- a/lib/netdev-native-tnl.c
>> +++ b/lib/netdev-native-tnl.c
>> @@ -508,6 +508,9 @@ netdev_vxlan_pop_header(struct dp_packet
>*packet)
>>  ovs_be32 vx_flags;
>>  enum packet_type next_pt = PT_ETH;
>>
>> +ovs_assert(packet->l3_ofs > 0);
>> +ovs_assert(packet->l4_ofs > 0);
>> +
>>  pkt_metadata_init_tnl(md);
>>  if (VXLAN_HLEN > dp_packet_l4_size(packet)) {
>>  goto err;
>
>Thanks for working to make OVS more reliable.
>
>How much risk do you think there is of these assertions triggering?  Are you
>debugging an issue where they would trigger, and has that been fixed?  I'm
>trying to figure out whether it makes more sense to put assertions here or
>whether something closer to a log message plus a jump to "err" would be
>better.  It's not great for OVS to assert-fail, but on the other hand if it 
>indicates
>a genuine bug then sometimes it's the best thing to do.

I was working on a RFC patch to skip recirculation for vxlan decap side.  
I posted today @  
https://mail.openvswitch.org/pipermail/ovs-dev/2018-January/343103.html

In that implementation vxlan header is popped before the Miniflow extraction 
and that's when
I ran in to above mentioned problem. 

Also I found that dp_packet_reset_packet() and dp_packet_reset_offsets() when 
accidentally
called will clear the offsets and any later invocation of *vxlan_pop_header() 
or for that matter
any code that uses the dp_packet L3/L4 offsets will fail.  So I added an 
assertion to make it more explicit
for vxlans.

Please note that there isn't any bug on the master code and this was done as a 
precautionary
measure to improve debugging.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/4] compiler: Introduce OVS_PREFETCH variants.

2018-01-12 Thread Bodireddy, Bhanuprakash
>-Original Message-
>From: Ben Pfaff [mailto:b...@ovn.org]
>Sent: Friday, January 12, 2018 6:20 PM
>To: Bodireddy, Bhanuprakash 
>Cc: d...@openvswitch.org
>Subject: Re: [ovs-dev] [PATCH 1/4] compiler: Introduce OVS_PREFETCH
>variants.
>
>Hi Bhanu, who do you think should review this series?  Is it something that Ian
>should pick up for dpdk_merge?

Hi Ben,

I will check with Ian if he has time to review this. As the patch series doesn't
change any functionality at this point it shouldn't take much time.

-Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v6 0/8] Add OVS DPDK keep-alive functionality.

2018-01-16 Thread Bodireddy, Bhanuprakash
>Hi,
>
>Sorry to jump on this at v6 only, but I skimmed over the code and I am
>struggling to understand what problem you're trying to solve. Yes, I realize
>you want some sort of feedback about the PMD processing, but it's not clear
>to me what exactly you want from it.
>
>This last patchset uses a separate thread just to monitor the PMD threads
>which can update their status in the core busy loop.  I guess it tells you if 
>the
>PMD thread is stuck or not, but not really if it's processing packets.  That's
>again, my question above.
>
>If you need to know if the thread is running, I think any OVS can provide you
>the process stats which should be more reliable and doesn't depend on OVS
>at all.
>
>I appreciate if you could elaborate more on the use-case.

Intel SA team has been working on  SA Framework for NFV environment and has 
defined interfaces
for the base platform(aligned with ETSI GS NFV 002)  which includes compute, 
storage, NW, virtual switch, OS and hypervisor.
The core idea here is to monitor and detect the service impacting faults on the 
Base platform. 
Both reactive and pro-active fault detection techniques are employed and faults 
are reported to
higher level layers for corrective actions. The corrective actions for example 
here can be migrating
the workloads, marking the compute offline and is based on the policies 
enforced at higher layers.

One aspect of larger SA framework is monitoring virtual switch health. Some of 
the events of interest here
are link status, OvS DB connection status, packet statistics(drops/errors), PMD 
health. 

This patch series has only implemented *PMD health* monitoring and reporting 
mechanism and the details are
already in the patch. The other interesting events of virtual switch are 
already implemented as part of collectd plugin.

On your questions:

> I guess it tells you if the PMD thread is stuck or not, but not really if 
> it's processing packets.  That's
>again, my question above.

The functionality to check if the PMD is processing the packets was implemented 
way back in v3.
https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/336789.html

For easier review, the patch series was split up in v4 to get the basic 
functionality in. This is mentioned in version change log below.
https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337702.html

>If you need to know if the thread is running, I think any OVS can provide you
>the process stats which should be more reliable and doesn't depend on OVS
>at all.

There is a problem here and I did simulate the case to show that the stats 
reported by OS aren't accurate in the below thread.
https://mail.openvswitch.org/pipermail/ovs-dev/2017-September/338388.html

Check the details on /proc/[pid]/stats. Though the PMD thread is stalled, OS 
reports the thread as *Running (R)* state.

- Bhanuprakash.

>
>
>On Fri, Dec 08, 2017 at 12:04:19PM +, Bhanuprakash Bodireddy wrote:
>> Keepalive feature is aimed at achieving Fastpath Service Assurance in
>> OVS-DPDK deployments. It adds support for monitoring the packet
>> processing threads by dispatching heartbeats at regular intervals.
>>
>> keepalive feature can be enabled through below OVSDB settings.
>>
>> enable-keepalive=true
>>   - Keepalive feature is disabled by default and should be enabled
>> at startup before ovs-vswitchd daemon is started.
>>
>> keepalive-interval="5000"
>>   - Timer interval in milliseconds for monitoring the packet
>> processing cores.
>>
>> TESTING:
>> The testing of keepalive is done using stress cmd (simulating the 
>> stalls).
>>   - pmd-cpu-mask=0xf [MQ enabled on DPDK ports]
>>   - stress -c 1 &  [tid is usually the __tid + 1 of the output]
>>   - chrt -r -p 99 [set realtime priority for stress thread]
>>   - taskset -p 0x8[Pin the stress thread to the core PMD is 
>> running]
>>   - PMD thread will be descheduled due to its normal priority and yields
>> core to stress thread.
>>
>>   - ovs-appctl keepalive/pmd-health-show   [Display that the thread is
>GONE]
>>   - ./ovsdb/ovsdb-client monitor Open_vSwitch  [Should update the
>> status]
>>
>>   - taskset -p 0x10   [This brings back pmd thread to life as stress
>thread
>> is moved to idle core]
>>
>>   (watch out for stress threads, and carefully pin them to core not to 
>> hang
>your DUTs
>>during tesing).
>>
>> v5 -> v6
>>   * Remove 2 patches from series
>>  - xnanosleep was applied to master as part of high resolution timeout
>support.
>>  - Extend get_process_info() API was also applied to master earlier.
>>   * Remove KA_STATE_DOZING as it was initially meant to handle Core C
>states, not needed
>> for now.
>>   * Fixed ka_destroy(), to fix unit test cases 536, 537.
>>   * A minor performance degradation(0.5%) is observed with Keepalive
>enabled.
>> [Tested with loopback case using 1000 IXIA streams/64 byte u

Re: [ovs-dev] [PATCH] dpif-netdev: Refactor datapath flow cache

2018-02-16 Thread Bodireddy, Bhanuprakash
>
>>-Original Message-
>>>
>>> [Wang, Yipeng] In my test, I compared the proposed EMC with current
>EMC with same 16k entries.
>>> If I turned off THP, the current EMC will cause many TLB misses because of
>its larger entry size, which I profiled with vTunes.
>>> Once I turned on THP with no other changes, the current EMC's
>>> throughput increases a lot and is comparable with the newly proposed
>EMC. From vTunes, the EMC lookup TLB misses decreases from 100 million to
>0 during the 30sec profiling time.
>>> So if THP is enabled, reducing EMC entry size may not give too much
>benefit comparing to the current EMC.
>>> It is worth to mention that they both use similar amount of CPU cache
>>> since only the miniflow struct is accessed by CPU, thus the TLB should be
>the major concern.

[BHANU] 
I found this thread on THP interesting and want to share my findings here.  
I did some micro benchmarks on this feature a long time ago and found there was 
some performance improvement with THP enabled.
Some of this can be attributed to faster emc_lookup() with THP enabled.

With large number of flows, emc_lookup() is back end bound and further analysis 
showed that there is significant
DTLB overhead. One way to reduce the overhead is to use larger pages and with 
THP the overhead reduce by 40% for this function.

So THP has some positive affect on emc_lookup()!

- Bhanuprakash.

>>
>>I understand your point. But I can't seem to reproduce the effect of THP on
>my system.
>>I don't have vTunes available, but I guess "perf stat" should also
>>provide TLB miss statistics.
>>
>>How can you check if ovs-vswitchd is using transparent huge pages for
>>backing e.g. the EMC memory?
>>
>
>[Wang, Yipeng]
>I used the master OVS and change the EMC to be 16k entries. I feed 10k or
>more flows to stress EMC.  With perf, I tried this command:
>sudo perf stat -p PID -e dTLB-load-misses It shows the TLB misses changed a
>lot with THP on or off on my machine. vtunes shows the EMC_lookup
>function's data separately though.
>
>To check if THP is used by OvS, I found a Redhat suggested command handy:
>From: https://access.redhat.com/solutions/46111
>grep -e AnonHugePages  /proc/*/smaps | awk  '{ if($2>4) print $0} ' |  awk -F
>"/"  '{print $0; system("ps -fp " $3)} '
>I don't know how to check each individual function though.
>
>>>
>>> [Wang, Yipeng] Yes that there is no systematic collisions. However,
>>> in general, 1-hash table tends to cause many more misses than 2-hash.
>>> For code simplicity, I agree that 1-hash is simpler and much easier
>>> to understand. For performance, if the flows can fit in 1-hash table,
>>> they should also stay in the primary location of the 2-hash table, so
>>> basically they should have similar lookup speed. For large numbers of
>>> flows in general, traffic will have higher miss ratio in 1-hash than
>>> 2-hash table. From one of our tests that has 10k flows and 3 subtable (test
>cases described later), and EMC is sized for 16k entries, the 2-hash EMC
>causes about 14% miss ratio,  while the 1-hash EMC causes 47% miss ratio.
>>
>>I agree that a lower EMC hit rate is a concern with just DPCLS or CD+DPCLS as
>second stage.
>>But with DFC the extra cost for a miss on EMC is low as the DFC lookup
>>only slightly higher than EMC itself. The EMC miss is cheap as it will
>>typically already detected when comparing the full RSS hash.
>>
>>Furthermore, the EMC is now mainly meant to speed up the biggest
>>elephant flows, so it can be smaller and thrashing is avoided by very low
>insertion probability.
>>Simplistic benchmarks using a large number of "eternal" flows with
>>equidistantly spaced packets are really an unrealistic worst case for any
>cache-based architecture.
>>
>
>[Wang, Yipeng]
>If the realistic traffic patterns mostly hit EMC with elephant flows, I agree 
>that
>EMC could be simplified.
>
>>>
>>> [Wang, Yipeng] We agree that a DFC hit performs better than a CD hit,
>>> but CD usually has higher hit rate for large number of flows, as the data
>shows later.
>>
>>That is something I don't yet understand. Is this because of the fact
>>that CD stores up to 16 entries per hash bucket and handles collisions better?
>
>[Wang, Yipeng]
>Yes, with 2-hash function and 16 entries per bucket, CD has much less misses
>in general.
>
>As first step to combine both CD and DFC, I incorporated the signature and
>way-associative structure from CD into DFC. I just did simple prototype
>without Any performance tuning, preliminary results show good
>improvement over miss ratio and throughput. I will post the complete results
>soon.
>
>Since DFC/CD is much faster than megaflow, I believe higher hit rate is
>preferred. So A CD-like way-associative structure should be helpful. The
>signature per entry also helps on performance, similar effect with EMC.
>
>>>
>>> [Wang, Yipeng] We use the test/rules we posted with our CD patch.
>>> Basically we vary src_IP to hit different subtables, and then vary
>>> dst_IP to create 

Re: [ovs-dev] [RFC 1/4] dpif-netdev: Refactor datapath flow cache

2018-02-20 Thread Bodireddy, Bhanuprakash
Hi Yipeng,

Thanks for the RFC series. This patch series need to be rebased. 
I applied this on an older commit to do initial testing. Some comments below.

I see that DFC cache is implemented in similar lines of EMC cache except that 
it holds
Million entries and uses more bits of RSS hash to index in to the Cache. I 
agree that
DPCLS lookup is expensive and consumes 30% of total cycles in some test cases 
and DFC
Cache will definitely reduce some pain there.

On the memory foot print:

On Master, 
EMC  entry size = 592 bytes
   8k entries = ~4MB.

With this patch,
 EMC entry size = 256 bytes
  16k entries = ~4MB.

I like the above reduction in flow key size, keeping the entry size to multiple 
of cache line and still keeping the overall EMC size to ~4MB with more EMC 
entries.

However my concern is DFC cache size. As DFC cache is million entries and 
consumes ~12 MB for each PMD thread, it might not fit in to L3 cache. Also note 
that in newer platforms L3 cache is shrinking and L2 is slightly increased (eg: 
Skylake has only 1MB L2 and 19MB L3 cache).

Inspite of the memory footprint I still think DFC cache improves switching 
performance as it is lot less expensive than invoking dpcls_lookup() as the 
later involves more expensive hash computation and subtable traversal. It would 
be nice if there is more testing done with real VNFs to see that this patch 
doesn't cause cache thrashing and suffer from memory bottlenecks.

Some more comments below.

>This is a rebase of Jan's previous patch [PATCH] dpif-netdev: Refactor
>datapath flow cache https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341066.html
>
>So far the netdev datapath uses an 8K EMC to speed up the lookup of
>frequently used flows by comparing the parsed packet headers against the
>miniflow of a cached flow, using 13 bits of the packet RSS hash as index. The
>EMC is too small for many applications with 100K or more parallel packet flows
>so that EMC threshing actually degrades performance.
>Furthermore, the size of struct miniflow and the flow copying cost prevents us
>from making it much larger.
>
>At the same time the lookup cost of the megaflow classifier (DPCLS) is
>increasing as the number of frequently hit subtables grows with the
>complexity of pipeline and the number of recirculations.
>
>To close the performance gap for many parallel flows, this patch introduces
>the datapath flow cache (DFC) with 1M entries as lookup stage between EMC
>and DPCLS. It directly maps 20 bits of the RSS hash to a pointer to the last 
>hit
>megaflow entry and performs a masked comparison of the packet flow with
>the megaflow key to confirm the hit. This avoids the costly DPCLS lookup even
>for very large number of parallel flows with a small memory overhead.
>
>Due the large size of the DFC and the low risk of DFC thrashing, any DPCLS hit
>immediately inserts an entry in the DFC so that subsequent packets get
>speeded up. The DFC, thus, accelerate also short-lived flows.
>
>To further accelerate the lookup of few elephant flows, every DFC hit triggers
>a probabilistic EMC insertion of the flow. As the DFC entry is already in place
>the default EMC insertion probability can be reduced to
>1/1000 to minimize EMC thrashing should there still be many fat flows.
>The inverse EMC insertion probability remains configurable.
>
>The EMC implementation is simplified by removing the possibility to store a
>flow in two slots, as there is no particular reason why two flows should
>systematically collide (the RSS hash is not symmetric).

[BHANU]
I am not sure if this is good idea to simplify EMC by using 1-way associative 
instead of current 2 way associative implementation.
I prefer to leave the current approach as-is unless we have strong data to 
prove it otherwise.
This comment applies to below code changes w.r.t to EMC lookup and insert.

>The maximum size of the EMC flow key is limited to 256 bytes to reduce the
>memory footprint. This should be sufficient to hold most real life packet flow
>keys. Larger flows are not installed in the EMC.

+1 

>
>The pmd-stats-show command is enhanced to show both EMC and DFC hits
>separately.
>
>The sweep speed for cleaning up obsolete EMC and DFC flow entries and
>freeing dead megaflow entries is increased. With a typical PMD cycle duration
>of 100us under load and checking one DFC entry per cycle, the DFC sweep
>should normally complete within in 100s.
>
>In PVP performance tests with an L3 pipeline over VXLAN we determined the
>optimal EMC size to be 16K entries to obtain a uniform speedup compared to
>the master branch over the full range of parallel flows. The measurement
>below is for 64 byte packets and the average number of subtable lookups per
>DPCLS hit in this pipeline is 1.0, i.e. the acceleration already starts for a 
>single
>busy mask. Tests with many visited subtables should show a strong increase
>of the gain through DFC.
>
>Flows   master  DFC+EMC  Gain
>   

Re: [ovs-dev] [RFC 2/4] dpif-netdev: Fix EMC key length

2018-02-20 Thread Bodireddy, Bhanuprakash
This fix is needed and can  be included in 1/4 in next revision.

- Bhanuprakash.

>-Original Message-
>From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
>boun...@openvswitch.org] On Behalf Of Yipeng Wang
>Sent: Thursday, January 18, 2018 6:20 PM
>To: d...@openvswitch.org; jan.scheur...@ericsson.com
>Cc: Tai, Charlie 
>Subject: [ovs-dev] [RFC 2/4] dpif-netdev: Fix EMC key length
>
>EMC's key length is not initialized when insertion. Initialize the key length
>before insertion.
>
>The code might be put in another place, for now I just put it in dfc_lookup.
>
>Signed-off-by: Yipeng Wang 
>---
> lib/dpif-netdev.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index b9f4b6d..3e87992
>100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -2295,7 +2295,7 @@ dfc_insert(struct dp_netdev_pmd_thread *pmd,  }
>
> static inline struct dp_netdev_flow *
>-dfc_lookup(struct dfc_cache *cache, const struct netdev_flow_key *key,
>+dfc_lookup(struct dfc_cache *cache, struct netdev_flow_key *key,
>bool *exact_match)
> {
> struct dp_netdev_flow *flow;
>@@ -2317,6 +2317,7 @@ dfc_lookup(struct dfc_cache *cache, const struct
>netdev_flow_key *key,
> /* Found a match in DFC. Insert into EMC for subsequent lookups.
>  * We use probabilistic insertion here so that mainly elephant
>  * flows enter EMC. */
>+key->len = netdev_flow_key_size(miniflow_n_values(&key->mf));
> emc_probabilistic_insert(&cache->emc_cache, key, flow);
> *exact_match = false;
> return flow;
>--
>2.7.4
>
>___
>dev mailing list
>d...@openvswitch.org
>https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC 3/4] dpif-netdev: Use way-associative cache

2018-02-23 Thread Bodireddy, Bhanuprakash
Hi Yipeng,

Thanks for the patch. Some high level questions/comments.

(1)  Am I right in understanding that this patch *only* introduces a new cache 
approach in to DFC to reduce the collisions?

(2)  Why the number of entries per Bucket is set to '8'?  With this each 
dfc_bucket  size is 80 bytes (16 + 64).
If the number of entries set to '6', the dfc_bucket size will be 60 
bytes and can fit in to a cache line.
I assume 'DFC_ENTRY_PER_BUCKET' isn't a random picked number. Was it 
picked due to any benchmarks?

(3) A 2 byte signature is introduced in a bucket and is used to insert or 
retrieve flows in to the bucket.
3a. Due to the introduction of 2 byte signature the size of dfc_cache 
increased by 2MB per PMD thread.
3b. Every time we insert or retrieve a flow, we have to match the 
packet signature(upper 16 bit RSS hash) with each entry of the bucket. 
Wondering if that slow down the operations?

(4)  The number of buckets depends on the number of entries per bucket.  Which 
of this plays an important role in reducing the collisions?
i.e Would higher number of entries per bucket reduce the collisions?

(5) What is the performance delta observed with this new Cache implementation 
over 1/4 approach?

Some more minor comments below.

>This commit uses a way-associative cache (CD) rather than a simple single
>entry hash table for DFC. Experiments show that this design generally has
>much higher hit rate.
>
>Since miss is much costly than hit, a CD-like structure that improves hit rate
>should help in general.
>
>Signed-off-by: Yipeng Wang 
>---
> lib/dpif-netdev.c | 107 +++--
>-
> 1 file changed, 70 insertions(+), 37 deletions(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 3e87992..50a1d25
>100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -150,8 +150,10 @@ struct netdev_flow_key {
>  */
>
> #define DFC_MASK_LEN 20
>+#define DFC_ENTRY_PER_BUCKET 8
> #define DFC_ENTRIES (1u << DFC_MASK_LEN) -#define DFC_MASK
>(DFC_ENTRIES - 1)
>+#define DFC_BUCKET_CNT (DFC_ENTRIES / DFC_ENTRY_PER_BUCKET)
>#define
>+DFC_MASK (DFC_BUCKET_CNT - 1)
> #define EMC_MASK_LEN 14
> #define EMC_ENTRIES (1u << EMC_MASK_LEN)  #define EMC_MASK
>(EMC_ENTRIES - 1) @@ -171,13 +173,14 @@ struct emc_cache {
> int sweep_idx;
> };
>
>-struct dfc_entry {
>-struct dp_netdev_flow *flow;
>+struct dfc_bucket {
>+uint16_t sig[DFC_ENTRY_PER_BUCKET];
>+struct dp_netdev_flow *flow[DFC_ENTRY_PER_BUCKET];
> };
>
> struct dfc_cache {
> struct emc_cache emc_cache;
>-struct dfc_entry entries[DFC_ENTRIES];
>+struct dfc_bucket buckets[DFC_BUCKET_CNT];
> int sweep_idx;
> };
>
>@@ -749,9 +752,9 @@ dpif_netdev_xps_revalidate_pmd(const struct
>dp_netdev_pmd_thread *pmd,  static int dpif_netdev_xps_get_tx_qid(const
>struct dp_netdev_pmd_thread *pmd,
>   struct tx_port *tx);
>
>-static inline bool dfc_entry_alive(struct dfc_entry *ce);
>+static inline bool dfc_entry_alive(struct dp_netdev_flow *flow);
> static void emc_clear_entry(struct emc_entry *ce); -static void
>dfc_clear_entry(struct dfc_entry *ce);
>+static void dfc_clear_entry(struct dfc_bucket *b, int idx);
>
> static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
>
>@@ -774,11 +777,13 @@ emc_cache_init(struct emc_cache *emc)  static void
>dfc_cache_init(struct dfc_cache *flow_cache)  {
>-int i;
>+int i, j;
>
> emc_cache_init(&flow_cache->emc_cache);
>-for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
>-flow_cache->entries[i].flow = NULL;
>+for (i = 0; i < DFC_BUCKET_CNT; i++) {
>+for (j = 0; j < DFC_ENTRY_PER_BUCKET; j++) {
>+flow_cache->buckets[i].flow[j] = NULL;

[BHANU] How about initializing the signature?

>+}
> }
> flow_cache->sweep_idx = 0;
> }
>@@ -796,10 +801,12 @@ emc_cache_uninit(struct emc_cache *emc)  static
>void  dfc_cache_uninit(struct dfc_cache *flow_cache)  {
>-int i;
>+int i, j;
>
>-for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
>-dfc_clear_entry(&flow_cache->entries[i]);
>+for (i = 0; i < DFC_BUCKET_CNT; i++) {
>+for (j = 0; j < DFC_ENTRY_PER_BUCKET; j++) {
>+dfc_clear_entry(&(flow_cache->buckets[i]), j);
>+}
> }
> emc_cache_uninit(&flow_cache->emc_cache);
> }
>@@ -2245,39 +2252,46 @@ emc_lookup(struct emc_cache *emc, const struct
>netdev_flow_key *key)
> return NULL;
> }
>
>-static inline struct dfc_entry *
>+static inline struct dp_netdev_flow *
> dfc_entry_get(struct dfc_cache *cache, const uint32_t hash)  {
>-return &cache->entries[hash & DFC_MASK];
>+struct dfc_bucket *bucket = &cache->buckets[hash & DFC_MASK];
>+uint16_t sig = hash >> 16;
>+for (int i = 0; i < DFC_ENTRY_PER_BUCKET; i++) {
>+if(bucket->sig[i] == sig) {
>+return bucket->flow[i];
>+}
>+}
>+return NULL;
> }
>
> static inline bool
>-df

Re: [ovs-dev] [RFC 4/4] dpif-netdev.c: Add indirect table

2018-02-23 Thread Bodireddy, Bhanuprakash
Hi Yipeng,

>If we store pointers in DFC, then the memory requirement is large. When
>there are VMs or multiple PMDs running on the same platform, they will
>compete the shared cache. So we want DFC to be as memory efficient as
>possible.

>
>Indirect table is a simple hash table that map the DFC's result to the
>dp_netdev_flow's pointer. This is to reduce the memory size of the DFC
>cache, assuming that the megaflow count is much smaller than the exact
>match flow count. With this commit, we could reduce the 8-byte pointer to a
>2-byte index in DFC cache so that the memory/cache requirement is almost
>halved. Another approach we plan to try is to use the flow_table as the
>indirect table.

I assume this patch is only aimed at reducing the DFC cache memory foot print 
and doesn't introduce any new functionality ?

With this I see the dfc_bucket size is at 32 bytes from earlier 80bytes in 3/4 
and the buckets now will be aligned to cache lines.
Also the dfc_cache size reduced to ~8MB from ~12MB in 1/4 and ~14Mb in 3/4 
patches.

I am guessing there might be some performance improvement with this patch due 
to buckets aligning to cache lines apart of reduced memory footprint. Do you 
see any such advantage here in your benchmarks?

Regards,
Bhanuprakash.

>
>The indirect table size is a fixed constant for now.
>
>Signed-off-by: Yipeng Wang 
>---
> lib/dpif-netdev.c | 69 +++
>
> 1 file changed, 44 insertions(+), 25 deletions(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 50a1d25..35197d3
>100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -151,6 +151,12 @@ struct netdev_flow_key {
>
> #define DFC_MASK_LEN 20
> #define DFC_ENTRY_PER_BUCKET 8
>+
>+/* For now we fix the Indirect table size, ideally it should be sized
>+according
>+ * to max megaflow count but less than 2^16  */ #define
>+INDIRECT_TABLE_SIZE (1u << 12) #define INDIRECT_TABLE_MASK
>+(INDIRECT_TABLE_SIZE - 1)
> #define DFC_ENTRIES (1u << DFC_MASK_LEN)  #define DFC_BUCKET_CNT
>(DFC_ENTRIES / DFC_ENTRY_PER_BUCKET)  #define DFC_MASK
>(DFC_BUCKET_CNT - 1) @@ -175,13 +181,14 @@ struct emc_cache {
>
> struct dfc_bucket {
> uint16_t sig[DFC_ENTRY_PER_BUCKET];
>-struct dp_netdev_flow *flow[DFC_ENTRY_PER_BUCKET];
>+uint16_t index[DFC_ENTRY_PER_BUCKET];
> };
>
> struct dfc_cache {
> struct emc_cache emc_cache;
> struct dfc_bucket buckets[DFC_BUCKET_CNT];
> int sweep_idx;
>+struct dp_netdev_flow *indirect_table[INDIRECT_TABLE_SIZE];
> };
>
> 

>@@ -754,7 +761,7 @@ static int dpif_netdev_xps_get_tx_qid(const struct
>dp_netdev_pmd_thread *pmd,
>
> static inline bool dfc_entry_alive(struct dp_netdev_flow *flow);  static void
>emc_clear_entry(struct emc_entry *ce); -static void dfc_clear_entry(struct
>dfc_bucket *b, int idx);
>+static void dfc_clear_entry(struct dp_netdev_flow **flow, struct
>+dfc_bucket *b, int idx);
>
> static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
>
>@@ -782,9 +789,12 @@ dfc_cache_init(struct dfc_cache *flow_cache)
> emc_cache_init(&flow_cache->emc_cache);
> for (i = 0; i < DFC_BUCKET_CNT; i++) {
> for (j = 0; j < DFC_ENTRY_PER_BUCKET; j++) {
>-flow_cache->buckets[i].flow[j] = NULL;
>+flow_cache->buckets[i].sig[j] = 0;
> }
> }
>+for (i = 0; i < INDIRECT_TABLE_SIZE; i++) {
>+flow_cache->indirect_table[i] = NULL;
>+}
> flow_cache->sweep_idx = 0;
> }
>
>@@ -805,7 +815,7 @@ dfc_cache_uninit(struct dfc_cache *flow_cache)
>
> for (i = 0; i < DFC_BUCKET_CNT; i++) {
> for (j = 0; j < DFC_ENTRY_PER_BUCKET; j++) {
>-dfc_clear_entry(&(flow_cache->buckets[i]), j);
>+dfc_clear_entry(flow_cache->indirect_table,
>+ &(flow_cache->buckets[i]), j);
> }
> }
> emc_cache_uninit(&flow_cache->emc_cache);
>@@ -2259,7 +2269,7 @@ dfc_entry_get(struct dfc_cache *cache, const
>uint32_t hash)
> uint16_t sig = hash >> 16;
> for (int i = 0; i < DFC_ENTRY_PER_BUCKET; i++) {
> if(bucket->sig[i] == sig) {
>-return bucket->flow[i];
>+return cache->indirect_table[bucket->index[i]];
> }
> }
> return NULL;
>@@ -2272,28 +2282,33 @@ dfc_entry_alive(struct dp_netdev_flow *flow)  }
>
> static void
>-dfc_clear_entry(struct dfc_bucket *b, int idx)
>+dfc_clear_entry(struct dp_netdev_flow **ind_table, struct dfc_bucket
>+*b, int idx)
> {
>-if (b->flow[idx]) {
>-dp_netdev_flow_unref(b->flow[idx]);
>-b->flow[idx] = NULL;
>+if (ind_table[b->index[idx]]) {
>+dp_netdev_flow_unref(ind_table[b->index[idx]]);
>+ind_table[b->index[idx]] = NULL;
> }
> }
>
>-static inline void
>-dfc_change_entry(struct dfc_bucket *b, int idx, struct dp_netdev_flow
>*flow)
>+
>+static inline uint16_t
>+indirect_table_insert(struct dp_netdev_flow **indirect_table,
>+struct dp_netdev_flow *flow)
> {
>-if (b->flow[idx] != flow) {
>-if (b->flo

Re: [ovs-dev] Ports down on OVS DPDK and traffic fails

2017-03-30 Thread Bodireddy, Bhanuprakash
>-Original Message-
>From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
>boun...@openvswitch.org] On Behalf Of Nadathur, Sundar
>Sent: Thursday, March 30, 2017 6:46 PM
>To: ovs-dev@openvswitch.org
>Subject: [ovs-dev] Ports down on OVS DPDK and traffic fails
>
>Summary: I am trying to bring up two tap devices on a ovs-dpdk switch, but
>both ports are down and traffic doesn't work, even after adding a VM. Would
>appreciate some help.
>
>Compiled OVS 2.7.0 (from the tar ball) with DPDK 17.02 as follows:
>./configure --with-dpdk=$DPDK_BUILD CFLAGS="-g -O2 -msse4.2"
>make
>make install
>
>During the bring up, I initialized as follows:
>ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true Created
>the bridge and ports with these commands:
>ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev ip tuntap add
>mode tap vi1 ip tuntap add mode tap vi2 ip addr set ... dev v1 # same for vi2
>ovs-vsctl add-port br0 vi1 -- set Interface vi1 type=dpdkvhostuser #same for
>vi2

It was mentioned as tap ports in the summary above, but here interface type is 
'dpdkvhostuser'. 
It's recommended to use vhostuser ports with OvS-DPDK and  information on the 
vhost-user
ports can be found here: 
http://docs.openvswitch.org/en/latest/topics/dpdk/vhost-user/

- Bhanuprakash. 

>
>But both ports show as NO-CARRIER even after 'ip addr .. up':
># ip addr show vi1
>22: vi1:  mtu 1500 qdisc noqueue
>state DOWN qlen 500
>link/ether 3a:19:09:52:14:50 brd ff:ff:ff:ff:ff:ff
>inet 200.1.1.1/24 scope global vi1
>   valid_lft forever preferred_lft forever
>
>I can ping the IP addresses of vi1 and vi2, but 'ping 200.1.1.1 -I vi2' fails. 
>ARP
>doesn't get resolved, wireshark (tshark) shows no traffic on the switch, and
>'ovs-ofctl dump-flows' shows that the stats are not going up.
>
>First, I would expect at least the ARP packets to go through and the stats to 
>go
>up (even though the ping won't succeed because there is nothing on the
>second switch port vi2 to respond.) Why is that not working?
>
>Second,  I then brought up a VM connecting to vi1. But, vi1 still shows as NO-
>CARRIER, and traffic from VM doesn't work. What could be going wrong here?
>(I can add more details as needed.)
>
>Thanks,
>Sundar
>
>
>Regards,
>Sundar
>
>___
>dev mailing list
>d...@openvswitch.org
>https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 0/7] Add OVS DPDK keep-alive functionality

2017-04-03 Thread Bodireddy, Bhanuprakash
Thanks Aaron for reviewing this patch series.  My comments inline.

>
>Hi Bhanu,
>
>Bhanuprakash Bodireddy  writes:
>
>> This patch is aimed at achieving Fastpath Service Assurance in
>> OVS-DPDK deployments. This commit adds support for monitoring the
>> packet processing cores(pmd thread cores) by dispatching heartbeats at
>> regular intervals. Incase of heartbeat miss the failure shall be
>> detected & reported to higher level fault management
>systems/frameworks.
>>
>> The implementation uses POSIX shared memory object for storing the
>> events that will be read by monitoring framework. keep-alive feature
>> can be enabled through below OVSDB settings.
>>
>> keepalive=true
>>- Keepalive feature is disabled by default
>>
>> keepalive-interval="50"
>>- Timer interval in milliseconds for monitoring the packet
>>  processing cores.
>>
>> keepalive-shm-name="/dpdk_keepalive_shm_name"
>>- Shared memory block name where the events shall be updated.
>>
>> When KA is enabled, 'ovs-keepalive' thread shall be spawned that wakes
>> up at regular intervals to update the timestamp and status of pmd
>> cores in shared memory region.
>>
>> An external monitoring framework like collectd(with dpdk plugin
>> support) can read the status updates from shared memory. On a missing
>> heartbeat, the collectd shall relay the status to ceilometer service
>> running in the controller. Below is the high level overview of deployment
>model.
>
>Given this runs in-line in the fastpath, can you tell me what kind of impact 
>it is
>to throughput (and possibly latency) that this series implies?

This feature has very minimal impact on the throughput. 
As you know, there is a significant drop on the current master due to 
commit: daf4d3c18da4("odp: Support conntrack orig tuple key.").
I folded in the RFC patch  by Daniele that fixes the memset and measured the
throughput with and without KA feature and see no significant difference in 
throughput.

Agree that Latency is indeed important. I would collect the latency stats  and 
will share 
the results in this thread. 

- Bhanuprakash.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/7] dpdk: Add helper functions for DPDK keepalive.

2017-04-03 Thread Bodireddy, Bhanuprakash
>-Original Message-
>From: Aaron Conole [mailto:acon...@redhat.com]
>Sent: Monday, April 3, 2017 1:58 AM
>To: Bodireddy, Bhanuprakash 
>Cc: d...@openvswitch.org
>Subject: Re: [ovs-dev] [PATCH 1/7] dpdk: Add helper functions for DPDK
>keepalive.
>
>Bhanuprakash Bodireddy  writes:
>
>> Introduce helper functions in 'dpdk' module that are needed for
>> keepalive functionality. Also add dummy functions in 'dpdk-stub'
>> module that are needed when DPDK is not available.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>
>I think it's better to add helpers at the time they are first called.
>That means that there's no dead code at any point in the build, and it
>becomes obvious why the function is added.

Completely agree. I split  my earlier RFC patch in to smaller patches
and rebased them with master and later realized they were out of order. 
I will  handle this appropriately in next version. 

-Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 3/7] netdev-dpdk: Add support for keepalive functionality.

2017-04-03 Thread Bodireddy, Bhanuprakash
>>
>> +/*
>> + * OVS Shared Memory structure
>> + *
>> + * The information in the shared memory block will be read by collectd.
>> + * */
>> +struct dpdk_keepalive_shm {
>> +/* IPC semaphore. Posted when a core dies */
>> +sem_t core_died;
>> +
>> +/*
>> + * Relayed status of each core.
>> + * UNUSED[0], ALIVE[1], DEAD[2], GONE[3], MISSING[4], DOZING[5],
>SLEEP[6]
>> + **/
>> +enum rte_keepalive_state core_state[RTE_KEEPALIVE_MAXCORES];
>
>What is 'DOZING'?  What is 'MISSING'?  Where is a definition of these states
>and what they mean?  What is DEAD&GONE?

State 'DOZING' means core going idle,
   'MISSING' indicates first heartbeat miss,
   'DEAD' indicates two heart beat misses,
   'GONE' means it missed three or more heartbeats and is completely 
'burried'

Note that these states are defined in DPDK Keepalive library. [rte_keepalive.h]

>
>> +/* Last seen timestamp of the core */
>> +uint64_t core_last_seen_times[RTE_KEEPALIVE_MAXCORES];
>> +
>> +/* Store pmd thread tid */
>> +pid_t thread_id[RTE_KEEPALIVE_MAXCORES];
>> +};
>> +
>> +static struct dpdk_keepalive_shm *ka_shm;
>>  static int netdev_dpdk_class_init(void);  static int
>> netdev_dpdk_vhost_class_init(void);
>>
>> @@ -586,6 +613,202 @@ netdev_dpdk_mempool_configure(struct
>netdev_dpdk *dev)
>>  return 0;
>>  }
>>
>> +void
>> +dpdk_ka_get_tid(unsigned core_idx)
>> +{
>> +uint32_t tid = rte_sys_gettid();
>> +
>> +if (dpdk_is_ka_enabled() && ka_shm) {
>> +ka_shm->thread_id[core_idx] = tid;
>> +}
>> +}
>> +
>> +/* Callback function invoked on heartbeat miss.  Verify if it is
>> +genuine
>> + * heartbeat miss or a false positive and log the message accordingly.
>> + */
>> +static void
>> +dpdk_failcore_cb(void *ptr_data, const int core_id) {
>> +struct dpdk_keepalive_shm *ka_shm = (struct dpdk_keepalive_shm
>> +*)ptr_data;
>> +
>> +if (ka_shm) {
>> +int pstate;
>> +uint32_t tid = ka_shm->thread_id[core_id];
>> +int err = get_process_status(tid, &pstate);
>> +
>> +if (!err) {
>> +switch (pstate) {
>> +
>> +case ACTIVE_STATE:
>> +VLOG_INFO_RL(&rl,"False positive, pmd tid[%"PRIu32"] 
>> alive\n",
>> +  tid);
>
>Can we get false positives?  Doesn't that diminish the usefulness?

You made a good point! Thanks for asking this. 
On false positives, this can happen in few scenarios and depends on the load 
and port distribution among PMD threads.  This was also observed when the timer 
intervals aren't appropriately tuned in OvS and collectd service. 

For example, If single PMD thread is handling multiple PHY ports and  vhostuser 
ports it can have a long list  of ports to be polled  and at times can miss 
successive heart beats. This may happen because PMD thread is spending  more 
time processing the packets and haven't marked itself alive for a while. If the 
timer intervals are aggressive the chances of it missing the heartbeats are 
higher. This made me to add helper functions to check if the PMD thread is 
active and subsequently log it appropriately.

This doesn't diminish the usefulness of the feature as it all boils down to 
setting sensible timeout values in ovsdb and collectd services. I would guess 
millisecond granularity isn't needed by  every user and the case I explained 
above is  more for customers looking at <=5ms granularity.

>
>> +break;
>> +case STOPPED_STATE:
>> +case TRACED_STATE:
>> +case DEFUNC_STATE:
>> +case UNINTERRUPTIBLE_SLEEP_STATE:
>> +VLOG_WARN_RL(&rl,
>> +"PMD tid[%"PRIu32"] on core[%d] is unresponsive\n",
>> +tid, core_id);
>> +break;
>> +default:
>> +VLOG_DBG("%s: The process state: %d\n", __FUNCTION__,
>pstate);
>> +OVS_NOT_REACHED();
>> +}
>> +}
>> +}
>> +}
>> +
>> +/* Notify the external monitoring application for change in core state.
>> + *
>> + * On a consecutive heartbeat miss the core is considered dead and
>> +the status
>> + * is relayed to monitoring framework by unlocking the semaphore.
>> + */
>> +static void
>> +dpdk_ka_relay_core_state(void *ptr_data, const int core_id,
>> +   const enum rte_keepalive_state core_state, uint64_t
>> +last_alive) {
>> +struct dpdk_keepalive_shm *ka_shm = (struct dpdk_keepalive_shm
>*)ptr_data;
>> +int count;
>> +
>> +if (!ka_shm) {
>> +VLOG_ERR("KeepAlive: Invalid shared memory block\n");
>> +return;
>> +}
>> +
>> +VLOG_DBG_RL(&rl,
>> +   "TS(%lu):CORE%d, old state:%d, current_state:%d\n",
>> +   (unsigned 
>> long)time(NULL),core_id,ka_shm->core_state[core_id],
>> +   core_state);
>> +
>> +switch (core_state) {
>> +case RTE_KA_STATE_ALIVE:
>> +case RTE_KA_STATE_MISSING:
>> +

Re: [ovs-dev] [PATCH 4/7] process: Retrieve process status.

2017-04-03 Thread Bodireddy, Bhanuprakash
>-Original Message-
>From: Aaron Conole [mailto:acon...@redhat.com]
>Sent: Monday, April 3, 2017 2:00 AM
>To: Bodireddy, Bhanuprakash 
>Cc: d...@openvswitch.org
>Subject: Re: [ovs-dev] [PATCH 4/7] process: Retrieve process status.
>
>Bhanuprakash Bodireddy  writes:
>
>> Implement function to retrieve the process status. This will be used
>> by Keepalive monitoring thread for detecting false alarms.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>>  lib/process.c | 60
>>
>+++
>>  lib/process.h | 10 ++
>>  2 files changed, 70 insertions(+)
>>
>> diff --git a/lib/process.c b/lib/process.c index e9d0ba9..e0601dd
>> 100644
>> --- a/lib/process.c
>> +++ b/lib/process.c
>> @@ -50,6 +50,20 @@ struct process {
>>  int status;
>>  };
>>
>> +struct pstate2Num {
>> +char *tidState;
>> +int num;
>> +};
>> +
>> +const struct pstate2Num pstate_map[] = {
>> +{ "S", STOPPED_STATE },
>> +{ "R", ACTIVE_STATE },
>> +{ "t", TRACED_STATE },
>> +{ "Z", DEFUNC_STATE },
>> +{ "D", UNINTERRUPTIBLE_SLEEP_STATE },
>> +{ "NULL", UNUSED_STATE },
>> +};
>> +
>>  /* Pipe used to signal child termination. */  static int fds[2];
>>
>> @@ -390,6 +404,52 @@ process_run(void)  #endif  }
>>
>> +int
>> +get_process_status(int tid, int *pstate) { #ifndef _WIN32
>
>The following is Linux specific.  Please use an '#if LINUX' - there are 
>examples
>in the code for this.

That's right. I would do this in v2.

- Bhanuprakash
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 5/7] utils: Introduce xusleep for subsecond granularity.

2017-04-03 Thread Bodireddy, Bhanuprakash
>Bhanuprakash Bodireddy  writes:
>
>> This will be used by KA framework that needs millisecond granularity.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>
>Without this patch, builds starting at 3/7 will fail.  That's pretty bad.  
>Please see
>my earlier comment about including helpers when they are used.

As mentioned earlier, it was my mistake and will fix this in v2.

>
>>  lib/util.c | 12 
>>  lib/util.h |  1 +
>>  2 files changed, 13 insertions(+)
>>
>> diff --git a/lib/util.c b/lib/util.c
>> index 1c06ce0..889ebd8 100644
>> --- a/lib/util.c
>> +++ b/lib/util.c
>> @@ -2125,6 +2125,18 @@ xsleep(unsigned int seconds)
>>  ovsrcu_quiesce_end();
>>  }
>>
>> +void
>> +xusleep(unsigned int microseconds)
>> +{
>> +ovsrcu_quiesce_start();
>> +#ifdef _WIN32
>> +Sleep(microseconds/1000);
>> +#else
>> +usleep(microseconds);
>> +#endif
>> +ovsrcu_quiesce_end();
>> +}
>> +
>
>Wow!  This is deceptive.  If I call this with microseconds argument as, say, 
>999
>there's a *strong* chance this will NOT sleep for at least that amount of time.
>This function needs a different implementation or just keep it non-windows.

This is indeed deceptive :).  
I did spend some time to understand if there is  an equivalent usleep() 
implementation in Windows and came across Queryperformancecounter and usermode 
scheduling in windows.  I would better stick to non-windows implementation and 
allow the WINDOWS experts add their equivalent implantation above. 

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 7/7] Documentation: Update DPDK doc with Keepalive feature.

2017-04-03 Thread Bodireddy, Bhanuprakash
>-Original Message-
>From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
>boun...@openvswitch.org] On Behalf Of Stephen Finucane
>Sent: Monday, April 3, 2017 11:15 AM
>To: ovs-dev@openvswitch.org
>Subject: Re: [ovs-dev] [PATCH 7/7] Documentation: Update DPDK doc with
>Keepalive feature.
>
>On Sat, 2017-04-01 at 20:02 +0100, Bhanuprakash Bodireddy wrote:
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>
>Couple of changes suggested below if there's a future v2.
>
Thanks for reviewing the documentation. I would make the edits as suggested 
here when I post v2.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 3/7] netdev-dpdk: Add support for keepalive functionality.

2017-04-03 Thread Bodireddy, Bhanuprakash
>>>
>>>This whole mechanism seems very error prone.  Is it possible to hang a
>>>thread with the subsequent sem_post?
>>
>> The relay function is called by 'ovs_keepalive' thread. I didn't
>> completely understand your concerns here. I would be happy to verify
>> if you have any scenarios in mind that you think pose problems.
>
>My concern stems from this:
>   "is relayed to monitoring framework by unlocking the semaphore."
>
>Assume the OvS vswitchd crashes while another process is pended on this
>locked semaphore.  Since you do a shm_unlink, I think the pended process
>waiting on that semaphore will be stuck.  Even worse, since we do have a
>process left with a handle, the name will be unlinked, and the memory and
>reading process (which may be ceilometer, but may be something else) will be
>left orphaned waiting for an update so that they can destroy it.
>
>There's a lot of coordination that needs to be added here, and it's likely not
>portable.  I'm not even sure if there are examples where it has been shown to
>work 100% on specific systems.
>
>A much better design is to use a socket to share state.  Since it is tied to 
>the
>lifetime of a process, and cleanup is guaranteed by POSIX (and all sane OSes,
>anyway) you can rely on it.  It is a much better mechanism for signaling
>changes, and a message based interface means you can make this dump
>information to _any_ interested subscriber - even those who are off system if
>you choose.
>

Thanks Aaron for clarifying this and appreciate all your feedback on this patch 
series.
I shall try out the scenarios leading to unexpected crashes and see how well 
the applications
recover on a restart and check for any further shortcomings. 

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/5] dpif-netdev: Skip EMC lookup when EMC is disabled.

2017-04-13 Thread Bodireddy, Bhanuprakash
>On 04/13/2017 07:11 PM, Kevin Traynor wrote:
>> On 03/12/2017 05:33 PM, Bhanuprakash Bodireddy wrote:
>>> Conditional EMC insert patch gives the flexibility to configure the
>>> probability of flow insertion in to EMC. This also allows an option
>>> to entirely disable EMC by setting 'emc-insert-inv-prob=0' which can
>>> be useful at large number of parallel flows.
>>>
>>> This patch skips EMC lookup when EMC is disabled. This is useful to
>>> avoid wasting CPU cycles and also improve performance considerably.
>>>
>>
>> LGTM. How much does this improve performance?

I found  significant performance improvement when testing with few hundred 
streams.  I remember the improvement was  ~800kpps  with smaller packets. This 
is for the reason that emc_lookup() invokes expensive memcmp() to compare the 
netdev_flow_key in EMC and it takes up significant cycles. Longer the 'key', 
worse the performance.  

>Ack for the series,
>Acked-by: Kevin Traynor 

Thanks kevin for the review and Acks.

Regards,
Bhanuprakash. 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3] netdev-dpdk: Implement Tx intermediate queue for dpdk ports.

2017-04-14 Thread Bodireddy, Bhanuprakash
The latency stats published in 
v3(https://mail.openvswitch.org/pipermail/ovs-dev/2017-February/328363.html)
seems to be erroneous due to the way the RFC2544 test was configured in IXIA.  
Please find below the updated latency stats.
Only case 1 and Case 2 stats are published below, where the burst size is 32.

BTW,  While calculating Latency, the stats parameter was set as 'Cut Through' 
meaning
Latency will be calculated as first bit in, first bit out.  Also the acceptable 
Frame Loss % is
set to 1%. Note that the below results are aggregated results of approximately 
9 iterations.

Benchmarks are done on the same commit(83ede47a48eb92053f66815e462e94a39d8a1f2c)
as v3. 

Case 1:   Matching IP flow rules for each IXIA stream
###
Packet64 128
256  512 
Branch  Master  Patch  Master  Patch  MasterPatch 
Master  Patch 
Min  25360199000  30260208890  23490131320 
19620   118700   (ns)
Max   854260577600   868680302440197420195090   160930   
184740   (ns)
Avg 384182   261213412612262091   190386166025   133661   
154787   (ns)

1024  1280  
1518
   Master   Patch  Master PatchMaster Patch
Min   20290  180650  30370  157260  19680   147550   (ns)
Max  304290 239750   178570  216650199140   209050  (ns)
Avg260350 209316   149328  185930170091  177033   (ns)


case 2: ovs-ofctl add-flow br0 in_port=1,action=output:2
###
Packet 64  128 
256  512
Branch  MasterPatch Master  Patch   Master  Patch   Master  
Patch   
Min   27870  30680  1308029160 12000 18970  
   14520  14610  (ns)
Max323790205930   282360   289470   39170  51610
 48340  80670  (ns)
Avg  162219163582 4068541677 21582 41546
 35017  66192  (ns)

 1024 1280  
  1518
   Master   Patch Master Patch  Master Patch
Min  10820  29670   11270   24740  11510   24780  (ns)
Max 29480  7030029900   39010  32460  40010  (ns)
Avg   18926  54582   19239   30636  19087  16722  (ns)

Regards,
Bhanuprakash. 

>-Original Message-
>From: Bodireddy, Bhanuprakash
>Sent: Thursday, February 2, 2017 10:15 PM
>To: d...@openvswitch.org
>Cc: i.maxim...@samsung.com; ktray...@redhat.com; diproiet...@ovn.org;
>Bodireddy, Bhanuprakash ; Fischetti,
>Antonio ; Markus Magnusson
>
>Subject: [PATCH v3] netdev-dpdk: Implement Tx intermediate queue for
>dpdk ports.
>
>After packet classification, packets are queued in to batches depending on the
>matching netdev flow. Thereafter each batch is processed to execute the
>related actions. This becomes particularly inefficient if there are few packets
>in each batch as rte_eth_tx_burst() incurs expensive MMIO writes.
>
>This commit adds back intermediate queue implementation. Packets are
>queued and burst when the packet count exceeds threshold. Also drain logic
>is refactored to handle packets hanging in the tx queues. Testing shows
>significant performance gains with this implementation.
>
>Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of queueing
>of packets.")
>Signed-off-by: Bhanuprakash Bodireddy
>>
>Signed-off-by: Antonio Fischetti 
>Co-authored-by: Antonio Fischetti 
>Signed-off-by: Markus Magnusson 
>Co-authored-by: Markus Magnusson 
>---
>v2->v3
>  * Refactor the code
>  * Use thread local copy 'send_port_cache' instead of 'tx_port' while draining
>  * Invoke dp_netdev_drain_txq_port() to drain the packets from the queue
>as
>part of pmd reconfiguration that gets triggered due to port
>addition/deletion
>or change in pmd-cpu-mask.
>  * Invoke netdev_txq_drain() from xps_get_tx_qid() to drain packets in old
>queue. This is possible in XPS case where the tx queue can change after
>timeout.
>  * Fix another bug in netdev_dpdk_eth_tx_burst() w.r.t 'txq->count'.
>
>Latency stats:
>Collected the latency stats with PHY2PHY loopback case using 30 IXIA streams
>/UDP packets/uni direction traffic. All the stats are in nanoseconds. Results
>below compare latency results between Master vs patch.
>
>case 1: Matching IP flow rules for each IXIA stream
>Eg: 

Re: [ovs-dev] [PATCH 4/5] dpif-netdev: Fix comments for dp_netdev_pmd_thread struct.

2017-04-18 Thread Bodireddy, Bhanuprakash
>On Sun, Mar 12, 2017 at 05:33:27PM +, Bhanuprakash Bodireddy wrote:
>> The sorted subtable ranking patch introduced a classifier instance per
>> ingress port with its subtables ranked on the frequency of hits. The
>> pmd thread can have more classifier instances now and solely depends
>> on the number of ingress ports currently handled by the pmd thread.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>
>Thank you for improving comments!
>
>I'm not the right person to review this patch, but from a process perspective I
>find myself wondering whether it corrects a comment that was already wrong
>before the beginning of the series (in which case it is fine as-is) or whether 
>it
>only needs correction following the series (in which case it should be folded
>into whichever patch made it incorrect).
>
>Thanks again!

Hi Ben,
Please note that the comments were right initially but after the "subtable 
ranking" feature got introduced the comments needed correction.  The subtable 
ranking patch series got merged a while ago  but this particular comment wasn't 
fixed then. I happened to find this during code inspection.  What's the best 
way to handle this now?

- Bhanuprakash.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3] netdev-dpdk: Implement Tx intermediate queue for dpdk ports.

2017-04-18 Thread Bodireddy, Bhanuprakash
Hi Eelco,

Please find my comments inline. 

>
>Hi Bhanuprakash,
>
>I was doing some Physical to Virtual tests, and whenever the number of flows
>reaches the rx batch size performance dropped a lot. I created an
>experimental patch where I added an intermediate queue and flush it at the
>end of the rx batch.
>
>When I found your patch I decided to give it try to see how it behaves.
>I also modified you patch in such a way that it will flush the queue after 
>every
>call to dp_netdev_process_rxq_port().

I presume you were doing like below in the pmd_thread_main receive loop?

for (i = 0; i < poll_cnt; i++) {
dp_netdev_process_rxq_port(pmd, poll_list[i].rx,
   poll_list[i].port_no);
dp_netdev_drain_txq_ports(pmd);
}

>
>Here are some pkt forwarding stats for the Physical to Physical scenario, for
>two 82599ES 10G port with 64 byte packets being send at wire speed:
>
>Number  plainpatch +
>of flows  git clonepatch  flush
>  =  =  =
>   10   10727283   13527752   13393844
>   327042253   11285572 11228799
>   5075154919642650  9607791
>  10058386999461239  9430730
>  50052850667859123  7845807
>100052264777146404  7135601

Thanks for sharing the numbers, I do agree with your findings and I saw very 
similar results with our v3 patch.
In any case we see significant throughput improvement with the patch.

>
>
>I do not have an IXIA to do the latency tests you performed, however I do
>have a XENA tester which has a basic latency measurement feature.
>I used the following script to get the latency numbers:
>
>https://github.com/chaudron/XenaPythonLib/blob/latency/examples/latenc
>y.py

Thanks for pointing this, it could be useful for users with no IXIA setup.

>
>
>As you can see in the numbers below, the default queue introduces quite
>some latency, however doing the flush every rx batch brings the latency down
>to almost the original values. The results mimics your test case 2, sending 10G
>traffic @ wire speed:
>
>   = GIT CLONE
>   Pkt size  min(ns)  avg(ns)  max(ns)
>512  4,631  5,022309,914
>   1024  5,545  5,749104,294
>   1280  5,978  6,159 45,306
>   1518  6,419  6,774946,850
>
>   = PATCH
>   Pkt size  min(ns)  avg(ns)  max(ns)
>512  4,928492,228  1,995,026
>   1024  5,761499,206  2,006,628
>   1280  6,186497,975  1,986,175
>   1518  6,579494,434  2,005,947
>
>   = PATCH + FLUSH
>   Pkt size  min(ns)  avg(ns)  max(ns)
>512  4,711  5,064182,477
>   1024  5,601  5,888701,654
>   1280  6,018  6,491533,037
>   1518  6,467  6,734312,471

The latency numbers above are very encouraging indeed. However with RFC2544 
tests especially on IXIA, we do have lot of parameters to tune.
I see that the latency stats fluctuate a lot with change in acceptable 'Frame 
Loss'.  I am not expert of IXIA myself, but trying to figure out acceptable
settings and trying to measure latency/throughput. 

>
>Maybe it will be good to re-run your latency tests with the flush for every rx
>batch. This might get ride of your huge latency while still increasing the
>performance in the case the rx batch shares the same egress port.
>
>The overall patchset looks fine to me, see some comments inline.
Thanks for reviewing the patch.

>>
>> +#define MAX_LOOP_TO_DRAIN 128
>Is defining this inline ok?
I see that this convention is used in ovs. 

>>   NULL,
>>   NULL,
>>   netdev_dpdk_vhost_reconfigure,
>> -netdev_dpdk_vhost_rxq_recv);
>> +netdev_dpdk_vhost_rxq_recv,
>> +NULL);
>We need this patch even more in the vhost case as there is an even bigger
>drop in performance when we exceed the rx batch size. I measured around
>40%, when reducing the rx batch size to 4, and using 1 vs 5 flows (single PMD).

Completely Agree. Infact we did a quick patch doing batching for vhost ports as 
well and found significant performance improvement(though it's not thoroughly 
tested for all corner cases).
We have that in our backlog and we will trying posting that patch as an RFC 
atleast to get feedback from the community.

-Bhanuprakash. 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 3/7] netdev-dpdk: Add support for keepalive functionality.

2017-04-24 Thread Bodireddy, Bhanuprakash
>>> The relay function is called by 'ovs_keepalive' thread. I didn't
>>> completely understand your concerns here. I would be happy to verify
>>> if you have any scenarios in mind that you think pose problems.
>>
>>My concern stems from this:
>>   "is relayed to monitoring framework by unlocking the semaphore."
>>
>>Assume the OvS vswitchd crashes while another process is pended on this
>>locked semaphore.  Since you do a shm_unlink, I think the pended
>>process waiting on that semaphore will be stuck.  Even worse, since we
>>do have a process left with a handle, the name will be unlinked, and
>>the memory and reading process (which may be ceilometer, but may be
>>something else) will be left orphaned waiting for an update so that they can
>destroy it.
>>
>>There's a lot of coordination that needs to be added here, and it's
>>likely not portable.  I'm not even sure if there are examples where it
>>has been shown to work 100% on specific systems.
>>
>>A much better design is to use a socket to share state.  Since it is
>>tied to the lifetime of a process, and cleanup is guaranteed by POSIX
>>(and all sane OSes,
>>anyway) you can rely on it.  It is a much better mechanism for
>>signaling changes, and a message based interface means you can make
>>this dump information to _any_ interested subscriber - even those who
>>are off system if you choose.
>>

Hello Aaron,

I got a chance to talk to collectd team and understood that the 'dpdkevents' 
plugin of collectd implementation changed a bit which I wasn't aware of when I 
posted v1 of this patch. In the new implementation semaphore isn't needed. How 
it works is: 
-  DPDK primary process(OvS-DPDK in our case) initializes the shared memory 
during startup and keeps writing the events.  
-  dpdkevents plugin of collectd would read the events from SHM block and in 
case of change in 'core state' notify the ceilometer.

- when the OvS crashes, it would automatically be restarted(ovs-vswitchd 
--monitor ) and as part of initialization the shared memory would be 
reinitialized(flags=O_CREAT | O_TRUNC | O_RDWR, 0666).
- dpdkevents plugin will call "dpdk_events_read()" at every predefined interval 
and shall open the SHM block and get the file descriptor. It goes on to check 
if there is any change in 'inode'. If there is no change in 'inode'  it means 
the OvS is intact and healthy. In case of change in inode(OvS has restarted), 
it would immediately unmap the memory, close the stale file descriptor and  
memory map the shared memory using the new file descriptor and read the updated 
events.

In this way we can avoid semaphore and also make sure the collectd would detect 
the primary application crashes. I tested this and found to be working in 
different scenarios. Please note that I was sending SIGABRT to OvS and started 
OvS using 'ovs-vswitch --monitor' option to restart the OvS automatically.

Sorry for taking so long to reply, some of the above mentioned changes in 
collectd are still WIP and are in the process of getting upstreamed.

Regards,
Bhanuprakash. 

>
>Thanks Aaron for clarifying this and appreciate all your feedback on this patch
>series.
>I shall try out the scenarios leading to unexpected crashes and see how well
>the applications recover on a restart and check for any further shortcomings.
>
>- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 0/7] Add OVS DPDK keep-alive functionality

2017-04-24 Thread Bodireddy, Bhanuprakash
>> > Agree that Latency is indeed important. I would collect the latency
>> > stats and will share the results in this thread.
>>
>> Awesome.  Even better would be to put those informations in the cover
>> letter of v2.  Just a short summary of the tests and what the results
>> were for latency / throughput would be good.
>
>I think it'd be even better to include measurements in one of the commit
>messages, because those are available in the repository after the patches are
>applied.  It's harder to find cover letters because they're only on the mailing
>list.

That's a good point, I will do this when I send on the v2 for this patch 
series. 
Also we will follow this for our other patches targeting performance!

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 6/6] Documentation: Update DPDK doc with Keepalive feature.

2017-04-27 Thread Bodireddy, Bhanuprakash
>>
>>    Compute NodeControllerCompute Node
>>
>> Collectd  <--> Ceilometer <>   Collectd
>>
>> OvS DPDK   OvS DPDK
>>
>>    +-+
>>    | VM  |
>>    +--+--+
>>   \---+---/
>>   |
>>    +--+---+   ++--+ +--+---+
>>    | OVS  |-> |dpdkevents plugin  | --> |   collectd   |
>>    +--+---+   ++--+ +--+---+
>>    |
>>  +--+-+ +---++ |
>>  | Ceilometer | <-- | collectd ceilometer plugin |  <---
>>  +--+-+ +---++
>
>You see all of this, right here ^^^ ? That's excellent. Put *that* in a doc. I
>would suggest 'Documentation/topics/dpdk/keepalive.rst'. You could include
>a reference to the below doc using something like the
>following:
>
>For information on how to use the keepalive feature, refer to
>:ref:`the HOWTO `.

I have plans to write a separate document with detail explanation on how this 
feature should be used with OpenStack. That also includes how this is 
integrated with ceilometer. But I will do the document as a separate patch once 
this feature gets accepted. 

>
>The only changes I'd make is to indent the diagram by four spaces and
>precede it with by '::' (to format as a literal block), and change the OVSDB
>settings overview piece to use definitions lists, which look like
>this:
>
>``keepalive=true``
>
>  Enable the keepalive feature. Defaults to false (disabled).
>
>This could be done as a separate patch unless you need to respin this series.

I will take this suggestion and do this if I have to send v3.

>
>> Performance impact:
>> No noticeable performance or latency impact is observed with KA
>> feature enabled. The tests were run with 100ms KA interval and latency
>> is (Min:134,710ns, Avg:173,005ns, Max:1,504,670ns) for Phy2Phy
>> loopback test case with 100 unique streams.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>>  Documentation/howto/dpdk.rst | 93
>> 
>>  1 file changed, 93 insertions(+)
>>
>> diff --git a/Documentation/howto/dpdk.rst
>> b/Documentation/howto/dpdk.rst index dc63f7d..e482166 100644
>> --- a/Documentation/howto/dpdk.rst
>> +++ b/Documentation/howto/dpdk.rst
>> @@ -400,6 +400,99 @@ If ``N`` is set to 1, an insertion will be
>> performed for every flow. If set to
>>
>>  For more information on the EMC refer to :doc:`/intro/install/dpdk` .
>>
>> +.. _dpdk_keepalive:
>> +
>> +KeepAlive
>> +-
>> +
>> +OvS KeepAlive(KA) feature is disabled by default. To enable KA
>> feature::
>
>s/OvS/OVS/
>s/KeepAlive(KA)/KeepAlive (KA)/
>
Ok.

>> +
>> +$ ovs-vsctl --no-wait set Open_vSwitch .
>> other_config:keepalive=true
>> +
>> +The default timer interval for monitoring packet processing cores is
>> 100ms.
>> +To set a different timer value, run::
>> +
>> +$ ovs-vsctl --no-wait set Open_vSwitch . \
>> +other_config:keepalive-interval="50"
>> +
>> +The events comprise of core states and the last seen timestamps. The
>> events
>> +are written in to shared memory region
>> ``/dev/shm/dpdk_keepalive_shm_name``.
>> +To write in to a different shared memory region, run::
>> +
>> +$ ovs-vsctl --no-wait set Open_vSwitch . \
>> +other_config:keepalive-shm-name="/"
>> +
>
>nit: I assume the '/' before '' was a typo. Drop that, if
>so.
It's not a typo. It is expected.

>
>> +The events in the shared memory block can be read by external
>> monitoring
>> +framework (or) applications. `collectd `__
>> has built-in
>> +support for DPDK and provides a `dpdkevents` plugin that can be
>> enabled to
>> +relay the datapath core status to OpenStack service `Ceilometer
>> +`__.
>> +
>> +To install and configure `collectd`, run::
>> +
>> +# Clone collectd from Git repository
>> +$ git clone https://github.com/collectd/collectd.git
>> +
>> +# configure and install collectd
>> +$ cd collectd
>> +$ ./build.sh
>> +$ ./configure --enable-syslog --enable-logfile --with-
>> libdpdk=/usr
>> +$ make
>> +$ make install
>> +
>
>I should have called this out first time, but I'm not sure we want to duplicate
>collectd's installation procedure as these things can change.
>We might be better of linking to installation docs. _However_, we do this
>already for DPDK so there is precedent. If you think we should keep them
>(and you likely do), I'd be happy to simply include a link to the installation 
>docs
>like so:

While I agree to this, collectd documentation is vast and users may be lost 
reading the installation.
I would mention the basic steps here and can point a link to the documentation 
as suggested so that they can refer to the documentation for any advanced 
debugging.

>
>For further information on installing

Re: [ovs-dev] [PATCH v2 0/6] Add OVS DPDK keep-alive functionality

2017-04-28 Thread Bodireddy, Bhanuprakash
>> This patch is aimed at achieving Fastpath Service Assurance in
>> OVS-DPDK deployments. This commit adds support for monitoring the
>> packet processing cores(pmd thread cores) by dispatching heartbeats at
>> regular intervals. Incase of heartbeat miss the failure shall be
>> detected & reported to higher level fault management
>systems/frameworks.
>>
>> The implementation uses POSIX shared memory object for storing the
>> events that will be read by monitoring framework. keep-alive feature
>> can be enabled through below OVSDB settings.
>
Hi Aaron,

Thanks for the comments and also adding Ben here. It would be nice to know his 
point of view of this design.

>I've been thinking about this design, and I'm concerned - shared memory is
>inflexible, and allows multiple actors to mess with the information.
>Is there a reason to use shared memory?  I am not sure of what advantage
>this form of reporting has vs. simply using a message passing interface.

This boils down to shared memory vs message passing program model and which of 
them is an elegant approach in sharing the state between 2 processes.  While I 
completely agree that sockets are good to share the state and pass on the 
information to any interested subscriber(within our outside system*) they also 
have some shortcomings. 

For example, one of the design goals of this feature is to support sub-second 
detection timeout of PMD thread stalls/locks in carrier grade NFV deployments. 
As you know when we look at speeds shared memory is clearly the winner vs 
sockets. But the speed gain of SHM will disappear when locks get deployed. In 
the KA design POSIX shared memory is used as it can handle the needed 
granularities and also semaphores are intentionally avoided. Most importantly 
there are only two actors(ovs_Keepalive, collectd) here. 'ovs_keepalive' thread 
will update the shared memory periodically with the core states and their last 
seen timestamps whereas the collectd events thread will read and check if there 
in any change in core state.

  With
>messages there is clear abstraction, and the projects / processes are truly
>separated.  Shared memory is leads to a situation of inextricably coupling two
>(or more) processes.

I completely agree to this. 

>
>As an example, if the constant changes, or a new statistic is desired to be
>tracked, the consumer which wants to use this data needs to be recompiled,
>and needs to have the *exact* correct version.  If the pad bits from the
>compiler change, if anything from the ovs side causes alignment to be shifted,
>if OvS wants to redefine the struct, if OvS uses any data from there as the
>rhs... the list of scenarios where this interface can fail goes on - and the
>failures are quite catastrophic.

While your concerns are genuine, the structure is well defined and has bare 
minimal information that is absolutely necessary. core states and last seen 
timestamps of the respective cores are absolutely needed to know the health of 
the cores handling the datapath in OvS-DPDK. 

We don't foresee any need in near future to extend this structure and want to 
keep it simple so that OvS-DPDK and collectd will work without worrying about 
version compatibilities. On the alignment front, I see that the structure is 
2048 byes and no padding is added here but can use attribute packed to be sure.

>
>I think maybe a design doc of this interface would be good to read through, as
>it will explain why this design was chosen.  It also might allow for better
>feedback, putting a more generic solution (for example, any new threads that
>OvS spawns we might want to monitor, as well - and that would be good to
>report).  Do you agree?

This design allows only the datapath cores to be monitored and not independent 
threads at this point. OvS do have '--monitor' option which would monitor and 
do the application restarts automatically In case of a crash.

This design is more aimed at monitoring the health of the datapath cores(PMD 
threads) that could impact the overall the switching performance of the 
compute.  I am open to take all the feedback and answer the questions In this 
thread.

Adding Maryam to the thread who is handling the collectd effort. 

Appreciate all your feedback and comments!

Regards,
Bhanuprakash. 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 4/5] dpif-netdev: Fix comments for dp_netdev_pmd_thread struct.

2017-04-28 Thread Bodireddy, Bhanuprakash
>
>
>How about something like this ?
>
>+ * Each struct has its own flow cache and classifier per managed inport.
>Packets
>+ *received from managed ports are looked up in the corresponding pmd
>+ * thread's flow cache and corresponding classifier, if the flow cache
>misses.
>+ * Packets are executed with the found actions in either case.
>

Sounds good Darrell.  I wanted to fix the earlier comment that reads classifier 
per pmd thread that can be  confusing to users.
I will send on separate patch as Ben suggested by adding the 'Fixes' tag and 
also add you in the 'signed-off-by' list.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 0/6] Add OVS DPDK keep-alive functionality

2017-05-01 Thread Bodireddy, Bhanuprakash
>> > This patch is aimed at achieving Fastpath Service Assurance in
>> > OVS-DPDK deployments. This commit adds support for monitoring the
>> > packet processing cores(pmd thread cores) by dispatching heartbeats
>> > at regular intervals. Incase of heartbeat miss the failure shall be
>> > detected & reported to higher level fault management
>systems/frameworks.
>> >
>> > The implementation uses POSIX shared memory object for storing the
>> > events that will be read by monitoring framework. keep-alive feature
>> > can be enabled through below OVSDB settings.
>>
>> I've been thinking about this design, and I'm concerned - shared
>> memory is inflexible, and allows multiple actors to mess with the
>information.
>> Is there a reason to use shared memory?  I am not sure of what
>> advantage this form of reporting has vs. simply using a message
>> passing interface.  With messages there is clear abstraction, and the
>> projects / processes are truly separated.  Shared memory is leads to a
>> situation of inextricably coupling two (or more) processes.
>
>Shared memory is great within a process, but it has drawbacks as an inter-
>process interface.  Bhanu, is there a reason that we need the same interface
>intra- and inter-process?  For example, OVS could have a dedicated thread
>that monitors the shared memory interface and, on failure, reports the
>problem over a Unix domain socket.

Ben,
In the original KA design shared memory interface was meant for inter-process.  
With OvS-DPDK, Keepalive thread is spawned that wakes up periodically to update 
the PMD core status in SHM. This SHM interface will be monitored by collectd 
and in case of event it notifies to OpenStack service ceilometer.

I like what you suggested,  wherein we can still have the SHM interface and 
additionally spawn a dedicated monitoring thread to monitor SHM interface and 
notify the status(in case of events) to external monitoring 
frameworks(collectd) over Unix domain sockets. 

As the consensus is to use Unix domain sockets, I would pass on the feedback to 
collectd team for appropriate changes on their side.

Regards,
Bhanuprakash. 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 0/6] Add OVS DPDK keep-alive functionality

2017-05-01 Thread Bodireddy, Bhanuprakash
>There two ways to go with the design.
>
>1) Make it generic, so that it is not so PMD specific, as it is now.
>2) If it stays PMD specific, make it stronger; right now, the health check is
>limited – it detects that a PMD thread is proceeding or not.
>For something like DPDK, I don’t think that will be enough in the long run.
>This can result in some false negatives, as well.
>  Maybe, we want to know that the ports and queues are getting
>processed, PMD/port/queue mappings as expected, time spent processing
>packets per PMD, port state changes, packet stats, queue depths, etc This
>information could be correlated by the final receiver of the data.

Darrell,

Thanks for the feedback. I am implementing this for OvS-DPDK and so it would be 
PMD specific at this point.
However I completely agree with your suggestions here and would try extending 
the monitoring to factor in ports & packet stats and other parameters to make 
sure there aren't any false negatives.

>
>I also agree that socket communication is preferred over shm, although I don’t
>think any shm usage will necessarily lead to a meltdown.

As the consensus is to have Unix domain sockets over SHM,  I will make 
necessary changes.

Regards,
Bhanuprakash. 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC PATCH] netdev-dpdk: Add Tx intermediate queue for vhost ports.

2017-05-15 Thread Bodireddy, Bhanuprakash
Hi Eelco,

> > This commit adds the intermediate queue for vHost-user ports. It  >
>improves the throughput in multiple virtual machines deployments and  > also
>in cases with VM doing packet forwarding in kernel stack.
> >
> > This patch is aligned with intermediate queue implementation for dpdk  >
>ports that can be found here: https://patchwork.ozlabs.org/patch/723309/
>
>This patch and the one above combined will increase throughput in general
>however to the cost of additional latency (see some numbers below).
>
>However I still would like to see both patches applied with a flush every tx
>batch. This still increase performance if the rx batch has overlapping egress
>ports, but lacks the latency increase.
>
>It would be nice if you could do your latency tests with this flush included to
>see if you get the same results I got with this patch and the earlier one.

Thanks for reviewing this patch and for all the feedback.  I shall work on 
merging both the DPDK and vHost User intermediate port implementation and post 
it to the ML.  I am going to factor in all your comments. I have to go back and 
recheck few things w.r.t deleting the packets w.r.t 'total pkts' and 'dropped' 
counters.  

>I did do some quick latency and throughput tests (with only this patch
>applied).
>Same test setup as for the other patch set, i.e. two 82599ES 10G port with 64
>byte packets being send at wire speed:
>
>Physical to Virtual test:
>
>flows
>Number  plainpatch +
>of flows  git clonepatch  flush
>  =  =  =
>10  594589980065937833914
>32  387221165963106530133
>50  328371358618946618711
>100 313254059537525857226
>500 296449956129015273006
>1000293195252330895178038
>
>
>Physical to Virtual to Physical test:
>
>Number  plainpatch +
>of flows  git clonepatch  flush
>  =  =  =
>10  324064726595263652217
>32  213687220603132834941
>50  198179519124762897763
>100 179467817980842014881
>500 168675616720141657513
>1000167779516285781612480
>
>The results for the latency tests mimics your test case 2 form the previous
>patch set, sending 10G traffic @ wire speed:
>
>= GIT CLONE
>Pkt size  min(ns)  avg(ns)  max(ns)
>  512  10,011   12,100   281,915
>1024   7,8709,313   193,116
>1280   7,8629,036   194,439
>1518   8,2159,417   204,782
>
>= PATCH
>Pkt size  min(ns)  avg(ns)  max(ns)
>  512  25,044   28,244   774,921
>1024  29,029   33,031   218,653
>1280  26,464   30,097   203,083
>1518  25,870   29,412   204,165
>
>= PATCH + FLUSH
>Pkt size  min(ns)  avg(ns)  max(ns)
>  512  10,492   13,655   281,538
>1024   8,4079,784   205,095
>1280   8,3999,750   194,888
>1518   8,3679,722   196,973
>

Many thanks and appreciate you for taking time to test the patch and posting 
the throughput and latency details. I will also do the tests once I merge the 
patches and cross check with the above nos. Also I am going to include the 
flush logic as suggested to improve latency.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC] packets: Do not initialize ct_orig_tuple.

2017-05-15 Thread Bodireddy, Bhanuprakash
>Commit daf4d3c18da4("odp: Support conntrack orig tuple key.") introduced
>new fields in struct 'pkt_metadata'.  pkt_metadata_init() is called for every
>packet in the userspace datapath.  When testing a simple single flow case with
>DPDK, we observe a lower throughput after the above commit (it was 14.88
>Mpps before, it is 13 Mpps after).
>
>This patch skips initializing ct_orig_tuple in pkt_metadata_init().
>It should be enough to initialize ct_state, because nobody should look at
>ct_orig_tuple unless ct_state is != 0.
>
>CC: Jarno Rajahalme 
>Signed-off-by: Daniele Di Proietto 
>---
>I'm sending this as an RFC because I didn't check very carefully if we can 
>really
>avoid initializing ct_orig_tuple.
>
>Maybe there are better solutions to this problem.
>---
> lib/packets.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/lib/packets.h b/lib/packets.h index a5a483bc8..6f1791c7a 100644
>--- a/lib/packets.h
>+++ b/lib/packets.h
>@@ -129,7 +129,7 @@ pkt_metadata_init(struct pkt_metadata *md,
>odp_port_t port)
> /* It can be expensive to zero out all of the tunnel metadata. However,
>  * we can just zero out ip_dst and the rest of the data will never be
>  * looked at. */
>-memset(md, 0, offsetof(struct pkt_metadata, in_port));
>+memset(md, 0, offsetof(struct pkt_metadata, ct_orig_tuple));
> md->tunnel.ip_dst = 0;
> md->tunnel.ipv6_dst = in6addr_any;
>

It's been a while this RFC patch has been submitted to fix performance drop on 
Master. This indeed fixes the OvS-DPDK performance drop that was introduced by 
the conntrack commit.
Is there a better fix than what is suggested above?  

- Bhanuprakash
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 4/5] dpif-netdev: Fix comments for dp_netdev_pmd_thread struct.

2017-05-15 Thread Bodireddy, Bhanuprakash
>>
>>
>>How about something like this ?
>>
>>+ * Each struct has its own flow cache and classifier per managed inport.
>>Packets
>>+ *received from managed ports are looked up in the corresponding pmd
>>+ * thread's flow cache and corresponding classifier, if the flow
>>cache misses.
>>+ * Packets are executed with the found actions in either case.
>>
>
>Sounds good Darrell.  I wanted to fix the earlier comment that reads classifier
>per pmd thread that can be  confusing to users.
>I will send on separate patch as Ben suggested by adding the 'Fixes' tag and
>also add you in the 'signed-off-by' list.
>

Hello Darrell,

I have posted the patch here https://patchwork.ozlabs.org/patch/762597/
On the side note, I have few patches posted here 
https://patchwork.ozlabs.org/patch/737847/  that need your attention. The 
patches improve performance especially in EMC disabled case.
Please note that the patch series was 'Acked' too.

Regards,
Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC 1/2] doc: Reduce duplication in 'man_pages'

2017-05-19 Thread Bodireddy, Bhanuprakash
The make fails with below error on my fedora 22 Target. 

error.log--
Documentation/conf.py:126:9: F812 list comprehension redefines 'filename' from 
line 59
Makefile:5848: recipe for target 'flake8-check' failed
make[2]: *** [flake8-check] Error 1
make[2]: Leaving directory '/workspace/master/ovs'
Makefile:5182: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/workspace/master/ovs'
Makefile:3014: recipe for target 'all' failed
make: *** [all] Error 2

I don't see this issue when I replace the filename with 'file_name' as below.  

diff --git a/Documentation/conf.py b/Documentation/conf.py
index d70ee6b..77c4df5 100644
--- a/Documentation/conf.py
+++ b/Documentation/conf.py
@@ -121,6 +121,6 @@ _man_pages = [

 # Generate list of (path, name, description, [author, ...], section)
 man_pages = [
-('ref/%s' % filename, filename.split('.', 1)[0],
- description, [author], filename.split('.', 1)[1])
-for filename, description in _man_pages]
+('ref/%s' % file_name, file_name.split('.', 1)[0],
+ description, [author], file_name.split('.', 1)[1])
+for file_name, description in _man_pages]

Regards,
Bhanuprakash. 

>-Original Message-
>From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
>boun...@openvswitch.org] On Behalf Of Ben Pfaff
>Sent: Thursday, May 18, 2017 11:50 PM
>To: Stephen Finucane 
>Cc: d...@openvswitch.org
>Subject: Re: [ovs-dev] [RFC 1/2] doc: Reduce duplication in 'man_pages'
>
>On Wed, May 10, 2017 at 09:32:18PM -0400, Stephen Finucane wrote:
>> All these entries are going to be roughly the same, with only two key
>> differences. Clarify things by focusing on those differences.
>>
>> Signed-off-by: Stephen Finucane 
>
>Fair enough!  Thanks, I applied this to master.
>___
>dev mailing list
>d...@openvswitch.org
>https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] doc: Resolve pep8 warnings in conf.py

2017-05-19 Thread Bodireddy, Bhanuprakash
>-Original Message-
>From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
>boun...@openvswitch.org] On Behalf Of Stephen Finucane
>Sent: Friday, May 19, 2017 10:15 AM
>To: d...@openvswitch.org
>Subject: [ovs-dev] [PATCH] doc: Resolve pep8 warnings in conf.py
>
>flake8 doesn't like us redefining variables in loops.
>
>Signed-off-by: Stephen Finucane 
>Reported-by: Bhanuprakash Bodireddy
>
>Fixes: f15010f ("doc: Reduce duplication in 'man_pages'")
>Cc: Ben Pfaff 
>---
> Documentation/conf.py | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
>diff --git a/Documentation/conf.py b/Documentation/conf.py index
>d70ee6b..77c4df5 100644
>--- a/Documentation/conf.py
>+++ b/Documentation/conf.py
>@@ -121,6 +121,6 @@ _man_pages = [
>
> # Generate list of (path, name, description, [author, ...], section)  
> man_pages
>= [
>-('ref/%s' % filename, filename.split('.', 1)[0],
>- description, [author], filename.split('.', 1)[1])
>-for filename, description in _man_pages]
>+('ref/%s' % file_name, file_name.split('.', 1)[0],
>+ description, [author], file_name.split('.', 1)[1])
>+for file_name, description in _man_pages]

Acked by: Bhanuprakash Bodireddy 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC PATCH 0/6] Change dpdk rxq scheduling to incorporate rxq processing cycles.

2017-05-21 Thread Bodireddy, Bhanuprakash
Hello Kevin,

>
>Rxqs are scheduled to be handled across available pmds in round robin order
>with no weight or priority.
>
>It can happen that some very busy queues are handled by one pmd which
>does not have enough cycles to prevent packets being dropped on them.
>While at the same time another pmd which handles queues with no traffic on
>them, is essentially idling.
>
>Rxq scheduling happens as a result of a number of events and when it does,
>the same unweighted round robin approach is applied each time.
>
>This patchset proposes to augment the round robin nature of rxq scheduling
>by counting the processing cycles used by the rxqs during their operation and
>incorporate it into the rxq scheduling.
>
>Before distributing in a round robin manner, the rxqs will be sorted in order 
>of
>the processing cycles they have been consuming. Assuming multiple pmds,
>this ensures that the measured rxqs using most processing cycles will be
>distributed to different cores.

Thanks for working on this. This work is important and would also solve 
OvS-DPDK scaling issue.  
I have reviewed the patch series but haven't tested this yet with traffic. I 
would stick to comments on high level design at his stage.

With this series, rxqs are sorted based on cycles consumed and are subsequently 
distributed across multiple PMDs in round robin fashion. While this approach is 
fine, I have some concerns w.r.t QoS.  

Now OvS-DPDK don't support QoS (i.e based on L2 - pcp, L3 - dscp bits). But 
when QoS is implemented in future based on Weighted Round Robin(WRR), Strict 
priority(SP) (or) hybrid models(WRR and SP) the proposed approach in this RFC 
may conflict or pose some limitations.

I am aware that this RFC doesn't factor in QoS, but would like to know your 
thoughts and if you have some ideas to  make it less painful in future.

1)  In WRR/SP model, the rxq(s) shall be sorted according to queue 
credits/priorities.  PMD may have to poll rxq(s) based on this criteria?
2) For example, signaling/control traffic which is high priority may be 
redirected to a specific rxq(based on pcp/dscp bits) in most of the scenarios. 
Also the control traffic can be <= 5% of total traffic and hence the cycles 
spent on this queue can be insignificant and these rxq(s) shall be pushed to 
the end of the sorted list as per the proposed approach. Most importantly rxq 
receiving the *high volume lowest priority traffic* will top the list always 
and shall be processed first by the respective PMD threads potentially tail 
dropping *low rate high priority traffic* during congestion.

Also adding Billy from Intel who is working QoS for OvS-DPDK.

- Bhanuprakash
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC PATCH 0/6] Change dpdk rxq scheduling to incorporate rxq processing cycles.

2017-05-22 Thread Bodireddy, Bhanuprakash
>> Thanks for working on this. This work is important and would also solve OvS-
>DPDK scaling issue.
>> I have reviewed the patch series but haven't tested this yet with traffic. I
>would stick to comments on high level design at his stage.
>>
>> With this series, rxqs are sorted based on cycles consumed and are
>subsequently distributed across multiple PMDs in round robin fashion. While
>this approach is fine, I have some concerns w.r.t QoS.
>>
>> Now OvS-DPDK don't support QoS (i.e based on L2 - pcp, L3 - dscp bits). But
>when QoS is implemented in future based on Weighted Round Robin(WRR),
>Strict priority(SP) (or) hybrid models(WRR and SP) the proposed approach in
>this RFC may conflict or pose some limitations.
>>
>> I am aware that this RFC doesn't factor in QoS, but would like to know your
>thoughts and if you have some ideas to  make it less painful in future.
>>
>> 1)  In WRR/SP model, the rxq(s) shall be sorted according to queue
>credits/priorities.  PMD may have to poll rxq(s) based on this criteria?
>> 2) For example, signaling/control traffic which is high priority may be
>redirected to a specific rxq(based on pcp/dscp bits) in most of the scenarios.
>Also the control traffic can be <= 5% of total traffic and hence the cycles 
>spent
>on this queue can be insignificant and these rxq(s) shall be pushed to the end
>of the sorted list as per the proposed approach. Most importantly rxq
>receiving the *high volume lowest priority traffic* will top the list always 
>and
>shall be processed first by the respective PMD threads potentially tail
>dropping *low rate high priority traffic* during congestion.
>>
>> Also adding Billy from Intel who is working QoS for OvS-DPDK.
>>
>> - Bhanuprakash
>>
>
>Thanks for your comments. I can only speculate based on the high level
>description of QoS but I don't think it should be an issue. Think of this RFC 
>as a
>refinement of the default round robin queue to core assignment.
>
>At the moment, the rxq affinity pinning takes precedence over the default
>assignment. I would expect that a user defined QoS should also take
>precedence, but that there will be still be a need for code to perform the
>default assignment because QoS would be optional and/or queues may have
>equal priorities.

This makes sense. I will try to test this series with some traffic.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v8] netdev-dpdk: Increase pmd thread priority

2017-05-26 Thread Bodireddy, Bhanuprakash
Hi Billy,
>Hi Bhanu,
>
>This patch no longer applies cleanly.

Thanks for looking in to this patch.  It's pretty old patch and should be 
rebased. 

>
>$git apply ...
>error: Documentation/intro/install/dpdk-advanced.rst: No such file or
>directory
>error: patch failed: lib/ovs-numa.h:56
>error: lib/ovs-numa.h: patch does not apply
>
>Also some more information on the rationale behind such a change would be
>useful. E.g. is it helpful in the case where the PMDs have not been given
>isolated cores?

As you would have already noticed, this patch was at v8. 

Initially I started out with applying real time scheduling policy and priority 
to PMD threads. It was noticed that real time priorities(the original 
implementation) could potentially cause problems in few corner cases. Flavio 
and Daniele suggested to bump up the PMD thread priority instead of applying 
real time scheduling policy.  

This is what was implemented in the last sent v8. With datapath threads having 
higher priority the overall switching performance wouldn't suffer and may 
improve in out-of-box deployments as mentioned by Flavio in other mail. 

I will rebase and send the patch.

- Bhanuprakash.

>
>> -Original Message-
>> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
>> boun...@openvswitch.org] On Behalf Of Aaron Conole
>> Sent: Tuesday, January 3, 2017 8:08 PM
>> To: Bodireddy, Bhanuprakash 
>> Cc: d...@openvswitch.org
>> Subject: Re: [ovs-dev] [PATCH v8] netdev-dpdk: Increase pmd thread
>> priority
>>
>> Bhanuprakash Bodireddy  writes:
>>
>> > Increase the DPDK pmd thread scheduling priority by lowering the
>> > nice value. This will advise the kernel scheduler to prioritize pmd
>> > thread over other processes.
>> >
>> > Signed-off-by: Bhanuprakash Bodireddy
>> > 
>> > ---
>>
>> Sorry for jumping into this so late.  Is there a measured benefit to this
>patch?
>> Do you have a test case to reproduce the effect you're seeing?  Might
>> it be better to write up documentation for the user describing
>> chrt/nice/renice utilities?
>>
>> > v7->v8:
>> > * Rebase
>> > * Update the documentation file
>> > @Documentation/intro/install/dpdk-advanced.rst
>> >
>> > v6->v7:
>> > * Remove realtime scheduling policy logic.
>> > * Increase pmd thread scheduling priority by lowering nice value to -20.
>> > * Update doc accordingly.
>> >
>> > v5->v6:
>> > * Prohibit spawning pmd thread on the lowest core in dpdk-lcore-mask if
>> >   lcore-mask and pmd-mask affinity are identical.
>> > * Updated Note section in INSTALL.DPDK-ADVANCED doc.
>> > * Tested below cases to verify system stability with pmd priority
>> > patch
>> >
>> > v4->v5:
>> > * Reword Note section in DPDK-ADVANCED.md
>> >
>> > v3->v4:
>> > * Document update
>> > * Use ovs_strerror for reporting errors in lib-numa.c
>> >
>> > v2->v3:
>> > * Move set_priority() function to lib/ovs-numa.c
>> > * Apply realtime scheduling policy and priority to pmd thread only if
>> >   pmd-cpu-mask is passed.
>> > * Update INSTALL.DPDK-ADVANCED.
>> >
>> > v1->v2:
>> > * Removed #ifdef and introduced dummy function
>> "pmd_thread_setpriority"
>> >   in netdev-dpdk.h
>> > * Rebase
>> >
>> >  Documentation/intro/install/dpdk-advanced.rst |  8 +++-
>> >  lib/dpif-netdev.c |  4 
>> >  lib/ovs-numa.c| 19 +++
>> >  lib/ovs-numa.h|  1 +
>> >  4 files changed, 31 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/Documentation/intro/install/dpdk-advanced.rst
>> > b/Documentation/intro/install/dpdk-advanced.rst
>> > index 44d1cd7..67815ac 100644
>> > --- a/Documentation/intro/install/dpdk-advanced.rst
>> > +++ b/Documentation/intro/install/dpdk-advanced.rst
>> > @@ -238,7 +238,8 @@ affinitized accordingly.
>> >to be affinitized to isolated cores for optimum performance.
>> >
>> >By setting a bit in the mask, a pmd thread is created and pinned
>> > to the
>> > -  corresponding CPU core. e.g. to run a pmd thread on core 2::
>> > +  corresponding CPU core and the nice value set to '-20'.
>> > +  e.g. to run a pmd thread on core 2::
>> >
>> >$ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
>> >

Re: [ovs-dev] [PATCH 0/4] prioritizing latency sensitive traffic

2017-07-28 Thread Bodireddy, Bhanuprakash
Hi Billy,

>Hi All,
>
>This patch set provides a method to request ingress scheduling on interfaces.
>It also provides an implemtation of same for DPDK physical ports.
>
>This allows specific packet types to be:
>* forwarded to their destination port ahead of other packets.
>and/or
>* be less likely to be dropped in an overloaded situation.
>
>It was previously discussed
>https://mail.openvswitch.org/pipermail/ovs-discuss/2017-May/044395.html
>and RFC'd
>https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335237.html
>
>Limitations of this patch:
>* The patch uses the Flow Director filter API in DPDK and has only been tested
>on Fortville (XL710) NIC.
>* Prioritization is limited to:
>** eth_type
>** Fully specified 5-tuple src & dst ip and port numbers for UDP & TCP packets
>* ovs-appctl dpif-netdev/pmd-*-show o/p should indicate rxq prioritization.
>* any requirements for a more granular prioritization mechanism
>
>Initial results:
>* even when userspace OVS is very much overloaded and
>  dropping significant numbers of packets the drop rate for prioritized traffic
>  is running at 1/1000th of the drop rate for non-prioritized traffic.
>
>* the latency profile of prioritized traffic through userspace OVS is also much
>  improved
>
>1e0 |*
>|*
>1e-1|* | Non-prioritized pkt latency
>|* * Prioritized pkt latency
>1e-2|*
>|*
>1e-3|*   |
>|*   |
>1e-4|*   | | |
>|*   |*| |
>1e-5|*   |*| | |
>|*   |*|*| |  |
>1e-6|*   |*|*|*|  |
>|*   |*|*|*|* |
>1e-7|*   |*|*|*|* |*
>|*   |*|*|*|* |*
>1e-8|*   |*|*|*|* |*
>  0-1 1-20 20-40 40-50 50-60 60-70 ... 120-400
>Latency (us)
>
> Proportion of packets per latency bin @ 80% Max Throughput
>  (Log scale)
> 

Thanks for working on this feature. I started reviewing the code initially but 
later decided to test it first as it uses XL710 NIC Flow director features and 
wanted to
Know the implications if any. I had few observations here and would like to 
know if you have seen this during your unit tests.

1)  With this patch series, Rx Burst Bulk Allocation call back function is 
invoked instead of vector rx function. 
  Meaning i40e_recv_pkts_bulk_alloc() gets invoked instead of 
i40e_recv_pkts_vec().  Please check  i40e_set_rx_function() of i40e DPDK 
drivers.
 
  I am speculating this may be due to the enabling flow director and rules. 
I don't know the implications of using  bulk_alloc() function, maybe we should
  check with DPDK guys on this.
 
2)  When I tried to prioritize the udp pkts for specific IPs and Ports, I see a 
massive performance drop. I am using XL710 NIC with stable firmware version.
   Below are my steps.

   -  Start OvS and make sure the the n_rxq for DPDK0, DPDK1 ports is set 
to 2.
   -  Do simple P2P test with single 
stream(ip_src=8.18.8.1,ip_dst=101.10.10.1,udp_src=10001,udp_dst=5001)  and 
check the throughput.
   -  Prioritize the active stream.
 ovs-vsctl set interface dpdk0 
other_config:ingress_sched=udp,ip_src=8.18.8.1,ip_dst=101.10.10.1,udp_src=10001,udp_dst=5001
  -  Throughput drop is observed now. (~1.7Mpps)

A bit of debugging in to case 2, I found that "miniflow_hash_5tuple()" is 
getting invoked and consuming 10% of the total cycles.
one of the commits had below lines. 

dpdk_eth_dev_queue_setup--
/* Ingress scheduling requires ETH_MQ_RX_NONE so limit it to when exactly
 * two rxqs are defined. Otherwise MQ will not work as expected. */
if (dev->ingress_sched_str && n_rxq == 2) {
conf.rxmode.mq_mode = ETH_MQ_RX_NONE;
}
else {
conf.rxmode.mq_mode = ETH_MQ_RX_RSS;
}
-

Does ingress scheduling turn off RSS?  This will be big drawback as calculating 
hash in SW consumes significant cycles.

3) This is another corner case.
  - Here n_rxq set to 4 for my DPDK ports. start OvS and traffic is started 
and throughput is as expected.
  -  Now prioritize the stream
   ovs-vsctl set interface dpdk0 
other_config:ingress_sched=udp,ip_src=8.18.8.1,ip_dst=101.10.10.1,udp_src=10001,udp_dst=5001
  - The above command shouldn't take in to affect as n_rxq is set to 4 and 
not 2 and the same is logged appropriately.

   "2017-07-28T11:11:57.792Z|00104|netdev_dpdk|ERR|Interface dpdk0: 
Ingress scheduling config ignored; Requires n_rxq==2.
   2017-07-28T11:11:57.809Z|00105|dpdk|INFO|PMD: i40e_pf_config_rss(): 
Max of contiguous 4 PF queues are configured"

  - However the throug

Re: [ovs-dev] [PATCH v2 0/4] Output packet batching.

2017-08-01 Thread Bodireddy, Bhanuprakash
>This patch-set inspired by [1] from Bhanuprakash Bodireddy.
>Implementation of [1] looks very complex and introduces many pitfalls for
>later code modifications like possible packet stucks.
>
>This version targeted to make simple and flexible output packet batching on
>higher level without introducing and even simplifying netdev layer.
>
>Patch set consists of 3 patches. All the functionality introduced in the first
>patch. Two others are just cleanups of netdevs to not do unnecessary things.
>
>Basic testing of 'PVP with OVS bonding on phy ports' scenario shows
>significant performance improvement.
>More accurate and intensive testing required.
>
>[1] [PATCH 0/6] netdev-dpdk: Use intermediate queue during packet
>transmission.
>https://mail.openvswitch.org/pipermail/ovs-dev/2017-June/334762.html
>
>Version 2:
>
>   * Rebased on current master.
>   * Added time based batching RFC patch.
>   * Fixed mixing packets with different sources in same batch.
>

Applied this series along with other patches[1] and gave initial try.
With this series, approximately half a million throughput drop is observed in 
simple test case (P2P - 1stream - udp) vs  master + [1]. 
The performance improvement is observed with multiple flows  (which this series 
is meant to address).

At this stage no latency settings were used. Yet to review and do more testing.

[1] Improves performance.
https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335359.html
https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/336186.html
https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/336187.html
https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/336290.html

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH RFC v2 4/4] dpif-netdev: Time based output batching.

2017-08-01 Thread Bodireddy, Bhanuprakash
>On 28.07.2017 10:20, Darrell Ball wrote:
>> I have not tested yet
>>
>> However, I would have expected something max latency config. to be
>specific to netdev-dpdk port types
>
>IMHO, if we can make it generic, we must make it generic.
>
>[Darrell]
>The first question I ask myself is -  is this functionality intrinsically 
>generic or is
>it not ?
>It is clearly not and trying to make it artificially so would do the following:
>
>1) We end up designing something the wrong way where it partially works.
>2) Breaks other features present and future that really do intersect.
>
>
> Making of this
>functionality netdev-dpdk specific will brake ability to test it using
>unit tests. As the change is complex and has a lot of pitfalls like
>possible packet stucks and possible latency issues, this code should be
>covered by unit tests to simplify the support and modifications.
>(And it's already partly covered because it is generic. And I fixed many
>minor issues while developing through unit test failures.)
>
>[Darrell]
>Most of dpdk is not tested by our unit tests because it cannot be simulated
>well at the moment. This is orthogonal to the basic question however.

Darrell is right and the unit tests we have currently don't test DPDK datapath 
well. 
So having this changes in netdev layer shouldn't  impact the unit tests much. 

While I share your other concern that changes in netdev layer will be little 
complex and slightly
painful for future code changes, this max latency config  introduced in dpif 
layer may not hold good to
different port types and users may potentially introduce conflicting changes in 
netdev layer in future to
suit their use cases.
 
>
>
>In the future this can be used also to improve performance of netdev-linux
>by replacing sendmsg() with batched sendmmsg(). This should significantly
>increase performance of flood actions while MACs are not learned yet in
>action NORMAL.
>
>> This type of code also seems to intersect with present and future QoS
>considerations in netdev-dpdk

>
>Maybe, but there are also some related features in mail-list like rx queue
>prioritization which are implemented in generic way on dpif-netdev layer.

If you are referring to rxq prioritization work by Billy 
(https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/336001.html),
this feature is more implemented in netdev layer with very minimal updates to 
dpif layer. 

BTW,  dp_execute_cb()  is getting cluttered with this patch. 

- Bhanuprakash.

>
>>
>> -Original Message-
>> From: Ilya Maximets 
>> Date: Wednesday, July 26, 2017 at 8:21 AM
>> To: "ovs-dev@openvswitch.org" ,
>Bhanuprakash Bodireddy 
>> Cc: Heetae Ahn , Ben Pfaff
>, Antonio Fischetti , Eelco
>Chaudron , Ciara Loftus ,
>Kevin Traynor , Darrell Ball ,
>Ilya Maximets 
>> Subject: [PATCH RFC v2 4/4] dpif-netdev: Time based output batching.
>>
>> This allows to collect packets from more than one RX burst
>> and send them together with a configurable maximum latency.
>>
>> 'other_config:output-max-latency' can be used to configure
>> time that a packet can wait in output batch for sending.
>>
>> Signed-off-by: Ilya Maximets 
>> ---
>>
>> millisecon granularity is used for now. Can be easily switched to use
>> microseconds instead.
>>
>>  lib/dpif-netdev.c| 97
>+++-
>>  vswitchd/vswitch.xml | 15 
>>  2 files changed, 95 insertions(+), 17 deletions(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 07c7dad..e5f8a3d 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -84,6 +84,9 @@ VLOG_DEFINE_THIS_MODULE(dpif_netdev);
>>  #define MAX_RECIRC_DEPTH 5
>>  DEFINE_STATIC_PER_THREAD_DATA(uint32_t, recirc_depth, 0)
>>
>> +/* Use instant packet send by default. */
>> +#define DEFAULT_OUTPUT_MAX_LATENCY 0
>> +
>>  /* Configuration parameters. */
>>  enum { MAX_FLOWS = 65536 }; /* Maximum number of flows in flow
>table. */
>>  enum { MAX_METERS = 65536 };/* Maximum number of meters. */
>> @@ -261,6 +264,9 @@ struct dp_netdev {
>>  struct hmap ports;
>>  struct seq *port_seq;   /* Incremented whenever a port 
> changes.
>*/
>>
>> +/* The time that a packet can wait in output batch for sending. 
> */
>> +atomic_uint32_t output_max_latency;
>> +
>>  /* Meters. */
>>  struct ovs_mutex meter_locks[N_METER_LOCKS];
>>  struct dp_meter *meters[MAX_METERS]; /* Meter bands. */
>> @@ -498,6 +504,7 @@ struct tx_port {
>>  int qid;
>>  long long last_used;
>>  struct hmap_node node;
>> +long lo

Re: [ovs-dev] DPDK Merge Repo

2017-08-02 Thread Bodireddy, Bhanuprakash
>> Hi Darrell and Ben.
>>
>> > Hi All
>> >
>> > As mentioned before, I am using a repo for DPDK patch merging.
>> > The repo is here:
>> > https://github.com/darball/ovs/
>> >
>> > There are still some outstanding patches from Bhanu that have not
>> completed review yet:
>> >
>> > util: Add PADDED_MEMBERS_CACHELINE_MARKER macro to mark
>cachelines.-
>> > Bhanu
>> > packets: Reorganize the pkt_metadata structure. - Bhanu
>> >
>> > and a series we would like to get into 2.8
>> >
>> > netdev-dpdk: Use intermediate queue during packet transmission.
>> > Bhanu Jun 29/V3
>> > netdev: Add netdev_txq_flush function.
>> > netdev-dpdk: Add netdev_dpdk_txq_flush function.
>> > netdev-dpdk: Add netdev_dpdk_vhost_txq_flush function.
>> > netdev-dpdk: Add intermediate queue support.
>> > netdev-dpdk: Enable intermediate queue for vHost User port.
>> > dpif-netdev: Flush the packets in intermediate queue.
>>
>> I think that we still not reached agreement about the level of
>> implementation (netdev-dpdk or dpif-netdev). Just few people
>> participate in discussion which is not very productive. I suggest not
>> to target output batching for 2.8 release because of this and also
>> lack of testing and review.
>> As I understand, we have only 3 days merge window for the new features
>> and I expect that we can't finish discussion, review and testing in time.
>>
>
>My own opinion on this, this feature has been kicking around for quite a
>while,  the original patch from Bhanu went out back in December.
>https://mail.openvswitch.org/pipermail/ovs-dev/2016-
>December/326348.html

Unfortunately it dates back to Aug 2016 and almost been an year.
(Refer: https://mail.openvswitch.org/pipermail/ovs-dev/2016-August/321748.html)

I reported this issue and copied Ilya (original author) whose commit 
"b59cc14e032d (netdev-dpdk: Use instant sending instead of  queueing of 
packets" introduced this particular issue in 2.6. 

Unfortunately the author who introduced this issue didn't respond to that 
question and we came up with the path series  to address this. Multiple RFC 
versions were posted and even Ilya participated in reviews and provided 
feedback. 
It's unacceptable now to say that this patch series hasn't been reviewed 
enough. Lot of time has been invested on this feature especially for rebasing, 
testing, collecting latency stats and promptly replying to all the questions on 
ML. 

>
>There's a level of due diligence carried out in terms of reviewing and testing
>from a range of people in the community for the netdev approach and a
>number of users are already using this without issue. As such I would like this
>approach to be included in the 2.8 release.

As Ian rightly pointed, we know of few internal and external customers already 
having this patch series with incremental changes and running in their 
deployments.
There may always be few corner cases and those can be addressed and this 
shouldn't be a concern to get this in to 2.8 series.

>
>I think the dpif layer is more generic and in the long run more maintainable
>but it was quite late in being flagged as an alternate approach and is not as
>mature in terms of testing/reviews. As such I don't think it should block the
>netdev approach until it has reached the same level of feedback and testing
>from the community. The dpif approach could target the 2.9 release after it
>has received more feedback and replace the netdev approach when the pros
>and cons of both have been clearly demonstrated.

This has been discussed and each approach has its own merits and demerits. 
Darrel already had put his views in other threads.  

- Bhanuprakash. 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3 00/19] Add OVS DPDK keep-alive functionality.

2017-08-04 Thread Bodireddy, Bhanuprakash
HI Ilya,

Thanks for looking in to this and providing your feedback. 

When this feature was first posted as RFC 
(https://mail.openvswitch.org/pipermail/ovs-dev/2016-July/318243.html), the 
implementation in OvS was done based on DPDK Keepalive library and keeping 
collectd in sync.  As you can see from RFC it was pretty compact code and 
integrated well with ceilometer and provided end to end functionality. Much of 
the RFC code was  to handle SHM. 

However the reviewers pointed below flaws.

- Very DPDK specific.
- Shared memory for inter process communication(Between OvS and collectd 
threads).
- Tracks PMD cores and not threads.
- Limited support to detect false negatives & false positives.
- Limited support to query KA status.

As per suggestions, below changes were introduced.

- Basic infrastructure to register & track threads instead of cores. (Now only 
PMDs are tracked only but can be extended to track non-PMD threads).
- Keep most of the APIs generic so that they can extended in the future. All 
generic APIs are in Keepalive.[hc]
- Remove Shared memory and introduce OvSDB.
- Add support to detect false negatives.
- appctl options to query status.

I agree that we have few issues but they can be reworked.
 -  invoke dpdk_is_enabled() from generic code (vswitchd/bridge.c) isn't nice, 
I had to do  to pass few unit test cases last time.
 -  Half a dozen stub APIs. I couldn't avoid it as they are needed to get the 
kernel datapath build.

The patch series can be categorized  in to sub patchesets (KA infrastructure/  
OvSDB changes/  Query KA stats / Check False positives).  This patch series in 
the current form is using rte_keepalive library to handle PMD thread. But 
importantly has  introduced basic infrastructure to deal with other threads in 
the future.  

Regards,
Bhanuprakash. 

>-Original Message-
>From: Ilya Maximets [mailto:i.maxim...@samsung.com]
>Sent: Friday, August 4, 2017 2:40 PM
>To: ovs-dev@openvswitch.org; Bodireddy, Bhanuprakash
>
>Cc: Darrell Ball ; Ben Pfaff ; Aaron
>Conole 
>Subject: Re: [ovs-dev] [PATCH v3 00/19] Add OVS DPDK keep-alive
>functionality.
>
>Hi Bhanuprakash,
>
>Thanks for working on this.
>I have a general concern about implementation of this functionality:
>
>*What is the profit from using rte_keepalive library ?*
>
>Pros:
>
>* No need to implement 'rte_keepalive_dispatch_pings()' (40 LOC)
>  and struct rte_keepalive (30 LOC, can be significantly decreased
>  by removing not needed elements) ---> ~70 LOC.
>
>Cons:
>
>* DPDK dependency:
>
>* Implementation of PMD threads management (KA) inside netdev code
>  (netdev-dpdk) looks very strange.
>* Many DPDK references in generic code (like dpdk_is_enabled).
>* Feature isn't available for the common threads (main?) wihtout DPDK.
>* Many stubs and placeholders for cases without dpdk.
>* No ability for unit testing.
>
>So, does it worth to use rte_keepalive? To make functionality fully generic we
>only need to implement 'rte_keepalive_dispatch_pings()'
>and few helpers. As soon as this function does nothing dpdk-specific it's a
>really simple task which will allow to greatly clean up the code. The feature 
>is
>too big to use external library for 70 LOCs of really simple code. (Clean up
>should save much more).
>
>Am I missed something?
>Any thoughts?
>
>Best regards, Ilya Maximets.
>
>> Keepalive feature is aimed at achieving Fastpath Service Assurance in
>> OVS-DPDK deployments. It adds support for monitoring the packet
>> processing cores(PMD thread cores) by dispatching heartbeats at
>> regular intervals. Incase of heartbeat misses additional health checks
>> are enabled on the PMD thread to detect the failure and the same shall
>> be reported to higher level fault management systems/frameworks.
>>
>> The implementation uses OVSDB for reporting the health of the PMD
>threads.
>> Any external monitoring application can read the status from OVSDB at
>> regular intervals (or) subscribe to the updates in OVSDB so that they
>> get notified when the changes happen on OVSDB.
>>
>> keepalive info struct is created and initialized for storing the
>> status of the PMD threads. This is initialized by main
>> thread(vswitchd) as part of init process and will be periodically updated by
>'keepalive'
>> thread. keepalive feature can be enabled through below OVSDB settings.
>>
>> enable-keepalive=true
>>   - Keepalive feature is disabled by default.
>>
>> keepalive-interval="5000"
>>   - Timer interval in milliseconds for monitoring the packet
>> processing cores.
>>
>> 

Re: [ovs-dev] [PATCH v3 0/6] netdev-dpdk: Use intermediate queue during packet transmission.

2017-08-08 Thread Bodireddy, Bhanuprakash
Hi Darrell,

>Sorry, I was multitasking last week and did not get a chance to finish the
>responses on Friday
>
>I looked thru. the code for all the patches The last 3 patches of V3 needed a
>manual merge; as you know, the series needs a rebase after recent commits.

I  will rebase and send out v4. 

>For a full o/p batch case, I see about a 10% drop in pps; is that what you see 
>?

I see ~200 - 250 kpps drop in P2P case with single flow and see significant 
improvements when
the number of flows reach the rx batch size. 

Can you please let me know if 'full o/p batch' above means simple P2P test with 
single flow?
It would be helpful if you can share your traffic profile for me to reproduce 
locally.

>After applying each patch, we should be able to build and nothing should be
>broken, which is not the case since patch 4 has a function only used in patch 
>6.
>I have some comments on the individual patches.

I might have introduced this problem when I reordered patches. I will fix this.

- Bhanuprakash.

>
>Darrell
>
>-Original Message-
>From:  on behalf of Bhanuprakash
>Bodireddy 
>Date: Thursday, June 29, 2017 at 3:39 PM
>To: "d...@openvswitch.org" 
>Subject: [ovs-dev] [PATCH v3 0/6] netdev-dpdk: Use intermediate queue
>during packet transmission.
>
>After packet classification, packets are queued in to batches depending
>on the matching netdev flow. Thereafter each batch is processed to
>execute the related actions. This becomes particularly inefficient if
>there are few packets in each batch as rte_eth_tx_burst() incurs expensive
>MMIO writes.
>
>This patch series implements intermediate queue for DPDK and vHost User
>ports.
>Packets are queued and burst when the packet count exceeds threshold.
>Also
>drain logic is implemented to handle cases where packets can get stuck in
>the tx queues at low rate traffic conditions. Care has been taken to see
>that latency is well with in the acceptable limits. Testing shows 
> significant
>performance gains with this implementation.
>
>This path series combines the earlier 2 patches posted below.
>  DPDK patch: https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
>2DApril_331039.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09C
>GX7JQ5Ih-uZnsw&m=mfmTud95lZzZdILQFvPpn7UBeTpD_q-
>YENVoGQXZFog&s=Pqg7ZCr3Ypmyww79tJOxn1XTp5PG0FmK-
>zwcW6lJJ2U&e=
>  vHost User patch: https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
>2DMay_332271.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09C
>GX7JQ5Ih-uZnsw&m=mfmTud95lZzZdILQFvPpn7UBeTpD_q-
>YENVoGQXZFog&s=-
>_WLDFeO_nwkaOdrNHFtl_3uEwEDvEgUsQzabGB6fm8&e=
>
>Performance Numbers with intermediate queue:
>
>  DPDK ports
> ===
>
>  Throughput for P2P scenario, for two 82599ES 10G port with 64 byte
>packets
>
>  Number
>  flows   MASTER With PATCH
>  ====
>10   1072728313393844
>32704225311228799
>507515491 9607791
>   1005838699 9430730
>   5005285066 7845807
>  10005226477 7135601
>
>   Latency test
>
>   MASTER
>   ===
>   Pkt size  min(ns)  avg(ns)  max(ns)
>512  4,631  5,022309,914
>   1024  5,545  5,749104,294
>   1280  5,978  6,159 45,306
>   1518  6,419  6,774946,850
>
>   PATCH
>   =
>   Pkt size  min(ns)  avg(ns)  max(ns)
>512  4,711  5,064182,477
>   1024  5,601  5,888701,654
>   1280  6,018  6,491533,037
>   1518  6,467  6,734312,471
>
>   vHost User ports
>  ==
>
>  Throughput for PV scenario, with 64 byte packets
>
>   Number
>   flows   MASTERWith PATCH
>    =   =
>105945899 7833914
>323872211 6530133
>503283713 6618711
>   1003132540 5857226
>   5002964499 5273006
>  10002931952 5178038
>
>  Latency test.
>
>  MASTER
>  ===
>  Pkt size  min(ns)  avg(ns)  max(ns)
>   512  10,011   12,100   281,915
>  1024   7,8709,313   193,116
>  1280   7,8629,036   194,439
>  1518   8,2159,417   204,782
>
>  PATCH
>  ===
>  Pkt size  min(ns)  avg(ns)  max(ns)
>   512  10,492   13,655   281,538
>  1024   8,4079,784   205,095
>  1280   8,3999,750   194,888
>  1518   8,3679,722   196,973
>
>Performance number reported by Eelco Chaudron redhat.com> at
>  https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
>2DJune_333949

Re: [ovs-dev] [PATCH v3 2/6] netdev-dpdk: Add netdev_dpdk_txq_flush function.

2017-08-08 Thread Bodireddy, Bhanuprakash
>Hi Bhanu
>
>Would it be possible to combine patches 1 and 2, rather than initially defining
>an empty netdev_txq_flush for dpdk ? I think the combined patch would have
>more context.

No problem Darrell . I will merge 1 & 2  in V4.

- Bhanuprakash.

>
>
>-Original Message-
>From:  on behalf of Bhanuprakash
>Bodireddy 
>Date: Thursday, June 29, 2017 at 3:39 PM
>To: "d...@openvswitch.org" 
>Subject: [ovs-dev] [PATCH v3 2/6] netdev-dpdk: Add netdev_dpdk_txq_flush
>   function.
>
>This commit adds netdev_dpdk_txq_flush() function. If there are
>any packets waiting in the queue, they are transmitted instantly
>using the rte_eth_tx_burst function. In XPS enabled case, lock is
>taken on the tx queue before flushing the queue.
>
>Signed-off-by: Bhanuprakash Bodireddy
>
>Signed-off-by: Antonio Fischetti 
>Co-authored-by: Antonio Fischetti 
>Signed-off-by: Markus Magnusson 
>Co-authored-by: Markus Magnusson 
>Acked-by: Eelco Chaudron 
>---
> lib/netdev-dpdk.c | 31 +--
> 1 file changed, 29 insertions(+), 2 deletions(-)
>
>diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>index 9ca4433..dd42716 100644
>--- a/lib/netdev-dpdk.c
>+++ b/lib/netdev-dpdk.c
>@@ -293,6 +293,11 @@ struct dpdk_mp {
> struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
> };
>
>+/* Queue 'INTERIM_QUEUE_BURST_THRESHOLD' packets before
>transmitting.
>+ * Defaults to 'NETDEV_MAX_BURST'(32) packets.
>+ */
>+#define INTERIM_QUEUE_BURST_THRESHOLD NETDEV_MAX_BURST
>+
> /* There should be one 'struct dpdk_tx_queue' created for
>  * each cpu core. */
> struct dpdk_tx_queue {
>@@ -302,6 +307,12 @@ struct dpdk_tx_queue {
> * pmd threads (see 'concurrent_txq'). 
> */
> int map;   /* Mapping of configured vhost-user 
> queues
> * to enabled by guest. */
>+int dpdk_pkt_cnt;  /* Number of buffered packets waiting 
> to
>+  be sent on DPDK tx queue. */
>+struct rte_mbuf
>*dpdk_burst_pkts[INTERIM_QUEUE_BURST_THRESHOLD];
>+   /* Intermediate queue where packets can
>+* be buffered to amortize the cost of 
> MMIO
>+* writes. */
> };
>
> /* dpdk has no way to remove dpdk ring ethernet devices
>@@ -1897,9 +1908,25 @@ netdev_dpdk_send__(struct netdev_dpdk *dev,
>int qid,
>  * few packets (< INTERIM_QUEUE_BURST_THRESHOLD) buffered in the
>queue.
>  */
> static int
>-netdev_dpdk_txq_flush(struct netdev *netdev OVS_UNUSED,
>-  int qid OVS_UNUSED, bool concurrent_txq OVS_UNUSED)
>+netdev_dpdk_txq_flush(struct netdev *netdev,
>+  int qid, bool concurrent_txq)
> {
>+struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>+struct dpdk_tx_queue *txq = &dev->tx_q[qid];
>+
>+if (OVS_LIKELY(txq->dpdk_pkt_cnt)) {
>+if (OVS_UNLIKELY(concurrent_txq)) {
>+qid = qid % dev->up.n_txq;
>+rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
>+}
>+
>+netdev_dpdk_eth_tx_burst(dev, qid, txq->dpdk_burst_pkts,
>+ txq->dpdk_pkt_cnt);
>+
>+if (OVS_UNLIKELY(concurrent_txq)) {
>+rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
>+}
>+}
> return 0;
> }
>
>--
>2.4.11
>
>___
>dev mailing list
>d...@openvswitch.org
>https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_mailman_listinfo_ovs-
>2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-
>uZnsw&m=1wUGGHlSVXpqn5THs-
>saPYXoqzsKYA6zy3m0dzrOr5c&s=HDVtHRNK1uhmuU70EfLAxfXvZXasjTmO8b8
>zpS7M9t4&e=
>
>
>
>
>
>
>

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3 4/6] netdev-dpdk: Add intermediate queue support.

2017-08-08 Thread Bodireddy, Bhanuprakash
>
>This commit introduces netdev_dpdk_eth_tx_queue() function that
>implements intermediate queue and packet buffering. The packets get
>buffered till the threshold 'INTERIM_QUEUE_BURST_THRESHOLD[32] is
>reached and eventually gets transmitted.
>
>To handle the case(eg: ping) where packets are sent at low rate and
>can potentially get stuck in the queue, flush logic is implemented
>that gets invoked from dp_netdev_flush_txq_ports() as part of PMD packet
>processing loop.
>
>Signed-off-by: Bhanuprakash Bodireddy
>
>Signed-off-by: Antonio Fischetti 
>Co-authored-by: Antonio Fischetti 
>Signed-off-by: Markus Magnusson 
>Co-authored-by: Markus Magnusson 
>Acked-by: Eelco Chaudron 
>---
> lib/dpif-netdev.c | 44
>+++-
> lib/netdev-dpdk.c | 37 +++--
> 2 files changed, 78 insertions(+), 3 deletions(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>index 4e29085..7e1f5bc 100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -332,6 +332,7 @@ enum pmd_cycles_counter_type {
> };
>
> #define XPS_TIMEOUT_MS 500LL
>+#define LAST_USED_QID_NONE -1
>
> /* Contained by struct dp_netdev_port's 'rxqs' member.  */
> struct dp_netdev_rxq {
>@@ -492,7 +493,13 @@ struct rxq_poll {
> struct tx_port {
> struct dp_netdev_port *port;
> int qid;
>-long long last_used;
>+int last_used_qid;/* Last queue id where packets got
>+ enqueued. */
>+long long last_used;  /* In case XPS is enabled, it contains the
>+   * timestamp of the last time the port was
>+   * used by the thread to send data.  After
>+   * XPS_TIMEOUT_MS elapses the qid will be
>+   * marked as -1. */
> struct hmap_node node;
> };
>
>@@ -3080,6 +3087,25 @@ cycles_count_end(struct
>dp_netdev_pmd_thread *pmd,
> }
>
> static void
>+dp_netdev_flush_txq_ports(struct dp_netdev_pmd_thread *pmd)
>+{
>+struct tx_port *cached_tx_port;
>+int tx_qid;
>+
>+HMAP_FOR_EACH (cached_tx_port, node, &pmd->send_port_cache) {
>+tx_qid = cached_tx_port->last_used_qid;
>+
>+if (tx_qid != LAST_USED_QID_NONE) {
>+netdev_txq_flush(cached_tx_port->port->netdev, tx_qid,
>+ cached_tx_port->port->dynamic_txqs);
>+
>+/* Queue flushed and mark it empty. */
>+cached_tx_port->last_used_qid = LAST_USED_QID_NONE;
>+}
>+}
>+}
>+
>
>Could you move this function and I think the other code in dpif-netdev.c to
>patch 6, if you can ?

Should be a simple change. Will do this.

>This function is unused, so will generate a build error with –Werror when
>applied in sequence and logically this seems like it can go into patch 6.

Completely agree. 

- Bhanuprakash.

>
>Darrell
>
>
>+static void
> dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>struct netdev_rxq *rx,
>odp_port_t port_no)
>@@ -4355,6 +4381,7 @@ dp_netdev_add_port_tx_to_pmd(struct
>dp_netdev_pmd_thread *pmd,
>
> tx->port = port;
> tx->qid = -1;
>+tx->last_used_qid = LAST_USED_QID_NONE;
>
> hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port-
>>port_no));
> pmd->need_reload = true;
>@@ -4925,6 +4952,14 @@ dpif_netdev_xps_get_tx_qid(const struct
>dp_netdev_pmd_thread *pmd,
>
> dpif_netdev_xps_revalidate_pmd(pmd, now, false);
>
>+/* The tx queue can change in XPS case, make sure packets in previous
>+ * queue is flushed properly. */
>+if (tx->last_used_qid != LAST_USED_QID_NONE &&
>+   tx->qid != tx->last_used_qid) {
>+netdev_txq_flush(port->netdev, tx->last_used_qid, port-
>>dynamic_txqs);
>+tx->last_used_qid = LAST_USED_QID_NONE;
>+}
>+
> VLOG_DBG("Core %d: New TX queue ID %d for port \'%s\'.",
>  pmd->core_id, tx->qid, netdev_get_name(tx->port->netdev));
> return min_qid;
>@@ -5020,6 +5055,13 @@ dp_execute_cb(void *aux_, struct
>dp_packet_batch *packets_,
> tx_qid = pmd->static_tx_qid;
> }
>
>+/* In case these packets gets buffered into an intermediate
>+ * queue and XPS is enabled the flush function could find a
>+ * different tx qid assigned to its thread.  We keep track
>+ * of the qid we're now using, that will trigger the flush
>+ * function and will select the right queue to flush. */
>+p->last_used_qid = tx_qid;
>+
> netdev_send(p->port

Re: [ovs-dev] [PATCH v3 6/6] dpif-netdev: Flush the packets in intermediate queue.

2017-08-08 Thread Bodireddy, Bhanuprakash
Hi Darrell,

>
>Under low rate traffic conditions, there can be 2 issues.
>  (1) Packets potentially can get stuck in the intermediate queue.
>  (2) Latency of the packets can increase significantly due to
>   buffering in intermediate queue.
>
>This commit handles the (1) issue by flushing the tx port queues from
>PMD processing loop. Also this commit addresses issue (2) by flushing
>the tx queues after every rxq port processing. This reduces the latency
>with out impacting the forwarding throughput.
>
>   MASTER
>  
>   Pkt size  min(ns)   avg(ns)   max(ns)
>512  4,631  5,022309,914
>   1024  5,545  5,749104,294
>   1280  5,978  6,159 45,306
>   1518  6,419  6,774946,850
>
>  MASTER + COMMIT
>  -
>   Pkt size  min(ns)   avg(ns)   max(ns)
>512  4,711  5,064182,477
>   1024  5,601  5,888701,654
>   1280  6,018  6,491533,037
>   1518  6,467  6,734312,471
>
>PMDs can be teared down and spawned at runtime and so the rxq and txq
>mapping of the PMD threads can change. In few cases packets can get
>stuck in the queue due to reconfiguration and this commit helps flush
>the queues.
>
>Suggested-by: Eelco Chaudron 
>Reported-at: https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
>2DApril_331039.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09C
>GX7JQ5Ih-uZnsw&m=bHrBe9xQ4KZyIP8eXmMQgmAki-
>7TrHqH1PHcy7KBp9M&s=FLHjFbETDpuejnwNxIJem8vPtHo7KDb0q0YJSIpsMb
>8&e=
>Signed-off-by: Bhanuprakash Bodireddy
>
>Signed-off-by: Antonio Fischetti 
>Co-authored-by: Antonio Fischetti 
>Signed-off-by: Markus Magnusson 
>Co-authored-by: Markus Magnusson 
>Acked-by: Eelco Chaudron 
>---
> lib/dpif-netdev.c | 7 +++
> 1 file changed, 7 insertions(+)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>index 7e1f5bc..f03bd3e 100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -3603,6 +3603,8 @@ dpif_netdev_run(struct dpif *dpif)
> for (i = 0; i < port->n_rxq; i++) {
> dp_netdev_process_rxq_port(non_pmd, port->rxqs[i].rx,
>port->port_no);
>+
>+dp_netdev_flush_txq_ports(non_pmd);
>
>
>
>Is this a temporary change ?; seems counter to the objective ?
>Should be latency based, as discussed on another thread couple months ago
>?; configurable by port type and port ?

This is a temporary change and is made to see the latency is well within limits.
With this change, the performance improvement is *only* observed  when rx batch 
size is significant (unlikely in real use cases). 
The  incremental patch series( on top of this) should address that by buffering 
packets in Intermediate queue across multiple rx batches. Also latency configs 
would be introduced as you just mentioned above to tune according to their 
requirements.

This needs significant testing as we need to strike a fine balance between 
throughput and latency and shall be done as part of next series.

- Bhanuprakash.

>
>
>
> }
> }
> }
>@@ -3760,6 +3762,8 @@ reload:
> for (i = 0; i < poll_cnt; i++) {
> dp_netdev_process_rxq_port(pmd, poll_list[i].rx,
>poll_list[i].port_no);
>+
>+dp_netdev_flush_txq_ports(pmd);
> }
>
>
>
>Same comment as above.
>
>
>
> if (lc++ > 1024) {
>@@ -3780,6 +3784,9 @@ reload:
> }
> }
>
>+/* Flush the queues as part of reconfiguration logic. */
>+dp_netdev_flush_txq_ports(pmd);
>+
> poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
> exiting = latch_is_set(&pmd->exit_latch);
> /* Signal here to make sure the pmd finishes
>--
>2.4.11
>
>___
>dev mailing list
>d...@openvswitch.org
>https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_mailman_listinfo_ovs-
>2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-
>uZnsw&m=bHrBe9xQ4KZyIP8eXmMQgmAki-
>7TrHqH1PHcy7KBp9M&s=9f249RikCnGphA_CpKIFbbtkbo2W6axBPaub91khHe
>M&e=
>
>
>
>
>

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3 00/19] Add OVS DPDK keep-alive functionality.

2017-08-08 Thread Bodireddy, Bhanuprakash
HI Ilya,

>I understand that using rte_keepalive library was worth in the early RFC
>because size of RFC was comparable with the size of rte_keepalive library.
>But now, as so many generic things was implemented in lib/keepalive.{c,h}
>and the size of the patch-set is pretty large, IMHO, it's better to implement
>'struct rte_keepalive' and 'rte_keepalive_dispatch_pings()' inside
>lib/keepalive.{c,h} and remove dpdk library as a dependency for this
>functionality.

I agree with your suggestion  and will factor in  this input for next series.

>
>'rte_keepalive' doesn't have any dpdk-specific things inside. It doesn't work
>with NICs or DPDK-allocated memory. This library is just a simple wrapper.
>So, do we need the dependency from dpdk only to use this wrapper? Without
>it we'll have generic keepalive functionality for the whole OVS without
>additional subs and dpdk references in generic code.

Completely agree and this will help us avoid dummy functions.   

>
>I'm asking you to try to implement 'struct rte_keepalive' and
>'rte_keepalive_dispatch_pings()' inside lib/keepalive.{c,h} and move all the
>keepalive related code out of [netdev-]dpdk.{c,h} to keepalive.{c,h} and,
>possibly, to dpif-netdev.{c,h}.
>I'm expecting significant improvements in code size, simplicity and 
>readability.
>Also, this will allow to use keepalive without DPDK.

I have tried my best not to clutter netdev-dpdk and dpif-netdev. I hope by 
removing
the dependency on DPDK Keepalive library it might be even better.  

I will work on this and wait for inputs from other reviewers before posting 
next version.

- Bhanuprakash.

>
>Best regards, Ilya Maximets.
>
>On 04.08.2017 18:24, Bodireddy, Bhanuprakash wrote:
>> HI Ilya,
>>
>> Thanks for looking in to this and providing your feedback.
>>
>> When this feature was first posted as RFC
>(https://mail.openvswitch.org/pipermail/ovs-dev/2016-July/318243.html),
>the implementation in OvS was done based on DPDK Keepalive library and
>keeping collectd in sync.  As you can see from RFC it was pretty compact code
>and integrated well with ceilometer and provided end to end functionality.
>Much of the RFC code was  to handle SHM.
>>
>> However the reviewers pointed below flaws.
>>
>> - Very DPDK specific.
>> - Shared memory for inter process communication(Between OvS and
>collectd threads).
>> - Tracks PMD cores and not threads.
>> - Limited support to detect false negatives & false positives.
>> - Limited support to query KA status.
>>
>> As per suggestions, below changes were introduced.
>>
>> - Basic infrastructure to register & track threads instead of cores. (Now 
>> only
>PMDs are tracked only but can be extended to track non-PMD threads).
>> - Keep most of the APIs generic so that they can extended in the
>> future. All generic APIs are in Keepalive.[hc]
>> - Remove Shared memory and introduce OvSDB.
>> - Add support to detect false negatives.
>> - appctl options to query status.
>>
>> I agree that we have few issues but they can be reworked.
>>  -  invoke dpdk_is_enabled() from generic code (vswitchd/bridge.c) isn't
>nice, I had to do  to pass few unit test cases last time.
>>  -  Half a dozen stub APIs. I couldn't avoid it as they are needed to get the
>kernel datapath build.
>>
>> The patch series can be categorized  in to sub patchesets (KA infrastructure/
>OvSDB changes/  Query KA stats / Check False positives).  This patch series in
>the current form is using rte_keepalive library to handle PMD thread. But
>importantly has  introduced basic infrastructure to deal with other threads in
>the future.
>>
>> Regards,
>> Bhanuprakash.
>>
>>> -Original Message-
>>> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
>>> Sent: Friday, August 4, 2017 2:40 PM
>>> To: ovs-dev@openvswitch.org; Bodireddy, Bhanuprakash
>>> 
>>> Cc: Darrell Ball ; Ben Pfaff ; Aaron
>>> Conole 
>>> Subject: Re: [ovs-dev] [PATCH v3 00/19] Add OVS DPDK keep-alive
>>> functionality.
>>>
>>> Hi Bhanuprakash,
>>>
>>> Thanks for working on this.
>>> I have a general concern about implementation of this functionality:
>>>
>>> *What is the profit from using rte_keepalive library ?*
>>>
>>> Pros:
>>>
>>>* No need to implement 'rte_keepalive_dispatch_pings()' (40 LOC)
>>>  and struct rte_keepalive (30 LOC, can be significantly decreased
>>>  by removing not needed elements) ---> ~70 LOC.

Re: [ovs-dev] [PATCH v4 1/5] netdev: Add netdev_txq_flush function.

2017-08-09 Thread Bodireddy, Bhanuprakash
Hi Ilya,
>>
>> +/* Flush tx queues.
>> + * This is done periodically to empty the intermediate queue in case
>> +of
>> + * fewer packets (< INTERIM_QUEUE_BURST_THRESHOLD) buffered in the
>queue.
>> + */
>> +static int
>> +netdev_dpdk_txq_flush(struct netdev *netdev, int qid , bool
>> +concurrent_txq) {
>> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +struct dpdk_tx_queue *txq = &dev->tx_q[qid];
>> +
>> +if (OVS_LIKELY(txq->dpdk_pkt_cnt)) {
>> +if (OVS_UNLIKELY(concurrent_txq)) {
>> +qid = qid % dev->up.n_txq;
>> +rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
>> +}
>> +
>> +netdev_dpdk_eth_tx_burst(dev, qid, txq->dpdk_burst_pkts,
>> + txq->dpdk_pkt_cnt);
>
>The queue used for send and the locked one are different because you're
>remapping the qid before taking the spinlock.

>I suspect that we're always using right queue numbers in current
>implementation of dpif-netdev, but I need to recheck to be sure.

I believe the case you are referring here is the XPS case ('dynamic_txqs' true).
When we have to flush the packets we retrieve the qid from the 
'cached_tx_port->last_used_qid'
 that was initialized earlier by 'dpif_netdev_xps_get_tx_qid()'. The logic of 
remapping the qid and 
acquiring the spin lock in the above function is no different from current 
logic in master. Can you 
elaborate the specific case where this would break the functionality?

Please note that  in 'dpif_netdev_xps_get_tx_qid'  the qid can change and so we 
did flush the queue.  

- Bhanuprakash. 

>Anyway, logic of this function completely broken.
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 2/5] netdev-dpdk: Add netdev_dpdk_vhost_txq_flush function.

2017-08-09 Thread Bodireddy, Bhanuprakash
>>
>> +static int
>> +netdev_dpdk_vhost_tx_burst(struct netdev_dpdk *dev, int qid) {
>> +struct dpdk_tx_queue *txq = &dev->tx_q[qid];
>> +struct rte_mbuf **cur_pkts = (struct rte_mbuf
>> +**)txq->vhost_burst_pkts;
>> +
>> +int tx_vid = netdev_dpdk_get_vid(dev);
>> +int tx_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
>> +uint32_t sent = 0;
>> +uint32_t retries = 0;
>> +uint32_t sum, total_pkts;
>> +
>> +total_pkts = sum = txq->vhost_pkt_cnt;
>> +do {
>> +uint32_t ret;
>> +ret = rte_vhost_enqueue_burst(tx_vid, tx_qid, &cur_pkts[sent],
>sum);
>> +if (OVS_UNLIKELY(!ret)) {
>> +/* No packets enqueued - do not retry. */
>> +break;
>> +} else {
>> +/* Packet have been sent. */
>> +sent += ret;
>> +
>> +/* 'sum' packet have to be retransmitted. */
>> +sum -= ret;
>> +}
>> +} while (sum && (retries++ < VHOST_ENQ_RETRY_NUM));
>> +
>> +for (int i = 0; i < total_pkts; i++) {
>> +dp_packet_delete(txq->vhost_burst_pkts[i]);
>> +}
>> +
>> +/* Reset pkt count. */
>> +txq->vhost_pkt_cnt = 0;
>> +
>> +/* 'sum' refers to packets dropped. */
>> +return sum;
>> +}
>> +
>> +/* Flush the txq if there are any packets available. */ static int
>> +netdev_dpdk_vhost_txq_flush(struct netdev *netdev, int qid,
>> +bool concurrent_txq OVS_UNUSED) {
>> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +struct dpdk_tx_queue *txq;
>> +
>> +qid = dev->tx_q[qid % netdev->n_txq].map;
>> +
>> +/* The qid may be disabled in the guest and has been set to
>> + * OVS_VHOST_QUEUE_DISABLED.
>> + */
>> +if (OVS_UNLIKELY(qid < 0)) {
>> +return 0;
>> +}
>> +
>> +txq = &dev->tx_q[qid];
>> +/* Increment the drop count and free the memory. */
>> +if (OVS_UNLIKELY(!is_vhost_running(dev) ||
>> + !(dev->flags & NETDEV_UP))) {
>> +
>> +if (txq->vhost_pkt_cnt) {
>> +rte_spinlock_lock(&dev->stats_lock);
>> +dev->stats.tx_dropped+= txq->vhost_pkt_cnt;
>> +rte_spinlock_unlock(&dev->stats_lock);
>> +
>> +for (int i = 0; i < txq->vhost_pkt_cnt; i++) {
>> +dp_packet_delete(txq->vhost_burst_pkts[i]);
>
>Spinlock (tx_lock) must be held here to avoid queue and mempool breakage.

I think you are right. tx_lock might be acquired for freeing the packets.

---
rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
for (int i = 0; i < txq->vhost_pkt_cnt; i++) {
 dp_packet_delete(txq->vhost_burst_pkts[i]);
}
rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);

- Bhanuprakash
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 2/5] netdev-dpdk: Add netdev_dpdk_vhost_txq_flush function.

2017-08-09 Thread Bodireddy, Bhanuprakash
>enable)
  if (enable) {
  dev->tx_q[qid].map = qid;
>>
>> Here flushing required too because we're possibly enabling previously
>remapped queue.
>>
  } else {
 +/* If the queue is disabled in the guest, the 
 corresponding qid
 + * map shall be set to OVS_VHOST_QUEUE_DISABLED(-2).
 + *
 + * The packets that were queued in 'qid' could be 
 potentially
 + * stuck and needs to be dropped.
 + *
 + * XXX: The queues may be already disabled in the guest so
 + * flush function in this case only helps in updating 
 stats
 + * and freeing memory.
 + */
 +netdev_dpdk_vhost_txq_flush(&dev->up, qid, 0);
  dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;
  }
  netdev_dpdk_remap_txqs(dev);
>>
>> 'netdev_dpdk_remap_txqs()', actually, is able to change mapping for
>> all the disabled in guest queues. So, we need to flush all of them
>> while remapping somewhere inside the function.
>> One other thing is that there is a race window between flush and
>> mapping update where another process able to enqueue more packets in
>> just flushed queue. The order of operations should be changed, or both
>> of them should be done under the same tx_lock. I think, it's required
>> to make tx_q[].map field atomic to fix the race condition, because
>> send function takes the 'map' and then locks the corresponding queue.
>> It wasn't an issue before, because packets in case of race was just
>> dropped on attempt to send to disabled queue, but with this patch
>> applied they will be enqueued to the intermediate queue and stuck there.
>
>Making 'map' atomic will not help. To solve the race we should make 'reading
>of map + enqueue' an atomic operation by some spinlock.
>Like this:
>
>vhost_send:
>
>qid = qid % netdev->n_txq;
>rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
>
>mapped_qid = dev->tx_q[qid].map;
>
>if (qid != mapped_qid) {
>rte_spinlock_lock(&dev->tx_q[mapped_qid].tx_lock);
>}
>
>tx_enqueue(mapped_qid, pkts, cnt);
>
>if (qid != mapped_qid) {
>rte_spinlock_unlock(&dev->tx_q[mapped_qid].tx_lock);
>}
>
>rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
>
>
>txq remapping inside 'netdev_dpdk_remap_txqs()' or
>'vring_state_changed()':
>
>qid - queue we need to remap.
>new_qid - queue we need to remap to.
>
>rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
>
>mapped_qid = dev->tx_q[qid].map;
>if (qid != mapped_qid) {
>rte_spinlock_lock(&dev->tx_q[mapped_qid].tx_lock);
>}
>
>tx_flush(mapped_qid)
>
>if (qid != mapped_qid) {
>rte_spinlock_unlock(&dev->tx_q[mapped_qid].tx_lock);
>}
>
>dev->tx_q[qid].map = new_qid;
>
>rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
>
>
>Above schema should work without races, but looks kind of ugly and requires
>taking of additional spinlock on each send.
>
>P.S. Sorry for talking with myself. Just want to share my thoughts.

Hi Ilya,

Thanks for reviewing the patches and providing inputs.
I went through your comments for this patch(2/5) and agree with the suggestions.
Meanwhile  while go through the changes above and get back to you.

Bhanuprakash. 


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 2/5] netdev-dpdk: Add netdev_dpdk_vhost_txq_flush function.

2017-08-10 Thread Bodireddy, Bhanuprakash
>>
  } else {
 +/* If the queue is disabled in the guest, the 
 corresponding qid
 + * map shall be set to OVS_VHOST_QUEUE_DISABLED(-2).
 + *
 + * The packets that were queued in 'qid' could be 
 potentially
 + * stuck and needs to be dropped.
 + *
 + * XXX: The queues may be already disabled in the guest so
 + * flush function in this case only helps in updating 
 stats
 + * and freeing memory.
 + */
 +netdev_dpdk_vhost_txq_flush(&dev->up, qid, 0);
  dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;
  }
  netdev_dpdk_remap_txqs(dev);
>>
>> 'netdev_dpdk_remap_txqs()', actually, is able to change mapping for
>> all the disabled in guest queues. So, we need to flush all of them
>> while remapping somewhere inside the function.
>> One other thing is that there is a race window between flush and
>> mapping update where another process able to enqueue more packets in
>> just flushed queue. The order of operations should be changed, or both
>> of them should be done under the same tx_lock. I think, it's required
>> to make tx_q[].map field atomic to fix the race condition, because
>> send function takes the 'map' and then locks the corresponding queue.
>> It wasn't an issue before, because packets in case of race was just
>> dropped on attempt to send to disabled queue, but with this patch
>> applied they will be enqueued to the intermediate queue and stuck there.
>
>Making 'map' atomic will not help. To solve the race we should make 'reading
>of map + enqueue' an atomic operation by some spinlock.
>Like this:
>
>vhost_send:
>
>qid = qid % netdev->n_txq;
>rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
>
>mapped_qid = dev->tx_q[qid].map;
>
>if (qid != mapped_qid) {
>rte_spinlock_lock(&dev->tx_q[mapped_qid].tx_lock);
>}
>
>tx_enqueue(mapped_qid, pkts, cnt);
>
>if (qid != mapped_qid) {
>rte_spinlock_unlock(&dev->tx_q[mapped_qid].tx_lock);
>}
>
>rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
>
>
>txq remapping inside 'netdev_dpdk_remap_txqs()' or
>'vring_state_changed()':
>
>qid - queue we need to remap.
>new_qid - queue we need to remap to.
>
>rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
>
>mapped_qid = dev->tx_q[qid].map;
>if (qid != mapped_qid) {
>rte_spinlock_lock(&dev->tx_q[mapped_qid].tx_lock);
>}
>
>tx_flush(mapped_qid)
>
>if (qid != mapped_qid) {
>rte_spinlock_unlock(&dev->tx_q[mapped_qid].tx_lock);
>}
>
>dev->tx_q[qid].map = new_qid;
>
>rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
>
>
>Above schema should work without races, but looks kind of ugly and requires
>taking of additional spinlock on each send.
>
>P.S. Sorry for talking with myself. Just want to share my thoughts.

Hi Ilya,

Can you please review the below changes based on what you suggested above. 
As the problem only happens when the queues are enabled/disabled in the guest, 
I did some  preliminary testing with the below changes by sending some traffic 
in to the VM
and enabling and disabling the queues inside the guest the same time. 

Vhost_send()
-
qid = qid % netdev->n_txq;

/* Acquire tx_lock before reading tx_q[qid].map and enqueueing packets.
 * tx_q[].map gets updated in vring_state_changed() when vrings are
 * enabled/disabled in the guest. */
rte_spinlock_lock(&dev->tx_q[qid].tx_lock);

mapped_qid = dev->tx_q[qid].map;
if (OVS_UNLIKELY(qid != mapped_qid)) {
rte_spinlock_lock(&dev->tx_q[mapped_qid].tx_lock);
}

if (OVS_UNLIKELY(!is_vhost_running(dev) || mapped_qid < 0
 || !(dev->flags & NETDEV_UP))) {
rte_spinlock_lock(&dev->stats_lock);
dev->stats.tx_dropped+= cnt;
rte_spinlock_unlock(&dev->stats_lock);

for (i = 0; i < total_pkts; i++) {
dp_packet_delete(pkts[i]);
}

if (OVS_UNLIKELY(qid != mapped_qid)) {
rte_spinlock_unlock(&dev->tx_q[mapped_qid].tx_lock);
}
rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);

return;
}

cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
/* Check has QoS has been configured for the netdev */
cnt = netdev_dpdk_qos_run(dev, cur_pkts, cnt);
dropped = total_pkts - cnt;

int idx = 0;
struct dpdk_tx_queue *txq = &dev->tx_q[mapped_qid];
while (idx < cnt) {
txq->vhost_burst_pkts[txq->

Re: [ovs-dev] [PATCH v4 2/5] netdev-dpdk: Add netdev_dpdk_vhost_txq_flush function.

2017-08-11 Thread Bodireddy, Bhanuprakash
>On 09.08.2017 15:35, Bodireddy, Bhanuprakash wrote:
>>>>
>>>> +static int
>>>> +netdev_dpdk_vhost_tx_burst(struct netdev_dpdk *dev, int qid) {
>>>> +struct dpdk_tx_queue *txq = &dev->tx_q[qid];
>>>> +struct rte_mbuf **cur_pkts = (struct rte_mbuf
>>>> +**)txq->vhost_burst_pkts;
>>>> +
>>>> +int tx_vid = netdev_dpdk_get_vid(dev);
>>>> +int tx_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
>>>> +uint32_t sent = 0;
>>>> +uint32_t retries = 0;
>>>> +uint32_t sum, total_pkts;
>>>> +
>>>> +total_pkts = sum = txq->vhost_pkt_cnt;
>>>> +do {
>>>> +uint32_t ret;
>>>> +ret = rte_vhost_enqueue_burst(tx_vid, tx_qid,
>>>> + &cur_pkts[sent],
>>> sum);
>>>> +if (OVS_UNLIKELY(!ret)) {
>>>> +/* No packets enqueued - do not retry. */
>>>> +break;
>>>> +} else {
>>>> +/* Packet have been sent. */
>>>> +sent += ret;
>>>> +
>>>> +/* 'sum' packet have to be retransmitted. */
>>>> +sum -= ret;
>>>> +}
>>>> +} while (sum && (retries++ < VHOST_ENQ_RETRY_NUM));
>>>> +
>>>> +for (int i = 0; i < total_pkts; i++) {
>>>> +dp_packet_delete(txq->vhost_burst_pkts[i]);
>>>> +}
>>>> +
>>>> +/* Reset pkt count. */
>>>> +txq->vhost_pkt_cnt = 0;
>>>> +
>>>> +/* 'sum' refers to packets dropped. */
>>>> +return sum;
>>>> +}
>>>> +
>>>> +/* Flush the txq if there are any packets available. */ static int
>>>> +netdev_dpdk_vhost_txq_flush(struct netdev *netdev, int qid,
>>>> +bool concurrent_txq OVS_UNUSED) {
>>>> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>> +struct dpdk_tx_queue *txq;
>>>> +
>>>> +qid = dev->tx_q[qid % netdev->n_txq].map;
>>>> +
>>>> +/* The qid may be disabled in the guest and has been set to
>>>> + * OVS_VHOST_QUEUE_DISABLED.
>>>> + */
>>>> +if (OVS_UNLIKELY(qid < 0)) {
>>>> +return 0;
>>>> +}
>>>> +
>>>> +txq = &dev->tx_q[qid];
>>>> +/* Increment the drop count and free the memory. */
>>>> +if (OVS_UNLIKELY(!is_vhost_running(dev) ||
>>>> + !(dev->flags & NETDEV_UP))) {
>>>> +
>>>> +if (txq->vhost_pkt_cnt) {
>>>> +rte_spinlock_lock(&dev->stats_lock);
>>>> +dev->stats.tx_dropped+= txq->vhost_pkt_cnt;
>>>> +rte_spinlock_unlock(&dev->stats_lock);
>>>> +
>>>> +for (int i = 0; i < txq->vhost_pkt_cnt; i++) {
>>>> +dp_packet_delete(txq->vhost_burst_pkts[i]);
>>>
>>> Spinlock (tx_lock) must be held here to avoid queue and mempool
>breakage.
>>
>> I think you are right. tx_lock might be acquired for freeing the packets.
>
>I think that 'vhost_pkt_cnt' reads and updates also should be protected to
>avoid races.

>From the discussion in the thread 
>https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337133.html,
We are going to acquire tx_lock for updating the map and flushing the queue 
inside vring_state_changed(). 

That triggers a deadlock in the  flushing function as we have already acquired 
the same lock in netdev_dpdk_vhost_txq_flush().
This is the same problem for freeing the memory and protecting the updates to 
vhost_pkt_cnt.

   if (OVS_LIKELY(txq->vhost_pkt_cnt)) {
 rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
netdev_dpdk_vhost_tx_burst(dev, qid);
rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
   }

As the problem is triggered when the guest queues are enabled/disabled, with a 
small race window where packets can get enqueued in to the queue just after the 
flush and before map value is updated in cb function(vring_state_changed()), 
how abt this?

Technically as the queues are disabled, there is no point in flushing the 
packets, so let's free the packets and set the txq->vhost_pkt_cnt in 
vring_state_changed() itself instead of calling flush().

vring_state_changed().
--
rte_spinlock_l

Re: [ovs-dev] [PATCH v4 5/5] dpif-netdev: Flush the packets in intermediate queue.

2017-08-11 Thread Bodireddy, Bhanuprakash
Hello All,

Adding all the people here who had either reviewed or provided their feedback
on the batching patches at some stage.

you are already aware that there are 2 different series on ML to implement tx
 batching (netdev layer vs dpif layer) that improves DPDK datapath performance. 
 
Our output batching is in netdev layer whereas ilya moved it to  dpif layer 
and simplified it. Each approach has its own pros and cons and had been
discussed in earlier threads.

While reviewing v4 of my patch series,  ilya detected a race condition that 
happens
when the queues in the guest are enabled/disabled at run time. Though we have
solutions to address this issue and implemented it, I realized that the code 
complexity
has increased and changes spanning multiple functions with additional spin 
locks  to 
address this one corner case. 

I think, though our patch series has flexibility it has gotten lot complex now 
and would
be difficult to maintain in the future. At this stage I would like to lean 
towards simpler
solution that's more maintainable  which is implemented by Ilya.

I would like to thank Eelco, Darrell, Jan and Ilya for reviewing our series and 
providing their
feedback.

Bhanuprakash. 

>-Original Message-
>From: Darrell Ball [mailto:db...@vmware.com]
>Sent: Friday, August 11, 2017 2:03 AM
>To: Bodireddy, Bhanuprakash ;
>d...@openvswitch.org
>Subject: Re: [ovs-dev] [PATCH v4 5/5] dpif-netdev: Flush the packets in
>intermediate queue.
>
>Hi Bhanu
>
>Given that you ultimately intend changes beyond those in this patch, would it
>make sense just to fold the follow up series (at least, the key elements) into
>this series, essentially expanding on this patch 5 ?
>
>Thanks Darrell
>
>-Original Message-
>From:  on behalf of Bhanuprakash
>Bodireddy 
>Date: Tuesday, August 8, 2017 at 10:06 AM
>To: "d...@openvswitch.org" 
>Subject: [ovs-dev] [PATCH v4 5/5] dpif-netdev: Flush the packets in
>   intermediate queue.
>
>Under low rate traffic conditions, there can be 2 issues.
>  (1) Packets potentially can get stuck in the intermediate queue.
>  (2) Latency of the packets can increase significantly due to
>   buffering in intermediate queue.
>
>This commit handles the (1) issue by flushing the tx port queues using
>dp_netdev_flush_txq_ports() as part of PMD packet processing loop.
>Also this commit addresses issue (2) by flushing the tx queues after
>every rxq port processing. This reduces the latency with out impacting
>the forwarding throughput.
>
>   MASTER
>  
>   Pkt size  min(ns)   avg(ns)   max(ns)
>512  4,631  5,022309,914
>   1024  5,545  5,749104,294
>   1280  5,978  6,159 45,306
>   1518  6,419  6,774946,850
>
>  MASTER + COMMIT
>  -
>   Pkt size  min(ns)   avg(ns)   max(ns)
>512  4,711  5,064182,477
>   1024  5,601  5,888701,654
>   1280  6,018  6,491533,037
>   1518  6,467  6,734312,471
>
>PMDs can be teared down and spawned at runtime and so the rxq and txq
>mapping of the PMD threads can change. In few cases packets can get
>stuck in the queue due to reconfiguration and this commit helps flush
>the queues.
>
>Suggested-by: Eelco Chaudron 
>Reported-at: https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
>2DApril_331039.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09C
>GX7JQ5Ih-
>uZnsw&m=qwwXxtIBvUf5cgPbYkcAKwCukS_ZiaeFE6lAdHHaw28&s=H0yNRh-
>c9pdYHacCzkoruc48Dj_Whkcwcjzv-vta-EI&e=
>Signed-off-by: Bhanuprakash Bodireddy
>
>Signed-off-by: Antonio Fischetti 
>Co-authored-by: Antonio Fischetti 
>Signed-off-by: Markus Magnusson 
>Co-authored-by: Markus Magnusson 
>Acked-by: Eelco Chaudron 
>---
> lib/dpif-netdev.c | 52
>+++-
> 1 file changed, 51 insertions(+), 1 deletion(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>index e2cd931..bfb9650 100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -340,6 +340,7 @@ enum pmd_cycles_counter_type {
> };
>
> #define XPS_TIMEOUT_MS 500LL
>+#define LAST_USED_QID_NONE -1
>
> /* Contained by struct dp_netdev_port's 'rxqs' member.  */
> struct dp_netdev_rxq {
>@@ -500,7 +501,13 @@ struct rxq_poll {
> struct tx_port {
> struct dp_netdev_port *port;
> int qid;
>-long long last_used;
>+int last_used_qi

Re: [ovs-dev] [PATCH 2/2] dpif-netdev: Per-port conditional EMC insert.

2017-09-01 Thread Bodireddy, Bhanuprakash
Hi Ilya,

>> Tuning the per EMC insertion probability per port based on detailed
>knowledge about the nature of traffic patterns seems a micro-optimization to
>me, which might be helpful in very controlled setups e.g. in synthetic
>benchmarks, but very hard to apply in more general use cases, such as
>vSwitch in OpenStack, where the entity (Nova compute) configuring the
>vhostuser VM ports has no knowledge at all about traffic characteristics.
>>
>> The nice property of the probabilistic EMC insertion is that flows with more
>traffic have a higher chance of ending up in the EMC than flows with lower
>traffic. In your case the few big encapsulated flows from the VM should have
>a higher chance to make it into the EMC than the many smaller individual
>flows into the VM and thus automatically get the bulk of EMC hits.
>>
>> Do you have empirical data that shows that this effect is not sufficient and
>performance can be significantly improved by per-port probabilities?
>>
>> In any case I would request to keep the global configuration option and only
>add the per-port option to override the global probability if wanted.
>>
>
>+1 for backwards compatibility by keeping the global config.

Thanks for this patch. 
I proposed a similar approach as incremental addition when the EMC probabilistic
insertion patch was upstreamed. My concern then was as it's a global config all 
the
PMD threads and ports would be affected. This was also discussed in one of the 
community calls then.

The general feedback was, though it sounds helpful in lab scenarios where the 
user has
pre knowledge of traffic, number of VMs, Phy and vhostuser ports , it may be 
not be the 
case in OpenStack deployments. The OpenStack folks mentioned that this kind of 
optimizations
can't be easily used in their deployments.

Regards,
Bhanuprakash. 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 0/7] Add OVS DPDK keep-alive functionality.

2017-09-06 Thread Bodireddy, Bhanuprakash
Hi Aaron,

>Quick comment before I do an in-depth review.
>
>One thing that is missing in this series is some form of documentation added
>to explain why this feature should exist (for instance, why can't the standard
>posix process accounting information suffice?) and what the high-level
>concepts are (you have the states being used, but I don't see a definition that
>will be needed to understand when reading a keep-alive report).

I am planning to write a cookbook with instructions on setting up Keepalive in 
OvS, 
Installing & configuring collectd and setting up ceilometer service to read the 
events.
The definition of the KA states and how to interpret them would be explained in 
the
document. Also the minimal step guide would be added in to OvS-DPDK Advanced 
guide
with links to cookbook. 

On your other question on why posix process accounting isn't enough, please see 
below
for testcase and details.

>
>I think there could be a reason to provide this, but I think it's important to
>explain why collectd will need to use the ovsdb interface, rather than calling
>ex: times[1] or parsing /proc//stat for the runtime (and watching
>accumulation).

1) On collectd reading ovsdb rather than directly monitoring the threads.

  Collectd for sure is one popular daemon to collect and monitor system 
wide statistics.
  However, if we move ovs thread monitoring functionality to collectd we 
are *forcing*
  the users to use collectd to monitor OvS health. This may not be a 
problem for someone using
  collectd + OpenStack. 

  Think of customer using OvS but having their proprietary monitoring 
application with OpenStack or
  worse their own orchestrator, in this case it's easy for them to monitor 
OvS by querying OvSDB
  with minimal code changes in to their app. 

  Also it might be easy for any monitoring application to query/subscribe 
to OvSDB for knowing the
  OvS configuration and health. 

2) On /proc/[pid]/stats:

- I do read 'stats' file in 01/7  patch to get the thread name and 'core id' 
the thread was last scheduled.
- The other fields related to time in stats file can't be completely relied due 
to below test case.

This test scenario was to simulate & identify the PMD stalls when a higher 
priority thread(kernel/other IO thread) 
gets scheduled on the same core.

Test scenario:
- OvS with single/multiple PMD thread.
- Start a worker thread spinning continuously on the core (stress -c 1 &).
- Change the worker thread attributes to RT (chrt -r -p 99  ).
- Pin the worker thread on the same core as one of the PMDs  (taskset -p  
)

Now the PMD stalls as the other worker thread has higher priority and is 
favored & scheduled by Linux scheduler preempting PMD thread.
However the /proc/pid/stat shows that the thread is still in 
 *Running (R)* state ->   field 3   
   (see the output below)
   Utime,stime were incrementing  ->field 14, 15(-do-)

All the other time related fields were '0' as they don't apply here.
For fields information:  http://man7.org/linux/man-pages/man5/proc.5.html

---sample 
output---
$ cat /proc/12506/stat
12506 (pmd61) R 1 12436 12436 0 -1 4210752 101244 0 0 0 389393 3099 0 0 20 0 35 
0 226680879 4798472192 4363 18446744073709551615
 4194304 9786556 140737290674320 140344414947088 4467454 0 0 4096 24579 0 0 0 
-1 3 0 0 0 0 0 11883752 12196256 48676864 140737290679638
 140737290679790 140737290679790 140737290682316 0


But with the KA framework, the PMD stall be detected immediately and reported.
IMHO, we can use /proc interface or other mechanisms you suggested but that 
should
be used as part of additional health checks. I do check the /proc/[pid]/stat to 
read the 
thread states as part of larger health check mechanism in V3. 

Hope I answered all your questions here. Let me know your comments while you 
review this series in-depth.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 2/7] Keepalive: Add initial keepalive support.

2017-09-07 Thread Bodireddy, Bhanuprakash
Hi Aaron,

My reply inline.

>Hi Bhanu,
>
>Bhanuprakash Bodireddy  writes:
>
>> This commit introduces the initial keepalive support by adding
>> 'keepalive' module and also helper and initialization functions that
>> will be invoked by later commits.
>>
>> This commit adds new ovsdb column "keepalive" that shows the status of
>> the datapath threads. This is implemented for DPDK datapath and only
>> status of PMD threads is reported.
>
>I don't see the value in having this enabled / disabled flag?  Why not just
>always have it on?

[BHANU]  

I was following the conventions here.  
OvS statistics is done similar way where the stats can be enabled with 
'other_config:enable-statistics=true' with default being false.
Maybe this is done as additional thread (system_stats) will be spawned to 
handle the functionality and users should have an option to turn them on/off. 

>
>Additionally, even setting these true in this commit won't do anything.
>No tracking starts until 3/7, afaict.
>
>I guess it's okay to document here, but it might be worth stating that.

[BHANU]  Ok. 

>
>> For eg:
>>   To enable keepalive feature.
>>   'ovs-vsctl --no-wait set Open_vSwitch . other_config:enable-
>keepalive=true'
>
>I'm not sure that a separate enable / disable flag is needed.
>
>>   To set timer interval of 5000ms for monitoring packet processing cores.
>>   'ovs-vsctl --no-wait set Open_vSwitch . \
>>  other_config:keepalive-interval="5000"
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>
>As stated earlier, please add a Documentation/ update with this.

[BHANU]  I would add the documentation in the respin.  

>
>>  lib/automake.mk|   2 +
>>  lib/keepalive.c| 183
>+
>>  lib/keepalive.h|  87 +
>>  vswitchd/bridge.c  |   3 +
>>  vswitchd/vswitch.ovsschema |   8 +-
>>  vswitchd/vswitch.xml   |  49 
>>  6 files changed, 330 insertions(+), 2 deletions(-)  create mode
>> 100644 lib/keepalive.c  create mode 100644 lib/keepalive.h
>>
>> diff --git a/lib/automake.mk b/lib/automake.mk index 2415f4c..0d99f0a
>> 100644
>> --- a/lib/automake.mk
>> +++ b/lib/automake.mk
>> @@ -110,6 +110,8 @@ lib_libopenvswitch_la_SOURCES = \
>>  lib/json.c \
>>  lib/jsonrpc.c \
>>  lib/jsonrpc.h \
>> +lib/keepalive.c \
>> +lib/keepalive.h \
>>  lib/lacp.c \
>>  lib/lacp.h \
>>  lib/latch.h \
>> diff --git a/lib/keepalive.c b/lib/keepalive.c new file mode 100644
>> index 000..ac73a42
>> --- /dev/null
>> +++ b/lib/keepalive.c
>> @@ -0,0 +1,183 @@
>> +/*
>> + * Copyright (c) 2014, 2015, 2016, 2017 Nicira, Inc.
>
>This line is not appropriately attributing the file.

[BHANU]  Should be "Copyright (c) 2017 Intel, Inc."

>
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + * http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing,
>> + software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>implied.
>> + * See the License for the specific language governing permissions
>> + and
>> + * limitations under the License.
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include "keepalive.h"
>> +#include "lib/vswitch-idl.h"
>> +#include "openvswitch/vlog.h"
>> +#include "timeval.h"
>> +
>> +VLOG_DEFINE_THIS_MODULE(keepalive);
>> +
>> +static bool keepalive_enable = false;/* Keepalive disabled by default */
>> +static bool ka_init_status = ka_init_failure; /* Keepalive
>> +initialization */
>
>You're assigning this bool a value from an enum.  I know that's probably
>allowed, but it looks strange to me.  I would prefer that this type either 
>reflect
>the enum type or a true/false value is used instead.

[BHANU]   OK.

>
>> +static uint32_t keepalive_timer_interval; /* keepalive timer interval */
>> +static struct keepalive_info *ka_info = NULL;
>
>Why allocate ka_info?  It will simplify some of the later code to just keep it
>statically available.  It also means you can eliminate the
>xzalloc() and free() calls you use later on in code.

[BHANU]   Ok, saves me few lines of code. 

>
>Also, the nice thing about a static declaration is the structure will already 
>be 0
>filled, and you'll know at program initialization time whether it will succeed 
>in
>getting the allocation.
>
>> +
>> +inline bool
>
>The inline keyword is inappropriate in .c files.  Please let the compiler do 
>it's
>job.

[BHANU]   Ok

>
>> +ka_is_enabled(void)
>> +{
>> +return keepalive_enable;
>> +}
>> +
>
>I'm not sure about enable / disable.  In this case, I think the branches are 
>not
>needed at all.  Better to have this always enabled and just deal wit

Re: [ovs-dev] [PATCH v4 3/7] dpif-netdev: Register packet processing cores to KA framework.

2017-09-07 Thread Bodireddy, Bhanuprakash
>Bhanuprakash Bodireddy  writes:
>
>> This commit registers the packet processing PMD cores to keepalive
>> framework. Only PMDs that have rxqs mapped will be registered and
>> actively monitored by KA framework.
>>
>> This commit spawns a keepalive thread that will dispatch heartbeats to
>> PMD cores. The pmd threads respond to heartbeats by marking themselves
>> alive. As long as PMD responds to heartbeats it is considered 'healthy'.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>
>I'm really confused with this patch.  I've stopped reviewing the series.
>
>It seems like there's a mix of 'track by core id' and 'track by thread id'.
>
>I don't think it's possible to do anything by core id.  We can never know what
>else has been scheduled on those cores, and we cannot be sure that a taskset
>or other scheduler provisioning call will move the threads.

[BHANU] I have already answered this in other thread. 
one can't correlate threads with cores and we shouldn't be tracking by cores. 
However with PMD threads 
there is 1:1 mapping of PMD and the core-id and it's safe to temporarily write 
PMD liveness info in to an array indexed
by core id before updating this in to HMAP. 

However as already mentioned, we are using tid for all other purposes as it is 
unique across the system.

>
>>  lib/dpif-netdev.c |  70 +
>>  lib/keepalive.c   | 153
>++
>>  lib/keepalive.h   |  17 ++
>>  lib/util.c|  23 
>>  lib/util.h|   2 +
>>  5 files changed, 254 insertions(+), 11 deletions(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>> e2cd931..84c7ffd 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -49,6 +49,7 @@
>>  #include "flow.h"
>>  #include "hmapx.h"
>>  #include "id-pool.h"
>> +#include "keepalive.h"
>>  #include "latch.h"
>>  #include "netdev.h"
>>  #include "netdev-vport.h"
>> @@ -978,6 +979,63 @@ sorted_poll_thread_list(struct dp_netdev *dp,
>>  *n = k;
>>  }
>>
>> +static void *
>> +ovs_keepalive(void *f_ OVS_UNUSED)
>> +{
>> +pthread_detach(pthread_self());
>> +
>> +for (;;) {
>> +xusleep(get_ka_interval() * 1000);
>> +}
>> +
>> +return NULL;
>> +}
>> +
>> +static void
>> +ka_thread_start(struct dp_netdev *dp) {
>> +static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
>> +
>> +if (ovsthread_once_start(&once)) {
>> +ovs_thread_create("ovs_keepalive", ovs_keepalive, dp);
>> +
>> +ovsthread_once_done(&once);
>> +}
>> +}
>> +
>> +static void
>> +ka_register_datapath_threads(struct dp_netdev *dp) {
>> +int ka_init = get_ka_init_status();
>> +VLOG_DBG("Keepalive: Was initialization successful? [%s]",
>> +ka_init ? "Success" : "Failure");
>> +if (!ka_init) {
>> +return;
>> +}
>> +
>> +ka_thread_start(dp);
>> +
>> +struct dp_netdev_pmd_thread *pmd;
>> +CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
>> +/*  Register only PMD threads. */
>> +if (pmd->core_id != NON_PMD_CORE_ID) {
>> +int tid = ka_get_pmd_tid(pmd->core_id);
>> +
>> +/* Skip PMD thread with no rxqs mapping. */
>
>why skip these pmds?  we should still watch them, and then we can
>correlated interesting events (for instance... when an rxq gets added whats
>the change in utilization, etc).

[BHANU]  We shoud skip the PMDs that has no rxqs mapped. This would happen in 
cases
where there are more PMD threads than the number of rxqs. 

More importantly a PMD thread with no mapped rxq will not even enter the 
receive loop and
will be in sleep state as below. 

$ ps -eLo tid,psr,comm,state | grep pmd
 51727   3 pmd61   R
 51747   0 pmd62   S
 51749   1 pmd63   S
 51750   2 pmd64   R
 51756   6 pmd65   S
 51758   7 pmd66   R
 51759   4 pmd67   R
 51760   5 pmd68   S

When an rxq gets added to a sleeping PMD thread, the datapath reconfiguration 
happens,
this time threads get registered to KA framework as below.

CP:  reconfigure_datapath() -> ka_register_datapath_threads() -> 
ka_register_thread().

>
>> +if (OVS_UNLIKELY(!hmap_count(&pmd->poll_list))) {
>> +/* rxq mapping changes due to reconfiguration,
>> + * if there are no rxqs mapped to PMD, unregister it. */
>> +ka_unregister_thread(tid, true);
>> +continue;
>> +}
>> +
>> +ka_register_thread(tid, true);
>> +VLOG_INFO("Registered PMD thread [%d] on Core [%d] to KA
>framework",
>> +  tid, pmd->core_id);
>> +}
>> +}
>> +}
>> +
>>  static void
>>  dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char
>*argv[],
>>   void *aux)
>> @@ -3625,6 +3683,9 @@ reconfigure_datapath(struct dp_netdev *dp)
>>
>>  /* Reload affected pmd threads. */
>>  reload_affected_pmds(dp);
>> +

Re: [ovs-dev] [PATCH v4 0/7] Add OVS DPDK keep-alive functionality.

2017-09-07 Thread Bodireddy, Bhanuprakash
>"Bodireddy, Bhanuprakash"  writes:
>
>> Hi Aaron,
>>
>>>Quick comment before I do an in-depth review.
>>>
>>>One thing that is missing in this series is some form of documentation
>>>added to explain why this feature should exist (for instance, why
>>>can't the standard posix process accounting information suffice?) and
>>>what the high-level concepts are (you have the states being used, but
>>>I don't see a definition that will be needed to understand when reading a
>keep-alive report).
>>
>> I am planning to write a cookbook with instructions on setting up
>> Keepalive in OvS, Installing & configuring collectd and setting up ceilometer
>service to read the events.
>> The definition of the KA states and how to interpret them would be
>> explained in the document. Also the minimal step guide would be added
>> in to OvS-DPDK Advanced guide with links to cookbook.
>
>Please put that as you go in the patches.  It will make review easier, too.

[BHANU] OK.

>
>> On your other question on why posix process accounting isn't enough,
>> please see below for testcase and details.
>>
>>>
>>>I think there could be a reason to provide this, but I think it's
>>>important to explain why collectd will need to use the ovsdb
>>>interface, rather than calling
>>>ex: times[1] or parsing /proc//stat for the runtime (and watching
>>>accumulation).
>>
>> 1) On collectd reading ovsdb rather than directly monitoring the threads.
>>
>>   Collectd for sure is one popular daemon to collect and monitor system
>wide statistics.
>>   However, if we move ovs thread monitoring functionality to collectd we
>are *forcing*
>>   the users to use collectd to monitor OvS health. This may not be a
>problem for someone using
>>   collectd + OpenStack.
>
>It's important to note - collectd monitoring threads has nothing to do with 
>this
>feature.  If collectd can monitor threads from arbitrary processes and report, 
>it
>becomes much more powerful, no?  Let's keep it focused on Open vSwitch.
>
>>   Think of customer using OvS but having their proprietary monitoring
>application with OpenStack or
>>   worse their own orchestrator, in this case it's easy for them to 
>> monitor
>OvS by querying OvSDB
>>   with minimal code changes in to their app.
>>
>>   Also it might be easy for any monitoring application to 
>> query/subscribe to
>OvSDB for knowing the
>>   OvS configuration and health.
>
>I don't really like using the idea of proprietary monitors as justification 
>for this.

>
>OTOH, I think there's a good justification when it comes to multi-node Open
>vSwitch tracking.  There, it may not be possible to aggregate the statistics on
>each individual node (due to possible some kind of administration policy) - so 
>I
>agree having something like this exposed through ovsdb could be useful.


[BHANU] In any case querying ovsdb is most suitable.

>
>> 2) On /proc/[pid]/stats:
>>
>> - I do read 'stats' file in 01/7  patch to get the thread name and 'core id' 
>> the
>thread was last scheduled.
>> - The other fields related to time in stats file can't be completely relied 
>> due
>to below test case.
>>
>> This test scenario was to simulate & identify the PMD stalls when a
>> higher priority thread(kernel/other IO thread) gets scheduled on the same
>core.
>>
>> Test scenario:
>> - OvS with single/multiple PMD thread.
>> - Start a worker thread spinning continuously on the core (stress -c 1 &).
>> - Change the worker thread attributes to RT (chrt -r -p 99  ).
>> - Pin the worker thread on the same core as one of the PMDs  (taskset
>> -p  )
>>
>> Now the PMD stalls as the other worker thread has higher priority and is
>favored & scheduled by Linux scheduler preempting PMD thread.
>> However the /proc/pid/stat shows that the thread is still in
>>  *Running (R)* state ->   field 3
>>   (see the output
>below)
>>Utime,stime were incrementing  ->field 14, 15(-do-)
>>
>> All the other time related fields were '0' as they don't apply here.
>> For fields information:
>> http://man7.org/linux/man-pages/man5/proc.5.html
>>
>> ---sample
>> output---
>> $ cat /proc/12506/stat
>> 12506 (pmd61) R 1 1243

Re: [ovs-dev] [PATCH 12/13] conntrack: Fix dead assignment reported by clang.

2017-09-10 Thread Bodireddy, Bhanuprakash
Hi Darrell,

>What version of clang are you using and in what environment ?

The clang version is  3.5.0. This was seen with clang static analysis.

- Bhanuprakash.

>
>On 9/8/17, 10:59 AM, "ovs-dev-boun...@openvswitch.org on behalf of
>Bhanuprakash Bodireddy" bhanuprakash.bodire...@intel.com> wrote:
>
>Clang reports that value stored to ftp, seq_skew_dir never read inside
>the function.
>
>Signed-off-by: Bhanuprakash Bodireddy
>
>---
> lib/conntrack.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
>diff --git a/lib/conntrack.c b/lib/conntrack.c
>index 419cb1d..a0838ee 100644
>--- a/lib/conntrack.c
>+++ b/lib/conntrack.c
>@@ -2615,7 +2615,7 @@ process_ftp_ctl_v4(struct conntrack *ct,
> char ftp_msg[LARGEST_FTP_MSG_OF_INTEREST + 1] = {0};
> get_ftp_ctl_msg(pkt, ftp_msg);
>
>-char *ftp = ftp_msg;
>+char *ftp;
> enum ct_alg_mode mode;
> if (!strncasecmp(ftp_msg, FTP_PORT_CMD, strlen(FTP_PORT_CMD))) {
> ftp = ftp_msg + strlen(FTP_PORT_CMD);
>@@ -2761,7 +2761,7 @@ process_ftp_ctl_v6(struct conntrack *ct,
> get_ftp_ctl_msg(pkt, ftp_msg);
> *ftp_data_start = tcp_hdr + tcp_hdr_len;
>
>-char *ftp = ftp_msg;
>+char *ftp;
> struct in6_addr ip6_addr;
> if (!strncasecmp(ftp_msg, FTP_EPRT_CMD, strlen(FTP_EPRT_CMD))) {
> ftp = ftp_msg + strlen(FTP_EPRT_CMD);
>@@ -2909,7 +2909,6 @@ handle_ftp_ctl(struct conntrack *ct, const struct
>conn_lookup_ctx *ctx,
> bool seq_skew_dir;
> if (ftp_ctl == CT_FTP_CTL_OTHER) {
> seq_skew = conn_for_expectation->seq_skew;
>-seq_skew_dir = conn_for_expectation->seq_skew_dir;
> } else if (ftp_ctl == CT_FTP_CTL_INTEREST) {
> enum ftp_ctl_pkt rc;
> if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
>--
>2.4.11
>
>___
>dev mailing list
>d...@openvswitch.org
>https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_mailman_listinfo_ovs-
>2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-
>uZnsw&m=LE5PLIlvBSFThteUGJevgTlRlFesohyngSzGDqpvk5k&s=BsQfIBSohBf
>sM_UTvU-fZeE6EswgpKmd9tz0snT8usc&e=
>

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4 3/7] dpif-netdev: Register packet processing cores to KA framework.

2017-09-13 Thread Bodireddy, Bhanuprakash
>"Bodireddy, Bhanuprakash"  writes:
>
>>>Bhanuprakash Bodireddy  writes:
>>>
>>>> This commit registers the packet processing PMD cores to keepalive
>>>> framework. Only PMDs that have rxqs mapped will be registered and
>>>> actively monitored by KA framework.
>>>>
>>>> This commit spawns a keepalive thread that will dispatch heartbeats
>>>> to PMD cores. The pmd threads respond to heartbeats by marking
>>>> themselves alive. As long as PMD responds to heartbeats it is considered
>'healthy'.
>>>>
>>>> Signed-off-by: Bhanuprakash Bodireddy
>>>> 
>>>> ---
>>>
>>>I'm really confused with this patch.  I've stopped reviewing the series.
>>>
>>>It seems like there's a mix of 'track by core id' and 'track by thread id'.
>>>
>>>I don't think it's possible to do anything by core id.  We can never
>>>know what else has been scheduled on those cores, and we cannot be
>>>sure that a taskset or other scheduler provisioning call will move the
>threads.
>>
>> [BHANU] I have already answered this in other thread.
>> one can't correlate threads with cores and we shouldn't be tracking by
>> cores. However with PMD threads there is 1:1 mapping of PMD and the
>> core-id and it's safe to temporarily write PMD liveness info in to an
>> array indexed by core id before updating this in to HMAP.
>
>The core-id as a concept here is deceptive.  An external entity (such as
>taskset) can rebalance the PMDs.  External entities can be scheduled on the
>cores.  I think it's dangerous to have anything called core-id in this series 
>or
>feature, because people will naturally infer things which aren't true.
>Additionally, it will lead to things like "well, we know that core x,y,z are
>running at A%, so we can schedule things thusly..."
>
>Makes sense?
>

The concerns above makes sense and you have a valid point. 
Unfortunately the logic that you see w.r.t PMD, core_id mapping is something 
that was
implemented in rte_keepalive library and I inherited it. As the 1:1 mapping of 
a thread(PMD)
to core is deceptive and makes little sense, I reworked on a different approach 
with no impact
on datapath performance. I was testing this last few days to check for perf 
impacts and other
possible issues.

Previous design:

As part of heartbeat mechanism (dispath_heartbeats()),  in keepalive_info 
structure
we had arrays indexed by core-ids used by PMDs and Keepalive thread for 
heart-beating.
The arrays were used to keep the logic simple and lock-free.

Also Keepalive thread was updating the status periodically in to 'process_list' 
map using callback function.

New design:
---
we already have a 'process_list' map where all the PMD threads are added by 
main(vswitchd)
thread. In this new approach, I take a copy of the 'process_list', let's call 
it 'cached_process_list'
and use this cached map for heartbeating between Keepalive and PMD threads. No 
locks are 
needed on the 'cached_process_list' there by not impacting the datapath 
performance.

Also whenever there is datapath reconfiguration(triggered by pmd-cpu-mask), the 
'process_list' map 
will be updated and also the cached_process_list will be reloaded from the main 
map there by having
the maps in sync.  This is handled as part of ka_register_datapath_threads().  
I have been testing this 
and seems to be working fine.

This way we can completely avoid all references to core_id in this series. Let 
me know if you have
any comments on this new approach.

Regards,
Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 02/13] netdev-dummy: Reorder elements in dummy_packet_stream structure.

2017-09-17 Thread Bodireddy, Bhanuprakash
Hi greg,

>On 09/08/2017 10:59 AM, Bhanuprakash Bodireddy wrote:
>> By reordering elements in dummy_packet_stream structure, sum holes
>
>Do you mean "the sum of the holes" can be reduced or do you mean "some
>holes"
>can be reduced?

In this patch series "sum of the holes" means, the sum/total of all the hole 
bytes in the
respective structure. For example 'dummy_packet_stream' structure members are 
aligned below way.
This structure has one hole comprising of 56 bytes.

struct dummy_packet_stream {
struct stream *stream;   /* 0 8 */

>   56 bytes holes. 

 
struct dp_packet   rxbuf;   /*64   704 */  
struct ovs_listtxq;/*   76816 */
};

With the proposed change in this patch, the new alignment is as below 

struct dummy_packet_stream {
struct stream *stream;   /* 0 8 */
struct ovs_listtxq; /* 816 
*/

> 40 bytes hole
struct dp_packet   rxbuf;/*64   704 */
};

For all the patches, the information is added in to the commit log that shows
the improvement with the proposed changes. As claimed, sum holes(bytes) are
 reduced from 56 to 40 in case of this patch.

>> Before: structure size: 784, sum holes: 56, cachelines:13
>> After :  structure size: 768, sum holes: 40, cachelines:12

>
>Same question through several of the other patches where you use the same
>language.

In few structures there are multiple holes and 'sum holes' adds hole bytes of 
multiple holes
In those cases. 

- Bhanuprakash.

>
>> can be reduced, thus saving a cache line.
>>
>> Before: structure size: 784, sum holes: 56, cachelines:13 After :
>> structure size: 768, sum holes: 40, cachelines:12
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>>   lib/netdev-dummy.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/lib/netdev-dummy.c b/lib/netdev-dummy.c index
>> f731af1..d888c40 100644
>> --- a/lib/netdev-dummy.c
>> +++ b/lib/netdev-dummy.c
>> @@ -50,8 +50,8 @@ struct reconnect;
>>
>>   struct dummy_packet_stream {
>>   struct stream *stream;
>> -struct dp_packet rxbuf;
>>   struct ovs_list txq;
>> +struct dp_packet rxbuf;
>>   };
>>
>>   enum dummy_packet_conn_type {
>>

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 00/10] Use DP_PACKET_BATCH_FOR_EACH macro.

2017-09-20 Thread Bodireddy, Bhanuprakash
Hi Darrell,

>You have many instances where you want to use
>DP_PACKET_BATCH_FOR_EACH You have another series partially about this:
>https://patchwork.ozlabs.org/patch/813007/
>
>Also, this series mixes in other changes like creating new variables for 
>clarity, I
>guess, and removing unneeded variables. which anyways has different
>motivation but part of the same patch.
>
>Do you think it makes sense to group the DP_PACKET_BATCH_FOR_EACH
>changes in one patch and splice out the other changes as other patches in the
>same series by same theme ?

That makes sense and I sent out a v2 by merging the 2 patches of my previous 
series. 
This time the patches are grouped and I added the details in the cover letter 
under version info.

Cover letter:  
https://mail.openvswitch.org/pipermail/ovs-dev/2017-September/338990.html
https://patchwork.ozlabs.org/patch/816191/

- Bhanuprakash.

>
>Thanks
>Darrell
>
>On 9/19/17, 12:39 PM, "ovs-dev-boun...@openvswitch.org on behalf of
>Bhanuprakash Bodireddy" bhanuprakash.bodire...@intel.com> wrote:
>
>DP_PACKET_BATCH_FOR_EACH macro was introduced early this year as
>part
>of enhancing packet batch APIs. Commit '72c84bc2' implemented this macro
>and replaced most of the calling sites with macros and simplified the 
> logic.
>
>However there are still many APIs that needs to be fixed.
>This patch series is a simple and straightforward set of changes
>aimed at using DP_PACKET_BATCH_FOR_EACH macro at all appropriate
>places.
>Also minor code cleanup is done to improve readability of the code.
>
>No functionality changes and no performance impact with this series.
>
>Bhanuprakash Bodireddy (10):
>  netdev-linux: Clean up netdev_linux_sock_batch_send().
>  netdev-linux: Use DP_PACKET_BATCH_FOR_EACH in
>netdev_linux_tap_batch_send.
>  netdev-dpdk: Cleanup dpdk_do_tx_copy.
>  netdev-dpdk: Minor cleanup of netdev_dpdk_send__.
>  netdev-dpdk: Use DP_PACKET_BATCH_FOR_EACH in
>netdev_dpdk_ring_send
>  netdev-bsd: Use DP_PACKET_BATCH_FOR_EACH in netdev_bsd_send.
>  odp-execute: Use const qualifer for batch size.
>  dpif-netdev: Use DP_PACKET_BATCH_FOR_EACH in
>dp_netdev_run_meter.
>  dpif-netdev: Use DP_PACKET_BATCH_FOR_EACH in fast_path_processing.
>  dpif-netdev: Remove 'cnt' in dp_netdev_input__().
>
> lib/dpif-netdev.c  | 33 +++--
> lib/netdev-bsd.c   |  7 ---
> lib/netdev-dpdk.c  | 40 +++-
> lib/netdev-linux.c | 17 +
> lib/odp-execute.c  |  3 ++-
> 5 files changed, 49 insertions(+), 51 deletions(-)
>
>--
>2.4.11
>___
>dev mailing list
>d...@openvswitch.org
>https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_mailman_listinfo_ovs-
>2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-
>uZnsw&m=XiubftJP8lYJL_SPaytZ9IvK97Hxqfr-
>TwV3fcbd2Qw&s=NzGP8ioHmW7p2aaJepNTw7ayyFxEmuPXEnYpmoN7yOU&
>e=
>

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] is there any document about how to build debian package with dpdk?

2017-09-21 Thread Bodireddy, Bhanuprakash
>we modified little code for dpdk, so we must rebuild ovs debian package with
>dpdk by ourself.
>so is there any guide about how to build openvswith-dpdk package?

There is a guide on this here 
http://docs.openvswitch.org/en/latest/intro/install/debian/

- Bhanuprakash.


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [patch v2 1/5] conntrack: Fix clang static analysis reports.

2017-09-21 Thread Bodireddy, Bhanuprakash
>These dead assignment warnings do not affect functionality.
>In one case, a local variable could be removed and in another case, the
>working pointer should be used rather than the start pointer.
>
>Fixes: bd5e81a0e596 ("Userspace Datapath: Add ALG infra and FTP.")
>Reported-by: Bhanuprakash Bodireddy
>
>Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>September/338515.html
>Signed-off-by: Darrell Ball 
>---
> lib/conntrack.c | 12 
> 1 file changed, 4 insertions(+), 8 deletions(-)
>
>diff --git a/lib/conntrack.c b/lib/conntrack.c index 419cb1d..59d3c4e 100644
>--- a/lib/conntrack.c
>+++ b/lib/conntrack.c
>@@ -2617,7 +2617,7 @@ process_ftp_ctl_v4(struct conntrack *ct,
>
> char *ftp = ftp_msg;
> enum ct_alg_mode mode;
>-if (!strncasecmp(ftp_msg, FTP_PORT_CMD, strlen(FTP_PORT_CMD))) {
>+if (!strncasecmp(ftp, FTP_PORT_CMD, strlen(FTP_PORT_CMD))) {
> ftp = ftp_msg + strlen(FTP_PORT_CMD);
> mode = CT_FTP_MODE_ACTIVE;
> } else {
>@@ -2763,7 +2763,7 @@ process_ftp_ctl_v6(struct conntrack *ct,
>
> char *ftp = ftp_msg;
> struct in6_addr ip6_addr;
>-if (!strncasecmp(ftp_msg, FTP_EPRT_CMD, strlen(FTP_EPRT_CMD))) {
>+if (!strncasecmp(ftp, FTP_EPRT_CMD, strlen(FTP_EPRT_CMD))) {
> ftp = ftp_msg + strlen(FTP_EPRT_CMD);
> ftp = skip_non_digits(ftp);
> if (*ftp != FTP_AF_V6 || isdigit(ftp[1])) { @@ -2906,10 +2906,8 @@
>handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
>
> struct ovs_16aligned_ip6_hdr *nh6 = dp_packet_l3(pkt);
> int64_t seq_skew = 0;
>-bool seq_skew_dir;
> if (ftp_ctl == CT_FTP_CTL_OTHER) {
> seq_skew = conn_for_expectation->seq_skew;
>-seq_skew_dir = conn_for_expectation->seq_skew_dir;
> } else if (ftp_ctl == CT_FTP_CTL_INTEREST) {
> enum ftp_ctl_pkt rc;
> if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) { @@ -2933,18 +2931,16
>@@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
> seq_skew = repl_ftp_v6_addr(pkt, v6_addr_rep, ftp_data_start,
> addr_offset_from_ftp_data_start,
> addr_size, mode);
>-seq_skew_dir = ctx->reply;
> if (seq_skew) {
> ip_len = ntohs(nh6->ip6_ctlun.ip6_un1.ip6_un1_plen);
> ip_len += seq_skew;
> nh6->ip6_ctlun.ip6_un1.ip6_un1_plen = htons(ip_len);
> conn_seq_skew_set(ct, &conn_for_expectation->key, now,
>-  seq_skew, seq_skew_dir);
>+  seq_skew, ctx->reply);
> }
> } else {
> seq_skew = repl_ftp_v4_addr(pkt, v4_addr_rep, ftp_data_start,
> addr_offset_from_ftp_data_start);
>-seq_skew_dir = ctx->reply;
> ip_len = ntohs(l3_hdr->ip_tot_len);
> if (seq_skew) {
> ip_len += seq_skew; @@ -2952,7 +2948,7 @@
>handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
>   l3_hdr->ip_tot_len, htons(ip_len));
> l3_hdr->ip_tot_len = htons(ip_len);
> conn_seq_skew_set(ct, &conn_for_expectation->key, now,
>-  seq_skew, seq_skew_dir);
>+  seq_skew, ctx->reply);
> }
> }
> } else {
>--

LGTM and verified with clang.

Acked-by: Bhanuprakash Bodireddy 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [patch v2 2/5] conntrack: Minor performance enhancement.

2017-09-21 Thread Bodireddy, Bhanuprakash
>Add an OVS_UNLIKELY and reorder a few variable condition checks.
>
>Signed-off-by: Darrell Ball 
>---
> lib/conntrack.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
>diff --git a/lib/conntrack.c b/lib/conntrack.c index 59d3c4e..c94bc27 100644
>--- a/lib/conntrack.c
>+++ b/lib/conntrack.c
>@@ -1104,7 +1104,7 @@ process_one(struct conntrack *ct, struct dp_packet
>*pkt,
>
> bool tftp_ctl = is_tftp_ctl(pkt);
> struct conn conn_for_expectation;
>-if (conn && (ftp_ctl || tftp_ctl)) {
>+if (OVS_UNLIKELY((ftp_ctl || tftp_ctl) && conn)) {
> conn_for_expectation = *conn;
> }
>
>@@ -1115,10 +1115,10 @@ process_one(struct conntrack *ct, struct
>dp_packet *pkt,
> }
>
> /* FTP control packet handling with expectation creation. */
>-if (OVS_UNLIKELY(conn && ftp_ctl)) {
>+if (OVS_UNLIKELY(ftp_ctl && conn)) {
> handle_ftp_ctl(ct, ctx, pkt, &conn_for_expectation,
>now, CT_FTP_CTL_INTEREST, !!nat_action_info);
>-} else if (OVS_UNLIKELY(conn && tftp_ctl)) {
>+} else if (OVS_UNLIKELY(tftp_ctl && conn)) {
> handle_tftp_ctl(ct, &conn_for_expectation, now);
> }
> }

LGTM 
Acked-by: Bhanuprakash Bodireddy 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [patch v2 3/5] conntrack: Create nat_conn_keys_insert().

2017-09-21 Thread Bodireddy, Bhanuprakash
>Create a separate function from existing code, so the code can be reused in a
>subsequent patch; no change in functionality.
>
>Signed-off-by: Darrell Ball 
>---
> lib/conntrack.c | 42 +-
> 1 file changed, 29 insertions(+), 13 deletions(-)
>
>diff --git a/lib/conntrack.c b/lib/conntrack.c index c94bc27..2eca38d 100644
>--- a/lib/conntrack.c
>+++ b/lib/conntrack.c
>@@ -96,6 +96,11 @@ nat_conn_keys_lookup(struct hmap *nat_conn_keys,
>  const struct conn_key *key,
>  uint32_t basis);
>
>+static bool
>+nat_conn_keys_insert(struct hmap *nat_conn_keys,
>+ const struct conn *nat_conn,
>+ uint32_t hash_basis);
>+

This patch is refactoring the code with no change in functionality. 
Small nit (not necessarily needed) change variable name from 'hash_basis' to
'basis' to keep it consistent with other APIs in this file.

LGTM
Acked-by: Bhanuprakash Bodireddy 


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] ovs-tcpdump error

2017-09-21 Thread Bodireddy, Bhanuprakash
Hi,

ovs-tcpdump throws the below error when trying to capture packets on one of the 
vhostuserports.

$ ovs-tcpdump -i dpdkvhostuser0
   ERROR: Please create an interface called `midpdkvhostuser0`
See your OS guide for how to do this.
Ex: ip link add midpdkvhostuser0 type veth peer name midpdkvhostuser02

$ ip link add midpdkvhostuser0 type veth peer name midpdkvhostuser02
 Error: argument "midpdkvhostuser0" is wrong: "name" too long

To get around this issue, I have to pass  '-mirror-to' option as below.

$ ovs-tcpdump -i dpdkvhostuser0 -XX --mirror-to vh0

Is this due to the length of the port name?  Would be nice to fix this issue.

Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


  1   2   >