from:"Bodireddy, Bhanuprakash"

Re: [ovs-dev] [PATCH 4/4] doc: Update configure section with prefetchwt1 details.

2018-03-13 Thread Bodireddy, Bhanuprakash

>> -Original Message-
>> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
>> boun...@openvswitch.org] On Behalf Of Bhanuprakash Bodireddy
>> Sent: Friday, January 12, 2018 5:41 PM
>> To: d...@openvswitch.org
>> Subject: [ovs-dev] [PATCH 4/4] doc: Update configure section with
>> prefetchwt1 details.
>>
>> Inspite of specifying -march=native when using Low Temporal
>> Write(OPCH_LTW), the compiler generates 'prefetchw' instruction
>> instead of 'prefetchwt1'
>> instruction available on processor as in 'Case B'. To make the
>> compiler emit
>> prefetchwt1 instruction, -mprefetchwt1 needs to be passed to configure
>> explicitly.
>>
>> [Problem]
>>   Case A:
>> OVS_PREFETCH_CACHE(addr, OPCH_HTW)  [__builtin_prefetch(addr, 1,
>3)]
>> [Assembly]
>> leaq-112(%rbp), %rax
>> prefetchw  (%rax)
>>
>>   Case B:
>> OVS_PREFETCH_CACHE(addr, OPCH_LTW)  [__builtin_prefetch(addr, 1,
>1)]
>> [Assembly]
>> leaq-112(%rbp), %rax
>> prefetchw  (%rax) <***problem***>
>>
>> [Solution]
>>./configure CFLAGS="-g -O2 -mprefetchwt1"
>>
>>   Case B:
>> OVS_PREFETCH_CACHE(addr, OPCH_LTW)  [__builtin_prefetch(addr, 1,
>1)]
>> [Assembly]
>> leaq-112(%rbp), %rax
>> prefetchwt1  (%rax)
>>
>> See also:
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-December/341591.ht
>> ml
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>>  Documentation/intro/install/general.rst | 13 +
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/Documentation/intro/install/general.rst
>> b/Documentation/intro/install/general.rst
>> index 718e5c2..4d2db45 100644
>> --- a/Documentation/intro/install/general.rst
>> +++ b/Documentation/intro/install/general.rst
>> @@ -280,6 +280,19 @@ With this, GCC will detect the processor and
>> automatically set appropriate  flags for it. This should not be used
>> if you are compiling OVS outside the  target machine.
>>
>> +Compilers(gcc) won't emit prefetchwt1 instruction even with '-
>> march=native'
>> +specified. In such case, -mprefetchwt1 needs to be explicitly passed
>> +during configuration.
>
>Is prefetchwt1 supported by other compilers (clang etc.)?

[BHANU]  I don't know if clang supports this instruction. 
But the below link has reference to this and the instruction may be supported.
https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang-mprefetchwt1

>
>> +
>> +For example inspite of specifying -march=native when using Low
>> +Temporal Write i.e OVS_PREFETCH_CACHE(addr, OPCH_LTW), the
>compiler
>> +generates
>> 'prefetchw'
>> +instruction instead of 'prefetchwt1' instruction available on processor.
>> +
>> +To make the compiler generate the appropriate instruction, it is
>> +recommended to pass ``-mprefetchwt1`` settings::
>> +
>> +$ ./configure CFLAGS="-g -O2 -march=native -mprefetchwt1"
>
>In the comments for patch 1 of the series you mentioned users had to enable
>the instruction.
>It would be worth mentioning that here also. If there is extra work external to
>OVS to enable this instruction we can't assume the user will know this.

[BHANU]  what I meant by enabling the instruction in 1/4 patch was to use 
-mprefetchwt1 flag
while configuring OvS. 

Regards,
Bhanuprakash.

>
>> +
>>  .. note::
>>CFLAGS are not applied when building the Linux kernel module.
>> Custom CFLAGS
>>for the kernel module are supplied using the ``EXTRA_CFLAGS``
>> variable when
>> --
>> 2.4.11
>>
>> ___
>> dev mailing list
>> d...@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 1/4] compiler: Introduce OVS_PREFETCH variants.

2018-03-13 Thread Bodireddy, Bhanuprakash

>
>> -Original Message-
>> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
>> boun...@openvswitch.org] On Behalf Of Bhanuprakash Bodireddy
>> Sent: Friday, January 12, 2018 5:41 PM
>> To: d...@openvswitch.org
>> Subject: [ovs-dev] [PATCH 1/4] compiler: Introduce OVS_PREFETCH variants.
>>
>> This commit introduces prefetch variants by using the GCC built-in
>> prefetch function.
>>
>> The prefetch variants gives the user better control on designing data
>> caching strategy in order to increase cache efficiency and minimize
>> cache pollution. Data reference patterns here can be classified in to
>>
>>  - Non-temporal(NT) - Data that is referenced once and not reused in
>>   immediate future.
>>  - Temporal - Data will be used again soon.
>>
>> The Macro variants can be used where there are
>>  - Predictable memory access patterns.
>>  - Execution pipeline can stall if data isn't available.
>>  - Time consuming loops.
>>
>> For example:
>>
>>   OVS_PREFETCH_CACHE(addr, OPCH_LTR)
>> - OPCH_LTR : OVS PREFETCH CACHE HINT-LOW TEMPORAL READ.
>> - __builtin_prefetch(addr, 0, 1)
>> - Prefetch data in to L3 cache for readonly purpose.
>>
>>   OVS_PREFETCH_CACHE(addr, OPCH_HTW)
>> - OPCH_HTW : OVS PREFETCH CACHE HINT-HIGH TEMPORAL WRITE.
>> - __builtin_prefetch(addr, 1, 3)
>> - Prefetch data in to all caches in anticipation of write. In doing
>>   so it invalidates other cached copies so as to gain 'exclusive'
>>   access.
>>
>>   OVS_PREFETCH(addr)
>> - OPCH_HTR : OVS PREFETCH CACHE HINT-HIGH TEMPORAL READ.
>> - __builtin_prefetch(addr, 0, 3)
>> - Prefetch data in to all caches in anticipation of read and that
>>   data will be used again soon (HTR - High Temporal Read).
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>>  include/openvswitch/compiler.h | 147
>> ++---
>>  1 file changed, 139 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/openvswitch/compiler.h
>> b/include/openvswitch/compiler.h index c7cb930..94bb24d 100644
>> --- a/include/openvswitch/compiler.h
>> +++ b/include/openvswitch/compiler.h
>> @@ -222,18 +222,149 @@
>>  static void f(void)
>>  #endif
>>
>> -/* OVS_PREFETCH() can be used to instruct the CPU to fetch the cache
>> - * line containing the given address to a CPU cache.
>> - * OVS_PREFETCH_WRITE() should be used when the memory is going to
>be
>> - * written to.  Depending on the target CPU, this can generate the
>> same
>> - * instruction as OVS_PREFETCH(), or bring the data into the cache in
>> an
>> - * exclusive state. */
>>  #if __GNUC__
>> -#define OVS_PREFETCH(addr) __builtin_prefetch((addr)) -#define
>> OVS_PREFETCH_WRITE(addr) __builtin_prefetch((addr), 1)
>> +enum cache_locality {
>> +NON_TEMPORAL_LOCALITY,
>> +LOW_TEMPORAL_LOCALITY,
>> +MODERATE_TEMPORAL_LOCALITY,
>> +HIGH_TEMPORAL_LOCALITY
>> +};
>> +
>> +enum cache_rw {
>> +PREFETCH_READ,
>> +PREFETCH_WRITE
>> +};
>> +
>> +/* The prefetch variants gives the user better control on designing
>> +data
>> + * caching strategy in order to increase cache efficiency and
>> +minimize
>> + * cache pollution. Data reference patterns here can be classified in
>> +to
>> + *
>> + *   Non-temporal(NT) - Data that is referenced once and not reused in
>> + *  immediate future.
>> + *   Temporal - Data will be used again soon.
>> + *
>> + * The Macro variants can be used where there are
>> + *   o Predictable memory access patterns.
>> + *   o Execution pipeline can stall if data isn't available.
>> + *   o Time consuming loops.
>> + *
>> + * OVS_PREFETCH_CACHE() can be used to instruct the CPU to fetch the
>> +cache
>> + * line containing the given address to a CPU cache. The second
>> +argument
>> + * OPCH_XXR (or) OPCH_XXW is used to hint if the prefetched data is
>> +going
>> + * to be read or written to by core.
>> + *
>> + * Example Usage:
>> + *
>> + *   OVS_PREFETCH_CACHE(addr, OPCH_LTR)
>> + *   - OPCH_LTR : OVS PREFETCH CACHE HINT-LOW TEMPORAL READ.
>> + *   - __builtin_prefetch(addr, 0, 1)
>> + *   - Prefetch data in to L3 cache for readonly purpose.
>> + *
>> + *   OVS_PREFETCH_CACHE(addr, OPCH_HTW)
>> + *   - OPCH_HTW : OVS PREFETCH CACHE HINT-HIGH TEMPORAL WRITE.
>> + *   - __builtin_prefetch(addr, 1, 3)
>> + *   - Prefetch data in to all caches in anticipation of write. In
>> doing
>> + * so it invalidates other cached copies so as to gain
>> 'exclusive'
>> + * access.
>> + *
>> + *   OVS_PREFETCH(addr)
>> + *   - OPCH_HTR : OVS PREFETCH CACHE HINT-HIGH TEMPORAL READ.
>> + *   - __builtin_prefetch(addr, 0, 3)
>> + *   - Prefetch data in to all caches in anticipation of read and
>> that
>> + * data will be used again soon (HTR - High Temporal Read).
>> + *
>> + * Implementation details of prefetch hint instructions may vary
>> + across
>> + *

Re: [ovs-dev] [RFC 4/4] dpif-netdev.c: Add indirect table

2018-02-23 Thread Bodireddy, Bhanuprakash

Hi Yipeng,

>If we store pointers in DFC, then the memory requirement is large. When
>there are VMs or multiple PMDs running on the same platform, they will
>compete the shared cache. So we want DFC to be as memory efficient as
>possible.

>
>Indirect table is a simple hash table that map the DFC's result to the
>dp_netdev_flow's pointer. This is to reduce the memory size of the DFC
>cache, assuming that the megaflow count is much smaller than the exact
>match flow count. With this commit, we could reduce the 8-byte pointer to a
>2-byte index in DFC cache so that the memory/cache requirement is almost
>halved. Another approach we plan to try is to use the flow_table as the
>indirect table.

I assume this patch is only aimed at reducing the DFC cache memory foot print 
and doesn't introduce any new functionality ?

With this I see the dfc_bucket size is at 32 bytes from earlier 80bytes in 3/4 
and the buckets now will be aligned to cache lines.
Also the dfc_cache size reduced to ~8MB from ~12MB in 1/4 and ~14Mb in 3/4 
patches.

I am guessing there might be some performance improvement with this patch due 
to buckets aligning to cache lines apart of reduced memory footprint. Do you 
see any such advantage here in your benchmarks?

Regards,
Bhanuprakash.

>
>The indirect table size is a fixed constant for now.
>
>Signed-off-by: Yipeng Wang 
>---
> lib/dpif-netdev.c | 69 +++
>
> 1 file changed, 44 insertions(+), 25 deletions(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 50a1d25..35197d3
>100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -151,6 +151,12 @@ struct netdev_flow_key {
>
> #define DFC_MASK_LEN 20
> #define DFC_ENTRY_PER_BUCKET 8
>+
>+/* For now we fix the Indirect table size, ideally it should be sized
>+according
>+ * to max megaflow count but less than 2^16  */ #define
>+INDIRECT_TABLE_SIZE (1u << 12) #define INDIRECT_TABLE_MASK
>+(INDIRECT_TABLE_SIZE - 1)
> #define DFC_ENTRIES (1u << DFC_MASK_LEN)  #define DFC_BUCKET_CNT
>(DFC_ENTRIES / DFC_ENTRY_PER_BUCKET)  #define DFC_MASK
>(DFC_BUCKET_CNT - 1) @@ -175,13 +181,14 @@ struct emc_cache {
>
> struct dfc_bucket {
> uint16_t sig[DFC_ENTRY_PER_BUCKET];
>-struct dp_netdev_flow *flow[DFC_ENTRY_PER_BUCKET];
>+uint16_t index[DFC_ENTRY_PER_BUCKET];
> };
>
> struct dfc_cache {
> struct emc_cache emc_cache;
> struct dfc_bucket buckets[DFC_BUCKET_CNT];
> int sweep_idx;
>+struct dp_netdev_flow *indirect_table[INDIRECT_TABLE_SIZE];
> };
>
> 

>@@ -754,7 +761,7 @@ static int dpif_netdev_xps_get_tx_qid(const struct
>dp_netdev_pmd_thread *pmd,
>
> static inline bool dfc_entry_alive(struct dp_netdev_flow *flow);  static void
>emc_clear_entry(struct emc_entry *ce); -static void dfc_clear_entry(struct
>dfc_bucket *b, int idx);
>+static void dfc_clear_entry(struct dp_netdev_flow **flow, struct
>+dfc_bucket *b, int idx);
>
> static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
>
>@@ -782,9 +789,12 @@ dfc_cache_init(struct dfc_cache *flow_cache)
> emc_cache_init(_cache->emc_cache);
> for (i = 0; i < DFC_BUCKET_CNT; i++) {
> for (j = 0; j < DFC_ENTRY_PER_BUCKET; j++) {
>-flow_cache->buckets[i].flow[j] = NULL;
>+flow_cache->buckets[i].sig[j] = 0;
> }
> }
>+for (i = 0; i < INDIRECT_TABLE_SIZE; i++) {
>+flow_cache->indirect_table[i] = NULL;
>+}
> flow_cache->sweep_idx = 0;
> }
>
>@@ -805,7 +815,7 @@ dfc_cache_uninit(struct dfc_cache *flow_cache)
>
> for (i = 0; i < DFC_BUCKET_CNT; i++) {
> for (j = 0; j < DFC_ENTRY_PER_BUCKET; j++) {
>-dfc_clear_entry(&(flow_cache->buckets[i]), j);
>+dfc_clear_entry(flow_cache->indirect_table,
>+ &(flow_cache->buckets[i]), j);
> }
> }
> emc_cache_uninit(_cache->emc_cache);
>@@ -2259,7 +2269,7 @@ dfc_entry_get(struct dfc_cache *cache, const
>uint32_t hash)
> uint16_t sig = hash >> 16;
> for (int i = 0; i < DFC_ENTRY_PER_BUCKET; i++) {
> if(bucket->sig[i] == sig) {
>-return bucket->flow[i];
>+return cache->indirect_table[bucket->index[i]];
> }
> }
> return NULL;
>@@ -2272,28 +2282,33 @@ dfc_entry_alive(struct dp_netdev_flow *flow)  }
>
> static void
>-dfc_clear_entry(struct dfc_bucket *b, int idx)
>+dfc_clear_entry(struct dp_netdev_flow **ind_table, struct dfc_bucket
>+*b, int idx)
> {
>-if (b->flow[idx]) {
>-dp_netdev_flow_unref(b->flow[idx]);
>-b->flow[idx] = NULL;
>+if (ind_table[b->index[idx]]) {
>+dp_netdev_flow_unref(ind_table[b->index[idx]]);
>+ind_table[b->index[idx]] = NULL;
> }
> }
>
>-static inline void
>-dfc_change_entry(struct dfc_bucket *b, int idx, struct dp_netdev_flow
>*flow)
>+
>+static inline uint16_t
>+indirect_table_insert(struct dp_netdev_flow **indirect_table,
>+struct dp_netdev_flow *flow)
> {
>-if (b->flow[idx] != flow) {
>-

Re: [ovs-dev] [RFC 3/4] dpif-netdev: Use way-associative cache

2018-02-23 Thread Bodireddy, Bhanuprakash

Hi Yipeng,

Thanks for the patch. Some high level questions/comments.

(1)  Am I right in understanding that this patch *only* introduces a new cache 
approach in to DFC to reduce the collisions?

(2)  Why the number of entries per Bucket is set to '8'?  With this each 
dfc_bucket  size is 80 bytes (16 + 64).
If the number of entries set to '6', the dfc_bucket size will be 60 
bytes and can fit in to a cache line.
I assume 'DFC_ENTRY_PER_BUCKET' isn't a random picked number. Was it 
picked due to any benchmarks?

(3) A 2 byte signature is introduced in a bucket and is used to insert or 
retrieve flows in to the bucket.
3a. Due to the introduction of 2 byte signature the size of dfc_cache 
increased by 2MB per PMD thread.
3b. Every time we insert or retrieve a flow, we have to match the 
packet signature(upper 16 bit RSS hash) with each entry of the bucket. 
Wondering if that slow down the operations?

(4)  The number of buckets depends on the number of entries per bucket.  Which 
of this plays an important role in reducing the collisions?
i.e Would higher number of entries per bucket reduce the collisions?

(5) What is the performance delta observed with this new Cache implementation 
over 1/4 approach?

Some more minor comments below.

>This commit uses a way-associative cache (CD) rather than a simple single
>entry hash table for DFC. Experiments show that this design generally has
>much higher hit rate.
>
>Since miss is much costly than hit, a CD-like structure that improves hit rate
>should help in general.
>
>Signed-off-by: Yipeng Wang 
>---
> lib/dpif-netdev.c | 107 +++--
>-
> 1 file changed, 70 insertions(+), 37 deletions(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 3e87992..50a1d25
>100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -150,8 +150,10 @@ struct netdev_flow_key {
>  */
>
> #define DFC_MASK_LEN 20
>+#define DFC_ENTRY_PER_BUCKET 8
> #define DFC_ENTRIES (1u << DFC_MASK_LEN) -#define DFC_MASK
>(DFC_ENTRIES - 1)
>+#define DFC_BUCKET_CNT (DFC_ENTRIES / DFC_ENTRY_PER_BUCKET)
>#define
>+DFC_MASK (DFC_BUCKET_CNT - 1)
> #define EMC_MASK_LEN 14
> #define EMC_ENTRIES (1u << EMC_MASK_LEN)  #define EMC_MASK
>(EMC_ENTRIES - 1) @@ -171,13 +173,14 @@ struct emc_cache {
> int sweep_idx;
> };
>
>-struct dfc_entry {
>-struct dp_netdev_flow *flow;
>+struct dfc_bucket {
>+uint16_t sig[DFC_ENTRY_PER_BUCKET];
>+struct dp_netdev_flow *flow[DFC_ENTRY_PER_BUCKET];
> };
>
> struct dfc_cache {
> struct emc_cache emc_cache;
>-struct dfc_entry entries[DFC_ENTRIES];
>+struct dfc_bucket buckets[DFC_BUCKET_CNT];
> int sweep_idx;
> };
>
>@@ -749,9 +752,9 @@ dpif_netdev_xps_revalidate_pmd(const struct
>dp_netdev_pmd_thread *pmd,  static int dpif_netdev_xps_get_tx_qid(const
>struct dp_netdev_pmd_thread *pmd,
>   struct tx_port *tx);
>
>-static inline bool dfc_entry_alive(struct dfc_entry *ce);
>+static inline bool dfc_entry_alive(struct dp_netdev_flow *flow);
> static void emc_clear_entry(struct emc_entry *ce); -static void
>dfc_clear_entry(struct dfc_entry *ce);
>+static void dfc_clear_entry(struct dfc_bucket *b, int idx);
>
> static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
>
>@@ -774,11 +777,13 @@ emc_cache_init(struct emc_cache *emc)  static void
>dfc_cache_init(struct dfc_cache *flow_cache)  {
>-int i;
>+int i, j;
>
> emc_cache_init(_cache->emc_cache);
>-for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
>-flow_cache->entries[i].flow = NULL;
>+for (i = 0; i < DFC_BUCKET_CNT; i++) {
>+for (j = 0; j < DFC_ENTRY_PER_BUCKET; j++) {
>+flow_cache->buckets[i].flow[j] = NULL;

[BHANU] How about initializing the signature?

>+}
> }
> flow_cache->sweep_idx = 0;
> }
>@@ -796,10 +801,12 @@ emc_cache_uninit(struct emc_cache *emc)  static
>void  dfc_cache_uninit(struct dfc_cache *flow_cache)  {
>-int i;
>+int i, j;
>
>-for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
>-dfc_clear_entry(_cache->entries[i]);
>+for (i = 0; i < DFC_BUCKET_CNT; i++) {
>+for (j = 0; j < DFC_ENTRY_PER_BUCKET; j++) {
>+dfc_clear_entry(&(flow_cache->buckets[i]), j);
>+}
> }
> emc_cache_uninit(_cache->emc_cache);
> }
>@@ -2245,39 +2252,46 @@ emc_lookup(struct emc_cache *emc, const struct
>netdev_flow_key *key)
> return NULL;
> }
>
>-static inline struct dfc_entry *
>+static inline struct dp_netdev_flow *
> dfc_entry_get(struct dfc_cache *cache, const uint32_t hash)  {
>-return >entries[hash & DFC_MASK];
>+struct dfc_bucket *bucket = >buckets[hash & DFC_MASK];
>+uint16_t sig = hash >> 16;
>+for (int i = 0; i < DFC_ENTRY_PER_BUCKET; i++) {
>+if(bucket->sig[i] == sig) {
>+return bucket->flow[i];
>+}
>+}
>+return NULL;
> }
>
> static inline bool

Re: [ovs-dev] [RFC 2/4] dpif-netdev: Fix EMC key length

2018-02-20 Thread Bodireddy, Bhanuprakash

This fix is needed and can  be included in 1/4 in next revision.

- Bhanuprakash.

>-Original Message-
>From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
>boun...@openvswitch.org] On Behalf Of Yipeng Wang
>Sent: Thursday, January 18, 2018 6:20 PM
>To: d...@openvswitch.org; jan.scheur...@ericsson.com
>Cc: Tai, Charlie 
>Subject: [ovs-dev] [RFC 2/4] dpif-netdev: Fix EMC key length
>
>EMC's key length is not initialized when insertion. Initialize the key length
>before insertion.
>
>The code might be put in another place, for now I just put it in dfc_lookup.
>
>Signed-off-by: Yipeng Wang 
>---
> lib/dpif-netdev.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index b9f4b6d..3e87992
>100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -2295,7 +2295,7 @@ dfc_insert(struct dp_netdev_pmd_thread *pmd,  }
>
> static inline struct dp_netdev_flow *
>-dfc_lookup(struct dfc_cache *cache, const struct netdev_flow_key *key,
>+dfc_lookup(struct dfc_cache *cache, struct netdev_flow_key *key,
>bool *exact_match)
> {
> struct dp_netdev_flow *flow;
>@@ -2317,6 +2317,7 @@ dfc_lookup(struct dfc_cache *cache, const struct
>netdev_flow_key *key,
> /* Found a match in DFC. Insert into EMC for subsequent lookups.
>  * We use probabilistic insertion here so that mainly elephant
>  * flows enter EMC. */
>+key->len = netdev_flow_key_size(miniflow_n_values(>mf));
> emc_probabilistic_insert(>emc_cache, key, flow);
> *exact_match = false;
> return flow;
>--
>2.7.4
>
>___
>dev mailing list
>d...@openvswitch.org
>https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [RFC 1/4] dpif-netdev: Refactor datapath flow cache

2018-02-20 Thread Bodireddy, Bhanuprakash

Hi Yipeng,

Thanks for the RFC series. This patch series need to be rebased. 
I applied this on an older commit to do initial testing. Some comments below.

I see that DFC cache is implemented in similar lines of EMC cache except that 
it holds
Million entries and uses more bits of RSS hash to index in to the Cache. I 
agree that
DPCLS lookup is expensive and consumes 30% of total cycles in some test cases 
and DFC
Cache will definitely reduce some pain there.

On the memory foot print:

On Master, 
EMC  entry size = 592 bytes
   8k entries = ~4MB.

With this patch,
 EMC entry size = 256 bytes
  16k entries = ~4MB.

I like the above reduction in flow key size, keeping the entry size to multiple 
of cache line and still keeping the overall EMC size to ~4MB with more EMC 
entries.

However my concern is DFC cache size. As DFC cache is million entries and 
consumes ~12 MB for each PMD thread, it might not fit in to L3 cache. Also note 
that in newer platforms L3 cache is shrinking and L2 is slightly increased (eg: 
Skylake has only 1MB L2 and 19MB L3 cache).

Inspite of the memory footprint I still think DFC cache improves switching 
performance as it is lot less expensive than invoking dpcls_lookup() as the 
later involves more expensive hash computation and subtable traversal. It would 
be nice if there is more testing done with real VNFs to see that this patch 
doesn't cause cache thrashing and suffer from memory bottlenecks.

Some more comments below.

>This is a rebase of Jan's previous patch [PATCH] dpif-netdev: Refactor
>datapath flow cache https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341066.html
>
>So far the netdev datapath uses an 8K EMC to speed up the lookup of
>frequently used flows by comparing the parsed packet headers against the
>miniflow of a cached flow, using 13 bits of the packet RSS hash as index. The
>EMC is too small for many applications with 100K or more parallel packet flows
>so that EMC threshing actually degrades performance.
>Furthermore, the size of struct miniflow and the flow copying cost prevents us
>from making it much larger.
>
>At the same time the lookup cost of the megaflow classifier (DPCLS) is
>increasing as the number of frequently hit subtables grows with the
>complexity of pipeline and the number of recirculations.
>
>To close the performance gap for many parallel flows, this patch introduces
>the datapath flow cache (DFC) with 1M entries as lookup stage between EMC
>and DPCLS. It directly maps 20 bits of the RSS hash to a pointer to the last 
>hit
>megaflow entry and performs a masked comparison of the packet flow with
>the megaflow key to confirm the hit. This avoids the costly DPCLS lookup even
>for very large number of parallel flows with a small memory overhead.
>
>Due the large size of the DFC and the low risk of DFC thrashing, any DPCLS hit
>immediately inserts an entry in the DFC so that subsequent packets get
>speeded up. The DFC, thus, accelerate also short-lived flows.
>
>To further accelerate the lookup of few elephant flows, every DFC hit triggers
>a probabilistic EMC insertion of the flow. As the DFC entry is already in place
>the default EMC insertion probability can be reduced to
>1/1000 to minimize EMC thrashing should there still be many fat flows.
>The inverse EMC insertion probability remains configurable.
>
>The EMC implementation is simplified by removing the possibility to store a
>flow in two slots, as there is no particular reason why two flows should
>systematically collide (the RSS hash is not symmetric).

[BHANU]
I am not sure if this is good idea to simplify EMC by using 1-way associative 
instead of current 2 way associative implementation.
I prefer to leave the current approach as-is unless we have strong data to 
prove it otherwise.
This comment applies to below code changes w.r.t to EMC lookup and insert.

>The maximum size of the EMC flow key is limited to 256 bytes to reduce the
>memory footprint. This should be sufficient to hold most real life packet flow
>keys. Larger flows are not installed in the EMC.

+1 

>
>The pmd-stats-show command is enhanced to show both EMC and DFC hits
>separately.
>
>The sweep speed for cleaning up obsolete EMC and DFC flow entries and
>freeing dead megaflow entries is increased. With a typical PMD cycle duration
>of 100us under load and checking one DFC entry per cycle, the DFC sweep
>should normally complete within in 100s.
>
>In PVP performance tests with an L3 pipeline over VXLAN we determined the
>optimal EMC size to be 16K entries to obtain a uniform speedup compared to
>the master branch over the full range of parallel flows. The measurement
>below is for 64 byte packets and the average number of subtable lookups per
>DPCLS hit in this pipeline is 1.0, i.e. the acceleration already starts for a 
>single
>busy mask. Tests with many visited subtables should show a strong increase
>of the gain through DFC.
>
>Flows   master  DFC+EMC  Gain
>

Re: [ovs-dev] [PATCH] dpif-netdev: Refactor datapath flow cache

2018-02-16 Thread Bodireddy, Bhanuprakash

>
>>-Original Message-
>>>
>>> [Wang, Yipeng] In my test, I compared the proposed EMC with current
>EMC with same 16k entries.
>>> If I turned off THP, the current EMC will cause many TLB misses because of
>its larger entry size, which I profiled with vTunes.
>>> Once I turned on THP with no other changes, the current EMC's
>>> throughput increases a lot and is comparable with the newly proposed
>EMC. From vTunes, the EMC lookup TLB misses decreases from 100 million to
>0 during the 30sec profiling time.
>>> So if THP is enabled, reducing EMC entry size may not give too much
>benefit comparing to the current EMC.
>>> It is worth to mention that they both use similar amount of CPU cache
>>> since only the miniflow struct is accessed by CPU, thus the TLB should be
>the major concern.

[BHANU] 
I found this thread on THP interesting and want to share my findings here.  
I did some micro benchmarks on this feature a long time ago and found there was 
some performance improvement with THP enabled.
Some of this can be attributed to faster emc_lookup() with THP enabled.

With large number of flows, emc_lookup() is back end bound and further analysis 
showed that there is significant
DTLB overhead. One way to reduce the overhead is to use larger pages and with 
THP the overhead reduce by 40% for this function.

So THP has some positive affect on emc_lookup()!

- Bhanuprakash.

>>
>>I understand your point. But I can't seem to reproduce the effect of THP on
>my system.
>>I don't have vTunes available, but I guess "perf stat" should also
>>provide TLB miss statistics.
>>
>>How can you check if ovs-vswitchd is using transparent huge pages for
>>backing e.g. the EMC memory?
>>
>
>[Wang, Yipeng]
>I used the master OVS and change the EMC to be 16k entries. I feed 10k or
>more flows to stress EMC.  With perf, I tried this command:
>sudo perf stat -p PID -e dTLB-load-misses It shows the TLB misses changed a
>lot with THP on or off on my machine. vtunes shows the EMC_lookup
>function's data separately though.
>
>To check if THP is used by OvS, I found a Redhat suggested command handy:
>From: https://access.redhat.com/solutions/46111
>grep -e AnonHugePages  /proc/*/smaps | awk  '{ if($2>4) print $0} ' |  awk -F
>"/"  '{print $0; system("ps -fp " $3)} '
>I don't know how to check each individual function though.
>
>>>
>>> [Wang, Yipeng] Yes that there is no systematic collisions. However,
>>> in general, 1-hash table tends to cause many more misses than 2-hash.
>>> For code simplicity, I agree that 1-hash is simpler and much easier
>>> to understand. For performance, if the flows can fit in 1-hash table,
>>> they should also stay in the primary location of the 2-hash table, so
>>> basically they should have similar lookup speed. For large numbers of
>>> flows in general, traffic will have higher miss ratio in 1-hash than
>>> 2-hash table. From one of our tests that has 10k flows and 3 subtable (test
>cases described later), and EMC is sized for 16k entries, the 2-hash EMC
>causes about 14% miss ratio,  while the 1-hash EMC causes 47% miss ratio.
>>
>>I agree that a lower EMC hit rate is a concern with just DPCLS or CD+DPCLS as
>second stage.
>>But with DFC the extra cost for a miss on EMC is low as the DFC lookup
>>only slightly higher than EMC itself. The EMC miss is cheap as it will
>>typically already detected when comparing the full RSS hash.
>>
>>Furthermore, the EMC is now mainly meant to speed up the biggest
>>elephant flows, so it can be smaller and thrashing is avoided by very low
>insertion probability.
>>Simplistic benchmarks using a large number of "eternal" flows with
>>equidistantly spaced packets are really an unrealistic worst case for any
>cache-based architecture.
>>
>
>[Wang, Yipeng]
>If the realistic traffic patterns mostly hit EMC with elephant flows, I agree 
>that
>EMC could be simplified.
>
>>>
>>> [Wang, Yipeng] We agree that a DFC hit performs better than a CD hit,
>>> but CD usually has higher hit rate for large number of flows, as the data
>shows later.
>>
>>That is something I don't yet understand. Is this because of the fact
>>that CD stores up to 16 entries per hash bucket and handles collisions better?
>
>[Wang, Yipeng]
>Yes, with 2-hash function and 16 entries per bucket, CD has much less misses
>in general.
>
>As first step to combine both CD and DFC, I incorporated the signature and
>way-associative structure from CD into DFC. I just did simple prototype
>without Any performance tuning, preliminary results show good
>improvement over miss ratio and throughput. I will post the complete results
>soon.
>
>Since DFC/CD is much faster than megaflow, I believe higher hit rate is
>preferred. So A CD-like way-associative structure should be helpful. The
>signature per entry also helps on performance, similar effect with EMC.
>
>>>
>>> [Wang, Yipeng] We use the test/rules we posted with our CD patch.
>>> Basically we vary src_IP to hit different subtables, and then vary
>>> dst_IP to create

Re: [ovs-dev] [PATCH v6 0/8] Add OVS DPDK keep-alive functionality.

2018-01-16 Thread Bodireddy, Bhanuprakash

>Hi,
>
>Sorry to jump on this at v6 only, but I skimmed over the code and I am
>struggling to understand what problem you're trying to solve. Yes, I realize
>you want some sort of feedback about the PMD processing, but it's not clear
>to me what exactly you want from it.
>
>This last patchset uses a separate thread just to monitor the PMD threads
>which can update their status in the core busy loop.  I guess it tells you if 
>the
>PMD thread is stuck or not, but not really if it's processing packets.  That's
>again, my question above.
>
>If you need to know if the thread is running, I think any OVS can provide you
>the process stats which should be more reliable and doesn't depend on OVS
>at all.
>
>I appreciate if you could elaborate more on the use-case.

Intel SA team has been working on  SA Framework for NFV environment and has 
defined interfaces
for the base platform(aligned with ETSI GS NFV 002)  which includes compute, 
storage, NW, virtual switch, OS and hypervisor.
The core idea here is to monitor and detect the service impacting faults on the 
Base platform. 
Both reactive and pro-active fault detection techniques are employed and faults 
are reported to
higher level layers for corrective actions. The corrective actions for example 
here can be migrating
the workloads, marking the compute offline and is based on the policies 
enforced at higher layers.

One aspect of larger SA framework is monitoring virtual switch health. Some of 
the events of interest here
are link status, OvS DB connection status, packet statistics(drops/errors), PMD 
health. 

This patch series has only implemented *PMD health* monitoring and reporting 
mechanism and the details are
already in the patch. The other interesting events of virtual switch are 
already implemented as part of collectd plugin.

On your questions:

> I guess it tells you if the PMD thread is stuck or not, but not really if 
> it's processing packets.  That's
>again, my question above.

The functionality to check if the PMD is processing the packets was implemented 
way back in v3.
https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/336789.html

For easier review, the patch series was split up in v4 to get the basic 
functionality in. This is mentioned in version change log below.
https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337702.html

>If you need to know if the thread is running, I think any OVS can provide you
>the process stats which should be more reliable and doesn't depend on OVS
>at all.

There is a problem here and I did simulate the case to show that the stats 
reported by OS aren't accurate in the below thread.
https://mail.openvswitch.org/pipermail/ovs-dev/2017-September/338388.html

Check the details on /proc/[pid]/stats. Though the PMD thread is stalled, OS 
reports the thread as *Running (R)* state.

- Bhanuprakash.

>
>
>On Fri, Dec 08, 2017 at 12:04:19PM +, Bhanuprakash Bodireddy wrote:
>> Keepalive feature is aimed at achieving Fastpath Service Assurance in
>> OVS-DPDK deployments. It adds support for monitoring the packet
>> processing threads by dispatching heartbeats at regular intervals.
>>
>> keepalive feature can be enabled through below OVSDB settings.
>>
>> enable-keepalive=true
>>   - Keepalive feature is disabled by default and should be enabled
>> at startup before ovs-vswitchd daemon is started.
>>
>> keepalive-interval="5000"
>>   - Timer interval in milliseconds for monitoring the packet
>> processing cores.
>>
>> TESTING:
>> The testing of keepalive is done using stress cmd (simulating the 
>> stalls).
>>   - pmd-cpu-mask=0xf [MQ enabled on DPDK ports]
>>   - stress -c 1 &  [tid is usually the __tid + 1 of the output]
>>   - chrt -r -p 99 [set realtime priority for stress thread]
>>   - taskset -p 0x8[Pin the stress thread to the core PMD is 
>> running]
>>   - PMD thread will be descheduled due to its normal priority and yields
>> core to stress thread.
>>
>>   - ovs-appctl keepalive/pmd-health-show   [Display that the thread is
>GONE]
>>   - ./ovsdb/ovsdb-client monitor Open_vSwitch  [Should update the
>> status]
>>
>>   - taskset -p 0x10   [This brings back pmd thread to life as stress
>thread
>> is moved to idle core]
>>
>>   (watch out for stress threads, and carefully pin them to core not to 
>> hang
>your DUTs
>>during tesing).
>>
>> v5 -> v6
>>   * Remove 2 patches from series
>>  - xnanosleep was applied to master as part of high resolution timeout
>support.
>>  - Extend get_process_info() API was also applied to master earlier.
>>   * Remove KA_STATE_DOZING as it was initially meant to handle Core C
>states, not needed
>> for now.
>>   * Fixed ka_destroy(), to fix unit test cases 536, 537.
>>   * A minor performance degradation(0.5%) is observed with Keepalive
>enabled.
>> [Tested with loopback case using 1000 IXIA streams/64 byte

Re: [ovs-dev] [PATCH 1/4] compiler: Introduce OVS_PREFETCH variants.

2018-01-12 Thread Bodireddy, Bhanuprakash

>-Original Message-
>From: Ben Pfaff [mailto:b...@ovn.org]
>Sent: Friday, January 12, 2018 6:20 PM
>To: Bodireddy, Bhanuprakash <bhanuprakash.bodire...@intel.com>
>Cc: d...@openvswitch.org
>Subject: Re: [ovs-dev] [PATCH 1/4] compiler: Introduce OVS_PREFETCH
>variants.
>
>Hi Bhanu, who do you think should review this series?  Is it something that Ian
>should pick up for dpdk_merge?

Hi Ben,

I will check with Ian if he has time to review this. As the patch series doesn't
change any functionality at this point it shouldn't take much time.

-Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] netdev-native-tnl: Add assertion in vxlan_pop_header.

2018-01-12 Thread Bodireddy, Bhanuprakash

Hi Ben,

>On Fri, Jan 12, 2018 at 05:43:13PM +, Bhanuprakash Bodireddy wrote:
>> During tunnel decapsulation the below steps are performed:
>>  [1] Tunnel information is populated in packet metadata i.e packet->md-
>>tunnel.
>>  [2] Outer header gets popped.
>>  [3] Packet is recirculated.
>>
>> For [1] to work, the dp_packet L3 and L4 header offsets should be valid.
>> The offsets in the dp_packet are set as part of miniflow extraction.
>>
>> If offsets are accidentally reset (or) the pop header operation is
>> performed prior to miniflow extraction, step [1] fails silently and
>> creates issues that are harder to debug. Add the assertion to check if
>> the offsets are valid.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>>  lib/netdev-native-tnl.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/lib/netdev-native-tnl.c b/lib/netdev-native-tnl.c index
>> 9ce8567..fb5eab0 100644
>> --- a/lib/netdev-native-tnl.c
>> +++ b/lib/netdev-native-tnl.c
>> @@ -508,6 +508,9 @@ netdev_vxlan_pop_header(struct dp_packet
>*packet)
>>  ovs_be32 vx_flags;
>>  enum packet_type next_pt = PT_ETH;
>>
>> +ovs_assert(packet->l3_ofs > 0);
>> +ovs_assert(packet->l4_ofs > 0);
>> +
>>  pkt_metadata_init_tnl(md);
>>  if (VXLAN_HLEN > dp_packet_l4_size(packet)) {
>>  goto err;
>
>Thanks for working to make OVS more reliable.
>
>How much risk do you think there is of these assertions triggering?  Are you
>debugging an issue where they would trigger, and has that been fixed?  I'm
>trying to figure out whether it makes more sense to put assertions here or
>whether something closer to a log message plus a jump to "err" would be
>better.  It's not great for OVS to assert-fail, but on the other hand if it 
>indicates
>a genuine bug then sometimes it's the best thing to do.

I was working on a RFC patch to skip recirculation for vxlan decap side.  
I posted today @  
https://mail.openvswitch.org/pipermail/ovs-dev/2018-January/343103.html

In that implementation vxlan header is popped before the Miniflow extraction 
and that's when
I ran in to above mentioned problem. 

Also I found that dp_packet_reset_packet() and dp_packet_reset_offsets() when 
accidentally
called will clear the offsets and any later invocation of *vxlan_pop_header() 
or for that matter
any code that uses the dp_packet L3/L4 offsets will fail.  So I added an 
assertion to make it more explicit
for vxlans.

Please note that there isn't any bug on the master code and this was done as a 
precautionary
measure to improve debugging.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] dpif-netdev: Allocate dp_netdev_pmd_thread struct by xzalloc_cacheline.

2017-12-08 Thread Bodireddy, Bhanuprakash

>
>On 08.12.2017 18:44, Bodireddy, Bhanuprakash wrote:
>>>
>>> On 08.12.2017 16:45, Stokes, Ian wrote:
>>>>> All instances of struct dp_netdev_pmd_thread are allocated by
>>>>> xzalloc and therefore doesn't guarantee memory allocation aligned
>>>>> on CACHE_LINE_SIZE boundary. Due to this any padding done inside
>>>>> the structure with this assumption might create holes.
>>>>>
>>>>> This commit replaces xzalloc, free with xzalloc_cacheline and
>>>>> free_cacheline. With the changes the memory is 64 byte aligned.
>>>>
>>>> Thanks for this Bhanu,
>>>>
>>>> I think this looks OK and I'm considering pushing to the DPDK_Merge
>>>> branch
>>> but as there has been a fair bit of debate lately regarding memory
>>> and cache alignment I want to flag to others who have engaged to date
>>> to have their say before I apply it as there has been no input yet for the
>patch.
>>>>
>>>> @Jan/Ilya, are you ok with this change?
>>>
>>> OVS will likely crash on destroying non_pmd thread because it still
>>> allocated by usual xzalloc, but freed with others by free_cacheline().
>>
>> Are you sure OvS crashes in this case and reproducible?
>> Firstly I didn't see a crash and to double check this I enabled a DBG
>> in dp_netdev_destroy_pmd() to see if free_cacheline() is called for
>> the non pmd thread (whose core_id is NON_PMD_CORE_ID) and that
>doesn't seem to be hitting and gets hit only for pmd threads having valid
>core_ids.
>
>This should happen in dp_netdev_free() on ovs exit or deletion of the
>datapath.
>
>I guess, you need following patch to reproduce:
>https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>December/341617.html
>
>Ian is going to include it to the closest pull request.
>
>Even if it's not reproducible you have to fix memory allocation for non_pmd
>anyway. Current code logically wrong.

Ok, that makes sense I would use the xzalloc_cacheline() for allocating memory 
for non_pmd too..

Bhanuprakash.

>
>>
>> Also AFAIK, non pmd thread is nothing but vswitchd thread and I don’t
>> see how that can be freed from the above function.  Also I started
>wondering where the memory allocated for non_pmd thread is getting freed
>now.
>>
>> Let me know the steps if you can reproduce the crash as you mentioned.
>>
>> - Bhanuprakash.
>>
>>>
>>>>
>>>> Thanks
>>>> Ian
>>>>
>>>>>
>>>>> Before:
>>>>> With xzalloc, all the memory is 16 byte aligned.
>>>>>
>>>>> (gdb) p pmd
>>>>> $1 = (struct dp_netdev_pmd_thread *) 0x7eff8a813010
>>>>> (gdb) p >cacheline0
>>>>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7eff8a813010
>>>>> (gdb) p >cacheline1
>>>>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7eff8a813050
>>>>> (gdb) p >flow_cache
>>>>> $4 = (struct emc_cache *) 0x7eff8a813090
>>>>> (gdb) p >flow_table
>>>>> $5 = (struct cmap *) 0x7eff8acb30d0
>>>>> (gdb)  p >stats
>>>>> $6 = (struct dp_netdev_pmd_stats *) 0x7eff8acb3110
>>>>> (gdb) p >port_mutex
>>>>> $7 = (struct ovs_mutex *) 0x7eff8acb3150
>>>>> (gdb) p >poll_list
>>>>> $8 = (struct hmap *) 0x7eff8acb3190
>>>>> (gdb) p >tnl_port_cache
>>>>> $9 = (struct hmap *) 0x7eff8acb31d0
>>>>> (gdb) p >stats_zero
>>>>> $10 = (unsigned long long (*)[5]) 0x7eff8acb3210
>>>>>
>>>>> After:
>>>>> With xzalloc_cacheline, all the memory is 64 byte aligned.
>>>>>
>>>>> (gdb) p pmd
>>>>> $1 = (struct dp_netdev_pmd_thread *) 0x7f39e2365040
>>>>> (gdb) p >cacheline0
>>>>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7f39e2365040
>>>>> (gdb) p >cacheline1
>>>>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7f39e2365080
>>>>> (gdb) p >flow_cache
>>>>> $4 = (struct emc_cache *) 0x7f39e23650c0
>>>>> (gdb) p >flow_table
>>>>> $5 = (struct cmap *) 0x7f39e2805100
>>>>> (gdb) p >stats
>>>>> $6 = (struct dp_netdev_pmd_stats *) 0x7f39e2805140
>>>>> (gdb) p >port_mutex
>>>>> $7 = (s

Re: [ovs-dev] [PATCH] dpif-netdev: Allocate dp_netdev_pmd_thread struct by xzalloc_cacheline.

2017-12-08 Thread Bodireddy, Bhanuprakash

>
>On 08.12.2017 16:45, Stokes, Ian wrote:
>>> All instances of struct dp_netdev_pmd_thread are allocated by xzalloc
>>> and therefore doesn't guarantee memory allocation aligned on
>>> CACHE_LINE_SIZE boundary. Due to this any padding done inside the
>>> structure with this assumption might create holes.
>>>
>>> This commit replaces xzalloc, free with xzalloc_cacheline and
>>> free_cacheline. With the changes the memory is 64 byte aligned.
>>
>> Thanks for this Bhanu,
>>
>> I think this looks OK and I'm considering pushing to the DPDK_Merge branch
>but as there has been a fair bit of debate lately regarding memory and cache
>alignment I want to flag to others who have engaged to date to have their say
>before I apply it as there has been no input yet for the patch.
>>
>> @Jan/Ilya, are you ok with this change?
>
>OVS will likely crash on destroying non_pmd thread because it still allocated 
>by
>usual xzalloc, but freed with others by free_cacheline().

Are you sure OvS crashes in this case and reproducible?
Firstly I didn't see a crash and to double check this I enabled a DBG in 
dp_netdev_destroy_pmd() 
to see if free_cacheline() is called for the non pmd thread (whose core_id is 
NON_PMD_CORE_ID) and that 
doesn't seem to be hitting and gets hit only for pmd threads having valid 
core_ids.

Also AFAIK, non pmd thread is nothing but vswitchd thread and I don’t see how 
that can be freed from the above
function.  Also I started wondering where the memory allocated for non_pmd 
thread is getting freed now.

Let me know the steps if you can reproduce the crash as you mentioned.

- Bhanuprakash.

>
>>
>> Thanks
>> Ian
>>
>>>
>>> Before:
>>> With xzalloc, all the memory is 16 byte aligned.
>>>
>>> (gdb) p pmd
>>> $1 = (struct dp_netdev_pmd_thread *) 0x7eff8a813010
>>> (gdb) p >cacheline0
>>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7eff8a813010
>>> (gdb) p >cacheline1
>>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7eff8a813050
>>> (gdb) p >flow_cache
>>> $4 = (struct emc_cache *) 0x7eff8a813090
>>> (gdb) p >flow_table
>>> $5 = (struct cmap *) 0x7eff8acb30d0
>>> (gdb)  p >stats
>>> $6 = (struct dp_netdev_pmd_stats *) 0x7eff8acb3110
>>> (gdb) p >port_mutex
>>> $7 = (struct ovs_mutex *) 0x7eff8acb3150
>>> (gdb) p >poll_list
>>> $8 = (struct hmap *) 0x7eff8acb3190
>>> (gdb) p >tnl_port_cache
>>> $9 = (struct hmap *) 0x7eff8acb31d0
>>> (gdb) p >stats_zero
>>> $10 = (unsigned long long (*)[5]) 0x7eff8acb3210
>>>
>>> After:
>>> With xzalloc_cacheline, all the memory is 64 byte aligned.
>>>
>>> (gdb) p pmd
>>> $1 = (struct dp_netdev_pmd_thread *) 0x7f39e2365040
>>> (gdb) p >cacheline0
>>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7f39e2365040
>>> (gdb) p >cacheline1
>>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7f39e2365080
>>> (gdb) p >flow_cache
>>> $4 = (struct emc_cache *) 0x7f39e23650c0
>>> (gdb) p >flow_table
>>> $5 = (struct cmap *) 0x7f39e2805100
>>> (gdb) p >stats
>>> $6 = (struct dp_netdev_pmd_stats *) 0x7f39e2805140
>>> (gdb) p >port_mutex
>>> $7 = (struct ovs_mutex *) 0x7f39e2805180
>>> (gdb) p >poll_list
>>> $8 = (struct hmap *) 0x7f39e28051c0
>>> (gdb) p >tnl_port_cache
>>> $9 = (struct hmap *) 0x7f39e2805200
>>> (gdb) p >stats_zero
>>> $10 = (unsigned long long (*)[5]) 0x7f39e2805240
>>>
>>> Reported-by: Ilya Maximets 
>>> Signed-off-by: Bhanuprakash Bodireddy
>>> 
>>> ---
>>>  lib/dpif-netdev.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>> db78318..3e281ae
>>> 100644
>>> --- a/lib/dpif-netdev.c
>>> +++ b/lib/dpif-netdev.c
>>> @@ -3646,7 +3646,7 @@ reconfigure_pmd_threads(struct dp_netdev
>*dp)
>>>  FOR_EACH_CORE_ON_DUMP(core, pmd_cores) {
>>>  pmd = dp_netdev_get_pmd(dp, core->core_id);
>>>  if (!pmd) {
>>> -pmd = xzalloc(sizeof *pmd);
>>> +pmd = xzalloc_cacheline(sizeof *pmd);
>>>  dp_netdev_configure_pmd(pmd, dp, core->core_id, core-
 numa_id);
>>>  pmd->thread = ovs_thread_create("pmd", pmd_thread_main,
>pmd);
>>>  VLOG_INFO("PMD thread on numa_id: %d, core id: %2d
>>> created.", @@ -4574,7 +4574,7 @@ dp_netdev_destroy_pmd(struct
>>> dp_netdev_pmd_thread
>>> *pmd)
>>>  xpthread_cond_destroy(>cond);
>>>  ovs_mutex_destroy(>cond_mutex);
>>>  ovs_mutex_destroy(>port_mutex);
>>> -free(pmd);
>>> +free_cacheline(pmd);
>>>  }
>>>
>>>  /* Stops the pmd thread, removes it from the 'dp->poll_threads',
>>> --
>>> 2.4.11
>>>
>>> ___
>>> dev mailing list
>>> d...@openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
>>
>>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] dpif-netdev: Optimize the exact match lookup.

2017-12-08 Thread Bodireddy, Bhanuprakash

Hi Tonghao,

>On Thu, Jul 27, 2017 at 11:38:00PM -0700, Tonghao Zhang wrote:
>> When inserting or updating (e.g. emc_insert) a flow to EMC, we compare
>> (e.g the hash and miniflow ) the netdev_flow_key.
>> If the key is matched, we will update it. If we didn’t find the
>> miniflow in the cache, the new flow will be stored.
>>
>> But when looking up the flow, we compare the hash and miniflow of key
>> and make sure it is alive. If a flow is not alive but the key is
>> matched, we still will go to next loop. More important, we can’t find
>> the flow in the next loop (the flow is not alive in the previous
>> loop). This patch simply compares the miniflows of the packets.
>>
>> The topo is shown as below. VM01 sends TCP packets to VM02, and OvS
>> forwards packtets.
>>
>>  VM01 -- OVS+DPDK VM02 -- VM03
>>
>> With this patch, the TCP throughput between VMs is 5.37, 5.45, 5.48,
>> 5.59, 5.65, 5.60 Gbs/sec avg: 5.52 Gbs/sec
>>
>> up to:
>> 5.64, 5.65, 5.66, 5.67, 5.62, 5.67 Gbs/sec avg: 5.65 Gbs/sec
>>
>> (maybe ~2.3% performance improve, but it is hard to tell exactly due
>> to variance in the test results).
>>
>> Signed-off-by: Tonghao Zhang 
>
>Thank you for the patch.  I haven't spotted any reviews for this on the mailing
>list.  I apologize for that--usually I expect to see a review much more quickly
>than this.  I hope that someone who understands the dpif-netdev code well
>will provide a review soon.

I reviewed and tested this patch and the performance improvement is marginal 
and varies a lot depending on traffic pattern.

In the original implementation, if the hashes match and the entry is alive in 
EMC then the Miniflows are compared using memcmp() and takes
Significant cycles.
With the change proposed in this patch, if the hash matches we would do the 
Miniflow comparison(takes significant cycles depending on key->len) and then go 
on to check if the entry is alive. In case the entry isn't available(With EMC 
saturated and packets hitting classifier) we probably would have wasted lot of 
cycles in this case doing the expensive memcmp().

What do you think?

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v6 1/7] dpif-netdev: Refactor PMD thread structure for further extension.

2017-12-07 Thread Bodireddy, Bhanuprakash

>On 07/12/17 14:28, Ilya Maximets wrote:
>> Thanks for review, comments inline.
>>
>> On 07.12.2017 15:49, Eelco Chaudron wrote:
>>> On 01/12/17 16:44, Ilya Maximets wrote:
 This is preparation for 'struct dp_netdev_pmd_thread' modification
 in upcoming commits. Needed to avoid reordering and regrouping while
 replacing old and adding new members.

>>> Should this be part of the TX batching set? Anyway, I'm ok if it's
>>> not stalling the approval :)
>> Unfortunately yes, because members reordered and regrouped just to
>> include new members: pmd->ctx and pmd->n_output_batches. This could
>> not be a standalone change because adding of different members will
>> require different regrouping/ reordering. I moved this change to a
>> separate patch to not do this twice while adding each member in patches
>2/7 and 6/7.
>>
>> Anyway, as I mentioned in cover letter, I still prefer reverting of
>> the padding at all by this patch:
>>
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341153.html

I understand that with PADDED_MEMBERS macro it was slightly tricky to extend or 
reorganize the structure and so suggested 'pahole'.
But I see that the problem hasn't gone and still there are some strong opinions 
on reverting the earlier effort.

I  don’t mind reverting the patch but would be nice if the changes to this 
structure are made keeping alignment in mind.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH RFC 2/5] configure: Include -mprefetchwt1 explicitly.

2017-12-07 Thread Bodireddy, Bhanuprakash

>> >> If CPU just skips this instruction we will lost all the prefetching
>> >> optimizations because all the calls will be replaced by non-existent
>'prefetchwt1'.
>> >
>> > [Bhanu] I would be worried if core generates an exception treating
>> > it as illegal instruction. Instead pipeline units treat this as NOP
>> > if it
>> doesn't support it.
>> > So the micro optimizations doesn't really do any thing on the processors
>that doesn't support it.
>>
>> This could be an issue. If someday we'll have real performance
>> optimization based on OPCH_HTW prefetch, we will have prefetchwt1 on
>> system that supports it and NOP on others even if they have usual
>> prefetchw which could provide performance improvement too.

[Bhanu]  Adding the below information only for future reference, (going to 
point to this thread in the commit log)

On systems that has *only* prefetchw and no prefetchwt1 instruction.
 OPCH_LTW-   prefetchw 
 OPCH_MTW  -   prefetchw
 OPCH_HTW   -prefetchw
 OPCH_NTW   -prefetchw

On systems that supports both prefetchw and prefetchwt1,
 OPCH_LTW-   prefetchwt1
 OPCH_MTW  -   prefetchwt1
 OPCH_HTW   -prefetchw
 OPCH_NTW   -prefetchwt1

So OPCH_HTW would always be prefetchw and LTW/MTW/HTW  might turn in to NOPs on 
processors that support prefetchw alone.
(when compiled with CFLAGS = -march=native -mprefetchwt1)

>>
>> As I understand, checking of '-mprefetchwt1' is equal to checking
>> compiler version. It doesn't check anything about supporting of this
>instruction in CPU.
>> This could end up with non-working performance optimizations and even
>> degradation on systems that supports usual prefetches but not
>> prefetchwt1 (useless NOPs degrades performance if they are on a hot
>path).
>>
>> IMHO, This compiler option should be passed only if CPU really supports it.
>> I guess, the maximum that we can do is add a note into performance
>> optimization guide that '-mprefetchwt1' could be passed via CFLAGS if
>> user sure that it supported by target CPU.
>
>That is my thinking as well. The people/organizations building OVS packages
>for deployment have the responsibility to specify the minimum requirements
>on the target architecture and feed that into the compiler using CFLAGS. That
>may well be leaning towards the lower end of capabilities to maximize
>compatibility and sacrifice some performance on high-end CPUs.
>
>The specialized prefetch macros should be mapped to the best available
>target instructions by the compiler and/or conditional compile directives
>based on the CFLAGS architecture settings.
>
>We would gather all these target-specific compiler optimization guidelines in
>the advanced DPDK documentation of OVS.
>
>Of course developers or benchmark testers are free to use -march=native or
>similar at their discretion in their local test beds for best possible 
>performance.

If the general view is get rid of this flag at compilation and only to document 
this, I am happy with this and can update the documentation.
But I still think we are being too defensive here and with few NOPs performance 
impact isn't even noticeable. 

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH RFC 2/5] configure: Include -mprefetchwt1 explicitly.

2017-12-05 Thread Bodireddy, Bhanuprakash

[...]
>int main()
>{
>int c;
>
>__builtin_prefetch(, 1, 1);
>c = 8;
>
>return c;
>}
>
>on my old Ivy Bridge i7-3770 CPU. It does not support even 'prefetchw':
>
>  PREFETCHWT1  = false
>  3DNow! PREFETCH/PREFETCHW instructions = false
>
>Results:

[Bhanu] I  found https://gcc.godbolt.org/ the other day and its handy to 
generate code for different targets and compilers.

>$ gcc 1.c
>$ objdump -S ./a.out | grep prefetch -A2 -B2
>  40055b:   31 c0   xor%eax,%eax
>  40055d:   48 8d 45 f4 lea-0xc(%rbp),%rax
>  400561:   0f 18 18prefetcht2 (%rax)
>  400564:   c7 45 f4 08 00 00 00movl   $0x8,-0xc(%rbp)
>  40056b:   8b 45 f4mov-0xc(%rbp),%eax

[Bhanu] Expected and compiler generates prefetcht2.

>
>$ gcc 1.c -march=native
>$ objdump -S ./a.out | grep prefetch -A2 -B2
>  40055b:   31 c0   xor%eax,%eax
>  40055d:   48 8d 45 f4 lea-0xc(%rbp),%rax
>  400561:   0f 18 18prefetcht2 (%rax)
>  400564:   c7 45 f4 08 00 00 00movl   $0x8,-0xc(%rbp)
>  40056b:   8b 45 f4mov-0xc(%rbp),%eax

[Bhanu] Though march=native is specified the processor doesn't  have it and 
still prefetchnt2 is generated by compiler.

>$ gcc 1.c -march=native -mprefetchwt1
>$ objdump -S ./a.out | grep prefetch -A2 -B2
>  40055b:   31 c0   xor%eax,%eax
>  40055d:   48 8d 45 f4 lea-0xc(%rbp),%rax
>  400561:   0f 0d 10prefetchwt1 (%rax)
>  400564:   c7 45 f4 08 00 00 00movl   $0x8,-0xc(%rbp)
>  40056b:   8b 45 f4mov-0xc(%rbp),%eax

[Bhanu] The compiler inserts prefetchwt1 instruction as we asked it to do.

>
>So, it inserts this instruction even if I have no such instruction in CPU.

[Bhanu] 
Though the compiler generates this, as the instruction isn't available on the 
processor it just become a multi byte NO-Operation(NOP).
On processors(Intel) that doesn't have prefetchw or 3D Now feature(AMD)  it 
decodes in to NOP.
http://ref.x86asm.net/coder64.html#x0F0D
- Click on '0D' in two-byte opcode index - (16.  0F0D NOP)
   -  More information on this can be found in Intel SW developers 
manual (Combined Volumes)

>More interesting is that program still works without any issues.
>I assume that CPU just skips that instruction or executes something else.

[Bhanu] This is what is mostly expected. On processors that supports 
prefetchwt1 it executes and others it just becomes a NOP.

>
>So, it's really strange and it's unclear what CPU really executes in case where
>we have 'prefetchwt1' in code but not supported by CPU.

[Bhanu] It’s decoded in to NOP may be by pipeline decoding units.

>
>If CPU just skips this instruction we will lost all the prefetching 
>optimizations
>because all the calls will be replaced by non-existent 'prefetchwt1'.

[Bhanu] I would be worried if core generates an exception treating it as 
illegal instruction. Instead pipeline units treat this as NOP if it doesn't 
support it.
So the micro optimizations doesn't really do any thing on the processors that 
doesn't support it.

>
>How can we be sure that 'prefetchwt1' was really executed?

[Bhanu] I don’t know how we can see this unless we can peek in to Instruction 
queues & Decoders of the pipeline :(.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH RFC 5/5] dpif-netdev: Prefetch the cacheline having the cycle stats.

2017-12-05 Thread Bodireddy, Bhanuprakash

>
>> Prefetch the cacheline having the cycle stats so that we can speed up
>> the cycles_count_start() and cycles_count_intermediate().
>
>Do you have any performance results?

I don’t have nos. for this patch alone. I was testing the overall throughput 
along with other patches (that were *not* part of this RFC series) to verify 
performance improvements. I will include in commit log when I do for individual 
patches. 

BTW, I usually look at  the % of total instructions getting retired, cycles 
spent in front and back-end for the functions to see if prefetching does 
improve/degrade performance.

- Bhanuprakash.

>
>>
>> Signed-off-by: Bhanuprakash Bodireddy > intel.com>
>> ---
>>  lib/dpif-netdev.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>> b74b5d7..ab13d83 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -576,7 +576,7 @@ struct dp_netdev_pmd_thread {
>>  struct ovs_mutex flow_mutex;
>>  /* 8 pad bytes. */
>>  );
>> -PADDED_MEMBERS(CACHE_LINE_SIZE,
>> +PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>cachelineC,
>>  struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>
>>  /* One classifier per in_port polled by the pmd */ @@ -4082,6
>> +4082,7 @@ reload:
>>  lc = UINT_MAX;
>>  }
>>
>> +OVS_PREFETCH_CACHE(>cachelineC, OPCH_HTW);
>>  cycles_count_start(pmd);
>>  for (;;) {
>>  for (i = 0; i < poll_cnt; i++) {
>> --
>> 2.4.11
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH RFC 2/5] configure: Include -mprefetchwt1 explicitly.

2017-12-05 Thread Bodireddy, Bhanuprakash

>>>On Mon, Dec 04, 2017 at 08:16:47PM +, Bhanuprakash Bodireddy
>wrote:
 Processors support prefetch instruction in anticipation of write but
 compilers(gcc) won't use them unless explicitly asked to do so even
 with '-march=native' specified.

 [Problem]
   Case A:
 OVS_PREFETCH_CACHE(addr, OPCH_HTW)
__builtin_prefetch(addr, 1, 3)
  leaq-112(%rbp), %rax[Assembly]
  prefetchw  (%rax)

   Case B:
 OVS_PREFETCH_CACHE(addr, OPCH_LTW)
__builtin_prefetch(addr, 1, 1)
  leaq-112(%rbp), %rax[Assembly]
  prefetchw  (%rax) <***problem***>

   Inspite of specifying -march=native and using Low Temporal
>>>Write(OPCH_LTW),
   the compiler generates 'prefetchw' instruction instead of 'prefetchwt1'
   instruction available on processor.

 [Solution]
   Include -mprefetchwt1

   Case B:
 OVS_PREFETCH_CACHE(addr, OPCH_LTW)
__builtin_prefetch(addr, 1, 1)
  leaq-112(%rbp), %rax[Assembly]
  prefetchwt1  (%rax)

 [Testing]
   $ ./boot.sh
   $ ./configure
  checking target hint for cgcc... x86_64
  checking whether gcc accepts -mprefetchwt1... yes
   $ make -j

 Signed-off-by: Bhanuprakash Bodireddy >>> intel.com>
>>>
>>>Does this have any effect if the architecture or CPU configured for
>>>use does not support prefetchwt1?
>>
>> That's a good question and I spent reasonable time today to figure this out.
>> I have Haswell, Broadwell and Skylake CPUs and they all support this
>instruction.
>
>Hmm. I have 2 different Broadwell machines (Xeon E5 v4 and i7-6800K) and
>both of them doesn't have prefetchwt1 instruction according to cpuid:
>
>   PREFETCHWT1  = false

Xeon E5-26XX v4 is Broadwell workstation/server but i7-6800k is Skylake Desktop 
variant where as E3-12XX v5 is equivalent skylake workstation/server variant.
AFAIK, prefetchwt1 should be available on above processors, not sure why cpuid 
displays it otherwise.

pmd_thread_main()
---
WITH OPCH_HTW, we see prefetchw instruction. 

OVS_PREFETCH_CACHE(>cachelineC, OPCH_HTW);
cycles_count_start(pmd);
for (;;) {
for (i = 0; i < poll_cnt; i++) {
process_packets =
dp_netdev_process_rxq_port(pmd, poll_list[i].rxq->rx,
   poll_list[i].port_no);
cycles_count_intermediate(pmd, poll_list[i].rxq,


Address Source Line Assembly
0x6e29ef4,086   movl  0x823ecb(%rip), %edi  

0x6e29f54,085   movq  0x50(%rsp), %rax  

0x6e29fa4,086   test %edi, %edi 

0x6e29fc4,085   prefetchwz  (%rax)  


With OPCH_LTW, we can see prefetchwt1b instruction being used(change made to 
show this).

OVS_PREFETCH_CACHE(>cachelineC, OPCH_LTW);
cycles_count_start(pmd);
for (;;) {
for (i = 0; i < poll_cnt; i++) {
..

Address Source Line Assembly
0x6e29ef4,086   movl  0x823ecb(%rip), %edi  

0x6e29f54,085   movq  0x50(%rsp), %rax  

0x6e29fa4,086   test %edi, %edi 

0x6e29fc4,085   prefetchwt1b  (%rax)

-

>
>This means that introducing of this change will break binary compatibility even
>between CPUs of the same generation, i.e. I will not be able to run on my
>system binaries compiled on yours.
>
>If it's true I prefer to not have this change.
>
>Anyway adding of this change will make compiling a generic binary for a
>different platforms impossible if your build server supports prefetchwt1.
>There should be way to disable this arch specific compiler flag even if it
>supported on my current platform.

I see your point where a build server can be advanced and supports the 
prefetchwt1 instruction
and when I copy and run the precompiled binaries on a server not supporting it, 
how does this behave?

Not sure on this. May be Redhat/canonical developers can comment on how they 
handle this kind of cases.

I will try to check this on my side.

- Bhanuprakash.

>
>Best regards, Ilya Maximets.
>
>> But I found that this instruction isn't enabled by default even with
>march=native and so need to explicitly enable this.
>>
>> Coming

Re: [ovs-dev] [PATCH v6 0/7] Output packet batching.

2017-12-05 Thread Bodireddy, Bhanuprakash

>I have retested your "Output patches batching" v6 in our standard PVP L3-
>VPN/VXLAN benchmark setup [1]. The configuration is a single PMD serving a
>physical 10G port and a VM running DPDK testpmd as IP reflector with 4
>equally loaded vhostuser ports. The tests are run with 64 byte packets. Below
>are Mpps values averaged over four 10 second runs:
>
>master  patchpatch
>Flows   Mppstx-flush-interval=0  tx-flush-interval=50
>8   4.419   4.342   -1.7%4.7497.5%
>100 4.026   3.956   -1.7%4.2816.3%
>10003.630   3.6320.1%3.7603.6%
>20003.394   3.390   -0.1%3.4902.8%
>50002.989   2.938   -1.7%2.9940.2%
>1   2.756   2.711   -1.6%2.746   -0.4%
>2   2.641   2.598   -1.6%2.622   -0.7%
>5   2.604   2.558   -1.8%2.579   -1.0%
>10  2.598   2.552   -1.8%2.572   -1.0%
>50  2.598   2.550   -1.8%2.571   -1.0%
>
>As expected output batching within rx bursts (tx-flush-interval=0) provides
>little or no benefit in this scenario. The test results reflect roughly a 1.7%
>performance penalty due to the tx batching overhead. This overhead is
>measurable, but should in my eyes not be a blocker for merging this patch
>series.

I had a similar observation when I was testing for regression with non-batching 
scenario.
https://mail.openvswitch.org/pipermail/ovs-dev/2017-October/339719.html

As tx-flush-interval by default is 0 (enable instant send) and causes 
performance degradation,
I recommend documenting this in one of the commits and giving a link to this 
performance numbers(adding Tested-at tag)
so that users can tune tx-flush-interval accordingly. 

>
>Interestingly, tests with time-based tx batching and a minimum flush interval
>of 50 microseconds show a consistent and significant performance increase
>for small number of flows (in the regime where EMC is effective) and a
>reduced penalty of 1% for many flows. I don't have a good explanation yet for
>this phenomenon. I would be interested to see if other benchmark results
>support the general positive impact of time-based tx batching on throughput
>also for synthetic DPDK applications in the VM. The average Ping RTT increases
>by 20-30 us as expected.

I think this depends on tx-flush-interval and also should be documented.

- Bhanuprakash.

>
>We will also retest the performance improvement of time-based tx batching
>on interrupt driven Linux kernel applications (such as iperf3).
>
>BR, Jan
>
>> -Original Message-
>> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
>> Sent: Friday, 01 December, 2017 16:44
>> To: ovs-dev@openvswitch.org; Bhanuprakash Bodireddy
>
>> Cc: Heetae Ahn ; Antonio Fischetti
>; Eelco Chaudron
>> ; Ciara Loftus ; Kevin
>Traynor ; Jan Scheurich
>> ; Ian Stokes ; Ilya
>Maximets 
>> Subject: [PATCH v6 0/7] Output packet batching.
>>
>> This patch-set inspired by [1] from Bhanuprakash Bodireddy.
>> Implementation of [1] looks very complex and introduces many pitfalls [2]
>> for later code modifications like possible packet stucks.
>>
>> This version targeted to make simple and flexible output packet batching on
>> higher level without introducing and even simplifying netdev layer.
>>
>> Basic testing of 'PVP with OVS bonding on phy ports' scenario shows
>> significant performance improvement.
>>
>> Test results for time-based batching for v3:
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>September/338247.html
>>
>> Test results for v4:
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>October/339624.html
>>
>> [1] [PATCH v4 0/5] netdev-dpdk: Use intermediate queue during packet
>transmission.
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>August/337019.html
>>
>> [2] For example:
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>August/337133.html
>>
>> Version 6:
>>  * Rebased on current master:
>>- Added new patch to refactor dp_netdev_pmd_thread structure
>>  according to following suggestion:
>>  https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341230.html
>>
>>NOTE: I still prefer reverting of the padding related patch.
>>  Rebase done to not block acepting of this series.
>>  Revert patch and discussion here:
>>  https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341153.html
>>
>>  * Added comment about pmd_thread_ctx_time_update() usage.
>>
>> Version 5:
>>  * pmd_thread_ctx_time_update() calls moved to different places to
>>call them only from dp_netdev_process_rxq_port() and main
>>polling functions:
>>  pmd_thread_main, dpif_netdev_run and
>dpif_netdev_execute.
>>

Re: [ovs-dev] [PATCH RFC 3/5] util: Extend ovs_prefetch_range to include prefetch type.

2017-12-04 Thread Bodireddy, Bhanuprakash

>On Mon, Dec 04, 2017 at 08:16:48PM +, Bhanuprakash Bodireddy wrote:
>> With ovs_prefetch_range(), large amounts of data can be prefetched in
>> to caches. Prefetch type gives better control over data caching
>> strategy; Meaning where the data should be prefetched(L1/L2/L3) and if
>> the data reference is temporal or non-temporal.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>
>I'll leave review of patches 3-5 to others who better understand the specific
>issues.

No problem, I posted this as RFC to get early feedback and am currently looking
at bottlenecks in other usecases (vxlans, conntrack) with multiple pmd threads 
to
use prefetching. 

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH RFC 1/5] compiler: Introduce OVS_PREFETCH variants.

2017-12-04 Thread Bodireddy, Bhanuprakash

Hi Ben,

>On Mon, Dec 04, 2017 at 08:16:46PM +, Bhanuprakash Bodireddy wrote:
>> This commit introduces prefetch variants by using the GCC built-in
>> prefetch function.
>>
>> The prefetch variants gives the user better control on designing data
>> caching strategy in order to increase cache efficiency and minimize
>> cache pollution. Data reference patterns here can be classified in to
>>
>>  - Non-temporal(NT) - Data that is referenced once and not reused in
>>   immediate future.
>>  - Temporal - Data will be used again soon.
>>
>> The Macro variants can be used where there are
>>  - Predictable memory access patterns.
>>  - Execution pipeline can stall if data isn't available.
>>  - Time consuming loops.
>>
>> For example:
>>
>>   OVS_PREFETCH_CACHE(addr, OPCH_LTR)
>> - OPCH_LTR : OVS PREFETCH CACHE HINT-LOW TEMPORAL READ.
>> - __builtin_prefetch(addr, 0, 1)
>> - Prefetch data in to L3 cache for readonly purpose.
>>
>>   OVS_PREFETCH_CACHE(addr, OPCH_HTW)
>> - OPCH_HTW : OVS PREFETCH CACHE HINT-HIGH TEMPORAL WRITE.
>> - __builtin_prefetch(addr, 1, 3)
>> - Prefetch data in to all caches in anticipation of write. In doing
>>   so it invalidates other cached copies so as to gain 'exclusive'
>>   access.
>>
>>   OVS_PREFETCH(addr)
>> - OPCH_HTR : OVS PREFETCH CACHE HINT-HIGH TEMPORAL READ.
>> - __builtin_prefetch(addr, 0, 3)
>> - Prefetch data in to all caches in anticipation of read and that
>>   data will be used again soon (HTR - High Temporal Read).
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>
>The information in this commit message seems like it could also be useful as
>part of a code comment.

This makes sense and I can include this in the code comments with some examples 
of usage.

- Bhanuprakash.


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] util: Make xmalloc_cacheline() allocate full cachelines.

2017-11-30 Thread Bodireddy, Bhanuprakash

>On Wed, Nov 29, 2017 at 08:02:17AM +0000, Bodireddy, Bhanuprakash wrote:
>> >
>> >On Tue, Nov 28, 2017 at 09:06:09PM +, Bodireddy, Bhanuprakash
>wrote:
>> >> >Until now, xmalloc_cacheline() has provided its caller memory that
>> >> >does not share a cache line, but when posix_memalign() is not
>> >> >available it did not provide a full cache line; instead, it
>> >> >returned memory that was offset 8 bytes into a cache line.  This
>> >> >makes it hard for clients to design structures to be cache
>> >> >line-aligned.  This commit changes
>> >> >xmalloc_cacheline() to always return a full cache line instead of
>> >> >memory offset into one.
>> >> >
>> >> >Signed-off-by: Ben Pfaff <b...@ovn.org>
>> >> >---
>> >> > lib/util.c | 60
>> >> >---
>> >> >-
>> >> > 1 file changed, 32 insertions(+), 28 deletions(-)
>> >> >
>> >> >diff --git a/lib/util.c b/lib/util.c index
>> >> >9e6edd27ae4c..137091a3cd4f 100644
>> >> >--- a/lib/util.c
>> >> >+++ b/lib/util.c
>> >> >@@ -196,15 +196,9 @@ x2nrealloc(void *p, size_t *n, size_t s)
>> >> > return xrealloc(p, *n * s);
>> >> > }
>> >> >
>> >> >-/* The desired minimum alignment for an allocated block of memory.
>> >> >*/ - #define MEM_ALIGN MAX(sizeof(void *), 8) -
>> >> >BUILD_ASSERT_DECL(IS_POW2(MEM_ALIGN));
>> >> >-BUILD_ASSERT_DECL(CACHE_LINE_SIZE >= MEM_ALIGN);
>> >> >-
>> >> >-/* Allocates and returns 'size' bytes of memory in dedicated
>> >> >cache lines.  That
>> >> >- * is, the memory block returned will not share a cache line with
>> >> >other data,
>> >> >- * avoiding "false sharing".  (The memory returned will not be at
>> >> >the start of
>> >> >- * a cache line, though, so don't assume such alignment.)
>> >> >+/* Allocates and returns 'size' bytes of memory aligned to a
>> >> >+cache line and in
>> >> >+ * dedicated cache lines.  That is, the memory block returned
>> >> >+will not share a
>> >> >+ * cache line with other data, avoiding "false sharing".
>> >> >  *
>> >> >  * Use free_cacheline() to free the returned memory block. */
>> >> >void
>> >> >* @@ -
>> >> >221,28 +215,37 @@ xmalloc_cacheline(size_t size)
>> >> > }
>> >> > return p;
>> >> > #else
>> >> >-void **payload;
>> >> >-void *base;
>> >> >-
>> >> > /* Allocate room for:
>> >> >  *
>> >> >- * - Up to CACHE_LINE_SIZE - 1 bytes before the payload, so that
>the
>> >> >- *   start of the payload doesn't potentially share a cache 
>> >> >line.
>> >> >+ * - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to allow 
>> >> >the
>> >> >+ *   pointer to be aligned exactly sizeof(void *) bytes before 
>> >> >the
>> >> >+ *   beginning of a cache line.
>> >> >  *
>> >> >- * - A payload consisting of a void *, followed by padding out 
>> >> >to
>> >> >- *   MEM_ALIGN bytes, followed by 'size' bytes of user data.
>> >> >+ * - Pointer: A pointer to the start of the header padding, to 
>> >> >allow
>us
>> >> >+ *   to free() the block later.
>> >> >  *
>> >> >- * - Space following the payload up to the end of the cache 
>> >> >line, so
>> >> >- *   that the end of the payload doesn't potentially share a 
>> >> >cache
>line
>> >> >- *   with some following block. */
>> >> >-base = xmalloc((CACHE_LINE_SIZE - 1)
>> >> >-   + ROUND_UP(MEM_ALIGN + size, CACHE_LINE_SIZE));
>> >> >-
>> >> >-/* Locate the payload and store a pointer to the base at the
>beginning.
>> >*/
>> >> >-payload = (void **) ROUND_UP((uintptr_t) base, CACHE_LINE_SIZE);
>> >> >-*payload = base;
>> >> >-
>> >> >-return (char *) paylo

Re: [ovs-dev] [PATCH v2 1/2] timeval: Introduce macros to convert timespec and timeval.

2017-11-28 Thread Bodireddy, Bhanuprakash

Hi Ben,

>On Tue, Nov 14, 2017 at 08:42:30PM +, Bhanuprakash Bodireddy wrote:
>> This commit replaces the numbers with MSEC_PER_SEC, NSEC_PER_SEC and
>> USEC_PER_MSEC macros when dealing with timespec and timeval.
>>
>> This commit doesn't change functionality.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>
>This still seems careless and risky to me.
>
>For example:
>msecs = secs * MSEC_PER_SEC * 1LL;
>which expands to
>msecs = secs * 1000L * 1LL;
>still risks overflow on a 32-bit system (where 1000L is 32 bits long).
>
>The previous version of the code didn't have that problem:
>msecs = secs * 1000LL;
>
>Maybe it would be better to just leave these as-is.

I agree with you and take back my changes w.r.t introducing the time MACROS. I 
have posted v3 version replacing the Macro.
On an unrelated note, can you please also review the patch here that extends 
get_process_info(). 
   
https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/340762.html

My Keepalive patch series has dependency on high resolution timer patch and 
above mentioned API.
 
- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

2017-11-28 Thread Bodireddy, Bhanuprakash

>On 27.11.2017 20:02, Bodireddy, Bhanuprakash wrote:
>>> I agree with Ilya here. Adding theses cache line markers and
>>> re-grouping variables to minimize gaps in cache lines is creating a
>>> maintenance burden without any tangible benefit. I have had to go
>>> through the pain of refactoring my PMD Performance Metrics patch to
>>> the new dp_netdev_pmd_thread struct and spent a lot of time to
>>> analyze the actual memory layout with GDB and play Tetris with the
>variables.
>>
>> Analyzing the memory layout with gdb for large structures is time consuming
>and not usually recommended.
>> I would suggest using Poke-a-hole(pahole) and that helps to understand
>and fix the structures in no time.
>> With pahole it's going to be lot easier to work with large structures
>especially.
>
>Interesting tool, but it seems doesn't work perfectly. I see duplicated unions
>and zero length arrays in the output and I still need to check sizes by hands.
>And it fails trying to run on my ubuntu 16.04 LTS on x86.
>IMHO, the code should be simple enough to not use external utilities when
>you need to check the single structure.

Pahole has been there for a while and is available with most distributions and 
works reliably on RHEL distros.
I am on fedora and built pahole from sources and it displays all the sizes and 
cacheline boundaries.

>
>>>
>>> There will never be more than a handful of PMDs, so minimizing the
>>> gaps does not matter from memory perspective. And whether the
>>> individual members occupy 4 or 5 cache lines does not matter either
>>> compared to the many hundred cache lines touched for EMC and DPCLS
>>> lookups of an Rx batch. And any optimization done for x86 is not
>>> necessarily optimal for other architectures.
>>
>> I agree that optimization targeted for x86 doesn't necessarily suit ARM due
>to its different cache line size.
>>
>>>
>>> Finally, even for x86 there is not even a performance improvement. I
>>> re-ran our standard L3VPN over VXLAN performance PVP test on master
>>> and with Ilya's revert patch:
>>>
>>> Flows   master  reverted
>>> 8,  4.464.48
>>> 100,4.274.29
>>> 1000,   4.074.07
>>> 2000,   3.683.68
>>> 5000,   3.033.03
>>> 1,  2.762.77
>>> 2,  2.642.65
>>> 5,  2.602.61
>>> 10, 2.602.61
>>> 50, 2.602.61
>>
>> What are the  CFLAGS in this case, as they seem to make difference. I have
>added my finding here for a different patch targeted at performance
>>
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341270.ht
>> ml
>
>Do you have any performance results that shows significant performance
>difference between above cases? Please describe your test scenario and
>environment so we'll be able to see that padding/alignment really needed
>here. I saw no such results yet.
>
>BTW, at one place you're saying that patch was not about performance, at the
>same time you're trying to show that it has some positive performance
>impact. I'm a bit confused with that.

Giving a bit more context here, I have been experimenting  with *prefetching* in
OvS as prefetching isn't used except in 2 instances(emc_processing & cmaps).
This work is aimed at checking the performance benefits with prefetching on not 
just Haswell
but also with newer range of processors.

The best way to prefetch a part of structure is to mark the portion of it. This 
isn't possible
unless we have some kind of cache line marking. This is what my patch initially 
did and then
we can prefetch portion of it based on cacheline markers. You can find an 
example in pkt_metadata struct.

My point is on X86  with cache line marking and with xzalloc_cacheline() API 
one shouldn't
see drop in performance if not improvement. But the real improvements would be 
seen when the
prefetching is done at right places and that’s WIP.

Bhanuprakash.

>
>>
>> Patches to consider when testing your use case:
>>  Xzalloc_cachline:  https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341231.html
>>  (If using output batching)  
>> https://mail.openvswitch.org/pipermail/ovs-
>dev/2017-November/341230.html
>>
>> - Bhanuprakash.
>>
>>>
>>> All in all, I support reverting this change.
>>>
>>> Regards, Jan
>>>
>>> Acked-by: Jan Scheurich <jan.scheur...@ericsson.com>
>>>
>>>> -Original Message-
>>>> From: ovs-dev-boun...@openvswitch.org
>>>> [mailto:ovs-dev-boun...@openvswitch.org] On

Re: [ovs-dev] [PATCH] util: Make xmalloc_cacheline() allocate full cachelines.

2017-11-28 Thread Bodireddy, Bhanuprakash

>Until now, xmalloc_cacheline() has provided its caller memory that does not
>share a cache line, but when posix_memalign() is not available it did not
>provide a full cache line; instead, it returned memory that was offset 8 bytes
>into a cache line.  This makes it hard for clients to design structures to be 
>cache
>line-aligned.  This commit changes
>xmalloc_cacheline() to always return a full cache line instead of memory
>offset into one.
>
>Signed-off-by: Ben Pfaff 
>---
> lib/util.c | 60 ---
>-
> 1 file changed, 32 insertions(+), 28 deletions(-)
>
>diff --git a/lib/util.c b/lib/util.c
>index 9e6edd27ae4c..137091a3cd4f 100644
>--- a/lib/util.c
>+++ b/lib/util.c
>@@ -196,15 +196,9 @@ x2nrealloc(void *p, size_t *n, size_t s)
> return xrealloc(p, *n * s);
> }
>
>-/* The desired minimum alignment for an allocated block of memory. */ -
>#define MEM_ALIGN MAX(sizeof(void *), 8) -
>BUILD_ASSERT_DECL(IS_POW2(MEM_ALIGN));
>-BUILD_ASSERT_DECL(CACHE_LINE_SIZE >= MEM_ALIGN);
>-
>-/* Allocates and returns 'size' bytes of memory in dedicated cache lines.  
>That
>- * is, the memory block returned will not share a cache line with other data,
>- * avoiding "false sharing".  (The memory returned will not be at the start of
>- * a cache line, though, so don't assume such alignment.)
>+/* Allocates and returns 'size' bytes of memory aligned to a cache line
>+and in
>+ * dedicated cache lines.  That is, the memory block returned will not
>+share a
>+ * cache line with other data, avoiding "false sharing".
>  *
>  * Use free_cacheline() to free the returned memory block. */  void * @@ -
>221,28 +215,37 @@ xmalloc_cacheline(size_t size)
> }
> return p;
> #else
>-void **payload;
>-void *base;
>-
> /* Allocate room for:
>  *
>- * - Up to CACHE_LINE_SIZE - 1 bytes before the payload, so that the
>- *   start of the payload doesn't potentially share a cache line.
>+ * - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to allow the
>+ *   pointer to be aligned exactly sizeof(void *) bytes before the
>+ *   beginning of a cache line.
>  *
>- * - A payload consisting of a void *, followed by padding out to
>- *   MEM_ALIGN bytes, followed by 'size' bytes of user data.
>+ * - Pointer: A pointer to the start of the header padding, to allow 
>us
>+ *   to free() the block later.
>  *
>- * - Space following the payload up to the end of the cache line, so
>- *   that the end of the payload doesn't potentially share a cache 
>line
>- *   with some following block. */
>-base = xmalloc((CACHE_LINE_SIZE - 1)
>-   + ROUND_UP(MEM_ALIGN + size, CACHE_LINE_SIZE));
>-
>-/* Locate the payload and store a pointer to the base at the beginning. */
>-payload = (void **) ROUND_UP((uintptr_t) base, CACHE_LINE_SIZE);
>-*payload = base;
>-
>-return (char *) payload + MEM_ALIGN;
>+ * - User data: 'size' bytes.
>+ *
>+ * - Trailer padding: Enough to bring the user data up to a cache line
>+ *   multiple.
>+ *
>+ * +---+-++-+
>+ * | header| pointer | user data  | trailer |
>+ * +---+-++-+
>+ * ^   ^ ^
>+ * |   | |
>+ * p   q r
>+ *
>+ */
>+void *p = xmalloc((CACHE_LINE_SIZE - 1)
>+  + sizeof(void *)
>+  + ROUND_UP(size, CACHE_LINE_SIZE));
>+bool runt = PAD_SIZE((uintptr_t) p, CACHE_LINE_SIZE) < sizeof(void *);
>+void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? CACHE_LINE_SIZE : 0),
>+CACHE_LINE_SIZE);
>+void **q = (void **) r - 1;
>+*q = p;
>+return r;
> #endif
> }
>
>@@ -265,7 +268,8 @@ free_cacheline(void *p)
> free(p);
> #else
> if (p) {
>-free(*(void **) ((uintptr_t) p - MEM_ALIGN));
>+void **q = (void **) p - 1;
>+free(*q);
> }
> #endif
> }
>--

Thanks for the patch.
Reviewed and tested this and now it returns 64 byte aligned address.

Acked-by: Bhanuprakash Bodireddy 

- Bhanuprakash.


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

2017-11-28 Thread Bodireddy, Bhanuprakash

>
>> Analyzing the memory layout with gdb for large structures is time consuming
>and not usually recommended.
>> I would suggest using Poke-a-hole(pahole) and that helps to understand
>and fix the structures in no time.
>> With pahole it's going to be lot easier to work with large structures
>especially.
>
>Thanks for the pointer. I'll have a look at pahole.
>It doesn't affect my reasoning against optimizing the compactification of 
>struct
>dp_netdev_pmd_thread, though.
>
>> >Finally, even for x86 there is not even a performance improvement. I
>> >re-ran our standard L3VPN over VXLAN performance PVP test on master
>> >and with Ilya's revert patch:
>> >
>> >Flows   master  reverted
>> >8,  4.464.48
>> >100,4.274.29
>> >1000,   4.074.07
>> >2000,   3.683.68
>> >5000,   3.033.03
>> >1,  2.762.77
>> >2,  2.642.65
>> >5,  2.602.61
>> >10, 2.602.61
>> >50, 2.602.61
>>
>> What are the  CFLAGS in this case, as they seem to make difference. I
>> have added my finding here for a different patch targeted at performance
>>
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341270.ht
>> ml
>
>I'm compiling with "-O3 -msse4.2" to be in line with production deployments
>of OVS-DPDK that need to run on a wider family of Xeon generations.

Thanks for this. AFAIK by specifying  '-msse4.2' alone, you don't get to use 
the builtin_popcnt().
One way to enable is to use '-mpopcnt'   in CFLAGS or build with march=native.

(This is slightly out of context for this thread and JFYI. Ignore this if you 
only want to use intrinsics and not builtin popcnt.)

>
>>
>> Patches to consider when testing your use case:
>>  Xzalloc_cachline:  https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>November/341231.html
>>  (If using output batching)  
>> https://mail.openvswitch.org/pipermail/ovs-
>dev/2017-November/341230.html
>
>I didn't use these. Tx batching is not relevant here. And I understand the
>xzalloc_cacheline patch alone does not guarantee that the allocated memory
>is indeed cache line-aligned.

Atleast with POSIX_MEMALIGN, address will be aligned on 64 byte and start at 
CACHE_LINE_SIZE boundary.
I am yet to check Ben's new patch and test it. 

- Bhanuprakash.

>
>Thx, Jan
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

2017-11-27 Thread Bodireddy, Bhanuprakash

>I agree with Ilya here. Adding theses cache line markers and re-grouping
>variables to minimize gaps in cache lines is creating a maintenance burden
>without any tangible benefit. I have had to go through the pain of refactoring
>my PMD Performance Metrics patch to the new dp_netdev_pmd_thread
>struct and spent a lot of time to analyze the actual memory layout with GDB
>and play Tetris with the variables.

Analyzing the memory layout with gdb for large structures is time consuming and 
not usually recommended.
I would suggest using Poke-a-hole(pahole) and that helps to understand and fix 
the structures in no time.
With pahole it's going to be lot easier to work with large structures 
especially.

>
>There will never be more than a handful of PMDs, so minimizing the gaps does
>not matter from memory perspective. And whether the individual members
>occupy 4 or 5 cache lines does not matter either compared to the many
>hundred cache lines touched for EMC and DPCLS lookups of an Rx batch. And
>any optimization done for x86 is not necessarily optimal for other
>architectures.

I agree that optimization targeted for x86 doesn't necessarily suit ARM due to 
its different cache line size.

>
>Finally, even for x86 there is not even a performance improvement. I re-ran
>our standard L3VPN over VXLAN performance PVP test on master and with
>Ilya's revert patch:
>
>Flows   master  reverted
>8,  4.464.48
>100,4.274.29
>1000,   4.074.07
>2000,   3.683.68
>5000,   3.033.03
>1,  2.762.77
>2,  2.642.65
>5,  2.602.61
>10, 2.602.61
>50, 2.602.61

What are the  CFLAGS in this case, as they seem to make difference. I have 
added my finding here for a different patch targeted at performance
  https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341270.html

Patches to consider when testing your use case:
 Xzalloc_cachline:  
https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341231.html
 (If using output batching)  
https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341230.html

- Bhanuprakash.

>
>All in all, I support reverting this change.
>
>Regards, Jan
>
>Acked-by: Jan Scheurich <jan.scheur...@ericsson.com>
>
>> -Original Message-
>> From: ovs-dev-boun...@openvswitch.org
>> [mailto:ovs-dev-boun...@openvswitch.org] On Behalf Of Bodireddy,
>> Bhanuprakash
>> Sent: Friday, 24 November, 2017 17:09
>> To: Ilya Maximets <i.maxim...@samsung.com>; ovs-dev@openvswitch.org;
>> Ben Pfaff <b...@ovn.org>
>> Cc: Heetae Ahn <heetae82@samsung.com>
>> Subject: Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor
>dp_netdev_pmd_thread structure."
>>
>> >On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:
>> >>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>> >>>
>> >>> Padding and aligning of dp_netdev_pmd_thread structure members is
>> >>> useless, broken in a several ways and only greatly degrades
>> >>> maintainability and extensibility of the structure.
>> >>
>> >> The idea of my earlier patch was to mark the cache lines and reduce
>> >> the
>> >holes while still maintaining the grouping of related members in this
>structure.
>> >
>> >Some of the grouping aspects looks strange. For example, it looks
>> >illogical that 'exit_latch' is grouped with 'flow_table' but not the
>> >'reload_seq' and other reload related stuff. It looks strange that
>> >statistics and counters spread across different groups. So, IMHO, it's not
>well grouped.
>>
>> I had to strike a fine balance and some members may be placed in a
>> different group due to their sizes and importance. Let me think if I can make
>it better.
>>
>> >
>> >> Also cache line marking is a good practice to make some one extra
>> >> cautious
>> >when extending or editing important structures .
>> >> Most importantly I was experimenting with prefetching on this
>> >> structure
>> >and needed cache line markers for it.
>> >>
>> >> I see that you are on ARM (I don't have HW to test) and want to
>> >> know if this
>> >commit has any negative affect and any numbers would be appreciated.
>> >
>> >Basic VM-VM testing shows stable 0.5% perfromance improvement with
>> >revert applied.
>>
>> I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86.
>>
>> >Padding adds 560 additional bytes of holes.
>> As the cache line in ARM is 128 , it created holes, I can find a

Re: [ovs-dev] [PATCH] packets: Prefetch the packet metadata in cacheline1.

2017-11-27 Thread Bodireddy, Bhanuprakash

>>Bhanuprakash Bodireddy  writes:
>>
>>> pkt_metadata_prefetch_init() is used to prefetch the packet metadata
>>> before initializing the metadata in pkt_metadata_init(). This is done
>>> for every packet in userspace datapath and is performance critical.
>>>
>>> Commit 99fc16c0 prefetches only cachline0 and cacheline2 as the
>>> metadata part of respective cachelines will be initialized by
>>pkt_metadata_init().
>>>
>>> However in VXLAN case when popping the vxlan header,
>>> netdev_vxlan_pop_header() invokes pkt_metadata_init_tnl() which
>>> zeroes out metadata part of
>>> cacheline1 that wasn't prefetched earlier and causes performance
>>> degradation.
>>>
>>> By prefetching cacheline1, 9% performance improvement is observed.
>>
>>Do we see a degredation in the non-vxlan case?  If not, then I don't
>>see any reason not to apply this patch.
>
>This patch doesn't impact the performance of non-vxlan cases and only have a
>positive impact in vxlan case.

The commit message claims that the performance improvement was 9% with this 
patch
but when Sugesh was checking he wasn't getting that performance improvement on 
his Haswell.

I was chatting to Sugesh this afternoon on this patch and we found some 
interesting details and much
of this boils down to how the OvS is built .( Apart from HW, BIOS settings - TB 
disabled).

The test case here measure the VXLAN de capsulation performance alone for 
packet sizes of 118 bytes.
The OvS CFLAGS and throughput numbers are as below.

CFLAGS="-O2"
Master  4.667 Mpps  
With Patch   5.045 Mpps

CFLAGS="-O2 -msse4.2"
Master  4.710 Mpps
With Patch   5.097 Mpps

CFLAGS="-O2 -march=native"
Master  5.072 Mpps
With Patch   5.193 Mpps

CFLAGS="-Ofast -march=native"
Master  5.349 Mpps
With Patch   5.378 Mpps

This means the performance measurements/claims are difficult to assess and as 
one can see above with "-Ofast, -march=native"
the improvement is insignificant but this is very platform dependent due to 
"march=native" flag. Also the optimization flags seems to
make significant difference.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

2017-11-27 Thread Bodireddy, Bhanuprakash

[ snip]
>>> Yes, you will always get aligned addressess on your x86 Linux system
>>> that supports
>>> posix_memalign() call. The comment says what it says because it will
>>> make some memory allocation tricks in case posix_memalign() is not
>>> available (Windows, some MacOS, maybe some Linux systems (not sure))
>>> and the address will not be aligned it this case.
>>
>> I also verified the other case when posix_memalign isn't available and
>> even in that case it returns the address aligned on CACHE_LINE_SIZE
>> boundary. I will send out a patch to use  xzalloc_cacheline for allocating 
>> the
>memory.
>
>I don't know how you tested this, because it is impossible:
>
>   1. OVS allocates some memory:
>   base = xmalloc(...);
>
>   2. Rounds it up to the cache line start:
>   payload = (void **) ROUND_UP((uintptr_t) base,
>CACHE_LINE_SIZE);
>
>   3. Returns the pointer increased by 8 bytes:
>   return (char *) payload + MEM_ALIGN;
>
>So, unless you redefined MEM_ALIGN to zero, you will never get aligned
>memory address while allocating by xmalloc_cacheline() on system without
>posix_memalign().
>

Hmmm, I didn't set MEM_ALIGN to zero instead used below test code to get 
aligned addresses
when posix_memalign() isn't available.  We can't set MEM_ALIGN to zero so have 
to do this
hack to get aligned address and store the initial address (original address 
allocated by malloc) in a place before the
aligned location so that it can be freed  by later  call to free(). (I should 
have mentioned in my previous mail). 

-
void **payload;
void *base;

base = xmalloc(CACHE_LINE_SIZE + size + MEM_ALIGN);
/* Address aligned on CACHE_LINE_SIZE boundary. */
payload = (void**)(((uintptr_t) base + CACHE_LINE_SIZE + MEM_ALIGN) &
~(CACHE_LINE_SIZE - 1));
/* Store the original address so it can be freed later. */
payload[-1] = base;
return (char *)payload;
-

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

2017-11-26 Thread Bodireddy, Bhanuprakash

[snip]
>>>
>>> I don't think it complicates development and instead I feel the
>>> commit gives a clear indication to the developer that the members are
>>> grouped and
>>aligned and marked with cacheline markers.
>>> This makes the developer extra cautious when adding new members so
>>> that
>>holes can be avoided.
>>
>>Starting rebase of the output batching patch-set I figured out that I
>>need to remove 'unsigned long long last_cycles' and add 'struct
>>dp_netdev_pmd_thread_ctx ctx'
>>which is 8 bytes larger. Could you, please, suggest me where should I
>>place that new structure member and what to do with a hole from
>'last_cycles'?
>>
>>This is not a trivial question, because already poor grouping will
>>become worse almost anyway.
>
>Aah, realized now that the batching series doesn't cleanly apply on master.
>Let me check this and will send across the changes that should fix this.
>

I see that 2 patches of the output batching series touches this structure and I 
 modified
the structure to factor in below changes introduced in batching series.
 -  Include dp_netdev_pmd_thread_ctx structure.
 -  Include n_output_batches variable.
 -   Change in sizes of dp_netdev_pmd_stats struct and stats_zero .
 -   ovs_mutex  size ( 48bytes on x86 vs 56bytes in ARM)

Also carried some testing and found no performance impact with the below 
changes. 

---
struct dp_netdev_pmd_thread {
PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
struct dp_netdev *dp;
struct cmap_node node;  /* In 'dp->poll_threads'. */
pthread_cond_t cond;/* For synchronizing pmd thread 
   reload. */
);

PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
struct ovs_mutex cond_mutex;/* Mutex for condition variable. */
pthread_t thread;
);

/* Per thread exact-match cache.  Note, the instance for cpu core
 * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
 * need to be protected by 'non_pmd_mutex'.  Every other instance
 * will only be accessed by its own pmd thread. */
PADDED_MEMBERS(CACHE_LINE_SIZE,
OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
);

/* Flow-Table and classifiers
 *
 * Writers of 'flow_table' must take the 'flow_mutex'.  Corresponding
 * changes to 'classifiers' must be made while still holding the
 * 'flow_mutex'.
 */
PADDED_MEMBERS(CACHE_LINE_SIZE,
struct ovs_mutex flow_mutex;
);
PADDED_MEMBERS(CACHE_LINE_SIZE,
struct cmap flow_table OVS_GUARDED; /* Flow table. */

/* One classifier per in_port polled by the pmd */
struct cmap classifiers;
/* Periodically sort subtable vectors according to hit frequencies */
long long int next_optimization;
/* End of the next time interval for which processing cycles
   are stored for each polled rxq. */
long long int rxq_next_cycle_store;

/* Cycles counters */
struct dp_netdev_pmd_cycles cycles;

/* Current context of the PMD thread. */
struct dp_netdev_pmd_thread_ctx ctx;
);
PADDED_MEMBERS(CACHE_LINE_SIZE,
/* Statistics. */
struct dp_netdev_pmd_stats stats;
/* 8 pad bytes. */
);

PADDED_MEMBERS(CACHE_LINE_SIZE,
struct latch exit_latch;/* For terminating the pmd thread. */
struct seq *reload_seq;
uint64_t last_reload_seq;
atomic_bool reload; /* Do we need to reload ports? */
/* Set to true if the pmd thread needs to be reloaded. */
bool need_reload;
bool isolated;

struct ovs_refcount ref_cnt;/* Every reference must be refcount'ed. 
*/

/* Queue id used by this pmd thread to send packets on all netdevs if
 * XPS disabled for this netdev. All static_tx_qid's are unique and less
 * than 'cmap_count(dp->poll_threads)'. */
uint32_t static_tx_qid;

/* Number of filled output batches. */
int n_output_batches;
unsigned core_id;   /* CPU core id of this pmd thread. */
int numa_id;/* numa node id of this pmd thread. */

/* 16 pad bytes. */
);

PADDED_MEMBERS(CACHE_LINE_SIZE,
struct ovs_mutex port_mutex;/* Mutex for 'poll_list'
   and 'tx_ports'. */
/* 16 pad bytes. */
);
PADDED_MEMBERS(CACHE_LINE_SIZE,
/* List of rx queues to poll. */
struct hmap poll_list OVS_GUARDED;
/* Map of 'tx_port's used for transmission.  Written by the main
 * thread, read by the pmd thread. */
struct hmap tx_ports OVS_GUARDED;
);
PADDED_MEMBERS(CACHE_LINE_SIZE,
/* These are thread-local copies of 'tx_ports'.  One

Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

2017-11-24 Thread Bodireddy, Bhanuprakash

>On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:
>>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>>>
>>> Padding and aligning of dp_netdev_pmd_thread structure members is
>>> useless, broken in a several ways and only greatly degrades
>>> maintainability and extensibility of the structure.
>>
>> The idea of my earlier patch was to mark the cache lines and reduce the
>holes while still maintaining the grouping of related members in this 
>structure.
>
>Some of the grouping aspects looks strange. For example, it looks illogical 
>that
>'exit_latch' is grouped with 'flow_table' but not the 'reload_seq' and other
>reload related stuff. It looks strange that statistics and counters spread 
>across
>different groups. So, IMHO, it's not well grouped.

I had to strike a fine balance and some members may be placed in a different 
group
due to their sizes and importance. Let me think if I can make it better.

>
>> Also cache line marking is a good practice to make some one extra cautious
>when extending or editing important structures .
>> Most importantly I was experimenting with prefetching on this structure
>and needed cache line markers for it.
>>
>> I see that you are on ARM (I don't have HW to test) and want to know if this
>commit has any negative affect and any numbers would be appreciated.
>
>Basic VM-VM testing shows stable 0.5% perfromance improvement with
>revert applied.

I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86.  

>Padding adds 560 additional bytes of holes.
As the cache line in ARM is 128 , it created holes, I can find a workaround to 
handle this.

>
>> More comments inline.
>>
>>>
>>> Issues:
>>>
>>>1. It's not working because all the instances of struct
>>>   dp_netdev_pmd_thread allocated only by usual malloc. All the
>>>   memory is not aligned to cachelines -> structure almost never
>>>   starts at aligned memory address. This means that any further
>>>   paddings and alignments inside the structure are completely
>>>   useless. Fo example:
>>>
>>>   Breakpoint 1, pmd_thread_main
>>>   (gdb) p pmd
>>>   $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
>>>   (gdb) p >cacheline1
>>>   $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
>>>   (gdb) p >cacheline0
>>>   $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
>>>   (gdb) p >flow_cache
>>>   $53 = (struct emc_cache *) 0x1b1afe0
>>>
>>>   All of the above addresses shifted from cacheline start by 32B.
>>
>> If you see below all the addresses are 64 byte aligned.
>>
>> (gdb) p pmd
>> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
>> (gdb) p >cacheline0
>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
>> (gdb) p >cacheline1
>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
>> (gdb) p >flow_cache
>> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0
>> (gdb) p >flow_table
>> $5 = (struct cmap *) 0x7fc1e9fba100
>> (gdb) p >stats
>> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
>> (gdb) p >port_mutex
>> $7 = (struct ovs_mutex *) 0x7fc1e9fba180
>> (gdb) p >poll_list
>> $8 = (struct hmap *) 0x7fc1e9fba1c0
>> (gdb) p >tnl_port_cache
>> $9 = (struct hmap *) 0x7fc1e9fba200
>> (gdb) p >stats_zero
>> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240
>>
>> I tried using xzalloc_cacheline instead of default xzalloc() here.  I
>> tried tens of times and always found that the address is
>> 64 byte aligned and it should start at the beginning of cache line on X86.
>> Not sure why the comment  " (The memory returned will not be at the start
>of  a cache line, though, so don't assume such alignment.)" says otherwise?
>
>Yes, you will always get aligned addressess on your x86 Linux system that
>supports
>posix_memalign() call. The comment says what it says because it will make
>some memory allocation tricks in case posix_memalign() is not available
>(Windows, some MacOS, maybe some Linux systems (not sure)) and the
>address will not be aligned it this case.

I also verified the other case when posix_memalign isn't available and even in 
that case
it returns the address aligned on CACHE_LINE_SIZE boundary. I will send out a 
patch to use
 xzalloc_cacheline for allocating the memory.

>
>>
>>>
>>>   Can we fix it properly? NO.
>>>   OVS currently doesn't have appropriate API to allocate aligned
>>>   memory. The best candidate is 'xm

Re: [ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

2017-11-22 Thread Bodireddy, Bhanuprakash

>This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>
>Padding and aligning of dp_netdev_pmd_thread structure members is
>useless, broken in a several ways and only greatly degrades maintainability
>and extensibility of the structure.

The idea of my earlier patch was to mark the cache lines and reduce the holes 
while still maintaining the grouping of related members in this structure.
Also cache line marking is a good practice to make some one extra cautious when 
extending or editing important structures . 
Most importantly I was experimenting with prefetching on this structure and 
needed cache line markers for it. 

I see that you are on ARM (I don't have HW to test) and want to know if this 
commit has any negative affect and any numbers would be appreciated.
More comments inline.

>
>Issues:
>
>1. It's not working because all the instances of struct
>   dp_netdev_pmd_thread allocated only by usual malloc. All the
>   memory is not aligned to cachelines -> structure almost never
>   starts at aligned memory address. This means that any further
>   paddings and alignments inside the structure are completely
>   useless. Fo example:
>
>   Breakpoint 1, pmd_thread_main
>   (gdb) p pmd
>   $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
>   (gdb) p >cacheline1
>   $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
>   (gdb) p >cacheline0
>   $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
>   (gdb) p >flow_cache
>   $53 = (struct emc_cache *) 0x1b1afe0
>
>   All of the above addresses shifted from cacheline start by 32B.

If you see below all the addresses are 64 byte aligned.

(gdb) p pmd
$1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
(gdb) p >cacheline0
$2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
(gdb) p >cacheline1
$3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
(gdb) p >flow_cache
$4 = (struct emc_cache *) 0x7fc1e9b1a0c0
(gdb) p >flow_table
$5 = (struct cmap *) 0x7fc1e9fba100
(gdb) p >stats
$6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
(gdb) p >port_mutex
$7 = (struct ovs_mutex *) 0x7fc1e9fba180
(gdb) p >poll_list
$8 = (struct hmap *) 0x7fc1e9fba1c0
(gdb) p >tnl_port_cache
$9 = (struct hmap *) 0x7fc1e9fba200
(gdb) p >stats_zero
$10 = (unsigned long long (*)[5]) 0x7fc1e9fba240

I tried using xzalloc_cacheline instead of default xzalloc() here.  I tried 
tens of times and always found that the address is
64 byte aligned and it should start at the beginning of cache line on X86. 
Not sure why the comment  " (The memory returned will not be at the start of  a 
cache line, though, so don't assume such alignment.)" says otherwise?

>
>   Can we fix it properly? NO.
>   OVS currently doesn't have appropriate API to allocate aligned
>   memory. The best candidate is 'xmalloc_cacheline()' but it
>   clearly states that "The memory returned will not be at the
>   start of a cache line, though, so don't assume such alignment".
>   And also, this function will never return aligned memory on
>   Windows or MacOS.
>
>2. CACHE_LINE_SIZE is not constant. Different architectures have
>   different cache line sizes, but the code assumes that
>   CACHE_LINE_SIZE is always equal to 64 bytes. All the structure
>   members are grouped by 64 bytes and padded to CACHE_LINE_SIZE.
>   This leads to a huge holes in a structures if CACHE_LINE_SIZE
>   differs from 64. This is opposite to portability. If I want
>   good performance of cmap I need to have CACHE_LINE_SIZE equal
>   to the real cache line size, but I will have huge holes in the
>   structures. If you'll take a look to struct rte_mbuf from DPDK
>   you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and
>   RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.

I understand that ARM and few other processors (like OCTEON) have 128 bytes 
cache lines.
But  again curious of performance impact in your case with this new alignment.

>
>3. Sizes of system/libc defined types are not constant for all the
>   systems. For example, sizeof(pthread_mutex_t) == 48 on my
>   ARMv8 machine, but only 40 on x86. The difference could be
>   much bigger on Windows or MacOS systems. But the code assumes
>   that sizeof(struct ovs_mutex) is always 48 bytes. This may lead
>   to broken alignment/big holes in case of padding/wrong comments
>   about amount of free pad bytes.

This isn't an issue as you would have already mentioned and more about issue 
with the comment that reads the pad bytes.
In case of ARM it would be just 8 pad bytes instead of 16 on X86. 

union {
struct {
struct ovs_mutex port_mutex; /* 484998448 */
};/*  48 */
uint8_tpad13[64];/*  64 */
};   /

>
>4. Sizes of the many fileds in structure depends on defines

Re: [ovs-dev] [PATCH] packets: Prefetch the packet metadata in cacheline1.

2017-11-20 Thread Bodireddy, Bhanuprakash

>
>Bhanuprakash Bodireddy  writes:
>
>> pkt_metadata_prefetch_init() is used to prefetch the packet metadata
>> before initializing the metadata in pkt_metadata_init(). This is done
>> for every packet in userspace datapath and is performance critical.
>>
>> Commit 99fc16c0 prefetches only cachline0 and cacheline2 as the
>> metadata part of respective cachelines will be initialized by
>pkt_metadata_init().
>>
>> However in VXLAN case when popping the vxlan header,
>> netdev_vxlan_pop_header() invokes pkt_metadata_init_tnl() which zeroes
>> out metadata part of
>> cacheline1 that wasn't prefetched earlier and causes performance
>> degradation.
>>
>> By prefetching cacheline1, 9% performance improvement is observed.
>
>Do we see a degredation in the non-vxlan case?  If not, then I don't see any
>reason not to apply this patch.

This patch doesn't impact the performance of non-vxlan cases and only have a 
positive impact in vxlan case.

- Bhanuprakash.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 3/7] util: High resolution sleep support for windows.

2017-11-14 Thread Bodireddy, Bhanuprakash

Hi Alin,

>Thanks a lot for the patch.
>
>I have a few comments inlined.
>
>> -Original Message-
>> From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
>> boun...@openvswitch.org] On Behalf Of Bhanuprakash Bodireddy
>> Sent: Wednesday, November 8, 2017 6:36 PM
>> To: d...@openvswitch.org
>> Cc: Alin Gabriel Serdean 
>> Subject: [ovs-dev] [PATCH 3/7] util: High resolution sleep support for
>> windows.
>>
>> This commit implements xnanosleep() for the threads needing high
>> resolution sleep timeouts in windows.
>>
>> CC: Alin Gabriel Serdean 
>> CC: Aaron Conole 
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>>  lib/util.c | 17 +
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/lib/util.c b/lib/util.c
>> index a29e288..46b5691 100644
>> --- a/lib/util.c
>> +++ b/lib/util.c
>> @@ -2217,6 +2217,23 @@ xnanosleep(uint64_t nanoseconds)
>>  retval = nanosleep(_sleep, NULL);
>>  error = retval < 0 ? errno : 0;
>>  } while (error == EINTR);
>> +#else
>> +HANDLE timer = CreateWaitableTimer(NULL, FALSE, "NSTIMER");
>[Alin Serdean] Small nit we don't need to name the timer because we don't
>reuse it.
>> +if (timer) {
>> +LARGE_INTEGER duetime;
>> +duetime.QuadPart = -nanoseconds;
>> +if (SetWaitableTimer(timer, , 0, NULL, NULL, FALSE)) {
>> +WaitForSingleObject(timer, INFINITE);
>> +CloseHandle(timer);
>> +} else {
>> +CloseHandle(timer);
>> +VLOG_ERR_ONCE("SetWaitableTimer Failed (%s)",
>> +   ovs_lasterror_to_string());
>> +}
>[Alin Serdean] Can you move the CloseHandle part here?
>> +} else {
>> +VLOG_ERR_ONCE("CreateWaitableTimer Failed (%s)",
>> +   ovs_lasterror_to_string());
>> +}
>>  #endif
>>  ovsrcu_quiesce_end();
>>  }

Thanks for your comments. I will send across v2 with the above changes.

- Bhanuprakash. 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] ovs-tcpdump error

2017-11-10 Thread Bodireddy, Bhanuprakash

>Aaron Conole <acon...@redhat.com> writes:
>
>> Hi Bhanu,
>>
>> "Bodireddy, Bhanuprakash" <bhanuprakash.bodire...@intel.com> writes:
>>
>>> Hi,
>>>
>>>
>>>
>>> ovs-tcpdump throws the below error when trying to capture packets on
>>> one of the vhostuserports.
>>>
>>>
>>>
>>> $ ovs-tcpdump -i dpdkvhostuser0
>>>
>>>ERROR: Please create an interface called `midpdkvhostuser0`
>>>
>>> See your OS guide for how to do this.
>>>
>>> Ex: ip link add midpdkvhostuser0 type veth peer name
>>> midpdkvhostuser02
>>>
>>>
>>>
>>> $ ip link add midpdkvhostuser0 type veth peer name midpdkvhostuser02
>>>
>>>  Error: argument "midpdkvhostuser0" is wrong: "name" too long
>>>
>>>
>>>
>>> To get around this issue, I have to pass  ‘—mirror-to’ option as below.
>>>
>>>
>>>
>>> $ ovs-tcpdump -i dpdkvhostuser0 -XX --mirror-to vh0
>>>
>>>
>>>
>>> Is this due to the length of the port name?  Would be nice to fix this 
>>> issue.
>>
>> Thanks for the detailed write up.
>>
>> It is related to the mirror port name length.  The mirror port is
>> bound by IFNAMSIZ restriction, so it must be 15 characters + nul, and
>> midpdkvhostuser0 would be 16 + nul.  This is a linux specific
>> restriction, and it won't be changed because it is very much a well
>> established UAPI (and changing it will have implications on code not
>> able to deal with larger sized name buffers).
>>
>> I'm not sure how best to fix it.  My concession was the mirror-to
>> option.  Perhaps there's a better way?
>
>Hi Bhanu, I've been thinking about this a bit more.  How about something like
>the following patch?
>
>If you think it's acceptable, I'll submit it formally.

Hi Aaron,

I am on fedora and applied the patch but couldn't verify the fix as I get the 
below error.

Traceback (most recent call last):
  File "./utilities/ovs-tcpdump", line 21, in 
import random.randint
ImportError: No module named randint

When I slightly change the code to

-import random.randint
+ from random import randint
...
-return "ovsmi%06d" % random.randint(1, 99)
+return "ovsmi%06d" % randint(1, 99)

I get below error
Traceback (most recent call last):
  File "./utilities/ovs-tcpdump", line 478, in 
main()
  File "./utilities/ovs-tcpdump", line 419, in main
mirror_interface = mirror_interface or _make_mirror_name(interface)
TypeError: 'dict' object is not callable

Why is this so?

- Bhanuprakash.

>
>---
>
>diff --git a/utilities/ovs-tcpdump.in b/utilities/ovs-tcpdump.in index
>6718c77..76e8a7b 100755
>--- a/utilities/ovs-tcpdump.in
>+++ b/utilities/ovs-tcpdump.in
>@@ -18,6 +18,7 @@ import fcntl
>
> import os
> import pwd
>+import random.randint
> import struct
> import subprocess
> import sys
>@@ -39,6 +40,7 @@ except Exception:
>
> tapdev_fd = None
> _make_taps = {}
>+_make_mirror_name = {}
>
>
> def _doexec(*args, **kwargs):
>@@ -76,8 +78,16 @@ def _install_tap_linux(tap_name, mtu_value=None):
> pipe.wait()
>
>
>+def _make_linux_mirror_name(interface_name):
>+if interface_name.length() > 13:
>+return "ovsmi%06d" % random.randint(1, 99)
>+return "mi%s" % interface_name
>+
>+
> _make_taps['linux'] = _install_tap_linux  _make_taps['linux2'] =
>_install_tap_linux
>+_make_mirror_name['linux'] = _make_linux_mirror_name
>+_make_mirror_name['linux2'] = _make_linux_mirror_name
>
>
> def username():
>@@ -406,7 +416,7 @@ def main():
> print("TCPDUMP Args: %s" % ' '.join(tcpdargs))
>
> ovsdb = OVSDB(db_sock)
>-mirror_interface = mirror_interface or "mi%s" % interface
>+mirror_interface = mirror_interface or _make_mirror_name(interface)
>
> if sys.platform in _make_taps and \
>mirror_interface not in netifaces.interfaces():
>---
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 0/7] Introduce high resolution sleep support.

2017-11-08 Thread Bodireddy, Bhanuprakash

HI Ben,
>On Wed, Nov 08, 2017 at 04:35:52PM +, Bhanuprakash Bodireddy wrote:
>> This patchset introduces high resolution sleep support for linux and
>windows.
>> Also time macros are introduced to replace the numbers with meaningful
>> names.
>
>Thank you very much for the series.
>
>Did you test that the Windows version of the code compiles (e.g. via
>appveyor)?

I cross checked with appveyor and the build was successful. I replied to 
another thread where we were discussing about the windows implementation.

>
>I would normally squash patch 3 (the Windows version of xnanosleep) into
>patch 2 (the Linux version). 

I couldn't verify the functionality of windows implementation and hence posted 
it as a separate patch for now.
I will squash it once I receive feedback from Alin. 

 Also, I would normally squash the patches that
>just replace constants by xSEC_PER_ySEC macros into the patch that
>introduced those macros (if there are other changes then I would separate
>those).
Ok.

>
>I am concerned about types.  The xSEC_PER_ySEC macros all use type "long"
>for their constants, but in some cases the code needs to have type "long
>long", for example in many cases when multiplying by one of these macros.
>When the patches replace an LL-suffixed literal by one of the xSEC_PER_ySEC
>macros, this introduces a risk of overflow that was not present before.
>
>I am not certain that the xSEC_PER_ySEC macros clarify things, especially given
>the type issues.  I don't feel strongly about it though.

Yeah I understand your concern here and difficult to test the cases for overflow
with this changes. I will leave it the way it is now.

>
>In the xnanosleep() implementation for Windows, I think that the two calls to
>CloseHandle can be consolidated into one.

Sure.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v5 03/10] util: Add high resolution sleep support.

2017-11-08 Thread Bodireddy, Bhanuprakash

>
>> Ben Pfaff <b...@ovn.org> writes:
>>
>> > On Mon, Nov 06, 2017 at 05:29:26PM +, Bodireddy, Bhanuprakash
>> wrote:
>> >> Hi Ben,
>> >> >
>> >> >On Fri, Sep 15, 2017 at 05:40:23PM +0100, Bhanuprakash Bodireddy
>> wrote:
>> >> >> This commit introduces xnanosleep() for the threads needing high
>> >> >> resolution sleep timeouts.
>> >> >>
>> >> >> Signed-off-by: Bhanuprakash Bodireddy
>> >> >> <bhanuprakash.bodire...@intel.com>
>> >> >
>> >> >This is a little confusing.  The name xnanosleep() implies that
>> >> >its argument would be in nanoseconds, but it's in fact in milliseconds.
>> >> >Second, I don't understand why it's only implemented for Linux.
>> >>
>> >> I tried reworking this API with nanoseconds argument and
>> >> implementing
>> >> nsec_to_timespec() today.
>> >> This changes works fine on Linux, however the windows build breaks
>> >> with below error as reported by appveyor.
>> >>
>> >> error C4013: 'nanosleep' undefined; assuming extern returning int
>> >> (windows.h and time.h headers are included).
>> >>
>> >> But looks like nanosleep is supported on windows. Any inputs on
>> >> this would be helpful.
>> >
>> > If nanosleep isn't available on Windows (it looks like it isn't),
>> > then I'd recommend using some other function that Windows does have.
>> > If its argument isn't in nanoseconds, you can convert it.
>> >
>> > If you don't really need nanosecond resolution (the fact that the
>> > argument was in milliseconds seems like a hint), then maybe you
>> > could just use some other function instead of nanosleep, even on Linux.
>> >
>> > This stackoverflow page has some information:
>> > https://stackoverflow.com/questions/7827062/is-there-a-windows-
>> equival
>> > ent-of-nanosleep
>>
>> So, there's really no good way in windows of doing this - for OvS, I
>> would suggest reading up on the windows Wait calls
>> (https://msdn.microsoft.com/en-
>> us/library/windows/desktop/ms687069(v=vs.85).aspx#waitfunctionsandtim
>> e-outintervals).
>>
>> Prefer those to Sleep(), as Sleep(MS) can stall or deadlock the
>> process
>(at
>> least from what I remember a lifetime ago).
>There is no direct equivalent unfortunately.
>I would use
>CreateWaitableTimer(https://msdn.microsoft.com/en-
>us/library/windows/desktop
>/ms682492(v=vs.85).aspx) with SetWaitableTimer
>(https://msdn.microsoft.com/en-
>us/library/windows/desktop/ms686289(v=vs.85).
>aspx) and then wait on the timer(WaitForSingleObject) although you have 100
>nanosecond intervals.
>To go lower you can use: QueryPerformanceCounter
>(https://msdn.microsoft.com/en-
>us/library/windows/desktop/ms644904(v=vs.85).
>aspx) .
>I can try to do some benchmarks if you need such a high resolution.

Thanks for your inputs and those were helpful.
I implemented the windows equivalient of nanosleep and posted the patche here. 
https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/340743.html

I verified that this builds on windows with appveyor. But I couldn't verify the 
functionality here and
that's the reason I posted this as a separate patch instead of folding in to 
2/7.

- Bhanuprakash.



___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v5 03/10] util: Add high resolution sleep support.

2017-11-06 Thread Bodireddy, Bhanuprakash

Hi Ben,
>
>On Fri, Sep 15, 2017 at 05:40:23PM +0100, Bhanuprakash Bodireddy wrote:
>> This commit introduces xnanosleep() for the threads needing high
>> resolution sleep timeouts.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>
>This is a little confusing.  The name xnanosleep() implies that its argument
>would be in nanoseconds, but it's in fact in milliseconds.
>Second, I don't understand why it's only implemented for Linux.

I tried reworking this API with nanoseconds argument and implementing 
nsec_to_timespec() today. 
This changes works fine on Linux, however the windows build breaks with below 
error as reported by appveyor.

error C4013: 'nanosleep' undefined; assuming extern returning int
(windows.h and time.h headers are included).

But looks like nanosleep is supported on windows. Any inputs on this would be 
helpful.

- Bhanuprakash.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v4 0/7] Output packet batching.

2017-10-13 Thread Bodireddy, Bhanuprakash

>Hi Ilya,
>
>Sorry for the late response, as I was rather busy and did not find time to look
>at your revisions 1 till 3. Hopefully, I can make it up looking at v4...
>
>I did some tests in-line with the earlier tests I did with Bhanu's patch 
>series.
>Here is a comparison for a simple PVP test using a single physical port with 64
>bytes packets (wire speed 10G), single PMD thread:
>
>#Flows  master patched
>==  =  =
>     10  3,123,350  4,174,807  pps
>     32  2,090,440  3,625,314  pps
>     50  1,954,184  3,499,402  pps
>    100  1,705,794  3,264,955  pps
>    500  1,601,252  2,956,190  pps
>   1000  1,568,175  2,712,385  pps
>
>In addition, I did some latency statistics based on a PVP setup with two
>physical ports, and one virtual port, and two OpenFlow rules:
>
>ovs-ofctl add-flow br0 "in_port=dpdk0,action=vhost0"
>ovs-ofctl add-flow br0 "in_port=vhost0,action=dpdk1"
>
>Also, note that there is some deviation on latency numbers, so I did 4 runs and
>reported the min-max values.
>
>First the master results:
>
>Summary (flows = 30, 10G line rate = 95%, runtime = 60 seconds):
>
>   Pkt size min(ns) avg(ns)  max(ns)
>     -  ---  -
>    512  7,437 - 7,469  11,416 - 13,770   99,395 - 112,296
>   1024  7,197 - 7,221  11,277 - 12,379   42,876 -  47,230
>   1280  7,373 - 7,549  10,647 - 12,528   37,240 -  42,235
>   1518  8,046 - 8,135  11,808 - 12,931   36,534 -  46,388
>
>And the patched results:
>
>   Pkt size min(ns) avg(ns)  max(ns)
>     -  ---  -
>    512  7,605 - 7,662  11,711 - 14,053   56,603 - 121,059
>   1024  7,285 - 7,317  11,291 - 12,695   44,753 -  69,624
>   1280  7,605 - 7,702  10,842 - 12,685   37,047 -  45,747
>   1518  8,111 - 8,159  11,434 - 13,045   38,587 -  41,754
>
>As you can see above for the default configuration there is a minimal latency
>increase. I assume you did some latency tests yourself, and I hope these
>numbers match your findings...

Thanks for sharing the numbers Eelco. This should be useful for future 
reference.

Also please note that with batching patches applied there is a small 
performance drop
In P2P test case(non batching scenario). This shouldn't be a concern at this 
point but 
VSPERF  may raise a red flag with some of its test cases when this series is 
applied.

Other than that the series looks good to me. I have asked Ian to check QoS 
functionality with
this series + incremental patch(that fixes the known issue with policer) to 
check for any other corner cases.

- Bhanuprakash.

>
>I do have some small comments on your patchsets but will address them
>replying to the individual emails.
>
>Cheers,
>
>Eelco
>
>On 05/10/17 17:05, Ilya Maximets wrote:
>> This patch-set inspired by [1] from Bhanuprakash Bodireddy.
>> Implementation of [1] looks very complex and introduces many pitfalls
>> [2] for later code modifications like possible packet stucks.
>>
>> This version targeted to make simple and flexible output packet
>> batching on higher level without introducing and even simplifying netdev
>layer.
>>
>> Basic testing of 'PVP with OVS bonding on phy ports' scenario shows
>> significant performance improvement.
>>
>> Test results for time-based batching for v3:
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-September/338247.h
>> tml
>>
>> [1] [PATCH v4 0/5] netdev-dpdk: Use intermediate queue during packet
>transmission.
>>
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337019.html
>>
>> [2] For example:
>>
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337133.html
>>
>> Version 4:
>>  * Rebased on current master.
>>  * Rebased on top of "Keep latest measured time for PMD thread."
>>(Jan Scheurich)
>>  * Microsecond resolution related patches integrated.
>>  * Time-based batching without RFC tag.
>>  * 'output_time' renamed to 'flush_time'. (Jan Scheurich)
>>  * 'flush_time' update moved to
>'dp_netdev_pmd_flush_output_on_port'.
>>(Jan Scheurich)
>>  * 'output-max-latency' renamed to 'tx-flush-interval'.
>>  * Added patch for output batching statistics.
>>
>> Version 3:
>>
>>  * Rebased on current master.
>>  * Time based RFC: fixed assert on n_output_batches <= 0.
>>
>> Version 2:
>>
>>  * Rebased on current master.
>>  * Added time based batching RFC patch.
>>  * Fixed mixing packets with different sources in same batch.
>>
>>
>> Ilya Maximets (7):
>>dpif-netdev: Keep latest measured time for PMD thread.
>>dpif-netdev: Output packet batching.
>>netdev: Remove unused may_steal.
>>netdev: Remove useless cutlen.
>>timeval: Introduce time_usec().
>>dpif-netdev: Time based output batching.
>>dpif-netdev: Count sent packets and batches.
>>
>>   lib/dpif-netdev.c | 334 +--
>---
>>   lib/netdev-bsd.c

Re: [ovs-dev] [PATCH v4 5/7] timeval: Introduce time_usec().

2017-10-13 Thread Bodireddy, Bhanuprakash

>This fanction will provide monotonic time in microseconds.

[BHANU] Typo here with function.

>
>Signed-off-by: Ilya Maximets 
>---
> lib/timeval.c | 22 ++  lib/timeval.h |  2 ++
> 2 files changed, 24 insertions(+)
>
>diff --git a/lib/timeval.c b/lib/timeval.c index dd63f03..be2eddc 100644
>--- a/lib/timeval.c
>+++ b/lib/timeval.c
>@@ -233,6 +233,22 @@ time_wall_msec(void)
> return time_msec__(_clock);
> }
>
>+static long long int
>+time_usec__(struct clock *c)
>+{
>+struct timespec ts;
>+
>+time_timespec__(c, );
>+return timespec_to_usec();
>+}
>+
>+/* Returns a monotonic timer, in microseconds. */ long long int
>+time_usec(void)
>+{
>+return time_usec__(_clock); }
>+

[BHANU]  As you are introducing the support for microsecond granularity, can 
you also add time_wall_usec() and time_wall_usec__() here?
The ipfix code (ipfix_now()) can be the first one to use it for now. May be 
more in the future! 

> /* Configures the program to die with SIGALRM 'secs' seconds from now, if
>  * 'secs' is nonzero, or disables the feature if 'secs' is zero. */  void @@ 
> -360,6
>+376,12 @@ timeval_to_msec(const struct timeval *tv)
> return (long long int) tv->tv_sec * 1000 + tv->tv_usec / 1000;  }
>
>+long long int
>+timespec_to_usec(const struct timespec *ts) {
>+return (long long int) ts->tv_sec * 1000 * 1000 + ts->tv_nsec /
>+1000; }
>+

[BHANU] how about adding timeval_to_usec()?
Also it would be nice to have the usec_to_timespec() and  timeval_diff_usec() 
implemented to make this commit complete.

- Bhanuprakash. 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v4 4/7] netdev: Remove useless cutlen.

2017-10-13 Thread Bodireddy, Bhanuprakash

>Cutlen already applied while processing OVS_ACTION_ATTR_OUTPUT.
>
>Signed-off-by: Ilya Maximets 
>---
> lib/netdev-bsd.c   | 2 +-
> lib/netdev-dpdk.c  | 5 -
> lib/netdev-dummy.c | 2 +-
> lib/netdev-linux.c | 4 ++--
> 4 files changed, 4 insertions(+), 9 deletions(-)
>
>diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c index 4f243b5..7454d03 100644
>--- a/lib/netdev-bsd.c
>+++ b/lib/netdev-bsd.c
>@@ -697,7 +697,7 @@ netdev_bsd_send(struct netdev *netdev_, int qid
>OVS_UNUSED,
>
> for (i = 0; i < batch->count; i++) {
> const void *data = dp_packet_data(batch->packets[i]);
>-size_t size = dp_packet_get_send_len(batch->packets[i]);
>+size_t size = dp_packet_size(batch->packets[i]);
>
> while (!error) {
> ssize_t retval;
>diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index 011c6f7..300a0ae
>100644
>--- a/lib/netdev-dpdk.c
>+++ b/lib/netdev-dpdk.c
>@@ -1851,8 +1851,6 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
>struct dp_packet_batch *batch)
> dropped += batch_cnt - cnt;
> }
>
>-dp_packet_batch_apply_cutlen(batch);
>-
> for (uint32_t i = 0; i < cnt; i++) {
> struct dp_packet *packet = batch->packets[i];
> uint32_t size = dp_packet_size(packet); @@ -1905,7 +1903,6 @@
>netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
> dpdk_do_tx_copy(netdev, qid, batch);
> dp_packet_delete_batch(batch, true);
> } else {
>-dp_packet_batch_apply_cutlen(batch);
> __netdev_dpdk_vhost_send(netdev, qid, batch->packets, batch-
>>count);
> }
> return 0;
>@@ -1936,8 +1933,6 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int
>qid,
> int batch_cnt = dp_packet_batch_size(batch);
> struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
>
>-dp_packet_batch_apply_cutlen(batch);
>-
> tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
> tx_cnt = netdev_dpdk_qos_run(dev, pkts, tx_cnt);
> dropped = batch_cnt - tx_cnt;
>diff --git a/lib/netdev-dummy.c b/lib/netdev-dummy.c index 57ef13f..1f846b5
>100644
>--- a/lib/netdev-dummy.c
>+++ b/lib/netdev-dummy.c
>@@ -1071,7 +1071,7 @@ netdev_dummy_send(struct netdev *netdev, int
>qid OVS_UNUSED,
> struct dp_packet *packet;
> DP_PACKET_BATCH_FOR_EACH(packet, batch) {
> const void *buffer = dp_packet_data(packet);
>-size_t size = dp_packet_get_send_len(packet);
>+size_t size = dp_packet_size(packet);
>
> if (batch->packets[i]->packet_type != htonl(PT_ETH)) {
> error = EPFNOSUPPORT;
>diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c index aaf4899..e70cef3
>100644
>--- a/lib/netdev-linux.c
>+++ b/lib/netdev-linux.c
>@@ -1197,7 +1197,7 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
> for (int i = 0; i < batch->count; i++) {
> struct dp_packet *packet = batch->packets[i];
> iov[i].iov_base = dp_packet_data(packet);
>-iov[i].iov_len = dp_packet_get_send_len(packet);
>+iov[i].iov_len = dp_packet_size(packet);
> mmsg[i].msg_hdr = (struct msghdr) { .msg_name = ,
> .msg_namelen = sizeof sll,
> .msg_iov = [i], @@ -1234,7 
> +1234,7 @@
>netdev_linux_tap_batch_send(struct netdev *netdev_,
> struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> for (int i = 0; i < batch->count; i++) {
> struct dp_packet *packet = batch->packets[i];
>-size_t size = dp_packet_get_send_len(packet);
>+size_t size = dp_packet_size(packet);
> ssize_t retval;
> int error;

With the above change, I think we can get rid of dp_packet_get_send_len() API 
altogether. 
The only place it gets called now is dp_packet_batch_apply_cutlen() and that 
can be replaced.

dp_packet_batch_apply_cutlen(..) {
...
-dp_packet_set_size(packet, dp_packet_get_send_len(packet));
+   dp_packet_set_size(packet, dp_packet_size(packet) - 
dp_packet_get_cutlen(packet));
}

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3 3/4] netdev-dpdk: Remove useless cutlen.

2017-09-25 Thread Bodireddy, Bhanuprakash

>Cutlen already applied while processing OVS_ACTION_ATTR_OUTPUT.
>
>Signed-off-by: Ilya Maximets 

LGTM, 
The below redundant calls can be removed as packet cutlen is already applied in 
dpif layer.

-Bhanuprakash.

>---
> lib/netdev-dpdk.c | 5 -
> 1 file changed, 5 deletions(-)
>
>diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index 8e3158f..ddcc574
>100644
>--- a/lib/netdev-dpdk.c
>+++ b/lib/netdev-dpdk.c
>@@ -1819,8 +1819,6 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
>struct dp_packet_batch *batch)
> int newcnt = 0;
> int i;
>
>-dp_packet_batch_apply_cutlen(batch);
>-
> for (i = 0; i < batch->count; i++) {
> int size = dp_packet_size(batch->packets[i]);
>
>@@ -1879,7 +1877,6 @@ netdev_dpdk_vhost_send(struct netdev *netdev,
>int qid,
> dpdk_do_tx_copy(netdev, qid, batch);
> dp_packet_delete_batch(batch, true);
> } else {
>-dp_packet_batch_apply_cutlen(batch);
> __netdev_dpdk_vhost_send(netdev, qid, batch->packets, batch-
>>count);
> }
> return 0;
>@@ -1910,8 +1907,6 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int
>qid,
> int cnt = batch->count;
> struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
>
>-dp_packet_batch_apply_cutlen(batch);
>-
> cnt = netdev_dpdk_filter_packet_len(dev, pkts, cnt);
> cnt = netdev_dpdk_qos_run(dev, pkts, cnt);
> dropped = batch->count - cnt;
>--
>2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3 2/4] netdev: Remove unused may_steal.

2017-09-25 Thread Bodireddy, Bhanuprakash

>Not needed anymore because 'may_steal' already handled on dpif-netdev
>layer and always true;

LGTM.
 'may_steal' is still used by QoS policer in netdev layer. I am not familiar 
with Policer functionality but
Just wondering may_steal isn't needed with this change.

- Bhanuprakash.

>
>Signed-off-by: Ilya Maximets 
>---
> lib/dpif-netdev.c |  2 +-
> lib/netdev-bsd.c  |  4 ++--
> lib/netdev-dpdk.c | 25 +++--
> lib/netdev-dummy.c|  4 ++--
> lib/netdev-linux.c|  4 ++--
> lib/netdev-provider.h |  7 +++
> lib/netdev.c  | 12 
> lib/netdev.h  |  2 +-
> 8 files changed, 26 insertions(+), 34 deletions(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index a2a25be..dcf55f3
>100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -3121,7 +3121,7 @@ dp_netdev_pmd_flush_output_on_port(struct
>dp_netdev_pmd_thread *pmd,
> tx_qid = pmd->static_tx_qid;
> }
>
>-netdev_send(p->port->netdev, tx_qid, >output_pkts, true,
>dynamic_txqs);
>+netdev_send(p->port->netdev, tx_qid, >output_pkts,
>+ dynamic_txqs);
> dp_packet_batch_init(>output_pkts);
> }
>
>diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c index 8a4cdb3..4f243b5 100644
>--- a/lib/netdev-bsd.c
>+++ b/lib/netdev-bsd.c
>@@ -680,7 +680,7 @@ netdev_bsd_rxq_drain(struct netdev_rxq *rxq_)
>  */
> static int
> netdev_bsd_send(struct netdev *netdev_, int qid OVS_UNUSED,
>-struct dp_packet_batch *batch, bool may_steal,
>+struct dp_packet_batch *batch,
> bool concurrent_txq OVS_UNUSED)  {
> struct netdev_bsd *dev = netdev_bsd_cast(netdev_); @@ -728,7 +728,7
>@@ netdev_bsd_send(struct netdev *netdev_, int qid OVS_UNUSED,
> }
>
> ovs_mutex_unlock(>mutex);
>-dp_packet_delete_batch(batch, may_steal);
>+dp_packet_delete_batch(batch, true);
>
> return error;
> }
>diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index 1d82bca..8e3158f
>100644
>--- a/lib/netdev-dpdk.c
>+++ b/lib/netdev-dpdk.c
>@@ -1872,12 +1872,12 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
>struct dp_packet_batch *batch)  static int  netdev_dpdk_vhost_send(struct
>netdev *netdev, int qid,
>struct dp_packet_batch *batch,
>-   bool may_steal, bool concurrent_txq OVS_UNUSED)
>+   bool concurrent_txq OVS_UNUSED)
> {
>
>-if (OVS_UNLIKELY(!may_steal || batch->packets[0]->source !=
>DPBUF_DPDK)) {
>+if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) {
> dpdk_do_tx_copy(netdev, qid, batch);
>-dp_packet_delete_batch(batch, may_steal);
>+dp_packet_delete_batch(batch, true);
> } else {
> dp_packet_batch_apply_cutlen(batch);
> __netdev_dpdk_vhost_send(netdev, qid, batch->packets, batch-
>>count); @@ -1887,11 +1887,11 @@ netdev_dpdk_vhost_send(struct netdev
>*netdev, int qid,
>
> static inline void
> netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
>-   struct dp_packet_batch *batch, bool may_steal,
>+   struct dp_packet_batch *batch,
>bool concurrent_txq)  {
> if (OVS_UNLIKELY(!(dev->flags & NETDEV_UP))) {
>-dp_packet_delete_batch(batch, may_steal);
>+dp_packet_delete_batch(batch, true);
> return;
> }
>
>@@ -1900,12 +1900,11 @@ netdev_dpdk_send__(struct netdev_dpdk *dev,
>int qid,
> rte_spinlock_lock(>tx_q[qid].tx_lock);
> }
>
>-if (OVS_UNLIKELY(!may_steal ||
>- batch->packets[0]->source != DPBUF_DPDK)) {
>+if (OVS_UNLIKELY(batch->packets[0]->source != DPBUF_DPDK)) {
> struct netdev *netdev = >up;
>
> dpdk_do_tx_copy(netdev, qid, batch);
>-dp_packet_delete_batch(batch, may_steal);
>+dp_packet_delete_batch(batch, true);
> } else {
> int dropped;
> int cnt = batch->count;
>@@ -1933,12 +1932,11 @@ netdev_dpdk_send__(struct netdev_dpdk *dev,
>int qid,
>
> static int
> netdev_dpdk_eth_send(struct netdev *netdev, int qid,
>- struct dp_packet_batch *batch, bool may_steal,
>- bool concurrent_txq)
>+ struct dp_packet_batch *batch, bool
>+ concurrent_txq)
> {
> struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>
>-netdev_dpdk_send__(dev, qid, batch, may_steal, concurrent_txq);
>+netdev_dpdk_send__(dev, qid, batch, concurrent_txq);
> return 0;
> }
>
>@@ -2905,8 +2903,7 @@ dpdk_ring_open(const char dev_name[],
>dpdk_port_t *eth_port_id)
>
> static int
> netdev_dpdk_ring_send(struct netdev *netdev, int qid,
>-  struct dp_packet_batch *batch, bool may_steal,
>-  bool concurrent_txq)
>+  struct dp_packet_batch *batch, bool
>+ concurrent_txq)
> {
> struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> unsigned i;
>@@ -2919,7 +2916,7 @@ netdev_dpdk_ring_send(struct netdev *netdev, int
>qid,
>

Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Output packet batching.

2017-09-25 Thread Bodireddy, Bhanuprakash

Hi Ilya,

This series needs to be rebased.  Few comments below.

>While processing incoming batch of packets they are scattered across many
>per-flow batches and sent separately.
>
>This becomes an issue while using more than a few flows.
>
>For example if we have balanced-tcp OvS bonding with 2 ports there will be
>256 datapath internal flows for each dp_hash pattern. This will lead to
>scattering of a single recieved batch across all of that 256 per-flow batches 
>and
>invoking send for each packet separately. This behaviour greatly degrades
>overall performance of netdev_send because of inability to use advantages of
>vectorized transmit functions.
>But the half (if 2 ports in bonding) of datapath flows will have the same 
>output
>actions. This means that we can collect them in a single place back and send at
>once using single call to netdev_send. This patch introduces per-port packet
>batch for output packets for that purpose.
>
>'output_pkts' batch is thread local and located in send port cache.
>
>Signed-off-by: Ilya Maximets 
>---
> lib/dpif-netdev.c | 104
>++
> 1 file changed, 82 insertions(+), 22 deletions(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index e2cd931..a2a25be
>100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -502,6 +502,7 @@ struct tx_port {
> int qid;
> long long last_used;
> struct hmap_node node;
>+struct dp_packet_batch output_pkts;
> };
>
> /* PMD: Poll modes drivers.  PMD accesses devices via polling to eliminate
>@@ -633,9 +634,10 @@ static void dp_netdev_execute_actions(struct
>dp_netdev_pmd_thread *pmd,
>   size_t actions_len,
>   long long now);  static void 
> dp_netdev_input(struct
>dp_netdev_pmd_thread *,
>-struct dp_packet_batch *, odp_port_t port_no);
>+struct dp_packet_batch *, odp_port_t port_no,
>+long long now);
> static void dp_netdev_recirculate(struct dp_netdev_pmd_thread *,
>-  struct dp_packet_batch *);
>+  struct dp_packet_batch *, long long
>+ now);
>
> static void dp_netdev_disable_upcall(struct dp_netdev *);  static void
>dp_netdev_pmd_reload_done(struct dp_netdev_pmd_thread *pmd); @@ -
>667,6 +669,9 @@ static void dp_netdev_add_rxq_to_pmd(struct
>dp_netdev_pmd_thread *pmd,  static void
>dp_netdev_del_rxq_from_pmd(struct dp_netdev_pmd_thread *pmd,
>struct rxq_poll *poll)
> OVS_REQUIRES(pmd->port_mutex);
>+static void
>+dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread
>*pmd,
>+   long long now);
> static void reconfigure_datapath(struct dp_netdev *dp)
> OVS_REQUIRES(dp->port_mutex);
> static bool dp_netdev_pmd_try_ref(struct dp_netdev_pmd_thread *pmd);
>@@ -2809,6 +2814,7 @@ dpif_netdev_execute(struct dpif *dpif, struct
>dpif_execute *execute)
> struct dp_netdev *dp = get_dp_netdev(dpif);
> struct dp_netdev_pmd_thread *pmd;
> struct dp_packet_batch pp;
>+long long now = time_msec();

[BHANU] Calling time_msec() can be moved little down in this function, may be 
after the 'probe' check.

>
> if (dp_packet_size(execute->packet) < ETH_HEADER_LEN ||
> dp_packet_size(execute->packet) > UINT16_MAX) { @@ -2851,8 +2857,8
>@@ dpif_netdev_execute(struct dpif *dpif, struct dpif_execute *execute)
>
> dp_packet_batch_init_packet(, execute->packet);
> dp_netdev_execute_actions(pmd, , false, execute->flow,
>-  execute->actions, execute->actions_len,
>-  time_msec());
>+  execute->actions, execute->actions_len, now);
>+dp_netdev_pmd_flush_output_packets(pmd, now);

[BHANU] Is this code path mostly run in non-pmd thread context? I can only 
think of bfd case where the 
where all the above runs in monitoring thread(non-pmd) context.  

>
> if (pmd->core_id == NON_PMD_CORE_ID) {
> ovs_mutex_unlock(>non_pmd_mutex);
>@@ -3101,6 +3107,37 @@ cycles_count_intermediate(struct
>dp_netdev_pmd_thread *pmd,
> non_atomic_ullong_add(>cycles.n[type], interval);  }
>
>+static void
>+dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread
>*pmd,
>+   struct tx_port *p, long long now) {
>+int tx_qid;
>+bool dynamic_txqs;
>+
>+dynamic_txqs = p->port->dynamic_txqs;
>+if (dynamic_txqs) {
>+tx_qid = dpif_netdev_xps_get_tx_qid(pmd, p, now);
>+} else {
>+tx_qid = pmd->static_tx_qid;
>+}
>+
>+netdev_send(p->port->netdev, tx_qid, >output_pkts, true,
>dynamic_txqs);
>+dp_packet_batch_init(>output_pkts);
>+}
>+
>+static void
>+dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread
>*pmd,
>+   long long now) {
>+

[ovs-dev] ovs-tcpdump error

2017-09-21 Thread Bodireddy, Bhanuprakash

Hi,

ovs-tcpdump throws the below error when trying to capture packets on one of the 
vhostuserports.

$ ovs-tcpdump -i dpdkvhostuser0
   ERROR: Please create an interface called `midpdkvhostuser0`
See your OS guide for how to do this.
Ex: ip link add midpdkvhostuser0 type veth peer name midpdkvhostuser02

$ ip link add midpdkvhostuser0 type veth peer name midpdkvhostuser02
 Error: argument "midpdkvhostuser0" is wrong: "name" too long

To get around this issue, I have to pass  '-mirror-to' option as below.

$ ovs-tcpdump -i dpdkvhostuser0 -XX --mirror-to vh0

Is this due to the length of the port name?  Would be nice to fix this issue.

Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [patch v2 3/5] conntrack: Create nat_conn_keys_insert().

2017-09-21 Thread Bodireddy, Bhanuprakash

>Create a separate function from existing code, so the code can be reused in a
>subsequent patch; no change in functionality.
>
>Signed-off-by: Darrell Ball 
>---
> lib/conntrack.c | 42 +-
> 1 file changed, 29 insertions(+), 13 deletions(-)
>
>diff --git a/lib/conntrack.c b/lib/conntrack.c index c94bc27..2eca38d 100644
>--- a/lib/conntrack.c
>+++ b/lib/conntrack.c
>@@ -96,6 +96,11 @@ nat_conn_keys_lookup(struct hmap *nat_conn_keys,
>  const struct conn_key *key,
>  uint32_t basis);
>
>+static bool
>+nat_conn_keys_insert(struct hmap *nat_conn_keys,
>+ const struct conn *nat_conn,
>+ uint32_t hash_basis);
>+

This patch is refactoring the code with no change in functionality. 
Small nit (not necessarily needed) change variable name from 'hash_basis' to
'basis' to keep it consistent with other APIs in this file.

LGTM
Acked-by: Bhanuprakash Bodireddy 


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [patch v2 2/5] conntrack: Minor performance enhancement.

2017-09-21 Thread Bodireddy, Bhanuprakash

>Add an OVS_UNLIKELY and reorder a few variable condition checks.
>
>Signed-off-by: Darrell Ball 
>---
> lib/conntrack.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
>diff --git a/lib/conntrack.c b/lib/conntrack.c index 59d3c4e..c94bc27 100644
>--- a/lib/conntrack.c
>+++ b/lib/conntrack.c
>@@ -1104,7 +1104,7 @@ process_one(struct conntrack *ct, struct dp_packet
>*pkt,
>
> bool tftp_ctl = is_tftp_ctl(pkt);
> struct conn conn_for_expectation;
>-if (conn && (ftp_ctl || tftp_ctl)) {
>+if (OVS_UNLIKELY((ftp_ctl || tftp_ctl) && conn)) {
> conn_for_expectation = *conn;
> }
>
>@@ -1115,10 +1115,10 @@ process_one(struct conntrack *ct, struct
>dp_packet *pkt,
> }
>
> /* FTP control packet handling with expectation creation. */
>-if (OVS_UNLIKELY(conn && ftp_ctl)) {
>+if (OVS_UNLIKELY(ftp_ctl && conn)) {
> handle_ftp_ctl(ct, ctx, pkt, _for_expectation,
>now, CT_FTP_CTL_INTEREST, !!nat_action_info);
>-} else if (OVS_UNLIKELY(conn && tftp_ctl)) {
>+} else if (OVS_UNLIKELY(tftp_ctl && conn)) {
> handle_tftp_ctl(ct, _for_expectation, now);
> }
> }

LGTM 
Acked-by: Bhanuprakash Bodireddy 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [patch v2 1/5] conntrack: Fix clang static analysis reports.

2017-09-21 Thread Bodireddy, Bhanuprakash

>These dead assignment warnings do not affect functionality.
>In one case, a local variable could be removed and in another case, the
>working pointer should be used rather than the start pointer.
>
>Fixes: bd5e81a0e596 ("Userspace Datapath: Add ALG infra and FTP.")
>Reported-by: Bhanuprakash Bodireddy
>
>Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>September/338515.html
>Signed-off-by: Darrell Ball 
>---
> lib/conntrack.c | 12 
> 1 file changed, 4 insertions(+), 8 deletions(-)
>
>diff --git a/lib/conntrack.c b/lib/conntrack.c index 419cb1d..59d3c4e 100644
>--- a/lib/conntrack.c
>+++ b/lib/conntrack.c
>@@ -2617,7 +2617,7 @@ process_ftp_ctl_v4(struct conntrack *ct,
>
> char *ftp = ftp_msg;
> enum ct_alg_mode mode;
>-if (!strncasecmp(ftp_msg, FTP_PORT_CMD, strlen(FTP_PORT_CMD))) {
>+if (!strncasecmp(ftp, FTP_PORT_CMD, strlen(FTP_PORT_CMD))) {
> ftp = ftp_msg + strlen(FTP_PORT_CMD);
> mode = CT_FTP_MODE_ACTIVE;
> } else {
>@@ -2763,7 +2763,7 @@ process_ftp_ctl_v6(struct conntrack *ct,
>
> char *ftp = ftp_msg;
> struct in6_addr ip6_addr;
>-if (!strncasecmp(ftp_msg, FTP_EPRT_CMD, strlen(FTP_EPRT_CMD))) {
>+if (!strncasecmp(ftp, FTP_EPRT_CMD, strlen(FTP_EPRT_CMD))) {
> ftp = ftp_msg + strlen(FTP_EPRT_CMD);
> ftp = skip_non_digits(ftp);
> if (*ftp != FTP_AF_V6 || isdigit(ftp[1])) { @@ -2906,10 +2906,8 @@
>handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
>
> struct ovs_16aligned_ip6_hdr *nh6 = dp_packet_l3(pkt);
> int64_t seq_skew = 0;
>-bool seq_skew_dir;
> if (ftp_ctl == CT_FTP_CTL_OTHER) {
> seq_skew = conn_for_expectation->seq_skew;
>-seq_skew_dir = conn_for_expectation->seq_skew_dir;
> } else if (ftp_ctl == CT_FTP_CTL_INTEREST) {
> enum ftp_ctl_pkt rc;
> if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) { @@ -2933,18 +2931,16
>@@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
> seq_skew = repl_ftp_v6_addr(pkt, v6_addr_rep, ftp_data_start,
> addr_offset_from_ftp_data_start,
> addr_size, mode);
>-seq_skew_dir = ctx->reply;
> if (seq_skew) {
> ip_len = ntohs(nh6->ip6_ctlun.ip6_un1.ip6_un1_plen);
> ip_len += seq_skew;
> nh6->ip6_ctlun.ip6_un1.ip6_un1_plen = htons(ip_len);
> conn_seq_skew_set(ct, _for_expectation->key, now,
>-  seq_skew, seq_skew_dir);
>+  seq_skew, ctx->reply);
> }
> } else {
> seq_skew = repl_ftp_v4_addr(pkt, v4_addr_rep, ftp_data_start,
> addr_offset_from_ftp_data_start);
>-seq_skew_dir = ctx->reply;
> ip_len = ntohs(l3_hdr->ip_tot_len);
> if (seq_skew) {
> ip_len += seq_skew; @@ -2952,7 +2948,7 @@
>handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
>   l3_hdr->ip_tot_len, htons(ip_len));
> l3_hdr->ip_tot_len = htons(ip_len);
> conn_seq_skew_set(ct, _for_expectation->key, now,
>-  seq_skew, seq_skew_dir);
>+  seq_skew, ctx->reply);
> }
> }
> } else {
>--

LGTM and verified with clang.

Acked-by: Bhanuprakash Bodireddy 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] is there any document about how to build debian package with dpdk?

2017-09-21 Thread Bodireddy, Bhanuprakash

>we modified little code for dpdk, so we must rebuild ovs debian package with
>dpdk by ourself.
>so is there any guide about how to build openvswith-dpdk package?

There is a guide on this here 
http://docs.openvswitch.org/en/latest/intro/install/debian/

- Bhanuprakash.


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 00/10] Use DP_PACKET_BATCH_FOR_EACH macro.

2017-09-20 Thread Bodireddy, Bhanuprakash

Hi Darrell,

>You have many instances where you want to use
>DP_PACKET_BATCH_FOR_EACH You have another series partially about this:
>https://patchwork.ozlabs.org/patch/813007/
>
>Also, this series mixes in other changes like creating new variables for 
>clarity, I
>guess, and removing unneeded variables. which anyways has different
>motivation but part of the same patch.
>
>Do you think it makes sense to group the DP_PACKET_BATCH_FOR_EACH
>changes in one patch and splice out the other changes as other patches in the
>same series by same theme ?

That makes sense and I sent out a v2 by merging the 2 patches of my previous 
series. 
This time the patches are grouped and I added the details in the cover letter 
under version info.

Cover letter:  
https://mail.openvswitch.org/pipermail/ovs-dev/2017-September/338990.html
https://patchwork.ozlabs.org/patch/816191/

- Bhanuprakash.

>
>Thanks
>Darrell
>
>On 9/19/17, 12:39 PM, "ovs-dev-boun...@openvswitch.org on behalf of
>Bhanuprakash Bodireddy" bhanuprakash.bodire...@intel.com> wrote:
>
>DP_PACKET_BATCH_FOR_EACH macro was introduced early this year as
>part
>of enhancing packet batch APIs. Commit '72c84bc2' implemented this macro
>and replaced most of the calling sites with macros and simplified the 
> logic.
>
>However there are still many APIs that needs to be fixed.
>This patch series is a simple and straightforward set of changes
>aimed at using DP_PACKET_BATCH_FOR_EACH macro at all appropriate
>places.
>Also minor code cleanup is done to improve readability of the code.
>
>No functionality changes and no performance impact with this series.
>
>Bhanuprakash Bodireddy (10):
>  netdev-linux: Clean up netdev_linux_sock_batch_send().
>  netdev-linux: Use DP_PACKET_BATCH_FOR_EACH in
>netdev_linux_tap_batch_send.
>  netdev-dpdk: Cleanup dpdk_do_tx_copy.
>  netdev-dpdk: Minor cleanup of netdev_dpdk_send__.
>  netdev-dpdk: Use DP_PACKET_BATCH_FOR_EACH in
>netdev_dpdk_ring_send
>  netdev-bsd: Use DP_PACKET_BATCH_FOR_EACH in netdev_bsd_send.
>  odp-execute: Use const qualifer for batch size.
>  dpif-netdev: Use DP_PACKET_BATCH_FOR_EACH in
>dp_netdev_run_meter.
>  dpif-netdev: Use DP_PACKET_BATCH_FOR_EACH in fast_path_processing.
>  dpif-netdev: Remove 'cnt' in dp_netdev_input__().
>
> lib/dpif-netdev.c  | 33 +++--
> lib/netdev-bsd.c   |  7 ---
> lib/netdev-dpdk.c  | 40 +++-
> lib/netdev-linux.c | 17 +
> lib/odp-execute.c  |  3 ++-
> 5 files changed, 49 insertions(+), 51 deletions(-)
>
>--
>2.4.11
>___
>dev mailing list
>d...@openvswitch.org
>https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_mailman_listinfo_ovs-
>2Ddev=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA09CGX7JQ5Ih-
>uZnsw=XiubftJP8lYJL_SPaytZ9IvK97Hxqfr-
>TwV3fcbd2Qw=NzGP8ioHmW7p2aaJepNTw7ayyFxEmuPXEnYpmoN7yOU&
>e=
>

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 02/13] netdev-dummy: Reorder elements in dummy_packet_stream structure.

2017-09-18 Thread Bodireddy, Bhanuprakash

Hi greg,

>On 09/08/2017 10:59 AM, Bhanuprakash Bodireddy wrote:
>> By reordering elements in dummy_packet_stream structure, sum holes
>
>Do you mean "the sum of the holes" can be reduced or do you mean "some
>holes"
>can be reduced?

In this patch series "sum of the holes" means, the sum/total of all the hole 
bytes in the
respective structure. For example 'dummy_packet_stream' structure members are 
aligned below way.
This structure has one hole comprising of 56 bytes.

struct dummy_packet_stream {
struct stream *stream;   /* 0 8 */

>   56 bytes holes. 

 
struct dp_packet   rxbuf;   /*64   704 */  
struct ovs_listtxq;/*   76816 */
};

With the proposed change in this patch, the new alignment is as below 

struct dummy_packet_stream {
struct stream *stream;   /* 0 8 */
struct ovs_listtxq; /* 816 
*/

> 40 bytes hole
struct dp_packet   rxbuf;/*64   704 */
};

For all the patches, the information is added in to the commit log that shows
the improvement with the proposed changes. As claimed, sum holes(bytes) are
 reduced from 56 to 40 in case of this patch.

>> Before: structure size: 784, sum holes: 56, cachelines:13
>> After :  structure size: 768, sum holes: 40, cachelines:12

>
>Same question through several of the other patches where you use the same
>language.

In few structures there are multiple holes and 'sum holes' adds hole bytes of 
multiple holes
In those cases. 

- Bhanuprakash.

>
>> can be reduced, thus saving a cache line.
>>
>> Before: structure size: 784, sum holes: 56, cachelines:13 After :
>> structure size: 768, sum holes: 40, cachelines:12
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>>   lib/netdev-dummy.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/lib/netdev-dummy.c b/lib/netdev-dummy.c index
>> f731af1..d888c40 100644
>> --- a/lib/netdev-dummy.c
>> +++ b/lib/netdev-dummy.c
>> @@ -50,8 +50,8 @@ struct reconnect;
>>
>>   struct dummy_packet_stream {
>>   struct stream *stream;
>> -struct dp_packet rxbuf;
>>   struct ovs_list txq;
>> +struct dp_packet rxbuf;
>>   };
>>
>>   enum dummy_packet_conn_type {
>>

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v4 3/7] dpif-netdev: Register packet processing cores to KA framework.

2017-09-13 Thread Bodireddy, Bhanuprakash

>"Bodireddy, Bhanuprakash" <bhanuprakash.bodire...@intel.com> writes:
>
>>>Bhanuprakash Bodireddy <bhanuprakash.bodire...@intel.com> writes:
>>>
>>>> This commit registers the packet processing PMD cores to keepalive
>>>> framework. Only PMDs that have rxqs mapped will be registered and
>>>> actively monitored by KA framework.
>>>>
>>>> This commit spawns a keepalive thread that will dispatch heartbeats
>>>> to PMD cores. The pmd threads respond to heartbeats by marking
>>>> themselves alive. As long as PMD responds to heartbeats it is considered
>'healthy'.
>>>>
>>>> Signed-off-by: Bhanuprakash Bodireddy
>>>> <bhanuprakash.bodire...@intel.com>
>>>> ---
>>>
>>>I'm really confused with this patch.  I've stopped reviewing the series.
>>>
>>>It seems like there's a mix of 'track by core id' and 'track by thread id'.
>>>
>>>I don't think it's possible to do anything by core id.  We can never
>>>know what else has been scheduled on those cores, and we cannot be
>>>sure that a taskset or other scheduler provisioning call will move the
>threads.
>>
>> [BHANU] I have already answered this in other thread.
>> one can't correlate threads with cores and we shouldn't be tracking by
>> cores. However with PMD threads there is 1:1 mapping of PMD and the
>> core-id and it's safe to temporarily write PMD liveness info in to an
>> array indexed by core id before updating this in to HMAP.
>
>The core-id as a concept here is deceptive.  An external entity (such as
>taskset) can rebalance the PMDs.  External entities can be scheduled on the
>cores.  I think it's dangerous to have anything called core-id in this series 
>or
>feature, because people will naturally infer things which aren't true.
>Additionally, it will lead to things like "well, we know that core x,y,z are
>running at A%, so we can schedule things thusly..."
>
>Makes sense?
>

The concerns above makes sense and you have a valid point. 
Unfortunately the logic that you see w.r.t PMD, core_id mapping is something 
that was
implemented in rte_keepalive library and I inherited it. As the 1:1 mapping of 
a thread(PMD)
to core is deceptive and makes little sense, I reworked on a different approach 
with no impact
on datapath performance. I was testing this last few days to check for perf 
impacts and other
possible issues.

Previous design:

As part of heartbeat mechanism (dispath_heartbeats()),  in keepalive_info 
structure
we had arrays indexed by core-ids used by PMDs and Keepalive thread for 
heart-beating.
The arrays were used to keep the logic simple and lock-free.

Also Keepalive thread was updating the status periodically in to 'process_list' 
map using callback function.

New design:
---
we already have a 'process_list' map where all the PMD threads are added by 
main(vswitchd)
thread. In this new approach, I take a copy of the 'process_list', let's call 
it 'cached_process_list'
and use this cached map for heartbeating between Keepalive and PMD threads. No 
locks are 
needed on the 'cached_process_list' there by not impacting the datapath 
performance.

Also whenever there is datapath reconfiguration(triggered by pmd-cpu-mask), the 
'process_list' map 
will be updated and also the cached_process_list will be reloaded from the main 
map there by having
the maps in sync.  This is handled as part of ka_register_datapath_threads().  
I have been testing this 
and seems to be working fine.

This way we can completely avoid all references to core_id in this series. Let 
me know if you have
any comments on this new approach.

Regards,
Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 12/13] conntrack: Fix dead assignment reported by clang.

2017-09-10 Thread Bodireddy, Bhanuprakash

Hi Darrell,

>What version of clang are you using and in what environment ?

The clang version is  3.5.0. This was seen with clang static analysis.

- Bhanuprakash.

>
>On 9/8/17, 10:59 AM, "ovs-dev-boun...@openvswitch.org on behalf of
>Bhanuprakash Bodireddy" bhanuprakash.bodire...@intel.com> wrote:
>
>Clang reports that value stored to ftp, seq_skew_dir never read inside
>the function.
>
>Signed-off-by: Bhanuprakash Bodireddy
>
>---
> lib/conntrack.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
>diff --git a/lib/conntrack.c b/lib/conntrack.c
>index 419cb1d..a0838ee 100644
>--- a/lib/conntrack.c
>+++ b/lib/conntrack.c
>@@ -2615,7 +2615,7 @@ process_ftp_ctl_v4(struct conntrack *ct,
> char ftp_msg[LARGEST_FTP_MSG_OF_INTEREST + 1] = {0};
> get_ftp_ctl_msg(pkt, ftp_msg);
>
>-char *ftp = ftp_msg;
>+char *ftp;
> enum ct_alg_mode mode;
> if (!strncasecmp(ftp_msg, FTP_PORT_CMD, strlen(FTP_PORT_CMD))) {
> ftp = ftp_msg + strlen(FTP_PORT_CMD);
>@@ -2761,7 +2761,7 @@ process_ftp_ctl_v6(struct conntrack *ct,
> get_ftp_ctl_msg(pkt, ftp_msg);
> *ftp_data_start = tcp_hdr + tcp_hdr_len;
>
>-char *ftp = ftp_msg;
>+char *ftp;
> struct in6_addr ip6_addr;
> if (!strncasecmp(ftp_msg, FTP_EPRT_CMD, strlen(FTP_EPRT_CMD))) {
> ftp = ftp_msg + strlen(FTP_EPRT_CMD);
>@@ -2909,7 +2909,6 @@ handle_ftp_ctl(struct conntrack *ct, const struct
>conn_lookup_ctx *ctx,
> bool seq_skew_dir;
> if (ftp_ctl == CT_FTP_CTL_OTHER) {
> seq_skew = conn_for_expectation->seq_skew;
>-seq_skew_dir = conn_for_expectation->seq_skew_dir;
> } else if (ftp_ctl == CT_FTP_CTL_INTEREST) {
> enum ftp_ctl_pkt rc;
> if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
>--
>2.4.11
>
>___
>dev mailing list
>d...@openvswitch.org
>https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_mailman_listinfo_ovs-
>2Ddev=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA09CGX7JQ5Ih-
>uZnsw=LE5PLIlvBSFThteUGJevgTlRlFesohyngSzGDqpvk5k=BsQfIBSohBf
>sM_UTvU-fZeE6EswgpKmd9tz0snT8usc=
>

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v4 0/7] Add OVS DPDK keep-alive functionality.

2017-09-07 Thread Bodireddy, Bhanuprakash

>"Bodireddy, Bhanuprakash" <bhanuprakash.bodire...@intel.com> writes:
>
>> Hi Aaron,
>>
>>>Quick comment before I do an in-depth review.
>>>
>>>One thing that is missing in this series is some form of documentation
>>>added to explain why this feature should exist (for instance, why
>>>can't the standard posix process accounting information suffice?) and
>>>what the high-level concepts are (you have the states being used, but
>>>I don't see a definition that will be needed to understand when reading a
>keep-alive report).
>>
>> I am planning to write a cookbook with instructions on setting up
>> Keepalive in OvS, Installing & configuring collectd and setting up ceilometer
>service to read the events.
>> The definition of the KA states and how to interpret them would be
>> explained in the document. Also the minimal step guide would be added
>> in to OvS-DPDK Advanced guide with links to cookbook.
>
>Please put that as you go in the patches.  It will make review easier, too.

[BHANU] OK.

>
>> On your other question on why posix process accounting isn't enough,
>> please see below for testcase and details.
>>
>>>
>>>I think there could be a reason to provide this, but I think it's
>>>important to explain why collectd will need to use the ovsdb
>>>interface, rather than calling
>>>ex: times[1] or parsing /proc//stat for the runtime (and watching
>>>accumulation).
>>
>> 1) On collectd reading ovsdb rather than directly monitoring the threads.
>>
>>   Collectd for sure is one popular daemon to collect and monitor system
>wide statistics.
>>   However, if we move ovs thread monitoring functionality to collectd we
>are *forcing*
>>   the users to use collectd to monitor OvS health. This may not be a
>problem for someone using
>>   collectd + OpenStack.
>
>It's important to note - collectd monitoring threads has nothing to do with 
>this
>feature.  If collectd can monitor threads from arbitrary processes and report, 
>it
>becomes much more powerful, no?  Let's keep it focused on Open vSwitch.
>
>>   Think of customer using OvS but having their proprietary monitoring
>application with OpenStack or
>>   worse their own orchestrator, in this case it's easy for them to 
>> monitor
>OvS by querying OvSDB
>>   with minimal code changes in to their app.
>>
>>   Also it might be easy for any monitoring application to 
>> query/subscribe to
>OvSDB for knowing the
>>   OvS configuration and health.
>
>I don't really like using the idea of proprietary monitors as justification 
>for this.

>
>OTOH, I think there's a good justification when it comes to multi-node Open
>vSwitch tracking.  There, it may not be possible to aggregate the statistics on
>each individual node (due to possible some kind of administration policy) - so 
>I
>agree having something like this exposed through ovsdb could be useful.


[BHANU] In any case querying ovsdb is most suitable.

>
>> 2) On /proc/[pid]/stats:
>>
>> - I do read 'stats' file in 01/7  patch to get the thread name and 'core id' 
>> the
>thread was last scheduled.
>> - The other fields related to time in stats file can't be completely relied 
>> due
>to below test case.
>>
>> This test scenario was to simulate & identify the PMD stalls when a
>> higher priority thread(kernel/other IO thread) gets scheduled on the same
>core.
>>
>> Test scenario:
>> - OvS with single/multiple PMD thread.
>> - Start a worker thread spinning continuously on the core (stress -c 1 &).
>> - Change the worker thread attributes to RT (chrt -r -p 99  ).
>> - Pin the worker thread on the same core as one of the PMDs  (taskset
>> -p  )
>>
>> Now the PMD stalls as the other worker thread has higher priority and is
>favored & scheduled by Linux scheduler preempting PMD thread.
>> However the /proc/pid/stat shows that the thread is still in
>>  *Running (R)* state ->   field 3
>>   (see the output
>below)
>>Utime,stime were incrementing  ->field 14, 15(-do-)
>>
>> All the other time related fields were '0' as they don't apply here.
>> For fields information:
>> http://man7.org/linux/man-pages/man5/proc.5.html
>>
>> ---sample
>> output---
>> $ cat /proc/12506/stat
>> 12506 (pmd61) R 1 12436 12436 0 -1 4210752 101244 0 0 0 389393 309

Re: [ovs-dev] [PATCH v4 3/7] dpif-netdev: Register packet processing cores to KA framework.

2017-09-07 Thread Bodireddy, Bhanuprakash

>Bhanuprakash Bodireddy  writes:
>
>> This commit registers the packet processing PMD cores to keepalive
>> framework. Only PMDs that have rxqs mapped will be registered and
>> actively monitored by KA framework.
>>
>> This commit spawns a keepalive thread that will dispatch heartbeats to
>> PMD cores. The pmd threads respond to heartbeats by marking themselves
>> alive. As long as PMD responds to heartbeats it is considered 'healthy'.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>
>I'm really confused with this patch.  I've stopped reviewing the series.
>
>It seems like there's a mix of 'track by core id' and 'track by thread id'.
>
>I don't think it's possible to do anything by core id.  We can never know what
>else has been scheduled on those cores, and we cannot be sure that a taskset
>or other scheduler provisioning call will move the threads.

[BHANU] I have already answered this in other thread. 
one can't correlate threads with cores and we shouldn't be tracking by cores. 
However with PMD threads 
there is 1:1 mapping of PMD and the core-id and it's safe to temporarily write 
PMD liveness info in to an array indexed
by core id before updating this in to HMAP. 

However as already mentioned, we are using tid for all other purposes as it is 
unique across the system.

>
>>  lib/dpif-netdev.c |  70 +
>>  lib/keepalive.c   | 153
>++
>>  lib/keepalive.h   |  17 ++
>>  lib/util.c|  23 
>>  lib/util.h|   2 +
>>  5 files changed, 254 insertions(+), 11 deletions(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>> e2cd931..84c7ffd 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -49,6 +49,7 @@
>>  #include "flow.h"
>>  #include "hmapx.h"
>>  #include "id-pool.h"
>> +#include "keepalive.h"
>>  #include "latch.h"
>>  #include "netdev.h"
>>  #include "netdev-vport.h"
>> @@ -978,6 +979,63 @@ sorted_poll_thread_list(struct dp_netdev *dp,
>>  *n = k;
>>  }
>>
>> +static void *
>> +ovs_keepalive(void *f_ OVS_UNUSED)
>> +{
>> +pthread_detach(pthread_self());
>> +
>> +for (;;) {
>> +xusleep(get_ka_interval() * 1000);
>> +}
>> +
>> +return NULL;
>> +}
>> +
>> +static void
>> +ka_thread_start(struct dp_netdev *dp) {
>> +static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
>> +
>> +if (ovsthread_once_start()) {
>> +ovs_thread_create("ovs_keepalive", ovs_keepalive, dp);
>> +
>> +ovsthread_once_done();
>> +}
>> +}
>> +
>> +static void
>> +ka_register_datapath_threads(struct dp_netdev *dp) {
>> +int ka_init = get_ka_init_status();
>> +VLOG_DBG("Keepalive: Was initialization successful? [%s]",
>> +ka_init ? "Success" : "Failure");
>> +if (!ka_init) {
>> +return;
>> +}
>> +
>> +ka_thread_start(dp);
>> +
>> +struct dp_netdev_pmd_thread *pmd;
>> +CMAP_FOR_EACH (pmd, node, >poll_threads) {
>> +/*  Register only PMD threads. */
>> +if (pmd->core_id != NON_PMD_CORE_ID) {
>> +int tid = ka_get_pmd_tid(pmd->core_id);
>> +
>> +/* Skip PMD thread with no rxqs mapping. */
>
>why skip these pmds?  we should still watch them, and then we can
>correlated interesting events (for instance... when an rxq gets added whats
>the change in utilization, etc).

[BHANU]  We shoud skip the PMDs that has no rxqs mapped. This would happen in 
cases
where there are more PMD threads than the number of rxqs. 

More importantly a PMD thread with no mapped rxq will not even enter the 
receive loop and
will be in sleep state as below. 

$ ps -eLo tid,psr,comm,state | grep pmd
 51727   3 pmd61   R
 51747   0 pmd62   S
 51749   1 pmd63   S
 51750   2 pmd64   R
 51756   6 pmd65   S
 51758   7 pmd66   R
 51759   4 pmd67   R
 51760   5 pmd68   S

When an rxq gets added to a sleeping PMD thread, the datapath reconfiguration 
happens,
this time threads get registered to KA framework as below.

CP:  reconfigure_datapath() -> ka_register_datapath_threads() -> 
ka_register_thread().

>
>> +if (OVS_UNLIKELY(!hmap_count(>poll_list))) {
>> +/* rxq mapping changes due to reconfiguration,
>> + * if there are no rxqs mapped to PMD, unregister it. */
>> +ka_unregister_thread(tid, true);
>> +continue;
>> +}
>> +
>> +ka_register_thread(tid, true);
>> +VLOG_INFO("Registered PMD thread [%d] on Core [%d] to KA
>framework",
>> +  tid, pmd->core_id);
>> +}
>> +}
>> +}
>> +
>>  static void
>>  dpif_netdev_pmd_info(struct unixctl_conn *conn, int argc, const char
>*argv[],
>>   void *aux)
>> @@ -3625,6 +3683,9 @@ reconfigure_datapath(struct dp_netdev *dp)
>>
>>  /* Reload affected pmd

Re: [ovs-dev] [PATCH v4 2/7] Keepalive: Add initial keepalive support.

2017-09-07 Thread Bodireddy, Bhanuprakash

Hi Aaron,

My reply inline.

>Hi Bhanu,
>
>Bhanuprakash Bodireddy  writes:
>
>> This commit introduces the initial keepalive support by adding
>> 'keepalive' module and also helper and initialization functions that
>> will be invoked by later commits.
>>
>> This commit adds new ovsdb column "keepalive" that shows the status of
>> the datapath threads. This is implemented for DPDK datapath and only
>> status of PMD threads is reported.
>
>I don't see the value in having this enabled / disabled flag?  Why not just
>always have it on?

[BHANU]  

I was following the conventions here.  
OvS statistics is done similar way where the stats can be enabled with 
'other_config:enable-statistics=true' with default being false.
Maybe this is done as additional thread (system_stats) will be spawned to 
handle the functionality and users should have an option to turn them on/off. 

>
>Additionally, even setting these true in this commit won't do anything.
>No tracking starts until 3/7, afaict.
>
>I guess it's okay to document here, but it might be worth stating that.

[BHANU]  Ok. 

>
>> For eg:
>>   To enable keepalive feature.
>>   'ovs-vsctl --no-wait set Open_vSwitch . other_config:enable-
>keepalive=true'
>
>I'm not sure that a separate enable / disable flag is needed.
>
>>   To set timer interval of 5000ms for monitoring packet processing cores.
>>   'ovs-vsctl --no-wait set Open_vSwitch . \
>>  other_config:keepalive-interval="5000"
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> ---
>
>As stated earlier, please add a Documentation/ update with this.

[BHANU]  I would add the documentation in the respin.  

>
>>  lib/automake.mk|   2 +
>>  lib/keepalive.c| 183
>+
>>  lib/keepalive.h|  87 +
>>  vswitchd/bridge.c  |   3 +
>>  vswitchd/vswitch.ovsschema |   8 +-
>>  vswitchd/vswitch.xml   |  49 
>>  6 files changed, 330 insertions(+), 2 deletions(-)  create mode
>> 100644 lib/keepalive.c  create mode 100644 lib/keepalive.h
>>
>> diff --git a/lib/automake.mk b/lib/automake.mk index 2415f4c..0d99f0a
>> 100644
>> --- a/lib/automake.mk
>> +++ b/lib/automake.mk
>> @@ -110,6 +110,8 @@ lib_libopenvswitch_la_SOURCES = \
>>  lib/json.c \
>>  lib/jsonrpc.c \
>>  lib/jsonrpc.h \
>> +lib/keepalive.c \
>> +lib/keepalive.h \
>>  lib/lacp.c \
>>  lib/lacp.h \
>>  lib/latch.h \
>> diff --git a/lib/keepalive.c b/lib/keepalive.c new file mode 100644
>> index 000..ac73a42
>> --- /dev/null
>> +++ b/lib/keepalive.c
>> @@ -0,0 +1,183 @@
>> +/*
>> + * Copyright (c) 2014, 2015, 2016, 2017 Nicira, Inc.
>
>This line is not appropriately attributing the file.

[BHANU]  Should be "Copyright (c) 2017 Intel, Inc."

>
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + * http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing,
>> + software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>implied.
>> + * See the License for the specific language governing permissions
>> + and
>> + * limitations under the License.
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include "keepalive.h"
>> +#include "lib/vswitch-idl.h"
>> +#include "openvswitch/vlog.h"
>> +#include "timeval.h"
>> +
>> +VLOG_DEFINE_THIS_MODULE(keepalive);
>> +
>> +static bool keepalive_enable = false;/* Keepalive disabled by default */
>> +static bool ka_init_status = ka_init_failure; /* Keepalive
>> +initialization */
>
>You're assigning this bool a value from an enum.  I know that's probably
>allowed, but it looks strange to me.  I would prefer that this type either 
>reflect
>the enum type or a true/false value is used instead.

[BHANU]   OK.

>
>> +static uint32_t keepalive_timer_interval; /* keepalive timer interval */
>> +static struct keepalive_info *ka_info = NULL;
>
>Why allocate ka_info?  It will simplify some of the later code to just keep it
>statically available.  It also means you can eliminate the
>xzalloc() and free() calls you use later on in code.

[BHANU]   Ok, saves me few lines of code. 

>
>Also, the nice thing about a static declaration is the structure will already 
>be 0
>filled, and you'll know at program initialization time whether it will succeed 
>in
>getting the allocation.
>
>> +
>> +inline bool
>
>The inline keyword is inappropriate in .c files.  Please let the compiler do 
>it's
>job.

[BHANU]   Ok

>
>> +ka_is_enabled(void)
>> +{
>> +return keepalive_enable;
>> +}
>> +
>
>I'm not sure about enable / disable.  In this case, I think the branches are 
>not

Re: [ovs-dev] [PATCH v4 0/7] Add OVS DPDK keep-alive functionality.

2017-09-06 Thread Bodireddy, Bhanuprakash

Hi Aaron,

>Quick comment before I do an in-depth review.
>
>One thing that is missing in this series is some form of documentation added
>to explain why this feature should exist (for instance, why can't the standard
>posix process accounting information suffice?) and what the high-level
>concepts are (you have the states being used, but I don't see a definition that
>will be needed to understand when reading a keep-alive report).

I am planning to write a cookbook with instructions on setting up Keepalive in 
OvS, 
Installing & configuring collectd and setting up ceilometer service to read the 
events.
The definition of the KA states and how to interpret them would be explained in 
the
document. Also the minimal step guide would be added in to OvS-DPDK Advanced 
guide
with links to cookbook. 

On your other question on why posix process accounting isn't enough, please see 
below
for testcase and details.

>
>I think there could be a reason to provide this, but I think it's important to
>explain why collectd will need to use the ovsdb interface, rather than calling
>ex: times[1] or parsing /proc//stat for the runtime (and watching
>accumulation).

1) On collectd reading ovsdb rather than directly monitoring the threads.

  Collectd for sure is one popular daemon to collect and monitor system 
wide statistics.
  However, if we move ovs thread monitoring functionality to collectd we 
are *forcing*
  the users to use collectd to monitor OvS health. This may not be a 
problem for someone using
  collectd + OpenStack. 

  Think of customer using OvS but having their proprietary monitoring 
application with OpenStack or
  worse their own orchestrator, in this case it's easy for them to monitor 
OvS by querying OvSDB
  with minimal code changes in to their app. 

  Also it might be easy for any monitoring application to query/subscribe 
to OvSDB for knowing the
  OvS configuration and health. 

2) On /proc/[pid]/stats:

- I do read 'stats' file in 01/7  patch to get the thread name and 'core id' 
the thread was last scheduled.
- The other fields related to time in stats file can't be completely relied due 
to below test case.

This test scenario was to simulate & identify the PMD stalls when a higher 
priority thread(kernel/other IO thread) 
gets scheduled on the same core.

Test scenario:
- OvS with single/multiple PMD thread.
- Start a worker thread spinning continuously on the core (stress -c 1 &).
- Change the worker thread attributes to RT (chrt -r -p 99  ).
- Pin the worker thread on the same core as one of the PMDs  (taskset -p  
)

Now the PMD stalls as the other worker thread has higher priority and is 
favored & scheduled by Linux scheduler preempting PMD thread.
However the /proc/pid/stat shows that the thread is still in 
 *Running (R)* state ->   field 3   
   (see the output below)
   Utime,stime were incrementing  ->field 14, 15(-do-)

All the other time related fields were '0' as they don't apply here.
For fields information:  http://man7.org/linux/man-pages/man5/proc.5.html

---sample 
output---
$ cat /proc/12506/stat
12506 (pmd61) R 1 12436 12436 0 -1 4210752 101244 0 0 0 389393 3099 0 0 20 0 35 
0 226680879 4798472192 4363 18446744073709551615
 4194304 9786556 140737290674320 140344414947088 4467454 0 0 4096 24579 0 0 0 
-1 3 0 0 0 0 0 11883752 12196256 48676864 140737290679638
 140737290679790 140737290679790 140737290682316 0


But with the KA framework, the PMD stall be detected immediately and reported.
IMHO, we can use /proc interface or other mechanisms you suggested but that 
should
be used as part of additional health checks. I do check the /proc/[pid]/stat to 
read the 
thread states as part of larger health check mechanism in V3. 

Hope I answered all your questions here. Let me know your comments while you 
review this series in-depth.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 2/2] dpif-netdev: Per-port conditional EMC insert.

2017-09-01 Thread Bodireddy, Bhanuprakash

Hi Ilya,

>> Tuning the per EMC insertion probability per port based on detailed
>knowledge about the nature of traffic patterns seems a micro-optimization to
>me, which might be helpful in very controlled setups e.g. in synthetic
>benchmarks, but very hard to apply in more general use cases, such as
>vSwitch in OpenStack, where the entity (Nova compute) configuring the
>vhostuser VM ports has no knowledge at all about traffic characteristics.
>>
>> The nice property of the probabilistic EMC insertion is that flows with more
>traffic have a higher chance of ending up in the EMC than flows with lower
>traffic. In your case the few big encapsulated flows from the VM should have
>a higher chance to make it into the EMC than the many smaller individual
>flows into the VM and thus automatically get the bulk of EMC hits.
>>
>> Do you have empirical data that shows that this effect is not sufficient and
>performance can be significantly improved by per-port probabilities?
>>
>> In any case I would request to keep the global configuration option and only
>add the per-port option to override the global probability if wanted.
>>
>
>+1 for backwards compatibility by keeping the global config.

Thanks for this patch. 
I proposed a similar approach as incremental addition when the EMC probabilistic
insertion patch was upstreamed. My concern then was as it's a global config all 
the
PMD threads and ports would be affected. This was also discussed in one of the 
community calls then.

The general feedback was, though it sounds helpful in lab scenarios where the 
user has
pre knowledge of traffic, number of VMs, Phy and vhostuser ports , it may be 
not be the 
case in OpenStack deployments. The OpenStack folks mentioned that this kind of 
optimizations
can't be easily used in their deployments.

Regards,
Bhanuprakash. 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v4 5/5] dpif-netdev: Flush the packets in intermediate queue.

2017-08-11 Thread Bodireddy, Bhanuprakash

Hello All,

Adding all the people here who had either reviewed or provided their feedback
on the batching patches at some stage.

you are already aware that there are 2 different series on ML to implement tx
 batching (netdev layer vs dpif layer) that improves DPDK datapath performance. 
 
Our output batching is in netdev layer whereas ilya moved it to  dpif layer 
and simplified it. Each approach has its own pros and cons and had been
discussed in earlier threads.

While reviewing v4 of my patch series,  ilya detected a race condition that 
happens
when the queues in the guest are enabled/disabled at run time. Though we have
solutions to address this issue and implemented it, I realized that the code 
complexity
has increased and changes spanning multiple functions with additional spin 
locks  to 
address this one corner case. 

I think, though our patch series has flexibility it has gotten lot complex now 
and would
be difficult to maintain in the future. At this stage I would like to lean 
towards simpler
solution that's more maintainable  which is implemented by Ilya.

I would like to thank Eelco, Darrell, Jan and Ilya for reviewing our series and 
providing their
feedback.

Bhanuprakash. 

>-Original Message-
>From: Darrell Ball [mailto:db...@vmware.com]
>Sent: Friday, August 11, 2017 2:03 AM
>To: Bodireddy, Bhanuprakash <bhanuprakash.bodire...@intel.com>;
>d...@openvswitch.org
>Subject: Re: [ovs-dev] [PATCH v4 5/5] dpif-netdev: Flush the packets in
>intermediate queue.
>
>Hi Bhanu
>
>Given that you ultimately intend changes beyond those in this patch, would it
>make sense just to fold the follow up series (at least, the key elements) into
>this series, essentially expanding on this patch 5 ?
>
>Thanks Darrell
>
>-Original Message-
>From: <ovs-dev-boun...@openvswitch.org> on behalf of Bhanuprakash
>Bodireddy <bhanuprakash.bodire...@intel.com>
>Date: Tuesday, August 8, 2017 at 10:06 AM
>To: "d...@openvswitch.org" <d...@openvswitch.org>
>Subject: [ovs-dev] [PATCH v4 5/5] dpif-netdev: Flush the packets in
>   intermediate queue.
>
>Under low rate traffic conditions, there can be 2 issues.
>  (1) Packets potentially can get stuck in the intermediate queue.
>  (2) Latency of the packets can increase significantly due to
>   buffering in intermediate queue.
>
>This commit handles the (1) issue by flushing the tx port queues using
>dp_netdev_flush_txq_ports() as part of PMD packet processing loop.
>Also this commit addresses issue (2) by flushing the tx queues after
>every rxq port processing. This reduces the latency with out impacting
>the forwarding throughput.
>
>   MASTER
>  
>   Pkt size  min(ns)   avg(ns)   max(ns)
>512  4,631  5,022309,914
>   1024  5,545  5,749104,294
>   1280  5,978  6,159 45,306
>   1518  6,419  6,774946,850
>
>  MASTER + COMMIT
>  -
>   Pkt size  min(ns)   avg(ns)   max(ns)
>512  4,711  5,064182,477
>   1024  5,601  5,888701,654
>   1280  6,018  6,491533,037
>   1518  6,467  6,734312,471
>
>PMDs can be teared down and spawned at runtime and so the rxq and txq
>mapping of the PMD threads can change. In few cases packets can get
>stuck in the queue due to reconfiguration and this commit helps flush
>the queues.
>
>Suggested-by: Eelco Chaudron <echau...@redhat.com>
>Reported-at: https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
>2DApril_331039.html=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA09C
>GX7JQ5Ih-
>uZnsw=qwwXxtIBvUf5cgPbYkcAKwCukS_ZiaeFE6lAdHHaw28=H0yNRh-
>c9pdYHacCzkoruc48Dj_Whkcwcjzv-vta-EI=
>Signed-off-by: Bhanuprakash Bodireddy
><bhanuprakash.bodire...@intel.com>
>Signed-off-by: Antonio Fischetti <antonio.fische...@intel.com>
>Co-authored-by: Antonio Fischetti <antonio.fische...@intel.com>
>Signed-off-by: Markus Magnusson <markus.magnus...@ericsson.com>
>Co-authored-by: Markus Magnusson <markus.magnus...@ericsson.com>
>Acked-by: Eelco Chaudron <echau...@redhat.com>
>---
> lib/dpif-netdev.c | 52
>+++-
> 1 file changed, 51 insertions(+), 1 deletion(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>index e2cd931..bfb9650 100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -340,6 +340,7 @@ enum pmd_cycles_counter_type {
> };
>
> #define XPS_TIMEOUT_MS 500LL
>+#define LAST_USED_QID_NO

Re: [ovs-dev] [PATCH v4 2/5] netdev-dpdk: Add netdev_dpdk_vhost_txq_flush function.

2017-08-11 Thread Bodireddy, Bhanuprakash

>On 09.08.2017 15:35, Bodireddy, Bhanuprakash wrote:
>>>>
>>>> +static int
>>>> +netdev_dpdk_vhost_tx_burst(struct netdev_dpdk *dev, int qid) {
>>>> +struct dpdk_tx_queue *txq = >tx_q[qid];
>>>> +struct rte_mbuf **cur_pkts = (struct rte_mbuf
>>>> +**)txq->vhost_burst_pkts;
>>>> +
>>>> +int tx_vid = netdev_dpdk_get_vid(dev);
>>>> +int tx_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
>>>> +uint32_t sent = 0;
>>>> +uint32_t retries = 0;
>>>> +uint32_t sum, total_pkts;
>>>> +
>>>> +total_pkts = sum = txq->vhost_pkt_cnt;
>>>> +do {
>>>> +uint32_t ret;
>>>> +ret = rte_vhost_enqueue_burst(tx_vid, tx_qid,
>>>> + _pkts[sent],
>>> sum);
>>>> +if (OVS_UNLIKELY(!ret)) {
>>>> +/* No packets enqueued - do not retry. */
>>>> +break;
>>>> +} else {
>>>> +/* Packet have been sent. */
>>>> +sent += ret;
>>>> +
>>>> +/* 'sum' packet have to be retransmitted. */
>>>> +sum -= ret;
>>>> +}
>>>> +} while (sum && (retries++ < VHOST_ENQ_RETRY_NUM));
>>>> +
>>>> +for (int i = 0; i < total_pkts; i++) {
>>>> +dp_packet_delete(txq->vhost_burst_pkts[i]);
>>>> +}
>>>> +
>>>> +/* Reset pkt count. */
>>>> +txq->vhost_pkt_cnt = 0;
>>>> +
>>>> +/* 'sum' refers to packets dropped. */
>>>> +return sum;
>>>> +}
>>>> +
>>>> +/* Flush the txq if there are any packets available. */ static int
>>>> +netdev_dpdk_vhost_txq_flush(struct netdev *netdev, int qid,
>>>> +bool concurrent_txq OVS_UNUSED) {
>>>> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>> +struct dpdk_tx_queue *txq;
>>>> +
>>>> +qid = dev->tx_q[qid % netdev->n_txq].map;
>>>> +
>>>> +/* The qid may be disabled in the guest and has been set to
>>>> + * OVS_VHOST_QUEUE_DISABLED.
>>>> + */
>>>> +if (OVS_UNLIKELY(qid < 0)) {
>>>> +return 0;
>>>> +}
>>>> +
>>>> +txq = >tx_q[qid];
>>>> +/* Increment the drop count and free the memory. */
>>>> +if (OVS_UNLIKELY(!is_vhost_running(dev) ||
>>>> + !(dev->flags & NETDEV_UP))) {
>>>> +
>>>> +if (txq->vhost_pkt_cnt) {
>>>> +rte_spinlock_lock(>stats_lock);
>>>> +dev->stats.tx_dropped+= txq->vhost_pkt_cnt;
>>>> +rte_spinlock_unlock(>stats_lock);
>>>> +
>>>> +for (int i = 0; i < txq->vhost_pkt_cnt; i++) {
>>>> +dp_packet_delete(txq->vhost_burst_pkts[i]);
>>>
>>> Spinlock (tx_lock) must be held here to avoid queue and mempool
>breakage.
>>
>> I think you are right. tx_lock might be acquired for freeing the packets.
>
>I think that 'vhost_pkt_cnt' reads and updates also should be protected to
>avoid races.

>From the discussion in the thread 
>https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337133.html,
We are going to acquire tx_lock for updating the map and flushing the queue 
inside vring_state_changed(). 

That triggers a deadlock in the  flushing function as we have already acquired 
the same lock in netdev_dpdk_vhost_txq_flush().
This is the same problem for freeing the memory and protecting the updates to 
vhost_pkt_cnt.

   if (OVS_LIKELY(txq->vhost_pkt_cnt)) {
 rte_spinlock_lock(>tx_q[qid].tx_lock);
netdev_dpdk_vhost_tx_burst(dev, qid);
rte_spinlock_unlock(>tx_q[qid].tx_lock);
   }

As the problem is triggered when the guest queues are enabled/disabled, with a 
small race window where packets can get enqueued in to the queue just after the 
flush and before map value is updated in cb function(vring_state_changed()), 
how abt this?

Technically as the queues are disabled, there is no point in flushing the 
packets, so let's free the packets and set the txq->vhost_pkt_cnt in 
vring_state_changed() itself instead of calling flush().

vring_state_changed().
--
rte_spinlock_lock(>tx_q[qid].tx_lock);

mapped_qid = dev->tx_q[qid].map;
 if (OVS_UNLIKELY(qid != map

Re: [ovs-dev] [PATCH v4 2/5] netdev-dpdk: Add netdev_dpdk_vhost_txq_flush function.

2017-08-10 Thread Bodireddy, Bhanuprakash

>>
  } else {
 +/* If the queue is disabled in the guest, the 
 corresponding qid
 + * map shall be set to OVS_VHOST_QUEUE_DISABLED(-2).
 + *
 + * The packets that were queued in 'qid' could be 
 potentially
 + * stuck and needs to be dropped.
 + *
 + * XXX: The queues may be already disabled in the guest so
 + * flush function in this case only helps in updating 
 stats
 + * and freeing memory.
 + */
 +netdev_dpdk_vhost_txq_flush(>up, qid, 0);
  dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;
  }
  netdev_dpdk_remap_txqs(dev);
>>
>> 'netdev_dpdk_remap_txqs()', actually, is able to change mapping for
>> all the disabled in guest queues. So, we need to flush all of them
>> while remapping somewhere inside the function.
>> One other thing is that there is a race window between flush and
>> mapping update where another process able to enqueue more packets in
>> just flushed queue. The order of operations should be changed, or both
>> of them should be done under the same tx_lock. I think, it's required
>> to make tx_q[].map field atomic to fix the race condition, because
>> send function takes the 'map' and then locks the corresponding queue.
>> It wasn't an issue before, because packets in case of race was just
>> dropped on attempt to send to disabled queue, but with this patch
>> applied they will be enqueued to the intermediate queue and stuck there.
>
>Making 'map' atomic will not help. To solve the race we should make 'reading
>of map + enqueue' an atomic operation by some spinlock.
>Like this:
>
>vhost_send:
>
>qid = qid % netdev->n_txq;
>rte_spinlock_lock(>tx_q[qid].tx_lock);
>
>mapped_qid = dev->tx_q[qid].map;
>
>if (qid != mapped_qid) {
>rte_spinlock_lock(>tx_q[mapped_qid].tx_lock);
>}
>
>tx_enqueue(mapped_qid, pkts, cnt);
>
>if (qid != mapped_qid) {
>rte_spinlock_unlock(>tx_q[mapped_qid].tx_lock);
>}
>
>rte_spinlock_unlock(>tx_q[qid].tx_lock);
>
>
>txq remapping inside 'netdev_dpdk_remap_txqs()' or
>'vring_state_changed()':
>
>qid - queue we need to remap.
>new_qid - queue we need to remap to.
>
>rte_spinlock_lock(>tx_q[qid].tx_lock);
>
>mapped_qid = dev->tx_q[qid].map;
>if (qid != mapped_qid) {
>rte_spinlock_lock(>tx_q[mapped_qid].tx_lock);
>}
>
>tx_flush(mapped_qid)
>
>if (qid != mapped_qid) {
>rte_spinlock_unlock(>tx_q[mapped_qid].tx_lock);
>}
>
>dev->tx_q[qid].map = new_qid;
>
>rte_spinlock_unlock(>tx_q[qid].tx_lock);
>
>
>Above schema should work without races, but looks kind of ugly and requires
>taking of additional spinlock on each send.
>
>P.S. Sorry for talking with myself. Just want to share my thoughts.

Hi Ilya,

Can you please review the below changes based on what you suggested above. 
As the problem only happens when the queues are enabled/disabled in the guest, 
I did some  preliminary testing with the below changes by sending some traffic 
in to the VM
and enabling and disabling the queues inside the guest the same time. 

Vhost_send()
-
qid = qid % netdev->n_txq;

/* Acquire tx_lock before reading tx_q[qid].map and enqueueing packets.
 * tx_q[].map gets updated in vring_state_changed() when vrings are
 * enabled/disabled in the guest. */
rte_spinlock_lock(>tx_q[qid].tx_lock);

mapped_qid = dev->tx_q[qid].map;
if (OVS_UNLIKELY(qid != mapped_qid)) {
rte_spinlock_lock(>tx_q[mapped_qid].tx_lock);
}

if (OVS_UNLIKELY(!is_vhost_running(dev) || mapped_qid < 0
 || !(dev->flags & NETDEV_UP))) {
rte_spinlock_lock(>stats_lock);
dev->stats.tx_dropped+= cnt;
rte_spinlock_unlock(>stats_lock);

for (i = 0; i < total_pkts; i++) {
dp_packet_delete(pkts[i]);
}

if (OVS_UNLIKELY(qid != mapped_qid)) {
rte_spinlock_unlock(>tx_q[mapped_qid].tx_lock);
}
rte_spinlock_unlock(>tx_q[qid].tx_lock);

return;
}

cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
/* Check has QoS has been configured for the netdev */
cnt = netdev_dpdk_qos_run(dev, cur_pkts, cnt);
dropped = total_pkts - cnt;

int idx = 0;
struct dpdk_tx_queue *txq = >tx_q[mapped_qid];
while (idx < cnt) {
txq->vhost_burst_pkts[txq->vhost_pkt_cnt++] = pkts[idx++];

if (txq->vhost_pkt_cnt >=

Re: [ovs-dev] [PATCH v4 2/5] netdev-dpdk: Add netdev_dpdk_vhost_txq_flush function.

2017-08-09 Thread Bodireddy, Bhanuprakash

>enable)
  if (enable) {
  dev->tx_q[qid].map = qid;
>>
>> Here flushing required too because we're possibly enabling previously
>remapped queue.
>>
  } else {
 +/* If the queue is disabled in the guest, the 
 corresponding qid
 + * map shall be set to OVS_VHOST_QUEUE_DISABLED(-2).
 + *
 + * The packets that were queued in 'qid' could be 
 potentially
 + * stuck and needs to be dropped.
 + *
 + * XXX: The queues may be already disabled in the guest so
 + * flush function in this case only helps in updating 
 stats
 + * and freeing memory.
 + */
 +netdev_dpdk_vhost_txq_flush(>up, qid, 0);
  dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;
  }
  netdev_dpdk_remap_txqs(dev);
>>
>> 'netdev_dpdk_remap_txqs()', actually, is able to change mapping for
>> all the disabled in guest queues. So, we need to flush all of them
>> while remapping somewhere inside the function.
>> One other thing is that there is a race window between flush and
>> mapping update where another process able to enqueue more packets in
>> just flushed queue. The order of operations should be changed, or both
>> of them should be done under the same tx_lock. I think, it's required
>> to make tx_q[].map field atomic to fix the race condition, because
>> send function takes the 'map' and then locks the corresponding queue.
>> It wasn't an issue before, because packets in case of race was just
>> dropped on attempt to send to disabled queue, but with this patch
>> applied they will be enqueued to the intermediate queue and stuck there.
>
>Making 'map' atomic will not help. To solve the race we should make 'reading
>of map + enqueue' an atomic operation by some spinlock.
>Like this:
>
>vhost_send:
>
>qid = qid % netdev->n_txq;
>rte_spinlock_lock(>tx_q[qid].tx_lock);
>
>mapped_qid = dev->tx_q[qid].map;
>
>if (qid != mapped_qid) {
>rte_spinlock_lock(>tx_q[mapped_qid].tx_lock);
>}
>
>tx_enqueue(mapped_qid, pkts, cnt);
>
>if (qid != mapped_qid) {
>rte_spinlock_unlock(>tx_q[mapped_qid].tx_lock);
>}
>
>rte_spinlock_unlock(>tx_q[qid].tx_lock);
>
>
>txq remapping inside 'netdev_dpdk_remap_txqs()' or
>'vring_state_changed()':
>
>qid - queue we need to remap.
>new_qid - queue we need to remap to.
>
>rte_spinlock_lock(>tx_q[qid].tx_lock);
>
>mapped_qid = dev->tx_q[qid].map;
>if (qid != mapped_qid) {
>rte_spinlock_lock(>tx_q[mapped_qid].tx_lock);
>}
>
>tx_flush(mapped_qid)
>
>if (qid != mapped_qid) {
>rte_spinlock_unlock(>tx_q[mapped_qid].tx_lock);
>}
>
>dev->tx_q[qid].map = new_qid;
>
>rte_spinlock_unlock(>tx_q[qid].tx_lock);
>
>
>Above schema should work without races, but looks kind of ugly and requires
>taking of additional spinlock on each send.
>
>P.S. Sorry for talking with myself. Just want to share my thoughts.

Hi Ilya,

Thanks for reviewing the patches and providing inputs.
I went through your comments for this patch(2/5) and agree with the suggestions.
Meanwhile  while go through the changes above and get back to you.

Bhanuprakash. 


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v4 2/5] netdev-dpdk: Add netdev_dpdk_vhost_txq_flush function.

2017-08-09 Thread Bodireddy, Bhanuprakash

>>
>> +static int
>> +netdev_dpdk_vhost_tx_burst(struct netdev_dpdk *dev, int qid) {
>> +struct dpdk_tx_queue *txq = >tx_q[qid];
>> +struct rte_mbuf **cur_pkts = (struct rte_mbuf
>> +**)txq->vhost_burst_pkts;
>> +
>> +int tx_vid = netdev_dpdk_get_vid(dev);
>> +int tx_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
>> +uint32_t sent = 0;
>> +uint32_t retries = 0;
>> +uint32_t sum, total_pkts;
>> +
>> +total_pkts = sum = txq->vhost_pkt_cnt;
>> +do {
>> +uint32_t ret;
>> +ret = rte_vhost_enqueue_burst(tx_vid, tx_qid, _pkts[sent],
>sum);
>> +if (OVS_UNLIKELY(!ret)) {
>> +/* No packets enqueued - do not retry. */
>> +break;
>> +} else {
>> +/* Packet have been sent. */
>> +sent += ret;
>> +
>> +/* 'sum' packet have to be retransmitted. */
>> +sum -= ret;
>> +}
>> +} while (sum && (retries++ < VHOST_ENQ_RETRY_NUM));
>> +
>> +for (int i = 0; i < total_pkts; i++) {
>> +dp_packet_delete(txq->vhost_burst_pkts[i]);
>> +}
>> +
>> +/* Reset pkt count. */
>> +txq->vhost_pkt_cnt = 0;
>> +
>> +/* 'sum' refers to packets dropped. */
>> +return sum;
>> +}
>> +
>> +/* Flush the txq if there are any packets available. */ static int
>> +netdev_dpdk_vhost_txq_flush(struct netdev *netdev, int qid,
>> +bool concurrent_txq OVS_UNUSED) {
>> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +struct dpdk_tx_queue *txq;
>> +
>> +qid = dev->tx_q[qid % netdev->n_txq].map;
>> +
>> +/* The qid may be disabled in the guest and has been set to
>> + * OVS_VHOST_QUEUE_DISABLED.
>> + */
>> +if (OVS_UNLIKELY(qid < 0)) {
>> +return 0;
>> +}
>> +
>> +txq = >tx_q[qid];
>> +/* Increment the drop count and free the memory. */
>> +if (OVS_UNLIKELY(!is_vhost_running(dev) ||
>> + !(dev->flags & NETDEV_UP))) {
>> +
>> +if (txq->vhost_pkt_cnt) {
>> +rte_spinlock_lock(>stats_lock);
>> +dev->stats.tx_dropped+= txq->vhost_pkt_cnt;
>> +rte_spinlock_unlock(>stats_lock);
>> +
>> +for (int i = 0; i < txq->vhost_pkt_cnt; i++) {
>> +dp_packet_delete(txq->vhost_burst_pkts[i]);
>
>Spinlock (tx_lock) must be held here to avoid queue and mempool breakage.

I think you are right. tx_lock might be acquired for freeing the packets.

---
rte_spinlock_lock(>tx_q[qid].tx_lock);
for (int i = 0; i < txq->vhost_pkt_cnt; i++) {
 dp_packet_delete(txq->vhost_burst_pkts[i]);
}
rte_spinlock_unlock(>tx_q[qid].tx_lock);

- Bhanuprakash
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v4 1/5] netdev: Add netdev_txq_flush function.

2017-08-09 Thread Bodireddy, Bhanuprakash

Hi Ilya,
>>
>> +/* Flush tx queues.
>> + * This is done periodically to empty the intermediate queue in case
>> +of
>> + * fewer packets (< INTERIM_QUEUE_BURST_THRESHOLD) buffered in the
>queue.
>> + */
>> +static int
>> +netdev_dpdk_txq_flush(struct netdev *netdev, int qid , bool
>> +concurrent_txq) {
>> +struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +struct dpdk_tx_queue *txq = >tx_q[qid];
>> +
>> +if (OVS_LIKELY(txq->dpdk_pkt_cnt)) {
>> +if (OVS_UNLIKELY(concurrent_txq)) {
>> +qid = qid % dev->up.n_txq;
>> +rte_spinlock_lock(>tx_q[qid].tx_lock);
>> +}
>> +
>> +netdev_dpdk_eth_tx_burst(dev, qid, txq->dpdk_burst_pkts,
>> + txq->dpdk_pkt_cnt);
>
>The queue used for send and the locked one are different because you're
>remapping the qid before taking the spinlock.

>I suspect that we're always using right queue numbers in current
>implementation of dpif-netdev, but I need to recheck to be sure.

I believe the case you are referring here is the XPS case ('dynamic_txqs' true).
When we have to flush the packets we retrieve the qid from the 
'cached_tx_port->last_used_qid'
 that was initialized earlier by 'dpif_netdev_xps_get_tx_qid()'. The logic of 
remapping the qid and 
acquiring the spin lock in the above function is no different from current 
logic in master. Can you 
elaborate the specific case where this would break the functionality?

Please note that  in 'dpif_netdev_xps_get_tx_qid'  the qid can change and so we 
did flush the queue.  

- Bhanuprakash. 

>Anyway, logic of this function completely broken.
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3 00/19] Add OVS DPDK keep-alive functionality.

2017-08-08 Thread Bodireddy, Bhanuprakash

HI Ilya,

>I understand that using rte_keepalive library was worth in the early RFC
>because size of RFC was comparable with the size of rte_keepalive library.
>But now, as so many generic things was implemented in lib/keepalive.{c,h}
>and the size of the patch-set is pretty large, IMHO, it's better to implement
>'struct rte_keepalive' and 'rte_keepalive_dispatch_pings()' inside
>lib/keepalive.{c,h} and remove dpdk library as a dependency for this
>functionality.

I agree with your suggestion  and will factor in  this input for next series.

>
>'rte_keepalive' doesn't have any dpdk-specific things inside. It doesn't work
>with NICs or DPDK-allocated memory. This library is just a simple wrapper.
>So, do we need the dependency from dpdk only to use this wrapper? Without
>it we'll have generic keepalive functionality for the whole OVS without
>additional subs and dpdk references in generic code.

Completely agree and this will help us avoid dummy functions.   

>
>I'm asking you to try to implement 'struct rte_keepalive' and
>'rte_keepalive_dispatch_pings()' inside lib/keepalive.{c,h} and move all the
>keepalive related code out of [netdev-]dpdk.{c,h} to keepalive.{c,h} and,
>possibly, to dpif-netdev.{c,h}.
>I'm expecting significant improvements in code size, simplicity and 
>readability.
>Also, this will allow to use keepalive without DPDK.

I have tried my best not to clutter netdev-dpdk and dpif-netdev. I hope by 
removing
the dependency on DPDK Keepalive library it might be even better.  

I will work on this and wait for inputs from other reviewers before posting 
next version.

- Bhanuprakash.

>
>Best regards, Ilya Maximets.
>
>On 04.08.2017 18:24, Bodireddy, Bhanuprakash wrote:
>> HI Ilya,
>>
>> Thanks for looking in to this and providing your feedback.
>>
>> When this feature was first posted as RFC
>(https://mail.openvswitch.org/pipermail/ovs-dev/2016-July/318243.html),
>the implementation in OvS was done based on DPDK Keepalive library and
>keeping collectd in sync.  As you can see from RFC it was pretty compact code
>and integrated well with ceilometer and provided end to end functionality.
>Much of the RFC code was  to handle SHM.
>>
>> However the reviewers pointed below flaws.
>>
>> - Very DPDK specific.
>> - Shared memory for inter process communication(Between OvS and
>collectd threads).
>> - Tracks PMD cores and not threads.
>> - Limited support to detect false negatives & false positives.
>> - Limited support to query KA status.
>>
>> As per suggestions, below changes were introduced.
>>
>> - Basic infrastructure to register & track threads instead of cores. (Now 
>> only
>PMDs are tracked only but can be extended to track non-PMD threads).
>> - Keep most of the APIs generic so that they can extended in the
>> future. All generic APIs are in Keepalive.[hc]
>> - Remove Shared memory and introduce OvSDB.
>> - Add support to detect false negatives.
>> - appctl options to query status.
>>
>> I agree that we have few issues but they can be reworked.
>>  -  invoke dpdk_is_enabled() from generic code (vswitchd/bridge.c) isn't
>nice, I had to do  to pass few unit test cases last time.
>>  -  Half a dozen stub APIs. I couldn't avoid it as they are needed to get the
>kernel datapath build.
>>
>> The patch series can be categorized  in to sub patchesets (KA infrastructure/
>OvSDB changes/  Query KA stats / Check False positives).  This patch series in
>the current form is using rte_keepalive library to handle PMD thread. But
>importantly has  introduced basic infrastructure to deal with other threads in
>the future.
>>
>> Regards,
>> Bhanuprakash.
>>
>>> -Original Message-
>>> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
>>> Sent: Friday, August 4, 2017 2:40 PM
>>> To: ovs-dev@openvswitch.org; Bodireddy, Bhanuprakash
>>> <bhanuprakash.bodire...@intel.com>
>>> Cc: Darrell Ball <db...@vmware.com>; Ben Pfaff <b...@ovn.org>; Aaron
>>> Conole <acon...@redhat.com>
>>> Subject: Re: [ovs-dev] [PATCH v3 00/19] Add OVS DPDK keep-alive
>>> functionality.
>>>
>>> Hi Bhanuprakash,
>>>
>>> Thanks for working on this.
>>> I have a general concern about implementation of this functionality:
>>>
>>> *What is the profit from using rte_keepalive library ?*
>>>
>>> Pros:
>>>
>>>* No need to implement 'rte_keepalive_dispatch_pings()' (40 LOC)
>>>  and struct rte_keepalive (30 LOC, can be significantly decreased
>>>  by removing not needed elements) ---> ~7

Re: [ovs-dev] [PATCH v3 6/6] dpif-netdev: Flush the packets in intermediate queue.

2017-08-08 Thread Bodireddy, Bhanuprakash

Hi Darrell,

>
>Under low rate traffic conditions, there can be 2 issues.
>  (1) Packets potentially can get stuck in the intermediate queue.
>  (2) Latency of the packets can increase significantly due to
>   buffering in intermediate queue.
>
>This commit handles the (1) issue by flushing the tx port queues from
>PMD processing loop. Also this commit addresses issue (2) by flushing
>the tx queues after every rxq port processing. This reduces the latency
>with out impacting the forwarding throughput.
>
>   MASTER
>  
>   Pkt size  min(ns)   avg(ns)   max(ns)
>512  4,631  5,022309,914
>   1024  5,545  5,749104,294
>   1280  5,978  6,159 45,306
>   1518  6,419  6,774946,850
>
>  MASTER + COMMIT
>  -
>   Pkt size  min(ns)   avg(ns)   max(ns)
>512  4,711  5,064182,477
>   1024  5,601  5,888701,654
>   1280  6,018  6,491533,037
>   1518  6,467  6,734312,471
>
>PMDs can be teared down and spawned at runtime and so the rxq and txq
>mapping of the PMD threads can change. In few cases packets can get
>stuck in the queue due to reconfiguration and this commit helps flush
>the queues.
>
>Suggested-by: Eelco Chaudron 
>Reported-at: https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
>2DApril_331039.html=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA09C
>GX7JQ5Ih-uZnsw=bHrBe9xQ4KZyIP8eXmMQgmAki-
>7TrHqH1PHcy7KBp9M=FLHjFbETDpuejnwNxIJem8vPtHo7KDb0q0YJSIpsMb
>8=
>Signed-off-by: Bhanuprakash Bodireddy
>
>Signed-off-by: Antonio Fischetti 
>Co-authored-by: Antonio Fischetti 
>Signed-off-by: Markus Magnusson 
>Co-authored-by: Markus Magnusson 
>Acked-by: Eelco Chaudron 
>---
> lib/dpif-netdev.c | 7 +++
> 1 file changed, 7 insertions(+)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>index 7e1f5bc..f03bd3e 100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -3603,6 +3603,8 @@ dpif_netdev_run(struct dpif *dpif)
> for (i = 0; i < port->n_rxq; i++) {
> dp_netdev_process_rxq_port(non_pmd, port->rxqs[i].rx,
>port->port_no);
>+
>+dp_netdev_flush_txq_ports(non_pmd);
>
>
>
>Is this a temporary change ?; seems counter to the objective ?
>Should be latency based, as discussed on another thread couple months ago
>?; configurable by port type and port ?

This is a temporary change and is made to see the latency is well within limits.
With this change, the performance improvement is *only* observed  when rx batch 
size is significant (unlikely in real use cases). 
The  incremental patch series( on top of this) should address that by buffering 
packets in Intermediate queue across multiple rx batches. Also latency configs 
would be introduced as you just mentioned above to tune according to their 
requirements.

This needs significant testing as we need to strike a fine balance between 
throughput and latency and shall be done as part of next series.

- Bhanuprakash.

>
>
>
> }
> }
> }
>@@ -3760,6 +3762,8 @@ reload:
> for (i = 0; i < poll_cnt; i++) {
> dp_netdev_process_rxq_port(pmd, poll_list[i].rx,
>poll_list[i].port_no);
>+
>+dp_netdev_flush_txq_ports(pmd);
> }
>
>
>
>Same comment as above.
>
>
>
> if (lc++ > 1024) {
>@@ -3780,6 +3784,9 @@ reload:
> }
> }
>
>+/* Flush the queues as part of reconfiguration logic. */
>+dp_netdev_flush_txq_ports(pmd);
>+
> poll_cnt = pmd_load_queues_and_ports(pmd, _list);
> exiting = latch_is_set(>exit_latch);
> /* Signal here to make sure the pmd finishes
>--
>2.4.11
>
>___
>dev mailing list
>d...@openvswitch.org
>https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_mailman_listinfo_ovs-
>2Ddev=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA09CGX7JQ5Ih-
>uZnsw=bHrBe9xQ4KZyIP8eXmMQgmAki-
>7TrHqH1PHcy7KBp9M=9f249RikCnGphA_CpKIFbbtkbo2W6axBPaub91khHe
>M=
>
>
>
>
>

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3 4/6] netdev-dpdk: Add intermediate queue support.

2017-08-08 Thread Bodireddy, Bhanuprakash

>
>This commit introduces netdev_dpdk_eth_tx_queue() function that
>implements intermediate queue and packet buffering. The packets get
>buffered till the threshold 'INTERIM_QUEUE_BURST_THRESHOLD[32] is
>reached and eventually gets transmitted.
>
>To handle the case(eg: ping) where packets are sent at low rate and
>can potentially get stuck in the queue, flush logic is implemented
>that gets invoked from dp_netdev_flush_txq_ports() as part of PMD packet
>processing loop.
>
>Signed-off-by: Bhanuprakash Bodireddy
>
>Signed-off-by: Antonio Fischetti 
>Co-authored-by: Antonio Fischetti 
>Signed-off-by: Markus Magnusson 
>Co-authored-by: Markus Magnusson 
>Acked-by: Eelco Chaudron 
>---
> lib/dpif-netdev.c | 44
>+++-
> lib/netdev-dpdk.c | 37 +++--
> 2 files changed, 78 insertions(+), 3 deletions(-)
>
>diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>index 4e29085..7e1f5bc 100644
>--- a/lib/dpif-netdev.c
>+++ b/lib/dpif-netdev.c
>@@ -332,6 +332,7 @@ enum pmd_cycles_counter_type {
> };
>
> #define XPS_TIMEOUT_MS 500LL
>+#define LAST_USED_QID_NONE -1
>
> /* Contained by struct dp_netdev_port's 'rxqs' member.  */
> struct dp_netdev_rxq {
>@@ -492,7 +493,13 @@ struct rxq_poll {
> struct tx_port {
> struct dp_netdev_port *port;
> int qid;
>-long long last_used;
>+int last_used_qid;/* Last queue id where packets got
>+ enqueued. */
>+long long last_used;  /* In case XPS is enabled, it contains the
>+   * timestamp of the last time the port was
>+   * used by the thread to send data.  After
>+   * XPS_TIMEOUT_MS elapses the qid will be
>+   * marked as -1. */
> struct hmap_node node;
> };
>
>@@ -3080,6 +3087,25 @@ cycles_count_end(struct
>dp_netdev_pmd_thread *pmd,
> }
>
> static void
>+dp_netdev_flush_txq_ports(struct dp_netdev_pmd_thread *pmd)
>+{
>+struct tx_port *cached_tx_port;
>+int tx_qid;
>+
>+HMAP_FOR_EACH (cached_tx_port, node, >send_port_cache) {
>+tx_qid = cached_tx_port->last_used_qid;
>+
>+if (tx_qid != LAST_USED_QID_NONE) {
>+netdev_txq_flush(cached_tx_port->port->netdev, tx_qid,
>+ cached_tx_port->port->dynamic_txqs);
>+
>+/* Queue flushed and mark it empty. */
>+cached_tx_port->last_used_qid = LAST_USED_QID_NONE;
>+}
>+}
>+}
>+
>
>Could you move this function and I think the other code in dpif-netdev.c to
>patch 6, if you can ?

Should be a simple change. Will do this.

>This function is unused, so will generate a build error with –Werror when
>applied in sequence and logically this seems like it can go into patch 6.

Completely agree. 

- Bhanuprakash.

>
>Darrell
>
>
>+static void
> dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>struct netdev_rxq *rx,
>odp_port_t port_no)
>@@ -4355,6 +4381,7 @@ dp_netdev_add_port_tx_to_pmd(struct
>dp_netdev_pmd_thread *pmd,
>
> tx->port = port;
> tx->qid = -1;
>+tx->last_used_qid = LAST_USED_QID_NONE;
>
> hmap_insert(>tx_ports, >node, hash_port_no(tx->port-
>>port_no));
> pmd->need_reload = true;
>@@ -4925,6 +4952,14 @@ dpif_netdev_xps_get_tx_qid(const struct
>dp_netdev_pmd_thread *pmd,
>
> dpif_netdev_xps_revalidate_pmd(pmd, now, false);
>
>+/* The tx queue can change in XPS case, make sure packets in previous
>+ * queue is flushed properly. */
>+if (tx->last_used_qid != LAST_USED_QID_NONE &&
>+   tx->qid != tx->last_used_qid) {
>+netdev_txq_flush(port->netdev, tx->last_used_qid, port-
>>dynamic_txqs);
>+tx->last_used_qid = LAST_USED_QID_NONE;
>+}
>+
> VLOG_DBG("Core %d: New TX queue ID %d for port \'%s\'.",
>  pmd->core_id, tx->qid, netdev_get_name(tx->port->netdev));
> return min_qid;
>@@ -5020,6 +5055,13 @@ dp_execute_cb(void *aux_, struct
>dp_packet_batch *packets_,
> tx_qid = pmd->static_tx_qid;
> }
>
>+/* In case these packets gets buffered into an intermediate
>+ * queue and XPS is enabled the flush function could find a
>+ * different tx qid assigned to its thread.  We keep track
>+ * of the qid we're now using, that will trigger the flush
>

Re: [ovs-dev] [PATCH v3 2/6] netdev-dpdk: Add netdev_dpdk_txq_flush function.

2017-08-08 Thread Bodireddy, Bhanuprakash

>Hi Bhanu
>
>Would it be possible to combine patches 1 and 2, rather than initially defining
>an empty netdev_txq_flush for dpdk ? I think the combined patch would have
>more context.

No problem Darrell . I will merge 1 & 2  in V4.

- Bhanuprakash.

>
>
>-Original Message-
>From:  on behalf of Bhanuprakash
>Bodireddy 
>Date: Thursday, June 29, 2017 at 3:39 PM
>To: "d...@openvswitch.org" 
>Subject: [ovs-dev] [PATCH v3 2/6] netdev-dpdk: Add netdev_dpdk_txq_flush
>   function.
>
>This commit adds netdev_dpdk_txq_flush() function. If there are
>any packets waiting in the queue, they are transmitted instantly
>using the rte_eth_tx_burst function. In XPS enabled case, lock is
>taken on the tx queue before flushing the queue.
>
>Signed-off-by: Bhanuprakash Bodireddy
>
>Signed-off-by: Antonio Fischetti 
>Co-authored-by: Antonio Fischetti 
>Signed-off-by: Markus Magnusson 
>Co-authored-by: Markus Magnusson 
>Acked-by: Eelco Chaudron 
>---
> lib/netdev-dpdk.c | 31 +--
> 1 file changed, 29 insertions(+), 2 deletions(-)
>
>diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>index 9ca4433..dd42716 100644
>--- a/lib/netdev-dpdk.c
>+++ b/lib/netdev-dpdk.c
>@@ -293,6 +293,11 @@ struct dpdk_mp {
> struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
> };
>
>+/* Queue 'INTERIM_QUEUE_BURST_THRESHOLD' packets before
>transmitting.
>+ * Defaults to 'NETDEV_MAX_BURST'(32) packets.
>+ */
>+#define INTERIM_QUEUE_BURST_THRESHOLD NETDEV_MAX_BURST
>+
> /* There should be one 'struct dpdk_tx_queue' created for
>  * each cpu core. */
> struct dpdk_tx_queue {
>@@ -302,6 +307,12 @@ struct dpdk_tx_queue {
> * pmd threads (see 'concurrent_txq'). 
> */
> int map;   /* Mapping of configured vhost-user 
> queues
> * to enabled by guest. */
>+int dpdk_pkt_cnt;  /* Number of buffered packets waiting 
> to
>+  be sent on DPDK tx queue. */
>+struct rte_mbuf
>*dpdk_burst_pkts[INTERIM_QUEUE_BURST_THRESHOLD];
>+   /* Intermediate queue where packets can
>+* be buffered to amortize the cost of 
> MMIO
>+* writes. */
> };
>
> /* dpdk has no way to remove dpdk ring ethernet devices
>@@ -1897,9 +1908,25 @@ netdev_dpdk_send__(struct netdev_dpdk *dev,
>int qid,
>  * few packets (< INTERIM_QUEUE_BURST_THRESHOLD) buffered in the
>queue.
>  */
> static int
>-netdev_dpdk_txq_flush(struct netdev *netdev OVS_UNUSED,
>-  int qid OVS_UNUSED, bool concurrent_txq OVS_UNUSED)
>+netdev_dpdk_txq_flush(struct netdev *netdev,
>+  int qid, bool concurrent_txq)
> {
>+struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>+struct dpdk_tx_queue *txq = >tx_q[qid];
>+
>+if (OVS_LIKELY(txq->dpdk_pkt_cnt)) {
>+if (OVS_UNLIKELY(concurrent_txq)) {
>+qid = qid % dev->up.n_txq;
>+rte_spinlock_lock(>tx_q[qid].tx_lock);
>+}
>+
>+netdev_dpdk_eth_tx_burst(dev, qid, txq->dpdk_burst_pkts,
>+ txq->dpdk_pkt_cnt);
>+
>+if (OVS_UNLIKELY(concurrent_txq)) {
>+rte_spinlock_unlock(>tx_q[qid].tx_lock);
>+}
>+}
> return 0;
> }
>
>--
>2.4.11
>
>___
>dev mailing list
>d...@openvswitch.org
>https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_mailman_listinfo_ovs-
>2Ddev=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA09CGX7JQ5Ih-
>uZnsw=1wUGGHlSVXpqn5THs-
>saPYXoqzsKYA6zy3m0dzrOr5c=HDVtHRNK1uhmuU70EfLAxfXvZXasjTmO8b8
>zpS7M9t4=
>
>
>
>
>
>
>

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3 0/6] netdev-dpdk: Use intermediate queue during packet transmission.

2017-08-08 Thread Bodireddy, Bhanuprakash

Hi Darrell,

>Sorry, I was multitasking last week and did not get a chance to finish the
>responses on Friday
>
>I looked thru. the code for all the patches The last 3 patches of V3 needed a
>manual merge; as you know, the series needs a rebase after recent commits.

I  will rebase and send out v4. 

>For a full o/p batch case, I see about a 10% drop in pps; is that what you see 
>?

I see ~200 - 250 kpps drop in P2P case with single flow and see significant 
improvements when
the number of flows reach the rx batch size. 

Can you please let me know if 'full o/p batch' above means simple P2P test with 
single flow?
It would be helpful if you can share your traffic profile for me to reproduce 
locally.

>After applying each patch, we should be able to build and nothing should be
>broken, which is not the case since patch 4 has a function only used in patch 
>6.
>I have some comments on the individual patches.

I might have introduced this problem when I reordered patches. I will fix this.

- Bhanuprakash.

>
>Darrell
>
>-Original Message-
>From:  on behalf of Bhanuprakash
>Bodireddy 
>Date: Thursday, June 29, 2017 at 3:39 PM
>To: "d...@openvswitch.org" 
>Subject: [ovs-dev] [PATCH v3 0/6] netdev-dpdk: Use intermediate queue
>during packet transmission.
>
>After packet classification, packets are queued in to batches depending
>on the matching netdev flow. Thereafter each batch is processed to
>execute the related actions. This becomes particularly inefficient if
>there are few packets in each batch as rte_eth_tx_burst() incurs expensive
>MMIO writes.
>
>This patch series implements intermediate queue for DPDK and vHost User
>ports.
>Packets are queued and burst when the packet count exceeds threshold.
>Also
>drain logic is implemented to handle cases where packets can get stuck in
>the tx queues at low rate traffic conditions. Care has been taken to see
>that latency is well with in the acceptable limits. Testing shows 
> significant
>performance gains with this implementation.
>
>This path series combines the earlier 2 patches posted below.
>  DPDK patch: https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
>2DApril_331039.html=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA09C
>GX7JQ5Ih-uZnsw=mfmTud95lZzZdILQFvPpn7UBeTpD_q-
>YENVoGQXZFog=Pqg7ZCr3Ypmyww79tJOxn1XTp5PG0FmK-
>zwcW6lJJ2U=
>  vHost User patch: https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
>2DMay_332271.html=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA09C
>GX7JQ5Ih-uZnsw=mfmTud95lZzZdILQFvPpn7UBeTpD_q-
>YENVoGQXZFog=-
>_WLDFeO_nwkaOdrNHFtl_3uEwEDvEgUsQzabGB6fm8=
>
>Performance Numbers with intermediate queue:
>
>  DPDK ports
> ===
>
>  Throughput for P2P scenario, for two 82599ES 10G port with 64 byte
>packets
>
>  Number
>  flows   MASTER With PATCH
>  ====
>10   1072728313393844
>32704225311228799
>507515491 9607791
>   1005838699 9430730
>   5005285066 7845807
>  10005226477 7135601
>
>   Latency test
>
>   MASTER
>   ===
>   Pkt size  min(ns)  avg(ns)  max(ns)
>512  4,631  5,022309,914
>   1024  5,545  5,749104,294
>   1280  5,978  6,159 45,306
>   1518  6,419  6,774946,850
>
>   PATCH
>   =
>   Pkt size  min(ns)  avg(ns)  max(ns)
>512  4,711  5,064182,477
>   1024  5,601  5,888701,654
>   1280  6,018  6,491533,037
>   1518  6,467  6,734312,471
>
>   vHost User ports
>  ==
>
>  Throughput for PV scenario, with 64 byte packets
>
>   Number
>   flows   MASTERWith PATCH
>    =   =
>105945899 7833914
>323872211 6530133
>503283713 6618711
>   1003132540 5857226
>   5002964499 5273006
>  10002931952 5178038
>
>  Latency test.
>
>  MASTER
>  ===
>  Pkt size  min(ns)  avg(ns)  max(ns)
>   512  10,011   12,100   281,915
>  1024   7,8709,313   193,116
>  1280   7,8629,036   194,439
>  1518   8,2159,417   204,782
>
>  PATCH
>  ===
>  Pkt size  min(ns)  avg(ns)  max(ns)
>   512  10,492   13,655   281,538
>  1024   8,4079,784   205,095
>  1280   8,3999,750   194,888
>  1518   8,3679,722   196,973
>
>Performance number reported by Eelco Chaudron redhat.com> at
>  https://urldefense.proofpoint.com/v2/url?u=https-

Re: [ovs-dev] [PATCH v3 00/19] Add OVS DPDK keep-alive functionality.

2017-08-04 Thread Bodireddy, Bhanuprakash

HI Ilya,

Thanks for looking in to this and providing your feedback. 

When this feature was first posted as RFC 
(https://mail.openvswitch.org/pipermail/ovs-dev/2016-July/318243.html), the 
implementation in OvS was done based on DPDK Keepalive library and keeping 
collectd in sync.  As you can see from RFC it was pretty compact code and 
integrated well with ceilometer and provided end to end functionality. Much of 
the RFC code was  to handle SHM. 

However the reviewers pointed below flaws.

- Very DPDK specific.
- Shared memory for inter process communication(Between OvS and collectd 
threads).
- Tracks PMD cores and not threads.
- Limited support to detect false negatives & false positives.
- Limited support to query KA status.

As per suggestions, below changes were introduced.

- Basic infrastructure to register & track threads instead of cores. (Now only 
PMDs are tracked only but can be extended to track non-PMD threads).
- Keep most of the APIs generic so that they can extended in the future. All 
generic APIs are in Keepalive.[hc]
- Remove Shared memory and introduce OvSDB.
- Add support to detect false negatives.
- appctl options to query status.

I agree that we have few issues but they can be reworked.
 -  invoke dpdk_is_enabled() from generic code (vswitchd/bridge.c) isn't nice, 
I had to do  to pass few unit test cases last time.
 -  Half a dozen stub APIs. I couldn't avoid it as they are needed to get the 
kernel datapath build.

The patch series can be categorized  in to sub patchesets (KA infrastructure/  
OvSDB changes/  Query KA stats / Check False positives).  This patch series in 
the current form is using rte_keepalive library to handle PMD thread. But 
importantly has  introduced basic infrastructure to deal with other threads in 
the future.  

Regards,
Bhanuprakash. 

>-Original Message-
>From: Ilya Maximets [mailto:i.maxim...@samsung.com]
>Sent: Friday, August 4, 2017 2:40 PM
>To: ovs-dev@openvswitch.org; Bodireddy, Bhanuprakash
><bhanuprakash.bodire...@intel.com>
>Cc: Darrell Ball <db...@vmware.com>; Ben Pfaff <b...@ovn.org>; Aaron
>Conole <acon...@redhat.com>
>Subject: Re: [ovs-dev] [PATCH v3 00/19] Add OVS DPDK keep-alive
>functionality.
>
>Hi Bhanuprakash,
>
>Thanks for working on this.
>I have a general concern about implementation of this functionality:
>
>*What is the profit from using rte_keepalive library ?*
>
>Pros:
>
>* No need to implement 'rte_keepalive_dispatch_pings()' (40 LOC)
>  and struct rte_keepalive (30 LOC, can be significantly decreased
>  by removing not needed elements) ---> ~70 LOC.
>
>Cons:
>
>* DPDK dependency:
>
>* Implementation of PMD threads management (KA) inside netdev code
>  (netdev-dpdk) looks very strange.
>* Many DPDK references in generic code (like dpdk_is_enabled).
>* Feature isn't available for the common threads (main?) wihtout DPDK.
>* Many stubs and placeholders for cases without dpdk.
>* No ability for unit testing.
>
>So, does it worth to use rte_keepalive? To make functionality fully generic we
>only need to implement 'rte_keepalive_dispatch_pings()'
>and few helpers. As soon as this function does nothing dpdk-specific it's a
>really simple task which will allow to greatly clean up the code. The feature 
>is
>too big to use external library for 70 LOCs of really simple code. (Clean up
>should save much more).
>
>Am I missed something?
>Any thoughts?
>
>Best regards, Ilya Maximets.
>
>> Keepalive feature is aimed at achieving Fastpath Service Assurance in
>> OVS-DPDK deployments. It adds support for monitoring the packet
>> processing cores(PMD thread cores) by dispatching heartbeats at
>> regular intervals. Incase of heartbeat misses additional health checks
>> are enabled on the PMD thread to detect the failure and the same shall
>> be reported to higher level fault management systems/frameworks.
>>
>> The implementation uses OVSDB for reporting the health of the PMD
>threads.
>> Any external monitoring application can read the status from OVSDB at
>> regular intervals (or) subscribe to the updates in OVSDB so that they
>> get notified when the changes happen on OVSDB.
>>
>> keepalive info struct is created and initialized for storing the
>> status of the PMD threads. This is initialized by main
>> thread(vswitchd) as part of init process and will be periodically updated by
>'keepalive'
>> thread. keepalive feature can be enabled through below OVSDB settings.
>>
>> enable-keepalive=true
>>   - Keepalive feature is disabled by default.
>>
>> keepalive-interval="5000"
>>   - Timer interval in milliseconds for monito

Re: [ovs-dev] DPDK Merge Repo

2017-08-02 Thread Bodireddy, Bhanuprakash

>> Hi Darrell and Ben.
>>
>> > Hi All
>> >
>> > As mentioned before, I am using a repo for DPDK patch merging.
>> > The repo is here:
>> > https://github.com/darball/ovs/
>> >
>> > There are still some outstanding patches from Bhanu that have not
>> completed review yet:
>> >
>> > util: Add PADDED_MEMBERS_CACHELINE_MARKER macro to mark
>cachelines.-
>> > Bhanu
>> > packets: Reorganize the pkt_metadata structure. - Bhanu
>> >
>> > and a series we would like to get into 2.8
>> >
>> > netdev-dpdk: Use intermediate queue during packet transmission.
>> > Bhanu Jun 29/V3
>> > netdev: Add netdev_txq_flush function.
>> > netdev-dpdk: Add netdev_dpdk_txq_flush function.
>> > netdev-dpdk: Add netdev_dpdk_vhost_txq_flush function.
>> > netdev-dpdk: Add intermediate queue support.
>> > netdev-dpdk: Enable intermediate queue for vHost User port.
>> > dpif-netdev: Flush the packets in intermediate queue.
>>
>> I think that we still not reached agreement about the level of
>> implementation (netdev-dpdk or dpif-netdev). Just few people
>> participate in discussion which is not very productive. I suggest not
>> to target output batching for 2.8 release because of this and also
>> lack of testing and review.
>> As I understand, we have only 3 days merge window for the new features
>> and I expect that we can't finish discussion, review and testing in time.
>>
>
>My own opinion on this, this feature has been kicking around for quite a
>while,  the original patch from Bhanu went out back in December.
>https://mail.openvswitch.org/pipermail/ovs-dev/2016-
>December/326348.html

Unfortunately it dates back to Aug 2016 and almost been an year.
(Refer: https://mail.openvswitch.org/pipermail/ovs-dev/2016-August/321748.html)

I reported this issue and copied Ilya (original author) whose commit 
"b59cc14e032d (netdev-dpdk: Use instant sending instead of  queueing of 
packets" introduced this particular issue in 2.6. 

Unfortunately the author who introduced this issue didn't respond to that 
question and we came up with the path series  to address this. Multiple RFC 
versions were posted and even Ilya participated in reviews and provided 
feedback. 
It's unacceptable now to say that this patch series hasn't been reviewed 
enough. Lot of time has been invested on this feature especially for rebasing, 
testing, collecting latency stats and promptly replying to all the questions on 
ML. 

>
>There's a level of due diligence carried out in terms of reviewing and testing
>from a range of people in the community for the netdev approach and a
>number of users are already using this without issue. As such I would like this
>approach to be included in the 2.8 release.

As Ian rightly pointed, we know of few internal and external customers already 
having this patch series with incremental changes and running in their 
deployments.
There may always be few corner cases and those can be addressed and this 
shouldn't be a concern to get this in to 2.8 series.

>
>I think the dpif layer is more generic and in the long run more maintainable
>but it was quite late in being flagged as an alternate approach and is not as
>mature in terms of testing/reviews. As such I don't think it should block the
>netdev approach until it has reached the same level of feedback and testing
>from the community. The dpif approach could target the 2.9 release after it
>has received more feedback and replace the netdev approach when the pros
>and cons of both have been clearly demonstrated.

This has been discussed and each approach has its own merits and demerits. 
Darrel already had put his views in other threads.  

- Bhanuprakash. 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH RFC v2 4/4] dpif-netdev: Time based output batching.

2017-08-01 Thread Bodireddy, Bhanuprakash

>On 28.07.2017 10:20, Darrell Ball wrote:
>> I have not tested yet
>>
>> However, I would have expected something max latency config. to be
>specific to netdev-dpdk port types
>
>IMHO, if we can make it generic, we must make it generic.
>
>[Darrell]
>The first question I ask myself is -  is this functionality intrinsically 
>generic or is
>it not ?
>It is clearly not and trying to make it artificially so would do the following:
>
>1) We end up designing something the wrong way where it partially works.
>2) Breaks other features present and future that really do intersect.
>
>
> Making of this
>functionality netdev-dpdk specific will brake ability to test it using
>unit tests. As the change is complex and has a lot of pitfalls like
>possible packet stucks and possible latency issues, this code should be
>covered by unit tests to simplify the support and modifications.
>(And it's already partly covered because it is generic. And I fixed many
>minor issues while developing through unit test failures.)
>
>[Darrell]
>Most of dpdk is not tested by our unit tests because it cannot be simulated
>well at the moment. This is orthogonal to the basic question however.

Darrell is right and the unit tests we have currently don't test DPDK datapath 
well. 
So having this changes in netdev layer shouldn't  impact the unit tests much. 

While I share your other concern that changes in netdev layer will be little 
complex and slightly
painful for future code changes, this max latency config  introduced in dpif 
layer may not hold good to
different port types and users may potentially introduce conflicting changes in 
netdev layer in future to
suit their use cases.
 
>
>
>In the future this can be used also to improve performance of netdev-linux
>by replacing sendmsg() with batched sendmmsg(). This should significantly
>increase performance of flood actions while MACs are not learned yet in
>action NORMAL.
>
>> This type of code also seems to intersect with present and future QoS
>considerations in netdev-dpdk

>
>Maybe, but there are also some related features in mail-list like rx queue
>prioritization which are implemented in generic way on dpif-netdev layer.

If you are referring to rxq prioritization work by Billy 
(https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/336001.html),
this feature is more implemented in netdev layer with very minimal updates to 
dpif layer. 

BTW,  dp_execute_cb()  is getting cluttered with this patch. 

- Bhanuprakash.

>
>>
>> -Original Message-
>> From: Ilya Maximets 
>> Date: Wednesday, July 26, 2017 at 8:21 AM
>> To: "ovs-dev@openvswitch.org" ,
>Bhanuprakash Bodireddy 
>> Cc: Heetae Ahn , Ben Pfaff
>, Antonio Fischetti , Eelco
>Chaudron , Ciara Loftus ,
>Kevin Traynor , Darrell Ball ,
>Ilya Maximets 
>> Subject: [PATCH RFC v2 4/4] dpif-netdev: Time based output batching.
>>
>> This allows to collect packets from more than one RX burst
>> and send them together with a configurable maximum latency.
>>
>> 'other_config:output-max-latency' can be used to configure
>> time that a packet can wait in output batch for sending.
>>
>> Signed-off-by: Ilya Maximets 
>> ---
>>
>> millisecon granularity is used for now. Can be easily switched to use
>> microseconds instead.
>>
>>  lib/dpif-netdev.c| 97
>+++-
>>  vswitchd/vswitch.xml | 15 
>>  2 files changed, 95 insertions(+), 17 deletions(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 07c7dad..e5f8a3d 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -84,6 +84,9 @@ VLOG_DEFINE_THIS_MODULE(dpif_netdev);
>>  #define MAX_RECIRC_DEPTH 5
>>  DEFINE_STATIC_PER_THREAD_DATA(uint32_t, recirc_depth, 0)
>>
>> +/* Use instant packet send by default. */
>> +#define DEFAULT_OUTPUT_MAX_LATENCY 0
>> +
>>  /* Configuration parameters. */
>>  enum { MAX_FLOWS = 65536 }; /* Maximum number of flows in flow
>table. */
>>  enum { MAX_METERS = 65536 };/* Maximum number of meters. */
>> @@ -261,6 +264,9 @@ struct dp_netdev {
>>  struct hmap ports;
>>  struct seq *port_seq;   /* Incremented whenever a port 
> changes.
>*/
>>
>> +/* The time that a packet can wait in output batch for sending. 
> */
>> +atomic_uint32_t output_max_latency;
>> +
>>  /* Meters. */
>>

Re: [ovs-dev] [PATCH v2 0/4] Output packet batching.

2017-08-01 Thread Bodireddy, Bhanuprakash

>This patch-set inspired by [1] from Bhanuprakash Bodireddy.
>Implementation of [1] looks very complex and introduces many pitfalls for
>later code modifications like possible packet stucks.
>
>This version targeted to make simple and flexible output packet batching on
>higher level without introducing and even simplifying netdev layer.
>
>Patch set consists of 3 patches. All the functionality introduced in the first
>patch. Two others are just cleanups of netdevs to not do unnecessary things.
>
>Basic testing of 'PVP with OVS bonding on phy ports' scenario shows
>significant performance improvement.
>More accurate and intensive testing required.
>
>[1] [PATCH 0/6] netdev-dpdk: Use intermediate queue during packet
>transmission.
>https://mail.openvswitch.org/pipermail/ovs-dev/2017-June/334762.html
>
>Version 2:
>
>   * Rebased on current master.
>   * Added time based batching RFC patch.
>   * Fixed mixing packets with different sources in same batch.
>

Applied this series along with other patches[1] and gave initial try.
With this series, approximately half a million throughput drop is observed in 
simple test case (P2P - 1stream - udp) vs  master + [1]. 
The performance improvement is observed with multiple flows  (which this series 
is meant to address).

At this stage no latency settings were used. Yet to review and do more testing.

[1] Improves performance.
https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335359.html
https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/336186.html
https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/336187.html
https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/336290.html

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 0/4] prioritizing latency sensitive traffic

2017-07-28 Thread Bodireddy, Bhanuprakash

Hi Billy,

>Hi All,
>
>This patch set provides a method to request ingress scheduling on interfaces.
>It also provides an implemtation of same for DPDK physical ports.
>
>This allows specific packet types to be:
>* forwarded to their destination port ahead of other packets.
>and/or
>* be less likely to be dropped in an overloaded situation.
>
>It was previously discussed
>https://mail.openvswitch.org/pipermail/ovs-discuss/2017-May/044395.html
>and RFC'd
>https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335237.html
>
>Limitations of this patch:
>* The patch uses the Flow Director filter API in DPDK and has only been tested
>on Fortville (XL710) NIC.
>* Prioritization is limited to:
>** eth_type
>** Fully specified 5-tuple src & dst ip and port numbers for UDP & TCP packets
>* ovs-appctl dpif-netdev/pmd-*-show o/p should indicate rxq prioritization.
>* any requirements for a more granular prioritization mechanism
>
>Initial results:
>* even when userspace OVS is very much overloaded and
>  dropping significant numbers of packets the drop rate for prioritized traffic
>  is running at 1/1000th of the drop rate for non-prioritized traffic.
>
>* the latency profile of prioritized traffic through userspace OVS is also much
>  improved
>
>1e0 |*
>|*
>1e-1|* | Non-prioritized pkt latency
>|* * Prioritized pkt latency
>1e-2|*
>|*
>1e-3|*   |
>|*   |
>1e-4|*   | | |
>|*   |*| |
>1e-5|*   |*| | |
>|*   |*|*| |  |
>1e-6|*   |*|*|*|  |
>|*   |*|*|*|* |
>1e-7|*   |*|*|*|* |*
>|*   |*|*|*|* |*
>1e-8|*   |*|*|*|* |*
>  0-1 1-20 20-40 40-50 50-60 60-70 ... 120-400
>Latency (us)
>
> Proportion of packets per latency bin @ 80% Max Throughput
>  (Log scale)
> 

Thanks for working on this feature. I started reviewing the code initially but 
later decided to test it first as it uses XL710 NIC Flow director features and 
wanted to
Know the implications if any. I had few observations here and would like to 
know if you have seen this during your unit tests.

1)  With this patch series, Rx Burst Bulk Allocation call back function is 
invoked instead of vector rx function. 
  Meaning i40e_recv_pkts_bulk_alloc() gets invoked instead of 
i40e_recv_pkts_vec().  Please check  i40e_set_rx_function() of i40e DPDK 
drivers.
 
  I am speculating this may be due to the enabling flow director and rules. 
I don't know the implications of using  bulk_alloc() function, maybe we should
  check with DPDK guys on this.
 
2)  When I tried to prioritize the udp pkts for specific IPs and Ports, I see a 
massive performance drop. I am using XL710 NIC with stable firmware version.
   Below are my steps.

   -  Start OvS and make sure the the n_rxq for DPDK0, DPDK1 ports is set 
to 2.
   -  Do simple P2P test with single 
stream(ip_src=8.18.8.1,ip_dst=101.10.10.1,udp_src=10001,udp_dst=5001)  and 
check the throughput.
   -  Prioritize the active stream.
 ovs-vsctl set interface dpdk0 
other_config:ingress_sched=udp,ip_src=8.18.8.1,ip_dst=101.10.10.1,udp_src=10001,udp_dst=5001
  -  Throughput drop is observed now. (~1.7Mpps)

A bit of debugging in to case 2, I found that "miniflow_hash_5tuple()" is 
getting invoked and consuming 10% of the total cycles.
one of the commits had below lines. 

dpdk_eth_dev_queue_setup--
/* Ingress scheduling requires ETH_MQ_RX_NONE so limit it to when exactly
 * two rxqs are defined. Otherwise MQ will not work as expected. */
if (dev->ingress_sched_str && n_rxq == 2) {
conf.rxmode.mq_mode = ETH_MQ_RX_NONE;
}
else {
conf.rxmode.mq_mode = ETH_MQ_RX_RSS;
}
-

Does ingress scheduling turn off RSS?  This will be big drawback as calculating 
hash in SW consumes significant cycles.

3) This is another corner case.
  - Here n_rxq set to 4 for my DPDK ports. start OvS and traffic is started 
and throughput is as expected.
  -  Now prioritize the stream
   ovs-vsctl set interface dpdk0 
other_config:ingress_sched=udp,ip_src=8.18.8.1,ip_dst=101.10.10.1,udp_src=10001,udp_dst=5001
  - The above command shouldn't take in to affect as n_rxq is set to 4 and 
not 2 and the same is logged appropriately.

   "2017-07-28T11:11:57.792Z|00104|netdev_dpdk|ERR|Interface dpdk0: 
Ingress scheduling config ignored; Requires n_rxq==2.
   2017-07-28T11:11:57.809Z|00105|dpdk|INFO|PMD: i40e_pf_config_rss(): 
Max of contiguous 4 PF queues are configured"

  - However the

Re: [ovs-dev] [PATCH v2 0/4] Output packet batching.

2017-07-27 Thread Bodireddy, Bhanuprakash

HI Ilya,

I am OOO and would review and test this patch series shortly(by Monday). 

Bhanuprakash. 

>-Original Message-
>From: Ilya Maximets [mailto:i.maxim...@samsung.com]
>Sent: Wednesday, July 26, 2017 4:21 PM
>To: ovs-dev@openvswitch.org; Bodireddy, Bhanuprakash
><bhanuprakash.bodire...@intel.com>
>Cc: Heetae Ahn <heetae82@samsung.com>; Ben Pfaff <b...@ovn.org>;
>Fischetti, Antonio <antonio.fische...@intel.com>; Eelco Chaudron
><echau...@redhat.com>; Loftus, Ciara <ciara.lof...@intel.com>; Kevin
>Traynor <ktray...@redhat.com>; Darrell Ball <db...@vmware.com>; Ilya
>Maximets <i.maxim...@samsung.com>
>Subject: [PATCH v2 0/4] Output packet batching.
>
>This patch-set inspired by [1] from Bhanuprakash Bodireddy.
>Implementation of [1] looks very complex and introduces many pitfalls for
>later code modifications like possible packet stucks.
>
>This version targeted to make simple and flexible output packet batching on
>higher level without introducing and even simplifying netdev layer.
>
>Patch set consists of 3 patches. All the functionality introduced in the first
>patch. Two others are just cleanups of netdevs to not do unnecessary things.
>
>Basic testing of 'PVP with OVS bonding on phy ports' scenario shows
>significant performance improvement.
>More accurate and intensive testing required.
>
>[1] [PATCH 0/6] netdev-dpdk: Use intermediate queue during packet
>transmission.
>https://mail.openvswitch.org/pipermail/ovs-dev/2017-June/334762.html
>
>Version 2:
>
>   * Rebased on current master.
>   * Added time based batching RFC patch.
>   * Fixed mixing packets with different sources in same batch.
>
>Ilya Maximets (4):
>  dpif-netdev: Output packet batching.
>  netdev: Remove unused may_steal.
>  netdev: Remove useless cutlen.
>  dpif-netdev: Time based output batching.
>
> lib/dpif-netdev.c | 175
>++
> lib/netdev-bsd.c  |   7 +-
> lib/netdev-dpdk.c |  30 -
> lib/netdev-dummy.c|   6 +-
> lib/netdev-linux.c|   7 +-
> lib/netdev-provider.h |   7 +-
> lib/netdev.c  |  12 ++--
> lib/netdev.h  |   2 +-
> vswitchd/vswitch.xml  |  15 +
> 9 files changed, 189 insertions(+), 72 deletions(-)
>
>--
>2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] packets: Reorganize the pkt_metdata structure.

2017-07-24 Thread Bodireddy, Bhanuprakash

Hello Ben,
>> >
>> >I doubt that this is about a warning, because as I understand it OVS
>> >on MSVC causes a lot of warnings, so it's probably a more serious issue.
>>
>> In will try to get MSVC installed and verify the OvS build. Will get back to 
>> you
>on this.
>
>If you push to a branch on github in your own repo, then appveyor will
>automatically do a build, so you don't have to install MSVC.

Appveyor is very helpful to verify OvS build on MSVC.  BTW I sent out 2 
separate patches.

Introduced new  'PADDED_MEMBERS_CACHELINE_MARKER' macro to mark cachelines.
https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/336186.html

v3 patch using the new macro.  
https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/336187.html

With the above patches the build is successful on MSVC. I confirmed it using 
Appveyor.

- Bhanuprakash.


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 00/20] Add OVS DPDK keep-alive functionality.

2017-07-21 Thread Bodireddy, Bhanuprakash

>> Keepalive feature is aimed at achieving Fastpath Service Assurance in
>> OVS-DPDK deployments. It adds support for monitoring the packet
>> processing cores(PMD thread cores) by dispatching heartbeats at
>> regular intervals. Incase of heartbeat misses additional health checks
>> are enabled on the PMD thread to detect the failure and the same shall
>> be reported to higher level fault management systems/frameworks.
>>
>> The implementation uses OVSDB for reporting the health of the PMD
>threads.
>> Any external monitoring application can read the status from OVSDB at
>> regular intervals (or) subscribe to the updates in OVSDB so that they
>> get notified when the changes happen on OVSDB.
>>
>
>Hi Bhanu,
>
>I had some problems applying this to master.  Can you rebase?
>

Hi Aaron,

V1 doesn't apply cleanly on the master any more. I rebased and posted v2 here 
https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/336040.html
I am on vacation and will try to watch my email for further follow-ups.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v2 1/2] util: Make PADDED_MEMBERS work more than once per struct.

2017-07-13 Thread Bodireddy, Bhanuprakash

>Until now, if the PADDED_MEMBERS macro was used more than once in a
>struct, it caused Clang and GCC warnings because of multiple definition of a
>member named "pad".  This commit fixes the problem by giving each of these
>a unique name.
>
>MSVC, Clang, and GCC all support the __COUNTER__ macro, although it is not
>standardized.
>
>Signed-off-by: Ben Pfaff 
>---
> include/openvswitch/util.h | 5 -
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
>diff --git a/include/openvswitch/util.h b/include/openvswitch/util.h index
>8453550cd845..17b06528f1a0 100644
>--- a/include/openvswitch/util.h
>+++ b/include/openvswitch/util.h
>@@ -172,10 +172,13 @@ OVS_NO_RETURN void ovs_assert_failure(const
>char *, const char *, const char *);
>  * PADDED_MEMBERS(8, uint8_t x; uint8_t y;);
>  * };
>  */
>+#define PAD_PASTE2(x, y) x##y
>+#define PAD_PASTE(x, y) PAD_PASTE2(x, y) #define PAD_ID
>PAD_PASTE(pad,
>+__COUNTER__)
> #define PADDED_MEMBERS(UNIT, MEMBERS)   \
> union { \
> struct { MEMBERS }; \
>-uint8_t pad[ROUND_UP(sizeof(struct { MEMBERS }), UNIT)];\
>+uint8_t PAD_ID[ROUND_UP(sizeof(struct { MEMBERS }), UNIT)]; \
> }
>
> static inline bool

LGTM, tested it as well.

Acked-by: Bhanuprakash Bodireddy 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] packets: Reorganize the pkt_metdata structure.

2017-07-12 Thread Bodireddy, Bhanuprakash

>
>On Wed, Jul 12, 2017 at 10:38:30AM +, Bodireddy, Bhanuprakash wrote:
>> Hi Ben,
>>
>> >On Thu, Jun 22, 2017 at 10:10:49PM +0100, Bhanuprakash Bodireddy wrote:
>> >> pkt_metadata_init() is called for every packet in userspace
>> >> datapath and initializes few members in pkt_metadata. Before this
>> >> the members that needs to be initialized are prefetched using
>> >pkt_metadata_prefetch_init().
>> >>
>> >> The above functions are critical to the userspace datapath
>> >> performance and should be in sync. Any changes to the pkt_metadata
>> >> should also include changes to metadata_init() and prefetch_init() if
>necessary.
>> >>
>> >> This commit slightly refactors the pkt_metadata structure and
>> >> introduces cache line markers to catch any violations to the
>> >> structure. Also only prefetch the cachelines having the members
>> >> that needs
>> >to be zeroed out.
>> >>
>> >> Signed-off-by: Bhanuprakash Bodireddy
>> >> <bhanuprakash.bodire...@intel.com>
>> >
>> >OVS has a PADDED_MEMBERS macro that makes this easier.
>Unfortunately
>> >it isn't currently adequate for use more than once per struct.  But
>> >it's fixable, so I sent a patch to fix it:
>> >
>> >https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335204.html
>>
>> Thanks for this patch. PADDED_MEMBERS macro is pretty handy.
>>
>> >and a fixed-up version of the original patch:
>> >
>> >https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335205.html
>>
>> Thanks  for improving the patch.
>>
>> >
>> >However, even with the fix, this is going to cause problems with
>> >MSVC, because it does not allow 0-length arrays.  Maybe you can find
>> >another way to mark the beginning of a cache line.
>>
>> Microsoft links mentions that the warning "zero-sized array in
>> struct/union" we encountered is a 'Level-4' warning and is numbered
>'C4200'.
>>
>> C4200: https://msdn.microsoft.com/en-us/library/79wf64bc.aspx
>>
>> I can't think of alternate ways to mark the cachelines, so how about
>> this incremental change in lib/util.h that disables the warning when building
>with MSVC?
>>
>> I am currently on vacation and have limited access to lab to download and
>test the changes with Microsoft compiliers.
>> Apologies for this.
>>
>> --8<--cut here-->8--
>>
>> #define CACHE_LINE_SIZE 64
>> BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE));
>> #ifndef _MSC_VER
>> typedef void *OVS_CACHE_LINE_MARKER[0]; #else __pragma
>(warning(push))
>> __pragma (warning(disable:4200)) typedef void
>> *OVS_CACHE_LINE_MARKER[0]; __pragma (warning(pop)) #endif
>>
>> - Bhanuprakash.
>
>I doubt that this is about a warning, because as I understand it OVS on MSVC
>causes a lot of warnings, so it's probably a more serious issue.

In will try to get MSVC installed and verify the OvS build. Will get back to 
you on this.

- Bhanuprakash.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] packets: Reorganize the pkt_metdata structure.

2017-07-12 Thread Bodireddy, Bhanuprakash

>On Sun, Jul 09, 2017 at 10:53:52PM +, Darrell Ball wrote:
>> I went thru. this patch and see the merits of the objective.
>> I also did various testing with it.
>> I had one comment inline.

Thanks Darrell for testing this patch.  I will factor in your comment in next 
version of patch.

>>
>> However, I would feel more comfortable if Ben possibly could take a look as
>well.
>
>Thanks a lot, I did have some comments, so I followed up directly to the patch.

I sent out a possible fix in other mail by disabling the C4200 warnings.  I 
have limited
knowledge in compilers and unfortunately don't have MSVC installed to test my 
incremental changes.

Please review it and let me know if it fixes the issue.

Note: Typo in subject, pkt_metdata  should be changed to 'pkt_metadata'.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] packets: Reorganize the pkt_metdata structure.

2017-07-12 Thread Bodireddy, Bhanuprakash

Hi Ben, 

>On Thu, Jun 22, 2017 at 10:10:49PM +0100, Bhanuprakash Bodireddy wrote:
>> pkt_metadata_init() is called for every packet in userspace datapath
>> and initializes few members in pkt_metadata. Before this the members
>> that needs to be initialized are prefetched using
>pkt_metadata_prefetch_init().
>>
>> The above functions are critical to the userspace datapath performance
>> and should be in sync. Any changes to the pkt_metadata should also
>> include changes to metadata_init() and prefetch_init() if necessary.
>>
>> This commit slightly refactors the pkt_metadata structure and
>> introduces cache line markers to catch any violations to the
>> structure. Also only prefetch the cachelines having the members that needs
>to be zeroed out.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>
>OVS has a PADDED_MEMBERS macro that makes this easier.  Unfortunately it
>isn't currently adequate for use more than once per struct.  But it's fixable, 
>so I
>sent a patch to fix it:
>https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335204.html

Thanks for this patch. PADDED_MEMBERS macro is pretty handy. 

>and a fixed-up version of the original patch:
>https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335205.html

Thanks  for improving the patch. 

>
>However, even with the fix, this is going to cause problems with MSVC,
>because it does not allow 0-length arrays.  Maybe you can find another way to
>mark the beginning of a cache line.

Microsoft links mentions that the warning "zero-sized array in struct/union" we 
encountered
is a 'Level-4' warning and is numbered 'C4200'. 

C4200: https://msdn.microsoft.com/en-us/library/79wf64bc.aspx

I can't think of alternate ways to mark the cachelines, so how about this 
incremental change in lib/util.h
that disables the warning when building with MSVC?

I am currently on vacation and have limited access to lab to download and test 
the changes with Microsoft compiliers.
Apologies for this.

--8<--cut here-->8--

#define CACHE_LINE_SIZE 64
BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE));
#ifndef _MSC_VER
typedef void *OVS_CACHE_LINE_MARKER[0];
#else
__pragma (warning(push))
__pragma (warning(disable:4200))
typedef void *OVS_CACHE_LINE_MARKER[0];
__pragma (warning(pop))
#endif
 
- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] packets: Do not initialize ct_orig_tuple.

2017-07-12 Thread Bodireddy, Bhanuprakash

Thanks Darrell for testing this patch. I will send on v2 with your suggestions 
below.

- Bhanuprakash.

>-Original Message-
>From: Darrell Ball [mailto:db...@vmware.com]
>Sent: Sunday, July 9, 2017 9:44 PM
>To: Bodireddy, Bhanuprakash <bhanuprakash.bodire...@intel.com>;
>d...@openvswitch.org
>Subject: Re: [ovs-dev] [PATCH] packets: Do not initialize ct_orig_tuple.
>
>I tested and found similar ballpark performance increase (approx. 10%) in the
>most simple case with decreasing benefit as the pipeline gets more realistic
>and useful.
>
>This ideal case incremental beyond the original fix patch (from Daniele) shows
>a small decrease in performance of approx 100k pps (0.7 %). I cannot explain
>that right now.
>However, I think there is advantage in clearly defining what needs 
>initialization
>and was does not and the additional incremental by Bhanu does that.
>In realistic cases, I expect the 0.7 % difference, if it is real, to be much 
>less
>anyways.
>
>Hence, this patch looks fine except for some comments inline.
>
>
>
>On 6/22/17, 2:08 PM, "ovs-dev-boun...@openvswitch.org on behalf of
>Bhanuprakash Bodireddy" <ovs-dev-boun...@openvswitch.org on behalf of
>bhanuprakash.bodire...@intel.com> wrote:
>
>Commit "odp: Support conntrack orig tuple key." introduced new fields
>in struct 'pkt_metadata'.  pkt_metadata_init() is called for every
>packet in the userspace datapath.  When testing a simple single
>flow case with DPDK, we observe a lower throughput after the above
>commit (it was 14.88 Mpps before, it is 13 Mpps after).
>
>This patch skips initializing ct_orig_tuple in pkt_metadata_init().
>It should be enough to initialize ct_state, because nobody should look
>at ct_orig_tuple unless ct_state is != 0.
>
>It's discussed at:
>https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
>2DMay_332419.html=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA09C
>GX7JQ5Ih-
>uZnsw=8ZtjBQ0W_pVzaasUyrsZM_cp_rmDylXtUvA7D7F2UOw=min0TA
>xgYGI118BI-LEDaDap8G8XuU4C9xJXtLLqIaE=
>
>Fixes: daf4d3c18da4("odp: Support conntrack orig tuple key.")
>Signed-off-by: Daniele Di Proietto <diproiet...@vmware.com>
>Signed-off-by: Bhanuprakash Bodireddy
><bhanuprakash.bodire...@intel.com>
>Co-authored-by: Bhanuprakash Bodireddy
><bhanuprakash.bodire...@intel.com>
>---
>Original RFC was posted by Daniele here:
>https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
>2DMarch_329679.html=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA0
>9CGX7JQ5Ih-
>uZnsw=8ZtjBQ0W_pVzaasUyrsZM_cp_rmDylXtUvA7D7F2UOw=QivDZb
>3mJt6__TUIssDosbGfZlpAtRN7P_4p-lc4Zls=
>
>In this patch moved the offset from ct_orig_tuple to 'ct_orig_tuple_ipv6'.
>This patch fixes the performance drop(~2.3Mpps for P2P - 64 byte pkts)
>with OvS-DPDK on Master.
>
> lib/packets.h | 11 ++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
>diff --git a/lib/packets.h b/lib/packets.h
>index a9d5e84..94c3dcc 100644
>--- a/lib/packets.h
>+++ b/lib/packets.h
>@@ -126,10 +126,19 @@ pkt_metadata_init_tnl(struct pkt_metadata *md)
> static inline void
> pkt_metadata_init(struct pkt_metadata *md, odp_port_t port)
> {
>+/* This is called for every packet in userspace datapath and affects
>+ * performance if all the metadata is initialized.
>
>I think you meant the sentence
>
> “Hence absolutely
>+ * necessary fields should be zeroed out.”
>
>to be something like
>
>“Hence, fields should only be zeroed out when necessary.”
>
>
>+ *
>+ * Initialize only the first 17 bytes of metadata (till ct_state).
>
>Do we really need to discuss “17 bytes” ?
>Can the sentence be:
>Initialize only till ct_state.
>
>+ * Once the ct_state is zeroed out rest of ct fields will not be 
> looked
>+ * at unless ct_state != 0.
>+ */
>+memset(md, 0, offsetof(struct pkt_metadata, ct_orig_tuple_ipv6));
>+
> /* It can be expensive to zero out all of the tunnel metadata. 
> However,
>  * we can just zero out ip_dst and the rest of the data will never be
>  * looked at. */
>-memset(md, 0, offsetof(struct pkt_metadata, in_port));
> md->tunnel.ip_dst = 0;
> md->tunnel.ipv6_dst = in6addr_any;
> md->in_port.odp_port = port;
>--
>2.4.11
>
>___
>dev mailing list
>d...@openvswitch.org
>https://urldefense.proofpoint.com/v2/url?u=https-
>3A__mail.openvswitch.org_mailman_listinfo_ovs-
>2Ddev=DwICAg=uilaK90D4TOVoH58JNXRgQ=BVhFA09CGX7JQ5Ih-
>uZnsw=8ZtjBQ0W_pVzaasUyrsZM_cp_rmDylXtUvA7D7F2UOw=pGKK0P
>BsxExr3Bxhim-OKiWgLg3rhYtj2MqiosQ2Rfc=
>
>
>

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 0/3] Output packet batching.

2017-07-04 Thread Bodireddy, Bhanuprakash

Apologies for snipping the text. I did it to keep this thread readable. 

>
>Hi Darrell and Jan.
>Thanks for looking at this. I agree with Darrell that mixing implementations on
>two different levels is a bad idea, but as I already wrote in reply to
>Bhanuprakash [2], there is no issues with implementing of output batching of
>more than one rx batch.
>
>[2] https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/334808.html
>
>Look at the incremental below. This is how it may look like:
HI Ilya,

I briefly  went through the incremental patch and see that you introduced 
config parameter
to tune the latency and built the logic around it. It may work but we are back 
to same question.

Is dpif layer the right place to do all of this? Shouldn't this logic be part 
of netdev layer as tx batching 
rightly belongs to netdev layer.  If a specific use case warrants tuning the 
queuing, flushing and latency
parameters let this be done at netdev layer by providing more configs with 
acceptable defaults and leave
the dpif layer simple as it is now.

You referred  to performance issues with flushing triggered from non-local 
thread (on different NUMA node).
This may be because in lab, we simulate these cases and saturate the 10G link. 
But this may not be a very pressing
issue in real world scenarios.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 0/3] Output packet batching.

2017-07-03 Thread Bodireddy, Bhanuprakash

>This patch-set inspired by [1] from Bhanuprakash Bodireddy.
>Implementation of [1] looks very complex and introduces many pitfalls for
>later code modifications like possible packet stucks.
>
>This version targeted to make simple and flexible output packet batching on
>higher level without introducing and even simplifying netdev layer.

I didn't test the patches yet. In this series, the batching is done at dpif 
layer where as in [1] it's in netdev layer.
In [1], batching was implemented by introducing intermediate queue in netdev 
layer and had some added 
complexity due to XPS.

However I think [1] is more flexible and can be easily tweaked to suit 
different use cases along with some of the
APIs potentially consumed by future implementations.

1. Why [1] is flexible?

PMD thread polls the rxq[s] mapped to it and post classification transmits the 
packets on the tx ports.
For optimum performance, we need to queue and burst maximum no. of packets to 
mitigate transmission (MMIO) cost.
As it is now (on master),  we end up transmitting fewer packets due to current 
instant send logic.

Bit of background on this patch series:
In *v1* of [1], we added intermediate queue and tried flushing the packets once 
every 1024 PMD polling 
cycles as below.

--pmd_thread_main()--
pmd_thread_main(void *f_) {

 for (i = 0; i < poll_cnt; i++) {
dp_netdev_process_rxq_port(pmd, poll_list[i].rx, 
poll_list[i].port_no);
 }

 if (lc++ > 1024) {
 if ((now - prev) > DRAIN_TSC) {
  HMAP_FOR_EACH (tx_port, node, >send_port_cache) {
  dp_netdev_flush_txq_port(pmd, 
tx_port->port, now);
  }
 }
}
..
}
---

Pros: 
-  The idea behind bursting them once(lc > 1024 cycles) instead of per rxq port 
processing, was to queue more 
packets in the respective txq[s] ports and burst them to greatly improve 
throughput.

Cons:
-  Increases latency as flushing happens once every 1024 polling cycle.

Minimizing Latency:
   To minimize latency 'INTERIM_QUEUE_BURST_THRESHOLD' was introduced that can 
be tuned 
   based on use case (throughput hungry vs latency sensitive). Infact we also 
published nos. with
   BURST_THRESHOLD set to 16 and flushing triggered every 1024 Polling cycles.  
This was done to
   allow users to tune thresholds for their respective use cases.

However in *V3 [1]* the flushing was performed per rxq processing to get the 
patches accepted
as latency was raised as primary concern. 

2. Why flush APIs in netdev layer?

The Flush APIs were introduced for dpdk and vHost User ports as they can be 
consumed in future
 by other implementation(QoS Priority queues).

Also the current queueing and flushing logic can be abstracted using DPDK APIs 
rte_eth_tx_buffer(),
rte_eth_tx_buffer_flush() further simplifying the code a lot. 

We were targeting some of the optimizations in the future like using rte 
functions for buffering, 
flushing and further introducing timer based flushing logic that invokes 
flushing in a timely manner
instead of per rxq port there by having a balance between throughput and 
latency.

>
>Patch set consists of 3 patches. All the functionality introduced in the first
>patch. Two others are just cleanups of netdevs to not do unnecessary things.

The cleanup of code in netdevs is good.

- Bhanuprakash.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v2] process: Consolidate process related APIs.

2017-06-29 Thread Bodireddy, Bhanuprakash

Hello Ben,

Can you please check if the changes look good and apply this patch to the 
master.  
The Keepalive patch series has a dependency on this patch (using few of the 
below APIs). I can send the first non-RFC version of KA patch series only if 
this patch is in.

Regards,
Bhanuprakash.

>-Original Message-
>From: Bodireddy, Bhanuprakash
>Sent: Tuesday, June 20, 2017 10:30 AM
>To: d...@openvswitch.org
>Cc: Bodireddy, Bhanuprakash <bhanuprakash.bodire...@intel.com>; Ben
>Pfaff <b...@ovn.org>
>Subject: [PATCH v2] process: Consolidate process related APIs.
>
>As part of retrieving system statistics, process status APIs along with helper
>functions were implemented. Some of them are very generic and can be
>reused by other subsystems.
>
>Move the APIs in system-stats.c to process.c and util.c and make them
>available. This patch doesn't change any functionality.
>
>CC: Ben Pfaff <b...@ovn.org>
>Signed-off-by: Bhanuprakash Bodireddy
><bhanuprakash.bodire...@intel.com>
>---
>v1->v2
>  * Move ticks_to_ms() from util.c to process.c
>  * Verify the changes and test it by enabling statistics using,
> $ovs-vsctl set Open_vSwitch . other_config:enable-statistics=true
>
> lib/process.c   | 189 
> lib/process.h   |  12 +++
> lib/util.c  |  68 +
> lib/util.h  |   3 +
> vswitchd/system-stats.c | 251 +---
> 5 files changed, 273 insertions(+), 250 deletions(-)
>
>diff --git a/lib/process.c b/lib/process.c index e9d0ba9..3e119b5 100644
>--- a/lib/process.c
>+++ b/lib/process.c
>@@ -33,6 +33,7 @@
> #include "poll-loop.h"
> #include "signals.h"
> #include "socket-util.h"
>+#include "timeval.h"
> #include "util.h"
> #include "openvswitch/vlog.h"
>
>@@ -40,6 +41,13 @@ VLOG_DEFINE_THIS_MODULE(process);
>
> COVERAGE_DEFINE(process_start);
>
>+#ifdef __linux__
>+#define LINUX 1
>+#include 
>+#else
>+#define LINUX 0
>+#endif
>+
> struct process {
> struct ovs_list node;
> char *name;
>@@ -50,6 +58,15 @@ struct process {
> int status;
> };
>
>+struct raw_process_info {
>+unsigned long int vsz;  /* Virtual size, in kB. */
>+unsigned long int rss;  /* Resident set size, in kB. */
>+long long int uptime;   /* ms since started. */
>+long long int cputime;  /* ms of CPU used during 'uptime'. */
>+pid_t ppid; /* Parent. */
>+char name[18];  /* Name (surrounded by parentheses). */
>+};
>+
> /* Pipe used to signal child termination. */  static int fds[2];
>
>@@ -327,6 +344,178 @@ process_status(const struct process *p)
> return p->status;
> }
>
>+int
>+count_crashes(pid_t pid)
>+{
>+char file_name[128];
>+const char *paren;
>+char line[128];
>+int crashes = 0;
>+FILE *stream;
>+
>+ovs_assert(LINUX);
>+
>+sprintf(file_name, "/proc/%lu/cmdline", (unsigned long int) pid);
>+stream = fopen(file_name, "r");
>+if (!stream) {
>+VLOG_WARN_ONCE("%s: open failed (%s)", file_name,
>ovs_strerror(errno));
>+goto exit;
>+}
>+
>+if (!fgets(line, sizeof line, stream)) {
>+VLOG_WARN_ONCE("%s: read failed (%s)", file_name,
>+   feof(stream) ? "end of file" : ovs_strerror(errno));
>+goto exit_close;
>+}
>+
>+paren = strchr(line, '(');
>+if (paren) {
>+int x;
>+if (ovs_scan(paren + 1, "%d", )) {
>+crashes = x;
>+}
>+}
>+
>+exit_close:
>+fclose(stream);
>+exit:
>+return crashes;
>+}
>+
>+static unsigned long long int
>+ticks_to_ms(unsigned long long int ticks) {
>+ovs_assert(LINUX);
>+
>+#ifndef USER_HZ
>+#define USER_HZ 100
>+#endif
>+
>+#if USER_HZ == 100  /* Common case. */
>+return ticks * (1000 / USER_HZ);
>+#else  /* Alpha and some other architectures.  */
>+double factor = 1000.0 / USER_HZ;
>+return ticks * factor + 0.5;
>+#endif
>+}
>+
>+static bool
>+get_raw_process_info(pid_t pid, struct raw_process_info *raw) {
>+unsigned long long int vsize, rss, start_time, utime, stime;
>+long long int start_msec;
>+unsigned long ppid;
>+char file_name[128];
>+FILE *stream;
>+int n;
>+
>+ovs_assert(LINUX);
>+
>+sprintf(file_name, "/proc/%lu/stat", (unsigned long int) pid);
>+stream = fopen(file_name, "r");
>+if (!str

Re: [ovs-dev] [ovs-dev, 4/6] dpif-netdev: Flush the packets in intermediate queue.

2017-06-29 Thread Bodireddy, Bhanuprakash

>I found another issue while testing w.r.t vhostuser ports.
>
>When non-pmd thread tries to send packets to vHostUser port below is the
>call patch
> dp_execute_cb()
> netdev_dpdk_vhost_send()  [(may_steal == true) and (Pkt src type ==
>DPBUF_MALLOC)]
>dpdk_do_tx_copy()
> __netdev_dpdk_vhost_send()   [ Pkts are queues till the 
> cnt
>reaches 32]
>
>The packets aren't flushed and the ping can fail in this cases.  To verify if 
>it
>works,  I invoked 'vhost_tx_burst()'  in dpdk_do_tx_copy().
>But you mentioned this may pose a problem w.r.t flood traffic having higher
>priority in other mail.  What is the best place to flush the non-PMD thread
>queues?

The better solution may be is to flush in dpif_netdev_run() after 
dp_netdev_process_rxq_port() for non pmd threads.

--dpif_netdev_run()--
for (i = 0; i < port->n_rxq; i++) {
dp_netdev_process_rxq_port(non_pmd, port->rxqs[i].rx,
   port->port_no);

dp_netdev_flush_txq_ports(non_pmd);
}

Bhanuprakash.

>
>>
>>>At least, you have to flush non-PMD threads too.
>>
>>In case of non PMD threads we don’t have to flush as the packets aren't
>>queued and bursted instantly. The call path on the transmit side is:
>>
>>Vswitchd thread:
>>
>>dp_execute_cb()
>>  netdev_send()
>>netdev_dpdk_send__()
>>dpdk_do_tx_copy()
>>   netdev_dpdk_eth_tx_burst(). [ Burst 
>> packets immediately]
>>
>>- Bhanuprakash.
>>
>>>
>>>On 18.06.2017 22:56, Bhanuprakash Bodireddy wrote:
 Under low rate traffic conditions, there can be 2 issues.
   (1) Packets potentially can get stuck in the intermediate queue.
   (2) Latency of the packets can increase significantly due to
buffering in intermediate queue.

 This commit handles the (1) issue by flushing the tx port queues
 from PMD processing loop. Also this commit addresses issue (2) by
 flushing the tx queues after every rxq port processing. This reduces
 the latency with out impacting the forwarding throughput.

MASTER
   
Pkt size  min(ns)   avg(ns)   max(ns)
 512  4,631  5,022309,914
1024  5,545  5,749104,294
1280  5,978  6,159 45,306
1518  6,419  6,774946,850

   MASTER + COMMIT
   -
Pkt size  min(ns)   avg(ns)   max(ns)
 512  4,711  5,064182,477
1024  5,601  5,888701,654
1280  6,018  6,491533,037
1518  6,467  6,734312,471

 PMDs can be teared down and spawned at runtime and so the rxq and
 txq mapping of the PMD threads can change. In few cases packets can
 get stuck in the queue due to reconfiguration and this commit helps
 flush the queues.

 Suggested-by: Eelco Chaudron 
 Reported-at:
 https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/331039.htm
 l
 Signed-off-by: Bhanuprakash Bodireddy
 
 Signed-off-by: Antonio Fischetti 
 Co-authored-by: Antonio Fischetti 
 Signed-off-by: Markus Magnusson 
 Co-authored-by: Markus Magnusson
>
 Acked-by: Eelco Chaudron 
 ---
  lib/dpif-netdev.c | 5 +
  1 file changed, 5 insertions(+)

 diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
 d59208e..dfd88aa 100644
 --- a/lib/dpif-netdev.c
 +++ b/lib/dpif-netdev.c
 @@ -3761,6 +3761,8 @@ reload:
  for (i = 0; i < poll_cnt; i++) {
  dp_netdev_process_rxq_port(pmd, poll_list[i].rx,
 poll_list[i].port_no);
 +
 +dp_netdev_flush_txq_ports(pmd);
  }

  if (lc++ > 1024) {
 @@ -3781,6 +3783,9 @@ reload:
  }
  }

 +/* Flush the queues as part of reconfiguration logic. */
 +dp_netdev_flush_txq_ports(pmd);
 +
  poll_cnt = pmd_load_queues_and_ports(pmd, _list);
  exiting = latch_is_set(>exit_latch);
  /* Signal here to make sure the pmd finishes

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 5/6] netdev-dpdk: Add netdev_dpdk_vhost_txq_flush function.

2017-06-28 Thread Bodireddy, Bhanuprakash

>
>On 27.06.2017 23:31, Bodireddy, Bhanuprakash wrote:
>>> On 26.06.2017 00:52, Bodireddy, Bhanuprakash wrote:
>>>>>> +
>>>>>> +/* Flush the txq if there are any packets available.
>>>>>> + * dynamic_txqs/concurrent_txq is disabled for vHost User ports
>>>>>> +as
>>>>>> + * 'OVS_VHOST_MAX_QUEUE_NUM' txqs are preallocated.
>>>>>> + */
>>>>>
>>>>> This comment is completely untrue. You may ignore 'concurrent_txq'
>>>>> because you *must* lock the queue in any case because of dynamic
>>>>> txq remapping inside netdev-dpdk. You must take the spinlock for
>>>>> the
>>>>> 'dev-
>>>>>> tx_q[qid % netdev->n_txq].map' before sending packets.
>>>>
>>>> Thanks for catching this and the lock should be taken before
>>>> flushing the
>>> queue. Below is how the new logic with spinlocks.
>>>>
>>>> /* Flush the txq if there are any packets available. */ static int
>>>> netdev_dpdk_vhost_txq_flush(struct netdev *netdev, int qid,
>>>> bool concurrent_txq OVS_UNUSED) {
>>>> struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>> struct dpdk_tx_queue *txq;
>>>>
>>>> qid = dev->tx_q[qid % netdev->n_txq].map;
>>>>
>>>> txq = >tx_q[qid];
>>>> if (OVS_LIKELY(txq->vhost_pkt_cnt)) {
>>>> rte_spinlock_lock(>tx_q[qid].tx_lock);
>>>> netdev_dpdk_vhost_tx_burst(dev, qid);
>>>> rte_spinlock_unlock(>tx_q[qid].tx_lock);
>>>> }
>>>>
>>>> return 0;
>>>> }
>>>>
>>>>>
>>>>> In current implementation you're able to call send and flush
>>>>> simultaneously for the same queue from different threads because
>>>>> 'flush' doesn't care about queue remapping.
>>>>>
>>>>> See '__netdev_dpdk_vhost_send' and 'netdev_dpdk_remap_txqs' for
>>> detail.
>>>>>
>>>>> Additionally, flushing logic will be broken in case of txq
>>>>> remapping because you may have different underneath queue each
>time
>>>>> you trying to send of flush.
>>>>
>>>> I remember you raised this point earlier. To handle this case,
>'last_used_qid'
>>> was introduced in tx_port.
>>>> With this we can track any change in qid and make sure the packets
>>>> are
>>> flushed in the old queue.
>>>> This logic is in patch 3/6 of this series.
>>>
>>> I'm talking about txq remapping inside netdev-dpdk not about XPS.
>>> You're trying to flush the 'qid = tx_q[...].map' but the 'map'
>>> value can be changed at any time because of enabling/disabling vrings
>>> inside guest. Refer the 'vring_state_changed()' callback that
>>> triggers 'netdev_dpdk_remap_txqs' which I already mentioned.
>>> It's actually the reason why we're using unconditional locking for
>>> vhost-user ports ignoring 'concurrent_txq' value.
>>
>> I  spent some time looking at the above mentioned functions and see that
>'qid = tx_q[...].map' is updated by 'vhost_thread' as part of callback 
>function.
>This  gets triggered only when the queues are enabled/disabled in the guest.
>I was using 'ethtool -L' in the guest for testing this.
>> As you rightly mentioned the flushing logic breaks and the packets may
>potentially get stuck in the queue. During my testing I found this corner case
>that poses a problem with the patch  (steps in order).
>>
>> -  Multiqueue is enabled on the guest with 2 rx,tx queues.
>> - In PVP case, Traffic is sent to VM and the packets are sent on the 'queue 
>> 1'
>of the vHostUser port.
>> - PMD thread  keeps queueing packets on the vhostuser port queue.
>> [using netdev_dpdk_vhost_send()]
>> - Meanwhile user changes queue configuration using 'ethtool -L ens3
>combined 1', enables only one queue now and disabling the other.
>> - Callback functions gets triggered and disables the queue 1.
>> - 'tx_q[...].map' is updated to '0' [queue 0] - default queue.
>>
>> - Meantime PMD thread reads the map and finds that tx_q[..].map == '0'
>and flushes the packets in the queue0.
>> This is the problem as the packets enqueued earlier on queue1 were not
>flushed.
>>
>> How about the below fix?
>> - Before disabling the queues (qid = OVS_VHOST_QUEUE_DISABLED), flush
>the pa

Re: [ovs-dev] [ovs-dev, 3/6] netdev-dpdk: Add intermediate queue support.

2017-06-28 Thread Bodireddy, Bhanuprakash

>At first, this patch should be applied after the patch with flushing on
>reconfiguration because we must not reconfigure ports while there are
>unsent packets in the intermediate queue.
>Otherwise we may destroy the memory pool which contains that packets and
>will try to send them after that. This may lead to serious problems.

This is a good point. Will handle this appropriately in next version.

>
>Second thing is that you should also modify 'dpdk_do_tx_copy'
>function, otherwise where will be reordering issues and flood traffic will have
>accidentally higher priority because not buffered.

I presume you are referring to netdev_dpdk_eth_tx_burst()where we burst the 
packets. You think we should queue the packets here and flush them?

- Bhanuprakash.

>On 18.06.2017 22:56, Bhanuprakash Bodireddy wrote:
>> This commit introduces netdev_dpdk_eth_tx_queue() function that
>> implements intermediate queue and packet buffering. The packets get
>> buffered till the threshold 'INTERIM_QUEUE_BURST_THRESHOLD[32] is
>> reached and eventually gets transmitted.
>>
>> To handle the case(eg: ping) where packets are sent at low rate and
>> can potentially get stuck in the queue, flush logic is implemented
>> that gets invoked from dp_netdev_flush_txq_ports() as part of PMD
>> packet processing loop.
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> Signed-off-by: Antonio Fischetti 
>> Co-authored-by: Antonio Fischetti 
>> Signed-off-by: Markus Magnusson 
>> Co-authored-by: Markus Magnusson 
>> Acked-by: Eelco Chaudron 
>> ---
>>  lib/dpif-netdev.c | 44
>+++-
>>  lib/netdev-dpdk.c | 35 ++-
>>  2 files changed, 77 insertions(+), 2 deletions(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>> 2b65dc7..d59208e 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -332,6 +332,7 @@ enum pmd_cycles_counter_type {  };
>>
>>  #define XPS_TIMEOUT_MS 500LL
>> +#define LAST_USED_QID_NONE -1
>>
>>  /* Contained by struct dp_netdev_port's 'rxqs' member.  */  struct
>> dp_netdev_rxq { @@ -492,7 +493,13 @@ struct rxq_poll {  struct tx_port
>> {
>>  struct dp_netdev_port *port;
>>  int qid;
>> -long long last_used;
>> +int last_used_qid;/* Last queue id where packets got
>> + enqueued. */
>> +long long last_used;  /* In case XPS is enabled, it contains the
>> +   * timestamp of the last time the port was
>> +   * used by the thread to send data.  After
>> +   * XPS_TIMEOUT_MS elapses the qid will be
>> +   * marked as -1. */
>>  struct hmap_node node;
>>  };
>>
>> @@ -3081,6 +3088,25 @@ cycles_count_end(struct
>dp_netdev_pmd_thread
>> *pmd,  }
>>
>>  static void
>> +dp_netdev_flush_txq_ports(struct dp_netdev_pmd_thread *pmd) {
>> +struct tx_port *cached_tx_port;
>> +int tx_qid;
>> +
>> +HMAP_FOR_EACH (cached_tx_port, node, >send_port_cache) {
>> +tx_qid = cached_tx_port->last_used_qid;
>> +
>> +if (tx_qid != LAST_USED_QID_NONE) {
>> +netdev_txq_flush(cached_tx_port->port->netdev, tx_qid,
>> + cached_tx_port->port->dynamic_txqs);
>> +
>> +/* Queue flushed and mark it empty. */
>> +cached_tx_port->last_used_qid = LAST_USED_QID_NONE;
>> +}
>> +}
>> +}
>> +
>> +static void
>>  dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>> struct netdev_rxq *rx,
>> odp_port_t port_no) @@ -4356,6 +4382,7 @@
>> dp_netdev_add_port_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
>>
>>  tx->port = port;
>>  tx->qid = -1;
>> +tx->last_used_qid = LAST_USED_QID_NONE;
>>
>>  hmap_insert(>tx_ports, >node, hash_port_no(tx->port-
>>port_no));
>>  pmd->need_reload = true;
>> @@ -4926,6 +4953,14 @@ dpif_netdev_xps_get_tx_qid(const struct
>> dp_netdev_pmd_thread *pmd,
>>
>>  dpif_netdev_xps_revalidate_pmd(pmd, now, false);
>>
>> +/* The tx queue can change in XPS case, make sure packets in previous
>> + * queue is flushed properly. */
>> +if (tx->last_used_qid != LAST_USED_QID_NONE &&
>> +   tx->qid != tx->last_used_qid) {
>> +netdev_txq_flush(port->netdev, tx->last_used_qid, port-
>>dynamic_txqs);
>> +tx->last_used_qid = LAST_USED_QID_NONE;
>> +}
>> +
>>  VLOG_DBG("Core %d: New TX queue ID %d for port \'%s\'.",
>>   pmd->core_id, tx->qid, netdev_get_name(tx->port->netdev));
>>  return min_qid;
>> @@ -5021,6 +5056,13 @@ dp_execute_cb(void *aux_, struct
>dp_packet_batch *packets_,
>>  tx_qid = pmd->static_tx_qid;
>>  }
>>
>> +/*

Re: [ovs-dev] [ovs-dev, 4/6] dpif-netdev: Flush the packets in intermediate queue.

2017-06-28 Thread Bodireddy, Bhanuprakash


>At least, you have to flush non-PMD threads too.

In case of non PMD threads we don’t have to flush as the packets aren't queued 
and bursted instantly. The call path on the transmit side is:

Vswitchd thread:

dp_execute_cb()
  netdev_send()   
netdev_dpdk_send__() 
dpdk_do_tx_copy() 
   netdev_dpdk_eth_tx_burst(). [ Burst packets 
immediately]

- Bhanuprakash.

>
>On 18.06.2017 22:56, Bhanuprakash Bodireddy wrote:
>> Under low rate traffic conditions, there can be 2 issues.
>>   (1) Packets potentially can get stuck in the intermediate queue.
>>   (2) Latency of the packets can increase significantly due to
>>buffering in intermediate queue.
>>
>> This commit handles the (1) issue by flushing the tx port queues from
>> PMD processing loop. Also this commit addresses issue (2) by flushing
>> the tx queues after every rxq port processing. This reduces the
>> latency with out impacting the forwarding throughput.
>>
>>MASTER
>>   
>>Pkt size  min(ns)   avg(ns)   max(ns)
>> 512  4,631  5,022309,914
>>1024  5,545  5,749104,294
>>1280  5,978  6,159 45,306
>>1518  6,419  6,774946,850
>>
>>   MASTER + COMMIT
>>   -
>>Pkt size  min(ns)   avg(ns)   max(ns)
>> 512  4,711  5,064182,477
>>1024  5,601  5,888701,654
>>1280  6,018  6,491533,037
>>1518  6,467  6,734312,471
>>
>> PMDs can be teared down and spawned at runtime and so the rxq and txq
>> mapping of the PMD threads can change. In few cases packets can get
>> stuck in the queue due to reconfiguration and this commit helps flush
>> the queues.
>>
>> Suggested-by: Eelco Chaudron 
>> Reported-at:
>> https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/331039.html
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>> Signed-off-by: Antonio Fischetti 
>> Co-authored-by: Antonio Fischetti 
>> Signed-off-by: Markus Magnusson 
>> Co-authored-by: Markus Magnusson 
>> Acked-by: Eelco Chaudron 
>> ---
>>  lib/dpif-netdev.c | 5 +
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>> d59208e..dfd88aa 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -3761,6 +3761,8 @@ reload:
>>  for (i = 0; i < poll_cnt; i++) {
>>  dp_netdev_process_rxq_port(pmd, poll_list[i].rx,
>> poll_list[i].port_no);
>> +
>> +dp_netdev_flush_txq_ports(pmd);
>>  }
>>
>>  if (lc++ > 1024) {
>> @@ -3781,6 +3783,9 @@ reload:
>>  }
>>  }
>>
>> +/* Flush the queues as part of reconfiguration logic. */
>> +dp_netdev_flush_txq_ports(pmd);
>> +
>>  poll_cnt = pmd_load_queues_and_ports(pmd, _list);
>>  exiting = latch_is_set(>exit_latch);
>>  /* Signal here to make sure the pmd finishes
>>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 5/6] netdev-dpdk: Add netdev_dpdk_vhost_txq_flush function.

2017-06-27 Thread Bodireddy, Bhanuprakash

>On 26.06.2017 00:52, Bodireddy, Bhanuprakash wrote:
>>>> +
>>>> +/* Flush the txq if there are any packets available.
>>>> + * dynamic_txqs/concurrent_txq is disabled for vHost User ports as
>>>> + * 'OVS_VHOST_MAX_QUEUE_NUM' txqs are preallocated.
>>>> + */
>>>
>>> This comment is completely untrue. You may ignore 'concurrent_txq'
>>> because you *must* lock the queue in any case because of dynamic txq
>>> remapping inside netdev-dpdk. You must take the spinlock for the
>>> 'dev-
>>>> tx_q[qid % netdev->n_txq].map' before sending packets.
>>
>> Thanks for catching this and the lock should be taken before flushing the
>queue. Below is how the new logic with spinlocks.
>>
>> /* Flush the txq if there are any packets available. */ static int
>> netdev_dpdk_vhost_txq_flush(struct netdev *netdev, int qid,
>> bool concurrent_txq OVS_UNUSED) {
>> struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> struct dpdk_tx_queue *txq;
>>
>> qid = dev->tx_q[qid % netdev->n_txq].map;
>>
>> txq = >tx_q[qid];
>> if (OVS_LIKELY(txq->vhost_pkt_cnt)) {
>> rte_spinlock_lock(>tx_q[qid].tx_lock);
>> netdev_dpdk_vhost_tx_burst(dev, qid);
>> rte_spinlock_unlock(>tx_q[qid].tx_lock);
>> }
>>
>> return 0;
>> }
>>
>>>
>>> In current implementation you're able to call send and flush
>>> simultaneously for the same queue from different threads because
>>> 'flush' doesn't care about queue remapping.
>>>
>>> See '__netdev_dpdk_vhost_send' and 'netdev_dpdk_remap_txqs' for
>detail.
>>>
>>> Additionally, flushing logic will be broken in case of txq remapping
>>> because you may have different underneath queue each time you trying
>>> to send of flush.
>>
>> I remember you raised this point earlier. To handle this case,  
>> 'last_used_qid'
>was introduced in tx_port.
>> With this we can track any change in qid and make sure the packets are
>flushed in the old queue.
>> This logic is in patch 3/6 of this series.
>
>I'm talking about txq remapping inside netdev-dpdk not about XPS.
>You're trying to flush the 'qid = tx_q[...].map' but the 'map'
>value can be changed at any time because of enabling/disabling vrings inside
>guest. Refer the 'vring_state_changed()' callback that triggers
>'netdev_dpdk_remap_txqs' which I already mentioned.
>It's actually the reason why we're using unconditional locking for vhost-user
>ports ignoring 'concurrent_txq' value.

I  spent some time looking at the above mentioned functions and see that 'qid = 
tx_q[...].map' is updated by 'vhost_thread' as part of callback function. This  
gets triggered only when the queues are enabled/disabled in the guest.  I was 
using 'ethtool -L' in the guest for testing this.
As you rightly mentioned the flushing logic breaks and the packets may 
potentially get stuck in the queue. During my testing I found this corner case 
that poses a problem with the patch  (steps in order).

-  Multiqueue is enabled on the guest with 2 rx,tx queues.
- In PVP case, Traffic is sent to VM and the packets are sent on the 'queue 1' 
of the vHostUser port.
- PMD thread  keeps queueing packets on the vhostuser port queue. [using 
netdev_dpdk_vhost_send()]
- Meanwhile user changes queue configuration using 'ethtool -L ens3 combined 
1', enables only one queue now and disabling the other.
- Callback functions gets triggered and disables the queue 1.
- 'tx_q[...].map' is updated to '0' [queue 0] - default queue.

- Meantime PMD thread reads the map and finds that tx_q[..].map == '0' and 
flushes the packets in the queue0. 
This is the problem as the packets enqueued earlier on queue1 were not flushed.

How about the below fix?
- Before disabling the queues (qid = OVS_VHOST_QUEUE_DISABLED), flush the 
packets in the queue from vring_state_changed().

--vring_state_changed()
if (strncmp(ifname, dev->vhost_id, IF_NAME_SZ) == 0) {
if (enable) {
dev->tx_q[qid].map = qid;
} else {
/* If the queue is disabled in the guest, the corresponding qid
 * map should be set to OVS_VHOST_QUEUE_DISABLED(-2).
 *
 * The packets that were queued in 'qid' can be potentially
 * stuck and should be flushed before it is disabled.
 */
netdev_dpdk_vhost_txq_flush(>up, dev->tx_q[qid].map, 0);
dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;

Re: [ovs-dev] [PATCH v10] netdev-dpdk: Increase pmd thread priority.

2017-06-26 Thread Bodireddy, Bhanuprakash

>With this change and CFS in effect, it effectively means that the dpdk control
>threads need to be on different cores than the PMD threads or the response
>latency may be too long for their control work ?
>Have we tested having the control threads on the same cpu with -20 nice for
>the pmd thread ?

Yes, I did some testing and had a reason to add the comment that recommends 
dpdk-lcore-mask and pmd-cpu-mask should be non-overlapping.
The testing was done with a simple script that adds and deletes 750 vHost User 
ports(script copied below). The time statistics are captured in this case.

   dpdk-lcore-mask |  PMD thread   | PMD NICE |  Time statistics
  unspecifiedCore 3  -20
real1m5.610s   / user0m0.706s/ sys 0m0.023s  [With patch]
  Core 3Core 3  -20 
  real2m14.089s / user0m0.717s/ sys 0m0.017s [with patch]
  unspecifiedCore 3 0 
real1m5.209s   /user0m0.711s/sys 0m0.020s[Master]
  Core 3Core 30 
   real 1m7.209s   /user0m0.711s/sys 0m0.020s[Master]

In all cases, if the dpdk-lcore-mask is 'unspecified' the main thread floats 
between the available cores(0-27 in my case).

With this patch(PMD nice value is at -20), and with main & pmd thread pinned to 
core 3, the port addition and deletion took twice the time. However most 
important thing to notice is  with active traffic and with port 
addition/deletion in progress, throughput drops instantly *without* the patch. 
In this case the vswitchd  thread consumes 7% of the CPU time at one stage 
there by impacting the forwarding performance.

With the patch the throughput is still affected but happens gradually. In this 
case the vswitchd thread was consuming not more than 2% of the CPU time and so 
port addition/deletion took longer time.   

>
>I see the comment is added below
>+It is recommended that the OVS control thread and pmd thread shouldn't
>be
>+pinned to the same core i.e 'dpdk-lcore-mask' and 'pmd-cpu-mask' cpu
>mask
>+settings should be non-overlapping.
>
>
>I understand that other heavy threads would be a problem for PMD threads
>and we want to effectively encourage these to be on different cores in the
>situation where we are using a pmd-cpu-mask.
>However, here we are almost shutting down other threads by default on the
>same core as PMDs threads using -20 nice, even those with little cpu load but
>just needing a reasonable latency.

I had the logic of completely shutting down other threads in the early versions 
of this patch by assigning real time priority to the PMD thread. But that 
seemed too dangerous and changing nice value is safer bet. I agree that latency 
can go up for non-pmd threads with this patch but it’s the same problem as 
there are other kernel threads that runs at -20 nice value and some with 'rt' 
priority. 

>
>Will this aggravate the argument from some quarters that using dpdk requires
>too much cpu reservation ?
Atleast for PMD threads that are heart of packet processing in OvS-DPDK. 


More information on commands:

script to test the port addition and deletion.

$cat port_test.sh
   cmds=; for i in {1..750}; do cmds+=" -- add-port br0 dpdkvhostuser$i -- set 
Interface dpdkvhostuser$i type=dpdkvhostuser"; done
   ovs-vsctl $cmds

   sleep 1;

   cmds=; for i in {1..750}; do cmds+=" -- del-port br0 dpdkvhostuser$i"; done
   ovs-vsctl $cmds

$ time ./port_test.sh

dpdk-lcore-mask and pmd-cpu-mask explicitly set to CORE 3.
---
$ ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=8
$ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=8
$ ps -eLo tid,psr,comm | grep -e revalidator -e handler -e ovs -e pmd -e urc -e 
eal
   110881  20 ovsdb-server
   110892   3 ovs-vswitchd
   110976   3 pmd61
   110898   3 eal-intr-thread
   110903   3 urcu3
   110947   3 handler60

Dpdk-lcore-mask unspecified, pmd-cpu-mask explicitly set to CORE 3.
-
$ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=8
$  ps -eLo tid,psr,comm | grep -e revalidator -e handler -e ovs -e pmd -e urc 
-e eal
111474  14 ovsdb-server
111483   6 ovs-vswitchd
111566   3 pmd61
111564  10 revalidator60
111489   0 eal-intr-thread
111493   8 urcu3

Regards,
Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 6/6] netdev: Fix null pointer dereference reported by clang.

2017-06-25 Thread Bodireddy, Bhanuprakash

Hi Mark,
>>
>>Clang reports that array access from 'dumps' variable result in null
>>pointer dereference.
>>
>>Signed-off-by: Bhanuprakash Bodireddy
>>
>
>Hi Bhanu,
>
>LGTM - I also compiled this with gcc, clang, and sparse without issue.
>Checkpatch reports no obvious problems either.
>
>Acked-by: Mark Kavanagh 
>
>One thing - what version of clang are you using? My version (3.4.2) didn't
>detect any of the issues in this patchset. Alternatively, are there additional
>flags that you use when compiling with clang?

My clang version is 3.5.0. I was running clang static analyzer on my Keepalive 
branch to detect memory leaks and dead code and that's when I found these 
issues.

- Bhanuprakash.
 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 5/6] netdev-dpdk: Add netdev_dpdk_vhost_txq_flush function.

2017-06-25 Thread Bodireddy, Bhanuprakash

>> +
>> +/* Flush the txq if there are any packets available.
>> + * dynamic_txqs/concurrent_txq is disabled for vHost User ports as
>> + * 'OVS_VHOST_MAX_QUEUE_NUM' txqs are preallocated.
>> + */
>
>This comment is completely untrue. You may ignore 'concurrent_txq'
>because you *must* lock the queue in any case because of dynamic txq
>remapping inside netdev-dpdk. You must take the spinlock for the 'dev-
>>tx_q[qid % netdev->n_txq].map' before sending packets.

Thanks for catching this and the lock should be taken before flushing the 
queue. Below is how the new logic with spinlocks. 

/* Flush the txq if there are any packets available. */
static int
netdev_dpdk_vhost_txq_flush(struct netdev *netdev, int qid,
bool concurrent_txq OVS_UNUSED)
{
struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
struct dpdk_tx_queue *txq;

qid = dev->tx_q[qid % netdev->n_txq].map;

txq = >tx_q[qid];
if (OVS_LIKELY(txq->vhost_pkt_cnt)) {
rte_spinlock_lock(>tx_q[qid].tx_lock);
netdev_dpdk_vhost_tx_burst(dev, qid);
rte_spinlock_unlock(>tx_q[qid].tx_lock);
}

return 0;
}

>
>In current implementation you're able to call send and flush simultaneously
>for the same queue from different threads because 'flush' doesn't care about
>queue remapping.
>
>See '__netdev_dpdk_vhost_send' and 'netdev_dpdk_remap_txqs' for detail.
>
>Additionally, flushing logic will be broken in case of txq remapping because
>you may have different underneath queue each time you trying to send of
>flush.

I remember you raised this point earlier. To handle this case,  'last_used_qid' 
was introduced in tx_port. 
With this we can track any change in qid and make sure the packets are  flushed 
in the old queue. This logic is in patch 3/6 of this series.

>
>Have you ever tested this with multiqueue vhost?
>With disabling/enabling queues inside the guest?

I did basic sanity testing with vhost multiqueue to verify throughput and to 
check if flush logic is working at low rate(1 pkt each is sent to VM) on 
multiple VMs.
When you say queue 'enable/disable' in the guest are you referring to using 
'ethtool -L  combined ' ?
If this is the case I did do this by configuring 5 rxqs(DPDK and vhost User) 
and changed the channel nos and verified with testpmd app again for flushing 
scenarios. I didn't testing with kernel forwarding here.

Regards,
Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 7/8] netdev-dpdk: Configurable retries while enqueuing to vHost User ports.

2017-06-20 Thread Bodireddy, Bhanuprakash

>>On 06/07/2017 10:21 AM, Bhanuprakash Bodireddy wrote:
>>> This commit adds "vhost-enque-retry" where in the number of retries
>>> performed while enqueuing packets to vHostUser ports can be
>>> configured in ovsdb.
>>>
>>> Currently number of retries are set to '8' and a retry is performed
>>> when atleast some packets have been successfully sent on previous
>>attempt.
>>> While this approach works well, it causes throughput drop when
>>> multiple vHost User ports are servied by same PMD thread.
>>
>>Hi Bhanu,
>>
>>You are saying the approach works well but you are changing the default
>>behaviour. It would be good to explain a bit more about the negative
>>effects of changing the default and compare that against the positive
>>effects, so everyone gets a balanced view. If you have measurements
>>that would be even better.
>
>This issue was discussed earlier at different forums (OvS-DPDK day during
>2016 fall conference and community call) about the negative effect of retries
>on vHost User ports. Giving a bit of background for others interested in this
>problem:
>
>In OvS 2.5 Release:
>The retries on the vHost User ports were performed until a timeout(~100
>micro seconds)  is reached.
>The problem with that approach was If the guest is connected and isn't
>actively processing its queues, it could potentially impact the performance of
>neighboring guests (other vHost User ports) provided the same PMD thread is
>servicing them all.  It was reported by me and you indeed provided the fix in
>2.6
>
>In OvS 2.6 Release:
>Timeout logic is removed and retry logic is introduced. Here a maximum up to
>'8' retries can be performed provided atleast one packet is transmitted
>successfully in the previous attempt.
>
>Problem:
>Take the case where there are few VMs (with 3 vHost User ports each)
>serviced by same PMD thread. Some of the VMs are forwarding at high
>rates(using dpdk based app) and the remaining are slow VMs doing kernel
>forwarding in the guest. In this case the PMD would spend significant cycles
>for slower VMs and may end up doing maximum of 8 retries all the time.
>However, in some cases  doing a retry immediately isn't of much value as
>there may not be any free descriptors available.
>
>Also if there are more slow ports, the packets can potentially get tail dropped
>at the NIC as PMD is busy processing the packets and doing retries. I don't
>have numbers right now to back this problem but can do some tests next
>week to assess the impact with and without retries. Also adding jan here who
>wanted the retry logic to be configurable.

Hi Kevin,

I did some testing today with and without retries  and found  little 
performance improvement with retries turned off.
My test bench is pretty basic and not tuned for performance.
 - 2 PMD threads
 - 4 VMs with kernel based forwarding enabled in the guest.
 - VM running 3.x kernel / Qemu 2.5 / mrg_rxbuf=off
- 64 byte packets @ line rate with each VM receiving 25% of the traffic(3.7 
mpps).

With retries enabled the aggregate throughput stands at 2.39Mpps in steady 
state,  whereas with retries turned off It is 2.42 Mpps. 

Regards,
Bhanuprakash.



___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 1/6] process: Consolidate process related APIs.

2017-06-20 Thread Bodireddy, Bhanuprakash

Hi Ben,

>On Mon, Jun 19, 2017 at 07:53:59PM +0100, Bhanuprakash Bodireddy wrote:
>> As part of retrieving system statistics, process status APIs along
>> with helper functions were implemented. Some of them are very generic
>> and can be reused by other subsystems.
>>
>> Move the APIs in system-stats.c to process.c and util.c and make them
>> available. This patch doesn't change any functionality.
>>
>> CC: Ben Pfaff 
>> Signed-off-by: Bhanuprakash Bodireddy
>> 
>
>Thanks for cleaning up the OVS internal APIs.
>
>Regarding most of these moves, I agree with them.  For ticks_to_ms(),
>though, I would prefer to keep it in the same place as get_process_info(), and
>as a static function, because I don't envision it being called from anywhere
>else.  Are you OK making that change?

Posted v2 with what you suggested here. (Moved ticks_to_ms() to process.c and 
made it static function.)
V2:  https://mail.openvswitch.org/pipermail/ovs-dev/2017-June/334362.html

Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [RFC PATCH 00/21] Add OVS DPDK keep-alive functionality

2017-06-19 Thread Bodireddy, Bhanuprakash

Hi Aaron,

>>
>>I've been playing with this a little bit;  is it too late to consider tracking
>'threads'
>>instead of 'cores'?  I'm not sure what it means for a particular core
>>ID to be 'healthy' - but I know what 'pmd24' not responding means.
>
>That's an interesting input. It's not late and all suggestions are most 
>welcome.
>I will try doing this in the next series.

I reworked and sent out V3 patch series here: 
https://mail.openvswitch.org/pipermail/ovs-dev/2017-June/334229.html
In this series.
  -  Posix shared memory is removed.
  -  Logic has been changed to track threads as suggested in this thread. I 
have used hash maps for this.
  
>
>>
>>Additionally, I'd suggest keeping words like 'healthy', and 'unhealthy'
>>out of it.  I'd basically just have this keepalive report things on the
>>thread you
>>*know* - last time it poked your status register (and you can also
>>track things like cpu utilization, etc, if you'd like).  Then let your
>>higher level thing that reads ceilometer make those "healthy"
>>determinations.  After all, sometimes 0% utilization is "healthy," and
>>sometimes it isn't.
>
>This makes sense. Infact It was the case in the beginning where only the core
>status was reported.
> Only recently I added this Datapath status row with the overall status. I 
> shall
>remove this and leave it to external monitoring apps to parse the data and
>decide it.

I have also removed this logic and now only the thread status is shown. It's 
now the job of monitoring framework to read the thread status and determine the 
health of the compute.

Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 0/6 V2] netdev-dpdk: Use intermediate queue during packet transmission.

2017-06-19 Thread Bodireddy, Bhanuprakash

>-Original Message-
>From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
>boun...@openvswitch.org] On Behalf Of Bhanuprakash Bodireddy
>Sent: Sunday, June 18, 2017 8:56 PM
>To: d...@openvswitch.org
>Subject: [ovs-dev] [PATCH 0/6 V2] netdev-dpdk: Use intermediate queue
>during packet transmission.
>
>After packet classification, packets are queued in to batches depending on the
>matching netdev flow. Thereafter each batch is processed to execute the
>related actions. This becomes particularly inefficient if there are few packets
>in each batch as rte_eth_tx_burst() incurs expensive MMIO writes.
>
>This patch series implements intermediate queue for DPDK and vHost User
>ports.
>Packets are queued and burst when the packet count exceeds threshold. Also
>drain logic is implemented to handle cases where packets can get stuck in the
>tx queues at low rate traffic conditions. Care has been taken to see that
>latency is well with in the acceptable limits. Testing shows significant
>performance gains with this implementation.
>
>This path series combines the earlier 2 patches posted below.
>  DPDK patch: https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>April/331039.html
>  vHost User patch: https://mail.openvswitch.org/pipermail/ovs-dev/2017-
>May/332271.html
>
>Also this series proposes to disable the retries on vHost User ports and make
>it configurable via ovsdb.(controversial?)

Please ignore the above lines(happened due to copy-paste job from v1) on vHost 
User port  from this V2 patch series.
I have removed the 2 vhost retries patches from this series as they have 
nothing to do with the intermediate queue implementation  and will be sent 
separately.

Regards,
Bhanuprakash.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [RFC PATCH 00/21] Add OVS DPDK keep-alive functionality

2017-06-14 Thread Bodireddy, Bhanuprakash

Hi Aaron,
>Hi Bhanu,
>
>I've been playing with this a little bit;  is it too late to consider tracking 
>'threads'
>instead of 'cores'?  I'm not sure what it means for a particular core ID to be
>'healthy' - but I know what 'pmd24' not responding means.

That's an interesting input. It's not late and all suggestions are most 
welcome. 
I will try doing this in the next series. 

>
>Additionally, I'd suggest keeping words like 'healthy', and 'unhealthy'
>out of it.  I'd basically just have this keepalive report things on the thread 
>you
>*know* - last time it poked your status register (and you can also track things
>like cpu utilization, etc, if you'd like).  Then let your higher level thing 
>that
>reads ceilometer make those "healthy"
>determinations.  After all, sometimes 0% utilization is "healthy," and
>sometimes it isn't.

This makes sense. Infact It was the case in the beginning where only the core 
status was reported.
 Only recently I added this Datapath status row with the overall status. I 
shall remove this and leave it to external monitoring apps to parse the data 
and decide it.

- Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 7/8] netdev-dpdk: Configurable retries while enqueuing to vHost User ports.

2017-06-13 Thread Bodireddy, Bhanuprakash

Hi Kevin,

>On 06/07/2017 10:21 AM, Bhanuprakash Bodireddy wrote:
>> This commit adds "vhost-enque-retry" where in the number of retries
>> performed while enqueuing packets to vHostUser ports can be configured
>> in ovsdb.
>>
>> Currently number of retries are set to '8' and a retry is performed
>> when atleast some packets have been successfully sent on previous
>attempt.
>> While this approach works well, it causes throughput drop when
>> multiple vHost User ports are servied by same PMD thread.
>
>Hi Bhanu,
>
>You are saying the approach works well but you are changing the default
>behaviour. It would be good to explain a bit more about the negative effects
>of changing the default and compare that against the positive effects, so
>everyone gets a balanced view. If you have measurements that would be
>even better.

This issue was discussed earlier at different forums (OvS-DPDK day during 2016 
fall conference and community call) about the negative effect of retries on 
vHost User ports. Giving a bit of background for others interested in this 
problem:

In OvS 2.5 Release: 
The retries on the vHost User ports were performed until a timeout(~100 micro 
seconds)  is reached. 
The problem with that approach was If the guest is connected and isn't actively 
processing its queues, it could potentially impact the performance of 
neighboring guests (other vHost User ports) provided the same PMD thread is 
servicing them all.  It was reported by me and you indeed provided the fix in 
2.6

In OvS 2.6 Release:
Timeout logic is removed and retry logic is introduced. Here a maximum up to 
'8' retries can be performed provided atleast one packet is transmitted 
successfully in the previous attempt.  

Problem:
Take the case where there are few VMs (with 3 vHost User ports each) serviced 
by same PMD thread. Some of the VMs are forwarding at high rates(using dpdk 
based app) and the remaining are slow VMs doing kernel forwarding in the guest. 
In this case the PMD would spend significant cycles for slower VMs and may end 
up doing maximum of 8 retries all the time.  However, in some cases  doing a 
retry immediately isn't of much value as there may not be any free descriptors 
available. 

Also if there are more slow ports, the packets can potentially get tail dropped 
at the NIC as PMD is busy processing the packets and doing retries. I don't 
have numbers right now to back this problem but can do some tests next week to 
assess the impact with and without retries. Also adding jan here who wanted the 
retry logic to be configurable.

Regards,
Bhanuprakash. 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH 0/8] netdev-dpdk: Use intermediate queue during packet transmission.

2017-06-13 Thread Bodireddy, Bhanuprakash

Hi Eelco

>Hi Bhanu,
>
>Went over the full patch set, and the changes look good to me.
>All my previous concerns are addressed, and therefore I'm acking this series.

Thanks for reviewing the series and acking it.

>
>I do have one small remark regarding the dpdk_tx_queue struct, see
>individual patch email.

I agree with what you suggested.
I have to send out v2 anyways as Ben suggested to rename the API from 
netdev_txq_drain() to netdev_txq_flush(). I will factor in your suggestion in 
V2. 

>
>Here are some numbers with this patch on a none tuned system, single run.
>This just to make sure we still benefit with both patches applied.
>
>Throughput for PV scenario, with 64 byte packets
>
>Number
>flows   MASTER With PATCH
>=====
>   10  4,531,4247,884,607
>   32  3,137,3006,367,643
>   50  2,552,7256,649,985
>  100  2,473,8355,876,677
>  500  2,308,8405,265,986
>1000  2,380,7555,001,081
>
>
>Throughput for PVP scenario, with 64 byte packets
>
>Number
>flows   MASTER With PATCH
>=====
>   10  2,309,2543,800,747
>   32  1,626,3803,324,561
>   50  1,538,8793,092,792
>  100  1,429,0282,887,488
>  500  1,271,7732,537,624
>1000  1,268,4302,442,405
>
>Latency test
>
>  MASTER
>  ===
>  Pkt size  min(ns)  avg(ns)  max(ns)
>   512  9,94712,381   264,131
>  1024  7,662 9,445   194,463
>  1280  7,790 9,115   196,059
>  1518  8,103 9,599   197,646
>
>  PATCH
>  =
>  Pkt size  min(ns)  avg(ns)  max(ns)
>   512  10,195   12,551   199,699
>  1024  7,838 9,612   206,378
>  1280  8,151 9,575   187,848
>  1518  8,095 9,643   198,552
>
>
>Throughput for PP scenario, with 64 byte packets:
>
>Number
>flows   MASTER With PATCH
>=====
>   10  7,430,6168,853,037
>   32  4,770,1906,774,006
>   50  4,736,2597,336,776
>  100  4,699,2376,146,151
>  500  3,870,0195,242,781
>1000  3,853,8835,121,911
>
>
>Latency test
>
>  MASTER
>  ===
>  Pkt size  min(ns)  avg(ns)  max(ns)
>   512  4,8875,596165,246
>  1024  5,8016,447170,842
>  1280  6,3557,056159,056
>  1518  6,8607,634160,860
>
>  PATCH
>  =
>  Pkt size  min(ns)  avg(ns)  max(ns)
>   512  4,7835,521158,134
>  1024  5,8016,359170,859
>  1280  6,3156,878150,301
>  1518  6,5797,398143,068
>
>
>Acked-by: Eelco Chaudron 

Thanks for your time in testing and sharing the numbers here.

Bhanuprakash.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

1 2 3 >

1 - 100 of 204 matches

Mail list logo