Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-06 Thread Numan Siddique
On Tue, Feb 6, 2024 at 3:52 PM Han Zhou  wrote:
>
> On Mon, Feb 5, 2024 at 7:47 PM Numan Siddique  wrote:
> >
> > On Mon, Feb 5, 2024 at 9:41 PM Han Zhou  wrote:
> > >
> > > On Mon, Feb 5, 2024 at 4:12 PM Numan Siddique  wrote:
> > > >
> > > > On Mon, Feb 5, 2024 at 5:54 PM Han Zhou  wrote:
> > > > >
> > > > > On Mon, Feb 5, 2024 at 10:15 AM Ilya Maximets 
> > > wrote:
> > > > > >
> > > > > > On 2/5/24 15:45, Ilya Maximets wrote:
> > > > > > > On 2/5/24 11:34, Ilya Maximets wrote:
> > > > > > >> On 2/5/24 09:23, Dumitru Ceara wrote:
> > > > > > >>> On 2/5/24 08:13, Han Zhou wrote:
> > > > > >  On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique  >
> > > wrote:
> > > > > > >
> > > > > > > On Sun, Feb 4, 2024 at 9:53 PM Han Zhou 
> wrote:
> > > > > > >>
> > > > > > >> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets <
> > > i.maxim...@ovn.org>
> > > > > wrote:
> > > > > > >>>
> > > > > > >>
> > > > > > >>>  35 files changed, 9681 insertions(+), 4645
> deletions(-)
> > > > > > >>
> > > > > > >> I had another look at this series and acked the
> remaining
> > > > > >  patches.  I
> > > > > > >> just had some minor comments that can be easily fixed
> when
> > > > > >  applying
> > > > > > >> the
> > > > > > >> patches to the main branch.
> > > > > > >>
> > > > > > >> Thanks for all the work on this!  It was a very large
> > > change
> > > > > but
> > > > > >  it
> > > > > > >> improves northd performance significantly.  I just
> hope we
> > > > > don't
> > > > > > >> introduce too many bugs.  Hopefully the time we have
> until
> > > > > release
> > > > > > >> will
> > > > > > >> allow us to further test this change on the 24.03
> branch.
> > > > > > >>
> > > > > > >> Regards,
> > > > > > >> Dumitru
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Thanks a lot Dumitru and Han for the reviews and
> patience.
> > > > > > >
> > > > > > > I addressed the comments and applied the patches to
> main and
> > > > > also
> > > > > >  to
> > > > > >  branch-24.03.
> > > > > > >
> > > > > > > @Han - I know you wanted to take another look in to v6.
>  I
> > > > > didn't
> > > > > >  want
> > > > > > >> to
> > > > > >  delay further as branch-24.03 was created.  I'm more than
> > > happy
> > > > > to
> > > > > > >> submit
> > > > > >  follow up patches if you have any comments to address.
> > > Please
> > > > > let
> > > > > >  me
> > > > > > >> know.
> > > > > > >
> > > > > > 
> > > > > >  Hi Numan,
> > > > > > 
> > > > > >  I was writing the reply and saw your email just now.
> Thanks
> > > a lot
> > > > > >  for
> > > > > >  taking a huge effort to achieve the great optimization. I
> > > only
> > > > > left
> > > > > >  one
> > > > > >  comment on the implicit dependency left for the en_lrnat
> ->
> > > > > >  en_lflow.
> > > > > > >> Feel
> > > > > >  free to address it with a followup and no need to block
> the
> > > > > >  branching.
> > > > > > >> And
> > > > > >  take my Ack for the series with that addressed.
> > > > > > 
> > > > > >  Acked-by: Han Zhou 
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> Hi, Numan, Dumitru and Han.
> > > > > > >>>
> > > > > > >>> I see a huge negative performance impact, most likely from
> > > this
> > > > > set,
> > > > > >  on
> > > > > > >>> ovn-heater's cluster-density tests.  The memory
> consumption on
> > > > > northd
> > > > > > >>>
> > > > > > >>> Thanks for reporting this, Ilya!
> > > > > > >>>
> > > > > > >>> jumped about 4x and it constantly recomputes due to
> failures
> > > of
> > > > > >  port_group
> > > > > > >>> handler:
> > > > > > >>>
> > > > > > >>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node:
> lflow,
> > > > > >  recompute
> > > > > > >> (failed handler for input port_group) took 9762ms
> > > > > > >>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably
> long
> > > > > 9898ms
> > > > > >  poll
> > > > > > >> interval (5969ms user, 1786ms system)
> > > > > > >>> ...
> > > > > > >>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node:
> lflow,
> > > > > >  recompute
> > > > > > >> (failed handler for input port_group) took 9014ms
> > > > > > >>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably
> long
> > > > > 9118ms
> > > > > >  poll
> > > > > > >> interval (5376ms user, 1515ms system)
> > > > > > >>> ...
> > > > > > >>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node:
> lflow,
> > > > > >  recompute
> > > > > > >> (failed handler for input port_group) took 10695ms
> > > > > > >>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably
> long
> > > 

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-06 Thread Ilya Maximets
On 2/6/24 22:22, Ilya Maximets wrote:
>>>
>>> I did some testing with my patch and these are the findings
>>>
>>>                                      | Avg. Poll Intervals | Total
>>> test time |  northd RSS
>>> -- ++---
>>> Last week                     |      1.5 seconds     |  1005 seconds
>>> |    2.5 GB
>>> This week                     |      6  seconds       |  2246 seconds
>>>  |    8.5 GB
>>> Ilya's Patch                   |     2.5 seconds      |  1170 seconds
>>>  |    3.1 GB
>>> Numan's patch (run1)   |     1.6 seconds      |  992 seconds    |    2.43 GB
>>> Numan's patch (run2)   |     1.7 seconds      |  1022 seconds   |    2.43 GB
>>>
>>> Seems like removing dp ref cnt all together is more efficient and very 
>>> close to
>>> last week's results.
>>>
>>> I'll submit the patch tomorrow after some more testing.
>>>
>>> Feel free to provide any review comments or feedback if you've any.
>>>
>>> Thanks
>>> Numan
>>>
>
> Regarding the cluster density 500 node test with Ilya's fix,  I think
> this seems o.k to me since
> the test falls back to recompute a lot and there is some cost now with
> the lflow I-P patches.
> This can be improved by adding port group I-P.   I can start looking
> into the port-group I-P if you
> think it would be worth it.  Let me know.
>

 Adding more I-P would be helpful, but it still doesn't explain the
 observation from Ilya regarding the 66% increase in recompute time and 25%
 increase in memory. My test cases couldn't reproduce such significant
 increases (it is ~10-20% in my test). I think we'd better figure out what
 caused such big increases - is it still related to ref count or is it
 something else. I think the lflow_ref might have contributed to it,
 especially the use of hmap instead of list. It is totally fine if the
 lflow_ref adds some cost, which is required for the I-P to work, but it is
 better to understand where the cost comes from and justify it (or optimize
 it).
>>>
>>> Looks like from my patch results,  dp_ref_cnt is contributing to this 
>>> increase.
>>> Is it possible for you to trigger another test with my patch -
>>> https://github.com/numansiddique/ovn/commit/92f6563d9e7b1e6c8f2b924dea7a220781e10a05
>>>  
>>> 
>>>
>> It is great to see that removing ref count resolved the performance gap. The 
>> test result is actually very interesting. I also had a test with more LBs 
>> (10k instead of the earlier 1k) in the large LBG, trying to make the cost of 
>> bitmap scan more obvious, but still, the difference between your patch and 
>> Ilya's patch is not significant in my test. The latency difference is within 
>> 5%, and the memory <1%. So I wonder why in the ovn-heater test it is so 
>> different. Is it possible that in the ovn-heater test it does create a lot 
>> of duplicated lflows so ref count hmap operations are actually triggered for 
>> a significant amount of times, which might have contributed to the major 
>> cost even with Ilya's patch, and with your patch it means it fall back to 
>> recompute much more often?
> 
> FWIW, in my testing with the cluster-density 500node database there
> are 12 M refcounts allocated.  With 32 bytes each, it's about 370 MB
> of RAM.  We should also add a fair share of allocation overhead since
> we're allocating a huge number of very small objects.
> 
> Numan also gets rid of all the space allocated for hash maps that hold
> all these refcounts, and these were taking ~25% of all allocations.
> 
> Getting rid of these allocations likely saves a lot of CPU cycles as
> well, since they are very heavy.
> 
> P.S.
> Almost all of these 12M refcounts are for:
>   build_lrouter_defrag_flows_for_lb():ovn_lflow_add_with_dp_group()
> All of these have refcount == 3.
> 
> There is one with refcount 502:
>   build_egress_delivery_flows_for_lrouter_port():ovn_lflow_add_default_drop()
> 
> And there are 500-ish with refcount of 2:
>   build_lswitch_rport_arp_req_flow():ovn_lflow_add_with_hint()

On the current main we allocate ~120 M refcounts.

> 
> Best regards, Ilya Maximets.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-06 Thread Ilya Maximets
>>
>> I did some testing with my patch and these are the findings
>>
>>                                      | Avg. Poll Intervals | Total
>> test time |  northd RSS
>> -- ++---
>> Last week                     |      1.5 seconds     |  1005 seconds
>> |    2.5 GB
>> This week                     |      6  seconds       |  2246 seconds
>>  |    8.5 GB
>> Ilya's Patch                   |     2.5 seconds      |  1170 seconds
>>  |    3.1 GB
>> Numan's patch (run1)   |     1.6 seconds      |  992 seconds    |    2.43 GB
>> Numan's patch (run2)   |     1.7 seconds      |  1022 seconds   |    2.43 GB
>>
>> Seems like removing dp ref cnt all together is more efficient and very close 
>> to
>> last week's results.
>>
>> I'll submit the patch tomorrow after some more testing.
>>
>> Feel free to provide any review comments or feedback if you've any.
>>
>> Thanks
>> Numan
>>
>> > >
>> > > Regarding the cluster density 500 node test with Ilya's fix,  I think
>> > > this seems o.k to me since
>> > > the test falls back to recompute a lot and there is some cost now with
>> > > the lflow I-P patches.
>> > > This can be improved by adding port group I-P.   I can start looking
>> > > into the port-group I-P if you
>> > > think it would be worth it.  Let me know.
>> > >
>> >
>> > Adding more I-P would be helpful, but it still doesn't explain the
>> > observation from Ilya regarding the 66% increase in recompute time and 25%
>> > increase in memory. My test cases couldn't reproduce such significant
>> > increases (it is ~10-20% in my test). I think we'd better figure out what
>> > caused such big increases - is it still related to ref count or is it
>> > something else. I think the lflow_ref might have contributed to it,
>> > especially the use of hmap instead of list. It is totally fine if the
>> > lflow_ref adds some cost, which is required for the I-P to work, but it is
>> > better to understand where the cost comes from and justify it (or optimize
>> > it).
>>
>> Looks like from my patch results,  dp_ref_cnt is contributing to this 
>> increase.
>> Is it possible for you to trigger another test with my patch -
>> https://github.com/numansiddique/ovn/commit/92f6563d9e7b1e6c8f2b924dea7a220781e10a05
>>  
>> 
>>
> It is great to see that removing ref count resolved the performance gap. The 
> test result is actually very interesting. I also had a test with more LBs 
> (10k instead of the earlier 1k) in the large LBG, trying to make the cost of 
> bitmap scan more obvious, but still, the difference between your patch and 
> Ilya's patch is not significant in my test. The latency difference is within 
> 5%, and the memory <1%. So I wonder why in the ovn-heater test it is so 
> different. Is it possible that in the ovn-heater test it does create a lot of 
> duplicated lflows so ref count hmap operations are actually triggered for a 
> significant amount of times, which might have contributed to the major cost 
> even with Ilya's patch, and with your patch it means it fall back to 
> recompute much more often?

FWIW, in my testing with the cluster-density 500node database there
are 12 M refcounts allocated.  With 32 bytes each, it's about 370 MB
of RAM.  We should also add a fair share of allocation overhead since
we're allocating a huge number of very small objects.

Numan also gets rid of all the space allocated for hash maps that hold
all these refcounts, and these were taking ~25% of all allocations.

Getting rid of these allocations likely saves a lot of CPU cycles as
well, since they are very heavy.

P.S.
Almost all of these 12M refcounts are for:
  build_lrouter_defrag_flows_for_lb():ovn_lflow_add_with_dp_group()
All of these have refcount == 3.

There is one with refcount 502:
  build_egress_delivery_flows_for_lrouter_port():ovn_lflow_add_default_drop()

And there are 500-ish with refcount of 2:
  build_lswitch_rport_arp_req_flow():ovn_lflow_add_with_hint()

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-06 Thread Han Zhou
On Mon, Feb 5, 2024 at 7:47 PM Numan Siddique  wrote:
>
> On Mon, Feb 5, 2024 at 9:41 PM Han Zhou  wrote:
> >
> > On Mon, Feb 5, 2024 at 4:12 PM Numan Siddique  wrote:
> > >
> > > On Mon, Feb 5, 2024 at 5:54 PM Han Zhou  wrote:
> > > >
> > > > On Mon, Feb 5, 2024 at 10:15 AM Ilya Maximets 
> > wrote:
> > > > >
> > > > > On 2/5/24 15:45, Ilya Maximets wrote:
> > > > > > On 2/5/24 11:34, Ilya Maximets wrote:
> > > > > >> On 2/5/24 09:23, Dumitru Ceara wrote:
> > > > > >>> On 2/5/24 08:13, Han Zhou wrote:
> > > > >  On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique 
> > wrote:
> > > > > >
> > > > > > On Sun, Feb 4, 2024 at 9:53 PM Han Zhou 
wrote:
> > > > > >>
> > > > > >> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets <
> > i.maxim...@ovn.org>
> > > > wrote:
> > > > > >>>
> > > > > >>
> > > > > >>>  35 files changed, 9681 insertions(+), 4645
deletions(-)
> > > > > >>
> > > > > >> I had another look at this series and acked the
remaining
> > > > >  patches.  I
> > > > > >> just had some minor comments that can be easily fixed
when
> > > > >  applying
> > > > > >> the
> > > > > >> patches to the main branch.
> > > > > >>
> > > > > >> Thanks for all the work on this!  It was a very large
> > change
> > > > but
> > > > >  it
> > > > > >> improves northd performance significantly.  I just
hope we
> > > > don't
> > > > > >> introduce too many bugs.  Hopefully the time we have
until
> > > > release
> > > > > >> will
> > > > > >> allow us to further test this change on the 24.03
branch.
> > > > > >>
> > > > > >> Regards,
> > > > > >> Dumitru
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks a lot Dumitru and Han for the reviews and
patience.
> > > > > >
> > > > > > I addressed the comments and applied the patches to
main and
> > > > also
> > > > >  to
> > > > >  branch-24.03.
> > > > > >
> > > > > > @Han - I know you wanted to take another look in to v6.
 I
> > > > didn't
> > > > >  want
> > > > > >> to
> > > > >  delay further as branch-24.03 was created.  I'm more than
> > happy
> > > > to
> > > > > >> submit
> > > > >  follow up patches if you have any comments to address.
> > Please
> > > > let
> > > > >  me
> > > > > >> know.
> > > > > >
> > > > > 
> > > > >  Hi Numan,
> > > > > 
> > > > >  I was writing the reply and saw your email just now.
Thanks
> > a lot
> > > > >  for
> > > > >  taking a huge effort to achieve the great optimization. I
> > only
> > > > left
> > > > >  one
> > > > >  comment on the implicit dependency left for the en_lrnat
->
> > > > >  en_lflow.
> > > > > >> Feel
> > > > >  free to address it with a followup and no need to block
the
> > > > >  branching.
> > > > > >> And
> > > > >  take my Ack for the series with that addressed.
> > > > > 
> > > > >  Acked-by: Han Zhou 
> > > > > >>>
> > > > > >>>
> > > > > >>> Hi, Numan, Dumitru and Han.
> > > > > >>>
> > > > > >>> I see a huge negative performance impact, most likely from
> > this
> > > > set,
> > > > >  on
> > > > > >>> ovn-heater's cluster-density tests.  The memory
consumption on
> > > > northd
> > > > > >>>
> > > > > >>> Thanks for reporting this, Ilya!
> > > > > >>>
> > > > > >>> jumped about 4x and it constantly recomputes due to
failures
> > of
> > > > >  port_group
> > > > > >>> handler:
> > > > > >>>
> > > > > >>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node:
lflow,
> > > > >  recompute
> > > > > >> (failed handler for input port_group) took 9762ms
> > > > > >>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably
long
> > > > 9898ms
> > > > >  poll
> > > > > >> interval (5969ms user, 1786ms system)
> > > > > >>> ...
> > > > > >>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node:
lflow,
> > > > >  recompute
> > > > > >> (failed handler for input port_group) took 9014ms
> > > > > >>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably
long
> > > > 9118ms
> > > > >  poll
> > > > > >> interval (5376ms user, 1515ms system)
> > > > > >>> ...
> > > > > >>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node:
lflow,
> > > > >  recompute
> > > > > >> (failed handler for input port_group) took 10695ms
> > > > > >>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably
long
> > > > 10890ms
> > > > > >> poll interval (6085ms user, 2745ms system)
> > > > > >>> ...
> > > > > >>> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node:
lflow,
> > > > >  recompute
> > > > > >> (failed handler for input port_group) took 9985ms
> > > > > >>> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably
long
> > > 

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-06 Thread Dumitru Ceara
On 2/5/24 23:53, Han Zhou wrote:
> For Dumitru's proposal:
>>> Maybe it would be an idea to integrate some of Han's performance testing
>>> scripts into the set of tests we already have in the upstream repo,
>>> ovn-performance.at [0], and run those in GitHub actions too.
>>> [0] https://github.com/ovn-org/ovn/blob/main/tests/ovn-performance.at
>>> Han, others, what do you think?
> I've been thinking about this for a long time but it wasn't prioritized. It
> was not easy for the reasons mentioned by Ilya. But I agree it is something
> we need to figure out. For memory increase it should be pretty reliable,
> but for CPU/latency, we might need to think about metrics that take into
> account the performance of the node that executes the test cases. Of course
> hardware resources (CPU and memory of the test node) v.s. the scale we can
> test is another concern, but we may start with a scale that is not too big
> to run in git actions while big enough to provide good performance metrics.
> 

I guess we need to first try it out and see if there's any degree of
reliability (from memory usage perspective) when running something like
this in public GitHub runners.

Han, do you have time to investigate this direction?  Otherwise, if
nobody else volunteers, I will try to add it to my todo list.

Thanks,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-05 Thread Numan Siddique
On Mon, Feb 5, 2024 at 9:41 PM Han Zhou  wrote:
>
> On Mon, Feb 5, 2024 at 4:12 PM Numan Siddique  wrote:
> >
> > On Mon, Feb 5, 2024 at 5:54 PM Han Zhou  wrote:
> > >
> > > On Mon, Feb 5, 2024 at 10:15 AM Ilya Maximets 
> wrote:
> > > >
> > > > On 2/5/24 15:45, Ilya Maximets wrote:
> > > > > On 2/5/24 11:34, Ilya Maximets wrote:
> > > > >> On 2/5/24 09:23, Dumitru Ceara wrote:
> > > > >>> On 2/5/24 08:13, Han Zhou wrote:
> > > >  On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique 
> wrote:
> > > > >
> > > > > On Sun, Feb 4, 2024 at 9:53 PM Han Zhou  wrote:
> > > > >>
> > > > >> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets <
> i.maxim...@ovn.org>
> > > wrote:
> > > > >>>
> > > > >>
> > > > >>>  35 files changed, 9681 insertions(+), 4645 deletions(-)
> > > > >>
> > > > >> I had another look at this series and acked the remaining
> > > >  patches.  I
> > > > >> just had some minor comments that can be easily fixed when
> > > >  applying
> > > > >> the
> > > > >> patches to the main branch.
> > > > >>
> > > > >> Thanks for all the work on this!  It was a very large
> change
> > > but
> > > >  it
> > > > >> improves northd performance significantly.  I just hope we
> > > don't
> > > > >> introduce too many bugs.  Hopefully the time we have until
> > > release
> > > > >> will
> > > > >> allow us to further test this change on the 24.03 branch.
> > > > >>
> > > > >> Regards,
> > > > >> Dumitru
> > > > >
> > > > >
> > > > >
> > > > > Thanks a lot Dumitru and Han for the reviews and patience.
> > > > >
> > > > > I addressed the comments and applied the patches to main and
> > > also
> > > >  to
> > > >  branch-24.03.
> > > > >
> > > > > @Han - I know you wanted to take another look in to v6.  I
> > > didn't
> > > >  want
> > > > >> to
> > > >  delay further as branch-24.03 was created.  I'm more than
> happy
> > > to
> > > > >> submit
> > > >  follow up patches if you have any comments to address.
> Please
> > > let
> > > >  me
> > > > >> know.
> > > > >
> > > > 
> > > >  Hi Numan,
> > > > 
> > > >  I was writing the reply and saw your email just now. Thanks
> a lot
> > > >  for
> > > >  taking a huge effort to achieve the great optimization. I
> only
> > > left
> > > >  one
> > > >  comment on the implicit dependency left for the en_lrnat ->
> > > >  en_lflow.
> > > > >> Feel
> > > >  free to address it with a followup and no need to block the
> > > >  branching.
> > > > >> And
> > > >  take my Ack for the series with that addressed.
> > > > 
> > > >  Acked-by: Han Zhou 
> > > > >>>
> > > > >>>
> > > > >>> Hi, Numan, Dumitru and Han.
> > > > >>>
> > > > >>> I see a huge negative performance impact, most likely from
> this
> > > set,
> > > >  on
> > > > >>> ovn-heater's cluster-density tests.  The memory consumption on
> > > northd
> > > > >>>
> > > > >>> Thanks for reporting this, Ilya!
> > > > >>>
> > > > >>> jumped about 4x and it constantly recomputes due to failures
> of
> > > >  port_group
> > > > >>> handler:
> > > > >>>
> > > > >>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow,
> > > >  recompute
> > > > >> (failed handler for input port_group) took 9762ms
> > > > >>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long
> > > 9898ms
> > > >  poll
> > > > >> interval (5969ms user, 1786ms system)
> > > > >>> ...
> > > > >>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow,
> > > >  recompute
> > > > >> (failed handler for input port_group) took 9014ms
> > > > >>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long
> > > 9118ms
> > > >  poll
> > > > >> interval (5376ms user, 1515ms system)
> > > > >>> ...
> > > > >>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow,
> > > >  recompute
> > > > >> (failed handler for input port_group) took 10695ms
> > > > >>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long
> > > 10890ms
> > > > >> poll interval (6085ms user, 2745ms system)
> > > > >>> ...
> > > > >>> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow,
> > > >  recompute
> > > > >> (failed handler for input port_group) took 9985ms
> > > > >>> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long
> > > 10108ms
> > > > >> poll interval (5521ms user, 2440ms system)
> > > > >>>
> > > > >>> That increases 95%% ovn-installed latency in 500node
> > > cluster-density
> > > >  from
> > > > >>> 3.6 seconds last week to 21.5 seconds this week.
> > > > >>>
> > > > >>> I think, this should be a release 

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-05 Thread Han Zhou
On Mon, Feb 5, 2024 at 4:12 PM Numan Siddique  wrote:
>
> On Mon, Feb 5, 2024 at 5:54 PM Han Zhou  wrote:
> >
> > On Mon, Feb 5, 2024 at 10:15 AM Ilya Maximets 
wrote:
> > >
> > > On 2/5/24 15:45, Ilya Maximets wrote:
> > > > On 2/5/24 11:34, Ilya Maximets wrote:
> > > >> On 2/5/24 09:23, Dumitru Ceara wrote:
> > > >>> On 2/5/24 08:13, Han Zhou wrote:
> > >  On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique 
wrote:
> > > >
> > > > On Sun, Feb 4, 2024 at 9:53 PM Han Zhou  wrote:
> > > >>
> > > >> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets <
i.maxim...@ovn.org>
> > wrote:
> > > >>>
> > > >>
> > > >>>  35 files changed, 9681 insertions(+), 4645 deletions(-)
> > > >>
> > > >> I had another look at this series and acked the remaining
> > >  patches.  I
> > > >> just had some minor comments that can be easily fixed when
> > >  applying
> > > >> the
> > > >> patches to the main branch.
> > > >>
> > > >> Thanks for all the work on this!  It was a very large
change
> > but
> > >  it
> > > >> improves northd performance significantly.  I just hope we
> > don't
> > > >> introduce too many bugs.  Hopefully the time we have until
> > release
> > > >> will
> > > >> allow us to further test this change on the 24.03 branch.
> > > >>
> > > >> Regards,
> > > >> Dumitru
> > > >
> > > >
> > > >
> > > > Thanks a lot Dumitru and Han for the reviews and patience.
> > > >
> > > > I addressed the comments and applied the patches to main and
> > also
> > >  to
> > >  branch-24.03.
> > > >
> > > > @Han - I know you wanted to take another look in to v6.  I
> > didn't
> > >  want
> > > >> to
> > >  delay further as branch-24.03 was created.  I'm more than
happy
> > to
> > > >> submit
> > >  follow up patches if you have any comments to address.
Please
> > let
> > >  me
> > > >> know.
> > > >
> > > 
> > >  Hi Numan,
> > > 
> > >  I was writing the reply and saw your email just now. Thanks
a lot
> > >  for
> > >  taking a huge effort to achieve the great optimization. I
only
> > left
> > >  one
> > >  comment on the implicit dependency left for the en_lrnat ->
> > >  en_lflow.
> > > >> Feel
> > >  free to address it with a followup and no need to block the
> > >  branching.
> > > >> And
> > >  take my Ack for the series with that addressed.
> > > 
> > >  Acked-by: Han Zhou 
> > > >>>
> > > >>>
> > > >>> Hi, Numan, Dumitru and Han.
> > > >>>
> > > >>> I see a huge negative performance impact, most likely from
this
> > set,
> > >  on
> > > >>> ovn-heater's cluster-density tests.  The memory consumption on
> > northd
> > > >>>
> > > >>> Thanks for reporting this, Ilya!
> > > >>>
> > > >>> jumped about 4x and it constantly recomputes due to failures
of
> > >  port_group
> > > >>> handler:
> > > >>>
> > > >>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow,
> > >  recompute
> > > >> (failed handler for input port_group) took 9762ms
> > > >>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long
> > 9898ms
> > >  poll
> > > >> interval (5969ms user, 1786ms system)
> > > >>> ...
> > > >>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow,
> > >  recompute
> > > >> (failed handler for input port_group) took 9014ms
> > > >>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long
> > 9118ms
> > >  poll
> > > >> interval (5376ms user, 1515ms system)
> > > >>> ...
> > > >>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow,
> > >  recompute
> > > >> (failed handler for input port_group) took 10695ms
> > > >>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long
> > 10890ms
> > > >> poll interval (6085ms user, 2745ms system)
> > > >>> ...
> > > >>> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow,
> > >  recompute
> > > >> (failed handler for input port_group) took 9985ms
> > > >>> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long
> > 10108ms
> > > >> poll interval (5521ms user, 2440ms system)
> > > >>>
> > > >>> That increases 95%% ovn-installed latency in 500node
> > cluster-density
> > >  from
> > > >>> 3.6 seconds last week to 21.5 seconds this week.
> > > >>>
> > > >>> I think, this should be a release blocker.
> > > >>>
> > > >>> Memory usage is also very concerning.  Unfortunately it is not
> > tied
> > >  to the
> > > >>> cluster-density test.  The same 4-5x RSS jump is also seen in
> > other
> > >  test
> > > >>> like density-heavy.  Last week RSS of ovn-northd in
> > cluster-density
> > >  500

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-05 Thread Numan Siddique
On Mon, Feb 5, 2024 at 5:54 PM Han Zhou  wrote:
>
> On Mon, Feb 5, 2024 at 10:15 AM Ilya Maximets  wrote:
> >
> > On 2/5/24 15:45, Ilya Maximets wrote:
> > > On 2/5/24 11:34, Ilya Maximets wrote:
> > >> On 2/5/24 09:23, Dumitru Ceara wrote:
> > >>> On 2/5/24 08:13, Han Zhou wrote:
> >  On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique  wrote:
> > >
> > > On Sun, Feb 4, 2024 at 9:53 PM Han Zhou  wrote:
> > >>
> > >> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets 
> wrote:
> > >>>
> > >>
> > >>>  35 files changed, 9681 insertions(+), 4645 deletions(-)
> > >>
> > >> I had another look at this series and acked the remaining
> >  patches.  I
> > >> just had some minor comments that can be easily fixed when
> >  applying
> > >> the
> > >> patches to the main branch.
> > >>
> > >> Thanks for all the work on this!  It was a very large change
> but
> >  it
> > >> improves northd performance significantly.  I just hope we
> don't
> > >> introduce too many bugs.  Hopefully the time we have until
> release
> > >> will
> > >> allow us to further test this change on the 24.03 branch.
> > >>
> > >> Regards,
> > >> Dumitru
> > >
> > >
> > >
> > > Thanks a lot Dumitru and Han for the reviews and patience.
> > >
> > > I addressed the comments and applied the patches to main and
> also
> >  to
> >  branch-24.03.
> > >
> > > @Han - I know you wanted to take another look in to v6.  I
> didn't
> >  want
> > >> to
> >  delay further as branch-24.03 was created.  I'm more than happy
> to
> > >> submit
> >  follow up patches if you have any comments to address.  Please
> let
> >  me
> > >> know.
> > >
> > 
> >  Hi Numan,
> > 
> >  I was writing the reply and saw your email just now. Thanks a lot
> >  for
> >  taking a huge effort to achieve the great optimization. I only
> left
> >  one
> >  comment on the implicit dependency left for the en_lrnat ->
> >  en_lflow.
> > >> Feel
> >  free to address it with a followup and no need to block the
> >  branching.
> > >> And
> >  take my Ack for the series with that addressed.
> > 
> >  Acked-by: Han Zhou 
> > >>>
> > >>>
> > >>> Hi, Numan, Dumitru and Han.
> > >>>
> > >>> I see a huge negative performance impact, most likely from this
> set,
> >  on
> > >>> ovn-heater's cluster-density tests.  The memory consumption on
> northd
> > >>>
> > >>> Thanks for reporting this, Ilya!
> > >>>
> > >>> jumped about 4x and it constantly recomputes due to failures of
> >  port_group
> > >>> handler:
> > >>>
> > >>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow,
> >  recompute
> > >> (failed handler for input port_group) took 9762ms
> > >>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long
> 9898ms
> >  poll
> > >> interval (5969ms user, 1786ms system)
> > >>> ...
> > >>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow,
> >  recompute
> > >> (failed handler for input port_group) took 9014ms
> > >>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long
> 9118ms
> >  poll
> > >> interval (5376ms user, 1515ms system)
> > >>> ...
> > >>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow,
> >  recompute
> > >> (failed handler for input port_group) took 10695ms
> > >>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long
> 10890ms
> > >> poll interval (6085ms user, 2745ms system)
> > >>> ...
> > >>> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow,
> >  recompute
> > >> (failed handler for input port_group) took 9985ms
> > >>> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long
> 10108ms
> > >> poll interval (5521ms user, 2440ms system)
> > >>>
> > >>> That increases 95%% ovn-installed latency in 500node
> cluster-density
> >  from
> > >>> 3.6 seconds last week to 21.5 seconds this week.
> > >>>
> > >>> I think, this should be a release blocker.
> > >>>
> > >>> Memory usage is also very concerning.  Unfortunately it is not
> tied
> >  to the
> > >>> cluster-density test.  The same 4-5x RSS jump is also seen in
> other
> >  test
> > >>> like density-heavy.  Last week RSS of ovn-northd in
> cluster-density
> >  500
> > >> node
> > >>> was between 1.5 and 2.5 GB, this week we have a range between 5.5
> and
> >  8.5
> > >> GB.
> > >>>
> > >>> I would consider this as a release blocker as well.
> > >>>
> > >>>
> > >>> I agree, we shouldn't release 24.03.0 unless these two issues are
> > >>> (sufficiently) addressed.  We do have until 

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-05 Thread Han Zhou
On Mon, Feb 5, 2024 at 10:15 AM Ilya Maximets  wrote:
>
> On 2/5/24 15:45, Ilya Maximets wrote:
> > On 2/5/24 11:34, Ilya Maximets wrote:
> >> On 2/5/24 09:23, Dumitru Ceara wrote:
> >>> On 2/5/24 08:13, Han Zhou wrote:
>  On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique  wrote:
> >
> > On Sun, Feb 4, 2024 at 9:53 PM Han Zhou  wrote:
> >>
> >> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets 
wrote:
> >>>
> >>
> >>>  35 files changed, 9681 insertions(+), 4645 deletions(-)
> >>
> >> I had another look at this series and acked the remaining
>  patches.  I
> >> just had some minor comments that can be easily fixed when
>  applying
> >> the
> >> patches to the main branch.
> >>
> >> Thanks for all the work on this!  It was a very large change
but
>  it
> >> improves northd performance significantly.  I just hope we
don't
> >> introduce too many bugs.  Hopefully the time we have until
release
> >> will
> >> allow us to further test this change on the 24.03 branch.
> >>
> >> Regards,
> >> Dumitru
> >
> >
> >
> > Thanks a lot Dumitru and Han for the reviews and patience.
> >
> > I addressed the comments and applied the patches to main and
also
>  to
>  branch-24.03.
> >
> > @Han - I know you wanted to take another look in to v6.  I
didn't
>  want
> >> to
>  delay further as branch-24.03 was created.  I'm more than happy
to
> >> submit
>  follow up patches if you have any comments to address.  Please
let
>  me
> >> know.
> >
> 
>  Hi Numan,
> 
>  I was writing the reply and saw your email just now. Thanks a lot
>  for
>  taking a huge effort to achieve the great optimization. I only
left
>  one
>  comment on the implicit dependency left for the en_lrnat ->
>  en_lflow.
> >> Feel
>  free to address it with a followup and no need to block the
>  branching.
> >> And
>  take my Ack for the series with that addressed.
> 
>  Acked-by: Han Zhou 
> >>>
> >>>
> >>> Hi, Numan, Dumitru and Han.
> >>>
> >>> I see a huge negative performance impact, most likely from this
set,
>  on
> >>> ovn-heater's cluster-density tests.  The memory consumption on
northd
> >>>
> >>> Thanks for reporting this, Ilya!
> >>>
> >>> jumped about 4x and it constantly recomputes due to failures of
>  port_group
> >>> handler:
> >>>
> >>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow,
>  recompute
> >> (failed handler for input port_group) took 9762ms
> >>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long
9898ms
>  poll
> >> interval (5969ms user, 1786ms system)
> >>> ...
> >>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow,
>  recompute
> >> (failed handler for input port_group) took 9014ms
> >>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long
9118ms
>  poll
> >> interval (5376ms user, 1515ms system)
> >>> ...
> >>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow,
>  recompute
> >> (failed handler for input port_group) took 10695ms
> >>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long
10890ms
> >> poll interval (6085ms user, 2745ms system)
> >>> ...
> >>> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow,
>  recompute
> >> (failed handler for input port_group) took 9985ms
> >>> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long
10108ms
> >> poll interval (5521ms user, 2440ms system)
> >>>
> >>> That increases 95%% ovn-installed latency in 500node
cluster-density
>  from
> >>> 3.6 seconds last week to 21.5 seconds this week.
> >>>
> >>> I think, this should be a release blocker.
> >>>
> >>> Memory usage is also very concerning.  Unfortunately it is not
tied
>  to the
> >>> cluster-density test.  The same 4-5x RSS jump is also seen in
other
>  test
> >>> like density-heavy.  Last week RSS of ovn-northd in
cluster-density
>  500
> >> node
> >>> was between 1.5 and 2.5 GB, this week we have a range between 5.5
and
>  8.5
> >> GB.
> >>>
> >>> I would consider this as a release blocker as well.
> >>>
> >>>
> >>> I agree, we shouldn't release 24.03.0 unless these two issues are
> >>> (sufficiently) addressed.  We do have until March 1st (official
release
> >>> date) to do that or to revert any patches that cause regressions.
> >>>
> >>>
> >>> I don't have direct evidence that this particular series is a
>  culprit, but
> >>> it looks like the most likely candidate.  I can dig more into
> >> investigation
> >>> on Monday.
> >>>
> >>> 

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-05 Thread Ilya Maximets
On 2/5/24 15:45, Ilya Maximets wrote:
> On 2/5/24 11:34, Ilya Maximets wrote:
>> On 2/5/24 09:23, Dumitru Ceara wrote:
>>> On 2/5/24 08:13, Han Zhou wrote:
 On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique  wrote:
>
> On Sun, Feb 4, 2024 at 9:53 PM Han Zhou  wrote:
>>
>> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets  wrote:
>>>
>>
>>>  35 files changed, 9681 insertions(+), 4645 deletions(-)
>>
>> I had another look at this series and acked the remaining
 patches.  I
>> just had some minor comments that can be easily fixed when
 applying
>> the
>> patches to the main branch.
>>
>> Thanks for all the work on this!  It was a very large change but
 it
>> improves northd performance significantly.  I just hope we don't
>> introduce too many bugs.  Hopefully the time we have until release
>> will
>> allow us to further test this change on the 24.03 branch.
>>
>> Regards,
>> Dumitru
>
>
>
> Thanks a lot Dumitru and Han for the reviews and patience.
>
> I addressed the comments and applied the patches to main and also
 to
 branch-24.03.
>
> @Han - I know you wanted to take another look in to v6.  I didn't
 want
>> to
 delay further as branch-24.03 was created.  I'm more than happy to
>> submit
 follow up patches if you have any comments to address.  Please let
 me
>> know.
>

 Hi Numan,

 I was writing the reply and saw your email just now. Thanks a lot
 for
 taking a huge effort to achieve the great optimization. I only left
 one
 comment on the implicit dependency left for the en_lrnat ->
 en_lflow.
>> Feel
 free to address it with a followup and no need to block the
 branching.
>> And
 take my Ack for the series with that addressed.

 Acked-by: Han Zhou 
>>>
>>>
>>> Hi, Numan, Dumitru and Han.
>>>
>>> I see a huge negative performance impact, most likely from this set,
 on
>>> ovn-heater's cluster-density tests.  The memory consumption on northd
>>>
>>> Thanks for reporting this, Ilya!
>>>
>>> jumped about 4x and it constantly recomputes due to failures of
 port_group
>>> handler:
>>>
>>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow,
 recompute
>> (failed handler for input port_group) took 9762ms
>>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms
 poll
>> interval (5969ms user, 1786ms system)
>>> ...
>>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow,
 recompute
>> (failed handler for input port_group) took 9014ms
>>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms
 poll
>> interval (5376ms user, 1515ms system)
>>> ...
>>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow,
 recompute
>> (failed handler for input port_group) took 10695ms
>>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms
>> poll interval (6085ms user, 2745ms system)
>>> ...
>>> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow,
 recompute
>> (failed handler for input port_group) took 9985ms
>>> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms
>> poll interval (5521ms user, 2440ms system)
>>>
>>> That increases 95%% ovn-installed latency in 500node cluster-density
 from
>>> 3.6 seconds last week to 21.5 seconds this week.
>>>
>>> I think, this should be a release blocker.
>>>
>>> Memory usage is also very concerning.  Unfortunately it is not tied
 to the
>>> cluster-density test.  The same 4-5x RSS jump is also seen in other
 test
>>> like density-heavy.  Last week RSS of ovn-northd in cluster-density
 500
>> node
>>> was between 1.5 and 2.5 GB, this week we have a range between 5.5 and
 8.5
>> GB.
>>>
>>> I would consider this as a release blocker as well.
>>>
>>>
>>> I agree, we shouldn't release 24.03.0 unless these two issues are
>>> (sufficiently) addressed.  We do have until March 1st (official release
>>> date) to do that or to revert any patches that cause regressions.
>>>
>>>
>>> I don't have direct evidence that this particular series is a
 culprit, but
>>> it looks like the most likely candidate.  I can dig more into
>> investigation
>>> on Monday.
>>>
>>> Best regards, Ilya Maximets.
>>
>> Thanks Ilya for reporting this. 95% latency and 4x RSS increase is a
 little
>> surprising to me. I did test this series with my scale test scripts for
>> recompute performance regression. It was 10+% increase in latency. I
 even
>> digged a little into it, and 

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-05 Thread Dumitru Ceara
On 2/5/24 12:10, Ilya Maximets wrote:
> On 2/5/24 11:58, Dumitru Ceara wrote:
>> On 2/5/24 11:34, Ilya Maximets wrote:
>>> On 2/5/24 09:23, Dumitru Ceara wrote:
 On 2/5/24 08:13, Han Zhou wrote:
> On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique  wrote:
>>
>> On Sun, Feb 4, 2024 at 9:53 PM Han Zhou  wrote:
>>>
>>> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets  wrote:

>>>
  35 files changed, 9681 insertions(+), 4645 deletions(-)
>>>
>>> I had another look at this series and acked the remaining
> patches.  I
>>> just had some minor comments that can be easily fixed when
> applying
>>> the
>>> patches to the main branch.
>>>
>>> Thanks for all the work on this!  It was a very large change but
> it
>>> improves northd performance significantly.  I just hope we don't
>>> introduce too many bugs.  Hopefully the time we have until release
>>> will
>>> allow us to further test this change on the 24.03 branch.
>>>
>>> Regards,
>>> Dumitru
>>
>>
>>
>> Thanks a lot Dumitru and Han for the reviews and patience.
>>
>> I addressed the comments and applied the patches to main and also
> to
> branch-24.03.
>>
>> @Han - I know you wanted to take another look in to v6.  I didn't
> want
>>> to
> delay further as branch-24.03 was created.  I'm more than happy to
>>> submit
> follow up patches if you have any comments to address.  Please let
> me
>>> know.
>>
>
> Hi Numan,
>
> I was writing the reply and saw your email just now. Thanks a lot
> for
> taking a huge effort to achieve the great optimization. I only left
> one
> comment on the implicit dependency left for the en_lrnat ->
> en_lflow.
>>> Feel
> free to address it with a followup and no need to block the
> branching.
>>> And
> take my Ack for the series with that addressed.
>
> Acked-by: Han Zhou 


 Hi, Numan, Dumitru and Han.

 I see a huge negative performance impact, most likely from this set,
> on
 ovn-heater's cluster-density tests.  The memory consumption on northd

 Thanks for reporting this, Ilya!

 jumped about 4x and it constantly recomputes due to failures of
> port_group
 handler:

 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow,
> recompute
>>> (failed handler for input port_group) took 9762ms
 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms
> poll
>>> interval (5969ms user, 1786ms system)
 ...
 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow,
> recompute
>>> (failed handler for input port_group) took 9014ms
 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms
> poll
>>> interval (5376ms user, 1515ms system)
 ...
 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow,
> recompute
>>> (failed handler for input port_group) took 10695ms
 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms
>>> poll interval (6085ms user, 2745ms system)
 ...
 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow,
> recompute
>>> (failed handler for input port_group) took 9985ms
 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms
>>> poll interval (5521ms user, 2440ms system)

 That increases 95%% ovn-installed latency in 500node cluster-density
> from
 3.6 seconds last week to 21.5 seconds this week.

 I think, this should be a release blocker.

 Memory usage is also very concerning.  Unfortunately it is not tied
> to the
 cluster-density test.  The same 4-5x RSS jump is also seen in other
> test
 like density-heavy.  Last week RSS of ovn-northd in cluster-density
> 500
>>> node
 was between 1.5 and 2.5 GB, this week we have a range between 5.5 and
> 8.5
>>> GB.

 I would consider this as a release blocker as well.


 I agree, we shouldn't release 24.03.0 unless these two issues are
 (sufficiently) addressed.  We do have until March 1st (official release
 date) to do that or to revert any patches that cause regressions.


 I don't have direct evidence that this particular series is a
> culprit, but
 it looks like the most likely candidate.  I can dig more into
>>> investigation
 on Monday.

 Best regards, Ilya Maximets.
>>>
>>> Thanks Ilya for reporting this. 95% latency and 4x RSS increase is a
> little
>>> surprising to me. I 

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-05 Thread Ilya Maximets
On 2/5/24 11:34, Ilya Maximets wrote:
> On 2/5/24 09:23, Dumitru Ceara wrote:
>> On 2/5/24 08:13, Han Zhou wrote:
>>> On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique  wrote:

 On Sun, Feb 4, 2024 at 9:53 PM Han Zhou  wrote:
>
> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets  wrote:
>>
>
>>  35 files changed, 9681 insertions(+), 4645 deletions(-)
>
> I had another look at this series and acked the remaining
>>> patches.  I
> just had some minor comments that can be easily fixed when
>>> applying
> the
> patches to the main branch.
>
> Thanks for all the work on this!  It was a very large change but
>>> it
> improves northd performance significantly.  I just hope we don't
> introduce too many bugs.  Hopefully the time we have until release
> will
> allow us to further test this change on the 24.03 branch.
>
> Regards,
> Dumitru



 Thanks a lot Dumitru and Han for the reviews and patience.

 I addressed the comments and applied the patches to main and also
>>> to
>>> branch-24.03.

 @Han - I know you wanted to take another look in to v6.  I didn't
>>> want
> to
>>> delay further as branch-24.03 was created.  I'm more than happy to
> submit
>>> follow up patches if you have any comments to address.  Please let
>>> me
> know.

>>>
>>> Hi Numan,
>>>
>>> I was writing the reply and saw your email just now. Thanks a lot
>>> for
>>> taking a huge effort to achieve the great optimization. I only left
>>> one
>>> comment on the implicit dependency left for the en_lrnat ->
>>> en_lflow.
> Feel
>>> free to address it with a followup and no need to block the
>>> branching.
> And
>>> take my Ack for the series with that addressed.
>>>
>>> Acked-by: Han Zhou 
>>
>>
>> Hi, Numan, Dumitru and Han.
>>
>> I see a huge negative performance impact, most likely from this set,
>>> on
>> ovn-heater's cluster-density tests.  The memory consumption on northd
>>
>> Thanks for reporting this, Ilya!
>>
>> jumped about 4x and it constantly recomputes due to failures of
>>> port_group
>> handler:
>>
>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow,
>>> recompute
> (failed handler for input port_group) took 9762ms
>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms
>>> poll
> interval (5969ms user, 1786ms system)
>> ...
>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow,
>>> recompute
> (failed handler for input port_group) took 9014ms
>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms
>>> poll
> interval (5376ms user, 1515ms system)
>> ...
>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow,
>>> recompute
> (failed handler for input port_group) took 10695ms
>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms
> poll interval (6085ms user, 2745ms system)
>> ...
>> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow,
>>> recompute
> (failed handler for input port_group) took 9985ms
>> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms
> poll interval (5521ms user, 2440ms system)
>>
>> That increases 95%% ovn-installed latency in 500node cluster-density
>>> from
>> 3.6 seconds last week to 21.5 seconds this week.
>>
>> I think, this should be a release blocker.
>>
>> Memory usage is also very concerning.  Unfortunately it is not tied
>>> to the
>> cluster-density test.  The same 4-5x RSS jump is also seen in other
>>> test
>> like density-heavy.  Last week RSS of ovn-northd in cluster-density
>>> 500
> node
>> was between 1.5 and 2.5 GB, this week we have a range between 5.5 and
>>> 8.5
> GB.
>>
>> I would consider this as a release blocker as well.
>>
>>
>> I agree, we shouldn't release 24.03.0 unless these two issues are
>> (sufficiently) addressed.  We do have until March 1st (official release
>> date) to do that or to revert any patches that cause regressions.
>>
>>
>> I don't have direct evidence that this particular series is a
>>> culprit, but
>> it looks like the most likely candidate.  I can dig more into
> investigation
>> on Monday.
>>
>> Best regards, Ilya Maximets.
>
> Thanks Ilya for reporting this. 95% latency and 4x RSS increase is a
>>> little
> surprising to me. I did test this series with my scale test scripts for
> recompute performance regression. It was 10+% increase in latency. I
>>> even
> digged a little into it, and noticed ~5% increase caused by the hmap
>>> used
> to maintain the lflows in each lflow_ref. This was discussed in the code
> review for an earlier version (v2/v3). Overall it 

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-05 Thread Ilya Maximets
On 2/5/24 11:58, Dumitru Ceara wrote:
> On 2/5/24 11:34, Ilya Maximets wrote:
>> On 2/5/24 09:23, Dumitru Ceara wrote:
>>> On 2/5/24 08:13, Han Zhou wrote:
 On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique  wrote:
>
> On Sun, Feb 4, 2024 at 9:53 PM Han Zhou  wrote:
>>
>> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets  wrote:
>>>
>>
>>>  35 files changed, 9681 insertions(+), 4645 deletions(-)
>>
>> I had another look at this series and acked the remaining
 patches.  I
>> just had some minor comments that can be easily fixed when
 applying
>> the
>> patches to the main branch.
>>
>> Thanks for all the work on this!  It was a very large change but
 it
>> improves northd performance significantly.  I just hope we don't
>> introduce too many bugs.  Hopefully the time we have until release
>> will
>> allow us to further test this change on the 24.03 branch.
>>
>> Regards,
>> Dumitru
>
>
>
> Thanks a lot Dumitru and Han for the reviews and patience.
>
> I addressed the comments and applied the patches to main and also
 to
 branch-24.03.
>
> @Han - I know you wanted to take another look in to v6.  I didn't
 want
>> to
 delay further as branch-24.03 was created.  I'm more than happy to
>> submit
 follow up patches if you have any comments to address.  Please let
 me
>> know.
>

 Hi Numan,

 I was writing the reply and saw your email just now. Thanks a lot
 for
 taking a huge effort to achieve the great optimization. I only left
 one
 comment on the implicit dependency left for the en_lrnat ->
 en_lflow.
>> Feel
 free to address it with a followup and no need to block the
 branching.
>> And
 take my Ack for the series with that addressed.

 Acked-by: Han Zhou 
>>>
>>>
>>> Hi, Numan, Dumitru and Han.
>>>
>>> I see a huge negative performance impact, most likely from this set,
 on
>>> ovn-heater's cluster-density tests.  The memory consumption on northd
>>>
>>> Thanks for reporting this, Ilya!
>>>
>>> jumped about 4x and it constantly recomputes due to failures of
 port_group
>>> handler:
>>>
>>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow,
 recompute
>> (failed handler for input port_group) took 9762ms
>>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms
 poll
>> interval (5969ms user, 1786ms system)
>>> ...
>>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow,
 recompute
>> (failed handler for input port_group) took 9014ms
>>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms
 poll
>> interval (5376ms user, 1515ms system)
>>> ...
>>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow,
 recompute
>> (failed handler for input port_group) took 10695ms
>>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms
>> poll interval (6085ms user, 2745ms system)
>>> ...
>>> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow,
 recompute
>> (failed handler for input port_group) took 9985ms
>>> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms
>> poll interval (5521ms user, 2440ms system)
>>>
>>> That increases 95%% ovn-installed latency in 500node cluster-density
 from
>>> 3.6 seconds last week to 21.5 seconds this week.
>>>
>>> I think, this should be a release blocker.
>>>
>>> Memory usage is also very concerning.  Unfortunately it is not tied
 to the
>>> cluster-density test.  The same 4-5x RSS jump is also seen in other
 test
>>> like density-heavy.  Last week RSS of ovn-northd in cluster-density
 500
>> node
>>> was between 1.5 and 2.5 GB, this week we have a range between 5.5 and
 8.5
>> GB.
>>>
>>> I would consider this as a release blocker as well.
>>>
>>>
>>> I agree, we shouldn't release 24.03.0 unless these two issues are
>>> (sufficiently) addressed.  We do have until March 1st (official release
>>> date) to do that or to revert any patches that cause regressions.
>>>
>>>
>>> I don't have direct evidence that this particular series is a
 culprit, but
>>> it looks like the most likely candidate.  I can dig more into
>> investigation
>>> on Monday.
>>>
>>> Best regards, Ilya Maximets.
>>
>> Thanks Ilya for reporting this. 95% latency and 4x RSS increase is a
 little
>> surprising to me. I did test this series with my scale test scripts for
>> recompute performance regression. It was 10+% increase in latency. I
 even
>> digged a little into it, and 

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-05 Thread Dumitru Ceara
On 2/5/24 11:34, Ilya Maximets wrote:
> On 2/5/24 09:23, Dumitru Ceara wrote:
>> On 2/5/24 08:13, Han Zhou wrote:
>>> On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique  wrote:

 On Sun, Feb 4, 2024 at 9:53 PM Han Zhou  wrote:
>
> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets  wrote:
>>
>
>>  35 files changed, 9681 insertions(+), 4645 deletions(-)
>
> I had another look at this series and acked the remaining
>>> patches.  I
> just had some minor comments that can be easily fixed when
>>> applying
> the
> patches to the main branch.
>
> Thanks for all the work on this!  It was a very large change but
>>> it
> improves northd performance significantly.  I just hope we don't
> introduce too many bugs.  Hopefully the time we have until release
> will
> allow us to further test this change on the 24.03 branch.
>
> Regards,
> Dumitru



 Thanks a lot Dumitru and Han for the reviews and patience.

 I addressed the comments and applied the patches to main and also
>>> to
>>> branch-24.03.

 @Han - I know you wanted to take another look in to v6.  I didn't
>>> want
> to
>>> delay further as branch-24.03 was created.  I'm more than happy to
> submit
>>> follow up patches if you have any comments to address.  Please let
>>> me
> know.

>>>
>>> Hi Numan,
>>>
>>> I was writing the reply and saw your email just now. Thanks a lot
>>> for
>>> taking a huge effort to achieve the great optimization. I only left
>>> one
>>> comment on the implicit dependency left for the en_lrnat ->
>>> en_lflow.
> Feel
>>> free to address it with a followup and no need to block the
>>> branching.
> And
>>> take my Ack for the series with that addressed.
>>>
>>> Acked-by: Han Zhou 
>>
>>
>> Hi, Numan, Dumitru and Han.
>>
>> I see a huge negative performance impact, most likely from this set,
>>> on
>> ovn-heater's cluster-density tests.  The memory consumption on northd
>>
>> Thanks for reporting this, Ilya!
>>
>> jumped about 4x and it constantly recomputes due to failures of
>>> port_group
>> handler:
>>
>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow,
>>> recompute
> (failed handler for input port_group) took 9762ms
>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms
>>> poll
> interval (5969ms user, 1786ms system)
>> ...
>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow,
>>> recompute
> (failed handler for input port_group) took 9014ms
>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms
>>> poll
> interval (5376ms user, 1515ms system)
>> ...
>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow,
>>> recompute
> (failed handler for input port_group) took 10695ms
>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms
> poll interval (6085ms user, 2745ms system)
>> ...
>> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow,
>>> recompute
> (failed handler for input port_group) took 9985ms
>> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms
> poll interval (5521ms user, 2440ms system)
>>
>> That increases 95%% ovn-installed latency in 500node cluster-density
>>> from
>> 3.6 seconds last week to 21.5 seconds this week.
>>
>> I think, this should be a release blocker.
>>
>> Memory usage is also very concerning.  Unfortunately it is not tied
>>> to the
>> cluster-density test.  The same 4-5x RSS jump is also seen in other
>>> test
>> like density-heavy.  Last week RSS of ovn-northd in cluster-density
>>> 500
> node
>> was between 1.5 and 2.5 GB, this week we have a range between 5.5 and
>>> 8.5
> GB.
>>
>> I would consider this as a release blocker as well.
>>
>>
>> I agree, we shouldn't release 24.03.0 unless these two issues are
>> (sufficiently) addressed.  We do have until March 1st (official release
>> date) to do that or to revert any patches that cause regressions.
>>
>>
>> I don't have direct evidence that this particular series is a
>>> culprit, but
>> it looks like the most likely candidate.  I can dig more into
> investigation
>> on Monday.
>>
>> Best regards, Ilya Maximets.
>
> Thanks Ilya for reporting this. 95% latency and 4x RSS increase is a
>>> little
> surprising to me. I did test this series with my scale test scripts for
> recompute performance regression. It was 10+% increase in latency. I
>>> even
> digged a little into it, and noticed ~5% increase caused by the hmap
>>> used
> to maintain the lflows in each lflow_ref. This was discussed in the code
> review for an earlier version (v2/v3). Overall it 

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-05 Thread Ilya Maximets
On 2/5/24 09:23, Dumitru Ceara wrote:
> On 2/5/24 08:13, Han Zhou wrote:
>> On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique  wrote:
>>>
>>> On Sun, Feb 4, 2024 at 9:53 PM Han Zhou  wrote:

 On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets  wrote:
>

>  35 files changed, 9681 insertions(+), 4645 deletions(-)

 I had another look at this series and acked the remaining
>> patches.  I
 just had some minor comments that can be easily fixed when
>> applying
 the
 patches to the main branch.

 Thanks for all the work on this!  It was a very large change but
>> it
 improves northd performance significantly.  I just hope we don't
 introduce too many bugs.  Hopefully the time we have until release
 will
 allow us to further test this change on the 24.03 branch.

 Regards,
 Dumitru
>>>
>>>
>>>
>>> Thanks a lot Dumitru and Han for the reviews and patience.
>>>
>>> I addressed the comments and applied the patches to main and also
>> to
>> branch-24.03.
>>>
>>> @Han - I know you wanted to take another look in to v6.  I didn't
>> want
 to
>> delay further as branch-24.03 was created.  I'm more than happy to
 submit
>> follow up patches if you have any comments to address.  Please let
>> me
 know.
>>>
>>
>> Hi Numan,
>>
>> I was writing the reply and saw your email just now. Thanks a lot
>> for
>> taking a huge effort to achieve the great optimization. I only left
>> one
>> comment on the implicit dependency left for the en_lrnat ->
>> en_lflow.
 Feel
>> free to address it with a followup and no need to block the
>> branching.
 And
>> take my Ack for the series with that addressed.
>>
>> Acked-by: Han Zhou 
>
>
> Hi, Numan, Dumitru and Han.
>
> I see a huge negative performance impact, most likely from this set,
>> on
> ovn-heater's cluster-density tests.  The memory consumption on northd
> 
> Thanks for reporting this, Ilya!
> 
> jumped about 4x and it constantly recomputes due to failures of
>> port_group
> handler:
>
> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow,
>> recompute
 (failed handler for input port_group) took 9762ms
> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms
>> poll
 interval (5969ms user, 1786ms system)
> ...
> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow,
>> recompute
 (failed handler for input port_group) took 9014ms
> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms
>> poll
 interval (5376ms user, 1515ms system)
> ...
> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow,
>> recompute
 (failed handler for input port_group) took 10695ms
> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms
 poll interval (6085ms user, 2745ms system)
> ...
> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow,
>> recompute
 (failed handler for input port_group) took 9985ms
> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms
 poll interval (5521ms user, 2440ms system)
>
> That increases 95%% ovn-installed latency in 500node cluster-density
>> from
> 3.6 seconds last week to 21.5 seconds this week.
>
> I think, this should be a release blocker.
>
> Memory usage is also very concerning.  Unfortunately it is not tied
>> to the
> cluster-density test.  The same 4-5x RSS jump is also seen in other
>> test
> like density-heavy.  Last week RSS of ovn-northd in cluster-density
>> 500
 node
> was between 1.5 and 2.5 GB, this week we have a range between 5.5 and
>> 8.5
 GB.
>
> I would consider this as a release blocker as well.
>
> 
> I agree, we shouldn't release 24.03.0 unless these two issues are
> (sufficiently) addressed.  We do have until March 1st (official release
> date) to do that or to revert any patches that cause regressions.
> 
>
> I don't have direct evidence that this particular series is a
>> culprit, but
> it looks like the most likely candidate.  I can dig more into
 investigation
> on Monday.
>
> Best regards, Ilya Maximets.

 Thanks Ilya for reporting this. 95% latency and 4x RSS increase is a
>> little
 surprising to me. I did test this series with my scale test scripts for
 recompute performance regression. It was 10+% increase in latency. I
>> even
 digged a little into it, and noticed ~5% increase caused by the hmap
>> used
 to maintain the lflows in each lflow_ref. This was discussed in the code
 review for an earlier version (v2/v3). Overall it looked not very bad,
>> if
 we now handle most common scenarios incrementally, and it is reasonable
>> to
 have some cost for maintaining the references/index for 

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-05 Thread Dumitru Ceara
On 2/5/24 08:13, Han Zhou wrote:
> On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique  wrote:
>>
>> On Sun, Feb 4, 2024 at 9:53 PM Han Zhou  wrote:
>>>
>>> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets  wrote:

>>>
  35 files changed, 9681 insertions(+), 4645 deletions(-)
>>>
>>> I had another look at this series and acked the remaining
> patches.  I
>>> just had some minor comments that can be easily fixed when
> applying
>>> the
>>> patches to the main branch.
>>>
>>> Thanks for all the work on this!  It was a very large change but
> it
>>> improves northd performance significantly.  I just hope we don't
>>> introduce too many bugs.  Hopefully the time we have until release
>>> will
>>> allow us to further test this change on the 24.03 branch.
>>>
>>> Regards,
>>> Dumitru
>>
>>
>>
>> Thanks a lot Dumitru and Han for the reviews and patience.
>>
>> I addressed the comments and applied the patches to main and also
> to
> branch-24.03.
>>
>> @Han - I know you wanted to take another look in to v6.  I didn't
> want
>>> to
> delay further as branch-24.03 was created.  I'm more than happy to
>>> submit
> follow up patches if you have any comments to address.  Please let
> me
>>> know.
>>
>
> Hi Numan,
>
> I was writing the reply and saw your email just now. Thanks a lot
> for
> taking a huge effort to achieve the great optimization. I only left
> one
> comment on the implicit dependency left for the en_lrnat ->
> en_lflow.
>>> Feel
> free to address it with a followup and no need to block the
> branching.
>>> And
> take my Ack for the series with that addressed.
>
> Acked-by: Han Zhou 


 Hi, Numan, Dumitru and Han.

 I see a huge negative performance impact, most likely from this set,
> on
 ovn-heater's cluster-density tests.  The memory consumption on northd

Thanks for reporting this, Ilya!

 jumped about 4x and it constantly recomputes due to failures of
> port_group
 handler:

 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow,
> recompute
>>> (failed handler for input port_group) took 9762ms
 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms
> poll
>>> interval (5969ms user, 1786ms system)
 ...
 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow,
> recompute
>>> (failed handler for input port_group) took 9014ms
 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms
> poll
>>> interval (5376ms user, 1515ms system)
 ...
 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow,
> recompute
>>> (failed handler for input port_group) took 10695ms
 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms
>>> poll interval (6085ms user, 2745ms system)
 ...
 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow,
> recompute
>>> (failed handler for input port_group) took 9985ms
 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms
>>> poll interval (5521ms user, 2440ms system)

 That increases 95%% ovn-installed latency in 500node cluster-density
> from
 3.6 seconds last week to 21.5 seconds this week.

 I think, this should be a release blocker.

 Memory usage is also very concerning.  Unfortunately it is not tied
> to the
 cluster-density test.  The same 4-5x RSS jump is also seen in other
> test
 like density-heavy.  Last week RSS of ovn-northd in cluster-density
> 500
>>> node
 was between 1.5 and 2.5 GB, this week we have a range between 5.5 and
> 8.5
>>> GB.

 I would consider this as a release blocker as well.


I agree, we shouldn't release 24.03.0 unless these two issues are
(sufficiently) addressed.  We do have until March 1st (official release
date) to do that or to revert any patches that cause regressions.


 I don't have direct evidence that this particular series is a
> culprit, but
 it looks like the most likely candidate.  I can dig more into
>>> investigation
 on Monday.

 Best regards, Ilya Maximets.
>>>
>>> Thanks Ilya for reporting this. 95% latency and 4x RSS increase is a
> little
>>> surprising to me. I did test this series with my scale test scripts for
>>> recompute performance regression. It was 10+% increase in latency. I
> even
>>> digged a little into it, and noticed ~5% increase caused by the hmap
> used
>>> to maintain the lflows in each lflow_ref. This was discussed in the code
>>> review for an earlier version (v2/v3). Overall it looked not very bad,
> if
>>> we now handle most common scenarios incrementally, and it is reasonable
> to
>>> have some cost for maintaining the references/index for incremental
>>> processing. I wonder if my test scenario was too simple (didn't have LBs
>>> included) to find the problems, so today I did another test by
> including a
>>> LB group with 1k LBs 

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-04 Thread Han Zhou
On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique  wrote:
>
> On Sun, Feb 4, 2024 at 9:53 PM Han Zhou  wrote:
> >
> > On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets  wrote:
> > >
> > > >>>
> > > >>> >  35 files changed, 9681 insertions(+), 4645 deletions(-)
> > > >>>
> > > >>> I had another look at this series and acked the remaining
patches.  I
> > > >>> just had some minor comments that can be easily fixed when
applying
> > the
> > > >>> patches to the main branch.
> > > >>>
> > > >>> Thanks for all the work on this!  It was a very large change but
it
> > > >>> improves northd performance significantly.  I just hope we don't
> > > >>> introduce too many bugs.  Hopefully the time we have until release
> > will
> > > >>> allow us to further test this change on the 24.03 branch.
> > > >>>
> > > >>> Regards,
> > > >>> Dumitru
> > > >>
> > > >>
> > > >>
> > > >> Thanks a lot Dumitru and Han for the reviews and patience.
> > > >>
> > > >> I addressed the comments and applied the patches to main and also
to
> > > > branch-24.03.
> > > >>
> > > >> @Han - I know you wanted to take another look in to v6.  I didn't
want
> > to
> > > > delay further as branch-24.03 was created.  I'm more than happy to
> > submit
> > > > follow up patches if you have any comments to address.  Please let
me
> > know.
> > > >>
> > > >
> > > > Hi Numan,
> > > >
> > > > I was writing the reply and saw your email just now. Thanks a lot
for
> > > > taking a huge effort to achieve the great optimization. I only left
one
> > > > comment on the implicit dependency left for the en_lrnat ->
en_lflow.
> > Feel
> > > > free to address it with a followup and no need to block the
branching.
> > And
> > > > take my Ack for the series with that addressed.
> > > >
> > > > Acked-by: Han Zhou 
> > >
> > >
> > > Hi, Numan, Dumitru and Han.
> > >
> > > I see a huge negative performance impact, most likely from this set,
on
> > > ovn-heater's cluster-density tests.  The memory consumption on northd
> > > jumped about 4x and it constantly recomputes due to failures of
port_group
> > > handler:
> > >
> > > 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow,
recompute
> > (failed handler for input port_group) took 9762ms
> > > 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms
poll
> > interval (5969ms user, 1786ms system)
> > > ...
> > > 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow,
recompute
> > (failed handler for input port_group) took 9014ms
> > > 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms
poll
> > interval (5376ms user, 1515ms system)
> > > ...
> > > 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow,
recompute
> > (failed handler for input port_group) took 10695ms
> > > 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms
> > poll interval (6085ms user, 2745ms system)
> > > ...
> > > 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow,
recompute
> > (failed handler for input port_group) took 9985ms
> > > 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms
> > poll interval (5521ms user, 2440ms system)
> > >
> > > That increases 95%% ovn-installed latency in 500node cluster-density
from
> > > 3.6 seconds last week to 21.5 seconds this week.
> > >
> > > I think, this should be a release blocker.
> > >
> > > Memory usage is also very concerning.  Unfortunately it is not tied
to the
> > > cluster-density test.  The same 4-5x RSS jump is also seen in other
test
> > > like density-heavy.  Last week RSS of ovn-northd in cluster-density
500
> > node
> > > was between 1.5 and 2.5 GB, this week we have a range between 5.5 and
8.5
> > GB.
> > >
> > > I would consider this as a release blocker as well.
> > >
> > >
> > > I don't have direct evidence that this particular series is a
culprit, but
> > > it looks like the most likely candidate.  I can dig more into
> > investigation
> > > on Monday.
> > >
> > > Best regards, Ilya Maximets.
> >
> > Thanks Ilya for reporting this. 95% latency and 4x RSS increase is a
little
> > surprising to me. I did test this series with my scale test scripts for
> > recompute performance regression. It was 10+% increase in latency. I
even
> > digged a little into it, and noticed ~5% increase caused by the hmap
used
> > to maintain the lflows in each lflow_ref. This was discussed in the code
> > review for an earlier version (v2/v3). Overall it looked not very bad,
if
> > we now handle most common scenarios incrementally, and it is reasonable
to
> > have some cost for maintaining the references/index for incremental
> > processing. I wonder if my test scenario was too simple (didn't have LBs
> > included) to find the problems, so today I did another test by
including a
> > LB group with 1k LBs applied to 100 node-LS & GR, and another 1K LBs per
> > node-LS & GR (101K LBs in total), and I did see more performance penalty
> > but still within ~20%. While for memory I didn't notice a significant
> > increase (<10%). I 

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-04 Thread Numan Siddique
On Sun, Feb 4, 2024 at 9:53 PM Han Zhou  wrote:
>
> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets  wrote:
> >
> > >>>
> > >>> >  35 files changed, 9681 insertions(+), 4645 deletions(-)
> > >>>
> > >>> I had another look at this series and acked the remaining patches.  I
> > >>> just had some minor comments that can be easily fixed when applying
> the
> > >>> patches to the main branch.
> > >>>
> > >>> Thanks for all the work on this!  It was a very large change but it
> > >>> improves northd performance significantly.  I just hope we don't
> > >>> introduce too many bugs.  Hopefully the time we have until release
> will
> > >>> allow us to further test this change on the 24.03 branch.
> > >>>
> > >>> Regards,
> > >>> Dumitru
> > >>
> > >>
> > >>
> > >> Thanks a lot Dumitru and Han for the reviews and patience.
> > >>
> > >> I addressed the comments and applied the patches to main and also to
> > > branch-24.03.
> > >>
> > >> @Han - I know you wanted to take another look in to v6.  I didn't want
> to
> > > delay further as branch-24.03 was created.  I'm more than happy to
> submit
> > > follow up patches if you have any comments to address.  Please let me
> know.
> > >>
> > >
> > > Hi Numan,
> > >
> > > I was writing the reply and saw your email just now. Thanks a lot for
> > > taking a huge effort to achieve the great optimization. I only left one
> > > comment on the implicit dependency left for the en_lrnat -> en_lflow.
> Feel
> > > free to address it with a followup and no need to block the branching.
> And
> > > take my Ack for the series with that addressed.
> > >
> > > Acked-by: Han Zhou 
> >
> >
> > Hi, Numan, Dumitru and Han.
> >
> > I see a huge negative performance impact, most likely from this set, on
> > ovn-heater's cluster-density tests.  The memory consumption on northd
> > jumped about 4x and it constantly recomputes due to failures of port_group
> > handler:
> >
> > 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow, recompute
> (failed handler for input port_group) took 9762ms
> > 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms poll
> interval (5969ms user, 1786ms system)
> > ...
> > 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow, recompute
> (failed handler for input port_group) took 9014ms
> > 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms poll
> interval (5376ms user, 1515ms system)
> > ...
> > 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow, recompute
> (failed handler for input port_group) took 10695ms
> > 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms
> poll interval (6085ms user, 2745ms system)
> > ...
> > 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow, recompute
> (failed handler for input port_group) took 9985ms
> > 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms
> poll interval (5521ms user, 2440ms system)
> >
> > That increases 95%% ovn-installed latency in 500node cluster-density from
> > 3.6 seconds last week to 21.5 seconds this week.
> >
> > I think, this should be a release blocker.
> >
> > Memory usage is also very concerning.  Unfortunately it is not tied to the
> > cluster-density test.  The same 4-5x RSS jump is also seen in other test
> > like density-heavy.  Last week RSS of ovn-northd in cluster-density 500
> node
> > was between 1.5 and 2.5 GB, this week we have a range between 5.5 and 8.5
> GB.
> >
> > I would consider this as a release blocker as well.
> >
> >
> > I don't have direct evidence that this particular series is a culprit, but
> > it looks like the most likely candidate.  I can dig more into
> investigation
> > on Monday.
> >
> > Best regards, Ilya Maximets.
>
> Thanks Ilya for reporting this. 95% latency and 4x RSS increase is a little
> surprising to me. I did test this series with my scale test scripts for
> recompute performance regression. It was 10+% increase in latency. I even
> digged a little into it, and noticed ~5% increase caused by the hmap used
> to maintain the lflows in each lflow_ref. This was discussed in the code
> review for an earlier version (v2/v3). Overall it looked not very bad, if
> we now handle most common scenarios incrementally, and it is reasonable to
> have some cost for maintaining the references/index for incremental
> processing. I wonder if my test scenario was too simple (didn't have LBs
> included) to find the problems, so today I did another test by including a
> LB group with 1k LBs applied to 100 node-LS & GR, and another 1K LBs per
> node-LS & GR (101K LBs in total), and I did see more performance penalty
> but still within ~20%. While for memory I didn't notice a significant
> increase (<10%). I believe I am missing some specific scenario that had the
> big impact in the ovn-heater's tests. Please share if you dig out more
> clues .

Hi Ilya,

Thanks for reporting these details.

I had a look at this regression.   There is a significant increase in
the lflow recompute
time 

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-04 Thread Han Zhou
On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets  wrote:
>
> >>>
> >>> >  35 files changed, 9681 insertions(+), 4645 deletions(-)
> >>>
> >>> I had another look at this series and acked the remaining patches.  I
> >>> just had some minor comments that can be easily fixed when applying
the
> >>> patches to the main branch.
> >>>
> >>> Thanks for all the work on this!  It was a very large change but it
> >>> improves northd performance significantly.  I just hope we don't
> >>> introduce too many bugs.  Hopefully the time we have until release
will
> >>> allow us to further test this change on the 24.03 branch.
> >>>
> >>> Regards,
> >>> Dumitru
> >>
> >>
> >>
> >> Thanks a lot Dumitru and Han for the reviews and patience.
> >>
> >> I addressed the comments and applied the patches to main and also to
> > branch-24.03.
> >>
> >> @Han - I know you wanted to take another look in to v6.  I didn't want
to
> > delay further as branch-24.03 was created.  I'm more than happy to
submit
> > follow up patches if you have any comments to address.  Please let me
know.
> >>
> >
> > Hi Numan,
> >
> > I was writing the reply and saw your email just now. Thanks a lot for
> > taking a huge effort to achieve the great optimization. I only left one
> > comment on the implicit dependency left for the en_lrnat -> en_lflow.
Feel
> > free to address it with a followup and no need to block the branching.
And
> > take my Ack for the series with that addressed.
> >
> > Acked-by: Han Zhou 
>
>
> Hi, Numan, Dumitru and Han.
>
> I see a huge negative performance impact, most likely from this set, on
> ovn-heater's cluster-density tests.  The memory consumption on northd
> jumped about 4x and it constantly recomputes due to failures of port_group
> handler:
>
> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow, recompute
(failed handler for input port_group) took 9762ms
> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms poll
interval (5969ms user, 1786ms system)
> ...
> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow, recompute
(failed handler for input port_group) took 9014ms
> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms poll
interval (5376ms user, 1515ms system)
> ...
> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow, recompute
(failed handler for input port_group) took 10695ms
> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms
poll interval (6085ms user, 2745ms system)
> ...
> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow, recompute
(failed handler for input port_group) took 9985ms
> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms
poll interval (5521ms user, 2440ms system)
>
> That increases 95%% ovn-installed latency in 500node cluster-density from
> 3.6 seconds last week to 21.5 seconds this week.
>
> I think, this should be a release blocker.
>
> Memory usage is also very concerning.  Unfortunately it is not tied to the
> cluster-density test.  The same 4-5x RSS jump is also seen in other test
> like density-heavy.  Last week RSS of ovn-northd in cluster-density 500
node
> was between 1.5 and 2.5 GB, this week we have a range between 5.5 and 8.5
GB.
>
> I would consider this as a release blocker as well.
>
>
> I don't have direct evidence that this particular series is a culprit, but
> it looks like the most likely candidate.  I can dig more into
investigation
> on Monday.
>
> Best regards, Ilya Maximets.

Thanks Ilya for reporting this. 95% latency and 4x RSS increase is a little
surprising to me. I did test this series with my scale test scripts for
recompute performance regression. It was 10+% increase in latency. I even
digged a little into it, and noticed ~5% increase caused by the hmap used
to maintain the lflows in each lflow_ref. This was discussed in the code
review for an earlier version (v2/v3). Overall it looked not very bad, if
we now handle most common scenarios incrementally, and it is reasonable to
have some cost for maintaining the references/index for incremental
processing. I wonder if my test scenario was too simple (didn't have LBs
included) to find the problems, so today I did another test by including a
LB group with 1k LBs applied to 100 node-LS & GR, and another 1K LBs per
node-LS & GR (101K LBs in total), and I did see more performance penalty
but still within ~20%. While for memory I didn't notice a significant
increase (<10%). I believe I am missing some specific scenario that had the
big impact in the ovn-heater's tests. Please share if you dig out more
clues .

Thanks,
Han
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-04 Thread Ilya Maximets
>>>
>>> >  35 files changed, 9681 insertions(+), 4645 deletions(-)
>>>
>>> I had another look at this series and acked the remaining patches.  I
>>> just had some minor comments that can be easily fixed when applying the
>>> patches to the main branch.
>>>
>>> Thanks for all the work on this!  It was a very large change but it
>>> improves northd performance significantly.  I just hope we don't
>>> introduce too many bugs.  Hopefully the time we have until release will
>>> allow us to further test this change on the 24.03 branch.
>>>
>>> Regards,
>>> Dumitru
>>
>>
>>
>> Thanks a lot Dumitru and Han for the reviews and patience.
>>
>> I addressed the comments and applied the patches to main and also to
> branch-24.03.
>>
>> @Han - I know you wanted to take another look in to v6.  I didn't want to
> delay further as branch-24.03 was created.  I'm more than happy to submit
> follow up patches if you have any comments to address.  Please let me know.
>>
> 
> Hi Numan,
> 
> I was writing the reply and saw your email just now. Thanks a lot for
> taking a huge effort to achieve the great optimization. I only left one
> comment on the implicit dependency left for the en_lrnat -> en_lflow. Feel
> free to address it with a followup and no need to block the branching. And
> take my Ack for the series with that addressed.
> 
> Acked-by: Han Zhou 


Hi, Numan, Dumitru and Han.

I see a huge negative performance impact, most likely from this set, on
ovn-heater's cluster-density tests.  The memory consumption on northd
jumped about 4x and it constantly recomputes due to failures of port_group
handler:

2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow, recompute (failed 
handler for input port_group) took 9762ms
2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms poll 
interval (5969ms user, 1786ms system)
...
2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow, recompute (failed 
handler for input port_group) took 9014ms
2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms poll 
interval (5376ms user, 1515ms system)
...
2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow, recompute (failed 
handler for input port_group) took 10695ms
2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms poll 
interval (6085ms user, 2745ms system)
...
2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow, recompute (failed 
handler for input port_group) took 9985ms
2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms poll 
interval (5521ms user, 2440ms system)

That increases 95%% ovn-installed latency in 500node cluster-density from
3.6 seconds last week to 21.5 seconds this week.

I think, this should be a release blocker.

Memory usage is also very concerning.  Unfortunately it is not tied to the
cluster-density test.  The same 4-5x RSS jump is also seen in other test
like density-heavy.  Last week RSS of ovn-northd in cluster-density 500 node
was between 1.5 and 2.5 GB, this week we have a range between 5.5 and 8.5 GB.

I would consider this as a release blocker as well.


I don't have direct evidence that this particular series is a culprit, but
it looks like the most likely candidate.  I can dig more into investigation
on Monday.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

2024-02-02 Thread Dumitru Ceara
On 1/30/24 22:20, num...@ovn.org wrote:
> From: Numan Siddique 
> 

Hi Numan,

> This patch series adds incremental processing in the lflow engine
> node to handle changes to northd and other engine nodes.
> Changed related to load balancers and NAT are mainly handled in
> this patch series.
> 
> This patch series can also be found here - 
> https://github.com/numansiddique/ovn/tree/northd_lbnatacl_lflow/v5
> 
> Prior to this patch series, most of the changes to northd engine
> resulted in full recomputation of logical flows.  This series
> aims to improve the performance of ovn-northd by adding the I-P
> support.  In order to add this support, some of the northd engine
> node data (from struct ovn_datapath) is split and moved over to
> new engine nodes - mainly related to load balancers, NAT and ACLs.
> 
> Below are the scale testing results done with these patches applied
> using ovn-heater.  The test ran the scenario  -
> ocp-500-density-heavy.yml [1].
> 
> With all the lflow I-P patches applied, the resuts are:
> 
> ---
> Min (s) Median (s)  90%ile (s)  
> 99%ile (s)  Max (s) Mean (s)Total (s)   Count   Failed
> ---
> Iteration Total 0.1368831.1290161.192001
> 1.2041671.2127280.66501783.127099   125 0
> Namespace.add_ports 0.0052160.0057360.007034
> 0.0154860.0189780.0062110.776373125 0
> WorkerNode.bind_port0.0350300.0460820.052469
> 0.0582930.0603110.04597311.493259   250 0
> WorkerNode.ping_port0.0050570.0067271.047692
> 1.0692531.0713360.26689666.724094   250 0
> ---
> 
> The results with the present main are:
> 
> ---
> Min (s) Median (s)  90%ile (s)  
> 99%ile (s)  Max (s) Mean (s)Total (s)   Count   Failed
> ---
> Iteration Total 0.1354912.2238053.311270
> 3.3390783.3453461.729172216.146495  125 0
> Namespace.add_ports 0.0053800.0057440.006819
> 0.0187730.0208000.0062920.786532125 0
> WorkerNode.bind_port0.0341790.0460550.053488
> 0.0588010.0710430.04611711.529311   250 0
> WorkerNode.ping_port0.0049560.0069523.086952
> 3.1917433.1928070.791544197.886026  250 0
> ---
> 
> Please see the link [2] which has a high level description of the
> changes done in this patch series.
> 
> 
> [1] - 
> https://github.com/ovn-org/ovn-heater/blob/main/test-scenarios/ocp-500-density-heavy.yml
> [2] - https://mail.openvswitch.org/pipermail/ovs-dev/2023-December/410053.html
> 
> v5 -> v6
> --
>* Applied the first 3 patches of v5 after addressing all the review
>  comments (and with the Acks)
>  
>* Rebased to latest main and resolved the conflicts.
> 
>* Addressed almost all of the review comments received for v5 from
>  Han and Dumitru.
> - Added detailed documentation on 'struct lflow_ref' and life
>   cycle of 'struct lflow_ref_node'.
> - Added documentation on the thread safety limitations when
>   using 'struct lflow_ref'.
> 
> v4 -> v5
> ---
>* Rebased to latest main and resolved the conflicts.
> 
>* Addressed the review comments from Han in patch 15 (and in p8).  Removed 
> the
>  assert if SB dp group is missing and handled it by returning false
>  so that lflow engine recomputes.  Added test cases to cover this
>  scenario for both lflows (p8) and SB load balancers (p15) .
> 
> v3 -> v4
> ---
>* Addressed most of the review comments from Dumitru and Han.
> 
>* Found a couple of bugs in v3 patch 9 -
>  "northd: Refactor lflow management into a separate module."
>  and addressed them in v4.
>  To brief  the issue, if a