Re: [ovs-discuss] Long northd recompute time with OpenStack deployment

2023-08-24 Thread Roberto Bartzen Acosta via discuss
Hi Ilya,

Em qui., 24 de ago. de 2023 às 12:03, Ilya Maximets 
escreveu:

> On 8/23/23 19:58, Roberto Bartzen Acosta wrote:
> > Hi Ilya,
> >
> > Em qua., 23 de ago. de 2023 às 13:45, Ilya Maximets  > escreveu:
> >
> > On 8/23/23 07:59, Roberto Bartzen Acosta wrote:
> > > Hi Ilya,
> > >
> > > Regarding what we've been talking about on the openvswitch IRC
> channel, I'm adding more information about the setup.
> >
> > Hi, Roberto.
> > Thanks for the information!
> >
> > >
> > > *Summary*
> > >
> > > Tests with OVN 22.03 version... (Northd / lflow) recompute time:
> > > 3k routers - 46183ms/3595ms
> > > 4k routers - 48008ms/4143ms
> > > 5k routers - 84347ms/8976ms
> > >
> > > Tests with OVN 23.03 / OVS 3.1.2 version  (Northd / lflow)
> recompute time:
> > > *5k routers - 75532ms/4750ms
> > > *
> > >
> > > ovn-appctl -t ovn-northd coverage/show
> > > Event coverage, avg rate over last: 5 seconds, last minute, last
> hour,  hash=ccb8802c:
> > > hmap_pathological  0.0/sec 0.000/sec1.3389/sec
>   total: 13037
> > > hmap_expand0.0/sec 0.000/sec  694.1992/sec
>   total: 8429618
> > > hmap_reserve   0.0/sec 0.000/sec  438.7769/sec
>   total: 4365281
> > > txn_unchanged  0.0/sec 0.000/sec0.0619/sec
>   total: 2366
> > > txn_incomplete 0.0/sec 0.000/sec0.0797/sec
>   total: 997
> > > txn_success0.0/sec 0.000/sec0.0194/sec
>   total: 171
> > > txn_try_again  0.0/sec 0.000/sec0./sec
>   total: 591
> > > poll_create_node   2.0/sec 0.333/sec0.8039/sec
>   total: 26251
> > > poll_zero_timeout  0.0/sec 0.000/sec0.0219/sec
>   total: 305
> > > seq_change 1.0/sec 0.167/sec0.3319/sec
>   total: 9972
> > > pstream_open   0.0/sec 0.000/sec0./sec
>   total: 1
> > > stream_open0.0/sec 0.000/sec0./sec
>   total: 12
> > > unixctl_received   0.0/sec 0.000/sec0./sec
>   total: 3
> > > unixctl_replied0.0/sec 0.000/sec0./sec
>   total: 3
> > > util_xalloc4.0/sec 0.583/sec   145068.3125/sec
>   total: 23935972424
> > > 93 events never hit
> > >
> > >   * stopwatch/show [1]
> > >
> > > Statistics for 'ovnsb_db_run' Total samples: 121 Maximum: 85562
> msec Minimum: 61409 msec 95th percentile: 81893.110630 msec Short term
> average: 83914.189203 msec Long term average: 74790.861215 msec
> > >
> > >
> > >   *  HA_Chassis_Group table [2]
> > >   * Port_Binding [3]
> > >   * NB database and SB database [4]
> > >
> > >
> > > I included the download link for the databases so we can continue
> with the recompute time analysis.
> > >
> > > What would be possible to improve in this recompute time
> considering the 40K entries in Port_Binding?
> >
> > AFAICT, northd is doing horribly inefficient recomputation of
> ref_chassis
> > column in HA_Chassis_Group table.  It's basically number of ports
> times
> > number of ha_chassis_groups (equals to the number of routers in your
> > setup) times number of chassis.  On your databases it turns into
> about
> > 270 million operations.  I re-worked the code locally to not check
> the
> > same things over and over again and re-computation of this particular
> > part went down from 80 seconds to 52 milliseconds on my machine.
> >
> >
> > That sounds very promising ;) Thank you for your help!
> > I'll be waiting to backport the patch and test in my setup.
> >
> >
> >
> > I still need to run some checks to be sure that I'm computing exactly
> > the same thing that the old code does, after that I'll post the
> patch.
> >
> > We're a little late for the 23.09 release, but I'll try to market
> this
> > change as a bug fix, maybe OVN maintainers will agree. :)
> >
> > Do you want to be mentioned in a Reported-by tag?
> >
> > It would be good to keep tracking.
>
> Ack.  Just for completeness of the mail thread, here is the list of patches
> I posted related to this issue:
>
> 1.
> https://patchwork.ozlabs.org/project/ovn/patch/20230823214140.1779255-1-i.maxim...@ovn.org/
> 2.
> https://patchwork.ozlabs.org/project/ovn/patch/20230823215705.1786348-1-i.maxim...@ovn.org/
>
>
Functional time comparison with the OpenStack environment running version
OVN 23.03.0!
Time between "added interface tap7d5c3046-0e on chassis" and "Neutron
event: OVN reports status up for port":

Before these patches:
~2 minutes and 37 seconds
After these patches:
1 second  OMG \o/

Excellent work !!! congrats!

Best regards,
Roberto


Event timeline :

Aug 24 17:22:27 LAB-SRV nova-compute[282881]: 2023-08-24 17:

Re: [ovs-discuss] Long northd recompute time with OpenStack deployment

2023-08-24 Thread Ilya Maximets via discuss
On 8/23/23 19:58, Roberto Bartzen Acosta wrote:
> Hi Ilya,
> 
> Em qua., 23 de ago. de 2023 às 13:45, Ilya Maximets  > escreveu:
> 
> On 8/23/23 07:59, Roberto Bartzen Acosta wrote:
> > Hi Ilya,
> >
> > Regarding what we've been talking about on the openvswitch IRC channel, 
> I'm adding more information about the setup.
> 
> Hi, Roberto.
> Thanks for the information!
> 
> >
> > *Summary*
> >
> > Tests with OVN 22.03 version... (Northd / lflow) recompute time:
> > 3k routers - 46183ms/3595ms
> > 4k routers - 48008ms/4143ms
> > 5k routers - 84347ms/8976ms
> >
> > Tests with OVN 23.03 / OVS 3.1.2 version  (Northd / lflow) recompute 
> time:
> > *5k routers - 75532ms/4750ms
> > *
> >
> > ovn-appctl -t ovn-northd coverage/show
> > Event coverage, avg rate over last: 5 seconds, last minute, last hour,  
> hash=ccb8802c:
> > hmap_pathological          0.0/sec     0.000/sec        1.3389/sec   
> total: 13037
> > hmap_expand                0.0/sec     0.000/sec      694.1992/sec   
> total: 8429618
> > hmap_reserve               0.0/sec     0.000/sec      438.7769/sec   
> total: 4365281
> > txn_unchanged              0.0/sec     0.000/sec        0.0619/sec   
> total: 2366
> > txn_incomplete             0.0/sec     0.000/sec        0.0797/sec   
> total: 997
> > txn_success                0.0/sec     0.000/sec        0.0194/sec   
> total: 171
> > txn_try_again              0.0/sec     0.000/sec        0./sec   
> total: 591
> > poll_create_node           2.0/sec     0.333/sec        0.8039/sec   
> total: 26251
> > poll_zero_timeout          0.0/sec     0.000/sec        0.0219/sec   
> total: 305
> > seq_change                 1.0/sec     0.167/sec        0.3319/sec   
> total: 9972
> > pstream_open               0.0/sec     0.000/sec        0./sec   
> total: 1
> > stream_open                0.0/sec     0.000/sec        0./sec   
> total: 12
> > unixctl_received           0.0/sec     0.000/sec        0./sec   
> total: 3
> > unixctl_replied            0.0/sec     0.000/sec        0./sec   
> total: 3
> > util_xalloc                4.0/sec     0.583/sec   145068.3125/sec   
> total: 23935972424
> > 93 events never hit
> >
> >   * stopwatch/show [1]
> >
> > Statistics for 'ovnsb_db_run' Total samples: 121 Maximum: 85562 msec 
> Minimum: 61409 msec 95th percentile: 81893.110630 msec Short term average: 
> 83914.189203 msec Long term average: 74790.861215 msec
> >
> >
> >   *  HA_Chassis_Group table [2]
> >   * Port_Binding [3]
> >   * NB database and SB database [4]
> >
> >
> > I included the download link for the databases so we can continue with 
> the recompute time analysis.
> >
> > What would be possible to improve in this recompute time considering 
> the 40K entries in Port_Binding?
> 
> AFAICT, northd is doing horribly inefficient recomputation of ref_chassis
> column in HA_Chassis_Group table.  It's basically number of ports times
> number of ha_chassis_groups (equals to the number of routers in your
> setup) times number of chassis.  On your databases it turns into about
> 270 million operations.  I re-worked the code locally to not check the
> same things over and over again and re-computation of this particular
> part went down from 80 seconds to 52 milliseconds on my machine.
> 
> 
> That sounds very promising ;) Thank you for your help!
> I'll be waiting to backport the patch and test in my setup.
>  
> 
> 
> I still need to run some checks to be sure that I'm computing exactly
> the same thing that the old code does, after that I'll post the patch.
> 
> We're a little late for the 23.09 release, but I'll try to market this
> change as a bug fix, maybe OVN maintainers will agree. :)
> 
> Do you want to be mentioned in a Reported-by tag?
> 
> It would be good to keep tracking.

Ack.  Just for completeness of the mail thread, here is the list of patches
I posted related to this issue:

1. 
https://patchwork.ozlabs.org/project/ovn/patch/20230823214140.1779255-1-i.maxim...@ovn.org/
2. 
https://patchwork.ozlabs.org/project/ovn/patch/20230823215705.1786348-1-i.maxim...@ovn.org/

>  
> 
> 
> Also, this email has "[ovs-discuss]" in the name, but ovs-discuss is
> not in CC list, which is a little strange.
> 
> My bad! I forgot to cc it, it was too late when I sent it ...
> 
> 
> Best regards, Ilya Maximets.
> 
> >
> > Kind regards,
> > Roberto
> >
> > [1] - 
> https://drive.google.com/file/d/1I5lfHgDu-ut8MiPjj4MqjdFzyhnZ1IeC/view?usp=drive_link
>  
> 
>  
>   
> 

Re: [ovs-discuss] Long northd recompute time with OpenStack deployment

2023-08-23 Thread Roberto Bartzen Acosta via discuss
Hi Ilya,

Em qua., 23 de ago. de 2023 às 13:45, Ilya Maximets 
escreveu:

> On 8/23/23 07:59, Roberto Bartzen Acosta wrote:
> > Hi Ilya,
> >
> > Regarding what we've been talking about on the openvswitch IRC
> channel, I'm adding more information about the setup.
>
> Hi, Roberto.
> Thanks for the information!
>
> >
> > *Summary*
> >
> > Tests with OVN 22.03 version... (Northd / lflow) recompute time:
> > 3k routers - 46183ms/3595ms
> > 4k routers - 48008ms/4143ms
> > 5k routers - 84347ms/8976ms
> >
> > Tests with OVN 23.03 / OVS 3.1.2 version  (Northd / lflow) recompute
> time:
> > *5k routers - 75532ms/4750ms
> > *
> >
> > ovn-appctl -t ovn-northd coverage/show
> > Event coverage, avg rate over last: 5 seconds, last minute, last hour,
>  hash=ccb8802c:
> > hmap_pathological  0.0/sec 0.000/sec1.3389/sec
> total: 13037
> > hmap_expand0.0/sec 0.000/sec  694.1992/sec
> total: 8429618
> > hmap_reserve   0.0/sec 0.000/sec  438.7769/sec
> total: 4365281
> > txn_unchanged  0.0/sec 0.000/sec0.0619/sec
> total: 2366
> > txn_incomplete 0.0/sec 0.000/sec0.0797/sec
> total: 997
> > txn_success0.0/sec 0.000/sec0.0194/sec
> total: 171
> > txn_try_again  0.0/sec 0.000/sec0./sec
> total: 591
> > poll_create_node   2.0/sec 0.333/sec0.8039/sec
> total: 26251
> > poll_zero_timeout  0.0/sec 0.000/sec0.0219/sec
> total: 305
> > seq_change 1.0/sec 0.167/sec0.3319/sec
> total: 9972
> > pstream_open   0.0/sec 0.000/sec0./sec
> total: 1
> > stream_open0.0/sec 0.000/sec0./sec
> total: 12
> > unixctl_received   0.0/sec 0.000/sec0./sec
> total: 3
> > unixctl_replied0.0/sec 0.000/sec0./sec
> total: 3
> > util_xalloc4.0/sec 0.583/sec   145068.3125/sec
> total: 23935972424
> > 93 events never hit
> >
> >   * stopwatch/show [1]
> >
> > Statistics for 'ovnsb_db_run' Total samples: 121 Maximum: 85562 msec
> Minimum: 61409 msec 95th percentile: 81893.110630 msec Short term average:
> 83914.189203 msec Long term average: 74790.861215 msec
> >
> >
> >   *  HA_Chassis_Group table [2]
> >   * Port_Binding [3]
> >   * NB database and SB database [4]
> >
> >
> > I included the download link for the databases so we can continue with
> the recompute time analysis.
> >
> > What would be possible to improve in this recompute time considering the
> 40K entries in Port_Binding?
>
> AFAICT, northd is doing horribly inefficient recomputation of ref_chassis
> column in HA_Chassis_Group table.  It's basically number of ports times
> number of ha_chassis_groups (equals to the number of routers in your
> setup) times number of chassis.  On your databases it turns into about
> 270 million operations.  I re-worked the code locally to not check the
> same things over and over again and re-computation of this particular
> part went down from 80 seconds to 52 milliseconds on my machine.
>

That sounds very promising ;) Thank you for your help!
I'll be waiting to backport the patch and test in my setup.


>
> I still need to run some checks to be sure that I'm computing exactly
> the same thing that the old code does, after that I'll post the patch.
>
> We're a little late for the 23.09 release, but I'll try to market this
> change as a bug fix, maybe OVN maintainers will agree. :)
>
> Do you want to be mentioned in a Reported-by tag?
>
It would be good to keep tracking.


>
> Also, this email has "[ovs-discuss]" in the name, but ovs-discuss is
> not in CC list, which is a little strange.
>
My bad! I forgot to cc it, it was too late when I sent it ...


> Best regards, Ilya Maximets.
>
> >
> > Kind regards,
> > Roberto
> >
> > [1] -
> https://drive.google.com/file/d/1I5lfHgDu-ut8MiPjj4MqjdFzyhnZ1IeC/view?usp=drive_link
> <
> https://drive.google.com/file/d/1I5lfHgDu-ut8MiPjj4MqjdFzyhnZ1IeC/view?usp=drive_link
> >
> > [2] -
> https://drive.google.com/file/d/1f9FHjDlQnSHeLmTPloS-BmqQbdnosklh/view?usp=sharing
> <
> https://drive.google.com/file/d/1f9FHjDlQnSHeLmTPloS-BmqQbdnosklh/view?usp=sharing
> >
> > [3] -
> https://drive.google.com/file/d/1RC0t0eOy3CCdYgGYLiPANAtvJUJjYwje/view?usp=sharing
> <
> https://drive.google.com/file/d/1RC0t0eOy3CCdYgGYLiPANAtvJUJjYwje/view?usp=sharing
> >
> > [4] -
> https://drive.google.com/file/d/1YqCMkY2BasvlcUkzHDpttMcasgKd5BBl/view?usp=sharing
> <
> https://drive.google.com/file/d/1YqCMkY2BasvlcUkzHDpttMcasgKd5BBl/view?usp=sharing
> >
> >
> >
> >
> >
> >
> >
> > /‘Esta mensagem é direcionada apenas para os endereços constantes no
> cabeçalho inicial. Se você não está listado nos endereços constantes no
> cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa
> mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão
> imediatamente anula