Re: [ovs-discuss] Why no-flood is not supported in later openflow versions?

2018-01-29 Thread Ben Pfaff
On Mon, Jan 29, 2018 at 11:52:34AM +0800, Jianfeng Tan wrote:
> >On Sun, Jan 28, 2018 at 11:39:29PM +0800, Jianfeng Tan wrote:
> >> Hi,
> >>
> >> In the openflow, we set no-flood PodMod operation through openflow 1.0;
> for
> >> later openflow versions, no-flood is dropped.
> >> We can refer to below two macro definitions.
> >>
> >> #define OFPPC10_ALL (OFPPC_PORT_DOWN | OFPPC10_NO_STP |
> OFPPC_NO_RECV |
> >> \
> >>  OFPPC10_NO_RECV_STP | OFPPC10_NO_FLOOD |
> >> OFPPC_NO_FWD | \
> >>  OFPPC_NO_PACKET_IN)
> >>
> >> #define OFPPC11_ALL \
> >> (OFPPC_PORT_DOWN | OFPPC_NO_RECV | OFPPC_NO_FWD |
> >> OFPPC_NO_PACKET_IN)
> >>
> >> Would like to know why no-flood is not one of the common flags? Or do we
> >> have any other BKM to supersede such port config?
> >
> >The OpenFlow 1.1 and later specifications don't have a "no flood" bit.
> 
> Thank you for the reply.
> 
> We also notice that it's the OpenFlow protocol (later version) does not
> support
> this bit. But we wonder why it's removed from the protocol? It's useless? Or
> we have an alternative to cover this.

I wasn't involved in the standardization of OpenFlow 1.1, but I'd guess
that groups were intended to replace this feature.
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Stokes, Ian
> -Original Message-
> From: Kevin Traynor [mailto:ktray...@redhat.com]
> Sent: Monday, January 29, 2018 8:03 AM
> To: Stokes, Ian ; Ilya Maximets
> ; ovs-discuss@openvswitch.org
> Cc: Flavio Leitner ; Loftus, Ciara
> ; Kavanagh, Mark B ;
> Jan Scheurich (jan.scheur...@ericsson.com) ;
> Ben Pfaff (b...@ovn.org) ; acon...@redhat.com; Venkatesan
> Pradeep 
> Subject: Re: Mempool issue for OVS 2.9
> 
> On 01/26/2018 05:27 PM, Stokes, Ian wrote:
> >> -Original Message-
> >> From: Kevin Traynor [mailto:ktray...@redhat.com]
> >> Sent: Friday, January 26, 2018 3:48 PM
> >> To: Ilya Maximets ; Stokes, Ian
> >> ; ovs-discuss@openvswitch.org
> >> Cc: Flavio Leitner ; Loftus, Ciara
> >> ; Kavanagh, Mark B
> >> ; Jan Scheurich
> >> (jan.scheur...@ericsson.com) ; Ben Pfaff
> >> (b...@ovn.org) ; acon...@redhat.com; Venkatesan Pradeep
> >> 
> >> Subject: Re: Mempool issue for OVS 2.9
> >>
> >> hOn 01/26/2018 03:16 PM, Ilya Maximets wrote:
> >>> On 26.01.2018 15:00, Stokes, Ian wrote:
>  Hi All,
> 
>  Recently an issue was raised regarding the move from a single
>  shared
> >> mempool model that was in place up to OVS 2.8, to a mempool per port
> >> model introduced in 2.9.
> 
>  https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046
>  02
>  1.html
> 
>  The per port mempool model was introduced in September 5th with
>  commit
> >> d555d9bded to allow fine grain control on a per port case of memory
> usage.
> 
>  In the 'common/shared mempool' model, ports sharing a similar MTU
>  would
> >> all share the same buffer mempool (e.g. the most common example of
> >> this being that all ports are by default created with a 1500B MTU,
> >> and as such share the same mbuf mempool).
> 
>  This approach had some drawbacks however. For example, with the
>  shared
> >> memory pool model a user could exhaust the shared memory pool (for
> >> instance by requesting a large number of RXQs for a port), this would
> >> cause the vSwitch to crash as any remaining ports would not have the
> >> required memory to function. This bug was discovered and reported to
> >> the community in late 2016
> >> https://mail.openvswitch.org/pipermail/ovs-
> >> discuss/2016-September/042560.html.
> 
>  The per port mempool patch aimed to avoid such issues by allocating
>  a
> >> separate buffer mempool to each port.
> 
>  An issue has been flagged on ovs-discuss, whereby memory dimensions
> >> provided for a given number of ports on OvS 2.6-2.8 may be
> >> insufficient to support the same number of ports in OvS 2.9, on
> >> account of the per-port mempool model without re-dimensioning extra
> >> memory. The effect of this is use case dependent (number of ports,
> >> RXQs, MTU settings, number of PMDs
> >> etc.) The previous 'common-pool' model was rudimentary in estimating
> >> the mempool size and was marked as something that was to be improved
> >> upon. The memory allocation calculation for per port model was
> >> modified to take the possible configuration factors mentioned into
> account.
> 
>  It's unfortunate that this came to light so close to the release
>  code
> >> freeze - but better late than never as it is a valid problem to be
> >> resolved.
> 
>  I wanted to highlight some options to the community as I don't
>  think
> >> the next steps should be taken in isolation due to the impact this
> >> feature has.
> 
>  There are a number of possibilities for the 2.9 release.
> 
>  (i) Revert the mempool per port patches and return to the shared
> >> mempool model. There are a number of features and refactoring in
> >> place on top of the change so this will not be a simple revert. I'm
> >> investigating what exactly is involved with this currently.
> >>>
> >>> This looks like a bad thing to do. Shared memory pools has their own
> >>> issues and hides the real memory usage by each port. Also, reverting
> >>> seems almost impossible, we'll have to re-implement it from scratch.
> >
> > I would agree, reverting isn’t as straight forward an option due to the
> amount of commits that were introduced in relation to the per port mempool
> feature over time(listed below for completeness).
> >
> > netdev-dpdk: Create separate memory pool for each port: d555d9b
> > netdev-dpdk: fix management of pre-existing mempools:b6b26021d
> > netdev-dpdk: Fix mempool names to reflect socket id: f06546a
> > netdev-dpdk: skip init for existing mempools: 837c176
> > netdev-dpdk: manage failure in mempool name creation: 65056fd
> > netdev-dpdk: 

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Kevin Traynor
On 01/26/2018 05:21 PM, Ilya Maximets wrote:
> On 26.01.2018 18:47, Kevin Traynor wrote:
>> hOn 01/26/2018 03:16 PM, Ilya Maximets wrote:
>>> On 26.01.2018 15:00, Stokes, Ian wrote:
 Hi All,

 Recently an issue was raised regarding the move from a single shared 
 mempool model that was in place up to OVS 2.8, to a mempool per port model 
 introduced in 2.9.

 https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046021.html

 The per port mempool model was introduced in September 5th with commit 
 d555d9bded to allow fine grain control on a per port case of memory usage.

 In the 'common/shared mempool' model, ports sharing a similar MTU would 
 all share the same buffer mempool (e.g. the most common example of this 
 being that all ports are by default created with a 1500B MTU, and as such 
 share the same mbuf mempool).

 This approach had some drawbacks however. For example, with the shared 
 memory pool model a user could exhaust the shared memory pool (for 
 instance by requesting a large number of RXQs for a port), this would 
 cause the vSwitch to crash as any remaining ports would not have the 
 required memory to function. This bug was discovered and reported to the 
 community in late 2016 
 https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html.

 The per port mempool patch aimed to avoid such issues by allocating a 
 separate buffer mempool to each port.

 An issue has been flagged on ovs-discuss, whereby memory dimensions 
 provided for a given number of ports on OvS 2.6-2.8 may be insufficient to 
 support the same number of ports in OvS 2.9, on account of the per-port 
 mempool model without re-dimensioning extra memory. The effect of this is 
 use case dependent (number of ports, RXQs, MTU settings, number of PMDs 
 etc.) The previous 'common-pool' model was rudimentary in estimating the 
 mempool size and was marked as something that was to be improved upon. The 
 memory allocation calculation for per port model was modified to take the 
 possible configuration factors mentioned into account.

 It's unfortunate that this came to light so close to the release code 
 freeze - but better late than never as it is a valid problem to be 
 resolved.

 I wanted to highlight some options to the community as I don't think the 
 next steps should be taken in isolation due to the impact this feature has.

 There are a number of possibilities for the 2.9 release.

 (i) Revert the mempool per port patches and return to the shared mempool 
 model. There are a number of features and refactoring in place on top of 
 the change so this will not be a simple revert. I'm investigating what 
 exactly is involved with this currently.
>>>
>>> This looks like a bad thing to do. Shared memory pools has their own issues 
>>> and
>>> hides the real memory usage by each port. Also, reverting seems almost 
>>> impossible,
>>> we'll have to re-implement it from scratch.
> (ii) Leave the per port mempool implementation as is, flag to users
>> that memory requirements have increased. Extra memory may have to be
>> provided on a per use case basis.
>>>
>>> That's a good point. I prefer this option if possible.
>>
>> With 4 pmds and 4 rxqs per port a user would only be able to add 7 ports
>> per socket now. 9 ports if 1 Rxq. Of course a user may have some
>> headroom in the memory they have allocated but that can't be assumed. It
>> will surely mean some users setup will not work after upgrade - then
>> they will have to debug why.
>>
>> If we just reduce the MIN_NB_MBUF to 8192, it would make it 10 ports/4
>> rxqs, or 13 ports/1 rxqs without additional memory. It still gives
>> 24K/18K buffers per port.
>>
>> What do you think?
> 
> You're talking mostly about physical ports. But what about virtual?
> They will have only ~10K mbufs.
> 
> In my usual installation I have bonding with 2 HW NICs with 4096
> descriptors in each. --> 8192 mbufs will be consumed only by HW
> TX queues. With some fluctuations and possible sends to different
> TX queues on HW and possible sends to another HW NICs I'll have
> almost 100% probability of mempool exhausting even on that simple
> installation. But there are more complicated installations too
> (4 bonded NICs, for example).
>

Yes, thanks for pointing it out. The calculation for the amount of
buffers needed is bad, so we'll just have to leave the additional
buffers to try and cover the other cases it misses.
>>
>>>
 (iii) Reduce the amount of memory allocated per mempool per port. An RFC 
 to this effect was submitted by Kevin but on follow up the feeling is that 
 it does not resolve the issue adequately.
>>>
>>
>> Right, we discussed it in the other thread. It fixes some of the missed
>> factors for worst case, but Pradeep pointed out there 

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Jan Scheurich
> -Original Message-
> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
> Sent: Monday, 29 January, 2018 09:35
> To: Jan Scheurich ; Venkatesan Pradeep 
> ; Stokes, Ian
> ; d...@openvswitch.org
> Cc: Kevin Traynor ; Flavio Leitner ; 
> Loftus, Ciara ; Kavanagh, Mark B
> ; Ben Pfaff (b...@ovn.org) ; 
> acon...@redhat.com; disc...@openvswitch.org
> Subject: Re: Mempool issue for OVS 2.9
> 
> On 29.01.2018 11:19, Jan Scheurich wrote:
> > Hi,
> >
> > I'd like to take one step back and look at how much many mbufs we actually 
> > need.
> >
> > Today mbufs are consumed in the following places:
> >
> >  1. Rx queues of **physical** dpdk ports: dev->requested_n_rxq * 
> > dev->requested_rxq_size
> > Note 1: These mbufs are hogged up at all times.
> > Note 2: There is little point in configuring more rx queues per phy 
> > port than there are PMDs to poll them.
> > Note 3: The rx queues of vhostuser ports exist as virtqueues in the 
> > guest and do not hog mbufs.
> >  2. One batch per PMD during processing: #PMD * NETDEV_MAX_BURST
> >  3. One batch per tx queue with time-based tx batching: 
> > dev->requested_n_txq * NETDEV_MAX_BURST
> >  4. Tx queues of **physical** ports: dev->requested_n_txq * expected peak 
> > tx queue fill level
> > Note 1:  The maximum of 2K mbufs per tx queue can only be reached if 
> > the OVS transmit rate exceeds the line rate for a long time.
> This can only happen for large packets and when the traffic originates from 
> VMs on the compute node. This would be a case of under-
> dimensioning and packets would be dropped in any case. Excluding that 
> scenario, a typical peak tx queue fill level would be when all
> PMDs transmit a full batch at the same time: #PMDs * NETDEV_MAX_BURST.
> 
> Above assumption is wrong. Just look at ixgbe driver:
> drivers/net/ixgbe/ixgbe_rxtx.c: tx_xmit_pkts():
> 
>/*
> * Begin scanning the H/W ring for done descriptors when the
> * number of available descriptors drops below tx_free_thresh.  For
> * each done descriptor, free the associated buffer.
> */
>if (txq->nb_tx_free < txq->tx_free_thresh)
>┊   ixgbe_tx_free_bufs(txq);
> 
> The default value for 'tx_free_thresh' is 32. So, if I'll configure number
> of TX descriptors to 4096, driver will start to free mbufs only when it will
> have more than 4063 mbufs inside its TX queue. No matter how frequent calls
> to send() function.

OK, but that doesn't change my general argument. The mbufs hogged in the tx 
side of the phy port driver are coming from all ports (least likely the port 
itself). Considering them in dimensioning the port's private mempool is 
conceptually wrong. In my simplified dimensioning formula below I have already 
assumed full occupancy of the tx queue for phy ports. The second key 
observation is that vhostuser ports do not hog mbufs at all. And vhost zero 
copy doesn't change that.

BTW:, is there any reason why phy drivers should free tx mbufs only when the tx 
ring is close to becoming full? I'd understand that want need to free them in 
batches for performance reasons, but is there no cheap possibility to do this 
earlier?

> 
> > Note 2: Vhostuser ports do not store mbufs in tx queues due to copying 
> > to virtio descriptors
> >
> >
> > For concreteness let us use an example of a typical, rather large OVS 
> > deployment in an NFVI cloud:
> >
> >   * Two cores with 4 PMDs per NUMA socket using HT.
> >   * Two physical ports using RSS over 4 rx queues to enable load-sharing 
> > over the 4 local PMDs and 9 tx queues (8 PMDs plus non PMD)
> >   * 100 vhostuser ports with a varying number of rx and tx queue pairs (128 
> > in total).
> >
> >
> > In the above example deployments this translates into
> >
> >  1. 4 * 2K = 8K mbufs per physical port (16K in total)
> >  2. 8 * 32 = 256 mbufs total
> >  3. (128 +  23*9) * 32 = 4672 mbufs in total
> >  4. 9 * 32 = 288 mbufs per physical port (Adding some safety margin, a 
> > total of 2K mbufs)
> >
> > ---
> > Total : 23K mbufs
> >
> > This is way lower than the size of the earlier shared mempool (256K mbufs), 
> > which explains why we have never observed out of mbuf
> drops in our NFVI deployments. The vswitchd crash that triggered the change 
> to per-port mempools only happened because they tried to
> configure 64 rx and tx queues per physical port for multiple ports. I can’t 
> see any reason for configuring more rx and tx queues than
> polling PMDs, though.
> >
> > The actual consumption of mbufs scales primarily with the number of 
> > physical ports (a, c and d) and only to a much lower degree with
> the number of vhost ports/queues (c).
> >
> > Except for the phy rx queues, all other cases buffer a statistical mix of 
> > mbufs received on all ports. 

[ovs-discuss] Open vSwitch selected as ACM SOSR Software Systems Award Winner for 2018

2018-01-29 Thread Ben Pfaff
Yesterday, the ACM SOSR awards committee announced that it was granting
Open vSwitch its second annual Software Systems Award, which is given to
recognize the development of a software system that has had a
significant impact on SDN research, implementations, and tools.  The
award comes along with a small cash prize (which will be directed toward
OVS community benefit) and an invited talk at SOSR in Los Angeles on
March 28-29.

There is a one-line announcement of the award selection at
https://conferences.sigcomm.org/sosr/2018/index.html and more
information about the award itself at
https://conferences.sigcomm.org/sosr/2018/award.html.

Thanks to everyone for your support and contributions to Open vSwitch
over the years!  OVS has come to be a major force in industry and
academia and we could have not done it without all the great people in
our community.
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Ilya Maximets
On 28.01.2018 21:34, Venkatesan Pradeep wrote:
> Some more observations
> 
> Regards,
> 
> Pradeep
> 
>> -Original Message-   
>> From: ovs-discuss-boun...@openvswitch.org [mailto:ovs-discuss-
>> boun...@openvswitch.org] On Behalf Of Venkatesan Pradeep
>> Sent: Friday, January 26, 2018 11:04 PM
>> To: Jan Scheurich ; Stokes, Ian
>> ; ovs-discuss@openvswitch.org
>> Cc: Ilya Maximets (i.maxim...@samsung.com) ;
>> Flavio Leitner 
>> Subject: Re: [ovs-discuss] Mempool issue for OVS 2.9
>>
>> Response marked [Pradeep]
>>
>> Thanks,
>>
>> Pradeep
>>
>> -Original Message-
>> From: Jan Scheurich
>> Sent: Friday, January 26, 2018 10:26 PM
>> To: Stokes, Ian ; ovs-discuss@openvswitch.org
>> Cc: Kevin Traynor ; Flavio Leitner ;
>> Ilya Maximets (i.maxim...@samsung.com) ;
>> Loftus, Ciara ; Kavanagh, Mark B
>> ; Ben Pfaff (b...@ovn.org) ;
>> acon...@redhat.com; Venkatesan Pradeep
>> 
>> Subject: RE: Mempool issue for OVS 2.9
>>
>>> -Original Message-
>>> From: Stokes, Ian [mailto:ian.sto...@intel.com]
>>> Sent: Friday, 26 January, 2018 13:01
>>> To: ovs-discuss@openvswitch.org
>>> Cc: Kevin Traynor ; Flavio Leitner
>>> ; Ilya Maximets (i.maxim...@samsung.com)
>>> ; Loftus, Ciara ;
>>> Kavanagh, Mark B ; Jan Scheurich
>>> ; Ben Pfaff (b...@ovn.org) ;
>>> acon...@redhat.com; Venkatesan Pradeep
>>> 
>>> Subject: Mempool issue for OVS 2.9
>>>
>>> Hi All,
>>>
>>> Recently an issue was raised regarding the move from a single shared
>>> mempool model that was in place up to OVS 2.8, to a mempool per port
>> model introduced in 2.9.
>>>
>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046021
>>> .html
>>>
>>> The per port mempool model was introduced in September 5th with commit
>>> d555d9bded to allow fine grain control on a per port case of memory usage.
>>>
>>> In the 'common/shared mempool' model, ports sharing a similar MTU
>>> would all share the same buffer mempool (e.g. the most common example of
>> this being that all ports are by default created with a 1500B MTU, and as 
>> such
>> share the same mbuf mempool).
>>>
>>> This approach had some drawbacks however. For example, with the shared
>>> memory pool model a user could exhaust the shared memory pool (for
>>> instance by requesting a large number of RXQs for a port), this would
>>> cause the vSwitch to crash as any remaining ports would not have the
>> required memory to function. This bug was discovered and reported to the
>> community in late 2016 https://mail.openvswitch.org/pipermail/ovs-
>> discuss/2016-September/042560.html.
>>>
>>> The per port mempool patch aimed to avoid such issues by allocating a
>> separate buffer mempool to each port.
>>>
>>> An issue has been flagged on ovs-discuss, whereby memory dimensions
>>> provided for a given number of ports on OvS 2.6-2.8 may be
>>> insufficient to support the same number of ports in OvS 2.9, on
>>> account of the per-port mempool model without re-dimensioning extra
>>> memory. The effect of this is use case dependent (number of ports, RXQs,
>> MTU settings, number of PMDs etc.) The previous 'common- pool' model was
>> rudimentary in estimating the mempool size and was marked as something
>> that was to be improved upon. The memory allocation calculation for per port
>> model was modified to take the possible configuration factors mentioned into
>> account.
>>>
>>> It's unfortunate that this came to light so close to the release code
>>> freeze - but better late than never as it is a valid problem to be resolved.
>>>
>>> I wanted to highlight some options to the community as I don't think
>>> the next steps should be taken in isolation due to the impact this feature 
>>> has.
>>>
>>> There are a number of possibilities for the 2.9 release.
>>>
>>> (i) Revert the mempool per port patches and return to the shared
>>> mempool model. There are a number of features and refactoring in place on
>> top of the change so this will not be a simple revert. I'm investigating what
>> exactly is involved with this currently.
>>
>> The shared mempool concept has been working fairly well in our NFVI systems
>> for a long time. One can argue that it is too simplistic (as it does not at 
>> all take
>> the number of ports and queue lengths into account) and may be insufficient 
>> in
>> specific scenarios but at least it can operate with a reasonable amount of
>> memory.
>>
>> Out of the very many vhostuser ports only very few actually carry significant
>> amount of traffic. To reserve the theoretical worst case amount of buffers 

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Ilya Maximets
On 29.01.2018 11:50, Jan Scheurich wrote:
>> -Original Message-
>> From: Ilya Maximets [mailto:i.maxim...@samsung.com]
>> Sent: Monday, 29 January, 2018 09:35
>> To: Jan Scheurich ; Venkatesan Pradeep 
>> ; Stokes, Ian
>> ; d...@openvswitch.org
>> Cc: Kevin Traynor ; Flavio Leitner ; 
>> Loftus, Ciara ; Kavanagh, Mark B
>> ; Ben Pfaff (b...@ovn.org) ; 
>> acon...@redhat.com; disc...@openvswitch.org
>> Subject: Re: Mempool issue for OVS 2.9
>>
>> On 29.01.2018 11:19, Jan Scheurich wrote:
>>> Hi,
>>>
>>> I'd like to take one step back and look at how much many mbufs we actually 
>>> need.
>>>
>>> Today mbufs are consumed in the following places:
>>>
>>>  1. Rx queues of **physical** dpdk ports: dev->requested_n_rxq * 
>>> dev->requested_rxq_size
>>> Note 1: These mbufs are hogged up at all times.
>>> Note 2: There is little point in configuring more rx queues per phy 
>>> port than there are PMDs to poll them.
>>> Note 3: The rx queues of vhostuser ports exist as virtqueues in the 
>>> guest and do not hog mbufs.
>>>  2. One batch per PMD during processing: #PMD * NETDEV_MAX_BURST
>>>  3. One batch per tx queue with time-based tx batching: 
>>> dev->requested_n_txq * NETDEV_MAX_BURST
>>>  4. Tx queues of **physical** ports: dev->requested_n_txq * expected peak 
>>> tx queue fill level
>>> Note 1:  The maximum of 2K mbufs per tx queue can only be reached if 
>>> the OVS transmit rate exceeds the line rate for a long time.
>> This can only happen for large packets and when the traffic originates from 
>> VMs on the compute node. This would be a case of under-
>> dimensioning and packets would be dropped in any case. Excluding that 
>> scenario, a typical peak tx queue fill level would be when all
>> PMDs transmit a full batch at the same time: #PMDs * NETDEV_MAX_BURST.
>>
>> Above assumption is wrong. Just look at ixgbe driver:
>> drivers/net/ixgbe/ixgbe_rxtx.c: tx_xmit_pkts():
>>
>>/*
>> * Begin scanning the H/W ring for done descriptors when the
>> * number of available descriptors drops below tx_free_thresh.  For
>> * each done descriptor, free the associated buffer.
>> */
>>if (txq->nb_tx_free < txq->tx_free_thresh)
>>┊   ixgbe_tx_free_bufs(txq);
>>
>> The default value for 'tx_free_thresh' is 32. So, if I'll configure number
>> of TX descriptors to 4096, driver will start to free mbufs only when it will
>> have more than 4063 mbufs inside its TX queue. No matter how frequent calls
>> to send() function.
> 
> OK, but that doesn't change my general argument. The mbufs hogged in the tx 
> side of the phy port driver are coming from all ports (least likely the port 
> itself). Considering them in dimensioning the port's private mempool is 
> conceptually wrong. In my simplified dimensioning formula below I have 
> already assumed full occupancy of the tx queue for phy ports. The second key 
> observation is that vhostuser ports do not hog mbufs at all. And vhost zero 
> copy doesn't change that.

Formula below maybe good for static environment. I want to change number of
PMD threads dynamically in my deployments and this working in current per-port
model and with oversized shared pool. If we'll try to reduce memory consumption
of the shared pool we'll have to reconfigure all the devices each time we
change the number of PMD threads. This would be really bad.
So, size of the memory pool should not depend on dynamic characteristics of the
datapath or other ports to avoid unexpected interrupts in traffic flows in case
of random changes in configuration. Of course, it could depend on 
characteristics
of the port itself in case of per-port model. In case of shared mempool model 
the
size should only depend on static datapath configuration.

> 
> BTW:, is there any reason why phy drivers should free tx mbufs only when the 
> tx ring is close to becoming full? I'd understand that want need to free them 
> in batches for performance reasons, but is there no cheap possibility to do 
> this earlier?
> 
>>
>>> Note 2: Vhostuser ports do not store mbufs in tx queues due to copying 
>>> to virtio descriptors
>>>
>>>
>>> For concreteness let us use an example of a typical, rather large OVS 
>>> deployment in an NFVI cloud:
>>>
>>>   * Two cores with 4 PMDs per NUMA socket using HT.
>>>   * Two physical ports using RSS over 4 rx queues to enable load-sharing 
>>> over the 4 local PMDs and 9 tx queues (8 PMDs plus non PMD)
>>>   * 100 vhostuser ports with a varying number of rx and tx queue pairs (128 
>>> in total).
>>>
>>>
>>> In the above example deployments this translates into
>>>
>>>  1. 4 * 2K = 8K mbufs per physical port (16K in total)
>>>  2. 8 * 32 = 256 mbufs total
>>>  3. (128 +  23*9) * 32 = 4672 mbufs in total
>>>  4. 9 * 32 = 288 mbufs per 

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Ilya Maximets
On 29.01.2018 11:19, Jan Scheurich wrote:
> Hi,
>  
> I'd like to take one step back and look at how much many mbufs we actually 
> need.
>  
> Today mbufs are consumed in the following places:
> 
>  1. Rx queues of **physical** dpdk ports: dev->requested_n_rxq * 
> dev->requested_rxq_size
> Note 1: These mbufs are hogged up at all times.
> Note 2: There is little point in configuring more rx queues per phy port 
> than there are PMDs to poll them.
> Note 3: The rx queues of vhostuser ports exist as virtqueues in the guest 
> and do not hog mbufs.
>  2. One batch per PMD during processing: #PMD * NETDEV_MAX_BURST
>  3. One batch per tx queue with time-based tx batching: dev->requested_n_txq 
> * NETDEV_MAX_BURST
>  4. Tx queues of **physical** ports: dev->requested_n_txq * expected peak tx 
> queue fill level
> Note 1:  The maximum of 2K mbufs per tx queue can only be reached if the 
> OVS transmit rate exceeds the line rate for a long time. This can only happen 
> for large packets and when the traffic originates from VMs on the compute 
> node. This would be a case of under-dimensioning and packets would be dropped 
> in any case. Excluding that scenario, a typical peak tx queue fill level 
> would be when all PMDs transmit a full batch at the same time: #PMDs * 
> NETDEV_MAX_BURST.

Above assumption is wrong. Just look at ixgbe driver:
drivers/net/ixgbe/ixgbe_rxtx.c: tx_xmit_pkts():

   /*
* Begin scanning the H/W ring for done descriptors when the
* number of available descriptors drops below tx_free_thresh.  For
* each done descriptor, free the associated buffer.
*/
   if (txq->nb_tx_free < txq->tx_free_thresh)
   ┊   ixgbe_tx_free_bufs(txq);

The default value for 'tx_free_thresh' is 32. So, if I'll configure number
of TX descriptors to 4096, driver will start to free mbufs only when it will
have more than 4063 mbufs inside its TX queue. No matter how frequent calls
to send() function.

> Note 2: Vhostuser ports do not store mbufs in tx queues due to copying to 
> virtio descriptors
> 
>  
> For concreteness let us use an example of a typical, rather large OVS 
> deployment in an NFVI cloud:
> 
>   * Two cores with 4 PMDs per NUMA socket using HT.
>   * Two physical ports using RSS over 4 rx queues to enable load-sharing over 
> the 4 local PMDs and 9 tx queues (8 PMDs plus non PMD)
>   * 100 vhostuser ports with a varying number of rx and tx queue pairs (128 
> in total).
> 
>  
> In the above example deployments this translates into
> 
>  1. 4 * 2K = 8K mbufs per physical port (16K in total)
>  2. 8 * 32 = 256 mbufs total
>  3. (128 +  23*9) * 32 = 4672 mbufs in total
>  4. 9 * 32 = 288 mbufs per physical port (Adding some safety margin, a total 
> of 2K mbufs)
> 
> ---
> Total : 23K mbufs
>  
> This is way lower than the size of the earlier shared mempool (256K mbufs), 
> which explains why we have never observed out of mbuf drops in our NFVI 
> deployments. The vswitchd crash that triggered the change to per-port 
> mempools only happened because they tried to configure 64 rx and tx queues 
> per physical port for multiple ports. I can’t see any reason for configuring 
> more rx and tx queues than polling PMDs, though.
>  
> The actual consumption of mbufs scales primarily with the number of physical 
> ports (a, c and d) and only to a much lower degree with the number of vhost 
> ports/queues (c).
>  
> Except for the phy rx queues, all other cases buffer a statistical mix of 
> mbufs received on all ports. There seems little point in assigning per-port 
> mempools for these.
>  
> I think we should revert to a shared mempool (per MTU size) with a simple 
> dimensioning formula that only depends on the number of physical ports and 
> the number of PMDs, both of which are zero day configuration parameters that 
> are set by OVS users.
>  
> For example:
> #mbuf = SUM/physical ports [n_rxq * rxq_size + (#PMDs + 1) * txq_size] + 16K
>  
> The fixed 16K would cover for b) and c) for up to 512 vhostuser tx queues, 
> which should be ample.
> In the above example this result in 2 * [ 4 * 2K + 9 * 2K ] + 8K  = 60K mbufs.
>  
> BR, Jan
>  
>> -Original Message-
>> From: Venkatesan Pradeep
>> Sent: Friday, 26 January, 2018 18:34
>> To: Jan Scheurich ; Stokes, Ian 
>> ; ovs-discuss@openvswitch.org
>> Cc: Kevin Traynor ; Flavio Leitner ; 
>> Ilya Maximets (i.maxim...@samsung.com)
>> ; Loftus, Ciara ; Kavanagh, 
>> Mark B ; Ben Pfaff
>> (b...@ovn.org) ; acon...@redhat.com
>> Subject: RE: Mempool issue for OVS 2.9
>> 
>> Response marked [Pradeep]
>> 
>> Thanks,
>> 
>> Pradeep
>> 
>> -Original Message-
>> From: Jan Scheurich
>> Sent: Friday, January 26, 2018 10:26 PM
>> To: Stokes, Ian >; 
>> 

Re: [ovs-discuss] Why no-flood is not supported in later openflow versions?

2018-01-29 Thread Jianfeng Tan
>On Sun, Jan 28, 2018 at 11:39:29PM +0800, Jianfeng Tan wrote:
>> Hi,
>>
>> In the openflow, we set no-flood PodMod operation through openflow 1.0;
for
>> later openflow versions, no-flood is dropped.
>> We can refer to below two macro definitions.
>>
>> #define OFPPC10_ALL (OFPPC_PORT_DOWN | OFPPC10_NO_STP |
OFPPC_NO_RECV |
>> \
>>  OFPPC10_NO_RECV_STP | OFPPC10_NO_FLOOD |
>> OFPPC_NO_FWD | \
>>  OFPPC_NO_PACKET_IN)
>>
>> #define OFPPC11_ALL \
>> (OFPPC_PORT_DOWN | OFPPC_NO_RECV | OFPPC_NO_FWD |
>> OFPPC_NO_PACKET_IN)
>>
>> Would like to know why no-flood is not one of the common flags? Or do we
>> have any other BKM to supersede such port config?
>
>The OpenFlow 1.1 and later specifications don't have a "no flood" bit.

Thank you for the reply.

We also notice that it's the OpenFlow protocol (later version) does not
support
this bit. But we wonder why it's removed from the protocol? It's useless? Or
we have an alternative to cover this.

Thanks,
Jianfeng
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Jan Scheurich
Hi,

I'd like to take one step back and look at how much many mbufs we actually need.

Today mbufs are consumed in the following places:
a)  Rx queues of *physical* dpdk ports: dev->requested_n_rxq * 
dev->requested_rxq_size
   Note 1: These mbufs are hogged up at all times.
   Note 2: There is little point in configuring more rx queues per phy port 
than there are PMDs to poll them.
   Note 3: The rx queues of vhostuser ports exist as virtqueues in the guest 
and do not hog mbufs.
b)  One batch per PMD during processing: #PMD * NETDEV_MAX_BURST
c)  One batch per tx queue with time-based tx batching: 
dev->requested_n_txq * NETDEV_MAX_BURST
d)  Tx queues of *physical* ports: dev->requested_n_txq * expected peak tx 
queue fill level
Note 1:  The maximum of 2K mbufs per tx queue can only be reached if the OVS 
transmit rate exceeds the line rate for a long time. This can only happen for 
large packets and when the traffic originates from VMs on the compute node. 
This would be a case of under-dimensioning and packets would be dropped in any 
case. Excluding that scenario, a typical peak tx queue fill level would be when 
all PMDs transmit a full batch at the same time: #PMDs * NETDEV_MAX_BURST.
Note 2: Vhostuser ports do not store mbufs in tx queues due to copying to 
virtio descriptors

For concreteness let us use an example of a typical, rather large OVS 
deployment in an NFVI cloud:
*   Two cores with 4 PMDs per NUMA socket using HT.
*   Two physical ports using RSS over 4 rx queues to enable load-sharing 
over the 4 local PMDs and 9 tx queues (8 PMDs plus non PMD)
*   100 vhostuser ports with a varying number of rx and tx queue pairs (128 
in total).

In the above example deployments this translates into
a)  4 * 2K = 8K mbufs per physical port (16K in total)
b)  8 * 32 = 256 mbufs total
c)  (128 +  23*9) * 32 = 4672 mbufs in total
d)  9 * 32 = 288 mbufs per physical port (Adding some safety margin, a 
total of 2K mbufs)
---
Total : 23K mbufs

This is way lower than the size of the earlier shared mempool (256K mbufs), 
which explains why we have never observed out of mbuf drops in our NFVI 
deployments. The vswitchd crash that triggered the change to per-port mempools 
only happened because they tried to configure 64 rx and tx queues per physical 
port for multiple ports. I can't see any reason for configuring more rx and tx 
queues than polling PMDs, though.

The actual consumption of mbufs scales primarily with the number of physical 
ports (a, c and d) and only to a much lower degree with the number of vhost 
ports/queues (c).

Except for the phy rx queues, all other cases buffer a statistical mix of mbufs 
received on all ports. There seems little point in assigning per-port mempools 
for these.

I think we should revert to a shared mempool (per MTU size) with a simple 
dimensioning formula that only depends on the number of physical ports and the 
number of PMDs, both of which are zero day configuration parameters that are 
set by OVS users.

For example:
 #mbuf = SUM/physical ports [n_rxq * rxq_size + (#PMDs + 1) * txq_size] + 16K

The fixed 16K would cover for b) and c) for up to 512 vhostuser tx queues, 
which should be ample.
In the above example this result in 2 * [ 4 * 2K + 9 * 2K ] + 8K  = 60K mbufs.

BR, Jan

> -Original Message-
> From: Venkatesan Pradeep
> Sent: Friday, 26 January, 2018 18:34
> To: Jan Scheurich ; Stokes, Ian 
> ; ovs-discuss@openvswitch.org
> Cc: Kevin Traynor ; Flavio Leitner ; 
> Ilya Maximets (i.maxim...@samsung.com)
> ; Loftus, Ciara ; Kavanagh, 
> Mark B ; Ben Pfaff
> (b...@ovn.org) ; acon...@redhat.com
> Subject: RE: Mempool issue for OVS 2.9
>
> Response marked [Pradeep]
>
> Thanks,
>
> Pradeep
>
> -Original Message-
> From: Jan Scheurich
> Sent: Friday, January 26, 2018 10:26 PM
> To: Stokes, Ian >; 
> ovs-discuss@openvswitch.org
> Cc: Kevin Traynor >; Flavio 
> Leitner >; Ilya Maximets 
> (i.maxim...@samsung.com)
> >; Loftus, Ciara 
> >; Kavanagh, Mark B 
> >; Ben Pfaff
> (b...@ovn.org) >; 
> acon...@redhat.com; Venkatesan Pradeep 
> >
> Subject: RE: Mempool issue for OVS 2.9
>
> > -Original Message-
> > From: Stokes, Ian [mailto:ian.sto...@intel.com]
> > Sent: Friday, 26 January, 2018 13:01
> > To: 

Re: [ovs-discuss] Mempool issue for OVS 2.9

2018-01-29 Thread Kevin Traynor
On 01/26/2018 05:27 PM, Stokes, Ian wrote:
>> -Original Message-
>> From: Kevin Traynor [mailto:ktray...@redhat.com]
>> Sent: Friday, January 26, 2018 3:48 PM
>> To: Ilya Maximets ; Stokes, Ian
>> ; ovs-discuss@openvswitch.org
>> Cc: Flavio Leitner ; Loftus, Ciara
>> ; Kavanagh, Mark B ;
>> Jan Scheurich (jan.scheur...@ericsson.com) ;
>> Ben Pfaff (b...@ovn.org) ; acon...@redhat.com; Venkatesan
>> Pradeep 
>> Subject: Re: Mempool issue for OVS 2.9
>>
>> hOn 01/26/2018 03:16 PM, Ilya Maximets wrote:
>>> On 26.01.2018 15:00, Stokes, Ian wrote:
 Hi All,

 Recently an issue was raised regarding the move from a single shared
>> mempool model that was in place up to OVS 2.8, to a mempool per port model
>> introduced in 2.9.

 https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/04602
 1.html

 The per port mempool model was introduced in September 5th with commit
>> d555d9bded to allow fine grain control on a per port case of memory usage.

 In the 'common/shared mempool' model, ports sharing a similar MTU would
>> all share the same buffer mempool (e.g. the most common example of this
>> being that all ports are by default created with a 1500B MTU, and as such
>> share the same mbuf mempool).

 This approach had some drawbacks however. For example, with the shared
>> memory pool model a user could exhaust the shared memory pool (for
>> instance by requesting a large number of RXQs for a port), this would
>> cause the vSwitch to crash as any remaining ports would not have the
>> required memory to function. This bug was discovered and reported to the
>> community in late 2016 https://mail.openvswitch.org/pipermail/ovs-
>> discuss/2016-September/042560.html.

 The per port mempool patch aimed to avoid such issues by allocating a
>> separate buffer mempool to each port.

 An issue has been flagged on ovs-discuss, whereby memory dimensions
>> provided for a given number of ports on OvS 2.6-2.8 may be insufficient to
>> support the same number of ports in OvS 2.9, on account of the per-port
>> mempool model without re-dimensioning extra memory. The effect of this is
>> use case dependent (number of ports, RXQs, MTU settings, number of PMDs
>> etc.) The previous 'common-pool' model was rudimentary in estimating the
>> mempool size and was marked as something that was to be improved upon. The
>> memory allocation calculation for per port model was modified to take the
>> possible configuration factors mentioned into account.

 It's unfortunate that this came to light so close to the release code
>> freeze - but better late than never as it is a valid problem to be
>> resolved.

 I wanted to highlight some options to the community as I don't think
>> the next steps should be taken in isolation due to the impact this feature
>> has.

 There are a number of possibilities for the 2.9 release.

 (i) Revert the mempool per port patches and return to the shared
>> mempool model. There are a number of features and refactoring in place on
>> top of the change so this will not be a simple revert. I'm investigating
>> what exactly is involved with this currently.
>>>
>>> This looks like a bad thing to do. Shared memory pools has their own
>>> issues and hides the real memory usage by each port. Also, reverting
>>> seems almost impossible, we'll have to re-implement it from scratch.
> 
> I would agree, reverting isn’t as straight forward an option due to the 
> amount of commits that were introduced in relation to the per port mempool 
> feature over time(listed below for completeness).
> 
> netdev-dpdk: Create separate memory pool for each port: d555d9b
> netdev-dpdk: fix management of pre-existing mempools:b6b26021d
> netdev-dpdk: Fix mempool names to reflect socket id: f06546a
> netdev-dpdk: skip init for existing mempools: 837c176
> netdev-dpdk: manage failure in mempool name creation: 65056fd 
> netdev-dpdk: Reword mp_size as n_mbufs: ad9b5b9
> netdev-dpdk: Rename dpdk_mp_put as dpdk_mp_free: a08a115
> netdev-dpdk: Fix mp_name leak on snprintf failure: ec6edc8
> netdev-dpdk: Fix dpdk_mp leak in case of EEXIST: 173ef76
> netdev-dpdk: Factor out struct dpdk_mp: 24e78f9
> netdev-dpdk: Remove unused MAX_NB_MBUF: bc57ed9
> netdev-dpdk: Fix mempool creation with large MTU: af5b0da
> netdev-dpdk: Add debug appctl to get mempool information: be48173
> 
> Although a lot of these are fixes/formatting, we would have to introduce a 
> new series and re-introduce shared model by removing the per port specifics. 
> This will require extensive testing also.
> 
> Functionality such as the mempool debug tool would also be impacted as well 
> (I'm not sure how valuable it would be for 1 shared mempool).
> 
> I can look at putting an RFC for 

[ovs-discuss] failed to put[create] (Invalid argument)

2018-01-29 Thread John Lester
Hello,

I'm having issues forwarding a flow that is encapsulated with nsh+ethernet.
Below is the syslog error message. I'm using ovs v2.9.90 and Linux kernel
4.13.0-32-generic.

Jan 29 20:23:49 checkout ovs-vswitchd:
ovs|00672|dpif(handler5)|WARN|system@ovs-system: failed to put[create]
(Invalid argument) ufid:ec579e8d-da63-4c11-95e0-4bd069feff84
recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(2),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=7e:20:c2:40:50:c0,dst=ff:ff:ff:ff:ff:ff),eth_type(0x0806),arp(sip=
192.168.2.1/0.0.0.0,tip=192.168.2.2/0.0.0.0,op=1/0,sha=7e:20:c2:40:50:c0/00:00:00:00:00:00,tha=00:00:00:00:00:00/00:00:00:00:00:00),
actions:push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x63,si=255,c1=0x9,c2=0xc0a83249,c3=0x12345678,c4=0x11223344),push_eth(src=00:00:11:11:11:11,dst=00:00:00:00:35:02),3

checkout@checkout:~$ sudo ovs-ofctl -O OpenFlow15 dump-flows br0
 cookie=0x0, duration=104.506s, table=0, n_packets=53, n_bytes=2282,
priority=10,in_port=veth0b
actions=encap(nsh(md_type=1)),set_field:0x63->nsh_spi,set_field:255->nsh_si,set_field:0x9->nsh_c1,set_field:0xc0a83249->nsh_c2,set_field:0x12345678->nsh_c3,set_field:0x11223344->nsh_c4,encap(ethernet),set_field:00:00:11:11:11:11->eth_src,set_field:00:00:00:00:35:02->eth_dst,output:veth1b

checkout@checkout:~$ sudo ovs-vsctl show
ac5b8edb-206d-423c-8213-ea41df9abf58
Bridge "br0"
Port "veth1b"
Interface "veth1b"
Port "veth0b"
Interface "veth0b"
Port "br0"
Interface "br0"
type: internal

Any help would be greatly appreciated, thanks.
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss