Re: [ovs-dev] [PATCH v9 00/14] Support multi-segment mbufs
Hi Ilya, On 30/08/2018 12:06, Ilya Maximets wrote: > Hi, Tiago. I applied that whole patch-set and looked through the > code. Didn't finish review yet, but have a few concerns about current > implementation: > > 1. dp_packet_l3/l4/others returns the pointer and checks only that >offset is less than the size of the first segment. But this check >doesn't guarantee that even one byte of this header located in >the first segment. Users will commonly read/write using that >pointer and crash/read wrong memory regions. > > 2. mbuf modifications with buf_addr, data_off, etc changes >looks very dangerous. At least, in your current implementation >dp_packet_get_allocated() will return wrong value for reallocated >this way packet, because you're not updating nb_segs. >This is really hard to track all the mbuf fields that should be stored >and every DPDK upgrade may introduce very complex issues. >Maybe it's better to just clone the packet in a usual way? > > 3. dp_packet_linear_ofs(..) must be called before any call of >dp_packet_l3/l4/others ? Otherwise you will get a wrong pointer. >This is a bad concept from the API point of view. >Also, some users in your current implementation uses both >dp_packet_linear_ofs(pkt, pkt->l3_ofs) and dp_packet_l3(pkt) >at the same time for unknown reason. See handle_ftp_ctl() for >example. I've fixed the other issues, but would like to clarify the above two points (2. and 3.) before sending v10. With regards to the way dp_packet_linear_data() is making the conversion between DPDK and non-DPDK packets, it seems to me that the options we have here are, either construct another packet entirely and store the copied data in the new packet, or modify the current packet internally to hold the new data. I chose the latter to abstract the changes from the caller, so the caller doesn't have to update its pointer to the packet. I tend to agree the current way it is being done (saving `buf_addr` and such) is prone to error. However, I've noticed that in DPDK 18.05 a new function has been introduced, rte_pktmbuf_attach_extbuff() [1], which, theoretically (I haven't played with it just yet), will enable us to attach an external buffer into an mbuf. This would allow us to remove the process currently in place once we integrate with DPDK 18.11, by just attaching the malloc()'ed buffer and detaching before returning the mbuf to the pool. Then, the purpose is not to use dp_packet_linear_ofs() before or after dp_packet_l3/l4/others, but instead. That is, if one plans to traverse the data, dp_packet_linear_ofs() is the function to use. Otherwise dp_packet_l3() is the one. The caller just needs to be aware that by calling the linear APIs the address of where the data is will change. Does that make sense? What do you think? Tiago. [1] http://doc.dpdk.org/api/rte__mbuf_8h.html#a67b6d31fd6f79b55dca6565e18431668 > > 4. handle_ftp_ctl() copies the whole packet even if function will >return after the first simple check. > > 5. You missed copying of 'rss_hash_valid' in dp_packet_clone_with_headroom(). >In general, there are a lot of code duplications which is >not a good thing in so complicated code. > > 6. No need to have netdev_dpdk_is_multi_segment_mbufs_enabled() in >a common header. > > 7. dp_packet_reset_packet() is broken. >dp_packet_set_size() should not change the real packet, i.e. free >the segments. In your implementation dp_packet_set_size() crops >the packet from the tail and after that dp_packet_set_data() >moves the head forward and reduces packet size again. As a result >dp_packet_reset_packet() eats the requested number of bytes from >both sides of the packet. > > 8. dp_packet_l3/l4/others returns NULL in case of headers placement >not in a first segment. But callers never checks the result. >They expect that these functions always return valid pointer >for the packets that contains these headers. In reality, >you can not be sure that all the headers are in the first segment. >For example in case of few levels of tunnelling. > > Best regards, Ilya Maximets. > > On 30.08.2018 10:36, Lam, Tiago wrote: >> Hi Ilya, >> >> Since you've shown some concerns on the latest revision, does this >> address you concerns? More specifically, does patch 09/14 address the >> points you made on v7 of the series, where dp_packet_data() was being >> used to traverse the whole data, even if segmented? >> >> Thanks, >> Tiago. >> >> On 24/08/2018 17:24, Tiago Lam wrote: >>> Overview >>> >>> This patchset introduces support for multi-segment mbufs to OvS-DPDK. >>> Multi-segment mbufs are typically used when the size of an mbuf is >>> insufficient to contain the entirety of a packet's data. Instead, the >>> data is split across numerous mbufs, each carrying a portion, or >>> 'segment', of the packet data. Mbufs are chained via their 'next' >>> attribute (an mbuf
Re: [ovs-dev] [PATCH v9 00/14] Support multi-segment mbufs
Hi Ilya, Thanks for the comments, I have some responses in-line. I'll cook up the fixes for the next iteration, but would be interested in getting on your views, mainly on 1., 2. and 3.. On 30/08/2018 12:06, Ilya Maximets wrote: > Hi, Tiago. I applied that whole patch-set and looked through the > code. Didn't finish review yet, but have a few concerns about current > implementation: > > 1. dp_packet_l3/l4/others returns the pointer and checks only that >offset is less than the size of the first segment. But this check >doesn't guarantee that even one byte of this header located in >the first segment. Users will commonly read/write using that >pointer and crash/read wrong memory regions. For packets with small / few headers, there won't be any problems. But for packets where there's the need to access bigger / more headers, that's why dp_packet_linear_ofs() has been introduced. It copies the data so it can be accessed contiguously. More generally, the idea of the new linear APIs is - when a caller knows it will need all the data, such as a calculating a checksum or calling a write() it will need to use the linear APIs, instead of the the normal APIs. > > 2. mbuf modifications with buf_addr, data_off, etc changes >looks very dangerous. At least, in your current implementation >dp_packet_get_allocated() will return wrong value for reallocated >this way packet, because you're not updating nb_segs. >This is really hard to track all the mbuf fields that should be stored >and every DPDK upgrade may introduce very complex issues. >Maybe it's better to just clone the packet in a usual way? > > 3. dp_packet_linear_ofs(..) must be called before any call of >dp_packet_l3/l4/others ? Otherwise you will get a wrong pointer. >This is a bad concept from the API point of view. >Also, some users in your current implementation uses both >dp_packet_linear_ofs(pkt, pkt->l3_ofs) and dp_packet_l3(pkt) >at the same time for unknown reason. See handle_ftp_ctl() for >example. I see your point; The idea here was avoiding changing the callers of the dp_packet API as much as possible. If we make dp_packet_linear_data() call dp_packet_clone() then the caller will have to to update its pointer to the packet. In some cases this pointer is even `CONST *`. Hence why I opted for this approach. > > 4. handle_ftp_ctl() copies the whole packet even if function will >return after the first simple check.> > 5. You missed copying of 'rss_hash_valid' in dp_packet_clone_with_headroom(). >In general, there are a lot of code duplications which is >not a good thing in so complicated code. I don't understand this one; `rss_hash_valid` is only supposed to be used when OvS has been compiled without DPDK. With regards to the duplication, a previous review pointed out that there were two many compilation flags around dp_packet_clone_with_headroom(), hence why it has been broken out into two different implementations (which leads to some duplication). > > 6. No need to have netdev_dpdk_is_multi_segment_mbufs_enabled() in >a common header. > > 7. dp_packet_reset_packet() is broken. >dp_packet_set_size() should not change the real packet, i.e. free >the segments. In your implementation dp_packet_set_size() crops >the packet from the tail and after that dp_packet_set_data() >moves the head forward and reduces packet size again. As a result >dp_packet_reset_packet() eats the requested number of bytes from >both sides of the packet. You're right, I'll fix 6 and 7. However, dp_packet_size() was implemented like this (frees the tail segments) so that the tail of the packet is always in the last segment mbuf. This is to avoid returning a pointer to data that's segment in dp_packet_tail() or dp_packet_end(). > > 8. dp_packet_l3/l4/others returns NULL in case of headers placement >not in a first segment. But callers never checks the result. >They expect that these functions always return valid pointer >for the packets that contains these headers. In reality, >you can not be sure that all the headers are in the first segment. >For example in case of few levels of tunnelling. It is true that some places are not checking for NULL, but I believe I've covered most cases by using dp_packet_linear_ofs(). Are you talking of any case in specific? But given this is hidden behind the "dpdk-multi-seg-mbufs" flags I'd prefer to treat it on a case by case scenario other than "polluting" the code with NULL checks where none is required (most of the cases, I would say, won't require it). Thanks, Tiago. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Re: [ovs-dev] [PATCH v9 00/14] Support multi-segment mbufs
Hi, Tiago. I applied that whole patch-set and looked through the code. Didn't finish review yet, but have a few concerns about current implementation: 1. dp_packet_l3/l4/others returns the pointer and checks only that offset is less than the size of the first segment. But this check doesn't guarantee that even one byte of this header located in the first segment. Users will commonly read/write using that pointer and crash/read wrong memory regions. 2. mbuf modifications with buf_addr, data_off, etc changes looks very dangerous. At least, in your current implementation dp_packet_get_allocated() will return wrong value for reallocated this way packet, because you're not updating nb_segs. This is really hard to track all the mbuf fields that should be stored and every DPDK upgrade may introduce very complex issues. Maybe it's better to just clone the packet in a usual way? 3. dp_packet_linear_ofs(..) must be called before any call of dp_packet_l3/l4/others ? Otherwise you will get a wrong pointer. This is a bad concept from the API point of view. Also, some users in your current implementation uses both dp_packet_linear_ofs(pkt, pkt->l3_ofs) and dp_packet_l3(pkt) at the same time for unknown reason. See handle_ftp_ctl() for example. 4. handle_ftp_ctl() copies the whole packet even if function will return after the first simple check. 5. You missed copying of 'rss_hash_valid' in dp_packet_clone_with_headroom(). In general, there are a lot of code duplications which is not a good thing in so complicated code. 6. No need to have netdev_dpdk_is_multi_segment_mbufs_enabled() in a common header. 7. dp_packet_reset_packet() is broken. dp_packet_set_size() should not change the real packet, i.e. free the segments. In your implementation dp_packet_set_size() crops the packet from the tail and after that dp_packet_set_data() moves the head forward and reduces packet size again. As a result dp_packet_reset_packet() eats the requested number of bytes from both sides of the packet. 8. dp_packet_l3/l4/others returns NULL in case of headers placement not in a first segment. But callers never checks the result. They expect that these functions always return valid pointer for the packets that contains these headers. In reality, you can not be sure that all the headers are in the first segment. For example in case of few levels of tunnelling. Best regards, Ilya Maximets. On 30.08.2018 10:36, Lam, Tiago wrote: > Hi Ilya, > > Since you've shown some concerns on the latest revision, does this > address you concerns? More specifically, does patch 09/14 address the > points you made on v7 of the series, where dp_packet_data() was being > used to traverse the whole data, even if segmented? > > Thanks, > Tiago. > > On 24/08/2018 17:24, Tiago Lam wrote: >> Overview >> >> This patchset introduces support for multi-segment mbufs to OvS-DPDK. >> Multi-segment mbufs are typically used when the size of an mbuf is >> insufficient to contain the entirety of a packet's data. Instead, the >> data is split across numerous mbufs, each carrying a portion, or >> 'segment', of the packet data. Mbufs are chained via their 'next' >> attribute (an mbuf pointer). >> >> The main motivation behind the support for multi-segment mbufs is to >> later introduce TSO (use case i. below) / GRO in OvS-DPDK, which is >> planned to be introduced after this series. >> >> Use Cases >> = >> i. Handling oversized (guest-originated) frames, which are marked >> for hardware accelration/offload (TSO, for example). >> >> Packets which originate from a non-DPDK source may be marked for >> offload; as such, they may be larger than the permitted ingress >> interface's MTU, and may be stored in an oversized dp-packet. In >> order to transmit such packets over a DPDK port, their contents >> must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in >> its current implementation, that function only copies data into >> a single mbuf; if the space available in the mbuf is exhausted, >> but not all packet data has been copied, then it is lost. >> Similarly, when cloning a DPDK mbuf, it must be considered >> whether that mbuf contains multiple segments. Both issues are >> resolved within this patchset. >> >> ii. Handling jumbo frames. >> >> While OvS already supports jumbo frames, it does so by increasing >> mbuf size, such that the entirety of a jumbo frame may be handled >> in a single mbuf. This is certainly the preferred, and most >> performant approach (and remains the default). >> >> Enabling multi-segment mbufs >> >> Multi-segment and single-segment mbufs are mutually exclusive, and the >> user must decide on which approach to adopt on init. The introduction >> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. >> >> This is a global boolean
Re: [ovs-dev] [PATCH v9 00/14] Support multi-segment mbufs
Hi Ilya, Since you've shown some concerns on the latest revision, does this address you concerns? More specifically, does patch 09/14 address the points you made on v7 of the series, where dp_packet_data() was being used to traverse the whole data, even if segmented? Thanks, Tiago. On 24/08/2018 17:24, Tiago Lam wrote: > Overview > > This patchset introduces support for multi-segment mbufs to OvS-DPDK. > Multi-segment mbufs are typically used when the size of an mbuf is > insufficient to contain the entirety of a packet's data. Instead, the > data is split across numerous mbufs, each carrying a portion, or > 'segment', of the packet data. Mbufs are chained via their 'next' > attribute (an mbuf pointer). > > The main motivation behind the support for multi-segment mbufs is to > later introduce TSO (use case i. below) / GRO in OvS-DPDK, which is > planned to be introduced after this series. > > Use Cases > = > i. Handling oversized (guest-originated) frames, which are marked > for hardware accelration/offload (TSO, for example). > > Packets which originate from a non-DPDK source may be marked for > offload; as such, they may be larger than the permitted ingress > interface's MTU, and may be stored in an oversized dp-packet. In > order to transmit such packets over a DPDK port, their contents > must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in > its current implementation, that function only copies data into > a single mbuf; if the space available in the mbuf is exhausted, > but not all packet data has been copied, then it is lost. > Similarly, when cloning a DPDK mbuf, it must be considered > whether that mbuf contains multiple segments. Both issues are > resolved within this patchset. > > ii. Handling jumbo frames. > > While OvS already supports jumbo frames, it does so by increasing > mbuf size, such that the entirety of a jumbo frame may be handled > in a single mbuf. This is certainly the preferred, and most > performant approach (and remains the default). > > Enabling multi-segment mbufs > > Multi-segment and single-segment mbufs are mutually exclusive, and the > user must decide on which approach to adopt on init. The introduction > of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. > > This is a global boolean value, which determines how jumbo frames are > represented across all DPDK ports. In the absence of a user-supplied > value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment > mbufs must be explicitly enabled / single-segment mbufs remain the > default. > > Setting the field is identical to setting existing DPDK-specific OVSDB > fields: > > ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true > ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10 > ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0 > ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true > > Performance notes (based on v8, 1st non-RFC) > = > In order to test for regressions in performance, tests were run on top > of master 88125d6 and v8 of this patchset, both with the multi-segment > mbufs option enabled and disabled. > > VSperf was used to run the phy2phy_cont and pvp_cont tests with varying > packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface. > > Test | Size | Master | Multi-seg disabled | Multi-seg enabled > - > p2p | 64 | ~22.7 | ~22.65| ~18.3 > p2p | 1500 | ~1.6 |~1.6|~1.6 > p2p | 7000 | ~0.36 | ~0.36| ~0.36 > pvp | 64 | ~6.7 |~6.7|~6.3 > pvp | 1500 | ~1.6 |~1.6|~1.6 > pvp | 7000 | ~0.36 | ~0.36| ~0.36 > > Packet size is in bytes, while all packet rates are reported in mpps > (aggregated). > > No noticeable regression has been observed (certainly everything is > within the ± 5% margin of existing performance), aside from the 64B > packet size case when multi-segment mbuf is enabled. This is > expected, however, because of how Tx vectoriszed functions are > incompatible with multi-segment mbufs on some PMDs. The PMD under > use during these tests was the i40e (on a Intel X710 NIC), which > indeed doesn't support vectorized Tx functions with multi-segment > mbufs. > > --- > v9: - Rebase on master e4e2009 ("tunnel, tests: Sort flow output in ERSPAN > v1/v2 metadata"); > - Simplify patch 09/14. The functions introduced in packets.c were dropped > so the code in netdev-native-tnl.c remains largely the same. These can > be > introduced at a later time, if needed (maybe when csum across segmented > data is introduced); > > v8: - Rebase on master 1dd218a ("ovsdb-idl: Fix recently introduced Python 3 > tests."); > - Address Ian's comment:
[ovs-dev] [PATCH v9 00/14] Support multi-segment mbufs
Overview This patchset introduces support for multi-segment mbufs to OvS-DPDK. Multi-segment mbufs are typically used when the size of an mbuf is insufficient to contain the entirety of a packet's data. Instead, the data is split across numerous mbufs, each carrying a portion, or 'segment', of the packet data. Mbufs are chained via their 'next' attribute (an mbuf pointer). The main motivation behind the support for multi-segment mbufs is to later introduce TSO (use case i. below) / GRO in OvS-DPDK, which is planned to be introduced after this series. Use Cases = i. Handling oversized (guest-originated) frames, which are marked for hardware accelration/offload (TSO, for example). Packets which originate from a non-DPDK source may be marked for offload; as such, they may be larger than the permitted ingress interface's MTU, and may be stored in an oversized dp-packet. In order to transmit such packets over a DPDK port, their contents must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in its current implementation, that function only copies data into a single mbuf; if the space available in the mbuf is exhausted, but not all packet data has been copied, then it is lost. Similarly, when cloning a DPDK mbuf, it must be considered whether that mbuf contains multiple segments. Both issues are resolved within this patchset. ii. Handling jumbo frames. While OvS already supports jumbo frames, it does so by increasing mbuf size, such that the entirety of a jumbo frame may be handled in a single mbuf. This is certainly the preferred, and most performant approach (and remains the default). Enabling multi-segment mbufs Multi-segment and single-segment mbufs are mutually exclusive, and the user must decide on which approach to adopt on init. The introduction of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This is a global boolean value, which determines how jumbo frames are represented across all DPDK ports. In the absence of a user-supplied value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment mbufs must be explicitly enabled / single-segment mbufs remain the default. Setting the field is identical to setting existing DPDK-specific OVSDB fields: ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10 ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0 ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true Performance notes (based on v8, 1st non-RFC) = In order to test for regressions in performance, tests were run on top of master 88125d6 and v8 of this patchset, both with the multi-segment mbufs option enabled and disabled. VSperf was used to run the phy2phy_cont and pvp_cont tests with varying packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface. Test | Size | Master | Multi-seg disabled | Multi-seg enabled - p2p | 64 | ~22.7 | ~22.65| ~18.3 p2p | 1500 | ~1.6 |~1.6|~1.6 p2p | 7000 | ~0.36 | ~0.36| ~0.36 pvp | 64 | ~6.7 |~6.7|~6.3 pvp | 1500 | ~1.6 |~1.6|~1.6 pvp | 7000 | ~0.36 | ~0.36| ~0.36 Packet size is in bytes, while all packet rates are reported in mpps (aggregated). No noticeable regression has been observed (certainly everything is within the ± 5% margin of existing performance), aside from the 64B packet size case when multi-segment mbuf is enabled. This is expected, however, because of how Tx vectoriszed functions are incompatible with multi-segment mbufs on some PMDs. The PMD under use during these tests was the i40e (on a Intel X710 NIC), which indeed doesn't support vectorized Tx functions with multi-segment mbufs. --- v9: - Rebase on master e4e2009 ("tunnel, tests: Sort flow output in ERSPAN v1/v2 metadata"); - Simplify patch 09/14. The functions introduced in packets.c were dropped so the code in netdev-native-tnl.c remains largely the same. These can be introduced at a later time, if needed (maybe when csum across segmented data is introduced); v8: - Rebase on master 1dd218a ("ovsdb-idl: Fix recently introduced Python 3 tests."); - Address Ian's comment: - Fix sparse warnings on patch 07/14 and 12/14 by allocating memory dynamically. - Address Ilya's comments: - netdev_linux_tap_batch_send() and udp_extract_tnl_md() now linearize the data before hand, beforing write()'ing or performing the checksums in the data; - Some other cases have been found and adapted; The new patch 09/14 introduced in the series is where the "linearization" logic is introduced and, as a consequence, some users of the dp_packet API,