Re: [RFC] Patch to add Software/Generic Segmentation Offload (GSO) support in FreeBSD

2014-09-22 Thread Stefano Garzarella
Hi Ryan,
in gso_dispatch(), I put the "eh_len" parameter in order to have the
offset of the L3 header. In this way, if someone adds QinQ support,
just call gso_dispatch() with the right length of the MAC header.
During the execution the GSO, the MAC header is simply copied
as it is in each new segment.

Instead, for the vxlan support, we can define a new entries in gso_type,
define a new "gso_functions" to properly handle these types of packets
and mark the packet in the network stack with the correct GSO type.
For now we used only 4 bit to encode the gso_type in m_pkthdr.csum_flags,
but, in the future, we can use more bit or a specific field in the m_pkthdr.

Your suggestions are very good, but I tried to make a software TSO,
modifying as little as possible the network stack.

Thanks,
Stefano




2014-09-18 20:50 GMT+02:00 Ryan Stone :

> On Wed, Sep 17, 2014 at 4:27 AM, Stefano Garzarella
>  wrote:
> > Much of the advantage of TSO comes from crossing the network stack only
> > once per (large) segment instead of once per 1500-byte frame.
> > GSO does the same both for segmentation (TCP) and fragmentation (UDP)
> > by doing these operations as late as possible.
>
> My initial impression is that this is a layering violation.  Code like
> this gives me pause:
>
> + eh = mtod(m, struct ether_vlan_header *);
> + if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) {
> + eh_len = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN;
> + } else {
> + eh_len = ETHER_HDR_LEN;
> + }
> +
> + return gso_dispatch(ifp, m, eh_len);
>
> If someone adds QinQ support, this code must be updated.  When vxlan
> support comes in, we must update this code or else the outer UDP
> packet gets fragmented instead of the inner TCP payload being
> segmented.  As more tunneling protocols get added to FreeBSD, the
> dispatch code for GSO gets uglier and uglier.
>
> It seems to me that the real problem that we are trying to solve is a
> lack of batching in the kernel.  Currently the network stack operates
> on the mbuf (packet) boundary.  It seems to me that we could introduce
> a "packet group" concept that is guaranteed to have the same L3 and L2
> endpoint.  In the transmit path, we would initially have a single
> (potentially oversized) packet in the group.  When TCP segments the
> packet, it would add each packet to the packet group and pass it down
> the stack.  Because we guarantee that the endpoints are the same for
> every packet in the group, the L3 code can do a single routing table
> lookup and the L2 code can do a single l2table lookup for the entire
> group.
>
> The disadvantages of packet groups would be that:
> a) You have touch a lot more code in a lot more places to take
> advantage of the concept.
> b) TSO inherently has the same layering problems.  If we're going to
> solve the problem for tunneling protocols then GSO might well be able
> to take advantage of them.
>



-- 
*Stefano Garzarella*
stefano.garzare...@gmail.com
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: [RFC] Patch to add Software/Generic Segmentation Offload (GSO) support in FreeBSD

2014-09-20 Thread Stefano Garzarella
Hi Freddie,
this is a preliminary version and, for now, we have not analyzed all
aspects.
Thanks for your suggestion. We will try to analyze how the GSO affects IPFW
as soon as possible.

Cheers,
Stefano

2014-09-18 17:27 GMT+02:00 Freddie Cash :

> On Thu, Sep 18, 2014 at 7:16 AM, Stefano Garzarella <
> stefanogarzare...@gmail.com> wrote:
>
>> I saw the discussion about TSO, but the GSO is a software
>> implementation unrelated with the hardware.
>> Furthermore, if the TSO is enabled (and supported by the NIC), the GSO is
>> not executed, because is useless.
>>
>> After the execution of the GSO, the packets, that are passed to the device
>> driver, are smaller (or equal) than MTU, so the TSO is unnecessary. For
>> this reason the GSO doesn't look neither "ifp->if_hw_tsomax" nor hardware
>> segment limits.
>>
>> The GSO is very useful when you can't use the TSO.
>>
>
> ​How does GSO affect IPFW, specifically the libalias(3)-based, in-kernel
> NAT?  The ipfw(8) man page mentions that it doesn't play nicely with
> hardware-based TSO, and that one should disable TSO when using IPFW NAT.
>
> Will the software-based GSO play nicely with IPFW NAT?​  Will it make any
> difference to packet throughput through IPFW?
>
> Or is it still way too early in development to be worrying about such
> things?  :)
>
> --
> Freddie Cash
> fjwc...@gmail.com
>



-- 
*Stefano Garzarella*
stefano.garzare...@gmail.com
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [RFC] Patch to add Software/Generic Segmentation Offload (GSO) support in FreeBSD

2014-09-18 Thread Ryan Stone
On Wed, Sep 17, 2014 at 4:27 AM, Stefano Garzarella
 wrote:
> Much of the advantage of TSO comes from crossing the network stack only
> once per (large) segment instead of once per 1500-byte frame.
> GSO does the same both for segmentation (TCP) and fragmentation (UDP)
> by doing these operations as late as possible.

My initial impression is that this is a layering violation.  Code like
this gives me pause:

+ eh = mtod(m, struct ether_vlan_header *);
+ if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) {
+ eh_len = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN;
+ } else {
+ eh_len = ETHER_HDR_LEN;
+ }
+
+ return gso_dispatch(ifp, m, eh_len);

If someone adds QinQ support, this code must be updated.  When vxlan
support comes in, we must update this code or else the outer UDP
packet gets fragmented instead of the inner TCP payload being
segmented.  As more tunneling protocols get added to FreeBSD, the
dispatch code for GSO gets uglier and uglier.

It seems to me that the real problem that we are trying to solve is a
lack of batching in the kernel.  Currently the network stack operates
on the mbuf (packet) boundary.  It seems to me that we could introduce
a "packet group" concept that is guaranteed to have the same L3 and L2
endpoint.  In the transmit path, we would initially have a single
(potentially oversized) packet in the group.  When TCP segments the
packet, it would add each packet to the packet group and pass it down
the stack.  Because we guarantee that the endpoints are the same for
every packet in the group, the L3 code can do a single routing table
lookup and the L2 code can do a single l2table lookup for the entire
group.

The disadvantages of packet groups would be that:
a) You have touch a lot more code in a lot more places to take
advantage of the concept.
b) TSO inherently has the same layering problems.  If we're going to
solve the problem for tunneling protocols then GSO might well be able
to take advantage of them.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: [RFC] Patch to add Software/Generic Segmentation Offload (GSO) support in FreeBSD

2014-09-18 Thread Freddie Cash
On Thu, Sep 18, 2014 at 7:16 AM, Stefano Garzarella <
stefanogarzare...@gmail.com> wrote:

> I saw the discussion about TSO, but the GSO is a software
> implementation unrelated with the hardware.
> Furthermore, if the TSO is enabled (and supported by the NIC), the GSO is
> not executed, because is useless.
>
> After the execution of the GSO, the packets, that are passed to the device
> driver, are smaller (or equal) than MTU, so the TSO is unnecessary. For
> this reason the GSO doesn't look neither "ifp->if_hw_tsomax" nor hardware
> segment limits.
>
> The GSO is very useful when you can't use the TSO.
>

​How does GSO affect IPFW, specifically the libalias(3)-based, in-kernel
NAT?  The ipfw(8) man page mentions that it doesn't play nicely with
hardware-based TSO, and that one should disable TSO when using IPFW NAT.

Will the software-based GSO play nicely with IPFW NAT?​  Will it make any
difference to packet throughput through IPFW?

Or is it still way too early in development to be worrying about such
things?  :)

-- 
Freddie Cash
fjwc...@gmail.com
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: [RFC] Patch to add Software/Generic Segmentation Offload (GSO) support in FreeBSD

2014-09-18 Thread Stefano Garzarella
Hi Hans,
I saw the discussion about TSO, but the GSO is a software
implementation unrelated with the hardware.
Furthermore, if the TSO is enabled (and supported by the NIC), the GSO is
not executed, because is useless.

After the execution of the GSO, the packets, that are passed to the device
driver, are smaller (or equal) than MTU, so the TSO is unnecessary. For
this reason the GSO doesn't look neither "ifp->if_hw_tsomax" nor hardware
segment limits.

The GSO is very useful when you can't use the TSO.

Cheers,
Stefano

2014-09-17 22:27 GMT+02:00 Hans Petter Selasky :

> On 09/17/14 20:18, Stefano Garzarella wrote:
>
>> Hi Adrian,
>> the results that I sent, regard just one flow, but I can try with two
>> simultaneous flows and I'll send you the results.
>>
>> Thanks,
>> Stefano
>>
>>
> Hi Stefano,
>
> You might have seen the discussion about TSO. Is it so that the proposed
> GSO feature only looks at the "ifp->if_hw_tsomax" field, and ignores
> hardware limits regarding maximum segment size and maximum segment count?
>
> --HPS
>



-- 
*Stefano Garzarella*
stefano.garzare...@gmail.com
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: [RFC] Patch to add Software/Generic Segmentation Offload (GSO) support in FreeBSD

2014-09-17 Thread Hans Petter Selasky

On 09/17/14 20:18, Stefano Garzarella wrote:

Hi Adrian,
the results that I sent, regard just one flow, but I can try with two
simultaneous flows and I'll send you the results.

Thanks,
Stefano



Hi Stefano,

You might have seen the discussion about TSO. Is it so that the proposed 
GSO feature only looks at the "ifp->if_hw_tsomax" field, and ignores 
hardware limits regarding maximum segment size and maximum segment count?


--HPS
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: [RFC] Patch to add Software/Generic Segmentation Offload (GSO) support in FreeBSD

2014-09-17 Thread Stefano Garzarella
Hi Adrian,
the results that I sent, regard just one flow, but I can try with two
simultaneous flows and I'll send you the results.

Thanks,
Stefano

2014-09-17 19:27 GMT+02:00 Adrian Chadd :

> Hi!
>
> Cool!
>
> How many flows were you testing with? Just one or two?
>
> It's for outbound, so it's not _as_ big a deal as it is for inbound,
> but it'd still be nice to know.
>
>
> -a
>
>
> On 17 September 2014 01:27, Stefano Garzarella
>  wrote:
> > Hi all,
> > I have recently worked, during my master’s thesis with the supervision
> > of Prof. Luigi Rizzo, on a project to add GSO (Generic Segmentation
> > Offload) support in FreeBSD. I will present this project at EuroBSDcon
> > 2014, in Sofia (Bulgaria) on September 28, 2014.
> >
> > Following is a brief description of our project:
> >
> > The use of large frames makes network communication much less
> > demanding for the CPU. Yet, backward compatibility and slow links
> > requires the use of 1500 byte or smaller frames.  Modern NICs with
> > hardware TCP segmentation offloading (TSO) address this problem.
> > However, a generic software version (GSO) provided by the OS has
> > reason to exist, for use on paths with no suitable hardware, such
> > as between virtual machines or with older or buggy NICs.
> >
> > Much of the advantage of TSO comes from crossing the network stack only
> > once per (large) segment instead of once per 1500-byte frame.
> > GSO does the same both for segmentation (TCP) and fragmentation (UDP)
> > by doing these operations as late as possible. Ideally, this could be
> done
> > within the device driver, but that would require modifications to all
> > drivers.
> > A more convenient, similarly effective approach is to segment
> > just before the packet is passed to the driver (in ether_output()).
> >
> > Our preliminary implementation supports TCP and UDP on IPv4/IPv6;
> > it only intercepts packets large than the MTU (others are left
> unchanged),
> > and only when GSO is marked as enabled for the interface.
> >
> > Segments larger than the MTU are not split in tcp_output(),
> > udp_output(), or ip_output(), but marked with a flag (contained in
> > m_pkthdr.csum_flags), which is processed by ether_output() just
> > before calling the device driver.
> >
> > ether_output(), through gso_dispatch(), splits the large frame as needed,
> > creating headers and possibly doing checksums if not supported by
> > the hardware.
> >
> > In experiments agains an LRO-enabled receiver (otherwise TSO/GSO
> > are ineffective) we have seen the following performance,
> > taken at different clock speeds (because at top speeds the
> > 10G link becomes the bottleneck):
> >
> >
> > Testing enviroment (all with Intel 10Gbit NIC)
> > Sender: FreeBSD 11-CURRENT - CPU i7-870 at 2.93 GHz + Turboboost
> > Receiver: Linux 3.12.8 - CPU i7-3770K at 3.50GHz + Turboboost
> > Benchmark tool: netperf 2.6.0
> >
> > --- TCP/IPv4 packets (checksum offloading enabled) ---
> > Freq.  TSO   GSO none Speedup
> > [GHz] [Gbps]   [Gbps]   [Gbps]   GSO-none
> > 2.93   9347  9298  8308 12 %
> > 2.53   9266  9401  6771 39 %
> > 2.00   9408  9294  5499 69 %
> > 1.46   9408  8087  4075 98 %
> > 1.05   9408  5673  2884 97 %
> > 0.45   6760  2206  1244 77 %
> >
> >
> > --- TCP/IPv6 packets (checksum offloading enabled) ---
> > Freq.  TSO   GSO none Speedup
> > [GHz] [Gbps]   [Gbps]   [Gbps]   GSO-none
> > 2.93   7530  6939  4966 40 %
> > 2.53   5133  7145  4008 78 %
> > 2.00   5965  6331  3152101 %
> > 1.46   5565  5180  2348121 %
> > 1.05   8501  3607  1732108 %
> > 0.45   3665  1505651131 %
> >
> >
> > --- UDP/IPv4 packets (9K) ---
> > Freq.  GSO  none Speedup
> > [GHz] [Gbps]   [Gbps]   GSO-none
> > 2.93   9440  8084 17 %
> > 2.53   7772  6649 17 %
> > 2.00   6336  5338 19 %
> > 1.46   4748  4014 18 %
> > 1.05   3359  2831 19 %
> > 0.45   1312  1120 17 %
> >
> >
> > --- UDP/IPv6 packets (9K) ---
> > Freq.  GSO  none Speedup
> > [GHz] [Gbps]   [Gbps]   GSO-none
> > 2.93   7281  6197 18 %
> > 2.53   5953  5020 19 %
> > 2.00   4804  4048 19 %
> > 1.46   3582  3004 19 %
> > 1.05   2512  2092 20 %
> > 0.45 998826 21 %
> >
> > We tried to change as little as possible the network stack to add
> > GSO support. To avoid changing API/ABI, we temporarily used spare
> > fields in struct tcpcb (TCP Control Block) and struct ifnet to store
> > some information related to GSO (enabled, max burst size, etc.).
> > The code that perform

Re: [RFC] Patch to add Software/Generic Segmentation Offload (GSO) support in FreeBSD

2014-09-17 Thread Adrian Chadd
Hi!

Cool!

How many flows were you testing with? Just one or two?

It's for outbound, so it's not _as_ big a deal as it is for inbound,
but it'd still be nice to know.


-a


On 17 September 2014 01:27, Stefano Garzarella
 wrote:
> Hi all,
> I have recently worked, during my master’s thesis with the supervision
> of Prof. Luigi Rizzo, on a project to add GSO (Generic Segmentation
> Offload) support in FreeBSD. I will present this project at EuroBSDcon
> 2014, in Sofia (Bulgaria) on September 28, 2014.
>
> Following is a brief description of our project:
>
> The use of large frames makes network communication much less
> demanding for the CPU. Yet, backward compatibility and slow links
> requires the use of 1500 byte or smaller frames.  Modern NICs with
> hardware TCP segmentation offloading (TSO) address this problem.
> However, a generic software version (GSO) provided by the OS has
> reason to exist, for use on paths with no suitable hardware, such
> as between virtual machines or with older or buggy NICs.
>
> Much of the advantage of TSO comes from crossing the network stack only
> once per (large) segment instead of once per 1500-byte frame.
> GSO does the same both for segmentation (TCP) and fragmentation (UDP)
> by doing these operations as late as possible. Ideally, this could be done
> within the device driver, but that would require modifications to all
> drivers.
> A more convenient, similarly effective approach is to segment
> just before the packet is passed to the driver (in ether_output()).
>
> Our preliminary implementation supports TCP and UDP on IPv4/IPv6;
> it only intercepts packets large than the MTU (others are left unchanged),
> and only when GSO is marked as enabled for the interface.
>
> Segments larger than the MTU are not split in tcp_output(),
> udp_output(), or ip_output(), but marked with a flag (contained in
> m_pkthdr.csum_flags), which is processed by ether_output() just
> before calling the device driver.
>
> ether_output(), through gso_dispatch(), splits the large frame as needed,
> creating headers and possibly doing checksums if not supported by
> the hardware.
>
> In experiments agains an LRO-enabled receiver (otherwise TSO/GSO
> are ineffective) we have seen the following performance,
> taken at different clock speeds (because at top speeds the
> 10G link becomes the bottleneck):
>
>
> Testing enviroment (all with Intel 10Gbit NIC)
> Sender: FreeBSD 11-CURRENT - CPU i7-870 at 2.93 GHz + Turboboost
> Receiver: Linux 3.12.8 - CPU i7-3770K at 3.50GHz + Turboboost
> Benchmark tool: netperf 2.6.0
>
> --- TCP/IPv4 packets (checksum offloading enabled) ---
> Freq.  TSO   GSO none Speedup
> [GHz] [Gbps]   [Gbps]   [Gbps]   GSO-none
> 2.93   9347  9298  8308 12 %
> 2.53   9266  9401  6771 39 %
> 2.00   9408  9294  5499 69 %
> 1.46   9408  8087  4075 98 %
> 1.05   9408  5673  2884 97 %
> 0.45   6760  2206  1244 77 %
>
>
> --- TCP/IPv6 packets (checksum offloading enabled) ---
> Freq.  TSO   GSO none Speedup
> [GHz] [Gbps]   [Gbps]   [Gbps]   GSO-none
> 2.93   7530  6939  4966 40 %
> 2.53   5133  7145  4008 78 %
> 2.00   5965  6331  3152101 %
> 1.46   5565  5180  2348121 %
> 1.05   8501  3607  1732108 %
> 0.45   3665  1505651131 %
>
>
> --- UDP/IPv4 packets (9K) ---
> Freq.  GSO  none Speedup
> [GHz] [Gbps]   [Gbps]   GSO-none
> 2.93   9440  8084 17 %
> 2.53   7772  6649 17 %
> 2.00   6336  5338 19 %
> 1.46   4748  4014 18 %
> 1.05   3359  2831 19 %
> 0.45   1312  1120 17 %
>
>
> --- UDP/IPv6 packets (9K) ---
> Freq.  GSO  none Speedup
> [GHz] [Gbps]   [Gbps]   GSO-none
> 2.93   7281  6197 18 %
> 2.53   5953  5020 19 %
> 2.00   4804  4048 19 %
> 1.46   3582  3004 19 %
> 1.05   2512  2092 20 %
> 0.45 998826 21 %
>
> We tried to change as little as possible the network stack to add
> GSO support. To avoid changing API/ABI, we temporarily used spare
> fields in struct tcpcb (TCP Control Block) and struct ifnet to store
> some information related to GSO (enabled, max burst size, etc.).
> The code that performs the segmentation/fragmentation is contained
> in the file gso.[h|c] in sys/net.  We used 4 bit in m_pkthdr.csum_flags
> (CSUM_GSO_MASK) to encode the packet type (TCP/IPv4, TCP/IPv6, etc)
> to prevent access to the TCP/IP/Ethernet headers of each packet.
> In ether_output_frame(), if the packet requires the GSO
> ((m->m_pkthdr.csum_flags & CSUM_GSO_MASK) != 0), it is segmented
> or fragmented, and then they