Re: [RFC v3 net-next 00/18] Time based packet transmission
Hi, On 03/08/2018 02:54 PM, Henrik Austad wrote: > Just looking at the timestamp when the frames were received. They should be > sent at regular intervals if I read udp_tai.c correctly, so the assumption > was that the timestamp from tcpdump should give an inkling to how well it > worked. > > I set it up to send a frame every 10ms and computed the diff between each > UDP packet received. Nothing fancy, just tcpdump and grep for the > timestamp and look at the distribution. Ok, I see it now. Just as a reference, this is how I've been running tcpdump on my tests: $ tcpdump -i enp3s0 -w foo.pcap -j adapter_unsynced \ -tt --time-stamp-precision=nano udp port 7788 -c 1 > >>> I have to dig more into why this is happening, a lot frames delayed much >>> more than I'd expect, but at this stage I'm pretty sure this is pebkac. One >>> obvious fix is move some hw around and do a direct link, but I didn't have >>> time for that right now. >>> >>> I'm very interested in doing what Richard's original test was when he used >>> ptp-synched clocks and also used hw receive-time and compared with expected >>> tx-time. So, while I'm getting that up and running, I thought I should >>> share the early results. >> >> Sure, thanks. Which delta and clockid are you using, please? > > I used the example provided in -00, > > tc qdisc replace dev eth2 parent root handle 100 mqprio num_tc 3 \ > map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 > > tc qdisc add dev eth2 parent 100:1 tbs offload delta 10 clockid \ > CLOCK_REALTIME sorting The delta value is highly dependent on the system. I recommend playing around with it a bit before running long tests. On my KabyLake desktop I noticed that 150us is quite reliable value, for example. (same kernel as yours, and no preempt-rt applied) But that is not the issue here it seems. > >> Also, was this clock synchronized to the PHC? You need that for hw offload >> with >> sorting enabled. > > Hmm, good point, no, NIC clock was not synchronized, I'll do that in the > next round for both sender and receiver! Oh, then you need to get that setup first. Here I synchronize both PHCs over the network first with ptp4l: Rx) $ ptp4l --summary_interval=3 -i enp3s0 -m -2 Tx) $ ptp4l --summary_interval=3 -i enp3s0 -s -m -2 & My Rx is the PTP master and the Tx is the PTP slave. Then I synchronize the PHC to the system clock on the Tx side only: Tx) $ phc2sys -a -r -r -u 8 & And udp_tai is using CLOCK_REALTIME. The UTC vs TAI 37s offset makes no difference for this test specifically because I compensate for it when calculating the offsets on the Rx side. For the next patchset version I will be providing a more complete set of testing instructions. I hope that helps for now. Thanks, Jesus
Re: [RFC v3 net-next 00/18] Time based packet transmission
On Thu, Mar 08, 2018 at 10:06:46AM -0800, Jesus Sanchez-Palencia wrote: > Hi, > > > On 03/08/2018 06:09 AM, Henrik Austad wrote: > > (...) > > > > > A lot of new knobs, I see the need, I would've like to have fewer, but > > you've documented them pretty well. Perhaps we should add something to > > Documentation/ at one stage? > > Sure. The idea is working on that once the interfaces have been accepted. Yeah, probably a good idea. > > Anyways, the patches applied cleanly so I gave them a (very) quick spin. > > Using udp_tai and tcpdump in the other end to grab the frames > > > > Setting up with hw offload and sorting in qdisc. > > > > Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss > > bypass as dual-core and i210 is not friends): > > > > udp_tai -c1 -i eth2 -p 20 -P 1000 > > > > Receiver (imx7, kernel 4.9.11): > > chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length > > 256" > tai_imx7.log > > > > Note: this involves 2 swtiches and a somewhat hackish kernel running on the > > receiver, so these numbers can only improve. > > > > count2340.00 > > mean0.043770 > > std 0.047784 > > min 0.009025 > > 25% 0.010003 > > 50% 0.010010 > > 75% 0.109998 > > max 0.120060 > > > > Thanks for giving it a shot. > > But I'm not sure I follow the numbers above, sorry :/ > Are you computing the packet's Rx timestamp offset from the (expected) Tx > time? Just looking at the timestamp when the frames were received. They should be sent at regular intervals if I read udp_tai.c correctly, so the assumption was that the timestamp from tcpdump should give an inkling to how well it worked. I set it up to send a frame every 10ms and computed the diff between each UDP packet received. Nothing fancy, just tcpdump and grep for the timestamp and look at the distribution. > > I have to dig more into why this is happening, a lot frames delayed much > > more than I'd expect, but at this stage I'm pretty sure this is pebkac. One > > obvious fix is move some hw around and do a direct link, but I didn't have > > time for that right now. > > > > I'm very interested in doing what Richard's original test was when he used > > ptp-synched clocks and also used hw receive-time and compared with expected > > tx-time. So, while I'm getting that up and running, I thought I should > > share the early results. > > Sure, thanks. Which delta and clockid are you using, please? I used the example provided in -00, tc qdisc replace dev eth2 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 tc qdisc add dev eth2 parent 100:1 tbs offload delta 10 clockid \ CLOCK_REALTIME sorting > Also, was this clock synchronized to the PHC? You need that for hw offload > with > sorting enabled. Hmm, good point, no, NIC clock was not synchronized, I'll do that in the next round for both sender and receiver! -henrik signature.asc Description: PGP signature
Re: [RFC v3 net-next 00/18] Time based packet transmission
Hi, On 03/08/2018 06:09 AM, Henrik Austad wrote: (...) > > A lot of new knobs, I see the need, I would've like to have fewer, but > you've documented them pretty well. Perhaps we should add something to > Documentation/ at one stage? Sure. The idea is working on that once the interfaces have been accepted. > > Anyways, the patches applied cleanly so I gave them a (very) quick spin. > Using udp_tai and tcpdump in the other end to grab the frames > > Setting up with hw offload and sorting in qdisc. > > Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss > bypass as dual-core and i210 is not friends): > > udp_tai -c1 -i eth2 -p 20 -P 1000 > > Receiver (imx7, kernel 4.9.11): > chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length > 256" > tai_imx7.log > > Note: this involves 2 swtiches and a somewhat hackish kernel running on the > receiver, so these numbers can only improve. > > count2340.00 > mean0.043770 > std 0.047784 > min 0.009025 > 25% 0.010003 > 50% 0.010010 > 75% 0.109998 > max 0.120060 > Thanks for giving it a shot. But I'm not sure I follow the numbers above, sorry :/ Are you computing the packet's Rx timestamp offset from the (expected) Tx time? > I have to dig more into why this is happening, a lot frames delayed much > more than I'd expect, but at this stage I'm pretty sure this is pebkac. One > obvious fix is move some hw around and do a direct link, but I didn't have > time for that right now. > > I'm very interested in doing what Richard's original test was when he used > ptp-synched clocks and also used hw receive-time and compared with expected > tx-time. So, while I'm getting that up and running, I thought I should > share the early results. Sure, thanks. Which delta and clockid are you using, please? Also, was this clock synchronized to the PHC? You need that for hw offload with sorting enabled. Thanks, Jesus (...)
Re: [RFC v3 net-next 00/18] Time based packet transmission
On Tue, Mar 06, 2018 at 05:12:12PM -0800, Jesus Sanchez-Palencia wrote: > This series is the v3 of the Time based packet transmission RFC, which was > originally proposed by Richard Cochran (v1: https://lwn.net/Articles/733962/ ) > and further developed by us with the addition of the tbs qdisc > (v2: https://lwn.net/Articles/744797/ ). Nice! > It introduces a new socket option (SO_TXTIME), a new qdisc (tbs) and > implements support for hw offloading on the igb driver for the Intel > i210 NIC. The tbs qdisc also supports SW best effort that can be used > as a fallback. > > The main changes since v2 can be found below. > > Fixes since v2: > - skb->tstamp is only cleared on the forwarding path; > - ktime_t is no longer the type used for timestamps (s64 is); > - get_unaligned() is now used for copying data from the cmsg header; > - added getsockopt() support for SO_TXTIME; > - restricted SO_TXTIME input range to [0,1]; > - removed ns_capable() check from __sock_cmsg_send(); > - the qdisc control struct now uses a 32 bitmap for config flags; > - fixed qdisc backlog decrement bug; > - 'overlimits' is now incremented on dequeue() drops in addition to the >'dropped' counter; > > Interface changes since v2: > * CMSG interface: >- added a per-packet clockid parameter to the cmsg (SCM_CLOCKID); >- added a per-packet drop_if_late flag to the cmsg (SCM_DROP_IF_LATE); > * tc-tbs: >- clockid now receives a string; > e.g.: CLOCK_REALTIME or /dev/ptp0 >- offload is now a standalone argument (i.e. no more offload 1); >- sorting is now argument that enables txtime based sorting provided > by the qdisc; > > Design changes since v2: > - Now on the dequeue() path, tbs only drops an expired packet if it has the >skb->tc_drop_if_late flag set. In practical terms, this will define if >the semantics of txtime on a system is "not earlier than" or "not later >than" a given timestamp; > - Now on the enqueue() path, the qdisc will drop a packet if its clockid >doesn't match the qdisc's one; > - Sorting the packets based on their txtime is now an option for the disc. >Effectively, this means it can be configured in 4 modes: HW offload or >SW best-effort, sorting enabled or disabled; A lot of new knobs, I see the need, I would've like to have fewer, but you've documented them pretty well. Perhaps we should add something to Documentation/ at one stage? Anyways, the patches applied cleanly so I gave them a (very) quick spin. Using udp_tai and tcpdump in the other end to grab the frames Setting up with hw offload and sorting in qdisc. Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss bypass as dual-core and i210 is not friends): udp_tai -c1 -i eth2 -p 20 -P 1000 Receiver (imx7, kernel 4.9.11): chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length 256" > tai_imx7.log Note: this involves 2 swtiches and a somewhat hackish kernel running on the receiver, so these numbers can only improve. count2340.00 mean0.043770 std 0.047784 min 0.009025 25% 0.010003 50% 0.010010 75% 0.109998 max 0.120060 I have to dig more into why this is happening, a lot frames delayed much more than I'd expect, but at this stage I'm pretty sure this is pebkac. One obvious fix is move some hw around and do a direct link, but I didn't have time for that right now. I'm very interested in doing what Richard's original test was when he used ptp-synched clocks and also used hw receive-time and compared with expected tx-time. So, while I'm getting that up and running, I thought I should share the early results. -Henrik > The tbs qdisc is designed so it buffers packets until a configurable time > before > their deadline (tx times). If sorting is enabled, regardless of HW offload or > SW > fallback modes, the qdisc uses a rbtree internally so the buffered packets are > always 'ordered' by the earliest deadline. > > If sorting is disabled, then for HW offload the qdisc will use a 'raw' FIFO > through qdisc_enqueue_tail() / qdisc_dequeue_head(), whereas for SW > best-effort, > it will use a 'scheduled' FIFO. > > The other configurable parameter from the tbs qdisc is the clockid to be used. > In order to provide that, this series adds a new API to pkt_sched.h (i.e. > qdisc_watchdog_init_clockid()). > > The tbs qdisc will drop any packets with a transmission time in the past or > when a deadline is missed if SCM_DROP_IF_LATE is set. Queueing packets in > advance plus configuring the delta parameter for the system correctly makes > all the difference in reducing the number of drops. Moreover, note that the > delta parameter ends up defining the Tx time when SW best-effort is used > given that the timestamps won't be used by the NIC on this case. > > Examples: > > # SW best-effort with sorting # > > $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio
Re: [RFC v3 net-next 00/18] Time based packet transmission
On Tue, Mar 06, 2018 at 05:12:12PM -0800, Jesus Sanchez-Palencia wrote: > Design changes since v2: > - Now on the dequeue() path, tbs only drops an expired packet if it has the >skb->tc_drop_if_late flag set. In practical terms, this will define if >the semantics of txtime on a system is "not earlier than" or "not later >than" a given timestamp; > - Now on the enqueue() path, the qdisc will drop a packet if its clockid >doesn't match the qdisc's one; > - Sorting the packets based on their txtime is now an option for the disc. >Effectively, this means it can be configured in 4 modes: HW offload or >SW best-effort, sorting enabled or disabled; While all of this makes the series and the configuration more complex, still I like the fact that the interface offers these different modes. Looking forward to testing this... Thanks, Richard
[RFC v3 net-next 00/18] Time based packet transmission
This series is the v3 of the Time based packet transmission RFC, which was originally proposed by Richard Cochran (v1: https://lwn.net/Articles/733962/ ) and further developed by us with the addition of the tbs qdisc (v2: https://lwn.net/Articles/744797/ ). It introduces a new socket option (SO_TXTIME), a new qdisc (tbs) and implements support for hw offloading on the igb driver for the Intel i210 NIC. The tbs qdisc also supports SW best effort that can be used as a fallback. The main changes since v2 can be found below. Fixes since v2: - skb->tstamp is only cleared on the forwarding path; - ktime_t is no longer the type used for timestamps (s64 is); - get_unaligned() is now used for copying data from the cmsg header; - added getsockopt() support for SO_TXTIME; - restricted SO_TXTIME input range to [0,1]; - removed ns_capable() check from __sock_cmsg_send(); - the qdisc control struct now uses a 32 bitmap for config flags; - fixed qdisc backlog decrement bug; - 'overlimits' is now incremented on dequeue() drops in addition to the 'dropped' counter; Interface changes since v2: * CMSG interface: - added a per-packet clockid parameter to the cmsg (SCM_CLOCKID); - added a per-packet drop_if_late flag to the cmsg (SCM_DROP_IF_LATE); * tc-tbs: - clockid now receives a string; e.g.: CLOCK_REALTIME or /dev/ptp0 - offload is now a standalone argument (i.e. no more offload 1); - sorting is now argument that enables txtime based sorting provided by the qdisc; Design changes since v2: - Now on the dequeue() path, tbs only drops an expired packet if it has the skb->tc_drop_if_late flag set. In practical terms, this will define if the semantics of txtime on a system is "not earlier than" or "not later than" a given timestamp; - Now on the enqueue() path, the qdisc will drop a packet if its clockid doesn't match the qdisc's one; - Sorting the packets based on their txtime is now an option for the disc. Effectively, this means it can be configured in 4 modes: HW offload or SW best-effort, sorting enabled or disabled; The tbs qdisc is designed so it buffers packets until a configurable time before their deadline (tx times). If sorting is enabled, regardless of HW offload or SW fallback modes, the qdisc uses a rbtree internally so the buffered packets are always 'ordered' by the earliest deadline. If sorting is disabled, then for HW offload the qdisc will use a 'raw' FIFO through qdisc_enqueue_tail() / qdisc_dequeue_head(), whereas for SW best-effort, it will use a 'scheduled' FIFO. The other configurable parameter from the tbs qdisc is the clockid to be used. In order to provide that, this series adds a new API to pkt_sched.h (i.e. qdisc_watchdog_init_clockid()). The tbs qdisc will drop any packets with a transmission time in the past or when a deadline is missed if SCM_DROP_IF_LATE is set. Queueing packets in advance plus configuring the delta parameter for the system correctly makes all the difference in reducing the number of drops. Moreover, note that the delta parameter ends up defining the Tx time when SW best-effort is used given that the timestamps won't be used by the NIC on this case. Examples: # SW best-effort with sorting # $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 tbs delta 10 \ clockid CLOCK_REALTIME sorting In this example first the mqprio qdisc is setup, then the tbs qdisc is configured onto the first hw Tx queue using SW best-effort with sorting enabled. Also, it is configured so the timestamps on each packet are in reference to the clockid CLOCK_REALTIME and so packets are dequeued from the qdisc 10 nanoseconds before their transmission time. # HW offload without sorting # $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 tbs offload In this example, the Qdisc will use HW offload for the control of the transmission time through the network adapter. It's assumed implicitly the timestamp in skbuffs are in reference to the interface's PHC and setting any other valid clockid would be treated as an error. Because there is no scheduling being performed in the qdisc, setting a delta != 0 would also be considered an error. # HW offload with sorting # $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 10 \ clockid CLOCK_REALTIME sorting Here, the Qdisc will use HW offload for the txtime control again, but now sorting will be enabled, and thus there will be scheduling being performed by