Re: [RFC v3 net-next 00/18] Time based packet transmission

2018-03-08 Thread Jesus Sanchez-Palencia
Hi,


On 03/08/2018 02:54 PM, Henrik Austad wrote:
> Just looking at the timestamp when the frames were received. They should be 
> sent at regular intervals if I read udp_tai.c correctly, so the assumption 
> was that the timestamp from tcpdump should give an inkling to how well it 
> worked.
> 
> I set it up to send a frame every 10ms and computed the diff between each 
> UDP packet received. Nothing fancy, just tcpdump and grep for the 
> timestamp and look at the distribution.

Ok, I see it now. Just as a reference, this is how I've been running tcpdump on
my tests:

$ tcpdump -i enp3s0 -w foo.pcap -j adapter_unsynced \
-tt --time-stamp-precision=nano udp port 7788 -c 1


> 
>>> I have to dig more into why this is happening, a lot frames delayed much 
>>> more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
>>> obvious fix is move some hw around and do a direct link, but I didn't have 
>>> time for that right now.
>>>
>>> I'm very interested in doing what Richard's original test was when he used 
>>> ptp-synched clocks and also used hw receive-time and compared with expected 
>>> tx-time. So, while I'm getting that up and running, I thought I should 
>>> share the early results.
>>
>> Sure, thanks. Which delta and clockid are you using, please?
> 
> I used the example provided in -00,
> 
> tc qdisc replace dev eth2 parent root handle 100 mqprio num_tc 3 \
>  map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
> 
> tc qdisc add dev eth2 parent 100:1 tbs offload delta 10 clockid \
>  CLOCK_REALTIME sorting


The delta value is highly dependent on the system. I recommend playing around
with it a bit before running long tests. On my KabyLake desktop I noticed that
150us is quite reliable value, for example. (same kernel as yours, and no
preempt-rt applied) But that is not the issue here it seems.



> 
>> Also, was this clock synchronized to the PHC? You need that for hw offload 
>> with
>> sorting enabled.
> 
> Hmm, good point, no, NIC clock was not synchronized, I'll do that in the 
> next round for both sender and receiver!

Oh, then you need to get that setup first. Here I synchronize both PHCs over the
network first with ptp4l:

Rx) $ ptp4l --summary_interval=3 -i enp3s0 -m -2
Tx) $ ptp4l --summary_interval=3 -i enp3s0 -s -m -2 &

My Rx is the PTP master and the Tx is the PTP slave.
Then I synchronize the PHC to the system clock on the Tx side only:

Tx) $ phc2sys -a -r -r -u 8 &


And udp_tai is using CLOCK_REALTIME. The UTC vs TAI 37s offset makes no
difference for this test specifically because I compensate for it when
calculating the offsets on the Rx side.

For the next patchset version I will be providing a more complete set of testing
instructions. I hope that helps for now.


Thanks,
Jesus






Re: [RFC v3 net-next 00/18] Time based packet transmission

2018-03-08 Thread Henrik Austad
On Thu, Mar 08, 2018 at 10:06:46AM -0800, Jesus Sanchez-Palencia wrote:
> Hi,
> 
> 
> On 03/08/2018 06:09 AM, Henrik Austad wrote:
> 
> (...)
> 
> > 
> > A lot of new knobs, I see the need, I would've like to have fewer, but 
> > you've documented them pretty well. Perhaps we should add something to 
> > Documentation/ at one stage?
> 
> Sure. The idea is working on that once the interfaces have been accepted.

Yeah, probably a good idea.

> > Anyways, the patches applied cleanly so I gave them a (very) quick spin. 
> > Using udp_tai and tcpdump in the other end to grab the frames
> > 
> > Setting up with hw offload and sorting in qdisc.
> > 
> > Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss 
> > bypass as dual-core and i210 is not friends):
> > 
> > udp_tai -c1 -i eth2 -p 20 -P 1000
> > 
> > Receiver (imx7, kernel 4.9.11):
> > chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length 
> > 256" > tai_imx7.log
> > 
> > Note: this involves 2 swtiches and a somewhat hackish kernel running on the 
> > receiver, so these numbers can only improve.
> > 
> > count2340.00
> > mean0.043770
> > std 0.047784
> > min 0.009025
> > 25% 0.010003
> > 50% 0.010010
> > 75% 0.109998
> > max 0.120060
> > 
> 
> Thanks for giving it a shot.
> 
> But I'm not sure I follow the numbers above, sorry :/
> Are you computing the packet's Rx timestamp offset from the (expected) Tx 
> time?

Just looking at the timestamp when the frames were received. They should be 
sent at regular intervals if I read udp_tai.c correctly, so the assumption 
was that the timestamp from tcpdump should give an inkling to how well it 
worked.

I set it up to send a frame every 10ms and computed the diff between each 
UDP packet received. Nothing fancy, just tcpdump and grep for the 
timestamp and look at the distribution.

> > I have to dig more into why this is happening, a lot frames delayed much 
> > more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
> > obvious fix is move some hw around and do a direct link, but I didn't have 
> > time for that right now.
> > 
> > I'm very interested in doing what Richard's original test was when he used 
> > ptp-synched clocks and also used hw receive-time and compared with expected 
> > tx-time. So, while I'm getting that up and running, I thought I should 
> > share the early results.
> 
> Sure, thanks. Which delta and clockid are you using, please?

I used the example provided in -00,

tc qdisc replace dev eth2 parent root handle 100 mqprio num_tc 3 \
 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

tc qdisc add dev eth2 parent 100:1 tbs offload delta 10 clockid \
 CLOCK_REALTIME sorting

> Also, was this clock synchronized to the PHC? You need that for hw offload 
> with
> sorting enabled.

Hmm, good point, no, NIC clock was not synchronized, I'll do that in the 
next round for both sender and receiver!

-henrik


signature.asc
Description: PGP signature


Re: [RFC v3 net-next 00/18] Time based packet transmission

2018-03-08 Thread Jesus Sanchez-Palencia
Hi,


On 03/08/2018 06:09 AM, Henrik Austad wrote:

(...)

> 
> A lot of new knobs, I see the need, I would've like to have fewer, but 
> you've documented them pretty well. Perhaps we should add something to 
> Documentation/ at one stage?

Sure. The idea is working on that once the interfaces have been accepted.


> 
> Anyways, the patches applied cleanly so I gave them a (very) quick spin. 
> Using udp_tai and tcpdump in the other end to grab the frames
> 
> Setting up with hw offload and sorting in qdisc.
> 
> Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss 
> bypass as dual-core and i210 is not friends):
> 
> udp_tai -c1 -i eth2 -p 20 -P 1000
> 
> Receiver (imx7, kernel 4.9.11):
> chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length 
> 256" > tai_imx7.log
> 
> Note: this involves 2 swtiches and a somewhat hackish kernel running on the 
> receiver, so these numbers can only improve.
> 
> count2340.00
> mean0.043770
> std 0.047784
> min 0.009025
> 25% 0.010003
> 50% 0.010010
> 75% 0.109998
> max 0.120060
> 

Thanks for giving it a shot.

But I'm not sure I follow the numbers above, sorry :/
Are you computing the packet's Rx timestamp offset from the (expected) Tx time?


> I have to dig more into why this is happening, a lot frames delayed much 
> more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
> obvious fix is move some hw around and do a direct link, but I didn't have 
> time for that right now.
> 
> I'm very interested in doing what Richard's original test was when he used 
> ptp-synched clocks and also used hw receive-time and compared with expected 
> tx-time. So, while I'm getting that up and running, I thought I should 
> share the early results.


Sure, thanks. Which delta and clockid are you using, please?
Also, was this clock synchronized to the PHC? You need that for hw offload with
sorting enabled.

Thanks,
Jesus

(...)



Re: [RFC v3 net-next 00/18] Time based packet transmission

2018-03-08 Thread Henrik Austad
On Tue, Mar 06, 2018 at 05:12:12PM -0800, Jesus Sanchez-Palencia wrote:
> This series is the v3 of the Time based packet transmission RFC, which was
> originally proposed by Richard Cochran (v1: https://lwn.net/Articles/733962/ )
> and further developed by us with the addition of the tbs qdisc
> (v2: https://lwn.net/Articles/744797/ ).

Nice!

> It introduces a new socket option (SO_TXTIME), a new qdisc (tbs) and
> implements support for hw offloading on the igb driver for the Intel
> i210 NIC. The tbs qdisc also supports SW best effort that can be used
> as a fallback.
> 
> The main changes since v2 can be found below.
> 
> Fixes since v2:
>  - skb->tstamp is only cleared on the forwarding path;
>  - ktime_t is no longer the type used for timestamps (s64 is);
>  - get_unaligned() is now used for copying data from the cmsg header;
>  - added getsockopt() support for SO_TXTIME;
>  - restricted SO_TXTIME input range to [0,1];
>  - removed ns_capable() check from __sock_cmsg_send();
>  - the qdisc  control struct now uses a 32 bitmap for config flags;
>  - fixed qdisc backlog decrement bug;
>  - 'overlimits' is now incremented on dequeue() drops in addition to the
>'dropped' counter;
> 
> Interface changes since v2:
>  * CMSG interface:
>- added a per-packet clockid parameter to the cmsg (SCM_CLOCKID);
>- added a per-packet drop_if_late flag to the cmsg (SCM_DROP_IF_LATE);
>  * tc-tbs:
>- clockid now receives a string;
>  e.g.: CLOCK_REALTIME or /dev/ptp0
>- offload is now a standalone argument (i.e. no more offload 1);
>- sorting is now argument that enables txtime based sorting provided
>  by the qdisc;
> 
> Design changes since v2:
>  - Now on the dequeue() path, tbs only drops an expired packet if it has the
>skb->tc_drop_if_late flag set. In practical terms, this will define if
>the semantics of txtime on a system is "not earlier than" or "not later
>than" a given timestamp;
>  - Now on the enqueue() path, the qdisc will drop a packet if its clockid
>doesn't match the qdisc's one;
>  - Sorting the packets based on their txtime is now an option for the disc.
>Effectively, this means it can be configured in 4 modes: HW offload or
>SW best-effort, sorting enabled or disabled;

A lot of new knobs, I see the need, I would've like to have fewer, but 
you've documented them pretty well. Perhaps we should add something to 
Documentation/ at one stage?

Anyways, the patches applied cleanly so I gave them a (very) quick spin. 
Using udp_tai and tcpdump in the other end to grab the frames

Setting up with hw offload and sorting in qdisc.

Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss 
bypass as dual-core and i210 is not friends):

udp_tai -c1 -i eth2 -p 20 -P 1000

Receiver (imx7, kernel 4.9.11):
chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length 
256" > tai_imx7.log

Note: this involves 2 swtiches and a somewhat hackish kernel running on the 
receiver, so these numbers can only improve.

count2340.00
mean0.043770
std 0.047784
min 0.009025
25% 0.010003
50% 0.010010
75% 0.109998
max 0.120060

I have to dig more into why this is happening, a lot frames delayed much 
more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
obvious fix is move some hw around and do a direct link, but I didn't have 
time for that right now.

I'm very interested in doing what Richard's original test was when he used 
ptp-synched clocks and also used hw receive-time and compared with expected 
tx-time. So, while I'm getting that up and running, I thought I should 
share the early results.

-Henrik

> The tbs qdisc is designed so it buffers packets until a configurable time 
> before
> their deadline (tx times). If sorting is enabled, regardless of HW offload or 
> SW
> fallback modes, the qdisc uses a rbtree internally so the buffered packets are
> always 'ordered' by the earliest deadline.
> 
> If sorting is disabled, then for HW offload the qdisc will use a 'raw' FIFO
> through qdisc_enqueue_tail() / qdisc_dequeue_head(), whereas for SW 
> best-effort,
> it will use a 'scheduled' FIFO.
> 
> The other configurable parameter from the tbs qdisc is the clockid to be used.
> In order to provide that, this series adds a new API to pkt_sched.h (i.e.
> qdisc_watchdog_init_clockid()).
> 
> The tbs qdisc will drop any packets with a transmission time in the past or
> when a deadline is missed if SCM_DROP_IF_LATE is set. Queueing packets in
> advance plus configuring the delta parameter for the system correctly makes
> all the difference in reducing the number of drops. Moreover, note that the
> delta parameter ends up defining the Tx time when SW best-effort is used
> given that the timestamps won't be used by the NIC on this case.
> 
> Examples:
> 
> # SW best-effort with sorting #
> 
> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio 

Re: [RFC v3 net-next 00/18] Time based packet transmission

2018-03-06 Thread Richard Cochran
On Tue, Mar 06, 2018 at 05:12:12PM -0800, Jesus Sanchez-Palencia wrote:
> Design changes since v2:
>  - Now on the dequeue() path, tbs only drops an expired packet if it has the
>skb->tc_drop_if_late flag set. In practical terms, this will define if
>the semantics of txtime on a system is "not earlier than" or "not later
>than" a given timestamp;
>  - Now on the enqueue() path, the qdisc will drop a packet if its clockid
>doesn't match the qdisc's one;
>  - Sorting the packets based on their txtime is now an option for the disc.
>Effectively, this means it can be configured in 4 modes: HW offload or
>SW best-effort, sorting enabled or disabled;

While all of this makes the series and the configuration more complex,
still I like the fact that the interface offers these different modes.

Looking forward to testing this...

Thanks,
Richard


[RFC v3 net-next 00/18] Time based packet transmission

2018-03-06 Thread Jesus Sanchez-Palencia
This series is the v3 of the Time based packet transmission RFC, which was
originally proposed by Richard Cochran (v1: https://lwn.net/Articles/733962/ )
and further developed by us with the addition of the tbs qdisc
(v2: https://lwn.net/Articles/744797/ ).

It introduces a new socket option (SO_TXTIME), a new qdisc (tbs) and
implements support for hw offloading on the igb driver for the Intel
i210 NIC. The tbs qdisc also supports SW best effort that can be used
as a fallback.

The main changes since v2 can be found below.

Fixes since v2:
 - skb->tstamp is only cleared on the forwarding path;
 - ktime_t is no longer the type used for timestamps (s64 is);
 - get_unaligned() is now used for copying data from the cmsg header;
 - added getsockopt() support for SO_TXTIME;
 - restricted SO_TXTIME input range to [0,1];
 - removed ns_capable() check from __sock_cmsg_send();
 - the qdisc  control struct now uses a 32 bitmap for config flags;
 - fixed qdisc backlog decrement bug;
 - 'overlimits' is now incremented on dequeue() drops in addition to the
   'dropped' counter;

Interface changes since v2:
 * CMSG interface:
   - added a per-packet clockid parameter to the cmsg (SCM_CLOCKID);
   - added a per-packet drop_if_late flag to the cmsg (SCM_DROP_IF_LATE);
 * tc-tbs:
   - clockid now receives a string;
 e.g.: CLOCK_REALTIME or /dev/ptp0
   - offload is now a standalone argument (i.e. no more offload 1);
   - sorting is now argument that enables txtime based sorting provided
 by the qdisc;

Design changes since v2:
 - Now on the dequeue() path, tbs only drops an expired packet if it has the
   skb->tc_drop_if_late flag set. In practical terms, this will define if
   the semantics of txtime on a system is "not earlier than" or "not later
   than" a given timestamp;
 - Now on the enqueue() path, the qdisc will drop a packet if its clockid
   doesn't match the qdisc's one;
 - Sorting the packets based on their txtime is now an option for the disc.
   Effectively, this means it can be configured in 4 modes: HW offload or
   SW best-effort, sorting enabled or disabled;


The tbs qdisc is designed so it buffers packets until a configurable time before
their deadline (tx times). If sorting is enabled, regardless of HW offload or SW
fallback modes, the qdisc uses a rbtree internally so the buffered packets are
always 'ordered' by the earliest deadline.

If sorting is disabled, then for HW offload the qdisc will use a 'raw' FIFO
through qdisc_enqueue_tail() / qdisc_dequeue_head(), whereas for SW best-effort,
it will use a 'scheduled' FIFO.

The other configurable parameter from the tbs qdisc is the clockid to be used.
In order to provide that, this series adds a new API to pkt_sched.h (i.e.
qdisc_watchdog_init_clockid()).

The tbs qdisc will drop any packets with a transmission time in the past or
when a deadline is missed if SCM_DROP_IF_LATE is set. Queueing packets in
advance plus configuring the delta parameter for the system correctly makes
all the difference in reducing the number of drops. Moreover, note that the
delta parameter ends up defining the Tx time when SW best-effort is used
given that the timestamps won't be used by the NIC on this case.

Examples:

# SW best-effort with sorting #

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs delta 10 \
   clockid CLOCK_REALTIME sorting

In this example first the mqprio qdisc is setup, then the tbs qdisc is
configured onto the first hw Tx queue using SW best-effort with sorting
enabled. Also, it is configured so the timestamps on each packet are in
reference to the clockid CLOCK_REALTIME and so packets are dequeued from
the qdisc 10 nanoseconds before their transmission time.


# HW offload without sorting #

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs offload

In this example, the Qdisc will use HW offload for the control of the
transmission time through the network adapter. It's assumed implicitly
the timestamp in skbuffs are in reference to the interface's PHC and
setting any other valid clockid would be treated as an error. Because
there is no scheduling being performed in the qdisc, setting a delta != 0
would also be considered an error.


# HW offload with sorting #
$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 10 \
   clockid CLOCK_REALTIME sorting

Here, the Qdisc will use HW offload for the txtime control again,
but now sorting will be enabled, and thus there will be scheduling being
performed by