[PATCH v1 net-next] igb: Use an advanced ctx descriptor for launchtime

2018-07-26 Thread Jesus Sanchez-Palencia
On i210, Launchtime (TxTime) requires the usage of an "Advanced
Transmit Context Descriptor" for retrieving the timestamp of a packet.

The igb driver correctly builds such descriptor on the segmentation
flow (i.e. igb_tso()) or on the checksum one (i.e. igb_tx_csum()), but the
feature is broken for AF_PACKET if the IGB_TX_FLAGS_VLAN is not set,
which happens due to an early return on igb_tx_csum().

This flag is only set by the kernel when a VLAN interface is used,
thus we can't just rely on it. Here we are fixing this issue by checking
if launchtime is enabled for the current tx_ring before performing the
early return.

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index e3a0c02721c9..fa1089defcd5 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5816,7 +5816,8 @@ static void igb_tx_csum(struct igb_ring *tx_ring, struct 
igb_tx_buffer *first)
 
if (skb->ip_summed != CHECKSUM_PARTIAL) {
 csum_failed:
-   if (!(first->tx_flags & IGB_TX_FLAGS_VLAN))
+   if (!(first->tx_flags & IGB_TX_FLAGS_VLAN)
+   && !tx_ring->launchtime_enable)
return;
goto no_csum;
}
-- 
2.18.0



Re: [PATCH v2 net-next 01/14] net: Clear skb->tstamp only on the forwarding path

2018-07-18 Thread Jesus Sanchez-Palencia
Hi Eric,


On 07/16/2018 04:15 PM, Eric Dumazet wrote:
> 
> 
> On 07/16/2018 02:52 PM, Jesus Sanchez-Palencia wrote:
>> Hi Eric,
>>
>>
>>
>> On 07/13/2018 10:35 AM, Eric Dumazet wrote:
>>>
>>>
>>> On 07/03/2018 03:42 PM, Jesus Sanchez-Palencia wrote:
>>>> This is done in preparation for the upcoming time based transmission
>>>> patchset. Now that skb->tstamp will be used to hold packet's txtime,
>>>> we must ensure that it is being cleared when traversing namespaces.
>>>> Also, doing that from skb_scrub_packet() before the early return would
>>>> break our feature when tunnels are used.
>>>>
>>>> Signed-off-by: Jesus Sanchez-Palencia 
>>>> ---
>>>>  net/core/skbuff.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index 1357f36c8a5e..c4e24ac27464 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -4898,7 +4898,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
>>>>   */
>>>>  void skb_scrub_packet(struct sk_buff *skb, bool xnet)
>>>>  {
>>>> -  skb->tstamp = 0;
>>>>skb->pkt_type = PACKET_HOST;
>>>>skb->skb_iif = 0;
>>>>skb->ignore_df = 0;
>>>> @@ -4912,6 +4911,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
>>>>  
>>>>ipvs_reset(skb);
>>>>skb->mark = 0;
>>>> +  skb->tstamp = 0;
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(skb_scrub_packet);
>>>>  
>>>>
>>>
>>>
>>>
>>> I believe we had some misunderstanding here.
>>>
>>> What I meant by forwarding is the following case :
>>>
>>> - We receive a packet.
>>> - netstamp_wanted is >0 (because at least one packet capture is active)
>>> - __net_timestamp() is called and does :
>>> skb->tstamp = ktime_get_real();
>>>
>>> Then this skb is forwarded into an interface where EDT is taken into
>>> consideration by either a qdisc or a device.
>>>
>>> Since CLOCK_TAI is a different base than CLOCK_REALTIME, we might have a 
>>> problem.
>>
>>
>> I'm not sure we have a problem here. For the Tx path I only see
>> net_timestamp_set() being called from dev_queue_xmit_nit(). And even there, 
>> it's
>> a clone of the skb that gets timestamped.
>>
>> I believe the original skb, which had the valid txtime copied into 
>> skb->tstamp,
>> is not modified anywhere along that path.
>>
>> What am I missing, please?
>>
>> Thanks,
>> Jesus
>>
> 
> 
> I am simply stating that a linux router, receiving packet on ethX and 
> forwarding
> them on ethY, could have a problem if ethY has a qdisc looking at skb->tstamp
> assuming a timestamp in CLOCK_TAI base.
> 
> In this case, skb->tstamp would have been set at ingress (not using CLOCK_TAI
> but CLOCK_REALTIME), and would be read at egress (assuming CLOCK_TAI)
> 
> Normal IPV4 routing path would be in net/ipv4/ip_forward.c, no scrubbing ever 
> happens,
> and no cloning either.
> 
> Your patch  (Clear skb->tstamp only on the forwarding path) is not handling 
> the
> typical forward path, only the cases where 'scrubbing' is used.


Thanks for the clarification, I wasn't following you before.

I believe we're fine with what we have *today*, because the qdisc will drop skbs
that don't have a valid skb->sk, or if the SO_TXTIME flags is not set.

This is a problem, however, if another qdisc is developed tomorrow and uses the
same information but do not perform the same checks. Or, if the packet somehow
gets to the controller and the HW has the "launch time" feature enabled and the
driver uses the tstamp information the same way we do.

If we want to protect from that now, I think we should go with the most
immediate solution and clear skb->tstamp somewhere in ip_forward(), as you've
pointed before. It seems to me this wouldn't break any other feature that might
be re-using the tstamp information.

This is also a more correct fix for the problem you raised before:

>>> - We receive a packet.
>>> - netstamp_wanted is >0 (because at least one packet capture is active)
>>> - __net_timestamp() is called and does :
>>> skb->tstamp = ktime_get_real();
>>>
>>> Then this skb is forwarded into an interface where EDT is taken into
>>> consideration by either a qdisc or a device.


IMHO this Rx tstamp should never be forwarded to the Tx path as a valid txtime,
regardless if the time base used was UTC or TAI.


What do you think?

Jesus






> 
> 
> 
>>
>>
>>>
>>>
>>> Solutions for this problem :
>>>
>>> 1) Convert all our skb->tstamp usages to CLOCK_TAI base.
>>>
>>> or
>>>
>>> 2) clear skb->tstamp in forwarding paths, including the ones not scrubbing 
>>> the packet.
>>>
>>> My preference is 1), even if it is a bit more work.
>>>


Re: [PATCH v2 net-next 01/14] net: Clear skb->tstamp only on the forwarding path

2018-07-16 Thread Jesus Sanchez-Palencia
Hi Eric,



On 07/13/2018 10:35 AM, Eric Dumazet wrote:
> 
> 
> On 07/03/2018 03:42 PM, Jesus Sanchez-Palencia wrote:
>> This is done in preparation for the upcoming time based transmission
>> patchset. Now that skb->tstamp will be used to hold packet's txtime,
>> we must ensure that it is being cleared when traversing namespaces.
>> Also, doing that from skb_scrub_packet() before the early return would
>> break our feature when tunnels are used.
>>
>> Signed-off-by: Jesus Sanchez-Palencia 
>> ---
>>  net/core/skbuff.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 1357f36c8a5e..c4e24ac27464 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -4898,7 +4898,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
>>   */
>>  void skb_scrub_packet(struct sk_buff *skb, bool xnet)
>>  {
>> -skb->tstamp = 0;
>>  skb->pkt_type = PACKET_HOST;
>>  skb->skb_iif = 0;
>>  skb->ignore_df = 0;
>> @@ -4912,6 +4911,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
>>  
>>  ipvs_reset(skb);
>>  skb->mark = 0;
>> +skb->tstamp = 0;
>>  }
>>  EXPORT_SYMBOL_GPL(skb_scrub_packet);
>>  
>>
> 
> 
> 
> I believe we had some misunderstanding here.
> 
> What I meant by forwarding is the following case :
> 
> - We receive a packet.
> - netstamp_wanted is >0 (because at least one packet capture is active)
> - __net_timestamp() is called and does :
> skb->tstamp = ktime_get_real();
> 
> Then this skb is forwarded into an interface where EDT is taken into
> consideration by either a qdisc or a device.
> 
> Since CLOCK_TAI is a different base than CLOCK_REALTIME, we might have a 
> problem.


I'm not sure we have a problem here. For the Tx path I only see
net_timestamp_set() being called from dev_queue_xmit_nit(). And even there, it's
a clone of the skb that gets timestamped.

I believe the original skb, which had the valid txtime copied into skb->tstamp,
is not modified anywhere along that path.

What am I missing, please?

Thanks,
Jesus



> 
> 
> Solutions for this problem :
> 
> 1) Convert all our skb->tstamp usages to CLOCK_TAI base.
> 
> or
> 
> 2) clear skb->tstamp in forwarding paths, including the ones not scrubbing 
> the packet.
> 
> My preference is 1), even if it is a bit more work.
> 


Re: tc mqprio offload command error

2018-07-16 Thread Jesus Sanchez-Palencia
Hi,


On 07/16/2018 10:20 AM, Alexander Duyck wrote:
> On Sun, Jul 15, 2018 at 6:30 PM, Chopra, Manish
>  wrote:
>> Hello Folks,
>>
>> I am trying to set below command to try mqprio offload on 4.18 kernel. It is 
>> throwing the flowing error.
>>
>> # tc qdisc add dev eth0 root mqprio num_tc 2 map 1 1 1 1 0 0 0 0
>> RTNETLINK answers: Numerical result out of range
>>
>> I can't really make out what's wrong with the above command, since this 
>> works fine with other OS kernels.
>> Any thoughts if it is something broken on upstream kernel ?
>>
>> Thanks,
>> Manish
> 
> You might need to specify the traffic class for the 8 remaining
> priorities. The full map size is 16 entries, not just 8. The default
> value for the last 4 mapping entries is TC 3 which would be out of
> range if you only have 2 TCs specified.


In addition to that, you might hit the same bug we brought up [1] a while ago.
If that is the case, a fix was just proposed here [2]. Note that other qdiscs
might be broken as well, but we could only spot the issue with mqprio and netem
so far.

[1] https://patchwork.ozlabs.org/patch/867860/#1893405
[2] https://patchwork.ozlabs.org/patch/944565/


Regards,
Jesus


> 
> - Alex
> 


[PATCH v1 iproute2] tc: Do not use addattr_nest_compat on mqprio and netem

2018-07-16 Thread Jesus Sanchez-Palencia
Here we are partially reverting commit c14f9d92eee107
"treewide: Use addattr_nest()/addattr_nest_end() to handle nested
attributes" .

As discussed in [1], changing from the 'manually' coded version that
used addattr_l() to addattr_nest_compat() wasn't functionally
equivalent, because now the messages have extra fields appended to it.

This introduced a regression since the implementation of parse_attr()
from both mqprio and netem can't handle this new message format.

Without this fix, mqprio returns an error. netem won't return an error
but its internal configuration ends up wrong.

As an example, this can be reproduced by the following commands when
this patch is not applied:

 1) mqprio
$ tc qdisc replace dev enp3s0 parent root handle 100 mqprio \
num_tc 3 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
queues 1@0 1@1 2@2 hw 0

RTNETLINK answers: Numerical result out of range

 2) netem
$ tc qdisc add dev enp3s0 root netem rate 5kbit 20 100 5 \
distribution normal latency 1 1

$ tc -s qdisc

(...)
qdisc netem 8001: dev enp3s0 root refcnt 9 limit 1000 delay 0us  0us
 Sent 402 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
(...)

With this patch applied, the tc -s qdisc command above for netem instead
reads:

(...)
qdisc netem 8002: dev enp3s0 root refcnt 9 limit 1000 delay 0us  0us \
rate 5Kbit packetoverhead 20 cellsize 100 celloverhead 5
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
(...)

[1] https://patchwork.ozlabs.org/patch/867860/#1893405

Fixes: c14f9d92eee107 ("treewide: Use addattr_nest()/addattr_nest_end() to 
handle nested attributes")
Reported-by: Vinicius Costa Gomes 
Signed-off-by: Jesus Sanchez-Palencia 
---
 tc/q_mqprio.c | 5 +++--
 tc/q_netem.c  | 7 +--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/tc/q_mqprio.c b/tc/q_mqprio.c
index 207d6441..89b46002 100644
--- a/tc/q_mqprio.c
+++ b/tc/q_mqprio.c
@@ -173,7 +173,8 @@ static int mqprio_parse_opt(struct qdisc_util *qu, int argc,
argc--; argv++;
}
 
-   tail = addattr_nest_compat(n, 1024, TCA_OPTIONS, , sizeof(opt));
+   tail = NLMSG_TAIL(n);
+   addattr_l(n, 1024, TCA_OPTIONS, , sizeof(opt));
 
if (flags & TC_MQPRIO_F_MODE)
addattr_l(n, 1024, TCA_MQPRIO_MODE,
@@ -208,7 +209,7 @@ static int mqprio_parse_opt(struct qdisc_util *qu, int argc,
addattr_nest_end(n, start);
}
 
-   addattr_nest_compat_end(n, tail);
+   tail->rta_len = (void *)NLMSG_TAIL(n) - (void *)tail;
 
return 0;
 }
diff --git a/tc/q_netem.c b/tc/q_netem.c
index 623ec903..9f9a9b3d 100644
--- a/tc/q_netem.c
+++ b/tc/q_netem.c
@@ -422,6 +422,8 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, 
char **argv,
}
}
 
+   tail = NLMSG_TAIL(n);
+
if (reorder.probability) {
if (opt.latency == 0) {
fprintf(stderr, "reordering not possible without 
specifying some delay\n");
@@ -450,7 +452,8 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, 
char **argv,
return -1;
}
 
-   tail = addattr_nest_compat(n, 1024, TCA_OPTIONS, , sizeof(opt));
+   if (addattr_l(n, 1024, TCA_OPTIONS, , sizeof(opt)) < 0)
+   return -1;
 
if (present[TCA_NETEM_CORR] &&
addattr_l(n, 1024, TCA_NETEM_CORR, , sizeof(cor)) < 0)
@@ -509,7 +512,7 @@ static int netem_parse_opt(struct qdisc_util *qu, int argc, 
char **argv,
return -1;
free(dist_data);
}
-   addattr_nest_compat_end(n, tail);
+   tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail;
return 0;
 }
 
-- 
2.18.0



[PATCH v4 iproute2-next 2/3] tc: Add support for the ETF Qdisc

2018-07-09 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes 

The "Earliest TxTime First" (ETF) queueing discipline allows precise
control of the transmission time of packets by providing a sorted
time-based scheduling of packets.

The syntax is:

tc qdisc add dev DEV parent NODE etf delta 
 clockid  [offload] [deadline_mode]

Signed-off-by: Vinicius Costa Gomes 
Signed-off-by: Jesus Sanchez-Palencia 
---
 tc/Makefile |   1 +
 tc/q_etf.c  | 181 
 2 files changed, 182 insertions(+)
 create mode 100644 tc/q_etf.c

diff --git a/tc/Makefile b/tc/Makefile
index dfd00267..4525c0fb 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -71,6 +71,7 @@ TCMODULES += q_clsact.o
 TCMODULES += e_bpf.o
 TCMODULES += f_matchall.o
 TCMODULES += q_cbs.o
+TCMODULES += q_etf.o
 
 TCSO :=
 ifeq ($(TC_CONFIG_ATM),y)
diff --git a/tc/q_etf.c b/tc/q_etf.c
new file mode 100644
index ..79a06ba8
--- /dev/null
+++ b/tc/q_etf.c
@@ -0,0 +1,181 @@
+/*
+ * q_etf.c Earliest TxTime First (ETF).
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Vinicius Costa Gomes 
+ *     Jesus Sanchez-Palencia 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+
+#define CLOCKID_INVALID (-1)
+static const struct static_clockid {
+   const char *name;
+   clockid_t clockid;
+} clockids_sysv[] = {
+   { "REALTIME", CLOCK_REALTIME },
+   { "TAI", CLOCK_TAI },
+   { "BOOTTIME", CLOCK_BOOTTIME },
+   { "MONOTONIC", CLOCK_MONOTONIC },
+   { NULL }
+};
+
+static void explain(void)
+{
+   fprintf(stderr, "Usage: ... etf delta NANOS clockid CLOCKID [offload] 
[deadline_mode]\n");
+   fprintf(stderr, "CLOCKID must be a valid SYS-V id (i.e. CLOCK_TAI)\n");
+}
+
+static void explain1(const char *arg, const char *val)
+{
+   fprintf(stderr, "etf: illegal value for \"%s\": \"%s\"\n", arg, val);
+}
+
+static void explain_clockid(const char *val)
+{
+   fprintf(stderr, "etf: illegal value for \"clockid\": \"%s\".\n", val);
+   fprintf(stderr, "It must be a valid SYS-V id (i.e. CLOCK_TAI)\n");
+}
+
+static int get_clockid(__s32 *val, const char *arg)
+{
+   const struct static_clockid *c;
+
+   /* Drop the CLOCK_ prefix if that is being used. */
+   if (strcasestr(arg, "CLOCK_") != NULL)
+   arg += sizeof("CLOCK_") - 1;
+
+   for (c = clockids_sysv; c->name; c++) {
+   if (strcasecmp(c->name, arg) == 0) {
+   *val = c->clockid;
+
+   return 0;
+   }
+   }
+
+   return -1;
+}
+
+static const char* get_clock_name(clockid_t clockid)
+{
+   const struct static_clockid *c;
+
+   for (c = clockids_sysv; c->name; c++) {
+   if (clockid == c->clockid)
+   return c->name;
+   }
+
+   return "invalid";
+}
+
+static int etf_parse_opt(struct qdisc_util *qu, int argc,
+char **argv, struct nlmsghdr *n, const char *dev)
+{
+   struct tc_etf_qopt opt = {
+   .clockid = CLOCKID_INVALID,
+   };
+   struct rtattr *tail;
+
+   while (argc > 0) {
+   if (matches(*argv, "offload") == 0) {
+   if (opt.flags & TC_ETF_OFFLOAD_ON) {
+   fprintf(stderr, "etf: duplicate \"offload\" 
specification\n");
+   return -1;
+   }
+
+   opt.flags |= TC_ETF_OFFLOAD_ON;
+   } else if (matches(*argv, "deadline_mode") == 0) {
+   if (opt.flags & TC_ETF_DEADLINE_MODE_ON) {
+   fprintf(stderr, "etf: duplicate 
\"deadline_mode\" specification\n");
+   return -1;
+   }
+
+   opt.flags |= TC_ETF_DEADLINE_MODE_ON;
+   } else if (matches(*argv, "delta") == 0) {
+   NEXT_ARG();
+   if (opt.delta) {
+   fprintf(stderr, "etf: duplicate \"delta\" 
specification\n");
+   return -1;
+   }
+   if (get_s32(, *argv, 0)) {
+   explain1("delta", *argv);
+   return -1;
+  

[PATCH v4 iproute2-next 0/3] Add support for ETF qdisc

2018-07-09 Thread Jesus Sanchez-Palencia
fixes since v3:
 - Add support for clock names with the "CLOCK_" prefix;
 - Print clock name on print_opt();
 - Use strcasecmp() instead of strncasecmp().


The ETF (earliest txtime first) qdisc was recently merged into net-next
[1], so this patchset adds support for it through the tc command line
tool.

An initial man page is also provided.

The first commit in this series is adding an updated version of
include/uapi/linux/pkt_sched.h and is not meant to be merged. It's
provided here just as a convenience for those who want to easily build
this patchset.

[1] https://patchwork.ozlabs.org/cover/938991/

Jesus Sanchez-Palencia (2):
  uapi pkt_sched: Add etf info - DO NOT COMMIT
  man: Add initial manpage for tc-etf(8)

Vinicius Costa Gomes (1):
  tc: Add support for the ETF Qdisc

 include/uapi/linux/pkt_sched.h |  21 
 man/man8/tc-etf.8  | 141 +
 tc/Makefile|   1 +
 tc/q_etf.c | 181 +
 4 files changed, 344 insertions(+)
 create mode 100644 man/man8/tc-etf.8
 create mode 100644 tc/q_etf.c

-- 
2.18.0



[PATCH v4 iproute2-next 3/3] man: Add initial manpage for tc-etf(8)

2018-07-09 Thread Jesus Sanchez-Palencia
Add an initial manpage for tc-etf covering all config options, basic
concepts and operation modes.

Signed-off-by: Jesus Sanchez-Palencia 
---
 man/man8/tc-etf.8 | 141 ++
 1 file changed, 141 insertions(+)
 create mode 100644 man/man8/tc-etf.8

diff --git a/man/man8/tc-etf.8 b/man/man8/tc-etf.8
new file mode 100644
index ..30a12de7
--- /dev/null
+++ b/man/man8/tc-etf.8
@@ -0,0 +1,141 @@
+.TH ETF 8 "05 Jul 2018" "iproute2" "Linux"
+.SH NAME
+ETF \- Earliest TxTime First (ETF) Qdisc
+.SH SYNOPSIS
+.B tc qdisc ... dev
+dev
+.B parent
+classid
+.B [ handle
+major:
+.B ] etf clockid
+clockid
+.B [ delta
+delta_nsecs
+.B ] [ deadline_mode ]
+.B [ offload ]
+
+.SH DESCRIPTION
+The ETF (Earliest TxTime First) qdisc allows applications to control
+the instant when a packet should be dequeued from the traffic control
+layer into the netdevice. If
+.B offload
+is configured and supported by the network interface card, the it will
+also control when packets leave the network controller.
+
+ETF achieves that by buffering packets until a configurable time
+before their transmission time (i.e. txtime, or deadline), which can
+be configured through the
+.B delta
+option.
+
+The qdisc uses a rb-tree internally so packets are always 'ordered' by
+their txtime and will be dequeued following the (next) earliest txtime
+first.
+
+It relies on the SO_TXTIME socket option and the SCM_TXTIME CMSG in
+each packet field to configure the behavior of time dependent sockets:
+the clockid to be used as a reference, if the expected mode of txtime
+for that socket is deadline or strict mode, and if packet drops should
+be reported on the socket's error queue. See
+.BR socket(7)
+for more information.
+
+The etf qdisc will drop any packets with a txtime in the past, or if a
+packet expires while waiting for being dequeued.
+
+This queueing discipline is intended to be used by TSN (Time Sensitive
+Networking) applications, and it exposes a traffic shaping functionality
+that is commonly documented as "Launch Time" or "Time-Based Scheduling"
+by vendors and the documentation of network interface controllers.
+
+ETF is meant to be installed under another qdisc that maps packet flows
+to traffic classes, one example is
+.BR mqprio(8).
+
+.SH PARAMETERS
+.TP
+clockid
+.br
+Specifies the clock to be used by qdisc's internal timer for measuring
+time and scheduling events. The qdisc expects that packets passing
+through it to be using this same
+.B clockid
+as the reference of their txtime timestamps. It will drop packets
+coming from sockets that do not comply with that.
+
+For more information about time and clocks on Linux, please refer
+to
+.BR time(7)
+and
+.BR clock_gettime(3).
+
+.TP
+delta
+.br
+After enqueueing or dequeueing a packet, the qdisc will schedule its
+next wake-up time for the next txtime minus this delta value.
+This means
+.B delta
+can be used as a fudge factor for the scheduler latency of a system.
+This value must be specified in nanoseconds.
+The default value is 0 nanoseconds.
+
+.TP
+deadline_mode
+.br
+When
+.B deadline_mode
+is set, the qdisc will handle txtime with a different semantics,
+changed from a 'strict' transmission time to a deadline.
+In practice, this means during the dequeue flow
+.BR etf(8)
+will set the txtime of the packet being dequeued to 'now'.
+The default is for this option to be disabled.
+
+.TP
+offload
+.br
+When
+.B offload
+is set,
+.BR etf(8)
+will try to configure the network interface so time-based transmission
+arbitration is enabled in the controller. This feature is commonly
+referred to as "Launch Time" or "Time-Based Scheduling" by the
+documentation of network interface controllers.
+The default is for this option to be disabled.
+
+.SH EXAMPLES
+
+ETF is used to enforce a Quality of Service. It controls when each
+packets should be dequeued and transmitted, and can be used for
+limiting the data rate of a traffic class. To separate packets into
+traffic classes the user may choose
+.BR mqprio(8),
+and configure it like this:
+
+.EX
+# tc qdisc add dev eth0 handle 100: parent root mqprio num_tc 3 \\
+   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
+   queues 1@0 1@1 2@2 \\
+   hw 0
+.EE
+.P
+To replace the current queueing discipline by ETF in traffic class
+number 0, issue:
+.P
+.EX
+# tc qdisc replace dev eth0 parent 100:1 etf \\
+   clockid CLOCK_TAI delta 30 offload
+.EE
+
+With the options above, etf will be configured to use CLOCK_TAI as
+its clockid_t, will schedule packets for 300 us before their txtime,
+and will enable the functionality on that in the network interface
+card. Deadline mode will not be configured for this mode.
+
+.SH AUTHORS
+Jesus Sanchez-Palencia 
+.br
+Vinicius Costa Gomes 
-- 
2.18.0



[PATCH v4 iproute2-next 1/3] uapi pkt_sched: Add etf info - DO NOT COMMIT

2018-07-09 Thread Jesus Sanchez-Palencia
This should come from the next uapi headers update.
Sending it now just as a convenience so anyone can build tc with etf
and taprio support.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/uapi/linux/pkt_sched.h | 21 +
 1 file changed, 21 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096a..94911846 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -539,6 +539,7 @@ enum {
TCA_NETEM_LATENCY64,
TCA_NETEM_JITTER64,
TCA_NETEM_SLOT,
+   TCA_NETEM_SLOT_DIST,
__TCA_NETEM_MAX,
 };
 
@@ -581,6 +582,8 @@ struct tc_netem_slot {
__s64   max_delay;
__s32   max_packets;
__s32   max_bytes;
+   __s64   dist_delay; /* nsec */
+   __s64   dist_jitter; /* nsec */
 };
 
 enum {
@@ -934,4 +937,22 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+
+/* ETF */
+struct tc_etf_qopt {
+   __s32 delta;
+   __s32 clockid;
+   __u32 flags;
+#define TC_ETF_DEADLINE_MODE_ONBIT(0)
+#define TC_ETF_OFFLOAD_ON  BIT(1)
+};
+
+enum {
+   TCA_ETF_UNSPEC,
+   TCA_ETF_PARMS,
+   __TCA_ETF_MAX,
+};
+
+#define TCA_ETF_MAX (__TCA_ETF_MAX - 1)
+
 #endif
-- 
2.18.0



[PATCH v2 net-next] net: Use __u32 in uapi net_stamp.h

2018-07-09 Thread Jesus Sanchez-Palencia
We are not supposed to use u32 in uapi, so change the flags member of
struct sock_txtime from u32 to __u32 instead.

Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit time")
Reported-by: Eric Dumazet 
Signed-off-by: Jesus Sanchez-Palencia 
---
 include/uapi/linux/net_tstamp.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index f8f4539f1135..97ff3c17ec4d 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -155,8 +155,8 @@ enum txtime_flags {
 };
 
 struct sock_txtime {
-   clockid_t   clockid;/* reference clockid */
-   u32 flags;  /* flags defined by enum txtime_flags */
+   clockid_t   clockid;/* reference clockid */
+   __u32   flags;  /* as defined by enum txtime_flags */
 };
 
 #endif /* _NET_TIMESTAMPING_H */
-- 
2.18.0



Re: [PATCH net-next] net: Use __u32 in uapi net_stamp.h

2018-07-09 Thread Jesus Sanchez-Palencia



On 07/09/2018 04:18 PM, Eric Dumazet wrote:
> 
> 
> On 07/09/2018 04:08 PM, Jesus Sanchez-Palencia wrote:
>> We are not supposed to use u32 in uapi, so change the flags member of
>> struct sock_txtime from u32 to __u32 instead.
>>
>> Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit 
>> time")
>> Signed-off-by: Jesus Sanchez-Palencia 
> 
> Could you use this patch as an opportunity to tab-align the fields names ?
> 
> Also you can credit the reporter, as in :
> 
> Reported-by: Eric Dumazet 
> 

Sure


> Thanks !
> 


Re: [PATCH v3 iproute2 2/3] tc: Add support for the ETF Qdisc

2018-07-09 Thread Jesus Sanchez-Palencia



On 07/09/2018 10:32 AM, David Ahern wrote:
> On 7/9/18 9:48 AM, Jesus Sanchez-Palencia wrote:
>> Hi David,
>>
>>
>> On 07/06/2018 08:58 AM, David Ahern wrote:
>>> On 7/5/18 4:42 PM, Jesus Sanchez-Palencia wrote:
>>>
>>>> +static int get_clockid(__s32 *val, const char *arg)
>>>> +{
>>>> +  const struct static_clockid {
>>>> +  const char *name;
>>>> +  clockid_t clockid;
>>>> +  } clockids_sysv[] = {
>>>> +  { "CLOCK_REALTIME", CLOCK_REALTIME },
>>>> +  { "CLOCK_TAI", CLOCK_TAI },
>>>> +  { "CLOCK_BOOTTIME", CLOCK_BOOTTIME },
>>>> +  { "CLOCK_MONOTONIC", CLOCK_MONOTONIC },
>>>> +  { NULL }
>>>> +  };
>>>> +
>>>> +  const struct static_clockid *c;
>>>> +
>>>> +  for (c = clockids_sysv; c->name; c++) {
>>>> +  if (strncasecmp(c->name, arg, 25) == 0) {
>>>
>>> Why 25?
>>
>>
>> That was just an upper bound giving some room beyond the longest
>> clockid name we have today. Should I add a define MAX_CLOCK_NAME ?
> 
> why not just strcasecmp? using the 'n' variant with n > strlen of either
> argument seems pointless.


Ok, will fix.


> 
>>
>>
>>>
>>> be nice to allow shortcuts -- e.g., just REALTIME or realtime.
>>
>>
>> I'd rather just keep it as is and use the names as they are defined for
>> everything else (i.e. CLOCK_REALTIME), unless there are some strong 
>> objections.
> 
> An all caps argument is unnecessary work on the pinky finger and the
> CLOCK_ prefix is redundant to the keyword. Really, just a thought on
> making it easier for users. A CLI argument does not need to maintain a
> 1:1 with code names.


Lower case already works given the strncasecmp() usage but, fair enough, I will
modify the implementation so it accepts both CLOCK_FOO or FOO (lower case
included), and will make it print one of the two strings during print_opt().


Thanks,
Jesus



[PATCH net-next] net: Use __u32 in uapi net_stamp.h

2018-07-09 Thread Jesus Sanchez-Palencia
We are not supposed to use u32 in uapi, so change the flags member of
struct sock_txtime from u32 to __u32 instead.

Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit time")
Signed-off-by: Jesus Sanchez-Palencia 
---
 include/uapi/linux/net_tstamp.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index f8f4539f1135..bdae4806fe40 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -156,7 +156,7 @@ enum txtime_flags {
 
 struct sock_txtime {
clockid_t   clockid;/* reference clockid */
-   u32 flags;  /* flags defined by enum txtime_flags */
+   __u32 flags;/* flags defined by enum txtime_flags */
 };
 
 #endif /* _NET_TIMESTAMPING_H */
-- 
2.18.0



Re: [PATCH v2 net-next 02/14] net: Add a new socket option for a future transmit time.

2018-07-09 Thread Jesus Sanchez-Palencia



On 07/07/2018 05:44 PM, Eric Dumazet wrote:
> 
> 
> On 07/03/2018 03:42 PM, Jesus Sanchez-Palencia wrote:
>> From: Richard Cochran 
>>
>> This patch introduces SO_TXTIME. User space enables this option in
>> order to pass a desired future transmit time in a CMSG when calling
>> sendmsg(2). The argument to this socket option is a 8-bytes long struct
>> provided by the uapi header net_tstamp.h defined as:
>>
>> struct sock_txtime {
>>  clockid_t   clockid;
>>  u32 flags;
>> };
>>
>> Note that new fields were added to struct sock by filling a 2-bytes
>> hole found in the struct. For that reason, neither the struct size or
>> number of cachelines were altered.
> 
> 
>> diff --git a/include/uapi/linux/net_tstamp.h 
>> b/include/uapi/linux/net_tstamp.h
>> index 4fe104b2411f..c9a77c353b98 100644
>> --- a/include/uapi/linux/net_tstamp.h
>> +++ b/include/uapi/linux/net_tstamp.h
>> @@ -141,4 +141,19 @@ struct scm_ts_pktinfo {
>>  __u32 reserved[2];
>>  };
>>  
>> +/*
>> + * SO_TXTIME gets a struct sock_txtime with flags being an integer bit
>> + * field comprised of these values.
>> + */
>> +enum txtime_flags {
>> +SOF_TXTIME_DEADLINE_MODE = (1 << 0),
>> +
>> +SOF_TXTIME_FLAGS_MASK = (SOF_TXTIME_DEADLINE_MODE)
>> +};
>> +
>> +struct sock_txtime {
>> +clockid_t   clockid;/* reference clockid */
>> +u32 flags;  /* flags defined by enum txtime_flags */
>> +};
>> +
> 
> I was under the impression that we could not use 'u32' type in 
> include/uapi/linux/* file
> 
> This must be replaced by __u32, right ?


I'm sending a patch fixing that now, thanks.



Re: [PATCH v3 iproute2 2/3] tc: Add support for the ETF Qdisc

2018-07-09 Thread Jesus Sanchez-Palencia
Hi David,


On 07/06/2018 08:58 AM, David Ahern wrote:
> On 7/5/18 4:42 PM, Jesus Sanchez-Palencia wrote:
> 
>> +static int get_clockid(__s32 *val, const char *arg)
>> +{
>> +const struct static_clockid {
>> +const char *name;
>> +clockid_t clockid;
>> +} clockids_sysv[] = {
>> +{ "CLOCK_REALTIME", CLOCK_REALTIME },
>> +{ "CLOCK_TAI", CLOCK_TAI },
>> +{ "CLOCK_BOOTTIME", CLOCK_BOOTTIME },
>> +{ "CLOCK_MONOTONIC", CLOCK_MONOTONIC },
>> +{ NULL }
>> +};
>> +
>> +const struct static_clockid *c;
>> +
>> +for (c = clockids_sysv; c->name; c++) {
>> +if (strncasecmp(c->name, arg, 25) == 0) {
> 
> Why 25?


That was just an upper bound giving some room beyond the longest
clockid name we have today. Should I add a define MAX_CLOCK_NAME ?


> 
> be nice to allow shortcuts -- e.g., just REALTIME or realtime.


I'd rather just keep it as is and use the names as they are defined for
everything else (i.e. CLOCK_REALTIME), unless there are some strong objections.



> 
>> +*val = c->clockid;
>> +
>> +return 0;
>> +}
>> +}
>> +
>> +return -1;
>> +}
>> +
>> +
>> +static int etf_parse_opt(struct qdisc_util *qu, int argc,
>> + char **argv, struct nlmsghdr *n, const char *dev)
>> +{
>> +struct tc_etf_qopt opt = {
>> +.clockid = CLOCKID_INVALID,
>> +};
>> +struct rtattr *tail;
>> +
>> +while (argc > 0) {
>> +if (matches(*argv, "offload") == 0) {
>> +if (opt.flags & TC_ETF_OFFLOAD_ON) {
>> +fprintf(stderr, "etf: duplicate \"offload\" 
>> specification\n");
>> +return -1;
>> +}
>> +
>> +opt.flags |= TC_ETF_OFFLOAD_ON;
>> +} else if (matches(*argv, "deadline_mode") == 0) {
>> +if (opt.flags & TC_ETF_DEADLINE_MODE_ON) {
>> +fprintf(stderr, "etf: duplicate 
>> \"deadline_mode\" specification\n");
>> +return -1;
>> +}
>> +
>> +opt.flags |= TC_ETF_DEADLINE_MODE_ON;
>> +} else if (matches(*argv, "delta") == 0) {
>> +NEXT_ARG();
>> +if (opt.delta) {
>> +fprintf(stderr, "etf: duplicate \"delta\" 
>> specification\n");
>> +return -1;
>> +}
>> +if (get_s32(, *argv, 0)) {
>> +explain1("delta", *argv);
>> +return -1;
>> +}
>> +} else if (matches(*argv, "clockid") == 0) {
>> +NEXT_ARG();
>> +if (opt.clockid != CLOCKID_INVALID) {
>> +fprintf(stderr, "etf: duplicate \"clockid\" 
>> specification\n");
>> +return -1;
>> +}
>> +if (get_clockid(, *argv)) {
>> +explain_clockid(*argv);
>> +return -1;
>> +}
>> +} else if (strcmp(*argv, "help") == 0) {
>> +explain();
>> +return -1;
>> +} else {
>> +fprintf(stderr, "etf: unknown parameter \"%s\"\n", 
>> *argv);
>> +explain();
>> +return -1;
>> +}
>> +argc--; argv++;
>> +}
>> +
>> +tail = NLMSG_TAIL(n);
>> +addattr_l(n, 1024, TCA_OPTIONS, NULL, 0);
>> +addattr_l(n, 2024, TCA_ETF_PARMS, , sizeof(opt));
>> +tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail;
>> +return 0;
>> +}
>> +
>> +static int etf_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
>> +{
>> +struct rtattr *tb[TCA_ETF_MAX+1];
>> +struct tc_etf_qopt *qopt;
>> +
>> +if (opt == NULL)
>> +return 0;
>> +
>> +parse_rtattr_nested(tb, TCA_ETF_MAX, opt);
>> +
>> +if (tb[TCA_ETF_PARMS] == NULL)
>> +return -1;
>> +
>> +qopt = RTA_DATA(tb[TCA_ETF_PARMS]);
>> +if (RTA_PAYLOAD(tb[TCA_ETF_PARMS])  < sizeof(*qopt))
>> +return -1;
>> +
>> +if (qopt->clockid == CLOCKID_INVALID)
>> +print_string(PRINT_ANY, "clockid", "clockid %s ", "invalid");
>> +else
>> +print_uint(PRINT_ANY, "clockid", "clockid %d ", qopt->clockid);
> 
> If you allow the user to input a string, then you should return it here too.


Ok, will fix.

Thanks,
Jesus


Re: [PATCH v2 net-next 00/14] Scheduled packet Transmission: ETF

2018-07-09 Thread Jesus Sanchez-Palencia
Hi Stephen,


On 07/06/2018 02:38 PM, Stephen Hemminger wrote:
> On Tue,  3 Jul 2018 15:42:46 -0700
> Jesus Sanchez-Palencia  wrote:
> 
>> Changes since v1:
>>   - moved struct sock_txtime from socket.h to uapi net_tstamp.h;
>>   - sk_clockid was changed from u16 to u8;
>>   - sk_txtime_flags was changed from u16 to a u8 bit field in struct sock;
>>   - the socket option flags are now validated in sock_setsockopt();
>>   - added SO_EE_ORIGIN_TXTIME;
>>   - sockc.transmit_time is now initialized from all IPv4 Tx paths;
>>   - added support for the IPv6 Tx path;
>>
>>
>> Overview
>> 
>>
>> This work consists of a set of kernel interfaces that can be used by
>> applications that require (time-based) Scheduled Tx of packets.
>> It is comprised by 3 new components to the kernel:
>>
>>   - SO_TXTIME: socket option + cmsg programming interfaces.
>>
>>   - etf: the "earliest txtime first" qdisc, that provides per-queue
>>   TxTime-based scheduling. This has been renamed from 'tbs' to
>>   'etf' to better describe its functionality.
>>
>>   - taprio: the "time-aware priority scheduler" qdisc, that provides
>>  per-port Time-Aware scheduling;
>>
>> This patchset is providing the first 2 components, which have been
>> developed for longer. The taprio qdisc will be shared as an RFC separately
>> (shortly).
>>
>> Note that this series is a follow up of the "Time based packet
>> transmission" RFCv3 [1].
>>
>>
>>
>> etf (formerly known as 'tbs')
>> =
>>
>> For applications/systems that the concept of time slices isn't precise
>> enough, the etf qdisc allows applications to control the instant when
>> a packet should leave the network controller. When used in conjunction
>> with taprio, it can also be used in case the application needs to
>> control with greater guarantee the offset into each time slice a packet
>> will be sent. Another use case of etf, is when only a small number of
>> applications on a system are time sensitive, so it can then be used
>> with a more traditional root qdisc (like mqprio).
>>
>> The etf qdisc is designed so it buffers packets until a configurable
>> time before their deadline (Tx time). The qdisc uses a rbtree internally
>> so the buffered packets are always 'ordered' by their txtime (deadline)
>> and will be dequeued following the earliest txtime first.
>>
>> It relies on the SO_TXTIME API set for receiving the per-packet timestamp
>> (txtime) as well as the config flags for each socket: the clockid to be
>> used as a reference, if the expected mode of txtime for that socket is
>> deadline or strict mode, and if packet drops should be reported on the
>> socket's error queue or not.
>>
>> The qdisc will drop any packets with a Tx time in the past, or if a
>> packet expires while waiting for being dequeued. Drops can be reported
>> as errors back to userspace through the socket's error queue.
>>
>> Example configuration:
>>
>> $ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 20 \
>> clockid CLOCK_TAI
>>
>> Here, the Qdisc will use HW offload for the txtime control.
>> Packets will be dequeued by the qdisc "delta" (20) nanoseconds before
>> their transmission time. Because this will be using HW offload and
>> since dynamic clocks are not supported by hrtimers, the system clock
>> and the PHC clock must be synchronized for this mode to behave as expected.
>>
>> A more complete example can be found here, with instructions of how to
>> test it:
>>
>> https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f [2]
>>
>>
>> Note that we haven't modified the qdisc so it uses a timerqueue because
>> the modification needed was increasing the number of cachelines of a sk_buff.
>>
>>
>>
>> This series is also hosted on github and can be found at [3].
>> The companion iproute2 patches can be found at [4].
>>
>>
>> [1] https://patchwork.ozlabs.org/cover/882342/
>>
>> [2] github doesn't make it clear, but the gist can be cloned like this:
>> $ git clone https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f 
>> scheduled-tx-tests
>>
>> [3] https://github.com/jeez/linux/tree/etf-v2
>>
>> [4] https://github.com/jeez/iproute2/tree/etf-v2
>>
>>
>>
>> Jesus Sanchez-Palencia (10):
>>   net: Clear skb->tstamp only on the forwarding path
>>   net: ipv4: Hook

Re: [PATCH v3 iproute2 2/3] tc: Add support for the ETF Qdisc

2018-07-06 Thread Jesus Sanchez-Palencia
Hi Stephen,


On 07/06/2018 02:32 PM, Stephen Hemminger wrote:
> 
>> diff --git a/tc/q_etf.c b/tc/q_etf.c
>> new file mode 100644
>> index ..5db1dd6f
>> --- /dev/null
>> +++ b/tc/q_etf.c
>> @@ -0,0 +1,168 @@
>> +/*
>> + * q_etf.c  Earliest TxTime First (ETF).
>> + *
>> + *  This program is free software; you can redistribute it and/or
>> + *  modify it under the terms of the GNU General Public License
>> + *  as published by the Free Software Foundation; either version
>> + *  2 of the License, or (at your option) any later version.
>> + *
>> + * Authors: Vinicius Costa Gomes 
>> + *  Jesus Sanchez-Palencia 
>> + *
>> + */
> 
> 
> Please use SPDX tag rather than GPL boilerplate when adding new code.

Sure, will do for v4.

> 
>> +static int get_clockid(__s32 *val, const char *arg)
>> +{
>> +const struct static_clockid {
>> +const char *name;
>> +clockid_t clockid;
>> +} clockids_sysv[] = {
>> +{ "CLOCK_REALTIME", CLOCK_REALTIME },
>> +{ "CLOCK_TAI", CLOCK_TAI },
>> +{ "CLOCK_BOOTTIME", CLOCK_BOOTTIME },
>> +{ "CLOCK_MONOTONIC", CLOCK_MONOTONIC },
>> +{ NULL }
>> +};
>> +
>> +const struct static_clockid *c;
>> +
>> +for (c = clockids_sysv; c->name; c++) {
>> +if (strncasecmp(c->name, arg, 25) == 0) {
>> +*val = c->clockid;
>> +
>> +return 0;
>> +}
>> +}
>> +
>> +return -1;
>> +}
> 
> Internally, kernel must use ktime. For the userspace part the TC standard
> is to use USER HZ of 100.
> 
> Please change user kernel API of this module to match other existing modules.
> Doing something unique for this module is not necessary.


I don't follow you on the above, sorry.

The qdisc must be configured with a valid clockid_t. This type is used by both
userspace and kernel.

As requested before, we made the tc etf command line interface more
user-friendly and allowed for users to input the clock name as a string. The
code above is just lookup table converting the string into a a valid clockid_t
so we can then pass it to the kernel.

There is no ktime or any other timestamp type above, only clockid_t.

Can you please clarify what your request is?


Thanks,
Jesus


Re: [PATCH net-next 0/6] sock cookie initializers

2018-07-06 Thread Jesus Sanchez-Palencia



On 07/06/2018 07:12 AM, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> Recent UDP GSO and SO_TXTIME features added new fields to cookie
> structs.
> 
> When adding a field, all sites where a struct is initialized have to
> be updated, which is a lot of boilerplate. Alternatively, a field can
> be initialized selectively, but this is fragile. I introduced a bug
> in udp gso where an uninitialized field was read. See also fix commit
> ("9887cba19978 ip: limit use of gso_size to udp").
> 
> Introduce initializers for structs ipcm(6)_cookie and sockc_cookie.
> 
> patch 1..3 do exactly this.
> patch 4..5 make ipv4 and ipv6 handle cookies the same way and
>remove some boilerplate in doing so.
> patch 6removes the udp gso branch that needed the above fix


Acked-by: Jesus Sanchez-Palencia 

I've applied this series and tested SO_TXTIME + the etf qdisc and everything is
working just fine.


Thanks,
Jesus



> 
> Willem de Bruijn (6):
>   ipv4: ipcm_cookie initializers
>   ipv6: ipcm6_cookie initializer
>   sock: sockc cookie initializer
>   ipv6: fold sockcm_cookie into ipcm6_cookie
>   ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6
>   ip: unconditionally set cork gso_size
> 
>  include/net/ip.h | 16 ++-
>  include/net/ipv6.h   | 26 
>  include/net/sock.h   |  6 ++
>  include/net/transp_v6.h  |  3 +--
>  net/ipv4/icmp.c  | 11 ++
>  net/ipv4/ip_output.c | 12 ---
>  net/ipv4/ping.c  | 11 +-
>  net/ipv4/raw.c   | 11 +-
>  net/ipv4/tcp.c   |  2 +-
>  net/ipv4/udp.c   | 12 +--
>  net/ipv6/datagram.c  |  4 ++--
>  net/ipv6/icmp.c  | 14 -
>  net/ipv6/ip6_flowlabel.c |  3 +--
>  net/ipv6/ip6_output.c| 43 +---
>  net/ipv6/ipv6_sockglue.c |  3 +--
>  net/ipv6/ping.c  |  7 ++-
>  net/ipv6/raw.c   | 15 +-
>  net/ipv6/udp.c   | 14 +
>  net/l2tp/l2tp_ip6.c  | 10 +++---
>  net/packet/af_packet.c   |  9 +++--
>  20 files changed, 98 insertions(+), 134 deletions(-)
> 


[PATCH v3 iproute2 0/3] Add support for ETF qdisc

2018-07-05 Thread Jesus Sanchez-Palencia
Changes since v2:
 - Added man page for tc-etf.

The ETF (earliest txtime first) qdisc was recently merged into net-next
[1], so this patchset adds support for it through the tc command line
tool.

An initial man page is also provided.

The first commit in this series is adding an updated version of
include/uapi/linux/pkt_sched.h and is not meant to be merged. It's
provided here just as a convenience for those who want to easily build
this patchset.

[1] https://patchwork.ozlabs.org/cover/938991/

Jesus Sanchez-Palencia (2):
  uapi pkt_sched: Add etf info - DO NOT COMMIT
  man: Add initial manpage for tc-etf(8)

Vinicius Costa Gomes (1):
  tc: Add support for the ETF Qdisc

 include/uapi/linux/pkt_sched.h |  21 +
 man/man8/tc-etf.8  | 141 +++
 tc/Makefile|   1 +
 tc/q_etf.c | 168 +
 4 files changed, 331 insertions(+)
 create mode 100644 man/man8/tc-etf.8
 create mode 100644 tc/q_etf.c

-- 
2.18.0



[PATCH v3 iproute2 3/3] man: Add initial manpage for tc-etf(8)

2018-07-05 Thread Jesus Sanchez-Palencia
Add an initial manpage for tc-etf covering all config options, basic
concepts and operation modes.

Signed-off-by: Jesus Sanchez-Palencia 
---
 man/man8/tc-etf.8 | 141 ++
 1 file changed, 141 insertions(+)
 create mode 100644 man/man8/tc-etf.8

diff --git a/man/man8/tc-etf.8 b/man/man8/tc-etf.8
new file mode 100644
index ..30a12de7
--- /dev/null
+++ b/man/man8/tc-etf.8
@@ -0,0 +1,141 @@
+.TH ETF 8 "05 Jul 2018" "iproute2" "Linux"
+.SH NAME
+ETF \- Earliest TxTime First (ETF) Qdisc
+.SH SYNOPSIS
+.B tc qdisc ... dev
+dev
+.B parent
+classid
+.B [ handle
+major:
+.B ] etf clockid
+clockid
+.B [ delta
+delta_nsecs
+.B ] [ deadline_mode ]
+.B [ offload ]
+
+.SH DESCRIPTION
+The ETF (Earliest TxTime First) qdisc allows applications to control
+the instant when a packet should be dequeued from the traffic control
+layer into the netdevice. If
+.B offload
+is configured and supported by the network interface card, the it will
+also control when packets leave the network controller.
+
+ETF achieves that by buffering packets until a configurable time
+before their transmission time (i.e. txtime, or deadline), which can
+be configured through the
+.B delta
+option.
+
+The qdisc uses a rb-tree internally so packets are always 'ordered' by
+their txtime and will be dequeued following the (next) earliest txtime
+first.
+
+It relies on the SO_TXTIME socket option and the SCM_TXTIME CMSG in
+each packet field to configure the behavior of time dependent sockets:
+the clockid to be used as a reference, if the expected mode of txtime
+for that socket is deadline or strict mode, and if packet drops should
+be reported on the socket's error queue. See
+.BR socket(7)
+for more information.
+
+The etf qdisc will drop any packets with a txtime in the past, or if a
+packet expires while waiting for being dequeued.
+
+This queueing discipline is intended to be used by TSN (Time Sensitive
+Networking) applications, and it exposes a traffic shaping functionality
+that is commonly documented as "Launch Time" or "Time-Based Scheduling"
+by vendors and the documentation of network interface controllers.
+
+ETF is meant to be installed under another qdisc that maps packet flows
+to traffic classes, one example is
+.BR mqprio(8).
+
+.SH PARAMETERS
+.TP
+clockid
+.br
+Specifies the clock to be used by qdisc's internal timer for measuring
+time and scheduling events. The qdisc expects that packets passing
+through it to be using this same
+.B clockid
+as the reference of their txtime timestamps. It will drop packets
+coming from sockets that do not comply with that.
+
+For more information about time and clocks on Linux, please refer
+to
+.BR time(7)
+and
+.BR clock_gettime(3).
+
+.TP
+delta
+.br
+After enqueueing or dequeueing a packet, the qdisc will schedule its
+next wake-up time for the next txtime minus this delta value.
+This means
+.B delta
+can be used as a fudge factor for the scheduler latency of a system.
+This value must be specified in nanoseconds.
+The default value is 0 nanoseconds.
+
+.TP
+deadline_mode
+.br
+When
+.B deadline_mode
+is set, the qdisc will handle txtime with a different semantics,
+changed from a 'strict' transmission time to a deadline.
+In practice, this means during the dequeue flow
+.BR etf(8)
+will set the txtime of the packet being dequeued to 'now'.
+The default is for this option to be disabled.
+
+.TP
+offload
+.br
+When
+.B offload
+is set,
+.BR etf(8)
+will try to configure the network interface so time-based transmission
+arbitration is enabled in the controller. This feature is commonly
+referred to as "Launch Time" or "Time-Based Scheduling" by the
+documentation of network interface controllers.
+The default is for this option to be disabled.
+
+.SH EXAMPLES
+
+ETF is used to enforce a Quality of Service. It controls when each
+packets should be dequeued and transmitted, and can be used for
+limiting the data rate of a traffic class. To separate packets into
+traffic classes the user may choose
+.BR mqprio(8),
+and configure it like this:
+
+.EX
+# tc qdisc add dev eth0 handle 100: parent root mqprio num_tc 3 \\
+   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
+   queues 1@0 1@1 2@2 \\
+   hw 0
+.EE
+.P
+To replace the current queueing discipline by ETF in traffic class
+number 0, issue:
+.P
+.EX
+# tc qdisc replace dev eth0 parent 100:1 etf \\
+   clockid CLOCK_TAI delta 30 offload
+.EE
+
+With the options above, etf will be configured to use CLOCK_TAI as
+its clockid_t, will schedule packets for 300 us before their txtime,
+and will enable the functionality on that in the network interface
+card. Deadline mode will not be configured for this mode.
+
+.SH AUTHORS
+Jesus Sanchez-Palencia 
+.br
+Vinicius Costa Gomes 
-- 
2.18.0



[PATCH v3 iproute2 2/3] tc: Add support for the ETF Qdisc

2018-07-05 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes 

The "Earliest TxTime First" (ETF) queueing discipline allows precise
control of the transmission time of packets by providing a sorted
time-based scheduling of packets.

The syntax is:

tc qdisc add dev DEV parent NODE etf delta 
 clockid  [offload] [deadline_mode]

Signed-off-by: Vinicius Costa Gomes 
Signed-off-by: Jesus Sanchez-Palencia 
---
 tc/Makefile |   1 +
 tc/q_etf.c  | 168 
 2 files changed, 169 insertions(+)
 create mode 100644 tc/q_etf.c

diff --git a/tc/Makefile b/tc/Makefile
index dfd00267..4525c0fb 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -71,6 +71,7 @@ TCMODULES += q_clsact.o
 TCMODULES += e_bpf.o
 TCMODULES += f_matchall.o
 TCMODULES += q_cbs.o
+TCMODULES += q_etf.o
 
 TCSO :=
 ifeq ($(TC_CONFIG_ATM),y)
diff --git a/tc/q_etf.c b/tc/q_etf.c
new file mode 100644
index ..5db1dd6f
--- /dev/null
+++ b/tc/q_etf.c
@@ -0,0 +1,168 @@
+/*
+ * q_etf.c Earliest TxTime First (ETF).
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Vinicius Costa Gomes 
+ *     Jesus Sanchez-Palencia 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+
+#define CLOCKID_INVALID (-1)
+static void explain(void)
+{
+   fprintf(stderr, "Usage: ... etf delta NANOS clockid CLOCKID [offload] 
[deadline_mode]\n");
+   fprintf(stderr, "CLOCKID must be a valid SYS-V id (i.e. CLOCK_TAI)\n");
+}
+
+static void explain1(const char *arg, const char *val)
+{
+   fprintf(stderr, "etf: illegal value for \"%s\": \"%s\"\n", arg, val);
+}
+
+static void explain_clockid(const char *val)
+{
+   fprintf(stderr, "etf: illegal value for \"clockid\": \"%s\".\n", val);
+   fprintf(stderr, "It must be a valid SYS-V id (i.e. CLOCK_TAI)");
+}
+
+static int get_clockid(__s32 *val, const char *arg)
+{
+   const struct static_clockid {
+   const char *name;
+   clockid_t clockid;
+   } clockids_sysv[] = {
+   { "CLOCK_REALTIME", CLOCK_REALTIME },
+   { "CLOCK_TAI", CLOCK_TAI },
+   { "CLOCK_BOOTTIME", CLOCK_BOOTTIME },
+   { "CLOCK_MONOTONIC", CLOCK_MONOTONIC },
+   { NULL }
+   };
+
+   const struct static_clockid *c;
+
+   for (c = clockids_sysv; c->name; c++) {
+   if (strncasecmp(c->name, arg, 25) == 0) {
+   *val = c->clockid;
+
+   return 0;
+   }
+   }
+
+   return -1;
+}
+
+
+static int etf_parse_opt(struct qdisc_util *qu, int argc,
+char **argv, struct nlmsghdr *n, const char *dev)
+{
+   struct tc_etf_qopt opt = {
+   .clockid = CLOCKID_INVALID,
+   };
+   struct rtattr *tail;
+
+   while (argc > 0) {
+   if (matches(*argv, "offload") == 0) {
+   if (opt.flags & TC_ETF_OFFLOAD_ON) {
+   fprintf(stderr, "etf: duplicate \"offload\" 
specification\n");
+   return -1;
+   }
+
+   opt.flags |= TC_ETF_OFFLOAD_ON;
+   } else if (matches(*argv, "deadline_mode") == 0) {
+   if (opt.flags & TC_ETF_DEADLINE_MODE_ON) {
+   fprintf(stderr, "etf: duplicate 
\"deadline_mode\" specification\n");
+   return -1;
+   }
+
+   opt.flags |= TC_ETF_DEADLINE_MODE_ON;
+   } else if (matches(*argv, "delta") == 0) {
+   NEXT_ARG();
+   if (opt.delta) {
+   fprintf(stderr, "etf: duplicate \"delta\" 
specification\n");
+   return -1;
+   }
+   if (get_s32(, *argv, 0)) {
+   explain1("delta", *argv);
+   return -1;
+   }
+   } else if (matches(*argv, "clockid") == 0) {
+   NEXT_ARG();
+   if (opt.clockid != CLOCKID_INVALID) {
+   fprintf(stderr, "etf: duplicate \"clockid\" 
specification\n");
+   return -1;
+ 

[PATCH v3 iproute2 1/3] uapi pkt_sched: Add etf info - DO NOT COMMIT

2018-07-05 Thread Jesus Sanchez-Palencia
This should come from the next uapi headers update.
Sending it now just as a convenience so anyone can build tc with etf
and taprio support.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/uapi/linux/pkt_sched.h | 21 +
 1 file changed, 21 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096a..94911846 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -539,6 +539,7 @@ enum {
TCA_NETEM_LATENCY64,
TCA_NETEM_JITTER64,
TCA_NETEM_SLOT,
+   TCA_NETEM_SLOT_DIST,
__TCA_NETEM_MAX,
 };
 
@@ -581,6 +582,8 @@ struct tc_netem_slot {
__s64   max_delay;
__s32   max_packets;
__s32   max_bytes;
+   __s64   dist_delay; /* nsec */
+   __s64   dist_jitter; /* nsec */
 };
 
 enum {
@@ -934,4 +937,22 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+
+/* ETF */
+struct tc_etf_qopt {
+   __s32 delta;
+   __s32 clockid;
+   __u32 flags;
+#define TC_ETF_DEADLINE_MODE_ONBIT(0)
+#define TC_ETF_OFFLOAD_ON  BIT(1)
+};
+
+enum {
+   TCA_ETF_UNSPEC,
+   TCA_ETF_PARMS,
+   __TCA_ETF_MAX,
+};
+
+#define TCA_ETF_MAX (__TCA_ETF_MAX - 1)
+
 #endif
-- 
2.18.0



[PATCH v2 iproute2] man: Fix typos on tc-cbs

2018-07-05 Thread Jesus Sanchez-Palencia
Fix 2 typos on the man page of the CBS qdisc.

Signed-off-by: Jesus Sanchez-Palencia 
Reviewed-by: Simon Horman 
---
 man/man8/tc-cbs.8 | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/man/man8/tc-cbs.8 b/man/man8/tc-cbs.8
index 32e1e0d4..ad1d8821 100644
--- a/man/man8/tc-cbs.8
+++ b/man/man8/tc-cbs.8
@@ -28,7 +28,7 @@ defined rate limiting method to the traffic.
 This queueing discipline is intended to be used by TSN (Time Sensitive
 Networking) applications, the CBS parameters are derived directly by
 what is described by the Annex L of the IEEE 802.1Q-2014
-Sepcification. The algorithm and how it affects the latency are
+Specification. The algorithm and how it affects the latency are
 detailed there.
 
 CBS is meant to be installed under another qdisc that maps packet
@@ -60,7 +60,7 @@ packet size, which is then used for calculating the idleslope.
 sendslope
 Sendslope is the rate of credits that is depleted (it should be a
 negative number of kilobits per second) when a transmission is
-ocurring. It can be calculated as follows, (IEEE 802.1Q-2014 Section
+occurring. It can be calculated as follows, (IEEE 802.1Q-2014 Section
 8.6.8.2 item g):
 
 sendslope = idleslope - port_transmit_rate
-- 
2.18.0



[PATCH v2 iproute2 2/2] tc: Add support for the ETF Qdisc

2018-07-03 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes 

The "Earliest TxTime First" (ETF) queueing discipline allows precise
control of the transmission time of packets by providing a sorted
time-based scheduling of packets.

The syntax is:

tc qdisc add dev DEV parent NODE etf delta 
 clockid  [offload] [deadline_mode]

Signed-off-by: Vinicius Costa Gomes 
Signed-off-by: Jesus Sanchez-Palencia 
---
 tc/Makefile |   1 +
 tc/q_etf.c  | 168 
 2 files changed, 169 insertions(+)
 create mode 100644 tc/q_etf.c

diff --git a/tc/Makefile b/tc/Makefile
index dfd00267..4525c0fb 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -71,6 +71,7 @@ TCMODULES += q_clsact.o
 TCMODULES += e_bpf.o
 TCMODULES += f_matchall.o
 TCMODULES += q_cbs.o
+TCMODULES += q_etf.o
 
 TCSO :=
 ifeq ($(TC_CONFIG_ATM),y)
diff --git a/tc/q_etf.c b/tc/q_etf.c
new file mode 100644
index ..5db1dd6f
--- /dev/null
+++ b/tc/q_etf.c
@@ -0,0 +1,168 @@
+/*
+ * q_etf.c Earliest TxTime First (ETF).
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Vinicius Costa Gomes 
+ *     Jesus Sanchez-Palencia 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+
+#define CLOCKID_INVALID (-1)
+static void explain(void)
+{
+   fprintf(stderr, "Usage: ... etf delta NANOS clockid CLOCKID [offload] 
[deadline_mode]\n");
+   fprintf(stderr, "CLOCKID must be a valid SYS-V id (i.e. CLOCK_TAI)\n");
+}
+
+static void explain1(const char *arg, const char *val)
+{
+   fprintf(stderr, "etf: illegal value for \"%s\": \"%s\"\n", arg, val);
+}
+
+static void explain_clockid(const char *val)
+{
+   fprintf(stderr, "etf: illegal value for \"clockid\": \"%s\".\n", val);
+   fprintf(stderr, "It must be a valid SYS-V id (i.e. CLOCK_TAI)");
+}
+
+static int get_clockid(__s32 *val, const char *arg)
+{
+   const struct static_clockid {
+   const char *name;
+   clockid_t clockid;
+   } clockids_sysv[] = {
+   { "CLOCK_REALTIME", CLOCK_REALTIME },
+   { "CLOCK_TAI", CLOCK_TAI },
+   { "CLOCK_BOOTTIME", CLOCK_BOOTTIME },
+   { "CLOCK_MONOTONIC", CLOCK_MONOTONIC },
+   { NULL }
+   };
+
+   const struct static_clockid *c;
+
+   for (c = clockids_sysv; c->name; c++) {
+   if (strncasecmp(c->name, arg, 25) == 0) {
+   *val = c->clockid;
+
+   return 0;
+   }
+   }
+
+   return -1;
+}
+
+
+static int etf_parse_opt(struct qdisc_util *qu, int argc,
+char **argv, struct nlmsghdr *n, const char *dev)
+{
+   struct tc_etf_qopt opt = {
+   .clockid = CLOCKID_INVALID,
+   };
+   struct rtattr *tail;
+
+   while (argc > 0) {
+   if (matches(*argv, "offload") == 0) {
+   if (opt.flags & TC_ETF_OFFLOAD_ON) {
+   fprintf(stderr, "etf: duplicate \"offload\" 
specification\n");
+   return -1;
+   }
+
+   opt.flags |= TC_ETF_OFFLOAD_ON;
+   } else if (matches(*argv, "deadline_mode") == 0) {
+   if (opt.flags & TC_ETF_DEADLINE_MODE_ON) {
+   fprintf(stderr, "etf: duplicate 
\"deadline_mode\" specification\n");
+   return -1;
+   }
+
+   opt.flags |= TC_ETF_DEADLINE_MODE_ON;
+   } else if (matches(*argv, "delta") == 0) {
+   NEXT_ARG();
+   if (opt.delta) {
+   fprintf(stderr, "etf: duplicate \"delta\" 
specification\n");
+   return -1;
+   }
+   if (get_s32(, *argv, 0)) {
+   explain1("delta", *argv);
+   return -1;
+   }
+   } else if (matches(*argv, "clockid") == 0) {
+   NEXT_ARG();
+   if (opt.clockid != CLOCKID_INVALID) {
+   fprintf(stderr, "etf: duplicate \"clockid\" 
specification\n");
+   return -1;
+ 

[PATCH v2 iproute2 1/2] uapi pkt_sched: Add etf info - DO NOT COMMIT

2018-07-03 Thread Jesus Sanchez-Palencia
This should come from the next uapi headers update.
Sending it now just as a convenience so anyone can build tc with etf
and taprio support.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/uapi/linux/pkt_sched.h | 21 +
 1 file changed, 21 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096a..94911846 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -539,6 +539,7 @@ enum {
TCA_NETEM_LATENCY64,
TCA_NETEM_JITTER64,
TCA_NETEM_SLOT,
+   TCA_NETEM_SLOT_DIST,
__TCA_NETEM_MAX,
 };
 
@@ -581,6 +582,8 @@ struct tc_netem_slot {
__s64   max_delay;
__s32   max_packets;
__s32   max_bytes;
+   __s64   dist_delay; /* nsec */
+   __s64   dist_jitter; /* nsec */
 };
 
 enum {
@@ -934,4 +937,22 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+
+/* ETF */
+struct tc_etf_qopt {
+   __s32 delta;
+   __s32 clockid;
+   __u32 flags;
+#define TC_ETF_DEADLINE_MODE_ONBIT(0)
+#define TC_ETF_OFFLOAD_ON  BIT(1)
+};
+
+enum {
+   TCA_ETF_UNSPEC,
+   TCA_ETF_PARMS,
+   __TCA_ETF_MAX,
+};
+
+#define TCA_ETF_MAX (__TCA_ETF_MAX - 1)
+
 #endif
-- 
2.17.1



[PATCH v2 net-next 02/14] net: Add a new socket option for a future transmit time.

2018-07-03 Thread Jesus Sanchez-Palencia
From: Richard Cochran 

This patch introduces SO_TXTIME. User space enables this option in
order to pass a desired future transmit time in a CMSG when calling
sendmsg(2). The argument to this socket option is a 8-bytes long struct
provided by the uapi header net_tstamp.h defined as:

struct sock_txtime {
clockid_t   clockid;
u32 flags;
};

Note that new fields were added to struct sock by filling a 2-bytes
hole found in the struct. For that reason, neither the struct size or
number of cachelines were altered.

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 arch/alpha/include/uapi/asm/socket.h  |  3 +++
 arch/ia64/include/uapi/asm/socket.h   |  3 +++
 arch/mips/include/uapi/asm/socket.h   |  3 +++
 arch/parisc/include/uapi/asm/socket.h |  3 +++
 arch/s390/include/uapi/asm/socket.h   |  3 +++
 arch/sparc/include/uapi/asm/socket.h  |  3 +++
 arch/xtensa/include/uapi/asm/socket.h |  3 +++
 include/net/sock.h| 10 
 include/uapi/asm-generic/socket.h |  3 +++
 include/uapi/linux/net_tstamp.h   | 15 
 net/core/sock.c   | 35 +++
 11 files changed, 84 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index be14f16149d5..065fb372e355 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -112,4 +112,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/ia64/include/uapi/asm/socket.h 
b/arch/ia64/include/uapi/asm/socket.h
index 3efba40adc54..c872c4e6bafb 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -114,4 +114,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 49c3d4795963..71370fb3ceef 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -123,4 +123,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index 1d0fdc3b5d22..061b9cf2a779 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -104,4 +104,7 @@
 
 #define SO_ZEROCOPY0x4035
 
+#define SO_TXTIME  0x4036
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h 
b/arch/s390/include/uapi/asm/socket.h
index 3510c0fd06f4..39d901476ee5 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -111,4 +111,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index d58520c2e6ff..7ea35e5601b6 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -101,6 +101,9 @@
 
 #define SO_ZEROCOPY0x003e
 
+#define SO_TXTIME  0x003f
+#define SCM_TXTIME SO_TXTIME
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION 0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT   0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h 
b/arch/xtensa/include/uapi/asm/socket.h
index 75a07b8119a9..1de07a7f7680 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -116,4 +116,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _XTENSA_SOCKET_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 2ed99bfa4595..68347b9821c6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -319,6 +319,9 @@ struct sock_common {
   *@sk_destruct: called at sock freeing time, i.e. when all refcnt == 0
   *@sk_reuseport_cb: reuseport group container
   *@sk_rcu: used during RCU grace period
+  *@sk_clockid: clockid used by time-based scheduling (SO_TXTIME)
+  *@sk_txtime_deadline_mode: set deadline mode for SO_TXTIME
+  *@sk_txtime_unused: unused txtime flags
   */
 struct sock {
/*
@@ -475,6 +478,11 @@ struct sock {
u8  sk_shutdown;
u32 sk_tskey;
atomic_tsk_zckey;
+
+   u8  sk_clockid;
+   u8  sk_txtime_deadline_mode : 1,
+   sk_txtime_unused : 7

[PATCH v2 net-next 00/14] Scheduled packet Transmission: ETF

2018-07-03 Thread Jesus Sanchez-Palencia


Changes since v1:
  - moved struct sock_txtime from socket.h to uapi net_tstamp.h;
  - sk_clockid was changed from u16 to u8;
  - sk_txtime_flags was changed from u16 to a u8 bit field in struct sock;
  - the socket option flags are now validated in sock_setsockopt();
  - added SO_EE_ORIGIN_TXTIME;
  - sockc.transmit_time is now initialized from all IPv4 Tx paths;
  - added support for the IPv6 Tx path;


Overview


This work consists of a set of kernel interfaces that can be used by
applications that require (time-based) Scheduled Tx of packets.
It is comprised by 3 new components to the kernel:

  - SO_TXTIME: socket option + cmsg programming interfaces.

  - etf: the "earliest txtime first" qdisc, that provides per-queue
 TxTime-based scheduling. This has been renamed from 'tbs' to
 'etf' to better describe its functionality.

  - taprio: the "time-aware priority scheduler" qdisc, that provides
per-port Time-Aware scheduling;

This patchset is providing the first 2 components, which have been
developed for longer. The taprio qdisc will be shared as an RFC separately
(shortly).

Note that this series is a follow up of the "Time based packet
transmission" RFCv3 [1].



etf (formerly known as 'tbs')
=

For applications/systems that the concept of time slices isn't precise
enough, the etf qdisc allows applications to control the instant when
a packet should leave the network controller. When used in conjunction
with taprio, it can also be used in case the application needs to
control with greater guarantee the offset into each time slice a packet
will be sent. Another use case of etf, is when only a small number of
applications on a system are time sensitive, so it can then be used
with a more traditional root qdisc (like mqprio).

The etf qdisc is designed so it buffers packets until a configurable
time before their deadline (Tx time). The qdisc uses a rbtree internally
so the buffered packets are always 'ordered' by their txtime (deadline)
and will be dequeued following the earliest txtime first.

It relies on the SO_TXTIME API set for receiving the per-packet timestamp
(txtime) as well as the config flags for each socket: the clockid to be
used as a reference, if the expected mode of txtime for that socket is
deadline or strict mode, and if packet drops should be reported on the
socket's error queue or not.

The qdisc will drop any packets with a Tx time in the past, or if a
packet expires while waiting for being dequeued. Drops can be reported
as errors back to userspace through the socket's error queue.

Example configuration:

$ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 20 \
clockid CLOCK_TAI

Here, the Qdisc will use HW offload for the txtime control.
Packets will be dequeued by the qdisc "delta" (20) nanoseconds before
their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by hrtimers, the system clock
and the PHC clock must be synchronized for this mode to behave as expected.

A more complete example can be found here, with instructions of how to
test it:

https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f [2]


Note that we haven't modified the qdisc so it uses a timerqueue because
the modification needed was increasing the number of cachelines of a sk_buff.



This series is also hosted on github and can be found at [3].
The companion iproute2 patches can be found at [4].


[1] https://patchwork.ozlabs.org/cover/882342/

[2] github doesn't make it clear, but the gist can be cloned like this:
$ git clone https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f 
scheduled-tx-tests

[3] https://github.com/jeez/linux/tree/etf-v2

[4] https://github.com/jeez/iproute2/tree/etf-v2



Jesus Sanchez-Palencia (10):
  net: Clear skb->tstamp only on the forwarding path
  net: ipv4: Hook into time based transmission
  net: ipv6: Hook into time based transmission
  net/sched: Add HW offloading capability to ETF
  igb: Refactor igb_configure_cbs()
  igb: Only change Tx arbitration when CBS is on
  igb: Refactor igb_offload_cbs()
  igb: Only call skb_tx_timestamp after descriptors are ready
  igb: Add support for ETF offload
  net/sched: Make etf report drops on error_queue

Richard Cochran (2):
  net: Add a new socket option for a future transmit time.
  net: packet: Hook into time based transmission.

Vinicius Costa Gomes (2):
  net/sched: Allow creating a Qdisc watchdog with other clocks
  net/sched: Introduce the ETF Qdisc

 arch/alpha/include/uapi/asm/socket.h  |   3 +
 arch/ia64/include/uapi/asm/socket.h   |   3 +
 arch/mips/include/uapi/asm/socket.h   |   3 +
 arch/parisc/include/uapi/asm/socket.h |   3 +
 arch/s390/include/uapi/asm/socket.h   |   3 +
 arch/sparc/include/uapi/asm/socket.h  |   3 +
 arch/xtensa/include/uapi/asm/socket.h |   3 +

[PATCH v2 net-next 03/14] net: ipv4: Hook into time based transmission

2018-07-03 Thread Jesus Sanchez-Palencia
Add a transmit_time field to struct inet_cork, then copy the
timestamp from the CMSG cookie at ip_setup_cork() so we can
safely copy it into the skb later during __ip_make_skb().

For the raw fast path, just perform the copy at raw_send_hdrinc().

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 include/net/inet_sock.h | 1 +
 net/ipv4/icmp.c | 2 ++
 net/ipv4/ip_output.c| 3 +++
 net/ipv4/ping.c | 1 +
 net/ipv4/raw.c  | 2 ++
 net/ipv4/udp.c  | 1 +
 6 files changed, 10 insertions(+)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 83d5b3c2ac42..314be484c696 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -148,6 +148,7 @@ struct inet_cork {
__s16   tos;
charpriority;
__u16   gso_size;
+   u64 transmit_time;
 };
 
 struct inet_cork_full {
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 1617604c9284..937239afd68d 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -437,6 +437,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct 
sk_buff *skb)
ipc.tx_flags = 0;
ipc.ttl = 0;
ipc.tos = -1;
+   ipc.sockc.transmit_time = 0;
 
if (icmp_param->replyopts.opt.opt.optlen) {
ipc.opt = _param->replyopts.opt;
@@ -715,6 +716,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, 
__be32 info)
ipc.tx_flags = 0;
ipc.ttl = 0;
ipc.tos = -1;
+   ipc.sockc.transmit_time = 0;
 
rt = icmp_route_lookup(net, , skb_in, iph, saddr, tos, mark,
   type, code, _param);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index b3308e9d9762..135fb5036d18 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1153,6 +1153,7 @@ static int ip_setup_cork(struct sock *sk, struct 
inet_cork *cork,
cork->tos = ipc->tos;
cork->priority = ipc->priority;
cork->tx_flags = ipc->tx_flags;
+   cork->transmit_time = ipc->sockc.transmit_time;
 
return 0;
 }
@@ -1413,6 +1414,7 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
 
skb->priority = (cork->tos != -1) ? cork->priority: sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = cork->transmit_time;
/*
 * Steal rt from cork.dst to avoid a pair of atomic_inc/atomic_dec
 * on dst refcount
@@ -1550,6 +1552,7 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
ipc.tx_flags = 0;
ipc.ttl = 0;
ipc.tos = -1;
+   ipc.sockc.transmit_time = 0;
 
if (replyopts.opt.opt.optlen) {
ipc.opt = 
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 2ed64bca54e3..b47492205507 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -746,6 +746,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
ipc.tx_flags = 0;
ipc.ttl = 0;
ipc.tos = -1;
+   ipc.sockc.transmit_time = 0;
 
if (msg->msg_controllen) {
err = ip_cmsg_send(sk, msg, , false);
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index abb3c9490c55..446af7be2b55 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -381,6 +381,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 
*fl4,
 
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = sockc->transmit_time;
skb_dst_set(skb, >dst);
*rtp = NULL;
 
@@ -562,6 +563,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
}
 
ipc.sockc.tsflags = sk->sk_tsflags;
+   ipc.sockc.transmit_time = 0;
ipc.addr = inet->inet_saddr;
ipc.opt = NULL;
ipc.tx_flags = 0;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 24e116ddae79..5c76ba0666ec 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -930,6 +930,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t 
len)
ipc.tx_flags = 0;
ipc.ttl = 0;
ipc.tos = -1;
+   ipc.sockc.transmit_time = 0;
 
getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
 
-- 
2.18.0



[PATCH v2 net-next 09/14] igb: Refactor igb_configure_cbs()

2018-07-03 Thread Jesus Sanchez-Palencia
Make this function retrieve what it needs from the Tx ring being
addressed since it already relies on what had been saved on it before.
Also, since this function will be used by the upcoming Launchtime
patches rename it to better reflect its intention. Note that
Launchtime is not part of what 802.1Qav specifies, but the i210
datasheet refers to this set of functionality as "Qav Transmission
Mode".

Here we also perform a tiny refactor at is_any_cbs_enabled(), and add
further documentation to igb_setup_tx_mode().

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 60 +++
 1 file changed, 28 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index f1e3397bd405..15f6b9c57ccf 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1655,23 +1655,17 @@ static void set_queue_mode(struct e1000_hw *hw, int 
queue, enum queue_mode mode)
 }
 
 /**
- *  igb_configure_cbs - Configure Credit-Based Shaper (CBS)
+ *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
  *  @queue: queue number
- *  @enable: true = enable CBS, false = disable CBS
- *  @idleslope: idleSlope in kbps
- *  @sendslope: sendSlope in kbps
- *  @hicredit: hiCredit in bytes
- *  @locredit: loCredit in bytes
  *
- *  Configure CBS for a given hardware queue. When disabling, idleslope,
- *  sendslope, hicredit, locredit arguments are ignored. Returns 0 if
- *  success. Negative otherwise.
+ *  Configure CBS for a given hardware queue. Parameters are retrieved
+ *  from the correct Tx ring, so igb_save_cbs_params() should be used
+ *  for setting those correctly prior to this function being called.
  **/
-static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
- bool enable, int idleslope, int sendslope,
- int hicredit, int locredit)
+static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 {
+   struct igb_ring *ring = adapter->tx_ring[queue];
struct net_device *netdev = adapter->netdev;
struct e1000_hw *hw = >hw;
u32 tqavcc;
@@ -1680,7 +1674,7 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
WARN_ON(hw->mac.type != e1000_i210);
WARN_ON(queue < 0 || queue > 1);
 
-   if (enable || queue == 0) {
+   if (ring->cbs_enable || queue == 0) {
/* i210 does not allow the queue 0 to be in the Strict
 * Priority mode while the Qav mode is enabled, so,
 * instead of disabling strict priority mode, we give
@@ -1690,10 +1684,10 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
 * Queue0 QueueMode must be set to 1b when
 * TransmitMode is set to Qav."
 */
-   if (queue == 0 && !enable) {
+   if (queue == 0 && !ring->cbs_enable) {
/* max "linkspeed" idleslope in kbps */
-   idleslope = 100;
-   hicredit = ETH_FRAME_LEN;
+   ring->idleslope = 100;
+   ring->hicredit = ETH_FRAME_LEN;
}
 
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
@@ -1756,14 +1750,15 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
 *   calculated value, so the resulting bandwidth might
 *   be slightly higher for some configurations.
 */
-   value = DIV_ROUND_UP_ULL(idleslope * 61034ULL, 100);
+   value = DIV_ROUND_UP_ULL(ring->idleslope * 61034ULL, 100);
 
tqavcc = rd32(E1000_I210_TQAVCC(queue));
tqavcc &= ~E1000_TQAVCC_IDLESLOPE_MASK;
tqavcc |= value;
wr32(E1000_I210_TQAVCC(queue), tqavcc);
 
-   wr32(E1000_I210_TQAVHC(queue), 0x8000 + hicredit * 0x7735);
+   wr32(E1000_I210_TQAVHC(queue),
+0x8000 + ring->hicredit * 0x7735);
} else {
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);
set_queue_mode(hw, queue, QUEUE_MODE_STRICT_PRIORITY);
@@ -1783,8 +1778,9 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
 */
 
netdev_dbg(netdev, "CBS %s: queue %d idleslope %d sendslope %d hiCredit 
%d locredit %d\n",
-  (enable) ? "enabled" : "disabled", queue,
-  idleslope, sendslope, hicredit, locredit);
+  (ring->cbs_enable) ? "enabled" : "disabled", queue,
+  ring->idleslope, ring->sendslope, ring->

[PATCH v2 net-next 11/14] igb: Refactor igb_offload_cbs()

2018-07-03 Thread Jesus Sanchez-Palencia
Split code into a separate function (igb_offload_apply()) that will be
used by ETF offload implementation.

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 8c90f1e51add..c30ab7b260cc 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2474,6 +2474,19 @@ igb_features_check(struct sk_buff *skb, struct 
net_device *dev,
return features;
 }
 
+static void igb_offload_apply(struct igb_adapter *adapter, s32 queue)
+{
+   if (!is_fqtss_enabled(adapter)) {
+   enable_fqtss(adapter, true);
+   return;
+   }
+
+   igb_config_tx_modes(adapter, queue);
+
+   if (!is_any_cbs_enabled(adapter))
+   enable_fqtss(adapter, false);
+}
+
 static int igb_offload_cbs(struct igb_adapter *adapter,
   struct tc_cbs_qopt_offload *qopt)
 {
@@ -2494,15 +2507,7 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
if (err)
return err;
 
-   if (is_fqtss_enabled(adapter)) {
-   igb_config_tx_modes(adapter, qopt->queue);
-
-   if (!is_any_cbs_enabled(adapter))
-   enable_fqtss(adapter, false);
-
-   } else {
-   enable_fqtss(adapter, true);
-   }
+   igb_offload_apply(adapter, qopt->queue);
 
return 0;
 }
-- 
2.18.0



[PATCH v2 net-next 13/14] igb: Add support for ETF offload

2018-07-03 Thread Jesus Sanchez-Palencia
Implement HW offload support for SO_TXTIME through igb's Launchtime
feature. This is done by extending igb_setup_tc() so it supports
TC_SETUP_QDISC_ETF and configuring i210 so time based transmit
arbitration is enabled.

The FQTSS transmission mode added before is extended so strict
priority (SP) queues wait for stream reservation (SR) ones.
igb_config_tx_modes() is extended so it can support enabling/disabling
Launchtime following the previous approach used for the credit-based
shaper (CBS).

As the previous flow, FQTSS transmission mode is enabled automatically
by the driver once Launchtime (or CBS, as before) is enabled.
Similarly, it's automatically disabled when the feature is disabled
for the last queue that had it setup on.

The driver just consumes the transmit times from the skbuffs directly,
so no special handling is done in case an 'invalid' time is provided.
We assume this has been handled by the ETF qdisc already.

Signed-off-by: Jesus Sanchez-Palencia 
---
 .../net/ethernet/intel/igb/e1000_defines.h|  16 ++
 drivers/net/ethernet/intel/igb/igb.h  |   1 +
 drivers/net/ethernet/intel/igb/igb_main.c | 138 +++---
 3 files changed, 138 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h 
b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 252440a418dc..8a28f3388f69 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -1048,6 +1048,22 @@
 #define E1000_TQAVCTRL_XMIT_MODE   BIT(0)
 #define E1000_TQAVCTRL_DATAFETCHARBBIT(4)
 #define E1000_TQAVCTRL_DATATRANARB BIT(8)
+#define E1000_TQAVCTRL_DATATRANTIM BIT(9)
+#define E1000_TQAVCTRL_SP_WAIT_SR  BIT(10)
+/* Fetch Time Delta - bits 31:16
+ *
+ * This field holds the value to be reduced from the launch time for
+ * fetch time decision. The FetchTimeDelta value is defined in 32 ns
+ * granularity.
+ *
+ * This field is 16 bits wide, and so the maximum value is:
+ *
+ * 65535 * 32 = 2097120 ~= 2.1 msec
+ *
+ * XXX: We are configuring the max value here since we couldn't come up
+ * with a reason for not doing so.
+ */
+#define E1000_TQAVCTRL_FETCHTIME_DELTA (0x << 16)
 
 /* TX Qav Credit Control fields */
 #define E1000_TQAVCC_IDLESLOPE_MASK0x
diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index 9643b5b3d444..ca54e268d157 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -262,6 +262,7 @@ struct igb_ring {
u16 count;  /* number of desc. in the ring */
u8 queue_index; /* logical index of the ring*/
u8 reg_idx; /* physical index of the ring */
+   bool launchtime_enable; /* true if LaunchTime is enabled */
bool cbs_enable;/* indicates if CBS is enabled */
s32 idleslope;  /* idleSlope in kbps */
s32 sendslope;  /* sendSlope in kbps */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 445da8285d9b..e3a0c02721c9 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1666,13 +1666,26 @@ static bool is_any_cbs_enabled(struct igb_adapter 
*adapter)
return false;
 }
 
+static bool is_any_txtime_enabled(struct igb_adapter *adapter)
+{
+   int i;
+
+   for (i = 0; i < adapter->num_tx_queues; i++) {
+   if (adapter->tx_ring[i]->launchtime_enable)
+   return true;
+   }
+
+   return false;
+}
+
 /**
  *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
  *  @queue: queue number
  *
- *  Configure CBS for a given hardware queue. Parameters are retrieved
- *  from the correct Tx ring, so igb_save_cbs_params() should be used
+ *  Configure CBS and Launchtime for a given hardware queue.
+ *  Parameters are retrieved from the correct Tx ring, so
+ *  igb_save_cbs_params() and igb_save_txtime_params() should be used
  *  for setting those correctly prior to this function being called.
  **/
 static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
@@ -1686,6 +1699,19 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
WARN_ON(hw->mac.type != e1000_i210);
WARN_ON(queue < 0 || queue > 1);
 
+   /* If any of the Qav features is enabled, configure queues as SR and
+* with HIGH PRIO. If none is, then configure them with LOW PRIO and
+* as SP.
+*/
+   if (ring->cbs_enable || ring->launchtime_enable) {
+   set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
+   set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
+   } else {
+   set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW

[PATCH v2 net-next 10/14] igb: Only change Tx arbitration when CBS is on

2018-07-03 Thread Jesus Sanchez-Palencia
Currently the data transmission arbitration algorithm - DataTranARB
field on TQAVCTRL reg - is always set to CBS when the Tx mode is
changed from legacy to 'Qav' mode.

Make that configuration a bit more granular in preparation for the
upcoming Launchtime enabling patches, since CBS and Launchtime can be
enabled separately. That is achieved by moving the DataTranARB setup
to igb_config_tx_modes() instead.

Similarly, when disabling CBS we must check if it has been disabled
for all queues, and clear the DataTranARB accordingly.

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 49 +++
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 15f6b9c57ccf..8c90f1e51add 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1654,6 +1654,18 @@ static void set_queue_mode(struct e1000_hw *hw, int 
queue, enum queue_mode mode)
wr32(E1000_I210_TQAVCC(queue), val);
 }
 
+static bool is_any_cbs_enabled(struct igb_adapter *adapter)
+{
+   int i;
+
+   for (i = 0; i < adapter->num_tx_queues; i++) {
+   if (adapter->tx_ring[i]->cbs_enable)
+   return true;
+   }
+
+   return false;
+}
+
 /**
  *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
@@ -1668,7 +1680,7 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
struct igb_ring *ring = adapter->tx_ring[queue];
struct net_device *netdev = adapter->netdev;
struct e1000_hw *hw = >hw;
-   u32 tqavcc;
+   u32 tqavcc, tqavctrl;
u16 value;
 
WARN_ON(hw->mac.type != e1000_i210);
@@ -1693,6 +1705,14 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
 
+   /* Always set data transfer arbitration to credit-based
+* shaper algorithm on TQAVCTRL if CBS is enabled for any of
+* the queues.
+*/
+   tqavctrl = rd32(E1000_I210_TQAVCTRL);
+   tqavctrl |= E1000_TQAVCTRL_DATATRANARB;
+   wr32(E1000_I210_TQAVCTRL, tqavctrl);
+
/* According to i210 datasheet section 7.2.7.7, we should set
 * the 'idleSlope' field from TQAVCC register following the
 * equation:
@@ -1770,6 +1790,16 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
 
/* Set hiCredit to zero. */
wr32(E1000_I210_TQAVHC(queue), 0);
+
+   /* If CBS is not enabled for any queues anymore, then return to
+* the default state of Data Transmission Arbitration on
+* TQAVCTRL.
+*/
+   if (!is_any_cbs_enabled(adapter)) {
+   tqavctrl = rd32(E1000_I210_TQAVCTRL);
+   tqavctrl &= ~E1000_TQAVCTRL_DATATRANARB;
+   wr32(E1000_I210_TQAVCTRL, tqavctrl);
+   }
}
 
/* XXX: In i210 controller the sendSlope and loCredit parameters from
@@ -1803,18 +1833,6 @@ static int igb_save_cbs_params(struct igb_adapter 
*adapter, int queue,
return 0;
 }
 
-static bool is_any_cbs_enabled(struct igb_adapter *adapter)
-{
-   int i;
-
-   for (i = 0; i < adapter->num_tx_queues; i++) {
-   if (adapter->tx_ring[i]->cbs_enable)
-   return true;
-   }
-
-   return false;
-}
-
 /**
  *  igb_setup_tx_mode - Switch to/from Qav Tx mode when applicable
  *  @adapter: pointer to adapter struct
@@ -1838,11 +1856,10 @@ static void igb_setup_tx_mode(struct igb_adapter 
*adapter)
int i, max_queue;
 
/* Configure TQAVCTRL register: set transmit mode to 'Qav',
-* set data fetch arbitration to 'round robin' and set data
-* transfer arbitration to 'credit shaper algorithm.
+* set data fetch arbitration to 'round robin'.
 */
val = rd32(E1000_I210_TQAVCTRL);
-   val |= E1000_TQAVCTRL_XMIT_MODE | E1000_TQAVCTRL_DATATRANARB;
+   val |= E1000_TQAVCTRL_XMIT_MODE;
val &= ~E1000_TQAVCTRL_DATAFETCHARB;
wr32(E1000_I210_TQAVCTRL, val);
 
-- 
2.18.0



[PATCH v2 net-next 08/14] net/sched: Add HW offloading capability to ETF

2018-07-03 Thread Jesus Sanchez-Palencia
Add infra so etf qdisc supports HW offload of time-based transmission.

For hw offload, the time sorted list is still used, so packets are
dequeued always in order of txtime.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 10 \
   clockid CLOCK_REALTIME

In this example, the Qdisc will use HW offload for the control of the
transmission time through the network adapter. The hrtimer used for
packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME
as reference and packets leave the Qdisc "delta" (10) nanoseconds
before their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by the hrtimer, the system clock
and the PHC clock must be synchronized for this mode to behave as
expected.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/net/pkt_sched.h|  5 +++
 include/uapi/linux/pkt_sched.h |  1 +
 net/sched/sch_etf.c| 71 +-
 3 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 2466ea143d01..7dc769e5452b 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -155,4 +155,9 @@ struct tc_cbs_qopt_offload {
s32 sendslope;
 };
 
+struct tc_etf_qopt_offload {
+   u8 enable;
+   s32 queue;
+};
+
 #endif
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index d5e933ce1447..949118461009 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -944,6 +944,7 @@ struct tc_etf_qopt {
__s32 clockid;
__u32 flags;
 #define TC_ETF_DEADLINE_MODE_ONBIT(0)
+#define TC_ETF_OFFLOAD_ON  BIT(1)
 };
 
 enum {
diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
index 4b7f4903ac17..932a136db568 100644
--- a/net/sched/sch_etf.c
+++ b/net/sched/sch_etf.c
@@ -20,8 +20,10 @@
 #include 
 
 #define DEADLINE_MODE_IS_ON(x) ((x)->flags & TC_ETF_DEADLINE_MODE_ON)
+#define OFFLOAD_IS_ON(x) ((x)->flags & TC_ETF_OFFLOAD_ON)
 
 struct etf_sched_data {
+   bool offload;
bool deadline_mode;
int clockid;
int queue;
@@ -45,6 +47,9 @@ static inline int validate_input_params(struct tc_etf_qopt 
*qopt,
 *  * Dynamic clockids are not supported.
 *
 *  * Delta must be a positive integer.
+*
+* Also note that for the HW offload case, we must
+* expect that system clocks have been synchronized to PHC.
 */
if (qopt->clockid < 0) {
NL_SET_ERR_MSG(extack, "Dynamic clockids are not supported");
@@ -225,6 +230,56 @@ static struct sk_buff *etf_dequeue_timesortedlist(struct 
Qdisc *sch)
return skb;
 }
 
+static void etf_disable_offload(struct net_device *dev,
+   struct etf_sched_data *q)
+{
+   struct tc_etf_qopt_offload etf = { };
+   const struct net_device_ops *ops;
+   int err;
+
+   if (!q->offload)
+   return;
+
+   ops = dev->netdev_ops;
+   if (!ops->ndo_setup_tc)
+   return;
+
+   etf.queue = q->queue;
+   etf.enable = 0;
+
+   err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_ETF, );
+   if (err < 0)
+   pr_warn("Couldn't disable ETF offload for queue %d\n",
+   etf.queue);
+}
+
+static int etf_enable_offload(struct net_device *dev, struct etf_sched_data *q,
+ struct netlink_ext_ack *extack)
+{
+   const struct net_device_ops *ops = dev->netdev_ops;
+   struct tc_etf_qopt_offload etf = { };
+   int err;
+
+   if (q->offload)
+   return 0;
+
+   if (!ops->ndo_setup_tc) {
+   NL_SET_ERR_MSG(extack, "Specified device does not support ETF 
offload");
+   return -EOPNOTSUPP;
+   }
+
+   etf.queue = q->queue;
+   etf.enable = 1;
+
+   err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_ETF, );
+   if (err < 0) {
+   NL_SET_ERR_MSG(extack, "Specified device failed to setup ETF 
hardware offload");
+   return err;
+   }
+
+   return 0;
+}
+
 static int etf_init(struct Qdisc *sch, struct nlattr *opt,
struct netlink_ext_ack *extack)
 {
@@ -251,8 +306,9 @@ static int etf_init(struct Qdisc *sch, struct nlattr *opt,
 
qopt = nla_data(tb[TCA_ETF_PARMS]);
 
-   pr_debug("delta %d clockid %d deadline %s\n",
+   pr_debug("delta %d clockid %d offload %s deadline %s\n",
 qopt->delta, qopt->clockid,
+OFFLOAD_IS_ON(qopt) ? "on" : "off",
 DEADLINE_MODE_IS_ON(qopt) ? "on" : "off"

[PATCH v2 net-next 12/14] igb: Only call skb_tx_timestamp after descriptors are ready

2018-07-03 Thread Jesus Sanchez-Palencia
Currently, skb_tx_timestamp() is being called before the Tx
descriptors are prepared in igb_xmit_frame_ring(), which happens
during either the igb_tso() or igb_tx_csum() calls.

Given that now the skb->tstamp might be used to carry the timestamp
for SO_TXTIME, we must only call skb_tx_timestamp() after the
information has been copied into the Tx descriptors.

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index c30ab7b260cc..445da8285d9b 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6033,8 +6033,6 @@ netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
}
}
 
-   skb_tx_timestamp(skb);
-
if (skb_vlan_tag_present(skb)) {
tx_flags |= IGB_TX_FLAGS_VLAN;
tx_flags |= (skb_vlan_tag_get(skb) << IGB_TX_FLAGS_VLAN_SHIFT);
@@ -6050,6 +6048,8 @@ netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
else if (!tso)
igb_tx_csum(tx_ring, first);
 
+   skb_tx_timestamp(skb);
+
if (igb_tx_map(tx_ring, first, hdr_len))
goto cleanup_tx_tstamp;
 
-- 
2.18.0



[PATCH v2 net-next 05/14] net: packet: Hook into time based transmission.

2018-07-03 Thread Jesus Sanchez-Palencia
From: Richard Cochran 

For raw layer-2 packets, copy the desired future transmit time from
the CMSG cookie into the skb.

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 net/packet/af_packet.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 57634bc3da74..3428f7739ae9 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1951,6 +1951,7 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
goto out_unlock;
}
 
+   sockc.transmit_time = 0;
sockc.tsflags = sk->sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(sk, msg, );
@@ -1962,6 +1963,7 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = sockc.transmit_time;
 
sock_tx_timestamp(sk, sockc.tsflags, _shinfo(skb)->tx_flags);
 
@@ -2457,6 +2459,7 @@ static int tpacket_fill_skb(struct packet_sock *po, 
struct sk_buff *skb,
skb->dev = dev;
skb->priority = po->sk.sk_priority;
skb->mark = po->sk.sk_mark;
+   skb->tstamp = sockc->transmit_time;
sock_tx_timestamp(>sk, sockc->tsflags, _shinfo(skb)->tx_flags);
skb_shinfo(skb)->destructor_arg = ph.raw;
 
@@ -2633,6 +2636,7 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_put;
 
+   sockc.transmit_time = 0;
sockc.tsflags = po->sk.sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(>sk, msg, );
@@ -2829,6 +2833,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_unlock;
 
+   sockc.transmit_time = 0;
sockc.tsflags = sk->sk_tsflags;
sockc.mark = sk->sk_mark;
if (msg->msg_controllen) {
@@ -2903,6 +2908,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sockc.mark;
+   skb->tstamp = sockc.transmit_time;
 
if (has_vnet_hdr) {
err = virtio_net_hdr_to_skb(skb, _hdr, vio_le());
-- 
2.18.0



[PATCH v2 net-next 07/14] net/sched: Introduce the ETF Qdisc

2018-07-03 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes 

The ETF (Earliest TxTime First) qdisc uses the information added
earlier in this series (the socket option SO_TXTIME and the new
role of sk_buff->tstamp) to schedule packets transmission based
on absolute time.

For some workloads, just bandwidth enforcement is not enough, and
precise control of the transmission of packets is necessary.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 etf delta 10 \
   clockid CLOCK_TAI

In this example, the Qdisc will provide SW best-effort for the control
of the transmission time to the network adapter, the time stamp in the
socket will be in reference to the clockid CLOCK_TAI and packets
will leave the qdisc "delta" (10) nanoseconds before its transmission
time.

The ETF qdisc will buffer packets sorted by their txtime. It will drop
packets on enqueue() if their skbuff clockid does not match the clock
reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped
if it expires while being enqueued.

The qdisc also supports the SO_TXTIME deadline mode. For this mode, it
will dequeue a packet as soon as possible and change the skb timestamp
to 'now' during etf_dequeue().

Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid
to be configured, but it's been decided that usage of CLOCK_TAI should
be enforced until we decide to allow for other clockids to be used.
The rationale here is that PTP times are usually in the TAI scale, thus
no other clocks should be necessary. For now, the qdisc will return
EINVAL if any clocks other than CLOCK_TAI are used.

Signed-off-by: Jesus Sanchez-Palencia 
Signed-off-by: Vinicius Costa Gomes 
---
 include/linux/netdevice.h  |   1 +
 include/uapi/linux/pkt_sched.h |  17 ++
 net/sched/Kconfig  |  11 +
 net/sched/Makefile |   1 +
 net/sched/sch_etf.c| 384 +
 5 files changed, 414 insertions(+)
 create mode 100644 net/sched/sch_etf.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 64480a0f2c16..610df79b9845 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -798,6 +798,7 @@ enum tc_setup_type {
TC_SETUP_QDISC_RED,
TC_SETUP_QDISC_PRIO,
TC_SETUP_QDISC_MQ,
+   TC_SETUP_QDISC_ETF,
 };
 
 /* These structures hold the attributes of bpf state that are being passed
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index bad3c03bcf43..d5e933ce1447 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -937,4 +937,21 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+
+/* ETF */
+struct tc_etf_qopt {
+   __s32 delta;
+   __s32 clockid;
+   __u32 flags;
+#define TC_ETF_DEADLINE_MODE_ONBIT(0)
+};
+
+enum {
+   TCA_ETF_UNSPEC,
+   TCA_ETF_PARMS,
+   __TCA_ETF_MAX,
+};
+
+#define TCA_ETF_MAX (__TCA_ETF_MAX - 1)
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a01169fb5325..fcc89706745b 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -183,6 +183,17 @@ config NET_SCH_CBS
  To compile this code as a module, choose M here: the
  module will be called sch_cbs.
 
+config NET_SCH_ETF
+   tristate "Earliest TxTime First (ETF)"
+   help
+ Say Y here if you want to use the Earliest TxTime First (ETF) packet
+ scheduling algorithm.
+
+ See the top of  for more details.
+
+ To compile this code as a module, choose M here: the
+ module will be called sch_etf.
+
 config NET_SCH_GRED
tristate "Generic Random Early Detection (GRED)"
---help---
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 8811d3804878..9a5a7077d217 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -54,6 +54,7 @@ obj-$(CONFIG_NET_SCH_FQ)  += sch_fq.o
 obj-$(CONFIG_NET_SCH_HHF)  += sch_hhf.o
 obj-$(CONFIG_NET_SCH_PIE)  += sch_pie.o
 obj-$(CONFIG_NET_SCH_CBS)  += sch_cbs.o
+obj-$(CONFIG_NET_SCH_ETF)  += sch_etf.o
 
 obj-$(CONFIG_NET_CLS_U32)  += cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)   += cls_route.o
diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
new file mode 100644
index ..4b7f4903ac17
--- /dev/null
+++ b/net/sched/sch_etf.c
@@ -0,0 +1,384 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* net/sched/sch_etf.c  Earliest TxTime First queueing discipline.
+ *
+ * Authors:Jesus Sanchez-Palencia 
+ * Vinicius Costa Gomes 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DEADLINE_MODE_IS_ON(x) ((x)->flags & TC_ETF_DEADLINE_MODE_ON)
+
+struct etf_sched_data {
+   bool deadline_mode;
+   int clockid;
+   int queue;
+   s32

[PATCH v2 net-next 14/14] net/sched: Make etf report drops on error_queue

2018-07-03 Thread Jesus Sanchez-Palencia
Use the socket error queue for reporting dropped packets if the
socket has enabled that feature through the SO_TXTIME API.

Packets are dropped either on enqueue() if they aren't accepted by the
qdisc or on dequeue() if the system misses their deadline. Those are
reported as different errors so applications can react accordingly.

Userspace can retrieve the errors through the socket error queue and the
corresponding cmsg interfaces. A struct sock_extended_err* is used for
returning the error data, and the packet's timestamp can be retrieved by
adding both ee_data and ee_info fields as e.g.:

((__u64) serr->ee_data << 32) + serr->ee_info

This feature is disabled by default and must be explicitly enabled by
applications. Enabling it can bring some overhead for the Tx cycles
of the application.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/net/sock.h  |  3 ++-
 include/uapi/linux/errqueue.h   |  4 
 include/uapi/linux/net_tstamp.h |  5 -
 net/core/sock.c |  4 
 net/sched/sch_etf.c | 35 +++--
 5 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 68347b9821c6..e0eac9ef44b5 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -481,7 +481,8 @@ struct sock {
 
u8  sk_clockid;
u8  sk_txtime_deadline_mode : 1,
-   sk_txtime_unused : 7;
+   sk_txtime_report_errors : 1,
+   sk_txtime_unused : 6;
 
struct socket   *sk_socket;
void*sk_user_data;
diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index dc64cfaf13da..c0151200f7d1 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -20,12 +20,16 @@ struct sock_extended_err {
 #define SO_EE_ORIGIN_ICMP6 3
 #define SO_EE_ORIGIN_TXSTATUS  4
 #define SO_EE_ORIGIN_ZEROCOPY  5
+#define SO_EE_ORIGIN_TXTIME6
 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
 
 #define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1))
 
 #define SO_EE_CODE_ZEROCOPY_COPIED 1
 
+#define SO_EE_CODE_TXTIME_INVALID_PARAM1
+#define SO_EE_CODE_TXTIME_MISSED   2
+
 /**
  * struct scm_timestamping - timestamps exposed through cmsg
  *
diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index c9a77c353b98..f8f4539f1135 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -147,8 +147,11 @@ struct scm_ts_pktinfo {
  */
 enum txtime_flags {
SOF_TXTIME_DEADLINE_MODE = (1 << 0),
+   SOF_TXTIME_REPORT_ERRORS = (1 << 1),
 
-   SOF_TXTIME_FLAGS_MASK = (SOF_TXTIME_DEADLINE_MODE)
+   SOF_TXTIME_FLAGS_LAST = SOF_TXTIME_REPORT_ERRORS,
+   SOF_TXTIME_FLAGS_MASK = (SOF_TXTIME_FLAGS_LAST - 1) |
+SOF_TXTIME_FLAGS_LAST
 };
 
 struct sock_txtime {
diff --git a/net/core/sock.c b/net/core/sock.c
index fe64b839f1b2..03fdea5b0f57 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1087,6 +1087,8 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
sk->sk_clockid = sk_txtime.clockid;
sk->sk_txtime_deadline_mode =
!!(sk_txtime.flags & SOF_TXTIME_DEADLINE_MODE);
+   sk->sk_txtime_report_errors =
+   !!(sk_txtime.flags & SOF_TXTIME_REPORT_ERRORS);
}
break;
 
@@ -1429,6 +1431,8 @@ int sock_getsockopt(struct socket *sock, int level, int 
optname,
v.txtime.clockid = sk->sk_clockid;
v.txtime.flags |= sk->sk_txtime_deadline_mode ?
  SOF_TXTIME_DEADLINE_MODE : 0;
+   v.txtime.flags |= sk->sk_txtime_report_errors ?
+ SOF_TXTIME_REPORT_ERRORS : 0;
break;
 
default:
diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
index 932a136db568..1538d6fa8165 100644
--- a/net/sched/sch_etf.c
+++ b/net/sched/sch_etf.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -123,6 +124,32 @@ static void reset_watchdog(struct Qdisc *sch)
qdisc_watchdog_schedule_ns(>watchdog, ktime_to_ns(next));
 }
 
+static void report_sock_error(struct sk_buff *skb, u32 err, u8 code)
+{
+   struct sock_exterr_skb *serr;
+   struct sk_buff *clone;
+   ktime_t txtime = skb->tstamp;
+
+   if (!skb->sk || !(skb->sk->sk_txtime_report_errors))
+   return;
+
+   clone = skb_clone(skb, GFP_ATOMIC);
+   if (!clone)
+   return;
+
+   serr = SKB_EXT_ERR(clone);
+   serr->ee.ee_errno = err;
+   serr->ee.ee_origin = SO_EE_ORIGIN_TXTIME;
+   serr->ee

[PATCH v2 net-next 01/14] net: Clear skb->tstamp only on the forwarding path

2018-07-03 Thread Jesus Sanchez-Palencia
This is done in preparation for the upcoming time based transmission
patchset. Now that skb->tstamp will be used to hold packet's txtime,
we must ensure that it is being cleared when traversing namespaces.
Also, doing that from skb_scrub_packet() before the early return would
break our feature when tunnels are used.

Signed-off-by: Jesus Sanchez-Palencia 
---
 net/core/skbuff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1357f36c8a5e..c4e24ac27464 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4898,7 +4898,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
  */
 void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 {
-   skb->tstamp = 0;
skb->pkt_type = PACKET_HOST;
skb->skb_iif = 0;
skb->ignore_df = 0;
@@ -4912,6 +4911,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 
ipvs_reset(skb);
skb->mark = 0;
+   skb->tstamp = 0;
 }
 EXPORT_SYMBOL_GPL(skb_scrub_packet);
 
-- 
2.18.0



[PATCH v2 net-next 04/14] net: ipv6: Hook into time based transmission

2018-07-03 Thread Jesus Sanchez-Palencia
Add a struct sockcm_cookie parameter to ip6_setup_cork() so
we can easily re-use the transmit_time field from struct inet_cork
for most paths, by copying the timestamp from the CMSG cookie.
This is later copied into the skb during __ip6_make_skb().

For the raw fast path, also pass the sockcm_cookie as a parameter
so we can just perform the copy at rawv6_send_hdrinc() directly.

Signed-off-by: Jesus Sanchez-Palencia 
---
 net/ipv6/ip6_output.c | 11 ---
 net/ipv6/raw.c|  7 +--
 net/ipv6/udp.c|  1 +
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a14fb4fcdf18..f48af7e62f12 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1158,7 +1158,8 @@ static void ip6_append_data_mtu(unsigned int *mtu,
 
 static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork,
  struct inet6_cork *v6_cork, struct ipcm6_cookie *ipc6,
- struct rt6_info *rt, struct flowi6 *fl6)
+ struct rt6_info *rt, struct flowi6 *fl6,
+ const struct sockcm_cookie *sockc)
 {
struct ipv6_pinfo *np = inet6_sk(sk);
unsigned int mtu;
@@ -1226,6 +1227,8 @@ static int ip6_setup_cork(struct sock *sk, struct 
inet_cork_full *cork,
cork->base.flags |= IPCORK_ALLFRAG;
cork->base.length = 0;
 
+   cork->base.transmit_time = sockc->transmit_time;
+
return 0;
 }
 
@@ -1575,7 +1578,7 @@ int ip6_append_data(struct sock *sk,
 * setup for corking
 */
err = ip6_setup_cork(sk, >cork, >cork,
-ipc6, rt, fl6);
+ipc6, rt, fl6, sockc);
if (err)
return err;
 
@@ -1673,6 +1676,8 @@ struct sk_buff *__ip6_make_skb(struct sock *sk,
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
 
+   skb->tstamp = cork->base.transmit_time;
+
skb_dst_set(skb, dst_clone(>dst));
IP6_UPD_PO_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUT, skb->len);
if (proto == IPPROTO_ICMPV6) {
@@ -1765,7 +1770,7 @@ struct sk_buff *ip6_make_skb(struct sock *sk,
cork->base.opt = NULL;
cork->base.dst = NULL;
v6_cork.opt = NULL;
-   err = ip6_setup_cork(sk, cork, _cork, ipc6, rt, fl6);
+   err = ip6_setup_cork(sk, cork, _cork, ipc6, rt, fl6, sockc);
if (err) {
ip6_cork_release(cork, _cork);
return ERR_PTR(err);
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index afc307c89d1a..5737c50f16eb 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -620,7 +620,7 @@ static int rawv6_push_pending_frames(struct sock *sk, 
struct flowi6 *fl6,
 
 static int rawv6_send_hdrinc(struct sock *sk, struct msghdr *msg, int length,
struct flowi6 *fl6, struct dst_entry **dstp,
-   unsigned int flags)
+   unsigned int flags, const struct sockcm_cookie *sockc)
 {
struct ipv6_pinfo *np = inet6_sk(sk);
struct net *net = sock_net(sk);
@@ -650,6 +650,7 @@ static int rawv6_send_hdrinc(struct sock *sk, struct msghdr 
*msg, int length,
skb->protocol = htons(ETH_P_IPV6);
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = sockc->transmit_time;
skb_dst_set(skb, >dst);
*dstp = NULL;
 
@@ -848,6 +849,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
fl6.flowi6_oif = sk->sk_bound_dev_if;
 
sockc.tsflags = sk->sk_tsflags;
+   sockc.transmit_time = 0;
if (msg->msg_controllen) {
opt = _space;
memset(opt, 0, sizeof(struct ipv6_txoptions));
@@ -921,7 +923,8 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
 
 back_from_confirm:
if (inet->hdrincl)
-   err = rawv6_send_hdrinc(sk, msg, len, , , 
msg->msg_flags);
+   err = rawv6_send_hdrinc(sk, msg, len, , ,
+   msg->msg_flags, );
else {
ipc6.opt = opt;
lock_sock(sk);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e6645cae403e..ac6fc6728903 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1148,6 +1148,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
ipc6.dontfrag = -1;
ipc6.gso_size = up->gso_size;
sockc.tsflags = sk->sk_tsflags;
+   sockc.transmit_time = 0;
 
/* destination address check */
if (sin6) {
-- 
2.18.0



[PATCH v2 net-next 06/14] net/sched: Allow creating a Qdisc watchdog with other clocks

2018-07-03 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes 

This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
passed, this allows other time references to be used when scheduling
the Qdisc to run.

Signed-off-by: Vinicius Costa Gomes 
---
 include/net/pkt_sched.h |  2 ++
 net/sched/sch_api.c | 11 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 815b92a23936..2466ea143d01 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -72,6 +72,8 @@ struct qdisc_watchdog {
struct Qdisc*qdisc;
 };
 
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
*qdisc,
+clockid_t clockid);
 void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc);
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires);
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 54eca685420f..98541c6399db 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -596,12 +596,19 @@ static enum hrtimer_restart qdisc_watchdog(struct hrtimer 
*timer)
return HRTIMER_NORESTART;
 }
 
-void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
*qdisc,
+clockid_t clockid)
 {
-   hrtimer_init(>timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+   hrtimer_init(>timer, clockid, HRTIMER_MODE_ABS_PINNED);
wd->timer.function = qdisc_watchdog;
wd->qdisc = qdisc;
 }
+EXPORT_SYMBOL(qdisc_watchdog_init_clockid);
+
+void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+{
+   qdisc_watchdog_init_clockid(wd, qdisc, CLOCK_MONOTONIC);
+}
 EXPORT_SYMBOL(qdisc_watchdog_init);
 
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires)
-- 
2.18.0



Re: [PATCH v1 net-next 14/14] net/sched: Make etf report drops on error_queue

2018-06-29 Thread Jesus Sanchez-Palencia




On 06/29/2018 11:49 AM, Willem de Bruijn wrote:
 diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
 +static void report_sock_error(struct sk_buff *skb, u32 err, u8 code)
 +{
 +   struct sock_exterr_skb *serr;
 +   ktime_t txtime = skb->tstamp;
 +
 +   if (!skb->sk || !(skb->sk->sk_txtime_flags & 
 SK_TXTIME_RECV_ERR_MASK))
 +   return;
 +
 +   skb = skb_clone_sk(skb);
 +   if (!skb)
 +   return;
 +
 +   sock_hold(skb->sk);
>>>
>>> Why take an extra reference? The skb holds a ref on the sk.
>>
>>
>> Yes, the cloned skb holds a ref on the socket, but the documentation of
>> skb_clone_sk() makes this explicit suggestion:
>>
>> (...)
>>  * When passing buffers allocated with this function to sock_queue_err_skb
>>  * it is necessary to wrap the call with sock_hold/sock_put in order to
>>  * prevent the socket from being released prior to being enqueued on
>>  * the sk_error_queue.
>>  */
>>
>> which I believe is here just so we are protected against a possible race 
>> after
>> skb_orphan() is called from sock_queue_err_skb(). Please let me know if I'm
>> misreading anything.
> 
> Yes, indeed. Code only has to worry about that if there are no
> concurrent references
> on the socket.
> 
> I may be mistaken, but I believe that this complicated logic exists
> only for cases where
> the timestamp may be queued after the original skb has been released.
> Specifically,
> when a tx timestamp is returned from a hardware device after transmission of 
> the
> original skb. Then the cloned timestamp skb needs its own reference on
> the sk while
> it is waiting for the timestamp data (i.e., until the device
> completion arrives) and then
> we need a temporary extra ref to work around the skb_orphan in
> sock_queue_err_skb.
> 
> Compare skb_complete_tx_timestamp with skb_tstamp_tx. The second is used in
> the regular datapath to clone an skb and queue it on the error queue
> immediately,
> while holding the original skb. This does not call skb_clone_sk and
> does not need the
> extra sock_hold. This should be good enough for this code path, too.
> As kb holds a
> ref on skb->sk, the socket cannot go away in the middle of report_sock_error.


Oh, that makes sense. Great, I will give this a try and add it to the v2.

Thanks,
Jesus



Re: [PATCH v1 net-next 14/14] net/sched: Make etf report drops on error_queue

2018-06-29 Thread Jesus Sanchez-Palencia
Hi Willem,


On 06/28/2018 07:27 AM, Willem de Bruijn wrote:

(...)

> 
>>  struct sock_txtime {
>> clockid_t   clockid;/* reference clockid */
>> -   u16 flags;  /* bit 0: txtime in deadline_mode */
>> +   u16 flags;  /* bit 0: txtime in deadline_mode
>> +* bit 1: report drops on sk err 
>> queue
>> +*/
>>  };
> 
> If this is shared with userspace, should be defined in an uapi header.
> Same on the flag bits below. Self documenting code is preferable over
> comments.


Fixed for v2.


> 
>>  /*
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 73f4404e49e4..e681a45cfe7e 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -473,6 +473,7 @@ struct sock {
>> u16 sk_clockid;
>> u16 sk_txtime_flags;
>>  #define SK_TXTIME_DEADLINE_MASKBIT(0)
>> +#define SK_TXTIME_RECV_ERR_MASKBIT(1)
> 
> Integer bitfields are (arguably) more readable. There is no requirement
> that the user interface be the same as the in-kernel implementation. Indeed
> if you can save bits in struct sock, that is preferable (but not so for the 
> ABI,
> which cannot easily be extended).


Sure, changed for v2.

(...)


>> diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
>> index 5514a8aa3bd5..166f4b72875b 100644
>> --- a/net/sched/sch_etf.c
>> +++ b/net/sched/sch_etf.c
>> @@ -11,6 +11,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -124,6 +125,35 @@ static void reset_watchdog(struct Qdisc *sch)
>> qdisc_watchdog_schedule_ns(>watchdog, ktime_to_ns(next));
>>  }
>>
>> +static void report_sock_error(struct sk_buff *skb, u32 err, u8 code)
>> +{
>> +   struct sock_exterr_skb *serr;
>> +   ktime_t txtime = skb->tstamp;
>> +
>> +   if (!skb->sk || !(skb->sk->sk_txtime_flags & 
>> SK_TXTIME_RECV_ERR_MASK))
>> +   return;
>> +
>> +   skb = skb_clone_sk(skb);
>> +   if (!skb)
>> +   return;
>> +
>> +   sock_hold(skb->sk);
> 
> Why take an extra reference? The skb holds a ref on the sk.


Yes, the cloned skb holds a ref on the socket, but the documentation of
skb_clone_sk() makes this explicit suggestion:

(...)
 * When passing buffers allocated with this function to sock_queue_err_skb
 * it is necessary to wrap the call with sock_hold/sock_put in order to
 * prevent the socket from being released prior to being enqueued on
 * the sk_error_queue.
 */

which I believe is here just so we are protected against a possible race after
skb_orphan() is called from sock_queue_err_skb(). Please let me know if I'm
misreading anything.

And for v2 I will move the sock_hold() call to immediately before the
sock_queue_err_skb() to avoid any future confusion.



> 
>> +
>> +   serr = SKB_EXT_ERR(skb);
>> +   serr->ee.ee_errno = err;
>> +   serr->ee.ee_origin = SO_EE_ORIGIN_LOCAL;
> 
> I suggest adding a new SO_EE_ORIGIN_TXTIME as opposed to overloading
> the existing
> local origin. Then the EE_CODE can start at 1, as ee_code can be
> demultiplexed by origin.


OK, it looks better indeed. Fixed for v2.


> 
>> +   serr->ee.ee_type = 0;
>> +   serr->ee.ee_code = code;
>> +   serr->ee.ee_pad = 0;
>> +   serr->ee.ee_data = (txtime >> 32); /* high part of tstamp */
>> +   serr->ee.ee_info = txtime; /* low part of tstamp */
>> +
>> +   if (sock_queue_err_skb(skb->sk, skb))
>> +   kfree_skb(skb);
>> +
>> +   sock_put(skb->sk);
>> +}


Thanks,
Jesus


Re: [PATCH v1 net-next 02/14] net: Add a new socket option for a future transmit time.

2018-06-28 Thread Jesus Sanchez-Palencia
Hi Willem,


On 06/28/2018 07:40 AM, Willem de Bruijn wrote:
> On Thu, Jun 28, 2018 at 10:26 AM Willem de Bruijn
>  wrote:
>>
>> On Wed, Jun 27, 2018 at 6:08 PM Jesus Sanchez-Palencia
>>  wrote:
>>>
>>> From: Richard Cochran 
>>>
>>> This patch introduces SO_TXTIME. User space enables this option in
>>> order to pass a desired future transmit time in a CMSG when calling
>>> sendmsg(2). The argument to this socket option is a 6-bytes long struct
>>> defined as:
>>>
>>> struct sock_txtime {
>>> clockid_t   clockid;
>>> u16 flags;
>>> };
>>
>> clockid_t is __kernel_clockid_t is int is a variable length field.
>> Please use fixed length fields.
> 
> Sorry, int is fine, of course, and clockid_t is used between userspace and
> kernel already.


Great. So, in addition to the other feedback in sock.c, what I'm thinking here
for the v2 is:

- move this struct to and the flags definition (as enums) to
include/uapi/linux/net_tstamp.h;

- keep clockid as a clockid_t and increase flags to u32 since this already takes
8 bytes in total anyway;

- reduce sk_clockid and sk_txtime_flags from struct sock from a u16 to a u8 
each.


Thanks,
Jesus



> 
>> Also, as MAX_CLOCKS is 16, only 4 bits are needed. A single u16
>> is probably sufficient as cmsg argument. To future proof, a u32 will
>> allow for more
>> than 4 flags. But in struct sock, 16 bits should be sufficient to
>> encode both clock id
>> and flags.


Re: [PATCH v1 net-next 12/14] igb: Only call skb_tx_timestamp after descriptors are ready

2018-06-28 Thread Jesus Sanchez-Palencia



On 06/27/2018 04:56 PM, Eric Dumazet wrote:
> 
> 
> On 06/27/2018 02:59 PM, Jesus Sanchez-Palencia wrote:
>> Currently, skb_tx_timestamp() is being called before the DMA
>> descriptors are prepared in igb_xmit_frame_ring(), which happens
>> during either the igb_tso() or igb_tx_csum() calls.
>>
>> Given that now the skb->tstamp might be used to carry the timestamp
>> for SO_TXTIME, we must only call skb_tx_timestamp() after the
>> information has been copied into the DMA tx_ring.
> 
> 
> Since when this skb->tstamp use happened ?
> 
> If this is in patch 11/14 (igb: Add support for ETF offload), then you should 
> either :
> 
> 1) Squash this into 11/14
> 
> 2) swap 11 and 12 patch, so that this change is done before "igb: Add support 
> for ETF offload"  
> 
> Otherwise a bisection could fail badly.


OK. Fixed for v2 by swapping patches 11 and 12.

Thanks,
Jesus


Re: [PATCH v1 net-next 13/14] net/sched: Enforce usage of CLOCK_TAI for sch_etf

2018-06-28 Thread Jesus Sanchez-Palencia



On 06/28/2018 07:26 AM, Willem de Bruijn wrote:
> On Wed, Jun 27, 2018 at 8:45 PM Jesus Sanchez-Palencia
>  wrote:
>>
>> The qdisc and the SO_TXTIME ABIs allow for a clockid to be configured,
>> but it's been decided that usage of CLOCK_TAI should be enforced until
>> we decide to allow for other clockids to be used. The rationale here is
>> that PTP times are usually in the TAI scale, thus no other clocks should
>> be necessary.
>>
>> For now, the qdisc will return EINVAL if any clocks other than
>> CLOCK_TAI are used.
>>
>> Signed-off-by: Jesus Sanchez-Palencia 
>> ---
>>  net/sched/sch_etf.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
>> index cd6cb5b69228..5514a8aa3bd5 100644
>> --- a/net/sched/sch_etf.c
>> +++ b/net/sched/sch_etf.c
>> @@ -56,8 +56,8 @@ static inline int validate_input_params(struct tc_etf_qopt 
>> *qopt,
>> return -ENOTSUPP;
>> }
>>
>> -   if (qopt->clockid >= MAX_CLOCKS) {
>> -   NL_SET_ERR_MSG(extack, "Invalid clockid");
>> +   if (qopt->clockid != CLOCK_TAI) {
>> +   NL_SET_ERR_MSG(extack, "Invalid clockid. CLOCK_TAI must be 
>> used");
> 
> Similar to the comment in patch 12, this should be squashed (into
> patch 6) to avoid incorrect behavior in a range of SHA1s.


Ok. Fixed for v2.

Thanks,
Jesus


Re: [PATCH v1 net-next 02/14] net: Add a new socket option for a future transmit time.

2018-06-27 Thread Jesus Sanchez-Palencia
Hi Eric,


On 06/27/2018 03:16 PM, Eric Dumazet wrote:
> 
> 
> On 06/27/2018 02:59 PM, Jesus Sanchez-Palencia wrote:
>> From: Richard Cochran 
>>
>> This patch introduces SO_TXTIME. User space enables this option in
>> order to pass a desired future transmit time in a CMSG when calling
>> sendmsg(2). The argument to this socket option is a 6-bytes long struct
>> defined as:
>>
>> struct sock_txtime {
>>  clockid_t   clockid;
>>  u16 flags;
>> };
> 
> Note that sizeof(struct sock_txtime) is 8, not 6, because of alignments.


Oh yeah, sure.


> 
> This means that your implementation of getsockopt(... SO_TXTIME )
> is probably leaking two bytes of kernel stack to user space.

I'm failing to see how... There is a memset() in sock.c:1147 clearing all the 8
bytes that we later use to (explicitly) assign each member of the struct. Aren't
the 2 extra bytes sanitized, then? What have I missed?


Thanks,
Jesus


[PATCH v1 iproute2 1/2] uapi pkt_sched: Add etf info - DO NOT COMMIT

2018-06-27 Thread Jesus Sanchez-Palencia
This should come from the next uapi headers update.
Sending it now just as a convenience so anyone can build tc with etf
and taprio support.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/uapi/linux/pkt_sched.h | 66 ++
 1 file changed, 66 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096a..4d5a5bd3 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -934,4 +934,70 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+
+/* ETF */
+struct tc_etf_qopt {
+   __s32 delta;
+   __s32 clockid;
+   __u32 flags;
+#define TC_ETF_DEADLINE_MODE_ONBIT(0)
+#define TC_ETF_OFFLOAD_ON  BIT(1)
+};
+
+enum {
+   TCA_ETF_UNSPEC,
+   TCA_ETF_PARMS,
+   __TCA_ETF_MAX,
+};
+
+#define TCA_ETF_MAX (__TCA_ETF_MAX - 1)
+
+/* TAPRIO */
+enum {
+   TC_TAPRIO_CMD_SET_GATES = 0x00,
+   TC_TAPRIO_CMD_SET_AND_HOLD = 0x01,
+   TC_TAPRIO_CMD_SET_AND_RELEASE = 0x02,
+};
+
+enum {
+   TCA_TAPRIO_SCHED_ENTRY_UNSPEC,
+   TCA_TAPRIO_SCHED_ENTRY_INDEX, /* u32 */
+   TCA_TAPRIO_SCHED_ENTRY_CMD, /* u8 */
+   TCA_TAPRIO_SCHED_ENTRY_GATE_MASK, /* u32 */
+   TCA_TAPRIO_SCHED_ENTRY_INTERVAL, /* u32 */
+   __TCA_TAPRIO_SCHED_ENTRY_MAX,
+};
+#define TCA_TAPRIO_SCHED_ENTRY_MAX (__TCA_TAPRIO_SCHED_ENTRY_MAX - 1)
+
+/* The format for schedule entry list is:
+ * [TCA_TAPRIO_SCHED_ENTRY_LIST]
+ *   [TCA_TAPRIO_SCHED_ENTRY]
+ * [TCA_TAPRIO_SCHED_ENTRY_CMD]
+ * [TCA_TAPRIO_SCHED_ENTRY_GATES]
+ * [TCA_TAPRIO_SCHED_ENTRY_INTERVAL]
+ */
+enum {
+   TCA_TAPRIO_SCHED_UNSPEC,
+   TCA_TAPRIO_SCHED_ENTRY,
+   __TCA_TAPRIO_SCHED_MAX,
+};
+
+#define TCA_TAPRIO_SCHED_MAX (__TCA_TAPRIO_SCHED_MAX - 1)
+
+enum {
+   TCA_TAPRIO_ATTR_UNSPEC,
+   TCA_TAPRIO_ATTR_PRIOMAP, /* struct tc_mqprio_qopt */
+   TCA_TAPRIO_ATTR_PREEMPT_MASK, /* which traffic classes are preemptible, 
u32 */
+   TCA_TAPRIO_ATTR_SCHED_ENTRY_LIST, /* nested of entry */
+   TCA_TAPRIO_ATTR_SCHED_BASE_TIME, /* s64 */
+   TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME, /* s64 */
+   TCA_TAPRIO_ATTR_SCHED_EXTENSION_TIME, /* s64 */
+   TCA_TAPRIO_ATTR_SCHED_SINGLE_ENTRY, /*  */
+   TCA_TAPRIO_ATTR_SCHED_CLOCKID, /* s32 */
+   TCA_TAPRIO_PAD,
+   __TCA_TAPRIO_ATTR_MAX,
+};
+
+#define TCA_TAPRIO_ATTR_MAX (__TCA_TAPRIO_ATTR_MAX - 1)
+
 #endif
-- 
2.17.1



[PATCH v1 iproute2 2/2] tc: Add support for the ETF Qdisc

2018-06-27 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes 

The "Earliest TxTime First" (ETF) queueing discipline allows precise
control of the transmission time of packets by providing a sorted
time-based scheduling of packets.

The syntax is:

tc qdisc add dev DEV parent NODE etf delta 
 clockid  [offload] [deadline_mode]

Signed-off-by: Vinicius Costa Gomes 
Signed-off-by: Jesus Sanchez-Palencia 
---
 tc/Makefile |   1 +
 tc/q_etf.c  | 168 
 2 files changed, 169 insertions(+)
 create mode 100644 tc/q_etf.c

diff --git a/tc/Makefile b/tc/Makefile
index dfd00267..4525c0fb 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -71,6 +71,7 @@ TCMODULES += q_clsact.o
 TCMODULES += e_bpf.o
 TCMODULES += f_matchall.o
 TCMODULES += q_cbs.o
+TCMODULES += q_etf.o
 
 TCSO :=
 ifeq ($(TC_CONFIG_ATM),y)
diff --git a/tc/q_etf.c b/tc/q_etf.c
new file mode 100644
index ..5db1dd6f
--- /dev/null
+++ b/tc/q_etf.c
@@ -0,0 +1,168 @@
+/*
+ * q_etf.c Earliest TxTime First (ETF).
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Vinicius Costa Gomes 
+ *     Jesus Sanchez-Palencia 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+
+#define CLOCKID_INVALID (-1)
+static void explain(void)
+{
+   fprintf(stderr, "Usage: ... etf delta NANOS clockid CLOCKID [offload] 
[deadline_mode]\n");
+   fprintf(stderr, "CLOCKID must be a valid SYS-V id (i.e. CLOCK_TAI)\n");
+}
+
+static void explain1(const char *arg, const char *val)
+{
+   fprintf(stderr, "etf: illegal value for \"%s\": \"%s\"\n", arg, val);
+}
+
+static void explain_clockid(const char *val)
+{
+   fprintf(stderr, "etf: illegal value for \"clockid\": \"%s\".\n", val);
+   fprintf(stderr, "It must be a valid SYS-V id (i.e. CLOCK_TAI)");
+}
+
+static int get_clockid(__s32 *val, const char *arg)
+{
+   const struct static_clockid {
+   const char *name;
+   clockid_t clockid;
+   } clockids_sysv[] = {
+   { "CLOCK_REALTIME", CLOCK_REALTIME },
+   { "CLOCK_TAI", CLOCK_TAI },
+   { "CLOCK_BOOTTIME", CLOCK_BOOTTIME },
+   { "CLOCK_MONOTONIC", CLOCK_MONOTONIC },
+   { NULL }
+   };
+
+   const struct static_clockid *c;
+
+   for (c = clockids_sysv; c->name; c++) {
+   if (strncasecmp(c->name, arg, 25) == 0) {
+   *val = c->clockid;
+
+   return 0;
+   }
+   }
+
+   return -1;
+}
+
+
+static int etf_parse_opt(struct qdisc_util *qu, int argc,
+char **argv, struct nlmsghdr *n, const char *dev)
+{
+   struct tc_etf_qopt opt = {
+   .clockid = CLOCKID_INVALID,
+   };
+   struct rtattr *tail;
+
+   while (argc > 0) {
+   if (matches(*argv, "offload") == 0) {
+   if (opt.flags & TC_ETF_OFFLOAD_ON) {
+   fprintf(stderr, "etf: duplicate \"offload\" 
specification\n");
+   return -1;
+   }
+
+   opt.flags |= TC_ETF_OFFLOAD_ON;
+   } else if (matches(*argv, "deadline_mode") == 0) {
+   if (opt.flags & TC_ETF_DEADLINE_MODE_ON) {
+   fprintf(stderr, "etf: duplicate 
\"deadline_mode\" specification\n");
+   return -1;
+   }
+
+   opt.flags |= TC_ETF_DEADLINE_MODE_ON;
+   } else if (matches(*argv, "delta") == 0) {
+   NEXT_ARG();
+   if (opt.delta) {
+   fprintf(stderr, "etf: duplicate \"delta\" 
specification\n");
+   return -1;
+   }
+   if (get_s32(, *argv, 0)) {
+   explain1("delta", *argv);
+   return -1;
+   }
+   } else if (matches(*argv, "clockid") == 0) {
+   NEXT_ARG();
+   if (opt.clockid != CLOCKID_INVALID) {
+   fprintf(stderr, "etf: duplicate \"clockid\" 
specification\n");
+   return -1;
+ 

[PATCH v1 net-next 01/14] net: Clear skb->tstamp only on the forwarding path

2018-06-27 Thread Jesus Sanchez-Palencia
This is done in preparation for the upcoming time based transmission
patchset. Now that skb->tstamp will be used to hold packet's txtime,
we must ensure that it is being cleared when traversing namespaces.
Also, doing that from skb_scrub_packet() before the early return would
break our feature when tunnels are used.

Signed-off-by: Jesus Sanchez-Palencia 
---
 net/core/skbuff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b1f274f22d85..236802b35203 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4898,7 +4898,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
  */
 void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 {
-   skb->tstamp = 0;
skb->pkt_type = PACKET_HOST;
skb->skb_iif = 0;
skb->ignore_df = 0;
@@ -4913,6 +4912,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
ipvs_reset(skb);
skb_orphan(skb);
skb->mark = 0;
+   skb->tstamp = 0;
 }
 EXPORT_SYMBOL_GPL(skb_scrub_packet);
 
-- 
2.17.1



[PATCH v1 net-next 02/14] net: Add a new socket option for a future transmit time.

2018-06-27 Thread Jesus Sanchez-Palencia
From: Richard Cochran 

This patch introduces SO_TXTIME. User space enables this option in
order to pass a desired future transmit time in a CMSG when calling
sendmsg(2). The argument to this socket option is a 6-bytes long struct
defined as:

struct sock_txtime {
clockid_t   clockid;
u16 flags;
};

Note that two new fields were added to struct sock by filling a 4-bytes
hole found in the struct. For that reason, neither the struct size or
number of cachelines were altered.

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 arch/alpha/include/uapi/asm/socket.h  |  3 +++
 arch/ia64/include/uapi/asm/socket.h   |  3 +++
 arch/mips/include/uapi/asm/socket.h   |  3 +++
 arch/parisc/include/uapi/asm/socket.h |  3 +++
 arch/s390/include/uapi/asm/socket.h   |  3 +++
 arch/sparc/include/uapi/asm/socket.h  |  3 +++
 arch/xtensa/include/uapi/asm/socket.h |  3 +++
 include/linux/socket.h|  5 +
 include/net/sock.h|  8 +++
 include/uapi/asm-generic/socket.h |  3 +++
 net/core/sock.c   | 32 +++
 11 files changed, 69 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index be14f16149d5..065fb372e355 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -112,4 +112,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/ia64/include/uapi/asm/socket.h 
b/arch/ia64/include/uapi/asm/socket.h
index 3efba40adc54..c872c4e6bafb 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -114,4 +114,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 49c3d4795963..71370fb3ceef 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -123,4 +123,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index 1d0fdc3b5d22..061b9cf2a779 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -104,4 +104,7 @@
 
 #define SO_ZEROCOPY0x4035
 
+#define SO_TXTIME  0x4036
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h 
b/arch/s390/include/uapi/asm/socket.h
index 3510c0fd06f4..39d901476ee5 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -111,4 +111,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index d58520c2e6ff..7ea35e5601b6 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -101,6 +101,9 @@
 
 #define SO_ZEROCOPY0x003e
 
+#define SO_TXTIME  0x003f
+#define SCM_TXTIME SO_TXTIME
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION 0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT   0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h 
b/arch/xtensa/include/uapi/asm/socket.h
index 75a07b8119a9..1de07a7f7680 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -116,4 +116,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _XTENSA_SOCKET_H */
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 7ed4713d5337..ca476b7a8ff0 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -83,6 +83,11 @@ struct cmsghdr {
 intcmsg_type;  /* protocol-specific type */
 };
 
+struct sock_txtime {
+   clockid_t   clockid;/* reference clockid */
+   u16 flags;  /* bit 0: txtime in deadline_mode */
+};
+
 /*
  * Ancillary data object information MACROS
  * Table 5-14 of POSIX 1003.1g
diff --git a/include/net/sock.h b/include/net/sock.h
index b3b75419eafe..73f4404e49e4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -315,6 +315,7 @@ struct sock_common {
   *@sk_destruct: called at sock freeing time, i.e. when all refcnt == 0
   *@sk_reuseport_cb: reuseport group container
   *@sk_rcu: used during RCU grace period
+  *@sk_txtime: used by time-based scheduling
   */
 struct sock

[PATCH v1 net-next 04/14] net: packet: Hook into time based transmission.

2018-06-27 Thread Jesus Sanchez-Palencia
From: Richard Cochran 

For raw layer-2 packets, copy the desired future transmit time from
the CMSG cookie into the skb.

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 net/packet/af_packet.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index ff8e7e245c37..255c0164e0aa 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1951,6 +1951,7 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
goto out_unlock;
}
 
+   sockc.transmit_time = 0;
sockc.tsflags = sk->sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(sk, msg, );
@@ -1962,6 +1963,7 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = sockc.transmit_time;
 
sock_tx_timestamp(sk, sockc.tsflags, _shinfo(skb)->tx_flags);
 
@@ -2457,6 +2459,7 @@ static int tpacket_fill_skb(struct packet_sock *po, 
struct sk_buff *skb,
skb->dev = dev;
skb->priority = po->sk.sk_priority;
skb->mark = po->sk.sk_mark;
+   skb->tstamp = sockc->transmit_time;
sock_tx_timestamp(>sk, sockc->tsflags, _shinfo(skb)->tx_flags);
skb_shinfo(skb)->destructor_arg = ph.raw;
 
@@ -2633,6 +2636,7 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_put;
 
+   sockc.transmit_time = 0;
sockc.tsflags = po->sk.sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(>sk, msg, );
@@ -2829,6 +2833,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_unlock;
 
+   sockc.transmit_time = 0;
sockc.tsflags = sk->sk_tsflags;
sockc.mark = sk->sk_mark;
if (msg->msg_controllen) {
@@ -2903,6 +2908,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sockc.mark;
+   skb->tstamp = sockc.transmit_time;
 
if (has_vnet_hdr) {
err = virtio_net_hdr_to_skb(skb, _hdr, vio_le());
-- 
2.17.1



[PATCH v1 net-next 12/14] igb: Only call skb_tx_timestamp after descriptors are ready

2018-06-27 Thread Jesus Sanchez-Palencia
Currently, skb_tx_timestamp() is being called before the DMA
descriptors are prepared in igb_xmit_frame_ring(), which happens
during either the igb_tso() or igb_tx_csum() calls.

Given that now the skb->tstamp might be used to carry the timestamp
for SO_TXTIME, we must only call skb_tx_timestamp() after the
information has been copied into the DMA tx_ring.

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 9b9a6a6227e0..0d72f2417143 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6138,8 +6138,6 @@ netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
}
}
 
-   skb_tx_timestamp(skb);
-
if (skb_vlan_tag_present(skb)) {
tx_flags |= IGB_TX_FLAGS_VLAN;
tx_flags |= (skb_vlan_tag_get(skb) << IGB_TX_FLAGS_VLAN_SHIFT);
@@ -6155,6 +6153,8 @@ netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
else if (!tso)
igb_tx_csum(tx_ring, first);
 
+   skb_tx_timestamp(skb);
+
if (igb_tx_map(tx_ring, first, hdr_len))
goto cleanup_tx_tstamp;
 
-- 
2.17.1



[PATCH v1 net-next 11/14] igb: Add support for ETF offload

2018-06-27 Thread Jesus Sanchez-Palencia
Implement HW offload support for SO_TXTIME through igb's Launchtime
feature. This is done by extending igb_setup_tc() so it supports
TC_SETUP_QDISC_ETF and configuring i210 so time based transmit
arbitration is enabled.

The FQTSS transmission mode added before is extended so strict
priority (SP) queues wait for stream reservation (SR) ones.
igb_config_tx_modes() is extended so it can support enabling/disabling
Launchtime following the previous approach used for the credit-based
shaper (CBS).

As the previous flow, FQTSS transmission mode is enabled automatically
by the driver once Launchtime (or CBS, as before) is enabled.
Similarly, it's automatically disabled when the feature is disabled
for the last queue that had it setup on.

The driver just consumes the transmit times from the skbuffs directly,
so no special handling is done in case an 'invalid' time is provided.
We assume this has been handled by the ETF qdisc already.

Signed-off-by: Jesus Sanchez-Palencia 
---
 .../net/ethernet/intel/igb/e1000_defines.h|  16 ++
 drivers/net/ethernet/intel/igb/igb.h  |   1 +
 drivers/net/ethernet/intel/igb/igb_main.c | 139 +++---
 3 files changed, 139 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h 
b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 252440a418dc..8a28f3388f69 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -1048,6 +1048,22 @@
 #define E1000_TQAVCTRL_XMIT_MODE   BIT(0)
 #define E1000_TQAVCTRL_DATAFETCHARBBIT(4)
 #define E1000_TQAVCTRL_DATATRANARB BIT(8)
+#define E1000_TQAVCTRL_DATATRANTIM BIT(9)
+#define E1000_TQAVCTRL_SP_WAIT_SR  BIT(10)
+/* Fetch Time Delta - bits 31:16
+ *
+ * This field holds the value to be reduced from the launch time for
+ * fetch time decision. The FetchTimeDelta value is defined in 32 ns
+ * granularity.
+ *
+ * This field is 16 bits wide, and so the maximum value is:
+ *
+ * 65535 * 32 = 2097120 ~= 2.1 msec
+ *
+ * XXX: We are configuring the max value here since we couldn't come up
+ * with a reason for not doing so.
+ */
+#define E1000_TQAVCTRL_FETCHTIME_DELTA (0x << 16)
 
 /* TX Qav Credit Control fields */
 #define E1000_TQAVCC_IDLESLOPE_MASK0x
diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index 9643b5b3d444..ca54e268d157 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -262,6 +262,7 @@ struct igb_ring {
u16 count;  /* number of desc. in the ring */
u8 queue_index; /* logical index of the ring*/
u8 reg_idx; /* physical index of the ring */
+   bool launchtime_enable; /* true if LaunchTime is enabled */
bool cbs_enable;/* indicates if CBS is enabled */
s32 idleslope;  /* idleSlope in kbps */
s32 sendslope;  /* sendSlope in kbps */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index c30ab7b260cc..9b9a6a6227e0 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1666,13 +1666,26 @@ static bool is_any_cbs_enabled(struct igb_adapter 
*adapter)
return false;
 }
 
+static bool is_any_txtime_enabled(struct igb_adapter *adapter)
+{
+   int i;
+
+   for (i = 0; i < adapter->num_tx_queues; i++) {
+   if (adapter->tx_ring[i]->launchtime_enable)
+   return true;
+   }
+
+   return false;
+}
+
 /**
  *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
  *  @queue: queue number
  *
- *  Configure CBS for a given hardware queue. Parameters are retrieved
- *  from the correct Tx ring, so igb_save_cbs_params() should be used
+ *  Configure CBS and Launchtime for a given hardware queue.
+ *  Parameters are retrieved from the correct Tx ring, so
+ *  igb_save_cbs_params() and igb_save_txtime_params() should be used
  *  for setting those correctly prior to this function being called.
  **/
 static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
@@ -1686,6 +1699,19 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
WARN_ON(hw->mac.type != e1000_i210);
WARN_ON(queue < 0 || queue > 1);
 
+   /* If any of the Qav features is enabled, configure queues as SR and
+* with HIGH PRIO. If none is, then configure them with LOW PRIO and
+* as SP.
+*/
+   if (ring->cbs_enable || ring->launchtime_enable) {
+   set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
+   set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
+   } else {
+   set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW

[PATCH v1 net-next 00/14] Scheduled packet Transmission: ETF

2018-06-27 Thread Jesus Sanchez-Palencia
Overview


This work consists of a set of kernel interfaces that can be used by
applications that require (time-based) Scheduled Tx of packets.
It is comprised by 3 new components to the kernel:

  - SO_TXTIME: socket option + cmsg programming interfaces.

  - etf: the "earliest txtime first" qdisc, that provides per-queue
 TxTime-based scheduling. This has been renamed from 'tbs' to
 'etf' to better describe its functionality.

  - taprio: the "time-aware priority scheduler" qdisc, that provides
per-port Time-Aware scheduling;

This patchset is providing the first 2 components, which have been
developed for longer. The taprio qdisc will be shared as an RFC separately
(shortly).

Note that this series is a follow up of the "Time based packet
transmission" RFCv3 [1].



etf (formerly known as 'tbs')
=

Changes since the RFC v3:
  - removed patch adding CLOCKID_INVALID;
  - now we report packet drops through the socket's error queue;
  - the usage of CLOCK_TAI is enforced by the qdisc;
  - fixed bug on igb driver to avoid timestamps from being overwritten;
  - simplified queueing modes by making 'sorting' mandatory;
  - renamed qdisc from 'tbs' to 'etf'.

For applications/systems that the concept of time slices isn't precise
enough, the etf qdisc allows applications to control the instant when
a packet should leave the network controller. When used in conjunction
with taprio, it can also be used in case the application needs to
control with greater guarantee the offset into each time slice a packet
will be sent. Another use case of etf, is when only a small number of
applications on a system are time sensitive, so it can then be used
with a more traditional root qdisc (like mqprio).

The etf qdisc is designed so it buffers packets until a configurable
time before their deadline (Tx time). The qdisc uses a rbtree internally
so the buffered packets are always 'ordered' by their txtime (deadline)
and will be dequeued following the earliest txtime first.

The qdisc will drop any packets with a Tx time in the past, or if a
packet expires while waiting for being dequeued. Drops can be reported
as errors back to userspace through the socket's error queue.

Example configuration:

$ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 20 \
clockid CLOCK_TAI

Here, the Qdisc will use HW offload for the txtime control.
Packets will be dequeued by the qdisc "delta" (20) nanoseconds before
their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by hrtimers, the system clock
and the PHC clock must be synchronized for this mode to behave as expected.

A more complete example can be found here, with instructions of how to
test it:

https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f [2]


Note that we haven't modified the qdisc so it uses a timerqueue because
the modification needed was increasing the number of cachelines of a sk_buff.



SO_TXTIME
=

Changes since the RFC v3:
  - skb->tstamp is now cleared in skb_scrub_packet();
  - transmit time is now set for other send paths, and not only the
"fast" ones as before;
  - removed the per-packet parameters (clockid and drop_if_late).
Now just the skb->tstamp is used;
  - flags and clockid_t are now set per-socket as a parameter of
SO_TXTIME.



This series is also hosted on github and can be found at [3].
The companion iproute2 patches can be found at [4].


[1] https://patchwork.ozlabs.org/cover/882342/

[2] github doesn't make it clear, but the gist can be cloned like this:
$ git clone https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f 
scheduled-tx-tests

[3] https://github.com/jeez/linux/tree/etf-v1

[4] https://github.com/jeez/iproute2/tree/etf-v1



Jesus Sanchez-Palencia (10):
  net: Clear skb->tstamp only on the forwarding path
  net: ipv4: Hook into time based transmission
  net/sched: Add HW offloading capability to ETF
  igb: Refactor igb_configure_cbs()
  igb: Only change Tx arbitration when CBS is on
  igb: Refactor igb_offload_cbs()
  igb: Add support for ETF offload
  igb: Only call skb_tx_timestamp after descriptors are ready
  net/sched: Enforce usage of CLOCK_TAI for sch_etf
  net/sched: Make etf report drops on error_queue

Richard Cochran (2):
  net: Add a new socket option for a future transmit time.
  net: packet: Hook into time based transmission.

Vinicius Costa Gomes (2):
  net/sched: Allow creating a Qdisc watchdog with other clocks
  net/sched: Introduce the ETF Qdisc

 arch/alpha/include/uapi/asm/socket.h  |   3 +
 arch/ia64/include/uapi/asm/socket.h   |   3 +
 arch/mips/include/uapi/asm/socket.h   |   3 +
 arch/parisc/include/uapi/asm/socket.h |   3 +
 arch/s390/include/uapi/asm/socket.h   |   3 +
 arch/sparc/include/uapi/asm/socket.h  |   3 +
 arch/xtensa/includ

[PATCH v1 net-next 05/14] net/sched: Allow creating a Qdisc watchdog with other clocks

2018-06-27 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes 

This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
passed, this allows other time references to be used when scheduling
the Qdisc to run.

Signed-off-by: Vinicius Costa Gomes 
---
 include/net/pkt_sched.h |  2 ++
 net/sched/sch_api.c | 11 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 815b92a23936..2466ea143d01 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -72,6 +72,8 @@ struct qdisc_watchdog {
struct Qdisc*qdisc;
 };
 
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
*qdisc,
+clockid_t clockid);
 void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc);
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires);
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 54eca685420f..98541c6399db 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -596,12 +596,19 @@ static enum hrtimer_restart qdisc_watchdog(struct hrtimer 
*timer)
return HRTIMER_NORESTART;
 }
 
-void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
*qdisc,
+clockid_t clockid)
 {
-   hrtimer_init(>timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+   hrtimer_init(>timer, clockid, HRTIMER_MODE_ABS_PINNED);
wd->timer.function = qdisc_watchdog;
wd->qdisc = qdisc;
 }
+EXPORT_SYMBOL(qdisc_watchdog_init_clockid);
+
+void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+{
+   qdisc_watchdog_init_clockid(wd, qdisc, CLOCK_MONOTONIC);
+}
 EXPORT_SYMBOL(qdisc_watchdog_init);
 
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires)
-- 
2.17.1



[PATCH v1 net-next 10/14] igb: Refactor igb_offload_cbs()

2018-06-27 Thread Jesus Sanchez-Palencia
Split code into a separate function (igb_offload_apply()) that will be
used by ETF offload implementation.

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 8c90f1e51add..c30ab7b260cc 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2474,6 +2474,19 @@ igb_features_check(struct sk_buff *skb, struct 
net_device *dev,
return features;
 }
 
+static void igb_offload_apply(struct igb_adapter *adapter, s32 queue)
+{
+   if (!is_fqtss_enabled(adapter)) {
+   enable_fqtss(adapter, true);
+   return;
+   }
+
+   igb_config_tx_modes(adapter, queue);
+
+   if (!is_any_cbs_enabled(adapter))
+   enable_fqtss(adapter, false);
+}
+
 static int igb_offload_cbs(struct igb_adapter *adapter,
   struct tc_cbs_qopt_offload *qopt)
 {
@@ -2494,15 +2507,7 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
if (err)
return err;
 
-   if (is_fqtss_enabled(adapter)) {
-   igb_config_tx_modes(adapter, qopt->queue);
-
-   if (!is_any_cbs_enabled(adapter))
-   enable_fqtss(adapter, false);
-
-   } else {
-   enable_fqtss(adapter, true);
-   }
+   igb_offload_apply(adapter, qopt->queue);
 
return 0;
 }
-- 
2.17.1



[PATCH v1 net-next 13/14] net/sched: Enforce usage of CLOCK_TAI for sch_etf

2018-06-27 Thread Jesus Sanchez-Palencia
The qdisc and the SO_TXTIME ABIs allow for a clockid to be configured,
but it's been decided that usage of CLOCK_TAI should be enforced until
we decide to allow for other clockids to be used. The rationale here is
that PTP times are usually in the TAI scale, thus no other clocks should
be necessary.

For now, the qdisc will return EINVAL if any clocks other than
CLOCK_TAI are used.

Signed-off-by: Jesus Sanchez-Palencia 
---
 net/sched/sch_etf.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
index cd6cb5b69228..5514a8aa3bd5 100644
--- a/net/sched/sch_etf.c
+++ b/net/sched/sch_etf.c
@@ -56,8 +56,8 @@ static inline int validate_input_params(struct tc_etf_qopt 
*qopt,
return -ENOTSUPP;
}
 
-   if (qopt->clockid >= MAX_CLOCKS) {
-   NL_SET_ERR_MSG(extack, "Invalid clockid");
+   if (qopt->clockid != CLOCK_TAI) {
+   NL_SET_ERR_MSG(extack, "Invalid clockid. CLOCK_TAI must be 
used");
return -EINVAL;
}
 
-- 
2.17.1



[PATCH v1 net-next 09/14] igb: Only change Tx arbitration when CBS is on

2018-06-27 Thread Jesus Sanchez-Palencia
Currently the data transmission arbitration algorithm - DataTranARB
field on TQAVCTRL reg - is always set to CBS when the Tx mode is
changed from legacy to 'Qav' mode.

Make that configuration a bit more granular in preparation for the
upcoming Launchtime enabling patches, since CBS and Launchtime can be
enabled separately. That is achieved by moving the DataTranARB setup
to igb_config_tx_modes() instead.

Similarly, when disabling CBS we must check if it has been disabled
for all queues, and clear the DataTranARB accordingly.

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 49 +++
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 15f6b9c57ccf..8c90f1e51add 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1654,6 +1654,18 @@ static void set_queue_mode(struct e1000_hw *hw, int 
queue, enum queue_mode mode)
wr32(E1000_I210_TQAVCC(queue), val);
 }
 
+static bool is_any_cbs_enabled(struct igb_adapter *adapter)
+{
+   int i;
+
+   for (i = 0; i < adapter->num_tx_queues; i++) {
+   if (adapter->tx_ring[i]->cbs_enable)
+   return true;
+   }
+
+   return false;
+}
+
 /**
  *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
@@ -1668,7 +1680,7 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
struct igb_ring *ring = adapter->tx_ring[queue];
struct net_device *netdev = adapter->netdev;
struct e1000_hw *hw = >hw;
-   u32 tqavcc;
+   u32 tqavcc, tqavctrl;
u16 value;
 
WARN_ON(hw->mac.type != e1000_i210);
@@ -1693,6 +1705,14 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
 
+   /* Always set data transfer arbitration to credit-based
+* shaper algorithm on TQAVCTRL if CBS is enabled for any of
+* the queues.
+*/
+   tqavctrl = rd32(E1000_I210_TQAVCTRL);
+   tqavctrl |= E1000_TQAVCTRL_DATATRANARB;
+   wr32(E1000_I210_TQAVCTRL, tqavctrl);
+
/* According to i210 datasheet section 7.2.7.7, we should set
 * the 'idleSlope' field from TQAVCC register following the
 * equation:
@@ -1770,6 +1790,16 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
 
/* Set hiCredit to zero. */
wr32(E1000_I210_TQAVHC(queue), 0);
+
+   /* If CBS is not enabled for any queues anymore, then return to
+* the default state of Data Transmission Arbitration on
+* TQAVCTRL.
+*/
+   if (!is_any_cbs_enabled(adapter)) {
+   tqavctrl = rd32(E1000_I210_TQAVCTRL);
+   tqavctrl &= ~E1000_TQAVCTRL_DATATRANARB;
+   wr32(E1000_I210_TQAVCTRL, tqavctrl);
+   }
}
 
/* XXX: In i210 controller the sendSlope and loCredit parameters from
@@ -1803,18 +1833,6 @@ static int igb_save_cbs_params(struct igb_adapter 
*adapter, int queue,
return 0;
 }
 
-static bool is_any_cbs_enabled(struct igb_adapter *adapter)
-{
-   int i;
-
-   for (i = 0; i < adapter->num_tx_queues; i++) {
-   if (adapter->tx_ring[i]->cbs_enable)
-   return true;
-   }
-
-   return false;
-}
-
 /**
  *  igb_setup_tx_mode - Switch to/from Qav Tx mode when applicable
  *  @adapter: pointer to adapter struct
@@ -1838,11 +1856,10 @@ static void igb_setup_tx_mode(struct igb_adapter 
*adapter)
int i, max_queue;
 
/* Configure TQAVCTRL register: set transmit mode to 'Qav',
-* set data fetch arbitration to 'round robin' and set data
-* transfer arbitration to 'credit shaper algorithm.
+* set data fetch arbitration to 'round robin'.
 */
val = rd32(E1000_I210_TQAVCTRL);
-   val |= E1000_TQAVCTRL_XMIT_MODE | E1000_TQAVCTRL_DATATRANARB;
+   val |= E1000_TQAVCTRL_XMIT_MODE;
val &= ~E1000_TQAVCTRL_DATAFETCHARB;
wr32(E1000_I210_TQAVCTRL, val);
 
-- 
2.17.1



[PATCH v1 net-next 08/14] igb: Refactor igb_configure_cbs()

2018-06-27 Thread Jesus Sanchez-Palencia
Make this function retrieve what it needs from the Tx ring being
addressed since it already relies on what had been saved on it before.
Also, since this function will be used by the upcoming Launchtime
patches rename it to better reflect its intention. Note that
Launchtime is not part of what 802.1Qav specifies, but the i210
datasheet refers to this set of functionality as "Qav Transmission
Mode".

Here we also perform a tiny refactor at is_any_cbs_enabled(), and add
further documentation to igb_setup_tx_mode().

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 60 +++
 1 file changed, 28 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index f1e3397bd405..15f6b9c57ccf 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1655,23 +1655,17 @@ static void set_queue_mode(struct e1000_hw *hw, int 
queue, enum queue_mode mode)
 }
 
 /**
- *  igb_configure_cbs - Configure Credit-Based Shaper (CBS)
+ *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
  *  @queue: queue number
- *  @enable: true = enable CBS, false = disable CBS
- *  @idleslope: idleSlope in kbps
- *  @sendslope: sendSlope in kbps
- *  @hicredit: hiCredit in bytes
- *  @locredit: loCredit in bytes
  *
- *  Configure CBS for a given hardware queue. When disabling, idleslope,
- *  sendslope, hicredit, locredit arguments are ignored. Returns 0 if
- *  success. Negative otherwise.
+ *  Configure CBS for a given hardware queue. Parameters are retrieved
+ *  from the correct Tx ring, so igb_save_cbs_params() should be used
+ *  for setting those correctly prior to this function being called.
  **/
-static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
- bool enable, int idleslope, int sendslope,
- int hicredit, int locredit)
+static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 {
+   struct igb_ring *ring = adapter->tx_ring[queue];
struct net_device *netdev = adapter->netdev;
struct e1000_hw *hw = >hw;
u32 tqavcc;
@@ -1680,7 +1674,7 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
WARN_ON(hw->mac.type != e1000_i210);
WARN_ON(queue < 0 || queue > 1);
 
-   if (enable || queue == 0) {
+   if (ring->cbs_enable || queue == 0) {
/* i210 does not allow the queue 0 to be in the Strict
 * Priority mode while the Qav mode is enabled, so,
 * instead of disabling strict priority mode, we give
@@ -1690,10 +1684,10 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
 * Queue0 QueueMode must be set to 1b when
 * TransmitMode is set to Qav."
 */
-   if (queue == 0 && !enable) {
+   if (queue == 0 && !ring->cbs_enable) {
/* max "linkspeed" idleslope in kbps */
-   idleslope = 100;
-   hicredit = ETH_FRAME_LEN;
+   ring->idleslope = 100;
+   ring->hicredit = ETH_FRAME_LEN;
}
 
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
@@ -1756,14 +1750,15 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
 *   calculated value, so the resulting bandwidth might
 *   be slightly higher for some configurations.
 */
-   value = DIV_ROUND_UP_ULL(idleslope * 61034ULL, 100);
+   value = DIV_ROUND_UP_ULL(ring->idleslope * 61034ULL, 100);
 
tqavcc = rd32(E1000_I210_TQAVCC(queue));
tqavcc &= ~E1000_TQAVCC_IDLESLOPE_MASK;
tqavcc |= value;
wr32(E1000_I210_TQAVCC(queue), tqavcc);
 
-   wr32(E1000_I210_TQAVHC(queue), 0x8000 + hicredit * 0x7735);
+   wr32(E1000_I210_TQAVHC(queue),
+0x8000 + ring->hicredit * 0x7735);
} else {
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);
set_queue_mode(hw, queue, QUEUE_MODE_STRICT_PRIORITY);
@@ -1783,8 +1778,9 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
 */
 
netdev_dbg(netdev, "CBS %s: queue %d idleslope %d sendslope %d hiCredit 
%d locredit %d\n",
-  (enable) ? "enabled" : "disabled", queue,
-  idleslope, sendslope, hicredit, locredit);
+  (ring->cbs_enable) ? "enabled" : "disabled", queue,
+  ring->idleslope, ring->sendslope, ring->

[PATCH v1 net-next 03/14] net: ipv4: Hook into time based transmission

2018-06-27 Thread Jesus Sanchez-Palencia
Add a transmit_time field to struct inet_cork, then copy the
timestamp from the CMSG cookie at ip_setup_cork() so we can
safely copy it into the skb later during __ip_make_skb().

For the raw fast path, just perform the copy at raw_send_hdrinc().

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 include/net/inet_sock.h | 1 +
 net/ipv4/ip_output.c| 3 +++
 net/ipv4/raw.c  | 2 ++
 net/ipv4/udp.c  | 1 +
 4 files changed, 7 insertions(+)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 83d5b3c2ac42..314be484c696 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -148,6 +148,7 @@ struct inet_cork {
__s16   tos;
charpriority;
__u16   gso_size;
+   u64 transmit_time;
 };
 
 struct inet_cork_full {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index b3308e9d9762..904a54a090e9 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1153,6 +1153,7 @@ static int ip_setup_cork(struct sock *sk, struct 
inet_cork *cork,
cork->tos = ipc->tos;
cork->priority = ipc->priority;
cork->tx_flags = ipc->tx_flags;
+   cork->transmit_time = ipc->sockc.transmit_time;
 
return 0;
 }
@@ -1413,6 +1414,7 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
 
skb->priority = (cork->tos != -1) ? cork->priority: sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = cork->transmit_time;
/*
 * Steal rt from cork.dst to avoid a pair of atomic_inc/atomic_dec
 * on dst refcount
@@ -1495,6 +1497,7 @@ struct sk_buff *ip_make_skb(struct sock *sk,
cork->flags = 0;
cork->addr = 0;
cork->opt = NULL;
+   cork->transmit_time = 0;
err = ip_setup_cork(sk, cork, ipc, rtp);
if (err)
return ERR_PTR(err);
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index abb3c9490c55..446af7be2b55 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -381,6 +381,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 
*fl4,
 
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = sockc->transmit_time;
skb_dst_set(skb, >dst);
*rtp = NULL;
 
@@ -562,6 +563,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
}
 
ipc.sockc.tsflags = sk->sk_tsflags;
+   ipc.sockc.transmit_time = 0;
ipc.addr = inet->inet_saddr;
ipc.opt = NULL;
ipc.tx_flags = 0;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 9bb27df4dac5..0ab2c13bc7a1 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -978,6 +978,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t 
len)
}
 
ipc.sockc.tsflags = sk->sk_tsflags;
+   ipc.sockc.transmit_time = 0;
ipc.addr = inet->inet_saddr;
ipc.oif = sk->sk_bound_dev_if;
ipc.gso_size = up->gso_size;
-- 
2.17.1



[PATCH v1 net-next 14/14] net/sched: Make etf report drops on error_queue

2018-06-27 Thread Jesus Sanchez-Palencia
Use the socket error queue for reporting dropped packets if the
socket has enabled that feature through the SO_TXTIME API.

Packets are dropped either on enqueue() if they aren't accepted by the
qdisc or on dequeue() if the system misses their deadline. Those are
reported as different errors so applications can react accordingly.

Userspace can retrieve the errors through the socket error queue and the
corresponding cmsg interfaces. A struct sock_extended_err* is used for
returning the error data, and the packet's timestamp can be retrieved by
adding both ee_data and ee_info fields as e.g.:

((__u64) serr->ee_data << 32) + serr->ee_info

This feature is disabled by default and must be explicitly enabled by
applications. Enabling it can bring some overhead for the Tx cycles
of the application.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/linux/socket.h|  4 +++-
 include/net/sock.h|  1 +
 include/uapi/linux/errqueue.h |  2 ++
 net/sched/sch_etf.c   | 37 +--
 4 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index ca476b7a8ff0..75e11d29b32a 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -85,7 +85,9 @@ struct cmsghdr {
 
 struct sock_txtime {
clockid_t   clockid;/* reference clockid */
-   u16 flags;  /* bit 0: txtime in deadline_mode */
+   u16 flags;  /* bit 0: txtime in deadline_mode
+* bit 1: report drops on sk err queue
+*/
 };
 
 /*
diff --git a/include/net/sock.h b/include/net/sock.h
index 73f4404e49e4..e681a45cfe7e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -473,6 +473,7 @@ struct sock {
u16 sk_clockid;
u16 sk_txtime_flags;
 #define SK_TXTIME_DEADLINE_MASKBIT(0)
+#define SK_TXTIME_RECV_ERR_MASKBIT(1)
 
struct socket   *sk_socket;
void*sk_user_data;
diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index dc64cfaf13da..66fd5e443c94 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -25,6 +25,8 @@ struct sock_extended_err {
 #define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1))
 
 #define SO_EE_CODE_ZEROCOPY_COPIED 1
+#define SO_EE_CODE_TXTIME_INVALID_PARAM2
+#define SO_EE_CODE_TXTIME_MISSED   3
 
 /**
  * struct scm_timestamping - timestamps exposed through cmsg
diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
index 5514a8aa3bd5..166f4b72875b 100644
--- a/net/sched/sch_etf.c
+++ b/net/sched/sch_etf.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -124,6 +125,35 @@ static void reset_watchdog(struct Qdisc *sch)
qdisc_watchdog_schedule_ns(>watchdog, ktime_to_ns(next));
 }
 
+static void report_sock_error(struct sk_buff *skb, u32 err, u8 code)
+{
+   struct sock_exterr_skb *serr;
+   ktime_t txtime = skb->tstamp;
+
+   if (!skb->sk || !(skb->sk->sk_txtime_flags & SK_TXTIME_RECV_ERR_MASK))
+   return;
+
+   skb = skb_clone_sk(skb);
+   if (!skb)
+   return;
+
+   sock_hold(skb->sk);
+
+   serr = SKB_EXT_ERR(skb);
+   serr->ee.ee_errno = err;
+   serr->ee.ee_origin = SO_EE_ORIGIN_LOCAL;
+   serr->ee.ee_type = 0;
+   serr->ee.ee_code = code;
+   serr->ee.ee_pad = 0;
+   serr->ee.ee_data = (txtime >> 32); /* high part of tstamp */
+   serr->ee.ee_info = txtime; /* low part of tstamp */
+
+   if (sock_queue_err_skb(skb->sk, skb))
+   kfree_skb(skb);
+
+   sock_put(skb->sk);
+}
+
 static int etf_enqueue_timesortedlist(struct sk_buff *nskb, struct Qdisc *sch,
  struct sk_buff **to_free)
 {
@@ -131,8 +161,10 @@ static int etf_enqueue_timesortedlist(struct sk_buff 
*nskb, struct Qdisc *sch,
struct rb_node **p = >head.rb_node, *parent = NULL;
ktime_t txtime = nskb->tstamp;
 
-   if (!is_packet_valid(sch, nskb))
+   if (!is_packet_valid(sch, nskb)) {
+   report_sock_error(nskb, EINVAL, 
SO_EE_CODE_TXTIME_INVALID_PARAM);
return qdisc_drop(nskb, sch, to_free);
+   }
 
while (*p) {
struct sk_buff *skb;
@@ -175,6 +207,8 @@ static void timesortedlist_erase(struct Qdisc *sch, struct 
sk_buff *skb,
if (drop) {
struct sk_buff *to_free = NULL;
 
+   report_sock_error(skb, ECANCELED, SO_EE_CODE_TXTIME_MISSED);
+
qdisc_drop(skb, sch, _free);
kfree_skb_list(to_free);
qdisc_qstats_overlimit(sch);
@@ -200,7 +234,6 @@ static struct sk_buff *etf_dequeue_timesortedlist(struct 
Qdisc *sch)
 

[PATCH v1 net-next 07/14] net/sched: Add HW offloading capability to ETF

2018-06-27 Thread Jesus Sanchez-Palencia
Add infra so etf qdisc supports HW offload of time-based transmission.

For hw offload, the time sorted list is still used, so packets are
dequeued always in order of txtime.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 10 \
   clockid CLOCK_REALTIME

In this example, the Qdisc will use HW offload for the control of the
transmission time through the network adapter. The hrtimer used for
packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME
as reference and packets leave the Qdisc "delta" (10) nanoseconds
before their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by the hrtimer, the system clock
and the PHC clock must be synchronized for this mode to behave as
expected.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/net/pkt_sched.h|  5 +++
 include/uapi/linux/pkt_sched.h |  1 +
 net/sched/sch_etf.c| 71 +-
 3 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 2466ea143d01..7dc769e5452b 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -155,4 +155,9 @@ struct tc_cbs_qopt_offload {
s32 sendslope;
 };
 
+struct tc_etf_qopt_offload {
+   u8 enable;
+   s32 queue;
+};
+
 #endif
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 9d6fd2004a03..efad482e69d2 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -941,6 +941,7 @@ struct tc_etf_qopt {
__s32 clockid;
__u32 flags;
 #define TC_ETF_DEADLINE_MODE_ONBIT(0)
+#define TC_ETF_OFFLOAD_ON  BIT(1)
 };
 
 enum {
diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
index 5f01a285f399..cd6cb5b69228 100644
--- a/net/sched/sch_etf.c
+++ b/net/sched/sch_etf.c
@@ -20,8 +20,10 @@
 #include 
 
 #define DEADLINE_MODE_IS_ON(x) ((x)->flags & TC_ETF_DEADLINE_MODE_ON)
+#define OFFLOAD_IS_ON(x) ((x)->flags & TC_ETF_OFFLOAD_ON)
 
 struct etf_sched_data {
+   bool offload;
bool deadline_mode;
int clockid;
int queue;
@@ -45,6 +47,9 @@ static inline int validate_input_params(struct tc_etf_qopt 
*qopt,
 *  * Dynamic clockids are not supported.
 *
 *  * Delta must be a positive integer.
+*
+* Also note that for the HW offload case, we must
+* expect that system clocks have been synchronized to PHC.
 */
if (qopt->clockid < 0) {
NL_SET_ERR_MSG(extack, "Dynamic clockids are not supported");
@@ -226,6 +231,56 @@ static struct sk_buff *etf_dequeue_timesortedlist(struct 
Qdisc *sch)
return skb;
 }
 
+static void etf_disable_offload(struct net_device *dev,
+   struct etf_sched_data *q)
+{
+   struct tc_etf_qopt_offload etf = { };
+   const struct net_device_ops *ops;
+   int err;
+
+   if (!q->offload)
+   return;
+
+   ops = dev->netdev_ops;
+   if (!ops->ndo_setup_tc)
+   return;
+
+   etf.queue = q->queue;
+   etf.enable = 0;
+
+   err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_ETF, );
+   if (err < 0)
+   pr_warn("Couldn't disable ETF offload for queue %d\n",
+   etf.queue);
+}
+
+static int etf_enable_offload(struct net_device *dev, struct etf_sched_data *q,
+ struct netlink_ext_ack *extack)
+{
+   const struct net_device_ops *ops = dev->netdev_ops;
+   struct tc_etf_qopt_offload etf = { };
+   int err;
+
+   if (q->offload)
+   return 0;
+
+   if (!ops->ndo_setup_tc) {
+   NL_SET_ERR_MSG(extack, "Specified device does not support ETF 
offload");
+   return -EOPNOTSUPP;
+   }
+
+   etf.queue = q->queue;
+   etf.enable = 1;
+
+   err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_ETF, );
+   if (err < 0) {
+   NL_SET_ERR_MSG(extack, "Specified device failed to setup ETF 
hardware offload");
+   return err;
+   }
+
+   return 0;
+}
+
 static int etf_init(struct Qdisc *sch, struct nlattr *opt,
struct netlink_ext_ack *extack)
 {
@@ -252,8 +307,9 @@ static int etf_init(struct Qdisc *sch, struct nlattr *opt,
 
qopt = nla_data(tb[TCA_ETF_PARMS]);
 
-   pr_debug("delta %d clockid %d deadline %s\n",
+   pr_debug("delta %d clockid %d offload %s deadline %s\n",
 qopt->delta, qopt->clockid,
+OFFLOAD_IS_ON(qopt) ? "on" : "off",
 DEADLINE_MODE_IS_ON(qopt) ? "on" : "off"

[PATCH v1 net-next 06/14] net/sched: Introduce the ETF Qdisc

2018-06-27 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes 

The ETF (Earliest TxTime First) qdisc uses the information added
earlier in this series (the socket option SO_TXTIME and the new
role of sk_buff->tstamp) to schedule packets transmission based
on absolute time.

For some workloads, just bandwidth enforcement is not enough, and
precise control of the transmission of packets is necessary.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 etf delta 10 \
   clockid CLOCK_REALTIME

In this example, the Qdisc will provide SW best-effort for the control
of the transmission time to the network adapter, the time stamp in the
socket will be in reference to the clockid CLOCK_REALTIME and packets
will leave the qdisc "delta" (10) nanoseconds before its transmission
time.

The ETF qdisc will buffer packets sorted by their txtime. It will drop
packets on enqueue() if their skbuff clockid does not match the clock
reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped
if it expires while being enqueued.

The qdisc also supports the SO_TXTIME deadline mode. For this mode, it
will dequeue a packet as soon as possible and change the skb timestamp
to 'now' during etf_dequeue().

Signed-off-by: Jesus Sanchez-Palencia 
Signed-off-by: Vinicius Costa Gomes 
---
 include/linux/netdevice.h  |   1 +
 include/uapi/linux/pkt_sched.h |  17 ++
 net/sched/Kconfig  |  11 +
 net/sched/Makefile |   1 +
 net/sched/sch_etf.c| 385 +
 5 files changed, 415 insertions(+)
 create mode 100644 net/sched/sch_etf.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c6b377a15869..7f650bdc6ec3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -793,6 +793,7 @@ enum tc_setup_type {
TC_SETUP_QDISC_RED,
TC_SETUP_QDISC_PRIO,
TC_SETUP_QDISC_MQ,
+   TC_SETUP_QDISC_ETF,
 };
 
 /* These structures hold the attributes of bpf state that are being passed
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096ae97b..9d6fd2004a03 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -934,4 +934,21 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+
+/* ETF */
+struct tc_etf_qopt {
+   __s32 delta;
+   __s32 clockid;
+   __u32 flags;
+#define TC_ETF_DEADLINE_MODE_ONBIT(0)
+};
+
+enum {
+   TCA_ETF_UNSPEC,
+   TCA_ETF_PARMS,
+   __TCA_ETF_MAX,
+};
+
+#define TCA_ETF_MAX (__TCA_ETF_MAX - 1)
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a01169fb5325..fcc89706745b 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -183,6 +183,17 @@ config NET_SCH_CBS
  To compile this code as a module, choose M here: the
  module will be called sch_cbs.
 
+config NET_SCH_ETF
+   tristate "Earliest TxTime First (ETF)"
+   help
+ Say Y here if you want to use the Earliest TxTime First (ETF) packet
+ scheduling algorithm.
+
+ See the top of  for more details.
+
+ To compile this code as a module, choose M here: the
+ module will be called sch_etf.
+
 config NET_SCH_GRED
tristate "Generic Random Early Detection (GRED)"
---help---
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 8811d3804878..9a5a7077d217 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -54,6 +54,7 @@ obj-$(CONFIG_NET_SCH_FQ)  += sch_fq.o
 obj-$(CONFIG_NET_SCH_HHF)  += sch_hhf.o
 obj-$(CONFIG_NET_SCH_PIE)  += sch_pie.o
 obj-$(CONFIG_NET_SCH_CBS)  += sch_cbs.o
+obj-$(CONFIG_NET_SCH_ETF)  += sch_etf.o
 
 obj-$(CONFIG_NET_CLS_U32)  += cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)   += cls_route.o
diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
new file mode 100644
index ..5f01a285f399
--- /dev/null
+++ b/net/sched/sch_etf.c
@@ -0,0 +1,385 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* net/sched/sch_etf.c  Earliest TxTime First queueing discipline.
+ *
+ * Authors:Jesus Sanchez-Palencia 
+ * Vinicius Costa Gomes 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DEADLINE_MODE_IS_ON(x) ((x)->flags & TC_ETF_DEADLINE_MODE_ON)
+
+struct etf_sched_data {
+   bool deadline_mode;
+   int clockid;
+   int queue;
+   s32 delta; /* in ns */
+   ktime_t last; /* The txtime of the last skb sent to the netdevice. */
+   struct rb_root head;
+   struct qdisc_watchdog watchdog;
+   ktime_t (*get_time)(void);
+};
+
+static const struct nla_policy etf_policy[TCA_ETF_MAX + 1] = {
+   [TCA_ETF_PARMS] = { .len = sizeof(struct tc_etf_qopt) },
+};
+
+static inline int valida

[PATCH iproute2] man: Fix typos on tc-cbs

2018-06-27 Thread Jesus Sanchez-Palencia
Signed-off-by: Jesus Sanchez-Palencia 
---
 man/man8/tc-cbs.8 | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/man/man8/tc-cbs.8 b/man/man8/tc-cbs.8
index 32e1e0d4..ad1d8821 100644
--- a/man/man8/tc-cbs.8
+++ b/man/man8/tc-cbs.8
@@ -28,7 +28,7 @@ defined rate limiting method to the traffic.
 This queueing discipline is intended to be used by TSN (Time Sensitive
 Networking) applications, the CBS parameters are derived directly by
 what is described by the Annex L of the IEEE 802.1Q-2014
-Sepcification. The algorithm and how it affects the latency are
+Specification. The algorithm and how it affects the latency are
 detailed there.
 
 CBS is meant to be installed under another qdisc that maps packet
@@ -60,7 +60,7 @@ packet size, which is then used for calculating the idleslope.
 sendslope
 Sendslope is the rate of credits that is depleted (it should be a
 negative number of kilobits per second) when a transmission is
-ocurring. It can be calculated as follows, (IEEE 802.1Q-2014 Section
+occurring. It can be calculated as follows, (IEEE 802.1Q-2014 Section
 8.6.8.2 item g):
 
 sendslope = idleslope - port_transmit_rate
-- 
2.17.1



Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-04-23 Thread Jesus Sanchez-Palencia
Hi Thomas,


On 03/21/2018 06:46 AM, Thomas Gleixner wrote:
> On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
>> +struct tbs_sched_data {
>> +bool sorting;
>> +int clockid;
>> +int queue;
>> +s32 delta; /* in ns */
>> +ktime_t last; /* The txtime of the last skb sent to the netdevice. */
>> +struct rb_root head;
> 
> Hmm. You are reimplementing timerqueue open coded. Have you checked whether
> you could reuse the timerqueue implementation?
> 
> That requires to add a timerqueue node to struct skbuff
> 
> @@ -671,7 +671,8 @@ struct sk_buff {
>   unsigned long   dev_scratch;
>   };
>   };
> - struct rb_node  rbnode; /* used in netem & tcp stack */
> + struct rb_node  rbnode; /* used in netem & tcp stack */
> + struct timerqueue_node  tqnode;
>   };
>   struct sock *sk;
> 
> Then you can use timerqueue_head in your scheduler data and all the open
> coded rbtree handling goes away.


I just noticed that doing the above increases the size of struct sk_buff by 8
bytes - struct timerqueue_node is 32bytes long while struct rb_node is only
24bytes long.

Given the feedback we got here before against touching struct sk_buff at all for
non-generic use cases, I will keep the implementation of sch_tbs.c as is, thus
keeping the open-coded version for now, ok?

Thanks,
Jesus


(...)




Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-04-11 Thread Jesus Sanchez-Palencia
Hi,

On 04/11/2018 01:16 PM, Thomas Gleixner wrote:
 Putting it all together, we end up with:

 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look 
 like:
 $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 15 offload 
 sorting
>>>
>>> Why CLOCK_REALTIME? The only interesting time in a TSN network is
>>> CLOCK_TAI, really.
>>
>> REALTIME was just an example here to show that the qdisc has to be configured
>> with a clockid parameter. Are you suggesting that instead both of the new 
>> qdiscs
>> (i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?
> 
> I think so. It's _the_ network time on which everything is based on.

Yes, but more on this below.


> 
 2) a new cmsg-interface for setting a per-packet timestamp that will be 
 used
 either as a txtime or as deadline by tbs (and further the NIC driver for 
 the
 offlaod case): SCM_TXTIME.

 3) a new socket option: SO_TXTIME. It will be used to enable the feature 
 for a
 socket, and will have as parameters a clockid and a txtime mode (deadline 
 or
 explicit), that defines the semantics of the timestamp set on packets using
 SCM_TXTIME.

 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
>>>
>>> Can you remind me why we would need that?
>>
>> So there is a "clockid" that can be used for the full hw offload modes. On 
>> this
>> case, the txtimes are in reference to the NIC's PTP clock, and, as 
>> discussed, we
>> can't just use a clockid that was computed from the fd pointing to /dev/ptpX 
>> .
> 
> And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
> yet another clock, right?

Just breaking this down a bit, yes, TAI is the network time base, and the NICs
PTP clock use that because PTP is (commonly) based on TAI. After the PHCs have
been synchronized over the network (e.g. with ptp4l), my understanding is that
if applications want to use the clockid_t CLOCK_TAI as a network clock reference
it's required that something (i.e. phc2sys) is synchronizing the PHCs and the
system clock, and also that something calls adjtime to apply the TAI vs UTC
offset to CLOCK_TAI.

If we are fine with those 'dependencies', then I agree there is no need for
another clock.

I was thinking about the full offload use-cases, thus when no scheduling is
happening inside the qdiscs. Applications could just read the time from the PHC
clocks directly without having to rely on any of the above. On this case,
userspace would use DYNAMIC_CLOCK just to flag that this is the case, but I must
admit it's not clear to me how common of a use-case that is, or even if it makes
sense.


Thanks,
Jesus


> 
> Thanks,
> 
>   tglx
> 


Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-04-10 Thread Jesus Sanchez-Palencia
Hi Thomas,


On 04/10/2018 05:37 AM, Thomas Gleixner wrote:

(...)


>>>
>>>  - Simple feed through (Applications are time contraints aware and set the
>>>exact schedule). qdisc has admission control.
>>
>> This will be provided by the tbs qdisc. It will still provide a txtime sorted
>> list and hw offload, but now there will be a per-socket option that tells the
>> qdisc if the per-packet timestamp is the txtime (i.e. explicit mode, as 
>> you've
>> called it) or a deadline. The drop_if_late flag will be removed.
>>
>> When in explicit mode, packets from that socket are dequeued from the qdisc
>> during its time slice if their [(txtime - delta) < now].
>>
>>>
>>>  - Deadline aware qdisc to handle e.g. A/V streams. Applications are aware
>>>of time constraints and provide the packet deadline. qdisc has admission
>>>control. This can be a simple first comes, first served scheduler or
>>>something like EDF which allows optimized utilization. The qdisc sets
>>>the TX time depending on the deadline and feeds into the root.
>>
>> This will be provided by tbs if the socket which is transmitting packets is
>> configured for deadline mode.
> 
> You don't want the socket to decide that. The qdisc into which a socket
> feeds defines the mode and the qdisc rejects requests with the wrong mode.
> 
> Making a qdisc doing both and let the user decide what he wants it to be is
> not really going to fly. Especially if you have different users which want
> a different mode. It's clearly distinct functionality.


Ok, so just to make sure I got this right, are you suggesting that both the
'tbs' qdisc *and* the socket (i.e. through SO_TXTIME) should have a config
parameter for specifying the txtime mode? This way if there is a mismatch,
packets from that socket are rejected by the qdisc.



(...)


> 
>> Another question for this mode (but perhaps that applies to both modes) is, 
>> what
>> if the qdisc misses the deadline for *any* reason? I'm assuming it should 
>> drop
>> the packet during dequeue.
> 
> There the question is how user space is notified about that issue. The
> application which queued the packet on time does rightfully assume that
> it's going to be on the wire on time.
> 
> This is a violation of the overall scheduling plan, so you need to have
> a sane design to handle that.


In addition to the qdisc stats, we could look into using the socket's error
queue to notify the application about that.


> 
>> Putting it all together, we end up with:
>>
>> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look 
>> like:
>> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 15 offload sorting
> 
> Why CLOCK_REALTIME? The only interesting time in a TSN network is
> CLOCK_TAI, really.


REALTIME was just an example here to show that the qdisc has to be configured
with a clockid parameter. Are you suggesting that instead both of the new qdiscs
(i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?


> 
>> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
>> either as a txtime or as deadline by tbs (and further the NIC driver for the
>> offlaod case): SCM_TXTIME.
>>
>> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for 
>> a
>> socket, and will have as parameters a clockid and a txtime mode (deadline or
>> explicit), that defines the semantics of the timestamp set on packets using
>> SCM_TXTIME.
>>
>> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
> 
> Can you remind me why we would need that?


So there is a "clockid" that can be used for the full hw offload modes. On this
case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, we
can't just use a clockid that was computed from the fd pointing to /dev/ptpX .


> 
>> 5) a new schedule-aware qdisc, 'tas' or 'taprio', to be used per port. Its 
>> cli
>> will look like what was proposed for taprio (base time being an absolute 
>> timestamp).
>>
>> If we all agree with the above, we will start by closing on 1-4 asap and will
>> focus on 5 next.
>>
>> How does that sound?
> 
> Backwards to be honest.
> 
> You should start with the NIC facing qdisc because that's the key part of
> all this and the design might have implications on how the qdiscs which
> feed into it need to be designed.


Ok, let's just try to close on the above first.


Thanks,
Jesus


> 
> Thanks,
> 
>   tglx
> 


Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-04-09 Thread Jesus Sanchez-Palencia
Hi Thomas,


On 03/28/2018 12:48 AM, Thomas Gleixner wrote:

(...)

>
> There are two modes:
>
>   1) Send at the given TX time (Explicit mode)
>
>   2) Send before given TX time (Deadline mode)
>
> There is no need to specify 'drop if late' simply because if the message is
> handed in past the given TX time, it's too late by definition. What you are
> trying to implement is a hybrid of TSN and general purpose (not time aware)
> networking in one go. And you do that because your overall design is not
> looking at the big picture. You designed from a given use case assumption
> and tried to fit other things into it with duct tape.


Ok, I see the difference now, thanks. I have just two more questions about the
deadline mode, please see below.

(...)


>
>>> Coming back to the overall scheme. If you start upfront with a time slice
>>> manager which is designed to:
>>>
>>>   - Handle multiple channels
>>>
>>>   - Expose the time constraints, properties per channel
>>>
>>> then you can fit all kind of use cases, whether designed by committee or
>>> not. You can configure that thing per node or network wide. It does not
>>> make a difference. The only difference are the resulting constraints.
>>
>>
>> Ok, and I believe the above was covered by what we had proposed before, 
>> unless
>> what you meant by time constraints is beyond the configured port schedule.
>>
>> Are you suggesting that we'll need to have a kernel entity that is not only
>> aware of the current traffic classes 'schedule', but also of the resources 
>> that
>> are still available for new streams to be accommodated into the classes? 
>> Putting
>> it differently, is the TAS you envision just an entity that runs a schedule, 
>> or
>> is it a time-aware 'orchestrator'?
>
> In the first place its something which runs a defined schedule.
>
> The accomodation for new streams is required, but not necessarily at the
> root qdisc level. That might be a qdisc feeding into it.
>
> Assume you have a bandwidth reservation, aka time slot, for audio. If your
> audio related qdisc does deadline scheduling then you can add new streams
> to it up to the point where it's not longer able to fit.
>
> The only thing which might be needed at the root qdisc is the ability to
> utilize unused time slots for other purposes, but that's not required to be
> there in the first place as long as its designed in a way that it can be
> added later on.


Ok, agreed.


>
>>> So lets look once more at the picture in an abstract way:
>>>
>>>[ NIC ]
>>>   |
>>>  [ Time slice manager ]
>>> |   |
>>>  [ Ch 0 ] ... [ Ch N ]
>>>
>>> So you have a bunch of properties here:
>>>
>>> 1) Number of Channels ranging from 1 to N
>>>
>>> 2) Start point, slice period and slice length per channel
>>
>> Ok, so we agree that a TAS entity is needed. Assuming that channels are 
>> traffic
>> classes, do you have something else in mind other than a new root qdisc?
>
> Whatever you call it, the important point is that it is the gate keeper to
> the network adapter and there is no way around it. It fully controls the
> timed schedule how simple or how complex it may be.


Ok, and I've finally understood the nuance between the above and what we had
planned initially.


(...)


>>
>> * TAS:
>>
>>The idea we are currently exploring is to add a "time-aware", priority 
>> based
>>qdisc, that also exposes the Tx queues available and provides a mechanism 
>> for
>>mapping priority <-> traffic class <-> Tx queues in a similar fashion as
>>mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would 
>> be:
>>
>>$ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4\
>> map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 \
>> queues 0 1 2 3  \
>> sched-file gates.sched [base-time ]   \
>>[cycle-time ] [extension-time ]
>>
>> is multi-line, with each line being of the following format:
>>  
>>
>>Qbv only defines one : "S" for 'SetGates'
>>
>>For example:
>>
>>S 0x01 300
>>S 0x03 500
>>
>>This means that there are two intervals, the first will have the gate
>>for traffic class 0 open for 300 nanoseconds, the second will have
>>both traffic classes open for 500 nanoseconds.
>
> To accomodate stuff like control systems you also need a base line, which
> is not expressed as interval. Otherwise you can't schedule network wide
> explicit plans. That's either an absolute network-time (TAI) time stamp or
> an offset to a well defined network-time (TAI) time stamp, e.g. start of
> epoch or something else which is agreed on. The actual schedule then fast
> forwards past now (TAI) and sets up the slots from there. That makes node
> hotplug possible as well.


Sure, and the [base-time ] on the command line above was actually
wrong. It should have been expressed 

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-27 Thread Jesus Sanchez-Palencia
Hi Thomas,


On 03/25/2018 04:46 AM, Thomas Gleixner wrote:
> On Fri, 23 Mar 2018, Jesus Sanchez-Palencia wrote:
>> On 03/22/2018 03:52 PM, Thomas Gleixner wrote:
>>> So what's the plan for this? Having TAS as a separate entity or TAS feeding
>>> into the proposed 'basic' time transmission thing?
>>
>> The second one, I guess.
>
> That's just wrong. It won't work. See below.

Yes, our proposal does not handle the scenarios you are bringing into the
discussion.

I think we have more points of convergence than divergence already. I will just
go through some pieces of the discussion first, and then let's see if we can
agree on where we are trying to get.



>
>> Elaborating, the plan is at some point having TAS as a separate entity,
>> but which can use tbs for one of its classes (and cbs for another, and
>> strict priority for everything else, etc).
>>
>> Basically, the design would something along the lines of 'taprio'. A root 
>> qdisc
>> that is both time and priority aware, and capable of running a schedule for 
>> the
>> port. That schedule can run inside the kernel with hrtimers, or just be
>> offloaded into the controller if Qbv is supported on HW.
>>
>> Because it would expose the inner traffic classes in a mq / mqprio / prio 
>> style,
>> then it would allow for other per-queue qdiscs to be attached to it. On a 
>> system
>> using the i210, for instance, we could then have tbs installed on traffic 
>> class
>> 0 just dialing hw offload. The Qbv schedule would be running in SW on the TAS
>> entity (i.e. 'taprio') which would be setting the packets' txtime before
>> dequeueing packets on a fast path -> tbs -> NIC.
>>
>> Similarly, other qdisc, like cbs, could be installed if all that traffic 
>> class
>> requires is traffic shaping once its 'gate' is allowed to execute the 
>> selected
>> tx algorithm attached to it.
>>
>>> I've not yet seen a convincing argument why this low level stuff with all
>>> of its weird flavours is superiour over something which reflects the basic
>>> operating principle of TSN.
>>
>>
>> As you know, not all TSN systems are designed the same. Take AVB systems, for
>> example. These not always are running on networks that are aware of any time
>> schedule, or at least not quite like what is described by Qbv.
>>
>> On those systems there is usually a certain number of streams with different
>> priorities that care mostly about having their bandwidth reserved along the
>> network. The applications running on such systems are usually based on AVTP,
>> thus they already have to calculate and set the "avtp presentation time"
>> per-packet themselves. A Qbv scheduler would probably provide very little
>> benefits to this domain, IMHO. For "talkers" of these AVB systems, shaping
>> traffic using txtime (i.e. tbs) can provide a low-jitter alternative to cbs, 
>> for
>> instance.
>
> You're looking at it from particular use cases and try to accomodate for
> them in the simplest possible way. I don't think that cuts it.
>
> Let's take a step back and look at it from a more general POV without
> trying to make it fit to any of the standards first. I'm deliberately NOT
> using any of the standard defined terms.
>
> At the (local) network level you have always an explicit plan. This plan
> might range from no plan at all to an very elaborate plan which is strict
> about when each node is allowed to TX a particular class of packets.


Ok, we are aligned here.


>
> So lets assume we have the following picture:
>
> [NIC]
>   |
>[ Time slice manager ]
>
> Now in the simplest case, the time slice manager has no constraints and
> exposes a single input which allows the application to say: "Send my packet
> at time X". There is no restriction on 'time X' except if there is a time
> collision with an already queued packet or the requested TX time has
> already passed. That's close to what you implemented.
>
>   Is the TX timestamp which you defined in the user space ABI a fixed
>   scheduling point or is it a deadline?
>
>   That's an important distinction and for this all to work accross various
>   use cases you need a way to express that in the ABI. It might be an
>   implicit property of the socket/channel to which the application connects
>   to but still you want to express it from the application side to do
>   proper sanity checking.
>
>   Just think about stuff like audio/video streaming. The point of
>   transmission does not have to be fixed if you have some intelligent
>   contro

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-23 Thread Jesus Sanchez-Palencia
Hi,


On 03/22/2018 03:52 PM, Thomas Gleixner wrote:
> On Thu, 22 Mar 2018, Jesus Sanchez-Palencia wrote:
>> Our plan was to work directly with the Qbv-like scheduling (per-port) just 
>> after
>> the cbs qdisc (Qav), but the feedback here and offline was that there were 
>> use
>> cases for a more simplistic launchtime approach (per-queue) as well. We've
>> decided to invest on it first (and postpone the 'taprio' qdisc until there 
>> was
>> NIC available with HW support for it, basically).
> 
> I missed that discussion due to other urgent stuff on my plate. Just
> skimmed through it. More below.
> 
>> You are right, and we agree, that using tbs for a per-port schedule of any 
>> sort
>> will require a SW scheduler to be developed on top of it, but we've never 
>> said
>> the contrary either. Our vision has always been that these are separate
>> mechanisms with different use-cases, so we do see the value for the kernel to
>> provide both.
>>
>> In other words, tbs is not the final solution for Qbv, and we agree that a 
>> 'TAS'
>> qdisc is still necessary. And due to the wide range of applications and hw 
>> being
>> used for those out there, we need both specially given that one does not 
>> block
>> the other.
> 
> So what's the plan for this? Having TAS as a separate entity or TAS feeding
> into the proposed 'basic' time transmission thing?


The second one, I guess. Elaborating, the plan is at some point having TAS as a
separate entity, but which can use tbs for one of its classes (and cbs for
another, and strict priority for everything else, etc).

Basically, the design would something along the lines of 'taprio'. A root qdisc
that is both time and priority aware, and capable of running a schedule for the
port. That schedule can run inside the kernel with hrtimers, or just be
offloaded into the controller if Qbv is supported on HW.

Because it would expose the inner traffic classes in a mq / mqprio / prio style,
then it would allow for other per-queue qdiscs to be attached to it. On a system
using the i210, for instance, we could then have tbs installed on traffic class
0 just dialing hw offload. The Qbv schedule would be running in SW on the TAS
entity (i.e. 'taprio') which would be setting the packets' txtime before
dequeueing packets on a fast path -> tbs -> NIC.

Similarly, other qdisc, like cbs, could be installed if all that traffic class
requires is traffic shaping once its 'gate' is allowed to execute the selected
tx algorithm attached to it.



> 
> The general objection I have with the current approach is that it creates
> the playground for all flavours of misdesigned user space implementations
> and just replaces the home brewn and ugly user mode network adapter
> drivers.
> 
> But that's not helping the cause at all. There is enough crappy stuff out
> there already and I rather see a proper designed slice management which can
> be utilized and improved by all involved parties.
> 
> All variants which utilize the basic time driven packet transmission are
> based on periodic explicit plan scheduling with (local) network wide time
> slice assignment.
> 
> It does not matter whether you feed VLAN traffic into a time slice, where
> the VLAN itself does not even have to know about it, or if you have aware
> applications feeding packets to a designated timeslot. The basic principle
> of this is always the same.
> 
> So coming back to last years discussion. It totally went into the wrong
> direction because it turned from an approach (the patches) which came from
> the big picture to an single use case and application centric view. That's
> just wrong and I regret that I didn't have the time to pay attention back
> then.
> 
> You always need to look at the big picture first and design from there, not
> the other way round. There will always be the argument:
> 
> But my application is special and needs X
> 
> It's easy to fall for that. From a long experience I know that none of
> these claims ever held. These arguments are made because the people making
> them have either never looked at the big picture or are simply refusing to
> do so because it would cause them work.
> 
> If you start from the use case and application centric view and ignore the
> big picture then you end up in a gazillion of extra magic features over
> time which could have been completely avoided if you had put your foot down
> and made everyone to agree on a proper and versatile design in the first
> place.
> 
> The more low level access you hand out in the beginning the less commonly
> used, improved and maintained infrastrucure you will get in the end. That
> has happened before in other areas and it will happen here as 

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-23 Thread Jesus Sanchez-Palencia
Hi Thomas,


On 03/23/2018 01:49 AM, Thomas Gleixner wrote:
> On Thu, 22 Mar 2018, Jesus Sanchez-Palencia wrote:
>> On 03/22/2018 03:11 PM, Thomas Gleixner wrote:
>> So, are you just opposing to the case where sorting off + offload off is 
>> used?
>> (i.e. the scheduled FIFO case)
> 
> FIFO does not make any sense if your packets have a fixed transmission
> time. I yet have to see a reasonable explanation why FIFO in the context of
> time ordered would be a good thing.


On context of tbs, the scheduled FIFO was developed just so consistency was kept
between all 4 variants, basically (sw best-effort or hw offload vs sorting
enabled or sorting disabled).

I don't have any strong argument in favor of this mode at the moment, so I will
just remove it on a next version - unless someone else brings up a valid use
case for it, of course.

Thanks for the feedback,
Jesus


Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-22 Thread Jesus Sanchez-Palencia
Hi Thomas,


On 03/22/2018 03:11 PM, Thomas Gleixner wrote:

(...)

>> Having the sorting always enabled requires that a valid static clockid is 
>> passed
>> to the qdisc. For the hw offload mode, that means that the PHC and one of the
>> system clocks must be synchronized since hrtimers do not support dynamic 
>> clocks.
>> Not all systems do that or want to, and given that we do not want to perform
>> crosstimestamping between the packets' clock reference and the qdisc's one, 
>> the
>> only solution for these systems would be using the raw hw offload mode.
> 
> There are two variants of hardware offload:
> 
> 1) Full hardware offload
> 
>That bypasses the queue completely. You just stick the thing into the
>scatter gather buffers. Except when there is no room anymore, then you
>have to queue, but it does not make any difference if you queue in FIFO
>or in time order. The packets go out in time order anyway.


Illustrating your variants with the current qdisc's setup arguments.

The above is:
- sorting off
- offload on

(I call it a 'raw' fifo as a reference to the usage of qdisc_enqueue_tail() and
qdisc_dequeue_head(), basically.)


> 
> 2) Single packet hardware offload
> 
>What you do here is to schedule a hrtimer a bit earlier than the first
>packet tx time and when it fires stick the packet into the hardware and
>rearm the timer for the next one.


The above is:
- sorting on
- offload on

right?


So, are you just opposing to the case where sorting off + offload off is used?
(i.e. the scheduled FIFO case)



> 
>The whole point of TSN with hardware support is that you have:
> 
>- Global network time
> 
>and
> 
>- Frequency adjustment of the system time base
> 
> PTP is TAI based and the kernel exposes clock TAI directly through
> hrtimers. You don't need dynamic clocks for that.
> 
> You can even use clock MONOTONIC as it basically is just
> 
>TAI - offset>
> If the network card uses anything else than TAI or a time stamp with a
> strict correlation to TAI for actual TX scheduling then the whole thing is
> broken to begin with.


Sure, I agree.

Thanks,
Jesus

> 
> Thanks,
> 
>   tglx
> 


Re: [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS

2018-03-22 Thread Jesus Sanchez-Palencia
Hi,


On 03/21/2018 07:22 AM, Thomas Gleixner wrote:
> On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
>> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>>map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
>>
>> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload
>>
>> In this example, the Qdisc will use HW offload for the control of the
>> transmission time through the network adapter. It's assumed the timestamp
>> in skbuffs are in reference to the interface's PHC and setting any other
>> valid clockid would be treated as an error. Because there is no
>> scheduling being performed in the qdisc, setting a delta != 0 would also
>> be considered an error.
> 
> Which clockid will be handed in from the application? The network adapter
> time has no fixed clockid. The only way you can get to it is via a fd based
> posix clock and that does not work at all because the qdisc setup might
> have a different FD than the application which queues packets.


Yes. As a result, we came up with a rather simplistic solution that would still
allow for dynamic clocks to be used in the future without any API changes. As of
the v3 RFC, the qdisc returns -EINVAL if a netlink application (i.e. tc) tries
to initialize it in 'raw' hw offload passing any clockid != CLOCKID_INVALID. The
skbuffs' clockid was initialized with the same value, so if the application sets
its value to any other valid clockids through the cmsg interface, the qdisc
would just drop the patches on enqueue() due to the mismatch.

In other words, dynamic clocks are currently not used at all.

(I noticed later that this was broken anyway because the definition of invalid
clockids from posix-timers.h is actually only valid for negative numbers.)

Given all the feedback against adding the clockid into struct sk_buff, for the
next version, we'll have to re-think this anyway now that clockid will be set
per socket (i.e. as an argument to the SO_TXTIME) and not per packet anymore.




> 
> I think this should look like this:
> 
> clock_adapter:1 = clock of the network adapter
>   0 = system clock selected by clock_system
> 
> clock_system: 0 = CLOCK_REALTIME
>   1 = CLOCK_MONOTONIC
> 
> or something like that.
> 
>> Example 2:
>>
>> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>>map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
>>
>> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 10 \
>> clockid CLOCK_REALTIME sorting
>>
>> Here, the Qdisc will use HW offload for the txtime control again,
>> but now sorting will be enabled, and thus there will be scheduling being
>> performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
>> reference and packets leave the Qdisc "delta" (10) nanoseconds before
>> their transmission time. Because this will be using HW offload and
>> since dynamic clocks are not supported by the hrtimer, the system clock
>> and the PHC clock must be synchronized for this mode to behave as expected.
> 
> So what you do here is queueing the packets in the qdisk and then schedule
> them at some point ahead of actual transmission time for delivery to the
> hardware. That delivery uses the same txtime as used for qdisc scheduling
> to tell the hardware when the packet should go on the wire. That's needed
> when the network adapter does not support queueing of multiple packets.
> 
> Bah, and probably there you need CLOCK_TAI because that's what PTP is based
> on, so clock_system needs to accomodate that as well. Dammit, there goes
> the simple 2 bits implementation. CLOCK_TAI is 11, so we'd need 4 clock
> bits plus the adapter bit.
> 
> Though we could spare a bit. The fixed CLOCK_* space goes from 0 to 15. I
> don't see us adding new fixed clocks, so we really can reserve #15 for
> selecting the adapter clock if sparing that extra bit is truly required.


So what about just using the previous single 'clockid' argument, but then just
adding to uapi time.h something like:

#define DYNAMIC_CLOCKID 15

And using it for that, instead. This way applications that will use the raw hw
offload mode must use this value for their per-socket clockid, and the qdisc's
clockid would be implicitly initialized to the same value.

What do you think?

Thanks,
Jesus



> 
> Thanks,
> 
>   tglx
> 
> 


Re: [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS

2018-03-22 Thread Jesus Sanchez-Palencia
Hi,


On 03/21/2018 09:18 AM, Thomas Gleixner wrote:
> On Wed, 21 Mar 2018, Richard Cochran wrote:
> 
>> On Wed, Mar 21, 2018 at 03:22:11PM +0100, Thomas Gleixner wrote:
>>> Which clockid will be handed in from the application? The network adapter
>>> time has no fixed clockid. The only way you can get to it is via a fd based
>>> posix clock and that does not work at all because the qdisc setup might
>>> have a different FD than the application which queues packets.
>>
>> Duh.  That explains it.  Please ignore my "why not?" Q in the other thread...
> 
> :)
> 
> So in that case you are either bound to rely on the application to use the
> proper dynamic clock or if we need a sanity check, then you need a cookie
> of some form which can be retrieved from the posix clock file descriptor
> and handed in as 'clockid' together with clock_adapter = true.
> 
> That's doable, but that needs a bit more trickery. A simple unique ID per
> dynamic posix-clock would be trivial to add, but that would not give you
> any form of verification whether this ID actually belongs to the network
> adapter or not.
> 
> So either you ignore the clockid and rely on the application not being
> stupid when it says "clock_adpater = true" or you need some extra
> complexity to build an association of a "clockid" to a network adapter.
> 
> There is a connection already, via
> 
>  adapter->ptp_clock->devid
> 
> which is MKDEV(major, index) which is accessible at least at the network
> driver level, but probably not from networking core. So you'd need to drill
> a few more holes by adding yet another callback to net_device_ops.
> 
> I'm not sure if its worth the trouble. If the application hands in bogus
> timestamps, packets go out at the wrong time or are dropped. That's true
> whether it uses the proper clock or not. So nothing the kernel should
> really worry about.


+1 and that is the approach we've taken so far with the qdisc setting
"CLOCKID_INVALID" to its internal clockid for the "raw" (non-assisted) hw
offload case.

thanks,
Jesus



> 
> For clock_system - REAL/MONO/TAI(sigh) - you surely need a sanity check,
> but that is independent of the underlying network adapater even in the
> qdisc assisted HW offload case.
> 
> Thanks,
> 
>   tglx
> 
> 
> 
> 
> 
> 


Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-22 Thread Jesus Sanchez-Palencia
Hi Thomas,


On 03/21/2018 06:46 AM, Thomas Gleixner wrote:
> On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
>> +struct tbs_sched_data {
>> +bool sorting;
>> +int clockid;
>> +int queue;
>> +s32 delta; /* in ns */
>> +ktime_t last; /* The txtime of the last skb sent to the netdevice. */
>> +struct rb_root head;
> 
> Hmm. You are reimplementing timerqueue open coded. Have you checked whether
> you could reuse the timerqueue implementation?
> 
> That requires to add a timerqueue node to struct skbuff
> 
> @@ -671,7 +671,8 @@ struct sk_buff {
>   unsigned long   dev_scratch;
>   };
>   };
> - struct rb_node  rbnode; /* used in netem & tcp stack */
> + struct rb_node  rbnode; /* used in netem & tcp stack */
> + struct timerqueue_node  tqnode;
>   };
>   struct sock *sk;
> 
> Then you can use timerqueue_head in your scheduler data and all the open
> coded rbtree handling goes away.


Yes, you are right. We actually looked into that for the first prototype of this
qdisc but we weren't so sure about adding the timerqueue node to the sk_buff's
union and whether it would impact the other usages here, but looking again now
and it looks fine.

We'll fix for the next version, thanks.


> 
>> +static bool is_packet_valid(struct Qdisc *sch, struct sk_buff *nskb)
>> +{
>> +struct tbs_sched_data *q = qdisc_priv(sch);
>> +ktime_t txtime = nskb->tstamp;
>> +struct sock *sk = nskb->sk;
>> +ktime_t now;
>> +
>> +if (sk && !sock_flag(sk, SOCK_TXTIME))
>> +return false;
>> +
>> +/* We don't perform crosstimestamping.
>> + * Drop if packet's clockid differs from qdisc's.
>> + */
>> +if (nskb->txtime_clockid != q->clockid)
>> +return false;
>> +
>> +now = get_time_by_clockid(q->clockid);
> 
> If you store the time getter function pointer in tbs_sched_data then you
> avoid the lookup and just can do
> 
>now = q->get_time();
> 
> That applies to lots of other places.


Good idea, thanks. Will fix.



>> +
>> +static struct sk_buff *tbs_peek_timesortedlist(struct Qdisc *sch)
>> +{
>> +struct tbs_sched_data *q = qdisc_priv(sch);
>> +struct rb_node *p;
>> +
>> +p = rb_first(>head);
> 
> timerqueue gives you direct access to the first expiring entry w/o walking
> the rbtree. So that would become:
> 
>   p = timerqueue_getnext(>tqhead);
>   return p ? rb_to_skb(p) : NULL;

OK.

(...)

>> +static struct sk_buff *tbs_dequeue_scheduledfifo(struct Qdisc *sch)
>> +{
>> +struct tbs_sched_data *q = qdisc_priv(sch);
>> +struct sk_buff *skb = tbs_peek(sch);
>> +ktime_t now, next;
>> +
>> +if (!skb)
>> +return NULL;
>> +
>> +now = get_time_by_clockid(q->clockid);
>> +
>> +/* Drop if packet has expired while in queue and the drop_if_late
>> + * flag is set.
>> + */
>> +if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
>> +struct sk_buff *to_free = NULL;
>> +
>> +qdisc_queue_drop_head(sch, _free);
>> +kfree_skb_list(to_free);
>> +qdisc_qstats_overlimit(sch);
>> +
>> +skb = NULL;
>> +goto out;
> 
> Instead of going out immediately you should check the next skb whether its
> due for sending already.

We wanted to have a baseline before starting with the optimizations, so we left
this for a later patchset. It was one of the opens we had listed on the v2 cover
letter IIRC, but we'll look into it.


(...)


>> +}
>> +
>> +next = ktime_sub_ns(skb->tstamp, q->delta);
>> +
>> +/* Dequeue only if now is within the [txtime - delta, txtime] range. */
>> +if (ktime_after(now, next))
>> +timesortedlist_erase(sch, skb, false);
>> +else
>> +skb = NULL;
>> +
>> +out:
>> +/* Now we may need to re-arm the qdisc watchdog for the next packet. */
>> +reset_watchdog(sch);
>> +
>> +return skb;
>> +}
>> +
>> +static inline void setup_queueing_mode(struct tbs_sched_data *q)
>> +{
>> +if (q->sorting) {
>> +q->enqueue = tbs_enqueue_timesortedlist;
>> +q->dequeue = tbs_dequeue_timesortedlist;
>> +q->peek = tbs_peek_timesortedlist;
>> +} else {
>> + 

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-22 Thread Jesus Sanchez-Palencia
Hi Thomas,


On 03/21/2018 03:29 PM, Thomas Gleixner wrote:
> On Wed, 21 Mar 2018, Thomas Gleixner wrote:
>> If you look at the use cases of TDM in various fields then FIFO mode is
>> pretty much useless. In industrial/automotive fieldbus applications the
>> various time slices are filled by different threads or even processes.
> 
> That brings me to a related question. The TDM cases I'm familiar with which
> aim to use this utilize multiple periodic time slices, aka 802.1Qbv
> time-aware scheduling.
> 
> Simple example:
> 
> [1a][1b][1c][1d]  [1a][1b][1c][1d][.
>   [2a][2b][2c][2d]
>   [3a][3b]
>   [4a][4b]
> -->   
> t   
> 
> where 1-4 is the slice level and a-d are network nodes.
> 
> In most cases the slice levels on a node are handled by different
> applications or threads. Some of the protocols utilize dedicated time slice
> levels - lets assume '4' in the above example - to run general network
> traffic which might even be allowed to have collisions, i.e. [4a-d] would
> become [4] and any node can send; the involved componets like switches are
> supposed to handle that.
> 
> I'm not seing how TBS is going to assist with any of that. It requires
> everything to be handled at the application level. Not really useful
> especially not for general traffic which does not know about the scheduling
> bands at all.
> 
> If you look at an industrial control node. It basically does:
> 
>   queue_first_packet(tx, slice1);
>   while (!stop) {
>   if (wait_for_packet(rx) == ERROR)
>   goto errorhandling;
>   tx = do_computation(rx);
>   queue_next_tx(tx, slice1);
>   }
> 
> that's a pretty common pattern for these kind of applications. For audio
> sources queue_next() might be triggered by the input sampler which needs to
> be synchronized to the network slices anyway in order to work properly.
> 
> TBS per current implementation is nice as a proof of concept, but it solves
> just a small portion of the complete problem space. I have the suspicion
> that this was 'designed' to replace the user space hack in the AVNU stack
> with something close to it. Not really a good plan to be honest.
> 
> I think what we really want is a strict periodic scheduler which supports
> multiple slices as shown above because thats what all relevant TDM use
> cases need: A/V, industrial fieldbusses .
> 
>   |-|
>   | |
>   |   TAS   |<- Config
>   |1   2   3   4|
>   |-|
>|   |   |   |
>|   |   |   |
>|   |   |   |
>|   |   |   |
>   [DirectSocket]   [Qdisc FIFO]   [Qdisc Prio] [Qdisc FIFO]
>|   |   |
>  |   |   |
>   [Socket][Socket] [General traffic]
> 
> 
> The interesting thing here is that it does not require any time stamp
> information brought in from the application. That's especially good for
> general network traffic which is routed through a dedicated time slot. If
> we don't have that then we need a user space scheduler which does exactly
> the same thing and we have to route the general traffic out to user space
> and back into the kernel, which is obviously a pointless exercise.
> 
> There are all kind of TDM schemes out there which are not directly driven
> by applications, but rather route categorized traffic like VLANs through
> dedicated time slices. That works pretty well with the above scheme because
> in that case the applications might be completely oblivious about the tx
> time schedule.
> 
> Surely there are protocols which do not utilize every time slice they could
> use, so we need a way to tell the number of empty slices between two
> consecutive packets. There are also different policies vs. the unused time
> slices, like sending dummy frames or just nothing which wants to be
> addressed, but I don't think that changes the general approach.
> 
> There might be some special cases for setup or node hotplug, but the
> protocols I'm familiar with handle these in dedicated time slices or
> through general traffic so it should just fit in.
> 
> I'm surely missing some details, but from my knowledge about the protocols
> which want to utilize this, the general direction should be fine.
> 
> Feel free to tell me that I'm missing the point 

Re: [RFC v3 net-next 00/18] Time based packet transmission

2018-03-08 Thread Jesus Sanchez-Palencia
Hi,


On 03/08/2018 02:54 PM, Henrik Austad wrote:
> Just looking at the timestamp when the frames were received. They should be 
> sent at regular intervals if I read udp_tai.c correctly, so the assumption 
> was that the timestamp from tcpdump should give an inkling to how well it 
> worked.
> 
> I set it up to send a frame every 10ms and computed the diff between each 
> UDP packet received. Nothing fancy, just tcpdump and grep for the 
> timestamp and look at the distribution.

Ok, I see it now. Just as a reference, this is how I've been running tcpdump on
my tests:

$ tcpdump -i enp3s0 -w foo.pcap -j adapter_unsynced \
-tt --time-stamp-precision=nano udp port 7788 -c 1


> 
>>> I have to dig more into why this is happening, a lot frames delayed much 
>>> more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
>>> obvious fix is move some hw around and do a direct link, but I didn't have 
>>> time for that right now.
>>>
>>> I'm very interested in doing what Richard's original test was when he used 
>>> ptp-synched clocks and also used hw receive-time and compared with expected 
>>> tx-time. So, while I'm getting that up and running, I thought I should 
>>> share the early results.
>>
>> Sure, thanks. Which delta and clockid are you using, please?
> 
> I used the example provided in -00,
> 
> tc qdisc replace dev eth2 parent root handle 100 mqprio num_tc 3 \
>  map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
> 
> tc qdisc add dev eth2 parent 100:1 tbs offload delta 10 clockid \
>  CLOCK_REALTIME sorting


The delta value is highly dependent on the system. I recommend playing around
with it a bit before running long tests. On my KabyLake desktop I noticed that
150us is quite reliable value, for example. (same kernel as yours, and no
preempt-rt applied) But that is not the issue here it seems.



> 
>> Also, was this clock synchronized to the PHC? You need that for hw offload 
>> with
>> sorting enabled.
> 
> Hmm, good point, no, NIC clock was not synchronized, I'll do that in the 
> next round for both sender and receiver!

Oh, then you need to get that setup first. Here I synchronize both PHCs over the
network first with ptp4l:

Rx) $ ptp4l --summary_interval=3 -i enp3s0 -m -2
Tx) $ ptp4l --summary_interval=3 -i enp3s0 -s -m -2 &

My Rx is the PTP master and the Tx is the PTP slave.
Then I synchronize the PHC to the system clock on the Tx side only:

Tx) $ phc2sys -a -r -r -u 8 &


And udp_tai is using CLOCK_REALTIME. The UTC vs TAI 37s offset makes no
difference for this test specifically because I compensate for it when
calculating the offsets on the Rx side.

For the next patchset version I will be providing a more complete set of testing
instructions. I hope that helps for now.


Thanks,
Jesus






Re: [RFC v3 net-next 00/18] Time based packet transmission

2018-03-08 Thread Jesus Sanchez-Palencia
Hi,


On 03/08/2018 06:09 AM, Henrik Austad wrote:

(...)

> 
> A lot of new knobs, I see the need, I would've like to have fewer, but 
> you've documented them pretty well. Perhaps we should add something to 
> Documentation/ at one stage?

Sure. The idea is working on that once the interfaces have been accepted.


> 
> Anyways, the patches applied cleanly so I gave them a (very) quick spin. 
> Using udp_tai and tcpdump in the other end to grab the frames
> 
> Setting up with hw offload and sorting in qdisc.
> 
> Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss 
> bypass as dual-core and i210 is not friends):
> 
> udp_tai -c1 -i eth2 -p 20 -P 1000
> 
> Receiver (imx7, kernel 4.9.11):
> chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length 
> 256" > tai_imx7.log
> 
> Note: this involves 2 swtiches and a somewhat hackish kernel running on the 
> receiver, so these numbers can only improve.
> 
> count2340.00
> mean0.043770
> std 0.047784
> min 0.009025
> 25% 0.010003
> 50% 0.010010
> 75% 0.109998
> max 0.120060
> 

Thanks for giving it a shot.

But I'm not sure I follow the numbers above, sorry :/
Are you computing the packet's Rx timestamp offset from the (expected) Tx time?


> I have to dig more into why this is happening, a lot frames delayed much 
> more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
> obvious fix is move some hw around and do a direct link, but I didn't have 
> time for that right now.
> 
> I'm very interested in doing what Richard's original test was when he used 
> ptp-synched clocks and also used hw receive-time and compared with expected 
> tx-time. So, while I'm getting that up and running, I thought I should 
> share the early results.


Sure, thanks. Which delta and clockid are you using, please?
Also, was this clock synchronized to the PHC? You need that for hw offload with
sorting enabled.

Thanks,
Jesus

(...)



Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params

2018-03-08 Thread Jesus Sanchez-Palencia
Hi,


On 03/08/2018 08:44 AM, Richard Cochran wrote:
> On Wed, Mar 07, 2018 at 09:47:40AM -0800, Eric Dumazet wrote:
>> I would love if skb->tstamp could be either 0 or expressed in
>> ktime_get() base all the time.
>>
>> ( Even if we would have to convert this to other bases when/if needed)
> 
> We really do need variable clock IDs.  Otherwise the HW offloading
> case won't work.  The desired transmit time must be expressed in terms
> of the clock inside the MAC.  This clock is not necessarily related to
> the system time at all.
> 
> But in addition to the performance concerns, I think putting this into
> a socket option is the more natural solution.


Ok, so we have it settled for clockid now. Providing it per-socket was what we'd
proposed previously, so this was just an attempt to accommodate all the feedback
we got on the v2 RFC.

What about the tc_drop_if_late bit, though? Would it be acceptable to keep it
per-packet, thus eating the 1-bit hole from skbuff if we would #if guard it
(e.g. with CONFIG_NET_SCH_TBS)?


Thanks,
Jesus


Re: [RFC v3 iproute2 3/3] tc: Add support for the TBS Qdisc

2018-03-07 Thread Jesus Sanchez-Palencia
Hi,


On 03/06/2018 05:51 PM, Stephen Hemminger wrote:
> On Tue,  6 Mar 2018 17:16:08 -0800
> Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com> wrote:
> 
>> atic int tbs_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
>> +{
>> +struct rtattr *tb[TCA_TBS_MAX+1];
>> +struct tc_tbs_qopt *qopt;
>> +
>> +if (opt == NULL)
>> +return 0;
>> +
>> +parse_rtattr_nested(tb, TCA_TBS_MAX, opt);
>> +
>> +if (tb[TCA_TBS_PARMS] == NULL)
>> +return -1;
>> +
>> +qopt = RTA_DATA(tb[TCA_TBS_PARMS]);
>> +if (RTA_PAYLOAD(tb[TCA_TBS_PARMS])  < sizeof(*qopt))
>> +return -1;
>> +
>> +fprintf(f, "clockid ");
>> +if (qopt->clockid == CLOCKID_INVALID)
>> +fprintf(f, "invalid ");
>> +else
>> +fprintf(f, "%d ", qopt->clockid);
>> +
>> +fprintf(f, "delta %d ", qopt->delta);
>> +fprintf(f, "offload %s ", (qopt->flags & TC_TBS_OFFLOAD_ON) ?
>> +"on" : "off");
>> +fprintf(f, "sorting %s", (qopt->flags & TC_TBS_SORTING_ON) ?
>> +"on" : "off");
>> +
>> +return 0;
>> +}
> 
> All new print code in iproute2 should support JSON output.
> Look at other code using json_print.h for simple way to handle this.
> 


Fixed, thanks. I'm assuming that only applies to print code from print_qopt()
implementations. Please let me know if otherwise.



Re: [RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path

2018-03-07 Thread Jesus Sanchez-Palencia


On 03/07/2018 08:59 AM, Willem de Bruijn wrote:
> On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
> <jesus.sanchez-palen...@intel.com> wrote:
>> This is done in preparation for the upcoming time based transmission
>> patchset. Now that skb->tstamp will be used to hold packet's txtime,
>> we must ensure that it is being cleared when traversing namespaces.
>> Also, doing that from skb_scrub_packet() would break our feature when
>> tunnels are used.
> 
> Then the right location to move to is skb_scrub_packet below the test for 
> xnet.

Fixed, thanks.



> 
>> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
>> ---
>>  include/linux/netdevice.h | 1 +
>>  net/core/skbuff.c | 1 -
>>  2 files changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index dbe6344b727a..7104de2bc957 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3379,6 +3379,7 @@ static __always_inline int dev_forward_skb(struct 
>> net_device *dev,
>>
>> skb_scrub_packet(skb, true);
>> skb->priority = 0;
>> +   skb->tstamp = 0;
>> return 0;
>>  }
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 715c13495ba6..678fc5416ae1 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -4865,7 +4865,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
>>   */
>>  void skb_scrub_packet(struct sk_buff *skb, bool xnet)
>>  {
>> -   skb->tstamp = 0;
>> skb->pkt_type = PACKET_HOST;
>> skb->skb_iif = 0;
>> skb->ignore_df = 0;
>> --
>> 2.16.2
>>


Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params

2018-03-07 Thread Jesus Sanchez-Palencia
Hi,


On 03/06/2018 06:53 PM, Eric Dumazet wrote:
> On Tue, 2018-03-06 at 17:12 -0800, Jesus Sanchez-Palencia wrote:
>> Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and
>> a drop_if_late flag. With this commit the API becomes:
>>
>>
> 
>  * diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>  * index d8340e6e8814..951969ceaf65 100644
>  * --- a/include/linux/skbuff.h
>  * +++ b/include/linux/skbuff.h
>  * @@ -788,6 +788,9 @@ struct sk_buff {
>  *    __u8tc_redirected:1;
>  *    __u8tc_from_ingress:1;
>  *  #endif
>  * +  __u8tc_drop_if_late:1;
>  * +
>  * +  clockid_t   txtime_clockid;
>  *  
>  *  #ifdef CONFIG_NET_SCHED
>  *    __u16   tc_index;   /* traffic
>control index */
> 
> 
> This is adding 32+1 bits to sk_buff, and possibly holes in this very
> very hot (and already too fat) structure.

I should have mentioned on the commit msg, but the tc_drop_if_late is actually
filling a 1 bit hole that was already there.


> 
> Do we really need 32 bits for a clockid_t ?

There is a 2 bytes hole just after tc_index, so a u16 clockid would fit
perfectly without increasing the skbuffs size / cachelines any further.

>From Richard's reply, it seems safe to just change the definition here if we
make it explicit on the SCM_CLOCKID documentation the caveat about the max
possible fd count for dynamic clocks.

How does that sound?

Thanks,
Jesus



[PATCH net-next] sock: Fix SO_ZEROCOPY switch case

2018-03-07 Thread Jesus Sanchez-Palencia
Fix the SO_ZEROCOPY switch case on sock_setsockopt() avoiding the
ret values to be overwritten by the one set on the default case.

Fixes: 28190752c7092 ("sock: permit SO_ZEROCOPY on PF_RDS socket")
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
Acked-by: Willem de Bruijn <will...@google.com>
---
 net/core/sock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 507d8c6c4319..27f218bba43f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1062,8 +1062,9 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
ret = -EINVAL;
else
sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
-   break;
}
+   break;
+
default:
ret = -ENOPROTOOPT;
break;
-- 
2.16.2



[RFC v3 iproute2 2/3] uapi pkt_sched: Add tbs info - DO NOT COMMIT

2018-03-06 Thread Jesus Sanchez-Palencia
This should come from the next uapi headers update.
Sending it now just as a convenience so anyone can build tc with tbs
support.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
---
 include/uapi/linux/pkt_sched.h | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096a..92af9fa4 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -934,4 +934,22 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+
+/* TBS */
+struct tc_tbs_qopt {
+   __s32 delta;
+   __s32 clockid;
+   __u32 flags;
+#define TC_TBS_SORTING_ON BIT(0)
+#define TC_TBS_OFFLOAD_ON BIT(1)
+};
+
+enum {
+   TCA_TBS_UNSPEC,
+   TCA_TBS_PARMS,
+   __TCA_TBS_MAX,
+};
+
+#define TCA_TBS_MAX (__TCA_TBS_MAX - 1)
+
 #endif
-- 
2.16.2



[RFC v3 iproute2 3/3] tc: Add support for the TBS Qdisc

2018-03-06 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes <vinicius.go...@intel.com>

The Time Based Scheduler (TBS) queueing discipline allows precise
control of the transmission time of packets.

The syntax is:

tc qdisc add dev DEV parent NODE tbs delta 
 clockid  [offload] [sorting]

Signed-off-by: Vinicius Costa Gomes <vinicius.go...@intel.com>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
---
 tc/Makefile |   1 +
 tc/q_tbs.c  | 200 
 2 files changed, 201 insertions(+)
 create mode 100644 tc/q_tbs.c

diff --git a/tc/Makefile b/tc/Makefile
index 3716dd6a..3c87b0dc 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -71,6 +71,7 @@ TCMODULES += q_clsact.o
 TCMODULES += e_bpf.o
 TCMODULES += f_matchall.o
 TCMODULES += q_cbs.o
+TCMODULES += q_tbs.o
 
 TCSO :=
 ifeq ($(TC_CONFIG_ATM),y)
diff --git a/tc/q_tbs.c b/tc/q_tbs.c
new file mode 100644
index ..b0823dc9
--- /dev/null
+++ b/tc/q_tbs.c
@@ -0,0 +1,200 @@
+/*
+ * q_tbs.c TBS.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Vinicius Costa Gomes <vinicius.go...@intel.com>
+ *     Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+
+/* clockid is invalid if bits 0, 1, 2 are set as described by posix-timers.h */
+#define CLOCKID_INVALID (BIT(0) | BIT(1) | BIT(2))
+#define PTP_MAX_DEV_PATH 16
+
+/* fd to clockid helpers. Copied from posix-timers.h. */
+#define CLOCKFD 3
+static inline clockid_t make_process_cpuclock(const unsigned int pid,
+const clockid_t clock)
+{
+return ((~pid) << 3) | clock;
+}
+
+static inline clockid_t fd_to_clockid(const int fd)
+{
+return make_process_cpuclock((unsigned int) fd, CLOCKFD);
+}
+
+static void explain(void)
+{
+   fprintf(stderr, "Usage: ... tbs delta NANOS clockid CLOCKID [offload] 
[sorting]\n");
+   fprintf(stderr, "CLOCKID must be a valid SYS-V id (i.e. CLOCK_TAI) or \
+a dynamic clock (i.e. /dev/ptp0).\n");
+}
+
+static void explain1(const char *arg, const char *val)
+{
+   fprintf(stderr, "tbs: illegal value for \"%s\": \"%s\"\n", arg, val);
+}
+
+static void explain_clockid(const char *val)
+{
+   fprintf(stderr, "tbs: illegal value for \"clockid\": \"%s\".\n", val);
+   fprintf(stderr, "It must be a valid SYS-V id (i.e. CLOCK_TAI) or "\
+   "dynamic clock (i.e. /dev/ptp0).\n");
+}
+
+static int get_clockid(__s32 *val, const char *arg)
+{
+   const struct static_clockid {
+   const char *name;
+   clockid_t clockid;
+   } clockids_sysv[] = {
+   { "CLOCK_REALTIME", CLOCK_REALTIME },
+   { "CLOCK_TAI", CLOCK_TAI },
+   { "CLOCK_BOOTTIME", CLOCK_BOOTTIME },
+   { "CLOCK_MONOTONIC", CLOCK_MONOTONIC },
+   { NULL }
+   };
+
+   struct ptp_clock_caps capabilities;
+   char ptp_path[PTP_MAX_DEV_PATH];
+   const struct static_clockid *c;
+   int fd_ptp;
+
+   for (c = clockids_sysv; c->name; c++) {
+   if (strncasecmp(c->name, arg, 25) == 0) {
+   *val = c->clockid;
+
+   return 0;
+   }
+   }
+
+   snprintf(ptp_path, sizeof(ptp_path), "%s", arg);
+   fd_ptp = open(ptp_path, O_RDONLY);
+
+   /* Make sure the path provided points to a PTP chardev. */
+   if (fd_ptp < 0 || ioctl(fd_ptp, PTP_CLOCK_GETCAPS, ) < 0) {
+   return -1;
+   }
+
+   *val = fd_to_clockid(fd_ptp);
+   return 0;
+}
+
+
+static int tbs_parse_opt(struct qdisc_util *qu, int argc,
+char **argv, struct nlmsghdr *n, const char *dev)
+{
+   struct tc_tbs_qopt opt = {
+   .clockid = CLOCKID_INVALID,
+   };
+   struct rtattr *tail;
+
+   while (argc > 0) {
+   if (matches(*argv, "offload") == 0) {
+   if (opt.flags & TC_TBS_OFFLOAD_ON) {
+   fprintf(stderr, "tbs: duplicate \"offload\" 
specification\n");
+   return -1;
+   }
+
+   opt.flags |= TC_TBS_OFFLOAD_ON;
+   } else if (matches(*argv, "sorting") == 0) {
+   if (opt.flags 

[RFC v3 iproute2 1/3] include: Add ptp_clock.h to linux uapi

2018-03-06 Thread Jesus Sanchez-Palencia
This header will be used by the new tc-tbs qdisc.
It was copied from kernel tag 4.16.0-rc2.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
---
 include/uapi/linux/ptp_clock.h | 147 +
 1 file changed, 147 insertions(+)
 create mode 100644 include/uapi/linux/ptp_clock.h

diff --git a/include/uapi/linux/ptp_clock.h b/include/uapi/linux/ptp_clock.h
new file mode 100644
index ..3039bf6a
--- /dev/null
+++ b/include/uapi/linux/ptp_clock.h
@@ -0,0 +1,147 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+/*
+ * PTP 1588 clock support - user space interface
+ *
+ * Copyright (C) 2010 OMICRON electronics GmbH
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#ifndef _PTP_CLOCK_H_
+#define _PTP_CLOCK_H_
+
+#include 
+#include 
+
+/* PTP_xxx bits, for the flags field within the request structures. */
+#define PTP_ENABLE_FEATURE (1<<0)
+#define PTP_RISING_EDGE(1<<1)
+#define PTP_FALLING_EDGE   (1<<2)
+
+/*
+ * struct ptp_clock_time - represents a time value
+ *
+ * The sign of the seconds field applies to the whole value. The
+ * nanoseconds field is always unsigned. The reserved field is
+ * included for sub-nanosecond resolution, should the demand for
+ * this ever appear.
+ *
+ */
+struct ptp_clock_time {
+   __s64 sec;  /* seconds */
+   __u32 nsec; /* nanoseconds */
+   __u32 reserved;
+};
+
+struct ptp_clock_caps {
+   int max_adj;   /* Maximum frequency adjustment in parts per billon. */
+   int n_alarm;   /* Number of programmable alarms. */
+   int n_ext_ts;  /* Number of external time stamp channels. */
+   int n_per_out; /* Number of programmable periodic signals. */
+   int pps;   /* Whether the clock supports a PPS callback. */
+   int n_pins;/* Number of input/output pins. */
+   /* Whether the clock supports precise system-device cross timestamps */
+   int cross_timestamping;
+   int rsv[13];   /* Reserved for future use. */
+};
+
+struct ptp_extts_request {
+   unsigned int index;  /* Which channel to configure. */
+   unsigned int flags;  /* Bit field for PTP_xxx flags. */
+   unsigned int rsv[2]; /* Reserved for future use. */
+};
+
+struct ptp_perout_request {
+   struct ptp_clock_time start;  /* Absolute start time. */
+   struct ptp_clock_time period; /* Desired period, zero means disable. */
+   unsigned int index;   /* Which channel to configure. */
+   unsigned int flags;   /* Reserved for future use. */
+   unsigned int rsv[4];  /* Reserved for future use. */
+};
+
+#define PTP_MAX_SAMPLES 25 /* Maximum allowed offset measurement samples. */
+
+struct ptp_sys_offset {
+   unsigned int n_samples; /* Desired number of measurements. */
+   unsigned int rsv[3];/* Reserved for future use. */
+   /*
+* Array of interleaved system/phc time stamps. The kernel
+* will provide 2*n_samples + 1 time stamps, with the last
+* one as a system time stamp.
+*/
+   struct ptp_clock_time ts[2 * PTP_MAX_SAMPLES + 1];
+};
+
+struct ptp_sys_offset_precise {
+   struct ptp_clock_time device;
+   struct ptp_clock_time sys_realtime;
+   struct ptp_clock_time sys_monoraw;
+   unsigned int rsv[4];/* Reserved for future use. */
+};
+
+enum ptp_pin_function {
+   PTP_PF_NONE,
+   PTP_PF_EXTTS,
+   PTP_PF_PEROUT,
+   PTP_PF_PHYSYNC,
+};
+
+struct ptp_pin_desc {
+   /*
+* Hardware specific human readable pin name. This field is
+* set by the kernel during the PTP_PIN_GETFUNC ioctl and is
+* ignored for the PTP_PIN_SETFUNC ioctl.
+*/
+   char name[64];
+   /*
+* Pin index in the range of zero to ptp_clock_caps.n_pins - 1.
+*/
+   unsigned int index;
+   /*
+* Which of the PTP_PF_xxx functions to use on this pin.
+*/
+   unsigned int func;
+   /*
+* The specific channel to use for this function.
+* This corresponds to the 'index' field of the
+* PTP_EXTTS_REQUEST and PTP_PEROUT_REQUEST ioctls.
+*/
+   unsigned int chan;
+   /*
+* Reserved for future use.
+*/
+   unsigned int rsv[5];
+};

[RFC v3 net-next 05/18] net: ipv4: raw: Hook into time based transmission.

2018-03-06 Thread Jesus Sanchez-Palencia
From: Richard Cochran <rcoch...@linutronix.de>

For raw packets, copy the desired future transmit time from the CMSG
cookie into the skb.

Signed-off-by: Richard Cochran <rcoch...@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
---
 net/ipv4/raw.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 54648d20bf0f..8e05970ba7c4 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -381,6 +381,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 
*fl4,
 
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = sockc->transmit_time;
skb_dst_set(skb, >dst);
*rtp = NULL;
 
@@ -562,6 +563,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
}
 
ipc.sockc.tsflags = sk->sk_tsflags;
+   ipc.sockc.transmit_time = 0;
ipc.addr = inet->inet_saddr;
ipc.opt = NULL;
ipc.tx_flags = 0;
-- 
2.16.2



[RFC v3 net-next 03/18] posix-timers: Add CLOCKID_INVALID mask

2018-03-06 Thread Jesus Sanchez-Palencia
posix-timers.h states that a clockid_t value is invalid if bits 0, 1 and
2 are all set. Add a mask that can be safely used elsewhere even if this
implicit rule's implementation is changed.

This is done in preparation for the upcoming time based transmission
patchset.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
---
 include/linux/posix-timers.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index c85704fcdbd2..0ba677cc8da6 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -28,6 +28,7 @@ struct cpu_timer_list {
  *
  * A clockid is invalid if bits 2, 1, and 0 are all set.
  */
+#define CLOCKID_INVALIDGENMASK(2, 0)
 #define CPUCLOCK_PID(clock)((pid_t) ~((clock) >> 3))
 #define CPUCLOCK_PERTHREAD(clock) \
(((clock) & (clockid_t) CPUCLOCK_PERTHREAD_MASK) != 0)
-- 
2.16.2



[RFC v3 net-next 00/18] Time based packet transmission

2018-03-06 Thread Jesus Sanchez-Palencia
duling being
performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
and packets leave the Qdisc "delta" (10) nanoseconds before
their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by the hrtimer, the system clock
and the PHC clock must be synchronized for this mode to behave as expected.


For testing, we've followed a similar approach from the v1 and v2 testing and
no significant changes on the results were observed. An updated version of
udp_tai.c is attached to this cover letter.

For last, most of the To Dos we still have before a final patchset are related
to further testing the igb support:
 - testing with L2 only talkers + AF_PACKET sockets;
 - testing tbs in conjunction with cbs;

Thanks for all the feedback so far,
Jesus


Jesus Sanchez-Palencia (12):
  sock: Fix SO_ZEROCOPY switch case
  net: Clear skb->tstamp only on the forwarding path
  posix-timers: Add CLOCKID_INVALID mask
  net: SO_TXTIME: Add clockid and drop_if_late params
  net: ipv4: raw: Handle remaining txtime parameters
  net: ipv4: udp: Handle remaining txtime parameters
  net: packet: Handle remaining txtime parameters
  net/sched: Add HW offloading capability to TBS
  igb: Refactor igb_configure_cbs()
  igb: Only change Tx arbitration when CBS is on
  igb: Refactor igb_offload_cbs()
  igb: Add support for TBS offload

Richard Cochran (4):
  net: Add a new socket option for a future transmit time.
  net: ipv4: raw: Hook into time based transmission.
  net: ipv4: udp: Hook into time based transmission.
  net: packet: Hook into time based transmission.

Vinicius Costa Gomes (2):
  net/sched: Allow creating a Qdisc watchdog with other clocks
  net/sched: Introduce the TBS Qdisc

 arch/alpha/include/uapi/asm/socket.h   |   5 +
 arch/frv/include/uapi/asm/socket.h |   5 +
 arch/ia64/include/uapi/asm/socket.h|   5 +
 arch/m32r/include/uapi/asm/socket.h|   5 +
 arch/mips/include/uapi/asm/socket.h|   5 +
 arch/mn10300/include/uapi/asm/socket.h |   5 +
 arch/parisc/include/uapi/asm/socket.h  |   5 +
 arch/s390/include/uapi/asm/socket.h|   5 +
 arch/sparc/include/uapi/asm/socket.h   |   5 +
 arch/xtensa/include/uapi/asm/socket.h  |   5 +
 drivers/net/ethernet/intel/igb/e1000_defines.h |  16 +
 drivers/net/ethernet/intel/igb/igb.h   |   1 +
 drivers/net/ethernet/intel/igb/igb_main.c  | 239 +++---
 include/linux/netdevice.h  |   2 +
 include/linux/posix-timers.h   |   1 +
 include/linux/skbuff.h |   3 +
 include/net/pkt_sched.h|   7 +
 include/net/sock.h |   4 +
 include/uapi/asm-generic/socket.h  |   5 +
 include/uapi/linux/pkt_sched.h |  18 +
 net/core/skbuff.c  |   1 -
 net/core/sock.c|  44 +-
 net/ipv4/raw.c |   7 +
 net/ipv4/udp.c |  10 +-
 net/packet/af_packet.c |  19 +
 net/sched/Kconfig  |  11 +
 net/sched/Makefile |   1 +
 net/sched/sch_api.c|  11 +-
 net/sched/sch_tbs.c| 591 +
 29 files changed, 978 insertions(+), 63 deletions(-)
 create mode 100644 net/sched/sch_tbs.c

-- 
2.16.2

---8<---
/*
 * This program demonstrates transmission of UDP packets using the
 * system TAI timer.
 *
 * Copyright (C) 2017 linutronix GmbH
 *
 * Large portions taken from the linuxptp stack.
 * Copyright (C) 2011, 2012 Richard Cochran <richardcoch...@gmail.com>
 *
 * Some portions taken from the sgd test program.
 * Copyright (C) 2015 linutronix GmbH
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License along
 * with this program; if not, write to the Free Software Foundation, Inc.,
 * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
 */
#define _GNU_SOURCE /*for CPU_SET*/
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define DEFAULT_PERIOD  100
#define DEFAULT_DELAY   50
#define MCAST_IPADDR"239.1.1.1"
#define

[RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path

2018-03-06 Thread Jesus Sanchez-Palencia
This is done in preparation for the upcoming time based transmission
patchset. Now that skb->tstamp will be used to hold packet's txtime,
we must ensure that it is being cleared when traversing namespaces.
Also, doing that from skb_scrub_packet() would break our feature when
tunnels are used.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
---
 include/linux/netdevice.h | 1 +
 net/core/skbuff.c | 1 -
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index dbe6344b727a..7104de2bc957 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3379,6 +3379,7 @@ static __always_inline int dev_forward_skb(struct 
net_device *dev,
 
skb_scrub_packet(skb, true);
skb->priority = 0;
+   skb->tstamp = 0;
return 0;
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c13495ba6..678fc5416ae1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4865,7 +4865,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
  */
 void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 {
-   skb->tstamp = 0;
skb->pkt_type = PACKET_HOST;
skb->skb_iif = 0;
skb->ignore_df = 0;
-- 
2.16.2



[RFC v3 net-next 04/18] net: Add a new socket option for a future transmit time.

2018-03-06 Thread Jesus Sanchez-Palencia
From: Richard Cochran <rcoch...@linutronix.de>

This patch introduces SO_TXTIME.  User space enables this option in
order to pass a desired future transmit time in a CMSG when calling
sendmsg(2).

A new field is added to struct sockcm_cookie, and the tstamp from
skbuffs will be used later on.

Signed-off-by: Richard Cochran <rcoch...@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
---
 arch/alpha/include/uapi/asm/socket.h   |  3 +++
 arch/frv/include/uapi/asm/socket.h |  3 +++
 arch/ia64/include/uapi/asm/socket.h|  3 +++
 arch/m32r/include/uapi/asm/socket.h|  3 +++
 arch/mips/include/uapi/asm/socket.h|  3 +++
 arch/mn10300/include/uapi/asm/socket.h |  3 +++
 arch/parisc/include/uapi/asm/socket.h  |  3 +++
 arch/s390/include/uapi/asm/socket.h|  3 +++
 arch/sparc/include/uapi/asm/socket.h   |  3 +++
 arch/xtensa/include/uapi/asm/socket.h  |  3 +++
 include/net/sock.h |  2 ++
 include/uapi/asm-generic/socket.h  |  3 +++
 net/core/sock.c| 21 +
 13 files changed, 56 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index be14f16149d5..065fb372e355 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -112,4 +112,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h 
b/arch/frv/include/uapi/asm/socket.h
index 9168e78fa32a..0e95f45cd058 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -105,5 +105,8 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h 
b/arch/ia64/include/uapi/asm/socket.h
index 3efba40adc54..c872c4e6bafb 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -114,4 +114,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h 
b/arch/m32r/include/uapi/asm/socket.h
index cf5018e82c3d..65276c95b8df 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -105,4 +105,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 49c3d4795963..71370fb3ceef 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -123,4 +123,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h 
b/arch/mn10300/include/uapi/asm/socket.h
index b35eee132142..d029a40b1b55 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -105,4 +105,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index 1d0fdc3b5d22..061b9cf2a779 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -104,4 +104,7 @@
 
 #define SO_ZEROCOPY0x4035
 
+#define SO_TXTIME  0x4036
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h 
b/arch/s390/include/uapi/asm/socket.h
index 3510c0fd06f4..39d901476ee5 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -111,4 +111,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index d58520c2e6ff..7ea35e5601b6 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -101,6 +101,9 @@
 
 #define SO_ZEROCOPY0x003e
 
+#define SO_TXTIME  0x003f
+#define SCM_TXTIME SO_TXTIME
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION 0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT   0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h 
b/arch/xtensa/include/uapi/asm/socket.h
index 75a07b8119a9..1de07a7f7680 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -116,4 +1

[RFC v3 net-next 06/18] net: ipv4: udp: Hook into time based transmission.

2018-03-06 Thread Jesus Sanchez-Palencia
From: Richard Cochran <rcoch...@linutronix.de>

For udp packets, copy the desired future transmit time from the CMSG
cookie into the skb.

Signed-off-by: Richard Cochran <rcoch...@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
---
 net/ipv4/udp.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3013404d0935..d683bbde526b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -926,6 +926,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t 
len)
}
 
ipc.sockc.tsflags = sk->sk_tsflags;
+   ipc.sockc.transmit_time = 0;
ipc.addr = inet->inet_saddr;
ipc.oif = sk->sk_bound_dev_if;
 
@@ -1040,8 +1041,10 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
  sizeof(struct udphdr), , ,
  msg->msg_flags);
err = PTR_ERR(skb);
-   if (!IS_ERR_OR_NULL(skb))
+   if (!IS_ERR_OR_NULL(skb)) {
+   skb->tstamp = ipc.sockc.transmit_time;
err = udp_send_skb(skb, fl4);
+   }
goto out;
}
 
-- 
2.16.2



[RFC v3 net-next 17/18] igb: Refactor igb_offload_cbs()

2018-03-06 Thread Jesus Sanchez-Palencia
Split code into a separate function (igb_offload_apply()) that will be
used by TBS offload implementation.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 9c33f2d18d8c..10d7809a85d7 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2476,6 +2476,19 @@ igb_features_check(struct sk_buff *skb, struct 
net_device *dev,
return features;
 }
 
+static void igb_offload_apply(struct igb_adapter *adapter, s32 queue)
+{
+   if (!is_fqtss_enabled(adapter)) {
+   enable_fqtss(adapter, true);
+   return;
+   }
+
+   igb_config_tx_modes(adapter, queue);
+
+   if (!is_any_cbs_enabled(adapter))
+   enable_fqtss(adapter, false);
+}
+
 static int igb_offload_cbs(struct igb_adapter *adapter,
   struct tc_cbs_qopt_offload *qopt)
 {
@@ -2496,15 +2509,7 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
if (err)
return err;
 
-   if (is_fqtss_enabled(adapter)) {
-   igb_config_tx_modes(adapter, qopt->queue);
-
-   if (!is_any_cbs_enabled(adapter))
-   enable_fqtss(adapter, false);
-
-   } else {
-   enable_fqtss(adapter, true);
-   }
+   igb_offload_apply(adapter, qopt->queue);
 
return 0;
 }
-- 
2.16.2



[RFC v3 net-next 07/18] net: packet: Hook into time based transmission.

2018-03-06 Thread Jesus Sanchez-Palencia
From: Richard Cochran <rcoch...@linutronix.de>

For raw layer-2 packets, copy the desired future transmit time from
the CMSG cookie into the skb.

Signed-off-by: Richard Cochran <rcoch...@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
---
 net/packet/af_packet.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 2c5a6fe5d749..b2115fac2a8d 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1976,6 +1976,7 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
goto out_unlock;
}
 
+   sockc.transmit_time = 0;
sockc.tsflags = sk->sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(sk, msg, );
@@ -1987,6 +1988,7 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = sockc.transmit_time;
 
sock_tx_timestamp(sk, sockc.tsflags, _shinfo(skb)->tx_flags);
 
@@ -2484,6 +2486,7 @@ static int tpacket_fill_skb(struct packet_sock *po, 
struct sk_buff *skb,
skb->dev = dev;
skb->priority = po->sk.sk_priority;
skb->mark = po->sk.sk_mark;
+   skb->tstamp = sockc->transmit_time;
sock_tx_timestamp(>sk, sockc->tsflags, _shinfo(skb)->tx_flags);
skb_shinfo(skb)->destructor_arg = ph.raw;
 
@@ -2660,6 +2663,7 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_put;
 
+   sockc.transmit_time = 0;
sockc.tsflags = po->sk.sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(>sk, msg, );
@@ -2856,6 +2860,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_unlock;
 
+   sockc.transmit_time = 0;
sockc.tsflags = sk->sk_tsflags;
sockc.mark = sk->sk_mark;
if (msg->msg_controllen) {
@@ -2928,6 +2933,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sockc.mark;
+   skb->tstamp = sockc.transmit_time;
 
if (has_vnet_hdr) {
err = virtio_net_hdr_to_skb(skb, _hdr, vio_le());
-- 
2.16.2



[RFC v3 net-next 12/18] net/sched: Allow creating a Qdisc watchdog with other clocks

2018-03-06 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes 

This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
passed, this allows other time references to be used when scheduling
the Qdisc to run.

Signed-off-by: Vinicius Costa Gomes 
---
 include/net/pkt_sched.h |  2 ++
 net/sched/sch_api.c | 11 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 815b92a23936..2466ea143d01 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -72,6 +72,8 @@ struct qdisc_watchdog {
struct Qdisc*qdisc;
 };
 
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
*qdisc,
+clockid_t clockid);
 void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc);
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires);
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 68f9d942bed4..beb1dc296bfb 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -596,12 +596,19 @@ static enum hrtimer_restart qdisc_watchdog(struct hrtimer 
*timer)
return HRTIMER_NORESTART;
 }
 
-void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
*qdisc,
+clockid_t clockid)
 {
-   hrtimer_init(>timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+   hrtimer_init(>timer, clockid, HRTIMER_MODE_ABS_PINNED);
wd->timer.function = qdisc_watchdog;
wd->qdisc = qdisc;
 }
+EXPORT_SYMBOL(qdisc_watchdog_init_clockid);
+
+void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+{
+   qdisc_watchdog_init_clockid(wd, qdisc, CLOCK_MONOTONIC);
+}
 EXPORT_SYMBOL(qdisc_watchdog_init);
 
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires)
-- 
2.16.2



[RFC v3 net-next 18/18] igb: Add support for TBS offload

2018-03-06 Thread Jesus Sanchez-Palencia
Implement HW offload support for SO_TXTIME through igb's Launchtime
feature. This is done by extending igb_setup_tc() so it supports
TC_SETUP_QDISC_TBS and configuring i210 so time based transmit
arbitration is enabled.

The FQTSS transmission mode added before is extended so strict
priority (SP) queues wait for stream reservation (SR) ones.
igb_config_tx_modes() is extended so it can support enabling/disabling
Launchtime following the previous approach used for the credit-based
shaper (CBS).

As the previous flow, FQTSS transmission mode is enabled automatically
by the driver once Launchtime (or CBS, as before) is enabled.
Similarly, it's automatically disabled when the feature is disabled
for the last queue that had it setup on.

The driver just consumes the transmit times from the skbuffs directly,
so no special handling is done in case an 'invalid' time is provided.
We assume this has been handled by the TBS qdisc already.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
---
 drivers/net/ethernet/intel/igb/e1000_defines.h |  16 +++
 drivers/net/ethernet/intel/igb/igb.h   |   1 +
 drivers/net/ethernet/intel/igb/igb_main.c  | 135 ++---
 3 files changed, 137 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h 
b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 83cabff1e0ab..9e357848c550 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -1066,6 +1066,22 @@
 #define E1000_TQAVCTRL_XMIT_MODE   BIT(0)
 #define E1000_TQAVCTRL_DATAFETCHARBBIT(4)
 #define E1000_TQAVCTRL_DATATRANARB BIT(8)
+#define E1000_TQAVCTRL_DATATRANTIM BIT(9)
+#define E1000_TQAVCTRL_SP_WAIT_SR  BIT(10)
+/* Fetch Time Delta - bits 31:16
+ *
+ * This field holds the value to be reduced from the launch time for
+ * fetch time decision. The FetchTimeDelta value is defined in 32 ns
+ * granularity.
+ *
+ * This field is 16 bits wide, and so the maximum value is:
+ *
+ * 65535 * 32 = 2097120 ~= 2.1 msec
+ *
+ * XXX: We are configuring the max value here since we couldn't come up
+ * with a reason for not doing so.
+ */
+#define E1000_TQAVCTRL_FETCHTIME_DELTA (0x << 16)
 
 /* TX Qav Credit Control fields */
 #define E1000_TQAVCC_IDLESLOPE_MASK0x
diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index 1c6b8d9176a8..4e1146efa399 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -281,6 +281,7 @@ struct igb_ring {
u16 count;  /* number of desc. in the ring */
u8 queue_index; /* logical index of the ring*/
u8 reg_idx; /* physical index of the ring */
+   bool launchtime_enable; /* true if LaunchTime is enabled */
bool cbs_enable;/* indicates if CBS is enabled */
s32 idleslope;  /* idleSlope in kbps */
s32 sendslope;  /* sendSlope in kbps */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 10d7809a85d7..fa931f66a1f8 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1684,13 +1684,26 @@ static bool is_any_cbs_enabled(struct igb_adapter 
*adapter)
return false;
 }
 
+static bool is_any_txtime_enabled(struct igb_adapter *adapter)
+{
+   int i;
+
+   for (i = 0; i < adapter->num_tx_queues; i++) {
+   if (adapter->tx_ring[i]->launchtime_enable)
+   return true;
+   }
+
+   return false;
+}
+
 /**
  *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
  *  @queue: queue number
  *
- *  Configure CBS for a given hardware queue. Parameters are retrieved
- *  from the correct Tx ring, so igb_save_cbs_params() should be used
+ *  Configure CBS and Launchtime for a given hardware queue.
+ *  Parameters are retrieved from the correct Tx ring, so
+ *  igb_save_cbs_params() and igb_save_txtime_params() should be used
  *  for setting those correctly prior to this function being called.
  **/
 static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
@@ -1704,10 +1717,20 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
WARN_ON(hw->mac.type != e1000_i210);
WARN_ON(queue < 0 || queue > 1);
 
-   if (ring->cbs_enable) {
+   /* If any of the Qav features is enabled, configure queues as SR and
+* with HIGH PRIO. If none is, then configure them with LOW PRIO and
+* as SP.
+*/
+   if (ring->cbs_enable || ring->launchtime_enable) {
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
set_queue_mode(hw, q

[RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params

2018-03-06 Thread Jesus Sanchez-Palencia
Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and
a drop_if_late flag. With this commit the API becomes:

- use SO_TXTIME to enable the feature on a socket;
- pass the per-packet arguments through the cmsg header using:
  * SCM_CLOCKID for the clockid to be used as the txtime clock source;
  * SCM_TXTIME for the txtime timestamp;
  * SCM_DROP_IF_LATE for the drop flag. This flag will be used by the
traffic control to decide if a delayed packet should be dropped.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
---
 arch/alpha/include/uapi/asm/socket.h   |  2 ++
 arch/frv/include/uapi/asm/socket.h |  2 ++
 arch/ia64/include/uapi/asm/socket.h|  2 ++
 arch/m32r/include/uapi/asm/socket.h|  2 ++
 arch/mips/include/uapi/asm/socket.h|  2 ++
 arch/mn10300/include/uapi/asm/socket.h |  2 ++
 arch/parisc/include/uapi/asm/socket.h  |  2 ++
 arch/s390/include/uapi/asm/socket.h|  2 ++
 arch/sparc/include/uapi/asm/socket.h   |  2 ++
 arch/xtensa/include/uapi/asm/socket.h  |  2 ++
 include/linux/skbuff.h |  3 +++
 include/net/sock.h |  2 ++
 include/uapi/asm-generic/socket.h  |  2 ++
 net/core/sock.c| 22 +-
 14 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index 065fb372e355..3399dfefa579 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -114,5 +114,7 @@
 
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   62
+#define SCM_CLOCKID63
 
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h 
b/arch/frv/include/uapi/asm/socket.h
index 0e95f45cd058..43b636836722 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -107,6 +107,8 @@
 
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   62
+#define SCM_CLOCKID63
 
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h 
b/arch/ia64/include/uapi/asm/socket.h
index c872c4e6bafb..1f06d07aadbe 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -116,5 +116,7 @@
 
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   62
+#define SCM_CLOCKID63
 
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h 
b/arch/m32r/include/uapi/asm/socket.h
index 65276c95b8df..69ab380d8d48 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -107,5 +107,7 @@
 
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   62
+#define SCM_CLOCKID63
 
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 71370fb3ceef..97da79f58538 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -125,5 +125,7 @@
 
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   62
+#define SCM_CLOCKID63
 
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h 
b/arch/mn10300/include/uapi/asm/socket.h
index d029a40b1b55..7c7a174fdfae 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -107,5 +107,7 @@
 
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   62
+#define SCM_CLOCKID63
 
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index 061b9cf2a779..7fe86b5cd593 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -106,5 +106,7 @@
 
 #define SO_TXTIME  0x4036
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   0x4037
+#define SCM_CLOCKID0x4038
 
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h 
b/arch/s390/include/uapi/asm/socket.h
index 39d901476ee5..97f90c4a9b8c 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -113,5 +113,7 @@
 
 #define SO_TXTIME  61
 #define SCM_TXTIME SO_TXTIME
+#define SCM_DROP_IF_LATE   62
+#define SCM_CLOCKID63
 
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index 7ea35e5601b6..6397c366dd2d 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -103,6 +103,8 @@
 
 #define SO_TXTIME  0x003f
 #define SCM_TXTIME SO_TXTIME
+#

[RFC v3 net-next 16/18] igb: Only change Tx arbitration when CBS is on

2018-03-06 Thread Jesus Sanchez-Palencia
Currently the data transmission arbitration algorithm - DataTranARB
field on TQAVCTRL reg - is always set to CBS when the Tx mode is
changed from legacy to 'Qav' mode.

Make that configuration a bit more granular in preparation for the
upcoming Launchtime enabling patches, since CBS and Launchtime can be
enabled separately. That is achieved by moving the DataTranARB setup
to igb_config_tx_modes() instead.

Similarly, when disabling CBS we must check if it has been disabled
for all queues, and clear the DataTranARB accordingly.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 49 +--
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 49cfbe4fd2b1..9c33f2d18d8c 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1672,6 +1672,18 @@ static void set_queue_mode(struct e1000_hw *hw, int 
queue, enum queue_mode mode)
wr32(E1000_I210_TQAVCC(queue), val);
 }
 
+static bool is_any_cbs_enabled(struct igb_adapter *adapter)
+{
+   int i;
+
+   for (i = 0; i < adapter->num_tx_queues; i++) {
+   if (adapter->tx_ring[i]->cbs_enable)
+   return true;
+   }
+
+   return false;
+}
+
 /**
  *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
@@ -1686,7 +1698,7 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
struct igb_ring *ring = adapter->tx_ring[queue];
struct net_device *netdev = adapter->netdev;
struct e1000_hw *hw = >hw;
-   u32 tqavcc;
+   u32 tqavcc, tqavctrl;
u16 value;
 
WARN_ON(hw->mac.type != e1000_i210);
@@ -1696,6 +1708,14 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
 
+   /* Always set data transfer arbitration to credit-based
+* shaper algorithm on TQAVCTRL if CBS is enabled for any of
+* the queues.
+*/
+   tqavctrl = rd32(E1000_I210_TQAVCTRL);
+   tqavctrl |= E1000_TQAVCTRL_DATATRANARB;
+   wr32(E1000_I210_TQAVCTRL, tqavctrl);
+
/* According to i210 datasheet section 7.2.7.7, we should set
 * the 'idleSlope' field from TQAVCC register following the
 * equation:
@@ -1773,6 +1793,16 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
 
/* Set hiCredit to zero. */
wr32(E1000_I210_TQAVHC(queue), 0);
+
+   /* If CBS is not enabled for any queues anymore, then return to
+* the default state of Data Transmission Arbitration on
+* TQAVCTRL.
+*/
+   if (!is_any_cbs_enabled(adapter)) {
+   tqavctrl = rd32(E1000_I210_TQAVCTRL);
+   tqavctrl &= ~E1000_TQAVCTRL_DATATRANARB;
+   wr32(E1000_I210_TQAVCTRL, tqavctrl);
+   }
}
 
/* XXX: In i210 controller the sendSlope and loCredit parameters from
@@ -1806,18 +1836,6 @@ static int igb_save_cbs_params(struct igb_adapter 
*adapter, int queue,
return 0;
 }
 
-static bool is_any_cbs_enabled(struct igb_adapter *adapter)
-{
-   int i;
-
-   for (i = 0; i < adapter->num_tx_queues; i++) {
-   if (adapter->tx_ring[i]->cbs_enable)
-   return true;
-   }
-
-   return false;
-}
-
 /**
  *  igb_setup_tx_mode - Switch to/from Qav Tx mode when applicable
  *  @adapter: pointer to adapter struct
@@ -1841,11 +1859,10 @@ static void igb_setup_tx_mode(struct igb_adapter 
*adapter)
int i, max_queue;
 
/* Configure TQAVCTRL register: set transmit mode to 'Qav',
-* set data fetch arbitration to 'round robin' and set data
-* transfer arbitration to 'credit shaper algorithm.
+* set data fetch arbitration to 'round robin'.
 */
val = rd32(E1000_I210_TQAVCTRL);
-   val |= E1000_TQAVCTRL_XMIT_MODE | E1000_TQAVCTRL_DATATRANARB;
+   val |= E1000_TQAVCTRL_XMIT_MODE;
val &= ~E1000_TQAVCTRL_DATAFETCHARB;
wr32(E1000_I210_TQAVCTRL, val);
 
-- 
2.16.2



[RFC v3 net-next 11/18] net: packet: Handle remaining txtime parameters

2018-03-06 Thread Jesus Sanchez-Palencia
Initialize clockid to CLOCKID_INVALID instead of 0 (i.e.
CLOCK_REALTIME), and copy both drop_if_late and clockid from CMSG cookie
into skb.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
---
 net/packet/af_packet.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index b2115fac2a8d..e455fbf5a356 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -94,6 +94,7 @@
 #endif
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -1977,6 +1978,8 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
}
 
sockc.transmit_time = 0;
+   sockc.drop_if_late = 0;
+   sockc.clockid = CLOCKID_INVALID;
sockc.tsflags = sk->sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(sk, msg, );
@@ -1989,6 +1992,8 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
skb->tstamp = sockc.transmit_time;
+   skb->tc_drop_if_late = sockc.drop_if_late;
+   skb->txtime_clockid = sockc.clockid;
 
sock_tx_timestamp(sk, sockc.tsflags, _shinfo(skb)->tx_flags);
 
@@ -2487,6 +2492,8 @@ static int tpacket_fill_skb(struct packet_sock *po, 
struct sk_buff *skb,
skb->priority = po->sk.sk_priority;
skb->mark = po->sk.sk_mark;
skb->tstamp = sockc->transmit_time;
+   skb->tc_drop_if_late = sockc->drop_if_late;
+   skb->txtime_clockid = sockc->clockid;
sock_tx_timestamp(>sk, sockc->tsflags, _shinfo(skb)->tx_flags);
skb_shinfo(skb)->destructor_arg = ph.raw;
 
@@ -2664,6 +2671,8 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
goto out_put;
 
sockc.transmit_time = 0;
+   sockc.drop_if_late = 0;
+   sockc.clockid = CLOCKID_INVALID;
sockc.tsflags = po->sk.sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(>sk, msg, );
@@ -2861,6 +2870,8 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
goto out_unlock;
 
sockc.transmit_time = 0;
+   sockc.drop_if_late = 0;
+   sockc.clockid = CLOCKID_INVALID;
sockc.tsflags = sk->sk_tsflags;
sockc.mark = sk->sk_mark;
if (msg->msg_controllen) {
@@ -2934,6 +2945,8 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
skb->priority = sk->sk_priority;
skb->mark = sockc.mark;
skb->tstamp = sockc.transmit_time;
+   skb->tc_drop_if_late = sockc.drop_if_late;
+   skb->txtime_clockid = sockc.clockid;
 
if (has_vnet_hdr) {
err = virtio_net_hdr_to_skb(skb, _hdr, vio_le());
-- 
2.16.2



[RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS

2018-03-06 Thread Jesus Sanchez-Palencia
Add new queueing modes to tbs qdisc so HW offload is supported.

For hw offload, if sorting is on, then the time sorted list will still
be used, but when sorting is disabled the enqueue / dequeue flow will
be based on a 'raw' FIFO through the usage of qdisc_enqueue_tail() and
qdisc_dequeue_head(). For the 'raw hw offload' mode, the drop_if_late
flag from skbuffs is not used by the Qdisc since this mode implicitly
assumes the PHC clock is being used by applications.

Example 1:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs offload

In this example, the Qdisc will use HW offload for the control of the
transmission time through the network adapter. It's assumed the timestamp
in skbuffs are in reference to the interface's PHC and setting any other
valid clockid would be treated as an error. Because there is no
scheduling being performed in the qdisc, setting a delta != 0 would also
be considered an error.

Example 2:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 10 \
   clockid CLOCK_REALTIME sorting

Here, the Qdisc will use HW offload for the txtime control again,
but now sorting will be enabled, and thus there will be scheduling being
performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
reference and packets leave the Qdisc "delta" (10) nanoseconds before
their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by the hrtimer, the system clock
and the PHC clock must be synchronized for this mode to behave as expected.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
---
 include/net/pkt_sched.h|   5 ++
 include/uapi/linux/pkt_sched.h |   1 +
 net/sched/sch_tbs.c| 159 +++--
 3 files changed, 144 insertions(+), 21 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 2466ea143d01..d042ffda7f21 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -155,4 +155,9 @@ struct tc_cbs_qopt_offload {
s32 sendslope;
 };
 
+struct tc_tbs_qopt_offload {
+   u8 enable;
+   s32 queue;
+};
+
 #endif
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index a33b5b9da81a..92af9fa4dee4 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -941,6 +941,7 @@ struct tc_tbs_qopt {
__s32 clockid;
__u32 flags;
 #define TC_TBS_SORTING_ON BIT(0)
+#define TC_TBS_OFFLOAD_ON BIT(1)
 };
 
 enum {
diff --git a/net/sched/sch_tbs.c b/net/sched/sch_tbs.c
index c19eedda9bc5..2aafa55de42c 100644
--- a/net/sched/sch_tbs.c
+++ b/net/sched/sch_tbs.c
@@ -25,8 +25,10 @@
 #include 
 
 #define SORTING_IS_ON(x) (x->flags & TC_TBS_SORTING_ON)
+#define OFFLOAD_IS_ON(x) (x->flags & TC_TBS_OFFLOAD_ON)
 
 struct tbs_sched_data {
+   bool offload;
bool sorting;
int clockid;
int queue;
@@ -68,25 +70,42 @@ static inline int validate_input_params(struct tc_tbs_qopt 
*qopt,
struct netlink_ext_ack *extack)
 {
/* Check if params comply to the following rules:
-*  * If SW best-effort, then clockid and delta must be valid
-*regardless of sorting enabled or not.
+*  * If SW best-effort, then clockid and delta must be valid.
+*
+*  * If HW offload is ON and sorting is ON, then clockid and delta
+*must be valid.
+*
+*  * If HW offload is ON and sorting is OFF, then clockid and
+*delta must not have been set. The netdevice PHC will be used
+*implictly.
 *
 *  * Dynamic clockids are not supported.
 *  * Delta must be a positive integer.
 */
-   if ((qopt->clockid & CLOCKID_INVALID) == CLOCKID_INVALID ||
-   qopt->clockid >= MAX_CLOCKS) {
-   NL_SET_ERR_MSG(extack, "Invalid clockid");
-   return -EINVAL;
-   } else if (qopt->clockid < 0 ||
-  !clockid_to_get_time[qopt->clockid]) {
-   NL_SET_ERR_MSG(extack, "Clockid is not supported");
-   return -ENOTSUPP;
-   }
-
-   if (qopt->delta < 0) {
-   NL_SET_ERR_MSG(extack, "Delta must be positive");
-   return -EINVAL;
+   if (!OFFLOAD_IS_ON(qopt) || SORTING_IS_ON(qopt)) {
+   if ((qopt->clockid & CLOCKID_INVALID) == CLOCKID_INVALID ||
+   qopt->clockid >= MAX_CLOCKS) {
+   NL_SET_ERR_MSG(extack, &qu

  1   2   >