Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop

2016-04-02 Thread Brenden Blanco
On Sat, Apr 02, 2016 at 12:47:16PM -0400, Tom Herbert wrote:
> Very nice! Do you think this hook will be sufficient to implement a
> fast forward patch also?
That is the goal, but more work needs to be done of course. It won't be
possible with just a single pseudo skb, the driver will need a fast way to get
batches of pseudo skbs (per core?) through from rx to tx. In mlx4 for
instance, either the skb needs to be much more complete to be handled from the
start of mlx4_en_xmit(), or that function would need to be split so that the
fast tx could start midway through.

Or, skb allocation just gets much faster. Then it should be pretty
straightforward.
> 
> Tom


[PATCH v3 net-next 6/8] ipv6: process socket-level control messages in IPv6

2016-04-02 Thread Soheil Hassas Yeganeh
From: Soheil Hassas Yeganeh 

Process socket-level control messages by invoking
__sock_cmsg_send in ip6_datagram_send_ctl for control messages on
the SOL_SOCKET layer.

This makes sure whenever ip6_datagram_send_ctl is called for
udp and raw, we also process socket-level control messages.

This is a bit uglier than IPv4, since IPv6 does not have
something like ipcm_cookie. Perhaps we can later create
a control message cookie for IPv6?

Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv6 control messages.

Signed-off-by: Soheil Hassas Yeganeh 
Acked-by: Willem de Bruijn 
---
 include/net/transp_v6.h  | 3 ++-
 net/ipv6/datagram.c  | 9 -
 net/ipv6/ip6_flowlabel.c | 3 ++-
 net/ipv6/ipv6_sockglue.c | 3 ++-
 net/ipv6/raw.c   | 6 +-
 net/ipv6/udp.c   | 5 -
 net/l2tp/l2tp_ip6.c  | 8 +---
 7 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/include/net/transp_v6.h b/include/net/transp_v6.h
index b927413..2b1c345 100644
--- a/include/net/transp_v6.h
+++ b/include/net/transp_v6.h
@@ -42,7 +42,8 @@ void ip6_datagram_recv_specific_ctl(struct sock *sk, struct 
msghdr *msg,
 
 int ip6_datagram_send_ctl(struct net *net, struct sock *sk, struct msghdr *msg,
  struct flowi6 *fl6, struct ipv6_txoptions *opt,
- int *hlimit, int *tclass, int *dontfrag);
+ int *hlimit, int *tclass, int *dontfrag,
+ struct sockcm_cookie *sockc);
 
 void ip6_dgram_sock_seq_show(struct seq_file *seq, struct sock *sp,
 __u16 srcp, __u16 destp, int bucket);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 4281621..a73d701 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -685,7 +685,8 @@ EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl);
 int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
  struct msghdr *msg, struct flowi6 *fl6,
  struct ipv6_txoptions *opt,
- int *hlimit, int *tclass, int *dontfrag)
+ int *hlimit, int *tclass, int *dontfrag,
+ struct sockcm_cookie *sockc)
 {
struct in6_pktinfo *src_info;
struct cmsghdr *cmsg;
@@ -702,6 +703,12 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
goto exit_f;
}
 
+   if (cmsg->cmsg_level == SOL_SOCKET) {
+   if (__sock_cmsg_send(sk, msg, cmsg, sockc))
+   return -EINVAL;
+   continue;
+   }
+
if (cmsg->cmsg_level != SOL_IPV6)
continue;
 
diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
index dc2db4f..35d3ddc 100644
--- a/net/ipv6/ip6_flowlabel.c
+++ b/net/ipv6/ip6_flowlabel.c
@@ -372,6 +372,7 @@ fl_create(struct net *net, struct sock *sk, struct 
in6_flowlabel_req *freq,
if (olen > 0) {
struct msghdr msg;
struct flowi6 flowi6;
+   struct sockcm_cookie sockc_junk;
int junk;
 
err = -ENOMEM;
@@ -390,7 +391,7 @@ fl_create(struct net *net, struct sock *sk, struct 
in6_flowlabel_req *freq,
memset(, 0, sizeof(flowi6));
 
err = ip6_datagram_send_ctl(net, sk, , , fl->opt,
-   , , );
+   , , , _junk);
if (err)
goto done;
err = -EINVAL;
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 4449ad1..a5557d2 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -471,6 +471,7 @@ sticky_done:
struct ipv6_txoptions *opt = NULL;
struct msghdr msg;
struct flowi6 fl6;
+   struct sockcm_cookie sockc_junk;
int junk;
 
memset(, 0, sizeof(fl6));
@@ -503,7 +504,7 @@ sticky_done:
msg.msg_control = (void *)(opt+1);
 
retv = ip6_datagram_send_ctl(net, sk, , , opt, ,
-, );
+, , _junk);
if (retv)
goto done;
 update:
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index fa59dd7..f175ec0 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -745,6 +745,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
struct dst_entry *dst = NULL;
struct raw6_frag_vec rfv;
struct flowi6 fl6;
+   struct sockcm_cookie sockc;
int addr_len = msg->msg_namelen;
int hlimit = -1;
int tclass = -1;
@@ -821,13 +822,16 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr 

[PATCH v3 net-next 8/8] sock: document timestamping via cmsg in Documentation

2016-04-02 Thread Soheil Hassas Yeganeh
From: Soheil Hassas Yeganeh 

Update docs and add code snippet for using cmsg for timestamping.

Signed-off-by: Soheil Hassas Yeganeh 
Acked-by: Willem de Bruijn 
---
 Documentation/networking/timestamping.txt | 48 +--
 1 file changed, 45 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/timestamping.txt 
b/Documentation/networking/timestamping.txt
index a977339..671cccf 100644
--- a/Documentation/networking/timestamping.txt
+++ b/Documentation/networking/timestamping.txt
@@ -44,11 +44,17 @@ timeval of SO_TIMESTAMP (ms).
 Supports multiple types of timestamp requests. As a result, this
 socket option takes a bitmap of flags, not a boolean. In
 
-  err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, );
+  err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val,
+   sizeof(val));
 
 val is an integer with any of the following bits set. Setting other
 bit returns EINVAL and does not change the current state.
 
+The socket option configures timestamp generation for individual
+sk_buffs (1.3.1), timestamp reporting to the socket's error
+queue (1.3.2) and options (1.3.3). Timestamp generation can also
+be enabled for individual sendmsg calls using cmsg (1.3.4).
+
 
 1.3.1 Timestamp Generation
 
@@ -71,13 +77,16 @@ SOF_TIMESTAMPING_RX_SOFTWARE:
   kernel receive stack.
 
 SOF_TIMESTAMPING_TX_HARDWARE:
-  Request tx timestamps generated by the network adapter.
+  Request tx timestamps generated by the network adapter. This flag
+  can be enabled via both socket options and control messages.
 
 SOF_TIMESTAMPING_TX_SOFTWARE:
   Request tx timestamps when data leaves the kernel. These timestamps
   are generated in the device driver as close as possible, but always
   prior to, passing the packet to the network interface. Hence, they
   require driver support and may not be available for all devices.
+  This flag can be enabled via both socket options and control messages.
+
 
 SOF_TIMESTAMPING_TX_SCHED:
   Request tx timestamps prior to entering the packet scheduler. Kernel
@@ -90,7 +99,8 @@ SOF_TIMESTAMPING_TX_SCHED:
   machines with virtual devices where a transmitted packet travels
   through multiple devices and, hence, multiple packet schedulers,
   a timestamp is generated at each layer. This allows for fine
-  grained measurement of queuing delay.
+  grained measurement of queuing delay. This flag can be enabled
+  via both socket options and control messages.
 
 SOF_TIMESTAMPING_TX_ACK:
   Request tx timestamps when all data in the send buffer has been
@@ -99,6 +109,7 @@ SOF_TIMESTAMPING_TX_ACK:
   over-report measurement, because the timestamp is generated when all
   data up to and including the buffer at send() was acknowledged: the
   cumulative acknowledgment. The mechanism ignores SACK and FACK.
+  This flag can be enabled via both socket options and control messages.
 
 
 1.3.2 Timestamp Reporting
@@ -183,6 +194,37 @@ having access to the contents of the original packet, so 
cannot be
 combined with SOF_TIMESTAMPING_OPT_TSONLY.
 
 
+1.3.4. Enabling timestamps via control messages
+
+In addition to socket options, timestamp generation can be requested
+per write via cmsg, only for SOF_TIMESTAMPING_TX_* (see Section 1.3.1).
+Using this feature, applications can sample timestamps per sendmsg()
+without paying the overhead of enabling and disabling timestamps via
+setsockopt:
+
+  struct msghdr *msg;
+  ...
+  cmsg= CMSG_FIRSTHDR(msg);
+  cmsg->cmsg_level= SOL_SOCKET;
+  cmsg->cmsg_type = SO_TIMESTAMPING;
+  cmsg->cmsg_len  = CMSG_LEN(sizeof(__u32));
+  *((__u32 *) CMSG_DATA(cmsg)) = SOF_TIMESTAMPING_TX_SCHED |
+SOF_TIMESTAMPING_TX_SOFTWARE |
+SOF_TIMESTAMPING_TX_ACK;
+  err = sendmsg(fd, msg, 0);
+
+The SOF_TIMESTAMPING_TX_* flags set via cmsg will override
+the SOF_TIMESTAMPING_TX_* flags set via setsockopt.
+
+Moreover, applications must still enable timestamp reporting via
+setsockopt to receive timestamps:
+
+  __u32 val = SOF_TIMESTAMPING_SOFTWARE |
+ SOF_TIMESTAMPING_OPT_ID /* or any other flag */;
+  err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val,
+   sizeof(val));
+
+
 1.4 Bytestream Timestamps
 
 The SO_TIMESTAMPING interface supports timestamping of bytes in a
-- 
2.8.0.rc3.226.g39d4020



[PATCH v3 net-next 5/8] ipv4: process socket-level control messages in IPv4

2016-04-02 Thread Soheil Hassas Yeganeh
From: Soheil Hassas Yeganeh 

Process socket-level control messages by invoking
__sock_cmsg_send in ip_cmsg_send for control messages on
the SOL_SOCKET layer.

This makes sure whenever ip_cmsg_send is called in udp, icmp,
and raw, we also process socket-level control messages.

Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv4 control messages.

Signed-off-by: Soheil Hassas Yeganeh 
Acked-by: Willem de Bruijn 
---
 include/net/ip.h   | 3 ++-
 net/ipv4/ip_sockglue.c | 9 -
 net/ipv4/ping.c| 2 +-
 net/ipv4/raw.c | 2 +-
 net/ipv4/udp.c | 3 +--
 5 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index fad74d3..93725e5 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -56,6 +56,7 @@ static inline unsigned int ip_hdrlen(const struct sk_buff 
*skb)
 }
 
 struct ipcm_cookie {
+   struct sockcm_cookiesockc;
__be32  addr;
int oif;
struct ip_options_rcu   *opt;
@@ -550,7 +551,7 @@ int ip_options_rcv_srr(struct sk_buff *skb);
 
 void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb);
 void ip_cmsg_recv_offset(struct msghdr *msg, struct sk_buff *skb, int offset);
-int ip_cmsg_send(struct net *net, struct msghdr *msg,
+int ip_cmsg_send(struct sock *sk, struct msghdr *msg,
 struct ipcm_cookie *ipc, bool allow_ipv6);
 int ip_setsockopt(struct sock *sk, int level, int optname, char __user *optval,
  unsigned int optlen);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 035ad64..1b7c077 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -219,11 +219,12 @@ void ip_cmsg_recv_offset(struct msghdr *msg, struct 
sk_buff *skb,
 }
 EXPORT_SYMBOL(ip_cmsg_recv_offset);
 
-int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc,
+int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc,
 bool allow_ipv6)
 {
int err, val;
struct cmsghdr *cmsg;
+   struct net *net = sock_net(sk);
 
for_each_cmsghdr(cmsg, msg) {
if (!CMSG_OK(msg, cmsg))
@@ -244,6 +245,12 @@ int ip_cmsg_send(struct net *net, struct msghdr *msg, 
struct ipcm_cookie *ipc,
continue;
}
 #endif
+   if (cmsg->cmsg_level == SOL_SOCKET) {
+   if (__sock_cmsg_send(sk, msg, cmsg, >sockc))
+   return -EINVAL;
+   continue;
+   }
+
if (cmsg->cmsg_level != SOL_IP)
continue;
switch (cmsg->cmsg_type) {
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index cf9700b..670639b 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -747,7 +747,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
sock_tx_timestamp(sk, _flags);
 
if (msg->msg_controllen) {
-   err = ip_cmsg_send(sock_net(sk), msg, , false);
+   err = ip_cmsg_send(sk, msg, , false);
if (unlikely(err)) {
kfree(ipc.opt);
return err;
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 8d22de7..088ce66 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -548,7 +548,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
ipc.oif = sk->sk_bound_dev_if;
 
if (msg->msg_controllen) {
-   err = ip_cmsg_send(net, msg, , false);
+   err = ip_cmsg_send(sk, msg, , false);
if (unlikely(err)) {
kfree(ipc.opt);
goto out;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 08eed5e..bccb4e1 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1034,8 +1034,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
sock_tx_timestamp(sk, _flags);
 
if (msg->msg_controllen) {
-   err = ip_cmsg_send(sock_net(sk), msg, ,
-  sk->sk_family == AF_INET6);
+   err = ip_cmsg_send(sk, msg, , sk->sk_family == AF_INET6);
if (unlikely(err)) {
kfree(ipc.opt);
return err;
-- 
2.8.0.rc3.226.g39d4020



[PATCH v3 net-next 7/8] sock: enable timestamping using control messages

2016-04-02 Thread Soheil Hassas Yeganeh
From: Soheil Hassas Yeganeh 

Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
This is very costly when users want to sample writes to gather
tx timestamps.

Add support for enabling SO_TIMESTAMPING via control messages by
using tsflags added in `struct sockcm_cookie` (added in the previous
patches in this series) to set the tx_flags of the last skb created in
a sendmsg. With this patch, the timestamp recording bits in tx_flags
of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

Please note that this is only effective for overriding the recording
timestamps flags. Users should enable timestamp reporting (e.g.,
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
socket options and then should ask for SOF_TIMESTAMPING_TX_*
using control messages per sendmsg to sample timestamps for each
write.

Signed-off-by: Soheil Hassas Yeganeh 
Acked-by: Willem de Bruijn 
---
 drivers/net/tun.c  |  3 ++-
 include/net/ipv6.h |  6 --
 include/net/sock.h | 10 ++
 net/can/raw.c  |  2 +-
 net/ipv4/ping.c|  5 +++--
 net/ipv4/raw.c | 11 ++-
 net/ipv4/tcp.c | 20 +++-
 net/ipv4/udp.c |  7 ---
 net/ipv6/icmp.c|  6 --
 net/ipv6/ip6_output.c  | 15 +--
 net/ipv6/ping.c|  3 ++-
 net/ipv6/raw.c |  5 ++---
 net/ipv6/udp.c |  7 ---
 net/l2tp/l2tp_ip6.c|  2 +-
 net/packet/af_packet.c | 30 +-
 net/socket.c   | 10 +-
 16 files changed, 93 insertions(+), 49 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afdf950..6d2fcd0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -860,7 +860,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct 
net_device *dev)
goto drop;
 
if (skb->sk && sk_fullsock(skb->sk)) {
-   sock_tx_timestamp(skb->sk, _shinfo(skb)->tx_flags);
+   sock_tx_timestamp(skb->sk, skb->sk->sk_tsflags,
+ _shinfo(skb)->tx_flags);
sw_tx_timestamp(skb);
}
 
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index d0aeb97..55ee1eb 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -867,7 +867,8 @@ int ip6_append_data(struct sock *sk,
int odd, struct sk_buff *skb),
void *from, int length, int transhdrlen, int hlimit,
int tclass, struct ipv6_txoptions *opt, struct flowi6 *fl6,
-   struct rt6_info *rt, unsigned int flags, int dontfrag);
+   struct rt6_info *rt, unsigned int flags, int dontfrag,
+   const struct sockcm_cookie *sockc);
 
 int ip6_push_pending_frames(struct sock *sk);
 
@@ -884,7 +885,8 @@ struct sk_buff *ip6_make_skb(struct sock *sk,
 void *from, int length, int transhdrlen,
 int hlimit, int tclass, struct ipv6_txoptions *opt,
 struct flowi6 *fl6, struct rt6_info *rt,
-unsigned int flags, int dontfrag);
+unsigned int flags, int dontfrag,
+const struct sockcm_cookie *sockc);
 
 static inline struct sk_buff *ip6_finish_skb(struct sock *sk)
 {
diff --git a/include/net/sock.h b/include/net/sock.h
index af012da..e91b87f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2057,19 +2057,21 @@ static inline void sock_recv_ts_and_drops(struct msghdr 
*msg, struct sock *sk,
sk->sk_stamp = skb->tstamp;
 }
 
-void __sock_tx_timestamp(const struct sock *sk, __u8 *tx_flags);
+void __sock_tx_timestamp(__u16 tsflags, __u8 *tx_flags);
 
 /**
  * sock_tx_timestamp - checks whether the outgoing packet is to be time stamped
  * @sk:socket sending this packet
+ * @tsflags:   timestamping flags to use
  * @tx_flags:  completed with instructions for time stamping
  *
  * Note : callers should take care of initial *tx_flags value (usually 0)
  */
-static inline void sock_tx_timestamp(const struct sock *sk, __u8 *tx_flags)
+static inline void sock_tx_timestamp(const struct sock *sk, __u16 tsflags,
+__u8 *tx_flags)
 {
-   if (unlikely(sk->sk_tsflags))
-   __sock_tx_timestamp(sk, tx_flags);
+   if (unlikely(tsflags))
+   __sock_tx_timestamp(tsflags, tx_flags);
if (unlikely(sock_flag(sk, SOCK_WIFI_STATUS)))
*tx_flags |= SKBTX_WIFI_STATUS;
 }
diff --git a/net/can/raw.c b/net/can/raw.c
index 2e67b14..972c187 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -755,7 +755,7 @@ static int raw_sendmsg(struct socket *sock, struct msghdr 
*msg, size_t size)
if (err < 0)
goto free_skb;
 
-   sock_tx_timestamp(sk, _shinfo(skb)->tx_flags);
+   sock_tx_timestamp(sk, sk->sk_tsflags, 

[PATCH v3 net-next 4/8] sock: accept SO_TIMESTAMPING flags in socket cmsg

2016-04-02 Thread Soheil Hassas Yeganeh
From: Soheil Hassas Yeganeh 

Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level
as a basis to accept timestamping requests per write.

This implementation only accepts TX recording flags (i.e.,
SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE,
SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in
control messages. Users need to set reporting flags (e.g.,
SOF_TIMESTAMPING_OPT_ID) per socket via socket options.

This commit adds a tsflags field in sockcm_cookie which is
set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_*
bits in sockcm_cookie.tsflags allowing the control message
to override the recording behavior per write, yet maintaining
the value of other flags.

This patch implements validating the control message and setting
tsflags in struct sockcm_cookie. Next commits in this series will
actually implement timestamping per write for different protocols.

Signed-off-by: Soheil Hassas Yeganeh 
Acked-by: Willem de Bruijn 
---
 include/net/sock.h  |  1 +
 include/uapi/linux/net_tstamp.h | 10 ++
 net/core/sock.c | 13 +
 3 files changed, 24 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 03772d4..af012da 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1418,6 +1418,7 @@ void sk_send_sigurg(struct sock *sk);
 
 struct sockcm_cookie {
u32 mark;
+   u16 tsflags;
 };
 
 int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index 6d1abea..264e515 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -31,6 +31,16 @@ enum {
 SOF_TIMESTAMPING_LAST
 };
 
+/*
+ * SO_TIMESTAMPING flags are either for recording a packet timestamp or for
+ * reporting the timestamp to user space.
+ * Recording flags can be set both via socket options and control messages.
+ */
+#define SOF_TIMESTAMPING_TX_RECORD_MASK(SOF_TIMESTAMPING_TX_HARDWARE | 
\
+SOF_TIMESTAMPING_TX_SOFTWARE | \
+SOF_TIMESTAMPING_TX_SCHED | \
+SOF_TIMESTAMPING_TX_ACK)
+
 /**
  * struct hwtstamp_config - %SIOCGHWTSTAMP and %SIOCSHWTSTAMP parameter
  *
diff --git a/net/core/sock.c b/net/core/sock.c
index 0a64fe2..315f5e5 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1870,6 +1870,8 @@ EXPORT_SYMBOL(sock_alloc_send_skb);
 int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
 struct sockcm_cookie *sockc)
 {
+   u32 tsflags;
+
switch (cmsg->cmsg_type) {
case SO_MARK:
if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
@@ -1878,6 +1880,17 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr 
*msg, struct cmsghdr *cmsg,
return -EINVAL;
sockc->mark = *(u32 *)CMSG_DATA(cmsg);
break;
+   case SO_TIMESTAMPING:
+   if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
+   return -EINVAL;
+
+   tsflags = *(u32 *)CMSG_DATA(cmsg);
+   if (tsflags & ~SOF_TIMESTAMPING_TX_RECORD_MASK)
+   return -EINVAL;
+
+   sockc->tsflags &= ~SOF_TIMESTAMPING_TX_RECORD_MASK;
+   sockc->tsflags |= tsflags;
+   break;
default:
return -EINVAL;
}
-- 
2.8.0.rc3.226.g39d4020



[PATCH v3 net-next 2/8] tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFO

2016-04-02 Thread Soheil Hassas Yeganeh
From: Soheil Hassas Yeganeh 

SOF_TIMESTAMPING_OPT_ID is set to get data-independent IDs
to associate timestamps with send calls. For TCP connections,
tp->snd_una is used as the starting point to calculate
relative IDs.

This socket option will fail if set before the handshake on a
passive TCP fast open connection with data in SYN or SYN/ACK,
since setsockopt requires the connection to be in the
ESTABLISHED state.

To address these, instead of limiting the option to the
ESTABLISHED state, accept the SOF_TIMESTAMPING_OPT_ID option as
long as the connection is not in LISTEN or CLOSE states.

Signed-off-by: Soheil Hassas Yeganeh 
Acked-by: Willem de Bruijn 
Acked-by: Yuchung Cheng 
Acked-by: Eric Dumazet 
---
 net/core/sock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 66976f8..0a64fe2 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -832,7 +832,8 @@ set_rcvbuf:
!(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
if (sk->sk_protocol == IPPROTO_TCP &&
sk->sk_type == SOCK_STREAM) {
-   if (sk->sk_state != TCP_ESTABLISHED) {
+   if ((1 << sk->sk_state) &
+   (TCPF_CLOSE | TCPF_LISTEN)) {
ret = -EINVAL;
break;
}
-- 
2.8.0.rc3.226.g39d4020



[PATCH v3 net-next 3/8] tcp: use one bit in TCP_SKB_CB to mark ACK timestamps

2016-04-02 Thread Soheil Hassas Yeganeh
From: Soheil Hassas Yeganeh 

Currently, to avoid a cache line miss for accessing skb_shinfo,
tcp_ack_tstamp skips socket that do not have
SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is
implemented based on an implicit assumption that the
SOF_TIMESTAMPING_TX_ACK is set via socket options for the
duration that ACK timestamps are needed.

To implement per-write timestamps, this check should be
removed and replaced with a per-packet alternative that
quickly skips packets missing ACK timestamps marks without
a cache-line miss.

To enable per-packet marking without a cache line miss, use
one bit in TCP_SKB_CB to mark a whether a SKB might need a
ack tx timestamp or not. Further checks in tcp_ack_tstamp are not
modified and work as before.

Signed-off-by: Soheil Hassas Yeganeh 
Acked-by: Willem de Bruijn 
Acked-by: Eric Dumazet 
---
 include/net/tcp.h| 3 ++-
 net/ipv4/tcp.c   | 2 ++
 net/ipv4/tcp_input.c | 2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b91370f..f3a80ec 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -754,7 +754,8 @@ struct tcp_skb_cb {
TCPCB_REPAIRED)
 
__u8ip_dsfield; /* IPv4 tos or IPv6 dsfield */
-   /* 1 byte hole */
+   __u8txstamp_ack:1,  /* Record TX timestamp for ack? */
+   unused:7;
__u32   ack_seq;/* Sequence number ACK'd*/
union {
struct inet_skb_parmh4;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 08b8b96..ce3c9eb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -432,10 +432,12 @@ static void tcp_tx_timestamp(struct sock *sk, struct 
sk_buff *skb)
 {
if (sk->sk_tsflags) {
struct skb_shared_info *shinfo = skb_shinfo(skb);
+   struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 
sock_tx_timestamp(sk, >tx_flags);
if (shinfo->tx_flags & SKBTX_ANY_TSTAMP)
shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
+   tcb->txstamp_ack = !!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);
}
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e6e65f7..2d5fee4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3093,7 +3093,7 @@ static void tcp_ack_tstamp(struct sock *sk, struct 
sk_buff *skb,
const struct skb_shared_info *shinfo;
 
/* Avoid cache line misses to get skb_shinfo() and shinfo->tx_flags */
-   if (likely(!(sk->sk_tsflags & SOF_TIMESTAMPING_TX_ACK)))
+   if (likely(!TCP_SKB_CB(skb)->txstamp_ack))
return;
 
shinfo = skb_shinfo(skb);
-- 
2.8.0.rc3.226.g39d4020



[PATCH v3 net-next 1/8] sock: break up sock_cmsg_snd into __sock_cmsg_snd and loop

2016-04-02 Thread Soheil Hassas Yeganeh
From: Willem de Bruijn 

To process cmsg's of the SOL_SOCKET level in addition to
cmsgs of another level, protocols can call sock_cmsg_send().
This causes a double walk on the cmsghdr list, one for SOL_SOCKET
and one for the other level.

Extract the inner demultiplex logic from the loop that walks the list,
to allow having this called directly from a walker in the protocol
specific code.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/net/sock.h |  2 ++
 net/core/sock.c| 33 ++---
 2 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 255d3e0..03772d4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1420,6 +1420,8 @@ struct sockcm_cookie {
u32 mark;
 };
 
+int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
+struct sockcm_cookie *sockc);
 int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
   struct sockcm_cookie *sockc);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index b67b9ae..66976f8 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1866,27 +1866,38 @@ struct sk_buff *sock_alloc_send_skb(struct sock *sk, 
unsigned long size,
 }
 EXPORT_SYMBOL(sock_alloc_send_skb);
 
+int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
+struct sockcm_cookie *sockc)
+{
+   switch (cmsg->cmsg_type) {
+   case SO_MARK:
+   if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
+   return -EPERM;
+   if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
+   return -EINVAL;
+   sockc->mark = *(u32 *)CMSG_DATA(cmsg);
+   break;
+   default:
+   return -EINVAL;
+   }
+   return 0;
+}
+EXPORT_SYMBOL(__sock_cmsg_send);
+
 int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
   struct sockcm_cookie *sockc)
 {
struct cmsghdr *cmsg;
+   int ret;
 
for_each_cmsghdr(cmsg, msg) {
if (!CMSG_OK(msg, cmsg))
return -EINVAL;
if (cmsg->cmsg_level != SOL_SOCKET)
continue;
-   switch (cmsg->cmsg_type) {
-   case SO_MARK:
-   if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
-   return -EPERM;
-   if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32)))
-   return -EINVAL;
-   sockc->mark = *(u32 *)CMSG_DATA(cmsg);
-   break;
-   default:
-   return -EINVAL;
-   }
+   ret = __sock_cmsg_send(sk, msg, cmsg, sockc);
+   if (ret)
+   return ret;
}
return 0;
 }
-- 
2.8.0.rc3.226.g39d4020



[PATCH v3 net-next 0/8] add TX timestamping via cmsg

2016-04-02 Thread Soheil Hassas Yeganeh
From: Soheil Hassas Yeganeh 

This patch series aim at enabling TX timestamping via cmsg.

Currently, to occasionally sample TX timestamping on a socket,
applications need to call setsockopt twice: first for enabling
timestamps and then for disabling them. This is an unnecessary
overhead. With cmsg, in contrast, applications can sample TX
timestamps per sendmsg().

This patch series adds the code for processing SO_TIMESTAMPING
for cmsg's of the SOL_SOCKET level, and adds the glue code for
TCP, UDP, and RAW for both IPv4 and IPv6. This implementation
supports overriding timestamp generation flags (i.e.,
SOF_TIMESTAMPING_TX_*) but not timestamp reporting flags.
Applications must still enable timestamp reporting via
setsockopt to receive timestamps.

This series does not change existing timestamping behavior for
applications that are using socket options.

I will follow up with another patch to enable timestamping for
active TFO (client-side TCP Fast Open) and also setting packet
mark via cmsgs.

Thanks!

Changes in v2:
- Replace u32 with __u32 in the documentation.

Changes in v3:
- Fix the broken build for L2TP (due to changes
  in IPv6).

Soheil Hassas Yeganeh (7):
  tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFO
  tcp: use one bit in TCP_SKB_CB to mark ACK timestamps
  sock: accept SO_TIMESTAMPING flags in socket cmsg
  ipv4: process socket-level control messages in IPv4
  ipv6: process socket-level control messages in IPv6
  sock: enable timestamping using control messages
  sock: document timestamping via cmsg in Documentation

Willem de Bruijn (1):
  sock: break up sock_cmsg_snd into __sock_cmsg_snd and loop

 Documentation/networking/timestamping.txt | 48 --
 drivers/net/tun.c |  3 +-
 include/net/ip.h  |  3 +-
 include/net/ipv6.h|  6 ++--
 include/net/sock.h| 13 +---
 include/net/tcp.h |  3 +-
 include/net/transp_v6.h   |  3 +-
 include/uapi/linux/net_tstamp.h   | 10 +++
 net/can/raw.c |  2 +-
 net/core/sock.c   | 49 +++
 net/ipv4/ip_sockglue.c|  9 +-
 net/ipv4/ping.c   |  7 +++--
 net/ipv4/raw.c| 13 
 net/ipv4/tcp.c| 22 ++
 net/ipv4/tcp_input.c  |  2 +-
 net/ipv4/udp.c| 10 +++
 net/ipv6/datagram.c   |  9 +-
 net/ipv6/icmp.c   |  6 ++--
 net/ipv6/ip6_flowlabel.c  |  3 +-
 net/ipv6/ip6_output.c | 15 ++
 net/ipv6/ipv6_sockglue.c  |  3 +-
 net/ipv6/ping.c   |  3 +-
 net/ipv6/raw.c|  7 +++--
 net/ipv6/udp.c| 10 +--
 net/l2tp/l2tp_ip6.c   | 10 ---
 net/packet/af_packet.c| 30 +++
 net/socket.c  | 10 +++
 27 files changed, 231 insertions(+), 78 deletions(-)

-- 
2.8.0.rc3.226.g39d4020



Re: [RFC PATCH net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update

2016-04-02 Thread Martin KaFai Lau
On Fri, Apr 01, 2016 at 04:13:41PM -0700, Cong Wang wrote:
> On Fri, Apr 1, 2016 at 3:56 PM, Martin KaFai Lau  wrote:
> > +   bh_lock_sock(sk);
> > +   if (!sock_owned_by_user(sk))
> > +   ip6_datagram_dst_update(sk, false);
> > +   bh_unlock_sock(sk);
>
>
> My discussion with Eric shows that we probably don't need to hold
> this sock lock here, and you are Cc'ed in that thread, so
>
> 1) why do you still take the lock here?
> 2) why didn't you involve in our discussion if you disagree?
It is because I agree with the last thread discussion that updating
sk->sk_dst_cache does not need a sk lock.  I also don't see
a lock is need for other operations in that thread.

I am thinking another case that needs a lock, so I start
another RFC thread.  A quick recall for this commit message:
>> It is done under '!sock_owned_by_user(sk)' condition because
>> the user may make another ip6_datagram_connect() while
>> dst lookup and update are happening.
If that could not happen, then the lock is not needed.

One thing to note is that this patch uses the addresses from the sk
instead of iph when updating sk->sk_dst_cache.  It is basically the
same logic that the __ip6_datagram_connect() is doing, so some
refactoring works in the first two patches.

AFAIK, a UDP socket can become connected after sending out some
datagrams in un-connected state.  or It can be connected
multiple times to different destinations.  I did some quick
tests but I could be wrong.

I am thinking if there could be a chance that the skb->data, which
has the original outgoing iph, is not related to the current
connected address.  If it is possible, we have to specifically
use the addresses in the sk instead of skb->data (i.e. iph) when
updating the sk->sk_dst_cache.

If we need to use the sk addresses (and other info) to find out a
new dst for a connected udp socket, it is better not doing it while
the userland is connecting to somewhere else.

If the above case is impossible, we can keep using the info from iph to
do the dst update for a connected-udp sk without taking the lock.

>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index ed44663..f7e6a6d 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -1417,8 +1417,19 @@ EXPORT_SYMBOL_GPL(ip6_update_pmtu);
>>
>>  void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu)
>>  {
>> +struct dst_entry *dst;
>> +
>>  ip6_update_pmtu(skb, sock_net(sk), mtu,
>>  sk->sk_bound_dev_if, sk->sk_mark);
iph's addresses are used to update the pmtu.  It is fine
because it does not update the sk->sk_dst_cache.

>>> +
>> +dst = __sk_dst_get(sk);
>> +if (!dst || dst->ops->check(dst, inet6_sk(sk)->dst_cookie))
>> +return;
>> +
>> +bh_lock_sock(sk);
>> +if (!sock_owned_by_user(sk))
sk is not connecting to another address.  Find a new dst
for the connected address.
>> +ip6_datagram_dst_update(sk, false);
>> +bh_unlock_sock(sk);
>>  }


Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop

2016-04-02 Thread Lorenzo Colitti
On Sun, Apr 3, 2016 at 7:57 AM, Tom Herbert  wrote:
> I am curious though, how do you think this would specifically help
> Android with power? Seems like the receiver still needs to be powered
> to receive packets to filter them anyway...

The receiver is powered up, but its wake/sleep cycles are much shorter
than the main CPU's. On a phone, leaving the CPU asleep with wifi on
might consume ~5mA average, but getting the CPU out of suspend might
average ~200mA for ~300ms as the system comes out of sleep,
initializes other hardware, wakes up userspace processes whose
timeouts have fired, freezes, and suspends again. Receiving one such
superfluous packet every 3 seconds (e.g., on networks that send
identical IPv6 RAs once every 3 seconds) works out to ~25mA, which is
5x the cost of idle. Pushing down filters to the hardware so it can
drop the packet without waking up the CPU thus saves a lot of idle
power.

That said, getting BPF to the driver is part of the picture. On the
chipsets we're targeting for APF, we're only seeing 2k-4k of memory
(that's 256-512 BPF instructions) available for filtering code, which
means that BPF might be too large.


Re: [PATCH v2 net-next 0/8] add TX timestamping via cmsg

2016-04-02 Thread Soheil Hassas Yeganeh
On Sat, Apr 2, 2016 at 9:27 PM, David Miller  wrote:
> From: David Miller 
> Date: Sat, 02 Apr 2016 21:19:42 -0400 (EDT)
>
>> Series applied, thanks.
>
> I had to revert, this breaks the build:
>
> net/l2tp/l2tp_ip6.c: In function ‘l2tp_ip6_sendmsg’:
> net/l2tp/l2tp_ip6.c:565:9: error: too few arguments to function 
> ‘ip6_datagram_send_ctl’
>err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, , opt,
>  ^
> In file included from net/l2tp/l2tp_ip6.c:33:0:
> include/net/transp_v6.h:43:5: note: declared here
>  int ip6_datagram_send_ctl(struct net *net, struct sock *sk, struct msghdr 
> *msg,
>  ^
> net/l2tp/l2tp_ip6.c:625:8: error: too few arguments to function 
> ‘ip6_append_data’
>   err = ip6_append_data(sk, ip_generic_getfrag, msg,
> ^
> In file included from include/net/inetpeer.h:15:0,
>  from include/net/route.h:28,
>  from include/net/ip.h:31,
>  from net/l2tp/l2tp_ip6.c:23:
> include/net/ipv6.h:865:5: note: declared here
>  int ip6_append_data(struct sock *sk,
>  ^

I'm really sorry about this. CONFIG_L2TP was no enabled in my config.
I'll fix the patch, and will mail v3.

Thanks,
Soheil


Re: [PATCH v2 net-next 0/8] add TX timestamping via cmsg

2016-04-02 Thread David Miller
From: David Miller 
Date: Sat, 02 Apr 2016 21:19:42 -0400 (EDT)

> Series applied, thanks.

I had to revert, this breaks the build:

net/l2tp/l2tp_ip6.c: In function ‘l2tp_ip6_sendmsg’:
net/l2tp/l2tp_ip6.c:565:9: error: too few arguments to function 
‘ip6_datagram_send_ctl’
   err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, , opt,
 ^
In file included from net/l2tp/l2tp_ip6.c:33:0:
include/net/transp_v6.h:43:5: note: declared here
 int ip6_datagram_send_ctl(struct net *net, struct sock *sk, struct msghdr *msg,
 ^
net/l2tp/l2tp_ip6.c:625:8: error: too few arguments to function 
‘ip6_append_data’
  err = ip6_append_data(sk, ip_generic_getfrag, msg,
^
In file included from include/net/inetpeer.h:15:0,
 from include/net/route.h:28,
 from include/net/ip.h:31,
 from net/l2tp/l2tp_ip6.c:23:
include/net/ipv6.h:865:5: note: declared here
 int ip6_append_data(struct sock *sk,
 ^


Re: [PATCH v2 net-next 0/8] add TX timestamping via cmsg

2016-04-02 Thread David Miller
From: Soheil Hassas Yeganeh 
Date: Fri,  1 Apr 2016 11:04:32 -0400

> From: Soheil Hassas Yeganeh 
> 
> This patch series aim at enabling TX timestamping via cmsg.
> 
> Currently, to occasionally sample TX timestamping on a socket,
> applications need to call setsockopt twice: first for enabling
> timestamps and then for disabling them. This is an unnecessary
> overhead. With cmsg, in contrast, applications can sample TX
> timestamps per sendmsg().
> 
> This patch series adds the code for processing SO_TIMESTAMPING
> for cmsg's of the SOL_SOCKET level, and adds the glue code for
> TCP, UDP, and RAW for both IPv4 and IPv6. This implementation
> supports overriding timestamp generation flags (i.e.,
> SOF_TIMESTAMPING_TX_*) but not timestamp reporting flags.
> Applications must still enable timestamp reporting via
> setsockopt to receive timestamps.
> 
> This series does not change existing timestamping behavior for
> applications that are using socket options.
> 
> I will follow up with another patch to enable timestamping for
> active TFO (client-side TCP Fast Open) and also setting packet
> mark via cmsgs.
 ...
> Changes in v2:
>   - Replace u32 with __u32 in the documentation.

Series applied, thanks.


Re: [RESEND PATCH net-next 00/13] Enhance stmmac driver to support GMAC4.x IP

2016-04-02 Thread David Miller
From: Alexandre TORGUE 
Date: Fri, 1 Apr 2016 11:37:24 +0200

> This is a subset of patch to enhance current stmmac driver to support
> new GMAC4.x chips. New set of callbacks is defined to support this new
> family: descriptors, dma, core.

Series applied, thanks.


Re: [PATCH v2 net-next] net: hns: add support of pause frame ctrl for HNS V2

2016-04-02 Thread David Miller
From: Yisen Zhuang 
Date: Thu, 31 Mar 2016 21:00:09 +0800

> From: Lisheng 
> 
> The patch adds support of pause ctrl for HNS V2, and this feature is lost
> by HNS V1:
>1) service ports can disable rx pause frame,
>2) debug ports can open tx/rx pause frame.
> 
> And this patch updates the REGs about the pause ctrl when updated
> status function called by upper layer routine.
> 
> Signed-off-by: Lisheng 
> Signed-off-by: Yisen Zhuang 
> Reviewed-by: Andy Shevchenko 

Applied.


Re: [PATCH] netlink: use nla_get_in_addr and nla_put_in_addr for ipv4 address

2016-04-02 Thread David Miller
From: Haishuang Yan 
Date: Thu, 31 Mar 2016 18:21:38 +0800

> Since nla_get_in_addr and nla_put_in_addr were implemented,
> so use them appropriately.
> 
> Signed-off-by: Haishuang Yan 

Applied, thank you.


Re: [PATCH v2 net-next] tcp: remove cwnd moderation after recovery

2016-04-02 Thread David Miller
From: Yuchung Cheng 
Date: Wed, 30 Mar 2016 14:54:20 -0700

> For non-SACK connections, cwnd is lowered to inflight plus 3 packets
> when the recovery ends. This is an optional feature in the NewReno
> RFC 2582 to reduce the potential burst when cwnd is "re-opened"
> after recovery and inflight is low.
> 
> This feature is questionably effective because of PRR: when
> the recovery ends (i.e., snd_una == high_seq) NewReno holds the
> CA_Recovery state for another round trip to prevent false fast
> retransmits. But if the inflight is low, PRR will overwrite the
> moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
> receiver responds bogus ACKs (i.e., acking future data) to speed up
> transfer after recovery, it can only induce a burst up to a window
> worth of data packets by acking up to SND.NXT. A restart from (short)
> idle or receiving streched ACKs can both cause such bursts as well.
> 
> On the other hand, if the recovery ends because the sender
> detects the losses were spurious (e.g., reordering). This feature
> unconditionally lowers a reverted cwnd even though nothing
> was lost.
> 
> By principle loss recovery module should not update cwnd. Further
> pacing is much more effective to reduce burst. Hence this patch
> removes the cwnd moderation feature.
> 
> v2 changes: revised commit message on bogus ACKs and burst, and
> missing signature
> 
> Signed-off-by: Matt Mathis 
> Signed-off-by: Neal Cardwell 
> Signed-off-by: Soheil Hassas Yeganeh 
> Signed-off-by: Yuchung Cheng 

Applied, thanks.


Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop

2016-04-02 Thread Tom Herbert
On Sat, Apr 2, 2016 at 2:41 PM, Johannes Berg  wrote:
> On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
>> This patch set introduces new infrastructure for programmatically
>> processing packets in the earliest stages of rx, as part of an effort
>> others are calling Express Data Path (XDP) [1]. Start this effort by
>> introducing a new bpf program type for early packet filtering, before
>> even
>> an skb has been allocated.
>>
>> With this, hope to enable line rate filtering, with this initial
>> implementation providing drop/allow action only.
>
> Since this is handed to the driver in some way, I assume the API would
> also allow offloading the program to the NIC itself, and as such be
> useful for what Android wants to do to save power in wireless?
>
Conceptually, yes. There is some ongoing work to offload BPF and one
goal is that BPF programs (like for XDP) could be portable between
userspace, kernel (maybe even other OSes), and devices.

I am curious though, how do you think this would specifically help
Android with power? Seems like the receiver still needs to be powered
to receive packets to filter them anyway...

Thanks,
Tom

> johannes


Re: bridge/brctl/ip

2016-04-02 Thread Nikolay Aleksandrov
On 04/02/2016 09:26 PM, Bert Vermeulen wrote:
> Hi all,
> 
> I'm wondering about the current userspace toolset to control bridging in
> the Linux kernel. As far as I can determine, functionality is a bit
> scattered right now between the iproute2 (ip, bridge) and bridge-utils
> (brctl) tools:
> 
> - creating/deleting bridges: ip or brctl
> - adding/deleting ports to/from bridge: brctl only

ip link set dev ethX master bridgeY
ip link set dev ethX nomaster

> - showing bridge fdb: brctl (in-kernel fdb), bridge (hardware offloaded
>   fdb) (!)

bridge fdb show - shows all fdb entries, offloaded or not.

> ...and no doubt a few other things.
> 
> Also the brctl tool seems not to be getting updates, whereas the
> iproute2 tools are of course updated regularly. Is brctl considered
> obsolete?

iproute2 supports (almost, user-space stp?) everything now, there have been 
many recent
additions to the options that can be manipulated.
$ ip link set dev bridge0 type bridge help
Usage: ... bridge [ forward_delay FORWARD_DELAY ]
  [ hello_time HELLO_TIME ]
  [ max_age MAX_AGE ]
  [ ageing_time AGEING_TIME ]
  [ stp_state STP_STATE ]
  [ priority PRIORITY ]
  [ group_fwd_mask MASK ]
  [ group_address ADDRESS ]
  [ vlan_filtering VLAN_FILTERING ]
  [ vlan_protocol VLAN_PROTOCOL ]
  [ vlan_default_pvid VLAN_DEFAULT_PVID ]
  [ mcast_snooping MULTICAST_SNOOPING ]
  [ mcast_router MULTICAST_ROUTER ]
  [ mcast_query_use_ifaddr MCAST_QUERY_USE_IFADDR ]
  [ mcast_querier MULTICAST_QUERIER ]
  [ mcast_hash_elasticity HASH_ELASTICITY ]
  [ mcast_hash_max HASH_MAX ]
  [ mcast_last_member_count LAST_MEMBER_COUNT ]
  [ mcast_startup_query_count STARTUP_QUERY_COUNT ]
  [ mcast_last_member_interval LAST_MEMBER_INTERVAL ]
  [ mcast_membership_interval MEMBERSHIP_INTERVAL ]
  [ mcast_querier_interval QUERIER_INTERVAL ]
  [ mcast_query_interval QUERY_INTERVAL ]
  [ mcast_query_response_interval QUERY_RESPONSE_INTERVAL ]
  [ mcast_startup_query_interval STARTUP_QUERY_INTERVAL ]
  [ nf_call_iptables NF_CALL_IPTABLES ]
  [ nf_call_ip6tables NF_CALL_IP6TABLES ]
  [ nf_call_arptables NF_CALL_ARPTABLES ]

Where: VLAN_PROTOCOL := { 802.1Q | 802.1ad }


> 
> If that is the case, would patches to add the missing functionality into
> the bridge tool be welcome? I'm thinking primarily of creating/deleting
> bridges, and adding/deleting ports in bridges.
> 
> 



Re: bridge/brctl/ip

2016-04-02 Thread Andrew Lunn
On Sat, Apr 02, 2016 at 09:26:55PM +0200, Bert Vermeulen wrote:
> Hi all,
> 
> I'm wondering about the current userspace toolset to control bridging in
> the Linux kernel. As far as I can determine, functionality is a bit
> scattered right now between the iproute2 (ip, bridge) and bridge-utils
> (brctl) tools:
> 
> - adding/deleting ports to/from bridge: brctl only

ip link set lan0 master br0

I think most of the normal operations can be done with iproute2.  What
might be missing is things like setting the forwarding delay, hello
time, etc.

   Andrew


Re: net: memory leak due to CLONE_NEWNET

2016-04-02 Thread Cong Wang
On Sat, Apr 2, 2016 at 6:55 AM, Dmitry Vyukov  wrote:
> Hello,
>
> The following program leads to memory leaks in:
>
> unreferenced object 0x88005c10d208 (size 96):
>   comm "a.out", pid 10753, jiffies 4296778619 (age 43.118s)
>   hex dump (first 32 bytes):
> e8 31 85 2d 00 88 ff ff 0f 00 00 00 00 00 00 00  .1.-
> 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .N..
>   backtrace:
> [] kmemleak_alloc+0x63/0xa0 mm/kmemleak.c:915
> [< inline >] kmemleak_alloc_recursive include/linux/kmemleak.h:47
> [< inline >] slab_post_alloc_hook mm/slab.h:406
> [< inline >] slab_alloc_node mm/slub.c:2602
> [< inline >] slab_alloc mm/slub.c:2610
> [] kmem_cache_alloc_trace+0x160/0x3d0 mm/slub.c:2627
> [< inline >] kmalloc include/linux/slab.h:478
> [< inline >] tc_action_net_init include/net/act_api.h:122
> [] csum_init_net+0x15e/0x450 net/sched/act_csum.c:593
> [] ops_init+0xa9/0x3a0 net/core/net_namespace.c:109
> [] setup_net+0x1b4/0x3e0 net/core/net_namespace.c:287
> [] copy_net_ns+0xd6/0x1a0 net/core/net_namespace.c:367
> [] create_new_namespaces+0x37f/0x740 
> kernel/nsproxy.c:106
> [] unshare_nsproxy_namespaces+0xa9/0x1d0

The following patch should fix it.

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 2a19fe1..03e322b 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -135,6 +135,7 @@ void tcf_hashinfo_destroy(const struct tc_action_ops *ops,
 static inline void tc_action_net_exit(struct tc_action_net *tn)
 {
tcf_hashinfo_destroy(tn->ops, tn->hinfo);
+   kfree(tn->hinfo);
 }

 int tcf_generic_walker(struct tc_action_net *tn, struct sk_buff *skb,


Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-02 Thread Rick Jones

On 04/01/2016 07:21 PM, Eric Dumazet wrote:

On Fri, 2016-04-01 at 22:16 -0400, David Miller wrote:

From: Alexander Duyck 
Date: Fri, 1 Apr 2016 12:58:41 -0700


RFC 6864 is pretty explicit about this, IPv4 ID used only for
fragmentation.  https://tools.ietf.org/html/rfc6864#section-4.1

The goal with this change is to try and keep most of the existing
behavior in tact without violating this rule?  I would think the
sequence number should give you the ability to infer a drop in the
case of TCP.  In the case of UDP tunnels we are now getting a bit more
data since we were ignoring the outer IP header ID before.


When retransmits happen, the sequence numbers are the same.  But you
can then use the IP ID to see exactly what happened.  You can even
tell if multiple retransmits got reordered.

Eric's use case is extremely useful, and flat out eliminates ambiguity
when analyzing TCP traces.


Yes, our team (including Van Jacobson ;) ) would be sad to not have
sequential IP ID (but then we don't have them for IPv6 ;) )


Your team would not be the only one sad to see that go away.

rick jones


Since the cost of generating them is pretty small (inet->inet_id
counter), we probably should keep them in linux. Their usage will phase
out as IPv6 wins the Internet war...






bridge/brctl/ip

2016-04-02 Thread Bert Vermeulen
Hi all,

I'm wondering about the current userspace toolset to control bridging in
the Linux kernel. As far as I can determine, functionality is a bit
scattered right now between the iproute2 (ip, bridge) and bridge-utils
(brctl) tools:

- creating/deleting bridges: ip or brctl
- adding/deleting ports to/from bridge: brctl only
- showing bridge fdb: brctl (in-kernel fdb), bridge (hardware offloaded
  fdb) (!)
...and no doubt a few other things.

Also the brctl tool seems not to be getting updates, whereas the
iproute2 tools are of course updated regularly. Is brctl considered
obsolete?

If that is the case, would patches to add the missing functionality into
the bridge tool be welcome? I'm thinking primarily of creating/deleting
bridges, and adding/deleting ports in bridges.


-- 
Bert Vermeulen
b...@biot.com


[PATCH v3 -next] net/core/dev: Warn on a too-short GRO frame

2016-04-02 Thread Aaron Conole
From: Aaron Conole 

When signaling that a GRO frame is ready to be processed, the network stack
correctly checks length and aborts processing when a frame is less than 14
bytes. However, such a condition is really indicative of a broken driver,
and should be loudly signaled, rather than silently dropped as the case is
today.

Convert the condition to use net_warn_ratelimited() to ensure the stack
loudly complains about such broken drivers.

Signed-off-by: Aaron Conole 
---
v2:
* Switched from WARN_ON to net_warn_ratelimited

v3:
* Amend the string to include device name as a hint

 net/core/dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index b9bcbe7..273f10d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4663,6 +4663,8 @@ static struct sk_buff *napi_frags_skb(struct napi_struct 
*napi)
if (unlikely(skb_gro_header_hard(skb, hlen))) {
eth = skb_gro_header_slow(skb, hlen, 0);
if (unlikely(!eth)) {
+   net_warn_ratelimited("%s: dropping impossible skb from 
%s\n",
+__func__, napi->dev->name);
napi_reuse_skb(napi, skb);
return NULL;
}
-- 
2.5.5



Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program

2016-04-02 Thread Johannes Berg
On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:

> +static int mlx4_bpf_set(struct net_device *dev, int fd)
> +{
[...]
> + if (prog->type != BPF_PROG_TYPE_PHYS_DEV) {
> + bpf_prog_put(prog);
> + return -EINVAL;
> + }
> + }

Why wouldn't this check be done in the generic code that calls
ndo_bpf_set()?

johannes


Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop

2016-04-02 Thread Johannes Berg
On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
> This patch set introduces new infrastructure for programmatically
> processing packets in the earliest stages of rx, as part of an effort
> others are calling Express Data Path (XDP) [1]. Start this effort by
> introducing a new bpf program type for early packet filtering, before
> even
> an skb has been allocated.
> 
> With this, hope to enable line rate filtering, with this initial
> implementation providing drop/allow action only.

Since this is handed to the driver in some way, I assume the API would
also allow offloading the program to the NIC itself, and as such be
useful for what Android wants to do to save power in wireless?

johannes


Re: Question on rhashtable in worst-case scenario.

2016-04-02 Thread Johannes Berg
On Sat, 2016-04-02 at 09:46 +0800, Herbert Xu wrote:
> On Fri, Apr 01, 2016 at 11:34:10PM +0200, Johannes Berg wrote:
> > 
> > 
> > I was thinking about that one - it's not obvious to me from the
> > code
> > how this "explicitly checking for dups" would be done or let's say
> > how
> > rhashtable differentiates. But since it seems to work for Ben until
> > hitting a certain number of identical keys, surely that's just me
> > not
> > understanding the code rather than anything else :)
> It's really simple, rhashtable_insert_fast does not check for dups
> while rhashtable_lookup_insert_* do.

Oh, ok, thanks :)

johannes


[PATCH v2] net: remove unimplemented RTNH_F_PERVASIVE

2016-04-02 Thread Quentin Armitage
Linux 2.1.68 introduced RTNH_F_PERVASIVE, but it had no implementation
and couldn't be enabled since the required config parameter wasn't in
any Kconfig file (see commit d088dde7b196 ("ipv4: obsolete config in
kernel source (IP_ROUTE_PERVASIVE)")).

This commit removes all remaining references to RTNH_F_PERVASIVE.
Although this will cause userspace applications that were using the
flag to fail to build, they will be alerted to the fact that using
RTNH_F_PERVASIVE was not achieving anything.

Signed-off-by: Quentin Armitage 
---
 include/uapi/linux/rtnetlink.h |2 +-
 net/decnet/dn_fib.c|2 +-
 net/ipv4/fib_semantics.c   |2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index ca764b5..58e6ba0 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -339,7 +339,7 @@ struct rtnexthop {
 /* rtnh_flags */
 
 #define RTNH_F_DEAD1   /* Nexthop is dead (used by multipath)  
*/
-#define RTNH_F_PERVASIVE   2   /* Do recursive gateway lookup  */
+   /* 2 was RTNH_F_PERVASIVE (never 
implemented) */
 #define RTNH_F_ONLINK  4   /* Gateway is forced on link*/
 #define RTNH_F_OFFLOAD 8   /* offloaded route */
 #define RTNH_F_LINKDOWN16  /* carrier-down on nexthop */
diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c
index df48034..c53aa74 100644
--- a/net/decnet/dn_fib.c
+++ b/net/decnet/dn_fib.c
@@ -243,7 +243,7 @@ out:
} else {
struct net_device *dev;
 
-   if (nh->nh_flags&(RTNH_F_PERVASIVE|RTNH_F_ONLINK))
+   if (nh->nh_flags & RTNH_F_ONLINK)
return -EINVAL;
 
dev = __dev_get_by_index(_net, nh->nh_oif);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index d97268e..3883860 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -803,7 +803,7 @@ static int fib_check_nh(struct fib_config *cfg, struct 
fib_info *fi,
} else {
struct in_device *in_dev;
 
-   if (nh->nh_flags & (RTNH_F_PERVASIVE | RTNH_F_ONLINK))
+   if (nh->nh_flags & RTNH_F_ONLINK)
return -EINVAL;
 
rcu_read_lock();
-- 
1.7.7.6



Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop

2016-04-02 Thread Tom Herbert
On Fri, Apr 1, 2016 at 9:21 PM, Brenden Blanco  wrote:
> This patch set introduces new infrastructure for programmatically
> processing packets in the earliest stages of rx, as part of an effort
> others are calling Express Data Path (XDP) [1]. Start this effort by
> introducing a new bpf program type for early packet filtering, before even
> an skb has been allocated.
>
> With this, hope to enable line rate filtering, with this initial
> implementation providing drop/allow action only.
>
> Patch 1 introduces the new prog type and helpers for validating the bpf
> program. A new userspace struct is defined containing only len as a field,
> with others to follow in the future.
> In patch 2, create a new ndo to pass the fd to support drivers.
> In patch 3, expose a new rtnl option to userspace.
> In patch 4, enable support in mlx4 driver. No skb allocation is required,
> instead a static percpu skb is kept in the driver and minimally initialized
> for each driver frag.
> In patch 5, create a sample drop and count program. With single core,
> achieved ~14.5 Mpps drop rate on a 40G mlx4. This includes packet data
> access, bpf array lookup, and increment.
>
Very nice! Do you think this hook will be sufficient to implement a
fast forward patch also?

Tom

> Interestingly, accessing packet data from the program did not have a
> noticeable impact on performance. Even so, future enhancements to
> prefetching / batching / page-allocs should hopefully improve the
> performance in this path.
>
> [1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
>
> Brenden Blanco (5):
>   bpf: add PHYS_DEV prog type for early driver filter
>   net: add ndo to set bpf prog in adapter rx
>   rtnl: add option for setting link bpf prog
>   mlx4: add support for fast rx drop bpf program
>   Add sample for adding simple drop program to link
>
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  61 ++
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |  18 +++
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |   2 +
>  include/linux/netdevice.h  |   8 ++
>  include/uapi/linux/bpf.h   |   5 +
>  include/uapi/linux/if_link.h   |   1 +
>  kernel/bpf/verifier.c  |   1 +
>  net/core/dev.c |  12 ++
>  net/core/filter.c  |  68 +++
>  net/core/rtnetlink.c   |  10 ++
>  samples/bpf/Makefile   |   4 +
>  samples/bpf/bpf_load.c |   8 ++
>  samples/bpf/netdrvx1_kern.c|  26 +
>  samples/bpf/netdrvx1_user.c| 155 
> +
>  14 files changed, 379 insertions(+)
>  create mode 100644 samples/bpf/netdrvx1_kern.c
>  create mode 100644 samples/bpf/netdrvx1_user.c
>
> --
> 2.8.0
>


Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter

2016-04-02 Thread Tom Herbert
On Fri, Apr 1, 2016 at 9:21 PM, Brenden Blanco  wrote:
> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a new
> context type, struct xdp_metadata, is exposed to userspace. So far only
> expose the readable packet length, and only in read mode.
>
This would eventually be a generic abstraction of receive descriptors?

> The PHYS_DEV name is chosen to represent that the program is meant only
> for physical adapters, rather than all netdevs.
>
Is there a hard restriction that this could only work with physical devices?

> While the user visible struct is new, the underlying context must be
> implemented as a minimal skb in order for the packet load_* instructions
> to work. The skb filled in by the driver must have skb->len, skb->head,
> and skb->data set, and skb->data_len == 0.
>
> Signed-off-by: Brenden Blanco 
> ---
>  include/uapi/linux/bpf.h |  5 
>  kernel/bpf/verifier.c|  1 +
>  net/core/filter.c| 68 
> 
>  3 files changed, 74 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 924f537..b8a4ef2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -92,6 +92,7 @@ enum bpf_prog_type {
> BPF_PROG_TYPE_KPROBE,
> BPF_PROG_TYPE_SCHED_CLS,
> BPF_PROG_TYPE_SCHED_ACT,
> +   BPF_PROG_TYPE_PHYS_DEV,
>  };
>
>  #define BPF_PSEUDO_MAP_FD  1
> @@ -367,6 +368,10 @@ struct __sk_buff {
> __u32 tc_classid;
>  };
>
> +struct xdp_metadata {
> +   __u32 len;
> +};
> +
>  struct bpf_tunnel_key {
> __u32 tunnel_id;
> union {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 2e08f8e..804ca70 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1340,6 +1340,7 @@ static bool may_access_skb(enum bpf_prog_type type)
> case BPF_PROG_TYPE_SOCKET_FILTER:
> case BPF_PROG_TYPE_SCHED_CLS:
> case BPF_PROG_TYPE_SCHED_ACT:
> +   case BPF_PROG_TYPE_PHYS_DEV:
> return true;
> default:
> return false;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index b7177d0..c417db6 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2018,6 +2018,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
> }
>  }
>
> +static const struct bpf_func_proto *
> +phys_dev_func_proto(enum bpf_func_id func_id)
> +{
> +   return sk_filter_func_proto(func_id);
> +}
> +
>  static bool __is_valid_access(int off, int size, enum bpf_access_type type)
>  {
> /* check bounds */
> @@ -2073,6 +2079,36 @@ static bool tc_cls_act_is_valid_access(int off, int 
> size,
> return __is_valid_access(off, size, type);
>  }
>
> +static bool __is_valid_xdp_access(int off, int size,
> + enum bpf_access_type type)
> +{
> +   if (off < 0 || off >= sizeof(struct xdp_metadata))
> +   return false;
> +
> +   if (off % size != 0)
> +   return false;
> +
> +   if (size != 4)
> +   return false;
> +
> +   return true;
> +}
> +
> +static bool phys_dev_is_valid_access(int off, int size,
> +enum bpf_access_type type)
> +{
> +   if (type == BPF_WRITE)
> +   return false;
> +
> +   switch (off) {
> +   case offsetof(struct xdp_metadata, len):
> +   break;
> +   default:
> +   return false;
> +   }
> +   return __is_valid_xdp_access(off, size, type);
> +}
> +
>  static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>   int src_reg, int ctx_off,
>   struct bpf_insn *insn_buf,
> @@ -2210,6 +2246,26 @@ static u32 bpf_net_convert_ctx_access(enum 
> bpf_access_type type, int dst_reg,
> return insn - insn_buf;
>  }
>
> +static u32 bpf_phys_dev_convert_ctx_access(enum bpf_access_type type,
> +  int dst_reg, int src_reg,
> +  int ctx_off,
> +  struct bpf_insn *insn_buf,
> +  struct bpf_prog *prog)
> +{
> +   struct bpf_insn *insn = insn_buf;
> +
> +   switch (ctx_off) {
> +   case offsetof(struct xdp_metadata, len):
> +   BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
> +
> +   *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
> + offsetof(struct sk_buff, len));
> +   break;
> +   }
> +
> +   return insn - insn_buf;
> +}
> +
>  static const struct bpf_verifier_ops sk_filter_ops = {
> .get_func_proto = sk_filter_func_proto,
> .is_valid_access = sk_filter_is_valid_access,
> @@ -,6 +2278,12 

Re: [PATCH v2 net-next 01/11] net: add SOCK_RCU_FREE socket flag

2016-04-02 Thread Tom Herbert
On Fri, Apr 1, 2016 at 11:52 AM, Eric Dumazet  wrote:
> We want a generic way to insert an RCU grace period before socket
> freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too
> much overhead.
>
> SLAB_DESTROY_BY_RCU strict rules force us to take a reference
> on the socket sk_refcnt, and it is a performance problem for UDP
> encapsulation, or TCP synflood behavior, as many CPUs might
> attempt the atomic operations on a shared sk_refcnt
>
> UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their
> lookup can use traditional RCU rules, without refcount changes.
> They can set the flag only once hashed and visible by other cpus.
>
> Signed-off-by: Eric Dumazet 
> Cc: Tom Herbert 
> ---
>  include/net/sock.h |  2 ++
>  net/core/sock.c| 14 +-
>  2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 255d3e03727b..c88785a3e76c 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -438,6 +438,7 @@ struct sock {
>   struct sk_buff *skb);
> void(*sk_destruct)(struct sock *sk);
> struct sock_reuseport __rcu *sk_reuseport_cb;
> +   struct rcu_head sk_rcu;
>  };
>
>  #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
> @@ -720,6 +721,7 @@ enum sock_flags {
>  */
> SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */
> SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
> +   SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
>  };
>
>  #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << 
> SOCK_TIMESTAMPING_RX_SOFTWARE))
> diff --git a/net/core/sock.c b/net/core/sock.c
> index b67b9aedb230..238a94f879ca 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1418,8 +1418,12 @@ struct sock *sk_alloc(struct net *net, int family, 
> gfp_t priority,
>  }
>  EXPORT_SYMBOL(sk_alloc);
>
> -void sk_destruct(struct sock *sk)
> +/* Sockets having SOCK_RCU_FREE will call this function after one RCU
> + * grace period. This is the case for UDP sockets and TCP listeners.
> + */
> +static void __sk_destruct(struct rcu_head *head)
>  {
> +   struct sock *sk = container_of(head, struct sock, sk_rcu);
> struct sk_filter *filter;
>
> if (sk->sk_destruct)
> @@ -1448,6 +1452,14 @@ void sk_destruct(struct sock *sk)
> sk_prot_free(sk->sk_prot_creator, sk);
>  }
>
> +void sk_destruct(struct sock *sk)
> +{
> +   if (sock_flag(sk, SOCK_RCU_FREE))
> +   call_rcu(>sk_rcu, __sk_destruct);
> +   else
> +   __sk_destruct(>sk_rcu);
> +}
> +
>  static void __sk_free(struct sock *sk)
>  {
> if (unlikely(sock_diag_has_destroy_listeners(sk) && 
> sk->sk_net_refcnt))
> --
> 2.8.0.rc3.226.g39d4020
>

Tested-by: Tom Herbert 


Re: [PATCH v2 net-next 02/11] udp: no longer use SLAB_DESTROY_BY_RCU

2016-04-02 Thread Tom Herbert
On Fri, Apr 1, 2016 at 11:52 AM, Eric Dumazet  wrote:
> Tom Herbert would like not touching UDP socket refcnt for encapsulated
> traffic. For this to happen, we need to use normal RCU rules, with a grace
> period before freeing a socket. UDP sockets are not short lived in the
> high usage case, so the added cost of call_rcu() should not be a concern.
>
> This actually removes a lot of complexity in UDP stack.
>
> Multicast receives no longer need to hold a bucket spinlock.
>
> Note that ip early demux still needs to take a reference on the socket.
>
> Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
> but this might be changed later.
>
> Performance for a single UDP socket receiving flood traffic from
> many RX queues/cpus.
>
> Simple udp_rx using simple recvfrom() loop :
> 438 kpps instead of 374 kpps : 17 % increase of the peak rate.
>
> v2: Addressed Willem de Bruijn feedback in multicast handling
>  - keep early demux break in __udp4_lib_demux_lookup()
>
Works fine with UDP encapsulation also.

Tested-by: Tom Herbert 

> Signed-off-by: Eric Dumazet 
> Cc: Tom Herbert 
> Cc: Willem de Bruijn 
> ---
>  include/linux/udp.h |   8 +-
>  include/net/sock.h  |  12 +--
>  include/net/udp.h   |   2 +-
>  net/ipv4/udp.c  | 293 
> 
>  net/ipv4/udp_diag.c |  18 ++--
>  net/ipv6/udp.c  | 196 ---
>  6 files changed, 171 insertions(+), 358 deletions(-)
>
> diff --git a/include/linux/udp.h b/include/linux/udp.h
> index 87c094961bd5..32342754643a 100644
> --- a/include/linux/udp.h
> +++ b/include/linux/udp.h
> @@ -98,11 +98,11 @@ static inline bool udp_get_no_check6_rx(struct sock *sk)
> return udp_sk(sk)->no_check6_rx;
>  }
>
> -#define udp_portaddr_for_each_entry(__sk, node, list) \
> -   hlist_nulls_for_each_entry(__sk, node, list, 
> __sk_common.skc_portaddr_node)
> +#define udp_portaddr_for_each_entry(__sk, list) \
> +   hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
>
> -#define udp_portaddr_for_each_entry_rcu(__sk, node, list) \
> -   hlist_nulls_for_each_entry_rcu(__sk, node, list, 
> __sk_common.skc_portaddr_node)
> +#define udp_portaddr_for_each_entry_rcu(__sk, list) \
> +   hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node)
>
>  #define IS_UDPLITE(__sk) (udp_sk(__sk)->pcflag)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index c88785a3e76c..c3a707d1cee8 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -178,7 +178,7 @@ struct sock_common {
> int skc_bound_dev_if;
> union {
> struct hlist_node   skc_bind_node;
> -   struct hlist_nulls_node skc_portaddr_node;
> +   struct hlist_node   skc_portaddr_node;
> };
> struct proto*skc_prot;
> possible_net_t  skc_net;
> @@ -670,18 +670,18 @@ static inline void sk_add_bind_node(struct sock *sk,
> hlist_for_each_entry(__sk, list, sk_bind_node)
>
>  /**
> - * sk_nulls_for_each_entry_offset - iterate over a list at a given struct 
> offset
> + * sk_for_each_entry_offset_rcu - iterate over a list at a given struct 
> offset
>   * @tpos:  the type * to use as a loop cursor.
>   * @pos:   the  hlist_node to use as a loop cursor.
>   * @head:  the head for your list.
>   * @offset:offset of hlist_node within the struct.
>   *
>   */
> -#define sk_nulls_for_each_entry_offset(tpos, pos, head, offset)  
>  \
> -   for (pos = (head)->first; 
>  \
> -(!is_a_nulls(pos)) &&
>  \
> +#define sk_for_each_entry_offset_rcu(tpos, pos, head, offset)
>  \
> +   for (pos = rcu_dereference((head)->first);
>  \
> +pos != NULL &&   
>  \
> ({ tpos = (typeof(*tpos) *)((void *)pos - offset); 1;});  
>  \
> -pos = pos->next)
> +pos = rcu_dereference(pos->next))
>
>  static inline struct user_namespace *sk_user_ns(struct sock *sk)
>  {
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 92927f729ac8..d870ec1611c4 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -59,7 +59,7 @@ struct udp_skb_cb {
>   * @lock:  spinlock protecting changes to head/count
>   */
>  struct udp_hslot {
> -   struct hlist_nulls_head head;
> +   struct hlist_head   head;
> int count;
> spinlock_t  lock;
>  } __attribute__((aligned(2 * sizeof(long;
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 08eed5e16df0..0475aaf95040 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -143,10 +143,9 @@ 

Re: [PATCH] net: remove unimplemented RTNH_F_PERVASIVE

2016-04-02 Thread Sergei Shtylyov

Hello.

On 4/2/2016 11:43 AM, Quentin Armitage wrote:


Linux 2.1.68 introduced RTNH_F_PERVASIVE, but it had no implementation
and couldn't be enabled since the required config parameter wasn't in
any Kconfig file (see commit d088dde7b).


   scripts/checkpatch.pl now enforces certain commit citing format, your 
doesn't match it, i.e. you need 12-digit SHA1 and ("


This commit removes all remaining references to RTNH_F_PERVASIVE.
Although this will cause userspace applications that were using the
flag to fail to build, they will be alerted to the fact that using
RTNH_F_PERVASIVE was not achieving anything.

Signed-off-by: Quentin Armitage 

[...]


diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c
index df48034..f5660c6 100644
--- a/net/decnet/dn_fib.c
+++ b/net/decnet/dn_fib.c
@@ -243,7 +243,7 @@ out:
} else {
struct net_device *dev;

-   if (nh->nh_flags&(RTNH_F_PERVASIVE|RTNH_F_ONLINK))
+   if (nh->nh_flags_F_ONLINK)


   Please enclose & into spaces like below.


diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index d97268e..3883860 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -803,7 +803,7 @@ static int fib_check_nh(struct fib_config *cfg, struct 
fib_info *fi,
} else {
struct in_device *in_dev;

-   if (nh->nh_flags & (RTNH_F_PERVASIVE | RTNH_F_ONLINK))
+   if (nh->nh_flags & RTNH_F_ONLINK)
return -EINVAL;

rcu_read_lock();


MBR, Sergei



net: memory leak due to CLONE_NEWNET

2016-04-02 Thread Dmitry Vyukov
Hello,

The following program leads to memory leaks in:

unreferenced object 0x88005c10d208 (size 96):
  comm "a.out", pid 10753, jiffies 4296778619 (age 43.118s)
  hex dump (first 32 bytes):
e8 31 85 2d 00 88 ff ff 0f 00 00 00 00 00 00 00  .1.-
00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .N..
  backtrace:
[] kmemleak_alloc+0x63/0xa0 mm/kmemleak.c:915
[< inline >] kmemleak_alloc_recursive include/linux/kmemleak.h:47
[< inline >] slab_post_alloc_hook mm/slab.h:406
[< inline >] slab_alloc_node mm/slub.c:2602
[< inline >] slab_alloc mm/slub.c:2610
[] kmem_cache_alloc_trace+0x160/0x3d0 mm/slub.c:2627
[< inline >] kmalloc include/linux/slab.h:478
[< inline >] tc_action_net_init include/net/act_api.h:122
[] csum_init_net+0x15e/0x450 net/sched/act_csum.c:593
[] ops_init+0xa9/0x3a0 net/core/net_namespace.c:109
[] setup_net+0x1b4/0x3e0 net/core/net_namespace.c:287
[] copy_net_ns+0xd6/0x1a0 net/core/net_namespace.c:367
[] create_new_namespaces+0x37f/0x740 kernel/nsproxy.c:106
[] unshare_nsproxy_namespaces+0xa9/0x1d0
kernel/nsproxy.c:205
[< inline >] SYSC_unshare kernel/fork.c:2019
[] SyS_unshare+0x3b3/0x800 kernel/fork.c:1969
[] entry_SYSCALL_64_fastpath+0x23/0xc1
arch/x86/entry/entry_64.S:207
[] 0x
unreferenced object 0x88005c10e1c8 (size 96):
  comm "a.out", pid 10753, jiffies 4296778620 (age 43.117s)
  hex dump (first 32 bytes):
e8 0b 85 2d 00 88 ff ff 0f 00 00 00 00 00 ad de  ...-
00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .N..
  backtrace:
[] kmemleak_alloc+0x63/0xa0 mm/kmemleak.c:915
[< inline >] kmemleak_alloc_recursive include/linux/kmemleak.h:47
[< inline >] slab_post_alloc_hook mm/slab.h:406
[< inline >] slab_alloc_node mm/slub.c:2602
[< inline >] slab_alloc mm/slub.c:2610
[] kmem_cache_alloc_trace+0x160/0x3d0 mm/slub.c:2627
[< inline >] kmalloc include/linux/slab.h:478
[< inline >] tc_action_net_init include/net/act_api.h:122
[] ife_init_net+0x15e/0x450 net/sched/act_ife.c:838
[] ops_init+0xa9/0x3a0 net/core/net_namespace.c:109
[] setup_net+0x1b4/0x3e0 net/core/net_namespace.c:287
[] copy_net_ns+0xd6/0x1a0 net/core/net_namespace.c:367
[] create_new_namespaces+0x37f/0x740 kernel/nsproxy.c:106
[] unshare_nsproxy_namespaces+0xa9/0x1d0
kernel/nsproxy.c:205
[< inline >] SYSC_unshare kernel/fork.c:2019
[] SyS_unshare+0x3b3/0x800 kernel/fork.c:1969
[] entry_SYSCALL_64_fastpath+0x23/0xc1
arch/x86/entry/entry_64.S:207
[] 0x
unreferenced object 0x880025a55b08 (size 96):
  comm "a.out", pid 10702, jiffies 4296768144 (age 61.526s)
  hex dump (first 32 bytes):
28 ed 55 2b 00 88 ff ff 0f 00 00 00 00 00 00 00  (.U+
00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .N..
  backtrace:
[] kmemleak_alloc+0x63/0xa0 mm/kmemleak.c:915
[< inline >] kmemleak_alloc_recursive include/linux/kmemleak.h:47
[< inline >] slab_post_alloc_hook mm/slab.h:406
[< inline >] slab_alloc_node mm/slub.c:2602
[< inline >] slab_alloc mm/slub.c:2610
[] kmem_cache_alloc_trace+0x160/0x3d0 mm/slub.c:2627
[< inline >] kmalloc include/linux/slab.h:478
[< inline >] tc_action_net_init include/net/act_api.h:122
[] nat_init_net+0x15e/0x450 net/sched/act_nat.c:311
[] ops_init+0xa9/0x3a0 net/core/net_namespace.c:109
[] setup_net+0x1b4/0x3e0 net/core/net_namespace.c:287
[] copy_net_ns+0xd6/0x1a0 net/core/net_namespace.c:367
[] create_new_namespaces+0x37f/0x740 kernel/nsproxy.c:106
[] unshare_nsproxy_namespaces+0xa9/0x1d0
kernel/nsproxy.c:205
[< inline >] SYSC_unshare kernel/fork.c:2019
[] SyS_unshare+0x3b3/0x800 kernel/fork.c:1969
[] entry_SYSCALL_64_fastpath+0x23/0xc1
arch/x86/entry/entry_64.S:207
[] 0x


#include 
#include 
#include 
#include 
#include 
#include 

int main()
{
int pid, status;

pid = fork();
if (pid == 0) {
unshare(CLONE_NEWNET);
exit(0);
}
while (waitpid(pid, , 0) != pid) {
}
return 0;
}


grep "kmalloc-96" /proc/slabinfo confirms the leak.

I am on commit 05cf8077e54b20dddb756eaa26f3aeb5c38dd3cf (Apr 1).


[PATCH] net: remove unimplemented RTNH_F_PERVASIVE

2016-04-02 Thread Quentin Armitage
Linux 2.1.68 introduced RTNH_F_PERVASIVE, but it had no implementation
and couldn't be enabled since the required config parameter wasn't in
any Kconfig file (see commit d088dde7b).

This commit removes all remaining references to RTNH_F_PERVASIVE.
Although this will cause userspace applications that were using the
flag to fail to build, they will be alerted to the fact that using
RTNH_F_PERVASIVE was not achieving anything.

Signed-off-by: Quentin Armitage 
---
 include/uapi/linux/rtnetlink.h |2 +-
 net/decnet/dn_fib.c|2 +-
 net/ipv4/fib_semantics.c   |2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index ca764b5..58e6ba0 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -339,7 +339,7 @@ struct rtnexthop {
 /* rtnh_flags */
 
 #define RTNH_F_DEAD1   /* Nexthop is dead (used by multipath)  
*/
-#define RTNH_F_PERVASIVE   2   /* Do recursive gateway lookup  */
+   /* 2 was RTNH_F_PERVASIVE (never 
implemented) */
 #define RTNH_F_ONLINK  4   /* Gateway is forced on link*/
 #define RTNH_F_OFFLOAD 8   /* offloaded route */
 #define RTNH_F_LINKDOWN16  /* carrier-down on nexthop */
diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c
index df48034..f5660c6 100644
--- a/net/decnet/dn_fib.c
+++ b/net/decnet/dn_fib.c
@@ -243,7 +243,7 @@ out:
} else {
struct net_device *dev;
 
-   if (nh->nh_flags&(RTNH_F_PERVASIVE|RTNH_F_ONLINK))
+   if (nh->nh_flags_F_ONLINK)
return -EINVAL;
 
dev = __dev_get_by_index(_net, nh->nh_oif);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index d97268e..3883860 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -803,7 +803,7 @@ static int fib_check_nh(struct fib_config *cfg, struct 
fib_info *fi,
} else {
struct in_device *in_dev;
 
-   if (nh->nh_flags & (RTNH_F_PERVASIVE | RTNH_F_ONLINK))
+   if (nh->nh_flags & RTNH_F_ONLINK)
return -EINVAL;
 
rcu_read_lock();
-- 
1.7.7.6



Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program

2016-04-02 Thread Jesper Dangaard Brouer

First of all, I'm very happy to see people start working on this!
Thanks you Brenden!

On Fri,  1 Apr 2016 18:21:57 -0700
Brenden Blanco  wrote:

> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
> bpf programs require a skb context to navigate the packet, build a
> percpu fake skb with the minimal fields. This avoids the costly
> allocation for packets that end up being dropped.
> 
> Since mlx4 is so far the only user of this pseudo skb, the build
> function is defined locally.
> 
> Signed-off-by: Brenden Blanco 
> ---
[...]
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 86bcfe5..03fe005 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
[...]
> @@ -764,6 +765,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
> mlx4_en_cq *cq, int bud
>   if (budget <= 0)
>   return polled;
>  
> + prog = READ_ONCE(priv->prog);
> +
>   /* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
>* descriptor offset can be deduced from the CQE index instead of
>* reading 'cqe->index' */
> @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
> mlx4_en_cq *cq, int bud
>   l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
>   (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>  
> + /* A bpf program gets first chance to drop the packet. It may
> +  * read bytes but not past the end of the frag. A non-zero
> +  * return indicates packet should be dropped.
> +  */
> + if (prog) {
> + struct ethhdr *ethh;
> +

I think you need to DMA sync RX-page before you can safely access
packet data in page (on all arch's).

> + ethh = (struct ethhdr *)(page_address(frags[0].page) +
> +  frags[0].page_offset);
> + if (mlx4_call_bpf(prog, ethh, length)) {

AFAIK length here covers all the frags[n].page, thus potentially
causing the BPF program to access memory out of bound (crash).

Having several page fragments is AFAIK an optimization for jumbo-frames
on PowerPC (which is a bit annoying for you use-case ;-)).


> + priv->stats.rx_dropped++;
> + goto next;
> + }
> + }
> +



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-02 Thread Alexander Duyck
On Fri, Apr 1, 2016 at 7:16 PM, David Miller  wrote:
> From: Alexander Duyck 
> Date: Fri, 1 Apr 2016 12:58:41 -0700
>
>> RFC 6864 is pretty explicit about this, IPv4 ID used only for
>> fragmentation.  https://tools.ietf.org/html/rfc6864#section-4.1
>>
>> The goal with this change is to try and keep most of the existing
>> behavior in tact without violating this rule?  I would think the
>> sequence number should give you the ability to infer a drop in the
>> case of TCP.  In the case of UDP tunnels we are now getting a bit more
>> data since we were ignoring the outer IP header ID before.
>
> When retransmits happen, the sequence numbers are the same.  But you
> can then use the IP ID to see exactly what happened.  You can even
> tell if multiple retransmits got reordered.
>
> Eric's use case is extremely useful, and flat out eliminates ambiguity
> when analyzing TCP traces.

I'm not really sure the IP ID is really all that useful.  On a 10G
link it takes about 80ms for it to wrap using an MTU of 1500.  That
value is going to keep dropping as we move up to 40G and 100G.  That
was one of the motivations behind RFC 6864 because it starts becoming
a bottle-neck if you want to keep the IP IDs truly unique.  In
addition while this change would allow you to combined disjointed IP
IDs I don't think you would lose the re-transmission as there would
likely be a gap in sequence numbers that would cause the flow to be
flushed from GRO, and it isn't as if we can prepend the retransmit to
the aggregated frame, we are always adding to the tail.

I would think the most likely case where this change would foul up any
IP IDs would be the garbage-in/garbage-out case like the IPv6 to IPv4
translation that is using the fixed IP ID of 0.  I agree that it isn't
desirable, however at the same time per RFC 6864 there is nothing
there to say we cannot do that.  In addition it would likely help to
mitigate some of the impact of IP ID on things like SLHC compression
since the resegmented frames would be incrementing so it would reduce
the number of times the IP ID would have to be updated.

- Alex