Re: [PATCH net-next v2] net: ipmr/ip6mr: add support for keeping an entry age

2016-07-15 Thread David Miller
From: Nikolay Aleksandrov 
Date: Fri, 15 Jul 2016 23:11:15 -0700

> Sorry about that, would you like me to resubmit the patch ?

That's not necessary.


Re: [PATCH net] bnxt_en: Fix potential race condition in bnxt_tx_enable()

2016-07-15 Thread David Miller
From: Florian Fainelli 
Date: Fri, 15 Jul 2016 16:42:01 -0700

> @@ -4599,7 +4599,9 @@ static void bnxt_tx_enable(struct bnxt *bp)
>   for (i = 0; i < bp->tx_nr_rings; i++) {
>   txr = &bp->tx_ring[i];
>   txq = netdev_get_tx_queue(bp->dev, i);
> + __netif_tx_lock(txq, smp_processor_id());
>   txr->dev_state = 0;
> + __netif_tx_unlock(txq);

You're going to have to explain how this could possibly cause a
problem, because I'm pretty sure it can't.

Either the reader sees 0, or non-zero, in this value.

And adding locking around this assignment does not change that at all.


Re: [PATCH net v2] tcp_timer.c: Add kernel-doc function descriptions

2016-07-15 Thread David Miller
From: Richard Sailer 
Date: Sat, 16 Jul 2016 04:04:34 +0200

> This adds kernel-doc style descriptions for 6 functions and
> fixes 1 typo.
> 
> Signed-off-by: Richard Sailer 
> ---
> Changes from v1:
>   * All comments are indented consistently
>   * Consistently no "." at end of function short description line

Applied to net-next, thanks.


Re: [PATCH net-next v2] net: ipmr/ip6mr: add support for keeping an entry age

2016-07-15 Thread Nikolay Aleksandrov

> On Jul 15, 2016, at 10:56 PM, David Miller  wrote:
> 
> From: Nikolay Aleksandrov 
> Date: Thu, 14 Jul 2016 19:28:27 +0300
> 
>> In preparation for hardware offloading of ipmr/ip6mr we need an
>> interface that allows to check (and later update) the age of entries.
>> Relying on stats alone can show activity but not actual age of the entry,
>> furthermore when there're tens of thousands of entries a lot of the
>> hardware implementations only support "hit" bits which are cleared on
>> read to denote that the entry was active and shouldn't be aged out,
>> these can then be naturally translated into age timestamp and will be
>> compatible with the software forwarding age. Using a lastuse entry doesn't
>> affect performance because the members in that cache line are written to
>> along with the age.
>> Since all new users are encouraged to use ipmr via netlink, this is
>> exported via the RTA_EXPIRES attribute.
>> Also do a minor local variable declaration style adjustment - arrange them
>> longest to shortest.
>> 
>> Signed-off-by: Nikolay Aleksandrov 
>> CC: Roopa Prabhu 
>> CC: Shrijeet Mukherjee 
>> CC: Satish Ashok 
>> CC: Donald Sharp 
>> CC: David S. Miller 
>> CC: Alexey Kuznetsov 
>> CC: James Morris 
>> CC: Hideaki YOSHIFUJI 
>> CC: Patrick McHardy 
>> ---
>> v2: Just reuse RTA_EXPIRES instead to minimize the attr size and simplify,
>> others will be added when needed
> 
> Why are your dates on these changes in the past?
> 
> Having them in the past messes up the ordering on patchwork because
> patchwork orders incoming patches by date, and therefore I can't just
> look at the first page to see "newer" submissions.
> 
> So please don't do whatever propagates commit dates into your emails,
> or whatever is causing this problem.  It's best always to use the
> current time.

Hmm, it seems my VM has its time zone messed up and since I’m in California 
right now the dates
come out wrong. Sorry about that, would you like me to resubmit the patch ?

Thanks,
 Nik



Re: [PATCH net-next v2] net: ipmr/ip6mr: add support for keeping an entry age

2016-07-15 Thread David Miller
From: Nikolay Aleksandrov 
Date: Thu, 14 Jul 2016 19:28:27 +0300

> In preparation for hardware offloading of ipmr/ip6mr we need an
> interface that allows to check (and later update) the age of entries.
> Relying on stats alone can show activity but not actual age of the entry,
> furthermore when there're tens of thousands of entries a lot of the
> hardware implementations only support "hit" bits which are cleared on
> read to denote that the entry was active and shouldn't be aged out,
> these can then be naturally translated into age timestamp and will be
> compatible with the software forwarding age. Using a lastuse entry doesn't
> affect performance because the members in that cache line are written to
> along with the age.
> Since all new users are encouraged to use ipmr via netlink, this is
> exported via the RTA_EXPIRES attribute.
> Also do a minor local variable declaration style adjustment - arrange them
> longest to shortest.
> 
> Signed-off-by: Nikolay Aleksandrov 
> CC: Roopa Prabhu 
> CC: Shrijeet Mukherjee 
> CC: Satish Ashok 
> CC: Donald Sharp 
> CC: David S. Miller 
> CC: Alexey Kuznetsov 
> CC: James Morris 
> CC: Hideaki YOSHIFUJI 
> CC: Patrick McHardy 
> ---
> v2: Just reuse RTA_EXPIRES instead to minimize the attr size and simplify,
> others will be added when needed

Why are your dates on these changes in the past?

Having them in the past messes up the ordering on patchwork because
patchwork orders incoming patches by date, and therefore I can't just
look at the first page to see "newer" submissions.

So please don't do whatever propagates commit dates into your emails,
or whatever is causing this problem.  It's best always to use the
current time.


Re: [PATCH 1/1] tracing, bpf: Implement function bpf_probe_write

2016-07-15 Thread Alexei Starovoitov
On Fri, Jul 15, 2016 at 07:16:01PM -0700, Sargun Dhillon wrote:
> 
> 
> On Thu, 14 Jul 2016, Alexei Starovoitov wrote:
> 
> >On Wed, Jul 13, 2016 at 01:31:57PM -0700, Sargun Dhillon wrote:
> >>
> >>
> >>On Wed, 13 Jul 2016, Alexei Starovoitov wrote:
> >>
> >>>On Wed, Jul 13, 2016 at 03:36:11AM -0700, Sargun Dhillon wrote:
> Provides BPF programs, attached to kprobes a safe way to write to
> memory referenced by probes. This is done by making probe_kernel_write
> accessible to bpf functions via the bpf_probe_write helper.
> >>>
> >>>not quite :)
> >>>
> Signed-off-by: Sargun Dhillon 
> ---
>  include/uapi/linux/bpf.h  |  3 +++
>  kernel/trace/bpf_trace.c  | 20 
>  samples/bpf/bpf_helpers.h |  2 ++
>  3 files changed, 25 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 406459b..355b565 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -313,6 +313,9 @@ enum bpf_func_id {
>   */
>   BPF_FUNC_skb_get_tunnel_opt,
>   BPF_FUNC_skb_set_tunnel_opt,
> +
> + BPF_FUNC_probe_write, /* int bpf_probe_write(void *dst, void *src,
> int size) */
> +
> >>>
> >>>the patch is against some old kernel.
> >>>Please always make the patch against net-next tree and cc netdev list.
> >>>
> >>Sorry, I did this against Linus's tree, not net-next. Will fix.
> >>
> +static u64 bpf_probe_write(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
> +{
> + void *dst = (void *) (long) r1;
> + void *unsafe_ptr = (void *) (long) r2;
> + int  size = (int) r3;
> +
> + return probe_kernel_write(dst, unsafe_ptr, size);
> +}
> >>>
> >>>the patch is whitepsace mangled. Please see 
> >>>Documentation/networking/netdev-FAQ.txt
> >>Also will fix.
> >>
> >>>
> >>>the main issue though that we cannot simply allow bpf to do probe_write,
> >>>since it may crash the kernel.
> >>>What might be ok is to allow writing into memory of current
> >>>user space process only. This way bpf prog will keep kernel safety 
> >>>guarantees,
> >>>yet it will be able to modify user process memory when necessary.
> >>>Since bpf+tracing is root only, it doesn't pose security risk.
> >>>
> >>>
> >>
> >>Doesn't probe_write prevent you from writing to protected memory and
> >>generate an EFAULT? Or are you worried about the situation where a bpf
> >>program writes to some other chunk of kernel memory, or writes bad data
> >>to said kernel memory?
> >>
> >>I guess when I meant "safe" -- it's safer than allowing arbitrary memcpy.
> >>I don't see a good way to ensure safety otherwise as we don't know
> >>which registers point to memory that it's reasonable for probes to
> >>manipulate. It's not like skb_store_bytes where we can check the pointer
> >>going in is the same pointer that's referenced, and with a super
> >>restricted datatype.
> >
> >exactly. probe_write can write anywhere in the kernel and that
> >will cause crashes. If we allow that bpf becomes no different than
> >kernel module.
> >
> >>Perhaps, it would be a good idea to describe an example where I used this:
> >>#include 
> >>#include 
> >>#include 
> >>
> >>
> >>int trace_inet_stream_connect(struct pt_regs *ctx)
> >>{
> >>if (!PT_REGS_PARM2(ctx)) {
> >>return 0;
> >>}
> >>struct sockaddr uaddr = {};
> >>struct sockaddr_in *addr_in;
> >>bpf_probe_read(&uaddr, sizeof(struct sockaddr), (void 
> >> *)PT_REGS_PARM2(ctx));
> >>if (uaddr.sa_family == AF_INET) {
> >>// Simple cast causes LLVM weirdness
> >>addr_in = &uaddr;
> >>char fmt[] = "Connecting on port: %d\n";
> >>bpf_trace_printk(fmt, sizeof(fmt), ntohs(addr_in->sin_port));
> >>if (ntohs(addr_in->sin_port) == 80) {
> >>addr_in->sin_port = htons(443);
> >>bpf_probe_write((void *)PT_REGS_PARM2(ctx), &uaddr, 
> >> sizeof(uaddr));
> >>}
> >>}
> >>return 0;
> >>};
> >>
> >>There are two reasons I want to do this:
> >>1) Debugging - sometimes, it makes sense to divert a program's syscalls in
> >>order to allow for better debugging
> >>2) Network Functions - I wrote a load balancer which intercepts
> >>inet_stream_connect & tcp_set_state. We can manipulate the destination
> >>address as neccessary at connect time. This also has the nice side effect
> >>that getpeername() returns the real IP that a server is connected to, and
> >>the performance is far better than doing "network load balancing"
> >>
> >>(I realize this is a total hack, better approaches would be appreciated)
> >
> >nice. interesting idea.
> >Have you considered ld_preload hack to do port rewrite?
> >
> We've thought about it. It wont really work for us, because we're doing this
> to manipulate 3rd party runtimes, many of which are written in languages
> that don't play nice with LD_PRELOAD. Go is the primary problem child in
> this case. We also looked a

Re: [PATCH 1/1] tracing, bpf: Implement function bpf_probe_write

2016-07-15 Thread Sargun Dhillon



On Thu, 14 Jul 2016, Alexei Starovoitov wrote:


On Wed, Jul 13, 2016 at 01:31:57PM -0700, Sargun Dhillon wrote:



On Wed, 13 Jul 2016, Alexei Starovoitov wrote:


On Wed, Jul 13, 2016 at 03:36:11AM -0700, Sargun Dhillon wrote:

Provides BPF programs, attached to kprobes a safe way to write to
memory referenced by probes. This is done by making probe_kernel_write
accessible to bpf functions via the bpf_probe_write helper.


not quite :)


Signed-off-by: Sargun Dhillon 
---
 include/uapi/linux/bpf.h  |  3 +++
 kernel/trace/bpf_trace.c  | 20 
 samples/bpf/bpf_helpers.h |  2 ++
 3 files changed, 25 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 406459b..355b565 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -313,6 +313,9 @@ enum bpf_func_id {
  */
  BPF_FUNC_skb_get_tunnel_opt,
  BPF_FUNC_skb_set_tunnel_opt,
+
+ BPF_FUNC_probe_write, /* int bpf_probe_write(void *dst, void *src,
int size) */
+


the patch is against some old kernel.
Please always make the patch against net-next tree and cc netdev list.


Sorry, I did this against Linus's tree, not net-next. Will fix.


+static u64 bpf_probe_write(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ void *dst = (void *) (long) r1;
+ void *unsafe_ptr = (void *) (long) r2;
+ int  size = (int) r3;
+
+ return probe_kernel_write(dst, unsafe_ptr, size);
+}


the patch is whitepsace mangled. Please see 
Documentation/networking/netdev-FAQ.txt

Also will fix.



the main issue though that we cannot simply allow bpf to do probe_write,
since it may crash the kernel.
What might be ok is to allow writing into memory of current
user space process only. This way bpf prog will keep kernel safety guarantees,
yet it will be able to modify user process memory when necessary.
Since bpf+tracing is root only, it doesn't pose security risk.




Doesn't probe_write prevent you from writing to protected memory and
generate an EFAULT? Or are you worried about the situation where a bpf
program writes to some other chunk of kernel memory, or writes bad data
to said kernel memory?

I guess when I meant "safe" -- it's safer than allowing arbitrary memcpy.
I don't see a good way to ensure safety otherwise as we don't know
which registers point to memory that it's reasonable for probes to
manipulate. It's not like skb_store_bytes where we can check the pointer
going in is the same pointer that's referenced, and with a super
restricted datatype.


exactly. probe_write can write anywhere in the kernel and that
will cause crashes. If we allow that bpf becomes no different than
kernel module.


Perhaps, it would be a good idea to describe an example where I used this:
#include 
#include 
#include 


int trace_inet_stream_connect(struct pt_regs *ctx)
{
if (!PT_REGS_PARM2(ctx)) {
return 0;
}
struct sockaddr uaddr = {};
struct sockaddr_in *addr_in;
bpf_probe_read(&uaddr, sizeof(struct sockaddr), (void 
*)PT_REGS_PARM2(ctx));
if (uaddr.sa_family == AF_INET) {
// Simple cast causes LLVM weirdness
addr_in = &uaddr;
char fmt[] = "Connecting on port: %d\n";
bpf_trace_printk(fmt, sizeof(fmt), ntohs(addr_in->sin_port));
if (ntohs(addr_in->sin_port) == 80) {
addr_in->sin_port = htons(443);
bpf_probe_write((void *)PT_REGS_PARM2(ctx), &uaddr, 
sizeof(uaddr));
}
}
return 0;
};

There are two reasons I want to do this:
1) Debugging - sometimes, it makes sense to divert a program's syscalls in
order to allow for better debugging
2) Network Functions - I wrote a load balancer which intercepts
inet_stream_connect & tcp_set_state. We can manipulate the destination
address as neccessary at connect time. This also has the nice side effect
that getpeername() returns the real IP that a server is connected to, and
the performance is far better than doing "network load balancing"

(I realize this is a total hack, better approaches would be appreciated)


nice. interesting idea.
Have you considered ld_preload hack to do port rewrite?

We've thought about it. It wont really work for us, because we're doing 
this to manipulate 3rd party runtimes, many of which are written in 
languages that don't play nice with LD_PRELOAD. Go is the primary problem 
child in this case. We also looked at using SECCOMP + ptrace, but again, 
not all runtimes play nice with ptrace.



If we allowed manipulation of the current task's user memory by exposing
copy_to_user, that could also work if I attach the probe to sys_connect,
I could overwrite the address there before it gets copied into
kernel space, but that could lead to its own weirdness.


we cannot simply call copy_to_user from the bpf either,
but yeah, something semantically equivalent to copy_to_user should
solve your port rewriting case, right?
Could you explain little bit more on 'sy

[PATCH net v2] tcp_timer.c: Add kernel-doc function descriptions

2016-07-15 Thread Richard Sailer
This adds kernel-doc style descriptions for 6 functions and
fixes 1 typo.

Signed-off-by: Richard Sailer 
---
Changes from v1:
  * All comments are indented consistently
  * Consistently no "." at end of function short description line

 net/ipv4/tcp_timer.c | 81 +---
 1 file changed, 64 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index debdd8b..d84930b 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -24,6 +24,13 @@
 
 int sysctl_tcp_thin_linear_timeouts __read_mostly;
 
+/**
+ *  tcp_write_err() - close socket and save error info
+ *  @sk:  The socket the error has appeared on.
+ *
+ *  Returns: Nothing (void)
+ */
+
 static void tcp_write_err(struct sock *sk)
 {
sk->sk_err = sk->sk_err_soft ? : ETIMEDOUT;
@@ -33,16 +40,21 @@ static void tcp_write_err(struct sock *sk)
__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONTIMEOUT);
 }
 
-/* Do not allow orphaned sockets to eat all our resources.
- * This is direct violation of TCP specs, but it is required
- * to prevent DoS attacks. It is called when a retransmission timeout
- * or zero probe timeout occurs on orphaned socket.
+/**
+ *  tcp_out_of_resources() - Close socket if out of resources
+ *  @sk:pointer to current socket
+ *  @do_reset:  send a last packet with reset flag
  *
- * Criteria is still not confirmed experimentally and may change.
- * We kill the socket, if:
- * 1. If number of orphaned sockets exceeds an administratively configured
- *limit.
- * 2. If we have strong memory pressure.
+ *  Do not allow orphaned sockets to eat all our resources.
+ *  This is direct violation of TCP specs, but it is required
+ *  to prevent DoS attacks. It is called when a retransmission timeout
+ *  or zero probe timeout occurs on orphaned socket.
+ *
+ *  Criteria is still not confirmed experimentally and may change.
+ *  We kill the socket, if:
+ *  1. If number of orphaned sockets exceeds an administratively configured
+ * limit.
+ *  2. If we have strong memory pressure.
  */
 static int tcp_out_of_resources(struct sock *sk, bool do_reset)
 {
@@ -74,7 +86,11 @@ static int tcp_out_of_resources(struct sock *sk, bool 
do_reset)
return 0;
 }
 
-/* Calculate maximal number or retries on an orphaned socket. */
+/**
+ *  tcp_orphan_retries() - Returns maximal number of retries on an orphaned 
socket
+ *  @sk:Pointer to the current socket.
+ *  @alive: bool, socket alive state
+ */
 static int tcp_orphan_retries(struct sock *sk, bool alive)
 {
int retries = sock_net(sk)->ipv4.sysctl_tcp_orphan_retries; /* May be 
zero. */
@@ -115,10 +131,22 @@ static void tcp_mtu_probing(struct inet_connection_sock 
*icsk, struct sock *sk)
}
 }
 
-/* This function calculates a "timeout" which is equivalent to the timeout of a
- * TCP connection after "boundary" unsuccessful, exponentially backed-off
+
+/**
+ *  retransmits_timed_out() - returns true if this connection has timed out
+ *  @sk:   The current socket
+ *  @boundary: max number of retransmissions
+ *  @timeout:  A custom timeout value.
+ * If set to 0 the default timeout is calculated and used.
+ * Using TCP_RTO_MIN and the number of unsuccessful retransmits.
+ *  @syn_set:  true if the SYN Bit was set.
+ *
+ * The default "timeout" value this function can calculate and use
+ * is equivalent to the timeout of a TCP Connection
+ * after "boundary" unsuccessful, exponentially backed-off
  * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
  * syn_set flag is set.
+ *
  */
 static bool retransmits_timed_out(struct sock *sk,
  unsigned int boundary,
@@ -257,6 +285,16 @@ out:
sk_mem_reclaim(sk);
 }
 
+
+/**
+ *  tcp_delack_timer() - The TCP delayed ACK timeout handler
+ *  @data:  Pointer to the current socket. (gets casted to struct sock *)
+ *
+ *  This function gets (indirectly) called when the kernel timer for a TCP 
packet
+ *  of this socket expires. Calls tcp_delack_timer_handler() to do the actual 
work.
+ *
+ *  Returns: Nothing (void)
+ */
 static void tcp_delack_timer(unsigned long data)
 {
struct sock *sk = (struct sock *)data;
@@ -350,10 +388,18 @@ static void tcp_fastopen_synack_timer(struct sock *sk)
  TCP_TIMEOUT_INIT << req->num_timeout, TCP_RTO_MAX);
 }
 
-/*
- * The TCP retransmit timer.
- */
 
+/**
+ *  tcp_retransmit_timer() - The TCP retransmit timeout handler
+ *  @sk:  Pointer to the current socket.
+ *
+ *  This function gets called when the kernel timer for a TCP packet
+ *  of this socket expires.
+ *
+ *  It handles retransmission, timer adjustment and other necesarry measures.
+ *
+ *  Returns: Nothing (void)
+ */
 void tcp_retransmit_timer(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
@@ -494,7 +540,8 @@ out_reset_timer:
 out:;
 }
 
-/* Called with BH disabled */
+/* Called with 

[PATCH net-next v2] net: ipmr/ip6mr: add support for keeping an entry age

2016-07-15 Thread Nikolay Aleksandrov
In preparation for hardware offloading of ipmr/ip6mr we need an
interface that allows to check (and later update) the age of entries.
Relying on stats alone can show activity but not actual age of the entry,
furthermore when there're tens of thousands of entries a lot of the
hardware implementations only support "hit" bits which are cleared on
read to denote that the entry was active and shouldn't be aged out,
these can then be naturally translated into age timestamp and will be
compatible with the software forwarding age. Using a lastuse entry doesn't
affect performance because the members in that cache line are written to
along with the age.
Since all new users are encouraged to use ipmr via netlink, this is
exported via the RTA_EXPIRES attribute.
Also do a minor local variable declaration style adjustment - arrange them
longest to shortest.

Signed-off-by: Nikolay Aleksandrov 
CC: Roopa Prabhu 
CC: Shrijeet Mukherjee 
CC: Satish Ashok 
CC: Donald Sharp 
CC: David S. Miller 
CC: Alexey Kuznetsov 
CC: James Morris 
CC: Hideaki YOSHIFUJI 
CC: Patrick McHardy 
---
v2: Just reuse RTA_EXPIRES instead to minimize the attr size and simplify,
others will be added when needed

 include/linux/mroute.h  |  1 +
 include/linux/mroute6.h |  1 +
 net/ipv4/ipmr.c | 13 +
 net/ipv6/ip6mr.c| 13 +
 4 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/include/linux/mroute.h b/include/linux/mroute.h
index bf9b322cb0b0..d351fd3e1049 100644
--- a/include/linux/mroute.h
+++ b/include/linux/mroute.h
@@ -104,6 +104,7 @@ struct mfc_cache {
unsigned long bytes;
unsigned long pkt;
unsigned long wrong_if;
+   unsigned long lastuse;
unsigned char ttls[MAXVIFS];/* TTL thresholds   
*/
} res;
} mfc_un;
diff --git a/include/linux/mroute6.h b/include/linux/mroute6.h
index 66982e764051..3987b64040c5 100644
--- a/include/linux/mroute6.h
+++ b/include/linux/mroute6.h
@@ -92,6 +92,7 @@ struct mfc6_cache {
unsigned long bytes;
unsigned long pkt;
unsigned long wrong_if;
+   unsigned long lastuse;
unsigned char ttls[MAXMIFS];/* TTL thresholds   
*/
} res;
} mfc_un;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 5ad48ec77710..e0d76f5f0113 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1150,6 +1150,7 @@ static int ipmr_mfc_add(struct net *net, struct mr_table 
*mrt,
c->mfc_origin = mfc->mfcc_origin.s_addr;
c->mfc_mcastgrp = mfc->mfcc_mcastgrp.s_addr;
c->mfc_parent = mfc->mfcc_parent;
+   c->mfc_un.res.lastuse = jiffies;
ipmr_update_thresholds(mrt, c, mfc->mfcc_ttls);
if (!mrtsock)
c->mfc_flags |= MFC_STATIC;
@@ -1792,6 +1793,7 @@ static void ip_mr_forward(struct net *net, struct 
mr_table *mrt,
vif = cache->mfc_parent;
cache->mfc_un.res.pkt++;
cache->mfc_un.res.bytes += skb->len;
+   cache->mfc_un.res.lastuse = jiffies;
 
if (cache->mfc_origin == htonl(INADDR_ANY) && true_vifi >= 0) {
struct mfc_cache *cache_proxy;
@@ -2071,10 +2073,10 @@ drop:
 static int __ipmr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb,
  struct mfc_cache *c, struct rtmsg *rtm)
 {
-   int ct;
-   struct rtnexthop *nhp;
-   struct nlattr *mp_attr;
struct rta_mfc_stats mfcs;
+   struct nlattr *mp_attr;
+   struct rtnexthop *nhp;
+   int ct;
 
/* If cache is unresolved, don't try to parse IIF and OIF */
if (c->mfc_parent >= MAXVIFS)
@@ -2106,7 +2108,10 @@ static int __ipmr_fill_mroute(struct mr_table *mrt, 
struct sk_buff *skb,
mfcs.mfcs_packets = c->mfc_un.res.pkt;
mfcs.mfcs_bytes = c->mfc_un.res.bytes;
mfcs.mfcs_wrong_if = c->mfc_un.res.wrong_if;
-   if (nla_put_64bit(skb, RTA_MFC_STATS, sizeof(mfcs), &mfcs, RTA_PAD) < 0)
+   if (nla_put_64bit(skb, RTA_MFC_STATS, sizeof(mfcs), &mfcs, RTA_PAD) ||
+   nla_put_u64_64bit(skb, RTA_EXPIRES,
+ jiffies_to_clock_t(c->mfc_un.res.lastuse),
+ RTA_PAD))
return -EMSGSIZE;
 
rtm->rtm_type = RTN_MULTICAST;
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index c7ca0f5d1a3b..7adce139d92a 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -1500,6 +1500,7 @@ static int ip6mr_mfc_add(struct net *net, struct 
mr6_table *mrt,
c->mf6c_origin = mfc->mf6cc_origin.sin6_addr;
c->mf6c_mcastgrp = mfc->mf6cc_mcastgrp.sin6_addr;
c->mf6c_parent = mfc->mf6cc_parent;
+   c->mfc_un.res.lastuse = jiffies;
ip6mr_update_thresholds(mrt, c, ttls);
if (!mrtsock)
c->mfc_flags |= MFC_STATIC;
@@ -2092,6 +2093,7 @@ static void 

Re: [PATCH v2 1/2] net: ethernet: ethoc: use phydev from struct net_device

2016-07-15 Thread David Miller
From: Philippe Reynes 
Date: Fri, 15 Jul 2016 09:59:11 +0200

> The private structure contain a pointer to phydev, but the structure
> net_device already contain such pointer. So we can remove the pointer
> phy in the private structure, and update the driver to use the
> one contained in struct net_device.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 2/2] net: ethernet: xilinx: axienet: use phy_ethtool_{get|set}_link_ksettings

2016-07-15 Thread David Miller
From: Philippe Reynes 
Date: Thu, 14 Jul 2016 19:45:58 +0200

> There are two generics functions phy_ethtool_{get|set}_link_ksettings,
> so we can use them instead of defining the same code in the driver.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 1/2] net: ethernet: xilinx: axienet: use phydev from struct net_device

2016-07-15 Thread David Miller
From: Philippe Reynes 
Date: Thu, 14 Jul 2016 19:45:57 +0200

> The private structure contain a pointer to phydev, but the structure
> net_device already contain such pointer. So we can remove the pointer
> phy in the private structure, and update the driver to use the
> one contained in struct net_device.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 1/2] net: ethernet: tc35815: use phydev from struct net_device

2016-07-15 Thread David Miller
From: Philippe Reynes 
Date: Thu, 14 Jul 2016 15:20:46 +0200

> The private structure contain a pointer to phydev, but the structure
> net_device already contain such pointer. So we can remove the pointer
> phy in the private structure, and update the driver to use the
> one contained in struct net_device.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH v2 2/2] net: ethernet: ethoc: use phy_ethtool_{get|set}_link_ksettings

2016-07-15 Thread David Miller
From: Philippe Reynes 
Date: Fri, 15 Jul 2016 09:59:12 +0200

> There are two generics functions phy_ethtool_{get|set}_link_ksettings,
> so we can use them instead of defining the same code in the driver.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 2/2] net: ethernet: tc35815: use phy_ethtool_{get|set}_link_ksettings

2016-07-15 Thread David Miller
From: Philippe Reynes 
Date: Thu, 14 Jul 2016 15:20:47 +0200

> There are two generics functions phy_ethtool_{get|set}_link_ksettings,
> so we can use them instead of defining the same code in the driver.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 1/2] net: ethernet: pasemi_mac: use phydev from struct net_device

2016-07-15 Thread David Miller
From: Philippe Reynes 
Date: Thu, 14 Jul 2016 23:44:52 +0200

> The private structure contain a pointer to phydev, but the structure
> net_device already contain such pointer. So we can remove the pointer
> phy in the private structure, and update the driver to use the
> one contained in struct net_device.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 2/2] net: ethernet: pasemi_mac: use phy_ethtool_{get|set}_link_ksettings

2016-07-15 Thread David Miller
From: Philippe Reynes 
Date: Thu, 14 Jul 2016 23:44:53 +0200

> There are two generics functions phy_ethtool_{get|set}_link_ksettings,
> so we can use them instead of defining the same code in the driver.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 2/2] net: ethernet: smsc9420: use phy_ethtool_{get|set}_link_ksettings

2016-07-15 Thread David Miller
From: Philippe Reynes 
Date: Fri, 15 Jul 2016 10:36:21 +0200

> There are two generics functions phy_ethtool_{get|set}_link_ksettings,
> so we can use them instead of defining the same code in the driver.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 1/2] net: ethernet: amd: au1000_eth: use phydev from struct net_device

2016-07-15 Thread David Miller
From: Philippe Reynes 
Date: Fri, 15 Jul 2016 12:05:11 +0200

> The private structure contain a pointer to phydev, but the structure
> net_device already contain such pointer. So we can remove the pointer
> phydev in the private structure, and update the driver to use the
> one contained in struct net_device.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 1/2] net: ethernet: smsc9420: use phydev from struct net_device

2016-07-15 Thread David Miller
From: Philippe Reynes 
Date: Fri, 15 Jul 2016 10:36:20 +0200

> The private structure contain a pointer to phydev, but the structure
> net_device already contain such pointer. So we can remove the pointer
> phy in the private structure, and update the driver to use the
> one contained in struct net_device.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 2/2] net: ethernet: ti: cpmac: use phy_ethtool_{get|set}_link_ksettings

2016-07-15 Thread David Miller
From: Philippe Reynes 
Date: Fri, 15 Jul 2016 12:39:02 +0200

> There are two generics functions phy_ethtool_{get|set}_link_ksettings,
> so we can use them instead of defining the same code in the driver.
> 
> There was a check on CAP_NET_ADMIN in cpmac_set_settings, but this
> check is already done in dev_ethtool, so no need to repeat it before
> calling the generic function.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 1/2] net: ethernet: ti: cpmac: use phydev from struct net_device

2016-07-15 Thread David Miller
From: Philippe Reynes 
Date: Fri, 15 Jul 2016 12:39:01 +0200

> The private structure contain a pointer to phydev, but the structure
> net_device already contain such pointer. So we can remove the pointer
> phy in the private structure, and update the driver to use the
> one contained in struct net_device.
> 
> Signed-off-by: Philippe Reynes 

Applied.


Re: [PATCH 2/2] net: ethernet: amd: au1000_eth: use phy_ethtool_{get|set}_link_ksettings

2016-07-15 Thread David Miller
From: Philippe Reynes 
Date: Fri, 15 Jul 2016 12:05:12 +0200

> There are two generics functions phy_ethtool_{get|set}_link_ksettings,
> so we can use them instead of defining the same code in the driver.
> 
> There was a check on CAP_NET_ADMIN in au1000_set_settings, but this
> check is already done in dev_ethtool, so no need to repeat it before
> calling the generic function.
> 
> Signed-off-by: Philippe Reynes 

Applied.


[PATCH net] net: cavium: liquidio: Avoid dma_unmap_single on uninitialized ndata

2016-07-15 Thread Florian Fainelli
The label lio_xmit_failed is used 3 times through liquidio_xmit() but it
always makes a call to dma_unmap_single() using potentially
uninitialized variables from "ndata" variable. Out of the 3 gotos, 2 run
after ndata has been initialized, and had a prior dma_map_single() call.

Fix this by adding a new error label: lio_xmit_dma_failed which does
this dma_unmap_single() and then processed with the lio_xmit_failed
fallthrough.

Fixes: f21fb3ed364bb ("Add support of Cavium Liquidio ethernet adapters")
Reported-by: coverity (CID 1309740)
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/cavium/liquidio/lio_main.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 8de79ae63231..0e7e7da8d201 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -2821,7 +2821,7 @@ static int liquidio_xmit(struct sk_buff *skb, struct 
net_device *netdev)
if (!g) {
netif_info(lio, tx_err, lio->netdev,
   "Transmit scatter gather: glist null!\n");
-   goto lio_xmit_failed;
+   goto lio_xmit_dma_failed;
}
 
cmdsetup.s.gather = 1;
@@ -2892,7 +2892,7 @@ static int liquidio_xmit(struct sk_buff *skb, struct 
net_device *netdev)
else
status = octnet_send_nic_data_pkt(oct, &ndata, xmit_more);
if (status == IQ_SEND_FAILED)
-   goto lio_xmit_failed;
+   goto lio_xmit_dma_failed;
 
netif_info(lio, tx_queued, lio->netdev, "Transmit queued 
successfully\n");
 
@@ -2906,12 +2906,13 @@ static int liquidio_xmit(struct sk_buff *skb, struct 
net_device *netdev)
 
return NETDEV_TX_OK;
 
+lio_xmit_dma_failed:
+   dma_unmap_single(&oct->pci_dev->dev, ndata.cmd.dptr,
+ndata.datasize, DMA_TO_DEVICE);
 lio_xmit_failed:
stats->tx_dropped++;
netif_info(lio, tx_err, lio->netdev, "IQ%d Transmit dropped:%llu\n",
   iq_no, stats->tx_dropped);
-   dma_unmap_single(&oct->pci_dev->dev, ndata.cmd.dptr,
-ndata.datasize, DMA_TO_DEVICE);
recv_buffer_free(skb);
return NETDEV_TX_OK;
 }
-- 
2.7.4



[PATCH net] bnxt_en: Fix potential race condition in bnxt_tx_enable()

2016-07-15 Thread Florian Fainelli
txr->dev_state is always manipulated after acquiring the transmit queue
lock, except in bnxt_tx_enable(), which seems suspicious here, so also
acquire the transmit queue lock before changing the value.

Reported-by: coverity (CID 1339583)
Fixes: c0c050c58d840 ("bnxt_en: New Broadcom ethernet driver.")
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index c777cde85ce4..904c2a8ece12 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4599,7 +4599,9 @@ static void bnxt_tx_enable(struct bnxt *bp)
for (i = 0; i < bp->tx_nr_rings; i++) {
txr = &bp->tx_ring[i];
txq = netdev_get_tx_queue(bp->dev, i);
+   __netif_tx_lock(txq, smp_processor_id());
txr->dev_state = 0;
+   __netif_tx_unlock(txq);
}
netif_tx_wake_all_queues(bp->dev);
if (bp->link_info.link_up)
-- 
2.7.4



[PATCH net] net: nb8800: Fix SKB leak in nb8800_receive()

2016-07-15 Thread Florian Fainelli
In case nb8800_receive() fails to allocate a fragment, we would leak the
SKB freshly allocated and just return, instead, free it.

Reported-by: coverity (CID 1341750)
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/aurora/nb8800.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/aurora/nb8800.c 
b/drivers/net/ethernet/aurora/nb8800.c
index 08a23e6b60e9..1a3555d03a96 100644
--- a/drivers/net/ethernet/aurora/nb8800.c
+++ b/drivers/net/ethernet/aurora/nb8800.c
@@ -259,6 +259,7 @@ static void nb8800_receive(struct net_device *dev, unsigned 
int i,
if (err) {
netdev_err(dev, "rx buffer allocation failed\n");
dev->stats.rx_dropped++;
+   dev_kfree_skb(skb);
return;
}
 
-- 
2.7.4



[PATCH net] et131x: Fix logical vs bitwise check in et131x_tx_timeout()

2016-07-15 Thread Florian Fainelli
We should be using a logical check here instead of a bitwise operation
to check if the device is closed already in et131x_tx_timeout().

Reported-by: coverity (CID 146498)
Fixes: 38df6492eb511 ("et131x: Add PCIe gigabit ethernet driver et131x to 
drivers/net")
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/agere/et131x.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/agere/et131x.c 
b/drivers/net/ethernet/agere/et131x.c
index 30defe6c81f2..821d86c38ab2 100644
--- a/drivers/net/ethernet/agere/et131x.c
+++ b/drivers/net/ethernet/agere/et131x.c
@@ -3851,7 +3851,7 @@ static void et131x_tx_timeout(struct net_device *netdev)
unsigned long flags;
 
/* If the device is closed, ignore the timeout */
-   if (~(adapter->flags & FMP_ADAPTER_INTERRUPT_IN_USE))
+   if (!(adapter->flags & FMP_ADAPTER_INTERRUPT_IN_USE))
return;
 
/* Any nonrecoverable hardware error?
-- 
2.7.4



[PATCH v9 11/11] bpf: add sample for xdp forwarding and rewrite

2016-07-15 Thread Brenden Blanco
Add a sample that rewrites and forwards packets out on the same
interface. Observed single core forwarding performance of ~10Mpps.

Since the mlx4 driver under test recycles every single packet page, the
perf output shows almost exclusively just the ring management and bpf
program work. Slowdowns are likely occurring due to cache misses.

Signed-off-by: Brenden Blanco 
---
 samples/bpf/Makefile|   5 +++
 samples/bpf/xdp2_kern.c | 114 
 2 files changed, 119 insertions(+)
 create mode 100644 samples/bpf/xdp2_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 0e4ab3a..d2d2b35 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
 hostprogs-y += xdp1
+hostprogs-y += xdp2
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -44,6 +45,8 @@ map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
+# reuse xdp1 source intentionally
+xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -67,6 +70,7 @@ always += test_overhead_kprobe_kern.o
 always += parse_varlen.o parse_simple.o parse_ldabs.o
 always += test_cgrp2_tc_kern.o
 always += xdp1_kern.o
+always += xdp2_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -88,6 +92,7 @@ HOSTLOADLIBES_spintest += -lelf
 HOSTLOADLIBES_map_perf_test += -lelf -lrt
 HOSTLOADLIBES_test_overhead += -lelf -lrt
 HOSTLOADLIBES_xdp1 += -lelf
+HOSTLOADLIBES_xdp2 += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c
new file mode 100644
index 000..38fe7e1
--- /dev/null
+++ b/samples/bpf/xdp2_kern.c
@@ -0,0 +1,114 @@
+/* Copyright (c) 2016 PLUMgrid
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define KBUILD_MODNAME "foo"
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") dropcnt = {
+   .type = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size = sizeof(u32),
+   .value_size = sizeof(long),
+   .max_entries = 256,
+};
+
+static void swap_src_dst_mac(void *data)
+{
+   unsigned short *p = data;
+   unsigned short dst[3];
+
+   dst[0] = p[0];
+   dst[1] = p[1];
+   dst[2] = p[2];
+   p[0] = p[3];
+   p[1] = p[4];
+   p[2] = p[5];
+   p[3] = dst[0];
+   p[4] = dst[1];
+   p[5] = dst[2];
+}
+
+static int parse_ipv4(void *data, u64 nh_off, void *data_end)
+{
+   struct iphdr *iph = data + nh_off;
+
+   if (iph + 1 > data_end)
+   return 0;
+   return iph->protocol;
+}
+
+static int parse_ipv6(void *data, u64 nh_off, void *data_end)
+{
+   struct ipv6hdr *ip6h = data + nh_off;
+
+   if (ip6h + 1 > data_end)
+   return 0;
+   return ip6h->nexthdr;
+}
+
+SEC("xdp1")
+int xdp_prog1(struct xdp_md *ctx)
+{
+   void *data_end = (void *)(long)ctx->data_end;
+   void *data = (void *)(long)ctx->data;
+   struct ethhdr *eth = data;
+   int rc = XDP_DROP;
+   long *value;
+   u16 h_proto;
+   u64 nh_off;
+   u32 index;
+
+   nh_off = sizeof(*eth);
+   if (data + nh_off > data_end)
+   return rc;
+
+   h_proto = eth->h_proto;
+
+   if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+   struct vlan_hdr *vhdr;
+
+   vhdr = data + nh_off;
+   nh_off += sizeof(struct vlan_hdr);
+   if (data + nh_off > data_end)
+   return rc;
+   h_proto = vhdr->h_vlan_encapsulated_proto;
+   }
+   if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+   struct vlan_hdr *vhdr;
+
+   vhdr = data + nh_off;
+   nh_off += sizeof(struct vlan_hdr);
+   if (data + nh_off > data_end)
+   return rc;
+   h_proto = vhdr->h_vlan_encapsulated_proto;
+   }
+
+   if (h_proto == htons(ETH_P_IP))
+   index = parse_ipv4(data, nh_off, data_end);
+   else if (h_proto == htons(ETH_P_IPV6))
+   index = parse_ipv6(data, nh_off, data_end);
+   else
+   index = 0;
+
+   value = bpf_map_lookup_elem(&dropcnt, &index);
+   if (value)
+   *value += 1;
+
+   if (index == 17) {
+   swap_src_dst_mac(data);
+ 

[PATCH v9 10/11] bpf: enable direct packet data write for xdp progs

2016-07-15 Thread Brenden Blanco
For forwarding to be effective, XDP programs should be allowed to
rewrite packet data.

This requires that the drivers supporting XDP must all map the packet
memory as TODEVICE or BIDIRECTIONAL before invoking the program.

Signed-off-by: Brenden Blanco 
---
 kernel/bpf/verifier.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a8d67d0..f72f23b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -653,6 +653,16 @@ static int check_map_access(struct verifier_env *env, u32 
regno, int off,
 
 #define MAX_PACKET_OFF 0x
 
+static bool may_write_pkt_data(enum bpf_prog_type type)
+{
+   switch (type) {
+   case BPF_PROG_TYPE_XDP:
+   return true;
+   default:
+   return false;
+   }
+}
+
 static int check_packet_access(struct verifier_env *env, u32 regno, int off,
   int size)
 {
@@ -806,10 +816,15 @@ static int check_mem_access(struct verifier_env *env, u32 
regno, int off,
err = check_stack_read(state, off, size, value_regno);
}
} else if (state->regs[regno].type == PTR_TO_PACKET) {
-   if (t == BPF_WRITE) {
+   if (t == BPF_WRITE && !may_write_pkt_data(env->prog->type)) {
verbose("cannot write into packet\n");
return -EACCES;
}
+   if (t == BPF_WRITE && value_regno >= 0 &&
+   is_pointer_value(env, value_regno)) {
+   verbose("R%d leaks addr into packet\n", value_regno);
+   return -EACCES;
+   }
err = check_packet_access(env, regno, off, size);
if (!err && t == BPF_READ && value_regno >= 0)
mark_reg_unknown_value(state->regs, value_regno);
-- 
2.8.2



[PATCH v9 09/11] net/mlx4_en: add xdp forwarding and data write support

2016-07-15 Thread Brenden Blanco
A user will now be able to loop packets back out of the same port using
a bpf program attached to xdp hook. Updates to the packet contents from
the bpf program is also supported.

For the packet write feature to work, the rx buffers are now mapped as
bidirectional when the page is allocated. This occurs only when the xdp
hook is active.

When the program returns a TX action, enqueue the packet directly to a
dedicated tx ring, so as to avoid completely any locking. This requires
the tx ring to be allocated 1:1 for each rx ring, as well as the tx
completion running in the same softirq.

Upon tx completion, this dedicated tx ring recycles pages without
unmapping directly back to the original rx ring. In steady state tx/drop
workload, effectively 0 page allocs/frees will occur.

In order to separate out the paths between free and recycle, a
free_tx_desc func pointer is introduced that is optionally updated
whenever recycle_ring is activated. By default the original free
function is always initialized.

Signed-off-by: Brenden Blanco 
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |   9 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |  34 ++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c  |  14 +++
 drivers/net/ethernet/mellanox/mlx4/en_tx.c  | 140 +++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h|  28 -
 5 files changed, 217 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 51a2e82..f6c2625 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -1722,6 +1722,12 @@ static int mlx4_en_set_channels(struct net_device *dev,
!channel->tx_count || !channel->rx_count)
return -EINVAL;
 
+   if (channel->tx_count * MLX4_EN_NUM_UP <= priv->rsv_tx_rings) {
+   en_err(priv, "Minimum %d tx channels required with XDP on\n",
+  priv->rsv_tx_rings / MLX4_EN_NUM_UP + 1);
+   return -EINVAL;
+   }
+
mutex_lock(&mdev->state_lock);
if (priv->port_up) {
port_up = 1;
@@ -1740,7 +1746,8 @@ static int mlx4_en_set_channels(struct net_device *dev,
goto out;
}
 
-   netif_set_real_num_tx_queues(dev, priv->tx_ring_num);
+   netif_set_real_num_tx_queues(dev, priv->tx_ring_num -
+   priv->rsv_tx_rings);
netif_set_real_num_rx_queues(dev, priv->rx_ring_num);
 
if (dev->num_tc)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index e54541e..a1542a1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1522,6 +1522,24 @@ static void mlx4_en_free_affinity_hint(struct 
mlx4_en_priv *priv, int ring_idx)
free_cpumask_var(priv->rx_ring[ring_idx]->affinity_mask);
 }
 
+static void mlx4_en_init_recycle_ring(struct mlx4_en_priv *priv,
+ int tx_ring_idx)
+{
+   struct mlx4_en_tx_ring *tx_ring = priv->tx_ring[tx_ring_idx];
+   int rr_index;
+
+   rr_index = (priv->rsv_tx_rings - priv->tx_ring_num) + tx_ring_idx;
+   if (rr_index >= 0) {
+   tx_ring->free_tx_desc = mlx4_en_recycle_tx_desc;
+   tx_ring->recycle_ring = priv->rx_ring[rr_index];
+   en_dbg(DRV, priv,
+  "Set tx_ring[%d]->recycle_ring = rx_ring[%d]\n",
+  tx_ring_idx, rr_index);
+   } else {
+   tx_ring->recycle_ring = NULL;
+   }
+}
+
 int mlx4_en_start_port(struct net_device *dev)
 {
struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -1644,6 +1662,8 @@ int mlx4_en_start_port(struct net_device *dev)
}
tx_ring->tx_queue = netdev_get_tx_queue(dev, i);
 
+   mlx4_en_init_recycle_ring(priv, i);
+
/* Arm CQ for TX completions */
mlx4_en_arm_cq(priv, cq);
 
@@ -2534,9 +2554,12 @@ static int mlx4_xdp_set(struct net_device *dev, struct 
bpf_prog *prog)
struct mlx4_en_priv *priv = netdev_priv(dev);
struct mlx4_en_dev *mdev = priv->mdev;
struct bpf_prog *old_prog;
+   int rsv_tx_rings;
int port_up = 0;
int err;
 
+   rsv_tx_rings = prog ? ALIGN(priv->rx_ring_num, MLX4_EN_NUM_UP) : 0;
+
/* No need to reconfigure buffers when simply swapping the
 * program for a new one.
 */
@@ -2555,12 +2578,23 @@ static int mlx4_xdp_set(struct net_device *dev, struct 
bpf_prog *prog)
return -EOPNOTSUPP;
}
 
+   if (priv->tx_ring_num < rsv_tx_rings + MLX4_EN_NUM_UP) {
+   en_err(priv,
+  "Minimum %d tx channels required to run XDP\n",
+  (rsv_tx_rings + MLX4_EN_NUM_UP) / MLX4_EN_NUM_UP);
+ 

[PATCH v9 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support

2016-07-15 Thread Brenden Blanco
The mlx4 driver by default allocates order-3 pages for the ring to
consume in multiple fragments. When the device has an xdp program, this
behavior will prevent tx actions since the page must be re-mapped in
TODEVICE mode, which cannot be done if the page is still shared.

Start by making the allocator configurable based on whether xdp is
running, such that order-0 pages are always used and never shared.

Since this will stress the page allocator, add a simple page cache to
each rx ring. Pages in the cache are left dma-mapped, and in drop-only
stress tests the page allocator is eliminated from the perf report.

Note that setting an xdp program will now require the rings to be
reconfigured.

Before:
 26.91%  ksoftirqd/0  [mlx4_en] [k] mlx4_en_process_rx_cq
 17.88%  ksoftirqd/0  [mlx4_en] [k] mlx4_en_alloc_frags
  6.00%  ksoftirqd/0  [mlx4_en] [k] mlx4_en_free_frag
  4.49%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
  3.21%  swapper  [kernel.vmlinux]  [k] intel_idle
  2.73%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
  2.57%  swapper  [mlx4_en] [k] mlx4_en_process_rx_cq

After:
 31.72%  swapper  [kernel.vmlinux]   [k] intel_idle
  8.79%  swapper  [mlx4_en]  [k] mlx4_en_process_rx_cq
  7.54%  swapper  [kernel.vmlinux]   [k] poll_idle
  6.36%  swapper  [mlx4_core][k] mlx4_eq_int
  4.21%  swapper  [kernel.vmlinux]   [k] tasklet_action
  4.03%  swapper  [kernel.vmlinux]   [k] cpuidle_enter_state
  3.43%  swapper  [mlx4_en]  [k] mlx4_en_prepare_rx_desc
  2.18%  swapper  [kernel.vmlinux]   [k] native_irq_return_iret
  1.37%  swapper  [kernel.vmlinux]   [k] menu_select
  1.09%  swapper  [kernel.vmlinux]   [k] bpf_map_lookup_elem

Signed-off-by: Brenden Blanco 
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 35 +++--
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 70 +++---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   | 11 +++-
 3 files changed, 104 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index b7c1804..e54541e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2532,20 +2532,49 @@ static int mlx4_en_set_tx_maxrate(struct net_device 
*dev, int queue_index, u32 m
 static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 {
struct mlx4_en_priv *priv = netdev_priv(dev);
+   struct mlx4_en_dev *mdev = priv->mdev;
struct bpf_prog *old_prog;
+   int port_up = 0;
+   int err;
+
+   /* No need to reconfigure buffers when simply swapping the
+* program for a new one.
+*/
+   if (READ_ONCE(priv->prog) && prog) {
+   /* This xchg is paired with READ_ONCE in the fast path, but is
+* also protected from itself via rtnl lock
+*/
+   old_prog = xchg(&priv->prog, prog);
+   if (old_prog)
+   bpf_prog_put(old_prog);
+   return 0;
+   }
 
if (priv->num_frags > 1) {
en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
return -EOPNOTSUPP;
}
 
-   /* This xchg is paired with READ_ONCE in the fast path, but is
-* also protected from itself via rtnl lock
-*/
+   mutex_lock(&mdev->state_lock);
+   if (priv->port_up) {
+   port_up = 1;
+   mlx4_en_stop_port(dev, 1);
+   }
+
old_prog = xchg(&priv->prog, prog);
if (old_prog)
bpf_prog_put(old_prog);
 
+   if (port_up) {
+   err = mlx4_en_start_port(dev);
+   if (err) {
+   en_err(priv, "Failed starting port %d for XDP change\n",
+  priv->port);
+   queue_work(mdev->workqueue, &priv->watchdog_task);
+   }
+   }
+
+   mutex_unlock(&mdev->state_lock);
return 0;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index adfa123..6020c37 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -57,7 +57,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
struct page *page;
dma_addr_t dma;
 
-   for (order = MLX4_EN_ALLOC_PREFER_ORDER; ;) {
+   for (order = frag_info->order; ;) {
gfp_t gfp = _gfp;
 
if (order)
@@ -70,7 +70,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
return -ENOMEM;
}
dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE << order,
-  PCI_DMA_FROMDEVICE);
+  frag_info->dma_dir);
if (dma_mapping_error(priv->dde

[PATCH v9 03/11] rtnl: add option for setting link xdp prog

2016-07-15 Thread Brenden Blanco
Sets the bpf program represented by fd as an early filter in the rx path
of the netdev. The fd must have been created as BPF_PROG_TYPE_XDP.
Providing a negative value as fd clears the program. Getting the fd back
via rtnl is not possible, therefore reading of this value merely
provides a bool whether the program is valid on the link or not.

Signed-off-by: Brenden Blanco 
---
 include/uapi/linux/if_link.h | 12 +
 net/core/rtnetlink.c | 64 
 2 files changed, 76 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 4285ac3..a1b5202 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -156,6 +156,7 @@ enum {
IFLA_GSO_MAX_SEGS,
IFLA_GSO_MAX_SIZE,
IFLA_PAD,
+   IFLA_XDP,
__IFLA_MAX
 };
 
@@ -843,4 +844,15 @@ enum {
 };
 #define LINK_XSTATS_TYPE_MAX (__LINK_XSTATS_TYPE_MAX - 1)
 
+/* XDP section */
+
+enum {
+   IFLA_XDP_UNSPEC,
+   IFLA_XDP_FD,
+   IFLA_XDP_ATTACHED,
+   __IFLA_XDP_MAX,
+};
+
+#define IFLA_XDP_MAX (__IFLA_XDP_MAX - 1)
+
 #endif /* _UAPI_LINUX_IF_LINK_H */
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a9e3805..eba2b82 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -891,6 +891,16 @@ static size_t rtnl_port_size(const struct net_device *dev,
return port_self_size;
 }
 
+static size_t rtnl_xdp_size(const struct net_device *dev)
+{
+   size_t xdp_size = nla_total_size(1);/* XDP_ATTACHED */
+
+   if (!dev->netdev_ops->ndo_xdp)
+   return 0;
+   else
+   return xdp_size;
+}
+
 static noinline size_t if_nlmsg_size(const struct net_device *dev,
 u32 ext_filter_mask)
 {
@@ -927,6 +937,7 @@ static noinline size_t if_nlmsg_size(const struct 
net_device *dev,
   + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */
   + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_SWITCH_ID */
   + nla_total_size(IFNAMSIZ) /* IFLA_PHYS_PORT_NAME */
+  + rtnl_xdp_size(dev) /* IFLA_XDP */
   + nla_total_size(1); /* IFLA_PROTO_DOWN */
 
 }
@@ -1211,6 +1222,33 @@ static int rtnl_fill_link_ifmap(struct sk_buff *skb, 
struct net_device *dev)
return 0;
 }
 
+static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
+{
+   struct netdev_xdp xdp_op = {};
+   struct nlattr *xdp;
+   int err;
+
+   if (!dev->netdev_ops->ndo_xdp)
+   return 0;
+   xdp = nla_nest_start(skb, IFLA_XDP);
+   if (!xdp)
+   return -EMSGSIZE;
+   xdp_op.command = XDP_QUERY_PROG;
+   err = dev->netdev_ops->ndo_xdp(dev, &xdp_op);
+   if (err)
+   goto err_cancel;
+   err = nla_put_u8(skb, IFLA_XDP_ATTACHED, xdp_op.prog_attached);
+   if (err)
+   goto err_cancel;
+
+   nla_nest_end(skb, xdp);
+   return 0;
+
+err_cancel:
+   nla_nest_cancel(skb, xdp);
+   return err;
+}
+
 static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
int type, u32 pid, u32 seq, u32 change,
unsigned int flags, u32 ext_filter_mask)
@@ -1307,6 +1345,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct 
net_device *dev,
if (rtnl_port_fill(skb, dev, ext_filter_mask))
goto nla_put_failure;
 
+   if (rtnl_xdp_fill(skb, dev))
+   goto nla_put_failure;
+
if (dev->rtnl_link_ops || rtnl_have_link_slave_info(dev)) {
if (rtnl_link_fill(skb, dev) < 0)
goto nla_put_failure;
@@ -1392,6 +1433,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
[IFLA_PHYS_SWITCH_ID]   = { .type = NLA_BINARY, .len = 
MAX_PHYS_ITEM_ID_LEN },
[IFLA_LINK_NETNSID] = { .type = NLA_S32 },
[IFLA_PROTO_DOWN]   = { .type = NLA_U8 },
+   [IFLA_XDP]  = { .type = NLA_NESTED },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -1429,6 +1471,11 @@ static const struct nla_policy 
ifla_port_policy[IFLA_PORT_MAX+1] = {
[IFLA_PORT_RESPONSE]= { .type = NLA_U16, },
 };
 
+static const struct nla_policy ifla_xdp_policy[IFLA_XDP_MAX + 1] = {
+   [IFLA_XDP_FD]   = { .type = NLA_S32 },
+   [IFLA_XDP_ATTACHED] = { .type = NLA_U8 },
+};
+
 static const struct rtnl_link_ops *linkinfo_to_kind_ops(const struct nlattr 
*nla)
 {
const struct rtnl_link_ops *ops = NULL;
@@ -2054,6 +2101,23 @@ static int do_setlink(const struct sk_buff *skb,
status |= DO_SETLINK_NOTIFY;
}
 
+   if (tb[IFLA_XDP]) {
+   struct nlattr *xdp[IFLA_XDP_MAX + 1];
+
+   err = nla_parse_nested(xdp, IFLA_XDP_MAX, tb[IFLA_XDP],
+  ifla_xdp_policy);
+   if (err < 0)
+   goto

[PATCH v9 05/11] Add sample for adding simple drop program to link

2016-07-15 Thread Brenden Blanco
Add a sample program that only drops packets at the BPF_PROG_TYPE_XDP_RX
hook of a link. With the drop-only program, observed single core rate is
~20Mpps.

Other tests were run, for instance without the dropcnt increment or
without reading from the packet header, the packet rate was mostly
unchanged.

$ perf record -a samples/bpf/xdp1 $(
---
 samples/bpf/Makefile|   4 ++
 samples/bpf/bpf_load.c  |   8 +++
 samples/bpf/xdp1_kern.c |  93 +
 samples/bpf/xdp1_user.c | 181 
 4 files changed, 286 insertions(+)
 create mode 100644 samples/bpf/xdp1_kern.c
 create mode 100644 samples/bpf/xdp1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index a98b780..0e4ab3a 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -21,6 +21,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += xdp1
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -42,6 +43,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -64,6 +66,7 @@ always += test_overhead_tp_kern.o
 always += test_overhead_kprobe_kern.o
 always += parse_varlen.o parse_simple.o parse_ldabs.o
 always += test_cgrp2_tc_kern.o
+always += xdp1_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -84,6 +87,7 @@ HOSTLOADLIBES_offwaketime += -lelf
 HOSTLOADLIBES_spintest += -lelf
 HOSTLOADLIBES_map_perf_test += -lelf -lrt
 HOSTLOADLIBES_test_overhead += -lelf -lrt
+HOSTLOADLIBES_xdp1 += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 022af71..0cfda23 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -50,6 +50,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
+   bool is_xdp = strncmp(event, "xdp", 3) == 0;
enum bpf_prog_type prog_type;
char buf[256];
int fd, efd, err, id;
@@ -66,6 +67,8 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
prog_type = BPF_PROG_TYPE_KPROBE;
} else if (is_tracepoint) {
prog_type = BPF_PROG_TYPE_TRACEPOINT;
+   } else if (is_xdp) {
+   prog_type = BPF_PROG_TYPE_XDP;
} else {
printf("Unknown event '%s'\n", event);
return -1;
@@ -79,6 +82,9 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
 
prog_fd[prog_cnt++] = fd;
 
+   if (is_xdp)
+   return 0;
+
if (is_socket) {
event += 6;
if (*event != '/')
@@ -319,6 +325,7 @@ int load_bpf_file(char *path)
if (memcmp(shname_prog, "kprobe/", 7) == 0 ||
memcmp(shname_prog, "kretprobe/", 10) == 0 ||
memcmp(shname_prog, "tracepoint/", 11) == 0 ||
+   memcmp(shname_prog, "xdp", 3) == 0 ||
memcmp(shname_prog, "socket", 6) == 0)
load_and_attach(shname_prog, insns, 
data_prog->d_size);
}
@@ -336,6 +343,7 @@ int load_bpf_file(char *path)
if (memcmp(shname, "kprobe/", 7) == 0 ||
memcmp(shname, "kretprobe/", 10) == 0 ||
memcmp(shname, "tracepoint/", 11) == 0 ||
+   memcmp(shname, "xdp", 3) == 0 ||
memcmp(shname, "socket", 6) == 0)
load_and_attach(shname, data->d_buf, data->d_size);
}
diff --git a/samples/bpf/xdp1_kern.c b/samples/bpf/xdp1_kern.c
new file mode 100644
index 000..e7dd8ac
--- /dev/null
+++ b/samples/bpf/xdp1_kern.c
@@ -0,0 +1,93 @@
+/* Copyright (c) 2016 PLUMgrid
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define KBUILD_MODNAME "foo"
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") dropcnt = {
+   .type = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size = sizeof(u32),
+   .value_size = sizeof(long),
+   .max_entries = 256,
+};
+
+st

[PATCH v9 07/11] bpf: add XDP_TX xdp_action for direct forwarding

2016-07-15 Thread Brenden Blanco
XDP enabled drivers must transmit received packets back out on the same
port they were received on when a program returns this action.

Signed-off-by: Brenden Blanco 
---
 include/uapi/linux/bpf.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4282d44..a8f1ea1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -447,6 +447,7 @@ enum xdp_action {
XDP_ABORTED = 0,
XDP_DROP,
XDP_PASS,
+   XDP_TX,
 };
 
 /* user accessible metadata for XDP packet hook
-- 
2.8.2



[PATCH v9 08/11] net/mlx4_en: break out tx_desc write into separate function

2016-07-15 Thread Brenden Blanco
In preparation for writing the tx descriptor from multiple functions,
create a helper for both normal and blueflame access.

Signed-off-by: Brenden Blanco 
---
 drivers/infiniband/hw/mlx4/qp.c|  11 +--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 127 +
 include/linux/mlx4/qp.h|  18 ++--
 3 files changed, 90 insertions(+), 66 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 8db8405..768085f 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -232,7 +232,7 @@ static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, 
int size)
}
} else {
ctrl = buf = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1));
-   s = (ctrl->fence_size & 0x3f) << 4;
+   s = (ctrl->qpn_vlan.fence_size & 0x3f) << 4;
for (i = 64; i < s; i += 64) {
wqe = buf + i;
*wqe = cpu_to_be32(0x);
@@ -264,7 +264,7 @@ static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int 
size)
inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof 
*inl));
}
ctrl->srcrb_flags = 0;
-   ctrl->fence_size = size / 16;
+   ctrl->qpn_vlan.fence_size = size / 16;
/*
 * Make sure descriptor is fully written before setting ownership bit
 * (because HW can start executing as soon as we do).
@@ -1992,7 +1992,8 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
ctrl = get_send_wqe(qp, i);
ctrl->owner_opcode = cpu_to_be32(1 << 31);
if (qp->sq_max_wqes_per_wr == 1)
-   ctrl->fence_size = 1 << (qp->sq.wqe_shift - 4);
+   ctrl->qpn_vlan.fence_size =
+   1 << (qp->sq.wqe_shift - 4);
 
stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift);
}
@@ -3169,8 +3170,8 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct 
ib_send_wr *wr,
wmb();
*lso_wqe = lso_hdr_sz;
 
-   ctrl->fence_size = (wr->send_flags & IB_SEND_FENCE ?
-   MLX4_WQE_CTRL_FENCE : 0) | size;
+   ctrl->qpn_vlan.fence_size = (wr->send_flags & IB_SEND_FENCE ?
+MLX4_WQE_CTRL_FENCE : 0) | size;
 
/*
 * Make sure descriptor is fully written before
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 76aa4d2..c29191e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -700,10 +700,66 @@ static void mlx4_bf_copy(void __iomem *dst, const void 
*src,
__iowrite64_copy(dst, src, bytecnt / 8);
 }
 
+void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
+{
+   wmb();
+   /* Since there is no iowrite*_native() that writes the
+* value as is, without byteswapping - using the one
+* the doesn't do byteswapping in the relevant arch
+* endianness.
+*/
+#if defined(__LITTLE_ENDIAN)
+   iowrite32(
+#else
+   iowrite32be(
+#endif
+ ring->doorbell_qpn,
+ ring->bf.uar->map + MLX4_SEND_DOORBELL);
+}
+
+static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
+ struct mlx4_en_tx_desc *tx_desc,
+ union mlx4_wqe_qpn_vlan qpn_vlan,
+ int desc_size, int bf_index,
+ __be32 op_own, bool bf_ok,
+ bool send_doorbell)
+{
+   tx_desc->ctrl.qpn_vlan = qpn_vlan;
+
+   if (bf_ok) {
+   op_own |= htonl((bf_index & 0x) << 8);
+   /* Ensure new descriptor hits memory
+* before setting ownership of this descriptor to HW
+*/
+   dma_wmb();
+   tx_desc->ctrl.owner_opcode = op_own;
+
+   wmb();
+
+   mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
+desc_size);
+
+   wmb();
+
+   ring->bf.offset ^= ring->bf.buf_size;
+   } else {
+   /* Ensure new descriptor hits memory
+* before setting ownership of this descriptor to HW
+*/
+   dma_wmb();
+   tx_desc->ctrl.owner_opcode = op_own;
+   if (send_doorbell)
+   mlx4_en_xmit_doorbell(ring);
+   else
+   ring->xmit_more++;
+   }
+}
+
 netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct skb_shared_info *shinfo = skb_shinfo(skb);
struct mlx4_en_priv *priv = netdev_priv(dev);
+   union mlx4_wqe_qpn_vlan

[PATCH v9 04/11] net/mlx4_en: add support for fast rx drop bpf program

2016-07-15 Thread Brenden Blanco
Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.

In tc/socket bpf programs, helpers linearize skb fragments as needed
when the program touches the packet data. However, in the pursuit of
speed, XDP programs will not be allowed to use these slower functions,
especially if it involves allocating an skb.

Therefore, disallow MTU settings that would produce a multi-fragment
packet that XDP programs would fail to access. Future enhancements could
be done to increase the allowable MTU.

Signed-off-by: Brenden Blanco 
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 51 ++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 37 +--
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  5 +++
 3 files changed, 89 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 6083775..b7c1804 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -31,6 +31,7 @@
  *
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -2084,6 +2085,9 @@ void mlx4_en_destroy_netdev(struct net_device *dev)
if (mdev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_TS)
mlx4_en_remove_timestamp(mdev);
 
+   if (priv->prog)
+   bpf_prog_put(priv->prog);
+
/* Detach the netdev so tasks would not attempt to access it */
mutex_lock(&mdev->state_lock);
mdev->pndev[priv->port] = NULL;
@@ -2112,6 +2116,11 @@ static int mlx4_en_change_mtu(struct net_device *dev, 
int new_mtu)
en_err(priv, "Bad MTU size:%d.\n", new_mtu);
return -EPERM;
}
+   if (priv->prog && MLX4_EN_EFF_MTU(new_mtu) > FRAG_SZ0) {
+   en_err(priv, "MTU size:%d requires frags but XDP running\n",
+  new_mtu);
+   return -EOPNOTSUPP;
+   }
dev->mtu = new_mtu;
 
if (netif_running(dev)) {
@@ -2520,6 +2529,46 @@ static int mlx4_en_set_tx_maxrate(struct net_device 
*dev, int queue_index, u32 m
return err;
 }
 
+static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
+{
+   struct mlx4_en_priv *priv = netdev_priv(dev);
+   struct bpf_prog *old_prog;
+
+   if (priv->num_frags > 1) {
+   en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
+   return -EOPNOTSUPP;
+   }
+
+   /* This xchg is paired with READ_ONCE in the fast path, but is
+* also protected from itself via rtnl lock
+*/
+   old_prog = xchg(&priv->prog, prog);
+   if (old_prog)
+   bpf_prog_put(old_prog);
+
+   return 0;
+}
+
+static bool mlx4_xdp_attached(struct net_device *dev)
+{
+   struct mlx4_en_priv *priv = netdev_priv(dev);
+
+   return !!READ_ONCE(priv->prog);
+}
+
+static int mlx4_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+   switch (xdp->command) {
+   case XDP_SETUP_PROG:
+   return mlx4_xdp_set(dev, xdp->prog);
+   case XDP_QUERY_PROG:
+   xdp->prog_attached = mlx4_xdp_attached(dev);
+   return 0;
+   default:
+   return -EINVAL;
+   }
+}
+
 static const struct net_device_ops mlx4_netdev_ops = {
.ndo_open   = mlx4_en_open,
.ndo_stop   = mlx4_en_close,
@@ -2548,6 +2597,7 @@ static const struct net_device_ops mlx4_netdev_ops = {
.ndo_udp_tunnel_del = mlx4_en_del_vxlan_port,
.ndo_features_check = mlx4_en_features_check,
.ndo_set_tx_maxrate = mlx4_en_set_tx_maxrate,
+   .ndo_xdp= mlx4_xdp,
 };
 
 static const struct net_device_ops mlx4_netdev_ops_master = {
@@ -2584,6 +2634,7 @@ static const struct net_device_ops mlx4_netdev_ops_master 
= {
.ndo_udp_tunnel_del = mlx4_en_del_vxlan_port,
.ndo_features_check = mlx4_en_features_check,
.ndo_set_tx_maxrate = mlx4_en_set_tx_maxrate,
+   .ndo_xdp= mlx4_xdp,
 };
 
 struct mlx4_en_bond {
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index c1b3a9c..adfa123 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -743,6 +743,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
struct mlx4_en_rx_alloc *frags;
struct mlx4_en_rx_desc *rx_desc;
+   struct bpf_prog *prog;
struct sk_buff *skb;
int index;
int nr;
@@ -759,6 +760,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
if (budget <= 0)
return polled;
 
+   prog = READ_ONCE(priv->prog);
+
/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 * descriptor offset can be deduced from the CQE in

[PATCH v9 02/11] net: add ndo to setup/query xdp prog in adapter rx

2016-07-15 Thread Brenden Blanco
Add one new netdev op for drivers implementing the BPF_PROG_TYPE_XDP
filter. The single op is used for both setup/query of the xdp program,
modelled after ndo_setup_tc.

Signed-off-by: Brenden Blanco 
---
 include/linux/netdevice.h | 34 ++
 net/core/dev.c| 33 +
 2 files changed, 67 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 49736a3..fab9a1c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -63,6 +63,7 @@ struct wpan_dev;
 struct mpls_dev;
 /* UDP Tunnel offloads */
 struct udp_tunnel_info;
+struct bpf_prog;
 
 void netdev_set_default_ethtool_ops(struct net_device *dev,
const struct ethtool_ops *ops);
@@ -799,6 +800,33 @@ struct tc_to_netdev {
};
 };
 
+/* These structures hold the attributes of xdp state that are being passed
+ * to the netdevice through the xdp op.
+ */
+enum xdp_netdev_command {
+   /* Set or clear a bpf program used in the earliest stages of packet
+* rx. The prog will have been loaded as BPF_PROG_TYPE_XDP. The callee
+* is responsible for calling bpf_prog_put on any old progs that are
+* stored. In case of error, the callee need not release the new prog
+* reference, but on success it takes ownership and must bpf_prog_put
+* when it is no longer used.
+*/
+   XDP_SETUP_PROG,
+   /* Check if a bpf program is set on the device.  The callee should
+* return true if a program is currently attached and running.
+*/
+   XDP_QUERY_PROG,
+};
+
+struct netdev_xdp {
+   enum xdp_netdev_command command;
+   union {
+   /* XDP_SETUP_PROG */
+   struct bpf_prog *prog;
+   /* XDP_QUERY_PROG */
+   bool prog_attached;
+   };
+};
 
 /*
  * This structure defines the management hooks for network devices.
@@ -1087,6 +1115,9 @@ struct tc_to_netdev {
  * appropriate rx headroom value allows avoiding skb head copy on
  * forward. Setting a negative value resets the rx headroom to the
  * default value.
+ * int (*ndo_xdp)(struct net_device *dev, struct netdev_xdp *xdp);
+ * This function is used to set or query state related to XDP on the
+ * netdevice. See definition of enum xdp_netdev_command for details.
  *
  */
 struct net_device_ops {
@@ -1271,6 +1302,8 @@ struct net_device_ops {
   struct sk_buff *skb);
void(*ndo_set_rx_headroom)(struct net_device *dev,
   int needed_headroom);
+   int (*ndo_xdp)(struct net_device *dev,
+  struct netdev_xdp *xdp);
 };
 
 /**
@@ -3257,6 +3290,7 @@ int dev_get_phys_port_id(struct net_device *dev,
 int dev_get_phys_port_name(struct net_device *dev,
   char *name, size_t len);
 int dev_change_proto_down(struct net_device *dev, bool proto_down);
+int dev_change_xdp_fd(struct net_device *dev, int fd);
 struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device 
*dev);
 struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device 
*dev,
struct netdev_queue *txq, int *ret);
diff --git a/net/core/dev.c b/net/core/dev.c
index 7894e40..2a9c39f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -94,6 +94,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -6615,6 +6616,38 @@ int dev_change_proto_down(struct net_device *dev, bool 
proto_down)
 EXPORT_SYMBOL(dev_change_proto_down);
 
 /**
+ * dev_change_xdp_fd - set or clear a bpf program for a device rx path
+ * @dev: device
+ * @fd: new program fd or negative value to clear
+ *
+ * Set or clear a bpf program for a device
+ */
+int dev_change_xdp_fd(struct net_device *dev, int fd)
+{
+   const struct net_device_ops *ops = dev->netdev_ops;
+   struct bpf_prog *prog = NULL;
+   struct netdev_xdp xdp = {};
+   int err;
+
+   if (!ops->ndo_xdp)
+   return -EOPNOTSUPP;
+   if (fd >= 0) {
+   prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+   }
+
+   xdp.command = XDP_SETUP_PROG;
+   xdp.prog = prog;
+   err = ops->ndo_xdp(dev, &xdp);
+   if (err < 0 && prog)
+   bpf_prog_put(prog);
+
+   return err;
+}
+EXPORT_SYMBOL(dev_change_xdp_fd);
+
+/**
  * dev_new_index   -   allocate an ifindex
  * @net: the applicable net namespace
  *
-- 
2.8.2



[PATCH v9 00/11] Add driver bpf hook for early packet drop and forwarding

2016-07-15 Thread Brenden Blanco
This patch set introduces new infrastructure for programmatically
processing packets in the earliest stages of rx, as part of an effort
others are calling eXpress Data Path (XDP) [1]. Start this effort by
introducing a new bpf program type for early packet filtering, before
even an skb has been allocated.

Extend on this with the ability to modify packet data and send back out
on the same port.

Patch 1 introduces the new prog type and helpers for validating the bpf
  program. A new userspace struct is defined containing only data and
  data_end as fields, with others to follow in the future.
In patch 2, create a new ndo to pass the fd to supported drivers.
In patch 3, expose a new rtnl option to userspace.
In patch 4, enable support in mlx4 driver.
In patch 5, create a sample drop and count program. With single core,
  achieved ~20 Mpps drop rate on a 40G ConnectX3-Pro. This includes
  packet data access, bpf array lookup, and increment.
In patch 6, add a page recycle facility to mlx4 rx, enabled when xdp is
  active.
In patch 7, add the XDP_TX type to bpf.h
In patch 8, add helper in tx patch for writing tx_desc
In patch 9, add support in mlx4 for packet data write and forwarding
In patch 10, turn on packet write support in the bpf verifier
In patch 11, add a sample program for packet write and forwarding. With
  single core, achieved ~10 Mpps rewrite and forwarding.

[1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf

v9:
 4/12: Add missing newline in en_err message.
 6/12: Move page_cache cleanup from mlx4_en_destroy_rx_ring to
   mlx4_en_deactivate_rx_ring. Move mlx4_en_moderation_update back to
   static. Remove calls to mlx4_en_alloc/free_resources in mlx4_xdp_set.
   Adopt instead the approach of mlx4_en_change_mtu to use a watchdog.
 9/12: Use a per-ring function pointer in tx to separate out the code
   for regular and recycle paths of tx completion handling. Add a helper
   function to init the recycle ring and callback, called just after
   activating tx. Remove extra tx ring resource requirement, and instead
   steal from the upper rings. This helps to avoid needing
   mlx4_en_alloc_resources. Add some hopefully meaningful error
   messages for the various error cases. Reverted some of the
   hard-to-follow logic that was accounting for the extra tx rings.

v8:
 1/11: Reduce WARN_ONCE to single line. Also, change act param of that
   function to u32 to match return type of bpf_prog_run_xdp.
 2/11: Clarify locking semantics in ndo comment.
 4/11: Add en_err warning in mlx4_xdp_set on num_frags/mtu violation.

v7:
 Addressing two of the major discussion points: return codes and ndo.
 The rest will be taken as todo items for separate patches.

 Add an XDP_ABORTED type, which explicitly falls through to DROP. The
 same result must be taken for the default case as well, as it is now
 well-defined API behavior.

 Merge ndo_xdp_* into a single ndo. The style is similar to
 ndo_setup_tc, but with less unidirectional naming convention. The IFLA
 parameter names are unchanged.

 TODOs:
 Add ethtool per-ring stats for aborted, default cases, maybe even drop
 and tx as well.
 Avoid duplicate dma sync operation in XDP_PASS case as mentioned by
 Saeed.

  1/12: Add XDP_ABORTED enum, reword API comment, and update commit
   message.
  2/12: Rewrite ndo_xdp_*() into single ndo_xdp() with type/union style
calling convention.
  3/12: Switch to ndo_xdp callback.
  4/12: Add XDP_ABORTED case as a fall-through to XDP_DROP. Implement
ndo_xdp.
 12/12: Dropped, this will need some more work.

v6:
  2/12: drop unnecessary netif_device_present check
  4/12, 6/12, 9/12: Reorder default case statement above drop case to
remove some copy/paste.

v5:
  0/12: Rebase and remove previous 1/13 patch
  1/12: Fix nits from Daniel. Left the (void *) cast as-is, to be fixed
in future. Add bpf_warn_invalid_xdp_action() helper, to be used when
out of bounds action is returned by the program. Add a comment to
bpf.h denoting the undefined nature of out of bounds returns.
  2/12: Switch to using bpf_prog_get_type(). Rename ndo_xdp_get() to
ndo_xdp_attached().
  3/12: Add IFLA_XDP as a nested type, and add the associated nla_policy
for the new subtypes IFLA_XDP_FD and IFLA_XDP_ATTACHED.
  4/12: Fixup the use of READ_ONCE in the ndos. Add a user of
bpf_warn_invalid_xdp_action helper.
  5/12: Adjust to using the nested netlink options.
  6/12: kbuild was complaining about overflow of u16 on tile
architecture...bump frag_stride to u32. The page_offset member that
is computed from this was already u32.

v4:
  2/12: Add inline helper for calling xdp bpf prog under rcu
  3/12: Add detail to ndo comments
  5/12: Remove mlx4_call_xdp and use inline helper instead.
  6/12: Fix checkpatch complaints
  9/12: Introduce new patch 9/12 with common helper for tx_desc write
Refactor to use common tx_desc write helper
 11/12: Fix checkpatch complaints

v3:
  Rewrite from v2 trying to 

[PATCH v9 01/11] bpf: add XDP prog type for early driver filter

2016-07-15 Thread Brenden Blanco
Add a new bpf prog type that is intended to run in early stages of the
packet rx path. Only minimal packet metadata will be available, hence a
new context type, struct xdp_md, is exposed to userspace. So far only
expose the packet start and end pointers, and only in read mode.

An XDP program must return one of the well known enum values, all other
return codes are reserved for future use. Unfortunately, this
restriction is hard to enforce at verification time, so take the
approach of warning at runtime when such programs are encountered. Out
of bounds return codes should alias to XDP_ABORTED.

Signed-off-by: Brenden Blanco 
---
 include/linux/filter.h   | 18 +++
 include/uapi/linux/bpf.h | 20 
 kernel/bpf/verifier.c|  1 +
 net/core/filter.c| 79 
 4 files changed, 118 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6fc31ef..15d816a 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -368,6 +368,11 @@ struct bpf_skb_data_end {
void *data_end;
 };
 
+struct xdp_buff {
+   void *data;
+   void *data_end;
+};
+
 /* compute the linear packet data range [data, data_end) which
  * will be accessed by cls_bpf and act_bpf programs
  */
@@ -429,6 +434,18 @@ static inline u32 bpf_prog_run_clear_cb(const struct 
bpf_prog *prog,
return BPF_PROG_RUN(prog, skb);
 }
 
+static inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
+  struct xdp_buff *xdp)
+{
+   u32 ret;
+
+   rcu_read_lock();
+   ret = BPF_PROG_RUN(prog, (void *)xdp);
+   rcu_read_unlock();
+
+   return ret;
+}
+
 static inline unsigned int bpf_prog_size(unsigned int proglen)
 {
return max(sizeof(struct bpf_prog),
@@ -509,6 +526,7 @@ bool bpf_helper_changes_skb_data(void *func);
 
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
   const struct bpf_insn *patch, u32 len);
+void bpf_warn_invalid_xdp_action(u32 act);
 
 #ifdef CONFIG_BPF_JIT
 extern int bpf_jit_enable;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 262a7e8..4282d44 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,6 +94,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SCHED_CLS,
BPF_PROG_TYPE_SCHED_ACT,
BPF_PROG_TYPE_TRACEPOINT,
+   BPF_PROG_TYPE_XDP,
 };
 
 #define BPF_PSEUDO_MAP_FD  1
@@ -437,4 +438,23 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
 };
 
+/* User return codes for XDP prog type.
+ * A valid XDP program must return one of these defined values. All other
+ * return codes are reserved for future use. Unknown return codes will result
+ * in packet drop.
+ */
+enum xdp_action {
+   XDP_ABORTED = 0,
+   XDP_DROP,
+   XDP_PASS,
+};
+
+/* user accessible metadata for XDP packet hook
+ * new fields must be added to the end of this structure
+ */
+struct xdp_md {
+   __u32 data;
+   __u32 data_end;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e206c21..a8d67d0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -713,6 +713,7 @@ static int check_ptr_alignment(struct verifier_env *env, 
struct reg_state *reg,
switch (env->prog->type) {
case BPF_PROG_TYPE_SCHED_CLS:
case BPF_PROG_TYPE_SCHED_ACT:
+   case BPF_PROG_TYPE_XDP:
break;
default:
verbose("verifier is misconfigured\n");
diff --git a/net/core/filter.c b/net/core/filter.c
index 10c4a2f..2d770f5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2369,6 +2369,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
}
 }
 
+static const struct bpf_func_proto *
+xdp_func_proto(enum bpf_func_id func_id)
+{
+   return sk_filter_func_proto(func_id);
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2436,6 +2442,44 @@ static bool tc_cls_act_is_valid_access(int off, int size,
return __is_valid_access(off, size, type);
 }
 
+static bool __is_valid_xdp_access(int off, int size,
+ enum bpf_access_type type)
+{
+   if (off < 0 || off >= sizeof(struct xdp_md))
+   return false;
+   if (off % size != 0)
+   return false;
+   if (size != 4)
+   return false;
+
+   return true;
+}
+
+static bool xdp_is_valid_access(int off, int size,
+   enum bpf_access_type type,
+   enum bpf_reg_type *reg_type)
+{
+   if (type == BPF_WRITE)
+   return false;
+
+   switch (off) {
+   case offsetof(struct xdp_md, data):
+   *reg_type = PTR_TO_PACKET;
+   break;
+   case offsetof(struct xdp_md, data_end):
+   *reg_type = PTR_TO_PACKET_END;

Re: [PATCH net-next] net: ipmr/ip6mr: add support for keeping an entry age

2016-07-15 Thread Nikolay Aleksandrov

> On Jul 14, 2016, at 8:08 AM, Nikolay Aleksandrov 
>  wrote:
> 
> In preparation for hardware offloading of ipmr/ip6mr we need an
> interface that allows to check (and later update) the age of entries.
> Relying on stats alone can show activity but not actual age of the entry,
> furthermore when there're tens of thousands of entries a lot of the
> hardware implementations only support "hit" bits which are cleared on
> read to denote that the entry was active and shouldn't be aged out,
> these can then be naturally translated into age timestamp and will be
> compatible with the software forwarding age. Using a lastuse entry doesn't
> affect performance because the entries in that cache line are written to
> along with the age. Once an entry goes above the member size (32 bits) we
> keep it at UINT_MAX as we cannot afford to wrap it which will falsely show
> that it was used recently. This is not supposed to happen as entries should
> be aged out in matter of minutes or seconds.
> Since all new users are encouraged to use ipmr via netlink, this is
> exported via the RTA_CACHEINFO attribute which has rta_lastuse entry.
> 
> Signed-off-by: Nikolay Aleksandrov 
> CC: Roopa Prabhu 
> CC: Shrijeet Mukherjee 
> CC: Satish Ashok 
> CC: Donald Sharp 
> CC: David S. Miller 
> CC: Alexey Kuznetsov 
> CC: James Morris 
> CC: Hideaki YOSHIFUJI 
> CC: Patrick McHardy 
> —

Self-NAK, I’ll send a revised v2 version using a single u32 attribute 
(RTA_EXPIRES), no need to waste the space
right now. We’ll add more as we need them.

Sorry for the noise.

Cheers,
 Nik




[PATCH net-next] bpf: bpf_event_entry_gen's alloc needs to be in atomic context

2016-07-15 Thread Daniel Borkmann
Should have been obvious, only called from bpf() syscall via map_update_elem()
that calls bpf_fd_array_map_update_elem() under RCU read lock and thus this
must also be in GFP_ATOMIC, of course.

Fixes: 3b1efb196eee ("bpf, maps: flush own entries on perf map release")
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 Sorry for missing this typo, sigh.

 kernel/bpf/arraymap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index db1a743..633a650 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -430,7 +430,7 @@ static struct bpf_event_entry *bpf_event_entry_gen(struct 
file *perf_file,
 {
struct bpf_event_entry *ee;
 
-   ee = kzalloc(sizeof(*ee), GFP_KERNEL);
+   ee = kzalloc(sizeof(*ee), GFP_ATOMIC);
if (ee) {
ee->event = perf_file->private_data;
ee->perf_file = perf_file;
-- 
1.9.3



headsup: llvm now uses official bpf e_machine value

2016-07-15 Thread Alexei Starovoitov
just pushed Richard's patch
https://github.com/llvm-mirror/llvm/commit/36b9c09330bfb5e771914cfe307588f30d5510d2

tested with bcc and different bpf loaders.
Thankfully none of them rely on old and arguably
buggy em_none value, so no breakage expected.
if you're using special bpf elf loader, please test.

Thanks


Re: [PATCH net] net: bgmac: Fix infinite loop in bgmac_dma_tx_add()

2016-07-15 Thread David Miller
From: Florian Fainelli 
Date: Fri, 15 Jul 2016 15:42:52 -0700

> Nothing is decrementing the index "i" while we are cleaning up the
> fragments we could not successful transmit.
> 
> Fixes: 9cde94506eacf ("bgmac: implement scatter/gather support")
> Reported-by: coverity (CID 1352048)
> Signed-off-by: Florian Fainelli 

Applied and queued up for -stable, thanks.


Re: [PATCH] igb: fix adjusting ptp timestamps for tx/rx latency

2016-07-15 Thread kbuild test robot
Hi,

[auto build test ERROR on jkirsher-next-queue/dev-queue]
[also build test ERROR on v4.7-rc7 next-20160715]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Kshitiz-Gupta/igb-fix-adjusting-ptp-timestamps-for-tx-rx-latency/20160716-062544
base:   https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git 
dev-queue
config: sparc64-allyesconfig (attached as .config)
compiler: sparc64-linux-gnu-gcc (Debian 5.3.1-8) 5.3.1 20160205
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=sparc64 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/intel/igb/igb_ptp.c: In function 'igb_ptp_rx_pktstamp':
>> drivers/net/ethernet/intel/igb/igb_ptp.c:783:12: error: 'IGB_RX_LATENCY_10' 
>> undeclared (first use in this function)
  adjust = IGB_RX_LATENCY_10;
   ^
   drivers/net/ethernet/intel/igb/igb_ptp.c:783:12: note: each undeclared 
identifier is reported only once for each function it appears in
>> drivers/net/ethernet/intel/igb/igb_ptp.c:786:12: error: 'IGB_RX_LATENCY_100' 
>> undeclared (first use in this function)
  adjust = IGB_RX_LATENCY_100;
   ^
>> drivers/net/ethernet/intel/igb/igb_ptp.c:789:12: error: 
>> 'IGB_RX_LATENCY_1000' undeclared (first use in this function)
  adjust = IGB_RX_LATENCY_1000;
   ^

vim +/IGB_RX_LATENCY_10 +783 drivers/net/ethernet/intel/igb/igb_ptp.c

   777  igb_ptp_systim_to_hwtstamp(adapter, skb_hwtstamps(skb),
   778 le64_to_cpu(regval[1]));
   779  
   780  /* adjust timestamp for the RX latency based on link speed */
   781  switch (adapter->link_speed) {
   782  case SPEED_10:
 > 783  adjust = IGB_RX_LATENCY_10;
   784  break;
   785  case SPEED_100:
 > 786  adjust = IGB_RX_LATENCY_100;
   787  break;
   788  case SPEED_1000:
 > 789  adjust = IGB_RX_LATENCY_1000;
   790  break;
   791  }
   792  skb_hwtstamps(skb)->hwtstamp =

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH] net: fixup for tracepoint napi:napi_poll

2016-07-15 Thread David Miller
From: Jesper Dangaard Brouer 
Date: Fri, 15 Jul 2016 23:55:20 +0200

> The recent change to tracepoint napi:napi_poll changed the order of
> the parameters that perf scripts sees, the printk was correct.  The
> problem was that the new parameters (work and budget) were pushed
> in front of dev_name.
> 
> The new parameters obviously need to be appended to keep backward
> compatible.
> 
> Fixes: 1db19db7f5ff ("net: tracepoint napi:napi_poll add work and budget")
> Signed-off-by: Jesper Dangaard Brouer 

Applied, thanks Jesper.


Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue

2016-07-15 Thread Alexei Starovoitov
On Fri, Jul 15, 2016 at 03:18:12PM -0700, John Fastabend wrote:
> 
> nolock (pfifo_fast)
> 1:  1440293 1421602 1409553 1393469 1424543
> 2:  1754890 1819292 1727948 1797711 1743427
> 4:  3282665 3344095 3315220 3332777 3348972
> 8:  2940079 1644450 2950777 2922085 2946310
> 12: 2042084 2610060 2857581 3493162 3104611
> 
> lock (pfifo_fast)
> 1:  1471479 1469142 1458825 1456788 1453952
> 2:  1746231 1749490 1753176 1753780 1755959
> 4:  1119626 1120515 1121478 1119220 1121115
> 8:  1001471  999308 1000318 1000776 1000384
> 12:  989269  992122  991590  986581  990430
> 
> So then if we just use the first test example because I'm being a
> bit lazy and don't want to calculate the avg/mean/whatever we get
> a pfifo_fast chart like,
> 
>   locked nolock   diff
> ---
> 1 14714791440293  −  31186
> 2 17462311754890  +   8659
> 4 11196263282665  +2163039
> 8 11196262940079  +1820453
> 12 9892692857581* +1868312
...
> Also I'm going to take a look at Jesper's microbenchmark numbers but I
> think if I can convince myself that using skb_array helps or at least
> does no harm I might push to have this include with skb_array and then
> work on optimizing the ring type/kind/etc. as a follow up patch.
> Additionally it does seem to provide goodness on the pfifo_fast single
> queue case.

Agree. I think the pfifo_fast gains worth applying this patch set
as-is and work on further improvements in follow up.



[PATCH v2 iproute2] ss: Add option to suppress header line

2016-07-15 Thread David Ahern
Add option to suppress header line. When used the following line
is not shown:
"State  Recv-Q Send-Q Local Address:Port  Peer Address:Port"

Signed-off-by: David Ahern 
---
v2
- rebased to master branch

 man/man8/ss.8 |  3 +++
 misc/ss.c | 28 +++-
 2 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/man/man8/ss.8 b/man/man8/ss.8
index 758460c27c95..8911976faa35 100644
--- a/man/man8/ss.8
+++ b/man/man8/ss.8
@@ -21,6 +21,9 @@ Show summary of options.
 .B \-V, \-\-version
 Output version information.
 .TP
+.B \-H, \-\-no-header
+Suppress header line.
+.TP
 .B \-n, \-\-numeric
 Do not try to resolve service names.
 .TP
diff --git a/misc/ss.c b/misc/ss.c
index abece96c0946..38205b0e8c28 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -97,6 +97,7 @@ int show_tcpinfo;
 int show_bpf;
 int show_proc_ctx;
 int show_sock_ctx;
+int show_header = 1;
 /* If show_users & show_proc_ctx only do user_ent_hash_build() once */
 int user_ent_hash_build_init;
 int follow_events;
@@ -3669,6 +3670,7 @@ static void _usage(FILE *dest)
 "   FAMILY := {inet|inet6|link|unix|netlink|help}\n"
 "\n"
 "   -K, --kill  forcibly close sockets, display what was closed\n"
+"   -H, --no-header Suppress header line\n"
 "\n"
 "   -A, --query=QUERY, --socket=QUERY\n"
 "   QUERY := 
{all|inet|tcp|udp|raw|unix|unix_dgram|unix_stream|unix_seqpacket|packet|netlink}[,QUERY]\n"
@@ -3762,6 +3764,7 @@ static const struct option long_opts[] = {
{ "contexts", 0, 0, 'z' },
{ "net", 1, 0, 'N' },
{ "kill", 0, 0, 'K' },
+   { "no-header", 0, 0, 'H' },
{ 0 }
 
 };
@@ -3776,7 +3779,7 @@ int main(int argc, char *argv[])
int ch;
int state_filter = 0;
 
-   while ((ch = getopt_long(argc, argv, 
"dhaletuwxnro460spbEf:miA:D:F:vVzZN:K",
+   while ((ch = getopt_long(argc, argv, 
"dhaletuwxnro460spbEf:miA:D:F:vVzZN:KH",
 long_opts, NULL)) != EOF) {
switch (ch) {
case 'n':
@@ -3961,6 +3964,9 @@ int main(int argc, char *argv[])
case 'K':
current_filter.kill = 1;
break;
+   case 'H':
+   show_header = 0;
+   break;
case 'h':
help();
case '?':
@@ -4086,19 +4092,23 @@ int main(int argc, char *argv[])
 
addr_width = addrp_width - serv_width - 1;
 
-   if (netid_width)
-   printf("%-*s ", netid_width, "Netid");
-   if (state_width)
-   printf("%-*s ", state_width, "State");
-   printf("%-6s %-6s ", "Recv-Q", "Send-Q");
+   if (show_header) {
+   if (netid_width)
+   printf("%-*s ", netid_width, "Netid");
+   if (state_width)
+   printf("%-*s ", state_width, "State");
+   printf("%-6s %-6s ", "Recv-Q", "Send-Q");
+   }
 
/* Make enough space for the local/remote port field */
addr_width -= 13;
serv_width += 13;
 
-   printf("%*s:%-*s %*s:%-*s\n",
-  addr_width, "Local Address", serv_width, "Port",
-  addr_width, "Peer Address", serv_width, "Port");
+   if (show_header) {
+   printf("%*s:%-*s %*s:%-*s\n",
+  addr_width, "Local Address", serv_width, "Port",
+  addr_width, "Peer Address", serv_width, "Port");
+   }
 
fflush(stdout);
 
-- 
2.1.4



[PATCH net-next] net: ipmr/ip6mr: add support for keeping an entry age

2016-07-15 Thread Nikolay Aleksandrov
In preparation for hardware offloading of ipmr/ip6mr we need an
interface that allows to check (and later update) the age of entries.
Relying on stats alone can show activity but not actual age of the entry,
furthermore when there're tens of thousands of entries a lot of the
hardware implementations only support "hit" bits which are cleared on
read to denote that the entry was active and shouldn't be aged out,
these can then be naturally translated into age timestamp and will be
compatible with the software forwarding age. Using a lastuse entry doesn't
affect performance because the entries in that cache line are written to
along with the age. Once an entry goes above the member size (32 bits) we
keep it at UINT_MAX as we cannot afford to wrap it which will falsely show
that it was used recently. This is not supposed to happen as entries should
be aged out in matter of minutes or seconds.
Since all new users are encouraged to use ipmr via netlink, this is
exported via the RTA_CACHEINFO attribute which has rta_lastuse entry.

Signed-off-by: Nikolay Aleksandrov 
CC: Roopa Prabhu 
CC: Shrijeet Mukherjee 
CC: Satish Ashok 
CC: Donald Sharp 
CC: David S. Miller 
CC: Alexey Kuznetsov 
CC: James Morris 
CC: Hideaki YOSHIFUJI 
CC: Patrick McHardy 
---
RTA_CACHEINFO was chosen because there're other useful members of the
struct which will be used later when we gradually remove the ipmr/ip6mr
entry cache limitations.

 include/linux/mroute.h  |  1 +
 include/linux/mroute6.h |  1 +
 net/ipv4/ipmr.c | 18 +++---
 net/ipv6/ip6mr.c| 17 ++---
 4 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/linux/mroute.h b/include/linux/mroute.h
index bf9b322cb0b0..d351fd3e1049 100644
--- a/include/linux/mroute.h
+++ b/include/linux/mroute.h
@@ -104,6 +104,7 @@ struct mfc_cache {
unsigned long bytes;
unsigned long pkt;
unsigned long wrong_if;
+   unsigned long lastuse;
unsigned char ttls[MAXVIFS];/* TTL thresholds   
*/
} res;
} mfc_un;
diff --git a/include/linux/mroute6.h b/include/linux/mroute6.h
index 66982e764051..3987b64040c5 100644
--- a/include/linux/mroute6.h
+++ b/include/linux/mroute6.h
@@ -92,6 +92,7 @@ struct mfc6_cache {
unsigned long bytes;
unsigned long pkt;
unsigned long wrong_if;
+   unsigned long lastuse;
unsigned char ttls[MAXMIFS];/* TTL thresholds   
*/
} res;
} mfc_un;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 5ad48ec77710..b0ba7f6d2731 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1150,6 +1150,7 @@ static int ipmr_mfc_add(struct net *net, struct mr_table 
*mrt,
c->mfc_origin = mfc->mfcc_origin.s_addr;
c->mfc_mcastgrp = mfc->mfcc_mcastgrp.s_addr;
c->mfc_parent = mfc->mfcc_parent;
+   c->mfc_un.res.lastuse = jiffies;
ipmr_update_thresholds(mrt, c, mfc->mfcc_ttls);
if (!mrtsock)
c->mfc_flags |= MFC_STATIC;
@@ -1792,6 +1793,7 @@ static void ip_mr_forward(struct net *net, struct 
mr_table *mrt,
vif = cache->mfc_parent;
cache->mfc_un.res.pkt++;
cache->mfc_un.res.bytes += skb->len;
+   cache->mfc_un.res.lastuse = jiffies;
 
if (cache->mfc_origin == htonl(INADDR_ANY) && true_vifi >= 0) {
struct mfc_cache *cache_proxy;
@@ -2071,10 +2073,13 @@ drop:
 static int __ipmr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb,
  struct mfc_cache *c, struct rtmsg *rtm)
 {
-   int ct;
-   struct rtnexthop *nhp;
-   struct nlattr *mp_attr;
struct rta_mfc_stats mfcs;
+   struct rta_cacheinfo ci;
+   struct nlattr *mp_attr;
+   struct rtnexthop *nhp;
+   long delta;
+   int ct;
+
 
/* If cache is unresolved, don't try to parse IIF and OIF */
if (c->mfc_parent >= MAXVIFS)
@@ -2109,7 +2114,14 @@ static int __ipmr_fill_mroute(struct mr_table *mrt, 
struct sk_buff *skb,
if (nla_put_64bit(skb, RTA_MFC_STATS, sizeof(mfcs), &mfcs, RTA_PAD) < 0)
return -EMSGSIZE;
 
+   memset(&ci, 0, sizeof(ci));
+   delta = jiffies - c->mfc_un.res.lastuse;
+   /* rta_lastuse is 32 bit, we shouldn't wrap the age */
+   ci.rta_lastuse = min_t(u64, jiffies_delta_to_clock_t(delta), UINT_MAX);
+   if (nla_put(skb, RTA_CACHEINFO, sizeof(ci), &ci) < 0)
+   return -EMSGSIZE;
rtm->rtm_type = RTN_MULTICAST;
+
return 1;
 }
 
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index c7ca0f5d1a3b..6a5f1ca1dcca 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -1500,6 +1500,7 @@ static int ip6mr_mfc_add(struct net *net, struct 
mr6_table *mrt,
c->mf6c_origin = mfc->mf6cc_origin.sin6_addr;
c->mf6c_mcast

[PATCH net] net: bgmac: Fix infinite loop in bgmac_dma_tx_add()

2016-07-15 Thread Florian Fainelli
Nothing is decrementing the index "i" while we are cleaning up the
fragments we could not successful transmit.

Fixes: 9cde94506eacf ("bgmac: implement scatter/gather support")
Reported-by: coverity (CID 1352048)
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/bgmac.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bgmac.c 
b/drivers/net/ethernet/broadcom/bgmac.c
index a6333d38ecc0..25bbae5928d4 100644
--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c
@@ -231,7 +231,7 @@ err_dma:
dma_unmap_single(dma_dev, slot->dma_addr, skb_headlen(skb),
 DMA_TO_DEVICE);
 
-   while (i > 0) {
+   while (i-- > 0) {
int index = (ring->end + i) % BGMAC_TX_RING_SLOTS;
struct bgmac_slot_info *slot = &ring->slots[index];
u32 ctl1 = le32_to_cpu(ring->cpu_base[index].ctl1);
-- 
2.7.4



[PATCH iproute2] ss: Fix support for device filter by index

2016-07-15 Thread David Ahern
Support was recently added for device filters. The intent was to allow
the device to be specified by name or index, and using the if%u format
(dev == if5) or the simpler and more intuitive index alone (dev == 5).
The latter case is broken since the index is not saved to the filter
after the strtoul conversion. Further, the tmp variable used for the
conversion shadows another variable used in the function. Fix both.

With this change all 3 variants work as expected:
$ ss -t 'dev == 62'
State   Recv-Q Send-Q Local Address:PortPeer Address:Port
ESTAB   0  224 10.0.1.3%mgmt:ssh   192.168.0.50:58442

$ ss -t 'dev == mgmt'
State   Recv-Q Send-Q Local Address:PortPeer Address:Port
ESTAB   0  224 10.0.1.3%mgmt:ssh   192.168.0.50:58442

$ ss -t 'dev == if62'
State   Recv-Q Send-Q Local Address:PortPeer Address:Port
ESTAB   0  36  10.0.1.3%mgmt:ssh   192.168.0.50:58442

Fixes: 2d2932125616 ("ss: Add support to filter on device")
Signed-off-by: David Ahern 
---
No changes since last version. Applies cleanly to master branch.

 misc/ss.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index a0f9c6b9623c..abece96c0946 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -1435,11 +1435,13 @@ void *parse_devcond(char *name)
a.iface = xll_name_to_index(name);
if (a.iface == 0) {
char *end;
-   unsigned long res;
+   unsigned long n;
 
-   res = strtoul(name, &end, 0);
-   if (!end || end == name || *end || res > UINT_MAX)
+   n = strtoul(name, &end, 0);
+   if (!end || end == name || *end || n > UINT_MAX)
return NULL;
+
+   a.iface = n;
}
 
res = malloc(sizeof(*res));
-- 
2.1.4



Re: [v10, 3/7] soc: fsl: add GUTS driver for QorIQ platforms

2016-07-15 Thread Paul Gortmaker
[Re: [v10, 3/7] soc: fsl: add GUTS driver for QorIQ platforms] On 15/07/2016 
(Fri 14:12) Scott Wood wrote:

> On Fri, 2016-07-15 at 12:43 -0400, Paul Gortmaker wrote:
> > > +source "drivers/soc/fsl/qe/Kconfig"

[...]

> > > +
> > > +config FSL_GUTS
> > > +   bool
> > > diff --git a/drivers/soc/fsl/Makefile b/drivers/soc/fsl/Makefile
> > > index 203307f..02afb7f 100644
> > > --- a/drivers/soc/fsl/Makefile
> > > +++ b/drivers/soc/fsl/Makefile
> > > @@ -4,3 +4,4 @@
> > > 
> > >  obj-$(CONFIG_QUICC_ENGINE) += qe/
> > >  obj-$(CONFIG_CPM)  += qe/
> > > +obj-$(CONFIG_FSL_GUTS) += guts.o
> > > diff --git a/drivers/soc/fsl/guts.c b/drivers/soc/fsl/guts.c
> > > new file mode 100644
> > > index 000..fa155e6
> > > --- /dev/null
> > > +++ b/drivers/soc/fsl/guts.c
> > > @@ -0,0 +1,119 @@
> > > +/*
> > > + * Freescale QorIQ Platforms GUTS Driver
> > > + *
> > > + * Copyright (C) 2016 Freescale Semiconductor, Inc.
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License as published by
> > > + * the Free Software Foundation; either version 2 of the License, or
> > > + * (at your option) any later version.
> > > + */
> > > +
> > > +#include 
> > > +#include 
> > Seems there was lots of discussion on this.  If it does end up being
> > resent, it would be nice to get the module.h and other modular stuff
> > gone since it is a bool Kconfig.
> 
> I plan to resend just the GUTS driver portion and send it through the PPC
> tree.
> 
> I don't see any modular stuff in there besides the linux/module.h include.

Great.  Normally I'm seeing the MODULE_DEVICE_TABLE and MODULE_AUTHOR
and MODULE_LICENSE etc, so it has (unfortunately) become a knee jerk
reaction to assume the latter follows a module.h presence...  thanks for
removing the extraneous include.

Paul.
--

> 
> -Scott
> 
> 


[PATCH] igb: fix adjusting ptp timestamps for tx/rx latency

2016-07-15 Thread Kshitiz Gupta
Fix PHY delay compensation math in igb_ptp_tx_hwtstamp() and
igb_ptp_rx_rgtstamp. Add PHY delay compensation in
igb_ptp_rx_pktstamp().

In the IGB driver, there are two functions that retrieve timestamps
received by the PHY - igb_ptp_rx_rgtstamp() and igb_ptp_rx_pktstamp().
The previous commit only changed igb_ptp_rx_rgtstamp(), and the change
was incorrect.

There are two instances in which PHY delay compensations should be
made:

- Before the packet transmission over the PHY, the latency between
  when the packet is timestamped and transmission of the packets,
  should be an add operation, but it is currently a subtract.

- After the packets are received from the PHY, the latency between
  the receiving and timestamping of the packets should be a subtract
  operation, but it is currently an add.

Signed-off-by: Kshitiz Gupta 
Fixes: 3f544d2 (igb: adjust ptp timestamps for tx/rx latency)
---
 drivers/net/ethernet/intel/igb/igb_ptp.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c 
b/drivers/net/ethernet/intel/igb/igb_ptp.c
index f097c5a..3f27f13 100644
--- a/drivers/net/ethernet/intel/igb/igb_ptp.c
+++ b/drivers/net/ethernet/intel/igb/igb_ptp.c
@@ -743,7 +743,7 @@ static void igb_ptp_tx_hwtstamp(struct igb_adapter *adapter)
}
}
 
-   shhwtstamps.hwtstamp = ktime_sub_ns(shhwtstamps.hwtstamp, adjust);
+   shhwtstamps.hwtstamp = ktime_add_ns(shhwtstamps.hwtstamp, adjust);
 
skb_tstamp_tx(adapter->ptp_tx_skb, &shhwtstamps);
dev_kfree_skb_any(adapter->ptp_tx_skb);
@@ -766,13 +766,30 @@ void igb_ptp_rx_pktstamp(struct igb_q_vector *q_vector,
 struct sk_buff *skb)
 {
__le64 *regval = (__le64 *)va;
+   struct igb_adapter *adapter = q_vector->adapter;
+   int adjust = 0;
 
/* The timestamp is recorded in little endian format.
 * DWORD: 0123
 * Field: Reserved Reserved SYSTIML  SYSTIMH
 */
-   igb_ptp_systim_to_hwtstamp(q_vector->adapter, skb_hwtstamps(skb),
+   igb_ptp_systim_to_hwtstamp(adapter, skb_hwtstamps(skb),
   le64_to_cpu(regval[1]));
+
+   /* adjust timestamp for the RX latency based on link speed */
+   switch (adapter->link_speed) {
+   case SPEED_10:
+   adjust = IGB_RX_LATENCY_10;
+   break;
+   case SPEED_100:
+   adjust = IGB_RX_LATENCY_100;
+   break;
+   case SPEED_1000:
+   adjust = IGB_RX_LATENCY_1000;
+   break;
+   }
+   skb_hwtstamps(skb)->hwtstamp =
+   ktime_sub_ns(skb_hwtstamps(skb)->hwtstamp, adjust);
 }
 
 /**
@@ -824,7 +841,7 @@ void igb_ptp_rx_rgtstamp(struct igb_q_vector *q_vector,
}
}
skb_hwtstamps(skb)->hwtstamp =
-   ktime_add_ns(skb_hwtstamps(skb)->hwtstamp, adjust);
+   ktime_sub_ns(skb_hwtstamps(skb)->hwtstamp, adjust);
 
/* Update the last_rx_timestamp timer in order to enable watchdog check
 * for error case of latched timestamp on a dropped packet.
-- 
2.1.4



Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue

2016-07-15 Thread John Fastabend
On 16-07-15 04:23 AM, Jesper Dangaard Brouer wrote:
> On Thu, 14 Jul 2016 17:07:33 -0700
> John Fastabend  wrote:
> 
>> On 16-07-14 04:42 PM, Alexei Starovoitov wrote:
>>> On Wed, Jul 13, 2016 at 11:23:12PM -0700, John Fastabend wrote:  
 This converts the pfifo_fast qdisc to use the alf_queue enqueue and
 dequeue routines then sets the NOLOCK bit.

 This also removes the logic used to pick the next band to dequeue from
 and instead just checks each alf_queue for packets from top priority
 to lowest. This might need to be a bit more clever but seems to work
 for now.

 Signed-off-by: John Fastabend 
 ---
  net/sched/sch_generic.c |  131 
 +++  
>>>   
  static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
  struct sk_buff **to_free)
  {
 -  return qdisc_drop(skb, qdisc, to_free);
 +  err = skb_array_produce_bh(q, skb);  
>>> ..  
  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
  {
 +  skb = skb_array_consume_bh(q);  
>>>
>>> For this particular qdisc the performance gain should come from
>>> granularityof spin_lock, right?  
>>
>> And the fact that the consumer and producer are using different
>> locks now.
> 
> Yes. Splitting up enqueue'ers (producer's) from the dequeuer (consumer)
> is an important step, because today the qdisc layer have this problem
> that enqueue'ers can starve the single dequeuer.  The current
> mitigation tricks are the enq busy_lock and bulk dequeue.
> 
> As John says, using skb_array cause producers and consumer to use
> different locks.
> 
>>> Before we were taking the lock much earlier. Here we keep the lock,
>>> but for the very short time.
>>> original ppslocklessdiff
>>> 1   1418168 1269450 -148718
>>> 2   1587390 1553408 -33982
>>> 4   1084961 1683639 +598678
>>> 8   989636  1522723 +533087
>>> 12  1014018 1348172 +334154
>>>
> 

I was able to recover the performance loss here and actually improve it
by fixing a few things in the patchset. Namely qdisc_run was
being called in a few places unnecessarily creating a fairly large per
packet cost overhead and then using the _bh locks was costing quite a
bit and is not needed as Jesper pointer out.

So new pps data here in somewhat raw format. I ran five iterations of
each thread count (1,2,4,8,12)

nolock (pfifo_fast)
1:  1440293 1421602 1409553 1393469 1424543
2:  1754890 1819292 1727948 1797711 1743427
4:  3282665 3344095 3315220 3332777 3348972
8:  2940079 1644450 2950777 2922085 2946310
12: 2042084 2610060 2857581 3493162 3104611

lock (pfifo_fast)
1:  1471479 1469142 1458825 1456788 1453952
2:  1746231 1749490 1753176 1753780 1755959
4:  1119626 1120515 1121478 1119220 1121115
8:  1001471  999308 1000318 1000776 1000384
12:  989269  992122  991590  986581  990430

nolock (mq)
1:   1435952  1459523  1448860  1385451   1435031
2:   2850662  2855702  2859105  2855443   2843382
4:   5288135  5271192  5252242  5270192   5311642
8:  10042731 10018063  9891813  9968382   9956727
12: 13265277 13384199 13438955 13363771  13436198

lock (mq)
1:   1448374  1444208  1437459  1437088  1452453
2:   2687963  2679221  2651059  2691630  2667479
4:   5153884  4684153  5091728  4635261  4902381
8:   9292395  9625869  9681835  9711651  9660498
12: 13553918 13682410 14084055 13946138 13724726

So then if we just use the first test example because I'm being a
bit lazy and don't want to calculate the avg/mean/whatever we get
a pfifo_fast chart like,

  locked nolock   diff
---
1 14714791440293  −  31186
2 17462311754890  +   8659
4 11196263282665  +2163039
8 11196262940079  +1820453
12 9892692857581* +1868312

[*] I pulled the 3rd iteration here as the 1st one seems off

And the mq chart looks reasonable again with these changes,


   lockednolock   diff
---
1   1448374  1435952  -  12422
2   2687963  2850662  + 162699
4   5153884  5288135  + 134251
8   9292395 10042731  + 750336
12 13553918 13265277  - 288641

So the mq case is a bit of a wash from my point of view which I sort
of expected seeing in this test case there is no contention on the
enqueue()/producer or dequeue()/consumer case when running pktgen
at 1 thread per qdisc/queue. A better test would be to fire up a few
thousand udp sessions and bang on the qdiscs to get contention on the
enqueue side. I'll try this next. On another note the variance is a
touch con

Re: [PATCH 0 / 5] move the common CDC parser

2016-07-15 Thread Greg KH
On Fri, Jul 15, 2016 at 11:51:47AM -0700, David Miller wrote:
> From: Oliver Neukum 
> Date: Thu, 14 Jul 2016 15:41:29 +0200
> 
> > Experience has shown that making all CDC drivers depend on usbnet
> > is not practical, because some of them are not network drivers.
> > So this patch moves the common parser from usbnet into the messages
> > helpers of usbcore.
> > The rest of the series applies it to the non-network CDC drivers.
> > 
> > I hope it can go through Greg's tree although it touches usbnet.
> 
> I'm fine with Greg taking this series, sure.

Ok, I'll take it, thanks.

greg k-h


[PATCH] net: fixup for tracepoint napi:napi_poll

2016-07-15 Thread Jesper Dangaard Brouer
The recent change to tracepoint napi:napi_poll changed the order of
the parameters that perf scripts sees, the printk was correct.  The
problem was that the new parameters (work and budget) were pushed
in front of dev_name.

The new parameters obviously need to be appended to keep backward
compatible.

Fixes: 1db19db7f5ff ("net: tracepoint napi:napi_poll add work and budget")
Signed-off-by: Jesper Dangaard Brouer 
---
 include/trace/events/napi.h |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/napi.h b/include/trace/events/napi.h
index 118ed7767639..0b9e5136a2a3 100644
--- a/include/trace/events/napi.h
+++ b/include/trace/events/napi.h
@@ -18,16 +18,16 @@ TRACE_EVENT(napi_poll,
 
TP_STRUCT__entry(
__field(struct napi_struct *,   napi)
+   __string(   dev_name, napi->dev ? napi->dev->name : NO_DEV)
__field(int,work)
__field(int,budget)
-   __string(   dev_name, napi->dev ? napi->dev->name : NO_DEV)
),
 
TP_fast_assign(
__entry->napi = napi;
+   __assign_str(dev_name, napi->dev ? napi->dev->name : NO_DEV);
__entry->work = work;
__entry->budget = budget;
-   __assign_str(dev_name, napi->dev ? napi->dev->name : NO_DEV);
),
 
TP_printk("napi poll on napi struct %p for device %s work %d budget %d",



Re: [PATCH v3 2/8] thunderbolt: Updating device IDs

2016-07-15 Thread David Miller
From: "Levy, Amir (Jer)" 
Date: Fri, 15 Jul 2016 18:56:39 +

> On Fri, Jul 15 2016, 09:49 PM, David Miller wrote:
>> From: Amir Levy 
>> Date: Thu, 14 Jul 2016 14:28:16 +0300
>> 
>> > Adding the new Thunderbolt(TM) device IDs to the list.
>> >
>> > Signed-off-by: Amir Levy 
>> 
>> Unless these PCI-IDs, all of them, are going to be used in multiple spots in 
>> the
>> kernel, it is not appropriate to add them here.
>> 
>> They belong as private macros in the drivers themselves, instead.
> 
> They might be used: http://www.spinics.net/lists/linux-pci/msg51331.html

Then move them to the common location when this "might" thing actually
happens.


Re: [PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support

2016-07-15 Thread Brenden Blanco
On Wed, Jul 13, 2016 at 08:40:59AM -0700, Brenden Blanco wrote:
> On Wed, Jul 13, 2016 at 10:17:26AM +0300, Tariq Toukan wrote:
> > 
> > On 13/07/2016 3:54 AM, Brenden Blanco wrote:
> > >On Tue, Jul 12, 2016 at 02:18:32PM -0700, David Miller wrote:
> > >>From: Brenden Blanco 
> > >>Date: Tue, 12 Jul 2016 00:51:29 -0700
> > >>
> > >>>+mlx4_en_free_resources(priv);
> > >>>+
> > >>> old_prog = xchg(&priv->prog, prog);
> > >>> if (old_prog)
> > >>> bpf_prog_put(old_prog);
> > >>>-return 0;
> > >>>+err = mlx4_en_alloc_resources(priv);
> > >>>+if (err) {
> > >>>+en_err(priv, "Failed reallocating port resources\n");
> > >>>+goto out;
> > >>>+}
> > >>>+if (port_up) {
> > >>>+err = mlx4_en_start_port(dev);
> > >>>+if (err)
> > >>>+en_err(priv, "Failed starting port\n");
> > >>A failed configuration operation should _NEVER_ leave the interface in
> > >>an inoperative state like these error paths do.
> > >>
> > >>You must instead preallocate the necessary resources, and only change
> > >>the chip's configuration and commit to the new settings once you have
> > >>successfully allocated those resources.
> > >I'll see what I can do here.
> > That's exactly what we're doing in a patchset that will be submitted
> > to net very soon (this week).
> Thanks Tariq!
> As an example, I had originally tried to integrate this code into
> mlx4_en_set_channels, which seems to have the same problem.
> > It fixes/refactors these failure flows just like Dave described,
> > something like:
> > 
> > err = mlx4_en_try_alloc_resources(priv, tmp, &new_prof);
> > if (err)
> > goto out;
> > 
> > if (priv->port_up) {
> > port_up = 1;
> > mlx4_en_stop_port(dev, 1);
> > }
> > 
> > mlx4_en_safe_replace_resources(priv, tmp);
> > 
> > if (port_up) {
> > err = mlx4_en_start_port(dev);
> > if (err)
> > en_err(priv, "Failed starting port\n");
> > }
> > 
> > I suggest you keep your code aligned with current net-next driver,
> > and later I will take it and fix it (once merged with net).
So, I took Dave's suggestion to heart, and spent the last 2 days seeing
what was possible to implement with just xdp as the focus, rather than
an overall cleanup which Tariq will be looking at.

Unfortunately, this turned out to a be a bit of a rat hole.

What I wanted to do was to pre-allocate all the required pages before
reaching the point of no return. Doing this isn't all that hard, since
it should just be a few loops. However, I ended with a bit more
duplicated code than one would like, since I had to tease out the
various sections that assume exclusive access to hardware.

But, more than that, is that I don't see a way to fill these pages into
the rings safely while hardware still has ability to write into the old
ones. There was no "pause" API that I could find besides
mlx4_en_stop_port(). That function is fairly destructive and requires
the resource allocation in mlx4_en_start_port() to succeed to recover
the port status.

One option that I considered would be to drain buffers from the rx ring,
and just let mlx4_en_recover_from_oom() do its job once we update the
page template in frag_info[]. This, however, also requires the queues to
be paused safely, so we again have to rely on mlx4_en_stop_port().

One change I can make is to avoid allocating additional tx rings, which
means that we can skip the calls to mlx4_en_free/alloc_resources().

The resulting code would then mirror what mlx4_en_change_mtu() does:

if (port_up) {
err = mlx4_en_start_port(dev);
if (err)
queue_work(mdev->workqueue, &priv->watchdog_task);
}

I intend to respin the patchset with this approach, and a few other
changes as requested elsewhere. If the above is still unacceptable, feel
free to let me know and I will avoid spamming the list.
> Another option is to avoid entirely the tx_ring_num change, so as to
> keep the majority of the initialized state valid. We would only allocate
> a new set of pages and refill the rx rings once we have confirmed there
> are enough resources.
> 
> So others can follow the discussion, there are multiple reasons to
> reconfigure the rings.
> 1. The rx frags should be page-per-packet
> 2. The pages should be mapped DMA_BIDIRECTIONAL
> 3. Each rx ring should have a dedicated tx ring, which is off limits
> from the upper stack
> 4. The dedicated tx ring will have a pointer back to its rx ring for
> recycling
> 
> #1 and #2 can be done to the side ahead of time, as you are also
> suggesting.
> 
> Currently, to achieve #3, we increase tx_ring_num while keeping
> num_tx_rings_p_up the same. This precipitates a round of
> free/alloc_resources, which takes some time and has many opportunities
> for failure.
> However, we could resurrect an earlier appr

Re: [patch net 0/5] mlxsw: Couple of fixes

2016-07-15 Thread David Miller
From: Jiri Pirko 
Date: Fri, 15 Jul 2016 11:14:57 +0200

> Couple of fixes for mlxsw driver from Ido.

Series applied, thanks.


Re: [PATCH net-next 1/2] macvtap: avoid hash calculating for single queue

2016-07-15 Thread David Miller
From: Jason Wang 
Date: Fri, 15 Jul 2016 03:46:30 -0400

> We decide the rxq through calculating its hash which is not necessary
> if we only have one rx queue. So this patch skip this and just return
> queue 0. Test shows 22% improving on guest rx pps.
> 
> Before: 1201504 pkts/s
> After:  1472731 pkts/s
> 
> Signed-off-by: Jason Wang 

Applied.


Re: [PATCH net-next 2/2] macvtap: switch to use skb array

2016-07-15 Thread David Miller
From: Jason Wang 
Date: Fri, 15 Jul 2016 03:46:31 -0400

> This patch switch to use skb array instead of sk_receive_queue to
> avoid spinlock contentions. Tests shows about 21% improvements for
> guest rx pps:
> 
> Before: 1472731 pkts/s
> After:  1786289 pkts/s
> 
> Signed-off-by: Jason Wang 

Looks great, nice work.

Applied, thanks Jason.


Re: [PATCH] r8152: add MODULE_VERSION

2016-07-15 Thread Grant Grundler
On Fri, Jul 15, 2016 at 2:25 PM, David Miller  wrote:
> From: Grant Grundler 
> Date: Thu, 14 Jul 2016 11:27:16 -0700
>
>> ethtool -i provides a driver version that is hard coded.
>> Export the same value via "modinfo".
>>
>> Signed-off-by: Grant Grundler 
>
> Applied.

Excellent - thank you. :)

grant


Re: [net 0/4][pull request] Intel Wired LAN Driver Updates 2016-07-14

2016-07-15 Thread David Miller
From: Jeff Kirsher 
Date: Fri, 15 Jul 2016 00:02:00 -0700

> This series contains fixes to i40e and ixgbe.

Pulled, thanks Jeff.


Re: [PATCH] r8152: add MODULE_VERSION

2016-07-15 Thread David Miller
From: Grant Grundler 
Date: Thu, 14 Jul 2016 11:27:16 -0700

> ethtool -i provides a driver version that is hard coded.
> Export the same value via "modinfo".
> 
> Signed-off-by: Grant Grundler 

Applied.


Re: [PATCH net-next v2 0/3] BPF event output helper improvements

2016-07-15 Thread David Miller
From: Daniel Borkmann 
Date: Thu, 14 Jul 2016 18:08:02 +0200

> This set adds improvements to the BPF event output helper to
> support non-linear data sampling, here specifically, for skb
> context. For details please see individual patches. The set
> is based against net-next tree.
> 
> v1 -> v2:
>   - Integrated and adapted Peter's diff into patch 1, updated
> the remaining ones accordingly. Thanks Peter!
> 
> Thanks a lot!

Series applied, thanks Daniel.


Re: [PATCH net] tcp: enable per-socket rate limiting of all 'challenge acks'

2016-07-15 Thread David Miller
From: Jason Baron 
Date: Thu, 14 Jul 2016 11:38:40 -0400

> From: Jason Baron 
> 
> The per-socket rate limit for 'challenge acks' was introduced in the
> context of limiting ack loops:
> 
> commit f2b2c582e824 ("tcp: mitigate ACK loops for connections as tcp_sock")
> 
> And I think it can be extended to rate limit all 'challenge acks' on a
> per-socket basis.
> 
> Since we have the global tcp_challenge_ack_limit, this patch allows for
> tcp_challenge_ack_limit to be set to a large value and effectively rely on
> the per-socket limit, or set tcp_challenge_ack_limit to a lower value and
> still prevents a single connections from consuming the entire challenge ack
> quota.
> 
> It further moves in the direction of eliminating the global limit at some
> point, as Eric Dumazet has suggested. This a follow-up to:
> Subject: tcp: make challenge acks less predictable
> 
> Cc: Eric Dumazet 
> Cc: David S. Miller 
> Cc: Neal Cardwell 
> Cc: Yuchung Cheng 
> Cc: Yue Cao 
> Signed-off-by: Jason Baron 

Applied, thanks.


Re: [PATCH net-next] rxrpc: checking for IS_ERR() instead of NULL

2016-07-15 Thread David Miller
From: David Howells 
Date: Thu, 14 Jul 2016 15:47:01 +0100

> From: Dan Carpenter 
> 
> The rxrpc_lookup_peer() function returns NULL on error, it never returns
> error pointers.
> 
> Fixes: 8496af50eb38 ('rxrpc: Use RCU to access a peer's service connection 
> tree')
> Signed-off-by: Dan Carpenter 
> Signed-off-by: David Howells 

Applied.


Re: [ovs-dev] [PATCH net-next v11 5/6] openvswitch: add layer 3 flow/port support

2016-07-15 Thread pravin shelar
On Wed, Jul 13, 2016 at 12:31 AM, Simon Horman
 wrote:
> Hi Pravin,
>
> On Thu, Jul 07, 2016 at 01:54:15PM -0700, pravin shelar wrote:
>> On Wed, Jul 6, 2016 at 10:59 AM, Simon Horman
>>  wrote:
>
> ...

>
>> > diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
>> > index 0ea128eeeab2..86f2cfb19de3 100644
>> > --- a/net/openvswitch/flow.c
>> > +++ b/net/openvswitch/flow.c
>> ...
>>
>> > @@ -723,9 +729,17 @@ int ovs_flow_key_extract(const struct ip_tunnel_info 
>> > *tun_info,
>> > key->phy.skb_mark = skb->mark;
>> > ovs_ct_fill_key(skb, key);
>> > key->ovs_flow_hash = 0;
>> > +   key->phy.is_layer3 = skb->mac_len == 0;
>>
>> I do not think mac_len can be used. mac_header needs to be checked.
>> ...
>
> Yes, indeed. The update to use skb_mac_header_was_set() here accidently
> slipped into the following patch, sorry about that.
>
> With that change in place I believe that this patch is internally
> consistent because mac_header and mac_len are set correctly by the
> call to key_extract() which is called by ovs_flow_key_extract() just
> after where the excerpt above ends.
>
> That said, I do think that it is possible to rely on skb_mac_header_was_set
> throughout the datapath, including action processing etc... I have provided
> an incremental patch - which I created on top of this entire series - at
> the end of this email. If you prefer that approach I am happy to take it,
> though I do feel that using mac_len leads to slightly cleaner code. Let me
> know what you think.
>


I am not sure if you can use only mac_len to detect L3 packet. This
does not work with MPLS packets, mac_len is used to account MPLS
headers pushed on skb. Therefore in case of a MPLS header on L3
packet, mac_len would be non zero and we have to look at either
mac_header or some other metadata like is_layer3 flag from key to
check for L3 packet.


>> > diff --git a/net/openvswitch/vport-netdev.c 
>> > b/net/openvswitch/vport-netdev.c
>> > index 4e3972344aa6..733e7914f6bd 100644
>> > --- a/net/openvswitch/vport-netdev.c
>> > +++ b/net/openvswitch/vport-netdev.c
>> > @@ -57,8 +57,10 @@ static void netdev_port_receive(struct sk_buff *skb)
>> > if (unlikely(!skb))
>> > return;
>> >
>> > -   skb_push(skb, ETH_HLEN);
>> > -   skb_postpush_rcsum(skb, skb->data, ETH_HLEN);
>> > +   if (vport->dev->type == ARPHRD_ETHER) {
>> > +   skb_push(skb, ETH_HLEN);
>> > +   skb_postpush_rcsum(skb, skb->data, ETH_HLEN);
>> > +   }
>> This is still required for tunnel device of ARPHRD_NONE which can
>> handle l2 packets.
>
> That is not necessary given the current implementation (of ipgre) as it
> supplies an skb with the mac header in place if the inner packet was an
> Ethernet packet. This scheme could of course be adjusted.
>
> ...
>

I think we should send L2 header with l2 header pushed on skb. This is
what OVS expect. The skb-push should be done for all l2 packets rather
than for particular type of device.

>
>
> Update to use skb_mac_header_was_set() more as mentioned above.
> Please let me know what you think about this approach.
>
>  include/net/mpls.h   |4 ++-
>  net/openvswitch/actions.c|   42 
> ---
>  net/openvswitch/flow.c   |   23 +++
>  net/openvswitch/vport-internal_dev.c |2 -
>  net/openvswitch/vport-netdev.c   |4 +--
>  5 files changed, 44 insertions(+), 31 deletions(-)
>
> diff --git a/include/net/mpls.h b/include/net/mpls.h
> index 5b3b5addfb08..296b68661be0 100644
> --- a/include/net/mpls.h
> +++ b/include/net/mpls.h
> @@ -34,6 +34,8 @@ static inline bool eth_p_mpls(__be16 eth_type)
>   */
>  static inline unsigned char *skb_mpls_header(struct sk_buff *skb)
>  {
> -   return skb_mac_header(skb) + skb->mac_len;
> +   return skb_mac_header_was_set(skb) ?
> +   skb_mac_header(skb) + skb->mac_len :
> +   skb->data;
>  }

This function is also called from GSO layer. issue is in GSO layer, it
does reset mac header and mac length and then calls mpls-gso-handler.
So all subsequent check for L3 packet fails.
So far we have explored three different ways to detect L3 packet but
each has its own issue.
1. skb mac header : GSO can reset mac header.
2. skb mac length : MPLS uses mac_len to account for MPLS header
length along with L2 header
3. skb protocol: ETH_P_TEB is not set for all L2 frames, networking
stack is not ready to handle this type for given skb.

So none of them works consistently. I think the only option to detect
L3 packet reliably (and without adding field to skb) is to use
skb-protocol along with ARPHRD_NONE device type. If ARPHRD_NONE type
device generates L2 packet it needs to set protocol to ETH_P_TEB. Some
networking stack function also needs to be fixed to handle this
protocol type, e.g. vlan_get_protocol(), br_dev_queue_push_xmit(),
etc.


Re: [PATCH 00/14] Present useful limits to user (v2)

2016-07-15 Thread H. Peter Anvin
,Johannes Weiner ,Alexei Starovoitov 
,Arnaldo Carvalho de Melo ,Alexander Shishkin 
,Balbir Singh 
,Markus Elfring ,"David 
S. Miller" ,Nicolas Dichtel 
,Andrew Morton 
,Konstantin Khlebnikov ,Jiri Slaby 
,Cyrill Gorcunov ,Michal Hocko 
,Vlastimil Babka ,Dave Hansen 
,Greg Kroah-Hartman 
,Dan Carpenter ,Michael 
Kerrisk ,"Kirill A. Shutemov" 
,Marcus Gelderie ,Vladimir 
Davydov ,Joe Perches ,Frederic 
Weisbecker ,Andrea Arcangeli ,!
 "Eric W.
Biederman" ,Andi Kleen ,Oleg 
Nesterov ,Stas Sergeev ,Amanieu d'Antras 
,Richard Weinberger ,Wang Xiaoqiang 
,Helge Deller ,Mateusz Guzik 
,Alex Thorlton ,Ben Segall 
,John Stultz ,Rik van Riel 
,Eric B Munson ,Alexey Klimov 
,Chen Gang ,Andrey Ryabinin 
,David Rientjes ,Hugh Dickins 
,Alexander Kuleshov ,"open 
list:DOCUMENTATION" ,"open list:IA64 (Itanium) 
PLATFORM" ,"open list:KERNEL VIRTUAL MACHINE (KVM) 
FOR POWERPC" ,"open list:KERNEL VIRTUAL MACHINE (KVM)" 
,"open list:LINUX FOR POWERPC!
  (32-BIT
AND 64-BIT)" ,"open list:INFINIBAND SUBSYSTEM" 
,"open list:FILESYSTEMS (VFS and infrastructure)" 
,"open list:CONTROL GROUP (CGROUP)" 
,"open list:BPF (Safe dynamic programs and tools)" 
,"open list:MEMORY MANAGEMENT" 
Message-ID: 

On July 15, 2016 6:59:56 AM PDT, Peter Zijlstra  wrote:
>On Fri, Jul 15, 2016 at 01:52:48PM +, Topi Miettinen wrote:
>> On 07/15/16 12:43, Peter Zijlstra wrote:
>> > On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote:
>> >> Hello,
>> >>
>> >> There are many basic ways to control processes, including
>capabilities,
>> >> cgroups and resource limits. However, there are far fewer ways to
>find out
>> >> useful values for the limits, except blind trial and error.
>> >>
>> >> This patch series attempts to fix that by giving at least a nice
>starting
>> >> point from the highwater mark values of the resources in question.
>> >> I looked where each limit is checked and added a call to update
>the mark
>> >> nearby.
>> > 
>> > And how is that useful? Setting things to the high watermark is
>> > basically the same as not setting the limit at all.
>> 
>> What else would you use, too small limits?
>
>That question doesn't make sense.
>
>What's the point of setting a limit if it ends up being the same as
>no-limit (aka unlimited).
>
>If you cannot explain; and you have not so far; what use these values
>are, why would we look at the patches.

One reason is to catch a malfunctioning process rather than dragging the whole 
system down with it.  It could also be useful for development.
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.


Re: [PATCH] dt: bindings: Add a generic ethernet device binding

2016-07-15 Thread Arnd Bergmann
On Wednesday, July 13, 2016 12:20:04 PM CEST Hans de Goede wrote:
> +&mmc1 {
> +   non-removable;
> +   status = "okay";
> +
> +   sdio_wifi: sdio_wifi@1 {
> +   compatible = "generic,ethernet"
> +   reg = <1>;
> +   };
> +};

For discoverable buses, we normally use a compatible property that
reflects the device ID on that bus, e.g. on PCI we have "pci1A2B:3C4D",
and I think that makes more sense than having to come up with strings
for sdio devices.

In fact, Linux completely ignores the compatible strings on those
buses (pci, usb, sdio, ...), so I think we can just do the same thing
using no compatible string at all.

Arnd



Question on IPv6 default route metrics

2016-07-15 Thread Petri Gynther
netdev:

I have the same question as Jan in his original thread:
http://lkml.iu.edu/hypermail/linux/kernel/1108.3/01897.html

If a Linux device has multiple IPv6 default routes (e.g. via eth0 and
wlan0), they all currently have the same metric 1024.

But, wired route is normally preferred. So, it should have a lower metric.

It is my understanding that the kernel currently sets this default
route metric to 1024 when it receives a IPv6 router advertisement on a
link.

Is there any way to control the default route metric value on a
per-interface basis, i.e. router advertisements received on eth0 will
get metric 1024, and router advertisements received on wlan0 will get
metric, say, 1028?


[PATCH net-next] sctp: fix GSO for IPv6

2016-07-15 Thread Marcelo Ricardo Leitner
commit 90017accff61 ("sctp: Add GSO support") didn't register SCTP GSO
offloading for IPv6 and yet didn't put any restrictions on generating
GSO packets while in IPv6, which causes all IPv6 GSO'ed packets to be
silently dropped.

The fix is to properly register the offload this time.

Fixes: 90017accff61 ("sctp: Add GSO support")
Signed-off-by: Marcelo Ricardo Leitner 
---
I guess sctp multi-homing outsmarted myself during testing, ugh.

 net/sctp/offload.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/net/sctp/offload.c b/net/sctp/offload.c
index 
a37887b373a75524a54a1443f7df2d45ecf6cef7..7e869d0cca69826ee3e892e389bacdc9a58a1637
 100644
--- a/net/sctp/offload.c
+++ b/net/sctp/offload.c
@@ -92,7 +92,28 @@ static const struct net_offload sctp_offload = {
},
 };
 
+static const struct net_offload sctp6_offload = {
+   .callbacks = {
+   .gso_segment = sctp_gso_segment,
+   },
+};
+
 int __init sctp_offload_init(void)
 {
-   return inet_add_offload(&sctp_offload, IPPROTO_SCTP);
+   int ret;
+
+   ret = inet_add_offload(&sctp_offload, IPPROTO_SCTP);
+   if (ret)
+   goto out;
+
+   ret = inet6_add_offload(&sctp6_offload, IPPROTO_SCTP);
+   if (ret)
+   goto ipv4;
+
+   return ret;
+
+ipv4:
+   inet_del_offload(&sctp_offload, IPPROTO_SCTP);
+out:
+   return ret;
 }
-- 
2.7.4



[PATCH net-next] sctp: recvmsg should be able to run even if sock is in closing state

2016-07-15 Thread Marcelo Ricardo Leitner
Commit d46e416c11c8 missed to update some other places which checked for
the socket being TCP-style AND Established state, as Closing state has
some overlapping with the previous understanding of Established.

Without this fix, one of the effects is that some already queued rx
messages may not be readable anymore depending on how the association
teared down, and sending may also not be possible if peer initiated the
shutdown.

Also merge two if() blocks into one condition on sctp_sendmsg().

Cc: Xin Long 
Fixes: d46e416c11c8 ("sctp: sctp should change socket state when shutdown is 
received")
Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/socket.c | 32 +---
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 
52fdd540a9ef153336e0c6df725ce47c9ebab11b..d2681cb1dd30044d62b443311923a94659ce9395
 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -202,7 +202,7 @@ struct sctp_association *sctp_id2assoc(struct sock *sk, 
sctp_assoc_t id)
 * could be a TCP-style listening socket or a socket which
 * hasn't yet called connect() to establish an association.
 */
-   if (!sctp_sstate(sk, ESTABLISHED))
+   if (!sctp_sstate(sk, ESTABLISHED) && !sctp_sstate(sk, CLOSING))
return NULL;
 
/* Get the first and the only association from the list. */
@@ -1068,7 +1068,7 @@ static int __sctp_connect(struct sock *sk,
 * is already connected.
 * It cannot be done even on a TCP-style listening socket.
 */
-   if (sctp_sstate(sk, ESTABLISHED) ||
+   if (sctp_sstate(sk, ESTABLISHED) || sctp_sstate(sk, CLOSING) ||
(sctp_style(sk, TCP) && sctp_sstate(sk, LISTENING))) {
err = -EISCONN;
goto out_free;
@@ -1705,18 +1705,19 @@ static int sctp_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t msg_len)
if (msg_name) {
/* Look for a matching association on the endpoint. */
asoc = sctp_endpoint_lookup_assoc(ep, &to, &transport);
-   if (!asoc) {
-   /* If we could not find a matching association on the
-* endpoint, make sure that it is not a TCP-style
-* socket that already has an association or there is
-* no peeled-off association on another socket.
-*/
-   if ((sctp_style(sk, TCP) &&
-sctp_sstate(sk, ESTABLISHED)) ||
-   sctp_endpoint_is_peeled_off(ep, &to)) {
-   err = -EADDRNOTAVAIL;
-   goto out_unlock;
-   }
+
+   /* If we could not find a matching association on the
+* endpoint, make sure that it is not a TCP-style
+* socket that already has an association or there is
+* no peeled-off association on another socket.
+*/
+   if (!asoc &&
+   ((sctp_style(sk, TCP) &&
+ (sctp_sstate(sk, ESTABLISHED) ||
+  sctp_sstate(sk, CLOSING))) ||
+sctp_endpoint_is_peeled_off(ep, &to))) {
+   err = -EADDRNOTAVAIL;
+   goto out_unlock;
}
} else {
asoc = sctp_id2assoc(sk, associd);
@@ -2077,7 +2078,8 @@ static int sctp_recvmsg(struct sock *sk, struct msghdr 
*msg, size_t len,
 
lock_sock(sk);
 
-   if (sctp_style(sk, TCP) && !sctp_sstate(sk, ESTABLISHED)) {
+   if (sctp_style(sk, TCP) && !sctp_sstate(sk, ESTABLISHED) &&
+   !sctp_sstate(sk, CLOSING)) {
err = -ENOTCONN;
goto out;
}
-- 
2.7.4



Re: [v10, 3/7] soc: fsl: add GUTS driver for QorIQ platforms

2016-07-15 Thread Scott Wood
On Fri, 2016-07-15 at 12:43 -0400, Paul Gortmaker wrote:
> On Wed, May 4, 2016 at 11:12 PM, Yangbo Lu  wrote:
> > 
> > The global utilities block controls power management, I/O device
> > enabling, power-onreset(POR) configuration monitoring, alternate
> > function selection for multiplexed signals,and clock control.
> > 
> > This patch adds GUTS driver to manage and access global utilities
> > block.
> > 
> > Signed-off-by: Yangbo Lu 
> > Acked-by: Scott Wood 
> > ---
> > Changes for v4:
> > - Added this patch
> > Changes for v5:
> > - Modified copyright info
> > - Changed MODULE_LICENSE to GPL
> > - Changed EXPORT_SYMBOL_GPL to EXPORT_SYMBOL
> > - Made FSL_GUTS user-invisible
> > - Added a complete compatible list for GUTS
> > - Stored guts info in file-scope variable
> > - Added mfspr() getting SVR
> > - Redefined GUTS APIs
> > - Called fsl_guts_init rather than using platform driver
> > - Removed useless parentheses
> > - Removed useless 'extern' key words
> > Changes for v6:
> > - Made guts thread safe in fsl_guts_init
> > Changes for v7:
> > - Removed 'ifdef' for function declaration in guts.h
> > Changes for v8:
> > - Fixes lines longer than 80 characters checkpatch issue
> > - Added 'Acked-by: Scott Wood'
> > Changes for v9:
> > - None
> > Changes for v10:
> > - None
> > ---
> >  drivers/soc/Kconfig  |   2 +-
> >  drivers/soc/fsl/Kconfig  |   8 +++
> >  drivers/soc/fsl/Makefile |   1 +
> >  drivers/soc/fsl/guts.c   | 119
> > 
> >  include/linux/fsl/guts.h | 126 +-
> > -
> >  5 files changed, 207 insertions(+), 49 deletions(-)
> >  create mode 100644 drivers/soc/fsl/Kconfig
> >  create mode 100644 drivers/soc/fsl/guts.c
> > 
> > diff --git a/drivers/soc/Kconfig b/drivers/soc/Kconfig
> > index cb58ef0..7106463 100644
> > --- a/drivers/soc/Kconfig
> > +++ b/drivers/soc/Kconfig
> > @@ -2,7 +2,7 @@ menu "SOC (System On Chip) specific Drivers"
> > 
> >  source "drivers/soc/bcm/Kconfig"
> >  source "drivers/soc/brcmstb/Kconfig"
> > -source "drivers/soc/fsl/qe/Kconfig"
> > +source "drivers/soc/fsl/Kconfig"
> >  source "drivers/soc/mediatek/Kconfig"
> >  source "drivers/soc/qcom/Kconfig"
> >  source "drivers/soc/rockchip/Kconfig"
> > diff --git a/drivers/soc/fsl/Kconfig b/drivers/soc/fsl/Kconfig
> > new file mode 100644
> > index 000..b313759
> > --- /dev/null
> > +++ b/drivers/soc/fsl/Kconfig
> > @@ -0,0 +1,8 @@
> > +#
> > +# Freescale SOC drivers
> > +#
> > +
> > +source "drivers/soc/fsl/qe/Kconfig"
> > +
> > +config FSL_GUTS
> > +   bool
> > diff --git a/drivers/soc/fsl/Makefile b/drivers/soc/fsl/Makefile
> > index 203307f..02afb7f 100644
> > --- a/drivers/soc/fsl/Makefile
> > +++ b/drivers/soc/fsl/Makefile
> > @@ -4,3 +4,4 @@
> > 
> >  obj-$(CONFIG_QUICC_ENGINE) += qe/
> >  obj-$(CONFIG_CPM)  += qe/
> > +obj-$(CONFIG_FSL_GUTS) += guts.o
> > diff --git a/drivers/soc/fsl/guts.c b/drivers/soc/fsl/guts.c
> > new file mode 100644
> > index 000..fa155e6
> > --- /dev/null
> > +++ b/drivers/soc/fsl/guts.c
> > @@ -0,0 +1,119 @@
> > +/*
> > + * Freescale QorIQ Platforms GUTS Driver
> > + *
> > + * Copyright (C) 2016 Freescale Semiconductor, Inc.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > + */
> > +
> > +#include 
> > +#include 
> Seems there was lots of discussion on this.  If it does end up being
> resent, it would be nice to get the module.h and other modular stuff
> gone since it is a bool Kconfig.

I plan to resend just the GUTS driver portion and send it through the PPC
tree.

I don't see any modular stuff in there besides the linux/module.h include.

-Scott




Re: [PATCH 3/3] RFC: net: smsc911x: add wake-up event interrupt support

2016-07-15 Thread Florian Fainelli
On 07/08/2016 02:07 AM, Linus Walleij wrote:
> +static irqreturn_t smsc911x_pme_irq_thread(int irq, void *dev_id)
> +{
> + struct net_device *dev = dev_id;
> + struct smsc911x_data *pdata __maybe_unused = netdev_priv(dev);
> +
> + SMSC_TRACE(pdata, pm, "wakeup event");
> + /* This signal is active for 50 ms, wait for it to deassert */
> + usleep_range(5, 10);

Should not you have a call to pm_wakeup_event() such that this probably
gets accounted for as a wake-up event in /sys/*?
-- 
Florian


Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program

2016-07-15 Thread Jesper Dangaard Brouer

On Fri, 15 Jul 2016 09:47:46 -0700 Alexei Starovoitov 
 wrote:
> On Fri, Jul 15, 2016 at 09:18:13AM -0700, Tom Herbert wrote:
[..]
> > > We don't need extra comlexity of figuring out number of rings and
> > > struggling with lack of atomicity.  
> > 
> > We already have this problem with other per ring configuration.  
> 
> not really. without atomicity of the program change, the user space
> daemon that controls it will struggle to adjust. Consider the case
> where we're pushing new update for loadbalancer. In such case we
> want to reuse the established bpf map, since we cannot atomically
> move it from old to new, but we want to swap the program that uses
> in one go, otherwise two different programs will be accessing
> the same map. Technically it's valid, but difference in the programs
> may cause issues. Lack of atomicity is not intractable problem,
> it just makes user space quite a bit more complex for no reason.

I don't think you have a problem with updating the program per queue
basis, as they will be updated atomically per RX queue (thus a CPU can
only see one program).
 Today, you already have to handle that multiple CPUs running the same
program, need to access the same map.

You mention that, there might be a problem, if the program differs too
much to share the map.  But that is the same problem as today.  If you
need to load a program that e.g. change the map layout, then you
obviously cannot allow it inherit the old map, but must feed the new
program a new map (with the new layout).


There is actually a performance advantage of knowing that a program is
only attached to a single RX queue. As only a single CPU can process a
RX ring. Thus, when e.g. accessing a map (or other lookup table) you can
avoid any locking.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH 1/3] net: smsc911x: augment device tree bindings

2016-07-15 Thread Rob Herring
On Fri, Jul 08, 2016 at 11:07:30AM +0200, Linus Walleij wrote:
> This adds device tree bindings for:
> 
> - An optional GPIO line for releasing the RESET signal to the
>   SMSC911x devices
> 
> - An optional PME (power management event) interrupt line that
>   can be utilized to wake up the system on network activity.
>   This signal exist on all the SMSC911x devices, it is just not
>   very often routed.
> 
> Both these lines are routed to the SoC on the Qualcomm APQ8060
> Dragonboard and thus needs to be bound in the device tree.
> 
> Cc: devicet...@vger.kernel.org
> Signed-off-by: Linus Walleij 
> ---
>  Documentation/devicetree/bindings/net/smsc911x.txt | 16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/smsc911x.txt 
> b/Documentation/devicetree/bindings/net/smsc911x.txt
> index 3fed3c124411..7b01c37272c1 100644
> --- a/Documentation/devicetree/bindings/net/smsc911x.txt
> +++ b/Documentation/devicetree/bindings/net/smsc911x.txt
> @@ -3,9 +3,12 @@
>  Required properties:
>  - compatible : Should be "smsc,lan", "smsc,lan9115"
>  - reg : Address and length of the io space for SMSC LAN
> -- interrupts : Should contain SMSC LAN interrupt line
> -- interrupt-parent : Should be the phandle for the interrupt controller
> -  that services interrupts for this device
> +- interrupts/extended-interrupts : Should contain the SMSC LAN

It's interrupts-extended. Documentation-wise both are always supported, 
so we just document 'interrupts' unless interrupts-extended is only ever 
valid which would not be the case here.

> +  interrupt line as cell 0, cell 1 is an OPTIONAL PME (power
> +  management event) interrupt that is able to wake up the host
> +  system with a 50ms pulse on network activity
> +  For generic bindings for interrupt controller parents, refer to
> +  interrupt-controller/interrupts.txt
>  - phy-mode : See ethernet.txt file in the same directory


Re: [PATCH iproute2] ss: Add option to suppress header line

2016-07-15 Thread David Ahern

On 7/15/16 11:20 AM, David Ahern wrote:

Add option to suppress header line. When used the following line
is not shown:
"State  Recv-Q Send-Q Local Address:Port  Peer Address:Port"

Signed-off-by: David Ahern 
---
 man/man8/ss.8 |  3 +++
 misc/ss.c | 28 +++-
 2 files changed, 22 insertions(+), 9 deletions(-)



This one I goofed and made relative to our branch.

if you are ok with the option I'll rebase to master branch.



Re: [PATCH iproute2] ss: Fix support for device filter by index

2016-07-15 Thread David Ahern

On 7/15/16 12:53 PM, Stephen Hemminger wrote:

On Fri, 15 Jul 2016 09:29:28 -0700
David Ahern  wrote:


Support was recently added for device filters. The intent was to allow
the device to be specified by name or index, and using the if%u format
(dev == if5) or the simpler and more intuitive index alone (dev == 5).
The latter case is broken since the index is not saved to the filter
after the strtoul conversion. Further, the tmp variable used for the
conversion shadows another variable used in the function. Fix both.

With this change all 3 variants work as expected:
$ ss -t 'dev == 62'
State   Recv-Q Send-Q Local Address:PortPeer Address:Port
ESTAB   0  224 10.0.1.3%mgmt:ssh   192.168.0.50:58442

$ ss -t 'dev == mgmt'
State   Recv-Q Send-Q Local Address:PortPeer Address:Port
ESTAB   0  224 10.0.1.3%mgmt:ssh   192.168.0.50:58442

$ ss -t 'dev == if62'
State   Recv-Q Send-Q Local Address:PortPeer Address:Port
ESTAB   0  36  10.0.1.3%mgmt:ssh   192.168.0.50:58442

Fixes: 2d2932125616 ("ss: Add support to filter on device")
Signed-off-by: David Ahern 


Won't apply to current code.
Please rebase.



It applies cleanly for me to master branch. That's where the ss dev 
filter is from June:


commit 2d29321256168e13e10fbde3c57f33e70dcb6cc8
Author: David Ahern 
Date:   Mon Jun 27 11:34:25 2016 -0700

ss: Add support to filter on device

No commits on that file since except 
62000e51e05d635016bae9891a4e00134ed8aefb which does not impact the 
function in question.





RE: [PATCH v3 2/8] thunderbolt: Updating device IDs

2016-07-15 Thread Levy, Amir (Jer)
On Fri, Jul 15 2016, 09:49 PM, David Miller wrote:
> From: Amir Levy 
> Date: Thu, 14 Jul 2016 14:28:16 +0300
> 
> > Adding the new Thunderbolt(TM) device IDs to the list.
> >
> > Signed-off-by: Amir Levy 
> 
> Unless these PCI-IDs, all of them, are going to be used in multiple spots in 
> the
> kernel, it is not appropriate to add them here.
> 
> They belong as private macros in the drivers themselves, instead.

They might be used: http://www.spinics.net/lists/linux-pci/msg51331.html



Re: [PATCH iproute2] ss: Fix support for device filter by index

2016-07-15 Thread Stephen Hemminger
On Fri, 15 Jul 2016 09:29:28 -0700
David Ahern  wrote:

> Support was recently added for device filters. The intent was to allow
> the device to be specified by name or index, and using the if%u format
> (dev == if5) or the simpler and more intuitive index alone (dev == 5).
> The latter case is broken since the index is not saved to the filter
> after the strtoul conversion. Further, the tmp variable used for the
> conversion shadows another variable used in the function. Fix both.
> 
> With this change all 3 variants work as expected:
> $ ss -t 'dev == 62'
> State   Recv-Q Send-Q Local Address:PortPeer Address:Port
> ESTAB   0  224 10.0.1.3%mgmt:ssh   192.168.0.50:58442
> 
> $ ss -t 'dev == mgmt'
> State   Recv-Q Send-Q Local Address:PortPeer Address:Port
> ESTAB   0  224 10.0.1.3%mgmt:ssh   192.168.0.50:58442
> 
> $ ss -t 'dev == if62'
> State   Recv-Q Send-Q Local Address:PortPeer Address:Port
> ESTAB   0  36  10.0.1.3%mgmt:ssh   192.168.0.50:58442
> 
> Fixes: 2d2932125616 ("ss: Add support to filter on device")
> Signed-off-by: David Ahern 

Won't apply to current code.
Please rebase.


Re: [PATCH] net: phy: micrel: Add KSZ8041FTL fiber mode support

2016-07-15 Thread David Miller
From: Philipp Zabel 
Date: Thu, 14 Jul 2016 16:29:43 +0200

> We can't detect the FXEN (fiber mode) bootstrap pin, so configure
> it via a boolean device tree property "micrel,fiber-mode".
> If it is enabled, auto-negotiation is not supported.
> The only available modes are 100base-fx (full duplex and half duplex).
> 
> Signed-off-by: Philipp Zabel 

Applied to net-next, thanks.


Re: [PATCH 0 / 5] move the common CDC parser

2016-07-15 Thread David Miller
From: Oliver Neukum 
Date: Thu, 14 Jul 2016 15:41:29 +0200

> Experience has shown that making all CDC drivers depend on usbnet
> is not practical, because some of them are not network drivers.
> So this patch moves the common parser from usbnet into the messages
> helpers of usbcore.
> The rest of the series applies it to the non-network CDC drivers.
> 
> I hope it can go through Greg's tree although it touches usbnet.

I'm fine with Greg taking this series, sure.


Re: [PATCH v3 2/8] thunderbolt: Updating device IDs

2016-07-15 Thread David Miller
From: Amir Levy 
Date: Thu, 14 Jul 2016 14:28:16 +0300

> Adding the new Thunderbolt(TM) device IDs to the list.
> 
> Signed-off-by: Amir Levy 

Unless these PCI-IDs, all of them, are going to be used in multiple
spots in the kernel, it is not appropriate to add them here.

They belong as private macros in the drivers themselves, instead.


Re: [PATCH iproute2-master 1/2] man: Add devlink man pages to Makefile

2016-07-15 Thread Stephen Hemminger
On Wed, 13 Jul 2016 09:53:53 +0300
Ido Schimmel  wrote:

> Signed-off-by: Jiri Pirko 
> Signed-off-by: Ido Schimmel 
> ---
>  man/man8/Makefile | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/man/man8/Makefile b/man/man8/Makefile
> index 929826e..9badbed 100644
> --- a/man/man8/Makefile
> +++ b/man/man8/Makefile
> @@ -16,7 +16,8 @@ MAN8PAGES = $(TARGETS) ip.8 arpd.8 lnstat.8 routel.8 
> rtacct.8 rtmon.8 rtpr.8 ss.
>   tc-basic.8 tc-cgroup.8 tc-flow.8 tc-flower.8 tc-fw.8 tc-route.8 \
>   tc-tcindex.8 tc-u32.8 \
>   tc-connmark.8 tc-csum.8 tc-mirred.8 tc-nat.8 tc-pedit.8 tc-police.8 \
> - tc-simple.8 tc-skbedit.8 tc-vlan.8 tc-xt.8
> + tc-simple.8 tc-skbedit.8 tc-vlan.8 tc-xt.8 \
> + devlink.8 devlink-dev.8 devlink-monitor.8 devlink-port.8 devlink-sb.8
>  
>  all: $(TARGETS)
>  

Both applied thanks


Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program

2016-07-15 Thread Jesper Dangaard Brouer
On Fri, 15 Jul 2016 11:08:06 -0700
Tom Herbert  wrote:

> On Thu, Jul 14, 2016 at 12:25 AM, Jesper Dangaard Brouer
>  wrote:
> >
> > I would really really like to see the XDP program associated with the
> > RX ring queues, instead of a single XDP program covering the entire NIC.
> > (Just move the bpf_prog pointer to struct mlx4_en_rx_ring)
> >  
> I think it would be helpful to have some concrete implementation to
> look at for this. Jesper, can you code up some patches, taking into
> account Alexei's concerns about the atomic program update problem?.

Bad timing, as I'm going on 3 weeks vacation from today.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [iproute PATCH v4 0/5] Big C99 style initializer rework

2016-07-15 Thread Stephen Hemminger
On Wed, 13 Jul 2016 20:47:14 +0200
Phil Sutter  wrote:

> This is v4 of my C99-style initializer related patch series. The changes
> since v3 are:
> 
> - Use empty initializer instead of the universal zero initializer:
>   The latter one triggers warnings in older GCCs, and this appears to
>   be the least intrusive workaround. Plus, empty initializers are used
>   all over the code already, so it won't make things worse. (GCC in
>   pedantic mode does not like them, but that is a can of worms by
>   itself.)
> 
> - Dropped patch 6 (unsigned value comparison simplification):
>   It unintendedly changes that comparison's semantics, and I am not
>   completely sure the change is correct - therefore rather leave it as
>   is.
> 
> - Rebased onto current origin/master again (no conflicts).
> 
> For reference, here's the v3 changelog:
> 
> - Flattened embedded struct's initializers:
>   Since the field names are very short, I figured it makes more sense to
>   keep indenting low. Also, the same style is already used in
>   ip/xfrm_policy.c so take that as an example.
> 
> - Moved leftover nlmsg_seq initializing into the common place as well:
>   I was unsure whether this is a good idea at first (due to the
>   increment), but again it's done in ip/xfrm_policy.c as well so should
>   be fine.
> 
> - Added a comma after the last field initializer as suggested by Jakub.
> 
> - Dropped patch 7 since it was NACKed.
> 
> - Eliminated checkpatch non-compliance.
> 
> - Second go at union bpf_attr in tc/tc_bpf.c:
>   I figured that while it is not possible to initialize fields, gcc-3.4.6
>   does not complain when setting the whole union to zero using '= {0}'.
>   So I did this and thereby at least got rid of the memset calls.
> 
> For reference, here's the v2 changelog:
> 
> - Rebased onto current upstream master:
>   My own commit a0a73b298a579 ("tc: m_action: Use C99 style initializers
>   for struct req") contains most of the changes to tc/m_action.c already,
>   so I put the remaining ones into a dedicated patch (the first one here)
>   with a better description.
> 
> - Tested against gcc-3.4.6:
>   This is the oldest gcc version I was able to install locally. It indeed
>   does not like the former changes in tc/tc_bpf.c, so I reverted them.
>   Apart from emitting many warnings, it successfully compiles the
>   sources.
> 
> In the process of compatibility testing, I made a few more changes which
> make sense to have:
> 
> - New patch 5 allows to conveniently override the compiler via command
>   line.
> 
> - New patch 6 eliminates a warning with old gcc but looks valid in
>   general.
> 
> - A warning made me look at ip/tcp_metrics.c and I found a minor code
>   simplification (patch 7).
> 
> Phil Sutter (5):
>   tc: m_action: Improve conversion to C99 style initializers
>   Use C99 style initializers everywhere
>   Replace malloc && memset by calloc
>   No need to initialize rtattr fields before parsing
>   Makefile: Allow to override CC
> 
>  Makefile   |   4 +-
>  bridge/fdb.c   |  25 ++--
>  bridge/link.c  |  14 +++
>  bridge/mdb.c   |  17 -
>  bridge/vlan.c  |  17 -
>  genl/ctrl.c|  44 +
>  genl/genl.c|   3 +-
>  ip/ip6tunnel.c |  10 ++---
>  ip/ipaddress.c |  33 +++-
>  ip/ipaddrlabel.c   |  21 --
>  ip/iplink.c|  61 -
>  ip/iplink_can.c|   4 +-
>  ip/ipmaddr.c   |  25 
>  ip/ipmroute.c  |   8 +---
>  ip/ipneigh.c   |  30 ++-
>  ip/ipnetconf.c |  10 ++---
>  ip/ipnetns.c   |  39 +--
>  ip/ipntable.c  |  25 
>  ip/iproute.c   |  78 +
>  ip/iprule.c|  22 +--
>  ip/iptoken.c   |  19 -
>  ip/iptunnel.c  |  31 +--
>  ip/ipxfrm.c|  26 -
>  ip/link_gre.c  |  18 -
>  ip/link_gre6.c |  18 -
>  ip/link_ip6tnl.c   |  25 +---
>  ip/link_iptnl.c|  22 +--
>  ip/link_vti.c  |  18 -
>  ip/link_vti6.c |  18 -
>  ip/xfrm_policy.c   |  99 +++
>  ip/xfrm_state.c| 110 
> ++---
>  lib/libnetlink.c   |  77 ++---
>  lib/ll_map.c   |   1 -
>  lib/names.c|   7 +---
>  misc/arpd.c|  64 ++-
>  misc/lnstat.c  |   6 +--
>  misc/lnstat_util.c |   4 +-
>  misc/ss.c  |  37 +++---
>  tc/e_bpf.c |   7 +---
>  tc/em_canid.c  |   4 +-
>  tc/em_cmp.c|   4 +-
>  tc/em_ipset.c  |   4 +-
>  tc/em_meta.c   |   4 +-
>  tc/em_nbyte.c  |   4 +-
>  tc/em_u32.c|   4 +-
>  tc/f_flow.c|   3 --
>  tc/f_flower.c  |   3 +-
>  tc/f_fw.c  |   6 +--
>  tc/f_route.c   |   3 --
>  tc/f_rsvp.c|   6 +-

Re: [PATCH net-next 0/5] RDS: TCP: Enable mprds for rds-tcp

2016-07-15 Thread David Miller
From: Sowmini Varadhan 
Date: Thu, 14 Jul 2016 03:51:00 -0700

> The third, and final, installment for mprds-tcp changes.
> 
> In Patch 3 of this set, if the transport support t_mp_capable, 
> we hash outgoing traffic across multiple paths.  Additionally, even if 
> the transport is MP capable, we may be peering with some node that does
> not support mprds, or supports a different number of paths. This
> necessitates RDS control plane changes so that both peers agree
> on the number of paths to be used for the rds-tcp connection.
> Patch 3 implements all these changes, which are documented in patch 5
> of the series.
> 
> Patch 1 of this series is a bug fix for a race-condition
> that has always existed, but is now more easily encountered with mprds. 
> Patch 2 is code refactoring. Patches 4 and 5 are Documentation updates.

Series applied, thanks.


Re: [patch v2 -next] wan/fsl_ucc_hdlc: info leak in uhdlc_ioctl()

2016-07-15 Thread David Miller
From: Dan Carpenter 
Date: Thu, 14 Jul 2016 14:16:53 +0300

> There is a 2 byte struct whole after line.loopback so we need to clear
> that out to avoid disclosing stack information.
> 
> Fixes: c19b6d246a35 ('drivers/net: support hdlc function for QE-UCC')
> Signed-off-by: Dan Carpenter 
> ---
> v2: remove the other initialization to zero

Applied.


Re: [PATCH iproute2] ip route: restore route entries in correct order

2016-07-15 Thread Stephen Hemminger
On Tue, 12 Jul 2016 21:37:58 +0800
Xin Long  wrote:

> Sometimes we cannot restore route entries, because in kernel
>   [1] fib_check_nh()
>   [2] fib_valid_prefsrc()
> cause some routes to depend on existence of others while adding.
> 
> For example, we saved all the routes, and flushed all tables
>   [a] default via 192.168.122.1 dev eth0
>   [b] 192.168.122.0/24 dev eth0 src 192.168.122.21
>   [c] broadcast 127.0.0.0 dev lo table local src 127.0.0.1
>   [d] local 127.0.0.0/8 dev lo table local  src 127.0.0.1
>   [e] local 127.0.0.1 dev lo table local src 127.0.0.1
>   [f] broadcast 127.255.255.255 dev lo table local src 127.0.0.1
>   [g] broadcast 192.168.122.0 dev eth0 table local src 192.168.122.21
>   [h] local 192.168.122.21 dev eth0 table local src 192.168.122.21
>   [i] broadcast 192.168.122.255 dev eth0 table local src 192.168.122.21
> 
>   Now start to restore them:
> If we want to add [a], we have to add [b] first, as [1] and
> 'via 192.168.122.1' in [a].
> If we want to add [b], we have to add [h] first, as [2] and
> 'src 192.168.122.21' in [b].
> 
>   So the correct order to restore should be like:
> [e][h] -> [b][c][d][f][g][i] -> [a]
> 
> This patch fixes it by traversing the file 3 times, it only restores
> part of them in each run according to the following conditions, to
> make sure every entry can be restored successfully.
>   1. !gw && (!fib_prefsrc || fib_prefsrc == cfg->fc_dst)
>   2. !gw && (fib_prefsrc != cfg->fc_dst)
>   3. gw
> 
> Signed-off-by: Xin Long 

Applied, then I changed rtattr_cmp() to have const args.


Re: [PATCH v2] Add support for configuring Infiniband GUIDs

2016-07-15 Thread Stephen Hemminger
On Thu,  7 Jul 2016 16:09:03 -0500
Eli Cohen  wrote:

> Add two NLA's that allow configuration of Infiniband node or port GUIDs
> by referencing the IPoIB net device set over the physical function. The
> format to be used is as follows:
> 
> ip link set dev ib0 vf 0 node_guid 00:02:c9:03:00:21:6e:70
> ip link set dev ib0 vf 0 port_guid 00:02:c9:03:00:21:6e:78
> 
> Signed-off-by: Eli Cohen 

Applied, thanks



Re: [PATCH net-next V3] net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)

2016-07-15 Thread David Miller
From: Matt Wilson 
Date: Thu, 14 Jul 2016 09:15:11 -0700

> On Thu, Jul 14, 2016 at 09:08:03AM -0700, Benjamin Poirier wrote:
>> On 2016/07/14 08:22, Matt Wilson wrote:
>> [...]
>> > 
>> > Dave and Benjamin,
>> > 
>> > Do you want to see the interrupt moderation extensions to ethtool and
>> > the sysfs nodes removed before this lands in net-next? Or should
>> > Netanel remove the sysfs bits until we can extend the ethtool
>> > interfaces to cover the parameters that ena uses?
>> 
>> I couldn't say what's acceptable or not. A few other drivers (qlcnic,
>> sfc, ...) already have sysfs tunables. Maybe John, as the new ethtool
>> maintainer, can weight in too about the changes required to ethtool.
> 
> We definitely want ethtool to handle all the settings, it's just a
> question of when. We also want to address and resolve all the great
> feedback so far, and since you originally raised the point about
> extending ethtool I wanted to see if you have any major objection.

If you add the sysfs stuff you're stuck with it forever, so I definitely
do not want to see that.

You guys should start simple, a basic driver that supports what is
possible with no core kernel changes or non-portable driver private
sysfs knobx.  Only then should you think about adding new things.


Re: [PATCH iproute 0/5] iproute: ila and fou additions

2016-07-15 Thread Stephen Hemminger
On Thu, 14 Jul 2016 12:22:11 -0700
Tom Herbert  wrote:

> Patch set includes:
> 
> - Allow configuring checksum mode for ila LWT (e.g. configure
>   checksum neutral
> - Configuration for performing ila translations using netfilter hook
> - fou encapsulation for ip6tnl and gre6
> - fou listener for IPv6
> 
> *** SUBJECT HERE ***
> 
> *** BLURB HERE ***
> 
> Tom Herbert (5):
>   ila: Support for checksum neutral translation
>   ila: Support for configuring ila to use netfilter hook
>   ip6tnl: Support for fou encapsulation
>   gre6: Support for fou encapsulation
>   fou: Allowing configuring IPv6 listener
> 
>  ip/Makefile   |   2 +-
>  ip/ip.c   |   3 +-
>  ip/ip_common.h|   1 +
>  ip/ipfou.c|   8 +-
>  ip/ipila.c| 259 
> ++
>  ip/iproute_lwtunnel.c |  57 ++-
>  ip/link_gre6.c| 101 
>  ip/link_ip6tnl.c  |  92 +-
>  8 files changed, 516 insertions(+), 7 deletions(-)
>  create mode 100644 ip/ipila.c
> 

I am okay with the content of these patches, but they have have lots of style 
issues.

1. Bad indentation in several spots
2. Lines are too long
3. Use rta_getattr_u64, look for places that access RTA_DATA() directly
5. Run checkpatch



Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program

2016-07-15 Thread Tom Herbert
On Thu, Jul 14, 2016 at 12:25 AM, Jesper Dangaard Brouer
 wrote:
>
> I would really really like to see the XDP program associated with the
> RX ring queues, instead of a single XDP program covering the entire NIC.
> (Just move the bpf_prog pointer to struct mlx4_en_rx_ring)
>
I think it would be helpful to have some concrete implementation to
look at for this. Jesper, can you code up some patches, taking into
account Alexei's concerns about the atomic program update problem?.

Thanks,
Tom


Re: [PATCH] dt: bindings: Add a generic ethernet device binding

2016-07-15 Thread David Miller
From: Hans de Goede 
Date: Fri, 15 Jul 2016 08:40:00 +0200

> Hi,
> 
> On 15-07-16 01:17, David Miller wrote:
>> From: Hans de Goede 
>> Date: Wed, 13 Jul 2016 12:20:04 +0200
>>
>>> On some boards (android tablets) different batches use different sdio
>>> wifi modules. This is not a problem since sdio is a discoverable bus,
>>> so we only need to describe and activate the mmc controller in dt and
>>> then the kernel will automatically load the right driver.
>>>
>>> But sometimes it is useful to specify certain ethernet properties for
>>> these "unknown" sdio devices, specifically we want the boot-loader
>>> to be able to set "local-mac-address" as some of these sdio wifi
>>> modules come without an eeprom / without a factory programmed mac
>>> address.
>>>
>>> Since the exact device is unknown (differs per batch) we cannot use
>>> a wifi-chip specific compatible. This commit adds a new
>>> "generic,ethernet" binding for use in dt-nodes describing such an
>>> unknown ethernet device.
>>>
>>> Cc: Maxime Ripard 
>>> Signed-off-by: Hans de Goede 
>>
>> Precedence exists for a "system ethernet address" as far back as the
>> original sparc device tree implementation, so please just specify it
>> that way rather than trying to force having to make an alias or
>> reference to it from a specific device.
> 
> Some boards where this is applicable have both a wired and a wireless
> ethernet, so one global setting will not work.

Then call it "eth:local-mac-address" and "wifi:local-mac-address"


Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program

2016-07-15 Thread Tom Herbert
On Fri, Jul 15, 2016 at 9:47 AM, Alexei Starovoitov
 wrote:
> On Fri, Jul 15, 2016 at 09:18:13AM -0700, Tom Herbert wrote:
>> > attaching program to all rings at once is a fundamental part for correct
>> > operation. As was pointed out in the past the bpf_prog pointer
>> > in the ring design loses atomicity of the update. While the new program is
>> > being attached the old program is still running on other rings.
>> > That is not something user space can compensate for.
>> > So for current 'one prog for all rings' we cannot do what you're 
>> > suggesting,
>> > yet it doesn't mean we won't do prog per ring tomorrow. To do that the 
>> > other
>> > aspects need to be agreed upon before we jump into implementation:
>> > - what is the way for the program to know which ring it's running on?
>> >   if there is no such way, then attaching the same prog to multiple
>> >   ring is meaningless.
>>
>> Why would it need to know? If the user can say run this program on
>> this ring that should be sufficient.
>
> and the program would have to be recompiled with #define for every ring?

Why would we need to recompile? We should be able to run the same
program on different rings, this just a matter of associating each
ring with a program.

>> We already have this problem with other per ring configuration.
>
> not really. without atomicity of the program change, the user space
> daemon that controls it will struggle to adjust. Consider the case
> where we're pushing new update for loadbalancer. In such case we
> want to reuse the established bpf map, since we cannot atomically
> move it from old to new, but we want to swap the program that uses
> in one go, otherwise two different programs will be accessing
> the same map. Technically it's valid, but difference in the programs
> may cause issues. Lack of atomicity is not intractable problem,
> it just makes user space quite a bit more complex for no reason.
>

I'm really missing why having a program pointer per ring could be so
complicated. This should just a matter of maintaining a pointer to the
BPF program program in each RX queue. If we want to latch together all
the rings to run the same program then just have an API that does
that-- walk all the queues and set the pointer to the program.  if
necessary this can be done atomically by taking the device down for
the operation.

To me, an XDP program is just another attribute of an RX queue, it's
really not special!. We already have a very good infrastructure for
managing multiqueue and pretty much everything in the receive path
operates at the queue level not the device level-- we should follow
that model.

Tom


Re: [PATCH] rndis_host: Set random MAC for ZTE MF910

2016-07-15 Thread Bjørn Mork
David Laight  writes:
> From: Bjørn Mork
>> Sent: 13 July 2016 23:23
> ...
>> Or how about the more generic?:
>> 
>> if (bp[0] & 0x02)
>>  eth_hw_addr_random(net);
>>  else
>>  ether_addr_copy(net->dev_addr, bp);
>> 
>> That would catch similar screwups from other vendors too.
>
> Not really, that disables 'locally administered' addresses.

... when the 'locally administered' addresses comes from firmeare, yes.
That was the idea.  We are better off using our own random locally
administered address if some vendor has been cheap/stupid enough to
program that into firmware.

The aminstrator is of course still free to set any address, 'locally
administered' or whatever.  This is not the question here.

> If a vendor has used the same address on lots of cards it could easily
> be a 'real' address.

Sure.  We cannot easily detect that.  The only way is to keep a
blacklist of such  'real' addresses, the way Kristian initially
proposed.

But I thought that we could simplify this particular screwup since the
address in question had the local bit set, and catch every other similar
abuse at the same time. If you get the local bit from formware, then you
know for sure that there is something wrong.

> Not only that, there certainly used to be manufacturers that used 'locally
> administered' addresses on all their cards (as well as those that used 
> unallocated
> address blocks).

Sure. But is there any reason to care about those addresses?

> Not to mention the bit-revered addresses

Listing all the ways vendors have screwed is going to be a long and
rather boring thread ;)


Bjørn


[no subject]

2016-07-15 Thread Easy Loan Finance
I have a business loan for you @1% contact me for more info


Re: [PATCH net-next 2/4] net: bridge: rearrange flood vs unicast receive paths

2016-07-15 Thread Nikolay Aleksandrov

> On Jul 15, 2016, at 10:35 AM, Cong Wang  wrote:
> 
> On Wed, Jul 13, 2016 at 8:10 PM, Nikolay Aleksandrov
>  wrote:
>> This patch removes one conditional from the unicast path by using the fact
>> that skb is NULL only when the packet is multicast or is local.
>> 
>> Signed-off-by: Nikolay Aleksandrov 
>> ---
>> net/bridge/br_input.c | 29 ++---
>> 1 file changed, 14 insertions(+), 15 deletions(-)
>> 
>> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
>> index 0b6d32619468..c20c5be6fc22 100644
>> --- a/net/bridge/br_input.c
>> +++ b/net/bridge/br_input.c
>> @@ -134,10 +134,10 @@ int br_handle_frame_finish(struct net *net, struct 
>> sock *sk, struct sk_buff *skb
>>struct net_bridge_port *p = br_port_get_rcu(skb->dev);
>>const unsigned char *dest = eth_hdr(skb)->h_dest;
>>struct net_bridge_fdb_entry *dst = NULL;
>> +   bool mcast_hit = false, unicast = true;
>>struct net_bridge_mdb_entry *mdst;
>>struct net_bridge *br;
>>struct sk_buff *skb2;
>> -   bool unicast = true;
>>u16 vid = 0;
>> 
>>if (!p || p->state == BR_STATE_DISABLED)
>> @@ -177,30 +177,29 @@ int br_handle_frame_finish(struct net *net, struct 
>> sock *sk, struct sk_buff *skb
>>if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
>>br_multicast_querier_exists(br, eth_hdr(skb))) {
>>if ((mdst && mdst->mglist) ||
>> -   br_multicast_is_router(br))
>> +   br_multicast_is_router(br)) {
>>skb2 = skb;
>> -   br_multicast_forward(mdst, skb, skb2);
>> -   skb = NULL;
>> -   if (!skb2)
>> -   goto out;
>> +   br->dev->stats.multicast++;
>> +   }
>> +   mcast_hit = true;
>>} else {
>>skb2 = skb;
>> +   br->dev->stats.multicast++;
>>}
>>unicast = false;
>> -   br->dev->stats.multicast++;
> 
> 
> Looks like you change the unconditional increment of this counter,
> is this intended?

Oops +CC all, mixed the reply list.

Yes, this counter must increment only when the packet is going to be locally 
received as you can see before
there was an unconditional jump if the packet was not going to be locally 
received (if !skb2, goto).
This is exactly one of the confusing cases trying to avoid with these skb0/skb2 
variables..

Cheers,
Nik



Re: [PATCH net-next 2/4] net: bridge: rearrange flood vs unicast receive paths

2016-07-15 Thread Cong Wang
On Wed, Jul 13, 2016 at 8:10 PM, Nikolay Aleksandrov
 wrote:
> This patch removes one conditional from the unicast path by using the fact
> that skb is NULL only when the packet is multicast or is local.
>
> Signed-off-by: Nikolay Aleksandrov 
> ---
>  net/bridge/br_input.c | 29 ++---
>  1 file changed, 14 insertions(+), 15 deletions(-)
>
> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> index 0b6d32619468..c20c5be6fc22 100644
> --- a/net/bridge/br_input.c
> +++ b/net/bridge/br_input.c
> @@ -134,10 +134,10 @@ int br_handle_frame_finish(struct net *net, struct sock 
> *sk, struct sk_buff *skb
> struct net_bridge_port *p = br_port_get_rcu(skb->dev);
> const unsigned char *dest = eth_hdr(skb)->h_dest;
> struct net_bridge_fdb_entry *dst = NULL;
> +   bool mcast_hit = false, unicast = true;
> struct net_bridge_mdb_entry *mdst;
> struct net_bridge *br;
> struct sk_buff *skb2;
> -   bool unicast = true;
> u16 vid = 0;
>
> if (!p || p->state == BR_STATE_DISABLED)
> @@ -177,30 +177,29 @@ int br_handle_frame_finish(struct net *net, struct sock 
> *sk, struct sk_buff *skb
> if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
> br_multicast_querier_exists(br, eth_hdr(skb))) {
> if ((mdst && mdst->mglist) ||
> -   br_multicast_is_router(br))
> +   br_multicast_is_router(br)) {
> skb2 = skb;
> -   br_multicast_forward(mdst, skb, skb2);
> -   skb = NULL;
> -   if (!skb2)
> -   goto out;
> +   br->dev->stats.multicast++;
> +   }
> +   mcast_hit = true;
> } else {
> skb2 = skb;
> +   br->dev->stats.multicast++;
> }
> unicast = false;
> -   br->dev->stats.multicast++;


Looks like you change the unconditional increment of this counter,
is this intended?


Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue

2016-07-15 Thread John Fastabend
On 16-07-15 03:09 AM, Jesper Dangaard Brouer wrote:
> On Thu, 14 Jul 2016 17:09:43 -0700
> John Fastabend  wrote:
> 
  static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
  struct sk_buff **to_free)
  {
 -  if (skb_queue_len(&qdisc->q) < qdisc_dev(qdisc)->tx_queue_len) {
 -  int band = prio2band[skb->priority & TC_PRIO_MAX];
 -  struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
 -  struct sk_buff_head *list = band2list(priv, band);
 -
 -  priv->bitmap |= (1 << band);
 -  qdisc->q.qlen++;
 -  return __qdisc_enqueue_tail(skb, qdisc, list);
 -  }
 +  int band = prio2band[skb->priority & TC_PRIO_MAX];
 +  struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
 +  struct skb_array *q = band2list(priv, band);
 +  int err;
  
 -  return qdisc_drop(skb, qdisc, to_free);
 +  err = skb_array_produce_bh(q, skb);  
>>>
>>> Do you need the _bh variant here?  (Doesn't the qdisc run with BH disabled?)
>>>
>>>   
>>
>> Yep its inside rcu_read_lock_bh().
> 
> The call rcu_read_lock_bh() already disabled BH (local_bh_disable()).
> Thus, you can use the normal variants of skb_array_produce(), it is
> (approx 20 cycles) faster than the _bh variant...
> 

hah I was agreeing with you as in yep no need for the _bh variant :)
I must have been low on coffee or something when I wrote that response
because when I read it now it sounds like I really think the _bh is
needed.

At any rate _bh removed thanks!


  1   2   >