Re: [PATCH 0/6] eBPF JIT for PPC64

2016-06-12 Thread Naveen N. Rao
On 2016/06/10 10:47PM, David Miller wrote:
> From: "Naveen N. Rao" 
> Date: Tue,  7 Jun 2016 19:02:17 +0530
> 
> > Please note that patch [2] is a pre-requisite for this patchset, and is
> > not yet upstream.
>  ...
> > [1] http://thread.gmane.org/gmane.linux.kernel/2188694
> > [2] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/96514
> 
> Because of #2 I don't think I can take this directly into the networking
> tree, right?
> 
> Therefore, how would you like this to be merged?

Hi David,
Thanks for asking. Yes, I think it is better to take this through the 
powerpc tree as all the changes are contained within arch/powerpc, 
unless Michael Ellerman feels differently.

Michael?


Regards,
Naveen



Re: [PATCH ipvs-next] ipvs: count pre-established TCP states as active

2016-06-12 Thread Simon Horman
On Sun, Jun 12, 2016 at 06:27:39PM +0300, Julian Anastasov wrote:
> 
>   Hello,
> 
> On Fri, 3 Jun 2016, Michal Kubecek wrote:
> 
> > Some users observed that "least connection" distribution algorithm doesn't
> > handle well bursts of TCP connections from reconnecting clients after
> > a node or network failure.
> > 
> > This is because the algorithm counts active connection as worth 256
> > inactive ones where for TCP, "active" only means TCP connections in
> > ESTABLISHED state. In case of a connection burst, new connections are
> > handled before previous ones have finished the three way handshaking so
> > that all are still counted as "inactive", i.e. cheap ones. The become
> > "active" quickly but at that time, all of them are already assigned to one
> > real server (or few), resulting in highly unbalanced distribution.
> > 
> > Address this by counting the "pre-established" states as "active".
> > 
> > Signed-off-by: Michal Kubecek 
> 
> Acked-by: Julian Anastasov 
> 
>   Simon, please apply!

Thanks, done.


[PATCH net] net_sched: prio: rollback allocations if prio_init() fails

2016-06-12 Thread Eric Dumazet
From: Eric Dumazet 

Now prio_init() can return -ENOMEM, it also has to make sure
any allocated qdisc are freed, since the caller (qdisc_create()) wont
call ->destroy() handler for us.

Fixes: cbdf45116478 ("net_sched: prio: properly report out of memory errors")
Signed-off-by: Eric Dumazet 
Reported-by: Cong Wang 
---
 net/sched/sch_prio.c |   14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 071718bccdab..9b703f2c921b 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -221,20 +221,18 @@ static int prio_tune(struct Qdisc *sch, struct nlattr 
*opt)
 static int prio_init(struct Qdisc *sch, struct nlattr *opt)
 {
struct prio_sched_data *q = qdisc_priv(sch);
-   int i;
+   int err, i;
 
for (i = 0; i < TCQ_PRIO_BANDS; i++)
q->queues[i] = _qdisc;
 
-   if (opt == NULL) {
+   if (!opt)
return -EINVAL;
-   } else {
-   int err;
 
-   if ((err = prio_tune(sch, opt)) != 0)
-   return err;
-   }
-   return 0;
+   err = prio_tune(sch, opt);
+   if (err)
+   prio_destroy(sch);
+   return err;
 }
 
 static int prio_dump(struct Qdisc *sch, struct sk_buff *skb)




Re: [PATCH net] net_sched: prio: properly report out of memory errors

2016-06-12 Thread Eric Dumazet
On Sun, 2016-06-12 at 20:45 -0700, Cong Wang wrote:
> On Sun, Jun 12, 2016 at 4:21 PM, Eric Dumazet  wrote:
> > +   struct Qdisc *child;
> > +
> > +   if (q->queues[i] != _qdisc)
> > +   continue;
> > +
> > +   child = qdisc_create_dflt(sch->dev_queue, _qdisc_ops,
> > + TC_H_MAKE(sch->handle, i + 1));
> > +   if (!child)
> > +   return -ENOMEM;
> 
> Since this is inside a loop, shouldn't we kfree the previous child
> creations when we fail?

You're right.

prio_init() needs to do the cleanup, as prio_destroy() wont be called
from qdisc_create()

I am testing a fix with fault injection.

Thanks.




Re: [PATCH net] net_sched: fix pfifo_head_drop behavior vs backlog

2016-06-12 Thread Cong Wang
On Sun, Jun 12, 2016 at 8:01 PM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> When the qdisc is full, we drop a packet at the head of the queue,
> queue the current skb and return NET_XMIT_CN
>
> Now we track backlog on upper qdiscs, we need to call
> qdisc_tree_reduce_backlog(), even if the qlen did not change.
>
> Fixes: 2f5fb43f ("net_sched: update hierarchical backlog too")
> Signed-off-by: Eric Dumazet 
> Cc: WANG Cong 
> Cc: Jamal Hadi Salim 


Acked-by: Cong Wang 


Re: [PATCH net] net_sched: prio: properly report out of memory errors

2016-06-12 Thread Cong Wang
On Sun, Jun 12, 2016 at 4:21 PM, Eric Dumazet  wrote:
> +   struct Qdisc *child;
> +
> +   if (q->queues[i] != _qdisc)
> +   continue;
> +
> +   child = qdisc_create_dflt(sch->dev_queue, _qdisc_ops,
> + TC_H_MAKE(sch->handle, i + 1));
> +   if (!child)
> +   return -ENOMEM;

Since this is inside a loop, shouldn't we kfree the previous child
creations when we fail?



> +   sch_tree_lock(sch);
> +   q->queues[i] = child;
> +   sch_tree_unlock(sch);
> }
> return 0;
>  }
>
>


Re: [PATCH] netfilter/nflog: nflog-range does not truncate packets

2016-06-12 Thread Vishwanath Pai
On 06/09/2016 01:57 PM, Vishwanath Pai wrote:
> On 06/08/2016 08:16 AM, Pablo Neira Ayuso wrote:
>> Looking again at your code:
>>
>> case NFULNL_COPY_PACKET:
>> -   if (inst->copy_range > skb->len)
>> +   data_len = inst->copy_range;
>> +   if (li->u.ulog.copy_len < data_len)
>> +   data_len = li->u.ulog.copy_len;
>>
>> data_len is set to instance's copy_range.
>>
>> But then, if the NFLOG rule indicates smaller copy_len, you use this
>> value. So to my understanding, NFLOG rule prevails over instance's
>> copy_range in what matters, which is to shrink the copy range.
>>
 --nflog-range will not override the per-instance default,
 the only time it would get preference is when its value is lesser than
 the per-instance value. If copy_range is lesser than --nflog-range then
 we retain copy_range.

 So basically what we are doing is min(copy_range, nflog-range).
 Just wanted to clarify this, if this is not how it's meant to be
 please let me know.

 Also, there is a bug in my patch, li->u.ulog.copy_len can be set to "0"
 from userspace (if --nflog-range is not specified), so we have to check
 for this condition before using the value. I will send a V2 of the patch
 based on your reply.
>> Currently, li->u.ulog.copy_len is set to "0" by default when not
>> specified.
>>
>> But copy_len = 0 is a valid possibility, so this looks a bit more
>> tricky to me to fix since we may need to get flags here to know when
>> this is set.
>>
>> Probably something like:
>>
>> if (li->flags & NF_LOG_F_COPY_RANGE)
>> data_len = li->u.ulog.copy_len;
>> /* Per-instance copy range prevails over global per-rule option. */
>> if (data_len < inst->copy_range)
>> data_len = inst->copy_range;
>> if (data_len > skb->len)
>> data_len = skb->len;
>>
>> Although this would require a bit more code to introduce these flags.
>>
>> You will also need a new flag for xt_NFLOG:
>>
>> #define XT_NFLOG_COPY_LEN   0x2
>>
>> it seems other XT_NFLOG_* flags were already in place.
>>
>> Interesting that nobody noticed this for so long BTW.
> 
> I tried this out, I added two flags: one for XT_NFLOG to notify the
> kernel when --nflog-range is set by the user, and another flag for
> nfnetlink_log to pass this on to nfulnl_log_packet. This design works
> fine but while testing this I found a problem.
> 
> Lets say --nflog-range is set to 200, and the instance copy_range is set
> to 100. According to the code above the final value of data_len will be
> 200 so we try to pack 200 bytes into the skb. But somewhere between
> nfnetlink_log to dumpcap the packet is getting truncated and dumpcap
> doesn't like this:
> 
> $> dumpcap -y NFLOG -i nflog:5 -s 100
> Capturing on 'nflog:5'
> File: /tmp/wireshark_pcapng_nflog-5_20160609133531_pi6MrS
> dumpcap: Error while capturing packets: Message truncated: (got: 228)
> (nlmsg_len: 320)
> Please report this to the Wireshark developers.
> https://bugs.wireshark.org/
> (This is not a crash; please do not report it as such.)
> Packets captured: 0
> Packets received/dropped on interface 'nflog:5': 0/0
> (pcap:0/dumpcap:0/flushed:0/ps_ifdrop:0) (0.0%)
> 
> I'm trying to figure out where the packet is getting truncated.
> 

I found where the problem is. This is the userspace code for libpcap:

do {
len = recv(handle->fd, handle->buffer, handle->bufsize, 0);
if (handle->break_loop) {
handle->break_loop = 0;
return -2;
}
} while ((len == -1) && (errno == EINTR));

   ...

buf = handle->buffer;
while (len >= NLMSG_SPACE(0)) {
const struct nlmsghdr *nlh = (const struct nlmsghdr *) buf;
u_int32_t msg_len;
nftype_t type = OTHER;

if (nlh->nlmsg_len < sizeof(struct nlmsghdr) || len <
nlh->nlmsg_len) {
snprintf(handle->errbuf, PCAP_ERRBUF_SIZE,
"Message truncated: (got: %d) (nlmsg_len: %u)", len, nlh->nlmsg_len);
return -1;
}

handle->bufsize is only big enough to accommodate the snaplen specified
by the user in dumpcap. So if we send more data than that then we will
break userspace. One way around this is to send min(li->u.ulog.copy_len,
inst->copy_range). If this is OK then I can send a patch for this,
please suggest.




Re: [PATCH v4] udp reuseport: fix packet of same flow hashed to different socket

2016-06-12 Thread Eric Dumazet
On Mon, 2016-06-13 at 11:02 +0800, Su Xuemin wrote:
> From: "Su, Xuemin" 
> 
> There is a corner case in which udp packets belonging to a same
> flow are hashed to different socket when hslot->count changes from 10
> to 11:
...
> Signed-off-by: Su, Xuemin 
> Signed-off-by: Eric Dumazet 
> ---
> I use this tree to generate the patch:
>   git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> 
>  net/ipv4/udp.c | 73 
> +-
>  net/ipv6/udp.c | 71 +---
>  2 files changed, 32 insertions(+), 112 deletions(-)

Very nice simplification of UDP stack, thanks a lot for finalizing this.





[PATCH v4] udp reuseport: fix packet of same flow hashed to different socket

2016-06-12 Thread Su Xuemin
From: "Su, Xuemin" 

There is a corner case in which udp packets belonging to a same
flow are hashed to different socket when hslot->count changes from 10
to 11:

1) When hslot->count <= 10, __udp_lib_lookup() searches udp_table->hash,
and always passes 'daddr' to udp_ehashfn().

2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2,
but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to
INADDR_ANY instead of some specific addr.

That means when hslot->count changes from 10 to 11, the hash calculated by
udp_ehashfn() is also changed, and the udp packets belonging to a same
flow will be hashed to different socket.

This is easily reproduced:
1) Create 10 udp sockets and bind all of them to 0.0.0.0:4.
2) From the same host send udp packets to 127.0.0.1:4, record the
socket index which receives the packets.
3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096
is 4 + UDP_HASH_SIZE(4096), this makes the new socket put into the
same hslot as the aformentioned 10 sockets, and makes the hslot->count
change from 10 to 11.
4) From the same host send udp packets to 127.0.0.1:4, and the socket
index which receives the packets will be different from the one received
in step 2.
This should not happen as the socket bound to 0.0.0.0:44096 should not
change the behavior of the sockets bound to 0.0.0.0:4.

It's the same case for IPv6, and this patch also fixes that.

Signed-off-by: Su, Xuemin 
Signed-off-by: Eric Dumazet 
---
I use this tree to generate the patch:
  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

 net/ipv4/udp.c | 73 +-
 net/ipv6/udp.c | 71 +---
 2 files changed, 32 insertions(+), 112 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 0ff31d9..55ec77c 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -391,9 +391,9 @@ int udp_v4_get_port(struct sock *sk, unsigned short snum)
return udp_lib_get_port(sk, snum, ipv4_rcv_saddr_equal, hash2_nulladdr);
 }
 
-static inline int compute_score(struct sock *sk, struct net *net,
-   __be32 saddr, unsigned short hnum, __be16 sport,
-   __be32 daddr, __be16 dport, int dif)
+static int compute_score(struct sock *sk, struct net *net,
+__be32 saddr, __be16 sport,
+__be32 daddr, unsigned short hnum, int dif)
 {
int score;
struct inet_sock *inet;
@@ -434,52 +434,6 @@ static inline int compute_score(struct sock *sk, struct 
net *net,
return score;
 }
 
-/*
- * In this second variant, we check (daddr, dport) matches (inet_rcv_sadd, 
inet_num)
- */
-static inline int compute_score2(struct sock *sk, struct net *net,
-__be32 saddr, __be16 sport,
-__be32 daddr, unsigned int hnum, int dif)
-{
-   int score;
-   struct inet_sock *inet;
-
-   if (!net_eq(sock_net(sk), net) ||
-   ipv6_only_sock(sk))
-   return -1;
-
-   inet = inet_sk(sk);
-
-   if (inet->inet_rcv_saddr != daddr ||
-   inet->inet_num != hnum)
-   return -1;
-
-   score = (sk->sk_family == PF_INET) ? 2 : 1;
-
-   if (inet->inet_daddr) {
-   if (inet->inet_daddr != saddr)
-   return -1;
-   score += 4;
-   }
-
-   if (inet->inet_dport) {
-   if (inet->inet_dport != sport)
-   return -1;
-   score += 4;
-   }
-
-   if (sk->sk_bound_dev_if) {
-   if (sk->sk_bound_dev_if != dif)
-   return -1;
-   score += 4;
-   }
-
-   if (sk->sk_incoming_cpu == raw_smp_processor_id())
-   score++;
-
-   return score;
-}
-
 static u32 udp_ehashfn(const struct net *net, const __be32 laddr,
   const __u16 lport, const __be32 faddr,
   const __be16 fport)
@@ -492,11 +446,11 @@ static u32 udp_ehashfn(const struct net *net, const 
__be32 laddr,
  udp_ehash_secret + net_hash_mix(net));
 }
 
-/* called with read_rcu_lock() */
+/* called with rcu_read_lock() */
 static struct sock *udp4_lib_lookup2(struct net *net,
__be32 saddr, __be16 sport,
__be32 daddr, unsigned int hnum, int dif,
-   struct udp_hslot *hslot2, unsigned int slot2,
+   struct udp_hslot *hslot2,
struct sk_buff *skb)
 {
struct sock *sk, *result;
@@ -506,7 +460,7 @@ static struct sock *udp4_lib_lookup2(struct net *net,
result = NULL;
badness = 0;
udp_portaddr_for_each_entry_rcu(sk, >head) {
-   score = compute_score2(sk, net, saddr, sport,
+   score = compute_score(sk, net, 

[PATCH net] net_sched: fix pfifo_head_drop behavior vs backlog

2016-06-12 Thread Eric Dumazet
From: Eric Dumazet 

When the qdisc is full, we drop a packet at the head of the queue,
queue the current skb and return NET_XMIT_CN

Now we track backlog on upper qdiscs, we need to call
qdisc_tree_reduce_backlog(), even if the qlen did not change.

Fixes: 2f5fb43f ("net_sched: update hierarchical backlog too")
Signed-off-by: Eric Dumazet 
Cc: WANG Cong 
Cc: Jamal Hadi Salim 
---
 net/sched/sch_fifo.c |4 
 1 file changed, 4 insertions(+)

diff --git a/net/sched/sch_fifo.c b/net/sched/sch_fifo.c
index 2177eac0a61e..2e4bd2c0a50c 100644
--- a/net/sched/sch_fifo.c
+++ b/net/sched/sch_fifo.c
@@ -37,14 +37,18 @@ static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc 
*sch)
 
 static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 {
+   unsigned int prev_backlog;
+
if (likely(skb_queue_len(>q) < sch->limit))
return qdisc_enqueue_tail(skb, sch);
 
+   prev_backlog = sch->qstats.backlog;
/* queue full, remove one skb to fulfill the limit */
__qdisc_queue_drop_head(sch, >q);
qdisc_qstats_drop(sch);
qdisc_enqueue_tail(skb, sch);
 
+   qdisc_tree_reduce_backlog(sch, 0, prev_backlog - sch->qstats.backlog);
return NET_XMIT_CN;
 }
 




udp failures traced to e858fae2b0b8

2016-06-12 Thread David Ahern

Mike:

UDP tests in my vrf unit test suite are failing. git bisect points to:


dsa@kenny:~/kernel-2.git$ git bisect good
e858fae2b0b8f41f0bed2cdffde25e7c97da38a7 is the first bad commit
commit e858fae2b0b8f41f0bed2cdffde25e7c97da38a7
Author: Mike Rapoport 
Date:   Wed Jun 8 16:09:21 2016 +0300

virtio_net: use common code for virtio_net_hdr and skb GSO conversion

Replace open coded conversion between virtio_net_hdr to skb GSO 
info with

virtio_net_hdr_{from,to}_skb

Signed-off-by: Mike Rapoport 
Signed-off-by: David S. Miller 

:04 04 7e53f434e2d3a50ac685ce153dd92fc1a207de0e 
470864c731ae7dd680f9807d2a1a359c715a4270 M  drivers



On console I get the following:

[   80.414134] skbuff: bad partial csum: csum=34/6 len=40
[   80.416105] eth1: bad gso: type: 0, size: 0


Nothing special on the setup -- shell scripts and qemu commands that 
have worked fine for years. Host OS is debian/jessie:


dsa@kenny:dsa-ns:~$ /usr/bin/qemu-system-x86_64 --version
QEMU emulator version 2.1.2 (Debian 1:2.1+dfsg-12+deb8u6), Copyright (c) 
2003-2008 Fabrice Bellard


Networking arg:

-netdev 
type=tap,vhost=on,ifname=vm01-eth1,script=no,downscript=no,id=netdev1

-device virtio-net-pci,mac=02:e0:f9:1c:b9:74,netdev=netdev1,romfile=

tap device is connected to a bridge.

David


Re: [RFC] Handle error writing UINT_MAX to u32 fields

2016-06-12 Thread subashab

The suggested change would extend the usable range of positive numbers
by one bit only. As many systems are 64 bit this does not seem forward
looking.

I would prefer to have a routine that can handle 64 bit integers with
limits (let's call it proc_doint64vec_minmax) which uses fields extra1
and extra2 of ctl_table as min and max.

Then set xfrm_table[].extra1 = 0 and xfrm_table[].extra2 = UINT_MAX if
you need a result in the u32 range.



Thanks Heinrich. Do you think we can use proc_doulongvec_minmax for 
this?


Re: [very-RFC 6/8] Add TSN event-tracing

2016-06-12 Thread Steven Rostedt
On Sun, 12 Jun 2016 23:25:10 +0200
Henrik Austad  wrote:

> > > +#include 
> > > +#include 
> > > +/* #include  */
> > > +
> > > +/* FIXME: update to TRACE_CLASS to reduce overhead */  
> > 
> > I'm curious to why I didn't do this now. A class would make less
> > duplication of typing too ;-)  
> 
> Yeah, I found this in a really great article written by some tracing-dude, 
> I hear he talks really, really fast!

I plead the 5th!

> 
> https://lwn.net/Articles/381064/
> 
> > > +TRACE_EVENT(tsn_buffer_write,
> > > +
> > > + TP_PROTO(struct tsn_link *link,
> > > + size_t bytes),
> > > +
> > > + TP_ARGS(link, bytes),
> > > +
> > > + TP_STRUCT__entry(
> > > + __field(u64, stream_id)
> > > + __field(size_t, size)
> > > + __field(size_t, bsize)
> > > + __field(size_t, size_left)
> > > + __field(void *, buffer)
> > > + __field(void *, head)
> > > + __field(void *, tail)
> > > + __field(void *, end)
> > > + ),
> > > +
> > > + TP_fast_assign(
> > > + __entry->stream_id = link->stream_id;
> > > + __entry->size = bytes;
> > > + __entry->bsize = link->used_buffer_size;
> > > + __entry->size_left = (link->head - link->tail) % 
> > > link->used_buffer_size;  
> > 
> > Move this logic into the print statement, since you save head and tail.  
> 
> Ok, any particular reason?

Because it removes calculations during the trace. The calculations done
in TP_printk() are done at the time of reading the trace, and
calculations done in TP_fast_assign() are done during the recording and
hence adding more overhead to the trace itself.


> 
> > > + __entry->buffer = link->buffer;
> > > + __entry->head = link->head;
> > > + __entry->tail = link->tail;
> > > + __entry->end = link->end;
> > > + ),
> > > +
> > > + TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, 
> > > [buffer=%p, head=%p, tail=%p, end=%p]",
> > > + __entry->stream_id, __entry->size, __entry->bsize, 
> > > __entry->size_left,  
> > 
> >  __entry->stream_id, __entry->size, __entry->bsize,
> >  (__entry->head - __entry->tail) % __entry->bsize,
> >   
> 
> Ok, so is this about saving space by dropping one intermediate value, or is 
> it some other point I'm missing here?

Nope, just moving the overhead from the recording of the trace to the
reading of the trace.

> 
> > > + __entry->buffer,__entry->head, __entry->tail,  __entry->end)
> > > +
> > > + );
> > > +


> > > +
> > > + TP_fast_assign(
> > > + __entry->stream_id = link->stream_id;
> > > + __entry->vlan_tag = (skb_vlan_tag_present(skb) ? 
> > > skb_vlan_tag_get(skb) : 0);
> > > + __entry->bytes = bytes;
> > > + __entry->data_len = skb->data_len;
> > > + __entry->headlen = skb_headlen(skb);
> > > + __entry->protocol = ntohs(vlan_get_protocol(skb));  
> > 
> > Maybe it would be better to do the ntohs() in the TP_printk() as well.
> >   
> > > + __entry->prot_native = ntohs(skb->protocol);  
> > 
> > here too.
> >   
> > > + __entry->tx_idx = skb_get_queue_mapping(skb);
> > > +
> > > + __entry->mac_len = skb->mac_len;
> > > + __entry->hdr_len = skb->hdr_len;
> > > + __entry->vlan_tci = skb->vlan_tci;
> > > + __entry->mac_header = skb->mac_header;
> > > + __entry->tail = (unsigned int)skb->tail;
> > > + __entry->end  = (unsigned int)skb->end;
> > > + __entry->truesize = skb->truesize;
> > > + ),
> > > +
> > > + 
> > > TP_printk("stream_id=%llu,vlan_tag=0x%04x,data_size=%zd,data_len=%zd,headlen=%u,proto=0x%04x
> > >  
> > > (0x%04x),tx_idx=%d,mac_len=%u,hdr_len=%u,vlan_tci=0x%02x,mac_header=0x%02x,tail=%u,end=%u,truesize=%u",
> > > + __entry->stream_id,
> > > + __entry->vlan_tag,
> > > + __entry->bytes,
> > > + __entry->data_len,
> > > + __entry->headlen,
> > > + __entry->protocol,
> > > + __entry->prot_native, __entry->tx_idx,
> > > + __entry->mac_len,
> > > + __entry->hdr_len,
> > > + __entry->vlan_tci,
> > > + __entry->mac_header,  
> > 
> > Is this an ether mac header? If so we support %M. But as it's defined
> > as only u16, it doesn't seem like it can be.  
> 
> Actually, looking at the output, I'm not quite sure what it is that I 
> wanted to grab with that, the skb->mac_header should give an offset into 
> the header-area of skb, so it should be a constant offset from skb->head 
> (that is an actual pointer).
> 
> I *think* I wanted to make sure I updated things correctly so that the 
> offset didn't suddenly change, but the fact that I'm no longer sure 
> indicates that I should just drop that one. That whole printout is too long 
> anyway..
> 
> Thanks for pointing a finger at this!
> 
> 
> I'm still a bit stymied as to why logic should be in TP_printk() and not 
> TP_fast_assign(). Not that I really 

Re: [PATCH v2 -next] sched: remove NET_XMIT_POLICED

2016-06-12 Thread David Miller
From: Cong Wang 
Date: Sat, 11 Jun 2016 17:14:09 -0700

> On Sat, Jun 11, 2016 at 3:54 PM, David Miller  wrote:
>> From: Florian Westphal 
>> Date: Sat, 11 Jun 2016 12:46:04 +0200
>>
>>> sch_atm returns this when TC_ACT_SHOT classification occurs.
>>>
>>> But all other schedulers that use tc_classify
>>> (htb, hfsc, drr, fq_codel ...) return NET_XMIT_SUCCESS | __BYPASS
>>> in this case so just do that in atm.
>>
>> Yes, but it's been like this for two decades.
>>
>> One of the points of my reply to the previous version of this
>> patch is that indeed it is explicitly returned, intentionally.
> 
> I think this is a leftover from CONFIG_NET_CLS_POLICE removal.
> 
> BTW, probably TC_POLICE_* can be removed together.

Ok I'll apply this.


Re: [PATCH] net: ethernet: ti: cpsw: use destroy ctlr to destroy channels

2016-06-12 Thread David Miller
From: Ivan Khoronzhuk 
Date: Sat, 11 Jun 2016 01:11:54 +0300

> There is no reason to destroy channels that are destroyed while
> cpdma_ctlr destroy. In this case no need to remember how much
> channels where created and destroy them by one, as cpdma_ctlr
> destroys all of them.
> 
> Signed-off-by: Ivan Khoronzhuk 

Applied.


Re: [PATCH net-next RFT] net: fec: handle small PHY reset durations more precisely

2016-06-12 Thread David Miller
From: Stefan Wahren 
Date: Wed,  8 Jun 2016 20:42:46 +

> Since msleep is based on jiffies the PHY reset could take longer
> than expected. So use msleep for values greater than 20 msec otherwise
> usleep_range.
> 
> Signed-off-by: Stefan Wahren 

Applied.


Re: [PATCH net] net_sched: prio: properly report out of memory errors

2016-06-12 Thread David Miller
From: Eric Dumazet 
Date: Sun, 12 Jun 2016 16:21:47 -0700

> From: Eric Dumazet 
> 
> At Qdisc creation or change time, prio_tune() creates missing
> pfifo qdiscs but does not return an error code if one
> qdisc could not be allocated.
> 
> Leaving a qdisc in non operational state without telling user
> anything about this problem is not good.
> 
> Also, testing if we replace something different than noop_qdisc
> a second time makes no sense so I removed useless code.
> 
> Signed-off-by: Eric Dumazet 

Applied.


RE: [PATCH v2 1/2] ARM: imx6: disable deeper idle states when FEC is active w/o HW workaround

2016-06-12 Thread Fugang Duan
From: Lucas Stach  Sent: Thursday, June 09, 2016 5:35 PM
> To: Fugang Duan ; Shawn Guo 
> Cc: devicet...@vger.kernel.org; patchwork-...@pengutronix.de;
> ker...@pengutronix.de; linux-arm-ker...@lists.infradead.org;
> netdev@vger.kernel.org
> Subject: Re: [PATCH v2 1/2] ARM: imx6: disable deeper idle states when FEC is
> active w/o HW workaround
> 
> Hi Fugang,
> 
> Am Montag, den 06.06.2016, 02:00 + schrieb Fugang Duan:
> > From: Lucas Stach  Sent: Saturday, June 04,
> > 2016 12:31 AM
> > >
> > > To: Shawn Guo ; Fugang Duan  > > om>
> > > Cc: devicet...@vger.kernel.org; patchwork-...@pengutronix.de;
> > > ker...@pengutronix.de; linux-arm-ker...@lists.infradead.org;
> > > netdev@vger.kernel.org
> > > Subject: [PATCH v2 1/2] ARM: imx6: disable deeper idle states when
> > > FEC is active w/o HW workaround
> > >
> > > The i.MX6 Q/DL has an erratum (ERR006687) that prevents the FEC from
> > > waking the CPUs when they are in wait(unclocked) state. As the
> > > hardware workaround isn't applicable to all boards, disable the
> > > deeper idle state when the workaround isn't present and the FEC is
> > > in use.
> > >
> > > This allows to safely run a kernel with CPUidle enabled on all
> > > i.MX6 boards.
> > >
> > > Signed-off-by: Lucas Stach 
> > > Acked-by: David S. Miller  (for network
> > > changes)
> > > ---
> 
> [...]
> 
> > Hi, Lucas,
> >
> > FEC irq cannot wake up CPUs when system is in wait mode. But we can
> > use GPIO_6 for FEC interrupt that GPIO irq wake up CPUs.
> > No need to disable wait mode as your such patches.
> >
> > You just config the gpio irq like below patches:
> > bc20a5d6da71 (ARM: dts: imx6qdl-sabreauto: use GPIO_6 for FEC
> > interrupt.)
> > 6261c4c8f13e (ARM: dts: imx6qdl-sabrelite: use GPIO_6 for FEC
> > interrupt.)
> >
> Please look at the description of this series again. The changes don't 
> disable the
> deeper idle states on boards where the HW waorkaround is available. There is a
> large number of boards in the wild which can not use the HW workaround, as
> they use GPIO_6 for other purposes. The aim of this series is to have an
> automatic software workaround available for those boards.
> 
> Regards,
> Lucas

I see.  My concern is the wild boards why don't follow NXP HW reference guide.

For the patch itself, it is fine for me to fix the wild boards issue.

Acked-by: Fugang Duan 


[PATCH] esp: correct offset for ESN when using NAT-T

2016-06-12 Thread Blair Steven
The offset for calculating ESN was not taking into account the new UDP
header created for NAT-T.
---
 net/ipv4/esp4.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 4779374..c84d1fc 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -223,6 +223,8 @@ static int esp_output(struct xfrm_state *x, struct sk_buff 
*skb)
uh->len = htons(skb->len - skb_transport_offset(skb));
uh->check = 0;
 
+   skb_set_transport_header(skb, skb_transport_offset(skb) + 
sizeof(struct udphdr));
+
switch (encap_type) {
default:
case UDP_ENCAP_ESPINUDP:
-- 
2.8.3



[PATCH] IPsec NAT-T issue

2016-06-12 Thread Blair Steven

During testing we have discovered an issue with IPsec NAT-T where the SPI
is over writing the source and dest ports of the UDP header. I'm not
super familiar with this code, but I've found a solution that seems
to work in my setup.

I'd like some feedback on this please, if it's the right thing to be doing
here, or if it should be done elsewhere.

Thanks very much

Blair Steven (1):
  esp: correct offset for ESN when using NAT-T

 net/ipv4/esp4.c | 2 ++
 1 file changed, 2 insertions(+)

-- 
2.8.3



Re: off-by-one in DecodeQ931

2016-06-12 Thread Toby DiPasquale
Attached is the patch generated with git format-patch.

On Mon, Jun 6, 2016 at 10:55 AM, Pablo Neira Ayuso  wrote:
> On Mon, Jun 06, 2016 at 04:35:55PM +0200, Florian Westphal wrote:
>> Toby DiPasquale  wrote:
>> > Is this latest patch OK?
>>
>> Yes, I don't know why it wasn't applied yet.
>>
>> Pablo?
>
> This doesn't apply.
>
> $ git am /tmp/off-by-one-in-DecodeQ931.patch -s
> Applying: off-by-one in DecodeQ931
> error: patch failed: net/netfilter/nf_conntrack_h323_asn1.c:846
> error: net/netfilter/nf_conntrack_h323_asn1.c: patch does not apply
> Patch failed at 0001 off-by-one in DecodeQ931
> The copy of the patch that failed is found in:
>patch
> When you have resolved this problem, run "git am --continue".
> If you prefer to skip this patch, run "git am --skip" instead.
> To restore the original branch and stop patching, run "git am --abort".
>
> Could you resubmit using git-format-patch? Thanks.



-- 
Toby DiPasquale


off-by-one.patch
Description: Binary data


[PATCH net] net_sched: prio: properly report out of memory errors

2016-06-12 Thread Eric Dumazet
From: Eric Dumazet 

At Qdisc creation or change time, prio_tune() creates missing
pfifo qdiscs but does not return an error code if one
qdisc could not be allocated.

Leaving a qdisc in non operational state without telling user
anything about this problem is not good.

Also, testing if we replace something different than noop_qdisc
a second time makes no sense so I removed useless code.

Signed-off-by: Eric Dumazet 
---
 net/sched/sch_prio.c |   32 
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 4b0a82191bc4..071718bccdab 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -202,26 +202,18 @@ static int prio_tune(struct Qdisc *sch, struct nlattr 
*opt)
sch_tree_unlock(sch);
 
for (i = 0; i < q->bands; i++) {
-   if (q->queues[i] == _qdisc) {
-   struct Qdisc *child, *old;
-
-   child = qdisc_create_dflt(sch->dev_queue,
- _qdisc_ops,
- TC_H_MAKE(sch->handle, i + 
1));
-   if (child) {
-   sch_tree_lock(sch);
-   old = q->queues[i];
-   q->queues[i] = child;
-
-   if (old != _qdisc) {
-   qdisc_tree_reduce_backlog(old,
- old->q.qlen,
- 
old->qstats.backlog);
-   qdisc_destroy(old);
-   }
-   sch_tree_unlock(sch);
-   }
-   }
+   struct Qdisc *child;
+
+   if (q->queues[i] != _qdisc)
+   continue;
+
+   child = qdisc_create_dflt(sch->dev_queue, _qdisc_ops,
+ TC_H_MAKE(sch->handle, i + 1));
+   if (!child)
+   return -ENOMEM;
+   sch_tree_lock(sch);
+   q->queues[i] = child;
+   sch_tree_unlock(sch);
}
return 0;
 }




Re: [PATCH net] fib_rules: don't break ECN with TOS rules

2016-06-12 Thread David Ahern

On 6/11/16 3:10 PM, Hannes Frederic Sowa wrote:

@@ -215,6 +216,9 @@ static int fib4_rule_configure(struct fib_rule *rule, 
struct sk_buff *skb,
rule4->dst_len = frh->dst_len;
rule4->dstmask = inet_make_mask(rule4->dst_len);
rule4->tos = frh->tos;
+   if (!INET_ECN_ignore(rule4->tos))
+   pr_warn("ipv4: never matching ipv4 rule with match for %x 
added\n",
+   rule4->tos);



The wording of the warning message is not very clear on what the problem is.


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-12 Thread Henrik Austad
On Sun, Jun 12, 2016 at 07:43:34PM +0900, Takashi Sakamoto wrote:
> On Jun 12 2016 17:31, Henrik Austad wrote:
> > On Sun, Jun 12, 2016 at 01:30:24PM +0900, Takashi Sakamoto wrote:
> >> On Jun 12 2016 12:38, Takashi Sakamoto wrote:
> >>> In your patcset, there's no actual codes about how to handle any
> >>> interrupt contexts (software / hardware), how to handle packet payload,
> >>> and so on. Especially, for recent sound subsystem, the timing of
> >>> generating interrupts and which context does what works are important to
> >>> reduce playback/capture latency and power consumption.
> >>>
> >>> Of source, your intention of this patchset is to show your early concept
> >>> of TSN feature. Nevertheless, both of explaination and codes are
> >>> important to the other developers who have little knowledges about TSN,
> >>> AVB and AES-64 such as me.
> >>
> >> Oops. Your 5th patch was skipped by alsa-project.org. I guess that size
> >> of the patch is too large to the list service. I can see it:
> >> http://marc.info/?l=linux-netdev=146568672728661=2
> >>
> >> As long as seeing the patch, packets are queueing in hrtimer callbacks
> >> every 1 seconds.
> > 
> > Actually, the hrtimer fires every 1ms, and that part is something I have to 
> > do something about, also because it sends of the same number of frames 
> > every time, regardless of how accurate the internal timer is to the rest of 
> > the network (there's no backpressure from the networking layer).
> > 
> >> (This is a high level discussion and it's OK to ignore it for the
> >> moment. When writing packet-oriented drivers for sound subsystem, you
> >> need to pay special attention to accuracy of the number of PCM frames
> >> transferred currently, and granularity of the number of PCM frames
> >> transferred by one operation. In this case, snd_avb_hw,
> >> snd_avb_pcm_pointer(), tsn_buffer_write_net() and tsn_buffer_read_net()
> >> are involved in this discussion. You can see ALSA developers' struggle
> >> in USB audio device class drivers and (of cource) IEC 61883-1/6 drivers.)
> > 
> > Ah, good point. Any particular parts of the USB-subsystem I should start 
> > looking at?
> 
> I don't think it's a beter way for you to study USB Audio Device Class
> driver unless you're interested in ALSA or USB subsystem.
> 
> (But for your information, snd-usb-audio is in sound/usb/* of Linux
> kernel. IEC 61883-1/6 driver is in sound/firewire/*.)

Ok, thanks, I'll definately be looking at the firewire bit

> We need different strategy to achieve it on different transmission backend.
> 
> > Knowing where to start looking is a tremendous help
> 
> It's not well-documented, and not well-generalized for packet-oriented
> drivers. Most of developers who have enough knowledge about it work for
> DMA-oriented drivers in mobile platforms and have little interests in
> packet-oriented drivers. You need to find your own way.
> 
> Currently I have few advices to you, because I'm also on the way for
> drivers to process IEC 61883-1/6 packets on IEEE 1394 bus with enough
> accuracy and granularity. The paper I introduced is for the way (but not
> mature).
> 
> I wish you get more helps from the other developers. Your work is more
> significant to Linux system, than mine.
> 
> (And I hope your future work get no ignorance and no unreasonable
> hostility from coarse users.)

Ah well, I have asbestos-underwear so that should be fine :)

Thanks for the pointers, I really appreciate them!



-- 
Henrik Austad


signature.asc
Description: Digital signature


QUICK LOAN OFFER

2016-06-12 Thread International Loan Firm



[iproute2 PATCH 1/1] action pedit: stylistic changes

2016-06-12 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

More modern layout.

Signed-off-by: Jamal Hadi Salim 
---
 tc/m_pedit.c | 118 +++
 1 file changed, 62 insertions(+), 56 deletions(-)

diff --git a/tc/m_pedit.c b/tc/m_pedit.c
index a539b68..d276ba0 100644
--- a/tc/m_pedit.c
+++ b/tc/m_pedit.c
@@ -32,8 +32,7 @@
 static struct m_pedit_util *pedit_list;
 static int pedit_debug;
 
-static void
-explain(void)
+static void explain(void)
 {
fprintf(stderr, "Usage: ... pedit munge  [CONTROL]\n");
fprintf(stderr,
@@ -48,22 +47,24 @@ explain(void)
 
 }
 
-static void
-usage(void)
+static void usage(void)
 {
explain();
exit(-1);
 }
 
-static int
-pedit_parse_nopopt (int *argc_p, char ***argv_p, struct tc_pedit_sel *sel, 
struct tc_pedit_key *tkey)
+static int pedit_parse_nopopt(int *argc_p, char ***argv_p,
+ struct tc_pedit_sel *sel,
+ struct tc_pedit_key *tkey)
 {
int argc = *argc_p;
char **argv = *argv_p;
 
if (argc) {
-   fprintf(stderr, "Unknown action  hence option \"%s\" is 
unparsable\n", *argv);
-   return -1;
+   fprintf(stderr,
+   "Unknown action  hence option \"%s\" is unparsable\n",
+   *argv);
+   return -1;
}
 
return 0;
@@ -75,7 +76,7 @@ static struct m_pedit_util *get_pedit_kind(const char *str)
static void *pBODY;
void *dlh;
char buf[256];
-   struct  m_pedit_util *p;
+   struct m_pedit_util *p;
 
for (p = pedit_list; p; p = p->next) {
if (strcmp(p->id, str) == 0)
@@ -107,15 +108,14 @@ noexist:
p = malloc(sizeof(*p));
if (p) {
memset(p, 0, sizeof(*p));
-   strncpy(p->id, str, sizeof(p->id)-1);
+   strncpy(p->id, str, sizeof(p->id) - 1);
p->parse_peopt = pedit_parse_nopopt;
goto reg;
}
return p;
 }
 
-int
-pack_key(struct tc_pedit_sel *sel, struct tc_pedit_key *tkey)
+int pack_key(struct tc_pedit_sel *sel, struct tc_pedit_key *tkey)
 {
int hwm = sel->nkeys;
 
@@ -137,9 +137,8 @@ pack_key(struct tc_pedit_sel *sel, struct tc_pedit_key 
*tkey)
return 0;
 }
 
-
-int
-pack_key32(__u32 retain, struct tc_pedit_sel *sel, struct tc_pedit_key *tkey)
+int pack_key32(__u32 retain, struct tc_pedit_sel *sel,
+  struct tc_pedit_key *tkey)
 {
if (tkey->off > (tkey->off & ~3)) {
fprintf(stderr,
@@ -152,11 +151,11 @@ pack_key32(__u32 retain, struct tc_pedit_sel *sel, struct 
tc_pedit_key *tkey)
return pack_key(sel, tkey);
 }
 
-int
-pack_key16(__u32 retain, struct tc_pedit_sel *sel, struct tc_pedit_key *tkey)
+int pack_key16(__u32 retain, struct tc_pedit_sel *sel,
+  struct tc_pedit_key *tkey)
 {
int ind, stride;
-   __u32 m[4] = {0x, 0xFFFF, 0x};
+   __u32 m[4] = { 0x, 0xFFFF, 0x };
 
if (tkey->val > 0x || tkey->mask > 0x) {
fprintf(stderr, "pack_key16 bad value\n");
@@ -177,19 +176,20 @@ pack_key16(__u32 retain, struct tc_pedit_sel *sel, struct 
tc_pedit_key *tkey)
tkey->off &= ~3;
 
if (pedit_debug)
-   printf("pack_key16: Final val %08x mask %08x\n", tkey->val, 
tkey->mask);
+   printf("pack_key16: Final val %08x mask %08x\n",
+  tkey->val, tkey->mask);
return pack_key(sel, tkey);
 
 }
 
-int
-pack_key8(__u32 retain, struct tc_pedit_sel *sel, struct tc_pedit_key *tkey)
+int pack_key8(__u32 retain, struct tc_pedit_sel *sel, struct tc_pedit_key 
*tkey)
 {
int ind, stride;
-   __u32 m[4] = {0x00FF, 0xFF00, 0x00FF, 0xFF00};
+   __u32 m[4] = { 0x00FF, 0xFF00, 0x00FF, 0xFF00 };
 
if (tkey->val > 0xFF || tkey->mask > 0xFF) {
-   fprintf(stderr, "pack_key8 bad value (val %x mask %x\n", 
tkey->val, tkey->mask);
+   fprintf(stderr, "pack_key8 bad value (val %x mask %x\n",
+   tkey->val, tkey->mask);
return -1;
}
 
@@ -202,12 +202,12 @@ pack_key8(__u32 retain, struct tc_pedit_sel *sel, struct 
tc_pedit_key *tkey)
tkey->off &= ~3;
 
if (pedit_debug)
-   printf("pack_key8: Final word off %d  val %08x mask %08x\n", 
tkey->off, tkey->val, tkey->mask);
+   printf("pack_key8: Final word off %d  val %08x mask %08x\n",
+  tkey->off, tkey->val, tkey->mask);
return pack_key(sel, tkey);
 }
 
-int
-parse_val(int *argc_p, char ***argv_p, __u32 *val, int type)
+int parse_val(int *argc_p, char ***argv_p, __u32 * val, int type)
 {
int argc = *argc_p;
char **argv = *argv_p;
@@ -216,7 +216,7 @@ parse_val(int *argc_p, char ***argv_p, __u32 *val, int type)

[PATCH] net: ethernet: enic: move to new ethtool api {get|set}_link_ksettings

2016-06-12 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move the enic driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/cisco/enic/enic_ethtool.c |   28 +--
 1 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/cisco/enic/enic_ethtool.c 
b/drivers/net/ethernet/cisco/enic/enic_ethtool.c
index f44a39c..fd3980c 100644
--- a/drivers/net/ethernet/cisco/enic/enic_ethtool.c
+++ b/drivers/net/ethernet/cisco/enic/enic_ethtool.c
@@ -103,25 +103,29 @@ static void enic_intr_coal_set_rx(struct enic *enic, u32 
timer)
}
 }
 
-static int enic_get_settings(struct net_device *netdev,
-   struct ethtool_cmd *ecmd)
+static int enic_get_ksettings(struct net_device *netdev,
+ struct ethtool_link_ksettings *ecmd)
 {
struct enic *enic = netdev_priv(netdev);
+   struct ethtool_link_settings *base = >base;
 
-   ecmd->supported = (SUPPORTED_1baseT_Full | SUPPORTED_FIBRE);
-   ecmd->advertising = (ADVERTISED_1baseT_Full | ADVERTISED_FIBRE);
-   ecmd->port = PORT_FIBRE;
-   ecmd->transceiver = XCVR_EXTERNAL;
+   ethtool_link_ksettings_add_link_mode(ecmd, supported,
+1baseT_Full);
+   ethtool_link_ksettings_add_link_mode(ecmd, supported, FIBRE);
+   ethtool_link_ksettings_add_link_mode(ecmd, advertising,
+1baseT_Full);
+   ethtool_link_ksettings_add_link_mode(ecmd, advertising, FIBRE);
+   base->port = PORT_FIBRE;
 
if (netif_carrier_ok(netdev)) {
-   ethtool_cmd_speed_set(ecmd, vnic_dev_port_speed(enic->vdev));
-   ecmd->duplex = DUPLEX_FULL;
+   base->speed = vnic_dev_port_speed(enic->vdev);
+   base->duplex = DUPLEX_FULL;
} else {
-   ethtool_cmd_speed_set(ecmd, SPEED_UNKNOWN);
-   ecmd->duplex = DUPLEX_UNKNOWN;
+   base->speed = SPEED_UNKNOWN;
+   base->duplex = DUPLEX_UNKNOWN;
}
 
-   ecmd->autoneg = AUTONEG_DISABLE;
+   base->autoneg = AUTONEG_DISABLE;
 
return 0;
 }
@@ -500,7 +504,6 @@ static int enic_set_rxfh(struct net_device *netdev, const 
u32 *indir,
 }
 
 static const struct ethtool_ops enic_ethtool_ops = {
-   .get_settings = enic_get_settings,
.get_drvinfo = enic_get_drvinfo,
.get_msglevel = enic_get_msglevel,
.set_msglevel = enic_set_msglevel,
@@ -516,6 +519,7 @@ static const struct ethtool_ops enic_ethtool_ops = {
.get_rxfh_key_size = enic_get_rxfh_key_size,
.get_rxfh = enic_get_rxfh,
.set_rxfh = enic_set_rxfh,
+   .get_link_ksettings = enic_get_ksettings,
 };
 
 void enic_set_ethtool_ops(struct net_device *netdev)
-- 
1.7.4.4



Re: [RFC PATCH iproute2] tc: let m_ipt work with new iptables API headers

2016-06-12 Thread Alexander Aring

Hi,

On 05/29/2016 08:27 PM, Alexander Aring wrote:
> Since commit 5cd1adb ("Update to current iptables headers") the build
> with m_ipt.o and the following config will fail:
> 
> TC_CONFIG_XT:=n
> TC_CONFIG_XT_OLD:=n
> TC_CONFIG_XT_OLD_H:=n
> 
> This patch renames "iptables_target" to "xtables_target" and some other
> things which gets renamed and I noticed while reading iptables git log.
> Functions which are not used in m_ipt.c and not exported by the header
> are removed, if they still used in m_ipt.c I added a static to the function.
> 
> Reported-by: Clemens Gruber 
> Signed-off-by: Alexander Aring 
> ---
> I removed also "linux/if.h" because some redifinition errors and seems
> not be necessary anymore.
> 
> I didn't test this implementation, I just fix the compile issues at m_ipt
> implementation according to the algoritmn described above.
> 
> Maybe there are more TC_CONFIG_XT* config variants broken than just m_ipt?
> 

ping, any comments to this patch?
I can rebase and resend this patch if necessary.

- Alex


Re: [very-RFC 6/8] Add TSN event-tracing

2016-06-12 Thread Henrik Austad
On Sun, Jun 12, 2016 at 12:58:03PM -0400, Steven Rostedt wrote:
> On Sun, 12 Jun 2016 01:01:34 +0200
> Henrik Austad  wrote:
> 
> > From: Henrik Austad 
> > 
> > This needs refactoring and should be updated to use TRACE_CLASS, but for
> > now it provides a fair debug-window into TSN.
> > 
> > Cc: "David S. Miller" 
> > Cc: Steven Rostedt  (maintainer:TRACING)
> > Cc: Ingo Molnar  (maintainer:TRACING)
> > Signed-off-by: Henrik Austad 
> > ---
> >  include/trace/events/tsn.h | 349 
> > +
> >  1 file changed, 349 insertions(+)
> >  create mode 100644 include/trace/events/tsn.h
> > 
> > diff --git a/include/trace/events/tsn.h b/include/trace/events/tsn.h
> > new file mode 100644
> > index 000..ac1f31b
> > --- /dev/null
> > +++ b/include/trace/events/tsn.h
> > @@ -0,0 +1,349 @@
> > +#undef TRACE_SYSTEM
> > +#define TRACE_SYSTEM tsn
> > +
> > +#if !defined(_TRACE_TSN_H) || defined(TRACE_HEADER_MULTI_READ)
> > +#define _TRACE_TSN_H
> > +
> > +#include 
> > +#include 
> > +
> > +#include 
> > +#include 
> > +/* #include  */
> > +
> > +/* FIXME: update to TRACE_CLASS to reduce overhead */
> 
> I'm curious to why I didn't do this now. A class would make less
> duplication of typing too ;-)

Yeah, I found this in a really great article written by some tracing-dude, 
I hear he talks really, really fast!

https://lwn.net/Articles/381064/

> > +TRACE_EVENT(tsn_buffer_write,
> > +
> > +   TP_PROTO(struct tsn_link *link,
> > +   size_t bytes),
> > +
> > +   TP_ARGS(link, bytes),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(u64, stream_id)
> > +   __field(size_t, size)
> > +   __field(size_t, bsize)
> > +   __field(size_t, size_left)
> > +   __field(void *, buffer)
> > +   __field(void *, head)
> > +   __field(void *, tail)
> > +   __field(void *, end)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->stream_id = link->stream_id;
> > +   __entry->size = bytes;
> > +   __entry->bsize = link->used_buffer_size;
> > +   __entry->size_left = (link->head - link->tail) % 
> > link->used_buffer_size;
> 
> Move this logic into the print statement, since you save head and tail.

Ok, any particular reason?

> > +   __entry->buffer = link->buffer;
> > +   __entry->head = link->head;
> > +   __entry->tail = link->tail;
> > +   __entry->end = link->end;
> > +   ),
> > +
> > +   TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, 
> > [buffer=%p, head=%p, tail=%p, end=%p]",
> > +   __entry->stream_id, __entry->size, __entry->bsize, 
> > __entry->size_left,
> 
>  __entry->stream_id, __entry->size, __entry->bsize,
>  (__entry->head - __entry->tail) % __entry->bsize,
> 

Ok, so is this about saving space by dropping one intermediate value, or is 
it some other point I'm missing here?

> > +   __entry->buffer,__entry->head, __entry->tail,  __entry->end)
> > +
> > +   );
> > +
> > +TRACE_EVENT(tsn_buffer_write_net,
> > +
> > +   TP_PROTO(struct tsn_link *link,
> > +   size_t bytes),
> > +
> > +   TP_ARGS(link, bytes),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(u64, stream_id)
> > +   __field(size_t, size)
> > +   __field(size_t, bsize)
> > +   __field(size_t, size_left)
> > +   __field(void *, buffer)
> > +   __field(void *, head)
> > +   __field(void *, tail)
> > +   __field(void *, end)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->stream_id = link->stream_id;
> > +   __entry->size = bytes;
> > +   __entry->bsize = link->used_buffer_size;
> > +   __entry->size_left = (link->head - link->tail) % 
> > link->used_buffer_size;
> > +   __entry->buffer = link->buffer;
> > +   __entry->head = link->head;
> > +   __entry->tail = link->tail;
> > +   __entry->end = link->end;
> > +   ),
> > +
> > +   TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, 
> > [buffer=%p, head=%p, tail=%p, end=%p]",
> > +   __entry->stream_id, __entry->size, __entry->bsize, 
> > __entry->size_left,
> > +   __entry->buffer,__entry->head, __entry->tail,  __entry->end)
> > +
> > +   );
> > +
> > +
> > +TRACE_EVENT(tsn_buffer_read,
> > +
> > +   TP_PROTO(struct tsn_link *link,
> > +   size_t bytes),
> > +
> > +   TP_ARGS(link, bytes),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(u64, stream_id)
> > +   __field(size_t, size)
> > +   __field(size_t, bsize)
> > +   __field(size_t, size_left)
> > +   __field(void *, buffer)
> > +   __field(void *, head)
> > +   __field(void *, tail)
> > +   __field(void *, end)
> > +   ),
> > +
> > +   TP_fast_assign(
> 

[net-next PATCH v2 1/1] net sched actions: skbedit add support for mod-ing skb pkt_type

2016-06-12 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

Extremely useful for setting packet type to host so i dont
have to modify the dst mac address using pedit (which requires
that i know the mac address)

Signed-off-by: Jamal Hadi Salim 
---
 include/net/tc_act/tc_skbedit.h| 10 +-
 include/uapi/linux/tc_act/tc_skbedit.h |  2 ++
 net/sched/act_skbedit.c| 18 +-
 3 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/include/net/tc_act/tc_skbedit.h b/include/net/tc_act/tc_skbedit.h
index b496d5a..d01a5d4 100644
--- a/include/net/tc_act/tc_skbedit.h
+++ b/include/net/tc_act/tc_skbedit.h
@@ -24,11 +24,11 @@
 
 struct tcf_skbedit {
struct tcf_common   common;
-   u32 flags;
-   u32 priority;
-   u32 mark;
-   u16 queue_mapping;
-   /* XXX: 16-bit pad here? */
+   u32 flags;
+   u32 priority;
+   u32 mark;
+   u16 queue_mapping;
+   u16 ptype;
 };
 #define to_skbedit(a) \
container_of(a->priv, struct tcf_skbedit, common)
diff --git a/include/uapi/linux/tc_act/tc_skbedit.h 
b/include/uapi/linux/tc_act/tc_skbedit.h
index fecb5cc..a4d00c6 100644
--- a/include/uapi/linux/tc_act/tc_skbedit.h
+++ b/include/uapi/linux/tc_act/tc_skbedit.h
@@ -27,6 +27,7 @@
 #define SKBEDIT_F_PRIORITY 0x1
 #define SKBEDIT_F_QUEUE_MAPPING0x2
 #define SKBEDIT_F_MARK 0x4
+#define SKBEDIT_F_PTYPE0x8
 
 struct tc_skbedit {
tc_gen;
@@ -40,6 +41,7 @@ enum {
TCA_SKBEDIT_QUEUE_MAPPING,
TCA_SKBEDIT_MARK,
TCA_SKBEDIT_PAD,
+   TCA_SKBEDIT_PTYPE,
__TCA_SKBEDIT_MAX
 };
 #define TCA_SKBEDIT_MAX (__TCA_SKBEDIT_MAX - 1)
diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index 8f235ff..8a7a34b 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -47,6 +47,8 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
tc_action *a,
skb_set_queue_mapping(skb, d->queue_mapping);
if (d->flags & SKBEDIT_F_MARK)
skb->mark = d->mark;
+   if (d->flags & SKBEDIT_F_PTYPE)
+   skb->pkt_type = d->ptype;
 
spin_unlock(>tcf_lock);
return d->tcf_action;
@@ -57,6 +59,7 @@ static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX 
+ 1] = {
[TCA_SKBEDIT_PRIORITY]  = { .len = sizeof(u32) },
[TCA_SKBEDIT_QUEUE_MAPPING] = { .len = sizeof(u16) },
[TCA_SKBEDIT_MARK]  = { .len = sizeof(u32) },
+   [TCA_SKBEDIT_PTYPE] = { .len = sizeof(u16) },
 };
 
 static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
@@ -68,7 +71,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
struct tc_skbedit *parm;
struct tcf_skbedit *d;
u32 flags = 0, *priority = NULL, *mark = NULL;
-   u16 *queue_mapping = NULL;
+   u16 *queue_mapping = NULL, *ptype = NULL;
int ret = 0, err, aexists = 0;
 
if (nla == NULL)
@@ -91,6 +94,13 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
queue_mapping = nla_data(tb[TCA_SKBEDIT_QUEUE_MAPPING]);
}
 
+   if (tb[TCA_SKBEDIT_PTYPE] != NULL) {
+   flags |= SKBEDIT_F_PTYPE;
+   ptype = nla_data(tb[TCA_SKBEDIT_PTYPE]);
+   if (*ptype > PACKET_KERNEL)
+   return -EINVAL;
+   }
+
if (tb[TCA_SKBEDIT_MARK] != NULL) {
flags |= SKBEDIT_F_MARK;
mark = nla_data(tb[TCA_SKBEDIT_MARK]);
@@ -135,6 +145,8 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
d->queue_mapping = *queue_mapping;
if (flags & SKBEDIT_F_MARK)
d->mark = *mark;
+   if (flags & SKBEDIT_F_PTYPE)
+   d->ptype = *ptype;
 
d->tcf_action = parm->action;
 
@@ -172,6 +184,10 @@ static int tcf_skbedit_dump(struct sk_buff *skb, struct 
tc_action *a,
nla_put(skb, TCA_SKBEDIT_MARK, sizeof(d->mark),
>mark))
goto nla_put_failure;
+   if ((d->flags & SKBEDIT_F_PTYPE) &&
+   nla_put(skb, TCA_SKBEDIT_PTYPE, sizeof(d->ptype),
+   >ptype))
+   goto nla_put_failure;
 
tcf_tm_dump(, >tcf_tm);
if (nla_put_64bit(skb, TCA_SKBEDIT_TM, sizeof(t), , TCA_SKBEDIT_PAD))
-- 
1.9.1



Re: [net-next PATCH 1/1] net sched actions: skbedit add support for mod-ing skb pkt_type

2016-06-12 Thread Jamal Hadi Salim

On 16-06-12 05:12 PM, Jamal Hadi Salim wrote:

From: Jamal Hadi Salim 

Extremely useful for setting packet type to host so i dont
have to modify the dst mac address using pedit (which requires
that i know the mac address)

Signed-off-by: Jamal Hadi Salim 


Please ignore this version; dumping was buggy. Sending a new version
in a minute.

cheers.
jamal



[net-next PATCH 1/1] net sched actions: skbedit add support for mod-ing skb pkt_type

2016-06-12 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

Extremely useful for setting packet type to host so i dont
have to modify the dst mac address using pedit (which requires
that i know the mac address)

Signed-off-by: Jamal Hadi Salim 
---
 include/net/tc_act/tc_skbedit.h| 10 +-
 include/uapi/linux/tc_act/tc_skbedit.h |  2 ++
 net/sched/act_skbedit.c| 17 -
 3 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/include/net/tc_act/tc_skbedit.h b/include/net/tc_act/tc_skbedit.h
index b496d5a..d01a5d4 100644
--- a/include/net/tc_act/tc_skbedit.h
+++ b/include/net/tc_act/tc_skbedit.h
@@ -24,11 +24,11 @@
 
 struct tcf_skbedit {
struct tcf_common   common;
-   u32 flags;
-   u32 priority;
-   u32 mark;
-   u16 queue_mapping;
-   /* XXX: 16-bit pad here? */
+   u32 flags;
+   u32 priority;
+   u32 mark;
+   u16 queue_mapping;
+   u16 ptype;
 };
 #define to_skbedit(a) \
container_of(a->priv, struct tcf_skbedit, common)
diff --git a/include/uapi/linux/tc_act/tc_skbedit.h 
b/include/uapi/linux/tc_act/tc_skbedit.h
index fecb5cc..a4d00c6 100644
--- a/include/uapi/linux/tc_act/tc_skbedit.h
+++ b/include/uapi/linux/tc_act/tc_skbedit.h
@@ -27,6 +27,7 @@
 #define SKBEDIT_F_PRIORITY 0x1
 #define SKBEDIT_F_QUEUE_MAPPING0x2
 #define SKBEDIT_F_MARK 0x4
+#define SKBEDIT_F_PTYPE0x8
 
 struct tc_skbedit {
tc_gen;
@@ -40,6 +41,7 @@ enum {
TCA_SKBEDIT_QUEUE_MAPPING,
TCA_SKBEDIT_MARK,
TCA_SKBEDIT_PAD,
+   TCA_SKBEDIT_PTYPE,
__TCA_SKBEDIT_MAX
 };
 #define TCA_SKBEDIT_MAX (__TCA_SKBEDIT_MAX - 1)
diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index 8f235ff..8ab5d0c 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -47,6 +47,8 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
tc_action *a,
skb_set_queue_mapping(skb, d->queue_mapping);
if (d->flags & SKBEDIT_F_MARK)
skb->mark = d->mark;
+   if (d->flags & SKBEDIT_F_PTYPE)
+   skb->pkt_type = d->ptype;
 
spin_unlock(>tcf_lock);
return d->tcf_action;
@@ -57,6 +59,7 @@ static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX 
+ 1] = {
[TCA_SKBEDIT_PRIORITY]  = { .len = sizeof(u32) },
[TCA_SKBEDIT_QUEUE_MAPPING] = { .len = sizeof(u16) },
[TCA_SKBEDIT_MARK]  = { .len = sizeof(u32) },
+   [TCA_SKBEDIT_PTYPE] = { .len = sizeof(u16) },
 };
 
 static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
@@ -68,7 +71,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
struct tc_skbedit *parm;
struct tcf_skbedit *d;
u32 flags = 0, *priority = NULL, *mark = NULL;
-   u16 *queue_mapping = NULL;
+   u16 *queue_mapping = NULL, *ptype = NULL;
int ret = 0, err, aexists = 0;
 
if (nla == NULL)
@@ -91,6 +94,13 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
queue_mapping = nla_data(tb[TCA_SKBEDIT_QUEUE_MAPPING]);
}
 
+   if (tb[TCA_SKBEDIT_PTYPE] != NULL) {
+   flags |= SKBEDIT_F_PTYPE;
+   ptype = nla_data(tb[TCA_SKBEDIT_PTYPE]);
+   if (*ptype > PACKET_KERNEL)
+   return -EINVAL;
+   }
+
if (tb[TCA_SKBEDIT_MARK] != NULL) {
flags |= SKBEDIT_F_MARK;
mark = nla_data(tb[TCA_SKBEDIT_MARK]);
@@ -135,6 +145,8 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
d->queue_mapping = *queue_mapping;
if (flags & SKBEDIT_F_MARK)
d->mark = *mark;
+   if (flags & SKBEDIT_F_PTYPE)
+   d->ptype = *ptype;
 
d->tcf_action = parm->action;
 
@@ -171,6 +183,9 @@ static int tcf_skbedit_dump(struct sk_buff *skb, struct 
tc_action *a,
if ((d->flags & SKBEDIT_F_MARK) &&
nla_put(skb, TCA_SKBEDIT_MARK, sizeof(d->mark),
>mark))
+   if ((d->flags & SKBEDIT_F_PTYPE) &&
+   nla_put(skb, TCA_SKBEDIT_PTYPE, sizeof(d->ptype),
+   >ptype))
goto nla_put_failure;
 
tcf_tm_dump(, >tcf_tm);
-- 
1.9.1



Re: ebpf: issue with clang

2016-06-12 Thread Eric Leblond
Hello,

On Sun, 2016-06-12 at 20:35 +0200, Daniel Borkmann wrote:
> On 06/12/2016 07:37 PM, Eric Leblond wrote:
> > On Thu, 2016-06-09 at 17:34 -0700, Alexei Starovoitov wrote:
> > > On Thu, Jun 09, 2016 at 11:10:05PM +0200, Eric Leblond wrote:
> > > > Hello,
> > > > 
> > > > I'm working on integrating ebpf cluster load balancing for
> > > > AF_PACKET
> > > > and I've got some problem to get real code inside the EBPF
> > > > filter.
> > > > 
> > > > I've tried different command lines in the build process. One of
> > > > them
> > > > is:
> > > > clang-3.9 -Wall -O2 -emit-llvm -c hash_ports.c -o - | llc-3.9
> > > > -march=bpf -filetype=obj -o hash_ports.bpf
> > > > 
> > > > If I use that one, then the generated code is almost void. If I
> > > > remove
> > > > the -O2 then I've got a generated code that fails during load.
> > > > When
> > > > not
> > > > using -O2, I manage to load a trivial filter (return of static
> > > > value).
> > > > 
> > > > The C code is the following (a derivative of http-simple-
> > > > filter.c
> > > > used
> > > > for testing):
> > > > 
> > > > int filter(struct __sk_buff *skb) {
> > > > uint8_t *cursor = 0;
> > > > struct ethernet_t *ethernet = cursor_advance(cursor,
> > > > sizeof(*ethernet));
> > > 
> > > this is bcc C syntax that is hiding the explicit
> > > load_byte/half/word
> > > operations
> > > we have to do when using plain C.
> > > If you want to compile C code with clang -O2 -target bpf file.c
> > > -c -o
> > > file.o
> > > and copy .o around to be used in tc like:
> > > tc filter add dev eth0 ingress bpf da obj file.o
> > > then plain C should be used like in all samples/bpf/*_kern.c
> > > examples.
> > > Other folks like the convenience of bcc that hides clang/llvm
> > > invocation.
> > > It mostly applicable to tracing tools where both bcc-C and
> > > corresponding python or lua bits are in the same file
> > > like in iovisor/bcc/tools/* scripts.
> > > The iovisor/bcc/examples/networking/* (where this http-simple-
> > > filter.c came from)
> > > are also suitable for networking and relying on pyroute2 to talk
> > > to
> > > kernel to create netns, veth and to attach bpf to qdisc.
> > > 
> > > In summary there are several ways to write bpf C code:
> > > 1. plain C syntax as in samples/bpf/*_kern.c
> > > Pro: compiles with clang into .o
> > > Con: .o requires elf loader (integrated into tc already for
> > > networking),
> > 
> > Yes, that's not an easy part. I've devel one loader for suricata
> > but I
> > will check the one in tc to see if I can take advantage of it.
> 
> Sure, feel free to rip it out and adapt it.
> 
> With AF_PACKET load balancing you mean a packet fanout eBPF demuxing
> or
> something else controlled via tc ingress?

I'm using fanout eBPF demuxing to implement load balancing in Suricata.
Current alpha level code is here:

https://github.com/regit/suricata/commit/f299abe90bfed3590a9f3de1179091
b7afc2d90c

I'm currently working on the demuxing to implement something more
realistic than what current demuxing function.

> 
> If packet fanout, then you also need to adapt the program type into
> BPF_PROG_TYPE_SOCKET_FILTER.

Yes, already done that (or at least it seems to work).

BR,
-- 

Eric Leblond 


Re: ebpf: issue with clang

2016-06-12 Thread Daniel Borkmann

On 06/12/2016 07:37 PM, Eric Leblond wrote:

On Thu, 2016-06-09 at 17:34 -0700, Alexei Starovoitov wrote:

On Thu, Jun 09, 2016 at 11:10:05PM +0200, Eric Leblond wrote:

Hello,

I'm working on integrating ebpf cluster load balancing for
AF_PACKET
and I've got some problem to get real code inside the EBPF filter.

I've tried different command lines in the build process. One of
them
is:
clang-3.9 -Wall -O2 -emit-llvm -c hash_ports.c -o - | llc-3.9
-march=bpf -filetype=obj -o hash_ports.bpf

If I use that one, then the generated code is almost void. If I
remove
the -O2 then I've got a generated code that fails during load. When
not
using -O2, I manage to load a trivial filter (return of static
value).

The C code is the following (a derivative of http-simple-filter.c
used
for testing):

int filter(struct __sk_buff *skb) {
uint8_t *cursor = 0;
struct ethernet_t *ethernet = cursor_advance(cursor,
sizeof(*ethernet));


this is bcc C syntax that is hiding the explicit load_byte/half/word
operations
we have to do when using plain C.
If you want to compile C code with clang -O2 -target bpf file.c -c -o
file.o
and copy .o around to be used in tc like:
tc filter add dev eth0 ingress bpf da obj file.o
then plain C should be used like in all samples/bpf/*_kern.c
examples.
Other folks like the convenience of bcc that hides clang/llvm
invocation.
It mostly applicable to tracing tools where both bcc-C and
corresponding python or lua bits are in the same file
like in iovisor/bcc/tools/* scripts.
The iovisor/bcc/examples/networking/* (where this http-simple-
filter.c came from)
are also suitable for networking and relying on pyroute2 to talk to
kernel to create netns, veth and to attach bpf to qdisc.

In summary there are several ways to write bpf C code:
1. plain C syntax as in samples/bpf/*_kern.c
Pro: compiles with clang into .o
Con: .o requires elf loader (integrated into tc already for
networking),


Yes, that's not an easy part. I've devel one loader for suricata but I
will check the one in tc to see if I can take advantage of it.


Sure, feel free to rip it out and adapt it.

With AF_PACKET load balancing you mean a packet fanout eBPF demuxing or
something else controlled via tc ingress?

If packet fanout, then you also need to adapt the program type into
BPF_PROG_TYPE_SOCKET_FILTER.

Thanks!


but not friendly for tracing that needs recompile for every kernel
due to unstable kprobes
2. bcc C syntax that compiles C on the fly in memory and loads
directly
Pro: there is no .o, no extra files, no need to install clang/llvm
Con: bcc is not widely available yet. ubuntu and others already have
it in apt.
python and lua may not be for everyone. c++ api is not stable yet.


I need to include it into suricata which is C code. I've played with
bcc and it is a great tool but installation on the different platform
may be complicated for our users.


3. perf+bpf, it is similar to samples/pbf/ C style with few
extensions.
If .c file is passed, the perf calls external clang and loads .o
eventually
Pro: out-of-the-box perf and clang work well
Con: not available for networking


Out of scope for me then.


Sounds like you want to use it with af_packet then
tools/testing/selftests/net/reuseport_bpf.c
could be a good start too, but there bpf is written in asm.


Yes, bad point, asm is not really what I want. I want "normal advanced"
users to be able to edit the load balancing function.


If you pick bcc style then iovisor-...@lists.iovisor.org mailing list
is a good place to ask questions. Be sure to subscribe first, since
it rejects non-subscriber emails to reduce spam.


Thanks a lot for all these explanations. You saved me days!

BR,





Re: [PATCH 1/5] ethernet: add sun8i-emac driver

2016-06-12 Thread Florian Fainelli
Le 09/06/2016 02:44, LABBE Corentin a écrit :
> Hello
> 
> I agree to all your comments, but for some I have additionnal questions
> 
> On Mon, Jun 06, 2016 at 11:25:15AM -0700, Florian Fainelli wrote:
>> On 06/03/2016 02:56 AM, LABBE Corentin wrote:
>>
>> [snip]
>>
>>> +
>>> +/* The datasheet said that each descriptor can transfers up to 4096bytes
>>> + * But latter, a register documentation reduce that value to 2048
>>> + * Anyway using 2048 cause strange behaviours and even BSP driver use 2047
>>> + */
>>> +#define DESC_BUF_MAX 2044
>>> +#if (DESC_BUF_MAX < (ETH_FRAME_LEN + 4))
>>> +#error "DESC_BUF_MAX must be set at minimum to ETH_FRAME_LEN + 4"
>>> +#endif
>>
>> You can probably drop that, it would not make much sense to enable
>> fragments and a buffer size smaller than ETH_FRAME_LEN + ETH_FCS_LEN anyway.
>>
> 
> I has added this test for preventing someone who want to "optimize" 
> DESC_BUF_MAX to doing mistake.
> But I agree that it is of low use.

It's actually dangerous, and if you don't make sure that the value is
properly rounded to whatever the DMA controller's alignment should be,
performance could be terribel too.

> 
>>> +/* Return the number of contiguous free descriptors
>>> + * starting from tx_slot
>>> + */
>>> +static int rb_tx_numfreedesc(struct net_device *ndev)
>>> +{
>>> +   struct sun8i_emac_priv *priv = netdev_priv(ndev);
>>> +
>>> +   if (priv->tx_slot < priv->tx_dirty)
>>> +   return priv->tx_dirty - priv->tx_slot;
>>
>> Does this work with if tx_dirty wraps around?
>>
> 
> The tx_dirty cannot wrap since I always keep an empty slot. (tx_slot cannot 
> go equal or after tx_dirty)

OK, fair enough.

> 
>>> +/* Grab a frame into a skb from descriptor number i */
>>> +static int sun8i_emac_rx_from_ddesc(struct net_device *ndev, int i)
>>> +{
>>> +   struct sk_buff *skb;
>>> +   struct sun8i_emac_priv *priv = netdev_priv(ndev);
>>> +   struct dma_desc *ddesc = priv->dd_rx + i;
>>> +   int frame_len;
>>> +   int crc_checked = 0;
>>> +
>>> +   if (ndev->features & NETIF_F_RXCSUM)
>>> +   crc_checked = 1;
>>
>> Assuming CRC here refers to the Ethernet frame's FCS, then this is
>> absolutely not how NETIF_F_RXCSUM works. NETIF_F_RXCSUM is about your
>> Ethernet adapter supporting L3/L4 checksum offloads, while the Ethernet
>> FCS is pretty much mandatory for the frame to be properly received in
>> the first place. Can you clarify which way it is?
>>
> 
> No CRC here is RXCSUM. I understand the misnaming.
> I will rename the variable to rxcsum_done.

Thanks

> 
>>> +
>>> +   priv->ndev->stats.rx_packets++;
>>> +   priv->ndev->stats.rx_bytes += frame_len;
>>> +   priv->rx_sk[i] = NULL;
>>> +
>>> +   /* this frame is not the last */
>>> +   if ((ddesc->status & BIT(8)) == 0) {
>>> +   dev_warn(priv->dev, "Multi frame not implemented currlen=%d\n",
>>> +frame_len);
>>> +   }
>>> +
>>> +   sun8i_emac_rx_sk(ndev, i);
>>> +
>>> +   netif_rx(skb);
>>
>> netif_receive_skb() at the very least, or if you implement NAPI, like
>> you shoud napi_gro_receive().
>>
> 
> netif_receive_skb documentation say
> "This function may only be called from softirq context and interrupts should 
> be enabled."
> but the calling functions is in hardirq context.

Well, yes, because you are not implementing NAPI, while you should, you
execute in hard interrupt context. Once you move to NAPI, you can and
should use netif_receive_skb().

> 
>>> +   return 0;
>>> +}
>>> +
>>> +/* Cycle over RX DMA descriptors for finding frame to receive
>>> + */
>>> +static int sun8i_emac_receive_all(struct net_device *ndev)
>>> +{
>>> +   struct sun8i_emac_priv *priv = netdev_priv(ndev);
>>> +   struct dma_desc *ddesc;
>>> +
>>> +   ddesc = priv->dd_rx + priv->rx_dirty;
>>> +   while (!(ddesc->status & BIT(31))) {
>>> +   sun8i_emac_rx_from_ddesc(ndev, priv->rx_dirty);
>>> +   rb_inc(>rx_dirty, nbdesc_rx);
>>> +   ddesc = priv->dd_rx + priv->rx_dirty;
>>> +   };
>>
>> So, what if we ping flood your device here, is not there a remote chance
>> that we keep the RX interrupt so busy we can't break out of this loop,
>> and we are executing from hard IRQ context, that's bad.
>>
> 
> I have added a start variable for preventing to do more than a full loop.

Which gets you close to a proper NAPI implementation, good.

> 
>>> +
>>> +   return 0;
>>> +}
>>> +
>>> +/* iterate over dma_desc for finding completed xmit.
>>> + * Called from interrupt context, so no need to spinlock tx
>>
>> Humm, well it depends if what you are doing here may race with
>> ndo_start_xmit(), and usually it does.
>>
> 
> I believe that how it is designed it cannot race each over (access the same 
> descriptor slot) since I keep a free slot between each other.
> 
>> Also, you should consider completing TX packets in NAPI context (soft
>> IRQ) instead of hard IRQs like here.
>>
> 
> I wanted to finish this driver the "old" way (with hard IRQ) and implementing 
> NAPI after as a Kconfig choice.
> Does 

Re: ebpf: issue with clang

2016-06-12 Thread Eric Leblond
Hello,

On Thu, 2016-06-09 at 17:34 -0700, Alexei Starovoitov wrote:
> On Thu, Jun 09, 2016 at 11:10:05PM +0200, Eric Leblond wrote:
> > Hello,
> > 
> > I'm working on integrating ebpf cluster load balancing for
> > AF_PACKET
> > and I've got some problem to get real code inside the EBPF filter.
> > 
> > I've tried different command lines in the build process. One of
> > them
> > is:
> > clang-3.9 -Wall -O2 -emit-llvm -c hash_ports.c -o - | llc-3.9
> > -march=bpf -filetype=obj -o hash_ports.bpf
> > 
> > If I use that one, then the generated code is almost void. If I
> > remove
> > the -O2 then I've got a generated code that fails during load. When
> > not
> > using -O2, I manage to load a trivial filter (return of static
> > value).
> >  
> > The C code is the following (a derivative of http-simple-filter.c
> > used
> > for testing):
> > 
> > int filter(struct __sk_buff *skb) {
> > uint8_t *cursor = 0;
> > struct ethernet_t *ethernet = cursor_advance(cursor,
> > sizeof(*ethernet));
> 
> this is bcc C syntax that is hiding the explicit load_byte/half/word
> operations
> we have to do when using plain C.
> If you want to compile C code with clang -O2 -target bpf file.c -c -o
> file.o
> and copy .o around to be used in tc like:
> tc filter add dev eth0 ingress bpf da obj file.o
> then plain C should be used like in all samples/bpf/*_kern.c
> examples.
> Other folks like the convenience of bcc that hides clang/llvm
> invocation.
> It mostly applicable to tracing tools where both bcc-C and
> corresponding python or lua bits are in the same file
> like in iovisor/bcc/tools/* scripts.
> The iovisor/bcc/examples/networking/* (where this http-simple-
> filter.c came from)
> are also suitable for networking and relying on pyroute2 to talk to
> kernel to create netns, veth and to attach bpf to qdisc.
> 
> In summary there are several ways to write bpf C code:
> 1. plain C syntax as in samples/bpf/*_kern.c
> Pro: compiles with clang into .o
> Con: .o requires elf loader (integrated into tc already for
> networking),

Yes, that's not an easy part. I've devel one loader for suricata but I
will check the one in tc to see if I can take advantage of it.

> but not friendly for tracing that needs recompile for every kernel
> due to unstable kprobes
> 2. bcc C syntax that compiles C on the fly in memory and loads
> directly
> Pro: there is no .o, no extra files, no need to install clang/llvm
> Con: bcc is not widely available yet. ubuntu and others already have
> it in apt.
> python and lua may not be for everyone. c++ api is not stable yet.

I need to include it into suricata which is C code. I've played with
bcc and it is a great tool but installation on the different platform
may be complicated for our users.

> 3. perf+bpf, it is similar to samples/pbf/ C style with few
> extensions.
> If .c file is passed, the perf calls external clang and loads .o
> eventually
> Pro: out-of-the-box perf and clang work well
> Con: not available for networking

Out of scope for me then.

> Sounds like you want to use it with af_packet then
> tools/testing/selftests/net/reuseport_bpf.c
> could be a good start too, but there bpf is written in asm.

Yes, bad point, asm is not really what I want. I want "normal advanced"
users to be able to edit the load balancing function.

> If you pick bcc style then iovisor-...@lists.iovisor.org mailing list
> is a good place to ask questions. Be sure to subscribe first, since
> it rejects non-subscriber emails to reduce spam.

Thanks a lot for all these explanations. You saved me days!

BR,
-- 

Eric Leblond 


Re: [very-RFC 6/8] Add TSN event-tracing

2016-06-12 Thread Steven Rostedt
On Sun, 12 Jun 2016 01:01:34 +0200
Henrik Austad  wrote:

> From: Henrik Austad 
> 
> This needs refactoring and should be updated to use TRACE_CLASS, but for
> now it provides a fair debug-window into TSN.
> 
> Cc: "David S. Miller" 
> Cc: Steven Rostedt  (maintainer:TRACING)
> Cc: Ingo Molnar  (maintainer:TRACING)
> Signed-off-by: Henrik Austad 
> ---
>  include/trace/events/tsn.h | 349 
> +
>  1 file changed, 349 insertions(+)
>  create mode 100644 include/trace/events/tsn.h
> 
> diff --git a/include/trace/events/tsn.h b/include/trace/events/tsn.h
> new file mode 100644
> index 000..ac1f31b
> --- /dev/null
> +++ b/include/trace/events/tsn.h
> @@ -0,0 +1,349 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM tsn
> +
> +#if !defined(_TRACE_TSN_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_TSN_H
> +
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +/* #include  */
> +
> +/* FIXME: update to TRACE_CLASS to reduce overhead */

I'm curious to why I didn't do this now. A class would make less
duplication of typing too ;-)

> +TRACE_EVENT(tsn_buffer_write,
> +
> + TP_PROTO(struct tsn_link *link,
> + size_t bytes),
> +
> + TP_ARGS(link, bytes),
> +
> + TP_STRUCT__entry(
> + __field(u64, stream_id)
> + __field(size_t, size)
> + __field(size_t, bsize)
> + __field(size_t, size_left)
> + __field(void *, buffer)
> + __field(void *, head)
> + __field(void *, tail)
> + __field(void *, end)
> + ),
> +
> + TP_fast_assign(
> + __entry->stream_id = link->stream_id;
> + __entry->size = bytes;
> + __entry->bsize = link->used_buffer_size;
> + __entry->size_left = (link->head - link->tail) % 
> link->used_buffer_size;

Move this logic into the print statement, since you save head and tail.

> + __entry->buffer = link->buffer;
> + __entry->head = link->head;
> + __entry->tail = link->tail;
> + __entry->end = link->end;
> + ),
> +
> + TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, 
> [buffer=%p, head=%p, tail=%p, end=%p]",
> + __entry->stream_id, __entry->size, __entry->bsize, 
> __entry->size_left,

 __entry->stream_id, __entry->size, __entry->bsize,
 (__entry->head - __entry->tail) % __entry->bsize,

> + __entry->buffer,__entry->head, __entry->tail,  __entry->end)
> +
> + );
> +
> +TRACE_EVENT(tsn_buffer_write_net,
> +
> + TP_PROTO(struct tsn_link *link,
> + size_t bytes),
> +
> + TP_ARGS(link, bytes),
> +
> + TP_STRUCT__entry(
> + __field(u64, stream_id)
> + __field(size_t, size)
> + __field(size_t, bsize)
> + __field(size_t, size_left)
> + __field(void *, buffer)
> + __field(void *, head)
> + __field(void *, tail)
> + __field(void *, end)
> + ),
> +
> + TP_fast_assign(
> + __entry->stream_id = link->stream_id;
> + __entry->size = bytes;
> + __entry->bsize = link->used_buffer_size;
> + __entry->size_left = (link->head - link->tail) % 
> link->used_buffer_size;
> + __entry->buffer = link->buffer;
> + __entry->head = link->head;
> + __entry->tail = link->tail;
> + __entry->end = link->end;
> + ),
> +
> + TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, avail=%zd, 
> [buffer=%p, head=%p, tail=%p, end=%p]",
> + __entry->stream_id, __entry->size, __entry->bsize, 
> __entry->size_left,
> + __entry->buffer,__entry->head, __entry->tail,  __entry->end)
> +
> + );
> +
> +
> +TRACE_EVENT(tsn_buffer_read,
> +
> + TP_PROTO(struct tsn_link *link,
> + size_t bytes),
> +
> + TP_ARGS(link, bytes),
> +
> + TP_STRUCT__entry(
> + __field(u64, stream_id)
> + __field(size_t, size)
> + __field(size_t, bsize)
> + __field(size_t, size_left)
> + __field(void *, buffer)
> + __field(void *, head)
> + __field(void *, tail)
> + __field(void *, end)
> + ),
> +
> + TP_fast_assign(
> + __entry->stream_id = link->stream_id;
> + __entry->size = bytes;
> + __entry->bsize = link->used_buffer_size;
> + __entry->size_left = (link->head - link->tail) % 
> link->used_buffer_size;
> + __entry->buffer = link->buffer;
> + __entry->head = link->head;
> + __entry->tail = link->tail;
> + __entry->end = link->end;
> + ),
> +
> + TP_printk("stream_id=%llu, copy=%zd, buffer: %zd, 

Re: [PATCH] net: alx: Work around the DMA RX overflow issue

2016-06-12 Thread Eric Dumazet
On Sun, 2016-06-12 at 17:36 +0800, Feng Tang wrote:
> Commit 26c5f03 uses a new skb allocator to avoid the RFD overflow
> issue.
> 
> But from debugging without datasheet, we found the error always
> happen when the DMA RX address is set to 0xfc0, which is very
> likely to be a HW/silicon problem.
> 
> So one idea is instead of adding a new allocator, why not just
> hitting the right target by avaiding the error-prone DMA address?
> 
> This patch will actually
> * Remove the commit 26c5f03
> * Apply rx skb with 64 bytes longer space, and if the allocated skb
>   has a 0x...fc0 address, it will use skb_resever(skb, 64) to
>   advance the address, so that the RX overflow can be avoided.
> 
> In theory this method should also apply to atl1c driver, which
> I can't find anyone who can help to test on real devices.
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=70761
> Signed-off-by: Feng Tang 
> Suggested-by: Eric Dumazet 
> Tested-by: Ole Lukoie 
> ---
>  drivers/net/ethernet/atheros/alx/alx.h  |  4 ---
>  drivers/net/ethernet/atheros/alx/main.c | 61 
> -
>  2 files changed, 14 insertions(+), 51 deletions(-)

Acked-by: Eric Dumazet 

Thanks !




Re: [PATCH v3] udp reuseport: fix packet of same flow hashed to different socket

2016-06-12 Thread Eric Dumazet
On Sun, 2016-06-12 at 20:43 +0800, Su Xuemin wrote:
> From: "Su, Xuemin" 
...

> Signed-off-by: Su, Xuemin 
> Signed-off-by: Eric Dumazet 
> ---

First, I want to thank you for this very high quality submission,
especially if this is your first linux kernel patch.

I have one additional comment to make :



>   if (score > badness) {
>   reuseport = sk->sk_reuseport;
> @@ -556,14 +510,20 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 
> saddr,
> daddr, hnum, dif,
> hslot2, slot2, skb);
>   if (!result) {
> + unsigned int old = hash2;
>   hash2 = udp4_portaddr_hash(net, htonl(INADDR_ANY), 
> hnum);
> +
> + /* avoid search the same slot again. */
> + if (unlikely(old == hash2))
> + return result;
> +

Technically speaking, what matters is the slot, not the hash value
(32bit)

So I would save in old, slot2

>   slot2 = hash2 & udptable->mask;

And here perform the check
if (unlikely(slot2 != old_slot))
return result;

>   hslot2 = >hash2[slot2];
>   if (hslot->count < hslot2->count)
>   goto begin;


(Same remark applies in IPv6)

Thanks !




Re: [PATCH ipvs-next] ipvs: count pre-established TCP states as active

2016-06-12 Thread Julian Anastasov

Hello,

On Fri, 3 Jun 2016, Michal Kubecek wrote:

> Some users observed that "least connection" distribution algorithm doesn't
> handle well bursts of TCP connections from reconnecting clients after
> a node or network failure.
> 
> This is because the algorithm counts active connection as worth 256
> inactive ones where for TCP, "active" only means TCP connections in
> ESTABLISHED state. In case of a connection burst, new connections are
> handled before previous ones have finished the three way handshaking so
> that all are still counted as "inactive", i.e. cheap ones. The become
> "active" quickly but at that time, all of them are already assigned to one
> real server (or few), resulting in highly unbalanced distribution.
> 
> Address this by counting the "pre-established" states as "active".
> 
> Signed-off-by: Michal Kubecek 

Acked-by: Julian Anastasov 

Simon, please apply!

> ---
>  net/netfilter/ipvs/ip_vs_proto_tcp.c | 25 +++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c 
> b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> index d7024b2ed769..5117bcb7d2f0 100644
> --- a/net/netfilter/ipvs/ip_vs_proto_tcp.c
> +++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> @@ -395,6 +395,20 @@ static const char *const 
> tcp_state_name_table[IP_VS_TCP_S_LAST+1] = {
>   [IP_VS_TCP_S_LAST]  =   "BUG!",
>  };
>  
> +static const bool tcp_state_active_table[IP_VS_TCP_S_LAST] = {
> + [IP_VS_TCP_S_NONE]  =   false,
> + [IP_VS_TCP_S_ESTABLISHED]   =   true,
> + [IP_VS_TCP_S_SYN_SENT]  =   true,
> + [IP_VS_TCP_S_SYN_RECV]  =   true,
> + [IP_VS_TCP_S_FIN_WAIT]  =   false,
> + [IP_VS_TCP_S_TIME_WAIT] =   false,
> + [IP_VS_TCP_S_CLOSE] =   false,
> + [IP_VS_TCP_S_CLOSE_WAIT]=   false,
> + [IP_VS_TCP_S_LAST_ACK]  =   false,
> + [IP_VS_TCP_S_LISTEN]=   false,
> + [IP_VS_TCP_S_SYNACK]=   true,
> +};
> +
>  #define sNO IP_VS_TCP_S_NONE
>  #define sES IP_VS_TCP_S_ESTABLISHED
>  #define sSS IP_VS_TCP_S_SYN_SENT
> @@ -418,6 +432,13 @@ static const char * tcp_state_name(int state)
>   return tcp_state_name_table[state] ? tcp_state_name_table[state] : "?";
>  }
>  
> +static bool tcp_state_active(int state)
> +{
> + if (state >= IP_VS_TCP_S_LAST)
> + return false;
> + return tcp_state_active_table[state];
> +}
> +
>  static struct tcp_states_t tcp_states [] = {
>  /*   INPUT */
>  /*sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA  */
> @@ -540,12 +561,12 @@ set_tcp_state(struct ip_vs_proto_data *pd, struct 
> ip_vs_conn *cp,
>  
>   if (dest) {
>   if (!(cp->flags & IP_VS_CONN_F_INACTIVE) &&
> - (new_state != IP_VS_TCP_S_ESTABLISHED)) {
> + !tcp_state_active(new_state)) {
>   atomic_dec(>activeconns);
>   atomic_inc(>inactconns);
>   cp->flags |= IP_VS_CONN_F_INACTIVE;
>   } else if ((cp->flags & IP_VS_CONN_F_INACTIVE) &&
> -(new_state == IP_VS_TCP_S_ESTABLISHED)) {
> +tcp_state_active(new_state)) {
>   atomic_inc(>activeconns);
>   atomic_dec(>inactconns);
>   cp->flags &= ~IP_VS_CONN_F_INACTIVE;
> -- 
> 2.8.3

Regards

--
Julian Anastasov 


[PATCH v3] udp reuseport: fix packet of same flow hashed to different socket

2016-06-12 Thread Su Xuemin
From: "Su, Xuemin" 

There is a corner case in which udp packets belonging to a same
flow are hashed to different socket when hslot->count changes from 10
to 11:

1) When hslot->count <= 10, __udp_lib_lookup() searches udp_table->hash,
and always passes 'daddr' to udp_ehashfn().

2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2,
but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to
INADDR_ANY instead of some specific addr.

That means when hslot->count changes from 10 to 11, the hash calculated by
udp_ehashfn() is also changed, and the udp packets belonging to a same
flow will be hashed to different socket.

This is easily reproduced:
1) Create 10 udp sockets and bind all of them to 0.0.0.0:4.
2) From the same host send udp packets to 127.0.0.1:4, record the
socket index which receives the packets.
3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096
is 4 + UDP_HASH_SIZE(4096), this makes the new socket put into the
same hslot as the aformentioned 10 sockets, and makes the hslot->count
change from 10 to 11.
4) From the same host send udp packets to 127.0.0.1:4, and the socket
index which receives the packets will be different from the one received
in step 2.
This should not happen as the socket bound to 0.0.0.0:44096 should not
change the behavior of the sockets bound to 0.0.0.0:4.

It's the same case for IPv6, and this patch also fixes that.

Signed-off-by: Su, Xuemin 
Signed-off-by: Eric Dumazet 
---
I use this tree to generate the patch:
  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

 net/ipv4/udp.c | 68 --
 net/ipv6/udp.c | 66 
 2 files changed, 28 insertions(+), 106 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 0ff31d9..586f1b0 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -391,9 +391,9 @@ int udp_v4_get_port(struct sock *sk, unsigned short snum)
return udp_lib_get_port(sk, snum, ipv4_rcv_saddr_equal, hash2_nulladdr);
 }
 
-static inline int compute_score(struct sock *sk, struct net *net,
-   __be32 saddr, unsigned short hnum, __be16 sport,
-   __be32 daddr, __be16 dport, int dif)
+static int compute_score(struct sock *sk, struct net *net,
+__be32 saddr, __be16 sport,
+__be32 daddr, unsigned short hnum, int dif)
 {
int score;
struct inet_sock *inet;
@@ -434,52 +434,6 @@ static inline int compute_score(struct sock *sk, struct 
net *net,
return score;
 }
 
-/*
- * In this second variant, we check (daddr, dport) matches (inet_rcv_sadd, 
inet_num)
- */
-static inline int compute_score2(struct sock *sk, struct net *net,
-__be32 saddr, __be16 sport,
-__be32 daddr, unsigned int hnum, int dif)
-{
-   int score;
-   struct inet_sock *inet;
-
-   if (!net_eq(sock_net(sk), net) ||
-   ipv6_only_sock(sk))
-   return -1;
-
-   inet = inet_sk(sk);
-
-   if (inet->inet_rcv_saddr != daddr ||
-   inet->inet_num != hnum)
-   return -1;
-
-   score = (sk->sk_family == PF_INET) ? 2 : 1;
-
-   if (inet->inet_daddr) {
-   if (inet->inet_daddr != saddr)
-   return -1;
-   score += 4;
-   }
-
-   if (inet->inet_dport) {
-   if (inet->inet_dport != sport)
-   return -1;
-   score += 4;
-   }
-
-   if (sk->sk_bound_dev_if) {
-   if (sk->sk_bound_dev_if != dif)
-   return -1;
-   score += 4;
-   }
-
-   if (sk->sk_incoming_cpu == raw_smp_processor_id())
-   score++;
-
-   return score;
-}
-
 static u32 udp_ehashfn(const struct net *net, const __be32 laddr,
   const __u16 lport, const __be32 faddr,
   const __be16 fport)
@@ -492,7 +446,7 @@ static u32 udp_ehashfn(const struct net *net, const __be32 
laddr,
  udp_ehash_secret + net_hash_mix(net));
 }
 
-/* called with read_rcu_lock() */
+/* called with rcu_read_lock() */
 static struct sock *udp4_lib_lookup2(struct net *net,
__be32 saddr, __be16 sport,
__be32 daddr, unsigned int hnum, int dif,
@@ -506,7 +460,7 @@ static struct sock *udp4_lib_lookup2(struct net *net,
result = NULL;
badness = 0;
udp_portaddr_for_each_entry_rcu(sk, >head) {
-   score = compute_score2(sk, net, saddr, sport,
+   score = compute_score(sk, net, saddr, sport,
  daddr, hnum, dif);
if (score > badness) {
reuseport = sk->sk_reuseport;
@@ -556,14 

Re: [PATCH net-next] ethtool: Macro definition for SFF-8436/8636 Memory map max sizes

2016-06-12 Thread Ben Hutchings
On Sat, 2016-06-11 at 19:26 -0700, David Miller wrote:
> From: Vidya Sagar Ravipati 
> Date: Sat, 11 Jun 2016 16:22:38 -0700
> 
> > As part of ethtool application, application is requesting  the drivers
> > to provide the supported eeprom size to allocate memory buffer for
> > getting complete dump.
> 
> And the right way to do that is the driver requests the eeprom info
> with a buffer size of zero, then the driver fills in the size field
> for what the size actually is.
> 
> Then the application can allocate the proper buffer size and rerun
> the eeprom request.
> 
> Putting endless values for each and every eeprom type a device has is
> just rediculous.
> 
> I'm not going to continue promoting this broken and unscalable scheme,
> we have to fix this.

I don't think there's nothing broken here.  ethtool doesn't use those
macros, the drivers do.

Ben.

-- 
Ben Hutchings
The program is absolutely right; therefore, the computer must be wrong.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH net-next] ethtool: Macro definition for SFF-8436/8636 Memory map max sizes

2016-06-12 Thread Ben Hutchings
On Sat, 2016-06-11 at 15:51 -0700, David Miller wrote:
[...]
> Why do we need these values in the header file at all?

Because we don't like putting magic numbers in driver code, and these
sizes are defined by standards that are independent of a single driver.

> The application can probe the size by repeated eeprom calls, increasing
> the buffer size each time as needed until success.

But really it should use ETHTOOL_GMODULEINFO first.

Ben.

-- 
Ben Hutchings
The program is absolutely right; therefore, the computer must be wrong.


signature.asc
Description: This is a digitally signed message part


Re: [linux-sunxi] [PATCH 0/5] net-next: ethernet: add sun8i-emac driver

2016-06-12 Thread Hans de Goede

Hi,

On 03-06-16 11:56, LABBE Corentin wrote:

Hello

This patch series add the driver for sun8i-emac which handle the Ethernet MAC
present on Allwinner H3/A83T/A64 SoCs.

It supports 10/100/1000 Mbit/s speed with half/full duplex.
It can use an internal PHY (MII 10/100) or an external PHY
via RGMII/RMII.

This patch series enable the driver only for the H3 SoC since A83T and A64
doesn't have the necessary clocks present in mainline.

This patch series enable the driver only for the OrangePiPC board since other
board with H3 use external PHY which need optional regulators that will be
supported later.

The driver have been tested on the following boards:
- H3 Orange PI PC, Orange PI Plus, BananaPI-M2+
- A64 Pine64
- A83T BananaPI-M3

I would like to thanks Chen-Yu Tsai for his help on developing this driver.


Awesome! Thanks for your work on this.

I've tested this on a orangepi-one. You can add my:

Tested-by: Hans de Goede 

For the next version of this series.

Attached is a dts patch for adding support
for the emac one that one (to be applied on top of this patch-set).

I've a Orange PI Plus and a Orange PI 2 as well, both of which use an external
phy, I would love to test on those as well, do you've a tree with adds the
necessary regulator bits and an example of how this would like in dts
somewhere ?

One remark about:
[PATCH 5/5] ARM: dts: sun8i: Enable sun8i-emac on the Orange PI PC

We always keep the nodes in board.dts files sorted alphebetically,
so the emac node should be after the ehci node rather then at the end.

Regards,

Hans



Regards

LABBE Corentin (5):
  ethernet: add sun8i-emac driver
  MAINTAINERS: Add myself as maintainers of sun8i-emac
  ARM: sun8i: dt: Add DT bindings documentation for Allwinner sun8i-emac
  ARM: dts: sun8i-h3: add sun8i-emac ethernet driver
  ARM: dts: sun8i: Enable sun8i-emac on the Orange PI PC

 .../bindings/net/allwinner,sun8i-emac.txt  |   64 +
 MAINTAINERS|6 +
 arch/arm/boot/dts/sun8i-h3-orangepi-pc.dts |   11 +
 arch/arm/boot/dts/sun8i-h3.dtsi|   14 +
 drivers/net/ethernet/allwinner/Kconfig |   13 +
 drivers/net/ethernet/allwinner/Makefile|1 +
 drivers/net/ethernet/allwinner/sun8i-emac.c| 1943 
 7 files changed, 2052 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/net/allwinner,sun8i-emac.txt
 create mode 100644 drivers/net/ethernet/allwinner/sun8i-emac.c

>From 25ea802c4c2962dc50c34bd02faa9a5f3acd2e34 Mon Sep 17 00:00:00 2001
From: Hans de Goede 
Date: Sun, 12 Jun 2016 13:20:05 +0200
Subject: [PATCH v3] ARM: dts: sun8i: Enable sun8i-emac on the Orange PI One

The sun8i-emac hardware is present on the Orange PI One.
It uses the internal PHY.

This patch create the needed emac and phy nodes.

Signed-off-by: Hans de Goede 
---
 arch/arm/boot/dts/sun8i-h3-orangepi-one.dts | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/arm/boot/dts/sun8i-h3-orangepi-one.dts b/arch/arm/boot/dts/sun8i-h3-orangepi-one.dts
index 0adf932..8df5c74 100644
--- a/arch/arm/boot/dts/sun8i-h3-orangepi-one.dts
+++ b/arch/arm/boot/dts/sun8i-h3-orangepi-one.dts
@@ -94,6 +94,17 @@
 	status = "okay";
 };
 
+ {
+	phy = <>;
+	phy-mode = "mii";
+	allwinner,use-internal-phy;
+	allwinner,leds-active-low;
+	status = "okay";
+	phy1: ethernet-phy@1 {
+		reg = <1>;
+	};
+};
+
  {
 	pinctrl-names = "default";
 	pinctrl-0 = <_pins_a>, <_cd_pin>;
-- 
2.7.4



Re: [PATCH] netlink.7: describe netlink socket options

2016-06-12 Thread Michael Kerrisk (man-pages)
Hi Andrey,

On 06/10/2016 10:28 PM, Andrey Vagin wrote:
> Cc: Kir Kolyshkin 
> Cc: Michael Kerrisk 
> Cc: Herbert Xu 
> Cc: Patrick McHardy 
> Cc: Christophe Ricard 
> Cc: Nicolas Dichtel 
> Signed-off-by: Andrey Vagin 
> ---
>  man7/netlink.7 | 75 
> ++
>  1 file changed, 75 insertions(+)


Thanks for the nicely done patch. Applied!

Cheers,

Michael


> diff --git a/man7/netlink.7 b/man7/netlink.7
> index 513f854..b4848df 100644
> --- a/man7/netlink.7
> +++ b/man7/netlink.7
> @@ -368,6 +368,81 @@ and
>  .BR NETLINK_SELINUX
>  groups allow other users to receive messages.
>  No groups allow other users to send messages.
> +
> +.SS Socket options
> +To set or get a netlink socket option, call
> +.BR getsockopt (2)
> +to read or
> +.BR setsockopt (2)
> +to write the option with the option level argument set to
> +.BR SOL_NETLINK .
> +Unless otherwise noted,
> +.I optval
> +is a pointer to an
> +.IR int .
> +.TP
> +.BR NETLINK_PKTINFO " (since Linux 2.6.14)"
> +Enable
> +.B nl_pktinfo
> +control messages for received packets to get the extended
> +destination group number.
> +.TP
> +.BR NETLINK_ADD_MEMBERSHIP ,\  NETLINK_DROP_MEMBERSHIP " (since Linux 
> 2.6.14)"
> +Join/leave a group specified by
> +.IR optval .
> +.\"  commit 9a4595bc7e67962f13232ee55a64e063062c3a99
> +.\"  Author: Patrick McHardy 
> +.TP
> +.BR NETLINK_LIST_MEMBERSHIPS " (since Linux 4.2)"
> +Retrieve all groups a socket is a member of.
> +.I optval
> +is a pointer to
> +.B __u32
> +and
> +.I optlen
> +is the size of the array. The array is filled with the full membership set 
> of the
> +socket, and the required array size is returned in
> +.I optlen.
> +.\"  commit b42be38b2778eda2237fc759e55e3b698b05b315
> +.\"  Author: David Herrmann 
> +.TP
> +.BR NETLINK_BROADCAST_ERROR " (since Linux 2.6.30)"
> +When not set,
> +.B netlink_broadcast()
> +only reports
> +.B ESRCH
> +errors and silently ignore
> +.B NOBUFS
> +errors.
> +.\"  commit be0c22a46cfb79ab2342bb28fde99afa94ef868e
> +.\"  Author: Pablo Neira Ayuso 
> +.TP
> +.BR NETLINK_NO_ENOBUFS " (since Linux 2.6.30)"
> +This flag can be used by unicast and broadcast listeners to avoid receiving
> +.B ENOBUFS
> +errors.
> +.\"  commit 38938bfe3489394e2eed5e40c9bb8f66a2ce1405
> +.\"  Author: Pablo Neira Ayuso 
> +.TP
> +.BR NETLINK_LISTEN_ALL_NSID " (since Linux 4.2)"
> +When set, this socket will receive netlink notifications from all network 
> namespaces that
> +have an
> +.I nsid
> +assigned into the network namespace where the socket has been opened. The
> +.I nsid
> +is sent to user space via an ancillary data.
> +.\"  commit 59324cf35aba5336b611074028777838a963d03b
> +.\"  Author: Nicolas Dichtel 
> +.TP
> +.BR NETLINK_CAP_ACK " (since Linux 4.2)"
> +The kernel may fail to allocate the necessary room for the acknowledgment
> +message back to userspace. This option trims off the payload of the original
> +netlink message.
> +The netlink message header is still included, so the user can guess from the
> +sequence number what is the message that has triggered the acknowledgment.
> +.\"  commit 0a6a3a23ea6efde079a5b77688541a98bf202721
> +.\"  Author: Christophe Ricard 
> +
>  .SH VERSIONS
>  The socket interface to netlink is a new feature of Linux 2.2.
>  
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-12 Thread Takashi Sakamoto
On Jun 12 2016 17:31, Henrik Austad wrote:
> On Sun, Jun 12, 2016 at 01:30:24PM +0900, Takashi Sakamoto wrote:
>> On Jun 12 2016 12:38, Takashi Sakamoto wrote:
>>> In your patcset, there's no actual codes about how to handle any
>>> interrupt contexts (software / hardware), how to handle packet payload,
>>> and so on. Especially, for recent sound subsystem, the timing of
>>> generating interrupts and which context does what works are important to
>>> reduce playback/capture latency and power consumption.
>>>
>>> Of source, your intention of this patchset is to show your early concept
>>> of TSN feature. Nevertheless, both of explaination and codes are
>>> important to the other developers who have little knowledges about TSN,
>>> AVB and AES-64 such as me.
>>
>> Oops. Your 5th patch was skipped by alsa-project.org. I guess that size
>> of the patch is too large to the list service. I can see it:
>> http://marc.info/?l=linux-netdev=146568672728661=2
>>
>> As long as seeing the patch, packets are queueing in hrtimer callbacks
>> every 1 seconds.
> 
> Actually, the hrtimer fires every 1ms, and that part is something I have to 
> do something about, also because it sends of the same number of frames 
> every time, regardless of how accurate the internal timer is to the rest of 
> the network (there's no backpressure from the networking layer).
> 
>> (This is a high level discussion and it's OK to ignore it for the
>> moment. When writing packet-oriented drivers for sound subsystem, you
>> need to pay special attention to accuracy of the number of PCM frames
>> transferred currently, and granularity of the number of PCM frames
>> transferred by one operation. In this case, snd_avb_hw,
>> snd_avb_pcm_pointer(), tsn_buffer_write_net() and tsn_buffer_read_net()
>> are involved in this discussion. You can see ALSA developers' struggle
>> in USB audio device class drivers and (of cource) IEC 61883-1/6 drivers.)
> 
> Ah, good point. Any particular parts of the USB-subsystem I should start 
> looking at?

I don't think it's a beter way for you to study USB Audio Device Class
driver unless you're interested in ALSA or USB subsystem.

(But for your information, snd-usb-audio is in sound/usb/* of Linux
kernel. IEC 61883-1/6 driver is in sound/firewire/*.)

We need different strategy to achieve it on different transmission backend.

> Knowing where to start looking is a tremendous help

It's not well-documented, and not well-generalized for packet-oriented
drivers. Most of developers who have enough knowledge about it work for
DMA-oriented drivers in mobile platforms and have little interests in
packet-oriented drivers. You need to find your own way.

Currently I have few advices to you, because I'm also on the way for
drivers to process IEC 61883-1/6 packets on IEEE 1394 bus with enough
accuracy and granularity. The paper I introduced is for the way (but not
mature).

I wish you get more helps from the other developers. Your work is more
significant to Linux system, than mine.

(And I hope your future work get no ignorance and no unreasonable
hostility from coarse users.)


Regards

Takashi Sakamoto



signature.asc
Description: OpenPGP digital signature


Re: [PATCH net] fib_rules: don't break ECN with TOS rules

2016-06-12 Thread Hannes Frederic Sowa
On Sun, Jun 12, 2016, at 11:28, Julian Anastasov wrote:
> 
>   Hello,
> 
> On Sun, 12 Jun 2016, Hannes Frederic Sowa wrote:
> 
> > On Sun, Jun 12, 2016, at 02:09, Julian Anastasov wrote:
> > > 
> > >   Well, may be the confusion comes from commit 89aef8921bfb
> > > ("ipv4: Delete routing cache.") where the 'tos &= IPTOS_RT_MASK;'
> > > line is lost from ip_route_input_common. I think, we should
> > > add it back, so that we can properly match input routes with rules
> > > that specify tos value. Old kernels didn't stored ECN bits in
> > > flowi4_tos in the input path, so we should do the same.
> > 
> > I would love to have done that but was fearing problems with user space
> > compatibility. Also IPTOS_RT_MASK is not enough for filtering, we need
> > to check for the whole INET_ECN_MASK.
> 
>   I checked even 2.4.33 for reference. Nobody sets bit 1
> in key->tos but using IPTOS_TOS_MASK on configuration allows
> rules with 'tos 2'. I guess, in this case such rule can not be
> matched. And the TOS matching was used only under
> CONFIG_IP_ROUTE_TOS, so the ip rules were supposed to match only
> bits 2..4, not full TOS values.

Thanks for looking it up! :)

> > > > +   INET_ECN_ignore(r->tos) != INET_ECN_ignore(fl4->flowi4_tos))
> > > > return 0;
> > > 
> > >   fib4_rule_configure already rejects ECN bits in r->tos,
> > > so no need to filter them again in fast path.
> > 
> > The problem is that IPTOS_TOS_MASK is just the masking of the 0x1 bit,
> > not both ECN bits. Thi stems from the pre-DSCP time where it was
> > forbidden to use bit 0x1 in TOS.
> 
>   I see, I didn't noticed it. If we decide to reject
> rules with tos 2 we can break scripts. May be we should just
> filter it:
> 
>   if (r->tos && ((r->tos & IPTOS_RT_MASK) != fl4->flowi4_tos))
> 
>   This will allow the hidden feature 'ip rule add tos 2' to
> match properly bits2..4 == 0 :) But may be we should not spend
> extra cycles in fast path for such feature. As nobody complained
> for problems with 3.6+, it seems match by tos is not used often.

Not sure if people really noticed or it just quietly broke ecn. :/

> May be we just need to restore the missing 'tos &= IPTOS_RT_MASK;'
> line?

I fear we already use those lower bits for other flags, like RTO_ONLINK.
I will study this today and will report back.

My last proposal would be to simply enclose my block with a if (r->tos)
{ ... }. Not sure, yet.

Bye,
Hannes


[PATCH net] ipv4: fix checksum annotation in udp4_csum_init

2016-06-12 Thread Hannes Frederic Sowa
Reported-by: Cong Wang 
Cc: Cong Wang 
Cc: Tom Herbert 
Fixes: 4068579e1e098fa ("net: Implmement RFC 6936 (zero RX csums for UDP/IPv6")
Signed-off-by: Hannes Frederic Sowa 
---
 net/ipv4/udp.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 0ff31d97d48586..ba0d8b8b76900a 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1755,8 +1755,11 @@ static inline int udp4_csum_init(struct sk_buff *skb, 
struct udphdr *uh,
return err;
}
 
-   return skb_checksum_init_zero_check(skb, proto, uh->check,
-   inet_compute_pseudo);
+   /* Note, we are only interested in != 0 or == 0, thus the
+* force to int.
+*/
+   return (__force int)skb_checksum_init_zero_check(skb, proto, uh->check,
+inet_compute_pseudo);
 }
 
 /*
-- 
2.5.5



Re: [PATCH v9 11/22] IB/hns: Add IB device registration

2016-06-12 Thread Wei Hu (Xavier)



On 2016/6/9 14:26, Leon Romanovsky wrote:

On Wed, Jun 01, 2016 at 11:37:53PM +0800, Lijun Ou wrote:

This patch registered IB device when loaded, and unregistered
IB device when removed.

Signed-off-by: Wei Hu 
Signed-off-by: Nenglong Zhao 
Signed-off-by: Lijun Ou 
---
  drivers/infiniband/hw/hns/hns_roce_main.c | 46 +++
  1 file changed, 46 insertions(+)

diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c 
b/drivers/infiniband/hw/hns/hns_roce_main.c
index 7fb0d34..f179a7f 100644
--- a/drivers/infiniband/hw/hns/hns_roce_main.c
+++ b/drivers/infiniband/hw/hns/hns_roce_main.c
@@ -62,6 +62,41 @@
  #include "hns_roce_device.h"
  #include "hns_roce_icm.h"
  
+void hns_roce_unregister_device(struct hns_roce_dev *hr_dev)

You are not calling to this function in this patch.


+{
+   ib_unregister_device(_dev->ib_dev);
+}
+
+int hns_roce_register_device(struct hns_roce_dev *hr_dev)

This function should be static.


+{
+   int ret;
+   struct hns_roce_ib_iboe *iboe = NULL;
+   struct ib_device *ib_dev = NULL;
+   struct device *dev = _dev->pdev->dev;
+
+   iboe = _dev->iboe;
+
+   ib_dev = _dev->ib_dev;
+   strlcpy(ib_dev->name, "hisi_%d", IB_DEVICE_NAME_MAX);
+
+   ib_dev->owner= THIS_MODULE;
+   ib_dev->node_type= RDMA_NODE_IB_CA;
+   ib_dev->dma_device   = dev;
+
+   ib_dev->phys_port_cnt= hr_dev->caps.num_ports;
+   ib_dev->local_dma_lkey   = hr_dev->caps.reserved_lkey;
+   ib_dev->num_comp_vectors = hr_dev->caps.num_comp_vectors;
+   ib_dev->uverbs_abi_ver   = 1;
+
+   ret = ib_register_device(ib_dev, NULL);
+   if (ret) {
+   dev_err(dev, "ib_register_device failed!\n");
+   return ret;
+   }
+
+   return 0;
+}
+
  int hns_roce_get_cfg(struct hns_roce_dev *hr_dev)
  {
int i;
@@ -363,6 +398,17 @@ static int hns_roce_probe(struct platform_device *pdev)
goto error_failed_engine_init;
}
  
+	ret = hns_roce_register_device(hr_dev);

+   if (ret) {
+   dev_err(dev, "register_device failed!\n");

According to the current code, you will print this error together with
error line in hns_roce_register_device for the same failure.

"ib_register_device failed!"
"register_device failed!"

Hi, leon
In this patch [PATCH v9 11/22], there is only one error branch in 
funtion named hns_roce_register_device.
In the following patch [PATCH v9 13/22], we add more operation, 
there are more

than two error branch in this function as below.

 @@ -212,8 +402,27 @@ int hns_roce_register_device(struct hns_roce_dev *hr_dev)
goto error_failed_setup_mtu_gids;
}
 
	spin_lock_init(>lock);


iboe->nb.notifier_call = hns_roce_netdev_event;
ret = register_netdevice_notifier(>nb);
if (ret) {
dev_err(dev, "register_netdevice_notifier failed!\n");
goto error_failed_setup_mtu_gids;
}

iboe->nb_inet.notifier_call = hns_roce_inet_event;
ret = register_inetaddr_notifier(>nb_inet);
if (ret) {
dev_err(dev, "register inet addr notifier failed!\n");
goto error_failed_register_inetaddr_notifier;
}

return 0;
 
 error_failed_register_inetaddr_notifier:

unregister_netdevice_notifier(>nb);

 error_failed_setup_mtu_gids:
ib_unregister_device(ib_dev);



Regards
Wei Hu



+   goto error_failed_register_device;
+   }
+
+   return 0;
+
+error_failed_register_device:
+   hns_roce_engine_exit(hr_dev);
+
  error_failed_engine_init:
hns_roce_cleanup_bitmap(hr_dev);
  
--

1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





[PATCH] net: alx: Work around the DMA RX overflow issue

2016-06-12 Thread Feng Tang
Commit 26c5f03 uses a new skb allocator to avoid the RFD overflow
issue.

But from debugging without datasheet, we found the error always
happen when the DMA RX address is set to 0xfc0, which is very
likely to be a HW/silicon problem.

So one idea is instead of adding a new allocator, why not just
hitting the right target by avaiding the error-prone DMA address?

This patch will actually
* Remove the commit 26c5f03
* Apply rx skb with 64 bytes longer space, and if the allocated skb
  has a 0x...fc0 address, it will use skb_resever(skb, 64) to
  advance the address, so that the RX overflow can be avoided.

In theory this method should also apply to atl1c driver, which
I can't find anyone who can help to test on real devices.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=70761
Signed-off-by: Feng Tang 
Suggested-by: Eric Dumazet 
Tested-by: Ole Lukoie 
---
 drivers/net/ethernet/atheros/alx/alx.h  |  4 ---
 drivers/net/ethernet/atheros/alx/main.c | 61 -
 2 files changed, 14 insertions(+), 51 deletions(-)

diff --git a/drivers/net/ethernet/atheros/alx/alx.h 
b/drivers/net/ethernet/atheros/alx/alx.h
index d02c424..8fc93c5 100644
--- a/drivers/net/ethernet/atheros/alx/alx.h
+++ b/drivers/net/ethernet/atheros/alx/alx.h
@@ -96,10 +96,6 @@ struct alx_priv {
unsigned int rx_ringsz;
unsigned int rxbuf_size;
 
-   struct page  *rx_page;
-   unsigned int rx_page_offset;
-   unsigned int rx_frag_size;
-
struct napi_struct napi;
struct alx_tx_queue txq;
struct alx_rx_queue rxq;
diff --git a/drivers/net/ethernet/atheros/alx/main.c 
b/drivers/net/ethernet/atheros/alx/main.c
index c98acdc..e708e36 100644
--- a/drivers/net/ethernet/atheros/alx/main.c
+++ b/drivers/net/ethernet/atheros/alx/main.c
@@ -70,35 +70,6 @@ static void alx_free_txbuf(struct alx_priv *alx, int entry)
}
 }
 
-static struct sk_buff *alx_alloc_skb(struct alx_priv *alx, gfp_t gfp)
-{
-   struct sk_buff *skb;
-   struct page *page;
-
-   if (alx->rx_frag_size > PAGE_SIZE)
-   return __netdev_alloc_skb(alx->dev, alx->rxbuf_size, gfp);
-
-   page = alx->rx_page;
-   if (!page) {
-   alx->rx_page = page = alloc_page(gfp);
-   if (unlikely(!page))
-   return NULL;
-   alx->rx_page_offset = 0;
-   }
-
-   skb = build_skb(page_address(page) + alx->rx_page_offset,
-   alx->rx_frag_size);
-   if (likely(skb)) {
-   alx->rx_page_offset += alx->rx_frag_size;
-   if (alx->rx_page_offset >= PAGE_SIZE)
-   alx->rx_page = NULL;
-   else
-   get_page(page);
-   }
-   return skb;
-}
-
-
 static int alx_refill_rx_ring(struct alx_priv *alx, gfp_t gfp)
 {
struct alx_rx_queue *rxq = >rxq;
@@ -115,9 +86,22 @@ static int alx_refill_rx_ring(struct alx_priv *alx, gfp_t 
gfp)
while (!cur_buf->skb && next != rxq->read_idx) {
struct alx_rfd *rfd = >rfd[cur];
 
-   skb = alx_alloc_skb(alx, gfp);
+   /*
+* When DMA RX address is set to something like
+* 0xfc0, it will be very likely to cause DMA
+* RFD overflow issue.
+*
+* To work around it, we apply rx skb with 64 bytes
+* longer space, and offset the address whenever
+* 0xfc0 is detected.
+*/
+   skb = __netdev_alloc_skb(alx->dev, alx->rxbuf_size + 64, gfp);
if (!skb)
break;
+
+   if (((unsigned long)skb->data & 0xfff) == 0xfc0)
+   skb_reserve(skb, 64);
+
dma = dma_map_single(>hw.pdev->dev,
 skb->data, alx->rxbuf_size,
 DMA_FROM_DEVICE);
@@ -153,7 +137,6 @@ static int alx_refill_rx_ring(struct alx_priv *alx, gfp_t 
gfp)
alx_write_mem16(>hw, ALX_RFD_PIDX, cur);
}
 
-
return count;
 }
 
@@ -622,11 +605,6 @@ static void alx_free_rings(struct alx_priv *alx)
kfree(alx->txq.bufs);
kfree(alx->rxq.bufs);
 
-   if (alx->rx_page) {
-   put_page(alx->rx_page);
-   alx->rx_page = NULL;
-   }
-
dma_free_coherent(>hw.pdev->dev,
  alx->descmem.size,
  alx->descmem.virt,
@@ -681,7 +659,6 @@ static int alx_request_irq(struct alx_priv *alx)
  alx->dev->name, alx);
if (!err)
goto out;
-
/* fall back to legacy interrupt */
pci_disable_msi(alx->hw.pdev);
}
@@ -725,7 +702,6 @@ static int alx_init_sw(struct alx_priv *alx)
struct pci_dev *pdev = alx->hw.pdev;
struct alx_hw 

Re: [PATCH net] fib_rules: don't break ECN with TOS rules

2016-06-12 Thread Julian Anastasov

Hello,

On Sun, 12 Jun 2016, Hannes Frederic Sowa wrote:

> On Sun, Jun 12, 2016, at 02:09, Julian Anastasov wrote:
> > 
> > Well, may be the confusion comes from commit 89aef8921bfb
> > ("ipv4: Delete routing cache.") where the 'tos &= IPTOS_RT_MASK;'
> > line is lost from ip_route_input_common. I think, we should
> > add it back, so that we can properly match input routes with rules
> > that specify tos value. Old kernels didn't stored ECN bits in
> > flowi4_tos in the input path, so we should do the same.
> 
> I would love to have done that but was fearing problems with user space
> compatibility. Also IPTOS_RT_MASK is not enough for filtering, we need
> to check for the whole INET_ECN_MASK.

I checked even 2.4.33 for reference. Nobody sets bit 1
in key->tos but using IPTOS_TOS_MASK on configuration allows
rules with 'tos 2'. I guess, in this case such rule can not be
matched. And the TOS matching was used only under
CONFIG_IP_ROUTE_TOS, so the ip rules were supposed to match only
bits 2..4, not full TOS values.

> > > + INET_ECN_ignore(r->tos) != INET_ECN_ignore(fl4->flowi4_tos))
> > >   return 0;
> > 
> > fib4_rule_configure already rejects ECN bits in r->tos,
> > so no need to filter them again in fast path.
> 
> The problem is that IPTOS_TOS_MASK is just the masking of the 0x1 bit,
> not both ECN bits. Thi stems from the pre-DSCP time where it was
> forbidden to use bit 0x1 in TOS.

I see, I didn't noticed it. If we decide to reject
rules with tos 2 we can break scripts. May be we should just
filter it:

if (r->tos && ((r->tos & IPTOS_RT_MASK) != fl4->flowi4_tos))

This will allow the hidden feature 'ip rule add tos 2' to
match properly bits2..4 == 0 :) But may be we should not spend
extra cycles in fast path for such feature. As nobody complained
for problems with 3.6+, it seems match by tos is not used often.
May be we just need to restore the missing 'tos &= IPTOS_RT_MASK;'
line?

Regards

--
Julian Anastasov 


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-12 Thread Takashi Sakamoto
On Jun 12 2016 17:28, Henrik Austad wrote:
> On Sun, Jun 12, 2016 at 12:38:36PM +0900, Takashi Sakamoto wrote:
>> I'm one of maintainers for ALSA firewire stack, which handles IEC
>> 61883-1/6 and vendor-unique packets on IEEE 1394 bus for consumer
>> recording equipments.
>> (I'm not in MAINTAINERS because I'm a shy boy.)
>>
>> IEC 61883-6 describes that one packet can multiplex several types of
>> data in its data channels; i.e. Multi Bit Linear Audio data (PCM
>> samples), One Bit Audio Data (DSD), MIDI messages and so on.
> 
> Hmm, that I did not know, not sure how that applies to AVB, but definately 
> something I have to look into.

For your information, I describe more about it.

You can see pre-standardized specification for IEC 61883-6 in website of
1394 Trade Association. Let's look for 'Audio and Music Data
Transmission Protocol 2.3 (October 13, 2010, 1394TA)'
http://1394ta.org/specifications/

In 'clause 12. AM824 SEQUENCE ADAPTATION LAYERS', you can see that one
data block includes several types of data.


But I can imagine that joint group for AVB loosely refers to IEC
61883-6. In this case, AVB specification might describe one data block
transfers one type of data, to drop unreasonable complexities.

>> If you handles packet payload in 'struct snd_pcm_ops.copy', a process
>> context of an ALSA PCM applications performs the work. Thus, no chances
>> to multiplex data with the other types.
> 
> The driver is not adhering fully to any standards right now, the amount of 
> detail is quite high - but I'm slowly improving as I go through the 
> standards. Getting on top of all the standards and all the different 
> subsystems are definately a work in progress (it's a lot to digest!)

In my taste, the driver is not necessarily compliant to any standards.
It's enough just to work its task, without bad side-effects to Linux
system. Based on this concept, current ALSA firewire stack just support
PCM frames and MIDI messages.

Here, I tell you that actual devices tend not to be compliant to any
standards and lost inter-operability.

(Especially, most of audio and music units on IEEE 1394 bus ignores some
of items in standards. In short, they already lost inter-operability.)

So here, we just consider about what actual devices do, instead of
following any standards.


Regards

Takashi Sakamoto



signature.asc
Description: OpenPGP digital signature


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-12 Thread Henrik Austad
On Sun, Jun 12, 2016 at 01:30:24PM +0900, Takashi Sakamoto wrote:
> On Jun 12 2016 12:38, Takashi Sakamoto wrote:
> > In your patcset, there's no actual codes about how to handle any
> > interrupt contexts (software / hardware), how to handle packet payload,
> > and so on. Especially, for recent sound subsystem, the timing of
> > generating interrupts and which context does what works are important to
> > reduce playback/capture latency and power consumption.
> > 
> > Of source, your intention of this patchset is to show your early concept
> > of TSN feature. Nevertheless, both of explaination and codes are
> > important to the other developers who have little knowledges about TSN,
> > AVB and AES-64 such as me.
> 
> Oops. Your 5th patch was skipped by alsa-project.org. I guess that size
> of the patch is too large to the list service. I can see it:
> http://marc.info/?l=linux-netdev=146568672728661=2
> 
> As long as seeing the patch, packets are queueing in hrtimer callbacks
> every 1 seconds.

Actually, the hrtimer fires every 1ms, and that part is something I have to 
do something about, also because it sends of the same number of frames 
every time, regardless of how accurate the internal timer is to the rest of 
the network (there's no backpressure from the networking layer).

> (This is a high level discussion and it's OK to ignore it for the
> moment. When writing packet-oriented drivers for sound subsystem, you
> need to pay special attention to accuracy of the number of PCM frames
> transferred currently, and granularity of the number of PCM frames
> transferred by one operation. In this case, snd_avb_hw,
> snd_avb_pcm_pointer(), tsn_buffer_write_net() and tsn_buffer_read_net()
> are involved in this discussion. You can see ALSA developers' struggle
> in USB audio device class drivers and (of cource) IEC 61883-1/6 drivers.)

Ah, good point. Any particular parts of the USB-subsystem I should start 
looking at? Knowing where to start looking is a tremendous help

Thanks for the feedback!

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [very-RFC 0/8] TSN driver for the kernel

2016-06-12 Thread Henrik Austad
On Sun, Jun 12, 2016 at 12:38:36PM +0900, Takashi Sakamoto wrote:
> Hi,
> 
> I'm one of maintainers for ALSA firewire stack, which handles IEC
> 61883-1/6 and vendor-unique packets on IEEE 1394 bus for consumer
> recording equipments.
> (I'm not in MAINTAINERS because I'm a shy boy.)
> 
> IEC 61883-6 describes that one packet can multiplex several types of
> data in its data channels; i.e. Multi Bit Linear Audio data (PCM
> samples), One Bit Audio Data (DSD), MIDI messages and so on.

Hmm, that I did not know, not sure how that applies to AVB, but definately 
something I have to look into.

> If you handles packet payload in 'struct snd_pcm_ops.copy', a process
> context of an ALSA PCM applications performs the work. Thus, no chances
> to multiplex data with the other types.

Hmm, ok, I didn't know that, that is something I need to look into -and 
incidentally one of the reasons why I posted the series now instead of a 
few more months down the road - thanks!

The driver is not adhering fully to any standards right now, the amount of 
detail is quite high - but I'm slowly improving as I go through the 
standards. Getting on top of all the standards and all the different 
subsystems are definately a work in progress (it's a lot to digest!)

> To prevent this situation, current ALSA firewire stack handles packet
> payload in software interrupt context of isochronous context of OHCI
> 1394. As a result of this, the software stack supports PCM substreams
> and MIDI substreams.
> 
> In your patcset, there's no actual codes about how to handle any
> interrupt contexts (software / hardware), how to handle packet payload,
> and so on. Especially, for recent sound subsystem, the timing of
> generating interrupts and which context does what works are important to
> reduce playback/capture latency and power consumption.

See reply in other mail :)

> Of source, your intention of this patchset is to show your early concept
> of TSN feature. Nevertheless, both of explaination and codes are
> important to the other developers who have little knowledges about TSN,
> AVB and AES-64 such as me.

Yes, that is one of the things I aimed for, and also getting feedback on 
the overall thinking

> And, I might cooperate to prepare for common IEC 61883 layer. For actual
> codes of ALSA firewire stack, please see mainline kernel code. For
> actual devices of IEC 61883-1/6 and IEEE 1394 bus, please refer to my
> report in 2014. At least, you can get to know what to consider about
> developing upper drivers near ALSA userspace applications.
> https://github.com/takaswie/alsa-firewire-report

Thanks, I'll dig into that, much appreciated

> (But I confirm that the report includes my misunderstandings in clause
> 3.4 and 6.2. need more time...)

ok, good to know

Thank you for your input, very much appreicated!

-- 
Henrik Austad


signature.asc
Description: Digital signature


Re: [ovs-dev] [PATCH net-next] NSH(Network Service Header) implementation

2016-06-12 Thread Yang, Yi
On Mon, Jun 06, 2016 at 05:45:08PM -0700, pravin shelar wrote:
> On Mon, Jun 6, 2016 at 2:34 AM, Yi Yang  wrote:
> > IETF defined NSH(Network Service Header) for Service
> > Function Chaining, this is an IETF draft
> >
> > https://tools.ietf.org/html/draft-ietf-sfc-nsh-05
> >
> > It will be a IETF standard shortly, this patch implemented
> > NSH for Open vSwitch.
> >
> > Signed-off-by: Johnson Li 
> > Signed-off-by: Yi Yang 
> > ---
> >  drivers/net/vxlan.c  |   7 ++
> >  include/net/nsh.h| 117 +++
> >  include/uapi/linux/openvswitch.h |  32 +++
> >  net/openvswitch/actions.c|  68 +
> >  net/openvswitch/flow.c   |  45 -
> >  net/openvswitch/flow.h   |  15 +++
> >  net/openvswitch/flow_netlink.c   | 202 
> > ++-
> >  net/openvswitch/vport-netdev.c   |   3 +-
> >  net/openvswitch/vport-vxlan.c|  15 +++
> >  9 files changed, 501 insertions(+), 3 deletions(-)
> >  create mode 100644 include/net/nsh.h
> >
> 
> ...
> ...
> > diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> > index 9a3eb7a..38e787c 100644
> > --- a/net/openvswitch/actions.c
> > +++ b/net/openvswitch/actions.c
> > @@ -29,6 +29,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include 
> >  #include 
> > @@ -38,6 +39,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include "datapath.h"
> >  #include "flow.h"
> > @@ -259,6 +261,64 @@ static int push_vlan(struct sk_buff *skb, struct 
> > sw_flow_key *key,
> >  ntohs(vlan->vlan_tci) & ~VLAN_TAG_PRESENT);
> >  }
> >
> ...
> ...
> > +
> > +static int push_nsh(struct sk_buff *skb, struct sw_flow_key *key,
> > +   const struct ovs_action_push_nsh *nsh)
> > +{
> > +   if (nsh->len > 0 && nsh->len <= 256) {
> > +   struct nsh_hdr *nsh_hdr = NULL;
> > +
> > +   if (skb_cow_head(skb, nsh->len) < 0)
> > +   return -ENOMEM;
> > +
> > +   skb_push(skb, nsh->len);
> > +   nsh_hdr = (struct nsh_hdr *)(skb->data);
> > +   memcpy(nsh_hdr, nsh->header, nsh->len);
> > +
> > +   if (!skb->inner_protocol)
> > +   skb_set_inner_protocol(skb, skb->protocol);
> > +
> > +   skb->protocol = htons(ETH_P_NSH); /* 0x894F */
> > +   key->eth.type = htons(ETH_P_NSH);
> > +   } else {
> > +   return -EINVAL;
> > +   }
> > +
> > +   return 0;
> > +}
> 
> Networking stack or OVS can not handle arbitrary skb-protocol. For
> example what happens if OVS has push vlan action or it sends this nsh
> packet to net device which can not handle nsh packet? Even networking
> stack can not parse such packet for handling offloads in software.
> ...
>
ovs nsh patch set have been sent out, they can handle nsh header. This
patch just implemented NSH push and pop actions, we do need some cdoe to
do some sanity check for the case you mentioned, but I think it is out
of scope of thi patch.

> > diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
> > index 5eb7694..3d060c4 100644
> > --- a/net/openvswitch/vport-vxlan.c
> > +++ b/net/openvswitch/vport-vxlan.c
> > @@ -52,6 +52,18 @@ static int vxlan_get_options(const struct vport *vport, 
> > struct sk_buff *skb)
> > return -EMSGSIZE;
> >
> > nla_nest_end(skb, exts);
> > +   } else if (vxlan->flags & VXLAN_F_GPE) {
> > +   struct nlattr *exts;
> > +
> > +   exts = nla_nest_start(skb, OVS_TUNNEL_ATTR_EXTENSION);
> > +   if (!exts)
> > +   return -EMSGSIZE;
> > +
> > +   if (vxlan->flags & VXLAN_F_GPE &&
> > +   nla_put_flag(skb, OVS_VXLAN_EXT_GPE))
> > +   return -EMSGSIZE;
> > +
> > +   nla_nest_end(skb, exts);
> > }
> >
> > return 0;
> > @@ -59,6 +71,7 @@ static int vxlan_get_options(const struct vport *vport, 
> > struct sk_buff *skb)
> >
> >  static const struct nla_policy exts_policy[OVS_VXLAN_EXT_MAX + 1] = {
> > [OVS_VXLAN_EXT_GBP] = { .type = NLA_FLAG, },
> > +   [OVS_VXLAN_EXT_GPE] = { .type = NLA_FLAG, },
> >  };
> >
> >  static int vxlan_configure_exts(struct vport *vport, struct nlattr *attr,
> > @@ -76,6 +89,8 @@ static int vxlan_configure_exts(struct vport *vport, 
> > struct nlattr *attr,
> >
> > if (exts[OVS_VXLAN_EXT_GBP])
> > conf->flags |= VXLAN_F_GBP;
> > +   else if (exts[OVS_VXLAN_EXT_GPE])
> > +   conf->flags |= VXLAN_F_GPE;
> >
> This is compatibility code, no need to add new features to this code.
> Now we should be directly using net devices.
Will use net device after those patches are merged into net-next, It
seems current net device implementation in 

Re: [Bridge] [PATCH net-next v2] bridge: Synchronize unicast filtering with FDB

2016-06-12 Thread Toshiaki Makita

On 16/06/12 (日) 1:17, Nikolay Aleksandrov via Bridge wrote:

On 06/11/2016 07:35 AM, David Miller wrote:

From: Toshiaki Makita 
Date: Mon,  6 Jun 2016 21:20:13 +0900


Patrick Schaaf reported that flooding due to a missing fdb entry of
the address of macvlan on the bridge device caused high CPU
consumption of an openvpn process behind a tap bridge port.
Adding an fdb entry of the macvlan address can suppress flooding
and avoid this problem.

This change makes bridge able to synchronize unicast filtering with
fdb automatically so admin do not need to manually add an fdb entry.
This effectively supports IFF_UNICAST_FLT in bridge, thus adding an
macvlan device would not place bridge into promiscuous mode as well.

v2:
- Test vlan with br_vlan_should_use() in br_fdb_sync_uc() as per
   Nikolay Aleksandrov.

Reported-by: Patrick Schaaf 
Signed-off-by: Toshiaki Makita 


I really need bridging experts to review and ACK/NACK this.

Thanks.



Oops, I almost missed the v2, sorry about that. So, technically it looks 
correct, but
I only fear the scalability impact of the change. If there're a large number of 
vlans
adding a macvlan (or any device that syncs uc addr) might become very slow and 
every
flag change will become very slow too without an option to revert to the 
original
behaviour so we'll have to wait for the entries to be added in order to delete 
them.
Another common scenario is having 8021q interfaces on top of the bridge with 
different
mac addresses for some of the configured vlans (or with macvlans on top of them 
for VRR),
that use case would suffer as well because their macs need to be local only for 
those vlans,
and not the 2000+ other vlans that might exist.
On every sync_uc() call all the fdb entries get deleted and added again, so 
even after deleting
some manually they can come back unexpectedly after some operation and also the 
message storm from
all the deletes and adds could be problematic as well.

E.g. 2000 br0 vlans, 25 macvlans on br0 (adding them took more than 5 minutes, 
53k fdb entries):
$ bridge fdb del de:8e:9f:16:c5:71 dev br0 vlan 289
$ ip l set br0 multicast on
$ bridge fdb | grep 289 | grep de:8e:9f:16:c5:71
de:8e:9f:16:c5:71 dev br0 vlan 1289 master br0 permanent
de:8e:9f:16:c5:71 dev br0 vlan 289 master br0 permanent

In fact you can't escape the slow performance even if you delete all entries 
because on the
next flag change or interface add, they will be added back.


I still think this auto-sync should be done, otherwise macvlan imposes 
promiscuous mode on bridge even if you manually add such fdb entries.
I believe most of your concern would disappear by making use of 
__dev_uc_sync() instead.
Indeed it seems that there is no easy way to propagate the combination 
of uc addr and vlan from upper device, so local entries for unneeded 
vlan can still be created even if using __dev_uc_sync(). In case you 
worry about those unneeded entries, I can add a knob to disable this 
feature.

Are you comfortable with this change?

Toshiaki Makita