Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-13 Thread hejianet

Hi Marcelo


On 9/13/16 2:57 AM, Marcelo wrote:

On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:

This is to use the generic interface snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.
Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
warning "the frame size" larger than 1024 on s390.

Yeah about that, did you test it with stack overflow detection?
These arrays can be quite large.

One more below..

Do you think it is acceptable if the stack usage is a little larger than 1024?
e.g. 1120
I can't find any other way to reduce the stack usage except use "static" before
unsigned long buff[TCP_MIB_MAX]

PS. sizeof buff is about TCP_MIB_MAX(116)*8=928
B.R.

Signed-off-by: Jia He 
---
  net/ipv4/proc.c | 106 +++-
  1 file changed, 74 insertions(+), 32 deletions(-)

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 9f665b6..c6fc80e 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -46,6 +46,8 @@
  #include 
  #include 
  
+#define TCPUDP_MIB_MAX max_t(u32, UDP_MIB_MAX, TCP_MIB_MAX)

+
  /*
   *Report socket allocation statistics [m...@utu.fi]
   */
@@ -378,13 +380,15 @@ static void icmp_put(struct seq_file *seq)
  /*
   *Called from the PROCfs module. This outputs /proc/net/snmp.
   */
-static int snmp_seq_show(struct seq_file *seq, void *v)
+static int snmp_seq_show_ipstats(struct seq_file *seq, void *v)
  {
int i;
+   u64 buff64[IPSTATS_MIB_MAX];
struct net *net = seq->private;
  
-	seq_puts(seq, "Ip: Forwarding DefaultTTL");

+   memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64));
  
+	seq_puts(seq, "Ip: Forwarding DefaultTTL");

for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_ipstats_list[i].name);
  
@@ -393,57 +397,77 @@ static int snmp_seq_show(struct seq_file *seq, void *v)

   net->ipv4.sysctl_ip_default_ttl);
  
  	BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);

+   snmp_get_cpu_field64_batch(buff64, snmp4_ipstats_list,
+  net->mib.ip_statistics,
+  offsetof(struct ipstats_mib, syncp));
for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
-   seq_printf(seq, " %llu",
-  snmp_fold_field64(net->mib.ip_statistics,
-snmp4_ipstats_list[i].entry,
-offsetof(struct ipstats_mib, 
syncp)));
+   seq_printf(seq, " %llu", buff64[i]);
  
-	icmp_put(seq);	/* RFC 2011 compatibility */

-   icmpmsg_put(seq);
+   return 0;
+}
+
+static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v)
+{
+   int i;
+   unsigned long buff[TCPUDP_MIB_MAX];
+   struct net *net = seq->private;
+
+   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
  
  	seq_puts(seq, "\nTcp:");

for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_tcp_list[i].name);
  
  	seq_puts(seq, "\nTcp:");

+   snmp_get_cpu_field_batch(buff, snmp4_tcp_list,
+net->mib.tcp_statistics);
for (i = 0; snmp4_tcp_list[i].name != NULL; i++) {
/* MaxConn field is signed, RFC 2012 */
if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN)
-   seq_printf(seq, " %ld",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %ld", buff[i]);
else
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
}
  
+	memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));

+
+   snmp_get_cpu_field_batch(buff, snmp4_udp_list,
+net->mib.udp_statistics);
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_udp_list[i].name);
-
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.udp_statistics,
-  snmp4_udp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
+
+   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
  
  	/* the UDP and UDP-Lite MIBs are the same */

seq_puts(seq, "\nUdpLite:");
+   snmp_get_cpu_field_batch(buff, snmp4_udp_list,
+net->mib.udplite_statistics);
for (i = 0; snmp4_udp_list[i].name != 

[PATCH net-next] tcp: fix a stale ooo_last_skb after a replace

2016-09-13 Thread Eric Dumazet
From: Eric Dumazet 

When skb replaces another one in ooo queue, I forgot to also
update tp->ooo_last_skb as well, if the replaced skb was the last one
in the queue.

To fix this, we simply can re-use the code that runs after an insertion,
trying to merge skbs at the right of current skb.

This not only fixes the bug, but also remove all small skbs that might
be a subset of the new one.

Example:

We receive segments 2001:3001,  4001:5001

Then we receive 2001:8001 : We should replace 2001:3001 with the big
skb, but also remove 4001:50001 from the queue to save space.

packetdrill test demonstrating the bug

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

+0 < S 0:0(0) win 32792 
+0 > S. 0:0(0) ack 1 
+0.100 < . 1:1(0) ack 1 win 1024
+0 accept(3, ..., ...) = 4

+0.01 < . 1001:2001(1000) ack 1 win 1024
+0> . 1:1(0) ack 1 

+0.01 < . 1001:3001(2000) ack 1 win 1024
+0> . 1:1(0) ack 1 


Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet 
Reported-by: Yuchung Cheng 
Cc: Yaogong Wang 
---
 net/ipv4/tcp_input.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
70b892db99018fb42ab38ab7e5ce0dab498f9571..dad3e7eeed94b6f76f4bef4812c5d0fe9944e5f0
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4502,7 +4502,7 @@ coalesce_done:
NET_INC_STATS(sock_net(sk),
  LINUX_MIB_TCPOFOMERGE);
__kfree_skb(skb1);
-   goto add_sack;
+   goto merge_right;
}
} else if (tcp_try_coalesce(sk, skb1, skb, )) {
goto coalesce_done;
@@ -4514,6 +4514,7 @@ insert:
rb_link_node(>rbnode, parent, p);
rb_insert_color(>rbnode, >out_of_order_queue);
 
+merge_right:
/* Remove other segments covered by skb. */
while ((q = rb_next(>rbnode)) != NULL) {
skb1 = rb_entry(q, struct sk_buff, rbnode);




Re: Modification to skb->queue_mapping affecting performance

2016-09-13 Thread Eric Dumazet
On Tue, 2016-09-13 at 22:13 -0700, Michael Ma wrote:

> I don't intend to install multiple qdisc - the only reason that I'm
> doing this now is to leverage MQ to workaround the lock contention,
> and based on the profile this all worked. However to simplify the way
> to setup HTB I wanted to use TXQ to partition HTB classes so that a
> HTB class only belongs to one TXQ, which also requires mapping skb to
> TXQ using some rules (here I'm using priority but I assume it's
> straightforward to use other information such as classid). And the
> problem I found here is that when using priority to infer the TXQ so
> that queue_mapping is changed, bandwidth is affected significantly -
> the only thing I can guess is that due to queue switch, there are more
> cache misses assuming processor cores have a static mapping to all the
> queues. Any suggestion on what to do next for the investigation?
> 
> I would also guess that this should be a common problem if anyone
> wants to use MQ+IFB to workaround the qdisc lock contention on the
> receiver side and classful qdisc is used on IFB, but haven't really
> found a similar thread here...

But why are you changing the queue ?

NIC already does the proper RSS thing, meaning all packets of one flow
should land on one RX queue. No need to ' classify yourself and risk
lock contention' 

I use IFB + MQ + netem every day, and it scales to 10 Mpps with no
problem.

Do you really need to rate limit flows ? Not clear what are your goals,
why for example you use HTB to begin with.






RE: [PATCH net-next 2/3] net: ethernet: mediatek: add ethtool functions to configure RX flows of HW LRO

2016-09-13 Thread Nelson Chang
(resend)

Thanks Florian for the review!
I will add ndo_fix_features hook in v2 to prevent the case that a user
wants to turn off NETIF_F_LRO but RX flow is programmed.
If any programmed RX flow exists, NETIF_F_LRO cannot be turned off.

-Original Message-
From: Florian Fainelli [mailto:f.faine...@gmail.com]
Sent: Wednesday, September 14, 2016 2:27 AM
To: Nelson Chang (張家祥); j...@phrozen.org; da...@davemloft.net
Cc: n...@openwrt.org; netdev@vger.kernel.org;
linux-media...@lists.infradead.org; nelsonch...@gmail.com
Subject: Re: [PATCH net-next 2/3] net: ethernet: mediatek: add ethtool
functions to configure RX flows of HW LRO

On 09/13/2016 06:54 AM, Nelson Chang wrote:
> The codes add ethtool functions to set RX flows for HW LRO. Because 
> the HW LRO hardware can only recognize the destination IP of TCP/IP
RX 
> flows, the ethtool command to add HW LRO flow is as below:
> ethtool -N [devname] flow-type tcp4 dst-ip [ip_addr] loc [0~1]
> 
> Otherwise, cause the hardware can set total four destination IPs,
each 
> GMAC (GMAC1/GMAC2) can set two IPs separately at most.
> 
> Signed-off-by: Nelson Chang 
> ---

> +
> +static int mtk_set_features(struct net_device *dev, netdev_features_t
> +features) {
> + int err = 0;
> +
> + if (!((dev->features ^ features) & NETIF_F_LRO))
> + return 0;
> +
> + if (!(features & NETIF_F_LRO))
> + mtk_hwlro_netdev_disable(dev);

you may want to implement a fix_features ndo operations which makes sure
that NETIF_F_LRO is turned on in case a RX flow is programmed,
otherwise, it may be confusing to the user that a flow was programmed,
but no offload is happening.
--
Florian




RE: [PATCH net-next 3/3] net: ethernet: mediatek: add dts configuration to enable HW LRO

2016-09-13 Thread Nelson Chang
(resend)

The description of the property as you said is more precise.
The property is a capability if the hardware supports LRO. I'll rephrase
the property description in v2.

Thanks Florian!

-Original Message-
From: Florian Fainelli [mailto:f.faine...@gmail.com]
Sent: Wednesday, September 14, 2016 2:25 AM
To: Nelson Chang (張家祥); j...@phrozen.org; da...@davemloft.net
Cc: n...@openwrt.org; netdev@vger.kernel.org;
linux-media...@lists.infradead.org; nelsonch...@gmail.com
Subject: Re: [PATCH net-next 3/3] net: ethernet: mediatek: add dts
configuration to enable HW LRO

On 09/13/2016 06:54 AM, Nelson Chang wrote:
> Add the configuration of HW LRO in the binding document.
> 
> Signed-off-by: Nelson Chang 
> ---
>  Documentation/devicetree/bindings/net/mediatek-net.txt | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/mediatek-net.txt
> b/Documentation/devicetree/bindings/net/mediatek-net.txt
> index 32eaaca..f43c0d1 100644
> --- a/Documentation/devicetree/bindings/net/mediatek-net.txt
> +++ b/Documentation/devicetree/bindings/net/mediatek-net.txt
> @@ -20,6 +20,7 @@ Required properties:
>  - mediatek,ethsys: phandle to the syscon node that handles the port 
> setup
>  - mediatek,pctl: phandle to the syscon node that handles the ports
slew rate
>   and driver current
> +- mediatek,hwlro: set to enable HW LRO functions of PDMA rx rings

That sounds like implementing a enable/disable policy in the Device Tree
as opposed to providing an indication as to whether the HW supports LRO
or not. If all versions of the hardware support LRO, then you would
rather let the users change NETIF_F_LRO using ethtool features instead
of having this be defined in the Device Tree.

If, on the other hand, not all version of the HW support LRO, then you
would just want to rephrase the property description to say this
describes a capability.
--
Florian




Re: Modification to skb->queue_mapping affecting performance

2016-09-13 Thread Michael Ma
2016-09-13 22:13 GMT-07:00 Michael Ma :
> 2016-09-13 18:18 GMT-07:00 Eric Dumazet :
>> On Tue, 2016-09-13 at 17:23 -0700, Michael Ma wrote:
>>
>>> If I understand correctly this is still to associate a qdisc with each
>>> ifb TXQ. How should I do this if I want to use HTB? I guess I'll need
>>> to divide the bandwidth of each class in HTB by the number of TX
>>> queues for each individual HTB qdisc associated?
>>>
>>> My original idea was to attach a HTB qdisc for each ifb queue
>>> representing a set of flows not sharing bandwidth with others so that
>>> root lock contention still happens but only affects flows in the same
>>> HTB. Did I understand the root lock contention issue incorrectly for
>>> ifb? I do see some comments in __dev_queue_xmit() about using a
>>> different code path for software devices which bypasses
>>> __dev_xmit_skb(). Does this mean ifb won't go through
>>> __dev_xmit_skb()?
>>
>> You can install HTB on all of your MQ children for sure.
>>
>> Again, there is no qdisc lock contention if you properly use MQ.
>>
>> Now if you _need_ to install a single qdisc for whatever reason, then
>> maybe you want to use a single rx queue on the NIC, to reduce lock
>> contention ;)

Yes - this might reduce lock contention but there would still be
contention and I'm really looking for more concurrency...

>>
>>
> I don't intend to install multiple qdisc - the only reason that I'm
> doing this now is to leverage MQ to workaround the lock contention,
> and based on the profile this all worked. However to simplify the way
> to setup HTB I wanted to use TXQ to partition HTB classes so that a
> HTB class only belongs to one TXQ, which also requires mapping skb to
> TXQ using some rules (here I'm using priority but I assume it's
> straightforward to use other information such as classid). And the
> problem I found here is that when using priority to infer the TXQ so
> that queue_mapping is changed, bandwidth is affected significantly -
> the only thing I can guess is that due to queue switch, there are more
> cache misses assuming processor cores have a static mapping to all the
> queues. Any suggestion on what to do next for the investigation?
>
> I would also guess that this should be a common problem if anyone
> wants to use MQ+IFB to workaround the qdisc lock contention on the
> receiver side and classful qdisc is used on IFB, but haven't really
> found a similar thread here...

Hi Cong - I saw quite some threads from you regarding to ingress qdisc
+ MQ and issues for queue_mapping. Do you by any chance have a similar
setup? (classful qdiscs associated to the queues of IFB which requires
queue_mapping modification so that the qdisc selection is done at
queue selection time based on information such as skb
priority/classid. Would appreciate any suggestions.


Re: Modification to skb->queue_mapping affecting performance

2016-09-13 Thread Michael Ma
2016-09-13 18:18 GMT-07:00 Eric Dumazet :
> On Tue, 2016-09-13 at 17:23 -0700, Michael Ma wrote:
>
>> If I understand correctly this is still to associate a qdisc with each
>> ifb TXQ. How should I do this if I want to use HTB? I guess I'll need
>> to divide the bandwidth of each class in HTB by the number of TX
>> queues for each individual HTB qdisc associated?
>>
>> My original idea was to attach a HTB qdisc for each ifb queue
>> representing a set of flows not sharing bandwidth with others so that
>> root lock contention still happens but only affects flows in the same
>> HTB. Did I understand the root lock contention issue incorrectly for
>> ifb? I do see some comments in __dev_queue_xmit() about using a
>> different code path for software devices which bypasses
>> __dev_xmit_skb(). Does this mean ifb won't go through
>> __dev_xmit_skb()?
>
> You can install HTB on all of your MQ children for sure.
>
> Again, there is no qdisc lock contention if you properly use MQ.
>
> Now if you _need_ to install a single qdisc for whatever reason, then
> maybe you want to use a single rx queue on the NIC, to reduce lock
> contention ;)
>
>
I don't intend to install multiple qdisc - the only reason that I'm
doing this now is to leverage MQ to workaround the lock contention,
and based on the profile this all worked. However to simplify the way
to setup HTB I wanted to use TXQ to partition HTB classes so that a
HTB class only belongs to one TXQ, which also requires mapping skb to
TXQ using some rules (here I'm using priority but I assume it's
straightforward to use other information such as classid). And the
problem I found here is that when using priority to infer the TXQ so
that queue_mapping is changed, bandwidth is affected significantly -
the only thing I can guess is that due to queue switch, there are more
cache misses assuming processor cores have a static mapping to all the
queues. Any suggestion on what to do next for the investigation?

I would also guess that this should be a common problem if anyone
wants to use MQ+IFB to workaround the qdisc lock contention on the
receiver side and classful qdisc is used on IFB, but haven't really
found a similar thread here...


Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-13 Thread Alexei Starovoitov
On Tue, Sep 13, 2016 at 07:24:08PM +0200, Pablo Neira Ayuso wrote:
> On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote:
> > Hi,
> > 
> > On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
> > > On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
> > >> This is v5 of the patch set to allow eBPF programs for network
> > >> filtering and accounting to be attached to cgroups, so that they apply
> > >> to all sockets of all tasks placed in that cgroup. The logic also
> > >> allows to be extendeded for other cgroup based eBPF logic.
> > > 
> > > 1) This infrastructure can only be useful to systemd, or any similar
> > >orchestration daemon. Look, you can only apply filtering policies
> > >to processes that are launched by systemd, so this only works
> > >for server processes.
> > 
> > Sorry, but both statements aren't true. The eBPF policies apply to every
> > process that is placed in a cgroup, and my example program in 6/6 shows
> > how that can be done from the command line.
> 
> Then you have to explain me how can anyone else than systemd use this
> infrastructure?

Sounds like systemd and bpf phobia combined :)
Jokes aside. I'm puzzled why systemd is even being mentioned here.
Here we use tupperware (our internal container management system) that
is heavily using cgroups and has nothing to do with systemd.
we're working as part of open container initiative, so hopefully soon
all container management systems will benefit from what we're building.
cgroups and bpf are crucial part of this process.

> > Also, systemd is able to control userspace processes just fine, and
> > it not limited to 'server processes'.
> 
> My main point is that those processes *need* to be launched by the
> orchestrator, which is was refering as 'server processes'.

No experience in systemd, so cannot comment about it,
but that statement is not true for our stuff.

> > > For client processes this infrastructure is
> > >*racy*, you have to add new processes in runtime to the cgroup,
> > >thus there will be time some little time where no filtering policy
> > >will be applied. For quality of service, this may be an acceptable
> > >race, but this is aiming to deploy a filtering policy.
> > 
> > That's a limitation that applies to many more control mechanisms in the
> > kernel, and it's something that can easily be solved with fork+exec.
> 
> As long as you have control to launch the processes yes, but this
> will not work in other scenarios. Just like cgroup net_cls and friends
> are broken for filtering for things that you have no control to
> fork+exec.

not true

> To use this infrastructure from a non-launcher process, you'll have to
> rely on the proc connection to subscribe to new process events, then
> echo that pid to the cgroup, and that interface is asynchronous so
> *adding new processes to the cgroup is subject to races*.

in general not true either. have you worked with cgroups or just speculating?
 
> *You're proposing a socket filtering facility that hooks layer 2
> output path*!

flashback. Not too long ago you were beating drums about netfilter
ingress hook operating at layer 2... sounds like nobody used it
and that was a bad call? Should we remove that netfilter hook then?

Our use case is different from Daniel's.
For us this cgroup+bpf is _not_ for filterting and _not_ for security.
We run a ton of tasks in cgroups that launch all sorts of
things on their own. We need to monitor what they do from networking
point of view. Therefore bpf programs need to monitor the traffic in
particular part of cgroup hierarchy. Not globally and no pass/drop decisions.
The monitoring itself is complicated. Like we need to group and
aggregate within bpf program based on certain bits of ipv6 address
and so on. bpf is only programmable engine that can do this job.
nft is simply not flexible enough to do that.
I'd really love to have an alternative to bpf for such tasks,
but you seem to spend all the energy arguing against bpf whereas
nft still has a lot to be desired.



Re: [PATCH net-next v3] net: inet: diag: expose the socket mark to privileged processes.

2016-09-13 Thread David Ahern
On 9/13/16 10:00 PM, Lorenzo Colitti wrote:
> On Fri, Sep 9, 2016 at 2:23 PM, Lorenzo Colitti  wrote:
>> RFC patch sent out as http://patchwork.ozlabs.org/patch/667892/ . This
>> achieves a fair bit of simplification with no or negligible
>> performance impact, because there was a lot of redundancy in the
>> parameters that were passed in.
> 
> David, any thoughts on that patch? I submitted it as RFC because I
> wasn't sure what you wanted. Should I have sent it as non-RFC instead?
> 

I realize you meant DaveM, but this one has been accepted. It's your other 2 
that are marked RFC by you and in patchwork.


Re: [PATCH net-next v3] net: inet: diag: expose the socket mark to privileged processes.

2016-09-13 Thread Lorenzo Colitti
On Fri, Sep 9, 2016 at 2:23 PM, Lorenzo Colitti  wrote:
> RFC patch sent out as http://patchwork.ozlabs.org/patch/667892/ . This
> achieves a fair bit of simplification with no or negligible
> performance impact, because there was a lot of redundancy in the
> parameters that were passed in.

David, any thoughts on that patch? I submitted it as RFC because I
wasn't sure what you wanted. Should I have sent it as non-RFC instead?


[PATCH net-next 3/7] net: ethernet: mediatek: cleanup error path inside mtk_hw_init

2016-09-13 Thread sean.wang
From: Sean Wang 

This cleans up the error path inside mtk_hw_init call, causing it able
to exit appropriately when something fails and also includes refactoring
mtk_cleanup call to make the partial logic reusable on the error path.

Signed-off-by: Sean Wang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 34 -
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index c71b0b3..917a49c6 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1564,17 +1564,36 @@ static void mtk_pending_work(struct work_struct *work)
rtnl_unlock();
 }
 
-static int mtk_cleanup(struct mtk_eth *eth)
+static int mtk_free_dev(struct mtk_eth *eth)
 {
int i;
 
for (i = 0; i < MTK_MAC_COUNT; i++) {
if (!eth->netdev[i])
continue;
+   free_netdev(eth->netdev[i]);
+   }
+
+   return 0;
+}
 
+static int mtk_unreg_dev(struct mtk_eth *eth)
+{
+   int i;
+
+   for (i = 0; i < MTK_MAC_COUNT; i++) {
+   if (!eth->netdev[i])
+   continue;
unregister_netdev(eth->netdev[i]);
-   free_netdev(eth->netdev[i]);
}
+
+   return 0;
+}
+
+static int mtk_cleanup(struct mtk_eth *eth)
+{
+   mtk_unreg_dev(eth);
+   mtk_free_dev(eth);
cancel_work_sync(>pending_work);
 
return 0;
@@ -1872,7 +1891,7 @@ static int mtk_probe(struct platform_device *pdev)
 
err = mtk_add_mac(eth, mac_np);
if (err)
-   goto err_free_dev;
+   goto err_deinit_hw;
}
 
err = devm_request_irq(eth->dev, eth->irq[1], mtk_handle_irq_tx, 0,
@@ -1896,7 +1915,7 @@ static int mtk_probe(struct platform_device *pdev)
err = register_netdev(eth->netdev[i]);
if (err) {
dev_err(eth->dev, "error bringing up device\n");
-   goto err_free_dev;
+   goto err_deinit_mdio;
} else
netif_info(eth, probe, eth->netdev[i],
   "mediatek frame engine at 0x%08lx, irq %d\n",
@@ -1916,8 +1935,13 @@ static int mtk_probe(struct platform_device *pdev)
 
return 0;
 
+err_deinit_mdio:
+   mtk_mdio_cleanup(eth);
 err_free_dev:
-   mtk_cleanup(eth);
+   mtk_free_dev(eth);
+err_deinit_hw:
+   mtk_hw_deinit(eth);
+
return err;
 }
 
-- 
1.9.1



[PATCH net-next 7/7] net: ethernet: mediatek: avoid race condition during the reset process

2016-09-13 Thread sean.wang
From: Sean Wang 

add the protection of the race condition between
the reset process and hardware access happening
on the related callbacks.

Signed-off-by: Sean Wang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 36 +
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  3 ++-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 48cddf9..a6a9a2f 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -145,6 +145,9 @@ static void mtk_phy_link_adjust(struct net_device *dev)
  MAC_MCR_RX_EN | MAC_MCR_BACKOFF_EN |
  MAC_MCR_BACKPR_EN;
 
+   if (unlikely(test_bit(MTK_RESETTING, >hw->state)))
+   return;
+
switch (mac->phy_dev->speed) {
case SPEED_1000:
mcr |= MAC_MCR_SPEED_1000;
@@ -370,6 +373,9 @@ static int mtk_set_mac_address(struct net_device *dev, void 
*p)
if (ret)
return ret;
 
+   if (unlikely(test_bit(MTK_RESETTING, >hw->state)))
+   return -EBUSY;
+
spin_lock_bh(>hw->page_lock);
mtk_w32(mac->hw, (macaddr[0] << 8) | macaddr[1],
MTK_GDMA_MAC_ADRH(mac->id));
@@ -770,6 +776,9 @@ static int mtk_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
 */
spin_lock(>page_lock);
 
+   if (unlikely(test_bit(MTK_RESETTING, >state)))
+   goto drop;
+
tx_num = mtk_cal_txd_req(skb);
if (unlikely(atomic_read(>free_count) <= tx_num)) {
mtk_stop_queue(eth);
@@ -842,6 +851,9 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget,
 
netdev = eth->netdev[mac];
 
+   if (unlikely(test_bit(MTK_RESETTING, >state)))
+   goto release_desc;
+
/* alloc new buffer */
new_data = napi_alloc_frag(ring->frag_size);
if (unlikely(!new_data)) {
@@ -1573,6 +1585,12 @@ static void mtk_pending_work(struct work_struct *work)
 
rtnl_lock();
 
+   dev_dbg(eth->dev, "[%s][%d] reset\n", __func__, __LINE__);
+
+   while (test_and_set_bit_lock(MTK_RESETTING, >state))
+   cpu_relax();
+
+   dev_dbg(eth->dev, "[%s][%d] mtk_stop starts\n", __func__, __LINE__);
/* stop all devices to make sure that dma is properly shut down */
for (i = 0; i < MTK_MAC_COUNT; i++) {
if (!eth->netdev[i])
@@ -1580,6 +1598,7 @@ static void mtk_pending_work(struct work_struct *work)
mtk_stop(eth->netdev[i]);
__set_bit(i, );
}
+   dev_dbg(eth->dev, "[%s][%d] mtk_stop ends\n", __func__, __LINE__);
 
/* restart underlying hardware such as power, clock, pin mux
 * and the connected phy
@@ -1613,6 +1632,11 @@ static void mtk_pending_work(struct work_struct *work)
dev_close(eth->netdev[i]);
}
}
+
+   dev_dbg(eth->dev, "[%s][%d] reset done\n", __func__, __LINE__);
+
+   clear_bit_unlock(MTK_RESETTING, >state);
+
rtnl_unlock();
 }
 
@@ -1657,6 +1681,9 @@ static int mtk_get_settings(struct net_device *dev,
struct mtk_mac *mac = netdev_priv(dev);
int err;
 
+   if (unlikely(test_bit(MTK_RESETTING, >hw->state)))
+   return -EBUSY;
+
err = phy_read_status(mac->phy_dev);
if (err)
return -ENODEV;
@@ -1707,6 +1734,9 @@ static int mtk_nway_reset(struct net_device *dev)
 {
struct mtk_mac *mac = netdev_priv(dev);
 
+   if (unlikely(test_bit(MTK_RESETTING, >hw->state)))
+   return -EBUSY;
+
return genphy_restart_aneg(mac->phy_dev);
 }
 
@@ -1715,6 +1745,9 @@ static u32 mtk_get_link(struct net_device *dev)
struct mtk_mac *mac = netdev_priv(dev);
int err;
 
+   if (unlikely(test_bit(MTK_RESETTING, >hw->state)))
+   return -EBUSY;
+
err = genphy_update_link(mac->phy_dev);
if (err)
return ethtool_op_get_link(dev);
@@ -1755,6 +1788,9 @@ static void mtk_get_ethtool_stats(struct net_device *dev,
unsigned int start;
int i;
 
+   if (unlikely(test_bit(MTK_RESETTING, >hw->state)))
+   return;
+
if (netif_running(dev) && netif_device_present(dev)) {
if (spin_trylock(>stats_lock)) {
mtk_stats_update_mac(mac);
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index 7efa00f..79954b4 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -336,7 +336,8 @@ enum mtk_clks_map {
 };
 
 enum mtk_dev_state {
-   MTK_HW_INIT
+   MTK_HW_INIT,
+   MTK_RESETTING
 };
 
 /* struct mtk_tx_buf - This struct holds the pointers to 

[PATCH net-next 0/7] add enhancement into the existing reset flow

2016-09-13 Thread sean.wang
From: Sean Wang 

Current driver only resets DMA used by descriptor rings which
can't guarantee it can recover all various kinds of fatal
errors, so the patch
1) tries to reset the underlying hardware resource from scratch on
Mediatek SoC required for ethernet running.
2) refactors code in order to the reusability of existing code.
3) considers handling for race condition between the reset flow and
callbacks registered into core driver called about hardware accessing.
4) introduces power domain usage to hardware setup which leads to have
cleanly and completely restore to the state as the initial.

Sean Wang (7):
  net: ethernet: mediatek: refactoring mtk_hw_init to be reused
  net: ethernet: mediatek: add mtk_hw_deinit call as the opposite to
mtk_hw_init call
  net: ethernet: mediatek: cleanup error path inside mtk_hw_init
  net: ethernet: mediatek: add controlling power domain the ethernet
belongs to
  net: ethernet: mediatek: add the whole ethernet reset into the reset
process
  net: ethernet: mediatek: add more resets for internal ethernet circuit
block
  net: ethernet: mediatek: avoid race condition during the reset process

 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 227 +---
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  15 +-
 2 files changed, 187 insertions(+), 55 deletions(-)

-- 
1.9.1



[PATCH net-next 2/7] net: ethernet: mediatek: add mtk_hw_deinit call as the opposite to mtk_hw_init call

2016-09-13 Thread sean.wang
From: Sean Wang 

grouping things related to the deinitialization of what
mtk_hw_init call does that help to be reused by the reset
process and the error path handling.

Signed-off-by: Sean Wang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index ca46e82..c71b0b3 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1477,6 +1477,16 @@ static int __init mtk_hw_init(struct mtk_eth *eth)
return 0;
 }
 
+static int mtk_hw_deinit(struct mtk_eth *eth)
+{
+   clk_disable_unprepare(eth->clks[MTK_CLK_GP2]);
+   clk_disable_unprepare(eth->clks[MTK_CLK_GP1]);
+   clk_disable_unprepare(eth->clks[MTK_CLK_ESW]);
+   clk_disable_unprepare(eth->clks[MTK_CLK_ETHIF]);
+
+   return 0;
+}
+
 static int __init mtk_init(struct net_device *dev)
 {
struct mtk_mac *mac = netdev_priv(dev);
@@ -1923,10 +1933,7 @@ static int mtk_remove(struct platform_device *pdev)
mtk_stop(eth->netdev[i]);
}
 
-   clk_disable_unprepare(eth->clks[MTK_CLK_ETHIF]);
-   clk_disable_unprepare(eth->clks[MTK_CLK_ESW]);
-   clk_disable_unprepare(eth->clks[MTK_CLK_GP1]);
-   clk_disable_unprepare(eth->clks[MTK_CLK_GP2]);
+   mtk_hw_deinit(eth);
 
netif_napi_del(>tx_napi);
netif_napi_del(>rx_napi);
-- 
1.9.1



[PATCH net-next 6/7] net: ethernet: mediatek: add more resets for internal ethernet circuit block

2016-09-13 Thread sean.wang
From: Sean Wang 

struct mtk_eth has already contained struct regmap ethsys pointer
to the address range of the internal circuit reset, so we reuse it
to reset more internal blocks on ethernet hardware such as packet
processing engine (PPE) and frame engine (FE) instead of rstc which
deals with FE only.

Signed-off-by: Sean Wang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 27 +++
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  6 +-
 2 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index b9ddbcb..48cddf9 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1414,6 +1414,19 @@ static int mtk_stop(struct net_device *dev)
return 0;
 }
 
+static void ethsys_reset(struct mtk_eth *eth, u32 reset_bits)
+{
+   regmap_update_bits(eth->ethsys, ETHSYS_RSTCTRL,
+  reset_bits,
+  reset_bits);
+
+   usleep_range(1000, 1100);
+   regmap_update_bits(eth->ethsys, ETHSYS_RSTCTRL,
+  reset_bits,
+  ~reset_bits);
+   mdelay(10);
+}
+
 static int mtk_hw_init(struct mtk_eth *eth)
 {
int i, val;
@@ -1428,12 +1441,8 @@ static int mtk_hw_init(struct mtk_eth *eth)
clk_prepare_enable(eth->clks[MTK_CLK_ESW]);
clk_prepare_enable(eth->clks[MTK_CLK_GP1]);
clk_prepare_enable(eth->clks[MTK_CLK_GP2]);
-
-   /* reset the frame engine */
-   reset_control_assert(eth->rstc);
-   usleep_range(10, 20);
-   reset_control_deassert(eth->rstc);
-   usleep_range(10, 20);
+   ethsys_reset(eth, RSTCTRL_FE);
+   ethsys_reset(eth, RSTCTRL_PPE);
 
regmap_read(eth->ethsys, ETHSYS_SYSCFG0, );
for (i = 0; i < MTK_MAC_COUNT; i++) {
@@ -1894,12 +1903,6 @@ static int mtk_probe(struct platform_device *pdev)
return PTR_ERR(eth->pctl);
}
 
-   eth->rstc = devm_reset_control_get(>dev, "eth");
-   if (IS_ERR(eth->rstc)) {
-   dev_err(>dev, "no eth reset found\n");
-   return PTR_ERR(eth->rstc);
-   }
-
for (i = 0; i < 3; i++) {
eth->irq[i] = platform_get_irq(pdev, i);
if (eth->irq[i] < 0) {
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index 388cbe7..7efa00f 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -266,6 +266,11 @@
 #define SYSCFG0_GE_MASK0x3
 #define SYSCFG0_GE_MODE(x, y)  (x << (12 + (y * 2)))
 
+/*ethernet reset control register*/
+#define ETHSYS_RSTCTRL 0x34
+#define RSTCTRL_FE BIT(6)
+#define RSTCTRL_PPEBIT(31)
+
 struct mtk_rx_dma {
unsigned int rxd1;
unsigned int rxd2;
@@ -423,7 +428,6 @@ struct mtk_rx_ring {
 struct mtk_eth {
struct device   *dev;
void __iomem*base;
-   struct reset_control*rstc;
spinlock_t  page_lock;
spinlock_t  irq_lock;
struct net_device   dummy_dev;
-- 
1.9.1



[PATCH net-next 5/7] net: ethernet: mediatek: add the whole ethernet reset into the reset process

2016-09-13 Thread sean.wang
From: Sean Wang 

1) original driver only resets DMA used by descriptor rings
which can't guarantee it can recover all various kinds of fatal
errors, so the patch tries to reset the underlying hardware
resource from scratch on Mediatek SoC required for ethernet
running, including power, pin mux control, clock and internal
circuits on the ethernet in order to restore into the initial
state which the rebooted machine gives.

2) add state variable inside structure mtk_eth to help distinguish
mtk_hw_init is called between the initialization during boot time
or re-initialization during the reset process.

3) add ge_mode variable inside structure mtk_mac for restoring
the interface mode of the current setup for the target MAC.

4) remove __init attribute from mtk_hw_init definition

Signed-off-by: Sean Wang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 52 -
 drivers/net/ethernet/mediatek/mtk_eth_soc.h |  8 +
 2 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index fd5d064..b9ddbcb 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -231,7 +231,7 @@ static int mtk_phy_connect(struct mtk_mac *mac)
 {
struct mtk_eth *eth = mac->hw;
struct device_node *np;
-   u32 val, ge_mode;
+   u32 val;
 
np = of_parse_phandle(mac->of_node, "phy-handle", 0);
if (!np && of_phy_is_fixed_link(mac->of_node))
@@ -245,18 +245,18 @@ static int mtk_phy_connect(struct mtk_mac *mac)
case PHY_INTERFACE_MODE_RGMII_RXID:
case PHY_INTERFACE_MODE_RGMII_ID:
case PHY_INTERFACE_MODE_RGMII:
-   ge_mode = 0;
+   mac->ge_mode = 0;
break;
case PHY_INTERFACE_MODE_MII:
-   ge_mode = 1;
+   mac->ge_mode = 1;
break;
case PHY_INTERFACE_MODE_REVMII:
-   ge_mode = 2;
+   mac->ge_mode = 2;
break;
case PHY_INTERFACE_MODE_RMII:
if (!mac->id)
goto err_phy;
-   ge_mode = 3;
+   mac->ge_mode = 3;
break;
default:
goto err_phy;
@@ -265,7 +265,7 @@ static int mtk_phy_connect(struct mtk_mac *mac)
/* put the gmac into the right mode */
regmap_read(eth->ethsys, ETHSYS_SYSCFG0, );
val &= ~SYSCFG0_GE_MODE(SYSCFG0_GE_MASK, mac->id);
-   val |= SYSCFG0_GE_MODE(ge_mode, mac->id);
+   val |= SYSCFG0_GE_MODE(mac->ge_mode, mac->id);
regmap_write(eth->ethsys, ETHSYS_SYSCFG0, val);
 
mtk_phy_connect_node(eth, mac, np);
@@ -1414,9 +1414,12 @@ static int mtk_stop(struct net_device *dev)
return 0;
 }
 
-static int __init mtk_hw_init(struct mtk_eth *eth)
+static int mtk_hw_init(struct mtk_eth *eth)
 {
-   int i;
+   int i, val;
+
+   if (test_and_set_bit(MTK_HW_INIT, >state))
+   return 0;
 
pm_runtime_enable(eth->dev);
pm_runtime_get_sync(eth->dev);
@@ -1432,6 +1435,15 @@ static int __init mtk_hw_init(struct mtk_eth *eth)
reset_control_deassert(eth->rstc);
usleep_range(10, 20);
 
+   regmap_read(eth->ethsys, ETHSYS_SYSCFG0, );
+   for (i = 0; i < MTK_MAC_COUNT; i++) {
+   if (!eth->mac[i])
+   continue;
+   val &= ~SYSCFG0_GE_MODE(SYSCFG0_GE_MASK, eth->mac[i]->id);
+   val |= SYSCFG0_GE_MODE(eth->mac[i]->ge_mode, eth->mac[i]->id);
+   }
+   regmap_write(eth->ethsys, ETHSYS_SYSCFG0, val);
+
/* Set GE2 driving and slew rate */
regmap_write(eth->pctl, GPIO_DRV_SEL10, 0xa00);
 
@@ -1483,6 +1495,9 @@ static int __init mtk_hw_init(struct mtk_eth *eth)
 
 static int mtk_hw_deinit(struct mtk_eth *eth)
 {
+   if (!test_and_clear_bit(MTK_HW_INIT, >state))
+   return 0;
+
clk_disable_unprepare(eth->clks[MTK_CLK_GP2]);
clk_disable_unprepare(eth->clks[MTK_CLK_GP1]);
clk_disable_unprepare(eth->clks[MTK_CLK_ESW]);
@@ -1557,6 +1572,27 @@ static void mtk_pending_work(struct work_struct *work)
__set_bit(i, );
}
 
+   /* restart underlying hardware such as power, clock, pin mux
+* and the connected phy
+*/
+   mtk_hw_deinit(eth);
+
+   if (eth->dev->pins)
+   devm_kfree(eth->dev, eth->dev->pins);
+   pinctrl_bind_pins(eth->dev);
+
+   mtk_hw_init(eth);
+
+   for (i = 0; i < MTK_MAC_COUNT; i++) {
+   if (!eth->mac[i] ||
+   of_phy_is_fixed_link(eth->mac[i]->of_node))
+   continue;
+   err = phy_init_hw(eth->mac[i]->phy_dev);
+   if (err)
+   dev_err(eth->dev, "%s: PHY init failed.\n",
+   eth->netdev[i]->name);
+   }
+

[PATCH net-next 4/7] net: ethernet: mediatek: add controlling power domain the ethernet belongs to

2016-09-13 Thread sean.wang
From: Sean Wang 

introduce power domain control which the digital circuit of
the ethernet belongs to inside the flow of hardware initialization
and deinitialization which helps the entire ethernet hardware block
could restart cleanly and completely as being back to the initial
state when the whole machine reboot.

Signed-off-by: Sean Wang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 917a49c6..fd5d064 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1417,6 +1418,9 @@ static int __init mtk_hw_init(struct mtk_eth *eth)
 {
int i;
 
+   pm_runtime_enable(eth->dev);
+   pm_runtime_get_sync(eth->dev);
+
clk_prepare_enable(eth->clks[MTK_CLK_ETHIF]);
clk_prepare_enable(eth->clks[MTK_CLK_ESW]);
clk_prepare_enable(eth->clks[MTK_CLK_GP1]);
@@ -1484,6 +1488,9 @@ static int mtk_hw_deinit(struct mtk_eth *eth)
clk_disable_unprepare(eth->clks[MTK_CLK_ESW]);
clk_disable_unprepare(eth->clks[MTK_CLK_ETHIF]);
 
+   pm_runtime_put_sync(eth->dev);
+   pm_runtime_disable(eth->dev);
+
return 0;
 }
 
-- 
1.9.1



[PATCH net-next 1/7] net: ethernet: mediatek: refactoring mtk_hw_init to be reused

2016-09-13 Thread sean.wang
From: Sean Wang 

the existing mtk_hw_init includes hardware and software
initialization inside so that it is slightly hard to reuse
them for the process of the reset recovery, so some splitting
is made here for keeping hardware initializing relevant thing
and the else such as IRQ registration and MDIO initialization
what are all about to the interface of core driver moved to the
other proper place because they have no needs to register IRQ and
re-initialize structure again during the reset process.

Signed-off-by: Sean Wang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 62 -
 1 file changed, 34 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 66fd45a..ca46e82 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1415,7 +1415,12 @@ static int mtk_stop(struct net_device *dev)
 
 static int __init mtk_hw_init(struct mtk_eth *eth)
 {
-   int err, i;
+   int i;
+
+   clk_prepare_enable(eth->clks[MTK_CLK_ETHIF]);
+   clk_prepare_enable(eth->clks[MTK_CLK_ESW]);
+   clk_prepare_enable(eth->clks[MTK_CLK_GP1]);
+   clk_prepare_enable(eth->clks[MTK_CLK_GP2]);
 
/* reset the frame engine */
reset_control_assert(eth->rstc);
@@ -1441,19 +1446,6 @@ static int __init mtk_hw_init(struct mtk_eth *eth)
/* Enable RX VLan Offloading */
mtk_w32(eth, 1, MTK_CDMP_EG_CTRL);
 
-   err = devm_request_irq(eth->dev, eth->irq[1], mtk_handle_irq_tx, 0,
-  dev_name(eth->dev), eth);
-   if (err)
-   return err;
-   err = devm_request_irq(eth->dev, eth->irq[2], mtk_handle_irq_rx, 0,
-  dev_name(eth->dev), eth);
-   if (err)
-   return err;
-
-   err = mtk_mdio_init(eth);
-   if (err)
-   return err;
-
/* disable delay and normal interrupt */
mtk_w32(eth, 0, MTK_QDMA_DELAY_INT);
mtk_w32(eth, 0, MTK_PDMA_DELAY_INT);
@@ -1783,16 +1775,7 @@ static int mtk_add_mac(struct mtk_eth *eth, struct 
device_node *np)
eth->netdev[id]->features |= MTK_HW_FEATURES;
eth->netdev[id]->ethtool_ops = _ethtool_ops;
 
-   err = register_netdev(eth->netdev[id]);
-   if (err) {
-   dev_err(eth->dev, "error bringing up device\n");
-   goto free_netdev;
-   }
eth->netdev[id]->irq = eth->irq[0];
-   netif_info(eth, probe, eth->netdev[id],
-  "mediatek frame engine at 0x%08lx, irq %d\n",
-  eth->netdev[id]->base_addr, eth->irq[0]);
-
return 0;
 
 free_netdev:
@@ -1862,11 +1845,6 @@ static int mtk_probe(struct platform_device *pdev)
}
}
 
-   clk_prepare_enable(eth->clks[MTK_CLK_ETHIF]);
-   clk_prepare_enable(eth->clks[MTK_CLK_ESW]);
-   clk_prepare_enable(eth->clks[MTK_CLK_GP1]);
-   clk_prepare_enable(eth->clks[MTK_CLK_GP2]);
-
eth->msg_enable = netif_msg_init(mtk_msg_level, MTK_DEFAULT_MSG_ENABLE);
INIT_WORK(>pending_work, mtk_pending_work);
 
@@ -1887,6 +1865,34 @@ static int mtk_probe(struct platform_device *pdev)
goto err_free_dev;
}
 
+   err = devm_request_irq(eth->dev, eth->irq[1], mtk_handle_irq_tx, 0,
+  dev_name(eth->dev), eth);
+   if (err)
+   goto err_free_dev;
+
+   err = devm_request_irq(eth->dev, eth->irq[2], mtk_handle_irq_rx, 0,
+  dev_name(eth->dev), eth);
+   if (err)
+   goto err_free_dev;
+
+   err = mtk_mdio_init(eth);
+   if (err)
+   goto err_free_dev;
+
+   for (i = 0; i < MTK_MAX_DEVS; i++) {
+   if (!eth->netdev[i])
+   continue;
+
+   err = register_netdev(eth->netdev[i]);
+   if (err) {
+   dev_err(eth->dev, "error bringing up device\n");
+   goto err_free_dev;
+   } else
+   netif_info(eth, probe, eth->netdev[i],
+  "mediatek frame engine at 0x%08lx, irq %d\n",
+  eth->netdev[i]->base_addr, eth->irq[0]);
+   }
+
/* we run 2 devices on the same DMA ring so we need a dummy device
 * for NAPI to work
 */
-- 
1.9.1



Re: [PATCHv2 next 3/3] ipvlan: Introduce l3s mode

2016-09-13 Thread David Ahern
On 9/12/16 12:01 PM, Mahesh Bandewar wrote:

> +struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb,
> +   u16 proto)
> +{
> + struct ipvl_addr *addr;
> + struct net_device *sdev;
> +
> + addr = ipvlan_skb_to_addr(skb, dev);
> + if (!addr)
> + goto out;
> +
> + sdev = addr->master->dev;
> + switch (proto) {
> + case AF_INET:
> + {
> + int err;
> + struct iphdr *ip4h = ip_hdr(skb);
> +
> + err = ip_route_input_noref(skb, ip4h->daddr, ip4h->saddr,
> +ip4h->tos, sdev);
> + if (unlikely(err))
> + goto out;
> + break;
> + }
> + case AF_INET6:
> + {
> + struct dst_entry *dst;
> + struct ipv6hdr *ip6h = ipv6_hdr(skb);
> + int flags = RT6_LOOKUP_F_HAS_SADDR;
> + struct flowi6 fl6 = {
> + .flowi6_iif   = sdev->ifindex,
> + .daddr= ip6h->daddr,
> + .saddr= ip6h->saddr,
> + .flowlabel= ip6_flowinfo(ip6h),
> + .flowi6_mark  = skb->mark,
> + .flowi6_proto = ip6h->nexthdr,
> + };
> +
> + skb_dst_drop(skb);
> + dst = ip6_route_input_lookup(dev_net(sdev), sdev, , flags);
> + skb_dst_set(skb, dst);
> + break;
> + }
> + default:
> + break;
> + }

Nit: why not put the above in separate per-version functions (ipvlan_ip_rcv and 
ipvlan_ip6_rcv) similar to what is done for ipvlan_process_outbound?


> +
> +out:
> + return skb;
> +}
> +
> +unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb,
> +  const struct nf_hook_state *state)
> +{
> + struct ipvl_addr *addr;
> + unsigned int len;
> +
> + addr = ipvlan_skb_to_addr(skb, skb->dev);
> + if (!addr)
> + goto out;
> +
> + skb->dev = addr->master->dev;
> + len = skb->len + ETH_HLEN;
> + ipvlan_count_rx(addr->master, len, true, false);
> +out:
> + return NF_ACCEPT;
> +}
> diff --git a/drivers/net/ipvlan/ipvlan_main.c 
> b/drivers/net/ipvlan/ipvlan_main.c
> index 18b4e8c7f68a..d02be277e1db 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -9,24 +9,65 @@
>  
>  #include "ipvlan.h"
>  
> +static struct nf_hook_ops ipvl_nfops[] __read_mostly = {
> + {
> + .hook = ipvlan_nf_input,
> + .pf   = NFPROTO_IPV4,
> + .hooknum  = NF_INET_LOCAL_IN,
> + .priority = INT_MAX,
> + },
> + {
> + .hook = ipvlan_nf_input,
> + .pf   = NFPROTO_IPV6,
> + .hooknum  = NF_INET_LOCAL_IN,
> + .priority = INT_MAX,
> + },
> +};
> +
> +static struct l3mdev_ops ipvl_l3mdev_ops __read_mostly = {
> + .l3mdev_l3_rcv = ipvlan_l3_rcv,
> +};
> +
>  static void ipvlan_adjust_mtu(struct ipvl_dev *ipvlan, struct net_device 
> *dev)
>  {
>   ipvlan->dev->mtu = dev->mtu - ipvlan->mtu_adj;
>  }
>  
> -static void ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
> +static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>  {
>   struct ipvl_dev *ipvlan;
> + int err = 0;
>  
> + ASSERT_RTNL();
>   if (port->mode != nval) {
> + if (nval == IPVLAN_MODE_L3S) {
> + port->dev->l3mdev_ops = _l3mdev_ops;
> + port->dev->priv_flags |= IFF_L3MDEV_MASTER;
> + if (!port->ipt_hook_added) {
> + err = _nf_register_hooks(ipvl_nfops,
> + ARRAY_SIZE(ipvl_nfops));

That's clever. The hooks are not device based so why do the register for each 
device? Alternatively, you could use a static dst like VRF does for Tx. In the 
ipvlan rcv function set the dst input handler to send the packet back to the 
ipvlan driver via dst->input. From there send the packet through the netfilter 
hooks and then do the real lookup, update the dst and call its input function. 
I have working code for VRF driver somewhere that shows how to do this.

 
> + if (!err)
> + port->ipt_hook_added = true;
> + else
> + return err;
> + }
> + } else {
> + port->dev->priv_flags &= ~IFF_L3MDEV_MASTER;
> + port->dev->l3mdev_ops = NULL;
> + if (port->ipt_hook_added)
> + _nf_unregister_hooks(ipvl_nfops,
> +  ARRAY_SIZE(ipvl_nfops));
> + port->ipt_hook_added = false;
> + }




Re: [RFC PATCH v3 3/7] proc: Reduce cache miss in snmp6_seq_show

2016-09-13 Thread hejianet



On 9/13/16 3:05 AM, Marcelo wrote:

On Fri, Sep 09, 2016 at 02:33:58PM +0800, Jia He wrote:

This is to use the generic interface snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.

Signed-off-by: Jia He 
---
  net/ipv6/proc.c | 32 +++-
  1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index 679253d0..50ba2c3 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -30,6 +30,11 @@
  #include 
  #include 
  
+#define MAX4(a, b, c, d) \

+   max_t(u32, max_t(u32, a, b), max_t(u32, c, d))
+#define SNMP_MIB_MAX MAX4(UDP_MIB_MAX, TCP_MIB_MAX, \
+   IPSTATS_MIB_MAX, ICMP_MIB_MAX)
+
  static int sockstat6_seq_show(struct seq_file *seq, void *v)
  {
struct net *net = seq->private;
@@ -192,13 +197,19 @@ static void snmp6_seq_show_item(struct seq_file *seq, 
void __percpu *pcpumib,
const struct snmp_mib *itemlist)
  {
int i;
-   unsigned long val;
-
-   for (i = 0; itemlist[i].name; i++) {
-   val = pcpumib ?
-   snmp_fold_field(pcpumib, itemlist[i].entry) :
-   atomic_long_read(smib + itemlist[i].entry);
-   seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name, val);
+   unsigned long buff[SNMP_MIB_MAX];
+
+   memset(buff, 0, sizeof(unsigned long) * SNMP_MIB_MAX);

This memset() could be moved...


+
+   if (pcpumib) {

... here, so it's not executed if it hits the else block.

Thanks for the suggestion
B.R.
Jia

+   snmp_get_cpu_field_batch(buff, itemlist, pcpumib);
+   for (i = 0; itemlist[i].name; i++)
+   seq_printf(seq, "%-32s\t%lu\n",
+  itemlist[i].name, buff[i]);
+   } else {
+   for (i = 0; itemlist[i].name; i++)
+   seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name,
+  atomic_long_read(smib + itemlist[i].entry));
}
  }
  
@@ -206,10 +217,13 @@ static void snmp6_seq_show_item64(struct seq_file *seq, void __percpu *mib,

  const struct snmp_mib *itemlist, size_t 
syncpoff)
  {
int i;
+   u64 buff64[SNMP_MIB_MAX];
+
+   memset(buff64, 0, sizeof(unsigned long) * SNMP_MIB_MAX);
  
+	snmp_get_cpu_field64_batch(buff64, itemlist, mib, syncpoff);

for (i = 0; itemlist[i].name; i++)
-   seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name,
-  snmp_fold_field64(mib, itemlist[i].entry, syncpoff));
+   seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name, buff64[i]);
  }
  
  static int snmp6_seq_show(struct seq_file *seq, void *v)

--
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-13 Thread hejianet

Hi Marcelo


On 9/13/16 2:57 AM, Marcelo wrote:

On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:

This is to use the generic interface snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.
Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
warning "the frame size" larger than 1024 on s390.

Yeah about that, did you test it with stack overflow detection?
These arrays can be quite large.

One more below..

I found scripts/checkstack.pl could analyze the stack usage statically.
[root@tian-lp1 kernel]# objdump -d vmlinux | scripts/checkstack.pl ppc64|grep 
seq
0xc07d4b18 netstat_seq_show_tcpext.isra.7 [vmlinux]:1120
0xc07ccbe8 fib_triestat_seq_show [vmlinux]: 496
0xc083e7a4 tcp6_seq_show [vmlinux]: 480
0xc07d4908 snmp_seq_show_ipstats.isra.6 [vmlinux]:464
0xc07d4d18 netstat_seq_show_ipext.isra.8 [vmlinux]:464
0xc06f5bd8 proto_seq_show [vmlinux]:416
0xc07f5718 xfrm_statistics_seq_show [vmlinux]:  416
0xc07405b4 dev_seq_printf_stats [vmlinux]:  400

seems the stack usage in netstat_seq_show_tcpext is too big.
Will consider how to reduce it

B.R.
Jia

Signed-off-by: Jia He 
---
  net/ipv4/proc.c | 106 +++-
  1 file changed, 74 insertions(+), 32 deletions(-)

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 9f665b6..c6fc80e 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -46,6 +46,8 @@
  #include 
  #include 
  
+#define TCPUDP_MIB_MAX max_t(u32, UDP_MIB_MAX, TCP_MIB_MAX)

+
  /*
   *Report socket allocation statistics [m...@utu.fi]
   */
@@ -378,13 +380,15 @@ static void icmp_put(struct seq_file *seq)
  /*
   *Called from the PROCfs module. This outputs /proc/net/snmp.
   */
-static int snmp_seq_show(struct seq_file *seq, void *v)
+static int snmp_seq_show_ipstats(struct seq_file *seq, void *v)
  {
int i;
+   u64 buff64[IPSTATS_MIB_MAX];
struct net *net = seq->private;
  
-	seq_puts(seq, "Ip: Forwarding DefaultTTL");

+   memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64));
  
+	seq_puts(seq, "Ip: Forwarding DefaultTTL");

for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_ipstats_list[i].name);
  
@@ -393,57 +397,77 @@ static int snmp_seq_show(struct seq_file *seq, void *v)

   net->ipv4.sysctl_ip_default_ttl);
  
  	BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);

+   snmp_get_cpu_field64_batch(buff64, snmp4_ipstats_list,
+  net->mib.ip_statistics,
+  offsetof(struct ipstats_mib, syncp));
for (i = 0; snmp4_ipstats_list[i].name != NULL; i++)
-   seq_printf(seq, " %llu",
-  snmp_fold_field64(net->mib.ip_statistics,
-snmp4_ipstats_list[i].entry,
-offsetof(struct ipstats_mib, 
syncp)));
+   seq_printf(seq, " %llu", buff64[i]);
  
-	icmp_put(seq);	/* RFC 2011 compatibility */

-   icmpmsg_put(seq);
+   return 0;
+}
+
+static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v)
+{
+   int i;
+   unsigned long buff[TCPUDP_MIB_MAX];
+   struct net *net = seq->private;
+
+   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
  
  	seq_puts(seq, "\nTcp:");

for (i = 0; snmp4_tcp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_tcp_list[i].name);
  
  	seq_puts(seq, "\nTcp:");

+   snmp_get_cpu_field_batch(buff, snmp4_tcp_list,
+net->mib.tcp_statistics);
for (i = 0; snmp4_tcp_list[i].name != NULL; i++) {
/* MaxConn field is signed, RFC 2012 */
if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN)
-   seq_printf(seq, " %ld",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %ld", buff[i]);
else
-   seq_printf(seq, " %lu",
-  snmp_fold_field(net->mib.tcp_statistics,
-  snmp4_tcp_list[i].entry));
+   seq_printf(seq, " %lu", buff[i]);
}
  
+	memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));

+
+   snmp_get_cpu_field_batch(buff, snmp4_udp_list,
+net->mib.udp_statistics);
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_udp_list[i].name);
-
seq_puts(seq, "\nUdp:");
for (i = 0; snmp4_udp_list[i].name != NULL; i++)
-   seq_printf(seq, " %lu",
-   

RE: [PATCH net-next 3/5] liquidio CN23XX: Mailbox support

2016-09-13 Thread Vatsavayi, Raghu
Sure Dave, Will submit new patches with these changes.
Thanks
Raghu.

> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Saturday, September 10, 2016 9:42 PM
> To: Vatsavayi, Raghu
> Cc: netdev@vger.kernel.org; Chickles, Derek; Burla, Satananda; Manlunas,
> Felix; Vatsavayi, Raghu
> Subject: Re: [PATCH net-next 3/5] liquidio CN23XX: Mailbox support
> 
> From: Raghu Vatsavayi 
> Date: Fri, 9 Sep 2016 13:08:25 -0700
> 
> > +int octeon_mbox_read(struct octeon_mbox *mbox) {
> > +   int ret = 0;
> > +   union octeon_mbox_message msg;
> > +
> 
> Please always order local variable declarations from longest to shortest line.
> 
> Please audit your entire submission for this problem.


Re: Modification to skb->queue_mapping affecting performance

2016-09-13 Thread Eric Dumazet
On Tue, 2016-09-13 at 17:23 -0700, Michael Ma wrote:

> If I understand correctly this is still to associate a qdisc with each
> ifb TXQ. How should I do this if I want to use HTB? I guess I'll need
> to divide the bandwidth of each class in HTB by the number of TX
> queues for each individual HTB qdisc associated?
> 
> My original idea was to attach a HTB qdisc for each ifb queue
> representing a set of flows not sharing bandwidth with others so that
> root lock contention still happens but only affects flows in the same
> HTB. Did I understand the root lock contention issue incorrectly for
> ifb? I do see some comments in __dev_queue_xmit() about using a
> different code path for software devices which bypasses
> __dev_xmit_skb(). Does this mean ifb won't go through
> __dev_xmit_skb()?

You can install HTB on all of your MQ children for sure.

Again, there is no qdisc lock contention if you properly use MQ.

Now if you _need_ to install a single qdisc for whatever reason, then
maybe you want to use a single rx queue on the NIC, to reduce lock
contention ;)





Re: [PATCH] MAINTAINERS: Remove myself from PA Semi entries

2016-09-13 Thread Wolfram Sang
On Tue, Sep 13, 2016 at 02:48:38PM -0700, Olof Johansson wrote:
> The platform is old, very few users and I lack bandwidth to keep after
> it these days.
> 
> Mark the base platform as well as the drivers as orphans, patches have
> been flowing through the fallback maintainers for a while already.
> 
> Signed-off-by: Olof Johansson 
> ---
> 
> Jean, Dave,
> 
> I was hoping to have Michael merge this since the bulk of the platform is 
> under him,
> cc:ing you mostly to be aware that I am orphaning a driver in your subsystems.

Let me answer for Jean since I took over I2C in November 2012 ;) I'd
think the entry can go completely. The last 'F:' tag for the platform
catches the I2C driver anyhow. But in general:

Acked-by: Wolfram Sang 

Thanks,

   Wolfram

> 
> 
> Thanks,
> 
> -Olof
> 
>  MAINTAINERS | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d8e81b1..411f4f7 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7155,9 +7155,8 @@ F:  arch/powerpc/platforms/83xx/
>  F:   arch/powerpc/platforms/85xx/
>  
>  LINUX FOR POWERPC PA SEMI PWRFICIENT
> -M:   Olof Johansson 
>  L:   linuxppc-...@lists.ozlabs.org
> -S:   Maintained
> +S:   Orphan
>  F:   arch/powerpc/platforms/pasemi/
>  F:   drivers/*/*pasemi*
>  F:   drivers/*/*/*pasemi*
> @@ -8849,15 +8848,13 @@ S:Maintained
>  F:   drivers/net/wireless/intersil/p54/
>  
>  PA SEMI ETHERNET DRIVER
> -M:   Olof Johansson 
>  L:   netdev@vger.kernel.org
> -S:   Maintained
> +S:   Orphan
>  F:   drivers/net/ethernet/pasemi/*
>  
>  PA SEMI SMBUS DRIVER
> -M:   Olof Johansson 
>  L:   linux-...@vger.kernel.org
> -S:   Maintained
> +S:   Orphan
>  F:   drivers/i2c/busses/i2c-pasemi.c
>  
>  PADATA PARALLEL EXECUTION MECHANISM
> -- 
> 2.8.0.rc3.29.gb552ff8
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-i2c" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


Re: Modification to skb->queue_mapping affecting performance

2016-09-13 Thread Michael Ma
2016-09-13 17:09 GMT-07:00 Eric Dumazet :
> On Tue, 2016-09-13 at 16:30 -0700, Michael Ma wrote:
>
>> The RX queue number I found from "ls /sys/class/net/eth0/queues" is
>> 64. (is this the correct way of identifying the queue number on NIC?)
>> I setup ifb with 24 queues which is equal to the TX queue number of
>> eth0 and also the number of CPU cores.
>
> Please do not drop netdev@ from this mail exchange.

Sorry that I accidentally dropped that.
>
> ethtool -l eth0
>
>>
>> > There is no qdisc lock contention anymore AFAIK, since each cpu will use
>> > a dedicate IFB queue and tasklet.
>> >
>> How is this achieved? I thought qdisc on ifb will still be protected
>> by the qdisc root lock in __dev_xmit_skb() so essentially all threads
>> processing qdisc are still serialized without using MQ?
>
> You have to properly setup ifb/mq like in :
>
> # netem based setup, installed at receiver side only
> ETH=eth0
> IFB=ifb10
> #DELAY="delay 100ms"
> EST="est 1sec 4sec"
> #REORDER=1000us
> #LOSS="loss 2.0"
> TXQ=24  # change this to number of TX queues on the physical NIC
>
> ip link add $IFB numtxqueues $TXQ type ifb
> ip link set dev $IFB up
>
> tc qdisc del dev $ETH ingress 2>/dev/null
> tc qdisc add dev $ETH ingress 2>/dev/null
>
> tc filter add dev $ETH parent : \
>protocol ip u32 match u32 0 0 flowid 1:1 \
> action mirred egress redirect dev $IFB
>
> tc qdisc del dev $IFB root 2>/dev/null
>
> tc qdisc add dev $IFB root handle 1: mq
> for i in `seq 1 $TXQ`
> do
>  slot=$( printf %x $(( i )) )
>  tc qd add dev $IFB parent 1:$slot $EST netem \
> limit 10 $DELAY $REORDER $LOSS
> done
>
>
If I understand correctly this is still to associate a qdisc with each
ifb TXQ. How should I do this if I want to use HTB? I guess I'll need
to divide the bandwidth of each class in HTB by the number of TX
queues for each individual HTB qdisc associated?

My original idea was to attach a HTB qdisc for each ifb queue
representing a set of flows not sharing bandwidth with others so that
root lock contention still happens but only affects flows in the same
HTB. Did I understand the root lock contention issue incorrectly for
ifb? I do see some comments in __dev_queue_xmit() about using a
different code path for software devices which bypasses
__dev_xmit_skb(). Does this mean ifb won't go through
__dev_xmit_skb()?


Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-13 Thread Rustad, Mark D

Alexei Starovoitov  wrote:


On Tue, Sep 13, 2016 at 10:41:12PM +, Rustad, Mark D wrote:

That said, I can see that you have tried to keep the original code path
pretty much intact. I would note that you introduced rcu calls into the  
!bpf

path that would never have been done before. While that should be ok, I
would really like to see it tested, at least for the !bpf case, on real
hardware to be sure.


please go ahead and test. rcu_read_lock is zero extra instructions
for everything but preempt or debug kernels.


Well, I don't have any hardware in hand to test with, though my former  
employer would. I guess my current employer would too! :-) FWIW, the kernel  
used in that system I referred to before was a preempt kernel.


The test matrix is large, the tail is long and you can't just gloss these  
things over. I understand that it isn't the focus of your work, just as  
regression testing the e1000 is not the focus of any of our work any more.  
That is precisely why it is a sensitive area.


--
Mark Rustad, Networking Division, Intel Corporation


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Modification to skb->queue_mapping affecting performance

2016-09-13 Thread Eric Dumazet
On Tue, 2016-09-13 at 16:30 -0700, Michael Ma wrote:

> The RX queue number I found from "ls /sys/class/net/eth0/queues" is
> 64. (is this the correct way of identifying the queue number on NIC?)
> I setup ifb with 24 queues which is equal to the TX queue number of
> eth0 and also the number of CPU cores.

Please do not drop netdev@ from this mail exchange.

ethtool -l eth0

> 
> > There is no qdisc lock contention anymore AFAIK, since each cpu will use
> > a dedicate IFB queue and tasklet.
> >
> How is this achieved? I thought qdisc on ifb will still be protected
> by the qdisc root lock in __dev_xmit_skb() so essentially all threads
> processing qdisc are still serialized without using MQ?

You have to properly setup ifb/mq like in :

# netem based setup, installed at receiver side only
ETH=eth0
IFB=ifb10
#DELAY="delay 100ms"
EST="est 1sec 4sec"
#REORDER=1000us
#LOSS="loss 2.0"
TXQ=24  # change this to number of TX queues on the physical NIC

ip link add $IFB numtxqueues $TXQ type ifb
ip link set dev $IFB up

tc qdisc del dev $ETH ingress 2>/dev/null
tc qdisc add dev $ETH ingress 2>/dev/null

tc filter add dev $ETH parent : \
   protocol ip u32 match u32 0 0 flowid 1:1 \
action mirred egress redirect dev $IFB

tc qdisc del dev $IFB root 2>/dev/null

tc qdisc add dev $IFB root handle 1: mq
for i in `seq 1 $TXQ`
do
 slot=$( printf %x $(( i )) )
 tc qd add dev $IFB parent 1:$slot $EST netem \
limit 10 $DELAY $REORDER $LOSS
done




Re: [PATCH v2] bnx2: Reset device during driver initialization

2016-09-13 Thread Baoquan He
On 09/13/16 at 11:25am, David Miller wrote:
> 
> Just to be clear, I did actually apply this v2 of the patch
> rather than the initial version.:)

Thanks a lot!



Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-13 Thread Alexei Starovoitov
On Tue, Sep 13, 2016 at 10:41:12PM +, Rustad, Mark D wrote:
> That said, I can see that you have tried to keep the original code path
> pretty much intact. I would note that you introduced rcu calls into the !bpf
> path that would never have been done before. While that should be ok, I
> would really like to see it tested, at least for the !bpf case, on real
> hardware to be sure. 

please go ahead and test. rcu_read_lock is zero extra instructions
for everything but preempt or debug kernels.



Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-13 Thread Francois Romieu
Rustad, Mark D  :
> Alexei Starovoitov  wrote:
[...]
> > the point that it's only used virtualized, since PCI (not PCIE) have
> > been long dead.
> 
> My point is precisely the opposite. It is a real device, it exists in real
> systems and it is used in those systems. I worked on embedded systems that
> ran Linux and used e1000 devices. I am sure they are still out there because
> customers are still paying for support of those systems.

Old PCI is not the bulk of my professional hardware but it's still used here,
both for networking and video. No embedded systems. Mostly dumb file serving
ranging from quad core to i5 and xeon. Recent kernels. 

> The day is coming when all the motherboards with PCI(-X) will be gone, but I
> think it is still at least a few years off.

Add some time for the PCI <-> PCIe adapters to disappear as well. :o)

-- 
Ueimor


Re: Modification to skb->queue_mapping affecting performance

2016-09-13 Thread Eric Dumazet
On Tue, 2016-09-13 at 15:59 -0700, Michael Ma wrote:
> Hi -
> 
> We currently use mqprio on ifb to work around the qdisc root lock
> contention on the receiver side. The problem that we found was that
> queue_mapping is already set when redirecting from ingress qdisc to
> ifb (based on RX selection, I guess?) so the TX queue selection is not
> based on priority.
> 
> Then we implemented a filter which can set skb->queue_mapping to 0 so
> that TX queue selection can be done as expected and flows with
> different priorities will go through different TX queues. However with
> the queue_mapping recomputed, we found the achievable bandwidth with
> small packets (512 bytes) dropped significantly if they're targeting
> different queues. From perf profile I don't see any bottleneck from
> CPU perspective.
> 
> Any thoughts on why modifying queue_mapping will have this kind of
> effect? Also is there any better way of achieving receiver side
> throttling using HTB while avoiding the qdisc root lock on ifb?

But, how many queues do you have on your NIC, and have you setup ifb to
have a same number of queues ?

There is no qdisc lock contention anymore AFAIK, since each cpu will use
a dedicate IFB queue and tasklet.







re

2016-09-13 Thread Mrs. Maria-Elisabeth Schaeffler



Did you get my message?



Modification to skb->queue_mapping affecting performance

2016-09-13 Thread Michael Ma
Hi -

We currently use mqprio on ifb to work around the qdisc root lock
contention on the receiver side. The problem that we found was that
queue_mapping is already set when redirecting from ingress qdisc to
ifb (based on RX selection, I guess?) so the TX queue selection is not
based on priority.

Then we implemented a filter which can set skb->queue_mapping to 0 so
that TX queue selection can be done as expected and flows with
different priorities will go through different TX queues. However with
the queue_mapping recomputed, we found the achievable bandwidth with
small packets (512 bytes) dropped significantly if they're targeting
different queues. From perf profile I don't see any bottleneck from
CPU perspective.

Any thoughts on why modifying queue_mapping will have this kind of
effect? Also is there any better way of achieving receiver side
throttling using HTB while avoiding the qdisc root lock on ifb?

Thanks,
Michael


[PATCH net-next 1/4] rxrpc: Create an address for sendmsg() to bind unbound socket with

2016-09-13 Thread David Howells
Create an address for sendmsg() to bind unbound socket with rather than
using a completely blank address otherwise the transport socket creation
will fail because it will try to use address family 0.

We use the address family specified in the protocol argument when the
AF_RXRPC socket was created and SOCK_DGRAM as the default.  For anything
else, bind() must be used.

Signed-off-by: David Howells 
---

 net/rxrpc/af_rxrpc.c |   12 
 1 file changed, 12 insertions(+)

diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index 25d00ded24bc..741b0d8d2e8c 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -401,6 +401,18 @@ static int rxrpc_sendmsg(struct socket *sock, struct 
msghdr *m, size_t len)
 
switch (rx->sk.sk_state) {
case RXRPC_UNBOUND:
+   rx->srx.srx_family = AF_RXRPC;
+   rx->srx.srx_service = 0;
+   rx->srx.transport_type = SOCK_DGRAM;
+   rx->srx.transport.family = rx->family;
+   switch (rx->family) {
+   case AF_INET:
+   rx->srx.transport_len = sizeof(struct sockaddr_in);
+   break;
+   default:
+   ret = -EAFNOSUPPORT;
+   goto error_unlock;
+   }
local = rxrpc_lookup_local(>srx);
if (IS_ERR(local)) {
ret = PTR_ERR(local);



[PATCH net-next 3/4] rxrpc: Use rxrpc_extract_addr_from_skb() rather than doing this manually

2016-09-13 Thread David Howells
There are two places that want to transmit a packet in response to one just
received and manually pick the address to reply to out of the sk_buff.
Make them use rxrpc_extract_addr_from_skb() instead so that IPv6 is handled
automatically.

Signed-off-by: David Howells 
---

 net/rxrpc/local_event.c |   13 +
 net/rxrpc/output.c  |   32 ++--
 2 files changed, 11 insertions(+), 34 deletions(-)

diff --git a/net/rxrpc/local_event.c b/net/rxrpc/local_event.c
index cdd58e6e9fbd..f073e932500e 100644
--- a/net/rxrpc/local_event.c
+++ b/net/rxrpc/local_event.c
@@ -15,8 +15,6 @@
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
@@ -33,7 +31,7 @@ static void rxrpc_send_version_request(struct rxrpc_local 
*local,
 {
struct rxrpc_wire_header whdr;
struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
-   struct sockaddr_in sin;
+   struct sockaddr_rxrpc srx;
struct msghdr msg;
struct kvec iov[2];
size_t len;
@@ -41,12 +39,11 @@ static void rxrpc_send_version_request(struct rxrpc_local 
*local,
 
_enter("");
 
-   sin.sin_family = AF_INET;
-   sin.sin_port = udp_hdr(skb)->source;
-   sin.sin_addr.s_addr = ip_hdr(skb)->saddr;
+   if (rxrpc_extract_addr_from_skb(, skb) < 0)
+   return;
 
-   msg.msg_name= 
-   msg.msg_namelen = sizeof(sin);
+   msg.msg_name= 
+   msg.msg_namelen = srx.transport_len;
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_flags   = 0;
diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c
index 90c7722d5779..ec3621f2c5c8 100644
--- a/net/rxrpc/output.c
+++ b/net/rxrpc/output.c
@@ -15,8 +15,6 @@
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include "ar-internal.h"
@@ -272,10 +270,7 @@ send_fragmentable:
  */
 void rxrpc_reject_packets(struct rxrpc_local *local)
 {
-   union {
-   struct sockaddr sa;
-   struct sockaddr_in sin;
-   } sa;
+   struct sockaddr_rxrpc srx;
struct rxrpc_skb_priv *sp;
struct rxrpc_wire_header whdr;
struct sk_buff *skb;
@@ -292,32 +287,21 @@ void rxrpc_reject_packets(struct rxrpc_local *local)
iov[1].iov_len = sizeof(code);
size = sizeof(whdr) + sizeof(code);
 
-   msg.msg_name = 
+   msg.msg_name = 
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_flags = 0;
 
-   memset(, 0, sizeof(sa));
-   sa.sa.sa_family = local->srx.transport.family;
-   switch (sa.sa.sa_family) {
-   case AF_INET:
-   msg.msg_namelen = sizeof(sa.sin);
-   break;
-   default:
-   msg.msg_namelen = 0;
-   break;
-   }
-
memset(, 0, sizeof(whdr));
whdr.type = RXRPC_PACKET_TYPE_ABORT;
 
while ((skb = skb_dequeue(>reject_queue))) {
rxrpc_see_skb(skb);
sp = rxrpc_skb(skb);
-   switch (sa.sa.sa_family) {
-   case AF_INET:
-   sa.sin.sin_port = udp_hdr(skb)->source;
-   sa.sin.sin_addr.s_addr = ip_hdr(skb)->saddr;
+
+   if (rxrpc_extract_addr_from_skb(, skb) == 0) {
+   msg.msg_namelen = srx.transport_len;
+
code = htonl(skb->priority);
 
whdr.epoch  = htonl(sp->hdr.epoch);
@@ -329,10 +313,6 @@ void rxrpc_reject_packets(struct rxrpc_local *local)
whdr.flags  &= RXRPC_CLIENT_INITIATED;
 
kernel_sendmsg(local->socket, , iov, 2, size);
-   break;
-
-   default:
-   break;
}
 
rxrpc_free_skb(skb);



[PATCH net-next 2/4] rxrpc: Don't specify protocol to when creating transport socket

2016-09-13 Thread David Howells
Pass 0 as the protocol argument when creating the transport socket rather
than IPPROTO_UDP.

Signed-off-by: David Howells 
---

 net/rxrpc/local_object.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c
index 782b9adf67cb..8720be2a6250 100644
--- a/net/rxrpc/local_object.c
+++ b/net/rxrpc/local_object.c
@@ -103,8 +103,8 @@ static int rxrpc_open_socket(struct rxrpc_local *local)
_enter("%p{%d}", local, local->srx.transport_type);
 
/* create a socket to represent the local endpoint */
-   ret = sock_create_kern(_net, PF_INET, local->srx.transport_type,
-  IPPROTO_UDP, >socket);
+   ret = sock_create_kern(_net, local->srx.transport.family,
+  local->srx.transport_type, 0, >socket);
if (ret < 0) {
_leave(" = %d [socket]", ret);
return ret;



[PATCH net-next 4/4] rxrpc: Add IPv6 support

2016-09-13 Thread David Howells
Add IPv6 support to AF_RXRPC.  With this, AF_RXRPC sockets can be created:

service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET6);

instead of:

service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);

The AFS filesystem doesn't support IPv6 at the moment, though, since that
requires upgrades to some of the RPC calls.

Note that a good portion of this patch is replacing "%pI4:%u" in print
statements with "%pISpc" which is able to handle both protocols and print
the port.

Signed-off-by: David Howells 
---

 net/rxrpc/af_rxrpc.c |   15 +-
 net/rxrpc/conn_object.c  |8 +++
 net/rxrpc/local_object.c |   35 ++-
 net/rxrpc/output.c   |   16 +++
 net/rxrpc/peer_event.c   |   24 ++
 net/rxrpc/peer_object.c  |  109 +-
 net/rxrpc/proc.c |   30 +
 7 files changed, 154 insertions(+), 83 deletions(-)

diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index 741b0d8d2e8c..f61f7b2d1ca4 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -106,19 +106,23 @@ static int rxrpc_validate_address(struct rxrpc_sock *rx,
case AF_INET:
if (srx->transport_len < sizeof(struct sockaddr_in))
return -EINVAL;
-   _debug("INET: %x @ %pI4",
-  ntohs(srx->transport.sin.sin_port),
-  >transport.sin.sin_addr);
tail = offsetof(struct sockaddr_rxrpc, transport.sin.__pad);
break;
 
case AF_INET6:
+   if (srx->transport_len < sizeof(struct sockaddr_in6))
+   return -EINVAL;
+   tail = offsetof(struct sockaddr_rxrpc, transport) +
+   sizeof(struct sockaddr_in6);
+   break;
+
default:
return -EAFNOSUPPORT;
}
 
if (tail < len)
memset((void *)srx + tail, 0, len - tail);
+   _debug("INET: %pISp", >transport);
return 0;
 }
 
@@ -409,6 +413,9 @@ static int rxrpc_sendmsg(struct socket *sock, struct msghdr 
*m, size_t len)
case AF_INET:
rx->srx.transport_len = sizeof(struct sockaddr_in);
break;
+   case AF_INET6:
+   rx->srx.transport_len = sizeof(struct sockaddr_in6);
+   break;
default:
ret = -EAFNOSUPPORT;
goto error_unlock;
@@ -563,7 +570,7 @@ static int rxrpc_create(struct net *net, struct socket 
*sock, int protocol,
return -EAFNOSUPPORT;
 
/* we support transport protocol UDP/UDP6 only */
-   if (protocol != PF_INET)
+   if (protocol != PF_INET && protocol != PF_INET6)
return -EPROTONOSUPPORT;
 
if (sock->type != SOCK_DGRAM)
diff --git a/net/rxrpc/conn_object.c b/net/rxrpc/conn_object.c
index ffa9addb97b2..c0ddba787fd4 100644
--- a/net/rxrpc/conn_object.c
+++ b/net/rxrpc/conn_object.c
@@ -134,6 +134,14 @@ struct rxrpc_connection *rxrpc_find_connection_rcu(struct 
rxrpc_local *local,
srx.transport.sin.sin_addr.s_addr)
goto not_found;
break;
+   case AF_INET6:
+   if (peer->srx.transport.sin6.sin6_port !=
+   srx.transport.sin6.sin6_port ||
+   memcmp(>srx.transport.sin6.sin6_addr,
+  _addr,
+  sizeof(struct in6_addr)) != 0)
+   goto not_found;
+   break;
default:
BUG();
}
diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c
index 8720be2a6250..f5b9bb0d3f98 100644
--- a/net/rxrpc/local_object.c
+++ b/net/rxrpc/local_object.c
@@ -58,6 +58,15 @@ static long rxrpc_local_cmp_key(const struct rxrpc_local 
*local,
memcmp(>srx.transport.sin.sin_addr,
   >transport.sin.sin_addr,
   sizeof(struct in_addr));
+   case AF_INET6:
+   /* If the choice of UDP6 port is left up to the transport, then
+* the endpoint record doesn't match.
+*/
+   return ((u16 __force)local->srx.transport.sin6.sin6_port -
+   (u16 __force)srx->transport.sin6.sin6_port) ?:
+   memcmp(>srx.transport.sin6.sin6_addr,
+  >transport.sin6.sin6_addr,
+  sizeof(struct in6_addr));
default:
BUG();
}
@@ -100,7 +109,8 @@ static int rxrpc_open_socket(struct rxrpc_local *local)
struct sock *sock;
int ret, opt;
 
-   _enter("%p{%d}", local, local->srx.transport_type);
+   _enter("%p{%d,%d}",
+  local, 

[PATCH net-next 0/4] rxrpc: Support IPv6

2016-09-13 Thread David Howells

Here is a set of patches that add IPv6 support.  They need to be applied on
top of the just-posted miscellaneous fix patches.  They are:

 (1) Make autobinding of an unconnected socket work when sendmsg() is
 called to initiate a client call.

 (2) Don't specify the protocol when creating the client socket, but rather
 take the default instead.

 (3) Use rxrpc_extract_addr_from_skb() in a couple of places that were
 doing the same thing manually.  This allows the IPv6 address
 extraction to be done in fewer places.

 (4) Add IPv6 support.  With this, calls can be made to IPv6 servers from
 userspace AF_RXRPC programs; AFS, however, can't use IPv6 yet as the
 RPC calls need to be upgradeable.

The patches can be found here also:


http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite

Tagged thusly:

git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-rewrite-20160913-2

David
---
David Howells (4):
  rxrpc: Create an address for sendmsg() to bind unbound socket with
  rxrpc: Don't specify protocol to when creating transport socket
  rxrpc: Use rxrpc_extract_addr_from_skb() rather than doing this manually
  rxrpc: Add IPv6 support


 net/rxrpc/af_rxrpc.c |   27 ++-
 net/rxrpc/conn_object.c  |8 +++
 net/rxrpc/local_event.c  |   13 ++---
 net/rxrpc/local_object.c |   39 +++-
 net/rxrpc/output.c   |   48 +---
 net/rxrpc/peer_event.c   |   24 ++
 net/rxrpc/peer_object.c  |  109 +-
 net/rxrpc/proc.c |   30 +
 8 files changed, 179 insertions(+), 119 deletions(-)



Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-13 Thread Rustad, Mark D

Alexei Starovoitov  wrote:


On Tue, Sep 13, 2016 at 07:14:27PM +, Rustad, Mark D wrote:

Alexei Starovoitov  wrote:


On Tue, Sep 13, 2016 at 06:28:03PM +, Rustad, Mark D wrote:

Alexei Starovoitov  wrote:


I've looked through qemu and it appears only emulate e1k and tg3.
The latter is still used in the field, so the risk of touching
it is higher.


I have no idea what makes you think that e1k is *not* "used in the  
field".

I grant you it probably is used more virtualized than not these days,
but it
certainly exists and is used. You can still buy them new at Newegg for
goodness sakes!


the point that it's only used virtualized, since PCI (not PCIE) have
been long dead.


My point is precisely the opposite. It is a real device, it exists in real
systems and it is used in those systems. I worked on embedded systems that
ran Linux and used e1000 devices. I am sure they are still out there  
because

customers are still paying for support of those systems.

Yes, PCI(-X) is absent from any current hardware and has been for some  
years

now, but there is an installed base that continues. What part of that
installed base updates software? I don't know, but I would not just assume
that it is 0. I know that I updated the kernel on those embedded systems
that I worked on when I was supporting them. Never at the bleeding edge,  
but

generally hopping from one LTS kernel to another as needed.


I suspect modern linux won't boot on such old pci only systems for other
reasons not related to networking, since no one really cares to test  
kernels there.


Actually it does boot, because although the motherboard was PCIe, the slots  
and the adapters in them were PCI-X. So the core architecture was not so  
stale.



So I think we mostly agree. There is chance that this xdp e1k code will
find a way to that old system. What are the chances those users will
be using xdp there? I think pretty close to zero.


For sure they wouldn't be using XDP, but they could suffer regressions in a  
changed driver that might find its way there. That is the risk.



The pci-e nics integrated into motherboards that pretend to be tg3
(though they're not at all build by broadcom) are significantly more  
common.

That's why I picked e1k instead of tg3.


That may be true (I really don't know anything about tg3 so I certainly  
can't dispute it), so the risk could be smaller with e1k, but there is  
still a regression risk for real existing hardware. That is my concern.



Also note how this patch is not touching anything in the main e1k path
(because I don't have a hw to test and I suspect Intel's driver team
doesn't have it either) to make sure there is no breakage on those
old systems. I created separate e1000_xmit_raw_frame() routine
instead of adding flags into e1000_xmit_frame() for the same reasons:
to make sure there is no breakage.
Same reasoning for not doing an union of page/skb as Alexander suggested.
I wanted minimal change to e1k that allows development xdp programs in kvm
without affecting e1k main path. If you see the actual bug in the patch,
please point out the line.


I can't say that I can, because I am not familiar with the internals of  
e1k. When I was using it, I never had cause to even look at the driver  
because it just worked. My attentions then were properly elsewhere.


My concern is with messing with a driver that probably no one has an active  
testbed routinely running regression tests.


Maybe a new ID should be assigned and the driver forked for this purpose.  
At least then only the virtualization case would have to be tested. Of  
course the hosts would have to offer that ID, but if this is just for  
testing that should be ok, or at least possible.


If someone with more internal knowledge of this driver has enough  
confidence to bless such patches, that might be fine, but it is a fallacy  
to think that e1k is *only* a virtualization driver today. Not yet anyway.  
Maybe around 2020.


That said, I can see that you have tried to keep the original code path  
pretty much intact. I would note that you introduced rcu calls into the  
!bpf path that would never have been done before. While that should be ok,  
I would really like to see it tested, at least for the !bpf case, on real  
hardware to be sure. I really can't comment on the workaround issue brought  
up by Eric, because I just don't know about them. At least that risk seems  
to only be in the bpf case.


--
Mark Rustad, Networking Division, Intel Corporation


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [v11, 5/8] soc: fsl: add GUTS driver for QorIQ platforms

2016-09-13 Thread Scott Wood
On Tue, 2016-09-13 at 07:23 +, Y.B. Lu wrote:
> > 


> > 
> > -Original Message-
> > From: linux-mmc-ow...@vger.kernel.org [mailto:linux-mmc-
> > ow...@vger.kernel.org] On Behalf Of Scott Wood
> > Sent: Tuesday, September 13, 2016 7:25 AM
> > To: Y.B. Lu; linux-...@vger.kernel.org; ulf.hans...@linaro.org; Arnd
> > Bergmann
> > Cc: linuxppc-...@lists.ozlabs.org; devicet...@vger.kernel.org; linux-arm-
> > ker...@lists.infradead.org; linux-ker...@vger.kernel.org; linux-
> > c...@vger.kernel.org; linux-...@vger.kernel.org; iommu@lists.linux-
> > foundation.org; netdev@vger.kernel.org; Mark Rutland; Rob Herring;
> > Russell King; Jochen Friedrich; Joerg Roedel; Claudiu Manoil; Bhupesh
> > Sharma; Qiang Zhao; Kumar Gala; Santosh Shilimkar; Leo Li; X.B. Xie
> > Subject: Re: [v11, 5/8] soc: fsl: add GUTS driver for QorIQ platforms
> > 
> > BTW, aren't ls2080a and ls2085a the same die?  And is there no non-E
> > version of LS2080A/LS2040A?
> [Lu Yangbo-B47093] I checked all the svr values in chip errata doc "Revision
> level to part marking cross-reference" table.
> I found ls2080a and ls2085a were in two separate doc. And I didn’t find non-
> E version of LS2080A/LS2040A in chip errata doc.
> Do you know is there any other doc we can confirm this?

No.  Traditionally we've always had E and non-E versions of each chip, but I
have no knowledge of whether that has changed (I do note that the way that E-
status is indicated in SVR has changed).

But please label LS2080A and LS2085A as the same die (or provide strong
evidence that they are not).

> 
> > 
> > 
> > > 
> > > > > 
> > > > > + do {
> > > > > + if (!matches->soc_id)
> > > > > + return NULL;
> > > > > + if (glob_match(svr_match, matches->soc_id))
> > > > > + break;
> > > > > + } while (matches++);
> > > > Are you expecting "matches++" to ever evaluate as false?
> > > [Lu Yangbo-B47093] Yes, this is used to match the soc we use in
> > > qoriq_soc array until getting true.
> > > We need to get the name and die information defined in array.
> > I'm not asking whether the glob_match will ever return true.  I'm saying
> > that "matches++" will never become NULL.
> [Lu Yangbo-B47093] The matches++ will never become NULL while it will return
> NULL after matching for all the members in array.

"matches++" will never "return NULL".  It's just an incrementing address.  It
won't be null until you wrap around the address space, and even if the other
loop terminators never kicked in you'd crash long before that happens.

Please rewrite the loop as something like:

while (matches->soc_id) {
if (glob_match(...))
return matches;

matches++;
}

return NULL;


> > > > > + /* Register soc device */
> > > > > + soc_dev_attr = kzalloc(sizeof(*soc_dev_attr), GFP_KERNEL);
> > > > > + if (!soc_dev_attr) {
> > > > > + ret = -ENOMEM;
> > > > > + goto out_unmap;
> > > > > + }
> > > > Couldn't this be statically allocated?
> > > [Lu Yangbo-B47093] Do you mean we define this struct statically ?
> > > 
> > > static struct soc_device_attribute soc_dev_attr;
> > Yes.
> > 
> [Lu Yangbo-B47093] It's ok to define it statically. Is there any need to do
> that?

It's simpler.

-Scott



[PATCH net-next 07/10] rxrpc: Allow tx_winsize to grow in response to an ACK

2016-09-13 Thread David Howells
Allow tx_winsize to grow when the ACK info packet shows a larger receive
window at the other end rather than only permitting it to shrink.

Signed-off-by: David Howells 
---

 net/rxrpc/input.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index 5958ef8ba2a0..8e529afcd6c1 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -333,14 +333,16 @@ static void rxrpc_input_ackinfo(struct rxrpc_call *call, 
struct sk_buff *skb,
struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
struct rxrpc_peer *peer;
unsigned int mtu;
+   u32 rwind = ntohl(ackinfo->rwind);
 
_proto("Rx ACK %%%u Info { rx=%u max=%u rwin=%u jm=%u }",
   sp->hdr.serial,
   ntohl(ackinfo->rxMTU), ntohl(ackinfo->maxMTU),
-  ntohl(ackinfo->rwind), ntohl(ackinfo->jumbo_max));
+  rwind, ntohl(ackinfo->jumbo_max));
 
-   if (call->tx_winsize > ntohl(ackinfo->rwind))
-   call->tx_winsize = ntohl(ackinfo->rwind);
+   if (rwind > RXRPC_RXTX_BUFF_SIZE - 1)
+   rwind = RXRPC_RXTX_BUFF_SIZE - 1;
+   call->tx_winsize = rwind;
 
mtu = min(ntohl(ackinfo->rxMTU), ntohl(ackinfo->maxMTU));
 



[PATCH net-next 09/10] rxrpc: Fix prealloc refcounting

2016-09-13 Thread David Howells
The preallocated call buffer holds a ref on the calls within that buffer.
The ref was being released in the wrong place - it worked okay for incoming
calls to the AFS cache manager service, but doesn't work right for incoming
calls to a userspace service.

Instead of releasing an extra ref service calls in rxrpc_release_call(),
the ref needs to be released during the acceptance/rejectance process.  To
this end:

 (1) The prealloc ref is now normally released during
 rxrpc_new_incoming_call().

 (2) For preallocated kernel API calls, the kernel API's ref needs to be
 released when the call is discarded on socket close.

 (3) We shouldn't take a second ref in rxrpc_accept_call().

 (4) rxrpc_recvmsg_new_call() needs to get a ref of its own when it adds
 the call to the to_be_accepted socket queue.

In doing (4) above, we would prefer not to put the call's refcount down to
0 as that entails doing cleanup in softirq context, but it's unlikely as
there are several refs held elsewhere, at least one of which must be put by
someone in process context calling rxrpc_release_call().  However, it's not
a problem if we do have to do that.

Signed-off-by: David Howells 
---

 net/rxrpc/call_accept.c |9 -
 net/rxrpc/call_object.c |3 ---
 net/rxrpc/recvmsg.c |1 +
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/rxrpc/call_accept.c b/net/rxrpc/call_accept.c
index 5fd9d2c89b7f..26c293ef98eb 100644
--- a/net/rxrpc/call_accept.c
+++ b/net/rxrpc/call_accept.c
@@ -221,6 +221,7 @@ void rxrpc_discard_prealloc(struct rxrpc_sock *rx)
if (rx->discard_new_call) {
_debug("discard %lx", call->user_call_ID);
rx->discard_new_call(call, call->user_call_ID);
+   rxrpc_put_call(call, rxrpc_call_put_kernel);
}
rxrpc_call_completed(call);
rxrpc_release_call(rx, call);
@@ -402,6 +403,13 @@ found_service:
if (call->state == RXRPC_CALL_SERVER_ACCEPTING)
rxrpc_notify_socket(call);
 
+   /* We have to discard the prealloc queue's ref here and rely on a
+* combination of the RCU read lock and refs held either by the socket
+* (recvmsg queue, to-be-accepted queue or user ID tree) or the kernel
+* service to prevent the call from being deallocated too early.
+*/
+   rxrpc_put_call(call, rxrpc_call_put);
+
_leave(" = %p{%d}", call, call->debug_id);
 out:
spin_unlock(>incoming_lock);
@@ -469,7 +477,6 @@ struct rxrpc_call *rxrpc_accept_call(struct rxrpc_sock *rx,
}
 
/* formalise the acceptance */
-   rxrpc_get_call(call, rxrpc_call_got);
call->notify_rx = notify_rx;
call->user_call_ID = user_call_ID;
rxrpc_get_call(call, rxrpc_call_got_userid);
diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index 3f9476508204..9aa1c4b53563 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -464,9 +464,6 @@ void rxrpc_release_call(struct rxrpc_sock *rx, struct 
rxrpc_call *call)
call->rxtx_buffer[i] = NULL;
}
 
-   /* We have to release the prealloc backlog ref */
-   if (rxrpc_is_service_call(call))
-   rxrpc_put_call(call, rxrpc_call_put);
_leave("");
 }
 
diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index 16ff56f69256..a284205b8ecf 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -118,6 +118,7 @@ static int rxrpc_recvmsg_new_call(struct rxrpc_sock *rx,
list_del_init(>recvmsg_link);
write_unlock_bh(>recvmsg_lock);
 
+   rxrpc_get_call(call, rxrpc_call_got);
write_lock(>call_lock);
list_add_tail(>accept_link, >to_be_accepted);
write_unlock(>call_lock);



[PATCH net-next 01/10] rxrpc: Make sure we initialise the peer hash key

2016-09-13 Thread David Howells
Peer records created for incoming connections weren't getting their hash
key set.  This meant that incoming calls wouldn't see more than one DATA
packet - which is not a problem for AFS CM calls with small request data
blobs.

Signed-off-by: David Howells 
---

 net/rxrpc/peer_object.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rxrpc/peer_object.c b/net/rxrpc/peer_object.c
index 2efe29a4c232..3e6cd174b53d 100644
--- a/net/rxrpc/peer_object.c
+++ b/net/rxrpc/peer_object.c
@@ -203,6 +203,7 @@ struct rxrpc_peer *rxrpc_alloc_peer(struct rxrpc_local 
*local, gfp_t gfp)
  */
 static void rxrpc_init_peer(struct rxrpc_peer *peer, unsigned long hash_key)
 {
+   peer->hash_key = hash_key;
rxrpc_assess_MTU_size(peer);
peer->mtu = peer->if_mtu;
 
@@ -238,7 +239,6 @@ static struct rxrpc_peer *rxrpc_create_peer(struct 
rxrpc_local *local,
 
peer = rxrpc_alloc_peer(local, gfp);
if (peer) {
-   peer->hash_key = hash_key;
memcpy(>srx, srx, sizeof(*srx));
rxrpc_init_peer(peer, hash_key);
}



[PATCH net-next 03/10] rxrpc: The IDLE ACK packet should use rxrpc_idle_ack_delay

2016-09-13 Thread David Howells
The IDLE ACK packet should use the rxrpc_idle_ack_delay setting when the
timer is set for it.

Signed-off-by: David Howells 
---

 net/rxrpc/call_event.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
index 2b976e789562..61432049869b 100644
--- a/net/rxrpc/call_event.c
+++ b/net/rxrpc/call_event.c
@@ -95,7 +95,7 @@ static void __rxrpc_propose_ACK(struct rxrpc_call *call, u8 
ack_reason,
break;
 
case RXRPC_ACK_IDLE:
-   if (rxrpc_soft_ack_delay < expiry)
+   if (rxrpc_idle_ack_delay < expiry)
expiry = rxrpc_idle_ack_delay;
break;
 



[PATCH net-next 10/10] rxrpc: Correctly initialise, limit and transmit call->rx_winsize

2016-09-13 Thread David Howells
call->rx_winsize should be initialised to the sysctl setting and the sysctl
setting should be limited to the maximum we want to permit.  Further, we
need to place this in the ACK info instead of the sysctl setting.

Furthermore, discard the idea of accepting the subpackets of a jumbo packet
that lie beyond the receive window when the first packet of the jumbo is
within the window.  Just discard the excess subpackets instead.  This
allows the receive window to be opened up right to the buffer size less one
for the dead slot.

Signed-off-by: David Howells 
---

 net/rxrpc/ar-internal.h |3 ++-
 net/rxrpc/call_object.c |2 +-
 net/rxrpc/input.c   |   23 ---
 net/rxrpc/misc.c|5 -
 net/rxrpc/output.c  |4 ++--
 net/rxrpc/sysctl.c  |2 +-
 6 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 47c74a581a0f..e78c40b37db5 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -498,6 +498,7 @@ struct rxrpc_call {
 */
 #define RXRPC_RXTX_BUFF_SIZE   64
 #define RXRPC_RXTX_BUFF_MASK   (RXRPC_RXTX_BUFF_SIZE - 1)
+#define RXRPC_INIT_RX_WINDOW_SIZE 32
struct sk_buff  **rxtx_buffer;
u8  *rxtx_annotations;
 #define RXRPC_TX_ANNO_ACK  0
@@ -518,7 +519,7 @@ struct rxrpc_call {
rxrpc_seq_t rx_expect_next; /* Expected next packet 
sequence number */
u8  rx_winsize; /* Size of Rx window */
u8  tx_winsize; /* Maximum size of Tx window */
-   u8  nr_jumbo_dup;   /* Number of jumbo duplicates */
+   u8  nr_jumbo_bad;   /* Number of jumbo 
dups/exceeds-windows */
 
/* receive-phase ACK management */
u8  ackr_reason;/* reason to ACK */
diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index 9aa1c4b53563..22f9b0d1a138 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -152,7 +152,7 @@ struct rxrpc_call *rxrpc_alloc_call(gfp_t gfp)
memset(>sock_node, 0xed, sizeof(call->sock_node));
 
/* Leave space in the ring to handle a maxed-out jumbo packet */
-   call->rx_winsize = RXRPC_RXTX_BUFF_SIZE - 1 - 46;
+   call->rx_winsize = rxrpc_rx_window_size;
call->tx_winsize = 16;
call->rx_expect_next = 1;
return call;
diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index 8e529afcd6c1..75af0bd316c7 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -164,7 +164,7 @@ protocol_error:
  * (that information is encoded in the ACK packet).
  */
 static void rxrpc_input_dup_data(struct rxrpc_call *call, rxrpc_seq_t seq,
-u8 annotation, bool *_jumbo_dup)
+u8 annotation, bool *_jumbo_bad)
 {
/* Discard normal packets that are duplicates. */
if (annotation == 0)
@@ -174,9 +174,9 @@ static void rxrpc_input_dup_data(struct rxrpc_call *call, 
rxrpc_seq_t seq,
 * more partially duplicate jumbo packets, we refuse to take any more
 * jumbos for this call.
 */
-   if (!*_jumbo_dup) {
-   call->nr_jumbo_dup++;
-   *_jumbo_dup = true;
+   if (!*_jumbo_bad) {
+   call->nr_jumbo_bad++;
+   *_jumbo_bad = true;
}
 }
 
@@ -191,7 +191,7 @@ static void rxrpc_input_data(struct rxrpc_call *call, 
struct sk_buff *skb,
unsigned int ix;
rxrpc_serial_t serial = sp->hdr.serial, ack_serial = 0;
rxrpc_seq_t seq = sp->hdr.seq, hard_ack;
-   bool immediate_ack = false, jumbo_dup = false, queued;
+   bool immediate_ack = false, jumbo_bad = false, queued;
u16 len;
u8 ack = 0, flags, annotation = 0;
 
@@ -222,7 +222,7 @@ static void rxrpc_input_data(struct rxrpc_call *call, 
struct sk_buff *skb,
 
flags = sp->hdr.flags;
if (flags & RXRPC_JUMBO_PACKET) {
-   if (call->nr_jumbo_dup > 3) {
+   if (call->nr_jumbo_bad > 3) {
ack = RXRPC_ACK_NOSPACE;
ack_serial = serial;
goto ack;
@@ -259,7 +259,7 @@ next_subpacket:
}
 
if (call->rxtx_buffer[ix]) {
-   rxrpc_input_dup_data(call, seq, annotation, _dup);
+   rxrpc_input_dup_data(call, seq, annotation, _bad);
if (ack != RXRPC_ACK_DUPLICATE) {
ack = RXRPC_ACK_DUPLICATE;
ack_serial = serial;
@@ -304,6 +304,15 @@ skip:
annotation++;
if (flags & RXRPC_JUMBO_PACKET)
annotation |= RXRPC_RX_ANNO_JLAST;
+   if (after(seq, hard_ack + call->rx_winsize)) {
+   ack = RXRPC_ACK_EXCEEDS_WINDOW;
+   ack_serial = serial;
+   if (!jumbo_bad) 

[PATCH net-next 06/10] rxrpc: Use skb->len not skb->data_len

2016-09-13 Thread David Howells
skb->len should be used rather than skb->data_len when referring to the
amount of data in a packet.  This will only cause a malfunction in the
following cases:

 (1) We receive a jumbo packet (validation and splitting both are wrong).

 (2) We see if there's extra ACK info in an ACK packet (we think it's not
 there and just ignore it).

Signed-off-by: David Howells 
---

 net/rxrpc/input.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index a707d5952164..5958ef8ba2a0 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -127,7 +127,7 @@ static bool rxrpc_validate_jumbo(struct sk_buff *skb)
 {
struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
unsigned int offset = sp->offset;
-   unsigned int len = skb->data_len;
+   unsigned int len = skb->len;
int nr_jumbo = 1;
u8 flags = sp->hdr.flags;
 
@@ -196,7 +196,7 @@ static void rxrpc_input_data(struct rxrpc_call *call, 
struct sk_buff *skb,
u8 ack = 0, flags, annotation = 0;
 
_enter("{%u,%u},{%u,%u}",
-  call->rx_hard_ack, call->rx_top, skb->data_len, seq);
+  call->rx_hard_ack, call->rx_top, skb->len, seq);
 
_proto("Rx DATA %%%u { #%u f=%02x }",
   sp->hdr.serial, seq, sp->hdr.flags);
@@ -233,7 +233,7 @@ static void rxrpc_input_data(struct rxrpc_call *call, 
struct sk_buff *skb,
 next_subpacket:
queued = false;
ix = seq & RXRPC_RXTX_BUFF_MASK;
-   len = skb->data_len;
+   len = skb->len;
if (flags & RXRPC_JUMBO_PACKET)
len = RXRPC_JUMBO_DATALEN;
 
@@ -444,7 +444,7 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct 
sk_buff *skb,
}
 
offset = sp->offset + nr_acks + 3;
-   if (skb->data_len >= offset + sizeof(buf.info)) {
+   if (skb->len >= offset + sizeof(buf.info)) {
if (skb_copy_bits(skb, offset, , sizeof(buf.info)) < 0)
return rxrpc_proto_abort("XAI", call, 0);
rxrpc_input_ackinfo(call, skb, );



[PATCH net-next 05/10] rxrpc: Add missing unlock in rxrpc_call_accept()

2016-09-13 Thread David Howells
Add a missing unlock in rxrpc_call_accept() in the path taken if there's no
call to wake up.

Signed-off-by: David Howells 
---

 net/rxrpc/call_accept.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/rxrpc/call_accept.c b/net/rxrpc/call_accept.c
index b8acec0d596e..06e328f6b0f0 100644
--- a/net/rxrpc/call_accept.c
+++ b/net/rxrpc/call_accept.c
@@ -425,9 +425,11 @@ struct rxrpc_call *rxrpc_accept_call(struct rxrpc_sock *rx,
 
write_lock(>call_lock);
 
-   ret = -ENODATA;
-   if (list_empty(>to_be_accepted))
-   goto out;
+   if (list_empty(>to_be_accepted)) {
+   write_unlock(>call_lock);
+   kleave(" = -ENODATA [empty]");
+   return ERR_PTR(-ENODATA);
+   }
 
/* check the user ID isn't already in use */
pp = >calls.rb_node;



[PATCH net-next 04/10] rxrpc: Requeue call for recvmsg if more data

2016-09-13 Thread David Howells
rxrpc_recvmsg() needs to make sure that the call it has just been
processing gets requeued for further attention if the buffer has been
filled and there's more data to be consumed.  The softirq producer only
queues the call and wakes the socket if it fills the first slot in the
window, so userspace might end up sleeping forever otherwise, despite there
being data available.

This is not a problem provided the userspace buffer is big enough or it
empties the buffer completely before more data comes in.

Signed-off-by: David Howells 
---

 net/rxrpc/recvmsg.c |4 
 1 file changed, 4 insertions(+)

diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index 20d0b5c6f81b..16ff56f69256 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -463,6 +463,10 @@ try_again:
 flags, );
if (ret == -EAGAIN)
ret = 0;
+
+   if (after(call->rx_top, call->rx_hard_ack) &&
+   call->rxtx_buffer[(call->rx_hard_ack + 1) & 
RXRPC_RXTX_BUFF_MASK])
+   rxrpc_notify_socket(call);
break;
default:
ret = 0;



[PATCH net-next 08/10] rxrpc: Adjust the call ref tracepoint to show kernel API refs

2016-09-13 Thread David Howells
Adjust the call ref tracepoint to show references held on a call by the
kernel API separately as much as possible and add an additional trace to at
the allocation point from the preallocation buffer for an incoming call.

Note that this doesn't show the allocation of a client call for the kernel
separately at the moment.

Signed-off-by: David Howells 
---

 net/rxrpc/af_rxrpc.c|2 +-
 net/rxrpc/ar-internal.h |2 ++
 net/rxrpc/call_accept.c |3 ++-
 net/rxrpc/call_object.c |2 ++
 4 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index caa226dd436e..25d00ded24bc 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -299,7 +299,7 @@ void rxrpc_kernel_end_call(struct socket *sock, struct 
rxrpc_call *call)
 {
_enter("%d{%d}", call->debug_id, atomic_read(>usage));
rxrpc_release_call(rxrpc_sk(sock->sk), call);
-   rxrpc_put_call(call, rxrpc_call_put);
+   rxrpc_put_call(call, rxrpc_call_put_kernel);
 }
 EXPORT_SYMBOL(rxrpc_kernel_end_call);
 
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index b1cb79ec4e96..47c74a581a0f 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -540,8 +540,10 @@ enum rxrpc_call_trace {
rxrpc_call_seen,
rxrpc_call_got,
rxrpc_call_got_userid,
+   rxrpc_call_got_kernel,
rxrpc_call_put,
rxrpc_call_put_userid,
+   rxrpc_call_put_kernel,
rxrpc_call_put_noqueue,
rxrpc_call__nr_trace
 };
diff --git a/net/rxrpc/call_accept.c b/net/rxrpc/call_accept.c
index 06e328f6b0f0..5fd9d2c89b7f 100644
--- a/net/rxrpc/call_accept.c
+++ b/net/rxrpc/call_accept.c
@@ -121,7 +121,7 @@ static int rxrpc_service_prealloc_one(struct rxrpc_sock *rx,
 
call->user_call_ID = user_call_ID;
call->notify_rx = notify_rx;
-   rxrpc_get_call(call, rxrpc_call_got);
+   rxrpc_get_call(call, rxrpc_call_got_kernel);
user_attach_call(call, user_call_ID);
rxrpc_get_call(call, rxrpc_call_got_userid);
rb_link_node(>sock_node, parent, pp);
@@ -300,6 +300,7 @@ static struct rxrpc_call *rxrpc_alloc_incoming_call(struct 
rxrpc_sock *rx,
smp_store_release(>call_backlog_tail,
  (call_tail + 1) & (RXRPC_BACKLOG_MAX - 1));
 
+   rxrpc_see_call(call);
call->conn = conn;
call->peer = rxrpc_get_peer(conn->params.peer);
return call;
diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index 18ab13f82f6e..3f9476508204 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -56,8 +56,10 @@ const char rxrpc_call_traces[rxrpc_call__nr_trace][4] = {
[rxrpc_call_seen]   = "SEE",
[rxrpc_call_got]= "GOT",
[rxrpc_call_got_userid] = "Gus",
+   [rxrpc_call_got_kernel] = "Gke",
[rxrpc_call_put]= "PUT",
[rxrpc_call_put_userid] = "Pus",
+   [rxrpc_call_put_kernel] = "Pke",
[rxrpc_call_put_noqueue]= "PNQ",
 };
 



[PATCH net-next 02/10] rxrpc: Add missing wakeup on Tx window rotation

2016-09-13 Thread David Howells
We need to wake up the sender when Tx window rotation due to an incoming
ACK makes space in the buffer otherwise the sender is liable to just hang
endlessly.

This problem isn't noticeable if the Tx phase transfers no more than will
fit in a single window or the Tx window rotates fast enough that it doesn't
get full.

Signed-off-by: David Howells 
---

 net/rxrpc/input.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index afeba98004b1..a707d5952164 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -59,6 +59,8 @@ static void rxrpc_rotate_tx_window(struct rxrpc_call *call, 
rxrpc_seq_t to)
 
spin_unlock(>lock);
 
+   wake_up(>waitq);
+
while (list) {
skb = list;
list = skb->next;



[PATCH net-next 00/10] rxrpc: Miscellaneous fixes

2016-09-13 Thread David Howells

Here's a set of miscellaneous fix patches.  There are a couple of points of
note:

 (1) There is one non-fix patch that adjusts the call ref tracking
 tracepoint to make kernel API-held refs on calls more obvious.  This
 is a prerequisite for the patch that fixes prealloc refcounting.

 (2) The final patch alters how jumbo packets that partially exceed the
 receive window are handled.  Previously, space was being left in the
 Rx buffer for them, but this significantly hurts performance as the Rx
 window can't be increased to match the OpenAFS Tx window size.

 Instead, the excess subpackets are discarded and an EXCEEDS_WINDOW ACK
 is generated for the first.  To avoid the problem of someone trying to
 run the kernel out of space by feeding the kernel a series of
 overlapping maximal jumbo packets, we stop allowing jumbo packets on a
 call if we encounter more than three jumbo packets with duplicate or
 excessive subpackets.

The patches can be found here also (non-terminally on the branch):


http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite

Tagged thusly:

git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-rewrite-20160913-1

David
---
David Howells (10):
  rxrpc: Make sure we initialise the peer hash key
  rxrpc: Add missing wakeup on Tx window rotation
  rxrpc: The IDLE ACK packet should use rxrpc_idle_ack_delay
  rxrpc: Requeue call for recvmsg if more data
  rxrpc: Add missing unlock in rxrpc_call_accept()
  rxrpc: Use skb->len not skb->data_len
  rxrpc: Allow tx_winsize to grow in response to an ACK
  rxrpc: Adjust the call ref tracepoint to show kernel API refs
  rxrpc: Fix prealloc refcounting
  rxrpc: Correctly initialise, limit and transmit call->rx_winsize


 net/rxrpc/af_rxrpc.c|2 +-
 net/rxrpc/ar-internal.h |5 -
 net/rxrpc/call_accept.c |   20 +++-
 net/rxrpc/call_event.c  |2 +-
 net/rxrpc/call_object.c |7 +++
 net/rxrpc/input.c   |   41 +++--
 net/rxrpc/misc.c|5 -
 net/rxrpc/output.c  |4 ++--
 net/rxrpc/peer_object.c |2 +-
 net/rxrpc/recvmsg.c |5 +
 net/rxrpc/sysctl.c  |2 +-
 11 files changed, 64 insertions(+), 31 deletions(-)



Re: [PATCH net-next 2/2] net: deprecate eth_change_mtu, remove usage

2016-09-13 Thread kbuild test robot
Hi Jarod,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Jarod-Wilson/net-centralize-net_device-min-max-MTU-checking/20160913-042130
config: mips-xway_defconfig (attached as .config)
compiler: mips-linux-gnu-gcc (Debian 5.4.0-6) 5.4.0 20160609
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=mips 

All error/warnings (new ones prefixed by >>):

   drivers/net/ethernet/lantiq_etop.c: In function 'ltq_etop_change_mtu':
>> drivers/net/ethernet/lantiq_etop.c:524:7: error: 'ret' undeclared (first use 
>> in this function)
 if (!ret) {
  ^
   drivers/net/ethernet/lantiq_etop.c:524:7: note: each undeclared identifier 
is reported only once for each function it appears in
>> drivers/net/ethernet/lantiq_etop.c:534:1: warning: control reaches end of 
>> non-void function [-Wreturn-type]
}
^

vim +/ret +524 drivers/net/ethernet/lantiq_etop.c

504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  518  
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  519  
static int
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  520  
ltq_etop_change_mtu(struct net_device *dev, int new_mtu)
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  521  {
707d312a drivers/net/ethernet/lantiq_etop.c Jarod Wilson 2016-09-12  522
dev->mtu = new_mtu;
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  523  
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06 @524
if (!ret) {
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  525
struct ltq_etop_priv *priv = netdev_priv(dev);
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  526
unsigned long flags;
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  527  
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  528
spin_lock_irqsave(>lock, flags);
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  529
ltq_etop_w32((ETOP_PLEN_UNDER << 16) | new_mtu,
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  530
LTQ_ETOP_IGPLEN);
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  531
spin_unlock_irqrestore(>lock, flags);
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  532
}
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  533
return ret;
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06 @534  }
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  535  
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  536  
static int
504d4721 drivers/net/lantiq_etop.c  John Crispin 2011-05-06  537  
ltq_etop_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)

:: The code at line 524 was first introduced by commit
:: 504d4721ee8e432af4b5f196a08af38bc4dac5fe MIPS: Lantiq: Add ethernet 
driver

:: TO: John Crispin <blo...@openwrt.org>
:: CC: Ralf Baechle <r...@linux-mips.org>

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-13 Thread Alexei Starovoitov
On Tue, Sep 13, 2016 at 07:14:27PM +, Rustad, Mark D wrote:
> Alexei Starovoitov  wrote:
> 
> >On Tue, Sep 13, 2016 at 06:28:03PM +, Rustad, Mark D wrote:
> >>Alexei Starovoitov  wrote:
> >>
> >>>I've looked through qemu and it appears only emulate e1k and tg3.
> >>>The latter is still used in the field, so the risk of touching
> >>>it is higher.
> >>
> >>I have no idea what makes you think that e1k is *not* "used in the field".
> >>I grant you it probably is used more virtualized than not these days,
> >>but it
> >>certainly exists and is used. You can still buy them new at Newegg for
> >>goodness sakes!
> >
> >the point that it's only used virtualized, since PCI (not PCIE) have
> >been long dead.
> 
> My point is precisely the opposite. It is a real device, it exists in real
> systems and it is used in those systems. I worked on embedded systems that
> ran Linux and used e1000 devices. I am sure they are still out there because
> customers are still paying for support of those systems.
> 
> Yes, PCI(-X) is absent from any current hardware and has been for some years
> now, but there is an installed base that continues. What part of that
> installed base updates software? I don't know, but I would not just assume
> that it is 0. I know that I updated the kernel on those embedded systems
> that I worked on when I was supporting them. Never at the bleeding edge, but
> generally hopping from one LTS kernel to another as needed.

I suspect modern linux won't boot on such old pci only systems for other
reasons not related to networking, since no one really cares to test kernels 
there.
So I think we mostly agree. There is chance that this xdp e1k code will
find a way to that old system. What are the chances those users will
be using xdp there? I think pretty close to zero.

The pci-e nics integrated into motherboards that pretend to be tg3
(though they're not at all build by broadcom) are significantly more common.
That's why I picked e1k instead of tg3.

Also note how this patch is not touching anything in the main e1k path
(because I don't have a hw to test and I suspect Intel's driver team
doesn't have it either) to make sure there is no breakage on those
old systems. I created separate e1000_xmit_raw_frame() routine
instead of adding flags into e1000_xmit_frame() for the same reasons:
to make sure there is no breakage.
Same reasoning for not doing an union of page/skb as Alexander suggested.
I wanted minimal change to e1k that allows development xdp programs in kvm
without affecting e1k main path. If you see the actual bug in the patch,
please point out the line.



[PATCH] MAINTAINERS: Remove myself from PA Semi entries

2016-09-13 Thread Olof Johansson
The platform is old, very few users and I lack bandwidth to keep after
it these days.

Mark the base platform as well as the drivers as orphans, patches have
been flowing through the fallback maintainers for a while already.

Signed-off-by: Olof Johansson 
---

Jean, Dave,

I was hoping to have Michael merge this since the bulk of the platform is under 
him,
cc:ing you mostly to be aware that I am orphaning a driver in your subsystems.


Thanks,

-Olof

 MAINTAINERS | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index d8e81b1..411f4f7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7155,9 +7155,8 @@ F:arch/powerpc/platforms/83xx/
 F: arch/powerpc/platforms/85xx/
 
 LINUX FOR POWERPC PA SEMI PWRFICIENT
-M: Olof Johansson 
 L: linuxppc-...@lists.ozlabs.org
-S: Maintained
+S: Orphan
 F: arch/powerpc/platforms/pasemi/
 F: drivers/*/*pasemi*
 F: drivers/*/*/*pasemi*
@@ -8849,15 +8848,13 @@ S:  Maintained
 F: drivers/net/wireless/intersil/p54/
 
 PA SEMI ETHERNET DRIVER
-M: Olof Johansson 
 L: netdev@vger.kernel.org
-S: Maintained
+S: Orphan
 F: drivers/net/ethernet/pasemi/*
 
 PA SEMI SMBUS DRIVER
-M: Olof Johansson 
 L: linux-...@vger.kernel.org
-S: Maintained
+S: Orphan
 F: drivers/i2c/busses/i2c-pasemi.c
 
 PADATA PARALLEL EXECUTION MECHANISM
-- 
2.8.0.rc3.29.gb552ff8



Re: [PATCH v2 6/6] selftests: move watchdog tests from Documentation/watchdog

2016-09-13 Thread Timur Tabi

Shuah Khan wrote:

Is watchdog-simple a test or a sample/example? I thought this
is an example, and planning to move that under examples?

If this is a test, I will re-do the patch to include it.


I guess it's really just a sample.

--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.


Re: [PATCH net-next] openvswitch: avoid deferred execution of recirc actions

2016-09-13 Thread pravin shelar
On Tue, Sep 13, 2016 at 7:08 AM, Lance Richardson  wrote:
> The ovs kernel data path currently defers the execution of all
> recirc actions until stack utilization is at a minimum.
> This is too limiting for some packet forwarding scenarios due to
> the small size of the deferred action FIFO (10 entries). For
> example, broadcast traffic sent out more than 10 ports with
> recirculation results in packet drops when the deferred action
> FIFO becomes full, as reported here:
>
>  http://openvswitch.org/pipermail/dev/2016-March/067672.html
>
> Since the current recursion depth is available (it is already tracked
> by the exec_actions_level pcpu variable), we can use it to determine
> whether to execute recirculation actions immediately (safe when
> recursion depth is low) or defer execution until more stack space is
> available.
>
> With this change, the deferred action fifo size becomes a non-issue
> for currently failing scenarios because it is no longer used when
> there are three or fewer recursions through ovs_execute_actions().
>
> Suggested-by: Pravin Shelar 
> Signed-off-by: Lance Richardson 

Thanks for working on it.

Acked-by: Pravin B Shelar 


Re: [PATCH v2 6/6] selftests: move watchdog tests from Documentation/watchdog

2016-09-13 Thread Shuah Khan
On 09/13/2016 02:33 PM, Timur Tabi wrote:
> Shuah Khan wrote:
>> Remove watchdog-test from Makefile to move the test to selftests.
>>
>> Add Makefile and .gitignore to for watchdog-test. watchdog-test will
> 
> to for?

Thanks - will fix that.

> 
>> not be run as part of selftests suite and will not be included in
>> install targets.  It can be built separately for now.
>>
>> Signed-off-by: Shuah Khan 
>> ---
>>   Documentation/watchdog/src/.gitignore|   1 -
>>   Documentation/watchdog/src/Makefile  |   2 +-
>>   Documentation/watchdog/src/watchdog-test.c   | 105 
>> ---
>>   tools/testing/selftests/watchdog/.gitignore  |   1 +
>>   tools/testing/selftests/watchdog/Makefile|   8 ++
>>   tools/testing/selftests/watchdog/watchdog-test.c | 105 
>> +++
> 
> Please use -M when calling git-format-patch

okay.

> 
>>   6 files changed, 115 insertions(+), 107 deletions(-)
>>   delete mode 100644 Documentation/watchdog/src/watchdog-test.c
>>   create mode 100644 tools/testing/selftests/watchdog/.gitignore
>>   create mode 100644 tools/testing/selftests/watchdog/Makefile
>>   create mode 100644 tools/testing/selftests/watchdog/watchdog-test.c
>>
>> diff --git a/Documentation/watchdog/src/.gitignore 
>> b/Documentation/watchdog/src/.gitignore
>> index ac90997..ff0ebb5 100644
>> --- a/Documentation/watchdog/src/.gitignore
>> +++ b/Documentation/watchdog/src/.gitignore
>> @@ -1,2 +1 @@
>>   watchdog-simple
>> -watchdog-test
> 
> Why not also watchdog-simple?
> 

Is watchdog-simple a test or a sample/example? I thought this
is an example, and planning to move that under examples?

If this is a test, I will re-do the patch to include it.

thanks,
-- Shuah



Re: [PATCH v2 6/6] selftests: move watchdog tests from Documentation/watchdog

2016-09-13 Thread Timur Tabi

Shuah Khan wrote:

Remove watchdog-test from Makefile to move the test to selftests.

Add Makefile and .gitignore to for watchdog-test. watchdog-test will


to for?


not be run as part of selftests suite and will not be included in
install targets.  It can be built separately for now.

Signed-off-by: Shuah Khan 
---
  Documentation/watchdog/src/.gitignore|   1 -
  Documentation/watchdog/src/Makefile  |   2 +-
  Documentation/watchdog/src/watchdog-test.c   | 105 ---
  tools/testing/selftests/watchdog/.gitignore  |   1 +
  tools/testing/selftests/watchdog/Makefile|   8 ++
  tools/testing/selftests/watchdog/watchdog-test.c | 105 +++


Please use -M when calling git-format-patch


  6 files changed, 115 insertions(+), 107 deletions(-)
  delete mode 100644 Documentation/watchdog/src/watchdog-test.c
  create mode 100644 tools/testing/selftests/watchdog/.gitignore
  create mode 100644 tools/testing/selftests/watchdog/Makefile
  create mode 100644 tools/testing/selftests/watchdog/watchdog-test.c

diff --git a/Documentation/watchdog/src/.gitignore 
b/Documentation/watchdog/src/.gitignore
index ac90997..ff0ebb5 100644
--- a/Documentation/watchdog/src/.gitignore
+++ b/Documentation/watchdog/src/.gitignore
@@ -1,2 +1 @@
  watchdog-simple
-watchdog-test


Why not also watchdog-simple?



--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.


Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets

2016-09-13 Thread Greg
On Tue, 2016-09-13 at 20:18 +, Rustad, Mark D wrote:
> Greg  wrote:
> 
> > Someday Linux will be a modern OS that just includes IPV6 and forces a
> > config option to NOT have it.
> >
> > That'll be great.  All the IS_ENABLED_(CONFIG_IPV6) scattered everywhere
> > is nuts.
> >
> > 
> 
> Better wait until everyone at least *has* IPv6! I have yet to have IPv6  
> deployed on any of my employer's networks or get IPv6 service from any ISP  
> at my home. When I was at Apple in the 90's I was told that Apple needed  
> IPv6 by next year or "we were dead". Well Apple nearly died, but IPv6 had  
> nothing to do with that! And I still haven't experienced an IPv6  
> deployment! Yeah, I have run it a bit point-to-point to resolve technical  
> issues, but that isn't a "deployment" and not very interesting.
> 
> As much as we would like things to move faster, much of the world just  
> doesn't. Witness the e1000 discussion today for example. Hardware doesn't  
> vanish overnight, and I know that my ISP has a network full of CPE that  
> doesn't do IPv6, so I'm not expecting their status to change any time soon.

Well that's why we can have a configuration to turn it off...

But yeah.  /pipedream

- Greg

> 
> It would be great though.
> 
> 
> --
> Mark Rustad, Networking Division, Intel Corporation




[PATCH v2 4/6] selftests: move vDSO tests from Documentation/vDSO

2016-09-13 Thread Shuah Khan
Remove vDSO from Makefile to move the to selftests. Update vDSO Makefile
to work under selftests. vDSO will not be run as part of selftests suite
and will not be included in install targets. They can be built separately
for now.

Signed-off-by: Shuah Khan 
---
 Documentation/Makefile |   2 +-
 Documentation/vDSO/.gitignore  |   2 -
 Documentation/vDSO/Makefile|  17 --
 Documentation/vDSO/parse_vdso.c| 269 -
 Documentation/vDSO/vdso_standalone_test_x86.c  | 128 --
 Documentation/vDSO/vdso_test.c |  52 
 tools/testing/selftests/vDSO/.gitignore|   2 +
 tools/testing/selftests/vDSO/Makefile  |  20 ++
 tools/testing/selftests/vDSO/parse_vdso.c  | 269 +
 .../selftests/vDSO/vdso_standalone_test_x86.c  | 128 ++
 tools/testing/selftests/vDSO/vdso_test.c   |  52 
 11 files changed, 472 insertions(+), 469 deletions(-)
 delete mode 100644 Documentation/vDSO/.gitignore
 delete mode 100644 Documentation/vDSO/Makefile
 delete mode 100644 Documentation/vDSO/parse_vdso.c
 delete mode 100644 Documentation/vDSO/vdso_standalone_test_x86.c
 delete mode 100644 Documentation/vDSO/vdso_test.c
 create mode 100644 tools/testing/selftests/vDSO/.gitignore
 create mode 100644 tools/testing/selftests/vDSO/Makefile
 create mode 100644 tools/testing/selftests/vDSO/parse_vdso.c
 create mode 100644 tools/testing/selftests/vDSO/vdso_standalone_test_x86.c
 create mode 100644 tools/testing/selftests/vDSO/vdso_test.c

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 8cd6d1a..085b917 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -1,3 +1,3 @@
 subdir-y := accounting auxdisplay blackfin \
ia64 laptops mic misc-devices \
-   networking pcmcia timers vDSO watchdog
+   networking pcmcia timers watchdog
diff --git a/Documentation/vDSO/.gitignore b/Documentation/vDSO/.gitignore
deleted file mode 100644
index 133bf9e..000
--- a/Documentation/vDSO/.gitignore
+++ /dev/null
@@ -1,2 +0,0 @@
-vdso_test
-vdso_standalone_test_x86
diff --git a/Documentation/vDSO/Makefile b/Documentation/vDSO/Makefile
deleted file mode 100644
index b12e987..000
--- a/Documentation/vDSO/Makefile
+++ /dev/null
@@ -1,17 +0,0 @@
-ifndef CROSS_COMPILE
-# vdso_test won't build for glibc < 2.16, so disable it
-# hostprogs-y := vdso_test
-hostprogs-$(CONFIG_X86) := vdso_standalone_test_x86
-vdso_standalone_test_x86-objs := vdso_standalone_test_x86.o parse_vdso.o
-vdso_test-objs := parse_vdso.o vdso_test.o
-
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
-
-HOSTCFLAGS := -I$(objtree)/usr/include -std=gnu99
-HOSTCFLAGS_vdso_standalone_test_x86.o := -fno-asynchronous-unwind-tables 
-fno-stack-protector
-HOSTLOADLIBES_vdso_standalone_test_x86 := -nostdlib
-ifeq ($(CONFIG_X86_32),y)
-HOSTLOADLIBES_vdso_standalone_test_x86 += -lgcc_s
-endif
-endif
diff --git a/Documentation/vDSO/parse_vdso.c b/Documentation/vDSO/parse_vdso.c
deleted file mode 100644
index 1dbb4b8..000
--- a/Documentation/vDSO/parse_vdso.c
+++ /dev/null
@@ -1,269 +0,0 @@
-/*
- * parse_vdso.c: Linux reference vDSO parser
- * Written by Andrew Lutomirski, 2011-2014.
- *
- * This code is meant to be linked in to various programs that run on Linux.
- * As such, it is available with as few restrictions as possible.  This file
- * is licensed under the Creative Commons Zero License, version 1.0,
- * available at http://creativecommons.org/publicdomain/zero/1.0/legalcode
- *
- * The vDSO is a regular ELF DSO that the kernel maps into user space when
- * it starts a program.  It works equally well in statically and dynamically
- * linked binaries.
- *
- * This code is tested on x86.  In principle it should work on any
- * architecture that has a vDSO.
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-
-/*
- * To use this vDSO parser, first call one of the vdso_init_* functions.
- * If you've already parsed auxv, then pass the value of AT_SYSINFO_EHDR
- * to vdso_init_from_sysinfo_ehdr.  Otherwise pass auxv to vdso_init_from_auxv.
- * Then call vdso_sym for each symbol you want.  For example, to look up
- * gettimeofday on x86_64, use:
- *
- *  = vdso_sym("LINUX_2.6", "gettimeofday");
- * or
- *  = vdso_sym("LINUX_2.6", "__vdso_gettimeofday");
- *
- * vdso_sym will return 0 if the symbol doesn't exist or if the init function
- * failed or was not called.  vdso_sym is a little slow, so its return value
- * should be cached.
- *
- * vdso_sym is threadsafe; the init functions are not.
- *
- * These are the prototypes:
- */
-extern void vdso_init_from_auxv(void *auxv);
-extern void vdso_init_from_sysinfo_ehdr(uintptr_t base);
-extern void *vdso_sym(const char *version, const char *name);
-
-
-/* And here's the code. */
-#ifndef ELF_BITS
-# if ULONG_MAX > 0xUL
-#  define 

[PATCH v2 1/6] selftests: move dnotify_test from Documentation/filesystems

2016-09-13 Thread Shuah Khan
Move dnotify_test.c, Makefile, and .gitignore from Documentation/filesystems
to selftests/filesystems.

Remove filesystems build target from Documentation/Makefile and update
selftests/filesystems/Makefile to work under selftests. dnotify_test will
not be run as part of selftests suite and will not be included in install
targets. It can be built separately for now.

Signed-off-by: Shuah Khan 
---
 Documentation/Makefile |  2 +-
 Documentation/filesystems/.gitignore   |  1 -
 Documentation/filesystems/Makefile |  5 
 Documentation/filesystems/dnotify_test.c   | 34 --
 tools/testing/selftests/filesystems/.gitignore |  1 +
 tools/testing/selftests/filesystems/Makefile   |  7 +
 tools/testing/selftests/filesystems/dnotify_test.c | 34 ++
 7 files changed, 43 insertions(+), 41 deletions(-)
 delete mode 100644 Documentation/filesystems/.gitignore
 delete mode 100644 Documentation/filesystems/Makefile
 delete mode 100644 Documentation/filesystems/dnotify_test.c
 create mode 100644 tools/testing/selftests/filesystems/.gitignore
 create mode 100644 tools/testing/selftests/filesystems/Makefile
 create mode 100644 tools/testing/selftests/filesystems/dnotify_test.c

diff --git a/Documentation/Makefile b/Documentation/Makefile
index de955e1..0473710 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -1,3 +1,3 @@
 subdir-y := accounting auxdisplay blackfin \
-   filesystems filesystems ia64 laptops mic misc-devices \
+   ia64 laptops mic misc-devices \
networking pcmcia prctl ptp timers vDSO watchdog
diff --git a/Documentation/filesystems/.gitignore 
b/Documentation/filesystems/.gitignore
deleted file mode 100644
index 31d6e42..000
--- a/Documentation/filesystems/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-dnotify_test
diff --git a/Documentation/filesystems/Makefile 
b/Documentation/filesystems/Makefile
deleted file mode 100644
index 883010c..000
--- a/Documentation/filesystems/Makefile
+++ /dev/null
@@ -1,5 +0,0 @@
-# List of programs to build
-hostprogs-y := dnotify_test
-
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
diff --git a/Documentation/filesystems/dnotify_test.c 
b/Documentation/filesystems/dnotify_test.c
deleted file mode 100644
index 8b37b4a..000
--- a/Documentation/filesystems/dnotify_test.c
+++ /dev/null
@@ -1,34 +0,0 @@
-#define _GNU_SOURCE/* needed to get the defines */
-#include  /* in glibc 2.2 this has the needed
-  values defined */
-#include 
-#include 
-#include 
-
-static volatile int event_fd;
-
-static void handler(int sig, siginfo_t *si, void *data)
-{
-   event_fd = si->si_fd;
-}
-
-int main(void)
-{
-   struct sigaction act;
-   int fd;
-
-   act.sa_sigaction = handler;
-   sigemptyset(_mask);
-   act.sa_flags = SA_SIGINFO;
-   sigaction(SIGRTMIN + 1, , NULL);
-
-   fd = open(".", O_RDONLY);
-   fcntl(fd, F_SETSIG, SIGRTMIN + 1);
-   fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT);
-   /* we will now be notified if any of the files
-  in "." is modified or new files are created */
-   while (1) {
-   pause();
-   printf("Got event on fd=%d\n", event_fd);
-   }
-}
diff --git a/tools/testing/selftests/filesystems/.gitignore 
b/tools/testing/selftests/filesystems/.gitignore
new file mode 100644
index 000..31d6e42
--- /dev/null
+++ b/tools/testing/selftests/filesystems/.gitignore
@@ -0,0 +1 @@
+dnotify_test
diff --git a/tools/testing/selftests/filesystems/Makefile 
b/tools/testing/selftests/filesystems/Makefile
new file mode 100644
index 000..0ab1130
--- /dev/null
+++ b/tools/testing/selftests/filesystems/Makefile
@@ -0,0 +1,7 @@
+TEST_PROGS := dnotify_test
+all: $(TEST_PROGS)
+
+include ../lib.mk
+
+clean:
+   rm -fr $(TEST_PROGS)
diff --git a/tools/testing/selftests/filesystems/dnotify_test.c 
b/tools/testing/selftests/filesystems/dnotify_test.c
new file mode 100644
index 000..8b37b4a
--- /dev/null
+++ b/tools/testing/selftests/filesystems/dnotify_test.c
@@ -0,0 +1,34 @@
+#define _GNU_SOURCE/* needed to get the defines */
+#include  /* in glibc 2.2 this has the needed
+  values defined */
+#include 
+#include 
+#include 
+
+static volatile int event_fd;
+
+static void handler(int sig, siginfo_t *si, void *data)
+{
+   event_fd = si->si_fd;
+}
+
+int main(void)
+{
+   struct sigaction act;
+   int fd;
+
+   act.sa_sigaction = handler;
+   sigemptyset(_mask);
+   act.sa_flags = SA_SIGINFO;
+   sigaction(SIGRTMIN + 1, , NULL);
+
+   fd = open(".", O_RDONLY);
+   fcntl(fd, F_SETSIG, SIGRTMIN + 1);
+   fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT);
+   /* we will now be notified if any of the files
+  in "." is modified or new files are created */
+   

[PATCH v2 5/6] selftests: move ia64 tests from Documentation/ia64

2016-09-13 Thread Shuah Khan
Remove ia64 from Makefile to move the test to selftests.

Update ia64 Makefile to work under selftests. ia64 will not be run as part
of selftests suite and will not be included in install targets. They can be
built separately for now.

The original Makefile built this test on all archirectures and this update
doesn't change that.

Signed-off-by: Shuah Khan 
---
 Documentation/Makefile   |   2 +-
 Documentation/ia64/.gitignore|   1 -
 Documentation/ia64/Makefile  |   5 -
 Documentation/ia64/aliasing-test.c   | 263 ---
 tools/testing/selftests/ia64/.gitignore  |   1 +
 tools/testing/selftests/ia64/Makefile|   8 +
 tools/testing/selftests/ia64/aliasing-test.c | 263 +++
 7 files changed, 273 insertions(+), 270 deletions(-)
 delete mode 100644 Documentation/ia64/.gitignore
 delete mode 100644 Documentation/ia64/Makefile
 delete mode 100644 Documentation/ia64/aliasing-test.c
 create mode 100644 tools/testing/selftests/ia64/.gitignore
 create mode 100644 tools/testing/selftests/ia64/Makefile
 create mode 100644 tools/testing/selftests/ia64/aliasing-test.c

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 085b917..572e9b7 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -1,3 +1,3 @@
 subdir-y := accounting auxdisplay blackfin \
-   ia64 laptops mic misc-devices \
+   laptops mic misc-devices \
networking pcmcia timers watchdog
diff --git a/Documentation/ia64/.gitignore b/Documentation/ia64/.gitignore
deleted file mode 100644
index ab806ed..000
--- a/Documentation/ia64/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-aliasing-test
diff --git a/Documentation/ia64/Makefile b/Documentation/ia64/Makefile
deleted file mode 100644
index d493163..000
--- a/Documentation/ia64/Makefile
+++ /dev/null
@@ -1,5 +0,0 @@
-# List of programs to build
-hostprogs-y := aliasing-test
-
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
diff --git a/Documentation/ia64/aliasing-test.c 
b/Documentation/ia64/aliasing-test.c
deleted file mode 100644
index 62a190d..000
--- a/Documentation/ia64/aliasing-test.c
+++ /dev/null
@@ -1,263 +0,0 @@
-/*
- * Exercise /dev/mem mmap cases that have been troublesome in the past
- *
- * (c) Copyright 2007 Hewlett-Packard Development Company, L.P.
- * Bjorn Helgaas 
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-int sum;
-
-static int map_mem(char *path, off_t offset, size_t length, int touch)
-{
-   int fd, rc;
-   void *addr;
-   int *c;
-
-   fd = open(path, O_RDWR);
-   if (fd == -1) {
-   perror(path);
-   return -1;
-   }
-
-   if (fnmatch("/proc/bus/pci/*", path, 0) == 0) {
-   rc = ioctl(fd, PCIIOC_MMAP_IS_MEM);
-   if (rc == -1)
-   perror("PCIIOC_MMAP_IS_MEM ioctl");
-   }
-
-   addr = mmap(NULL, length, PROT_READ|PROT_WRITE, MAP_SHARED, fd, offset);
-   if (addr == MAP_FAILED)
-   return 1;
-
-   if (touch) {
-   c = (int *) addr;
-   while (c < (int *) (addr + length))
-   sum += *c++;
-   }
-
-   rc = munmap(addr, length);
-   if (rc == -1) {
-   perror("munmap");
-   return -1;
-   }
-
-   close(fd);
-   return 0;
-}
-
-static int scan_tree(char *path, char *file, off_t offset, size_t length, int 
touch)
-{
-   struct dirent **namelist;
-   char *name, *path2;
-   int i, n, r, rc = 0, result = 0;
-   struct stat buf;
-
-   n = scandir(path, , 0, alphasort);
-   if (n < 0) {
-   perror("scandir");
-   return -1;
-   }
-
-   for (i = 0; i < n; i++) {
-   name = namelist[i]->d_name;
-
-   if (fnmatch(".", name, 0) == 0)
-   goto skip;
-   if (fnmatch("..", name, 0) == 0)
-   goto skip;
-
-   path2 = malloc(strlen(path) + strlen(name) + 3);
-   strcpy(path2, path);
-   strcat(path2, "/");
-   strcat(path2, name);
-
-   if (fnmatch(file, name, 0) == 0) {
-   rc = map_mem(path2, offset, length, touch);
-   if (rc == 0)
-   fprintf(stderr, "PASS: %s 0x%lx-0x%lx is %s\n", 
path2, offset, offset + length, touch ? "readable" : "mappable");
-   else if (rc > 0)
-   fprintf(stderr, "PASS: %s 0x%lx-0x%lx not 
mappable\n", path2, offset, offset + length);
-  

[PATCH v2 6/6] selftests: move watchdog tests from Documentation/watchdog

2016-09-13 Thread Shuah Khan
Remove watchdog-test from Makefile to move the test to selftests.

Add Makefile and .gitignore to for watchdog-test. watchdog-test will
not be run as part of selftests suite and will not be included in
install targets.  It can be built separately for now.

Signed-off-by: Shuah Khan 
---
 Documentation/watchdog/src/.gitignore|   1 -
 Documentation/watchdog/src/Makefile  |   2 +-
 Documentation/watchdog/src/watchdog-test.c   | 105 ---
 tools/testing/selftests/watchdog/.gitignore  |   1 +
 tools/testing/selftests/watchdog/Makefile|   8 ++
 tools/testing/selftests/watchdog/watchdog-test.c | 105 +++
 6 files changed, 115 insertions(+), 107 deletions(-)
 delete mode 100644 Documentation/watchdog/src/watchdog-test.c
 create mode 100644 tools/testing/selftests/watchdog/.gitignore
 create mode 100644 tools/testing/selftests/watchdog/Makefile
 create mode 100644 tools/testing/selftests/watchdog/watchdog-test.c

diff --git a/Documentation/watchdog/src/.gitignore 
b/Documentation/watchdog/src/.gitignore
index ac90997..ff0ebb5 100644
--- a/Documentation/watchdog/src/.gitignore
+++ b/Documentation/watchdog/src/.gitignore
@@ -1,2 +1 @@
 watchdog-simple
-watchdog-test
diff --git a/Documentation/watchdog/src/Makefile 
b/Documentation/watchdog/src/Makefile
index 4a892c3..47be791 100644
--- a/Documentation/watchdog/src/Makefile
+++ b/Documentation/watchdog/src/Makefile
@@ -1,5 +1,5 @@
 # List of programs to build
-hostprogs-y := watchdog-simple watchdog-test
+hostprogs-y := watchdog-simple
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
diff --git a/Documentation/watchdog/src/watchdog-test.c 
b/Documentation/watchdog/src/watchdog-test.c
deleted file mode 100644
index 6983d05..000
--- a/Documentation/watchdog/src/watchdog-test.c
+++ /dev/null
@@ -1,105 +0,0 @@
-/*
- * Watchdog Driver Test Program
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-int fd;
-const char v = 'V';
-
-/*
- * This function simply sends an IOCTL to the driver, which in turn ticks
- * the PC Watchdog card to reset its internal timer so it doesn't trigger
- * a computer reset.
- */
-static void keep_alive(void)
-{
-int dummy;
-
-printf(".");
-ioctl(fd, WDIOC_KEEPALIVE, );
-}
-
-/*
- * The main program.  Run the program with "-d" to disable the card,
- * or "-e" to enable the card.
- */
-
-static void term(int sig)
-{
-int ret = write(fd, , 1);
-
-close(fd);
-if (ret < 0)
-   printf("\nStopping watchdog ticks failed (%d)...\n", errno);
-else
-   printf("\nStopping watchdog ticks...\n");
-exit(0);
-}
-
-int main(int argc, char *argv[])
-{
-int flags;
-unsigned int ping_rate = 1;
-int ret;
-
-setbuf(stdout, NULL);
-
-fd = open("/dev/watchdog", O_WRONLY);
-
-if (fd == -1) {
-   printf("Watchdog device not enabled.\n");
-   exit(-1);
-}
-
-if (argc > 1) {
-   if (!strncasecmp(argv[1], "-d", 2)) {
-   flags = WDIOS_DISABLECARD;
-   ioctl(fd, WDIOC_SETOPTIONS, );
-   printf("Watchdog card disabled.\n");
-   goto end;
-   } else if (!strncasecmp(argv[1], "-e", 2)) {
-   flags = WDIOS_ENABLECARD;
-   ioctl(fd, WDIOC_SETOPTIONS, );
-   printf("Watchdog card enabled.\n");
-   goto end;
-   } else if (!strncasecmp(argv[1], "-t", 2) && argv[2]) {
-   flags = atoi(argv[2]);
-   ioctl(fd, WDIOC_SETTIMEOUT, );
-   printf("Watchdog timeout set to %u seconds.\n", flags);
-   goto end;
-   } else if (!strncasecmp(argv[1], "-p", 2) && argv[2]) {
-   ping_rate = strtoul(argv[2], NULL, 0);
-   printf("Watchdog ping rate set to %u seconds.\n", ping_rate);
-   } else {
-   printf("-d to disable, -e to enable, -t  to set " \
-   "the timeout,\n-p  to set the ping rate, and \n");
-   printf("run by itself to tick the card.\n");
-   goto end;
-   }
-}
-
-printf("Watchdog Ticking Away!\n");
-
-signal(SIGINT, term);
-
-while(1) {
-   keep_alive();
-   sleep(ping_rate);
-}
-end:
-ret = write(fd, , 1);
-if (ret < 0)
-   printf("Stopping watchdog ticks failed (%d)...\n", errno);
-close(fd);
-return 0;
-}
diff --git a/tools/testing/selftests/watchdog/.gitignore 
b/tools/testing/selftests/watchdog/.gitignore
new file mode 100644
index 000..5aac515
--- /dev/null
+++ b/tools/testing/selftests/watchdog/.gitignore
@@ -0,0 +1 @@
+watchdog-test
diff --git a/tools/testing/selftests/watchdog/Makefile 
b/tools/testing/selftests/watchdog/Makefile
new file mode 100644
index 000..f863c66
--- /dev/null
+++ b/tools/testing/selftests/watchdog/Makefile
@@ -0,0 +1,8 @@
+TEST_PROGS := watchdog-test
+
+all: $(TEST_PROGS)
+
+include ../lib.mk
+
+clean:
+   rm -fr $(TEST_PROGS)
diff --git 

Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets

2016-09-13 Thread Rustad, Mark D

Greg  wrote:


Someday Linux will be a modern OS that just includes IPV6 and forces a
config option to NOT have it.

That'll be great.  All the IS_ENABLED_(CONFIG_IPV6) scattered everywhere
is nuts.




Better wait until everyone at least *has* IPv6! I have yet to have IPv6  
deployed on any of my employer's networks or get IPv6 service from any ISP  
at my home. When I was at Apple in the 90's I was told that Apple needed  
IPv6 by next year or "we were dead". Well Apple nearly died, but IPv6 had  
nothing to do with that! And I still haven't experienced an IPv6  
deployment! Yeah, I have run it a bit point-to-point to resolve technical  
issues, but that isn't a "deployment" and not very interesting.


As much as we would like things to move faster, much of the world just  
doesn't. Witness the e1000 discussion today for example. Hardware doesn't  
vanish overnight, and I know that my ISP has a network full of CPE that  
doesn't do IPv6, so I'm not expecting their status to change any time soon.


It would be great though.


--
Mark Rustad, Networking Division, Intel Corporation


signature.asc
Description: Message signed with OpenPGP using GPGMail


[PATCH v2 0/6] Move runnable code (tests) from Documentation to selftests

2016-09-13 Thread Shuah Khan
Move runnable code (tests) from Documentation to selftests and update
Makefiles to work under selftests.

Jon Corbet and I discussed this in an email thread and as per that
discussion, this patch series moves all the tests that are under the
Documentation directory to selftests. There is more runnable code in
the form of examples and utils and that is going to be another patch
series. I moved just the tests and left the documentation files as is.

Checkpatch isn't happy with a few of the patches as some of the
renamed files have existing checkpatch errors and warnings. I am
working another patch series that will address those.

Changes since v1:
- Changes to Documentation/Makefile to remove test targets as the tests
  get moved.
- Combined patches based on Michael Ellerman's comments.
- Fixed change log errors based on Sergei Shtylyov's comments.
- Expanded to list to wider audience and people that responded
  with comments and ideas.
- Included ia64 and watchdog which I missed in the v1 series.

Shuah Khan (6):
  selftests: move dnotify_test from Documentation/filesystems
  selftests: move prctl tests from Documentation/prctl
  selftests: move ptp tests from Documentation/ptp
  selftests: move vDSO tests from Documentation/vDSO
  selftests: move ia64 tests from Documentation/ia64
  selftests: move watchdog tests from Documentation/watchdog

 Documentation/Makefile |   4 +-
 Documentation/filesystems/.gitignore   |   1 -
 Documentation/filesystems/Makefile |   5 -
 Documentation/filesystems/dnotify_test.c   |  34 --
 Documentation/ia64/.gitignore  |   1 -
 Documentation/ia64/Makefile|   5 -
 Documentation/ia64/aliasing-test.c | 263 ---
 Documentation/prctl/.gitignore |   3 -
 Documentation/prctl/Makefile   |  10 -
 .../prctl/disable-tsc-ctxt-sw-stress-test.c|  97 
 .../prctl/disable-tsc-on-off-stress-test.c |  96 
 Documentation/prctl/disable-tsc-test.c |  95 
 Documentation/ptp/.gitignore   |   1 -
 Documentation/ptp/Makefile |   8 -
 Documentation/ptp/testptp.c| 523 -
 Documentation/ptp/testptp.mk   |  33 --
 Documentation/vDSO/.gitignore  |   2 -
 Documentation/vDSO/Makefile|  17 -
 Documentation/vDSO/parse_vdso.c| 269 ---
 Documentation/vDSO/vdso_standalone_test_x86.c  | 128 -
 Documentation/vDSO/vdso_test.c |  52 --
 Documentation/watchdog/src/.gitignore  |   1 -
 Documentation/watchdog/src/Makefile|   2 +-
 Documentation/watchdog/src/watchdog-test.c | 105 -
 tools/testing/selftests/filesystems/.gitignore |   1 +
 tools/testing/selftests/filesystems/Makefile   |   7 +
 tools/testing/selftests/filesystems/dnotify_test.c |  34 ++
 tools/testing/selftests/ia64/.gitignore|   1 +
 tools/testing/selftests/ia64/Makefile  |   8 +
 tools/testing/selftests/ia64/aliasing-test.c   | 263 +++
 tools/testing/selftests/prctl/.gitignore   |   3 +
 tools/testing/selftests/prctl/Makefile |  15 +
 .../prctl/disable-tsc-ctxt-sw-stress-test.c|  97 
 .../prctl/disable-tsc-on-off-stress-test.c |  96 
 tools/testing/selftests/prctl/disable-tsc-test.c   |  95 
 tools/testing/selftests/ptp/.gitignore |   1 +
 tools/testing/selftests/ptp/Makefile   |   8 +
 tools/testing/selftests/ptp/testptp.c  | 523 +
 tools/testing/selftests/ptp/testptp.mk |  33 ++
 tools/testing/selftests/vDSO/.gitignore|   2 +
 tools/testing/selftests/vDSO/Makefile  |  20 +
 tools/testing/selftests/vDSO/parse_vdso.c  | 269 +++
 .../selftests/vDSO/vdso_standalone_test_x86.c  | 128 +
 tools/testing/selftests/vDSO/vdso_test.c   |  52 ++
 tools/testing/selftests/watchdog/.gitignore|   1 +
 tools/testing/selftests/watchdog/Makefile  |   8 +
 tools/testing/selftests/watchdog/watchdog-test.c   | 105 +
 47 files changed, 1773 insertions(+), 1752 deletions(-)
 delete mode 100644 Documentation/filesystems/.gitignore
 delete mode 100644 Documentation/filesystems/Makefile
 delete mode 100644 Documentation/filesystems/dnotify_test.c
 delete mode 100644 Documentation/ia64/.gitignore
 delete mode 100644 Documentation/ia64/Makefile
 delete mode 100644 Documentation/ia64/aliasing-test.c
 delete mode 100644 Documentation/prctl/.gitignore
 delete mode 100644 Documentation/prctl/Makefile
 delete mode 100644 Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
 delete mode 100644 Documentation/prctl/disable-tsc-on-off-stress-test.c
 delete mode 100644 Documentation/prctl/disable-tsc-test.c
 

[PATCH v2 2/6] selftests: move prctl tests from Documentation/prctl

2016-09-13 Thread Shuah Khan
Move prctl tests from Documentation/prctl to selftests/prctl.

Remove prctl from Makefile to move the test. Update prctl Makefile to work
under selftests. prctl will not be run as part of selftests suite and will
not be included in install targets. They can be built separately for now.

Signed-off-by: Shuah Khan 
---
 Documentation/Makefile |  2 +-
 Documentation/prctl/.gitignore |  3 -
 Documentation/prctl/Makefile   | 10 ---
 .../prctl/disable-tsc-ctxt-sw-stress-test.c| 97 --
 .../prctl/disable-tsc-on-off-stress-test.c | 96 -
 Documentation/prctl/disable-tsc-test.c | 95 -
 tools/testing/selftests/prctl/.gitignore   |  3 +
 tools/testing/selftests/prctl/Makefile | 15 
 .../prctl/disable-tsc-ctxt-sw-stress-test.c| 97 ++
 .../prctl/disable-tsc-on-off-stress-test.c | 96 +
 tools/testing/selftests/prctl/disable-tsc-test.c   | 95 +
 11 files changed, 307 insertions(+), 302 deletions(-)
 delete mode 100644 Documentation/prctl/.gitignore
 delete mode 100644 Documentation/prctl/Makefile
 delete mode 100644 Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
 delete mode 100644 Documentation/prctl/disable-tsc-on-off-stress-test.c
 delete mode 100644 Documentation/prctl/disable-tsc-test.c
 create mode 100644 tools/testing/selftests/prctl/.gitignore
 create mode 100644 tools/testing/selftests/prctl/Makefile
 create mode 100644 
tools/testing/selftests/prctl/disable-tsc-ctxt-sw-stress-test.c
 create mode 100644 
tools/testing/selftests/prctl/disable-tsc-on-off-stress-test.c
 create mode 100644 tools/testing/selftests/prctl/disable-tsc-test.c

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 0473710..7a28f6c 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -1,3 +1,3 @@
 subdir-y := accounting auxdisplay blackfin \
ia64 laptops mic misc-devices \
-   networking pcmcia prctl ptp timers vDSO watchdog
+   networking pcmcia ptp timers vDSO watchdog
diff --git a/Documentation/prctl/.gitignore b/Documentation/prctl/.gitignore
deleted file mode 100644
index 0b5c274..000
--- a/Documentation/prctl/.gitignore
+++ /dev/null
@@ -1,3 +0,0 @@
-disable-tsc-ctxt-sw-stress-test
-disable-tsc-on-off-stress-test
-disable-tsc-test
diff --git a/Documentation/prctl/Makefile b/Documentation/prctl/Makefile
deleted file mode 100644
index 44de308..000
--- a/Documentation/prctl/Makefile
+++ /dev/null
@@ -1,10 +0,0 @@
-ifndef CROSS_COMPILE
-# List of programs to build
-hostprogs-$(CONFIG_X86) := disable-tsc-ctxt-sw-stress-test 
disable-tsc-on-off-stress-test disable-tsc-test
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
-
-HOSTCFLAGS_disable-tsc-ctxt-sw-stress-test.o += -I$(objtree)/usr/include
-HOSTCFLAGS_disable-tsc-on-off-stress-test.o += -I$(objtree)/usr/include
-HOSTCFLAGS_disable-tsc-test.o += -I$(objtree)/usr/include
-endif
diff --git a/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c 
b/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
deleted file mode 100644
index f7499d1..000
--- a/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
+++ /dev/null
@@ -1,97 +0,0 @@
-/*
- * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...)
- *
- * Tests if the control register is updated correctly
- * at context switches
- *
- * Warning: this test will cause a very high load for a few seconds
- *
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-
-#include 
-#include 
-
-/* Get/set the process' ability to use the timestamp counter instruction */
-#ifndef PR_GET_TSC
-#define PR_GET_TSC 25
-#define PR_SET_TSC 26
-# define PR_TSC_ENABLE 1   /* allow the use of the timestamp counter */
-# define PR_TSC_SIGSEGV2   /* throw a SIGSEGV instead of 
reading the TSC */
-#endif
-
-static uint64_t rdtsc(void)
-{
-uint32_t lo, hi;
-/* We cannot use "=A", since this would use %rax on x86_64 */
-__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
-return (uint64_t)hi << 32 | lo;
-}
-
-static void sigsegv_expect(int sig)
-{
-   /* */
-}
-
-static void segvtask(void)
-{
-   if (prctl(PR_SET_TSC, PR_TSC_SIGSEGV) < 0)
-   {
-   perror("prctl");
-   exit(0);
-   }
-   signal(SIGSEGV, sigsegv_expect);
-   alarm(10);
-   rdtsc();
-   fprintf(stderr, "FATAL ERROR, rdtsc() succeeded while disabled\n");
-   exit(0);
-}
-
-
-static void sigsegv_fail(int sig)
-{
-   fprintf(stderr, "FATAL ERROR, rdtsc() failed while enabled\n");
-   exit(0);
-}
-
-static void rdtsctask(void)
-{
-   if (prctl(PR_SET_TSC, PR_TSC_ENABLE) < 0)
-   {
-   perror("prctl");
-   exit(0);
-   }
-   signal(SIGSEGV, sigsegv_fail);
-   alarm(10);
-   for(;;) 

[PATCH v2 3/6] selftests: move ptp tests from Documentation/ptp

2016-09-13 Thread Shuah Khan
Remove ptp from Makefile to move the test to selftests. Update ptp Makefile
to work under selftests. ptp will not be run as part of selftests suite and
will not be included in install targets. They can be built separately for
now.

Signed-off-by: Shuah Khan 
---
 Documentation/Makefile |   2 +-
 Documentation/ptp/.gitignore   |   1 -
 Documentation/ptp/Makefile |   8 -
 Documentation/ptp/testptp.c| 523 -
 Documentation/ptp/testptp.mk   |  33 ---
 tools/testing/selftests/ptp/.gitignore |   1 +
 tools/testing/selftests/ptp/Makefile   |   8 +
 tools/testing/selftests/ptp/testptp.c  | 523 +
 tools/testing/selftests/ptp/testptp.mk |  33 +++
 9 files changed, 566 insertions(+), 566 deletions(-)
 delete mode 100644 Documentation/ptp/.gitignore
 delete mode 100644 Documentation/ptp/Makefile
 delete mode 100644 Documentation/ptp/testptp.c
 delete mode 100644 Documentation/ptp/testptp.mk
 create mode 100644 tools/testing/selftests/ptp/.gitignore
 create mode 100644 tools/testing/selftests/ptp/Makefile
 create mode 100644 tools/testing/selftests/ptp/testptp.c
 create mode 100644 tools/testing/selftests/ptp/testptp.mk

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 7a28f6c..8cd6d1a 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -1,3 +1,3 @@
 subdir-y := accounting auxdisplay blackfin \
ia64 laptops mic misc-devices \
-   networking pcmcia ptp timers vDSO watchdog
+   networking pcmcia timers vDSO watchdog
diff --git a/Documentation/ptp/.gitignore b/Documentation/ptp/.gitignore
deleted file mode 100644
index f562e49..000
--- a/Documentation/ptp/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-testptp
diff --git a/Documentation/ptp/Makefile b/Documentation/ptp/Makefile
deleted file mode 100644
index 293d6c0..000
--- a/Documentation/ptp/Makefile
+++ /dev/null
@@ -1,8 +0,0 @@
-# List of programs to build
-hostprogs-y := testptp
-
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
-
-HOSTCFLAGS_testptp.o += -I$(objtree)/usr/include
-HOSTLOADLIBES_testptp := -lrt
diff --git a/Documentation/ptp/testptp.c b/Documentation/ptp/testptp.c
deleted file mode 100644
index 5d2eae1..000
--- a/Documentation/ptp/testptp.c
+++ /dev/null
@@ -1,523 +0,0 @@
-/*
- * PTP 1588 clock support - User space test program
- *
- * Copyright (C) 2010 OMICRON electronics GmbH
- *
- *  This program is free software; you can redistribute it and/or modify
- *  it under the terms of the GNU General Public License as published by
- *  the Free Software Foundation; either version 2 of the License, or
- *  (at your option) any later version.
- *
- *  This program is distributed in the hope that it will be useful,
- *  but WITHOUT ANY WARRANTY; without even the implied warranty of
- *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- *  GNU General Public License for more details.
- *
- *  You should have received a copy of the GNU General Public License
- *  along with this program; if not, write to the Free Software
- *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- */
-#define _GNU_SOURCE
-#define __SANE_USERSPACE_TYPES__/* For PPC64, to get LL64 types */
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-
-#define DEVICE "/dev/ptp0"
-
-#ifndef ADJ_SETOFFSET
-#define ADJ_SETOFFSET 0x0100
-#endif
-
-#ifndef CLOCK_INVALID
-#define CLOCK_INVALID -1
-#endif
-
-/* clock_adjtime is not available in GLIBC < 2.14 */
-#if !__GLIBC_PREREQ(2, 14)
-#include 
-static int clock_adjtime(clockid_t id, struct timex *tx)
-{
-   return syscall(__NR_clock_adjtime, id, tx);
-}
-#endif
-
-static clockid_t get_clockid(int fd)
-{
-#define CLOCKFD 3
-#define FD_TO_CLOCKID(fd)  ((~(clockid_t) (fd) << 3) | CLOCKFD)
-
-   return FD_TO_CLOCKID(fd);
-}
-
-static void handle_alarm(int s)
-{
-   printf("received signal %d\n", s);
-}
-
-static int install_handler(int signum, void (*handler)(int))
-{
-   struct sigaction action;
-   sigset_t mask;
-
-   /* Unblock the signal. */
-   sigemptyset();
-   sigaddset(, signum);
-   sigprocmask(SIG_UNBLOCK, , NULL);
-
-   /* Install the signal handler. */
-   action.sa_handler = handler;
-   action.sa_flags = 0;
-   sigemptyset(_mask);
-   sigaction(signum, , NULL);
-
-   return 0;
-}
-
-static long ppb_to_scaled_ppm(int ppb)
-{
-   /*
-* The 'freq' field in the 'struct timex' is in parts per
-* million, but with a 16 bit binary fractional field.
-* Instead of calculating either one of
-*
-*scaled_ppm = (ppb / 1000) << 16  [1]
-*scaled_ppm = (ppb << 16) / 1000  [2]
-*
-* we simply use double precision math, in order 

RE: [net-next PATCH 00/11] iw_cxgb4,cxgbit: remove duplicate code

2016-09-13 Thread Steve Wise
> This patch series removes duplicate code from
> iw_cxgb4 and cxgbit by adding common function
> definitions in libcxgb.
> 
> Please review.
> 
> Thanks
> Varun
> 
> Varun Prakash (11):
>   libcxgb,iw_cxgb4,cxgbit: add cxgb_get_4tuple()
>   libcxgb,iw_cxgb4,cxgbit: add cxgb_find_route()
>   libcxgb,iw_cxgb4,cxgbit: add cxgb_find_route6()
>   libcxgb,iw_cxgb4,cxgbit: add cxgb_is_neg_adv()
>   libcxgb,iw_cxgb4,cxgbit: add cxgb_best_mtu()
>   libcxgb,iw_cxgb4,cxgbit: add cxgb_compute_wscale()
>   libcxgb,iw_cxgb4,cxgbit: add cxgb_mk_tid_release()
>   libcxgb,iw_cxgb4,cxgbit: add cxgb_mk_close_con_req()
>   libcxgb,iw_cxgb4,cxgbit: add cxgb_mk_abort_req()
>   libcxgb,iw_cxgb4,cxgbit: add cxgb_mk_abort_rpl()
>   libcxgb,iw_cxgb4,cxgbit: add cxgb_mk_rx_data_ack()
> 
>  drivers/infiniband/hw/cxgb4/Kconfig   |   1 +
>  drivers/infiniband/hw/cxgb4/Makefile  |   1 +
>  drivers/infiniband/hw/cxgb4/cm.c  | 288
++
>  drivers/infiniband/hw/cxgb4/iw_cxgb4.h|   9 -
>  drivers/net/ethernet/chelsio/libcxgb/Makefile |   4 +-
>  drivers/net/ethernet/chelsio/libcxgb/libcxgb_cm.c | 149 +++
>  drivers/net/ethernet/chelsio/libcxgb/libcxgb_cm.h | 160 
>  drivers/target/iscsi/cxgbit/cxgbit_cm.c   | 234 +++---
>  8 files changed, 428 insertions(+), 418 deletions(-)
>  create mode 100644 drivers/net/ethernet/chelsio/libcxgb/libcxgb_cm.c
>  create mode 100644 drivers/net/ethernet/chelsio/libcxgb/libcxgb_cm.h
> 

This series looks good.

Reviewed-by: Steve Wise 

Thanks Varun!

Steve




pull-request: mac80211 2016-09-13

2016-09-13 Thread Johannes Berg
Hi Dave,

We found a few more issues, I'm sending you small fixes here. The diffstat
would be even shorter, but one of Felix's patches has to move about 30 lines
of code, which makes it seem much bigger than it really is.

Let me know if there's any problem.

Thanks,
johannes



The following changes since commit 15543692a010192b4264ade0d45390e8bb3dc639:

  Merge tag 'mac80211-for-davem-2016-08-30' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 (2016-08-30 
21:34:48 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git 
tags/mac80211-for-davem-2016-09-13

for you to fetch changes up to ad5987b47e96a0fb6d13fea250e936aed93c:

  nl80211: validate number of probe response CSA counters (2016-09-13 20:19:27 
+0200)


A few more fixes:
 * better mesh path fixing, from Thomas
 * fix TIM IE recalculation after sending frames
   to a sleeping station, from Felix
 * fix sequence number assignment while sending
   frames to a sleeping station, also from Felix
 * validate number of probe response CSA counter
   offsets, fixing a copy/paste bug (from myself)


Felix Fietkau (2):
  mac80211: fix tim recalculation after PS response
  mac80211: fix sequence number assignment for PS response frames

Johannes Berg (1):
  nl80211: validate number of probe response CSA counters

Pedersen, Thomas (1):
  mac80211: make mpath path fixing more robust

 net/mac80211/mesh_hwmp.c|  3 ++-
 net/mac80211/mesh_pathtbl.c |  2 +-
 net/mac80211/sta_info.c |  4 +--
 net/mac80211/tx.c   | 65 +++--
 net/wireless/nl80211.c  |  2 +-
 5 files changed, 39 insertions(+), 37 deletions(-)


Re: [PATCH v3 net 1/1] net sched actions: fix GETing actions

2016-09-13 Thread Jamal Hadi Salim

On 16-09-13 12:20 PM, Cong Wang wrote:

On Mon, Sep 12, 2016 at 4:07 PM, Jamal Hadi Salim  wrote:

From: Jamal Hadi Salim 

With the batch changes that translated transient actions into
a temporary list lost in the translation was the fact that
tcf_action_destroy() will eventually delete the action from
the permanent location if the refcount is zero.

Example of what broke:
...add a gact action to drop
sudo $TC actions add action drop index 10
...now retrieve it, looks good
sudo $TC actions get action gact index 10
...retrieve it again and find it is gone!
sudo $TC actions get action gact index 10

Fixes:
commit 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer array"),
commit 824a7e8863b3 ("net_sched: remove an unnecessary list_del()")
commit f07fed82ad79 ("net_sched: remove the leftover cleanup_a()")

Signed-off-by: Jamal Hadi Salim 
---
 net/sched/act_api.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index d09d068..50720b1 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -592,6 +592,16 @@ err_out:
return ERR_PTR(err);
 }

+static void cleanup_a(struct list_head *actions, int ovr)
+{
+   struct tc_action *a;
+
+   list_for_each_entry(a, actions, list) {
+   if (ovr)
+   a->tcfa_refcnt -= 1;
+   }
+}
+
 int tcf_action_init(struct net *net, struct nlattr *nla,
  struct nlattr *est, char *name, int ovr,
  int bind, struct list_head *actions)
@@ -612,8 +622,15 @@ int tcf_action_init(struct net *net, struct nlattr *nla,
goto err;
}
act->order = i;
+   if (ovr)
+   act->tcfa_refcnt += 1;
list_add_tail(>list, actions);
}
+
+   /* Remove the temp refcnt which was necessary to protect against
+* destroying an existing action which was being replaced
+*/
+   cleanup_a(actions, ovr);
return 0;


I am still trying to understand this piece, so here you hold the refcnt
for the same action used by the later iteration? Otherwise there is
almost none user inbetween hold and release...

The comment you add is not clear to me, we use RTNL/RCU to
sync destroy and replace, so how could that happen?



I was worried about the destroy() hitting an error in that function.
If an action already existed and all we asked for was to
replace some attribute it would be deleted. It was the way the code was
before your changes so i just restored it to its original form.

cheers,
jamal


Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-13 Thread Rustad, Mark D

Alexei Starovoitov  wrote:


On Tue, Sep 13, 2016 at 06:28:03PM +, Rustad, Mark D wrote:

Alexei Starovoitov  wrote:


I've looked through qemu and it appears only emulate e1k and tg3.
The latter is still used in the field, so the risk of touching
it is higher.


I have no idea what makes you think that e1k is *not* "used in the field".
I grant you it probably is used more virtualized than not these days,  
but it

certainly exists and is used. You can still buy them new at Newegg for
goodness sakes!


the point that it's only used virtualized, since PCI (not PCIE) have
been long dead.


My point is precisely the opposite. It is a real device, it exists in real  
systems and it is used in those systems. I worked on embedded systems that  
ran Linux and used e1000 devices. I am sure they are still out there  
because customers are still paying for support of those systems.


Yes, PCI(-X) is absent from any current hardware and has been for some  
years now, but there is an installed base that continues. What part of that  
installed base updates software? I don't know, but I would not just assume  
that it is 0. I know that I updated the kernel on those embedded systems  
that I worked on when I was supporting them. Never at the bleeding edge,  
but generally hopping from one LTS kernel to another as needed.


The day is coming when all the motherboards with PCI(-X) will be gone, but  
I think it is still at least a few years off.


--
Mark Rustad, Networking Division, Intel Corporation


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [PATCH net-next] MAINTAINERS: Add an entry for the core network DSA code

2016-09-13 Thread Andrew Lunn
> F:drivers/net/dsa/
> 
> Other than that, LGTM

I was a bit hesitant to go that far. We have individual entries for
b53 and mv88e6xxx, but not for bcm_sf2. But O.K, v2 on the way.

Andrew


Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family

2016-09-13 Thread John Crispin


On 13/09/2016 21:07, Andrew Lunn wrote:
>> Since the former alternative is prefered, we may want to remove the
>> latter soon from DSA. If this phy_port_map is needed for that case, it'd
>> be preferable not to add it.
> 
> O.K, so maybe we should solve it the device tree way:
> 
> 
> {
>   phy_port1: phy@0 {
>   reg = <0>;
>   };
> 
>   phy_port2: phy@1 {
>   reg = <1>;
>   };
> 
>   phy_port3: phy@2 {
>   reg = <2>;
>   };
> 
>   phy_port4: phy@3 {
>   reg = <3>;
>   };
> 
>   phy_port5: phy@4 {
>   reg = <4>;
>   };
> 
>   switch@0 {
>compatible = "qca,qca8337";
> 
>#address-cells = <1>;
>#size-cells = <0>;
>reg = <30>;
> 
>ports {
>port@11 {
>reg = <11>;
>label = "cpu";
>ethernet = <>;
>phy-mode = "rgmii";
>};
> 
>port@1 {
>reg = <1>;
>label = "lan1";
>  phy-handle = <_port1>;
>};
> 
>port@2 {
>reg = <2>;
>label = "lan2";
>  phy-handle = <_port2>;
>};
> 
>port@3 {
>reg = <3>;
>label = "lan3";
>  phy-handle = <_port3>;
>};
> 
>port@4 {
>reg = <4>;
>label = "lan4";
>  phy-handle = <_port4>;
>};
>};
>};
>};
> 
> and remove the phy_read() and phy_write() functions.
> 
> 
>Andrew
> 

Hi Andrew

ok, will give it a spin in the morning and add a note to the binding doc
explaining this. thanks for taking the time !

John


Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family

2016-09-13 Thread Andrew Lunn
> Since the former alternative is prefered, we may want to remove the
> latter soon from DSA. If this phy_port_map is needed for that case, it'd
> be preferable not to add it.

O.K, so maybe we should solve it the device tree way:


{
phy_port1: phy@0 {
reg = <0>;
};

phy_port2: phy@1 {
reg = <1>;
};

phy_port3: phy@2 {
reg = <2>;
};

phy_port4: phy@3 {
reg = <3>;
};

phy_port5: phy@4 {
reg = <4>;
};

switch@0 {
   compatible = "qca,qca8337";

   #address-cells = <1>;
   #size-cells = <0>;
   reg = <30>;

   ports {
   port@11 {
   reg = <11>;
   label = "cpu";
   ethernet = <>;
   phy-mode = "rgmii";
   };

   port@1 {
   reg = <1>;
   label = "lan1";
   phy-handle = <_port1>;
   };

   port@2 {
   reg = <2>;
   label = "lan2";
   phy-handle = <_port2>;
   };

   port@3 {
   reg = <3>;
   label = "lan3";
   phy-handle = <_port3>;
   };

   port@4 {
   reg = <4>;
   label = "lan4";
   phy-handle = <_port4>;
   };
   };
   };
   };

and remove the phy_read() and phy_write() functions.


   Andrew


Re: [PATCH net-next] MAINTAINERS: Add an entry for the core network DSA code

2016-09-13 Thread Florian Fainelli
On 09/13/2016 12:00 PM, Andrew Lunn wrote:
> The core distributed switch architecture code currently does not have
> a MAINTAINERS entry, which results in some contributions not landing
> in the right peoples inbox.
> 
> Signed-off-by: Andrew Lunn 
> ---
>  MAINTAINERS | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ce80b36aab69..98cf0c4a14cf 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8169,6 +8169,13 @@ S: Maintained
>  W:   https://fedorahosted.org/dropwatch/
>  F:   net/core/drop_monitor.c
>  
> +NETWORKING [DSA]
> +M:   Andrew Lunn 
> +M:   Vivien Didelot 
> +M:   Florian Fainelli 
> +S:   Maintained
> +F:   net/dsa/

F:  include/net/dsa.h
F:  drivers/net/dsa/

Other than that, LGTM
-- 
Florian


[PATCH net-next] MAINTAINERS: Add an entry for the core network DSA code

2016-09-13 Thread Andrew Lunn
The core distributed switch architecture code currently does not have
a MAINTAINERS entry, which results in some contributions not landing
in the right peoples inbox.

Signed-off-by: Andrew Lunn 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index ce80b36aab69..98cf0c4a14cf 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8169,6 +8169,13 @@ S:   Maintained
 W: https://fedorahosted.org/dropwatch/
 F: net/core/drop_monitor.c
 
+NETWORKING [DSA]
+M: Andrew Lunn 
+M: Vivien Didelot 
+M: Florian Fainelli 
+S: Maintained
+F: net/dsa/
+
 NETWORKING [GENERAL]
 M: "David S. Miller" 
 L: netdev@vger.kernel.org
-- 
2.9.3



Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets

2016-09-13 Thread Greg
On Tue, 2016-09-13 at 20:19 +0300, Cyrill Gorcunov wrote:
> In criu we are actively using diag interface to collect sockets
> present in the system when dumping applications. And while for
> unix, tcp, udp[lite], packet, netlink it works as expected,
> the raw sockets do not have. Thus add it.
> 
> v2:
>  - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
>  - implement @destroy for diag requests (by dsa@)
> 
> v3:
>  - add export of raw_abort for IPv6 (by dsa@)
>  - pass net-admin flag into inet_sk_diag_fill due to
>changes in net-next branch (by dsa@)
> 
> CC: David S. Miller 
> CC: Eric Dumazet 
> CC: David Ahern 
> CC: Alexey Kuznetsov 
> CC: James Morris 
> CC: Hideaki YOSHIFUJI 
> CC: Patrick McHardy 
> CC: Andrey Vagin 
> CC: Stephen Hemminger 
> Signed-off-by: Cyrill Gorcunov 
> ---
> 
>  include/net/raw.h   |6 +
>  include/net/rawv6.h |7 +
>  net/ipv4/Kconfig|8 +
>  net/ipv4/Makefile   |1 
>  net/ipv4/raw.c  |   21 
>  net/ipv4/raw_diag.c |  226 
> 
>  net/ipv6/raw.c  |7 +
>  7 files changed, 272 insertions(+), 4 deletions(-)
> 
> Index: linux-ml.git/include/net/raw.h
> ===
> --- linux-ml.git.orig/include/net/raw.h
> +++ linux-ml.git/include/net/raw.h
> @@ -23,6 +23,12 @@
>  
>  extern struct proto raw_prot;
>  
> +extern struct raw_hashinfo raw_v4_hashinfo;
> +struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
> +  unsigned short num, __be32 raddr,
> +  __be32 laddr, int dif);
> +
> +int raw_abort(struct sock *sk, int err);
>  void raw_icmp_error(struct sk_buff *, int, u32);
>  int raw_local_deliver(struct sk_buff *, int);
>  
> Index: linux-ml.git/include/net/rawv6.h
> ===
> --- linux-ml.git.orig/include/net/rawv6.h
> +++ linux-ml.git/include/net/rawv6.h
> @@ -3,6 +3,13 @@
>  
>  #include 
>  
> +extern struct raw_hashinfo raw_v6_hashinfo;
> +struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
> +  unsigned short num, const struct in6_addr 
> *loc_addr,
> +  const struct in6_addr *rmt_addr, int dif);
> +
> +int raw_abort(struct sock *sk, int err);
> +
>  void raw6_icmp_error(struct sk_buff *, int nexthdr,
>   u8 type, u8 code, int inner_offset, __be32);
>  bool raw6_local_deliver(struct sk_buff *, int);
> Index: linux-ml.git/net/ipv4/Kconfig
> ===
> --- linux-ml.git.orig/net/ipv4/Kconfig
> +++ linux-ml.git/net/ipv4/Kconfig
> @@ -430,6 +430,14 @@ config INET_UDP_DIAG
> Support for UDP socket monitoring interface used by the ss tool.
> If unsure, say Y.
>  
> +config INET_RAW_DIAG
> + tristate "RAW: socket monitoring interface"
> + depends on INET_DIAG && (IPV6 || IPV6=n)
> + default n
> + ---help---
> +   Support for RAW socket monitoring interface used by the ss tool.
> +   If unsure, say Y.
> +
>  config INET_DIAG_DESTROY
>   bool "INET: allow privileged process to administratively close sockets"
>   depends on INET_DIAG
> Index: linux-ml.git/net/ipv4/Makefile
> ===
> --- linux-ml.git.orig/net/ipv4/Makefile
> +++ linux-ml.git/net/ipv4/Makefile
> @@ -40,6 +40,7 @@ obj-$(CONFIG_NETFILTER) += netfilter.o n
>  obj-$(CONFIG_INET_DIAG) += inet_diag.o 
>  obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
>  obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o
> +obj-$(CONFIG_INET_RAW_DIAG) += raw_diag.o
>  obj-$(CONFIG_NET_TCPPROBE) += tcp_probe.o
>  obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o
>  obj-$(CONFIG_TCP_CONG_CDG) += tcp_cdg.o
> Index: linux-ml.git/net/ipv4/raw.c
> ===
> --- linux-ml.git.orig/net/ipv4/raw.c
> +++ linux-ml.git/net/ipv4/raw.c
> @@ -89,9 +89,10 @@ struct raw_frag_vec {
>   int hlen;
>  };
>  
> -static struct raw_hashinfo raw_v4_hashinfo = {
> +struct raw_hashinfo raw_v4_hashinfo = {
>   .lock = __RW_LOCK_UNLOCKED(raw_v4_hashinfo.lock),
>  };
> +EXPORT_SYMBOL_GPL(raw_v4_hashinfo);
>  
>  int raw_hash_sk(struct sock *sk)
>  {
> @@ -120,7 +121,7 @@ void raw_unhash_sk(struct sock *sk)
>  }
>  EXPORT_SYMBOL_GPL(raw_unhash_sk);
>  
> -static struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
> +struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
>   unsigned short num, __be32 raddr, __be32 laddr, int dif)
>  {
>   sk_for_each_from(sk) {
> @@ -136,6 +137,7 @@ static struct sock *__raw_v4_lookup(stru
>  found:
>   return sk;
>  }
> 

Re: [PATCH net-next 3/3] net: ethernet: mediatek: add dts configuration to enable HW LRO

2016-09-13 Thread Florian Fainelli
On 09/13/2016 06:54 AM, Nelson Chang wrote:
> Add the configuration of HW LRO in the binding document.
> 
> Signed-off-by: Nelson Chang 
> ---
>  Documentation/devicetree/bindings/net/mediatek-net.txt | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/mediatek-net.txt 
> b/Documentation/devicetree/bindings/net/mediatek-net.txt
> index 32eaaca..f43c0d1 100644
> --- a/Documentation/devicetree/bindings/net/mediatek-net.txt
> +++ b/Documentation/devicetree/bindings/net/mediatek-net.txt
> @@ -20,6 +20,7 @@ Required properties:
>  - mediatek,ethsys: phandle to the syscon node that handles the port setup
>  - mediatek,pctl: phandle to the syscon node that handles the ports slew rate
>   and driver current
> +- mediatek,hwlro: set to enable HW LRO functions of PDMA rx rings

That sounds like implementing a enable/disable policy in the Device Tree
as opposed to providing an indication as to whether the HW supports LRO
or not. If all versions of the hardware support LRO, then you would
rather let the users change NETIF_F_LRO using ethtool features instead
of having this be defined in the Device Tree.

If, on the other hand, not all version of the HW support LRO, then you
would just want to rephrase the property description to say this
describes a capability.
-- 
Florian


Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-13 Thread Alexei Starovoitov
On Tue, Sep 13, 2016 at 06:28:03PM +, Rustad, Mark D wrote:
> Alexei Starovoitov  wrote:
> 
> >I've looked through qemu and it appears only emulate e1k and tg3.
> >The latter is still used in the field, so the risk of touching
> >it is higher.
> 
> I have no idea what makes you think that e1k is *not* "used in the field".
> I grant you it probably is used more virtualized than not these days, but it
> certainly exists and is used. You can still buy them new at Newegg for
> goodness sakes!

the point that it's only used virtualized, since PCI (not PCIE) have
been long dead.



Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-13 Thread Rustad, Mark D

Alexei Starovoitov  wrote:


I've looked through qemu and it appears only emulate e1k and tg3.
The latter is still used in the field, so the risk of touching
it is higher.


I have no idea what makes you think that e1k is *not* "used in the field".   
I grant you it probably is used more virtualized than not these days, but  
it certainly exists and is used. You can still buy them new at Newegg for  
goodness sakes!


Maybe I'll go home and plug in my old e100 into my machine that still has a  
PCI slot, just for old times sake. Oh darn, I have a SCSI card in that  
slot...


--
Mark Rustad, Networking Division, Intel Corporation


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [PATCH net-next 2/3] net: ethernet: mediatek: add ethtool functions to configure RX flows of HW LRO

2016-09-13 Thread Florian Fainelli
On 09/13/2016 06:54 AM, Nelson Chang wrote:
> The codes add ethtool functions to set RX flows for HW LRO. Because the
> HW LRO hardware can only recognize the destination IP of TCP/IP RX flows,
> the ethtool command to add HW LRO flow is as below:
> ethtool -N [devname] flow-type tcp4 dst-ip [ip_addr] loc [0~1]
> 
> Otherwise, cause the hardware can set total four destination IPs, each
> GMAC (GMAC1/GMAC2) can set two IPs separately at most.
> 
> Signed-off-by: Nelson Chang 
> ---

> +
> +static int mtk_set_features(struct net_device *dev, netdev_features_t 
> features)
> +{
> + int err = 0;
> +
> + if (!((dev->features ^ features) & NETIF_F_LRO))
> + return 0;
> +
> + if (!(features & NETIF_F_LRO))
> + mtk_hwlro_netdev_disable(dev);

you may want to implement a fix_features ndo operations which makes sure
that NETIF_F_LRO is turned on in case a RX flow is programmed,
otherwise, it may be confusing to the user that a flow was programmed,
but no offload is happening.
-- 
Florian


Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev

2016-09-13 Thread Jiri Slaby
On 09/13/2016, 05:35 PM, Tejun Heo wrote:
> Hello,
> 
> On Sat, Sep 10, 2016 at 11:33:48AM +0200, Dmitry Vyukov wrote:
>> Hit the WARNING with the patch. It showed "Showing busy workqueues and
>> worker pools:" after the WARNING, but then no queue info. Was it
>> already destroyed and removed from the list?...
> 
> Hmm...  It either means that the work item which was in flight when
> WARN_ON() ran finished by the time the debug printout got to it or
> that it's something unrelated to busy work items.
> 
>> [ 198.113838] WARNING: CPU: 2 PID: 26691 at kernel/workqueue.c:4042
>> destroy_workqueue+0x17b/0x630
> 
> I don't seem to have the same source code that you have.  Which exact
> WARN_ON() is this?

I assume Dmitry sees the same what I am still seeing, so I reported this
some time ago:
https://lkml.org/lkml/2016/3/21/492

This warning is trigerred there and still occurs with "HEAD":
  (pwq != wq->dfl_pwq) && (pwq->refcnt > 1)
and the state dump is in the log empty too:
destroy_workqueue: name='hci0' pwq=88006b5c8f00
wq->dfl_pwq=88006b5c9b00 pwq->refcnt=2 pwq->nr_active=0 delayed_works:
  pwq 13:
 cpus=2-3 node=1 flags=0x4 nice=-20 active=0/1
in-flight: 2669:wq_barrier_func

thanks,
-- 
js
suse labs


Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family

2016-09-13 Thread Florian Fainelli
On 09/13/2016 11:07 AM, John Crispin wrote:
> 
> 
> On 13/09/2016 19:09, Florian Fainelli wrote:
>> On 09/13/2016 08:59 AM, Andrew Lunn wrote:
 Hi Andrew,

 this function does indeed duplicate the functionality of
 phy_ethtool_get_eee() with the small difference, that e->eee_active is
 also set which phy_ethtool_get_eee() does not set.

 dsa_slave_get_eee() will call phy_ethtool_get_eee() right after the
 get_eee() op has been called. would it be ok to move the code setting
 eee_active to  phy_ethtool_get_eee().
>>
>> Humm, AFAIR, the reason why eee_active is set outside of
>> phy_ethtool_set_eee() is because this is a MAC + PHY thing, both need to
>> agree and support that, and so while the PHY may be configured to have
>> EEE advertised and enabled, you also need to take care of the MAC
>> portion and enable EEE in there as well. Is not there such a thing for
>> the qca8k switch where the PHY needs to be configured through the
>> standard phylib calls, but the switch's transmitter/receiver also needs
>> to have EEE enabled?
>>
> 
> Hi Florian,
> 
> the switch needs to enable the eee on a per mac absis, but there is no
> way to tell if the autonegotiate worked and eee is enabled without
> reading the phys registers.

OK, that does not sound atypical here, most drivers I see do have a way
to tell if EEE is active by reading e.g: the LPI indication register, or
something that is able to reflect the negotiated result.

> 
> setting the eee_active inside phy_ethtool_get_eee() would break those
> dsa drivers that have a register telling if AN worked. if it is ok i
> will just call phy_ethtool_get_eee() inside get_eee().

Ok, let's see the code and then we can discuss from there, not very
clear on the proposed change here.
-- 
Florian


Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family

2016-09-13 Thread John Crispin


On 13/09/2016 19:09, Florian Fainelli wrote:
> On 09/13/2016 08:59 AM, Andrew Lunn wrote:
>>> Hi Andrew,
>>>
>>> this function does indeed duplicate the functionality of
>>> phy_ethtool_get_eee() with the small difference, that e->eee_active is
>>> also set which phy_ethtool_get_eee() does not set.
>>>
>>> dsa_slave_get_eee() will call phy_ethtool_get_eee() right after the
>>> get_eee() op has been called. would it be ok to move the code setting
>>> eee_active to  phy_ethtool_get_eee().
> 
> Humm, AFAIR, the reason why eee_active is set outside of
> phy_ethtool_set_eee() is because this is a MAC + PHY thing, both need to
> agree and support that, and so while the PHY may be configured to have
> EEE advertised and enabled, you also need to take care of the MAC
> portion and enable EEE in there as well. Is not there such a thing for
> the qca8k switch where the PHY needs to be configured through the
> standard phylib calls, but the switch's transmitter/receiver also needs
> to have EEE enabled?
>

Hi Florian,

the switch needs to enable the eee on a per mac absis, but there is no
way to tell if the autonegotiate worked and eee is enabled without
reading the phys registers.

setting the eee_active inside phy_ethtool_get_eee() would break those
dsa drivers that have a register telling if AN worked. if it is ok i
will just call phy_ethtool_get_eee() inside get_eee().

John


[PATCHv2 net 5/6] sctp: make sctp_outq_flush/tail/uncork return void

2016-09-13 Thread Xin Long
sctp_outq_flush return value is meaningless now, this patch is
to make sctp_outq_flush return void, as well as sctp_outq_fail
and sctp_outq_uncork.

Signed-off-by: Xin Long 
---
 include/net/sctp/structs.h |  4 ++--
 net/sctp/outqueue.c| 19 +++
 net/sctp/sm_sideeffect.c   |  9 -
 3 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index f61fb7c..8693dc4 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1077,7 +1077,7 @@ struct sctp_outq {
 void sctp_outq_init(struct sctp_association *, struct sctp_outq *);
 void sctp_outq_teardown(struct sctp_outq *);
 void sctp_outq_free(struct sctp_outq*);
-int sctp_outq_tail(struct sctp_outq *, struct sctp_chunk *chunk, gfp_t);
+void sctp_outq_tail(struct sctp_outq *, struct sctp_chunk *chunk, gfp_t);
 int sctp_outq_sack(struct sctp_outq *, struct sctp_chunk *);
 int sctp_outq_is_empty(const struct sctp_outq *);
 void sctp_outq_restart(struct sctp_outq *);
@@ -1085,7 +1085,7 @@ void sctp_outq_restart(struct sctp_outq *);
 void sctp_retransmit(struct sctp_outq *, struct sctp_transport *,
 sctp_retransmit_reason_t);
 void sctp_retransmit_mark(struct sctp_outq *, struct sctp_transport *, __u8);
-int sctp_outq_uncork(struct sctp_outq *, gfp_t gfp);
+void sctp_outq_uncork(struct sctp_outq *, gfp_t gfp);
 void sctp_prsctp_prune(struct sctp_association *asoc,
   struct sctp_sndrcvinfo *sinfo, int msg_len);
 /* Uncork and flush an outqueue.  */
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 052a479..8c3f446 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -68,7 +68,7 @@ static void sctp_mark_missing(struct sctp_outq *q,
 
 static void sctp_generate_fwdtsn(struct sctp_outq *q, __u32 sack_ctsn);
 
-static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp);
+static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp);
 
 /* Add data to the front of the queue. */
 static inline void sctp_outq_head_data(struct sctp_outq *q,
@@ -285,10 +285,9 @@ void sctp_outq_free(struct sctp_outq *q)
 }
 
 /* Put a new chunk in an sctp_outq.  */
-int sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk *chunk, gfp_t gfp)
+void sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk *chunk, gfp_t gfp)
 {
struct net *net = sock_net(q->asoc->base.sk);
-   int error = 0;
 
pr_debug("%s: outq:%p, chunk:%p[%s]\n", __func__, q, chunk,
 chunk && chunk->chunk_hdr ?
@@ -318,9 +317,7 @@ int sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk 
*chunk, gfp_t gfp)
}
 
if (!q->cork)
-   error = sctp_outq_flush(q, 0, gfp);
-
-   return error;
+   sctp_outq_flush(q, 0, gfp);
 }
 
 /* Insert a chunk into the sorted list based on the TSNs.  The retransmit list
@@ -748,12 +745,12 @@ redo:
 }
 
 /* Cork the outqueue so queued chunks are really queued. */
-int sctp_outq_uncork(struct sctp_outq *q, gfp_t gfp)
+void sctp_outq_uncork(struct sctp_outq *q, gfp_t gfp)
 {
if (q->cork)
q->cork = 0;
 
-   return sctp_outq_flush(q, 0, gfp);
+   sctp_outq_flush(q, 0, gfp);
 }
 
 
@@ -766,7 +763,7 @@ int sctp_outq_uncork(struct sctp_outq *q, gfp_t gfp)
  * locking concerns must be made.  Today we use the sock lock to protect
  * this function.
  */
-static int sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
+static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
 {
struct sctp_packet *packet;
struct sctp_packet singleton;
@@ -891,7 +888,7 @@ static int sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
error = sctp_packet_transmit(, gfp);
if (error < 0) {
asoc->base.sk->sk_err = -error;
-   return 0;
+   return;
}
break;
 
@@ -1175,8 +1172,6 @@ sctp_flush_out:
/* Clear the burst limited state, if any */
sctp_transport_burst_reset(t);
}
-
-   return 0;
 }
 
 /* Update unack_data based on the incoming SACK chunk */
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index cf6e4f0..c345bf1 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -1421,8 +1421,7 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
local_cork = 1;
}
/* Send a chunk to our peer.  */
-   error = sctp_outq_tail(>outqueue, cmd->obj.chunk,
-  gfp);
+   sctp_outq_tail(>outqueue, cmd->obj.chunk, gfp);
break;
 
case SCTP_CMD_SEND_PKT:
@@ -1676,7 +1675,7 @@ static int 

[PATCHv2 net 6/6] sctp: not return ENOMEM err back in sctp_packet_transmit

2016-09-13 Thread Xin Long
As David and Marcelo's suggestion, ENOMEM err shouldn't return back to
user in transmit path. Instead, sctp's retransmit would take care of
the chunks that fail to send because of ENOMEM.

This patch is only to do some release job when alloc_skb fails, not to
return ENOMEM back any more.

Besides, it also cleans up sctp_packet_transmit's err path, and fixes
some issues in err path:

 - It didn't free the head skb in nomem: path.
 - No need to check nskb in no_route: path.
 - It should goto err: path if alloc_skb fails for head.
 - Not all the NOMEMs should free nskb.

Signed-off-by: Xin Long 
---
 net/sctp/output.c | 47 ++-
 1 file changed, 22 insertions(+), 25 deletions(-)

diff --git a/net/sctp/output.c b/net/sctp/output.c
index f2597a9..0c605ec 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -442,14 +442,14 @@ int sctp_packet_transmit(struct sctp_packet *packet, 
gfp_t gfp)
 * time. Application may notice this error.
 */
pr_err_once("Trying to GSO but underlying device 
doesn't support it.");
-   goto nomem;
+   goto err;
}
} else {
pkt_size = packet->size;
}
head = alloc_skb(pkt_size + MAX_HEADER, gfp);
if (!head)
-   goto nomem;
+   goto err;
if (gso) {
NAPI_GRO_CB(head)->last = head;
skb_shinfo(head)->gso_type = sk->sk_gso_type;
@@ -470,8 +470,12 @@ int sctp_packet_transmit(struct sctp_packet *packet, gfp_t 
gfp)
}
}
dst = dst_clone(tp->dst);
-   if (!dst)
-   goto no_route;
+   if (!dst) {
+   if (asoc)
+   IP_INC_STATS(sock_net(asoc->base.sk),
+IPSTATS_MIB_OUTNOROUTES);
+   goto nodst;
+   }
skb_dst_set(head, dst);
 
/* Build the SCTP header.  */
@@ -622,8 +626,10 @@ int sctp_packet_transmit(struct sctp_packet *packet, gfp_t 
gfp)
if (!gso)
break;
 
-   if (skb_gro_receive(, nskb))
+   if (skb_gro_receive(, nskb)) {
+   kfree_skb(nskb);
goto nomem;
+   }
nskb = NULL;
if (WARN_ON_ONCE(skb_shinfo(head)->gso_segs >=
 sk->sk_gso_max_segs))
@@ -717,18 +723,13 @@ int sctp_packet_transmit(struct sctp_packet *packet, 
gfp_t gfp)
}
head->ignore_df = packet->ipfragok;
tp->af_specific->sctp_xmit(head, tp);
+   goto out;
 
-out:
-   sctp_packet_reset(packet);
-   return err;
-no_route:
-   kfree_skb(head);
-   if (nskb != head)
-   kfree_skb(nskb);
-
-   if (asoc)
-   IP_INC_STATS(sock_net(asoc->base.sk), IPSTATS_MIB_OUTNOROUTES);
+nomem:
+   if (packet->auth && list_empty(>auth->list))
+   sctp_chunk_free(packet->auth);
 
+nodst:
/* FIXME: Returning the 'err' will effect all the associations
 * associated with a socket, although only one of the paths of the
 * association is unreachable.
@@ -737,22 +738,18 @@ no_route:
 * required.
 */
 /* err = -EHOSTUNREACH; */
-err:
-   /* Control chunks are unreliable so just drop them.  DATA chunks
-* will get resent or dropped later.
-*/
+   kfree_skb(head);
 
+err:
list_for_each_entry_safe(chunk, tmp, >chunk_list, list) {
list_del_init(>list);
if (!sctp_chunk_is_data(chunk))
sctp_chunk_free(chunk);
}
-   goto out;
-nomem:
-   if (packet->auth && list_empty(>auth->list))
-   sctp_chunk_free(packet->auth);
-   err = -ENOMEM;
-   goto err;
+
+out:
+   sctp_packet_reset(packet);
+   return err;
 }
 
 /
-- 
2.1.0



[PATCHv2 net 3/6] sctp: free msg->chunks when sctp_primitive_SEND return err

2016-09-13 Thread Xin Long
Last patch "sctp: do not return the transmit err back to sctp_sendmsg"
made sctp_primitive_SEND return err only when asoc state is unavailable.
In this case, chunks are not enqueued, they have no chance to be freed if
we don't take care of them later.

This Patch is actually to revert commit 1cd4d5c4326a ("sctp: remove the
unused sctp_datamsg_free()"), commit 69b5777f2e57 ("sctp: hold the chunks
only after the chunk is enqueued in outq") and commit 8b570dc9f7b6 ("sctp:
only drop the reference on the datamsg after sending a msg"), to use
sctp_datamsg_free to free the chunks of current msg.

Fixes: 8b570dc9f7b6 ("sctp: only drop the reference on the datamsg after 
sending a msg")
Signed-off-by: Xin Long 
---
 include/net/sctp/structs.h |  1 +
 net/sctp/chunk.c   | 13 +
 net/sctp/outqueue.c|  1 -
 net/sctp/socket.c  |  8 ++--
 4 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index ce93c4b..f61fb7c 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -537,6 +537,7 @@ struct sctp_datamsg {
 struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *,
struct sctp_sndrcvinfo *,
struct iov_iter *);
+void sctp_datamsg_free(struct sctp_datamsg *);
 void sctp_datamsg_put(struct sctp_datamsg *);
 void sctp_chunk_fail(struct sctp_chunk *, int error);
 int sctp_chunk_abandoned(struct sctp_chunk *);
diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index a55e547..af9cc80 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -70,6 +70,19 @@ static struct sctp_datamsg *sctp_datamsg_new(gfp_t gfp)
return msg;
 }
 
+void sctp_datamsg_free(struct sctp_datamsg *msg)
+{
+   struct sctp_chunk *chunk;
+
+   /* This doesn't have to be a _safe vairant because
+* sctp_chunk_free() only drops the refs.
+*/
+   list_for_each_entry(chunk, >chunks, frag_list)
+   sctp_chunk_free(chunk);
+
+   sctp_datamsg_put(msg);
+}
+
 /* Final destructruction of datamsg memory. */
 static void sctp_datamsg_destroy(struct sctp_datamsg *msg)
 {
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index da2418b..6c109b0 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -304,7 +304,6 @@ int sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk 
*chunk, gfp_t gfp)
 sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)) :
 "illegal chunk");
 
-   sctp_chunk_hold(chunk);
sctp_outq_tail_data(q, chunk);
if (chunk->asoc->prsctp_enable &&
SCTP_PR_PRIO_ENABLED(chunk->sinfo.sinfo_flags))
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 9fc417a..6cdc61c 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1958,6 +1958,8 @@ static int sctp_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t msg_len)
 
/* Now send the (possibly) fragmented message. */
list_for_each_entry(chunk, >chunks, frag_list) {
+   sctp_chunk_hold(chunk);
+
/* Do accounting for the write space.  */
sctp_set_owner_w(chunk);
 
@@ -1970,13 +1972,15 @@ static int sctp_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t msg_len)
 * breaks.
 */
err = sctp_primitive_SEND(net, asoc, datamsg);
-   sctp_datamsg_put(datamsg);
/* Did the lower layer accept the chunk? */
-   if (err)
+   if (err) {
+   sctp_datamsg_free(datamsg);
goto out_free;
+   }
 
pr_debug("%s: we sent primitively\n", __func__);
 
+   sctp_datamsg_put(datamsg);
err = msg_len;
 
if (unlikely(wait_connect)) {
-- 
2.1.0



[PATCHv2 net 4/6] sctp: save transmit error to sk_err in sctp_outq_flush

2016-09-13 Thread Xin Long
Every time when sctp calls sctp_outq_flush, it sends out the chunks of
control queue, retransmit queue and data queue. Even if some trunks are
failed to transmit, it still has to flush all the transports, as it's
the only chance to clean that transmit_list.

So the latest transmit error here should be returned back. This transmit
error is an internal error of sctp stack.

I checked all the places where it uses the transmit error (the return
value of sctp_outq_flush), most of them are actually just save it to
sk_err.

Except for sctp_assoc/endpoint_bh_rcv, they will drop the chunk if
it's failed to send a REPLY, which is actually incorrect, as we can't
be sure the error that sctp_outq_flush returns is from sending that
REPLY.

So it's meaningless for sctp_outq_flush to return error back.

This patch is to save transmit error to sk_err in sctp_outq_flush, the
new error can update the old value. Eventually, sctp_wait_for_* would
check for it.

Signed-off-by: Xin Long 
---
 net/sctp/output.c   |  3 ++-
 net/sctp/outqueue.c | 21 -
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/net/sctp/output.c b/net/sctp/output.c
index 31b7bc3..f2597a9 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -180,7 +180,6 @@ sctp_xmit_t sctp_packet_transmit_chunk(struct sctp_packet 
*packet,
   int one_packet, gfp_t gfp)
 {
sctp_xmit_t retval;
-   int error = 0;
 
pr_debug("%s: packet:%p size:%Zu chunk:%p size:%d\n", __func__,
 packet, packet->size, chunk, chunk->skb ? chunk->skb->len : 
-1);
@@ -188,6 +187,8 @@ sctp_xmit_t sctp_packet_transmit_chunk(struct sctp_packet 
*packet,
switch ((retval = (sctp_packet_append_chunk(packet, chunk {
case SCTP_XMIT_PMTU_FULL:
if (!packet->has_cookie_echo) {
+   int error = 0;
+
error = sctp_packet_transmit(packet, gfp);
if (error < 0)
chunk->skb->sk->sk_err = -error;
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 6c109b0..052a479 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -533,7 +533,6 @@ void sctp_retransmit(struct sctp_outq *q, struct 
sctp_transport *transport,
 sctp_retransmit_reason_t reason)
 {
struct net *net = sock_net(q->asoc->base.sk);
-   int error = 0;
 
switch (reason) {
case SCTP_RTXR_T3_RTX:
@@ -577,10 +576,7 @@ void sctp_retransmit(struct sctp_outq *q, struct 
sctp_transport *transport,
 * will be flushed at the end.
 */
if (reason != SCTP_RTXR_FAST_RTX)
-   error = sctp_outq_flush(q, /* rtx_timeout */ 1, GFP_ATOMIC);
-
-   if (error)
-   q->asoc->base.sk->sk_err = -error;
+   sctp_outq_flush(q, /* rtx_timeout */ 1, GFP_ATOMIC);
 }
 
 /*
@@ -893,8 +889,10 @@ static int sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
sctp_packet_config(, vtag, 0);
sctp_packet_append_chunk(, chunk);
error = sctp_packet_transmit(, gfp);
-   if (error < 0)
-   return error;
+   if (error < 0) {
+   asoc->base.sk->sk_err = -error;
+   return 0;
+   }
break;
 
case SCTP_CID_ABORT:
@@ -992,6 +990,8 @@ static int sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
retran:
error = sctp_outq_flush_rtx(q, packet,
rtx_timeout, _timer);
+   if (error < 0)
+   asoc->base.sk->sk_err = -error;
 
if (start_timer) {
sctp_transport_reset_t3_rtx(transport);
@@ -1166,14 +1166,17 @@ sctp_flush_out:
  struct sctp_transport,
  send_ready);
packet = >packet;
-   if (!sctp_packet_empty(packet))
+   if (!sctp_packet_empty(packet)) {
error = sctp_packet_transmit(packet, gfp);
+   if (error < 0)
+   asoc->base.sk->sk_err = -error;
+   }
 
/* Clear the burst limited state, if any */
sctp_transport_burst_reset(t);
}
 
-   return error;
+   return 0;
 }
 
 /* Update unack_data based on the incoming SACK chunk */
-- 
2.1.0



[PATCHv2 net 2/6] sctp: do not return the transmit err back to sctp_sendmsg

2016-09-13 Thread Xin Long
Once a chunk is enqueued successfully, sctp queues can take care of it.
Even if it is failed to transmit (like because of nomem), it should be
put into retransmit queue.

If sctp report this error to users, it confuses them, they may resend
that msg, but actually in kernel sctp stack is in charge of retransmit
it already.

Besides, this error probably is not from the failure of transmitting
current msg, but transmitting or retransmitting another msg's chunks,
as sctp_outq_flush just tries to send out all transports' chunks.

This patch is to make sctp_cmd_send_msg return avoid, and not return the
transmit err back to sctp_sendmsg

Fixes: 8b570dc9f7b6 ("sctp: only drop the reference on the datamsg after 
sending a msg")
Signed-off-by: Xin Long 
---
 net/sctp/sm_sideeffect.c | 16 +---
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index 12d4519..cf6e4f0 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -1020,19 +1020,13 @@ static void sctp_cmd_t1_timer_update(struct 
sctp_association *asoc,
  * This way the whole message is queued up and bundling if
  * encouraged for small fragments.
  */
-static int sctp_cmd_send_msg(struct sctp_association *asoc,
-   struct sctp_datamsg *msg, gfp_t gfp)
+static void sctp_cmd_send_msg(struct sctp_association *asoc,
+ struct sctp_datamsg *msg, gfp_t gfp)
 {
struct sctp_chunk *chunk;
-   int error = 0;
-
-   list_for_each_entry(chunk, >chunks, frag_list) {
-   error = sctp_outq_tail(>outqueue, chunk, gfp);
-   if (error)
-   break;
-   }
 
-   return error;
+   list_for_each_entry(chunk, >chunks, frag_list)
+   sctp_outq_tail(>outqueue, chunk, gfp);
 }
 
 
@@ -1709,7 +1703,7 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
sctp_outq_cork(>outqueue);
local_cork = 1;
}
-   error = sctp_cmd_send_msg(asoc, cmd->obj.msg, gfp);
+   sctp_cmd_send_msg(asoc, cmd->obj.msg, gfp);
break;
case SCTP_CMD_SEND_NEXT_ASCONF:
sctp_cmd_send_asconf(asoc);
-- 
2.1.0



[PATCHv2 net 1/6] sctp: remove the unnecessary state check in sctp_outq_tail

2016-09-13 Thread Xin Long
Data Chunks are only sent by sctp_primitive_SEND, in which sctp checks
the asoc's state through statetable before calling sctp_outq_tail. So
there's no need to check the asoc's state again in sctp_outq_tail.

Besides, sctp_do_sm is protected by lock_sock, even if sending msg is
interrupted by timer events, the event's processes still need to acquire
lock_sock first. It means no others CMDs can be enqueue into side effect
list before CMD_SEND_MSG to change asoc->state, so it's safe to remove it.

This patch is to remove redundant asoc->state check from sctp_outq_tail.

Signed-off-by: Xin Long 
---
 net/sctp/outqueue.c | 53 ++---
 1 file changed, 14 insertions(+), 39 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 72e54a4..da2418b 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -299,50 +299,25 @@ int sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk 
*chunk, gfp_t gfp)
 * immediately.
 */
if (sctp_chunk_is_data(chunk)) {
-   /* Is it OK to queue data chunks?  */
-   /* From 9. Termination of Association
-*
-* When either endpoint performs a shutdown, the
-* association on each peer will stop accepting new
-* data from its user and only deliver data in queue
-* at the time of sending or receiving the SHUTDOWN
-* chunk.
-*/
-   switch (q->asoc->state) {
-   case SCTP_STATE_CLOSED:
-   case SCTP_STATE_SHUTDOWN_PENDING:
-   case SCTP_STATE_SHUTDOWN_SENT:
-   case SCTP_STATE_SHUTDOWN_RECEIVED:
-   case SCTP_STATE_SHUTDOWN_ACK_SENT:
-   /* Cannot send after transport endpoint shutdown */
-   error = -ESHUTDOWN;
-   break;
-
-   default:
-   pr_debug("%s: outqueueing: outq:%p, chunk:%p[%s])\n",
-__func__, q, chunk, chunk && chunk->chunk_hdr ?
-
sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)) :
-"illegal chunk");
-
-   sctp_chunk_hold(chunk);
-   sctp_outq_tail_data(q, chunk);
-   if (chunk->asoc->prsctp_enable &&
-   SCTP_PR_PRIO_ENABLED(chunk->sinfo.sinfo_flags))
-   chunk->asoc->sent_cnt_removable++;
-   if (chunk->chunk_hdr->flags & SCTP_DATA_UNORDERED)
-   SCTP_INC_STATS(net, SCTP_MIB_OUTUNORDERCHUNKS);
-   else
-   SCTP_INC_STATS(net, SCTP_MIB_OUTORDERCHUNKS);
-   break;
-   }
+   pr_debug("%s: outqueueing: outq:%p, chunk:%p[%s])\n",
+__func__, q, chunk, chunk && chunk->chunk_hdr ?
+sctp_cname(SCTP_ST_CHUNK(chunk->chunk_hdr->type)) :
+"illegal chunk");
+
+   sctp_chunk_hold(chunk);
+   sctp_outq_tail_data(q, chunk);
+   if (chunk->asoc->prsctp_enable &&
+   SCTP_PR_PRIO_ENABLED(chunk->sinfo.sinfo_flags))
+   chunk->asoc->sent_cnt_removable++;
+   if (chunk->chunk_hdr->flags & SCTP_DATA_UNORDERED)
+   SCTP_INC_STATS(net, SCTP_MIB_OUTUNORDERCHUNKS);
+   else
+   SCTP_INC_STATS(net, SCTP_MIB_OUTORDERCHUNKS);
} else {
list_add_tail(>list, >control_chunk_list);
SCTP_INC_STATS(net, SCTP_MIB_OUTCTRLCHUNKS);
}
 
-   if (error < 0)
-   return error;
-
if (!q->cork)
error = sctp_outq_flush(q, 0, gfp);
 
-- 
2.1.0



[PATCHv2 net 0/6] sctp: fix the transmit err process

2016-09-13 Thread Xin Long
This patchset is to improve the transmit err process and also fix some
issues.

After this patchset, once the chunks are enqueued successfully, even
if the chunks fail to send out, no matter because of nodst or nomem,
no err retruns back to users any more. Instead, they are taken care
of by retransmit.

v1->v2:
  - add more details to the changelog in patch 1/6
  - add Fixes: tag in patch 2/6, 3/6
  - also revert 69b5777f2e57 in patch 3/6

Xin Long (6):
  sctp: remove the unnecessary state check in sctp_outq_tail
  sctp: do not return the transmit err back to sctp_sendmsg
  sctp: free msg->chunks when sctp_primitive_SEND return err
  sctp: save transmit error to sk_err in sctp_outq_flush
  sctp: make sctp_outq_flush/tail/uncork return void
  sctp: not return ENOMEM err back in sctp_packet_transmit

 include/net/sctp/structs.h |  5 +--
 net/sctp/chunk.c   | 13 +++
 net/sctp/output.c  | 50 +-
 net/sctp/outqueue.c| 88 --
 net/sctp/sm_sideeffect.c   | 25 +
 net/sctp/socket.c  |  8 +++--
 6 files changed, 85 insertions(+), 104 deletions(-)

-- 
2.1.0



Re: [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-13 Thread Alexei Starovoitov
On Tue, Sep 13, 2016 at 10:37:32AM -0700, Eric Dumazet wrote:
> On Tue, 2016-09-13 at 10:13 -0700, Alexei Starovoitov wrote:
> 
> > I'm afraid the point 'only for debugging' still didn't make it across.
> > xdp+e1k is for development (and debugging) of xdp-type of bpf
> > programs and _not_ for debugging of xdp itself, kernel or anything else.
> > The e1k provided interfaces and behavior needs to match exactly
> > what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
> > Doing special hacks are not acceptable. Therefore your
> > 'proposed fix' misses the mark, since:
> > 1. ignoring bql/qdisc is not a bug, but the requirement
> > 2. such 'fix' goes against the goal above since behaviors will be
> > different and xdp developer won't be able to build something like
> > xdp loadbalancer in the kvm.
> > 
> 
> 
> Is e1k the only way a VM can receive and send packets ?
> 
> Instead of adding more cruft to a legacy driver, risking breaking real
> old machines, 

agree that it is the concern.

> I am sure we can find modern alternative.

I've looked through qemu and it appears only emulate e1k and tg3.
The latter is still used in the field, so the risk of touching
it is higher.
The other alternative is virtio, but it doesn't have dma and/or pages,
so it looks to me even messier hack.
The last alternative considered was to invent xdp-only fake 'hw' nic,
but it's too much work to get it into qemu then ask the world
to upgrade qemu.
At that point I ran out of ideas and settled on hacking e1k :(
Not proud of this hack at all.



Re: [PATCH net 1/6] sctp: remove the unnecessary state check in sctp_outq_tail

2016-09-13 Thread Xin Long
>
> I also don't see an issue with this patch, btw.
>
> Xin, you may want to add more/such details to the changelog, specially
> about the timer versus primitive handling.
>
OK, I will post v2 of this patchset.


Re: [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-13 Thread Tom Herbert
On Tue, Sep 13, 2016 at 10:13 AM, Alexei Starovoitov
 wrote:
> On Tue, Sep 13, 2016 at 09:21:47AM -0700, Tom Herbert wrote:
>> On Mon, Sep 12, 2016 at 6:28 PM, Alexei Starovoitov
>>  wrote:
>> > On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
>> >> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet  
>> >> wrote:
>> >> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
>> >> >
>> >> >> yep. there are various ways to shoot yourself in the foot with xdp.
>> >> >> The simplest program that drops all the packets will make the box 
>> >> >> unpingable.
>> >> >
>> >> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
>> >> > scooter on 101 highway ;)
>> >> >
>> >> > This XDP_TX thing was one of the XDP marketing stuff, but there is
>> >> > absolutely no documentation on it, warning users about possible
>> >> > limitations/outcomes.
>> >> >
>> >> > BTW, I am not sure mlx4 implementation even works, vs BQL :
>> >> >
>> >> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
>> >> > but tx completion will call netdev_tx_completed_queue() -> crash
>> >> >
>> >> > Do we have one test to validate that a XDP_TX implementation is actually
>> >> > correct ?
>> >> >
>> >> Obviously not for e1000 :-(. We really need some real test and
>> >> performance results and analysis on the interaction between the stack
>> >> data path and XDP data path.
>> >
>> > no. we don't need it for e1k and we cannot really do it.
>> >  this patch is for debugging of xdp programs only.
>> >
>> You can say this "only for a debugging" a thousand times and that
>> still won't justify putting bad code into the kernel. Material issues
>> have been raised with these patches, I have proposed a fix for one
>> core issue, and we have requested a lot more testing. So, please, if
>> you really want to move these patches forward start addressing the
>> concerns being raised by reviewers.
>
> I'm afraid the point 'only for debugging' still didn't make it across.
> xdp+e1k is for development (and debugging) of xdp-type of bpf
> programs and _not_ for debugging of xdp itself, kernel or anything else.
> The e1k provided interfaces and behavior needs to match exactly
> what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
> Doing special hacks are not acceptable. Therefore your
> 'proposed fix' misses the mark, since:
> 1. ignoring bql/qdisc is not a bug, but the requirement

You don't seem to understand the problem. In the shared queue scenario
if one party (the stack) implements qdiscs, BQL, and such and the
other (XDP) just throws packets onto the queue then these are
incompatible behaviors and something will break. I suppose it's
possible that some how this does not affect the stack path, but
remains to be proven. In any case the patches under review look very
much like they break things; either a fix is needed or tests run to
show it's not a problem. Until this is resolved I am going to nack the
patch.

Tom

> 2. such 'fix' goes against the goal above since behaviors will be
> different and xdp developer won't be able to build something like
> xdp loadbalancer in the kvm.
>
> If you have other concerns please raise them or if you have
> suggestions on how to develop xdp programs without this e1k patch
> I would love hear them.
> Alexander's review comments are discussed in separate thread.
>


Re: [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-13 Thread Eric Dumazet
On Tue, 2016-09-13 at 10:13 -0700, Alexei Starovoitov wrote:

> I'm afraid the point 'only for debugging' still didn't make it across.
> xdp+e1k is for development (and debugging) of xdp-type of bpf
> programs and _not_ for debugging of xdp itself, kernel or anything else.
> The e1k provided interfaces and behavior needs to match exactly
> what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
> Doing special hacks are not acceptable. Therefore your
> 'proposed fix' misses the mark, since:
> 1. ignoring bql/qdisc is not a bug, but the requirement
> 2. such 'fix' goes against the goal above since behaviors will be
> different and xdp developer won't be able to build something like
> xdp loadbalancer in the kvm.
> 


Is e1k the only way a VM can receive and send packets ?

Instead of adding more cruft to a legacy driver, risking breaking real
old machines, I am sure we can find modern alternative.





Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-13 Thread Pablo Neira Ayuso
On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote:
> Hi,
> 
> On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
> > On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
> >> This is v5 of the patch set to allow eBPF programs for network
> >> filtering and accounting to be attached to cgroups, so that they apply
> >> to all sockets of all tasks placed in that cgroup. The logic also
> >> allows to be extendeded for other cgroup based eBPF logic.
> > 
> > 1) This infrastructure can only be useful to systemd, or any similar
> >orchestration daemon. Look, you can only apply filtering policies
> >to processes that are launched by systemd, so this only works
> >for server processes.
> 
> Sorry, but both statements aren't true. The eBPF policies apply to every
> process that is placed in a cgroup, and my example program in 6/6 shows
> how that can be done from the command line.

Then you have to explain me how can anyone else than systemd use this
infrastructure?

> Also, systemd is able to control userspace processes just fine, and
> it not limited to 'server processes'.

My main point is that those processes *need* to be launched by the
orchestrator, which is was refering as 'server processes'.

> > For client processes this infrastructure is
> >*racy*, you have to add new processes in runtime to the cgroup,
> >thus there will be time some little time where no filtering policy
> >will be applied. For quality of service, this may be an acceptable
> >race, but this is aiming to deploy a filtering policy.
> 
> That's a limitation that applies to many more control mechanisms in the
> kernel, and it's something that can easily be solved with fork+exec.

As long as you have control to launch the processes yes, but this
will not work in other scenarios. Just like cgroup net_cls and friends
are broken for filtering for things that you have no control to
fork+exec.

To use this infrastructure from a non-launcher process, you'll have to
rely on the proc connection to subscribe to new process events, then
echo that pid to the cgroup, and that interface is asynchronous so
*adding new processes to the cgroup is subject to races*.

> > 2) This aproach looks uninfrastructured to me. This provides a hook
> >to push a bpf blob at a place in the stack that deploys a filtering
> >policy that is not visible to others.
> 
> That's just as transparent as SO_ATTACH_FILTER. What kind of
> introspection mechanism do you have in mind?

SO_ATTACH_FILTER is called from the process itself, so this is a local
filtering policy that you apply to your own process.

In this case, this filtering policy is *global*, other processes with
similar capabilities can get just a bpf blob at best...

[...]
> >> After chatting with Daniel Borkmann and Alexei off-list, we concluded
> >> that __dev_queue_xmit() is the place where the egress hooks should live
> >> when eBPF programs need access to the L2 bits of the skb.
> > 
> > 3) This egress hook is coming very late, the only reason I find to
> >place it at __dev_queue_xmit() is that bpf naturally works with
> >layer 2 information in place. But this new hook is placed in
> >_everyone's output ath_ that only works for the very specific
> >usecase I exposed above.
> 
> It's about filtering outgoing network packets of applications, and
> providing them with L2 information for filtering purposes. I don't think
> that's a very specific use-case.
> 
> When the feature is not used at all, the added costs on the output path
> are close to zero, due to the use of static branches.

*You're proposing a socket filtering facility that hooks layer 2
output path*!

[...]
> > I have nothing against systemd or the needs for more
> > programmability/flexibility in the stack, but I think this needs to
> > fulfill some requirements to fit into the infrastructure that we have
> > in the right way.
> 
> Well, as I explained already, this patch set results from endless
> discussions that went nowhere, about how such a thing can be achieved
> with netfilter.

That is only a rough ~30 lines kernel patchset to support this in
netfilter and only one extra input hook, with potential access to
conntrack and better integration with other existing subsystems.


[PATCH v3] net: ip, diag -- Add diag interface for raw sockets

2016-09-13 Thread Cyrill Gorcunov
In criu we are actively using diag interface to collect sockets
present in the system when dumping applications. And while for
unix, tcp, udp[lite], packet, netlink it works as expected,
the raw sockets do not have. Thus add it.

v2:
 - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
 - implement @destroy for diag requests (by dsa@)

v3:
 - add export of raw_abort for IPv6 (by dsa@)
 - pass net-admin flag into inet_sk_diag_fill due to
   changes in net-next branch (by dsa@)

CC: David S. Miller 
CC: Eric Dumazet 
CC: David Ahern 
CC: Alexey Kuznetsov 
CC: James Morris 
CC: Hideaki YOSHIFUJI 
CC: Patrick McHardy 
CC: Andrey Vagin 
CC: Stephen Hemminger 
Signed-off-by: Cyrill Gorcunov 
---

 include/net/raw.h   |6 +
 include/net/rawv6.h |7 +
 net/ipv4/Kconfig|8 +
 net/ipv4/Makefile   |1 
 net/ipv4/raw.c  |   21 
 net/ipv4/raw_diag.c |  226 
 net/ipv6/raw.c  |7 +
 7 files changed, 272 insertions(+), 4 deletions(-)

Index: linux-ml.git/include/net/raw.h
===
--- linux-ml.git.orig/include/net/raw.h
+++ linux-ml.git/include/net/raw.h
@@ -23,6 +23,12 @@
 
 extern struct proto raw_prot;
 
+extern struct raw_hashinfo raw_v4_hashinfo;
+struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
+unsigned short num, __be32 raddr,
+__be32 laddr, int dif);
+
+int raw_abort(struct sock *sk, int err);
 void raw_icmp_error(struct sk_buff *, int, u32);
 int raw_local_deliver(struct sk_buff *, int);
 
Index: linux-ml.git/include/net/rawv6.h
===
--- linux-ml.git.orig/include/net/rawv6.h
+++ linux-ml.git/include/net/rawv6.h
@@ -3,6 +3,13 @@
 
 #include 
 
+extern struct raw_hashinfo raw_v6_hashinfo;
+struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
+unsigned short num, const struct in6_addr 
*loc_addr,
+const struct in6_addr *rmt_addr, int dif);
+
+int raw_abort(struct sock *sk, int err);
+
 void raw6_icmp_error(struct sk_buff *, int nexthdr,
u8 type, u8 code, int inner_offset, __be32);
 bool raw6_local_deliver(struct sk_buff *, int);
Index: linux-ml.git/net/ipv4/Kconfig
===
--- linux-ml.git.orig/net/ipv4/Kconfig
+++ linux-ml.git/net/ipv4/Kconfig
@@ -430,6 +430,14 @@ config INET_UDP_DIAG
  Support for UDP socket monitoring interface used by the ss tool.
  If unsure, say Y.
 
+config INET_RAW_DIAG
+   tristate "RAW: socket monitoring interface"
+   depends on INET_DIAG && (IPV6 || IPV6=n)
+   default n
+   ---help---
+ Support for RAW socket monitoring interface used by the ss tool.
+ If unsure, say Y.
+
 config INET_DIAG_DESTROY
bool "INET: allow privileged process to administratively close sockets"
depends on INET_DIAG
Index: linux-ml.git/net/ipv4/Makefile
===
--- linux-ml.git.orig/net/ipv4/Makefile
+++ linux-ml.git/net/ipv4/Makefile
@@ -40,6 +40,7 @@ obj-$(CONFIG_NETFILTER)   += netfilter.o n
 obj-$(CONFIG_INET_DIAG) += inet_diag.o 
 obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
 obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o
+obj-$(CONFIG_INET_RAW_DIAG) += raw_diag.o
 obj-$(CONFIG_NET_TCPPROBE) += tcp_probe.o
 obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o
 obj-$(CONFIG_TCP_CONG_CDG) += tcp_cdg.o
Index: linux-ml.git/net/ipv4/raw.c
===
--- linux-ml.git.orig/net/ipv4/raw.c
+++ linux-ml.git/net/ipv4/raw.c
@@ -89,9 +89,10 @@ struct raw_frag_vec {
int hlen;
 };
 
-static struct raw_hashinfo raw_v4_hashinfo = {
+struct raw_hashinfo raw_v4_hashinfo = {
.lock = __RW_LOCK_UNLOCKED(raw_v4_hashinfo.lock),
 };
+EXPORT_SYMBOL_GPL(raw_v4_hashinfo);
 
 int raw_hash_sk(struct sock *sk)
 {
@@ -120,7 +121,7 @@ void raw_unhash_sk(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(raw_unhash_sk);
 
-static struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
+struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
unsigned short num, __be32 raddr, __be32 laddr, int dif)
 {
sk_for_each_from(sk) {
@@ -136,6 +137,7 @@ static struct sock *__raw_v4_lookup(stru
 found:
return sk;
 }
+EXPORT_SYMBOL_GPL(__raw_v4_lookup);
 
 /*
  * 0 - deliver
@@ -918,6 +920,20 @@ static int compat_raw_ioctl(struct sock
 }
 #endif
 
+int raw_abort(struct sock *sk, int err)
+{
+   lock_sock(sk);
+
+   sk->sk_err = err;
+   sk->sk_error_report(sk);
+   

Re: [net-next PATCH v3 2/3] e1000: add initial XDP support

2016-09-13 Thread Alexei Starovoitov
On Tue, Sep 13, 2016 at 09:21:47AM -0700, Tom Herbert wrote:
> On Mon, Sep 12, 2016 at 6:28 PM, Alexei Starovoitov
>  wrote:
> > On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
> >> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet  
> >> wrote:
> >> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
> >> >
> >> >> yep. there are various ways to shoot yourself in the foot with xdp.
> >> >> The simplest program that drops all the packets will make the box 
> >> >> unpingable.
> >> >
> >> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> >> > scooter on 101 highway ;)
> >> >
> >> > This XDP_TX thing was one of the XDP marketing stuff, but there is
> >> > absolutely no documentation on it, warning users about possible
> >> > limitations/outcomes.
> >> >
> >> > BTW, I am not sure mlx4 implementation even works, vs BQL :
> >> >
> >> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> >> > but tx completion will call netdev_tx_completed_queue() -> crash
> >> >
> >> > Do we have one test to validate that a XDP_TX implementation is actually
> >> > correct ?
> >> >
> >> Obviously not for e1000 :-(. We really need some real test and
> >> performance results and analysis on the interaction between the stack
> >> data path and XDP data path.
> >
> > no. we don't need it for e1k and we cannot really do it.
> >  this patch is for debugging of xdp programs only.
> >
> You can say this "only for a debugging" a thousand times and that
> still won't justify putting bad code into the kernel. Material issues
> have been raised with these patches, I have proposed a fix for one
> core issue, and we have requested a lot more testing. So, please, if
> you really want to move these patches forward start addressing the
> concerns being raised by reviewers.

I'm afraid the point 'only for debugging' still didn't make it across.
xdp+e1k is for development (and debugging) of xdp-type of bpf
programs and _not_ for debugging of xdp itself, kernel or anything else.
The e1k provided interfaces and behavior needs to match exactly
what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
Doing special hacks are not acceptable. Therefore your
'proposed fix' misses the mark, since:
1. ignoring bql/qdisc is not a bug, but the requirement
2. such 'fix' goes against the goal above since behaviors will be
different and xdp developer won't be able to build something like
xdp loadbalancer in the kvm.

If you have other concerns please raise them or if you have
suggestions on how to develop xdp programs without this e1k patch
I would love hear them.
Alexander's review comments are discussed in separate thread.



Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family

2016-09-13 Thread Vivien Didelot
Hi Andrew,

Andrew Lunn  writes:

>> ok, i will simply substract 1 from the phy_addr inside the mdio
>> callbacks. this would make the code more readable and make the DT
>> binding compliant with the ePAPR spec.
>
> It does however need well commenting. It is setting a trap for anybody
> who puts an external PHY on port 6. If they access that PHY via these
> functions, the address is off by one.
>
> This is the first silicon vendor who made their MDIO addresses for
> PHYs illogical. So i'm thinking we maybe should add a new function to
> dsa_switch_ops.
>
>   /* Return the MDIO address for the PHY for this port. */
> int (*phy_port_map(struct dsa_switch *ds, int port);
>
> This should return the MDIO address for integrated PHYs only, or
> -ENODEV if the port does not have an integrated PHY. For an external
> PHY, a phy-handle should be used. This phy_port_map() is used in
> dsa_slave_phy_setup(). But dsa_slave_phy_setup() is already too
> complex, so it needs doing with care.

Note that some switch drivers *have to* register their slave MDIO bus
themselves (e.g. bcm_sf2). This becomes confusing with the DSA
phy_{read,write} ops.

Since the former alternative is prefered, we may want to remove the
latter soon from DSA. If this phy_port_map is needed for that case, it'd
be preferable not to add it.

Thanks,

Vivien


Re: [PATCH 3/3] net-next: dsa: add new driver for qca8xxx family

2016-09-13 Thread Florian Fainelli
On 09/13/2016 08:59 AM, Andrew Lunn wrote:
>> Hi Andrew,
>>
>> this function does indeed duplicate the functionality of
>> phy_ethtool_get_eee() with the small difference, that e->eee_active is
>> also set which phy_ethtool_get_eee() does not set.
>>
>> dsa_slave_get_eee() will call phy_ethtool_get_eee() right after the
>> get_eee() op has been called. would it be ok to move the code setting
>> eee_active to  phy_ethtool_get_eee().

Humm, AFAIR, the reason why eee_active is set outside of
phy_ethtool_set_eee() is because this is a MAC + PHY thing, both need to
agree and support that, and so while the PHY may be configured to have
EEE advertised and enabled, you also need to take care of the MAC
portion and enable EEE in there as well. Is not there such a thing for
the qca8k switch where the PHY needs to be configured through the
standard phylib calls, but the switch's transmitter/receiver also needs
to have EEE enabled?
-- 
Florian


  1   2   3   >