Re: PATCH: netdev: add a cast NLMSG_OK to avoid a GCC warning in users' code

2015-09-11 Thread D. Hugh Redelmeier
| From: David Miller 

| From: "D. Hugh Redelmeier" 
| Date: Wed, 9 Sep 2015 16:24:07 -0400 (EDT)

| > 1) netlink(3) says that the type of the second parameter is "int".
| >From the synopsis:
| >int NLMSG_OK(struct nlmsghdr *nlh, int len);
| >Surely then "int" should be appropriate.
| 
| Documentation can, and often is, wrong.  The code that has been there
| for more than two decades determines what the interface and semantics
| actually are.
| 
| Whatever is actually in the macro is what people have to accomodate
| and cope with.

That's pretty scary.  Especially since I've found that more than half
the files in Fedora that used the macor used a signed type.

Thanks for responding to the start of my message.  Here's the rest
again in case you missed it.

2) if you use the type "unsigned int" on a 32-bit machine, you get the 
   warning for an earlier conjunct:

#define NLMSG_OK(nlh,len) ((len) >= (int)sizeof(struct nlmsghdr) && \
   (nlh)->nlmsg_len >= sizeof(struct nlmsghdr) && \
   (nlh)->nlmsg_len <= (len))

   (len) >= (int)sizeof(struct nlmsghdr)  <=== unsigned >= int

3) on a 32-bit machine, size_t is likely "unsigned int" so the
   same problem as (2) should arise.

4) on a 64-bit machine with 64-bit ints, the same problems are likely.
   I don't have one to test on.

Casting to "short" or "unsigned short" works, but I don't know that
the value is guaranteed to fit in either of them.

| I'm not applying this, sorry.

Thanks for looking at this.  Could you reconsider?

An alternative would be to change netlink(3) to say that len is
of type ssize_t or size_t (there are arguments for each).  This would
be cleaner, but it would be an API change (at least in theory).  The
macro would still have to be changed, just in a different way.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] net: irda: pxaficp_ir: convert to readl and writel

2015-09-11 Thread Robert Jarzmik
Petr Cvek  writes:

>> Should have been posted to linux arm kernel mailing list, unless my mailer
>> failed ...
>> 
> Searching for:
>
>   "ARM: pxa: add resources to pxaficp_ir"
>
> did not found anything, same was for "ficp" in the 
> linux-arm-kernel/netdev/linux-kernel
> mailing list archive.
Ah ok, I'll resend it then.

>>> BTW This patch required update of my kernel repo. It seems that my:
>>>
>>> magician.c patches + ficp patch + new dma engine
>>>
>>> does not work for me at all. Kernel throws some panic about interrupts and 
>>> then
>>> it ends in an infinite stack dumping loop. Fault occurs before rootfs is
>>> mounted, so probably around MMC init (with removed SD card it fails normally
>>> with no rootfs found error).
>> Could you send me (privately) the stack you're getting please. This is 
>> something
>> I'd like to catch up early in the -rc releases.
>
> Well this will be problem as I cannot save anything to an SD card after and
> during the failure.  Only viable interfaces would be earlycon on an infraport
> or high speed camera on LCD :-).
Ah just as on my mioa701. I ended up soldering a JTAG cable :)

> But I was able to revert this commit:
>
>   6464b71409511939efce1ae4fb4ec6e3483b11b2mmc: pxamci: switch 
> over to dmaengine use
>
> and after that I am able to boot.
Okay. I'll try to reproduce this failure then. If I fail, well, before using the
JTAG cable, I used another trick: I was taking a movie from the LCD with a
smartphone, and it worked. It was an horrible thing to decrypt ... Let's hope
I'll be lucky on one of my platforms.

>> Now with your stack, could you also give me the upstream commit id of the 
>> tip of
>> the tree you're using (before your patches) please ?
>
> It is probably irelevant now, but for complete information:
>
> Discovered on my working repo: mainline 
> b8889c4fc6ba03e289cec6a4d692f6f080a55e53
> Still present on fresh downloaded: linux-next
> 22dc312d56ba077db27a9798b340e7d161f1df05
Ok, thanks.

>> And it is true I have not tested the rootfs special case, where drivers are 
>> not
>> yet initialized (and more specifically gpio and interrupt chip). Your 
>> backtrace
>> should tell me if you fall into this category of issues ... but I digress, 
>> this
>> has no link with pxaficp.
>
> Should I start new thread? (same bug can be present in the FICP too)
Yes, this pxamci bothers me, it deserves a thread.

> I will try to configure an initrd rootfs this should create more ways to save
> kernel log.
Great idea.

>
> Anyway after mmc dma revert I was still not able to start FICP. There is an 
> error:
>
>   Unable to handle kernel paging request at virtual address 32e4
>
> from pxa_irda_startup() and it seems it is caused by register definitions. 
> For example:
>
>   writel_relaxed((val), (irda)->stuart_base + (off));
>
> is called by
>
>   stuart_writel(si, 0, STIER);
>
> but STIER is not just an offset, but full register address:
>   
>   __REG(0x4074)
>
> So the definition should be changed, unless there is another patch I did not
> received (in that case, send me full patchset again please) :-).
Agreed, this is a bug in this patch. With this fix, is the pxaficp working or do
you need a bit more time to experiment ?

Cheers.

-- 
Robert
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC v2 net-next 3/4] ravb: Document binding for r8a7795 SoC

2015-09-11 Thread Geert Uytterhoeven
Hi Simon,

On Fri, Sep 11, 2015 at 4:01 AM, Simon Horman
 wrote:
> --- a/Documentation/devicetree/bindings/net/renesas,ravb.txt
> +++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt
> @@ -6,8 +6,11 @@ interface contains.
>  Required properties:
>  - compatible: "renesas,etheravb-r8a7790" if the device is a part of R8A7790 
> SoC.
>   "renesas,etheravb-r8a7794" if the device is a part of R8A7794 
> SoC.
> + "renesas,etheravb-r8a7795" if the device is a part of R8A7795 
> SoC.
>  - reg: offset and length of (1) the register block and (2) the stream buffer.
> -- interrupts: interrupt specifier for the sole interrupt.
> +- interrupts: interrupt specifiers.
> + One for each entry in interrupt-names the R8A7795 SoC;

... for the R8A7795 SoC

> + One entry for a multiplexed interrupt otherwise.
>  - phy-mode: see ethernet.txt file in the same directory.
>  - phy-handle: see ethernet.txt file in the same directory.
>  - #address-cells: number of address cells for the MDIO bus, must be equal to 
> 1.
> @@ -18,6 +21,9 @@ Required properties:
>  Optional properties:
>  - interrupt-parent: the phandle for the interrupt controller that services
> interrupts for this device.
> +- interrupt-names: One entry per interrupt named "ch%u".
> +  For the R8A7795 SoC this property is mandatory,
> +  and "ch0" through "ch24" are mandatory.

This suggests the single multiplexed interrupt on R-Car Gen2 can be called
"ch0". Is that what you want? I know the driver doesn't care.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCH: netdev: add a cast NLMSG_OK to avoid a GCC warning in users' code

2015-09-11 Thread D. Hugh Redelmeier
| From: David Miller 

| Everyone either uses an unsigned type (such as "size_t") or adds an
| explicit cast to an unsinged type for the second argument.

That sounded odd to me.  So I looked through the Fedora code base, as
indexed by searchcode.com.  This took a stupid amount of time so I
didn't check my work.

Summary: 37 files use a signed type, 32 files use an unsigned type
(some are in each category!).

So your point is not correct.

It is amazing how many private definitions there are for NLMSG_OK.
They are mostly just copies but at least some leave out the int cast of
the result of sizeof.

The kernel's own version seems to be an inline function (with a
lower-case name).  It would be nice to replace the NLMSG_OK macro with
an inline function: this supports better type checking.  I don't know
if there are technical reasons not to do this.

Supporting data:

strongswan 
/strongswan-5.0.0/src/libhydra/plugins/kernel_netlink/kernel_netlink_ipsec.c
int, size_t, size_t, size_t, size_t, size_t

libnfnetlink /libnfnetlink-1.0.0/src/libnfnetlink.c
size_t, size_t, size_t, unsigned int

glibc /glibc-2.16-75f0d304/sysdeps/unix/sysv/linux/ifaddrs.c
arm-gp2x-linux-glibc 
/arm-gp2x-linux-glibc-2.3.6/glibc-2.3.6/sysdeps/unix/sysv/linux/ifaddrs.c
size_t, size_t, size_t

strongswan 
/strongswan-5.0.0/src/libhydra/plugins/kernel_netlink/kernel_netlink_net.c
int, size_t, size_t, size_t

fcoe-utils /fcoe-utils-1.0.23/fcoemon.c
int, int, int

uClibc /uClibc-0.9.32/libc/inet/ifaddrs.c
size_t, size_t, size_t

glibc /glibc-2.16-75f0d304/sysdeps/unix/sysv/linux/if_index.c
arm-gp2x-linux-glibc 
/arm-gp2x-linux-glibc-2.3.6/glibc-2.3.6/sysdeps/unix/sysv/linux/if_index.c
size_t, size_t

bootchart /bootchart2-0.14.0/collector/tasks-netlink.c
size_, size_t

netlabel_tools /netlabel_tools-0.19/libnetlabel/mod_mgmt.c
netlabel_tools /netlabel_tools-0.19/libnetlabel/mod_unlabeled.c
each has int, int

bird /bird-1.3.7/sysdep/linux/netlink/netlink.c
unsigned int, unsigned int

gtk-gnutella /gtk-gnutella-0.97.1/src/lib/getgateway.c
unsigned, unsigned

iproute /iproute2-3.4.0/lib/libnetlink.c
int, int

iproute /iproute2-3.4.0/misc/ss.c
int, ssize_t

lldpad /lldpad-0.9.44/lldp_dcbx_nl.c
unsigned

dlm /dlm-3.99.5/dlm_controld/netlink.c
int, int

gnet2 /gnet-2.0.8/src/usagi_ifaddrs.c
int, int

dnsmasq /dnsmasq-2.59/src/netlink.c
size_t (through cast of ssize_t value), size_t (through cast of ssize_t 
value)

net-snmp /net-snmp-5.7.1/agent/mibgroup/ip-mib/data_access/ipaddress_linux.c
int, int

net-snmp /net-snmp-5.7.1/agent/mibgroup/ip-mib/data_access/defaultrouter_linux.c
int

rb_libtorrent /libtorrent-rasterbar-0.16.2/src/enum_net.cpp
int, int

bootparamd /netkit-bootparamd-0.17/rpc.bootparamd/bootparam_default_route.c
u_int (through cast of int value), u_int (through cast of int value)

nss-myhostname /nss-myhostname-0.3/netlink.c
size_t (through cast of ssize_t value)

glibc /glibc-2.16-75f0d304/sysdeps/unix/sysv/linux/check_native.c
glibc /glibc-2.16-75f0d304/sysdeps/unix/sysv/linux/check_pf.c
size_t (through cast of ssize_t value)

audit /audit-2.2.1/lib/netlink.c
unsigned int (through cast of int)

rstp /rstp/libnetlink.c
int

fence-virt /fence-virt-0.3.0/common/ip_lookup.c
int

libpcap /libpcap-1.3.0/pcap-netfilter-linux.c
int

oidentd /oidentd-2.0.8/src/kernel/linux.c
has it's own copy of the definition of NLMSG_OK, without cast to int
size_t (cast of ssize_t value)

netlabel_tools /netlabel_tools-0.19/libnetlabel/netlabel_comm.c
int

netlabel_tools /netlabel_tools-0.19/libnetlabel/mod_cipsov4.c
int

keepalived /keepalived-1.2.2/keepalived/vrrp/vrrp_netlink.c
int

xen /xen-4.1.2/tools/python/xen/lowlevel/netlink/libnetlink.c
int

mipv6-daemon /mipv6-daemon-2.0.2.20110203bgit/libnetlink/libnetlink.c
size_t (cast of int value)

avahi /avahi-0.6.31/avahi-core/netlink.c
size_t (cast of ssize_t)

avahi /avahi-0.6.31/avahi-autoipd/iface-linux.c
size_t

libcgroup /libcgroup-0.38/src/daemon/cgrulesengd.c
size_t (but code would probably be more correct with ssize_t)

libnl /libnl-1.1/lib/nl.c
int

pl /pl-6.0.2/packages/tipc/tipcutils/tipc-config.c
int

mingw-glib2 /glib-2.33.1/gio/gnetworkmonitornetlink.c
size_t (cast of gssize value)

fcoe-utils /fcoe-utils-1.0.23/lib/rtnetlink.c
int

libnl3 /libnl-3.2.7/lib/nl.c
int

iscsi-initiator-utils /open-iscsi-2.0-872-rc4-bnx2i/usr/dcb_app.c
unsigned int (cast from int)

lldpad /lldpad-0.9.44/lldp_rtnl.c
unsigned

lldpad /lldpad-0.9.44/nltest.c
unsigned int (cast from int)

ebtables /ebtables-v2.0.10-4/examples/ulog/test_ulog.c
65536 (odd! bug?); counts as an int

wpa_supplicant 

Re: NFS/TCP/IPv6 acting strangely in 4.2

2015-09-11 Thread Russell King - ARM Linux
On Fri, Sep 11, 2015 at 05:24:17PM +0100, Russell King - ARM Linux wrote:
> On Fri, Sep 11, 2015 at 08:18:43AM -0700, Eric Dumazet wrote:
> > On Fri, 2015-09-11 at 16:06 +0100, Russell King - ARM Linux wrote:
> > > On Fri, Sep 11, 2015 at 03:33:47PM +0100, Russell King - ARM Linux wrote:
> > > > It looks like 0c78789e3a030615c6650fde89546cadf40ec2cc might be relevant
> > > > too, but I don't see that solving the multiple _concurrent_ connection
> > > > attempts with the same port number - presumably it's somehow trying to
> > > > make the same socket repeatedly connect despite a previous connection
> > > > being in progress, which would have nothing to do with cleaning up a
> > > > previous attempt.
> > > 
> > > As I suspected, applying the above commit in addition does not solve the
> > > problem, I still see the same behaviour: SYN SYNACK SYN RSTACK, SYN
> > > SYNACK SYN RSTACK, and eventual SYN storms.
> > > 
> > > I do have this captured as well:
> > > 
> > > 2558   0.834316 armada388 -> n2100 TCP [TCP Port numbers reused] 
> > > rpasswd→nfs [SYN] Seq=1053655487 Win=28800 Len=0 MSS=1440 SACK_PERM=1 
> > > TSval=60001 TSecr=0 WS=128
> > > 2559   0.834572 n2100 -> armada388 TCP nfs→rpasswd [SYN, ACK] 
> > > Seq=3076611574 Ack=1053655488 Win=28560 Len=0 MSS=1440 SACK_PERM=1 
> > > TSval=869622246 TSecr=60001 WS=64
> > > 2560   0.834666 armada388 -> n2100 TCP [TCP Port numbers reused] 
> > > rpasswd→nfs [SYN] Seq=1054228544 Win=28800 Len=0 MSS=1440 SACK_PERM=1 
> > > TSval=60005 TSecr=0 WS=128
> > > 2561   0.834895 n2100 -> armada388 TCP nfs→rpasswd [ACK] Seq=3076611575 
> > > Ack=1053655488 Win=28560 Len=0 TSval=869622246 TSecr=60001
> > > 
> > > The packet at 2561 looks wrong to me - this doesn't follow what I know
> > > would be the standard TCP setup of syn, synack, ack, because that final
> > > ack is in the wrong direction.
> > > 
> > 
> > This 2561 packet is an ACK packet, because n2100 has a SYN_RECV socket
> > created by packet 2558.
> > 
> > It receives a SYN packet (2560) that it interprets as a packet slightly
> > out of sequence (1054228544 being above 1053655487) for this SYN_RECV
> > 
> > The wrong packet is 2560, not 2561
> 
> Ok.
> 
> Looking deeper at the XPRT sunrpc code, I have to wonder about the
> sanity of this:
> 
> void xprt_connect(struct rpc_task *task)
> {
> ...
> if (!xprt_connected(xprt)) {
> ...
> if (test_bit(XPRT_CLOSING, >state))
> return;
> if (xprt_test_and_set_connecting(xprt))
> return;
> xprt->stat.connect_start = jiffies;
> xprt->ops->connect(xprt, task);
> 
> That calls into xs_connect(), which schedules a workqueue to do the
> connection.  The workqueue will call xs_tcp_setup_socket().
> 
> xs_tcp_setup_socket() creates a socket if one didn't exist, otherwise
> re-using the previously obtained socket (which'll be why its using the
> same socket) and then goes on to call xs_tcp_finish_connecting().
> 
> xs_tcp_finish_connecting() calls kernel_connect(), which will return
> -EINPROGRESS.  We seem to treat EINPROGRESS as if the connection was
> successful:
> 
> case 0:
> case -EINPROGRESS:
> case -EALREADY:
> xprt_unlock_connect(xprt, transport);
> xprt_clear_connecting(xprt);
> return;
> 
> and the xprt_clear_connecting() results in this whole path being
> re-opened: the socket is not yet connected, so xprt_connected() will
> return false, and despite the socket connection still being mid-way
> through being connected, we clear the "connecting" status, causing
> xprt_test_and_set_connecting() to return false.
> 
> That allows us to re-call xprt->ops->connect, re-queue the connect
> worker, and re-run the call to kernel_connect() for a socket which is
> already mid-way through being connected.
> 
> Shouldn't the "connecting" status only be cleared when either the socket
> has _finished_ connecting, or when the connection has _failed_ to connect,
> and not when it's mid-way through connecting?
> 
> I've not been able to prove this: I've set rpc_debug to 129 to log
> just xprt and trans RPC facilities, and that's sufficient to change
> the timing such that this doesn't happen.

Following that idea, I just tried the patch below, and it seems to work.
I don't know whether it handles all cases after a call to kernel_connect(),
but it stops the multiple connection attempts:

  1   0.00 armada388 -> n2100 TCP 1009→nfs [SYN] Seq=3794066539 Win=28560 
Len=0 MSS=1440 SACK_PERM=1 TSval=15712 TSecr=870317691 WS=128
  2   0.000414 n2100 -> armada388 TCP nfs→1009 [SYN, ACK] Seq=1884476522 
Ack=3794066540 Win=28560 Len=0 MSS=1440 SACK_PERM=1 TSval=870318939 TSecr=15712 
WS=64
  3   0.000787 armada388 -> n2100 TCP 1009→nfs [ACK] Seq=3794066540 
Ack=1884476523 Win=28672 Len=0 TSval=15712 TSecr=870318939
  4   0.001304 armada388 -> n2100 NFS V3 ACCESS Call, FH: 0x905379cc, [Check: 
RD LU MD 

[PATCH v3 net 2/5] ipv6: Rename the dst_cache helper functions in ip6_tunnel

2015-09-11 Thread Martin KaFai Lau
It is a prep work to fix the dst_entry refcnt bugs in
ip6_tunnel.

This patch rename:
1. ip6_tnl_dst_check() to ip6_tnl_dst_get() to better
   reflect that it will take a dst refcnt in the next patch.
2. ip6_tnl_dst_store() to ip6_tnl_dst_set() to have a more
   conventional name matching with ip6_tnl_dst_get().

Signed-off-by: Martin KaFai Lau 
---
 include/net/ip6_tunnel.h |  4 ++--
 net/ipv6/ip6_gre.c   |  4 ++--
 net/ipv6/ip6_tunnel.c| 12 ++--
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index b8529aa..979b081 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -60,9 +60,9 @@ struct ipv6_tlv_tnl_enc_lim {
__u8 encap_limit;   /* tunnel encapsulation limit   */
 } __packed;
 
-struct dst_entry *ip6_tnl_dst_check(struct ip6_tnl *t);
+struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t);
 void ip6_tnl_dst_reset(struct ip6_tnl *t);
-void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst);
+void ip6_tnl_dst_set(struct ip6_tnl *t, struct dst_entry *dst);
 int ip6_tnl_rcv_ctl(struct ip6_tnl *t, const struct in6_addr *laddr,
const struct in6_addr *raddr);
 int ip6_tnl_xmit_ctl(struct ip6_tnl *t, const struct in6_addr *laddr,
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index af60d46..24f5dd8 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -634,7 +634,7 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
}
 
if (!fl6->flowi6_mark)
-   dst = ip6_tnl_dst_check(tunnel);
+   dst = ip6_tnl_dst_get(tunnel);
 
if (!dst) {
ndst = ip6_route_output(net, NULL, fl6);
@@ -763,7 +763,7 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
 
ip6tunnel_xmit(NULL, skb, dev);
if (ndst)
-   ip6_tnl_dst_store(tunnel, ndst);
+   ip6_tnl_dst_set(tunnel, ndst);
return 0;
 tx_err_link_failure:
stats->tx_carrier_errors++;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index b0ab420..599b0b4 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -126,7 +126,7 @@ static struct net_device_stats *ip6_get_stats(struct 
net_device *dev)
  * Locking : hash tables are protected by RCU and RTNL
  */
 
-struct dst_entry *ip6_tnl_dst_check(struct ip6_tnl *t)
+struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t)
 {
struct dst_entry *dst = t->dst_cache;
 
@@ -139,7 +139,7 @@ struct dst_entry *ip6_tnl_dst_check(struct ip6_tnl *t)
 
return dst;
 }
-EXPORT_SYMBOL_GPL(ip6_tnl_dst_check);
+EXPORT_SYMBOL_GPL(ip6_tnl_dst_get);
 
 void ip6_tnl_dst_reset(struct ip6_tnl *t)
 {
@@ -148,14 +148,14 @@ void ip6_tnl_dst_reset(struct ip6_tnl *t)
 }
 EXPORT_SYMBOL_GPL(ip6_tnl_dst_reset);
 
-void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst)
+void ip6_tnl_dst_set(struct ip6_tnl *t, struct dst_entry *dst)
 {
struct rt6_info *rt = (struct rt6_info *) dst;
t->dst_cookie = rt6_get_cookie(rt);
dst_release(t->dst_cache);
t->dst_cache = dst;
 }
-EXPORT_SYMBOL_GPL(ip6_tnl_dst_store);
+EXPORT_SYMBOL_GPL(ip6_tnl_dst_set);
 
 /**
  * ip6_tnl_lookup - fetch tunnel matching the end-point addresses
@@ -1010,7 +1010,7 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
memcpy(>daddr, addr6, sizeof(fl6->daddr));
neigh_release(neigh);
} else if (!fl6->flowi6_mark)
-   dst = ip6_tnl_dst_check(t);
+   dst = ip6_tnl_dst_get(t);
 
if (!ip6_tnl_xmit_ctl(t, >saddr, >daddr))
goto tx_err_link_failure;
@@ -1102,7 +1102,7 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
ipv6h->daddr = fl6->daddr;
ip6tunnel_xmit(NULL, skb, dev);
if (ndst)
-   ip6_tnl_dst_store(t, ndst);
+   ip6_tnl_dst_set(t, ndst);
return 0;
 tx_err_link_failure:
stats->tx_carrier_errors++;
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[fw filter]: Broken! fw mark based tc class selection not working

2015-09-11 Thread Akshat Kakkar
Recently I came to know that,
Without any options fw classifier maps fwmark to classid.

tc filter add dev  parent  protocol ip prio 1 fw

i.e. if my packet has mark(0x10001) and class id is not set,
then above tc filter, will set class id = 0x10001 i.e. 1:1

But when I am trying it out, its not working!
I am having class 1:1 defined but its not at all hit.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: PATCH: netdev: add a cast NLMSG_OK to avoid a GCC warning in users' code

2015-09-11 Thread D. Hugh Redelmeier
| From: David Laight 

| From: D. Hugh Redelmeier

| > #define NLMSG_OK(nlh,len) ((len) >= (int)sizeof(struct nlmsghdr) && \
| >(nlh)->nlmsg_len >= sizeof(struct nlmsghdr) && \
| >(nlh)->nlmsg_len <= (len))

| Why not cast (nlh)->nl_msg_len instead?
| Perhaps:
|   (typeof (len))(nlh)->nlmsg_len <= (len)
| which is almost certainly safe unless 'len' is 'signed char'.

(Nit pick: it doesn't feel safe to cast to short either.)

You would also want to adjust the first compare to
(len) >= (typeof (len))sizeof(struct nlmsghdr)
Remember that, on 32-bit machines, with the current macro, GCC gives a
warning on the first compare if len is unsigned.

These casts together might cover up some weird typing errors, like
using char * or double.

This doesn't seem like a bad fix.  It probably doesn't break any
currently working code unless something depended on one of the
surprising conversions.

I think that this change is better than the status quo but
isn't the best place to get to.


The most common source of the value is from a call to one of the
recv(2) functions.  These functions return a ssize_t.  In particular,
the error indicator is -1.

=> The error indicator -1 is only handled correctly by NLMSG_OK if the
   first compare is a signed compare.

The right change is to force the type to be ssize_t.  This should
break no currently working code (unless it is accidentally working).
After all, no practical length would be positive in size_t and negative in
ssize_t.

Using "int" instead of ssize_t would work on all Linux machines that I
know and has the merit of matching the documentation, so that is the
code change I suggested.

| Or subtract the two values and compare against zero?

If it is an unsigned subtract, the result cannot be negative, so that
doesn't work.

If it is a signed subtract, you have just the same problems as a
signed compare, plus you can get overflow (undefined in C, but
probably not in Linux).

So this isn't a win.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS/TCP/IPv6 acting strangely in 4.2

2015-09-11 Thread Russell King - ARM Linux
On Fri, Sep 11, 2015 at 08:18:43AM -0700, Eric Dumazet wrote:
> On Fri, 2015-09-11 at 16:06 +0100, Russell King - ARM Linux wrote:
> > On Fri, Sep 11, 2015 at 03:33:47PM +0100, Russell King - ARM Linux wrote:
> > > It looks like 0c78789e3a030615c6650fde89546cadf40ec2cc might be relevant
> > > too, but I don't see that solving the multiple _concurrent_ connection
> > > attempts with the same port number - presumably it's somehow trying to
> > > make the same socket repeatedly connect despite a previous connection
> > > being in progress, which would have nothing to do with cleaning up a
> > > previous attempt.
> > 
> > As I suspected, applying the above commit in addition does not solve the
> > problem, I still see the same behaviour: SYN SYNACK SYN RSTACK, SYN
> > SYNACK SYN RSTACK, and eventual SYN storms.
> > 
> > I do have this captured as well:
> > 
> > 2558   0.834316 armada388 -> n2100 TCP [TCP Port numbers reused] 
> > rpasswd→nfs [SYN] Seq=1053655487 Win=28800 Len=0 MSS=1440 SACK_PERM=1 
> > TSval=60001 TSecr=0 WS=128
> > 2559   0.834572 n2100 -> armada388 TCP nfs→rpasswd [SYN, ACK] 
> > Seq=3076611574 Ack=1053655488 Win=28560 Len=0 MSS=1440 SACK_PERM=1 
> > TSval=869622246 TSecr=60001 WS=64
> > 2560   0.834666 armada388 -> n2100 TCP [TCP Port numbers reused] 
> > rpasswd→nfs [SYN] Seq=1054228544 Win=28800 Len=0 MSS=1440 SACK_PERM=1 
> > TSval=60005 TSecr=0 WS=128
> > 2561   0.834895 n2100 -> armada388 TCP nfs→rpasswd [ACK] Seq=3076611575 
> > Ack=1053655488 Win=28560 Len=0 TSval=869622246 TSecr=60001
> > 
> > The packet at 2561 looks wrong to me - this doesn't follow what I know
> > would be the standard TCP setup of syn, synack, ack, because that final
> > ack is in the wrong direction.
> > 
> 
> This 2561 packet is an ACK packet, because n2100 has a SYN_RECV socket
> created by packet 2558.
> 
> It receives a SYN packet (2560) that it interprets as a packet slightly
> out of sequence (1054228544 being above 1053655487) for this SYN_RECV
> 
> The wrong packet is 2560, not 2561

Ok.

Looking deeper at the XPRT sunrpc code, I have to wonder about the
sanity of this:

void xprt_connect(struct rpc_task *task)
{
...
if (!xprt_connected(xprt)) {
...
if (test_bit(XPRT_CLOSING, >state))
return;
if (xprt_test_and_set_connecting(xprt))
return;
xprt->stat.connect_start = jiffies;
xprt->ops->connect(xprt, task);

That calls into xs_connect(), which schedules a workqueue to do the
connection.  The workqueue will call xs_tcp_setup_socket().

xs_tcp_setup_socket() creates a socket if one didn't exist, otherwise
re-using the previously obtained socket (which'll be why its using the
same socket) and then goes on to call xs_tcp_finish_connecting().

xs_tcp_finish_connecting() calls kernel_connect(), which will return
-EINPROGRESS.  We seem to treat EINPROGRESS as if the connection was
successful:

case 0:
case -EINPROGRESS:
case -EALREADY:
xprt_unlock_connect(xprt, transport);
xprt_clear_connecting(xprt);
return;

and the xprt_clear_connecting() results in this whole path being
re-opened: the socket is not yet connected, so xprt_connected() will
return false, and despite the socket connection still being mid-way
through being connected, we clear the "connecting" status, causing
xprt_test_and_set_connecting() to return false.

That allows us to re-call xprt->ops->connect, re-queue the connect
worker, and re-run the call to kernel_connect() for a socket which is
already mid-way through being connected.

Shouldn't the "connecting" status only be cleared when either the socket
has _finished_ connecting, or when the connection has _failed_ to connect,
and not when it's mid-way through connecting?

I've not been able to prove this: I've set rpc_debug to 129 to log
just xprt and trans RPC facilities, and that's sufficient to change
the timing such that this doesn't happen.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ebpf: emit correct src_reg for conditional jumps

2015-09-11 Thread Daniel Borkmann

On 09/11/2015 05:50 PM, Tycho Andersen wrote:

On Fri, Sep 11, 2015 at 08:40:43AM -0700, Alexei Starovoitov wrote:

On Fri, Sep 11, 2015 at 11:28:24AM +0200, Daniel Borkmann wrote:

[off topic for this patch]

... this requirement also breaks down for cases where you have a single
classic BPF instruction that maps into 2 or more eBPF instructions, hitting
BPF_MAXINSNS early at the time when you try to call into bpf(2) again with
the dumped result. If I recall correctly, Chrome seems to use up quite a lot
of insns space on (classic) seccomp-BPF, so this could potentially be a real
issue, next to artificially crafted examples that would fail.

With regards to the latter, also classic programs that could have holes of
dead code where you jump over them (see lib/test_bpf.c for examples) are
unfortunately allowed on the ABI side, so while the classic -> eBPF converter
may translate this dead region 1:1, it will be rejected by the verifier when
you dump and try to reload that. ;) Anyway, it's perhaps a silly example, but
it shows that this use-case can only work on a 'best-effort' basis, and not
cover everything of the classic BPF ABI, as opposed to having an interface
that can dump/restore classic BPF instructions directly.

Do you need this for checkpoint/restore? Wouldn't this therefore be better
done as dump/restore classic<->classic and eBPF<->eBPF directly? In socket
filters we do this by keeping orig_prog around, I guess it's better to bite
the bullet of additional memory overhead in case of classic seccomp-BPF, too.


I don't think so.
When I played with libseccomp and chrome I saw that browser installed
several bpf programs for every new tab. The longest program was 275
classic insns which translated to slightly more eBPF insns
(because in eBPF we don't have < and <= instructions, so converter
needs to emit extra jump insns)


Yes, these cases, and also exit code translates into two and you have a
preamble moving ctx to arg1 for every classic program. So it should be
slightly more overall, depending on the program structure.


Also they don't produce unreachable code.
So getting over 4k limit and unreachable are rather hypothetical
problems. I wouldn't want to have two interfaces to criu seccomp.


Thinking out loud, is there such a use-case where you checkpoint your
application on kernel X (that allows, say, to dump /and/ inject classic
seccomp insns) and restore it elsewhere on kernel Y, where Y is older
than X (f.e. it can easily inject classic insns on seccomp as it's
present for some time, but /not/ dump). I guess that could be ignored
as you couldn't move it away from there w/o dumping insns then? Also,
if the *whole* environment is even to a little degree non-homogeneous,
then your own seccomp rules can already kill you. ;)


eBPF is going to be used by seccomp as well, so having two is extra
burden for user space criu.


I don't know how much burden it actually is for criu, it for sure already
would need to do exactly this for eBPF socket filters. If I could choose
between adding extra complexity on user space or kernel space, I would
choose user space when possible.


I think the burden is not so huge once we have eBPF c/r in place (we
could just check the classic flag first, then dump the eBPF version or
something). However, it doesn't seem ideal to have to support two
interfaces.


If we start hitting 4k limit for eBPF, we can easily bump it
and/or add < and <= insns to eBPF (which was on my todo list anyway).


I think bumping BPF_MAXINSNS seems fragile wrt user space breakage, it's
at least unclear to me whether applications interacting with the kernel
depend on this in some [even weird] way. I'd probably go for the <, <=.


I was hoping you might say this (assuming you mean add < BPF_MAXINSNS
to the converter).

The dead code is certainly a problem, but perhaps we can wait on this
until there become some critical application that has this issue. I
was hoping we could get away without extra memory usage on the kernel
side (indeed, this patchset is mostly to try and avoid that).

Tycho



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: v2 of seccomp filter c/r patches

2015-09-11 Thread Andy Lutomirski
On Fri, Sep 11, 2015 at 9:30 AM, Andy Lutomirski  wrote:
> On Sep 10, 2015 5:22 PM, "Tycho Andersen"  
> wrote:
>>
>> Hi all,
>>
>> Here is v2 of the seccomp filter c/r set. The patch notes have individual
>> changes from the last series, but there are two points not noted:
>>
>> * The series still does not allow us to correctly restore state for programs
>>   that will use SECCOMP_FILTER_FLAG_TSYNC in the future. Given that we want 
>> to
>>   keep seccomp_filter's identity, I think something along the lines of 
>> another
>>   seccomp command like SECCOMP_INHERIT_PARENT is needed (although I'm not 
>> sure
>>   if this can even be done yet). In addition, we'll need a kcmp command for
>>   figuring out if filters are the same, although this too needs to compare
>>   seccomp_filter objects, so it's a little screwy. Any thoughts on how to do
>>   this nicely are welcome.
>
> Let's add a concept of a seccompfd.
>
> For background of what I want to add: I want to be able to create a
> seccomp monitor.  A seccomp monitor will be, logically, a pair of a
> struct file that represents the monitor and a seccomp_filter that is
> controlled by the monitor.  Depending on flags, whoever holds the
> monitor fd could change the active filter, intercept syscalls, and
> issue syscalls on behalf of a process that is trapped in an
> intercepted syscall.
>
> Seccomp filters would nest properly.
>
> The interface would probably be (extremely pseudocoded):
>
> monitor_fd, filter_fd = seccomp(CREATE_MONITOR, flags, ...);
>
> Then, later:
>
> seccomp(ATTACH_TO_FILTER, filter_fd);  /* now filtered */
>
> read(monitor_fd, buf, size); /* returns an intercepted syscall */
> write(monitor_fd, buf, size); /* issues a syscall or releases the
> trapped task */
>
> This can't be implemented on x86 without either going insane or
> finishing the massive set of pending cleanups to the x86 entry code.
> I favor the latter.
>
> We could, however, add part of it right now: we could have a way to
> create a filterfd, we could add kcmp support for it, and we could add
> the ATTACH_TO_FILTER thing.  I think that would solve your problem.
>
> One major open question: does a filter_fd know what its parent is and,
> if so, will it just refuse to attach if the caller's parent is wrong?
> Or will a filter_fd attach anywhere.
>

Let me add one more thought:

Currently, struct seccomp_filter encodes a strict tree hierarchy: it
knows what its parent is.  This only matters as an implementation
detail and because TSYNC checks for seccomp_filter equality.

We could change this without user-visible effects.  We could say that,
for TSYNC purposes, two filter states match if they contain exactly
the same layers in the same order where a layer does *not* encode a
concept of parent.  We could then say that attaching a classic bpf
filter creates a branch new layer that is not equal to any other layer
that's been created.

This has no effect whatsoever.  The difference would be that we could
declare that attaching the same ebpf program twice creates the *same*
layer so that, if you fork and both children attach the same ebpf
program, then they match for TSYNC purposes.  Similarly, attaching the
same hypothetical filterfd would create the same layer.

Thoughts?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Wire up 32-bit direct socket calls

2015-09-11 Thread Andy Lutomirski
On Fri, Sep 11, 2015 at 3:14 AM, Arnd Bergmann  wrote:
> On Friday 11 September 2015 11:54:50 Geert Uytterhoeven wrote:
>> To make sure I don't miss any (it seems I missed recvmmsg and sendmmsg for
>> the socketcall case, sigh), this is the list of ipc syscalls to implement?
>>
>> sys_msgget
>> sys_msgctl
>> sys_msgrcv
>> sys_msgsnd
>> sys_semget
>> sys_semctl
>> sys_semtimedop
>> sys_shmget
>> sys_shmctl
>> sys_shmat
>> sys_shmdt
>>
>> sys_semop() seems to be unneeded because it can be implemented using
>> sys_semtimedop()?
>>
>
> Yes, that list looks right. IPC also includes a set of six sys_mq_*
> call, but I believe that everyone already has those as they are not
> covered by sys_ipc.
>
> For y2038 compatibility, we will likely add a new variant of
> semtimedop that takes a 64-bit timespec. While the argument passed
> there is a relative time that will never need to be longer than 68
> years, we need to accommodate user space that defines timespec
> in a sane way, and converting the argument in libc would be awkward.
>

I missed sys_ipc entirely.

Ingo, Thomas, want to just wire those up, too?  I can send a patch
next week, but it'll be as trivial as the socket one.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net] bridge: fix igmpv3 / mldv2 report parsing

2015-09-11 Thread Linus Lüssing
With the newly introduced helper functions the skb pulling is hidden in
the checksumming function - and undone before returning to the caller.

The IGMPv3 and MLDv2 report parsing functions in the bridge still
assumed that the skb is pointing to the beginning of the IGMP/MLD
message while it is now kept at the beginning of the IPv4/6 header,
breaking the message parsing and creating packet loss.

Fixing this by taking the offset between IP and IGMP/MLD header into
account, too.

Fixes: 9afd85c9e455 ("net: Export IGMP/MLD message validation code")
Reported-by: Tobias Powalowski 
Tested-by: Tobias Powalowski 
Signed-off-by: Linus Lüssing 
---
 net/bridge/br_multicast.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 66efdc2..480b3de 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -1006,7 +1006,7 @@ static int br_ip4_multicast_igmp3_report(struct 
net_bridge *br,
 
ih = igmpv3_report_hdr(skb);
num = ntohs(ih->ngrec);
-   len = sizeof(*ih);
+   len = skb_transport_offset(skb) + sizeof(*ih);
 
for (i = 0; i < num; i++) {
len += sizeof(*grec);
@@ -1067,7 +1067,7 @@ static int br_ip6_multicast_mld2_report(struct net_bridge 
*br,
 
icmp6h = icmp6_hdr(skb);
num = ntohs(icmp6h->icmp6_dataun.un_data16[1]);
-   len = sizeof(*icmp6h);
+   len = skb_transport_offset(skb) + sizeof(*icmp6h);
 
for (i = 0; i < num; i++) {
__be16 *nsrcs, _nsrcs;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Net: core: datagram.c coding style fixes

2015-09-11 Thread Krzysztof Majzerowicz-Jaszcz
Fixed several coding style issues reported by checkpatch.pl

Signed-off-by: Krzysztof Majzerowicz-Jaszcz 
---
 net/core/datagram.c | 89 ++---
 1 file changed, 51 insertions(+), 38 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 617088a..b284a6d 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -9,7 +9,8 @@
  * identical recvmsg() code. So we share it here. The poll was
  * shared before but buried in udp.c so I moved it.
  *
- * Authors:Alan Cox . (datagram_poll() 
from old
+ * Authors:
+ * Alan Cox . (datagram_poll() from old
  *  udp.c code)
  *
  * Fixes:
@@ -36,7 +37,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -60,29 +61,24 @@
 #include 
 #include 
 
-/*
- * Is a socket 'connection oriented' ?
- */
+/* Is a socket 'connection oriented' ? */
 static inline int connection_based(struct sock *sk)
 {
return sk->sk_type == SOCK_SEQPACKET || sk->sk_type == SOCK_STREAM;
 }
 
-static int receiver_wake_function(wait_queue_t *wait, unsigned int mode, int 
sync,
- void *key)
+static int receiver_wake_function(wait_queue_t *wait, unsigned int mode,
+ int sync, void *key)
 {
unsigned long bits = (unsigned long)key;
 
-   /*
-* Avoid a wakeup if event not interesting for us
-*/
+/* Avoid a wakeup if event not interesting for us */
if (bits && !(bits & (POLLIN | POLLERR)))
return 0;
return autoremove_wake_function(wait, mode, sync, key);
 }
-/*
- * Wait for the last received packet to be different from skb
- */
+
+/* Wait for the last received packet to be different from skb */
 static int wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
 const struct sk_buff *skb)
 {
@@ -198,8 +194,9 @@ struct sk_buff *__skb_recv_datagram(struct sock *sk, 
unsigned int flags,
struct sk_buff *skb, *last;
unsigned long cpu_flags;
long timeo;
-   /*
-* Caller is allowed not to check sk->sk_err before skb_recv_datagram()
+
+   /* Caller is allowed not to check sk->sk_err before
+* skb_recv_datagram()
 */
int error = sock_error(sk);
 
@@ -235,9 +232,9 @@ struct sk_buff *__skb_recv_datagram(struct sock *sk, 
unsigned int flags,
goto unlock_err;
 
atomic_inc(>users);
-   } else
+   } else {
__skb_unlink(skb, queue);
-
+   }
spin_unlock_irqrestore(>lock, cpu_flags);
*off = _off;
return skb;
@@ -367,7 +364,8 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int 
offset,
copy = len;
if (copy_to_iter(skb->data + offset, copy, to) != copy)
goto short_copy;
-   if ((len -= copy) == 0)
+   len -= copy;
+   if (len == 0)
return 0;
offset += copy;
}
@@ -380,14 +378,16 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int 
offset,
WARN_ON(start > offset + len);
 
end = start + skb_frag_size(frag);
-   if ((copy = end - offset) > 0) {
+   copy = end - offset;
+   if (copy > 0) {
if (copy > len)
copy = len;
if (copy_page_to_iter(skb_frag_page(frag),
  frag->page_offset + offset -
  start, copy, to) != copy)
goto short_copy;
-   if (!(len -= copy))
+   len -= copy;
+   if (len == 0)
return 0;
offset += copy;
}
@@ -400,13 +400,15 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int 
offset,
WARN_ON(start > offset + len);
 
end = start + frag_iter->len;
-   if ((copy = end - offset) > 0) {
+   copy = end - offset;
+   if (copy > 0) {
if (copy > len)
copy = len;
if (skb_copy_datagram_iter(frag_iter, offset - start,
   to, copy))
goto fault;
-   if ((len -= copy) == 0)
+   len -= copy;
+   if (len == 0)

Re: [PATCH v2 2/5] seccomp: make underlying bpf ref counted as well

2015-09-11 Thread Tycho Andersen
On Fri, Sep 11, 2015 at 06:03:59PM +0200, Daniel Borkmann wrote:
> On 09/11/2015 04:44 PM, Tycho Andersen wrote:
> >On Fri, Sep 11, 2015 at 03:02:36PM +0200, Daniel Borkmann wrote:
> >>On 09/11/2015 02:20 AM, Tycho Andersen wrote:
> >>>In the next patch, we're going to add a way to access the underlying
> >>>filters via bpf fds. This means that we need to ref-count both the
> >>>struct seccomp_filter objects and the struct bpf_prog objects separately,
> >>>in case a process dies but a filter is still referred to by another
> >>>process.
> >>>
> >>>Additionally, we mark classic converted seccomp filters as seccomp eBPF
> >>>programs, since they are a subset of what is supported in seccomp eBPF.
> >>>
> >>>Signed-off-by: Tycho Andersen 
> >>>CC: Kees Cook 
> >>>CC: Will Drewry 
> >>>CC: Oleg Nesterov 
> >>>CC: Andy Lutomirski 
> >>>CC: Pavel Emelyanov 
> >>>CC: Serge E. Hallyn 
> >>>CC: Alexei Starovoitov 
> >>>CC: Daniel Borkmann 
> >>>---
> >>>  kernel/seccomp.c | 4 +++-
> >>>  1 file changed, 3 insertions(+), 1 deletion(-)
> >>>
> >>>diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >>>index 245df6b..afaeddf 100644
> >>>--- a/kernel/seccomp.c
> >>>+++ b/kernel/seccomp.c
> >>>@@ -378,6 +378,8 @@ static struct seccomp_filter 
> >>>*seccomp_prepare_filter(struct sock_fprog *fprog)
> >>>   }
> >>>
> >>>   atomic_set(>usage, 1);
> >>>+  atomic_set(>prog->aux->refcnt, 1);
> >>>+  sfilter->prog->type = BPF_PROG_TYPE_SECCOMP;
> >>
> >>So, if you do this, then this breaks the assumption of eBPF JITs
> >>that, currently, all classic converted BPF programs always have a
> >>prog->type of BPF_PROG_TYPE_UNSPEC (see: bpf_prog_was_classic()).
> >>
> >>Currently, JITs make use of this information to determine whether
> >>A and X mappings for such programs should or should not be cleared
> >>in the prologue (s390 currently).
> >>
> >>In the seccomp_prepare_filter() stage, we're already past that, so
> >>it will not cause an issue, but we certainly would need to be very
> >>careful in future, if bpf_prog_was_classic() is then used at a later
> >>stage when we already have a generated bpf_prog somewhere, as then
> >>this assumption will break.
> >
> >The only reason we need to do this is to allow BPF_DUMP_PROG to work,
> >since we were restricting it to only allow dumping of seccomp
> >programs, since those don't have maps. Instead, perhaps we could allow
> >dumping of BPF_PROG_TYPE_SECCOMP and BPF_PROG_TYPE_UNSPEC?
> 
> There are possibilities that BPF_PROG_TYPE_UNSPEC is calling helpers
> already today, at least in networking case, not seccomp. So, since
> you want to export [classic -> eBPF] only for seccomp, put fds on them
> and dump these via bpf(2), you could allow that (with a big comment
> stating why it's safe), but mid-term we really need to sanitize all
> this stuff properly as this is needed for other types, too.

Sorry, just to be clear, you're suggesting that the patch is ok modulo
a comment describing the jit issues?

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: v2 of seccomp filter c/r patches

2015-09-11 Thread Tycho Andersen
On Fri, Sep 11, 2015 at 10:00:22AM -0700, Andy Lutomirski wrote:
> On Fri, Sep 11, 2015 at 9:30 AM, Andy Lutomirski  wrote:
> > On Sep 10, 2015 5:22 PM, "Tycho Andersen"  
> > wrote:
> >>
> >> Hi all,
> >>
> >> Here is v2 of the seccomp filter c/r set. The patch notes have individual
> >> changes from the last series, but there are two points not noted:
> >>
> >> * The series still does not allow us to correctly restore state for 
> >> programs
> >>   that will use SECCOMP_FILTER_FLAG_TSYNC in the future. Given that we 
> >> want to
> >>   keep seccomp_filter's identity, I think something along the lines of 
> >> another
> >>   seccomp command like SECCOMP_INHERIT_PARENT is needed (although I'm not 
> >> sure
> >>   if this can even be done yet). In addition, we'll need a kcmp command for
> >>   figuring out if filters are the same, although this too needs to compare
> >>   seccomp_filter objects, so it's a little screwy. Any thoughts on how to 
> >> do
> >>   this nicely are welcome.
> >
> > Let's add a concept of a seccompfd.
> >
> > For background of what I want to add: I want to be able to create a
> > seccomp monitor.  A seccomp monitor will be, logically, a pair of a
> > struct file that represents the monitor and a seccomp_filter that is
> > controlled by the monitor.  Depending on flags, whoever holds the
> > monitor fd could change the active filter, intercept syscalls, and
> > issue syscalls on behalf of a process that is trapped in an
> > intercepted syscall.
> >
> > Seccomp filters would nest properly.
> >
> > The interface would probably be (extremely pseudocoded):
> >
> > monitor_fd, filter_fd = seccomp(CREATE_MONITOR, flags, ...);
> >
> > Then, later:
> >
> > seccomp(ATTACH_TO_FILTER, filter_fd);  /* now filtered */
> >
> > read(monitor_fd, buf, size); /* returns an intercepted syscall */
> > write(monitor_fd, buf, size); /* issues a syscall or releases the
> > trapped task */
> >
> > This can't be implemented on x86 without either going insane or
> > finishing the massive set of pending cleanups to the x86 entry code.
> > I favor the latter.
> >
> > We could, however, add part of it right now: we could have a way to
> > create a filterfd, we could add kcmp support for it, and we could add
> > the ATTACH_TO_FILTER thing.  I think that would solve your problem.
> >
> > One major open question: does a filter_fd know what its parent is and,
> > if so, will it just refuse to attach if the caller's parent is wrong?
> > Or will a filter_fd attach anywhere.
> >
> 
> Let me add one more thought:
> 
> Currently, struct seccomp_filter encodes a strict tree hierarchy: it
> knows what its parent is.  This only matters as an implementation
> detail and because TSYNC checks for seccomp_filter equality.
> 
> We could change this without user-visible effects.  We could say that,
> for TSYNC purposes, two filter states match if they contain exactly
> the same layers in the same order where a layer does *not* encode a
> concept of parent.  We could then say that attaching a classic bpf
> filter creates a branch new layer that is not equal to any other layer
> that's been created.
> 
> This has no effect whatsoever.  The difference would be that we could
> declare that attaching the same ebpf program twice creates the *same*
> layer so that, if you fork and both children attach the same ebpf
> program, then they match for TSYNC purposes.

Would you keep struct seccomp_filter identity here (meaning that you'd
reach over and grab the seccomp_filter from a sibling thread if it
existed)? Would it only work for the last filter attached to siblings,
or for all the filters? This does make my life easier, but I like the
idea of just using seccompfd directly below as it seems somewhat
easier (for me at least) to understand,

> Similarly, attaching the
> same hypothetical filterfd would create the same layer.

If we change the api of my current set to have the ptrace commands
iterate over seccomp fds, it looks something like:

seccompfd = ptrace(GET_FILTER_FD, pid);
while (ptrace(NEXT_FD, pid, seccompfd) == 0) {
if (seccomp(CHECK_INHERITED, seccompfd))
break;

bpffd = seccomp(GET_BPF_FD, seccompfd);
err = buf(BPF_PROG_DUMP, bpffd, );
/* save the bpf prog */
}

then restore can look like:

while (have_noninherited_filters()) {
filter = load_filter();
bpffd = bpf(BPF_PROG_LOAD, filter);
seccompfd = seccomp(SECCOMP_FD_CREATE, bpffd);

filters[n_filters++] = seccompfd;
}

/* fork any children as necessary and do the rest of the restore */

for (i = 0; i < n_filters; i++) {
seccomp(SECCOMP_FD_INSTALL, filters[i]);
}

then the only question is how to implement the CHECK_INHERITED command
on dump.

If we support the above API, we don't need to think about the concept
of layers at all, or do any extra work on filter install to preserve
struct seccomp_filter identity, it just comes 

[PATCH v3 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel

2015-09-11 Thread Martin KaFai Lau
v3:
- Merge a 'if else if' test in patch 4
- Use rcu_dereference_protected in patch 5 to fix a sparse check when
  CONFIG_SPARSE_RCU_POINTER is enabled

v2:
- Add patch 4 and 5 to remove the spinlock

v1:
This patch series is to fix the dst refcnt bugs in ip6_tunnel.

Patch 1 and 2 are the prep works.  Patch 3 is the fix.

I can reproduce the bug by adding and removing the ip6gre tunnel
while running a super_netperf TCP_CRR test.  I get the following
trace by adding WARN_ON_ONCE(newrefcnt < 0) to dst_release():

[  312.760432] [ cut here ]
[  312.774664] WARNING: CPU: 2 PID: 10263 at net/core/dst.c:288 
dst_release+0xf3/0x100()
[  312.776041] Modules linked in: k10temp coretemp hwmon ip6_gre ip6_tunnel 
tunnel6 ipmi_devintf ipmi_ms\
ghandler ip6table_filter ip6_tables xt_NFLOG nfnetlink_log nfnetlink xt_comment 
xt_statistic iptable_fil\
ter ip_tables x_tables nfsv3 nfs_acl nfs fscache lockd grace mptctl netconsole 
autofs4 rpcsec_gss_krb5 a\
uth_rpcgss oid_registry sunrpc ipv6 dm_mod loop iTCO_wdt iTCO_vendor_support 
serio_raw rtc_cmos pcspkr i\
2c_i801 i2c_core lpc_ich mfd_core ehci_pci ehci_hcd e1000e mlx4_en ptp pps_core 
vxlan udp_tunnel ip6_udp\
_tunnel mlx4_core sg button ext3 jbd mpt2sas raid_class
[  312.785302] CPU: 2 PID: 10263 Comm: netperf Not tainted 
4.2.0-rc8-00046-g4db9b63-dirty #15
[  312.791695] Hardware name: Quanta Freedom /Windmill-EP, BIOS F03_3B04 
09/12/2013
[  312.792965]  819dca2c 8811dfbdf6f8 816537de 
88123788fdb8
[  312.794263]   8811dfbdf738 81052646 
8811dfbdf768
[  312.795593]  881203a98180  88242927a000 
88120a2532e0
[  312.796946] Call Trace:
[  312.797380]  [] dump_stack+0x45/0x57
[  312.798288]  [] warn_slowpath_common+0x86/0xc0
[  312.799699]  [] warn_slowpath_null+0x1a/0x20
[  312.800852]  [] dst_release+0xf3/0x100
[  312.801834]  [] ip6_tnl_dst_store+0x48/0x70 [ip6_tunnel]
[  312.803738]  [] ip6gre_xmit2+0x536/0x720 [ip6_gre]
[  312.804774]  [] ip6gre_tunnel_xmit+0x16a/0x410 [ip6_gre]
[  312.805986]  [] dev_hard_start_xmit+0x23b/0x390
[  312.808810]  [] ? neigh_destroy+0xef/0x140
[  312.809843]  [] __dev_queue_xmit+0x48c/0x4f0
[  312.813931]  [] dev_queue_xmit_sk+0x13/0x20
[  312.814993]  [] neigh_direct_output+0x12/0x20
[  312.817448]  [] ip6_finish_output2+0x183/0x460 [ipv6]
[  312.818762]  [] ? find_next_bit+0x15/0x20
[  312.819671]  [] ip6_finish_output+0x89/0xe0 [ipv6]
[  312.820720]  [] ip6_output+0x44/0xe0 [ipv6]
[  312.821762]  [] ? nf_hook_slow+0x69/0xc0
[  312.823123]  [] ip6_xmit+0x242/0x4c0 [ipv6]
[  312.824073]  [] ? ac6_proc_exit+0x20/0x20 [ipv6]
[  312.825116]  [] inet6_csk_xmit+0x61/0xa0 [ipv6]
[  312.826127]  [] tcp_transmit_skb+0x4f0/0x9b0
[  312.827441]  [] tcp_connect+0x637/0x7a0
[  312.828327]  [] tcp_v6_connect+0x2d6/0x550 [ipv6]
[  312.829581]  [] __inet_stream_connect+0x95/0x2f0
[  312.830600]  [] ? hrtimer_try_to_cancel+0x1a/0xf0
[  312.833456]  [] ? timerqueue_add+0x59/0xb0
[  312.834407]  [] inet_stream_connect+0x38/0x50
[  312.835886]  [] SYSC_connect+0xb7/0xf0
[  312.840035]  [] ? do_setitimer+0x1b3/0x200
[  312.840983]  [] ? alarm_setitimer+0x3a/0x70
[  312.841941]  [] SyS_connect+0xe/0x10
[  312.842818]  [] entry_SYSCALL_64_fastpath+0x12/0x6a
[  312.844206] ---[ end trace 43f3ecd86c3b1313 ]---

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 net 4/5] ipv6: Avoid double dst_free

2015-09-11 Thread Martin KaFai Lau
It is a prep work to get dst freeing from fib tree undergo
a rcu grace period.

The following is a common paradigm:
if (ip6_del_rt(rt))
dst_free(rt)

which means, if rt cannot be deleted from the fib tree, dst_free(rt) now.
1. We don't know the ip6_del_rt(rt) failure is because it
   was not managed by fib tree (e.g. DST_NOCACHE) or it had already been
   removed from the fib tree.
2. If rt had been managed by the fib tree, ip6_del_rt(rt) failure means
   dst_free(rt) has been called already.  A second
   dst_free(rt) is not always obviously safe.  The rt may have
   been destroyed already.
3. If rt is a DST_NOCACHE, dst_free(rt) should not be called.
4. It is a stopper to make dst freeing from fib tree undergo a
   rcu grace period.

This patch is to use a DST_NOCACHE flag to indicate a rt is
not managed by the fib tree.

Signed-off-by: Martin KaFai Lau 
---
 net/ipv6/addrconf.c |  7 +++
 net/ipv6/ip6_fib.c  | 11 +--
 net/ipv6/route.c|  7 ---
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 030fefd..9001133 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5127,13 +5127,12 @@ static void __ipv6_ifa_notify(int event, struct 
inet6_ifaddr *ifp)
 
rt = addrconf_get_prefix_route(>peer_addr, 128,
   ifp->idev->dev, 0, 0);
-   if (rt && ip6_del_rt(rt))
-   dst_free(>dst);
+   if (rt)
+   ip6_del_rt(rt);
}
dst_hold(>rt->dst);
 
-   if (ip6_del_rt(ifp->rt))
-   dst_free(>rt->dst);
+   ip6_del_rt(ifp->rt);
 
rt_genid_bump_ipv6(net);
break;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 418d982..e68350b 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -933,6 +933,10 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
int replace_required = 0;
int sernum = fib6_new_sernum(info->nl_net);
 
+   if (WARN_ON_ONCE((rt->dst.flags & DST_NOCACHE) &&
+!atomic_read(>dst.__refcnt)))
+   return -EINVAL;
+
if (info->nlh) {
if (!(info->nlh->nlmsg_flags & NLM_F_CREATE))
allow_create = 0;
@@ -1025,6 +1029,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
fib6_start_gc(info->nl_net, rt);
if (!(rt->rt6i_flags & RTF_CACHE))
fib6_prune_clones(info->nl_net, pn);
+   rt->dst.flags &= ~DST_NOCACHE;
}
 
 out:
@@ -1049,7 +1054,8 @@ out:
atomic_inc(>leaf->rt6i_ref);
}
 #endif
-   dst_free(>dst);
+   if (!(rt->dst.flags & DST_NOCACHE))
+   dst_free(>dst);
}
return err;
 
@@ -1060,7 +1066,8 @@ out:
 st_failure:
if (fn && !(fn->fn_flags & (RTN_RTINFO|RTN_ROOT)))
fib6_repair_tree(info->nl_net, fn);
-   dst_free(>dst);
+   if (!(rt->dst.flags & DST_NOCACHE))
+   dst_free(>dst);
return err;
 #endif
 }
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 53617d7..3d3c1b2 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1322,8 +1322,7 @@ static void ip6_link_failure(struct sk_buff *skb)
if (rt) {
if (rt->rt6i_flags & RTF_CACHE) {
dst_hold(>dst);
-   if (ip6_del_rt(rt))
-   dst_free(>dst);
+   ip6_del_rt(rt);
} else if (rt->rt6i_node && (rt->rt6i_flags & RTF_DEFAULT)) {
rt->rt6i_node->fn_sernum = -1;
}
@@ -2028,7 +2027,8 @@ static int __ip6_del_rt(struct rt6_info *rt, struct 
nl_info *info)
struct fib6_table *table;
struct net *net = dev_net(rt->dst.dev);
 
-   if (rt == net->ipv6.ip6_null_entry) {
+   if (rt == net->ipv6.ip6_null_entry ||
+   rt->dst.flags & DST_NOCACHE) {
err = -ENOENT;
goto out;
}
@@ -2515,6 +2515,7 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev 
*idev,
rt->rt6i_dst.addr = *addr;
rt->rt6i_dst.plen = 128;
rt->rt6i_table = fib6_get_table(net, RT6_TABLE_LOCAL);
+   rt->dst.flags |= DST_NOCACHE;
 
atomic_set(>dst.__refcnt, 1);
 
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 net 1/5] ipv6: Refactor common ip6gre_tunnel_init codes

2015-09-11 Thread Martin KaFai Lau
It is a prep work to fix the dst_entry refcnt bugs in ip6_tunnel.

This patch refactors some common init codes used by both
ip6gre_tunnel_init and ip6gre_tap_init.

Signed-off-by: Martin KaFai Lau 
---
 net/ipv6/ip6_gre.c | 37 -
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 4038c69..af60d46 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1245,7 +1245,7 @@ static void ip6gre_tunnel_setup(struct net_device *dev)
netif_keep_dst(dev);
 }
 
-static int ip6gre_tunnel_init(struct net_device *dev)
+static int ip6gre_tunnel_init_common(struct net_device *dev)
 {
struct ip6_tnl *tunnel;
 
@@ -1255,16 +1255,30 @@ static int ip6gre_tunnel_init(struct net_device *dev)
tunnel->net = dev_net(dev);
strcpy(tunnel->parms.name, dev->name);
 
+   dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
+   if (!dev->tstats)
+   return -ENOMEM;
+
+   return 0;
+}
+
+static int ip6gre_tunnel_init(struct net_device *dev)
+{
+   struct ip6_tnl *tunnel;
+   int ret;
+
+   ret = ip6gre_tunnel_init_common(dev);
+   if (ret)
+   return ret;
+
+   tunnel = netdev_priv(dev);
+
memcpy(dev->dev_addr, >parms.laddr, sizeof(struct in6_addr));
memcpy(dev->broadcast, >parms.raddr, sizeof(struct in6_addr));
 
if (ipv6_addr_any(>parms.raddr))
dev->header_ops = _header_ops;
 
-   dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
-   if (!dev->tstats)
-   return -ENOMEM;
-
return 0;
 }
 
@@ -1460,19 +1474,16 @@ static void ip6gre_netlink_parms(struct nlattr *data[],
 static int ip6gre_tap_init(struct net_device *dev)
 {
struct ip6_tnl *tunnel;
+   int ret;
 
-   tunnel = netdev_priv(dev);
+   ret = ip6gre_tunnel_init_common(dev);
+   if (ret)
+   return ret;
 
-   tunnel->dev = dev;
-   tunnel->net = dev_net(dev);
-   strcpy(tunnel->parms.name, dev->name);
+   tunnel = netdev_priv(dev);
 
ip6gre_tnl_link_config(tunnel, 1);
 
-   dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
-   if (!dev->tstats)
-   return -ENOMEM;
-
return 0;
 }
 
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 net 5/5] ipv6: Replace spinlock with seqlock and rcu in ip6_tunnel

2015-09-11 Thread Martin KaFai Lau
This patch uses a seqlock to ensure consistency between idst->dst and
idst->cookie.  It also makes dst freeing from fib tree to undergo a
rcu grace period.

Signed-off-by: Martin KaFai Lau 
---
 include/net/ip6_tunnel.h |  4 ++--
 net/ipv6/ip6_fib.c   |  9 +++--
 net/ipv6/ip6_tunnel.c| 51 +---
 3 files changed, 36 insertions(+), 28 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 60b4f40..65c2a93 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -33,8 +33,8 @@ struct __ip6_tnl_parm {
 };
 
 struct ip6_tnl_dst {
-   spinlock_t lock;
-   struct dst_entry *dst;
+   seqlock_t lock;
+   struct dst_entry __rcu *dst;
u32 cookie;
 };
 
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index e68350b..8a9ec01 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -155,6 +155,11 @@ static void node_free(struct fib6_node *fn)
kmem_cache_free(fib6_node_kmem, fn);
 }
 
+static void rt6_rcu_free(struct rt6_info *rt)
+{
+   call_rcu(>dst.rcu_head, dst_rcu_free);
+}
+
 static void rt6_free_pcpu(struct rt6_info *non_pcpu_rt)
 {
int cpu;
@@ -169,7 +174,7 @@ static void rt6_free_pcpu(struct rt6_info *non_pcpu_rt)
ppcpu_rt = per_cpu_ptr(non_pcpu_rt->rt6i_pcpu, cpu);
pcpu_rt = *ppcpu_rt;
if (pcpu_rt) {
-   dst_free(_rt->dst);
+   rt6_rcu_free(pcpu_rt);
*ppcpu_rt = NULL;
}
}
@@ -181,7 +186,7 @@ static void rt6_release(struct rt6_info *rt)
 {
if (atomic_dec_and_test(>rt6i_ref)) {
rt6_free_pcpu(rt);
-   dst_free(>dst);
+   rt6_rcu_free(rt);
}
 }
 
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 851cf6d..d27312a 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -126,45 +126,48 @@ static struct net_device_stats *ip6_get_stats(struct 
net_device *dev)
  * Locking : hash tables are protected by RCU and RTNL
  */
 
-static void __ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
- struct dst_entry *dst)
+static void ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
+   struct dst_entry *dst)
 {
-   dst_release(idst->dst);
+   write_seqlock_bh(>lock);
+   dst_release(rcu_dereference_protected(
+   idst->dst,
+   lockdep_is_held(>lock->lock)));
if (dst) {
dst_hold(dst);
idst->cookie = rt6_get_cookie((struct rt6_info *)dst);
} else {
idst->cookie = 0;
}
-   idst->dst = dst;
-}
-
-static void ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
-   struct dst_entry *dst)
-{
-
-   spin_lock_bh(>lock);
-   __ip6_tnl_per_cpu_dst_set(idst, dst);
-   spin_unlock_bh(>lock);
+   rcu_assign_pointer(idst->dst, dst);
+   write_sequnlock_bh(>lock);
 }
 
 struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t)
 {
struct ip6_tnl_dst *idst;
struct dst_entry *dst;
+   unsigned int seq;
+   u32 cookie;
 
idst = raw_cpu_ptr(t->dst_cache);
-   spin_lock_bh(>lock);
-   dst = idst->dst;
-   if (dst) {
-   if (!dst->obsolete || dst->ops->check(dst, idst->cookie)) {
-   dst_hold(idst->dst);
-   } else {
-   __ip6_tnl_per_cpu_dst_set(idst, NULL);
-   dst = NULL;
-   }
+
+   rcu_read_lock();
+   do {
+   seq = read_seqbegin(>lock);
+   dst = rcu_dereference(idst->dst);
+   cookie = idst->cookie;
+   } while (read_seqretry(>lock, seq));
+
+   if (dst && !atomic_inc_not_zero(>__refcnt))
+   dst = NULL;
+   rcu_read_unlock();
+
+   if (dst && dst->obsolete && !dst->ops->check(dst, cookie)) {
+   ip6_tnl_per_cpu_dst_set(idst, NULL);
+   dst_release(dst);
+   dst = NULL;
}
-   spin_unlock_bh(>lock);
return dst;
 }
 EXPORT_SYMBOL_GPL(ip6_tnl_dst_get);
@@ -204,7 +207,7 @@ int ip6_tnl_dst_init(struct ip6_tnl *t)
return -ENOMEM;
 
for_each_possible_cpu(i)
-   spin_lock_init(_cpu_ptr(t->dst_cache, i)->lock);
+   seqlock_init(_cpu_ptr(t->dst_cache, i)->lock);
 
return 0;
 }
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 net 3/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel

2015-09-11 Thread Martin KaFai Lau
Problems in the current dst_entry cache in the ip6_tunnel:

1. ip6_tnl_dst_set is racy.  There is no lock to protect it:
   - One major problem is that the dst refcnt gets messed up. F.e.
 the same dst_cache can be released multiple times and then
 triggering the infamous dst refcnt < 0 warning message.
   - Another issue is the inconsistency between dst_cache and
 dst_cookie.

   It can be reproduced by adding and removing the ip6gre tunnel
   while running a super_netperf TCP_CRR test.

2. ip6_tnl_dst_get does not take the dst refcnt before returning
   the dst.

This patch:
1. Create a percpu dst_entry cache in ip6_tnl
2. Use a spinlock to protect the dst_cache operations
3. ip6_tnl_dst_get always takes the dst refcnt before returning

Signed-off-by: Martin KaFai Lau 
---
 include/net/ip6_tunnel.h |  11 -
 net/ipv6/ip6_gre.c   |  38 ---
 net/ipv6/ip6_tunnel.c| 122 +++
 3 files changed, 123 insertions(+), 48 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 979b081..60b4f40 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -32,6 +32,12 @@ struct __ip6_tnl_parm {
__be32  o_key;
 };
 
+struct ip6_tnl_dst {
+   spinlock_t lock;
+   struct dst_entry *dst;
+   u32 cookie;
+};
+
 /* IPv6 tunnel */
 struct ip6_tnl {
struct ip6_tnl __rcu *next; /* next tunnel in list */
@@ -39,8 +45,7 @@ struct ip6_tnl {
struct net *net;/* netns for packet i/o */
struct __ip6_tnl_parm parms;/* tunnel configuration parameters */
struct flowi fl;/* flowi template for xmit */
-   struct dst_entry *dst_cache;/* cached dst */
-   u32 dst_cookie;
+   struct ip6_tnl_dst __percpu *dst_cache; /* cached dst */
 
int err_count;
unsigned long err_time;
@@ -61,6 +66,8 @@ struct ipv6_tlv_tnl_enc_lim {
 } __packed;
 
 struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t);
+int ip6_tnl_dst_init(struct ip6_tnl *t);
+void ip6_tnl_dst_destroy(struct ip6_tnl *t);
 void ip6_tnl_dst_reset(struct ip6_tnl *t);
 void ip6_tnl_dst_set(struct ip6_tnl *t, struct dst_entry *dst);
 int ip6_tnl_rcv_ctl(struct ip6_tnl *t, const struct in6_addr *laddr,
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 24f5dd8..6465124 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -637,17 +637,17 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
dst = ip6_tnl_dst_get(tunnel);
 
if (!dst) {
-   ndst = ip6_route_output(net, NULL, fl6);
+   dst = ip6_route_output(net, NULL, fl6);
 
-   if (ndst->error)
+   if (dst->error)
goto tx_err_link_failure;
-   ndst = xfrm_lookup(net, ndst, flowi6_to_flowi(fl6), NULL, 0);
-   if (IS_ERR(ndst)) {
-   err = PTR_ERR(ndst);
-   ndst = NULL;
+   dst = xfrm_lookup(net, dst, flowi6_to_flowi(fl6), NULL, 0);
+   if (IS_ERR(dst)) {
+   err = PTR_ERR(dst);
+   dst = NULL;
goto tx_err_link_failure;
}
-   dst = ndst;
+   ndst = dst;
}
 
tdev = dst->dev;
@@ -702,12 +702,9 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
skb = new_skb;
}
 
-   if (fl6->flowi6_mark) {
-   skb_dst_set(skb, dst);
-   ndst = NULL;
-   } else {
-   skb_dst_set_noref(skb, dst);
-   }
+   if (!fl6->flowi6_mark && ndst)
+   ip6_tnl_dst_set(tunnel, ndst);
+   skb_dst_set(skb, dst);
 
proto = NEXTHDR_GRE;
if (encap_limit >= 0) {
@@ -762,14 +759,12 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
skb_set_inner_protocol(skb, protocol);
 
ip6tunnel_xmit(NULL, skb, dev);
-   if (ndst)
-   ip6_tnl_dst_set(tunnel, ndst);
return 0;
 tx_err_link_failure:
stats->tx_carrier_errors++;
dst_link_failure(skb);
 tx_err_dst_release:
-   dst_release(ndst);
+   dst_release(dst);
return err;
 }
 
@@ -1223,6 +1218,9 @@ static const struct net_device_ops ip6gre_netdev_ops = {
 
 static void ip6gre_dev_free(struct net_device *dev)
 {
+   struct ip6_tnl *t = netdev_priv(dev);
+
+   ip6_tnl_dst_destroy(t);
free_percpu(dev->tstats);
free_netdev(dev);
 }
@@ -1248,6 +1246,7 @@ static void ip6gre_tunnel_setup(struct net_device *dev)
 static int ip6gre_tunnel_init_common(struct net_device *dev)
 {
struct ip6_tnl *tunnel;
+   int ret;
 
tunnel = netdev_priv(dev);
 
@@ -1259,6 +1258,13 @@ static int ip6gre_tunnel_init_common(struct net_device 
*dev)
if (!dev->tstats)
return -ENOMEM;
 
+   ret = ip6_tnl_dst_init(tunnel);
+   if (ret) {
+   

Re: [PATCH v2 2/5] seccomp: make underlying bpf ref counted as well

2015-09-11 Thread Daniel Borkmann

On 09/11/2015 04:44 PM, Tycho Andersen wrote:

On Fri, Sep 11, 2015 at 03:02:36PM +0200, Daniel Borkmann wrote:

On 09/11/2015 02:20 AM, Tycho Andersen wrote:

In the next patch, we're going to add a way to access the underlying
filters via bpf fds. This means that we need to ref-count both the
struct seccomp_filter objects and the struct bpf_prog objects separately,
in case a process dies but a filter is still referred to by another
process.

Additionally, we mark classic converted seccomp filters as seccomp eBPF
programs, since they are a subset of what is supported in seccomp eBPF.

Signed-off-by: Tycho Andersen 
CC: Kees Cook 
CC: Will Drewry 
CC: Oleg Nesterov 
CC: Andy Lutomirski 
CC: Pavel Emelyanov 
CC: Serge E. Hallyn 
CC: Alexei Starovoitov 
CC: Daniel Borkmann 
---
  kernel/seccomp.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 245df6b..afaeddf 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -378,6 +378,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct 
sock_fprog *fprog)
}

atomic_set(>usage, 1);
+   atomic_set(>prog->aux->refcnt, 1);
+   sfilter->prog->type = BPF_PROG_TYPE_SECCOMP;


So, if you do this, then this breaks the assumption of eBPF JITs
that, currently, all classic converted BPF programs always have a
prog->type of BPF_PROG_TYPE_UNSPEC (see: bpf_prog_was_classic()).

Currently, JITs make use of this information to determine whether
A and X mappings for such programs should or should not be cleared
in the prologue (s390 currently).

In the seccomp_prepare_filter() stage, we're already past that, so
it will not cause an issue, but we certainly would need to be very
careful in future, if bpf_prog_was_classic() is then used at a later
stage when we already have a generated bpf_prog somewhere, as then
this assumption will break.


The only reason we need to do this is to allow BPF_DUMP_PROG to work,
since we were restricting it to only allow dumping of seccomp
programs, since those don't have maps. Instead, perhaps we could allow
dumping of BPF_PROG_TYPE_SECCOMP and BPF_PROG_TYPE_UNSPEC?


There are possibilities that BPF_PROG_TYPE_UNSPEC is calling helpers
already today, at least in networking case, not seccomp. So, since
you want to export [classic -> eBPF] only for seccomp, put fds on them
and dump these via bpf(2), you could allow that (with a big comment
stating why it's safe), but mid-term we really need to sanitize all
this stuff properly as this is needed for other types, too.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] openvswitch: Fix mask generation for nested attributes.

2015-09-11 Thread Pravin Shelar
On Thu, Sep 10, 2015 at 6:36 PM, Jesse Gross  wrote:
> Masks were added to OVS flows in a way that was backwards compatible
> with userspace programs that did not generate masks. As a result, it is
> possible that we may receive flows that do not have a mask and we need
> to synthesize one.
>
> Generating a mask requires iterating over attributes and descending into
> nested attributes. For each level we need to know the size to generate the
> correct mask. We do this with a linked table of attribute types.
>
> Although the logic to handle these nested attributes was there in concept,
> there are a number of bugs in practice. Examples include incomplete links
> between tables, variable length attributes being treated as nested and
> missing sanity checks.
>
Missing Fixes tag.

> Signed-off-by: Jesse Gross 

This patch adds white space space errors.

./scripts/checkpatch.pl
0001-openvswitch-Fix-mask-generation-for-nested-attribute.patch


total: 15 errors, 19 warnings, 0 checks, 130 lines checked

> ---
>  net/openvswitch/flow_netlink.c | 71 
> +-
>  1 file changed, 50 insertions(+), 21 deletions(-)
>
> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
> index c92d6a2..83a6847 100644
> --- a/net/openvswitch/flow_netlink.c
> +++ b/net/openvswitch/flow_netlink.c
> @@ -57,6 +57,7 @@ struct ovs_len_tbl {
>  };
>

...
> unsigned long opt_key_offset;
> struct vxlan_metadata opts;
> -   int err;
>
> BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts));
>
> -   err = nla_parse_nested(tb, OVS_VXLAN_EXT_MAX, a, vxlan_opt_policy);
> -   if (err < 0)
> -   return err;
> -
> memset(, 0, sizeof(opts));
> +   nla_for_each_nested(a, attr, rem) {
> +   int type = nla_type(a);
>
> -   if (tb[OVS_VXLAN_EXT_GBP])
> -   opts.gbp = nla_get_u32(tb[OVS_VXLAN_EXT_GBP]);
> +   if (type > OVS_VXLAN_EXT_MAX) {
> +OVS_NLERR(log, "VXLAN extension %d out of range max 
> %d",
> +  type, OVS_VXLAN_EXT_MAX);
> +return -EINVAL;
> +}
> +
> +if (ovs_vxlan_ext_key_lens[type].len != nla_len(a) &&
> +!(ovs_vxlan_ext_key_lens[type].len == OVS_ATTR_NESTED ||
> +  ovs_vxlan_ext_key_lens[type].len == 
> OVS_ATTR_VARIABLE)) {

ovs_vxlan_ext_key_lens types is never set to nested or variable len.
so this would not true.

> +OVS_NLERR(log, "VXLAN extension %d has unexpected 
> len %d expected %d",
> +  type, nla_len(a), 
> ovs_vxlan_ext_key_lens[type].len);
> +return -EINVAL;
> +}
> +
> +switch (type) {
> +   case OVS_VXLAN_EXT_GBP:
> +   opts.gbp = nla_get_u32(a);
> +   break;
> +   default:
> +OVS_NLERR(log, "Unknown VXLAN extension attribute 
> %d",
> +  type);
> +return -EINVAL;
> +   }
> +   }

Need to check for remaining bytes in this attribute.

>
> if (!is_mask)
> SW_FLOW_KEY_PUT(match, tun_opts_len, sizeof(opts), false);
> @@ -529,7 +553,8 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
> }
>
> if (ovs_tunnel_key_lens[type].len != nla_len(a) &&
> -   ovs_tunnel_key_lens[type].len != OVS_ATTR_NESTED) {
> +   !(ovs_tunnel_key_lens[type].len == OVS_ATTR_NESTED ||
> + ovs_tunnel_key_lens[type].len == OVS_ATTR_VARIABLE)) {
> OVS_NLERR(log, "Tunnel attr %d has unexpected len %d 
> expected %d",
>   type, nla_len(a), 
> ovs_tunnel_key_lens[type].len);
> return -EINVAL;
> @@ -1052,10 +1077,13 @@ static void nlattr_set(struct nlattr *attr, u8 val,
>
> /* The nlattr stream should already have been validated */
> nla_for_each_nested(nla, attr, rem) {
> -   if (tbl && tbl[nla_type(nla)].len == OVS_ATTR_NESTED)
> -   nlattr_set(nla, val, tbl[nla_type(nla)].next);
> -   else
> +   if (tbl[nla_type(nla)].len == OVS_ATTR_NESTED) {
> +   if (tbl[nla_type(nla)].next)
> +   tbl = tbl[nla_type(nla)].next;
> +   nlattr_set(nla, val, tbl);
> +   } else {
> memset(nla_data(nla), val, nla_len(nla));
> +   }
> }
>  }
>
> @@ -1923,7 +1951,8 @@ static int validate_set(const struct nlattr *a,
>
> if (key_type > OVS_KEY_ATTR_MAX ||
> (ovs_key_lens[key_type].len != key_len &&
> -ovs_key_lens[key_type].len != 

Re: [PATCH v2 4/5] seccomp: add a way to access filters via bpf fds

2015-09-11 Thread Andy Lutomirski
On Sep 10, 2015 5:22 PM, "Tycho Andersen"  wrote:
>
> This patch adds a way for a process that is "real root" to access the
> seccomp filters of another process. The process first does a
> PTRACE_SECCOMP_GET_FILTER_FD to get an fd with that process' seccomp filter
> attached, and then iterates on this with PTRACE_SECCOMP_NEXT_FILTER using
> bpf(BPF_PROG_DUMP) to dump the actual program at each step.
>

> +
> +   fd = bpf_new_fd(filter->prog, O_RDONLY);
> +   if (fd > 0)
> +   atomic_inc(>prog->aux->refcnt);

Why isn't this folded into bpf_new_fd?

> +
> +   return fd;
> +}
> +
> +long seccomp_next_filter(struct task_struct *child, u32 fd)
> +{
> +   struct seccomp_filter *cur;
> +   struct bpf_prog *prog;
> +   long ret = -ESRCH;
> +
> +   if (!capable(CAP_SYS_ADMIN))
> +   return -EACCES;
> +
> +   if (child->seccomp.mode != SECCOMP_MODE_FILTER)
> +   return -EINVAL;
> +
> +   prog = bpf_prog_get(fd);
> +   if (IS_ERR(prog)) {
> +   ret = PTR_ERR(prog);
> +   goto out;
> +   }
> +
> +   for (cur = child->seccomp.filter; cur; cur = cur->prev) {
> +   if (cur->prog == prog) {
> +   if (!cur->prev)
> +   ret = -ENOENT;
> +   else
> +   ret = bpf_prog_set(fd, cur->prev->prog);

This lets you take an fd pointing to one prog and point it elsewhere.
I'm not sure that's a good idea.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 0/3] 1588 support for Zynq Ultrascale+ MPSoC

2015-09-11 Thread Harini Katakam
This series adds 1588 support in Cadence MACB driver for Zynq Ultrascale+ MPSoC

This IP supports HW timestamping and this is accesible through extended BD.
The first patch adds support for extended BD through a config option.
Since this required the use two extra u32 variables in the macb_dma_desc
structure, we opted for a static config option.
The second patch adds support to access the timestamp in TX and RX paths,
register to PTP framework and provide time and frequency adjustment.
This was tested in  two step mode using linuxptp.
The TSU clock is expected to be provided through devicetree.

Harini Katakam (3):
  net: macb: Add support for extended BD with a config option
  net: macb: Add support for 1588 for Zynq Ultrascale+ MPSoC
  devicetree: macb: Add optional property tsu-clk

 Documentation/devicetree/bindings/net/macb.txt |3 +
 drivers/net/ethernet/cadence/Kconfig   |8 +
 drivers/net/ethernet/cadence/macb.c|  376 +++-
 drivers/net/ethernet/cadence/macb.h|   72 +
 4 files changed, 451 insertions(+), 8 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] bnx2x: use ktime_get_seconds() for timestamp

2015-09-11 Thread Arnd Bergmann
commit c48f350ff5e7 "bnx2x: Add MFW dump support" added the
bnx2x_update_mfw_dump() function that reads the current time and stores
it in a 32-bit field that gets passed into a buffer in a fixed format.

This is potentially broken when the epoch overflows in 2038, and
otherwise overflows in 2106. As we're trying to avoid uses of
struct timeval for this reason, I noticed the addition of this
function, and tried to rewrite it in a way that is more explicit
about the overflow and that will keep working once we deprecate
struct timeval.

I assume that it is not possible to change the ABI any more, otherwise
we should try to use a 64-bit field for the seconds right away.

Signed-off-by: Arnd Bergmann 
Cc: Yuval Mintz 
Cc: Ariel Elior 
---
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index e3da2bddf143..89a174fa1300 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -3705,16 +3705,14 @@ out:
 
 void bnx2x_update_mfw_dump(struct bnx2x *bp)
 {
-   struct timeval epoc;
u32 drv_ver;
u32 valid_dump;
 
if (!SHMEM2_HAS(bp, drv_info))
return;
 
-   /* Update Driver load time */
-   do_gettimeofday();
-   SHMEM2_WR(bp, drv_info.epoc, epoc.tv_sec);
+   /* Update Driver load time, possibly broken in y2038 */
+   SHMEM2_WR(bp, drv_info.epoc, (u32)ktime_get_real_seconds());
 
drv_ver = bnx2x_update_mng_version_utility(DRV_MODULE_VERSION, true);
SHMEM2_WR(bp, drv_info.drv_ver, drv_ver);

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: xfrm4_garbage_collect reaching limit

2015-09-11 Thread Steffen Klassert
Hi Dan.

On Thu, Sep 10, 2015 at 05:01:26PM -0400, Dan Streetman wrote:
> Hi Steffen,
> 
> I've been working with Jay on a ipsec issue, which I believe he
> discussed with you.  

Yes, we talked about this at the LPC.

> In this case the xfrm4_garbage_collect is
> returning error because the number of xfrm4 dst entries has exceeded
> twice the gc_thresh, which causes new allocations of xfrm4 dst objects
> to fail, thus making the ipsec connection unusable (until dst objects
> are removed/freed).
> 
> The main reason the count gets to the limit is because the
> xfrm4_policy_afinfo.garbage_collect function - which points to
> flow_cache_flush (indirectly) - doesn't actually guarantee any xfrm4
> dst will get cleaned up, it only cleans up unused entries.
> 
> The flow cache hashtable size limit watermark does restrict how many
> flow cache entries exist (by shrinking the per-cpu hashtable once it
> has 4k entries), and therefore indirectly controls the total number of
> xfrm4 dst objects.  However, there's a mismatch between the default
> xfrm4 gc_thresh - of 32k objects (which sets a 64k max of xfrm4 dst
> objects) - and the flow cache hashtable limit of 4k objects per cpu.
> Any system with 16 or less cpus will have a total limit of 64k (or
> less) flow cache entries, so the 64k xfrm4 dst entry limit will never
> be reached.  However for any system with more than 16 cpus, the flow
> cache limit is greater than the xfrm4 dst limit, and so the xfrm4 dst
> allocation can fail, rendering the ipsec connection unusable.
> 
> The most obvious solution is for the system admin to increase the
> xfrm4_gc_thresh value, although it's not really an obvious solution to
> the end-user what value they should set it to :-) 

Yes, a static gc threshold is always wrong for some workloads. So
the user needs to adjust it to his needs, even if the right value
is not obvious.

> Possibly the
> default value of xfrm4_gc_thresh could be set proportional to
> num_online_cpus(), but that doesn't help when cpus are onlined after
> boot.  

This could be an option, we could change the xfrm4_gc_thresh value with
a cpu notifier callback if more cpus come up after boot.

> Also, a warning message indicating the xfrm4_gc_thresh limit
> was reached, and a suggestion to increase the limit, may help anyone
> who hits the issue.
> 
> I'm not sure if something more aggressive is appropriate, like
> removing active entries during garbage collection. 

It would not make too much sense to push an active flow out of the
fastpath just to add some other flow. If the number of active
entries is to high, there is no other option than increasing the
gc threshold.

You could try to reduce the number of active entries by shutting
down stale security associations frequently.

> Or, removing the
> failure condition from xfrm4_garbage_collect so xfrm4 dst_ops can
> always be allocated,

This would open doors for DOS attacks, we can't do this.

> or just increasing it from gc_thresh * 2 up to *
> 4 or more.

This would just defer the problem, so not a real solution.

That said, whatever we do, we just paper over the real problem,
that is the flowcache itself. Everything that need this kind
of garbage collecting is fundamentally broken. But as long as
nobody volunteers to work on a replacement, we have to live
with this situation somehow.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Wire up 32-bit direct socket calls

2015-09-11 Thread Geert Uytterhoeven
On Fri, Sep 11, 2015 at 10:46 AM, Arnd Bergmann  wrote:
> On Friday 11 September 2015 10:24:29 Heiko Carstens wrote:
>>
>> FWIW, the s390 approach (ignoring the "new" system calls) is only 
>> temporarily.
>> I'll enable the seperate calls later when I have time to test everything,
>> especially the glibc stuff.
>
> Ok, thanks for clarifying.
>
>> The same is true for the ipc system call. (any reason why the seperate system
>> calls haven't been enabled on x86 now as well?)
>
> Agreed, we should split that out on all architectures as well.
> Almost the same set of architectures that have sys_socketcall also
> have sys_ipc, and the reasons for changing are identical. I don't
> think we have any other system calls that are handled like this
> on some architectures but not on others. There are a couple of
> system calls (e.g. futex) that are also multiplexers, but at
> least they do it consistently.

To make sure I don't miss any (it seems I missed recvmmsg and sendmmsg for
the socketcall case, sigh), this is the list of ipc syscalls to implement?

sys_msgget
sys_msgctl
sys_msgrcv
sys_msgsnd
sys_semget
sys_semctl
sys_semtimedop
sys_shmget
sys_shmctl
sys_shmat
sys_shmdt

sys_semop() seems to be unneeded because it can be implemented using
sys_semtimedop()?

Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Wire up 32-bit direct socket calls

2015-09-11 Thread Heiko Carstens
On Mon, Sep 07, 2015 at 02:53:12PM +0200, Arnd Bergmann wrote:
> On Wednesday 02 September 2015 13:16:19 H. Peter Anvin wrote:
> > On 09/02/2015 02:48 AM, Geert Uytterhoeven wrote:
> > > 
> > > Should all other architectures follow suit?
> > > Or should we follow the s390 approach:
> > > 
> > 
> > It is up to the maintainer(s), largely dependent on how likely you are
> > going to want to support this in your libc, but in general, socketcall
> > is an abomination which there is no reason not to bypass.
> > 
> > So follow suit unless you have a strong reason not to.
> 
> +1
> 
> In my y2038 syscall series, I'm adding a new recvmmsg64 call, and
> we may decide to add new setsockopt/getsockopt variants as well.
> This is probably not the last change to socketcall, and it would
> be made much easier if all architectures had separate calls here.
> 
> It seems that there are very few architectures that don't already have
> the separate calls:
> 
> $ git grep -l __NR_socketcall arch/*/include/uapi  | xargs git grep -L 
> recvmsg 
> arch/cris/include/uapi/asm/unistd.h
> arch/frv/include/uapi/asm/unistd.h
> arch/m32r/include/uapi/asm/unistd.h
> arch/m68k/include/uapi/asm/unistd.h
> arch/mn10300/include/uapi/asm/unistd.h
> arch/s390/include/uapi/asm/unistd.h
> 
> These are of course all examples of architectures that originally followed
> the i386 syscall scheme closely rather than trying to leave out obsolete
> calls.

FWIW, the s390 approach (ignoring the "new" system calls) is only temporarily.
I'll enable the seperate calls later when I have time to test everything,
especially the glibc stuff.

The same is true for the ipc system call. (any reason why the seperate system
calls haven't been enabled on x86 now as well?)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 3/3] devicetree: macb: Add optional property tsu-clk

2015-09-11 Thread Harini Katakam
Add TSU clock frequency to be used for 1588 support in macb driver.

Signed-off-by: Harini Katakam 
---
 Documentation/devicetree/bindings/net/macb.txt |3 +++
 1 file changed, 3 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/macb.txt 
b/Documentation/devicetree/bindings/net/macb.txt
index b5d7976..f7c0ea8 100644
--- a/Documentation/devicetree/bindings/net/macb.txt
+++ b/Documentation/devicetree/bindings/net/macb.txt
@@ -19,6 +19,9 @@ Required properties:
Optional elements: 'tx_clk'
 - clocks: Phandles to input clocks.
 
+Optional properties:
+- tsu-clk: Time stamp unit clock frequency used.
+
 Examples:
 
macb0: ethernet@fffc4000 {
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/3] net: macb: Add support for extended BD with a config option

2015-09-11 Thread Harini Katakam
Cadence GEM supports extended buffer descriptors.
This patch adds a config option to enable use of extended BD.
This adds two extra words to the TX BD and RX BD by configuring the
necessary registers. Corresponding variables are added to the
macb_dma_desc structure.

Signed-off-by: Harini Katakam 
---
 drivers/net/ethernet/cadence/Kconfig |8 
 drivers/net/ethernet/cadence/macb.c  |4 
 drivers/net/ethernet/cadence/macb.h  |8 
 3 files changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/cadence/Kconfig 
b/drivers/net/ethernet/cadence/Kconfig
index f0bcb15..33e4198 100644
--- a/drivers/net/ethernet/cadence/Kconfig
+++ b/drivers/net/ethernet/cadence/Kconfig
@@ -31,4 +31,12 @@ config MACB
  To compile this driver as a module, choose M here: the module
  will be called macb.
 
+config MACB_EXT_BD
+   tristate "Cadence MACB/GEM extended buffer descriptor"
+   depends on HAS_DMA && MACB
+   ---help---
+ The Cadence MACB host supports use of extended buffer descriptor.
+ This option enables additon of two extra words to TX BD and RX BD.
+ These two extra words are currently used to obtain PTP timestamp.
+
 endif # NET_CADENCE
diff --git a/drivers/net/ethernet/cadence/macb.c 
b/drivers/net/ethernet/cadence/macb.c
index 88c1e1a..bb2932c 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -1665,6 +1665,10 @@ static void macb_configure_dma(struct macb *bp)
dmacfg |= GEM_BIT(TXCOEN);
else
dmacfg &= ~GEM_BIT(TXCOEN);
+#ifdef CONFIG_MACB_EXT_BD
+   dmacfg |= GEM_BIT(RXBDEXT);
+   dmacfg |= GEM_BIT(TXBDEXT);
+#endif
netdev_dbg(bp->dev, "Cadence configure DMA with 0x%08x\n",
   dmacfg);
gem_writel(bp, DMACFG, dmacfg);
diff --git a/drivers/net/ethernet/cadence/macb.h 
b/drivers/net/ethernet/cadence/macb.h
index 6e1faea..58c9870 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -244,6 +244,10 @@
 #define GEM_RXBS_SIZE  8
 #define GEM_DDRP_OFFSET24 /* disc_when_no_ahb */
 #define GEM_DDRP_SIZE  1
+#define GEM_RXBDEXT_OFFSET 28 /* Extended RX BD */
+#define GEM_RXBDEXT_SIZE   1
+#define GEM_TXBDEXT_OFFSET 29 /* Extended TX BD */
+#define GEM_TXBDEXT_SIZE   1
 
 
 /* Bitfields in NSR */
@@ -466,6 +470,10 @@
 struct macb_dma_desc {
u32 addr;
u32 ctrl;
+#ifdef CONFIG_MACB_EXT_BD
+   u32 tsl;
+   u32 tsh;
+#endif
 };
 
 /* DMA descriptor bitfields */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/3] net: macb: Add support for 1588 for Zynq Ultrascale+ MPSoC

2015-09-11 Thread Harini Katakam
Cadence GEM in Zynq Ultrascale+ MPSoC supports 1588 and provides a
102 bit time counter with 48 bits for seconds, 30 bits for nsecs and
24 bits for sub-nsecs. The timestamp is made available to the SW through
registers as well as (more precisely) through upper two words in
an extended BD.

This patch does the following:
- Adds MACB_CAPS_TSU in zynqmp_config.
- Registers to ptp clock framework (after checking for timestamp support in
  IP and capability in config).
- TX BD and RX BD control registers are written to populate timestamp in
  extended BD words.
- Timer initialization is done by writing time of day to the timer counter.
- ns increment register is programmed as NS_PER_SEC/TSU_CLK.
  For a 24 bit subns precision, the subns increment equals
  remainder of (NS_PER_SEC/TSU_CLK) * (2^24).
  TSU (Time stamp unit) clock is obtained by the  driver from devicetree.
- HW time stamp capabilities are advertised via ethtool and macb ioctl is
  updated accordingly.
- For all PTP event frames, nanoseconds and the lower 5 bits of seconds are
  obtained from the BD. This offers a precise timestamp. The upper bits
  (which dont vary between consecutive packets) are obtained from the
  TX/RX PTP event/PEER registers. The timestamp obtained thus is updated
  in skb for upper layers to access.
- The drivers register functions with ptp to perform time and frequency
  adjustment.
- Time adjustment is done by writing to the 1558_ADJUST register.
  The controller will read the delta in this register and update the timer
  counter register. Alternatively, for large time offset adjustments,
  the driver reads the secs and nsecs counter values, adds/subtracts the
  delta and updates the timer counter. In order to be as precise as possible,
  nsecs counter is read again if secs has incremented during the counter read.
- Frequency adjustment is not directly supported by this IP.
  addend is the initial value ns increment and similarly addendesub.
  The ppb (parts per billion) provided is used as
  ns_incr = addend +/- (ppb/rate).
  Similarly the remainder of the above is used to populate subns increment.
  In case the ppb requested is negative AND subns adjustment greater than
  the addendsub, ns_incr is reduced by 1 and subns_incr is adjusted in
  positive accordingly.

Signed-off-by: Harini Katakam :
---
 drivers/net/ethernet/cadence/macb.c |  372 ++-
 drivers/net/ethernet/cadence/macb.h |   64 ++
 2 files changed, 428 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.c 
b/drivers/net/ethernet/cadence/macb.c
index bb2932c..b531008 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -30,6 +30,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "macb.h"
 
@@ -56,6 +58,9 @@
 
 #define GEM_MTU_MIN_SIZE   68
 
+#define GEM_TX_PTPHDR_OFFSET   42
+#define GEM_RX_PTPHDR_OFFSET   28
+
 /*
  * Graceful stop timeouts in us. We should allow up to
  * 1 frame time (10 Mbits/s, full-duplex, ignoring collisions)
@@ -165,6 +170,9 @@ static void macb_set_hwaddr(struct macb *bp)
top = cpu_to_le16(*((u16 *)(bp->dev->dev_addr + 4)));
macb_or_gem_writel(bp, SA1T, top);
 
+   gem_writel(bp, RXPTPUNI, bottom);
+   gem_writel(bp, TXPTPUNI, bottom);
+
/* Clear unused address register sets */
macb_or_gem_writel(bp, SA2B, 0);
macb_or_gem_writel(bp, SA2T, 0);
@@ -653,6 +661,40 @@ static void macb_tx_error_task(struct work_struct *work)
spin_unlock_irqrestore(>lock, flags);
 }
 
+static inline void macb_handle_txtstamp(struct macb *bp, struct sk_buff *skb,
+   struct macb_dma_desc *desc)
+{
+   u32 ts_s, ts_ns;
+   u8 msg_type;
+
+   skb_copy_from_linear_data_offset(skb, GEM_TX_PTPHDR_OFFSET,
+_type, 1);
+
+   /* Bit[32:6] of TS secs from register
+* Bit[5:0] of TS secs from BD
+* TS nano secs is available in BD
+*/
+   if (msg_type & 0x2) {
+   /* PTP Peer Event Frame packets */
+   ts_s = (gem_readl(bp, 1588PEERTXSEC) & GEM_SEC_MASK) |
+  ((desc->tsl >> GEM_TSL_SEC_RS) |
+  (desc->tsh << GEM_TSH_SEC_LS));
+   ts_ns = desc->tsl & GEM_TSL_NSEC_MASK;
+   } else {
+   /* PTP Event Frame packets */
+   ts_s = (gem_readl(bp, 1588TXSEC) & GEM_SEC_MASK) |
+  ((desc->tsl >> GEM_TSL_SEC_RS) |
+  (desc->tsh << GEM_TSH_SEC_LS));
+   ts_ns = desc->tsl & GEM_TSL_NSEC_MASK;
+   }
+
+   struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
+
+   memset(shhwtstamps, 0, sizeof(struct skb_shared_hwtstamps));
+   shhwtstamps->hwtstamp = ns_to_ktime((ts_s * NS_PER_SEC) + ts_ns);
+   skb_tstamp_tx(skb, skb_hwtstamps(skb));
+}
+
 static void macb_tx_interrupt(struct 

RE: PATCH: netdev: add a cast NLMSG_OK to avoid a GCC warning in users' code

2015-09-11 Thread David Laight
From: D. Hugh Redelmeier
> Sent: 09 September 2015 21:24
...
> 2) if you use the type "unsigned int" on a 32-bit machine, you get the
>warning for an earlier conjunct:
> 
> #define NLMSG_OK(nlh,len) ((len) >= (int)sizeof(struct nlmsghdr) && \
>  (nlh)->nlmsg_len >= sizeof(struct nlmsghdr) && \
>  (nlh)->nlmsg_len <= (len))
> 
>(len) >= (int)sizeof(struct nlmsghdr)  <=== unsigned >= int
> 
> 3) on a 32-bit machine, size_t is likely "unsigned int" so the
>same problem as (2) should arise.
> 
> 4) on a 64-bit machine with 64-bit ints, the same problems are likely.
>I don't have one to test on.
> 
> Casting to "short" or "unsigned short" works, but I don't know that
> the value is guaranteed to fit in either of them.

Why not cast (nlh)->nl_msg_len instead?
Or subtract the two values and compare against zero?
Perhaps:
(typeof (len))(nlh)->nlmsg_len <= (len)
which is almost certainly safe unless 'len' is 'signed char'.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC v2 net-next 3/4] ravb: Document binding for r8a7795 SoC

2015-09-11 Thread Geert Uytterhoeven
Hi Simon,

On Fri, Sep 11, 2015 at 10:14 AM, Simon Horman  wrote:
>> > @@ -18,6 +21,9 @@ Required properties:
>> >  Optional properties:
>> >  - interrupt-parent: the phandle for the interrupt controller that services
>> > interrupts for this device.
>> > +- interrupt-names: One entry per interrupt named "ch%u".
>> > +  For the R8A7795 SoC this property is mandatory,
>> > +  and "ch0" through "ch24" are mandatory.
>>
>> This suggests the single multiplexed interrupt on R-Car Gen2 can be called
>> "ch0". Is that what you want? I know the driver doesn't care.
>
> No, its not what I intended.
>
> I think its reasonable to allow the multiplexed interrupt to be named,
> but to what I wonder. The documentation seems to call the interrupt
> "EthernetAVB", which isn't very exciting.

Perhaps "mux", like I did for rspi, cfr.
Documentation/devicetree/bindings/spi/spi-rspi.txt:

  - interrupts   : A list of interrupt-specifiers, one for each entry in
   interrupt-names.
   If interrupt-names is not present, an interrupt specifier
   for a single muxed interrupt.
  - interrupt-names  : A list of interrupt names. Should contain (if present):
 - "error" for SPEI,
 - "rx" for SPRI,
 - "tx" to SPTI,
 - "mux" for a single muxed interrupt.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Wire up 32-bit direct socket calls

2015-09-11 Thread Arnd Bergmann
On Friday 11 September 2015 10:24:29 Heiko Carstens wrote:
> 
> FWIW, the s390 approach (ignoring the "new" system calls) is only temporarily.
> I'll enable the seperate calls later when I have time to test everything,
> especially the glibc stuff.

Ok, thanks for clarifying.

> The same is true for the ipc system call. (any reason why the seperate system
> calls haven't been enabled on x86 now as well?)

Agreed, we should split that out on all architectures as well.
Almost the same set of architectures that have sys_socketcall also
have sys_ipc, and the reasons for changing are identical. I don't
think we have any other system calls that are handled like this
on some architectures but not on others. There are a couple of
system calls (e.g. futex) that are also multiplexers, but at
least they do it consistently.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC v2 net-next 3/4] ravb: Document binding for r8a7795 SoC

2015-09-11 Thread Simon Horman
On Fri, Sep 11, 2015 at 10:41:31AM +0200, Geert Uytterhoeven wrote:
> Hi Simon,
> 
> On Fri, Sep 11, 2015 at 10:14 AM, Simon Horman  wrote:
> >> > @@ -18,6 +21,9 @@ Required properties:
> >> >  Optional properties:
> >> >  - interrupt-parent: the phandle for the interrupt controller that 
> >> > services
> >> > interrupts for this device.
> >> > +- interrupt-names: One entry per interrupt named "ch%u".
> >> > +  For the R8A7795 SoC this property is mandatory,
> >> > +  and "ch0" through "ch24" are mandatory.
> >>
> >> This suggests the single multiplexed interrupt on R-Car Gen2 can be called
> >> "ch0". Is that what you want? I know the driver doesn't care.
> >
> > No, its not what I intended.
> >
> > I think its reasonable to allow the multiplexed interrupt to be named,
> > but to what I wonder. The documentation seems to call the interrupt
> > "EthernetAVB", which isn't very exciting.
> 
> Perhaps "mux", like I did for rspi, cfr.
> Documentation/devicetree/bindings/spi/spi-rspi.txt:
> 
>   - interrupts   : A list of interrupt-specifiers, one for each entry in
>interrupt-names.
>If interrupt-names is not present, an interrupt 
> specifier
>for a single muxed interrupt.
>   - interrupt-names  : A list of interrupt names. Should contain (if present):
>  - "error" for SPEI,
>  - "rx" for SPRI,
>  - "tx" to SPTI,
>  - "mux" for a single muxed interrupt.

Thanks for for that example. "mux" sounds good to me.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ebpf: emit correct src_reg for conditional jumps

2015-09-11 Thread Daniel Borkmann

On 09/11/2015 10:45 AM, Daniel Borkmann wrote:

On 09/11/2015 02:25 AM, Tycho Andersen wrote:

Instead of always emitting BPF_REG_X, let's emit BPF_REG_X only when the
source actually is BPF_X. This causes programs generated by the classic
converter to not be importable via bpf(), as the eBPF verifier checks that
the src_reg is correct or 0. While not a problem yet, this will be a
problem when BPF_PROG_DUMP lands, and we can potentially dump and re-import
programs generated by the converter.

Signed-off-by: Tycho Andersen 
CC: Alexei Starovoitov 
CC: Daniel Borkmann 


I think the description at the beginning could have been a bit clearer. I.e.
it's safe to zero insn->src_reg for BPF_SRC(fp->code) that is not X, because
we can either have X or K for classic jump compares, and in case of K we're
only interested in the immediate value, but not a possible src_reg, of course,
so eBPF converter should have zeroed it.

Yeah, it was never really the aim of the converter to cover something like
that requirement you seem to have:

   load classic BPF
 -> transform into eBPF
   -> dump that eBPF result to uspace
 -> load that eBPF via bpf(2)


[off topic for this patch]

... this requirement also breaks down for cases where you have a single
classic BPF instruction that maps into 2 or more eBPF instructions, hitting
BPF_MAXINSNS early at the time when you try to call into bpf(2) again with
the dumped result. If I recall correctly, Chrome seems to use up quite a lot
of insns space on (classic) seccomp-BPF, so this could potentially be a real
issue, next to artificially crafted examples that would fail.

With regards to the latter, also classic programs that could have holes of
dead code where you jump over them (see lib/test_bpf.c for examples) are
unfortunately allowed on the ABI side, so while the classic -> eBPF converter
may translate this dead region 1:1, it will be rejected by the verifier when
you dump and try to reload that. ;) Anyway, it's perhaps a silly example, but
it shows that this use-case can only work on a 'best-effort' basis, and not
cover everything of the classic BPF ABI, as opposed to having an interface
that can dump/restore classic BPF instructions directly.

Do you need this for checkpoint/restore? Wouldn't this therefore be better
done as dump/restore classic<->classic and eBPF<->eBPF directly? In socket
filters we do this by keeping orig_prog around, I guess it's better to bite
the bullet of additional memory overhead in case of classic seccomp-BPF, too.

Cheers,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] xfrm6: Fix ICMPv6 and MH header checks in _decode_session6

2015-09-11 Thread Mathias Krause
From: Mathias Krause 

Ensure there's enough data left prior calling pskb_may_pull(). If
skb->data was already advanced, we'll call pskb_may_pull() with a
negative value converted to unsigned int -- leading to a huge
positive value. That won't matter in practice as pskb_may_pull()
will likely fail in this case, but it leads to underflow reports on
kernels handling such kind of over-/underflows, e.g. a PaX enabled
kernel instrumented with the size_overflow plugin.

Reported-by: satmd 
Reported-and-tested-by: Marcin Jurkowski 
Signed-off-by: Mathias Krause 
Cc: PaX Team 
---
 net/ipv6/xfrm6_policy.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index ed0583c1b9fc..c988c3f033cf 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -174,7 +174,8 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int 
reverse)
return;
 
case IPPROTO_ICMPV6:
-   if (!onlyproto && pskb_may_pull(skb, nh + offset + 2 - 
skb->data)) {
+   if (!onlyproto && (nh + offset + 2 < skb->data ||
+   pskb_may_pull(skb, nh + offset + 2 - skb->data))) {
u8 *icmp;
 
nh = skb_network_header(skb);
@@ -188,7 +189,8 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int 
reverse)
 #if IS_ENABLED(CONFIG_IPV6_MIP6)
case IPPROTO_MH:
offset += ipv6_optlen(exthdr);
-   if (!onlyproto && pskb_may_pull(skb, nh + offset + 3 - 
skb->data)) {
+   if (!onlyproto && (nh + offset + 3 < skb->data ||
+   pskb_may_pull(skb, nh + offset + 3 - skb->data))) {
struct ip6_mh *mh;
 
nh = skb_network_header(skb);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ebpf: emit correct src_reg for conditional jumps

2015-09-11 Thread Daniel Borkmann

On 09/11/2015 02:25 AM, Tycho Andersen wrote:

Instead of always emitting BPF_REG_X, let's emit BPF_REG_X only when the
source actually is BPF_X. This causes programs generated by the classic
converter to not be importable via bpf(), as the eBPF verifier checks that
the src_reg is correct or 0. While not a problem yet, this will be a
problem when BPF_PROG_DUMP lands, and we can potentially dump and re-import
programs generated by the converter.

Signed-off-by: Tycho Andersen 
CC: Alexei Starovoitov 
CC: Daniel Borkmann 


I think the description at the beginning could have been a bit clearer. I.e.
it's safe to zero insn->src_reg for BPF_SRC(fp->code) that is not X, because
we can either have X or K for classic jump compares, and in case of K we're
only interested in the immediate value, but not a possible src_reg, of course,
so eBPF converter should have zeroed it.

Yeah, it was never really the aim of the converter to cover something like
that requirement you seem to have:

  load classic BPF
-> transform into eBPF
  -> dump that eBPF result to uspace
-> load that eBPF via bpf(2)

Anyway, it doesn't cause an issue in the current code as stated, but it's good
if we fix it up nevertheless.

Looks good to me:

Acked-by: Daniel Borkmann 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] at86rf230: fix build warning

2015-09-11 Thread Sudip Mukherjee
e616a00ce786 ("drivers/net/ieee802154/at86rf230.c: seq_printf() now returns 
NULL")
has removed the usage of the integer "ret" but missed removing the
variable. And we were getting a build warning about "unused variable".

Fixes: e616a00ce786 ("drivers/net/ieee802154/at86rf230.c: seq_printf() now 
returns NULL")
Signed-off-by: Sudip Mukherjee 
---
 drivers/net/ieee802154/at86rf230.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ieee802154/at86rf230.c 
b/drivers/net/ieee802154/at86rf230.c
index a9f3af6..9ae98fd 100644
--- a/drivers/net/ieee802154/at86rf230.c
+++ b/drivers/net/ieee802154/at86rf230.c
@@ -1645,7 +1645,6 @@ static struct dentry *at86rf230_debugfs_root;
 static int at86rf230_stats_show(struct seq_file *file, void *offset)
 {
struct at86rf230_local *lp = file->private;
-   int ret;
 
seq_printf(file, "SUCCESS:\t\t%8llu\n", lp->trac.success);
seq_printf(file, "SUCCESS_DATA_PENDING:\t%8llu\n",
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC v2 net-next 3/4] ravb: Document binding for r8a7795 SoC

2015-09-11 Thread Simon Horman
On Fri, Sep 11, 2015 at 09:12:17AM +0200, Geert Uytterhoeven wrote:
> Hi Simon,
> 
> On Fri, Sep 11, 2015 at 4:01 AM, Simon Horman
>  wrote:
> > --- a/Documentation/devicetree/bindings/net/renesas,ravb.txt
> > +++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt
> > @@ -6,8 +6,11 @@ interface contains.
> >  Required properties:
> >  - compatible: "renesas,etheravb-r8a7790" if the device is a part of 
> > R8A7790 SoC.
> >   "renesas,etheravb-r8a7794" if the device is a part of R8A7794 
> > SoC.
> > + "renesas,etheravb-r8a7795" if the device is a part of R8A7795 
> > SoC.
> >  - reg: offset and length of (1) the register block and (2) the stream 
> > buffer.
> > -- interrupts: interrupt specifier for the sole interrupt.
> > +- interrupts: interrupt specifiers.
> > + One for each entry in interrupt-names the R8A7795 SoC;
> 
> ... for the R8A7795 SoC
> 
> > + One entry for a multiplexed interrupt otherwise.
> >  - phy-mode: see ethernet.txt file in the same directory.
> >  - phy-handle: see ethernet.txt file in the same directory.
> >  - #address-cells: number of address cells for the MDIO bus, must be equal 
> > to 1.
> > @@ -18,6 +21,9 @@ Required properties:
> >  Optional properties:
> >  - interrupt-parent: the phandle for the interrupt controller that services
> > interrupts for this device.
> > +- interrupt-names: One entry per interrupt named "ch%u".
> > +  For the R8A7795 SoC this property is mandatory,
> > +  and "ch0" through "ch24" are mandatory.
> 
> This suggests the single multiplexed interrupt on R-Car Gen2 can be called
> "ch0". Is that what you want? I know the driver doesn't care.

No, its not what I intended.

I think its reasonable to allow the multiplexed interrupt to be named,
but to what I wonder. The documentation seems to call the interrupt
"EthernetAVB", which isn't very exciting.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next PATCH] net: bridge: fix for bridging 802.1Q without REORDER_HDR

2015-09-11 Thread Stephen Hemminger
On Fri, 11 Sep 2015 21:22:03 +0200
Phil Sutter  wrote:

> When forwarding packets from an 802.1Q interface with REORDER_HDR set to
> zero, the VLAN header previously inserted by vlan_do_receive() needs to
> be stripped from the packet and the mac_header adjustment undone,
> otherwise a tagged frame with first four bytes missing will be
> transmitted.
> 
> Signed-off-by: Phil Sutter 
> ---
>  net/bridge/br_input.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> index f921a5d..e4e3fc7 100644
> --- a/net/bridge/br_input.c
> +++ b/net/bridge/br_input.c
> @@ -288,6 +288,16 @@ rx_handler_result_t br_handle_frame(struct sk_buff 
> **pskb)
>   }
>  
>  forward:
> + if (is_vlan_dev(skb->dev) &&
> + !(vlan_dev_priv(skb->dev)->flags & VLAN_FLAG_REORDER_HDR)) {
> + unsigned int offset = skb->data - skb_mac_header(skb);
> +
> + skb_push(skb, offset);
> + memmove(skb->data + VLAN_HLEN, skb->data, 2 * ETH_ALEN);
> + skb->mac_header += VLAN_HLEN;
> + skb_pull(skb, offset);
> + skb_reset_mac_len(skb);
> + }
>   switch (p->state) {
>   case BR_STATE_FORWARDING:
>   rhook = rcu_dereference(br_should_route_hook);

Thanks for finding this. Is this a new thing or has it always been there?

Sorry, this looks so special case it doesn't seem like a good idea.
Something is broken in VLAN handling if this is required.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Net: core: stream.c: coding style fixes

2015-09-11 Thread Krzysztof Majzerowicz-Jaszcz
Fixed coding style issues reported by checkpatch.pl

Signed-off-by: Krzysztof Majzerowicz-Jaszcz 
---
 net/core/stream.c | 29 -
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/net/core/stream.c b/net/core/stream.c
index d70f77a..f169de8 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -7,9 +7,9 @@
  * identical sendmsg() and recvmsg() code.
  * So we (will) share it here.
  *
- * Authors:Arnaldo Carvalho de Melo 
- * (from old tcp.c code)
- * Alan Cox  (Borrowed comments 
8-))
+ * Authors:Arnaldo Carvalho de Melo 
+ * (from old tcp.c code)
+ * Alan Cox  (Borrowed comments 8-))
  */
 
 #include 
@@ -60,6 +60,7 @@ int sk_stream_wait_connect(struct sock *sk, long *timeo_p)
 
do {
int err = sock_error(sk);
+
if (err)
return err;
if ((1 << sk->sk_state) & ~(TCPF_SYN_SENT | TCPF_SYN_RECV))
@@ -122,9 +123,10 @@ int sk_stream_wait_memory(struct sock *sk, long *timeo_p)
bool noblock = (*timeo_p ? false : true);
DEFINE_WAIT(wait);
 
-   if (sk_stream_memory_free(sk))
-   current_timeo = vm_wait = (prandom_u32() % (HZ / 5)) + 2;
-
+   if (sk_stream_memory_free(sk)) {
+   vm_wait = (prandom_u32() % (HZ / 5)) + 2;
+   current_timeo = vm_wait;
+   }
while (1) {
set_bit(SOCK_ASYNC_NOSPACE, >sk_socket->flags);
 
@@ -145,18 +147,19 @@ int sk_stream_wait_memory(struct sock *sk, long *timeo_p)
 
set_bit(SOCK_NOSPACE, >sk_socket->flags);
sk->sk_write_pending++;
-   sk_wait_event(sk, _timeo, sk->sk_err ||
- (sk->sk_shutdown & 
SEND_SHUTDOWN) ||
- (sk_stream_memory_free(sk) &&
- !vm_wait));
+   sk_wait_event(sk, _timeo,
+ sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN) ||
+ (sk_stream_memory_free(sk) && !vm_wait));
sk->sk_write_pending--;
 
if (vm_wait) {
vm_wait -= current_timeo;
current_timeo = *timeo_p;
-   if (current_timeo != MAX_SCHEDULE_TIMEOUT &&
-   (current_timeo -= vm_wait) < 0)
-   current_timeo = 0;
+   if (current_timeo != MAX_SCHEDULE_TIMEOUT) {
+   current_timeo -= vm_wait;
+   if (current_timeo < 0)
+   current_timeo = 0;
+   }
vm_wait = 0;
}
*timeo_p = current_timeo;
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net-sysfs: get_netdev_queue_index() cleanup

2015-09-11 Thread Thadeu Lima de Souza Cascardo
Redo commit ed1acc8cd8c22efa919da8d300bab646e01c2dce.

Commit 822b3b2ebfff8e9b3d006086c527738a7ca00cd0 ("net: Add max rate tx queue
attribute") moved get_netdev_queue_index around, but kept the old version.
Probably because of a reuse of the original patch from before Eric's change to
that function.

Remove one inline keyword, and no need for a loop to find
an index into a table.

Signed-off-by: Thadeu Lima de Souza Cascardo 
Fixes: 822b3b2ebfff8e9b3d006086c527738a7ca00cd0
Cc: Or Gerlitz 
Cc: John Fastabend 
Cc: Eric Dumazet 
---

Not sure what is the best way to credit Eric Dumazet here. I assume he will add
any appropriate tags.

---
 net/core/net-sysfs.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index b279077..49b5990 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1004,15 +1004,12 @@ static ssize_t show_trans_timeout(struct netdev_queue 
*queue,
 }
 
 #ifdef CONFIG_XPS
-static inline unsigned int get_netdev_queue_index(struct netdev_queue *queue)
+static unsigned int get_netdev_queue_index(struct netdev_queue *queue)
 {
struct net_device *dev = queue->dev;
-   int i;
-
-   for (i = 0; i < dev->num_tx_queues; i++)
-   if (queue == >_tx[i])
-   break;
+   unsigned int i;
 
+   i = queue - dev->_tx;
BUG_ON(i >= dev->num_tx_queues);
 
return i;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] netlink, mmap: transform mmap skb into full skb on taps

2015-09-11 Thread Daniel Borkmann

On 09/11/2015 09:42 PM, David Miller wrote:

From: Daniel Borkmann 
Date: Fri, 11 Sep 2015 12:25:45 +0200


Already calling into skb_clone() is an issue itself, as the data
area is user space buffer, and skb_clone() as well as skb_copy()
access skb_shinfo() area. :/ So in that regard netlink mmap skbs are
even further restrictive on what we can do than netlink large skbs.


Indeed, this is fatal.

So we'd still need something special like your
netlink_to_full_skb_clone to elide trying to touch the skb_shinfo
area.

I thought briefly about somehow cobbling up extra space in the ring
entries so we could have a real skb_shinfo() there, but that's illegal
too as the user could scribble all over it randomly while we interpret
the contents.  We don't own that memory.  So this doesn't work.


Yes, agreed.


We could rename the clone_preserves_destructor and have it also mean
that the SKB lacks frags and skb_shinfo() should not be inspected.

Something like this:

[...]

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2738d35..898c53d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h

[...]

@@ -2220,7 +2221,8 @@ static inline void skb_orphan(struct sk_buff *skb)
   */
  static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
  {
-   if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)))
+   if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) ||
+  skb->private_buffers))


(These two would need to be swapped.)


return 0;
return skb_copy_ubufs(skb, gfp_mask);
  }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index dad4dd3..54f9d6e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -825,7 +825,10 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, 
struct sk_buff *skb)
n->hdr_len = skb->nohdr ? skb_headroom(skb) : skb->hdr_len;
n->cloned = 1;
n->nohdr = 0;
-   n->destructor = NULL;
+   if (!skb->private_buffers)
+   n->destructor = NULL;
+   else
+   C(destructor);
C(tail);
C(end);
C(head);


We would also have to conditionally skip the __skb_clone()'s ...

  atomic_inc(&(skb_shinfo(skb)->dataref));

Thus, the issue here is that while netlink_alloc_large_skb() and
netlink_ring_setup_skb() would set both skb->private_buffers = 1,
the large skb case would actually need to inspect dataref count
(which it also can legally do) to properly release the vmalloc'ed
area again, while the other case must not even touch it. So if I
see this correctly, it looks like it's unfortunately not possible
to combine the two cases in a single flag. :/

If there's a good case to burn this flag outside of netlink for e.g.
vmalloc backend memory on skbs, it could be solved like that, while
the mmap case be declared netlink's problem. ;) I currently don't
have a better idea than to copy these guys, hmmm.

[...]

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 7f86d3b..523adac 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -854,6 +855,14 @@ static void netlink_ring_set_copied(struct sock *sk, 
struct sk_buff *skb)
  #define netlink_mmap_sendmsg(sk, msg, dst_portid, dst_group, scm) 0
  #endif /* CONFIG_NETLINK_MMAP */

+static bool skb_can_release_head(struct sk_buff *skb)
+{
+   if (!skb->cloned ||
+   !atomic_dec_return(&(skb_shinfo(skb)->dataref)))
+   return true;
+   return false;
+}
+
  static void netlink_skb_destructor(struct sk_buff *skb)
  {
  #ifdef CONFIG_NETLINK_MMAP
@@ -866,31 +875,35 @@ static void netlink_skb_destructor(struct sk_buff *skb)

[...]
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 net-next] rtnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 stats

2015-09-11 Thread Sowmini Varadhan

Many commonly used functions like getifaddrs() invoke RTM_GETLINK
to dump the interface information, and do not need the
the AF_INET6 statististics that are always returned by default
from rtnl_fill_ifinfo().

Computing the statistics can be an expensive operation that impacts
scaling, so it is desirable to avoid this if the information is
not needed.

This patch adds a the RTEXT_FILTER_SKIP_STATS extended info flag that
can be passed with netlink_request() to avoid statistics computation
for the ifinfo path.

Signed-off-by: Sowmini Varadhan 
---
v2: David Miller comments: pass u32 ext_filter_mask down.
v3: non-RFC version of v2.

 include/net/rtnetlink.h|3 ++-
 include/uapi/linux/rtnetlink.h |1 +
 net/core/rtnetlink.c   |2 +-
 net/ipv4/devinet.c |3 ++-
 net/ipv6/addrconf.c|   13 +
 5 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index 18fdb98..aff6ceb 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -122,7 +122,8 @@ struct rtnl_af_ops {
int family;
 
int (*fill_link_af)(struct sk_buff *skb,
-   const struct net_device *dev);
+   const struct net_device *dev,
+   u32 ext_filter_mask);
size_t  (*get_link_af_size)(const struct net_device 
*dev);
 
int (*validate_link_af)(const struct net_device 
*dev,
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 7020247..434227f 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -666,6 +666,7 @@ struct tcamsg {
 #define RTEXT_FILTER_VF(1 << 0)
 #define RTEXT_FILTER_BRVLAN(1 << 1)
 #define RTEXT_FILTER_BRVLAN_COMPRESSED (1 << 2)
+#defineRTEXT_FILTER_SKIP_STATS (1 << 3)
 
 /* End of information exported to user level */
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a466821..e545229 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1272,7 +1272,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct 
net_device *dev,
if (!(af = nla_nest_start(skb, af_ops->family)))
goto nla_put_failure;
 
-   err = af_ops->fill_link_af(skb, dev);
+   err = af_ops->fill_link_af(skb, dev, ext_filter_mask);
 
/*
 * Caller may return ENODATA to indicate that there
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 2d9cb17..7350084 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1654,7 +1654,8 @@ static size_t inet_get_link_af_size(const struct 
net_device *dev)
return nla_total_size(IPV4_DEVCONF_MAX * 4); /* IFLA_INET_CONF */
 }
 
-static int inet_fill_link_af(struct sk_buff *skb, const struct net_device *dev)
+static int inet_fill_link_af(struct sk_buff *skb, const struct net_device *dev,
+u32 ext_filter_mask)
 {
struct in_device *in_dev = rcu_dereference_rtnl(dev->ip_ptr);
struct nlattr *nla;
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 99c0f2b..9acbb09 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4760,7 +4760,8 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev 
*idev, int attrtype,
}
 }
 
-static int inet6_fill_ifla6_attrs(struct sk_buff *skb, struct inet6_dev *idev)
+static int inet6_fill_ifla6_attrs(struct sk_buff *skb, struct inet6_dev *idev,
+ u32 ext_filter_mask)
 {
struct nlattr *nla;
struct ifla_cacheinfo ci;
@@ -4780,6 +4781,9 @@ static int inet6_fill_ifla6_attrs(struct sk_buff *skb, 
struct inet6_dev *idev)
 
/* XXX - MC not implemented */
 
+   if (!!(ext_filter_mask & RTEXT_FILTER_SKIP_STATS))
+   return 0;
+
nla = nla_reserve(skb, IFLA_INET6_STATS, IPSTATS_MIB_MAX * sizeof(u64));
if (!nla)
goto nla_put_failure;
@@ -4815,14 +4819,15 @@ static size_t inet6_get_link_af_size(const struct 
net_device *dev)
return inet6_ifla6_size();
 }
 
-static int inet6_fill_link_af(struct sk_buff *skb, const struct net_device 
*dev)
+static int inet6_fill_link_af(struct sk_buff *skb, const struct net_device 
*dev,
+ u32 ext_filter_mask)
 {
struct inet6_dev *idev = __in6_dev_get(dev);
 
if (!idev)
return -ENODATA;
 
-   if (inet6_fill_ifla6_attrs(skb, idev) < 0)
+   if (inet6_fill_ifla6_attrs(skb, idev, ext_filter_mask) < 0)
return -EMSGSIZE;
 
return 0;
@@ -4977,7 +4982,7 @@ static int inet6_fill_ifinfo(struct sk_buff *skb, struct 
inet6_dev *idev,
if (!protoinfo)
 

Re: [PATCHv2 RFC] RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 stats

2015-09-11 Thread Sowmini Varadhan
On (09/12/15 00:22), Raghavendra K T wrote:
> 
> Sowmini, Thanks for the patch which is more cleaner way without
> breaking current behaviour.
> 
> [ Though RTEXT_FILTER_NEED_STATS flag with reverse effect  would have
> helped immediately :)]

Agree, but existing legacy usage will not set this flag, so I had 
few choices here.

> /me waits for the RTEXT_FILTER_SKIP_STATS to be supported in
> gccgo/golang, so that it can be used in docker newNetlinkRequest() to
> exploit this.

yes, I'm working on lining up the glibc bit as well (thus the cc to Jose..)
I'll send out the non-rfc version in a bit..

thanks for the feedback!

--Sowmini

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] netlink, mmap: transform mmap skb into full skb on taps

2015-09-11 Thread David Miller
From: Daniel Borkmann 
Date: Fri, 11 Sep 2015 12:25:45 +0200

> Already calling into skb_clone() is an issue itself, as the data
> area is user space buffer, and skb_clone() as well as skb_copy()
> access skb_shinfo() area. :/ So in that regard netlink mmap skbs are
> even further restrictive on what we can do than netlink large skbs.

Indeed, this is fatal.

So we'd still need something special like your
netlink_to_full_skb_clone to elide trying to touch the skb_shinfo
area.

I thought briefly about somehow cobbling up extra space in the ring
entries so we could have a real skb_shinfo() there, but that's illegal
too as the user could scribble all over it randomly while we interpret
the contents.  We don't own that memory.  So this doesn't work.

We could rename the clone_preserves_destructor and have it also mean
that the SKB lacks frags and skb_shinfo() should not be inspected.

Something like this:

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 639e9b8..47d5875 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -97,22 +97,6 @@ int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
 void netlink_detachskb(struct sock *sk, struct sk_buff *skb);
 int netlink_sendskb(struct sock *sk, struct sk_buff *skb);
 
-static inline struct sk_buff *
-netlink_skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
-{
-   struct sk_buff *nskb;
-
-   nskb = skb_clone(skb, gfp_mask);
-   if (!nskb)
-   return NULL;
-
-   /* This is a large skb, set destructor callback to release head */
-   if (is_vmalloc_addr(skb->head))
-   nskb->destructor = skb->destructor;
-
-   return nskb;
-}
-
 /*
  * skb should fit one page. This choice is good for headerless malloc.
  * But we should limit to 8K so that userspace does not have to
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2738d35..898c53d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -584,8 +584,9 @@ struct sk_buff {
fclone:2,
peeked:1,
head_frag:1,
-   xmit_more:1;
-   /* one bit hole */
+   xmit_more:1,
+   private_buffers:1;
+
kmemcheck_bitfield_end(flags1);
 
/* fields enclosed in headers_start/headers_end are copied
@@ -2220,7 +2221,8 @@ static inline void skb_orphan(struct sk_buff *skb)
  */
 static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
 {
-   if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)))
+   if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) ||
+  skb->private_buffers))
return 0;
return skb_copy_ubufs(skb, gfp_mask);
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index dad4dd3..54f9d6e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -825,7 +825,10 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, 
struct sk_buff *skb)
n->hdr_len = skb->nohdr ? skb_headroom(skb) : skb->hdr_len;
n->cloned = 1;
n->nohdr = 0;
-   n->destructor = NULL;
+   if (!skb->private_buffers)
+   n->destructor = NULL;
+   else
+   C(destructor);
C(tail);
C(end);
C(head);
@@ -1675,6 +1678,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, 
void *to, int len)
to += copy;
}
 
+   if (skb->private_buffers)
+   goto out;
+
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
int end;
skb_frag_t *f = _shinfo(skb)->frags[i];
@@ -1720,7 +1726,7 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, 
void *to, int len)
}
start = end;
}
-
+out:
if (!len)
return 0;
 
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 6fcbd21..2ec5425 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1075,7 +1075,7 @@ static void nl_fib_input(struct sk_buff *skb)
nlmsg_len(nlh) < sizeof(*frn))
return;
 
-   skb = netlink_skb_clone(skb, GFP_KERNEL);
+   skb = skb_clone(skb, GFP_KERNEL);
if (!skb)
return;
nlh = nlmsg_hdr(skb);
diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 70277b1..4c612481 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -291,7 +291,7 @@ static void nfnetlink_rcv_batch(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 replay:
status = 0;
 
-   skb = netlink_skb_clone(oskb, GFP_KERNEL);
+   skb = skb_clone(oskb, GFP_KERNEL);
if (!skb)
return netlink_ack(oskb, nlh, -ENOMEM);
 
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 7f86d3b..523adac 100644
--- a/net/netlink/af_netlink.c
+++ 

Re: [net-next PATCH] net: bridge: fix for bridging 802.1Q without REORDER_HDR

2015-09-11 Thread Phil Sutter
On Fri, Sep 11, 2015 at 12:24:45PM -0700, Stephen Hemminger wrote:
> On Fri, 11 Sep 2015 21:22:03 +0200
> Phil Sutter  wrote:
> 
> > When forwarding packets from an 802.1Q interface with REORDER_HDR set to
> > zero, the VLAN header previously inserted by vlan_do_receive() needs to
> > be stripped from the packet and the mac_header adjustment undone,
> > otherwise a tagged frame with first four bytes missing will be
> > transmitted.
> > 
> > Signed-off-by: Phil Sutter 
> > ---
> >  net/bridge/br_input.c | 10 ++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> > index f921a5d..e4e3fc7 100644
> > --- a/net/bridge/br_input.c
> > +++ b/net/bridge/br_input.c
> > @@ -288,6 +288,16 @@ rx_handler_result_t br_handle_frame(struct sk_buff 
> > **pskb)
> > }
> >  
> >  forward:
> > +   if (is_vlan_dev(skb->dev) &&
> > +   !(vlan_dev_priv(skb->dev)->flags & VLAN_FLAG_REORDER_HDR)) {
> > +   unsigned int offset = skb->data - skb_mac_header(skb);
> > +
> > +   skb_push(skb, offset);
> > +   memmove(skb->data + VLAN_HLEN, skb->data, 2 * ETH_ALEN);
> > +   skb->mac_header += VLAN_HLEN;
> > +   skb_pull(skb, offset);
> > +   skb_reset_mac_len(skb);
> > +   }
> > switch (p->state) {
> > case BR_STATE_FORWARDING:
> > rhook = rcu_dereference(br_should_route_hook);
> 
> Thanks for finding this. Is this a new thing or has it always been there?

Sorry, I didn't check if this is a regression or not. Seen initially
with RHEL7's kernel-3.10.0-229.7.2, which due to the massive backporting
is by far not as old as it might seem. But it's surely not a brand new
problem of net-next or so.

Since nowadays no sane mind touches REORDER_HDR (there was originally a
bug in NetworkManager which defaulted this to 0), it may very well be
there for a long time already.

> Sorry, this looks so special case it doesn't seem like a good idea.
> Something is broken in VLAN handling if this is required.

It is so ugly, I wish I had found a better way to fix the problem. Well,
maybe I miss something:

- packet enters __netif_receive_skb_core():
  - skb->protocol is set to ETH_P_8021Q, so:
- packet is untagged
- skb->vlan_tci set
- skb->protocol set to 'real' protocol
  - skb_vlan_tag_present(skb) == true, so:
- vlan_do_receive() is called:
  - tags the packet again
  - zeroes vlan_tci
- goto another_round
- __netif_receive_skb_core(), round 2:
  - skb->protocol is not ETH_P_8021Q -> no untagging
  - skb_vlan_tag_present(skb) == false -> no vlan_do_receive()
  - rx_handler handler (== br_handle_frame) is called

IMO the root of all evil is the existence of REORDER_HDR itself. It
causes an skb which should have been untagged to being passed along with
VLAN header present and code dealing with it needs to clean up the mess.

Cheers, Phil
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next PATCH] net: bridge: fix for bridging 802.1Q without REORDER_HDR

2015-09-11 Thread Phil Sutter
When forwarding packets from an 802.1Q interface with REORDER_HDR set to
zero, the VLAN header previously inserted by vlan_do_receive() needs to
be stripped from the packet and the mac_header adjustment undone,
otherwise a tagged frame with first four bytes missing will be
transmitted.

Signed-off-by: Phil Sutter 
---
 net/bridge/br_input.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index f921a5d..e4e3fc7 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -288,6 +288,16 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
}
 
 forward:
+   if (is_vlan_dev(skb->dev) &&
+   !(vlan_dev_priv(skb->dev)->flags & VLAN_FLAG_REORDER_HDR)) {
+   unsigned int offset = skb->data - skb_mac_header(skb);
+
+   skb_push(skb, offset);
+   memmove(skb->data + VLAN_HLEN, skb->data, 2 * ETH_ALEN);
+   skb->mac_header += VLAN_HLEN;
+   skb_pull(skb, offset);
+   skb_reset_mac_len(skb);
+   }
switch (p->state) {
case BR_STATE_FORWARDING:
rhook = rcu_dereference(br_should_route_hook);
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/3] devicetree: macb: Add optional property tsu-clk

2015-09-11 Thread Sören Brinkmann
Hi Harini,

On Fri, 2015-09-11 at 01:27PM +0530, Harini Katakam wrote:
> Add TSU clock frequency to be used for 1588 support in macb driver.
> 
> Signed-off-by: Harini Katakam 
> ---
>  Documentation/devicetree/bindings/net/macb.txt |3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/macb.txt 
> b/Documentation/devicetree/bindings/net/macb.txt
> index b5d7976..f7c0ea8 100644
> --- a/Documentation/devicetree/bindings/net/macb.txt
> +++ b/Documentation/devicetree/bindings/net/macb.txt
> @@ -19,6 +19,9 @@ Required properties:
>   Optional elements: 'tx_clk'
>  - clocks: Phandles to input clocks.
>  
> +Optional properties:
> +- tsu-clk: Time stamp unit clock frequency used.

Why are we not using the CCF and a clk_get_rate() in the driver?

Sören
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] net: stmmac: Use msleep rather then udelay for reset delay

2015-09-11 Thread Sjoerd Simons
The reset delays used for stmmac are in the order of 10ms to 1 second,
which is far too long for udelay usage, so switch to using msleep.

Practically this fixes the PHY not being reliably detected in some cases
as udelay wouldn't actually delay for long enough to let the phy
reliably be reset.

Signed-off-by: Sjoerd Simons 

---

Changes in v2:
- Use DIV_ROUND_UP instead of handcoding the same

 drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
index b735fa2..ebf6abc 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
@@ -161,11 +161,16 @@ int stmmac_mdio_reset(struct mii_bus *bus)
 
if (!gpio_request(reset_gpio, "mdio-reset")) {
gpio_direction_output(reset_gpio, active_low ? 1 : 0);
-   udelay(data->delays[0]);
+   if (data->delays[0])
+   msleep(DIV_ROUND_UP(data->delays[0], 1000));
+
gpio_set_value(reset_gpio, active_low ? 0 : 1);
-   udelay(data->delays[1]);
+   if (data->delays[1])
+   msleep(DIV_ROUND_UP(data->delays[1], 1000));
+
gpio_set_value(reset_gpio, active_low ? 1 : 0);
-   udelay(data->delays[2]);
+   if (data->delays[2])
+   msleep(DIV_ROUND_UP(data->delays[2], 1000));
}
}
 #endif
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 RFC] RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 stats

2015-09-11 Thread Raghavendra K T

On 09/11/2015 03:04 AM, Sowmini Varadhan wrote:


Many commonly used functions like getifaddrs() invoke RTM_GETLINK
to dump the interface information, and do not need the
the AF_INET6 statististics that are always returned by default
from rtnl_fill_ifinfo().

Computing the statistics can be an expensive operation that impacts
scaling, so it is desirable to avoid this if the information is
not needed.

This patch adds a the RTEXT_FILTER_SKIP_STATS extended info flag that
can be passed with netlink_request() to avoid statistics computation
for the ifinfo path.

Signed-off-by: Sowmini Varadhan 
---
v2: David Miller comments: pass u32 ext_filter_mask down.

  include/net/rtnetlink.h|3 ++-
  include/uapi/linux/rtnetlink.h |1 +
  net/core/rtnetlink.c   |2 +-
  net/ipv4/devinet.c |3 ++-
  net/ipv6/addrconf.c|   13 +
  5 files changed, 15 insertions(+), 7 deletions(-)


Sowmini, Thanks for the patch which is more cleaner way without
breaking current behaviour.

[ Though RTEXT_FILTER_NEED_STATS flag with reverse effect  would have
helped immediately :)]

/me waits for the RTEXT_FILTER_SKIP_STATS to be supported in
gccgo/golang, so that it can be used in docker newNetlinkRequest() to
exploit this.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net-sysfs: get_netdev_queue_index() cleanup

2015-09-11 Thread John Fastabend
On 15-09-11 01:08 PM, Thadeu Lima de Souza Cascardo wrote:
> Redo commit ed1acc8cd8c22efa919da8d300bab646e01c2dce.
> 
> Commit 822b3b2ebfff8e9b3d006086c527738a7ca00cd0 ("net: Add max rate tx queue
> attribute") moved get_netdev_queue_index around, but kept the old version.
> Probably because of a reuse of the original patch from before Eric's change to
> that function.
> 
> Remove one inline keyword, and no need for a loop to find
> an index into a table.
> 
> Signed-off-by: Thadeu Lima de Souza Cascardo 
> Fixes: 822b3b2ebfff8e9b3d006086c527738a7ca00cd0
> Cc: Or Gerlitz 
> Cc: John Fastabend 
> Cc: Eric Dumazet 
> ---
> 
> Not sure what is the best way to credit Eric Dumazet here. I assume he will 
> add
> any appropriate tags.
> 

Yep, I guess Or just overlooked this and I missed it in the review when
the patch was submitted.

Thanks!

Acked-by: John Fastabend 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Net: core: stream.c: coding style fixes

2015-09-11 Thread Stephen Hemminger
On Fri, 11 Sep 2015 22:13:00 +0200
Krzysztof Majzerowicz-Jaszcz  wrote:

>   * So we (will) share it here.
>   *
> - * Authors:Arnaldo Carvalho de Melo 
> - * (from old tcp.c code)
> - * Alan Cox  (Borrowed 
> comments 8-))
> + * Authors:Arnaldo Carvalho de Melo 
> + * (from old tcp.c code)
> + * Alan Cox  (Borrowed comments 
> 8-))
>   */
>  
>  #incl

Leave the comments alone.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [fw filter]: Broken! fw mark based tc class selection not working

2015-09-11 Thread Cong Wang
On Fri, Sep 11, 2015 at 9:34 AM, Akshat Kakkar  wrote:
> Recently I came to know that,
> Without any options fw classifier maps fwmark to classid.
>
> tc filter add dev  parent  protocol ip prio 1 fw
>
> i.e. if my packet has mark(0x10001) and class id is not set,
> then above tc filter, will set class id = 0x10001 i.e. 1:1
>
> But when I am trying it out, its not working!
> I am having class 1:1 defined but its not at all hit.

Did you specify a right filter handle? What is your setup?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] netlink, mmap: transform mmap skb into full skb on taps

2015-09-11 Thread David Miller
From: Daniel Borkmann 
Date: Fri, 11 Sep 2015 22:35:08 +0200

> On 09/11/2015 09:42 PM, David Miller wrote:
>> @@ -2220,7 +2221,8 @@ static inline void skb_orphan(struct sk_buff
>> *skb)
>>*/
>>   static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t
>>   gfp_mask)
>>   {
>> -if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)))
>> +if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) ||
>> +   skb->private_buffers))
> 
> (These two would need to be swapped.)

Right, good catch.

>> -n->destructor = NULL;
>> +if (!skb->private_buffers)
>> +n->destructor = NULL;
>> +else
>> +C(destructor);
>>  C(tail);
>>  C(end);
>>  C(head);
> 
> We would also have to conditionally skip the __skb_clone()'s ...
> 
>   atomic_inc(&(skb_shinfo(skb)->dataref));

Sigh, the solution was too good to be true I guess. :-/ That's right,
we can't touch skb_shinfo for mmap skbs.

Another approach would be to put the mmap user data into a page frag,
but that obviously has some costs associated with it.  However,
nothing in netlink is ready for fragged skbs yet.  It's the reason why
we have the large skb via vmalloc facility.

Long term fixing that is probably the way to go, but that would be an
enormous project.

So for now I'm going to apply your patch Daniel, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] irda: ali-ircc: Fix deadlock in ali_ircc_sir_change_speed()

2015-09-11 Thread Alexey Khoroshilov
ali_ircc_sir_change_speed() is always called with self->lock held,
so acquiring the lock inside it leads to unavoidable deadlock.

Call graph:
ali_ircc_sir_change_speed() is called from ali_ircc_change_speed()
  ali_ircc_fir_hard_xmit() under spin_lock_irqsave(>lock, flags);
  ali_ircc_sir_hard_xmit() under spin_lock_irqsave(>lock, flags);
  ali_ircc_net_ioctl() under spin_lock_irqsave(>lock, flags);
  ali_ircc_dma_xmit_complete()
ali_ircc_fir_interrupt()
  ali_ircc_interrupt() under spin_lock(>lock);
  ali_ircc_sir_write_wakeup()
ali_ircc_sir_interrupt()
  ali_ircc_interrupt() under spin_lock(>lock);

The patch removes spin_lock/unlock from ali_ircc_sir_change_speed().

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov 
---
 drivers/net/irda/ali-ircc.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/net/irda/ali-ircc.c b/drivers/net/irda/ali-ircc.c
index 58ae11a14bb6..64bb44d5d867 100644
--- a/drivers/net/irda/ali-ircc.c
+++ b/drivers/net/irda/ali-ircc.c
@@ -1031,7 +1031,6 @@ static void ali_ircc_fir_change_speed(struct ali_ircc_cb 
*priv, __u32 baud)
 static void ali_ircc_sir_change_speed(struct ali_ircc_cb *priv, __u32 speed)
 {
struct ali_ircc_cb *self = priv;
-   unsigned long flags;
int iobase; 
int fcr;/* FIFO control reg */
int lcr;/* Line control reg */
@@ -1061,8 +1060,6 @@ static void ali_ircc_sir_change_speed(struct ali_ircc_cb 
*priv, __u32 speed)
/* Update accounting for new speed */
self->io.speed = speed;
 
-   spin_lock_irqsave(>lock, flags);
-
divisor = 115200/speed;

fcr = UART_FCR_ENABLE_FIFO;
@@ -1089,9 +1086,6 @@ static void ali_ircc_sir_change_speed(struct ali_ircc_cb 
*priv, __u32 speed)
/* without this, the connection will be broken after come back from FIR 
speed,
   but with this, the SIR connection is harder to established */
outb((UART_MCR_DTR | UART_MCR_RTS | UART_MCR_OUT2), iobase+UART_MCR);
-   
-   spin_unlock_irqrestore(>lock, flags);
-   
 }
 
 static void ali_ircc_change_dongle_speed(struct ali_ircc_cb *priv, int speed)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] bridge: fix igmpv3 / mldv2 report parsing

2015-09-11 Thread David Miller
From: Linus Lüssing 
Date: Fri, 11 Sep 2015 18:39:48 +0200

> With the newly introduced helper functions the skb pulling is hidden in
> the checksumming function - and undone before returning to the caller.
> 
> The IGMPv3 and MLDv2 report parsing functions in the bridge still
> assumed that the skb is pointing to the beginning of the IGMP/MLD
> message while it is now kept at the beginning of the IPv4/6 header,
> breaking the message parsing and creating packet loss.
> 
> Fixing this by taking the offset between IP and IGMP/MLD header into
> account, too.
> 
> Fixes: 9afd85c9e455 ("net: Export IGMP/MLD message validation code")
> Reported-by: Tobias Powalowski 
> Tested-by: Tobias Powalowski 
> Signed-off-by: Linus Lüssing 

There were many serious regressions introduced by that commit, I'm
extremely disappointed.

Patch applied and queued up for -stable, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] openvswitch: Fix dependency on IPv6 defrag.

2015-09-11 Thread Pravin Shelar
On Fri, Sep 11, 2015 at 3:01 PM, Joe Stringer  wrote:
> When NF_CONNTRACK is built-in, NF_DEFRAG_IPV6 is a module, and
> OPENVSWITCH is built-in, the following build error would occur:
>
> net/built-in.o: In function `ovs_ct_execute':
> (.text+0x10f587): undefined reference to `nf_ct_frag6_gather'
>
> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
> Reported-by: Jim Davis 
> Signed-off-by: Joe Stringer 

Looks good
Acked-by: Pravin B Shelar 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [fw filter]: Broken! fw mark based tc class selection not working

2015-09-11 Thread Akshat Kakkar
There is no handle with fw filter. That's the whole point is. If
handle and class (flow id) is not specified, then whatever be the mark
on the packet, its automatically set as flowid. So if mark is 0x10003,
then this fw filter

tc filter add dev eth0 parent 1:0 protocol ip fw

will cause 0x10003 being set as classid I.e. 1:3.


tc qdisc add dev eth0 root handle 1: htb
tc class add dev eth0 parent 1: classid 1:a htb rate 1mbit
tc class add dev eth0 parent 1: classid 1:b htb rate 1mbit
tc class add dev eth0 parent 1: classid 1:c htb rate 1mbit
tc filter add dev eth0 parent 1:0 protocol ip fw

iptables -t mangle -I OUTPUT -o eth0 -p tcp -j MARK --set-mark 0x1000a
iptables -t mangle -I OUTPUT -o eth0 -p icmp -j MARK --set-mark 0x1000b
iptables -t mangle -I OUTPUT -o eth0 -p udp -j MARK --set-mark 0x1000c
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ebpf: emit correct src_reg for conditional jumps

2015-09-11 Thread David Miller
From: Tycho Andersen 
Date: Thu, 10 Sep 2015 18:25:07 -0600

> Instead of always emitting BPF_REG_X, let's emit BPF_REG_X only when the
> source actually is BPF_X. This causes programs generated by the classic
> converter to not be importable via bpf(), as the eBPF verifier checks that
> the src_reg is correct or 0. While not a problem yet, this will be a
> problem when BPF_PROG_DUMP lands, and we can potentially dump and re-import
> programs generated by the converter.
> 
> Signed-off-by: Tycho Andersen 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] sctp: fix race on protocol/netns initialization

2015-09-11 Thread David Miller
From: Marcelo Ricardo Leitner 
Date: Thu, 10 Sep 2015 17:31:15 -0300

> Consider sctp module is unloaded and is being requested because an user
> is creating a sctp socket.
> 
> During initialization, sctp will add the new protocol type and then
> initialize pernet subsys:
 ...
> The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
> is possible for userspace to create SCTP sockets like if the module is
> already fully loaded. If that happens, one of the possible effects is
> that we will have readers for net->sctp.local_addr_list list earlier
> than expected and sctp_net_init() does not take precautions while
> dealing with that list, leading to a potential panic but not limited to
> that, as sctp_sock_init() will copy a bunch of blank/partially
> initialized values from net->sctp.
> 
> The race happens like this:
 ...
> Simply inverting the initialization order between
> register_pernet_subsys() and sctp_v4_protosw_init() is not possible
> because register_pernet_subsys() will create a control sctp socket, so
> the protocol must be already visible by then. Deferring the socket
> creation to a work-queue is not good specially because we loose the
> ability to handle its errors.
> 
> So, as suggested by Vlad, the fix is to split netns initialization in
> two moments: defaults and control socket, so that the defaults are
> already loaded by when we register the protocol, while control socket
> initialization is kept at the same moment it is today.
> 
> Fixes: 4db67e808640 ("sctp: Make the address lists per network namespace")
> Signed-off-by: Vlad Yasevich 
> Signed-off-by: Marcelo Ricardo Leitner 

Applied and queued up for -stable, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 net 1/5] ipv6: Refactor common ip6gre_tunnel_init codes

2015-09-11 Thread David Miller
From: Martin KaFai Lau 
Date: Fri, 11 Sep 2015 11:06:17 -0700

> @@ -1460,19 +1474,16 @@ static void ip6gre_netlink_parms(struct nlattr 
> *data[],
>  static int ip6gre_tap_init(struct net_device *dev)
>  {
>   struct ip6_tnl *tunnel;
> + int ret;
>  
> - tunnel = netdev_priv(dev);
> + ret = ip6gre_tunnel_init_common(dev);
> + if (ret)
> + return ret;
>  
> - tunnel->dev = dev;
> - tunnel->net = dev_net(dev);
> - strcpy(tunnel->parms.name, dev->name);
> + tunnel = netdev_priv(dev);
>  
>   ip6gre_tnl_link_config(tunnel, 1);
>  
> - dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
> - if (!dev->tstats)
> - return -ENOMEM;
> -
>   return 0;
>  }

Is this really equivalent?

The existing code for GRE tap device initialization would use whatever
ether_setup() left in the broadcast address, it would leave
dev->dev_addr alone, and unconditionally use eth_header_ops.

You are changing behavior here, and it's been like this has been this
way long enough that I can't see clearly whether this is a valid
change or not.  It probably is not.

Either way, even if it is valid, you have to document what is happening
here and why it's ok.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 net 1/5] ipv6: Refactor common ip6gre_tunnel_init codes

2015-09-11 Thread Martin KaFai Lau
On Fri, Sep 11, 2015 at 03:30:59PM -0700, David Miller wrote:
> From: Martin KaFai Lau 
> Date: Fri, 11 Sep 2015 11:06:17 -0700
>
> > @@ -1460,19 +1474,16 @@ static void ip6gre_netlink_parms(struct nlattr 
> > *data[],
> >  static int ip6gre_tap_init(struct net_device *dev)
> >  {
> > struct ip6_tnl *tunnel;
> > +   int ret;
> >
> > -   tunnel = netdev_priv(dev);
> > +   ret = ip6gre_tunnel_init_common(dev);
> > +   if (ret)
> > +   return ret;
> >
> > -   tunnel->dev = dev;
> > -   tunnel->net = dev_net(dev);
> > -   strcpy(tunnel->parms.name, dev->name);
> > +   tunnel = netdev_priv(dev);
> >
> > ip6gre_tnl_link_config(tunnel, 1);
> >
> > -   dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
> > -   if (!dev->tstats)
> > -   return -ENOMEM;
> > -
> > return 0;
> >  }
>
> Is this really equivalent?
>
> The existing code for GRE tap device initialization would use whatever
> ether_setup() left in the broadcast address, it would leave
> dev->dev_addr alone, and unconditionally use eth_header_ops.
The new ip6gre_tunnel_init_common() does not set the dev->dev_addr
or dev->header_ops.  The change in ip6gre_tap_init() is to initialize
dev->tstats before ip6gre_tnl_link_config().

or I am missing something and have overlooked a bug?

Thanks,
Martin
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: randconfig build error with next-20150911, in net/openvswitch

2015-09-11 Thread Joe Stringer
On 11 September 2015 at 09:53, Jim Davis  wrote:
> Building with the attached random configuration file,
>
> net/built-in.o: In function `ovs_ct_execute':
> (.text+0x10f587): undefined reference to `nf_ct_frag6_gather'

Thanks for the report, I sent a patch:
https://patchwork.ozlabs.org/patch/517033/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 net-next 1/2] net: introduce socket family constants

2015-09-11 Thread David Miller
From: Ursula Braun 
Date: Fri, 11 Sep 2015 14:09:09 +0200

> @@ -0,0 +1,13 @@
> +/*
> + * SMC   Definitions for the SMC protocol.
> + *
> + * Author:   Ursula Braun 
> + */
> +#ifndef _SMC_H
> +#define _SMC_H
> +
> +/* SMC socket options - disjunct with TCP socket options */
> +#define SMC_KEEPALIVE99  /* start/stop keepalives */
> +
> +#endif /* _SMC_H */
> +

Trailing empty lines in files always generate warnings when such
patches are attempted to be applied by GIT.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net] openvswitch: Fix dependency on IPv6 defrag.

2015-09-11 Thread Joe Stringer
When NF_CONNTRACK is built-in, NF_DEFRAG_IPV6 is a module, and
OPENVSWITCH is built-in, the following build error would occur:

net/built-in.o: In function `ovs_ct_execute':
(.text+0x10f587): undefined reference to `nf_ct_frag6_gather'

Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Reported-by: Jim Davis 
Signed-off-by: Joe Stringer 
---
 net/openvswitch/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 2a071f4..d143aa9 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -5,7 +5,8 @@
 config OPENVSWITCH
tristate "Open vSwitch"
depends on INET
-   depends on (!NF_CONNTRACK || NF_CONNTRACK)
+   depends on !NF_CONNTRACK || \
+  (NF_CONNTRACK && (!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6))
select LIBCRC32C
select MPLS
select NET_MPLS_GSO
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bnx2x: use ktime_get_seconds() for timestamp

2015-09-11 Thread David Miller
From: Arnd Bergmann 
Date: Fri, 11 Sep 2015 11:33:01 +0200

> commit c48f350ff5e7 "bnx2x: Add MFW dump support" added the
> bnx2x_update_mfw_dump() function that reads the current time and stores
> it in a 32-bit field that gets passed into a buffer in a fixed format.
> 
> This is potentially broken when the epoch overflows in 2038, and
> otherwise overflows in 2106. As we're trying to avoid uses of
> struct timeval for this reason, I noticed the addition of this
> function, and tried to rewrite it in a way that is more explicit
> about the overflow and that will keep working once we deprecate
> struct timeval.
> 
> I assume that it is not possible to change the ABI any more, otherwise
> we should try to use a 64-bit field for the seconds right away.
> 
> Signed-off-by: Arnd Bergmann 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] netlink, mmap: transform mmap skb into full skb on taps

2015-09-11 Thread Daniel Borkmann

On 09/11/2015 11:34 PM, David Miller wrote:
...

Another approach would be to put the mmap user data into a page frag,
but that obviously has some costs associated with it.  However,
nothing in netlink is ready for fragged skbs yet.  It's the reason why
we have the large skb via vmalloc facility.

Long term fixing that is probably the way to go, but that would be an
enormous project.


Yeah, that could resolve it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] openvswitch: Fix dependency on IPv6 defrag.

2015-09-11 Thread David Miller
From: Joe Stringer 
Date: Fri, 11 Sep 2015 15:01:16 -0700

> When NF_CONNTRACK is built-in, NF_DEFRAG_IPV6 is a module, and
> OPENVSWITCH is built-in, the following build error would occur:
> 
> net/built-in.o: In function `ovs_ct_execute':
> (.text+0x10f587): undefined reference to `nf_ct_frag6_gather'
> 
> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
> Reported-by: Jim Davis 
> Signed-off-by: Joe Stringer 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] irda: ali-ircc: Fix deadlock in ali_ircc_sir_change_speed()

2015-09-11 Thread David Miller
From: Alexey Khoroshilov 
Date: Sat, 12 Sep 2015 00:34:48 +0300

> ali_ircc_sir_change_speed() is always called with self->lock held,
> so acquiring the lock inside it leads to unavoidable deadlock.
> 
> Call graph:
> ali_ircc_sir_change_speed() is called from ali_ircc_change_speed()
>   ali_ircc_fir_hard_xmit() under spin_lock_irqsave(>lock, flags);
>   ali_ircc_sir_hard_xmit() under spin_lock_irqsave(>lock, flags);
>   ali_ircc_net_ioctl() under spin_lock_irqsave(>lock, flags);
>   ali_ircc_dma_xmit_complete()
> ali_ircc_fir_interrupt()
>   ali_ircc_interrupt() under spin_lock(>lock);
>   ali_ircc_sir_write_wakeup()
> ali_ircc_sir_interrupt()
>   ali_ircc_interrupt() under spin_lock(>lock);
> 
> The patch removes spin_lock/unlock from ali_ircc_sir_change_speed().
> 
> Found by Linux Driver Verification project (linuxtesting.org).
> 
> Signed-off-by: Alexey Khoroshilov 

Looks good, applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 net 1/5] ipv6: Refactor common ip6gre_tunnel_init codes

2015-09-11 Thread David Miller
From: Martin KaFai Lau 
Date: Fri, 11 Sep 2015 16:20:26 -0700

> or I am missing something and have overlooked a bug?

My bad, I simply misread your patch.

Sorry about that.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/5] seccomp: add a way to access filters via bpf fds

2015-09-11 Thread Daniel Borkmann

On 09/11/2015 02:21 AM, Tycho Andersen wrote:

This patch adds a way for a process that is "real root" to access the
seccomp filters of another process. The process first does a
PTRACE_SECCOMP_GET_FILTER_FD to get an fd with that process' seccomp filter
attached, and then iterates on this with PTRACE_SECCOMP_NEXT_FILTER using
bpf(BPF_PROG_DUMP) to dump the actual program at each step.

Signed-off-by: Tycho Andersen 
CC: Kees Cook 
CC: Will Drewry 
CC: Oleg Nesterov 
CC: Andy Lutomirski 
CC: Pavel Emelyanov 
CC: Serge E. Hallyn 
CC: Alexei Starovoitov 
CC: Daniel Borkmann 

[...]

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 58ae9f4..ac3ed1c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -506,6 +506,30 @@ struct bpf_prog *bpf_prog_get(u32 ufd)
  }
  EXPORT_SYMBOL_GPL(bpf_prog_get);

+int bpf_prog_set(u32 ufd, struct bpf_prog *new)
+{
+   struct fd f;
+   struct bpf_prog *prog;
+
+   f = fdget(ufd);
+
+   prog = get_prog(f);
+   if (!IS_ERR(prog) && prog)
+   bpf_prog_put(prog);
+
+   atomic_inc(>aux->refcnt);
+   f.file->private_data = new;
+   fdput(f);
+   return 0;


So in case get_prog() fails, and for example f.file is infact NULL,
you assign the bpf prog then to ERR_PTR(-EBADF)'s private_data? :(


+}
+EXPORT_SYMBOL_GPL(bpf_prog_set);
+
+int bpf_new_fd(struct bpf_prog *prog, int flags)
+{
+   return anon_inode_getfd("bpf-prog", _prog_fops, prog, flags);
+}
+EXPORT_SYMBOL_GPL(bpf_new_fd);


Any reason why these two need to be exported for modules? Which
modules are using them?

I think modules should probably not mess with this.

If you already name it generic, it would also be good if bpf_new_fd()
is used in case of maps that call anon_inode_getfd(), too.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC v2 net-next 1/4] phylib: Add phy_set_max_speed helper

2015-09-11 Thread Sergei Shtylyov

Hello.

On 9/11/2015 5:01 AM, Simon Horman wrote:


Add a helper to allow ethernet drivers to limit the speed of a phy
(that they are attached to).

This mainly involves factoring out the business-end of
of_set_phy_supported() and exporting a new symbol.

This code seems to be open coded in several places, in several different
variants.

This code is envisaged this will be used in situations where setting
the "max-speed" property is not appropriate, e.g. because the maximum
speed is not a property of the phy hardware.

Signed-off-by: Simon Horman 

---

v2
* First post
---
  drivers/net/phy/phy_device.c | 52 
  include/linux/phy.h  |  1 +
  2 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index c0f211127274..d9a020095972 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1205,6 +1205,37 @@ static int gen10g_resume(struct phy_device *phydev)
return 0;
  }

+static void __set_phy_supported(struct phy_device *phydev, u32 max_speed)
+{
+   if (!IS_ENABLED(CONFIG_OF_MDIO))
+   return;
+
+   /* The default values for phydev->supported are provided by the PHY
+* driver "features" member, we want to reset to sane defaults fist


   s/fist/first/.


+* before supporting higher speeds.
+*/
+   phydev->supported &= PHY_DEFAULT_FEATURES;
+
+   switch (max_speed) {
+   default:
+   return;
+


   I don't think empty line is needed here.


+   case SPEED_1000:
+   phydev->supported |= PHY_1000BT_FEATURES;


   Need a comment like /* Fall thru */.


+   case SPEED_100:
+   phydev->supported |= PHY_100BT_FEATURES;


   And here.


+   case SPEED_10:
+   phydev->supported |= PHY_10BT_FEATURES;
+   }
+}

[...]

MBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/5] ebpf: add a way to dump an eBPF program

2015-09-11 Thread Daniel Borkmann

On 09/11/2015 02:21 AM, Tycho Andersen wrote:

This commit adds a way to dump eBPF programs. The initial implementation
doesn't support maps, and therefore only allows dumping seccomp ebpf
programs which themselves don't currently support maps.

v2: don't export a prog_id for the filter

Signed-off-by: Tycho Andersen 
CC: Kees Cook 
CC: Will Drewry 
CC: Oleg Nesterov 
CC: Andy Lutomirski 
CC: Pavel Emelyanov 
CC: Serge E. Hallyn 
CC: Alexei Starovoitov 
CC: Daniel Borkmann 

[...]

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index dc9b464..58ae9f4 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -586,6 +586,44 @@ free_prog:
return err;
  }

+static int bpf_prog_dump(union bpf_attr *attr, union bpf_attr __user *uattr)
+{
+   int ufd = attr->prog_fd;
+   struct fd f = fdget(ufd);
+   struct bpf_prog *prog;
+   int ret = -EINVAL;
+
+   prog = get_prog(f);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   /* For now, let's refuse to dump anything that isn't a seccomp program.
+* Other program types have support for maps, which our current dump
+* code doesn't support.
+*/
+   if (prog->type != BPF_PROG_TYPE_SECCOMP)
+   goto out;


Yep, also when you start adding helper calls (next to map objects) you'd
need to undo kernel pointers that the verifier sets here.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: v2 of seccomp filter c/r patches

2015-09-11 Thread Andy Lutomirski
On Sep 10, 2015 5:22 PM, "Tycho Andersen"  wrote:
>
> Hi all,
>
> Here is v2 of the seccomp filter c/r set. The patch notes have individual
> changes from the last series, but there are two points not noted:
>
> * The series still does not allow us to correctly restore state for programs
>   that will use SECCOMP_FILTER_FLAG_TSYNC in the future. Given that we want to
>   keep seccomp_filter's identity, I think something along the lines of another
>   seccomp command like SECCOMP_INHERIT_PARENT is needed (although I'm not sure
>   if this can even be done yet). In addition, we'll need a kcmp command for
>   figuring out if filters are the same, although this too needs to compare
>   seccomp_filter objects, so it's a little screwy. Any thoughts on how to do
>   this nicely are welcome.

Let's add a concept of a seccompfd.

For background of what I want to add: I want to be able to create a
seccomp monitor.  A seccomp monitor will be, logically, a pair of a
struct file that represents the monitor and a seccomp_filter that is
controlled by the monitor.  Depending on flags, whoever holds the
monitor fd could change the active filter, intercept syscalls, and
issue syscalls on behalf of a process that is trapped in an
intercepted syscall.

Seccomp filters would nest properly.

The interface would probably be (extremely pseudocoded):

monitor_fd, filter_fd = seccomp(CREATE_MONITOR, flags, ...);

Then, later:

seccomp(ATTACH_TO_FILTER, filter_fd);  /* now filtered */

read(monitor_fd, buf, size); /* returns an intercepted syscall */
write(monitor_fd, buf, size); /* issues a syscall or releases the
trapped task */

This can't be implemented on x86 without either going insane or
finishing the massive set of pending cleanups to the x86 entry code.
I favor the latter.

We could, however, add part of it right now: we could have a way to
create a filterfd, we could add kcmp support for it, and we could add
the ATTACH_TO_FILTER thing.  I think that would solve your problem.

One major open question: does a filter_fd know what its parent is and,
if so, will it just refuse to attach if the caller's parent is wrong?
Or will a filter_fd attach anywhere.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/5] seccomp: add a way to access filters via bpf fds

2015-09-11 Thread Tycho Andersen
On Fri, Sep 11, 2015 at 09:20:55AM -0700, Andy Lutomirski wrote:
> On Sep 10, 2015 5:22 PM, "Tycho Andersen"  
> wrote:
> >
> > This patch adds a way for a process that is "real root" to access the
> > seccomp filters of another process. The process first does a
> > PTRACE_SECCOMP_GET_FILTER_FD to get an fd with that process' seccomp filter
> > attached, and then iterates on this with PTRACE_SECCOMP_NEXT_FILTER using
> > bpf(BPF_PROG_DUMP) to dump the actual program at each step.
> >
> 
> > +
> > +   fd = bpf_new_fd(filter->prog, O_RDONLY);
> > +   if (fd > 0)
> > +   atomic_inc(>prog->aux->refcnt);
> 
> Why isn't this folded into bpf_new_fd?

No reason it can't be as far as I can see. I'll make the change for
the next version.

> > +
> > +   return fd;
> > +}
> > +
> > +long seccomp_next_filter(struct task_struct *child, u32 fd)
> > +{
> > +   struct seccomp_filter *cur;
> > +   struct bpf_prog *prog;
> > +   long ret = -ESRCH;
> > +
> > +   if (!capable(CAP_SYS_ADMIN))
> > +   return -EACCES;
> > +
> > +   if (child->seccomp.mode != SECCOMP_MODE_FILTER)
> > +   return -EINVAL;
> > +
> > +   prog = bpf_prog_get(fd);
> > +   if (IS_ERR(prog)) {
> > +   ret = PTR_ERR(prog);
> > +   goto out;
> > +   }
> > +
> > +   for (cur = child->seccomp.filter; cur; cur = cur->prev) {
> > +   if (cur->prog == prog) {
> > +   if (!cur->prev)
> > +   ret = -ENOENT;
> > +   else
> > +   ret = bpf_prog_set(fd, cur->prev->prog);
> 
> This lets you take an fd pointing to one prog and point it elsewhere.
> I'm not sure that's a good idea.

That's how the interface was designed (calling ptrace(NEXT_FILTER, fd) and
then doing bpf(DUMP, fd)). I suppose we could have NEXT_FILTER return
a new fd instead if that seems better to you.

Tycho
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/5] seccomp: make underlying bpf ref counted as well

2015-09-11 Thread Daniel Borkmann

On 09/11/2015 07:33 PM, Tycho Andersen wrote:

On Fri, Sep 11, 2015 at 06:03:59PM +0200, Daniel Borkmann wrote:

On 09/11/2015 04:44 PM, Tycho Andersen wrote:

On Fri, Sep 11, 2015 at 03:02:36PM +0200, Daniel Borkmann wrote:

On 09/11/2015 02:20 AM, Tycho Andersen wrote:

In the next patch, we're going to add a way to access the underlying
filters via bpf fds. This means that we need to ref-count both the
struct seccomp_filter objects and the struct bpf_prog objects separately,
in case a process dies but a filter is still referred to by another
process.

Additionally, we mark classic converted seccomp filters as seccomp eBPF
programs, since they are a subset of what is supported in seccomp eBPF.

Signed-off-by: Tycho Andersen 
CC: Kees Cook 
CC: Will Drewry 
CC: Oleg Nesterov 
CC: Andy Lutomirski 
CC: Pavel Emelyanov 
CC: Serge E. Hallyn 
CC: Alexei Starovoitov 
CC: Daniel Borkmann 
---
  kernel/seccomp.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 245df6b..afaeddf 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -378,6 +378,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct 
sock_fprog *fprog)
}

atomic_set(>usage, 1);
+   atomic_set(>prog->aux->refcnt, 1);
+   sfilter->prog->type = BPF_PROG_TYPE_SECCOMP;


So, if you do this, then this breaks the assumption of eBPF JITs
that, currently, all classic converted BPF programs always have a
prog->type of BPF_PROG_TYPE_UNSPEC (see: bpf_prog_was_classic()).

Currently, JITs make use of this information to determine whether
A and X mappings for such programs should or should not be cleared
in the prologue (s390 currently).

In the seccomp_prepare_filter() stage, we're already past that, so
it will not cause an issue, but we certainly would need to be very
careful in future, if bpf_prog_was_classic() is then used at a later
stage when we already have a generated bpf_prog somewhere, as then
this assumption will break.


The only reason we need to do this is to allow BPF_DUMP_PROG to work,
since we were restricting it to only allow dumping of seccomp
programs, since those don't have maps. Instead, perhaps we could allow
dumping of BPF_PROG_TYPE_SECCOMP and BPF_PROG_TYPE_UNSPEC?


There are possibilities that BPF_PROG_TYPE_UNSPEC is calling helpers
already today, at least in networking case, not seccomp. So, since
you want to export [classic -> eBPF] only for seccomp, put fds on them
and dump these via bpf(2), you could allow that (with a big comment
stating why it's safe), but mid-term we really need to sanitize all
this stuff properly as this is needed for other types, too.


Sorry, just to be clear, you're suggesting that the patch is ok modulo
a comment describing the jit issues?


I think due to the given insns restrictions on classic seccomp, this
could work for "most cases" (see below) for the time being until pointer
sanitation is resolved and that seccomp-only restriction from the dump
could be removed, BUT there's one more stone in the road which you still
need to take care of with this whole 'giving classic seccomp-BPF -> eBPF
transforms an fd, dumping and restoring that via bpf(2)' approach:

If you have JIT enabled on ARM32, and add a classic seccomp-BPF filter,
and dump that via your bpf(2) interface based on the current patches, what
you'll get is not eBPF opcodes but classic (!) BPF opcodes as ARM32 classic
JIT supports compilation of seccomp, since commit 24e737c1ebac ("ARM: net:
add JIT support for loads from struct seccomp_data.").

So in that case, bpf_prepare_filter() will not call into bpf_migrate_filter()
as there's simply no need for it, because the classic code could already
be JITed there. I guess other archs where JIT support for eBPF in not yet
within near sight might sooner or later support this insn for their classic
JITs, too ...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] net: irda: pxaficp_ir: convert to readl and writel

2015-09-11 Thread Petr Cvek
Dne 11.9.2015 v 08:18 Robert Jarzmik napsal(a):
> Petr Cvek  writes:
> 
>>> Should have been posted to linux arm kernel mailing list, unless my mailer
>>> failed ...
>>>
>> Searching for:
>>
>>  "ARM: pxa: add resources to pxaficp_ir"
>>
>> did not found anything, same was for "ficp" in the 
>> linux-arm-kernel/netdev/linux-kernel
>> mailing list archive.
> Ah ok, I'll resend it then.
> 
 BTW This patch required update of my kernel repo. It seems that my:

magician.c patches + ficp patch + new dma engine

 does not work for me at all. Kernel throws some panic about interrupts and 
 then
 it ends in an infinite stack dumping loop. Fault occurs before rootfs is
 mounted, so probably around MMC init (with removed SD card it fails 
 normally
 with no rootfs found error).
>>> Could you send me (privately) the stack you're getting please. This is 
>>> something
>>> I'd like to catch up early in the -rc releases.
>>
>> Well this will be problem as I cannot save anything to an SD card after and
>> during the failure.  Only viable interfaces would be earlycon on an infraport
>> or high speed camera on LCD :-).
> Ah just as on my mioa701. I ended up soldering a JTAG cable :)
Rather not :-D.

> 
>>> And it is true I have not tested the rootfs special case, where drivers are 
>>> not
>>> yet initialized (and more specifically gpio and interrupt chip). Your 
>>> backtrace
>>> should tell me if you fall into this category of issues ... but I digress, 
>>> this
>>> has no link with pxaficp.
>>
>> Should I start new thread? (same bug can be present in the FICP too)
> Yes, this pxamci bothers me, it deserves a thread.

Will start soon.

>>
>> Anyway after mmc dma revert I was still not able to start FICP. There is an 
>> error:
>>
>>  Unable to handle kernel paging request at virtual address 32e4
>>
>> from pxa_irda_startup() and it seems it is caused by register definitions. 
>> For example:
>>
>>  writel_relaxed((val), (irda)->stuart_base + (off));
>>
>> is called by
>>
>>  stuart_writel(si, 0, STIER);
>>
>> but STIER is not just an offset, but full register address:
>>  
>>  __REG(0x4074)
>>
>> So the definition should be changed, unless there is another patch I did not
>> received (in that case, send me full patchset again please) :-).
> Agreed, this is a bug in this patch. With this fix, is the pxaficp working or 
> do
> you need a bit more time to experiment ?

I have tried with a nasty hack (use only lower part of address, it should equal 
with reg offset):
#undef __REG
-#define __REG(x) (x)
+#define __REG(x) (x & 0x)

and it seems to work. The module inits and I am able to see IrDA traffic and 
ping other machine. FIR mode (mostly impacted by DMA) is still untested as 
magician unfortunately supports only SIR mode.

Petr
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] netlink, mmap: transform mmap skb into full skb on taps

2015-09-11 Thread Daniel Borkmann

On 09/11/2015 07:11 AM, David Miller wrote:
...

Looking more deeply into this, I think we have the same exact problem
with netlink skbs that use vmalloc memory at skb->head.


Yes, agreed, the test in the patch covered those as well via:

  if (netlink_skb_is_mmaped(skb) || is_vmalloc_addr(skb->head))


We have a special hack, but only when netlink_skb_clone() is used,
to preserve the destructor.  But these skbs can escape anywhere
and be eventually cloned as we have seen with the mmap stuff.


Yes, it looks like they are currently only used from user space to
kernel space. I saw that 3a36515f7294 ("netlink: fix splat in skb_clone
with large messages") fixed a couple of these in upper layers with
regards to large skbs, so there's a chance that this can be overseen
rather easily as well in other places and then only come to light in
cases where we allocate more than NLMSG_GOODSIZE, so we don't actually
use the alloc_skb() path. :/ So I like your idea below!


I'm wondering if we should do something more precise to fix this,
and in a way that handles both the mmap and vmalloc cases.


Perhaps it might also be useful if the kernel would one day want to
use netlink_alloc_large_skb() for answers back to user space, or in
places where we use network skbs (haven't looked into it with regards
to this).

Some more comments below:


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2738d35..77b804c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -584,8 +584,9 @@ struct sk_buff {
fclone:2,
peeked:1,
head_frag:1,
-   xmit_more:1;
-   /* one bit hole */
+   xmit_more:1,
+   clone_preserves_destructor;


( Nit: maybe as clone_preserves_destructor:1 )


+
kmemcheck_bitfield_end(flags1);

/* fields enclosed in headers_start/headers_end are copied
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index dad4dd3..4a7b8e3 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -825,7 +825,8 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, 
struct sk_buff *skb)
n->hdr_len = skb->nohdr ? skb_headroom(skb) : skb->hdr_len;
n->cloned = 1;
n->nohdr = 0;
-   n->destructor = NULL;
+   if (!skb->clone_preserves_destructor)
+   n->destructor = NULL;


I think we also need here:

else
C(destructor);


C(tail);
C(end);
C(head);

[...]

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 7f86d3b..214f1a1 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -719,6 +719,7 @@ static void netlink_ring_setup_skb(struct sk_buff *skb, 
struct sock *sk,
skb->end = skb->tail + size;
skb->len = 0;

+   skb->clone_preserves_destructor = 1;
skb->destructor  = netlink_skb_destructor;
NETLINK_CB(skb).flags |= NETLINK_SKB_MMAPED;
NETLINK_CB(skb).sk = sk;
@@ -854,6 +855,14 @@ static void netlink_ring_set_copied(struct sock *sk, 
struct sk_buff *skb)
  #define netlink_mmap_sendmsg(sk, msg, dst_portid, dst_group, scm) 0
  #endif /* CONFIG_NETLINK_MMAP */


One more thing that came to my mind. For netlink mmap skbs, the
skb->clone_preserves_destructor is actually not enough.

Already calling into skb_clone() is an issue itself, as the data area is
user space buffer, and skb_clone() as well as skb_copy() access skb_shinfo()
area. :/ So in that regard netlink mmap skbs are even further restrictive
on what we can do than netlink large skbs.

Best,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V5 net-next 0/2] net: implement SMC-R solution

2015-09-11 Thread Ursula Braun
From: Ursula Braun 

Dave,

this is V5 of my SMC-R patches taking care about your V4 comments:
getting rid of tcp_set_keepalive() and enforcing inverse christmas
tree ordering for local variables.
Since you are asking for a solution "100% in our own separate module
with our own can of worms", we have to give up the transparent detection
whether a communication peer can do SMC-R or not (this has been the
purpose of the rejected TCP hooks). Instead, we want just the new
self-contained SMC-R socket family added to the kernel.
By the way, since August 2015 the SMC-R Informational RFC is no longer
a draft, but published as RFC7609.

V5 changes:
1. Do no longer invoke tcp_set_keepalive() to implement setsockopt
   SO_KEEPALIVE.
2. Enforce inverse christmas tree ordering for all declarations of local
   function variables.
3. switch back to TCP if IPSEC is needed
4. fix dangling sockets
5. make sure link groups are freed before module unload

V4 changes:
1. Remove tcp patches supporting TCP experimental options
2. Remove references to tcp_sock syn_smc flag in smc-code, since TCP
   experimental options are not supported by the Linux-tcp.
3. clc_wait_msg() simplified

V3 changes:
1. Avoid adding of new space for smc-related bits in the tcp structures.
2. Make the smc feature to be nearly zero cost using Static Keys / jump
   labels
3. Increase / decrease smc static key in the smc-code
4. Make sure the next-to-last patch does not break the build
5. Additional pnet table checking

V2 changes:
1. activate tcp changes for CONFIG_AFSMC only (as suggested by Eric Dumazet)
2. add additional hook in net/core/sock.c
3. fix bitfield endianness problem

Thanks,
Ursula

In 2013, IBM introduced an optimized communications solution for the
IBM zEnterprise EC12 and BC12 (s390 in Linux terminology) that is
comprised of the IBM 10GbE RoCE Express feature with Shared Memory
Communications-RDMA (SMC-R) protocol [1].
SMC-R is designed for the enterprise data center environment and is an open
protocol as specified in the informational RFC7609 [2]. It has been
published in August 2015. Another implementation of this protocol is
available since 2013 with IBM z/OS Version 2 Release 1. 

SMC-R provides a “sockets over RDMA” solution that leverages industry
standard RDMA over Converged Ethernet (RoCE) technology.

IBM has developed a Linux implementation of the SMC-R standard. A new
socket protocol family AF_SMC is introduced. A preload library can be used
to enable TCP-based applications to use SMC-R without changes. 

Key aspects of SMC-R are: 
1. Provides optimized performance compared to standard TCP/IP over Ethernet
   within the data center for both request/response (latency) and streaming
   workloads (CPU savings) [3]. 
   Initial benchmarks on Linux on x86 processors have shown latency
   reduction of up to 52% with a throughput gain of 111% using SMC-R vs TCP
   for request/response message patterns (10 concurrent TCP connections
   with 16KBmessages) and CPU savings of up to 69% for streaming data
   patterns (single TCP connection with 20MB of data in one direction).
   [1] is currently updated to contain more detailed information on Linux
   and performance.
2. In order to preserve the traditional network administrative model the
   SMC-R protocol ties into the existing IP addresses and uses TCP's
   handshake to establish connections. This allows existing management
   tools and security infrastructure to control the creation of SMC
   connections.
3. The SMC-R protocol logically bonds multiple RoCE adapters together
   providingredundancy with transparent fail-over for improved high
   availability, increased bandwidth and load balancing across multiple
   RDMA-capable devices.
Without the rejected TCP Experimental Options the following aspects are
restricted; alternate solutions are in discussion. 
4. Due to its handshake protocol, SMC-R is compatible with (transparent to)
   existing TCP connection load balancers that are commonly used in the
   enterprise data center environment for multi-tier application workloads.
5. SMC-R's handshake protocol allows for transparent fallback to TCP/IP,
   should one of the peers not be capable of the protocol.

Additional SMC-R overview and reference materials are available [1].  

The SMC-R “rendezvous" protocol eliminates the need for RDMA-CM and the
exchange occurs through an initial TCP connection. Building on a TCP
connection to establish an SMC-R connection solves many key requirements.
The rendezvous process occurs now in 1 phase only: 
1. TCP/IP 3-way exchange with TCP experimental options is skipped.
2. SMC-R 3-way exchange:
   It is assumed both partners indicate SMC-R capability. Then at the
   completion of the 3-way TCP handshake the SMC-R layers in each peer take
   control of the TCP connection and exchange their RDMA credentials. If
   this 3-way exchange completes successfully the connection continues using
   SMC-R. If the 

Re: [PATCH v2 4/5] seccomp: add a way to access filters via bpf fds

2015-09-11 Thread Michael Kerrisk (man-pages)
HI Tycho

On 11 September 2015 at 02:21, Tycho Andersen
 wrote:
> This patch adds a way for a process that is "real root" to access the
> seccomp filters of another process. The process first does a
> PTRACE_SECCOMP_GET_FILTER_FD to get an fd with that process' seccomp filter
> attached, and then iterates on this with PTRACE_SECCOMP_NEXT_FILTER using
> bpf(BPF_PROG_DUMP) to dump the actual program at each step.

Do you have a man- page patch for this change?

Cheers,

Michael

> Signed-off-by: Tycho Andersen 
> CC: Kees Cook 
> CC: Will Drewry 
> CC: Oleg Nesterov 
> CC: Andy Lutomirski 
> CC: Pavel Emelyanov 
> CC: Serge E. Hallyn 
> CC: Alexei Starovoitov 
> CC: Daniel Borkmann 
> ---
>  include/linux/bpf.h | 12 ++
>  include/linux/seccomp.h | 14 +++
>  include/uapi/linux/ptrace.h |  3 +++
>  kernel/bpf/syscall.c| 26 -
>  kernel/ptrace.c |  7 ++
>  kernel/seccomp.c| 57 
> +
>  6 files changed, 118 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f57d7fe..bfd9cab 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -162,6 +162,8 @@ void bpf_register_prog_type(struct bpf_prog_type_list 
> *tl);
>  void bpf_register_map_type(struct bpf_map_type_list *tl);
>
>  struct bpf_prog *bpf_prog_get(u32 ufd);
> +int bpf_prog_set(u32 ufd, struct bpf_prog *new);
> +int bpf_new_fd(struct bpf_prog *prog, int flags);
>  void bpf_prog_put(struct bpf_prog *prog);
>  void bpf_prog_put_rcu(struct bpf_prog *prog);
>
> @@ -180,6 +182,16 @@ static inline struct bpf_prog *bpf_prog_get(u32 ufd)
> return ERR_PTR(-EOPNOTSUPP);
>  }
>
> +static inline int bpf_prog_set(u32 ufd, struct bpf_prog *new)
> +{
> +   return -EINVAL;
> +}
> +
> +static inline int bpf_new_fd(struct bpf_prog *prog, int flags)
> +{
> +   return -EINVAL;
> +}
> +
>  static inline void bpf_prog_put(struct bpf_prog *prog)
>  {
>  }
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index a19ddac..41b083c 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -95,4 +95,18 @@ static inline void get_seccomp_filter(struct task_struct 
> *tsk)
> return;
>  }
>  #endif /* CONFIG_SECCOMP_FILTER */
> +
> +#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> +extern long seccomp_get_filter_fd(struct task_struct *child);
> +extern long seccomp_next_filter(struct task_struct *child, u32 fd);
> +#else
> +static inline long seccomp_get_filter_fd(struct task_struct *child)
> +{
> +   return -EINVAL;
> +}
> +static inline long seccomp_next_filter(struct task_struct *child, u32 fd)
> +{
> +   return -EINVAL;
> +}
> +#endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
>  #endif /* _LINUX_SECCOMP_H */
> diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> index cf1019e..041c3c3 100644
> --- a/include/uapi/linux/ptrace.h
> +++ b/include/uapi/linux/ptrace.h
> @@ -23,6 +23,9 @@
>
>  #define PTRACE_SYSCALL   24
>
> +#define PTRACE_SECCOMP_GET_FILTER_FD   40
> +#define PTRACE_SECCOMP_NEXT_FILTER 41
> +
>  /* 0x4200-0x4300 are reserved for architecture-independent additions.  */
>  #define PTRACE_SETOPTIONS  0x4200
>  #define PTRACE_GETEVENTMSG 0x4201
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 58ae9f4..ac3ed1c 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -506,6 +506,30 @@ struct bpf_prog *bpf_prog_get(u32 ufd)
>  }
>  EXPORT_SYMBOL_GPL(bpf_prog_get);
>
> +int bpf_prog_set(u32 ufd, struct bpf_prog *new)
> +{
> +   struct fd f;
> +   struct bpf_prog *prog;
> +
> +   f = fdget(ufd);
> +
> +   prog = get_prog(f);
> +   if (!IS_ERR(prog) && prog)
> +   bpf_prog_put(prog);
> +
> +   atomic_inc(>aux->refcnt);
> +   f.file->private_data = new;
> +   fdput(f);
> +   return 0;
> +}
> +EXPORT_SYMBOL_GPL(bpf_prog_set);
> +
> +int bpf_new_fd(struct bpf_prog *prog, int flags)
> +{
> +   return anon_inode_getfd("bpf-prog", _prog_fops, prog, flags);
> +}
> +EXPORT_SYMBOL_GPL(bpf_new_fd);
> +
>  /* last field in 'union bpf_attr' used by this command */
>  #defineBPF_PROG_LOAD_LAST_FIELD kern_version
>
> @@ -572,7 +596,7 @@ static int bpf_prog_load(union bpf_attr *attr)
> if (err < 0)
> goto free_used_maps;
>
> -   err = anon_inode_getfd("bpf-prog", _prog_fops, prog, O_RDWR | 
> O_CLOEXEC);
> +   err = bpf_new_fd(prog, O_RDWR | O_CLOEXEC);
> if (err < 0)
> /* failed to allocate fd */
> goto free_used_maps;
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index c8e0e05..a151c35 

RE: [PATCH] bnx2x: use ktime_get_seconds() for timestamp

2015-09-11 Thread Yuval Mintz
> I assume that it is not possible to change the ABI any more, otherwise
> we should try to use a 64-bit field.

Actually, I did suggest exactly that to our management team,
But this was the decided ABI. 

Acked-by: Yuval Mintz 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] Revert "net/ipv6: add sysctl option accept_ra_min_hop_limit"

2015-09-11 Thread Florian Westphal
YOSHIFUJI Hideaki  wrote:
> Sabrina Dubroca wrote:
> > 2015-09-10, 14:52:45 +0900, YOSHIFUJI Hideaki wrote:
> >> Sabrina Dubroca wrote:
> >>> Would you agree with a default of 64, as Florian suggested?
> >>
> >> 1 was chosen to restore our behavior before introduction of current
> >> hoplimit check.  I am not in favor of changing that value.
> > 
> > But our old behavior had a security issue, which is why the >= current
> > check was introduced.
> 
> We have the knob to "protect" ourselves now but it has drawbacks no to
> accept lower values than specified.  We can never have ultimate default
> for everybody.  The knob might "mitigate" the issue but once we have
> any rouge routers on our L2, we lose anyway.  So, I do want to keep it
> as-is not to change our traditional behavior.

If that argument is brough forward (and it's a good point!), then the
entire case for rejecting 'low' hoplimit values in first place becomes moot.

If this is an important security issue, then either the sysctl has to be
removed or the default raised to some 'safe' value (32, for example).

If its not a security issue -- and it isn't if we think "1" is a good
default choice -- then we should seriously consider reverting both
the added sysctl and the 'original' commit (6fd99094de2b; "ipv6: Don't
reduce hop limit for an interface").

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NFS/TCP/IPv6 acting strangely in 4.2

2015-09-11 Thread Russell King - ARM Linux
I have a recent Marvell Armada 388 board here which uses the mvneta
driver.  I'm seeing some weird effects with NFS with it acting as a
client.  Unfortunately, the board does not have a functional RTC.

The NFS server is the same NFS server that I've used for years with
multiple other clients (it's my main NFS server) and it continues to
support all other clients without any ill effect.  The NFS server
kernel is rather old now, being a 3.x kernel.

The NFS client appears to connect using TCP/IPv6 and initially is
accessible.  Everything appears to work normally, and then the NFS
server appears to stop responding.

Running tcpdump on the NFS server, and then dumping the captured packets
with tshark (because tcpdump appears not to understand IPv6 SYNs on the
NFS port) shows:

  1   0.00 fe80::250:43ff:fe02:201 -> fe80::214:fdff:fe10:4f86 ICMPv6 
Neighbor Solicitation for fe80::214:fdff:fe10:4f86 from 00:50:43:02:02:01
  2   0.000252 fe80::214:fdff:fe10:4f86 -> fe80::250:43ff:fe02:201 ICMPv6 
Neighbor Advertisement fe80::214:fdff:fe10:4f86 (sol)
  3   0.030036 armada388 -> n2100 TCP 843→nfs [SYN] Seq=936803106 Win=28800 
Len=0 MSS=1440 SACK_PERM=1 TSval=892366 TSecr=0 WS=128
  4   0.030409 n2100 -> armada388 TCP nfs→843 [SYN, ACK] Seq=409465870 
Ack=936803107 Win=28560 Len=0 MSS=1440 SACK_PERM=1 TSval=818169117 TSecr=892366 
WS=64
  5   0.030493 armada388 -> n2100 TCP [TCP Port numbers reused] 843→nfs [SYN] 
Seq=936803633 Win=28800 Len=0 MSS=1440 SACK_PERM=1 TSval=892366 TSecr=0 WS=128
  6   0.030699 n2100 -> armada388 TCP nfs→843 [RST, ACK] Seq=0 Ack=936803634 
Win=0 Len=0
  7   0.030766 armada388 -> n2100 TCP 843→nfs [RST] Seq=936803107 Win=0 Len=0
  8   3.040150 armada388 -> n2100 TCP [TCP Port numbers reused] 843→nfs [SYN] 
Seq=983834371 Win=28800 Len=0 MSS=1440 SACK_PERM=1 TSval=892667 TSecr=0 WS=128
  9   3.040467 n2100 -> armada388 TCP nfs→843 [SYN, ACK] Seq=456498252 
Ack=983834372 Win=28560 Len=0 MSS=1440 SACK_PERM=1 TSval=818169418 TSecr=892667 
WS=64
 10   3.040552 armada388 -> n2100 TCP [TCP Port numbers reused] 843→nfs [SYN] 
Seq=983834922 Win=28800 Len=0 MSS=1440 SACK_PERM=1 TSval=892667 TSecr=0 WS=128
 11   3.040771 n2100 -> armada388 TCP nfs→843 [RST, ACK] Seq=0 Ack=983834923 
Win=0 Len=0
 12   3.040845 armada388 -> n2100 TCP 843→nfs [RST] Seq=983834372 Win=0 Len=0
 13   6.050296 armada388 -> n2100 TCP [TCP Port numbers reused] 843→nfs [SYN] 
Seq=1030865629 Win=28800 Len=0 MSS=1440 SACK_PERM=1 TSval=892968 TSecr=0 WS=128
 14   6.050673 n2100 -> armada388 TCP nfs→843 [SYN, ACK] Seq=503532268 
Ack=1030865630 Win=28560 Len=0 MSS=1440 SACK_PERM=1 TSval=818169719 
TSecr=892968 WS=64
 15   6.050761 armada388 -> n2100 TCP [TCP Port numbers reused] 843→nfs [SYN] 
Seq=1030866205 Win=28800 Len=0 MSS=1440 SACK_PERM=1 TSval=892968 TSecr=0 WS=128

It seems rather strange that the Armada388 NFS client is trying to
connect _two_ TCP/IPv6 connections from the same address and the same
port but with different sequence numbers (which suggests that packet
numbers 5, 10 and 15 are the problem ones.

However, what happens with the reset packets is most interesting.
Packet 6 looks like it's resetting the duplicate connection caused by
packet 5.  Packet 7 looks like it's resetting the initial connection
from packet 4.

It seems fairly reproducable, though I haven't worked out exactly what
triggers it.

Is this a known bug?  Any ideas where to start looking?

Thanks.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/5] ebpf: add a seccomp program type

2015-09-11 Thread Michael Kerrisk (man-pages)
On 11 September 2015 at 02:20, Tycho Andersen
 wrote:
> seccomp uses eBPF as its underlying storage and execution format, and eBPF
> has features that seccomp would like to make use of in the future. This
> patch adds a formal seccomp type to the eBPF verifier.
>
> The current implementation of the seccomp eBPF type is very limited, and
> doesn't support some interesting features (notably, maps) of eBPF. However,
> the primary motivation for this patchset is to enable checkpoint/restore
> for seccomp filters later in the series, to this limited feature set is ok
> for now.

Hi Tycho,

Seems like a man-pages patch is warranted here also?

Cheers,

Michael


> v2: * don't allow seccomp eBPF programs to call any functions
> * get rid of superfluous seccomp_convert_ctx_access
>
> Signed-off-by: Tycho Andersen 
> CC: Kees Cook 
> CC: Will Drewry 
> CC: Oleg Nesterov 
> CC: Andy Lutomirski 
> CC: Pavel Emelyanov 
> CC: Serge E. Hallyn 
> CC: Alexei Starovoitov 
> CC: Daniel Borkmann 
> ---
>  include/uapi/linux/bpf.h |  1 +
>  net/core/filter.c| 31 +++
>  2 files changed, 32 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 92a48e2..631cdee 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -123,6 +123,7 @@ enum bpf_prog_type {
> BPF_PROG_TYPE_KPROBE,
> BPF_PROG_TYPE_SCHED_CLS,
> BPF_PROG_TYPE_SCHED_ACT,
> +   BPF_PROG_TYPE_SECCOMP,
>  };
>
>  #define BPF_PSEUDO_MAP_FD  1
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 13079f0..faaae67 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1612,6 +1612,15 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
> }
>  }
>
> +static const struct bpf_func_proto *
> +seccomp_func_proto(enum bpf_func_id func_id)
> +{
> +   /* At some point in the future seccomp filters may grow support for
> +* eBPF functions. For now, these are disabled.
> +*/
> +   return NULL;
> +}
> +
>  static bool __is_valid_access(int off, int size, enum bpf_access_type type)
>  {
> /* check bounds */
> @@ -1662,6 +1671,17 @@ static bool tc_cls_act_is_valid_access(int off, int 
> size,
> return __is_valid_access(off, size, type);
>  }
>
> +static bool seccomp_is_valid_access(int off, int size,
> +   enum bpf_access_type type)
> +{
> +   if (type == BPF_WRITE)
> +   return false;
> +
> +   if (off < 0 || off >= sizeof(struct seccomp_data) || off & 3)
> +   return false;
> +
> +   return true;
> +}
>  static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>   int src_reg, int ctx_off,
>   struct bpf_insn *insn_buf)
> @@ -1795,6 +1815,11 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
> .convert_ctx_access = bpf_net_convert_ctx_access,
>  };
>
> +static const struct bpf_verifier_ops seccomp_ops = {
> +   .get_func_proto = seccomp_func_proto,
> +   .is_valid_access = seccomp_is_valid_access,
> +};
> +
>  static struct bpf_prog_type_list sk_filter_type __read_mostly = {
> .ops = _filter_ops,
> .type = BPF_PROG_TYPE_SOCKET_FILTER,
> @@ -1810,11 +1835,17 @@ static struct bpf_prog_type_list sched_act_type 
> __read_mostly = {
> .type = BPF_PROG_TYPE_SCHED_ACT,
>  };
>
> +static struct bpf_prog_type_list seccomp_type __read_mostly = {
> +   .ops = _ops,
> +   .type = BPF_PROG_TYPE_SECCOMP,
> +};
> +
>  static int __init register_sk_filter_ops(void)
>  {
> bpf_register_prog_type(_filter_type);
> bpf_register_prog_type(_cls_type);
> bpf_register_prog_type(_act_type);
> +   bpf_register_prog_type(_type);
>
> return 0;
>  }
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V5 net-next 1/2] net: introduce socket family constants

2015-09-11 Thread Ursula Braun
From: Ursula Braun 

The new socket family is assigned the next available address / protocol
family constant 41.

Signed-off-by: Ursula Braun 
---
 include/linux/socket.h |  4 +++-
 include/net/smc.h  | 13 +
 2 files changed, 16 insertions(+), 1 deletion(-)
 create mode 100644 include/net/smc.h

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 5bf59c8..1adcbcc 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -200,7 +200,8 @@ struct ucred {
 #define AF_ALG 38  /* Algorithm sockets*/
 #define AF_NFC 39  /* NFC sockets  */
 #define AF_VSOCK   40  /* vSockets */
-#define AF_MAX 41  /* For now.. */
+#define AF_SMC 41  /* smc sockets  */
+#define AF_MAX 42  /* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC  AF_UNSPEC
@@ -246,6 +247,7 @@ struct ucred {
 #define PF_ALG AF_ALG
 #define PF_NFC AF_NFC
 #define PF_VSOCK   AF_VSOCK
+#define PF_SMC AF_SMC
 #define PF_MAX AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
diff --git a/include/net/smc.h b/include/net/smc.h
new file mode 100644
index 000..cd513ee
--- /dev/null
+++ b/include/net/smc.h
@@ -0,0 +1,13 @@
+/*
+ * SMC Definitions for the SMC protocol.
+ *
+ * Author: Ursula Braun 
+ */
+#ifndef _SMC_H
+#define _SMC_H
+
+/* SMC socket options - disjunct with TCP socket options */
+#define SMC_KEEPALIVE  99  /* start/stop keepalives */
+
+#endif /* _SMC_H */
+
-- 
2.3.8

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] seccomp: add a way to attach a filter via eBPF fd

2015-09-11 Thread Michael Kerrisk (man-pages)
On 11 September 2015 at 02:21, Tycho Andersen
 wrote:
> This is the final bit needed to support seccomp filters created via the bpf
> syscall. The patch adds a new seccomp operation SECCOMP_MODE_FILTER_EBPF,
> which takes exactly one command (presumably to be expanded upon later when
> seccomp EBPFs support more interesting things) and an argument struct
> similar to that of bpf(), although the size is explicit in the struct to
> avoid changing the signature of seccomp().
>
> v2: Don't abuse seccomp's third argument; use a separate command and a
> pointer to a structure instead.

Hi Tycho,

Here, I'm entering broken record territory :-). Seems like a man-pages
patch is warranted here also?

Cheers,

Michael


> Signed-off-by: Tycho Andersen 
> CC: Kees Cook 
> CC: Will Drewry 
> CC: Oleg Nesterov 
> CC: Andy Lutomirski 
> CC: Pavel Emelyanov 
> CC: Serge E. Hallyn 
> CC: Alexei Starovoitov 
> CC: Daniel Borkmann 
> ---
>  include/uapi/linux/seccomp.h |  16 +
>  kernel/seccomp.c | 135 
> ++-
>  2 files changed, 138 insertions(+), 13 deletions(-)
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 0f238a4..a8694e2 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -13,10 +13,14 @@
>  /* Valid operations for seccomp syscall. */
>  #define SECCOMP_SET_MODE_STRICT0
>  #define SECCOMP_SET_MODE_FILTER1
> +#define SECCOMP_MODE_FILTER_EBPF   2
>
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
>  #define SECCOMP_FILTER_FLAG_TSYNC  1
>
> +/* Valid cmds for SECCOMP_MODE_FILTER_EBPF */
> +#define SECCOMP_EBPF_ADD_FD0
> +
>  /*
>   * All BPF programs must return a 32-bit value.
>   * The bottom 16-bits are for optional return data.
> @@ -51,4 +55,16 @@ struct seccomp_data {
> __u64 args[6];
>  };
>
> +struct seccomp_ebpf {
> +   unsigned int size;
> +
> +   union {
> +   /* SECCOMP_EBPF_ADD_FD */
> +   struct {
> +   unsigned intadd_flags;
> +   __u32   add_fd;
> +   };
> +   };
> +};
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 1856f69..e78175a 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -65,6 +65,9 @@ struct seccomp_filter {
>  /* Limit any path through the tree to 256KB worth of instructions. */
>  #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
>
> +static long seccomp_install_filter(unsigned int flags,
> +  struct seccomp_filter *prepared);
> +
>  /*
>   * Endianness is explicitly ignored and left for BPF program authors to 
> manage
>   * as per the specific architecture.
> @@ -356,17 +359,6 @@ static struct seccomp_filter 
> *seccomp_prepare_filter(struct sock_fprog *fprog)
>
> BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter));
>
> -   /*
> -* Installing a seccomp filter requires that the task has
> -* CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
> -* This avoids scenarios where unprivileged tasks can affect the
> -* behavior of privileged children.
> -*/
> -   if (!task_no_new_privs(current) &&
> -   security_capable_noaudit(current_cred(), current_user_ns(),
> -CAP_SYS_ADMIN) != 0)
> -   return ERR_PTR(-EACCES);
> -
> /* Allocate a new seccomp_filter */
> sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN);
> if (!sfilter)
> @@ -510,8 +502,105 @@ static void seccomp_send_sigsys(int syscall, int reason)
> info.si_syscall = syscall;
> force_sig_info(SIGSYS, , current);
>  }
> +
>  #endif /* CONFIG_SECCOMP_FILTER */
>
> +#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_SECCOMP_FILTER)
> +static struct seccomp_filter *seccomp_prepare_ebpf(int fd)
> +{
> +   struct seccomp_filter *ret;
> +   struct bpf_prog *prog;
> +
> +   prog = bpf_prog_get(fd);
> +   if (IS_ERR(prog))
> +   return (struct seccomp_filter *) prog;
> +
> +   if (prog->type != BPF_PROG_TYPE_SECCOMP) {
> +   bpf_prog_put(prog);
> +   return ERR_PTR(-EINVAL);
> +   }
> +
> +   ret = kzalloc(sizeof(*ret), GFP_KERNEL | __GFP_NOWARN);
> +   if (!ret) {
> +   bpf_prog_put(prog);
> +   return ERR_PTR(-ENOMEM);
> +   }
> +
> +   ret->prog = prog;
> +   atomic_set(>usage, 1);
> +
> +   /* Intentionally don't bpf_prog_put() here, because the underlying 
> prog
> +* is refcounted too and we're holding a reference from the struct
> +* 

Re: [PATCH] x86: Wire up 32-bit direct socket calls

2015-09-11 Thread Arnd Bergmann
On Friday 11 September 2015 11:54:50 Geert Uytterhoeven wrote:
> To make sure I don't miss any (it seems I missed recvmmsg and sendmmsg for
> the socketcall case, sigh), this is the list of ipc syscalls to implement?
> 
> sys_msgget
> sys_msgctl
> sys_msgrcv
> sys_msgsnd
> sys_semget
> sys_semctl
> sys_semtimedop
> sys_shmget
> sys_shmctl
> sys_shmat
> sys_shmdt
> 
> sys_semop() seems to be unneeded because it can be implemented using
> sys_semtimedop()?
> 

Yes, that list looks right. IPC also includes a set of six sys_mq_*
call, but I believe that everyone already has those as they are not
covered by sys_ipc.

For y2038 compatibility, we will likely add a new variant of
semtimedop that takes a 64-bit timespec. While the argument passed
there is a relative time that will never need to be longer than 68
years, we need to accommodate user space that defines timespec
in a sane way, and converting the argument in libc would be awkward.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] Revert "net/ipv6: add sysctl option accept_ra_min_hop_limit"

2015-09-11 Thread D.S. Ljungmark
On 11/09/15 12:53, Florian Westphal wrote:
> YOSHIFUJI Hideaki  wrote:
>> Sabrina Dubroca wrote:
>>> 2015-09-10, 14:52:45 +0900, YOSHIFUJI Hideaki wrote:
 Sabrina Dubroca wrote:
> Would you agree with a default of 64, as Florian suggested?

 1 was chosen to restore our behavior before introduction of current
 hoplimit check.  I am not in favor of changing that value.
>>>
>>> But our old behavior had a security issue, which is why the >= current
>>> check was introduced.
>>
>> We have the knob to "protect" ourselves now but it has drawbacks no to
>> accept lower values than specified.  We can never have ultimate default
>> for everybody.  The knob might "mitigate" the issue but once we have
>> any rouge routers on our L2, we lose anyway.  So, I do want to keep it
>> as-is not to change our traditional behavior.
> 
> If that argument is brough forward (and it's a good point!), then the
> entire case for rejecting 'low' hoplimit values in first place becomes moot.
> 
> If this is an important security issue, then either the sysctl has to be
> removed or the default raised to some 'safe' value (32, for example).
> 
> If its not a security issue -- and it isn't if we think "1" is a good
> default choice -- then we should seriously consider reverting both
> the added sysctl and the 'original' commit (6fd99094de2b; "ipv6: Don't
> reduce hop limit for an interface").
> 


The most common use-case for this is public WiFi.  So far, a negible
amount of access points have even remote ability to filter "unwanted" L2
traffic.

The fact that a single, empty RA packet with a hop limit of 2 will take
down your entire ipv6, even if your infrastructure uses DHCPv6 for
addressing is problematic.

There are scenarios where an L2 agent can push a link-local or
Peer-to-peer routes with a low hoplimit. These routes would then lower
the interface-level hop limit to something that breaks your other routing.

Personally, I think the concept of hop-limit being per interface in IPv6
is disasterously stupid, but I'm not arguing against the RFC there.

//D.S.



-- 
8362 CB14 98AD 11EF CEB6  FA81 FCC3 7674 449E 3CFC



signature.asc
Description: OpenPGP digital signature


Re: [v2 04/11] soc/fsl: Introduce drivers for the DPAA QMan

2015-09-11 Thread Scott Wood
On Wed, Aug 12, 2015 at 04:14:50PM -0400, Roy Pledge wrote:
> +/* Lock/unlock frame queues, subject to the "LOCKED" flag. This is about
> + * inter-processor locking only. Note, FQLOCK() is always called either 
> under a
> + * local_irq_save() or from interrupt context - hence there's no need for irq
> + * protection (and indeed, attempting to nest irq-protection doesn't work, as
> + * the "irq en/disable" machinery isn't recursive...). */
> +#define FQLOCK(fq) \
> + do { \
> + struct qman_fq *__fq478 = (fq); \
> + if (fq_isset(__fq478, QMAN_FQ_FLAG_LOCKED)) \
> + spin_lock(&__fq478->fqlock); \
> + } while (0)
> +#define FQUNLOCK(fq) \
> + do { \
> + struct qman_fq *__fq478 = (fq); \
> + if (fq_isset(__fq478, QMAN_FQ_FLAG_LOCKED)) \
> + spin_unlock(&__fq478->fqlock); \
> + } while (0)
> +

I don't see QMAN_FQ_FLAG_LOCKED set anywhere.  What is the use case?

-Scott
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] openvswitch: Fix mask generation for nested attributes.

2015-09-11 Thread Jesse Gross
On Fri, Sep 11, 2015 at 9:04 AM, Pravin Shelar  wrote:
> On Thu, Sep 10, 2015 at 6:36 PM, Jesse Gross  wrote:
>> Masks were added to OVS flows in a way that was backwards compatible
>> with userspace programs that did not generate masks. As a result, it is
>> possible that we may receive flows that do not have a mask and we need
>> to synthesize one.
>>
>> Generating a mask requires iterating over attributes and descending into
>> nested attributes. For each level we need to know the size to generate the
>> correct mask. We do this with a linked table of attribute types.
>>
>> Although the logic to handle these nested attributes was there in concept,
>> there are a number of bugs in practice. Examples include incomplete links
>> between tables, variable length attributes being treated as nested and
>> missing sanity checks.
>>
> Missing Fixes tag.

This isn't really fixing one commit, it's more something that has
degraded over time. I'm also not looking for this to go to stable
since I haven't actually heard reports of this causing a problem, it's
just something I saw by looking at the code.

> This patch adds white space space errors.
>
> ./scripts/checkpatch.pl
> 0001-openvswitch-Fix-mask-generation-for-nested-attribute.patch
>
> total: 15 errors, 19 warnings, 0 checks, 130 lines checked

Fixed, thanks.

>> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
>> index c92d6a2..83a6847 100644
>> --- a/net/openvswitch/flow_netlink.c
>> +++ b/net/openvswitch/flow_netlink.c
>> @@ -57,6 +57,7 @@ struct ovs_len_tbl {
>>  };
>>
>
> ...
>> unsigned long opt_key_offset;
>> struct vxlan_metadata opts;
>> -   int err;
>>
>> BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts));
>>
>> -   err = nla_parse_nested(tb, OVS_VXLAN_EXT_MAX, a, vxlan_opt_policy);
>> -   if (err < 0)
>> -   return err;
>> -
>> memset(, 0, sizeof(opts));
>> +   nla_for_each_nested(a, attr, rem) {
>> +   int type = nla_type(a);
>>
>> -   if (tb[OVS_VXLAN_EXT_GBP])
>> -   opts.gbp = nla_get_u32(tb[OVS_VXLAN_EXT_GBP]);
>> +   if (type > OVS_VXLAN_EXT_MAX) {
>> +OVS_NLERR(log, "VXLAN extension %d out of range max 
>> %d",
>> +  type, OVS_VXLAN_EXT_MAX);
>> +return -EINVAL;
>> +}
>> +
>> +if (ovs_vxlan_ext_key_lens[type].len != nla_len(a) &&
>> +!(ovs_vxlan_ext_key_lens[type].len == OVS_ATTR_NESTED ||
>> +  ovs_vxlan_ext_key_lens[type].len == 
>> OVS_ATTR_VARIABLE)) {
>
> ovs_vxlan_ext_key_lens types is never set to nested or variable len.
> so this would not true.

At the moment, it's true that it can't happen. However, I'd like to
avoid making assumptions like that in the code since it could make it
more likely that this type of problem is reintroduced in the future.
It's somewhat moot anyways after the introduction of a common helper
function.

>> +OVS_NLERR(log, "VXLAN extension %d has unexpected 
>> len %d expected %d",
>> +  type, nla_len(a), 
>> ovs_vxlan_ext_key_lens[type].len);
>> +return -EINVAL;
>> +}
>> +
>> +switch (type) {
>> +   case OVS_VXLAN_EXT_GBP:
>> +   opts.gbp = nla_get_u32(a);
>> +   break;
>> +   default:
>> +OVS_NLERR(log, "Unknown VXLAN extension attribute 
>> %d",
>> +  type);
>> +return -EINVAL;
>> +   }
>> +   }
>
> Need to check for remaining bytes in this attribute.

Added.

>> @@ -1923,7 +1951,8 @@ static int validate_set(const struct nlattr *a,
>>
>> if (key_type > OVS_KEY_ATTR_MAX ||
>> (ovs_key_lens[key_type].len != key_len &&
>> -ovs_key_lens[key_type].len != OVS_ATTR_NESTED))
>> +!(ovs_key_lens[key_type].len == OVS_ATTR_NESTED ||
>> +  ovs_key_lens[key_type].len == OVS_ATTR_VARIABLE)))
>
> Same check are done multiple times, It would be nice if we add helper
> function for this.

I agree it's better. I factored these out.

I'll send out a v2 shortly.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net v2] openvswitch: Fix mask generation for nested attributes.

2015-09-11 Thread Jesse Gross
Masks were added to OVS flows in a way that was backwards compatible
with userspace programs that did not generate masks. As a result, it is
possible that we may receive flows that do not have a mask and we need
to synthesize one.

Generating a mask requires iterating over attributes and descending into
nested attributes. For each level we need to know the size to generate the
correct mask. We do this with a linked table of attribute types.

Although the logic to handle these nested attributes was there in concept,
there are a number of bugs in practice. Examples include incomplete links
between tables, variable length attributes being treated as nested and
missing sanity checks.

Signed-off-by: Jesse Gross 
---
v2: Fix whitespace errors.
Add check for unknown bytes in VXLAN extensions.
Factor out check for nested or variable attributes.
---
 net/openvswitch/flow_netlink.c | 82 ++
 1 file changed, 59 insertions(+), 23 deletions(-)

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index c92d6a2..5c030a4 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -57,6 +57,7 @@ struct ovs_len_tbl {
 };
 
 #define OVS_ATTR_NESTED -1
+#define OVS_ATTR_VARIABLE -2
 
 static void update_range(struct sw_flow_match *match,
 size_t offset, size_t size, bool is_mask)
@@ -304,6 +305,10 @@ size_t ovs_key_attr_size(void)
+ nla_total_size(28); /* OVS_KEY_ATTR_ND */
 }
 
+static const struct ovs_len_tbl ovs_vxlan_ext_key_lens[OVS_VXLAN_EXT_MAX + 1] 
= {
+   [OVS_VXLAN_EXT_GBP] = { .len = sizeof(u32) },
+};
+
 static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 
1] = {
[OVS_TUNNEL_KEY_ATTR_ID]= { .len = sizeof(u64) },
[OVS_TUNNEL_KEY_ATTR_IPV4_SRC]  = { .len = sizeof(u32) },
@@ -315,8 +320,9 @@ static const struct ovs_len_tbl 
ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
[OVS_TUNNEL_KEY_ATTR_TP_SRC]= { .len = sizeof(u16) },
[OVS_TUNNEL_KEY_ATTR_TP_DST]= { .len = sizeof(u16) },
[OVS_TUNNEL_KEY_ATTR_OAM]   = { .len = 0 },
-   [OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS]   = { .len = OVS_ATTR_NESTED },
-   [OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS]= { .len = OVS_ATTR_NESTED },
+   [OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS]   = { .len = OVS_ATTR_VARIABLE },
+   [OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS]= { .len = OVS_ATTR_NESTED,
+   .next = ovs_vxlan_ext_key_lens 
},
 };
 
 /* The size of the argument for each %OVS_KEY_ATTR_* Netlink attribute.  */
@@ -349,6 +355,13 @@ static const struct ovs_len_tbl 
ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
[OVS_KEY_ATTR_CT_LABEL]  = { .len = sizeof(struct ovs_key_ct_label) },
 };
 
+static bool check_attr_len(unsigned int attr_len, unsigned int expected_len)
+{
+   return expected_len == attr_len ||
+  expected_len == OVS_ATTR_NESTED ||
+  expected_len == OVS_ATTR_VARIABLE;
+}
+
 static bool is_all_zero(const u8 *fp, size_t size)
 {
int i;
@@ -388,7 +401,7 @@ static int __parse_flow_nlattrs(const struct nlattr *attr,
}
 
expected_len = ovs_key_lens[type].len;
-   if (nla_len(nla) != expected_len && expected_len != 
OVS_ATTR_NESTED) {
+   if (!check_attr_len(nla_len(nla), expected_len)) {
OVS_NLERR(log, "Key %d has unexpected len %d expected 
%d",
  type, nla_len(nla), expected_len);
return -EINVAL;
@@ -473,29 +486,50 @@ static int genev_tun_opt_from_nlattr(const struct nlattr 
*a,
return 0;
 }
 
-static const struct nla_policy vxlan_opt_policy[OVS_VXLAN_EXT_MAX + 1] = {
-   [OVS_VXLAN_EXT_GBP] = { .type = NLA_U32 },
-};
-
-static int vxlan_tun_opt_from_nlattr(const struct nlattr *a,
+static int vxlan_tun_opt_from_nlattr(const struct nlattr *attr,
 struct sw_flow_match *match, bool is_mask,
 bool log)
 {
-   struct nlattr *tb[OVS_VXLAN_EXT_MAX+1];
+   struct nlattr *a;
+   int rem;
unsigned long opt_key_offset;
struct vxlan_metadata opts;
-   int err;
 
BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts));
 
-   err = nla_parse_nested(tb, OVS_VXLAN_EXT_MAX, a, vxlan_opt_policy);
-   if (err < 0)
-   return err;
-
memset(, 0, sizeof(opts));
+   nla_for_each_nested(a, attr, rem) {
+   int type = nla_type(a);
 
-   if (tb[OVS_VXLAN_EXT_GBP])
-   opts.gbp = nla_get_u32(tb[OVS_VXLAN_EXT_GBP]);
+   if (type > OVS_VXLAN_EXT_MAX) {
+   OVS_NLERR(log, "VXLAN extension %d out of range max %d",
+ type, OVS_VXLAN_EXT_MAX);
+   return -EINVAL;
+  

Re: [fw filter]: Broken! fw mark based tc class selection not working

2015-09-11 Thread Cong Wang
On Fri, Sep 11, 2015 at 3:24 PM, Akshat Kakkar  wrote:
> There is no handle with fw filter. That's the whole point is. If
> handle and class (flow id) is not specified, then whatever be the mark
> on the packet, its automatically set as flowid. So if mark is 0x10003,
> then this fw filter
>
> tc filter add dev eth0 parent 1:0 protocol ip fw
>
> will cause 0x10003 being set as classid I.e. 1:3.
>
>
> tc qdisc add dev eth0 root handle 1: htb
> tc class add dev eth0 parent 1: classid 1:a htb rate 1mbit
> tc class add dev eth0 parent 1: classid 1:b htb rate 1mbit
> tc class add dev eth0 parent 1: classid 1:c htb rate 1mbit
> tc filter add dev eth0 parent 1:0 protocol ip fw
>
> iptables -t mangle -I OUTPUT -o eth0 -p tcp -j MARK --set-mark 0x1000a
> iptables -t mangle -I OUTPUT -o eth0 -p icmp -j MARK --set-mark 0x1000b
> iptables -t mangle -I OUTPUT -o eth0 -p udp -j MARK --set-mark 0x1000c


Hmm, I didn't know that before either. Looks like my tp->init change
breaks it.

Could you try the following patch?

Thanks!
diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c
index 715e01e..b6394ab 100644
--- a/net/sched/cls_fw.c
+++ b/net/sched/cls_fw.c
@@ -34,6 +34,7 @@
 struct fw_head {
u32 mask;
boolmask_set;
+   boolold_method;
struct fw_filter __rcu  *ht[HTSIZE];
struct rcu_head rcu;
 };
@@ -65,7 +66,7 @@ static int fw_classify(struct sk_buff *skb, const struct 
tcf_proto *tp,
int r;
u32 id = skb->mark;
 
-   if (head != NULL) {
+   if (!head->old_method) {
id &= head->mask;
 
for (f = rcu_dereference_bh(head->ht[fw_hash(id)]); f;
@@ -120,6 +121,7 @@ static int fw_init(struct tcf_proto *tp)
if (head == NULL)
return -ENOBUFS;
 
+   head->old_method = true;
head->mask_set = false;
rcu_assign_pointer(tp->root, head);
return 0;
@@ -302,6 +304,7 @@ static int fw_change(struct net *net, struct sk_buff 
*in_skb,
if (!handle)
return -EINVAL;
 
+   head->old_method = false;
if (!head->mask_set) {
head->mask = 0x;
if (tb[TCA_FW_MASK])


[PATCH] ip link: missing options in bond usage

2015-09-11 Thread Arthur Gautier
Signed-off-by: Arthur Gautier 
---
 ip/iplink_bond.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ip/iplink_bond.c b/ip/iplink_bond.c
index 9d96cfeee88b..cb2f045a5a27 100644
--- a/ip/iplink_bond.c
+++ b/ip/iplink_bond.c
@@ -145,7 +145,7 @@ static void print_explain(FILE *f)
"ARP_ALL_TARGETS := any|all\n"
"PRIMARY_RESELECT := always|better|failure\n"
"FAIL_OVER_MAC := none|active|follow\n"
-   "XMIT_HASH_POLICY := layer2|layer2+3|layer3+4\n"
+   "XMIT_HASH_POLICY := 
layer2|layer2+3|layer3+4|encap2+3|encap3+4\n"
"LACP_RATE := slow|fast\n"
"AD_SELECT := stable|bandwidth|count\n"
);
-- 
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/5] seccomp: add a way to attach a filter via eBPF fd

2015-09-11 Thread Daniel Borkmann

On 09/11/2015 02:21 AM, Tycho Andersen wrote:

This is the final bit needed to support seccomp filters created via the bpf
syscall. The patch adds a new seccomp operation SECCOMP_MODE_FILTER_EBPF,
which takes exactly one command (presumably to be expanded upon later when
seccomp EBPFs support more interesting things) and an argument struct
similar to that of bpf(), although the size is explicit in the struct to
avoid changing the signature of seccomp().

v2: Don't abuse seccomp's third argument; use a separate command and a
 pointer to a structure instead.


Comments below ...


Signed-off-by: Tycho Andersen 
CC: Kees Cook 
CC: Will Drewry 
CC: Oleg Nesterov 
CC: Andy Lutomirski 
CC: Pavel Emelyanov 
CC: Serge E. Hallyn 
CC: Alexei Starovoitov 
CC: Daniel Borkmann 
---
  include/uapi/linux/seccomp.h |  16 +
  kernel/seccomp.c | 135 ++-
  2 files changed, 138 insertions(+), 13 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 0f238a4..a8694e2 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -13,10 +13,14 @@
  /* Valid operations for seccomp syscall. */
  #define SECCOMP_SET_MODE_STRICT   0
  #define SECCOMP_SET_MODE_FILTER   1
+#define SECCOMP_MODE_FILTER_EBPF   2


Should this be SECCOMP_SET_MODE_FILTER_EBPF or just SECCOMP_SET_MODE_EBPF?


  /* Valid flags for SECCOMP_SET_MODE_FILTER */
  #define SECCOMP_FILTER_FLAG_TSYNC 1

+/* Valid cmds for SECCOMP_MODE_FILTER_EBPF */
+#define SECCOMP_EBPF_ADD_FD0
+
  /*
   * All BPF programs must return a 32-bit value.
   * The bottom 16-bits are for optional return data.
@@ -51,4 +55,16 @@ struct seccomp_data {
__u64 args[6];
  };

+struct seccomp_ebpf {
+   unsigned int size;
+
+   union {
+   /* SECCOMP_EBPF_ADD_FD */
+   struct {
+   unsigned intadd_flags;
+   __u32   add_fd;
+   };
+   };
+};
+
  #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 1856f69..e78175a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -65,6 +65,9 @@ struct seccomp_filter {
  /* Limit any path through the tree to 256KB worth of instructions. */
  #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))

+static long seccomp_install_filter(unsigned int flags,
+  struct seccomp_filter *prepared);
+
  /*
   * Endianness is explicitly ignored and left for BPF program authors to manage
   * as per the specific architecture.
@@ -356,17 +359,6 @@ static struct seccomp_filter 
*seccomp_prepare_filter(struct sock_fprog *fprog)

BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter));

-   /*
-* Installing a seccomp filter requires that the task has
-* CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
-* This avoids scenarios where unprivileged tasks can affect the
-* behavior of privileged children.
-*/
-   if (!task_no_new_privs(current) &&
-   security_capable_noaudit(current_cred(), current_user_ns(),
-CAP_SYS_ADMIN) != 0)
-   return ERR_PTR(-EACCES);
-
/* Allocate a new seccomp_filter */
sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN);
if (!sfilter)
@@ -510,8 +502,105 @@ static void seccomp_send_sigsys(int syscall, int reason)
info.si_syscall = syscall;
force_sig_info(SIGSYS, , current);
  }
+
  #endif/* CONFIG_SECCOMP_FILTER */

+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_SECCOMP_FILTER)
+static struct seccomp_filter *seccomp_prepare_ebpf(int fd)
+{
+   struct seccomp_filter *ret;
+   struct bpf_prog *prog;
+
+   prog = bpf_prog_get(fd);
+   if (IS_ERR(prog))
+   return (struct seccomp_filter *) prog;


ERR_CAST()


+
+   if (prog->type != BPF_PROG_TYPE_SECCOMP) {
+   bpf_prog_put(prog);
+   return ERR_PTR(-EINVAL);
+   }
+
+   ret = kzalloc(sizeof(*ret), GFP_KERNEL | __GFP_NOWARN);
+   if (!ret) {
+   bpf_prog_put(prog);
+   return ERR_PTR(-ENOMEM);
+   }
+
+   ret->prog = prog;
+   atomic_set(>usage, 1);
+
+   /* Intentionally don't bpf_prog_put() here, because the underlying prog
+* is refcounted too and we're holding a reference from the struct
+* seccomp_filter object.
+*/
+   return ret;
+}
+
+static long seccomp_ebpf_add_fd(struct seccomp_ebpf *ebpf)
+{
+   struct seccomp_filter *prepared;
+
+   prepared = seccomp_prepare_ebpf(ebpf->add_fd);
+   if (IS_ERR(prepared))
+   return PTR_ERR(prepared);

Re: NFS/TCP/IPv6 acting strangely in 4.2

2015-09-11 Thread Eric Dumazet
On Fri, 2015-09-11 at 12:38 +0100, Russell King - ARM Linux wrote:
> I have a recent Marvell Armada 388 board here which uses the mvneta
> driver.  I'm seeing some weird effects with NFS with it acting as a
> client.  Unfortunately, the board does not have a functional RTC.
> 
> The NFS server is the same NFS server that I've used for years with
> multiple other clients (it's my main NFS server) and it continues to
> support all other clients without any ill effect.  The NFS server
> kernel is rather old now, being a 3.x kernel.
> 
> The NFS client appears to connect using TCP/IPv6 and initially is
> accessible.  Everything appears to work normally, and then the NFS
> server appears to stop responding.
> 
> Running tcpdump on the NFS server, and then dumping the captured
> packets
> with tshark (because tcpdump appears not to understand IPv6 SYNs on
> the
> NFS port) shows:
> 
>   3   0.030036 armada388 -> n2100 TCP 843→nfs [SYN] Seq=936803106
> Win=28800 Len=0 MSS=1440 SACK_PERM=1 TSval=892366 TSecr=0 WS=128

>   4   0.030409 n2100 -> armada388 TCP nfs→843 [SYN, ACK] Seq=409465870
> Ack=936803107 Win=28560 Len=0 MSS=1440 SACK_PERM=1
>  TSval=818169117 TSecr=892366 WS=64

>   5   0.030493 armada388 -> n2100 TCP [TCP Port numbers reused]
> 843→nfs [SYN] Seq=936803633 Win=28800 Len=0 MSS=1440 SACK_PERM=1
> TSval=892366 TSecr=0 WS=128

Yes, this packet looks very wrong. Like two simultaneous connect with
same source port. It is not a retransmit (Seq numbers differ)

>   6   0.030699 n2100 -> armada388 TCP nfs→843 [RST, ACK] Seq=0
> Ack=936803634 Win=0 Len=0
>   7   0.030766 armada388 -> n2100 TCP 843→nfs [RST] Seq=936803107
> Win=0 Len=0

I suspect port reuse in NFS being broken.

Have you tried to apply

commit 099392048cd443349c50310f7fdc96070e40f4e7
Author: Trond Myklebust 
Date:   Sat Aug 29 19:11:21 2015 -0700

SUNRPC: Prevent SYN+SYNACK+RST storms

Add a shutdown() call before we release the socket in order to ensure the
reset is sent before we try to reconnect.

Signed-off-by: Trond Myklebust 

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 214ca9dfb14e..7be90bc1a7c2 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -822,6 +822,8 @@ static void xs_reset_transport(struct sock_xprt *transport)
if (atomic_read(>xprt.swapper))
sk_clear_memalloc(sk);
 
+   kernel_sock_shutdown(sock, SHUT_RDWR);
+
write_lock_bh(>sk_callback_lock);
transport->inet = NULL;
transport->sock = NULL;


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/5] seccomp: make underlying bpf ref counted as well

2015-09-11 Thread Daniel Borkmann

On 09/11/2015 02:20 AM, Tycho Andersen wrote:

In the next patch, we're going to add a way to access the underlying
filters via bpf fds. This means that we need to ref-count both the
struct seccomp_filter objects and the struct bpf_prog objects separately,
in case a process dies but a filter is still referred to by another
process.

Additionally, we mark classic converted seccomp filters as seccomp eBPF
programs, since they are a subset of what is supported in seccomp eBPF.

Signed-off-by: Tycho Andersen 
CC: Kees Cook 
CC: Will Drewry 
CC: Oleg Nesterov 
CC: Andy Lutomirski 
CC: Pavel Emelyanov 
CC: Serge E. Hallyn 
CC: Alexei Starovoitov 
CC: Daniel Borkmann 
---
  kernel/seccomp.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 245df6b..afaeddf 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -378,6 +378,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct 
sock_fprog *fprog)
}

atomic_set(>usage, 1);
+   atomic_set(>prog->aux->refcnt, 1);
+   sfilter->prog->type = BPF_PROG_TYPE_SECCOMP;


So, if you do this, then this breaks the assumption of eBPF JITs
that, currently, all classic converted BPF programs always have a
prog->type of BPF_PROG_TYPE_UNSPEC (see: bpf_prog_was_classic()).

Currently, JITs make use of this information to determine whether
A and X mappings for such programs should or should not be cleared
in the prologue (s390 currently).

In the seccomp_prepare_filter() stage, we're already past that, so
it will not cause an issue, but we certainly would need to be very
careful in future, if bpf_prog_was_classic() is then used at a later
stage when we already have a generated bpf_prog somewhere, as then
this assumption will break.


return sfilter;
  }
@@ -470,7 +472,7 @@ void get_seccomp_filter(struct task_struct *tsk)
  static inline void seccomp_filter_free(struct seccomp_filter *filter)
  {
if (filter) {
-   bpf_prog_free(filter->prog);
+   bpf_prog_put(filter->prog);
kfree(filter);
}
  }



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS/TCP/IPv6 acting strangely in 4.2

2015-09-11 Thread Russell King - ARM Linux
On Fri, Sep 11, 2015 at 06:04:51AM -0700, Eric Dumazet wrote:
> On Fri, 2015-09-11 at 12:38 +0100, Russell King - ARM Linux wrote:
> > I have a recent Marvell Armada 388 board here which uses the mvneta
> > driver.  I'm seeing some weird effects with NFS with it acting as a
> > client.  Unfortunately, the board does not have a functional RTC.
> > 
> > The NFS server is the same NFS server that I've used for years with
> > multiple other clients (it's my main NFS server) and it continues to
> > support all other clients without any ill effect.  The NFS server
> > kernel is rather old now, being a 3.x kernel.
> > 
> > The NFS client appears to connect using TCP/IPv6 and initially is
> > accessible.  Everything appears to work normally, and then the NFS
> > server appears to stop responding.
> > 
> > Running tcpdump on the NFS server, and then dumping the captured
> > packets
> > with tshark (because tcpdump appears not to understand IPv6 SYNs on
> > the
> > NFS port) shows:
> > 
> >   3   0.030036 armada388 -> n2100 TCP 843→nfs [SYN] Seq=936803106
> > Win=28800 Len=0 MSS=1440 SACK_PERM=1 TSval=892366 TSecr=0 WS=128
> 
> >   4   0.030409 n2100 -> armada388 TCP nfs→843 [SYN, ACK] Seq=409465870
> > Ack=936803107 Win=28560 Len=0 MSS=1440 SACK_PERM=1
> >  TSval=818169117 TSecr=892366 WS=64
> 
> >   5   0.030493 armada388 -> n2100 TCP [TCP Port numbers reused]
> > 843→nfs [SYN] Seq=936803633 Win=28800 Len=0 MSS=1440 SACK_PERM=1
> > TSval=892366 TSecr=0 WS=128
> 
> Yes, this packet looks very wrong. Like two simultaneous connect with
> same source port. It is not a retransmit (Seq numbers differ)
> 
> >   6   0.030699 n2100 -> armada388 TCP nfs→843 [RST, ACK] Seq=0
> > Ack=936803634 Win=0 Len=0
> >   7   0.030766 armada388 -> n2100 TCP 843→nfs [RST] Seq=936803107
> > Win=0 Len=0
> 
> I suspect port reuse in NFS being broken.
> 
> Have you tried to apply

Thanks, I'll give this a go and report back.

I've been trying to debug this unsuccessfully.  It's really not helpful
to enable rpc debugging, and then be greeted with this from printk:

[ 2760.156924] RPC:   state 4 conn 1 dead 0 zapped 1 sk_shutdown 3
[ 2760.163218] RPC:   xs_tcp_state_change client ee240800...
[ 2760.168976] RPC:   state 5 conn 0 dead 0 zapped 1 sk_shutdown 3
[ 2760.175273] RPC:   xs_tcp_state_change client ee240800...
[ 2760.181032] RPC:   state 7 conn 0 dead 0 zapped 1 sk_shutdown 3
[ 2760.187315] RPC:   disconnected transport ee240800
[ 2760.192467] RPC:   xs_tcp_state_change client ee240800...
[ 2760.198224] RPC:   state 7 conn 0 dead 0 zapped 1 sk_shutdown 3
[ 2760.204505] RPC:   disconnected transport ee240800
[ 2760.209653] RPC:   xs_tcp_data_ready...
[ 2760.215001] RPC:   wake_up_first(ee240918 "xprt_sending")
lspci -vvx |less
[ 2805.912330] RPC:   looking up Generic cred
[ 2805.916791] RPC:   looking up Generic cred
[ 2805.921259] RPC:   looking up Generic cred
[ 2805.921784] RPC:   looking up Generic cred
[ 2805.921787] RPC:   looking up Generic cred
[ 2805.921788] RPC:   looking up Generic cred
[ 2805.921792] RPC:   new task initialized, procpid 327
[ 2805.921794] RPC:   allocated task ee358300
[ 2805.921797] RPC:  1335 __rpc_execute flags=0x80
[ 2805.921800] RPC:  1335 call_start nfs3 proc ACCESS (sync)
[ 2805.921801] RPC:  1335 call_reserve (status 0)
[ 2805.921805] RPC:  1335 reserved req ee8a2900 xid af74741b
** 168 printk messages dropped ** [ 2805.928207] RPC:   read reply XID 
b174741b
** 366 printk messages dropped ** [ 2805.941242] RPC:  1343 setting alarm for 
6 ms
** 310 printk messages dropped ** [ 2805.958274] RPC:   xprt = ee240800, 
tcp_copied = 120, tcp_offset = 120, tcp_reclen = 120
** 89 printk messages dropped ** [ 2805.959613] RPC:  1350 reserved req 
ee8a2900 xid be74741b
[ 2805.959614] RPC:   wake_up_first(ee240918 "xprt_sending")

which is just where we want to see the printk messages.  This kind'a
makes printk and the rpc debug stuff rather useless. :(

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >