Re: [RFC PATCH net] sctp: ASCONF-ACK with Unresolvable Address should be sent
On Mon, Jul 27, 2015 at 9:44 PM, Marcelo Ricardo Leitner marcelo.leit...@gmail.com wrote: On Sat, Jul 25, 2015 at 01:08:08PM +0800, Xin Long wrote: RFC 5061: This is an opaque integer assigned by the sender to identify each request parameter. The receiver of the ASCONF Chunk will copy this 32-bit value into the ASCONF Response Correlation ID field of the ASCONF-ACK response parameter. The sender of the ASCONF can use this same value in the ASCONF-ACK to find which request the response is for. Note that the receiver MUST NOT change this 32-bit value. Address Parameter: TLV This field contains an IPv4 or IPv6 address parameter, as described in Section 3.3.2.1 of [RFC4960]. ASCONF chunk with Error Cause Indication Parameter (Unresolvable Address) should be sent if the Delete IP Address is not part of the association. Endpoint A Endpoint B (ESTABLISHED)(ESTABLISHED) ASCONF- (Delete IP Address) - ASCONF-ACK (Unresolvable Address) Signed-off-by: Xin Long lucien@gmail.com --- net/sctp/sm_make_chunk.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c index 06320c8..6e399f6 100644 --- a/net/sctp/sm_make_chunk.c +++ b/net/sctp/sm_make_chunk.c @@ -3090,8 +3090,19 @@ static __be16 sctp_process_asconf_param(struct sctp_association *asoc, sctp_assoc_set_primary(asoc, asconf-transport); sctp_assoc_del_nonprimary_peers(asoc, asconf-transport); - } else - sctp_assoc_del_peer(asoc, addr); + return SCTP_ERROR_NO_ERROR; + } + + /* If the address is not part of the association, the + * ASCONF-ACK with Error Cause Indication Parameter + * which including cause of Unresolvable Address should + * be sent. + */ + peer = sctp_assoc_lookup_paddr(asoc, addr); + if (!peer) + return SCTP_ERROR_DNS_FAILED; + + sctp_assoc_rm_peer(asoc, peer); break; case SCTP_PARAM_SET_PRIMARY: /* ADDIP Section 4.2.4 -- 2.1.0 Looks good to me. Marcelo any update for this one? is it accepted? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] bpf: fix the bug 'struct bpf_array' has no member named 'prog' in s390 architecture
'Kbuild test robot' sent me an email about a build error 'struct bpf_array' has no member named 'prog' in s390 architecture. This error is caused by commit: 2a36f0b92eb 638dd023870574eb471b1c56be9ad [656/692] bpf: Make the bpf _prog_array_map more generic. In this patch, the member 'prog' of struct bpf_array has been replaced by 'ptrs'. So this patch fix it. --- arch/s390/net/bpf_jit_comp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c index 9f4bbc0..eeda051 100644 --- a/arch/s390/net/bpf_jit_comp.c +++ b/arch/s390/net/bpf_jit_comp.c @@ -1032,7 +1032,7 @@ static noinline int bpf_jit_insn(struct bpf_jit *jit, struct bpf_prog *fp, int i MAX_TAIL_CALL_CNT, 0, 0x2); /* -* prog = array-prog[index]; +* prog = array-ptrs[index]; * if (prog == NULL) * goto out; */ @@ -1041,7 +1041,7 @@ static noinline int bpf_jit_insn(struct bpf_jit *jit, struct bpf_prog *fp, int i EMIT6_DISP_LH(0xeb00, 0x000d, REG_1, BPF_REG_3, REG_0, 3); /* lg %r1,prog(%b2,%r1) */ EMIT6_DISP_LH(0xe300, 0x0004, REG_1, BPF_REG_2, - REG_1, offsetof(struct bpf_array, prog)); + REG_1, offsetof(struct bpf_array, ptrs)); /* clgij %r1,0,0x8,label0 */ EMIT6_PCREL_IMM_LABEL(0xec00, 0x007d, REG_1, 0, 0, 0x8); -- 1.8.3.4 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] net/fddi: remove HWM_REVERSE() macro
On Aug 11, 2015, at 12:24, David Miller da...@davemloft.net wrote: From: yalin wang yalin.wang2...@gmail.com Date: Tue, 11 Aug 2015 09:57:21 +0800 HWM_REVERSE() macro is unused, remove it. Signed-off-by: yalin wang yalin.wang2...@gmail.com Your email client has corrupted this patch. Please read Documentation/email-clients.txt, send a test patch to yourself, and only resubmit this change once you are able to successfully apply the patch you receive in that test email. Thanks. ok, Thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 Resend] net/fddi: remove HWM_REVERSE() macro
HWM_REVERSE() macro is unused, remove it. Signed-off-by: yalin wang yalin.wang2...@gmail.com --- drivers/net/fddi/skfp/h/hwmtm.h | 10 -- 1 file changed, 10 deletions(-) diff --git a/drivers/net/fddi/skfp/h/hwmtm.h b/drivers/net/fddi/skfp/h/hwmtm.h index 5924d42..4ca2341 100644 --- a/drivers/net/fddi/skfp/h/hwmtm.h +++ b/drivers/net/fddi/skfp/h/hwmtm.h @@ -74,15 +74,6 @@ #define NULL 0 #endif -#ifdef LITTLE_ENDIAN -#define HWM_REVERSE(x) (x) -#else -#defineHWM_REVERSE(x) x)24L)0xff00L) + \ -(((x) 8L)0x00ffL) + \ -(((x) 8L)0xff00L) + \ -(((x)24L)0x00ffL)) -#endif - #define C_INDIC(1L25) #define A_INDIC(1L26) #defineRD_FS_LOCAL 0x80 -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] bpf: fix the bug 'struct bpf_array' has no member named 'prog' in s390 architecture
From: Kaixu Xia xiaka...@huawei.com Date: Tue, 11 Aug 2015 05:00:24 + 'Kbuild test robot' sent me an email about a build error 'struct bpf_array' has no member named 'prog' in s390 architecture. This error is caused by commit: 2a36f0b92eb 638dd023870574eb471b1c56be9ad [656/692] bpf: Make the bpf _prog_array_map more generic. In this patch, the member 'prog' of struct bpf_array has been replaced by 'ptrs'. So this patch fix it. Please resubmit with a proper Fixes: and Signed-off-by: tags. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 Resend] net/fddi: remove HWM_REVERSE() macro
From: yalin wang yalin.wang2...@gmail.com Date: Tue, 11 Aug 2015 13:11:22 +0800 HWM_REVERSE() macro is unused, remove it. Signed-off-by: yalin wang yalin.wang2...@gmail.com You did not do as I asked you to, this patch is still corrupted and there is no way you successfully applied what is in this patch. -#defineHWM_REVERSE(x) x)24L)0xff00L) + \ -(((x) 8L)0x00ffL) + \ -(((x) 8L)0xff00L) + \ -(((x)24L)0x00ffL)) This indentation here is spaces, whereas in the source files they are TABS. Your email client did this. If you fail to properly verify that your outgoing patches are not corrupted before submitting them here, I will stop reviewing and considering your changes. Thank you. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VxLAN support question
On 8/10/15 4:47 PM, Andrew Qu wrote: Pretty much what I want is that kernel will have about 1K interfaces (something like Tunnel100.1-tunnel100.1000 To be created and attached to 1K bridge domains on which each VNI is associated with given VNI to bridge-domain will be assigned using other CLIs) creating 1k vxlan devices is doable, but you probably want to take a look at recently added metadata mode of vxlan. Also sounds like for each vni you'd need a different multicast group? What fabric going to support that? * Email Confidentiality Notice please avoid such banners. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 Resend] net/fddi: remove HWM_REVERSE() macro
On Aug 11, 2015, at 13:37, David Miller da...@davemloft.net wrote: From: yalin wang yalin.wang2...@gmail.com Date: Tue, 11 Aug 2015 13:11:22 +0800 HWM_REVERSE() macro is unused, remove it. Signed-off-by: yalin wang yalin.wang2...@gmail.com You did not do as I asked you to, this patch is still corrupted and there is no way you successfully applied what is in this patch. -#defineHWM_REVERSE(x) x)24L)0xff00L) + \ -(((x) 8L)0x00ffL) + \ -(((x) 8L)0xff00L) + \ -(((x)24L)0x00ffL)) This indentation here is spaces, whereas in the source files they are TABS. Your email client did this. If you fail to properly verify that your outgoing patches are not corrupted before submitting them here, I will stop reviewing and considering your changes. Thank you. ouch, i am sorry that i am sending from windows PC, let me check that . Sorry for that .-- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] netconsole: Check for carrier before calling netpoll_send_udp()
What if the carrier check passes, and then the chip reset starts on another cpu? You'll have the same problem. Okay, let me see if I can come up with a better way to mitigate this. On Tue, Aug 11, 2015 at 2:22 PM, David Miller da...@davemloft.net wrote: From: Jon Maxwell jmaxwel...@gmail.com Date: Tue, 11 Aug 2015 11:32:26 +1000 We have seen a few crashes recently where a NIC is getting reset for some reason and then the driver or another module calls printk() which invokes netconsole. Netconsole then calls the adapter specific poll routine via netpoll which crashes because the adapter is resetting and its structures are being reinitialized. This isn't a fix. What if the carrier check passes, and then the chip reset starts on another cpu? You'll have the same problem. I'm not applying this, sorry. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] vxlan: fix fdb_dump index calculation
When too many remotes are bound to an FDB entry, index may not be increased. This problem will be caused on the large scale environment that is based on the unicast default destination, for instance. Signed-off-by: Atzm Watanabe a...@iij.ad.jp --- drivers/net/vxlan.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index b6731fa..06c0731 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -931,10 +931,10 @@ static int vxlan_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb, hlist_for_each_entry_rcu(f, vxlan-fdb_head[h], hlist) { struct vxlan_rdst *rd; - if (idx cb-args[0]) - goto skip; - list_for_each_entry_rcu(rd, f-remotes, list) { + if (idx cb-args[0]) + goto skip; + err = vxlan_fdb_info(skb, vxlan, f, NETLINK_CB(cb-skb).portid, cb-nlh-nlmsg_seq, @@ -942,9 +942,9 @@ static int vxlan_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb, NLM_F_MULTI, rd); if (err 0) goto out; - } skip: - ++idx; + ++idx; + } } } out: -- 2.4.6 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv1 bluetooth-next] cc2520: set the default fifo pin value from platform data
Yup... :-) Your name in the From (LIYONG) address is different from SOB (Yong Li) address. It should be same, please fix your email-client. On 08/10/2015 12:59 PM, LIYONG wrote: In case of the device tree support is disabled, the fifo_pin is uninitialized, this patch will set the fifo_pin value based on platform data Signed-off-by: Yong Lisdliy...@gmail.com --- drivers/net/ieee802154/cc2520.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ieee802154/cc2520.c b/drivers/net/ieee802154/cc2520.c index 613dae5..c5b54a1 100644 --- a/drivers/net/ieee802154/cc2520.c +++ b/drivers/net/ieee802154/cc2520.c @@ -833,6 +833,7 @@ static int cc2520_get_platform_data(struct spi_device *spi, if (!spi_pdata) return -ENOENT; *pdata = *spi_pdata; + priv-fifo_pin = pdata-fifo; return 0; } -- 2.1.0 This patch is not applying. Please use 'git format-patch' to generate the patch. And send it by 'git send-email' In your case: git commit -am -s cc2520: set the default fifo pin value from platform data git format-patch --subject-prefix=PATCH v2 bluetooth-next -1 git send-email 0001- -- Varka Bhadram. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net/fddi:change HWM_REVERSE() macro
change HWM_REVERSE() macro to generic le32_to_cpu() Signed-off-by: yalin wang yalin.wang2...@gmail.com --- drivers/net/fddi/skfp/h/hwmtm.h | 11 ++- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/drivers/net/fddi/skfp/h/hwmtm.h b/drivers/net/fddi/skfp/h/hwmtm.h index 5924d42..72701ef 100644 --- a/drivers/net/fddi/skfp/h/hwmtm.h +++ b/drivers/net/fddi/skfp/h/hwmtm.h @@ -14,7 +14,7 @@ #ifndef_HWM_ #define_HWM_ - +#include linux/byteorder/generic.h #include mbuf.h /* @@ -74,14 +74,7 @@ #define NULL 0 #endif -#ifdef LITTLE_ENDIAN -#define HWM_REVERSE(x) (x) -#else -#defineHWM_REVERSE(x) x)24L)0xff00L) + \ -(((x) 8L)0x00ffL) + \ -(((x) 8L)0xff00L) + \ -(((x)24L)0x00ffL)) -#endif +#define HWM_REVERSE(x) le32_to_cpu(x) #define C_INDIC(1L25) #define A_INDIC(1L26) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v3 0/8] net: dsa: mv88e6xxx: support switchdev FDB objects
Hi Andrew, On 15-08-10 16:11:38, Andrew Lunn wrote: On Mon, Aug 10, 2015 at 09:09:45AM -0400, Vivien Didelot wrote: This patchset refactors the FDB management in the mv88e6xxx code and adds the glue in DSA to use the switchdev FDB objects. Hi Vivien Thanks for reworking these patches. Now they are much smaller, they are much easier to review. Reviewed-by: Andrew Lunn and...@lunn.ch Thanks for your time and suggestions on this, indeed with the reworked order, the diffs got smaller and more natural. Regards, -v -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] inet: fix races with reqsk timers
From: Eric Dumazet eduma...@google.com reqsk_queue_destroy() and reqsk_queue_unlink() should use del_timer_sync() instead of del_timer() before calling reqsk_put(), otherwise we could free a req still used by another cpu. But before doing so, reqsk_queue_destroy() must release syn_wait_lock spinlock or risk a dead lock, as reqsk_timer_handler() might need to take this same spinlock from reqsk_queue_unlink() (called from inet_csk_reqsk_queue_drop()) Fixes: fa76ce7328b2 (inet: get rid of central tcp/dccp listener timer) Signed-off-by: Eric Dumazet eduma...@google.com --- net/core/request_sock.c |8 +++- net/ipv4/inet_connection_sock.c |2 +- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/net/core/request_sock.c b/net/core/request_sock.c index 87b22c0..b42f0e2 100644 --- a/net/core/request_sock.c +++ b/net/core/request_sock.c @@ -103,10 +103,16 @@ void reqsk_queue_destroy(struct request_sock_queue *queue) spin_lock_bh(queue-syn_wait_lock); while ((req = lopt-syn_table[i]) != NULL) { lopt-syn_table[i] = req-dl_next; + /* Because of following del_timer_sync(), +* we must release the spinlock here +* or risk a dead lock. +*/ + spin_unlock_bh(queue-syn_wait_lock); atomic_inc(lopt-qlen_dec); - if (del_timer(req-rsk_timer)) + if (del_timer_sync(req-rsk_timer)) reqsk_put(req); reqsk_put(req); + spin_lock_bh(queue-syn_wait_lock); } spin_unlock_bh(queue-syn_wait_lock); } diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 60021d0..05e3145 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -593,7 +593,7 @@ static bool reqsk_queue_unlink(struct request_sock_queue *queue, } spin_unlock(queue-syn_wait_lock); - if (del_timer(req-rsk_timer)) + if (del_timer_sync(req-rsk_timer)) reqsk_put(req); return found; } -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/fddi:change HWM_REVERSE() macro
On Tue, 2015-08-11 at 00:14 +0800, yalin wang wrote: HWM_REVERSE Is unused and it would be better if removed. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/10] ss: symmetrical subhandler output extension example
On Mon, 2015-08-10 at 15:19 +0300, Sergei Shtylyov wrote: {} not needed. I guess you haven't run your patches thru scripts/checkpatch.pl? Yes, although this is missing from iproute2 sources ;) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 9/9] net: Introduce VRF device driver
This driver borrows heavily from IPvlan and teaming drivers. Routing domains (VRF-lite) are created by instantiating a VRF master device with an associated table and enslaving all routed interfaces that participate in the domain. As part of the enslavement, all connected routes for the enslaved devices are moved to the table associated with the VRF device. Outgoing sockets must bind to the VRF device to function. Standard FIB rules bind the VRF device to tables and regular fib rule processing is followed. Routed traffic through the box, is forwarded by using the VRF device as the IIF and following the IIF rule to a table that is mated with the VRF. Example: Create vrf 1: ip link add vrf1 type vrf table 5 ip rule add iif vrf1 table 5 ip rule add oif vrf1 table 5 ip route add table 5 prohibit default ip link set vrf1 up Add interface to vrf 1: ip link set eth1 master vrf1 Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- drivers/net/Kconfig | 7 + drivers/net/Makefile | 1 + drivers/net/vrf.c| 685 +++ 3 files changed, 693 insertions(+) create mode 100644 drivers/net/vrf.c diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index c18f9e62a9fa..e58468b02987 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -297,6 +297,13 @@ config NLMON diagnostics, etc. This is mostly intended for developers or support to debug netlink issues. If unsure, say N. +config NET_VRF + tristate Virtual Routing and Forwarding (Lite) + depends on IP_MULTIPLE_TABLES IPV6_MULTIPLE_TABLES + ---help--- + This option enables the support for mapping interfaces into VRF's. The + support enables VRF devices. + endif # NET_CORE config SUNGEM_PHY diff --git a/drivers/net/Makefile b/drivers/net/Makefile index c12cb22478a7..ca16dd689b36 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -25,6 +25,7 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o obj-$(CONFIG_VXLAN) += vxlan.o obj-$(CONFIG_GENEVE) += geneve.o obj-$(CONFIG_NLMON) += nlmon.o +obj-$(CONFIG_NET_VRF) += vrf.o # # Networking Drivers diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c new file mode 100644 index ..95097cb79354 --- /dev/null +++ b/drivers/net/vrf.c @@ -0,0 +1,685 @@ +/* + * vrf.c: device driver to encapsulate a VRF space + * + * Copyright (c) 2015 Cumulus Networks. All rights reserved. + * Copyright (c) 2015 Shrijeet Mukherjee s...@cumulusnetworks.com + * Copyright (c) 2015 David Ahern d...@cumulusnetworks.com + * + * Based on dummy, team and ipvlan drivers + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include linux/module.h +#include linux/kernel.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/ip.h +#include linux/init.h +#include linux/moduleparam.h +#include linux/netfilter.h +#include linux/rtnetlink.h +#include net/rtnetlink.h +#include linux/u64_stats_sync.h +#include linux/hashtable.h + +#include linux/inetdevice.h +#include net/ip.h +#include net/ip_fib.h +#include net/ip6_route.h +#include net/rtnetlink.h +#include net/route.h +#include net/addrconf.h +#include net/vrf.h + +#define DRV_NAME vrf +#define DRV_VERSION1.0 + +#define vrf_is_slave(dev) ((dev)-flags IFF_SLAVE) + +#define vrf_master_get_rcu(dev) \ + ((struct net_device *)rcu_dereference(dev-rx_handler_data)) + +struct pcpu_dstats { + u64 tx_pkts; + u64 tx_bytes; + u64 tx_drps; + u64 rx_pkts; + u64 rx_bytes; + struct u64_stats_sync syncp; +}; + +static struct dst_entry *vrf_ip_check(struct dst_entry *dst, u32 cookie) +{ + return dst; +} + +static int vrf_ip_local_out(struct sk_buff *skb) +{ + return ip_local_out(skb); +} + +static unsigned int vrf_v4_mtu(const struct dst_entry *dst) +{ + /* TO-DO: return max ethernet size? */ + return dst-dev-mtu; +} + +static void vrf_dst_destroy(struct dst_entry *dst) +{ + /* our dst lives forever - or until the device is closed */ +} + +static unsigned int vrf_default_advmss(const struct dst_entry *dst) +{ + return 65535 - 40; +} + +static struct dst_ops vrf_dst_ops = { + .family = AF_INET, + .local_out = vrf_ip_local_out, + .check = vrf_ip_check, + .mtu= vrf_v4_mtu, + .destroy= vrf_dst_destroy, + .default_advmss = vrf_default_advmss, +}; + +static bool is_ip_rx_frame(struct sk_buff *skb) +{ + switch (skb-protocol) { + case htons(ETH_P_IP): + case htons(ETH_P_IPV6): +
[PATCH 3/5] netfilter: conntrack: Use flags in nf_ct_tmpl_alloc()
From: Joe Stringer joestrin...@nicira.com The flags were ignored for this function when it was introduced. Also fix the style problem in kzalloc. Fixes: 0838aa7fc (netfilter: fix netns dependencies with conntrack templates) Signed-off-by: Joe Stringer joestrin...@nicira.com Signed-off-by: Pablo Neira Ayuso pa...@netfilter.org --- net/netfilter/nf_conntrack_core.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index f168099..3c20d02 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -292,7 +292,7 @@ struct nf_conn *nf_ct_tmpl_alloc(struct net *net, u16 zone, gfp_t flags) { struct nf_conn *tmpl; - tmpl = kzalloc(sizeof(struct nf_conn), GFP_KERNEL); + tmpl = kzalloc(sizeof(*tmpl), flags); if (tmpl == NULL) return NULL; @@ -303,7 +303,7 @@ struct nf_conn *nf_ct_tmpl_alloc(struct net *net, u16 zone, gfp_t flags) if (zone) { struct nf_conntrack_zone *nf_ct_zone; - nf_ct_zone = nf_ct_ext_add(tmpl, NF_CT_EXT_ZONE, GFP_ATOMIC); + nf_ct_zone = nf_ct_ext_add(tmpl, NF_CT_EXT_ZONE, flags); if (!nf_ct_zone) goto out_free; nf_ct_zone-id = zone; -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] netfilter: ip6t_SYNPROXY: fix NULL pointer dereference
From: Phil Sutter p...@nwl.cc This happens when networking namespaces are enabled. Suggested-by: Patrick McHardy ka...@trash.net Signed-off-by: Phil Sutter p...@nwl.cc Acked-by: Patrick McHardy ka...@trash.net Signed-off-by: Pablo Neira Ayuso pa...@netfilter.org --- net/ipv6/netfilter/ip6t_SYNPROXY.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/net/ipv6/netfilter/ip6t_SYNPROXY.c b/net/ipv6/netfilter/ip6t_SYNPROXY.c index 6edb7b1..bcebc24 100644 --- a/net/ipv6/netfilter/ip6t_SYNPROXY.c +++ b/net/ipv6/netfilter/ip6t_SYNPROXY.c @@ -37,12 +37,13 @@ synproxy_build_ip(struct sk_buff *skb, const struct in6_addr *saddr, } static void -synproxy_send_tcp(const struct sk_buff *skb, struct sk_buff *nskb, +synproxy_send_tcp(const struct synproxy_net *snet, + const struct sk_buff *skb, struct sk_buff *nskb, struct nf_conntrack *nfct, enum ip_conntrack_info ctinfo, struct ipv6hdr *niph, struct tcphdr *nth, unsigned int tcp_hdr_size) { - struct net *net = nf_ct_net((struct nf_conn *)nfct); + struct net *net = nf_ct_net(snet-tmpl); struct dst_entry *dst; struct flowi6 fl6; @@ -83,7 +84,8 @@ free_nskb: } static void -synproxy_send_client_synack(const struct sk_buff *skb, const struct tcphdr *th, +synproxy_send_client_synack(const struct synproxy_net *snet, + const struct sk_buff *skb, const struct tcphdr *th, const struct synproxy_options *opts) { struct sk_buff *nskb; @@ -119,7 +121,7 @@ synproxy_send_client_synack(const struct sk_buff *skb, const struct tcphdr *th, synproxy_build_options(nth, opts); - synproxy_send_tcp(skb, nskb, skb-nfct, IP_CT_ESTABLISHED_REPLY, + synproxy_send_tcp(snet, skb, nskb, skb-nfct, IP_CT_ESTABLISHED_REPLY, niph, nth, tcp_hdr_size); } @@ -163,7 +165,7 @@ synproxy_send_server_syn(const struct synproxy_net *snet, synproxy_build_options(nth, opts); - synproxy_send_tcp(skb, nskb, snet-tmpl-ct_general, IP_CT_NEW, + synproxy_send_tcp(snet, skb, nskb, snet-tmpl-ct_general, IP_CT_NEW, niph, nth, tcp_hdr_size); } @@ -203,7 +205,7 @@ synproxy_send_server_ack(const struct synproxy_net *snet, synproxy_build_options(nth, opts); - synproxy_send_tcp(skb, nskb, NULL, 0, niph, nth, tcp_hdr_size); + synproxy_send_tcp(snet, skb, nskb, NULL, 0, niph, nth, tcp_hdr_size); } static void @@ -241,7 +243,7 @@ synproxy_send_client_ack(const struct synproxy_net *snet, synproxy_build_options(nth, opts); - synproxy_send_tcp(skb, nskb, NULL, 0, niph, nth, tcp_hdr_size); + synproxy_send_tcp(snet, skb, nskb, NULL, 0, niph, nth, tcp_hdr_size); } static bool @@ -301,7 +303,7 @@ synproxy_tg6(struct sk_buff *skb, const struct xt_action_param *par) XT_SYNPROXY_OPT_SACK_PERM | XT_SYNPROXY_OPT_ECN); - synproxy_send_client_synack(skb, th, opts); + synproxy_send_client_synack(snet, skb, th, opts); return NF_DROP; } else if (th-ack !(th-fin || th-rst || th-syn)) { -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] netfilter: nf_conntrack: silence warning on falling back to vmalloc()
Since 88eab472ec21 (netfilter: conntrack: adjust nf_conntrack_buckets default value), the hashtable can easily hit this warning. We got reports from users that are getting this message in a quite spamming fashion, so better silence this. Signed-off-by: Pablo Neira Ayuso pa...@netfilter.org Acked-by: Florian Westphal f...@strlen.de --- net/netfilter/nf_conntrack_core.c |4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 651039a..f168099 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1544,10 +1544,8 @@ void *nf_ct_alloc_hashtable(unsigned int *sizep, int nulls) sz = nr_slots * sizeof(struct hlist_nulls_head); hash = (void *)__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO, get_order(sz)); - if (!hash) { - printk(KERN_WARNING nf_conntrack: falling back to vmalloc.\n); + if (!hash) hash = vzalloc(sz); - } if (hash nulls) for (i = 0; i nr_slots; i++) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 1/2] net: track link status of ipv6 nexthops
From: Andy Gospodarek go...@cumulusnetworks.com Date: Thu, 6 Aug 2015 11:42:33 -0400 Add support to track current link status of ipv6 nexthops to match recent changes that added support for ipv4 nexthops. There was not a field already available that could track these and no space available in the existing rt6i_flags field, so this patch adds rt6i_nhflags to struct rt6_info. Signed-off-by: Andy Gospodarek go...@cumulusnetworks.com Signed-off-by: Dinesh Dutt dd...@cumulusnetworks.com This doesn't really make any sense to me. You can evaluate the state of the link at the time you look at the route at all of the places where it matters as far as I can tell. It's so expensive to walk the entire routing table every time a link goes up and down, so it's much better to take an evaluate as needed approach to implementing this. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] Netfilter fixes for net
Hi David, The following patchset contains five Netfilter fixes for your net tree, they are: 1) Silence a warning on falling back to vmalloc(). Since 88eab472ec21, we can easily hit this warning message, that gets users confused. So let's get rid of it. 2) Recently when porting the template object allocation on top of kmalloc to fix the netns dependencies between x_tables and conntrack, the error checks where left unchanged. Remove IS_ERR() and check for NULL instead. Patch from Dan Carpenter. 3) Don't ignore gfp_flags in the new nf_ct_tmpl_alloc() function, from Joe Stringer. 4) Fix a crash due to NULL pointer dereference in ip6t_SYNPROXY, patch from Phil Sutter. 5) The sequence number of the Syn+ack that is sent from SYNPROXY to clients is not adjusted through our NAT infrastructure, as a result the client may ignore this TCP packet and TCP flow hangs until the client probes us. Also from Phil Sutter. You can pull these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git Thanks! The following changes since commit 15f1bb1f1e067be7088ed43ef23d59629bd24348: qlcnic: Fix corruption while copying (2015-07-29 23:57:26 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git master for you to fetch changes up to 3c16241c445303a90529565e7437e1f240acfef2: netfilter: SYNPROXY: fix sending window update to client (2015-08-10 13:55:07 +0200) Dan Carpenter (1): netfilter: nf_conntrack: checking for IS_ERR() instead of NULL Joe Stringer (1): netfilter: conntrack: Use flags in nf_ct_tmpl_alloc() Pablo Neira Ayuso (1): netfilter: nf_conntrack: silence warning on falling back to vmalloc() Phil Sutter (2): netfilter: ip6t_SYNPROXY: fix NULL pointer dereference netfilter: SYNPROXY: fix sending window update to client net/ipv4/netfilter/ipt_SYNPROXY.c |3 ++- net/ipv6/netfilter/ip6t_SYNPROXY.c | 19 +++ net/netfilter/nf_conntrack_core.c |8 +++- net/netfilter/nf_synproxy_core.c |4 +--- net/netfilter/xt_CT.c |5 +++-- 5 files changed, 20 insertions(+), 19 deletions(-) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] netfilter: nf_conntrack: checking for IS_ERR() instead of NULL
From: Dan Carpenter dan.carpen...@oracle.com We recently changed this from nf_conntrack_alloc() to nf_ct_tmpl_alloc() so the error handling needs to changed to check for NULL instead of IS_ERR(). Fixes: 0838aa7fcfcd ('netfilter: fix netns dependencies with conntrack templates') Signed-off-by: Dan Carpenter dan.carpen...@oracle.com Signed-off-by: Pablo Neira Ayuso pa...@netfilter.org --- net/netfilter/nf_synproxy_core.c |4 +--- net/netfilter/xt_CT.c|5 +++-- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/net/netfilter/nf_synproxy_core.c b/net/netfilter/nf_synproxy_core.c index 71f1e9f..d7f1685 100644 --- a/net/netfilter/nf_synproxy_core.c +++ b/net/netfilter/nf_synproxy_core.c @@ -353,10 +353,8 @@ static int __net_init synproxy_net_init(struct net *net) int err = -ENOMEM; ct = nf_ct_tmpl_alloc(net, 0, GFP_KERNEL); - if (IS_ERR(ct)) { - err = PTR_ERR(ct); + if (!ct) goto err1; - } if (!nfct_seqadj_ext_add(ct)) goto err2; diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c index c663003..43ddeee 100644 --- a/net/netfilter/xt_CT.c +++ b/net/netfilter/xt_CT.c @@ -202,9 +202,10 @@ static int xt_ct_tg_check(const struct xt_tgchk_param *par, goto err1; ct = nf_ct_tmpl_alloc(par-net, info-zone, GFP_KERNEL); - ret = PTR_ERR(ct); - if (IS_ERR(ct)) + if (!ct) { + ret = -ENOMEM; goto err2; + } ret = 0; if ((info-ct_events || info-exp_events) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] netfilter: SYNPROXY: fix sending window update to client
From: Phil Sutter p...@nwl.cc Upon receipt of SYNACK from the server, ipt_SYNPROXY first sends back an ACK to finish the server handshake, then calls nf_ct_seqadj_init() to initiate sequence number adjustment of forwarded packets to the client and finally sends a window update to the client to unblock it's TX queue. Since synproxy_send_client_ack() does not set synproxy_send_tcp()'s nfct parameter, no sequence number adjustment happens and the client receives the window update with incorrect sequence number. Depending on client TCP implementation, this leads to a significant delay (until a window probe is being sent). Signed-off-by: Phil Sutter p...@nwl.cc Signed-off-by: Pablo Neira Ayuso pa...@netfilter.org --- net/ipv4/netfilter/ipt_SYNPROXY.c |3 ++- net/ipv6/netfilter/ip6t_SYNPROXY.c |3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c b/net/ipv4/netfilter/ipt_SYNPROXY.c index fe8cc18..95ea633e 100644 --- a/net/ipv4/netfilter/ipt_SYNPROXY.c +++ b/net/ipv4/netfilter/ipt_SYNPROXY.c @@ -226,7 +226,8 @@ synproxy_send_client_ack(const struct synproxy_net *snet, synproxy_build_options(nth, opts); - synproxy_send_tcp(skb, nskb, NULL, 0, niph, nth, tcp_hdr_size); + synproxy_send_tcp(skb, nskb, skb-nfct, IP_CT_ESTABLISHED_REPLY, + niph, nth, tcp_hdr_size); } static bool diff --git a/net/ipv6/netfilter/ip6t_SYNPROXY.c b/net/ipv6/netfilter/ip6t_SYNPROXY.c index bcebc24..ebbb754 100644 --- a/net/ipv6/netfilter/ip6t_SYNPROXY.c +++ b/net/ipv6/netfilter/ip6t_SYNPROXY.c @@ -243,7 +243,8 @@ synproxy_send_client_ack(const struct synproxy_net *snet, synproxy_build_options(nth, opts); - synproxy_send_tcp(snet, skb, nskb, NULL, 0, niph, nth, tcp_hdr_size); + synproxy_send_tcp(snet, skb, nskb, skb-nfct, IP_CT_ESTABLISHED_REPLY, + niph, nth, tcp_hdr_size); } static bool -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] netlink: make sure -EBUSY won't escape from netlink_insert
From: Daniel Borkmann dan...@iogearbox.net Date: Fri, 7 Aug 2015 00:26:41 +0200 Linus reports the following deadlock on rtnl_mutex; triggered only once so far (extract): ... It seems so far plausible that the recursive call into rtnetlink_rcv() looks suspicious. One way, where this could trigger is that the senders NETLINK_CB(skb).portid was wrongly 0 (which is rtnetlink socket), so the rtnl_getlink() request's answer would be sent to the kernel instead to the actual user process, thus grabbing rtnl_mutex() twice. One theory would be that netlink_autobind() triggered via netlink_sendmsg() internally overwrites the -EBUSY error to 0, but where it is wrongly originating from __netlink_insert() instead. That would reset the socket's portid to 0, which is then filled into NETLINK_CB(skb).portid later on. As commit d470e3b483dc ([NETLINK]: Fix two socket hashing bugs.) also puts it, -EBUSY should not be propagated from netlink_insert(). It looks like it's very unlikely to reproduce. We need to trigger the rhashtable_insert_rehash() handler under a situation where rehashing currently occurs (one /rare/ way would be to hit ht-elasticity limits while not filled enough to expand the hashtable, but that would rather require a specifically crafted bind() sequence with knowledge about destination slots, seems unlikely). It probably makes sense to guard __netlink_insert() in any case and remap that error. It was suggested that EOVERFLOW might be better than an already overloaded ENOMEM. Reference: http://thread.gmane.org/gmane.linux.network/372676 Reported-by: Linus Torvalds torva...@linux-foundation.org Signed-off-by: Daniel Borkmann dan...@iogearbox.net Applied and queued up for -stable, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel warning in tcp_fragment
Ping? We saw a lot of this warnings in our production system. It would be great appreciate if someone can give us the fix on this warnings. :) On Fri, Jul 31, 2015 at 11:04 AM, Jovi Zhangwei j...@cloudflare.com wrote: Hi Eric, Would you like share your thought on this bug? great thanks. On Mon, Jul 27, 2015 at 4:19 PM, Martin KaFai Lau ka...@fb.com wrote: On Wed, Jul 22, 2015 at 11:55:35AM -0700, Jovi Zhangwei wrote: Sorry for disturbing, our production system(3.14 and 3.18 stable kernel) have many tcp_fragment warnings, the trace is same as below one which you discussed before. https://urldefense.proofpoint.com/v1/url?u=http://comments.gmane.org/gmane.linux.network/365658k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=%2Faj1ZOQObwbmtLwlDw3XzQ%3D%3D%0Am=fQUME5h%2FYY3oZjXbnLC3z6TaEEcTBSCAji4PkNqFjq8%3D%0As=1527f3221a6f31cba9544e5ddaa20986aafe8be8c898b42c7e9ce5e68d3803d8 But I didn't found the final solution in that mail thread, do you have any new ideas or patches on this warning? I think the following points to the last discussion. We are currently using a similar patch: http://comments.gmane.org/gmane.linux.network/366549 Eric, any update on your findings? or you have already pushed a fix? Thanks, --Martin -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/10] ss: symmetrical subhandler output extension example
On 08/10/2015 05:53 PM, Eric Dumazet wrote: {} not needed. I guess you haven't run your patches thru scripts/checkpatch.pl? Yes, although this is missing from iproute2 sources ;) Oh, sorry, somehow I thought it's a kernel patch. :-) MBR, Sergei -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPv6 and private net with masquerading not working correctly
(Cc'ing netdev and netfilter-devel) On Fri, Aug 7, 2015 at 6:00 AM, Gerhard Wiesinger li...@wiesinger.com wrote: On 06.08.2015 20:43, Gerhard Wiesinger wrote: Hello, I'm having the following problem with IPv6 and a private internal LAN which will be masqueraded to the public internet (I don't want to have public IPs in the LAN because of some static IPs and tracking) . Rules are generated by shorewall. Problem is that ICMP6 packets source address is not translated by the kernel on the reply when MTU has to be discovered because of too big packets and limited MTU capabilities on the path (happens also on tcp6 which works thereofore not correctly). # From an internal host on net fd00:1234:5678::/64 ping6 -s 2000 2a02:1234:5678:7::2 /etc/shorewall6/masq EXT_IF fc00::/7 ip6tables rule: MASQUERADE all * * fc00::/7 ::/0 # Internal interface IP6 fd00:1234:5678::9 2a02:1234:5678:7::2: frag (0|1432) ICMP6, echo request, seq 1, length 1432 IP6 fd00:1234:5678::9 2a02:1234:5678:7::2: frag (1432|576) IP6 2a02:1234:5678:9abc::115 fd00:1234:5678::9: ICMP6, packet too big, mtu 1440, length 1240 # External interface IP6 2001:1234:5678:9abc::1 2a02:1234:5678:7::2: frag (0|1432) ICMP6, echo request, seq 1, length 1432 IP6 2001:1234:5678:9abc::1 2a02:1234:5678:7::2: frag (1432|576) IP6 2a02:1234:5678:9abc::115 2001:1234:5678:9abc::1: ICMP6, packet too big, mtu 1440, length 1240 Looks to me like a a major kernel bug. Kernel version is: 4.1.3-201.fc22.x86_64 from Fedora 22 Any ideas? Any comments? Ciao, Gerhard -- http://www.wiesinger.com/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 00/10] VRF-lite - v5
In the context of internet scale routing a requirement that always comes up is the need to partition the available routing tables into disjoint routing planes. A specific use case is the multi-tenancy problem where each tenant has their own unique routing tables and in the very least need different default gateways. This patch allows the ability to create virtual router domains (aka VRFs (VRF-lite to be specific) in the linux packet forwarding stack. The main observation is that through the use of rules and socket binding to interfaces, all the facilities that we need are already present in the infrastructure. What is missing is a handle that identifies a routing domain and can be used to gather applicable rules/tables and uniqify neighbor selection. The scheme used needs to preserves the notions of ECMP, and general routing principles. This driver is a cross between functionality that the IPVLAN driver and the Team drivers provide where a device is created and packets into/out of the routing domain are shuttled through this device. The device is then used as a handle to identify the applicable rules. The VRF device is thus the layer3 equivalent of a vlan device. The very important point to note is that this is only a Layer3 concept so L2 tools (e.g., LLDP) do not need to be run in each VRF, processes can run in unaware mode or select a VRF to be talking through. Also the behavioral model is a generalized application of the familiar VRF-Lite model with some performance paths that need optimization. (Specifically the output route selector that Roopa, Robert, Thomas and EricB are currently discussing on the MPLS thread) High Level points = 1. Simple overlay driver (minimal changes to current stack) * uses the existing fib tables and fib rules infrastructure 2. Modelled closely after the ipvlan driver 3. Uses current API and infrastructure. * Applications can use SO_BINDTODEVICE or cmsg device indentifiers to pick VRF (ping, traceroute just work) * Standard IP Rules work, and since they are aggregated against the device, scale is manageable 4. Completely orthogonal to Namespaces and only provides separation in the routing plane (and ARP) N2 N1 (all configs here) +---+ +--+ | | |swp1 :10.0.1.1+--+swp1 :10.0.1.2 | | | | | |swp2 :10.0.2.1+--+swp2 :10.0.2.2 | | | +---+ | VRF 1| | table 5 | | | +---+ | | | VRF 2| N3 | table 6 | +---+ | | | | |swp3 :10.0.2.1+--+swp1 :10.0.2.2 | | | | | |swp4 :10.0.3.1+--+swp2 :10.0.3.2 | +--+ +---+ Given the topology above, the setup needed to get the basic VRF functions working would be Create the VRF devices and associate with a table ip link add vrf1 type vrf table 5 ip link add vrf2 type vrf table 6 Install the lookup rules that map table to VRF domain ip rule add pref 200 oif vrf1 lookup 5 ip rule add pref 200 iif vrf1 lookup 5 ip rule add pref 200 oif vrf2 lookup 6 ip rule add pref 200 iif vrf2 lookup 6 ip link set vrf1 up ip link set vrf2 up Enslave the routing member interfaces ip link set swp1 master vrf1 ip link set swp2 master vrf1 ip link set swp3 master vrf2 ip link set swp4 master vrf2 Connected and local routes are automatically moved from main and local tables to the VRF table. ping using VRF0 is simply ping -I vrf0 10.0.1.2 Design Highlights = If a device is enslaved to a VRF device (ie., associated with a VRF) then: 1. Rx path The master device index is used as the iif for all lookups. 2. Tx path Similarly, for Tx the VRF device oif is used in the flow to direct lookups to the table associated with the VRF via its rule. From there the FLOWI_FLAG_VRFSRC flag is used to indicate that the oif should not be used for FIB table lookups. 3. Connected and local routes On link up for a device, connected and local routes are added to the table associated with the VRF device, rather than the local and main tables. 4. Socket lookups Socket lookups use the VRF device for comparison with sk_bound_dev_if. If a socket is not bound to a device a socket match can happen based on destination address, port and protocol in which case a VRF global or agnostic process handles the connection (ie., this allows 1 listener socket to handle connections across VRFs). The child socket becomes bound to the
[PATCH net-next 2/9] net: Use VRF device index for lookups on RX
On ingress use index of VRF master device for route lookups if real device is enslaved. Rules are expected to be installed for the VRF device to direct lookups to a specific table. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/fib_frontend.c | 8 +++- net/ipv4/route.c| 3 ++- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 6b98de0d7949..d8ced1d89f1b 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -45,6 +45,7 @@ #include net/ip_fib.h #include net/rtnetlink.h #include net/xfrm.h +#include net/vrf.h #ifndef CONFIG_IP_MULTIPLE_TABLES @@ -309,7 +310,9 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, bool dev_match; fl4.flowi4_oif = 0; - fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX; + fl4.flowi4_iif = vrf_master_ifindex_rcu(dev); + if (!fl4.flowi4_iif) + fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX; fl4.daddr = src; fl4.saddr = dst; fl4.flowi4_tos = tos; @@ -339,6 +342,9 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, if (nh-nh_dev == dev) { dev_match = true; break; + } else if (vrf_master_ifindex_rcu(nh-nh_dev) == dev-ifindex) { + dev_match = true; + break; } } #else diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 18fd7c9095c7..c26ff1f7067d 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -112,6 +112,7 @@ #endif #include net/secure_seq.h #include net/ip_tunnels.h +#include net/vrf.h #define RT_FL_TOS(oldflp4) \ ((oldflp4)-flowi4_tos (IPTOS_RT_MASK | RTO_ONLINK)) @@ -1726,7 +1727,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, * Now we are ready to route packet. */ fl4.flowi4_oif = 0; - fl4.flowi4_iif = dev-ifindex; + fl4.flowi4_iif = vrf_master_ifindex_rcu(dev) ? : dev-ifindex; fl4.flowi4_mark = skb-mark; fl4.flowi4_tos = tos; fl4.flowi4_scope = RT_SCOPE_UNIVERSE; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/10] ss: replaced old output mechanisms with fmt handlers interfaces
Now, since the fmt (json, hr) handlers are in place, all can be output via these newly deviced code parts. Signed-off-by: Matthias Tafelmeier matthias.tafelme...@gmx.net Suggested-by: Hagen Paul Pfeifer ha...@jauu.net --- misc/ss.c | 330 +- 1 file changed, 152 insertions(+), 178 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index 8fb6e7d..993a87b 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -105,6 +105,7 @@ int show_sock_ctx = 0; int user_ent_hash_build_init = 0; int follow_events = 0; int json_output = 0; +int json_first_elem = 1; int netid_width; int state_width; @@ -113,6 +114,8 @@ int addr_width; int serv_width; int screen_width; +enum out_fmt_type fmt_type = FMT_HR; + static const char *TCP_PROTO = tcp; static const char *UDP_PROTO = udp; static const char *RAW_PROTO = raw; @@ -346,6 +349,16 @@ static FILE *ephemeral_ports_open(void) #define USER_ENT_HASH_SIZE 256 struct user_ent *user_ent_hash[USER_ENT_HASH_SIZE]; +static void json_print_opening(void) +{ + if (json_output json_first_elem) { + json_first_elem = 0; + printf({\n); + } else if (json_output) { + printf(,\n{\n); + } +} + static int user_ent_hashfn(unsigned int ino) { int val = (ino 24) ^ (ino 16) ^ (ino 8) ^ ino; @@ -791,7 +804,8 @@ do_numeric: return buf; } -static void inet_addr_print(const inet_prefix *a, int port, unsigned int ifindex) +static void inet_addr_print(const inet_prefix *a, int port, + unsigned int ifindex, char *peer_kind) { char buf[1024]; const char *ap = buf; @@ -819,8 +833,8 @@ static void inet_addr_print(const inet_prefix *a, int port, unsigned int ifindex est_len -= strlen(ifname) + 1; /* +1 for percent char */ } - sock_addr_print_width(est_len, ap, :, serv_width, resolve_service(port), - ifname); + sock_addr_fmt(ap, est_len, :, serv_width, resolve_service(port), + ifname, peer_kind); } static int inet2_addr_match(const inet_prefix *a, const inet_prefix *p, @@ -1352,21 +1366,27 @@ static void inet_stats_print(struct sockstat *s, int protocol) { char *buf = NULL; - sock_state_print(s, proto_name(protocol)); + sock_state_fmt(s, sstate_name, proto_name(protocol), + netid_width, state_width); - inet_addr_print(s-local, s-lport, s-iface); - inet_addr_print(s-remote, s-rport, 0); + if (json_output) + printf(\t,\peers\: {\n); + + inet_addr_print(s-local, s-lport, s-iface, local); + inet_addr_print(s-remote, s-rport, 0, remote); + if (json_output) + printf(}); if (show_proc_ctx || show_sock_ctx) { if (find_entry(s-ino, buf, - (show_proc_ctx show_sock_ctx) ? - PROC_SOCK_CTX : PROC_CTX) 0) { - printf( users:(%s), buf); + (show_proc_ctx show_sock_ctx) ? + PROC_SOCK_CTX : PROC_CTX) 0) { + sock_users_fmt(buf); free(buf); } } else if (show_users) { if (find_entry(s-ino, buf, USERS) 0) { - printf( users:(%s), buf); + sock_users_fmt(buf); free(buf); } } @@ -1470,16 +1490,16 @@ static int tcp_show_line(char *line, const struct filter *f, int family) inet_stats_print(s.ss, IPPROTO_TCP); if (show_options) - tcp_timer_print(s); + tcp_timer_fmt(s); if (show_details) { - sock_details_print(s.ss); + sock_details_fmt(s.ss, GENERIC_DETAIL, 0, 0); if (opt[0]) - printf( opt:\%s\, opt); + opt_fmt(opt); } if (show_tcpinfo) - tcp_stats_print(s); + tcp_stats_fmt(s); printf(\n); return 0; @@ -1523,31 +1543,14 @@ static void print_skmeminfo(struct rtattr *tb[], int attrtype) const struct inet_diag_meminfo *minfo = RTA_DATA(tb[INET_DIAG_MEMINFO]); - printf( mem:(r%u,w%u,f%u,t%u), - minfo-idiag_rmem, - minfo-idiag_wmem, - minfo-idiag_fmem, - minfo-idiag_tmem); + mem_fmt(minfo); } return; } skmeminfo = RTA_DATA(tb[attrtype]); - printf( skmem:(r%u,rb%u,t%u,tb%u,f%u,w%u,o%u, - skmeminfo[SK_MEMINFO_RMEM_ALLOC], - skmeminfo[SK_MEMINFO_RCVBUF], -
[PATCH 09/10] ss: symmetrical formatter extension example
This commit shall show shortly where to place changes when one wants to extend an ss output formatter with a new handler (format print procedure). The extension is done symmetrically. That means, every up to now existing formatter is extended with a semantically equivalent handler (hr and json formatter). Signed-off-by: Matthias Tafelmeier matthias.tafelme...@gmx.net Suggested-by: Hagen Paul Pfeifer ha...@jauu.net --- misc/ss_hr_fmt.c | 61 ++ misc/ss_json_fmt.c | 65 ++ misc/ss_out_fmt.c | 10 + misc/ss_out_fmt.h | 10 + 4 files changed, 146 insertions(+) diff --git a/misc/ss_hr_fmt.c b/misc/ss_hr_fmt.c index 40b6b7c..ca73dda 100644 --- a/misc/ss_hr_fmt.c +++ b/misc/ss_hr_fmt.c @@ -242,6 +242,66 @@ static void packet_show_ring_hr_fmt(struct packet_diag_ring *ring) printf(,features:0x%x, ring-pdr_features); } +static void packet_details_hr_fmt(struct packet_diag_info *pinfo, + struct packet_diag_ring *ring_rx, + struct packet_diag_ring *ring_tx, + uint32_t fanout, + bool has_fanout) +{ + if (pinfo) { + printf(\n\tver:%d, pinfo-pdi_version); + printf( cpy_thresh:%d, pinfo-pdi_copy_thresh); + printf( flags( ); + if (pinfo-pdi_flags PDI_RUNNING) + printf(running); + if (pinfo-pdi_flags PDI_AUXDATA) + printf( auxdata); + if (pinfo-pdi_flags PDI_ORIGDEV) + printf( origdev); + if (pinfo-pdi_flags PDI_VNETHDR) + printf( vnethdr); + if (pinfo-pdi_flags PDI_LOSS) + printf( loss); + if (!pinfo-pdi_flags) + printf(0); + printf( )); + } + if (ring_rx) { + printf(\n\tring_rx(); + packet_show_ring_fmt(ring_rx); + printf()); + } + if (ring_tx) { + printf(\n\tring_tx(); + packet_show_ring_fmt(ring_tx); + printf()); + } + if (has_fanout) { + uint16_t type = (fanout 16) 0x; + + printf(\n\tfanout(); + printf(id:%d,, fanout 0x); + printf(type:); + + if (type == 0) + printf(hash); + else if (type == 1) + printf(lb); + else if (type == 2) + printf(cpu); + else if (type == 3) + printf(roll); + else if (type == 4) + printf(random); + else if (type == 5) + printf(qm); + else + printf(0x%x, type); + + printf()); + } +} + const struct fmt_op_hub hr_output_op = { .tcp_stats_fmt = tcp_stats_hr_fmt, .tcp_timer_fmt = tcp_timer_hr_fmt, @@ -257,4 +317,5 @@ const struct fmt_op_hub hr_output_op = { .opt_fmt = opt_hr_fmt, .proc_fmt = proc_hr_fmt, .packet_show_ring_fmt = packet_show_ring_hr_fmt, + .packet_details_fmt = packet_details_hr_fmt }; diff --git a/misc/ss_json_fmt.c b/misc/ss_json_fmt.c index d7dfce9..3d10220 100644 --- a/misc/ss_json_fmt.c +++ b/misc/ss_json_fmt.c @@ -355,6 +355,70 @@ static void packet_show_ring_json_fmt(struct packet_diag_ring *ring) printf(\features_0x\ : \%x\\n, ring-pdr_features); } +static void packet_details_json_fmt(struct packet_diag_info *pinfo, + struct packet_diag_ring *ring_rx, + struct packet_diag_ring *ring_tx, + uint32_t fanout, + bool has_fanout) +{ + printf(,\n); + if (pinfo) { + printf(\t\ver\: \%d\,\n, pinfo-pdi_version); + printf(\t\cpy_thresh\: \%d\,\n, pinfo-pdi_copy_thresh); + printf(\t\flags\: \); + if (pinfo-pdi_flags PDI_RUNNING) + printf(running); + if (pinfo-pdi_flags PDI_AUXDATA) + printf(_auxdata); + if (pinfo-pdi_flags PDI_ORIGDEV) + printf(_origdev); + if (pinfo-pdi_flags PDI_VNETHDR) + printf(_vnethdr); + if (pinfo-pdi_flags PDI_LOSS) + printf(_loss); + if (!pinfo-pdi_flags) + printf(0); + printf(\); + res_json_fmt_branch(ring_rx || ring_tx || has_fanout, ' '); + } + if (ring_rx) { + printf(\t\ring_rx\: {); + packet_show_ring_fmt(ring_rx); + printf(}); + res_json_fmt_branch(ring_tx || has_fanout, ' '); + } + if (ring_tx) { + printf(\t\ring_tx\: {); +
V2 iproute2: full ss json support and general output simplification
TLDR: - add full JSON support for ss - Patchset provides a general and easy to use abstraction to extend ss later - Patchset size is large to minimize daily use (user should not deal with formation (json, human readble) later on) - Patches 8/10 and 9/10 illustrate how to extend ss for new data to support human readble and json output. - Example_Usages: 1. ss -jt to print out all tcp related information formatted in json 2. ss --json -a to print out all info (also summary) STATS: Matthias Tafelmeier (10): ss: rooted out ss type declarations for output formatters ss: created formatters for json and hr ss: removed obsolet fmt functions ss: prepare timer for output handler usage ss: framed skeleton for json output in ss ss: replaced old output mechanisms with fmt handlers interfaces ss: renaming and export of current_filter ss: symmetrical subhandler output extension example ss: symmetrical formatter extension example ss: fixed free on local array for valid json output misc/Makefile |2 +- misc/ss.c | 1006 +++- misc/ss_hr_fmt.c | 321 + misc/ss_hr_fmt.h |9 + misc/ss_json_fmt.c | 438 +++ misc/ss_json_fmt.h | 24 ++ misc/ss_out_fmt.c | 137 +++ misc/ss_out_fmt.h | 92 + misc/ss_types.h| 186 ++ 9 files changed, 1564 insertions(+), 651 deletions(-) create mode 100644 misc/ss_hr_fmt.c create mode 100644 misc/ss_hr_fmt.h create mode 100644 misc/ss_json_fmt.c create mode 100644 misc/ss_json_fmt.h create mode 100644 misc/ss_out_fmt.c create mode 100644 misc/ss_out_fmt.h create mode 100644 misc/ss_types.h -- Abstract: This patch set originates from the necessity to upgrade ss with the possibility to output in json format. Not to clutter up ss too much, the author of the patch decided to come up with a simple distributor to handler approach. That is, the distributor poses the mechanical interface which passes the output requests coming from ss to the appropriate handler. This simplifies the interaction with ss and provides a maximum of future extensiblity. Not to forget, ss loses weight thereby since output implemented in ss itself does migrate to the appropriate handler. Additionally, because types are shared amongst handlers, the distributor and ss, the author conceived, that a separate containter module for types has to be formed. In future, all type declarations and extensins go there. In sum, the patchset has this voluminous extent since there is no viable way for putting out syntactically correct human readble and json in a simpler manner. The requirement for convenient extensibility of output and data is another justification for the patchset size. Concept sketch: formatter1 * * * * ss ~~~zzz * ** ~ * * ** ~ ###fff * ** ~ # * * ** distributor~ # * * * ~ # * --- * * ~ # * *- * * ~ # ** # ** * * ~ # formatter2 ** * *- * * ~ # * * * --- * * ~ # * * * * * ~ # * * ** #~~zzz * ** # * * ** ###fff * ** * * At the moment, the distributor is the ss_out_fmt module while two handlers are up: namely the ss_json_fmt and the ss_hr_fmt (human readable). You can use those modules as the main reference for own extensions. Future Extension: In the following, I will expand on the expandability of the formatter model. The explanations advances from the minimal to the most sweeping extension in mind. Sub Format Handler
[PATCH 08/10] ss: symmetrical subhandler output extension example
This small sized patch shall convey the locations which have to be changed for a symmetrical output extension. Symmetrical means in this context all existing semantically related handlers in the diverse formatters (for hr and json up to now). Signed-off-by: Matthias Tafelmeier matthias.tafelme...@gmx.net Suggested-by: Hagen Paul Pfeifer ha...@jauu.net --- misc/ss_hr_fmt.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/misc/ss_hr_fmt.c b/misc/ss_hr_fmt.c index 6955ea5..40b6b7c 100644 --- a/misc/ss_hr_fmt.c +++ b/misc/ss_hr_fmt.c @@ -82,6 +82,8 @@ static void tcp_stats_hr_fmt(struct tcpstat *s) printf( reordering:%d, s-reordering); if (s-rcv_rtt) printf( rcv_rtt:%g, s-rcv_rtt); + if (s-rcv_space) + printf( rcv_space:%d, s-rcv_space); CHECK_FMT_ADAPT(s-rcv_space, s, hr_handler_must_be_adapted_accordingly_when_json_fmt_is_extended); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/10] ss: prepare timer for output handler usage
Minor preparation Patch Renamed, and exported timer to not have to pass it as a function local parameter argument. Signed-off-by: Matthias Tafelmeier matthias.tafelme...@gmx.net Suggested-by: Hagen Paul Pfeifer ha...@jauu.net --- misc/ss.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/misc/ss.c b/misc/ss.c index e241b2f..1b3ef90 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -647,7 +647,7 @@ static const char *sstate_namel[] = { [SS_CLOSING] = closing, }; -static const char *tmr_name[] = { +const char *ss_timer_name[] = { off, on, keepalive, -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/10] ss: renaming and export of current_filter
Exported current_filter as ss_current_filter, because in the fmt handlers, I need that piece of info to resolve out issues of json. Signed-off-by: Matthias Tafelmeier matthias.tafelme...@gmx.net Suggested-by: Hagen Paul Pfeifer ha...@jauu.net --- misc/ss.c | 218 +++--- 1 file changed, 109 insertions(+), 109 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index 993a87b..5eba08d 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -199,7 +199,7 @@ static const struct filter default_afs[AF_MAX] = { }; static int do_default = 1; -static struct filter current_filter; +struct filter ss_current_filter; static void filter_db_set(struct filter *f, int db) { @@ -1189,7 +1189,7 @@ void *parse_hostcond(char *addr, bool is_port) struct aafilter a = { .port = -1 }; struct aafilter *res; int fam = preferred_family; - struct filter *f = current_filter; + struct filter *f = ss_current_filter; if (fam == AF_UNIX || strncmp(addr, unix:, 5) == 0) { char *p; @@ -1288,9 +1288,9 @@ void *parse_hostcond(char *addr, bool is_port) if (get_integer(a.port, port, 0)) { struct servent *se1 = NULL; struct servent *se2 = NULL; - if (current_filter.dbs(1UDP_DB)) + if (ss_current_filter.dbs (1 UDP_DB)) se1 = getservbyname(port, UDP_PROTO); - if (current_filter.dbs(1TCP_DB)) + if (ss_current_filter.dbs (1 TCP_DB)) se2 = getservbyname(port, TCP_PROTO); if (se1 se2 se1-s_port != se2-s_port) { fprintf(stderr, Error: ambiguous port \%s\.\n, port); @@ -1304,9 +1304,9 @@ void *parse_hostcond(char *addr, bool is_port) struct scache *s; for (s = rlist; s; s = s-next) { if ((s-proto == UDP_PROTO - (current_filter.dbs(1UDP_DB))) || + (ss_current_filter.dbs(1UDP_DB))) || (s-proto == TCP_PROTO - (current_filter.dbs(1TCP_DB { + (ss_current_filter.dbs(1TCP_DB { if (s-name strcmp(s-name, port) == 0) { if (a.port 0 a.port != s-port) { fprintf(stderr, Error: ambiguous port \%s\.\n, port); @@ -3221,19 +3221,19 @@ int main(int argc, char *argv[]) follow_events = 1; break; case 'd': - filter_db_set(current_filter, DCCP_DB); + filter_db_set(ss_current_filter, DCCP_DB); break; case 't': - filter_db_set(current_filter, TCP_DB); + filter_db_set(ss_current_filter, TCP_DB); break; case 'u': - filter_db_set(current_filter, UDP_DB); + filter_db_set(ss_current_filter, UDP_DB); break; case 'w': - filter_db_set(current_filter, RAW_DB); + filter_db_set(ss_current_filter, RAW_DB); break; case 'x': - filter_af_set(current_filter, AF_UNIX); + filter_af_set(ss_current_filter, AF_UNIX); break; case 'a': state_filter = SS_ALL; @@ -3242,25 +3242,25 @@ int main(int argc, char *argv[]) state_filter = (1 SS_LISTEN) | (1 SS_CLOSE); break; case '4': - filter_af_set(current_filter, AF_INET); + filter_af_set(ss_current_filter, AF_INET); break; case '6': - filter_af_set(current_filter, AF_INET6); + filter_af_set(ss_current_filter, AF_INET6); break; case '0': - filter_af_set(current_filter, AF_PACKET); + filter_af_set(ss_current_filter, AF_PACKET); break; case 'f': if (strcmp(optarg, inet) == 0) - filter_af_set(current_filter, AF_INET); +
[PATCH 03/10] ss: removed obsolet fmt functions
Those functions are obsoleted since the new fmt handler mechanism subsumes their tasks. Rendundancy would be contradictory to the new mechanism. Signed-off-by: Matthias Tafelmeier matthias.tafelme...@gmx.net Suggested-by: Hagen Paul Pfeifer ha...@jauu.net --- misc/ss.c | 190 -- 1 file changed, 190 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index 3d31b81..e241b2f 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -647,43 +647,6 @@ static const char *sstate_namel[] = { [SS_CLOSING] = closing, }; -static void sock_state_print(struct sockstat *s, const char *sock_name) -{ - if (netid_width) - printf(%-*s , netid_width, sock_name); - if (state_width) - printf(%-*s , state_width, sstate_name[s-state]); - - printf(%-6d %-6d , s-rq, s-wq); -} - -static void sock_details_print(struct sockstat *s) -{ - if (s-uid) - printf( uid:%u, s-uid); - - printf( ino:%u, s-ino); - printf( sk:%llx, s-sk); -} - -static void sock_addr_print_width(int addr_len, const char *addr, char *delim, - int port_len, const char *port, const char *ifname) -{ - if (ifname) { - printf(%*s%%%s%s%-*s , addr_len, addr, ifname, delim, - port_len, port); - } - else { - printf(%*s%s%-*s , addr_len, addr, delim, port_len, port); - } -} - -static void sock_addr_print(const char *addr, char *delim, const char *port, - const char *ifname) -{ - sock_addr_print_width(addr_width, addr, delim, serv_width, port, ifname); -} - static const char *tmr_name[] = { off, on, @@ -693,33 +656,6 @@ static const char *tmr_name[] = { unknown }; -static const char *print_ms_timer(int timeout) -{ - static char buf[64]; - int secs, msecs, minutes; - if (timeout 0) - timeout = 0; - secs = timeout/1000; - minutes = secs/60; - secs = secs%60; - msecs = timeout%1000; - buf[0] = 0; - if (minutes) { - msecs = 0; - snprintf(buf, sizeof(buf)-16, %dmin, minutes); - if (minutes 9) - secs = 0; - } - if (secs) { - if (secs 9) - msecs = 0; - sprintf(buf+strlen(buf), %d%s, secs, msecs ? . : sec); - } - if (msecs) - sprintf(buf+strlen(buf), %03dms, msecs); - return buf; -} - struct scache *rlist; static void init_service_resolver(void) @@ -1482,122 +1418,6 @@ static int proc_inet_split_line(char *line, char **loc, char **rem, char **data) return 0; } -static char *sprint_bw(char *buf, double bw) -{ - if (bw 100.) - sprintf(buf,%.1fM, bw / 100.); - else if (bw 1000.) - sprintf(buf,%.1fK, bw / 1000.); - else - sprintf(buf, %g, bw); - - return buf; -} - -static void tcp_stats_print(struct tcpstat *s) -{ - char b1[64]; - - if (s-has_ts_opt) - printf( ts); - if (s-has_sack_opt) - printf( sack); - if (s-has_ecn_opt) - printf( ecn); - if (s-has_ecnseen_opt) - printf( ecnseen); - if (s-has_fastopen_opt) - printf( fastopen); - if (s-cong_alg[0]) - printf( %s, s-cong_alg); - if (s-has_wscale_opt) - printf( wscale:%d,%d, s-snd_wscale, s-rcv_wscale); - if (s-rto) - printf( rto:%g, s-rto); - if (s-backoff) - printf( backoff:%u, s-backoff); - if (s-rtt) - printf( rtt:%g/%g, s-rtt, s-rttvar); - if (s-ato) - printf( ato:%g, s-ato); - - if (s-qack) - printf( qack:%d, s-qack); - if (s-qack 1) - printf( bidir); - - if (s-mss) - printf( mss:%d, s-mss); - if (s-cwnd) - printf( cwnd:%d, s-cwnd); - if (s-ssthresh) - printf( ssthresh:%d, s-ssthresh); - - if (s-bytes_acked) - printf( bytes_acked:%llu, s-bytes_acked); - if (s-bytes_received) - printf( bytes_received:%llu, s-bytes_received); - if (s-segs_out) - printf( segs_out:%u, s-segs_out); - if (s-segs_in) - printf( segs_in:%u, s-segs_in); - - if (s-dctcp s-dctcp-enabled) { - struct dctcpstat *dctcp = s-dctcp; - - printf( dctcp:(ce_state:%u,alpha:%u,ab_ecn:%u,ab_tot:%u), - dctcp-ce_state, dctcp-alpha, dctcp-ab_ecn, - dctcp-ab_tot); - } else if (s-dctcp) { - printf( dctcp:fallback_mode); - } - - if (s-send_bps) - printf( send %sbps, sprint_bw(b1, s-send_bps)); - if (s-lastsnd) - printf( lastsnd:%u,
[PATCH 02/10] ss: created formatters for json and hr
This patch creates a central formatter module that acts as a kind of switch. From there, more specific handler modules for the certain output formats are called. Up to now, humand readable and json do exist. That prepares ss for potential output format extensions in the future. With the help of such an apparatus, extensions should get done conveniently as well. For a completely new output format, a new handler module must be created and should be constructed like its relatives (for ex.: ss_json_fmt.c). Moreover, its functions need to get registered with the central output distributor. The latter can be done in that the according fmt_op_hub of the new handler module is registered in the fmt_op_hub array. Solely extending tcp_stats output shall boil down to extending the according handler function with the new predicate and its value. The context of the output subparts are important. With JSON, for instance, you have to ensure, that the comas are set at the right places. Further, an interim solution for all tcp_stats extensions is to check that all those muddle through to all fmt handlers by STATICAL_ASSERTING that. Interim is the solution, since a central structure would be much more worthwile for maintainability and this method does not ensure correct output fmt extension in a foolproof manner. Examples for tcp_stats out extension: ss_json_fmt.c: To add a new foo_param in tcp_stats for output (Pseudocode): [...] if (s-has_ts_opt) { printf(,\n%s\ts\: \true\, indent1); } if (s-has_sack_opt) { printf(,\n%s\sack\: \true\, indent1); } if (s-has_ecn_opt) { printf(,\n%s\ecn\: \true\, indent1); } [...] - macro to ensure statically no new tcp_stats info will be forgotten in - any of the fmt handlers CHECK_FMT_ADAPT(s-new_foo_pred, s, error_msg_adapation_issue); Signed-off-by: Matthias Tafelmeier matthias.tafelme...@gmx.net Suggested-by: Hagen Paul Pfeifer ha...@jauu.net --- misc/Makefile | 2 +- misc/ss_hr_fmt.c | 258 misc/ss_hr_fmt.h | 9 ++ misc/ss_json_fmt.c | 373 + misc/ss_json_fmt.h | 24 misc/ss_out_fmt.c | 127 ++ misc/ss_out_fmt.h | 82 7 files changed, 874 insertions(+), 1 deletion(-) create mode 100644 misc/ss_hr_fmt.c create mode 100644 misc/ss_hr_fmt.h create mode 100644 misc/ss_json_fmt.c create mode 100644 misc/ss_json_fmt.h create mode 100644 misc/ss_out_fmt.c create mode 100644 misc/ss_out_fmt.h diff --git a/misc/Makefile b/misc/Makefile index b7ecba9..fb67ead 100644 --- a/misc/Makefile +++ b/misc/Makefile @@ -1,4 +1,4 @@ -SSOBJ=ss.o ssfilter.o +SSOBJ=ss.o ssfilter.o ss_hr_fmt.o ss_json_fmt.o ss_out_fmt.o LNSTATOBJ=lnstat.o lnstat_util.o TARGETS=ss nstat ifstat rtacct arpd lnstat diff --git a/misc/ss_hr_fmt.c b/misc/ss_hr_fmt.c new file mode 100644 index 000..6955ea5 --- /dev/null +++ b/misc/ss_hr_fmt.c @@ -0,0 +1,258 @@ +#include linux/sock_diag.h +#include linux/rtnetlink.h +#include ss_out_fmt.h +#include ss_types.h +#include ss_hr_fmt.h + +static void tcp_stats_hr_fmt(struct tcpstat *s) +{ + char b1[64]; + + if (s-has_ts_opt) + printf( ts); + if (s-has_sack_opt) + printf( sack); + if (s-has_ecn_opt) + printf( ecn); + if (s-has_ecnseen_opt) + printf( ecnseen); + if (s-has_fastopen_opt) + printf( fastopen); + if (s-cong_alg) + printf( %s, s-cong_alg); + if (s-has_wscale_opt) + printf( wscale:%d,%d, s-snd_wscale, s-rcv_wscale); + if (s-rto) + printf( rto:%g, s-rto); + if (s-backoff) + printf( backoff:%u, s-backoff); + if (s-rtt) + printf( rtt:%g/%g, s-rtt, s-rttvar); + if (s-ato) + printf( ato:%g, s-ato); + + if (s-qack) + printf( qack:%d, s-qack); + if (s-qack 1) + printf( bidir); + + if (s-mss) + printf( mss:%d, s-mss); + if (s-cwnd) + printf( cwnd:%d, s-cwnd); + if (s-ssthresh) + printf( ssthresh:%d, s-ssthresh); + + if (s-dctcp s-dctcp-enabled) { + struct dctcpstat *dctcp = s-dctcp; + + printf( dctcp:(ce_state:%u,alpha:%u,ab_ecn:%u,ab_tot:%u), + dctcp-ce_state, dctcp-alpha, dctcp-ab_ecn, + dctcp-ab_tot); + } else if (s-dctcp) { + printf( dctcp:fallback_mode); + } + + if (s-send_bps) + printf( send %sbps, sprint_bw(b1, s-send_bps)); + if (s-lastsnd) + printf( lastsnd:%u, s-lastsnd); + if (s-lastrcv) + printf( lastrcv:%u, s-lastrcv); + if (s-lastack) + printf( lastack:%u, s-lastack); + + if
[PATCH 05/10] ss: framed skeleton for json output in ss
This patch just adds the --json flag to ss. Also it ensures proper stats components bracketization – that goes for ex. TCP, UDP, NETLINK etc. Moreover, this patch prevents human readable headers to be printed. The first element flag ensures, that every first output json container element is treated specially, while all the others are treated equally. That is, only the first one does not print a coma ahead of itself. The rest does. This mechanism ensures the correct coma setting as demaned by the spec. Illustration in the following: PSEUDOCODE: { no comma {first } , {sec} , {third} . . . } Signed-off-by: Matthias Tafelmeier matthias.tafelme...@gmx.net Suggested-by: Hagen Paul Pfeifer ha...@jauu.net --- misc/ss.c | 198 -- 1 file changed, 155 insertions(+), 43 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index 1b3ef90..8fb6e7d 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -34,6 +34,9 @@ #include libnetlink.h #include namespace.h #include SNAPSHOT.h +#include ss_out_fmt.h +#include ss_json_fmt.h +#include ss_types.h #include linux/tcp.h #include linux/sock_diag.h @@ -101,6 +104,7 @@ int show_sock_ctx = 0; /* If show_users show_proc_ctx only do user_ent_hash_build() once */ int user_ent_hash_build_init = 0; int follow_events = 0; +int json_output = 0; int netid_width; int state_width; @@ -714,7 +718,6 @@ static int is_ephemeral(int port) return (port = ip_local_port_min port= ip_local_port_max); } - static const char *__resolve_service(int port) { struct scache *c; @@ -3064,6 +3067,9 @@ static int print_summary(void) printf(\n); + if (json_output has_successor) + printf(,\n); + return 0; } @@ -3090,6 +3096,7 @@ static void _usage(FILE *dest) -z, --contexts display process and socket SELinux security contexts\n -N, --net switch to the specified network namespace name\n \n + -j, --json format output in JSON\n -4, --ipv4 display only IP version 4 sockets\n -6, --ipv6 display only IP version 6 sockets\n -0, --packetdisplay PACKET sockets\n @@ -3189,6 +3196,7 @@ static const struct option long_opts[] = { { help, 0, 0, 'h' }, { context, 0, 0, 'Z' }, { contexts, 0, 0, 'z' }, + { json, 0, 0, 'j' }, { net, 1, 0, 'N' }, { 0 } @@ -3204,7 +3212,7 @@ int main(int argc, char *argv[]) int ch; int state_filter = 0; - while ((ch = getopt_long(argc, argv, dhaletuwxnro460spbEf:miA:D:F:vVzZN:, + while ((ch = getopt_long(argc, argv, dhaletuwxnro460spbEf:miA:D:F:vVzZN:j, long_opts, NULL)) != EOF) { switch(ch) { case 'n': @@ -3383,6 +3391,10 @@ int main(int argc, char *argv[]) if (netns_switch(optarg)) exit(1); break; + case 'j': + fmt_type = FMT_JSON; + json_output = 1; + break; case 'h': case '?': help(); @@ -3464,11 +3476,33 @@ int main(int argc, char *argv[]) exit(-1); } } + printf(\TCP\: [\n); inet_show_netlink(current_filter, dump_fp, IPPROTO_TCP); + res_json_fmt_branch(current_filter.dbs (1NETLINK_DB) || + current_filter.dbs PACKET_DBM || + current_filter.dbs UNIX_DBM || + current_filter.dbs (1RAW_DB) || + current_filter.dbs (1UDP_DB) || + current_filter.dbs (1TCP_DB) || + current_filter.dbs (1DCCP_DB), ']'); fflush(dump_fp); exit(0); } + if (do_summary) { + print_summary(current_filter.dbs PACKET_DBM || + current_filter.dbs UNIX_DBM || + current_filter.dbs (1RAW_DB) || + current_filter.dbs (1UDP_DB) || + current_filter.dbs (1TCP_DB) || + current_filter.dbs (1DCCP_DB)); + if (do_default argc == 0) { + if (json_output) + printf(}\n); + exit(0); + } + } + if (ssfilter_parse(current_filter.f, argc, argv, filter_fp)) usage(); @@ -3490,62 +3524,140 @@ int main(int argc, char *argv[]) } } - addrp_width = screen_width; - addrp_width -= netid_width+1; - addrp_width -= state_width+1; - addrp_width -= 14; + if
[PATCH 01/10] ss: rooted out ss type declarations for output formatters
The prospected output formatters and ss do share type declarations like slabstat or tcpstat so that the decision has been made to centralize those declarations in ss_types.h. Potential future declarations shall be placed there. The latter should help amend the extent of ss.c as well. Signed-off-by: Matthias Tafelmeier matthias.tafelme...@gmx.net Suggested-by: Hagen Paul Pfeifer ha...@jauu.net --- misc/ss.c | 186 +--- misc/ss_types.h | 186 2 files changed, 187 insertions(+), 185 deletions(-) create mode 100644 misc/ss_types.h diff --git a/misc/ss.c b/misc/ss.c index f4c828c..3d31b81 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -27,6 +27,7 @@ #include getopt.h #include stdbool.h +#include ss_types.h #include utils.h #include rt_names.h #include ll_map.h @@ -113,55 +114,17 @@ static const char *UDP_PROTO = udp; static const char *RAW_PROTO = raw; static const char *dg_proto = NULL; -enum -{ - TCP_DB, - DCCP_DB, - UDP_DB, - RAW_DB, - UNIX_DG_DB, - UNIX_ST_DB, - UNIX_SQ_DB, - PACKET_DG_DB, - PACKET_R_DB, - NETLINK_DB, - MAX_DB -}; #define PACKET_DBM ((1PACKET_DG_DB)|(1PACKET_R_DB)) #define UNIX_DBM ((1UNIX_DG_DB)|(1UNIX_ST_DB)|(1UNIX_SQ_DB)) #define ALL_DB ((1MAX_DB)-1) #define INET_DBM ((1TCP_DB)|(1UDP_DB)|(1DCCP_DB)|(1RAW_DB)) -enum { - SS_UNKNOWN, - SS_ESTABLISHED, - SS_SYN_SENT, - SS_SYN_RECV, - SS_FIN_WAIT1, - SS_FIN_WAIT2, - SS_TIME_WAIT, - SS_CLOSE, - SS_CLOSE_WAIT, - SS_LAST_ACK, - SS_LISTEN, - SS_CLOSING, - SS_MAX -}; - #define SS_ALL ((1 SS_MAX) - 1) #define SS_CONN (SS_ALL ~((1SS_LISTEN)|(1SS_CLOSE)|(1SS_TIME_WAIT)|(1SS_SYN_RECV))) #include ssfilter.h -struct filter -{ - int dbs; - int states; - int families; - struct ssfilter *f; -}; - static const struct filter default_dbs[MAX_DB] = { [TCP_DB] = { .states = SS_CONN, @@ -376,16 +339,6 @@ static FILE *ephemeral_ports_open(void) return generic_proc_open(PROC_IP_LOCAL_PORT_RANGE, sys/net/ipv4/ip_local_port_range); } -struct user_ent { - struct user_ent *next; - unsigned intino; - int pid; - int fd; - char*process; - char*process_ctx; - char*socket_ctx; -}; - #define USER_ENT_HASH_SIZE 256 struct user_ent *user_ent_hash[USER_ENT_HASH_SIZE]; @@ -538,12 +491,6 @@ static void user_ent_hash_build(void) closedir(dir); } -enum entry_types { - USERS, - PROC_CTX, - PROC_SOCK_CTX -}; - #define ENTRY_BUF_SIZE 512 static int find_entry(unsigned ino, char **buf, int type) { @@ -616,17 +563,6 @@ next: return cnt; } -/* Get stats from slab */ - -struct slabstat -{ - int socks; - int tcp_ports; - int tcp_tws; - int tcp_syns; - int skbs; -}; - static struct slabstat slabstat; static const char *slabstat_ids[] = @@ -711,75 +647,6 @@ static const char *sstate_namel[] = { [SS_CLOSING] = closing, }; -struct sockstat -{ - struct sockstat*next; - unsigned inttype; - uint16_tprot; - inet_prefix local; - inet_prefix remote; - int lport; - int rport; - int state; - int rq, wq; - unsignedino; - unsigneduid; - int refcnt; - unsigned intiface; - unsigned long long sk; - char *name; - char *peer_name; -}; - -struct dctcpstat -{ - unsigned intce_state; - unsigned intalpha; - unsigned intab_ecn; - unsigned intab_tot; - boolenabled; -}; - -struct tcpstat -{ - struct sockstat ss; - int timer; - int timeout; - int probes; - charcong_alg[16]; - double rto, ato, rtt, rttvar; - int qack, cwnd, ssthresh, backoff; - double send_bps; - int snd_wscale; - int rcv_wscale; - int mss; - unsigned intlastsnd; - unsigned intlastrcv; - unsigned intlastack; - double pacing_rate; - double pacing_rate_max; - unsigned long long bytes_acked; - unsigned long long bytes_received; - unsigned intsegs_out; - unsigned intsegs_in; - unsigned intunacked; - unsigned intretrans; - unsigned intretrans_total; - unsigned intlost; - unsigned intsacked; -
[PATCH 10/10] ss: fixed free on local array for valid json output
Minor fix to enable json output. Freeing of automatic char array name which will get freed after function stack cleanup. Another one after tcp_stats_fmt for freeing automatic tcpstats struct instance. Signed-off-by: Matthias Tafelmeier matthias.tafelme...@gmx.net Suggested-by: Hagen Paul Pfeifer ha...@jauu.net --- misc/ss.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index 5eba08d..722253a 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -1664,10 +1664,6 @@ static void tcp_show_info(const struct nlmsghdr *nlh, struct inet_diag_msg *r, s.segs_out = info-tcpi_segs_out; s.segs_in = info-tcpi_segs_in; tcp_stats_fmt(s); - if (s.dctcp) - free(s.dctcp); - if (s.cong_alg) - free(s.cong_alg); } } @@ -2366,8 +2362,6 @@ if (show_mem) { if (json_output) printf(}\n); - if (name) - free(name); return 0; } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/10] ss: symmetrical subhandler output extension example
-BEGIN PGP SIGNED MESSAGE- Hash: SHA384 {} not needed. I guess you haven't run your patches thru scripts/checkpatch.pl? Yes, although this is missing from iproute2 sources ;) Thank you for reviewing so far. I see there slipped some parts of the patch through according checkpatch.pl for which I am responsible. I will give a V2 patch for these asap. Nevertheless, there are parts of the patch for which I am not liable, so please bear with me. I only copied those over from the origin version. Well, I am quite prepared to correct them as well in order to come up for the history break. -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBCQAGBQJVyNoMAAoJEOAWT1uK3zQ7jOcH/3WJWNM+gcKDz/Hbj2oQLcli M3jkIICJFZhSlCUqI0DjmVecy3ryDtxZjM4HuHcqPP8nqmdP7ykiO7p89PLTF2iC XgA7UMMTByNJD6WSz7kjwWFlPXhvffrhE4yNZe+WkTE+HrJ8GPVydnhnr+Xo4L3g YYDns9VWAHQgD14bd36FaoZkYmlXM1WQJZm5sgMCYWEq8ZpIHFJhqKRD6Y7e29rK eI8BQchv30QHQiCzFOIyTqm7ncUb9CE8brBC1IFEFs9Eli5CQCoiriXANR3ntsjB dU/6P3NuyAkis7CWILgGaKSNi0h/DPhszZQh5Gfjl4FFE5vszCVup6pM1evBWH0= =mcKX -END PGP SIGNATURE- 0x8ADF343B.asc Description: application/pgp-keys
Re: [RFC PATCH net-next] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set
On 08/10/2015 10:47 AM, Eric Dumazet wrote: On Fri, 2015-08-07 at 18:31 +, Jason Baron wrote: From: Jason Baron jba...@akamai.com When SO_SNDBUF is set and we are under tcp memory pressure, the effective write buffer space can be much lower than what was set using SO_SNDBUF. For example, we may have set the buffer to 100kb, but we may only be able to write 10kb. In this scenario poll()/select()/epoll(), are going to continuously return POLLOUT, followed by -EAGAIN from write() in a very tight loop. Introduce sk-sk_effective_sndbuf, such that we can track the 'effective' size of the sndbuf, when we have a short write due to memory pressure. By using the sk-sk_effective_sndbuf instead of the sk-sk_sndbuf when we are under memory pressure, we can delay the POLLOUT until 1/3 of the buffer clears as we normally do. There is no issue here when SO_SNDBUF is not set, since the tcp layer will auto tune the sk-sndbuf. In my testing, this brought a single threaad's cpu usage down from 100% to 1% while maintaining the same level of throughput when under memory pressure. I am not sure we need to grow socket for something that looks like a flag ? So I added a new field because I needed to store the new 'effective' sndbuf somewhere and then restore the original value that was set via SO_SNDBUF. So its really b/c of SO_SNDBUF. We could perhaps use the fact that we are in memory pressure to signal wakeups differently, but I'm not sure exactly how. Also you add a race in sk_stream_wspace() as sk_effective_sndbuf value can change under us. + if (sk-sk_effective_sndbuf) + return sk-sk_effective_sndbuf - sk-sk_wmem_queued; + thanks. better? --- a/include/net/sock.h +++ b/include/net/sock.h @@ -798,8 +798,10 @@ static inline int sk_stream_min_wspace(const struct sock *sk) static inline int sk_stream_wspace(const struct sock *sk) { - if (sk-sk_effective_sndbuf) - return sk-sk_effective_sndbuf - sk-sk_wmem_queued; + int effective_sndbuf = sk-sk_effective_sndbuf; + + if (effective_sndbuf) + return effective_sndbuf - sk-sk_wmem_queued; return sk-sk_sndbuf - sk-sk_wmem_queued; } Thanks, -Jason -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] mkiss: Fix error handling in mkiss_open()
If register_netdev() fails we are not propagating the error and we return success because ax_open() succeeded previously. Fix this by checking the return value of ax_open() and register_netdev() and propagate the error in case of failure. Reported-by: RUC_Soft_Sec zy900...@163.com Signed-off-by: Fabio Estevam fabio.este...@freescale.com --- drivers/net/hamradio/mkiss.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/net/hamradio/mkiss.c b/drivers/net/hamradio/mkiss.c index 2ffbf13..216bfd3 100644 --- a/drivers/net/hamradio/mkiss.c +++ b/drivers/net/hamradio/mkiss.c @@ -728,11 +728,12 @@ static int mkiss_open(struct tty_struct *tty) dev-type = ARPHRD_AX25; /* Perform the low-level AX25 initialization. */ - if ((err = ax_open(ax-dev))) { + err = ax_open(ax-dev); + if (err) goto out_free_netdev; - } - if (register_netdev(dev)) + err = register_netdev(dev); + if (err) goto out_free_buffers; /* after register_netdev() - because else printk smashes the kernel */ -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] openvswitch: Fix L4 checksum handling when dealing with IP fragments
From: Glenn Griffin ggriffin.ker...@gmail.com Date: Mon, 10 Aug 2015 10:43:16 -0700 On Mon, Aug 03, 2015 at 02:03:28PM -0700, David Miller wrote: From: Glenn Griffin ggriffin.ker...@gmail.com Date: Mon, 3 Aug 2015 09:56:54 -0700 openvswitch modifies the L4 checksum of a packet when modifying the ip address. When an IP packet is fragmented only the first fragment contains an L4 header and checksum. Prior to this change openvswitch would modify all fragments, modifying application data in non-first fragments, causing checksum failures in the reassembled packet. Signed-off-by: Glenn Griffin ggriffin.ker...@gmail.com --- Changes in v2: - Compare frag_off in network byte order rather than host byte order Applied and queued up for -stable. I noticed this change didn't seem to make it into 4.2-rc6. I'm not too familiar with the release schedule so wasn't sure if that was expected or an oversight. Will this remain queued up until the 4.3 merge window opens? It's in my 'net' tree and will be pushed to Linus's tree at a time that I deem appropriate. Usually I try to push to Linus one every week or so, in order for changes to soak and get tested in my tree before they get pushed to his. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 8/9] net: Use passed in table for nexthop lookups
If a user passes in a table for new routes use that table for nexthop lookups. Specifically, this solves the case where a connected route does not exist in the main table, but only another table and then a subsequent route is added with a next hop using the connected route. ie., $ ip route ls default via 10.0.2.2 dev eth0 10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 169.254.0.0/16 dev eth0 scope link metric 1003 192.168.56.0/24 dev eth1 proto kernel scope link src 192.168.56.51 $ ip route ls table 10 1.1.1.0/24 dev eth2 scope link Without this patch adding a nexthop route fails: $ ip route add table 10 2.2.2.0/24 via 1.1.1.10 RTNETLINK answers: Network is unreachable With this patch the route is added successfully. Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/fib_semantics.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 85e9a8abf15c..b7f1d20a9615 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -691,6 +691,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, } rcu_read_lock(); { + struct fib_table *tbl = NULL; struct flowi4 fl4 = { .daddr = nh-nh_gw, .flowi4_scope = cfg-fc_scope + 1, @@ -701,8 +702,16 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, /* It is not necessary, but requires a bit of thinking */ if (fl4.flowi4_scope RT_SCOPE_LINK) fl4.flowi4_scope = RT_SCOPE_LINK; - err = fib_lookup(net, fl4, res, -FIB_LOOKUP_IGNORE_LINKSTATE); + + if (cfg-fc_table) + tbl = fib_get_table(net, cfg-fc_table); + + if (tbl) + err = fib_table_lookup(tbl, fl4, res, + FIB_LOOKUP_IGNORE_LINKSTATE); + else + err = fib_lookup(net, fl4, res, +FIB_LOOKUP_IGNORE_LINKSTATE); if (err) { rcu_read_unlock(); return err; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] iproute2: Add support for VRF device
Allow user to create a vrf device and specify its table binding. Based on the iplink_vlan implementation. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/linux/if_link.h | 8 + ip/Makefile | 2 +- ip/iplink.c | 2 +- ip/iplink_vrf.c | 85 + 4 files changed, 95 insertions(+), 2 deletions(-) create mode 100644 ip/iplink_vrf.c diff --git a/include/linux/if_link.h b/include/linux/if_link.h index b905cf7f4948..74dedf4320b8 100644 --- a/include/linux/if_link.h +++ b/include/linux/if_link.h @@ -338,6 +338,14 @@ enum macvlan_macaddr_mode { #define MACVLAN_FLAG_NOPROMISC 1 +/* VRF section */ +enum { + IFLA_VRF_UNSPEC, + IFLA_VRF_TABLE, + __IFLA_VRF_MAX +}; + +#define IFLA_VRF_MAX (__IFLA_VRF_MAX - 1) /* IPVLAN section */ enum { IFLA_IPVLAN_UNSPEC, diff --git a/ip/Makefile b/ip/Makefile index 77653ecc5785..d8b38ac2e44b 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -7,7 +7,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \ iplink_vxlan.o tcp_metrics.o iplink_ipoib.o ipnetconf.o link_ip6tnl.o \ link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \ iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \ -iplink_geneve.o +iplink_geneve.o iplink_vrf.o RTMONOBJ=rtmon.o diff --git a/ip/iplink.c b/ip/iplink.c index 369d50eab94e..14bf7211a447 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -94,7 +94,7 @@ void iplink_usage(void) fprintf(stderr, TYPE := { vlan | veth | vcan | dummy | ifb | macvlan | macvtap |\n); fprintf(stderr, bridge | bond | ipoib | ip6tnl | ipip | sit | vxlan |\n); fprintf(stderr, gre | gretap | ip6gre | ip6gretap | vti | nlmon |\n); - fprintf(stderr, bond_slave | ipvlan | geneve }\n); + fprintf(stderr, bond_slave | ipvlan | geneve | vrf }\n); } exit(-1); } diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c new file mode 100644 index ..0d7e21c7c152 --- /dev/null +++ b/ip/iplink_vrf.c @@ -0,0 +1,85 @@ +/* iplink_vrf.cVRF device support + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Authors: Shrijeet Mukherjee s...@cumulusnetworks.com + */ + +#include stdio.h +#include stdlib.h +#include string.h +#include sys/socket.h +#include linux/if_link.h + +#include rt_names.h +#include utils.h +#include ip_common.h + +static void vrf_explain(FILE *f) +{ + fprintf(f, Usage: ... vrf table TABLEID \n); +} + +static void explain(void) +{ + vrf_explain(stderr); +} + +static int table_arg(void) +{ + fprintf(stderr,Error: argument of \table\ must be 0-32767 and currently unused\n); + return -1; +} + +static int vrf_parse_opt(struct link_util *lu, int argc, char **argv, + struct nlmsghdr *n) +{ + while (argc 0) { + if (matches(*argv, table) == 0) { + __u32 table = 0; + NEXT_ARG(); + + table = atoi(*argv); + if (table 0 || table 32767) + return table_arg(); + addattr32(n, 1024, IFLA_VRF_TABLE, table); + } else if (matches(*argv, help) == 0) { + explain(); + return -1; + } else { + fprintf(stderr, vrf: unknown option \%s\?\n, + *argv); + explain(); + return -1; + } + argc--, argv++; + } + + return 0; +} + +static void vrf_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[]) +{ + if (!tb) + return; + + if (tb[IFLA_VRF_TABLE]) + fprintf(f, table %u , rta_getattr_u32(tb[IFLA_VRF_TABLE])); +} + +static void vrf_print_help(struct link_util *lu, int argc, char **argv, + FILE *f) +{ + vrf_explain(f); +} + +struct link_util vrf_link_util = { + .id = vrf, + .maxattr= IFLA_VRF_MAX, + .parse_opt = vrf_parse_opt, + .print_opt = vrf_print_opt, + .print_help = vrf_print_help, +}; -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 6/9] net: Fix up inet_addr_type checks
Currently inet_addr_type and inet_dev_addr_type expect local addresses to be in the local table. With the VRF device local routes for devices associated with a VRF will be in the table associated with the VRF. Provide an alternate inet_addr lookup to use a specific table rather than defaulting to the local table. inet_addr_type_dev_table keeps the same semantics as inet_addr_type but if the passed in device is enslaved to a VRF then the table for that VRF is used for the lookup. Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/route.h | 3 +++ net/ipv4/af_inet.c | 13 - net/ipv4/arp.c | 15 +-- net/ipv4/fib_frontend.c | 28 +--- net/ipv4/fib_semantics.c | 6 -- net/ipv4/icmp.c | 5 +++-- 6 files changed, 56 insertions(+), 14 deletions(-) diff --git a/include/net/route.h b/include/net/route.h index 6ba681f0b98d..6dda2c1bf8c6 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -192,6 +192,9 @@ unsigned int inet_addr_type(struct net *net, __be32 addr); unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr); +unsigned int inet_addr_type_dev_table(struct net *net, + const struct net_device *dev, + __be32 addr); void ip_rt_multicast_event(struct in_device *); int ip_rt_ioctl(struct net *, unsigned int cmd, void __user *arg); void ip_rt_get_source(u8 *src, struct sk_buff *skb, struct rtable *rt); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index cc4e498a0ccf..96fba4f63454 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -119,6 +119,7 @@ #ifdef CONFIG_IP_MROUTE #include linux/mroute.h #endif +#include net/vrf.h /* The inetsw table contains everything that inet_create needs to @@ -427,6 +428,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) struct net *net = sock_net(sk); unsigned short snum; int chk_addr_ret; + int tb_id = 0; int err; /* If the socket has its own bind function then use it. (RAW) */ @@ -448,7 +450,16 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) goto out; } - chk_addr_ret = inet_addr_type(net, addr-sin_addr.s_addr); + if (sk-sk_bound_dev_if) { + struct net_device *dev; + + rcu_read_lock(); + dev = dev_get_by_index_rcu(net, sk-sk_bound_dev_if); + if (dev) + tb_id = vrf_dev_table_rcu(dev); + rcu_read_unlock(); + } + chk_addr_ret = inet_addr_type_table(net, addr-sin_addr.s_addr, tb_id); /* Not specified by any standard per-se, however it breaks too * many applications when removed. It is unfortunate since diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index 34a308573f4b..30409b75e925 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -233,7 +233,7 @@ static int arp_constructor(struct neighbour *neigh) return -EINVAL; } - neigh-type = inet_addr_type(dev_net(dev), addr); + neigh-type = inet_addr_type_dev_table(dev_net(dev), dev, addr); parms = in_dev-arp_parms; __neigh_parms_put(neigh-parms); @@ -343,7 +343,7 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb) switch (IN_DEV_ARP_ANNOUNCE(in_dev)) { default: case 0: /* By default announce any local IP */ - if (skb inet_addr_type(dev_net(dev), + if (skb inet_addr_type_dev_table(dev_net(dev), dev, ip_hdr(skb)-saddr) == RTN_LOCAL) saddr = ip_hdr(skb)-saddr; break; @@ -351,7 +351,8 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb) if (!skb) break; saddr = ip_hdr(skb)-saddr; - if (inet_addr_type(dev_net(dev), saddr) == RTN_LOCAL) { + if (inet_addr_type_dev_table(dev_net(dev), dev, +saddr) == RTN_LOCAL) { /* saddr should be known to target */ if (inet_addr_onlink(in_dev, target, saddr)) break; @@ -751,7 +752,7 @@ static int arp_process(struct sock *sk, struct sk_buff *skb) /* Special case: IPv4 duplicate address detection packet (RFC2131) */ if (sip == 0) { if (arp-ar_op == htons(ARPOP_REQUEST) - inet_addr_type(net, tip) == RTN_LOCAL + inet_addr_type_dev_table(net, dev, tip) == RTN_LOCAL !arp_ignore(in_dev, sip, tip)) arp_send(ARPOP_REPLY, ETH_P_ARP, sip, dev, tip,
[PATCH net-next 7/9] net: Add routes to the table associated with the device
When a device associated with a VRF is brought up or down routes should be added to/removed from the table associated with the VRF. fib_magic defaults to using the main or local tables. Have it use the table with the device if there is one. A part of this is directing prefsrc validations to the correct table as well. Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/fib_frontend.c | 8 net/ipv4/fib_semantics.c | 25 +++-- 2 files changed, 23 insertions(+), 10 deletions(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d84ae0e30369..0a50a08ab844 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -803,6 +803,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifaddr *ifa) { struct net *net = dev_net(ifa-ifa_dev-dev); + int tb_id = vrf_dev_table_rtnl(ifa-ifa_dev-dev); struct fib_table *tb; struct fib_config cfg = { .fc_protocol = RTPROT_KERNEL, @@ -817,11 +818,10 @@ static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifad }, }; - if (type == RTN_UNICAST) - tb = fib_new_table(net, RT_TABLE_MAIN); - else - tb = fib_new_table(net, RT_TABLE_LOCAL); + if (!tb_id) + tb_id = (type == RTN_UNICAST) ? RT_TABLE_MAIN : RT_TABLE_LOCAL; + tb = fib_new_table(net, tb_id); if (!tb) return; diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 410ddb67221e..85e9a8abf15c 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -838,6 +838,23 @@ __be32 fib_info_update_nh_saddr(struct net *net, struct fib_nh *nh) return nh-nh_saddr; } +static bool fib_valid_prefsrc(struct fib_config *cfg, __be32 fib_prefsrc) +{ + if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst || + fib_prefsrc != cfg-fc_dst) { + int tb_id = cfg-fc_table; + + if (tb_id == RT_TABLE_MAIN) + tb_id = RT_TABLE_LOCAL; + + if (inet_addr_type_table(cfg-fc_nlinfo.nl_net, +fib_prefsrc, tb_id) != RTN_LOCAL) { + return false; + } + } + return true; +} + struct fib_info *fib_create_info(struct fib_config *cfg) { int err; @@ -1033,12 +1050,8 @@ struct fib_info *fib_create_info(struct fib_config *cfg) fi-fib_flags |= RTNH_F_LINKDOWN; } - if (fi-fib_prefsrc) { - if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst || - fi-fib_prefsrc != cfg-fc_dst) - if (inet_addr_type(net, fi-fib_prefsrc) != RTN_LOCAL) - goto err_inval; - } + if (fi-fib_prefsrc !fib_valid_prefsrc(cfg, fi-fib_prefsrc)) + goto err_inval; change_nexthops(fi) { fib_info_update_nh_saddr(net, nexthop_nh); -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 5/9] net: Add inet_addr lookup by table
Currently inet_addr_type and inet_dev_addr_type expect local addresses to be in the local table. With the VRF device local routes for devices associated with a VRF will be in the table associated with the VRF. Provide an alternate inet_addr lookup to use a specific table rather than defaulting to the local table. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/route.h | 1 + net/ipv4/fib_frontend.c | 22 +++--- 2 files changed, 16 insertions(+), 7 deletions(-) diff --git a/include/net/route.h b/include/net/route.h index 94189d4bd899..6ba681f0b98d 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -189,6 +189,7 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk); void ip_rt_send_redirect(struct sk_buff *skb); unsigned int inet_addr_type(struct net *net, __be32 addr); +unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr); void ip_rt_multicast_event(struct in_device *); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d8ced1d89f1b..b11321a8e58d 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -212,12 +212,12 @@ void fib_flush_external(struct net *net) */ static inline unsigned int __inet_dev_addr_type(struct net *net, const struct net_device *dev, - __be32 addr) + __be32 addr, int tb_id) { struct flowi4 fl4 = { .daddr = addr }; struct fib_result res; unsigned int ret = RTN_BROADCAST; - struct fib_table *local_table; + struct fib_table *table; if (ipv4_is_zeronet(addr) || ipv4_is_lbcast(addr)) return RTN_BROADCAST; @@ -226,10 +226,10 @@ static inline unsigned int __inet_dev_addr_type(struct net *net, rcu_read_lock(); - local_table = fib_get_table(net, RT_TABLE_LOCAL); - if (local_table) { + table = fib_get_table(net, tb_id); + if (table) { ret = RTN_UNICAST; - if (!fib_table_lookup(local_table, fl4, res, FIB_LOOKUP_NOREF)) { + if (!fib_table_lookup(table, fl4, res, FIB_LOOKUP_NOREF)) { if (!dev || dev == res.fi-fib_dev) ret = res.type; } @@ -239,16 +239,24 @@ static inline unsigned int __inet_dev_addr_type(struct net *net, return ret; } +unsigned int inet_addr_type_table(struct net *net, __be32 addr, int tb_id) +{ + return __inet_dev_addr_type(net, NULL, addr, tb_id); +} +EXPORT_SYMBOL(inet_addr_type_table); + unsigned int inet_addr_type(struct net *net, __be32 addr) { - return __inet_dev_addr_type(net, NULL, addr); + return __inet_dev_addr_type(net, NULL, addr, RT_TABLE_LOCAL); } EXPORT_SYMBOL(inet_addr_type); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr) { - return __inet_dev_addr_type(net, dev, addr); + int rt_table = vrf_dev_table(dev) ? : RT_TABLE_LOCAL; + + return __inet_dev_addr_type(net, dev, addr, rt_table); } EXPORT_SYMBOL(inet_dev_addr_type); -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] bna: fix interrupts storm caused by erroneous packets
From: Ivan Vecera ivec...@redhat.com Date: Thu, 6 Aug 2015 22:48:23 +0200 The commit e29aa33 bna: Enable Multi Buffer RX moved packets counter increment from the beginning of the NAPI processing loop after the check for erroneous packets so they are never accounted. This counter is used to inform firmware about number of processed completions (packets). As these packets are never acked the firmware fires IRQs for them again and again. Fixes: e29aa33 bna: Enable Multi Buffer RX Signed-off-by: Ivan Vecera ivec...@redhat.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH iproute2] tipc: fix bearer get/set help synopsis
On Fri, 7 Aug 2015 09:55:09 +0200 richard.a...@ericsson.com wrote: From: Richard Alpe richard.a...@ericsson.com One option is required for bearer set and bearer get. Applied, thanks -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] openvswitch: Fix L4 checksum handling when dealing with IP fragments
On Mon, Aug 03, 2015 at 02:03:28PM -0700, David Miller wrote: From: Glenn Griffin ggriffin.ker...@gmail.com Date: Mon, 3 Aug 2015 09:56:54 -0700 openvswitch modifies the L4 checksum of a packet when modifying the ip address. When an IP packet is fragmented only the first fragment contains an L4 header and checksum. Prior to this change openvswitch would modify all fragments, modifying application data in non-first fragments, causing checksum failures in the reassembled packet. Signed-off-by: Glenn Griffin ggriffin.ker...@gmail.com --- Changes in v2: - Compare frag_off in network byte order rather than host byte order Applied and queued up for -stable. I noticed this change didn't seem to make it into 4.2-rc6. I'm not too familiar with the release schedule so wasn't sure if that was expected or an oversight. Will this remain queued up until the 4.3 merge window opens? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 1/9] net: Introduce VRF related flags and helpers
Add a VRF_MASTER flag for interfaces and helper functions for determining if a device is a VRF_MASTER. Add link attribute for passing VRF_TABLE id. Add vrf_ptr to netdevice. Add various macros for determining if a device is a VRF device, the index of the master VRF device and table associated with VRF device. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/linux/netdevice.h| 20 +++ include/net/vrf.h| 139 +++ include/uapi/linux/if_link.h | 9 +++ 3 files changed, 168 insertions(+) create mode 100644 include/net/vrf.h diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 607b5f41f46f..f7a6ef2fae3a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1289,6 +1289,7 @@ enum netdev_priv_flags { IFF_XMIT_DST_RELEASE_PERM = 122, IFF_IPVLAN_MASTER = 123, IFF_IPVLAN_SLAVE= 124, + IFF_VRF_MASTER = 125, }; #define IFF_802_1Q_VLANIFF_802_1Q_VLAN @@ -1316,6 +1317,7 @@ enum netdev_priv_flags { #define IFF_XMIT_DST_RELEASE_PERM IFF_XMIT_DST_RELEASE_PERM #define IFF_IPVLAN_MASTER IFF_IPVLAN_MASTER #define IFF_IPVLAN_SLAVE IFF_IPVLAN_SLAVE +#define IFF_VRF_MASTER IFF_VRF_MASTER /** * struct net_device - The DEVICE structure. @@ -1432,6 +1434,7 @@ enum netdev_priv_flags { * @dn_ptr:DECnet specific data * @ip6_ptr: IPv6 specific data * @ax25_ptr: AX.25 specific data + * @vrf_ptr: VRF specific data * @ieee80211_ptr: IEEE 802.11 specific data, assign before registering * * @last_rx: Time of last Rx @@ -1650,6 +1653,7 @@ struct net_device { struct dn_dev __rcu *dn_ptr; struct inet6_dev __rcu *ip6_ptr; void*ax25_ptr; + struct net_vrf_dev __rcu *vrf_ptr; struct wireless_dev *ieee80211_ptr; struct wpan_dev *ieee802154_ptr; #if IS_ENABLED(CONFIG_MPLS_ROUTING) @@ -3808,6 +3812,22 @@ static inline bool netif_supports_nofcs(struct net_device *dev) return dev-priv_flags IFF_SUPP_NOFCS; } +static inline bool netif_is_vrf(const struct net_device *dev) +{ + return dev-priv_flags IFF_VRF_MASTER; +} + +static inline bool netif_index_is_vrf(struct net *net, int ifindex) +{ + struct net_device *dev = dev_get_by_index_rcu(net, ifindex); + bool rc = false; + + if (dev) + rc = netif_is_vrf(dev); + + return rc; +} + /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */ static inline void netif_keep_dst(struct net_device *dev) { diff --git a/include/net/vrf.h b/include/net/vrf.h new file mode 100644 index ..25c709fdb98f --- /dev/null +++ b/include/net/vrf.h @@ -0,0 +1,139 @@ +/* + * include/net/net_vrf.h - adds vrf dev structure definitions + * Copyright (c) 2015 Cumulus Networks + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#ifndef __LINUX_NET_VRF_H +#define __LINUX_NET_VRF_H + +struct net_vrf_dev { + struct rcu_head rcu; + int ifindex; /* ifindex of master dev */ + u32 tb_id; /* table id for VRF */ +}; + +struct slave { + struct list_headlist; + struct net_device *dev; +}; + +struct slave_queue { + struct list_headall_slaves; + int num_slaves; +}; + +struct net_vrf { + struct slave_queue queue; + struct rtable *rth; + u32 tb_id; +}; + + +#if IS_ENABLED(CONFIG_NET_VRF) +/* called with rcu_read_lock() */ +static inline int vrf_master_ifindex_rcu(const struct net_device *dev) +{ + struct net_vrf_dev *vrf_ptr; + int ifindex = 0; + + if (!dev) + return 0; + + if (netif_is_vrf(dev)) + ifindex = dev-ifindex; + else { + vrf_ptr = rcu_dereference(dev-vrf_ptr); + if (vrf_ptr) + ifindex = vrf_ptr-ifindex; + } + + return ifindex; +} + +/* called with rcu_read_lock */ +static inline int vrf_dev_table_rcu(const struct net_device *dev) +{ + int tb_id = 0; + + if (dev) { + struct net_vrf_dev *vrf_ptr; + + vrf_ptr = rcu_dereference(dev-vrf_ptr); + if (vrf_ptr) + tb_id = vrf_ptr-tb_id; + } + return tb_id; +} + +static inline int vrf_dev_table(const struct net_device *dev) +{ + int tb_id = 0; + + rcu_read_lock(); + tb_id =
[PATCH net-next 4/9] udp: Handle VRF device in sendmsg
For unconnected UDP sockets using a VRF device lookup source address based on VRF table. This allows the UDP header to be properly setup before showing up at the VRF device via the dst. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- net/ipv4/udp.c | 22 +- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 83aa604f9273..7af5052e3b1f 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1013,11 +1013,31 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (!rt) { struct net *net = sock_net(sk); + __u8 flow_flags = inet_sk_flowi_flags(sk); fl4 = fl4_stack; + + /* unconnected socket. If output device is enslaved to a VRF +* device lookup source address from VRF table. This mimics +* behavior of ip_route_connect{_init}. +*/ + if (netif_index_is_vrf(net, ipc.oif)) { + flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos, + RT_SCOPE_UNIVERSE, sk-sk_protocol, + (flow_flags | FLOWI_FLAG_VRFSRC), + faddr, saddr, dport, + inet-inet_sport); + + rt = ip_route_output_flow(net, fl4, sk); + if (!IS_ERR(rt)) { + saddr = fl4-saddr; + ip_rt_put(rt); + } + } + flowi4_init_output(fl4, ipc.oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE, sk-sk_protocol, - inet_sk_flowi_flags(sk), + flow_flags, faddr, saddr, dport, inet-inet_sport); security_sk_classify_flow(sk, flowi4_to_flowi(fl4)); -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 3/9] net: Use VRF device index for lookups on TX
As with ingress use the index of VRF master device for route lookups on egress. However, the oif should only be used to direct the lookups to a specific table. Routes in the table are not based on the VRF device but rather interfaces that are part of the VRF so do not consider the oif for lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this latter part. Signed-off-by: Shrijeet Mukherjee s...@cumulusnetworks.com Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/flow.h | 1 + include/net/route.h | 3 +++ net/ipv4/fib_trie.c | 7 +-- net/ipv4/icmp.c | 4 net/ipv4/route.c| 5 + 5 files changed, 18 insertions(+), 2 deletions(-) diff --git a/include/net/flow.h b/include/net/flow.h index 3098ae33a178..f305588fc162 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -33,6 +33,7 @@ struct flowi_common { __u8flowic_flags; #define FLOWI_FLAG_ANYSRC 0x01 #define FLOWI_FLAG_KNOWN_NH0x02 +#define FLOWI_FLAG_VRFSRC 0x04 __u32 flowic_secid; struct flowi_tunnel flowic_tun_key; }; diff --git a/include/net/route.h b/include/net/route.h index 2d45f419477f..94189d4bd899 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -251,6 +251,9 @@ static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst, __be32 if (inet_sk(sk)-transparent) flow_flags |= FLOWI_FLAG_ANYSRC; + if (netif_index_is_vrf(sock_net(sk), oif)) + flow_flags |= FLOWI_FLAG_VRFSRC; + flowi4_init_output(fl4, oif, sk-sk_mark, tos, RT_SCOPE_UNIVERSE, protocol, flow_flags, dst, src, dport, sport); } diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 37c4bb89a708..1243c79cb5b0 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1423,8 +1423,11 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp, nh-nh_flags RTNH_F_LINKDOWN !(fib_flags FIB_LOOKUP_IGNORE_LINKSTATE)) continue; - if (flp-flowi4_oif flp-flowi4_oif != nh-nh_oif) - continue; + if (!(flp-flowi4_flags FLOWI_FLAG_VRFSRC)) { + if (flp-flowi4_oif + flp-flowi4_oif != nh-nh_oif) + continue; + } if (!(fib_flags FIB_LOOKUP_NOREF)) atomic_inc(fi-fib_clntref); diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index c0556f1e4bf0..1164fc4ce3bc 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -96,6 +96,7 @@ #include net/xfrm.h #include net/inet_common.h #include net/ip_fib.h +#include net/vrf.h /* * Build xmit assembly blocks @@ -425,6 +426,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb) fl4.flowi4_mark = mark; fl4.flowi4_tos = RT_TOS(ip_hdr(skb)-tos); fl4.flowi4_proto = IPPROTO_ICMP; + fl4.flowi4_oif = vrf_master_ifindex_rcu(skb-dev) ? : skb-dev-ifindex; security_skb_classify_flow(skb, flowi4_to_flowi(fl4)); rt = ip_route_output_key(net, fl4); if (IS_ERR(rt)) @@ -458,6 +460,8 @@ static struct rtable *icmp_route_lookup(struct net *net, fl4-flowi4_proto = IPPROTO_ICMP; fl4-fl4_icmp_type = type; fl4-fl4_icmp_code = code; + fl4-flowi4_oif = vrf_master_ifindex_rcu(skb_in-dev) ? : skb_in-dev-ifindex; + security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4)); rt = __ip_route_output_key(net, fl4); if (IS_ERR(rt)) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index c26ff1f7067d..2c89d294b669 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2131,6 +2131,11 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) fl4-saddr = inet_select_addr(dev_out, 0, RT_SCOPE_HOST); } + if (netif_is_vrf(dev_out) + !(fl4-flowi4_flags FLOWI_FLAG_VRFSRC)) { + rth = vrf_dev_get_rth(dev_out); + goto out; + } } if (!fl4-daddr) { -- 2.3.2 (Apple Git-55) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fixes for the network driver of Marvell Armada 375 SoC
From: Marcin Wojtas m...@semihalf.com Date: Thu, 6 Aug 2015 19:00:27 +0200 This is a set of three patches that fix long-lasting problems implemented in the initial support for the Armada 375 network controller. Due to an inappropriate concept of handling the per-CPU sent packets' processing on TX path the driver numerous problems occured, such as RCU stalls. Those have been fixed, of which details you can find in the commit logs. The patches were intensively tested on top of v4.2-rc5. I'm looking forward to any comments or remarks. Series applied, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH iproute2 -next] m_bpf: add frontend support for late binding
On Fri, 7 Aug 2015 11:36:50 +0200 Daniel Borkmann dan...@iogearbox.net wrote: Frontend support for kernel commit a5c90b29e5cc (act_bpf: properly support late binding of bpf action to a classifier). Signed-off-by: Daniel Borkmann dan...@iogearbox.net Applied to net-next -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH iproute2 net-next] iplink: bonding: add support for IFLA_BOND_TLB_DYNAMIC_LB
On Mon, 3 Aug 2015 12:19:55 +0200 Nikolay Aleksandrov ra...@blackwall.org wrote: From: Nikolay Aleksandrov niko...@cumulusnetworks.com Add support to be able to set and show the value of tlb_dynamic_lb (IFLA_BOND_TLB_DYNAMIC_LB). Example: $ ip -d link show dev bond0 type bond 7: bond0: BROADCAST,MULTICAST,MASTER mtu 1500 qdisc noop state DOWN mode DEFAULT group default link/ether ce:2f:e1:6e:d7:e0 brd ff:ff:ff:ff:ff:ff promiscuity 0 bond mode balance-tlb miimon 100 updelay 0 downdelay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary_reselect always fail_over_mac none xmit_hash_policy layer2 resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 addrgenmode eui64 $ ip -d l set dev bond0 type bond tlb_dynamic_lb 0 $ ip -d link show dev bond0 type bond 7: bond0: BROADCAST,MULTICAST,MASTER mtu 1500 qdisc noop state DOWN mode DEFAULT group default link/ether ce:2f:e1:6e:d7:e0 brd ff:ff:ff:ff:ff:ff promiscuity 0 bond mode balance-tlb miimon 100 updelay 0 downdelay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary_reselect always fail_over_mac none xmit_hash_policy layer2 resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 0 addrgenmode eui64 Signed-off-by: Nikolay Aleksandrov niko...@cumulusnetworks.com Applied to net-next -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [iproute PATCH] misc/ss: don't imply -a when -A was specified
On Fri, 7 Aug 2015 15:31:27 +0200 Phil Sutter p...@nwl.cc wrote: Signed-off-by: Phil Sutter p...@nwl.cc Ok, applied -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel warning in tcp_fragment
On Mon, Aug 10, 2015 at 2:10 PM, Jovi Zhangwei j...@cloudflare.com wrote: Ping? We saw a lot of this warnings in our production system. It would be great appreciate if someone can give us the fix on this warnings. :) What is your net.ipv4.tcp_mtu_probing setting? If 1, have you tried setting it to 0? Previous reports ( https://patchwork.ozlabs.org/patch/480882/ ) have shown that this gets rid of at least one source of the warning. So that would provide a useful data point. Separately, you could also try the attached patch. This is against 3.14.39. It tries to attack a different possible source of this warning. Please let us know if that patch helps. Thanks! neal 0001-RFC-for-tests-on-v3.14.39-tcp-resegment-skbs-that-we.patch Description: Binary data
Re: [PATCH net-next 1/2] net: track link status of ipv6 nexthops
On Mon, Aug 10, 2015 at 10:54:00AM -0700, David Miller wrote: From: Andy Gospodarek go...@cumulusnetworks.com Date: Thu, 6 Aug 2015 11:42:33 -0400 Add support to track current link status of ipv6 nexthops to match recent changes that added support for ipv4 nexthops. There was not a field already available that could track these and no space available in the existing rt6i_flags field, so this patch adds rt6i_nhflags to struct rt6_info. Signed-off-by: Andy Gospodarek go...@cumulusnetworks.com Signed-off-by: Dinesh Dutt dd...@cumulusnetworks.com This doesn't really make any sense to me. You can evaluate the state of the link at the time you look at the route at all of the places where it matters as far as I can tell. It's so expensive to walk the entire routing table every time a link goes up and down, so it's much better to take an evaluate as needed approach to implementing this. I went this way as the idea of storing this info in a flags structure for 2 reasons: - This idea or marking on link status changes and checking for that mark during forwarding was done what was suggested by Alex et al for the ipv4 code and I wanted to keep the overall design similar. - New flags will likely be needed when switchdev support is added for ipv6 routes so going ahead and mirroring the RTNH_F* flags in the the ipv6 code seemed reasonable. I would actually be fine with what you proposed (it is closer to the first implementation), so if my justification above does not change your mind, let me know and I'll post a v2 that does not add rt6i_nhflags and simply checks netif_carrier_ok() rather than RTNH_F_LINKDOWN. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH net-next] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set
On Fri, 2015-08-07 at 18:31 +, Jason Baron wrote: From: Jason Baron jba...@akamai.com When SO_SNDBUF is set and we are under tcp memory pressure, the effective write buffer space can be much lower than what was set using SO_SNDBUF. For example, we may have set the buffer to 100kb, but we may only be able to write 10kb. In this scenario poll()/select()/epoll(), are going to continuously return POLLOUT, followed by -EAGAIN from write() in a very tight loop. Introduce sk-sk_effective_sndbuf, such that we can track the 'effective' size of the sndbuf, when we have a short write due to memory pressure. By using the sk-sk_effective_sndbuf instead of the sk-sk_sndbuf when we are under memory pressure, we can delay the POLLOUT until 1/3 of the buffer clears as we normally do. There is no issue here when SO_SNDBUF is not set, since the tcp layer will auto tune the sk-sndbuf. In my testing, this brought a single threaad's cpu usage down from 100% to 1% while maintaining the same level of throughput when under memory pressure. I am not sure we need to grow socket for something that looks like a flag ? Also you add a race in sk_stream_wspace() as sk_effective_sndbuf value can change under us. + if (sk-sk_effective_sndbuf) + return sk-sk_effective_sndbuf - sk-sk_wmem_queued; + -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v3 7/8] net: switchdev: support static FDB addresses
On Mon, Aug 10, 2015 at 6:09 AM, Vivien Didelot vivien.dide...@savoirfairelinux.com wrote: This patch adds an ndm_state member to the switchdev_obj_fdb structure, in order to support static FDB addresses. Set Rocker ndm_state to NUD_REACHABLE. Signed-off-by: Vivien Didelot vivien.dide...@savoirfairelinux.com Acked-by: Scott Feldman sfel...@gmail.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] gianfar: correct list membership accounting
From: Jakub Kicinski kubak...@wp.pl At a cost of one line let's make sure .count is correct when calling gfar_process_filer_changes(). Signed-off-by: Jakub Kicinski kubak...@wp.pl --- drivers/net/ethernet/freescale/gianfar_ethtool.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/freescale/gianfar_ethtool.c b/drivers/net/ethernet/freescale/gianfar_ethtool.c index e543d3b01838..b955ed83ca98 100644 --- a/drivers/net/ethernet/freescale/gianfar_ethtool.c +++ b/drivers/net/ethernet/freescale/gianfar_ethtool.c @@ -1723,13 +1723,14 @@ static int gfar_add_cls(struct gfar_private *priv, } process: + priv-rx_list.count++; ret = gfar_process_filer_changes(priv); if (ret) goto clean_list; - priv-rx_list.count++; return ret; clean_list: + priv-rx_list.count--; list_del(temp-list); clean_mem: kfree(temp); -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] ipv6: don't reject link-local nexthop on other interface
From: Florian Westphal f...@strlen.de Date: Fri, 7 Aug 2015 10:54:28 +0200 48ed7b26faa7 (ipv6: reject locally assigned nexthop addresses) is too strict; it rejects following corner-case: ip -6 route add default via fe80::1:2:3 dev eth1 [ where fe80::1:2:3 is assigned to a local interface, but not eth1 ] Fix this by restricting search to given device if nh is linklocal. Joint work with Hannes Frederic Sowa. Fixes: 48ed7b26faa7 (ipv6: reject locally assigned nexthop addresses) Signed-off-by: Hannes Frederic Sowa han...@stressinduktion.org Signed-off-by: Florian Westphal f...@strlen.de Applied, thank you. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net: fec: fix the race between xmit and bdp reclaiming path
From: Kevin Hao haoke...@gmail.com Date: Fri, 7 Aug 2015 13:52:37 +0800 When we transmit a fragmented skb, we may run into a race like the following scenario (assume txq-cur_tx is next to txq-dirty_tx): cpu 0 cpu 1 fec_enet_txq_submit_skb reserve a bdp for the first fragment fec_enet_txq_submit_frag_skb update the bdp for the other fragment update txq-cur_tx fec_enet_tx_queue bdp = fec_enet_get_nextdesc(txq-dirty_tx, fep, queue_id); This bdp is the bdp reserved for the first segment. Given that this bdp BD_ENET_TX_READY bit is not set and txq-cur_tx is already pointed to a bdp beyond this one. We think this is a completed bdp and try to reclaim it. update the bdp for the first segment update txq-cur_tx So we shouldn't update the txq-cur_tx until all the update to the bdps used for fragments are performed. Also add the corresponding memory barrier to guarantee that the update to the bdps, dirty_tx and cur_tx performed in the proper order. Signed-off-by: Kevin Hao haoke...@gmail.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[FWD] PROBLEM: there exists a wrong return value of function mkiss_open()
I don't know how many people care about hamradio, but the report that mkiss_open() returns success even when register_netdev() fails seems entirely true. The email was just not sent to the right people.. Linus On Sun, Aug 9, 2015 at 5:08 PM, RUC_Soft_Sec zy900...@163.com wrote: Summary: there exists a wrong return value of function mkiss_open(). It's a theoretical problem. we use static analysis method to detect this bug. Bug Description: In function mkiss_open() at drivers/net/hamradio/mkiss.c:726, the call to register_netdev() in line 765 may return a negative error code, and thus function mkiss_open() will return the value of variable err. And, the function mkiss_open() will return 0 at last when it runs well. However, when the call to register_netdev() in line 765 return a negative error code, the value of err is 0. So the function mkiss_open() will return 0 to its caller functions when it runs error because of the failing call to register_netdev(), leading to a wrong return value of function mkiss_open(). The related code snippets in mkiss_open() is as following. mkiss_open @@ drivers/net/hamradio/mkiss.c:726 726static int mkiss_open(struct tty_struct *tty) 727{ ... 761if ((err = ax_open(ax-dev))) { 762goto out_free_netdev; 763} 764 765if (register_netdev(dev)) 766goto out_free_buffers; ... 800out_free_buffers: 801kfree(ax-rbuff); 802kfree(ax-xbuff); 803 804out_free_netdev: 805free_netdev(dev); 806 807out: 808return err; 809} Generally, when the call to register_netdev() fails, the return value of caller functions should be different from another return value set when the call to register_netdev() succeeds, like the following codes in another file. com90io_found @@ drivers/net/arcnet/com90io.c:234 234static int __init com90io_found(struct net_device *dev) 235{ ... 268err = register_netdev(dev); 269if (err) { 270outb((inb(_CONFIG) ~IOMAPflag), _CONFIG); 271free_irq(dev-irq, dev); 272release_region(dev-base_addr, ARCNET_TOTAL_SIZE); 273return err; 274} 275 276BUGMSG(D_NORMAL, COM90IO: station %02Xh found at %03lXh, IRQ %d.\n, 277 dev-dev_addr[0], dev-base_addr, dev-irq); 278 279return 0; 280} Kernel version: 3.19.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] gianfar: remove faulty filer optimizer
From: Jakub Kicinski kubak...@wp.pl Current filer rule optimization is broken in several ways: (1) It destroys rule ordering. (2) It performs reads/writes beyond end of allocated tables. (3) It breaks badly for rules with more than 2 specifiers (e.g. matching ip, port, tos). (4) We observed that the masking rules it generates do not play well with clustering on P2020. Only first rule of the cluster would ever fire. Given that optimizer relies heavily on masking this is very hard to fix. The fact that nobody noticed (1), (3) or (4) makes me think that this feature is not very widely used and we should just remove it. Reported-by: Aleksander Dutkowski adutkow...@gmail.com Signed-off-by: Jakub Kicinski kubak...@wp.pl --- drivers/net/ethernet/freescale/gianfar_ethtool.c | 337 --- 1 file changed, 337 deletions(-) diff --git a/drivers/net/ethernet/freescale/gianfar_ethtool.c b/drivers/net/ethernet/freescale/gianfar_ethtool.c index b955ed83ca98..6bdc89179b72 100644 --- a/drivers/net/ethernet/freescale/gianfar_ethtool.c +++ b/drivers/net/ethernet/freescale/gianfar_ethtool.c @@ -902,27 +902,6 @@ static int gfar_check_filer_hardware(struct gfar_private *priv) return 0; } -static int gfar_comp_asc(const void *a, const void *b) -{ - return memcmp(a, b, 4); -} - -static int gfar_comp_desc(const void *a, const void *b) -{ - return -memcmp(a, b, 4); -} - -static void gfar_swap(void *a, void *b, int size) -{ - u32 *_a = a; - u32 *_b = b; - - swap(_a[0], _b[0]); - swap(_a[1], _b[1]); - swap(_a[2], _b[2]); - swap(_a[3], _b[3]); -} - /* Write a mask to filer cache */ static void gfar_set_mask(u32 mask, struct filer_table *tab) { @@ -1272,310 +1251,6 @@ static int gfar_convert_to_filer(struct ethtool_rx_flow_spec *rule, return 0; } -/* Copy size filer entries */ -static void gfar_copy_filer_entries(struct gfar_filer_entry dst[0], - struct gfar_filer_entry src[0], s32 size) -{ - while (size 0) { - size--; - dst[size].ctrl = src[size].ctrl; - dst[size].prop = src[size].prop; - } -} - -/* Delete the contents of the filer-table between start and end - * and collapse them - */ -static int gfar_trim_filer_entries(u32 begin, u32 end, struct filer_table *tab) -{ - int length; - - if (end MAX_FILER_CACHE_IDX || end begin) - return -EINVAL; - - end++; - length = end - begin; - - /* Copy */ - while (end tab-index) { - tab-fe[begin].ctrl = tab-fe[end].ctrl; - tab-fe[begin++].prop = tab-fe[end++].prop; - - } - /* Fill up with don't cares */ - while (begin tab-index) { - tab-fe[begin].ctrl = 0x60; - tab-fe[begin].prop = 0x; - begin++; - } - - tab-index -= length; - return 0; -} - -/* Make space on the wanted location */ -static int gfar_expand_filer_entries(u32 begin, u32 length, -struct filer_table *tab) -{ - if (length == 0 || length + tab-index MAX_FILER_CACHE_IDX || - begin MAX_FILER_CACHE_IDX) - return -EINVAL; - - gfar_copy_filer_entries((tab-fe[begin + length]), (tab-fe[begin]), - tab-index - length + 1); - - tab-index += length; - return 0; -} - -static int gfar_get_next_cluster_start(int start, struct filer_table *tab) -{ - for (; (start tab-index) (start MAX_FILER_CACHE_IDX - 1); -start++) { - if ((tab-fe[start].ctrl (RQFCR_AND | RQFCR_CLE)) == - (RQFCR_AND | RQFCR_CLE)) - return start; - } - return -1; -} - -static int gfar_get_next_cluster_end(int start, struct filer_table *tab) -{ - for (; (start tab-index) (start MAX_FILER_CACHE_IDX - 1); -start++) { - if ((tab-fe[start].ctrl (RQFCR_AND | RQFCR_CLE)) == - (RQFCR_CLE)) - return start; - } - return -1; -} - -/* Uses hardwares clustering option to reduce - * the number of filer table entries - */ -static void gfar_cluster_filer(struct filer_table *tab) -{ - s32 i = -1, j, iend, jend; - - while ((i = gfar_get_next_cluster_start(++i, tab)) != -1) { - j = i; - while ((j = gfar_get_next_cluster_start(++j, tab)) != -1) { - /* The cluster entries self and the previous one -* (a mask) must be identical! -*/ - if (tab-fe[i].ctrl != tab-fe[j].ctrl) - break; - if (tab-fe[i].prop != tab-fe[j].prop) - break; - if (tab-fe[i - 1].ctrl != tab-fe[j - 1].ctrl) - break;
[PATCH 1/3] gianfar: correct filer table writing
From: Jakub Kicinski kubak...@wp.pl MAX_FILER_IDX is the last usable index. Using less-than will already guarantee that one entry for catch-all rule will be left, no need to subtract 1 here. Signed-off-by: Jakub Kicinski kubak...@wp.pl --- drivers/net/ethernet/freescale/gianfar_ethtool.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/freescale/gianfar_ethtool.c b/drivers/net/ethernet/freescale/gianfar_ethtool.c index 555e461b0cfe..e543d3b01838 100644 --- a/drivers/net/ethernet/freescale/gianfar_ethtool.c +++ b/drivers/net/ethernet/freescale/gianfar_ethtool.c @@ -1585,11 +1585,10 @@ static int gfar_write_filer_table(struct gfar_private *priv, return -EBUSY; /* Fill regular entries */ - for (; i MAX_FILER_IDX - 1 (tab-fe[i].ctrl | tab-fe[i].prop); -i++) + for (; i MAX_FILER_IDX (tab-fe[i].ctrl | tab-fe[i].prop); i++) gfar_write_filer(priv, i, tab-fe[i].ctrl, tab-fe[i].prop); /* Fill the rest with fall-troughs */ - for (; i MAX_FILER_IDX - 1; i++) + for (; i MAX_FILER_IDX; i++) gfar_write_filer(priv, i, 0x60, 0x); /* Last entry must be default accept * because that's what people expect -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] gianfar: filer changes
From: Jakub Kicinski kubak...@wp.pl Hi, I've been working with the gianfar filer code recently and got some code to offer. Well, maybe not that much code to offer actually: two small fixes and removal of the current optimizer. I'm not sure what your feelings on patch 3 will be. It would be great to have a working optimizer if someone wants to take this task up but currently we have a semi-broken one and I vote for killing the beast entirely before it has a chance to bite more people... Jakub Kicinski (3): gianfar: correct filer table writing gianfar: correct list membership accounting gianfar: remove faulty filer optimizer drivers/net/ethernet/freescale/gianfar_ethtool.c | 345 +-- 1 file changed, 4 insertions(+), 341 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 net-next 5/9] openvswitch: Add conntrack action
On 6 August 2015 at 14:36, Pravin Shelar pshe...@nicira.com wrote: +static void ovs_fragment(struct vport *vport, struct sk_buff *skb, +unsigned int mru, __be16 ethertype) +{ + if (skb_network_offset(skb) MAX_L2_LEN) { + OVS_NLERR(1, L2 header too long to fragment); + return; + } + + if (ethertype == htons(ETH_P_IP)) { + struct dst_entry ovs_dst; + + prepare_frag(vport, skb); + dst_init(ovs_dst, ovs_dst_ops, NULL, 1, +DST_OBSOLETE_NONE, DST_NOCOUNT); + ovs_dst.dev = vport-dev; + + skb_dst_set_noref(skb, ovs_dst); + IPCB(skb)-frag_max_size = mru; + + ip_do_fragment(skb-sk, skb, ovs_vport_output); + } else if (ethertype == htons(ETH_P_IPV6)) { + const struct nf_ipv6_ops *v6ops = nf_get_ipv6_ops(); + struct rt6_info ovs_rt; + + if (!v6ops) { + kfree_skb(skb); + return; + } + + prepare_frag(vport, skb); + memset(ovs_rt, 0, sizeof(ovs_rt)); + dst_init(ovs_rt.dst, ovs_dst_ops, NULL, 1, +DST_OBSOLETE_NONE, DST_NOCOUNT); + ovs_rt.dst.dev = vport-dev; + + skb_dst_set_noref(skb, ovs_rt.dst); + IP6CB(skb)-frag_max_size = mru; + + v6ops-fragment(skb-sk, skb, ovs_vport_output); + } else { + WARN_ONCE(1, Failed fragment -%s: eth=%04x, MRU=%d, MTU=%d., + ovs_vport_name(vport), htons(ethertype), mru, + vport-dev-mtu); + kfree_skb(skb); + } +} + We also need something similar of this packet is going to userspace so that we can send original packets to userspace. Otherwise we would send defragmented packet to userspace. OK, in that case we'll need to get an MTU from somewhere. I'll look at using the MRU as the MTU for this path, since corner cases where the MRU is greater than the netlink payload size seems pretty unlikely (and the netlink sending code should already handle such cases). The other concern I have is exactly how this should be presented to userspace. Currently the conntrack action is treated an an implicit reassembly, which will implicitly refragment on output. In between, it remains defragmented. If we fragment on miss, then lookup will use the key representing the defragmented packet (ie no OVS_FRAG_TYPE_* bits set), so we should send the same up to userspace. I assume that userspace would then re-parse the packets and see that they are fragments for representing up to the higher layers like OpenFlow, but for flow installation it would reuse the key passed up from the kernel. Is that the model you have in mind? Right, Reassembly is transparent to cantrack action, so it should be to userspace. But this means we will need to fragment in upcall and defrag the skb again when the packet reenter kernel module from packet execute code path if we the action need to look at entire packet. So lets just keep it based on MRU parameter and we can enhance it later if we need it. OK, we'll retain the upcall MRU and keep it assembled for the moment, so the implicit conntrack will reassemble behaviour is retained from this version. I'll fix up the other issues and send v3, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2] bridge: netlink: add support for vlan_filtering attribute
From: Nikolay Aleksandrov ra...@blackwall.org Date: Fri, 7 Aug 2015 19:40:45 +0300 From: Nikolay Aleksandrov niko...@cumulusnetworks.com This patch adds the ability to toggle the vlan filtering support via netlink. Since we're already running with rtnl in .changelink() we don't need to take any additional locks. Signed-off-by: Nikolay Aleksandrov niko...@cumulusnetworks.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net: add explicit logging and stat for neighbour table overflow
From: r...@tardy.usa.hp.com (Rick Jones) Date: Fri, 7 Aug 2015 11:10:37 -0700 (PDT) From: Rick Jones rick.jon...@hp.com Add an explicit neighbour table overflow message (ratelimited) and statistic to make diagnosing neighbour table overflows tractable in the wild. Diagnosing a neighbour table overflow can be quite difficult in the wild because there is no explicit dmesg logged. Callers to neighbour code seem to use net_dbg_ratelimit when the neighbour call fails which means the base message is not emitted and the callback suppressed messages from the ratelimiting can end-up juxtaposed with unrelated messages. Further, a forced garbage collection will increment a stat on each call whether it was successful in freeing-up a table entry or not, so that statistic is only a hint. So, add a net_info_ratelimited message and explicit statistic to the neighbour code. Signed-off-by: Rick Jones rick.jon...@hp.com Looks fine, applied, thanks Rick. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users
Michael S. Tsirkin m...@redhat.com writes: On Mon, Jul 13, 2015 at 12:07:32AM -0400, Bandan Das wrote: vhost threads are per-device, but in most cases a single thread is enough. This change creates a single thread that is used to serve all guests. However, this complicates cgroups associations. The current policy is to attach the per-device thread to all cgroups of the parent process that the device is associated it. This is no longer possible if we have a single thread. So, we end up moving the thread around to cgroups of whichever device that needs servicing. This is a very inefficient protocol but seems to be the only way to integrate cgroups support. Signed-off-by: Razya Ladelsky ra...@il.ibm.com Signed-off-by: Bandan Das b...@redhat.com BTW, how does this interact with virtio net MQ? It would seem that MQ gains from more parallelism and CPU locality. Hm.. Good point. As of this version, this design will always have one worker thread servicing a guest. Now suppose we have 10 virtio queues for a guest, surely, we could benefit from spawning off another worker just like we are doing in case of a new guest/device with the devs_per_worker parameter. --- drivers/vhost/scsi.c | 15 +++-- drivers/vhost/vhost.c | 150 -- drivers/vhost/vhost.h | 19 +-- 3 files changed, 97 insertions(+), 87 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index ea32b38..6c42936 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -535,7 +535,7 @@ static void vhost_scsi_complete_cmd(struct vhost_scsi_cmd *cmd) llist_add(cmd-tvc_completion_list, vs-vs_completion_list); -vhost_work_queue(vs-dev, vs-vs_completion_work); +vhost_work_queue(vs-dev.worker, vs-vs_completion_work); } static int vhost_scsi_queue_data_in(struct se_cmd *se_cmd) @@ -1282,7 +1282,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs, } llist_add(evt-list, vs-vs_event_list); -vhost_work_queue(vs-dev, vs-vs_event_work); +vhost_work_queue(vs-dev.worker, vs-vs_event_work); } static void vhost_scsi_evt_handle_kick(struct vhost_work *work) @@ -1335,8 +1335,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs) /* Flush both the vhost poll and vhost work */ for (i = 0; i VHOST_SCSI_MAX_VQ; i++) vhost_scsi_flush_vq(vs, i); -vhost_work_flush(vs-dev, vs-vs_completion_work); -vhost_work_flush(vs-dev, vs-vs_event_work); +vhost_work_flush(vs-dev.worker, vs-vs_completion_work); +vhost_work_flush(vs-dev.worker, vs-vs_event_work); /* Wait for all reqs issued before the flush to be finished */ for (i = 0; i VHOST_SCSI_MAX_VQ; i++) @@ -1584,8 +1584,11 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) if (!vqs) goto err_vqs; -vhost_work_init(vs-vs_completion_work, vhost_scsi_complete_cmd_work); -vhost_work_init(vs-vs_event_work, vhost_scsi_evt_work); +vhost_work_init(vs-dev, vs-vs_completion_work, +vhost_scsi_complete_cmd_work); + +vhost_work_init(vs-dev, vs-vs_event_work, +vhost_scsi_evt_work); vs-vs_events_nr = 0; vs-vs_events_missed = false; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 2ee2826..951c96b 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -11,6 +11,8 @@ * Generic code for virtio server in host kernel. */ +#define pr_fmt(fmt) KBUILD_MODNAME : fmt + #include linux/eventfd.h #include linux/vhost.h #include linux/uio.h @@ -28,6 +30,9 @@ #include vhost.h +/* Just one worker thread to service all devices */ +static struct vhost_worker *worker; + enum { VHOST_MEMORY_MAX_NREGIONS = 64, VHOST_MEMORY_F_LOG = 0x1, @@ -58,13 +63,15 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync, return 0; } -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn) +void vhost_work_init(struct vhost_dev *dev, + struct vhost_work *work, vhost_work_fn_t fn) { INIT_LIST_HEAD(work-node); work-fn = fn; init_waitqueue_head(work-done); work-flushing = 0; work-queue_seq = work-done_seq = 0; +work-dev = dev; } EXPORT_SYMBOL_GPL(vhost_work_init); @@ -78,7 +85,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, poll-dev = dev; poll-wqh = NULL; -vhost_work_init(poll-work, fn); +vhost_work_init(dev, poll-work, fn); } EXPORT_SYMBOL_GPL(vhost_poll_init); @@ -116,30 +123,30 @@ void vhost_poll_stop(struct vhost_poll *poll) } EXPORT_SYMBOL_GPL(vhost_poll_stop); -static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work *work, -unsigned seq) +static bool vhost_work_seq_done(struct vhost_worker *worker, +struct
Re: [RFC PATCH 1/4] vhost: Introduce a universal thread to serve all users
Bandan Das b...@redhat.com writes: Michael S. Tsirkin m...@redhat.com writes: On Mon, Jul 13, 2015 at 12:07:32AM -0400, Bandan Das wrote: vhost threads are per-device, but in most cases a single thread is enough. This change creates a single thread that is used to serve all guests. However, this complicates cgroups associations. The current policy is to attach the per-device thread to all cgroups of the parent process that the device is associated it. This is no longer possible if we have a single thread. So, we end up moving the thread around to cgroups of whichever device that needs servicing. This is a very inefficient protocol but seems to be the only way to integrate cgroups support. Signed-off-by: Razya Ladelsky ra...@il.ibm.com Signed-off-by: Bandan Das b...@redhat.com BTW, how does this interact with virtio net MQ? It would seem that MQ gains from more parallelism and CPU locality. Hm.. Good point. As of this version, this design will always have one worker thread servicing a guest. Now suppose we have 10 virtio queues for a guest, surely, we could benefit from spawning off another worker just like we are doing in case of a new guest/device with the devs_per_worker parameter. So, I did a quick smoke test with virtio-net and the Elvis patches. virtio net MQ already spawns a new worker thread for every queue, it seems ? So, the above setup already works! :) I will run some tests and post back the results. --- drivers/vhost/scsi.c | 15 +++-- drivers/vhost/vhost.c | 150 -- drivers/vhost/vhost.h | 19 +-- 3 files changed, 97 insertions(+), 87 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index ea32b38..6c42936 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -535,7 +535,7 @@ static void vhost_scsi_complete_cmd(struct vhost_scsi_cmd *cmd) llist_add(cmd-tvc_completion_list, vs-vs_completion_list); - vhost_work_queue(vs-dev, vs-vs_completion_work); + vhost_work_queue(vs-dev.worker, vs-vs_completion_work); } static int vhost_scsi_queue_data_in(struct se_cmd *se_cmd) @@ -1282,7 +1282,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs, } llist_add(evt-list, vs-vs_event_list); - vhost_work_queue(vs-dev, vs-vs_event_work); + vhost_work_queue(vs-dev.worker, vs-vs_event_work); } static void vhost_scsi_evt_handle_kick(struct vhost_work *work) @@ -1335,8 +1335,8 @@ static void vhost_scsi_flush(struct vhost_scsi *vs) /* Flush both the vhost poll and vhost work */ for (i = 0; i VHOST_SCSI_MAX_VQ; i++) vhost_scsi_flush_vq(vs, i); - vhost_work_flush(vs-dev, vs-vs_completion_work); - vhost_work_flush(vs-dev, vs-vs_event_work); + vhost_work_flush(vs-dev.worker, vs-vs_completion_work); + vhost_work_flush(vs-dev.worker, vs-vs_event_work); /* Wait for all reqs issued before the flush to be finished */ for (i = 0; i VHOST_SCSI_MAX_VQ; i++) @@ -1584,8 +1584,11 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) if (!vqs) goto err_vqs; - vhost_work_init(vs-vs_completion_work, vhost_scsi_complete_cmd_work); - vhost_work_init(vs-vs_event_work, vhost_scsi_evt_work); + vhost_work_init(vs-dev, vs-vs_completion_work, + vhost_scsi_complete_cmd_work); + + vhost_work_init(vs-dev, vs-vs_event_work, + vhost_scsi_evt_work); vs-vs_events_nr = 0; vs-vs_events_missed = false; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 2ee2826..951c96b 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -11,6 +11,8 @@ * Generic code for virtio server in host kernel. */ +#define pr_fmt(fmt) KBUILD_MODNAME : fmt + #include linux/eventfd.h #include linux/vhost.h #include linux/uio.h @@ -28,6 +30,9 @@ #include vhost.h +/* Just one worker thread to service all devices */ +static struct vhost_worker *worker; + enum { VHOST_MEMORY_MAX_NREGIONS = 64, VHOST_MEMORY_F_LOG = 0x1, @@ -58,13 +63,15 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync, return 0; } -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn) +void vhost_work_init(struct vhost_dev *dev, +struct vhost_work *work, vhost_work_fn_t fn) { INIT_LIST_HEAD(work-node); work-fn = fn; init_waitqueue_head(work-done); work-flushing = 0; work-queue_seq = work-done_seq = 0; + work-dev = dev; } EXPORT_SYMBOL_GPL(vhost_work_init); @@ -78,7 +85,7 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, poll-dev = dev; poll-wqh = NULL; - vhost_work_init(poll-work, fn); + vhost_work_init(dev, poll-work, fn); } EXPORT_SYMBOL_GPL(vhost_poll_init); @@ -116,30 +123,30 @@ void vhost_poll_stop(struct vhost_poll *poll) } EXPORT_SYMBOL_GPL(vhost_poll_stop);
Re: [PATCH] eventfd: implementation of EFD_MASK flag
On 2015-08-10 10:57, Damian Hobson-Garcia wrote: Hi Martin, Thanks for your comments. On 2015-08-10 3:39 PM, Martin Sustrik wrote: On 2015-08-10 08:23, Damian Hobson-Garcia wrote: Replying to my own post, but I had the following comments/questions. Martin, if you have any response to my comments I would be very happy to hear them. On 2015-08-10 2:51 PM, Damian Hobson-Garcia wrote: From: Martin Sustrik sust...@250bpm.com [snip] write(2): User is allowed to write only buffers containing the following structure: struct efd_mask { __u32 events; __u64 data; }; The value of 'events' should be any combination of event flags as defined by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will be signaled when polling (select, poll, epoll) on the eventfd is done later on. 'data' is opaque data that are not interpreted by eventfd object. I'm not fully clear on the purpose that the 'data' member serves. Does this opaque handle need to be tied together with this event synchronization construct? It's a convenience thing. Imagine you are implementing your own file descriptor type in user space. You create an EFD_MASK socket and a structure that will hold any state that you need for the socket (tx/rx buffers and such). Now you have two things to pass around. If you want to pass the fd to a function, it must have two parameters (fd and pointer to the structure). To fix it you can put the fd into the structure. That way there's only one thing to pass around (the structure). The problem with that approach is when you have generic code that deals with file descriptors. For example, a simple poller which accepts a list of (fd, callback) pairs and invokes the callback when one of the fds signals POLLIN. You can't send a pointer to a structure to such function. All you can send is the fd, but then, when the callback is invoked, fd is all you have. You have no idea where your state is. 'data' member allows you to put the pointer to the state to the socket itself. Thus, if you have a fd, you can always find out where the associated data is by reading the mask structure from the fd. Ok, I see what you're saying. I guess that keeping track of the mapping between the fd and the struct in user space could be non-trivial if there are a large number of active fds that are polling very frequently. Wouldn't it be sufficient to just use epoll() in this case though? It already seems to support this kind of thing. My use case was like this: int s = mysocket(); ... // myrecv() can get the pointer to the structure // without user having to pass it as an argument myrecv(s, buf, sizeof(buf)); However, same behaviour can be accomplished by simply keeping a static array of pointers in the user space. So let's cut this part out of the patch. [snip] @@ -55,6 +69,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n) { +/* This function should never be used with eventfd in the mask mode. */ +BUG_ON(ctx-flags EFD_MASK); + ... @@ -158,6 +180,9 @@ int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait, { +/* This function should never be used with eventfd in the mask mode. */ +BUG_ON(ctx-flags EFD_MASK); + ... @@ -188,6 +213,9 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt) +/* This function should never be used with eventfd in the mask mode. */ +BUG_ON(ctx-flags EFD_MASK); + If eventfd_ctx_fileget() returns EINVAL when EFD_MASK is set, I don't think that there will be a way to call these functions in the mask mode, so it should be possible to get rid of the BUG_ON checks. Sure. Feel free to do so. [snip] @@ -230,6 +258,19 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, ssize_t res; __u64 cnt; +if (ctx-flags EFD_MASK) { +struct efd_mask mask; + +if (count sizeof(mask)) +return -EINVAL; +spin_lock_irq(ctx-wqh.lock); +mask = ctx-mask; +spin_unlock_irq(ctx-wqh.lock); +if (copy_to_user(buf, mask, sizeof(mask))) +return -EFAULT; +return sizeof(mask); +} + For the other eventfd modes, reading the value will update the internal state of the eventfd (either clearing or decrementing the counter). Should something similar be done here? I'm thinking of a case where a process is polling on this fd in a loop. Clearing the efd_mask data on read should provide an easy way for the polling process to know if it is seeing new poll events. No. In this case reading the value has no effect on the state of the fd. How it should work is rather: // fd is in POLLIN state poll(fd); // function exits with POLLIN but fd remains in POLLIN state my_recv(fd, buf, size); // my_recv function have found out that there's no more data to recv and switched off the POLLIN flag poll(fd); // we block here waiting for more data to arrive from the network How
Re: [PATCH net-next] net: dsa: mv88e6352: Use mnemonics for EEPROM registers and bits
From: Andrew Lunn and...@lunn.ch Date: Sat, 8 Aug 2015 17:04:50 +0200 Add register definitions #defines for accessing the EEPROM. Signed-off-by: Andrew Lunn and...@lunn.ch Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] hamradio/kiss: missing error code in mkiss_open()
If register_netdev() fails we return success but we should return an error code instead. Reported-by: RUC_Soft_Sec zy900...@163.com Signed-off-by: Dan Carpenter dan.carpen...@oracle.com diff --git a/drivers/net/hamradio/mkiss.c b/drivers/net/hamradio/mkiss.c index 2ffbf13..dcb6bb7 100644 --- a/drivers/net/hamradio/mkiss.c +++ b/drivers/net/hamradio/mkiss.c @@ -732,7 +732,8 @@ static int mkiss_open(struct tty_struct *tty) goto out_free_netdev; } - if (register_netdev(dev)) + err = register_netdev(dev); + if (err) goto out_free_buffers; /* after register_netdev() - because else printk smashes the kernel */ -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel warning in tcp_fragment
Hi Neal, Great thanks for your reply, we will arrange testing against that patch. On Mon, Aug 10, 2015 at 11:35 AM, Neal Cardwell ncardw...@google.com wrote: On Mon, Aug 10, 2015 at 2:10 PM, Jovi Zhangwei j...@cloudflare.com wrote: Ping? We saw a lot of this warnings in our production system. It would be great appreciate if someone can give us the fix on this warnings. :) What is your net.ipv4.tcp_mtu_probing setting? If 1, have you tried setting it to 0? Previous reports ( https://patchwork.ozlabs.org/patch/480882/ ) have shown that this gets rid of at least one source of the warning. So that would provide a useful data point. Separately, you could also try the attached patch. This is against 3.14.39. It tries to attack a different possible source of this warning. Please let us know if that patch helps. Thanks! neal -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH net-next] tcp: reduce cpu usage under tcp memory pressure when SO_SNDBUF is set
On Mon, 2015-08-10 at 13:29 -0400, Jason Baron wrote: + thanks. better? --- a/include/net/sock.h +++ b/include/net/sock.h @@ -798,8 +798,10 @@ static inline int sk_stream_min_wspace(const struct sock *sk) static inline int sk_stream_wspace(const struct sock *sk) { - if (sk-sk_effective_sndbuf) - return sk-sk_effective_sndbuf - sk-sk_wmem_queued; + int effective_sndbuf = sk-sk_effective_sndbuf; + + if (effective_sndbuf) + return effective_sndbuf - sk-sk_wmem_queued; return sk-sk_sndbuf - sk-sk_wmem_queued; } You need to use instead : int effective_sndbuf = READ_ONCE(sk-sk_effective_sndbuf); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] dsa: Support multiple MDIO busses
From: Andrew Lunn and...@lunn.ch Date: Sat, 8 Aug 2015 17:09:14 +0200 When using a cluster of switches, some topologies will have an MDIO bus per switch, not one for the whole cluster. Allow this to be represented in the device tree, by adding an optional mii-bus property at the switch level. The old platform_device method of instantiation supports this already, so only the device tree binding needs extending with an additional optional phandle. Signed-off-by: Andrew Lunn and...@lunn.ch Also applied, thanks Andrew. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/2] cxgb4: cleanup some indenting
From: Dan Carpenter dan.carpen...@oracle.com Date: Sat, 8 Aug 2015 22:15:59 +0300 Add or remove some tabs so that statements line up correctly. Signed-off-by: Dan Carpenter dan.carpen...@oracle.com Applied to 'net-next', thanks Dan. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2 -mainline] cxgb4: missing curly braces in t4_setup_debugfs()
From: Dan Carpenter dan.carpen...@oracle.com Date: Sat, 8 Aug 2015 22:15:25 +0300 There were missing curly braces so it means we call add_debugfs_mem() unintentionally. Fixes: 3ccc6cf74d8c ('cxgb4: Adds support for T6 adapter') Signed-off-by: Dan Carpenter dan.carpen...@oracle.com Applied to 'net'. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] isdn:remove reverse_bits(), use revbit8()
From: yalin wang yalin.wang2...@gmail.com Date: Mon, 10 Aug 2015 17:15:57 +0800 This change isdn driver, remove reverse_bits() function, use the generic revbit8() function instead. Signed-off-by: yalin wang yalin.wang2...@gmail.com Applied, however please format your Subject lines better in the future. There should be a space after the subsystem specifier and the ':' character. So isdn: Then you should capitalize the description in the Subject line because it is very much like an English sentence. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ipv6_mc_check_mld - kernel BUG at net/core/skbuff.c:1128
Hi folks, Here is a crash that I am able to easily reproduce. The setup is: 2 VMs, running in libvirt (qemu-kvm) CPU mode is host-passthrough, virtio drivers used wherever available Disable ipv6 (just to limit the amount of multicast noise) Set up a multicast vxlan tunnel between the two VMs Attach the vxlan device to a linux bridge Attach a veth pair to the linux bridge Enable ipv6 on a single veth At this point, either one of the VMs may crash with the attached trace Here is the test script. Not all lines are necessary, some are a byproduct of eliminating various functions from the trace to eliminate them as suspects. --- rmmod ebtable_nat rmmod ebtables sysctl net.ipv6.conf.all.disable_ipv6=1 ip l add vxlan0 type vxlan id 1 group 239.1.1.1 dev eth1 ip l add br0 type bridge ip l set vxlan0 master br0 ip l set br0 up ip l set vxlan0 up ip l add v1a type veth peer name v1b ip l set v1b master br0 ip l set v1b up ip l set v1a up sysctl net.ipv6.conf.v1a.disable_ipv6=0 Doing some code reading with Alexei, we found a suspect commit, which introduces an skb_get and skb_may_pull of the same skb, which leads to the BUG when skb-len == len. 9afd85c9e4552 net: Export IGMP/MLD message validation code static struct sk_buff *skb_checksum_maybe_trim(struct sk_buff *skb, unsigned int transport_len) ... if (skb-len len) { kfree_skb(skb); return NULL; } else if (skb-len == len) { return skb; } ... static int __ipv6_mc_check_mld(struct sk_buff *skb, struct sk_buff **skb_trimmed) ... skb_get(skb); skb_chk = skb_checksum_trimmed(skb, transport_len, ipv6_mc_validate_checksum); Would someone more familiar with the code be able to suggest a viable solution or patch to try? Cheers, Brenden Apologies for some of the mangled text: [ 100.879047] [ cut here ] [ 100.879105] kernel BUG at net/core/skbuff.c:1128! [ 100.879144] invalid opcode: [#1] [ 100.879250] Modules linked in: veth bridge stp llc vxlan ip6_udp_tunnel udp_tunnel ip6table_filter ip6_tables iptable_filter ip_tables x_tables netconsole configfs btrfs ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr openvswitch iscsi_tcp libiscsi_tcp libiscsi xor scsi_transport_iscsi libcrc32c raid6_pq dm_crypt iosf_mbi kvm_intel kvm ppdev dm_multipath crct10dif_pclmul scsi_dh crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse input_leds serio_raw floppy 8250_fintek i2c_piix4 parport_pc pata_acpi mac_hid lp parport virtio_scsi [last unloaded: ebtables] [ 100.881340] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc4+ #3 [ 100.881375] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150617_082717-anatol 04/01/2014 [ 100.881416] task: 88013abca940 ti: 88013abdc000 task.ti: 88013abdc000 [ 100.881457] RIP: 0010:[8168d3d7] [8168d3d7] pskb_expand_head+0x227/0x260 [ 100.881532] RSP: 0018:88013fd03ab8 EFLAGS: 00010202 [ 100.881567] RAX: 0002 RBX: 8800bb601500 RCX: 0020 [ 100.881604] RDX: 0148 RSI: RDI: 8800bb601500 [ 100.881642] RBP: 88013fd03af8 R08: R09: 001c [ 100.881677] R10: R11: 0001 R12: [ 100.881714] R13: 8800bb601500 R14: 8800bb358840 R15: [ 100.881749] FS: () GS:88013fd0() knlGS: [ 100.881790] CS: 0010 DS: ES: CR0: 80050033 [ 100.881828] CR2: 7f91049f2162 CR3: 000136a94000 CR4: 001406e0 [ 100.881864] DR0: DR1: DR2: [ 100.881902] DR3: DR6: fffe0ff0 DR7: 0400 [ 100.881936] Stack: [ 100.881977] 88013fd03b38 816cd766 88013fd03b67 8800bb601500 [ 100.882149] 88013fd03be0 0008 8800bb358840 [ 100.882316] 88013fd03b48 8168e68f 88013fd03b48 00088168f4d0 [ 100.882486] Call Trace: [ 100.882524] IRQ 100.882524] IRQ ace: 03b48 DR6: fffe0ff0 DR7: 0400 00 2717-anatol 04/01/2014 _netfilter if you need this. may change behavior in the future. rser 31 c0 87 87 b0 01 00 00 f7 tpm br_netfilter e1000e dw_dmac i2c_hid dw_dmac_core wmi video bridge 8250_dw gpio_lynxpoint i2c_designware_platform ptp mei_me stp i2c_designware_core pps_core llc acpi_pad mei i2c_core shpchp spi_pxa2xx_platform processor button xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack iptable_filter sch_fq_codel nfsd nfs auth_rpcgss fscache oid_registry nfs_acl lockd grace sunrpc ip_tables
Re: [PATCH net-next v5 0/4] GRE: Use flow based tunneling for OVS GRE vport.
From: Pravin B Shelar pshe...@nicira.com Date: Fri, 7 Aug 2015 23:49:50 -0700 Following patches make use of new Using GRE tunnel meta data collection feature. This allows us to directly use netdev based GRE tunnel implementation. While doing so I have removed GRE demux API which were targeted for OVS. Most of GRE protocol code is now consolidated in ip_gre module. v5-v4: Fixed Kconfig dependency for vport-gre module. v3-v4: Added interface to ip-gre device to enable meta data collection. While doing this I split second patch into two patches. v2-v3: Add API to create GRE flow based device. Series applied, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 0/2] bnx2x: small fixes
From: Yuval Mintz yuval.mi...@qlogic.com Date: Mon, 10 Aug 2015 12:49:34 +0300 This adds 2 small fixes, one to error flows during memory release and the other to flash writes via ethtool API. Series applied, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] inet: fix possible request socket leak
From: Eric Dumazet eduma...@google.com In commit b357a364c57c9 (inet: fix possible panic in reqsk_queue_unlink()), I missed fact that tcp_check_req() can return the listener socket in one case, and that we must release the request socket refcount or we leak it. Tested: Following packetdrill test template shows the issue 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0bind(3, ..., ...) = 0 +0listen(3, 1) = 0 +0 S 0:0(0) win 2920 mss 1460,sackOK,nop,nop +0 S. 0:0(0) ack 1 mss 1460,nop,nop,sackOK +.002 . 1:1(0) ack 21 win 2920 +0 R 21:21(0) Fixes: b357a364c57c9 (inet: fix possible panic in reqsk_queue_unlink()) Signed-off-by: Eric Dumazet eduma...@google.com --- net/ipv4/tcp_ipv4.c |2 +- net/ipv6/tcp_ipv6.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index d7d4c2b79cf2..0ea2e1c5d395 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1348,7 +1348,7 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb) req = inet_csk_search_req(sk, th-source, iph-saddr, iph-daddr); if (req) { nsk = tcp_check_req(sk, skb, req, false); - if (!nsk) + if (!nsk || nsk == sk) reqsk_put(req); return nsk; } diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 6748c4277aff..7a6cea5e4274 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -943,7 +943,7 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb) ipv6_hdr(skb)-daddr, tcp_v6_iif(skb)); if (req) { nsk = tcp_check_req(sk, skb, req, false); - if (!nsk) + if (!nsk || nsk == sk) reqsk_put(req); return nsk; } -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rt2x00: adjust EEPROM_SIZE for rt2500usb
rt2500usb_validate_eeprom() read data up to 0x6e (EEPROM_CALIBRATE_OFFSET) but only 0x6a bytes has been allocated and read from the eeprom. This lead to out-of-bound accesses and invalid values for EEPROM_BBPTUNE_R17 and EEPROM_CALIBRATE_OFFSET. Change the EEPROM_SIZE to 0x6e in order to retrieve all the fields. Tested with a rt2570 device. Signed-off-by: Adrien Schildknecht adrien+...@schischi.me --- drivers/net/wireless/rt2x00/rt2500usb.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/wireless/rt2x00/rt2500usb.h b/drivers/net/wireless/rt2x00/rt2500usb.h index afba073..78cc035 100644 --- a/drivers/net/wireless/rt2x00/rt2500usb.h +++ b/drivers/net/wireless/rt2x00/rt2500usb.h @@ -54,7 +54,7 @@ #define CSR_REG_BASE 0x0400 #define CSR_REG_SIZE 0x0100 #define EEPROM_BASE0x -#define EEPROM_SIZE0x006a +#define EEPROM_SIZE0x006e #define BBP_BASE 0x #define BBP_SIZE 0x0060 #define RF_BASE0x0004 -- 2.5.0 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] xfrm: Add oif to dst lookups
Rules can be installed that direct route lookups to specific tables based on oif. Plumb the oif through the xfrm lookups so it gets set in the flow struct and passed to the resolver routines. Signed-off-by: David Ahern d...@cumulusnetworks.com --- include/net/xfrm.h | 7 +-- net/ipv4/xfrm4_policy.c | 11 ++- net/ipv6/xfrm6_policy.c | 7 --- net/xfrm/xfrm_policy.c | 24 ++-- 4 files changed, 29 insertions(+), 20 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index f0ee97eec24d..312e3fee9ccf 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -285,10 +285,13 @@ struct xfrm_policy_afinfo { unsigned short family; struct dst_ops *dst_ops; void(*garbage_collect)(struct net *net); - struct dst_entry*(*dst_lookup)(struct net *net, int tos, + struct dst_entry*(*dst_lookup)(struct net *net, + int tos, int oif, const xfrm_address_t *saddr, const xfrm_address_t *daddr); - int (*get_saddr)(struct net *net, xfrm_address_t *saddr, xfrm_address_t *daddr); + int (*get_saddr)(struct net *net, int oif, +xfrm_address_t *saddr, +xfrm_address_t *daddr); void(*decode_session)(struct sk_buff *skb, struct flowi *fl, int reverse); diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c index bff69746e05f..55b3c0f4dde5 100644 --- a/net/ipv4/xfrm4_policy.c +++ b/net/ipv4/xfrm4_policy.c @@ -19,7 +19,7 @@ static struct xfrm_policy_afinfo xfrm4_policy_afinfo; static struct dst_entry *__xfrm4_dst_lookup(struct net *net, struct flowi4 *fl4, - int tos, + int tos, int oif, const xfrm_address_t *saddr, const xfrm_address_t *daddr) { @@ -28,6 +28,7 @@ static struct dst_entry *__xfrm4_dst_lookup(struct net *net, struct flowi4 *fl4, memset(fl4, 0, sizeof(*fl4)); fl4-daddr = daddr-a4; fl4-flowi4_tos = tos; + fl4-flowi4_oif = oif; if (saddr) fl4-saddr = saddr-a4; @@ -38,22 +39,22 @@ static struct dst_entry *__xfrm4_dst_lookup(struct net *net, struct flowi4 *fl4, return ERR_CAST(rt); } -static struct dst_entry *xfrm4_dst_lookup(struct net *net, int tos, +static struct dst_entry *xfrm4_dst_lookup(struct net *net, int tos, int oif, const xfrm_address_t *saddr, const xfrm_address_t *daddr) { struct flowi4 fl4; - return __xfrm4_dst_lookup(net, fl4, tos, saddr, daddr); + return __xfrm4_dst_lookup(net, fl4, tos, oif, saddr, daddr); } -static int xfrm4_get_saddr(struct net *net, +static int xfrm4_get_saddr(struct net *net, int oif, xfrm_address_t *saddr, xfrm_address_t *daddr) { struct dst_entry *dst; struct flowi4 fl4; - dst = __xfrm4_dst_lookup(net, fl4, 0, NULL, daddr); + dst = __xfrm4_dst_lookup(net, fl4, 0, oif, NULL, daddr); if (IS_ERR(dst)) return -EHOSTUNREACH; diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c index ed0583c1b9fc..a74013d3eceb 100644 --- a/net/ipv6/xfrm6_policy.c +++ b/net/ipv6/xfrm6_policy.c @@ -26,7 +26,7 @@ static struct xfrm_policy_afinfo xfrm6_policy_afinfo; -static struct dst_entry *xfrm6_dst_lookup(struct net *net, int tos, +static struct dst_entry *xfrm6_dst_lookup(struct net *net, int tos, int oif, const xfrm_address_t *saddr, const xfrm_address_t *daddr) { @@ -35,6 +35,7 @@ static struct dst_entry *xfrm6_dst_lookup(struct net *net, int tos, int err; memset(fl6, 0, sizeof(fl6)); + fl6.flowi6_oif = oif; memcpy(fl6.daddr, daddr, sizeof(fl6.daddr)); if (saddr) memcpy(fl6.saddr, saddr, sizeof(fl6.saddr)); @@ -50,13 +51,13 @@ static struct dst_entry *xfrm6_dst_lookup(struct net *net, int tos, return dst; } -static int xfrm6_get_saddr(struct net *net, +static int xfrm6_get_saddr(struct net *net, int oif, xfrm_address_t *saddr, xfrm_address_t *daddr) { struct dst_entry *dst; struct net_device *dev; - dst = xfrm6_dst_lookup(net, 0, NULL, daddr); + dst = xfrm6_dst_lookup(net, 0, oif, NULL, daddr); if (IS_ERR(dst)) return -EHOSTUNREACH; diff --git a/net/xfrm/xfrm_policy.c
VxLAN support question
Hi VxLAN experts, In user space, we are developing a CLI as the following: Interface tunnel 100 Mode vxlan Remote ip ipv4 19.1.1.1 Local ip ipv4 20.1.1.1 Vni 1-1000 With Kernel 3.12.37, we can't support above configurations in kernel. (OR PLEASE Correct me if I am wrong) Noticing VxLAN supports has been actively worked on, hoping most Recent kernel allow functionality above is supported now. Pretty much what I want is that kernel will have about 1K interfaces (something like Tunnel100.1-tunnel100.1000 To be created and attached to 1K bridge domains on which each VNI is associated with given VNI to bridge-domain will be assigned using other CLIs) Thanks, Andrew * Email Confidentiality Notice The information contained in this e-mail message (including any attachments) may be confidential, proprietary, privileged, or otherwise exempt from disclosure under applicable laws. It is intended to be conveyed only to the designated recipient(s). Any use, dissemination, distribution, printing, retaining or copying of this e-mail (including its attachments) by unintended recipient(s) is strictly prohibited and may be unlawful. If you are not an intended recipient of this e-mail, or believe that you have received this e-mail in error, please notify the sender immediately (by replying to this e-mail), delete any and all copies of this e-mail (including any attachments) from your system, and do not disclose the content of this e-mail to any other person. Thank you! -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [v4, 0/9] Freescale DPAA FMan
Hello David, Thank you for your feedback. I understand your concerns regarding the FMan driver, we've come a long way from where we started but still there are issues. The community support is critical for getting the code to the desired quality level and I appreciate the support I receive from you and from the other previous reviewers. In order to reduce the code scattering I plan to put together all the code for a certain IP block in one file. For example FMan port in his current state in /drivers/net/freescale/fman/: flib (directory) fsl_fman_port.h inc (directory) fm_port_ext.h (API for other drivers/modules) port (directory) fman_port.c (flib) fm_port.c fm_port.h Makefile fm_port_drv.c (file) New proposed structure in /drivers/net/freescale/fman/: fman_port_drv.c (includes simplified code from fm_port.c, fman_port.c and fm_port_drv.c) fman_port_drv.h (exported structures and API, minimal) Of-course, I'll do the same for other modules (MAC, FMan itself). After this structure change we get: - Subdirectories completely removed - Layering reduced, each module becomes much flatter, with one source and header file - Fewer number of files (sources and headers) - Namespace pollution drastically reduced - General complexity of the driver reduced. I would appreciate your comments about the steps described above. Regards, Igal -Original Message- From: David Miller [mailto:da...@davemloft.net] Sent: Saturday, August 08, 2015 1:31 AM To: Liberman Igal-B31950 igal.liber...@freescale.com Cc: netdev@vger.kernel.org; linuxppc-...@lists.ozlabs.org; linux- ker...@vger.kernel.org; Wood Scott-B07421 scottw...@freescale.com; Bucur Madalin-Cristian-B32716 madalin.bu...@freescale.com; pebo...@tiscali.nl; joakim.tjernl...@transmode.se; p...@mindchasers.com; step...@networkplumber.org Subject: Re: [v4, 0/9] Freescale DPAA FMan From: igal.liber...@freescale.com Date: Wed, 5 Aug 2015 12:25:16 +0300 The Freescale Data Path Acceleration Architecture (DPAA) is a set of hardware components on specific QorIQ multicore processors. This architecture provides the infrastructure to support simplified sharing of networking interfaces and accelerators by multiple CPU cores and the accelerators. I think the directory and code structure of this new driver is quite excessive. Because you've split things up _so_ much, you have to have all of these directories, and even worse and much more important to me you have to export so many functions from one source file to another. I think this is way too much. For example, in one file you have a bunch of initialization routines. init_a(), init_b(), init_c(), and you export them all. Then they are always called in sequence: init_a(); init_b(); init_c(); This is completely pointless. You just needed to export one function which calls all three functions. The namespace pollution of this driver is out of control. You really need to completely rework the architecture and layout of this driver before I will even begin to review it again. And the lack of review interest by other developers should be an indication to you how undesirable this code submission is to read. Thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 0/6] qlcnic: enhancements
From: Shahed Shaikh shahed.sha...@qlogic.com Date: Fri, 7 Aug 2015 07:17:01 -0400 This series adds few enhancements. o Patch from Harish reorders the sequence of header files inclusion, keeping kernel's header files on top. o Firmware introduced a new feature which allows driver to increases the size of firmware dump of iSCSI function which is being collected by NIC driver. o Print buffer address which is holding a firmware dump. o Use vzalloc() instead kzalloc() for allocating large chunk of memory which will avoid potential memory allocation failure. o Add new device ID for 0x8C30 which is a 83xx series based VF function. Please apply this series to net-next. Series applied, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 bluetooth-next] cc2520: set the default fifo pin value from platform data
On 08/11/2015 08:13 AM, sdliy...@gmail.com wrote: From: Yong Li sdliy...@gmail.com When the device tree support is disabled, the fifo_pin is uninitialized, this patch will set the fifo_pin value based on platform data Signed-off-by: Yong Li sdliy...@gmail.com Acked-by: Varka Bhadram varkabhad...@gmail.com --- drivers/net/ieee802154/cc2520.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ieee802154/cc2520.c b/drivers/net/ieee802154/cc2520.c index 613dae5..c5b54a1 100644 --- a/drivers/net/ieee802154/cc2520.c +++ b/drivers/net/ieee802154/cc2520.c @@ -833,6 +833,7 @@ static int cc2520_get_platform_data(struct spi_device *spi, if (!spi_pdata) return -ENOENT; *pdata = *spi_pdata; + priv-fifo_pin = pdata-fifo; return 0; } -- Varka Bhadram. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Unbreak resetting default values for tcp_wmem/udp_wmem_min
On Sunday 08/09 at 22:41 -0700, David Miller wrote: From: Calvin Owens calvinow...@fb.com Date: Wed, 5 Aug 2015 13:26:54 -0700 Commit 8133534c760d4083 (net: limit tcp/udp rmem/wmem to SOCK_{RCV,SND}BUF_MIN) modified four sysctls to enforce that the values written to them are not less than SOCK_MIN_{RCV,SND}BUF. This change is fine for tcp_rmem and udp_rmem_min, since SOCK_MIN_RCVBUF is equal to equal to TCP_SKB_MIN_TRUESIZE. But it breaks tcp_wmem and udp_wmem_min for previously valid values because SOCK_MIN_SNDBUF is (2 * TCP_SKB_MIN_TRUESIZE), which ends up being greater than 4KB. Thus, 4096 is no longer accepted as a valid value, despite still being the default for udp_wmem_min, and for 'min' in tcp_wmem. A huge number of sysctl configurations at FB use 4096 as 'min', so this change breaks all of them. This patch changes the sysctls to simply enforce that the value written is greater than or equal to the default value of SK_MEM_QUANTUM. Fixes: 8133534c760d4083 (net: limit tcp/udp rmem/wmem to SOCK_MIN...) Signed-off-by: Calvin Owens calvinow...@fb.com I think increasing the default makes more sense. If we don't allow applications to set 4K, the kernel shouldn't start with that value either. I'm really questioning the limitation itself: why enforce a minimum of SOCK_MIN_SNDBUF here? Why not SK_MEM_QUANTUM? Commit 8133534c760d4083 referred to b1cb59cf2efe7971, which choose to use the SOCK_MIN constants as the lower limits to avoid nasty bugs. But AFAICS, a limit of SOCK_MIN_SNDBUF isn't necessary to do that: the BUG_ON cited in the commit message for b1cb59cf2efe7971 seems to have happened because unix_stream_sendmsg() expects a minimum of a full page (ie SK_MEM_QUANTUM) and the math broke, not because it had less than SOCK_MIN_SNDBUF allocated. Nothing seems to assume that it has at least SOCK_MIN_SNDBUF to play with, so my argument is that enforcing a minimum of SK_MEM_QUANTUM avoids the sort of bugs commit 8133534c760d4083 was trying to avoid, and it does so without breaking anybody's sysctl configurations. What do you think? Thanks very much, Calvin -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Unbreak resetting default values for tcp_wmem/udp_wmem_min
From: Calvin Owens calvinow...@fb.com Date: Mon, 10 Aug 2015 20:34:06 -0700 I'm really questioning the limitation itself: why enforce a minimum of SOCK_MIN_SNDBUF here? Why not SK_MEM_QUANTUM? Commit 8133534c760d4083 referred to b1cb59cf2efe7971, which choose to use the SOCK_MIN constants as the lower limits to avoid nasty bugs. But AFAICS, a limit of SOCK_MIN_SNDBUF isn't necessary to do that: the BUG_ON cited in the commit message for b1cb59cf2efe7971 seems to have happened because unix_stream_sendmsg() expects a minimum of a full page (ie SK_MEM_QUANTUM) and the math broke, not because it had less than SOCK_MIN_SNDBUF allocated. Nothing seems to assume that it has at least SOCK_MIN_SNDBUF to play with, so my argument is that enforcing a minimum of SK_MEM_QUANTUM avoids the sort of bugs commit 8133534c760d4083 was trying to avoid, and it does so without breaking anybody's sysctl configurations. What do you think? The author of said commit argues that too small values lead to really bad performance, but I guess he should have adjusted the default if he cared about it so much. Ok, can you respin your patch with some added details in the commit message like what you said above? Thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mkiss: Fix error handling in mkiss_open()
From: Fabio Estevam fabio.este...@freescale.com Date: Mon, 10 Aug 2015 14:22:43 -0300 If register_netdev() fails we are not propagating the error and we return success because ax_open() succeeded previously. Fix this by checking the return value of ax_open() and register_netdev() and propagate the error in case of failure. Reported-by: RUC_Soft_Sec zy900...@163.com Signed-off-by: Fabio Estevam fabio.este...@freescale.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Netfilter fixes for net
From: Pablo Neira Ayuso pa...@netfilter.org Date: Mon, 10 Aug 2015 19:58:34 +0200 The following patchset contains five Netfilter fixes for your net tree, they are: 1) Silence a warning on falling back to vmalloc(). Since 88eab472ec21, we can easily hit this warning message, that gets users confused. So let's get rid of it. 2) Recently when porting the template object allocation on top of kmalloc to fix the netns dependencies between x_tables and conntrack, the error checks where left unchanged. Remove IS_ERR() and check for NULL instead. Patch from Dan Carpenter. 3) Don't ignore gfp_flags in the new nf_ct_tmpl_alloc() function, from Joe Stringer. 4) Fix a crash due to NULL pointer dereference in ip6t_SYNPROXY, patch from Phil Sutter. 5) The sequence number of the Syn+ack that is sent from SYNPROXY to clients is not adjusted through our NAT infrastructure, as a result the client may ignore this TCP packet and TCP flow hangs until the client probes us. Also from Phil Sutter. You can pull these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git Pulled, thanks Pablo. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] mellanox: mlxsw: Use '%zx' to print size_t format
From: Fabio Estevam fabio.este...@freescale.com Date: Mon, 10 Aug 2015 09:54:28 -0300 Use '%zx' to print size_t format in order to fix the following build warning: drivers/net/ethernet/mellanox/mlxsw/item.h:65:3: warning: format '%lx' expects argument of type 'long unsigned int', but argument 6 has type 'size_t' [-Wformat=] Signed-off-by: Fabio Estevam fabio.este...@freescale.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html