[PATCH iproute] gitignore: Ignore 'tags' file generated by ctags
Signed-off-by: Hangbin Liu --- .gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitignore b/.gitignore index ef03b17..74a5496 100644 --- a/.gitignore +++ b/.gitignore @@ -10,6 +10,7 @@ Config # cscope cscope.* ncscope.* +tags TAGS # git files that we don't want to ignore even it they are dot-files -- 2.5.5
[PATCH iproute2] ip route: check ftell, fseek return value
ftell() may return -1 in error case, which is not handled and therefore pass a negative offset to fseek(). The return code of fseek() is also not checked. Reported-by: Phil Sutter Signed-off-by: Hangbin Liu --- ip/iproute.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index 3da23af..ba877dc 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -1859,7 +1859,11 @@ static int iproute_restore(void) if (route_dump_check_magic()) exit(-1); - pos = ftell(stdin); + if ((pos = ftell(stdin)) == -1) { + perror("Failed to restore: ftell"); + exit(errno); + } + for (prio = 0; prio < 3; prio++) { int err; @@ -1867,7 +1871,10 @@ static int iproute_restore(void) if (err) exit(err); - fseek(stdin, pos, SEEK_SET); + if (fseek(stdin, pos, SEEK_SET) == -1) { + perror("Failed to restore: fseek"); + exit(errno); + } } exit(0); -- 2.5.5
Re: Centralizing support for TCAM?
On Sat, Sep 03, 2016 at 09:09:50AM +0200, Jiri Pirko wrote: > Fri, Sep 02, 2016 at 08:49:34PM CEST, john.fastab...@gmail.com wrote: > >On 16-09-02 10:18 AM, Florian Fainelli wrote: > >> Hi all, > >> > > > >Hi Florian, > > > >> (apologies for the long CC list and the fact that I can't type correctly > >> email addresses) > >> > > > >My favorite topic ;) > > > >> While working on adding support for the Broadcom Ethernet switches > >> Compact Field Processor (which is essentially a TCAM, > >> action/policer/rate meter RAMs, 256 entries), I started working with the > >> ethtool::rxnfc API which is actually kind of nice in that it fits nicely > >> with my use simple case of being able to insert rules at a given or > >> driver selected location and has a pretty good flow representation for > >> common things you may match: TCP/UDP v4/v6 (not so much for non-IP, or > >> L2 stuff though you can use the extension flow representation). It lacks > >> support for more complex actions other than redirect to a particular > >> port/queue. > > > >When I was doing this for one of the products I work on I decided that > >extending ethtool was likely not a good approach and building a netlink > >interface would be a better choice. My reasons were mainly extending > >ethtool is a bit painful to keep structure compatibility across versions > >and I also had use cases that wanted to get notifications both made > >easier when using netlink. However my netlink port+extensions were not > >accepted and were called a "kernel bypass" and the general opinion was > >that it was not going to be accepted upstream. Hence the 'tc' effort. > > Ethtool should die peacefully. Don't poke in it in the process... > > > > > >> > >> Now ethtool::rxnfc is one possible user, but tc and netfiler also are, > >> more powerful and extensible, but since this is a resource constrained > >> piece of hardware, and it would suck for people to have to implement > >> these 3 APIs if we could come up with a central one that satisfies the > >> superset offered by tc + netfilter. We can surely imagine an use case we > > > >My opinion is that tc and netfilter are sufficiently different that > >building a common layer is challenging and is actually more complex vs > >just implementing two interfaces. Always happy to review code though. > > In february, Pablo did some work on finding the common intermediate > layer for classifier-action subsystem. It was rejected with the argument > of unnecessary overhead. Makes sense to me. After that, you introduced > u32 tc offload. Since that, couple more tc classifiers and actions were > offloaded. > > I believe that for Florian's usecase, TC is a great fit. You can just use > cls_flower with couple of actions. > > My colleagues are working hard on enabling cls_flower offload. You can > easily benefit that. In mlxsw we also plan to use that for our TCAM ACLs > offloading. > > > > > >There is also an already established packet flow through tc, netfilter, > >fdb, l3 in linux that folks want to maintain. At the moment I just don't > >see the need for a common layer IMO. > > > >Also adding another layer of abstraction so we end up doing multiple > >translations into and out of these layers adds overhead. Eventually > >I need to get reasonable operations per second on the TCAM tables. > >Reasonable for me being somewhere in the 50k to 100k add/del/update > >commands per second. I'm hesitant to create more abstractions then > >are actually needed. > > > >> centralize the whole matching + action into a Domain Specific Language > >> that we compiled into eBPF and then translate into whatever the HW > >> understands, although that raises the question of where do we put the > >> translation tool in user space or kernel space. > > > >The eBPF to HW translation I started to look at but gave up. The issue > >was the program space of eBPF is much larger than any traditional > >parser, table hardware implementation can support so most programs get > >rejected (obvious observation right?). I'm more inclined to build > >hardware that can support eBPF vs restricting eBPF to fit into a > >parser/table model. > > +1 > I have been thinging a lot about this and I believe that parsing bpf > program in drivers into some pre-defined tables is quite complex. I > think that bpf is just very unsuitable to offload, if you don't have a > hw which could directly interpret it. > I know that Alexei disagrees :) lol :) compiling bpf into fixed pipeline asic is definitely not easy. The problem with adding new cls classifieris and actions to match what configurable hw does isn't pretty either. The fixed pipeline isn't interesting beyond l2/l3 and flow-based hw features are mostly useless in the tor. I'm not against adding new classifiers, since it's better than sdk, but we won't be using such tc features either. Since this thread about tcam... my 0.02 here is it's pretty bad in the nic(host) due to power consumption and in the tor it's only good as
[RFC PATCH v2 6/6] net: Suppress the "Comparison to NULL could be written" warning
This is to suppress the checkpatch.pl warning "Comparison to NULL could be written". No functional changes here. Signed-off-by: Jia He --- net/ipv4/proc.c | 44 ++-- net/sctp/proc.c | 4 ++-- net/xfrm/xfrm_proc.c | 4 ++-- 3 files changed, 26 insertions(+), 26 deletions(-) diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c index f413fdc..bf0bb22 100644 --- a/net/ipv4/proc.c +++ b/net/ipv4/proc.c @@ -358,22 +358,22 @@ static void icmp_put(struct seq_file *seq) atomic_long_t *ptr = net->mib.icmpmsg_statistics->mibs; seq_puts(seq, "\nIcmp: InMsgs InErrors InCsumErrors"); - for (i = 0; icmpmibmap[i].name != NULL; i++) + for (i = 0; icmpmibmap[i].name; i++) seq_printf(seq, " In%s", icmpmibmap[i].name); seq_puts(seq, " OutMsgs OutErrors"); - for (i = 0; icmpmibmap[i].name != NULL; i++) + for (i = 0; icmpmibmap[i].name; i++) seq_printf(seq, " Out%s", icmpmibmap[i].name); seq_printf(seq, "\nIcmp: %lu %lu %lu", snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_INMSGS), snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_INERRORS), snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_CSUMERRORS)); - for (i = 0; icmpmibmap[i].name != NULL; i++) + for (i = 0; icmpmibmap[i].name; i++) seq_printf(seq, " %lu", atomic_long_read(ptr + icmpmibmap[i].index)); seq_printf(seq, " %lu %lu", snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_OUTMSGS), snmp_fold_field(net->mib.icmp_statistics, ICMP_MIB_OUTERRORS)); - for (i = 0; icmpmibmap[i].name != NULL; i++) + for (i = 0; icmpmibmap[i].name; i++) seq_printf(seq, " %lu", atomic_long_read(ptr + (icmpmibmap[i].index | 0x100))); } @@ -390,7 +390,7 @@ static int snmp_seq_show_ipstats(struct seq_file *seq, void *v) memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64)); seq_puts(seq, "Ip: Forwarding DefaultTTL"); - for (i = 0; snmp4_ipstats_list[i].name != NULL; i++) + for (i = 0; snmp4_ipstats_list[i].name; i++) seq_printf(seq, " %s", snmp4_ipstats_list[i].name); seq_printf(seq, "\nIp: %d %d", @@ -400,13 +400,13 @@ static int snmp_seq_show_ipstats(struct seq_file *seq, void *v) BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0); for_each_possible_cpu(c) { - for (i = 0; snmp4_ipstats_list[i].name != NULL; i++) + for (i = 0; snmp4_ipstats_list[i].name; i++) buff64[i] += snmp_get_cpu_field64( net->mib.ip_statistics, c, snmp4_ipstats_list[i].entry, offsetof(struct ipstats_mib, syncp)); } - for (i = 0; snmp4_ipstats_list[i].name != NULL; i++) + for (i = 0; snmp4_ipstats_list[i].name; i++) seq_printf(seq, " %llu", buff64[i]); return 0; @@ -421,17 +421,17 @@ static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v) memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long)); seq_puts(seq, "\nTcp:"); - for (i = 0; snmp4_tcp_list[i].name != NULL; i++) + for (i = 0; snmp4_tcp_list[i].name; i++) seq_printf(seq, " %s", snmp4_tcp_list[i].name); seq_puts(seq, "\nTcp:"); for_each_possible_cpu(c) { - for (i = 0; snmp4_tcp_list[i].name != NULL; i++) + for (i = 0; snmp4_tcp_list[i].name; i++) buff[i] += snmp_get_cpu_field(net->mib.tcp_statistics, c, snmp4_tcp_list[i].entry); } - for (i = 0; snmp4_tcp_list[i].name != NULL; i++) { + for (i = 0; snmp4_tcp_list[i].name; i++) { /* MaxConn field is signed, RFC 2012 */ if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN) seq_printf(seq, " %ld", buff[i]); @@ -442,15 +442,15 @@ static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v) memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long)); for_each_possible_cpu(c) { - for (i = 0; snmp4_udp_list[i].name != NULL; i++) + for (i = 0; snmp4_udp_list[i].name; i++) buff[i] += snmp_get_cpu_field(net->mib.udp_statistics, c, snmp4_udp_list[i].entry); } seq_puts(seq, "\nUdp:"); - for (i = 0; snmp4_udp_list[i].name != NULL; i++) + for (i = 0; snmp4_udp_list[i].name; i++) seq_printf(seq, " %s", snmp4_udp_list[i].name); seq_puts(seq, "\nUdp:"); - for (i = 0; snmp4_udp_list[i].name != NULL; i++) + for (i = 0; snmp4_udp_list[i].name; i++) seq_printf(seq, " %lu", buff[i
[RFC PATCH v2 3/6] proc: Reduce cache miss in sctp_snmp_seq_show
This patch exchanges the two loop for collecting the percpu statistics data. This can reduce cache misses by going through all the items of each cpu sequentially. Signed-off-by: Jia He --- net/sctp/proc.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/net/sctp/proc.c b/net/sctp/proc.c index ef8ba77..085fb95 100644 --- a/net/sctp/proc.c +++ b/net/sctp/proc.c @@ -74,12 +74,19 @@ static const struct snmp_mib sctp_snmp_list[] = { static int sctp_snmp_seq_show(struct seq_file *seq, void *v) { struct net *net = seq->private; - int i; + int i, c; + unsigned long buff[SCTP_MIB_MAX]; + memset(buff, 0, sizeof(unsigned long) * SCTP_MIB_MAX); + + for_each_possible_cpu(c) + for (i = 0; sctp_snmp_list[i].name != NULL; i++) + buff[i] += snmp_get_cpu_field( + net->sctp.sctp_statistics, + c, sctp_snmp_list[i].entry); for (i = 0; sctp_snmp_list[i].name != NULL; i++) seq_printf(seq, "%-32s\t%ld\n", sctp_snmp_list[i].name, - snmp_fold_field(net->sctp.sctp_statistics, - sctp_snmp_list[i].entry)); + buff[i]); return 0; } -- 1.8.3.1
[RFC PATCH v2 1/6] proc: Reduce cache miss in {snmp,netstat}_seq_show
This patch exchanges the two loop for collecting the percpu statistics data. This can aggregate the data by going through all the items of each cpu sequentially. Then snmp_seq_show is split into 2 parts to avoid build warning "the frame size" larger than 1024. Signed-off-by: Jia He --- net/ipv4/proc.c | 112 1 file changed, 81 insertions(+), 31 deletions(-) diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c index 9f665b6..f413fdc 100644 --- a/net/ipv4/proc.c +++ b/net/ipv4/proc.c @@ -46,6 +46,9 @@ #include #include +#define MAX(a, b) ((u32)(a) >= (u32)(b) ? (a) : (b)) +#define TCPUDP_MIB_MAX MAX(UDP_MIB_MAX, TCP_MIB_MAX) + /* * Report socket allocation statistics [m...@utu.fi] */ @@ -378,13 +381,15 @@ static void icmp_put(struct seq_file *seq) /* * Called from the PROCfs module. This outputs /proc/net/snmp. */ -static int snmp_seq_show(struct seq_file *seq, void *v) +static int snmp_seq_show_ipstats(struct seq_file *seq, void *v) { - int i; + int i, c; + u64 buff64[IPSTATS_MIB_MAX]; struct net *net = seq->private; - seq_puts(seq, "Ip: Forwarding DefaultTTL"); + memset(buff64, 0, IPSTATS_MIB_MAX * sizeof(u64)); + seq_puts(seq, "Ip: Forwarding DefaultTTL"); for (i = 0; snmp4_ipstats_list[i].name != NULL; i++) seq_printf(seq, " %s", snmp4_ipstats_list[i].name); @@ -393,57 +398,92 @@ static int snmp_seq_show(struct seq_file *seq, void *v) net->ipv4.sysctl_ip_default_ttl); BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0); + + for_each_possible_cpu(c) { + for (i = 0; snmp4_ipstats_list[i].name != NULL; i++) + buff64[i] += snmp_get_cpu_field64( + net->mib.ip_statistics, + c, snmp4_ipstats_list[i].entry, + offsetof(struct ipstats_mib, syncp)); + } for (i = 0; snmp4_ipstats_list[i].name != NULL; i++) - seq_printf(seq, " %llu", - snmp_fold_field64(net->mib.ip_statistics, -snmp4_ipstats_list[i].entry, -offsetof(struct ipstats_mib, syncp))); + seq_printf(seq, " %llu", buff64[i]); - icmp_put(seq); /* RFC 2011 compatibility */ - icmpmsg_put(seq); + return 0; +} + +static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v) +{ + int i, c; + unsigned long buff[TCPUDP_MIB_MAX]; + struct net *net = seq->private; + + memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long)); seq_puts(seq, "\nTcp:"); for (i = 0; snmp4_tcp_list[i].name != NULL; i++) seq_printf(seq, " %s", snmp4_tcp_list[i].name); seq_puts(seq, "\nTcp:"); + for_each_possible_cpu(c) { + for (i = 0; snmp4_tcp_list[i].name != NULL; i++) + buff[i] += snmp_get_cpu_field(net->mib.tcp_statistics, + c, snmp4_tcp_list[i].entry); + } + for (i = 0; snmp4_tcp_list[i].name != NULL; i++) { /* MaxConn field is signed, RFC 2012 */ if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN) - seq_printf(seq, " %ld", - snmp_fold_field(net->mib.tcp_statistics, - snmp4_tcp_list[i].entry)); + seq_printf(seq, " %ld", buff[i]); else - seq_printf(seq, " %lu", - snmp_fold_field(net->mib.tcp_statistics, - snmp4_tcp_list[i].entry)); + seq_printf(seq, " %lu", buff[i]); } + memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long)); + + for_each_possible_cpu(c) { + for (i = 0; snmp4_udp_list[i].name != NULL; i++) + buff[i] += snmp_get_cpu_field(net->mib.udp_statistics, + c, snmp4_udp_list[i].entry); + } seq_puts(seq, "\nUdp:"); for (i = 0; snmp4_udp_list[i].name != NULL; i++) seq_printf(seq, " %s", snmp4_udp_list[i].name); - seq_puts(seq, "\nUdp:"); for (i = 0; snmp4_udp_list[i].name != NULL; i++) - seq_printf(seq, " %lu", - snmp_fold_field(net->mib.udp_statistics, - snmp4_udp_list[i].entry)); + seq_printf(seq, " %lu", buff[i]); + + memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long)); /* the UDP and UDP-Lite MIBs are the same */ seq_puts(seq, "\nUdpLite:"); + for_each_possible_cpu(c) { + for (i = 0; snmp4_udp_list[i].name != NULL; i++
[RFC PATCH v2 5/6] ipv6: Remove useless parameter in __snmp6_fill_statsdev
The parameter items(always ICMP6_MIB_MAX) is useless for __snmp6_fill_statsdev. Signed-off-by: Jia He --- net/ipv6/addrconf.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index f418d2e..e170554 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4952,18 +4952,18 @@ static inline size_t inet6_if_nlmsg_size(void) } static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib, - int items, int bytes) + int bytes) { int i; - int pad = bytes - sizeof(u64) * items; + int pad = bytes - sizeof(u64) * ICMP6_MIB_MAX; BUG_ON(pad < 0); /* Use put_unaligned() because stats may not be aligned for u64. */ - put_unaligned(items, &stats[0]); - for (i = 1; i < items; i++) + put_unaligned(ICMP6_MIB_MAX, &stats[0]); + for (i = 1; i < ICMP6_MIB_MAX; i++) put_unaligned(atomic_long_read(&mib[i]), &stats[i]); - memset(&stats[items], 0, pad); + memset(&stats[ICMP6_MIB_MAX], 0, pad); } static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, @@ -4996,7 +4996,7 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype, offsetof(struct ipstats_mib, syncp)); break; case IFLA_INET6_ICMP6STATS: - __snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, ICMP6_MIB_MAX, bytes); + __snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, bytes); break; } } -- 1.8.3.1
[RFC PATCH v2 2/6] proc: Reduce cache miss in snmp6_seq_show
This patch exchanges the two loop for collecting the percpu statistics data. This can reduce cache misses by going through all the items of each cpu sequentially. Signed-off-by: Jia He --- net/ipv6/proc.c | 47 --- 1 file changed, 36 insertions(+), 11 deletions(-) diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c index 679253d0..c834646 100644 --- a/net/ipv6/proc.c +++ b/net/ipv6/proc.c @@ -30,6 +30,13 @@ #include #include +#define MAX(a, b) ((u32)(a) >= (u32)(b) ? (a) : (b)) + +#define MAX4(a, b, c, d) \ + MAX(MAX(a, b), MAX(c, d)) +#define SNMP_MIB_MAX MAX4(UDP_MIB_MAX, TCP_MIB_MAX, \ + IPSTATS_MIB_MAX, ICMP_MIB_MAX) + static int sockstat6_seq_show(struct seq_file *seq, void *v) { struct net *net = seq->private; @@ -191,25 +198,43 @@ static void snmp6_seq_show_item(struct seq_file *seq, void __percpu *pcpumib, atomic_long_t *smib, const struct snmp_mib *itemlist) { - int i; - unsigned long val; - - for (i = 0; itemlist[i].name; i++) { - val = pcpumib ? - snmp_fold_field(pcpumib, itemlist[i].entry) : - atomic_long_read(smib + itemlist[i].entry); - seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name, val); + int i, c; + unsigned long buff[SNMP_MIB_MAX]; + + memset(buff, 0, sizeof(unsigned long) * SNMP_MIB_MAX); + + if (pcpumib) { + for_each_possible_cpu(c) + for (i = 0; itemlist[i].name; i++) + buff[i] += snmp_get_cpu_field(pcpumib, c, + itemlist[i].entry); + for (i = 0; itemlist[i].name; i++) + seq_printf(seq, "%-32s\t%lu\n", + itemlist[i].name, buff[i]); + } else { + for (i = 0; itemlist[i].name; i++) + seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name, + atomic_long_read(smib + itemlist[i].entry)); } } static void snmp6_seq_show_item64(struct seq_file *seq, void __percpu *mib, const struct snmp_mib *itemlist, size_t syncpoff) { - int i; + int i, c; + u64 buff[SNMP_MIB_MAX]; + + memset(buff, 0, sizeof(unsigned long) * SNMP_MIB_MAX); + for_each_possible_cpu(c) { + for (i = 0; itemlist[i].name; i++) { + buff[i] += snmp_get_cpu_field64(mib, c, + itemlist[i].entry, + syncpoff); + } + } for (i = 0; itemlist[i].name; i++) - seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name, - snmp_fold_field64(mib, itemlist[i].entry, syncpoff)); + seq_printf(seq, "%-32s\t%llu\n", itemlist[i].name, buff[i]); } static int snmp6_seq_show(struct seq_file *seq, void *v) -- 1.8.3.1
[RFC PATCH v2 4/6] proc: Reduce cache miss in xfrm_statistics_seq_show
This patch exchanges the two loop for collecting the percpu statistics data. This can reduce cache misses by going through all the items of each cpu sequentially. Signed-off-by: Jia He --- net/xfrm/xfrm_proc.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/net/xfrm/xfrm_proc.c b/net/xfrm/xfrm_proc.c index 9c4fbd8..c9df546 100644 --- a/net/xfrm/xfrm_proc.c +++ b/net/xfrm/xfrm_proc.c @@ -51,11 +51,20 @@ static const struct snmp_mib xfrm_mib_list[] = { static int xfrm_statistics_seq_show(struct seq_file *seq, void *v) { struct net *net = seq->private; - int i; - for (i = 0; xfrm_mib_list[i].name; i++) + int i, c; + unsigned long buff[LINUX_MIB_XFRMMAX]; + + memset(buff, 0, sizeof(unsigned long) * LINUX_MIB_XFRMMAX); + + for_each_possible_cpu(c) + for (i = 0; xfrm_mib_list[i].name != NULL; i++) + buff[i] += snmp_get_cpu_field( + net->mib.xfrm_statistics, + c, xfrm_mib_list[i].entry); + for (i = 0; xfrm_mib_list[i].name != NULL; i++) seq_printf(seq, "%-24s\t%lu\n", xfrm_mib_list[i].name, - snmp_fold_field(net->mib.xfrm_statistics, - xfrm_mib_list[i].entry)); + buff[i]); + return 0; } -- 1.8.3.1
[RFC PATCH v2 0/6] Reduce cache miss for snmp_fold_field
In a PowerPc server with large cpu number(160), besides commit a3a773726c9f ("net: Optimize snmp stat aggregation by walking all the percpu data at once"), I watched several other snmp_fold_field callsites which will cause high cache miss rate. My simple test case, which read from the procfs items endlessly: /***/ #include #include #include #include #include #define LINELEN 2560 int main(int argc, char **argv) { int i; int fd = -1 ; int rdsize = 0; char buf[LINELEN+1]; buf[LINELEN] = 0; memset(buf,0,LINELEN); if(1 >= argc) { printf("file name empty\n"); return -1; } fd = open(argv[1], O_RDWR, 0644); if(0 > fd){ printf("open error\n"); return -2; } for(i=0;i<0x;i++) { while(0 < (rdsize = read(fd,buf,LINELEN))){ //nothing here } lseek(fd, 0, SEEK_SET); } close(fd); return 0; } /**/ compile and run: gcc test.c -o test perf stat -d -e cache-misses ./test /proc/net/snmp perf stat -d -e cache-misses ./test /proc/net/snmp6 perf stat -d -e cache-misses ./test /proc/net/netstat perf stat -d -e cache-misses ./test /proc/net/sctp/snmp perf stat -d -e cache-misses ./test /proc/net/xfrm_stat before the patch set: Performance counter stats for 'system wide': 355911097 cache-misses [40.08%] 2356829300 L1-dcache-loads [60.04%] 355642645 L1-dcache-load-misses # 15.09% of all L1-dcache hits [60.02%] 346544541 LLC-loads [59.97%] 389763 LLC-load-misses #0.11% of all LL-cache hits[40.02%] 6.245162638 seconds time elapsed After the patch set: === Performance counter stats for 'system wide': 194992476 cache-misses [40.03%] 6718051877 L1-dcache-loads [60.07%] 194871921 L1-dcache-load-misses #2.90% of all L1-dcache hits [60.11%] 187632232 LLC-loads [60.04%] 464466 LLC-load-misses #0.25% of all LL-cache hits[39.89%] 6.868422769 seconds time elapsed The cache-miss rate can be reduced from 15% to 2.9% v2: - 1/6 fix bug in udplite statistics. - 1/6 snmp_seq_show is split into 2 parts Jia He (6): proc: Reduce cache miss in {snmp,netstat}_seq_show proc: Reduce cache miss in snmp6_seq_show proc: Reduce cache miss in sctp_snmp_seq_show proc: Reduce cache miss in xfrm_statistics_seq_show ipv6: Remove useless parameter in __snmp6_fill_statsdev net: Suppress the "Comparison to NULL could be written" warning net/ipv4/proc.c | 144 ++- net/ipv6/addrconf.c | 12 ++--- net/ipv6/proc.c | 47 + net/sctp/proc.c | 15 -- net/xfrm/xfrm_proc.c | 15 -- 5 files changed, 162 insertions(+), 71 deletions(-) -- 1.8.3.1
Re: [PATCH net-next 4/9] rxrpc: Randomise epoch and starting client conn ID values
Reply inline On 9/5/2016 12:24 PM, David Howells wrote: > [cc'ing Jeff Altman for comment] > > David Laight wrote: > >>> Create a random epoch value rather than a time-based one on startup and set >>> the top bit to indicate that this is the case. >> >> Why set the top bit? >> There is nothing to stop the time (in seconds) from having the top bit set. >> Nothing else can care - otherwise this wouldn't work. > > This is what I'm told I should do by purveyors of other RxRPC solutions. The protocol specification requires that the top bit be 1 for a random epoch and 0 for a time derived epoch. > >>> Also create a random starting client connection ID value. This will be >>> incremented from here as new client connections are created. >> >> I'm guessing this is to make duplicates less likely after a restart? Its to reduce the possibility of duplicates on multiple machines that might at some point exchange an endpoint address either due to mobility or NAT/PAT. > > Again, it's been suggested that I do this, but I would guess so. > >> You may want to worry about duplicate allocations (after 2^32 connects). > > It's actually a quarter of that, but connection != call, so a connection may > be used for up to ~16 billion RPC operations before it *has* to be flushed. > >> There are id allocation algorithms that guarantee not to generate duplicates >> and not to reuse values quickly while still being fixed cost. >> Look at the code NetBSD uses to allocate process ids for an example. > > I'm using idr_alloc_cyclic()[*] with a fixed size "window" on the active conn > ID values. Client connections with IDs outside of that window are discarded > as soon as possible to keep the memory consumption of the tree down (and to > force security renegotiation occasionally). However, given that there are a > billion IDs to cycle through, it will take quite a while for reuse to become > an issue. > > I like the idea of incrementing the epoch every time we cycle through the ID > space, but I'm told that a change in the epoch value is an indication that the > client rebooted - with what consequences I cannot say. State information might be recorded about an rx peer with the assumption that state will be reset when the epoch changes. The most frequent use of this technique is for rx rpc statistics monitoring. > > [*] which is what Linux uses to allocate process IDs. > > David > <> smime.p7s Description: S/MIME Cryptographic Signature
Re: [PATCH nf] netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack
On Tue, Sep 6, 2016 at 10:06 AM, wrote: > From: Gao Feng > > It is valid that the TCP RST packet which does not set ack flag, and bytes > of ack number are zero. For these RST packets, seqadj could not adjust the > ack number. > > Signed-off-by: Gao Feng > --- > v2: Regenerate because the first patch is removed > v1: Initial patch > > net/netfilter/nf_conntrack_seqadj.c | 34 +++--- > 1 file changed, 19 insertions(+), 15 deletions(-) > > diff --git a/net/netfilter/nf_conntrack_seqadj.c > b/net/netfilter/nf_conntrack_seqadj.c > index dff0f0c..3bd9c7e 100644 > --- a/net/netfilter/nf_conntrack_seqadj.c > +++ b/net/netfilter/nf_conntrack_seqadj.c > @@ -179,30 +179,34 @@ int nf_ct_seq_adjust(struct sk_buff *skb, > > tcph = (void *)skb->data + protoff; > spin_lock_bh(&ct->lock); > + > if (after(ntohl(tcph->seq), this_way->correction_pos)) > seqoff = this_way->offset_after; > else > seqoff = this_way->offset_before; > > - if (after(ntohl(tcph->ack_seq) - other_way->offset_before, > - other_way->correction_pos)) > - ackoff = other_way->offset_after; > - else > - ackoff = other_way->offset_before; > - > newseq = htonl(ntohl(tcph->seq) + seqoff); > - newack = htonl(ntohl(tcph->ack_seq) - ackoff); > - > inet_proto_csum_replace4(&tcph->check, skb, tcph->seq, newseq, false); > - inet_proto_csum_replace4(&tcph->check, skb, tcph->ack_seq, newack, > -false); > - > - pr_debug("Adjusting sequence number from %u->%u, ack from %u->%u\n", > -ntohl(tcph->seq), ntohl(newseq), ntohl(tcph->ack_seq), > -ntohl(newack)); > > + pr_debug("Adjusting sequence number from %u->%u\n", > +ntohl(tcph->seq), ntohl(newseq)); > tcph->seq = newseq; > - tcph->ack_seq = newack; > + > + if (likely(tcph->ack)) { > + if (after(ntohl(tcph->ack_seq) - other_way->offset_before, > + other_way->correction_pos)) > + ackoff = other_way->offset_after; > + else > + ackoff = other_way->offset_before; > + > + newack = htonl(ntohl(tcph->ack_seq) - ackoff); > + inet_proto_csum_replace4(&tcph->check, skb, tcph->ack_seq, > +newack, false); > + > + pr_debug("Adjusting ack number from %u->%u\n", > +ntohl(tcph->ack_seq), ntohl(newack)); > + tcph->ack_seq = newack; > + } > > res = nf_ct_sack_adjust(skb, protoff, tcph, ct, ctinfo); > spin_unlock_bh(&ct->lock); > -- > 1.9.1 > > Sorry, I forget to add the v2 in the subject. Best Regards Feng
[PATCH nf] netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack
From: Gao Feng It is valid that the TCP RST packet which does not set ack flag, and bytes of ack number are zero. For these RST packets, seqadj could not adjust the ack number. Signed-off-by: Gao Feng --- v2: Regenerate because the first patch is removed v1: Initial patch net/netfilter/nf_conntrack_seqadj.c | 34 +++--- 1 file changed, 19 insertions(+), 15 deletions(-) diff --git a/net/netfilter/nf_conntrack_seqadj.c b/net/netfilter/nf_conntrack_seqadj.c index dff0f0c..3bd9c7e 100644 --- a/net/netfilter/nf_conntrack_seqadj.c +++ b/net/netfilter/nf_conntrack_seqadj.c @@ -179,30 +179,34 @@ int nf_ct_seq_adjust(struct sk_buff *skb, tcph = (void *)skb->data + protoff; spin_lock_bh(&ct->lock); + if (after(ntohl(tcph->seq), this_way->correction_pos)) seqoff = this_way->offset_after; else seqoff = this_way->offset_before; - if (after(ntohl(tcph->ack_seq) - other_way->offset_before, - other_way->correction_pos)) - ackoff = other_way->offset_after; - else - ackoff = other_way->offset_before; - newseq = htonl(ntohl(tcph->seq) + seqoff); - newack = htonl(ntohl(tcph->ack_seq) - ackoff); - inet_proto_csum_replace4(&tcph->check, skb, tcph->seq, newseq, false); - inet_proto_csum_replace4(&tcph->check, skb, tcph->ack_seq, newack, -false); - - pr_debug("Adjusting sequence number from %u->%u, ack from %u->%u\n", -ntohl(tcph->seq), ntohl(newseq), ntohl(tcph->ack_seq), -ntohl(newack)); + pr_debug("Adjusting sequence number from %u->%u\n", +ntohl(tcph->seq), ntohl(newseq)); tcph->seq = newseq; - tcph->ack_seq = newack; + + if (likely(tcph->ack)) { + if (after(ntohl(tcph->ack_seq) - other_way->offset_before, + other_way->correction_pos)) + ackoff = other_way->offset_after; + else + ackoff = other_way->offset_before; + + newack = htonl(ntohl(tcph->ack_seq) - ackoff); + inet_proto_csum_replace4(&tcph->check, skb, tcph->ack_seq, +newack, false); + + pr_debug("Adjusting ack number from %u->%u\n", +ntohl(tcph->ack_seq), ntohl(newack)); + tcph->ack_seq = newack; + } res = nf_ct_sack_adjust(skb, protoff, tcph, ct, ctinfo); spin_unlock_bh(&ct->lock); -- 1.9.1
[PATCH v4 nf] netfilter: seqadj: Drop the packet directly when fail to add seqadj extension to avoid dereference NULL pointer later
From: Gao Feng When memory is exhausted, nfct_seqadj_ext_add may fail to add the seqadj extension. But the function nf_ct_seqadj_init doesn't check if get valid seqadj pointer by the nfct_seqadj. Now drop the packet directly when fail to add seqadj extension to avoid dereference NULL pointer in nf_ct_seqadj_init. Signed-off-by: Gao Feng --- v4: Drop the packet directly when fail to add seqadj extension; v3: Remove the warning log when seqadj is null; v2: Remove the unnessary seqadj check in nf_ct_seq_adjust v1: Initial patch net/netfilter/nf_conntrack_core.c | 6 +- net/netfilter/nf_nat_core.c | 3 ++- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index dd2c43a..dfa76ce 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1036,7 +1036,11 @@ init_conntrack(struct net *net, struct nf_conn *tmpl, return (struct nf_conntrack_tuple_hash *)ct; if (tmpl && nfct_synproxy(tmpl)) { - nfct_seqadj_ext_add(ct); + if (!nfct_seqadj_ext_add(ct)) { + nf_conntrack_free(ct); + pr_debug("Can't add seqadj extension\n"); + return NULL; + } nfct_synproxy_ext_add(ct); } diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c index de31818..b82282a 100644 --- a/net/netfilter/nf_nat_core.c +++ b/net/netfilter/nf_nat_core.c @@ -441,7 +441,8 @@ nf_nat_setup_info(struct nf_conn *ct, ct->status |= IPS_DST_NAT; if (nfct_help(ct)) - nfct_seqadj_ext_add(ct); + if (!nfct_seqadj_ext_add(ct)) + return NF_DROP; } if (maniptype == NF_NAT_MANIP_SRC) { -- 1.9.1
Re: [PATCH] RDS: Simplify code
On 9/4/16 11:23 AM, Leon Romanovsky wrote: On Sun, Sep 04, 2016 at 05:57:20PM +0200, Christophe JAILLET wrote: Le 04/09/2016 à 14:20, Leon Romanovsky a écrit : On Sat, Sep 03, 2016 at 07:33:29AM +0200, Christophe JAILLET wrote: Calling 'list_splice' followed by 'INIT_LIST_HEAD' is equivalent to 'list_splice_init'. It is not 100% accurate list_splice(y, z) INIT_LIST_HEAD(y) ==> if (!list_empty(y)) __list_splice(y, z, z>next); INIT_LIST_HEAD(y) and not if (!list_empty(y)) { __list_splice(y, z, z>next); INIT_LIST_HEAD(y) } as list_splice_init will do. You are right but if you dig further you will see that calling INIT_LIST_HEAD on an empty list is a no-op (AFAIK). And if this list was not already correctly initialized, then you would have some other troubles. Thank you for the suggestion, It looks like the code after that can be skipped in case of loop_conns list is empty, the tmp_list will be empty too. 174 list_for_each_entry_safe(lc, _lc, &tmp_list, loop_node) { 175 WARN_ON(lc->conn->c_passive); 176 rds_conn_destroy(lc->conn); 177 } Thanks for trying. As already pointed, your change doesn't simplify much rather change the behavior. The loop cursor already takes care of list empty case. I don't see any reason to change that code. Regards, Santosh
Great Offer
You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact (julieleach...@hotmail.com) for claims.
Great Offer
You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact (julieleach...@hotmail.com) for claims.
Great Offer
You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact (julieleach...@hotmail.com) for claims.
Re: [PATCH v3 2/6] cgroup: add support for eBPF programs
On 9/5/16 2:40 PM, Sargun Dhillon wrote: On Mon, Sep 05, 2016 at 04:49:26PM +0200, Daniel Mack wrote: Hi, On 08/30/2016 01:04 AM, Sargun Dhillon wrote: On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote: This patch adds two sets of eBPF program pointers to struct cgroup. One for such that are directly pinned to a cgroup, and one for such that are effective for it. To illustrate the logic behind that, assume the following example cgroup hierarchy. A - B - C \ D - E If only B has a program attached, it will be effective for B, C, D and E. If D then attaches a program itself, that will be effective for both D and E, and the program in B will only affect B and C. Only one program of a given type is effective for a cgroup. How does this work when running and orchestrator within an orchestrator? The Docker in Docker / Mesos in Mesos use case, where the top level orchestrator is observing the traffic, and there is an orchestrator within that also need to run it. In this case, I'd like to run E's filter, then if it returns 0, D's, and B's, and so on. Running multiple programs was an idea I had in one of my earlier drafts, but after some discussion, I refrained from it again because potentially walking the cgroup hierarchy on every packet is just too expensive. I think you're correct here. Maybe this is something I do with the LSM-attached filters, and not for skb filters. Do you think there might be a way to opt-in to this option? Is it possible to allow this, either by flattening out the datastructure (copy a ref to the bpf programs to C and E) or something similar? That would mean we carry a list of eBPF program pointers of dynamic size. IOW, the deeper inside the cgroup hierarchy, the bigger the list, so it can store a reference to all programs of all of its ancestor. While I think that would be possible, even at some later point, I'd really like to avoid it for the sake of simplicity. Is there any reason why this can't be done in userspace? Compile a program X for A, and overload it with Y, with Y doing the same than X but add some extra checks? Note that all users of the bpf(2) syscall API will need CAP_NET_ADMIN anyway, so there is no delegation to unprivileged sub-orchestators or anything alike really. One of the use-cases that's becoming more and more common are containers-in-containers. In this, you have a privileged container that's running something like build orchestration, and you want to do macro-isolation (say limit access to only that tennant's infrastructure). Then, when the build orchestrator runs a build, it may want to monitor, and further isolate the tasks that run in the build job. This is a side-effect of composing different container technologies. Typically you use one system for images, then another for orchestration, and the actual program running inside of it can also leverage containerization. Example: K8s->Docker->Jenkins Agent->Jenkins Build Job frankly I don't buy this argument, since above and other 'examples' of container-in-container look fake to me. There is a ton work to be done for such scheme to be even remotely feasible. The cgroup+bpf stuff would be the last on my list to 'fix' for such deployments. I don't think we should worry about it at present.
Re: [PATCH v3 2/6] cgroup: add support for eBPF programs
On Mon, Sep 05, 2016 at 04:49:26PM +0200, Daniel Mack wrote: > Hi, > > On 08/30/2016 01:04 AM, Sargun Dhillon wrote: > > On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote: > >> This patch adds two sets of eBPF program pointers to struct cgroup. > >> One for such that are directly pinned to a cgroup, and one for such > >> that are effective for it. > >> > >> To illustrate the logic behind that, assume the following example > >> cgroup hierarchy. > >> > >> A - B - C > >> \ D - E > >> > >> If only B has a program attached, it will be effective for B, C, D > >> and E. If D then attaches a program itself, that will be effective for > >> both D and E, and the program in B will only affect B and C. Only one > >> program of a given type is effective for a cgroup. > >> > > How does this work when running and orchestrator within an orchestrator? > > The > > Docker in Docker / Mesos in Mesos use case, where the top level > > orchestrator is > > observing the traffic, and there is an orchestrator within that also need > > to run > > it. > > > > In this case, I'd like to run E's filter, then if it returns 0, D's, and > > B's, > > and so on. > > Running multiple programs was an idea I had in one of my earlier drafts, > but after some discussion, I refrained from it again because potentially > walking the cgroup hierarchy on every packet is just too expensive. > I think you're correct here. Maybe this is something I do with the LSM-attached filters, and not for skb filters. Do you think there might be a way to opt-in to this option? > > Is it possible to allow this, either by flattening out the > > datastructure (copy a ref to the bpf programs to C and E) or > > something similar? > > That would mean we carry a list of eBPF program pointers of dynamic > size. IOW, the deeper inside the cgroup hierarchy, the bigger the list, > so it can store a reference to all programs of all of its ancestor. > > While I think that would be possible, even at some later point, I'd > really like to avoid it for the sake of simplicity. > > Is there any reason why this can't be done in userspace? Compile a > program X for A, and overload it with Y, with Y doing the same than X > but add some extra checks? Note that all users of the bpf(2) syscall API > will need CAP_NET_ADMIN anyway, so there is no delegation to > unprivileged sub-orchestators or anything alike really. One of the use-cases that's becoming more and more common are containers-in-containers. In this, you have a privileged container that's running something like build orchestration, and you want to do macro-isolation (say limit access to only that tennant's infrastructure). Then, when the build orchestrator runs a build, it may want to monitor, and further isolate the tasks that run in the build job. This is a side-effect of composing different container technologies. Typically you use one system for images, then another for orchestration, and the actual program running inside of it can also leverage containerization. Example: K8s->Docker->Jenkins Agent->Jenkins Build Job There's also a differentiation of ownership in each of these systems. I would really not require a middleware system that all my software has to talk to, because sometimes I'm taking off the shelf software (Jenkins), and porting it to containers. I think one of the pieces that's led to the success of cgroups is the straightforward API, and ease of use (and it's getting even easier in v2). It's perfectly fine to give the lower level tasks CAP_NET_ADMIN, because we use something like seccomp-bpf plus some of the work I've been doing with the LSM to prevent the sub-orchestrators from accidentally blowing away the system. Usually, we trust these orchestrators (internal users), so it's more of a precautionary measure as opposed to a true security measure. Also, rewriting BPF programs, although pretty straightforward sounds like a pain to do in userspace, even with a helper. If we were to take peoples programs and chain them together via tail call, or similar, I can imagine where rewriting a program might push you over the instruction limit. > > > Thanks, > Daniel >
Re: [PATCH -next v2] net: hns: fix return value check in hns_dsaf_get_cfg()
From: Salil Mehta Date: Mon, 5 Sep 2016 14:20:33 + > This patch will conflict with Doug Ledford's hns-roce's HNS driver. > This might lead to problems later during this merge window of 4.9. You don't need to say this three times. These changes will not be reverted, instead the conflicts will need to simply be resolved during the merges just like any other conflict that ever happens in our trees.
Re: [PATCH for-next 0/2] {IB,net}/hns: Add support of ACPI to the Hisilicon RoCE Driver
From: Salil Mehta Date: Mon, 5 Sep 2016 12:53:07 + > There is a patch in net-next for HNS Ethernet driver which has been accepted. > "b3dc935 net: hns: remove redundant dev_err call in hns_dsaf_get_cfg()" > > This patch is creating conflict with Doug Ledford's hns-roce branch. > Internally, > We have decided to let all the HNS patches pass through the hns-roce for 4.9 > Merge window to facilitate non-conflict merge of pending RoCE driver by Doug > Ledford. > > Though, we are trying to take a grip of the process for this merge window but > Somehow this one bypassed the internal process. This will create problems for > Hns-roce during merge later this window. Can I request you to drop this patch > For now. We shall re-submit this patch later when things have been settled at > the RoCE/RDMA end or would come again to you through hns-roce branch. > > Please let me know if this is possible this time. Thanks in anticipation! This cannot work like this. If I see an obvious fix to something in networking, I will apply the patch. Merge conflicts happen and they have to be resolved. We don't just revert legitimate changes to eliminate the conflict.
Re: [PATCH v4 4/5] net: stmmac: add a glue driver for the Amlogic Meson 8b / GXBB DWMAC
On Mon, Sep 5, 2016 at 12:53 PM, Arnd Bergmann wrote: > On Monday, September 5, 2016 9:37:29 AM CEST kbuild test robot wrote: >> All error/warnings (new ones prefixed by >>): >> >> >> drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c:63:18: error: field >> >> 'm250_mux' has incomplete type >> struct clk_mux m250_mux; >> ^ >> >> drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c:67:21: error: field >> >> 'm250_div' has incomplete type >> struct clk_divider m250_div; >> ^ >> > > I think this needs a compile-time dependency on COMMON_CLK indeed, since we are also a clock provider we have to depend on CONFIG_COMMON_CLK. That brings up a question though: so far the new driver uses the same Kconfig symbol as the "old" driver (CONFIG_DWMAC_MESON). The "old" driver does not need CONFIG_COMMON_CLK while the new one does. I see a few options here: 1. simply adding the dependency (as most configurations will have CONFIG_COMMON_CLK enabled anyways) 2. add some depends on COMMON_CLK || MACH_MESON6 || MACH_MESON8 foo 3. use a new Kconfig symbol for new new driver (CONFIG_DWMAC_MESON8B?) And finally regarding your other mail: I have already changed WARN_ON(PTR_ERR_OR_ZERO(...)) to WARN_ON(IS_ERR(...)) in v4 Regards, Martin
Re: [PATCH] rxrpc: initialize sched to false to ensure it is not a garbage value
Colin King wrote: > From: Colin Ian King > > sched will be uninitialized (and contain a garbage value) in the case > where call->state >= RXRPC_CALL_DEAD; fix this by initializing sched > to false to avoid an inadvertent call to rxrpc_queue_call. > > Signed-off-by: Colin Ian King I already have a patch queued for this, thanks. David
Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
On 09/05/2016 08:32 PM, Alexei Starovoitov wrote: > On 9/5/16 10:09 AM, Daniel Borkmann wrote: >> On 09/05/2016 04:09 PM, Daniel Mack wrote: >>> I really don't think it's worth sparing 8 bytes here and then do the >>> binary compat dance after flags are added, for no real gain. >> >> Sure, but there's not much of a dance needed, see for example how map_flags >> were added some time ago. So, iff there's really no foreseeable use-case in >> sight and since we have this flexibility in place already, then I don't >> quite >> follow why it's needed, if there's zero pain to add it later on. I would >> understand it of course, if it cannot be handled later on anymore. > > I agree with Daniel B. Since flags are completely unused right now, > there is no plan to use it for anything in the coming months and > even worse they make annoying hole in the struct, let's not > add them. We can safely do that later. CHECK_ATTR() allows us to > do it easily. It's not like syscall where flags are must have, > since we cannot add it later. Here it's done trivially. Okay then. If you both agree, I won't interfere :) Daniel
Re: [PATCH v3 nf] netfilter: seqadj: Fix one possible panic in seqadj when mem is exhausted
On Sat, Sep 03, 2016 at 07:51:50PM +0800, f...@ikuai8.com wrote: > From: Gao Feng > > When memory is exhausted, nfct_seqadj_ext_add may fail to add the seqadj > extension. But the function nf_ct_seqadj_init doesn't check if get valid > seqadj pointer by the nfct_seqadj, while other functions perform the > sanity check. > > So the system would be panic when nfct_seqadj_ext_add failed. > > Signed-off-by: Gao Feng > --- > v3: Remove the warning log when seqadj is null; > v2: Remove the unnessary seqadj check in nf_ct_seq_adjust > v1: Initial patch > > net/netfilter/nf_conntrack_seqadj.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/net/netfilter/nf_conntrack_seqadj.c > b/net/netfilter/nf_conntrack_seqadj.c > index dff0f0c..7f8d814 100644 > --- a/net/netfilter/nf_conntrack_seqadj.c > +++ b/net/netfilter/nf_conntrack_seqadj.c > @@ -16,9 +16,12 @@ int nf_ct_seqadj_init(struct nf_conn *ct, enum > ip_conntrack_info ctinfo, > if (off == 0) > return 0; > > + seqadj = nfct_seqadj(ct); > + if (unlikely(!seqadj)) > + return 0; I think we should handle this from init_conntrack() by simply dropping the packet as we do under similar circunstances (too stress to deal).
Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
On 9/5/16 10:09 AM, Daniel Borkmann wrote: On 09/05/2016 04:09 PM, Daniel Mack wrote: On 09/05/2016 03:56 PM, Daniel Borkmann wrote: On 09/05/2016 02:54 PM, Daniel Mack wrote: On 08/30/2016 01:00 AM, Daniel Borkmann wrote: On 08/26/2016 09:58 PM, Daniel Mack wrote: enum bpf_map_type { @@ -147,6 +149,13 @@ union bpf_attr { __aligned_u64pathname; __u32bpf_fd; }; + +struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ +__u32target_fd;/* container object to attach to */ +__u32attach_bpf_fd;/* eBPF program to attach */ +__u32attach_type;/* BPF_ATTACH_TYPE_* */ +__u64attach_flags; Could we just do ... __u32 dst_fd; __u32 src_fd; __u32 attach_type; ... and leave flags out, since unused anyway? Also see below. I'd really like to keep the flags, even if they're unused right now. This only adds 8 bytes during the syscall operation, so it doesn't harm. However, we cannot change the userspace API after the fact, and who knows what this (rather generic) interface will be used for later on. With the below suggestion added, then flags doesn't need to be added currently as it can be done safely at a later point in time with respecting old binaries. See also the syscall handling code in kernel/bpf/syscall.c +825 and the CHECK_ATTR() macro. The underlying idea of this was taken from perf_event_open() syscall back then, see [1] for a summary. [1] https://lkml.org/lkml/2014/8/26/116 Yes, I know that's possible, and I like the idea, but I don't think any new interface should come without flags really, as flags are something that will most certainly be needed at some point anyway. I didn't have them in my first shot, but Alexei pointed out that they should be added, and I agree. Also, this optimization wouldn't make the transported struct payload any smaller anyway, because the member of that union used by BPF_PROG_LOAD is still by far the biggest. I really don't think it's worth sparing 8 bytes here and then do the binary compat dance after flags are added, for no real gain. Sure, but there's not much of a dance needed, see for example how map_flags were added some time ago. So, iff there's really no foreseeable use-case in sight and since we have this flexibility in place already, then I don't quite follow why it's needed, if there's zero pain to add it later on. I would understand it of course, if it cannot be handled later on anymore. I agree with Daniel B. Since flags are completely unused right now, there is no plan to use it for anything in the coming months and even worse they make annoying hole in the struct, let's not add them. We can safely do that later. CHECK_ATTR() allows us to do it easily. It's not like syscall where flags are must have, since we cannot add it later. Here it's done trivially.
Re: 6pack: stack-out-of-bounds in sixpack_receive_buf
On Mon, Sep 5, 2016 at 7:49 PM, One Thousand Gnomes wrote: >> different runs). Looking at code, the following looks suspicious -- we >> limit copy by 512 bytes, but use the original count which can be >> larger than 512: >> >> static void sixpack_receive_buf(struct tty_struct *tty, >> const unsigned char *cp, char *fp, int count) >> { >> unsigned char buf[512]; >> >> memcpy(buf, cp, count < sizeof(buf) ? count : sizeof(buf)); >> >> sixpack_decode(sp, buf, count1); >> >> >> On commit 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next. > > With the sane tty locking we now have I believe the following is safe as > we consume the bytes and move them into the decoded buffer before > returning. > > diff --git a/drivers/net/hamradio/6pack.c b/drivers/net/hamradio/6pack.c > index 5a1e985..470b3dc 100644 > --- a/drivers/net/hamradio/6pack.c > +++ b/drivers/net/hamradio/6pack.c > @@ -127,7 +127,7 @@ struct sixpack { > > #define AX25_6PACK_HEADER_LEN 0 > > -static void sixpack_decode(struct sixpack *, unsigned char[], int); > +static void sixpack_decode(struct sixpack *, const unsigned char[], int); > static int encode_sixpack(unsigned char *, unsigned char *, int, unsigned > char); > > /* > @@ -428,7 +428,7 @@ out: > > /* > * Handle the 'receiver data ready' interrupt. > - * This function is called by the 'tty_io' module in the kernel when > + * This function is called by the tty module in the kernel when > * a block of 6pack data has been received, which can now be decapsulated > * and sent on to some IP layer for further processing. > */ > @@ -436,7 +436,6 @@ static void sixpack_receive_buf(struct tty_struct *tty, > const unsigned char *cp, char *fp, int count) > { > struct sixpack *sp; > - unsigned char buf[512]; > int count1; > > if (!count) > @@ -446,10 +445,7 @@ static void sixpack_receive_buf(struct tty_struct *tty, > if (!sp) > return; > > - memcpy(buf, cp, count < sizeof(buf) ? count : sizeof(buf)); > - > /* Read the characters out of the buffer */ > - > count1 = count; > while (count) { > count--; > @@ -459,7 +455,7 @@ static void sixpack_receive_buf(struct tty_struct *tty, > continue; > } > } > - sixpack_decode(sp, buf, count1); > + sixpack_decode(sp, cp, count1); > > sp_put(sp); > tty_unthrottle(tty); > @@ -992,7 +988,7 @@ static void decode_std_command(struct sixpack *sp, > unsigned char cmd) > /* decode a 6pack packet */ > > static void > -sixpack_decode(struct sixpack *sp, unsigned char *pre_rbuff, int count) > +sixpack_decode(struct sixpack *sp, const unsigned char *pre_rbuff, int count) > { > unsigned char inbyte; > int count1; Applied locally for testing. I will notify if I see this bug again. Thanks!
Re: 6pack: stack-out-of-bounds in sixpack_receive_buf
> different runs). Looking at code, the following looks suspicious -- we > limit copy by 512 bytes, but use the original count which can be > larger than 512: > > static void sixpack_receive_buf(struct tty_struct *tty, > const unsigned char *cp, char *fp, int count) > { > unsigned char buf[512]; > > memcpy(buf, cp, count < sizeof(buf) ? count : sizeof(buf)); > > sixpack_decode(sp, buf, count1); > > > On commit 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next. With the sane tty locking we now have I believe the following is safe as we consume the bytes and move them into the decoded buffer before returning. diff --git a/drivers/net/hamradio/6pack.c b/drivers/net/hamradio/6pack.c index 5a1e985..470b3dc 100644 --- a/drivers/net/hamradio/6pack.c +++ b/drivers/net/hamradio/6pack.c @@ -127,7 +127,7 @@ struct sixpack { #define AX25_6PACK_HEADER_LEN 0 -static void sixpack_decode(struct sixpack *, unsigned char[], int); +static void sixpack_decode(struct sixpack *, const unsigned char[], int); static int encode_sixpack(unsigned char *, unsigned char *, int, unsigned char); /* @@ -428,7 +428,7 @@ out: /* * Handle the 'receiver data ready' interrupt. - * This function is called by the 'tty_io' module in the kernel when + * This function is called by the tty module in the kernel when * a block of 6pack data has been received, which can now be decapsulated * and sent on to some IP layer for further processing. */ @@ -436,7 +436,6 @@ static void sixpack_receive_buf(struct tty_struct *tty, const unsigned char *cp, char *fp, int count) { struct sixpack *sp; - unsigned char buf[512]; int count1; if (!count) @@ -446,10 +445,7 @@ static void sixpack_receive_buf(struct tty_struct *tty, if (!sp) return; - memcpy(buf, cp, count < sizeof(buf) ? count : sizeof(buf)); - /* Read the characters out of the buffer */ - count1 = count; while (count) { count--; @@ -459,7 +455,7 @@ static void sixpack_receive_buf(struct tty_struct *tty, continue; } } - sixpack_decode(sp, buf, count1); + sixpack_decode(sp, cp, count1); sp_put(sp); tty_unthrottle(tty); @@ -992,7 +988,7 @@ static void decode_std_command(struct sixpack *sp, unsigned char cmd) /* decode a 6pack packet */ static void -sixpack_decode(struct sixpack *sp, unsigned char *pre_rbuff, int count) +sixpack_decode(struct sixpack *sp, const unsigned char *pre_rbuff, int count) { unsigned char inbyte; int count1;
Re: [PATCH 02/29] netfilter: physdev: add missed blank
On Mon, 2016-09-05 at 12:58 +0200, Pablo Neira Ayuso wrote: [] > diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c [] > @@ -107,8 +107,8 @@ static int physdev_mt_check(const struct xt_mtchk_param > *par) > info->invert & XT_PHYSDEV_OP_BRIDGED) && > par->hook_mask & ((1 << NF_INET_LOCAL_OUT) | > (1 << NF_INET_FORWARD) | (1 << NF_INET_POST_ROUTING))) { > - pr_info("using --physdev-out and --physdev-is-out are only" > - "supported in the FORWARD and POSTROUTING chains with" > + pr_info("using --physdev-out and --physdev-is-out are only " > + "supported in the FORWARD and POSTROUTING chains with " > "bridged traffic.\n"); > if (par->hook_mask & (1 << NF_INET_LOCAL_OUT)) > return -EINVAL; Perhaps it would be reasonable at some point to coalesce all the string fragments. Maybe using this could help: $ git ls-files -- "net/netfilter/*.[ch]" | \ xargs ./scripts/checkpatch.pl -f --types=split_string --fix-inplace
Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
On Mon, 2016-09-05 at 14:56 +0200, Daniel Mack wrote: > On 08/27/2016 02:08 AM, Alexei Starovoitov wrote: [] > > + switch (attr->attach_type) { > > + case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS: > > + case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: { > > + struct cgroup *cgrp; > > + > > + prog = bpf_prog_get_type(attr->attach_bpf_fd, > > + BPF_PROG_TYPE_CGROUP_SOCKET_FILTER); > > + if (IS_ERR(prog)) > > + return PTR_ERR(prog); > > + > > + cgrp = cgroup_get_from_fd(attr->target_fd); > > + if (IS_ERR(cgrp)) { > > + bpf_prog_put(prog); > > + return PTR_ERR(cgrp); > > + } > > + > > + cgroup_bpf_update(cgrp, prog, attr->attach_type); > > + cgroup_put(cgrp); > > + > > + break; > > + } > this } formatting style is confusing. The above } looks > like it matches 'switch () {'. > May be move 'struct cgroup *cgrp' to the top to avoid that? This style of case statements that declare local variables with an open brace after the case statement switch (bar) { [cases...] case foo: { local declarations; code... } [cases...] } is used quite frequently in the kernel. I think it's fine as is.
Re: [PATCH net-next v3] gso: Support partial splitting at the frag_list pointer
On Mon, Sep 5, 2016 at 3:47 AM, Steffen Klassert wrote: > Since commit 8a29111c7 ("net: gro: allow to build full sized skb") > gro may build buffers with a frag_list. This can hurt forwarding > because most NICs can't offload such packets, they need to be > segmented in software. This patch splits buffers with a frag_list > at the frag_list pointer into buffers that can be TSO offloaded. > > Signed-off-by: Steffen Klassert > --- > > Changes since v1: > > - Use the assumption that all buffers in the chain excluding the last > containing the same amount of data. > > - Simplify some checks against gso partial. > > - Fix the generation of IP IDs. > > Changes since v2: > > - Merge common code of gso partial and frag_list pointer splitting. > > net/core/skbuff.c | 50 > +++--- > net/ipv4/af_inet.c | 14 ++ > net/ipv4/gre_offload.c | 6 -- > net/ipv4/tcp_offload.c | 13 +++-- > net/ipv4/udp_offload.c | 6 -- > net/ipv6/ip6_offload.c | 5 - > 6 files changed, 68 insertions(+), 26 deletions(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 3864b4b6..f754d47 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -3078,11 +3078,30 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, > sg = !!(features & NETIF_F_SG); > csum = !!can_checksum_protocol(features, proto); > > - /* GSO partial only requires that we trim off any excess that > -* doesn't fit into an MSS sized block, so take care of that > -* now. > -*/ > - if (sg && csum && (features & NETIF_F_GSO_PARTIAL)) { > + if (sg && csum && (mss != GSO_BY_FRAGS)) { > + if (!(features & NETIF_F_GSO_PARTIAL)) { > + if (list_skb && > + net_gso_ok(features, > skb_shinfo(head_skb)->gso_type)) { The testing logic here is a bit off. You need to prevent us from doing the partial_segs bit below if NETIF_F_GSO_PARTIAL is not set and if your list_skb or net_gso_ok tests fail. Since as you pointed out we shouldn't ever be trying to perform GSO_PARTIAL on a frame that has a frag_list, what you might do is something like: if (!list_skb || !net_gso_ok(...)) goto normal; That way we don't setup partial_segs unless we are actually using it. > + struct sk_buff *iter; > + > + /* Split the buffer at the frag_list pointer. > +* This is based on the assumption that all > +* buffers in the chain excluding the last > +* containing the same amount of data. > +*/ > + skb_walk_frags(head_skb, iter) { > + if (skb_headlen(iter)) > + goto normal; > + > + len -= iter->len; > + } > + } > + } > + > + /* GSO partial only requires that we trim off any excess that > +* doesn't fit into an MSS sized block, so take care of that > +* now. > +*/ > partial_segs = len / mss; > if (partial_segs > 1) > mss *= partial_segs; > @@ -3090,6 +3109,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, > partial_segs = 0; > } > > +normal: > headroom = skb_headroom(head_skb); > pos = skb_headlen(head_skb); > > @@ -3281,21 +3301,29 @@ perform_csum_check: > */ > segs->prev = tail; > > - /* Update GSO info on first skb in partial sequence. */ > if (partial_segs) { > + struct sk_buff *iter; > int type = skb_shinfo(head_skb)->gso_type; > + unsigned short gso_size = skb_shinfo(head_skb)->gso_size; > > /* Update type to add partial and then remove dodgy if set */ > - type |= SKB_GSO_PARTIAL; > + type |= (features & NETIF_F_GSO_PARTIAL) / > NETIF_F_GSO_PARTIAL * SKB_GSO_PARTIAL; > type &= ~SKB_GSO_DODGY; > > /* Update GSO info and prepare to start updating headers on > * our way back down the stack of protocols. > */ > - skb_shinfo(segs)->gso_size = skb_shinfo(head_skb)->gso_size; > - skb_shinfo(segs)->gso_segs = partial_segs; > - skb_shinfo(segs)->gso_type = type; > - SKB_GSO_CB(segs)->data_offset = skb_headroom(segs) + doffset; > + for (iter = segs; iter; iter = iter->next) { > + skb_shinfo(iter)->gso_size = gso_size; > + skb_shinfo(iter)->gso_segs = partial_segs; > + skb_shinfo(iter)->gs
[patch net-next v7 2/3] net: core: Add offload stats to if_stats_msg
From: Nogah Frankel Add a nested attribute of offload stats to if_stats_msg named IFLA_STATS_LINK_OFFLOAD_XSTATS. Under it, add SW stats, meaning stats only per packets that went via slowpath to the cpu, named IFLA_OFFLOAD_XSTATS_CPU_HIT. Signed-off-by: Nogah Frankel Signed-off-by: Jiri Pirko --- include/uapi/linux/if_link.h | 10 + net/core/rtnetlink.c | 88 ++-- 2 files changed, 94 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 9bf3aec..4aaa2a1 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -826,6 +826,7 @@ enum { IFLA_STATS_LINK_64, IFLA_STATS_LINK_XSTATS, IFLA_STATS_LINK_XSTATS_SLAVE, + IFLA_STATS_LINK_OFFLOAD_XSTATS, __IFLA_STATS_MAX, }; @@ -845,6 +846,15 @@ enum { }; #define LINK_XSTATS_TYPE_MAX (__LINK_XSTATS_TYPE_MAX - 1) +/* These are stats embedded into IFLA_STATS_LINK_OFFLOAD_XSTATS */ +enum { + IFLA_OFFLOAD_XSTATS_UNSPEC, + IFLA_OFFLOAD_XSTATS_CPU_HIT, /* struct rtnl_link_stats64 */ + __IFLA_OFFLOAD_XSTATS_MAX +}; +#define IFLA_OFFLOAD_XSTATS_MAX (__IFLA_OFFLOAD_XSTATS_MAX - 1) +#define IFLA_OFFLOAD_XSTATS_FIRST (IFLA_OFFLOAD_XSTATS_UNSPEC + 1) + /* XDP section */ enum { diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 1dfca1c..95eb131 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -3577,6 +3577,74 @@ static bool stats_attr_valid(unsigned int mask, int attrid, int idxattr) (!idxattr || idxattr == attrid); } +static int rtnl_get_offload_stats_attr_size(int attr_id) +{ + switch (attr_id) { + case IFLA_OFFLOAD_XSTATS_CPU_HIT: + return sizeof(struct rtnl_link_stats64); + } + + return 0; +} + +static int rtnl_get_offload_stats(struct sk_buff *skb, struct net_device *dev) +{ + struct nlattr *attr; + int attr_id, size; + void *attr_data; + int err; + + if (!(dev->netdev_ops && dev->netdev_ops->ndo_has_offload_stats && + dev->netdev_ops->ndo_get_offload_stats)) + return 0; + + for (attr_id = IFLA_OFFLOAD_XSTATS_FIRST; +attr_id <= IFLA_OFFLOAD_XSTATS_MAX; attr_id++) { + size = rtnl_get_offload_stats_attr_size(attr_id); + if (!size) + continue; + + if (!dev->netdev_ops->ndo_has_offload_stats(attr_id)) + continue; + + attr = nla_reserve_64bit(skb, attr_id, size, +IFLA_OFFLOAD_XSTATS_UNSPEC); + if (!attr) + return -EMSGSIZE; + + attr_data = nla_data(attr); + memset(attr_data, 0, size); + err = dev->netdev_ops->ndo_get_offload_stats(attr_id, dev, +attr_data); + if (err) + return err; + } + + return 0; +} + +static int rtnl_get_offload_stats_size(const struct net_device *dev) +{ + int size = 0; + int attr_id; + + if (!(dev->netdev_ops && dev->netdev_ops->ndo_has_offload_stats && + dev->netdev_ops->ndo_get_offload_stats)) + return nla_total_size(0); + + for (attr_id = IFLA_OFFLOAD_XSTATS_FIRST; +attr_id <= IFLA_OFFLOAD_XSTATS_MAX; attr_id++) { + if (!dev->netdev_ops->ndo_has_offload_stats(attr_id)) + continue; + + size += rtnl_get_offload_stats_attr_size(attr_id); + } + + size += nla_total_size(0); + + return size; +} + static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev, int type, u32 pid, u32 seq, u32 change, unsigned int flags, unsigned int filter_mask, @@ -3586,6 +3654,7 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev, struct nlmsghdr *nlh; struct nlattr *attr; int s_prividx = *prividx; + int err; ASSERT_RTNL(); @@ -3614,8 +3683,6 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev, const struct rtnl_link_ops *ops = dev->rtnl_link_ops; if (ops && ops->fill_linkxstats) { - int err; - *idxattr = IFLA_STATS_LINK_XSTATS; attr = nla_nest_start(skb, IFLA_STATS_LINK_XSTATS); @@ -3639,8 +3706,6 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev, if (master) ops = master->rtnl_link_ops; if (ops && ops->fill_linkxstats) { - int err; - *idxattr = IFLA_STATS_LINK_XSTATS_SLAVE; attr = nla_nest_start(skb,
[patch net-next v7 1/3] netdevice: Add offload statistics ndo
From: Nogah Frankel Add a new ndo to return statistics for offloaded operation. Since there can be many different offloaded operation with many stats types, the ndo gets an attribute id by which it knows which stats are wanted. The ndo also gets a void pointer to be cast according to the attribute id. Signed-off-by: Nogah Frankel Signed-off-by: Jiri Pirko --- include/linux/netdevice.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 67bb978..2d2c09b 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -924,6 +924,14 @@ struct netdev_xdp { * 3. Update dev->stats asynchronously and atomically, and define *neither operation. * + * bool (*ndo_has_offload_stats)(int attr_id) + * Return true if this device supports offload stats of this attr_id. + * + * int (*ndo_get_offload_stats)(int attr_id, const struct net_device *dev, + * void *attr_data) + * Get statistics for offload operations by attr_id. Write it into the + * attr_data pointer. + * * int (*ndo_vlan_rx_add_vid)(struct net_device *dev, __be16 proto, u16 vid); * If device supports VLAN filtering this function is called when a * VLAN id is registered. @@ -1155,6 +1163,10 @@ struct net_device_ops { struct rtnl_link_stats64* (*ndo_get_stats64)(struct net_device *dev, struct rtnl_link_stats64 *storage); + bool(*ndo_has_offload_stats)(int attr_id); + int (*ndo_get_offload_stats)(int attr_id, +const struct net_device *dev, +void *attr_data); struct net_device_stats* (*ndo_get_stats)(struct net_device *dev); int (*ndo_vlan_rx_add_vid)(struct net_device *dev, -- 2.5.5
[patch net-next v7 3/3] mlxsw: spectrum: Implement offload stats ndo and expose HW stats by default
From: Nogah Frankel Change the default statistics ndo to return HW statistics (like the one returned by ethtool_ops). The HW stats are collected to a cache by delayed work every 1 sec. Implement the offload stat ndo. Add a function to get SW statistics, to be called from this function. Signed-off-by: Nogah Frankel Signed-off-by: Jiri Pirko --- drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 129 +++-- drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 5 + 2 files changed, 127 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c index 6c6b726..2a7f93f 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c @@ -818,9 +818,9 @@ err_span_port_mtu_update: return err; } -static struct rtnl_link_stats64 * -mlxsw_sp_port_get_stats64(struct net_device *dev, - struct rtnl_link_stats64 *stats) +int +mlxsw_sp_port_get_sw_stats64(const struct net_device *dev, +struct rtnl_link_stats64 *stats) { struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev); struct mlxsw_sp_port_pcpu_stats *p; @@ -847,6 +847,107 @@ mlxsw_sp_port_get_stats64(struct net_device *dev, tx_dropped += p->tx_dropped; } stats->tx_dropped = tx_dropped; + return 0; +} + +bool mlxsw_sp_port_has_offload_stats(int attr_id) +{ + switch (attr_id) { + case IFLA_OFFLOAD_XSTATS_CPU_HIT: + return true; + } + + return false; +} + +int mlxsw_sp_port_get_offload_stats(int attr_id, const struct net_device *dev, + void *sp) +{ + switch (attr_id) { + case IFLA_OFFLOAD_XSTATS_CPU_HIT: + return mlxsw_sp_port_get_sw_stats64(dev, sp); + } + + return -EINVAL; +} + +static int mlxsw_sp_port_get_stats_raw(struct net_device *dev, int grp, + int prio, char *ppcnt_pl) +{ + struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev); + struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp; + + mlxsw_reg_ppcnt_pack(ppcnt_pl, mlxsw_sp_port->local_port, grp, prio); + return mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(ppcnt), ppcnt_pl); +} + +static int mlxsw_sp_port_get_hw_stats(struct net_device *dev, + struct rtnl_link_stats64 *stats) +{ + char ppcnt_pl[MLXSW_REG_PPCNT_LEN]; + int err; + + err = mlxsw_sp_port_get_stats_raw(dev, MLXSW_REG_PPCNT_IEEE_8023_CNT, + 0, ppcnt_pl); + if (err) + goto out; + + stats->tx_packets = + mlxsw_reg_ppcnt_a_frames_transmitted_ok_get(ppcnt_pl); + stats->rx_packets = + mlxsw_reg_ppcnt_a_frames_received_ok_get(ppcnt_pl); + stats->tx_bytes = + mlxsw_reg_ppcnt_a_octets_transmitted_ok_get(ppcnt_pl); + stats->rx_bytes = + mlxsw_reg_ppcnt_a_octets_received_ok_get(ppcnt_pl); + stats->multicast = + mlxsw_reg_ppcnt_a_multicast_frames_received_ok_get(ppcnt_pl); + + stats->rx_crc_errors = + mlxsw_reg_ppcnt_a_frame_check_sequence_errors_get(ppcnt_pl); + stats->rx_frame_errors = + mlxsw_reg_ppcnt_a_alignment_errors_get(ppcnt_pl); + + stats->rx_length_errors = ( + mlxsw_reg_ppcnt_a_in_range_length_errors_get(ppcnt_pl) + + mlxsw_reg_ppcnt_a_out_of_range_length_field_get(ppcnt_pl) + + mlxsw_reg_ppcnt_a_frame_too_long_errors_get(ppcnt_pl)); + + stats->rx_errors = (stats->rx_crc_errors + + stats->rx_frame_errors + stats->rx_length_errors); + +out: + return err; +} + +static void update_stats_cache(struct work_struct *work) +{ + struct mlxsw_sp_port *mlxsw_sp_port = + container_of(work, struct mlxsw_sp_port, +hw_stats.update_dw.work); + + if (!netif_carrier_ok(mlxsw_sp_port->dev)) + goto out; + + mlxsw_sp_port_get_hw_stats(mlxsw_sp_port->dev, + mlxsw_sp_port->hw_stats.cache); + +out: + mlxsw_core_schedule_dw(&mlxsw_sp_port->hw_stats.update_dw, + MLXSW_HW_STATS_UPDATE_TIME); +} + +/* Return the stats from a cache that is updated periodically, + * as this function might get called in an atomic context. + */ +static struct rtnl_link_stats64 * +mlxsw_sp_port_get_stats64(struct net_device *dev, + struct rtnl_link_stats64 *stats) +{ + struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev); + + memcpy(stats, mlxsw_sp_port->hw_stats.cache, sizeof(*stats)); + return stats; } @@ -1208,6 +1309,8 @@ static const struct net_device_ops mlxsw_sp_port_netdev_ops = { .ndo_set_mac_address= mlxsw_sp_port_set_mac_a
[patch net-next v7 0/3] return offloaded stats as default and expose original sw stats
From: Jiri Pirko The problem we try to handle is about offloaded forwarded packets which are not seen by kernel. Let me try to draw it: port1 port2 (HW stats are counted here) \ / \/ \ / --(A) ASIC --(B)-- | (C) | CPU (SW stats are counted here) Now we have couple of flows for TX and RX (direction does not matter here): 1) port1->A->ASIC->C->CPU For this flow, HW and SW stats are equal. 2) port1->A->ASIC->C->CPU->C->ASIC->B->port2 For this flow, HW and SW stats are equal. 3) port1->A->ASIC->B->port2 For this flow, SW stats are 0. The purpose of this patchset is to provide facility for user to find out the difference between flows 1+2 and 3. In other words, user will be able to see the statistics for the slow-path (through kernel). Also note that HW stats are what someone calls "accumulated" stats. Every packet counted by SW is also counted by HW. Not the other way around. As a default the accumulated stats (HW) will be exposed to user so the userspace apps can react properly. This patchset add the SW stats (flows 1+2) under offload related stats, so in the future we can expose other offload related stat in a similar way. --- v6->v7: - patch 1/3: - ndo interface changed to get the wanted stats type as an input. - change commit message. - patch 2/3: - create a nesting for offloaded stat and put SW stats under it. - change the ndo call to indicate which offload stats we wants. - change commit message. - patch 3/3: - change ndo implementation to match the changes in the previous patches. - change commit message. v5->v6: - patch 2/4 was dropped as requested by Roopa - patch 1/3: - comment changed to indicate that default stats are combined stats - commit massage changed - patch 2/3: (previously 3/4) - SW stats return nothing if there is no SW stats ndo v4->v5: - updated cover letter - patch3/4: - using memcpy directly to copy stats as requested by DaveM v3->v4: - patch1/4: - fixed "return ()" pointed out by EricD - patch2/4: - fixed if_nlmsg_size as pointed out by EricD v2->v3: - patch1/4: - added dev_have_sw_stats helper - patch2/4: - avoided memcpy as requested by DaveM - patch3/4: - use new dev_have_sw_stats helper v1->v2: - patch3/4: - fixed NULL initialization Nogah Frankel (3): netdevice: Add offload statistics ndo net: core: Add offload stats to if_stats_msg mlxsw: spectrum: Implement offload stats ndo and expose HW stats by default drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 129 +++-- drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 5 + include/linux/netdevice.h | 12 +++ include/uapi/linux/if_link.h | 10 ++ net/core/rtnetlink.c | 88 - 5 files changed, 233 insertions(+), 11 deletions(-) -- 2.5.5
Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
On 09/05/2016 04:09 PM, Daniel Mack wrote: On 09/05/2016 03:56 PM, Daniel Borkmann wrote: On 09/05/2016 02:54 PM, Daniel Mack wrote: On 08/30/2016 01:00 AM, Daniel Borkmann wrote: On 08/26/2016 09:58 PM, Daniel Mack wrote: enum bpf_map_type { @@ -147,6 +149,13 @@ union bpf_attr { __aligned_u64 pathname; __u32 bpf_fd; }; + + struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ + __u32 target_fd; /* container object to attach to */ + __u32 attach_bpf_fd; /* eBPF program to attach */ + __u32 attach_type;/* BPF_ATTACH_TYPE_* */ + __u64 attach_flags; Could we just do ... __u32 dst_fd; __u32 src_fd; __u32 attach_type; ... and leave flags out, since unused anyway? Also see below. I'd really like to keep the flags, even if they're unused right now. This only adds 8 bytes during the syscall operation, so it doesn't harm. However, we cannot change the userspace API after the fact, and who knows what this (rather generic) interface will be used for later on. With the below suggestion added, then flags doesn't need to be added currently as it can be done safely at a later point in time with respecting old binaries. See also the syscall handling code in kernel/bpf/syscall.c +825 and the CHECK_ATTR() macro. The underlying idea of this was taken from perf_event_open() syscall back then, see [1] for a summary. [1] https://lkml.org/lkml/2014/8/26/116 Yes, I know that's possible, and I like the idea, but I don't think any new interface should come without flags really, as flags are something that will most certainly be needed at some point anyway. I didn't have them in my first shot, but Alexei pointed out that they should be added, and I agree. Also, this optimization wouldn't make the transported struct payload any smaller anyway, because the member of that union used by BPF_PROG_LOAD is still by far the biggest. I really don't think it's worth sparing 8 bytes here and then do the binary compat dance after flags are added, for no real gain. Sure, but there's not much of a dance needed, see for example how map_flags were added some time ago. So, iff there's really no foreseeable use-case in sight and since we have this flexibility in place already, then I don't quite follow why it's needed, if there's zero pain to add it later on. I would understand it of course, if it cannot be handled later on anymore.
[PATCH] rxrpc: initialize sched to false to ensure it is not a garbage value
From: Colin Ian King sched will be uninitialized (and contain a garbage value) in the case where call->state >= RXRPC_CALL_DEAD; fix this by initializing sched to false to avoid an inadvertent call to rxrpc_queue_call. Signed-off-by: Colin Ian King --- net/rxrpc/call_object.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c index 516d8ea..57e00fc 100644 --- a/net/rxrpc/call_object.c +++ b/net/rxrpc/call_object.c @@ -586,7 +586,7 @@ static void rxrpc_dead_call_expired(unsigned long _call) */ static void rxrpc_mark_call_released(struct rxrpc_call *call) { - bool sched; + bool sched = false; rxrpc_see_call(call); write_lock(&call->state_lock); -- 2.9.3
Re: 6pack: stack-out-of-bounds in sixpack_receive_buf
On Sat, 3 Sep 2016 15:38:08 +0200 Dmitry Vyukov wrote: > Hello, > > While running syzkaller fuzzer I've got the following report: > > BUG: KASAN: stack-out-of-bounds in sixpack_receive_buf+0xf8a/0x1450 at > addr 880037fbf850 > Read of size 1 by task syz-executor/6759 > page:eadfefc0 count:0 mapcount:0 mapping: (null) index:0x0 > flags: 0x1fffc00() > page dumped because: kasan: bad access detected > CPU: 3 PID: 6759 Comm: syz-executor Not tainted 4.8.0-rc3-next-20160825+ #8 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > 886b6fe0 880037fbf520 82db38d9 37fbf5b0 > fbfff10d6dfc 880037fbf5b0 880037fbf850 880037fbf850 > 880037d3f180 dc00 880037fbf5a0 8180a383 > Call Trace: > [] __asan_report_load1_noabort+0x3e/0x40 > mm/kasan/report.c:319 > [< inline >] sixpack_decode drivers/net/hamradio/6pack.c:1001 > [] sixpack_receive_buf+0xf8a/0x1450 > drivers/net/hamradio/6pack.c:462 > [] tty_ldisc_receive_buf+0x168/0x1b0 > drivers/tty/tty_buffer.c:433 > [] paste_selection+0x27e/0x3e0 > drivers/tty/vt/selection.c:363 > [] tioclinux+0x126/0x410 drivers/tty/vt/vt.c:2683 > [] vt_ioctl+0x13ef/0x2910 drivers/tty/vt/vt_ioctl.c:365 > [] tty_ioctl+0x69d/0x21e0 drivers/tty/tty_io.c:2983 > [< inline >] vfs_ioctl fs/ioctl.c:43 > [] do_vfs_ioctl+0x18c/0x1080 fs/ioctl.c:675 > [< inline >] SYSC_ioctl fs/ioctl.c:690 > [] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:681 > [] entry_SYSCALL_64_fastpath+0x23/0xc1 > Memory state around the buggy address: > 880037fbf700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 880037fbf780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > >880037fbf800: 00 00 00 00 00 00 00 00 00 00 f3 f3 f3 f3 00 00 > ^ > 880037fbf880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 880037fbf900: 00 00 00 00 f1 f1 f1 f1 00 f4 f4 f4 f2 f2 f2 f2 > == > > > It is then followed by similar reports that access subsequent stack bytes. > Unfortunately I can't reproduce it (though, I got 6 similar crashes in > different runs). Looking at code, the following looks suspicious -- we > limit copy by 512 bytes, but use the original count which can be > larger than 512: > > static void sixpack_receive_buf(struct tty_struct *tty, > const unsigned char *cp, char *fp, int count) > { > unsigned char buf[512]; > > memcpy(buf, cp, count < sizeof(buf) ? count : sizeof(buf)); > > sixpack_decode(sp, buf, count1); > Your suspicion is correct. Alan
Re: [PATCH net-next 4/9] rxrpc: Randomise epoch and starting client conn ID values
[cc'ing Jeff Altman for comment] David Laight wrote: > > Create a random epoch value rather than a time-based one on startup and set > > the top bit to indicate that this is the case. > > Why set the top bit? > There is nothing to stop the time (in seconds) from having the top bit set. > Nothing else can care - otherwise this wouldn't work. This is what I'm told I should do by purveyors of other RxRPC solutions. > > Also create a random starting client connection ID value. This will be > > incremented from here as new client connections are created. > > I'm guessing this is to make duplicates less likely after a restart? Again, it's been suggested that I do this, but I would guess so. > You may want to worry about duplicate allocations (after 2^32 connects). It's actually a quarter of that, but connection != call, so a connection may be used for up to ~16 billion RPC operations before it *has* to be flushed. > There are id allocation algorithms that guarantee not to generate duplicates > and not to reuse values quickly while still being fixed cost. > Look at the code NetBSD uses to allocate process ids for an example. I'm using idr_alloc_cyclic()[*] with a fixed size "window" on the active conn ID values. Client connections with IDs outside of that window are discarded as soon as possible to keep the memory consumption of the tree down (and to force security renegotiation occasionally). However, given that there are a billion IDs to cycle through, it will take quite a while for reuse to become an issue. I like the idea of incrementing the epoch every time we cycle through the ID space, but I'm told that a change in the epoch value is an indication that the client rebooted - with what consequences I cannot say. [*] which is what Linux uses to allocate process IDs. David
Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
On 09/05/2016 05:30 PM, David Laight wrote: > From: Daniel Mack + + struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ + __u32 target_fd; /* container object to attach to */ + __u32 attach_bpf_fd; /* eBPF program to attach */ + __u32 attach_type;/* BPF_ATTACH_TYPE_* */ + __u64 attach_flags; + }; >>> >>> there is a 4 byte hole in this struct. Can we pack it differently? >> >> Okay - I swapped "type" and "flags" to repair this. > > That just moves the pad to the end of the structure. > Still likely to cause a problem for 32bit apps on a 64bit kernel. What kind of problem do you have in mind? Again, this is embedded in a union of much bigger total size, and the API is not used in any kind of hot-path. > If you can't think of any flags, why 64 of them? I can't think of them right now, but this is about defining an API that can be used in other context as well. Also, it doesn't matter at all, they don't harm. IMO, it's just better to have them right away than to do a binary compat dance once someone needs them. Thanks, Daniel
RE: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
From: Daniel Mack > >> + > >> + struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ > >> + __u32 target_fd; /* container object to attach > >> to */ > >> + __u32 attach_bpf_fd; /* eBPF program to attach */ > >> + __u32 attach_type;/* BPF_ATTACH_TYPE_* */ > >> + __u64 attach_flags; > >> + }; > > > > there is a 4 byte hole in this struct. Can we pack it differently? > > Okay - I swapped "type" and "flags" to repair this. That just moves the pad to the end of the structure. Still likely to cause a problem for 32bit apps on a 64bit kernel. If you can't think of any flags, why 64 of them? David
Hello
Hello, I am Amira, 24 years young female. Please i will like to discuss something important with you. Please Reply
Re: [PATCH 2/2 nf] netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack
Hi Pablo, On Mon, Sep 5, 2016 at 11:02 PM, wrote: > From: Gao Feng > > It is valid that the TCP RST packet which does not set ack flag, and bytes > of ack number are zero. For these RST packets, seqadj could not adjust the > ack number. > > Signed-off-by: Gao Feng > --- > net/netfilter/nf_conntrack_seqadj.c | 34 +++--- > 1 file changed, 19 insertions(+), 15 deletions(-) > > diff --git a/net/netfilter/nf_conntrack_seqadj.c > b/net/netfilter/nf_conntrack_seqadj.c > index 7f8d814..65bb4a6 100644 > --- a/net/netfilter/nf_conntrack_seqadj.c > +++ b/net/netfilter/nf_conntrack_seqadj.c > @@ -182,30 +182,34 @@ int nf_ct_seq_adjust(struct sk_buff *skb, > > tcph = (void *)skb->data + protoff; > spin_lock_bh(&ct->lock); > + > if (after(ntohl(tcph->seq), this_way->correction_pos)) > seqoff = this_way->offset_after; > else > seqoff = this_way->offset_before; > > - if (after(ntohl(tcph->ack_seq) - other_way->offset_before, > - other_way->correction_pos)) > - ackoff = other_way->offset_after; > - else > - ackoff = other_way->offset_before; > - > newseq = htonl(ntohl(tcph->seq) + seqoff); > - newack = htonl(ntohl(tcph->ack_seq) - ackoff); > - > inet_proto_csum_replace4(&tcph->check, skb, tcph->seq, newseq, false); > - inet_proto_csum_replace4(&tcph->check, skb, tcph->ack_seq, newack, > -false); > - > - pr_debug("Adjusting sequence number from %u->%u, ack from %u->%u\n", > -ntohl(tcph->seq), ntohl(newseq), ntohl(tcph->ack_seq), > -ntohl(newack)); > > + pr_debug("Adjusting sequence number from %u->%u\n", > +ntohl(tcph->seq), ntohl(newseq)); > tcph->seq = newseq; > - tcph->ack_seq = newack; > + > + if (likely(tcph->ack)) { > + if (after(ntohl(tcph->ack_seq) - other_way->offset_before, > + other_way->correction_pos)) > + ackoff = other_way->offset_after; > + else > + ackoff = other_way->offset_before; > + > + newack = htonl(ntohl(tcph->ack_seq) - ackoff); > + inet_proto_csum_replace4(&tcph->check, skb, tcph->ack_seq, > +newack, false); > + > + pr_debug("Adjusting ack number from %u->%u\n", > +ntohl(tcph->ack_seq), ntohl(newack)); > + tcph->ack_seq = newack; > + } > > res = nf_ct_sack_adjust(skb, protoff, tcph, ct, ctinfo); > spin_unlock_bh(&ct->lock); > -- > 1.9.1 > > This patch is generated base on the patch commit "netfilter: seqadj: Fix one possible panic in seqadj when mem is exhausted" whose link is http://patchwork.ozlabs.org/patch/665116/. So its subject contains "2/2". Best Regards Feng Best Regards Feng
RE: [PATCH net-next 4/9] rxrpc: Randomise epoch and starting client conn ID values
From: David Howells > Sent: 04 September 2016 22:03 > Create a random epoch value rather than a time-based one on startup and set > the top bit to indicate that this is the case. Why set the top bit? There is nothing to stop the time (in seconds) from having the top bit set. Nothing else can care - otherwise this wouldn't work. > Also create a random starting client connection ID value. This will be > incremented from here as new client connections are created. I'm guessing this is to make duplicates less likely after a restart? You may want to worry about duplicate allocations (after 2^32 connects). There are id allocation algorithms that guarantee not to generate duplicates and not to reuse values quickly while still being fixed cost. Look at the code NetBSD uses to allocate process ids for an example. David
[PATCH 2/2 nf] netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack
From: Gao Feng It is valid that the TCP RST packet which does not set ack flag, and bytes of ack number are zero. For these RST packets, seqadj could not adjust the ack number. Signed-off-by: Gao Feng --- net/netfilter/nf_conntrack_seqadj.c | 34 +++--- 1 file changed, 19 insertions(+), 15 deletions(-) diff --git a/net/netfilter/nf_conntrack_seqadj.c b/net/netfilter/nf_conntrack_seqadj.c index 7f8d814..65bb4a6 100644 --- a/net/netfilter/nf_conntrack_seqadj.c +++ b/net/netfilter/nf_conntrack_seqadj.c @@ -182,30 +182,34 @@ int nf_ct_seq_adjust(struct sk_buff *skb, tcph = (void *)skb->data + protoff; spin_lock_bh(&ct->lock); + if (after(ntohl(tcph->seq), this_way->correction_pos)) seqoff = this_way->offset_after; else seqoff = this_way->offset_before; - if (after(ntohl(tcph->ack_seq) - other_way->offset_before, - other_way->correction_pos)) - ackoff = other_way->offset_after; - else - ackoff = other_way->offset_before; - newseq = htonl(ntohl(tcph->seq) + seqoff); - newack = htonl(ntohl(tcph->ack_seq) - ackoff); - inet_proto_csum_replace4(&tcph->check, skb, tcph->seq, newseq, false); - inet_proto_csum_replace4(&tcph->check, skb, tcph->ack_seq, newack, -false); - - pr_debug("Adjusting sequence number from %u->%u, ack from %u->%u\n", -ntohl(tcph->seq), ntohl(newseq), ntohl(tcph->ack_seq), -ntohl(newack)); + pr_debug("Adjusting sequence number from %u->%u\n", +ntohl(tcph->seq), ntohl(newseq)); tcph->seq = newseq; - tcph->ack_seq = newack; + + if (likely(tcph->ack)) { + if (after(ntohl(tcph->ack_seq) - other_way->offset_before, + other_way->correction_pos)) + ackoff = other_way->offset_after; + else + ackoff = other_way->offset_before; + + newack = htonl(ntohl(tcph->ack_seq) - ackoff); + inet_proto_csum_replace4(&tcph->check, skb, tcph->ack_seq, +newack, false); + + pr_debug("Adjusting ack number from %u->%u\n", +ntohl(tcph->ack_seq), ntohl(newack)); + tcph->ack_seq = newack; + } res = nf_ct_sack_adjust(skb, protoff, tcph, ct, ctinfo); spin_unlock_bh(&ct->lock); -- 1.9.1
Re: [PATCH v3 2/6] cgroup: add support for eBPF programs
Hi, On 08/30/2016 01:04 AM, Sargun Dhillon wrote: > On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote: >> This patch adds two sets of eBPF program pointers to struct cgroup. >> One for such that are directly pinned to a cgroup, and one for such >> that are effective for it. >> >> To illustrate the logic behind that, assume the following example >> cgroup hierarchy. >> >> A - B - C >> \ D - E >> >> If only B has a program attached, it will be effective for B, C, D >> and E. If D then attaches a program itself, that will be effective for >> both D and E, and the program in B will only affect B and C. Only one >> program of a given type is effective for a cgroup. >> > How does this work when running and orchestrator within an orchestrator? The > Docker in Docker / Mesos in Mesos use case, where the top level orchestrator > is > observing the traffic, and there is an orchestrator within that also need to > run > it. > > In this case, I'd like to run E's filter, then if it returns 0, D's, and B's, > and so on. Running multiple programs was an idea I had in one of my earlier drafts, but after some discussion, I refrained from it again because potentially walking the cgroup hierarchy on every packet is just too expensive. > Is it possible to allow this, either by flattening out the > datastructure (copy a ref to the bpf programs to C and E) or > something similar? That would mean we carry a list of eBPF program pointers of dynamic size. IOW, the deeper inside the cgroup hierarchy, the bigger the list, so it can store a reference to all programs of all of its ancestor. While I think that would be possible, even at some later point, I'd really like to avoid it for the sake of simplicity. Is there any reason why this can't be done in userspace? Compile a program X for A, and overload it with Y, with Y doing the same than X but add some extra checks? Note that all users of the bpf(2) syscall API will need CAP_NET_ADMIN anyway, so there is no delegation to unprivileged sub-orchestators or anything alike really. Thanks, Daniel
Re: [PATCH v3 5/6] net: core: run cgroup eBPF egress programs
On 08/30/2016 12:03 AM, Daniel Borkmann wrote: > On 08/26/2016 09:58 PM, Daniel Mack wrote: >> diff --git a/net/core/dev.c b/net/core/dev.c >> index a75df86..17484e6 100644 >> --- a/net/core/dev.c >> +++ b/net/core/dev.c >> @@ -141,6 +141,7 @@ >> #include >> #include >> #include >> +#include >> >> #include "net-sysfs.h" >> >> @@ -3329,6 +3330,11 @@ static int __dev_queue_xmit(struct sk_buff *skb, void >> *accel_priv) >> if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP)) >> __skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED); >> >> +rc = cgroup_bpf_run_filter(skb->sk, skb, >> + BPF_ATTACH_TYPE_CGROUP_INET_EGRESS); >> +if (rc) >> +return rc; > > This would leak the whole skb by the way. Ah, right. > Apart from that, could this be modeled w/o affecting the forwarding path (at > some > local output point where we know to have a valid socket)? Then you could also > drop > the !sk and sk->sk_family tests, and we wouldn't need to replicate parts of > what > clsact is doing as well. Hmm, maybe access to src/dst mac could be handled to > be > just zeroes since not available at that point? Hmm, I wonder where this hook could be put instead then. When placed in ip_output() and ip6_output(), the mac headers cannot be pushed before running the program, resulting in bogus skb data from the eBPF program. Also, if I read the code correctly, ip[6]_output is not called for multicast packets. Any other ideas? Thanks, Daniel
RE: [PATCH -next v2] net: hns: fix return value check in hns_dsaf_get_cfg()
Hello, This patch will conflict with Doug Ledford's hns-roce's HNS driver. This might lead to problems later during this merge window of 4.9. Therefore, Please re-submit it later. The patch files it has are Directly conflicting with RoCE patches: [PATCH for-next 1/2] net: hns: Add support of ACPI to HNS driver RoCE Reset function drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.h drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.h drivers/net/ethernet/hisilicon/hns/hns_dsaf_reg.h we are in process of streamlining the internal process for the better synchronization between HNS and RoCE/RDMA teams and would maintain internal repo for such conflicting patches and which can be git-pulled by David Miller and Doug Ledford. Best regards Salil > -Original Message- > From: weiyj...@163.com [mailto:weiyj...@163.com] > Sent: Tuesday, July 05, 2016 8:57 AM > To: Zhuangyuzeng (Yisen); Salil Mehta; Yankejian (Hackim Yim) > Cc: Wei Yongjun; netdev@vger.kernel.org > Subject: [PATCH -next v2] net: hns: fix return value check in > hns_dsaf_get_cfg() > > From: Wei Yongjun > > In case of error, function devm_ioremap_resource() returns ERR_PTR() > and never returns NULL. The NULL test in the return value check should > be replaced with IS_ERR(). > > Signed-off-by: Wei Yongjun > --- > drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c | 16 -- > -- > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c > b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c > index 86ce28a..2ef4277 100644 > --- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c > +++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c > @@ -114,9 +114,9 @@ int hns_dsaf_get_cfg(struct dsaf_device *dsaf_dev) > > dsaf_dev->sc_base = devm_ioremap_resource(&pdev->dev, > res); > - if (!dsaf_dev->sc_base) { > + if (IS_ERR(dsaf_dev->sc_base)) { > dev_err(dsaf_dev->dev, "subctrl can not > map!\n"); > - return -ENOMEM; > + return PTR_ERR(dsaf_dev->sc_base); > } > > res = platform_get_resource(pdev, IORESOURCE_MEM, > @@ -128,9 +128,9 @@ int hns_dsaf_get_cfg(struct dsaf_device *dsaf_dev) > > dsaf_dev->sds_base = devm_ioremap_resource(&pdev- > >dev, > res); > - if (!dsaf_dev->sds_base) { > + if (IS_ERR(dsaf_dev->sds_base)) { > dev_err(dsaf_dev->dev, "serdes-ctrl can not > map!\n"); > - return -ENOMEM; > + return PTR_ERR(dsaf_dev->sds_base); > } > } else { > dsaf_dev->sub_ctrl = syscon; > @@ -146,9 +146,9 @@ int hns_dsaf_get_cfg(struct dsaf_device *dsaf_dev) > } > } > dsaf_dev->ppe_base = devm_ioremap_resource(&pdev->dev, res); > - if (!dsaf_dev->ppe_base) { > + if (IS_ERR(dsaf_dev->ppe_base)) { > dev_err(dsaf_dev->dev, "ppe-base resource can not map!\n"); > - return -ENOMEM; > + return PTR_ERR(dsaf_dev->ppe_base); > } > dsaf_dev->ppe_paddr = res->start; > > @@ -165,9 +165,9 @@ int hns_dsaf_get_cfg(struct dsaf_device *dsaf_dev) > } > } > dsaf_dev->io_base = devm_ioremap_resource(&pdev->dev, res); > - if (!dsaf_dev->io_base) { > + if (IS_ERR(dsaf_dev->io_base)) { > dev_err(dsaf_dev->dev, "dsaf-base resource can not > map!\n"); > - return -ENOMEM; > + return PTR_ERR(dsaf_dev->io_base); > } > } >
Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
On 09/05/2016 03:56 PM, Daniel Borkmann wrote: > On 09/05/2016 02:54 PM, Daniel Mack wrote: >> On 08/30/2016 01:00 AM, Daniel Borkmann wrote: >>> On 08/26/2016 09:58 PM, Daniel Mack wrote: >> enum bpf_map_type { @@ -147,6 +149,13 @@ union bpf_attr { __aligned_u64 pathname; __u32 bpf_fd; }; + + struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ + __u32 target_fd; /* container object to attach to */ + __u32 attach_bpf_fd; /* eBPF program to attach */ + __u32 attach_type;/* BPF_ATTACH_TYPE_* */ + __u64 attach_flags; >>> >>> Could we just do ... >>> >>> __u32 dst_fd; >>> __u32 src_fd; >>> __u32 attach_type; >>> >>> ... and leave flags out, since unused anyway? Also see below. >> >> I'd really like to keep the flags, even if they're unused right now. >> This only adds 8 bytes during the syscall operation, so it doesn't harm. >> However, we cannot change the userspace API after the fact, and who >> knows what this (rather generic) interface will be used for later on. > > With the below suggestion added, then flags doesn't need to be > added currently as it can be done safely at a later point in time > with respecting old binaries. See also the syscall handling code > in kernel/bpf/syscall.c +825 and the CHECK_ATTR() macro. The > underlying idea of this was taken from perf_event_open() syscall > back then, see [1] for a summary. > >[1] https://lkml.org/lkml/2014/8/26/116 Yes, I know that's possible, and I like the idea, but I don't think any new interface should come without flags really, as flags are something that will most certainly be needed at some point anyway. I didn't have them in my first shot, but Alexei pointed out that they should be added, and I agree. Also, this optimization wouldn't make the transported struct payload any smaller anyway, because the member of that union used by BPF_PROG_LOAD is still by far the biggest. I really don't think it's worth sparing 8 bytes here and then do the binary compat dance after flags are added, for no real gain. Thanks, Daniel
RE: [PATCH -next] net: hns: remove redundant dev_err call in hns_dsaf_get_cfg()
> -Original Message- > From: David Miller [mailto:da...@davemloft.net] > Sent: Wednesday, August 24, 2016 1:19 AM > To: weiyj...@gmail.com > Cc: Zhuangyuzeng (Yisen); Salil Mehta; huangdaode; Yankejian (Hackim > Yim); xieqianqian; weiyongjun (A); netdev@vger.kernel.org > Subject: Re: [PATCH -next] net: hns: remove redundant dev_err call in > hns_dsaf_get_cfg() > > From: Wei Yongjun > Date: Tue, 23 Aug 2016 15:11:03 + > > > From: Wei Yongjun > > > > There is a error message within devm_ioremap_resource > > already, so remove the dev_err call to avoid redundant > > error message. > > > > Signed-off-by: Wei Yongjun > > Applied. Hi David, Forgive me for acting late on this but just realized that this patch will Conflict with Doug Ledford's hns-roce's HNS driver. This might lead to problems later during this merge window of 4.9. RoCE base driver has already suffered because of such problem during earlier merge window. Can I request you to drop this patch for now? We shall submit this patch later again either through hns-roce or directly to net-next. Thanks in anticipation Salil
Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
On 09/05/2016 02:54 PM, Daniel Mack wrote: On 08/30/2016 01:00 AM, Daniel Borkmann wrote: On 08/26/2016 09:58 PM, Daniel Mack wrote: enum bpf_map_type { @@ -147,6 +149,13 @@ union bpf_attr { __aligned_u64 pathname; __u32 bpf_fd; }; + + struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ + __u32 target_fd; /* container object to attach to */ + __u32 attach_bpf_fd; /* eBPF program to attach */ + __u32 attach_type;/* BPF_ATTACH_TYPE_* */ + __u64 attach_flags; Could we just do ... __u32 dst_fd; __u32 src_fd; __u32 attach_type; ... and leave flags out, since unused anyway? Also see below. I'd really like to keep the flags, even if they're unused right now. This only adds 8 bytes during the syscall operation, so it doesn't harm. However, we cannot change the userspace API after the fact, and who knows what this (rather generic) interface will be used for later on. With the below suggestion added, then flags doesn't need to be added currently as it can be done safely at a later point in time with respecting old binaries. See also the syscall handling code in kernel/bpf/syscall.c +825 and the CHECK_ATTR() macro. The underlying idea of this was taken from perf_event_open() syscall back then, see [1] for a summary. [1] https://lkml.org/lkml/2014/8/26/116 #define BPF_PROG_ATTACH_LAST_FIELD attach_type #define BPF_PROG_DETACH_LAST_FIELD BPF_PROG_ATTACH_LAST_FIELD ... Correct way would be to: if (CHECK_ATTR(BPF_PROG_ATTACH)) return -EINVAL; Will add - thanks! Daniel
[PATCH] ath10k: remove unused variable ar_pci
Trival fix to remove unused variable ar_pci in ath10k_pci_tx_pipe_cleanup when building with W=1: drivers/net/wireless/ath/ath10k/pci.c:1696:21: warning: variable 'ar_pci' set but not used [-Wunused-but-set-variable] Signed-off-by: Chaehyun Lim --- drivers/net/wireless/ath/ath10k/pci.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c index 1b841ad..afd5ef7 100644 --- a/drivers/net/wireless/ath/ath10k/pci.c +++ b/drivers/net/wireless/ath/ath10k/pci.c @@ -1693,14 +1693,12 @@ static void ath10k_pci_rx_pipe_cleanup(struct ath10k_pci_pipe *pci_pipe) static void ath10k_pci_tx_pipe_cleanup(struct ath10k_pci_pipe *pci_pipe) { struct ath10k *ar; - struct ath10k_pci *ar_pci; struct ath10k_ce_pipe *ce_pipe; struct ath10k_ce_ring *ce_ring; struct sk_buff *skb; int i; ar = pci_pipe->hif_ce_state; - ar_pci = ath10k_pci_priv(ar); ce_pipe = pci_pipe->ce_hdl; ce_ring = ce_pipe->src_ring; -- 2.9.2
Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev
On Mon, Sep 5, 2016 at 3:08 PM, Tejun Heo wrote: > Hello, > > On Sat, Sep 03, 2016 at 12:58:33PM +0200, Dmitry Vyukov wrote: >> > I've seen it only several times in several months, so I don't it will >> > be helpful. >> >> >> Bad news: I hit it again. >> On 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next, so I have >> bf389cabb3b8079c23f9762e62b05f291e2d5e99. > > Hmmm... we're not getting anywhere. I've applied the following patch > to wq/for-4.8-fixes so that when this happens the next time we can > actually tell what the hell is going wrong. > > Thanks. > > -- 8< -- > From 278930ada88c972d20025b0f20def27b1a09dff7 Mon Sep 17 00:00:00 2001 > From: Tejun Heo > Date: Mon, 5 Sep 2016 08:54:06 -0400 > Subject: [PATCH] workqueue: dump workqueue state on sanity check failures in > destroy_workqueue() > > destroy_workqueue() performs a number of sanity checks to ensure that > the workqueue is empty before proceeding with destruction. However, > it's not always easy to tell what's going on just from the warning > message. Let's dump workqueue state after sanity check failures to > help debugging. > > Signed-off-by: Tejun Heo > Link: > http://lkml.kernel.org/r/cact4y+zs6vkjho9qhb4treiz3s4+quvvvq9vwvj2mx6petg...@mail.gmail.com > Cc: Dmitry Vyukov > --- > kernel/workqueue.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index ef071ca..4eaec8b8 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -4021,6 +4021,7 @@ void destroy_workqueue(struct workqueue_struct *wq) > for (i = 0; i < WORK_NR_COLORS; i++) { > if (WARN_ON(pwq->nr_in_flight[i])) { > mutex_unlock(&wq->mutex); > + show_workqueue_state(); > return; > } > } > @@ -4029,6 +4030,7 @@ void destroy_workqueue(struct workqueue_struct *wq) > WARN_ON(pwq->nr_active) || > WARN_ON(!list_empty(&pwq->delayed_works))) { > mutex_unlock(&wq->mutex); > + show_workqueue_state(); > return; > } > } Applied this to my test tree.
Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev
Hello, On Sat, Sep 03, 2016 at 12:58:33PM +0200, Dmitry Vyukov wrote: > > I've seen it only several times in several months, so I don't it will > > be helpful. > > > Bad news: I hit it again. > On 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next, so I have > bf389cabb3b8079c23f9762e62b05f291e2d5e99. Hmmm... we're not getting anywhere. I've applied the following patch to wq/for-4.8-fixes so that when this happens the next time we can actually tell what the hell is going wrong. Thanks. -- 8< -- >From 278930ada88c972d20025b0f20def27b1a09dff7 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Mon, 5 Sep 2016 08:54:06 -0400 Subject: [PATCH] workqueue: dump workqueue state on sanity check failures in destroy_workqueue() destroy_workqueue() performs a number of sanity checks to ensure that the workqueue is empty before proceeding with destruction. However, it's not always easy to tell what's going on just from the warning message. Let's dump workqueue state after sanity check failures to help debugging. Signed-off-by: Tejun Heo Link: http://lkml.kernel.org/r/cact4y+zs6vkjho9qhb4treiz3s4+quvvvq9vwvj2mx6petg...@mail.gmail.com Cc: Dmitry Vyukov --- kernel/workqueue.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index ef071ca..4eaec8b8 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4021,6 +4021,7 @@ void destroy_workqueue(struct workqueue_struct *wq) for (i = 0; i < WORK_NR_COLORS; i++) { if (WARN_ON(pwq->nr_in_flight[i])) { mutex_unlock(&wq->mutex); + show_workqueue_state(); return; } } @@ -4029,6 +4030,7 @@ void destroy_workqueue(struct workqueue_struct *wq) WARN_ON(pwq->nr_active) || WARN_ON(!list_empty(&pwq->delayed_works))) { mutex_unlock(&wq->mutex); + show_workqueue_state(); return; } } -- 2.7.4
Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
On 08/27/2016 02:08 AM, Alexei Starovoitov wrote: > On Fri, Aug 26, 2016 at 09:58:49PM +0200, Daniel Mack wrote: >> + >> +struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ >> +__u32 target_fd; /* container object to attach >> to */ >> +__u32 attach_bpf_fd; /* eBPF program to attach */ >> +__u32 attach_type;/* BPF_ATTACH_TYPE_* */ >> +__u64 attach_flags; >> +}; > > there is a 4 byte hole in this struct. Can we pack it differently? Okay - I swapped "type" and "flags" to repair this. >> +switch (attr->attach_type) { >> +case BPF_ATTACH_TYPE_CGROUP_INET_INGRESS: >> +case BPF_ATTACH_TYPE_CGROUP_INET_EGRESS: { >> +struct cgroup *cgrp; >> + >> +prog = bpf_prog_get_type(attr->attach_bpf_fd, >> + BPF_PROG_TYPE_CGROUP_SOCKET_FILTER); >> +if (IS_ERR(prog)) >> +return PTR_ERR(prog); >> + >> +cgrp = cgroup_get_from_fd(attr->target_fd); >> +if (IS_ERR(cgrp)) { >> +bpf_prog_put(prog); >> +return PTR_ERR(cgrp); >> +} >> + >> +cgroup_bpf_update(cgrp, prog, attr->attach_type); >> +cgroup_put(cgrp); >> + >> +break; >> +} > > this } formatting style is confusing. The above } looks > like it matches 'switch () {'. > May be move 'struct cgroup *cgrp' to the top to avoid that? I kept it local to its users, but you're right, it's not worth it. Will change. Thanks, Daniel
Re: [PATCH v3 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
On 08/30/2016 01:00 AM, Daniel Borkmann wrote: > On 08/26/2016 09:58 PM, Daniel Mack wrote: >> enum bpf_map_type { >> @@ -147,6 +149,13 @@ union bpf_attr { >> __aligned_u64 pathname; >> __u32 bpf_fd; >> }; >> + >> +struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ >> +__u32 target_fd; /* container object to attach >> to */ >> +__u32 attach_bpf_fd; /* eBPF program to attach */ >> +__u32 attach_type;/* BPF_ATTACH_TYPE_* */ >> +__u64 attach_flags; > > Could we just do ... > > __u32 dst_fd; > __u32 src_fd; > __u32 attach_type; > > ... and leave flags out, since unused anyway? Also see below. I'd really like to keep the flags, even if they're unused right now. This only adds 8 bytes during the syscall operation, so it doesn't harm. However, we cannot change the userspace API after the fact, and who knows what this (rather generic) interface will be used for later on. > #define BPF_PROG_ATTACH_LAST_FIELD attach_type > #define BPF_PROG_DETACH_LAST_FIELD BPF_PROG_ATTACH_LAST_FIELD ... > > Correct way would be to: > > if (CHECK_ATTR(BPF_PROG_ATTACH)) > return -EINVAL; Will add - thanks! Daniel
RE: [PATCH for-next 0/2] {IB,net}/hns: Add support of ACPI to the Hisilicon RoCE Driver
> -Original Message- > From: Doug Ledford [mailto:dledf...@redhat.com] > Sent: Thursday, August 25, 2016 12:57 PM > To: David Miller; Salil Mehta > Cc: Huwei (Xavier); oulijun; Zhuangyuzeng (Yisen); > mehta.salil@gmail.com; linux-r...@vger.kernel.org; > netdev@vger.kernel.org; linux-ker...@vger.kernel.org; Linuxarm > Subject: Re: [PATCH for-next 0/2] {IB,net}/hns: Add support of ACPI to > the Hisilicon RoCE Driver > > On 8/25/2016 12:53 AM, David Miller wrote: > > From: Salil Mehta > > Date: Wed, 24 Aug 2016 04:44:48 +0800 > > > >> This patch is meant to add support of ACPI to the Hisilicon RoCE > driver. > >> Following changes have been made in the driver(s): > >> > >> Patch 1/2: HNS Ethernet Driver: changes to support ACPI have been > done in > >>the RoCE reset function part of the HNS ethernet driver. Earlier > it only > >>supported DT/syscon. > >> > >> Patch 2/2. HNS RoCE driver: changes done in RoCE driver are meant to > detect > >>the type and then either use DT specific or ACPI spcific > functions. Where > >>ever possible, this patch tries to make use of "Unified Device > Property > >>Interface" APIs to support both DT and ACPI through single > interface. > >> > >> NOTE 1: ACPI changes done in both of the drivers depend upon the > ACPI Table > >> (DSDT and IORT tables) changes part of UEFI/BIOS. These changes > are NOT > >> part of this patch-set. > >> NOTE 2: Reset function in Patch 1/2 depends upon the reset function > added in > >> ACPI tables(basically DSDT table) part of the UEFI/BIOS. Again, > this > >> change is NOT reflected in this patch-set. > > > > I can't apply this series to my tree because the hns infiniband > driver > > doesn't exist in it. > > > > No. This probably needs to go through my tree. Although with all of > the requirements, I'm a bit concerned about those being present > elsewhere. > Hi David, There is a patch in net-next for HNS Ethernet driver which has been accepted. "b3dc935 net: hns: remove redundant dev_err call in hns_dsaf_get_cfg()" This patch is creating conflict with Doug Ledford's hns-roce branch. Internally, We have decided to let all the HNS patches pass through the hns-roce for 4.9 Merge window to facilitate non-conflict merge of pending RoCE driver by Doug Ledford. Though, we are trying to take a grip of the process for this merge window but Somehow this one bypassed the internal process. This will create problems for Hns-roce during merge later this window. Can I request you to drop this patch For now. We shall re-submit this patch later when things have been settled at the RoCE/RDMA end or would come again to you through hns-roce branch. Please let me know if this is possible this time. Thanks in anticipation! Best regards Salil > -- > Doug Ledford > GPG Key ID: 0E572FDD
Re: [PATCH] tcp: cwnd does not increase in TCP YeAH
On Mon, Sep 5, 2016 at 12:03 AM, Artem Germanov wrote: > > Commit 76174004a0f19785a328f40388e87e982bbf69b9 > (tcp: do not slow start when cwnd equals ssthresh ) > introduced regression in TCP YeAH. Using 100ms delay 1% loss virtual > ethernet link kernel 4.2 shows bandwidth ~500KB/s for single TCP > connection and kernel 4.3 and above (including 4.8-rc4) shows bandwidth > ~100KB/s. > That is caused by stalled cwnd when cwnd equals ssthresh. This patch > fixes it by proper increasing cwnd in this case. > > Signed-off-by: Artem Germanov Acked-by: Neal Cardwell neal
Re: [PATCH v3 2/6] cgroup: add support for eBPF programs
On 08/30/2016 12:42 AM, Daniel Borkmann wrote: > On 08/26/2016 09:58 PM, Daniel Mack wrote: >> This patch adds two sets of eBPF program pointers to struct cgroup. >> One for such that are directly pinned to a cgroup, and one for such >> that are effective for it. >> >> To illustrate the logic behind that, assume the following example >> cgroup hierarchy. >> >>A - B - C >> \ D - E >> >> If only B has a program attached, it will be effective for B, C, D >> and E. If D then attaches a program itself, that will be effective for >> both D and E, and the program in B will only affect B and C. Only one >> program of a given type is effective for a cgroup. >> >> Attaching and detaching programs will be done through the bpf(2) >> syscall. For now, ingress and egress inet socket filtering are the >> only supported use-cases. >> >> Signed-off-by: Daniel Mack > [...] >> +void __cgroup_bpf_update(struct cgroup *cgrp, >> + struct cgroup *parent, >> + struct bpf_prog *prog, >> + enum bpf_attach_type type) >> +{ >> +struct bpf_prog *old_prog, *effective; >> +struct cgroup_subsys_state *pos; >> + >> +old_prog = xchg(cgrp->bpf.prog + type, prog); >> + >> +if (prog) >> +static_branch_inc(&cgroup_bpf_enabled_key); >> + >> +if (old_prog) { >> +bpf_prog_put(old_prog); >> +static_branch_dec(&cgroup_bpf_enabled_key); >> +} >> + >> +effective = (!prog && parent) ? >> +rcu_dereference_protected(parent->bpf.effective[type], >> + lockdep_is_held(&cgroup_mutex)) : >> +prog; >> + >> +css_for_each_descendant_pre(pos, &cgrp->self) { >> +struct cgroup *desc = container_of(pos, struct cgroup, self); >> + >> +/* skip the subtree if the descendant has its own program */ >> +if (desc->bpf.prog[type] && desc != cgrp) >> +pos = css_rightmost_descendant(pos); >> +else >> +rcu_assign_pointer(desc->bpf.effective[type], >> + effective); >> +} > > Shouldn't the old_prog reference only be released right here at the end > instead of above (otherwise this could race)? Yes, that's right. Will change as well. Thanks for spotting! Daniel
Re: [PATCH v3 1/6] bpf: add new prog type for cgroup socket filtering
On 08/30/2016 12:14 AM, Daniel Borkmann wrote: > On 08/26/2016 09:58 PM, Daniel Mack wrote: >> For now, this program type is equivalent to BPF_PROG_TYPE_SOCKET_FILTER in >> terms of checks during the verification process. It may access the skb as >> well. >> >> Programs of this type will be attached to cgroups for network filtering >> and accounting. >> >> Signed-off-by: Daniel Mack >> --- >> include/uapi/linux/bpf.h | 7 +++ >> kernel/bpf/verifier.c| 1 + >> net/core/filter.c| 6 ++ >> 3 files changed, 14 insertions(+) >> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> index e4c5a1b..1d5db42 100644 >> --- a/include/uapi/linux/bpf.h >> +++ b/include/uapi/linux/bpf.h >> @@ -95,6 +95,13 @@ enum bpf_prog_type { >> BPF_PROG_TYPE_SCHED_ACT, >> BPF_PROG_TYPE_TRACEPOINT, >> BPF_PROG_TYPE_XDP, >> +BPF_PROG_TYPE_CGROUP_SOCKET_FILTER, >> +}; > > Nit: can we drop the _FILTER suffix? So just leaving it > at BPF_PROG_TYPE_CGROUP_SOCKET. Some of these use cases > might not always strictly be related to filtering, so > seems cleaner to just leave it out everywhere. > >> + >> +enum bpf_attach_type { >> +BPF_ATTACH_TYPE_CGROUP_INET_INGRESS, >> +BPF_ATTACH_TYPE_CGROUP_INET_EGRESS, >> +__MAX_BPF_ATTACH_TYPE >> }; > > #define BPF_MAX_ATTACH_TYPE __BPF_MAX_ATTACH_TYPE > > And then use that in your follow-up patches for declaring > arrays, etc? Agreed, will change. Thanks, Daniel
Re: [PATCH v3 2/6] cgroup: add support for eBPF programs
Hi Alexei, On 08/27/2016 02:03 AM, Alexei Starovoitov wrote: > On Fri, Aug 26, 2016 at 09:58:48PM +0200, Daniel Mack wrote: >> This patch adds two sets of eBPF program pointers to struct cgroup. >> One for such that are directly pinned to a cgroup, and one for such >> that are effective for it. >> >> To illustrate the logic behind that, assume the following example >> cgroup hierarchy. >> >> A - B - C >> \ D - E >> >> If only B has a program attached, it will be effective for B, C, D >> and E. If D then attaches a program itself, that will be effective for >> both D and E, and the program in B will only affect B and C. Only one >> program of a given type is effective for a cgroup. >> >> Attaching and detaching programs will be done through the bpf(2) >> syscall. For now, ingress and egress inet socket filtering are the >> only supported use-cases. >> >> Signed-off-by: Daniel Mack > ... >> +css_for_each_descendant_pre(pos, &cgrp->self) { >> +struct cgroup *desc = container_of(pos, struct cgroup, self); >> + >> +/* skip the subtree if the descendant has its own program */ >> +if (desc->bpf.prog[type] && desc != cgrp) > > is desc != cgrp really needed? > I thought css_for_each_descendant_pre() shouldn't walk itself > or I'm missing how it works. Hmm, no - that check is necessary in my tests, and also according to the documentation: /** * css_for_each_descendant_pre - pre-order walk of a css's descendants * @pos: the css * to use as the loop cursor * @root: css whose descendants to walk * * Walk @root's descendants. @root is included in the iteration and the * first node to be visited. Must be called under rcu_read_lock(). * Daniel
[Patch v6] net: ethernet: xilinx: Enable emaclite for MIPS
The MIPS based xilfpga platform uses this driver. Enable it for MIPS Signed-off-by: Zubair Lutfullah Kakakhel --- V1 -> V6 are from a series that has gotten too big. So I have split this patch and am sending it separately. --- drivers/net/ethernet/xilinx/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/xilinx/Kconfig b/drivers/net/ethernet/xilinx/Kconfig index 4f5c024..6d68c8a 100644 --- a/drivers/net/ethernet/xilinx/Kconfig +++ b/drivers/net/ethernet/xilinx/Kconfig @@ -5,7 +5,7 @@ config NET_VENDOR_XILINX bool "Xilinx devices" default y - depends on PPC || PPC32 || MICROBLAZE || ARCH_ZYNQ + depends on PPC || PPC32 || MICROBLAZE || ARCH_ZYNQ || MIPS ---help--- If you have a network (Ethernet) card belonging to this class, say Y. @@ -18,7 +18,7 @@ if NET_VENDOR_XILINX config XILINX_EMACLITE tristate "Xilinx 10/100 Ethernet Lite support" - depends on (PPC32 || MICROBLAZE || ARCH_ZYNQ) + depends on PPC32 || MICROBLAZE || ARCH_ZYNQ || MIPS select PHYLIB ---help--- This driver supports the 10/100 Ethernet Lite from Xilinx. -- 1.9.1
Re: [Patch v5 0/2] net: ethernet: xilinx: mac addr and mips
On 09/04/2016 07:45 PM, David Miller wrote: From: Zubair Lutfullah Kakakhel Date: Fri, 2 Sep 2016 12:39:24 +0100 A couple of simple patches to generate the random mac address if none is found. And enabling the driver for mips. Based on v4.8-rc4. These were part of a larger series but that series is growing wildly. Splitting and submitting the net subsystem patches separately. Hence the v5. This doesn't apply cleanly to any of my trees. :) Looks like there is an identical patch by someone else already in net-next. https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/xilinx/xilinx_emaclite.c?id=5575cf133cf7f564da991595c6bc9344afa7d89a Regards, ZubairLK
Re: 答复: [PATCH] ipv6: addrconf: fix dev refcont leak when DAD failed
On 05.09.2016 13:54, weiyongjun (A) wrote: > On 05.09.2016 10:06, Wei Yongjun wrote: In general, when DAD detected IPv6 duplicate address, ifp->state will be set to INET6_IFADDR_STATE_ERRDAD and DAD is stopped by a delayed work, the call tree should be like this: ndisc_recv_ns -> addrconf_dad_failure<- missing ifp put -> addrconf_mod_dad_work -> schedule addrconf_dad_work() -> addrconf_dad_stop() <- missing ifp hold before call it addrconf_dad_failure() called with ifp refcont holding but not put. addrconf_dad_work() call addrconf_dad_stop() without extra holding refcount. This will not cause any issue normally. But the race between addrconf_dad_failure() and addrconf_dad_work() may cause ifp refcount leak and netdevice can not be unregister, dmesg show the following messages: IPv6: eth0: IPv6 duplicate address fe80::XX:::XX detected! ... unregister_netdevice: waiting for eth0 to become free. Usage count = 1 Cc: sta...@vger.kernel.org Fixes: c15b1ccadb32 ("ipv6: move DAD and addrconf_verify processing to workqueue") Signed-off-by: Wei Yongjun diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index bdf368e..2f1f5d4 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -1948,6 +1948,7 @@ errdad: spin_unlock_bh(&ifp->lock); addrconf_mod_dad_work(ifp, 0); + in6_ifa_put(ifp); } > >>> This in6_ifa_put makes sense. > /* Join to solicited addr multicast group. @@ -3857,6 +3858,7 @@ static void addrconf_dad_work(struct work_struct *w) addrconf_dad_begin(ifp); goto out; } else if (action == DAD_ABORT) { + in6_ifa_hold(ifp); addrconf_dad_stop(ifp, 1); if (disable_ipv6) addrconf_ifdown(idev->dev, 0); > >>> But why you add a in6_ifa_hold here isn't clear to me. Could you >>> explain why this is necessary? I don't see any async stuff being done >>> in addrconf_dad_stop, thus the reference we already have should be >>> sufficient for the lifetime of addrconf_dad_stop. > >> I think it that link local is added with flag IFA_F_PERMANENT, which we real >> need it is to remove in6_ifa_put() in addrconf_dad_stop. > >> static void addrconf_dad_stop(...) >> { >> if (ifp->flags&IFA_F_PERMANENT) { >> ... >> in6_ifa_put(ifp); <== remove this line since caller hold >> refcount >> } else if (ifp->flags&IFA_F_TEMPORARY) { >> ... >> ipv6_del_addr(ifp); >> } else { >> ipv6_del_addr(ifp); >> } >> } > > I see the comment of ipv6_del_addr: > > /* This function wants to get referenced ifp and releases it before return */ > static void ipv6_del_addr(struct inet6_ifaddr *ifp) > { >... > } > > Both in6_ifa_put() and ipv6_del_addr() need a refcount holding, so we should > use a extra in6_ifa_hold() call before call addrconf_dad_stop() as the origin > patch. You are correct! We always call in6_ifa_put() at the end of addrconf_dad_work. In this case it is not correct. Either we add a in6_ifa_hold or we suppress or jump over the last in6_ifa_put. Good catch! Hannes
答复: [PATCH] ipv6: addrconf: fix dev refcont leak when DAD failed
On 05.09.2016 10:06, Wei Yongjun wrote: >>> In general, when DAD detected IPv6 duplicate address, ifp->state will >>> be set to INET6_IFADDR_STATE_ERRDAD and DAD is stopped by a delayed >>> work, the call tree should be like this: >>> >>> ndisc_recv_ns >>> -> addrconf_dad_failure<- missing ifp put >>> -> addrconf_mod_dad_work >>>-> schedule addrconf_dad_work() >>> -> addrconf_dad_stop() <- missing ifp hold before call it >>> >>> addrconf_dad_failure() called with ifp refcont holding but not put. >>> addrconf_dad_work() call addrconf_dad_stop() without extra holding >>> refcount. This will not cause any issue normally. >>> >>> But the race between addrconf_dad_failure() and addrconf_dad_work() >>> may cause ifp refcount leak and netdevice can not be unregister, >>> dmesg show the following messages: >>> >>> IPv6: eth0: IPv6 duplicate address fe80::XX:::XX detected! >>> ... >>> unregister_netdevice: waiting for eth0 to become free. Usage count = >>> 1 >>> >>> Cc: sta...@vger.kernel.org >>> Fixes: c15b1ccadb32 ("ipv6: move DAD and addrconf_verify processing >>> to >>> workqueue") >>> Signed-off-by: Wei Yongjun >>> >>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index >>> bdf368e..2f1f5d4 100644 >>> --- a/net/ipv6/addrconf.c >>> +++ b/net/ipv6/addrconf.c >>> @@ -1948,6 +1948,7 @@ errdad: >>> spin_unlock_bh(&ifp->lock); >>> >>> addrconf_mod_dad_work(ifp, 0); >>> + in6_ifa_put(ifp); >>> } >>This in6_ifa_put makes sense. >>> >>> /* Join to solicited addr multicast group. >>> @@ -3857,6 +3858,7 @@ static void addrconf_dad_work(struct work_struct *w) >>> addrconf_dad_begin(ifp); >>> goto out; >>> } else if (action == DAD_ABORT) { >>> + in6_ifa_hold(ifp); >>> addrconf_dad_stop(ifp, 1); >>> if (disable_ipv6) >>> addrconf_ifdown(idev->dev, 0); >>> >>But why you add a in6_ifa_hold here isn't clear to me. Could you >>explain why this is necessary? I don't see any async stuff being done >>in addrconf_dad_stop, thus the reference we already have should be sufficient >>for the lifetime of addrconf_dad_stop. > I think it that link local is added with flag IFA_F_PERMANENT, which we real > need it is to remove in6_ifa_put() in addrconf_dad_stop. > static void addrconf_dad_stop(...) > { > if (ifp->flags&IFA_F_PERMANENT) { > ... > in6_ifa_put(ifp); <== remove this line since caller hold > refcount > } else if (ifp->flags&IFA_F_TEMPORARY) { > ... > ipv6_del_addr(ifp); > } else { > ipv6_del_addr(ifp); > } >} I see the comment of ipv6_del_addr: /* This function wants to get referenced ifp and releases it before return */ static void ipv6_del_addr(struct inet6_ifaddr *ifp) { ... } Both in6_ifa_put() and ipv6_del_addr() need a refcount holding, so we should use a extra in6_ifa_hold() call before call addrconf_dad_stop() as the origin patch. Regards, Wei Yongjun
[PATCH net-next 0/3] qed*: Debug data collection
This patch series adds the support of debug data collection in the qed driver, and the means to extract it in the qede driver via the get_regs operation. Hi Dave, Please consider applying this to 'net-next'. Thanks, Tomer Tomer Tayar (3): qed: Add infrastructure for debug data collection qed: Add support for debug data collection qed*: Add support for the ethtool get_regs operation drivers/net/ethernet/qlogic/qed/Makefile|2 +- drivers/net/ethernet/qlogic/qed/qed.h | 20 + drivers/net/ethernet/qlogic/qed/qed_debug.c | 6898 +++ drivers/net/ethernet/qlogic/qed/qed_debug.h | 54 + drivers/net/ethernet/qlogic/qed/qed_hsi.h | 1059 +++- drivers/net/ethernet/qlogic/qed/qed_main.c |6 + drivers/net/ethernet/qlogic/qed/qed_mcp.c | 76 + drivers/net/ethernet/qlogic/qed/qed_mcp.h | 46 + drivers/net/ethernet/qlogic/qed/qed_reg_addr.h | 900 +++ drivers/net/ethernet/qlogic/qede/qede_ethtool.c | 24 + include/linux/qed/common_hsi.h |3 + include/linux/qed/qed_if.h |4 + 12 files changed, 9080 insertions(+), 12 deletions(-) create mode 100644 drivers/net/ethernet/qlogic/qed/qed_debug.c create mode 100644 drivers/net/ethernet/qlogic/qed/qed_debug.h -- 1.8.3.1
Re: A potential bug in drivers/net/ethernet/synopsys/dwc_eth_qos.ko
Hi Pavel, Thanks for the notification. I agree that we should register the device after all initialization has completed. A patch will be sent shortly. BR, Lars On 09/05/2016 10:26 AM, Pavel Andrianov wrote: Hi! There is a potential bug in drivers/net/ethernet/synopsys/dwc_eth_qos.ko. In dwceqos_probe there is a registration of net device (line 2879). After that initialization of common resources is continued (lines 2918, 2924, 2941), but at the moment handlers from net device may be already executed. Should the registration of net device be at the end of dwceqos_probe?
[PATCH net-next 1/3] qed: Add infrastructure for debug data collection
Adds support for several infrastructure operations that are done as part of debug data collection. Signed-off-by: Tomer Tayar Signed-off-by: Yuval Mintz --- drivers/net/ethernet/qlogic/qed/qed_hsi.h | 3 + drivers/net/ethernet/qlogic/qed/qed_mcp.c | 76 ++ drivers/net/ethernet/qlogic/qed/qed_mcp.h | 46 drivers/net/ethernet/qlogic/qed/qed_reg_addr.h | 6 ++ 4 files changed, 131 insertions(+) diff --git a/drivers/net/ethernet/qlogic/qed/qed_hsi.h b/drivers/net/ethernet/qlogic/qed/qed_hsi.h index df94a8b7..3a4cb4a 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_hsi.h +++ b/drivers/net/ethernet/qlogic/qed/qed_hsi.h @@ -7640,6 +7640,7 @@ struct public_drv_mb { #define DRV_MSG_CODE_CFG_VF_MSIX 0xc001 #define DRV_MSG_CODE_MCP_RESET 0x0009 #define DRV_MSG_CODE_SET_VERSION 0x000f +#define DRV_MSG_CODE_MCP_HALT 0x0010 #define DRV_MSG_CODE_GET_STATS 0x0013 #define DRV_MSG_CODE_STATS_TYPE_LAN 1 @@ -7647,6 +7648,8 @@ struct public_drv_mb { #define DRV_MSG_CODE_STATS_TYPE_ISCSI 3 #define DRV_MSG_CODE_STATS_TYPE_RDMA4 +#define DRV_MSG_CODE_MASK_PARITIES 0x001a + #define DRV_MSG_CODE_BIST_TEST 0x001e #define DRV_MSG_CODE_SET_LED_MODE 0x0020 diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c index 4c21266..ea673e5 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c +++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c @@ -390,6 +390,34 @@ int qed_mcp_cmd(struct qed_hwfn *p_hwfn, return 0; } +int qed_mcp_nvm_rd_cmd(struct qed_hwfn *p_hwfn, + struct qed_ptt *p_ptt, + u32 cmd, + u32 param, + u32 *o_mcp_resp, + u32 *o_mcp_param, u32 *o_txn_size, u32 *o_buf) +{ + struct qed_mcp_mb_params mb_params; + union drv_union_data union_data; + int rc; + + memset(&mb_params, 0, sizeof(mb_params)); + mb_params.cmd = cmd; + mb_params.param = param; + mb_params.p_data_dst = &union_data; + rc = qed_mcp_cmd_and_union(p_hwfn, p_ptt, &mb_params); + if (rc) + return rc; + + *o_mcp_resp = mb_params.mcp_resp; + *o_mcp_param = mb_params.mcp_param; + + *o_txn_size = *o_mcp_param; + memcpy(o_buf, &union_data.raw_data, *o_txn_size); + + return 0; +} + int qed_mcp_load_req(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt, u32 *p_load_code) { @@ -1169,6 +1197,33 @@ qed_mcp_send_drv_version(struct qed_hwfn *p_hwfn, return rc; } +int qed_mcp_halt(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt) +{ + u32 resp = 0, param = 0; + int rc; + + rc = qed_mcp_cmd(p_hwfn, p_ptt, DRV_MSG_CODE_MCP_HALT, 0, &resp, +¶m); + if (rc) + DP_ERR(p_hwfn, "MCP response failure, aborting\n"); + + return rc; +} + +int qed_mcp_resume(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt) +{ + u32 value, cpu_mode; + + qed_wr(p_hwfn, p_ptt, MCP_REG_CPU_STATE, 0x); + + value = qed_rd(p_hwfn, p_ptt, MCP_REG_CPU_MODE); + value &= ~MCP_REG_CPU_MODE_SOFT_HALT; + qed_wr(p_hwfn, p_ptt, MCP_REG_CPU_MODE, value); + cpu_mode = qed_rd(p_hwfn, p_ptt, MCP_REG_CPU_MODE); + + return (cpu_mode & MCP_REG_CPU_MODE_SOFT_HALT) ? -EAGAIN : 0; +} + int qed_mcp_set_led(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt, enum qed_led_mode mode) { @@ -1196,6 +1251,27 @@ int qed_mcp_set_led(struct qed_hwfn *p_hwfn, return rc; } +int qed_mcp_mask_parities(struct qed_hwfn *p_hwfn, + struct qed_ptt *p_ptt, u32 mask_parities) +{ + u32 resp = 0, param = 0; + int rc; + + rc = qed_mcp_cmd(p_hwfn, p_ptt, DRV_MSG_CODE_MASK_PARITIES, +mask_parities, &resp, ¶m); + + if (rc) { + DP_ERR(p_hwfn, + "MCP response failure for mask parities, aborting\n"); + } else if (resp != FW_MSG_CODE_OK) { + DP_ERR(p_hwfn, + "MCP did not acknowledge mask parity request. Old MFW?\n"); + rc = -EINVAL; + } + + return rc; +} + int qed_mcp_bist_register_test(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt) { u32 drv_mb_param = 0, rsp, param; diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.h b/drivers/net/ethernet/qlogic/qed/qed_mcp.h index c6372fa..dff520e 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_mcp.h +++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.h @@ -468,6 +468,29 @@ int qed_mcp_reset(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt); /** + * @brief - Sends an NVM read command request to the MFW to get + *a buffer. + * + * @param p_h
[PATCH net-next 3/3] qed*: Add support for the ethtool get_regs operation
Signed-off-by: Tomer Tayar Signed-off-by: Yuval Mintz --- drivers/net/ethernet/qlogic/qed/qed_main.c | 2 ++ drivers/net/ethernet/qlogic/qede/qede_ethtool.c | 24 include/linux/qed/qed_if.h | 4 3 files changed, 30 insertions(+) diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c b/drivers/net/ethernet/qlogic/qed/qed_main.c index 78cbf99..1716979 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_main.c +++ b/drivers/net/ethernet/qlogic/qed/qed_main.c @@ -1421,6 +1421,8 @@ const struct qed_common_ops qed_common_ops_pass = { .get_link = &qed_get_current_link, .drain = &qed_drain, .update_msglvl = &qed_init_dp, + .dbg_all_data = &qed_dbg_all_data, + .dbg_all_data_size = &qed_dbg_all_data_size, .chain_alloc = &qed_chain_alloc, .chain_free = &qed_chain_free, .get_coalesce = &qed_get_coalesce, diff --git a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c index 6e17ee1..eba0128 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c +++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c @@ -689,6 +689,28 @@ static int qede_set_pauseparam(struct net_device *dev, return 0; } +static void qede_get_regs(struct net_device *ndev, + struct ethtool_regs *regs, void *buffer) +{ + struct qede_dev *edev = netdev_priv(ndev); + + regs->version = 0; + memset(buffer, 0, regs->len); + + if (edev->ops && edev->ops->common) + edev->ops->common->dbg_all_data(edev->cdev, buffer); +} + +static int qede_get_regs_len(struct net_device *ndev) +{ + struct qede_dev *edev = netdev_priv(ndev); + + if (edev->ops && edev->ops->common) + return edev->ops->common->dbg_all_data_size(edev->cdev); + else + return -EINVAL; +} + static void qede_update_mtu(struct qede_dev *edev, union qede_reload_args *args) { edev->ndev->mtu = args->mtu; @@ -1340,6 +1362,8 @@ static const struct ethtool_ops qede_ethtool_ops = { .get_link_ksettings = qede_get_link_ksettings, .set_link_ksettings = qede_set_link_ksettings, .get_drvinfo = qede_get_drvinfo, + .get_regs_len = qede_get_regs_len, + .get_regs = qede_get_regs, .get_msglevel = qede_get_msglevel, .set_msglevel = qede_set_msglevel, .nway_reset = qede_nway_reset, diff --git a/include/linux/qed/qed_if.h b/include/linux/qed/qed_if.h index 864265f..5a39a2c 100644 --- a/include/linux/qed/qed_if.h +++ b/include/linux/qed/qed_if.h @@ -449,6 +449,10 @@ struct qed_common_ops { void(*simd_handler_clean)(struct qed_dev *cdev, int index); + int (*dbg_all_data) (struct qed_dev *cdev, void *buffer); + + int (*dbg_all_data_size) (struct qed_dev *cdev); + /** * @brief can_link_change - can the instance change the link or not * -- 1.8.3.1
Re: linux-next: manual merge of the char-misc tree with the net-next tree
On Mon, Sep 05, 2016 at 04:56:50PM +1000, Stephen Rothwell wrote: > Hi all, > > Today's linux-next merge of the char-misc tree got a conflict in: > > include/linux/hyperv.h > > between commit: > > 30d1de08c87d ("hv_netvsc: make inline functions static") > > from the net-next tree and commit: > > bb08d431a914 ("Drivers: hv: ring_buffer: count on wrap around mappings in > get_next_pkt_raw()") > > from the char-misc tree. > > I fixed it up (the former moved the code modified by the latter, so the > below patch applies to the new location of the code) and can carry the > fix as necessary. This is now fixed as far as linux-next is concerned, > but any non trivial conflicts should be mentioned to your upstream > maintainer when your tree is submitted for merging. You may also want > to consider cooperating with the maintainer of the conflicting tree to > minimise any particularly complex conflicts. > > From: Stephen Rothwell > Date: Mon, 5 Sep 2016 16:53:06 +1000 > Subject: [PATCH] Drivers: hv: ring_buffer: merge fix up for "hv_netvsc: make > inline functions static" > > Signed-off-by: Stephen Rothwell > --- > drivers/net/hyperv/netvsc.c | 32 +++- > 1 file changed, 11 insertions(+), 21 deletions(-) > > diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c > index 2a9ccc4d9e3c..026df6556068 100644 > --- a/drivers/net/hyperv/netvsc.c > +++ b/drivers/net/hyperv/netvsc.c > @@ -42,31 +42,23 @@ static struct vmpacket_descriptor * > get_next_pkt_raw(struct vmbus_channel *channel) > { > struct hv_ring_buffer_info *ring_info = &channel->inbound; > - u32 read_loc = ring_info->priv_read_index; > + u32 priv_read_loc = ring_info->priv_read_index; > void *ring_buffer = hv_get_ring_buffer(ring_info); > - struct vmpacket_descriptor *cur_desc; > - u32 packetlen; > u32 dsize = ring_info->ring_datasize; > - u32 delta = read_loc - ring_info->ring_buffer->read_index; > + /* > + * delta is the difference between what is available to read and > + * what was already consumed in place. We commit read index after > + * the whole batch is processed. > + */ > + u32 delta = priv_read_loc >= ring_info->ring_buffer->read_index ? > + priv_read_loc - ring_info->ring_buffer->read_index : > + (dsize - ring_info->ring_buffer->read_index) + priv_read_loc; > u32 bytes_avail_toread = (hv_get_bytes_to_read(ring_info) - delta); > > if (bytes_avail_toread < sizeof(struct vmpacket_descriptor)) > return NULL; > > - if ((read_loc + sizeof(*cur_desc)) > dsize) > - return NULL; > - > - cur_desc = ring_buffer + read_loc; > - packetlen = cur_desc->len8 << 3; > - > - /* > - * If the packet under consideration is wrapping around, > - * return failure. > - */ > - if ((read_loc + packetlen + VMBUS_PKT_TRAILER) > (dsize - 1)) > - return NULL; > - > - return cur_desc; > + return ring_buffer + priv_read_loc; > } > > /* > @@ -78,16 +70,14 @@ static void put_pkt_raw(struct vmbus_channel *channel, > struct vmpacket_descriptor *desc) > { > struct hv_ring_buffer_info *ring_info = &channel->inbound; > - u32 read_loc = ring_info->priv_read_index; > u32 packetlen = desc->len8 << 3; > u32 dsize = ring_info->ring_datasize; > > - BUG_ON((read_loc + packetlen + VMBUS_PKT_TRAILER) > dsize); > - > /* >* Include the packet trailer. >*/ > ring_info->priv_read_index += packetlen + VMBUS_PKT_TRAILER; > + ring_info->priv_read_index %= dsize; > } > > /* Ugh, messy. Thanks for this. KY, how did this happen? greg k-h
Re: [PATCH] ipv6: addrconf: fix dev refcont leak when DAD failed
On 05.09.2016 10:06, Wei Yongjun wrote: >> In general, when DAD detected IPv6 duplicate address, ifp->state will >> be set to INET6_IFADDR_STATE_ERRDAD and DAD is stopped by a delayed >> work, the call tree should be like this: >> >> ndisc_recv_ns >> -> addrconf_dad_failure<- missing ifp put >> -> addrconf_mod_dad_work >>-> schedule addrconf_dad_work() >> -> addrconf_dad_stop() <- missing ifp hold before call it >> >> addrconf_dad_failure() called with ifp refcont holding but not put. >> addrconf_dad_work() call addrconf_dad_stop() without extra holding >> refcount. This will not cause any issue normally. >> >> But the race between addrconf_dad_failure() and addrconf_dad_work() >> may cause ifp refcount leak and netdevice can not be unregister, dmesg >> show the following messages: >> >> IPv6: eth0: IPv6 duplicate address fe80::XX:::XX detected! >> ... >> unregister_netdevice: waiting for eth0 to become free. Usage count = 1 >> >> Cc: sta...@vger.kernel.org >> Fixes: c15b1ccadb32 ("ipv6: move DAD and addrconf_verify processing to >> workqueue") >> Signed-off-by: Wei Yongjun >> >> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index >> bdf368e..2f1f5d4 100644 >> --- a/net/ipv6/addrconf.c >> +++ b/net/ipv6/addrconf.c >> @@ -1948,6 +1948,7 @@ errdad: >> spin_unlock_bh(&ifp->lock); >> >> addrconf_mod_dad_work(ifp, 0); >> +in6_ifa_put(ifp); >> } >This in6_ifa_put makes sense. >> >> /* Join to solicited addr multicast group. >> @@ -3857,6 +3858,7 @@ static void addrconf_dad_work(struct work_struct *w) >> addrconf_dad_begin(ifp); >> goto out; >> } else if (action == DAD_ABORT) { >> +in6_ifa_hold(ifp); >> addrconf_dad_stop(ifp, 1); >> if (disable_ipv6) >> addrconf_ifdown(idev->dev, 0); >> >But why you add a in6_ifa_hold here isn't clear to me. Could you explain why >this is >necessary? I don't see any async stuff being done in addrconf_dad_stop, thus >the >reference we already have should be sufficient for the lifetime of >addrconf_dad_stop. I think it that link local is added with flag IFA_F_PERMANENT, which we real need it is to remove in6_ifa_put() in addrconf_dad_stop. static void addrconf_dad_stop(...) { if (ifp->flags&IFA_F_PERMANENT) { ... in6_ifa_put(ifp); <== remove this line since caller hold refcount } else if (ifp->flags&IFA_F_TEMPORARY) { ... ipv6_del_addr(ifp); } else { ipv6_del_addr(ifp); } } If so, the addrconf_dad_begin() also need to fix because if hold a ref before addrconf_dad_stop(): static void addrconf_dad_begin(struct inet6_ifaddr *ifp) { ... in6_ifa_hold(ifp); <-- remove this line addrconf_dad_stop(ifp, 0); ... } Also inet6_addr_del which called ipv6_del_addr with refcount hold: inet6_addr_del(...) { ... list_for_each_entry(...) { ... in6_ifa_hold(ifp); <-- remove this line ... ipv6_del_addr(ifp); ... } ... } Regards, Yongjun Wei
[PATCH 25/29] netfilter: conntrack: add gc worker to remove timed-out entries
From: Florian Westphal Conntrack gc worker to evict stale entries. GC happens once every 5 seconds, but we only scan at most 1/64th of the table (and not more than 8k) buckets to avoid hogging cpu. This means that a complete scan of the table will take several minutes of wall-clock time. Considering that the gc run will never have to evict any entries during normal operation because those will happen from packet path this should be fine. We only need gc to make sure userspace (conntrack event listeners) eventually learn of the timeout, and for resource reclaim in case the system becomes idle. We do not disable BH and cond_resched for every bucket so this should not introduce noticeable latencies either. A followup patch will add a small change to speed up GC for the extreme case where most entries are timed out on an otherwise idle system. v2: Use cond_resched_rcu_qs & add comment wrt. missing restart on nulls value change in gc worker, suggested by Eric Dumazet. v3: don't call cancel_delayed_work_sync twice (again, Eric). Signed-off-by: Florian Westphal Acked-by: Eric Dumazet Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_core.c | 76 +++ 1 file changed, 76 insertions(+) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 87ee6da..f95a9e9 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -72,11 +72,24 @@ EXPORT_SYMBOL_GPL(nf_conntrack_expect_lock); struct hlist_nulls_head *nf_conntrack_hash __read_mostly; EXPORT_SYMBOL_GPL(nf_conntrack_hash); +struct conntrack_gc_work { + struct delayed_work dwork; + u32 last_bucket; + boolexiting; +}; + static __read_mostly struct kmem_cache *nf_conntrack_cachep; static __read_mostly spinlock_t nf_conntrack_locks_all_lock; static __read_mostly DEFINE_SPINLOCK(nf_conntrack_locks_all_lock); static __read_mostly bool nf_conntrack_locks_all; +#define GC_MAX_BUCKETS_DIV 64u +#define GC_MAX_BUCKETS 8192u +#define GC_INTERVAL(5 * HZ) +#define GC_MAX_EVICTS 256u + +static struct conntrack_gc_work conntrack_gc_work; + void nf_conntrack_lock(spinlock_t *lock) __acquires(lock) { spin_lock(lock); @@ -928,6 +941,63 @@ static noinline int early_drop(struct net *net, unsigned int _hash) return false; } +static void gc_worker(struct work_struct *work) +{ + unsigned int i, goal, buckets = 0, expired_count = 0; + unsigned long next_run = GC_INTERVAL; + struct conntrack_gc_work *gc_work; + + gc_work = container_of(work, struct conntrack_gc_work, dwork.work); + + goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS); + i = gc_work->last_bucket; + + do { + struct nf_conntrack_tuple_hash *h; + struct hlist_nulls_head *ct_hash; + struct hlist_nulls_node *n; + unsigned int hashsz; + struct nf_conn *tmp; + + i++; + rcu_read_lock(); + + nf_conntrack_get_ht(&ct_hash, &hashsz); + if (i >= hashsz) + i = 0; + + hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[i], hnnode) { + tmp = nf_ct_tuplehash_to_ctrack(h); + + if (nf_ct_is_expired(tmp)) { + nf_ct_gc_expired(tmp); + expired_count++; + continue; + } + } + + /* could check get_nulls_value() here and restart if ct +* was moved to another chain. But given gc is best-effort +* we will just continue with next hash slot. +*/ + rcu_read_unlock(); + cond_resched_rcu_qs(); + } while (++buckets < goal && +expired_count < GC_MAX_EVICTS); + + if (gc_work->exiting) + return; + + gc_work->last_bucket = i; + schedule_delayed_work(&gc_work->dwork, next_run); +} + +static void conntrack_gc_work_init(struct conntrack_gc_work *gc_work) +{ + INIT_DELAYED_WORK(&gc_work->dwork, gc_worker); + gc_work->exiting = false; +} + static struct nf_conn * __nf_conntrack_alloc(struct net *net, const struct nf_conntrack_zone *zone, @@ -1534,6 +1604,7 @@ static int untrack_refs(void) void nf_conntrack_cleanup_start(void) { + conntrack_gc_work.exiting = true; RCU_INIT_POINTER(ip_ct_attach, NULL); } @@ -1543,6 +1614,7 @@ void nf_conntrack_cleanup_end(void) while (untrack_refs() > 0) schedule(); + cancel_delayed_work_sync(&conntrack_gc_work.dwork); nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size); nf_conntrack_proto_fini(); @@ -1817,6 +1889,10 @@ int nf_conntrack_init_s
[PATCH 01/29] netfilter: conntrack: Only need first 4 bytes to get l4proto ports
From: Gao Feng We only need first 4 bytes instead of 8 bytes to get the ports of tcp/udp/dccp/sctp/udplite in their pkt_to_tuple function. Signed-off-by: Gao Feng Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_proto_dccp.c| 3 ++- net/netfilter/nf_conntrack_proto_sctp.c| 4 ++-- net/netfilter/nf_conntrack_proto_tcp.c | 4 ++-- net/netfilter/nf_conntrack_proto_udp.c | 4 ++-- net/netfilter/nf_conntrack_proto_udplite.c | 3 ++- 5 files changed, 10 insertions(+), 8 deletions(-) diff --git a/net/netfilter/nf_conntrack_proto_dccp.c b/net/netfilter/nf_conntrack_proto_dccp.c index 399a38f..a45bee5 100644 --- a/net/netfilter/nf_conntrack_proto_dccp.c +++ b/net/netfilter/nf_conntrack_proto_dccp.c @@ -402,7 +402,8 @@ static bool dccp_pkt_to_tuple(const struct sk_buff *skb, unsigned int dataoff, { struct dccp_hdr _hdr, *dh; - dh = skb_header_pointer(skb, dataoff, sizeof(_hdr), &_hdr); + /* Actually only need first 4 bytes to get ports. */ + dh = skb_header_pointer(skb, dataoff, 4, &_hdr); if (dh == NULL) return false; diff --git a/net/netfilter/nf_conntrack_proto_sctp.c b/net/netfilter/nf_conntrack_proto_sctp.c index 1d7ab96..e769f05 100644 --- a/net/netfilter/nf_conntrack_proto_sctp.c +++ b/net/netfilter/nf_conntrack_proto_sctp.c @@ -161,8 +161,8 @@ static bool sctp_pkt_to_tuple(const struct sk_buff *skb, unsigned int dataoff, const struct sctphdr *hp; struct sctphdr _hdr; - /* Actually only need first 8 bytes. */ - hp = skb_header_pointer(skb, dataoff, 8, &_hdr); + /* Actually only need first 4 bytes to get ports. */ + hp = skb_header_pointer(skb, dataoff, 4, &_hdr); if (hp == NULL) return false; diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c index 70c8381..4abe9e1 100644 --- a/net/netfilter/nf_conntrack_proto_tcp.c +++ b/net/netfilter/nf_conntrack_proto_tcp.c @@ -282,8 +282,8 @@ static bool tcp_pkt_to_tuple(const struct sk_buff *skb, unsigned int dataoff, const struct tcphdr *hp; struct tcphdr _hdr; - /* Actually only need first 8 bytes. */ - hp = skb_header_pointer(skb, dataoff, 8, &_hdr); + /* Actually only need first 4 bytes to get ports. */ + hp = skb_header_pointer(skb, dataoff, 4, &_hdr); if (hp == NULL) return false; diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c index 4fd0405..8a057e1 100644 --- a/net/netfilter/nf_conntrack_proto_udp.c +++ b/net/netfilter/nf_conntrack_proto_udp.c @@ -44,8 +44,8 @@ static bool udp_pkt_to_tuple(const struct sk_buff *skb, const struct udphdr *hp; struct udphdr _hdr; - /* Actually only need first 8 bytes. */ - hp = skb_header_pointer(skb, dataoff, sizeof(_hdr), &_hdr); + /* Actually only need first 4 bytes to get ports. */ + hp = skb_header_pointer(skb, dataoff, 4, &_hdr); if (hp == NULL) return false; diff --git a/net/netfilter/nf_conntrack_proto_udplite.c b/net/netfilter/nf_conntrack_proto_udplite.c index 9d692f5..029206e 100644 --- a/net/netfilter/nf_conntrack_proto_udplite.c +++ b/net/netfilter/nf_conntrack_proto_udplite.c @@ -54,7 +54,8 @@ static bool udplite_pkt_to_tuple(const struct sk_buff *skb, const struct udphdr *hp; struct udphdr _hdr; - hp = skb_header_pointer(skb, dataoff, sizeof(_hdr), &_hdr); + /* Actually only need first 4 bytes to get ports. */ + hp = skb_header_pointer(skb, dataoff, 4, &_hdr); if (hp == NULL) return false; -- 2.1.4
[PATCH 13/29] netfilter: fix spelling mistake: "delimitter" -> "delimiter"
From: Colin Ian King trivial fix to spelling mistake in pr_debug message Signed-off-by: Colin Ian King Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_ftp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/nf_conntrack_ftp.c b/net/netfilter/nf_conntrack_ftp.c index 4314700..b6934b5 100644 --- a/net/netfilter/nf_conntrack_ftp.c +++ b/net/netfilter/nf_conntrack_ftp.c @@ -237,7 +237,7 @@ static int try_eprt(const char *data, size_t dlen, struct nf_conntrack_man *cmd, } delim = data[0]; if (isdigit(delim) || delim < 33 || delim > 126 || data[2] != delim) { - pr_debug("try_eprt: invalid delimitter.\n"); + pr_debug("try_eprt: invalid delimiter.\n"); return 0; } -- 2.1.4
[PATCH 12/29] netfilter: nf_tables: add number generator expression
From: Laura Garcia Liebana This patch adds the numgen expression that allows us to generated incremental and random numbers, this generator is bound to a upper limit that is specified by userspace. This expression is useful to distribute packets in a round-robin fashion as well as randomly. Signed-off-by: Laura Garcia Liebana Signed-off-by: Pablo Neira Ayuso --- include/uapi/linux/netfilter/nf_tables.h | 24 net/netfilter/Kconfig| 6 + net/netfilter/Makefile | 1 + net/netfilter/nft_numgen.c | 192 +++ 4 files changed, 223 insertions(+) create mode 100644 net/netfilter/nft_numgen.c diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h index 784fbf1..8c9d6ff 100644 --- a/include/uapi/linux/netfilter/nf_tables.h +++ b/include/uapi/linux/netfilter/nf_tables.h @@ -1121,4 +1121,28 @@ enum nft_trace_types { __NFT_TRACETYPE_MAX }; #define NFT_TRACETYPE_MAX (__NFT_TRACETYPE_MAX - 1) + +/** + * enum nft_ng_attributes - nf_tables number generator expression netlink attributes + * + * @NFTA_NG_DREG: destination register (NLA_U32) + * @NFTA_NG_UNTIL: source value to increment the counter until reset (NLA_U32) + * @NFTA_NG_TYPE: operation type (NLA_U32) + */ +enum nft_ng_attributes { + NFTA_NG_UNSPEC, + NFTA_NG_DREG, + NFTA_NG_UNTIL, + NFTA_NG_TYPE, + __NFTA_NG_MAX +}; +#define NFTA_NG_MAX(__NFTA_NG_MAX - 1) + +enum nft_ng_types { + NFT_NG_INCREMENTAL, + NFT_NG_RANDOM, + __NFT_NG_MAX +}; +#define NFT_NG_MAX (__NFT_NG_MAX - 1) + #endif /* _LINUX_NF_TABLES_H */ diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index 29a8078..e8d56d9 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -474,6 +474,12 @@ config NFT_META This option adds the "meta" expression that you can use to match and to set packet metainformation such as the packet mark. +config NFT_NUMGEN + tristate "Netfilter nf_tables number generator module" + help + This option adds the number generator expression used to perform + incremental counting and random numbers bound to a upper limit. + config NFT_CT depends on NF_CONNTRACK tristate "Netfilter nf_tables conntrack module" diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile index 0fc42df..0c858110 100644 --- a/net/netfilter/Makefile +++ b/net/netfilter/Makefile @@ -80,6 +80,7 @@ obj-$(CONFIG_NF_TABLES_NETDEV)+= nf_tables_netdev.o obj-$(CONFIG_NFT_COMPAT) += nft_compat.o obj-$(CONFIG_NFT_EXTHDR) += nft_exthdr.o obj-$(CONFIG_NFT_META) += nft_meta.o +obj-$(CONFIG_NFT_NUMGEN) += nft_numgen.o obj-$(CONFIG_NFT_CT) += nft_ct.o obj-$(CONFIG_NFT_LIMIT)+= nft_limit.o obj-$(CONFIG_NFT_NAT) += nft_nat.o diff --git a/net/netfilter/nft_numgen.c b/net/netfilter/nft_numgen.c new file mode 100644 index 000..176e26d --- /dev/null +++ b/net/netfilter/nft_numgen.c @@ -0,0 +1,192 @@ +/* + * Copyright (c) 2016 Laura Garcia + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static DEFINE_PER_CPU(struct rnd_state, nft_numgen_prandom_state); + +struct nft_ng_inc { + enum nft_registers dreg:8; + u32 until; + atomic_tcounter; +}; + +static void nft_ng_inc_eval(const struct nft_expr *expr, + struct nft_regs *regs, + const struct nft_pktinfo *pkt) +{ + struct nft_ng_inc *priv = nft_expr_priv(expr); + u32 nval, oval; + + do { + oval = atomic_read(&priv->counter); + nval = (oval + 1 < priv->until) ? oval + 1 : 0; + } while (atomic_cmpxchg(&priv->counter, oval, nval) != oval); + + memcpy(®s->data[priv->dreg], &priv->counter, sizeof(u32)); +} + +static const struct nla_policy nft_ng_policy[NFTA_NG_MAX + 1] = { + [NFTA_NG_DREG] = { .type = NLA_U32 }, + [NFTA_NG_UNTIL] = { .type = NLA_U32 }, + [NFTA_NG_TYPE] = { .type = NLA_U32 }, +}; + +static int nft_ng_inc_init(const struct nft_ctx *ctx, + const struct nft_expr *expr, + const struct nlattr * const tb[]) +{ + struct nft_ng_inc *priv = nft_expr_priv(expr); + + priv->until = ntohl(nla_get_be32(tb[NFTA_NG_UNTIL])); + if (priv->until == 0) + return -ERANGE; + + priv->dreg = nft_parse_register(tb[NFTA_NG_DREG]); + atomic_set(&priv->counter, 0); + + return nft_validate_register_store(ctx, priv->dreg, NULL, + NFT_DA
[PATCH 27/29] netfilter: remove __nf_ct_kill_acct helper
From: Florian Westphal After timer removal this just calls nf_ct_delete so remove the __ prefix version and make nf_ct_kill a shorthand for nf_ct_delete. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_conntrack.h | 13 +++-- net/netfilter/nf_conntrack_core.c| 12 +--- 2 files changed, 8 insertions(+), 17 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 7277751..5041805 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -219,21 +219,14 @@ static inline void nf_ct_refresh(struct nf_conn *ct, __nf_ct_refresh_acct(ct, 0, skb, extra_jiffies, 0); } -bool __nf_ct_kill_acct(struct nf_conn *ct, enum ip_conntrack_info ctinfo, - const struct sk_buff *skb, int do_acct); - /* kill conntrack and do accounting */ -static inline bool nf_ct_kill_acct(struct nf_conn *ct, - enum ip_conntrack_info ctinfo, - const struct sk_buff *skb) -{ - return __nf_ct_kill_acct(ct, ctinfo, skb, 1); -} +bool nf_ct_kill_acct(struct nf_conn *ct, enum ip_conntrack_info ctinfo, +const struct sk_buff *skb); /* kill conntrack without accounting */ static inline bool nf_ct_kill(struct nf_conn *ct) { - return __nf_ct_kill_acct(ct, 0, NULL, 0); + return nf_ct_delete(ct, 0, 0); } /* These are for NAT. Icky. */ diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 7c66ce40..ac1db40 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1430,17 +1430,15 @@ acct: } EXPORT_SYMBOL_GPL(__nf_ct_refresh_acct); -bool __nf_ct_kill_acct(struct nf_conn *ct, - enum ip_conntrack_info ctinfo, - const struct sk_buff *skb, - int do_acct) +bool nf_ct_kill_acct(struct nf_conn *ct, +enum ip_conntrack_info ctinfo, +const struct sk_buff *skb) { - if (do_acct) - nf_ct_acct_update(ct, ctinfo, skb->len); + nf_ct_acct_update(ct, ctinfo, skb->len); return nf_ct_delete(ct, 0, 0); } -EXPORT_SYMBOL_GPL(__nf_ct_kill_acct); +EXPORT_SYMBOL_GPL(nf_ct_kill_acct); #if IS_ENABLED(CONFIG_NF_CT_NETLINK) -- 2.1.4
[PATCH 28/29] netfilter: log_arp: Use ARPHRD_ETHER instead of literal '1'
From: Gao Feng There is one macro ARPHRD_ETHER which defines the ethernet proto for ARP, so we could use it instead of the literal number '1'. Signed-off-by: Gao Feng Signed-off-by: Pablo Neira Ayuso --- net/ipv4/netfilter/nf_log_arp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/netfilter/nf_log_arp.c b/net/ipv4/netfilter/nf_log_arp.c index e7ad950..cf8f2d4 100644 --- a/net/ipv4/netfilter/nf_log_arp.c +++ b/net/ipv4/netfilter/nf_log_arp.c @@ -62,7 +62,7 @@ static void dump_arp_packet(struct nf_log_buf *m, /* If it's for Ethernet and the lengths are OK, then log the ARP * payload. */ - if (ah->ar_hrd != htons(1) || + if (ah->ar_hrd != htons(ARPHRD_ETHER) || ah->ar_hln != ETH_ALEN || ah->ar_pln != sizeof(__be32)) return; -- 2.1.4
[PATCH 23/29] netfilter: conntrack: get rid of conntrack timer
From: Florian Westphal With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as Eric Dumazet pointed out during netfilter workshop 2016. Eric also says: "Another reason was the fact that Thomas was about to change max timer range [..]" (500462a9de657f8, 'timers: Switch to a non-cascading wheel'). Remove the timer and use a 32bit jiffies value containing timestamp until entry is valid. During conntrack lookup, even before doing tuple comparision, check the timeout value and evict the entry in case it is too old. The dying bit is used as a synchronization point to avoid races where multiple cpus try to evict the same entry. Because lookup is always lockless, we need to bump the refcnt once when we evict, else we could try to evict already-dead entry that is being recycled. This is the standard/expected way when conntrack entries are destroyed. Followup patches will introduce garbage colliction via work queue and further places where we can reap obsoleted entries (e.g. during netlink dumps), this is needed to avoid expired conntracks from hanging around for too long when lookup rate is low after a busy period. Signed-off-by: Florian Westphal Acked-by: Eric Dumazet Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_conntrack.h | 23 +++-- net/netfilter/nf_conntrack_core.c| 91 net/netfilter/nf_conntrack_netlink.c | 14 ++ net/netfilter/nf_conntrack_pptp.c| 3 +- net/netfilter/nf_nat_core.c | 6 --- 5 files changed, 74 insertions(+), 63 deletions(-) diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h index 2a12748..7277751 100644 --- a/include/net/netfilter/nf_conntrack.h +++ b/include/net/netfilter/nf_conntrack.h @@ -42,7 +42,6 @@ union nf_conntrack_expect_proto { #include #include -#include #ifdef CONFIG_NETFILTER_DEBUG #define NF_CT_ASSERT(x)WARN_ON(!(x)) @@ -73,7 +72,7 @@ struct nf_conn_help { #include struct nf_conn { - /* Usage count in here is 1 for hash table/destruct timer, 1 per skb, + /* Usage count in here is 1 for hash table, 1 per skb, * plus 1 for any connection(s) we are `master' for * * Hint, SKB address this struct and refcnt via skb->nfct and @@ -96,8 +95,8 @@ struct nf_conn { /* Have we seen traffic both ways yet? (bitset) */ unsigned long status; - /* Timer function; drops refcnt when it goes off. */ - struct timer_list timeout; + /* jiffies32 when this ct is considered dead */ + u32 timeout; possible_net_t ct_net; @@ -291,14 +290,28 @@ static inline bool nf_is_loopback_packet(const struct sk_buff *skb) return skb->dev && skb->skb_iif && skb->dev->flags & IFF_LOOPBACK; } +#define nfct_time_stamp ((u32)(jiffies)) + /* jiffies until ct expires, 0 if already expired */ static inline unsigned long nf_ct_expires(const struct nf_conn *ct) { - long timeout = (long)ct->timeout.expires - (long)jiffies; + s32 timeout = ct->timeout - nfct_time_stamp; return timeout > 0 ? timeout : 0; } +static inline bool nf_ct_is_expired(const struct nf_conn *ct) +{ + return (__s32)(ct->timeout - nfct_time_stamp) <= 0; +} + +/* use after obtaining a reference count */ +static inline bool nf_ct_should_gc(const struct nf_conn *ct) +{ + return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) && + !nf_ct_is_dying(ct); +} + struct kernel_param; int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp); diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 887926a..87ee6da 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -371,7 +371,6 @@ destroy_conntrack(struct nf_conntrack *nfct) pr_debug("destroy_conntrack(%p)\n", ct); NF_CT_ASSERT(atomic_read(&nfct->use) == 0); - NF_CT_ASSERT(!timer_pending(&ct->timeout)); if (unlikely(nf_ct_is_template(ct))) { nf_ct_tmpl_free(ct); @@ -434,35 +433,30 @@ bool nf_ct_delete(struct nf_conn *ct, u32 portid, int report) { struct nf_conn_tstamp *tstamp; + if (test_and_set_bit(IPS_DYING_BIT, &ct->status)) + return false; + tstamp = nf_conn_tstamp_find(ct); if (tstamp && tstamp->stop == 0) tstamp->stop = ktime_get_real_ns(); - if (nf_ct_is_dying(ct)) - goto delete; - if (nf_conntrack_event_report(IPCT_DESTROY, ct, portid, report) < 0) { - /* destroy event was not delivered */ + /* destroy event was not delivered. nf_ct_put will +* be done by event cache worker on redelivery. +*/ nf_ct_delete_from_lists(ct); nf_conntrack_ecache_delayed_work(nf_ct_net(ct)); return false; } nf_conn
[PATCH 19/29] netfilter: nf_tables: honor NLM_F_EXCL flag in set element insertion
If the NLM_F_EXCL flag is set, then new elements that clash with an existing one return EEXIST. In case you try to add an element whose data area differs from what we have, then this returns EBUSY. If no flag is specified at all, then this returns success to userspace. This patch also update the set insert operation so we can fetch the existing element that clashes with the one you want to add, we need this to make sure the element data doesn't differ. Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_tables.h | 3 ++- net/netfilter/nf_tables_api.c | 20 +++- net/netfilter/nft_set_hash.c | 17 + net/netfilter/nft_set_rbtree.c| 12 4 files changed, 38 insertions(+), 14 deletions(-) diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h index f2f1339..8972468 100644 --- a/include/net/netfilter/nf_tables.h +++ b/include/net/netfilter/nf_tables.h @@ -251,7 +251,8 @@ struct nft_set_ops { int (*insert)(const struct net *net, const struct nft_set *set, - const struct nft_set_elem *elem); + const struct nft_set_elem *elem, + struct nft_set_ext **ext); void(*activate)(const struct net *net, const struct nft_set *set, const struct nft_set_elem *elem); diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 221d27f..bd9715e 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3483,12 +3483,12 @@ static int nft_setelem_parse_flags(const struct nft_set *set, } static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set, - const struct nlattr *attr) + const struct nlattr *attr, u32 nlmsg_flags) { struct nlattr *nla[NFTA_SET_ELEM_MAX + 1]; struct nft_data_desc d1, d2; struct nft_set_ext_tmpl tmpl; - struct nft_set_ext *ext; + struct nft_set_ext *ext, *ext2; struct nft_set_elem elem; struct nft_set_binding *binding; struct nft_userdata *udata; @@ -3615,9 +3615,19 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set, goto err4; ext->genmask = nft_genmask_cur(ctx->net) | NFT_SET_ELEM_BUSY_MASK; - err = set->ops->insert(ctx->net, set, &elem); - if (err < 0) + err = set->ops->insert(ctx->net, set, &elem, &ext2); + if (err) { + if (err == -EEXIST) { + if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA) && + nft_set_ext_exists(ext2, NFT_SET_EXT_DATA) && + memcmp(nft_set_ext_data(ext), + nft_set_ext_data(ext2), set->dlen) != 0) + err = -EBUSY; + else if (!(nlmsg_flags & NLM_F_EXCL)) + err = 0; + } goto err5; + } nft_trans_elem(trans) = elem; list_add_tail(&trans->list, &ctx->net->nft.commit_list); @@ -3673,7 +3683,7 @@ static int nf_tables_newsetelem(struct net *net, struct sock *nlsk, !atomic_add_unless(&set->nelems, 1, set->size + set->ndeact)) return -ENFILE; - err = nft_add_set_elem(&ctx, set, attr); + err = nft_add_set_elem(&ctx, set, attr, nlh->nlmsg_flags); if (err < 0) { atomic_dec(&set->nelems); break; diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c index 564fa79..3794cb2 100644 --- a/net/netfilter/nft_set_hash.c +++ b/net/netfilter/nft_set_hash.c @@ -126,7 +126,8 @@ err1: } static int nft_hash_insert(const struct net *net, const struct nft_set *set, - const struct nft_set_elem *elem) + const struct nft_set_elem *elem, + struct nft_set_ext **ext) { struct nft_hash *priv = nft_set_priv(set); struct nft_hash_elem *he = elem->priv; @@ -135,9 +136,17 @@ static int nft_hash_insert(const struct net *net, const struct nft_set *set, .set = set, .key = elem->key.val.data, }; - - return rhashtable_lookup_insert_key(&priv->ht, &arg, &he->node, - nft_hash_params); + struct nft_hash_elem *prev; + + prev = rhashtable_lookup_get_insert_key(&priv->ht, &arg, &he->node, + nft_hash_params); + if (IS_ERR(prev)) + return PTR_ERR(prev); + if (pr
[PATCH 08/29] netfilter: remove ip_conntrack* sysctl compat code
This backward compatibility has been around for more than ten years, since Yasuyuki Kozakai introduced IPv6 in conntrack. These days, we have alternate /proc/net/nf_conntrack* entries, the ctnetlink interface and the conntrack utility got adopted by many people in the user community according to what I observed on the netfilter user mailing list. So let's get rid of this. Note that nf_conntrack_htable_size and unsigned int nf_conntrack_max do not need to be exported as symbol anymore. Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_conntrack_l4proto.h | 8 - include/net/netns/conntrack.h | 8 - net/ipv4/netfilter/Kconfig | 11 - net/ipv4/netfilter/Makefile| 5 - net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 70 --- .../netfilter/nf_conntrack_l3proto_ipv4_compat.c | 491 - net/ipv4/netfilter/nf_conntrack_proto_icmp.c | 39 +- net/netfilter/nf_conntrack_core.c | 3 - net/netfilter/nf_conntrack_proto.c | 81 +--- net/netfilter/nf_conntrack_proto_generic.c | 39 +- net/netfilter/nf_conntrack_proto_sctp.c| 85 +--- net/netfilter/nf_conntrack_proto_tcp.c | 127 +- net/netfilter/nf_conntrack_proto_udp.c | 49 +- 13 files changed, 7 insertions(+), 1009 deletions(-) delete mode 100644 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h index 1a5fb36..de629f1 100644 --- a/include/net/netfilter/nf_conntrack_l4proto.h +++ b/include/net/netfilter/nf_conntrack_l4proto.h @@ -134,14 +134,6 @@ void nf_ct_l4proto_pernet_unregister(struct net *net, int nf_ct_l4proto_register(struct nf_conntrack_l4proto *proto); void nf_ct_l4proto_unregister(struct nf_conntrack_l4proto *proto); -static inline void nf_ct_kfree_compat_sysctl_table(struct nf_proto_net *pn) -{ -#if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT) - kfree(pn->ctl_compat_table); - pn->ctl_compat_table = NULL; -#endif -} - /* Generic netlink helpers */ int nf_ct_port_tuple_to_nlattr(struct sk_buff *skb, const struct nf_conntrack_tuple *tuple); diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h index 38b1a80..e469e85 100644 --- a/include/net/netns/conntrack.h +++ b/include/net/netns/conntrack.h @@ -15,10 +15,6 @@ struct nf_proto_net { #ifdef CONFIG_SYSCTL struct ctl_table_header *ctl_table_header; struct ctl_table*ctl_table; -#ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT - struct ctl_table_header *ctl_compat_header; - struct ctl_table*ctl_compat_table; -#endif #endif unsigned intusers; }; @@ -58,10 +54,6 @@ struct nf_ip_net { struct nf_udp_net udp; struct nf_icmp_net icmp; struct nf_icmp_net icmpv6; -#if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT) - struct ctl_table_header *ctl_table_header; - struct ctl_table*ctl_table; -#endif }; struct ct_pcpu { diff --git a/net/ipv4/netfilter/Kconfig b/net/ipv4/netfilter/Kconfig index c187c60..d613309 100644 --- a/net/ipv4/netfilter/Kconfig +++ b/net/ipv4/netfilter/Kconfig @@ -25,17 +25,6 @@ config NF_CONNTRACK_IPV4 To compile it as a module, choose M here. If unsure, say N. -config NF_CONNTRACK_PROC_COMPAT - bool "proc/sysctl compatibility with old connection tracking" - depends on NF_CONNTRACK_PROCFS && NF_CONNTRACK_IPV4 - default y - help - This option enables /proc and sysctl compatibility with the old - layer 3 dependent connection tracking. This is needed to keep - old programs that have not been adapted to the new names working. - - If unsure, say Y. - if NF_TABLES config NF_TABLES_IPV4 diff --git a/net/ipv4/netfilter/Makefile b/net/ipv4/netfilter/Makefile index 87b073d..853328f 100644 --- a/net/ipv4/netfilter/Makefile +++ b/net/ipv4/netfilter/Makefile @@ -4,11 +4,6 @@ # objects for l3 independent conntrack nf_conntrack_ipv4-y:= nf_conntrack_l3proto_ipv4.o nf_conntrack_proto_icmp.o -ifeq ($(CONFIG_NF_CONNTRACK_PROC_COMPAT),y) -ifeq ($(CONFIG_PROC_FS),y) -nf_conntrack_ipv4-objs += nf_conntrack_l3proto_ipv4_compat.o -endif -endif # connection tracking obj-$(CONFIG_NF_CONNTRACK_IPV4) += nf_conntrack_ipv4.o diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c index ae1a71a..870aebd 100644 --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c @@ -202,47 +202,6 @@ static struct nf_hook_ops ipv4_conntrack_ops[] __read_mostly = { }, }; -#if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT) -static int log_invalid_proto_min = 0; -static int log_in
[PATCH 05/29] ipvs: use nf_ct_kill helper
From: Florian Westphal Once timer is removed from nf_conn struct we cannot open-code the removal sequence anymore. Signed-off-by: Florian Westphal Acked-by: Julian Anastasov Signed-off-by: Pablo Neira Ayuso --- net/netfilter/ipvs/ip_vs_nfct.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/net/netfilter/ipvs/ip_vs_nfct.c b/net/netfilter/ipvs/ip_vs_nfct.c index f04fd8d..fc230d9 100644 --- a/net/netfilter/ipvs/ip_vs_nfct.c +++ b/net/netfilter/ipvs/ip_vs_nfct.c @@ -281,13 +281,10 @@ void ip_vs_conn_drop_conntrack(struct ip_vs_conn *cp) h = nf_conntrack_find_get(cp->ipvs->net, &nf_ct_zone_dflt, &tuple); if (h) { ct = nf_ct_tuplehash_to_ctrack(h); - /* Show what happens instead of calling nf_ct_kill() */ - if (del_timer(&ct->timeout)) { - IP_VS_DBG(7, "%s: ct=%p, deleted conntrack timer for tuple=" + if (nf_ct_kill(ct)) { + IP_VS_DBG(7, "%s: ct=%p, deleted conntrack for tuple=" FMT_TUPLE "\n", __func__, ct, ARG_TUPLE(&tuple)); - if (ct->timeout.function) - ct->timeout.function(ct->timeout.data); } else { IP_VS_DBG(7, "%s: ct=%p, no conntrack timer for tuple=" FMT_TUPLE "\n", -- 2.1.4
[PATCH 17/29] netfilter: nf_tables: reject hook configuration updates on existing chains
Currently, if you add a base chain whose name clashes with an existing non-base chain, nf_tables doesn't complain about this. Similarly, if you update the chain type, the hook number and priority. With this patch, nf_tables bails out in case any of this unsupported operations occur by returning EBUSY. # nft add table x # nft add chain x y # nft add chain x y { type nat hook input priority 0\; } :1:1-49: Error: Could not process rule: Device or resource busy add chain x y { type nat hook input priority 0; } ^ Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_tables_api.c | 31 +++ 1 file changed, 31 insertions(+) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 463fcad..221d27f 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -1348,6 +1348,37 @@ static int nf_tables_newchain(struct net *net, struct sock *nlsk, if (nlh->nlmsg_flags & NLM_F_REPLACE) return -EOPNOTSUPP; + if (nla[NFTA_CHAIN_HOOK]) { + struct nft_base_chain *basechain; + struct nft_chain_hook hook; + struct nf_hook_ops *ops; + + if (!(chain->flags & NFT_BASE_CHAIN)) + return -EBUSY; + + err = nft_chain_parse_hook(net, nla, afi, &hook, + create); + if (err < 0) + return err; + + basechain = nft_base_chain(chain); + if (basechain->type != hook.type) { + nft_chain_release_hook(&hook); + return -EBUSY; + } + + for (i = 0; i < afi->nops; i++) { + ops = &basechain->ops[i]; + if (ops->hooknum != hook.num || + ops->priority != hook.priority || + ops->dev != hook.dev) { + nft_chain_release_hook(&hook); + return -EBUSY; + } + } + nft_chain_release_hook(&hook); + } + if (nla[NFTA_CHAIN_HANDLE] && name) { struct nft_chain *chain2; -- 2.1.4
[PATCH 11/29] netfilter: nf_tables: add quota expression
This patch adds the quota expression. This new stateful expression integrate easily into the dynset expression to build 'hashquota' flow tables. Arguably, we could use instead "counter bytes > 1000" instead, but this approach has several problems: 1) We only support for one single stateful expression in dynamic set definitions, and the expression above is a composite of two expressions: get counter + comparison. 2) We would need to restore the packed counter representation (that we used to have) based on seqlock to synchronize this, since per-cpu is not suitable for this. So instead of bloating the counter expression back with the seqlock representation and extending the existing set infrastructure to make it more complex for the composite described above, let's follow the more simple approach of adding a quota expression that we can plug into our existing infrastructure. Signed-off-by: Pablo Neira Ayuso --- include/uapi/linux/netfilter/nf_tables.h | 19 + net/netfilter/Kconfig| 6 ++ net/netfilter/Makefile | 1 + net/netfilter/nft_quota.c| 121 +++ 4 files changed, 147 insertions(+) create mode 100644 net/netfilter/nft_quota.c diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h index 6ce0a6d..784fbf1 100644 --- a/include/uapi/linux/netfilter/nf_tables.h +++ b/include/uapi/linux/netfilter/nf_tables.h @@ -900,6 +900,25 @@ enum nft_queue_attributes { #define NFT_QUEUE_FLAG_CPU_FANOUT 0x02 /* use current CPU (no hashing) */ #define NFT_QUEUE_FLAG_MASK0x03 +enum nft_quota_flags { + NFT_QUOTA_F_INV = (1 << 0), +}; + +/** + * enum nft_quota_attributes - nf_tables quota expression netlink attributes + * + * @NFTA_QUOTA_BYTES: quota in bytes (NLA_U16) + * @NFTA_QUOTA_FLAGS: flags (NLA_U32) + */ +enum nft_quota_attributes { + NFTA_QUOTA_UNSPEC, + NFTA_QUOTA_BYTES, + NFTA_QUOTA_FLAGS, + NFTA_QUOTA_PAD, + __NFTA_QUOTA_MAX +}; +#define NFTA_QUOTA_MAX (__NFTA_QUOTA_MAX - 1) + /** * enum nft_reject_types - nf_tables reject expression reject types * diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index 9cfaa00..29a8078 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -542,6 +542,12 @@ config NFT_QUEUE This is required if you intend to use the userspace queueing infrastructure (also known as NFQUEUE) from nftables. +config NFT_QUOTA + tristate "Netfilter nf_tables quota module" + help + This option adds the "quota" expression that you can use to match + enforce bytes quotas. + config NFT_REJECT default m if NETFILTER_ADVANCED=n tristate "Netfilter nf_tables reject support" diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile index 1106ccd..0fc42df 100644 --- a/net/netfilter/Makefile +++ b/net/netfilter/Makefile @@ -84,6 +84,7 @@ obj-$(CONFIG_NFT_CT) += nft_ct.o obj-$(CONFIG_NFT_LIMIT)+= nft_limit.o obj-$(CONFIG_NFT_NAT) += nft_nat.o obj-$(CONFIG_NFT_QUEUE)+= nft_queue.o +obj-$(CONFIG_NFT_QUOTA)+= nft_quota.o obj-$(CONFIG_NFT_REJECT) += nft_reject.o obj-$(CONFIG_NFT_REJECT_INET) += nft_reject_inet.o obj-$(CONFIG_NFT_SET_RBTREE) += nft_set_rbtree.o diff --git a/net/netfilter/nft_quota.c b/net/netfilter/nft_quota.c new file mode 100644 index 000..6eafbf9 --- /dev/null +++ b/net/netfilter/nft_quota.c @@ -0,0 +1,121 @@ +/* + * Copyright (c) 2016 Pablo Neira Ayuso + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +struct nft_quota { + u64 quota; + boolinvert; + atomic64_t remain; +}; + +static inline long nft_quota(struct nft_quota *priv, +const struct nft_pktinfo *pkt) +{ + return atomic64_sub_return(pkt->skb->len, &priv->remain); +} + +static void nft_quota_eval(const struct nft_expr *expr, + struct nft_regs *regs, + const struct nft_pktinfo *pkt) +{ + struct nft_quota *priv = nft_expr_priv(expr); + + if (nft_quota(priv, pkt) < 0 && !priv->invert) + regs->verdict.code = NFT_BREAK; +} + +static const struct nla_policy nft_quota_policy[NFTA_QUOTA_MAX + 1] = { + [NFTA_QUOTA_BYTES] = { .type = NLA_U64 }, + [NFTA_QUOTA_FLAGS] = { .type = NLA_U32 }, +}; + +static int nft_quota_init(const struct nft_ctx *ctx, + const struct nft_expr *expr, + const struct nlattr * const tb[]) +{ + struct nft_quota *priv = nft_expr_priv(expr); + u32 flags = 0; +
[PATCH 24/29] netfilter: evict stale entries on netlink dumps
From: Florian Westphal When dumping we already have to look at the entire table, so we might as well toss those entries whose timeout value is in the past. We also look at every entry during resize operations. However, eviction there is not as simple because we hold the global resize lock so we can't evict without adding a 'expired' list to drop from later. Considering that resizes are very rare it doesn't seem worth doing it. Signed-off-by: Florian Westphal Acked-by: Eric Dumazet Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_netlink.c | 25 - 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 81fd34c..dedbe4b 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -815,14 +815,23 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb) struct hlist_nulls_node *n; struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh); u_int8_t l3proto = nfmsg->nfgen_family; - int res; + struct nf_conn *nf_ct_evict[8]; + int res, i; spinlock_t *lockp; last = (struct nf_conn *)cb->args[1]; + i = 0; local_bh_disable(); for (; cb->args[0] < nf_conntrack_htable_size; cb->args[0]++) { restart: + while (i) { + i--; + if (nf_ct_should_gc(nf_ct_evict[i])) + nf_ct_kill(nf_ct_evict[i]); + nf_ct_put(nf_ct_evict[i]); + } + lockp = &nf_conntrack_locks[cb->args[0] % CONNTRACK_LOCKS]; nf_conntrack_lock(lockp); if (cb->args[0] >= nf_conntrack_htable_size) { @@ -834,6 +843,13 @@ restart: if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL) continue; ct = nf_ct_tuplehash_to_ctrack(h); + if (nf_ct_is_expired(ct)) { + if (i < ARRAY_SIZE(nf_ct_evict) && + atomic_inc_not_zero(&ct->ct_general.use)) + nf_ct_evict[i++] = ct; + continue; + } + if (!net_eq(net, nf_ct_net(ct))) continue; @@ -875,6 +891,13 @@ out: if (last) nf_ct_put(last); + while (i) { + i--; + if (nf_ct_should_gc(nf_ct_evict[i])) + nf_ct_kill(nf_ct_evict[i]); + nf_ct_put(nf_ct_evict[i]); + } + return skb->len; } -- 2.1.4
[PATCH 20/29] netfilter: nf_tables: Use nla_put_be32() to dump immediate parameters
nft_dump_register() should only be used with registers, not with immediates. Fixes: cb1b69b0b15b ("netfilter: nf_tables: add hash expression") Fixes: 91dbc6be0a62("netfilter: nf_tables: add number generator expression") Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nft_hash.c | 6 +++--- net/netfilter/nft_numgen.c | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c index e090aee..764251d 100644 --- a/net/netfilter/nft_hash.c +++ b/net/netfilter/nft_hash.c @@ -88,11 +88,11 @@ static int nft_hash_dump(struct sk_buff *skb, goto nla_put_failure; if (nft_dump_register(skb, NFTA_HASH_DREG, priv->dreg)) goto nla_put_failure; - if (nft_dump_register(skb, NFTA_HASH_LEN, priv->len)) + if (nla_put_be32(skb, NFTA_HASH_LEN, htonl(priv->len))) goto nla_put_failure; - if (nft_dump_register(skb, NFTA_HASH_MODULUS, priv->modulus)) + if (nla_put_be32(skb, NFTA_HASH_MODULUS, htonl(priv->modulus))) goto nla_put_failure; - if (nft_dump_register(skb, NFTA_HASH_SEED, priv->seed)) + if (nla_put_be32(skb, NFTA_HASH_SEED, htonl(priv->seed))) goto nla_put_failure; return 0; diff --git a/net/netfilter/nft_numgen.c b/net/netfilter/nft_numgen.c index 176e26d..294745e 100644 --- a/net/netfilter/nft_numgen.c +++ b/net/netfilter/nft_numgen.c @@ -68,9 +68,9 @@ static int nft_ng_dump(struct sk_buff *skb, enum nft_registers dreg, { if (nft_dump_register(skb, NFTA_NG_DREG, dreg)) goto nla_put_failure; - if (nft_dump_register(skb, NFTA_NG_UNTIL, until)) + if (nla_put_be32(skb, NFTA_NG_UNTIL, htonl(until))) goto nla_put_failure; - if (nft_dump_register(skb, NFTA_NG_TYPE, type)) + if (nla_put_be32(skb, NFTA_NG_TYPE, htonl(type))) goto nla_put_failure; return 0; -- 2.1.4
[PATCH 16/29] netfilter: nf_tables: introduce nft_chain_parse_hook()
Introduce a new function to wrap the code that parses the chain hook configuration so we can reuse this code to validate chain updates. Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_tables_api.c | 152 +- 1 file changed, 89 insertions(+), 63 deletions(-) diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 7e1c876..463fcad 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -1196,6 +1196,83 @@ static void nf_tables_chain_destroy(struct nft_chain *chain) } } +struct nft_chain_hook { + u32 num; + u32 priority; + const struct nf_chain_type *type; + struct net_device *dev; +}; + +static int nft_chain_parse_hook(struct net *net, + const struct nlattr * const nla[], + struct nft_af_info *afi, + struct nft_chain_hook *hook, bool create) +{ + struct nlattr *ha[NFTA_HOOK_MAX + 1]; + const struct nf_chain_type *type; + struct net_device *dev; + int err; + + err = nla_parse_nested(ha, NFTA_HOOK_MAX, nla[NFTA_CHAIN_HOOK], + nft_hook_policy); + if (err < 0) + return err; + + if (ha[NFTA_HOOK_HOOKNUM] == NULL || + ha[NFTA_HOOK_PRIORITY] == NULL) + return -EINVAL; + + hook->num = ntohl(nla_get_be32(ha[NFTA_HOOK_HOOKNUM])); + if (hook->num >= afi->nhooks) + return -EINVAL; + + hook->priority = ntohl(nla_get_be32(ha[NFTA_HOOK_PRIORITY])); + + type = chain_type[afi->family][NFT_CHAIN_T_DEFAULT]; + if (nla[NFTA_CHAIN_TYPE]) { + type = nf_tables_chain_type_lookup(afi, nla[NFTA_CHAIN_TYPE], + create); + if (IS_ERR(type)) + return PTR_ERR(type); + } + if (!(type->hook_mask & (1 << hook->num))) + return -EOPNOTSUPP; + if (!try_module_get(type->owner)) + return -ENOENT; + + hook->type = type; + + hook->dev = NULL; + if (afi->flags & NFT_AF_NEEDS_DEV) { + char ifname[IFNAMSIZ]; + + if (!ha[NFTA_HOOK_DEV]) { + module_put(type->owner); + return -EOPNOTSUPP; + } + + nla_strlcpy(ifname, ha[NFTA_HOOK_DEV], IFNAMSIZ); + dev = dev_get_by_name(net, ifname); + if (!dev) { + module_put(type->owner); + return -ENOENT; + } + hook->dev = dev; + } else if (ha[NFTA_HOOK_DEV]) { + module_put(type->owner); + return -EOPNOTSUPP; + } + + return 0; +} + +static void nft_chain_release_hook(struct nft_chain_hook *hook) +{ + module_put(hook->type->owner); + if (hook->dev != NULL) + dev_put(hook->dev); +} + static int nf_tables_newchain(struct net *net, struct sock *nlsk, struct sk_buff *skb, const struct nlmsghdr *nlh, const struct nlattr * const nla[]) @@ -1206,10 +1283,8 @@ static int nf_tables_newchain(struct net *net, struct sock *nlsk, struct nft_table *table; struct nft_chain *chain; struct nft_base_chain *basechain = NULL; - struct nlattr *ha[NFTA_HOOK_MAX + 1]; u8 genmask = nft_genmask_next(net); int family = nfmsg->nfgen_family; - struct net_device *dev = NULL; u8 policy = NF_ACCEPT; u64 handle = 0; unsigned int i; @@ -1320,102 +1395,53 @@ static int nf_tables_newchain(struct net *net, struct sock *nlsk, return -EOVERFLOW; if (nla[NFTA_CHAIN_HOOK]) { - const struct nf_chain_type *type; + struct nft_chain_hook hook; struct nf_hook_ops *ops; nf_hookfn *hookfn; - u32 hooknum, priority; - - type = chain_type[family][NFT_CHAIN_T_DEFAULT]; - if (nla[NFTA_CHAIN_TYPE]) { - type = nf_tables_chain_type_lookup(afi, - nla[NFTA_CHAIN_TYPE], - create); - if (IS_ERR(type)) - return PTR_ERR(type); - } - err = nla_parse_nested(ha, NFTA_HOOK_MAX, nla[NFTA_CHAIN_HOOK], - nft_hook_policy); + err = nft_chain_parse_hook(net, nla, afi, &hook, create); if (err < 0) return err; - if (ha[NFTA_HOOK_HOOKNUM] == NULL || - ha[NFTA_HOOK_PRIORITY] == NULL) - return -EINVAL; -
[PATCH 26/29] netfilter: conntrack: resched gc again if eviction rate is high
From: Florian Westphal If we evicted a large fraction of the scanned conntrack entries re-schedule the next gc cycle for immediate execution. This triggers during tests where load is high, then drops to zero and many connections will be in TW/CLOSE state with < 30 second timeouts. Without this change it will take several minutes until conntrack count comes back to normal. Signed-off-by: Florian Westphal Acked-by: Eric Dumazet Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_core.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index f95a9e9..7c66ce40 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -945,6 +945,7 @@ static void gc_worker(struct work_struct *work) { unsigned int i, goal, buckets = 0, expired_count = 0; unsigned long next_run = GC_INTERVAL; + unsigned int ratio, scanned = 0; struct conntrack_gc_work *gc_work; gc_work = container_of(work, struct conntrack_gc_work, dwork.work); @@ -969,6 +970,7 @@ static void gc_worker(struct work_struct *work) hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[i], hnnode) { tmp = nf_ct_tuplehash_to_ctrack(h); + scanned++; if (nf_ct_is_expired(tmp)) { nf_ct_gc_expired(tmp); expired_count++; @@ -988,6 +990,10 @@ static void gc_worker(struct work_struct *work) if (gc_work->exiting) return; + ratio = scanned ? expired_count * 100 / scanned : 0; + if (ratio >= 90) + next_run = 0; + gc_work->last_bucket = i; schedule_delayed_work(&gc_work->dwork, next_run); } -- 2.1.4
[PATCH 06/29] netfilter: nf_tables: rename set implementations
Use nft_set_* prefix for backend set implementations, thus we can use nft_hash for the new hash expression. Signed-off-by: Pablo Neira Ayuso --- net/netfilter/Kconfig| 4 ++-- net/netfilter/Makefile | 4 ++-- net/netfilter/{nft_hash.c => nft_set_hash.c} | 0 net/netfilter/{nft_rbtree.c => nft_set_rbtree.c} | 0 4 files changed, 4 insertions(+), 4 deletions(-) rename net/netfilter/{nft_hash.c => nft_set_hash.c} (100%) rename net/netfilter/{nft_rbtree.c => nft_set_rbtree.c} (100%) diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index 9266cee..e5740e1 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -481,13 +481,13 @@ config NFT_CT This option adds the "meta" expression that you can use to match connection tracking information such as the flow state. -config NFT_RBTREE +config NFT_SET_RBTREE tristate "Netfilter nf_tables rbtree set module" help This option adds the "rbtree" set type (Red Black tree) that is used to build interval-based sets. -config NFT_HASH +config NFT_SET_HASH tristate "Netfilter nf_tables hash set module" help This option adds the "hash" set type that is used to build one-way diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile index 6913454..101fb85 100644 --- a/net/netfilter/Makefile +++ b/net/netfilter/Makefile @@ -86,8 +86,8 @@ obj-$(CONFIG_NFT_NAT) += nft_nat.o obj-$(CONFIG_NFT_QUEUE)+= nft_queue.o obj-$(CONFIG_NFT_REJECT) += nft_reject.o obj-$(CONFIG_NFT_REJECT_INET) += nft_reject_inet.o -obj-$(CONFIG_NFT_RBTREE) += nft_rbtree.o -obj-$(CONFIG_NFT_HASH) += nft_hash.o +obj-$(CONFIG_NFT_SET_RBTREE) += nft_set_rbtree.o +obj-$(CONFIG_NFT_SET_HASH) += nft_set_hash.o obj-$(CONFIG_NFT_COUNTER) += nft_counter.o obj-$(CONFIG_NFT_LOG) += nft_log.o obj-$(CONFIG_NFT_MASQ) += nft_masq.o diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_set_hash.c similarity index 100% rename from net/netfilter/nft_hash.c rename to net/netfilter/nft_set_hash.c diff --git a/net/netfilter/nft_rbtree.c b/net/netfilter/nft_set_rbtree.c similarity index 100% rename from net/netfilter/nft_rbtree.c rename to net/netfilter/nft_set_rbtree.c -- 2.1.4
[PATCH 21/29] netfilter: restart search if moved to other chain
From: Florian Westphal In case nf_conntrack_tuple_taken did not find a conflicting entry check that all entries in this hash slot were tested and restart in case an entry was moved to another chain. Reported-by: Eric Dumazet Fixes: ea781f197d6a ("netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()") Signed-off-by: Florian Westphal Acked-by: Eric Dumazet Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_core.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index 7d90a5d..887926a 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -809,6 +809,7 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple, zone = nf_ct_zone(ignored_conntrack); rcu_read_lock(); + begin: nf_conntrack_get_ht(&ct_hash, &hsize); hash = __hash_conntrack(net, tuple, hsize); @@ -822,6 +823,12 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple, } NF_CT_STAT_INC_ATOMIC(net, searched); } + + if (get_nulls_value(n) != hash) { + NF_CT_STAT_INC_ATOMIC(net, search_restart); + goto begin; + } + rcu_read_unlock(); return 0; -- 2.1.4
[PATCH 03/29] netfilter: nf_dup4: remove redundant checksum recalculation
From: Liping Zhang IP header checksum will be recalculated at ip_local_out, so there's no need to calculated it here, remove it. Also update code comments to illustrate it, and delete the misleading comments about checksum recalculation. Signed-off-by: Liping Zhang Signed-off-by: Pablo Neira Ayuso --- net/ipv4/netfilter/nf_dup_ipv4.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/net/ipv4/netfilter/nf_dup_ipv4.c b/net/ipv4/netfilter/nf_dup_ipv4.c index ceb1873..cf986e1 100644 --- a/net/ipv4/netfilter/nf_dup_ipv4.c +++ b/net/ipv4/netfilter/nf_dup_ipv4.c @@ -74,21 +74,19 @@ void nf_dup_ipv4(struct net *net, struct sk_buff *skb, unsigned int hooknum, nf_conntrack_get(skb->nfct); #endif /* -* If we are in PREROUTING/INPUT, the checksum must be recalculated -* since the length could have changed as a result of defragmentation. -* -* We also decrease the TTL to mitigate potential loops between two -* hosts. +* If we are in PREROUTING/INPUT, decrease the TTL to mitigate potential +* loops between two hosts. * * Set %IP_DF so that the original source is notified of a potentially * decreased MTU on the clone route. IPv6 does this too. +* +* IP header checksum will be recalculated at ip_local_out. */ iph = ip_hdr(skb); iph->frag_off |= htons(IP_DF); if (hooknum == NF_INET_PRE_ROUTING || hooknum == NF_INET_LOCAL_IN) --iph->ttl; - ip_send_check(iph); if (nf_dup_ipv4_route(net, skb, gw, oif)) { __this_cpu_write(nf_skb_duplicated, true); -- 2.1.4
[PATCH 14/29] netfilter: nft_hash: fix non static symbol warning
From: Wei Yongjun Fixes the following sparse warning: net/netfilter/nft_hash.c:40:25: warning: symbol 'nft_hash_policy' was not declared. Should it be static? Signed-off-by: Wei Yongjun Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nft_hash.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c index b82ff29..e090aee 100644 --- a/net/netfilter/nft_hash.c +++ b/net/netfilter/nft_hash.c @@ -37,7 +37,7 @@ static void nft_hash_eval(const struct nft_expr *expr, priv->modulus); } -const struct nla_policy nft_hash_policy[NFTA_HASH_MAX + 1] = { +static const struct nla_policy nft_hash_policy[NFTA_HASH_MAX + 1] = { [NFTA_HASH_SREG]= { .type = NLA_U32 }, [NFTA_HASH_DREG]= { .type = NLA_U32 }, [NFTA_HASH_LEN] = { .type = NLA_U32 }, -- 2.1.4
[PATCH 15/29] netfilter: nf_tables: typo in trace attribute definition
From: Pablo Neira Should be attributes, instead of attibutes, for consistency with other definitions. Signed-off-by: Pablo Neira Ayuso --- include/uapi/linux/netfilter/nf_tables.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h index 8c9d6ff..8a63f22 100644 --- a/include/uapi/linux/netfilter/nf_tables.h +++ b/include/uapi/linux/netfilter/nf_tables.h @@ -1090,7 +1090,7 @@ enum nft_gen_attributes { * @NFTA_TRACE_NFPROTO: nf protocol processed (NLA_U32) * @NFTA_TRACE_POLICY: policy that decided fate of packet (NLA_U32) */ -enum nft_trace_attibutes { +enum nft_trace_attributes { NFTA_TRACE_UNSPEC, NFTA_TRACE_TABLE, NFTA_TRACE_CHAIN, -- 2.1.4
[PATCH 29/29] netfilter: log: Check param to avoid overflow in nf_log_set
From: Gao Feng The nf_log_set is an interface function, so it should do the strict sanity check of parameters. Convert the return value of nf_log_set as int instead of void. When the pf is invalid, return -EOPNOTSUPP. Signed-off-by: Gao Feng Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_log.h | 3 +-- net/bridge/netfilter/nf_log_bridge.c | 3 +-- net/ipv4/netfilter/nf_log_arp.c | 3 +-- net/ipv4/netfilter/nf_log_ipv4.c | 3 +-- net/ipv6/netfilter/nf_log_ipv6.c | 3 +-- net/netfilter/nf_log.c | 8 +--- 6 files changed, 10 insertions(+), 13 deletions(-) diff --git a/include/net/netfilter/nf_log.h b/include/net/netfilter/nf_log.h index 83d855b..ee07dc8 100644 --- a/include/net/netfilter/nf_log.h +++ b/include/net/netfilter/nf_log.h @@ -60,8 +60,7 @@ struct nf_logger { int nf_log_register(u_int8_t pf, struct nf_logger *logger); void nf_log_unregister(struct nf_logger *logger); -void nf_log_set(struct net *net, u_int8_t pf, - const struct nf_logger *logger); +int nf_log_set(struct net *net, u_int8_t pf, const struct nf_logger *logger); void nf_log_unset(struct net *net, const struct nf_logger *logger); int nf_log_bind_pf(struct net *net, u_int8_t pf, diff --git a/net/bridge/netfilter/nf_log_bridge.c b/net/bridge/netfilter/nf_log_bridge.c index 5d9953a..1663df5 100644 --- a/net/bridge/netfilter/nf_log_bridge.c +++ b/net/bridge/netfilter/nf_log_bridge.c @@ -50,8 +50,7 @@ static struct nf_logger nf_bridge_logger __read_mostly = { static int __net_init nf_log_bridge_net_init(struct net *net) { - nf_log_set(net, NFPROTO_BRIDGE, &nf_bridge_logger); - return 0; + return nf_log_set(net, NFPROTO_BRIDGE, &nf_bridge_logger); } static void __net_exit nf_log_bridge_net_exit(struct net *net) diff --git a/net/ipv4/netfilter/nf_log_arp.c b/net/ipv4/netfilter/nf_log_arp.c index cf8f2d4..8945c26 100644 --- a/net/ipv4/netfilter/nf_log_arp.c +++ b/net/ipv4/netfilter/nf_log_arp.c @@ -111,8 +111,7 @@ static struct nf_logger nf_arp_logger __read_mostly = { static int __net_init nf_log_arp_net_init(struct net *net) { - nf_log_set(net, NFPROTO_ARP, &nf_arp_logger); - return 0; + return nf_log_set(net, NFPROTO_ARP, &nf_arp_logger); } static void __net_exit nf_log_arp_net_exit(struct net *net) diff --git a/net/ipv4/netfilter/nf_log_ipv4.c b/net/ipv4/netfilter/nf_log_ipv4.c index 076aadd..20f2255 100644 --- a/net/ipv4/netfilter/nf_log_ipv4.c +++ b/net/ipv4/netfilter/nf_log_ipv4.c @@ -347,8 +347,7 @@ static struct nf_logger nf_ip_logger __read_mostly = { static int __net_init nf_log_ipv4_net_init(struct net *net) { - nf_log_set(net, NFPROTO_IPV4, &nf_ip_logger); - return 0; + return nf_log_set(net, NFPROTO_IPV4, &nf_ip_logger); } static void __net_exit nf_log_ipv4_net_exit(struct net *net) diff --git a/net/ipv6/netfilter/nf_log_ipv6.c b/net/ipv6/netfilter/nf_log_ipv6.c index 8dd8696..c1bcf69 100644 --- a/net/ipv6/netfilter/nf_log_ipv6.c +++ b/net/ipv6/netfilter/nf_log_ipv6.c @@ -379,8 +379,7 @@ static struct nf_logger nf_ip6_logger __read_mostly = { static int __net_init nf_log_ipv6_net_init(struct net *net) { - nf_log_set(net, NFPROTO_IPV6, &nf_ip6_logger); - return 0; + return nf_log_set(net, NFPROTO_IPV6, &nf_ip6_logger); } static void __net_exit nf_log_ipv6_net_exit(struct net *net) diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c index aa5847a..30a17d6 100644 --- a/net/netfilter/nf_log.c +++ b/net/netfilter/nf_log.c @@ -39,12 +39,12 @@ static struct nf_logger *__find_logger(int pf, const char *str_logger) return NULL; } -void nf_log_set(struct net *net, u_int8_t pf, const struct nf_logger *logger) +int nf_log_set(struct net *net, u_int8_t pf, const struct nf_logger *logger) { const struct nf_logger *log; - if (pf == NFPROTO_UNSPEC) - return; + if (pf == NFPROTO_UNSPEC || pf >= ARRAY_SIZE(net->nf.nf_loggers)) + return -EOPNOTSUPP; mutex_lock(&nf_log_mutex); log = nft_log_dereference(net->nf.nf_loggers[pf]); @@ -52,6 +52,8 @@ void nf_log_set(struct net *net, u_int8_t pf, const struct nf_logger *logger) rcu_assign_pointer(net->nf.nf_loggers[pf], logger); mutex_unlock(&nf_log_mutex); + + return 0; } EXPORT_SYMBOL(nf_log_set); -- 2.1.4
[PATCH 18/29] rhashtable: add rhashtable_lookup_get_insert_key()
This patch modifies __rhashtable_insert_fast() so it returns the existing object that clashes with the one that you want to insert. In case the object is successfully inserted, NULL is returned. Otherwise, you get an error via ERR_PTR(). This patch adapts the existing callers of __rhashtable_insert_fast() so they handle this new logic, and it adds a new rhashtable_lookup_get_insert_key() interface to fetch this existing object. nf_tables needs this change to improve handling of EEXIST cases via honoring the NLM_F_EXCL flag and by checking if the data part of the mapping matches what we have. Cc: Herbert Xu Cc: Thomas Graf Signed-off-by: Pablo Neira Ayuso Acked-by: Herbert Xu --- include/linux/rhashtable.h | 70 +- lib/rhashtable.c | 10 +-- 2 files changed, 64 insertions(+), 16 deletions(-) diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h index 3eef080..26b7a05 100644 --- a/include/linux/rhashtable.h +++ b/include/linux/rhashtable.h @@ -343,7 +343,8 @@ int rhashtable_init(struct rhashtable *ht, struct bucket_table *rhashtable_insert_slow(struct rhashtable *ht, const void *key, struct rhash_head *obj, - struct bucket_table *old_tbl); + struct bucket_table *old_tbl, + void **data); int rhashtable_insert_rehash(struct rhashtable *ht, struct bucket_table *tbl); int rhashtable_walk_init(struct rhashtable *ht, struct rhashtable_iter *iter, @@ -563,8 +564,11 @@ restart: return NULL; } -/* Internal function, please use rhashtable_insert_fast() instead */ -static inline int __rhashtable_insert_fast( +/* Internal function, please use rhashtable_insert_fast() instead. This + * function returns the existing element already in hashes in there is a clash, + * otherwise it returns an error via ERR_PTR(). + */ +static inline void *__rhashtable_insert_fast( struct rhashtable *ht, const void *key, struct rhash_head *obj, const struct rhashtable_params params) { @@ -577,6 +581,7 @@ static inline int __rhashtable_insert_fast( spinlock_t *lock; unsigned int elasticity; unsigned int hash; + void *data = NULL; int err; restart: @@ -601,11 +606,14 @@ restart: new_tbl = rht_dereference_rcu(tbl->future_tbl, ht); if (unlikely(new_tbl)) { - tbl = rhashtable_insert_slow(ht, key, obj, new_tbl); + tbl = rhashtable_insert_slow(ht, key, obj, new_tbl, &data); if (!IS_ERR_OR_NULL(tbl)) goto slow_path; err = PTR_ERR(tbl); + if (err == -EEXIST) + err = 0; + goto out; } @@ -619,25 +627,25 @@ slow_path: err = rhashtable_insert_rehash(ht, tbl); rcu_read_unlock(); if (err) - return err; + return ERR_PTR(err); goto restart; } - err = -EEXIST; + err = 0; elasticity = ht->elasticity; rht_for_each(head, tbl, hash) { if (key && unlikely(!(params.obj_cmpfn ? params.obj_cmpfn(&arg, rht_obj(ht, head)) : - rhashtable_compare(&arg, rht_obj(ht, head) + rhashtable_compare(&arg, rht_obj(ht, head) { + data = rht_obj(ht, head); goto out; + } if (!--elasticity) goto slow_path; } - err = 0; - head = rht_dereference_bucket(tbl->buckets[hash], tbl, hash); RCU_INIT_POINTER(obj->next, head); @@ -652,7 +660,7 @@ out: spin_unlock_bh(lock); rcu_read_unlock(); - return err; + return err ? ERR_PTR(err) : data; } /** @@ -675,7 +683,13 @@ static inline int rhashtable_insert_fast( struct rhashtable *ht, struct rhash_head *obj, const struct rhashtable_params params) { - return __rhashtable_insert_fast(ht, NULL, obj, params); + void *ret; + + ret = __rhashtable_insert_fast(ht, NULL, obj, params); + if (IS_ERR(ret)) + return PTR_ERR(ret); + + return ret == NULL ? 0 : -EEXIST; } /** @@ -704,11 +718,15 @@ static inline int rhashtable_lookup_insert_fast( const struct rhashtable_params params) { const char *key = rht_obj(ht, obj); + void *ret; BUG_ON(ht->p.obj_hashfn); - return __rhashtable_insert_fast(ht, key + ht->p.key_offset, obj, - params); + ret = __rhashtable_insert_fast(ht, key + ht->p.key_offset, obj, params); + if (IS_ERR(ret)) + retu
[PATCH 22/29] netfilter: don't rely on DYING bit to detect when destroy event was sent
From: Florian Westphal The reliable event delivery mode currently (ab)uses the DYING bit to detect which entries on the dying list have to be skipped when re-delivering events from the eache worker in reliable event mode. Currently when we delete the conntrack from main table we only set this bit if we could also deliver the netlink destroy event to userspace. If we fail we move it to the dying list, the ecache worker will reattempt event delivery for all confirmed conntracks on the dying list that do not have the DYING bit set. Once timer is gone, we can no longer use if (del_timer()) to detect when we 'stole' the reference count owned by the timer/hash entry, so we need some other way to avoid racing with other cpu. Pablo suggested to add a marker in the ecache extension that skips entries that have been unhashed from main table but are still waiting for the last reference count to be dropped (e.g. because one skb waiting on nfqueue verdict still holds a reference). We do this by adding a tristate. If we fail to deliver the destroy event, make a note of this in the eache extension. The worker can then skip all entries that are in a different state. Either they never delivered a destroy event, e.g. because the netlink backend was not loaded, or redelivery took place already. Once the conntrack timer is removed we will now be able to replace del_timer() test with test_and_set_bit(DYING, &ct->status) to avoid racing with other cpu that tries to evict the same conntrack. Because DYING will then be set right before we report the destroy event we can no longer skip event reporting when dying bit is set. Suggested-by: Pablo Neira Ayuso Signed-off-by: Florian Westphal Acked-by: Eric Dumazet Signed-off-by: Pablo Neira Ayuso --- include/net/netfilter/nf_conntrack_ecache.h | 17 - net/netfilter/nf_conntrack_ecache.c | 22 ++ 2 files changed, 26 insertions(+), 13 deletions(-) diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h index fa36447..12d967b 100644 --- a/include/net/netfilter/nf_conntrack_ecache.h +++ b/include/net/netfilter/nf_conntrack_ecache.h @@ -12,12 +12,19 @@ #include #include +enum nf_ct_ecache_state { + NFCT_ECACHE_UNKNOWN,/* destroy event not sent */ + NFCT_ECACHE_DESTROY_FAIL, /* tried but failed to send destroy event */ + NFCT_ECACHE_DESTROY_SENT, /* sent destroy event after failure */ +}; + struct nf_conntrack_ecache { - unsigned long cache;/* bitops want long */ - unsigned long missed; /* missed events */ - u16 ctmask; /* bitmask of ct events to be delivered */ - u16 expmask;/* bitmask of expect events to be delivered */ - u32 portid; /* netlink portid of destroyer */ + unsigned long cache;/* bitops want long */ + unsigned long missed; /* missed events */ + u16 ctmask; /* bitmask of ct events to be delivered */ + u16 expmask;/* bitmask of expect events to be delivered */ + u32 portid; /* netlink portid of destroyer */ + enum nf_ct_ecache_state state; /* ecache state */ }; static inline struct nf_conntrack_ecache * diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c index d28011b..da9df2d 100644 --- a/net/netfilter/nf_conntrack_ecache.c +++ b/net/netfilter/nf_conntrack_ecache.c @@ -49,8 +49,13 @@ static enum retry_state ecache_work_evict_list(struct ct_pcpu *pcpu) hlist_nulls_for_each_entry(h, n, &pcpu->dying, hnnode) { struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h); + struct nf_conntrack_ecache *e; - if (nf_ct_is_dying(ct)) + if (!nf_ct_is_confirmed(ct)) + continue; + + e = nf_ct_ecache_find(ct); + if (!e || e->state != NFCT_ECACHE_DESTROY_FAIL) continue; if (nf_conntrack_event(IPCT_DESTROY, ct)) { @@ -58,8 +63,7 @@ static enum retry_state ecache_work_evict_list(struct ct_pcpu *pcpu) break; } - /* we've got the event delivered, now it's dying */ - set_bit(IPS_DYING_BIT, &ct->status); + e->state = NFCT_ECACHE_DESTROY_SENT; refs[evicted] = ct; if (++evicted >= ARRAY_SIZE(refs)) { @@ -130,7 +134,7 @@ int nf_conntrack_eventmask_report(unsigned int eventmask, struct nf_conn *ct, if (!e) goto out_unlock; - if (nf_ct_is_confirmed(ct) && !nf_ct_is_dying(ct)) { + if (nf_ct_is_confirmed(ct)) { struct nf_ct_event item = { .ct = ct, .portid = e->portid ? e->portid : portid, @@ -150,11 +154,13 @@ int nf_conntrack_eventmask_report(uns
[PATCH 10/29] netfilter: nf_conntrack: restore nf_conntrack_htable_size as exported symbol
This is required to iterate over the hash table in cttimeout, ctnetlink and nf_conntrack_ipv4. >> ERROR: "nf_conntrack_htable_size" [net/netfilter/nfnetlink_cttimeout.ko] >> undefined! ERROR: "nf_conntrack_htable_size" [net/netfilter/nf_conntrack_netlink.ko] undefined! ERROR: "nf_conntrack_htable_size" [net/ipv4/netfilter/nf_conntrack_ipv4.ko] undefined! Fixes: adf0516845bcd0 ("netfilter: remove ip_conntrack* sysctl compat code") Reported-by: kbuild test robot Reported-by: Stephen Rothwell Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_core.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index aeba28c..7d90a5d 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -160,6 +160,8 @@ static void nf_conntrack_all_unlock(void) } unsigned int nf_conntrack_htable_size __read_mostly; +EXPORT_SYMBOL_GPL(nf_conntrack_htable_size); + unsigned int nf_conntrack_max __read_mostly; seqcount_t nf_conntrack_generation __read_mostly; -- 2.1.4
[PATCH 02/29] netfilter: physdev: add missed blank
From: Hangbin Liu Signed-off-by: Hangbin Liu Signed-off-by: Pablo Neira Ayuso --- net/netfilter/xt_physdev.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c index e5f1898..bb33598 100644 --- a/net/netfilter/xt_physdev.c +++ b/net/netfilter/xt_physdev.c @@ -107,8 +107,8 @@ static int physdev_mt_check(const struct xt_mtchk_param *par) info->invert & XT_PHYSDEV_OP_BRIDGED) && par->hook_mask & ((1 << NF_INET_LOCAL_OUT) | (1 << NF_INET_FORWARD) | (1 << NF_INET_POST_ROUTING))) { - pr_info("using --physdev-out and --physdev-is-out are only" - "supported in the FORWARD and POSTROUTING chains with" + pr_info("using --physdev-out and --physdev-is-out are only " + "supported in the FORWARD and POSTROUTING chains with " "bridged traffic.\n"); if (par->hook_mask & (1 << NF_INET_LOCAL_OUT)) return -EINVAL; -- 2.1.4
[PATCH 04/29] netfilter: use_nf_conn_expires helper in more places
From: Florian Westphal ... so we don't need to touch all of these places when we get rid of the timer in nf_conn. Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c | 3 +-- net/netfilter/nf_conntrack_netlink.c | 5 + net/netfilter/nf_conntrack_standalone.c | 3 +-- net/netfilter/xt_conntrack.c | 4 +--- 4 files changed, 4 insertions(+), 11 deletions(-) diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c index 6392371..67bfc69 100644 --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c @@ -163,8 +163,7 @@ static int ct_seq_show(struct seq_file *s, void *v) ret = -ENOSPC; seq_printf(s, "%-8s %u %ld ", l4proto->name, nf_ct_protonum(ct), - timer_pending(&ct->timeout) - ? (long)(ct->timeout.expires - jiffies)/HZ : 0); + nf_ct_expires(ct) / HZ); if (l4proto->print_conntrack) l4proto->print_conntrack(s, ct); diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index 050bb34..68800c1 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -149,10 +149,7 @@ nla_put_failure: static int ctnetlink_dump_timeout(struct sk_buff *skb, const struct nf_conn *ct) { - long timeout = ((long)ct->timeout.expires - (long)jiffies) / HZ; - - if (timeout < 0) - timeout = 0; + long timeout = nf_ct_expires(ct) / HZ; if (nla_put_be32(skb, CTA_TIMEOUT, htonl(timeout))) goto nla_put_failure; diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 958a145..4e7becd 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -224,8 +224,7 @@ static int ct_seq_show(struct seq_file *s, void *v) seq_printf(s, "%-8s %u %-8s %u %ld ", l3proto->name, nf_ct_l3num(ct), l4proto->name, nf_ct_protonum(ct), - timer_pending(&ct->timeout) - ? (long)(ct->timeout.expires - jiffies)/HZ : 0); + nf_ct_expires(ct) / HZ); if (l4proto->print_conntrack) l4proto->print_conntrack(s, ct); diff --git a/net/netfilter/xt_conntrack.c b/net/netfilter/xt_conntrack.c index 188404b9..a3b8f69 100644 --- a/net/netfilter/xt_conntrack.c +++ b/net/netfilter/xt_conntrack.c @@ -233,10 +233,8 @@ conntrack_mt(const struct sk_buff *skb, struct xt_action_param *par, return false; if (info->match_flags & XT_CONNTRACK_EXPIRES) { - unsigned long expires = 0; + unsigned long expires = nf_ct_expires(ct) / HZ; - if (timer_pending(&ct->timeout)) - expires = (ct->timeout.expires - jiffies) / HZ; if ((expires >= info->expires_min && expires <= info->expires_max) ^ !(info->invert_flags & XT_CONNTRACK_EXPIRES)) -- 2.1.4
[PATCH 07/29] netfilter: nf_tables: add hash expression
From: Laura Garcia Liebana This patch adds a new hash expression, this provides jhash support but this can be extended to support for other hash functions. The modulus and seed already comes embedded into this new expression. Use case example: ... meta mark set hash ip saddr mod 10 Signed-off-by: Laura Garcia Liebana Signed-off-by: Pablo Neira Ayuso --- include/uapi/linux/netfilter/nf_tables.h | 20 + net/netfilter/Kconfig| 6 ++ net/netfilter/Makefile | 1 + net/netfilter/nft_hash.c | 136 +++ 4 files changed, 163 insertions(+) create mode 100644 net/netfilter/nft_hash.c diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h index 01751fa..6ce0a6d 100644 --- a/include/uapi/linux/netfilter/nf_tables.h +++ b/include/uapi/linux/netfilter/nf_tables.h @@ -724,6 +724,26 @@ enum nft_meta_keys { }; /** + * enum nft_hash_attributes - nf_tables hash expression netlink attributes + * + * @NFTA_HASH_SREG: source register (NLA_U32) + * @NFTA_HASH_DREG: destination register (NLA_U32) + * @NFTA_HASH_LEN: source data length (NLA_U32) + * @NFTA_HASH_MODULUS: modulus value (NLA_U32) + * @NFTA_HASH_SEED: seed value (NLA_U32) + */ +enum nft_hash_attributes { + NFTA_HASH_UNSPEC, + NFTA_HASH_SREG, + NFTA_HASH_DREG, + NFTA_HASH_LEN, + NFTA_HASH_MODULUS, + NFTA_HASH_SEED, + __NFTA_HASH_MAX, +}; +#define NFTA_HASH_MAX (__NFTA_HASH_MAX - 1) + +/** * enum nft_meta_attributes - nf_tables meta expression netlink attributes * * @NFTA_META_DREG: destination register (NLA_U32) diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index e5740e1..9cfaa00 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -563,6 +563,12 @@ config NFT_COMPAT x_tables match/target extensions over the nf_tables framework. +config NFT_HASH + tristate "Netfilter nf_tables hash module" + help + This option adds the "hash" expression that you can use to perform + a hash operation on registers. + if NF_TABLES_NETDEV config NF_DUP_NETDEV diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile index 101fb85..1106ccd 100644 --- a/net/netfilter/Makefile +++ b/net/netfilter/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_NFT_COUNTER) += nft_counter.o obj-$(CONFIG_NFT_LOG) += nft_log.o obj-$(CONFIG_NFT_MASQ) += nft_masq.o obj-$(CONFIG_NFT_REDIR)+= nft_redir.o +obj-$(CONFIG_NFT_HASH) += nft_hash.o # nf_tables netdev obj-$(CONFIG_NFT_DUP_NETDEV) += nft_dup_netdev.o diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c new file mode 100644 index 000..b82ff29 --- /dev/null +++ b/net/netfilter/nft_hash.c @@ -0,0 +1,136 @@ +/* + * Copyright (c) 2016 Laura Garcia + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct nft_hash { + enum nft_registers sreg:8; + enum nft_registers dreg:8; + u8 len; + u32 modulus; + u32 seed; +}; + +static void nft_hash_eval(const struct nft_expr *expr, + struct nft_regs *regs, + const struct nft_pktinfo *pkt) +{ + struct nft_hash *priv = nft_expr_priv(expr); + const void *data = ®s->data[priv->sreg]; + + regs->data[priv->dreg] = + reciprocal_scale(jhash(data, priv->len, priv->seed), +priv->modulus); +} + +const struct nla_policy nft_hash_policy[NFTA_HASH_MAX + 1] = { + [NFTA_HASH_SREG]= { .type = NLA_U32 }, + [NFTA_HASH_DREG]= { .type = NLA_U32 }, + [NFTA_HASH_LEN] = { .type = NLA_U32 }, + [NFTA_HASH_MODULUS] = { .type = NLA_U32 }, + [NFTA_HASH_SEED]= { .type = NLA_U32 }, +}; + +static int nft_hash_init(const struct nft_ctx *ctx, +const struct nft_expr *expr, +const struct nlattr * const tb[]) +{ + struct nft_hash *priv = nft_expr_priv(expr); + u32 len; + + if (!tb[NFTA_HASH_SREG] || + !tb[NFTA_HASH_DREG] || + !tb[NFTA_HASH_LEN] || + !tb[NFTA_HASH_SEED] || + !tb[NFTA_HASH_MODULUS]) + return -EINVAL; + + priv->sreg = nft_parse_register(tb[NFTA_HASH_SREG]); + priv->dreg = nft_parse_register(tb[NFTA_HASH_DREG]); + + len = ntohl(nla_get_be32(tb[NFTA_HASH_LEN])); + if (len == 0 || len > U8_MAX) + return -ERANGE; + + priv->len = len; + + priv->modulus = ntohl(nla_get_be32(tb[NFTA_HASH_MODULUS])); + if (priv->modu