Re: aio poll, io_pgetevents and a new in-kernel poll API V4
On Mon, Jan 22, 2018 at 09:12:07PM +0100, Christoph Hellwig wrote: > Hi all, > > this series adds support for the IOCB_CMD_POLL operation to poll for the > readyness of file descriptors using the aio subsystem. The API is based > on patches that existed in RHAS2.1 and RHEL3, which means it already is > supported by libaio. To implement the poll support efficiently new > methods to poll are introduced in struct file_operations: get_poll_head > and poll_mask. The first one returns a wait_queue_head to wait on > (lifetime is bound by the file), and the second does a non-blocking > check for the POLL* events. This allows aio poll to work without > any additional context switches, unlike epoll. I implemented something similar back in December, but did so without changing the in-kernel poll API. See below for the patch that implements it. Is changing the in-kernel poll API really desirable given how many drivers that will touch? The tree with that as a work-in-progress is available at git://git.kvack.org/~bcrl/aio-wip-20171215.git . I have some time scheduled right now to work on cleaning up that series which also includes support for more aio options by making use of queue_work() to dispatch operations that make use of helper threads. The aio poll implementation patch is below as an alternative approach. It has passed some amount of QA by Solace where we use it for a unified event loop with various filesystem / block AIO combined with poll on TCP, unix domains sockets and pipes. Reworking cancellation was required to fix several lock ordering issues that are impossible to address with the in-kernel aio cancellation support. -ben > To make the interface fully useful a new io_pgetevents system call is > added, which atomically saves and restores the signal mask over the > io_pgetevents system call. It it the logical equivalent to pselect and > ppoll for io_pgetevents. That looks useful. I'll have to look at this in detail. -ben commit a299c474b19107122eae846b53f742d876f304f9 Author: Benjamin LaHaise Date: Fri Dec 15 13:19:23 2017 -0500 aio: add aio poll implementation Some applications using AIO have a need to be able to poll on file descriptors. This is typically the case with libraries using non-blocking operations inside of an application using an io_getevents() based main event loop. This patch implements IOCB_CMD_POLL directly inside fs/aio.c using file_operations->poll() to insert a waiter. This avoids the need to use helper threads, enabling completion events to be generated directly in interrupt or task context. diff --git a/fs/aio.c b/fs/aio.c index 3381b2e..23b9a06 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -43,6 +43,7 @@ #include #include +#include #include "internal.h" @@ -172,7 +173,7 @@ struct kioctx { * synchronize cancellation and completion - we only set it to KIOCB_CANCELLED * with xchg() or cmpxchg(), see batch_complete_aio() and kiocb_cancel(). */ -#define KIOCB_CANCELLED((void *) (~0ULL)) +#define KIOCB_CANCELLED((void *) (~5ULL)) struct aio_kiocb { struct kiocbcommon; @@ -205,6 +206,13 @@ struct aio_kiocb { aio_thread_work_fn_tki_work_fn; struct work_struct ki_work; #endif + + unsigned long ki_data; + struct poll_table_structpoll; + struct tasklet_struct tasklet; + int wait_idx; + unsignedevents; + struct poll_table_entry poll_wait[2]; }; /*-- sysctl variables*/ @@ -212,7 +220,7 @@ struct aio_kiocb { unsigned long aio_nr; /* current system wide number of aio requests */ unsigned long aio_max_nr = 0x1; /* system wide maximum number of aio requests */ #if IS_ENABLED(CONFIG_AIO_THREAD) -unsigned long aio_auto_threads;/* Currently disabled by default */ +unsigned long aio_auto_threads = 1;/* Currently disabled by default */ #endif /*end sysctl variables---*/ @@ -1831,6 +1839,213 @@ static ssize_t aio_write(struct aio_kiocb *kiocb, struct iocb *iocb, bool compat return ret; } +static int aio_poll_cancel_cb(struct kiocb *iocb, unsigned long data, int ret) +{ + struct aio_kiocb *req = container_of(iocb, struct aio_kiocb, common); + unsigned i; + + for (i=0; iwait_idx; i++) { + wait_queue_head_t *head = req->poll_wait[i].wait_address; + remove_wait_queue(head, &req->poll_wait[i].wait); + } + + aio_complete(iocb, -EINTR, 0); + return 0; +} + +static int aio_poll_cancel(struct kiocb *iocb, kiocb_cancel_cb_fn **cb_p, + unsigned long *data_p) +{ + *cb_p = aio_poll_cancel_cb; + *data_p = 0; + return 0; +} + +stati
Re: [PATCH 2/6] move _body_io_syscall to the generic syscall.h
On Fri, Jan 05, 2018 at 11:25:17AM -0500, Jeff Moyer wrote: > Christoph Hellwig writes: > > > This way it can be used for the fallback 6-argument version on > > all architectures. > > > > Signed-off-by: Christoph Hellwig > > This is a strange way to do things. However, I was never really sold on > libaio having to implement its own system call wrappers. That decision > definitely resulted in some maintenance overhead. > > Ben, what was your reasoning for not just using syscall? The main issue was that glibc's pthreads implementation really sucked back during initial development and there was a use-case for having the io_XXX functions usable directly from clone()ed threads that didn't have all the glibc pthread state setup for per-cpu areas to handle per-thread errno. That made sense back then, but is rather silly today. Technically, I'm not sure the generic syscall wrapper is safe to use. The io_XXX arch wrappers don't modify errno, while it appears the generic one does. That said, nobody has ever noticed... -ben > -Jeff > > > --- > > src/syscall-generic.h | 6 -- > > src/syscall.h | 7 +++ > > 2 files changed, 7 insertions(+), 6 deletions(-) > > > > diff --git a/src/syscall-generic.h b/src/syscall-generic.h > > index 24d7c7c..35b8580 100644 > > --- a/src/syscall-generic.h > > +++ b/src/syscall-generic.h > > @@ -2,12 +2,6 @@ > > #include > > #include > > > > -#define _body_io_syscall(sname, args...) \ > > -{ \ > > - int ret = syscall(__NR_##sname, ## args); \ > > - return ret < 0 ? -errno : ret; \ > > -} > > - > > #define io_syscall1(type,fname,sname,type1,arg1) \ > > type fname(type1 arg1) \ > > _body_io_syscall(sname, (long)arg1) > > diff --git a/src/syscall.h b/src/syscall.h > > index a2da030..3819519 100644 > > --- a/src/syscall.h > > +++ b/src/syscall.h > > @@ -10,6 +10,13 @@ > > #define DEFSYMVER(compat_sym, orig_sym, ver_sym) \ > > __asm__(".symver " SYMSTR(compat_sym) "," SYMSTR(orig_sym) "@@LIBAIO_" > > SYMSTR(ver_sym)); > > > > +/* generic fallback */ > > +#define _body_io_syscall(sname, args...) \ > > +{ \ > > + int ret = syscall(__NR_##sname, ## args); \ > > + return ret < 0 ? -errno : ret; \ > > +} > > + > > #if defined(__i386__) > > #include "syscall-i386.h" > > #elif defined(__x86_64__) > -- "Thought is the essence of where you are now."
[PATCH] flower: check unused bits in MPLS fields
Since several of the the netlink attributes used to configure the flower classifier's MPLS TC, BOS and Label fields have additional bits which are unused, check those bits to ensure that they are actually 0 as suggested by Jamal. Signed-off-by: Benjamin LaHaise Cc: David Miller Cc: Jamal Hadi Salim Cc: Simon Horman Cc: Jakub Kicinski Cc: Jiri Pirko --- net/sched/cls_flower.c | 32 ++-- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 3ecf076..ca526c0 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -439,29 +439,39 @@ static void fl_set_key_val(struct nlattr **tb, memcpy(mask, nla_data(tb[mask_type]), len); } -static void fl_set_key_mpls(struct nlattr **tb, - struct flow_dissector_key_mpls *key_val, - struct flow_dissector_key_mpls *key_mask) +static int fl_set_key_mpls(struct nlattr **tb, + struct flow_dissector_key_mpls *key_val, + struct flow_dissector_key_mpls *key_mask) { if (tb[TCA_FLOWER_KEY_MPLS_TTL]) { key_val->mpls_ttl = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TTL]); key_mask->mpls_ttl = MPLS_TTL_MASK; } if (tb[TCA_FLOWER_KEY_MPLS_BOS]) { - key_val->mpls_bos = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_BOS]); + u8 bos = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_BOS]); + + if (bos & ~MPLS_BOS_MASK) + return -EINVAL; + key_val->mpls_bos = bos; key_mask->mpls_bos = MPLS_BOS_MASK; } if (tb[TCA_FLOWER_KEY_MPLS_TC]) { - key_val->mpls_tc = - nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TC]) & MPLS_TC_MASK; + u8 tc = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TC]); + + if (tc & ~MPLS_TC_MASK) + return -EINVAL; + key_val->mpls_tc = tc; key_mask->mpls_tc = MPLS_TC_MASK; } if (tb[TCA_FLOWER_KEY_MPLS_LABEL]) { - key_val->mpls_label = - nla_get_u32(tb[TCA_FLOWER_KEY_MPLS_LABEL]) & - MPLS_LABEL_MASK; + u32 label = nla_get_u32(tb[TCA_FLOWER_KEY_MPLS_LABEL]); + + if (label & ~MPLS_LABEL_MASK) + return -EINVAL; + key_val->mpls_label = label; key_mask->mpls_label = MPLS_LABEL_MASK; } + return 0; } static void fl_set_key_vlan(struct nlattr **tb, @@ -622,7 +632,9 @@ static int fl_set_key(struct net *net, struct nlattr **tb, sizeof(key->icmp.code)); } else if (key->basic.n_proto == htons(ETH_P_MPLS_UC) || key->basic.n_proto == htons(ETH_P_MPLS_MC)) { - fl_set_key_mpls(tb, &key->mpls, &mask->mpls); + ret = fl_set_key_mpls(tb, &key->mpls, &mask->mpls); + if (ret) + return ret; } else if (key->basic.n_proto == htons(ETH_P_ARP) || key->basic.n_proto == htons(ETH_P_RARP)) { fl_set_key_val(tb, &key->arp.sip, TCA_FLOWER_KEY_ARP_SIP, -- 2.7.4
Re: [PATCH net-next 0/2] flower: add MPLS matching support
On Tue, Apr 25, 2017 at 08:47:00AM -0400, Jamal Hadi Salim wrote: > On 17-04-25 07:55 AM, Simon Horman wrote: > [..] > > > > I agree something should be done wrt BOS. If the LABEL and TC are to > > be left as-is then I think a similar treatment of BOS - that is masking it > > - makes sense. > > > > I also agree with statements made earlier in the thread that it is unlikely > > that the unused bits of these attributes will be used - as opposed to a > > bitmask of flag values which seems ripe for re-use for future flags. > > > > For your use case, I think you are fine if you just do the mask in the > kernel. A mask to a user value implies "I am ignoring the rest > of these bits - I dont care if you set them " > > > I would like to add to the discussion that I think in future it would > > be good to expand the features provided by this patch to support supplying > > a mask as part of the match - as flower supports for other fields such > > as IP addresses. But I think the current scheme of masking out invalid bits > > should also work in conjunction with user-supplied masks. > > > > The challenge we have right now is "users do stoopid or malicious > things". So are you going to accept the wrong bitmap + mask? I think rejecting bits in a mask that clearly cannot be set (like the bits above the lower 20 bits in an MPLS label) makes perfect sense. It doesn't impact usability for the tools since they shouldn't be set (and actually can't be in the iproute2 changes). -ben > cheers, > jamal >
Re: [PATCH net-next 0/2] flower: add MPLS matching support
On Mon, Apr 24, 2017 at 08:58:18PM -0400, Jamal Hadi Salim wrote: > On 17-04-24 02:32 PM, David Miller wrote: > > From: Benjamin LaHaise > > > > > Series applied, but in the future: > > > > 1) Put the "v2", "v3", whatever in the inital [PATCH xxx] bracketed > >part of the subject line, otherwise it ends up in the GIT commit > >message header line and that's not desired. > > > > 2) Please cut it with these double signoffs, and add an appropriate > >entry to the email aliases file. > > I know i should have spoken earlier, wanted to but got distracted - but > shouldnt the new rules have applied to this patch too? ;-> > > You have 3 TLVs, one of which is u8 that only allows use of 3 bits. > The other is a u32 which allows only 20 bits to be set. What are the new rules for TLVs -- do you want a new type added for the subset values that are limited to the number of bits actually used? A pointed here would be helpful. -ben > cheers, > jamal
[PATCH net-next 2/2] cls_flower: add support for matching MPLS fields (v2)
Add support to the tc flower classifier to match based on fields in MPLS labels (TTL, Bottom of Stack, TC field, Label). Signed-off-by: Benjamin LaHaise Signed-off-by: Benjamin LaHaise Reviewed-by: Jakub Kicinski Cc: "David S. Miller" Cc: Simon Horman Cc: Jamal Hadi Salim Cc: Cong Wang Cc: Jiri Pirko Cc: Eric Dumazet Cc: Hadar Hen Zion Cc: Gao Feng --- include/uapi/linux/pkt_cls.h | 5 +++ net/sched/cls_flower.c | 74 2 files changed, 79 insertions(+) diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index 7a69f2a..f1129e3 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -432,6 +432,11 @@ enum { TCA_FLOWER_KEY_ARP_THA, /* ETH_ALEN */ TCA_FLOWER_KEY_ARP_THA_MASK,/* ETH_ALEN */ + TCA_FLOWER_KEY_MPLS_TTL,/* u8 - 8 bits */ + TCA_FLOWER_KEY_MPLS_BOS,/* u8 - 1 bit */ + TCA_FLOWER_KEY_MPLS_TC, /* u8 - 3 bits */ + TCA_FLOWER_KEY_MPLS_LABEL, /* be32 - 20 bits */ + __TCA_FLOWER_MAX, }; diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 31ee340..3ecf076 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -47,6 +48,7 @@ struct fl_flow_key { struct flow_dissector_key_ipv6_addrs enc_ipv6; }; struct flow_dissector_key_ports enc_tp; + struct flow_dissector_key_mpls mpls; } __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as longs. */ struct fl_flow_mask_range { @@ -418,6 +420,10 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = { [TCA_FLOWER_KEY_ARP_SHA_MASK] = { .len = ETH_ALEN }, [TCA_FLOWER_KEY_ARP_THA]= { .len = ETH_ALEN }, [TCA_FLOWER_KEY_ARP_THA_MASK] = { .len = ETH_ALEN }, + [TCA_FLOWER_KEY_MPLS_TTL] = { .type = NLA_U8 }, + [TCA_FLOWER_KEY_MPLS_BOS] = { .type = NLA_U8 }, + [TCA_FLOWER_KEY_MPLS_TC]= { .type = NLA_U8 }, + [TCA_FLOWER_KEY_MPLS_LABEL] = { .type = NLA_U32 }, }; static void fl_set_key_val(struct nlattr **tb, @@ -433,6 +439,31 @@ static void fl_set_key_val(struct nlattr **tb, memcpy(mask, nla_data(tb[mask_type]), len); } +static void fl_set_key_mpls(struct nlattr **tb, + struct flow_dissector_key_mpls *key_val, + struct flow_dissector_key_mpls *key_mask) +{ + if (tb[TCA_FLOWER_KEY_MPLS_TTL]) { + key_val->mpls_ttl = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TTL]); + key_mask->mpls_ttl = MPLS_TTL_MASK; + } + if (tb[TCA_FLOWER_KEY_MPLS_BOS]) { + key_val->mpls_bos = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_BOS]); + key_mask->mpls_bos = MPLS_BOS_MASK; + } + if (tb[TCA_FLOWER_KEY_MPLS_TC]) { + key_val->mpls_tc = + nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TC]) & MPLS_TC_MASK; + key_mask->mpls_tc = MPLS_TC_MASK; + } + if (tb[TCA_FLOWER_KEY_MPLS_LABEL]) { + key_val->mpls_label = + nla_get_u32(tb[TCA_FLOWER_KEY_MPLS_LABEL]) & + MPLS_LABEL_MASK; + key_mask->mpls_label = MPLS_LABEL_MASK; + } +} + static void fl_set_key_vlan(struct nlattr **tb, struct flow_dissector_key_vlan *key_val, struct flow_dissector_key_vlan *key_mask) @@ -589,6 +620,9 @@ static int fl_set_key(struct net *net, struct nlattr **tb, &mask->icmp.code, TCA_FLOWER_KEY_ICMPV6_CODE_MASK, sizeof(key->icmp.code)); + } else if (key->basic.n_proto == htons(ETH_P_MPLS_UC) || + key->basic.n_proto == htons(ETH_P_MPLS_MC)) { + fl_set_key_mpls(tb, &key->mpls, &mask->mpls); } else if (key->basic.n_proto == htons(ETH_P_ARP) || key->basic.n_proto == htons(ETH_P_RARP)) { fl_set_key_val(tb, &key->arp.sip, TCA_FLOWER_KEY_ARP_SIP, @@ -725,6 +759,8 @@ static void fl_init_dissector(struct cls_fl_head *head, FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, FLOW_DISSECTOR_KEY_ARP, arp); FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, +FLOW_DISSECTOR_KEY_MPLS, mpls); + FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, FLOW_DISSECTOR_KEY_VLAN, vlan); FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, FLOW_DISSECTOR_KEY_ENC_KEYID, enc_key_id); @@ -991,6 +1027,41 @@ static int fl_dump_key_val(struct sk_buff *skb, return 0; } +stat
[PATCH net-next 0/2] flower: add MPLS matching support
From: Benjamin LaHaise This patch series adds support for parsing MPLS flows in the flow dissector and the flower classifier. Each of the MPLS TTL, BOS, TC and Label fields can be used for matching. v2: incorporate style feedback, move #defines to linux/include/mpls.h Note: this omits Jiri's request to remove tabs between the type and field names in struct declarations. This would be inconsistent with numerous other struct definitions. Benjamin LaHaise (2): flow_dissector: add mpls support (v2) cls_flower: add support for matching MPLS fields (v2) include/linux/mpls.h | 5 +++ include/net/flow_dissector.h | 8 + include/uapi/linux/pkt_cls.h | 5 +++ net/core/flow_dissector.c| 25 +-- net/sched/cls_flower.c | 74 5 files changed, 114 insertions(+), 3 deletions(-) -- 2.7.4
[PATCH net-next 1/2] flow_dissector: add mpls support (v2)
Add support for parsing MPLS flows to the flow dissector in preparation for adding MPLS match support to cls_flower. Signed-off-by: Benjamin LaHaise Signed-off-by: Benjamin LaHaise Reviewed-by: Jakub Kicinski Cc: "David S. Miller" Cc: Simon Horman Cc: Jamal Hadi Salim Cc: Cong Wang Cc: Jiri Pirko Cc: Eric Dumazet Cc: Hadar Hen Zion Cc: Gao Feng --- include/linux/mpls.h | 5 + include/net/flow_dissector.h | 8 net/core/flow_dissector.c| 25 ++--- 3 files changed, 35 insertions(+), 3 deletions(-) diff --git a/include/linux/mpls.h b/include/linux/mpls.h index 145..384fb22 100644 --- a/include/linux/mpls.h +++ b/include/linux/mpls.h @@ -3,4 +3,9 @@ #include +#define MPLS_TTL_MASK (MPLS_LS_TTL_MASK >> MPLS_LS_TTL_SHIFT) +#define MPLS_BOS_MASK (MPLS_LS_S_MASK >> MPLS_LS_S_SHIFT) +#define MPLS_TC_MASK (MPLS_LS_TC_MASK >> MPLS_LS_TC_SHIFT) +#define MPLS_LABEL_MASK(MPLS_LS_LABEL_MASK >> MPLS_LS_LABEL_SHIFT) + #endif /* _LINUX_MPLS_H */ diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h index ac97030..8d21d44 100644 --- a/include/net/flow_dissector.h +++ b/include/net/flow_dissector.h @@ -41,6 +41,13 @@ struct flow_dissector_key_vlan { u16 padding; }; +struct flow_dissector_key_mpls { + u32 mpls_ttl:8, + mpls_bos:1, + mpls_tc:3, + mpls_label:20; +}; + struct flow_dissector_key_keyid { __be32 keyid; }; @@ -169,6 +176,7 @@ enum flow_dissector_key_id { FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */ FLOW_DISSECTOR_KEY_ENC_CONTROL, /* struct flow_dissector_key_control */ FLOW_DISSECTOR_KEY_ENC_PORTS, /* struct flow_dissector_key_ports */ + FLOW_DISSECTOR_KEY_MPLS, /* struct flow_dissector_key_mpls */ FLOW_DISSECTOR_KEY_MAX, }; diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index c9cf425..28d94bc 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -126,9 +126,11 @@ __skb_flow_dissect_mpls(const struct sk_buff *skb, { struct flow_dissector_key_keyid *key_keyid; struct mpls_label *hdr, _hdr[2]; + u32 entry, label; if (!dissector_uses_key(flow_dissector, - FLOW_DISSECTOR_KEY_MPLS_ENTROPY)) + FLOW_DISSECTOR_KEY_MPLS_ENTROPY) && + !dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_MPLS)) return FLOW_DISSECT_RET_OUT_GOOD; hdr = __skb_header_pointer(skb, nhoff, sizeof(_hdr), data, @@ -136,8 +138,25 @@ __skb_flow_dissect_mpls(const struct sk_buff *skb, if (!hdr) return FLOW_DISSECT_RET_OUT_BAD; - if ((ntohl(hdr[0].entry) & MPLS_LS_LABEL_MASK) >> - MPLS_LS_LABEL_SHIFT == MPLS_LABEL_ENTROPY) { + entry = ntohl(hdr[0].entry); + label = (entry & MPLS_LS_LABEL_MASK) >> MPLS_LS_LABEL_SHIFT; + + if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_MPLS)) { + struct flow_dissector_key_mpls *key_mpls; + + key_mpls = skb_flow_dissector_target(flow_dissector, +FLOW_DISSECTOR_KEY_MPLS, +target_container); + key_mpls->mpls_label = label; + key_mpls->mpls_ttl = (entry & MPLS_LS_TTL_MASK) + >> MPLS_LS_TTL_SHIFT; + key_mpls->mpls_tc = (entry & MPLS_LS_TC_MASK) + >> MPLS_LS_TC_SHIFT; + key_mpls->mpls_bos = (entry & MPLS_LS_S_MASK) + >> MPLS_LS_S_SHIFT; + } + + if (label == MPLS_LABEL_ENTROPY) { key_keyid = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_MPLS_ENTROPY, target_container); -- 2.7.4
Re: [PATCH net-next 2/2] cls_flower: add support for matching MPLS labels
On Mon, Mar 27, 2017 at 10:30:41PM +0200, Jiri Pirko wrote: > Mon, Mar 27, 2017 at 08:16:02PM CEST, benjamin.laha...@netronome.com wrote: > >Add support to tc flower to match based on fields in MPLS labels (TTL, > >Bottom of Stack, TC field, Label). > > Please use scripts/get_maintainer.pl to get list of ccs for the patches > you submit. Oops. Adding Jamal to the Cc -- please holler if you want me to resend. -ben > > > > >Signed-off-by: Benjamin LaHaise > >Signed-off-by: Benjamin LaHaise > >Reviewed-by: Simon Horman > >Reviewed-by: Jakub Kicinski > > > >diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h > >index 7a69f2a..f1129e3 100644 > >--- a/include/uapi/linux/pkt_cls.h > >+++ b/include/uapi/linux/pkt_cls.h > >@@ -432,6 +432,11 @@ enum { > > TCA_FLOWER_KEY_ARP_THA, /* ETH_ALEN */ > > TCA_FLOWER_KEY_ARP_THA_MASK,/* ETH_ALEN */ > > > >+TCA_FLOWER_KEY_MPLS_TTL,/* u8 - 8 bits */ > >+TCA_FLOWER_KEY_MPLS_BOS,/* u8 - 1 bit */ > >+TCA_FLOWER_KEY_MPLS_TC, /* u8 - 3 bits */ > >+TCA_FLOWER_KEY_MPLS_LABEL, /* be32 - 20 bits */ > >+ > > __TCA_FLOWER_MAX, > > }; > > > >diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c > >index 9d0c99d..24619f9 100644 > >--- a/net/sched/cls_flower.c > >+++ b/net/sched/cls_flower.c > >@@ -18,6 +18,7 @@ > > #include > > #include > > #include > >+#include > > > > #include > > #include > >@@ -47,6 +48,7 @@ struct fl_flow_key { > > struct flow_dissector_key_ipv6_addrs enc_ipv6; > > }; > > struct flow_dissector_key_ports enc_tp; > >+struct flow_dissector_key_mpls mpls; > > } __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as > > longs. */ > > > > struct fl_flow_mask_range { > >@@ -423,6 +425,10 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX > >+ 1] = { > > [TCA_FLOWER_KEY_ARP_SHA_MASK] = { .len = ETH_ALEN }, > > [TCA_FLOWER_KEY_ARP_THA]= { .len = ETH_ALEN }, > > [TCA_FLOWER_KEY_ARP_THA_MASK] = { .len = ETH_ALEN }, > >+[TCA_FLOWER_KEY_MPLS_TTL] = { .type = NLA_U8 }, > >+[TCA_FLOWER_KEY_MPLS_BOS] = { .type = NLA_U8 }, > >+[TCA_FLOWER_KEY_MPLS_TC]= { .type = NLA_U8 }, > >+[TCA_FLOWER_KEY_MPLS_LABEL] = { .type = NLA_U32 }, > > }; > > > > static void fl_set_key_val(struct nlattr **tb, > >@@ -438,6 +444,36 @@ static void fl_set_key_val(struct nlattr **tb, > > memcpy(mask, nla_data(tb[mask_type]), len); > > } > > > >+static void fl_set_key_mpls(struct nlattr **tb, > >+struct flow_dissector_key_mpls *key_val, > >+struct flow_dissector_key_mpls *key_mask) > >+{ > >+#define MPLS_TTL_MASK (MPLS_LS_TTL_MASK >> MPLS_LS_TTL_SHIFT) > >+#define MPLS_BOS_MASK (MPLS_LS_S_MASK >> MPLS_LS_S_SHIFT) > >+#define MPLS_TC_MASK(MPLS_LS_TC_MASK >> MPLS_LS_TC_SHIFT) > >+#define MPLS_LABEL_MASK (MPLS_LS_LABEL_MASK >> MPLS_LS_LABEL_SHIFT) > >+ > >+if (tb[TCA_FLOWER_KEY_MPLS_TTL]) { > >+key_val->mpls_ttl = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TTL]); > >+key_mask->mpls_ttl = MPLS_TTL_MASK; > >+} > >+if (tb[TCA_FLOWER_KEY_MPLS_BOS]) { > >+key_val->mpls_bos = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_BOS]); > >+key_mask->mpls_bos = MPLS_BOS_MASK; > >+} > >+if (tb[TCA_FLOWER_KEY_MPLS_TC]) { > >+key_val->mpls_tc = > >+nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TC]) & MPLS_TC_MASK; > >+key_mask->mpls_tc = MPLS_TC_MASK; > >+} > >+if (tb[TCA_FLOWER_KEY_MPLS_LABEL]) { > >+key_val->mpls_label = > >+nla_get_u32(tb[TCA_FLOWER_KEY_MPLS_LABEL]) & > >+MPLS_LABEL_MASK; > >+key_mask->mpls_label = MPLS_LABEL_MASK; > >+} > >+} > >+ > > static void fl_set_key_vlan(struct nlattr **tb, > > struct flow_dissector_key_vlan *key_val, > > struct flow_dissector_key_vlan *key_mask) > >@@ -594,6 +630,9 @@ static int fl_set_key(struct net *net, struct nlattr > >**tb, > >&mask->icmp.code, > >TCA_FLOWER_KEY_ICMPV6_CODE_MASK, > >
[RFC PATCH iproute2 net-next] tc flower: support for matching MPLS
[RFC until kernel code is accepted/rejected] This patch adds support to the iproute2 tc filter command for matching MPLS labels in the flower classifier. The ability to match the Time To Live, Bottom Of Stack, Traffic Control and Label fields are added as options to the flower filter. Signed-off-by: Benjamin LaHaise Signed-off-by: Benjamin LaHaise Reviewed-by: Simon Horman Reviewed-by: Jakub Kicinski diff --git a/include/linux/pkt_cls.h b/include/linux/pkt_cls.h index 7a69f2a..f1129e3 100644 --- a/include/linux/pkt_cls.h +++ b/include/linux/pkt_cls.h @@ -432,6 +432,11 @@ enum { TCA_FLOWER_KEY_ARP_THA, /* ETH_ALEN */ TCA_FLOWER_KEY_ARP_THA_MASK,/* ETH_ALEN */ + TCA_FLOWER_KEY_MPLS_TTL,/* u8 - 8 bits */ + TCA_FLOWER_KEY_MPLS_BOS,/* u8 - 1 bit */ + TCA_FLOWER_KEY_MPLS_TC, /* u8 - 3 bits */ + TCA_FLOWER_KEY_MPLS_LABEL, /* be32 - 20 bits */ + __TCA_FLOWER_MAX, }; diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8 index fc5bac5..6ce2d56 100644 --- a/man/man8/tc-flower.8 +++ b/man/man8/tc-flower.8 @@ -29,6 +29,14 @@ flower \- flow based traffic control filter .IR PRIORITY " | " .BR vlan_ethtype " { " ipv4 " | " ipv6 " | " .IR ETH_TYPE " } | " +.B mpls_label +.IR LABEL " | " +.B mpls_tc +.IR TC " | " +.B mpls_bos +.IR BOS " | " +.B mpls_ttl +.IR TTL " | " .BR ip_proto " { " tcp " | " udp " | " sctp " | " icmp " | " icmpv6 " | " .IR IP_PROTO " } | { " .BR dst_ip " | " src_ip " } " @@ -113,6 +121,27 @@ may be either .BR ipv4 ", " ipv6 or an unsigned 16bit value in hexadecimal format. .TP +.BI mpls_label " LABEL" +Match the outermost MPLS label id in an MPLS packet. +.I LABEL +is an unsigned 20 bit value in decimal format. +.TP +.BI mpls_tc " TC" +Match on the MPLS TC field, which is typically used for packet priority. +.I TC +is an unsigned 3 bit value in decimal format. +.TP +.BI mpls_bos " BOS" +Match on the MPLS Bottom Of Stack field in the outermost MPLS label. +.I BOS +is a 1 bit value. +.TP +.BI mpls_ttl " TTL" +Match on the MPLS Time To Live field. +.I TTL +is an unsigned 8 bit value which matches the MPLS TTL field in the outermost +label. +.TP .BI ip_proto " IP_PROTO" Match on layer four protocol. .I IP_PROTO diff --git a/tc/f_flower.c b/tc/f_flower.c index 5aac4a0..af9ef3d 100644 --- a/tc/f_flower.c +++ b/tc/f_flower.c @@ -19,6 +19,7 @@ #include #include #include +#include #include "utils.h" #include "tc_util.h" @@ -53,6 +54,10 @@ static void explain(void) " dst_mac MASKED-LLADDR |\n" " src_mac MASKED-LLADDR |\n" " ip_proto [tcp | udp | sctp | icmp | icmpv6 | IP-PROTO ] |\n" + " mpls_label LABEL |\n" + " mpls_tc TC |\n" + " mpls_bos BOS |\n" + " mpls_ttl TTL |\n" " dst_ip PREFIX |\n" " src_ip PREFIX |\n" " dst_port PORT-NUMBER |\n" @@ -599,6 +604,70 @@ static int flower_parse_opt(struct filter_util *qu, char *handle, &vlan_ethtype, n); if (ret < 0) return -1; + } else if (matches(*argv, "mpls_label") == 0) { + __u32 label; + + NEXT_ARG(); + if (eth_type != htons(ETH_P_MPLS_UC) && + eth_type != htons(ETH_P_MPLS_MC)) { + fprintf(stderr, + "Can't set \"mpls_label\" if ethertype isn't MPLS\n"); + return -1; + } + ret = get_u32(&label, *argv, 10); + if (ret < 0 || label & ~(MPLS_LS_LABEL_MASK >> MPLS_LS_LABEL_SHIFT)) { + fprintf(stderr, "Illegal \"mpls_label\"\n"); + return -1; + } + addattr32(n, MAX_MSG, TCA_FLOWER_KEY_MPLS_LABEL, label); + } else if (matches(*argv, "mpls_tc") == 0) { + __u8 tc; + + NEXT_ARG(); + if (eth_type != htons(ETH_P_MPLS_UC) && +
[PATCH net-next 2/2] cls_flower: add support for matching MPLS labels
Add support to tc flower to match based on fields in MPLS labels (TTL, Bottom of Stack, TC field, Label). Signed-off-by: Benjamin LaHaise Signed-off-by: Benjamin LaHaise Reviewed-by: Simon Horman Reviewed-by: Jakub Kicinski diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index 7a69f2a..f1129e3 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -432,6 +432,11 @@ enum { TCA_FLOWER_KEY_ARP_THA, /* ETH_ALEN */ TCA_FLOWER_KEY_ARP_THA_MASK,/* ETH_ALEN */ + TCA_FLOWER_KEY_MPLS_TTL,/* u8 - 8 bits */ + TCA_FLOWER_KEY_MPLS_BOS,/* u8 - 1 bit */ + TCA_FLOWER_KEY_MPLS_TC, /* u8 - 3 bits */ + TCA_FLOWER_KEY_MPLS_LABEL, /* be32 - 20 bits */ + __TCA_FLOWER_MAX, }; diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 9d0c99d..24619f9 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -47,6 +48,7 @@ struct fl_flow_key { struct flow_dissector_key_ipv6_addrs enc_ipv6; }; struct flow_dissector_key_ports enc_tp; + struct flow_dissector_key_mpls mpls; } __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as longs. */ struct fl_flow_mask_range { @@ -423,6 +425,10 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = { [TCA_FLOWER_KEY_ARP_SHA_MASK] = { .len = ETH_ALEN }, [TCA_FLOWER_KEY_ARP_THA]= { .len = ETH_ALEN }, [TCA_FLOWER_KEY_ARP_THA_MASK] = { .len = ETH_ALEN }, + [TCA_FLOWER_KEY_MPLS_TTL] = { .type = NLA_U8 }, + [TCA_FLOWER_KEY_MPLS_BOS] = { .type = NLA_U8 }, + [TCA_FLOWER_KEY_MPLS_TC]= { .type = NLA_U8 }, + [TCA_FLOWER_KEY_MPLS_LABEL] = { .type = NLA_U32 }, }; static void fl_set_key_val(struct nlattr **tb, @@ -438,6 +444,36 @@ static void fl_set_key_val(struct nlattr **tb, memcpy(mask, nla_data(tb[mask_type]), len); } +static void fl_set_key_mpls(struct nlattr **tb, + struct flow_dissector_key_mpls *key_val, + struct flow_dissector_key_mpls *key_mask) +{ +#define MPLS_TTL_MASK (MPLS_LS_TTL_MASK >> MPLS_LS_TTL_SHIFT) +#define MPLS_BOS_MASK (MPLS_LS_S_MASK >> MPLS_LS_S_SHIFT) +#define MPLS_TC_MASK (MPLS_LS_TC_MASK >> MPLS_LS_TC_SHIFT) +#define MPLS_LABEL_MASK(MPLS_LS_LABEL_MASK >> MPLS_LS_LABEL_SHIFT) + + if (tb[TCA_FLOWER_KEY_MPLS_TTL]) { + key_val->mpls_ttl = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TTL]); + key_mask->mpls_ttl = MPLS_TTL_MASK; + } + if (tb[TCA_FLOWER_KEY_MPLS_BOS]) { + key_val->mpls_bos = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_BOS]); + key_mask->mpls_bos = MPLS_BOS_MASK; + } + if (tb[TCA_FLOWER_KEY_MPLS_TC]) { + key_val->mpls_tc = + nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TC]) & MPLS_TC_MASK; + key_mask->mpls_tc = MPLS_TC_MASK; + } + if (tb[TCA_FLOWER_KEY_MPLS_LABEL]) { + key_val->mpls_label = + nla_get_u32(tb[TCA_FLOWER_KEY_MPLS_LABEL]) & + MPLS_LABEL_MASK; + key_mask->mpls_label = MPLS_LABEL_MASK; + } +} + static void fl_set_key_vlan(struct nlattr **tb, struct flow_dissector_key_vlan *key_val, struct flow_dissector_key_vlan *key_mask) @@ -594,6 +630,9 @@ static int fl_set_key(struct net *net, struct nlattr **tb, &mask->icmp.code, TCA_FLOWER_KEY_ICMPV6_CODE_MASK, sizeof(key->icmp.code)); + } else if (key->basic.n_proto == htons(ETH_P_MPLS_UC) || + key->basic.n_proto == htons(ETH_P_MPLS_MC)) { + fl_set_key_mpls(tb, &key->mpls, &mask->mpls); } else if (key->basic.n_proto == htons(ETH_P_ARP) || key->basic.n_proto == htons(ETH_P_RARP)) { fl_set_key_val(tb, &key->arp.sip, TCA_FLOWER_KEY_ARP_SIP, @@ -730,6 +769,8 @@ static void fl_init_dissector(struct cls_fl_head *head, FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, FLOW_DISSECTOR_KEY_ARP, arp); FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, +FLOW_DISSECTOR_KEY_MPLS, mpls); + FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, FLOW_DISSECTOR_KEY_VLAN, vlan); FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, FLOW_DISSECTOR_KEY_ENC_KEYID, enc_key_id); @@ -994,6 +1035,41 @@ static int fl_dump_key_val(struct sk_buff *skb, return 0; } +static int fl_dump_key_mpl
[PATCH net-next 1/2] flow_dissector: add mpls support
Add support for parsing MPLS flows to the flow dissector in preparation for adding MPLS match support to cls_flower. Signed-off-by: Benjamin LaHaise Signed-off-by: Benjamin LaHaise Reviewed-by: Simon Horman Reviewed-by: Jakub Kicinski diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h index ac97030..00d704f 100644 --- a/include/net/flow_dissector.h +++ b/include/net/flow_dissector.h @@ -41,6 +41,13 @@ struct flow_dissector_key_vlan { u16 padding; }; +struct flow_dissector_key_mpls { + u32 mpls_ttl : 8, + mpls_bos : 1, + mpls_tc : 3, + mpls_label : 20; +}; + struct flow_dissector_key_keyid { __be32 keyid; }; @@ -169,6 +176,7 @@ enum flow_dissector_key_id { FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */ FLOW_DISSECTOR_KEY_ENC_CONTROL, /* struct flow_dissector_key_control */ FLOW_DISSECTOR_KEY_ENC_PORTS, /* struct flow_dissector_key_ports */ + FLOW_DISSECTOR_KEY_MPLS, /* struct flow_dissector_key_mpls */ FLOW_DISSECTOR_KEY_MAX, }; diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index 5f3ae92..15185d8 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -126,9 +126,11 @@ __skb_flow_dissect_mpls(const struct sk_buff *skb, { struct flow_dissector_key_keyid *key_keyid; struct mpls_label *hdr, _hdr[2]; + u32 entry, label; if (!dissector_uses_key(flow_dissector, - FLOW_DISSECTOR_KEY_MPLS_ENTROPY)) + FLOW_DISSECTOR_KEY_MPLS_ENTROPY) && + !dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_MPLS)) return FLOW_DISSECT_RET_OUT_GOOD; hdr = __skb_header_pointer(skb, nhoff, sizeof(_hdr), data, @@ -136,8 +138,25 @@ __skb_flow_dissect_mpls(const struct sk_buff *skb, if (!hdr) return FLOW_DISSECT_RET_OUT_BAD; - if ((ntohl(hdr[0].entry) & MPLS_LS_LABEL_MASK) >> - MPLS_LS_LABEL_SHIFT == MPLS_LABEL_ENTROPY) { + entry = ntohl(hdr[0].entry); + label = (entry & MPLS_LS_LABEL_MASK) >> MPLS_LS_LABEL_SHIFT; + + if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_MPLS)) { + struct flow_dissector_key_mpls *key_mpls; + + key_mpls = skb_flow_dissector_target(flow_dissector, +FLOW_DISSECTOR_KEY_MPLS, +target_container); + key_mpls->mpls_label = label; + key_mpls->mpls_ttl = (entry & MPLS_LS_TTL_MASK) + >> MPLS_LS_TTL_SHIFT; + key_mpls->mpls_tc = (entry & MPLS_LS_TC_MASK) + >> MPLS_LS_TC_SHIFT; + key_mpls->mpls_bos = (entry & MPLS_LS_S_MASK) + >> MPLS_LS_S_SHIFT; + } + + if (label == MPLS_LABEL_ENTROPY) { key_keyid = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_MPLS_ENTROPY, target_container);
[PATCH v2 iproute2] f_flower: don't set TCA_FLOWER_KEY_ETH_TYPE for "protocol all"
v2 - update to address changes in 00697ca19ae3e1118f2af82c3b41ac4335fe918b. When using the tc flower filter, rules marked with "protocol all" do not actually match all packets. This is due to a bug in f_flower.c that passes in ETH_P_ALL in the TCA_FLOWER_KEY_ETH_TYPE attribute when adding a rule. Fix this by omitting TCA_FLOWER_KEY_ETH_TYPE if the protocol is set to ETH_P_ALL. Fixes: 488b41d020fb ("tc: flower no need to specify the ethertype") Cc: Jamal Hadi Salim Signed-off-by: Benjamin LaHaise Signed-off-by: Benjamin LaHaise diff --git a/tc/f_flower.c b/tc/f_flower.c index 314c2dd..145a856 100644 --- a/tc/f_flower.c +++ b/tc/f_flower.c @@ -529,9 +529,11 @@ parse_done: if (ret) return ret; - ret = addattr16(n, MAX_MSG, TCA_FLOWER_KEY_ETH_TYPE, eth_type); - if (ret) - return ret; + if (eth_type != htons(ETH_P_ALL)) { + ret = addattr16(n, MAX_MSG, TCA_FLOWER_KEY_ETH_TYPE, eth_type); + if (ret) + return ret; + } tail->rta_len = (((void *)n)+n->nlmsg_len) - (void *)tail;
[PATCH iproute2] f_flower: don't set TCA_FLOWER_KEY_ETH_TYPE for "protocol all"
When using the tc filter flower, rules marked with "protocol all" do not actually match all packets. This is due to a bug in f_flower.c that passes in ETH_P_ALL in the TCA_FLOWER_KEY_ETH_TYPE attribute when adding a rule. Fix this by omitting TCA_FLOWER_KEY_ETH_TYPE if the protocol is set to ETH_P_ALL. Signed-off-by: Benjamin LaHaise Signed-off-by: Benjamin LaHaise diff --git a/tc/f_flower.c b/tc/f_flower.c index 1dbc532..1f90da3 100644 --- a/tc/f_flower.c +++ b/tc/f_flower.c @@ -527,11 +527,13 @@ static int flower_parse_opt(struct filter_util *qu, char *handle, parse_done: addattr32(n, MAX_MSG, TCA_FLOWER_FLAGS, flags); - ret = addattr16(n, MAX_MSG, TCA_FLOWER_KEY_ETH_TYPE, eth_type); - if (ret) { - fprintf(stderr, "Illegal \"eth_type\"(0x%x)\n", - ntohs(eth_type)); - return -1; + if (eth_type != htons(ETH_P_ALL)) { + ret = addattr16(n, MAX_MSG, TCA_FLOWER_KEY_ETH_TYPE, eth_type); + if (ret) { + fprintf(stderr, "Illegal \"eth_type\"(0x%x)\n", + ntohs(eth_type)); + return -1; + } } tail->rta_len = (((void *)n)+n->nlmsg_len) - (void *)tail;
Re: use-after-free in sock_wake_async
On Tue, Nov 24, 2015 at 04:30:01PM -0500, Jason Baron wrote: > So looking at this trace I think its the other->sk_socket that gets > freed and then we call sk_wake_async() on it. > > We could I think grab the socket reference there with unix_state_lock(), > since that is held by unix_release_sock() before the final iput() is called. > > So something like below might work (compile tested only): That just adds the performance regression back in. It should be possible to protect the other socket dereference using RCU. I haven't had time to look at this yet today, but will try to find some time this evening to come up with a suggested patch. -ben > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > index aaa0b58..2b014f1 100644 > --- a/net/unix/af_unix.c > +++ b/net/unix/af_unix.c > @@ -196,6 +196,19 @@ static inline int unix_recvq_full(struct sock const > *sk) > return skb_queue_len(&sk->sk_receive_queue) > sk->sk_max_ack_backlog; > } > > +struct socket *unix_peer_get_socket(struct sock *s) > +{ > + struct socket *peer; > + > + unix_state_lock(s); > + peer = s->sk_socket; > + if (peer) > + __iget(SOCK_INODE(s->sk_socket)); > + unix_state_unlock(s); > + > + return peer; > +} > + > struct sock *unix_peer_get(struct sock *s) > { > struct sock *peer; > @@ -1639,6 +1652,7 @@ static int unix_stream_sendmsg(struct socket > *sock, struct msghdr *msg, > { > struct sock *sk = sock->sk; > struct sock *other = NULL; > + struct socket *other_socket = NULL; > int err, size; > struct sk_buff *skb; > int sent = 0; > @@ -1662,7 +1676,10 @@ static int unix_stream_sendmsg(struct socket > *sock, struct msghdr *msg, > } else { > err = -ENOTCONN; > other = unix_peer(sk); > - if (!other) > + if (other) > + other_socket = unix_peer_get_socket(other); > + > + if (!other_socket) > goto out_err; > } > > @@ -1721,6 +1738,9 @@ static int unix_stream_sendmsg(struct socket > *sock, struct msghdr *msg, > sent += size; > } > > + if (other_socket) > + iput(SOCK_INODE(other_socket)); > + > scm_destroy(&scm); > > return sent; > @@ -1733,6 +1753,8 @@ pipe_err: > send_sig(SIGPIPE, current, 0); > err = -EPIPE; > out_err: > + if (other_socket) > + iput(SOCK_INODE(other_socket)); > scm_destroy(&scm); > return sent ? : err; > } -- "Thought is the essence of where you are now." -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fib_semantics: prevent long hash chains in access server config
On Sat, Jan 12, 2008 at 09:38:57PM -0800, David Miller wrote: > And guess why we don't do this? Because it's not part of > the key. Other aspects of the base fib_info and nexthops > provide the uniqueness, not the devindex of the first hop. > > So you'll need to find another way to do this. Ah, you're right indeed. It's probably easier for me to change how the daemon adds the local ip address for these point to point interfaces. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fib_semantics: prevent long hash chains in access server config
This is a patch from a while ago that I'm resending. Basically, in access server configurations, a lot of routes have the same local ip address but on different devices. This fixes the long chains that result from not including the device index in the hash. -ben diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 1351a26..5375824 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -196,11 +196,21 @@ static __inline__ int nh_comp(const struct fib_info *fi, const struct fib_info * return 0; } +static inline unsigned int fib_devindex_hashfn(unsigned int val) +{ + unsigned int mask = DEVINDEX_HASHSIZE - 1; + + return (val ^ + (val >> DEVINDEX_HASHBITS) ^ + (val >> (DEVINDEX_HASHBITS * 2))) & mask; +} + static inline unsigned int fib_info_hashfn(const struct fib_info *fi) { unsigned int mask = (fib_hash_size - 1); unsigned int val = fi->fib_nhs; + val ^= fib_devindex_hashfn(fi->fib_dev->ifindex); val ^= fi->fib_protocol; val ^= (__force u32)fi->fib_prefsrc; val ^= fi->fib_priority; @@ -234,15 +244,6 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi) return NULL; } -static inline unsigned int fib_devindex_hashfn(unsigned int val) -{ - unsigned int mask = DEVINDEX_HASHSIZE - 1; - - return (val ^ - (val >> DEVINDEX_HASHBITS) ^ - (val >> (DEVINDEX_HASHBITS * 2))) & mask; -} - /* Check, that the gateway is already configured. Used only by redirect accept routine. */ -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
announce: Babylon PPP (scalable L2TP) 2.0 beta 1
Hello all, This is an announcement for a new beta release of a decently scalable L2TP stack for Linux. It is based off of the Babylon PPP stack created by SpellCaster back in the '98-'00 timeframe. Right now there are lots of rough edges (especially in terms of documentation), but it works well enough for many purposes. The main difference between the Babylon PPP stack and pppd is that the Babylon code uses a single process for controlling all PPP sessions established throughout the system, which uses significantly less memory and processing power per connection. Practically speaking, a mid range system can terminate thousands of sessions without breaking a sweat. Features - entirely GPLed - can terminate thousands of sessions - includes kernel mode multihop for L2TP - can act as an LAC or LNS - has gigawords accounting for sessions - includes Radius client support - already in use to terminate real L2TP traffic for 2 ISPs. - works on x86/x86-64. Not expected to work on architectures where unaligned memory accesses are problematic. Non-Features - Uses its own kernel mode PPP stack for L2TP and PPPoE. The exact direction for this is open to discussions, but will require lots of work to address the scalability issues. - PPPoE support is not well integrated at present. It works for dial out (kernel mode via drivers/pppoe/pppoed and kernel/bab_pppoe.o) or PPPoE->L2TP conversion (userspace), but not incoming calls as-is. - device nodes are still a work in progress - packaging needs improvement - the kernel code is not at all ready for merging as it needs fairly major cleanup in several areas The README.l2tp file gives a rough description of what is needed to get things going on recent 2.6 kernels, but I've only been testing 2.6.22.5 for this release. Questions/comments are appreciated. The source can be had via git clone http://www.kvack.org/~bcrl/babylon.git/.git . Cheers, -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] don't allow netfilter --setmss to increase mss
Hi folks, When terminating DSL connections for an assortment of random customers, I've found it necessary to use iptables to clamp the MSS used for connections to work around the various ICMP blackholes in the greater net. Unfortunately, the current behaviour in Linux is imperfect and actually make things worse, so I'm proposing the following: increasing the MSS in a packet can never be a good thing, so make --set-mss only lower the MSS in a packet. Yes, I am aware of --clamp-mss-to-pmtu, but it doesn't work for outgoing connections from clients (ie web traffic), as it only looks at the PMTU on the destination route, not the source of the packet (the DSL interfaces in question have a 1442 byte MTU while the destination ethernet interface is 1500 -- there are problematic hosts which use a 1300 byte MTU). Reworking that is probably a good idea at some point, but it's more work than this is. Thoughts? Would it be better to add a new flag? -ben Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]> diff --git a/net/netfilter/xt_TCPMSS.c b/net/netfilter/xt_TCPMSS.c index d40f7e4..411c482 100644 --- a/net/netfilter/xt_TCPMSS.c +++ b/net/netfilter/xt_TCPMSS.c @@ -88,8 +88,11 @@ tcpmss_mangle_packet(struct sk_buff **pskb, oldmss = (opt[i+2] << 8) | opt[i+3]; - if (info->mss == XT_TCPMSS_CLAMP_PMTU && - oldmss <= newmss) + /* Never increase MSS, even when setting it, as +* doing so results in problems for hosts that rely +* on MSS being set correctly. +*/ + if (oldmss <= newmss) return 0; opt[i+2] = (newmss & 0xff00) >> 8; -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 03/18] drivers/net/ns83820.c: add paramter to disable autonegotiation
On Fri, Aug 10, 2007 at 02:05:13PM -0700, [EMAIL PROTECTED] wrote: > Also added a "disable_autoneg" module argument to completely disable > autoneg on all cards using this driver. ... > [akpm: this is a previously-nacked patch, but the problem is real] Please remove this part of the patch. The ethtool support is sufficient and doesn't clobber other cards in the system. At the very least the module parameter has to be limited to a specific card. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FSCKED clock sources WAS(Re: [WIP][PATCHES] Network xmit batching
On Mon, Jun 25, 2007 at 12:59:54PM -0400, jamal wrote: > On Thu, 2007-21-06 at 12:55 -0400, Benjamin LaHaise wrote: > > > You should qualify that as 'Old P4 Xeon', as the Core 2 Xeons are leagues > > better. > > The Xeon hardware is not that old - about a year or so (and so is the > opteron). > BTW, how could you tell this was old Xeon? CPUID: vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 2.80GHz shows that it is a P4 Xeon, which sucks compared to: vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Genuine Intel(R) CPU @ 2.66GHz which is a Core 2 based Xeon. The tuning required by the P4 is quite different than the Core 2, and it generally performs more poorly due to the length of the pipeline and the expense of pipeline flushes. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FSCKED clock sources WAS(Re: [WIP][PATCHES] Network xmit batching
On Thu, Jun 21, 2007 at 12:08:19PM -0400, jamal wrote: > The results in the table for opteron and xeon are swapped when > cutnpasting from a larger test result. So Opteron is the one with better > results. > In any case - off for the day over here. You should qualify that as 'Old P4 Xeon', as the Core 2 Xeons are leagues better. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: r8169 tx problem (1s pause with ping)
On Fri, Jun 15, 2007 at 01:33:14AM +1000, David Gundersen wrote: > In the mean-time I'll attach my patch for the r8168-8.001.00 realtek > driver here in case anybody else wants to have a play with it and see if > it helps them out. Out of curiousity, does it work if you just do a single read (ie RTL_R8(TxPoll);) of the register before writing to it? That would clear things up if it is a PCI posting problem. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
r8169 tx problem (1s pause with ping)
Hello folks, I'm seeing something odd with r8169 on FC7: doing a ping -s 1600 alternates between a 1s latency and sub 1ms. Has anyone else seen anything like this? The system in question is an Asus M2A-VM with an onboard RTL8111 (I think). NAPI doesn't seem to make a difference. The kernel in question is currently a vanilla 2.6.21.5. Sub-mtu sized packets behave normally. 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01) PING 1.2.3.4 (1.2.3.4) 1600(1628) bytes of data. 1608 bytes from 1.2.3.4: icmp_seq=1 ttl=64 time=1000 ms 1608 bytes from 1.2.3.4: icmp_seq=2 ttl=64 time=0.816 ms 1608 bytes from 1.2.3.4: icmp_seq=3 ttl=64 time=1000 ms 1608 bytes from 1.2.3.4: icmp_seq=4 ttl=64 time=0.661 ms -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] TCP FIN gets dropped prematurely, results in ack storm
On Tue, May 01, 2007 at 02:03:04PM -0400, John Heffner wrote: > Actually, you cannot get in this situation by loss or reordering of > packets, only be corruption of state on one side. It sends the FIN, > which effectively increases the sequence number by one. However, all > later segments it sends have an old lower sequence number, which are now > out of window. Okay, I missed the other packets with a FIN later on in the storm. What is different about them is that they get sent with different timestamps than the acks being thrown about. Perhaps narrowly looking at the lack of FIN is wrong -- I'll try instrumenting what the PAWS code is doing on both sides as that is probably what short circuits an ACK into being sent. > Being liberal in what you accept is good to a point, but sometimes you > have to draw the line. True. Still, both sides are doing completely the wrong thing in this case, and I'd like to get an idea of the best way to prevent the ACK storm from happenning. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] TCP FIN gets dropped prematurely, results in ack storm
On Tue, May 01, 2007 at 01:54:03PM -0400, John Heffner wrote: > Looking at your trace, it seems like the behavior of the test system > 192.168.2.2 is broken in two ways. First, like you said it has broken > state in that it has forgotten that it sent the FIN. Once you do that, > the connection state is corrupt and all bets are off. It's sending an > out-of-window segment that's getting tossed by Linux, and Linux > generates an ack in response. This is in direct RFC compliance. The > second problem is that the other system is generating these broken acks > in response to the legitimate acks Linux is sending, causing the ack > war. I can't really guess why it's doing that... I know it's a bug, and I'm trying to fix it, but that doesn't change the fact that A) the system is already deployed and B) Linux is not retransmitting the FIN, which (from Linux's point of view) remains unacknowledged by the other side. The patch might be wrong, but the goal of fixing the behaviour isn't. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] TCP FIN gets dropped prematurely, results in ack storm
On Tue, May 01, 2007 at 09:41:28PM +0400, Evgeniy Polyakov wrote: > Hmm, 2.2 machine in your test seems to behave incorrectly: I am aware of that. However, I think that the loss of certain packets and reordering can result in the same behaviour. What's more, is that this behaviour can occur in real deployed systems. "Be strict in what you send and liberal in what you accept." Both systems should be fixed, which is what I'm trying to do. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] TCP FIN gets dropped prematurely, results in ack storm
On Tue, May 01, 2007 at 08:20:50PM +0400, Evgeniy Polyakov wrote: > > http://www.kvack.org/~bcrl/ack-storm.log . As near as I can tell, a > > similar effect can occur between two Linux boxes if the right packets get > > reordered/dropped during connection teardown. > > Could you archive 24Mb file or cut more precise bits out of it? The interesting bits are the first 10 lines. > According to your patch, several packets with fin bit might be sent, > including one with data. If another host does not receive fin > retransmit, then that logic is broken, and it can not be fixed by > duplicating fins, I would even say, that remote box should drop second > packet with fin, while it can carry data, which will break higher > connection logic. The FIN hasn't been ack'd by the other side, though and yet Linux is no longer transmitting packets with it sent. Read the beginning of the trace. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] TCP FIN gets dropped prematurely, results in ack storm
Hello, While testing a failover scenario, I managed to trigger an ack storm between a Linux box and another system. Although the cause of this particular ACK storm was due to the other box forgetting that it sent out a FIN (the second node was unaware of the FIN the first sent in its dying gasp, which is what I'm trying to fix, but it's a tricky race), the resulting Linux behaviour wasn't very robust. Is there any particularly good reason that FIN flag gets cleared on a connection which is being shut down? The trace that motivates this can be seen at http://www.kvack.org/~bcrl/ack-storm.log . As near as I can tell, a similar effect can occur between two Linux boxes if the right packets get reordered/dropped during connection teardown. -ben diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 0faacf9..1e54291 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -635,9 +635,9 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, unsigned int mss TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq; TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq; - /* PSH and FIN should only be set in the second packet. */ + /* PSH should only be set in the second packet. */ flags = TCP_SKB_CB(skb)->flags; - TCP_SKB_CB(skb)->flags = flags & ~(TCPCB_FLAG_FIN|TCPCB_FLAG_PSH); + TCP_SKB_CB(skb)->flags = flags & ~(TCPCB_FLAG_PSH); TCP_SKB_CB(buff)->flags = flags; TCP_SKB_CB(buff)->sacked = TCP_SKB_CB(skb)->sacked; TCP_SKB_CB(skb)->sacked &= ~TCPCB_AT_TAIL; @@ -1124,9 +1124,9 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len, TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq; TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq; - /* PSH and FIN should only be set in the second packet. */ + /* PSH should only be set in the second packet. */ flags = TCP_SKB_CB(skb)->flags; - TCP_SKB_CB(skb)->flags = flags & ~(TCPCB_FLAG_FIN|TCPCB_FLAG_PSH); + TCP_SKB_CB(skb)->flags = flags & ~(TCPCB_FLAG_PSH); TCP_SKB_CB(buff)->flags = flags; /* This packet was never sent out yet, so no SACK bits. */ @@ -1308,7 +1308,7 @@ static int tcp_mtu_probe(struct sock *sk) sk_stream_free_skb(sk, skb); } else { TCP_SKB_CB(nskb)->flags |= TCP_SKB_CB(skb)->flags & - ~(TCPCB_FLAG_FIN|TCPCB_FLAG_PSH); + ~(TCPCB_FLAG_PSH); if (!skb_shinfo(skb)->nr_frags) { skb_pull(skb, copy); if (skb->ip_summed != CHECKSUM_PARTIAL) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fib_info_hashfn leads to long hash chains
Hello The patch below fixes a case where fib_find_info() is consume excessive amounts of CPU during the creation of 1 PPP interfaces. In access servers, each point to point link has the same local address, but a different destination and interface. Because the device is not included in the hash calculation, the chain grows excessively large and we end up spinning the CPU walking the list. As near as I can tell, this shouldn't have any negative sideeffects, but someone with a better understanding of fib_semantics.c will need to check over it. Cheers, -ben Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 2 samples %image name app name symbol name 2469220 47.8872 vmlinux vmlinux fib_find_info 576393 11.1784 vmlinux vmlinux fn_trie_delete 559561 10.8519 vmlinux vmlinux fn_trie_insert 518975 10.0648 vmlinux vmlinux rt_run_flush 2846465.5203 vmlinux vmlinux local_bh_enable_ip 52469 1.0176 vmlinux vmlinux local_bh_disable 47638 0.9239 vmlinux vmlinux fib_nh_match 20588 0.3993 oprofiledoprofiledsfile_find 16074 0.3117 oprofiledoprofiledodb_update_node 13957 0.2707 ahci ahci (no symbols) 13648 0.2647 vmlinux vmlinux rtl8169_interrupt 13206 0.2561 vmlinux vmlinux register_netdevice After: Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 2 512204 26.1316 vmlinux vmlinux rt_run_flush 289378 14.7635 vmlinux vmlinux local_bh_enable_ip 267266 13.6354 vmlinux vmlinux fn_trie_delete 253438 12.9299 vmlinux vmlinux fn_trie_insert 53329 2.7207 vmlinux vmlinux local_bh_disable 21864 1.1155 vmlinux vmlinux fib_nh_match 15105 0.7706 ahci ahci (no symbols) 12332 0.6292 vmlinux vmlinux rtl8169_interrupt 9504 0.4849 vmlinux vmlinux ehci_irq 8436 0.4304 vmlinux vmlinux journal_clean_one_cp_list 7645 0.3900 oprofiledoprofiledodb_update_node 7462 0.3807 oprofiledoprofiledsfile_find 6366 0.3248 babylond babylond memset Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 3dad12e..e790842 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -197,11 +197,23 @@ static __inline__ int nh_comp(const struct fib_info *fi, const struct fib_info * return 0; } +static inline unsigned int fib_devindex_hashfn(unsigned int val) +{ + unsigned int mask = DEVINDEX_HASHSIZE - 1; + + return (val ^ + (val >> DEVINDEX_HASHBITS) ^ + (val >> (DEVINDEX_HASHBITS * 2))) & mask; +} + static inline unsigned int fib_info_hashfn(const struct fib_info *fi) { unsigned int mask = (fib_hash_size - 1); unsigned int val = fi->fib_nhs; + if (val) + val ^= fib_devindex_hashfn(fi->fib_dev->ifindex); + val ^= fi->fib_protocol; val ^= (__force u32)fi->fib_prefsrc; val ^= fi->fib_priority; @@ -235,15 +247,6 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi) return NULL; } -static inline unsigned int fib_devindex_hashfn(unsigned int val) -{ - unsigned int mask = DEVINDEX_HASHSIZE - 1; - - return (val ^ - (val >> DEVINDEX_HASHBITS) ^ - (val >> (DEVINDEX_HASHBITS * 2))) & mask; -} - /* Check, that the gateway is already configured. Used only by redirect accept routine. */ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP connection stops after high load.
On Wed, Apr 11, 2007 at 02:06:31PM -0700, Ben Greear wrote: > For the dup acks, I see nothing *but* dup acks on the wire...going in > both directions interestingly, at greater than 100,000 packets per second. > > I don't mind adding printks...and I've started reading through the code, > but there is a lot of it, and indiscriminate printks will likely just > hide the problem because it will slow down performance so much. What do the timestamps look like? PAWS contains logic which will drop packets if the timestamps are too old compared to what the receiver expects. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] network dev read_mostly
On Thu, Mar 15, 2007 at 12:25:16AM -0700, David Miller wrote: > Could we obtain %rip relative addressing with the ELF > relocation approach I mentioned? I think we can for some of the objects -- things like slab caches are good candidates if we have the initialization done at init time, which would actually be a very cool way of getting rid of the static slab creation calls. I'll cook something better up on the slab front. As for other variables, many can't be rip relative as they often end up pointing to addresses outside of the +/-2GB range we have there. The main reason I came up with this was from looking at the various kprobes and notifier chain overhead in common code paths. In many instances we need a single byte flag showing the feature is in use to jump out of the hot path. Then there are the selinux hooks -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] network dev read_mostly
On Mon, Mar 12, 2007 at 02:08:18PM -0700, Stephen Hemminger wrote: > For Eric, mark packet type and network device watermarks > as read mostly. The following x86-64 bits might be intersting, as they allow you to completely eliminate the memory access for run time defined constants. Note that read_always writes are non-atomic, so some other form of protection is necessary for readers (and rcu won't cut it). That can be fixed somewhat by specifying the alignment for the mov instruction to ensure writes are atomic, but for many uses that is overkill. This kind of change can make the biggest difference for high-latency cases, like L1 cache misses on the Prescott P4. I've not benched it on a P4 of late, though. -ben diff --git a/arch/x86_64/kernel/head64.c b/arch/x86_64/kernel/head64.c index 5f197b0..022ee38 100644 --- a/arch/x86_64/kernel/head64.c +++ b/arch/x86_64/kernel/head64.c @@ -70,6 +70,8 @@ void __init x86_64_start_kernel(char * real_mode_data) memcpy(init_level4_pgt, boot_level4_pgt, PTRS_PER_PGD*sizeof(pgd_t)); asm volatile("movq %0,%%cr3" :: "r" (__pa_symbol(&init_level4_pgt))); + init_read_always(); + for (i = 0; i < NR_CPUS; i++) cpu_pda(i) = &boot_cpu_pda[i]; diff --git a/arch/x86_64/kernel/vmlinux.lds.S b/arch/x86_64/kernel/vmlinux.lds.S index b73212c..19852dc 100644 --- a/arch/x86_64/kernel/vmlinux.lds.S +++ b/arch/x86_64/kernel/vmlinux.lds.S @@ -50,6 +50,10 @@ SECTIONS __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { *(__ex_table) } __stop___ex_table = .; + start__read_always = .; + __read_always : AT(ADDR(__read_always) - LOAD_OFFSET) { *(__read_always) } + stop__read_always = .; + RODATA BUG_TABLE diff --git a/arch/x86_64/mm/fault.c b/arch/x86_64/mm/fault.c index 6ada723..d48415e 100644 --- a/arch/x86_64/mm/fault.c +++ b/arch/x86_64/mm/fault.c @@ -31,6 +31,7 @@ #include #include #include +#include #include /* Page fault error code bits */ @@ -41,11 +42,67 @@ #define PF_INSTR (1<<4) static ATOMIC_NOTIFIER_HEAD(notify_page_fault_chain); +static DEFINE_READ_ALWAYS(char, notify_page_fault_active); +static DEFINE_READ_ALWAYS(char, page_fault_trace); + +void init_read_always(void) +{ + extern unsigned int start__read_always[], stop__read_always[]; + unsigned int *fixup; + + fixup = start__read_always; + while (fixup < stop__read_always) { + void *where = (void *)(fixup[0] - 0x1L); + void *which = (void *)(fixup[1] - 0x1L); + long size = fixup[2]; + fixup += 3; + + switch (size) { + case 1: *(u8 *)where = *(u8 *)which;break; + case 2: *(u16 *)where = *(u16 *)which; break; + case 4: *(u32 *)where = *(u32 *)which; break; + case 8: *(u64 *)where = *(u64 *)which; break; + } + } +} + +void set_read_always_size(void *ptr, long val, int size) +{ + extern unsigned int start__read_always[], stop__read_always[]; + unsigned int *fixup; + + switch(size) { + case 1: *(u8 *)ptr = val; break; + case 2: *(u16 *)ptr = val; break; + case 4: *(u32 *)ptr = val; break; + case 8: *(u64 *)ptr = val; break; + } + + fixup = start__read_always; + while (fixup < stop__read_always) { + void *where = (void *)(fixup[0] - 0x1L); + void *which = (void *)(fixup[1] - 0x1L); + long actual_size = fixup[2]; + fixup += 3; + + if (which != ptr) + continue; + + BUG_ON(size != actual_size); + switch(size) { + case 1: *(u8 *)where = val; break; + case 2: *(u16 *)where = val;break; + case 4: *(u32 *)where = val;break; + case 8: *(u64 *)where = val;break; + } + } +} /* Hook to register for page fault notifications */ int register_page_fault_notifier(struct notifier_block *nb) { vmalloc_sync_all(); + set_read_always(notify_page_fault_active, 1); return atomic_notifier_chain_register(¬ify_page_fault_chain, nb); } EXPORT_SYMBOL_GPL(register_page_fault_notifier); @@ -56,7 +113,7 @@ int unregister_page_fault_notifier(struct notifier_block *nb) } EXPORT_SYMBOL_GPL(unregister_page_fault_notifier); -static inline int notify_page_fault(struct pt_regs *regs, long err) +static int notify_page_fault(struct pt_regs *regs, long err) { struct die_args args = { .regs = regs, @@ -301,7 +358,6 @@ static int vmalloc_fault(unsigned long address) return 0; } -int page_fault_trace = 0; int exception_trace = 1; /* @@ -355,7 +411,8 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, if (vmalloc_fault(address) >
Re: [PATCH 1/2] avoid OPEN_MAX in SCM_MAX_FD
On Tue, Mar 13, 2007 at 01:39:12AM -0700, Roland McGrath wrote: > The OPEN_MAX constant is an arbitrary number with no useful relation to > anything. Nothing should be using it. This patch changes SCM_MAX_FD to > use NR_OPEN instead of OPEN_MAX. This increases the size of the struct > scm_fp_list type fourfold, to make it big enough to contain as many file > descriptors as could be asked of it. This size increase may not be very > worthwhile, but at any rate if an arbitrary limit unrelated to anything > else is being defined it should be done explicitly here with: > -#define SCM_MAX_FD (OPEN_MAX-1) > +#define SCM_MAX_FD (NR_OPEN-1) This is a bad idea. From linux/fs.h: #undef NR_OPEN #define NR_OPEN (1024*1024) /* Absolute upper limit on fd num */ There isn't anything I can see guaranteeing that net/scm.h is included before fs.h. This affects networking and should really be Cc'd to netdev@vger.kernel.org, which will raise the issue that if SCM_MAX_FD is raised, the resulting simple kmalloc() must be changed. That said, I doubt SCM_MAX_FD really needs to be raised, as applications using many file descriptors are unlikely to try to send their entire file table to another process in one go -- they have to handle the limits imposed by SCM_MAX_FD anyways. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: all syscalls initially taking 4usec on a P4? Re: nonblocking UDPv4 recvfrom() taking 4usec @ 3GHz?
On Tue, Feb 20, 2007 at 08:33:20PM +0100, bert hubert wrote: > I'm investigating this further for other system calls. It might be that my > measurements are off, but it appears even a slight delay between calls > incurs a large penalty. Make sure your system is idle. Userspace bloat means that *lots* of idle activity occurs in between timer ticks on recent distributions -- all those widgets polling the hardware to see if something changed or needs updating do a lot of damage to the caches. Try comparing a run under init=/bin/bash with one while logged in to a desktop environment to see just how painful it is. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Mon, Feb 19, 2007 at 01:26:42PM -0500, Benjamin LaHaise wrote: > On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet wrote: > > So even with a lazy hash function, 89 % of lookups are satisfied with less > > than 6 compares. > > Which sucks, as those are typically going to be cache misses (costing many > hundreds of cpu cycles). Hash chains fair very poorly under DoS conditions, > and must be removed under a heavy load. Worst case handling is very > important next to common case. I should clarify. Back of the napkin calculations show that there is only 157 cycles on a 3GHz processor in which to decide what happens to a packet, which means 1 cache miss is more than enough. In theory we can get pretty close to line rate with quad core processors, but it definately needs some of the features that newer chipsets have for stuffing packets directly into the cache. I would venture a guess that we also need to intelligently partition packets so that we make the most use of available cache resources. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Extensible hashing and RCU
On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet wrote: > So even with a lazy hash function, 89 % of lookups are satisfied with less > than 6 compares. Which sucks, as those are typically going to be cache misses (costing many hundreds of cpu cycles). Hash chains fair very poorly under DoS conditions, and must be removed under a heavy load. Worst case handling is very important next to common case. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -mm 0/10][RFC] aio: make struct kiocb private
On Mon, Jan 15, 2007 at 08:25:15PM -0800, Nate Diller wrote: > the right thing to do from a design perspective. Hopefully it enables > a new architecture that can reduce context switches in I/O completion, > and reduce overhead. That's the real motive ;) And it's a broken motive. Context switches per se are not bad, as they make it possible to properly schedule code in a busy system (which is *very* important when realtime concerns come into play). Have a look at how things were done in the 2.4 aio code to see how completion would get done with a non-retry method, typically in interrupt context. I had code that did direct I/O rather differently by sharing code with the read/write code paths at some point, the catch being that it was pretty invasive, which meant that it never got merged with the changes to handle writeback pressure and other work that happened during 2.5. That said, you can't make kiocb private without completely removing the ability of the rest of the kernel to complete an aio sanely from irq context. You need some form of i/o descriptor, and a kiocb is just that. Adding more layering is just going to make things messier and slower for no real gain. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SKB BUG: Invalid truesize, current git
On Tue, Nov 07, 2006 at 02:57:24PM -0800, David Miller wrote: > > Since pskb_copy tacks on the non-linear bits from the original > > skb, it needs to count them in the truesize field of the new skb. > > > > Signed-off-by: Herbert Xu <[EMAIL PROTECTED]> > > Applied, thanks Herbert. This seems to work -- I haven't been able to trigger the message with the patch applied now. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
SKB BUG: Invalid truesize, current git
Hi all, I managed to get a backtrace for the Invalid truesize bug. The trigger is running LMbench2, but it's rater intermittent. Traffic should be going over the loopback interface, but the main nic on the machine is e1000. Let me know if anyone has any ideas for things to try. -ben Linux version 2.6.19-rc4 ([EMAIL PROTECTED]) (gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)) #73 SMP Mon Nov 6 13:13:44 EST 2006 Command line: ro root=LABEL=/1 console=ttyS0,38400 BIOS-provided physical RAM map: BIOS-e820: - 0009cc00 (usable) BIOS-e820: 0009cc00 - 000a (reserved) BIOS-e820: 000cc000 - 000d (reserved) BIOS-e820: 000e4000 - 0010 (reserved) BIOS-e820: 0010 - bff6 (usable) BIOS-e820: bff6 - bff69000 (ACPI data) BIOS-e820: bff69000 - bff8 (ACPI NVS) BIOS-e820: bff8 - c000 (reserved) BIOS-e820: e000 - f000 (reserved) BIOS-e820: fec0 - fec1 (reserved) BIOS-e820: fee0 - fee01000 (reserved) BIOS-e820: ff00 - 0001 (reserved) BIOS-e820: 0001 - 00014000 (usable) Entering add_active_range(0, 0, 156) 0 entries of 256 used Entering add_active_range(0, 256, 786272) 1 entries of 256 used Entering add_active_range(0, 1048576, 1310720) 2 entries of 256 used end_pfn_map = 1310720 DMI present. ACPI: RSDP (v000 PTLTD ) @ 0x000f58d0 ACPI: RSDT (v001 PTLTDRSDT 0x0604 LTP 0x) @ 0xbff636df ACPI: FADT (v001 INTEL TUMWATER 0x0604 PTL 0x0003) @ 0xbff68e48 ACPI: MADT (v001 PTLTD APIC 0x0604 LTP 0x) @ 0xbff68ebc ACPI: MCFG (v001 PTLTDMCFG 0x0604 LTP 0x) @ 0xbff68f4c ACPI: BOOT (v001 PTLTD $SBFTBL$ 0x0604 LTP 0x0001) @ 0xbff68f88 ACPI: SPCR (v001 PTLTD $UCRTBL$ 0x0604 PTL 0x0001) @ 0xbff68fb0 ACPI: SSDT (v001 PmRefCpuPm 0x3000 INTL 0x20050228) @ 0xbff6371b ACPI: DSDT (v001 Intel BLAKFORD 0x0604 MSFT 0x010e) @ 0x Entering add_active_range(0, 0, 156) 0 entries of 256 used Entering add_active_range(0, 256, 786272) 1 entries of 256 used Entering add_active_range(0, 1048576, 1310720) 2 entries of 256 used Zone PFN ranges: DMA 0 -> 4096 DMA324096 -> 1048576 Normal1048576 -> 1310720 early_node_map[3] active PFN ranges 0:0 -> 156 0: 256 -> 786272 0: 1048576 -> 1310720 On node 0 totalpages: 1048316 DMA zone: 56 pages used for memmap DMA zone: 1395 pages reserved DMA zone: 2545 pages, LIFO batch:0 DMA32 zone: 14280 pages used for memmap DMA32 zone: 767896 pages, LIFO batch:31 Normal zone: 3584 pages used for memmap Normal zone: 258560 pages, LIFO batch:31 ACPI: Local APIC address 0xfee0 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 (Bootup-CPU) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x06] enabled) Processor #6 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) Processor #1 ACPI: LAPIC (acpi_id[0x03] lapic_id[0x07] enabled) Processor #7 ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1]) ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0]) IOAPIC[0]: apic_id 2, address 0xfec0, GSI 0-23 ACPI: IOAPIC (id[0x03] address[0xfec8] gsi_base[24]) IOAPIC[1]: apic_id 3, address 0xfec8, GSI 24-47 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) ACPI: IRQ0 used by override. ACPI: IRQ2 used by override. ACPI: IRQ9 used by override. Setting APIC routing to flat Using ACPI (MADT) for SMP configuration information Nosave address range: 0009c000 - 0009d000 Nosave address range: 0009d000 - 000a Nosave address range: 000a - 000cc000 Nosave address range: 000cc000 - 000d Nosave address range: 000d - 000e4000 Nosave address range: 000e4000 - 0010 Nosave address range: bff6 - bff69000 Nosave address range: bff69000 - bff8 Nosave address range: bff8 - c000 Nosave address range: c000 - e000 Nosave address range: e000 - f000 Nosave address range: f000 - fec0 Nosave address range: fec0 - fec1 Nosave address range: fec1 - fee0 Nosave address range: fee0 - fee01000 Nosave address range: fee01000 - ff00 Nosave address range: ff00 - 0001 Allocati
Re: [PATCH?] tcp and delayed acks
On Wed, Aug 16, 2006 at 12:11:12PM -0700, Stephen Hemminger wrote: > > is throttled waiting for ACKs to arrive. The problem is exacerbated when > > the sender is using a small send buffer -- running netperf -C -c -- -s 1024 > > show a miserable 420Kbit/s at essentially 0% CPU usage. Tests over gige > > are similarly constrained to a mere 96Mbit/s. > > What ethernet hardware? The defaults are often not big enough > for full speed on gigabit hardware. I need increase rmem/wmem to allow > for more buffering. This is for small buffer transmit buffer sizes over either loopback or e1000. The artifact also shows up over localhost for somewhat larger buffer sizes, although it is much more difficult to get results that don't have large fluctuations because of other scheduling issues. Pinning the tasks to CPUs is on my list of things to try, but something in the multiple variants of sched_setaffinity() has resulted in it being broken in netperf. > The point of delayed ack's was to merge the response and the ack on > request/response > protocols like NFS or telnet. It does make sense to get it out sooner though. I would like to see what sort of effect this change has on higher latency. Ideally, quick ack mode should be doing the right thing, but it might need more input about the receiver's intent. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH?] tcp and delayed acks
Hello folks, In looking at a few benchmarks (especially netperf) run locally, it seems that tcp is unable to make full use of available CPU cycles as the sender is throttled waiting for ACKs to arrive. The problem is exacerbated when the sender is using a small send buffer -- running netperf -C -c -- -s 1024 show a miserable 420Kbit/s at essentially 0% CPU usage. Tests over gige are similarly constrained to a mere 96Mbit/s. Since there is no way for the receiver to know if the sender is being blocked on transmit space, would it not make sense for the receiver to send out any delayed ACKs when it is clear that the receiving process is waiting for more data? The patch below attempts this (I make no guarantees of its correctness with respect to the rest of the delayed ack code). One point I'm still contemplating is what to do if the receiver is waiting in poll/select/epoll. [All tests run with maxcpus=1 on a 2.67GHz Woodcrest system.] Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % S us/KB us/KB Base (2.6.17-rc4): default send buffer size netperf -C -c 87380 16384 1638410.02 14127.79 99.9099.900.579 0.579 87380 16384 1638410.02 13875.28 99.9099.900.590 0.590 87380 16384 1638410.01 13777.25 99.9099.900.594 0.594 87380 16384 1638410.02 13796.31 99.9099.900.593 0.593 87380 16384 1638410.01 13801.97 99.9099.900.593 0.593 netperf -C -c -- -s 1024 87380 2048 204810.02 0.43 -0.04-0.04-7.105 -7.377 87380 2048 204810.02 0.43 -0.01-0.01-2.337 -2.620 87380 2048 204810.02 0.43 -0.03-0.03-5.683 -5.940 87380 2048 204810.02 0.43 -0.05-0.05-9.373 -9.625 87380 2048 204810.02 0.43 -0.05-0.05-9.373 -9.625 from a remote system over gigabit ethernet netperf -H woody -C -c 87380 16384 1638410.03 936.23 19.3220.473.382 1.791 87380 16384 1638410.03 936.27 17.6720.953.091 1.833 87380 16384 1638410.03 936.17 19.1820.773.356 1.817 87380 16384 1638410.03 936.26 18.2220.263.188 1.773 87380 16384 1638410.03 936.26 17.3520.543.036 1.797 netperf -H woody -C -c -- -s 1024 87380 2048 204810.0095.72 10.046.64 17.188 5.683 87380 2048 204810.0095.94 9.47 6.42 16.170 5.478 87380 2048 204810.0096.83 9.62 5.72 16.283 4.840 87380 2048 204810.0095.91 9.58 6.13 16.368 5.236 87380 2048 204810.0095.91 9.58 6.13 16.368 5.236 Patched: default send buffer size netperf -C -c 87380 16384 1638410.01 13923.16 99.9099.900.588 0.588 87380 16384 1638410.01 13854.59 99.9099.900.591 0.591 87380 16384 1638410.02 13840.42 99.9099.900.591 0.591 87380 16384 1638410.01 13810.96 99.9099.900.593 0.593 87380 16384 1638410.01 13771.27 99.9099.900.594 0.594 netperf -C -c -- -s 1024 87380 2048 204810.02 2473.48 99.9099.903.309 3.309 87380 2048 204810.02 2421.46 99.9099.903.380 3.380 87380 2048 204810.02 2288.07 99.9099.903.577 3.577 87380 2048 204810.02 2405.41 99.9099.903.402 3.402 87380 2048 204810.02 2284.41 99.9099.903.582 3.582 netperf -H woody -C -c 87380 16384 1638410.04 936.10 23.0421.604.033 1.890 87380 16384 1638410.03 936.20 18.5221.063.242 1.843 87380 16384 1638410.03 936.52 17.6121.053.082 1.841 87380 16384 1638410.03 936.18 18.2420.733.191 1.814 87380 16384 1638410.03 936.28 18.3021.043.202 1.841 netperf -H woody -C -c -- -s 1024 87380 2048 204810.00 142.46 10.197.53 11.714 4.332 87380 2048 204810.00 147.28 9.73 7.93 10.829 4.412 87380 2048 204810.00 143.37 10.646.54 12.161 3.738 87380 2048 204810.00 146.41 9.18 7.43 10.277 4.158 87380 2048 204810.01 145.58 9.80 7.25 11.032 4.081 Comments/thoughts? -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 934396b..e554ceb 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -127
Re: [PATCH] [e1000]: Remove unnecessary tx_lock
On Wed, Aug 09, 2006 at 10:25:30AM +1000, Herbert Xu wrote: > The problem here is that the TX clean function does not take the lock > (nor do we want it to). It can thus come in while you're transmitting > and empty the queue. That can be solved with sequence numbers -- ie, we keep track of the number of packets queued to the hardware, as well as tracking the number of tx completions that have been reaped. Because those numbers are flushed to memory by other locks (barriers) taken during processing, the code in hard_start_xmit() and the actual tx completion reaping could be done without the atomic ops. The key thing is that the higher level code in the network stack calling dev->hard_start_xmit() would know that it needs to retry if the tx complete sequence changes. I'll ponder this a bit more and try to come up with some sample code. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [e1000]: Remove unnecessary tx_lock
On Tue, Aug 08, 2006 at 03:06:07PM -0700, David Miller wrote: > The driver ->hard_start_xmit() method is invoked with the queue > unlocked, so this kind of scheme would not be workable. Looking at e1000, NETIF_F_LLTX is a waste -- the driver takes a spinlock almost immediately after entering. Taking the spinlock higher up in the network stack, preferably for multiple packets, would at least let the CPU deal with things sooner. tg3 doesn't use LLTX and thus holds the lock over hard_start_xmit(). bonding is atrocious and uses a read lock and unlock in many of the xmit routines. The only true lockless tx seems to be loopback. > While the queue is unlocked, another cpu could empty entries in the > TX queue making it not-full, which invalidates this "queue is now > full" return value the driver just gave. > > We don't do things the way we do it now for fun, it's just the most > reasonable scheme we've come up with given the locking constraints. Given that the vast majority of drivers are locked during their xmit routine, it is probably worth implementing at some point. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [e1000]: Remove unnecessary tx_lock
On Fri, Aug 04, 2006 at 04:31:11PM -0700, David Miller wrote: > Yes, it's meant to catch unintented races. > > The queueing layer that calls ->hard_start_xmit() technically has no > need to support NETDEV_TX_BUSY as a return value, since the device > is able to prevent this. > > If we could avoid NETDEV_TX_BUSY et al. from happening, the idea is > that we could eliminate all of the requeueing logic in the packet > scheduler layer. It is a pipe dream from many years ago. Maybe the way NETDEV_TX_BUSY is implemented is wrong -- that is, it should be possible to return a result saying "we sent the packet, but the queue is now full". That would eliminate the atomic op that netif_queue_stop() performs and allow higher layers to merge the necessary barrier on the full case with the op on the tx lock. That way we could have lockless queue handling within the driver. This might need start/stop sequence counters, though. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems with e1000 and jumboframes
On Thu, Aug 03, 2006 at 04:49:15PM +0200, Krzysztof Oledzki wrote: > With 1 GB of RAM full 1GB/3GB (CONFIG_VMSPLIT_3G_OPT) seems to be > enough... Nope, you lose ~128MB of RAM for vmalloc space. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems with e1000 and jumboframes
On Thu, Aug 03, 2006 at 03:48:39PM +0200, Arnd Hannemann wrote: > However the box is a VIA Epia MII12000 with 1 GB of Ram and 1 GB of swap > enabled, so there should be plenty of memory available. HIGHMEM support > is off. The e1000 nic seems to be an 82540EM, which to my knowledge > should support jumboframes. > However I can't always reproduce this on a freshly booted system, so > someone else may be the culprit and leaking pages? > > Any ideas how to debug this? This is memory fragmentation, and all you can do is work around it until the e1000 driver is changed to split jumbo frames up on rx. Here are a few ideas that should improve things for you: - switch to a 2GB/2GB split to recover the memory lost to highmem (see Processor Type and Features / Memory split) - increase /proc/sys/vm/min_free_kbytes -- more free memory will improve the odds that enough unfragmented memory is available for incoming network packets I hope this helps. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/4] kevent: core files.
On Thu, Jul 27, 2006 at 02:44:50PM -0700, Zach Brown wrote: > > >>int kevent_getevents(int event_fd, struct ukevent *events, > >>int min_events, int max_events, > >>struct timeval *timeout); > > > > You've just reinvented io_getevents(). > > Well, that's certainly one inflammatory way to put it. I would describe > it as suggesting that the kevents collection interface not lose the > nicer properties of io_getevents(). Perhaps, but there seems to be a lot of talk about introducing new APIs where it isn't entirely clear that it is needed. Sorry if that sounded rather acerbic. > > What exactly are we getting from > > reinventing this (aside from breaking existing apps and creating more of > > an API mess)? > > A generic event collection interface that isn't so strongly bound to the > existing semantics of io_setup() and io_submit(). It can be a file > descriptor instead of a mysterious cookie/pointer to the mapped region, > to start. Things were like that at one point in time, but file descriptors turn out to introduce a huge gaping security hole with SUID programs. The problem is that any event context is closely tied to the address space of the thread issuing the syscalls, and file descriptors do not have this close binding. > Sure, so maybe we experiment with these things in the context of the > kevent patches and maybe merge them back into the AIO paths if in the > end that's the right thing to do. I see no problem with separating > development from the existing code. Excellent! > >> epoll and kevent both have the notion of an event type that always > >> creates an event at the time of the collection syscall while the event > >> source is on a ready list. Think of epoll calling ->poll(POLLOUT) for > >> an empty socket buffer at every sys_epoll_wait() call. We can't have > >> some source constantly spewing into the ring :/. We could fix this by > >> the API requiring that level events can *only* be collected through the > >> syscall interface. userspace could call into the collection syscall > >> every N events collected through the ring, say. N would be tuned to > >> amortize the syscall cost and still provide fairness or latency for the > >> level sources. I'd be fine with that, especially when it's hidden off > >> in glibc. > > > > This is exactly why I think level triggered events are nasty. It's > > impossible to do cleanly without requiring a syscall. > > I'm not convinced that it isn't possible to get a sufficiently clean > interface that involves the mix. My arguement is that this approach introduces a slow path into the heavily loaded server case. If you can show me how to avoid that, I'd be happy to see such an implementation. =-) > > As soon as you allow queueing events up in kernel space, it becomes > > necessary to do another syscall after pulling events out of the queue, > > which is a waste of CPU cycles when you're under heavy load (exactly the > > point at which you want the system to be its most efficient). > > If we've just consumed a full ring worth of events, and done real work > with them, I'm not convinced that an empty syscall is going to be that > painful. If we're really under load it might well return some newly > arrived events. It becomes a mix of ring completions and syscall > completions. Except that you're not usually pulling a full ring worth of events at a time, more often just one. One of the driving forces behind AIO use is in realtime apps where you don't want to eat occasional spikes in the latency of request processing, one just wants to eat the highest priority event then work on the next. By keeping each step small and managable, the properties of the system are much easier to predict. Yes, batching can be helpful performance-wise, but it is somewhat opposite to the design criteria that need to be considered. The right way to cope with that may be to have two different modes of operation that trade off one way or the other on the batching question. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/4] kevent: core files.
On Thu, Jul 27, 2006 at 12:18:42PM -0700, Zach Brown wrote: > The easy part is fixing up the somewhat obfuscated collection call. > Instead of coming in through a multiplexer that magically treats a void > * as a struct kevent_user_control followed by N ukevents (as specified > in the kevent_user_control!) we'd turn it into a more explicit > collection syscall: > > int kevent_getevents(int event_fd, struct ukevent *events, > int min_events, int max_events, > struct timeval *timeout); You've just reinvented io_getevents(). What exactly are we getting from reinventing this (aside from breaking existing apps and creating more of an API mess)? > Say we have a ring of event structs. AIO has this today, but it sort of > gets it wrong because each event element doesn't specify whether it is > owned by the kernel or userspace. (It really gets it wrong because it > doesn't flush_dcache_page() after updating the ring via kmap(), but > never mind that! No one actually uses this mmap() AIO ring.) In AIO > today there is also a control struct mapped along with the ring that has > head and tail pointers. We don't want to bounce that cacheline around. > net/socket/af_packet.c gets this right with it's tp_status member of > tpacket_hdr. That could be rev'd in the mmap() ring buffer, as there are compat and incompat bits for changing the structure layout. As for bouncing the cacheline of head/tail around, I don't think it matters on real machines, as the multithreaded/SMP case will hit that cacheline bouncing if the user is sharing the event ring between multiple threads on multiple CPUs. The only way around that is to use multiple event rings, say one per node, at which point you have to do load balancing of io requests explicitely between queues (which might be worth it). > So, great, glibc can now find pending events very quickly if they're > waiting in the ring and can fall back to the collection syscall if it > wants to wait and the ring is empty. If it consumes events via the > syscall it increases its ring index by the number the syscall returned. > > There's two things we should address: level events and the notion of > only submitting as much as fits in the ring. > > epoll and kevent both have the notion of an event type that always > creates an event at the time of the collection syscall while the event > source is on a ready list. Think of epoll calling ->poll(POLLOUT) for > an empty socket buffer at every sys_epoll_wait() call. We can't have > some source constantly spewing into the ring :/. We could fix this by > the API requiring that level events can *only* be collected through the > syscall interface. userspace could call into the collection syscall > every N events collected through the ring, say. N would be tuned to > amortize the syscall cost and still provide fairness or latency for the > level sources. I'd be fine with that, especially when it's hidden off > in glibc. This is exactly why I think level triggered events are nasty. It's impossible to do cleanly without requiring a syscall. > Today AIO only allows submission of as many events as there are space in > the ring. It mostly does this so its completion can drop an event in > the ring from any context. If we back away from this so that we can > have long-lived source registration generate multiple edge events (and I > think we want to!), we have to be kind of careful. A source could > generate an event while the ring is full. The event could go in a list > but if userspace is collecting events in userspace the kernel won't be > told when there's space. We'd first have to check this ready list when > later events are generated so that pending events on the list aren't > overlooked. Userspace would also want to use the collection syscall as > the ring empties. Neither seem hard. As soon as you allow queueing events up in kernel space, it becomes necessary to do another syscall after pulling events out of the queue, which is a waste of CPU cycles when you're under heavy load (exactly the point at which you want the system to be its most efficient). Given that growing the ring buffer is easy enough to do, I'm not sure that the hit is worth it. At some point there has to be some form of flow control involved, and it is much better if it is explicitely obvious where this happens (as opposed to signal queues and our wonderful OOM handling). -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/3] drivers/net/ns83820.c: add paramter to disable autonegotiation
On Sun, Jun 25, 2006 at 01:44:36AM -0700, [EMAIL PROTECTED] wrote: > > From: Dan Faerch <[EMAIL PROTECTED]> > > Adds "ethtool command" support to driver. Initially 2 commands are > implemented: force fullduplex and toggle autoneg. This part is good, although doing something for copper cards needs doing, which probably means poking around to support the phy properly. > Also added a "disable_autoneg" module argument to completely disable > autoneg on all cards using this driver. This is the part I disagree with. Are you sure it isn't a bug in the link autonegotiation state machine for fibre cards? It should be defaulting to 1Gbit/full duplex if no autonegotiation is happening, and if it isn't then that should be fixed instead of papering over things with a config option. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/4] kevent: core files.
On Fri, Jun 23, 2006 at 01:54:23PM -0700, David Miller wrote: > From: Benjamin LaHaise <[EMAIL PROTECTED]> > Date: Fri, 23 Jun 2006 16:31:14 -0400 > > > Eh? Nobody has posted any numbers comparing the approaches yet, so this > > is pure handwaving, unless you have real concrete results? > > Evgeniy posts numbers and performance graphs on his kevent work all > the time. But you're argueing that the performance of something that hasn't been tested is worse simply by nature of it not having been tested. That's a fallacy of omission, iiuc. > Van Jacobson did in his LCA2006 net channel slides too, perhaps you > missed that. I have yet to be convinced that the layering violation known as net channels is the right way to go, mostly because it breaks horribly in a few cases -- think what happens during periods of CPU overcommit, in which case doing too much in interrupt context will kill a system (which is why softirqs are needed). The effect of doing all processing in user context creates issues with delayed acks (due to context switching to other tasks in the system), which will cause excess retransmits. The hard problems associated with packet filtering and security are also still unresolved, which is okay for a paper, but a concern in real life. There are also a number of performance flaws in the current stack that show up under profiling, some of which I posted fixes for, some of which have yet to be fixed. The pushf/popf pipeline stall was one of the bigger instances of CPU wastage that Van Jacobson noticed (it shows up as bottom halves using lots of CPU). Iirc, Ingo's real time patches may avoid that by way of reworking the irq disable/enable mechanism, which would mean the results need retesting. Using the cr8 register to enable/disable interrupts on x86-64 might also improve things, as that would eliminate the flags dependancy of cli/sti... In short, there's a lot of work that still has to be done. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/4] kevent: core files.
On Sat, Jun 24, 2006 at 01:08:27AM +0400, Evgeniy Polyakov wrote: > On Fri, Jun 23, 2006 at 04:44:42PM -0400, Benjamin LaHaise ([EMAIL > PROTECTED]) wrote: > > > AIO completion approach was designed to be used with process context VFS > > > update. read/write approach can not cover other types of notifications, > > > like inode updates or timers. > > > > The completion event is 100% generic and does not need to come from process > > context. Calling aio_complete() from irq context is entirely valid. > > put_ioctx() can sleep. Err, no, that should definately not be the case. If it can, someone has completely broken aio. > It is not syscall, but overall design should be analyzed. > It is possible to use existing ssycalls, kevent design does not care > about how it's data structures are delivered to the internal > "processor". Okay, that's good to hear. =-) -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/4] kevent: core files.
On Sat, Jun 24, 2006 at 12:17:17AM +0400, Evgeniy Polyakov wrote: > But now it is implemented as repeated call for the same work, which does > not look like it can be used for any other types of work. Given an iocb, you do not have to return -EIOCBRETRY, instead return -EIOCBQUEUED and then from whatever context do an aio_complete() with the result for that iocb. > And repeated work introduce latencies. > As far as I recall, it is you who wanted to remove thread based approach > from AIO subsystem. I have essentially given up on trying to get the filesystem AIO patches in given that the concerns against them are "woe complexity" with no real recourse for inclusion being open. If David is open to changes in the networking area, I'd love to see it built on top of your code. > AIO completion approach was designed to be used with process context VFS > update. read/write approach can not cover other types of notifications, > like inode updates or timers. The completion event is 100% generic and does not need to come from process context. Calling aio_complete() from irq context is entirely valid. > Format of the structure transferred between the objects does not matter > at all. We can create a wrapper on kevent structures or kevent can > transform data from AIO objects. > The main design goal of kevent is to provide easy connected hooks into > any state machine, which might be used by kernelspace to notify about > any kind of events without any knowledge of it's background nature. > Kevent can be used for example as notification blocks for address > changes or it can replace netlink completely (it can even emulate > event multicasting). > > Kevent is queue of events, which can be transferred from any object to > any destination. And io_getevents() reads a queue of events, so I'm not sure why you need a new syscall. > Not at all! > Kevent is a mechanism, which allows to impleement AIO, network AIO, poll > and select, timer control, adaptive readhead (as example of AIO VFS > update). All the code I present shows how to use kevent, it is not part > of the kevent. One can find Makefile in kevent dir to check what is the > core of the subsystem, which allows to be used as a transport for > events. > > AIO, NAIO, poll/select, socket and timer notifications are just users. > One can add it's own usage as easy as to call kevent_storage > initialization function and event generation function. All other pieces > are hidded in the implementation. I'll look at adapting your code to use the existing syscalls. Maybe code will be better at expressing my concerns. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/4] kevent: core files.
On Fri, Jun 23, 2006 at 01:19:40PM -0700, David Miller wrote: > I completely agree with Evgeniy here. > > There is nothing in the kernel today that provides integrated event > handling. Nothing. So when someone says to use the "existing" stuff, > they need to have their head examined. The existing AIO events are *events*, with the syscalls providing the reading of events. > The existing AIO stuff stinks as a set of interfaces. It was designed > by a standards committee, not by people truly interested in a good > performing event processing design. It is especially poorly suited > for networking, and any networking developer understands this. I disagree. Stuffing an event that a read or write is complete/ready is a good way of handling things, even more so with hardware that will perform the memory copies to/from user buffers. > It is pretty much a foregone conclusion that we will need new > APIs to get good networking performance. Every existing interface > has one limitation or another. Eh? Nobody has posted any numbers comparing the approaches yet, so this is pure handwaving, unless you have real concrete results? > So we should be happy people like Evgeniy try to work on this stuff, > instead of discouraging them. I would like to encourage him, but at the same time I don't want to see creating APIs that essentially duplicate existing work and needlessly break compatibility. I completely agree that the in-kernel APIs are not as encompassing as they should be, and within the kernel Evgeniy's work may well be the way to go. What I do not agree is that we need new syscalls at this point. I'm perfectly willing to accept proof that change is needed if we do a proper comparision between any new syscall API and the use of the existing syscall API, but the pain of introducing a new API is sufficiently large that I think it is worth looking at the numbers. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/4] kevent: core files.
On Fri, Jun 23, 2006 at 11:24:29PM +0400, Evgeniy Polyakov wrote: > What API are you talking about? > There is only epoll(), which is 40% slower than kevent, and AIO, which > works not as state machine, but as repeated call for the same work. > There is also inotify, which allocates new message each time event > occurs, which is not a good solution for every situation. AIO can be implemented as a state machine. Nothing in the API stops you from doing that, and in fact there was code which was implemented as a state machine used on 2.4 kernels. > Linux just does not have unified event processing mechanism, which was > pointed to many times in AIO mail list and when epoll() was only > introduced. I would even say, that Linux does not have such mechanism at > all, since every potential user implements it's own, which can not be > used with others. The epoll event API doesn't have space in the event fields for result codes as needed for AIO. The AIO API does -- how is it lacking in this regard? > Kevent fixes that. Although implementation itself can be suboptimal for > some cases or even unacceptible at all, but it is really needed > functionality. At the expense of adding another API? How is this a good thing? Why not spit out events in the existing format? > Every existing notification can be built on top of kevent. One can find > how easy it was to implement generic poll/select notifications (what > epoll() does) or socket notifications (which are similar to epoll(), but > are called from inside socket state machine, thus improving processing > performance). So far your code is adding a lot without unifying anything. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/4] kevent: core files.
On Fri, Jun 23, 2006 at 11:09:34AM +0400, Evgeniy Polyakov wrote: > This patch includes core kevent files: > - userspace controlling > - kernelspace interfaces > - initialisation > - notification state machines We don't need yet another event mechanism in the kernel, so I don't see why the new syscalls should be added when they don't interoperate with existing solutions. If your results are enough to sway akpm that it is worth taking the patches, then it would make sense to merge the code with the already in-tree APIs. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 7/8] lock validator: fix ns83820.c irq-flags bug
> The above code snippet removes the nested unlock-irq, but now the code > is unbalanced, so IMO this patch _adds_ confusion. > > I think the conservative patch for 2.6.17 is the one I have attached. > Unless there are objections, that is what I will forward... This looks reasonable and sufficiently conservative. Reworking locking is something that I'm a bit more hesitant about, although folding misc_lock in with the other locks perhaps makes sense. I would like to keep the split between tx and tx completion, though. Also, any rework is going to need real testing, which is not something that a simple release cycle is likely to get enough coverage on. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TSO and IPoIB performance degradation
On Mon, Mar 20, 2006 at 02:04:07PM +0200, Michael S. Tsirkin wrote: > does not stretch ACKs anymore. RFC 2581 does mention that it might be OK to > stretch ACKs "after careful consideration", and we are seeing that it helps > IP over InfiniBand, so recent Linux kernels perform worse in that respect. > > And since there does not seem to be a way to figure it out automagically when > doing this is a good idea, I proposed adding some kind of knob that will let > the > user apply the consideration for us. Wouldn't it make sense to strech the ACK when the previous ACK is still in the TX queue of the device? I know that sort of behaviour was always an issue on modem links where you don't want to send out redundant ACKs. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scm: fold __scm_send() into scm_send()
On Mon, Mar 13, 2006 at 09:05:31PM +0100, Ingo Oeser wrote: > From: Ingo Oeser <[EMAIL PROTECTED]> > > Fold __scm_send() into scm_send() and remove that interface completly > from the kernel. Whoa, what are you doing here? Uninlining scm_send() is a Bad Thing to do given that scm_send() is in the AF_UNIX hot path. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH] rcuification of ipv4 established and timewait connections
On Thu, Mar 09, 2006 at 01:12:20PM -0800, David S. Miller wrote: > Once we have RCU in place for the TCP hash tables, we have the chance > to code up dynamically sized hash tables. With the current locking, > this is basically impossible, with RCU it can be done. Nice! > So Ben can you work to figure out what the bind(0.0.0.0) > problem was? Once that is fully resolved, I think I'll apply > your RCU patch to net-2.6.17. I think I know what the problem is. Basically, sockets from the listen table can have their refcount leaked in the do_time_wait chunk of code. I'll send out a patch that reworks that area and also adds one thing I missed, which is that sock_hold() in an rcu section needs to be an atomic_inc_not_zero, or the socket can be leaked (resulting in exactly that sort of failure when a listening socket is destroyed). -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH] rcuification of ipv4 established and timewait connections
On Thu, Mar 09, 2006 at 07:25:25PM +0100, Eric Dumazet wrote: > On a second thought, do you think we still need one rwlock per hash chain ? > > TCP established hash table entries: 1048576 (order: 12, 16777216 bytes) > > On this x86_64 machine, we 'waste' 8 MB of ram for those rwlocks. > > With RCU, we touch these rwlocks only on TCP connection creation/deletion, > maybe we could reduce to one rwlock or a hashed array of 2^N rwlocks (2^N > depending on NR_CPUS), like in net/ipv4/route.c ? Hrm, maybe use cmpxchg? That gets rid of the lock and automatically provides us with hashed spinlocks on archs without a cmpxchg implementation. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
On Thu, Mar 09, 2006 at 07:41:08PM +1100, Nick Piggin wrote: > Considering that local_t has been broken so that basically nobody > is using it, now is a great time to rethink the types before it > gets fixed and people start using it. I'm starting to get more concerned as the per-cpu changes that keep adding more overhead is getting out of hand. > And modelling the type on the atomic types would make the most > sense because everyone already knows them. Except that the usage models are different; local_t is most likely to be used for cpu local statistics or for sequences, where making them signed is a bit backwards. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH] rcuification of ipv4 established and timewait connections
On Thu, Mar 09, 2006 at 01:18:26PM +0300, Evgeniy Polyakov wrote: > Ok, I hacked quite a bit in the patch, but I think nothing major was > changed, basically patch rejects. > And I'm now unable to bind to 0.0.0.0 address, i.e. bind() does not > fail, but all connections are refused. > Bind to machine's IP works fine. Odd. Can you fire off a copy of your backport to me? I'll see if I can spot any quirks. bind to 0.0.0.0 seems to work for me (ssh, portmap and a few test programs worked), but maybe something different is being done in your case. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
On Wed, Mar 08, 2006 at 02:25:28PM -0800, Ravikiran G Thirumalai wrote: > Then, for the batched percpu_counters, we could gain by using local_t only > for > the UP case. But we will have to have a new local_long_t implementation > for that. Do you think just one use case of local_long_t warrants for a new > set of apis? I think it may make more sense to simply convert local_t into a long, given that most of the users will be things like stats counters. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
On Wed, Mar 08, 2006 at 01:07:26PM -0800, Ravikiran G Thirumalai wrote: > But on non x86, local_bh_disable() is gonna be cheaper than a cli/atomic op > no? > (Even if they were switched over to do local_irq_save() and > local_irq_restore() from atomic_t's that is). It's still more expensive than local_t. > And if we use local_t, we will add the overhead for the non bh > percpu_counter_mod for non x86 arches. Last time I checked, all the major architectures had efficient local_t implementations. Most of the RISC CPUs are able to do a load / store conditional implementation that is the same cost (since memory barriers tend to be explicite on powerpc). So why not use it? -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
On Wed, Mar 08, 2006 at 12:26:56PM -0800, Ravikiran G Thirumalai wrote: > +static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long > amount) > +{ > + local_bh_disable(); > + fbc->count += amount; > + local_bh_enable(); > +} Please use local_t instead, then you don't have to do the local_bh_disable() and enable() and the whole thing collapses down into 1 instruction on x86. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH] rcuification of ipv4 established and timewait connections
On Wed, Mar 08, 2006 at 02:01:04PM +0300, Evgeniy Polyakov wrote: > When I tested RCU for similar change for kevent, but postponed more work > to RCU callback, including socket closing and some attempts to inode > dereferencing, such change forced performance degradation for httperf > benchmark and trivial web server from 2600 requests/sec down to 2200 > with 4.1K index page and 1500 MTU. > > I still have this testbed and will gladly test this change tomorrow > if it applies cleanly to 2.6.15-rc7? I think it should, the code it touches isn't that heavily changed. The results should be most interesting. It might be useful to compare SMP and non-SMP cases as the difference in locks may show up in the test. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86-64, use page->virtual to get 64 byte struct page
On Wed, Mar 08, 2006 at 10:40:38AM +0100, Andi Kleen wrote: > On Wednesday 08 March 2006 03:38, Benjamin LaHaise wrote: > > > It's hardly that uncommon for pages to cross cachelines or for pages to > > move around CPUs with networking. > > Data? I posted a workload that shows this. Anything that transmits over TCP on a CPU other than the one the interrupt occurs on is going to hit that behaviour for 7/8 of pages. > > Please name some sort of benchmarks that show your concerns for decreased > > performance. > > Anything that manipulates lots of data. LMBench certainly doesn't look any worse. In fact, things look slightly better. (2.6.16n is with WANT_PAGE_VIRTUAL, while 2.6.16O is without). cd results && make summary percent 2>/dev/null | more make[1]: Entering directory `/md0/root/LMbench2/results' L M B E N C H 2 . 0 S U M M A R Y Basic system parameters Host OS Description Mhz - - --- cobra.kva Linux 2.6.16nx86_64-linux-gnu 2997 cobra.kva Linux 2.6.16nx86_64-linux-gnu 2997 cobra.kva Linux 2.6.16nx86_64-linux-gnu 2997 cobra.kva Linux 2.6.16nx86_64-linux-gnu 2997 cobra.kva Linux 2.6.16nx86_64-linux-gnu 2997 cobra.kva Linux 2.6.16Ox86_64-linux-gnu 2997 cobra.kva Linux 2.6.16Ox86_64-linux-gnu 2997 cobra.kva Linux 2.6.16Ox86_64-linux-gnu 2997 cobra.kva Linux 2.6.16Ox86_64-linux-gnu 2997 cobra.kva Linux 2.6.16Ox86_64-linux-gnu 2997 Processor, Processes - times in microseconds - smaller is better Host OS Mhz null null open selct sig sig fork exec sh call I/O stat clos TCP inst hndl proc proc proc - - - cobra.kva Linux 2.6.16n 2997 0.23 0.30 1.45 2.70 8.974 0.35 3.19 111. 443. 1674 cobra.kva Linux 2.6.16n 2997 0.23 0.30 1.46 2.81 8.984 0.35 3.19 116. 443. 1679 cobra.kva Linux 2.6.16n 2997 0.23 0.30 1.45 2.74 9.761 0.35 3.17 112. 446. 1683 cobra.kva Linux 2.6.16n 2997 0.23 0.30 1.43 2.82 8.992 0.34 3.20 113. 444. 1683 cobra.kva Linux 2.6.16n 2997 0.23 0.30 1.45 2.83 8.975 0.35 3.18 112. 445. 1673 cobra.kva Linux 2.6.16O 2997 0.23 0.30 1.44 2.74 9.008 0.36 3.17 110. 443. 1674 cobra.kva Linux 2.6.16O 2997 0.23 0.30 1.45 2.75 9.003 0.37 3.19 112. 445. 1681 cobra.kva Linux 2.6.16O 2997 0.23 0.30 1.45 2.80 9.010 0.37 3.17 112. 452. 1689 cobra.kva Linux 2.6.16O 2997 0.23 0.30 1.45 2.75 9.007 0.36 3.15 112. 453. 1689 cobra.kva Linux 2.6.16O 2997 0.23 0.30 1.44 2.77 9.021 0.36 3.15 113. 448. 1686 Context switching - times in microseconds - smaller is better - Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw - - - -- -- -- -- --- --- cobra.kva Linux 2.6.16n 2.360 3.7500 6.6100 4.6600 10.1 5.215.3 cobra.kva Linux 2.6.16n 2.420 3.7600 6.6500 4.8500 10.5 5.4100015.0 cobra.kva Linux 2.6.16n 2.400 3.7600 6.5900 4.7000 9.6600 5.3700015.0 cobra.kva Linux 2.6.16n 2.400 3.7600 6.5600 4.6300 9.6100 5.8100015.0 cobra.kva Linux 2.6.16n 2.420 3.8200 6.5800 4.8400 10.7 5.4700014.7 cobra.kva Linux 2.6.16O 2.430 4.4800 7.2100 4.8500 10.4 5.9100015.8 cobra.kva Linux 2.6.16O 2.460 4.3100 7.2400 4.9000 10.7 5.4200015.9 cobra.kva Linux 2.6.16O 2.450 4.4800 7.2800 4.7000 10.1 5.215.9 cobra.kva Linux 2.6.16O 2.460 4.3900 6.9900 4.7200 9.7400 5.6500015.2 cobra.kva Linux 2.6.16O 2.450 4.4700 6.9400 4.8100 8.7100 5.5200015.5 *Local* Communication latencies in microseconds - smaller is better --- Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP ctxsw UNIX UDP TCP conn - - - - - - - - cobra.kva Linux 2.6.16n 2.360 8.228 12.8 12.7 21.7 17.6 29.0 43.6 cobra.kva Linux 2.6.16n 2.420 8.192 13.1 12.7 22.1 17.6 29.2 44.0 cobra.kva Linux 2.6.16n 2.400 8.163 13.0 12.7 22.4 17.6 29.3 43.9 cobra.kva Linux 2.6.16n 2.400 8.163 11.7 12.7 21.9 17.5 29.0 44.4 cobra.kva Linux 2.6.16n 2.420 8.211 13.1 12.7 22.3 17.6 29.5 44.4 cobra.kva Linux 2.6.16O 2.430 8.273 13.1 14.4 23.8 17.7 29.7 43.9 cobra.kva Linux 2.6.16O 2.460 8.284 10.6 14.2 24.1 17.7 29.6 43.8 cobra.kva Linux 2.6.16O 2.450 8.454 13.5 14.3 24.1 17.7 29.8 43.9 cobra.kva Linux 2.6.16O 2.460 8.245 10.7 14.2 24.3 17.7
Re: [PATCH] x86-64, use page->virtual to get 64 byte struct page
On Tue, Mar 07, 2006 at 07:50:52PM +0100, Andi Kleen wrote: > > My vmlinux has > > 80278382 : > 80278382: 8b 0d 78 ea 41 00 mov4319864(%rip),%ecx ># 80696e00 > 80278388: 48 89 f8mov%rdi,%rax > 8027838b: 48 c1 e0 0c shl$0xc,%rax > 8027838f: 48 d3 e8shr%cl,%rax > 80278392: 48 0f b6 80 00 5e 69movzbq > 0x80695e00(%rax),%rax > 80278399: 80 > 8027839a: 48 8b 14 c5 40 93 71mov > 0x80719340(,%rax,8),%rdx > 802783a1: 80 > 802783a2: 48 2b ba 40 36 00 00sub0x3640(%rdx),%rdi > 802783a9: 48 6b c7 38 imul $0x38,%rdi,%rax > 802783ad: 48 03 82 30 36 00 00add0x3630(%rdx),%rax > 802783b4: c3 retq That's easily in the 90+ cycles range as you've got 3 data dependant loads which will hit in the L2, but likely not in the L1 given that the workload is manipulating lots of data. Assuming the instruction scheduler gets things right. > 802783b5 : > 802783b5: 48 8b 07mov(%rdi),%rax > 802783b8: 48 c1 e8 38 shr$0x38,%rax > 802783bc: 48 8b 14 c5 80 97 71mov > 0x80719780(,%rax,8),%rdx > 802783c3: 80 > 802783c4: 48 b8 b7 6d db b6 6dmov > $0x6db6db6db6db6db7,%rax > 802783cb: db b6 6d > 802783ce: 48 2b ba 20 03 00 00sub0x320(%rdx),%rdi > 802783d5: 48 c1 ff 03 sar$0x3,%rdi > 802783d9: 48 0f af f8 imul %rax,%rdi > 802783dd: 48 03 ba 28 03 00 00add0x328(%rdx),%rdi > 802783e4: 48 89 f8mov%rdi,%rax > 802783e7: c3 retq > > > Both look quite optimized to me. I haven't timed them but it would surprise > me > if P4 needed more than 20 cycles to crunch through each of them. It's more than that because you've got the data dependancies on the load. Yes, imul is 10 cycles, but shift is 1. > Where is that idiv exactly? I don't see it. My memory seems to be failing me, I can't find it. Whoops. > Only in pathological workloads. Normally the working set is so large > that the probability of two pages are near each other is very small. It's hardly that uncommon for pages to cross cachelines or for pages to move around CPUs with networking. Remember that we're using pages for the data buffers in networking, so you'll have pages get freed on the wrong CPU quite often. Please name some sort of benchmarks that show your concerns for decreased performance. I've shown you one that gets improved, and I think the pages not overlapping cachelines is only a good thing. I know these things look like piddly little worthless optimizations, but they add up big time. Mea culpa for not having a 10Gbit nic to show more "real world" applications. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86-64, use page->virtual to get 64 byte struct page
On Tue, Mar 07, 2006 at 05:27:37PM +0100, Andi Kleen wrote: > On Wednesday 08 March 2006 00:26, Benjamin LaHaise wrote: > > Hi Andi, > > > > On x86-64 one inefficiency that shows up on profiles is the handling of > > struct page conversion to/from idx and addresses. This is mostly due to > > the fact that struct page is currently 56 bytes on x86-64, so gcc has to > > emit a slow division or multiplication to convert. > > Huh? You used an unsigned long, but ptrdiff_t is signed. gcc cannot use any shifting tricks because they round incorrectly in the signed case. > AFAIK mul has a latency of < 10 cycles even on P4 so I can't imagine > it's a real problem. Something must be wrong with your measurements. mul isn't particularly interesting in the profiles, it's the idiv. > My guess would be that on more macro loads it would be a loss due > to more cache misses. But you get less false sharing of struct page on SMP as well. With a 56 byte page a single struct page can overlap two cachelines, and on this workload the page definately gets transferred from one CPU to the other. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] x86-64, use page->virtual to get 64 byte struct page
Hi Andi, On x86-64 one inefficiency that shows up on profiles is the handling of struct page conversion to/from idx and addresses. This is mostly due to the fact that struct page is currently 56 bytes on x86-64, so gcc has to emit a slow division or multiplication to convert. By switching to using WANT_PAGE_VIRTUAL in asm/page.h, struct page grows to 64 bytes. Address calculation becomes cheaper because it is a memory load from the already hot struct page. For netperf, this shows up as a ~150 Mbit/s improvement. FYI, I also tested with the 64 byte page but no WANT_PAGE_VIRTUAL, but it wasn't quite as much of an improvement on the P4. Before: 87380 16384 1638410.01 9516.38 92.2692.261.588 1.588 87380 16384 1638410.01 9459.36 70.4370.431.220 1.220 87380 16384 1638410.00 9509.38 67.5567.551.164 1.164 87380 16384 1638410.00 9432.95 68.9168.911.197 1.197 87380 16384 1638410.01 9479.88 69.3269.321.198 1.198 87380 16384 1638410.01 9466.74 70.3770.371.218 1.218 87380 16384 1638410.01 9493.93 69.1869.181.194 1.194 87380 16384 1638410.00 9486.44 71.0671.061.227 1.227 87380 16384 1638410.01 9477.95 70.6770.671.222 1.222 87380 16384 1638410.00 9477.33 70.2070.201.214 1.214 After: 87380 16384 1638410.01 9629.42 92.0192.011.565 1.565 87380 16384 1638410.01 9641.69 90.1690.161.532 1.532 87380 16384 1638410.01 9650.40 90.1690.161.531 1.531 87380 16384 1638410.00 9638.69 90.6090.601.540 1.540 87380 16384 1638410.01 9667.15 89.3689.361.514 1.514 87380 16384 1638410.01 9684.13 89.8689.861.520 1.520 87380 16384 1638410.01 9642.38 90.3190.311.534 1.534 87380 16384 1638410.00 9669.24 90.9090.901.540 1.540 87380 16384 1638410.00 9676.82 90.2590.251.528 1.528 87380 16384 1638410.00 9711.26 90.8090.801.532 1.532 -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]> diff --git a/include/asm-x86_64/page.h b/include/asm-x86_64/page.h index 615e3e4..2c16c62 100644 --- a/include/asm-x86_64/page.h +++ b/include/asm-x86_64/page.h @@ -3,6 +3,8 @@ #include +#define WANT_PAGE_VIRTUAL 1 + /* PAGE_SHIFT determines the page size */ #define PAGE_SHIFT 12 #ifdef __ASSEMBLY__ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] use wait queue spinlock for the socket spinlock
Hi Dave et al, The patch below merges the use of the wait queue lock and socket spinlock into one. This gains us ~100-150Mbit/s on netperf, mostly due to the fact that because we know how the spinlock is used, we can avoid the whole irq save, disable and reenable sequence since the spinlock only needs bh protection. As a bonus, this also removes one atomic operation per wakeup. [This is with x86-64 task switching tweaks as otherwise it is in the noise.] Before: 87380 16384 1638410.01 9501.56 90.6690.661.563 1.563 87380 16384 1638410.01 9476.08 93.4093.401.615 1.615 87380 16384 1638410.00 9473.65 80.6680.661.395 1.395 87380 16384 1638410.00 9525.82 80.4180.411.383 1.383 87380 16384 1638410.00 9523.49 80.7180.711.388 1.388 87380 16384 1638410.00 9430.09 80.0180.011.390 1.390 87380 16384 1638410.00 9469.60 80.7180.711.396 1.396 87380 16384 1638410.01 9517.88 79.3279.321.365 1.365 87380 16384 1638410.01 9512.30 80.3180.311.383 1.383 87380 16384 1638410.00 9453.69 80.9080.901.402 1.402 After: 87380 16384 1638410.01 9629.42 92.0192.011.565 1.565 87380 16384 1638410.01 9641.69 90.1690.161.532 1.532 87380 16384 1638410.01 9650.40 90.1690.161.531 1.531 87380 16384 1638410.00 9638.69 90.6090.601.540 1.540 87380 16384 1638410.01 9667.15 89.3689.361.514 1.514 87380 16384 1638410.01 9684.13 89.8689.861.520 1.520 87380 16384 1638410.01 9642.38 90.3190.311.534 1.534 87380 16384 1638410.00 9669.24 90.9090.901.540 1.540 87380 16384 1638410.00 9676.82 90.2590.251.528 1.528 87380 16384 1638410.00 9711.26 90.8090.801.532 1.532 -ben Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]> diff --git a/include/net/sock.h b/include/net/sock.h index 57e5f6b..a864d32 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -76,13 +76,12 @@ */ struct sock_iocb; typedef struct { - spinlock_t slock; struct sock_iocb*owner; wait_queue_head_t wq; } socket_lock_t; #define sock_lock_init(__sk) \ -do { spin_lock_init(&((__sk)->sk_lock.slock)); \ +do { \ (__sk)->sk_lock.owner = NULL; \ init_waitqueue_head(&((__sk)->sk_lock.wq)); \ } while(0) @@ -733,8 +732,8 @@ extern void FASTCALL(lock_sock(struct so extern void FASTCALL(release_sock(struct sock *sk)); /* BH context may only use the following locking interface. */ -#define bh_lock_sock(__sk) spin_lock(&((__sk)->sk_lock.slock)) -#define bh_unlock_sock(__sk) spin_unlock(&((__sk)->sk_lock.slock)) +#define bh_lock_sock(__sk) spin_lock(&((__sk)->sk_lock.wq.lock)) +#define bh_unlock_sock(__sk) spin_unlock(&((__sk)->sk_lock.wq.lock)) extern struct sock *sk_alloc(int family, gfp_t priority, diff --git a/net/core/filter.c b/net/core/filter.c index 93fbd01..69e4636 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -421,10 +421,10 @@ int sk_attach_filter(struct sock_fprog * if (!err) { struct sk_filter *old_fp; - spin_lock_bh(&sk->sk_lock.slock); + spin_lock_bh(&sk->sk_lock.wq.lock); old_fp = sk->sk_filter; sk->sk_filter = fp; - spin_unlock_bh(&sk->sk_lock.slock); + spin_unlock_bh(&sk->sk_lock.wq.lock); fp = old_fp; } diff --git a/net/core/sock.c b/net/core/sock.c index f152783..96008af 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -445,15 +445,15 @@ set_rcvbuf: break; case SO_DETACH_FILTER: - spin_lock_bh(&sk->sk_lock.slock); + spin_lock_bh(&sk->sk_lock.wq.lock); filter = sk->sk_filter; if (filter) { sk->sk_filter = NULL; - spin_unlock_bh(&sk->sk_lock.slock); + spin_unlock_bh(&sk->sk_lock.wq.lock); sk_filter_release(sk, filter); break; } - spin_unlock_bh(&sk->sk_lock.slock); + spin_unlock_bh(&sk->sk_lock.wq.lock); ret = -ENONET; break; @@ -1031,20 +1031,25 @@ struct sk_buff *sock_alloc_send_skb(stru return sock_alloc_send_pskb(sk, size, 0, no
[EXPERIMENTAL] HT aware loopback device (hack, x86-64 only atm)
Hi folks, I'd like to start some discussions on SMP optimizations for the networking stack. The patch below is one such example which changes the loopback device in a way that helps out on workloads like netperf by trying to share more work with the other CPU on an HT system. Basically, if the other CPU is idle, we punt the netif_rx onto the other CPU. Using a kernel thread for this is fairly inefficient, so I am wondering if it makes sense to do it on the softirq level. This particular patch improves netperf over localhost by ~600Mbit/s (from ~9874Mbit/s to ~10475Mbit/s while raising %CPU usage from ~92% to ~95%, although it varies quite a bit) on a 3GHz P4 with HT (this is with a pile of other patches to optimize task switching on x86-64). The bigger part of the discussion is probably a question of how we can make the network stack scale with multicore CPUs. For workloads like routing lots of small packets, a single CPU can be easily overwhelmed. The question becomes where does partitioning the work make sense? At the very least we probably need to do some preprocessing of incoming packets so that a series of packets destined for a particular flow end up on the same CPU. This sort of preprocessing probably makes sense for other reasons: by processing a group of packets for a particular socket in one go, we can avoid the overhead of locking and unlocking the socket repeatedly (which is pretty expensive due to the memory barrier nature of locks). At this point I'd just like to stir up some discussion, so please comment away with any ideas and concerns. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index 690a1aa..ef283a3 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -58,6 +58,69 @@ #include #include +#define LOOP_NR_SKBS 256 +static struct sk_buff *loop_skb_ring[LOOP_NR_SKBS]; +static unsigned loop_skb_head, loop_skb_tail; +static struct task_struct *loop_task; + +static void smp_loop_netif_rx(struct sk_buff *skb) +{ + unsigned int next = (loop_skb_head + 1) % LOOP_NR_SKBS; + + if (next == loop_skb_tail) { + dev_kfree_skb(skb); + return; + } + + loop_skb_ring[loop_skb_head] = skb; + wmb(); + loop_skb_head = next; +} + +static void smp_loop_wake(void) +{ + if (loop_task && loop_task->state != TASK_RUNNING) + wake_up_process(loop_task); +} + +static int loop_netif_rx_thread(void *data) +{ + loop_task = current; + + for (;;) { + int nr = 0; + while (loop_skb_tail != loop_skb_head) { + unsigned next; + struct sk_buff *skb = loop_skb_ring[loop_skb_tail]; + loop_skb_ring[loop_skb_tail] = NULL; + next = (loop_skb_tail + 1) % LOOP_NR_SKBS; + barrier(); + loop_skb_tail = next; + netif_rx(skb); + if (nr++ >= 96) { + do_softirq(); + nr = 0; + } + } + + do_softirq(); + + set_current_state(TASK_INTERRUPTIBLE); + if (loop_skb_tail == loop_skb_head) + schedule(); + set_current_state(TASK_RUNNING); + } +} + +static inline int sibling_is_idle(void) +{ + int cpu = smp_processor_id() ^ 1; + struct x8664_pda *pda = cpu_pda(cpu); + if (pda->pcurrent == idle_task(cpu) || pda->pcurrent == loop_task) + return 1; + return 0; +} + static DEFINE_PER_CPU(struct net_device_stats, loopback_stats); #define LOOPBACK_OVERHEAD (128 + MAX_HEADER + 16 + 16) @@ -69,6 +132,7 @@ static DEFINE_PER_CPU(struct net_device_ */ #ifdef LOOPBACK_TSO +static void smp_loop_netif_rx(struct sk_buff *skb); static void emulate_large_send_offload(struct sk_buff *skb) { struct iphdr *iph = skb->nh.iph; @@ -76,6 +140,7 @@ static void emulate_large_send_offload(s unsigned int doffset = (iph->ihl + th->doff) * 4; unsigned int mtu = skb_shinfo(skb)->tso_size + doffset; unsigned int offset = 0; + int use_sibling = sibling_is_idle(); u32 seq = ntohl(th->seq); u16 id = ntohs(iph->id); @@ -112,12 +177,21 @@ static void emulate_large_send_offload(s th->seq = htonl(seq); if (offset + doffset + frag_size < skb->len) th->fin = th->psh = 0; +#ifdef CONFIG_SMP + if (use_sibling) + smp_loop_netif_rx(nskb); + else + netif_rx(nskb); +#else netif_rx(nskb); +#endif offset += frag_size; seq += frag_size; id++;
Re: [PATCH] avoid atomic op on page free
On Tue, Mar 07, 2006 at 01:04:36PM +1100, Nick Piggin wrote: > I'd say it will turn out to be more trouble than its worth, for the > miserly cost > avoiding one atomic_inc, and one atomic_dec_and_test on page-local data > that will > be in L1 cache. I'd never turn my nose up at anyone just having a go > though :) The cost is anything but miserly. Consider that every lock instruction is a memory barrier which takes your OoO CPU with lots of instructions in flight to ramp down to just 1 for the time it takes that instruction to execute. That synchronization is what makes the atomic expensive. In the case of netperf, I ended up with a 2.5Gbit/s (~30%) performance improvement through nothing but microoptimizations. There is method to my madness. ;-) -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] avoid atomic op on page free
On Tue, Mar 07, 2006 at 12:53:27PM +1100, Nick Piggin wrote: > You can't do this because you can't test PageLRU like that. > > Have a look in the lkml archives a few months back, where I proposed > a way to do this for __free_pages(). You can't do it for put_page. Even if we know that we are the last user of the page (the count is 1)? Who can bump the page's count then? > BTW I have quite a large backlog of patches in -mm which should end > up avoiding an atomic or two around these parts. That certainly looks like it will help. Not taking the spinlock unconditionally gets rid of quite a bit of the cost. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] avoid atomic op on page free
On Mon, Mar 06, 2006 at 05:39:41PM -0800, Andrew Morton wrote: > > It's just a simple send() and recv() pair of processes. Networking uses > > pages for the buffer on user transmits. > > You mean non-zero-copy transmits? If they were zero-copy then those pages > would still be on the LRU. Correct. > > Those pages tend to be freed > > in irq context on transmit or in the receiver if the traffic is local. > > If it was a non-zero-copy Tx then networking owns that page and can just do > free_hot_page() on it and avoid all that stuff in put_page(). At least currently, networking has no way of knowing that is the case since pages may have their reference count increased when an skb() is cloned, and in fact do when TCP sends them off. > Thing is, that case would represent about 100th of the number of > put_pages()s which get done in the world. IOW: a net loss. Those 1-2 cycles are free if you look at how things get scheduled with the execution of the surrounding code. I bet $20 that you can't find a modern CPU where the cost is measurable (meaning something like a P4, Athlon). If this level of cost for the common case is a concern, it's probably worth making atomic_dec_and_test() inline for page_cache_release(). The overhead of the function call and the PageCompound() test is probably more than what we're talking about as you're increasing the cache footprint and actually performing a write to memory. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] avoid atomic op on page free
On Mon, Mar 06, 2006 at 04:50:39PM -0800, Andrew Morton wrote: > Am a bit surprised at those numbers. > Because userspace has to do peculiar things to get its pages taken off the > LRU. What exactly was that application doing? It's just a simple send() and recv() pair of processes. Networking uses pages for the buffer on user transmits. Those pages tend to be freed in irq context on transmit or in the receiver if the traffic is local. > The patch adds slight overhead to the common case while providing > improvement to what I suspect is a very uncommon case? At least on any modern CPU with branch prediction, the test is essentially free (2 memory reads that pipeline well, iow 1 cycle, maybe 2). The upside is that you get to avoid the atomic (~17 cycles on a P4 with a simple test program, the penalty doubles if there is one other instruction that operates on memory in the loop), disabling interrupts (~20 cycles?, I don't remember) another atomic for the spinlock, another atomic for TestClearPageLRU() and the pushf/popf (expensive as they rely on whatever instruction that might still be in flight to complete and add the penalty for changing irq state). That's at least 70 cycles without including the memory barrier side effects which can cost 100 cycles+. Add in the costs for the cacheline bouncing of the lru_lock and we're talking *expensive*. So, a 1-2 cycle cost for a case that normally takes from 17 to 100+ cycles? I think that's worth it given the benefits. Also, I think the common case (page cache read / map) is something that should be done differently, as those atomics really do add up to major pain. Using rcu for page cache reads would be truely wonderful, but that will take some time. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] avoid memory barrier bitops in hot paths
On Mon, Mar 06, 2006 at 04:25:32PM -0800, David S. Miller wrote: > Wait... > > what about "test_and_clear_bit()"? > > Most implementations should be doing the light-weight test _first_, > and only do the update if the bit isn't in the state desired. > > I think in such cases we can elide the memory barrier. Hrmmm, clear_bit() doesn't seem to imply being a memory barrier, but if we do that things can be doubly worse for the sites that use smp_mb__*_clear_bit() (as sometimes we'd perform the barrier and then do the locked bit clear on x86). That would hurt in net_tx_action(). Oooh, I see some optimizations to make there... -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] avoid memory barrier bitops in hot paths
On Tue, Mar 07, 2006 at 12:59:17AM +0100, Eric Dumazet wrote: > I'm not even sure this 'optimization' is valid on UP. It can be, as branch prediction makes the test essentially free. The real answer is that it depends on the CPU, how much pressure there is on the write combining buffers and reorder buffers, etc. Something like this is almost in the noise when measured on its own, but the combined effect adds up (in this case netperf throughput increased by ~30%). I'll respin it with the fast_clear_bit() suggestion. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH] rcuification of ipv4 established and timewait connections
On Tue, Mar 07, 2006 at 12:48:23AM +0100, Eric Dumazet wrote: > If I understand your patch correctly, your future plan is to change "struct > inet_ehash_bucket" rwlock_t wlock to a pure spinlock (when ipv6 is > converted to rcu lookups too), because no more read_lock() are expected ? Yes/no... The cost doesn't change if we convert it to a spinlock as we would have to insert a memory barrier where the write_unlock() was. > I didnt understand why you added a test on sk->sk_node.pprev against > LIST_POISON2 in sk_unhashed() and sk_hashed() In doing the conversion the code was changed over to use hlist_del_rcu(), which poisons the pprev member. RCU basically adds another state where the hash list entry is unhashed but not freed, which can be detected by the presence of the poison value. Without the check there are a couple of places where we end up trying to remove an already dead hash. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] avoid memory barrier bitops in hot paths
On Mon, Mar 06, 2006 at 08:29:30PM -0300, Arnaldo Carvalho de Melo wrote: > - clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); > + if (test_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags)) > + clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); > > Something like fast_clear_bit() that does the above sequence in a inline > function? That might make sense. The general case tends to be better handled by avoiding atomic bitops wherever possible. In the socket case, a better approach may be to have a set of flags which are protected by the socket lock instead of being atomic bitops. > Something similar, no? > > fast_atomic_dec_and_test anyone? :-) Yup. One of the other patches I've got which really helps networking is to use that hack in the VM's put_page() function. The problem is that the conditions under which it is okay tend to vary from case to case. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] avoid memory barrier bitops in hot paths
This patch removes a couple of memory barriers from atomic bitops that showed up on profiles of netperf. -ben Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 00aa80e..dadc84c 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -682,7 +682,8 @@ int tcp_sendmsg(struct kiocb *iocb, stru goto out_err; /* This should be in poll */ - clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); + if (test_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags)) + clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); mss_now = tcp_current_mss(sk, !(flags&MSG_OOB)); size_goal = tp->xmit_size_goal; diff --git a/net/ipv4/netfilter/ip_conntrack_core.c b/net/ipv4/netfilter/ip_conntrack_core.c index 84c66db..416e310 100644 --- a/net/ipv4/netfilter/ip_conntrack_core.c +++ b/net/ipv4/netfilter/ip_conntrack_core.c @@ -881,7 +881,12 @@ unsigned int ip_conntrack_in(unsigned in return -ret; } - if (set_reply && !test_and_set_bit(IPS_SEEN_REPLY_BIT, &ct->status)) + /* Use test_bit() first as that avoids a memory barrier on implicite +* in test_and_set_bit() on some CPUs. It only gets set once per +* connection. +*/ + if (set_reply && !test_bit(IPS_SEEN_REPLY_BIT, &ct->status) && + !test_and_set_bit(IPS_SEEN_REPLY_BIT, &ct->status)) ip_conntrack_event_cache(IPCT_STATUS, *pskb); return ret; - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC/PATCH] rcuification of ipv4 established and timewait connections
Hello again, This patch introduces the use of rcu for the ipv4 established connections hashtable, as well as the timewait table since they are closely intertwined. This removes 4 atomic operations per packet from the tcp_v4_rcv codepath, which helps quite a bit when the other performance barriers in the system are removed. Eliminating the rwlock cache bouncing should also help on SMP systems. By itself, this improves local netperf performance on a P4/HT by ~260Mbit/s on average. With smaller packets (say, ethernet size) the difference should be larger. Note that this patch changes the semantics of __inet_lookup() and __inet_lookup_established() to *not* perform a sock_hold(). This way we can avoid the atomic inc and dec of the reference count and rely on rcu pinning the socket over the scope of the rcu_read_lock_bh() region. Only minimal fixes for ipv6 and dccp to use the renamed wlock, as I am not setup to test them. I have stared at this patch for a while, and I think it's correct, but more eyes are definately warranted. Most of the issues regarding the state and lifespan of a struct sock are already handled by the network layer as the socket can be locked after being retrieved from the connection hashtable, we just have to be careful about reinitializing fields that the rcu hash list depends on. -ben Before: max 7838.86, min 7771.90, avg 7811.68 87380 16384 1638410.01 7793.04 90.5690.561.904 1.904 87380 16384 1638410.01 7806.54 91.6191.611.923 1.923 87380 16384 1638410.00 7819.29 90.8090.801.903 1.903 87380 16384 1638410.00 7815.89 90.7090.701.901 1.901 87380 16384 1638410.01 7771.90 91.6691.661.932 1.932 87380 16384 1638410.00 7831.59 90.2090.201.887 1.887 87380 16384 1638410.01 7796.56 91.2691.261.918 1.918 87380 16384 1638410.01 7838.86 89.2689.261.866 1.866 87380 16384 1638410.01 7835.44 90.5690.561.894 1.894 After: max 8113.70, min 8026.32, avg 8072.34 87380 16384 1638410.01 8045.55 87.1187.111.774 1.774 87380 16384 1638410.01 8065.14 90.8690.861.846 1.846 87380 16384 1638410.00 8077.76 89.8589.851.822 1.822 87380 16384 1638410.00 8026.32 89.8089.801.833 1.833 87380 16384 1638410.01 8108.59 89.8189.811.815 1.815 87380 16384 1638410.01 8034.53 89.0189.011.815 1.815 87380 16384 1638410.00 8113.70 90.4590.451.827 1.827 87380 16384 1638410.00 8111.37 89.9089.901.816 1.816 87380 16384 1638410.01 8077.75 87.9687.961.784 1.784 87380 16384 1638410.00 8062.70 90.2590.251.834 1.834 -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]> diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h index 25f708f..73b05ab 100644 --- a/include/net/inet6_hashtables.h +++ b/include/net/inet6_hashtables.h @@ -65,7 +65,7 @@ static inline void __inet6_hash(struct i sk->sk_hash = hash = inet6_sk_ehashfn(sk); hash &= (hashinfo->ehash_size - 1); list = &hashinfo->ehash[hash].chain; - lock = &hashinfo->ehash[hash].lock; + lock = &hashinfo->ehash[hash].wlock; write_lock(lock); } @@ -98,7 +98,7 @@ static inline struct sock * struct inet_ehash_bucket *head = inet_ehash_bucket(hashinfo, hash); prefetch(head->chain.first); - read_lock(&head->lock); + read_lock(&head->wlock); sk_for_each(sk, node, &head->chain) { /* For IPV6 do the cheaper port and family tests first. */ if (INET6_MATCH(sk, hash, saddr, daddr, ports, dif)) @@ -118,12 +118,12 @@ static inline struct sock * goto hit; } } - read_unlock(&head->lock); + read_unlock(&head->wlock); return NULL; hit: sock_hold(sk); - read_unlock(&head->lock); + read_unlock(&head->wlock); return sk; } diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index 135d80f..4cde832 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -39,7 +39,7 @@ * for the rest. I'll experiment with dynamic table growth later. */ struct inet_ehash_bucket { - rwlock_t lock; + rwlock_t wlock; struct hlist_head chain; }; @@ -224,7 +224,7 @@ static inline void __inet_hash(str
[EMAIL PROTECTED]: [PATCH] use fget_light() in net/socket.c]
Hello folks, Here's an updated copy of the patch to use fget_light in net/socket.c. Rerunning the tests show a drop of ~80Mbit/s on average, which looks bad until you see the drop in cpu usage from ~89% to ~82%. That will get fixed in another patch... Before: max 8113.70, min 8026.32, avg 8072.34 87380 16384 1638410.01 8045.55 87.1187.111.774 1.774 87380 16384 1638410.01 8065.14 90.8690.861.846 1.846 87380 16384 1638410.00 8077.76 89.8589.851.822 1.822 87380 16384 1638410.00 8026.32 89.8089.801.833 1.833 87380 16384 1638410.01 8108.59 89.8189.811.815 1.815 87380 16384 1638410.01 8034.53 89.0189.011.815 1.815 87380 16384 1638410.00 8113.70 90.4590.451.827 1.827 87380 16384 1638410.00 8111.37 89.9089.901.816 1.816 87380 16384 1638410.01 8077.75 87.9687.961.784 1.784 87380 16384 1638410.00 8062.70 90.2590.251.834 1.834 After: max 8035.81, min 7963.69, avg 7998.14 87380 16384 1638410.01 8000.93 82.1182.111.682 1.682 87380 16384 1638410.01 8016.17 83.6783.671.710 1.710 87380 16384 1638410.01 7963.69 83.4783.471.717 1.717 87380 16384 1638410.01 8014.35 81.7181.711.671 1.671 87380 16384 1638410.00 7967.68 83.4183.411.715 1.715 87380 16384 1638410.00 7995.22 81.0081.001.660 1.660 87380 16384 1638410.00 8002.61 83.9083.901.718 1.718 87380 16384 1638410.00 8035.81 81.7181.711.666 1.666 87380 16384 1638410.01 8005.36 82.5682.561.690 1.690 87380 16384 1638410.00 7979.61 82.5082.501.694 1.694 -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]> diff --git a/net/socket.c b/net/socket.c index 3ca31e8..24c4a1a 100644 --- a/net/socket.c +++ b/net/socket.c @@ -429,6 +429,28 @@ int sock_map_fd(struct socket *sock) return fd; } +static struct socket *sock_from_file(struct file *file, int *err) +{ + struct inode *inode; + struct socket *sock; + + if (file->f_op == &socket_file_ops) + return file->private_data; /* set in sock_map_fd */ + + inode = file->f_dentry->d_inode; + if (!S_ISSOCK(inode->i_mode)) { + *err = -ENOTSOCK; + return NULL; + } + + sock = SOCKET_I(inode); + if (sock->file != file) { + printk(KERN_ERR "socki_lookup: socket file changed!\n"); + sock->file = file; + } + return sock; +} + /** * sockfd_lookup - Go from a file number to its socket slot * @fd: file handle @@ -445,31 +467,31 @@ int sock_map_fd(struct socket *sock) struct socket *sockfd_lookup(int fd, int *err) { struct file *file; - struct inode *inode; struct socket *sock; - if (!(file = fget(fd))) - { + if (!(file = fget(fd))) { *err = -EBADF; return NULL; } - - if (file->f_op == &socket_file_ops) - return file->private_data; /* set in sock_map_fd */ - - inode = file->f_dentry->d_inode; - if (!S_ISSOCK(inode->i_mode)) { - *err = -ENOTSOCK; + sock = sock_from_file(file, err); + if (!sock) fput(file); - return NULL; - } + return sock; +} - sock = SOCKET_I(inode); - if (sock->file != file) { - printk(KERN_ERR "socki_lookup: socket file changed!\n"); - sock->file = file; +static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed) +{ + struct file *file; + struct socket *sock; + + file = fget_light(fd, fput_needed); + if (file) { + sock = sock_from_file(file, err); + if (sock) + return sock; + fput_light(file, *fput_needed); } - return sock; + return NULL; } /** @@ -1304,19 +1326,17 @@ asmlinkage long sys_bind(int fd, struct { struct socket *sock; char address[MAX_SOCK_ADDR]; - int err; + int err, fput_needed; - if((sock = sockfd_lookup(fd,&err))!=NULL) + if((sock = sockfd_lookup_light(fd, &err, &fput_needed))!=NULL) { if((err=move_addr_to_kernel(umyaddr,addrlen,address))>=0) { err = security_socket_bind(sock, (struct sockaddr *)address, addrlen); - if (err) { - sockfd_put
Re: [PATCH 1/8] [I/OAT] DMA memcpy subsystem
On Fri, Mar 03, 2006 at 01:42:20PM -0800, Chris Leech wrote: > +void dma_async_device_unregister(struct dma_device* device) > +{ ... > + kref_put(&device->refcount, dma_async_device_cleanup); > + wait_for_completion(&device->done); > +} This looks like a bug: device is dereferenced after it is potentially freed. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance with MTU 9000
On Tue, Feb 28, 2006 at 09:33:38PM -0500, John Zielinski wrote: > Rick Jones wrote: > >And if you add the test-specific -D option? > > No difference. I have TCP_NODELAY in my Samba config and copying files > from the server is painfully slow. That's what got me started on all > these tests. You might want to have a look at the PCI latency settings on the system if your BIOS lets you change them (try increasing it). It's hard to say exactly what is happening without a bus analyzer, but 9000 is a pretty sub-optimal mtu as it probably results in at least a couple of additional bus transactions beyond 8192. Oh, one other thing to try: if the network driver does tx checksum offloading, disable it (replace NETIF_F_IP_CSUM with 0 in the driver). On some chips (the 83820 is one), the tx fifo is only 8KB, and if you try to use tx checksumming, the chip will read the packet over the bus *twice*, completely hosing performance in the process. This is probably a better idea to check than the pci latency settings. -ben -- "Ladies and gentlemen, I'm sorry to interrupt, but the police are here and they've asked us to stop the party." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] use fget_light for network syscalls
The patch below reworks socket.c to use fget_light() in place of fget() for the various networking syscalls. This is of particular value with SMP kernels running on the P4 as the memory barriers the atomic ops on the file counter causes result in the CPU having to wait for quite a few memory transactions to complete (we can have up to ~128 instructions in flight). On the default netperf (which uses fairly large buffers) the increase is around 25Mbit/s (on 8000Mbit/s). Using 'netperf -- -m 64' performance goes to ~284Mbit/s from 247Mbit/s, which is pretty substantial. -ben Signed-off-by: Benjamin LaHaise diff --git a/net/socket.c b/net/socket.c index a00851f..64019f8 100644 --- a/net/socket.c +++ b/net/socket.c @@ -414,6 +414,28 @@ out: return fd; } +static struct socket *sock_from_file(struct file *file, int *err) +{ + struct inode *inode; + struct socket *sock; + + if (file->f_op == &socket_file_ops) + return file->private_data; /* set in sock_map_fd */ + + inode = file->f_dentry->d_inode; + if (!S_ISSOCK(inode->i_mode)) { + *err = -ENOTSOCK; + return NULL; + } + + sock = SOCKET_I(inode); + if (sock->file != file) { + printk(KERN_ERR "socki_lookup: socket file changed!\n"); + sock->file = file; + } + return sock; +} + /** * sockfd_lookup - Go from a file number to its socket slot * @fd: file handle @@ -430,31 +452,31 @@ out: struct socket *sockfd_lookup(int fd, int *err) { struct file *file; - struct inode *inode; struct socket *sock; - if (!(file = fget(fd))) - { + if (!(file = fget(fd))) { *err = -EBADF; return NULL; } - - if (file->f_op == &socket_file_ops) - return file->private_data; /* set in sock_map_fd */ - - inode = file->f_dentry->d_inode; - if (!S_ISSOCK(inode->i_mode)) { - *err = -ENOTSOCK; + sock = sock_from_file(file, err); + if (!sock) fput(file); - return NULL; - } + return sock; +} - sock = SOCKET_I(inode); - if (sock->file != file) { - printk(KERN_ERR "socki_lookup: socket file changed!\n"); - sock->file = file; +static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed) +{ + struct file *file; + struct socket *sock; + + file = fget_light(fd, fput_needed); + if (file) { + sock = sock_from_file(file, err); + if (sock) + return sock; + fput_light(file, *fput_needed); } - return sock; + return NULL; } /** @@ -1289,19 +1311,17 @@ asmlinkage long sys_bind(int fd, struct { struct socket *sock; char address[MAX_SOCK_ADDR]; - int err; + int err, fput_needed; - if((sock = sockfd_lookup(fd,&err))!=NULL) + if((sock = sockfd_lookup_light(fd, &err, &fput_needed))!=NULL) { if((err=move_addr_to_kernel(umyaddr,addrlen,address))>=0) { err = security_socket_bind(sock, (struct sockaddr *)address, addrlen); - if (err) { - sockfd_put(sock); - return err; - } - err = sock->ops->bind(sock, (struct sockaddr *)address, addrlen); + if (!err) + err = sock->ops->bind(sock, + (struct sockaddr *)address, addrlen); } - sockfd_put(sock); + fput_light(sock->file, fput_needed); } return err; } @@ -1318,20 +1338,17 @@ int sysctl_somaxconn = SOMAXCONN; asmlinkage long sys_listen(int fd, int backlog) { struct socket *sock; - int err; + int err, fput_needed; - if ((sock = sockfd_lookup(fd, &err)) != NULL) { + if ((sock = sockfd_lookup_light(fd, &err, &fput_needed)) != NULL) { if ((unsigned) backlog > sysctl_somaxconn) backlog = sysctl_somaxconn; err = security_socket_listen(sock, backlog); - if (err) { - sockfd_put(sock); - return err; - } + if (!err) + err = sock->ops->listen(sock, backlog); - err=sock->ops->listen(sock, backlog); - sockfd_put(sock); + fput_light(sock->file, fput_needed); } return err; } @@ -1352,10 +1369,10 @@ asmlinkage long sys_listen(int fd, int b asmlinkage long
Re: GbE performance
On Tue, Feb 28, 2006 at 09:21:39AM +0100, Dag Bakke wrote: > Got both. Max throughput increased from ~605 Mbps with carefully tuned > options to iperf, to ~650 Mbps with less carefully tuned options to iperf. One other suggestion that might be worth testing: try comparing SMP and UP kernels. I've been doing some profiling of the network stack which shows that locked instructions carry a heavy penalty on the P4. Running a non-SMP kernel through the hoops would give us an idea just how much overhead this is causing. -ben -- "Ladies and gentlemen, I'm sorry to interrupt, but the police are here and they've asked us to stop the party." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0/2] Kevent. Network AIO.
On Thu, Feb 09, 2006 at 08:03:26PM +0300, Evgeniy Polyakov wrote: > It is completely different things. > The only common is that they _require_ some kind of notification > mechanism, > but none provide them. > epoll() can not be used for AIO and timers. True, that is a disappointment. There really should be a timer device in /dev which can be used with all of the possible event mechanism. > AIO can not be used with anything except AIO. Actually, there are patches for generic thread based aio so that other file descriptors can be used for AIO, perhaps they should be dusted off? They're pretty small and easy to understand. Btw, your kevent aio is not interopable with O_DIRECT AIO on files. This is a pretty major setback for quite a few types of high performance servers. > inotify is good, but it can not be used for anything else. Now this is funny! =-) > > > asynchronously from socket's receiving queue from softirq context. > > > > Is there any reason for not using the existing aio read method? > > Current AIO has it's own design which I do not accept completely. > It does not use internal state machines of underlying layers, but instead > repeatedly tries to call nonblocking functions, similar to synchronous > ones. Actually, that behaviour is completely up to the implementor of the aio method. Retry simply happens to be a very simple technique that allows existing code to be easily updated to implement aio. AIO operations can be implemented as a state machine -- the 2.4 codebase did that for file io, and it worked quite well. For something like networking, it could well be the way to go. On the networking front, it would be very useful to have an AIO method to implement zero copy transmit. Most of the pieces are in place with the sendpage code, but the catch is that we would need a destructor to use with data submitted via sendpage. Are you interested in exploring a few more approaches to the design and tie-in with existing kernel AIO? Cheers, -ben -- "Ladies and gentlemen, I'm sorry to interrupt, but the police are here and they've asked us to stop the party." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0/2] Kevent. Network AIO.
On Thu, Feb 09, 2006 at 04:56:11PM +0300, Evgeniy Polyakov wrote: > Hello. > > I'm pleased to announce following projects: > > 1/2 - Kevent subsystem. > This subsystem incorporates several AIO/kqueue design notes and ideas. > Kevent can be used both for edge and level notifications. It supports > socket notifications (accept and receiving), inode notifications > (create/remove), generic poll()/select() notifications, timer > notifications and network AIO notifications. What is the justification behind adding yet another notification system? We already have AIO, epoll, traditional select/poll. This seems like a lot of new code for no additional functionality. > 2/2 - Network asynchronous IO. Receiving (TCP) support. > Network AIO is based on kevent and works as usual kevent storage on top > of it. It allows to receive data directly into userspace pages > asynchronously from socket's receiving queue from softirq context. Is there any reason for not using the existing aio read method? -ben -- "Ladies and gentlemen, I'm sorry to interrupt, but the police are here and they've asked us to stop the party." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] af_unix: use shift instead of integer division
On Tue, Feb 07, 2006 at 04:15:31PM +0100, Andi Kleen wrote: > On Tuesday 07 February 2006 15:54, Benjamin LaHaise wrote: > > > + if (size > ((sk->sk_sndbuf >> 1) - 64)) > > + size = (sk->sk_sndbuf >> 1) - 64; > > This is really surprising. Are you double plus sure gcc doesn't > do this automatically? As I said, sk_sndbuf is a signed integer, so gcc can't use an arithmetic shift (which would round to infinity if the result is negative -- gcc has no way of knowing that sk_sndbuf will be positive). The alternative would be to convert sk_sndbuf to unsigned, but that would mean rechecking all the users for side effects. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] af_unix: scm: better initialization
Instead of doing a memset then initialization of the fields of the scm structure, just initialize all the members explicitly. Prevent reloading of current on x86 and x86-64 by storing the value in a local variable for subsequent dereferences. This is worth a ~7KB/s increase in af_unix bandwidth. Note that we avoid the issues surrounding potentially uninitialized members of the ucred structure by constructing a struct ucred instead of assigning the members individually, which forces the compiler to zero any padding. Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]> diff --git a/include/net/scm.h b/include/net/scm.h index c3fa3d5..0d90fa2 100644 --- a/include/net/scm.h +++ b/include/net/scm.h @@ -37,10 +37,14 @@ static __inline__ void scm_destroy(struc static __inline__ int scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *scm) { - memset(scm, 0, sizeof(*scm)); - scm->creds.uid = current->uid; - scm->creds.gid = current->gid; - scm->creds.pid = current->tgid; + struct task_struct *p = current; + scm->creds = (struct ucred) { + .uid = p->uid, + .gid = p->gid, + .pid = p->tgid + }; + scm->fp = NULL; + scm->seq = 0; if (msg->msg_controllen <= 0) return 0; return __scm_send(sock, msg, scm); - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] af_unix: use shift instead of integer division
The patch below replaces a divide by 2 with a shift -- sk_sndbuf is an integer, so gcc emits an idiv, which takes 10x longer than a shift by 1. This improves af_unix bandwidth by ~6-10K/s. Also, tidy up the comment to fit in 80 columns while we're at it. -ben Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 1b5989b..b57d4d9 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -1427,15 +1427,15 @@ static int unix_stream_sendmsg(struct ki while(sent < len) { /* -* Optimisation for the fact that under 0.01% of X messages typically -* need breaking up. +* Optimisation for the fact that under 0.01% of X +* messages typically need breaking up. */ - size=len-sent; + size = len-sent; /* Keep two messages in the pipe so it schedules better */ - if (size > sk->sk_sndbuf / 2 - 64) - size = sk->sk_sndbuf / 2 - 64; + if (size > ((sk->sk_sndbuf >> 1) - 64)) + size = (sk->sk_sndbuf >> 1) - 64; if (size > SKB_MAX_ALLOC) size = SKB_MAX_ALLOC; - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6 patch] schedule eepro100.c for removal
Where's the hunk to make the eepro100 driver spew messages about being obsolete out upon loading? -ben On Fri, Feb 03, 2006 at 10:32:34PM +0100, Adrian Bunk wrote: > Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> > > --- > > This patch was already sent on: > - 18 Jan 2006 > > --- linux-2.6.15-mm4-full/Documentation/feature-removal-schedule.txt.old > 2006-01-18 08:38:57.0 +0100 > +++ linux-2.6.15-mm4-full/Documentation/feature-removal-schedule.txt > 2006-01-18 08:39:59.0 +0100 > @@ -164,0 +165,6 @@ > +--- > + > +What: eepro100 network driver > +When: April 2006 > +Why:replaced by the e100 driver > +Who:Adrian Bunk <[EMAIL PROTECTED]> > > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- "Ladies and gentlemen, I'm sorry to interrupt, but the police are here and they've asked us to stop the party." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Sun, Jan 29, 2006 at 07:54:09AM +0100, Eric Dumazet wrote: > Well, I think that might be doable, maybe RCU magic ? > > 1) local_t are not that nice on all archs. It is for the users that matter, and the hooks are there if someone finds it to be a performance problem. > 2) The consolidation phase (summing all the cpus local offset to > consolidate the central counter) might be more difficult to do (we would > need kind of 2 counters per cpu, and a index that can be changed by the cpu > that wants a consolidation (still 'expensive')) For the vast majority of these sorts of statistics counters, we don't need 100% accurate counts. And I think it should be possible to choose between a lightweight implementation and the expensive implementation. On a chip like the Core Duo the cost of bouncing between the two cores is minimal, so all the extra code and data is a waste. > 3) Are the locked ops so expensive if done on a cache line that is mostly > in exclusive state in cpu cache ? Yes. What happens on the P4 is that it forces outstanding memory transactions in the reorder buffer to be flushed so that the memory barrier semantics of the lock prefix are observed. This can take a long time as there can be over a hundred instructions in flight. -ben -- "Ladies and gentlemen, I'm sorry to interrupt, but the police are here and they've asked us to stop the party." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Sat, Jan 28, 2006 at 04:55:49PM -0800, Andrew Morton wrote: > local_t isn't much use until we get rid of asm-generic/local.h. Bloaty, > racy with nested interrupts. The overuse of atomics is horrific in what is being proposed. All the major architectures except powerpc (i386, x86-64, ia64, and sparc64) implement local_t. It would make far more sense to push the last few stragglers (which mostly seem to be uniprocessor) into writing the appropriate implementations. Perhaps it's time to add a #error in asm-generic/local.h? -ben -- "Ladies and gentlemen, I'm sorry to interrupt, but the police are here and they've asked us to stop the party." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated
On Sat, Jan 28, 2006 at 01:28:20AM +0100, Eric Dumazet wrote: > We might use atomic_long_t only (and no spinlocks) > Something like this ? Erk, complex and slow... Try using local_t instead, which is substantially cheaper on the P4 as it doesn't use the lock prefix and act as a memory barrier. See asm/local.h. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: driver skb reuse
On Tue, Jan 24, 2006 at 10:35:06PM +0100, Robert Olsson wrote: > I splitted alloc_skb in two parts to get the memset() part even done > in the driver just before passing the skb to the stack. > > I did expect a big win in but I didn't see any gain. Strange but the code > was bad so it might be worth a try again. But this is a bit different > from the skb reuse we discuss now. Right. Btw, looking over your changes for skb reuse and how the e1000 lays out its data structures, I think there is still some room for improvement: currently you still touch the skb that the e1000 used to allocate the receive buffers in. That cacheline is likely to be L1 cold because of the large number of outstanding buffers the e1000 uses. If you instead only touch the buffer_info structure to populate a freshly allocated skb, you can eliminate a couple of cache misses and increase the chances that the skb you're reusing stays in the L1 cache. That will require a lot more surgery to the e1000 code, though, but it could be worth it. -ben -- "Ladies and gentlemen, I'm sorry to interrupt, but the police are here and they've asked us to stop the party." Don't Email: <[EMAIL PROTECTED]>. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: driver skb reuse
On Tue, Jan 24, 2006 at 02:23:15PM +0100, Robert Olsson wrote: > etc. In the test below I use my usual lab setup but just let netfilter > drop the packets. We win about 13% in this experiment below. > > Here we process (drop) about 13% packets more when skb'a get reued. Instead of doing a completely separate skb reuse path, what happens if you remove the memset() from __alloc_skb() and instead do it in a slab ctor? I remember seeing that in the profiles for af_unix. Dave, could you refresh my memory why the slab ctor ended up being dropped? Doing the memset() at free time is usually almost free since it is usually in the cache at that point. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6 patch] schedule SHAPER for removal
On Sat, Jan 21, 2006 at 01:48:48AM +0100, Adrian Bunk wrote: > Do we really have to wait the three years between stable Debian releases > for removing an obsolete driver that has always been marked as > EXPERIMENTAL? > > Please be serious. I am completely serious. The traditional cycle of obsolete code that works and is not causing a maintenence burden is 2 major releases -- one release during which the obsolete feature spews warnings on use, and another development cycle until it is actually removed. That's at least 3 years, which is still pretty short compared to distro cycles. There seems to be a lot of this disease of removing code for the sake of removal lately, and it's getting to the point of being really annoying. If the maintainer of the code in question isn't pushing for its removal, I see no need to rush the process too much, especially when the affected users aren't even likely to see the feature being marked obsolete since they don't troll the source code looking for things that break setups. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html