Re: aio poll, io_pgetevents and a new in-kernel poll API V4

2018-01-25 Thread Benjamin LaHaise
On Mon, Jan 22, 2018 at 09:12:07PM +0100, Christoph Hellwig wrote:
> Hi all,
> 
> this series adds support for the IOCB_CMD_POLL operation to poll for the
> readyness of file descriptors using the aio subsystem.  The API is based
> on patches that existed in RHAS2.1 and RHEL3, which means it already is
> supported by libaio.  To implement the poll support efficiently new
> methods to poll are introduced in struct file_operations:  get_poll_head
> and poll_mask.  The first one returns a wait_queue_head to wait on
> (lifetime is bound by the file), and the second does a non-blocking
> check for the POLL* events.  This allows aio poll to work without
> any additional context switches, unlike epoll.

I implemented something similar back in December, but did so without
changing the in-kernel poll API.  See below for the patch that implements
it.  Is changing the in-kernel poll API really desirable given how many
drivers that will touch?

The tree with that as a work-in-progress is available at
git://git.kvack.org/~bcrl/aio-wip-20171215.git .  I have some time
scheduled right now to work on cleaning up that series which also includes
support for more aio options by making use of queue_work() to dispatch
operations that make use of helper threads.  The aio poll implementation
patch is below as an alternative approach.  It has passed some amount of
QA by Solace where we use it for a unified event loop with various
filesystem / block AIO combined with poll on TCP, unix domains sockets and
pipes.  Reworking cancellation was required to fix several lock
ordering issues that are impossible to address with the in-kernel aio
cancellation support.

-ben


> To make the interface fully useful a new io_pgetevents system call is
> added, which atomically saves and restores the signal mask over the
> io_pgetevents system call.  It it the logical equivalent to pselect and
> ppoll for io_pgetevents.

That looks useful.  I'll have to look at this in detail.

-ben


commit a299c474b19107122eae846b53f742d876f304f9
Author: Benjamin LaHaise 
Date:   Fri Dec 15 13:19:23 2017 -0500

aio: add aio poll implementation

Some applications using AIO have a need to be able to poll on file
descriptors.  This is typically the case with libraries using
non-blocking operations inside of an application using an io_getevents()
based main event loop.  This patch implements IOCB_CMD_POLL directly
inside fs/aio.c using file_operations->poll() to insert a waiter.  This
avoids the need to use helper threads, enabling completion events to be
generated directly in interrupt or task context.

diff --git a/fs/aio.c b/fs/aio.c
index 3381b2e..23b9a06 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -43,6 +43,7 @@
 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -172,7 +173,7 @@ struct kioctx {
  * synchronize cancellation and completion - we only set it to KIOCB_CANCELLED
  * with xchg() or cmpxchg(), see batch_complete_aio() and kiocb_cancel().
  */
-#define KIOCB_CANCELLED((void *) (~0ULL))
+#define KIOCB_CANCELLED((void *) (~5ULL))
 
 struct aio_kiocb {
struct kiocbcommon;
@@ -205,6 +206,13 @@ struct aio_kiocb {
aio_thread_work_fn_tki_work_fn;
struct work_struct  ki_work;
 #endif
+
+   unsigned long   ki_data;
+   struct poll_table_structpoll;
+   struct tasklet_struct   tasklet;
+   int wait_idx;
+   unsignedevents;
+   struct poll_table_entry poll_wait[2];
 };
 
 /*-- sysctl variables*/
@@ -212,7 +220,7 @@ struct aio_kiocb {
 unsigned long aio_nr;  /* current system wide number of aio requests */
 unsigned long aio_max_nr = 0x1; /* system wide maximum number of aio 
requests */
 #if IS_ENABLED(CONFIG_AIO_THREAD)
-unsigned long aio_auto_threads;/* Currently disabled by default */
+unsigned long aio_auto_threads = 1;/* Currently disabled by default */
 #endif
 /*end sysctl variables---*/
 
@@ -1831,6 +1839,213 @@ static ssize_t aio_write(struct aio_kiocb *kiocb, 
struct iocb *iocb, bool compat
return ret;
 }
 
+static int aio_poll_cancel_cb(struct kiocb *iocb, unsigned long data, int ret)
+{
+   struct aio_kiocb *req = container_of(iocb, struct aio_kiocb, common);
+   unsigned i;
+
+   for (i=0; iwait_idx; i++) {
+   wait_queue_head_t *head = req->poll_wait[i].wait_address;
+   remove_wait_queue(head, &req->poll_wait[i].wait);
+   }
+
+   aio_complete(iocb, -EINTR, 0);
+   return 0;
+}
+
+static int aio_poll_cancel(struct kiocb *iocb, kiocb_cancel_cb_fn **cb_p,
+  unsigned long *data_p)
+{
+   *cb_p = aio_poll_cancel_cb;
+   *data_p = 0;
+   return 0;
+}
+
+stati

Re: [PATCH 2/6] move _body_io_syscall to the generic syscall.h

2018-01-05 Thread Benjamin LaHaise
On Fri, Jan 05, 2018 at 11:25:17AM -0500, Jeff Moyer wrote:
> Christoph Hellwig  writes:
> 
> > This way it can be used for the fallback 6-argument version on
> > all architectures.
> >
> > Signed-off-by: Christoph Hellwig 
> 
> This is a strange way to do things.  However, I was never really sold on
> libaio having to implement its own system call wrappers.  That decision
> definitely resulted in some maintenance overhead.
> 
> Ben, what was your reasoning for not just using syscall?

The main issue was that glibc's pthreads implementation really sucked back
during initial development and there was a use-case for having the io_XXX
functions usable directly from clone()ed threads that didn't have all the
glibc pthread state setup for per-cpu areas to handle per-thread errno.
That made sense back then, but is rather silly today.

Technically, I'm not sure the generic syscall wrapper is safe to use.  The
io_XXX arch wrappers don't modify errno, while it appears the generic one
does.  That said, nobody has ever noticed...

-ben

> -Jeff
> 
> > ---
> >  src/syscall-generic.h | 6 --
> >  src/syscall.h | 7 +++
> >  2 files changed, 7 insertions(+), 6 deletions(-)
> >
> > diff --git a/src/syscall-generic.h b/src/syscall-generic.h
> > index 24d7c7c..35b8580 100644
> > --- a/src/syscall-generic.h
> > +++ b/src/syscall-generic.h
> > @@ -2,12 +2,6 @@
> >  #include 
> >  #include 
> >  
> > -#define _body_io_syscall(sname, args...) \
> > -{ \
> > -   int ret = syscall(__NR_##sname, ## args); \
> > -   return ret < 0 ? -errno : ret; \
> > -}
> > -
> >  #define io_syscall1(type,fname,sname,type1,arg1) \
> >  type fname(type1 arg1) \
> >  _body_io_syscall(sname, (long)arg1)
> > diff --git a/src/syscall.h b/src/syscall.h
> > index a2da030..3819519 100644
> > --- a/src/syscall.h
> > +++ b/src/syscall.h
> > @@ -10,6 +10,13 @@
> >  #define DEFSYMVER(compat_sym, orig_sym, ver_sym)   \
> > __asm__(".symver " SYMSTR(compat_sym) "," SYMSTR(orig_sym) "@@LIBAIO_" 
> > SYMSTR(ver_sym));
> >  
> > +/* generic fallback */
> > +#define _body_io_syscall(sname, args...) \
> > +{ \
> > +   int ret = syscall(__NR_##sname, ## args); \
> > +   return ret < 0 ? -errno : ret; \
> > +}
> > +
> >  #if defined(__i386__)
> >  #include "syscall-i386.h"
> >  #elif defined(__x86_64__)
> 

-- 
"Thought is the essence of where you are now."


[PATCH] flower: check unused bits in MPLS fields

2017-05-01 Thread Benjamin LaHaise
Since several of the the netlink attributes used to configure the flower
classifier's MPLS TC, BOS and Label fields have additional bits which are
unused, check those bits to ensure that they are actually 0 as suggested
by Jamal.

Signed-off-by: Benjamin LaHaise 
Cc: David Miller 
Cc: Jamal Hadi Salim 
Cc: Simon Horman 
Cc: Jakub Kicinski 
Cc: Jiri Pirko 
---
 net/sched/cls_flower.c | 32 ++--
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 3ecf076..ca526c0 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -439,29 +439,39 @@ static void fl_set_key_val(struct nlattr **tb,
memcpy(mask, nla_data(tb[mask_type]), len);
 }
 
-static void fl_set_key_mpls(struct nlattr **tb,
-   struct flow_dissector_key_mpls *key_val,
-   struct flow_dissector_key_mpls *key_mask)
+static int fl_set_key_mpls(struct nlattr **tb,
+  struct flow_dissector_key_mpls *key_val,
+  struct flow_dissector_key_mpls *key_mask)
 {
if (tb[TCA_FLOWER_KEY_MPLS_TTL]) {
key_val->mpls_ttl = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TTL]);
key_mask->mpls_ttl = MPLS_TTL_MASK;
}
if (tb[TCA_FLOWER_KEY_MPLS_BOS]) {
-   key_val->mpls_bos = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_BOS]);
+   u8 bos = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_BOS]);
+
+   if (bos & ~MPLS_BOS_MASK)
+   return -EINVAL;
+   key_val->mpls_bos = bos;
key_mask->mpls_bos = MPLS_BOS_MASK;
}
if (tb[TCA_FLOWER_KEY_MPLS_TC]) {
-   key_val->mpls_tc =
-   nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TC]) & MPLS_TC_MASK;
+   u8 tc = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TC]);
+
+   if (tc & ~MPLS_TC_MASK)
+   return -EINVAL;
+   key_val->mpls_tc = tc;
key_mask->mpls_tc = MPLS_TC_MASK;
}
if (tb[TCA_FLOWER_KEY_MPLS_LABEL]) {
-   key_val->mpls_label =
-   nla_get_u32(tb[TCA_FLOWER_KEY_MPLS_LABEL]) &
-   MPLS_LABEL_MASK;
+   u32 label = nla_get_u32(tb[TCA_FLOWER_KEY_MPLS_LABEL]);
+
+   if (label & ~MPLS_LABEL_MASK)
+   return -EINVAL;
+   key_val->mpls_label = label;
key_mask->mpls_label = MPLS_LABEL_MASK;
}
+   return 0;
 }
 
 static void fl_set_key_vlan(struct nlattr **tb,
@@ -622,7 +632,9 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
   sizeof(key->icmp.code));
} else if (key->basic.n_proto == htons(ETH_P_MPLS_UC) ||
   key->basic.n_proto == htons(ETH_P_MPLS_MC)) {
-   fl_set_key_mpls(tb, &key->mpls, &mask->mpls);
+   ret = fl_set_key_mpls(tb, &key->mpls, &mask->mpls);
+   if (ret)
+   return ret;
} else if (key->basic.n_proto == htons(ETH_P_ARP) ||
   key->basic.n_proto == htons(ETH_P_RARP)) {
fl_set_key_val(tb, &key->arp.sip, TCA_FLOWER_KEY_ARP_SIP,
-- 
2.7.4



Re: [PATCH net-next 0/2] flower: add MPLS matching support

2017-04-25 Thread Benjamin LaHaise
On Tue, Apr 25, 2017 at 08:47:00AM -0400, Jamal Hadi Salim wrote:
> On 17-04-25 07:55 AM, Simon Horman wrote:
> [..]
> > 
> > I agree something should be done wrt BOS. If the LABEL and TC are to
> > be left as-is then I think a similar treatment of BOS - that is masking it
> > - makes sense.
> > 
> > I also agree with statements made earlier in the thread that it is unlikely
> > that the unused bits of these attributes will be used - as opposed to a
> > bitmask of flag values which seems ripe for re-use for future flags.
> > 
> 
> For your use case, I think you are fine if you just do the mask in the
> kernel. A mask  to a user value implies "I am ignoring the rest
> of these bits - I dont  care if you set them "
> 
> > I would like to add to the discussion that I think in future it would
> > be good to expand the features provided by this patch to support supplying
> > a mask as part of the match - as flower supports for other fields such
> > as IP addresses. But I think the current scheme of masking out invalid bits
> > should also work in conjunction with user-supplied masks.
> > 
> 
> The challenge we have right now is "users do stoopid or malicious
> things". So are you going to accept the wrong bitmap + mask?

I think rejecting bits in a mask that clearly cannot be set (like the 
bits above the lower 20 bits in an MPLS label) makes perfect sense.  It 
doesn't impact usability for the tools since they shouldn't be set (and 
actually can't be in the iproute2 changes).

-ben

> cheers,
> jamal
> 


Re: [PATCH net-next 0/2] flower: add MPLS matching support

2017-04-24 Thread Benjamin LaHaise
On Mon, Apr 24, 2017 at 08:58:18PM -0400, Jamal Hadi Salim wrote:
> On 17-04-24 02:32 PM, David Miller wrote:
> > From: Benjamin LaHaise 
> 
> > 
> > Series applied, but in the future:
> > 
> > 1) Put the "v2", "v3", whatever in the inital [PATCH xxx] bracketed
> >part of the subject line, otherwise it ends up in the GIT commit
> >message header line and that's not desired.
> > 
> > 2) Please cut it with these double signoffs, and add an appropriate
> >entry to the email aliases file.
> 
> I know i should have spoken earlier, wanted to but got distracted - but
> shouldnt the new rules have applied to this patch too? ;->
> 
> You have 3 TLVs, one of which is u8 that only allows use of 3 bits.
> The other is a u32 which allows only 20 bits to be set.

What are the new rules for TLVs -- do you want a new type added for the 
subset values that are limited to the number of bits actually used?  A 
pointed here would be helpful.

-ben

> cheers,
> jamal


[PATCH net-next 2/2] cls_flower: add support for matching MPLS fields (v2)

2017-04-22 Thread Benjamin LaHaise
Add support to the tc flower classifier to match based on fields in MPLS
labels (TTL, Bottom of Stack, TC field, Label).

Signed-off-by: Benjamin LaHaise 
Signed-off-by: Benjamin LaHaise 
Reviewed-by: Jakub Kicinski 
Cc: "David S. Miller" 
Cc: Simon Horman 
Cc: Jamal Hadi Salim 
Cc: Cong Wang 
Cc: Jiri Pirko 
Cc: Eric Dumazet 
Cc: Hadar Hen Zion 
Cc: Gao Feng 
---
 include/uapi/linux/pkt_cls.h |  5 +++
 net/sched/cls_flower.c   | 74 
 2 files changed, 79 insertions(+)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 7a69f2a..f1129e3 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -432,6 +432,11 @@ enum {
TCA_FLOWER_KEY_ARP_THA, /* ETH_ALEN */
TCA_FLOWER_KEY_ARP_THA_MASK,/* ETH_ALEN */
 
+   TCA_FLOWER_KEY_MPLS_TTL,/* u8 - 8 bits */
+   TCA_FLOWER_KEY_MPLS_BOS,/* u8 - 1 bit */
+   TCA_FLOWER_KEY_MPLS_TC, /* u8 - 3 bits */
+   TCA_FLOWER_KEY_MPLS_LABEL,  /* be32 - 20 bits */
+
__TCA_FLOWER_MAX,
 };
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 31ee340..3ecf076 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -47,6 +48,7 @@ struct fl_flow_key {
struct flow_dissector_key_ipv6_addrs enc_ipv6;
};
struct flow_dissector_key_ports enc_tp;
+   struct flow_dissector_key_mpls mpls;
 } __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as longs. 
*/
 
 struct fl_flow_mask_range {
@@ -418,6 +420,10 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 
1] = {
[TCA_FLOWER_KEY_ARP_SHA_MASK]   = { .len = ETH_ALEN },
[TCA_FLOWER_KEY_ARP_THA]= { .len = ETH_ALEN },
[TCA_FLOWER_KEY_ARP_THA_MASK]   = { .len = ETH_ALEN },
+   [TCA_FLOWER_KEY_MPLS_TTL]   = { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_MPLS_BOS]   = { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_MPLS_TC]= { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_MPLS_LABEL] = { .type = NLA_U32 },
 };
 
 static void fl_set_key_val(struct nlattr **tb,
@@ -433,6 +439,31 @@ static void fl_set_key_val(struct nlattr **tb,
memcpy(mask, nla_data(tb[mask_type]), len);
 }
 
+static void fl_set_key_mpls(struct nlattr **tb,
+   struct flow_dissector_key_mpls *key_val,
+   struct flow_dissector_key_mpls *key_mask)
+{
+   if (tb[TCA_FLOWER_KEY_MPLS_TTL]) {
+   key_val->mpls_ttl = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TTL]);
+   key_mask->mpls_ttl = MPLS_TTL_MASK;
+   }
+   if (tb[TCA_FLOWER_KEY_MPLS_BOS]) {
+   key_val->mpls_bos = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_BOS]);
+   key_mask->mpls_bos = MPLS_BOS_MASK;
+   }
+   if (tb[TCA_FLOWER_KEY_MPLS_TC]) {
+   key_val->mpls_tc =
+   nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TC]) & MPLS_TC_MASK;
+   key_mask->mpls_tc = MPLS_TC_MASK;
+   }
+   if (tb[TCA_FLOWER_KEY_MPLS_LABEL]) {
+   key_val->mpls_label =
+   nla_get_u32(tb[TCA_FLOWER_KEY_MPLS_LABEL]) &
+   MPLS_LABEL_MASK;
+   key_mask->mpls_label = MPLS_LABEL_MASK;
+   }
+}
+
 static void fl_set_key_vlan(struct nlattr **tb,
struct flow_dissector_key_vlan *key_val,
struct flow_dissector_key_vlan *key_mask)
@@ -589,6 +620,9 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
   &mask->icmp.code,
   TCA_FLOWER_KEY_ICMPV6_CODE_MASK,
   sizeof(key->icmp.code));
+   } else if (key->basic.n_proto == htons(ETH_P_MPLS_UC) ||
+  key->basic.n_proto == htons(ETH_P_MPLS_MC)) {
+   fl_set_key_mpls(tb, &key->mpls, &mask->mpls);
} else if (key->basic.n_proto == htons(ETH_P_ARP) ||
   key->basic.n_proto == htons(ETH_P_RARP)) {
fl_set_key_val(tb, &key->arp.sip, TCA_FLOWER_KEY_ARP_SIP,
@@ -725,6 +759,8 @@ static void fl_init_dissector(struct cls_fl_head *head,
FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 FLOW_DISSECTOR_KEY_ARP, arp);
FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+FLOW_DISSECTOR_KEY_MPLS, mpls);
+   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 FLOW_DISSECTOR_KEY_VLAN, vlan);
FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 FLOW_DISSECTOR_KEY_ENC_KEYID, enc_key_id);
@@ -991,6 +1027,41 @@ static int fl_dump_key_val(struct sk_buff *skb,
return 0;
 }
 
+stat

[PATCH net-next 0/2] flower: add MPLS matching support

2017-04-22 Thread Benjamin LaHaise
From: Benjamin LaHaise 

This patch series adds support for parsing MPLS flows in the flow dissector
and the flower classifier.  Each of the MPLS TTL, BOS, TC and Label fields
can be used for matching.

v2: incorporate style feedback, move #defines to linux/include/mpls.h
Note: this omits Jiri's request to remove tabs between the type and 
field names in struct declarations.  This would be inconsistent with 
numerous other struct definitions.

Benjamin LaHaise (2):
  flow_dissector: add mpls support (v2)
  cls_flower: add support for matching MPLS fields (v2)

 include/linux/mpls.h |  5 +++
 include/net/flow_dissector.h |  8 +
 include/uapi/linux/pkt_cls.h |  5 +++
 net/core/flow_dissector.c| 25 +--
 net/sched/cls_flower.c   | 74 
 5 files changed, 114 insertions(+), 3 deletions(-)

-- 
2.7.4



[PATCH net-next 1/2] flow_dissector: add mpls support (v2)

2017-04-22 Thread Benjamin LaHaise
Add support for parsing MPLS flows to the flow dissector in preparation for
adding MPLS match support to cls_flower.

Signed-off-by: Benjamin LaHaise 
Signed-off-by: Benjamin LaHaise 
Reviewed-by: Jakub Kicinski 
Cc: "David S. Miller" 
Cc: Simon Horman 
Cc: Jamal Hadi Salim 
Cc: Cong Wang 
Cc: Jiri Pirko 
Cc: Eric Dumazet 
Cc: Hadar Hen Zion 
Cc: Gao Feng 
---
 include/linux/mpls.h |  5 +
 include/net/flow_dissector.h |  8 
 net/core/flow_dissector.c| 25 ++---
 3 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/include/linux/mpls.h b/include/linux/mpls.h
index 145..384fb22 100644
--- a/include/linux/mpls.h
+++ b/include/linux/mpls.h
@@ -3,4 +3,9 @@
 
 #include 
 
+#define MPLS_TTL_MASK  (MPLS_LS_TTL_MASK >> MPLS_LS_TTL_SHIFT)
+#define MPLS_BOS_MASK  (MPLS_LS_S_MASK >> MPLS_LS_S_SHIFT)
+#define MPLS_TC_MASK   (MPLS_LS_TC_MASK >> MPLS_LS_TC_SHIFT)
+#define MPLS_LABEL_MASK(MPLS_LS_LABEL_MASK >> 
MPLS_LS_LABEL_SHIFT)
+
 #endif  /* _LINUX_MPLS_H */
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index ac97030..8d21d44 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -41,6 +41,13 @@ struct flow_dissector_key_vlan {
u16 padding;
 };
 
+struct flow_dissector_key_mpls {
+   u32 mpls_ttl:8,
+   mpls_bos:1,
+   mpls_tc:3,
+   mpls_label:20;
+};
+
 struct flow_dissector_key_keyid {
__be32  keyid;
 };
@@ -169,6 +176,7 @@ enum flow_dissector_key_id {
FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS, /* struct 
flow_dissector_key_ipv6_addrs */
FLOW_DISSECTOR_KEY_ENC_CONTROL, /* struct flow_dissector_key_control */
FLOW_DISSECTOR_KEY_ENC_PORTS, /* struct flow_dissector_key_ports */
+   FLOW_DISSECTOR_KEY_MPLS, /* struct flow_dissector_key_mpls */
 
FLOW_DISSECTOR_KEY_MAX,
 };
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index c9cf425..28d94bc 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -126,9 +126,11 @@ __skb_flow_dissect_mpls(const struct sk_buff *skb,
 {
struct flow_dissector_key_keyid *key_keyid;
struct mpls_label *hdr, _hdr[2];
+   u32 entry, label;
 
if (!dissector_uses_key(flow_dissector,
-   FLOW_DISSECTOR_KEY_MPLS_ENTROPY))
+   FLOW_DISSECTOR_KEY_MPLS_ENTROPY) &&
+   !dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_MPLS))
return FLOW_DISSECT_RET_OUT_GOOD;
 
hdr = __skb_header_pointer(skb, nhoff, sizeof(_hdr), data,
@@ -136,8 +138,25 @@ __skb_flow_dissect_mpls(const struct sk_buff *skb,
if (!hdr)
return FLOW_DISSECT_RET_OUT_BAD;
 
-   if ((ntohl(hdr[0].entry) & MPLS_LS_LABEL_MASK) >>
-   MPLS_LS_LABEL_SHIFT == MPLS_LABEL_ENTROPY) {
+   entry = ntohl(hdr[0].entry);
+   label = (entry & MPLS_LS_LABEL_MASK) >> MPLS_LS_LABEL_SHIFT;
+
+   if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_MPLS)) {
+   struct flow_dissector_key_mpls *key_mpls;
+
+   key_mpls = skb_flow_dissector_target(flow_dissector,
+FLOW_DISSECTOR_KEY_MPLS,
+target_container);
+   key_mpls->mpls_label = label;
+   key_mpls->mpls_ttl = (entry & MPLS_LS_TTL_MASK)
+   >> MPLS_LS_TTL_SHIFT;
+   key_mpls->mpls_tc = (entry & MPLS_LS_TC_MASK)
+   >> MPLS_LS_TC_SHIFT;
+   key_mpls->mpls_bos = (entry & MPLS_LS_S_MASK)
+   >> MPLS_LS_S_SHIFT;
+   }
+
+   if (label == MPLS_LABEL_ENTROPY) {
key_keyid = skb_flow_dissector_target(flow_dissector,
  
FLOW_DISSECTOR_KEY_MPLS_ENTROPY,
  target_container);
-- 
2.7.4



Re: [PATCH net-next 2/2] cls_flower: add support for matching MPLS labels

2017-03-27 Thread Benjamin LaHaise
On Mon, Mar 27, 2017 at 10:30:41PM +0200, Jiri Pirko wrote:
> Mon, Mar 27, 2017 at 08:16:02PM CEST, benjamin.laha...@netronome.com wrote:
> >Add support to tc flower to match based on fields in MPLS labels (TTL, 
> >Bottom of Stack, TC field, Label).
> 
> Please use scripts/get_maintainer.pl to get list of ccs for the patches
> you submit.

Oops.  Adding Jamal to the Cc -- please holler if you want me to resend.

-ben

> 
> >
> >Signed-off-by: Benjamin LaHaise 
> >Signed-off-by: Benjamin LaHaise 
> >Reviewed-by: Simon Horman 
> >Reviewed-by: Jakub Kicinski 
> >
> >diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
> >index 7a69f2a..f1129e3 100644
> >--- a/include/uapi/linux/pkt_cls.h
> >+++ b/include/uapi/linux/pkt_cls.h
> >@@ -432,6 +432,11 @@ enum {
> > TCA_FLOWER_KEY_ARP_THA, /* ETH_ALEN */
> > TCA_FLOWER_KEY_ARP_THA_MASK,/* ETH_ALEN */
> > 
> >+TCA_FLOWER_KEY_MPLS_TTL,/* u8 - 8 bits */
> >+TCA_FLOWER_KEY_MPLS_BOS,/* u8 - 1 bit */
> >+TCA_FLOWER_KEY_MPLS_TC, /* u8 - 3 bits */
> >+TCA_FLOWER_KEY_MPLS_LABEL,  /* be32 - 20 bits */
> >+
> > __TCA_FLOWER_MAX,
> > };
> > 
> >diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> >index 9d0c99d..24619f9 100644
> >--- a/net/sched/cls_flower.c
> >+++ b/net/sched/cls_flower.c
> >@@ -18,6 +18,7 @@
> > #include 
> > #include 
> > #include 
> >+#include 
> > 
> > #include 
> > #include 
> >@@ -47,6 +48,7 @@ struct fl_flow_key {
> > struct flow_dissector_key_ipv6_addrs enc_ipv6;
> > };
> > struct flow_dissector_key_ports enc_tp;
> >+struct flow_dissector_key_mpls mpls;
> > } __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as 
> > longs. */
> > 
> > struct fl_flow_mask_range {
> >@@ -423,6 +425,10 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX 
> >+ 1] = {
> > [TCA_FLOWER_KEY_ARP_SHA_MASK]   = { .len = ETH_ALEN },
> > [TCA_FLOWER_KEY_ARP_THA]= { .len = ETH_ALEN },
> > [TCA_FLOWER_KEY_ARP_THA_MASK]   = { .len = ETH_ALEN },
> >+[TCA_FLOWER_KEY_MPLS_TTL]   = { .type = NLA_U8 },
> >+[TCA_FLOWER_KEY_MPLS_BOS]   = { .type = NLA_U8 },
> >+[TCA_FLOWER_KEY_MPLS_TC]= { .type = NLA_U8 },
> >+[TCA_FLOWER_KEY_MPLS_LABEL] = { .type = NLA_U32 },
> > };
> > 
> > static void fl_set_key_val(struct nlattr **tb,
> >@@ -438,6 +444,36 @@ static void fl_set_key_val(struct nlattr **tb,
> > memcpy(mask, nla_data(tb[mask_type]), len);
> > }
> > 
> >+static void fl_set_key_mpls(struct nlattr **tb,
> >+struct flow_dissector_key_mpls *key_val,
> >+struct flow_dissector_key_mpls *key_mask)
> >+{
> >+#define MPLS_TTL_MASK   (MPLS_LS_TTL_MASK >> MPLS_LS_TTL_SHIFT)
> >+#define MPLS_BOS_MASK   (MPLS_LS_S_MASK >> MPLS_LS_S_SHIFT)
> >+#define MPLS_TC_MASK(MPLS_LS_TC_MASK >> MPLS_LS_TC_SHIFT)
> >+#define MPLS_LABEL_MASK (MPLS_LS_LABEL_MASK >> MPLS_LS_LABEL_SHIFT)
> >+
> >+if (tb[TCA_FLOWER_KEY_MPLS_TTL]) {
> >+key_val->mpls_ttl = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TTL]);
> >+key_mask->mpls_ttl = MPLS_TTL_MASK;
> >+}
> >+if (tb[TCA_FLOWER_KEY_MPLS_BOS]) {
> >+key_val->mpls_bos = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_BOS]);
> >+key_mask->mpls_bos = MPLS_BOS_MASK;
> >+}
> >+if (tb[TCA_FLOWER_KEY_MPLS_TC]) {
> >+key_val->mpls_tc =
> >+nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TC]) & MPLS_TC_MASK;
> >+key_mask->mpls_tc = MPLS_TC_MASK;
> >+}
> >+if (tb[TCA_FLOWER_KEY_MPLS_LABEL]) {
> >+key_val->mpls_label =
> >+nla_get_u32(tb[TCA_FLOWER_KEY_MPLS_LABEL]) &
> >+MPLS_LABEL_MASK;
> >+key_mask->mpls_label = MPLS_LABEL_MASK;
> >+}
> >+}
> >+
> > static void fl_set_key_vlan(struct nlattr **tb,
> > struct flow_dissector_key_vlan *key_val,
> > struct flow_dissector_key_vlan *key_mask)
> >@@ -594,6 +630,9 @@ static int fl_set_key(struct net *net, struct nlattr 
> >**tb,
> >&mask->icmp.code,
> >TCA_FLOWER_KEY_ICMPV6_CODE_MASK,
> >  

[RFC PATCH iproute2 net-next] tc flower: support for matching MPLS

2017-03-27 Thread Benjamin LaHaise
[RFC until kernel code is accepted/rejected]

This patch adds support to the iproute2 tc filter command for matching MPLS
labels in the flower classifier.  The ability to match the Time To Live,
Bottom Of Stack, Traffic Control and Label fields are added as options to
the flower filter.

Signed-off-by: Benjamin LaHaise 
Signed-off-by: Benjamin LaHaise 
Reviewed-by: Simon Horman 
Reviewed-by: Jakub Kicinski 

diff --git a/include/linux/pkt_cls.h b/include/linux/pkt_cls.h
index 7a69f2a..f1129e3 100644
--- a/include/linux/pkt_cls.h
+++ b/include/linux/pkt_cls.h
@@ -432,6 +432,11 @@ enum {
TCA_FLOWER_KEY_ARP_THA, /* ETH_ALEN */
TCA_FLOWER_KEY_ARP_THA_MASK,/* ETH_ALEN */
 
+   TCA_FLOWER_KEY_MPLS_TTL,/* u8 - 8 bits */
+   TCA_FLOWER_KEY_MPLS_BOS,/* u8 - 1 bit */
+   TCA_FLOWER_KEY_MPLS_TC, /* u8 - 3 bits */
+   TCA_FLOWER_KEY_MPLS_LABEL,  /* be32 - 20 bits */
+
__TCA_FLOWER_MAX,
 };
 
diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
index fc5bac5..6ce2d56 100644
--- a/man/man8/tc-flower.8
+++ b/man/man8/tc-flower.8
@@ -29,6 +29,14 @@ flower \- flow based traffic control filter
 .IR PRIORITY " | "
 .BR vlan_ethtype " { " ipv4 " | " ipv6 " | "
 .IR ETH_TYPE " } | "
+.B mpls_label
+.IR LABEL " | "
+.B mpls_tc
+.IR TC " | "
+.B mpls_bos
+.IR BOS " | "
+.B mpls_ttl
+.IR TTL " | "
 .BR ip_proto " { " tcp " | " udp " | " sctp " | " icmp " | " icmpv6 " | "
 .IR IP_PROTO " } | { "
 .BR dst_ip " | " src_ip " } "
@@ -113,6 +121,27 @@ may be either
 .BR ipv4 ", " ipv6
 or an unsigned 16bit value in hexadecimal format.
 .TP
+.BI mpls_label " LABEL"
+Match the outermost MPLS label id in an MPLS packet.
+.I LABEL
+is an unsigned 20 bit value in decimal format.
+.TP
+.BI mpls_tc " TC"
+Match on the MPLS TC field, which is typically used for packet priority.
+.I TC
+is an unsigned 3 bit value in decimal format.
+.TP
+.BI mpls_bos " BOS"
+Match on the MPLS Bottom Of Stack field in the outermost MPLS label.
+.I BOS
+is a 1 bit value.
+.TP
+.BI mpls_ttl " TTL"
+Match on the MPLS Time To Live field.
+.I TTL
+is an unsigned 8 bit value which matches the MPLS TTL field in the outermost
+label.
+.TP
 .BI ip_proto " IP_PROTO"
 Match on layer four protocol.
 .I IP_PROTO
diff --git a/tc/f_flower.c b/tc/f_flower.c
index 5aac4a0..af9ef3d 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "utils.h"
 #include "tc_util.h"
@@ -53,6 +54,10 @@ static void explain(void)
"   dst_mac MASKED-LLADDR |\n"
"   src_mac MASKED-LLADDR |\n"
"   ip_proto [tcp | udp | sctp | icmp | 
icmpv6 | IP-PROTO ] |\n"
+   "   mpls_label LABEL |\n"
+   "   mpls_tc TC |\n"
+   "   mpls_bos BOS |\n"
+   "   mpls_ttl TTL |\n"
"   dst_ip PREFIX |\n"
"   src_ip PREFIX |\n"
"   dst_port PORT-NUMBER |\n"
@@ -599,6 +604,70 @@ static int flower_parse_opt(struct filter_util *qu, char 
*handle,
 &vlan_ethtype, n);
if (ret < 0)
return -1;
+   } else if (matches(*argv, "mpls_label") == 0) {
+   __u32 label;
+
+   NEXT_ARG();
+   if (eth_type != htons(ETH_P_MPLS_UC) &&
+   eth_type != htons(ETH_P_MPLS_MC)) {
+   fprintf(stderr,
+   "Can't set \"mpls_label\" if ethertype 
isn't MPLS\n");
+   return -1;
+   }
+   ret = get_u32(&label, *argv, 10);
+   if (ret < 0 || label & ~(MPLS_LS_LABEL_MASK >> 
MPLS_LS_LABEL_SHIFT)) {
+   fprintf(stderr, "Illegal \"mpls_label\"\n");
+   return -1;
+   }
+   addattr32(n, MAX_MSG, TCA_FLOWER_KEY_MPLS_LABEL, label);
+   } else if (matches(*argv, "mpls_tc") == 0) {
+   __u8 tc;
+
+   NEXT_ARG();
+   if (eth_type != htons(ETH_P_MPLS_UC) &&
+  

[PATCH net-next 2/2] cls_flower: add support for matching MPLS labels

2017-03-27 Thread Benjamin LaHaise
Add support to tc flower to match based on fields in MPLS labels (TTL, 
Bottom of Stack, TC field, Label).

Signed-off-by: Benjamin LaHaise 
Signed-off-by: Benjamin LaHaise 
Reviewed-by: Simon Horman 
Reviewed-by: Jakub Kicinski 

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 7a69f2a..f1129e3 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -432,6 +432,11 @@ enum {
TCA_FLOWER_KEY_ARP_THA, /* ETH_ALEN */
TCA_FLOWER_KEY_ARP_THA_MASK,/* ETH_ALEN */
 
+   TCA_FLOWER_KEY_MPLS_TTL,/* u8 - 8 bits */
+   TCA_FLOWER_KEY_MPLS_BOS,/* u8 - 1 bit */
+   TCA_FLOWER_KEY_MPLS_TC, /* u8 - 3 bits */
+   TCA_FLOWER_KEY_MPLS_LABEL,  /* be32 - 20 bits */
+
__TCA_FLOWER_MAX,
 };
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 9d0c99d..24619f9 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -47,6 +48,7 @@ struct fl_flow_key {
struct flow_dissector_key_ipv6_addrs enc_ipv6;
};
struct flow_dissector_key_ports enc_tp;
+   struct flow_dissector_key_mpls mpls;
 } __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as longs. 
*/
 
 struct fl_flow_mask_range {
@@ -423,6 +425,10 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 
1] = {
[TCA_FLOWER_KEY_ARP_SHA_MASK]   = { .len = ETH_ALEN },
[TCA_FLOWER_KEY_ARP_THA]= { .len = ETH_ALEN },
[TCA_FLOWER_KEY_ARP_THA_MASK]   = { .len = ETH_ALEN },
+   [TCA_FLOWER_KEY_MPLS_TTL]   = { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_MPLS_BOS]   = { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_MPLS_TC]= { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_MPLS_LABEL] = { .type = NLA_U32 },
 };
 
 static void fl_set_key_val(struct nlattr **tb,
@@ -438,6 +444,36 @@ static void fl_set_key_val(struct nlattr **tb,
memcpy(mask, nla_data(tb[mask_type]), len);
 }
 
+static void fl_set_key_mpls(struct nlattr **tb,
+   struct flow_dissector_key_mpls *key_val,
+   struct flow_dissector_key_mpls *key_mask)
+{
+#define MPLS_TTL_MASK  (MPLS_LS_TTL_MASK >> MPLS_LS_TTL_SHIFT)
+#define MPLS_BOS_MASK  (MPLS_LS_S_MASK >> MPLS_LS_S_SHIFT)
+#define MPLS_TC_MASK   (MPLS_LS_TC_MASK >> MPLS_LS_TC_SHIFT)
+#define MPLS_LABEL_MASK(MPLS_LS_LABEL_MASK >> MPLS_LS_LABEL_SHIFT)
+
+   if (tb[TCA_FLOWER_KEY_MPLS_TTL]) {
+   key_val->mpls_ttl = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TTL]);
+   key_mask->mpls_ttl = MPLS_TTL_MASK;
+   }
+   if (tb[TCA_FLOWER_KEY_MPLS_BOS]) {
+   key_val->mpls_bos = nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_BOS]);
+   key_mask->mpls_bos = MPLS_BOS_MASK;
+   }
+   if (tb[TCA_FLOWER_KEY_MPLS_TC]) {
+   key_val->mpls_tc =
+   nla_get_u8(tb[TCA_FLOWER_KEY_MPLS_TC]) & MPLS_TC_MASK;
+   key_mask->mpls_tc = MPLS_TC_MASK;
+   }
+   if (tb[TCA_FLOWER_KEY_MPLS_LABEL]) {
+   key_val->mpls_label =
+   nla_get_u32(tb[TCA_FLOWER_KEY_MPLS_LABEL]) &
+   MPLS_LABEL_MASK;
+   key_mask->mpls_label = MPLS_LABEL_MASK;
+   }
+}
+
 static void fl_set_key_vlan(struct nlattr **tb,
struct flow_dissector_key_vlan *key_val,
struct flow_dissector_key_vlan *key_mask)
@@ -594,6 +630,9 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
   &mask->icmp.code,
   TCA_FLOWER_KEY_ICMPV6_CODE_MASK,
   sizeof(key->icmp.code));
+   } else if (key->basic.n_proto == htons(ETH_P_MPLS_UC) ||
+  key->basic.n_proto == htons(ETH_P_MPLS_MC)) {
+   fl_set_key_mpls(tb, &key->mpls, &mask->mpls);
} else if (key->basic.n_proto == htons(ETH_P_ARP) ||
   key->basic.n_proto == htons(ETH_P_RARP)) {
fl_set_key_val(tb, &key->arp.sip, TCA_FLOWER_KEY_ARP_SIP,
@@ -730,6 +769,8 @@ static void fl_init_dissector(struct cls_fl_head *head,
FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 FLOW_DISSECTOR_KEY_ARP, arp);
FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+FLOW_DISSECTOR_KEY_MPLS, mpls);
+   FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 FLOW_DISSECTOR_KEY_VLAN, vlan);
FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 FLOW_DISSECTOR_KEY_ENC_KEYID, enc_key_id);
@@ -994,6 +1035,41 @@ static int fl_dump_key_val(struct sk_buff *skb,
return 0;
 }
 
+static int fl_dump_key_mpl

[PATCH net-next 1/2] flow_dissector: add mpls support

2017-03-27 Thread Benjamin LaHaise
Add support for parsing MPLS flows to the flow dissector in preparation for
adding MPLS match support to cls_flower.

Signed-off-by: Benjamin LaHaise 
Signed-off-by: Benjamin LaHaise 
Reviewed-by: Simon Horman 
Reviewed-by: Jakub Kicinski 

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index ac97030..00d704f 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -41,6 +41,13 @@ struct flow_dissector_key_vlan {
u16 padding;
 };
 
+struct flow_dissector_key_mpls {
+   u32 mpls_ttl : 8,
+   mpls_bos : 1,
+   mpls_tc : 3,
+   mpls_label : 20;
+};
+
 struct flow_dissector_key_keyid {
__be32  keyid;
 };
@@ -169,6 +176,7 @@ enum flow_dissector_key_id {
FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS, /* struct 
flow_dissector_key_ipv6_addrs */
FLOW_DISSECTOR_KEY_ENC_CONTROL, /* struct flow_dissector_key_control */
FLOW_DISSECTOR_KEY_ENC_PORTS, /* struct flow_dissector_key_ports */
+   FLOW_DISSECTOR_KEY_MPLS, /* struct flow_dissector_key_mpls */
 
FLOW_DISSECTOR_KEY_MAX,
 };
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 5f3ae92..15185d8 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -126,9 +126,11 @@ __skb_flow_dissect_mpls(const struct sk_buff *skb,
 {
struct flow_dissector_key_keyid *key_keyid;
struct mpls_label *hdr, _hdr[2];
+   u32 entry, label;
 
if (!dissector_uses_key(flow_dissector,
-   FLOW_DISSECTOR_KEY_MPLS_ENTROPY))
+   FLOW_DISSECTOR_KEY_MPLS_ENTROPY) &&
+   !dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_MPLS))
return FLOW_DISSECT_RET_OUT_GOOD;
 
hdr = __skb_header_pointer(skb, nhoff, sizeof(_hdr), data,
@@ -136,8 +138,25 @@ __skb_flow_dissect_mpls(const struct sk_buff *skb,
if (!hdr)
return FLOW_DISSECT_RET_OUT_BAD;
 
-   if ((ntohl(hdr[0].entry) & MPLS_LS_LABEL_MASK) >>
-   MPLS_LS_LABEL_SHIFT == MPLS_LABEL_ENTROPY) {
+   entry = ntohl(hdr[0].entry);
+   label = (entry & MPLS_LS_LABEL_MASK) >> MPLS_LS_LABEL_SHIFT;
+
+   if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_MPLS)) {
+   struct flow_dissector_key_mpls *key_mpls;
+
+   key_mpls = skb_flow_dissector_target(flow_dissector,
+FLOW_DISSECTOR_KEY_MPLS,
+target_container);
+   key_mpls->mpls_label = label;
+   key_mpls->mpls_ttl = (entry & MPLS_LS_TTL_MASK)
+   >> MPLS_LS_TTL_SHIFT;
+   key_mpls->mpls_tc = (entry & MPLS_LS_TC_MASK)
+   >> MPLS_LS_TC_SHIFT;
+   key_mpls->mpls_bos = (entry & MPLS_LS_S_MASK)
+   >> MPLS_LS_S_SHIFT;
+   }
+
+   if (label == MPLS_LABEL_ENTROPY) {
key_keyid = skb_flow_dissector_target(flow_dissector,
  
FLOW_DISSECTOR_KEY_MPLS_ENTROPY,
  target_container);



[PATCH v2 iproute2] f_flower: don't set TCA_FLOWER_KEY_ETH_TYPE for "protocol all"

2017-01-20 Thread Benjamin LaHaise
v2 - update to address changes in 00697ca19ae3e1118f2af82c3b41ac4335fe918b.

When using the tc flower filter, rules marked with "protocol all" do not
actually match all packets.  This is due to a bug in f_flower.c that passes
in ETH_P_ALL in the TCA_FLOWER_KEY_ETH_TYPE attribute when adding a rule.
Fix this by omitting TCA_FLOWER_KEY_ETH_TYPE if the protocol is set to
ETH_P_ALL.

Fixes: 488b41d020fb ("tc: flower no need to specify the ethertype")
Cc: Jamal Hadi Salim 
Signed-off-by: Benjamin LaHaise 
Signed-off-by: Benjamin LaHaise 

diff --git a/tc/f_flower.c b/tc/f_flower.c
index 314c2dd..145a856 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -529,9 +529,11 @@ parse_done:
if (ret)
return ret;
 
-   ret = addattr16(n, MAX_MSG, TCA_FLOWER_KEY_ETH_TYPE, eth_type);
-   if (ret)
-   return ret;
+   if (eth_type != htons(ETH_P_ALL)) {
+   ret = addattr16(n, MAX_MSG, TCA_FLOWER_KEY_ETH_TYPE, eth_type);
+   if (ret)
+   return ret;
+   }
 
tail->rta_len = (((void *)n)+n->nlmsg_len) - (void *)tail;
 


[PATCH iproute2] f_flower: don't set TCA_FLOWER_KEY_ETH_TYPE for "protocol all"

2017-01-19 Thread Benjamin LaHaise
When using the tc filter flower, rules marked with "protocol all" do not
actually match all packets.  This is due to a bug in f_flower.c that passes
in ETH_P_ALL in the TCA_FLOWER_KEY_ETH_TYPE attribute when adding a rule.
Fix this by omitting TCA_FLOWER_KEY_ETH_TYPE if the protocol is set to
ETH_P_ALL.

Signed-off-by: Benjamin LaHaise 
Signed-off-by: Benjamin LaHaise 

diff --git a/tc/f_flower.c b/tc/f_flower.c
index 1dbc532..1f90da3 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -527,11 +527,13 @@ static int flower_parse_opt(struct filter_util *qu, char 
*handle,
 parse_done:
addattr32(n, MAX_MSG, TCA_FLOWER_FLAGS, flags);
 
-   ret = addattr16(n, MAX_MSG, TCA_FLOWER_KEY_ETH_TYPE, eth_type);
-   if (ret) {
-   fprintf(stderr, "Illegal \"eth_type\"(0x%x)\n",
-   ntohs(eth_type));
-   return -1;
+   if (eth_type != htons(ETH_P_ALL)) {
+   ret = addattr16(n, MAX_MSG, TCA_FLOWER_KEY_ETH_TYPE, eth_type);
+   if (ret) {
+   fprintf(stderr, "Illegal \"eth_type\"(0x%x)\n",
+   ntohs(eth_type));
+   return -1;
+   }
}
 
tail->rta_len = (((void *)n)+n->nlmsg_len) - (void *)tail;


Re: use-after-free in sock_wake_async

2015-11-24 Thread Benjamin LaHaise
On Tue, Nov 24, 2015 at 04:30:01PM -0500, Jason Baron wrote:
> So looking at this trace I think its the other->sk_socket that gets
> freed and then we call sk_wake_async() on it.
> 
> We could I think grab the socket reference there with unix_state_lock(),
> since that is held by unix_release_sock() before the final iput() is called.
> 
> So something like below might work (compile tested only):

That just adds the performance regression back in.  It should be possible 
to protect the other socket dereference using RCU.  I haven't had time to 
look at this yet today, but will try to find some time this evening to come 
up with a suggested patch.

-ben

> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index aaa0b58..2b014f1 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -196,6 +196,19 @@ static inline int unix_recvq_full(struct sock const
> *sk)
>   return skb_queue_len(&sk->sk_receive_queue) > sk->sk_max_ack_backlog;
>  }
> 
> +struct socket *unix_peer_get_socket(struct sock *s)
> +{
> + struct socket *peer;
> +
> + unix_state_lock(s);
> + peer = s->sk_socket;
> + if (peer)
> + __iget(SOCK_INODE(s->sk_socket));
> + unix_state_unlock(s);
> +
> + return peer;
> +}
> +
>  struct sock *unix_peer_get(struct sock *s)
>  {
>   struct sock *peer;
> @@ -1639,6 +1652,7 @@ static int unix_stream_sendmsg(struct socket
> *sock, struct msghdr *msg,
>  {
>   struct sock *sk = sock->sk;
>   struct sock *other = NULL;
> + struct socket *other_socket = NULL;
>   int err, size;
>   struct sk_buff *skb;
>   int sent = 0;
> @@ -1662,7 +1676,10 @@ static int unix_stream_sendmsg(struct socket
> *sock, struct msghdr *msg,
>   } else {
>   err = -ENOTCONN;
>   other = unix_peer(sk);
> - if (!other)
> + if (other)
> + other_socket = unix_peer_get_socket(other);
> +
> + if (!other_socket)
>   goto out_err;
>   }
> 
> @@ -1721,6 +1738,9 @@ static int unix_stream_sendmsg(struct socket
> *sock, struct msghdr *msg,
>   sent += size;
>   }
> 
> + if (other_socket)
> + iput(SOCK_INODE(other_socket));
> +
>   scm_destroy(&scm);
> 
>   return sent;
> @@ -1733,6 +1753,8 @@ pipe_err:
>   send_sig(SIGPIPE, current, 0);
>   err = -EPIPE;
>  out_err:
> + if (other_socket)
> + iput(SOCK_INODE(other_socket));
>   scm_destroy(&scm);
>   return sent ? : err;
>  }

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fib_semantics: prevent long hash chains in access server config

2008-01-13 Thread Benjamin LaHaise
On Sat, Jan 12, 2008 at 09:38:57PM -0800, David Miller wrote:
> And guess why we don't do this?  Because it's not part of
> the key.  Other aspects of the base fib_info and nexthops
> provide the uniqueness, not the devindex of the first hop.
> 
> So you'll need to find another way to do this.

Ah, you're right indeed.  It's probably easier for me to change how the 
daemon adds the local ip address for these point to point interfaces.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fib_semantics: prevent long hash chains in access server config

2008-01-12 Thread Benjamin LaHaise
This is a patch from a while ago that I'm resending.  Basically, in access 
server configurations, a lot of routes have the same local ip address but on 
different devices.  This fixes the long chains that result from not including 
the device index in the hash.

-ben

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 1351a26..5375824 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -196,11 +196,21 @@ static __inline__ int nh_comp(const struct fib_info *fi, 
const struct fib_info *
return 0;
 }
 
+static inline unsigned int fib_devindex_hashfn(unsigned int val)
+{
+   unsigned int mask = DEVINDEX_HASHSIZE - 1;
+
+   return (val ^
+   (val >> DEVINDEX_HASHBITS) ^
+   (val >> (DEVINDEX_HASHBITS * 2))) & mask;
+}
+
 static inline unsigned int fib_info_hashfn(const struct fib_info *fi)
 {
unsigned int mask = (fib_hash_size - 1);
unsigned int val = fi->fib_nhs;
 
+   val ^= fib_devindex_hashfn(fi->fib_dev->ifindex);
val ^= fi->fib_protocol;
val ^= (__force u32)fi->fib_prefsrc;
val ^= fi->fib_priority;
@@ -234,15 +244,6 @@ static struct fib_info *fib_find_info(const struct 
fib_info *nfi)
return NULL;
 }
 
-static inline unsigned int fib_devindex_hashfn(unsigned int val)
-{
-   unsigned int mask = DEVINDEX_HASHSIZE - 1;
-
-   return (val ^
-   (val >> DEVINDEX_HASHBITS) ^
-   (val >> (DEVINDEX_HASHBITS * 2))) & mask;
-}
-
 /* Check, that the gateway is already configured.
Used only by redirect accept routine.
  */
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


announce: Babylon PPP (scalable L2TP) 2.0 beta 1

2008-01-01 Thread Benjamin LaHaise
Hello all,

This is an announcement for a new beta release of a decently scalable L2TP 
stack for Linux.  It is based off of the Babylon PPP stack created by 
SpellCaster back in the '98-'00 timeframe.  Right now there are lots of 
rough edges (especially in terms of documentation), but it works well enough 
for many purposes.

The main difference between the Babylon PPP stack and pppd is that the 
Babylon code uses a single process for controlling all PPP sessions 
established throughout the system, which uses significantly less memory 
and processing power per connection.  Practically speaking, a mid range 
system can terminate thousands of sessions without breaking a sweat.

Features
- entirely GPLed
- can terminate thousands of sessions
- includes kernel mode multihop for L2TP
- can act as an LAC or LNS
- has gigawords accounting for sessions
- includes Radius client support
- already in use to terminate real L2TP traffic for 2 ISPs.
- works on x86/x86-64.  Not expected to work on architectures where 
unaligned memory accesses are problematic.

Non-Features
- Uses its own kernel mode PPP stack for L2TP and PPPoE.  The 
exact direction for this is open to discussions, but will require 
lots of work to address the scalability issues.
- PPPoE support is not well integrated at present.  It works for 
dial out (kernel mode via drivers/pppoe/pppoed and kernel/bab_pppoe.o) 
or PPPoE->L2TP conversion (userspace), but not incoming calls as-is.
- device nodes are still a work in progress
- packaging needs improvement
- the kernel code is not at all ready for merging as it needs fairly 
major cleanup in several areas

The README.l2tp file gives a rough description of what is needed to get 
things going on recent 2.6 kernels, but I've only been testing 2.6.22.5 
for this release.  Questions/comments are appreciated.  The source can 
be had via git clone http://www.kvack.org/~bcrl/babylon.git/.git .  Cheers,

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] don't allow netfilter --setmss to increase mss

2007-12-04 Thread Benjamin LaHaise
Hi folks,

When terminating DSL connections for an assortment of random customers, I've 
found it necessary to use iptables to clamp the MSS used for connections to 
work around the various ICMP blackholes in the greater net.  Unfortunately, 
the current behaviour in Linux is imperfect and actually make things worse, 
so I'm proposing the following: increasing the MSS in a packet can never be 
a good thing, so make --set-mss only lower the MSS in a packet.

Yes, I am aware of --clamp-mss-to-pmtu, but it doesn't work for outgoing 
connections from clients (ie web traffic), as it only looks at the PMTU on 
the destination route, not the source of the packet (the DSL interfaces in 
question have a 1442 byte MTU while the destination ethernet interface is 
1500 -- there are problematic hosts which use a 1300 byte MTU).  Reworking 
that is probably a good idea at some point, but it's more work than this is.

Thoughts?  Would it be better to add a new flag?

-ben

Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]>
diff --git a/net/netfilter/xt_TCPMSS.c b/net/netfilter/xt_TCPMSS.c
index d40f7e4..411c482 100644
--- a/net/netfilter/xt_TCPMSS.c
+++ b/net/netfilter/xt_TCPMSS.c
@@ -88,8 +88,11 @@ tcpmss_mangle_packet(struct sk_buff **pskb,
 
oldmss = (opt[i+2] << 8) | opt[i+3];
 
-   if (info->mss == XT_TCPMSS_CLAMP_PMTU &&
-   oldmss <= newmss)
+   /* Never increase MSS, even when setting it, as
+* doing so results in problems for hosts that rely
+* on MSS being set correctly.
+*/
+   if (oldmss <= newmss)
return 0;
 
opt[i+2] = (newmss & 0xff00) >> 8;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 03/18] drivers/net/ns83820.c: add paramter to disable autonegotiation

2007-08-10 Thread Benjamin LaHaise
On Fri, Aug 10, 2007 at 02:05:13PM -0700, [EMAIL PROTECTED] wrote:
> Also added a "disable_autoneg" module argument to completely disable
> autoneg on all cards using this driver.
...
> [akpm: this is a previously-nacked patch, but the problem is real]

Please remove this part of the patch.  The ethtool support is sufficient and 
doesn't clobber other cards in the system.  At the very least the module 
parameter has to be limited to a specific card.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FSCKED clock sources WAS(Re: [WIP][PATCHES] Network xmit batching

2007-06-25 Thread Benjamin LaHaise
On Mon, Jun 25, 2007 at 12:59:54PM -0400, jamal wrote:
> On Thu, 2007-21-06 at 12:55 -0400, Benjamin LaHaise wrote:
> 
> > You should qualify that as 'Old P4 Xeon', as the Core 2 Xeons are leagues 
> > better.
> 
> The Xeon hardware is not that old - about a year or so (and so is the
> opteron). 
> BTW, how could you tell this was old Xeon?

CPUID:

vendor_id   : GenuineIntel
cpu family  : 15
model   : 4
model name  : Intel(R) Xeon(TM) CPU 2.80GHz

shows that it is a P4 Xeon, which sucks compared to:

vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Genuine Intel(R) CPU  @ 2.66GHz

which is a Core 2 based Xeon.  The tuning required by the P4 is quite 
different than the Core 2, and it generally performs more poorly due to 
the length of the pipeline and the expense of pipeline flushes.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FSCKED clock sources WAS(Re: [WIP][PATCHES] Network xmit batching

2007-06-21 Thread Benjamin LaHaise
On Thu, Jun 21, 2007 at 12:08:19PM -0400, jamal wrote:
> The results in the table for opteron and xeon are swapped when
> cutnpasting from a larger test result. So Opteron is the one with better
> results.
> In any case - off for the day over here.

You should qualify that as 'Old P4 Xeon', as the Core 2 Xeons are leagues 
better.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: r8169 tx problem (1s pause with ping)

2007-06-18 Thread Benjamin LaHaise
On Fri, Jun 15, 2007 at 01:33:14AM +1000, David Gundersen wrote:
> In the mean-time I'll attach my patch for the r8168-8.001.00 realtek 
> driver here in case anybody else wants to have a play with it and see if 
> it helps them out.

Out of curiousity, does it work if you just do a single read (ie 
RTL_R8(TxPoll);) of the register before writing to it?  That would clear 
things up if it is a PCI posting problem.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


r8169 tx problem (1s pause with ping)

2007-06-12 Thread Benjamin LaHaise
Hello folks,

I'm seeing something odd with r8169 on FC7: doing a ping -s 1600 alternates 
between a 1s latency and sub 1ms.  Has anyone else seen anything like this?  
The system in question is an Asus M2A-VM with an onboard RTL8111 (I think).  
NAPI doesn't seem to make a difference.  The kernel in question is currently 
a vanilla 2.6.21.5.  Sub-mtu sized packets behave normally.

02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI 
Express Gigabit Ethernet controller (rev 01)

PING 1.2.3.4 (1.2.3.4) 1600(1628) bytes of data.
1608 bytes from 1.2.3.4: icmp_seq=1 ttl=64 time=1000 ms
1608 bytes from 1.2.3.4: icmp_seq=2 ttl=64 time=0.816 ms
1608 bytes from 1.2.3.4: icmp_seq=3 ttl=64 time=1000 ms
1608 bytes from 1.2.3.4: icmp_seq=4 ttl=64 time=0.661 ms

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] TCP FIN gets dropped prematurely, results in ack storm

2007-05-01 Thread Benjamin LaHaise
On Tue, May 01, 2007 at 02:03:04PM -0400, John Heffner wrote:
> Actually, you cannot get in this situation by loss or reordering of 
> packets, only be corruption of state on one side.  It sends the FIN, 
> which effectively increases the sequence number by one.  However, all 
> later segments it sends have an old lower sequence number, which are now 
> out of window.

Okay, I missed the other packets with a FIN later on in the storm.  What is 
different about them is that they get sent with different timestamps than 
the acks being thrown about.  Perhaps narrowly looking at the lack of FIN 
is wrong -- I'll try instrumenting what the PAWS code is doing on both 
sides as that is probably what short circuits an ACK into being sent.

> Being liberal in what you accept is good to a point, but sometimes you 
> have to draw the line.

True.  Still, both sides are doing completely the wrong thing in this case, 
and I'd like to get an idea of the best way to prevent the ACK storm from 
happenning.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] TCP FIN gets dropped prematurely, results in ack storm

2007-05-01 Thread Benjamin LaHaise
On Tue, May 01, 2007 at 01:54:03PM -0400, John Heffner wrote:
> Looking at your trace, it seems like the behavior of the test system 
> 192.168.2.2 is broken in two ways.  First, like you said it has broken 
> state in that it has forgotten that it sent the FIN.  Once you do that, 
> the connection state is corrupt and all bets are off.  It's sending an 
> out-of-window segment that's getting tossed by Linux, and Linux 
> generates an ack in response.  This is in direct RFC compliance.  The 
> second problem is that the other system is generating these broken acks 
> in response to the legitimate acks Linux is sending, causing the ack 
> war.  I can't really guess why it's doing that...

I know it's a bug, and I'm trying to fix it, but that doesn't change the 
fact that A) the system is already deployed and B) Linux is not retransmitting 
the FIN, which (from Linux's point of view) remains unacknowledged by the 
other side.  The patch might be wrong, but the goal of fixing the behaviour 
isn't.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] TCP FIN gets dropped prematurely, results in ack storm

2007-05-01 Thread Benjamin LaHaise
On Tue, May 01, 2007 at 09:41:28PM +0400, Evgeniy Polyakov wrote:
> Hmm, 2.2 machine in your test seems to behave incorrectly:

I am aware of that.  However, I think that the loss of certain packets and 
reordering can result in the same behaviour.  What's more, is that this 
behaviour can occur in real deployed systems.  "Be strict in what you send 
and liberal in what you accept."  Both systems should be fixed, which is 
what I'm trying to do.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] TCP FIN gets dropped prematurely, results in ack storm

2007-05-01 Thread Benjamin LaHaise
On Tue, May 01, 2007 at 08:20:50PM +0400, Evgeniy Polyakov wrote:
> > http://www.kvack.org/~bcrl/ack-storm.log .  As near as I can tell, a 
> > similar effect can occur between two Linux boxes if the right packets get 
> > reordered/dropped during connection teardown.
> 
> Could you archive 24Mb file or cut more precise bits out of it?

The interesting bits are the first 10 lines.

> According to your patch, several packets with fin bit might be sent,
> including one with data. If another host does not receive fin
> retransmit, then that logic is broken, and it can not be fixed by
> duplicating fins, I would even say, that remote box should drop second
> packet with fin, while it can carry data, which will break higher
> connection logic.

The FIN hasn't been ack'd by the other side, though and yet Linux is no 
longer transmitting packets with it sent.  Read the beginning of the trace.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] TCP FIN gets dropped prematurely, results in ack storm

2007-05-01 Thread Benjamin LaHaise
Hello,

While testing a failover scenario, I managed to trigger an ack storm 
between a Linux box and another system.  Although the cause of this particular 
ACK storm was due to the other box forgetting that it sent out a FIN (the 
second node was unaware of the FIN the first sent in its dying gasp, which 
is what I'm trying to fix, but it's a tricky race), the resulting Linux 
behaviour wasn't very robust.  Is there any particularly good reason that 
FIN flag gets cleared on a connection which is being shut down?  The trace 
that motivates this can be seen at 
http://www.kvack.org/~bcrl/ack-storm.log .  As near as I can tell, a 
similar effect can occur between two Linux boxes if the right packets get 
reordered/dropped during connection teardown.

-ben

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0faacf9..1e54291 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -635,9 +635,9 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 
len, unsigned int mss
TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq;
TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq;
 
-   /* PSH and FIN should only be set in the second packet. */
+   /* PSH should only be set in the second packet. */
flags = TCP_SKB_CB(skb)->flags;
-   TCP_SKB_CB(skb)->flags = flags & ~(TCPCB_FLAG_FIN|TCPCB_FLAG_PSH);
+   TCP_SKB_CB(skb)->flags = flags & ~(TCPCB_FLAG_PSH);
TCP_SKB_CB(buff)->flags = flags;
TCP_SKB_CB(buff)->sacked = TCP_SKB_CB(skb)->sacked;
TCP_SKB_CB(skb)->sacked &= ~TCPCB_AT_TAIL;
@@ -1124,9 +1124,9 @@ static int tso_fragment(struct sock *sk, struct sk_buff 
*skb, unsigned int len,
TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq;
TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq;
 
-   /* PSH and FIN should only be set in the second packet. */
+   /* PSH should only be set in the second packet. */
flags = TCP_SKB_CB(skb)->flags;
-   TCP_SKB_CB(skb)->flags = flags & ~(TCPCB_FLAG_FIN|TCPCB_FLAG_PSH);
+   TCP_SKB_CB(skb)->flags = flags & ~(TCPCB_FLAG_PSH);
TCP_SKB_CB(buff)->flags = flags;
 
/* This packet was never sent out yet, so no SACK bits. */
@@ -1308,7 +1308,7 @@ static int tcp_mtu_probe(struct sock *sk)
sk_stream_free_skb(sk, skb);
} else {
TCP_SKB_CB(nskb)->flags |= TCP_SKB_CB(skb)->flags &
-  
~(TCPCB_FLAG_FIN|TCPCB_FLAG_PSH);
+  ~(TCPCB_FLAG_PSH);
if (!skb_shinfo(skb)->nr_frags) {
skb_pull(skb, copy);
if (skb->ip_summed != CHECKSUM_PARTIAL)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fib_info_hashfn leads to long hash chains

2007-04-29 Thread Benjamin LaHaise
Hello

The patch below fixes a case where fib_find_info() is consume excessive 
amounts of CPU during the creation of 1 PPP interfaces.  In access 
servers, each point to point link has the same local address, but a different 
destination and interface.  Because the device is not included in the hash 
calculation, the chain grows excessively large and we end up spinning the 
CPU walking the list.  As near as I can tell, this shouldn't have any negative 
sideeffects, but someone with a better understanding of fib_semantics.c will 
need to check over it.  Cheers,

-ben

Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 2
samples  %image name   app name symbol name
2469220  47.8872  vmlinux  vmlinux  fib_find_info
576393   11.1784  vmlinux  vmlinux  fn_trie_delete
559561   10.8519  vmlinux  vmlinux  fn_trie_insert
518975   10.0648  vmlinux  vmlinux  rt_run_flush
2846465.5203  vmlinux  vmlinux  local_bh_enable_ip
52469 1.0176  vmlinux  vmlinux  local_bh_disable
47638 0.9239  vmlinux  vmlinux  fib_nh_match
20588 0.3993  oprofiledoprofiledsfile_find
16074 0.3117  oprofiledoprofiledodb_update_node
13957 0.2707  ahci ahci (no symbols)
13648 0.2647  vmlinux  vmlinux  rtl8169_interrupt
13206 0.2561  vmlinux  vmlinux  register_netdevice

After:

Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 2
512204   26.1316  vmlinux  vmlinux  rt_run_flush
289378   14.7635  vmlinux  vmlinux  local_bh_enable_ip
267266   13.6354  vmlinux  vmlinux  fn_trie_delete
253438   12.9299  vmlinux  vmlinux  fn_trie_insert
53329 2.7207  vmlinux  vmlinux  local_bh_disable
21864 1.1155  vmlinux  vmlinux  fib_nh_match
15105 0.7706  ahci ahci (no symbols)
12332 0.6292  vmlinux  vmlinux  rtl8169_interrupt
9504  0.4849  vmlinux  vmlinux  ehci_irq
8436  0.4304  vmlinux  vmlinux  
journal_clean_one_cp_list
7645  0.3900  oprofiledoprofiledodb_update_node
7462  0.3807  oprofiledoprofiledsfile_find
6366  0.3248  babylond babylond memset


Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]>
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 3dad12e..e790842 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -197,11 +197,23 @@ static __inline__ int nh_comp(const struct fib_info *fi, 
const struct fib_info *
return 0;
 }
 
+static inline unsigned int fib_devindex_hashfn(unsigned int val)
+{
+   unsigned int mask = DEVINDEX_HASHSIZE - 1;
+
+   return (val ^
+   (val >> DEVINDEX_HASHBITS) ^
+   (val >> (DEVINDEX_HASHBITS * 2))) & mask;
+}
+
 static inline unsigned int fib_info_hashfn(const struct fib_info *fi)
 {
unsigned int mask = (fib_hash_size - 1);
unsigned int val = fi->fib_nhs;
 
+   if (val)
+   val ^= fib_devindex_hashfn(fi->fib_dev->ifindex);
+
val ^= fi->fib_protocol;
val ^= (__force u32)fi->fib_prefsrc;
val ^= fi->fib_priority;
@@ -235,15 +247,6 @@ static struct fib_info *fib_find_info(const struct 
fib_info *nfi)
return NULL;
 }
 
-static inline unsigned int fib_devindex_hashfn(unsigned int val)
-{
-   unsigned int mask = DEVINDEX_HASHSIZE - 1;
-
-   return (val ^
-   (val >> DEVINDEX_HASHBITS) ^
-   (val >> (DEVINDEX_HASHBITS * 2))) & mask;
-}
-
 /* Check, that the gateway is already configured.
Used only by redirect accept routine.
  */
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP connection stops after high load.

2007-04-11 Thread Benjamin LaHaise
On Wed, Apr 11, 2007 at 02:06:31PM -0700, Ben Greear wrote:
> For the dup acks, I see nothing *but* dup acks on the wire...going in
> both directions interestingly, at greater than 100,000 packets per second.
> 
> I don't mind adding printks...and I've started reading through the code,
> but there is a lot of it, and indiscriminate printks will likely just
> hide the problem because it will slow down performance so much.

What do the timestamps look like?  PAWS contains logic which will drop 
packets if the timestamps are too old compared to what the receiver 
expects.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] network dev read_mostly

2007-03-15 Thread Benjamin LaHaise
On Thu, Mar 15, 2007 at 12:25:16AM -0700, David Miller wrote:
> Could we obtain %rip relative addressing with the ELF
> relocation approach I mentioned?

I think we can for some of the objects -- things like slab caches are 
good candidates if we have the initialization done at init time, which 
would actually be a very cool way of getting rid of the static slab 
creation calls.  I'll cook something better up on the slab front.

As for other variables, many can't be rip relative as they often end up 
pointing to addresses outside of the +/-2GB range we have there.

The main reason I came up with this was from looking at the various 
kprobes and notifier chain overhead in common code paths.  In many 
instances we need a single byte flag showing the feature is in use to 
jump out of the hot path.

Then there are the selinux hooks

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] network dev read_mostly

2007-03-14 Thread Benjamin LaHaise
On Mon, Mar 12, 2007 at 02:08:18PM -0700, Stephen Hemminger wrote:
> For Eric, mark packet type and network device watermarks
> as read mostly.

The following x86-64 bits might be intersting, as they allow you to 
completely eliminate the memory access for run time defined constants.  
Note that read_always writes are non-atomic, so some other form of 
protection is necessary for readers (and rcu won't cut it).  That can be 
fixed somewhat by specifying the alignment for the mov instruction to 
ensure writes are atomic, but for many uses that is overkill.  This kind 
of change can make the biggest difference for high-latency cases, like L1 
cache misses on the Prescott P4.  I've not benched it on a P4 of late, 
though.

-ben


diff --git a/arch/x86_64/kernel/head64.c b/arch/x86_64/kernel/head64.c
index 5f197b0..022ee38 100644
--- a/arch/x86_64/kernel/head64.c
+++ b/arch/x86_64/kernel/head64.c
@@ -70,6 +70,8 @@ void __init x86_64_start_kernel(char * real_mode_data)
memcpy(init_level4_pgt, boot_level4_pgt, PTRS_PER_PGD*sizeof(pgd_t));
asm volatile("movq %0,%%cr3" :: "r" (__pa_symbol(&init_level4_pgt)));
 
+   init_read_always();
+
for (i = 0; i < NR_CPUS; i++)
cpu_pda(i) = &boot_cpu_pda[i];
 
diff --git a/arch/x86_64/kernel/vmlinux.lds.S b/arch/x86_64/kernel/vmlinux.lds.S
index b73212c..19852dc 100644
--- a/arch/x86_64/kernel/vmlinux.lds.S
+++ b/arch/x86_64/kernel/vmlinux.lds.S
@@ -50,6 +50,10 @@ SECTIONS
   __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { *(__ex_table) }
   __stop___ex_table = .;
 
+  start__read_always = .;
+  __read_always : AT(ADDR(__read_always) - LOAD_OFFSET) { *(__read_always) }
+  stop__read_always = .;
+
   RODATA
 
   BUG_TABLE
diff --git a/arch/x86_64/mm/fault.c b/arch/x86_64/mm/fault.c
index 6ada723..d48415e 100644
--- a/arch/x86_64/mm/fault.c
+++ b/arch/x86_64/mm/fault.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Page fault error code bits */
@@ -41,11 +42,67 @@
 #define PF_INSTR   (1<<4)
 
 static ATOMIC_NOTIFIER_HEAD(notify_page_fault_chain);
+static DEFINE_READ_ALWAYS(char, notify_page_fault_active);
+static DEFINE_READ_ALWAYS(char, page_fault_trace);
+
+void init_read_always(void)
+{
+   extern unsigned int start__read_always[], stop__read_always[];
+   unsigned int *fixup;
+
+   fixup = start__read_always;
+   while (fixup < stop__read_always) {
+   void *where = (void *)(fixup[0] - 0x1L);
+   void *which = (void *)(fixup[1] - 0x1L);
+   long size = fixup[2];
+   fixup += 3;
+
+   switch (size) {
+   case 1: *(u8 *)where = *(u8 *)which;break;
+   case 2: *(u16 *)where = *(u16 *)which;  break;
+   case 4: *(u32 *)where = *(u32 *)which;  break;
+   case 8: *(u64 *)where = *(u64 *)which;  break;
+   }
+   }
+}
+
+void set_read_always_size(void *ptr, long val, int size)
+{
+   extern unsigned int start__read_always[], stop__read_always[];
+   unsigned int *fixup;
+
+   switch(size) {
+   case 1: *(u8 *)ptr = val;   break;
+   case 2: *(u16 *)ptr = val;  break;
+   case 4: *(u32 *)ptr = val;  break;
+   case 8: *(u64 *)ptr = val;  break;
+   }
+
+   fixup = start__read_always;
+   while (fixup < stop__read_always) {
+   void *where = (void *)(fixup[0] - 0x1L);
+   void *which = (void *)(fixup[1] - 0x1L);
+   long actual_size = fixup[2];
+   fixup += 3;
+
+   if (which != ptr)
+   continue;
+
+   BUG_ON(size != actual_size);
+   switch(size) {
+   case 1: *(u8 *)where = val; break;
+   case 2: *(u16 *)where = val;break;
+   case 4: *(u32 *)where = val;break;
+   case 8: *(u64 *)where = val;break;
+   }
+   }
+}
 
 /* Hook to register for page fault notifications */
 int register_page_fault_notifier(struct notifier_block *nb)
 {
vmalloc_sync_all();
+   set_read_always(notify_page_fault_active, 1);
return atomic_notifier_chain_register(¬ify_page_fault_chain, nb);
 }
 EXPORT_SYMBOL_GPL(register_page_fault_notifier);
@@ -56,7 +113,7 @@ int unregister_page_fault_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_page_fault_notifier);
 
-static inline int notify_page_fault(struct pt_regs *regs, long err)
+static int notify_page_fault(struct pt_regs *regs, long err)
 {
struct die_args args = {
.regs = regs,
@@ -301,7 +358,6 @@ static int vmalloc_fault(unsigned long address)
return 0;
 }
 
-int page_fault_trace = 0;
 int exception_trace = 1;
 
 /*
@@ -355,7 +411,8 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs 
*regs,
if (vmalloc_fault(address) >

Re: [PATCH 1/2] avoid OPEN_MAX in SCM_MAX_FD

2007-03-13 Thread Benjamin LaHaise
On Tue, Mar 13, 2007 at 01:39:12AM -0700, Roland McGrath wrote:
> The OPEN_MAX constant is an arbitrary number with no useful relation to
> anything.  Nothing should be using it.  This patch changes SCM_MAX_FD to
> use NR_OPEN instead of OPEN_MAX.  This increases the size of the struct
> scm_fp_list type fourfold, to make it big enough to contain as many file
> descriptors as could be asked of it.  This size increase may not be very
> worthwhile, but at any rate if an arbitrary limit unrelated to anything
> else is being defined it should be done explicitly here with:

> -#define SCM_MAX_FD   (OPEN_MAX-1)
> +#define SCM_MAX_FD   (NR_OPEN-1)

This is a bad idea.  From linux/fs.h:

#undef NR_OPEN
#define NR_OPEN (1024*1024) /* Absolute upper limit on fd num */

There isn't anything I can see guaranteeing that net/scm.h is included 
before fs.h.  This affects networking and should really be Cc'd to 
netdev@vger.kernel.org, which will raise the issue that if SCM_MAX_FD is 
raised, the resulting simple kmalloc() must be changed.  That said, I 
doubt SCM_MAX_FD really needs to be raised, as applications using many 
file descriptors are unlikely to try to send their entire file table to 
another process in one go -- they have to handle the limits imposed by 
SCM_MAX_FD anyways.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: all syscalls initially taking 4usec on a P4? Re: nonblocking UDPv4 recvfrom() taking 4usec @ 3GHz?

2007-02-20 Thread Benjamin LaHaise
On Tue, Feb 20, 2007 at 08:33:20PM +0100, bert hubert wrote:
> I'm investigating this further for other system calls. It might be that my
> measurements are off, but it appears even a slight delay between calls
> incurs a large penalty.

Make sure your system is idle.  Userspace bloat means that *lots* of idle 
activity occurs in between timer ticks on recent distributions -- all those 
widgets polling the hardware to see if something changed or needs updating 
do a lot of damage to the caches.  Try comparing a run under init=/bin/bash 
with one while logged in to a desktop environment to see just how painful it 
is.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-02-19 Thread Benjamin LaHaise
On Mon, Feb 19, 2007 at 01:26:42PM -0500, Benjamin LaHaise wrote:
> On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet wrote:
> > So even with a lazy hash function, 89 % of lookups are satisfied with less 
> > than 6 compares.
> 
> Which sucks, as those are typically going to be cache misses (costing many 
> hundreds of cpu cycles).  Hash chains fair very poorly under DoS conditions, 
> and must be removed under a heavy load.  Worst case handling is very 
> important next to common case.

I should clarify.  Back of the napkin calculations show that there is only 
157 cycles on a 3GHz processor in which to decide what happens to a packet, 
which means 1 cache miss is more than enough.  In theory we can get pretty 
close to line rate with quad core processors, but it definately needs some 
of the features that newer chipsets have for stuffing packets directly into 
the cache.  I would venture a guess that we also need to intelligently 
partition packets so that we make the most use of available cache resources.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Extensible hashing and RCU

2007-02-19 Thread Benjamin LaHaise
On Mon, Feb 19, 2007 at 07:13:07PM +0100, Eric Dumazet wrote:
> So even with a lazy hash function, 89 % of lookups are satisfied with less 
> than 6 compares.

Which sucks, as those are typically going to be cache misses (costing many 
hundreds of cpu cycles).  Hash chains fair very poorly under DoS conditions, 
and must be removed under a heavy load.  Worst case handling is very 
important next to common case.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -mm 0/10][RFC] aio: make struct kiocb private

2007-01-17 Thread Benjamin LaHaise
On Mon, Jan 15, 2007 at 08:25:15PM -0800, Nate Diller wrote:
> the right thing to do from a design perspective.  Hopefully it enables
> a new architecture that can reduce context switches in I/O completion,
> and reduce overhead.  That's the real motive ;)

And it's a broken motive.  Context switches per se are not bad, as they 
make it possible to properly schedule code in a busy system (which is 
*very* important when realtime concerns come into play).  Have a look 
at how things were done in the 2.4 aio code to see how completion would 
get done with a non-retry method, typically in interrupt context.  I had 
code that did direct I/O rather differently by sharing code with the 
read/write code paths at some point, the catch being that it was pretty 
invasive, which meant that it never got merged with the changes to handle 
writeback pressure and other work that happened during 2.5.

That said, you can't make kiocb private without completely removing the 
ability of the rest of the kernel to complete an aio sanely from irq context.  
You need some form of i/o descriptor, and a kiocb is just that.  Adding more 
layering is just going to make things messier and slower for no real gain.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SKB BUG: Invalid truesize, current git

2006-11-08 Thread Benjamin LaHaise
On Tue, Nov 07, 2006 at 02:57:24PM -0800, David Miller wrote:
> > Since pskb_copy tacks on the non-linear bits from the original
> > skb, it needs to count them in the truesize field of the new skb.
> > 
> > Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>
> 
> Applied, thanks Herbert.

This seems to work -- I haven't been able to trigger the message with the 
patch applied now.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SKB BUG: Invalid truesize, current git

2006-11-06 Thread Benjamin LaHaise
Hi all,

I managed to get a backtrace for the Invalid truesize bug.  The trigger is 
running LMbench2, but it's rater intermittent.  Traffic should be going 
over the loopback interface, but the main nic on the machine is e1000.  
Let me know if anyone has any ideas for things to try.

-ben

Linux version 2.6.19-rc4 ([EMAIL PROTECTED]) (gcc version 4.1.1 20060525 (Red 
Hat 4.1.1-1)) #73 SMP Mon Nov 6 13:13:44 EST 2006
Command line: ro root=LABEL=/1 console=ttyS0,38400
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009cc00 (usable)
 BIOS-e820: 0009cc00 - 000a (reserved)
 BIOS-e820: 000cc000 - 000d (reserved)
 BIOS-e820: 000e4000 - 0010 (reserved)
 BIOS-e820: 0010 - bff6 (usable)
 BIOS-e820: bff6 - bff69000 (ACPI data)
 BIOS-e820: bff69000 - bff8 (ACPI NVS)
 BIOS-e820: bff8 - c000 (reserved)
 BIOS-e820: e000 - f000 (reserved)
 BIOS-e820: fec0 - fec1 (reserved)
 BIOS-e820: fee0 - fee01000 (reserved)
 BIOS-e820: ff00 - 0001 (reserved)
 BIOS-e820: 0001 - 00014000 (usable)
Entering add_active_range(0, 0, 156) 0 entries of 256 used
Entering add_active_range(0, 256, 786272) 1 entries of 256 used
Entering add_active_range(0, 1048576, 1310720) 2 entries of 256 used
end_pfn_map = 1310720
DMI present.
ACPI: RSDP (v000 PTLTD ) @ 0x000f58d0
ACPI: RSDT (v001 PTLTDRSDT   0x0604  LTP 0x) @ 
0xbff636df
ACPI: FADT (v001 INTEL  TUMWATER 0x0604 PTL  0x0003) @ 
0xbff68e48
ACPI: MADT (v001 PTLTD   APIC   0x0604  LTP 0x) @ 
0xbff68ebc
ACPI: MCFG (v001 PTLTDMCFG   0x0604  LTP 0x) @ 
0xbff68f4c
ACPI: BOOT (v001 PTLTD  $SBFTBL$ 0x0604  LTP 0x0001) @ 
0xbff68f88
ACPI: SPCR (v001 PTLTD  $UCRTBL$ 0x0604 PTL  0x0001) @ 
0xbff68fb0
ACPI: SSDT (v001  PmRefCpuPm 0x3000 INTL 0x20050228) @ 
0xbff6371b
ACPI: DSDT (v001  Intel BLAKFORD 0x0604 MSFT 0x010e) @ 
0x
Entering add_active_range(0, 0, 156) 0 entries of 256 used
Entering add_active_range(0, 256, 786272) 1 entries of 256 used
Entering add_active_range(0, 1048576, 1310720) 2 entries of 256 used
Zone PFN ranges:
  DMA 0 -> 4096
  DMA324096 ->  1048576
  Normal1048576 ->  1310720
early_node_map[3] active PFN ranges
0:0 ->  156
0:  256 ->   786272
0:  1048576 ->  1310720
On node 0 totalpages: 1048316
  DMA zone: 56 pages used for memmap
  DMA zone: 1395 pages reserved
  DMA zone: 2545 pages, LIFO batch:0
  DMA32 zone: 14280 pages used for memmap
  DMA32 zone: 767896 pages, LIFO batch:31
  Normal zone: 3584 pages used for memmap
  Normal zone: 258560 pages, LIFO batch:31
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 (Bootup-CPU)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x06] enabled)
Processor #6
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
Processor #1
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x07] enabled)
Processor #7
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 2, address 0xfec0, GSI 0-23
ACPI: IOAPIC (id[0x03] address[0xfec8] gsi_base[24])
IOAPIC[1]: apic_id 3, address 0xfec8, GSI 24-47
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Setting APIC routing to flat
Using ACPI (MADT) for SMP configuration information
Nosave address range: 0009c000 - 0009d000
Nosave address range: 0009d000 - 000a
Nosave address range: 000a - 000cc000
Nosave address range: 000cc000 - 000d
Nosave address range: 000d - 000e4000
Nosave address range: 000e4000 - 0010
Nosave address range: bff6 - bff69000
Nosave address range: bff69000 - bff8
Nosave address range: bff8 - c000
Nosave address range: c000 - e000
Nosave address range: e000 - f000
Nosave address range: f000 - fec0
Nosave address range: fec0 - fec1
Nosave address range: fec1 - fee0
Nosave address range: fee0 - fee01000
Nosave address range: fee01000 - ff00
Nosave address range: ff00 - 0001
Allocati

Re: [PATCH?] tcp and delayed acks

2006-08-16 Thread Benjamin LaHaise
On Wed, Aug 16, 2006 at 12:11:12PM -0700, Stephen Hemminger wrote:
> > is throttled waiting for ACKs to arrive.  The problem is exacerbated when 
> > the sender is using a small send buffer -- running netperf -C -c -- -s 1024 
> > show a miserable 420Kbit/s at essentially 0% CPU usage.  Tests over gige 
> > are similarly constrained to a mere 96Mbit/s.
> 
> What ethernet hardware? The defaults are often not big enough
> for full speed on gigabit hardware. I need increase rmem/wmem to allow
> for more buffering. 

This is for small buffer transmit buffer sizes over either loopback or 
e1000.  The artifact also shows up over localhost for somewhat larger buffer 
sizes, although it is much more difficult to get results that don't have 
large fluctuations because of other scheduling issues.  Pinning the tasks to 
CPUs is on my list of things to try, but something in the multiple variants 
of sched_setaffinity() has resulted in it being broken in netperf.

> The point of delayed ack's was to merge the response and the ack on 
> request/response
> protocols like NFS or telnet. It does make sense to get it out sooner though.

I would like to see what sort of effect this change has on higher latency.  
Ideally, quick ack mode should be doing the right thing, but it might need 
more input about the receiver's intent.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH?] tcp and delayed acks

2006-08-16 Thread Benjamin LaHaise
Hello folks,

In looking at a few benchmarks (especially netperf) run locally, it seems 
that tcp is unable to make full use of available CPU cycles as the sender 
is throttled waiting for ACKs to arrive.  The problem is exacerbated when 
the sender is using a small send buffer -- running netperf -C -c -- -s 1024 
show a miserable 420Kbit/s at essentially 0% CPU usage.  Tests over gige 
are similarly constrained to a mere 96Mbit/s.

Since there is no way for the receiver to know if the sender is being 
blocked on transmit space, would it not make sense for the receiver to 
send out any delayed ACKs when it is clear that the receiving process is 
waiting for more data?  The patch below attempts this (I make no guarantees 
of its correctness with respect to the rest of the delayed ack code).  One 
point I'm still contemplating is what to do if the receiver is waiting in 
poll/select/epoll.

[All tests run with maxcpus=1 on a 2.67GHz Woodcrest system.]

Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

Base (2.6.17-rc4):
default send buffer size
netperf -C -c
 87380  16384  1638410.02  14127.79   99.9099.900.579   0.579 
 87380  16384  1638410.02  13875.28   99.9099.900.590   0.590 
 87380  16384  1638410.01  13777.25   99.9099.900.594   0.594 
 87380  16384  1638410.02  13796.31   99.9099.900.593   0.593 
 87380  16384  1638410.01  13801.97   99.9099.900.593   0.593 

netperf -C -c -- -s 1024
 87380   2048   204810.02 0.43   -0.04-0.04-7.105  -7.377
 87380   2048   204810.02 0.43   -0.01-0.01-2.337  -2.620
 87380   2048   204810.02 0.43   -0.03-0.03-5.683  -5.940
 87380   2048   204810.02 0.43   -0.05-0.05-9.373  -9.625
 87380   2048   204810.02 0.43   -0.05-0.05-9.373  -9.625

from a remote system over gigabit ethernet
netperf -H woody -C -c
 87380  16384  1638410.03   936.23   19.3220.473.382   1.791 
 87380  16384  1638410.03   936.27   17.6720.953.091   1.833 
 87380  16384  1638410.03   936.17   19.1820.773.356   1.817 
 87380  16384  1638410.03   936.26   18.2220.263.188   1.773 
 87380  16384  1638410.03   936.26   17.3520.543.036   1.797 

netperf -H woody -C -c -- -s 1024
 87380   2048   204810.0095.72   10.046.64 17.188  5.683 
 87380   2048   204810.0095.94   9.47 6.42 16.170  5.478 
 87380   2048   204810.0096.83   9.62 5.72 16.283  4.840 
 87380   2048   204810.0095.91   9.58 6.13 16.368  5.236 
 87380   2048   204810.0095.91   9.58 6.13 16.368  5.236 


Patched:
default send buffer size
netperf -C -c
 87380  16384  1638410.01  13923.16   99.9099.900.588   0.588 
 87380  16384  1638410.01  13854.59   99.9099.900.591   0.591 
 87380  16384  1638410.02  13840.42   99.9099.900.591   0.591 
 87380  16384  1638410.01  13810.96   99.9099.900.593   0.593 
 87380  16384  1638410.01  13771.27   99.9099.900.594   0.594 

netperf -C -c -- -s 1024
 87380   2048   204810.02  2473.48   99.9099.903.309   3.309 
 87380   2048   204810.02  2421.46   99.9099.903.380   3.380 
 87380   2048   204810.02  2288.07   99.9099.903.577   3.577 
 87380   2048   204810.02  2405.41   99.9099.903.402   3.402 
 87380   2048   204810.02  2284.41   99.9099.903.582   3.582 

netperf -H woody -C -c
 87380  16384  1638410.04   936.10   23.0421.604.033   1.890 
 87380  16384  1638410.03   936.20   18.5221.063.242   1.843 
 87380  16384  1638410.03   936.52   17.6121.053.082   1.841 
 87380  16384  1638410.03   936.18   18.2420.733.191   1.814 
 87380  16384  1638410.03   936.28   18.3021.043.202   1.841 

netperf -H woody -C -c -- -s 1024
 87380   2048   204810.00   142.46   10.197.53 11.714  4.332 
 87380   2048   204810.00   147.28   9.73 7.93 10.829  4.412 
 87380   2048   204810.00   143.37   10.646.54 12.161  3.738 
 87380   2048   204810.00   146.41   9.18 7.43 10.277  4.158 
 87380   2048   204810.01   145.58   9.80 7.25 11.032  4.081 

Comments/thoughts?

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.


diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 934396b..e554ceb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -127

Re: [PATCH] [e1000]: Remove unnecessary tx_lock

2006-08-08 Thread Benjamin LaHaise
On Wed, Aug 09, 2006 at 10:25:30AM +1000, Herbert Xu wrote:
> The problem here is that the TX clean function does not take the lock
> (nor do we want it to).  It can thus come in while you're transmitting
> and empty the queue.

That can be solved with sequence numbers -- ie, we keep track of the number 
of packets queued to the hardware, as well as tracking the number of tx 
completions that have been reaped.  Because those numbers are flushed to 
memory by other locks (barriers) taken during processing, the code in 
hard_start_xmit() and the actual tx completion reaping could be done without 
the atomic ops.  The key thing is that the higher level code in the network 
stack calling dev->hard_start_xmit() would know that it needs to retry if 
the tx complete sequence changes.  I'll ponder this a bit more and try to 
come up with some sample code.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [e1000]: Remove unnecessary tx_lock

2006-08-08 Thread Benjamin LaHaise
On Tue, Aug 08, 2006 at 03:06:07PM -0700, David Miller wrote:
> The driver ->hard_start_xmit() method is invoked with the queue
> unlocked, so this kind of scheme would not be workable.

Looking at e1000, NETIF_F_LLTX is a waste -- the driver takes a spinlock 
almost immediately after entering.  Taking the spinlock higher up in the 
network stack, preferably for multiple packets, would at least let the 
CPU deal with things sooner.  tg3 doesn't use LLTX and thus holds the lock 
over hard_start_xmit().  bonding is atrocious and uses a read lock and unlock 
in many of the xmit routines.  The only true lockless tx seems to be 
loopback.

> While the queue is unlocked, another cpu could empty entries in the
> TX queue making it not-full, which invalidates this "queue is now
> full" return value the driver just gave.
> 
> We don't do things the way we do it now for fun, it's just the most
> reasonable scheme we've come up with given the locking constraints.

Given that the vast majority of drivers are locked during their xmit 
routine, it is probably worth implementing at some point.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [e1000]: Remove unnecessary tx_lock

2006-08-08 Thread Benjamin LaHaise
On Fri, Aug 04, 2006 at 04:31:11PM -0700, David Miller wrote:
> Yes, it's meant to catch unintented races.
> 
> The queueing layer that calls ->hard_start_xmit() technically has no
> need to support NETDEV_TX_BUSY as a return value, since the device
> is able to prevent this.
> 
> If we could avoid NETDEV_TX_BUSY et al. from happening, the idea is
> that we could eliminate all of the requeueing logic in the packet
> scheduler layer.  It is a pipe dream from many years ago.

Maybe the way NETDEV_TX_BUSY is implemented is wrong -- that is, it should 
be possible to return a result saying "we sent the packet, but the queue is 
now full".  That would eliminate the atomic op that netif_queue_stop() 
performs and allow higher layers to merge the necessary barrier on the full 
case with the op on the tx lock.  That way we could have lockless queue 
handling within the driver.  This might need start/stop sequence counters, 
though.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems with e1000 and jumboframes

2006-08-03 Thread Benjamin LaHaise
On Thu, Aug 03, 2006 at 04:49:15PM +0200, Krzysztof Oledzki wrote:
> With 1 GB of RAM full 1GB/3GB (CONFIG_VMSPLIT_3G_OPT) seems to be 
> enough...

Nope, you lose ~128MB of RAM for vmalloc space.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems with e1000 and jumboframes

2006-08-03 Thread Benjamin LaHaise
On Thu, Aug 03, 2006 at 03:48:39PM +0200, Arnd Hannemann wrote:
> However the box is a VIA Epia MII12000 with 1 GB of Ram and 1 GB of swap
> enabled, so there should be plenty of memory available. HIGHMEM support
> is off. The e1000 nic seems to be an 82540EM, which to my knowledge
> should support jumboframes.

> However I can't always reproduce this on a freshly booted system, so
> someone else may be the culprit and leaking pages?
> 
> Any ideas how to debug this?

This is memory fragmentation, and all you can do is work around it until 
the e1000 driver is changed to split jumbo frames up on rx.  Here are a 
few ideas that should improve things for you:

- switch to a 2GB/2GB split to recover the memory lost to highmem
  (see Processor Type and Features / Memory split)
- increase /proc/sys/vm/min_free_kbytes -- more free memory will 
  improve the odds that enough unfragmented memory is available for 
  incoming network packets

I hope this helps.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/4] kevent: core files.

2006-07-27 Thread Benjamin LaHaise
On Thu, Jul 27, 2006 at 02:44:50PM -0700, Zach Brown wrote:
> 
> >>int kevent_getevents(int event_fd, struct ukevent *events,
> >>int min_events, int max_events,
> >>struct timeval *timeout);
> > 
> > You've just reinvented io_getevents().
> 
> Well, that's certainly one inflammatory way to put it.  I would describe
> it as suggesting that the kevents collection interface not lose the
> nicer properties of io_getevents().

Perhaps, but there seems to be a lot of talk about introducing new APIs 
where it isn't entirely clear that it is needed.  Sorry if that sounded 
rather acerbic.

> > What exactly are we getting from 
> > reinventing this (aside from breaking existing apps and creating more of 
> > an API mess)?
> 
> A generic event collection interface that isn't so strongly bound to the
> existing semantics of io_setup() and io_submit().  It can be a file
> descriptor instead of a mysterious cookie/pointer to the mapped region,
> to start.

Things were like that at one point in time, but file descriptors turn out 
to introduce a huge gaping security hole with SUID programs.  The problem 
is that any event context is closely tied to the address space of the 
thread issuing the syscalls, and file descriptors do not have this close 
binding.

> Sure, so maybe we experiment with these things in the context of the
> kevent patches and maybe merge them back into the AIO paths if in the
> end that's the right thing to do.  I see no problem with separating
> development from the existing code.

Excellent!

> >> epoll and kevent both have the notion of an event type that always
> >> creates an event at the time of the collection syscall while the event
> >> source is on a ready list.  Think of epoll calling ->poll(POLLOUT) for
> >> an empty socket buffer at every sys_epoll_wait() call.  We can't have
> >> some source constantly spewing into the ring :/.  We could fix this by
> >> the API requiring that level events can *only* be collected through the
> >> syscall interface.  userspace could call into the collection syscall
> >> every N events collected through the ring, say.  N would be tuned to
> >> amortize the syscall cost and still provide fairness or latency for the
> >> level sources.  I'd be fine with that, especially when it's hidden off
> >> in glibc.
> > 
> > This is exactly why I think level triggered events are nasty.  It's 
> > impossible to do cleanly without requiring a syscall.
> 
> I'm not convinced that it isn't possible to get a sufficiently clean
> interface that involves the mix.

My arguement is that this approach introduces a slow path into the heavily 
loaded server case.  If you can show me how to avoid that, I'd be happy to 
see such an implementation. =-)

> > As soon as you allow queueing events up in kernel space, it becomes 
> > necessary to do another syscall after pulling events out of the queue, 
> > which is a waste of CPU cycles when you're under heavy load (exactly the 
> > point at which you want the system to be its most efficient).
> 
> If we've just consumed a full ring worth of events, and done real work
> with them, I'm not convinced that an empty syscall is going to be that
> painful.  If we're really under load it might well return some newly
> arrived events.  It becomes a mix of ring completions and syscall
> completions.

Except that you're not usually pulling a full ring worth of events at a 
time, more often just one.  One of the driving forces behind AIO use is 
in realtime apps where you don't want to eat occasional spikes in the 
latency of request processing, one just wants to eat the highest priority 
event then work on the next.  By keeping each step small and managable, 
the properties of the system are much easier to predict.  Yes, batching 
can be helpful performance-wise, but it is somewhat opposite to the design 
criteria that need to be considered.  The right way to cope with that may 
be to have two different modes of operation that trade off one way or the 
other on the batching question.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 1/4] kevent: core files.

2006-07-27 Thread Benjamin LaHaise
On Thu, Jul 27, 2006 at 12:18:42PM -0700, Zach Brown wrote:
> The easy part is fixing up the somewhat obfuscated collection call.
> Instead of coming in through a multiplexer that magically treats a void
> * as a struct kevent_user_control followed by N ukevents (as specified
> in the kevent_user_control!) we'd turn it into a more explicit
> collection syscall:
> 
>   int kevent_getevents(int event_fd, struct ukevent *events,
>   int min_events, int max_events,
>   struct timeval *timeout);

You've just reinvented io_getevents().  What exactly are we getting from 
reinventing this (aside from breaking existing apps and creating more of 
an API mess)?

> Say we have a ring of event structs.  AIO has this today, but it sort of
> gets it wrong because each event element doesn't specify whether it is
> owned by the kernel or userspace.  (It really gets it wrong because it
> doesn't flush_dcache_page() after updating the ring via kmap(), but
> never mind that!  No one actually uses this mmap() AIO ring.)  In AIO
> today there is also a control struct mapped along with the ring that has
> head and tail pointers.  We don't want to bounce that cacheline around.
>  net/socket/af_packet.c gets this right with it's tp_status member of
> tpacket_hdr.

That could be rev'd in the mmap() ring buffer, as there are compat and 
incompat bits for changing the structure layout.  As for bouncing the 
cacheline of head/tail around, I don't think it matters on real machines, 
as the multithreaded/SMP case will hit that cacheline bouncing if the 
user is sharing the event ring between multiple threads on multiple CPUs.  
The only way around that is to use multiple event rings, say one per node, 
at which point you have to do load balancing of io requests explicitely 
between queues (which might be worth it).

> So, great, glibc can now find pending events very quickly if they're
> waiting in the ring and can fall back to the collection syscall if it
> wants to wait and the ring is empty.  If it consumes events via the
> syscall it increases its ring index by the number the syscall returned.
> 
> There's two things we should address: level events and the notion of
> only submitting as much as fits in the ring.
> 
> epoll and kevent both have the notion of an event type that always
> creates an event at the time of the collection syscall while the event
> source is on a ready list.  Think of epoll calling ->poll(POLLOUT) for
> an empty socket buffer at every sys_epoll_wait() call.  We can't have
> some source constantly spewing into the ring :/.  We could fix this by
> the API requiring that level events can *only* be collected through the
> syscall interface.  userspace could call into the collection syscall
> every N events collected through the ring, say.  N would be tuned to
> amortize the syscall cost and still provide fairness or latency for the
> level sources.  I'd be fine with that, especially when it's hidden off
> in glibc.

This is exactly why I think level triggered events are nasty.  It's 
impossible to do cleanly without requiring a syscall.

> Today AIO only allows submission of as many events as there are space in
> the ring.  It mostly does this so its completion can drop an event in
> the ring from any context.  If we back away from this so that we can
> have long-lived source registration generate multiple edge events (and I
> think we want to!), we have to be kind of careful.  A source could
> generate an event while the ring is full.  The event could go in a list
> but if userspace is collecting events in userspace the kernel won't be
> told when there's space.  We'd first have to check this ready list when
> later events are generated so that pending events on the list aren't
> overlooked.  Userspace would also want to use the collection syscall as
> the ring empties.  Neither seem hard.

As soon as you allow queueing events up in kernel space, it becomes 
necessary to do another syscall after pulling events out of the queue, 
which is a waste of CPU cycles when you're under heavy load (exactly the 
point at which you want the system to be its most efficient).  Given that 
growing the ring buffer is easy enough to do, I'm not sure that the hit 
is worth it.  At some point there has to be some form of flow control 
involved, and it is much better if it is explicitely obvious where this 
happens (as opposed to signal queues and our wonderful OOM handling).

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/3] drivers/net/ns83820.c: add paramter to disable autonegotiation

2006-06-26 Thread Benjamin LaHaise
On Sun, Jun 25, 2006 at 01:44:36AM -0700, [EMAIL PROTECTED] wrote:
> 
> From: Dan Faerch <[EMAIL PROTECTED]>
> 
> Adds "ethtool command" support to driver.  Initially 2 commands are
> implemented: force fullduplex and toggle autoneg.

This part is good, although doing something for copper cards needs doing, 
which probably means poking around to support the phy properly.

> Also added a "disable_autoneg" module argument to completely disable
> autoneg on all cards using this driver.

This is the part I disagree with.  Are you sure it isn't a bug in the 
link autonegotiation state machine for fibre cards?  It should be defaulting 
to 1Gbit/full duplex if no autonegotiation is happening, and if it isn't 
then that should be fixed instead of papering over things with a config 
option.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/4] kevent: core files.

2006-06-23 Thread Benjamin LaHaise
On Fri, Jun 23, 2006 at 01:54:23PM -0700, David Miller wrote:
> From: Benjamin LaHaise <[EMAIL PROTECTED]>
> Date: Fri, 23 Jun 2006 16:31:14 -0400
> 
> > Eh?  Nobody has posted any numbers comparing the approaches yet, so this 
> > is pure handwaving, unless you have real concrete results?
> 
> Evgeniy posts numbers and performance graphs on his kevent work all
> the time.

But you're argueing that the performance of something that hasn't been 
tested is worse simply by nature of it not having been tested.  That's a 
fallacy of omission, iiuc.

> Van Jacobson did in his LCA2006 net channel slides too, perhaps you
> missed that.

I have yet to be convinced that the layering violation known as net channels 
is the right way to go, mostly because it breaks horribly in a few cases -- 
think what happens during periods of CPU overcommit, in which case doing too 
much in interrupt context will kill a system (which is why softirqs are 
needed).  The effect of doing all processing in user context creates issues 
with delayed acks (due to context switching to other tasks in the system), 
which will cause excess retransmits.  The hard problems associated with 
packet filtering and security are also still unresolved, which is okay for 
a paper, but a concern in real life.

There are also a number of performance flaws in the current stack that 
show up under profiling, some of which I posted fixes for, some of which 
have yet to be fixed.  The pushf/popf pipeline stall was one of the bigger 
instances of CPU wastage that Van Jacobson noticed (it shows up as bottom 
halves using lots of CPU).  Iirc, Ingo's real time patches may avoid that 
by way of reworking the irq disable/enable mechanism, which would mean the 
results need retesting.  Using the cr8 register to enable/disable 
interrupts on x86-64 might also improve things, as that would eliminate the 
flags dependancy of cli/sti...

In short, there's a lot of work that still has to be done.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/4] kevent: core files.

2006-06-23 Thread Benjamin LaHaise
On Sat, Jun 24, 2006 at 01:08:27AM +0400, Evgeniy Polyakov wrote:
> On Fri, Jun 23, 2006 at 04:44:42PM -0400, Benjamin LaHaise ([EMAIL 
> PROTECTED]) wrote:
> > > AIO completion approach was designed to be used with process context VFS
> > > update. read/write approach can not cover other types of notifications,
> > > like inode updates or timers.
> > 
> > The completion event is 100% generic and does not need to come from process 
> > context.  Calling aio_complete() from irq context is entirely valid.
> 
> put_ioctx() can sleep.

Err, no, that should definately not be the case.  If it can, someone has 
completely broken aio.

> It is not syscall, but overall design should be analyzed.
> It is possible to use existing ssycalls, kevent design does not care
> about how it's data structures are delivered to the internal
> "processor".

Okay, that's good to hear. =-)

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/4] kevent: core files.

2006-06-23 Thread Benjamin LaHaise
On Sat, Jun 24, 2006 at 12:17:17AM +0400, Evgeniy Polyakov wrote:
> But now it is implemented as repeated call for the same work, which does
> not look like it can be used for any other types of work.

Given an iocb, you do not have to return -EIOCBRETRY, instead return 
-EIOCBQUEUED and then from whatever context do an aio_complete() with the 
result for that iocb.

> And repeated work introduce latencies.
> As far as I recall, it is you who wanted to remove thread based approach
> from AIO subsystem.

I have essentially given up on trying to get the filesystem AIO patches 
in given that the concerns against them are "woe complexity" with no real 
recourse for inclusion being open.  If David is open to changes in the 
networking area, I'd love to see it built on top of your code.

> AIO completion approach was designed to be used with process context VFS
> update. read/write approach can not cover other types of notifications,
> like inode updates or timers.

The completion event is 100% generic and does not need to come from process 
context.  Calling aio_complete() from irq context is entirely valid.

> Format of the structure transferred between the objects does not matter
> at all. We can create a wrapper on kevent structures or kevent can
> transform data from AIO objects.

> The main design goal of kevent is to provide easy connected hooks into
> any state machine, which might be used by kernelspace to notify about
> any kind of events without any knowledge of it's background nature.
> Kevent can be used for example as notification blocks for address
> changes or it can replace netlink completely (it can even emulate
> event multicasting).
> 
> Kevent is queue of events, which can be transferred from any object to
> any destination.

And io_getevents() reads a queue of events, so I'm not sure why you need 
a new syscall.

> Not at all!
> Kevent is a mechanism, which allows to impleement AIO, network AIO, poll
> and select, timer control, adaptive readhead (as example of AIO VFS
> update). All the code I present shows how to use kevent, it is not part
> of the kevent. One can find Makefile in kevent dir to check what is the
> core of the subsystem, which allows to be used as a transport for
> events.
> 
> AIO, NAIO, poll/select, socket and timer notifications are just users.
> One can add it's own usage as easy as to call kevent_storage
> initialization function and event generation function. All other pieces
> are hidded in the implementation.

I'll look at adapting your code to use the existing syscalls.  Maybe code 
will be better at expressing my concerns.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/4] kevent: core files.

2006-06-23 Thread Benjamin LaHaise
On Fri, Jun 23, 2006 at 01:19:40PM -0700, David Miller wrote:
> I completely agree with Evgeniy here.
> 
> There is nothing in the kernel today that provides integrated event
> handling.  Nothing.  So when someone says to use the "existing" stuff,
> they need to have their head examined.

The existing AIO events are *events*, with the syscalls providing the 
reading of events.

> The existing AIO stuff stinks as a set of interfaces.  It was designed
> by a standards committee, not by people truly interested in a good
> performing event processing design.  It is especially poorly suited
> for networking, and any networking developer understands this.

I disagree.  Stuffing an event that a read or write is complete/ready is a 
good way of handling things, even more so with hardware that will perform 
the memory copies to/from user buffers.

> It is pretty much a foregone conclusion that we will need new
> APIs to get good networking performance.  Every existing interface
> has one limitation or another.

Eh?  Nobody has posted any numbers comparing the approaches yet, so this 
is pure handwaving, unless you have real concrete results?

> So we should be happy people like Evgeniy try to work on this stuff,
> instead of discouraging them.

I would like to encourage him, but at the same time I don't want to see 
creating APIs that essentially duplicate existing work and needlessly 
break compatibility.  I completely agree that the in-kernel APIs are not 
as encompassing as they should be, and within the kernel Evgeniy's work 
may well be the way to go.  What I do not agree is that we need new 
syscalls at this point.  I'm perfectly willing to accept proof that change 
is needed if we do a proper comparision between any new syscall API and the 
use of the existing syscall API, but the pain of introducing a new API is 
sufficiently large that I think it is worth looking at the numbers.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/4] kevent: core files.

2006-06-23 Thread Benjamin LaHaise
On Fri, Jun 23, 2006 at 11:24:29PM +0400, Evgeniy Polyakov wrote:
> What API are you talking about?
> There is only epoll(), which is 40% slower than kevent, and AIO, which
> works not as state machine, but as repeated call for the same work.
> There is also inotify, which allocates new message each time event
> occurs, which is not a good solution for every situation.

AIO can be implemented as a state machine.  Nothing in the API stops 
you from doing that, and in fact there was code which was implemented as 
a state machine used on 2.4 kernels.

> Linux just does not have unified event processing mechanism, which was
> pointed to many times in AIO mail list and when epoll() was only
> introduced. I would even say, that Linux does not have such mechanism at
> all, since every potential user implements it's own, which can not be
> used with others.

The epoll event API doesn't have space in the event fields for result codes 
as needed for AIO.  The AIO API does -- how is it lacking in this regard?

> Kevent fixes that. Although implementation itself can be suboptimal for
> some cases or even unacceptible at all, but it is really needed
> functionality.

At the expense of adding another API?  How is this a good thing?  Why 
not spit out events in the existing format?

> Every existing notification can be built on top of kevent. One can find
> how easy it was to implement generic poll/select notifications (what
> epoll() does) or socket notifications (which are similar to epoll(), but
> are called from inside socket state machine, thus improving processing
> performance).

So far your code is adding a lot without unifying anything.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/4] kevent: core files.

2006-06-23 Thread Benjamin LaHaise
On Fri, Jun 23, 2006 at 11:09:34AM +0400, Evgeniy Polyakov wrote:
> This patch includes core kevent files:
>  - userspace controlling
>  - kernelspace interfaces
>  - initialisation
>  - notification state machines

We don't need yet another event mechanism in the kernel, so I don't see 
why the new syscalls should be added when they don't interoperate with 
existing solutions.  If your results are enough to sway akpm that it is 
worth taking the patches, then it would make sense to merge the code with 
the already in-tree APIs.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 7/8] lock validator: fix ns83820.c irq-flags bug

2006-06-11 Thread Benjamin LaHaise
> The above code snippet removes the nested unlock-irq, but now the code 
> is unbalanced, so IMO this patch _adds_ confusion.
> 
> I think the conservative patch for 2.6.17 is the one I have attached. 
> Unless there are objections, that is what I will forward...

This looks reasonable and sufficiently conservative.  Reworking locking is 
something that I'm a bit more hesitant about, although folding misc_lock 
in with the other locks perhaps makes sense.  I would like to keep the 
split between tx and tx completion, though.  Also, any rework is going to 
need real testing, which is not something that a simple release cycle is 
likely to get enough coverage on.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TSO and IPoIB performance degradation

2006-03-20 Thread Benjamin LaHaise
On Mon, Mar 20, 2006 at 02:04:07PM +0200, Michael S. Tsirkin wrote:
> does not stretch ACKs anymore. RFC 2581 does mention that it might be OK to
> stretch ACKs "after careful consideration", and we are seeing that it helps
> IP over InfiniBand, so recent Linux kernels perform worse in that respect.
> 
> And since there does not seem to be a way to figure it out automagically when
> doing this is a good idea, I proposed adding some kind of knob that will let 
> the
> user apply the consideration for us.

Wouldn't it make sense to strech the ACK when the previous ACK is still in 
the TX queue of the device?  I know that sort of behaviour was always an 
issue on modem links where you don't want to send out redundant ACKs.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scm: fold __scm_send() into scm_send()

2006-03-13 Thread Benjamin LaHaise
On Mon, Mar 13, 2006 at 09:05:31PM +0100, Ingo Oeser wrote:
> From: Ingo Oeser <[EMAIL PROTECTED]>
> 
> Fold __scm_send() into scm_send() and remove that interface completly
> from the kernel.

Whoa, what are you doing here?  Uninlining scm_send() is a Bad Thing to do 
given that scm_send() is in the AF_UNIX hot path.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] rcuification of ipv4 established and timewait connections

2006-03-09 Thread Benjamin LaHaise
On Thu, Mar 09, 2006 at 01:12:20PM -0800, David S. Miller wrote:
> Once we have RCU in place for the TCP hash tables, we have the chance
> to code up dynamically sized hash tables.  With the current locking,
> this is basically impossible, with RCU it can be done.

Nice!

> So Ben can you work to figure out what the bind(0.0.0.0)
> problem was?  Once that is fully resolved, I think I'll apply
> your RCU patch to net-2.6.17.

I think I know what the problem is.  Basically, sockets from the listen 
table can have their refcount leaked in the do_time_wait chunk of code.  
I'll send out a patch that reworks that area and also adds one thing I 
missed, which is that sock_hold() in an rcu section needs to be an 
atomic_inc_not_zero, or the socket can be leaked (resulting in exactly 
that sort of failure when a listening socket is destroyed).

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] rcuification of ipv4 established and timewait connections

2006-03-09 Thread Benjamin LaHaise
On Thu, Mar 09, 2006 at 07:25:25PM +0100, Eric Dumazet wrote:
> On a second thought, do you think we still need one rwlock per hash chain ?
> 
> TCP established hash table entries: 1048576 (order: 12, 16777216 bytes)
> 
> On this x86_64 machine, we 'waste' 8 MB of ram for those rwlocks.
> 
> With RCU, we touch these rwlocks only on TCP connection creation/deletion, 
> maybe we could reduce to one rwlock or a hashed array of 2^N rwlocks (2^N 
> depending on NR_CPUS), like in net/ipv4/route.c ?

Hrm, maybe use cmpxchg?  That gets rid of the lock and automatically provides 
us with hashed spinlocks on archs without a cmpxchg implementation.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

2006-03-09 Thread Benjamin LaHaise
On Thu, Mar 09, 2006 at 07:41:08PM +1100, Nick Piggin wrote:
> Considering that local_t has been broken so that basically nobody
> is using it, now is a great time to rethink the types before it
> gets fixed and people start using it.

I'm starting to get more concerned as the per-cpu changes that keep adding 
more overhead is getting out of hand.

> And modelling the type on the atomic types would make the most
> sense because everyone already knows them.

Except that the usage models are different; local_t is most likely to be 
used for cpu local statistics or for sequences, where making them signed 
is a bit backwards.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] rcuification of ipv4 established and timewait connections

2006-03-09 Thread Benjamin LaHaise
On Thu, Mar 09, 2006 at 01:18:26PM +0300, Evgeniy Polyakov wrote:
> Ok, I hacked quite a bit in the patch, but I think nothing major was
> changed, basically patch rejects.
> And I'm now unable to bind to 0.0.0.0 address, i.e. bind() does not
> fail, but all connections are refused.
> Bind to machine's IP works fine.

Odd.  Can you fire off a copy of your backport to me?  I'll see if I can 
spot any quirks.  bind to 0.0.0.0 seems to work for me (ssh, portmap and 
a few test programs worked), but maybe something different is being done 
in your case.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

2006-03-08 Thread Benjamin LaHaise
On Wed, Mar 08, 2006 at 02:25:28PM -0800, Ravikiran G Thirumalai wrote:
> Then, for the batched percpu_counters, we could gain by using local_t only 
> for 
> the UP case. But we will have to have a new local_long_t implementation 
> for that.  Do you think just one use case of local_long_t warrants for a new
> set of apis?

I think it may make more sense to simply convert local_t into a long, given 
that most of the users will be things like stats counters.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

2006-03-08 Thread Benjamin LaHaise
On Wed, Mar 08, 2006 at 01:07:26PM -0800, Ravikiran G Thirumalai wrote:
> But on non x86, local_bh_disable() is gonna be cheaper than a cli/atomic op 
> no?
> (Even if they were switched over to do local_irq_save() and
> local_irq_restore() from atomic_t's that is).

It's still more expensive than local_t.

> And if we use local_t, we will add the overhead for the non bh 
> percpu_counter_mod for non x86 arches.

Last time I checked, all the major architectures had efficient local_t 
implementations.  Most of the RISC CPUs are able to do a load / store 
conditional implementation that is the same cost (since memory barriers 
tend to be explicite on powerpc).  So why not use it?

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

2006-03-08 Thread Benjamin LaHaise
On Wed, Mar 08, 2006 at 12:26:56PM -0800, Ravikiran G Thirumalai wrote:
> +static inline void percpu_counter_mod_bh(struct percpu_counter *fbc, long 
> amount)
> +{
> + local_bh_disable();
> + fbc->count += amount;
> + local_bh_enable();
> +}

Please use local_t instead, then you don't have to do the local_bh_disable() 
and enable() and the whole thing collapses down into 1 instruction on x86.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] rcuification of ipv4 established and timewait connections

2006-03-08 Thread Benjamin LaHaise
On Wed, Mar 08, 2006 at 02:01:04PM +0300, Evgeniy Polyakov wrote:
> When I tested RCU for similar change for kevent, but postponed more work
> to RCU callback, including socket closing and some attempts to inode
> dereferencing, such change forced performance degradation for httperf
> benchmark and trivial web server from 2600 requests/sec down to 2200
> with 4.1K index page and 1500 MTU. 
> 
> I still have this testbed and will gladly test this change tomorrow 
> if it applies cleanly to 2.6.15-rc7?

I think it should, the code it touches isn't that heavily changed.  The 
results should be most interesting.  It might be useful to compare SMP 
and non-SMP cases as the difference in locks may show up in the test.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86-64, use page->virtual to get 64 byte struct page

2006-03-08 Thread Benjamin LaHaise
On Wed, Mar 08, 2006 at 10:40:38AM +0100, Andi Kleen wrote:
> On Wednesday 08 March 2006 03:38, Benjamin LaHaise wrote:
> 
> > It's hardly that uncommon for pages to cross cachelines or for pages to
> > move around CPUs with networking. 
> 
> Data?

I posted a workload that shows this.   Anything that transmits over TCP on 
a CPU other than the one the interrupt occurs on is going to hit that 
behaviour for 7/8 of pages.

> > Please name some sort of benchmarks that show your concerns for decreased
> > performance.  
> 
> Anything that manipulates lots of data.

LMBench certainly doesn't look any worse.  In fact, things look slightly 
better.  (2.6.16n is with WANT_PAGE_VIRTUAL, while 2.6.16O is without).

cd results && make summary percent 2>/dev/null | more
make[1]: Entering directory `/md0/root/LMbench2/results'

 L M B E N C H  2 . 0   S U M M A R Y
 


Basic system parameters

Host OS Description  Mhz

- - --- 
cobra.kva Linux 2.6.16nx86_64-linux-gnu 2997
cobra.kva Linux 2.6.16nx86_64-linux-gnu 2997
cobra.kva Linux 2.6.16nx86_64-linux-gnu 2997
cobra.kva Linux 2.6.16nx86_64-linux-gnu 2997
cobra.kva Linux 2.6.16nx86_64-linux-gnu 2997
cobra.kva Linux 2.6.16Ox86_64-linux-gnu 2997
cobra.kva Linux 2.6.16Ox86_64-linux-gnu 2997
cobra.kva Linux 2.6.16Ox86_64-linux-gnu 2997
cobra.kva Linux 2.6.16Ox86_64-linux-gnu 2997
cobra.kva Linux 2.6.16Ox86_64-linux-gnu 2997

Processor, Processes - times in microseconds - smaller is better

Host OS  Mhz null null  open selct sig  sig  fork exec sh  
 call  I/O stat clos TCP   inst hndl proc proc proc
- -      -     
cobra.kva Linux 2.6.16n 2997 0.23 0.30 1.45 2.70 8.974 0.35 3.19 111. 443. 1674
cobra.kva Linux 2.6.16n 2997 0.23 0.30 1.46 2.81 8.984 0.35 3.19 116. 443. 1679
cobra.kva Linux 2.6.16n 2997 0.23 0.30 1.45 2.74 9.761 0.35 3.17 112. 446. 1683
cobra.kva Linux 2.6.16n 2997 0.23 0.30 1.43 2.82 8.992 0.34 3.20 113. 444. 1683
cobra.kva Linux 2.6.16n 2997 0.23 0.30 1.45 2.83 8.975 0.35 3.18 112. 445. 1673
cobra.kva Linux 2.6.16O 2997 0.23 0.30 1.44 2.74 9.008 0.36 3.17 110. 443. 1674
cobra.kva Linux 2.6.16O 2997 0.23 0.30 1.45 2.75 9.003 0.37 3.19 112. 445. 1681
cobra.kva Linux 2.6.16O 2997 0.23 0.30 1.45 2.80 9.010 0.37 3.17 112. 452. 1689
cobra.kva Linux 2.6.16O 2997 0.23 0.30 1.45 2.75 9.007 0.36 3.15 112. 453. 1689
cobra.kva Linux 2.6.16O 2997 0.23 0.30 1.44 2.77 9.021 0.36 3.15 113. 448. 1686

Context switching - times in microseconds - smaller is better
-
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
- - - -- -- -- -- --- ---
cobra.kva Linux 2.6.16n 2.360 3.7500 6.6100 4.6600   10.1 5.215.3
cobra.kva Linux 2.6.16n 2.420 3.7600 6.6500 4.8500   10.5 5.4100015.0
cobra.kva Linux 2.6.16n 2.400 3.7600 6.5900 4.7000 9.6600 5.3700015.0
cobra.kva Linux 2.6.16n 2.400 3.7600 6.5600 4.6300 9.6100 5.8100015.0
cobra.kva Linux 2.6.16n 2.420 3.8200 6.5800 4.8400   10.7 5.4700014.7
cobra.kva Linux 2.6.16O 2.430 4.4800 7.2100 4.8500   10.4 5.9100015.8
cobra.kva Linux 2.6.16O 2.460 4.3100 7.2400 4.9000   10.7 5.4200015.9
cobra.kva Linux 2.6.16O 2.450 4.4800 7.2800 4.7000   10.1 5.215.9
cobra.kva Linux 2.6.16O 2.460 4.3900 6.9900 4.7200 9.7400 5.6500015.2
cobra.kva Linux 2.6.16O 2.450 4.4700 6.9400 4.8100 8.7100 5.5200015.5

*Local* Communication latencies in microseconds - smaller is better
---
Host OS 2p/0K  Pipe AF UDP  RPC/   TCP  RPC/ TCP
ctxsw   UNIX UDP TCP conn
- - - -  - - - - 
cobra.kva Linux 2.6.16n 2.360 8.228 12.8  12.7  21.7  17.6  29.0 43.6
cobra.kva Linux 2.6.16n 2.420 8.192 13.1  12.7  22.1  17.6  29.2 44.0
cobra.kva Linux 2.6.16n 2.400 8.163 13.0  12.7  22.4  17.6  29.3 43.9
cobra.kva Linux 2.6.16n 2.400 8.163 11.7  12.7  21.9  17.5  29.0 44.4
cobra.kva Linux 2.6.16n 2.420 8.211 13.1  12.7  22.3  17.6  29.5 44.4
cobra.kva Linux 2.6.16O 2.430 8.273 13.1  14.4  23.8  17.7  29.7 43.9
cobra.kva Linux 2.6.16O 2.460 8.284 10.6  14.2  24.1  17.7  29.6 43.8
cobra.kva Linux 2.6.16O 2.450 8.454 13.5  14.3  24.1  17.7  29.8 43.9
cobra.kva Linux 2.6.16O 2.460 8.245 10.7  14.2  24.3  17.7  

Re: [PATCH] x86-64, use page->virtual to get 64 byte struct page

2006-03-07 Thread Benjamin LaHaise
On Tue, Mar 07, 2006 at 07:50:52PM +0100, Andi Kleen wrote:
> 
> My vmlinux has
> 
> 80278382 :
> 80278382:   8b 0d 78 ea 41 00   mov4319864(%rip),%ecx 
># 80696e00 
> 80278388:   48 89 f8mov%rdi,%rax
> 8027838b:   48 c1 e0 0c shl$0xc,%rax
> 8027838f:   48 d3 e8shr%cl,%rax
> 80278392:   48 0f b6 80 00 5e 69movzbq 
> 0x80695e00(%rax),%rax
> 80278399:   80 
> 8027839a:   48 8b 14 c5 40 93 71mov
> 0x80719340(,%rax,8),%rdx
> 802783a1:   80 
> 802783a2:   48 2b ba 40 36 00 00sub0x3640(%rdx),%rdi
> 802783a9:   48 6b c7 38 imul   $0x38,%rdi,%rax
> 802783ad:   48 03 82 30 36 00 00add0x3630(%rdx),%rax
> 802783b4:   c3  retq   

That's easily in the 90+ cycles range as you've got 3 data dependant loads 
which will hit in the L2, but likely not in the L1 given that the workload is 
manipulating lots of data.  Assuming the instruction scheduler gets things 
right.

> 802783b5 :
> 802783b5:   48 8b 07mov(%rdi),%rax
> 802783b8:   48 c1 e8 38 shr$0x38,%rax
> 802783bc:   48 8b 14 c5 80 97 71mov
> 0x80719780(,%rax,8),%rdx
> 802783c3:   80 
> 802783c4:   48 b8 b7 6d db b6 6dmov
> $0x6db6db6db6db6db7,%rax
> 802783cb:   db b6 6d 
> 802783ce:   48 2b ba 20 03 00 00sub0x320(%rdx),%rdi
> 802783d5:   48 c1 ff 03 sar$0x3,%rdi
> 802783d9:   48 0f af f8 imul   %rax,%rdi
> 802783dd:   48 03 ba 28 03 00 00add0x328(%rdx),%rdi
> 802783e4:   48 89 f8mov%rdi,%rax
> 802783e7:   c3  retq   
> 
> 
> Both look quite optimized to me. I haven't timed them but it would surprise 
> me 
> if P4 needed more than 20 cycles to crunch through each of them.

It's more than that because you've got the data dependancies on the load.  
Yes, imul is 10 cycles, but shift is 1.

> Where is that idiv exactly? I don't see it.

My memory seems to be failing me, I can't find it.  Whoops.

> Only in pathological workloads. Normally the working set is so large 
> that the probability of two pages are near each other is very small.

It's hardly that uncommon for pages to cross cachelines or for pages to move 
around CPUs with networking.  Remember that we're using pages for the data 
buffers in networking, so you'll have pages get freed on the wrong CPU quite 
often.

Please name some sort of benchmarks that show your concerns for decreased 
performance.  I've shown you one that gets improved, and I think the pages 
not overlapping cachelines is only a good thing.

I know these things look like piddly little worthless optimizations, but 
they add up big time.  Mea culpa for not having a 10Gbit nic to show more 
"real world" applications.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86-64, use page->virtual to get 64 byte struct page

2006-03-07 Thread Benjamin LaHaise
On Tue, Mar 07, 2006 at 05:27:37PM +0100, Andi Kleen wrote:
> On Wednesday 08 March 2006 00:26, Benjamin LaHaise wrote:
> > Hi Andi,
> > 
> > On x86-64 one inefficiency that shows up on profiles is the handling of 
> > struct page conversion to/from idx and addresses.  This is mostly due to 
> > the fact that struct page is currently 56 bytes on x86-64, so gcc has to 
> > emit a slow division or multiplication to convert. 
> 
> Huh? 

You used an unsigned long, but ptrdiff_t is signed.  gcc cannot use any 
shifting tricks because they round incorrectly in the signed case.

> AFAIK mul has a latency of < 10 cycles even on P4 so I can't imagine
> it's a real problem. Something must be wrong with your measurements.

mul isn't particularly interesting in the profiles, it's the idiv.

> My guess would be that on more macro loads it would be a loss due 
> to more cache misses.

But you get less false sharing of struct page on SMP as well.  With a 56 byte 
page a single struct page can overlap two cachelines, and on this workload 
the page definately gets transferred from one CPU to the other.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] x86-64, use page->virtual to get 64 byte struct page

2006-03-07 Thread Benjamin LaHaise
Hi Andi,

On x86-64 one inefficiency that shows up on profiles is the handling of 
struct page conversion to/from idx and addresses.  This is mostly due to 
the fact that struct page is currently 56 bytes on x86-64, so gcc has to 
emit a slow division or multiplication to convert.  By switching to using 
WANT_PAGE_VIRTUAL in asm/page.h, struct page grows to 64 bytes.  Address 
calculation becomes cheaper because it is a memory load from the already 
hot struct page.  For netperf, this shows up as a ~150 Mbit/s improvement.

FYI, I also tested with the 64 byte page but no WANT_PAGE_VIRTUAL, but it 
wasn't quite as much of an improvement on the P4.

Before:
 87380  16384  1638410.01  9516.38   92.2692.261.588   1.588 
 87380  16384  1638410.01  9459.36   70.4370.431.220   1.220 
 87380  16384  1638410.00  9509.38   67.5567.551.164   1.164 
 87380  16384  1638410.00  9432.95   68.9168.911.197   1.197 
 87380  16384  1638410.01  9479.88   69.3269.321.198   1.198 
 87380  16384  1638410.01  9466.74   70.3770.371.218   1.218 
 87380  16384  1638410.01  9493.93   69.1869.181.194   1.194 
 87380  16384  1638410.00  9486.44   71.0671.061.227   1.227 
 87380  16384  1638410.01  9477.95   70.6770.671.222   1.222 
 87380  16384  1638410.00  9477.33   70.2070.201.214   1.214 


After:
 87380  16384  1638410.01  9629.42   92.0192.011.565   1.565
 87380  16384  1638410.01  9641.69   90.1690.161.532   1.532
 87380  16384  1638410.01  9650.40   90.1690.161.531   1.531
 87380  16384  1638410.00  9638.69   90.6090.601.540   1.540
 87380  16384  1638410.01  9667.15   89.3689.361.514   1.514
 87380  16384  1638410.01  9684.13   89.8689.861.520   1.520
 87380  16384  1638410.01  9642.38   90.3190.311.534   1.534
 87380  16384  1638410.00  9669.24   90.9090.901.540   1.540
 87380  16384  1638410.00  9676.82   90.2590.251.528   1.528
 87380  16384  1638410.00  9711.26   90.8090.801.532   1.532

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.

Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]>
diff --git a/include/asm-x86_64/page.h b/include/asm-x86_64/page.h
index 615e3e4..2c16c62 100644
--- a/include/asm-x86_64/page.h
+++ b/include/asm-x86_64/page.h
@@ -3,6 +3,8 @@
 
 #include 
 
+#define WANT_PAGE_VIRTUAL  1
+
 /* PAGE_SHIFT determines the page size */
 #define PAGE_SHIFT 12
 #ifdef __ASSEMBLY__

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] use wait queue spinlock for the socket spinlock

2006-03-07 Thread Benjamin LaHaise
Hi Dave et al,

The patch below merges the use of the wait queue lock and socket spinlock 
into one.  This gains us ~100-150Mbit/s on netperf, mostly due to the fact 
that because we know how the spinlock is used, we can avoid the whole irq 
save, disable and reenable sequence since the spinlock only needs bh 
protection.  As a bonus, this also removes one atomic operation per wakeup.

[This is with x86-64 task switching tweaks as otherwise it is in the noise.]
Before:
 87380  16384  1638410.01  9501.56   90.6690.661.563   1.563 
 87380  16384  1638410.01  9476.08   93.4093.401.615   1.615 
 87380  16384  1638410.00  9473.65   80.6680.661.395   1.395 
 87380  16384  1638410.00  9525.82   80.4180.411.383   1.383 
 87380  16384  1638410.00  9523.49   80.7180.711.388   1.388 
 87380  16384  1638410.00  9430.09   80.0180.011.390   1.390 
 87380  16384  1638410.00  9469.60   80.7180.711.396   1.396 
 87380  16384  1638410.01  9517.88   79.3279.321.365   1.365 
 87380  16384  1638410.01  9512.30   80.3180.311.383   1.383 
 87380  16384  1638410.00  9453.69   80.9080.901.402   1.402 

After:
 87380  16384  1638410.01  9629.42   92.0192.011.565   1.565 
 87380  16384  1638410.01  9641.69   90.1690.161.532   1.532 
 87380  16384  1638410.01  9650.40   90.1690.161.531   1.531 
 87380  16384  1638410.00  9638.69   90.6090.601.540   1.540 
 87380  16384  1638410.01  9667.15   89.3689.361.514   1.514 
 87380  16384  1638410.01  9684.13   89.8689.861.520   1.520 
 87380  16384  1638410.01  9642.38   90.3190.311.534   1.534 
 87380  16384  1638410.00  9669.24   90.9090.901.540   1.540 
 87380  16384  1638410.00  9676.82   90.2590.251.528   1.528 
 87380  16384  1638410.00  9711.26   90.8090.801.532   1.532 

-ben

Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]>
diff --git a/include/net/sock.h b/include/net/sock.h
index 57e5f6b..a864d32 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -76,13 +76,12 @@
  */
 struct sock_iocb;
 typedef struct {
-   spinlock_t  slock;
struct sock_iocb*owner;
wait_queue_head_t   wq;
 } socket_lock_t;
 
 #define sock_lock_init(__sk) \
-do {   spin_lock_init(&((__sk)->sk_lock.slock)); \
+do { \
(__sk)->sk_lock.owner = NULL; \
init_waitqueue_head(&((__sk)->sk_lock.wq)); \
 } while(0)
@@ -733,8 +732,8 @@ extern void FASTCALL(lock_sock(struct so
 extern void FASTCALL(release_sock(struct sock *sk));
 
 /* BH context may only use the following locking interface. */
-#define bh_lock_sock(__sk) spin_lock(&((__sk)->sk_lock.slock))
-#define bh_unlock_sock(__sk)   spin_unlock(&((__sk)->sk_lock.slock))
+#define bh_lock_sock(__sk) spin_lock(&((__sk)->sk_lock.wq.lock))
+#define bh_unlock_sock(__sk)   spin_unlock(&((__sk)->sk_lock.wq.lock))
 
 extern struct sock *sk_alloc(int family,
  gfp_t priority,
diff --git a/net/core/filter.c b/net/core/filter.c
index 93fbd01..69e4636 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -421,10 +421,10 @@ int sk_attach_filter(struct sock_fprog *
if (!err) {
struct sk_filter *old_fp;
 
-   spin_lock_bh(&sk->sk_lock.slock);
+   spin_lock_bh(&sk->sk_lock.wq.lock);
old_fp = sk->sk_filter;
sk->sk_filter = fp;
-   spin_unlock_bh(&sk->sk_lock.slock);
+   spin_unlock_bh(&sk->sk_lock.wq.lock);
fp = old_fp;
}
 
diff --git a/net/core/sock.c b/net/core/sock.c
index f152783..96008af 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -445,15 +445,15 @@ set_rcvbuf:
break;
 
case SO_DETACH_FILTER:
-   spin_lock_bh(&sk->sk_lock.slock);
+   spin_lock_bh(&sk->sk_lock.wq.lock);
filter = sk->sk_filter;
 if (filter) {
sk->sk_filter = NULL;
-   spin_unlock_bh(&sk->sk_lock.slock);
+   spin_unlock_bh(&sk->sk_lock.wq.lock);
sk_filter_release(sk, filter);
break;
}
-   spin_unlock_bh(&sk->sk_lock.slock);
+   spin_unlock_bh(&sk->sk_lock.wq.lock);
ret = -ENONET;
break;
 
@@ -1031,20 +1031,25 @@ struct sk_buff *sock_alloc_send_skb(stru
return sock_alloc_send_pskb(sk, size, 0, no

[EXPERIMENTAL] HT aware loopback device (hack, x86-64 only atm)

2006-03-07 Thread Benjamin LaHaise
Hi folks,

I'd like to start some discussions on SMP optimizations for the networking 
stack.  The patch below is one such example which changes the loopback 
device in a way that helps out on workloads like netperf by trying to share 
more work with the other CPU on an HT system.  Basically, if the other CPU 
is idle, we punt the netif_rx onto the other CPU.  Using a kernel thread 
for this is fairly inefficient, so I am wondering if it makes sense to do 
it on the softirq level.  This particular patch improves netperf over 
localhost by ~600Mbit/s (from ~9874Mbit/s to ~10475Mbit/s while raising 
%CPU usage from ~92% to ~95%, although it varies quite a bit) on a 3GHz P4 
with HT (this is with a pile of other patches to optimize task switching 
on x86-64).

The bigger part of the discussion is probably a question of how we can make 
the network stack scale with multicore CPUs.  For workloads like routing 
lots of small packets, a single CPU can be easily overwhelmed.  The question 
becomes where does partitioning the work make sense?  At the very least we 
probably need to do some preprocessing of incoming packets so that a series 
of packets destined for a particular flow end up on the same CPU.  This 
sort of preprocessing probably makes sense for other reasons: by processing 
a group of packets for a particular socket in one go, we can avoid the 
overhead of locking and unlocking the socket repeatedly (which is pretty 
expensive due to the memory barrier nature of locks).

At this point I'd just like to stir up some discussion, so please comment 
away with any ideas and concerns.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.


diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 690a1aa..ef283a3 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -58,6 +58,69 @@
 #include 
 #include 
 
+#define LOOP_NR_SKBS   256
+static struct sk_buff *loop_skb_ring[LOOP_NR_SKBS];
+static unsigned loop_skb_head, loop_skb_tail;
+static struct task_struct *loop_task;
+
+static void smp_loop_netif_rx(struct sk_buff *skb)
+{
+   unsigned int next = (loop_skb_head + 1) % LOOP_NR_SKBS;
+
+   if (next == loop_skb_tail) {
+   dev_kfree_skb(skb);
+   return;
+   }
+
+   loop_skb_ring[loop_skb_head] = skb;
+   wmb();
+   loop_skb_head = next;
+}
+
+static void smp_loop_wake(void)
+{
+   if (loop_task && loop_task->state != TASK_RUNNING)
+   wake_up_process(loop_task);
+}
+
+static int loop_netif_rx_thread(void *data)
+{
+   loop_task = current;
+
+   for (;;) {
+   int nr = 0;
+   while (loop_skb_tail != loop_skb_head) {
+   unsigned next;
+   struct sk_buff *skb = loop_skb_ring[loop_skb_tail];
+   loop_skb_ring[loop_skb_tail] = NULL;
+   next = (loop_skb_tail + 1) % LOOP_NR_SKBS;
+   barrier();
+   loop_skb_tail = next;
+   netif_rx(skb);
+   if (nr++ >= 96) {
+   do_softirq();
+   nr = 0;
+   }
+   }
+
+   do_softirq();
+
+   set_current_state(TASK_INTERRUPTIBLE);
+   if (loop_skb_tail == loop_skb_head)
+   schedule();
+   set_current_state(TASK_RUNNING);
+   }
+}
+
+static inline int sibling_is_idle(void)
+{
+   int cpu = smp_processor_id() ^ 1;
+   struct x8664_pda *pda = cpu_pda(cpu);
+   if (pda->pcurrent == idle_task(cpu) || pda->pcurrent == loop_task)
+   return 1;
+   return 0;
+}
+
 static DEFINE_PER_CPU(struct net_device_stats, loopback_stats);
 
 #define LOOPBACK_OVERHEAD (128 + MAX_HEADER + 16 + 16)
@@ -69,6 +132,7 @@ static DEFINE_PER_CPU(struct net_device_
  */
 
 #ifdef LOOPBACK_TSO
+static void smp_loop_netif_rx(struct sk_buff *skb);
 static void emulate_large_send_offload(struct sk_buff *skb)
 {
struct iphdr *iph = skb->nh.iph;
@@ -76,6 +140,7 @@ static void emulate_large_send_offload(s
unsigned int doffset = (iph->ihl + th->doff) * 4;
unsigned int mtu = skb_shinfo(skb)->tso_size + doffset;
unsigned int offset = 0;
+   int use_sibling = sibling_is_idle();
u32 seq = ntohl(th->seq);
u16 id  = ntohs(iph->id);
 
@@ -112,12 +177,21 @@ static void emulate_large_send_offload(s
th->seq = htonl(seq);
if (offset + doffset + frag_size < skb->len)
th->fin = th->psh = 0;
+#ifdef CONFIG_SMP
+   if (use_sibling)
+   smp_loop_netif_rx(nskb);
+   else
+   netif_rx(nskb);
+#else
netif_rx(nskb);
+#endif
offset += frag_size;
seq += frag_size;
id++;
  

Re: [PATCH] avoid atomic op on page free

2006-03-06 Thread Benjamin LaHaise
On Tue, Mar 07, 2006 at 01:04:36PM +1100, Nick Piggin wrote:
> I'd say it will turn out to be more trouble than its worth, for the 
> miserly cost
> avoiding one atomic_inc, and one atomic_dec_and_test on page-local data 
> that will
> be in L1 cache. I'd never turn my nose up at anyone just having a go 
> though :)

The cost is anything but miserly.  Consider that every lock instruction is 
a memory barrier which takes your OoO CPU with lots of instructions in flight 
to ramp down to just 1 for the time it takes that instruction to execute.  
That synchronization is what makes the atomic expensive.

In the case of netperf, I ended up with a 2.5Gbit/s (~30%) performance 
improvement through nothing but microoptimizations.  There is method to 
my madness. ;-)

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] avoid atomic op on page free

2006-03-06 Thread Benjamin LaHaise
On Tue, Mar 07, 2006 at 12:53:27PM +1100, Nick Piggin wrote:
> You can't do this because you can't test PageLRU like that.
> 
> Have a look in the lkml archives a few months back, where I proposed
> a way to do this for __free_pages(). You can't do it for put_page.

Even if we know that we are the last user of the page (the count is 1)?  
Who can bump the page's count then?

> BTW I have quite a large backlog of patches in -mm which should end
> up avoiding an atomic or two around these parts.

That certainly looks like it will help.  Not taking the spinlock 
unconditionally gets rid of quite a bit of the cost.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] avoid atomic op on page free

2006-03-06 Thread Benjamin LaHaise
On Mon, Mar 06, 2006 at 05:39:41PM -0800, Andrew Morton wrote:
> > It's just a simple send() and recv() pair of processes.  Networking uses 
> > pages for the buffer on user transmits.
> 
> You mean non-zero-copy transmits?  If they were zero-copy then those pages
> would still be on the LRU.

Correct.

> >  Those pages tend to be freed 
> > in irq context on transmit or in the receiver if the traffic is local.
> 
> If it was a non-zero-copy Tx then networking owns that page and can just do
> free_hot_page() on it and avoid all that stuff in put_page().

At least currently, networking has no way of knowing that is the case since 
pages may have their reference count increased when an skb() is cloned, and 
in fact do when TCP sends them off.

> Thing is, that case would represent about 100th of the number of
> put_pages()s which get done in the world.  IOW: a net loss.

Those 1-2 cycles are free if you look at how things get scheduled with the 
execution of the surrounding code. I bet $20 that you can't find a modern 
CPU where the cost is measurable (meaning something like a P4, Athlon).  
If this level of cost for the common case is a concern, it's probably worth 
making atomic_dec_and_test() inline for page_cache_release().  The overhead 
of the function call and the PageCompound() test is probably more than what 
we're talking about as you're increasing the cache footprint and actually 
performing a write to memory.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] avoid atomic op on page free

2006-03-06 Thread Benjamin LaHaise
On Mon, Mar 06, 2006 at 04:50:39PM -0800, Andrew Morton wrote:
> Am a bit surprised at those numbers.

> Because userspace has to do peculiar things to get its pages taken off the
> LRU.  What exactly was that application doing?

It's just a simple send() and recv() pair of processes.  Networking uses 
pages for the buffer on user transmits.  Those pages tend to be freed 
in irq context on transmit or in the receiver if the traffic is local.

> The patch adds slight overhead to the common case while providing
> improvement to what I suspect is a very uncommon case?

At least on any modern CPU with branch prediction, the test is essentially 
free (2 memory reads that pipeline well, iow 1 cycle, maybe 2).  The 
upside is that you get to avoid the atomic (~17 cycles on a P4 with a 
simple test program, the penalty doubles if there is one other instruction 
that operates on memory in the loop), disabling interrupts (~20 cycles?, I 
don't remember) another atomic for the spinlock, another atomic for 
TestClearPageLRU() and the pushf/popf (expensive as they rely on whatever 
instruction that might still be in flight to complete and add the penalty 
for changing irq state).  That's at least 70 cycles without including the 
memory barrier side effects which can cost 100 cycles+.  Add in the costs 
for the cacheline bouncing of the lru_lock and we're talking *expensive*.

So, a 1-2 cycle cost for a case that normally takes from 17 to 100+ cycles?  
I think that's worth it given the benefits.

Also, I think the common case (page cache read / map) is something that 
should be done differently, as those atomics really do add up to major 
pain.  Using rcu for page cache reads would be truely wonderful, but that 
will take some time.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] avoid memory barrier bitops in hot paths

2006-03-06 Thread Benjamin LaHaise
On Mon, Mar 06, 2006 at 04:25:32PM -0800, David S. Miller wrote:
> Wait...
> 
> what about "test_and_clear_bit()"?
> 
> Most implementations should be doing the light-weight test _first_,
> and only do the update if the bit isn't in the state desired.
> 
> I think in such cases we can elide the memory barrier.

Hrmmm, clear_bit() doesn't seem to imply being a memory barrier, but if 
we do that things can be doubly worse for the sites that use 
smp_mb__*_clear_bit() (as sometimes we'd perform the barrier and then do 
the locked bit clear on x86).  That would hurt in net_tx_action().  Oooh, 
I see some optimizations to make there...

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] avoid memory barrier bitops in hot paths

2006-03-06 Thread Benjamin LaHaise
On Tue, Mar 07, 2006 at 12:59:17AM +0100, Eric Dumazet wrote:
> I'm not even sure this 'optimization' is valid on UP.

It can be, as branch prediction makes the test essentially free.  The real 
answer is that it depends on the CPU, how much pressure there is on the write 
combining buffers and reorder buffers, etc.  Something like this is almost in 
the noise when measured on its own, but the combined effect adds up (in this 
case netperf throughput increased by ~30%).  I'll respin it with the 
fast_clear_bit() suggestion.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] rcuification of ipv4 established and timewait connections

2006-03-06 Thread Benjamin LaHaise
On Tue, Mar 07, 2006 at 12:48:23AM +0100, Eric Dumazet wrote:
> If I understand your patch correctly, your future plan is to change "struct 
> inet_ehash_bucket" rwlock_t wlock to a pure spinlock (when ipv6 is 
> converted to rcu lookups too), because no more read_lock() are expected ?

Yes/no... The cost doesn't change if we convert it to a spinlock as we would 
have to insert a memory barrier where the write_unlock() was.

> I didnt understand why you added a test on sk->sk_node.pprev against 
> LIST_POISON2 in sk_unhashed() and sk_hashed()

In doing the conversion the code was changed over to use hlist_del_rcu(), 
which poisons the pprev member.  RCU basically adds another state where the 
hash list entry is unhashed but not freed, which can be detected by the 
presence of the poison value.  Without the check there are a couple of 
places where we end up trying to remove an already dead hash.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] avoid memory barrier bitops in hot paths

2006-03-06 Thread Benjamin LaHaise
On Mon, Mar 06, 2006 at 08:29:30PM -0300, Arnaldo Carvalho de Melo wrote:
> -   clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
> +   if (test_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
> +   clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
> 
> Something like fast_clear_bit() that does the above sequence in a inline
> function?

That might make sense.  The general case tends to be better handled by 
avoiding atomic bitops wherever possible.  In the socket case, a better 
approach may be to have a set of flags which are protected by the socket 
lock instead of being atomic bitops.

> Something similar, no?
> 
> fast_atomic_dec_and_test anyone? :-)

Yup.  One of the other patches I've got which really helps networking is 
to use that hack in the VM's put_page() function.  The problem is that the 
conditions under which it is okay tend to vary from case to case.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] avoid memory barrier bitops in hot paths

2006-03-06 Thread Benjamin LaHaise
This patch removes a couple of memory barriers from atomic bitops that 
showed up on profiles of netperf.

-ben

Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]>
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 00aa80e..dadc84c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -682,7 +682,8 @@ int tcp_sendmsg(struct kiocb *iocb, stru
goto out_err;
 
/* This should be in poll */
-   clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
+   if (test_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
+   clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
 
mss_now = tcp_current_mss(sk, !(flags&MSG_OOB));
size_goal = tp->xmit_size_goal;
diff --git a/net/ipv4/netfilter/ip_conntrack_core.c 
b/net/ipv4/netfilter/ip_conntrack_core.c
index 84c66db..416e310 100644
--- a/net/ipv4/netfilter/ip_conntrack_core.c
+++ b/net/ipv4/netfilter/ip_conntrack_core.c
@@ -881,7 +881,12 @@ unsigned int ip_conntrack_in(unsigned in
return -ret;
}
 
-   if (set_reply && !test_and_set_bit(IPS_SEEN_REPLY_BIT, &ct->status))
+   /* Use test_bit() first as that avoids a memory barrier on implicite
+* in test_and_set_bit() on some CPUs.  It only gets set once per
+* connection.
+*/
+   if (set_reply && !test_bit(IPS_SEEN_REPLY_BIT, &ct->status) &&
+   !test_and_set_bit(IPS_SEEN_REPLY_BIT, &ct->status))
ip_conntrack_event_cache(IPCT_STATUS, *pskb);
 
return ret;

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC/PATCH] rcuification of ipv4 established and timewait connections

2006-03-06 Thread Benjamin LaHaise
Hello again,

This patch introduces the use of rcu for the ipv4 established connections 
hashtable, as well as the timewait table since they are closely intertwined.  
This removes 4 atomic operations per packet from the tcp_v4_rcv codepath, 
which helps quite a bit when the other performance barriers in the system 
are removed.  Eliminating the rwlock cache bouncing should also help on SMP 
systems.

By itself, this improves local netperf performance on a P4/HT by ~260Mbit/s 
on average.  With smaller packets (say, ethernet size) the difference should 
be larger.

Note that this patch changes the semantics of __inet_lookup() and 
__inet_lookup_established() to *not* perform a sock_hold().  This way 
we can avoid the atomic inc and dec of the reference count and rely on 
rcu pinning the socket over the scope of the rcu_read_lock_bh() region.

Only minimal fixes for ipv6 and dccp to use the renamed wlock, as I am not 
setup to test them.  I have stared at this patch for a while, and I think 
it's correct, but more eyes are definately warranted.  Most of the issues 
regarding the state and lifespan of a struct sock are already handled by 
the network layer as the socket can be locked after being retrieved from 
the connection hashtable, we just have to be careful about reinitializing 
fields that the rcu hash list depends on.

-ben

Before: max 7838.86, min 7771.90, avg 7811.68
 87380  16384  1638410.01  7793.04   90.5690.561.904   1.904 
 87380  16384  1638410.01  7806.54   91.6191.611.923   1.923 
 87380  16384  1638410.00  7819.29   90.8090.801.903   1.903 
 87380  16384  1638410.00  7815.89   90.7090.701.901   1.901 
 87380  16384  1638410.01  7771.90   91.6691.661.932   1.932 
 87380  16384  1638410.00  7831.59   90.2090.201.887   1.887 
 87380  16384  1638410.01  7796.56   91.2691.261.918   1.918 
 87380  16384  1638410.01  7838.86   89.2689.261.866   1.866 
 87380  16384  1638410.01  7835.44   90.5690.561.894   1.894 


After: max 8113.70, min 8026.32, avg 8072.34
 87380  16384  1638410.01  8045.55   87.1187.111.774   1.774 
 87380  16384  1638410.01  8065.14   90.8690.861.846   1.846 
 87380  16384  1638410.00  8077.76   89.8589.851.822   1.822 
 87380  16384  1638410.00  8026.32   89.8089.801.833   1.833 
 87380  16384  1638410.01  8108.59   89.8189.811.815   1.815 
 87380  16384  1638410.01  8034.53   89.0189.011.815   1.815 
 87380  16384  1638410.00  8113.70   90.4590.451.827   1.827 
 87380  16384  1638410.00  8111.37   89.9089.901.816   1.816 
 87380  16384  1638410.01  8077.75   87.9687.961.784   1.784 
 87380  16384  1638410.00  8062.70   90.2590.251.834   1.834 

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.

Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]>
diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 25f708f..73b05ab 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -65,7 +65,7 @@ static inline void __inet6_hash(struct i
sk->sk_hash = hash = inet6_sk_ehashfn(sk);
hash &= (hashinfo->ehash_size - 1);
list = &hashinfo->ehash[hash].chain;
-   lock = &hashinfo->ehash[hash].lock;
+   lock = &hashinfo->ehash[hash].wlock;
write_lock(lock);
}
 
@@ -98,7 +98,7 @@ static inline struct sock *
struct inet_ehash_bucket *head = inet_ehash_bucket(hashinfo, hash);
 
prefetch(head->chain.first);
-   read_lock(&head->lock);
+   read_lock(&head->wlock);
sk_for_each(sk, node, &head->chain) {
/* For IPV6 do the cheaper port and family tests first. */
if (INET6_MATCH(sk, hash, saddr, daddr, ports, dif))
@@ -118,12 +118,12 @@ static inline struct sock *
goto hit;
}
}
-   read_unlock(&head->lock);
+   read_unlock(&head->wlock);
return NULL;
 
 hit:
sock_hold(sk);
-   read_unlock(&head->lock);
+   read_unlock(&head->wlock);
return sk;
 }
 
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 135d80f..4cde832 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -39,7 +39,7 @@
  * for the rest.  I'll experiment with dynamic table growth later.
  */
 struct inet_ehash_bucket {
-   rwlock_t  lock;
+   rwlock_t wlock;
struct hlist_head chain;
 };
 
@@ -224,7 +224,7 @@ static inline void __inet_hash(str

[EMAIL PROTECTED]: [PATCH] use fget_light() in net/socket.c]

2006-03-06 Thread Benjamin LaHaise
Hello folks,

Here's an updated copy of the patch to use fget_light in net/socket.c.  
Rerunning the tests show a drop of ~80Mbit/s on average, which looks 
bad until you see the drop in cpu usage from ~89% to ~82%.  That will 
get fixed in another patch...

Before: max 8113.70, min 8026.32, avg 8072.34
 87380  16384  1638410.01  8045.55   87.1187.111.774   1.774 
 87380  16384  1638410.01  8065.14   90.8690.861.846   1.846 
 87380  16384  1638410.00  8077.76   89.8589.851.822   1.822 
 87380  16384  1638410.00  8026.32   89.8089.801.833   1.833 
 87380  16384  1638410.01  8108.59   89.8189.811.815   1.815 
 87380  16384  1638410.01  8034.53   89.0189.011.815   1.815 
 87380  16384  1638410.00  8113.70   90.4590.451.827   1.827 
 87380  16384  1638410.00  8111.37   89.9089.901.816   1.816 
 87380  16384  1638410.01  8077.75   87.9687.961.784   1.784 
 87380  16384  1638410.00  8062.70   90.2590.251.834   1.834 

After: max 8035.81, min 7963.69, avg 7998.14
 87380  16384  1638410.01  8000.93   82.1182.111.682   1.682 
 87380  16384  1638410.01  8016.17   83.6783.671.710   1.710 
 87380  16384  1638410.01  7963.69   83.4783.471.717   1.717 
 87380  16384  1638410.01  8014.35   81.7181.711.671   1.671 
 87380  16384  1638410.00  7967.68   83.4183.411.715   1.715 
 87380  16384  1638410.00  7995.22   81.0081.001.660   1.660 
 87380  16384  1638410.00  8002.61   83.9083.901.718   1.718 
 87380  16384  1638410.00  8035.81   81.7181.711.666   1.666 
 87380  16384  1638410.01  8005.36   82.5682.561.690   1.690 
 87380  16384  1638410.00  7979.61   82.5082.501.694   1.694 


-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.

Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]>
diff --git a/net/socket.c b/net/socket.c
index 3ca31e8..24c4a1a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -429,6 +429,28 @@ int sock_map_fd(struct socket *sock)
return fd;
 }
 
+static struct socket *sock_from_file(struct file *file, int *err)
+{
+   struct inode *inode;
+   struct socket *sock;
+
+   if (file->f_op == &socket_file_ops)
+   return file->private_data;  /* set in sock_map_fd */
+
+   inode = file->f_dentry->d_inode;
+   if (!S_ISSOCK(inode->i_mode)) {
+   *err = -ENOTSOCK;
+   return NULL;
+   }
+
+   sock = SOCKET_I(inode);
+   if (sock->file != file) {
+   printk(KERN_ERR "socki_lookup: socket file changed!\n");
+   sock->file = file;
+   }
+   return sock;
+}
+
 /**
  * sockfd_lookup   -   Go from a file number to its socket slot
  * @fd: file handle
@@ -445,31 +467,31 @@ int sock_map_fd(struct socket *sock)
 struct socket *sockfd_lookup(int fd, int *err)
 {
struct file *file;
-   struct inode *inode;
struct socket *sock;
 
-   if (!(file = fget(fd)))
-   {
+   if (!(file = fget(fd))) {
*err = -EBADF;
return NULL;
}
-
-   if (file->f_op == &socket_file_ops)
-   return file->private_data;  /* set in sock_map_fd */
-
-   inode = file->f_dentry->d_inode;
-   if (!S_ISSOCK(inode->i_mode)) {
-   *err = -ENOTSOCK;
+   sock = sock_from_file(file, err);
+   if (!sock)
fput(file);
-   return NULL;
-   }
+   return sock;
+}
 
-   sock = SOCKET_I(inode);
-   if (sock->file != file) {
-   printk(KERN_ERR "socki_lookup: socket file changed!\n");
-   sock->file = file;
+static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
+{
+   struct file *file;
+   struct socket *sock;
+
+   file = fget_light(fd, fput_needed);
+   if (file) {
+   sock = sock_from_file(file, err);
+   if (sock)
+   return sock;
+   fput_light(file, *fput_needed);
}
-   return sock;
+   return NULL;
 }
 
 /**
@@ -1304,19 +1326,17 @@ asmlinkage long sys_bind(int fd, struct 
 {
struct socket *sock;
char address[MAX_SOCK_ADDR];
-   int err;
+   int err, fput_needed;
 
-   if((sock = sockfd_lookup(fd,&err))!=NULL)
+   if((sock = sockfd_lookup_light(fd, &err, &fput_needed))!=NULL)
{
if((err=move_addr_to_kernel(umyaddr,addrlen,address))>=0) {
err = security_socket_bind(sock, (struct sockaddr 
*)address, addrlen);
-   if (err) {
-   sockfd_put

Re: [PATCH 1/8] [I/OAT] DMA memcpy subsystem

2006-03-04 Thread Benjamin LaHaise
On Fri, Mar 03, 2006 at 01:42:20PM -0800, Chris Leech wrote:
> +void dma_async_device_unregister(struct dma_device* device)
> +{
...
> + kref_put(&device->refcount, dma_async_device_cleanup);
> + wait_for_completion(&device->done);
> +}

This looks like a bug: device is dereferenced after it is potentially 
freed.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance with MTU 9000

2006-02-28 Thread Benjamin LaHaise
On Tue, Feb 28, 2006 at 09:33:38PM -0500, John Zielinski wrote:
> Rick Jones wrote:
> >And if you add the test-specific -D option?
> 
> No difference.  I have TCP_NODELAY in my Samba config and copying files 
> from the server is painfully slow.  That's what got me started on all 
> these tests.

You might want to have a look at the PCI latency settings on the system if 
your BIOS lets you change them (try increasing it).  It's hard to say 
exactly what is happening without a bus analyzer, but 9000 is a pretty 
sub-optimal mtu as it probably results in at least a couple of additional 
bus transactions beyond 8192.

Oh, one other thing to try: if the network driver does tx checksum 
offloading, disable it (replace NETIF_F_IP_CSUM with 0 in the driver).  
On some chips (the 83820 is one), the tx fifo is only 8KB, and if you 
try to use tx checksumming, the chip will read the packet over the bus 
*twice*, completely hosing performance in the process.  This is probably 
a better idea to check than the pci latency settings.

-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] use fget_light for network syscalls

2006-02-28 Thread Benjamin LaHaise
The patch below reworks socket.c to use fget_light() in place of fget() 
for the various networking syscalls.  This is of particular value with 
SMP kernels running on the P4 as the memory barriers the atomic ops on 
the file counter causes result in the CPU having to wait for quite a few 
memory transactions to complete (we can have up to ~128 instructions in 
flight).  

On the default netperf (which uses fairly large buffers) the increase is 
around 25Mbit/s (on 8000Mbit/s).  Using 'netperf -- -m 64' performance goes 
to ~284Mbit/s from 247Mbit/s, which is pretty substantial.

-ben

Signed-off-by: Benjamin LaHaise 
diff --git a/net/socket.c b/net/socket.c
index a00851f..64019f8 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -414,6 +414,28 @@ out:
return fd;
 }
 
+static struct socket *sock_from_file(struct file *file, int *err)
+{
+   struct inode *inode;
+   struct socket *sock;
+
+   if (file->f_op == &socket_file_ops)
+   return file->private_data;  /* set in sock_map_fd */
+
+   inode = file->f_dentry->d_inode;
+   if (!S_ISSOCK(inode->i_mode)) {
+   *err = -ENOTSOCK;
+   return NULL;
+   }
+
+   sock = SOCKET_I(inode);
+   if (sock->file != file) {
+   printk(KERN_ERR "socki_lookup: socket file changed!\n");
+   sock->file = file;
+   }
+   return sock;
+}
+
 /**
  * sockfd_lookup   -   Go from a file number to its socket slot
  * @fd: file handle
@@ -430,31 +452,31 @@ out:
 struct socket *sockfd_lookup(int fd, int *err)
 {
struct file *file;
-   struct inode *inode;
struct socket *sock;
 
-   if (!(file = fget(fd)))
-   {
+   if (!(file = fget(fd))) {
*err = -EBADF;
return NULL;
}
-
-   if (file->f_op == &socket_file_ops)
-   return file->private_data;  /* set in sock_map_fd */
-
-   inode = file->f_dentry->d_inode;
-   if (!S_ISSOCK(inode->i_mode)) {
-   *err = -ENOTSOCK;
+   sock = sock_from_file(file, err);
+   if (!sock)
fput(file);
-   return NULL;
-   }
+   return sock;
+}
 
-   sock = SOCKET_I(inode);
-   if (sock->file != file) {
-   printk(KERN_ERR "socki_lookup: socket file changed!\n");
-   sock->file = file;
+static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
+{
+   struct file *file;
+   struct socket *sock;
+
+   file = fget_light(fd, fput_needed);
+   if (file) {
+   sock = sock_from_file(file, err);
+   if (sock)
+   return sock;
+   fput_light(file, *fput_needed);
}
-   return sock;
+   return NULL;
 }
 
 /**
@@ -1289,19 +1311,17 @@ asmlinkage long sys_bind(int fd, struct 
 {
struct socket *sock;
char address[MAX_SOCK_ADDR];
-   int err;
+   int err, fput_needed;
 
-   if((sock = sockfd_lookup(fd,&err))!=NULL)
+   if((sock = sockfd_lookup_light(fd, &err, &fput_needed))!=NULL)
{
if((err=move_addr_to_kernel(umyaddr,addrlen,address))>=0) {
err = security_socket_bind(sock, (struct sockaddr 
*)address, addrlen);
-   if (err) {
-   sockfd_put(sock);
-   return err;
-   }
-   err = sock->ops->bind(sock, (struct sockaddr *)address, 
addrlen);
+   if (!err)
+   err = sock->ops->bind(sock,
+   (struct sockaddr *)address, addrlen);
}
-   sockfd_put(sock);
+   fput_light(sock->file, fput_needed);
}   
return err;
 }
@@ -1318,20 +1338,17 @@ int sysctl_somaxconn = SOMAXCONN;
 asmlinkage long sys_listen(int fd, int backlog)
 {
struct socket *sock;
-   int err;
+   int err, fput_needed;

-   if ((sock = sockfd_lookup(fd, &err)) != NULL) {
+   if ((sock = sockfd_lookup_light(fd, &err, &fput_needed)) != NULL) {
if ((unsigned) backlog > sysctl_somaxconn)
backlog = sysctl_somaxconn;
 
err = security_socket_listen(sock, backlog);
-   if (err) {
-   sockfd_put(sock);
-   return err;
-   }
+   if (!err)
+   err = sock->ops->listen(sock, backlog);
 
-   err=sock->ops->listen(sock, backlog);
-   sockfd_put(sock);
+   fput_light(sock->file, fput_needed);
}
return err;
 }
@@ -1352,10 +1369,10 @@ asmlinkage long sys_listen(int fd, int b
 asmlinkage long 

Re: GbE performance

2006-02-28 Thread Benjamin LaHaise
On Tue, Feb 28, 2006 at 09:21:39AM +0100, Dag Bakke wrote:
> Got both. Max throughput increased from ~605 Mbps with carefully tuned
> options to iperf, to ~650 Mbps with less carefully tuned options to iperf.

One other suggestion that might be worth testing: try comparing SMP and 
UP kernels.  I've been doing some profiling of the network stack which 
shows that locked instructions carry a heavy penalty on the P4.  Running 
a non-SMP kernel through the hoops would give us an idea just how much 
overhead this is causing.

-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0/2] Kevent. Network AIO.

2006-02-09 Thread Benjamin LaHaise
On Thu, Feb 09, 2006 at 08:03:26PM +0300, Evgeniy Polyakov wrote:
> It is completely different things.
> The only common is that they _require_ some kind of notification
> mechanism, 
> but none provide them.
> epoll() can not be used for AIO and timers.

True, that is a disappointment.  There really should be a timer device in 
/dev which can be used with all of the possible event mechanism.

> AIO can not be used with anything except AIO.

Actually, there are patches for generic thread based aio so that other 
file descriptors can be used for AIO, perhaps they should be dusted off?  
They're pretty small and easy to understand.

Btw, your kevent aio is not interopable with O_DIRECT AIO on files.  This 
is a pretty major setback for quite a few types of high performance servers.

> inotify is good, but it  can not be used for anything else.

Now this is funny! =-)

> > >   asynchronously from socket's receiving queue from softirq context.
> > 
> > Is there any reason for not using the existing aio read method?
> 
> Current AIO has it's own design which I do not accept completely.
> It does not use internal state machines of underlying layers, but instead
> repeatedly tries to call nonblocking functions, similar to synchronous
> ones.

Actually, that behaviour is completely up to the implementor of the aio 
method.  Retry simply happens to be a very simple technique that allows 
existing code to be easily updated to implement aio.  AIO operations can 
be implemented as a state machine -- the 2.4 codebase did that for file 
io, and it worked quite well.  For something like networking, it could 
well be the way to go.

On the networking front, it would be very useful to have an AIO method to 
implement zero copy transmit.  Most of the pieces are in place with the 
sendpage code, but the catch is that we would need a destructor to use 
with data submitted via sendpage.  Are you interested in exploring a few 
more approaches to the design and tie-in with existing kernel AIO?  Cheers,

-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0/2] Kevent. Network AIO.

2006-02-09 Thread Benjamin LaHaise
On Thu, Feb 09, 2006 at 04:56:11PM +0300, Evgeniy Polyakov wrote:
> Hello.
> 
> I'm pleased to announce following projects:
> 
> 1/2 - Kevent subsystem.
>   This subsystem incorporates several AIO/kqueue design notes and ideas.
>   Kevent can be used both for edge and level notifications. It supports
>   socket notifications (accept and receiving), inode notifications
>   (create/remove), generic poll()/select() notifications, timer
>   notifications and network AIO notifications.

What is the justification behind adding yet another notification system?  
We already have AIO, epoll, traditional select/poll.  This seems like a lot 
of new code for no additional functionality.

> 2/2 - Network asynchronous IO. Receiving (TCP) support.
>   Network AIO is based on kevent and works as usual kevent storage on top
>   of it. It allows to receive data directly into userspace pages
>   asynchronously from socket's receiving queue from softirq context.

Is there any reason for not using the existing aio read method?

-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] af_unix: use shift instead of integer division

2006-02-07 Thread Benjamin LaHaise
On Tue, Feb 07, 2006 at 04:15:31PM +0100, Andi Kleen wrote:
> On Tuesday 07 February 2006 15:54, Benjamin LaHaise wrote:
> 
> > +   if (size > ((sk->sk_sndbuf >> 1) - 64))
> > +   size = (sk->sk_sndbuf >> 1) - 64;
> 
> This is really surprising. Are you double plus sure gcc doesn't 
> do this automatically?

As I said, sk_sndbuf is a signed integer, so gcc can't use an arithmetic 
shift (which would round to infinity if the result is negative -- gcc has 
no way of knowing that sk_sndbuf will be positive).  The alternative would 
be to convert sk_sndbuf to unsigned, but that would mean rechecking all the 
users for side effects.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] af_unix: scm: better initialization

2006-02-07 Thread Benjamin LaHaise
Instead of doing a memset then initialization of the fields of the scm 
structure, just initialize all the members explicitly.  Prevent reloading 
of current on x86 and x86-64 by storing the value in a local variable for 
subsequent dereferences.  This is worth a ~7KB/s increase in af_unix 
bandwidth.  Note that we avoid the issues surrounding potentially 
uninitialized members of the ucred structure by constructing a struct 
ucred instead of assigning the members individually, which forces the 
compiler to zero any padding.

Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]>
diff --git a/include/net/scm.h b/include/net/scm.h
index c3fa3d5..0d90fa2 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -37,10 +37,14 @@ static __inline__ void scm_destroy(struc
 static __inline__ int scm_send(struct socket *sock, struct msghdr *msg,
   struct scm_cookie *scm)
 {
-   memset(scm, 0, sizeof(*scm));
-   scm->creds.uid = current->uid;
-   scm->creds.gid = current->gid;
-   scm->creds.pid = current->tgid;
+   struct task_struct *p = current;
+   scm->creds = (struct ucred) {
+   .uid = p->uid,
+   .gid = p->gid,
+   .pid = p->tgid
+   };
+   scm->fp = NULL;
+   scm->seq = 0;
if (msg->msg_controllen <= 0)
return 0;
return __scm_send(sock, msg, scm);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] af_unix: use shift instead of integer division

2006-02-07 Thread Benjamin LaHaise
The patch below replaces a divide by 2 with a shift -- sk_sndbuf is an 
integer, so gcc emits an idiv, which takes 10x longer than a shift by 1.  
This improves af_unix bandwidth by ~6-10K/s.  Also, tidy up the comment 
to fit in 80 columns while we're at it.

-ben

Signed-off-by: Benjamin LaHaise <[EMAIL PROTECTED]>
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 1b5989b..b57d4d9 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1427,15 +1427,15 @@ static int unix_stream_sendmsg(struct ki
while(sent < len)
{
/*
-*  Optimisation for the fact that under 0.01% of X 
messages typically
-*  need breaking up.
+*  Optimisation for the fact that under 0.01% of X
+*  messages typically need breaking up.
 */
 
-   size=len-sent;
+   size = len-sent;
 
/* Keep two messages in the pipe so it schedules better */
-   if (size > sk->sk_sndbuf / 2 - 64)
-   size = sk->sk_sndbuf / 2 - 64;
+   if (size > ((sk->sk_sndbuf >> 1) - 64))
+   size = (sk->sk_sndbuf >> 1) - 64;
 
if (size > SKB_MAX_ALLOC)
size = SKB_MAX_ALLOC;
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6 patch] schedule eepro100.c for removal

2006-02-03 Thread Benjamin LaHaise
Where's the hunk to make the eepro100 driver spew messages about being 
obsolete out upon loading?

-ben

On Fri, Feb 03, 2006 at 10:32:34PM +0100, Adrian Bunk wrote:
> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
> 
> ---
> 
> This patch was already sent on:
> - 18 Jan 2006
> 
> --- linux-2.6.15-mm4-full/Documentation/feature-removal-schedule.txt.old  
> 2006-01-18 08:38:57.0 +0100
> +++ linux-2.6.15-mm4-full/Documentation/feature-removal-schedule.txt  
> 2006-01-18 08:39:59.0 +0100
> @@ -164,0 +165,6 @@
> +---
> +
> +What:   eepro100 network driver
> +When:   April 2006
> +Why:replaced by the e100 driver
> +Who:Adrian Bunk <[EMAIL PROTECTED]>
> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-29 Thread Benjamin LaHaise
On Sun, Jan 29, 2006 at 07:54:09AM +0100, Eric Dumazet wrote:
> Well, I think that might be doable, maybe RCU magic ?
> 
> 1) local_t are not that nice on all archs.

It is for the users that matter, and the hooks are there if someone finds 
it to be a performance problem.

> 2) The consolidation phase (summing all the cpus local offset to 
> consolidate the central counter) might be more difficult to do (we would 
> need kind of 2 counters per cpu, and a index that can be changed by the cpu 
> that wants a consolidation (still 'expensive'))

For the vast majority of these sorts of statistics counters, we don't 
need 100% accurate counts.  And I think it should be possible to choose 
between a lightweight implementation and the expensive implementation.  
On a chip like the Core Duo the cost of bouncing between the two cores 
is minimal, so all the extra code and data is a waste.

> 3) Are the locked ops so expensive if done on a cache line that is mostly 
> in exclusive state in cpu cache ?

Yes.  What happens on the P4 is that it forces outstanding memory 
transactions in the reorder buffer to be flushed so that the memory barrier 
semantics of the lock prefix are observed.  This can take a long time as
there can be over a hundred instructions in flight.

-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-28 Thread Benjamin LaHaise
On Sat, Jan 28, 2006 at 04:55:49PM -0800, Andrew Morton wrote:
> local_t isn't much use until we get rid of asm-generic/local.h.  Bloaty,
> racy with nested interrupts.

The overuse of atomics is horrific in what is being proposed.  All the
major architectures except powerpc (i386, x86-64, ia64, and sparc64) 
implement local_t.  It would make far more sense to push the last few 
stragglers (which mostly seem to be uniprocessor) into writing the 
appropriate implementations.  Perhaps it's time to add a #error in 
asm-generic/local.h?

-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated

2006-01-28 Thread Benjamin LaHaise
On Sat, Jan 28, 2006 at 01:28:20AM +0100, Eric Dumazet wrote:
> We might use atomic_long_t only (and no spinlocks)
> Something like this ?

Erk, complex and slow...  Try using local_t instead, which is substantially 
cheaper on the P4 as it doesn't use the lock prefix and act as a memory 
barrier.  See asm/local.h.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: driver skb reuse

2006-01-25 Thread Benjamin LaHaise
On Tue, Jan 24, 2006 at 10:35:06PM +0100, Robert Olsson wrote:
>  I splitted alloc_skb in two parts to get the memset() part even done 
>  in the driver just before passing the skb to the stack.
> 
>  I did expect a big win in but I didn't see any gain. Strange but the code 
>  was bad so it might be worth a try again. But this is a bit different 
>  from the skb reuse we discuss now.

Right.  Btw, looking over your changes for skb reuse and how the e1000 
lays out its data structures, I think there is still some room for 
improvement: currently you still touch the skb that the e1000 used to 
allocate the receive buffers in.  That cacheline is likely to be L1 cold 
because of the large number of outstanding buffers the e1000 uses.  If you 
instead only touch the buffer_info structure to populate a freshly 
allocated skb, you can eliminate a couple of cache misses and increase 
the chances that the skb you're reusing stays in the L1 cache.  That will 
require a lot more surgery to the e1000 code, though, but it could be 
worth it.

-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: driver skb reuse

2006-01-24 Thread Benjamin LaHaise
On Tue, Jan 24, 2006 at 02:23:15PM +0100, Robert Olsson wrote:
> etc. In the test below I use my usual lab setup but just let netfilter
> drop the packets. We win about 13% in this experiment below.
> 
> Here we process (drop) about 13%  packets more when skb'a get reued.

Instead of doing a completely separate skb reuse path, what happens if 
you remove the memset() from __alloc_skb() and instead do it in a slab 
ctor?  I remember seeing that in the profiles for af_unix.  Dave, could 
you refresh my memory why the slab ctor ended up being dropped?  Doing 
the memset() at free time is usually almost free since it is usually in 
the cache at that point.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6 patch] schedule SHAPER for removal

2006-01-22 Thread Benjamin LaHaise
On Sat, Jan 21, 2006 at 01:48:48AM +0100, Adrian Bunk wrote:
> Do we really have to wait the three years between stable Debian releases 
> for removing an obsolete driver that has always been marked as 
> EXPERIMENTAL?
> 
> Please be serious.

I am completely serious.  The traditional cycle of obsolete code that works 
and is not causing a maintenence burden is 2 major releases -- one release 
during which the obsolete feature spews warnings on use, and another 
development cycle until it is actually removed.  That's at least 3 years, 
which is still pretty short compared to distro cycles.

There seems to be a lot of this disease of removing code for the sake of 
removal lately, and it's getting to the point of being really annoying.  If 
the maintainer of the code in question isn't pushing for its removal, I see 
no need to rush the process too much, especially when the affected users 
aren't even likely to see the feature being marked obsolete since they don't 
troll the source code looking for things that break setups.

-ben
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >