Re: tbench regression in 2.6.25-rc1

2008-02-20 Thread Zhang, Yanmin
Comparing with kernel 2.6.24, tbench result has regression with
2.6.25-rc1.
1) On 2 quad-core processor stoakley: 4%.
2) On 4 quad-core processor tigerton: more than 30%.

bisect located below patch.

b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
Author: Herbert Xu <[EMAIL PROTECTED]>
Date:   Tue Nov 13 21:33:32 2007 -0800

[IPV6]: Move nfheader_len into rt6_info

The dst member nfheader_len is only used by IPv6.  It's also currently
creating a rather ugly alignment hole in struct dst.  Therefore this patch
moves it from there into struct rt6_info.

Above patch changes the cache line alignment, especially member __refcnt. I did 
a 
testing by adding 2 unsigned long pading before lastuse, so the 3 members,
lastuse/__refcnt/__use, are moved to next cache line. The performance is 
recovered.

I created a patch to rearrange the members in struct dst_entry.

With Eric and Valdis Kletnieks's suggestion, I made finer arrangement.
1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
sizeof(dst_entry)=200
no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
tigerton by
moving tclassid to different place. It looks like tclassid could also have 
impact on
performance.
If moving tclassid before metrics, or just don't move tclassid, the performance 
isn't
good. So I move it behind metrics.
2) Add comments before __refcnt.

On 16-core tigerton:
If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better than
the one without the patch;
If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better than
the one without the patch.

With 32bit 2.6.25-rc1 on 8-core stoakley, the new patch doesn't introduce 
regression.

Thank Eric, Valdis, and David!

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>
Acked-by: Eric Dumazet <[EMAIL PROTECTED]>

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
+0800
@@ -52,15 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-   u32 metrics[RTAX_MAX];
-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
 
-#ifdef CONFIG_NET_CLS_ROUTE
-   __u32   tclassid;
-#endif
+   struct dst_entry*path;
 
struct neighbour*neighbour;
struct hh_cache *hh;
@@ -70,10 +65,20 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
struct  dst_ops *ops;
-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
+#ifdef CONFIG_NET_CLS_ROUTE
+   __u32   tclassid;
+#endif
+
+   /*
+* __refcnt wants to be on a different cache line from
+* input/output/ops or performance tanks badly
+*/
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-20 Thread David Miller
From: Eric Dumazet <[EMAIL PROTECTED]>
Date: Wed, 20 Feb 2008 08:38:17 +0100

> Thanks very much Yanmin, I think we can apply your patch as is, if no 
> regression was found for 32bits.

Great.  Can I get a resubmission of the patch with a cleaned up
changelog entry that describes in the regression along with the
changelog bits I saw in the most recent version of the patch?

An explicit "Acked-by:" from Eric would be nice too :-)

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-20 Thread David Miller
From: Eric Dumazet [EMAIL PROTECTED]
Date: Wed, 20 Feb 2008 08:38:17 +0100

 Thanks very much Yanmin, I think we can apply your patch as is, if no 
 regression was found for 32bits.

Great.  Can I get a resubmission of the patch with a cleaned up
changelog entry that describes in the regression along with the
changelog bits I saw in the most recent version of the patch?

An explicit Acked-by: from Eric would be nice too :-)

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-20 Thread Zhang, Yanmin
Comparing with kernel 2.6.24, tbench result has regression with
2.6.25-rc1.
1) On 2 quad-core processor stoakley: 4%.
2) On 4 quad-core processor tigerton: more than 30%.

bisect located below patch.

b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
Author: Herbert Xu [EMAIL PROTECTED]
Date:   Tue Nov 13 21:33:32 2007 -0800

[IPV6]: Move nfheader_len into rt6_info

The dst member nfheader_len is only used by IPv6.  It's also currently
creating a rather ugly alignment hole in struct dst.  Therefore this patch
moves it from there into struct rt6_info.

Above patch changes the cache line alignment, especially member __refcnt. I did 
a 
testing by adding 2 unsigned long pading before lastuse, so the 3 members,
lastuse/__refcnt/__use, are moved to next cache line. The performance is 
recovered.

I created a patch to rearrange the members in struct dst_entry.

With Eric and Valdis Kletnieks's suggestion, I made finer arrangement.
1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
sizeof(dst_entry)=200
no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
tigerton by
moving tclassid to different place. It looks like tclassid could also have 
impact on
performance.
If moving tclassid before metrics, or just don't move tclassid, the performance 
isn't
good. So I move it behind metrics.
2) Add comments before __refcnt.

On 16-core tigerton:
If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better than
the one without the patch;
If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better than
the one without the patch.

With 32bit 2.6.25-rc1 on 8-core stoakley, the new patch doesn't introduce 
regression.

Thank Eric, Valdis, and David!

Signed-off-by: Zhang Yanmin [EMAIL PROTECTED]
Acked-by: Eric Dumazet [EMAIL PROTECTED]

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
+0800
@@ -52,15 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-   u32 metrics[RTAX_MAX];
-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
 
-#ifdef CONFIG_NET_CLS_ROUTE
-   __u32   tclassid;
-#endif
+   struct dst_entry*path;
 
struct neighbour*neighbour;
struct hh_cache *hh;
@@ -70,10 +65,20 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
struct  dst_ops *ops;
-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
+#ifdef CONFIG_NET_CLS_ROUTE
+   __u32   tclassid;
+#endif
+
+   /*
+* __refcnt wants to be on a different cache line from
+* input/output/ops or performance tanks badly
+*/
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-19 Thread Eric Dumazet

Zhang, Yanmin a écrit :

On Tue, 2008-02-19 at 08:40 +0100, Eric Dumazet wrote:

Zhang, Yanmin a �crit :
On Mon, 2008-02-18 at 12:33 -0500, [EMAIL PROTECTED] wrote: 

On Mon, 18 Feb 2008 16:12:38 +0800, "Zhang, Yanmin" said:


I also think __refcnt is the key. I did a new testing by adding 2 unsigned long
pading before lastuse, so the 3 members are moved to next cache line. The 
performance is
recovered.

How about below patch? Almost all performance is recovered with the new patch.

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>

Could you add a comment someplace that says "refcnt wants to be on a different
cache line from input/output/ops or performance tanks badly", to warn some
future kernel hacker who starts adding new fields to the structure?

Ok. Below is the new patch.

1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
sizeof(dst_entry)=200
no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
tigerton by
moving tclassid to different place. It looks like tclassid could also have 
impact on
performance.
If moving tclassid before metrics, or just don't move tclassid, the performance 
isn't
good. So I move it behind metrics.

2) Add comments before __refcnt.

If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better than
the one without the patch.

If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better than
the one without the patch.

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
+0800
@@ -52,15 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-	u32			metrics[RTAX_MAX];

-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
 
-#ifdef CONFIG_NET_CLS_ROUTE

-   __u32   tclassid;
-#endif
+   struct dst_entry*path;
 
 	struct neighbour	*neighbour;

struct hh_cache *hh;
@@ -70,10 +65,20 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
 	struct  dst_ops	*ops;

-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
+#ifdef CONFIG_NET_CLS_ROUTE
+   __u32   tclassid;
+#endif
+
+   /*
+* __refcnt wants to be on a different cache line from
+* input/output/ops or performance tanks badly
+*/
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;




I prefer this patch, but unfortunatly your perf numbers are for 64 bits kernels.

Could you please test now with 32 bits one ?

I tested it with 32bit 2.6.25-rc1 on 8-core stoakley. The result almost has no 
difference
between pure kernel and patched kernel.

New update: On 8-core stoakley, the regression becomes 2~3% with kernel 
2.6.25-rc2. On
tigerton, the regression is still 30% with 2.6.25-rc2. On Tulsa( 8 
cores+hyperthreading),
the regression is still 4% with 2.6.25-rc2.

With my patch, on tigerton, almost all regression disappears. On tulsa, only 
about 2%
regression disappears.

So this issue is triggerred with multiple-cpu. Perhaps process scheduler is 
another
factor causing the issue to happen, but it's very hard to change scheduler.



Thanks very much Yanmin, I think we can apply your patch as is, if no 
regression was found for 32bits.




Eric,

I tested your new patch in function loopback_xmit. It has no improvement, while 
it doesn't
introduce new issues. As you tested it on dual-core machine and got 
improvement, how about
merging your patch with mine?


No, thank you, that was an experiment and is not related to your findings on 
dst_entry.


I am currently working on a 'distributed refcount' infrastructure, to be able 
to spread on several nodes (for NUMA machines) or several cache lines (normal 
SMP machines)  the high pressure we currently have on some refcnt (struct 
dst_entry, struct net_device, and many more refcnts ...)


Instead of NR_CPUS allocations, goal is to be able to restrict to a small 
value like 4, 8 or 16 the number of 32bits entities used to store one refcnt, 
even if NR_CPUS=4096 or so.


atomic_inc(>refcnt) ->  distref_inc(>refcnt)

distref_inc(struct distref *p)
{
atomic_inc(myptr[p->offset]);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

Re: tbench regression in 2.6.25-rc1

2008-02-19 Thread Zhang, Yanmin
On Tue, 2008-02-19 at 08:40 +0100, Eric Dumazet wrote:
> Zhang, Yanmin a �crit :
> > On Mon, 2008-02-18 at 12:33 -0500, [EMAIL PROTECTED] wrote: 
> >> On Mon, 18 Feb 2008 16:12:38 +0800, "Zhang, Yanmin" said:
> >>
> >>> I also think __refcnt is the key. I did a new testing by adding 2 
> >>> unsigned long
> >>> pading before lastuse, so the 3 members are moved to next cache line. The 
> >>> performance is
> >>> recovered.
> >>>
> >>> How about below patch? Almost all performance is recovered with the new 
> >>> patch.
> >>>
> >>> Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>
> >> Could you add a comment someplace that says "refcnt wants to be on a 
> >> different
> >> cache line from input/output/ops or performance tanks badly", to warn some
> >> future kernel hacker who starts adding new fields to the structure?
> > Ok. Below is the new patch.
> > 
> > 1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
> > sizeof(dst_entry)=200
> > no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
> > tigerton by
> > moving tclassid to different place. It looks like tclassid could also have 
> > impact on
> > performance.
> > If moving tclassid before metrics, or just don't move tclassid, the 
> > performance isn't
> > good. So I move it behind metrics.
> > 
> > 2) Add comments before __refcnt.
> > 
> > If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better 
> > than
> > the one without the patch.
> > 
> > If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better 
> > than
> > the one without the patch.
> > 
> > Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>
> > 
> > ---
> > 
> > --- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 
> > +0800
> > +++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
> > +0800
> > @@ -52,15 +52,10 @@ struct dst_entry
> > unsigned short  header_len; /* more space at head required 
> > */
> > unsigned short  trailer_len;/* space to reserve at tail */
> >  
> > -   u32 metrics[RTAX_MAX];
> > -   struct dst_entry*path;
> > -
> > -   unsigned long   rate_last;  /* rate limiting for ICMP */
> > unsigned intrate_tokens;
> > +   unsigned long   rate_last;  /* rate limiting for ICMP */
> >  
> > -#ifdef CONFIG_NET_CLS_ROUTE
> > -   __u32   tclassid;
> > -#endif
> > +   struct dst_entry*path;
> >  
> > struct neighbour*neighbour;
> > struct hh_cache *hh;
> > @@ -70,10 +65,20 @@ struct dst_entry
> > int (*output)(struct sk_buff*);
> >  
> > struct  dst_ops *ops;
> > -   
> > -   unsigned long   lastuse;
> > +
> > +   u32 metrics[RTAX_MAX];
> > +
> > +#ifdef CONFIG_NET_CLS_ROUTE
> > +   __u32   tclassid;
> > +#endif
> > +
> > +   /*
> > +* __refcnt wants to be on a different cache line from
> > +* input/output/ops or performance tanks badly
> > +*/
> > atomic_t__refcnt;   /* client references*/
> > int __use;
> > +   unsigned long   lastuse;
> > union {
> > struct dst_entry *next;
> > struct rtable*rt_next;
> > 
> > 
> > 
> 
> I prefer this patch, but unfortunatly your perf numbers are for 64 bits 
> kernels.
> 
> Could you please test now with 32 bits one ?
I tested it with 32bit 2.6.25-rc1 on 8-core stoakley. The result almost has no 
difference
between pure kernel and patched kernel.

New update: On 8-core stoakley, the regression becomes 2~3% with kernel 
2.6.25-rc2. On
tigerton, the regression is still 30% with 2.6.25-rc2. On Tulsa( 8 
cores+hyperthreading),
the regression is still 4% with 2.6.25-rc2.

With my patch, on tigerton, almost all regression disappears. On tulsa, only 
about 2%
regression disappears.

So this issue is triggerred with multiple-cpu. Perhaps process scheduler is 
another
factor causing the issue to happen, but it's very hard to change scheduler.


Eric,

I tested your new patch in function loopback_xmit. It has no improvement, while 
it doesn't
introduce new issues. As you tested it on dual-core machine and got 
improvement, how about
merging your patch with mine?

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-19 Thread Zhang, Yanmin
On Tue, 2008-02-19 at 08:35 +0100, Eric Dumazet wrote:
> Zhang, Yanmin a �crit :
> > On Mon, 2008-02-18 at 11:11 +0100, Eric Dumazet wrote:
> >> On Mon, 18 Feb 2008 16:12:38 +0800
> >> "Zhang, Yanmin" <[EMAIL PROTECTED]> wrote:
> >>
> >>> On Fri, 2008-02-15 at 15:22 -0800, David Miller wrote:
>  From: Eric Dumazet <[EMAIL PROTECTED]>
>  Date: Fri, 15 Feb 2008 15:21:48 +0100
> 
> > On linux-2.6.25-rc1 x86_64 :
> >
> > offsetof(struct dst_entry, lastuse)=0xb0
> > offsetof(struct dst_entry, __refcnt)=0xb8
> > offsetof(struct dst_entry, __use)=0xbc
> > offsetof(struct dst_entry, next)=0xc0
> >
> > So it should be optimal... I dont know why tbench prefers __refcnt 
> > being 
> > on 0xc0, since in this case lastuse will be on a different cache line...
> >
> > Each incoming IP packet will need to change lastuse, __refcnt and 
> > __use, 
> > so keeping them in the same cache line is a win.
> >
> > I suspect then that even this patch could help tbench, since it avoids 
> > writing lastuse...
>  I think your suspicions are right, and even moreso
>  it helps to keep __refcnt out of the same cache line
>  as input/output/ops which are read-almost-entirely :-
> >>> I think you are right. The issue is these three variables sharing the 
> >>> same cache line
> >>> with input/output/ops.
> >>>
>  )
> 
>  I haven't done an exhaustive analysis, but it seems that
>  the write traffic to lastuse and __refcnt are about the
>  same.  However if we find that __refcnt gets hit more
>  than lastuse in this workload, it explains the regression.
> >>> I also think __refcnt is the key. I did a new testing by adding 2 
> >>> unsigned long
> >>> pading before lastuse, so the 3 members are moved to next cache line. The 
> >>> performance is
> >>> recovered.
> >>>
> >>> How about below patch? Almost all performance is recovered with the new 
> >>> patch.
> >>>
> >>> Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>
> >>>
> >>> ---
> >>>
> >>> --- linux-2.6.25-rc1/include/net/dst.h2008-02-21 14:33:43.0 
> >>> +0800
> >>> +++ linux-2.6.25-rc1_work/include/net/dst.h   2008-02-21 
> >>> 14:36:22.0 +0800
> >>> @@ -52,11 +52,10 @@ struct dst_entry
> >>>   unsigned short  header_len; /* more space at head required 
> >>> */
> >>>   unsigned short  trailer_len;/* space to reserve at tail */
> >>>  
> >>> - u32 metrics[RTAX_MAX];
> >>> - struct dst_entry*path;
> >>> -
> >>> - unsigned long   rate_last;  /* rate limiting for ICMP */
> >>>   unsigned intrate_tokens;
> >>> + unsigned long   rate_last;  /* rate limiting for ICMP */
> >>> +
> >>> + struct dst_entry*path;
> >>>  
> >>>  #ifdef CONFIG_NET_CLS_ROUTE
> >>>   __u32   tclassid;
> >>> @@ -70,10 +69,12 @@ struct dst_entry
> >>>   int (*output)(struct sk_buff*);
> >>>  
> >>>   struct  dst_ops *ops;
> >>> - 
> >>> - unsigned long   lastuse;
> >>> +
> >>> + u32 metrics[RTAX_MAX];
> >>> +
> >>>   atomic_t__refcnt;   /* client references*/
> >>>   int __use;
> >>> + unsigned long   lastuse;
> >>>   union {
> >>>   struct dst_entry *next;
> >>>   struct rtable*rt_next;
> >>>
> >>>
> >> Well, after this patch, we grow dst_entry by 8 bytes :
> > With my .config, it doesn't grow. Perhaps because of CONFIG_NET_CLS_ROUTE, 
> > I don't
> > enable it. I will move tclassid under ops.
> > 
> >> sizeof(struct dst_entry)=0xd0
> >> offsetof(struct dst_entry, input)=0x68
> >> offsetof(struct dst_entry, output)=0x70
> >> offsetof(struct dst_entry, __refcnt)=0xb4
> >> offsetof(struct dst_entry, lastuse)=0xc0
> >> offsetof(struct dst_entry, __use)=0xb8
> >> sizeof(struct rtable)=0x140
> >>
> >>
> >> So we dirty two cache lines instead of one, unless your cpu have 128 bytes 
> >> cache lines ?
> >>
> >> I am quite suprised that my patch to not change lastuse if already set to 
> >> jiffies changes nothing...
> >>
> >> If you have some time, could you also test this (unrelated) patch ?
> >>
> >> We can avoid dirty all the time a cache line of loopback device.
> >>
> >> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> >> index f2a6e71..0a4186a 100644
> >> --- a/drivers/net/loopback.c
> >> +++ b/drivers/net/loopback.c
> >> @@ -150,7 +150,10 @@ static int loopback_xmit(struct sk_buff *skb, struct 
> >> net_device *dev)
> >> return 0;
> >> }
> >>  #endif
> >> -   dev->last_rx = jiffies;
> >> +#ifdef CONFIG_SMP
> >> +   if (dev->last_rx != jiffies)
> >> +#endif
> >> +   dev->last_rx = jiffies;
> >>  
> >> /* it's OK to use per_cpu_ptr() because BHs are off */
> >> pcpu_lstats = netdev_priv(dev);
> >>
> > Although I didn't test it, I don't think it's ok. The key is 

Re: tbench regression in 2.6.25-rc1

2008-02-19 Thread Zhang, Yanmin
On Tue, 2008-02-19 at 08:35 +0100, Eric Dumazet wrote:
 Zhang, Yanmin a �crit :
  On Mon, 2008-02-18 at 11:11 +0100, Eric Dumazet wrote:
  On Mon, 18 Feb 2008 16:12:38 +0800
  Zhang, Yanmin [EMAIL PROTECTED] wrote:
 
  On Fri, 2008-02-15 at 15:22 -0800, David Miller wrote:
  From: Eric Dumazet [EMAIL PROTECTED]
  Date: Fri, 15 Feb 2008 15:21:48 +0100
 
  On linux-2.6.25-rc1 x86_64 :
 
  offsetof(struct dst_entry, lastuse)=0xb0
  offsetof(struct dst_entry, __refcnt)=0xb8
  offsetof(struct dst_entry, __use)=0xbc
  offsetof(struct dst_entry, next)=0xc0
 
  So it should be optimal... I dont know why tbench prefers __refcnt 
  being 
  on 0xc0, since in this case lastuse will be on a different cache line...
 
  Each incoming IP packet will need to change lastuse, __refcnt and 
  __use, 
  so keeping them in the same cache line is a win.
 
  I suspect then that even this patch could help tbench, since it avoids 
  writing lastuse...
  I think your suspicions are right, and even moreso
  it helps to keep __refcnt out of the same cache line
  as input/output/ops which are read-almost-entirely :-
  I think you are right. The issue is these three variables sharing the 
  same cache line
  with input/output/ops.
 
  )
 
  I haven't done an exhaustive analysis, but it seems that
  the write traffic to lastuse and __refcnt are about the
  same.  However if we find that __refcnt gets hit more
  than lastuse in this workload, it explains the regression.
  I also think __refcnt is the key. I did a new testing by adding 2 
  unsigned long
  pading before lastuse, so the 3 members are moved to next cache line. The 
  performance is
  recovered.
 
  How about below patch? Almost all performance is recovered with the new 
  patch.
 
  Signed-off-by: Zhang Yanmin [EMAIL PROTECTED]
 
  ---
 
  --- linux-2.6.25-rc1/include/net/dst.h2008-02-21 14:33:43.0 
  +0800
  +++ linux-2.6.25-rc1_work/include/net/dst.h   2008-02-21 
  14:36:22.0 +0800
  @@ -52,11 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
  */
unsigned short  trailer_len;/* space to reserve at tail */
   
  - u32 metrics[RTAX_MAX];
  - struct dst_entry*path;
  -
  - unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
  + unsigned long   rate_last;  /* rate limiting for ICMP */
  +
  + struct dst_entry*path;
   
   #ifdef CONFIG_NET_CLS_ROUTE
__u32   tclassid;
  @@ -70,10 +69,12 @@ struct dst_entry
int (*output)(struct sk_buff*);
   
struct  dst_ops *ops;
  - 
  - unsigned long   lastuse;
  +
  + u32 metrics[RTAX_MAX];
  +
atomic_t__refcnt;   /* client references*/
int __use;
  + unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;
 
 
  Well, after this patch, we grow dst_entry by 8 bytes :
  With my .config, it doesn't grow. Perhaps because of CONFIG_NET_CLS_ROUTE, 
  I don't
  enable it. I will move tclassid under ops.
  
  sizeof(struct dst_entry)=0xd0
  offsetof(struct dst_entry, input)=0x68
  offsetof(struct dst_entry, output)=0x70
  offsetof(struct dst_entry, __refcnt)=0xb4
  offsetof(struct dst_entry, lastuse)=0xc0
  offsetof(struct dst_entry, __use)=0xb8
  sizeof(struct rtable)=0x140
 
 
  So we dirty two cache lines instead of one, unless your cpu have 128 bytes 
  cache lines ?
 
  I am quite suprised that my patch to not change lastuse if already set to 
  jiffies changes nothing...
 
  If you have some time, could you also test this (unrelated) patch ?
 
  We can avoid dirty all the time a cache line of loopback device.
 
  diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
  index f2a6e71..0a4186a 100644
  --- a/drivers/net/loopback.c
  +++ b/drivers/net/loopback.c
  @@ -150,7 +150,10 @@ static int loopback_xmit(struct sk_buff *skb, struct 
  net_device *dev)
  return 0;
  }
   #endif
  -   dev-last_rx = jiffies;
  +#ifdef CONFIG_SMP
  +   if (dev-last_rx != jiffies)
  +#endif
  +   dev-last_rx = jiffies;
   
  /* it's OK to use per_cpu_ptr() because BHs are off */
  pcpu_lstats = netdev_priv(dev);
 
  Although I didn't test it, I don't think it's ok. The key is __refcnt 
  shares the same
  cache line with ops/input/output.
  
 
 Note it was unrelated to struct dst, but dirtying of one cache line of 
 'loopback netdevice'
 
 I tested it, and tbench result was better with this patch : 890 MB/s instead 
 of 870 MB/s on a bi dual core machine.
I tested your new patch and it doesn't help tbench.

On my 8-core stoakley machine, the regression is only 5%, but it's 30% on 
16-core tigerton.
It looks like the scalability is poor.

 
 
 I was curious of the potential gain on your 16 cores 

Re: tbench regression in 2.6.25-rc1

2008-02-19 Thread Zhang, Yanmin
On Tue, 2008-02-19 at 08:40 +0100, Eric Dumazet wrote:
 Zhang, Yanmin a �crit :
  On Mon, 2008-02-18 at 12:33 -0500, [EMAIL PROTECTED] wrote: 
  On Mon, 18 Feb 2008 16:12:38 +0800, Zhang, Yanmin said:
 
  I also think __refcnt is the key. I did a new testing by adding 2 
  unsigned long
  pading before lastuse, so the 3 members are moved to next cache line. The 
  performance is
  recovered.
 
  How about below patch? Almost all performance is recovered with the new 
  patch.
 
  Signed-off-by: Zhang Yanmin [EMAIL PROTECTED]
  Could you add a comment someplace that says refcnt wants to be on a 
  different
  cache line from input/output/ops or performance tanks badly, to warn some
  future kernel hacker who starts adding new fields to the structure?
  Ok. Below is the new patch.
  
  1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
  sizeof(dst_entry)=200
  no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
  tigerton by
  moving tclassid to different place. It looks like tclassid could also have 
  impact on
  performance.
  If moving tclassid before metrics, or just don't move tclassid, the 
  performance isn't
  good. So I move it behind metrics.
  
  2) Add comments before __refcnt.
  
  If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better 
  than
  the one without the patch.
  
  If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better 
  than
  the one without the patch.
  
  Signed-off-by: Zhang Yanmin [EMAIL PROTECTED]
  
  ---
  
  --- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 
  +0800
  +++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
  +0800
  @@ -52,15 +52,10 @@ struct dst_entry
  unsigned short  header_len; /* more space at head required 
  */
  unsigned short  trailer_len;/* space to reserve at tail */
   
  -   u32 metrics[RTAX_MAX];
  -   struct dst_entry*path;
  -
  -   unsigned long   rate_last;  /* rate limiting for ICMP */
  unsigned intrate_tokens;
  +   unsigned long   rate_last;  /* rate limiting for ICMP */
   
  -#ifdef CONFIG_NET_CLS_ROUTE
  -   __u32   tclassid;
  -#endif
  +   struct dst_entry*path;
   
  struct neighbour*neighbour;
  struct hh_cache *hh;
  @@ -70,10 +65,20 @@ struct dst_entry
  int (*output)(struct sk_buff*);
   
  struct  dst_ops *ops;
  -   
  -   unsigned long   lastuse;
  +
  +   u32 metrics[RTAX_MAX];
  +
  +#ifdef CONFIG_NET_CLS_ROUTE
  +   __u32   tclassid;
  +#endif
  +
  +   /*
  +* __refcnt wants to be on a different cache line from
  +* input/output/ops or performance tanks badly
  +*/
  atomic_t__refcnt;   /* client references*/
  int __use;
  +   unsigned long   lastuse;
  union {
  struct dst_entry *next;
  struct rtable*rt_next;
  
  
  
 
 I prefer this patch, but unfortunatly your perf numbers are for 64 bits 
 kernels.
 
 Could you please test now with 32 bits one ?
I tested it with 32bit 2.6.25-rc1 on 8-core stoakley. The result almost has no 
difference
between pure kernel and patched kernel.

New update: On 8-core stoakley, the regression becomes 2~3% with kernel 
2.6.25-rc2. On
tigerton, the regression is still 30% with 2.6.25-rc2. On Tulsa( 8 
cores+hyperthreading),
the regression is still 4% with 2.6.25-rc2.

With my patch, on tigerton, almost all regression disappears. On tulsa, only 
about 2%
regression disappears.

So this issue is triggerred with multiple-cpu. Perhaps process scheduler is 
another
factor causing the issue to happen, but it's very hard to change scheduler.


Eric,

I tested your new patch in function loopback_xmit. It has no improvement, while 
it doesn't
introduce new issues. As you tested it on dual-core machine and got 
improvement, how about
merging your patch with mine?

-yanmin


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-19 Thread Eric Dumazet

Zhang, Yanmin a écrit :

On Tue, 2008-02-19 at 08:40 +0100, Eric Dumazet wrote:

Zhang, Yanmin a �crit :
On Mon, 2008-02-18 at 12:33 -0500, [EMAIL PROTECTED] wrote: 

On Mon, 18 Feb 2008 16:12:38 +0800, Zhang, Yanmin said:


I also think __refcnt is the key. I did a new testing by adding 2 unsigned long
pading before lastuse, so the 3 members are moved to next cache line. The 
performance is
recovered.

How about below patch? Almost all performance is recovered with the new patch.

Signed-off-by: Zhang Yanmin [EMAIL PROTECTED]

Could you add a comment someplace that says refcnt wants to be on a different
cache line from input/output/ops or performance tanks badly, to warn some
future kernel hacker who starts adding new fields to the structure?

Ok. Below is the new patch.

1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
sizeof(dst_entry)=200
no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
tigerton by
moving tclassid to different place. It looks like tclassid could also have 
impact on
performance.
If moving tclassid before metrics, or just don't move tclassid, the performance 
isn't
good. So I move it behind metrics.

2) Add comments before __refcnt.

If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better than
the one without the patch.

If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better than
the one without the patch.

Signed-off-by: Zhang Yanmin [EMAIL PROTECTED]

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
+0800
@@ -52,15 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-	u32			metrics[RTAX_MAX];

-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
 
-#ifdef CONFIG_NET_CLS_ROUTE

-   __u32   tclassid;
-#endif
+   struct dst_entry*path;
 
 	struct neighbour	*neighbour;

struct hh_cache *hh;
@@ -70,10 +65,20 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
 	struct  dst_ops	*ops;

-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
+#ifdef CONFIG_NET_CLS_ROUTE
+   __u32   tclassid;
+#endif
+
+   /*
+* __refcnt wants to be on a different cache line from
+* input/output/ops or performance tanks badly
+*/
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;




I prefer this patch, but unfortunatly your perf numbers are for 64 bits kernels.

Could you please test now with 32 bits one ?

I tested it with 32bit 2.6.25-rc1 on 8-core stoakley. The result almost has no 
difference
between pure kernel and patched kernel.

New update: On 8-core stoakley, the regression becomes 2~3% with kernel 
2.6.25-rc2. On
tigerton, the regression is still 30% with 2.6.25-rc2. On Tulsa( 8 
cores+hyperthreading),
the regression is still 4% with 2.6.25-rc2.

With my patch, on tigerton, almost all regression disappears. On tulsa, only 
about 2%
regression disappears.

So this issue is triggerred with multiple-cpu. Perhaps process scheduler is 
another
factor causing the issue to happen, but it's very hard to change scheduler.



Thanks very much Yanmin, I think we can apply your patch as is, if no 
regression was found for 32bits.




Eric,

I tested your new patch in function loopback_xmit. It has no improvement, while 
it doesn't
introduce new issues. As you tested it on dual-core machine and got 
improvement, how about
merging your patch with mine?


No, thank you, that was an experiment and is not related to your findings on 
dst_entry.


I am currently working on a 'distributed refcount' infrastructure, to be able 
to spread on several nodes (for NUMA machines) or several cache lines (normal 
SMP machines)  the high pressure we currently have on some refcnt (struct 
dst_entry, struct net_device, and many more refcnts ...)


Instead of NR_CPUS allocations, goal is to be able to restrict to a small 
value like 4, 8 or 16 the number of 32bits entities used to store one refcnt, 
even if NR_CPUS=4096 or so.


atomic_inc(p-refcnt) -  distref_inc(p-refcnt)

distref_inc(struct distref *p)
{
atomic_inc(myptr[p-offset]);
}

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please 

Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Eric Dumazet

Zhang, Yanmin a écrit :
On Mon, 2008-02-18 at 12:33 -0500, [EMAIL PROTECTED] wrote: 

On Mon, 18 Feb 2008 16:12:38 +0800, "Zhang, Yanmin" said:


I also think __refcnt is the key. I did a new testing by adding 2 unsigned long
pading before lastuse, so the 3 members are moved to next cache line. The 
performance is
recovered.

How about below patch? Almost all performance is recovered with the new patch.

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>

Could you add a comment someplace that says "refcnt wants to be on a different
cache line from input/output/ops or performance tanks badly", to warn some
future kernel hacker who starts adding new fields to the structure?

Ok. Below is the new patch.

1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
sizeof(dst_entry)=200
no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
tigerton by
moving tclassid to different place. It looks like tclassid could also have 
impact on
performance.
If moving tclassid before metrics, or just don't move tclassid, the performance 
isn't
good. So I move it behind metrics.

2) Add comments before __refcnt.

If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better than
the one without the patch.

If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better than
the one without the patch.

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
+0800
@@ -52,15 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-	u32			metrics[RTAX_MAX];

-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
 
-#ifdef CONFIG_NET_CLS_ROUTE

-   __u32   tclassid;
-#endif
+   struct dst_entry*path;
 
 	struct neighbour	*neighbour;

struct hh_cache *hh;
@@ -70,10 +65,20 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
 	struct  dst_ops	*ops;

-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
+#ifdef CONFIG_NET_CLS_ROUTE
+   __u32   tclassid;
+#endif
+
+   /*
+* __refcnt wants to be on a different cache line from
+* input/output/ops or performance tanks badly
+*/
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;





I prefer this patch, but unfortunatly your perf numbers are for 64 bits kernels.

Could you please test now with 32 bits one ?

Thank you
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Eric Dumazet

Zhang, Yanmin a écrit :

On Mon, 2008-02-18 at 11:11 +0100, Eric Dumazet wrote:

On Mon, 18 Feb 2008 16:12:38 +0800
"Zhang, Yanmin" <[EMAIL PROTECTED]> wrote:


On Fri, 2008-02-15 at 15:22 -0800, David Miller wrote:

From: Eric Dumazet <[EMAIL PROTECTED]>
Date: Fri, 15 Feb 2008 15:21:48 +0100


On linux-2.6.25-rc1 x86_64 :

offsetof(struct dst_entry, lastuse)=0xb0
offsetof(struct dst_entry, __refcnt)=0xb8
offsetof(struct dst_entry, __use)=0xbc
offsetof(struct dst_entry, next)=0xc0

So it should be optimal... I dont know why tbench prefers __refcnt being 
on 0xc0, since in this case lastuse will be on a different cache line...


Each incoming IP packet will need to change lastuse, __refcnt and __use, 
so keeping them in the same cache line is a win.


I suspect then that even this patch could help tbench, since it avoids 
writing lastuse...

I think your suspicions are right, and even moreso
it helps to keep __refcnt out of the same cache line
as input/output/ops which are read-almost-entirely :-

I think you are right. The issue is these three variables sharing the same 
cache line
with input/output/ops.


)

I haven't done an exhaustive analysis, but it seems that
the write traffic to lastuse and __refcnt are about the
same.  However if we find that __refcnt gets hit more
than lastuse in this workload, it explains the regression.

I also think __refcnt is the key. I did a new testing by adding 2 unsigned long
pading before lastuse, so the 3 members are moved to next cache line. The 
performance is
recovered.

How about below patch? Almost all performance is recovered with the new patch.

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-21 14:36:22.0 
+0800
@@ -52,11 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-	u32			metrics[RTAX_MAX];

-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
+
+   struct dst_entry*path;
 
 #ifdef CONFIG_NET_CLS_ROUTE

__u32   tclassid;
@@ -70,10 +69,12 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
 	struct  dst_ops	*ops;

-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;



Well, after this patch, we grow dst_entry by 8 bytes :

With my .config, it doesn't grow. Perhaps because of CONFIG_NET_CLS_ROUTE, I 
don't
enable it. I will move tclassid under ops.


sizeof(struct dst_entry)=0xd0
offsetof(struct dst_entry, input)=0x68
offsetof(struct dst_entry, output)=0x70
offsetof(struct dst_entry, __refcnt)=0xb4
offsetof(struct dst_entry, lastuse)=0xc0
offsetof(struct dst_entry, __use)=0xb8
sizeof(struct rtable)=0x140


So we dirty two cache lines instead of one, unless your cpu have 128 bytes 
cache lines ?

I am quite suprised that my patch to not change lastuse if already set to 
jiffies changes nothing...

If you have some time, could you also test this (unrelated) patch ?

We can avoid dirty all the time a cache line of loopback device.

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index f2a6e71..0a4186a 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -150,7 +150,10 @@ static int loopback_xmit(struct sk_buff *skb, struct 
net_device *dev)
return 0;
}
 #endif
-   dev->last_rx = jiffies;
+#ifdef CONFIG_SMP
+   if (dev->last_rx != jiffies)
+#endif
+   dev->last_rx = jiffies;
 
/* it's OK to use per_cpu_ptr() because BHs are off */

pcpu_lstats = netdev_priv(dev);


Although I didn't test it, I don't think it's ok. The key is __refcnt shares 
the same
cache line with ops/input/output.



Note it was unrelated to struct dst, but dirtying of one cache line of 
'loopback netdevice'


I tested it, and tbench result was better with this patch : 890 MB/s instead 
of 870 MB/s on a bi dual core machine.



I was curious of the potential gain on your 16 cores (4x4) machine.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Zhang, Yanmin
On Mon, 2008-02-18 at 12:33 -0500, [EMAIL PROTECTED] wrote: 
> On Mon, 18 Feb 2008 16:12:38 +0800, "Zhang, Yanmin" said:
> 
> > I also think __refcnt is the key. I did a new testing by adding 2 unsigned 
> > long
> > pading before lastuse, so the 3 members are moved to next cache line. The 
> > performance is
> > recovered.
> > 
> > How about below patch? Almost all performance is recovered with the new 
> > patch.
> > 
> > Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>
> 
> Could you add a comment someplace that says "refcnt wants to be on a different
> cache line from input/output/ops or performance tanks badly", to warn some
> future kernel hacker who starts adding new fields to the structure?
Ok. Below is the new patch.

1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
sizeof(dst_entry)=200
no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
tigerton by
moving tclassid to different place. It looks like tclassid could also have 
impact on
performance.
If moving tclassid before metrics, or just don't move tclassid, the performance 
isn't
good. So I move it behind metrics.

2) Add comments before __refcnt.

If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better than
the one without the patch.

If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better than
the one without the patch.

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
+0800
@@ -52,15 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-   u32 metrics[RTAX_MAX];
-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
 
-#ifdef CONFIG_NET_CLS_ROUTE
-   __u32   tclassid;
-#endif
+   struct dst_entry*path;
 
struct neighbour*neighbour;
struct hh_cache *hh;
@@ -70,10 +65,20 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
struct  dst_ops *ops;
-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
+#ifdef CONFIG_NET_CLS_ROUTE
+   __u32   tclassid;
+#endif
+
+   /*
+* __refcnt wants to be on a different cache line from
+* input/output/ops or performance tanks badly
+*/
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Zhang, Yanmin
On Mon, 2008-02-18 at 11:11 +0100, Eric Dumazet wrote:
> On Mon, 18 Feb 2008 16:12:38 +0800
> "Zhang, Yanmin" <[EMAIL PROTECTED]> wrote:
> 
> > On Fri, 2008-02-15 at 15:22 -0800, David Miller wrote:
> > > From: Eric Dumazet <[EMAIL PROTECTED]>
> > > Date: Fri, 15 Feb 2008 15:21:48 +0100
> > > 
> > > > On linux-2.6.25-rc1 x86_64 :
> > > > 
> > > > offsetof(struct dst_entry, lastuse)=0xb0
> > > > offsetof(struct dst_entry, __refcnt)=0xb8
> > > > offsetof(struct dst_entry, __use)=0xbc
> > > > offsetof(struct dst_entry, next)=0xc0
> > > > 
> > > > So it should be optimal... I dont know why tbench prefers __refcnt 
> > > > being 
> > > > on 0xc0, since in this case lastuse will be on a different cache line...
> > > > 
> > > > Each incoming IP packet will need to change lastuse, __refcnt and 
> > > > __use, 
> > > > so keeping them in the same cache line is a win.
> > > > 
> > > > I suspect then that even this patch could help tbench, since it avoids 
> > > > writing lastuse...
> > > 
> > > I think your suspicions are right, and even moreso
> > > it helps to keep __refcnt out of the same cache line
> > > as input/output/ops which are read-almost-entirely :-
> > I think you are right. The issue is these three variables sharing the same 
> > cache line
> > with input/output/ops.
> > 
> > > )
> > > 
> > > I haven't done an exhaustive analysis, but it seems that
> > > the write traffic to lastuse and __refcnt are about the
> > > same.  However if we find that __refcnt gets hit more
> > > than lastuse in this workload, it explains the regression.
> > I also think __refcnt is the key. I did a new testing by adding 2 unsigned 
> > long
> > pading before lastuse, so the 3 members are moved to next cache line. The 
> > performance is
> > recovered.
> > 
> > How about below patch? Almost all performance is recovered with the new 
> > patch.
> > 
> > Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>
> > 
> > ---
> > 
> > --- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 
> > +0800
> > +++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-21 14:36:22.0 
> > +0800
> > @@ -52,11 +52,10 @@ struct dst_entry
> > unsigned short  header_len; /* more space at head required 
> > */
> > unsigned short  trailer_len;/* space to reserve at tail */
> >  
> > -   u32 metrics[RTAX_MAX];
> > -   struct dst_entry*path;
> > -
> > -   unsigned long   rate_last;  /* rate limiting for ICMP */
> > unsigned intrate_tokens;
> > +   unsigned long   rate_last;  /* rate limiting for ICMP */
> > +
> > +   struct dst_entry*path;
> >  
> >  #ifdef CONFIG_NET_CLS_ROUTE
> > __u32   tclassid;
> > @@ -70,10 +69,12 @@ struct dst_entry
> > int (*output)(struct sk_buff*);
> >  
> > struct  dst_ops *ops;
> > -   
> > -   unsigned long   lastuse;
> > +
> > +   u32 metrics[RTAX_MAX];
> > +
> > atomic_t__refcnt;   /* client references*/
> > int __use;
> > +   unsigned long   lastuse;
> > union {
> > struct dst_entry *next;
> > struct rtable*rt_next;
> > 
> > 
> 
> Well, after this patch, we grow dst_entry by 8 bytes :
With my .config, it doesn't grow. Perhaps because of CONFIG_NET_CLS_ROUTE, I 
don't
enable it. I will move tclassid under ops.

> 
> sizeof(struct dst_entry)=0xd0
> offsetof(struct dst_entry, input)=0x68
> offsetof(struct dst_entry, output)=0x70
> offsetof(struct dst_entry, __refcnt)=0xb4
> offsetof(struct dst_entry, lastuse)=0xc0
> offsetof(struct dst_entry, __use)=0xb8
> sizeof(struct rtable)=0x140
> 
> 
> So we dirty two cache lines instead of one, unless your cpu have 128 bytes 
> cache lines ?
> 
> I am quite suprised that my patch to not change lastuse if already set to 
> jiffies changes nothing...
> 
> If you have some time, could you also test this (unrelated) patch ?
> 
> We can avoid dirty all the time a cache line of loopback device.
> 
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> index f2a6e71..0a4186a 100644
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -150,7 +150,10 @@ static int loopback_xmit(struct sk_buff *skb, struct 
> net_device *dev)
> return 0;
> }
>  #endif
> -   dev->last_rx = jiffies;
> +#ifdef CONFIG_SMP
> +   if (dev->last_rx != jiffies)
> +#endif
> +   dev->last_rx = jiffies;
>  
> /* it's OK to use per_cpu_ptr() because BHs are off */
> pcpu_lstats = netdev_priv(dev);
> 
Although I didn't test it, I don't think it's ok. The key is __refcnt shares 
the same
cache line with ops/input/output.

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at 

Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Valdis . Kletnieks
On Mon, 18 Feb 2008 16:12:38 +0800, "Zhang, Yanmin" said:

> I also think __refcnt is the key. I did a new testing by adding 2 unsigned 
> long
> pading before lastuse, so the 3 members are moved to next cache line. The 
> performance is
> recovered.
> 
> How about below patch? Almost all performance is recovered with the new patch.
> 
> Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>

Could you add a comment someplace that says "refcnt wants to be on a different
cache line from input/output/ops or performance tanks badly", to warn some
future kernel hacker who starts adding new fields to the structure?


pgpydfJZxFPCt.pgp
Description: PGP signature


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Eric Dumazet
On Mon, 18 Feb 2008 16:12:38 +0800
"Zhang, Yanmin" <[EMAIL PROTECTED]> wrote:

> On Fri, 2008-02-15 at 15:22 -0800, David Miller wrote:
> > From: Eric Dumazet <[EMAIL PROTECTED]>
> > Date: Fri, 15 Feb 2008 15:21:48 +0100
> > 
> > > On linux-2.6.25-rc1 x86_64 :
> > > 
> > > offsetof(struct dst_entry, lastuse)=0xb0
> > > offsetof(struct dst_entry, __refcnt)=0xb8
> > > offsetof(struct dst_entry, __use)=0xbc
> > > offsetof(struct dst_entry, next)=0xc0
> > > 
> > > So it should be optimal... I dont know why tbench prefers __refcnt being 
> > > on 0xc0, since in this case lastuse will be on a different cache line...
> > > 
> > > Each incoming IP packet will need to change lastuse, __refcnt and __use, 
> > > so keeping them in the same cache line is a win.
> > > 
> > > I suspect then that even this patch could help tbench, since it avoids 
> > > writing lastuse...
> > 
> > I think your suspicions are right, and even moreso
> > it helps to keep __refcnt out of the same cache line
> > as input/output/ops which are read-almost-entirely :-
> I think you are right. The issue is these three variables sharing the same 
> cache line
> with input/output/ops.
> 
> > )
> > 
> > I haven't done an exhaustive analysis, but it seems that
> > the write traffic to lastuse and __refcnt are about the
> > same.  However if we find that __refcnt gets hit more
> > than lastuse in this workload, it explains the regression.
> I also think __refcnt is the key. I did a new testing by adding 2 unsigned 
> long
> pading before lastuse, so the 3 members are moved to next cache line. The 
> performance is
> recovered.
> 
> How about below patch? Almost all performance is recovered with the new patch.
> 
> Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>
> 
> ---
> 
> --- linux-2.6.25-rc1/include/net/dst.h2008-02-21 14:33:43.0 
> +0800
> +++ linux-2.6.25-rc1_work/include/net/dst.h   2008-02-21 14:36:22.0 
> +0800
> @@ -52,11 +52,10 @@ struct dst_entry
>   unsigned short  header_len; /* more space at head required 
> */
>   unsigned short  trailer_len;/* space to reserve at tail */
>  
> - u32 metrics[RTAX_MAX];
> - struct dst_entry*path;
> -
> - unsigned long   rate_last;  /* rate limiting for ICMP */
>   unsigned intrate_tokens;
> + unsigned long   rate_last;  /* rate limiting for ICMP */
> +
> + struct dst_entry*path;
>  
>  #ifdef CONFIG_NET_CLS_ROUTE
>   __u32   tclassid;
> @@ -70,10 +69,12 @@ struct dst_entry
>   int (*output)(struct sk_buff*);
>  
>   struct  dst_ops *ops;
> - 
> - unsigned long   lastuse;
> +
> + u32 metrics[RTAX_MAX];
> +
>   atomic_t__refcnt;   /* client references*/
>   int __use;
> + unsigned long   lastuse;
>   union {
>   struct dst_entry *next;
>   struct rtable*rt_next;
> 
> 

Well, after this patch, we grow dst_entry by 8 bytes :

sizeof(struct dst_entry)=0xd0
offsetof(struct dst_entry, input)=0x68
offsetof(struct dst_entry, output)=0x70
offsetof(struct dst_entry, __refcnt)=0xb4
offsetof(struct dst_entry, lastuse)=0xc0
offsetof(struct dst_entry, __use)=0xb8
sizeof(struct rtable)=0x140


So we dirty two cache lines instead of one, unless your cpu have 128 bytes 
cache lines ?

I am quite suprised that my patch to not change lastuse if already set to 
jiffies changes nothing...

If you have some time, could you also test this (unrelated) patch ?

We can avoid dirty all the time a cache line of loopback device.

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index f2a6e71..0a4186a 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -150,7 +150,10 @@ static int loopback_xmit(struct sk_buff *skb, struct 
net_device *dev)
return 0;
}
 #endif
-   dev->last_rx = jiffies;
+#ifdef CONFIG_SMP
+   if (dev->last_rx != jiffies)
+#endif
+   dev->last_rx = jiffies;
 
/* it's OK to use per_cpu_ptr() because BHs are off */
pcpu_lstats = netdev_priv(dev);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Zhang, Yanmin
On Fri, 2008-02-15 at 15:22 -0800, David Miller wrote:
> From: Eric Dumazet <[EMAIL PROTECTED]>
> Date: Fri, 15 Feb 2008 15:21:48 +0100
> 
> > On linux-2.6.25-rc1 x86_64 :
> > 
> > offsetof(struct dst_entry, lastuse)=0xb0
> > offsetof(struct dst_entry, __refcnt)=0xb8
> > offsetof(struct dst_entry, __use)=0xbc
> > offsetof(struct dst_entry, next)=0xc0
> > 
> > So it should be optimal... I dont know why tbench prefers __refcnt being 
> > on 0xc0, since in this case lastuse will be on a different cache line...
> > 
> > Each incoming IP packet will need to change lastuse, __refcnt and __use, 
> > so keeping them in the same cache line is a win.
> > 
> > I suspect then that even this patch could help tbench, since it avoids 
> > writing lastuse...
> 
> I think your suspicions are right, and even moreso
> it helps to keep __refcnt out of the same cache line
> as input/output/ops which are read-almost-entirely :-
I think you are right. The issue is these three variables sharing the same 
cache line
with input/output/ops.

> )
> 
> I haven't done an exhaustive analysis, but it seems that
> the write traffic to lastuse and __refcnt are about the
> same.  However if we find that __refcnt gets hit more
> than lastuse in this workload, it explains the regression.
I also think __refcnt is the key. I did a new testing by adding 2 unsigned long
pading before lastuse, so the 3 members are moved to next cache line. The 
performance is
recovered.

How about below patch? Almost all performance is recovered with the new patch.

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-21 14:36:22.0 
+0800
@@ -52,11 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-   u32 metrics[RTAX_MAX];
-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
+
+   struct dst_entry*path;
 
 #ifdef CONFIG_NET_CLS_ROUTE
__u32   tclassid;
@@ -70,10 +69,12 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
struct  dst_ops *ops;
-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Zhang, Yanmin
On Fri, 2008-02-15 at 15:22 -0800, David Miller wrote:
 From: Eric Dumazet [EMAIL PROTECTED]
 Date: Fri, 15 Feb 2008 15:21:48 +0100
 
  On linux-2.6.25-rc1 x86_64 :
  
  offsetof(struct dst_entry, lastuse)=0xb0
  offsetof(struct dst_entry, __refcnt)=0xb8
  offsetof(struct dst_entry, __use)=0xbc
  offsetof(struct dst_entry, next)=0xc0
  
  So it should be optimal... I dont know why tbench prefers __refcnt being 
  on 0xc0, since in this case lastuse will be on a different cache line...
  
  Each incoming IP packet will need to change lastuse, __refcnt and __use, 
  so keeping them in the same cache line is a win.
  
  I suspect then that even this patch could help tbench, since it avoids 
  writing lastuse...
 
 I think your suspicions are right, and even moreso
 it helps to keep __refcnt out of the same cache line
 as input/output/ops which are read-almost-entirely :-
I think you are right. The issue is these three variables sharing the same 
cache line
with input/output/ops.

 )
 
 I haven't done an exhaustive analysis, but it seems that
 the write traffic to lastuse and __refcnt are about the
 same.  However if we find that __refcnt gets hit more
 than lastuse in this workload, it explains the regression.
I also think __refcnt is the key. I did a new testing by adding 2 unsigned long
pading before lastuse, so the 3 members are moved to next cache line. The 
performance is
recovered.

How about below patch? Almost all performance is recovered with the new patch.

Signed-off-by: Zhang Yanmin [EMAIL PROTECTED]

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-21 14:36:22.0 
+0800
@@ -52,11 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-   u32 metrics[RTAX_MAX];
-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
+
+   struct dst_entry*path;
 
 #ifdef CONFIG_NET_CLS_ROUTE
__u32   tclassid;
@@ -70,10 +69,12 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
struct  dst_ops *ops;
-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Eric Dumazet
On Mon, 18 Feb 2008 16:12:38 +0800
Zhang, Yanmin [EMAIL PROTECTED] wrote:

 On Fri, 2008-02-15 at 15:22 -0800, David Miller wrote:
  From: Eric Dumazet [EMAIL PROTECTED]
  Date: Fri, 15 Feb 2008 15:21:48 +0100
  
   On linux-2.6.25-rc1 x86_64 :
   
   offsetof(struct dst_entry, lastuse)=0xb0
   offsetof(struct dst_entry, __refcnt)=0xb8
   offsetof(struct dst_entry, __use)=0xbc
   offsetof(struct dst_entry, next)=0xc0
   
   So it should be optimal... I dont know why tbench prefers __refcnt being 
   on 0xc0, since in this case lastuse will be on a different cache line...
   
   Each incoming IP packet will need to change lastuse, __refcnt and __use, 
   so keeping them in the same cache line is a win.
   
   I suspect then that even this patch could help tbench, since it avoids 
   writing lastuse...
  
  I think your suspicions are right, and even moreso
  it helps to keep __refcnt out of the same cache line
  as input/output/ops which are read-almost-entirely :-
 I think you are right. The issue is these three variables sharing the same 
 cache line
 with input/output/ops.
 
  )
  
  I haven't done an exhaustive analysis, but it seems that
  the write traffic to lastuse and __refcnt are about the
  same.  However if we find that __refcnt gets hit more
  than lastuse in this workload, it explains the regression.
 I also think __refcnt is the key. I did a new testing by adding 2 unsigned 
 long
 pading before lastuse, so the 3 members are moved to next cache line. The 
 performance is
 recovered.
 
 How about below patch? Almost all performance is recovered with the new patch.
 
 Signed-off-by: Zhang Yanmin [EMAIL PROTECTED]
 
 ---
 
 --- linux-2.6.25-rc1/include/net/dst.h2008-02-21 14:33:43.0 
 +0800
 +++ linux-2.6.25-rc1_work/include/net/dst.h   2008-02-21 14:36:22.0 
 +0800
 @@ -52,11 +52,10 @@ struct dst_entry
   unsigned short  header_len; /* more space at head required 
 */
   unsigned short  trailer_len;/* space to reserve at tail */
  
 - u32 metrics[RTAX_MAX];
 - struct dst_entry*path;
 -
 - unsigned long   rate_last;  /* rate limiting for ICMP */
   unsigned intrate_tokens;
 + unsigned long   rate_last;  /* rate limiting for ICMP */
 +
 + struct dst_entry*path;
  
  #ifdef CONFIG_NET_CLS_ROUTE
   __u32   tclassid;
 @@ -70,10 +69,12 @@ struct dst_entry
   int (*output)(struct sk_buff*);
  
   struct  dst_ops *ops;
 - 
 - unsigned long   lastuse;
 +
 + u32 metrics[RTAX_MAX];
 +
   atomic_t__refcnt;   /* client references*/
   int __use;
 + unsigned long   lastuse;
   union {
   struct dst_entry *next;
   struct rtable*rt_next;
 
 

Well, after this patch, we grow dst_entry by 8 bytes :

sizeof(struct dst_entry)=0xd0
offsetof(struct dst_entry, input)=0x68
offsetof(struct dst_entry, output)=0x70
offsetof(struct dst_entry, __refcnt)=0xb4
offsetof(struct dst_entry, lastuse)=0xc0
offsetof(struct dst_entry, __use)=0xb8
sizeof(struct rtable)=0x140


So we dirty two cache lines instead of one, unless your cpu have 128 bytes 
cache lines ?

I am quite suprised that my patch to not change lastuse if already set to 
jiffies changes nothing...

If you have some time, could you also test this (unrelated) patch ?

We can avoid dirty all the time a cache line of loopback device.

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index f2a6e71..0a4186a 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -150,7 +150,10 @@ static int loopback_xmit(struct sk_buff *skb, struct 
net_device *dev)
return 0;
}
 #endif
-   dev-last_rx = jiffies;
+#ifdef CONFIG_SMP
+   if (dev-last_rx != jiffies)
+#endif
+   dev-last_rx = jiffies;
 
/* it's OK to use per_cpu_ptr() because BHs are off */
pcpu_lstats = netdev_priv(dev);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Valdis . Kletnieks
On Mon, 18 Feb 2008 16:12:38 +0800, Zhang, Yanmin said:

 I also think __refcnt is the key. I did a new testing by adding 2 unsigned 
 long
 pading before lastuse, so the 3 members are moved to next cache line. The 
 performance is
 recovered.
 
 How about below patch? Almost all performance is recovered with the new patch.
 
 Signed-off-by: Zhang Yanmin [EMAIL PROTECTED]

Could you add a comment someplace that says refcnt wants to be on a different
cache line from input/output/ops or performance tanks badly, to warn some
future kernel hacker who starts adding new fields to the structure?


pgpydfJZxFPCt.pgp
Description: PGP signature


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Zhang, Yanmin
On Mon, 2008-02-18 at 11:11 +0100, Eric Dumazet wrote:
 On Mon, 18 Feb 2008 16:12:38 +0800
 Zhang, Yanmin [EMAIL PROTECTED] wrote:
 
  On Fri, 2008-02-15 at 15:22 -0800, David Miller wrote:
   From: Eric Dumazet [EMAIL PROTECTED]
   Date: Fri, 15 Feb 2008 15:21:48 +0100
   
On linux-2.6.25-rc1 x86_64 :

offsetof(struct dst_entry, lastuse)=0xb0
offsetof(struct dst_entry, __refcnt)=0xb8
offsetof(struct dst_entry, __use)=0xbc
offsetof(struct dst_entry, next)=0xc0

So it should be optimal... I dont know why tbench prefers __refcnt 
being 
on 0xc0, since in this case lastuse will be on a different cache line...

Each incoming IP packet will need to change lastuse, __refcnt and 
__use, 
so keeping them in the same cache line is a win.

I suspect then that even this patch could help tbench, since it avoids 
writing lastuse...
   
   I think your suspicions are right, and even moreso
   it helps to keep __refcnt out of the same cache line
   as input/output/ops which are read-almost-entirely :-
  I think you are right. The issue is these three variables sharing the same 
  cache line
  with input/output/ops.
  
   )
   
   I haven't done an exhaustive analysis, but it seems that
   the write traffic to lastuse and __refcnt are about the
   same.  However if we find that __refcnt gets hit more
   than lastuse in this workload, it explains the regression.
  I also think __refcnt is the key. I did a new testing by adding 2 unsigned 
  long
  pading before lastuse, so the 3 members are moved to next cache line. The 
  performance is
  recovered.
  
  How about below patch? Almost all performance is recovered with the new 
  patch.
  
  Signed-off-by: Zhang Yanmin [EMAIL PROTECTED]
  
  ---
  
  --- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 
  +0800
  +++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-21 14:36:22.0 
  +0800
  @@ -52,11 +52,10 @@ struct dst_entry
  unsigned short  header_len; /* more space at head required 
  */
  unsigned short  trailer_len;/* space to reserve at tail */
   
  -   u32 metrics[RTAX_MAX];
  -   struct dst_entry*path;
  -
  -   unsigned long   rate_last;  /* rate limiting for ICMP */
  unsigned intrate_tokens;
  +   unsigned long   rate_last;  /* rate limiting for ICMP */
  +
  +   struct dst_entry*path;
   
   #ifdef CONFIG_NET_CLS_ROUTE
  __u32   tclassid;
  @@ -70,10 +69,12 @@ struct dst_entry
  int (*output)(struct sk_buff*);
   
  struct  dst_ops *ops;
  -   
  -   unsigned long   lastuse;
  +
  +   u32 metrics[RTAX_MAX];
  +
  atomic_t__refcnt;   /* client references*/
  int __use;
  +   unsigned long   lastuse;
  union {
  struct dst_entry *next;
  struct rtable*rt_next;
  
  
 
 Well, after this patch, we grow dst_entry by 8 bytes :
With my .config, it doesn't grow. Perhaps because of CONFIG_NET_CLS_ROUTE, I 
don't
enable it. I will move tclassid under ops.

 
 sizeof(struct dst_entry)=0xd0
 offsetof(struct dst_entry, input)=0x68
 offsetof(struct dst_entry, output)=0x70
 offsetof(struct dst_entry, __refcnt)=0xb4
 offsetof(struct dst_entry, lastuse)=0xc0
 offsetof(struct dst_entry, __use)=0xb8
 sizeof(struct rtable)=0x140
 
 
 So we dirty two cache lines instead of one, unless your cpu have 128 bytes 
 cache lines ?
 
 I am quite suprised that my patch to not change lastuse if already set to 
 jiffies changes nothing...
 
 If you have some time, could you also test this (unrelated) patch ?
 
 We can avoid dirty all the time a cache line of loopback device.
 
 diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
 index f2a6e71..0a4186a 100644
 --- a/drivers/net/loopback.c
 +++ b/drivers/net/loopback.c
 @@ -150,7 +150,10 @@ static int loopback_xmit(struct sk_buff *skb, struct 
 net_device *dev)
 return 0;
 }
  #endif
 -   dev-last_rx = jiffies;
 +#ifdef CONFIG_SMP
 +   if (dev-last_rx != jiffies)
 +#endif
 +   dev-last_rx = jiffies;
  
 /* it's OK to use per_cpu_ptr() because BHs are off */
 pcpu_lstats = netdev_priv(dev);
 
Although I didn't test it, I don't think it's ok. The key is __refcnt shares 
the same
cache line with ops/input/output.

-yanmin


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Eric Dumazet

Zhang, Yanmin a écrit :
On Mon, 2008-02-18 at 12:33 -0500, [EMAIL PROTECTED] wrote: 

On Mon, 18 Feb 2008 16:12:38 +0800, Zhang, Yanmin said:


I also think __refcnt is the key. I did a new testing by adding 2 unsigned long
pading before lastuse, so the 3 members are moved to next cache line. The 
performance is
recovered.

How about below patch? Almost all performance is recovered with the new patch.

Signed-off-by: Zhang Yanmin [EMAIL PROTECTED]

Could you add a comment someplace that says refcnt wants to be on a different
cache line from input/output/ops or performance tanks badly, to warn some
future kernel hacker who starts adding new fields to the structure?

Ok. Below is the new patch.

1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
sizeof(dst_entry)=200
no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
tigerton by
moving tclassid to different place. It looks like tclassid could also have 
impact on
performance.
If moving tclassid before metrics, or just don't move tclassid, the performance 
isn't
good. So I move it behind metrics.

2) Add comments before __refcnt.

If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better than
the one without the patch.

If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better than
the one without the patch.

Signed-off-by: Zhang Yanmin [EMAIL PROTECTED]

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
+0800
@@ -52,15 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-	u32			metrics[RTAX_MAX];

-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
 
-#ifdef CONFIG_NET_CLS_ROUTE

-   __u32   tclassid;
-#endif
+   struct dst_entry*path;
 
 	struct neighbour	*neighbour;

struct hh_cache *hh;
@@ -70,10 +65,20 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
 	struct  dst_ops	*ops;

-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
+#ifdef CONFIG_NET_CLS_ROUTE
+   __u32   tclassid;
+#endif
+
+   /*
+* __refcnt wants to be on a different cache line from
+* input/output/ops or performance tanks badly
+*/
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;





I prefer this patch, but unfortunatly your perf numbers are for 64 bits kernels.

Could you please test now with 32 bits one ?

Thank you
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Eric Dumazet

Zhang, Yanmin a écrit :

On Mon, 2008-02-18 at 11:11 +0100, Eric Dumazet wrote:

On Mon, 18 Feb 2008 16:12:38 +0800
Zhang, Yanmin [EMAIL PROTECTED] wrote:


On Fri, 2008-02-15 at 15:22 -0800, David Miller wrote:

From: Eric Dumazet [EMAIL PROTECTED]
Date: Fri, 15 Feb 2008 15:21:48 +0100


On linux-2.6.25-rc1 x86_64 :

offsetof(struct dst_entry, lastuse)=0xb0
offsetof(struct dst_entry, __refcnt)=0xb8
offsetof(struct dst_entry, __use)=0xbc
offsetof(struct dst_entry, next)=0xc0

So it should be optimal... I dont know why tbench prefers __refcnt being 
on 0xc0, since in this case lastuse will be on a different cache line...


Each incoming IP packet will need to change lastuse, __refcnt and __use, 
so keeping them in the same cache line is a win.


I suspect then that even this patch could help tbench, since it avoids 
writing lastuse...

I think your suspicions are right, and even moreso
it helps to keep __refcnt out of the same cache line
as input/output/ops which are read-almost-entirely :-

I think you are right. The issue is these three variables sharing the same 
cache line
with input/output/ops.


)

I haven't done an exhaustive analysis, but it seems that
the write traffic to lastuse and __refcnt are about the
same.  However if we find that __refcnt gets hit more
than lastuse in this workload, it explains the regression.

I also think __refcnt is the key. I did a new testing by adding 2 unsigned long
pading before lastuse, so the 3 members are moved to next cache line. The 
performance is
recovered.

How about below patch? Almost all performance is recovered with the new patch.

Signed-off-by: Zhang Yanmin [EMAIL PROTECTED]

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-21 14:36:22.0 
+0800
@@ -52,11 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-	u32			metrics[RTAX_MAX];

-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
+
+   struct dst_entry*path;
 
 #ifdef CONFIG_NET_CLS_ROUTE

__u32   tclassid;
@@ -70,10 +69,12 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
 	struct  dst_ops	*ops;

-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;



Well, after this patch, we grow dst_entry by 8 bytes :

With my .config, it doesn't grow. Perhaps because of CONFIG_NET_CLS_ROUTE, I 
don't
enable it. I will move tclassid under ops.


sizeof(struct dst_entry)=0xd0
offsetof(struct dst_entry, input)=0x68
offsetof(struct dst_entry, output)=0x70
offsetof(struct dst_entry, __refcnt)=0xb4
offsetof(struct dst_entry, lastuse)=0xc0
offsetof(struct dst_entry, __use)=0xb8
sizeof(struct rtable)=0x140


So we dirty two cache lines instead of one, unless your cpu have 128 bytes 
cache lines ?

I am quite suprised that my patch to not change lastuse if already set to 
jiffies changes nothing...

If you have some time, could you also test this (unrelated) patch ?

We can avoid dirty all the time a cache line of loopback device.

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index f2a6e71..0a4186a 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -150,7 +150,10 @@ static int loopback_xmit(struct sk_buff *skb, struct 
net_device *dev)
return 0;
}
 #endif
-   dev-last_rx = jiffies;
+#ifdef CONFIG_SMP
+   if (dev-last_rx != jiffies)
+#endif
+   dev-last_rx = jiffies;
 
/* it's OK to use per_cpu_ptr() because BHs are off */

pcpu_lstats = netdev_priv(dev);


Although I didn't test it, I don't think it's ok. The key is __refcnt shares 
the same
cache line with ops/input/output.



Note it was unrelated to struct dst, but dirtying of one cache line of 
'loopback netdevice'


I tested it, and tbench result was better with this patch : 890 MB/s instead 
of 870 MB/s on a bi dual core machine.



I was curious of the potential gain on your 16 cores (4x4) machine.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Zhang, Yanmin
On Mon, 2008-02-18 at 12:33 -0500, [EMAIL PROTECTED] wrote: 
 On Mon, 18 Feb 2008 16:12:38 +0800, Zhang, Yanmin said:
 
  I also think __refcnt is the key. I did a new testing by adding 2 unsigned 
  long
  pading before lastuse, so the 3 members are moved to next cache line. The 
  performance is
  recovered.
  
  How about below patch? Almost all performance is recovered with the new 
  patch.
  
  Signed-off-by: Zhang Yanmin [EMAIL PROTECTED]
 
 Could you add a comment someplace that says refcnt wants to be on a different
 cache line from input/output/ops or performance tanks badly, to warn some
 future kernel hacker who starts adding new fields to the structure?
Ok. Below is the new patch.

1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
sizeof(dst_entry)=200
no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
tigerton by
moving tclassid to different place. It looks like tclassid could also have 
impact on
performance.
If moving tclassid before metrics, or just don't move tclassid, the performance 
isn't
good. So I move it behind metrics.

2) Add comments before __refcnt.

If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better than
the one without the patch.

If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better than
the one without the patch.

Signed-off-by: Zhang Yanmin [EMAIL PROTECTED]

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
+0800
@@ -52,15 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-   u32 metrics[RTAX_MAX];
-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
 
-#ifdef CONFIG_NET_CLS_ROUTE
-   __u32   tclassid;
-#endif
+   struct dst_entry*path;
 
struct neighbour*neighbour;
struct hh_cache *hh;
@@ -70,10 +65,20 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
struct  dst_ops *ops;
-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
+#ifdef CONFIG_NET_CLS_ROUTE
+   __u32   tclassid;
+#endif
+
+   /*
+* __refcnt wants to be on a different cache line from
+* input/output/ops or performance tanks badly
+*/
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-17 Thread Zhang, Yanmin
On Fri, 2008-02-15 at 15:21 +0100, Eric Dumazet wrote:
> Zhang, Yanmin a écrit :
> > On Fri, 2008-02-15 at 07:05 +0100, Eric Dumazet wrote:
> >   
> >> Zhang, Yanmin a �crit :
> >> 
> >>> Comparing with kernel 2.6.24, tbench result has regression with
> >>> 2.6.25-rc1.
> >>>
> >>> 1) On 2 quad-core processor stoakley: 4%.
> >>> 2) On 4 quad-core processor tigerton: more than 30%.
> >>>
> >>> bisect located below patch.
> >>>
> >>> b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
> >>> commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
> >>> Author: Herbert Xu <[EMAIL PROTECTED]>
> >>> Date:   Tue Nov 13 21:33:32 2007 -0800
> >>>
> >>> [IPV6]: Move nfheader_len into rt6_info
> >>> 
> >>> The dst member nfheader_len is only used by IPv6.  It's also currently
> >>> creating a rather ugly alignment hole in struct dst.  Therefore this 
> >>> patch
> >>> moves it from there into struct rt6_info.
> >>>
> >>>
> >>> As tbench uses ipv4, so the patch's real impact on ipv4 is it deletes
> >>> nfheader_len in dst_entry. It might change cache line alignment.
> >>>
> >>> To verify my finding, I just added nfheader_len back to dst_entry in 
> >>> 2.6.25-rc1
> >>> and reran tbench on the 2 machines. Performance could be recovered 
> >>> completely.
> >>>
> >>> I started cpu_number*2 tbench processes. On my 16-core tigerton:
> >>> #./tbench_srv &
> >>> #./tbench 32 127.0.0.1
> >>>
> >>> -yanmin
> >>>   
> >> Yup. struct dst is sensitive to alignements, especially for benches.
> >>
> >> In the real world, we need to make sure that next pointer start at a cache 
> >> line bondary (or a litle bit after), so that RT cache lookups use one 
> >> cache 
> >> line per entry instead of two. This permits better behavior in DDOS 
> >> attacks.
> >>
> >> (check commit 1e19e02ca0c5e33ea73a25127dbe6c3b8fcaac4b for reference)
> >>
> >> Are you using a 64 or a 32 bit kernel ?
> >> 
> > 64bit x86-64 machine. On another 4-way Madison Itanium machine, tbench has 
> > the
> > similiar regression.
> >
> >   
> 
> On linux-2.6.25-rc1 x86_64 :
> 
> offsetof(struct dst_entry, lastuse)=0xb0
> offsetof(struct dst_entry, __refcnt)=0xb8
> offsetof(struct dst_entry, __use)=0xbc
> offsetof(struct dst_entry, next)=0xc0
> 
> So it should be optimal... I dont know why tbench prefers __refcnt being 
> on 0xc0, since in this case lastuse will be on a different cache line...
> 
> Each incoming IP packet will need to change lastuse, __refcnt and __use, 
> so keeping them in the same cache line is a win.
> 
> I suspect then that even this patch could help tbench, since it avoids 
> writing lastuse...
> 
> diff --git a/include/net/dst.h b/include/net/dst.h
> index e3ac7d0..24d3c4e 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -147,7 +147,8 @@ static inline void dst_use(struct dst_entry *dst, 
> unsigned long time)
>  {
> dst_hold(dst);
> dst->__use++;
> -   dst->lastuse = time;
> +   if (time != dst->lastuse)
> +   dst->lastuse = time;
>  }
I did a quick test and this patch doesn't help tbench.

Thanks,
-yanmin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-17 Thread Zhang, Yanmin
On Fri, 2008-02-15 at 15:21 +0100, Eric Dumazet wrote:
 Zhang, Yanmin a écrit :
  On Fri, 2008-02-15 at 07:05 +0100, Eric Dumazet wrote:

  Zhang, Yanmin a �crit :
  
  Comparing with kernel 2.6.24, tbench result has regression with
  2.6.25-rc1.
 
  1) On 2 quad-core processor stoakley: 4%.
  2) On 4 quad-core processor tigerton: more than 30%.
 
  bisect located below patch.
 
  b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
  commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
  Author: Herbert Xu [EMAIL PROTECTED]
  Date:   Tue Nov 13 21:33:32 2007 -0800
 
  [IPV6]: Move nfheader_len into rt6_info
  
  The dst member nfheader_len is only used by IPv6.  It's also currently
  creating a rather ugly alignment hole in struct dst.  Therefore this 
  patch
  moves it from there into struct rt6_info.
 
 
  As tbench uses ipv4, so the patch's real impact on ipv4 is it deletes
  nfheader_len in dst_entry. It might change cache line alignment.
 
  To verify my finding, I just added nfheader_len back to dst_entry in 
  2.6.25-rc1
  and reran tbench on the 2 machines. Performance could be recovered 
  completely.
 
  I started cpu_number*2 tbench processes. On my 16-core tigerton:
  #./tbench_srv 
  #./tbench 32 127.0.0.1
 
  -yanmin

  Yup. struct dst is sensitive to alignements, especially for benches.
 
  In the real world, we need to make sure that next pointer start at a cache 
  line bondary (or a litle bit after), so that RT cache lookups use one 
  cache 
  line per entry instead of two. This permits better behavior in DDOS 
  attacks.
 
  (check commit 1e19e02ca0c5e33ea73a25127dbe6c3b8fcaac4b for reference)
 
  Are you using a 64 or a 32 bit kernel ?
  
  64bit x86-64 machine. On another 4-way Madison Itanium machine, tbench has 
  the
  similiar regression.
 

 
 On linux-2.6.25-rc1 x86_64 :
 
 offsetof(struct dst_entry, lastuse)=0xb0
 offsetof(struct dst_entry, __refcnt)=0xb8
 offsetof(struct dst_entry, __use)=0xbc
 offsetof(struct dst_entry, next)=0xc0
 
 So it should be optimal... I dont know why tbench prefers __refcnt being 
 on 0xc0, since in this case lastuse will be on a different cache line...
 
 Each incoming IP packet will need to change lastuse, __refcnt and __use, 
 so keeping them in the same cache line is a win.
 
 I suspect then that even this patch could help tbench, since it avoids 
 writing lastuse...
 
 diff --git a/include/net/dst.h b/include/net/dst.h
 index e3ac7d0..24d3c4e 100644
 --- a/include/net/dst.h
 +++ b/include/net/dst.h
 @@ -147,7 +147,8 @@ static inline void dst_use(struct dst_entry *dst, 
 unsigned long time)
  {
 dst_hold(dst);
 dst-__use++;
 -   dst-lastuse = time;
 +   if (time != dst-lastuse)
 +   dst-lastuse = time;
  }
I did a quick test and this patch doesn't help tbench.

Thanks,
-yanmin


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-15 Thread David Miller
From: Eric Dumazet <[EMAIL PROTECTED]>
Date: Fri, 15 Feb 2008 15:21:48 +0100

> On linux-2.6.25-rc1 x86_64 :
> 
> offsetof(struct dst_entry, lastuse)=0xb0
> offsetof(struct dst_entry, __refcnt)=0xb8
> offsetof(struct dst_entry, __use)=0xbc
> offsetof(struct dst_entry, next)=0xc0
> 
> So it should be optimal... I dont know why tbench prefers __refcnt being 
> on 0xc0, since in this case lastuse will be on a different cache line...
> 
> Each incoming IP packet will need to change lastuse, __refcnt and __use, 
> so keeping them in the same cache line is a win.
> 
> I suspect then that even this patch could help tbench, since it avoids 
> writing lastuse...

I think your suspicions are right, and even moreso
it helps to keep __refcnt out of the same cache line
as input/output/ops which are read-almost-entirely :-)

I haven't done an exhaustive analysis, but it seems that
the write traffic to lastuse and __refcnt are about the
same.  However if we find that __refcnt gets hit more
than lastuse in this workload, it explains the regression.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-15 Thread Eric Dumazet

Zhang, Yanmin a écrit :

On Fri, 2008-02-15 at 07:05 +0100, Eric Dumazet wrote:
  

Zhang, Yanmin a �crit :


Comparing with kernel 2.6.24, tbench result has regression with
2.6.25-rc1.

1) On 2 quad-core processor stoakley: 4%.
2) On 4 quad-core processor tigerton: more than 30%.

bisect located below patch.

b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
Author: Herbert Xu <[EMAIL PROTECTED]>
Date:   Tue Nov 13 21:33:32 2007 -0800

[IPV6]: Move nfheader_len into rt6_info

The dst member nfheader_len is only used by IPv6.  It's also currently

creating a rather ugly alignment hole in struct dst.  Therefore this patch
moves it from there into struct rt6_info.


As tbench uses ipv4, so the patch's real impact on ipv4 is it deletes
nfheader_len in dst_entry. It might change cache line alignment.

To verify my finding, I just added nfheader_len back to dst_entry in 2.6.25-rc1
and reran tbench on the 2 machines. Performance could be recovered completely.

I started cpu_number*2 tbench processes. On my 16-core tigerton:
#./tbench_srv &
#./tbench 32 127.0.0.1

-yanmin
  

Yup. struct dst is sensitive to alignements, especially for benches.

In the real world, we need to make sure that next pointer start at a cache 
line bondary (or a litle bit after), so that RT cache lookups use one cache 
line per entry instead of two. This permits better behavior in DDOS attacks.


(check commit 1e19e02ca0c5e33ea73a25127dbe6c3b8fcaac4b for reference)

Are you using a 64 or a 32 bit kernel ?


64bit x86-64 machine. On another 4-way Madison Itanium machine, tbench has the
similiar regression.

  


On linux-2.6.25-rc1 x86_64 :

offsetof(struct dst_entry, lastuse)=0xb0
offsetof(struct dst_entry, __refcnt)=0xb8
offsetof(struct dst_entry, __use)=0xbc
offsetof(struct dst_entry, next)=0xc0

So it should be optimal... I dont know why tbench prefers __refcnt being 
on 0xc0, since in this case lastuse will be on a different cache line...


Each incoming IP packet will need to change lastuse, __refcnt and __use, 
so keeping them in the same cache line is a win.


I suspect then that even this patch could help tbench, since it avoids 
writing lastuse...


diff --git a/include/net/dst.h b/include/net/dst.h
index e3ac7d0..24d3c4e 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -147,7 +147,8 @@ static inline void dst_use(struct dst_entry *dst, 
unsigned long time)

{
   dst_hold(dst);
   dst->__use++;
-   dst->lastuse = time;
+   if (time != dst->lastuse)
+   dst->lastuse = time;
}







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-15 Thread Eric Dumazet

Zhang, Yanmin a écrit :

On Fri, 2008-02-15 at 07:05 +0100, Eric Dumazet wrote:
  

Zhang, Yanmin a �crit :


Comparing with kernel 2.6.24, tbench result has regression with
2.6.25-rc1.

1) On 2 quad-core processor stoakley: 4%.
2) On 4 quad-core processor tigerton: more than 30%.

bisect located below patch.

b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
Author: Herbert Xu [EMAIL PROTECTED]
Date:   Tue Nov 13 21:33:32 2007 -0800

[IPV6]: Move nfheader_len into rt6_info

The dst member nfheader_len is only used by IPv6.  It's also currently

creating a rather ugly alignment hole in struct dst.  Therefore this patch
moves it from there into struct rt6_info.


As tbench uses ipv4, so the patch's real impact on ipv4 is it deletes
nfheader_len in dst_entry. It might change cache line alignment.

To verify my finding, I just added nfheader_len back to dst_entry in 2.6.25-rc1
and reran tbench on the 2 machines. Performance could be recovered completely.

I started cpu_number*2 tbench processes. On my 16-core tigerton:
#./tbench_srv 
#./tbench 32 127.0.0.1

-yanmin
  

Yup. struct dst is sensitive to alignements, especially for benches.

In the real world, we need to make sure that next pointer start at a cache 
line bondary (or a litle bit after), so that RT cache lookups use one cache 
line per entry instead of two. This permits better behavior in DDOS attacks.


(check commit 1e19e02ca0c5e33ea73a25127dbe6c3b8fcaac4b for reference)

Are you using a 64 or a 32 bit kernel ?


64bit x86-64 machine. On another 4-way Madison Itanium machine, tbench has the
similiar regression.

  


On linux-2.6.25-rc1 x86_64 :

offsetof(struct dst_entry, lastuse)=0xb0
offsetof(struct dst_entry, __refcnt)=0xb8
offsetof(struct dst_entry, __use)=0xbc
offsetof(struct dst_entry, next)=0xc0

So it should be optimal... I dont know why tbench prefers __refcnt being 
on 0xc0, since in this case lastuse will be on a different cache line...


Each incoming IP packet will need to change lastuse, __refcnt and __use, 
so keeping them in the same cache line is a win.


I suspect then that even this patch could help tbench, since it avoids 
writing lastuse...


diff --git a/include/net/dst.h b/include/net/dst.h
index e3ac7d0..24d3c4e 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -147,7 +147,8 @@ static inline void dst_use(struct dst_entry *dst, 
unsigned long time)

{
   dst_hold(dst);
   dst-__use++;
-   dst-lastuse = time;
+   if (time != dst-lastuse)
+   dst-lastuse = time;
}







--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-15 Thread David Miller
From: Eric Dumazet [EMAIL PROTECTED]
Date: Fri, 15 Feb 2008 15:21:48 +0100

 On linux-2.6.25-rc1 x86_64 :
 
 offsetof(struct dst_entry, lastuse)=0xb0
 offsetof(struct dst_entry, __refcnt)=0xb8
 offsetof(struct dst_entry, __use)=0xbc
 offsetof(struct dst_entry, next)=0xc0
 
 So it should be optimal... I dont know why tbench prefers __refcnt being 
 on 0xc0, since in this case lastuse will be on a different cache line...
 
 Each incoming IP packet will need to change lastuse, __refcnt and __use, 
 so keeping them in the same cache line is a win.
 
 I suspect then that even this patch could help tbench, since it avoids 
 writing lastuse...

I think your suspicions are right, and even moreso
it helps to keep __refcnt out of the same cache line
as input/output/ops which are read-almost-entirely :-)

I haven't done an exhaustive analysis, but it seems that
the write traffic to lastuse and __refcnt are about the
same.  However if we find that __refcnt gets hit more
than lastuse in this workload, it explains the regression.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-14 Thread Zhang, Yanmin
On Fri, 2008-02-15 at 07:05 +0100, Eric Dumazet wrote:
> Zhang, Yanmin a �crit :
> > Comparing with kernel 2.6.24, tbench result has regression with
> > 2.6.25-rc1.
> > 
> > 1) On 2 quad-core processor stoakley: 4%.
> > 2) On 4 quad-core processor tigerton: more than 30%.
> > 
> > bisect located below patch.
> > 
> > b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
> > commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
> > Author: Herbert Xu <[EMAIL PROTECTED]>
> > Date:   Tue Nov 13 21:33:32 2007 -0800
> > 
> > [IPV6]: Move nfheader_len into rt6_info
> > 
> > The dst member nfheader_len is only used by IPv6.  It's also currently
> > creating a rather ugly alignment hole in struct dst.  Therefore this 
> > patch
> > moves it from there into struct rt6_info.
> > 
> > 
> > As tbench uses ipv4, so the patch's real impact on ipv4 is it deletes
> > nfheader_len in dst_entry. It might change cache line alignment.
> > 
> > To verify my finding, I just added nfheader_len back to dst_entry in 
> > 2.6.25-rc1
> > and reran tbench on the 2 machines. Performance could be recovered 
> > completely.
> > 
> > I started cpu_number*2 tbench processes. On my 16-core tigerton:
> > #./tbench_srv &
> > #./tbench 32 127.0.0.1
> > 
> > -yanmin
> 
> Yup. struct dst is sensitive to alignements, especially for benches.
> 
> In the real world, we need to make sure that next pointer start at a cache 
> line bondary (or a litle bit after), so that RT cache lookups use one cache 
> line per entry instead of two. This permits better behavior in DDOS attacks.
> 
> (check commit 1e19e02ca0c5e33ea73a25127dbe6c3b8fcaac4b for reference)
> 
> Are you using a 64 or a 32 bit kernel ?
64bit x86-64 machine. On another 4-way Madison Itanium machine, tbench has the
similiar regression.

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-14 Thread Eric Dumazet

Zhang, Yanmin a écrit :

Comparing with kernel 2.6.24, tbench result has regression with
2.6.25-rc1.

1) On 2 quad-core processor stoakley: 4%.
2) On 4 quad-core processor tigerton: more than 30%.

bisect located below patch.

b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
Author: Herbert Xu <[EMAIL PROTECTED]>
Date:   Tue Nov 13 21:33:32 2007 -0800

[IPV6]: Move nfheader_len into rt6_info

The dst member nfheader_len is only used by IPv6.  It's also currently

creating a rather ugly alignment hole in struct dst.  Therefore this patch
moves it from there into struct rt6_info.


As tbench uses ipv4, so the patch's real impact on ipv4 is it deletes
nfheader_len in dst_entry. It might change cache line alignment.

To verify my finding, I just added nfheader_len back to dst_entry in 2.6.25-rc1
and reran tbench on the 2 machines. Performance could be recovered completely.

I started cpu_number*2 tbench processes. On my 16-core tigerton:
#./tbench_srv &
#./tbench 32 127.0.0.1

-yanmin


Yup. struct dst is sensitive to alignements, especially for benches.

In the real world, we need to make sure that next pointer start at a cache 
line bondary (or a litle bit after), so that RT cache lookups use one cache 
line per entry instead of two. This permits better behavior in DDOS attacks.


(check commit 1e19e02ca0c5e33ea73a25127dbe6c3b8fcaac4b for reference)

Are you using a 64 or a 32 bit kernel ?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


tbench regression in 2.6.25-rc1

2008-02-14 Thread Zhang, Yanmin
Comparing with kernel 2.6.24, tbench result has regression with
2.6.25-rc1.

1) On 2 quad-core processor stoakley: 4%.
2) On 4 quad-core processor tigerton: more than 30%.

bisect located below patch.

b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
Author: Herbert Xu <[EMAIL PROTECTED]>
Date:   Tue Nov 13 21:33:32 2007 -0800

[IPV6]: Move nfheader_len into rt6_info

The dst member nfheader_len is only used by IPv6.  It's also currently
creating a rather ugly alignment hole in struct dst.  Therefore this patch
moves it from there into struct rt6_info.


As tbench uses ipv4, so the patch's real impact on ipv4 is it deletes
nfheader_len in dst_entry. It might change cache line alignment.

To verify my finding, I just added nfheader_len back to dst_entry in 2.6.25-rc1
and reran tbench on the 2 machines. Performance could be recovered completely.

I started cpu_number*2 tbench processes. On my 16-core tigerton:
#./tbench_srv &
#./tbench 32 127.0.0.1

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


tbench regression in 2.6.25-rc1

2008-02-14 Thread Zhang, Yanmin
Comparing with kernel 2.6.24, tbench result has regression with
2.6.25-rc1.

1) On 2 quad-core processor stoakley: 4%.
2) On 4 quad-core processor tigerton: more than 30%.

bisect located below patch.

b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
Author: Herbert Xu [EMAIL PROTECTED]
Date:   Tue Nov 13 21:33:32 2007 -0800

[IPV6]: Move nfheader_len into rt6_info

The dst member nfheader_len is only used by IPv6.  It's also currently
creating a rather ugly alignment hole in struct dst.  Therefore this patch
moves it from there into struct rt6_info.


As tbench uses ipv4, so the patch's real impact on ipv4 is it deletes
nfheader_len in dst_entry. It might change cache line alignment.

To verify my finding, I just added nfheader_len back to dst_entry in 2.6.25-rc1
and reran tbench on the 2 machines. Performance could be recovered completely.

I started cpu_number*2 tbench processes. On my 16-core tigerton:
#./tbench_srv 
#./tbench 32 127.0.0.1

-yanmin


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-14 Thread Eric Dumazet

Zhang, Yanmin a écrit :

Comparing with kernel 2.6.24, tbench result has regression with
2.6.25-rc1.

1) On 2 quad-core processor stoakley: 4%.
2) On 4 quad-core processor tigerton: more than 30%.

bisect located below patch.

b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
Author: Herbert Xu [EMAIL PROTECTED]
Date:   Tue Nov 13 21:33:32 2007 -0800

[IPV6]: Move nfheader_len into rt6_info

The dst member nfheader_len is only used by IPv6.  It's also currently

creating a rather ugly alignment hole in struct dst.  Therefore this patch
moves it from there into struct rt6_info.


As tbench uses ipv4, so the patch's real impact on ipv4 is it deletes
nfheader_len in dst_entry. It might change cache line alignment.

To verify my finding, I just added nfheader_len back to dst_entry in 2.6.25-rc1
and reran tbench on the 2 machines. Performance could be recovered completely.

I started cpu_number*2 tbench processes. On my 16-core tigerton:
#./tbench_srv 
#./tbench 32 127.0.0.1

-yanmin


Yup. struct dst is sensitive to alignements, especially for benches.

In the real world, we need to make sure that next pointer start at a cache 
line bondary (or a litle bit after), so that RT cache lookups use one cache 
line per entry instead of two. This permits better behavior in DDOS attacks.


(check commit 1e19e02ca0c5e33ea73a25127dbe6c3b8fcaac4b for reference)

Are you using a 64 or a 32 bit kernel ?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-14 Thread Zhang, Yanmin
On Fri, 2008-02-15 at 07:05 +0100, Eric Dumazet wrote:
 Zhang, Yanmin a �crit :
  Comparing with kernel 2.6.24, tbench result has regression with
  2.6.25-rc1.
  
  1) On 2 quad-core processor stoakley: 4%.
  2) On 4 quad-core processor tigerton: more than 30%.
  
  bisect located below patch.
  
  b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b is first bad commit
  commit b4ce92775c2e7ff9cf79cca4e0a19c8c5fd6287b
  Author: Herbert Xu [EMAIL PROTECTED]
  Date:   Tue Nov 13 21:33:32 2007 -0800
  
  [IPV6]: Move nfheader_len into rt6_info
  
  The dst member nfheader_len is only used by IPv6.  It's also currently
  creating a rather ugly alignment hole in struct dst.  Therefore this 
  patch
  moves it from there into struct rt6_info.
  
  
  As tbench uses ipv4, so the patch's real impact on ipv4 is it deletes
  nfheader_len in dst_entry. It might change cache line alignment.
  
  To verify my finding, I just added nfheader_len back to dst_entry in 
  2.6.25-rc1
  and reran tbench on the 2 machines. Performance could be recovered 
  completely.
  
  I started cpu_number*2 tbench processes. On my 16-core tigerton:
  #./tbench_srv 
  #./tbench 32 127.0.0.1
  
  -yanmin
 
 Yup. struct dst is sensitive to alignements, especially for benches.
 
 In the real world, we need to make sure that next pointer start at a cache 
 line bondary (or a litle bit after), so that RT cache lookups use one cache 
 line per entry instead of two. This permits better behavior in DDOS attacks.
 
 (check commit 1e19e02ca0c5e33ea73a25127dbe6c3b8fcaac4b for reference)
 
 Are you using a 64 or a 32 bit kernel ?
64bit x86-64 machine. On another 4-way Madison Itanium machine, tbench has the
similiar regression.

-yanmin


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/