Re: [RFC V2] test_bit before clear files_struct bits

2015-02-10 Thread Linus Torvalds
On Tue, Feb 10, 2015 at 2:46 PM, Kirill A. Shutemov
 wrote:
>
> But I still fail to understand why my micro-benchmark is faster with
> branch before store comparing to plain store.

Very tight artificial loops like that tend to be horrible for
performance analysis on modern cores, because you end up seeing mostly
random microarchitectural details rather than any real performance.

At a guess, since you write just one word per cacheline, what happens
is that the store buffer continually fills up faster than the stores
get drained to cache. So then the stores start staling. The extra load
- that you expect to slow things down - likely ends up efectively just
prefetching the hot L2 cacheline into L1 so that the store buffer then
drains more cleanly.

And both the load and the branch are effectively free, because the
branch predicts perfectly, and the load just prefetches a cacheline
that will have to be fetched for the subsequent store buffer drain
anyway.

And as you say, there is no cacheline bouncing issues, and the working
set presumably fits in the caches - even if it doesn't fit in the L1.

But that's all just a wild guess.

It could equally easily be some very specific microarchitectural store
buffer stall due to the simpler loop hitting just the right cycle
count between stores.  There are all kinds of random odd small
corner-cases that are generally very very rare and hidden in the
noise, but then a loop with just the right strides can happen to just
hit them. It used to be *trivial* to hit things like address
generation stalls, and even though modern Intel CPU's tend to be quite
robust performance-wise, it's not true that they always handle any
code sequence "perfectly".

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC V2] test_bit before clear files_struct bits

2015-02-10 Thread Kirill A. Shutemov
On Tue, Feb 10, 2015 at 12:49:46PM -0800, Linus Torvalds wrote:
> On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton
>  wrote:
> >
> > The patch is good but I'm still wondering if any CPUs can do this
> > speedup for us.  The CPU has to pull in the target word to modify the
> > bit and what it *could* do is to avoid dirtying the cacheline if it
> > sees that the bit is already in the desired state.
> 
> Sadly, no CPU I know of actually does this.  Probably because it would
> take a bit more core resources, and conditional writes to memory are
> not normally part of an x86 core (it might be more natural for
> something like old-style ARM that has conditional writes).
> 
> Also, even if the store were to be conditional, the cacheline would
> have been acquired in exclusive state, and in many cache protocols the
> state machine is from exclusive to dirty (since normally the only
> reason to get a cacheline for exclusive use is in order to write to
> it). So a "read, test, conditional write" ends up actually being more
> complicated than you'd think - because you *want* that
> exclusive->dirty state for the case where you really are going to
> change the bit, and to avoid extra cache protocol stages you don't
> generally want to read the cacheline into a shared read mode first
> (only to then have to turn it into exclusive/dirty as a second state)

That all sounds resonable.

But I still fail to understand why my micro-benchmark is faster with
branch before store comparing to plain store.

http://article.gmane.org/gmane.linux.kernel.cross-arch/26254

In this case we would not have intermidiate shared state, because we don't
have anybody else to share cache line with. So with branch we would have
the smae E->M and write-back to memory as without branch. But it doesn't
explain why branch makes code faster.

Any ideas?

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC V2] test_bit before clear files_struct bits

2015-02-10 Thread Linus Torvalds
On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton
 wrote:
>
> The patch is good but I'm still wondering if any CPUs can do this
> speedup for us.  The CPU has to pull in the target word to modify the
> bit and what it *could* do is to avoid dirtying the cacheline if it
> sees that the bit is already in the desired state.

Sadly, no CPU I know of actually does this.  Probably because it would
take a bit more core resources, and conditional writes to memory are
not normally part of an x86 core (it might be more natural for
something like old-style ARM that has conditional writes).

Also, even if the store were to be conditional, the cacheline would
have been acquired in exclusive state, and in many cache protocols the
state machine is from exclusive to dirty (since normally the only
reason to get a cacheline for exclusive use is in order to write to
it). So a "read, test, conditional write" ends up actually being more
complicated than you'd think - because you *want* that
exclusive->dirty state for the case where you really are going to
change the bit, and to avoid extra cache protocol stages you don't
generally want to read the cacheline into a shared read mode first
(only to then have to turn it into exclusive/dirty as a second state)

So at least on current x86 (and for reasons above, likely in the
future, including other architectures with read-modify-write memory
access models), the default assumption is that the bit operations will
actually change the bit, and unlikely bit setting/clearing for
cachelines that are very likely to otherwise stay clean should
probably be conditionalized in software. Like in this patch.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC V2] test_bit before clear files_struct bits

2015-02-10 Thread Andrew Morton

(cc Linus for CPU-fu)

On Tue, 10 Feb 2015 15:11:37 +0800 "Wang, Yalin"  
wrote:

> add test_bit() before clear close_on_exec and open_fds,
> by trace __clear_bit(), these 2 place are false in most times,
> we test it so that we don't need clear_bit, and we can win
> in most time.
> 
> ...
>
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -209,7 +209,8 @@ static inline void __set_close_on_exec(int fd, struct 
> fdtable *fdt)
>  
>  static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
>  {
> - __clear_bit(fd, fdt->close_on_exec);
> + if (test_bit(fd, fdt->close_on_exec))
> + __clear_bit(fd, fdt->close_on_exec);
>  }
>  
>  static inline void __set_open_fd(int fd, struct fdtable *fdt)
> @@ -309,7 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, 
> int *errorp)
>   struct file *f = *old_fds++;
>   if (f) {
>   get_file(f);
> - } else {
> + } else if (test_bit(open_files - i, new_fdt->open_fds)) {
>   /*
>* The fd may be claimed in the fd bitmap but not yet
>* instantiated in the files array if a sibling thread

The patch is good but I'm still wondering if any CPUs can do this
speedup for us.  The CPU has to pull in the target word to modify the
bit and what it *could* do is to avoid dirtying the cacheline if it
sees that the bit is already in the desired state.

However somef elapsed-time testing I did on a couple of Intel
machines indicates that these CPUs don't perform that optimisation. 
Perhaps there's some reason why they don't, dunno.


Still, I think we should encapsulate the above (common) pattern into
helper functions in include/linux/bitops.h because

- it's cleaner

- it's self-documenting

- it permits us to eliminate the if(test_bit) on any CPU which does
  perform the optimisation internally, if such exists.


You actually have measurement results for these (and other)
set-bit-on-already-set-bit call sites.  Please include all of that info
in the changelog.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC V2] test_bit before clear files_struct bits

2015-02-10 Thread Linus Torvalds
On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton
a...@linux-foundation.org wrote:

 The patch is good but I'm still wondering if any CPUs can do this
 speedup for us.  The CPU has to pull in the target word to modify the
 bit and what it *could* do is to avoid dirtying the cacheline if it
 sees that the bit is already in the desired state.

Sadly, no CPU I know of actually does this.  Probably because it would
take a bit more core resources, and conditional writes to memory are
not normally part of an x86 core (it might be more natural for
something like old-style ARM that has conditional writes).

Also, even if the store were to be conditional, the cacheline would
have been acquired in exclusive state, and in many cache protocols the
state machine is from exclusive to dirty (since normally the only
reason to get a cacheline for exclusive use is in order to write to
it). So a read, test, conditional write ends up actually being more
complicated than you'd think - because you *want* that
exclusive-dirty state for the case where you really are going to
change the bit, and to avoid extra cache protocol stages you don't
generally want to read the cacheline into a shared read mode first
(only to then have to turn it into exclusive/dirty as a second state)

So at least on current x86 (and for reasons above, likely in the
future, including other architectures with read-modify-write memory
access models), the default assumption is that the bit operations will
actually change the bit, and unlikely bit setting/clearing for
cachelines that are very likely to otherwise stay clean should
probably be conditionalized in software. Like in this patch.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC V2] test_bit before clear files_struct bits

2015-02-10 Thread Andrew Morton

(cc Linus for CPU-fu)

On Tue, 10 Feb 2015 15:11:37 +0800 Wang, Yalin yalin.w...@sonymobile.com 
wrote:

 add test_bit() before clear close_on_exec and open_fds,
 by trace __clear_bit(), these 2 place are false in most times,
 we test it so that we don't need clear_bit, and we can win
 in most time.
 
 ...

 --- a/fs/file.c
 +++ b/fs/file.c
 @@ -209,7 +209,8 @@ static inline void __set_close_on_exec(int fd, struct 
 fdtable *fdt)
  
  static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
  {
 - __clear_bit(fd, fdt-close_on_exec);
 + if (test_bit(fd, fdt-close_on_exec))
 + __clear_bit(fd, fdt-close_on_exec);
  }
  
  static inline void __set_open_fd(int fd, struct fdtable *fdt)
 @@ -309,7 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, 
 int *errorp)
   struct file *f = *old_fds++;
   if (f) {
   get_file(f);
 - } else {
 + } else if (test_bit(open_files - i, new_fdt-open_fds)) {
   /*
* The fd may be claimed in the fd bitmap but not yet
* instantiated in the files array if a sibling thread

The patch is good but I'm still wondering if any CPUs can do this
speedup for us.  The CPU has to pull in the target word to modify the
bit and what it *could* do is to avoid dirtying the cacheline if it
sees that the bit is already in the desired state.

However somef elapsed-time testing I did on a couple of Intel
machines indicates that these CPUs don't perform that optimisation. 
Perhaps there's some reason why they don't, dunno.


Still, I think we should encapsulate the above (common) pattern into
helper functions in include/linux/bitops.h because

- it's cleaner

- it's self-documenting

- it permits us to eliminate the if(test_bit) on any CPU which does
  perform the optimisation internally, if such exists.


You actually have measurement results for these (and other)
set-bit-on-already-set-bit call sites.  Please include all of that info
in the changelog.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC V2] test_bit before clear files_struct bits

2015-02-10 Thread Kirill A. Shutemov
On Tue, Feb 10, 2015 at 12:49:46PM -0800, Linus Torvalds wrote:
 On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton
 a...@linux-foundation.org wrote:
 
  The patch is good but I'm still wondering if any CPUs can do this
  speedup for us.  The CPU has to pull in the target word to modify the
  bit and what it *could* do is to avoid dirtying the cacheline if it
  sees that the bit is already in the desired state.
 
 Sadly, no CPU I know of actually does this.  Probably because it would
 take a bit more core resources, and conditional writes to memory are
 not normally part of an x86 core (it might be more natural for
 something like old-style ARM that has conditional writes).
 
 Also, even if the store were to be conditional, the cacheline would
 have been acquired in exclusive state, and in many cache protocols the
 state machine is from exclusive to dirty (since normally the only
 reason to get a cacheline for exclusive use is in order to write to
 it). So a read, test, conditional write ends up actually being more
 complicated than you'd think - because you *want* that
 exclusive-dirty state for the case where you really are going to
 change the bit, and to avoid extra cache protocol stages you don't
 generally want to read the cacheline into a shared read mode first
 (only to then have to turn it into exclusive/dirty as a second state)

That all sounds resonable.

But I still fail to understand why my micro-benchmark is faster with
branch before store comparing to plain store.

http://article.gmane.org/gmane.linux.kernel.cross-arch/26254

In this case we would not have intermidiate shared state, because we don't
have anybody else to share cache line with. So with branch we would have
the smae E-M and write-back to memory as without branch. But it doesn't
explain why branch makes code faster.

Any ideas?

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC V2] test_bit before clear files_struct bits

2015-02-10 Thread Linus Torvalds
On Tue, Feb 10, 2015 at 2:46 PM, Kirill A. Shutemov
kir...@shutemov.name wrote:

 But I still fail to understand why my micro-benchmark is faster with
 branch before store comparing to plain store.

Very tight artificial loops like that tend to be horrible for
performance analysis on modern cores, because you end up seeing mostly
random microarchitectural details rather than any real performance.

At a guess, since you write just one word per cacheline, what happens
is that the store buffer continually fills up faster than the stores
get drained to cache. So then the stores start staling. The extra load
- that you expect to slow things down - likely ends up efectively just
prefetching the hot L2 cacheline into L1 so that the store buffer then
drains more cleanly.

And both the load and the branch are effectively free, because the
branch predicts perfectly, and the load just prefetches a cacheline
that will have to be fetched for the subsequent store buffer drain
anyway.

And as you say, there is no cacheline bouncing issues, and the working
set presumably fits in the caches - even if it doesn't fit in the L1.

But that's all just a wild guess.

It could equally easily be some very specific microarchitectural store
buffer stall due to the simpler loop hitting just the right cycle
count between stores.  There are all kinds of random odd small
corner-cases that are generally very very rare and hidden in the
noise, but then a loop with just the right strides can happen to just
hit them. It used to be *trivial* to hit things like address
generation stalls, and even though modern Intel CPU's tend to be quite
robust performance-wise, it's not true that they always handle any
code sequence perfectly.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC V2] test_bit before clear files_struct bits

2015-02-09 Thread Wang, Yalin
add test_bit() before clear close_on_exec and open_fds,
by trace __clear_bit(), these 2 place are false in most times,
we test it so that we don't need clear_bit, and we can win
in most time.

Signed-off-by: Yalin Wang 
---
 fs/file.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index ee738ea..b0e059c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -209,7 +209,8 @@ static inline void __set_close_on_exec(int fd, struct 
fdtable *fdt)
 
 static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 {
-   __clear_bit(fd, fdt->close_on_exec);
+   if (test_bit(fd, fdt->close_on_exec))
+   __clear_bit(fd, fdt->close_on_exec);
 }
 
 static inline void __set_open_fd(int fd, struct fdtable *fdt)
@@ -309,7 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int 
*errorp)
struct file *f = *old_fds++;
if (f) {
get_file(f);
-   } else {
+   } else if (test_bit(open_files - i, new_fdt->open_fds)) {
/*
 * The fd may be claimed in the fd bitmap but not yet
 * instantiated in the files array if a sibling thread
-- 
2.2.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC V2] test_bit before clear files_struct bits

2015-02-09 Thread Wang, Yalin
add test_bit() before clear close_on_exec and open_fds,
by trace __clear_bit(), these 2 place are false in most times,
we test it so that we don't need clear_bit, and we can win
in most time.

Signed-off-by: Yalin Wang yalin.w...@sonymobile.com
---
 fs/file.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index ee738ea..b0e059c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -209,7 +209,8 @@ static inline void __set_close_on_exec(int fd, struct 
fdtable *fdt)
 
 static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 {
-   __clear_bit(fd, fdt-close_on_exec);
+   if (test_bit(fd, fdt-close_on_exec))
+   __clear_bit(fd, fdt-close_on_exec);
 }
 
 static inline void __set_open_fd(int fd, struct fdtable *fdt)
@@ -309,7 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int 
*errorp)
struct file *f = *old_fds++;
if (f) {
get_file(f);
-   } else {
+   } else if (test_bit(open_files - i, new_fdt-open_fds)) {
/*
 * The fd may be claimed in the fd bitmap but not yet
 * instantiated in the files array if a sibling thread
-- 
2.2.2
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/