Re: [RFC V2] test_bit before clear files_struct bits
On Tue, Feb 10, 2015 at 2:46 PM, Kirill A. Shutemov wrote: > > But I still fail to understand why my micro-benchmark is faster with > branch before store comparing to plain store. Very tight artificial loops like that tend to be horrible for performance analysis on modern cores, because you end up seeing mostly random microarchitectural details rather than any real performance. At a guess, since you write just one word per cacheline, what happens is that the store buffer continually fills up faster than the stores get drained to cache. So then the stores start staling. The extra load - that you expect to slow things down - likely ends up efectively just prefetching the hot L2 cacheline into L1 so that the store buffer then drains more cleanly. And both the load and the branch are effectively free, because the branch predicts perfectly, and the load just prefetches a cacheline that will have to be fetched for the subsequent store buffer drain anyway. And as you say, there is no cacheline bouncing issues, and the working set presumably fits in the caches - even if it doesn't fit in the L1. But that's all just a wild guess. It could equally easily be some very specific microarchitectural store buffer stall due to the simpler loop hitting just the right cycle count between stores. There are all kinds of random odd small corner-cases that are generally very very rare and hidden in the noise, but then a loop with just the right strides can happen to just hit them. It used to be *trivial* to hit things like address generation stalls, and even though modern Intel CPU's tend to be quite robust performance-wise, it's not true that they always handle any code sequence "perfectly". Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC V2] test_bit before clear files_struct bits
On Tue, Feb 10, 2015 at 12:49:46PM -0800, Linus Torvalds wrote: > On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton > wrote: > > > > The patch is good but I'm still wondering if any CPUs can do this > > speedup for us. The CPU has to pull in the target word to modify the > > bit and what it *could* do is to avoid dirtying the cacheline if it > > sees that the bit is already in the desired state. > > Sadly, no CPU I know of actually does this. Probably because it would > take a bit more core resources, and conditional writes to memory are > not normally part of an x86 core (it might be more natural for > something like old-style ARM that has conditional writes). > > Also, even if the store were to be conditional, the cacheline would > have been acquired in exclusive state, and in many cache protocols the > state machine is from exclusive to dirty (since normally the only > reason to get a cacheline for exclusive use is in order to write to > it). So a "read, test, conditional write" ends up actually being more > complicated than you'd think - because you *want* that > exclusive->dirty state for the case where you really are going to > change the bit, and to avoid extra cache protocol stages you don't > generally want to read the cacheline into a shared read mode first > (only to then have to turn it into exclusive/dirty as a second state) That all sounds resonable. But I still fail to understand why my micro-benchmark is faster with branch before store comparing to plain store. http://article.gmane.org/gmane.linux.kernel.cross-arch/26254 In this case we would not have intermidiate shared state, because we don't have anybody else to share cache line with. So with branch we would have the smae E->M and write-back to memory as without branch. But it doesn't explain why branch makes code faster. Any ideas? -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC V2] test_bit before clear files_struct bits
On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton wrote: > > The patch is good but I'm still wondering if any CPUs can do this > speedup for us. The CPU has to pull in the target word to modify the > bit and what it *could* do is to avoid dirtying the cacheline if it > sees that the bit is already in the desired state. Sadly, no CPU I know of actually does this. Probably because it would take a bit more core resources, and conditional writes to memory are not normally part of an x86 core (it might be more natural for something like old-style ARM that has conditional writes). Also, even if the store were to be conditional, the cacheline would have been acquired in exclusive state, and in many cache protocols the state machine is from exclusive to dirty (since normally the only reason to get a cacheline for exclusive use is in order to write to it). So a "read, test, conditional write" ends up actually being more complicated than you'd think - because you *want* that exclusive->dirty state for the case where you really are going to change the bit, and to avoid extra cache protocol stages you don't generally want to read the cacheline into a shared read mode first (only to then have to turn it into exclusive/dirty as a second state) So at least on current x86 (and for reasons above, likely in the future, including other architectures with read-modify-write memory access models), the default assumption is that the bit operations will actually change the bit, and unlikely bit setting/clearing for cachelines that are very likely to otherwise stay clean should probably be conditionalized in software. Like in this patch. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC V2] test_bit before clear files_struct bits
(cc Linus for CPU-fu) On Tue, 10 Feb 2015 15:11:37 +0800 "Wang, Yalin" wrote: > add test_bit() before clear close_on_exec and open_fds, > by trace __clear_bit(), these 2 place are false in most times, > we test it so that we don't need clear_bit, and we can win > in most time. > > ... > > --- a/fs/file.c > +++ b/fs/file.c > @@ -209,7 +209,8 @@ static inline void __set_close_on_exec(int fd, struct > fdtable *fdt) > > static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) > { > - __clear_bit(fd, fdt->close_on_exec); > + if (test_bit(fd, fdt->close_on_exec)) > + __clear_bit(fd, fdt->close_on_exec); > } > > static inline void __set_open_fd(int fd, struct fdtable *fdt) > @@ -309,7 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, > int *errorp) > struct file *f = *old_fds++; > if (f) { > get_file(f); > - } else { > + } else if (test_bit(open_files - i, new_fdt->open_fds)) { > /* >* The fd may be claimed in the fd bitmap but not yet >* instantiated in the files array if a sibling thread The patch is good but I'm still wondering if any CPUs can do this speedup for us. The CPU has to pull in the target word to modify the bit and what it *could* do is to avoid dirtying the cacheline if it sees that the bit is already in the desired state. However somef elapsed-time testing I did on a couple of Intel machines indicates that these CPUs don't perform that optimisation. Perhaps there's some reason why they don't, dunno. Still, I think we should encapsulate the above (common) pattern into helper functions in include/linux/bitops.h because - it's cleaner - it's self-documenting - it permits us to eliminate the if(test_bit) on any CPU which does perform the optimisation internally, if such exists. You actually have measurement results for these (and other) set-bit-on-already-set-bit call sites. Please include all of that info in the changelog. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC V2] test_bit before clear files_struct bits
On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton a...@linux-foundation.org wrote: The patch is good but I'm still wondering if any CPUs can do this speedup for us. The CPU has to pull in the target word to modify the bit and what it *could* do is to avoid dirtying the cacheline if it sees that the bit is already in the desired state. Sadly, no CPU I know of actually does this. Probably because it would take a bit more core resources, and conditional writes to memory are not normally part of an x86 core (it might be more natural for something like old-style ARM that has conditional writes). Also, even if the store were to be conditional, the cacheline would have been acquired in exclusive state, and in many cache protocols the state machine is from exclusive to dirty (since normally the only reason to get a cacheline for exclusive use is in order to write to it). So a read, test, conditional write ends up actually being more complicated than you'd think - because you *want* that exclusive-dirty state for the case where you really are going to change the bit, and to avoid extra cache protocol stages you don't generally want to read the cacheline into a shared read mode first (only to then have to turn it into exclusive/dirty as a second state) So at least on current x86 (and for reasons above, likely in the future, including other architectures with read-modify-write memory access models), the default assumption is that the bit operations will actually change the bit, and unlikely bit setting/clearing for cachelines that are very likely to otherwise stay clean should probably be conditionalized in software. Like in this patch. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC V2] test_bit before clear files_struct bits
(cc Linus for CPU-fu) On Tue, 10 Feb 2015 15:11:37 +0800 Wang, Yalin yalin.w...@sonymobile.com wrote: add test_bit() before clear close_on_exec and open_fds, by trace __clear_bit(), these 2 place are false in most times, we test it so that we don't need clear_bit, and we can win in most time. ... --- a/fs/file.c +++ b/fs/file.c @@ -209,7 +209,8 @@ static inline void __set_close_on_exec(int fd, struct fdtable *fdt) static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) { - __clear_bit(fd, fdt-close_on_exec); + if (test_bit(fd, fdt-close_on_exec)) + __clear_bit(fd, fdt-close_on_exec); } static inline void __set_open_fd(int fd, struct fdtable *fdt) @@ -309,7 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) struct file *f = *old_fds++; if (f) { get_file(f); - } else { + } else if (test_bit(open_files - i, new_fdt-open_fds)) { /* * The fd may be claimed in the fd bitmap but not yet * instantiated in the files array if a sibling thread The patch is good but I'm still wondering if any CPUs can do this speedup for us. The CPU has to pull in the target word to modify the bit and what it *could* do is to avoid dirtying the cacheline if it sees that the bit is already in the desired state. However somef elapsed-time testing I did on a couple of Intel machines indicates that these CPUs don't perform that optimisation. Perhaps there's some reason why they don't, dunno. Still, I think we should encapsulate the above (common) pattern into helper functions in include/linux/bitops.h because - it's cleaner - it's self-documenting - it permits us to eliminate the if(test_bit) on any CPU which does perform the optimisation internally, if such exists. You actually have measurement results for these (and other) set-bit-on-already-set-bit call sites. Please include all of that info in the changelog. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC V2] test_bit before clear files_struct bits
On Tue, Feb 10, 2015 at 12:49:46PM -0800, Linus Torvalds wrote: On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton a...@linux-foundation.org wrote: The patch is good but I'm still wondering if any CPUs can do this speedup for us. The CPU has to pull in the target word to modify the bit and what it *could* do is to avoid dirtying the cacheline if it sees that the bit is already in the desired state. Sadly, no CPU I know of actually does this. Probably because it would take a bit more core resources, and conditional writes to memory are not normally part of an x86 core (it might be more natural for something like old-style ARM that has conditional writes). Also, even if the store were to be conditional, the cacheline would have been acquired in exclusive state, and in many cache protocols the state machine is from exclusive to dirty (since normally the only reason to get a cacheline for exclusive use is in order to write to it). So a read, test, conditional write ends up actually being more complicated than you'd think - because you *want* that exclusive-dirty state for the case where you really are going to change the bit, and to avoid extra cache protocol stages you don't generally want to read the cacheline into a shared read mode first (only to then have to turn it into exclusive/dirty as a second state) That all sounds resonable. But I still fail to understand why my micro-benchmark is faster with branch before store comparing to plain store. http://article.gmane.org/gmane.linux.kernel.cross-arch/26254 In this case we would not have intermidiate shared state, because we don't have anybody else to share cache line with. So with branch we would have the smae E-M and write-back to memory as without branch. But it doesn't explain why branch makes code faster. Any ideas? -- Kirill A. Shutemov -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC V2] test_bit before clear files_struct bits
On Tue, Feb 10, 2015 at 2:46 PM, Kirill A. Shutemov kir...@shutemov.name wrote: But I still fail to understand why my micro-benchmark is faster with branch before store comparing to plain store. Very tight artificial loops like that tend to be horrible for performance analysis on modern cores, because you end up seeing mostly random microarchitectural details rather than any real performance. At a guess, since you write just one word per cacheline, what happens is that the store buffer continually fills up faster than the stores get drained to cache. So then the stores start staling. The extra load - that you expect to slow things down - likely ends up efectively just prefetching the hot L2 cacheline into L1 so that the store buffer then drains more cleanly. And both the load and the branch are effectively free, because the branch predicts perfectly, and the load just prefetches a cacheline that will have to be fetched for the subsequent store buffer drain anyway. And as you say, there is no cacheline bouncing issues, and the working set presumably fits in the caches - even if it doesn't fit in the L1. But that's all just a wild guess. It could equally easily be some very specific microarchitectural store buffer stall due to the simpler loop hitting just the right cycle count between stores. There are all kinds of random odd small corner-cases that are generally very very rare and hidden in the noise, but then a loop with just the right strides can happen to just hit them. It used to be *trivial* to hit things like address generation stalls, and even though modern Intel CPU's tend to be quite robust performance-wise, it's not true that they always handle any code sequence perfectly. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC V2] test_bit before clear files_struct bits
add test_bit() before clear close_on_exec and open_fds, by trace __clear_bit(), these 2 place are false in most times, we test it so that we don't need clear_bit, and we can win in most time. Signed-off-by: Yalin Wang --- fs/file.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/file.c b/fs/file.c index ee738ea..b0e059c 100644 --- a/fs/file.c +++ b/fs/file.c @@ -209,7 +209,8 @@ static inline void __set_close_on_exec(int fd, struct fdtable *fdt) static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) { - __clear_bit(fd, fdt->close_on_exec); + if (test_bit(fd, fdt->close_on_exec)) + __clear_bit(fd, fdt->close_on_exec); } static inline void __set_open_fd(int fd, struct fdtable *fdt) @@ -309,7 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) struct file *f = *old_fds++; if (f) { get_file(f); - } else { + } else if (test_bit(open_files - i, new_fdt->open_fds)) { /* * The fd may be claimed in the fd bitmap but not yet * instantiated in the files array if a sibling thread -- 2.2.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC V2] test_bit before clear files_struct bits
add test_bit() before clear close_on_exec and open_fds, by trace __clear_bit(), these 2 place are false in most times, we test it so that we don't need clear_bit, and we can win in most time. Signed-off-by: Yalin Wang yalin.w...@sonymobile.com --- fs/file.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/file.c b/fs/file.c index ee738ea..b0e059c 100644 --- a/fs/file.c +++ b/fs/file.c @@ -209,7 +209,8 @@ static inline void __set_close_on_exec(int fd, struct fdtable *fdt) static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) { - __clear_bit(fd, fdt-close_on_exec); + if (test_bit(fd, fdt-close_on_exec)) + __clear_bit(fd, fdt-close_on_exec); } static inline void __set_open_fd(int fd, struct fdtable *fdt) @@ -309,7 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) struct file *f = *old_fds++; if (f) { get_file(f); - } else { + } else if (test_bit(open_files - i, new_fdt-open_fds)) { /* * The fd may be claimed in the fd bitmap but not yet * instantiated in the files array if a sibling thread -- 2.2.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/