Re: svn commit: r358733 - head/sys/sys

2020-03-12 Thread Andriy Gapon
On 12/03/2020 09:48, Alan Cox wrote:
> Just to be clear, this is not just a matter of whether our release fences are
> poorly or inefficiently implemented.  Release fences (as defined by the C/C++
> standards that inspire our atomics) have *different* semantics than release
> stores. Specifically, a release fence places more ordering restrictions on the
> nearby store accesses than a release store does.  (In fact, "man 9 atomic"
> describes this.)  Thus, it is almost inevitable that a release fence will 
> have a
> higher cost than the best possible implementation of a release store.

For people like me who are new to these things here is a good explanation
(perhaps a bad title, though):
https://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/

-- 
Andriy Gapon
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r358733 - head/sys/sys

2020-03-12 Thread Alan Cox


On 3/11/20 9:16 PM, Mateusz Guzik wrote:

On 3/10/20, Konstantin Belousov  wrote:

On Tue, Mar 10, 2020 at 12:22:09AM +0100, Mateusz Guzik wrote:

On 3/9/20, Konstantin Belousov  wrote:

On Mon, Mar 09, 2020 at 01:56:17AM +0100, Mateusz Guzik wrote:

On 3/8/20, Mateusz Guzik  wrote:

Author: mjg
Date: Sun Mar  8 00:22:32 2020
New Revision: 358733
URL: https://svnweb.freebsd.org/changeset/base/358733

Log:
   seqc: tidy up

   - avoid hand-rolled read
   - match begin/end in terms of fence style


There were off lists questions about this so let me clarify.

The first bit is a cosmetic change, but the second one is not.


Modified:
   head/sys/sys/seqc.h

Modified: head/sys/sys/seqc.h
==
--- head/sys/sys/seqc.h Sat Mar  7 15:37:23 2020(r358732)
+++ head/sys/sys/seqc.h Sun Mar  8 00:22:32 2020(r358733)
@@ -66,7 +66,8 @@ static __inline void
  seqc_write_end(seqc_t *seqcp)
  {

-   atomic_store_rel_int(seqcp, *seqcp + 1);
+   atomic_thread_fence_rel();
+   *seqcp += 1;
MPASS(!seqc_in_modify(*seqcp));
critical_exit();
  }

For correct operation the counter has to be modified *before* any work
is done and *after* it is completed.

Consider a trivial case:
seqc++;
foo[0] = 0;
foo[1] = 1;
seqc++;

There are 2 ways in which this can be mucked with:
- the compiler can be looking at reordering or reworking the increment
(e.g., to collapse it to just += 2 instead). a compiler barrier
prevents that.
- even with generated machine code which increments in correct places
the cpu can be looking at reordering the operation (which is heavily
dependent on architecture), in particular it could end up issuing
seqc++; foo[1] = 1; which would defeat the point. This is where
release fences come in.

With the patched code there is:
seqc++;
atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
starts getting modified */
foo[0] = 0;
foo[1] = 1;
atomic_thread_fence_rel(); /* make sure modifications to foo don't
leak past this spot */
seqc++;

In comparison, the previous code was:
seqc++;
atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
starts getting modified */
foo[0] = 0;
foo[1] = 1;
atomic_store_rel_int(seqcp, *seqcp + 1); /* make sure modifications to
too don't leak past this spot and update seqc immediately */

There are 2 differences here -- one is an improvement and second one
does not matter:
the win is: the previous code forces the compiler to compute an
incremented value and then store it. On amd64 this translates to the
following (with rdx holding the address of the counter):

mov(%rdx),%eax /* load the value */
add$0x1,%eax/* add 1 */
mov%eax,(%rdx) /* store it */

On patched kernel this is:
addl   $0x1,(%rdx)

which is clearly much nicer.

I am not sure that the new code on amd64 is much nicer.

But the point was that on non-amd64, i.e. armv8 and potentially on
powerpc 3.0, the patch substitutes release store with full barrier. The
later is (much) slower.


If an arch performs something significantly more expensive here it's
probably a performance bug in the port.

It is a question of how much hardware support for fine-grained fences.
Relaxed architectures add some optimized operations already, but did not
(yet) implemented complete C11 optimized set.
I think they eventually converge.

Until they did not, I think it is more important to not pessimize
to-be Tier1 arches than to collapse three fast integer instructions
on amd64 into one rw op.


But perhaps more importantly
there are significantly more frequent users of
atomic_thread_fence_rel, like refcount(9). If the issue is real that
should be looked at.

For refcount(9) use of release fence removal, we would need
atomic_fetchadd_rel_int().


To reiterate, if atomic_thread_fence_rel is indeed very expensive on
certain architectures compared to other _rel variants, the kernel is
already heavily pessimized because of use of said standalone fence in
refcount(9) and few other places and this needs to be fixed.



Just to be clear, this is not just a matter of whether our release 
fences are poorly or inefficiently implemented.  Release fences (as 
defined by the C/C++ standards that inspire our atomics) have 
*different* semantics than release stores. Specifically, a release fence 
places more ordering restrictions on the nearby store accesses than a 
release store does.  (In fact, "man 9 atomic" describes this.)  Thus, it 
is almost inevitable that a release fence will have a higher cost than 
the best possible implementation of a release store.




If the cost can't be reduced (e.g., perhaps this can do a no-op store
+ fence somewhere akin to how lock xadd 0,(%esp) trickery used to
work), then the atomic APIs do indeed not to be completed w.r.t.
adding all the missing acq/rel variants (that's at least fetchadd and
swap).

We can also extend the API to do relevant addition/subtraction without

Re: svn commit: r358733 - head/sys/sys

2020-03-11 Thread Mateusz Guzik
On 3/10/20, Konstantin Belousov  wrote:
> On Tue, Mar 10, 2020 at 12:22:09AM +0100, Mateusz Guzik wrote:
>> On 3/9/20, Konstantin Belousov  wrote:
>> > On Mon, Mar 09, 2020 at 01:56:17AM +0100, Mateusz Guzik wrote:
>> >> On 3/8/20, Mateusz Guzik  wrote:
>> >> > Author: mjg
>> >> > Date: Sun Mar  8 00:22:32 2020
>> >> > New Revision: 358733
>> >> > URL: https://svnweb.freebsd.org/changeset/base/358733
>> >> >
>> >> > Log:
>> >> >   seqc: tidy up
>> >> >
>> >> >   - avoid hand-rolled read
>> >> >   - match begin/end in terms of fence style
>> >> >
>> >>
>> >> There were off lists questions about this so let me clarify.
>> >>
>> >> The first bit is a cosmetic change, but the second one is not.
>> >>
>> >> > Modified:
>> >> >   head/sys/sys/seqc.h
>> >> >
>> >> > Modified: head/sys/sys/seqc.h
>> >> > ==
>> >> > --- head/sys/sys/seqc.h Sat Mar  7 15:37:23 2020(r358732)
>> >> > +++ head/sys/sys/seqc.h Sun Mar  8 00:22:32 2020(r358733)
>> >> > @@ -66,7 +66,8 @@ static __inline void
>> >> >  seqc_write_end(seqc_t *seqcp)
>> >> >  {
>> >> >
>> >> > -   atomic_store_rel_int(seqcp, *seqcp + 1);
>> >> > +   atomic_thread_fence_rel();
>> >> > +   *seqcp += 1;
>> >> > MPASS(!seqc_in_modify(*seqcp));
>> >> > critical_exit();
>> >> >  }
>> >>
>> >> For correct operation the counter has to be modified *before* any work
>> >> is done and *after* it is completed.
>> >>
>> >> Consider a trivial case:
>> >> seqc++;
>> >> foo[0] = 0;
>> >> foo[1] = 1;
>> >> seqc++;
>> >>
>> >> There are 2 ways in which this can be mucked with:
>> >> - the compiler can be looking at reordering or reworking the increment
>> >> (e.g., to collapse it to just += 2 instead). a compiler barrier
>> >> prevents that.
>> >> - even with generated machine code which increments in correct places
>> >> the cpu can be looking at reordering the operation (which is heavily
>> >> dependent on architecture), in particular it could end up issuing
>> >> seqc++; foo[1] = 1; which would defeat the point. This is where
>> >> release fences come in.
>> >>
>> >> With the patched code there is:
>> >> seqc++;
>> >> atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
>> >> starts getting modified */
>> >> foo[0] = 0;
>> >> foo[1] = 1;
>> >> atomic_thread_fence_rel(); /* make sure modifications to foo don't
>> >> leak past this spot */
>> >> seqc++;
>> >>
>> >> In comparison, the previous code was:
>> >> seqc++;
>> >> atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
>> >> starts getting modified */
>> >> foo[0] = 0;
>> >> foo[1] = 1;
>> >> atomic_store_rel_int(seqcp, *seqcp + 1); /* make sure modifications to
>> >> too don't leak past this spot and update seqc immediately */
>> >>
>> >> There are 2 differences here -- one is an improvement and second one
>> >> does not matter:
>> >> the win is: the previous code forces the compiler to compute an
>> >> incremented value and then store it. On amd64 this translates to the
>> >> following (with rdx holding the address of the counter):
>> >>
>> >> mov(%rdx),%eax /* load the value */
>> >> add$0x1,%eax/* add 1 */
>> >> mov%eax,(%rdx) /* store it */
>> >>
>> >> On patched kernel this is:
>> >> addl   $0x1,(%rdx)
>> >>
>> >> which is clearly much nicer.
>> > I am not sure that the new code on amd64 is much nicer.
>> >
>> > But the point was that on non-amd64, i.e. armv8 and potentially on
>> > powerpc 3.0, the patch substitutes release store with full barrier. The
>> > later is (much) slower.
>> >
>>
>> If an arch performs something significantly more expensive here it's
>> probably a performance bug in the port.
> It is a question of how much hardware support for fine-grained fences.
> Relaxed architectures add some optimized operations already, but did not
> (yet) implemented complete C11 optimized set.
> I think they eventually converge.
>
> Until they did not, I think it is more important to not pessimize
> to-be Tier1 arches than to collapse three fast integer instructions
> on amd64 into one rw op.
>
>> But perhaps more importantly
>> there are significantly more frequent users of
>> atomic_thread_fence_rel, like refcount(9). If the issue is real that
>> should be looked at.
> For refcount(9) use of release fence removal, we would need
>   atomic_fetchadd_rel_int().
>

To reiterate, if atomic_thread_fence_rel is indeed very expensive on
certain architectures compared to other _rel variants, the kernel is
already heavily pessimized because of use of said standalone fence in
refcount(9) and few other places and this needs to be fixed.

If the cost can't be reduced (e.g., perhaps this can do a no-op store
+ fence somewhere akin to how lock xadd 0,(%esp) trickery used to
work), then the atomic APIs do indeed not to be completed w.r.t.
adding all the missing acq/rel variants (that's at least fetchadd and
swap).

We can also extend the API to do relevant 

Re: svn commit: r358733 - head/sys/sys

2020-03-10 Thread Konstantin Belousov
On Tue, Mar 10, 2020 at 12:22:09AM +0100, Mateusz Guzik wrote:
> On 3/9/20, Konstantin Belousov  wrote:
> > On Mon, Mar 09, 2020 at 01:56:17AM +0100, Mateusz Guzik wrote:
> >> On 3/8/20, Mateusz Guzik  wrote:
> >> > Author: mjg
> >> > Date: Sun Mar  8 00:22:32 2020
> >> > New Revision: 358733
> >> > URL: https://svnweb.freebsd.org/changeset/base/358733
> >> >
> >> > Log:
> >> >   seqc: tidy up
> >> >
> >> >   - avoid hand-rolled read
> >> >   - match begin/end in terms of fence style
> >> >
> >>
> >> There were off lists questions about this so let me clarify.
> >>
> >> The first bit is a cosmetic change, but the second one is not.
> >>
> >> > Modified:
> >> >   head/sys/sys/seqc.h
> >> >
> >> > Modified: head/sys/sys/seqc.h
> >> > ==
> >> > --- head/sys/sys/seqc.h  Sat Mar  7 15:37:23 2020(r358732)
> >> > +++ head/sys/sys/seqc.h  Sun Mar  8 00:22:32 2020(r358733)
> >> > @@ -66,7 +66,8 @@ static __inline void
> >> >  seqc_write_end(seqc_t *seqcp)
> >> >  {
> >> >
> >> > -atomic_store_rel_int(seqcp, *seqcp + 1);
> >> > +atomic_thread_fence_rel();
> >> > +*seqcp += 1;
> >> >  MPASS(!seqc_in_modify(*seqcp));
> >> >  critical_exit();
> >> >  }
> >>
> >> For correct operation the counter has to be modified *before* any work
> >> is done and *after* it is completed.
> >>
> >> Consider a trivial case:
> >> seqc++;
> >> foo[0] = 0;
> >> foo[1] = 1;
> >> seqc++;
> >>
> >> There are 2 ways in which this can be mucked with:
> >> - the compiler can be looking at reordering or reworking the increment
> >> (e.g., to collapse it to just += 2 instead). a compiler barrier
> >> prevents that.
> >> - even with generated machine code which increments in correct places
> >> the cpu can be looking at reordering the operation (which is heavily
> >> dependent on architecture), in particular it could end up issuing
> >> seqc++; foo[1] = 1; which would defeat the point. This is where
> >> release fences come in.
> >>
> >> With the patched code there is:
> >> seqc++;
> >> atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
> >> starts getting modified */
> >> foo[0] = 0;
> >> foo[1] = 1;
> >> atomic_thread_fence_rel(); /* make sure modifications to foo don't
> >> leak past this spot */
> >> seqc++;
> >>
> >> In comparison, the previous code was:
> >> seqc++;
> >> atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
> >> starts getting modified */
> >> foo[0] = 0;
> >> foo[1] = 1;
> >> atomic_store_rel_int(seqcp, *seqcp + 1); /* make sure modifications to
> >> too don't leak past this spot and update seqc immediately */
> >>
> >> There are 2 differences here -- one is an improvement and second one
> >> does not matter:
> >> the win is: the previous code forces the compiler to compute an
> >> incremented value and then store it. On amd64 this translates to the
> >> following (with rdx holding the address of the counter):
> >>
> >> mov(%rdx),%eax /* load the value */
> >> add$0x1,%eax/* add 1 */
> >> mov%eax,(%rdx) /* store it */
> >>
> >> On patched kernel this is:
> >> addl   $0x1,(%rdx)
> >>
> >> which is clearly much nicer.
> > I am not sure that the new code on amd64 is much nicer.
> >
> > But the point was that on non-amd64, i.e. armv8 and potentially on
> > powerpc 3.0, the patch substitutes release store with full barrier. The
> > later is (much) slower.
> >
> 
> If an arch performs something significantly more expensive here it's
> probably a performance bug in the port.
It is a question of how much hardware support for fine-grained fences.
Relaxed architectures add some optimized operations already, but did not
(yet) implemented complete C11 optimized set.
I think they eventually converge.

Until they did not, I think it is more important to not pessimize
to-be Tier1 arches than to collapse three fast integer instructions
on amd64 into one rw op.

> But perhaps more importantly
> there are significantly more frequent users of
> atomic_thread_fence_rel, like refcount(9). If the issue is real that
> should be looked at.
For refcount(9) use of release fence removal, we would need
atomic_fetchadd_rel_int().

> 
> >>
> >> the not a problem: the code does not issue the release fence after
> >> incrementing the counter the second time, meaning that in principle it
> >> can happen arbitrarily late, possibly disturbing code which waits for
> >> the counter to become even. This is not a problem because in whoever
> >> modifies it is supposed to have a lock and release it shortly after,
> >> which also posts the release fence and consequently makes sure the
> >> counter update is visible. Also note these routines come with
> >> preemption disablement around the modification -- if the thread gets
> >> preempted as a result of critical_exit from seqc_write_end, it will
> >> also publish the updated counter. If the updater never does anything
> >> which 

Re: svn commit: r358733 - head/sys/sys

2020-03-09 Thread Mateusz Guzik
On 3/9/20, Konstantin Belousov  wrote:
> On Mon, Mar 09, 2020 at 01:56:17AM +0100, Mateusz Guzik wrote:
>> On 3/8/20, Mateusz Guzik  wrote:
>> > Author: mjg
>> > Date: Sun Mar  8 00:22:32 2020
>> > New Revision: 358733
>> > URL: https://svnweb.freebsd.org/changeset/base/358733
>> >
>> > Log:
>> >   seqc: tidy up
>> >
>> >   - avoid hand-rolled read
>> >   - match begin/end in terms of fence style
>> >
>>
>> There were off lists questions about this so let me clarify.
>>
>> The first bit is a cosmetic change, but the second one is not.
>>
>> > Modified:
>> >   head/sys/sys/seqc.h
>> >
>> > Modified: head/sys/sys/seqc.h
>> > ==
>> > --- head/sys/sys/seqc.hSat Mar  7 15:37:23 2020(r358732)
>> > +++ head/sys/sys/seqc.hSun Mar  8 00:22:32 2020(r358733)
>> > @@ -66,7 +66,8 @@ static __inline void
>> >  seqc_write_end(seqc_t *seqcp)
>> >  {
>> >
>> > -  atomic_store_rel_int(seqcp, *seqcp + 1);
>> > +  atomic_thread_fence_rel();
>> > +  *seqcp += 1;
>> >MPASS(!seqc_in_modify(*seqcp));
>> >critical_exit();
>> >  }
>>
>> For correct operation the counter has to be modified *before* any work
>> is done and *after* it is completed.
>>
>> Consider a trivial case:
>> seqc++;
>> foo[0] = 0;
>> foo[1] = 1;
>> seqc++;
>>
>> There are 2 ways in which this can be mucked with:
>> - the compiler can be looking at reordering or reworking the increment
>> (e.g., to collapse it to just += 2 instead). a compiler barrier
>> prevents that.
>> - even with generated machine code which increments in correct places
>> the cpu can be looking at reordering the operation (which is heavily
>> dependent on architecture), in particular it could end up issuing
>> seqc++; foo[1] = 1; which would defeat the point. This is where
>> release fences come in.
>>
>> With the patched code there is:
>> seqc++;
>> atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
>> starts getting modified */
>> foo[0] = 0;
>> foo[1] = 1;
>> atomic_thread_fence_rel(); /* make sure modifications to foo don't
>> leak past this spot */
>> seqc++;
>>
>> In comparison, the previous code was:
>> seqc++;
>> atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
>> starts getting modified */
>> foo[0] = 0;
>> foo[1] = 1;
>> atomic_store_rel_int(seqcp, *seqcp + 1); /* make sure modifications to
>> too don't leak past this spot and update seqc immediately */
>>
>> There are 2 differences here -- one is an improvement and second one
>> does not matter:
>> the win is: the previous code forces the compiler to compute an
>> incremented value and then store it. On amd64 this translates to the
>> following (with rdx holding the address of the counter):
>>
>> mov(%rdx),%eax /* load the value */
>> add$0x1,%eax/* add 1 */
>> mov%eax,(%rdx) /* store it */
>>
>> On patched kernel this is:
>> addl   $0x1,(%rdx)
>>
>> which is clearly much nicer.
> I am not sure that the new code on amd64 is much nicer.
>
> But the point was that on non-amd64, i.e. armv8 and potentially on
> powerpc 3.0, the patch substitutes release store with full barrier. The
> later is (much) slower.
>

If an arch performs something significantly more expensive here it's
probably a performance bug in the port. But perhaps more importantly
there are significantly more frequent users of
atomic_thread_fence_rel, like refcount(9). If the issue is real that
should be looked at.

>>
>> the not a problem: the code does not issue the release fence after
>> incrementing the counter the second time, meaning that in principle it
>> can happen arbitrarily late, possibly disturbing code which waits for
>> the counter to become even. This is not a problem because in whoever
>> modifies it is supposed to have a lock and release it shortly after,
>> which also posts the release fence and consequently makes sure the
>> counter update is visible. Also note these routines come with
>> preemption disablement around the modification -- if the thread gets
>> preempted as a result of critical_exit from seqc_write_end, it will
>> also publish the updated counter. If the updater never does anything
>> which flushes the store buffer then in the absolute worst case they
>> will get preempted, at which point once more we are covered. Note that
>> code which wants more deterministic behavior will want to
>> seqc_read_any, test the counter once and if it is caught uneven resort
>> to locking.
>>
>> > @@ -84,7 +85,7 @@ seqc_read(const seqc_t *seqcp)
>> >seqc_t ret;
>> >
>> >for (;;) {
>> > -  ret = atomic_load_acq_int(__DECONST(seqc_t *, seqcp));
>> > +  ret = seqc_read_any(seqcp);
>> >if (__predict_false(seqc_in_modify(ret))) {
>> >cpu_spinwait();
>> >continue;
>> > ___
>> > svn-src-all@freebsd.org mailing list
>> > https://lists.freebsd.org/mailman/listinfo/svn-src-all
>> > To 

Re: svn commit: r358733 - head/sys/sys

2020-03-09 Thread Konstantin Belousov
On Mon, Mar 09, 2020 at 01:56:17AM +0100, Mateusz Guzik wrote:
> On 3/8/20, Mateusz Guzik  wrote:
> > Author: mjg
> > Date: Sun Mar  8 00:22:32 2020
> > New Revision: 358733
> > URL: https://svnweb.freebsd.org/changeset/base/358733
> >
> > Log:
> >   seqc: tidy up
> >
> >   - avoid hand-rolled read
> >   - match begin/end in terms of fence style
> >
> 
> There were off lists questions about this so let me clarify.
> 
> The first bit is a cosmetic change, but the second one is not.
> 
> > Modified:
> >   head/sys/sys/seqc.h
> >
> > Modified: head/sys/sys/seqc.h
> > ==
> > --- head/sys/sys/seqc.h Sat Mar  7 15:37:23 2020(r358732)
> > +++ head/sys/sys/seqc.h Sun Mar  8 00:22:32 2020(r358733)
> > @@ -66,7 +66,8 @@ static __inline void
> >  seqc_write_end(seqc_t *seqcp)
> >  {
> >
> > -   atomic_store_rel_int(seqcp, *seqcp + 1);
> > +   atomic_thread_fence_rel();
> > +   *seqcp += 1;
> > MPASS(!seqc_in_modify(*seqcp));
> > critical_exit();
> >  }
> 
> For correct operation the counter has to be modified *before* any work
> is done and *after* it is completed.
> 
> Consider a trivial case:
> seqc++;
> foo[0] = 0;
> foo[1] = 1;
> seqc++;
> 
> There are 2 ways in which this can be mucked with:
> - the compiler can be looking at reordering or reworking the increment
> (e.g., to collapse it to just += 2 instead). a compiler barrier
> prevents that.
> - even with generated machine code which increments in correct places
> the cpu can be looking at reordering the operation (which is heavily
> dependent on architecture), in particular it could end up issuing
> seqc++; foo[1] = 1; which would defeat the point. This is where
> release fences come in.
> 
> With the patched code there is:
> seqc++;
> atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
> starts getting modified */
> foo[0] = 0;
> foo[1] = 1;
> atomic_thread_fence_rel(); /* make sure modifications to foo don't
> leak past this spot */
> seqc++;
> 
> In comparison, the previous code was:
> seqc++;
> atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
> starts getting modified */
> foo[0] = 0;
> foo[1] = 1;
> atomic_store_rel_int(seqcp, *seqcp + 1); /* make sure modifications to
> too don't leak past this spot and update seqc immediately */
> 
> There are 2 differences here -- one is an improvement and second one
> does not matter:
> the win is: the previous code forces the compiler to compute an
> incremented value and then store it. On amd64 this translates to the
> following (with rdx holding the address of the counter):
> 
> mov(%rdx),%eax /* load the value */
> add$0x1,%eax/* add 1 */
> mov%eax,(%rdx) /* store it */
> 
> On patched kernel this is:
> addl   $0x1,(%rdx)
> 
> which is clearly much nicer.
I am not sure that the new code on amd64 is much nicer.

But the point was that on non-amd64, i.e. armv8 and potentially on
powerpc 3.0, the patch substitutes release store with full barrier. The
later is (much) slower.

> 
> the not a problem: the code does not issue the release fence after
> incrementing the counter the second time, meaning that in principle it
> can happen arbitrarily late, possibly disturbing code which waits for
> the counter to become even. This is not a problem because in whoever
> modifies it is supposed to have a lock and release it shortly after,
> which also posts the release fence and consequently makes sure the
> counter update is visible. Also note these routines come with
> preemption disablement around the modification -- if the thread gets
> preempted as a result of critical_exit from seqc_write_end, it will
> also publish the updated counter. If the updater never does anything
> which flushes the store buffer then in the absolute worst case they
> will get preempted, at which point once more we are covered. Note that
> code which wants more deterministic behavior will want to
> seqc_read_any, test the counter once and if it is caught uneven resort
> to locking.
> 
> > @@ -84,7 +85,7 @@ seqc_read(const seqc_t *seqcp)
> > seqc_t ret;
> >
> > for (;;) {
> > -   ret = atomic_load_acq_int(__DECONST(seqc_t *, seqcp));
> > +   ret = seqc_read_any(seqcp);
> > if (__predict_false(seqc_in_modify(ret))) {
> > cpu_spinwait();
> > continue;
> > ___
> > svn-src-all@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/svn-src-all
> > To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
> >
> 
> 
> -- 
> Mateusz Guzik 
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r358733 - head/sys/sys

2020-03-08 Thread Mateusz Guzik
On 3/8/20, Mateusz Guzik  wrote:
> Author: mjg
> Date: Sun Mar  8 00:22:32 2020
> New Revision: 358733
> URL: https://svnweb.freebsd.org/changeset/base/358733
>
> Log:
>   seqc: tidy up
>
>   - avoid hand-rolled read
>   - match begin/end in terms of fence style
>

There were off lists questions about this so let me clarify.

The first bit is a cosmetic change, but the second one is not.

> Modified:
>   head/sys/sys/seqc.h
>
> Modified: head/sys/sys/seqc.h
> ==
> --- head/sys/sys/seqc.h   Sat Mar  7 15:37:23 2020(r358732)
> +++ head/sys/sys/seqc.h   Sun Mar  8 00:22:32 2020(r358733)
> @@ -66,7 +66,8 @@ static __inline void
>  seqc_write_end(seqc_t *seqcp)
>  {
>
> - atomic_store_rel_int(seqcp, *seqcp + 1);
> + atomic_thread_fence_rel();
> + *seqcp += 1;
>   MPASS(!seqc_in_modify(*seqcp));
>   critical_exit();
>  }

For correct operation the counter has to be modified *before* any work
is done and *after* it is completed.

Consider a trivial case:
seqc++;
foo[0] = 0;
foo[1] = 1;
seqc++;

There are 2 ways in which this can be mucked with:
- the compiler can be looking at reordering or reworking the increment
(e.g., to collapse it to just += 2 instead). a compiler barrier
prevents that.
- even with generated machine code which increments in correct places
the cpu can be looking at reordering the operation (which is heavily
dependent on architecture), in particular it could end up issuing
seqc++; foo[1] = 1; which would defeat the point. This is where
release fences come in.

With the patched code there is:
seqc++;
atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
starts getting modified */
foo[0] = 0;
foo[1] = 1;
atomic_thread_fence_rel(); /* make sure modifications to foo don't
leak past this spot */
seqc++;

In comparison, the previous code was:
seqc++;
atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
starts getting modified */
foo[0] = 0;
foo[1] = 1;
atomic_store_rel_int(seqcp, *seqcp + 1); /* make sure modifications to
too don't leak past this spot and update seqc immediately */

There are 2 differences here -- one is an improvement and second one
does not matter:
the win is: the previous code forces the compiler to compute an
incremented value and then store it. On amd64 this translates to the
following (with rdx holding the address of the counter):

mov(%rdx),%eax /* load the value */
add$0x1,%eax/* add 1 */
mov%eax,(%rdx) /* store it */

On patched kernel this is:
addl   $0x1,(%rdx)

which is clearly much nicer.

the not a problem: the code does not issue the release fence after
incrementing the counter the second time, meaning that in principle it
can happen arbitrarily late, possibly disturbing code which waits for
the counter to become even. This is not a problem because in whoever
modifies it is supposed to have a lock and release it shortly after,
which also posts the release fence and consequently makes sure the
counter update is visible. Also note these routines come with
preemption disablement around the modification -- if the thread gets
preempted as a result of critical_exit from seqc_write_end, it will
also publish the updated counter. If the updater never does anything
which flushes the store buffer then in the absolute worst case they
will get preempted, at which point once more we are covered. Note that
code which wants more deterministic behavior will want to
seqc_read_any, test the counter once and if it is caught uneven resort
to locking.

> @@ -84,7 +85,7 @@ seqc_read(const seqc_t *seqcp)
>   seqc_t ret;
>
>   for (;;) {
> - ret = atomic_load_acq_int(__DECONST(seqc_t *, seqcp));
> + ret = seqc_read_any(seqcp);
>   if (__predict_false(seqc_in_modify(ret))) {
>   cpu_spinwait();
>   continue;
> ___
> svn-src-all@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/svn-src-all
> To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"
>


-- 
Mateusz Guzik 
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


svn commit: r358733 - head/sys/sys

2020-03-07 Thread Mateusz Guzik
Author: mjg
Date: Sun Mar  8 00:22:32 2020
New Revision: 358733
URL: https://svnweb.freebsd.org/changeset/base/358733

Log:
  seqc: tidy up
  
  - avoid hand-rolled read
  - match begin/end in terms of fence style

Modified:
  head/sys/sys/seqc.h

Modified: head/sys/sys/seqc.h
==
--- head/sys/sys/seqc.h Sat Mar  7 15:37:23 2020(r358732)
+++ head/sys/sys/seqc.h Sun Mar  8 00:22:32 2020(r358733)
@@ -66,7 +66,8 @@ static __inline void
 seqc_write_end(seqc_t *seqcp)
 {
 
-   atomic_store_rel_int(seqcp, *seqcp + 1);
+   atomic_thread_fence_rel();
+   *seqcp += 1;
MPASS(!seqc_in_modify(*seqcp));
critical_exit();
 }
@@ -84,7 +85,7 @@ seqc_read(const seqc_t *seqcp)
seqc_t ret;
 
for (;;) {
-   ret = atomic_load_acq_int(__DECONST(seqc_t *, seqcp));
+   ret = seqc_read_any(seqcp);
if (__predict_false(seqc_in_modify(ret))) {
cpu_spinwait();
continue;
___
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"