RE: [PATCH v2 0/5] locking: Introduce local{,64}_try_cmpxchg

2023-04-11 Thread David Laight
From: Dave Hansen
> Sent: 11 April 2023 14:44
> 
> On 4/11/23 04:35, Mark Rutland wrote:
> > I agree it'd be nice to have performance figures, but I think those would 
> > only
> > need to demonstrate a lack of a regression rather than a performance
> > improvement, and I think it's fairly clear from eyeballing the generated
> > instructions that a regression isn't likely.
> 
> Thanks for the additional context.
> 
> I totally agree that there's zero burden here to show a performance
> increase.  If anyone can think of a quick way to do _some_ kind of
> benchmark on the code being changed and just show that it's free of
> brown paper bags, it would be appreciated.  Nothing crazy, just think of
> one workload (synthetic or not) that will stress the paths being changed
> and run it with and without these changes.  Make sure there are not
> surprises.
> 
> I also agree that it's unlikely to be brown paper bag material.

The only thing I can think of is that, on x86, the locked
variant may actually be faster!
Both require exclusive access to the cache line (the unlocked
variant always does the write! [1]).
So if the cache line is contended between cpu the unlocked
variant might ping-pong the cache line twice!
Of course, if the line is shared like that then performance
is horrid.

[1] I checked on an uncached PCIe address on which I can monitor
the TLP. The write always happens so you can use cmpxchg18b
with a 'known bad value' to do a 16 byte read as a single TLP
(without using an SSE register).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [PATCH v2 0/5] locking: Introduce local{,64}_try_cmpxchg

2023-04-11 Thread Dave Hansen
On 4/11/23 04:35, Mark Rutland wrote:
> I agree it'd be nice to have performance figures, but I think those would only
> need to demonstrate a lack of a regression rather than a performance
> improvement, and I think it's fairly clear from eyeballing the generated
> instructions that a regression isn't likely.

Thanks for the additional context.

I totally agree that there's zero burden here to show a performance
increase.  If anyone can think of a quick way to do _some_ kind of
benchmark on the code being changed and just show that it's free of
brown paper bags, it would be appreciated.  Nothing crazy, just think of
one workload (synthetic or not) that will stress the paths being changed
and run it with and without these changes.  Make sure there are not
surprises.

I also agree that it's unlikely to be brown paper bag material.


Re: [PATCH v2 0/5] locking: Introduce local{,64}_try_cmpxchg

2023-04-11 Thread Mark Rutland
On Wed, Apr 05, 2023 at 09:37:04AM -0700, Dave Hansen wrote:
> On 4/5/23 07:17, Uros Bizjak wrote:
> > Add generic and target specific support for local{,64}_try_cmpxchg
> > and wire up support for all targets that use local_t infrastructure.
> 
> I feel like I'm missing some context.
> 
> What are the actual end user visible effects of this series?  Is there a
> measurable decrease in perf overhead?  Why go to all this trouble for
> perf?  Who else will use local_try_cmpxchg()?

Overall, the theory is that it can generate slightly better code (e.g. by
reusing the flags on x86). In practice, that might be in the noise, but as
demonstrated in prior postings the code generation is no worse than before.

>From my perspective, the more important part is that this aligns local_t with
the other atomic*_t APIs, which all have ${atomictype}_try_cmpxchg(), and for
consistency/legibility/maintainability it's nice to be able to use the same
code patterns, e.g.

${inttype} new, old = ${atomictype}_read(ptr);
do {
...
new = do_something_with(old);
} while (${atomictype}_try_cmpxvhg(ptr, , newval);

> I'm all for improving things, and perf is an important user.  But, if
> the goal here is improving performance, it would be nice to see at least
> a stab at quantifying the performance delta.

IIUC, Steve's original request for local_try_cmpxchg() was a combination of a
theoretical performance benefit and a more general preference to use
try_cmpxchg() for consistency / better structure of the source code:

  https://lore.kernel.org/lkml/20230301131831.6c8d4...@gandalf.local.home/

I agree it'd be nice to have performance figures, but I think those would only
need to demonstrate a lack of a regression rather than a performance
improvement, and I think it's fairly clear from eyeballing the generated
instructions that a regression isn't likely.

Thanks,
Mark.


RE: [PATCH v2 0/5] locking: Introduce local{,64}_try_cmpxchg

2023-04-06 Thread David Laight
From: Uros Bizjak
> Sent: 06 April 2023 09:39
> 
> On Thu, Apr 6, 2023 at 10:26 AM David Laight  wrote:
> >
> > From: Dave Hansen
> > > Sent: 05 April 2023 17:37
> > >
> > > On 4/5/23 07:17, Uros Bizjak wrote:
> > > > Add generic and target specific support for local{,64}_try_cmpxchg
> > > > and wire up support for all targets that use local_t infrastructure.
> > >
> > > I feel like I'm missing some context.
> > >
> > > What are the actual end user visible effects of this series?  Is there a
> > > measurable decrease in perf overhead?  Why go to all this trouble for
> > > perf?  Who else will use local_try_cmpxchg()?
> >
> > I'm assuming the local_xxx operations only have to be save wrt interrupts?
> > On x86 it is possible that an alternate instruction sequence
> > that doesn't use a locked instruction may actually be faster!
> 
> Please note that "local" functions do not use lock prefix. Only atomic
> properties of cmpxchg instruction are exploited since it only needs to
> be safe wrt interrupts.

Gah, I was assuming that LOCK was implied - like it is for xchg
and all the bit instructions.

In any case I suspect it makes little difference unless the
locked variant affects the instruction pipeline.
In fact, you may want to stop the cacheline being invalidated
between the read and write in order to avoid an extra cache
line bounce.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [PATCH v2 0/5] locking: Introduce local{,64}_try_cmpxchg

2023-04-06 Thread Uros Bizjak
On Thu, Apr 6, 2023 at 10:26 AM David Laight  wrote:
>
> From: Dave Hansen
> > Sent: 05 April 2023 17:37
> >
> > On 4/5/23 07:17, Uros Bizjak wrote:
> > > Add generic and target specific support for local{,64}_try_cmpxchg
> > > and wire up support for all targets that use local_t infrastructure.
> >
> > I feel like I'm missing some context.
> >
> > What are the actual end user visible effects of this series?  Is there a
> > measurable decrease in perf overhead?  Why go to all this trouble for
> > perf?  Who else will use local_try_cmpxchg()?
>
> I'm assuming the local_xxx operations only have to be save wrt interrupts?
> On x86 it is possible that an alternate instruction sequence
> that doesn't use a locked instruction may actually be faster!

Please note that "local" functions do not use lock prefix. Only atomic
properties of cmpxchg instruction are exploited since it only needs to
be safe wrt interrupts.

Uros.

> Although, maybe, any kind of locked cmpxchg just needs to ensure
> the cache line isn't 'stolen', so apart from possible slight
> delays on another cpu that gets a cache miss for the line in
> all makes little difference.
> The cache line miss costs a lot anyway, line bouncing more
> and is best avoided.
> So is there actually much of a benefit at all?
>
> Clearly the try_cmpxchg help - but that is a different issue.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)


RE: [PATCH v2 0/5] locking: Introduce local{,64}_try_cmpxchg

2023-04-06 Thread David Laight
From: Dave Hansen
> Sent: 05 April 2023 17:37
> 
> On 4/5/23 07:17, Uros Bizjak wrote:
> > Add generic and target specific support for local{,64}_try_cmpxchg
> > and wire up support for all targets that use local_t infrastructure.
> 
> I feel like I'm missing some context.
> 
> What are the actual end user visible effects of this series?  Is there a
> measurable decrease in perf overhead?  Why go to all this trouble for
> perf?  Who else will use local_try_cmpxchg()?

I'm assuming the local_xxx operations only have to be save wrt interrupts?
On x86 it is possible that an alternate instruction sequence
that doesn't use a locked instruction may actually be faster!

Although, maybe, any kind of locked cmpxchg just needs to ensure
the cache line isn't 'stolen', so apart from possible slight
delays on another cpu that gets a cache miss for the line in
all makes little difference.
The cache line miss costs a lot anyway, line bouncing more
and is best avoided.
So is there actually much of a benefit at all?

Clearly the try_cmpxchg help - but that is a different issue.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [PATCH v2 0/5] locking: Introduce local{,64}_try_cmpxchg

2023-04-05 Thread Uros Bizjak
On Wed, Apr 5, 2023 at 6:37 PM Dave Hansen  wrote:
>
> On 4/5/23 07:17, Uros Bizjak wrote:
> > Add generic and target specific support for local{,64}_try_cmpxchg
> > and wire up support for all targets that use local_t infrastructure.
>
> I feel like I'm missing some context.
>
> What are the actual end user visible effects of this series?  Is there a
> measurable decrease in perf overhead?  Why go to all this trouble for
> perf?  Who else will use local_try_cmpxchg()?

This functionality was requested by perf people [1], so perhaps Steven
can give us some concrete examples. In general, apart from the removal
of unneeded compare instruction on x86, usage of try_cmpxchg also
results in slightly better code on non-x86 targets [2], since the code
now correctly identifies fast-path through the cmpxchg loop.

Also important is that try_cmpxchg code reuses the result of cmpxchg
instruction in the loop, so a read from the memory in the loop is
eliminated. When reviewing the cmpxchg usage sites, I found numerous
places where unnecessary read from memory was present in the loop, two
examples can be seen in the last patch of this series.

Also, using try_cmpxchg prevents inconsistencies of the cmpxchg loop,
where the result of the cmpxchg is compared with the wrong "old" value
- one such bug is still lurking in x86 APIC code, please see [3].

Please note that apart from perf subsystem, event subsystem can also
be improved by using local_try_cmpxchg. This is the reason that the
last patch includes a change in events/core.c.

> I'm all for improving things, and perf is an important user.  But, if
> the goal here is improving performance, it would be nice to see at least
> a stab at quantifying the performance delta.

[1] https://lore.kernel.org/lkml/20230301131831.6c8d4...@gandalf.local.home/
[2] 
https://lore.kernel.org/lkml/yo91omfdzttgx...@fvff77s0q05n.cambridge.arm.com/
[3] https://lore.kernel.org/lkml/20230227160917.107820-1-ubiz...@gmail.com/

Uros.


Re: [PATCH v2 0/5] locking: Introduce local{,64}_try_cmpxchg

2023-04-05 Thread Dave Hansen
On 4/5/23 07:17, Uros Bizjak wrote:
> Add generic and target specific support for local{,64}_try_cmpxchg
> and wire up support for all targets that use local_t infrastructure.

I feel like I'm missing some context.

What are the actual end user visible effects of this series?  Is there a
measurable decrease in perf overhead?  Why go to all this trouble for
perf?  Who else will use local_try_cmpxchg()?

I'm all for improving things, and perf is an important user.  But, if
the goal here is improving performance, it would be nice to see at least
a stab at quantifying the performance delta.