Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-21 Thread Nick Piggin
On Wed, Jan 21, 2009 at 10:54:18AM +0100, Andi Kleen wrote:
> > The point is that the compiler is then free to do it. If things
> > slow down after the compiler gets *more* information, then that
> > is a problem with the compiler heuristics rather than the
> > information we give it.
> 
> The point was the -Os disables it typically then.
> (not always, compiler heuristics are far from perfect)

That'd be just another gcc failing. If it can make the code
faster without a size increase, then it should (of course
if it has to start spilling registers etc. then that's a
different matter, but we're not talking about only 32-bit
x86 here).


> > > Then x86s tend to have very very fast L1 caches and
> > > if something is not in L1 on reads then the cost of fetching
> > > something for a read dwarfs the few cycles you can typically
> > > get out of this.
> > 
> > Well most architectures have L1 caches of several cycles. And
> > L2 miss typically means going to L2 which in some cases the
> > compiler is expected to attempt to cover as much as possible
> > (eg in-order architectures).
> 
> L2 cache is so much slower that scheduling a few instructions
> more doesn't help much.

I think on a lot of CPUs that is actually not the case. Including
on Nehalem and Montecito CPUs where it is what, like under 15
cycles?

Even in cases where you have a high latency LLC or go to memory,
you want to be able to start as many loads as possible before
stalling. Especially for in-order architectures, but even OOOE
can stall if it can't resolve store addresses early enough or
speculate them.

 
> > stall, so you still want to get loads out early if possible.
> > 
> > Even a lot of OOOE CPUs I think won't have the best alias
> > anaysis, so all else being equal, it wouldn't hurt them to
> > move loads earlier.
> 
> Hmm, but if the load is nearby it won't matter if a 
> store is in the middle, because the CPU will just execute
> over it.

If the address is not known or the store buffer fills up etc
then it may not be able to. It could even be hundreds of
instructions here too much for an OOOE processor window. We
have a lot of huge functions (although granted they'll often
contain barriers for other reasons like locks or function
calls anyway).
 

> The real big win is if you do some computation inbetween,
> but at least for typical list manipulation there isn't 
> really any.

Well, I have a feeling that the MLP side of it could be more
significant than ILP. But no data.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-21 Thread Andi Kleen
> The point is that the compiler is then free to do it. If things
> slow down after the compiler gets *more* information, then that
> is a problem with the compiler heuristics rather than the
> information we give it.

The point was the -Os disables it typically then.
(not always, compiler heuristics are far from perfect)

> 
>  
> > Then x86s tend to have very very fast L1 caches and
> > if something is not in L1 on reads then the cost of fetching
> > something for a read dwarfs the few cycles you can typically
> > get out of this.
> 
> Well most architectures have L1 caches of several cycles. And
> L2 miss typically means going to L2 which in some cases the
> compiler is expected to attempt to cover as much as possible
> (eg in-order architectures).

L2 cache is so much slower that scheduling a few instructions
more doesn't help much.

> stall, so you still want to get loads out early if possible.
> 
> Even a lot of OOOE CPUs I think won't have the best alias
> anaysis, so all else being equal, it wouldn't hurt them to
> move loads earlier.

Hmm, but if the load is nearby it won't matter if a 
store is in the middle, because the CPU will just execute
over it.

The real big win is if you do some computation inbetween,
but at least for typical list manipulation there isn't 
really any.

> > Also at least x86 gcc normally doesn't do scheduling 
> > beyond basic blocks, so any if () shuts it up.
> 
> I don't think any of this is a reason not to use restrict, though.
> But... there are so many places we could add it to the kernel, and
> probably so few where it makes much difference. Maybe it should be
> able to help some critical core code, though.

Frankly I think it would be another unlikely().

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-21 Thread Nick Piggin
On Wed, Jan 21, 2009 at 10:20:49AM +0100, Andi Kleen wrote:
> On Wed, Jan 21, 2009 at 09:52:08AM +0100, Nick Piggin wrote:
> > On Wed, Jan 21, 2009 at 09:54:02AM +0100, Andi Kleen wrote:
> > > > GCC 4.3.2. Maybe i missed something obvious?
> > > 
> > > The typical use case of restrict is to tell it that multiple given
> > > arrays are independent and then give the loop optimizer 
> > > more freedom to handle expressions in the loop that
> > > accesses these arrays.
> > > 
> > > Since there are no loops in the list functions nothing changed.
> > > 
> > > Ok presumably there are some other optimizations which 
> > > rely on that alias information too, but again the list_*
> > > stuff is probably too simple to trigger any of them.
> > 
> > Any function that does several interleaved loads and stores
> > through different pointers could have much more freedom to
> > move loads early and stores late. 
> 
> For once that would require more live registers. It's not
> a clear and obvious win. Especially not if you have
> only very little registers, like on 32bit x86.
> 
> Then it would typically increase code size.

The point is that the compiler is then free to do it. If things
slow down after the compiler gets *more* information, then that
is a problem with the compiler heuristics rather than the
information we give it.

 
> Then x86s tend to have very very fast L1 caches and
> if something is not in L1 on reads then the cost of fetching
> something for a read dwarfs the few cycles you can typically
> get out of this.

Well most architectures have L1 caches of several cycles. And
L2 miss typically means going to L2 which in some cases the
compiler is expected to attempt to cover as much as possible
(eg in-order architectures).

If the caches are missed completely, then especially with an
in-order architecture, you want to issue as many parallel loads
as possible during the stall. If the compiler can't resolve
aliases, then it simply won't be able to bring some of those
loads forward.


> And lastly even on a in order system stores can 
> be typically queued without stalling, so it doesn't
> hurt to do them early.

Store queues are, what? On the order of tens of entries for
big power hungry x86? I'd guess much smaller for low power
in-order x86 and ARM etc. These can definitely fill up and
stall, so you still want to get loads out early if possible.

Even a lot of OOOE CPUs I think won't have the best alias
anaysis, so all else being equal, it wouldn't hurt them to
move loads earlier.


> Also at least x86 gcc normally doesn't do scheduling 
> beyond basic blocks, so any if () shuts it up.

I don't think any of this is a reason not to use restrict, though.
But... there are so many places we could add it to the kernel, and
probably so few where it makes much difference. Maybe it should be
able to help some critical core code, though.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-21 Thread Andi Kleen
On Wed, Jan 21, 2009 at 09:52:08AM +0100, Nick Piggin wrote:
> On Wed, Jan 21, 2009 at 09:54:02AM +0100, Andi Kleen wrote:
> > > GCC 4.3.2. Maybe i missed something obvious?
> > 
> > The typical use case of restrict is to tell it that multiple given
> > arrays are independent and then give the loop optimizer 
> > more freedom to handle expressions in the loop that
> > accesses these arrays.
> > 
> > Since there are no loops in the list functions nothing changed.
> > 
> > Ok presumably there are some other optimizations which 
> > rely on that alias information too, but again the list_*
> > stuff is probably too simple to trigger any of them.
> 
> Any function that does several interleaved loads and stores
> through different pointers could have much more freedom to
> move loads early and stores late. 

For once that would require more live registers. It's not
a clear and obvious win. Especially not if you have
only very little registers, like on 32bit x86.

Then it would typically increase code size.

Then x86s tend to have very very fast L1 caches and
if something is not in L1 on reads then the cost of fetching
something for a read dwarfs the few cycles you can typically
get out of this.

And lastly even on a in order system stores can 
be typically queued without stalling, so it doesn't
hurt to do them early.

Also at least x86 gcc normally doesn't do scheduling 
beyond basic blocks, so any if () shuts it up.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-21 Thread Nick Piggin
On Wed, Jan 21, 2009 at 09:54:02AM +0100, Andi Kleen wrote:
> > GCC 4.3.2. Maybe i missed something obvious?
> 
> The typical use case of restrict is to tell it that multiple given
> arrays are independent and then give the loop optimizer 
> more freedom to handle expressions in the loop that
> accesses these arrays.
> 
> Since there are no loops in the list functions nothing changed.
> 
> Ok presumably there are some other optimizations which 
> rely on that alias information too, but again the list_*
> stuff is probably too simple to trigger any of them.

Any function that does several interleaved loads and stores
through different pointers could have much more freedom to
move loads early and stores late. Big OOOE CPUs won't care
so much, but embedded and things (including in-order x86)
are very important users of the kernel.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-21 Thread Andi Kleen
> GCC 4.3.2. Maybe i missed something obvious?

The typical use case of restrict is to tell it that multiple given
arrays are independent and then give the loop optimizer 
more freedom to handle expressions in the loop that
accesses these arrays.

Since there are no loops in the list functions nothing changed.

Ok presumably there are some other optimizations which 
rely on that alias information too, but again the list_*
stuff is probably too simple to trigger any of them.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-20 Thread H. Peter Anvin

Ingo Molnar wrote:


Hm, GCC uses __restrict__, right?

I'm wondering whether there's any internal tie-up between alias analysis 
and the __restrict__ keyword - so if we turn off aliasing optimizations 
the __restrict__ keyword's optimizations are turned off as well.




Actually I suspect that "restrict" makes little difference for inlines 
or even statics, since gcc generally can do alias analysis fine there. 
However, in the presence of an intermodule function call, all alias 
analysis is off.  This is presumably why type-based analysis is used at 
all ... to at least be able to a modicum of, say, loop invariant removal 
in the presence of a library call.  This is also where "restrict" comes 
into play.


-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-20 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Tue, 20 Jan 2009, Ingo Molnar wrote:
> > 
> > (Different-type pointer uses are a common pattern: we have a lot of 
> > places where we have pointers to structures with different types so 
> > strict-aliasing optimization opportunities apply quite broadly 
> > already.)
> 
> Yes and no.
> 
> It's true that the kernel in general uses mostly pointers through 
> structures that can help the type-based thing.
> 
> However, the most common and important cases are actually the very same 
> structures. In particular, things like . Same "struct 
> list", often embedded into another case of the same struct.
> 
> And that's where "restrict" can actually help. It might be interesting 
> to see, for example, if it makes any difference to add a "restrict" 
> qualifier to the "new" pointer in __list_add(). That might give the 
> compiler the ability to schedule the stores to next->prev and prev->next 
> differently, and maybe it could matter?
> 
> It probably is not noticeable. The big reason for wanting to do alias 
> analysis tends to not be thatt kind of code at all, but the cases where 
> you can do much bigger simplifications, or on in-order machines where 
> you really want to hoist things like FP loads early and FP stores late, 
> and alias analysis (and here type-based is more reasonable) shows that 
> the FP accesses cannot alias with the integer accesses around it.

Hm, GCC uses __restrict__, right?

The patch below makes no difference at all on an x86 defconfig:

  vmlinux:
 text  data bss dec hex filename
  7253544   1641812 1324296 10219652 9bf084 vmlinux.before
  7253544   1641812 1324296 10219652 9bf084 vmlinux.after

not a single instruction was changed:

 --- vmlinux.before.asm
 +++ vmlinux.after.asm
 @@ -1,5 +1,5 @@
 
 -vmlinux.before: file format elf64-x86-64
 +vmlinux.after: file format elf64-x86-64

I'm wondering whether there's any internal tie-up between alias analysis 
and the __restrict__ keyword - so if we turn off aliasing optimizations 
the __restrict__ keyword's optimizations are turned off as well.

Nope, with aliasing optimizations turned back on there's still no change 
on the x86 defconfig:

 text  data bss dec hex filename
  7240893   1641796 1324296 10206985 9bbf09 vmlinux.before
  7240893   1641796 1324296 10206985 9bbf09 vmlinux.after

GCC 4.3.2. Maybe i missed something obvious?

Ingo

---
 include/linux/list.h |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/include/linux/list.h
===
--- linux.orig/include/linux/list.h
+++ linux/include/linux/list.h
@@ -38,7 +38,7 @@ static inline void INIT_LIST_HEAD(struct
  * the prev/next entries already!
  */
 #ifndef CONFIG_DEBUG_LIST
-static inline void __list_add(struct list_head *new,
+static inline void __list_add(struct list_head * __restrict__ new,
  struct list_head *prev,
  struct list_head *next)
 {
@@ -48,7 +48,7 @@ static inline void __list_add(struct lis
prev->next = new;
 }
 #else
-extern void __list_add(struct list_head *new,
+extern void __list_add(struct list_head * __restrict__ new,
  struct list_head *prev,
  struct list_head *next);
 #endif
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-20 Thread Linus Torvalds


On Tue, 20 Jan 2009, Ingo Molnar wrote:
> 
> (Different-type pointer uses are a common pattern: we have a lot of places 
> where we have pointers to structures with different types so 
> strict-aliasing optimization opportunities apply quite broadly already.)

Yes and no.

It's true that the kernel in general uses mostly pointers through 
structures that can help the type-based thing.

However, the most common and important cases are actually the very same 
structures. In particular, things like . Same "struct list", 
often embedded into another case of the same struct.

And that's where "restrict" can actually help. It might be interesting to 
see, for example, if it makes any difference to add a "restrict" qualifier 
to the "new" pointer in __list_add(). That might give the compiler the 
ability to schedule the stores to next->prev and prev->next differently, 
and maybe it could matter? 

It probably is not noticeable. The big reason for wanting to do alias 
analysis tends to not be thatt kind of code at all, but the cases where 
you can do much bigger simplifications, or on in-order machines where you 
really want to hoist things like FP loads early and FP stores late, and 
alias analysis (and here type-based is more reasonable) shows that the FP 
accesses cannot alias with the integer accesses around it.

In x86, I doubt _any_ amount of alias analysis makes a hug difference (as 
long as the compiler at least doesn't think that local variable spills can 
alias with anything else). Not enough registers, and generally pretty 
aggressively OoO (with alias analysis in hardware) makes for a much less 
sensitive platform.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-20 Thread Ingo Molnar

* David Woodhouse  wrote:

> On Tue, 2009-01-20 at 13:38 +0100, Ingo Molnar wrote:
> > 
> > * Nick Piggin  wrote:
> > 
> > > > > it seems like a nice opt-in thing that can be used where the aliases 
> > > > > are verified and the code is particularly performance critical...
> > > > 
> > > > Yes. I think we could use it in the kernel, although I'm not sure how 
> > > > many cases we would ever find where we really care.
> > > 
> > > Yeah, we don't tend to do a lot of intensive data processing, so it is 
> > > normally the cache misses that hurt most as you noted earlier.
> > > 
> > > Some places it might be appropriate, though. It might be nice if it can 
> > > bring code size down too...
> > 
> > I checked, its size effects were miniscule [0.17%] on the x86 defconfig 
> > kernel and it seems to be a clear loss in total cost as there would be an 
> > ongoing maintenance cost
> 
> They were talking about 'restrict', not strict-aliasing. Where it can be 
> used, it's going to give you optimisations that strict-aliasing can't.

the two are obviously related (just that the 'restrict' keyword can be 
used for same-type pointers so it gives even broader leeway) so i used the 
0.17% figure i already had to give a ballpark figure about what such type 
of optimizations can bring us in general.

(Different-type pointer uses are a common pattern: we have a lot of places 
where we have pointers to structures with different types so 
strict-aliasing optimization opportunities apply quite broadly already.)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-20 Thread David Woodhouse
On Tue, 2009-01-20 at 13:38 +0100, Ingo Molnar wrote:
> 
> * Nick Piggin  wrote:
> 
> > > > it seems like a nice opt-in thing that can be used where the aliases 
> > > > are verified and the code is particularly performance critical...
> > > 
> > > Yes. I think we could use it in the kernel, although I'm not sure how 
> > > many cases we would ever find where we really care.
> > 
> > Yeah, we don't tend to do a lot of intensive data processing, so it is 
> > normally the cache misses that hurt most as you noted earlier.
> > 
> > Some places it might be appropriate, though. It might be nice if it can 
> > bring code size down too...
> 
> I checked, its size effects were miniscule [0.17%] on the x86 defconfig 
> kernel and it seems to be a clear loss in total cost as there would be an 
> ongoing maintenance cost

They were talking about 'restrict', not strict-aliasing. Where it can be
used, it's going to give you optimisations that strict-aliasing can't.

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-20 Thread Miguel F Mascarenhas Sousa Filipe
On Mon, Jan 19, 2009 at 9:01 PM, Linus Torvalds
 wrote:
>
>
> On Mon, 19 Jan 2009, Nick Piggin wrote:
>>
>> I want to know what is the problem with the restrict keyword?
>> I'm sure I've read Linus ranting about how bad it is in the
>> past...
>
> No, I don't think I've ranted about 'restrict'. I think it's a reasonable
> solution for performance-critical code, and unlike the whole type-aliasing
> model, it actually works for the _sane_ cases (ie doing some operation
> over two arrays of the same type, and letting the compiler know that it
> can access the arrays without fearing that writing to one would affect
> reading from the other).
>
> The problem with 'restrict' is that almost nobody uses it, and it does
> obviously require programmer input rather than the compiler doing it
> automatically. But it should work well as a way to get Fortran-like
> performance from HPC workloads written in C - which is where most of the
> people are who really want the alias analysis.

while working in my college thesis, a fortran HPC workload (10k lines
of fortran), converted to C resulted in performance speedups. this was
with gcc 3.4.
A simple f2c conversion + adaptations, resulted in a considerable
speedup. (20% IIRC).
The conversion was not done by performance reasons, the extra
performance was simply a unexpected, but quite nice outcome.
Just to let you guys know, that even with gcc3.4, gcc of C code ran
faster than gfortran of the equiv. fortran code.


... pushing the optimization engine further (-march tunning and -O3)
resulted in bad data.
But I can't swear by the correctness of some of the computations with
REAL's made in the original fortran code.



>
>> it seems like a nice opt-in thing that can be used where the aliases are
>> verified and the code is particularly performance critical...
>
> Yes. I think we could use it in the kernel, although I'm not sure how many
> cases we would ever find where we really care.
>
>Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Miguel Sousa Filipe
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-20 Thread Ingo Molnar

* Nick Piggin  wrote:

> > > it seems like a nice opt-in thing that can be used where the aliases 
> > > are verified and the code is particularly performance critical...
> > 
> > Yes. I think we could use it in the kernel, although I'm not sure how 
> > many cases we would ever find where we really care.
> 
> Yeah, we don't tend to do a lot of intensive data processing, so it is 
> normally the cache misses that hurt most as you noted earlier.
> 
> Some places it might be appropriate, though. It might be nice if it can 
> bring code size down too...

I checked, its size effects were miniscule [0.17%] on the x86 defconfig 
kernel and it seems to be a clear loss in total cost as there would be an 
ongoing maintenance cost of this weird new variant of C that language 
lawyers legislated out of thin air and which departs so significantly from 
time-tested C coding concepts and practices.

We'd have to work around aliasing warnings of the compiler again and again 
with no upside and in fact i'd argue that the resulting code is _less_ 
clean.

The lack of data processing complexity in the kernel is not a surprise: 
the kernel is really just a conduit/abstractor between hw and apps, and 
rarely generates genuinely new information. (In fact it can be generally 
considered a broken system call concept if such data processing _has_ to 
be conducted somewhere in the kernel.)

( Notable exceptions would be the crypto code and the RAID5 [XOR checksum]
  and RAID6 [polinomial checksums] code - but those tend to be seriously
  hand-optimized already, with the most critical bits written in assembly. )

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-19 Thread Andi Kleen
> The problem with 'restrict' is that almost nobody uses it, and it does 

Also gcc traditionally didn't do a very good job using it (this
might be better in the very latest versions). At least some of the 3.x
often discarded this information. 

> automatically. But it should work well as a way to get Fortran-like 
> performance from HPC workloads written in C - which is where most of the 
> people are who really want the alias analysis.

It's more than just HPC  -- a lot of code has critical loops.

> > it seems like a nice opt-in thing that can be used where the aliases are 
> > verified and the code is particularly performance critical...
> 
> Yes. I think we could use it in the kernel, although I'm not sure how many 
> cases we would ever find where we really care. 

Very little I suspect. Also the optimizations that gcc does with this
often increase the code size. While that can be a win, with people
judging gcc's output apparently *ONLY* on the code size as seen
in this thread[1] it would obviously not compete well.

-Andi 

[1] although there are compilers around that generate smaller code
than gcc at its best.

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-19 Thread Nick Piggin
On Tue, Jan 20, 2009 at 08:01:52AM +1100, Linus Torvalds wrote:
> 
> 
> On Mon, 19 Jan 2009, Nick Piggin wrote:
> > 
> > I want to know what is the problem with the restrict keyword?
> > I'm sure I've read Linus ranting about how bad it is in the
> > past...
> 
> No, I don't think I've ranted about 'restrict'. I think it's a reasonable 
> solution for performance-critical code, and unlike the whole type-aliasing 
> model, it actually works for the _sane_ cases (ie doing some operation 
> over two arrays of the same type, and letting the compiler know that it 
> can access the arrays without fearing that writing to one would affect 
> reading from the other).
> 
> The problem with 'restrict' is that almost nobody uses it, and it does 
> obviously require programmer input rather than the compiler doing it 
> automatically. But it should work well as a way to get Fortran-like 
> performance from HPC workloads written in C - which is where most of the 
> people are who really want the alias analysis.

OK, that makes sense. I just had a vague feeling that you disliked
it.

 
> > it seems like a nice opt-in thing that can be used where the aliases are 
> > verified and the code is particularly performance critical...
> 
> Yes. I think we could use it in the kernel, although I'm not sure how many 
> cases we would ever find where we really care. 

Yeah, we don't tend to do a lot of intensive data processing, so it is
normally the cache misses that hurt most as you noted earlier.

Some places it might be appropriate, though. It might be nice if it can
bring code size down too...

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-19 Thread Linus Torvalds


On Mon, 19 Jan 2009, Nick Piggin wrote:
> 
> I want to know what is the problem with the restrict keyword?
> I'm sure I've read Linus ranting about how bad it is in the
> past...

No, I don't think I've ranted about 'restrict'. I think it's a reasonable 
solution for performance-critical code, and unlike the whole type-aliasing 
model, it actually works for the _sane_ cases (ie doing some operation 
over two arrays of the same type, and letting the compiler know that it 
can access the arrays without fearing that writing to one would affect 
reading from the other).

The problem with 'restrict' is that almost nobody uses it, and it does 
obviously require programmer input rather than the compiler doing it 
automatically. But it should work well as a way to get Fortran-like 
performance from HPC workloads written in C - which is where most of the 
people are who really want the alias analysis.

> it seems like a nice opt-in thing that can be used where the aliases are 
> verified and the code is particularly performance critical...

Yes. I think we could use it in the kernel, although I'm not sure how many 
cases we would ever find where we really care. 

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-18 Thread Nick Piggin
On Mon, Jan 19, 2009 at 01:13:45AM +0100, Ingo Molnar wrote:
> 
> * Bernd Schmidt  wrote:
> 
> > Linus Torvalds wrote:
> > > But you'll need some background to it:
> > 
> > You paint a somewhat one-sided picture bordering on FUD.
> 
> Type based aliasing should at most have been an opt-in for code that 
> cares, not a turned-on-by-default feature for everyone.

I want to know what is the problem with the restrict keyword?
I'm sure I've read Linus ranting about how bad it is in the
past... it seems like a nice opt-in thing that can be used where
the aliases are verified and the code is particularly performance
critical...

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-18 Thread Ingo Molnar

* Bernd Schmidt  wrote:

> Linus Torvalds wrote:
> > But you'll need some background to it:
> 
> You paint a somewhat one-sided picture bordering on FUD.
> 
> > Type-based aliasing is _stupid_.
> 
> Type-based aliasing is simply an application of the language definition, 
> and depending on the compiled application and/or target architecture, it 
> can be essential for performance.
> [...]
>
> So, to summarize: strict aliasing works for nearly every application 
> these days, there's a compiler switch for the rest to turn it off, it 
> can be a serious performance improvement, and the compiler warns about 
> dangerous constructs.  That makes the issue a little less black and 
> white than "type based aliasing is stupid".

i did some size measurements. Latest kernel, gcc 4.3.2, x86 defconfig, 
vmlinux built with strict-aliasing optimizations and without.

The image sizes are:

 textdata bss dec hex filename
  6997984 1635900 1322376 9956260  97eba4 vmlinux.gcc.-fno-strict-aliasing
  6985962 1635884 1322376 9944222  97bc9e vmlinux.gcc.strict-aliasing

that's 0.17% of a size win only.

The cost? More than _300_ new "type-punned" warnings during the kernel 
build (3000 altogether, including duplicates that trigger in multiple 
object files) - more than _1000_ new warnings (more than 10,000 total) in 
an allyesconfig build.

That is a _ton_ of effort for a ~0.1% category win. Often in historic code 
that has been working well and now got broken by gcc.

I think this feature has been over-sold while it under-performs. You also 
significantly weakened the utility of the C language for lowlevel 
hardware-conscious programming as a result (which is the strongest side of 
C by far).

Type based aliasing should at most have been an opt-in for code that 
cares, not a turned-on-by-default feature for everyone.

You already dismissed concerns in this thread by suggesting that Linux is 
just one of many projects, but that way you dismiss 1) the largest OSS 
project by linecount 2) one of the most performance-optimized pieces of 
OSS software in existence.

I.e. you dismiss what should have been GCC's strongest supporter, ally, 
test-field and early prototype tester. A lot of folks in the kernel look 
at the assembly level on a daily basis, they run the latest hardware and 
they see where GCC messes up. By dismissing Linux you cut yourself off 
from a lot of development power, you cut yourself off from a lot of 
enthusiasm and you miss out on a lot of dynamism that would naturally help 
GCC too.

I.e. almost by definition you cannot possibly be interested in writing the 
best compiler on the planet.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Steven Rostedt



On Mon, 12 Jan 2009, Linus Torvalds wrote:

>   code tends to get a lot more I$ and D$ misses. Deep call-chains _will_ 

I feel like an idiot that I never realized that "I$" meant
"instruction cache" until now :-p



-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Linus Torvalds


On Mon, 12 Jan 2009, Bernd Schmidt wrote:
>
> Too lazy to construct one myself, I googled for examples, and here's a 
> trivial one that shows how it affects the ability of the compiler to 
> eliminate memory references:

Do you really think this is realistic or even relevant?

The fact is

 (a) most people use similar types, so your example of "short" vs "int" is 
 actually not very common. Type-based alias analysis is wonderful for 
 finding specific examples of something you can optimize, but it's not 
 actually all that wonderful in general. It _particularly_ isn't 
 wonderful once you start looking at the downsides.

 When you're adding arrays of integers, you're usually adding 
 integers. Not "short"s. The shorts may be a great example of a 
 special case, but it's a special case!

 (b) instructions with memory accesses aren't the problem - instructions 
 that take cache misses are. Your example is an excellent example of 
 that - eliding the simple load out of the loop makes just about 
 absolutely _zero_ difference in any somewhat more realistic scenario, 
 because that one isn't the one that is going to make any real 
 difference anyway.

The thing is, the way to optimize for modern CPU's isn't to worry 
over-much about instruction scheduling. Yes, it matters for the broken 
ones, but it matters in the embedded world where you still find in-order 
CPU's, and there the size of code etc matters even more.

> I'll grant you that if you're writing a kernel or maybe a malloc
> library, you have reason to be unhappy about it.  But that's what
> compiler switches are for: -fno-strict-aliasing allows you to write code
> in a superset of C.

Oh, I'd use that flag regardless yes. But what you didn't seem to react to 
was that gcc - for no valid reason what-so-ever - actually trusts (or at 
least trusted: I haven't looked at that code for years) provably true 
static alias information _less_ than the idiotic weaker type-based one.

You make all this noise about how type-based alias analysis improves code, 
but then you can't seem to just look at the example I gave you. Type-based 
alias analysis didn't improve code. It just made things worse, for no 
actual gain. Moving those accesses to the stack around just causes worse 
behavior, and a bigger stack frame, which causes more cache misses.

[ Again, I do admit that kernel code is "different": we tend to have a 
  cold stack, in ways that many other code sequences do not have. System 
  code tends to get a lot more I$ and D$ misses. Deep call-chains _will_ 
  take cache misses on the stack, simply because the user will do things 
  between system calls or page faults that almost guarantees that things 
  are not in L1, and often not in L2 either.

  Also, sadly, microbenchmarks often hide this, since they are often 
  exactly the unrealistic kinds of back-to-back system calls that almost 
  no real program ever has, since real programs actually _do_ something 
  with the data. ]

My point is, you're making all these arguments and avoiding looking at the 
downsides of what you are arguing for.

So we use -Os - because it generally generates better (and simpler) code. 
We use -fno-strict-alias for the same reason. 

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Linus Torvalds


On Mon, 12 Jan 2009, Jamie Lokier wrote:
> 
> Sometimes code motion makes code faster and/or smaller but use more
> stack space.  If you want to keep the stack use down, it blocks some
> other optimisations.

Uhh. Yes. Compiling is an exercise in trade-offs.

That doesn't mean that you should try to find the STUPID trade-offs, 
though.

The thing is, there is no excuse for gcc's stupid alias analysis. Other 
compilers actually take advantage of things like the C standards type 
alias ambiguity by (a) realizing that it's insane as a general thing and 
(b) limiting it to the real special cases, like assuming that pointers to 
floats and pointers to integers do not alias.

That, btw, is where the whole concept comes from. It should be passed off 
as an "unsafe FP optimization", where it actually makes sense, exactly 
like a lot of other unsafe FP optimizations.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Bernd Schmidt
Linus Torvalds wrote:
> But you'll need some background to it:

You paint a somewhat one-sided picture bordering on FUD.

> Type-based aliasing is _stupid_.

Type-based aliasing is simply an application of the language definition,
and depending on the compiled application and/or target architecture, it
can be essential for performance.

It's _hard_ to tell whether two memory accesses can possibly conflict,
and the ability to decide based on type makes a vast difference.  This
is not, as you suggest in another post, simply a mild inconvenience for
the compiler that restricts scheduling a bit and forces the hardware
sort it out at run-time.  Too lazy to construct one myself, I googled
for examples, and here's a trivial one that shows how it affects the
ability of the compiler to eliminate memory references:

  typedef struct
  {
short a, b, c;
  } Sample;

  void test(int* values, Sample *uniform, int count)
  {
   int i;

   for (i=0;ib;
   }
  }

Type-based aliasing is what allows you to eliminate a load from the
loop.  Most users probably expect this kind of optimization from their
compiler, and it'll make a difference not just on Itanium.

I'll grant you that if you're writing a kernel or maybe a malloc
library, you have reason to be unhappy about it.  But that's what
compiler switches are for: -fno-strict-aliasing allows you to write code
in a superset of C.

> So gcc did. I know for a _fact_ that gcc would re-order write accesses 
> that were clearly to (statically) the same address. Gcc would suddenly 
> think that
> 
>   unsigned long a;
> 
>   a = 5;
>   *(unsigned short *)&a = 4;
> 
> could be re-ordered to set it to 4 first (because clearly they don't alias 
> - by reading the standard),

To be precise, what the standard says is that your example is not C, and
therefore has no meaning.  While this kind of thing does occur in the
wild, it is infrequent, and the programs that used this kind of code
have been fixed over the years.

gcc even warns about code such as the above with -Wall, which makes this
even more of a non-issue.

linus2.c: In function 'foo':
linus2.c:6: warning: dereferencing type-punned pointer will break
strict-aliasing rules

> and then because now the assignment of 'a=5' 
> was later, the assignment of 4 could be elided entirely! And if somebody 
> complains that the compiler is insane, the compiler people would say 
> "nyaah, nyaah, the standards people said we can do this", with absolutely 
> no introspection to ask whether it made any SENSE.

The thing is, yours is a trivial example, but try to think further: in
the general case the compiler can't tell whether two accesses can go to
the same address at runtime.  If it could, we wouldn't be having this
discussion; I'm pretty sure this question reduces to the halting
problem.  That's why the compiler must have a set of conservative rules
that allow it to decide that two accesses definitely _can't_ conflict.
For all standards conforming programs, type based aliasing is such a
rule.  You could add code to weaken it by also checking against the
address, but since that cannot be a reliable test that catches all
problematic cases, what would be the point?

So, in effect, if you're arguing that the compiler should detect the
above case and override the type-based aliasing based on the known
address, you're arguing that only subtle bugs in the application should
be exposed, not the obvious ones.  If you're arguing we should do away
with type-based aliasing altogether, you're ignoring the fact that there
are (a majority of) other users of gcc than the Linux kernel, they write
standards-conforming C, and they tend to worry about performance of
compiled code.

> The fact is, Linux uses -fno-strict-aliasing for a damn good reason: 
> because the gcc notion of "strict aliasing" is one huge stinking pile of 
> sh*t. Linux doesn't use that flag because Linux is playing fast and loose, 
> it uses that flag because _not_ using that flag is insane.

Not using this flag works for pretty much all user space applications
these days.

> Type-based aliasing is unacceptably stupid to begin with, and gcc took 
> that stupidity to totally new heights by making it actually more important 
> than even statically visible aliasing.

gcc makes use of statically visible aliasing if it can use it to prove
that two accesses can't conflict even if they have the same type, but
it's vastly less powerful than type based analysis.  Since it's
impossible in general to decide that two accesses must conflict, trying
to avoid transformations based on such an attempt is completely
senseless.  Trying to do so would have no effect for conforming C
programs, and avoid only a subset of the problematic cases for other
programs, so it's a waste of time.

So, to summarize: strict aliasing works for nearly every application
these days, there's a compiler switch for the rest to turn it off, it
can be a serious performance improvement, and the compiler warns about

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Jamie Lokier
Linus Torvalds wrote:
> > This is about storage allocation, not aliases.  Storage allocation only
> > depends on lifetime.
> 
> Well, the thing is, code motion does extend life-times, and if you think 
> you can move stores across each other (even when you can see that they 
> alias statically) due to type-based alias decisions, that does essentially 
> end up making what _used_ to be disjoint lifetimes now be potentially 
> overlapping.

Sometimes code motion makes code faster and/or smaller but use more
stack space.  If you want to keep the stack use down, it blocks some
other optimisations.

Register allocation is similar: code motion optimisations may use more
registers due to overlapping lifetimes, which causes more register
spills and changes the code.  The two interact; it's not trivial to
optimise fully.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Linus Torvalds


On Mon, 12 Jan 2009, Bernd Schmidt wrote:
> 
> However, if the compiler chooses to put them into the same stack 
> location, the RTL-based alias analysis will happily conclude (based on 
> the differing types) that the reads from A and the writes to B can't 
> possibly conflict, and some passes may end up reordering them. End 
> result: overlapping lifetimes and overlapping stack slots.  Oops.

Yes, I came to the same conclusion.

Of course, I knew a-priori that the real bug was using type-based alais 
analysis to make (statically visible) aliasing decisions, but I realize 
that there are people who never understood things like that. Sadly, some 
of them worked on gcc.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Linus Torvalds


On Mon, 12 Jan 2009, Linus Torvalds wrote:
> 
> Type-based aliasing is unacceptably stupid to begin with, and gcc took 
> that stupidity to totally new heights by making it actually more important 
> than even statically visible aliasing.

Btw, there are good forms of type-based aliasing.

The 'restrict' keyword actually makes sense as a way to say "this pointer 
points to data that you cannot reach any other way". Of course, almost 
nobody uses it, and quite frankly, inlining can destroy that one too (a 
pointer that is restricted in the callEE is not necessarily restricted at 
all in the callER, and an inliner that doesn't track that subtle 
distinction will be very unhappy).

So compiler people usually don't much like 'restrict' - because it i very 
limited (you might even say restricted) in its meaning, and doesn't allow 
for nearly the same kind of wild optimizations than the insane standard C 
type-aliasing allows. 

The best option, of course, is for a compiler to handle just _static_ 
alias information that it can prove (whether by use of 'restrict' or by 
actually doing some fancy real analysis of its own allocations), and 
letting the hardware do run-time dynamic alias analysis.

I suspect gcc people were a bit stressed out by Itanium support - it's an 
insane architecture that basically requires an insane compiler for 
reasonable performance, and I think the Itanium people ended up 
brain-washing a lot of people who might otherwise have been sane.

So maybe I should blame Intel. Or HP. Because they almost certainly were 
at least a _part_ reason for bad compiler decisions.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Bernd Schmidt
Linus Torvalds wrote:
> 
> On Mon, 12 Jan 2009, Bernd Schmidt wrote:
>> Something at the back of my mind said "aliasing".
>>
>> $ gcc linus.c -O2 -S ; grep subl linus.s
>> subl$1624, %esp
>> $ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s
>> subl$824, %esp
>>
>> That's with 4.3.2.
> 
> Interesting. 
> 
> Nonsensical, but interesting.
> 
> Since they have no overlap in lifetime, confusing this with aliasing is 
> really really broken (if the functions _hadn't_ been inlined, you'd have 
> gotten the same address for the two variables anyway! So anybody who 
> thinks that they need different addresses because they are different types 
> is really really fundmantally confused!).

I've never really looked at the stack slot sharing code.  But I think
it's not hard to see what's going on: "no overlap in lifetime" may be a
temporary state.  Let's say you have

 {
   variable_of_some_type a;
   writes to a;
   other stuff;
   reads from a;
 }
 {
   variable_of_some_other_type b;
   writes to b;
   other stuff;
   reads from b;
 }

At the point where the compiler generates RTL, stack space has to be
allocated for variables A and B.  At this point the lifetimes are
non-overlapping.  However, if the compiler chooses to put them into the
same stack location, the RTL-based alias analysis will happily conclude
(based on the differing types) that the reads from A and the writes to B
can't possibly conflict, and some passes may end up reordering them.
End result: overlapping lifetimes and overlapping stack slots.  Oops.


Bernd
-- 
This footer brought to you by insane German lawmakers.
Analog Devices GmbH  Wilhelm-Wagenfeld-Str. 6  80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Linus Torvalds


On Mon, 12 Jan 2009, H. Peter Anvin wrote:
> 
> This is about storage allocation, not aliases.  Storage allocation only
> depends on lifetime.

Well, the thing is, code motion does extend life-times, and if you think 
you can move stores across each other (even when you can see that they 
alias statically) due to type-based alias decisions, that does essentially 
end up making what _used_ to be disjoint lifetimes now be potentially 
overlapping.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Linus Torvalds


On Mon, 12 Jan 2009, Andi Kleen wrote:
>
> What I find nonsensical is that -fno-strict-aliasing generates
> better code here. Normally one would expect the compiler seeing
> more aliases with that option and then be more conservative
> regarding any sharing. But it seems to be the other way round
> here.

No, that's not the surprising part. And in fact, now that you mention it, 
I can even tell you why gcc does what it does.

But you'll need some background to it:

Type-based aliasing is _stupid_. It's so incredibly stupid that it's not 
even funny. It's broken. And gcc took the broken notion, and made it more 
so by making it a "by-the-letter-of-the-law" thing that makes no sense.

What happens (well, maybe it's fixed, but this was _literally_ what gcc 
used to do) is that the type-based aliasing overrode everything else, so 
if two accesses were to different types (and not in a union, and none of 
the types were "char"), then gcc "knew" that they clearly could not alias, 
and could thus wildly re-order accesses.

That's INSANE. It's so incredibly insane that people who do that should 
just be put out of their misery before they can reproduce. But real gcc 
developers really thought that it makes sense, because the standard allows 
it, and it gives the compiler the maximal freedom - because it can now do 
things that are CLEARLY NONSENSICAL.

And to compiler people, being able to do things that are clearly 
nonsensical seems to often be seen as a really good thing, because it 
means that they no longer have to worry about whether the end result works 
or not - they just got permission to do stupid things in the name of 
optimization.

So gcc did. I know for a _fact_ that gcc would re-order write accesses 
that were clearly to (statically) the same address. Gcc would suddenly 
think that

unsigned long a;

a = 5;
*(unsigned short *)&a = 4;

could be re-ordered to set it to 4 first (because clearly they don't alias 
- by reading the standard), and then because now the assignment of 'a=5' 
was later, the assignment of 4 could be elided entirely! And if somebody 
complains that the compiler is insane, the compiler people would say 
"nyaah, nyaah, the standards people said we can do this", with absolutely 
no introspection to ask whether it made any SENSE.

Anyway, once you start doing stupid things like that, and once you start 
thinking that the standard makes more sense than a human being using his 
brain for 5 seconds, suddenly you end up in a situation where you can move 
stores around wildly, and it's all 'correct'.

Now, take my stupid example, and make "fn1()" do "a.a = 1" and make 
"fn2()" do "b.b = 2", and think about what a compiler that thinks it can 
re-order the two writes willy-nilly will do?

Right. It will say "ok, a.a and b.b can not alias EVEN IF THEY HAVE 
STATICALLY THE SAME ADDFRESS ON THE STACK", because they are in two 
different structres. So we can then re-order the accesses, and move the 
stores around.

Guess what happens if you have that kind of insane mentality, and you then 
try to make sure that they really don't alias, so you allocate extra stack 
space.

The fact is, Linux uses -fno-strict-aliasing for a damn good reason: 
because the gcc notion of "strict aliasing" is one huge stinking pile of 
sh*t. Linux doesn't use that flag because Linux is playing fast and loose, 
it uses that flag because _not_ using that flag is insane.

Type-based aliasing is unacceptably stupid to begin with, and gcc took 
that stupidity to totally new heights by making it actually more important 
than even statically visible aliasing.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread H. Peter Anvin

Andi Kleen wrote:

On Mon, Jan 12, 2009 at 11:02:17AM -0800, Linus Torvalds wrote:

Something at the back of my mind said "aliasing".

$ gcc linus.c -O2 -S ; grep subl linus.s
subl$1624, %esp
$ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s
subl$824, %esp

That's with 4.3.2.
Interesting. 


Nonsensical, but interesting.


What I find nonsensical is that -fno-strict-aliasing generates
better code here. Normally one would expect the compiler seeing
more aliases with that option and then be more conservative
regarding any sharing. But it seems to be the other way round
here.


For this to be convolved with aliasing *AT ALL* indicates this is done 
incorrectly.


This is about storage allocation, not aliases.  Storage allocation only 
depends on lifetime.


-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Andi Kleen
On Mon, Jan 12, 2009 at 11:02:17AM -0800, Linus Torvalds wrote:
> > Something at the back of my mind said "aliasing".
> > 
> > $ gcc linus.c -O2 -S ; grep subl linus.s
> > subl$1624, %esp
> > $ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s
> > subl$824, %esp
> > 
> > That's with 4.3.2.
> 
> Interesting. 
> 
> Nonsensical, but interesting.

What I find nonsensical is that -fno-strict-aliasing generates
better code here. Normally one would expect the compiler seeing
more aliases with that option and then be more conservative
regarding any sharing. But it seems to be the other way round
here.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Linus Torvalds


On Mon, 12 Jan 2009, Bernd Schmidt wrote:
> 
> Something at the back of my mind said "aliasing".
> 
> $ gcc linus.c -O2 -S ; grep subl linus.s
> subl$1624, %esp
> $ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s
> subl$824, %esp
> 
> That's with 4.3.2.

Interesting. 

Nonsensical, but interesting.

Since they have no overlap in lifetime, confusing this with aliasing is 
really really broken (if the functions _hadn't_ been inlined, you'd have 
gotten the same address for the two variables anyway! So anybody who 
thinks that they need different addresses because they are different types 
is really really fundmantally confused!).

But your numbers are unambiguous, and I can see the effect of that 
compiler flag myself.

The good news is that the kernel obviously already uses 
-fno-strict-aliasing for other reasonds, so we should see this effect 
already, _despite_ it making no sense. And the stack usage still causes 
problems.

Oh, and I see why. This test-case shows it clearly.

Note how the max stack usage _should_ be "struct b" + "struct c". Note how 
it isn't (it's "struct a" + "struct b/c").

So what seems to be going on is that gcc is able to do some per-slot 
sharing, but if you have one function with a single large entity, and 
another with a couple of different ones, gcc can't do any smart 
allocation.

Put another way: gcc doesn't create a "union of the set of different stack 
usages" (which would be optimal given a single frame, and generate the 
stack layout of just the maximum possible size), it creates a "set of 
unions of different stack usages" (which can be optimal in the trivial 
cases, but not nearly optimal in practical cases).

That explains the ioctl behavior - the structure use is usually pretty 
complicated (ie it's almost never about just _one_ large stack slot, but 
the ioctl cases tend to do random stuff with multiple slots).

So it doesn't add up to some horrible maximum of all sizes, but it also 
doesn't end up coalescing stack usage very well.

Linus
---
struct a {
int a;
unsigned long array[200];
};

struct b {
int b;
unsigned long array[100];
};

struct c {
int c;
unsigned long array[100];
};

extern int fn3(int, void *);
extern int fn4(int, void *);

static inline __attribute__ ((always_inline))
int fn1(int flag)
{
struct a a;
return fn3(flag, &a);
}

static inline __attribute__ ((always_inline))
int fn2(int flag)
{
struct b b;
struct c c; 
return fn4(flag, &b) + fn4(flag, &c);
}

int fn(int flag)
{
fn1(flag);
if (flag & 1)
return 0;
return fn2(flag);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Bernd Schmidt
Andi Kleen wrote:
> On Sun, Jan 11, 2009 at 04:21:03PM -0800, Linus Torvalds wrote:
>>
>> On Mon, 12 Jan 2009, Andi Kleen wrote:
>>> so at least least for this case it works. Your case also doesn't work 
>>> for me. So it looks like gcc didn't like something you did in your test 
>>> program.
>> I very intentionally used _different_ types.
>>
>> If you use the same type, gcc will apparenrly happily say "hey, I can 
>> combine two variables of the same type with different liveness into the 
>> same variable".
> 
> Confirmed.
> 
>> But that's not the interesting case.
> 
> Weird. I wonder where this strange restriction comes from.

Something at the back of my mind said "aliasing".

$ gcc linus.c -O2 -S ; grep subl linus.s
subl$1624, %esp
$ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s
subl$824, %esp

That's with 4.3.2.


Bernd
-- 
This footer brought to you by insane German lawmakers.
Analog Devices GmbH  Wilhelm-Wagenfeld-Str. 6  80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Ingo Molnar

* Jamie Lokier  wrote:

> Ingo Molnar wrote:
> > If it's used once in a single .c file it should be inlined even if
> > it's large.
> 
> As Linus has pointed out, because of GCC not sharing stack among 
> different inlined functions, the above is surprisingly not true.

Yes, but note that this has no relevance to the specific case of 
CONFIG_OPTIMIZE_INLINING: GCC can at most decide to inline _less_, not 
more. I.e. under CONFIG_OPTIMIZE_INLINING we can only end up having less 
stack sharing trouble.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hard to debug kernel issues (was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning)

2009-01-11 Thread Chris Samuel
On Sun, 11 Jan 2009 11:26:41 pm David Woodhouse wrote:

> Sometimes you weren't going to get a backtrace if something goes wrong
> _anyway_.

Case in point - we've been struggling with some of our SuperMicro based 
systems with AMD Barcelona B3 k10h CPUs *turning themselves off* when running 
various HPC applications.

Nothing in the kernel logs, nothing in the IPMI controller logs. It's just 
like someone has wandered in and held the power button down (and no, it's not 
that).

It's been driving us up the wall.

We'd assumed it was a hardware issue as it was happening with all sorts of 
kernels but today we tried 2.6.29-rc1 "just in case" and I've not been able to 
reproduce the crash (yet) on a node I can crash in about 30 seconds, and 
rebooting back into 2.6.28 makes it crash again.

If the test boxes are still alive tomorrow I might see if we can attempt some 
form of a reverse bisect to track down what commit fixed it (git doesn't seem 
to support that so we've going to have to invert the good/bad commands).

cheers,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP



signature.asc
Description: This is a digitally signed message part.


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-11 Thread Jamie Lokier
Ingo Molnar wrote:
> If it's used once in a single .c file it should be inlined even if
> it's large.

As Linus has pointed out, because of GCC not sharing stack among
different inlined functions, the above is surprisingly not true.

In kernel it's a problem due to raw stack usage.

In userspace apps (where stack used is larger), inlining single-call
functions could, paradoxically, run slower due to increased stack
dcache pressure for some larger functions.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-11 Thread H. Peter Anvin
Andi Kleen wrote:
> 
> Weird. I wonder where this strange restriction comes from. It indeed
> makes this much less useful than it could be :/
> 

Most likely they're collapsing at too early of a stage.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-11 Thread Andi Kleen
On Sun, Jan 11, 2009 at 04:21:03PM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 12 Jan 2009, Andi Kleen wrote:
> > 
> > so at least least for this case it works. Your case also doesn't work 
> > for me. So it looks like gcc didn't like something you did in your test 
> > program.
> 
> I very intentionally used _different_ types.
> 
> If you use the same type, gcc will apparenrly happily say "hey, I can 
> combine two variables of the same type with different liveness into the 
> same variable".

Confirmed.

> But that's not the interesting case.

Weird. I wonder where this strange restriction comes from. It indeed
makes this much less useful than it could be :/

-Andi
-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-11 Thread Linus Torvalds


On Mon, 12 Jan 2009, Andi Kleen wrote:
> 
> so at least least for this case it works. Your case also doesn't work 
> for me. So it looks like gcc didn't like something you did in your test 
> program.

I very intentionally used _different_ types.

If you use the same type, gcc will apparenrly happily say "hey, I can 
combine two variables of the same type with different liveness into the 
same variable".

But that's not the interesting case.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-11 Thread Andi Kleen
On Sun, Jan 11, 2009 at 03:05:53PM -0800, Linus Torvalds wrote:
> 
> 
> On Sun, 11 Jan 2009, Linus Torvalds wrote:
> > On Sun, 11 Jan 2009, Andi Kleen wrote:
> > > 
> > > Was -- i think that got fixed in gcc. But again only in newer versions.
> > 
> > I doubt it. People have said that about a million times, it has never 
> > gotten fixed, and I've never seen any actual proof.
> 
> In fact, I just double-checked.
> Try this:

Hmm, I actually had tested it some time ago with my own program
(supposed to emulate an ioctl)

extern void f5(char *);

static void f3(void)
{
char y[100];
f5(y);
}

static void f2(void)
{
char x[100];
f5(x);
}

int f(int cmd)
{
switch (cmd) { 
case 1: f3(); break;
case 2: f2(); break;
}
return 1;
}

and with gcc 4.3.1 I get:

.globl f
.type   f, @function
f:
.LFB4:
subq$120, %rsp  < not 200 bytes, stack gets 
reused; dunno where the 20 comes from
.LCFI0:
cmpl$1, %edi
je  .L4
cmpl$2, %edi
je  .L4
movl$1, %eax
addq$120, %rsp
ret
.p2align 4,,10
.p2align 3
.L4:
movq%rsp, %rdi
callf5
movl$1, %eax
addq$120, %rsp
ret
.LFE4:

so at least least for this case it works. Your case also doesn't work for me.
So it looks like gcc didn't like something you did in your test program.
Could be a pointer aliasing problem of some sort.

But yes it doesn't work as well as we hoped.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-11 Thread Linus Torvalds


On Sun, 11 Jan 2009, Linus Torvalds wrote:
> On Sun, 11 Jan 2009, Andi Kleen wrote:
> > 
> > Was -- i think that got fixed in gcc. But again only in newer versions.
> 
> I doubt it. People have said that about a million times, it has never 
> gotten fixed, and I've never seen any actual proof.

In fact, I just double-checked.

Try this:

struct a {
unsigned long array[200];
int a;
};

struct b {
int b;
unsigned long array[200];
};

extern int fn3(int, void *);
extern int fn4(int, void *);

static inline __attribute__((always_inline)) int fn1(int flag)
{
struct a a;
return fn3(flag, &a);
}

static inline __attribute__((always_inline)) int fn2(int flag)
{
struct b b;
return fn4(flag, &b);
}

int fn(int flag)
{
if (flag & 1)
return fn1(flag);
return fn2(flag);
}

(yeah, I made sure it would inline with "always_inline" just so that the 
issue wouldn't be hidden by any "avoid stack frames" flags).

Gcc creates a big stack frame that contains _both_ 'a' and 'b', and does 
not merge the allocations together even though they clearly have no 
overlap in usage. Both 'a' and 'b' get 201 long-words (1608 bytes) of 
stack, causing the inlined version to have 3kB+ of stack, even though the 
non-inlined one would never use more than half of it.

So please stop claiming this is fixed. It's not fixed, never has been, and 
quite frankly, probably never will be because the lifetime analysis is 
hard enough (ie once you inline and there is any complex usage, CSE etc 
will quite possibly mix up the lifetimes - the above is clearly not any 
_realistic_ example).

So even if the above trivial case could be fixed, I suspect a more complex 
real-life case would still keep the allocations separate. Because merging 
the allocations and re-using the same stack for both really is pretty 
non-trivial, and the best solution is to simply not inline.

(And yeah, the above is such an extreme case that gcc seems to realize 
that it makes no sense to inline because the stack frame is _so_ big. I 
don't know what the default stack frame limit is, but it's apparently 
smaller than 1.5kB ;)

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-11 Thread Linus Torvalds


On Sun, 11 Jan 2009, Andi Kleen wrote:
> 
> Was -- i think that got fixed in gcc. But again only in newer versions.

I doubt it. People have said that about a million times, it has never 
gotten fixed, and I've never seen any actual proof.

I think that what got fixed was that gcc now at least re-uses stack slots 
for temporary spills. But only for things that fit in registers - not if 
you actually had variables that are big enough to be of type MEM. And the 
latter is what tends to eat stack-space (ie structures etc on stack).

But hey, maybe it really did get fixed. But the last big stack user wasn't 
that long ago, and I saw it and I have a pretty recent gcc (gcc-4.3.2 
right now, it could obviously have been slightly older back a few months 
ago).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-11 Thread Andi Kleen
> Isn't the ioctl switch stack issue a separate GCC bug?
> 
> It was/is assigning assigning separate space for local variables which

Was -- i think that got fixed in gcc. But again only in newer versions.

> are mutually exclusive. So instead of the stack footprint of the
> function with the switch() being equal to the largest individual stack
> size of all the subfunctions, it's equal to the _sum_ of the stack sizes
> of the subfunctions. Even though it'll never use them all at the same
> time.
> 
> Without that bug, it would have been harmless to inline them all.

True.

-Andi
-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-11 Thread David Woodhouse
On Sun, 2009-01-11 at 21:14 +0100, Andi Kleen wrote:
> 
> On the other hand (my personal opinion, not shared by everyone) is 
> that the ioctl switch stack issue is mostly only a problem with 4K 
> stacks and in the rare cases when I still run 32bit kernels
> I never set that option because I consider it russian roulette
> (because there undoutedly dangerous dynamic stack growth cases that 
> checkstack.pl doesn't flag) 

Isn't the ioctl switch stack issue a separate GCC bug?

It was/is assigning assigning separate space for local variables which
are mutually exclusive. So instead of the stack footprint of the
function with the switch() being equal to the largest individual stack
size of all the subfunctions, it's equal to the _sum_ of the stack sizes
of the subfunctions. Even though it'll never use them all at the same
time.

Without that bug, it would have been harmless to inline them all.

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-11 Thread Andi Kleen
On Sun, Jan 11, 2009 at 11:25:32AM -0800, Linus Torvalds wrote:
> 
> 
> On Sun, 11 Jan 2009, Andi Kleen wrote:
> > 
> > The proposal was to use -fno-inline-functions-called-once (but 
> > the resulting numbers were not promising)
> 
> Well, the _optimal_ situation would be to not need it, because gcc does a 
> good job without it. That implies trying to find a better balance between 
> "worth it" and "causes problems". 
> 
> Rigth now, it does sound like gcc simply doesn't try to balance AT ALL, or 
> balances only when we add some very version-specific random options (ie 
> the stack-usage one). 

The gcc 4.3 inliner takes stack growth into account by default (without
any special options). I experimented a bit with it when that
was introduced and found the default thresholds are too large for the kernel 
and don't change the checkstack.pl picture much.

I asked back then and was told --param large-stack-frame
is expected to be a reasonable stable --param (as much as these can
be) and I did a patch to lower it, but I couldn't get myself
to actually submit it [if you really want it I can send it]. 

But of course that only helps for gcc 4.3+, older gccs would need
a different workaround.

On the other hand (my personal opinion, not shared by everyone) is 
that the ioctl switch stack issue is mostly only a problem with 4K 
stacks and in the rare cases when I still run 32bit kernels
I never set that option because I consider it russian roulette
(because there undoutedly dangerous dynamic stack growth cases that 
checkstack.pl doesn't flag) 

> And even those options don't actually make much 
> sense - yes, they "balance" things, but they don't do it in a sensible 
> manner.
> 
> For example: stack usage is undeniably a problem (we've hit it over and 
> over again), but it's not about "stack must not be larger than X bytes". 
> 
> If the call is done unconditionally, then inlining _one_ function will 
> grow the static stack usage of the function we inline into, but it will 
> _not_ grow the dynamic stack usage one whit - so deciding to not inline 
> because of stack usage is pointless.

Don't think the current inliner takes that into account from
a quick look at the sources, although it probably could.  
Maybe Honza can comment.

But even if it did it could only do that for a single file,
but if the function is in a different file gcc doesn't
have the information (unless you run with David's --combine hack).
This means the kernel developers have to do it anyways.

On the other hand I'm not sure it's that big a problem. Just someone
should run make checkstack occasionally and add noinlines to large
offenders.

-Andi

[keep quote for Honza's benefit]

> 
> See? So "stop inlining when you hit a stack limit" IS THE WRONG THING TO 
> DO TOO! Because it just means that the compiler continues to do bad 
> inlining decisions until it hits some magical limit - but since the 
> problem isn't the static stack size of any _single_ function, but the 
> combined stack size of a dynamic chain of them, that's totally idiotic. 
> You still grew the dynamic stack, and you have no way of knowing by how 
> much - the limit on the static one simply has zero bearing what-so-ever on 
> the dynamic one.
> 
> So no, "limit static stack usage" is not a good option, because it stops 
> inlining when it doesn't matter (single unconditional call), and doesn't 
> stop inlining when it might (lots of sequential calls, in a deep chain).
> 
> The other alternative is to let gcc do what it does, but 
> 
>  (a) remove lots of unnecessary 'inline's. And we should likely do this 
>  regardless of any "-fno-inline-functions-called-once" issues.
> 
>  (b) add lots of 'noinline's to avoid all the cases where gcc screws up so 
>  badly that it's either a debugging disaster or an actual correctness 
>  issue.
> 
> The problem with (b) is that it's a lot of hard thinking, and debugging 
> disasters always happen in code that you didn't realize would be a problem 
> (because if you had, it simply wouldn't be the debugging issue it is).
> 
>   Linus
> 

-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-11 Thread Linus Torvalds


On Sun, 11 Jan 2009, Andi Kleen wrote:
> 
> The proposal was to use -fno-inline-functions-called-once (but 
> the resulting numbers were not promising)

Well, the _optimal_ situation would be to not need it, because gcc does a 
good job without it. That implies trying to find a better balance between 
"worth it" and "causes problems". 

Rigth now, it does sound like gcc simply doesn't try to balance AT ALL, or 
balances only when we add some very version-specific random options (ie 
the stack-usage one). And even those options don't actually make much 
sense - yes, they "balance" things, but they don't do it in a sensible 
manner.

For example: stack usage is undeniably a problem (we've hit it over and 
over again), but it's not about "stack must not be larger than X bytes". 

If the call is done unconditionally, then inlining _one_ function will 
grow the static stack usage of the function we inline into, but it will 
_not_ grow the dynamic stack usage one whit - so deciding to not inline 
because of stack usage is pointless.

See? So "stop inlining when you hit a stack limit" IS THE WRONG THING TO 
DO TOO! Because it just means that the compiler continues to do bad 
inlining decisions until it hits some magical limit - but since the 
problem isn't the static stack size of any _single_ function, but the 
combined stack size of a dynamic chain of them, that's totally idiotic. 
You still grew the dynamic stack, and you have no way of knowing by how 
much - the limit on the static one simply has zero bearing what-so-ever on 
the dynamic one.

So no, "limit static stack usage" is not a good option, because it stops 
inlining when it doesn't matter (single unconditional call), and doesn't 
stop inlining when it might (lots of sequential calls, in a deep chain).

The other alternative is to let gcc do what it does, but 

 (a) remove lots of unnecessary 'inline's. And we should likely do this 
 regardless of any "-fno-inline-functions-called-once" issues.

 (b) add lots of 'noinline's to avoid all the cases where gcc screws up so 
 badly that it's either a debugging disaster or an actual correctness 
 issue.

The problem with (b) is that it's a lot of hard thinking, and debugging 
disasters always happen in code that you didn't realize would be a problem 
(because if you had, it simply wouldn't be the debugging issue it is).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-11 Thread Andi Kleen
On Sun, Jan 11, 2009 at 12:26:41PM +, David Woodhouse wrote:
>   - Unconditionally have 'inline' meaning 'always_inline'. If we say it,
> we should mean it.
> 
>   - Resist the temptation to use -fno-inline-functions. Allow GCC to
> inline other things if it wants to.

The proposal was to use -fno-inline-functions-called-once (but 
the resulting numbers were not promising)

We've never allowed gcc to inline any other functions not marked
inline explicitely because that's not included in -O2. 

>   - Reduce the number of unnecessary 'inline' markers, and have a policy
> that the use of 'inline' should be accompanied by either a GCC PR#
> or an explanation of why we couldn't reasonably have expected GCC to
> get this particular case right.
> 
>   - Have a similar policy of PR# or explanation for 'uninline' too.
> 
> I don't think we should just give up on GCC ever getting it right. That
> way lies madness. As we've often found in the past. 

It sounds like you're advocating to set -O3/-finline-functions
by default.   Not sure that's a good idea.

-Andi

-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-11 Thread David Woodhouse
On Sat, 2009-01-10 at 04:02 +0100, Andi Kleen wrote:
> Long term that problem will hopefully disappear, as gcc learns to do cross
> source file inlining (like a lot of other compilers already do)

We've already been able to get GCC doing this for the kernel, in fact
(the --combine -fwhole-program stuff I was working on a while back).

It gives an interesting size reduction, especially in file systems and
other places where we tend to have functions with a single call site...
but in a different file.

Linus argues that we don't want that kind of inlining because it harms
debuggability, but that isn't _always_ true. Sometimes you weren't going
to get a backtrace if something goes wrong _anyway_. And even if the
size reduction doesn't necessarily give a corresponding performance
improvement, we might not care. In the embedded world, size does matter
too. And the numbers are such that you can happily keep debuginfo for
the shipped kernel builds and postprocess any backtraces you get. Just
as we can for distros.


In general, I would much prefer being able to trust the compiler, rather
than disabling its heuristics completely. We might not be able to trust
it right now, but we should be working towards that state. Not just
declaring that we know best, even though _sometimes_ we do.

I think we should:

  - Unconditionally have 'inline' meaning 'always_inline'. If we say it,
we should mean it.

  - Resist the temptation to use -fno-inline-functions. Allow GCC to
inline other things if it wants to.

  - Reduce the number of unnecessary 'inline' markers, and have a policy
that the use of 'inline' should be accompanied by either a GCC PR#
or an explanation of why we couldn't reasonably have expected GCC to
get this particular case right.

  - Have a similar policy of PR# or explanation for 'uninline' too.

I don't think we should just give up on GCC ever getting it right. That
way lies madness. As we've often found in the past. 

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-10 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Fri, 9 Jan 2009, H. Peter Anvin wrote:
> > 
> > I was thinking about experimenting with this, to see what level of 
> > upside it might add.  Ingo showed me numbers which indicate that a 
> > fairly significant fraction of the cases where removing inline helps 
> > is in .h files, which would require code movement to fix.  Hence to 
> > see if it can be automated.
> 
> We _definitely_ have too many inline functions in headers. They usually 
> start out small, and then they grow. And even after they've grown big, 
> it's usually not at all clear exactly where else they should go, so even 
> when you realize that "that shouldn't be inlined", moving them and 
> making them uninlined is not obvious.
> 
> And quite often, some of them go away - or at least shrink a lot - when 
> some config option or other isn't set. So sometimes it's an inline 
> because a certain class of people really want it inlined, simply because 
> for _them_ it makes sense, but when you enable debugging or something, 
> it absolutely explodes.

IMO it's all quite dynamic when it comes to inlining.

Beyond the .config variances (which alone is enough degrees of freedom to 
make this non-static, it at least is a complexity we can control in the 
kernel to a certain degree) it also depends on the platform, the CPU type, 
the compiler version - factors which we dont (and probably dont want to) 
control.

There's also the in-source variance of how many times an inline function 
is used within a .c file, and that factor is not easily tracked. If it's 
used once in a single .c file it should be inlined even if it's large. If 
it's used twice in a .c file it might be put out of line. Transition 
between those states is not obvious in all cases.

There's certainly clear-cut cases: the very small/constant ones that must 
be short and inlined in any environment, and the very large/complex ones 
that must not be inlined under any circumstance.

But there's a lot of shades of grey inbetween - and that's where the size 
wins come from. I'm not sure we can (or should) generally expect kernel 
coders to continuously maintain the 30,000+ inline attributes in the 
kernel that involve 100,000+ functions:

 - Nothing breaks if it's there, nothing breaks if it's not there.
   It's a completely opaque, transparent entity that never pushes itself 
   to the foreground of human attention.

 - It's so easy to add an inline function call site to a .c file without 
   noticing that it should not be inlined anymore.

 - It's so easy to _remove_ a usage site from a .c file without noticing
   that something should be inlined. I.e. local changes will have an 
   effect on the inline attribute _elsewhere_ - and this link is not 
   obvious and not tooled when editing the code.

 - The mapping from C statements to assembly can be non-obvious even to
   experienced developers. Data type details (signed/unsigned, width,
   etc.) can push an inline function over the (very hard to define)
   boundary.

I.e. IMO it's all very dynamic, it's opaque, it's not visualized and it's 
hard to track - so it's very fundamentally not for humans to maintain 
[except for the really clear-cut cases].

Add to that that in _theory_ the decision to inline or not is boringly 
mechanic and tools ought to be able to do a near-perfect job with it and 
just adopt to whatever environment the kernel is in at a given moment when 
it's built.

GCC limps along with its annoyingly mis-designed inlining heuristics, 
hopefully LLVC will become a real compiler that is aware of little details 
like instruction size and has a built-in assembler ...

So IMO all the basic psychological mechanics are missing from the picture 
that would result in really good, "self-maintained" inline attributes.

We can try to inject the requirement to have good inline attributes as an 
external rule, as a principle we want to see met - but we cannot expect it 
to be followed really in its current form, as it goes subtly against the 
human nature on various levels.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-10 Thread Jeremy Fitzhardinge

Linus Torvalds wrote:

Actually, the real spin locks are now fair. We use ticket locks on x86.

Well, at least we do unless you enable that broken paravirt support. I'm 
not at all clear on why CONFIG_PARAVIRT wants to use inferior locks, but I 
don't much care.
  


No, it will continue to use ticket locks, but there's the option to 
switch to byte locks or something else.  Ticket locks are awesomely bad 
when the VCPU scheduler fights with the run-order required by the ticket 
order.


   J
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread H. Peter Anvin
Ingo Molnar wrote:
> 
>  - Headers could probably go back to 'extern inline' again. At not small 
>expense - we just finished moving to 'static inline'. We'd need to 
>guarantee a library instantiation for every header include file - this 
>is an additional mechanism with additional introduction complexities 
>and an ongoing maintenance cost.
> 

I think I have a pretty clean idea for how to do this.  I'm going to
experiment with it over the next few days.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread H. Peter Anvin
Linus Torvalds wrote:
> 
> And quite often, some of them go away - or at least shrink a lot - when 
> some config option or other isn't set. So sometimes it's an inline because 
> a certain class of people really want it inlined, simply because for 
> _them_ it makes sense, but when you enable debugging or something, it 
> absolutely explodes.
> 

And this is really why getting static inline annotations right is really
hard if not impossible in the general case (especially when considering
the sheer number of architectures we compile on.)  So making it possible
for the compiler to do the right thing for at least this class of
functions really does seem like a good idea.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Linus Torvalds


On Fri, 9 Jan 2009, H. Peter Anvin wrote:
> 
> I was thinking about experimenting with this, to see what level of
> upside it might add.  Ingo showed me numbers which indicate that a
> fairly significant fraction of the cases where removing inline helps is
> in .h files, which would require code movement to fix.  Hence to see if
> it can be automated.

We _definitely_ have too many inline functions in headers. They usually 
start out small, and then they grow. And even after they've grown big, 
it's usually not at all clear exactly where else they should go, so even 
when you realize that "that shouldn't be inlined", moving them and making 
them uninlined is not obvious.

And quite often, some of them go away - or at least shrink a lot - when 
some config option or other isn't set. So sometimes it's an inline because 
a certain class of people really want it inlined, simply because for 
_them_ it makes sense, but when you enable debugging or something, it 
absolutely explodes.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread H. Peter Anvin
Linus Torvalds wrote:
> 
> There's none. In fact, it's wrong, unless you _also_ have an extern 
> definition (according to the "new" gcc rules as of back in the days).
> 
> Of course, as long as "inline" really means _always_ inline, it won't 
> matter. So in that sense Ingo is right - we _could_. Which has no bearing 
> on whether we _should_, of course.
> 

I was thinking about experimenting with this, to see what level of
upside it might add.  Ingo showed me numbers which indicate that a
fairly significant fraction of the cases where removing inline helps is
in .h files, which would require code movement to fix.  Hence to see if
it can be automated.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Andi Kleen
> A side-effect of the inline fetish is that a lot of it goes into header
> files, thus requiring that those header files #include lots of other
> headers, thus leading to, well, the current mess.

I personally also always found it annoying while grepping that
part of the code is in a completely different directory.
e.g. try to go through TCP code without jumping to .hs all the time.

Long term that problem will hopefully disappear, as gcc learns to do cross
source file inlining (like a lot of other compilers already do)

-Andi

-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Andrew Morton
On Sat, 10 Jan 2009 02:01:25 +0100 Ingo Molnar  wrote:

> 
> * Linus Torvalds  wrote:
> 
> > On Sat, 10 Jan 2009, Ingo Molnar wrote:
> > > 
> > > may_inline/inline_hint is a longer, less known and uglier keyword.
> > 
> > Hey, your choice, should you decide to accept it, is to just get rid of 
> > them entirely.
> > 
> > You claim that we're back to square one, but that's simply the way 
> > things are. Either "inline" means something, or it doesn't. You argue 
> > for it meaning nothing. I argue for it meaning something.
> > 
> > If you want to argue for it meaning nothing, then REMOVE it, instead of 
> > breaking it.
> > 
> > It really is that simple. Remove the inlines you think are wrong. 
> > Instead of trying to change the meaning of them.
> 
> Well, it's not totally meaningless. To begin with, defining 'inline' to 
> mean 'always inline' is a Linux kernel definition. So we already changed 
> the behavior - in the hope of getting it right most of the time and in the 
> hope of thus improving the kernel.
> 
> And now it appears that in our quest of improving the kernel we can 
> further tweak that (already non-standard) meaning to a weak "inline if the 
> compiler agrees too" hint. That gives us an even more compact kernel. It 
> also moves the meaning of 'inline' closer to what the typical programmer 
> expects it to be - for better or worse.
> 
> We could remove them completely, but there are a couple of practical 
> problems with that:
> 
>  - In this cycle alone, in the past ~2 weeks we added another 1300 inlines
>to the kernel.

Who "reviewed" all that?

> Do we really want periodic postings of:
> 
>   [PATCH 0/135] inline removal cleanups
> 
>... in the next 10 years? We have about 20% of all functions in the 
>kernel marked with 'inline'. It is a _very_ strong habit. Is it worth 
>fighting against it?

A side-effect of the inline fetish is that a lot of it goes into header
files, thus requiring that those header files #include lots of other
headers, thus leading to, well, the current mess.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Linus Torvalds


On Sat, 10 Jan 2009, Ingo Molnar wrote:
> 
>  - 'static inline' functions in .c files that are not used cause no build 
>warnings - while if we change them to 'static', we get a 'defined but
>not used' warning. Hundreds of new warnings in the allyesconfig builds.

Well, duh. Maybe they shouldn't be marked "inline", and maybe they should 
be marked with "__maybe_unused" instead. 

I do not think it makes sense to use "inline" as a way to say "maybe I 
won't use this function".

Yes, it's true that "static inline" won't warn, but hey, as a way to avoid 
a warning it's a pretty bad one.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Linus Torvalds


On Sat, 10 Jan 2009, Ingo Molnar wrote:

> 
> * Linus Torvalds  wrote:
> 
> > On Sat, 10 Jan 2009, Ingo Molnar wrote:
> > > 
> > > Well, it's not totally meaningless. To begin with, defining 'inline' to 
> > > mean 'always inline' is a Linux kernel definition. So we already changed 
> > > the behavior - in the hope of getting it right most of the time and in 
> > > the 
> > > hope of thus improving the kernel.
> > 
> > Umm. No we didn't. We've never changed it. It was "always inline" back in 
> 
> s/changed the behavior/overrode the behavior

The point is, as far as the kenrel has been concerned, "inline" has always 
meant the same thing. Not just for the last few weeks, but for the last 18 
_years_. It's always meant "always inline".

> But i'd definitely argue that the Linux kernel definition of 'inline' was 
> always more consistent than GCC's. That was rather easy as well: it doesnt 
> get any more clear-cut than 'always inline'.

Exactly. And I argue that we shouldn't change it.

If we have too many "inline"s, and if gcc inlines for us _anyway_, then 
the answer is not to change the meaning of "inline", but simply say "ok, 
we have too many inlines. Let's remove the ones we don't care about".

> I think it is axiomatic that improving the kernel means changing it - 
> sometimes that means changing deep details. (And if you see us ignoring 
> complaints, let us know, it must not happen.)

And I'm agreeing with you.

What I'm _not_ agreeing with is how you want to change the semantics of 
something we've had for 18 years.

YOU are the one who want to make "inline" mean "maybe". I'm against it. 
I'm against it because it makes no sense. It's not what we've ever done.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Linus Torvalds


On Fri, 9 Jan 2009, Harvey Harrison wrote:

> On Sat, 2009-01-10 at 02:01 +0100, Ingo Molnar wrote:
> 
> >  - Headers could probably go back to 'extern inline' again. At not small 
> >expense - we just finished moving to 'static inline'. We'd need to 
> >guarantee a library instantiation for every header include file - this 
> >is an additional mechanism with additional introduction complexities 
> >and an ongoing maintenance cost.
> 
> Puzzled?  What benefit is there to going back to extern inline in headers?

There's none. In fact, it's wrong, unless you _also_ have an extern 
definition (according to the "new" gcc rules as of back in the days).

Of course, as long as "inline" really means _always_ inline, it won't 
matter. So in that sense Ingo is right - we _could_. Which has no bearing 
on whether we _should_, of course.

In fact, the whole mess with "extern inline" is a perfect example of why a 
inlining hit should be called "may_inline" or "inline_hint" or something 
like that.

Because then it actually makes sense to have "extern may_inline" with one 
definition, and another definition for the non-inline version.  And it's 
very clear what the deal is about, and why we literally have two versions 
of the same function.

But again, that's very much not a "let's use 'extern' instead of 
'static'". It's a totally different issue.

Linus

[ A third reason to use "extern inline" is actually a really evil one: we 
  could do it for our unrelated issue with system call definitions on 
  architectures that require the caller to sign-extend the arguments. 

  Since we don't control the callers of system calls, we can't do that, 
  and architectures like s390 actually have potential security holes due 
  to callers that don't "follow the rules". So there are different needs 
  for trusted - in-kernel - system call users that we know do the sign 
  extension correctly, and untrusted - user-mode callers that just call 
  through the system call function table.

  What we _could_ do is for the wrappers to use

extern inline int sys_open(const char *pathname, int flags, mode_t mode)
{
return SYS_open(pathname, mode);
}

  which gives the C callers the right interface without any unnecessary 
  wrapping, and then

long WRAP_open(const char *pathname, long flags, long mode)
{
return SYS_open(pathname, flags, mode);
}
asm ("\t.globl sys_alias\n\t.set WRAP_open");

  which is the one that gets linked from any asm code. So now asm code 
  and C code gets two different functions, even though they use the same 
  system call name - one with inline expansion, one with linker games. 

  Whee. The games we can play (and the odd reasons we must play them). ]
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Sat, 10 Jan 2009, Ingo Molnar wrote:
> > 
> > Well, it's not totally meaningless. To begin with, defining 'inline' to 
> > mean 'always inline' is a Linux kernel definition. So we already changed 
> > the behavior - in the hope of getting it right most of the time and in the 
> > hope of thus improving the kernel.
> 
> Umm. No we didn't. We've never changed it. It was "always inline" back in 

s/changed the behavior/overrode the behavior

> the old days, and then we had to keep it "always inline", which is why 
> we override the default gcc meaning with the preprocessor.
> 
> Now, OPTIMIZE_INLINING _tries_ to change the semantics, and people are 
> complaining..

But i'd definitely argue that the Linux kernel definition of 'inline' was 
always more consistent than GCC's. That was rather easy as well: it doesnt 
get any more clear-cut than 'always inline'.

Nevertheless the question remains: is 3% on allyesconfig and 1% on 
defconfig (7.5% in kernel/built-in.o) worth changing the kernel definition 
for?

I think it is axiomatic that improving the kernel means changing it - 
sometimes that means changing deep details. (And if you see us ignoring 
complaints, let us know, it must not happen.)

So the question isnt whether to change, the question is: does the kernel 
get 'better' after that change - and could the same be achieved 
realistically via other means?

If you accept us turning all 30,000 inlines in the kernel upside down, we 
might be able to get the same end result differently. You can definitely 
be sure if people complained about this ~5 lines feature they will 
complain about a tens of thousand of lines patch (and the followup changed 
regime) ten or hundred times more fiercely.

And just in case it was not clear, i'm not a GCC apologist - to the 
contrary. I dont really care which tool makes the kernel better, and i 
wont stop looking at a quantified possibility to improve the kernel just 
because it happens to be GCC that offers a solution.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Steven Rostedt

>  - Headers could probably go back to 'extern inline' again. At not small 
>expense - we just finished moving to 'static inline'. We'd need to 
>guarantee a library instantiation for every header include file - this 
>is an additional mechanism with additional introduction complexities 
>and an ongoing maintenance cost.

I thought the "static inline" in headers should be more of a "always 
inline". As Andrew Morton keeps yelling at me to use static inline instead 
of macros ;-)

I do not see the point in the functions in the headers needing to have 
their "inlines" removed.

> 
>  - 'static inline' functions in .c files that are not used cause no build 
>warnings - while if we change them to 'static', we get a 'defined but
>not used' warning. Hundreds of new warnings in the allyesconfig builds.

Perhaps that's a good thing to see what functions are unused in the 
source.

> 
> I know that because i just have removed all variants of 'inline' from all 
> .c files of the kernel, it's a 3.5MB patch:
> 
>3263 files changed, 12409 insertions(+), 12409 deletions(-)
> 
> x86 defconfig comparisons:
> 
>   textfilename
>6875817vmlinux.always-inline   (  0.000% )
>6838290vmlinux.always-inline+remove-c-inlines  ( -0.548% )
>6794474vmlinux.optimize-inlining   ( -1.197% )
> 
> So the kernel's size improved by half a percent. Should i submit it?

Are there cases that are "must inline" in that patch? Also, what is the 
difference if you do vmlinux.optimize-remove-c-inlines? Is there a 
difference there?

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Harvey Harrison
On Sat, 2009-01-10 at 02:01 +0100, Ingo Molnar wrote:
> * Linus Torvalds  wrote:

>  - Headers could probably go back to 'extern inline' again. At not small 
>expense - we just finished moving to 'static inline'. We'd need to 
>guarantee a library instantiation for every header include file - this 
>is an additional mechanism with additional introduction complexities 
>and an ongoing maintenance cost.

Puzzled?  What benefit is there to going back to extern inline in headers?

Harvey

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Linus Torvalds


On Sat, 10 Jan 2009, Ingo Molnar wrote:
> 
> Well, it's not totally meaningless. To begin with, defining 'inline' to 
> mean 'always inline' is a Linux kernel definition. So we already changed 
> the behavior - in the hope of getting it right most of the time and in the 
> hope of thus improving the kernel.

Umm. No we didn't. We've never changed it. It was "always inline" back in 
the old days, and then we had to keep it "always inline", which is why we 
override the default gcc meaning with the preprocessor.

Now, OPTIMIZE_INLINING _tries_ to change the semantics, and people are 
complaining..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Linus Torvalds


On Sat, 10 Jan 2009, Jamie Lokier wrote:
> 
> Does "always_inline" complain if the function isn't inlinable, while
> "inline" allows it?  That would explain the alpha comment.

I suspect it dates back to gcc-3.1 days. It's from 2004. And the author of 
that comment is a part-time gcc hacker who was probably offended by the 
fact that we thought (correctly) that a lot of gcc inlining was totally 
broken.

Since he was the main alpha maintainer, he got to do things his way 
there..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Sat, 10 Jan 2009, Ingo Molnar wrote:
> > 
> > may_inline/inline_hint is a longer, less known and uglier keyword.
> 
> Hey, your choice, should you decide to accept it, is to just get rid of 
> them entirely.
> 
> You claim that we're back to square one, but that's simply the way 
> things are. Either "inline" means something, or it doesn't. You argue 
> for it meaning nothing. I argue for it meaning something.
> 
> If you want to argue for it meaning nothing, then REMOVE it, instead of 
> breaking it.
> 
> It really is that simple. Remove the inlines you think are wrong. 
> Instead of trying to change the meaning of them.

Well, it's not totally meaningless. To begin with, defining 'inline' to 
mean 'always inline' is a Linux kernel definition. So we already changed 
the behavior - in the hope of getting it right most of the time and in the 
hope of thus improving the kernel.

And now it appears that in our quest of improving the kernel we can 
further tweak that (already non-standard) meaning to a weak "inline if the 
compiler agrees too" hint. That gives us an even more compact kernel. It 
also moves the meaning of 'inline' closer to what the typical programmer 
expects it to be - for better or worse.

We could remove them completely, but there are a couple of practical 
problems with that:

 - In this cycle alone, in the past ~2 weeks we added another 1300 inlines
   to the kernel. Do we really want periodic postings of:

  [PATCH 0/135] inline removal cleanups

   ... in the next 10 years? We have about 20% of all functions in the 
   kernel marked with 'inline'. It is a _very_ strong habit. Is it worth 
   fighting against it?

 - Headers could probably go back to 'extern inline' again. At not small 
   expense - we just finished moving to 'static inline'. We'd need to 
   guarantee a library instantiation for every header include file - this 
   is an additional mechanism with additional introduction complexities 
   and an ongoing maintenance cost.

 - 'static inline' functions in .c files that are not used cause no build 
   warnings - while if we change them to 'static', we get a 'defined but
   not used' warning. Hundreds of new warnings in the allyesconfig builds.

I know that because i just have removed all variants of 'inline' from all 
.c files of the kernel, it's a 3.5MB patch:

   3263 files changed, 12409 insertions(+), 12409 deletions(-)

x86 defconfig comparisons:

  textfilename
   6875817vmlinux.always-inline   (  0.000% )
   6838290vmlinux.always-inline+remove-c-inlines  ( -0.548% )
   6794474vmlinux.optimize-inlining   ( -1.197% )

So the kernel's size improved by half a percent. Should i submit it?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Andi Kleen
On Sat, Jan 10, 2009 at 12:53:42AM +, Jamie Lokier wrote:
> Harvey Harrison wrote:
> > Oh yeah, and figure out what actually breaks on alpha such that they added
> > the following (arch/alpha/include/asm/compiler.h)
> > 
> > #ifdef __KERNEL__
> > /* Some idiots over in  thought inline should imply
> >always_inline.  This breaks stuff.  We'll include this file whenever
> >we run into such problems.  */
> 
> Does "always_inline" complain if the function isn't inlinable, while

Yes it does.

> "inline" allows it? 

(unless you set -Winline which the kernel doesn't) 

-Andi

-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Jamie Lokier
Harvey Harrison wrote:
> Oh yeah, and figure out what actually breaks on alpha such that they added
> the following (arch/alpha/include/asm/compiler.h)
> 
> #ifdef __KERNEL__
> /* Some idiots over in  thought inline should imply
>always_inline.  This breaks stuff.  We'll include this file whenever
>we run into such problems.  */

Does "always_inline" complain if the function isn't inlinable, while
"inline" allows it?  That would explain the alpha comment.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Linus Torvalds


On Sat, 10 Jan 2009, Ingo Molnar wrote:
> 
> may_inline/inline_hint is a longer, less known and uglier keyword. 

Hey, your choice, should you decide to accept it, is to just get rid of 
them entirely.

You claim that we're back to square one, but that's simply the way things 
are. Either "inline" means something, or it doesn't. You argue for it 
meaning nothing. I argue for it meaning something.

If you want to argue for it meaning nothing, then REMOVE it, instead of 
breaking it.

It really is that simple. Remove the inlines you think are wrong. Instead 
of trying to change the meaning of them.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Fri, 9 Jan 2009, Harvey Harrison wrote:
> > 
> > __needs_inline?  That would imply that it's for correctness reasons.
> 
> .. but the point is, we have _thousands_ of inlines, and do you know 
> which is which? We've historically forced them to be inlined, and every 
> time somebody does that "OPTIMIZE_INLINE=y", something simply _breaks_.

Having watched all the inline and anti-inline activites and patches of the 
past decade (and having participated in many of them) my strong impression 
is that any non-automated way is a fundamentally inhuman Don Quijote 
fight.

The inlining numbers me and others posted seem to support that impression.

Today we have in excess of thirty thousand 'inline' keyword uses in the 
kernel, and in excess of one hundred thousand kernel functions. We had a 
decade of hundreds of inline-tuning patches that flipped inline attributes 
on and off, with the goal of doing that job better than the compiler.

Still a sucky compiler who was never faced with this level of inlining 
complexity before (up to a few short months ago when we released the first 
kernel with non-CONFIG_BROKEN-marked CONFIG_OPTIMIZE_INLINING feature in 
it) manages to do a better job at judging inlining than a decade of human 
optimizations managed to do. (If you accept that 1% - 3% - 7.5% code size 
reduction in important areas of the kernel is an improvement.)

That improvement is systematic: it happens regardless whether it's core 
kernel developers who wrote the code, with years of kernel experience - or 
driver developers who came from Windows and might be inexperienced about 
it all and might slap 'inline' on every second random function.

And it's not like the compiler was not allowed to inline important 
functions before: all static functions in .c it can (and do) inline if it 
sees fit. Tens of thousands of them.

If we change 'inline' back to mean 'must inline' again, we have not 
changed the human dynamics of inlines at all and are back on square one. 
'should_inline' or 'may_inline' will be an opt-in hint that will be 
subject to the same kind of misjudgements that resulted in the inlining 
situation to begin with. In .c files it's already possible to do that: by 
not placing an 'inline' keyword at all, just leaving the function 
'static'.

may_inline/inline_hint is a longer, less known and uglier keyword. So all 
the cards are stacked up against this new 'may inline' mechanism, and by 
all likelyhood it will fizzle and never reach any sort of critical mass to 
truly matter. Nor should it - why should humans do this if a silly tool 
can achieve something rather acceptable?

So such a change will in essence amount to the effective removal of 
CONFIG_OPTIMIZE_INLINING. If we want to do that then we should do that 
honestly - and remove it altogether and not pretend to care.

Fedora has CONFIG_OPTIMIZE_INLINING=y enabled today - distros are always 
on the lookout for kernel image reductor features. As of today i'm not 
aware of a single Fedora bugzilla that was caused by that.

The upstream kernel did have bugs due to it - we had the UML breakage for 
example, and an older 3.x gcc threw an internal error on one of the 
(stale) isdn telephony drivers. Was Chris's crash actually caused by gcc's 
inlining decisions? I dont think it was.

Historically we had far more compiler problems with 
CONFIG_CC_OPTIMIZE_SIZE=y - optimizing for size is a subtly complex and 
non-trivial compiler pass.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Harvey Harrison
On Fri, 2009-01-09 at 14:09 -0800, Linus Torvalds wrote:
> Actually, the nice part about "inline_hint" would be that then we could 
> have some nice config option like
> 
>   #ifdef CONFIG_FULL_CALL_TRACE
>#define inline_hint noinline
>   #elif defined(CONFIG_TRUST_COMPILER)
>#define inline_hint /* */
>   #else
>#define inline_hint __inline
>   #endif
> 
> and now the _only_ thing we need to do is to remove the
> 
>   #define __inline__force_inline
> 
> thing, and just agree that "__inline" is the "native compiler meaning". 
> 
> We have a few users of "__inline", but not very many. We can leave them 
> alone, or just convert them to __inline__ or inline.
> 

Oh yeah, and figure out what actually breaks on alpha such that they added
the following (arch/alpha/include/asm/compiler.h)

#ifdef __KERNEL__
/* Some idiots over in  thought inline should imply
   always_inline.  This breaks stuff.  We'll include this file whenever
   we run into such problems.  */

#include 
#undef inline
#undef __inline__
#undef __inline
#undef __always_inline
#define __always_inline inline __attribute__((always_inline))

#endif /* __KERNEL__ */

Cheers,

Harvey

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Harvey Harrison
On Fri, 2009-01-09 at 14:09 -0800, Linus Torvalds wrote:

> We have a few users of "__inline", but not very many. We can leave them 
> alone, or just convert them to __inline__ or inline.

Actually I sent out a series of patches which mostly went in 2.6.27-28
timeframe, that's why there's a lot fewer __inline/__inline__

Other than one more block in scsi which has been hanging out in -mm for
awhile, eliminating them should be pretty easy now.

Harvey

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Linus Torvalds


On Fri, 9 Jan 2009, Harvey Harrison wrote:
> > 
> > Of course, at that point you might as well argue that the thing should not 
> > exist at all, and that such a flag should just be removed entirely. Which 
> > I certainly agree with - I think the only flag we need is "inline", and I 
> > think it should mean what it damn well says.
> 
> Also agreed, but there needs to start being some education about _not_ using
> inline so much in the kernel.

Actually, the nice part about "inline_hint" would be that then we could 
have some nice config option like

  #ifdef CONFIG_FULL_CALL_TRACE
   #define inline_hint noinline
  #elif defined(CONFIG_TRUST_COMPILER)
   #define inline_hint /* */
  #else
   #define inline_hint __inline
  #endif

and now the _only_ thing we need to do is to remove the

#define __inline__force_inline

thing, and just agree that "__inline" is the "native compiler meaning". 

We have a few users of "__inline", but not very many. We can leave them 
alone, or just convert them to __inline__ or inline.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Harvey Harrison
On Fri, 2009-01-09 at 13:50 -0800, Linus Torvalds wrote:
> 
> On Fri, 9 Jan 2009, Harvey Harrison wrote:
> > 
> > __needs_inline?  That would imply that it's for correctness reasons.
> 
> .. but the point is, we have _thousands_ of inlines, and do you know which 
> is which? We've historically forced them to be inlined, and every time 
> somebody does that "OPTIMIZE_INLINE=y", something simply _breaks_.
> 

My suggestion was just an alternative to __force_inline as a naming...I agree 
that
inline should mean __always_inline.always.

> So instead of just continually hitting our head against this wall because 
> some people seem to be convinced that gcc can do a good job, just do it 
> the other way around. Make the new one be "inline_hint" (no underscores 
> needed, btw), and there is ansolutely ZERO confusion about what it means. 

agreed.

> At that point, everybody knows why it's there, and it's clearly not a 
> correctness issue or anything else.
> 
> Of course, at that point you might as well argue that the thing should not 
> exist at all, and that such a flag should just be removed entirely. Which 
> I certainly agree with - I think the only flag we need is "inline", and I 
> think it should mean what it damn well says.

Also agreed, but there needs to start being some education about _not_ using
inline so much in the kernel.

Harvey

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Fri, 9 Jan 2009, Ingo Molnar wrote:
> > 
> > So, should we not remove CONFIG_OPTIMIZE_INLINING, then the correct one 
> > would be to mark it __always_inline [__asm_inline is senseless there], or 
> > the second patch below that changes the bit parameter to unsigned int.
> 
> Well, I certainly don't want to _remove_ the "inline" like your patch did. 

hm, that was a bug that i noticed and fixed in the second, fuller version 
of the patch i sent - which converts all the 'int nr' instances in 
bitops.h to 'unsigned int nr'.

This is the only instance where the integer type of 'nr' matters in 
practice though, due to the modulo arithmetics. But for cleanliness 
reasons we want to do the full patch, to have a standard type signature 
for these bitop methods.

> Other gcc versions will care. But I committed the pure "change to 
> unsigned" part.

thanks! I'll clean up the rest - the second patch will now conflict 
(trivially). I also wanted to check the whole file more fully, there might 
be other details. [ So many files, so few nights ;-) ]

We also might need more __always_inline's here and in other places, to 
solve the nonsensical inlining problems that Chris's case showed for 
example, with earlier GCCs.

Another option would be to not trust earlier GCCs at all with this feature 
- to only define inline to not-__always_inline only on latest 4.3.x GCC - 
the only one that seems to at least not mess up royally.

Thus CONFIG_OPTIMIZE_INLINING=y would have no effect on older GCCs. That 
would quarantine the problem (and the impact) sufficiently i think. And if 
future GCCs start messing up in this area we could zap the whole feature 
in a heartbeat.

( Although now that we have this feature this gives an incentive to
  compiler folks to tune their inliner on to the Linux kernel - for a 
  decade we never allowed them to do that. The kernel clearly has one of
  the trickiest (and ugliest) inlining smarts in headers - and we never 
  really exposed compilers to those things, so i'm not surprised at all 
  that they mess up in cases.

  Unfortunately the version lifecycle of most compiler projects is
  measured in years, not in months like that of the kernel. There's many
  reasons for that - and not all of those reasons are strictly their
  fault. )

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Linus Torvalds


On Fri, 9 Jan 2009, Harvey Harrison wrote:
> 
> __needs_inline?  That would imply that it's for correctness reasons.

.. but the point is, we have _thousands_ of inlines, and do you know which 
is which? We've historically forced them to be inlined, and every time 
somebody does that "OPTIMIZE_INLINE=y", something simply _breaks_.

So instead of just continually hitting our head against this wall because 
some people seem to be convinced that gcc can do a good job, just do it 
the other way around. Make the new one be "inline_hint" (no underscores 
needed, btw), and there is ansolutely ZERO confusion about what it means. 

At that point, everybody knows why it's there, and it's clearly not a 
correctness issue or anything else.

Of course, at that point you might as well argue that the thing should not 
exist at all, and that such a flag should just be removed entirely. Which 
I certainly agree with - I think the only flag we need is "inline", and I 
think it should mean what it damn well says.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Linus Torvalds


On Fri, 9 Jan 2009, Ingo Molnar wrote:
> 
> - Perhaps we could introduce a name for the first category: __must_inline? 
>   __should_inline? Not because it wouldnt mean 'always', but because it is 
>   'always inline' for another reason than the correctless __always_inline.

I think you're thinking about this the wrong way.

"inline" is a pretty damn strong hint already.

If you want a weaker one, make it _weaker_ instead of trying to use 
superlatives like "super_inline" or "must_inline" or whatever.

So I'd suggest:

 - keep "inline" as being a strong hint. In fact, I'd suggest it not be a 
   hint at all - when we say "inline", we mean it. No ambiguity 
   _anywhere_, and no need for idiotic "I really really REALLY mean it" 
   versions.

 - add a "maybe_inline" or "inline_hint" to mean that "ok, compiler, maybe 
   this is worth inlining, but I'll leave the final choice to you".

That would get rid of the whole rationale for OPTIMIZE_INLINING=y, because 
at that point, it's no longer potentially a correctness issue. At that 
point, if we let gcc optimize things, it was a per-call-site conscious 
decision.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Harvey Harrison
On Fri, 2009-01-09 at 22:34 +0100, Ingo Molnar wrote:
> The naming problem remains though:
> 
> - Perhaps we could introduce a name for the first category: __must_inline? 
>   __should_inline? Not because it wouldnt mean 'always', but because it is 
>   'always inline' for another reason than the correctless __always_inline.
> 
> - Another possible approach wuld be to rename the second category to 
>   __force_inline. That would signal it rather forcefully that the inlining 
>   there is an absolute correctness issue.

__needs_inline?  That would imply that it's for correctness reasons.

Then __always_inline is left to mean that it doesn't _need_ to be inline
but we _want_ it inline regardless of what gcc thinks?

$0.02

Harvey

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Fri, 9 Jan 2009, Ingo Molnar wrote:
> > 
> > So, should we not remove CONFIG_OPTIMIZE_INLINING, then the correct 
> > one would be to mark it __always_inline [__asm_inline is senseless 
> > there], or the second patch below that changes the bit parameter to 
> > unsigned int.
> 
> Well, I certainly don't want to _remove_ the "inline" like your patch 
> did. Other gcc versions will care. But I committed the pure "change to 
> unsigned" part.
> 
> But we should fix the cmpxchg (and perhaps plain xchg too), shouldn't 
> we?
> 
> That your gcc version gets it right doesn't change the fact that Chris' 
> gcc version didn't, and out-of-lined it all. So we'll need some 
> __always_inlines there too..

Yeah. I'll dig out an older version of gcc (latest distros are all 4.3.x 
based) and run the checks to see which inlines make a difference.

> And no, I don't think it makes any sense to call them "__asm_inline". 
> Even when there are asms hidden in between the C statements, what's the 
> difference between "always" and "asm"? None, really.

Well, the difference is small, nitpicky and insignificant: the thing is 
there are two logically separate categories of __always_inline:

 1) the places where __always_inline means that in this universe no sane 
compiler ever can end up thinking to move that function out of line.

 2) inlining for_correctness_ reasons: things like vreads or certain 
paravirt items. Stuff where the kernel actually crashes if we dont 
inline. Here if we do not inline we've got a materially crashy kernel.

The original intention of __always_inline was to only cover the second 
category above - and thus self-document all the 'correctness inlines'. 

This notion has become bitrotten somewhat: we do use __always_inline in a 
few other places like the ticket spinlock inlines for non-correctness 
reasons. That bitrot happened because we simply have no separate symbol 
for the first category.

So hpa suggested __asm_inline (yesterday, well before all the analysis was 
conducted) under the assumption that there would be many such annotations 
needed and that they would be all about cases where GCC's inliner gets 
confused by inline assembly.

This theory turned out to be a red herring today - asm()s do not seem to 
confuse latest GCC. (although they certain confuse earlier versions, so 
it's still a practical issue, so i agree that we do need to annotate a few 
more places.)

In any case, the __asm_inline name - even if it made some marginal sense 
originally - is totally moot now, no argument about that.

The naming problem remains though:

- Perhaps we could introduce a name for the first category: __must_inline? 
  __should_inline? Not because it wouldnt mean 'always', but because it is 
  'always inline' for another reason than the correctless __always_inline.

- Another possible approach wuld be to rename the second category to 
  __force_inline. That would signal it rather forcefully that the inlining 
  there is an absolute correctness issue.

- Or we could go with the status quo and just conflate those two 
  categories (as it is happening currently) and document the correctness 
  inlines via in-source comments?

But these are really nuances that pale in comparison to the fundamental 
questions that were asked in this thread, about the pure existence of this 
feature.

If the optimize-inlining feature looks worthwile and maintainable to 
remain upstream then i'd simply like to see the information of these two 
categories preserved in a structured way (in 5 years i'm not sure i'd 
remember all the paravirt inlining details), and i dont feel too strongly 
about the style how we preserve that information.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Linus Torvalds


On Fri, 9 Jan 2009, Ingo Molnar wrote:
> 
> So, should we not remove CONFIG_OPTIMIZE_INLINING, then the correct one 
> would be to mark it __always_inline [__asm_inline is senseless there], or 
> the second patch below that changes the bit parameter to unsigned int.

Well, I certainly don't want to _remove_ the "inline" like your patch did. 
Other gcc versions will care. But I committed the pure "change to 
unsigned" part.

But we should fix the cmpxchg (and perhaps plain xchg too), shouldn't we?

That your gcc version gets it right doesn't change the fact that Chris' 
gcc version didn't, and out-of-lined it all. So we'll need some 
__always_inlines there too..

And no, I don't think it makes any sense to call them "__asm_inline". Even 
when there are asms hidden in between the C statements, what's the 
difference between "always" and "asm"? None, really.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Andi Kleen
> I've done a finegrained size analysis today (see my other mail in this 
> thread), and it turns out that on gcc 4.3.x the main (and pretty much 
> only) inlining annotation that matters in arch/x86/include/asm/*.h is the 
> onliner patch attached below, annotating constant_test_bit().

That's pretty cool.

Should definitely file a gcc bug report for that though so that
they can fix gcc. Did you already do that or should I?

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Ingo Molnar

* Ingo Molnar  wrote:

> Note that meanwhile i also figured out why gcc got the inlining wrong 
> there: the 'int nr' combined with the '% BITS_PER_LONG' signed 
> arithmetics was too much for it to figure out at the inlining stage - it 
> generated IDIV instructions, etc. With forced inlining later 
> optimization stages managed to prove that the expression can be 
> simplified.
> 
> The second patch below that changes 'int nr' to 'unsigned nr' solves 
> that problem, without the need to mark the function __always_inline.

The patch below that changes all the 'int nr' arguments to 'unsigned int 
nr' in bitops.h and gives us a 0.3% size win (and all the right inlining 
behavior) on x86 defconfig:

text   data bss dec hex filename
 68134701453188  801096 9067754  8a5cea vmlinux.before
 67926021453188  801096 9046886  8a0b66 vmlinux.after

i checked other architectures and i can see many cases where the bitops 
'nr' parameter is defined as unsigned - maybe they noticed this.

This change makes some sense anyway as a cleanup: a negative 'nr' bitop 
argument does not make much sense IMO.

Ingo

---
 arch/x86/include/asm/bitops.h |   31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

Index: linux/arch/x86/include/asm/bitops.h
===
--- linux.orig/arch/x86/include/asm/bitops.h
+++ linux/arch/x86/include/asm/bitops.h
@@ -75,7 +75,7 @@ static inline void set_bit(unsigned int 
  * If it's called on the same region of memory simultaneously, the effect
  * may be that only one operation succeeds.
  */
-static inline void __set_bit(int nr, volatile unsigned long *addr)
+static inline void __set_bit(unsigned int nr, volatile unsigned long *addr)
 {
asm volatile("bts %1,%0" : ADDR : "Ir" (nr) : "memory");
 }
@@ -90,7 +90,7 @@ static inline void __set_bit(int nr, vol
  * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
  * in order to ensure changes are visible on other processors.
  */
-static inline void clear_bit(int nr, volatile unsigned long *addr)
+static inline void clear_bit(unsigned int nr, volatile unsigned long *addr)
 {
if (IS_IMMEDIATE(nr)) {
asm volatile(LOCK_PREFIX "andb %1,%0"
@@ -117,7 +117,7 @@ static inline void clear_bit_unlock(unsi
clear_bit(nr, addr);
 }
 
-static inline void __clear_bit(int nr, volatile unsigned long *addr)
+static inline void __clear_bit(unsigned int nr, volatile unsigned long *addr)
 {
asm volatile("btr %1,%0" : ADDR : "Ir" (nr));
 }
@@ -152,7 +152,7 @@ static inline void __clear_bit_unlock(un
  * If it's called on the same region of memory simultaneously, the effect
  * may be that only one operation succeeds.
  */
-static inline void __change_bit(int nr, volatile unsigned long *addr)
+static inline void __change_bit(unsigned int nr, volatile unsigned long *addr)
 {
asm volatile("btc %1,%0" : ADDR : "Ir" (nr));
 }
@@ -166,7 +166,7 @@ static inline void __change_bit(int nr, 
  * Note that @nr may be almost arbitrarily large; this function is not
  * restricted to acting on a single-word quantity.
  */
-static inline void change_bit(int nr, volatile unsigned long *addr)
+static inline void change_bit(unsigned int nr, volatile unsigned long *addr)
 {
if (IS_IMMEDIATE(nr)) {
asm volatile(LOCK_PREFIX "xorb %1,%0"
@@ -187,7 +187,7 @@ static inline void change_bit(int nr, vo
  * This operation is atomic and cannot be reordered.
  * It also implies a memory barrier.
  */
-static inline int test_and_set_bit(int nr, volatile unsigned long *addr)
+static inline int test_and_set_bit(unsigned int nr, volatile unsigned long 
*addr)
 {
int oldbit;
 
@@ -204,7 +204,7 @@ static inline int test_and_set_bit(int n
  *
  * This is the same as test_and_set_bit on x86.
  */
-static inline int test_and_set_bit_lock(int nr, volatile unsigned long *addr)
+static inline int test_and_set_bit_lock(unsigned int nr, volatile unsigned 
long *addr)
 {
return test_and_set_bit(nr, addr);
 }
@@ -218,7 +218,7 @@ static inline int test_and_set_bit_lock(
  * If two examples of this operation race, one can appear to succeed
  * but actually fail.  You must protect multiple accesses with a lock.
  */
-static inline int __test_and_set_bit(int nr, volatile unsigned long *addr)
+static inline int __test_and_set_bit(unsigned int nr, volatile unsigned long 
*addr)
 {
int oldbit;
 
@@ -237,7 +237,7 @@ static inline int __test_and_set_bit(int
  * This operation is atomic and cannot be reordered.
  * It also implies a memory barrier.
  */
-static inline int test_and_clear_bit(int nr, volatile unsigned long *addr)
+static inline int test_and_clear_bit(unsigned int nr, volatile unsigned long 
*addr)
 {
int oldbit;
 
@@ -257,7 +257,7 @@ static inline int test_and_clear_bit(int
  * If two examples of this operation race, one can appear to succeed
  *

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Fri, 9 Jan 2009, Ingo Molnar wrote:
> >  
> > -static inline int constant_test_bit(int nr, const volatile unsigned long 
> > *addr)
> > +static __asm_inline int
> > +constant_test_bit(int nr, const volatile unsigned long *addr)
> >  {
> > return ((1UL << (nr % BITS_PER_LONG)) &
> > (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
> 
> Thios makes absolutely no sense.
> 
> It's called "__always_inline", not __asm_inline.

yeah.

Note that meanwhile i also figured out why gcc got the inlining wrong 
there: the 'int nr' combined with the '% BITS_PER_LONG' signed arithmetics 
was too much for it to figure out at the inlining stage - it generated 
IDIV instructions, etc. With forced inlining later optimization stages 
managed to prove that the expression can be simplified.

The second patch below that changes 'int nr' to 'unsigned nr' solves that 
problem, without the need to mark the function __always_inline.


How did i end up with __asm_inline? The thing is, i started the day under 
the assumption that there's some big practical problem here. I expected to 
find a lot of places in need of annotation, so i introduced hpa's 
suggestion and added the __asm_inline (via the patch attached below).

I wrote 40 patches that annotated 200+ asm inline functions, and i was 
fully expected to find that GCC made a mess, and i also wrote a patch to 
disable CONFIG_OPTIMIZE_INLINING on those grounds.

The irony is that indeed pretty much the _only_ annotation that made a 
difference was the one that isnt even an asm() inline (as you noted).

So, should we not remove CONFIG_OPTIMIZE_INLINING, then the correct one 
would be to mark it __always_inline [__asm_inline is senseless there], or 
the second patch below that changes the bit parameter to unsigned int.

Ingo

---
 include/linux/compiler.h |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux/include/linux/compiler.h
===
--- linux.orig/include/linux/compiler.h
+++ linux/include/linux/compiler.h
@@ -223,7 +223,11 @@ void ftrace_likely_update(struct ftrace_
 #define noinline_for_stack noinline
 
 #ifndef __always_inline
-#define __always_inline inline
+# define __always_inline inline
+#endif
+
+#ifndef __asm_inline
+# define __asm_inline __always_inline
 #endif
 
 #endif /* __KERNEL__ */

Index: linux/arch/x86/include/asm/bitops.h
===
--- linux.orig/arch/x86/include/asm/bitops.h
+++ linux/arch/x86/include/asm/bitops.h
@@ -300,7 +300,7 @@ static inline int test_and_change_bit(in
return oldbit;
 }
 
-static inline int constant_test_bit(int nr, const volatile unsigned long *addr)
+static int constant_test_bit(unsigned int nr, const volatile unsigned long 
*addr)
 {
return ((1UL << (nr % BITS_PER_LONG)) &
(((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Linus Torvalds


On Fri, 9 Jan 2009, Peter Zijlstra wrote:
> 
> On that note:
> 
> Index: linux-2.6/kernel/mutex.c
> ===
> --- linux-2.6.orig/kernel/mutex.c
> +++ linux-2.6/kernel/mutex.c
> @@ -220,7 +220,9 @@ __mutex_lock_common(struct mutex *lock, 
>   __set_task_state(task, state);
>  
>   /* didnt get the lock, go to sleep: */
> + preempt_disable();
>   spin_unlock_mutex(&lock->wait_lock, flags);
> + preempt_enable_no_resched();
>   schedule();

Yes. I think this is a generic issue independently of the whole adaptive 
thing.

In fact, I think wee could make the mutex code use explicit preemption and 
then the __raw spinlocks to make this more obvious. Because now there's a 
hidden "preempt_enable()" in that spin_unlock_mutex, and anybody looking 
at the code and not realizing it is going to just say "Whaa? Who is this 
crazy Peter Zijlstra guy, and what drugs is he on? I want me some!".

Because your patch really doesn't make much sense unless you know how 
spinlocks work, and if you _do_ know how spinlocks work, you go "eww, 
that's doing extra preemption crud in order to just disable the 
_automatic_ preemption crud".

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Peter Zijlstra
On Fri, 2009-01-09 at 11:44 -0500, Steven Rostedt wrote:

> When we get to the schedule() it then needs to be a:
> 
>   preempt_enable_no_resched();
>   schedule();

On that note:

Index: linux-2.6/kernel/mutex.c
===
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -220,7 +220,9 @@ __mutex_lock_common(struct mutex *lock, 
__set_task_state(task, state);
 
/* didnt get the lock, go to sleep: */
+   preempt_disable();
spin_unlock_mutex(&lock->wait_lock, flags);
+   preempt_enable_no_resched();
schedule();
spin_lock_mutex(&lock->wait_lock, flags);
}


actually improves mutex performance on PREEMPT=y
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Jiri Kosina
On Thu, 8 Jan 2009, Peter Zijlstra wrote:

> > Well, at least we do unless you enable that broken paravirt support. 
> > I'm not at all clear on why CONFIG_PARAVIRT wants to use inferior 
> > locks, but I don't much care.
> Because the virtual cpu that has the ticket might not get scheduled for
> a while, even though another vcpu with a spinner is scheduled.
> The whole (para)virt is a nightmare in that respect.

Hmm, are we in fact really using byte locks in CONFIG_PARAVIRT situation? 
Where are we actually setting pv_lock_ops.spin_lock pointer to point to 
__byte_spin_lock?

Such initialization seems to happen only in paravirt_use_bytelocks() 
function, but my blind eyes prevent me from finding a callsite from which 
this function would eventually get called.

-- 
Jiri Kosina
SUSE Labs

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Steven Rostedt

On Fri, 9 Jan 2009, Linus Torvalds wrote:

> 
> 
> On Fri, 9 Jan 2009, Steven Rostedt wrote:
> > 
> > I was going to say a while ago...
> > In PREEMPT=y the need_resched() is not needed at all. If you have 
> > preemption enabled, you will get preempted in that loop. No need for the 
> > need_resched() in the outer loop. Although I'm not sure how it would even 
> > hit the "need_resched". If it was set, then it is most likely going to be 
> > cleared when coming back from being preempted.
> 
> No, no, you miss the point entirely.

No I did not miss your point. I was commenting on the current code ;-)

> So quite frankly, if you have CONFIG_PREEMPT, then the spinning really is 
> the wrong thing to do, or the whole mutex slow-path thing should be done 
> with preemption disabled so that we only schedule where we _should_ be 
> scheduling.

I agree here. I was going to recommend to add a preempt_disable in the 
spinner. And keep the need_resched test. Then we should not allow 
preemption until we get all the way to the point of the schedule in the 
contention case, or when we get the lock.

When we get to the schedule() it then needs to be a:

preempt_enable_no_resched();
schedule();

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Linus Torvalds


On Fri, 9 Jan 2009, H. Peter Anvin wrote:
> 
> __asm_inline was my suggestion, to distinguish "inline this
> unconditionally because gcc screws up in the presence of asm()"

THERE IS NO ASM IN THERE!

Guys, look at the code. No asm. The whole notion that gcc gets confused by 
inline asms IS BOGUS. It's simply not TRUE. Gcc gets confused because gcc 
is confused, and it has NOTHING to do with inline asms.

So please don't confuse things further. 

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Linus Torvalds


On Fri, 9 Jan 2009, Steven Rostedt wrote:
> 
> I was going to say a while ago...
> In PREEMPT=y the need_resched() is not needed at all. If you have 
> preemption enabled, you will get preempted in that loop. No need for the 
> need_resched() in the outer loop. Although I'm not sure how it would even 
> hit the "need_resched". If it was set, then it is most likely going to be 
> cleared when coming back from being preempted.

No, no, you miss the point entirely.

It's not about correctness.

Remember: the whole (and only) point of spinning is about performance.

And the thing is, we should only spin if it makes sense. So the

if (need_resched())
break;

is not there because of any "ok, I need to sleep now", it's there because 
of something TOTALLY DIFFERENT, namely "ok, it makes no sense to spin now, 
since I should be sleeping".

See? WE DO NOT WANT TO BE PREEMPTED in this region, because that totally 
destroys the whole point of the spinning. If we go through the scheduler, 
then we should go through the scheduler AND GO TO SLEEP, so that we don't 
go through the scheduler any more than absolutely necessary.

So this code - by design - is always only going to get worse if you have 
involuntary preemption. The preemption is going to do _two_ bad things:

 - it's going to call the scheduler at the wrong point, meaning that we 
   now scheduler _more_ (or at least not less) than if we didn't have that 
   spin-loop in the first place.

 - .. and to make things worse, since it scheduled "for us", it is going 
   to clear that "need_resched()" flag, so we'll _stay_ in the bad 
   spinning loop too long!

So quite frankly, if you have CONFIG_PREEMPT, then the spinning really is 
the wrong thing to do, or the whole mutex slow-path thing should be done 
with preemption disabled so that we only schedule where we _should_ be 
scheduling.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread H. Peter Anvin
Linus Torvalds wrote:
> 
> On Fri, 9 Jan 2009, Ingo Molnar wrote:
>>  
>> -static inline int constant_test_bit(int nr, const volatile unsigned long 
>> *addr)
>> +static __asm_inline int
>> +constant_test_bit(int nr, const volatile unsigned long *addr)
>>  {
>>  return ((1UL << (nr % BITS_PER_LONG)) &
>>  (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
> 
> Thios makes absolutely no sense.
> 
> It's called "__always_inline", not __asm_inline.
> 
> Why add a new nonsensical annotations like that?
> 

__asm_inline was my suggestion, to distinguish "inline this
unconditionally because gcc screws up in the presence of asm()" versus
"inline this unconditionally because the world ends if it isn't" -- to
tell the human reader, not gcc.  I guess the above is a good indicator
that the __asm_inline might have been a bad name.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Linus Torvalds


On Fri, 9 Jan 2009, Ingo Molnar wrote:
>  
> -static inline int constant_test_bit(int nr, const volatile unsigned long 
> *addr)
> +static __asm_inline int
> +constant_test_bit(int nr, const volatile unsigned long *addr)
>  {
>   return ((1UL << (nr % BITS_PER_LONG)) &
>   (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;

Thios makes absolutely no sense.

It's called "__always_inline", not __asm_inline.

Why add a new nonsensical annotations like that?

Also, the very fact that gcc gets that function wrong WHEN 'nr' IS 
CONSTANT (which is when it is called) just shows what kind of crap gcc is!

Ingo, the fact is, I care about size, but I care about debuggability and 
sanity more. I don't care one _whit_ about 3% size differences, if they 
are insane and cause idiotic per-compiler differences.

And you haven't done any interesting analysis per-file etc. It shoul be 
almost _trivial_ to do CONFIG_OPTIMIZE_INLINING on/off tests for the whole 
tree, and then comparing sizes of individual object files, and see if we 
find some obvious _bug_ where we just inline too much.

In fact, we shouldn't even do that - we should try to find a mode where 
gcc simply refuses to inline at all, and compare that to one where it 
_only_ inlines the things we ask it to. Because that's the more relevant 
test. The problem with gcc inlining is actually two-fold:

 - gcc doesn't inline things we ask for

   Here the sub-problem is that we ask for this too much, but see above on 
   how to figure -that- out!

 - gcc _does_ inline things that we haven't marked at all, causing too 
   much stack-space to be used, and causing debugging problems.

   And here the problem is that gcc should damn well not do that, at least 
   not as aggressively as it does!

IT DOES NOT MATTER if something is called in just one place and inlining 
makes things smaller! If it's not a clear performance win (and it almost 
never is, unless the function is really small), the inlining of especially 
functions that aren't even hot in the cache is ONLY a negative thing.

Linus

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Peter Zijlstra
On Fri, 2009-01-09 at 10:59 -0500, Steven Rostedt wrote:

> > 
> > Adding that blocking on !owner utterly destroys everything.
> 
> I was going to warn you about that ;-)
> 
> Without the check for !owner, you are almost guaranteed to go to sleep 
> every time. Here's why:
> 
> You are spinning and thus have a hot cache on that CPU.
> 
> The owner goes to unlock but will be in a cold cache. It sets lock->owner 
> to NULL, but is still in cold cache so it is a bit slower.
> 
> Once the spinner sees the NULL, it shoots out of the spin but sees the 
> lock is still not available then goes to sleep. All before the owner could 
> release it. This could probably happen at every contention. Thus, you lose 
> the benefit of spinning. You probably make things worse because you add a 
> spin before every sleep.

Which is why I changed the inner loop to:

  l_owner = ACCESS_ONCE(lock->owner)
  if (l_owner && l_owner != owner)
break

So that that would continue spinning.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Steven Rostedt

On Fri, 9 Jan 2009, Peter Zijlstra wrote:

> On Fri, 2009-01-09 at 11:47 +0100, Peter Zijlstra wrote:
> 
> > > So I think the bug is still there, we just hid it better by breaking out 
> > > of the loop with that "if (need_resched())" always eventually triggering. 
> > > And it would be ok if it really is guaranteed to _eventually_ trigger, 
> > > and 
> > > I guess with timeslices it eventually always will, but I suspect we could 
> > > have some serious latency spikes.
> > 
> > Yes, the owner getting preempted after acquiring the lock, but before
> > setting the owner can give some nasties :-(
> > 
> > I initially did that preempt_disable/enable around the fast path, but I
> > agree that slowing down the fast path is unwelcome.
> > 
> > Alternatively we could go back to block on !owner, with the added
> > complexity of not breaking out of the spin on lock->owner != owner
> > when !lock->owner, so that the premature owner clearing of the unlock
> > fast path will not force a schedule right before we get a chance to
> > acquire the lock.
> > 
> > Let me do that..
> 
> Ok a few observations..
> 
> Adding that need_resched() in the outer loop utterly destroys the
> performance gain for PREEMPT=y. Voluntary preemption is mostly good, but
> somewhat unstable results.

I was going to say a while ago...
In PREEMPT=y the need_resched() is not needed at all. If you have 
preemption enabled, you will get preempted in that loop. No need for the 
need_resched() in the outer loop. Although I'm not sure how it would even 
hit the "need_resched". If it was set, then it is most likely going to be 
cleared when coming back from being preempted.

> 
> Adding that blocking on !owner utterly destroys everything.

I was going to warn you about that ;-)

Without the check for !owner, you are almost guaranteed to go to sleep 
every time. Here's why:

You are spinning and thus have a hot cache on that CPU.

The owner goes to unlock but will be in a cold cache. It sets lock->owner 
to NULL, but is still in cold cache so it is a bit slower.

Once the spinner sees the NULL, it shoots out of the spin but sees the 
lock is still not available then goes to sleep. All before the owner could 
release it. This could probably happen at every contention. Thus, you lose 
the benefit of spinning. You probably make things worse because you add a 
spin before every sleep.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Chris Mason
On Fri, 2009-01-09 at 16:06 +0100, Peter Zijlstra wrote:
> On Fri, 2009-01-09 at 11:47 +0100, Peter Zijlstra wrote:
> 
> > > So I think the bug is still there, we just hid it better by breaking out 
> > > of the loop with that "if (need_resched())" always eventually triggering. 
> > > And it would be ok if it really is guaranteed to _eventually_ trigger, 
> > > and 
> > > I guess with timeslices it eventually always will, but I suspect we could 
> > > have some serious latency spikes.
> > 
> > Yes, the owner getting preempted after acquiring the lock, but before
> > setting the owner can give some nasties :-(
> > 
> > I initially did that preempt_disable/enable around the fast path, but I
> > agree that slowing down the fast path is unwelcome.
> > 
> > Alternatively we could go back to block on !owner, with the added
> > complexity of not breaking out of the spin on lock->owner != owner
> > when !lock->owner, so that the premature owner clearing of the unlock
> > fast path will not force a schedule right before we get a chance to
> > acquire the lock.
> > 
> > Let me do that..
> 
> Ok a few observations..
> 
> Adding that need_resched() in the outer loop utterly destroys the
> performance gain for PREEMPT=y. Voluntary preemption is mostly good, but
> somewhat unstable results.

How about if (!owner && need_resched()) break; instead of the
unconditional need_resched().  That should solve the race that Linus saw
without and hurt PREEMPT less.

> 
> Adding that blocking on !owner utterly destroys everything.
> 
> Going to look into where that extra preemption comes from.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Peter Zijlstra
On Fri, 2009-01-09 at 11:47 +0100, Peter Zijlstra wrote:

> > So I think the bug is still there, we just hid it better by breaking out 
> > of the loop with that "if (need_resched())" always eventually triggering. 
> > And it would be ok if it really is guaranteed to _eventually_ trigger, and 
> > I guess with timeslices it eventually always will, but I suspect we could 
> > have some serious latency spikes.
> 
> Yes, the owner getting preempted after acquiring the lock, but before
> setting the owner can give some nasties :-(
> 
> I initially did that preempt_disable/enable around the fast path, but I
> agree that slowing down the fast path is unwelcome.
> 
> Alternatively we could go back to block on !owner, with the added
> complexity of not breaking out of the spin on lock->owner != owner
> when !lock->owner, so that the premature owner clearing of the unlock
> fast path will not force a schedule right before we get a chance to
> acquire the lock.
> 
> Let me do that..

Ok a few observations..

Adding that need_resched() in the outer loop utterly destroys the
performance gain for PREEMPT=y. Voluntary preemption is mostly good, but
somewhat unstable results.

Adding that blocking on !owner utterly destroys everything.

Going to look into where that extra preemption comes from.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Ingo Molnar

* H. Peter Anvin  wrote:

> Andi Kleen wrote:
> >> I'll try to annotate the inline asms (there's not _that_ many of them), 
> >> and measure what the size impact is.
> > 
> > You can just use the patch I submitted and that you rejected for
> > most of them :)
> 
> I just ran a sample build for x86-64 with gcc 4.3.0, these all
> allyesconfig builds (modulo the inlining option):
> 
> : voreg 64 ; size o.*/vmlinux
>textdata bss dec hex filename
> 59421552 24912223 15560504 99894279 5f44407 o.noopt/vmlinux
> 57700527 24950719 15560504 98211750 5da97a6 o.opty/vmlinux
> 57590217 24940519 15560504 98091240 5d8c0e8 o.andi/vmlinux
> 
> A 3% code size difference even on allyesconfig (1.8 MB!) is nothing to 
> sneeze at.  As shown by the delta from Andi's patch, these small 
> assembly stubs really do need to be annotated, since gcc simply has no 
> way to do anything sane with them -- it just doesn't know.

I've done a finegrained size analysis today (see my other mail in this 
thread), and it turns out that on gcc 4.3.x the main (and pretty much 
only) inlining annotation that matters in arch/x86/include/asm/*.h is the 
onliner patch attached below, annotating constant_test_bit().

That change is included in Andi's patch too AFAICS - i.e. just that single 
hunk from Andi's patch would have given you 90% of the size win - an 
additional 0.17% size win to the 3.00% that CONFIG_OPTIMIZE_INLINING=y 
already brings.

The second patch below had some (much smaller, 0.01% ) impact too. All the 
other annotations i did to hundreds of inlined asm()s had no measurable 
effect on GCC 4.3.2. (i.e. gcc appears to inline single-statement asms 
correctly)

[ On older GCC it might matter more, but there we can/should turn off
  CONFIG_OPTIMIZE_INLINING. ]

> Personally, I'd like to see __asm_inline as opposed to __always_inline 
> for these, though, as a documentation issue: __always_inline implies to 
> me that this function needs to be inlined for correctness, and this 
> could be highly relevant if someone, for example, recodes the routine in 
> C or decides to bloat it out (e.g. paravirt_ops).

Yeah. I've implemented __asm_inline today. It indeed documents the reason 
for the annotation in a cleaner way than slapping __always_inline around 
and diluting the quality of __always_inline annotations.

> It's not a perfect solution even then, because gcc may choose to not 
> inline a higher level of inline functions for the same bogus reason. 
> There isn't much we can do about that, though, unless gcc either 
> integrates the assembler, or gives us some way of injecting the actual 
> weight of the asm statement...

Yeah.

Ingo

---
 arch/x86/include/asm/bitops.h |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux/arch/x86/include/asm/bitops.h
===
--- linux.orig/arch/x86/include/asm/bitops.h
+++ linux/arch/x86/include/asm/bitops.h
@@ -300,7 +300,8 @@ static inline int test_and_change_bit(in
return oldbit;
 }
 
-static inline int constant_test_bit(int nr, const volatile unsigned long *addr)
+static __asm_inline int
+constant_test_bit(int nr, const volatile unsigned long *addr)
 {
return ((1UL << (nr % BITS_PER_LONG)) &
(((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;



 arch/x86/include/asm/bitops.h |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux/arch/x86/include/asm/bitops.h
===
--- linux.orig/arch/x86/include/asm/bitops.h
+++ linux/arch/x86/include/asm/bitops.h
@@ -53,7 +53,7 @@
  * Note that @nr may be almost arbitrarily large; this function is not
  * restricted to acting on a single-word quantity.
  */
-static inline void set_bit(unsigned int nr, volatile unsigned long *addr)
+static __asm_inline void set_bit(unsigned int nr, volatile unsigned long *addr)
 {
if (IS_IMMEDIATE(nr)) {
asm volatile(LOCK_PREFIX "orb %1,%0"
@@ -75,7 +75,7 @@ static inline void set_bit(unsigned int 
  * If it's called on the same region of memory simultaneously, the effect
  * may be that only one operation succeeds.
  */
-static inline void __set_bit(int nr, volatile unsigned long *addr)
+static __asm_inline void __set_bit(int nr, volatile unsigned long *addr)
 {
asm volatile("bts %1,%0" : ADDR : "Ir" (nr) : "memory");
 }
@@ -90,7 +90,7 @@ static inline void __set_bit(int nr, vol
  * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
  * in order to ensure changes are visible on other processors.
  */
-static inline void clear_bit(int nr, volatile unsigned long *addr)
+static __asm_inline void clear_bit(int nr, volatile unsigned long *addr)
 {
if (IS_IMMEDIATE(nr)) {
asm volatile(LOCK_PREFIX "andb %1,%0"
@@ -117,7 +117,7 @@ static inline void clear_bit_unlock(unsi
clear_bit(nr, addr);
 }
 
-static inline void __cle

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Chris Mason
On Fri, 2009-01-09 at 04:35 +0100, Andi Kleen wrote:
> On Thu, Jan 08, 2009 at 05:44:25PM -0800, H. Peter Anvin wrote:
> > Harvey Harrison wrote:
> > >>
> > >> We might still try the second or third options, as i think we shouldnt 
> > >> go 
> > >> back into the business of managing the inline attributes of ~100,000 
> > >> kernel functions.
> > > 
> > > Or just make it clear that inline shouldn't (unless for a very good 
> > > reason)
> > > _ever_ be used in a .c file.
> > > 
> > 
> > The question is if that would produce acceptable quality code.  In
> > theory it should, but I'm more than wondering if it really will.
> 
> I actually often use noinline when developing code simply because it 
> makes it easier to read oopses when gcc doesn't inline ever static
> (which it normally does if it only has a single caller). You know
> roughly where it crashed without having to decode the line number.
> 
> I believe others do that too, I notice it's all over btrfs for example.

For btrfs it was mostly about stack size at first.  I'd use
checkstack.pl and then run through the big funcs and figure out how they
got so huge.  It was almost always because gcc was inlining something it
shouldn't, so I started using it on most funcs.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Peter Zijlstra
On Thu, 2009-01-08 at 11:54 -0800, Linus Torvalds wrote:

> I was pretty sure that adding the unlocked loop should provably not change 
> the mutex lock semantics. Why? Because it's just basically equivalent to 
> just doing the mutex_trylock() without really changing anything really 
> fundamental in the mutex logic.
> 
> And that argument is sadly totally bogus. 

It fails for the RT case, yes. It should still be true for regular tasks
- if the owner tracking was accurate.

> The thing is, we used to have this guarantee that any contention would 
> always go into the slowpath, and then in the slow-path we serialize using 
> the spinlock. 
> 
> So I think the bug is still there, we just hid it better by breaking out 
> of the loop with that "if (need_resched())" always eventually triggering. 
> And it would be ok if it really is guaranteed to _eventually_ trigger, and 
> I guess with timeslices it eventually always will, but I suspect we could 
> have some serious latency spikes.

Yes, the owner getting preempted after acquiring the lock, but before
setting the owner can give some nasties :-(

I initially did that preempt_disable/enable around the fast path, but I
agree that slowing down the fast path is unwelcome.

Alternatively we could go back to block on !owner, with the added
complexity of not breaking out of the spin on lock->owner != owner
when !lock->owner, so that the premature owner clearing of the unlock
fast path will not force a schedule right before we get a chance to
acquire the lock.

Let me do that..

> The problem? Setting "lock->count" to 0. That will mean that the next 
> "mutex_unlock()" will not necessarily enter the slowpath at all, and won't 
> necessarily wake things up like it should.

That's exactly what __mutex_fastpath_trylock() does (or can do,
depending on the implementation), so if regular mutexes are correct in
the face of a trylock stealing the lock in front of a woken up waiter,
then we're still good.

That said, I'm not seeing how mutexes aren't broken already.

say A locks it: counter 1->0
then B contends: counter 0->-1, added to wait list
then C contentds: counter -1, added to wait list
then A releases: counter -1->1, wake someone up, say B
then D trylocks: counter 1->0
so B is back to wait list
then D releases, 0->1, no wakeup

Aaah, B going back to sleep sets it to -1

Therefore, I think we're good.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Peter Zijlstra
On Thu, 2009-01-08 at 11:13 -0800, Linus Torvalds wrote:
> 
> On Thu, 8 Jan 2009, Chris Mason wrote:
> > 
> > It is less fair though, the 50 proc parallel creates had a much bigger
> > span between the first and last proc's exit time.  This isn't a huge
> > shock, I think it shows the hot path is closer to a real spin lock.
> 
> Actually, the real spin locks are now fair. We use ticket locks on x86.

> We _could_ certainly aim for using ticket locks for mutexes too, that 
> might be quite nice.

Not really, ticket locks cannot handle a spinner going away - and we
need that here.

I've googled around a bit and MCS locks
(http://www.cs.rice.edu/~johnmc/papers/asplos91.pdf) look like a viable
way to gain fairness in our situation.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Andi Kleen
> foo()
> {
>   char a[1000];
> }
> 
> bar()
> {
>   char a[1000];
> }
> 
> zot()
> {
>   foo();
>   bar();
> }
> 
> uses 2000 bytes of stack.
> And with the right compiler version.

I believe that's fixed in newer gcc versions.

For old gccs we might indeed need to add noinlines though.

-Andi

-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Andi Kleen
On Thu, Jan 08, 2009 at 07:42:48PM -0800, Linus Torvalds wrote:
> > I actually often use noinline when developing code simply because it 
> > makes it easier to read oopses when gcc doesn't inline ever static
> > (which it normally does if it only has a single caller)
> 
> Yes. Gcc inlining is a total piece of sh*t.

The static inlining by default (unfortunately) saves a lot of text size.

For testing I built an x86-64 allyesconfig kernel with 
-fno-inline-functions-called-once (which disables the default static
inlining), and it increased text size by ~4.1% (over 2MB for a allyesconfig 
kernel). So I think we have to keep that, dropping it would cost too
much :/

-Andi

-- 
a...@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread Andrew Morton
On Fri, 9 Jan 2009 04:35:31 +0100 Andi Kleen  wrote:

> On Thu, Jan 08, 2009 at 05:44:25PM -0800, H. Peter Anvin wrote:
> > Harvey Harrison wrote:
> > >>
> > >> We might still try the second or third options, as i think we shouldnt 
> > >> go 
> > >> back into the business of managing the inline attributes of ~100,000 
> > >> kernel functions.
> > > 
> > > Or just make it clear that inline shouldn't (unless for a very good 
> > > reason)
> > > _ever_ be used in a .c file.
> > > 
> > 
> > The question is if that would produce acceptable quality code.  In
> > theory it should, but I'm more than wondering if it really will.
> 
> I actually often use noinline when developing code simply because it 
> makes it easier to read oopses when gcc doesn't inline ever static
> (which it normally does if it only has a single caller). You know
> roughly where it crashed without having to decode the line number.
> 
> I believe others do that too, I notice it's all over btrfs for example.
> 

Plus there is the problem where

foo()
{
char a[1000];
}

bar()
{
char a[1000];
}

zot()
{
foo();
bar();
}

uses 2000 bytes of stack.

Fortunately scripts/checkstack.pl can find these.

If someone runs it.

With the right kconfig settings.

And with the right compiler version.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >