Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-11 Thread David Howells
Russell King <[EMAIL PROTECTED]> wrote:

> Yes you can.  Well, you can on ARM at least.  Between the load exclusive
> you can do anything you like until you hit the store exclusive.

How come atomic_set() on arm6 is implemented as:

static inline void atomic_set(atomic_t *v, int i)
{
unsigned long tmp;

__asm__ __volatile__("@ atomic_set\n"
"1: ldrex   %0, [%1]\n"
"   strex   %0, %2, [%1]\n"
"   teq %0, #0\n"
"   bne 1b"
: "=" (tmp)
: "r" (>counter), "r" (i)
: "cc");
}

Why LDREX/STREX and not direct assignment?

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-11 Thread David Howells
Russell King [EMAIL PROTECTED] wrote:

 Yes you can.  Well, you can on ARM at least.  Between the load exclusive
 you can do anything you like until you hit the store exclusive.

How come atomic_set() on arm6 is implemented as:

static inline void atomic_set(atomic_t *v, int i)
{
unsigned long tmp;

__asm__ __volatile__(@ atomic_set\n
1: ldrex   %0, [%1]\n
   strex   %0, %2, [%1]\n
   teq %0, #0\n
   bne 1b
: =r (tmp)
: r (v-counter), r (i)
: cc);
}

Why LDREX/STREX and not direct assignment?

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Nick Piggin

Russell King wrote:

On Fri, Dec 08, 2006 at 12:18:52PM +1100, Nick Piggin wrote:


Russell King wrote:


On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:



Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
architectures to produce optimal code.

Implementing an cmpxchg based accessor macro allows cmpxchg architectures
to produce optimal code and ll/sc non-optimal code.

See my point?


Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
load_locked is not locked at all,



Intentional - cmpxchg architectures don't generally have a load locked.


Exactly, so it is wrong -- you can't implement that behaviour with
load + cmpxchg.



I disagree.  I _have_ implemented the required behaviour.  I really
don't understand your point saying that it is wrong.



and there can be interleaving writes
between the load and cmpxchg which do not cause the store_conditional
to fail.



In which case the cmpxchg fails and we do the atomic operation again,
in exactly the same way that we do the operation again if the 'sc'
fails in the ll/sc case.


Not if cmpxchg sees the same value, it won't fail, regardless of how
many writes have hit that memory address.



Don't see anything wrong with that.  If that was a problem, atomic
implementations using cmpxchg on x86 would be impossible.

I think you're trying to implement ll/sc semantics on CPUs without
ll/sc which is exactly not what I'm trying to do.  I'd argue that's
impossible.


Yes, I did think that from reading your emails. It is not a problem
as such, but it is important to be clear on semantics.


I'm trying to suggest a better implementation for atomic ops rather
than just bowing to this x86-centric "cmpxchg is the best, everyone
must implement it" mentality.


Even if ARM is able to handle any arbitrary C code between the
"load locked" and store conditional API, other architectures can not
by definition.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Linus Torvalds


On Fri, 8 Dec 2006, Russell King wrote:
>> 
> The only instructions which affect the exclusive access state are the
> exclusive access instructions themselves.

Not according to the docs I found.

The ARM1136 manual explicitly states that any attempt to modify that 
address clears the tag (for shared memory regions, by _any_ CPU, and for 
nonshared regions by _that_ CPU).

And btw, that _has_ to be true, because otherwise the whole ldrex/strex 
sequence would be totally unusable as a way to do atomic bit operations on 
UP in the presense of interrupts (well, you could have a clrex instruction 
in the interrupt handler, but as far as I know you don't, so that seems to 
be a moot point - you only seem to do it in __switch_to).

In other words, I _really_ think you're wrong. 

So the very code sequence you quote MUST NOT WORK the way you claim it 
does. And not only that, since the granularity of the mark is not just for 
the bytes in question, but potentially apparently up to 128 bytes, any 
store even _close_ to the memory you had xclusive access to will break the 
exclusive access.

Really, Russell. Your stance makes no sense. It doesn't make any sense 
from a microarchitectural standpoint (it's not how you'd normally 
implement these things), but it ALSO makes no sense from the way you 
already use those instructions (as a way to protect against other 
processors - including your own - touching that word).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Linus Torvalds


On Fri, 8 Dec 2006, Russell King wrote:
> 
> No such restriction on ARM.
> 
> Also not true.  The architectural implementation is:

I checked the ARM manuals, and quite fankly, they don't back you up.

They do not claim that the physical address tag is byte-granular, and in 
fact they make it pretty clear that the same tag is used for all the 
sizes, which implies that the tag granularity is NOT byte granular, but 
likely something else. 

Both the manuals I checked also say: "Other events might cause the tag to 
be cleared", without going into particular details other than saying that 
a region that is marked non-shared might still clear the tag on access by 
other CPU's - but they leave it open whether that's by design or not.

In other words, if there actually is an architectural guarantee that 
ldrex/strex are really as strong as you imply, it's not in the standard 
architecture manuals from ARM at least for the ARM11562 or the ARM1136.

So I suspect you're wrong, and that the ldrex/strex tags actually are not 
all that different from other archtiectures which tend to have cacheline 
granularities or more (I _think_ the original alpha granularity was the 
whole address space, and any cache traffic would clear it. That's _really_ 
pathetically weak, but hey, I might remember wrong, and it was the very 
first implementation. I doubt ARM is _that_ weak, but I doubt it's as 
strong as you claim).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 11:35:39AM -0800, Linus Torvalds wrote:
> 
> 
> On Fri, 8 Dec 2006, Russell King wrote:
> >
> > Yes you can.  Well, you can on ARM at least.  Between the load exclusive
> > you can do anything you like until you hit the store exclusive.  If you
> > haven't touched the location (with anything other than another load
> > exclusive) and no other CPU has issued a load exclusive, your store
> > exclusive succeeds.
> 
> Is that actually true?
> 
> Almost all LL/SC implementations have granularity rules, where "touch the 
> location" is not a byte-granular thing, but actually ends up being 
> something like "touch the same cachline".

ARM's implementation works by using some CPU core peripheral hardware to hold
two pieces of information:

 1. the physical address being accessed by the load exclusive instruction.
 2. the cpu number.

When an exclusive load is performed, this hardware "remembers" bits
[31:N] (7 >= N >= 2 *) of the physical address and the CPU number.

When an exclusive store is performed, the address and CPU number are
compared to the stored versions, and the store is only allowed to succeed
if they match.  Hence, it is independent of the actual cache state,
and is relatively cheap to implement.

The only instructions which affect the exclusive access state are the
exclusive access instructions themselves.  Hence, to the same memory
location with the following sequence:

load exclusive
load normal
modify
store normal
store exclusive

the store exclusive _will_ succeed even though you've modified the value
in that memory location via standard loads/stores.

(* - while it is true that the physical address is stored,
implementations may themselves choose not to store between the
lower 7 to 2 bits inclusive - which incidentally means putting
two atomic_t's next together is probably a Bad Thing(tm)... and
rather sucks.  However, non-exclusive accesses within that address
region do not affect the exclusive access state.)

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 11:37:45AM -0800, Linus Torvalds wrote:
> 
> 
> On Fri, 8 Dec 2006, Russell King wrote:
> > 
> > I utterly disagree.  I could code atomic_add() as:
> 
> Sure. And Alpha could do that too. If you write the C code a specific way, 
> you can make it work. That does NOT mean that you can expose it widely as 
> a portable interface - it's still just a very _nonportable_ interface that 
> you use internally within one architecture to implement other interfaces.

However, nothing stops you wrapping the non-portable nature of ll/sc up
into the store part though.

If you can efficiently implement cmpxchg inside an ll/sc based portable
interface (yes you can) and you can implement problematical ll/sc
structures inside a cmpxchg() interface, you can do it either way around.
Only one way doesn't penalise broken ll/sc based implementations though.

That is the essence of my argument.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Linus Torvalds


On Fri, 8 Dec 2006, Russell King wrote:
>
> Yes you can.  Well, you can on ARM at least.  Between the load exclusive
> you can do anything you like until you hit the store exclusive.  If you
> haven't touched the location (with anything other than another load
> exclusive) and no other CPU has issued a load exclusive, your store
> exclusive succeeds.

Is that actually true?

Almost all LL/SC implementations have granularity rules, where "touch the 
location" is not a byte-granular thing, but actually ends up being 
something like "touch the same cachline".

They also often have _other_ rules like: "the cacheline has to stay in the 
L1 in exclusive state" etc. Which means that in a direct-mapped L1 cache, 
you can't even load anything that might be in the same way, because it 
would cause a cache eviction that invalidates the SC.

It's possible that ARM has really strong LL/SC, but quite frankly, that 
sounds unlikely. I've never heard of anybody ever _architecturally_ saying 
that they support that strong requirements, even if certain micro- 
architectures might actually support stronger semantics than the ones 
guaranteed by the architectural rules.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Linus Torvalds


On Fri, 8 Dec 2006, Russell King wrote:
> 
> I utterly disagree.  I could code atomic_add() as:

Sure. And Alpha could do that too. If you write the C code a specific way, 
you can make it work. That does NOT mean that you can expose it widely as 
a portable interface - it's still just a very _nonportable_ interface that 
you use internally within one architecture to implement other interfaces.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 11:15:58AM -0800, Linus Torvalds wrote:
> On Fri, 8 Dec 2006, Christoph Lameter wrote:
> > 
> > As also shown in this thread: There are restrictions on what you can do 
> > between ll/sc
> 
> This, btw, is almost certainly true on ARM too.
> 
> There are three major reasons for restrictions on ll/sc:
> 
>  - bus-cycle induced things (eg variations of "you cannot do a store in 
>between the ll and the sc, because it will touch the cache and clear 
>the bit", where "the store" might be a load too, and "the cache" might
>be just "the bus interface")

No such restriction on ARM.

>  - trap handling usually clears the internal lock bit too, which means 
>that depending on the micro-architecture, even internal microtraps 
>(like even just branch misprediction, but more commonly things like TLB 
>misses etc) can cause a sc to always fail.

Also not true.  The architectural implementation is:

ldrex: tags the _physical_ address + cpu number,
transitions to exclusive access.

strex: if in exclusive access state and matches the previous
tagged physical address + cpu number pair, store
succeeds.

This is typically implemented in hardware, and in the case of a SMP
system, external to the CPU cores themselves.  So all it's doing is
looking at the exclusive accesses.  It is not embedded into the CPU
core so that micro-architectural stuff affects it.

>  - timing. Livelock in particular.

That is a problem that we're facing, and the solution is rather simple.
You need to introduce a CPU-number specific number of cycles before
retrying the operation on failure.

> All of which means that _nobody_ can really do this reliably in C.

I utterly disagree.  I could code atomic_add() as:

extern void cpu_specific_delay(void);

static inline int atomic_add_return(int i, atomic_t *v)
{
do {
asm("ldrex %0, [%1]" : "=r" (val) : "r" (v));
val += i;
asm("strex %0, %1, [%2]" : "=r" (res) : "r" (val), "r" (v));
if (res)
cpu_specific_delay();
} while (res);

return val;
}

and actually we /are/ going to have to go down this path to break the
livelock problem, like it or not.  Ditto for the ARM bitops operations,
so we might as well have ARM at least implement a generic ll/sc thing.

Coding the cpu specific delays into every strex site is going to make
them far too heavy to put inline.

> So right now, I think the "cmpxchg" or the "bitmask set" approach are the 
> alternatives. Russell - LL/SC simply isn't on the table as an interface, 
> whether you like it or not.

Buggerit then.  cmpxchg sucks as an interface, and I am going to
continue to assert that.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Linus Torvalds


On Fri, 8 Dec 2006, Christoph Lameter wrote:
> 
> As also shown in this thread: There are restrictions on what you can do 
> between ll/sc

This, btw, is almost certainly true on ARM too.

There are three major reasons for restrictions on ll/sc:

 - bus-cycle induced things (eg variations of "you cannot do a store in 
   between the ll and the sc, because it will touch the cache and clear 
   the bit", where "the store" might be a load too, and "the cache" might
   be just "the bus interface")

 - trap handling usually clears the internal lock bit too, which means 
   that depending on the micro-architecture, even internal microtraps 
   (like even just branch misprediction, but more commonly things like TLB 
   misses etc) can cause a sc to always fail.

 - timing. Livelock in particular.

The last one is the one that hits everybody, regardless of 
microarchitecture. The rule may be that the LL/SC need to be within a 
certain number of cycles (which can be very small - like ten) in order to 
guarantee that the cacheline can't be stolen. 

All of which means that _nobody_ can really do this reliably in C. Even if 
there are no other microarchitectural rules (and it sounds like that might 
be true on ARM), the timing issue means that you can _still_ only use it 
for very specific and simple sequences, and trying to expose it as a 
higher-level thing is not going to work in general for anything even 
remotely complicated.

(The timing may also mean that you end up having to do random back-off 
etc, just to make sure _somebody_ makes progress. Ie it might not be a 
matter of "within ten cycles", but "you need to randomize the timing").

In other words, it's simply not an option to expose LL/SC as an interface. 
It would be VERY convenient to do, since cmpxchg can emulate ll/sc (the 
"ll" part is a normal load, the "sc" part is a "compare that the old value 
still matches, and store the new one if so"). But because you can't expose 
LL/SC anyway in any reasonably portable way, that just doesn't work.

So, you really do end up with three possibilities:

 - do things with TRULY PORTABLE interfaces. And like it or not, cmpxchg 
   is the closest thing you can get to that. It's trivial to do cmpxchg 
   using ll/sc (modulo the "random backoff part" if you need it, which is 
   still pretty simple, but no longer totally trivial), and architectures 
   that have neither ll/sc _nor_ a native cmpxchg can just go screw 
   themselves with spinlocks - they really aren't worth worrying about in 
   SMP. At some point you have to tell hardware designers that their 
   hardware just sucks.

 - have ugly conditional code in generic code. I personally think this is 
   a _much_ worse option in most cases.

 - have a much higher-level interface and make it _all_ architecture- 
   dependent (possibly with a "generic" version for sane architectures). 
   This works, but the more high-level it is, the more you end up having 
   the same thign written in many different ways, and nasty maintenance.

   So we generally set the bar pretty low. Things like semaphore locking 
   primitives are high-level enough already that we prefer to try to make 
   them use common lower-level interfaces (spinlocks, cmpxchg etc). 
   Something like kernel/workqueue.c is _way_ too high a level to do 
   arch-specific.

So right now, I think the "cmpxchg" or the "bitmask set" approach are the 
alternatives. Russell - LL/SC simply isn't on the table as an interface, 
whether you like it or not.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 10:46:17AM -0800, Linus Torvalds wrote:
> 
> 
> On Fri, 8 Dec 2006, David Howells wrote:
> > 
> > In fact I think more things have LL/SC than have CMPXCHG.
> 
> But you cannot expose ll/sc to C, so that's a bogus argument.

Yes you can.  Well, you can on ARM at least.  Between the load exclusive
you can do anything you like until you hit the store exclusive.  If you
haven't touched the location (with anything other than another load
exclusive) and no other CPU has issued a load exclusive, your store
exclusive succeeds.  Whether you have branches, loads or stores to
other locations, etc.  None of that matters.

Not so bogus as you make out.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Linus Torvalds


On Fri, 8 Dec 2006, David Howells wrote:
> 
> In fact I think more things have LL/SC than have CMPXCHG.

But you cannot expose ll/sc to C, so that's a bogus argument.

If you do ll/sc, you need to program in assembly language.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Christoph Lameter
On Fri, 8 Dec 2006, Russell King wrote:

> As proven previously the reverse is also true.  And as shown previously
> the cheaper out of the two for all platforms is the LL/SC based
> implementation, where the architecture specific implementation can
> be _either_ LL/SC based or cmpxchg based depending on what is
> supported in their hardware.

As also shown in this thread: There are restrictions on what you can do 
between ll/sc. You would not want to use C code there. ll/sc is an thing 
that needs to be restricted to asm code. So this is not a viable proposal 
at all. ll/sc is useful to construct various atomic functions but cannot 
be directly used in C code. cmpxchg can be effectively realized using 
ll/sc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 09:06:00AM -0800, Christoph Lameter wrote:
> On Fri, 8 Dec 2006, David Howells wrote:
> 
> > > It is the most universal atomic instruction that I know of.
> > 
> > I think TAS-type things and XCHG-type things are more common.
> 
> Huh? The most popular architectures are i386 x86_64 sparc ia64 etc which 
> all have one or the other form of cmpxchg (some issues with early sparc 
> and i386).

According to the latest figures, 621 million ARM processors were
shipped in Q3, which equates to about 78 processors per second.
(taken from http://www.arm.com/news/15300.html).

I'd like to know the figures for these other so-called "popular
architectures".

But in any case that's an utterly irrelevant point to the argument
at hand.

> > In fact I think more things have LL/SC than have CMPXCHG.
> 
> LL/SC can be easily used to come up with a cmpxchg equivalent.

As proven previously the reverse is also true.  And as shown previously
the cheaper out of the two for all platforms is the LL/SC based
implementation, where the architecture specific implementation can
be _either_ LL/SC based or cmpxchg based depending on what is
supported in their hardware.

I'm going to keep saying this until people get it.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Christoph Lameter
On Fri, 8 Dec 2006, David Howells wrote:

> > It is the most universal atomic instruction that I know of.
> 
> I think TAS-type things and XCHG-type things are more common.

Huh? The most popular architectures are i386 x86_64 sparc ia64 etc which 
all have one or the other form of cmpxchg (some issues with early sparc 
and i386).

And yes the xchg was the first multiprocessor instruction and therefore 
is also available on very old processors.

> In fact I think more things have LL/SC than have CMPXCHG.

LL/SC can be easily used to come up with a cmpxchg equivalent.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 08:53:22AM -0800, Christoph Lameter wrote:
> On Fri, 8 Dec 2006, Russell King wrote:
> 
> > > Not having cmpxchg is even worse because it requires the introduction and 
> > > maintenance of large sets of arch specific operations. Much more complex.
> > 
> > And which bit of "not available on many architectures" have you not grasped
> > yet?
> 
> We discussed various forms of emulating that functionality on this thread. 
> Seems to work satisfactorily. You can discover the information you skipped 
> by going back to some earlier messages of this thread.

Oh for god sake.  I've put forward a coherent argument.  I've disproved
every point put forward by people for my approach.

ARM continues to be broken because people like you are stuck with doing
things only one way.  Grow up and open your mind to other possibilities.

(It should be noted that LDREX/STREX is broken on some ARM implementations
at the moment - they will cause a livelock due to silicon bugs if both
CPUs have exactly the same number of cycles between LDREX and STREX.
This makes loops-within-loops implementations using cmpxchg _extremely_
expensive on ARM.)

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread David Howells
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> cmpxchg is the simplest solution to realize many other atomic operations and
> its widely available on a wide variety of platforms.

But by no means all.  Where it doesn't exist it can only be properly emulated
by something such as LL/SC if they are there.  If not, then you can't do a
safe cmpxchg on SMP.

> It is the most universal atomic instruction that I know of.

I think TAS-type things and XCHG-type things are more common.

In fact I think more things have LL/SC than have CMPXCHG.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Christoph Lameter
On Fri, 8 Dec 2006, Russell King wrote:

> > Not having cmpxchg is even worse because it requires the introduction and 
> > maintenance of large sets of arch specific operations. Much more complex.
> 
> And which bit of "not available on many architectures" have you not grasped
> yet?

We discussed various forms of emulating that functionality on this thread. 
Seems to work satisfactorily. You can discover the information you skipped 
by going back to some earlier messages of this thread.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 08:43:09AM -0800, Christoph Lameter wrote:
> On Fri, 8 Dec 2006, Russell King wrote:
> 
> > You're advocating cmpxchg is adopted by all architectures.  It isn't
> > available on many architectures, and those which it can be requires
> > unnecessarily complicated coding.
> 
> Not having cmpxchg is even worse because it requires the introduction and 
> maintenance of large sets of arch specific operations. Much more complex.

And which bit of "not available on many architectures" have you not grasped
yet?

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Christoph Lameter
On Fri, 8 Dec 2006, Russell King wrote:

> You're advocating cmpxchg is adopted by all architectures.  It isn't
> available on many architectures, and those which it can be requires
> unnecessarily complicated coding.

Not having cmpxchg is even worse because it requires the introduction and 
maintenance of large sets of arch specific operations. Much more complex.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 08:06:23AM -0800, Christoph Lameter wrote:
> On Fri, 8 Dec 2006, Russell King wrote:
> 
> > I'm trying to suggest a better implementation for atomic ops rather
> > than just bowing to this x86-centric "cmpxchg is the best, everyone
> > must implement it" mentality.
> 
> cmpxchg is the simplest solution to realize many other atomic operations 
> and its widely available on a wide variety of platforms. It is the most 
> universal atomic instruction that I know of. Other atomic operations may 
> be more efficient but certainly cmpxchg is the most universal.
> 
> Having multiple instructions with restrictions of what can be done in 
> between just complicates the use and seems to be arch specific. I have not 
> seen a better solution. Are you really advocating the weirdly complex 
> ll/sc be adopted by other architectures?

You're advocating cmpxchg is adopted by all architectures.  It isn't
available on many architectures, and those which it can be requires
unnecessarily complicated coding.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Christoph Lameter
On Fri, 8 Dec 2006, Russell King wrote:

> I'm trying to suggest a better implementation for atomic ops rather
> than just bowing to this x86-centric "cmpxchg is the best, everyone
> must implement it" mentality.

cmpxchg is the simplest solution to realize many other atomic operations 
and its widely available on a wide variety of platforms. It is the most 
universal atomic instruction that I know of. Other atomic operations may 
be more efficient but certainly cmpxchg is the most universal.

Having multiple instructions with restrictions of what can be done in 
between just complicates the use and seems to be arch specific. I have not 
seen a better solution. Are you really advocating the weirdly complex 
ll/sc be adopted by other architectures?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Thu, Dec 07, 2006 at 03:06:06PM +, Russell King wrote:
> On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
> > 
> > 
> > On Wed, 6 Dec 2006, Christoph Lameter wrote:
> > >
> > > I'd really appreciate a cmpxchg that is generically available for 
> > > all arches. It will allow lockless implementation for various performance 
> > > criticial portions of the kernel.
> > 
> > I suspect ARM may have been the last one without one, no?
> > 
> > That said, cmpxchg won't necessarily be "high-performance" unless the hw 
> > supports it naturally in hardware, so..
> > 
> > Russell, are you ok with the code DavidH posted (the "try 2" one)? I'd 
> > like to get an ack from the ARM maintainer before applying it, but it 
> > looked ok.
> 
> Things seem to have moved on since this request.  Do I need to do
> anything?  I dunno.

ARM's still broken.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 12:18:52PM +1100, Nick Piggin wrote:
> Russell King wrote:
> >On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:
> 
> >>>Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
> >>>architectures to produce optimal code.
> >>>
> >>>Implementing an cmpxchg based accessor macro allows cmpxchg architectures
> >>>to produce optimal code and ll/sc non-optimal code.
> >>>
> >>>See my point?
> >>
> >>Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
> >>load_locked is not locked at all,
> >
> >
> >Intentional - cmpxchg architectures don't generally have a load locked.
> 
> Exactly, so it is wrong -- you can't implement that behaviour with
> load + cmpxchg.

I disagree.  I _have_ implemented the required behaviour.  I really
don't understand your point saying that it is wrong.

> >>and there can be interleaving writes
> >>between the load and cmpxchg which do not cause the store_conditional
> >>to fail.
> >
> >
> >In which case the cmpxchg fails and we do the atomic operation again,
> >in exactly the same way that we do the operation again if the 'sc'
> >fails in the ll/sc case.
> 
> Not if cmpxchg sees the same value, it won't fail, regardless of how
> many writes have hit that memory address.

Don't see anything wrong with that.  If that was a problem, atomic
implementations using cmpxchg on x86 would be impossible.

I think you're trying to implement ll/sc semantics on CPUs without
ll/sc which is exactly not what I'm trying to do.  I'd argue that's
impossible.

I'm trying to suggest a better implementation for atomic ops rather
than just bowing to this x86-centric "cmpxchg is the best, everyone
must implement it" mentality.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 12:18:52PM +1100, Nick Piggin wrote:
 Russell King wrote:
 On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:
 
 Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
 architectures to produce optimal code.
 
 Implementing an cmpxchg based accessor macro allows cmpxchg architectures
 to produce optimal code and ll/sc non-optimal code.
 
 See my point?
 
 Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
 load_locked is not locked at all,
 
 
 Intentional - cmpxchg architectures don't generally have a load locked.
 
 Exactly, so it is wrong -- you can't implement that behaviour with
 load + cmpxchg.

I disagree.  I _have_ implemented the required behaviour.  I really
don't understand your point saying that it is wrong.

 and there can be interleaving writes
 between the load and cmpxchg which do not cause the store_conditional
 to fail.
 
 
 In which case the cmpxchg fails and we do the atomic operation again,
 in exactly the same way that we do the operation again if the 'sc'
 fails in the ll/sc case.
 
 Not if cmpxchg sees the same value, it won't fail, regardless of how
 many writes have hit that memory address.

Don't see anything wrong with that.  If that was a problem, atomic
implementations using cmpxchg on x86 would be impossible.

I think you're trying to implement ll/sc semantics on CPUs without
ll/sc which is exactly not what I'm trying to do.  I'd argue that's
impossible.

I'm trying to suggest a better implementation for atomic ops rather
than just bowing to this x86-centric cmpxchg is the best, everyone
must implement it mentality.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Thu, Dec 07, 2006 at 03:06:06PM +, Russell King wrote:
 On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
  
  
  On Wed, 6 Dec 2006, Christoph Lameter wrote:
  
   I'd really appreciate a cmpxchg that is generically available for 
   all arches. It will allow lockless implementation for various performance 
   criticial portions of the kernel.
  
  I suspect ARM may have been the last one without one, no?
  
  That said, cmpxchg won't necessarily be high-performance unless the hw 
  supports it naturally in hardware, so..
  
  Russell, are you ok with the code DavidH posted (the try 2 one)? I'd 
  like to get an ack from the ARM maintainer before applying it, but it 
  looked ok.
 
 Things seem to have moved on since this request.  Do I need to do
 anything?  I dunno.

ARM's still broken.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Christoph Lameter
On Fri, 8 Dec 2006, Russell King wrote:

 I'm trying to suggest a better implementation for atomic ops rather
 than just bowing to this x86-centric cmpxchg is the best, everyone
 must implement it mentality.

cmpxchg is the simplest solution to realize many other atomic operations 
and its widely available on a wide variety of platforms. It is the most 
universal atomic instruction that I know of. Other atomic operations may 
be more efficient but certainly cmpxchg is the most universal.

Having multiple instructions with restrictions of what can be done in 
between just complicates the use and seems to be arch specific. I have not 
seen a better solution. Are you really advocating the weirdly complex 
ll/sc be adopted by other architectures?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 08:06:23AM -0800, Christoph Lameter wrote:
 On Fri, 8 Dec 2006, Russell King wrote:
 
  I'm trying to suggest a better implementation for atomic ops rather
  than just bowing to this x86-centric cmpxchg is the best, everyone
  must implement it mentality.
 
 cmpxchg is the simplest solution to realize many other atomic operations 
 and its widely available on a wide variety of platforms. It is the most 
 universal atomic instruction that I know of. Other atomic operations may 
 be more efficient but certainly cmpxchg is the most universal.
 
 Having multiple instructions with restrictions of what can be done in 
 between just complicates the use and seems to be arch specific. I have not 
 seen a better solution. Are you really advocating the weirdly complex 
 ll/sc be adopted by other architectures?

You're advocating cmpxchg is adopted by all architectures.  It isn't
available on many architectures, and those which it can be requires
unnecessarily complicated coding.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Christoph Lameter
On Fri, 8 Dec 2006, Russell King wrote:

 You're advocating cmpxchg is adopted by all architectures.  It isn't
 available on many architectures, and those which it can be requires
 unnecessarily complicated coding.

Not having cmpxchg is even worse because it requires the introduction and 
maintenance of large sets of arch specific operations. Much more complex.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 08:43:09AM -0800, Christoph Lameter wrote:
 On Fri, 8 Dec 2006, Russell King wrote:
 
  You're advocating cmpxchg is adopted by all architectures.  It isn't
  available on many architectures, and those which it can be requires
  unnecessarily complicated coding.
 
 Not having cmpxchg is even worse because it requires the introduction and 
 maintenance of large sets of arch specific operations. Much more complex.

And which bit of not available on many architectures have you not grasped
yet?

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread David Howells
Christoph Lameter [EMAIL PROTECTED] wrote:

 cmpxchg is the simplest solution to realize many other atomic operations and
 its widely available on a wide variety of platforms.

But by no means all.  Where it doesn't exist it can only be properly emulated
by something such as LL/SC if they are there.  If not, then you can't do a
safe cmpxchg on SMP.

 It is the most universal atomic instruction that I know of.

I think TAS-type things and XCHG-type things are more common.

In fact I think more things have LL/SC than have CMPXCHG.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Christoph Lameter
On Fri, 8 Dec 2006, Russell King wrote:

  Not having cmpxchg is even worse because it requires the introduction and 
  maintenance of large sets of arch specific operations. Much more complex.
 
 And which bit of not available on many architectures have you not grasped
 yet?

We discussed various forms of emulating that functionality on this thread. 
Seems to work satisfactorily. You can discover the information you skipped 
by going back to some earlier messages of this thread.




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 08:53:22AM -0800, Christoph Lameter wrote:
 On Fri, 8 Dec 2006, Russell King wrote:
 
   Not having cmpxchg is even worse because it requires the introduction and 
   maintenance of large sets of arch specific operations. Much more complex.
  
  And which bit of not available on many architectures have you not grasped
  yet?
 
 We discussed various forms of emulating that functionality on this thread. 
 Seems to work satisfactorily. You can discover the information you skipped 
 by going back to some earlier messages of this thread.

Oh for god sake.  I've put forward a coherent argument.  I've disproved
every point put forward by people for my approach.

ARM continues to be broken because people like you are stuck with doing
things only one way.  Grow up and open your mind to other possibilities.

(It should be noted that LDREX/STREX is broken on some ARM implementations
at the moment - they will cause a livelock due to silicon bugs if both
CPUs have exactly the same number of cycles between LDREX and STREX.
This makes loops-within-loops implementations using cmpxchg _extremely_
expensive on ARM.)

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Christoph Lameter
On Fri, 8 Dec 2006, David Howells wrote:

  It is the most universal atomic instruction that I know of.
 
 I think TAS-type things and XCHG-type things are more common.

Huh? The most popular architectures are i386 x86_64 sparc ia64 etc which 
all have one or the other form of cmpxchg (some issues with early sparc 
and i386).

And yes the xchg was the first multiprocessor instruction and therefore 
is also available on very old processors.

 In fact I think more things have LL/SC than have CMPXCHG.

LL/SC can be easily used to come up with a cmpxchg equivalent.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 09:06:00AM -0800, Christoph Lameter wrote:
 On Fri, 8 Dec 2006, David Howells wrote:
 
   It is the most universal atomic instruction that I know of.
  
  I think TAS-type things and XCHG-type things are more common.
 
 Huh? The most popular architectures are i386 x86_64 sparc ia64 etc which 
 all have one or the other form of cmpxchg (some issues with early sparc 
 and i386).

According to the latest figures, 621 million ARM processors were
shipped in Q3, which equates to about 78 processors per second.
(taken from http://www.arm.com/news/15300.html).

I'd like to know the figures for these other so-called popular
architectures.

But in any case that's an utterly irrelevant point to the argument
at hand.

  In fact I think more things have LL/SC than have CMPXCHG.
 
 LL/SC can be easily used to come up with a cmpxchg equivalent.

As proven previously the reverse is also true.  And as shown previously
the cheaper out of the two for all platforms is the LL/SC based
implementation, where the architecture specific implementation can
be _either_ LL/SC based or cmpxchg based depending on what is
supported in their hardware.

I'm going to keep saying this until people get it.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Christoph Lameter
On Fri, 8 Dec 2006, Russell King wrote:

 As proven previously the reverse is also true.  And as shown previously
 the cheaper out of the two for all platforms is the LL/SC based
 implementation, where the architecture specific implementation can
 be _either_ LL/SC based or cmpxchg based depending on what is
 supported in their hardware.

As also shown in this thread: There are restrictions on what you can do 
between ll/sc. You would not want to use C code there. ll/sc is an thing 
that needs to be restricted to asm code. So this is not a viable proposal 
at all. ll/sc is useful to construct various atomic functions but cannot 
be directly used in C code. cmpxchg can be effectively realized using 
ll/sc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Linus Torvalds


On Fri, 8 Dec 2006, David Howells wrote:
 
 In fact I think more things have LL/SC than have CMPXCHG.

But you cannot expose ll/sc to C, so that's a bogus argument.

If you do ll/sc, you need to program in assembly language.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 10:46:17AM -0800, Linus Torvalds wrote:
 
 
 On Fri, 8 Dec 2006, David Howells wrote:
  
  In fact I think more things have LL/SC than have CMPXCHG.
 
 But you cannot expose ll/sc to C, so that's a bogus argument.

Yes you can.  Well, you can on ARM at least.  Between the load exclusive
you can do anything you like until you hit the store exclusive.  If you
haven't touched the location (with anything other than another load
exclusive) and no other CPU has issued a load exclusive, your store
exclusive succeeds.  Whether you have branches, loads or stores to
other locations, etc.  None of that matters.

Not so bogus as you make out.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Linus Torvalds


On Fri, 8 Dec 2006, Christoph Lameter wrote:
 
 As also shown in this thread: There are restrictions on what you can do 
 between ll/sc

This, btw, is almost certainly true on ARM too.

There are three major reasons for restrictions on ll/sc:

 - bus-cycle induced things (eg variations of you cannot do a store in 
   between the ll and the sc, because it will touch the cache and clear 
   the bit, where the store might be a load too, and the cache might
   be just the bus interface)

 - trap handling usually clears the internal lock bit too, which means 
   that depending on the micro-architecture, even internal microtraps 
   (like even just branch misprediction, but more commonly things like TLB 
   misses etc) can cause a sc to always fail.

 - timing. Livelock in particular.

The last one is the one that hits everybody, regardless of 
microarchitecture. The rule may be that the LL/SC need to be within a 
certain number of cycles (which can be very small - like ten) in order to 
guarantee that the cacheline can't be stolen. 

All of which means that _nobody_ can really do this reliably in C. Even if 
there are no other microarchitectural rules (and it sounds like that might 
be true on ARM), the timing issue means that you can _still_ only use it 
for very specific and simple sequences, and trying to expose it as a 
higher-level thing is not going to work in general for anything even 
remotely complicated.

(The timing may also mean that you end up having to do random back-off 
etc, just to make sure _somebody_ makes progress. Ie it might not be a 
matter of within ten cycles, but you need to randomize the timing).

In other words, it's simply not an option to expose LL/SC as an interface. 
It would be VERY convenient to do, since cmpxchg can emulate ll/sc (the 
ll part is a normal load, the sc part is a compare that the old value 
still matches, and store the new one if so). But because you can't expose 
LL/SC anyway in any reasonably portable way, that just doesn't work.

So, you really do end up with three possibilities:

 - do things with TRULY PORTABLE interfaces. And like it or not, cmpxchg 
   is the closest thing you can get to that. It's trivial to do cmpxchg 
   using ll/sc (modulo the random backoff part if you need it, which is 
   still pretty simple, but no longer totally trivial), and architectures 
   that have neither ll/sc _nor_ a native cmpxchg can just go screw 
   themselves with spinlocks - they really aren't worth worrying about in 
   SMP. At some point you have to tell hardware designers that their 
   hardware just sucks.

 - have ugly conditional code in generic code. I personally think this is 
   a _much_ worse option in most cases.

 - have a much higher-level interface and make it _all_ architecture- 
   dependent (possibly with a generic version for sane architectures). 
   This works, but the more high-level it is, the more you end up having 
   the same thign written in many different ways, and nasty maintenance.

   So we generally set the bar pretty low. Things like semaphore locking 
   primitives are high-level enough already that we prefer to try to make 
   them use common lower-level interfaces (spinlocks, cmpxchg etc). 
   Something like kernel/workqueue.c is _way_ too high a level to do 
   arch-specific.

So right now, I think the cmpxchg or the bitmask set approach are the 
alternatives. Russell - LL/SC simply isn't on the table as an interface, 
whether you like it or not.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 11:15:58AM -0800, Linus Torvalds wrote:
 On Fri, 8 Dec 2006, Christoph Lameter wrote:
  
  As also shown in this thread: There are restrictions on what you can do 
  between ll/sc
 
 This, btw, is almost certainly true on ARM too.
 
 There are three major reasons for restrictions on ll/sc:
 
  - bus-cycle induced things (eg variations of you cannot do a store in 
between the ll and the sc, because it will touch the cache and clear 
the bit, where the store might be a load too, and the cache might
be just the bus interface)

No such restriction on ARM.

  - trap handling usually clears the internal lock bit too, which means 
that depending on the micro-architecture, even internal microtraps 
(like even just branch misprediction, but more commonly things like TLB 
misses etc) can cause a sc to always fail.

Also not true.  The architectural implementation is:

ldrex: tags the _physical_ address + cpu number,
transitions to exclusive access.

strex: if in exclusive access state and matches the previous
tagged physical address + cpu number pair, store
succeeds.

This is typically implemented in hardware, and in the case of a SMP
system, external to the CPU cores themselves.  So all it's doing is
looking at the exclusive accesses.  It is not embedded into the CPU
core so that micro-architectural stuff affects it.

  - timing. Livelock in particular.

That is a problem that we're facing, and the solution is rather simple.
You need to introduce a CPU-number specific number of cycles before
retrying the operation on failure.

 All of which means that _nobody_ can really do this reliably in C.

I utterly disagree.  I could code atomic_add() as:

extern void cpu_specific_delay(void);

static inline int atomic_add_return(int i, atomic_t *v)
{
do {
asm(ldrex %0, [%1] : =r (val) : r (v));
val += i;
asm(strex %0, %1, [%2] : =r (res) : r (val), r (v));
if (res)
cpu_specific_delay();
} while (res);

return val;
}

and actually we /are/ going to have to go down this path to break the
livelock problem, like it or not.  Ditto for the ARM bitops operations,
so we might as well have ARM at least implement a generic ll/sc thing.

Coding the cpu specific delays into every strex site is going to make
them far too heavy to put inline.

 So right now, I think the cmpxchg or the bitmask set approach are the 
 alternatives. Russell - LL/SC simply isn't on the table as an interface, 
 whether you like it or not.

Buggerit then.  cmpxchg sucks as an interface, and I am going to
continue to assert that.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Linus Torvalds


On Fri, 8 Dec 2006, Russell King wrote:
 
 I utterly disagree.  I could code atomic_add() as:

Sure. And Alpha could do that too. If you write the C code a specific way, 
you can make it work. That does NOT mean that you can expose it widely as 
a portable interface - it's still just a very _nonportable_ interface that 
you use internally within one architecture to implement other interfaces.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Linus Torvalds


On Fri, 8 Dec 2006, Russell King wrote:

 Yes you can.  Well, you can on ARM at least.  Between the load exclusive
 you can do anything you like until you hit the store exclusive.  If you
 haven't touched the location (with anything other than another load
 exclusive) and no other CPU has issued a load exclusive, your store
 exclusive succeeds.

Is that actually true?

Almost all LL/SC implementations have granularity rules, where touch the 
location is not a byte-granular thing, but actually ends up being 
something like touch the same cachline.

They also often have _other_ rules like: the cacheline has to stay in the 
L1 in exclusive state etc. Which means that in a direct-mapped L1 cache, 
you can't even load anything that might be in the same way, because it 
would cause a cache eviction that invalidates the SC.

It's possible that ARM has really strong LL/SC, but quite frankly, that 
sounds unlikely. I've never heard of anybody ever _architecturally_ saying 
that they support that strong requirements, even if certain micro- 
architectures might actually support stronger semantics than the ones 
guaranteed by the architectural rules.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 11:37:45AM -0800, Linus Torvalds wrote:
 
 
 On Fri, 8 Dec 2006, Russell King wrote:
  
  I utterly disagree.  I could code atomic_add() as:
 
 Sure. And Alpha could do that too. If you write the C code a specific way, 
 you can make it work. That does NOT mean that you can expose it widely as 
 a portable interface - it's still just a very _nonportable_ interface that 
 you use internally within one architecture to implement other interfaces.

However, nothing stops you wrapping the non-portable nature of ll/sc up
into the store part though.

If you can efficiently implement cmpxchg inside an ll/sc based portable
interface (yes you can) and you can implement problematical ll/sc
structures inside a cmpxchg() interface, you can do it either way around.
Only one way doesn't penalise broken ll/sc based implementations though.

That is the essence of my argument.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Russell King
On Fri, Dec 08, 2006 at 11:35:39AM -0800, Linus Torvalds wrote:
 
 
 On Fri, 8 Dec 2006, Russell King wrote:
 
  Yes you can.  Well, you can on ARM at least.  Between the load exclusive
  you can do anything you like until you hit the store exclusive.  If you
  haven't touched the location (with anything other than another load
  exclusive) and no other CPU has issued a load exclusive, your store
  exclusive succeeds.
 
 Is that actually true?
 
 Almost all LL/SC implementations have granularity rules, where touch the 
 location is not a byte-granular thing, but actually ends up being 
 something like touch the same cachline.

ARM's implementation works by using some CPU core peripheral hardware to hold
two pieces of information:

 1. the physical address being accessed by the load exclusive instruction.
 2. the cpu number.

When an exclusive load is performed, this hardware remembers bits
[31:N] (7 = N = 2 *) of the physical address and the CPU number.

When an exclusive store is performed, the address and CPU number are
compared to the stored versions, and the store is only allowed to succeed
if they match.  Hence, it is independent of the actual cache state,
and is relatively cheap to implement.

The only instructions which affect the exclusive access state are the
exclusive access instructions themselves.  Hence, to the same memory
location with the following sequence:

load exclusive
load normal
modify
store normal
store exclusive

the store exclusive _will_ succeed even though you've modified the value
in that memory location via standard loads/stores.

(* - while it is true that the physical address is stored,
implementations may themselves choose not to store between the
lower 7 to 2 bits inclusive - which incidentally means putting
two atomic_t's next together is probably a Bad Thing(tm)... and
rather sucks.  However, non-exclusive accesses within that address
region do not affect the exclusive access state.)

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Linus Torvalds


On Fri, 8 Dec 2006, Russell King wrote:
 
 No such restriction on ARM.
 
 Also not true.  The architectural implementation is:

I checked the ARM manuals, and quite fankly, they don't back you up.

They do not claim that the physical address tag is byte-granular, and in 
fact they make it pretty clear that the same tag is used for all the 
sizes, which implies that the tag granularity is NOT byte granular, but 
likely something else. 

Both the manuals I checked also say: Other events might cause the tag to 
be cleared, without going into particular details other than saying that 
a region that is marked non-shared might still clear the tag on access by 
other CPU's - but they leave it open whether that's by design or not.

In other words, if there actually is an architectural guarantee that 
ldrex/strex are really as strong as you imply, it's not in the standard 
architecture manuals from ARM at least for the ARM11562 or the ARM1136.

So I suspect you're wrong, and that the ldrex/strex tags actually are not 
all that different from other archtiectures which tend to have cacheline 
granularities or more (I _think_ the original alpha granularity was the 
whole address space, and any cache traffic would clear it. That's _really_ 
pathetically weak, but hey, I might remember wrong, and it was the very 
first implementation. I doubt ARM is _that_ weak, but I doubt it's as 
strong as you claim).

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Linus Torvalds


On Fri, 8 Dec 2006, Russell King wrote:
 
 The only instructions which affect the exclusive access state are the
 exclusive access instructions themselves.

Not according to the docs I found.

The ARM1136 manual explicitly states that any attempt to modify that 
address clears the tag (for shared memory regions, by _any_ CPU, and for 
nonshared regions by _that_ CPU).

And btw, that _has_ to be true, because otherwise the whole ldrex/strex 
sequence would be totally unusable as a way to do atomic bit operations on 
UP in the presense of interrupts (well, you could have a clrex instruction 
in the interrupt handler, but as far as I know you don't, so that seems to 
be a moot point - you only seem to do it in __switch_to).

In other words, I _really_ think you're wrong. 

So the very code sequence you quote MUST NOT WORK the way you claim it 
does. And not only that, since the granularity of the mark is not just for 
the bytes in question, but potentially apparently up to 128 bytes, any 
store even _close_ to the memory you had xclusive access to will break the 
exclusive access.

Really, Russell. Your stance makes no sense. It doesn't make any sense 
from a microarchitectural standpoint (it's not how you'd normally 
implement these things), but it ALSO makes no sense from the way you 
already use those instructions (as a way to protect against other 
processors - including your own - touching that word).

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-08 Thread Nick Piggin

Russell King wrote:

On Fri, Dec 08, 2006 at 12:18:52PM +1100, Nick Piggin wrote:


Russell King wrote:


On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:



Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
architectures to produce optimal code.

Implementing an cmpxchg based accessor macro allows cmpxchg architectures
to produce optimal code and ll/sc non-optimal code.

See my point?


Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
load_locked is not locked at all,



Intentional - cmpxchg architectures don't generally have a load locked.


Exactly, so it is wrong -- you can't implement that behaviour with
load + cmpxchg.



I disagree.  I _have_ implemented the required behaviour.  I really
don't understand your point saying that it is wrong.



and there can be interleaving writes
between the load and cmpxchg which do not cause the store_conditional
to fail.



In which case the cmpxchg fails and we do the atomic operation again,
in exactly the same way that we do the operation again if the 'sc'
fails in the ll/sc case.


Not if cmpxchg sees the same value, it won't fail, regardless of how
many writes have hit that memory address.



Don't see anything wrong with that.  If that was a problem, atomic
implementations using cmpxchg on x86 would be impossible.

I think you're trying to implement ll/sc semantics on CPUs without
ll/sc which is exactly not what I'm trying to do.  I'd argue that's
impossible.


Yes, I did think that from reading your emails. It is not a problem
as such, but it is important to be clear on semantics.


I'm trying to suggest a better implementation for atomic ops rather
than just bowing to this x86-centric cmpxchg is the best, everyone
must implement it mentality.


Even if ARM is able to handle any arbitrary C code between the
load locked and store conditional API, other architectures can not
by definition.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Nick Piggin

Russell King wrote:

On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:



Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
architectures to produce optimal code.

Implementing an cmpxchg based accessor macro allows cmpxchg architectures
to produce optimal code and ll/sc non-optimal code.

See my point?


Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
load_locked is not locked at all,



Intentional - cmpxchg architectures don't generally have a load locked.


Exactly, so it is wrong -- you can't implement that behaviour with
load + cmpxchg.


and there can be interleaving writes
between the load and cmpxchg which do not cause the store_conditional
to fail.



In which case the cmpxchg fails and we do the atomic operation again,
in exactly the same way that we do the operation again if the 'sc'
fails in the ll/sc case.


Not if cmpxchg sees the same value, it won't fail, regardless of how
many writes have hit that memory address.


I do not see any problem.


This was not the big problem -- as I said, if this was the only problem
we could opt for a "watered down" version that doesn't actually load
locked [the ll/sc interface would be much cooler than cmpxchg, IMO :)]

The main problem is the restrictions between the ll and sc. This is why
I implemented atomic_cmpxchg rather than atomic_ll/sc. However I agree
that in critical code, a higher level API should be used instead (eg.
see atomic_add_unless, which can be implemented optimally on RISCs).

Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Ralf Baechle
On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:

> Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
> load_locked is not locked at all, and there can be interleaving writes
> between the load and cmpxchg which do not cause the store_conditional
> to fail.
> 
> It might be reasonable to implement this watered down version, but:
> don't some architectures have restrictions on what instructions can
> be issued between the ll and the sc?

On MIPS the restriction is no loads or stores or sync instructions between
ll/sc.  Also there may be at most 2048 bytes between the address of the
ll and sc instructions.  Which means ll/sc sequences should better be
written in assembler or gcc might do a bit too creative things ...

  Ralf
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Linus Torvalds


On Thu, 7 Dec 2006, Nick Piggin wrote:
> 
> It might be reasonable to implement this watered down version, but:
> don't some architectures have restrictions on what instructions can
> be issued between the ll and the sc?

Yes. You really probably do not want to expose ll/sc on a C level because 
of this.

On alpha, the architecture manual says (I didn't go back and check, but 
I'm pretty sure) that a ld.l and st.c cannot have a taken branch in 
between then, for example. That basically means that you can't allow the 
compiler to reorder the basic blocks (which it often will with a 
while-loop).

Now, I actually suspect that this was not a microarchitectural flaw, and 
that a branch would _work_ there, and that the architecture manual was 
just being anal, but strictly speaking, it means that these things had 
better always be in assembly, and you can sadly not expose them (on alpha, 
at least) as higher-level primitives as such - you can only expose the 
end result ("cmpxchg" or similar).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Russell King
On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 6 Dec 2006, Christoph Lameter wrote:
> >
> > I'd really appreciate a cmpxchg that is generically available for 
> > all arches. It will allow lockless implementation for various performance 
> > criticial portions of the kernel.
> 
> I suspect ARM may have been the last one without one, no?
> 
> That said, cmpxchg won't necessarily be "high-performance" unless the hw 
> supports it naturally in hardware, so..
> 
> Russell, are you ok with the code DavidH posted (the "try 2" one)? I'd 
> like to get an ack from the ARM maintainer before applying it, but it 
> looked ok.

Things seem to have moved on since this request.  Do I need to do
anything?  I dunno.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Russell King
On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:
> Russell King wrote:
> >On Wed, Dec 06, 2006 at 11:16:55AM -0800, Christoph Lameter wrote:
> 
> >No.  If you read what I said, you'll see that you can _cheaply_ use
> >cmpxchg in a ll/sc based implementation.  Take an atomic increment
> >operation.
> >
> > do {
> > old = load_locked(addr);
> > } while (store_exclusive(old, old + 1, addr);
> >
> >On a cmpxchg, that "store_exclusive" (loosely) becomes your cmpxchg
> >instruction, comparing the first arg, and if equal storing the second.
> >The "load_locked" macro becomes a standard pointer deref.  Ergo, x86
> >becomes:
> >
> > do {
> > load value
> > manipulate it
> > conditional store
> > } while not stored
> >
> >On ll/sc, the load_locked() macro is the load locked instruction.  The
> >store_exclusive() macro is the exclusive store and it doesn't need to
> >use the first parameter at all.  Ergo, ARM becomes:
> >
> > do {
> > ldrex r1, [r2]
> > manipulate r1
> > strex r0, r1, [r2]
> > } while failed
> >
> >Notice that both are optimal.
> >
> >Now let's consider the cmpxchg case.
> >
> > do {
> > val = *addr;
> > } while (cmpxchg(val, val + 1, addr);
> >
> >The x86 case is _identical_ to the ll/sc based implementation.  Absolutely
> >entirely.  No impact what so ever.
> >
> >Let's look at the ll/sc case.  The cmpxchg code implemented on this has
> >to reload the original value, compare it, if equal store the new value.
> >So:
> >
> > do {
> > val = *addr;
> > (r2 = addr, 
> > ldrex r1, [r2]
> > compare r1, r0
> > strexeq r4, r3, [r2] (store exclusive if equal)
> > } while store failed or comparecondition failed
> >
> >Note how the cmpxchg has _forced_ the ll/sc implementation to become
> >more complex.
> >
> >So, let's recap.
> >
> >Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
> >architectures to produce optimal code.
> >
> >Implementing an cmpxchg based accessor macro allows cmpxchg architectures
> >to produce optimal code and ll/sc non-optimal code.
> >
> >See my point?
> 
> Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
> load_locked is not locked at all,

Intentional - cmpxchg architectures don't generally have a load locked.

> and there can be interleaving writes
> between the load and cmpxchg which do not cause the store_conditional
> to fail.

In which case the cmpxchg fails and we do the atomic operation again,
in exactly the same way that we do the operation again if the 'sc'
fails in the ll/sc case.

I do not see any problem.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Ivan Kokshaysky
On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:
> It might be reasonable to implement this watered down version, but:
> don't some architectures have restrictions on what instructions can
> be issued between the ll and the sc?

Yes. On Alpha you cannot execute other load/stores, taken branches etc.
So in general case the code between ll and sc cannot be written in C...

Ivan.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Nick Piggin

Russell King wrote:

On Wed, Dec 06, 2006 at 11:16:55AM -0800, Christoph Lameter wrote:



No.  If you read what I said, you'll see that you can _cheaply_ use
cmpxchg in a ll/sc based implementation.  Take an atomic increment
operation.

do {
old = load_locked(addr);
} while (store_exclusive(old, old + 1, addr);

On a cmpxchg, that "store_exclusive" (loosely) becomes your cmpxchg
instruction, comparing the first arg, and if equal storing the second.
The "load_locked" macro becomes a standard pointer deref.  Ergo, x86
becomes:

do {
load value
manipulate it
conditional store
} while not stored

On ll/sc, the load_locked() macro is the load locked instruction.  The
store_exclusive() macro is the exclusive store and it doesn't need to
use the first parameter at all.  Ergo, ARM becomes:

do {
ldrex r1, [r2]
manipulate r1
strex r0, r1, [r2]
} while failed

Notice that both are optimal.

Now let's consider the cmpxchg case.

do {
val = *addr;
} while (cmpxchg(val, val + 1, addr);

The x86 case is _identical_ to the ll/sc based implementation.  Absolutely
entirely.  No impact what so ever.

Let's look at the ll/sc case.  The cmpxchg code implemented on this has
to reload the original value, compare it, if equal store the new value.
So:

do {
val = *addr;
		(r2 = addr, 
		ldrex r1, [r2]

compare r1, r0
strexeq r4, r3, [r2] (store exclusive if equal)
} while store failed or comparecondition failed

Note how the cmpxchg has _forced_ the ll/sc implementation to become
more complex.

So, let's recap.

Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
architectures to produce optimal code.

Implementing an cmpxchg based accessor macro allows cmpxchg architectures
to produce optimal code and ll/sc non-optimal code.

See my point?


Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
load_locked is not locked at all, and there can be interleaving writes
between the load and cmpxchg which do not cause the store_conditional
to fail.

It might be reasonable to implement this watered down version, but:
don't some architectures have restrictions on what instructions can
be issued between the ll and the sc?

But in general I agree with you, in that a higher level primitive is
preferable (eg. atomic_add_unless).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Nick Piggin

Linus Torvalds wrote:


On Thu, 7 Dec 2006, Roman Zippel wrote:


On Wed, 6 Dec 2006, Matthew Wilcox wrote:



To be honest, it'd be much easier if we only defined these operations on
atomic_t's.  We have all the infrastructure in place for them, and
they're fairly well understood.  If you need different sizes, I'm OK
with an atomic_pointer_t, or whatever.


FWIW Seconded.



I disagree.

Any _real_ CPU will simply never care about _anything_ else than just the 
size of the datum in question. There's absolutely no point to only allow 
it on certain types, especially as we _know_ those certain types are 
already going to be more than one, and we also know that they are going to 
be different sizes. In other words, in reality, we have to handle a 
sizeable subset of the whole generic situation, and the "on certain types 
only" situation is only going to be awkward and irritating.


For example, would we have a different "cmpxchg_ptr()" function for the 
atomic pointer thing? With any reasonable CPU just depending on the _size_ 
of the type, I don't see what the problem is with just doing the 
bog-standard "cmpxchg_8/16/32/64" and having the bog-standard case- 
statement in a header file to do it all automatically for you, and then we 
don't need to worry about it.


What's wrong with using atomic_cmpxchg? For unsigned long / pointers,
there is a patch to implement atomic_long_cmpxchg.

Some architectures simply can't implement cmpxchg on memory that may
be modified in arbitrary ways. If you add some more conditions to use
cmpxchg, then you weaken it (ie. can't use it for synchronisation
with userspace) and on top of that you don't get the easy static
checking that atomic_t gives you.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Nick Piggin

Linus Torvalds wrote:


On Thu, 7 Dec 2006, Roman Zippel wrote:


On Wed, 6 Dec 2006, Matthew Wilcox wrote:



To be honest, it'd be much easier if we only defined these operations on
atomic_t's.  We have all the infrastructure in place for them, and
they're fairly well understood.  If you need different sizes, I'm OK
with an atomic_pointer_t, or whatever.


FWIW Seconded.



I disagree.

Any _real_ CPU will simply never care about _anything_ else than just the 
size of the datum in question. There's absolutely no point to only allow 
it on certain types, especially as we _know_ those certain types are 
already going to be more than one, and we also know that they are going to 
be different sizes. In other words, in reality, we have to handle a 
sizeable subset of the whole generic situation, and the on certain types 
only situation is only going to be awkward and irritating.


For example, would we have a different cmpxchg_ptr() function for the 
atomic pointer thing? With any reasonable CPU just depending on the _size_ 
of the type, I don't see what the problem is with just doing the 
bog-standard cmpxchg_8/16/32/64 and having the bog-standard case- 
statement in a header file to do it all automatically for you, and then we 
don't need to worry about it.


What's wrong with using atomic_cmpxchg? For unsigned long / pointers,
there is a patch to implement atomic_long_cmpxchg.

Some architectures simply can't implement cmpxchg on memory that may
be modified in arbitrary ways. If you add some more conditions to use
cmpxchg, then you weaken it (ie. can't use it for synchronisation
with userspace) and on top of that you don't get the easy static
checking that atomic_t gives you.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Nick Piggin

Russell King wrote:

On Wed, Dec 06, 2006 at 11:16:55AM -0800, Christoph Lameter wrote:



No.  If you read what I said, you'll see that you can _cheaply_ use
cmpxchg in a ll/sc based implementation.  Take an atomic increment
operation.

do {
old = load_locked(addr);
} while (store_exclusive(old, old + 1, addr);

On a cmpxchg, that store_exclusive (loosely) becomes your cmpxchg
instruction, comparing the first arg, and if equal storing the second.
The load_locked macro becomes a standard pointer deref.  Ergo, x86
becomes:

do {
load value
manipulate it
conditional store
} while not stored

On ll/sc, the load_locked() macro is the load locked instruction.  The
store_exclusive() macro is the exclusive store and it doesn't need to
use the first parameter at all.  Ergo, ARM becomes:

do {
ldrex r1, [r2]
manipulate r1
strex r0, r1, [r2]
} while failed

Notice that both are optimal.

Now let's consider the cmpxchg case.

do {
val = *addr;
} while (cmpxchg(val, val + 1, addr);

The x86 case is _identical_ to the ll/sc based implementation.  Absolutely
entirely.  No impact what so ever.

Let's look at the ll/sc case.  The cmpxchg code implemented on this has
to reload the original value, compare it, if equal store the new value.
So:

do {
val = *addr;
		(r2 = addr, 
		ldrex r1, [r2]

compare r1, r0
strexeq r4, r3, [r2] (store exclusive if equal)
} while store failed or comparecondition failed

Note how the cmpxchg has _forced_ the ll/sc implementation to become
more complex.

So, let's recap.

Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
architectures to produce optimal code.

Implementing an cmpxchg based accessor macro allows cmpxchg architectures
to produce optimal code and ll/sc non-optimal code.

See my point?


Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
load_locked is not locked at all, and there can be interleaving writes
between the load and cmpxchg which do not cause the store_conditional
to fail.

It might be reasonable to implement this watered down version, but:
don't some architectures have restrictions on what instructions can
be issued between the ll and the sc?

But in general I agree with you, in that a higher level primitive is
preferable (eg. atomic_add_unless).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Ivan Kokshaysky
On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:
 It might be reasonable to implement this watered down version, but:
 don't some architectures have restrictions on what instructions can
 be issued between the ll and the sc?

Yes. On Alpha you cannot execute other load/stores, taken branches etc.
So in general case the code between ll and sc cannot be written in C...

Ivan.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Russell King
On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:
 Russell King wrote:
 On Wed, Dec 06, 2006 at 11:16:55AM -0800, Christoph Lameter wrote:
 
 No.  If you read what I said, you'll see that you can _cheaply_ use
 cmpxchg in a ll/sc based implementation.  Take an atomic increment
 operation.
 
  do {
  old = load_locked(addr);
  } while (store_exclusive(old, old + 1, addr);
 
 On a cmpxchg, that store_exclusive (loosely) becomes your cmpxchg
 instruction, comparing the first arg, and if equal storing the second.
 The load_locked macro becomes a standard pointer deref.  Ergo, x86
 becomes:
 
  do {
  load value
  manipulate it
  conditional store
  } while not stored
 
 On ll/sc, the load_locked() macro is the load locked instruction.  The
 store_exclusive() macro is the exclusive store and it doesn't need to
 use the first parameter at all.  Ergo, ARM becomes:
 
  do {
  ldrex r1, [r2]
  manipulate r1
  strex r0, r1, [r2]
  } while failed
 
 Notice that both are optimal.
 
 Now let's consider the cmpxchg case.
 
  do {
  val = *addr;
  } while (cmpxchg(val, val + 1, addr);
 
 The x86 case is _identical_ to the ll/sc based implementation.  Absolutely
 entirely.  No impact what so ever.
 
 Let's look at the ll/sc case.  The cmpxchg code implemented on this has
 to reload the original value, compare it, if equal store the new value.
 So:
 
  do {
  val = *addr;
  (r2 = addr, 
  ldrex r1, [r2]
  compare r1, r0
  strexeq r4, r3, [r2] (store exclusive if equal)
  } while store failed or comparecondition failed
 
 Note how the cmpxchg has _forced_ the ll/sc implementation to become
 more complex.
 
 So, let's recap.
 
 Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
 architectures to produce optimal code.
 
 Implementing an cmpxchg based accessor macro allows cmpxchg architectures
 to produce optimal code and ll/sc non-optimal code.
 
 See my point?
 
 Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
 load_locked is not locked at all,

Intentional - cmpxchg architectures don't generally have a load locked.

 and there can be interleaving writes
 between the load and cmpxchg which do not cause the store_conditional
 to fail.

In which case the cmpxchg fails and we do the atomic operation again,
in exactly the same way that we do the operation again if the 'sc'
fails in the ll/sc case.

I do not see any problem.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Russell King
On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
 
 
 On Wed, 6 Dec 2006, Christoph Lameter wrote:
 
  I'd really appreciate a cmpxchg that is generically available for 
  all arches. It will allow lockless implementation for various performance 
  criticial portions of the kernel.
 
 I suspect ARM may have been the last one without one, no?
 
 That said, cmpxchg won't necessarily be high-performance unless the hw 
 supports it naturally in hardware, so..
 
 Russell, are you ok with the code DavidH posted (the try 2 one)? I'd 
 like to get an ack from the ARM maintainer before applying it, but it 
 looked ok.

Things seem to have moved on since this request.  Do I need to do
anything?  I dunno.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Linus Torvalds


On Thu, 7 Dec 2006, Nick Piggin wrote:
 
 It might be reasonable to implement this watered down version, but:
 don't some architectures have restrictions on what instructions can
 be issued between the ll and the sc?

Yes. You really probably do not want to expose ll/sc on a C level because 
of this.

On alpha, the architecture manual says (I didn't go back and check, but 
I'm pretty sure) that a ld.l and st.c cannot have a taken branch in 
between then, for example. That basically means that you can't allow the 
compiler to reorder the basic blocks (which it often will with a 
while-loop).

Now, I actually suspect that this was not a microarchitectural flaw, and 
that a branch would _work_ there, and that the architecture manual was 
just being anal, but strictly speaking, it means that these things had 
better always be in assembly, and you can sadly not expose them (on alpha, 
at least) as higher-level primitives as such - you can only expose the 
end result (cmpxchg or similar).

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Ralf Baechle
On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:

 Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
 load_locked is not locked at all, and there can be interleaving writes
 between the load and cmpxchg which do not cause the store_conditional
 to fail.
 
 It might be reasonable to implement this watered down version, but:
 don't some architectures have restrictions on what instructions can
 be issued between the ll and the sc?

On MIPS the restriction is no loads or stores or sync instructions between
ll/sc.  Also there may be at most 2048 bytes between the address of the
ll and sc instructions.  Which means ll/sc sequences should better be
written in assembler or gcc might do a bit too creative things ...

  Ralf
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-07 Thread Nick Piggin

Russell King wrote:

On Thu, Dec 07, 2006 at 08:31:08PM +1100, Nick Piggin wrote:



Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
architectures to produce optimal code.

Implementing an cmpxchg based accessor macro allows cmpxchg architectures
to produce optimal code and ll/sc non-optimal code.

See my point?


Wrong. Your ll/sc implementation with cmpxchg is buggy. The cmpxchg
load_locked is not locked at all,



Intentional - cmpxchg architectures don't generally have a load locked.


Exactly, so it is wrong -- you can't implement that behaviour with
load + cmpxchg.


and there can be interleaving writes
between the load and cmpxchg which do not cause the store_conditional
to fail.



In which case the cmpxchg fails and we do the atomic operation again,
in exactly the same way that we do the operation again if the 'sc'
fails in the ll/sc case.


Not if cmpxchg sees the same value, it won't fail, regardless of how
many writes have hit that memory address.


I do not see any problem.


This was not the big problem -- as I said, if this was the only problem
we could opt for a watered down version that doesn't actually load
locked [the ll/sc interface would be much cooler than cmpxchg, IMO :)]

The main problem is the restrictions between the ll and sc. This is why
I implemented atomic_cmpxchg rather than atomic_ll/sc. However I agree
that in critical code, a higher level API should be used instead (eg.
see atomic_add_unless, which can be implemented optimally on RISCs).

Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Douglas McNaught
Matthew Wilcox <[EMAIL PROTECTED]> writes:

> On Wed, Dec 06, 2006 at 05:36:29PM -0800, Linus Torvalds wrote:
>> Or are you saying that gcc aligns normal 32-bit entities at 
>> 16-bit alignment? Neither of those sound very likely.
>
> alignof(u32) is 2 on m68k.  Crazy, huh?

The original 68000 had a 16-bit bus (but 32-bit registers), which is
probably why it's that way.

-Doug
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Roman Zippel
Hi,

On Wed, 6 Dec 2006, Linus Torvalds wrote:

> > m68060 produces a trap for unaligned atomic access, unfortunately standard 
> > alignment is smaller than this.
> 
> Umm. on 68060, since it's 32-bit, you'd only have the 32-bit case. Are you 
> saying that you can't do a 32-bit atomic access at any 32-bit aligned 
> boundary? Or are you saying that gcc aligns normal 32-bit entities at 
> 16-bit alignment? Neither of those sound very likely.

The latter.
And yes, I'm starting to hate it too, especially after I've seen all the 
weird bugs this causes in userspace, which then only trigger on m68k.
I'm thinking of changing it when switching to proper TLS support, but it's 
not there yet.
BTW there is another reason I actually like the atomic type - 
documentation. It makes it more clear that this intended to be used as 
atomic value, so an unhealthy mixture of different accesses is less likely 
and if they pile up at some place it's a good indicator to maybe switch 
back to spinlocks.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Matthew Wilcox
On Wed, Dec 06, 2006 at 05:36:29PM -0800, Linus Torvalds wrote:
> Or are you saying that gcc aligns normal 32-bit entities at 
> 16-bit alignment? Neither of those sound very likely.

alignof(u32) is 2 on m68k.  Crazy, huh?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Linus Torvalds


On Thu, 7 Dec 2006, Roman Zippel wrote:
>
> m68060 produces a trap for unaligned atomic access, unfortunately standard 
> alignment is smaller than this.

Umm. on 68060, since it's 32-bit, you'd only have the 32-bit case. Are you 
saying that you can't do a 32-bit atomic access at any 32-bit aligned 
boundary? Or are you saying that gcc aligns normal 32-bit entities at 
16-bit alignment? Neither of those sound very likely.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Roman Zippel
Hi,

On Wed, 6 Dec 2006, Linus Torvalds wrote:

> > > Any _real_ CPU will simply never care about _anything_ else than just the 
> > > size of the datum in question.
> > 
> > ..or alignment which a dedicated atomic type would allow to be attached.
> 
> Can you give any example of a real CPU where alignment matters?
> 
> Sure, it needs to be naturally aligned, but that's true of _any_ type in 
> the kernel. We don't do unaligneds without "get_unaligned()" and friends.

m68060 produces a trap for unaligned atomic access, unfortunately standard 
alignment is smaller than this.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Linus Torvalds


On Thu, 7 Dec 2006, Roman Zippel wrote:
> > 
> > Any _real_ CPU will simply never care about _anything_ else than just the 
> > size of the datum in question.
> 
> ..or alignment which a dedicated atomic type would allow to be attached.

Can you give any example of a real CPU where alignment matters?

Sure, it needs to be naturally aligned, but that's true of _any_ type in 
the kernel. We don't do unaligneds without "get_unaligned()" and friends.

Btw, if you want to leave out 8-bit and 16-bit data, that's fine. So 
generally you'd only need to handle 32-bit (and 64-bit on a 64-bit 
architecture) accesses anyway.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread David Miller
From: Al Viro <[EMAIL PROTECTED]>
Date: Wed, 6 Dec 2006 19:08:28 +

> On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
> > 
> > 
> > On Wed, 6 Dec 2006, Christoph Lameter wrote:
> > >
> > > I'd really appreciate a cmpxchg that is generically available for 
> > > all arches. It will allow lockless implementation for various performance 
> > > criticial portions of the kernel.
> > 
> > I suspect ARM may have been the last one without one, no?
> 
> No.  sparc32 doesn't have one, for instance.

That's correct.  It has an atomic swap, but not a cmpxchg.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Roman Zippel
Hi,

On Wed, 6 Dec 2006, Linus Torvalds wrote:

> > > To be honest, it'd be much easier if we only defined these operations on
> > > atomic_t's.  We have all the infrastructure in place for them, and
> > > they're fairly well understood.  If you need different sizes, I'm OK
> > > with an atomic_pointer_t, or whatever.
> > 
> > FWIW Seconded.
> 
> I disagree.
> 
> Any _real_ CPU will simply never care about _anything_ else than just the 
> size of the datum in question.

..or alignment which a dedicated atomic type would allow to be attached.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Linus Torvalds


On Thu, 7 Dec 2006, Roman Zippel wrote:
> On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> 
> > To be honest, it'd be much easier if we only defined these operations on
> > atomic_t's.  We have all the infrastructure in place for them, and
> > they're fairly well understood.  If you need different sizes, I'm OK
> > with an atomic_pointer_t, or whatever.
> 
> FWIW Seconded.

I disagree.

Any _real_ CPU will simply never care about _anything_ else than just the 
size of the datum in question. There's absolutely no point to only allow 
it on certain types, especially as we _know_ those certain types are 
already going to be more than one, and we also know that they are going to 
be different sizes. In other words, in reality, we have to handle a 
sizeable subset of the whole generic situation, and the "on certain types 
only" situation is only going to be awkward and irritating.

For example, would we have a different "cmpxchg_ptr()" function for the 
atomic pointer thing? With any reasonable CPU just depending on the _size_ 
of the type, I don't see what the problem is with just doing the 
bog-standard "cmpxchg_8/16/32/64" and having the bog-standard case- 
statement in a header file to do it all automatically for you, and then we 
don't need to worry about it.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Ralf Baechle
On Wed, Dec 06, 2006 at 11:16:55AM -0800, Christoph Lameter wrote:

> But then its also just requires disable/enable interrupts on UP which may 
> be cheaper than an atomic operation.
> 
> > For CPUs with load locked + store conditional, it is expensive.
> 
> Because it locks the bus? I am not that familiar with those architectures 
> but it seems that those will have a general problem anyways.

On a decent implementation ll/sc will have the same cost as ordinary
non-atomic load and store instructions.  A likely uniprocessor
implementation uses a single flip-flop ("llbit") in the CPU which is set
by ll and cleared by any exception handler, especially interrupt.  A later
store conditional will then simply fail if that bit is cleared.  That
is extremly trivial stuff.  On SMP it's somewhat more complex; A
processor will have to remember the address used with ll and start
snooping the bus for writes to it.  The store conditional will then
go and upgrade the cache line to exclusive state if the llbit is still
set and perform the store.  The llbit would be cleared if the processor
has snooped any other write to the cacheline.  Details are fun but that's
bascially how it's implemented.

Of course load linked / store conditional are typically used in loops
so there is a little extra overhead from that especially where when the
branch is misspredicted.

Also note there is no locked cycle required to implement load linked /
store conditional.

  Ralf
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Roman Zippel
Hi,

On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> To be honest, it'd be much easier if we only defined these operations on
> atomic_t's.  We have all the infrastructure in place for them, and
> they're fairly well understood.  If you need different sizes, I'm OK
> with an atomic_pointer_t, or whatever.

FWIW Seconded.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Linus Torvalds


On Wed, 6 Dec 2006, Christoph Lameter wrote:
> 
> This means we tolerate the assignment race for SMP that was pointed out 
> earlier?

The assignment "race" doesn't really exist in real code, I suspect.

There's two cases of assignment:

 - pure initialization. This one isn't racy, since it's literally the 
   initial setting of some data structure, and if it's visible in 
   uninitialized state on other CPU's before being initialized, you have 
   bigger problems than some silly assignment race ;)

   So the race doesn't exist for this class of assignment at all, and 
   that's the _common_ case of a pure assignment (ie the "we start out 
   with this field having value X")

 - mixing assignment on one CPU (and not using "atomic_set()") and using 
   "cmpxchg()". 

   This would be a problem on architectures that use an external spinlock 
   or something, but I don't really see how it could be a valid code 
   sequence ANYWAY.

   Any code that does this had better have some _higherlevel_ 
   serialization around that data structure, since while it's not a _bug_ 
   on architectures that do "cmpxchg()" in hardware (as far as the cmpxchg 
   itself is concerned), the fact is, in the absense of any other locking, 
   what the hell is such code supposed to _mean_?

In other words - the second case would be a bug in that it bypasses the 
serialization of the spinlock, but why would you ever do that anyway? What 
could _possibly_ be a valid use of such a "set value blindly" kind of 
sequence? Since another CPU is clearly doing something to the value (ie 
the cmpxchg), the "set it to value X" is just not a well-defined 
operation in the first place, since you don't know what the end result is 
(will it be "X", or will the "cmpxchg" on the other CPU have set it to 
something else?).

So I don't think the race really exists for any sane code anyway. Even if 
people don't use "atomic_set()".

But hey, maybe I'm just ignoring some really odd usage case.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Christoph Lameter
On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> To be honest, it'd be much easier if we only defined these operations on
> atomic_t's.  We have all the infrastructure in place for them, and
> they're fairly well understood.  If you need different sizes, I'm OK
> with an atomic_pointer_t, or whatever.

An pointer is probably one of the most important entities to use with 
cmpxchg (aside from the ints and longs that we already support).

A pointer can be defined to point to any other type. So we would need
atomic_pointer_t()?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Matthew Wilcox
On Wed, Dec 06, 2006 at 01:52:20PM -0800, Christoph Lameter wrote:
> On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> 
> > And for those of us with only load-and-zero, that's simply:
> > 
> > #define load_locked(addr) spin_lock(hash(addr)), *addr
> > #define store_exclusive(addr, old, new) \
> > *addr = new, spin_unlock(hash(addr)), 0
> > 
> > which is also optimal for us.
> 
> This means we tolerate the assignment race for SMP that was pointed out 
> earlier?

What gave you that impression?  It simply wasn't part of this example.

To be honest, it'd be much easier if we only defined these operations on
atomic_t's.  We have all the infrastructure in place for them, and
they're fairly well understood.  If you need different sizes, I'm OK
with an atomic_pointer_t, or whatever.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Christoph Lameter
On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> And for those of us with only load-and-zero, that's simply:
> 
> #define load_locked(addr) spin_lock(hash(addr)), *addr
> #define store_exclusive(addr, old, new) \
>   *addr = new, spin_unlock(hash(addr)), 0
> 
> which is also optimal for us.

This means we tolerate the assignment race for SMP that was pointed out 
earlier?

cmpxchg emulation may then also be tolerable just replace the irq 
enable/disable in David's implementation with taking a spin lock?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Matthew Wilcox
On Wed, Dec 06, 2006 at 07:58:20PM +, Russell King wrote:
> No.  If you read what I said, you'll see that you can _cheaply_ use
> cmpxchg in a ll/sc based implementation.  Take an atomic increment
> operation.
> 
>   do {
>   old = load_locked(addr);
>   } while (store_exclusive(old, old + 1, addr);

[...]

> Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
> architectures to produce optimal code.
> 
> Implementing an cmpxchg based accessor macro allows cmpxchg architectures
> to produce optimal code and ll/sc non-optimal code.

And for those of us with only load-and-zero, that's simply:

#define load_locked(addr) spin_lock(hash(addr)), *addr
#define store_exclusive(addr, old, new) \
*addr = new, spin_unlock(hash(addr)), 0

which is also optimal for us.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Matthew Wilcox
On Wed, Dec 06, 2006 at 12:11:01PM -0800, Christoph Lameter wrote:
> On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> > Follow the thread back.  I'm talking about parisc.  Machines exist (are
> > still being sold) with up to 128 cores.  I think the largest we've
> > confirmed work are 8 CPUs.
> 
> And you only have an atomic get and clear atomic op? Amazing. Is this 
> really still in production?

Yes.  And yes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Lennert Buytenhek
On Wed, Dec 06, 2006 at 07:47:22PM +, David Howells wrote:

> > > Pre-v6 ARM doesn't support SMP according to ARM's atomic.h,
> > 
> > That's not quite true, there exist ARMv5 processors that in theory
> > can support SMP.
> 
> I meant that the Linux ARM arch doesn't support it:

Ah, yes, not currently.  That's true.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Christoph Lameter
On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> On Wed, Dec 06, 2006 at 11:47:40AM -0800, Christoph Lameter wrote:
> > Nope this is a UP implementation. There is no cpu B.
> 
> Follow the thread back.  I'm talking about parisc.  Machines exist (are
> still being sold) with up to 128 cores.  I think the largest we've
> confirmed work are 8 CPUs.

And you only have an atomic get and clear atomic op? Amazing. Is this 
really still in production?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Russell King
On Wed, Dec 06, 2006 at 11:16:55AM -0800, Christoph Lameter wrote:
> On Wed, 6 Dec 2006, Russell King wrote:
> 
> > On Wed, Dec 06, 2006 at 10:56:14AM -0800, Christoph Lameter wrote:
> > > I'd really appreciate a cmpxchg that is generically available for 
> > > all arches. It will allow lockless implementation for various performance 
> > > criticial portions of the kernel.
> > 
> > Let's recap on cmpxchg.
> > 
> > For CPUs with no atomic operation other than SWP, it is not lockless.
> 
> But then its also just requires disable/enable interrupts on UP which may 
> be cheaper than an atomic operation.

No.  SWP is atomic on the CPU it's being issued on, especially wrt
interrupts.  Only on one ARM CPU (which is UP) does it have a
questionable use, and there we do it via interrupt disable/enable.

> > For CPUs with load locked + store conditional, it is expensive.
> 
> Because it locks the bus? I am not that familiar with those architectures 
> but it seems that those will have a general problem anyways.

No.  That certainly would be bad for performance.  I can talk
authoritively from the ARM implementation.

When you use a special "ldrex" (load exclusive) instruction, the
CPU remembers the "address + cpu" pair.  If another access occurs
to the same address, this state is reset.

Only if this state is preserved will a "strex" (store exclusive)
instruction succeed.  This instruction returns status indicating
whether it succeeded.

So, to implement cmpxchg, you need to do this:

; r1 = temporary register
; r2 = address
; r4 = new value
; r3 = returned status
ldrex   r1, [r2]
cmp r1, old_value
streqex r3, r4, [r2]

> > If you want an operation for performance critical portions of the
> > kernel, please allow architecture maintainers the freedom to use their
> > best performance enhancements.
> 
> And thereby denying the kernel developers to use a simple atomic SMP 
> operation? Adding additional defines for each arch and each performance 
> critical piece of kernel logic?

No.  If you read what I said, you'll see that you can _cheaply_ use
cmpxchg in a ll/sc based implementation.  Take an atomic increment
operation.

do {
old = load_locked(addr);
} while (store_exclusive(old, old + 1, addr);

On a cmpxchg, that "store_exclusive" (loosely) becomes your cmpxchg
instruction, comparing the first arg, and if equal storing the second.
The "load_locked" macro becomes a standard pointer deref.  Ergo, x86
becomes:

do {
load value
manipulate it
conditional store
} while not stored

On ll/sc, the load_locked() macro is the load locked instruction.  The
store_exclusive() macro is the exclusive store and it doesn't need to
use the first parameter at all.  Ergo, ARM becomes:

do {
ldrex r1, [r2]
manipulate r1
strex r0, r1, [r2]
} while failed

Notice that both are optimal.

Now let's consider the cmpxchg case.

do {
val = *addr;
} while (cmpxchg(val, val + 1, addr);

The x86 case is _identical_ to the ll/sc based implementation.  Absolutely
entirely.  No impact what so ever.

Let's look at the ll/sc case.  The cmpxchg code implemented on this has
to reload the original value, compare it, if equal store the new value.
So:

do {
val = *addr;
(r2 = addr, 
ldrex r1, [r2]
compare r1, r0
strexeq r4, r3, [r2] (store exclusive if equal)
} while store failed or comparecondition failed

Note how the cmpxchg has _forced_ the ll/sc implementation to become
more complex.

So, let's recap.

Implementing ll/sc based accessor macros allows both ll/sc _and_ cmpxchg
architectures to produce optimal code.

Implementing an cmpxchg based accessor macro allows cmpxchg architectures
to produce optimal code and ll/sc non-optimal code.

See my point?

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Russell King
On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 6 Dec 2006, Christoph Lameter wrote:
> >
> > I'd really appreciate a cmpxchg that is generically available for 
> > all arches. It will allow lockless implementation for various performance 
> > criticial portions of the kernel.
> 
> I suspect ARM may have been the last one without one, no?
> 
> That said, cmpxchg won't necessarily be "high-performance" unless the hw 
> supports it naturally in hardware, so..
> 
> Russell, are you ok with the code DavidH posted (the "try 2" one)? I'd 
> like to get an ack from the ARM maintainer before applying it, but it 
> looked ok.

I've only recently got in today, so it'll be a while (== tomorrow) before
I can get around to doing any tests.  (It's evening this side of the
pond.)

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Linus Torvalds


On Wed, 6 Dec 2006, Linus Torvalds wrote:
> 
> Your problem will be, of course, that any architecture that does this in 
> hardware will just DoTheRightThing, and as such, broken architectures with 
> bad locking primitives will have to test and do source-level analysis 
> more.

(The underlying thread here being that the workqueue stuff _should_ be 
safe in this regard. But the "testing will not catch bugs like these is 
certainly true in general)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Matthew Wilcox
On Wed, Dec 06, 2006 at 11:47:40AM -0800, Christoph Lameter wrote:
> Nope this is a UP implementation. There is no cpu B.

Follow the thread back.  I'm talking about parisc.  Machines exist (are
still being sold) with up to 128 cores.  I think the largest we've
confirmed work are 8 CPUs.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Linus Torvalds


On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> 
> That doesn't help, since assignment can't be guarded by any lock.

True. Pure assignment will be lost, and is only ok for the case of a pure 
initializer (where nobody can see the old state).

Your problem will be, of course, that any architecture that does this in 
hardware will just DoTheRightThing, and as such, broken architectures with 
bad locking primitives will have to test and do source-level analysis 
more.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Christoph Lameter
On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> On Wed, Dec 06, 2006 at 11:29:42AM -0800, Christoph Lameter wrote:
> > On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> > 
> > > It's just been pointed out to me that the parisc one isn't safe.
> > > 
> > >  imagine variable X is set to 3
> > >  CPU A issues cmpxchg(, 3, 5)
> > >  you'd expect that to change X to 5
> > >  but what if CPU B assigns 6 to X between cmpxchg reading X
> > > and it setting X?
> > 
> > The same could happen with a regular cmpxchg. Cmpxchg changes it to 5 and 
> > then other cpu performs a store before the next instruction.
> 
> For someone who's advocating use of cmpxchg, it seems you don't
> understand its semantics!  In the scenario dhowells pointed out, X would
> be left set to 5.  X should have the value 6 under any legitimate
> implementation:

Nope this is a UP implementation. There is no cpu B.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread David Howells
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> Also, I don't see quite why you think "cmpxchg()" and "atomic_cmpxchg()" 
> would be different. ANY cmpxchg() needs to be atomic - if it's not, 
> there's no point to the operation at all, since you'd just write it as

It's not that atomic_cmpxchg() is different to cmpxchg(), it's that
atomic_set() is different to direct assignment.

atomic_set() on PA-RISC, for example, has spinlocks in it.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread David Howells
Matthew Wilcox <[EMAIL PROTECTED]> wrote:

> > Ok. For SMP-safety, it's important that any architecture that can't do 
> > this needs to _share_ the same spinlock (on SMP only, of course) that it 
> > uses for the bitops. 
> 
> That doesn't help, since assignment can't be guarded by any lock.

It's not a problem for workqueues, since the only direct assignment to the
management member variable is during initialisation.

But in general cmpxchg() might be a problem with respect to assignment.

atomic_cmpxchg() should be safe wrt atomic_set().

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Matthew Wilcox
On Wed, Dec 06, 2006 at 11:34:52AM -0800, Linus Torvalds wrote:
> On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> > Given parisc's paucity of atomic operations (load-and-zero-32bit and
> > load-and-zero-64bit), cmpxchg() is impossible to implement safely.
> > There has to be something we can hook to exclude another processor
> > modifying the variable.  I'm OK with using atomic_cmpxchg(); we have
> > atomic_set locked against it.
> 
> How do you to the atomic bitops?

The same way we do atomic_t.

What I hadn't realised (because I hadn't read dhowell's implementation
... because it hasn't shown up on git2.kernel.org yet) is that he
doesn't actually *use* this unlocked-assignment that would cause the
problem.  He uses bitops which use the same locks.

> Also, I don't see quite why you think "cmpxchg()" and "atomic_cmpxchg()" 
> would be different. ANY cmpxchg() needs to be atomic - if it's not, 
> there's no point to the operation at all, since you'd just write it as
> 
>   if (*p == x)
>   *p = y;
> 
> instead, and not use cmpxchg(). 

The difference is that we can (and do) acquire a lock for atomic_set.
We can't acquire one for X = 6.

>  - on SMP, use a spinlock (with interrupts disabled), and share that 
>spinlock with all the other atomic ops (bitops at a minimum - the 
>"atomic_t" operations have traditionally been in another "locking 
>space" because of sparc32 historic braindamage, but I'd suggest 
>sharing the same spinlock between them all).

Yep, we agree.

> And yeah, it sucks. You _can_ (if you really want to) make the spinlock be 
> hashed based on the address of the atomic data structure. That at least 
> allows you to do _multiple_ spinlocks, but let's face it, your real 
> problem is _likely_ going to be cacheline bouncing, not contention, and 
> then using a hashed lock won't be likely to buy you all that much.

We do hash based on the address -- and we try to arrange things such
that different spinlocks are acquired for different cachelines.  I don't
know if anyone's benchmarked it recently to see how well it performs.
It's a bit of a waltzing bear [1] at times ;-)

[1] The wonder is not how well it waltzes, but that it waltzes at all
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Matthew Wilcox
On Wed, Dec 06, 2006 at 11:29:42AM -0800, Christoph Lameter wrote:
> On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> 
> > It's just been pointed out to me that the parisc one isn't safe.
> > 
> >  imagine variable X is set to 3
> >  CPU A issues cmpxchg(, 3, 5)
> >  you'd expect that to change X to 5
> >  but what if CPU B assigns 6 to X between cmpxchg reading X
> > and it setting X?
> 
> The same could happen with a regular cmpxchg. Cmpxchg changes it to 5 and 
> then other cpu performs a store before the next instruction.

For someone who's advocating use of cmpxchg, it seems you don't
understand its semantics!  In the scenario dhowells pointed out, X would
be left set to 5.  X should have the value 6 under any legitimate
implementation:

CPU A   CPU B
cmpxchg(3,5)
X = 6


CPU A   CPU B
X = 6
cmpxhcg(3,5)


CPU A
cmpxchg(3,
X = 6
5)


Given that even yourself got confused about how to use it, perhaps it's
not a good idea to expose this primitive to most programmers anyway?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Linus Torvalds


On Wed, 6 Dec 2006, Matthew Wilcox wrote:
> 
> Given parisc's paucity of atomic operations (load-and-zero-32bit and
> load-and-zero-64bit), cmpxchg() is impossible to implement safely.
> There has to be something we can hook to exclude another processor
> modifying the variable.  I'm OK with using atomic_cmpxchg(); we have
> atomic_set locked against it.

How do you to the atomic bitops?

Also, I don't see quite why you think "cmpxchg()" and "atomic_cmpxchg()" 
would be different. ANY cmpxchg() needs to be atomic - if it's not, 
there's no point to the operation at all, since you'd just write it as

if (*p == x)
*p = y;

instead, and not use cmpxchg(). 

So yes, architectures without native support (where "native" includes 
load-locked + store-conditional) always need to

 - on UP, just disable interrupts
 - on SMP, use a spinlock (with interrupts disabled), and share that 
   spinlock with all the other atomic ops (bitops at a minimum - the 
   "atomic_t" operations have traditionally been in another "locking 
   space" because of sparc32 historic braindamage, but I'd suggest 
   sharing the same spinlock between them all).

And yeah, it sucks. You _can_ (if you really want to) make the spinlock be 
hashed based on the address of the atomic data structure. That at least 
allows you to do _multiple_ spinlocks, but let's face it, your real 
problem is _likely_ going to be cacheline bouncing, not contention, and 
then using a hashed lock won't be likely to buy you all that much.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Linus Torvalds


On Wed, 6 Dec 2006, Christoph Lameter wrote:
> 
> > For CPUs with load locked + store conditional, it is expensive.
> 
> Because it locks the bus? I am not that familiar with those architectures 
> but it seems that those will have a general problem anyways.

load_locked + store_conditional should _not_ be any more expensive than 
any atomic sequence will always be.

Atomic sequences in SMP are obviously never "cheap". But cmpxchg shouldn't 
be any more expensive than any other atomic sequence if you have 
load-locked and store-conditional.

There are obviously _implementation_ bugs. The early alpha's had such an 
atrocious ldl/stc that it wasn't even funny. That might be true in other 
implementations too, but it's definitely not cmpxchg-specific if so. It 
will affect _any_ atomic ops on such an architecture (atomic_inc() and 
friends)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Matthew Wilcox
On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
> On Wed, 6 Dec 2006, Christoph Lameter wrote:
> >
> > I'd really appreciate a cmpxchg that is generically available for 
> > all arches. It will allow lockless implementation for various performance 
> > criticial portions of the kernel.
> 
> I suspect ARM may have been the last one without one, no?

It's just been pointed out to me that the parisc one isn't safe.

 imagine variable X is set to 3
 CPU A issues cmpxchg(, 3, 5)
 you'd expect that to change X to 5
 but what if CPU B assigns 6 to X between cmpxchg reading X
and it setting X?

Given parisc's paucity of atomic operations (load-and-zero-32bit and
load-and-zero-64bit), cmpxchg() is impossible to implement safely.
There has to be something we can hook to exclude another processor
modifying the variable.  I'm OK with using atomic_cmpxchg(); we have
atomic_set locked against it.

Of course, using cmpxchg() isn't really lockless.  It's just hidden
locking.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Christoph Lameter
On Wed, 6 Dec 2006, Matthew Wilcox wrote:

> It's just been pointed out to me that the parisc one isn't safe.
> 
>  imagine variable X is set to 3
>  CPU A issues cmpxchg(, 3, 5)
>  you'd expect that to change X to 5
>  but what if CPU B assigns 6 to X between cmpxchg reading X
> and it setting X?

The same could happen with a regular cmpxchg. Cmpxchg changes it to 5 and 
then other cpu performs a store before the next instruction.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Matthew Wilcox
On Wed, Dec 06, 2006 at 11:25:35AM -0800, Linus Torvalds wrote:
> Ok. For SMP-safety, it's important that any architecture that can't do 
> this needs to _share_ the same spinlock (on SMP only, of course) that it 
> uses for the bitops. 

That doesn't help, since assignment can't be guarded by any lock.

> It would be good (but perhaps not as strict a requirement) if the atomic 
> counters also use the same lock. But that is probably impossible on 
> sparc32 (since it has a per-counter "lock"-like thing, iirc). So doing a 
> cmpxchg() on an atomic_t would be a bug.

sparc32 switched over to the parisc way of doing things, so they could
expand their atomic_t to a full 32 bits.  They still have the old
atomic_24_t lying around for their arch-private use, but atomic_t uses a
hashed spinlock.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it

2006-12-06 Thread Linus Torvalds


On Wed, 6 Dec 2006, Al Viro wrote:
> 
> No.  sparc32 doesn't have one, for instance.

Ok. For SMP-safety, it's important that any architecture that can't do 
this needs to _share_ the same spinlock (on SMP only, of course) that it 
uses for the bitops. 

It would be good (but perhaps not as strict a requirement) if the atomic 
counters also use the same lock. But that is probably impossible on 
sparc32 (since it has a per-counter "lock"-like thing, iirc). So doing a 
cmpxchg() on an atomic_t would be a bug.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch doesn't support it [try #2]

2006-12-06 Thread Christoph Lameter
Have a look at arch/i386/kernel/cpu/intel.c. You can probably replace my 
code that simulates cmpxchg for 386s

arch/i386/kernel/cpu/intel.c:

#ifndef CONFIG_X86_CMPXCHG
unsigned long cmpxchg_386_u8(volatile void *ptr, u8 old, u8 new)
{
u8 prev;
unsigned long flags;

/* Poor man's cmpxchg for 386. Unsuitable for SMP */
local_irq_save(flags);
prev = *(u8 *)ptr;
if (prev == old)
*(u8 *)ptr = new;
local_irq_restore(flags);
return prev;
}
EXPORT_SYMBOL(cmpxchg_386_u8);

unsigned long cmpxchg_386_u16(volatile void *ptr, u16 old, u16 new)
{
u16 prev;
unsigned long flags;

/* Poor man's cmpxchg for 386. Unsuitable for SMP */
local_irq_save(flags);
prev = *(u16 *)ptr;
if (prev == old)
*(u16 *)ptr = new;
local_irq_restore(flags);
return prev;
}
EXPORT_SYMBOL(cmpxchg_386_u16);

unsigned long cmpxchg_386_u32(volatile void *ptr, u32 old, u32 new)
{
u32 prev;
unsigned long flags;

/* Poor man's cmpxchg for 386. Unsuitable for SMP */
local_irq_save(flags);
prev = *(u32 *)ptr;
if (prev == old)
*(u32 *)ptr = new;
local_irq_restore(flags);
return prev;
}
EXPORT_SYMBOL(cmpxchg_386_u32);
#endif


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >