Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Segher Boessenkool

"volatile" has nothing to do with reordering.  atomic_dec() writes
to memory, so it _does_ have "volatile semantics", implicitly, as
long as the compiler cannot optimise the atomic variable away
completely -- any store counts as a side effect.


Stores can be reordered. Only x86 has (mostly) implicit write ordering.
So no atomic_dec has no volatile semantics


Read again: I said the C "volatile" construct has nothing to do
with CPU memory access reordering.


and may be reordered on a variety
of processors. Writes to memory may not follow code order on several
processors.


The _compiler_ isn't allowed to reorder things here.  Yes, of course
you do need stronger barriers for many purposes, volatile isn't all
that useful you know.


Segher

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Paul E. McKenney
On Mon, Sep 10, 2007 at 02:36:26PM -0700, Christoph Lameter wrote:
> On Mon, 10 Sep 2007, Paul E. McKenney wrote:
> 
> > The one exception to this being the case where process-level code is
> > communicating to an interrupt handler running on that same CPU -- on
> > all CPUs that I am aware of, a given CPU always sees its own writes
> > in order.
> 
> Yes but that is due to the code path effectively continuing in the 
> interrupt handler. The cpu makes sure that op codes being executed always 
> see memory in a consistent way. The basic ordering problem with out of 
> order writes is therefore coming from other processors concurrently 
> executing code and holding variables in registers that are modified 
> elsewhere. The only solution that I know of are one or the other form of 
> barrier.

So we are agreed then -- volatile accesses may be of some assistance when
interacting with interrupt handlers running on the same CPU (presumably
when using per-CPU variables), but are generally useless when sharing
variables among CPUs.  Correct?

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Christoph Lameter
On Mon, 10 Sep 2007, Paul E. McKenney wrote:

> The one exception to this being the case where process-level code is
> communicating to an interrupt handler running on that same CPU -- on
> all CPUs that I am aware of, a given CPU always sees its own writes
> in order.

Yes but that is due to the code path effectively continuing in the 
interrupt handler. The cpu makes sure that op codes being executed always 
see memory in a consistent way. The basic ordering problem with out of 
order writes is therefore coming from other processors concurrently 
executing code and holding variables in registers that are modified 
elsewhere. The only solution that I know of are one or the other form of 
barrier.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Paul E. McKenney
On Mon, Sep 10, 2007 at 11:59:29AM -0700, Christoph Lameter wrote:
> On Fri, 17 Aug 2007, Segher Boessenkool wrote:
> 
> > "volatile" has nothing to do with reordering.  atomic_dec() writes
> > to memory, so it _does_ have "volatile semantics", implicitly, as
> > long as the compiler cannot optimise the atomic variable away
> > completely -- any store counts as a side effect.
> 
> Stores can be reordered. Only x86 has (mostly) implicit write ordering. So 
> no atomic_dec has no volatile semantics and may be reordered on a variety 
> of processors. Writes to memory may not follow code order on several 
> processors.

The one exception to this being the case where process-level code is
communicating to an interrupt handler running on that same CPU -- on
all CPUs that I am aware of, a given CPU always sees its own writes
in order.

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Kyle Moffett

On Sep 10, 2007, at 12:46:33, Denys Vlasenko wrote:
My point is that people are confused as to what atomic_read()   
exactly means, and this is bad. Same for cpu_relax().  First one  
says "read", and second one doesn't say "barrier".


Q:

Q:  When is it OK to use atomic_read()?
A:  You are asking the question, so never.

Q:  But I need to check the value of the atomic at this point in time...
A:  Your code is buggy if it needs to do that on an atomic_t for  
anything other than debugging or optimization.  Use either  
atomic_*_return() or a lock and some normal integers.


Q:  "So why can't the atomic_read DTRT magically?"
A:  Because "the right thing" depends on the situation and is usually  
best done with something other than atomic_t.


If somebody can post some non-buggy code which is correctly using  
atomic_read() *and* depends on the compiler generating extra  
nonsensical loads due to "volatile" then the issue *might* be  
reconsidered.  This also includes samples of code which uses  
atomic_read() and needs memory barriers (so that we can fix the buggy  
code, not so we can change atomic_read()).  So far the only code  
samples anybody has posted are buggy regardless of whether or not the  
value and/or accessors are flagged "volatile" or not.  And hey, maybe  
the volatile ops *should* be implemented in inline ASM for future- 
proof-ness, but that's a separate issue.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Christoph Lameter
On Mon, 10 Sep 2007, Linus Torvalds wrote:

> The fact is, "volatile" *only* makes things worse. It generates worse 
> code, and never fixes any real bugs. This is a *fact*.

Yes, lets just drop the volatiles now! We need a patch that gets rid of 
them Volunteers?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Christoph Lameter
On Fri, 17 Aug 2007, Segher Boessenkool wrote:

> "volatile" has nothing to do with reordering.  atomic_dec() writes
> to memory, so it _does_ have "volatile semantics", implicitly, as
> long as the compiler cannot optimise the atomic variable away
> completely -- any store counts as a side effect.

Stores can be reordered. Only x86 has (mostly) implicit write ordering. So 
no atomic_dec has no volatile semantics and may be reordered on a variety 
of processors. Writes to memory may not follow code order on several 
processors.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 16:09, Linus Torvalds wrote:
> On Mon, 10 Sep 2007, Denys Vlasenko wrote:
> > static inline int
> > qla2x00_wait_for_loop_ready(scsi_qla_host_t *ha)
> > {
> > int  return_status = QLA_SUCCESS;
> > unsigned long loop_timeout ;
> > scsi_qla_host_t *pha = to_qla_parent(ha);
> > 
> > /* wait for 5 min at the max for loop to be ready */
> > loop_timeout = jiffies + (MAX_LOOP_TIMEOUT * HZ);
> > 
> > while ((!atomic_read(>loop_down_timer) &&
> > atomic_read(>loop_state) == LOOP_DOWN) ||
> > atomic_read(>loop_state) != LOOP_READY) {
> > if (atomic_read(>loop_state) == LOOP_DEAD) {
> ...
> > Is above correct or buggy? Correct, because msleep is a barrier.
> > Is it obvious? No.
> 
> It's *buggy*. But it has nothing to do with any msleep() in the loop, or 
> anything else.
> 
> And more importantly, it would be equally buggy even *with* a "volatile" 
> atomic_read().

I am not saying that this code is okay, this isn't the point.
(The code is in fact awful for several more reasons).

My point is that people are confused as to what atomic_read()
exactly means, and this is bad. Same for cpu_relax().
First one says "read", and second one doesn't say "barrier".

This is real code from current kernel which demonstrates this:

"I don't know that cpu_relax() is a barrier already":

drivers/kvm/kvm_main.c
        while (atomic_read() != needed) {
                cpu_relax();
                barrier();
        }

"I think that atomic_read() is a read from memory and therefore
I don't need a barrier":

arch/x86_64/kernel/crash.c
        msecs = 1000; /* Wait at most a second for the other cpus to stop */
        while ((atomic_read(_for_crash_ipi) > 0) && msecs) {
                mdelay(1);
                msecs--;
        }

Since neither camp seems to give up, I am proposing renaming
them to something less confusing, and make everybody happy.

cpu_relax_barrier()
atomic_value()
atomic_fetch()

I'm not native English speaker, do these sound better?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Arjan van de Ven
On Mon, 10 Sep 2007 15:38:23 +0100
Denys Vlasenko <[EMAIL PROTECTED]> wrote:

> On Monday 10 September 2007 15:51, Arjan van de Ven wrote:
> > On Mon, 10 Sep 2007 11:56:29 +0100
> > Denys Vlasenko <[EMAIL PROTECTED]> wrote:
> > 
> > > 
> > > Well, if you insist on having it again:
> > > 
> > > Waiting for atomic value to be zero:
> > > 
> > > while (atomic_read())
> > > continue;
> > > 
> > 
> > and this I would say is buggy code all the way.
> >
> > Not from a pure C level semantics, but from a "busy waiting is
> > buggy" semantics level and a "I'm inventing my own locking"
> > semantics level.
> 
> After inspecting arch/*, I cannot agree with you.

the arch/ people obviously are allowed to do their own locking stuff...
BECAUSE THEY HAVE TO IMPLEMENT THAT!


the arch maintainers know EXACTLY how their hw behaves (well, we hope)
so they tend to be the exception to many rules in the kernel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Linus Torvalds

On Mon, 10 Sep 2007, Denys Vlasenko wrote:
> 
> static inline int
> qla2x00_wait_for_loop_ready(scsi_qla_host_t *ha)
> {
> int  return_status = QLA_SUCCESS;
> unsigned long loop_timeout ;
> scsi_qla_host_t *pha = to_qla_parent(ha);
> 
> /* wait for 5 min at the max for loop to be ready */
> loop_timeout = jiffies + (MAX_LOOP_TIMEOUT * HZ);
> 
> while ((!atomic_read(>loop_down_timer) &&
> atomic_read(>loop_state) == LOOP_DOWN) ||
> atomic_read(>loop_state) != LOOP_READY) {
> if (atomic_read(>loop_state) == LOOP_DEAD) {
...
> Is above correct or buggy? Correct, because msleep is a barrier.
> Is it obvious? No.

It's *buggy*. But it has nothing to do with any msleep() in the loop, or 
anything else.

And more importantly, it would be equally buggy even *with* a "volatile" 
atomic_read().

Why is this so hard for people to understand? You're all acting like 
morons.

The reason it is buggy has absolutely nothing to do with whether the read 
is done or not, it has to do with the fact that the CPU may re-order the 
reads *regardless* of whether the read is done in some specific order by 
the compiler ot not! In effect, there is zero ordering between all those 
three reads, and if you don't have memory barriers (or a lock or other 
serialization), that code is buggy.

So stop this idiotic discussion thread already. The above kind of code 
needs memory barriers to be non-buggy. The whole "volatile or not" 
discussion is totally idiotic, and pointless, and anybody who doesn't 
understand that by now needs to just shut up and think about it more, 
rather than make this discussion drag out even further.

The fact is, "volatile" *only* makes things worse. It generates worse 
code, and never fixes any real bugs. This is a *fact*.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 15:51, Arjan van de Ven wrote:
> On Mon, 10 Sep 2007 11:56:29 +0100
> Denys Vlasenko <[EMAIL PROTECTED]> wrote:
> 
> > 
> > Well, if you insist on having it again:
> > 
> > Waiting for atomic value to be zero:
> > 
> > while (atomic_read())
> > continue;
> > 
> 
> and this I would say is buggy code all the way.
>
> Not from a pure C level semantics, but from a "busy waiting is buggy"
> semantics level and a "I'm inventing my own locking" semantics level.

After inspecting arch/*, I cannot agree with you.
Otherwise almost all major architectures use
"conceptually buggy busy-waiting":

arch/alpha
arch/i386
arch/ia64
arch/m32r
arch/mips
arch/parisc
arch/powerpc
arch/sh
arch/sparc64
arch/um
arch/x86_64

All of the above contain busy-waiting on atomic_read.

Including these loops without barriers:

arch/mips/kernel/smtc.c
while (atomic_read(_hook_initialized) < 1000)
;
arch/mips/sgi-ip27/ip27-nmi.c
while (atomic_read(_cpus) != num_online_cpus());

[Well maybe num_online_cpus() is a barrier, I didn't check]

arch/sh/kernel/smp.c
if (wait)
while (atomic_read(_fn_call.finished) != (nr_cpus - 1));

Bugs?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 14:38, Denys Vlasenko wrote:
> You are basically trying to educate me how to use atomic properly.
> You don't need to do it, as I am (currently) not a driver author.
> 
> I am saying that people who are already using atomic_read()
> (and who unfortunately did not read your explanation above)
> will still sometimes use atomic_read() as a way to read atomic value
> *from memory*, and will create nasty heisenbugs for you to debug.

static inline int
qla2x00_wait_for_loop_ready(scsi_qla_host_t *ha)
{
int  return_status = QLA_SUCCESS;
unsigned long loop_timeout ;
scsi_qla_host_t *pha = to_qla_parent(ha);

/* wait for 5 min at the max for loop to be ready */
loop_timeout = jiffies + (MAX_LOOP_TIMEOUT * HZ);

while ((!atomic_read(>loop_down_timer) &&
atomic_read(>loop_state) == LOOP_DOWN) ||
atomic_read(>loop_state) != LOOP_READY) {
if (atomic_read(>loop_state) == LOOP_DEAD) {
return_status = QLA_FUNCTION_FAILED;
break;
}
msleep(1000);
if (time_after_eq(jiffies, loop_timeout)) {
return_status = QLA_FUNCTION_FAILED;
break;
}
}
return (return_status);
}

Is above correct or buggy? Correct, because msleep is a barrier.
Is it obvious? No.

static void
qla2x00_rst_aen(scsi_qla_host_t *ha)
{
if (ha->flags.online && !ha->flags.reset_active &&
!atomic_read(>loop_down_timer) &&
!(test_bit(ABORT_ISP_ACTIVE, >dpc_flags))) {
do {
clear_bit(RESET_MARKER_NEEDED, >dpc_flags);

/*
 * Issue marker command only when we are going to start
 * the I/O.
 */
ha->marker_needed = 1;
} while (!atomic_read(>loop_down_timer) &&
(test_bit(RESET_MARKER_NEEDED, >dpc_flags)));
}
}

Is above correct? I honestly don't know. Correct, because set_bit is
a barrier on _all _memory_? Will it break if set_bit will be changed
to be a barrier only on its operand? Probably yes.

drivers/kvm/kvm_main.c

while (atomic_read() != needed) {
cpu_relax();
barrier();
}

Obviously author did not know that cpu_relax is already a barrier.
See why I think driver authors will be confused?

arch/x86_64/kernel/crash.c

static void nmi_shootdown_cpus(void)
{
...
msecs = 1000; /* Wait at most a second for the other cpus to stop */
while ((atomic_read(_for_crash_ipi) > 0) && msecs) {
mdelay(1);
msecs--;
}
...
}

Is mdelay(1) a barrier? Yes, because it is a function on x86_64.
Absolutely the same code will be buggy on an arch where
mdelay(1) == udelay(1000), and udelay is implemented
as inline busy-wait.

arch/sparc64/kernel/smp.c

/* Wait for response */
while (atomic_read() != cpus)
cpu_relax();
...later in the same file...
while (atomic_read(_capture_registry) != ncpus)
rmb();

I'm confused. Do we need cpu_relax() or rmb()? Does cpu_relax() imply rmb()?
(No it doesn't). Which of those two while loops needs correcting?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Arjan van de Ven
On Mon, 10 Sep 2007 11:56:29 +0100
Denys Vlasenko <[EMAIL PROTECTED]> wrote:

> 
> Well, if you insist on having it again:
> 
> Waiting for atomic value to be zero:
> 
> while (atomic_read())
> continue;
> 

and this I would say is buggy code all the way.

Not from a pure C level semantics, but from a "busy waiting is buggy"
semantics level and a "I'm inventing my own locking" semantics level.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 13:22, Kyle Moffett wrote:
> On Sep 10, 2007, at 06:56:29, Denys Vlasenko wrote:
> > On Sunday 09 September 2007 19:18, Arjan van de Ven wrote:
> >> On Sun, 9 Sep 2007 19:02:54 +0100
> >> Denys Vlasenko <[EMAIL PROTECTED]> wrote:
> >>
> >>> Why is all this fixation on "volatile"? I don't think people want  
> >>> "volatile" keyword per se, they want atomic_read() to _always_  
> >>> compile into an memory-accessing instruction, not register access.
> >>
> >> and ... why is that?  is there any valid, non-buggy code sequence  
> >> that makes that a reasonable requirement?
> >
> > Well, if you insist on having it again:
> >
> > Waiting for atomic value to be zero:
> >
> > while (atomic_read())
> > continue;
> >
> > gcc may happily convert it into:
> >
> > reg = atomic_read();
> > while (reg)
> > continue;
> 
> Bzzt.  Even if you fixed gcc to actually convert it to a busy loop on  
> a memory variable, you STILL HAVE A BUG as it may *NOT* be gcc that  
> does the conversion, it may be that the CPU does the caching of the  
> memory value.  GCC has no mechanism to do cache-flushes or memory- 
> barriers except through our custom inline assembly.

CPU can cache the value all right, but it cannot use that cached value
*forever*, it has to react to invalidate cycles on the shared bus
and re-fetch new data.

IOW: atomic_read() which compiles down to memory accessor
will work properly.

> the CPU.  Thirdly, on a large system it may take some arbitrarily  
> large amount of time for cache-propagation to update the value of the  
> variable in your local CPU cache.

Yes, but "arbitrarily large amount of time" is actually measured
in nanoseconds here. Let's say 1000ns max for hundreds of CPUs?

> Also, you   
> probably want a cpu_relax() in there somewhere to avoid overheating  
> the CPU.

Yes, but 
1. CPU shouldn't overheat (in a sense that it gets damaged),
   it will only use more power than needed.
2. cpu_relax() just throttles down my CPU, so it's performance
   optimization only. Wait, it isn't, it's a barrier too.
   Wow, "cpu_relax" is a barrier? How am I supposed to know
   that without reading lkml flamewars and/or header files?

Let's try reading headers. asm-x86_64/processor.h:

#define cpu_relax()   rep_nop()

So, is it a barrier? No clue yet.

/* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
static inline void rep_nop(void)
{
__asm__ __volatile__("rep;nop": : :"memory");
}

Comment explicitly says that it is "a good thing" (doesn't say
that it is mandatory) and says NOTHING about barriers!

Barrier-ness is not mentioned and is hidden in "memory" clobber.

Do you think it's obvious enough for average driver writer?
I think not, especially that it's unlikely for him to even start
suspecting that it is a memory barrier based on the "cpu_relax"
name.

> You simply CANNOT use an atomic_t as your sole synchronizing
> primitive, it doesn't work!  You virtually ALWAYS want to use an  
> atomic_t in the following types of situations:
> 
> (A) As an object refcount.  The value is never read except as part of  
> an atomic_dec_return().  Why aren't you using "struct kref"?
> 
> (B) As an atomic value counter (number of processes, for example).   
> Just "reading" the value is racy anyways, if you want to enforce a  
> limit or something then use atomic_inc_return(), check the result,  
> and use atomic_dec() if it's too big.  If you just want to return the  
> statistics then you are going to be instantaneous-point-in-time anyways.
> 
> (C) As an optimization value (statistics-like, but exact accuracy  
> isn't important).
> 
> Atomics are NOT A REPLACEMENT for the proper kernel subsystem, like  
> completions, mutexes, semaphores, spinlocks, krefs, etc.  It's not  
> useful for synchronization, only for keeping track of simple integer  
> RMW values.  Note that atomic_read() and atomic_set() aren't very  
> useful RMW primitives (read-nomodify-nowrite and read-set-zero- 
> write).  Code which assumes anything else is probably buggy in other  
> ways too.

You are basically trying to educate me how to use atomic properly.
You don't need to do it, as I am (currently) not a driver author.

I am saying that people who are already using atomic_read()
(and who unfortunately did not read your explanation above)
will still sometimes use atomic_read() as a way to read atomic value
*from memory*, and will create nasty heisenbugs for you to debug.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Kyle Moffett

On Sep 10, 2007, at 06:56:29, Denys Vlasenko wrote:

On Sunday 09 September 2007 19:18, Arjan van de Ven wrote:

On Sun, 9 Sep 2007 19:02:54 +0100
Denys Vlasenko <[EMAIL PROTECTED]> wrote:

Why is all this fixation on "volatile"? I don't think people want  
"volatile" keyword per se, they want atomic_read() to _always_  
compile into an memory-accessing instruction, not register access.


and ... why is that?  is there any valid, non-buggy code sequence  
that makes that a reasonable requirement?


Well, if you insist on having it again:

Waiting for atomic value to be zero:

while (atomic_read())
continue;

gcc may happily convert it into:

reg = atomic_read();
while (reg)
continue;


Bzzt.  Even if you fixed gcc to actually convert it to a busy loop on  
a memory variable, you STILL HAVE A BUG as it may *NOT* be gcc that  
does the conversion, it may be that the CPU does the caching of the  
memory value.  GCC has no mechanism to do cache-flushes or memory- 
barriers except through our custom inline assembly.  Also, you  
probably want a cpu_relax() in there somewhere to avoid overheating  
the CPU.  Thirdly, on a large system it may take some arbitrarily  
large amount of time for cache-propagation to update the value of the  
variable in your local CPU cache.  Finally, if atomics are based on  
based on spinlock+interrupt-disable then you will sit in a tight busy- 
loop of spin_lock_irqsave()->spin_unlock_irqrestore().  Depending on  
your system's internal model this may practically lock up your core  
because the spin_lock() will take the cacheline for exclusive access  
and doing that in a loop can prevent any other CPU from doing any  
operation on it!  Since your IRQs are disabled you even have a very  
small window that an IRQ will come along and free it up long enough  
for the update to take place.


The earlier code segment of:

while(atomic_read() > 0)
atomic_dec();
is *completely* buggy because you could very easily have 4 CPUs doing  
this on an atomic variable with a value of 1 and end up with it at  
negative 3 by the time you are done.  Moreover all the alternatives  
are also buggy, with the sole exception of this rather obvious- 
seeming one:

atomic_set(, 0);


You simply CANNOT use an atomic_t as your sole synchronizing  
primitive, it doesn't work!  You virtually ALWAYS want to use an  
atomic_t in the following types of situations:


(A) As an object refcount.  The value is never read except as part of  
an atomic_dec_return().  Why aren't you using "struct kref"?


(B) As an atomic value counter (number of processes, for example).   
Just "reading" the value is racy anyways, if you want to enforce a  
limit or something then use atomic_inc_return(), check the result,  
and use atomic_dec() if it's too big.  If you just want to return the  
statistics then you are going to be instantaneous-point-in-time anyways.


(C) As an optimization value (statistics-like, but exact accuracy  
isn't important).


Atomics are NOT A REPLACEMENT for the proper kernel subsystem, like  
completions, mutexes, semaphores, spinlocks, krefs, etc.  It's not  
useful for synchronization, only for keeping track of simple integer  
RMW values.  Note that atomic_read() and atomic_set() aren't very  
useful RMW primitives (read-nomodify-nowrite and read-set-zero- 
write).  Code which assumes anything else is probably buggy in other  
ways too.


So while I see no real reason for the "volatile" on the atomics, I  
also see no real reason why it's terribly harmful.  Regardless of the  
"volatile" on the operation the CPU is perfectly happy to cache it  
anyways so it doesn't buy you any actual "always-access-memory"  
guarantees.  If you are just interested in it as an optimization you  
could probably just read the properly-aligned integer counter  
directly, an atomic read on most CPUs.


If you really need it to hit main memory *every* *single* *time*  
(Why?  Are you using it instead of the proper kernel subsystem?)   
then you probably need a custom inline assembly helper anyways.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Herbert Xu
On Mon, Sep 10, 2007 at 11:56:29AM +0100, Denys Vlasenko wrote:
> 
> Expecting every driver writer to remember that atomic_read is not in fact
> a "read from memory" is naive. That won't happen. Face it, majority of
> driver authors are a bit less talented than Ingo Molnar or Arjan van de Ven ;)
> The name of the macro is saying that it's a read.
> We are confusing users here.

For driver authors who're too busy to learn the intricacies
of atomic operations, we have the plain old spin lock which
then lets you use normal data structures such as u32 safely.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Sunday 09 September 2007 19:18, Arjan van de Ven wrote:
> On Sun, 9 Sep 2007 19:02:54 +0100
> Denys Vlasenko <[EMAIL PROTECTED]> wrote:
> 
> > Why is all this fixation on "volatile"? I don't think
> > people want "volatile" keyword per se, they want atomic_read() to
> > _always_ compile into an memory-accessing instruction, not register
> > access.
> 
> and ... why is that?
> is there any valid, non-buggy code sequence that makes that a
> reasonable requirement?

Well, if you insist on having it again:

Waiting for atomic value to be zero:

while (atomic_read())
continue;

gcc may happily convert it into:

reg = atomic_read();
while (reg)
continue;

Expecting every driver writer to remember that atomic_read is not in fact
a "read from memory" is naive. That won't happen. Face it, majority of
driver authors are a bit less talented than Ingo Molnar or Arjan van de Ven ;)
The name of the macro is saying that it's a read.
We are confusing users here.

It's doubly confusing that cpy_relax(), which says _nothing_ about barriers
in its name, is actually a barrier you need to insert here.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Sunday 09 September 2007 19:18, Arjan van de Ven wrote:
 On Sun, 9 Sep 2007 19:02:54 +0100
 Denys Vlasenko [EMAIL PROTECTED] wrote:
 
  Why is all this fixation on volatile? I don't think
  people want volatile keyword per se, they want atomic_read(x) to
  _always_ compile into an memory-accessing instruction, not register
  access.
 
 and ... why is that?
 is there any valid, non-buggy code sequence that makes that a
 reasonable requirement?

Well, if you insist on having it again:

Waiting for atomic value to be zero:

while (atomic_read(x))
continue;

gcc may happily convert it into:

reg = atomic_read(x);
while (reg)
continue;

Expecting every driver writer to remember that atomic_read is not in fact
a read from memory is naive. That won't happen. Face it, majority of
driver authors are a bit less talented than Ingo Molnar or Arjan van de Ven ;)
The name of the macro is saying that it's a read.
We are confusing users here.

It's doubly confusing that cpy_relax(), which says _nothing_ about barriers
in its name, is actually a barrier you need to insert here.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Herbert Xu
On Mon, Sep 10, 2007 at 11:56:29AM +0100, Denys Vlasenko wrote:
 
 Expecting every driver writer to remember that atomic_read is not in fact
 a read from memory is naive. That won't happen. Face it, majority of
 driver authors are a bit less talented than Ingo Molnar or Arjan van de Ven ;)
 The name of the macro is saying that it's a read.
 We are confusing users here.

For driver authors who're too busy to learn the intricacies
of atomic operations, we have the plain old spin lock which
then lets you use normal data structures such as u32 safely.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Kyle Moffett

On Sep 10, 2007, at 06:56:29, Denys Vlasenko wrote:

On Sunday 09 September 2007 19:18, Arjan van de Ven wrote:

On Sun, 9 Sep 2007 19:02:54 +0100
Denys Vlasenko [EMAIL PROTECTED] wrote:

Why is all this fixation on volatile? I don't think people want  
volatile keyword per se, they want atomic_read(x) to _always_  
compile into an memory-accessing instruction, not register access.


and ... why is that?  is there any valid, non-buggy code sequence  
that makes that a reasonable requirement?


Well, if you insist on having it again:

Waiting for atomic value to be zero:

while (atomic_read(x))
continue;

gcc may happily convert it into:

reg = atomic_read(x);
while (reg)
continue;


Bzzt.  Even if you fixed gcc to actually convert it to a busy loop on  
a memory variable, you STILL HAVE A BUG as it may *NOT* be gcc that  
does the conversion, it may be that the CPU does the caching of the  
memory value.  GCC has no mechanism to do cache-flushes or memory- 
barriers except through our custom inline assembly.  Also, you  
probably want a cpu_relax() in there somewhere to avoid overheating  
the CPU.  Thirdly, on a large system it may take some arbitrarily  
large amount of time for cache-propagation to update the value of the  
variable in your local CPU cache.  Finally, if atomics are based on  
based on spinlock+interrupt-disable then you will sit in a tight busy- 
loop of spin_lock_irqsave()-spin_unlock_irqrestore().  Depending on  
your system's internal model this may practically lock up your core  
because the spin_lock() will take the cacheline for exclusive access  
and doing that in a loop can prevent any other CPU from doing any  
operation on it!  Since your IRQs are disabled you even have a very  
small window that an IRQ will come along and free it up long enough  
for the update to take place.


The earlier code segment of:

while(atomic_read(x)  0)
atomic_dec(x);
is *completely* buggy because you could very easily have 4 CPUs doing  
this on an atomic variable with a value of 1 and end up with it at  
negative 3 by the time you are done.  Moreover all the alternatives  
are also buggy, with the sole exception of this rather obvious- 
seeming one:

atomic_set(x, 0);


You simply CANNOT use an atomic_t as your sole synchronizing  
primitive, it doesn't work!  You virtually ALWAYS want to use an  
atomic_t in the following types of situations:


(A) As an object refcount.  The value is never read except as part of  
an atomic_dec_return().  Why aren't you using struct kref?


(B) As an atomic value counter (number of processes, for example).   
Just reading the value is racy anyways, if you want to enforce a  
limit or something then use atomic_inc_return(), check the result,  
and use atomic_dec() if it's too big.  If you just want to return the  
statistics then you are going to be instantaneous-point-in-time anyways.


(C) As an optimization value (statistics-like, but exact accuracy  
isn't important).


Atomics are NOT A REPLACEMENT for the proper kernel subsystem, like  
completions, mutexes, semaphores, spinlocks, krefs, etc.  It's not  
useful for synchronization, only for keeping track of simple integer  
RMW values.  Note that atomic_read() and atomic_set() aren't very  
useful RMW primitives (read-nomodify-nowrite and read-set-zero- 
write).  Code which assumes anything else is probably buggy in other  
ways too.


So while I see no real reason for the volatile on the atomics, I  
also see no real reason why it's terribly harmful.  Regardless of the  
volatile on the operation the CPU is perfectly happy to cache it  
anyways so it doesn't buy you any actual always-access-memory  
guarantees.  If you are just interested in it as an optimization you  
could probably just read the properly-aligned integer counter  
directly, an atomic read on most CPUs.


If you really need it to hit main memory *every* *single* *time*  
(Why?  Are you using it instead of the proper kernel subsystem?)   
then you probably need a custom inline assembly helper anyways.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 13:22, Kyle Moffett wrote:
 On Sep 10, 2007, at 06:56:29, Denys Vlasenko wrote:
  On Sunday 09 September 2007 19:18, Arjan van de Ven wrote:
  On Sun, 9 Sep 2007 19:02:54 +0100
  Denys Vlasenko [EMAIL PROTECTED] wrote:
 
  Why is all this fixation on volatile? I don't think people want  
  volatile keyword per se, they want atomic_read(x) to _always_  
  compile into an memory-accessing instruction, not register access.
 
  and ... why is that?  is there any valid, non-buggy code sequence  
  that makes that a reasonable requirement?
 
  Well, if you insist on having it again:
 
  Waiting for atomic value to be zero:
 
  while (atomic_read(x))
  continue;
 
  gcc may happily convert it into:
 
  reg = atomic_read(x);
  while (reg)
  continue;
 
 Bzzt.  Even if you fixed gcc to actually convert it to a busy loop on  
 a memory variable, you STILL HAVE A BUG as it may *NOT* be gcc that  
 does the conversion, it may be that the CPU does the caching of the  
 memory value.  GCC has no mechanism to do cache-flushes or memory- 
 barriers except through our custom inline assembly.

CPU can cache the value all right, but it cannot use that cached value
*forever*, it has to react to invalidate cycles on the shared bus
and re-fetch new data.

IOW: atomic_read(x) which compiles down to memory accessor
will work properly.

 the CPU.  Thirdly, on a large system it may take some arbitrarily  
 large amount of time for cache-propagation to update the value of the  
 variable in your local CPU cache.

Yes, but arbitrarily large amount of time is actually measured
in nanoseconds here. Let's say 1000ns max for hundreds of CPUs?

 Also, you   
 probably want a cpu_relax() in there somewhere to avoid overheating  
 the CPU.

Yes, but 
1. CPU shouldn't overheat (in a sense that it gets damaged),
   it will only use more power than needed.
2. cpu_relax() just throttles down my CPU, so it's performance
   optimization only. Wait, it isn't, it's a barrier too.
   Wow, cpu_relax is a barrier? How am I supposed to know
   that without reading lkml flamewars and/or header files?

Let's try reading headers. asm-x86_64/processor.h:

#define cpu_relax()   rep_nop()

So, is it a barrier? No clue yet.

/* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
static inline void rep_nop(void)
{
__asm__ __volatile__(rep;nop: : :memory);
}

Comment explicitly says that it is a good thing (doesn't say
that it is mandatory) and says NOTHING about barriers!

Barrier-ness is not mentioned and is hidden in memory clobber.

Do you think it's obvious enough for average driver writer?
I think not, especially that it's unlikely for him to even start
suspecting that it is a memory barrier based on the cpu_relax
name.

 You simply CANNOT use an atomic_t as your sole synchronizing
 primitive, it doesn't work!  You virtually ALWAYS want to use an  
 atomic_t in the following types of situations:
 
 (A) As an object refcount.  The value is never read except as part of  
 an atomic_dec_return().  Why aren't you using struct kref?
 
 (B) As an atomic value counter (number of processes, for example).   
 Just reading the value is racy anyways, if you want to enforce a  
 limit or something then use atomic_inc_return(), check the result,  
 and use atomic_dec() if it's too big.  If you just want to return the  
 statistics then you are going to be instantaneous-point-in-time anyways.
 
 (C) As an optimization value (statistics-like, but exact accuracy  
 isn't important).
 
 Atomics are NOT A REPLACEMENT for the proper kernel subsystem, like  
 completions, mutexes, semaphores, spinlocks, krefs, etc.  It's not  
 useful for synchronization, only for keeping track of simple integer  
 RMW values.  Note that atomic_read() and atomic_set() aren't very  
 useful RMW primitives (read-nomodify-nowrite and read-set-zero- 
 write).  Code which assumes anything else is probably buggy in other  
 ways too.

You are basically trying to educate me how to use atomic properly.
You don't need to do it, as I am (currently) not a driver author.

I am saying that people who are already using atomic_read()
(and who unfortunately did not read your explanation above)
will still sometimes use atomic_read() as a way to read atomic value
*from memory*, and will create nasty heisenbugs for you to debug.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Arjan van de Ven
On Mon, 10 Sep 2007 11:56:29 +0100
Denys Vlasenko [EMAIL PROTECTED] wrote:

 
 Well, if you insist on having it again:
 
 Waiting for atomic value to be zero:
 
 while (atomic_read(x))
 continue;
 

and this I would say is buggy code all the way.

Not from a pure C level semantics, but from a busy waiting is buggy
semantics level and a I'm inventing my own locking semantics level.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 14:38, Denys Vlasenko wrote:
 You are basically trying to educate me how to use atomic properly.
 You don't need to do it, as I am (currently) not a driver author.
 
 I am saying that people who are already using atomic_read()
 (and who unfortunately did not read your explanation above)
 will still sometimes use atomic_read() as a way to read atomic value
 *from memory*, and will create nasty heisenbugs for you to debug.

static inline int
qla2x00_wait_for_loop_ready(scsi_qla_host_t *ha)
{
int  return_status = QLA_SUCCESS;
unsigned long loop_timeout ;
scsi_qla_host_t *pha = to_qla_parent(ha);

/* wait for 5 min at the max for loop to be ready */
loop_timeout = jiffies + (MAX_LOOP_TIMEOUT * HZ);

while ((!atomic_read(pha-loop_down_timer) 
atomic_read(pha-loop_state) == LOOP_DOWN) ||
atomic_read(pha-loop_state) != LOOP_READY) {
if (atomic_read(pha-loop_state) == LOOP_DEAD) {
return_status = QLA_FUNCTION_FAILED;
break;
}
msleep(1000);
if (time_after_eq(jiffies, loop_timeout)) {
return_status = QLA_FUNCTION_FAILED;
break;
}
}
return (return_status);
}

Is above correct or buggy? Correct, because msleep is a barrier.
Is it obvious? No.

static void
qla2x00_rst_aen(scsi_qla_host_t *ha)
{
if (ha-flags.online  !ha-flags.reset_active 
!atomic_read(ha-loop_down_timer) 
!(test_bit(ABORT_ISP_ACTIVE, ha-dpc_flags))) {
do {
clear_bit(RESET_MARKER_NEEDED, ha-dpc_flags);

/*
 * Issue marker command only when we are going to start
 * the I/O.
 */
ha-marker_needed = 1;
} while (!atomic_read(ha-loop_down_timer) 
(test_bit(RESET_MARKER_NEEDED, ha-dpc_flags)));
}
}

Is above correct? I honestly don't know. Correct, because set_bit is
a barrier on _all _memory_? Will it break if set_bit will be changed
to be a barrier only on its operand? Probably yes.

drivers/kvm/kvm_main.c

while (atomic_read(completed) != needed) {
cpu_relax();
barrier();
}

Obviously author did not know that cpu_relax is already a barrier.
See why I think driver authors will be confused?

arch/x86_64/kernel/crash.c

static void nmi_shootdown_cpus(void)
{
...
msecs = 1000; /* Wait at most a second for the other cpus to stop */
while ((atomic_read(waiting_for_crash_ipi)  0)  msecs) {
mdelay(1);
msecs--;
}
...
}

Is mdelay(1) a barrier? Yes, because it is a function on x86_64.
Absolutely the same code will be buggy on an arch where
mdelay(1) == udelay(1000), and udelay is implemented
as inline busy-wait.

arch/sparc64/kernel/smp.c

/* Wait for response */
while (atomic_read(data.finished) != cpus)
cpu_relax();
...later in the same file...
while (atomic_read(smp_capture_registry) != ncpus)
rmb();

I'm confused. Do we need cpu_relax() or rmb()? Does cpu_relax() imply rmb()?
(No it doesn't). Which of those two while loops needs correcting?
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 15:51, Arjan van de Ven wrote:
 On Mon, 10 Sep 2007 11:56:29 +0100
 Denys Vlasenko [EMAIL PROTECTED] wrote:
 
  
  Well, if you insist on having it again:
  
  Waiting for atomic value to be zero:
  
  while (atomic_read(x))
  continue;
  
 
 and this I would say is buggy code all the way.

 Not from a pure C level semantics, but from a busy waiting is buggy
 semantics level and a I'm inventing my own locking semantics level.

After inspecting arch/*, I cannot agree with you.
Otherwise almost all major architectures use
conceptually buggy busy-waiting:

arch/alpha
arch/i386
arch/ia64
arch/m32r
arch/mips
arch/parisc
arch/powerpc
arch/sh
arch/sparc64
arch/um
arch/x86_64

All of the above contain busy-waiting on atomic_read.

Including these loops without barriers:

arch/mips/kernel/smtc.c
while (atomic_read(idle_hook_initialized)  1000)
;
arch/mips/sgi-ip27/ip27-nmi.c
while (atomic_read(nmied_cpus) != num_online_cpus());

[Well maybe num_online_cpus() is a barrier, I didn't check]

arch/sh/kernel/smp.c
if (wait)
while (atomic_read(smp_fn_call.finished) != (nr_cpus - 1));

Bugs?
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Linus Torvalds

On Mon, 10 Sep 2007, Denys Vlasenko wrote:
 
 static inline int
 qla2x00_wait_for_loop_ready(scsi_qla_host_t *ha)
 {
 int  return_status = QLA_SUCCESS;
 unsigned long loop_timeout ;
 scsi_qla_host_t *pha = to_qla_parent(ha);
 
 /* wait for 5 min at the max for loop to be ready */
 loop_timeout = jiffies + (MAX_LOOP_TIMEOUT * HZ);
 
 while ((!atomic_read(pha-loop_down_timer) 
 atomic_read(pha-loop_state) == LOOP_DOWN) ||
 atomic_read(pha-loop_state) != LOOP_READY) {
 if (atomic_read(pha-loop_state) == LOOP_DEAD) {
...
 Is above correct or buggy? Correct, because msleep is a barrier.
 Is it obvious? No.

It's *buggy*. But it has nothing to do with any msleep() in the loop, or 
anything else.

And more importantly, it would be equally buggy even *with* a volatile 
atomic_read().

Why is this so hard for people to understand? You're all acting like 
morons.

The reason it is buggy has absolutely nothing to do with whether the read 
is done or not, it has to do with the fact that the CPU may re-order the 
reads *regardless* of whether the read is done in some specific order by 
the compiler ot not! In effect, there is zero ordering between all those 
three reads, and if you don't have memory barriers (or a lock or other 
serialization), that code is buggy.

So stop this idiotic discussion thread already. The above kind of code 
needs memory barriers to be non-buggy. The whole volatile or not 
discussion is totally idiotic, and pointless, and anybody who doesn't 
understand that by now needs to just shut up and think about it more, 
rather than make this discussion drag out even further.

The fact is, volatile *only* makes things worse. It generates worse 
code, and never fixes any real bugs. This is a *fact*.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Arjan van de Ven
On Mon, 10 Sep 2007 15:38:23 +0100
Denys Vlasenko [EMAIL PROTECTED] wrote:

 On Monday 10 September 2007 15:51, Arjan van de Ven wrote:
  On Mon, 10 Sep 2007 11:56:29 +0100
  Denys Vlasenko [EMAIL PROTECTED] wrote:
  
   
   Well, if you insist on having it again:
   
   Waiting for atomic value to be zero:
   
   while (atomic_read(x))
   continue;
   
  
  and this I would say is buggy code all the way.
 
  Not from a pure C level semantics, but from a busy waiting is
  buggy semantics level and a I'm inventing my own locking
  semantics level.
 
 After inspecting arch/*, I cannot agree with you.

the arch/ people obviously are allowed to do their own locking stuff...
BECAUSE THEY HAVE TO IMPLEMENT THAT!


the arch maintainers know EXACTLY how their hw behaves (well, we hope)
so they tend to be the exception to many rules in the kernel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 16:09, Linus Torvalds wrote:
 On Mon, 10 Sep 2007, Denys Vlasenko wrote:
  static inline int
  qla2x00_wait_for_loop_ready(scsi_qla_host_t *ha)
  {
  int  return_status = QLA_SUCCESS;
  unsigned long loop_timeout ;
  scsi_qla_host_t *pha = to_qla_parent(ha);
  
  /* wait for 5 min at the max for loop to be ready */
  loop_timeout = jiffies + (MAX_LOOP_TIMEOUT * HZ);
  
  while ((!atomic_read(pha-loop_down_timer) 
  atomic_read(pha-loop_state) == LOOP_DOWN) ||
  atomic_read(pha-loop_state) != LOOP_READY) {
  if (atomic_read(pha-loop_state) == LOOP_DEAD) {
 ...
  Is above correct or buggy? Correct, because msleep is a barrier.
  Is it obvious? No.
 
 It's *buggy*. But it has nothing to do with any msleep() in the loop, or 
 anything else.
 
 And more importantly, it would be equally buggy even *with* a volatile 
 atomic_read().

I am not saying that this code is okay, this isn't the point.
(The code is in fact awful for several more reasons).

My point is that people are confused as to what atomic_read()
exactly means, and this is bad. Same for cpu_relax().
First one says read, and second one doesn't say barrier.

This is real code from current kernel which demonstrates this:

I don't know that cpu_relax() is a barrier already:

drivers/kvm/kvm_main.c
        while (atomic_read(completed) != needed) {
                cpu_relax();
                barrier();
        }

I think that atomic_read() is a read from memory and therefore
I don't need a barrier:

arch/x86_64/kernel/crash.c
        msecs = 1000; /* Wait at most a second for the other cpus to stop */
        while ((atomic_read(waiting_for_crash_ipi)  0)  msecs) {
                mdelay(1);
                msecs--;
        }

Since neither camp seems to give up, I am proposing renaming
them to something less confusing, and make everybody happy.

cpu_relax_barrier()
atomic_value(x)
atomic_fetch(x)

I'm not native English speaker, do these sound better?
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Christoph Lameter
On Fri, 17 Aug 2007, Segher Boessenkool wrote:

 volatile has nothing to do with reordering.  atomic_dec() writes
 to memory, so it _does_ have volatile semantics, implicitly, as
 long as the compiler cannot optimise the atomic variable away
 completely -- any store counts as a side effect.

Stores can be reordered. Only x86 has (mostly) implicit write ordering. So 
no atomic_dec has no volatile semantics and may be reordered on a variety 
of processors. Writes to memory may not follow code order on several 
processors.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Christoph Lameter
On Mon, 10 Sep 2007, Linus Torvalds wrote:

 The fact is, volatile *only* makes things worse. It generates worse 
 code, and never fixes any real bugs. This is a *fact*.

Yes, lets just drop the volatiles now! We need a patch that gets rid of 
them Volunteers?


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Kyle Moffett

On Sep 10, 2007, at 12:46:33, Denys Vlasenko wrote:
My point is that people are confused as to what atomic_read()   
exactly means, and this is bad. Same for cpu_relax().  First one  
says read, and second one doesn't say barrier.


QA:

Q:  When is it OK to use atomic_read()?
A:  You are asking the question, so never.

Q:  But I need to check the value of the atomic at this point in time...
A:  Your code is buggy if it needs to do that on an atomic_t for  
anything other than debugging or optimization.  Use either  
atomic_*_return() or a lock and some normal integers.


Q:  So why can't the atomic_read DTRT magically?
A:  Because the right thing depends on the situation and is usually  
best done with something other than atomic_t.


If somebody can post some non-buggy code which is correctly using  
atomic_read() *and* depends on the compiler generating extra  
nonsensical loads due to volatile then the issue *might* be  
reconsidered.  This also includes samples of code which uses  
atomic_read() and needs memory barriers (so that we can fix the buggy  
code, not so we can change atomic_read()).  So far the only code  
samples anybody has posted are buggy regardless of whether or not the  
value and/or accessors are flagged volatile or not.  And hey, maybe  
the volatile ops *should* be implemented in inline ASM for future- 
proof-ness, but that's a separate issue.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Paul E. McKenney
On Mon, Sep 10, 2007 at 11:59:29AM -0700, Christoph Lameter wrote:
 On Fri, 17 Aug 2007, Segher Boessenkool wrote:
 
  volatile has nothing to do with reordering.  atomic_dec() writes
  to memory, so it _does_ have volatile semantics, implicitly, as
  long as the compiler cannot optimise the atomic variable away
  completely -- any store counts as a side effect.
 
 Stores can be reordered. Only x86 has (mostly) implicit write ordering. So 
 no atomic_dec has no volatile semantics and may be reordered on a variety 
 of processors. Writes to memory may not follow code order on several 
 processors.

The one exception to this being the case where process-level code is
communicating to an interrupt handler running on that same CPU -- on
all CPUs that I am aware of, a given CPU always sees its own writes
in order.

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Christoph Lameter
On Mon, 10 Sep 2007, Paul E. McKenney wrote:

 The one exception to this being the case where process-level code is
 communicating to an interrupt handler running on that same CPU -- on
 all CPUs that I am aware of, a given CPU always sees its own writes
 in order.

Yes but that is due to the code path effectively continuing in the 
interrupt handler. The cpu makes sure that op codes being executed always 
see memory in a consistent way. The basic ordering problem with out of 
order writes is therefore coming from other processors concurrently 
executing code and holding variables in registers that are modified 
elsewhere. The only solution that I know of are one or the other form of 
barrier.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Paul E. McKenney
On Mon, Sep 10, 2007 at 02:36:26PM -0700, Christoph Lameter wrote:
 On Mon, 10 Sep 2007, Paul E. McKenney wrote:
 
  The one exception to this being the case where process-level code is
  communicating to an interrupt handler running on that same CPU -- on
  all CPUs that I am aware of, a given CPU always sees its own writes
  in order.
 
 Yes but that is due to the code path effectively continuing in the 
 interrupt handler. The cpu makes sure that op codes being executed always 
 see memory in a consistent way. The basic ordering problem with out of 
 order writes is therefore coming from other processors concurrently 
 executing code and holding variables in registers that are modified 
 elsewhere. The only solution that I know of are one or the other form of 
 barrier.

So we are agreed then -- volatile accesses may be of some assistance when
interacting with interrupt handlers running on the same CPU (presumably
when using per-CPU variables), but are generally useless when sharing
variables among CPUs.  Correct?

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Segher Boessenkool

volatile has nothing to do with reordering.  atomic_dec() writes
to memory, so it _does_ have volatile semantics, implicitly, as
long as the compiler cannot optimise the atomic variable away
completely -- any store counts as a side effect.


Stores can be reordered. Only x86 has (mostly) implicit write ordering.
So no atomic_dec has no volatile semantics


Read again: I said the C volatile construct has nothing to do
with CPU memory access reordering.


and may be reordered on a variety
of processors. Writes to memory may not follow code order on several
processors.


The _compiler_ isn't allowed to reorder things here.  Yes, of course
you do need stronger barriers for many purposes, volatile isn't all
that useful you know.


Segher

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-09 Thread Arjan van de Ven
On Sun, 9 Sep 2007 19:02:54 +0100
Denys Vlasenko <[EMAIL PROTECTED]> wrote:

> Why is all this fixation on "volatile"? I don't think
> people want "volatile" keyword per se, they want atomic_read() to
> _always_ compile into an memory-accessing instruction, not register
> access.

and ... why is that?
is there any valid, non-buggy code sequence that makes that a
reasonable requirement?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-09 Thread Denys Vlasenko
On Friday 17 August 2007 17:48, Linus Torvalds wrote:
> 
> On Fri, 17 Aug 2007, Nick Piggin wrote:
> > 
> > That's not obviously just taste to me. Not when the primitive has many
> > (perhaps, the majority) of uses that do not require said barriers. And
> > this is not solely about the code generation (which, as Paul says, is
> > relatively minor even on x86). I prefer people to think explicitly
> > about barriers in their lockless code.
> 
> Indeed.
> 
> I think the important issues are:
> 
>  - "volatile" itself is simply a badly/weakly defined issue. The semantics 
>of it as far as the compiler is concerned are really not very good, and 
>in practice tends to boil down to "I will generate so bad code that 
>nobody can accuse me of optimizing anything away".
> 
>  - "volatile" - regardless of how well or badly defined it is - is purely 
>a compiler thing. It has absolutely no meaning for the CPU itself, so 
>it at no point implies any CPU barriers. As a result, even if the 
>compiler generates crap code and doesn't re-order anything, there's 
>nothing that says what the CPU will do.
> 
>  - in other words, the *only* possible meaning for "volatile" is a purely 
>single-CPU meaning. And if you only have a single CPU involved in the 
>process, the "volatile" is by definition pointless (because even 
>without a volatile, the compiler is required to make the C code appear 
>consistent as far as a single CPU is concerned).
> 
> So, let's take the example *buggy* code where we use "volatile" to wait 
> for other CPU's:
> 
>   atomic_set(, 0);
>   while (!atomic_read())
>   /* nothing */;
> 
> 
> which generates an endless loop if we don't have atomic_read() imply 
> volatile.
> 
> The point here is that it's buggy whether the volatile is there or not! 
> Exactly because the user expects multi-processing behaviour, but 
> "volatile" doesn't actually give any real guarantees about it. Another CPU 
> may have done:
> 
>   external_ptr = kmalloc(..);
>   /* Setup is now complete, inform the waiter */
>   atomic_inc();
> 
> but the fact is, since the other CPU isn't serialized in any way, the 
> "while-loop" (even in the presense of "volatile") doesn't actually work 
> right! Whatever the "atomic_read()" was waiting for may not have 
> completed, because we have no barriers!

Why is all this fixation on "volatile"? I don't think
people want "volatile" keyword per se, they want atomic_read() to
_always_ compile into an memory-accessing instruction, not register access.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-09 Thread Denys Vlasenko
On Friday 17 August 2007 17:48, Linus Torvalds wrote:
 
 On Fri, 17 Aug 2007, Nick Piggin wrote:
  
  That's not obviously just taste to me. Not when the primitive has many
  (perhaps, the majority) of uses that do not require said barriers. And
  this is not solely about the code generation (which, as Paul says, is
  relatively minor even on x86). I prefer people to think explicitly
  about barriers in their lockless code.
 
 Indeed.
 
 I think the important issues are:
 
  - volatile itself is simply a badly/weakly defined issue. The semantics 
of it as far as the compiler is concerned are really not very good, and 
in practice tends to boil down to I will generate so bad code that 
nobody can accuse me of optimizing anything away.
 
  - volatile - regardless of how well or badly defined it is - is purely 
a compiler thing. It has absolutely no meaning for the CPU itself, so 
it at no point implies any CPU barriers. As a result, even if the 
compiler generates crap code and doesn't re-order anything, there's 
nothing that says what the CPU will do.
 
  - in other words, the *only* possible meaning for volatile is a purely 
single-CPU meaning. And if you only have a single CPU involved in the 
process, the volatile is by definition pointless (because even 
without a volatile, the compiler is required to make the C code appear 
consistent as far as a single CPU is concerned).
 
 So, let's take the example *buggy* code where we use volatile to wait 
 for other CPU's:
 
   atomic_set(var, 0);
   while (!atomic_read(var))
   /* nothing */;
 
 
 which generates an endless loop if we don't have atomic_read() imply 
 volatile.
 
 The point here is that it's buggy whether the volatile is there or not! 
 Exactly because the user expects multi-processing behaviour, but 
 volatile doesn't actually give any real guarantees about it. Another CPU 
 may have done:
 
   external_ptr = kmalloc(..);
   /* Setup is now complete, inform the waiter */
   atomic_inc(var);
 
 but the fact is, since the other CPU isn't serialized in any way, the 
 while-loop (even in the presense of volatile) doesn't actually work 
 right! Whatever the atomic_read() was waiting for may not have 
 completed, because we have no barriers!

Why is all this fixation on volatile? I don't think
people want volatile keyword per se, they want atomic_read(x) to
_always_ compile into an memory-accessing instruction, not register access.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-09 Thread Arjan van de Ven
On Sun, 9 Sep 2007 19:02:54 +0100
Denys Vlasenko [EMAIL PROTECTED] wrote:

 Why is all this fixation on volatile? I don't think
 people want volatile keyword per se, they want atomic_read(x) to
 _always_ compile into an memory-accessing instruction, not register
 access.

and ... why is that?
is there any valid, non-buggy code sequence that makes that a
reasonable requirement?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Denys Vlasenko
On Friday 24 August 2007 18:15, Christoph Lameter wrote:
> On Fri, 24 Aug 2007, Denys Vlasenko wrote:
> > On Thursday 16 August 2007 00:22, Paul Mackerras wrote:
> > > Satyam Sharma writes:
> > > In the kernel we use atomic variables in precisely those situations
> > > where a variable is potentially accessed concurrently by multiple
> > > CPUs, and where each CPU needs to see updates done by other CPUs in a
> > > timely fashion.  That is what they are for.  Therefore the compiler
> > > must not cache values of atomic variables in registers; each
> > > atomic_read must result in a load and each atomic_set must result in a
> > > store.  Anything else will just lead to subtle bugs.
> >
> > Amen.
>
> A "timely" fashion? One cannot rely on something like that when coding.
> The visibility of updates is insured by barriers and not by some fuzzy
> notion of "timeliness".

But here you do have some notion of time:

while (atomic_read())
continue;

"continue when other CPU(s) decrement it down to zero".
If "read" includes an insn which accesses RAM, you will
see "new" value sometime after other CPU decrements it.
"Sometime after" is on the order of nanoseconds here.
It is a valid concept of time, right?

The whole confusion is about whether atomic_read implies
"read from RAM" or not. I am in a camp which thinks it does.
You are in an opposite one.

We just need a less ambiguous name.

What about this:

/**
 * atomic_read - read atomic variable
 * @v: pointer of type atomic_t
 *
 * Atomically reads the value of @v.
 * No compiler barrier implied.
 */
#define atomic_read(v)  ((v)->counter)

+/**
+ * atomic_read_uncached - read atomic variable from memory
+ * @v: pointer of type atomic_t
+ *
+ * Atomically reads the value of @v. This is guaranteed to emit an insn
+ * which accesses memory, atomically. No ordering guarantees!
+ */
+#define atomic_read_uncached(v)  asm_or_volatile_ptr_magic(v)

I was thinking of s/atomic_read/atomic_get/ too, but it implies "taking"
atomic a-la get_cpu()...
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Linus Torvalds


On Fri, 24 Aug 2007, Denys Vlasenko wrote:
>
> > No, you don't use "x.counter++". But you *do* use
> >
> > if (atomic_read() <= 1)
> >
> > and loading into a register is stupid and pointless, when you could just
> > do it as a regular memory-operand to the cmp instruction.
> 
> It doesn't mean that (volatile int*) cast is bad, it means that current gcc
> is bad (or "not good enough"). IOW: instead of avoiding volatile cast,
> it's better to fix the compiler.

I would agree that fixing the compiler in this case would be a good thing, 
even quite regardless of any "atomic_read()" discussion.

I just have a strong suspicion that "volatile" performance is so low down 
the list of any C compiler persons interest, that it's never going to 
happen. And quite frankly, I cannot blame the gcc guys for it.

That's especially as "volatile" really isn't a very good feature of the C 
language, and is likely to get *less* interesting rather than more (as 
user space starts to be more and more threaded, "volatile" gets less and 
less useful.

[ Ie, currently, I think you can validly use "volatile" in a "sigatomic_t" 
  kind of way, where there is a single thread, but with asynchronous 
  events. In that kind of situation, I think it's probably useful. But 
  once you get multiple threads, it gets pointless.

  Sure: you could use "volatile" together with something like Dekker's or 
  Peterson's algorithm that doesn't depend on cache coherency (that's 
  basically what the C "volatile" keyword approximates: not atomic 
  accesses, but *uncached* accesses! But let's face it, that's way past 
  insane. ]

So I wouldn't expect "volatile" to ever really generate better code. It 
might happen as a side effect of other improvements (eg, I might hope that 
the SSA work would eventually lead to gcc having a much better defined 
model of valid optimizations, and maybe better code generation for 
volatile accesses fall out cleanly out of that), but in the end, it's such 
an ugly special case in C, and so seldom used, that I wouldn't depend on 
it.

> Linus, in all honesty gcc has many more cases of suboptimal code,
> case of "volatile" is just one of many.

Well, the thing is, quite often, many of those "suboptimal code" 
generations fall into two distinct classes:

 - complex C code. I can't really blame the compiler too much for this. 
   Some things are *hard* to optimize, and for various scalability 
   reasons, you often end up having limits in the compiler where it 
   doesn't even _try_ doing certain optimizations if you have excessive 
   complexity.

 - bad register allocation. Register allocation really is hard, and 
   sometimes gcc just does the "obviously wrong" thing, and you end up 
   having totally unnecessary spills.

> Off the top of my head:

Yes, "unsigned long long" with x86 has always generated atrocious code. In 
fact, I would say that historically it was really *really* bad. These 
days, gcc actually does a pretty good job, but I'm not surprised that it's 
still quite possible to find cases where it did some optimization (in this 
case, apparently noticing that "shift by >= 32 bits" causes the low 
register to be pointless) and then missed *another* optimization (better 
register use) because that optimization had been done *before* the first 
optimization was done.

That's a *classic* example of compiler code generation issues, and quite 
frankly, I think that's very different from the issue of "volatile".

Quite frankly, I'd like there to be more competition in the open source 
compiler game, and that might cause some upheavals, but on the whole, gcc 
actually does a pretty damn good job. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Christoph Lameter
On Fri, 24 Aug 2007, Denys Vlasenko wrote:

> On Thursday 16 August 2007 00:22, Paul Mackerras wrote:
> > Satyam Sharma writes:
> > In the kernel we use atomic variables in precisely those situations
> > where a variable is potentially accessed concurrently by multiple
> > CPUs, and where each CPU needs to see updates done by other CPUs in a
> > timely fashion.  That is what they are for.  Therefore the compiler
> > must not cache values of atomic variables in registers; each
> > atomic_read must result in a load and each atomic_set must result in a
> > store.  Anything else will just lead to subtle bugs.
> 
> Amen.

A "timely" fashion? One cannot rely on something like that when coding. 
The visibility of updates is insured by barriers and not by some fuzzy 
notion of "timeliness".
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Denys Vlasenko
On Thursday 16 August 2007 00:22, Paul Mackerras wrote:
> Satyam Sharma writes:
> In the kernel we use atomic variables in precisely those situations
> where a variable is potentially accessed concurrently by multiple
> CPUs, and where each CPU needs to see updates done by other CPUs in a
> timely fashion.  That is what they are for.  Therefore the compiler
> must not cache values of atomic variables in registers; each
> atomic_read must result in a load and each atomic_set must result in a
> store.  Anything else will just lead to subtle bugs.

Amen.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Denys Vlasenko
On Saturday 18 August 2007 05:13, Linus Torvalds wrote:
> On Sat, 18 Aug 2007, Satyam Sharma wrote:
> > No code does (or would do, or should do):
> >
> > x.counter++;
> >
> > on an "atomic_t x;" anyway.
>
> That's just an example of a general problem.
>
> No, you don't use "x.counter++". But you *do* use
>
>   if (atomic_read() <= 1)
>
> and loading into a register is stupid and pointless, when you could just
> do it as a regular memory-operand to the cmp instruction.

It doesn't mean that (volatile int*) cast is bad, it means that current gcc
is bad (or "not good enough"). IOW: instead of avoiding volatile cast,
it's better to fix the compiler.

> And as far as the compiler is concerned, the problem is the 100% same:
> combining operations with the volatile memop.
>
> The fact is, a compiler that thinks that
>
>   movl mem,reg
>   cmpl $val,reg
>
> is any better than
>
>   cmpl $val,mem
>
> is just not a very good compiler.

Linus, in all honesty gcc has many more cases of suboptimal code,
case of "volatile" is just one of many.

Off the top of my head:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28417

unsigned v;
void f(unsigned A) { v = ((unsigned long long)A) * 365384439 >> (27+32); }

gcc-4.1.1 -S -Os -fomit-frame-pointer t.c

f:
movl$365384439, %eax
mull4(%esp)
movl%edx, %eax <= ?
shrl$27, %eax
movl%eax, v
ret

Why is it moving %edx to %eax?

gcc-4.2.1 -S -Os -fomit-frame-pointer t.c

f:
movl$365384439, %eax
mull4(%esp)
movl%edx, %eax <= ?
xorl%edx, %edx <= ??!
shrl$27, %eax
movl%eax, v
ret

Progress... Now we also zero out %edx afterwards for no apparent reason.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Denys Vlasenko
On Saturday 18 August 2007 05:13, Linus Torvalds wrote:
 On Sat, 18 Aug 2007, Satyam Sharma wrote:
  No code does (or would do, or should do):
 
  x.counter++;
 
  on an atomic_t x; anyway.

 That's just an example of a general problem.

 No, you don't use x.counter++. But you *do* use

   if (atomic_read(x) = 1)

 and loading into a register is stupid and pointless, when you could just
 do it as a regular memory-operand to the cmp instruction.

It doesn't mean that (volatile int*) cast is bad, it means that current gcc
is bad (or not good enough). IOW: instead of avoiding volatile cast,
it's better to fix the compiler.

 And as far as the compiler is concerned, the problem is the 100% same:
 combining operations with the volatile memop.

 The fact is, a compiler that thinks that

   movl mem,reg
   cmpl $val,reg

 is any better than

   cmpl $val,mem

 is just not a very good compiler.

Linus, in all honesty gcc has many more cases of suboptimal code,
case of volatile is just one of many.

Off the top of my head:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28417

unsigned v;
void f(unsigned A) { v = ((unsigned long long)A) * 365384439  (27+32); }

gcc-4.1.1 -S -Os -fomit-frame-pointer t.c

f:
movl$365384439, %eax
mull4(%esp)
movl%edx, %eax = ?
shrl$27, %eax
movl%eax, v
ret

Why is it moving %edx to %eax?

gcc-4.2.1 -S -Os -fomit-frame-pointer t.c

f:
movl$365384439, %eax
mull4(%esp)
movl%edx, %eax = ?
xorl%edx, %edx = ??!
shrl$27, %eax
movl%eax, v
ret

Progress... Now we also zero out %edx afterwards for no apparent reason.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Denys Vlasenko
On Thursday 16 August 2007 00:22, Paul Mackerras wrote:
 Satyam Sharma writes:
 In the kernel we use atomic variables in precisely those situations
 where a variable is potentially accessed concurrently by multiple
 CPUs, and where each CPU needs to see updates done by other CPUs in a
 timely fashion.  That is what they are for.  Therefore the compiler
 must not cache values of atomic variables in registers; each
 atomic_read must result in a load and each atomic_set must result in a
 store.  Anything else will just lead to subtle bugs.

Amen.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Christoph Lameter
On Fri, 24 Aug 2007, Denys Vlasenko wrote:

 On Thursday 16 August 2007 00:22, Paul Mackerras wrote:
  Satyam Sharma writes:
  In the kernel we use atomic variables in precisely those situations
  where a variable is potentially accessed concurrently by multiple
  CPUs, and where each CPU needs to see updates done by other CPUs in a
  timely fashion.  That is what they are for.  Therefore the compiler
  must not cache values of atomic variables in registers; each
  atomic_read must result in a load and each atomic_set must result in a
  store.  Anything else will just lead to subtle bugs.
 
 Amen.

A timely fashion? One cannot rely on something like that when coding. 
The visibility of updates is insured by barriers and not by some fuzzy 
notion of timeliness.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Linus Torvalds


On Fri, 24 Aug 2007, Denys Vlasenko wrote:

  No, you don't use x.counter++. But you *do* use
 
  if (atomic_read(x) = 1)
 
  and loading into a register is stupid and pointless, when you could just
  do it as a regular memory-operand to the cmp instruction.
 
 It doesn't mean that (volatile int*) cast is bad, it means that current gcc
 is bad (or not good enough). IOW: instead of avoiding volatile cast,
 it's better to fix the compiler.

I would agree that fixing the compiler in this case would be a good thing, 
even quite regardless of any atomic_read() discussion.

I just have a strong suspicion that volatile performance is so low down 
the list of any C compiler persons interest, that it's never going to 
happen. And quite frankly, I cannot blame the gcc guys for it.

That's especially as volatile really isn't a very good feature of the C 
language, and is likely to get *less* interesting rather than more (as 
user space starts to be more and more threaded, volatile gets less and 
less useful.

[ Ie, currently, I think you can validly use volatile in a sigatomic_t 
  kind of way, where there is a single thread, but with asynchronous 
  events. In that kind of situation, I think it's probably useful. But 
  once you get multiple threads, it gets pointless.

  Sure: you could use volatile together with something like Dekker's or 
  Peterson's algorithm that doesn't depend on cache coherency (that's 
  basically what the C volatile keyword approximates: not atomic 
  accesses, but *uncached* accesses! But let's face it, that's way past 
  insane. ]

So I wouldn't expect volatile to ever really generate better code. It 
might happen as a side effect of other improvements (eg, I might hope that 
the SSA work would eventually lead to gcc having a much better defined 
model of valid optimizations, and maybe better code generation for 
volatile accesses fall out cleanly out of that), but in the end, it's such 
an ugly special case in C, and so seldom used, that I wouldn't depend on 
it.

 Linus, in all honesty gcc has many more cases of suboptimal code,
 case of volatile is just one of many.

Well, the thing is, quite often, many of those suboptimal code 
generations fall into two distinct classes:

 - complex C code. I can't really blame the compiler too much for this. 
   Some things are *hard* to optimize, and for various scalability 
   reasons, you often end up having limits in the compiler where it 
   doesn't even _try_ doing certain optimizations if you have excessive 
   complexity.

 - bad register allocation. Register allocation really is hard, and 
   sometimes gcc just does the obviously wrong thing, and you end up 
   having totally unnecessary spills.

 Off the top of my head:

Yes, unsigned long long with x86 has always generated atrocious code. In 
fact, I would say that historically it was really *really* bad. These 
days, gcc actually does a pretty good job, but I'm not surprised that it's 
still quite possible to find cases where it did some optimization (in this 
case, apparently noticing that shift by = 32 bits causes the low 
register to be pointless) and then missed *another* optimization (better 
register use) because that optimization had been done *before* the first 
optimization was done.

That's a *classic* example of compiler code generation issues, and quite 
frankly, I think that's very different from the issue of volatile.

Quite frankly, I'd like there to be more competition in the open source 
compiler game, and that might cause some upheavals, but on the whole, gcc 
actually does a pretty damn good job. 

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Denys Vlasenko
On Friday 24 August 2007 18:15, Christoph Lameter wrote:
 On Fri, 24 Aug 2007, Denys Vlasenko wrote:
  On Thursday 16 August 2007 00:22, Paul Mackerras wrote:
   Satyam Sharma writes:
   In the kernel we use atomic variables in precisely those situations
   where a variable is potentially accessed concurrently by multiple
   CPUs, and where each CPU needs to see updates done by other CPUs in a
   timely fashion.  That is what they are for.  Therefore the compiler
   must not cache values of atomic variables in registers; each
   atomic_read must result in a load and each atomic_set must result in a
   store.  Anything else will just lead to subtle bugs.
 
  Amen.

 A timely fashion? One cannot rely on something like that when coding.
 The visibility of updates is insured by barriers and not by some fuzzy
 notion of timeliness.

But here you do have some notion of time:

while (atomic_read(x))
continue;

continue when other CPU(s) decrement it down to zero.
If read includes an insn which accesses RAM, you will
see new value sometime after other CPU decrements it.
Sometime after is on the order of nanoseconds here.
It is a valid concept of time, right?

The whole confusion is about whether atomic_read implies
read from RAM or not. I am in a camp which thinks it does.
You are in an opposite one.

We just need a less ambiguous name.

What about this:

/**
 * atomic_read - read atomic variable
 * @v: pointer of type atomic_t
 *
 * Atomically reads the value of @v.
 * No compiler barrier implied.
 */
#define atomic_read(v)  ((v)-counter)

+/**
+ * atomic_read_uncached - read atomic variable from memory
+ * @v: pointer of type atomic_t
+ *
+ * Atomically reads the value of @v. This is guaranteed to emit an insn
+ * which accesses memory, atomically. No ordering guarantees!
+ */
+#define atomic_read_uncached(v)  asm_or_volatile_ptr_magic(v)

I was thinking of s/atomic_read/atomic_get/ too, but it implies taking
atomic a-la get_cpu()...
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-22 Thread Adrian Bunk
On Tue, Aug 21, 2007 at 06:51:16PM -0400, [EMAIL PROTECTED] wrote:
> On Tue, 21 Aug 2007 09:16:43 PDT, "Paul E. McKenney" said:
> 
> > I agree that instant gratification is hard to come by when synching
> > up compiler and kernel versions.  Nonetheless, it should be possible
> > to create APIs that are are conditioned on the compiler version.
> 
> We've tried that, sort of.  See the mess surrounding the whole
> extern/static/inline/__whatever boondogle, which seems to have
> changed semantics in every single gcc release since 2.95 or so.
>...

There is exactly one semantics change in gcc in this area, and that is 
the change of the "extern inline" semantics in gcc 4.3 to the
C99 semantics.

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-22 Thread Adrian Bunk
On Tue, Aug 21, 2007 at 06:51:16PM -0400, [EMAIL PROTECTED] wrote:
 On Tue, 21 Aug 2007 09:16:43 PDT, Paul E. McKenney said:
 
  I agree that instant gratification is hard to come by when synching
  up compiler and kernel versions.  Nonetheless, it should be possible
  to create APIs that are are conditioned on the compiler version.
 
 We've tried that, sort of.  See the mess surrounding the whole
 extern/static/inline/__whatever boondogle, which seems to have
 changed semantics in every single gcc release since 2.95 or so.
...

There is exactly one semantics change in gcc in this area, and that is 
the change of the extern inline semantics in gcc 4.3 to the
C99 semantics.

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Paul E. McKenney
On Tue, Aug 21, 2007 at 06:51:16PM -0400, [EMAIL PROTECTED] wrote:
> On Tue, 21 Aug 2007 09:16:43 PDT, "Paul E. McKenney" said:
> 
> > I agree that instant gratification is hard to come by when synching
> > up compiler and kernel versions.  Nonetheless, it should be possible
> > to create APIs that are are conditioned on the compiler version.
> 
> We've tried that, sort of.  See the mess surrounding the whole
> extern/static/inline/__whatever boondogle, which seems to have
> changed semantics in every single gcc release since 2.95 or so.
> 
> And recently mention was made that gcc4.4 will have *new* semantics
> in this area. Yee. Hah.

;-)

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Valdis . Kletnieks
On Tue, 21 Aug 2007 09:16:43 PDT, "Paul E. McKenney" said:

> I agree that instant gratification is hard to come by when synching
> up compiler and kernel versions.  Nonetheless, it should be possible
> to create APIs that are are conditioned on the compiler version.

We've tried that, sort of.  See the mess surrounding the whole
extern/static/inline/__whatever boondogle, which seems to have
changed semantics in every single gcc release since 2.95 or so.

And recently mention was made that gcc4.4 will have *new* semantics
in this area. Yee. Hah.







pgpGx7YTiWc5V.pgp
Description: PGP signature


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Linus Torvalds


On Tue, 21 Aug 2007, Chris Snook wrote:
> 
> Moore's law is definitely working against us here.  Register counts, pipeline
> depths, core counts, and clock multipliers are all increasing in the long run.
> At some point in the future, barrier() will be universally regarded as a
> hammer too big for most purposes.

Note that "barrier()" is purely a compiler barrier. It has zero impact on 
the CPU pipeline itself, and also has zero impact on anything that gcc 
knows isn't visible in memory (ie local variables that don't have their 
address taken), so barrier() really is pretty cheap.

Now, it's possible that gcc messes up in some circumstances, and that the 
memory clobber will cause gcc to also do things like flush local registers 
unnecessarily to their stack slots, but quite frankly, if that happens, 
it's a gcc problem, and I also have to say that I've not seen that myself.

So in a very real sense, "barrier()" will just make sure that there is a 
stronger sequence point for the compiler where things are stable. In most 
cases it has absolutely zero performance impact - apart from the 
-intended- impact of making sure that the compiler doesn't re-order or 
cache stuff around it.

And sure, we could make it more finegrained, and also introduce a 
per-variable barrier, but the fact is, people _already_ have problems with 
thinking about these kinds of things, and adding new abstraction issues 
with subtle semantics is the last thing we want.

So I really think you'd want to show a real example of real code that 
actually gets noticeably slower or bigger.

In removing "volatile", we have shown that. It may not have made a big 
difference on powerpc, but it makes a real difference on x86 - and more 
importantly, it removes something that people clearly don't know how it 
works, and incorrectly expect to just fix bugs.

[ There are *other* barriers - the ones that actually add memory barriers 
  to the CPU - that really can be quite expensive. The good news is that 
  the expense is going down rather than up: both Intel and AMD are not 
  only removing the need for some of them (ie "smp_rmb()" will become a 
  compiler-only barrier), but we're _also_ seeing the whole "pipeline 
  flush" approach go away, and be replaced by the CPU itself actually 
  being better - so even the actual CPU pipeline barriers are getting
  cheaper, not more expensive. ]

For example, did anybody even _test_ how expensive "barrier()" is? Just 
as a lark, I did

#undef barrier
#define barrier() do { } while (0)

in kernel/sched.c (which only has three of them in it, but hey, that's 
more than most files), and there were _zero_ code generation downsides. 
One instruction was moved (and a few line numbers changed), so it wasn't 
like the assembly language was identical, but the point is, barrier() 
simply doesn't have the same kinds of downsides that "volatile" has.

(That may not be true on other architectures or in other source files, of 
course. This *does* depend on code generation details. But anybody who 
thinks that "barrier()" is fundamentally expensive is simply incorrect. It 
is *fundamnetally* a no-op).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Satyam Sharma


On Tue, 21 Aug 2007, Chris Snook wrote:

> David Miller wrote:
> > From: Linus Torvalds <[EMAIL PROTECTED]>
> > Date: Mon, 20 Aug 2007 22:46:47 -0700 (PDT)
> > 
> > > Ie a "barrier()" is likely _cheaper_ than the code generation downside
> > > from using "volatile".
> > 
> > Assuming GCC were ever better about the code generation badness
> > with volatile that has been discussed here, I much prefer
> > we tell GCC "this memory piece changed" rather than "every
> > piece of memory has changed" which is what the barrier() does.
> > 
> > I happened to have been scanning a lot of assembler lately to
> > track down a gcc-4.2 miscompilation on sparc64, and the barriers
> > do hurt quite a bit in some places.  Instead of keeping unrelated
> > variables around cached in local registers, it reloads everything.
> 
> Moore's law is definitely working against us here.  Register counts, pipeline
> depths, core counts, and clock multipliers are all increasing in the long run.
> At some point in the future, barrier() will be universally regarded as a
> hammer too big for most purposes.

I do agree, and the important point to note is that the benefits of a
/lighter/ compiler barrier, such as what David referred to above, _can_
be had without having to do anything with the "volatile" keyword at all.
And such a primitive has already been mentioned/proposed on this thread.


But this is all tangential to the core question at hand -- whether to have
implicit (compiler, possibly "light-weight" of the kind referred above)
barrier semantics in atomic ops that do not have them, or not.

I was lately looking in the kernel for _actual_ code that uses atomic_t
and benefits from the lack of any implicit barrier, with the compiler
being free to cache the atomic_t in a register. Now that often does _not_
happen, because all other ops (implemented in asm with LOCK prefix on x86)
_must_ therefore constrain the atomic_t to memory anyway. So typically all
atomic ops code sequences end up operating on memory.

Then I did locate sched.c:select_nohz_load_balancer() -- it repeatedly
references the same atomic_t object, and the code that I saw generated
(with CC_OPTIMIZE_FOR_SIZE=y) did cache it in a register for a sequence of
instructions. It uses atomic_cmpxchg, thereby not requiring explicit
memory barriers anywhere in the code, and is an example of an atomic_t
user that is safe, and yet benefits from its memory loads/stores being
elided/coalesced by the compiler.


# at this point, %%eax holds num_online_cpus() and
# %%ebx holds cpus_weight(nohz.cpu_mask)
# the variable "cpu" is in %esi

0xc1018e1d:  cmp%eax,%ebx   # if No.A.
0xc1018e1f:  mov0xc134d900,%eax # first atomic_read()
0xc1018e24:  jne0xc1018e36
0xc1018e26:  cmp%esi,%eax   # if No.B.
0xc1018e28:  jne0xc1018e80  # returns with 0
0xc1018e2a:  movl   $0x,0xc134d900  # atomic_set(-1), and ...
0xc1018e34:  jmp0xc1018e80  # ... returns with 0
0xc1018e36:  cmp$0x,%eax# if No.C. (NOTE!)
0xc1018e39:  jne0xc1018e46
0xc1018e3b:  lock cmpxchg %esi,0xc134d900   # atomic_cmpxchg()
0xc1018e43:  inc%eax
0xc1018e44:  jmp0xc1018e48
0xc1018e46:  cmp%esi,%eax   # if No.D. (NOTE!)
0xc1018e48:  jne0xc1018e80  # if !=, default return 0 (if 
No.E.)
0xc1018e4a:  jmp0xc1018e84  # otherwise (==) returns with 1

The above is:

if (cpus_weight(nohz.cpu_mask) == num_online_cpus()) {  /* if No.A. */
if (atomic_read(_balancer) == cpu)/* if No.B. */
atomic_set(_balancer, -1);/* XXX */
return 0;
}
if (atomic_read(_balancer) == -1) {   /* if No.C. */
/* make me the ilb owner */
if (atomic_cmpxchg(_balancer, -1, cpu) == -1) /* if 
No.E. */
return 1;
} else if (atomic_read(_balancer) == cpu) /* if No.D. */
return 1;
...
...
return 0; /* default return from function */

As you can see, the atomic_read()'s of "if"s Nos. B, C, and D, were _all_
coalesced into a single memory reference "mov0xc134d900,%eax" at the
top of the function, and then "if"s Nos. C and D simply used the value
from %%eax itself. But that's perfectly safe, such is the logic of this
function. It uses cmpxchg _whenever_ updating the value in the memory
atomic_t and then returns appropriately. The _only_ point that a casual
reader may find racy is that marked /* XXX */ above -- atomic_read()
followed by atomic_set() with no barrier in between. But even that is ok,
because if one thread ever finds that condition to succeed, it is 100%
guaranteed no other thread on any other CPU will find _any_ condition
to be true, thereby avoiding any race in the modification of that value.


BTW it does sound reasonable that a lot of 

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Paul E. McKenney
On Tue, Aug 21, 2007 at 04:48:51PM +0200, Segher Boessenkool wrote:
> >>Let me say it more clearly: On ARM, it is impossible to perform atomic
> >>operations on MMIO space.
> >
> >Actually, no one is suggesting that we try to do that at all.
> >
> >The discussion about RMW ops on MMIO space started with a comment
> >attributed to the gcc developers that one reason why gcc on x86
> >doesn't use instructions that do RMW ops on volatile variables is that
> >volatile is used to mark MMIO addresses, and there was some
> >uncertainty about whether (non-atomic) RMW ops on x86 could be used on
> >MMIO.  This is in regard to the question about why gcc on x86 always
> >moves a volatile variable into a register before doing anything to it.
> 
> This question is GCC PR33102, which was incorrectly closed as a 
> duplicate
> of PR3506 -- and *that* PR was closed because its reporter seemed to
> claim the GCC generated code for an increment on a volatile (namely, 
> three
> machine instructions: load, modify, store) was incorrect, and it has to
> be one machine instruction.
> 
> >So the whole discussion is irrelevant to ARM, PowerPC and any other
> >architecture except x86[-64].
> 
> And even there, it's not something the kernel can take advantage of
> before GCC 4.4 is in widespread use, if then.  Let's move on.

I agree that instant gratification is hard to come by when synching
up compiler and kernel versions.  Nonetheless, it should be possible
to create APIs that are are conditioned on the compiler version.

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Segher Boessenkool
At some point in the future, barrier() will be universally regarded as 
a hammer too big for most purposes.  Whether or not removing it now


You can't just remove it, it is needed in some places; you want to
replace it in most places with a more fine-grained "compiler barrier",
I presume?

constitutes premature optimization is arguable, but I think we should 
allow such optimization to happen (or not happen) in 
architecture-dependent code, and provide a consistent API that doesn't 
require the use of such things in arch-independent code where it might 
turn into a totally superfluous performance killer depending on what 
hardware it gets compiled for.


Explicit barrier()s won't be too hard to replace -- but what to do
about the implicit barrier()s in rmb() etc. etc. -- *those* will be
hard to get rid of, if only because it is hard enough to teach driver
authors about how to use those primitives *already*.  It is far from
clear what a good interface like that would look like, anyway.

Probably we should first start experimenting with a forget()-style
micro-barrier (but please, find a better name), and see if a nice
usage pattern shows up that can be turned into an API.


Segher

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Segher Boessenkool

Let me say it more clearly: On ARM, it is impossible to perform atomic
operations on MMIO space.


Actually, no one is suggesting that we try to do that at all.

The discussion about RMW ops on MMIO space started with a comment
attributed to the gcc developers that one reason why gcc on x86
doesn't use instructions that do RMW ops on volatile variables is that
volatile is used to mark MMIO addresses, and there was some
uncertainty about whether (non-atomic) RMW ops on x86 could be used on
MMIO.  This is in regard to the question about why gcc on x86 always
moves a volatile variable into a register before doing anything to it.


This question is GCC PR33102, which was incorrectly closed as a 
duplicate

of PR3506 -- and *that* PR was closed because its reporter seemed to
claim the GCC generated code for an increment on a volatile (namely, 
three

machine instructions: load, modify, store) was incorrect, and it has to
be one machine instruction.


So the whole discussion is irrelevant to ARM, PowerPC and any other
architecture except x86[-64].


And even there, it's not something the kernel can take advantage of
before GCC 4.4 is in widespread use, if then.  Let's move on.


Segher

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Segher Boessenkool

And no, RMW on MMIO isn't "problematic" at all, either.

An RMW op is a read op, a modify op, and a write op, all rolled
into one opcode.  But three actual operations.


Maybe for some CPUs, but not all.  ARM for instance can't use the
load exclusive and store exclusive instructions to MMIO space.


Sure, your CPU doesn't have RMW instructions -- how to emulate
those if you don't have them is a totally different thing.


Let me say it more clearly: On ARM, it is impossible to perform atomic
operations on MMIO space.


It's all completely beside the point, see the other subthread, but...

Yeah, you can't do LL/SC to MMIO space; ARM isn't alone in that.
You could still implement atomic operations on MMIO space by taking
a lock elsewhere, in normal cacheable memory space.  Why you would
do this is a separate question, you probably don't want it :-)


Segher

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Chris Snook

David Miller wrote:

From: Linus Torvalds <[EMAIL PROTECTED]>
Date: Mon, 20 Aug 2007 22:46:47 -0700 (PDT)

Ie a "barrier()" is likely _cheaper_ than the code generation downside 
from using "volatile".


Assuming GCC were ever better about the code generation badness
with volatile that has been discussed here, I much prefer
we tell GCC "this memory piece changed" rather than "every
piece of memory has changed" which is what the barrier() does.

I happened to have been scanning a lot of assembler lately to
track down a gcc-4.2 miscompilation on sparc64, and the barriers
do hurt quite a bit in some places.  Instead of keeping unrelated
variables around cached in local registers, it reloads everything.


Moore's law is definitely working against us here.  Register counts, 
pipeline depths, core counts, and clock multipliers are all increasing 
in the long run.  At some point in the future, barrier() will be 
universally regarded as a hammer too big for most purposes.  Whether or 
not removing it now constitutes premature optimization is arguable, but 
I think we should allow such optimization to happen (or not happen) in 
architecture-dependent code, and provide a consistent API that doesn't 
require the use of such things in arch-independent code where it might 
turn into a totally superfluous performance killer depending on what 
hardware it gets compiled for.


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Andi Kleen
On Tue, Aug 21, 2007 at 07:33:49PM +1000, Paul Mackerras wrote:
> So the whole discussion is irrelevant to ARM, PowerPC and any other
> architecture except x86[-64].

It's even irrelevant on x86 because all modifying operations on atomic_t 
are coded in inline assembler and will always be RMW no matter
if atomic_t is volatile or not.

[ignoring atomic_set(x, atomic_read(x) + 1) which nobody should do]

The only issue is if atomic_t should have a implicit barrier or not.
My personal opinion is yes -- better safe than sorry. And any code
impact it may have is typically dwarved by the next cache miss anyways,
so it doesn't matter much.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Paul Mackerras
Russell King writes:

> Let me say it more clearly: On ARM, it is impossible to perform atomic
> operations on MMIO space.

Actually, no one is suggesting that we try to do that at all.

The discussion about RMW ops on MMIO space started with a comment
attributed to the gcc developers that one reason why gcc on x86
doesn't use instructions that do RMW ops on volatile variables is that
volatile is used to mark MMIO addresses, and there was some
uncertainty about whether (non-atomic) RMW ops on x86 could be used on
MMIO.  This is in regard to the question about why gcc on x86 always
moves a volatile variable into a register before doing anything to it.

So the whole discussion is irrelevant to ARM, PowerPC and any other
architecture except x86[-64].

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Russell King
On Mon, Aug 20, 2007 at 05:05:18PM -0700, Paul E. McKenney wrote:
> On Tue, Aug 21, 2007 at 01:02:01AM +0200, Segher Boessenkool wrote:
> > >>And no, RMW on MMIO isn't "problematic" at all, either.
> > >>
> > >>An RMW op is a read op, a modify op, and a write op, all rolled
> > >>into one opcode.  But three actual operations.
> > >
> > >Maybe for some CPUs, but not all.  ARM for instance can't use the
> > >load exclusive and store exclusive instructions to MMIO space.
> > 
> > Sure, your CPU doesn't have RMW instructions -- how to emulate
> > those if you don't have them is a totally different thing.
> 
> I thought that ARM's load exclusive and store exclusive instructions
> were its equivalent of LL and SC, which RISC machines typically use to
> build atomic sequences of instructions -- and which normally cannot be
> applied to MMIO space.

Absolutely correct.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Russell King
On Tue, Aug 21, 2007 at 01:02:01AM +0200, Segher Boessenkool wrote:
> >>And no, RMW on MMIO isn't "problematic" at all, either.
> >>
> >>An RMW op is a read op, a modify op, and a write op, all rolled
> >>into one opcode.  But three actual operations.
> >
> >Maybe for some CPUs, but not all.  ARM for instance can't use the
> >load exclusive and store exclusive instructions to MMIO space.
> 
> Sure, your CPU doesn't have RMW instructions -- how to emulate
> those if you don't have them is a totally different thing.

Let me say it more clearly: On ARM, it is impossible to perform atomic
operations on MMIO space.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread David Miller
From: Linus Torvalds <[EMAIL PROTECTED]>
Date: Mon, 20 Aug 2007 22:46:47 -0700 (PDT)

> Ie a "barrier()" is likely _cheaper_ than the code generation downside 
> from using "volatile".

Assuming GCC were ever better about the code generation badness
with volatile that has been discussed here, I much prefer
we tell GCC "this memory piece changed" rather than "every
piece of memory has changed" which is what the barrier() does.

I happened to have been scanning a lot of assembler lately to
track down a gcc-4.2 miscompilation on sparc64, and the barriers
do hurt quite a bit in some places.  Instead of keeping unrelated
variables around cached in local registers, it reloads everything.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread David Miller
From: Linus Torvalds [EMAIL PROTECTED]
Date: Mon, 20 Aug 2007 22:46:47 -0700 (PDT)

 Ie a barrier() is likely _cheaper_ than the code generation downside 
 from using volatile.

Assuming GCC were ever better about the code generation badness
with volatile that has been discussed here, I much prefer
we tell GCC this memory piece changed rather than every
piece of memory has changed which is what the barrier() does.

I happened to have been scanning a lot of assembler lately to
track down a gcc-4.2 miscompilation on sparc64, and the barriers
do hurt quite a bit in some places.  Instead of keeping unrelated
variables around cached in local registers, it reloads everything.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Russell King
On Tue, Aug 21, 2007 at 01:02:01AM +0200, Segher Boessenkool wrote:
 And no, RMW on MMIO isn't problematic at all, either.
 
 An RMW op is a read op, a modify op, and a write op, all rolled
 into one opcode.  But three actual operations.
 
 Maybe for some CPUs, but not all.  ARM for instance can't use the
 load exclusive and store exclusive instructions to MMIO space.
 
 Sure, your CPU doesn't have RMW instructions -- how to emulate
 those if you don't have them is a totally different thing.

Let me say it more clearly: On ARM, it is impossible to perform atomic
operations on MMIO space.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Russell King
On Mon, Aug 20, 2007 at 05:05:18PM -0700, Paul E. McKenney wrote:
 On Tue, Aug 21, 2007 at 01:02:01AM +0200, Segher Boessenkool wrote:
  And no, RMW on MMIO isn't problematic at all, either.
  
  An RMW op is a read op, a modify op, and a write op, all rolled
  into one opcode.  But three actual operations.
  
  Maybe for some CPUs, but not all.  ARM for instance can't use the
  load exclusive and store exclusive instructions to MMIO space.
  
  Sure, your CPU doesn't have RMW instructions -- how to emulate
  those if you don't have them is a totally different thing.
 
 I thought that ARM's load exclusive and store exclusive instructions
 were its equivalent of LL and SC, which RISC machines typically use to
 build atomic sequences of instructions -- and which normally cannot be
 applied to MMIO space.

Absolutely correct.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Paul Mackerras
Russell King writes:

 Let me say it more clearly: On ARM, it is impossible to perform atomic
 operations on MMIO space.

Actually, no one is suggesting that we try to do that at all.

The discussion about RMW ops on MMIO space started with a comment
attributed to the gcc developers that one reason why gcc on x86
doesn't use instructions that do RMW ops on volatile variables is that
volatile is used to mark MMIO addresses, and there was some
uncertainty about whether (non-atomic) RMW ops on x86 could be used on
MMIO.  This is in regard to the question about why gcc on x86 always
moves a volatile variable into a register before doing anything to it.

So the whole discussion is irrelevant to ARM, PowerPC and any other
architecture except x86[-64].

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Andi Kleen
On Tue, Aug 21, 2007 at 07:33:49PM +1000, Paul Mackerras wrote:
 So the whole discussion is irrelevant to ARM, PowerPC and any other
 architecture except x86[-64].

It's even irrelevant on x86 because all modifying operations on atomic_t 
are coded in inline assembler and will always be RMW no matter
if atomic_t is volatile or not.

[ignoring atomic_set(x, atomic_read(x) + 1) which nobody should do]

The only issue is if atomic_t should have a implicit barrier or not.
My personal opinion is yes -- better safe than sorry. And any code
impact it may have is typically dwarved by the next cache miss anyways,
so it doesn't matter much.

-Andi

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Chris Snook

David Miller wrote:

From: Linus Torvalds [EMAIL PROTECTED]
Date: Mon, 20 Aug 2007 22:46:47 -0700 (PDT)

Ie a barrier() is likely _cheaper_ than the code generation downside 
from using volatile.


Assuming GCC were ever better about the code generation badness
with volatile that has been discussed here, I much prefer
we tell GCC this memory piece changed rather than every
piece of memory has changed which is what the barrier() does.

I happened to have been scanning a lot of assembler lately to
track down a gcc-4.2 miscompilation on sparc64, and the barriers
do hurt quite a bit in some places.  Instead of keeping unrelated
variables around cached in local registers, it reloads everything.


Moore's law is definitely working against us here.  Register counts, 
pipeline depths, core counts, and clock multipliers are all increasing 
in the long run.  At some point in the future, barrier() will be 
universally regarded as a hammer too big for most purposes.  Whether or 
not removing it now constitutes premature optimization is arguable, but 
I think we should allow such optimization to happen (or not happen) in 
architecture-dependent code, and provide a consistent API that doesn't 
require the use of such things in arch-independent code where it might 
turn into a totally superfluous performance killer depending on what 
hardware it gets compiled for.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Segher Boessenkool

And no, RMW on MMIO isn't problematic at all, either.

An RMW op is a read op, a modify op, and a write op, all rolled
into one opcode.  But three actual operations.


Maybe for some CPUs, but not all.  ARM for instance can't use the
load exclusive and store exclusive instructions to MMIO space.


Sure, your CPU doesn't have RMW instructions -- how to emulate
those if you don't have them is a totally different thing.


Let me say it more clearly: On ARM, it is impossible to perform atomic
operations on MMIO space.


It's all completely beside the point, see the other subthread, but...

Yeah, you can't do LL/SC to MMIO space; ARM isn't alone in that.
You could still implement atomic operations on MMIO space by taking
a lock elsewhere, in normal cacheable memory space.  Why you would
do this is a separate question, you probably don't want it :-)


Segher

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Segher Boessenkool

Let me say it more clearly: On ARM, it is impossible to perform atomic
operations on MMIO space.


Actually, no one is suggesting that we try to do that at all.

The discussion about RMW ops on MMIO space started with a comment
attributed to the gcc developers that one reason why gcc on x86
doesn't use instructions that do RMW ops on volatile variables is that
volatile is used to mark MMIO addresses, and there was some
uncertainty about whether (non-atomic) RMW ops on x86 could be used on
MMIO.  This is in regard to the question about why gcc on x86 always
moves a volatile variable into a register before doing anything to it.


This question is GCC PR33102, which was incorrectly closed as a 
duplicate

of PR3506 -- and *that* PR was closed because its reporter seemed to
claim the GCC generated code for an increment on a volatile (namely, 
three

machine instructions: load, modify, store) was incorrect, and it has to
be one machine instruction.


So the whole discussion is irrelevant to ARM, PowerPC and any other
architecture except x86[-64].


And even there, it's not something the kernel can take advantage of
before GCC 4.4 is in widespread use, if then.  Let's move on.


Segher

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Segher Boessenkool
At some point in the future, barrier() will be universally regarded as 
a hammer too big for most purposes.  Whether or not removing it now


You can't just remove it, it is needed in some places; you want to
replace it in most places with a more fine-grained compiler barrier,
I presume?

constitutes premature optimization is arguable, but I think we should 
allow such optimization to happen (or not happen) in 
architecture-dependent code, and provide a consistent API that doesn't 
require the use of such things in arch-independent code where it might 
turn into a totally superfluous performance killer depending on what 
hardware it gets compiled for.


Explicit barrier()s won't be too hard to replace -- but what to do
about the implicit barrier()s in rmb() etc. etc. -- *those* will be
hard to get rid of, if only because it is hard enough to teach driver
authors about how to use those primitives *already*.  It is far from
clear what a good interface like that would look like, anyway.

Probably we should first start experimenting with a forget()-style
micro-barrier (but please, find a better name), and see if a nice
usage pattern shows up that can be turned into an API.


Segher

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Satyam Sharma


On Tue, 21 Aug 2007, Chris Snook wrote:

 David Miller wrote:
  From: Linus Torvalds [EMAIL PROTECTED]
  Date: Mon, 20 Aug 2007 22:46:47 -0700 (PDT)
  
   Ie a barrier() is likely _cheaper_ than the code generation downside
   from using volatile.
  
  Assuming GCC were ever better about the code generation badness
  with volatile that has been discussed here, I much prefer
  we tell GCC this memory piece changed rather than every
  piece of memory has changed which is what the barrier() does.
  
  I happened to have been scanning a lot of assembler lately to
  track down a gcc-4.2 miscompilation on sparc64, and the barriers
  do hurt quite a bit in some places.  Instead of keeping unrelated
  variables around cached in local registers, it reloads everything.
 
 Moore's law is definitely working against us here.  Register counts, pipeline
 depths, core counts, and clock multipliers are all increasing in the long run.
 At some point in the future, barrier() will be universally regarded as a
 hammer too big for most purposes.

I do agree, and the important point to note is that the benefits of a
/lighter/ compiler barrier, such as what David referred to above, _can_
be had without having to do anything with the volatile keyword at all.
And such a primitive has already been mentioned/proposed on this thread.


But this is all tangential to the core question at hand -- whether to have
implicit (compiler, possibly light-weight of the kind referred above)
barrier semantics in atomic ops that do not have them, or not.

I was lately looking in the kernel for _actual_ code that uses atomic_t
and benefits from the lack of any implicit barrier, with the compiler
being free to cache the atomic_t in a register. Now that often does _not_
happen, because all other ops (implemented in asm with LOCK prefix on x86)
_must_ therefore constrain the atomic_t to memory anyway. So typically all
atomic ops code sequences end up operating on memory.

Then I did locate sched.c:select_nohz_load_balancer() -- it repeatedly
references the same atomic_t object, and the code that I saw generated
(with CC_OPTIMIZE_FOR_SIZE=y) did cache it in a register for a sequence of
instructions. It uses atomic_cmpxchg, thereby not requiring explicit
memory barriers anywhere in the code, and is an example of an atomic_t
user that is safe, and yet benefits from its memory loads/stores being
elided/coalesced by the compiler.


# at this point, %%eax holds num_online_cpus() and
# %%ebx holds cpus_weight(nohz.cpu_mask)
# the variable cpu is in %esi

0xc1018e1d:  cmp%eax,%ebx   # if No.A.
0xc1018e1f:  mov0xc134d900,%eax # first atomic_read()
0xc1018e24:  jne0xc1018e36
0xc1018e26:  cmp%esi,%eax   # if No.B.
0xc1018e28:  jne0xc1018e80  # returns with 0
0xc1018e2a:  movl   $0x,0xc134d900  # atomic_set(-1), and ...
0xc1018e34:  jmp0xc1018e80  # ... returns with 0
0xc1018e36:  cmp$0x,%eax# if No.C. (NOTE!)
0xc1018e39:  jne0xc1018e46
0xc1018e3b:  lock cmpxchg %esi,0xc134d900   # atomic_cmpxchg()
0xc1018e43:  inc%eax
0xc1018e44:  jmp0xc1018e48
0xc1018e46:  cmp%esi,%eax   # if No.D. (NOTE!)
0xc1018e48:  jne0xc1018e80  # if !=, default return 0 (if 
No.E.)
0xc1018e4a:  jmp0xc1018e84  # otherwise (==) returns with 1

The above is:

if (cpus_weight(nohz.cpu_mask) == num_online_cpus()) {  /* if No.A. */
if (atomic_read(nohz.load_balancer) == cpu)/* if No.B. */
atomic_set(nohz.load_balancer, -1);/* XXX */
return 0;
}
if (atomic_read(nohz.load_balancer) == -1) {   /* if No.C. */
/* make me the ilb owner */
if (atomic_cmpxchg(nohz.load_balancer, -1, cpu) == -1) /* if 
No.E. */
return 1;
} else if (atomic_read(nohz.load_balancer) == cpu) /* if No.D. */
return 1;
...
...
return 0; /* default return from function */

As you can see, the atomic_read()'s of ifs Nos. B, C, and D, were _all_
coalesced into a single memory reference mov0xc134d900,%eax at the
top of the function, and then ifs Nos. C and D simply used the value
from %%eax itself. But that's perfectly safe, such is the logic of this
function. It uses cmpxchg _whenever_ updating the value in the memory
atomic_t and then returns appropriately. The _only_ point that a casual
reader may find racy is that marked /* XXX */ above -- atomic_read()
followed by atomic_set() with no barrier in between. But even that is ok,
because if one thread ever finds that condition to succeed, it is 100%
guaranteed no other thread on any other CPU will find _any_ condition
to be true, thereby avoiding any race in the modification of that value.


BTW it does sound reasonable that a lot of atomic_t users 

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Paul E. McKenney
On Tue, Aug 21, 2007 at 04:48:51PM +0200, Segher Boessenkool wrote:
 Let me say it more clearly: On ARM, it is impossible to perform atomic
 operations on MMIO space.
 
 Actually, no one is suggesting that we try to do that at all.
 
 The discussion about RMW ops on MMIO space started with a comment
 attributed to the gcc developers that one reason why gcc on x86
 doesn't use instructions that do RMW ops on volatile variables is that
 volatile is used to mark MMIO addresses, and there was some
 uncertainty about whether (non-atomic) RMW ops on x86 could be used on
 MMIO.  This is in regard to the question about why gcc on x86 always
 moves a volatile variable into a register before doing anything to it.
 
 This question is GCC PR33102, which was incorrectly closed as a 
 duplicate
 of PR3506 -- and *that* PR was closed because its reporter seemed to
 claim the GCC generated code for an increment on a volatile (namely, 
 three
 machine instructions: load, modify, store) was incorrect, and it has to
 be one machine instruction.
 
 So the whole discussion is irrelevant to ARM, PowerPC and any other
 architecture except x86[-64].
 
 And even there, it's not something the kernel can take advantage of
 before GCC 4.4 is in widespread use, if then.  Let's move on.

I agree that instant gratification is hard to come by when synching
up compiler and kernel versions.  Nonetheless, it should be possible
to create APIs that are are conditioned on the compiler version.

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Linus Torvalds


On Tue, 21 Aug 2007, Chris Snook wrote:
 
 Moore's law is definitely working against us here.  Register counts, pipeline
 depths, core counts, and clock multipliers are all increasing in the long run.
 At some point in the future, barrier() will be universally regarded as a
 hammer too big for most purposes.

Note that barrier() is purely a compiler barrier. It has zero impact on 
the CPU pipeline itself, and also has zero impact on anything that gcc 
knows isn't visible in memory (ie local variables that don't have their 
address taken), so barrier() really is pretty cheap.

Now, it's possible that gcc messes up in some circumstances, and that the 
memory clobber will cause gcc to also do things like flush local registers 
unnecessarily to their stack slots, but quite frankly, if that happens, 
it's a gcc problem, and I also have to say that I've not seen that myself.

So in a very real sense, barrier() will just make sure that there is a 
stronger sequence point for the compiler where things are stable. In most 
cases it has absolutely zero performance impact - apart from the 
-intended- impact of making sure that the compiler doesn't re-order or 
cache stuff around it.

And sure, we could make it more finegrained, and also introduce a 
per-variable barrier, but the fact is, people _already_ have problems with 
thinking about these kinds of things, and adding new abstraction issues 
with subtle semantics is the last thing we want.

So I really think you'd want to show a real example of real code that 
actually gets noticeably slower or bigger.

In removing volatile, we have shown that. It may not have made a big 
difference on powerpc, but it makes a real difference on x86 - and more 
importantly, it removes something that people clearly don't know how it 
works, and incorrectly expect to just fix bugs.

[ There are *other* barriers - the ones that actually add memory barriers 
  to the CPU - that really can be quite expensive. The good news is that 
  the expense is going down rather than up: both Intel and AMD are not 
  only removing the need for some of them (ie smp_rmb() will become a 
  compiler-only barrier), but we're _also_ seeing the whole pipeline 
  flush approach go away, and be replaced by the CPU itself actually 
  being better - so even the actual CPU pipeline barriers are getting
  cheaper, not more expensive. ]

For example, did anybody even _test_ how expensive barrier() is? Just 
as a lark, I did

#undef barrier
#define barrier() do { } while (0)

in kernel/sched.c (which only has three of them in it, but hey, that's 
more than most files), and there were _zero_ code generation downsides. 
One instruction was moved (and a few line numbers changed), so it wasn't 
like the assembly language was identical, but the point is, barrier() 
simply doesn't have the same kinds of downsides that volatile has.

(That may not be true on other architectures or in other source files, of 
course. This *does* depend on code generation details. But anybody who 
thinks that barrier() is fundamentally expensive is simply incorrect. It 
is *fundamnetally* a no-op).

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Valdis . Kletnieks
On Tue, 21 Aug 2007 09:16:43 PDT, Paul E. McKenney said:

 I agree that instant gratification is hard to come by when synching
 up compiler and kernel versions.  Nonetheless, it should be possible
 to create APIs that are are conditioned on the compiler version.

We've tried that, sort of.  See the mess surrounding the whole
extern/static/inline/__whatever boondogle, which seems to have
changed semantics in every single gcc release since 2.95 or so.

And recently mention was made that gcc4.4 will have *new* semantics
in this area. Yee. Hah.







pgpGx7YTiWc5V.pgp
Description: PGP signature


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Paul E. McKenney
On Tue, Aug 21, 2007 at 06:51:16PM -0400, [EMAIL PROTECTED] wrote:
 On Tue, 21 Aug 2007 09:16:43 PDT, Paul E. McKenney said:
 
  I agree that instant gratification is hard to come by when synching
  up compiler and kernel versions.  Nonetheless, it should be possible
  to create APIs that are are conditioned on the compiler version.
 
 We've tried that, sort of.  See the mess surrounding the whole
 extern/static/inline/__whatever boondogle, which seems to have
 changed semantics in every single gcc release since 2.95 or so.
 
 And recently mention was made that gcc4.4 will have *new* semantics
 in this area. Yee. Hah.

;-)

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Linus Torvalds


On Mon, 20 Aug 2007, Chris Snook wrote:
>
> What about barrier removal?  With consistent semantics we could optimize a
> fair amount of code.  Whether or not that constitutes "premature" optimization
> is open to debate, but there's no question we could reduce our register wiping
> in some places.

Why do people think that barriers are expensive? They really aren't. 
Especially the regular compiler barrier is basically zero cost. Any 
reasonable compiler will just flush the stuff it holds in registers that 
isn't already automatic local variables, and for regular kernel code, that 
tends to basically be nothing at all.

Ie a "barrier()" is likely _cheaper_ than the code generation downside 
from using "volatile".

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Paul E. McKenney
On Tue, Aug 21, 2007 at 01:02:01AM +0200, Segher Boessenkool wrote:
> >>And no, RMW on MMIO isn't "problematic" at all, either.
> >>
> >>An RMW op is a read op, a modify op, and a write op, all rolled
> >>into one opcode.  But three actual operations.
> >
> >Maybe for some CPUs, but not all.  ARM for instance can't use the
> >load exclusive and store exclusive instructions to MMIO space.
> 
> Sure, your CPU doesn't have RMW instructions -- how to emulate
> those if you don't have them is a totally different thing.

I thought that ARM's load exclusive and store exclusive instructions
were its equivalent of LL and SC, which RISC machines typically use to
build atomic sequences of instructions -- and which normally cannot be
applied to MMIO space.

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Segher Boessenkool

And no, RMW on MMIO isn't "problematic" at all, either.

An RMW op is a read op, a modify op, and a write op, all rolled
into one opcode.  But three actual operations.


Maybe for some CPUs, but not all.  ARM for instance can't use the
load exclusive and store exclusive instructions to MMIO space.


Sure, your CPU doesn't have RMW instructions -- how to emulate
those if you don't have them is a totally different thing.


Segher

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Russell King
On Tue, Aug 21, 2007 at 12:04:17AM +0200, Segher Boessenkool wrote:
> And no, RMW on MMIO isn't "problematic" at all, either.
> 
> An RMW op is a read op, a modify op, and a write op, all rolled
> into one opcode.  But three actual operations.

Maybe for some CPUs, but not all.  ARM for instance can't use the
load exclusive and store exclusive instructions to MMIO space.

This means placing atomic_t or bitops into MMIO space is a definite
no-go on ARM.  It breaks.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Segher Boessenkool
Such code generally doesn't care precisely when it gets the update, 
just that the update is atomic, and it doesn't loop forever.


Yes, it _does_ care that it gets the update _at all_, and preferably
as early as possible.


Regardless, I'm convinced we just need to do it all in assembly.


So do you want "volatile asm" or "plain asm", for atomic_read()?
The asm version has two ways to go about it too...


Segher

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Segher Boessenkool
Right. ROTFL... volatile actually breaks atomic_t instead of making 
it safe. x++ becomes a register load, increment and a register store. 
Without volatile we can increment the memory directly. It seems that 
volatile requires that the variable is loaded into a register first 
and then operated upon. Understandable when you think about volatile 
being used to access memory mapped I/O registers where a RMW 
operation could be problematic.


So, if we want consistent behavior, we're pretty much screwed unless 
we use inline assembler everywhere?


Nah, this whole argument is flawed -- "without volatile" we still
*cannot* "increment the memory directly".  On x86, you need a lock
prefix; on other archs, some other mechanism to make the memory
increment an *atomic* memory increment.

And no, RMW on MMIO isn't "problematic" at all, either.

An RMW op is a read op, a modify op, and a write op, all rolled
into one opcode.  But three actual operations.


The advantages of asm code for atomic_{read,set} are:
1) all the other atomic ops are implemented that way already;
2) you have full control over the asm insns selected, in particular,
   you can guarantee you *do* get an atomic op;
3) you don't need to use "volatile " which generates
   not-all-that-good code on archs like x86, and we want to get rid
   of it anyway since it is problematic in many ways;
4) you don't need to use *(volatile *)&, which a) doesn't
   exist in C; b) isn't documented or supported in GCC; c) has a recent
   history of bugginess; d) _still uses volatile objects_; e) _still_
   is problematic in almost all those same ways as in 3);
5) you can mix atomic and non-atomic accesses to the atomic_t, which
   you cannot with the other alternatives.

The only disadvantage I know of is potentially slightly worse
instruction scheduling.  This is a generic asm() problem: GCC
cannot see what actual insns are inside the asm() block.


Segher

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Chris Snook

Herbert Xu wrote:

On Mon, Aug 20, 2007 at 09:15:11AM -0400, Chris Snook wrote:

Linus Torvalds wrote:
So the only reason to add back "volatile" to the atomic_read() sequence is 
not to fix bugs, but to _hide_ the bugs better. They're still there, they 
are just a lot harder to trigger, and tend to be a lot subtler.
What about barrier removal?  With consistent semantics we could optimize a 
fair amount of code.  Whether or not that constitutes "premature" 
optimization is open to debate, but there's no question we could reduce our 
register wiping in some places.


If you've been reading all of Linus's emails you should be
thinking about adding memory barriers, and not removing
compiler barriers.

He's just told you that code of the kind

while (!atomic_read(cond))
;

do_something()

probably needs a memory barrier (not just compiler) so that
do_something() doesn't see stale cache content that occured
before cond flipped.


Such code generally doesn't care precisely when it gets the update, just that 
the update is atomic, and it doesn't loop forever.  Regardless, I'm convinced we 
just need to do it all in assembly.


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Chris Snook

Christoph Lameter wrote:

On Fri, 17 Aug 2007, Paul E. McKenney wrote:


On Sat, Aug 18, 2007 at 08:09:13AM +0800, Herbert Xu wrote:

On Fri, Aug 17, 2007 at 04:59:12PM -0700, Paul E. McKenney wrote:

gcc bugzilla bug #33102, for whatever that ends up being worth.  ;-)

I had totally forgotten that I'd already filed that bug more
than six years ago until they just closed yours as a duplicate
of mine :)

Good luck in getting it fixed!

Well, just got done re-opening it for the third time.  And a local
gcc community member advised me not to give up too easily.  But I
must admit that I am impressed with the speed that it was identified
as duplicate.

Should be entertaining!  ;-)


Right. ROTFL... volatile actually breaks atomic_t instead of making it 
safe. x++ becomes a register load, increment and a register store. Without 
volatile we can increment the memory directly. It seems that volatile 
requires that the variable is loaded into a register first and then 
operated upon. Understandable when you think about volatile being used to 
access memory mapped I/O registers where a RMW operation could be 
problematic.


So, if we want consistent behavior, we're pretty much screwed unless we use 
inline assembler everywhere?


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Herbert Xu
On Mon, Aug 20, 2007 at 09:15:11AM -0400, Chris Snook wrote:
> Linus Torvalds wrote:
> >So the only reason to add back "volatile" to the atomic_read() sequence is 
> >not to fix bugs, but to _hide_ the bugs better. They're still there, they 
> >are just a lot harder to trigger, and tend to be a lot subtler.
> 
> What about barrier removal?  With consistent semantics we could optimize a 
> fair amount of code.  Whether or not that constitutes "premature" 
> optimization is open to debate, but there's no question we could reduce our 
> register wiping in some places.

If you've been reading all of Linus's emails you should be
thinking about adding memory barriers, and not removing
compiler barriers.

He's just told you that code of the kind

while (!atomic_read(cond))
;

do_something()

probably needs a memory barrier (not just compiler) so that
do_something() doesn't see stale cache content that occured
before cond flipped.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LDD3 pitfalls (was Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures)

2007-08-20 Thread Chris Snook

Stefan Richter wrote:

Nick Piggin wrote:

Stefan Richter wrote:

Nick Piggin wrote:


I don't know why people would assume volatile of atomics. AFAIK, most
of the documentation is pretty clear that all the atomic stuff can be
reordered etc. except for those that modify and return a value.


Which documentation is there?

Documentation/atomic_ops.txt



For driver authors, there is LDD3.  It doesn't specifically cover
effects of optimization on accesses to atomic_t.

For architecture port authors, there is Documentation/atomic_ops.txt.
Driver authors also can learn something from that document, as it
indirectly documents the atomic_t and bitops APIs.


"Semantics and Behavior of Atomic and Bitmask Operations" is
pretty direct :)

Sure, it says that it's for arch maintainers, but there is no
reason why users can't make use of it.



Note, LDD3 page 238 says:  "It is worth noting that most of the other
kernel primitives dealing with synchronization, such as spinlock and
atomic_t operations, also function as memory barriers."

I don't know about Linux 2.6.10 against which LDD3 was written, but
currently only _some_ atomic_t operations function as memory barriers.

Besides, judging from some posts in this thread, saying that atomic_t
operations dealt with synchronization may not be entirely precise.


atomic_t is often used as the basis for implementing more sophisticated 
synchronization mechanisms, such as rwlocks.  Whether or not they are designed 
for that purpose, the atomic_* operations are de facto synchronization primitives.


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Chris Snook

Linus Torvalds wrote:
So the only reason to add back "volatile" to the atomic_read() sequence is 
not to fix bugs, but to _hide_ the bugs better. They're still there, they 
are just a lot harder to trigger, and tend to be a lot subtler.


What about barrier removal?  With consistent semantics we could optimize a fair 
amount of code.  Whether or not that constitutes "premature" optimization is 
open to debate, but there's no question we could reduce our register wiping in 
some places.


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Chris Snook

Linus Torvalds wrote:
So the only reason to add back volatile to the atomic_read() sequence is 
not to fix bugs, but to _hide_ the bugs better. They're still there, they 
are just a lot harder to trigger, and tend to be a lot subtler.


What about barrier removal?  With consistent semantics we could optimize a fair 
amount of code.  Whether or not that constitutes premature optimization is 
open to debate, but there's no question we could reduce our register wiping in 
some places.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LDD3 pitfalls (was Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures)

2007-08-20 Thread Chris Snook

Stefan Richter wrote:

Nick Piggin wrote:

Stefan Richter wrote:

Nick Piggin wrote:


I don't know why people would assume volatile of atomics. AFAIK, most
of the documentation is pretty clear that all the atomic stuff can be
reordered etc. except for those that modify and return a value.


Which documentation is there?

Documentation/atomic_ops.txt



For driver authors, there is LDD3.  It doesn't specifically cover
effects of optimization on accesses to atomic_t.

For architecture port authors, there is Documentation/atomic_ops.txt.
Driver authors also can learn something from that document, as it
indirectly documents the atomic_t and bitops APIs.


Semantics and Behavior of Atomic and Bitmask Operations is
pretty direct :)

Sure, it says that it's for arch maintainers, but there is no
reason why users can't make use of it.



Note, LDD3 page 238 says:  It is worth noting that most of the other
kernel primitives dealing with synchronization, such as spinlock and
atomic_t operations, also function as memory barriers.

I don't know about Linux 2.6.10 against which LDD3 was written, but
currently only _some_ atomic_t operations function as memory barriers.

Besides, judging from some posts in this thread, saying that atomic_t
operations dealt with synchronization may not be entirely precise.


atomic_t is often used as the basis for implementing more sophisticated 
synchronization mechanisms, such as rwlocks.  Whether or not they are designed 
for that purpose, the atomic_* operations are de facto synchronization primitives.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Herbert Xu
On Mon, Aug 20, 2007 at 09:15:11AM -0400, Chris Snook wrote:
 Linus Torvalds wrote:
 So the only reason to add back volatile to the atomic_read() sequence is 
 not to fix bugs, but to _hide_ the bugs better. They're still there, they 
 are just a lot harder to trigger, and tend to be a lot subtler.
 
 What about barrier removal?  With consistent semantics we could optimize a 
 fair amount of code.  Whether or not that constitutes premature 
 optimization is open to debate, but there's no question we could reduce our 
 register wiping in some places.

If you've been reading all of Linus's emails you should be
thinking about adding memory barriers, and not removing
compiler barriers.

He's just told you that code of the kind

while (!atomic_read(cond))
;

do_something()

probably needs a memory barrier (not just compiler) so that
do_something() doesn't see stale cache content that occured
before cond flipped.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Chris Snook

Christoph Lameter wrote:

On Fri, 17 Aug 2007, Paul E. McKenney wrote:


On Sat, Aug 18, 2007 at 08:09:13AM +0800, Herbert Xu wrote:

On Fri, Aug 17, 2007 at 04:59:12PM -0700, Paul E. McKenney wrote:

gcc bugzilla bug #33102, for whatever that ends up being worth.  ;-)

I had totally forgotten that I'd already filed that bug more
than six years ago until they just closed yours as a duplicate
of mine :)

Good luck in getting it fixed!

Well, just got done re-opening it for the third time.  And a local
gcc community member advised me not to give up too easily.  But I
must admit that I am impressed with the speed that it was identified
as duplicate.

Should be entertaining!  ;-)


Right. ROTFL... volatile actually breaks atomic_t instead of making it 
safe. x++ becomes a register load, increment and a register store. Without 
volatile we can increment the memory directly. It seems that volatile 
requires that the variable is loaded into a register first and then 
operated upon. Understandable when you think about volatile being used to 
access memory mapped I/O registers where a RMW operation could be 
problematic.


So, if we want consistent behavior, we're pretty much screwed unless we use 
inline assembler everywhere?


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Chris Snook

Herbert Xu wrote:

On Mon, Aug 20, 2007 at 09:15:11AM -0400, Chris Snook wrote:

Linus Torvalds wrote:
So the only reason to add back volatile to the atomic_read() sequence is 
not to fix bugs, but to _hide_ the bugs better. They're still there, they 
are just a lot harder to trigger, and tend to be a lot subtler.
What about barrier removal?  With consistent semantics we could optimize a 
fair amount of code.  Whether or not that constitutes premature 
optimization is open to debate, but there's no question we could reduce our 
register wiping in some places.


If you've been reading all of Linus's emails you should be
thinking about adding memory barriers, and not removing
compiler barriers.

He's just told you that code of the kind

while (!atomic_read(cond))
;

do_something()

probably needs a memory barrier (not just compiler) so that
do_something() doesn't see stale cache content that occured
before cond flipped.


Such code generally doesn't care precisely when it gets the update, just that 
the update is atomic, and it doesn't loop forever.  Regardless, I'm convinced we 
just need to do it all in assembly.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Segher Boessenkool
Right. ROTFL... volatile actually breaks atomic_t instead of making 
it safe. x++ becomes a register load, increment and a register store. 
Without volatile we can increment the memory directly. It seems that 
volatile requires that the variable is loaded into a register first 
and then operated upon. Understandable when you think about volatile 
being used to access memory mapped I/O registers where a RMW 
operation could be problematic.


So, if we want consistent behavior, we're pretty much screwed unless 
we use inline assembler everywhere?


Nah, this whole argument is flawed -- without volatile we still
*cannot* increment the memory directly.  On x86, you need a lock
prefix; on other archs, some other mechanism to make the memory
increment an *atomic* memory increment.

And no, RMW on MMIO isn't problematic at all, either.

An RMW op is a read op, a modify op, and a write op, all rolled
into one opcode.  But three actual operations.


The advantages of asm code for atomic_{read,set} are:
1) all the other atomic ops are implemented that way already;
2) you have full control over the asm insns selected, in particular,
   you can guarantee you *do* get an atomic op;
3) you don't need to use volatile data which generates
   not-all-that-good code on archs like x86, and we want to get rid
   of it anyway since it is problematic in many ways;
4) you don't need to use *(volatile type*)data, which a) doesn't
   exist in C; b) isn't documented or supported in GCC; c) has a recent
   history of bugginess; d) _still uses volatile objects_; e) _still_
   is problematic in almost all those same ways as in 3);
5) you can mix atomic and non-atomic accesses to the atomic_t, which
   you cannot with the other alternatives.

The only disadvantage I know of is potentially slightly worse
instruction scheduling.  This is a generic asm() problem: GCC
cannot see what actual insns are inside the asm() block.


Segher

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Segher Boessenkool
Such code generally doesn't care precisely when it gets the update, 
just that the update is atomic, and it doesn't loop forever.


Yes, it _does_ care that it gets the update _at all_, and preferably
as early as possible.


Regardless, I'm convinced we just need to do it all in assembly.


So do you want volatile asm or plain asm, for atomic_read()?
The asm version has two ways to go about it too...


Segher

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Russell King
On Tue, Aug 21, 2007 at 12:04:17AM +0200, Segher Boessenkool wrote:
 And no, RMW on MMIO isn't problematic at all, either.
 
 An RMW op is a read op, a modify op, and a write op, all rolled
 into one opcode.  But three actual operations.

Maybe for some CPUs, but not all.  ARM for instance can't use the
load exclusive and store exclusive instructions to MMIO space.

This means placing atomic_t or bitops into MMIO space is a definite
no-go on ARM.  It breaks.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Segher Boessenkool

And no, RMW on MMIO isn't problematic at all, either.

An RMW op is a read op, a modify op, and a write op, all rolled
into one opcode.  But three actual operations.


Maybe for some CPUs, but not all.  ARM for instance can't use the
load exclusive and store exclusive instructions to MMIO space.


Sure, your CPU doesn't have RMW instructions -- how to emulate
those if you don't have them is a totally different thing.


Segher

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Paul E. McKenney
On Tue, Aug 21, 2007 at 01:02:01AM +0200, Segher Boessenkool wrote:
 And no, RMW on MMIO isn't problematic at all, either.
 
 An RMW op is a read op, a modify op, and a write op, all rolled
 into one opcode.  But three actual operations.
 
 Maybe for some CPUs, but not all.  ARM for instance can't use the
 load exclusive and store exclusive instructions to MMIO space.
 
 Sure, your CPU doesn't have RMW instructions -- how to emulate
 those if you don't have them is a totally different thing.

I thought that ARM's load exclusive and store exclusive instructions
were its equivalent of LL and SC, which RISC machines typically use to
build atomic sequences of instructions -- and which normally cannot be
applied to MMIO space.

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Linus Torvalds


On Mon, 20 Aug 2007, Chris Snook wrote:

 What about barrier removal?  With consistent semantics we could optimize a
 fair amount of code.  Whether or not that constitutes premature optimization
 is open to debate, but there's no question we could reduce our register wiping
 in some places.

Why do people think that barriers are expensive? They really aren't. 
Especially the regular compiler barrier is basically zero cost. Any 
reasonable compiler will just flush the stuff it holds in registers that 
isn't already automatic local variables, and for regular kernel code, that 
tends to basically be nothing at all.

Ie a barrier() is likely _cheaper_ than the code generation downside 
from using volatile.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   >