Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Kyle Moffett

On Sep 10, 2007, at 06:56:29, Denys Vlasenko wrote:

On Sunday 09 September 2007 19:18, Arjan van de Ven wrote:

On Sun, 9 Sep 2007 19:02:54 +0100
Denys Vlasenko [EMAIL PROTECTED] wrote:

Why is all this fixation on volatile? I don't think people want  
volatile keyword per se, they want atomic_read(x) to _always_  
compile into an memory-accessing instruction, not register access.


and ... why is that?  is there any valid, non-buggy code sequence  
that makes that a reasonable requirement?


Well, if you insist on having it again:

Waiting for atomic value to be zero:

while (atomic_read(x))
continue;

gcc may happily convert it into:

reg = atomic_read(x);
while (reg)
continue;


Bzzt.  Even if you fixed gcc to actually convert it to a busy loop on  
a memory variable, you STILL HAVE A BUG as it may *NOT* be gcc that  
does the conversion, it may be that the CPU does the caching of the  
memory value.  GCC has no mechanism to do cache-flushes or memory- 
barriers except through our custom inline assembly.  Also, you  
probably want a cpu_relax() in there somewhere to avoid overheating  
the CPU.  Thirdly, on a large system it may take some arbitrarily  
large amount of time for cache-propagation to update the value of the  
variable in your local CPU cache.  Finally, if atomics are based on  
based on spinlock+interrupt-disable then you will sit in a tight busy- 
loop of spin_lock_irqsave()-spin_unlock_irqrestore().  Depending on  
your system's internal model this may practically lock up your core  
because the spin_lock() will take the cacheline for exclusive access  
and doing that in a loop can prevent any other CPU from doing any  
operation on it!  Since your IRQs are disabled you even have a very  
small window that an IRQ will come along and free it up long enough  
for the update to take place.


The earlier code segment of:

while(atomic_read(x)  0)
atomic_dec(x);
is *completely* buggy because you could very easily have 4 CPUs doing  
this on an atomic variable with a value of 1 and end up with it at  
negative 3 by the time you are done.  Moreover all the alternatives  
are also buggy, with the sole exception of this rather obvious- 
seeming one:

atomic_set(x, 0);


You simply CANNOT use an atomic_t as your sole synchronizing  
primitive, it doesn't work!  You virtually ALWAYS want to use an  
atomic_t in the following types of situations:


(A) As an object refcount.  The value is never read except as part of  
an atomic_dec_return().  Why aren't you using struct kref?


(B) As an atomic value counter (number of processes, for example).   
Just reading the value is racy anyways, if you want to enforce a  
limit or something then use atomic_inc_return(), check the result,  
and use atomic_dec() if it's too big.  If you just want to return the  
statistics then you are going to be instantaneous-point-in-time anyways.


(C) As an optimization value (statistics-like, but exact accuracy  
isn't important).


Atomics are NOT A REPLACEMENT for the proper kernel subsystem, like  
completions, mutexes, semaphores, spinlocks, krefs, etc.  It's not  
useful for synchronization, only for keeping track of simple integer  
RMW values.  Note that atomic_read() and atomic_set() aren't very  
useful RMW primitives (read-nomodify-nowrite and read-set-zero- 
write).  Code which assumes anything else is probably buggy in other  
ways too.


So while I see no real reason for the volatile on the atomics, I  
also see no real reason why it's terribly harmful.  Regardless of the  
volatile on the operation the CPU is perfectly happy to cache it  
anyways so it doesn't buy you any actual always-access-memory  
guarantees.  If you are just interested in it as an optimization you  
could probably just read the properly-aligned integer counter  
directly, an atomic read on most CPUs.


If you really need it to hit main memory *every* *single* *time*  
(Why?  Are you using it instead of the proper kernel subsystem?)   
then you probably need a custom inline assembly helper anyways.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 13:22, Kyle Moffett wrote:
 On Sep 10, 2007, at 06:56:29, Denys Vlasenko wrote:
  On Sunday 09 September 2007 19:18, Arjan van de Ven wrote:
  On Sun, 9 Sep 2007 19:02:54 +0100
  Denys Vlasenko [EMAIL PROTECTED] wrote:
 
  Why is all this fixation on volatile? I don't think people want  
  volatile keyword per se, they want atomic_read(x) to _always_  
  compile into an memory-accessing instruction, not register access.
 
  and ... why is that?  is there any valid, non-buggy code sequence  
  that makes that a reasonable requirement?
 
  Well, if you insist on having it again:
 
  Waiting for atomic value to be zero:
 
  while (atomic_read(x))
  continue;
 
  gcc may happily convert it into:
 
  reg = atomic_read(x);
  while (reg)
  continue;
 
 Bzzt.  Even if you fixed gcc to actually convert it to a busy loop on  
 a memory variable, you STILL HAVE A BUG as it may *NOT* be gcc that  
 does the conversion, it may be that the CPU does the caching of the  
 memory value.  GCC has no mechanism to do cache-flushes or memory- 
 barriers except through our custom inline assembly.

CPU can cache the value all right, but it cannot use that cached value
*forever*, it has to react to invalidate cycles on the shared bus
and re-fetch new data.

IOW: atomic_read(x) which compiles down to memory accessor
will work properly.

 the CPU.  Thirdly, on a large system it may take some arbitrarily  
 large amount of time for cache-propagation to update the value of the  
 variable in your local CPU cache.

Yes, but arbitrarily large amount of time is actually measured
in nanoseconds here. Let's say 1000ns max for hundreds of CPUs?

 Also, you   
 probably want a cpu_relax() in there somewhere to avoid overheating  
 the CPU.

Yes, but 
1. CPU shouldn't overheat (in a sense that it gets damaged),
   it will only use more power than needed.
2. cpu_relax() just throttles down my CPU, so it's performance
   optimization only. Wait, it isn't, it's a barrier too.
   Wow, cpu_relax is a barrier? How am I supposed to know
   that without reading lkml flamewars and/or header files?

Let's try reading headers. asm-x86_64/processor.h:

#define cpu_relax()   rep_nop()

So, is it a barrier? No clue yet.

/* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
static inline void rep_nop(void)
{
__asm__ __volatile__(rep;nop: : :memory);
}

Comment explicitly says that it is a good thing (doesn't say
that it is mandatory) and says NOTHING about barriers!

Barrier-ness is not mentioned and is hidden in memory clobber.

Do you think it's obvious enough for average driver writer?
I think not, especially that it's unlikely for him to even start
suspecting that it is a memory barrier based on the cpu_relax
name.

 You simply CANNOT use an atomic_t as your sole synchronizing
 primitive, it doesn't work!  You virtually ALWAYS want to use an  
 atomic_t in the following types of situations:
 
 (A) As an object refcount.  The value is never read except as part of  
 an atomic_dec_return().  Why aren't you using struct kref?
 
 (B) As an atomic value counter (number of processes, for example).   
 Just reading the value is racy anyways, if you want to enforce a  
 limit or something then use atomic_inc_return(), check the result,  
 and use atomic_dec() if it's too big.  If you just want to return the  
 statistics then you are going to be instantaneous-point-in-time anyways.
 
 (C) As an optimization value (statistics-like, but exact accuracy  
 isn't important).
 
 Atomics are NOT A REPLACEMENT for the proper kernel subsystem, like  
 completions, mutexes, semaphores, spinlocks, krefs, etc.  It's not  
 useful for synchronization, only for keeping track of simple integer  
 RMW values.  Note that atomic_read() and atomic_set() aren't very  
 useful RMW primitives (read-nomodify-nowrite and read-set-zero- 
 write).  Code which assumes anything else is probably buggy in other  
 ways too.

You are basically trying to educate me how to use atomic properly.
You don't need to do it, as I am (currently) not a driver author.

I am saying that people who are already using atomic_read()
(and who unfortunately did not read your explanation above)
will still sometimes use atomic_read() as a way to read atomic value
*from memory*, and will create nasty heisenbugs for you to debug.
--
vda
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Arjan van de Ven
On Mon, 10 Sep 2007 11:56:29 +0100
Denys Vlasenko [EMAIL PROTECTED] wrote:

 
 Well, if you insist on having it again:
 
 Waiting for atomic value to be zero:
 
 while (atomic_read(x))
 continue;
 

and this I would say is buggy code all the way.

Not from a pure C level semantics, but from a busy waiting is buggy
semantics level and a I'm inventing my own locking semantics level.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 14:38, Denys Vlasenko wrote:
 You are basically trying to educate me how to use atomic properly.
 You don't need to do it, as I am (currently) not a driver author.
 
 I am saying that people who are already using atomic_read()
 (and who unfortunately did not read your explanation above)
 will still sometimes use atomic_read() as a way to read atomic value
 *from memory*, and will create nasty heisenbugs for you to debug.

static inline int
qla2x00_wait_for_loop_ready(scsi_qla_host_t *ha)
{
int  return_status = QLA_SUCCESS;
unsigned long loop_timeout ;
scsi_qla_host_t *pha = to_qla_parent(ha);

/* wait for 5 min at the max for loop to be ready */
loop_timeout = jiffies + (MAX_LOOP_TIMEOUT * HZ);

while ((!atomic_read(pha-loop_down_timer) 
atomic_read(pha-loop_state) == LOOP_DOWN) ||
atomic_read(pha-loop_state) != LOOP_READY) {
if (atomic_read(pha-loop_state) == LOOP_DEAD) {
return_status = QLA_FUNCTION_FAILED;
break;
}
msleep(1000);
if (time_after_eq(jiffies, loop_timeout)) {
return_status = QLA_FUNCTION_FAILED;
break;
}
}
return (return_status);
}

Is above correct or buggy? Correct, because msleep is a barrier.
Is it obvious? No.

static void
qla2x00_rst_aen(scsi_qla_host_t *ha)
{
if (ha-flags.online  !ha-flags.reset_active 
!atomic_read(ha-loop_down_timer) 
!(test_bit(ABORT_ISP_ACTIVE, ha-dpc_flags))) {
do {
clear_bit(RESET_MARKER_NEEDED, ha-dpc_flags);

/*
 * Issue marker command only when we are going to start
 * the I/O.
 */
ha-marker_needed = 1;
} while (!atomic_read(ha-loop_down_timer) 
(test_bit(RESET_MARKER_NEEDED, ha-dpc_flags)));
}
}

Is above correct? I honestly don't know. Correct, because set_bit is
a barrier on _all _memory_? Will it break if set_bit will be changed
to be a barrier only on its operand? Probably yes.

drivers/kvm/kvm_main.c

while (atomic_read(completed) != needed) {
cpu_relax();
barrier();
}

Obviously author did not know that cpu_relax is already a barrier.
See why I think driver authors will be confused?

arch/x86_64/kernel/crash.c

static void nmi_shootdown_cpus(void)
{
...
msecs = 1000; /* Wait at most a second for the other cpus to stop */
while ((atomic_read(waiting_for_crash_ipi)  0)  msecs) {
mdelay(1);
msecs--;
}
...
}

Is mdelay(1) a barrier? Yes, because it is a function on x86_64.
Absolutely the same code will be buggy on an arch where
mdelay(1) == udelay(1000), and udelay is implemented
as inline busy-wait.

arch/sparc64/kernel/smp.c

/* Wait for response */
while (atomic_read(data.finished) != cpus)
cpu_relax();
...later in the same file...
while (atomic_read(smp_capture_registry) != ncpus)
rmb();

I'm confused. Do we need cpu_relax() or rmb()? Does cpu_relax() imply rmb()?
(No it doesn't). Which of those two while loops needs correcting?
--
vda
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 15:51, Arjan van de Ven wrote:
 On Mon, 10 Sep 2007 11:56:29 +0100
 Denys Vlasenko [EMAIL PROTECTED] wrote:
 
  
  Well, if you insist on having it again:
  
  Waiting for atomic value to be zero:
  
  while (atomic_read(x))
  continue;
  
 
 and this I would say is buggy code all the way.

 Not from a pure C level semantics, but from a busy waiting is buggy
 semantics level and a I'm inventing my own locking semantics level.

After inspecting arch/*, I cannot agree with you.
Otherwise almost all major architectures use
conceptually buggy busy-waiting:

arch/alpha
arch/i386
arch/ia64
arch/m32r
arch/mips
arch/parisc
arch/powerpc
arch/sh
arch/sparc64
arch/um
arch/x86_64

All of the above contain busy-waiting on atomic_read.

Including these loops without barriers:

arch/mips/kernel/smtc.c
while (atomic_read(idle_hook_initialized)  1000)
;
arch/mips/sgi-ip27/ip27-nmi.c
while (atomic_read(nmied_cpus) != num_online_cpus());

[Well maybe num_online_cpus() is a barrier, I didn't check]

arch/sh/kernel/smp.c
if (wait)
while (atomic_read(smp_fn_call.finished) != (nr_cpus - 1));

Bugs?
--
vda
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Linus Torvalds

On Mon, 10 Sep 2007, Denys Vlasenko wrote:
 
 static inline int
 qla2x00_wait_for_loop_ready(scsi_qla_host_t *ha)
 {
 int  return_status = QLA_SUCCESS;
 unsigned long loop_timeout ;
 scsi_qla_host_t *pha = to_qla_parent(ha);
 
 /* wait for 5 min at the max for loop to be ready */
 loop_timeout = jiffies + (MAX_LOOP_TIMEOUT * HZ);
 
 while ((!atomic_read(pha-loop_down_timer) 
 atomic_read(pha-loop_state) == LOOP_DOWN) ||
 atomic_read(pha-loop_state) != LOOP_READY) {
 if (atomic_read(pha-loop_state) == LOOP_DEAD) {
...
 Is above correct or buggy? Correct, because msleep is a barrier.
 Is it obvious? No.

It's *buggy*. But it has nothing to do with any msleep() in the loop, or 
anything else.

And more importantly, it would be equally buggy even *with* a volatile 
atomic_read().

Why is this so hard for people to understand? You're all acting like 
morons.

The reason it is buggy has absolutely nothing to do with whether the read 
is done or not, it has to do with the fact that the CPU may re-order the 
reads *regardless* of whether the read is done in some specific order by 
the compiler ot not! In effect, there is zero ordering between all those 
three reads, and if you don't have memory barriers (or a lock or other 
serialization), that code is buggy.

So stop this idiotic discussion thread already. The above kind of code 
needs memory barriers to be non-buggy. The whole volatile or not 
discussion is totally idiotic, and pointless, and anybody who doesn't 
understand that by now needs to just shut up and think about it more, 
rather than make this discussion drag out even further.

The fact is, volatile *only* makes things worse. It generates worse 
code, and never fixes any real bugs. This is a *fact*.

Linus
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Arjan van de Ven
On Mon, 10 Sep 2007 15:38:23 +0100
Denys Vlasenko [EMAIL PROTECTED] wrote:

 On Monday 10 September 2007 15:51, Arjan van de Ven wrote:
  On Mon, 10 Sep 2007 11:56:29 +0100
  Denys Vlasenko [EMAIL PROTECTED] wrote:
  
   
   Well, if you insist on having it again:
   
   Waiting for atomic value to be zero:
   
   while (atomic_read(x))
   continue;
   
  
  and this I would say is buggy code all the way.
 
  Not from a pure C level semantics, but from a busy waiting is
  buggy semantics level and a I'm inventing my own locking
  semantics level.
 
 After inspecting arch/*, I cannot agree with you.

the arch/ people obviously are allowed to do their own locking stuff...
BECAUSE THEY HAVE TO IMPLEMENT THAT!


the arch maintainers know EXACTLY how their hw behaves (well, we hope)
so they tend to be the exception to many rules in the kernel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Denys Vlasenko
On Monday 10 September 2007 16:09, Linus Torvalds wrote:
 On Mon, 10 Sep 2007, Denys Vlasenko wrote:
  static inline int
  qla2x00_wait_for_loop_ready(scsi_qla_host_t *ha)
  {
  int  return_status = QLA_SUCCESS;
  unsigned long loop_timeout ;
  scsi_qla_host_t *pha = to_qla_parent(ha);
  
  /* wait for 5 min at the max for loop to be ready */
  loop_timeout = jiffies + (MAX_LOOP_TIMEOUT * HZ);
  
  while ((!atomic_read(pha-loop_down_timer) 
  atomic_read(pha-loop_state) == LOOP_DOWN) ||
  atomic_read(pha-loop_state) != LOOP_READY) {
  if (atomic_read(pha-loop_state) == LOOP_DEAD) {
 ...
  Is above correct or buggy? Correct, because msleep is a barrier.
  Is it obvious? No.
 
 It's *buggy*. But it has nothing to do with any msleep() in the loop, or 
 anything else.
 
 And more importantly, it would be equally buggy even *with* a volatile 
 atomic_read().

I am not saying that this code is okay, this isn't the point.
(The code is in fact awful for several more reasons).

My point is that people are confused as to what atomic_read()
exactly means, and this is bad. Same for cpu_relax().
First one says read, and second one doesn't say barrier.

This is real code from current kernel which demonstrates this:

I don't know that cpu_relax() is a barrier already:

drivers/kvm/kvm_main.c
        while (atomic_read(completed) != needed) {
                cpu_relax();
                barrier();
        }

I think that atomic_read() is a read from memory and therefore
I don't need a barrier:

arch/x86_64/kernel/crash.c
        msecs = 1000; /* Wait at most a second for the other cpus to stop */
        while ((atomic_read(waiting_for_crash_ipi)  0)  msecs) {
                mdelay(1);
                msecs--;
        }

Since neither camp seems to give up, I am proposing renaming
them to something less confusing, and make everybody happy.

cpu_relax_barrier()
atomic_value(x)
atomic_fetch(x)

I'm not native English speaker, do these sound better?
--
vda
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Christoph Lameter
On Fri, 17 Aug 2007, Segher Boessenkool wrote:

 volatile has nothing to do with reordering.  atomic_dec() writes
 to memory, so it _does_ have volatile semantics, implicitly, as
 long as the compiler cannot optimise the atomic variable away
 completely -- any store counts as a side effect.

Stores can be reordered. Only x86 has (mostly) implicit write ordering. So 
no atomic_dec has no volatile semantics and may be reordered on a variety 
of processors. Writes to memory may not follow code order on several 
processors.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Christoph Lameter
On Mon, 10 Sep 2007, Linus Torvalds wrote:

 The fact is, volatile *only* makes things worse. It generates worse 
 code, and never fixes any real bugs. This is a *fact*.

Yes, lets just drop the volatiles now! We need a patch that gets rid of 
them Volunteers?


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Kyle Moffett

On Sep 10, 2007, at 12:46:33, Denys Vlasenko wrote:
My point is that people are confused as to what atomic_read()   
exactly means, and this is bad. Same for cpu_relax().  First one  
says read, and second one doesn't say barrier.


QA:

Q:  When is it OK to use atomic_read()?
A:  You are asking the question, so never.

Q:  But I need to check the value of the atomic at this point in time...
A:  Your code is buggy if it needs to do that on an atomic_t for  
anything other than debugging or optimization.  Use either  
atomic_*_return() or a lock and some normal integers.


Q:  So why can't the atomic_read DTRT magically?
A:  Because the right thing depends on the situation and is usually  
best done with something other than atomic_t.


If somebody can post some non-buggy code which is correctly using  
atomic_read() *and* depends on the compiler generating extra  
nonsensical loads due to volatile then the issue *might* be  
reconsidered.  This also includes samples of code which uses  
atomic_read() and needs memory barriers (so that we can fix the buggy  
code, not so we can change atomic_read()).  So far the only code  
samples anybody has posted are buggy regardless of whether or not the  
value and/or accessors are flagged volatile or not.  And hey, maybe  
the volatile ops *should* be implemented in inline ASM for future- 
proof-ness, but that's a separate issue.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Paul E. McKenney
On Mon, Sep 10, 2007 at 11:59:29AM -0700, Christoph Lameter wrote:
 On Fri, 17 Aug 2007, Segher Boessenkool wrote:
 
  volatile has nothing to do with reordering.  atomic_dec() writes
  to memory, so it _does_ have volatile semantics, implicitly, as
  long as the compiler cannot optimise the atomic variable away
  completely -- any store counts as a side effect.
 
 Stores can be reordered. Only x86 has (mostly) implicit write ordering. So 
 no atomic_dec has no volatile semantics and may be reordered on a variety 
 of processors. Writes to memory may not follow code order on several 
 processors.

The one exception to this being the case where process-level code is
communicating to an interrupt handler running on that same CPU -- on
all CPUs that I am aware of, a given CPU always sees its own writes
in order.

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Christoph Lameter
On Mon, 10 Sep 2007, Paul E. McKenney wrote:

 The one exception to this being the case where process-level code is
 communicating to an interrupt handler running on that same CPU -- on
 all CPUs that I am aware of, a given CPU always sees its own writes
 in order.

Yes but that is due to the code path effectively continuing in the 
interrupt handler. The cpu makes sure that op codes being executed always 
see memory in a consistent way. The basic ordering problem with out of 
order writes is therefore coming from other processors concurrently 
executing code and holding variables in registers that are modified 
elsewhere. The only solution that I know of are one or the other form of 
barrier.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-10 Thread Paul E. McKenney
On Mon, Sep 10, 2007 at 02:36:26PM -0700, Christoph Lameter wrote:
 On Mon, 10 Sep 2007, Paul E. McKenney wrote:
 
  The one exception to this being the case where process-level code is
  communicating to an interrupt handler running on that same CPU -- on
  all CPUs that I am aware of, a given CPU always sees its own writes
  in order.
 
 Yes but that is due to the code path effectively continuing in the 
 interrupt handler. The cpu makes sure that op codes being executed always 
 see memory in a consistent way. The basic ordering problem with out of 
 order writes is therefore coming from other processors concurrently 
 executing code and holding variables in registers that are modified 
 elsewhere. The only solution that I know of are one or the other form of 
 barrier.

So we are agreed then -- volatile accesses may be of some assistance when
interacting with interrupt handlers running on the same CPU (presumably
when using per-CPU variables), but are generally useless when sharing
variables among CPUs.  Correct?

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-09 Thread Denys Vlasenko
On Friday 17 August 2007 17:48, Linus Torvalds wrote:
 
 On Fri, 17 Aug 2007, Nick Piggin wrote:
  
  That's not obviously just taste to me. Not when the primitive has many
  (perhaps, the majority) of uses that do not require said barriers. And
  this is not solely about the code generation (which, as Paul says, is
  relatively minor even on x86). I prefer people to think explicitly
  about barriers in their lockless code.
 
 Indeed.
 
 I think the important issues are:
 
  - volatile itself is simply a badly/weakly defined issue. The semantics 
of it as far as the compiler is concerned are really not very good, and 
in practice tends to boil down to I will generate so bad code that 
nobody can accuse me of optimizing anything away.
 
  - volatile - regardless of how well or badly defined it is - is purely 
a compiler thing. It has absolutely no meaning for the CPU itself, so 
it at no point implies any CPU barriers. As a result, even if the 
compiler generates crap code and doesn't re-order anything, there's 
nothing that says what the CPU will do.
 
  - in other words, the *only* possible meaning for volatile is a purely 
single-CPU meaning. And if you only have a single CPU involved in the 
process, the volatile is by definition pointless (because even 
without a volatile, the compiler is required to make the C code appear 
consistent as far as a single CPU is concerned).
 
 So, let's take the example *buggy* code where we use volatile to wait 
 for other CPU's:
 
   atomic_set(var, 0);
   while (!atomic_read(var))
   /* nothing */;
 
 
 which generates an endless loop if we don't have atomic_read() imply 
 volatile.
 
 The point here is that it's buggy whether the volatile is there or not! 
 Exactly because the user expects multi-processing behaviour, but 
 volatile doesn't actually give any real guarantees about it. Another CPU 
 may have done:
 
   external_ptr = kmalloc(..);
   /* Setup is now complete, inform the waiter */
   atomic_inc(var);
 
 but the fact is, since the other CPU isn't serialized in any way, the 
 while-loop (even in the presense of volatile) doesn't actually work 
 right! Whatever the atomic_read() was waiting for may not have 
 completed, because we have no barriers!

Why is all this fixation on volatile? I don't think
people want volatile keyword per se, they want atomic_read(x) to
_always_ compile into an memory-accessing instruction, not register access.
--
vda
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-09-09 Thread Arjan van de Ven
On Sun, 9 Sep 2007 19:02:54 +0100
Denys Vlasenko [EMAIL PROTECTED] wrote:

 Why is all this fixation on volatile? I don't think
 people want volatile keyword per se, they want atomic_read(x) to
 _always_ compile into an memory-accessing instruction, not register
 access.

and ... why is that?
is there any valid, non-buggy code sequence that makes that a
reasonable requirement?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Denys Vlasenko
On Saturday 18 August 2007 05:13, Linus Torvalds wrote:
 On Sat, 18 Aug 2007, Satyam Sharma wrote:
  No code does (or would do, or should do):
 
  x.counter++;
 
  on an atomic_t x; anyway.

 That's just an example of a general problem.

 No, you don't use x.counter++. But you *do* use

   if (atomic_read(x) = 1)

 and loading into a register is stupid and pointless, when you could just
 do it as a regular memory-operand to the cmp instruction.

It doesn't mean that (volatile int*) cast is bad, it means that current gcc
is bad (or not good enough). IOW: instead of avoiding volatile cast,
it's better to fix the compiler.

 And as far as the compiler is concerned, the problem is the 100% same:
 combining operations with the volatile memop.

 The fact is, a compiler that thinks that

   movl mem,reg
   cmpl $val,reg

 is any better than

   cmpl $val,mem

 is just not a very good compiler.

Linus, in all honesty gcc has many more cases of suboptimal code,
case of volatile is just one of many.

Off the top of my head:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28417

unsigned v;
void f(unsigned A) { v = ((unsigned long long)A) * 365384439  (27+32); }

gcc-4.1.1 -S -Os -fomit-frame-pointer t.c

f:
movl$365384439, %eax
mull4(%esp)
movl%edx, %eax = ?
shrl$27, %eax
movl%eax, v
ret

Why is it moving %edx to %eax?

gcc-4.2.1 -S -Os -fomit-frame-pointer t.c

f:
movl$365384439, %eax
mull4(%esp)
movl%edx, %eax = ?
xorl%edx, %edx = ??!
shrl$27, %eax
movl%eax, v
ret

Progress... Now we also zero out %edx afterwards for no apparent reason.
--
vda
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Denys Vlasenko
On Thursday 16 August 2007 00:22, Paul Mackerras wrote:
 Satyam Sharma writes:
 In the kernel we use atomic variables in precisely those situations
 where a variable is potentially accessed concurrently by multiple
 CPUs, and where each CPU needs to see updates done by other CPUs in a
 timely fashion.  That is what they are for.  Therefore the compiler
 must not cache values of atomic variables in registers; each
 atomic_read must result in a load and each atomic_set must result in a
 store.  Anything else will just lead to subtle bugs.

Amen.
--
vda
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Linus Torvalds


On Fri, 24 Aug 2007, Denys Vlasenko wrote:

  No, you don't use x.counter++. But you *do* use
 
  if (atomic_read(x) = 1)
 
  and loading into a register is stupid and pointless, when you could just
  do it as a regular memory-operand to the cmp instruction.
 
 It doesn't mean that (volatile int*) cast is bad, it means that current gcc
 is bad (or not good enough). IOW: instead of avoiding volatile cast,
 it's better to fix the compiler.

I would agree that fixing the compiler in this case would be a good thing, 
even quite regardless of any atomic_read() discussion.

I just have a strong suspicion that volatile performance is so low down 
the list of any C compiler persons interest, that it's never going to 
happen. And quite frankly, I cannot blame the gcc guys for it.

That's especially as volatile really isn't a very good feature of the C 
language, and is likely to get *less* interesting rather than more (as 
user space starts to be more and more threaded, volatile gets less and 
less useful.

[ Ie, currently, I think you can validly use volatile in a sigatomic_t 
  kind of way, where there is a single thread, but with asynchronous 
  events. In that kind of situation, I think it's probably useful. But 
  once you get multiple threads, it gets pointless.

  Sure: you could use volatile together with something like Dekker's or 
  Peterson's algorithm that doesn't depend on cache coherency (that's 
  basically what the C volatile keyword approximates: not atomic 
  accesses, but *uncached* accesses! But let's face it, that's way past 
  insane. ]

So I wouldn't expect volatile to ever really generate better code. It 
might happen as a side effect of other improvements (eg, I might hope that 
the SSA work would eventually lead to gcc having a much better defined 
model of valid optimizations, and maybe better code generation for 
volatile accesses fall out cleanly out of that), but in the end, it's such 
an ugly special case in C, and so seldom used, that I wouldn't depend on 
it.

 Linus, in all honesty gcc has many more cases of suboptimal code,
 case of volatile is just one of many.

Well, the thing is, quite often, many of those suboptimal code 
generations fall into two distinct classes:

 - complex C code. I can't really blame the compiler too much for this. 
   Some things are *hard* to optimize, and for various scalability 
   reasons, you often end up having limits in the compiler where it 
   doesn't even _try_ doing certain optimizations if you have excessive 
   complexity.

 - bad register allocation. Register allocation really is hard, and 
   sometimes gcc just does the obviously wrong thing, and you end up 
   having totally unnecessary spills.

 Off the top of my head:

Yes, unsigned long long with x86 has always generated atrocious code. In 
fact, I would say that historically it was really *really* bad. These 
days, gcc actually does a pretty good job, but I'm not surprised that it's 
still quite possible to find cases where it did some optimization (in this 
case, apparently noticing that shift by = 32 bits causes the low 
register to be pointless) and then missed *another* optimization (better 
register use) because that optimization had been done *before* the first 
optimization was done.

That's a *classic* example of compiler code generation issues, and quite 
frankly, I think that's very different from the issue of volatile.

Quite frankly, I'd like there to be more competition in the open source 
compiler game, and that might cause some upheavals, but on the whole, gcc 
actually does a pretty damn good job. 

Linus
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-24 Thread Denys Vlasenko
On Friday 24 August 2007 18:15, Christoph Lameter wrote:
 On Fri, 24 Aug 2007, Denys Vlasenko wrote:
  On Thursday 16 August 2007 00:22, Paul Mackerras wrote:
   Satyam Sharma writes:
   In the kernel we use atomic variables in precisely those situations
   where a variable is potentially accessed concurrently by multiple
   CPUs, and where each CPU needs to see updates done by other CPUs in a
   timely fashion.  That is what they are for.  Therefore the compiler
   must not cache values of atomic variables in registers; each
   atomic_read must result in a load and each atomic_set must result in a
   store.  Anything else will just lead to subtle bugs.
 
  Amen.

 A timely fashion? One cannot rely on something like that when coding.
 The visibility of updates is insured by barriers and not by some fuzzy
 notion of timeliness.

But here you do have some notion of time:

while (atomic_read(x))
continue;

continue when other CPU(s) decrement it down to zero.
If read includes an insn which accesses RAM, you will
see new value sometime after other CPU decrements it.
Sometime after is on the order of nanoseconds here.
It is a valid concept of time, right?

The whole confusion is about whether atomic_read implies
read from RAM or not. I am in a camp which thinks it does.
You are in an opposite one.

We just need a less ambiguous name.

What about this:

/**
 * atomic_read - read atomic variable
 * @v: pointer of type atomic_t
 *
 * Atomically reads the value of @v.
 * No compiler barrier implied.
 */
#define atomic_read(v)  ((v)-counter)

+/**
+ * atomic_read_uncached - read atomic variable from memory
+ * @v: pointer of type atomic_t
+ *
+ * Atomically reads the value of @v. This is guaranteed to emit an insn
+ * which accesses memory, atomically. No ordering guarantees!
+ */
+#define atomic_read_uncached(v)  asm_or_volatile_ptr_magic(v)

I was thinking of s/atomic_read/atomic_get/ too, but it implies taking
atomic a-la get_cpu()...
--
vda
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-22 Thread Adrian Bunk
On Tue, Aug 21, 2007 at 06:51:16PM -0400, [EMAIL PROTECTED] wrote:
 On Tue, 21 Aug 2007 09:16:43 PDT, Paul E. McKenney said:
 
  I agree that instant gratification is hard to come by when synching
  up compiler and kernel versions.  Nonetheless, it should be possible
  to create APIs that are are conditioned on the compiler version.
 
 We've tried that, sort of.  See the mess surrounding the whole
 extern/static/inline/__whatever boondogle, which seems to have
 changed semantics in every single gcc release since 2.95 or so.
...

There is exactly one semantics change in gcc in this area, and that is 
the change of the extern inline semantics in gcc 4.3 to the
C99 semantics.

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread David Miller
From: Linus Torvalds [EMAIL PROTECTED]
Date: Mon, 20 Aug 2007 22:46:47 -0700 (PDT)

 Ie a barrier() is likely _cheaper_ than the code generation downside 
 from using volatile.

Assuming GCC were ever better about the code generation badness
with volatile that has been discussed here, I much prefer
we tell GCC this memory piece changed rather than every
piece of memory has changed which is what the barrier() does.

I happened to have been scanning a lot of assembler lately to
track down a gcc-4.2 miscompilation on sparc64, and the barriers
do hurt quite a bit in some places.  Instead of keeping unrelated
variables around cached in local registers, it reloads everything.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Russell King
On Tue, Aug 21, 2007 at 01:02:01AM +0200, Segher Boessenkool wrote:
 And no, RMW on MMIO isn't problematic at all, either.
 
 An RMW op is a read op, a modify op, and a write op, all rolled
 into one opcode.  But three actual operations.
 
 Maybe for some CPUs, but not all.  ARM for instance can't use the
 load exclusive and store exclusive instructions to MMIO space.
 
 Sure, your CPU doesn't have RMW instructions -- how to emulate
 those if you don't have them is a totally different thing.

Let me say it more clearly: On ARM, it is impossible to perform atomic
operations on MMIO space.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Russell King
On Mon, Aug 20, 2007 at 05:05:18PM -0700, Paul E. McKenney wrote:
 On Tue, Aug 21, 2007 at 01:02:01AM +0200, Segher Boessenkool wrote:
  And no, RMW on MMIO isn't problematic at all, either.
  
  An RMW op is a read op, a modify op, and a write op, all rolled
  into one opcode.  But three actual operations.
  
  Maybe for some CPUs, but not all.  ARM for instance can't use the
  load exclusive and store exclusive instructions to MMIO space.
  
  Sure, your CPU doesn't have RMW instructions -- how to emulate
  those if you don't have them is a totally different thing.
 
 I thought that ARM's load exclusive and store exclusive instructions
 were its equivalent of LL and SC, which RISC machines typically use to
 build atomic sequences of instructions -- and which normally cannot be
 applied to MMIO space.

Absolutely correct.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Paul Mackerras
Russell King writes:

 Let me say it more clearly: On ARM, it is impossible to perform atomic
 operations on MMIO space.

Actually, no one is suggesting that we try to do that at all.

The discussion about RMW ops on MMIO space started with a comment
attributed to the gcc developers that one reason why gcc on x86
doesn't use instructions that do RMW ops on volatile variables is that
volatile is used to mark MMIO addresses, and there was some
uncertainty about whether (non-atomic) RMW ops on x86 could be used on
MMIO.  This is in regard to the question about why gcc on x86 always
moves a volatile variable into a register before doing anything to it.

So the whole discussion is irrelevant to ARM, PowerPC and any other
architecture except x86[-64].

Paul.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Andi Kleen
On Tue, Aug 21, 2007 at 07:33:49PM +1000, Paul Mackerras wrote:
 So the whole discussion is irrelevant to ARM, PowerPC and any other
 architecture except x86[-64].

It's even irrelevant on x86 because all modifying operations on atomic_t 
are coded in inline assembler and will always be RMW no matter
if atomic_t is volatile or not.

[ignoring atomic_set(x, atomic_read(x) + 1) which nobody should do]

The only issue is if atomic_t should have a implicit barrier or not.
My personal opinion is yes -- better safe than sorry. And any code
impact it may have is typically dwarved by the next cache miss anyways,
so it doesn't matter much.

-Andi

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Chris Snook

David Miller wrote:

From: Linus Torvalds [EMAIL PROTECTED]
Date: Mon, 20 Aug 2007 22:46:47 -0700 (PDT)

Ie a barrier() is likely _cheaper_ than the code generation downside 
from using volatile.


Assuming GCC were ever better about the code generation badness
with volatile that has been discussed here, I much prefer
we tell GCC this memory piece changed rather than every
piece of memory has changed which is what the barrier() does.

I happened to have been scanning a lot of assembler lately to
track down a gcc-4.2 miscompilation on sparc64, and the barriers
do hurt quite a bit in some places.  Instead of keeping unrelated
variables around cached in local registers, it reloads everything.


Moore's law is definitely working against us here.  Register counts, 
pipeline depths, core counts, and clock multipliers are all increasing 
in the long run.  At some point in the future, barrier() will be 
universally regarded as a hammer too big for most purposes.  Whether or 
not removing it now constitutes premature optimization is arguable, but 
I think we should allow such optimization to happen (or not happen) in 
architecture-dependent code, and provide a consistent API that doesn't 
require the use of such things in arch-independent code where it might 
turn into a totally superfluous performance killer depending on what 
hardware it gets compiled for.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Segher Boessenkool

And no, RMW on MMIO isn't problematic at all, either.

An RMW op is a read op, a modify op, and a write op, all rolled
into one opcode.  But three actual operations.


Maybe for some CPUs, but not all.  ARM for instance can't use the
load exclusive and store exclusive instructions to MMIO space.


Sure, your CPU doesn't have RMW instructions -- how to emulate
those if you don't have them is a totally different thing.


Let me say it more clearly: On ARM, it is impossible to perform atomic
operations on MMIO space.


It's all completely beside the point, see the other subthread, but...

Yeah, you can't do LL/SC to MMIO space; ARM isn't alone in that.
You could still implement atomic operations on MMIO space by taking
a lock elsewhere, in normal cacheable memory space.  Why you would
do this is a separate question, you probably don't want it :-)


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Segher Boessenkool

Let me say it more clearly: On ARM, it is impossible to perform atomic
operations on MMIO space.


Actually, no one is suggesting that we try to do that at all.

The discussion about RMW ops on MMIO space started with a comment
attributed to the gcc developers that one reason why gcc on x86
doesn't use instructions that do RMW ops on volatile variables is that
volatile is used to mark MMIO addresses, and there was some
uncertainty about whether (non-atomic) RMW ops on x86 could be used on
MMIO.  This is in regard to the question about why gcc on x86 always
moves a volatile variable into a register before doing anything to it.


This question is GCC PR33102, which was incorrectly closed as a 
duplicate

of PR3506 -- and *that* PR was closed because its reporter seemed to
claim the GCC generated code for an increment on a volatile (namely, 
three

machine instructions: load, modify, store) was incorrect, and it has to
be one machine instruction.


So the whole discussion is irrelevant to ARM, PowerPC and any other
architecture except x86[-64].


And even there, it's not something the kernel can take advantage of
before GCC 4.4 is in widespread use, if then.  Let's move on.


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Segher Boessenkool
At some point in the future, barrier() will be universally regarded as 
a hammer too big for most purposes.  Whether or not removing it now


You can't just remove it, it is needed in some places; you want to
replace it in most places with a more fine-grained compiler barrier,
I presume?

constitutes premature optimization is arguable, but I think we should 
allow such optimization to happen (or not happen) in 
architecture-dependent code, and provide a consistent API that doesn't 
require the use of such things in arch-independent code where it might 
turn into a totally superfluous performance killer depending on what 
hardware it gets compiled for.


Explicit barrier()s won't be too hard to replace -- but what to do
about the implicit barrier()s in rmb() etc. etc. -- *those* will be
hard to get rid of, if only because it is hard enough to teach driver
authors about how to use those primitives *already*.  It is far from
clear what a good interface like that would look like, anyway.

Probably we should first start experimenting with a forget()-style
micro-barrier (but please, find a better name), and see if a nice
usage pattern shows up that can be turned into an API.


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Satyam Sharma


On Tue, 21 Aug 2007, Chris Snook wrote:

 David Miller wrote:
  From: Linus Torvalds [EMAIL PROTECTED]
  Date: Mon, 20 Aug 2007 22:46:47 -0700 (PDT)
  
   Ie a barrier() is likely _cheaper_ than the code generation downside
   from using volatile.
  
  Assuming GCC were ever better about the code generation badness
  with volatile that has been discussed here, I much prefer
  we tell GCC this memory piece changed rather than every
  piece of memory has changed which is what the barrier() does.
  
  I happened to have been scanning a lot of assembler lately to
  track down a gcc-4.2 miscompilation on sparc64, and the barriers
  do hurt quite a bit in some places.  Instead of keeping unrelated
  variables around cached in local registers, it reloads everything.
 
 Moore's law is definitely working against us here.  Register counts, pipeline
 depths, core counts, and clock multipliers are all increasing in the long run.
 At some point in the future, barrier() will be universally regarded as a
 hammer too big for most purposes.

I do agree, and the important point to note is that the benefits of a
/lighter/ compiler barrier, such as what David referred to above, _can_
be had without having to do anything with the volatile keyword at all.
And such a primitive has already been mentioned/proposed on this thread.


But this is all tangential to the core question at hand -- whether to have
implicit (compiler, possibly light-weight of the kind referred above)
barrier semantics in atomic ops that do not have them, or not.

I was lately looking in the kernel for _actual_ code that uses atomic_t
and benefits from the lack of any implicit barrier, with the compiler
being free to cache the atomic_t in a register. Now that often does _not_
happen, because all other ops (implemented in asm with LOCK prefix on x86)
_must_ therefore constrain the atomic_t to memory anyway. So typically all
atomic ops code sequences end up operating on memory.

Then I did locate sched.c:select_nohz_load_balancer() -- it repeatedly
references the same atomic_t object, and the code that I saw generated
(with CC_OPTIMIZE_FOR_SIZE=y) did cache it in a register for a sequence of
instructions. It uses atomic_cmpxchg, thereby not requiring explicit
memory barriers anywhere in the code, and is an example of an atomic_t
user that is safe, and yet benefits from its memory loads/stores being
elided/coalesced by the compiler.


# at this point, %%eax holds num_online_cpus() and
# %%ebx holds cpus_weight(nohz.cpu_mask)
# the variable cpu is in %esi

0xc1018e1d:  cmp%eax,%ebx   # if No.A.
0xc1018e1f:  mov0xc134d900,%eax # first atomic_read()
0xc1018e24:  jne0xc1018e36
0xc1018e26:  cmp%esi,%eax   # if No.B.
0xc1018e28:  jne0xc1018e80  # returns with 0
0xc1018e2a:  movl   $0x,0xc134d900  # atomic_set(-1), and ...
0xc1018e34:  jmp0xc1018e80  # ... returns with 0
0xc1018e36:  cmp$0x,%eax# if No.C. (NOTE!)
0xc1018e39:  jne0xc1018e46
0xc1018e3b:  lock cmpxchg %esi,0xc134d900   # atomic_cmpxchg()
0xc1018e43:  inc%eax
0xc1018e44:  jmp0xc1018e48
0xc1018e46:  cmp%esi,%eax   # if No.D. (NOTE!)
0xc1018e48:  jne0xc1018e80  # if !=, default return 0 (if 
No.E.)
0xc1018e4a:  jmp0xc1018e84  # otherwise (==) returns with 1

The above is:

if (cpus_weight(nohz.cpu_mask) == num_online_cpus()) {  /* if No.A. */
if (atomic_read(nohz.load_balancer) == cpu)/* if No.B. */
atomic_set(nohz.load_balancer, -1);/* XXX */
return 0;
}
if (atomic_read(nohz.load_balancer) == -1) {   /* if No.C. */
/* make me the ilb owner */
if (atomic_cmpxchg(nohz.load_balancer, -1, cpu) == -1) /* if 
No.E. */
return 1;
} else if (atomic_read(nohz.load_balancer) == cpu) /* if No.D. */
return 1;
...
...
return 0; /* default return from function */

As you can see, the atomic_read()'s of ifs Nos. B, C, and D, were _all_
coalesced into a single memory reference mov0xc134d900,%eax at the
top of the function, and then ifs Nos. C and D simply used the value
from %%eax itself. But that's perfectly safe, such is the logic of this
function. It uses cmpxchg _whenever_ updating the value in the memory
atomic_t and then returns appropriately. The _only_ point that a casual
reader may find racy is that marked /* XXX */ above -- atomic_read()
followed by atomic_set() with no barrier in between. But even that is ok,
because if one thread ever finds that condition to succeed, it is 100%
guaranteed no other thread on any other CPU will find _any_ condition
to be true, thereby avoiding any race in the modification of that value.


BTW it does sound reasonable that a lot of atomic_t users 

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Paul E. McKenney
On Tue, Aug 21, 2007 at 04:48:51PM +0200, Segher Boessenkool wrote:
 Let me say it more clearly: On ARM, it is impossible to perform atomic
 operations on MMIO space.
 
 Actually, no one is suggesting that we try to do that at all.
 
 The discussion about RMW ops on MMIO space started with a comment
 attributed to the gcc developers that one reason why gcc on x86
 doesn't use instructions that do RMW ops on volatile variables is that
 volatile is used to mark MMIO addresses, and there was some
 uncertainty about whether (non-atomic) RMW ops on x86 could be used on
 MMIO.  This is in regard to the question about why gcc on x86 always
 moves a volatile variable into a register before doing anything to it.
 
 This question is GCC PR33102, which was incorrectly closed as a 
 duplicate
 of PR3506 -- and *that* PR was closed because its reporter seemed to
 claim the GCC generated code for an increment on a volatile (namely, 
 three
 machine instructions: load, modify, store) was incorrect, and it has to
 be one machine instruction.
 
 So the whole discussion is irrelevant to ARM, PowerPC and any other
 architecture except x86[-64].
 
 And even there, it's not something the kernel can take advantage of
 before GCC 4.4 is in widespread use, if then.  Let's move on.

I agree that instant gratification is hard to come by when synching
up compiler and kernel versions.  Nonetheless, it should be possible
to create APIs that are are conditioned on the compiler version.

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Linus Torvalds


On Tue, 21 Aug 2007, Chris Snook wrote:
 
 Moore's law is definitely working against us here.  Register counts, pipeline
 depths, core counts, and clock multipliers are all increasing in the long run.
 At some point in the future, barrier() will be universally regarded as a
 hammer too big for most purposes.

Note that barrier() is purely a compiler barrier. It has zero impact on 
the CPU pipeline itself, and also has zero impact on anything that gcc 
knows isn't visible in memory (ie local variables that don't have their 
address taken), so barrier() really is pretty cheap.

Now, it's possible that gcc messes up in some circumstances, and that the 
memory clobber will cause gcc to also do things like flush local registers 
unnecessarily to their stack slots, but quite frankly, if that happens, 
it's a gcc problem, and I also have to say that I've not seen that myself.

So in a very real sense, barrier() will just make sure that there is a 
stronger sequence point for the compiler where things are stable. In most 
cases it has absolutely zero performance impact - apart from the 
-intended- impact of making sure that the compiler doesn't re-order or 
cache stuff around it.

And sure, we could make it more finegrained, and also introduce a 
per-variable barrier, but the fact is, people _already_ have problems with 
thinking about these kinds of things, and adding new abstraction issues 
with subtle semantics is the last thing we want.

So I really think you'd want to show a real example of real code that 
actually gets noticeably slower or bigger.

In removing volatile, we have shown that. It may not have made a big 
difference on powerpc, but it makes a real difference on x86 - and more 
importantly, it removes something that people clearly don't know how it 
works, and incorrectly expect to just fix bugs.

[ There are *other* barriers - the ones that actually add memory barriers 
  to the CPU - that really can be quite expensive. The good news is that 
  the expense is going down rather than up: both Intel and AMD are not 
  only removing the need for some of them (ie smp_rmb() will become a 
  compiler-only barrier), but we're _also_ seeing the whole pipeline 
  flush approach go away, and be replaced by the CPU itself actually 
  being better - so even the actual CPU pipeline barriers are getting
  cheaper, not more expensive. ]

For example, did anybody even _test_ how expensive barrier() is? Just 
as a lark, I did

#undef barrier
#define barrier() do { } while (0)

in kernel/sched.c (which only has three of them in it, but hey, that's 
more than most files), and there were _zero_ code generation downsides. 
One instruction was moved (and a few line numbers changed), so it wasn't 
like the assembly language was identical, but the point is, barrier() 
simply doesn't have the same kinds of downsides that volatile has.

(That may not be true on other architectures or in other source files, of 
course. This *does* depend on code generation details. But anybody who 
thinks that barrier() is fundamentally expensive is simply incorrect. It 
is *fundamnetally* a no-op).

Linus
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Valdis . Kletnieks
On Tue, 21 Aug 2007 09:16:43 PDT, Paul E. McKenney said:

 I agree that instant gratification is hard to come by when synching
 up compiler and kernel versions.  Nonetheless, it should be possible
 to create APIs that are are conditioned on the compiler version.

We've tried that, sort of.  See the mess surrounding the whole
extern/static/inline/__whatever boondogle, which seems to have
changed semantics in every single gcc release since 2.95 or so.

And recently mention was made that gcc4.4 will have *new* semantics
in this area. Yee. Hah.







pgpxaMsiSRSQs.pgp
Description: PGP signature


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-21 Thread Paul E. McKenney
On Tue, Aug 21, 2007 at 06:51:16PM -0400, [EMAIL PROTECTED] wrote:
 On Tue, 21 Aug 2007 09:16:43 PDT, Paul E. McKenney said:
 
  I agree that instant gratification is hard to come by when synching
  up compiler and kernel versions.  Nonetheless, it should be possible
  to create APIs that are are conditioned on the compiler version.
 
 We've tried that, sort of.  See the mess surrounding the whole
 extern/static/inline/__whatever boondogle, which seems to have
 changed semantics in every single gcc release since 2.95 or so.
 
 And recently mention was made that gcc4.4 will have *new* semantics
 in this area. Yee. Hah.

;-)

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Chris Snook

Linus Torvalds wrote:
So the only reason to add back volatile to the atomic_read() sequence is 
not to fix bugs, but to _hide_ the bugs better. They're still there, they 
are just a lot harder to trigger, and tend to be a lot subtler.


What about barrier removal?  With consistent semantics we could optimize a fair 
amount of code.  Whether or not that constitutes premature optimization is 
open to debate, but there's no question we could reduce our register wiping in 
some places.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LDD3 pitfalls (was Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures)

2007-08-20 Thread Chris Snook

Stefan Richter wrote:

Nick Piggin wrote:

Stefan Richter wrote:

Nick Piggin wrote:


I don't know why people would assume volatile of atomics. AFAIK, most
of the documentation is pretty clear that all the atomic stuff can be
reordered etc. except for those that modify and return a value.


Which documentation is there?

Documentation/atomic_ops.txt



For driver authors, there is LDD3.  It doesn't specifically cover
effects of optimization on accesses to atomic_t.

For architecture port authors, there is Documentation/atomic_ops.txt.
Driver authors also can learn something from that document, as it
indirectly documents the atomic_t and bitops APIs.


Semantics and Behavior of Atomic and Bitmask Operations is
pretty direct :)

Sure, it says that it's for arch maintainers, but there is no
reason why users can't make use of it.



Note, LDD3 page 238 says:  It is worth noting that most of the other
kernel primitives dealing with synchronization, such as spinlock and
atomic_t operations, also function as memory barriers.

I don't know about Linux 2.6.10 against which LDD3 was written, but
currently only _some_ atomic_t operations function as memory barriers.

Besides, judging from some posts in this thread, saying that atomic_t
operations dealt with synchronization may not be entirely precise.


atomic_t is often used as the basis for implementing more sophisticated 
synchronization mechanisms, such as rwlocks.  Whether or not they are designed 
for that purpose, the atomic_* operations are de facto synchronization primitives.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Herbert Xu
On Mon, Aug 20, 2007 at 09:15:11AM -0400, Chris Snook wrote:
 Linus Torvalds wrote:
 So the only reason to add back volatile to the atomic_read() sequence is 
 not to fix bugs, but to _hide_ the bugs better. They're still there, they 
 are just a lot harder to trigger, and tend to be a lot subtler.
 
 What about barrier removal?  With consistent semantics we could optimize a 
 fair amount of code.  Whether or not that constitutes premature 
 optimization is open to debate, but there's no question we could reduce our 
 register wiping in some places.

If you've been reading all of Linus's emails you should be
thinking about adding memory barriers, and not removing
compiler barriers.

He's just told you that code of the kind

while (!atomic_read(cond))
;

do_something()

probably needs a memory barrier (not just compiler) so that
do_something() doesn't see stale cache content that occured
before cond flipped.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Chris Snook

Christoph Lameter wrote:

On Fri, 17 Aug 2007, Paul E. McKenney wrote:


On Sat, Aug 18, 2007 at 08:09:13AM +0800, Herbert Xu wrote:

On Fri, Aug 17, 2007 at 04:59:12PM -0700, Paul E. McKenney wrote:

gcc bugzilla bug #33102, for whatever that ends up being worth.  ;-)

I had totally forgotten that I'd already filed that bug more
than six years ago until they just closed yours as a duplicate
of mine :)

Good luck in getting it fixed!

Well, just got done re-opening it for the third time.  And a local
gcc community member advised me not to give up too easily.  But I
must admit that I am impressed with the speed that it was identified
as duplicate.

Should be entertaining!  ;-)


Right. ROTFL... volatile actually breaks atomic_t instead of making it 
safe. x++ becomes a register load, increment and a register store. Without 
volatile we can increment the memory directly. It seems that volatile 
requires that the variable is loaded into a register first and then 
operated upon. Understandable when you think about volatile being used to 
access memory mapped I/O registers where a RMW operation could be 
problematic.


So, if we want consistent behavior, we're pretty much screwed unless we use 
inline assembler everywhere?


-- Chris
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Chris Snook

Herbert Xu wrote:

On Mon, Aug 20, 2007 at 09:15:11AM -0400, Chris Snook wrote:

Linus Torvalds wrote:
So the only reason to add back volatile to the atomic_read() sequence is 
not to fix bugs, but to _hide_ the bugs better. They're still there, they 
are just a lot harder to trigger, and tend to be a lot subtler.
What about barrier removal?  With consistent semantics we could optimize a 
fair amount of code.  Whether or not that constitutes premature 
optimization is open to debate, but there's no question we could reduce our 
register wiping in some places.


If you've been reading all of Linus's emails you should be
thinking about adding memory barriers, and not removing
compiler barriers.

He's just told you that code of the kind

while (!atomic_read(cond))
;

do_something()

probably needs a memory barrier (not just compiler) so that
do_something() doesn't see stale cache content that occured
before cond flipped.


Such code generally doesn't care precisely when it gets the update, just that 
the update is atomic, and it doesn't loop forever.  Regardless, I'm convinced we 
just need to do it all in assembly.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Segher Boessenkool
Right. ROTFL... volatile actually breaks atomic_t instead of making 
it safe. x++ becomes a register load, increment and a register store. 
Without volatile we can increment the memory directly. It seems that 
volatile requires that the variable is loaded into a register first 
and then operated upon. Understandable when you think about volatile 
being used to access memory mapped I/O registers where a RMW 
operation could be problematic.


So, if we want consistent behavior, we're pretty much screwed unless 
we use inline assembler everywhere?


Nah, this whole argument is flawed -- without volatile we still
*cannot* increment the memory directly.  On x86, you need a lock
prefix; on other archs, some other mechanism to make the memory
increment an *atomic* memory increment.

And no, RMW on MMIO isn't problematic at all, either.

An RMW op is a read op, a modify op, and a write op, all rolled
into one opcode.  But three actual operations.


The advantages of asm code for atomic_{read,set} are:
1) all the other atomic ops are implemented that way already;
2) you have full control over the asm insns selected, in particular,
   you can guarantee you *do* get an atomic op;
3) you don't need to use volatile data which generates
   not-all-that-good code on archs like x86, and we want to get rid
   of it anyway since it is problematic in many ways;
4) you don't need to use *(volatile type*)data, which a) doesn't
   exist in C; b) isn't documented or supported in GCC; c) has a recent
   history of bugginess; d) _still uses volatile objects_; e) _still_
   is problematic in almost all those same ways as in 3);
5) you can mix atomic and non-atomic accesses to the atomic_t, which
   you cannot with the other alternatives.

The only disadvantage I know of is potentially slightly worse
instruction scheduling.  This is a generic asm() problem: GCC
cannot see what actual insns are inside the asm() block.


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Segher Boessenkool
Such code generally doesn't care precisely when it gets the update, 
just that the update is atomic, and it doesn't loop forever.


Yes, it _does_ care that it gets the update _at all_, and preferably
as early as possible.


Regardless, I'm convinced we just need to do it all in assembly.


So do you want volatile asm or plain asm, for atomic_read()?
The asm version has two ways to go about it too...


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Russell King
On Tue, Aug 21, 2007 at 12:04:17AM +0200, Segher Boessenkool wrote:
 And no, RMW on MMIO isn't problematic at all, either.
 
 An RMW op is a read op, a modify op, and a write op, all rolled
 into one opcode.  But three actual operations.

Maybe for some CPUs, but not all.  ARM for instance can't use the
load exclusive and store exclusive instructions to MMIO space.

This means placing atomic_t or bitops into MMIO space is a definite
no-go on ARM.  It breaks.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Segher Boessenkool

And no, RMW on MMIO isn't problematic at all, either.

An RMW op is a read op, a modify op, and a write op, all rolled
into one opcode.  But three actual operations.


Maybe for some CPUs, but not all.  ARM for instance can't use the
load exclusive and store exclusive instructions to MMIO space.


Sure, your CPU doesn't have RMW instructions -- how to emulate
those if you don't have them is a totally different thing.


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Paul E. McKenney
On Tue, Aug 21, 2007 at 01:02:01AM +0200, Segher Boessenkool wrote:
 And no, RMW on MMIO isn't problematic at all, either.
 
 An RMW op is a read op, a modify op, and a write op, all rolled
 into one opcode.  But three actual operations.
 
 Maybe for some CPUs, but not all.  ARM for instance can't use the
 load exclusive and store exclusive instructions to MMIO space.
 
 Sure, your CPU doesn't have RMW instructions -- how to emulate
 those if you don't have them is a totally different thing.

I thought that ARM's load exclusive and store exclusive instructions
were its equivalent of LL and SC, which RISC machines typically use to
build atomic sequences of instructions -- and which normally cannot be
applied to MMIO space.

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-20 Thread Linus Torvalds


On Mon, 20 Aug 2007, Chris Snook wrote:

 What about barrier removal?  With consistent semantics we could optimize a
 fair amount of code.  Whether or not that constitutes premature optimization
 is open to debate, but there's no question we could reduce our register wiping
 in some places.

Why do people think that barriers are expensive? They really aren't. 
Especially the regular compiler barrier is basically zero cost. Any 
reasonable compiler will just flush the stuff it holds in registers that 
isn't already automatic local variables, and for regular kernel code, that 
tends to basically be nothing at all.

Ie a barrier() is likely _cheaper_ than the code generation downside 
from using volatile.

Linus
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-18 Thread Satyam Sharma
[ LOL, you _are_ shockingly petty! ]


On Sat, 18 Aug 2007, Segher Boessenkool wrote:

   The documentation simply doesn't say +m is allowed.  The code to
   allow it was added for the benefit of people who do not read the
   documentation.  Documentation for +m might get added later if it
   is decided this [the code, not the documentation] is a sane thing
   to have (which isn't directly obvious).
  
  Huh?
  
  If the (current) documentation doesn't match up with the (current)
  code, then _at least one_ of them has to be (as of current) wrong.
  
  I wonder how could you even try to disagree with that.
 
 Easy.
 
 The GCC documentation you're referring to is the user's manual.
 See the blurb on the first page:
 
 This manual documents how to use the GNU compilers, as well as their
 features and incompatibilities, and how to report bugs.  It corresponds
 to GCC version 4.3.0.  The internals of the GNU compilers, including
 how to port them to new targets and some information about how to write
 front ends for new languages, are documented in a separate manual.
 
 _How to use_.  This documentation doesn't describe in minute detail
 everything the compiler does (see the source code for that -- no, it
 isn't described in the internals manual either).

Wow, now that's a nice disclaimer. By your (poor) standards of writing
documentation, one can as well write any factually incorrect stuff that
one wants in a document once you've got such a blurb in place :-)


 If it doesn't tell you how to use +m, and even tells you _not_ to
 use it, maybe that is what it means to say?  It doesn't mean +m
 doesn't actually do something.  It also doesn't mean it does what
 you think it should do.  It might do just that of course.  But treating
 writing C code as an empirical science isn't such a smart idea.

Oh, really? Considering how much is (left out of being) documented, often
one would reasonably have to experimentally see (with testcases) how the
compiler behaves for some given code. Well, at least _I_ do it often
(several others on this list do as well), and I think there's everything
smart about it rather than having to read gcc sources -- I'd be surprised
(unless you have infinite free time on your hands, which does look like
teh case actually) if someone actually prefers reading gcc sources first
to know what/how gcc does something for some given code, rather than
simply write it out, compile and look the generated code (saves time for
those who don't have an infinite amount of it).


  And I didn't go whining about this ... you asked me. (I think I'd said
  something to the effect of GCC docs are often wrong,
 
 No need to guess at what you said, even if you managed to delete
 your own mail already, there are plenty of free web-based archives
 around.  You said:
 
  See, volatile C keyword, for all it's ill-definition and dodgy
  semantics, is still at least given somewhat of a treatment in the C
  standard (whose quality is ... ummm, sadly not always good and clear,
  but unsurprisingly, still about 5,482 orders-of-magnitude times
  better than GCC docs).

Try _reading_ what I said there, for a change, dude. I'd originally only
said unless GCC's docs is yet again wrong ... then _you_ asked me what,
after which this discussion began and I wrote the above [which I fully
agree with -- so what if I used hyperbole in my sentence (yup, that was
intended, and obviously, exaggeration), am I not even allowed to do that?
Man, you're a Nazi or what ...] I didn't go whining about on my own as
you'd had earlier suggested, until _you_ asked me.

[ Ick, I somehow managed to reply this ... this is such a ...
  *disgustingly* petty argument you made here. ]


  which is true,
 
 Yes, documentation of that size often has shortcomings.  No surprise
 there.  However, great effort is made to make it better documentation,
 and especially to keep it up to date; if you find any errors or
 omissions, please report them.  There are many ways how to do that,
 see the GCC homepage./end-of-marketing-blurb
 ^^

Looks like you even get paid :-)


  but probably you feel saying that is not allowed on non-gcc lists?)
 
 [amazingly pointless stuff snipped]
 
  As for the PR
  you're requesting me to file with GCC for this, that
  gcc-patches@ thread did precisely that
 
 [more amazingly pointless stuff snipped]
 
  and more (submitted a patch to
  said documentation -- and no, saying documentation might get added
  later is totally bogus and nonsensical -- documentation exists to
  document current behaviour, not past).
 
 When code like you want to write becomes a supported feature, that
 will be reflected in the user manual.  It is completely nonsensical
 to expect everything that is *not* a supported feature to be mentioned
 there.

What crap. It is _perfectly reasonable_ to expect (current) documentation
to keep up with (current) code behaviour. In fact trying to justify such
a state is completely 

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-18 Thread Satyam Sharma


On Fri, 17 Aug 2007, Linus Torvalds wrote:

 On Sat, 18 Aug 2007, Satyam Sharma wrote:
  
  No code does (or would do, or should do):
  
  x.counter++;
  
  on an atomic_t x; anyway.
 
 That's just an example of a general problem.
 
 No, you don't use x.counter++. But you *do* use
 
   if (atomic_read(x) = 1)
 
 and loading into a register is stupid and pointless, when you could just 
 do it as a regular memory-operand to the cmp instruction.

True, but that makes this a bad/poor code generation issue with the
compiler, not something that affects the _correctness_ of atomic ops if
volatile is used for that counter object (as was suggested), because
we'd always use the atomic_inc() etc primitives to do increments, which
are always (should be!) implemented to be atomic.


 And as far as the compiler is concerned, the problem is the 100% same: 
 combining operations with the volatile memop.
 
 The fact is, a compiler that thinks that
 
   movl mem,reg
   cmpl $val,reg
 
 is any better than
 
   cmpl $val,mem
 
 is just not a very good compiler.

Absolutely, this is definitely a bug report worth opening with gcc. And
what you've said to explain this previously sounds definitely correct --
seeing volatile for any access does appear to just scare the hell out
of gcc and makes it generate such (poor) code.


Satyam
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-18 Thread Satyam Sharma


On Sat, 18 Aug 2007, Segher Boessenkool wrote:

   GCC manual, section 6.1, When
  ^^
   is a Volatile Object Accessed? doesn't say anything of the
  ^^^
   kind.
  ^

  True, implementation-defined as per the C standard _is_ supposed to mean
^

  unspecified behaviour where each implementation documents how the choice
  is made. So ok, probably GCC isn't documenting this
  

  implementation-defined behaviour which it is supposed to, but can't really
  fault them much for this, probably.
 
 GCC _is_ documenting this, namely in this section 6.1.

(Again totally petty, but) Yes, but ...

 It doesn't
  ^^
 mention volatile-casted stuff.  Draw your own conclusions.
  ^^

... exactly. So that's why I said GCC isn't documenting _this_.

Man, try _reading_ mails before replying to them ...
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


LDD3 pitfalls (was Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures)

2007-08-18 Thread Stefan Richter
Nick Piggin wrote:
 Stefan Richter wrote:
 Nick Piggin wrote:

 I don't know why people would assume volatile of atomics. AFAIK, most
 of the documentation is pretty clear that all the atomic stuff can be
 reordered etc. except for those that modify and return a value.


 Which documentation is there?
 
 Documentation/atomic_ops.txt
 
 
 For driver authors, there is LDD3.  It doesn't specifically cover
 effects of optimization on accesses to atomic_t.

 For architecture port authors, there is Documentation/atomic_ops.txt.
 Driver authors also can learn something from that document, as it
 indirectly documents the atomic_t and bitops APIs.

 
 Semantics and Behavior of Atomic and Bitmask Operations is
 pretty direct :)
 
 Sure, it says that it's for arch maintainers, but there is no
 reason why users can't make use of it.


Note, LDD3 page 238 says:  It is worth noting that most of the other
kernel primitives dealing with synchronization, such as spinlock and
atomic_t operations, also function as memory barriers.

I don't know about Linux 2.6.10 against which LDD3 was written, but
currently only _some_ atomic_t operations function as memory barriers.

Besides, judging from some posts in this thread, saying that atomic_t
operations dealt with synchronization may not be entirely precise.
-- 
Stefan Richter
-=-=-=== =--- =--=-
http://arcgraph.de/sr/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-18 Thread Paul E. McKenney
On Fri, Aug 17, 2007 at 09:13:35PM -0700, Linus Torvalds wrote:
 
 
 On Sat, 18 Aug 2007, Satyam Sharma wrote:
  
  No code does (or would do, or should do):
  
  x.counter++;
  
  on an atomic_t x; anyway.
 
 That's just an example of a general problem.
 
 No, you don't use x.counter++. But you *do* use
 
   if (atomic_read(x) = 1)
 
 and loading into a register is stupid and pointless, when you could just 
 do it as a regular memory-operand to the cmp instruction.
 
 And as far as the compiler is concerned, the problem is the 100% same: 
 combining operations with the volatile memop.
 
 The fact is, a compiler that thinks that
 
   movl mem,reg
   cmpl $val,reg
 
 is any better than
 
   cmpl $val,mem
 
 is just not a very good compiler. But when talking about volatile, 
 that's exactly what ytou always get (and always have gotten - this is 
 not a regression, and I doubt gcc is alone in this).

One of the gcc guys claimed that he thought that the two-instruction
sequence would be faster on some x86 machines.  I pointed out that
there might be a concern about code size.  I chose not to point out
that people might also care about the other x86 machines.  ;-)

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-18 Thread Paul E. McKenney
On Fri, Aug 17, 2007 at 06:24:15PM -0700, Christoph Lameter wrote:
 On Fri, 17 Aug 2007, Paul E. McKenney wrote:
 
  On Sat, Aug 18, 2007 at 08:09:13AM +0800, Herbert Xu wrote:
   On Fri, Aug 17, 2007 at 04:59:12PM -0700, Paul E. McKenney wrote:
   
gcc bugzilla bug #33102, for whatever that ends up being worth.  ;-)
   
   I had totally forgotten that I'd already filed that bug more
   than six years ago until they just closed yours as a duplicate
   of mine :)
   
   Good luck in getting it fixed!
  
  Well, just got done re-opening it for the third time.  And a local
  gcc community member advised me not to give up too easily.  But I
  must admit that I am impressed with the speed that it was identified
  as duplicate.
  
  Should be entertaining!  ;-)
 
 Right. ROTFL... volatile actually breaks atomic_t instead of making it 
 safe. x++ becomes a register load, increment and a register store. Without 
 volatile we can increment the memory directly. It seems that volatile 
 requires that the variable is loaded into a register first and then 
 operated upon. Understandable when you think about volatile being used to 
 access memory mapped I/O registers where a RMW operation could be 
 problematic.
 
 See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=3506

Yep.  The initial reaction was in fact to close my bug as a duplicate
of 3506.  But I was not asking for atomicity, but rather for smaller
code to be generated, so I reopened it.

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-18 Thread Linus Torvalds


On Sat, 18 Aug 2007, Paul E. McKenney wrote:
 
 One of the gcc guys claimed that he thought that the two-instruction
 sequence would be faster on some x86 machines.  I pointed out that
 there might be a concern about code size.  I chose not to point out
 that people might also care about the other x86 machines.  ;-)

Some (very few) x86 uarchs do tend to prefer load-store like code 
generation, and doing a mov [mem],reg + op reg instead of op [mem] can 
actually be faster on some of them. Not any that are relevant today, 
though.

Also, that has nothing to do with volatile, and should be controlled by 
optimization flags (like -mtune). In fact, I thought there was a separate 
flag to do that (ie something like -mload-store), but I can't find it, 
so maybe that's just my fevered brain..

Linus
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-18 Thread Paul E. McKenney
On Sat, Aug 18, 2007 at 03:41:13PM -0700, Linus Torvalds wrote:
 
 
 On Sat, 18 Aug 2007, Paul E. McKenney wrote:
  
  One of the gcc guys claimed that he thought that the two-instruction
  sequence would be faster on some x86 machines.  I pointed out that
  there might be a concern about code size.  I chose not to point out
  that people might also care about the other x86 machines.  ;-)
 
 Some (very few) x86 uarchs do tend to prefer load-store like code 
 generation, and doing a mov [mem],reg + op reg instead of op [mem] can 
 actually be faster on some of them. Not any that are relevant today, 
 though.

;-)

 Also, that has nothing to do with volatile, and should be controlled by 
 optimization flags (like -mtune). In fact, I thought there was a separate 
 flag to do that (ie something like -mload-store), but I can't find it, 
 so maybe that's just my fevered brain..

Good point, will suggest this if the need arises.

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma


On Fri, 17 Aug 2007, Herbert Xu wrote:

 On Fri, Aug 17, 2007 at 01:43:27PM +1000, Paul Mackerras wrote:
 
  The cost of doing so seems to me to be well down in the noise - 44
  bytes of extra kernel text on a ppc64 G5 config, and I don't believe
  the extra few cycles for the occasional extra load would be measurable
  (they should all hit in the L1 dcache).  I don't mind if x86[-64] have
  atomic_read/set be nonvolatile and find all the missing barriers, but
  for now on powerpc, I think that not having to find those missing
  barriers is worth the 0.00076% increase in kernel text size.
 
 BTW, the sort of missing barriers that triggered this thread
 aren't that subtle.  It'll result in a simple lock-up if the
 loop condition holds upon entry.  At which point it's fairly
 straightforward to find the culprit.

Not necessarily. A barrier-less buggy code such as below:

atomic_set(v, 0);

... /* some initial code */

while (atomic_read(v))
;

... /* code that MUST NOT be executed unless v becomes non-zero */

(where v-counter is has no volatile access semantics)

could be generated by the compiler to simply *elid* or *do away* with
the loop itself, thereby making the:

/* code that MUST NOT be executed unless v becomes non-zero */

to be executed even when v is zero! That is subtle indeed, and causes
no hard lockups.

Granted, the above IS buggy code. But, the stated objective is to avoid
heisenbugs. And we have driver / subsystem maintainers such as Stefan
coming up and admitting that often a lot of code that's written to use
atomic_read() does assume the read will not be elided by the compiler.

See, I agree, volatility semantics != what we often want. However, if
what we want is compiler barrier, for only the object under consideration,
volatility semantics aren't really nonsensical or anything.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Geert Uytterhoeven
On Thu, 16 Aug 2007, Linus Torvalds wrote:
 On Fri, 17 Aug 2007, Paul Mackerras wrote:
  I'm really surprised it's as much as a few K.  I tried it on powerpc
  and it only saved 40 bytes (10 instructions) for a G5 config.
 
 One of the things that volatile generally screws up is a simple
 
   volatile int i;
 
   i++;
 
 which a compiler will generally get horribly, horribly wrong.
 
 In a reasonable world, gcc should just make that be (on x86)
 
   addl $1,i(%rip)
 
 on x86-64, which is indeed what it does without the volatile. But with the 
 volatile, the compiler gets really nervous, and doesn't dare do it in one 
 instruction, and thus generates crap like
 
 movli(%rip), %eax
 addl$1, %eax
 movl%eax, i(%rip)
 
 instead. For no good reason, except that volatile just doesn't have any 
 good/clear semantics for the compiler, so most compilers will just make it 
 be I will not touch this access in any way, shape, or form. Including 
 even trivially correct instruction optimization/combination.

Apart from having to fetch more bytes for the instructions (which does
matter), execution time is probably the same on modern processors, as they
convert the single instruction to RISC-style load, modify, store anyway.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [EMAIL PROTECTED]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say programmer or something like that.
-- Linus Torvalds
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin

Satyam Sharma wrote:


#define atomic_read_volatile(v) \
({  \
forget((v)-counter);\
((v)-counter);  \
})

where:


*vomit* :)

Not only do I hate the keyword volatile, but the barrier is only a
one-sided affair so its probable this is going to have slightly
different allowed reorderings than a real volatile access.

Also, why would you want to make these insane accessors for atomic_t
types? Just make sure everybody knows the basics of barriers, and they
can apply that knowledge to atomic_t and all other lockless memory
accesses as well.



#define forget(a)   __asm__ __volatile__ ( :=m (a) :m (a))


I like order(x) better, but it's not the most perfect name either.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma


On Thu, 16 Aug 2007, Paul E. McKenney wrote:

 On Fri, Aug 17, 2007 at 07:59:02AM +0800, Herbert Xu wrote:
  On Thu, Aug 16, 2007 at 09:34:41AM -0700, Paul E. McKenney wrote:
  
   The compiler can also reorder non-volatile accesses.  For an example
   patch that cares about this, please see:
   
 http://lkml.org/lkml/2007/8/7/280
   
   This patch uses an ORDERED_WRT_IRQ() in rcu_read_lock() and
   rcu_read_unlock() to ensure that accesses aren't reordered with respect
   to interrupt handlers and NMIs/SMIs running on that same CPU.
  
  Good, finally we have some code to discuss (even though it's
  not actually in the kernel yet).
 
 There was some earlier in this thread as well.

Hmm, I never quite got what all this interrupt/NMI/SMI handling and
RCU business you mentioned earlier was all about, but now that you've
pointed to the actual code and issues with it ...


  First of all, I think this illustrates that what you want
  here has nothing to do with atomic ops.  The ORDERED_WRT_IRQ
  macro occurs a lot more times in your patch than atomic
  reads/sets.  So *assuming* that it was necessary at all,
  then having an ordered variant of the atomic_read/atomic_set
  ops could do just as well.
 
 Indeed.  If I could trust atomic_read()/atomic_set() to cause the compiler
 to maintain ordering, then I could just use them instead of having to
 create an  ORDERED_WRT_IRQ().  (Or ACCESS_ONCE(), as it is called in a
 different patch.)

+#define WHATEVER(x)(*(volatile typeof(x) *)(x))

I suppose one could want volatile access semantics for stuff that's
a bit-field too, no?

Also, this gives *zero* re-ordering guarantees that your code wants
as you've explained it below) -- neither w.r.t. CPU re-ordering (which
probably you don't care about) *nor* w.r.t. compiler re-ordering
(which you definitely _do_ care about).


  However, I still don't know which atomic_read/atomic_set in
  your patch would be broken if there were no volatile.  Could
  you please point them out?
 
 Suppose I tried replacing the ORDERED_WRT_IRQ() calls with
 atomic_read() and atomic_set().  Starting with __rcu_read_lock():
 
 o If ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++
   was ordered by the compiler after
   ORDERED_WRT_IRQ(me-rcu_read_lock_nesting) = nesting + 1, then
   suppose an NMI/SMI happened after the rcu_read_lock_nesting but
   before the rcu_flipctr.
 
   Then if there was an rcu_read_lock() in the SMI/NMI
   handler (which is perfectly legal), the nested rcu_read_lock()
   would believe that it could take the then-clause of the
   enclosing if statement.  But because the rcu_flipctr per-CPU
   variable had not yet been incremented, an RCU updater would
   be within its rights to assume that there were no RCU reads
   in progress, thus possibly yanking a data structure out from
   under the reader in the SMI/NMI function.
 
   Fatal outcome.  Note that only one CPU is involved here
   because these are all either per-CPU or per-task variables.

Ok, so you don't care about CPU re-ordering. Still, I should let you know
that your ORDERED_WRT_IRQ() -- bad name, btw -- is still buggy. What you
want is a full compiler optimization barrier().

[ Your code probably works now, and emits correct code, but that's
  just because of gcc did what it did. Nothing in any standard,
  or in any documented behaviour of gcc, or anything about the real
  (or expected) semantics of volatile is protecting the code here. ]


 o If ORDERED_WRT_IRQ(me-rcu_read_lock_nesting) = nesting + 1
   was ordered by the compiler to follow the
   ORDERED_WRT_IRQ(me-rcu_flipctr_idx) = idx, and an NMI/SMI
   happened between the two, then an __rcu_read_lock() in the NMI/SMI
   would incorrectly take the else clause of the enclosing if
   statement.  If some other CPU flipped the rcu_ctrlblk.completed
   in the meantime, then the __rcu_read_lock() would (correctly)
   write the new value into rcu_flipctr_idx.
 
   Well and good so far.  But the problem arises in
   __rcu_read_unlock(), which then decrements the wrong counter.
   Depending on exactly how subsequent events played out, this could
   result in either prematurely ending grace periods or never-ending
   grace periods, both of which are fatal outcomes.
 
 And the following are not needed in the current version of the
 patch, but will be in a future version that either avoids disabling
 irqs or that dispenses with the smp_read_barrier_depends() that I
 have 99% convinced myself is unneeded:
 
 o nesting = ORDERED_WRT_IRQ(me-rcu_read_lock_nesting);
 
 o idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed)  0x1;
 
 Furthermore, in that future version, irq handlers can cause the same
 mischief that SMI/NMI handlers can in this version.
 
 Next, looking at __rcu_read_unlock():
 
 o If ORDERED_WRT_IRQ(me-rcu_read_lock_nesting) = nesting - 1
   was reordered by the compiler to 

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Stefan Richter
Nick Piggin wrote:
 I don't know why people would assume volatile of atomics. AFAIK, most
 of the documentation is pretty clear that all the atomic stuff can be
 reordered etc. except for those that modify and return a value.

Which documentation is there?

For driver authors, there is LDD3.  It doesn't specifically cover
effects of optimization on accesses to atomic_t.

For architecture port authors, there is Documentation/atomic_ops.txt.
Driver authors also can learn something from that document, as it
indirectly documents the atomic_t and bitops APIs.

Prompted by this thread, I reread this document, and indeed, the
sentence Unlike the above routines, it is required that explicit memory
barriers are performed before and after [atomic_{inc,dec}_return]
indicates that atomic_read (one of the above routines) is very
different from all other atomic_t accessors that return values.

This is strange.  Why is it that atomic_read stands out that way?  IMO
this API imbalance is quite unexpected by many people.  Wouldn't it be
beneficial to change the atomic_read API to behave the same like all
other atomic_t accessors that return values?

OK, it is also different from the other accessors that return data in so
far as it doesn't modify the data.  But as driver author, i.e. user of
the API, I can't see much use of an atomic_read that can be reordered
and, more importantly, can be optimized away by the compiler.  Sure, now
that I learned of these properties I can start to audit code and insert
barriers where I believe they are needed, but this simply means that
almost all occurrences of atomic_read will get barriers (unless there
already are implicit but more or less obvious barriers like msleep).
-- 
Stefan Richter
-=-=-=== =--- =---=
http://arcgraph.de/sr/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin

Stefan Richter wrote:

Nick Piggin wrote:


I don't know why people would assume volatile of atomics. AFAIK, most
of the documentation is pretty clear that all the atomic stuff can be
reordered etc. except for those that modify and return a value.



Which documentation is there?


Documentation/atomic_ops.txt



For driver authors, there is LDD3.  It doesn't specifically cover
effects of optimization on accesses to atomic_t.

For architecture port authors, there is Documentation/atomic_ops.txt.
Driver authors also can learn something from that document, as it
indirectly documents the atomic_t and bitops APIs.



Semantics and Behavior of Atomic and Bitmask Operations is
pretty direct :)

Sure, it says that it's for arch maintainers, but there is no
reason why users can't make use of it.



Prompted by this thread, I reread this document, and indeed, the
sentence Unlike the above routines, it is required that explicit memory
barriers are performed before and after [atomic_{inc,dec}_return]
indicates that atomic_read (one of the above routines) is very
different from all other atomic_t accessors that return values.

This is strange.  Why is it that atomic_read stands out that way?  IMO


It is not just atomic_read of course. It is atomic_add,sub,inc,dec,set.



this API imbalance is quite unexpected by many people.  Wouldn't it be
beneficial to change the atomic_read API to behave the same like all
other atomic_t accessors that return values?


It is very consistent and well defined. Operations which both modify
the data _and_ return something are defined to have full barriers
before and after.

What do you want to add to the other atomic accessors? Full memory
barriers? Only compiler barriers? It's quite likely that if you think
some barriers will fix bugs, then there are other bugs lurking there
anyway.

Just use spinlocks if you're not absolutely clear about potential
races and memory ordering issues -- they're pretty cheap and simple.



OK, it is also different from the other accessors that return data in so
far as it doesn't modify the data.  But as driver author, i.e. user of
the API, I can't see much use of an atomic_read that can be reordered
and, more importantly, can be optimized away by the compiler.


It will return to you an atomic snapshot of the data (loaded from
memory at some point since the last compiler barrier). All you have
to be aware of compiler barriers and the Linux SMP memory ordering
model, which should be a given if you are writing lockless code.



Sure, now
that I learned of these properties I can start to audit code and insert
barriers where I believe they are needed, but this simply means that
almost all occurrences of atomic_read will get barriers (unless there
already are implicit but more or less obvious barriers like msleep).


You might find that these places that appear to need barriers are
buggy for other reasons anyway. Can you point to some in-tree code
we can have a look at?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma


On Fri, 17 Aug 2007, Paul Mackerras wrote:

 Herbert Xu writes:
 
  On Fri, Aug 17, 2007 at 03:09:57PM +1000, Paul Mackerras wrote:
   Herbert Xu writes:
   
Can you find an actual atomic_read code snippet there that is
broken without the volatile modifier?
   
   There are some in arch-specific code, for example line 1073 of
   arch/mips/kernel/smtc.c.  On mips, cpu_relax() is just barrier(), so
   the empty loop body is ok provided that atomic_read actually does the
   load each time around the loop.
  
  A barrier() is all you need to force the compiler to reread
  the value.
  
  The people advocating volatile in this thread are talking
  about code that doesn't use barrier()/cpu_relax().
 
 Did you look at it?  Here it is:
 
   /* Someone else is initializing in parallel - let 'em finish */
   while (atomic_read(idle_hook_initialized)  1000)
   ;


Honestly, this thread is suffering from HUGE communication gaps.

What Herbert (obviously) meant there was that this loop could've
been okay _without_ using volatile-semantics-atomic_read() also, if
only it used cpu_relax().

That does work, because cpu_relax() is _at least_ barrier() on all
archs (on some it also emits some arch-dependent pause kind of
instruction).

Now, saying that MIPS does not have such an instruction so I won't
use cpu_relax() for arch-dependent-busy-while-loops in arch/mips/
sounds like a wrong argument, because: tomorrow, such arch's _may_
introduce such an instruction, so naturally, at that time we'd
change cpu_relax() appropriately (in reality, we would actually
*re-define* cpu_relax() and ensure that the correct version gets
pulled in depending on whether the callsite code is legacy or only
for the newer such CPUs of said arch, whatever), but loops such as
this would remain un-changed, because they never used cpu_relax()!

OTOH an argument that said the following would've made a stronger case:

I don't want to use cpu_relax() because that's a full memory
clobber barrier() and I have loop-invariants / other variables
around in that code that I *don't* want the compiler to forget
just because it used cpu_relax(), and hence I will not use
cpu_relax() but instead make my atomic_read() itself have
volatility semantics. Not just that, but I will introduce a
cpu_relax_no_barrier() on MIPS, that would be a no-op #define
for now, but which may not be so forever, and continue to use
that in such busy loops.

In general, please read the thread-summary I've tried to do at:
http://lkml.org/lkml/2007/8/17/25
Feel free to continue / comment / correct stuff from there, there's
too much confusion and circular-arguments happening on this thread
otherwise.

[ I might've made an incorrect statement there about
  volatile w.r.t. cache on non-x86 archs, I think. ]


Satyam
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma


On Fri, 17 Aug 2007, Nick Piggin wrote:

 Satyam Sharma wrote:
 
  #define atomic_read_volatile(v) \
  ({  \
  forget((v)-counter);   \
  ((v)-counter); \
  })
  
  where:
 
 *vomit* :)

I wonder if this'll generate smaller and better code than _both_ the
other atomic_read_volatile() variants. Would need to build allyesconfig
on lots of diff arch's etc to test the theory though.


 Not only do I hate the keyword volatile, but the barrier is only a
 one-sided affair so its probable this is going to have slightly
 different allowed reorderings than a real volatile access.

True ...


 Also, why would you want to make these insane accessors for atomic_t
 types? Just make sure everybody knows the basics of barriers, and they
 can apply that knowledge to atomic_t and all other lockless memory
 accesses as well.

Code that looks like:

while (!atomic_read(v)) {
...
cpu_relax_no_barrier();
forget(v.counter);

}

would be uglier. Also think about code such as:

a = atomic_read();
if (!a)
do_something();

forget();
a = atomic_read();
... /* some code that depends on value of a, obviously */

forget();
a = atomic_read();
...

So much explicit sprinkling of forget() looks ugly.

atomic_read_volatile()

on the other hand, looks neater. The _volatile() suffix makes it also
no less explicit than an explicit barrier-like macro that this primitive
is something special, for code clarity purposes.


  #define forget(a)   __asm__ __volatile__ ( :=m (a) :m (a))
 
 I like order(x) better, but it's not the most perfect name either.

forget(x) is just a stupid-placeholder-for-a-better-name. order(x) sounds
good but we could leave quibbling about function or macro names for later,
this thread is noisy as it is :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin

Satyam Sharma wrote:


On Fri, 17 Aug 2007, Herbert Xu wrote:



On Fri, Aug 17, 2007 at 01:43:27PM +1000, Paul Mackerras wrote:

BTW, the sort of missing barriers that triggered this thread
aren't that subtle.  It'll result in a simple lock-up if the
loop condition holds upon entry.  At which point it's fairly
straightforward to find the culprit.



Not necessarily. A barrier-less buggy code such as below:

atomic_set(v, 0);

... /* some initial code */

while (atomic_read(v))
;

... /* code that MUST NOT be executed unless v becomes non-zero */

(where v-counter is has no volatile access semantics)

could be generated by the compiler to simply *elid* or *do away* with
the loop itself, thereby making the:

/* code that MUST NOT be executed unless v becomes non-zero */

to be executed even when v is zero! That is subtle indeed, and causes
no hard lockups.


Then I presume you mean

while (!atomic_read(v))
;

Which is just the same old infinite loop bug solved with cpu_relax().
These are pretty trivial to audit and fix, and also to debug, I would
think.



Granted, the above IS buggy code. But, the stated objective is to avoid
heisenbugs.


Anyway, why are you making up code snippets that are buggy in other
ways in order to support this assertion being made that lots of kernel
code supposedly depends on volatile semantics. Just reference the
actual code.



And we have driver / subsystem maintainers such as Stefan
coming up and admitting that often a lot of code that's written to use
atomic_read() does assume the read will not be elided by the compiler.


So these are broken on i386 and x86-64?

Are they definitely safe on SMP and weakly ordered machines with
just a simple compiler barrier there? Because I would not be
surprised if there are a lot of developers who don't really know
what to assume when it comes to memory ordering issues.

This is not a dig at driver writers: we still have memory ordering
problems in the VM too (and probably most of the subtle bugs in
lockless VM code are memory ordering ones). Let's not make up a
false sense of security and hope that sprinkling volatile around
will allow people to write bug-free lockless code. If a writer
can't be bothered reading API documentation and learning the Linux
memory model, they can still be productive writing safely locked
code.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma


On Fri, 17 Aug 2007, Nick Piggin wrote:

 Stefan Richter wrote:
 [...]
 Just use spinlocks if you're not absolutely clear about potential
 races and memory ordering issues -- they're pretty cheap and simple.

I fully agree with this. As Paul Mackerras mentioned elsewhere,
a lot of authors sprinkle atomic_t in code thinking they're somehow
done with *locking*. This is sad, and I wonder if it's time for a
Documentation/atomic-considered-dodgy.txt kind of document :-)


  Sure, now
  that I learned of these properties I can start to audit code and insert
  barriers where I believe they are needed, but this simply means that
  almost all occurrences of atomic_read will get barriers (unless there
  already are implicit but more or less obvious barriers like msleep).
 
 You might find that these places that appear to need barriers are
 buggy for other reasons anyway. Can you point to some in-tree code
 we can have a look at?

Such code was mentioned elsewhere (query nodemgr_host_thread in cscope)
that managed to escape the requirement for a barrier only because of
some completely un-obvious compilation-unit-scope thing. But I find such
an non-explicit barrier quite bad taste. Stefan, do consider plunking an
explicit call to barrier() there.


Satyam
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Andi Kleen
On Friday 17 August 2007 05:42, Linus Torvalds wrote:
 On Fri, 17 Aug 2007, Paul Mackerras wrote:
  I'm really surprised it's as much as a few K.  I tried it on powerpc
  and it only saved 40 bytes (10 instructions) for a G5 config.

 One of the things that volatile generally screws up is a simple

   volatile int i;

   i++;

But for atomic_t people use atomic_inc() anyways which does this correctly.
It shouldn't really matter for atomic_t.

I'm worrying a bit that the volatile atomic_t change caused subtle code 
breakage like these delay read loops people here pointed out.
Wouldn't it be safer to just re-add the volatile to atomic_read() 
for 2.6.23? Or alternatively make it asm(), but volatile seems more
proven.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin

Satyam Sharma wrote:


On Fri, 17 Aug 2007, Nick Piggin wrote:



Sure, now
that I learned of these properties I can start to audit code and insert
barriers where I believe they are needed, but this simply means that
almost all occurrences of atomic_read will get barriers (unless there
already are implicit but more or less obvious barriers like msleep).


You might find that these places that appear to need barriers are
buggy for other reasons anyway. Can you point to some in-tree code
we can have a look at?



Such code was mentioned elsewhere (query nodemgr_host_thread in cscope)
that managed to escape the requirement for a barrier only because of
some completely un-obvious compilation-unit-scope thing. But I find such
an non-explicit barrier quite bad taste. Stefan, do consider plunking an
explicit call to barrier() there.


It is very obvious. msleep calls schedule() (ie. sleeps), which is
always a barrier.

The unobvious thing is that you wanted to know how the compiler knows
a function is a barrier -- answer is that if it does not *know* it is not
a barrier, it must assume it is a barrier. If the whole msleep call chain
including the scheduler were defined static in the current compilation
unit, then it would still be a barrier because it would actually be able
to see the barriers in schedule(void), if nothing else.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin

Satyam Sharma wrote:


On Fri, 17 Aug 2007, Nick Piggin wrote:



Also, why would you want to make these insane accessors for atomic_t
types? Just make sure everybody knows the basics of barriers, and they
can apply that knowledge to atomic_t and all other lockless memory
accesses as well.



Code that looks like:

while (!atomic_read(v)) {
...
cpu_relax_no_barrier();
forget(v.counter);

}

would be uglier. Also think about code such as:


I think they would both be equally ugly, but the atomic_read_volatile
variant would be more prone to subtle bugs because of the weird
implementation.

And it would be more ugly than introducing an order(x) statement for
all memory operations, and adding an order_atomic() wrapper for it
for atomic types.



a = atomic_read();
if (!a)
do_something();

forget();
a = atomic_read();
... /* some code that depends on value of a, obviously */

forget();
a = atomic_read();
...

So much explicit sprinkling of forget() looks ugly.


Firstly, why is it ugly? It's nice because of those nice explicit
statements there that give us a good heads up and would have some
comments attached to them (also, lack of the word volatile is
always a plus).

Secondly, what sort of code would do such a thing? In most cases,
it is probably riddled with bugs anyway (unless it is doing a
really specific sequence of interrupts or something, but in that
case it is very likely to either require locking or busy waits
anyway - ie. barriers).



on the other hand, looks neater. The _volatile() suffix makes it also
no less explicit than an explicit barrier-like macro that this primitive
is something special, for code clarity purposes.


Just don't use the word volatile, and have barriers both before
and after the memory operation, and I'm OK with it. I don't see
the point though, when you could just have a single barrier(x)
barrier function defined for all memory locations, rather than
this odd thing that only works for atomics (and would have to
be duplicated for atomic_set.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Paul Mackerras
Satyam Sharma writes:

 I wonder if this'll generate smaller and better code than _both_ the
 other atomic_read_volatile() variants. Would need to build allyesconfig
 on lots of diff arch's etc to test the theory though.

I'm sure it would be a tiny effect.

This whole thread is arguing about effects that are quite
insignificant.  On the one hand we have the non-volatile proponents,
who want to let the compiler do extra optimizations - which amounts to
letting it elide maybe a dozen loads in the whole kernel, loads which
would almost always be L1 cache hits.

On the other hand we have the volatile proponents, who are concerned
that some code somewhere in the kernel might be buggy without the
volatile behaviour, and who also want to be able to remove some
barriers and thus save a few bytes of code and a few loads here and
there (and possibly some stores too).

Either way the effect on code size and execution time is miniscule.

In the end the strongest argument is actually that gcc generates
unnecessarily verbose code on x86[-64] for volatile accesses.  Even
then we're only talking about ~2000 bytes, or less than 1 byte per
instance of atomic_read on average, about 0.06% of the kernel text
size.

The x86[-64] developers seem to be willing to bear the debugging cost
involved in having the non-volatile behaviour for atomic_read, in
order to save the 2kB.  That's fine with me.  Either way I think
somebody should audit all the uses of atomic_read, not just for
missing barriers, but also to find the places where it's used in a
racy manner.  Then we can work out where the races matter and fix them
if they do.

Paul.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma


On Fri, 17 Aug 2007, Nick Piggin wrote:

 Satyam Sharma wrote:
  
  On Fri, 17 Aug 2007, Nick Piggin wrote:
 
Sure, now
that I learned of these properties I can start to audit code and insert
barriers where I believe they are needed, but this simply means that
almost all occurrences of atomic_read will get barriers (unless there
already are implicit but more or less obvious barriers like msleep).
   
   You might find that these places that appear to need barriers are
   buggy for other reasons anyway. Can you point to some in-tree code
   we can have a look at?
  
  
  Such code was mentioned elsewhere (query nodemgr_host_thread in cscope)
  that managed to escape the requirement for a barrier only because of
  some completely un-obvious compilation-unit-scope thing. But I find such
  an non-explicit barrier quite bad taste. Stefan, do consider plunking an
  explicit call to barrier() there.
 
 It is very obvious. msleep calls schedule() (ie. sleeps), which is
 always a barrier.

Probably you didn't mean that, but no, schedule() is not barrier because
it sleeps. It's a barrier because it's invisible.

 The unobvious thing is that you wanted to know how the compiler knows
 a function is a barrier -- answer is that if it does not *know* it is not
 a barrier, it must assume it is a barrier.

True, that's clearly what happens here. But are you're definitely joking
that this is obvious in terms of code-clarity, right?

Just 5 minutes back you mentioned elsewhere you like seeing lots of
explicit calls to barrier() (with comments, no less, hmm? :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma


On Fri, 17 Aug 2007, Andi Kleen wrote:

 On Friday 17 August 2007 05:42, Linus Torvalds wrote:
  On Fri, 17 Aug 2007, Paul Mackerras wrote:
   I'm really surprised it's as much as a few K.  I tried it on powerpc
   and it only saved 40 bytes (10 instructions) for a G5 config.
 
  One of the things that volatile generally screws up is a simple
 
  volatile int i;
 
  i++;
 
 But for atomic_t people use atomic_inc() anyways which does this correctly.
 It shouldn't really matter for atomic_t.
 
 I'm worrying a bit that the volatile atomic_t change caused subtle code 
 breakage like these delay read loops people here pointed out.

Umm, I followed most of the thread, but which breakage is this?

 Wouldn't it be safer to just re-add the volatile to atomic_read() 
 for 2.6.23? Or alternatively make it asm(), but volatile seems more
 proven.

The problem with volatile is not just trashy code generation (which also
definitely is a major problem), but definition holes, and implementation
inconsistencies. Making it asm() is not the only other alternative to
volatile either (read another reply to this mail), but considering most
of the thread has been about people not wanting even a
atomic_read_volatile() variant, making atomic_read() itself have volatile
semantics sounds ... strange :-)


PS: http://lkml.org/lkml/2007/8/15/407 was submitted a couple days back,
any word if you saw that?

I have another one for you:


[PATCH] i386, x86_64: __const_udelay() should not be marked inline

Because it can never get inlined in any callsite (each translation unit
is compiled separately for the kernel and so the implementation of
__const_udelay() would be invisible to all other callsites). In fact it
turns out, the correctness of callsites at arch/x86_64/kernel/crash.c:97
and arch/i386/kernel/crash.c:101 explicitly _depends_ upon it not being
inlined, and also it's an exported symbol (modules may want to call
mdelay() and udelay() that often becomes __const_udelay() after some
macro-ing in various headers). So let's not mark it as inline either.

Signed-off-by: Satyam Sharma [EMAIL PROTECTED]

---

 arch/i386/lib/delay.c   |2 +-
 arch/x86_64/lib/delay.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/i386/lib/delay.c b/arch/i386/lib/delay.c
index f6edb11..0082c99 100644
--- a/arch/i386/lib/delay.c
+++ b/arch/i386/lib/delay.c
@@ -74,7 +74,7 @@ void __delay(unsigned long loops)
delay_fn(loops);
 }
 
-inline void __const_udelay(unsigned long xloops)
+void __const_udelay(unsigned long xloops)
 {
int d0;
 
diff --git a/arch/x86_64/lib/delay.c b/arch/x86_64/lib/delay.c
index 2dbebd3..d0cd9cd 100644
--- a/arch/x86_64/lib/delay.c
+++ b/arch/x86_64/lib/delay.c
@@ -38,7 +38,7 @@ void __delay(unsigned long loops)
 }
 EXPORT_SYMBOL(__delay);
 
-inline void __const_udelay(unsigned long xloops)
+void __const_udelay(unsigned long xloops)
 {
__delay(((xloops * HZ * 
cpu_data[raw_smp_processor_id()].loops_per_jiffy)  32) + 1);
 }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Stefan Richter
Nick Piggin wrote:
 Satyam Sharma wrote:
 And we have driver / subsystem maintainers such as Stefan
 coming up and admitting that often a lot of code that's written to use
 atomic_read() does assume the read will not be elided by the compiler.
 
 So these are broken on i386 and x86-64?

The ieee1394 and firewire subsystems have open, undiagnosed bugs, also
on i386 and x86-64.  But whether there is any bug because of wrong
assumptions about atomic_read among them, I don't know.  I don't know
which assumptions the authors made, I only know that I wasn't aware of
all the properties of atomic_read until now.

 Are they definitely safe on SMP and weakly ordered machines with
 just a simple compiler barrier there? Because I would not be
 surprised if there are a lot of developers who don't really know
 what to assume when it comes to memory ordering issues.
 
 This is not a dig at driver writers: we still have memory ordering
 problems in the VM too (and probably most of the subtle bugs in
 lockless VM code are memory ordering ones). Let's not make up a
 false sense of security and hope that sprinkling volatile around
 will allow people to write bug-free lockless code. If a writer
 can't be bothered reading API documentation

...or, if there is none, the implementation specification (as in case of
the atomic ops), or, if there is none, the implementation (as in case of
a some infrastructure code here and there)...

 and learning the Linux memory model, they can still be productive
 writing safely locked code.

Provided they are aware that they might not have the full picture of the
lockless primitives.  :-)
-- 
Stefan Richter
-=-=-=== =--- =---=
http://arcgraph.de/sr/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin

Satyam Sharma wrote:



On Fri, 17 Aug 2007, Nick Piggin wrote:



Satyam Sharma wrote:

It is very obvious. msleep calls schedule() (ie. sleeps), which is
always a barrier.



Probably you didn't mean that, but no, schedule() is not barrier because
it sleeps. It's a barrier because it's invisible.



Where did I say it is a barrier because it sleeps?

It is always a barrier because, at the lowest level, schedule() (and thus
anything that sleeps) is defined to always be a barrier. Regardless of
whatever obscure means the compiler might need to infer the barrier.

In other words, you can ignore those obscure details because schedule() is
always going to have an explicit barrier in it.



The unobvious thing is that you wanted to know how the compiler knows
a function is a barrier -- answer is that if it does not *know* it is not
a barrier, it must assume it is a barrier.



True, that's clearly what happens here. But are you're definitely joking
that this is obvious in terms of code-clarity, right?



No. If you accept that barrier() is implemented correctly, and you know
that sleeping is defined to be a barrier, then its perfectly clear. You
don't have to know how the compiler knows that some function contains
a barrier.



Just 5 minutes back you mentioned elsewhere you like seeing lots of
explicit calls to barrier() (with comments, no less, hmm? :-)



Sure, but there are well known primitives which contain barriers, and
trivial recognisable code sequences for which you don't need comments.
waiting-loops using sleeps or cpu_relax() are prime examples.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin

Satyam Sharma wrote:



On Fri, 17 Aug 2007, Nick Piggin wrote:


I think they would both be equally ugly,



You think both these are equivalent in terms of looks:

|
while (!atomic_read(v)) {  |   while (!atomic_read_xxx(v)) {
... |   ...
cpu_relax_no_barrier(); |   cpu_relax_no_barrier();
order_atomic(v);   |   }
}   |

(where order_atomic() is an atomic_t
specific wrapper as you mentioned below)

?



I think the LHS is better if your atomic_read_xxx primitive is using the
crazy one-sided barrier, because the LHS code you immediately know what
barriers are happening, and with the RHS you have to look at the 
atomic_read_xxx

definition.

If your atomic_read_xxx implementation was more intuitive, then both are
pretty well equal. More lines != ugly code.



but the atomic_read_volatile
variant would be more prone to subtle bugs because of the weird
implementation.



What bugs?



You can't think for yourself? Your atomic_read_volatile contains a compiler
barrier to the atomic variable before the load. 2 such reads from different
locations look like this:

asm volatile( : +m (v1));
atomic_read(v1);
asm volatile( : +m (v2));
atomic_read(v2);

Which implies that the load of v1 can be reordered to occur after the load
of v2. Bet you didn't expect that?


Secondly, what sort of code would do such a thing?



See the nodemgr_host_thread() that does something similar, though not
exactly same.



I'm sorry, all this waffling about made up code which might do this and
that is just a waste of time. Seriously, the thread is bloated enough
and never going to get anywhere with all this handwaving. If someone is
saving up all the really real and actually good arguments for why we
must have a volatile here, now is the time to use them.


and have barriers both before and after the memory operation,



How could that lead to bugs? (if you can point to existing code,
but just some testcase / sample code would be fine as well).



See above.


As I said, barrier() is too heavy-handed.



Typo. I meant: defined for a single memory location (ie. order(x)).

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin

Satyam Sharma wrote:



On Fri, 17 Aug 2007, Nick Piggin wrote:



Because they should be thinking about them in terms of barriers, over
which the compiler / CPU is not to reorder accesses or cache memory
operations, rather than special volatile accesses.



This is obviously just a taste thing. Whether to have that forget(x)
barrier as something author should explicitly sprinkle appropriately
in appropriate places in the code by himself or use a primitive that
includes it itself.



That's not obviously just taste to me. Not when the primitive has many
(perhaps, the majority) of uses that do not require said barriers. And
this is not solely about the code generation (which, as Paul says, is
relatively minor even on x86). I prefer people to think explicitly
about barriers in their lockless code.



I'm not saying taste matters aren't important (they are), but I'm really
skeptical if most folks would find the former tasteful.



So I /do/ have better taste than most folks? Thanks! :-)



And by the way, the point is *also* about the fact that cpu_relax(), as
of today, implies a full memory clobber, which is not what a lot of such
loops want. (due to stuff mentioned elsewhere, summarized in that summary)


That's not the point,



That's definitely the point, why not. This is why barrier(), being
heavy-handed, is not the best option.



That is _not_ the point (of why a volatile atomic_read is good) because 
there
has already been an alternative posted that better conforms with Linux 
barrier
API and is much more widely useful and more usable. If you are so 
worried about
barrier() being too heavyweight, then you're off to a poor start by 
wanting to

add a few K of kernel text by making atomic_read volatile.



because as I also mentioned, the logical extention
to Linux's barrier API to handle this is the order(x) macro. Again, not
special volatile accessors.



Sure, that forget(x) macro _is_ proposed to be made part of the generic
API. Doesn't explain why not to define/use primitives that has volatility
semantics in itself, though (taste matters apart).



If you follow the discussion You were thinking of a reason why the
semantics *should* be changed or added, and I was rebutting your argument
that it must be used when a full barrier() is too heavy (ie. by pointing
out that order() has superior semantics anyway).

Why do I keep repeating the same things? I'll not continue bloating this
thread until a new valid point comes up...
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma


On Fri, 17 Aug 2007, Nick Piggin wrote:

 Satyam Sharma wrote:
  On Fri, 17 Aug 2007, Nick Piggin wrote:
   Satyam Sharma wrote:
   
   It is very obvious. msleep calls schedule() (ie. sleeps), which is
   always a barrier.
  
  Probably you didn't mean that, but no, schedule() is not barrier because
  it sleeps. It's a barrier because it's invisible.
 
 Where did I say it is a barrier because it sleeps?

Just below. What you wrote:

 It is always a barrier because, at the lowest level, schedule() (and thus
 anything that sleeps) is defined to always be a barrier.

It is always a barrier because, at the lowest level, anything that sleeps
is defined to always be a barrier.


 Regardless of
 whatever obscure means the compiler might need to infer the barrier.
 
 In other words, you can ignore those obscure details because schedule() is
 always going to have an explicit barrier in it.

I didn't quite understand what you said here, so I'll tell what I think:

* foo() is a compiler barrier if the definition of foo() is invisible to
  the compiler at a callsite.

* foo() is also a compiler barrier if the definition of foo() includes
  a barrier, and it is inlined at the callsite.

If the above is wrong, or if there's something else at play as well,
do let me know.

   The unobvious thing is that you wanted to know how the compiler knows
   a function is a barrier -- answer is that if it does not *know* it is not
   a barrier, it must assume it is a barrier.
  
  True, that's clearly what happens here. But are you're definitely joking
  that this is obvious in terms of code-clarity, right?
 
 No. If you accept that barrier() is implemented correctly, and you know
 that sleeping is defined to be a barrier,

Curiously, that's the second time you've said sleeping is defined to
be a (compiler) barrier. How does the compiler even know if foo() is
a function that sleeps? Do compilers have some notion of sleeping
to ensure they automatically assume a compiler barrier whenever such
a function is called? Or are you saying that the compiler can see the
barrier() inside said function ... nopes, you're saying quite the
opposite below.


 then its perfectly clear. You
 don't have to know how the compiler knows that some function contains
 a barrier.

I think I do, why not? Would appreciate if you could elaborate on this.


Satyam
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma


On Fri, 17 Aug 2007, Nick Piggin wrote:

 Satyam Sharma wrote:
 [...]
  You think both these are equivalent in terms of looks:
  
  |
  while (!atomic_read(v)) {  |   while (!atomic_read_xxx(v)) {
  ... |   ...
  cpu_relax_no_barrier(); |
  cpu_relax_no_barrier();
  order_atomic(v);   |   }
  }   |
  
  (where order_atomic() is an atomic_t
  specific wrapper as you mentioned below)
  
  ?
 
 I think the LHS is better if your atomic_read_xxx primitive is using the
 crazy one-sided barrier,
  ^

I'd say it's purposefully one-sided.

 because the LHS code you immediately know what
 barriers are happening, and with the RHS you have to look at the
 atomic_read_xxx
 definition.

No. As I said, the _xxx (whatever the heck you want to name it as) should
give the same heads-up that your order_atomic thing is supposed to give.


 If your atomic_read_xxx implementation was more intuitive, then both are
 pretty well equal. More lines != ugly code.
 
  [...]
  What bugs?
 
 You can't think for yourself? Your atomic_read_volatile contains a compiler
 barrier to the atomic variable before the load. 2 such reads from different
 locations look like this:
 
 asm volatile( : +m (v1));
 atomic_read(v1);
 asm volatile( : +m (v2));
 atomic_read(v2);
 
 Which implies that the load of v1 can be reordered to occur after the load
 of v2.

And how would that be a bug? (sorry, I really can't think for myself)


   Secondly, what sort of code would do such a thing?
  
  See the nodemgr_host_thread() that does something similar, though not
  exactly same.
 
 I'm sorry, all this waffling about made up code which might do this and
 that is just a waste of time.

First, you could try looking at the code.

And by the way, as I've already said (why do *require* people to have to
repeat things to you?) this isn't even about only existing code.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin

Satyam Sharma wrote:



On Fri, 17 Aug 2007, Nick Piggin wrote:



Satyam Sharma wrote:


On Fri, 17 Aug 2007, Nick Piggin wrote:


Satyam Sharma wrote:

It is very obvious. msleep calls schedule() (ie. sleeps), which is
always a barrier.


Probably you didn't mean that, but no, schedule() is not barrier because
it sleeps. It's a barrier because it's invisible.


Where did I say it is a barrier because it sleeps?



Just below. What you wrote:



It is always a barrier because, at the lowest level, schedule() (and thus
anything that sleeps) is defined to always be a barrier.



It is always a barrier because, at the lowest level, anything that sleeps
is defined to always be a barrier.



... because it must call schedule and schedule is a barrier.



Regardless of
whatever obscure means the compiler might need to infer the barrier.

In other words, you can ignore those obscure details because schedule() is
always going to have an explicit barrier in it.



I didn't quite understand what you said here, so I'll tell what I think:

* foo() is a compiler barrier if the definition of foo() is invisible to
 the compiler at a callsite.

* foo() is also a compiler barrier if the definition of foo() includes
 a barrier, and it is inlined at the callsite.

If the above is wrong, or if there's something else at play as well,
do let me know.



Right.



The unobvious thing is that you wanted to know how the compiler knows
a function is a barrier -- answer is that if it does not *know* it is not
a barrier, it must assume it is a barrier.


True, that's clearly what happens here. But are you're definitely joking
that this is obvious in terms of code-clarity, right?


No. If you accept that barrier() is implemented correctly, and you know
that sleeping is defined to be a barrier,



Curiously, that's the second time you've said sleeping is defined to
be a (compiler) barrier.



_In Linux,_ sleeping is defined to be a compiler barrier.


How does the compiler even know if foo() is
a function that sleeps? Do compilers have some notion of sleeping
to ensure they automatically assume a compiler barrier whenever such
a function is called? Or are you saying that the compiler can see the
barrier() inside said function ... nopes, you're saying quite the
opposite below.



You're getting too worried about the compiler implementation. Start
by assuming that it does work ;)



then its perfectly clear. You
don't have to know how the compiler knows that some function contains
a barrier.



I think I do, why not? Would appreciate if you could elaborate on this.



If a function is not completely visible to the compiler (so it can't
determine whether a barrier could be in it or not), then it must always
assume it will contain a barrier so it always does the right thing.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma


On Fri, 17 Aug 2007, Nick Piggin wrote:

 Satyam Sharma wrote:
 
  On Fri, 17 Aug 2007, Nick Piggin wrote:
  
   Because they should be thinking about them in terms of barriers, over
   which the compiler / CPU is not to reorder accesses or cache memory
   operations, rather than special volatile accesses.
  
  This is obviously just a taste thing. Whether to have that forget(x)
  barrier as something author should explicitly sprinkle appropriately
  in appropriate places in the code by himself or use a primitive that
  includes it itself.
 
 That's not obviously just taste to me. Not when the primitive has many
 (perhaps, the majority) of uses that do not require said barriers. And
 this is not solely about the code generation (which, as Paul says, is
 relatively minor even on x86).

See, you do *require* people to have to repeat the same things to you!

As has been written about enough times already, and if you followed the
discussion on this thread, I am *not* proposing that atomic_read()'s
semantics be changed to have any extra barriers. What is proposed is a
different atomic_read_xxx() variant thereof, that those can use who do
want that.

Now whether to have a kind of barrier (volatile, whatever) in the
atomic_read_xxx() itself, or whether to make the code writer himself to
explicitly write the order(x) appropriately in appropriate places in the
code _is_ a matter of taste.


  That's definitely the point, why not. This is why barrier(), being
  heavy-handed, is not the best option.
 
 That is _not_ the point [...]

Again, you're requiring me to repeat things that were already made evident
on this thread (if you follow it).

This _is_ the point, because a lot of loops out there (too many of them,
I WILL NOT bother citing file_name:line_number) end up having to use a
barrier just because they're using a loop-exit-condition that depends
on a value returned by atomic_read(). It would be good for them if they
used an atomic_read_xxx() primitive that gave these volatility semantics
without junking compiler optimizations for other memory references.

 because there has already been an alternative posted

Whether that alternative (explicitly using forget(x), or wrappers thereof,
such as the order_atomic you proposed) is better than other alternatives
(such as atomic_read_xxx() which includes the volatility behaviour in
itself) is still open, and precisely what we started discussing just one
mail back.

(The above was also mostly stuff I had to repeated for you, sadly.)

 that better conforms with Linux barrier
 API and is much more widely useful and more usable.

I don't think so.

(Now *this* _is_ the taste-dependent matter that I mentioned earlier.)

 If you are so worried
 about
 barrier() being too heavyweight, then you're off to a poor start by wanting to
 add a few K of kernel text by making atomic_read volatile.

Repeating myself, for the N'th time, NO, I DON'T want to make atomic_read
have volatile semantics.

   because as I also mentioned, the logical extention
   to Linux's barrier API to handle this is the order(x) macro. Again, not
   special volatile accessors.
  
  Sure, that forget(x) macro _is_ proposed to be made part of the generic
  API. Doesn't explain why not to define/use primitives that has volatility
 
  semantics in itself, though (taste matters apart).
^
 If you follow the discussion You were thinking of a reason why the
 semantics *should* be changed or added, and I was rebutting your argument
 that it must be used when a full barrier() is too heavy (ie. by pointing
 out that order() has superior semantics anyway).

Amazing. Either you have reading comprehension problems, or else, please
try reading this thread (or at least this sub-thread) again. I don't want
_you_ blaming _me_ for having to repeat things to you all over again.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Paul E. McKenney
On Fri, Aug 17, 2007 at 01:09:08PM +0530, Satyam Sharma wrote:
 
 
 On Thu, 16 Aug 2007, Paul E. McKenney wrote:
 
  On Fri, Aug 17, 2007 at 07:59:02AM +0800, Herbert Xu wrote:
   On Thu, Aug 16, 2007 at 09:34:41AM -0700, Paul E. McKenney wrote:
   
The compiler can also reorder non-volatile accesses.  For an example
patch that cares about this, please see:

http://lkml.org/lkml/2007/8/7/280

This patch uses an ORDERED_WRT_IRQ() in rcu_read_lock() and
rcu_read_unlock() to ensure that accesses aren't reordered with respect
to interrupt handlers and NMIs/SMIs running on that same CPU.
   
   Good, finally we have some code to discuss (even though it's
   not actually in the kernel yet).
  
  There was some earlier in this thread as well.
 
 Hmm, I never quite got what all this interrupt/NMI/SMI handling and
 RCU business you mentioned earlier was all about, but now that you've
 pointed to the actual code and issues with it ...

Glad to help...

   First of all, I think this illustrates that what you want
   here has nothing to do with atomic ops.  The ORDERED_WRT_IRQ
   macro occurs a lot more times in your patch than atomic
   reads/sets.  So *assuming* that it was necessary at all,
   then having an ordered variant of the atomic_read/atomic_set
   ops could do just as well.
  
  Indeed.  If I could trust atomic_read()/atomic_set() to cause the compiler
  to maintain ordering, then I could just use them instead of having to
  create an  ORDERED_WRT_IRQ().  (Or ACCESS_ONCE(), as it is called in a
  different patch.)
 
 +#define WHATEVER(x)  (*(volatile typeof(x) *)(x))
 
 I suppose one could want volatile access semantics for stuff that's
 a bit-field too, no?

One could, but this is not supported in general.  So if you want that,
you need to use the usual bit-mask tricks and (for setting) atomic
operations.

 Also, this gives *zero* re-ordering guarantees that your code wants
 as you've explained it below) -- neither w.r.t. CPU re-ordering (which
 probably you don't care about) *nor* w.r.t. compiler re-ordering
 (which you definitely _do_ care about).

You are correct about CPU re-ordering (and about the fact that this
example doesn't care about it), but not about compiler re-ordering.

The compiler is prohibited from moving a volatile access across a sequence
point.  One example of a sequence point is a statement boundary.  Because
all of the volatile accesses in this code are separated by statement
boundaries, a conforming compiler is prohibited from reordering them.

   However, I still don't know which atomic_read/atomic_set in
   your patch would be broken if there were no volatile.  Could
   you please point them out?
  
  Suppose I tried replacing the ORDERED_WRT_IRQ() calls with
  atomic_read() and atomic_set().  Starting with __rcu_read_lock():
  
  o   If ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++
  was ordered by the compiler after
  ORDERED_WRT_IRQ(me-rcu_read_lock_nesting) = nesting + 1, then
  suppose an NMI/SMI happened after the rcu_read_lock_nesting but
  before the rcu_flipctr.
  
  Then if there was an rcu_read_lock() in the SMI/NMI
  handler (which is perfectly legal), the nested rcu_read_lock()
  would believe that it could take the then-clause of the
  enclosing if statement.  But because the rcu_flipctr per-CPU
  variable had not yet been incremented, an RCU updater would
  be within its rights to assume that there were no RCU reads
  in progress, thus possibly yanking a data structure out from
  under the reader in the SMI/NMI function.
  
  Fatal outcome.  Note that only one CPU is involved here
  because these are all either per-CPU or per-task variables.
 
 Ok, so you don't care about CPU re-ordering. Still, I should let you know
 that your ORDERED_WRT_IRQ() -- bad name, btw -- is still buggy. What you
 want is a full compiler optimization barrier().

No.  See above.

 [ Your code probably works now, and emits correct code, but that's
   just because of gcc did what it did. Nothing in any standard,
   or in any documented behaviour of gcc, or anything about the real
   (or expected) semantics of volatile is protecting the code here. ]

Really?  Why doesn't the prohibition against moving volatile accesses
across sequence points take care of this?

  o   If ORDERED_WRT_IRQ(me-rcu_read_lock_nesting) = nesting + 1
  was ordered by the compiler to follow the
  ORDERED_WRT_IRQ(me-rcu_flipctr_idx) = idx, and an NMI/SMI
  happened between the two, then an __rcu_read_lock() in the NMI/SMI
  would incorrectly take the else clause of the enclosing if
  statement.  If some other CPU flipped the rcu_ctrlblk.completed
  in the meantime, then the __rcu_read_lock() would (correctly)
  write the new value into rcu_flipctr_idx.
  
  Well and good so far.  But the problem arises in
  __rcu_read_unlock(), which then decrements the wrong counter.
  Depending 

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Linus Torvalds


On Fri, 17 Aug 2007, Nick Piggin wrote:
 
 That's not obviously just taste to me. Not when the primitive has many
 (perhaps, the majority) of uses that do not require said barriers. And
 this is not solely about the code generation (which, as Paul says, is
 relatively minor even on x86). I prefer people to think explicitly
 about barriers in their lockless code.

Indeed.

I think the important issues are:

 - volatile itself is simply a badly/weakly defined issue. The semantics 
   of it as far as the compiler is concerned are really not very good, and 
   in practice tends to boil down to I will generate so bad code that 
   nobody can accuse me of optimizing anything away.

 - volatile - regardless of how well or badly defined it is - is purely 
   a compiler thing. It has absolutely no meaning for the CPU itself, so 
   it at no point implies any CPU barriers. As a result, even if the 
   compiler generates crap code and doesn't re-order anything, there's 
   nothing that says what the CPU will do.

 - in other words, the *only* possible meaning for volatile is a purely 
   single-CPU meaning. And if you only have a single CPU involved in the 
   process, the volatile is by definition pointless (because even 
   without a volatile, the compiler is required to make the C code appear 
   consistent as far as a single CPU is concerned).

So, let's take the example *buggy* code where we use volatile to wait 
for other CPU's:

atomic_set(var, 0);
while (!atomic_read(var))
/* nothing */;


which generates an endless loop if we don't have atomic_read() imply 
volatile.

The point here is that it's buggy whether the volatile is there or not! 
Exactly because the user expects multi-processing behaviour, but 
volatile doesn't actually give any real guarantees about it. Another CPU 
may have done:

external_ptr = kmalloc(..);
/* Setup is now complete, inform the waiter */
atomic_inc(var);

but the fact is, since the other CPU isn't serialized in any way, the 
while-loop (even in the presense of volatile) doesn't actually work 
right! Whatever the atomic_read() was waiting for may not have 
completed, because we have no barriers!

So if volatile makes a difference, it is invariably a sign of a bug in 
serialization (the one exception is for IO - we use volatile to avoid 
having to use inline asm for IO on x86) - and for random values like 
jiffies).

So the question should *not* be whether volatile actually fixes bugs. It 
*never* fixes a bug. But what it can do is to hide the obvious ones. In 
other words, adding a volaile in the above kind of situation of 
atomic_read() will certainly turn an obvious bug into something that 
works practically all of the time).

So anybody who argues for volatile fixing bugs is fundamentally 
incorrect. It does NO SUCH THING. By arguing that, such people only show 
that you have no idea what they are talking about.

So the only reason to add back volatile to the atomic_read() sequence is 
not to fix bugs, but to _hide_ the bugs better. They're still there, they 
are just a lot harder to trigger, and tend to be a lot subtler.

And hey, sometimes hiding bugs well enough is ok. In this case, I'd 
argue that we've successfully *not* had the volatile there for eight 
months on x86-64, and that should tell people something. 

(Does _removing_ the volatile fix bugs? No - callers still need to think 
about barriers etc, and lots of people don't. So I'm not claiming that 
removing volatile fixes any bugs either, but I *am* claiming that:

 - removing volatile makes some bugs easier to see (which is mostly a good 
   thing: they were there before, anyway).

 - removing volatile generates better code (which is a good thing, even if 
   it's just 0.1%)

 - removing volatile removes a huge mental *bug* that lots of people seem 
   to have, as shown by this whole thread. Anybody who thinks that 
   volatile actually fixes anything has a gaping hole in their head, and 
   we should remove volatile just to make sure that nobody thinks that it 
   means soemthign that it doesn't mean!

In other words, this whole discussion has just convinced me that we should 
*not* add back volatile to atomic_read() - I was willing to do it for 
practical and hide the bugs reasons, but having seen people argue for 
it, thinking that it actually fixes something, I'm now convinced that the 
*last* thing we should do is to encourage that kind of superstitious 
thinking.

volatile is like a black cat crossing the road. Sure, it affects 
*something* (at a minimum: before, the black cat was on one side of the 
road, afterwards it is on the other side of the road), but it has no 
bigger and longer-lasting direct affects. 

People who think volatile really matters are just fooling themselves.

Linus
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Segher Boessenkool
Part of the motivation here is to fix heisenbugs.  If I knew 
where they


By the same token we should probably disable optimisations
altogether since that too can create heisenbugs.


Almost everything is a tradeoff; and so is this.  I don't
believe most people would find disabling all compiler
optimisations an acceptable price to pay for some peace
of mind.



So why is this a good tradeoff?

It certainly is better than disabling all compiler optimisations!


It's easy to be better than something really stupid :)


Sure, it wasn't me who made the comparison though.


So i386 and x86-64 don't have volatiles there, and it saves them a
few K of kernel text.


Which has to be investigated.  A few kB is a lot more than expected.


What you need to justify is why it is a good
tradeoff to make them volatile (which btw, is much harder to go
the other way after we let people make those assumptions).


My point is that people *already* made those assumptions.  There
are two ways to clean up this mess:

1) Have the volatile semantics by default, change the users
   that don't need it;
2) Have non-volatile semantics by default, change the users
   that do need it.

Option 2) randomly breaks stuff all over the place, option 1)
doesn't.  Yeah 1) could cause some extremely minor speed or
code size regression, but only temporarily until everything has
been audited.


I also think that just adding things to APIs in the hope it might fix
up some bugs isn't really a good road to go down. Where do you stop?

I look at it the other way: keeping the volatile semantics in
atomic_XXX() (or adding them to it, whatever) helps _prevent_ bugs;


Yeah, but we could add lots of things to help prevent bugs and
would never be included. I would also contend that it helps _hide_
bugs and encourages people to be lazy when thinking about these
things.


Sure.  We aren't _adding_ anything here though, not on the platforms
where it is most likely to show up, anyway.


Also, you dismiss the fact that we'd actually be *adding* volatile
semantics back to the 2 most widely tested architectures (in terms
of test time, number of testers, variety of configurations, and
coverage of driver code).


I'm not dismissing that.  x86 however is one of the few architectures
where mistakenly leaving out a volatile will not easily show up on
user testing, since the compiler will very often produce a memory
reference anyway because it has no registers to play with.


This is a very important different from
just keeping volatile semantics because it is basically a one-way
API change.


That's a good point.  Maybe we should create _two_ new APIs, one
explicitly going each way.


certainly most people expect that behaviour, and also that behaviour
is *needed* in some places and no other interface provides that
functionality.


I don't know that most people would expect that behaviour.


I didn't conduct any formal poll either :-)


Is there any documentation anywhere that would suggest this?


Not really I think, no.  But not the other way around, either.
Most uses of it seem to expect it though.


[some confusion about barriers wrt atomics snipped]


What were you confused about?


Me?  Not much.


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma


On Fri, 17 Aug 2007, Segher Boessenkool wrote:

   atomic_dec() already has volatile behavior everywhere, so this is
   semantically
   okay, but this code (and any like it) should be calling cpu_relax() each
   iteration through the loop, unless there's a compelling reason not to.
   I'll
   allow that for some hardware drivers (possibly this one) such a compelling
   reason may exist, but hardware-independent core subsystems probably have
   no
   excuse.
  
  No it does not have any volatile semantics. atomic_dec() can be reordered
  at will by the compiler within the current basic unit if you do not add a
  barrier.
 
 volatile has nothing to do with reordering.

If you're talking of volatile the type-qualifier keyword, then
http://lkml.org/lkml/2007/8/16/231 (and sub-thread below it) shows
otherwise.

 atomic_dec() writes
 to memory, so it _does_ have volatile semantics, implicitly, as
 long as the compiler cannot optimise the atomic variable away
 completely -- any store counts as a side effect.

I don't think an atomic_dec() implemented as an inline asm volatile
or one that uses a forget macro would have the same re-ordering
guarantees as an atomic_dec() that uses a volatile access cast.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Chris Friesen

Linus Torvalds wrote:

 - in other words, the *only* possible meaning for volatile is a purely 
   single-CPU meaning. And if you only have a single CPU involved in the 
   process, the volatile is by definition pointless (because even 
   without a volatile, the compiler is required to make the C code appear 
   consistent as far as a single CPU is concerned).


I assume you mean except for IO-related code and 'random' values like 
jiffies as you mention later on?  I assume other values set in 
interrupt handlers would count as random from a volatility perspective?


So anybody who argues for volatile fixing bugs is fundamentally 
incorrect. It does NO SUCH THING. By arguing that, such people only show 
that you have no idea what they are talking about.


What about reading values modified in interrupt handlers, as in your 
random case?  Or is this a bug where the user of atomic_read() is 
invalidly expecting a read each time it is called?


Chris
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma


On Fri, 17 Aug 2007, Paul E. McKenney wrote:

 On Fri, Aug 17, 2007 at 01:09:08PM +0530, Satyam Sharma wrote:
  
  On Thu, 16 Aug 2007, Paul E. McKenney wrote:
  
   On Fri, Aug 17, 2007 at 07:59:02AM +0800, Herbert Xu wrote:

First of all, I think this illustrates that what you want
here has nothing to do with atomic ops.  The ORDERED_WRT_IRQ
macro occurs a lot more times in your patch than atomic
reads/sets.  So *assuming* that it was necessary at all,
then having an ordered variant of the atomic_read/atomic_set
ops could do just as well.
   
   Indeed.  If I could trust atomic_read()/atomic_set() to cause the compiler
   to maintain ordering, then I could just use them instead of having to
   create an  ORDERED_WRT_IRQ().  (Or ACCESS_ONCE(), as it is called in a
   different patch.)
  
  +#define WHATEVER(x)(*(volatile typeof(x) *)(x))
  [...]
  Also, this gives *zero* re-ordering guarantees that your code wants
  as you've explained it below) -- neither w.r.t. CPU re-ordering (which
  probably you don't care about) *nor* w.r.t. compiler re-ordering
  (which you definitely _do_ care about).
 
 You are correct about CPU re-ordering (and about the fact that this
 example doesn't care about it), but not about compiler re-ordering.
 
 The compiler is prohibited from moving a volatile access across a sequence
 point.  One example of a sequence point is a statement boundary.  Because
 all of the volatile accesses in this code are separated by statement
 boundaries, a conforming compiler is prohibited from reordering them.

Yes, you're right, and I believe precisely this was discussed elsewhere
as well today.

But I'd call attention to what Herbert mentioned there. You're using
ORDERED_WRT_IRQ() on stuff that is _not_ defined to be an atomic_t at all:

* Member completed of struct rcu_ctrlblk is a long.
* Per-cpu variable rcu_flipctr is an array of ints.
* Members rcu_read_lock_nesting and rcu_flipctr_idx of
  struct task_struct are ints.

So are you saying you're having to use this volatile-access macro
because you *couldn't* declare all the above as atomic_t and thus just
expect the right thing to happen by using the atomic ops API by default,
because it lacks volatile access semantics (on x86)?

If so, then I wonder if using the volatile access cast is really the
best way to achieve (at least in terms of code clarity) the kind of
re-ordering guarantees it wants there. (there could be alternative
solutions, such as using barrier(), or that at bottom of this mail)

What I mean is this: If you switch to atomic_t, and x86 switched to
make atomic_t have volatile semantics by default, the statements
would be simply a string of: atomic_inc(), atomic_add(), atomic_set(),
and atomic_read() statements, and nothing in there that clearly makes
it *explicit* that the code is correct (and not buggy) simply because
of the re-ordering guarantees that the C volatile type-qualifier
keyword gives us as per the standard. But now we're firmly in
subjective territory, so you or anybody could legitimately disagree.


   Suppose I tried replacing the ORDERED_WRT_IRQ() calls with
   atomic_read() and atomic_set().  Starting with __rcu_read_lock():
   
   o If ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++
 was ordered by the compiler after
 ORDERED_WRT_IRQ(me-rcu_read_lock_nesting) = nesting + 1, then
 suppose an NMI/SMI happened after the rcu_read_lock_nesting but
 before the rcu_flipctr.
   
 Then if there was an rcu_read_lock() in the SMI/NMI
 handler (which is perfectly legal), the nested rcu_read_lock()
 would believe that it could take the then-clause of the
 enclosing if statement.  But because the rcu_flipctr per-CPU
 variable had not yet been incremented, an RCU updater would
 be within its rights to assume that there were no RCU reads
 in progress, thus possibly yanking a data structure out from
 under the reader in the SMI/NMI function.
   
 Fatal outcome.  Note that only one CPU is involved here
 because these are all either per-CPU or per-task variables.
  
  Ok, so you don't care about CPU re-ordering. Still, I should let you know
  that your ORDERED_WRT_IRQ() -- bad name, btw -- is still buggy. What you
  want is a full compiler optimization barrier().
 
 No.  See above.

True, *(volatile foo *) _will_ work for this case.

But multiple calls to barrier() (granted, would invalidate all other
optimizations also) would work as well, would it not?

[ Interestingly, if you declared all those objects mentioned earlier as
  atomic_t, and x86(-64) switched to an __asm__ __volatile__ based variant
  for atomic_{read,set}_volatile(), the bugs you want to avoid would still
  be there. volatile the C language type-qualifier does have compiler
  re-ordering semantics you mentioned earlier, but the volatile that
  applies to inline asm()s gives no re-ordering guarantees. ]


   o If ORDERED_WRT_IRQ(me-rcu_read_lock_nesting) = nesting 

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Paul E. McKenney
On Sat, Aug 18, 2007 at 12:01:38AM +0530, Satyam Sharma wrote:
 
 
 On Fri, 17 Aug 2007, Paul E. McKenney wrote:
 
  On Fri, Aug 17, 2007 at 01:09:08PM +0530, Satyam Sharma wrote:
   
   On Thu, 16 Aug 2007, Paul E. McKenney wrote:
   
On Fri, Aug 17, 2007 at 07:59:02AM +0800, Herbert Xu wrote:
 
 First of all, I think this illustrates that what you want
 here has nothing to do with atomic ops.  The ORDERED_WRT_IRQ
 macro occurs a lot more times in your patch than atomic
 reads/sets.  So *assuming* that it was necessary at all,
 then having an ordered variant of the atomic_read/atomic_set
 ops could do just as well.

Indeed.  If I could trust atomic_read()/atomic_set() to cause the 
compiler
to maintain ordering, then I could just use them instead of having to
create an  ORDERED_WRT_IRQ().  (Or ACCESS_ONCE(), as it is called in a
different patch.)
   
   +#define WHATEVER(x)  (*(volatile typeof(x) *)(x))
   [...]
   Also, this gives *zero* re-ordering guarantees that your code wants
   as you've explained it below) -- neither w.r.t. CPU re-ordering (which
   probably you don't care about) *nor* w.r.t. compiler re-ordering
   (which you definitely _do_ care about).
  
  You are correct about CPU re-ordering (and about the fact that this
  example doesn't care about it), but not about compiler re-ordering.
  
  The compiler is prohibited from moving a volatile access across a sequence
  point.  One example of a sequence point is a statement boundary.  Because
  all of the volatile accesses in this code are separated by statement
  boundaries, a conforming compiler is prohibited from reordering them.
 
 Yes, you're right, and I believe precisely this was discussed elsewhere
 as well today.
 
 But I'd call attention to what Herbert mentioned there. You're using
 ORDERED_WRT_IRQ() on stuff that is _not_ defined to be an atomic_t at all:
 
 * Member completed of struct rcu_ctrlblk is a long.
 * Per-cpu variable rcu_flipctr is an array of ints.
 * Members rcu_read_lock_nesting and rcu_flipctr_idx of
   struct task_struct are ints.
 
 So are you saying you're having to use this volatile-access macro
 because you *couldn't* declare all the above as atomic_t and thus just
 expect the right thing to happen by using the atomic ops API by default,
 because it lacks volatile access semantics (on x86)?
 
 If so, then I wonder if using the volatile access cast is really the
 best way to achieve (at least in terms of code clarity) the kind of
 re-ordering guarantees it wants there. (there could be alternative
 solutions, such as using barrier(), or that at bottom of this mail)
 
 What I mean is this: If you switch to atomic_t, and x86 switched to
 make atomic_t have volatile semantics by default, the statements
 would be simply a string of: atomic_inc(), atomic_add(), atomic_set(),
 and atomic_read() statements, and nothing in there that clearly makes
 it *explicit* that the code is correct (and not buggy) simply because
 of the re-ordering guarantees that the C volatile type-qualifier
 keyword gives us as per the standard. But now we're firmly in
 subjective territory, so you or anybody could legitimately disagree.

In any case, given Linus's note, it appears that atomic_read() and
atomic_set() won't consistently have volatile semantics, at least
not while the compiler generates such ugly code for volatile accesses.
So I will continue with my current approach.

In any case, I will not be using atomic_inc() or atomic_add() in this
code, as doing so would more than double the overhead, even on machines
that are the most efficient at implementing atomic operations.

Suppose I tried replacing the ORDERED_WRT_IRQ() calls with
atomic_read() and atomic_set().  Starting with __rcu_read_lock():

o   If ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++
was ordered by the compiler after
ORDERED_WRT_IRQ(me-rcu_read_lock_nesting) = nesting + 1, then
suppose an NMI/SMI happened after the rcu_read_lock_nesting but
before the rcu_flipctr.

Then if there was an rcu_read_lock() in the SMI/NMI
handler (which is perfectly legal), the nested rcu_read_lock()
would believe that it could take the then-clause of the
enclosing if statement.  But because the rcu_flipctr per-CPU
variable had not yet been incremented, an RCU updater would
be within its rights to assume that there were no RCU reads
in progress, thus possibly yanking a data structure out from
under the reader in the SMI/NMI function.

Fatal outcome.  Note that only one CPU is involved here
because these are all either per-CPU or per-task variables.
   
   Ok, so you don't care about CPU re-ordering. Still, I should let you know
   that your ORDERED_WRT_IRQ() -- bad name, btw -- is still buggy. What you
   want is a full compiler 

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Arjan van de Ven

On Fri, 2007-08-17 at 12:50 -0600, Chris Friesen wrote:
 Linus Torvalds wrote:
 
   - in other words, the *only* possible meaning for volatile is a purely 
 single-CPU meaning. And if you only have a single CPU involved in the 
 process, the volatile is by definition pointless (because even 
 without a volatile, the compiler is required to make the C code appear 
 consistent as far as a single CPU is concerned).
 
 I assume you mean except for IO-related code and 'random' values like 
 jiffies as you mention later on?  I assume other values set in 
 interrupt handlers would count as random from a volatility perspective?
 
  So anybody who argues for volatile fixing bugs is fundamentally 
  incorrect. It does NO SUCH THING. By arguing that, such people only show 
  that you have no idea what they are talking about.
 
 What about reading values modified in interrupt handlers, as in your 
 random case?  Or is this a bug where the user of atomic_read() is 
 invalidly expecting a read each time it is called?

the interrupt handler case is an SMP case since you do not know
beforehand what cpu your interrupt handler will run on.



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Paul E. McKenney
On Fri, Aug 17, 2007 at 11:54:33AM -0700, Arjan van de Ven wrote:
 
 On Fri, 2007-08-17 at 12:50 -0600, Chris Friesen wrote:
  Linus Torvalds wrote:
  
- in other words, the *only* possible meaning for volatile is a purely 
  single-CPU meaning. And if you only have a single CPU involved in the 
  process, the volatile is by definition pointless (because even 
  without a volatile, the compiler is required to make the C code appear 
  consistent as far as a single CPU is concerned).
  
  I assume you mean except for IO-related code and 'random' values like 
  jiffies as you mention later on?  I assume other values set in 
  interrupt handlers would count as random from a volatility perspective?
  
   So anybody who argues for volatile fixing bugs is fundamentally 
   incorrect. It does NO SUCH THING. By arguing that, such people only show 
   that you have no idea what they are talking about.
  
  What about reading values modified in interrupt handlers, as in your 
  random case?  Or is this a bug where the user of atomic_read() is 
  invalidly expecting a read each time it is called?
 
 the interrupt handler case is an SMP case since you do not know
 beforehand what cpu your interrupt handler will run on.

With the exception of per-CPU variables, yes.

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Arjan van de Ven

On Fri, 2007-08-17 at 12:49 -0700, Paul E. McKenney wrote:
   What about reading values modified in interrupt handlers, as in your 
   random case?  Or is this a bug where the user of atomic_read() is 
   invalidly expecting a read each time it is called?
  
  the interrupt handler case is an SMP case since you do not know
  beforehand what cpu your interrupt handler will run on.
 
 With the exception of per-CPU variables, yes.

if you're spinning waiting for a per-CPU variable to get changed by an
interrupt handler... you have bigger problems than volatile ;-)

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via 
http://www.linuxfirmwarekit.org

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Segher Boessenkool

Of course, since *normal* accesses aren't necessarily limited wrt
re-ordering, the question then becomes one of with regard to *what* 
does

it limit re-ordering?.

A C compiler that re-orders two different volatile accesses that have a
sequence point in between them is pretty clearly a buggy compiler. So 
at a

minimum, it limits re-ordering wrt other volatiles (assuming sequence
points exists). It also means that the compiler cannot move it
speculatively across conditionals, but other than that it's starting to
get fuzzy.


This is actually really well-defined in C, not fuzzy at all.
Volatile accesses are a side effect, and no side effects can
be reordered with respect to sequence points.  The side effects
that matter in the kernel environment are: 1) accessing a volatile
object; 2) modifying an object; 3) volatile asm(); 4) calling a
function that does any of these.

We certainly should avoid volatile whenever possible, but because
it's fuzzy wrt reordering is not a reason -- all alternatives have
exactly the same issues.


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Segher Boessenkool

(and yes, it is perfectly legitimate to
want a non-volatile read for a data type that you also want to do
atomic RMW operations on)


...which is undefined behaviour in C (and GCC) when that data is
declared volatile, which is a good argument against implementing
atomics that way in itself.


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Segher Boessenkool

In a reasonable world, gcc should just make that be (on x86)

addl $1,i(%rip)

on x86-64, which is indeed what it does without the volatile. But with 
the
volatile, the compiler gets really nervous, and doesn't dare do it in 
one

instruction, and thus generates crap like

movli(%rip), %eax
addl$1, %eax
movl%eax, i(%rip)

instead. For no good reason, except that volatile just doesn't have 
any
good/clear semantics for the compiler, so most compilers will just 
make it

be I will not touch this access in any way, shape, or form. Including
even trivially correct instruction optimization/combination.


It's just a (target-specific, perhaps) missed-optimisation kind
of bug in GCC.  Care to file a bug report?


but is
(again) something that gcc doesn't dare do, since i is volatile.


Just nobody taught it it can do this; perhaps no one wanted to
add optimisations like that, maybe with a reasoning like people
who hit the go-slow-in-unspecified-ways button should get what
they deserve ;-)


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Segher Boessenkool

Now the second wording *IS* technically correct, but come on, it's
24 words long whereas the original one was 3 -- and hopefully anybody
reading the shorter phrase *would* have known anyway what was meant,
without having to be pedantic about it :-)


Well you were talking pretty formal (and detailed) stuff, so
IMHO it's good to have that exactly correct.  Sure it's nicer
to use small words most of the time :-)


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Segher Boessenkool

#define forget(a)   __asm__ __volatile__ ( :=m (a) :m (a))

[ This is exactly equivalent to using +m in the constraints, as 
recently
  explained on a GCC list somewhere, in response to the patch in my 
bitops

  series a few weeks back where I thought +m was bogus. ]


[It wasn't explained on a GCC list in response to your patch, as
far as I can see -- if I missed it, please point me to an archived
version of it].

One last time: it isn't equivalent on older (but still supported
by Linux) versions of GCC.  Current versions of GCC allow it, but
it has no documented behaviour at all, so use it at your own risk.


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Segher Boessenkool
Here, I should obviously admit that the semantics of *(volatile int 
*)
aren't any neater or well-defined in the _language standard_ at all. 
The

standard does say (verbatim) precisely what constitutes as access to
object of volatile-qualified type is implementation-defined, but GCC
does help us out here by doing the right thing.


Where do you get that idea?


Try a testcase (experimentally verify).


That doesn't prove anything.  Experiments can only disprove
things.


GCC manual, section 6.1, When
is a Volatile Object Accessed? doesn't say anything of the
kind.


True, implementation-defined as per the C standard _is_ supposed to 
mean
unspecified behaviour where each implementation documents how the 
choice

is made. So ok, probably GCC isn't documenting this
implementation-defined behaviour which it is supposed to, but can't 
really

fault them much for this, probably.


GCC _is_ documenting this, namely in this section 6.1.  It doesn't
mention volatile-casted stuff.  Draw your own conclusions.


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma


On Sat, 18 Aug 2007, Segher Boessenkool wrote:

  #define forget(a)   __asm__ __volatile__ ( :=m (a) :m (a))
  
  [ This is exactly equivalent to using +m in the constraints, as recently
explained on a GCC list somewhere, in response to the patch in my bitops
series a few weeks back where I thought +m was bogus. ]
 
 [It wasn't explained on a GCC list in response to your patch, as
 far as I can see -- if I missed it, please point me to an archived
 version of it].

http://gcc.gnu.org/ml/gcc-patches/2007-07/msg01758.html

is a follow-up in the thread on the [EMAIL PROTECTED] mailing list,
which began with:

http://gcc.gnu.org/ml/gcc-patches/2007-07/msg01677.html

that was posted by Jan Kubicka, as he quotes in that initial posting,
after I had submitted:

http://lkml.org/lkml/2007/7/23/252

which was a (wrong) patch to rectify what I thought was the bogus
+m constraint, as per the quoted extract from gcc docs (that was
given in that (wrong) patch's changelog).

That's when _I_ came to know how GCC interprets +m, but probably
this has been explained on those lists multiple times. Who cares,
anyway?


 One last time: it isn't equivalent on older (but still supported
 by Linux) versions of GCC.  Current versions of GCC allow it, but
 it has no documented behaviour at all, so use it at your own risk.

True.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Satyam Sharma


On Sat, 18 Aug 2007, Segher Boessenkool wrote:

No it does not have any volatile semantics. atomic_dec() can be
reordered
at will by the compiler within the current basic unit if you do not add
a
barrier.
   
   volatile has nothing to do with reordering.
  
  If you're talking of volatile the type-qualifier keyword, then
  http://lkml.org/lkml/2007/8/16/231 (and sub-thread below it) shows
  otherwise.
 
 I'm not sure what in that mail you mean, but anyway...
 
 Yes, of course, the fact that volatile creates a side effect
 prevents certain things from being reordered wrt the atomic_dec();
 but the atomic_dec() has a side effect *already* so the volatile
 doesn't change anything.

That's precisely what that sub-thread (read down to the last mail
there, and not the first mail only) shows. So yes, volatile does
have something to do with re-ordering (as guaranteed by the C
standard).


   atomic_dec() writes
   to memory, so it _does_ have volatile semantics, implicitly, as
   long as the compiler cannot optimise the atomic variable away
   completely -- any store counts as a side effect.
  
  I don't think an atomic_dec() implemented as an inline asm volatile
  or one that uses a forget macro would have the same re-ordering
  guarantees as an atomic_dec() that uses a volatile access cast.
 
 The asm volatile implementation does have exactly the same
 reordering guarantees as the volatile cast thing,

I don't think so.

 if that is
 implemented by GCC in the obvious way.  Even a plain asm()
 will do the same.

Read the relevant GCC documentation.

[ of course, if the (latest) GCC documentation is *yet again*
  wrong, then alright, not much I can do about it, is there. ]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Segher Boessenkool

#define forget(a)   __asm__ __volatile__ ( :=m (a) :m (a))

[ This is exactly equivalent to using +m in the constraints, as 
recently
  explained on a GCC list somewhere, in response to the patch in my 
bitops

  series a few weeks back where I thought +m was bogus. ]


[It wasn't explained on a GCC list in response to your patch, as
far as I can see -- if I missed it, please point me to an archived
version of it].


http://gcc.gnu.org/ml/gcc-patches/2007-07/msg01758.html


Ah yes, that old thread, thank you.


That's when _I_ came to know how GCC interprets +m, but probably
this has been explained on those lists multiple times. Who cares,
anyway?


I just couldn't find the thread you meant, I thought I missed
have it, that's all :-)


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Paul E. McKenney
On Thu, Aug 16, 2007 at 08:50:30PM -0700, Linus Torvalds wrote:
 Just try it yourself:
 
   volatile int i;
   int j;
 
   int testme(void)
   {
   return i = 1;
   }
 
   int testme2(void)
   {
   return j = 1;
   }
 
 and compile with all the optimizations you can.
 
 I get:
 
   testme:
   movli(%rip), %eax
   subl$1, %eax
   setle   %al
   movzbl  %al, %eax
   ret
 
 vs
 
   testme2:
   xorl%eax, %eax
   cmpl$1, j(%rip)
   setle   %al
   ret
 
 (now, whether that xorl + setle is better than setle + movzbl, I don't 
 really know - maybe it is. But that's not thepoint. The point is the 
 difference between
 
 movli(%rip), %eax
 subl$1, %eax
 
 and
 
 cmpl$1, j(%rip)

gcc bugzilla bug #33102, for whatever that ends up being worth.  ;-)

Thanx, Paul
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Segher Boessenkool

atomic_dec() writes
to memory, so it _does_ have volatile semantics, implicitly, as
long as the compiler cannot optimise the atomic variable away
completely -- any store counts as a side effect.


I don't think an atomic_dec() implemented as an inline asm volatile
or one that uses a forget macro would have the same re-ordering
guarantees as an atomic_dec() that uses a volatile access cast.


The asm volatile implementation does have exactly the same
reordering guarantees as the volatile cast thing,


I don't think so.


asm volatile creates a side effect.  Side effects aren't
allowed to be reordered wrt sequence points.  This is exactly
the same reason as why volatile accesses cannot be reordered.


if that is
implemented by GCC in the obvious way.  Even a plain asm()
will do the same.


Read the relevant GCC documentation.


I did, yes.


[ of course, if the (latest) GCC documentation is *yet again*
  wrong, then alright, not much I can do about it, is there. ]


There was (and is) nothing wrong about the +m documentation, if
that is what you are talking about.  It could be extended now, to
allow +m -- but that takes more than just fixing the documentation.


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Herbert Xu
On Fri, Aug 17, 2007 at 04:59:12PM -0700, Paul E. McKenney wrote:

 gcc bugzilla bug #33102, for whatever that ends up being worth.  ;-)

I had totally forgotten that I'd already filed that bug more
than six years ago until they just closed yours as a duplicate
of mine :)

Good luck in getting it fixed!

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   >