Re: [lttng-dev] Xeon Phi memory barriers

2013-12-06 Thread Simon Marchi
ping.

On 19 November 2013 10:26, Simon Marchi simon.mar...@polymtl.ca wrote:
 Hello there,

 liburcu does not build on the Intel Xeon Phi, because the chip is
 recognized as x86_64, but lacks the {s,l,m}fence instructions found on
 usual x86_64 processors. The following is taken from the Xeon Phi dev
 guide:

 The Intel® Xeon PhiTM coprocessor memory model is the same as that of
 the Intel® Pentium processor. The reads and writes always appear in
 programmed order at the system bus (or the ring interconnect in the
 case of the Intel® Xeon PhiTM coprocessor); the exception being that
 read misses are permitted to go ahead of buffered writes on the system
 bus when all the buffered writes are cached hits and are, therefore,
 not directed to the same address being accessed by the read miss.

 As a consequence of its stricter memory ordering model, the Intel®
 Xeon PhiTM coprocessor does not support the SFENCE, LFENCE, and MFENCE
 instructions that provide a more efficient way of controlling memory
 ordering on other Intel processors.

 While reads and writes from an Intel® Xeon PhiTM coprocessor appear in
 program order on the system bus, the compiler can still reorder
 unrelated memory operations while maintaining program order on a
 single Intel® Xeon PhiTM coprocessor (hardware thread). If software
 running on an Intel® Xeon PhiTM coprocessor is dependent on the order
 of memory operations on another Intel® Xeon PhiTM coprocessor then a
 serializing instruction (e.g., CPUID, instruction with a LOCK prefix)
 between the memory operations is required to guarantee completion of
 all memory accesses issued prior to the serializing instruction before
 any subsequent memory operations are started.

 (end of quote)

 From what I understand, it is safe to leave out any run-time memory
 barriers, but we still need barriers that prevent the compiler from
 reordering (using __asm__ __volatile__ (:::memory)). In
 urcu/arch/x86.h, I see that when CONFIG_RCU_HAVE_FENCE is false,
 memory barriers result in both compile-time and run-time memory
 barriers:  __asm__ __volatile__ (lock; addl $0,0(%%esp):::memory).
 I guess this would work for the Phi, but the lock instruction does not
 seem necessary.

 So, should we just set CONFIG_RCU_HAVE_FENCE to false when compiling
 for the Phi and go on with our lives, or should we add a specific
 config for this case?

 Simon

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] Xeon Phi memory barriers

2013-12-06 Thread Mathieu Desnoyers
- Original Message -
 From: Simon Marchi simon.mar...@polymtl.ca
 To: lttng-dev@lists.lttng.org
 Sent: Tuesday, November 19, 2013 4:26:06 PM
 Subject: [lttng-dev] Xeon Phi memory barriers
 
 Hello there,

Hi Simon,

While reading this reply, please keep in mind that I'm in a
mindset where I've been in a full week of meeting, and it's late on
Friday evening here. So YMMV ;-) I'm CCing Paul E. McKenney, so he can
debunk my answer :)

 
 liburcu does not build on the Intel Xeon Phi, because the chip is
 recognized as x86_64, but lacks the {s,l,m}fence instructions found on
 usual x86_64 processors. The following is taken from the Xeon Phi dev
 guide:

Let's have a look:

 
 The Intel® Xeon PhiTM coprocessor memory model is the same as that of
 the Intel® Pentium processor. The reads and writes always appear in
 programmed order at the system bus (or the ring interconnect in the
 case of the Intel® Xeon PhiTM coprocessor); the exception being that
 read misses are permitted to go ahead of buffered writes on the system
 bus when all the buffered writes are cached hits and are, therefore,
 not directed to the same address being accessed by the read miss.

OK, so reads can be reordered with respect to following writes.

 
 As a consequence of its stricter memory ordering model, the Intel®
 Xeon PhiTM coprocessor does not support the SFENCE, LFENCE, and MFENCE
 instructions that provide a more efficient way of controlling memory
 ordering on other Intel processors.

I guess sfence and lfence are indeed completely useless, because we only
can ever care about ordering reads vs writes (mfence). But even the mfence
is not there.

 
 While reads and writes from an Intel® Xeon PhiTM coprocessor appear in
 program order on the system bus,

This part of the sentence seems misleading to me. Didn't the first
sentence state the opposite ? the exception being that
read misses are permitted to go ahead of buffered writes on the system
bus when all the buffered writes are cached hits and are, therefore,
not directed to the same address being accessed by the read miss.

I'm probably missing something.

 the compiler can still reorder
 unrelated memory operations while maintaining program order on a
 single Intel® Xeon PhiTM coprocessor (hardware thread). If software
 running on an Intel® Xeon PhiTM coprocessor is dependent on the order
 of memory operations on another Intel® Xeon PhiTM coprocessor then a
 serializing instruction (e.g., CPUID, instruction with a LOCK prefix)
 between the memory operations is required to guarantee completion of
 all memory accesses issued prior to the serializing instruction before
 any subsequent memory operations are started.
 
 (end of quote)
 
 From what I understand, it is safe to leave out any run-time memory
 barriers, but we still need barriers that prevent the compiler from
 reordering (using __asm__ __volatile__ (:::memory)). In
 urcu/arch/x86.h, I see that when CONFIG_RCU_HAVE_FENCE is false,
 memory barriers result in both compile-time and run-time memory
 barriers:  __asm__ __volatile__ (lock; addl $0,0(%%esp):::memory).
 I guess this would work for the Phi, but the lock instruction does not
 seem necessary.

Actually, either a cpuid (core serializing) instruction or lock-prefixed
instruction (serializing as a side-effect memory accesses) seems required.

 
 So, should we just set CONFIG_RCU_HAVE_FENCE to false when compiling
 for the Phi and go on with our lives, or should we add a specific
 config for this case?

I _think_ we could get away with this mapping:

smp_wmb() - barrier()
  reasoning: write vs write are not reordered by the processor.

smp_rmb() - barrier()
  reasoning: read vs read not reordered by processor.

smp_mb() - __asm__ __volatile__ (lock; addl $0,0(%%esp):::memory)
   or a cpuid instruction
  reasoning: cpu can reorder reads vs later writes.

smp_read_barrier_depends() - nothing at all (not needed at any level).

Interestingly enough, AFAIU, this seems to map to x86-TSO. Maybe that instead
of defining a compiling option specifically for Xeon Phi, we could instead
define a x86-tso.h header variant in userspace RCU and use it in all Intel
processors that map to TSO (hint: very vast majority). The only exceptions
seems to be Pentium Pro (needing smp_rmb() - lfence) and some Windchip
processors which could reorder stores (thus needing smp_wmb() - sfence).

Thoughts ?

Thanks,

Mathieu

 
 Simon
 
 ___
 lttng-dev mailing list
 lttng-dev@lists.lttng.org
 http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] Xeon Phi memory barriers

2013-12-06 Thread Paul E. McKenney
On Fri, Dec 06, 2013 at 08:15:38PM +, Mathieu Desnoyers wrote:
 - Original Message -
  From: Simon Marchi simon.mar...@polymtl.ca
  To: lttng-dev@lists.lttng.org
  Sent: Tuesday, November 19, 2013 4:26:06 PM
  Subject: [lttng-dev] Xeon Phi memory barriers
  
  Hello there,
 
 Hi Simon,
 
 While reading this reply, please keep in mind that I'm in a
 mindset where I've been in a full week of meeting, and it's late on
 Friday evening here. So YMMV ;-) I'm CCing Paul E. McKenney, so he can
 debunk my answer :)
 
  
  liburcu does not build on the Intel Xeon Phi, because the chip is
  recognized as x86_64, but lacks the {s,l,m}fence instructions found on
  usual x86_64 processors. The following is taken from the Xeon Phi dev
  guide:
 
 Let's have a look:
 
  
  The Intel® Xeon PhiTM coprocessor memory model is the same as that of
  the Intel® Pentium processor. The reads and writes always appear in
  programmed order at the system bus (or the ring interconnect in the
  case of the Intel® Xeon PhiTM coprocessor); the exception being that
  read misses are permitted to go ahead of buffered writes on the system
  bus when all the buffered writes are cached hits and are, therefore,
  not directed to the same address being accessed by the read miss.
 
 OK, so reads can be reordered with respect to following writes.

That would be -preceding- writes, correct?

  As a consequence of its stricter memory ordering model, the Intel®
  Xeon PhiTM coprocessor does not support the SFENCE, LFENCE, and MFENCE
  instructions that provide a more efficient way of controlling memory
  ordering on other Intel processors.
 
 I guess sfence and lfence are indeed completely useless, because we only
 can ever care about ordering reads vs writes (mfence). But even the mfence
 is not there.

The usual approach is an atomic operation to a dummy location on the
stack.  Is that the recommendation for Xeon Phi?

Either way, what should userspace RCU do to detect that it is being built
on a Xeon Phi?  I am sure that Mathieu would welcome the relevant patches
for this.  ;-)

  While reads and writes from an Intel® Xeon PhiTM coprocessor appear in
  program order on the system bus,
 
 This part of the sentence seems misleading to me. Didn't the first
 sentence state the opposite ? the exception being that
 read misses are permitted to go ahead of buffered writes on the system
 bus when all the buffered writes are cached hits and are, therefore,
 not directed to the same address being accessed by the read miss.
 
 I'm probably missing something.

The trick might be that read misses are only allowed to pass write
-hits-, which would mean that the system bus would have already seen
the invalidate corresponding to the delayed write, and thus would
have no evidence of any misorderingr

  the compiler can still reorder
  unrelated memory operations while maintaining program order on a
  single Intel® Xeon PhiTM coprocessor (hardware thread). If software
  running on an Intel® Xeon PhiTM coprocessor is dependent on the order
  of memory operations on another Intel® Xeon PhiTM coprocessor then a
  serializing instruction (e.g., CPUID, instruction with a LOCK prefix)
  between the memory operations is required to guarantee completion of
  all memory accesses issued prior to the serializing instruction before
  any subsequent memory operations are started.

OK, sounds like my guess of atomic instruction to dummy stack location
is correct, or perhaps carrying out a nearby assignment using an
xchg instruction.

  (end of quote)
  
  From what I understand, it is safe to leave out any run-time memory
  barriers, but we still need barriers that prevent the compiler from
  reordering (using __asm__ __volatile__ (:::memory)). In
  urcu/arch/x86.h, I see that when CONFIG_RCU_HAVE_FENCE is false,
  memory barriers result in both compile-time and run-time memory
  barriers:  __asm__ __volatile__ (lock; addl $0,0(%%esp):::memory).
  I guess this would work for the Phi, but the lock instruction does not
  seem necessary.
 
 Actually, either a cpuid (core serializing) instruction or lock-prefixed
 instruction (serializing as a side-effect memory accesses) seems required.

It would certainly be safe.  One approach would be to keep it that way
unless/until someone showed it to be unnecessary.

  So, should we just set CONFIG_RCU_HAVE_FENCE to false when compiling
  for the Phi and go on with our lives, or should we add a specific
  config for this case?
 
 I _think_ we could get away with this mapping:
 
 smp_wmb() - barrier()
   reasoning: write vs write are not reordered by the processor.
 
 smp_rmb() - barrier()
   reasoning: read vs read not reordered by processor.
 
 smp_mb() - __asm__ __volatile__ (lock; addl $0,0(%%esp):::memory)
or a cpuid instruction
   reasoning: cpu can reorder reads vs later writes.
 
 smp_read_barrier_depends() - nothing at all (not needed at any level).

This should be safe, though I would argue for do { } while (0) for

Re: [lttng-dev] Xeon Phi memory barriers

2013-12-06 Thread Mathieu Desnoyers
- Original Message -
 From: Paul E. McKenney paul...@linux.vnet.ibm.com
 To: Mathieu Desnoyers mathieu.desnoy...@efficios.com
 Cc: Simon Marchi simon.mar...@polymtl.ca, lttng-dev@lists.lttng.org
 Sent: Friday, December 6, 2013 10:40:45 PM
 Subject: Re: [lttng-dev] Xeon Phi memory barriers
 
 On Fri, Dec 06, 2013 at 08:15:38PM +, Mathieu Desnoyers wrote:
  - Original Message -
   From: Simon Marchi simon.mar...@polymtl.ca
   To: lttng-dev@lists.lttng.org
   Sent: Tuesday, November 19, 2013 4:26:06 PM
   Subject: [lttng-dev] Xeon Phi memory barriers
   
   Hello there,
  
  Hi Simon,
  
  While reading this reply, please keep in mind that I'm in a
  mindset where I've been in a full week of meeting, and it's late on
  Friday evening here. So YMMV ;-) I'm CCing Paul E. McKenney, so he can
  debunk my answer :)
  
   
   liburcu does not build on the Intel Xeon Phi, because the chip is
   recognized as x86_64, but lacks the {s,l,m}fence instructions found on
   usual x86_64 processors. The following is taken from the Xeon Phi dev
   guide:
  
  Let's have a look:
  
   
   The Intel® Xeon PhiTM coprocessor memory model is the same as that of
   the Intel® Pentium processor. The reads and writes always appear in
   programmed order at the system bus (or the ring interconnect in the
   case of the Intel® Xeon PhiTM coprocessor); the exception being that
   read misses are permitted to go ahead of buffered writes on the system
   bus when all the buffered writes are cached hits and are, therefore,
   not directed to the same address being accessed by the read miss.
  
  OK, so reads can be reordered with respect to following writes.
 
 That would be -preceding- writes, correct?

Oh, yes, I got it reversed.

 
   As a consequence of its stricter memory ordering model, the Intel®
   Xeon PhiTM coprocessor does not support the SFENCE, LFENCE, and MFENCE
   instructions that provide a more efficient way of controlling memory
   ordering on other Intel processors.
  
  I guess sfence and lfence are indeed completely useless, because we only
  can ever care about ordering reads vs writes (mfence). But even the mfence
  is not there.
 
 The usual approach is an atomic operation to a dummy location on the
 stack.  Is that the recommendation for Xeon Phi?

Yes, see below,

 
 Either way, what should userspace RCU do to detect that it is being built
 on a Xeon Phi?  I am sure that Mathieu would welcome the relevant patches
 for this.  ;-)
 
   While reads and writes from an Intel® Xeon PhiTM coprocessor appear in
   program order on the system bus,
  
  This part of the sentence seems misleading to me. Didn't the first
  sentence state the opposite ? the exception being that
  read misses are permitted to go ahead of buffered writes on the system
  bus when all the buffered writes are cached hits and are, therefore,
  not directed to the same address being accessed by the read miss.
  
  I'm probably missing something.
 
 The trick might be that read misses are only allowed to pass write
 -hits-, which would mean that the system bus would have already seen
 the invalidate corresponding to the delayed write, and thus would
 have no evidence of any misorderingr
 
   the compiler can still reorder
   unrelated memory operations while maintaining program order on a
   single Intel® Xeon PhiTM coprocessor (hardware thread). If software
   running on an Intel® Xeon PhiTM coprocessor is dependent on the order
   of memory operations on another Intel® Xeon PhiTM coprocessor then a
   serializing instruction (e.g., CPUID, instruction with a LOCK prefix)
   between the memory operations is required to guarantee completion of
   all memory accesses issued prior to the serializing instruction before
   any subsequent memory operations are started.
 
 OK, sounds like my guess of atomic instruction to dummy stack location
 is correct, or perhaps carrying out a nearby assignment using an
 xchg instruction.

Yes, or CPUID instruction seems OK too. We already use lock; addl on stack
in URCU for cases where fence instructions may not be available (x86-32).

 
   (end of quote)
   
   From what I understand, it is safe to leave out any run-time memory
   barriers, but we still need barriers that prevent the compiler from
   reordering (using __asm__ __volatile__ (:::memory)). In
   urcu/arch/x86.h, I see that when CONFIG_RCU_HAVE_FENCE is false,
   memory barriers result in both compile-time and run-time memory
   barriers:  __asm__ __volatile__ (lock; addl $0,0(%%esp):::memory).
   I guess this would work for the Phi, but the lock instruction does not
   seem necessary.
  
  Actually, either a cpuid (core serializing) instruction or lock-prefixed
  instruction (serializing as a side-effect memory accesses) seems required.
 
 It would certainly be safe.  One approach would be to keep it that way
 unless/until someone showed it to be unnecessary.
 
   So, should we just set CONFIG_RCU_HAVE_FENCE to false when compiling