Re: [lttng-dev] Xeon Phi memory barriers
ping. On 19 November 2013 10:26, Simon Marchi simon.mar...@polymtl.ca wrote: Hello there, liburcu does not build on the Intel Xeon Phi, because the chip is recognized as x86_64, but lacks the {s,l,m}fence instructions found on usual x86_64 processors. The following is taken from the Xeon Phi dev guide: The Intel® Xeon PhiTM coprocessor memory model is the same as that of the Intel® Pentium processor. The reads and writes always appear in programmed order at the system bus (or the ring interconnect in the case of the Intel® Xeon PhiTM coprocessor); the exception being that read misses are permitted to go ahead of buffered writes on the system bus when all the buffered writes are cached hits and are, therefore, not directed to the same address being accessed by the read miss. As a consequence of its stricter memory ordering model, the Intel® Xeon PhiTM coprocessor does not support the SFENCE, LFENCE, and MFENCE instructions that provide a more efficient way of controlling memory ordering on other Intel processors. While reads and writes from an Intel® Xeon PhiTM coprocessor appear in program order on the system bus, the compiler can still reorder unrelated memory operations while maintaining program order on a single Intel® Xeon PhiTM coprocessor (hardware thread). If software running on an Intel® Xeon PhiTM coprocessor is dependent on the order of memory operations on another Intel® Xeon PhiTM coprocessor then a serializing instruction (e.g., CPUID, instruction with a LOCK prefix) between the memory operations is required to guarantee completion of all memory accesses issued prior to the serializing instruction before any subsequent memory operations are started. (end of quote) From what I understand, it is safe to leave out any run-time memory barriers, but we still need barriers that prevent the compiler from reordering (using __asm__ __volatile__ (:::memory)). In urcu/arch/x86.h, I see that when CONFIG_RCU_HAVE_FENCE is false, memory barriers result in both compile-time and run-time memory barriers: __asm__ __volatile__ (lock; addl $0,0(%%esp):::memory). I guess this would work for the Phi, but the lock instruction does not seem necessary. So, should we just set CONFIG_RCU_HAVE_FENCE to false when compiling for the Phi and go on with our lives, or should we add a specific config for this case? Simon ___ lttng-dev mailing list lttng-dev@lists.lttng.org http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: [lttng-dev] Xeon Phi memory barriers
- Original Message - From: Simon Marchi simon.mar...@polymtl.ca To: lttng-dev@lists.lttng.org Sent: Tuesday, November 19, 2013 4:26:06 PM Subject: [lttng-dev] Xeon Phi memory barriers Hello there, Hi Simon, While reading this reply, please keep in mind that I'm in a mindset where I've been in a full week of meeting, and it's late on Friday evening here. So YMMV ;-) I'm CCing Paul E. McKenney, so he can debunk my answer :) liburcu does not build on the Intel Xeon Phi, because the chip is recognized as x86_64, but lacks the {s,l,m}fence instructions found on usual x86_64 processors. The following is taken from the Xeon Phi dev guide: Let's have a look: The Intel® Xeon PhiTM coprocessor memory model is the same as that of the Intel® Pentium processor. The reads and writes always appear in programmed order at the system bus (or the ring interconnect in the case of the Intel® Xeon PhiTM coprocessor); the exception being that read misses are permitted to go ahead of buffered writes on the system bus when all the buffered writes are cached hits and are, therefore, not directed to the same address being accessed by the read miss. OK, so reads can be reordered with respect to following writes. As a consequence of its stricter memory ordering model, the Intel® Xeon PhiTM coprocessor does not support the SFENCE, LFENCE, and MFENCE instructions that provide a more efficient way of controlling memory ordering on other Intel processors. I guess sfence and lfence are indeed completely useless, because we only can ever care about ordering reads vs writes (mfence). But even the mfence is not there. While reads and writes from an Intel® Xeon PhiTM coprocessor appear in program order on the system bus, This part of the sentence seems misleading to me. Didn't the first sentence state the opposite ? the exception being that read misses are permitted to go ahead of buffered writes on the system bus when all the buffered writes are cached hits and are, therefore, not directed to the same address being accessed by the read miss. I'm probably missing something. the compiler can still reorder unrelated memory operations while maintaining program order on a single Intel® Xeon PhiTM coprocessor (hardware thread). If software running on an Intel® Xeon PhiTM coprocessor is dependent on the order of memory operations on another Intel® Xeon PhiTM coprocessor then a serializing instruction (e.g., CPUID, instruction with a LOCK prefix) between the memory operations is required to guarantee completion of all memory accesses issued prior to the serializing instruction before any subsequent memory operations are started. (end of quote) From what I understand, it is safe to leave out any run-time memory barriers, but we still need barriers that prevent the compiler from reordering (using __asm__ __volatile__ (:::memory)). In urcu/arch/x86.h, I see that when CONFIG_RCU_HAVE_FENCE is false, memory barriers result in both compile-time and run-time memory barriers: __asm__ __volatile__ (lock; addl $0,0(%%esp):::memory). I guess this would work for the Phi, but the lock instruction does not seem necessary. Actually, either a cpuid (core serializing) instruction or lock-prefixed instruction (serializing as a side-effect memory accesses) seems required. So, should we just set CONFIG_RCU_HAVE_FENCE to false when compiling for the Phi and go on with our lives, or should we add a specific config for this case? I _think_ we could get away with this mapping: smp_wmb() - barrier() reasoning: write vs write are not reordered by the processor. smp_rmb() - barrier() reasoning: read vs read not reordered by processor. smp_mb() - __asm__ __volatile__ (lock; addl $0,0(%%esp):::memory) or a cpuid instruction reasoning: cpu can reorder reads vs later writes. smp_read_barrier_depends() - nothing at all (not needed at any level). Interestingly enough, AFAIU, this seems to map to x86-TSO. Maybe that instead of defining a compiling option specifically for Xeon Phi, we could instead define a x86-tso.h header variant in userspace RCU and use it in all Intel processors that map to TSO (hint: very vast majority). The only exceptions seems to be Pentium Pro (needing smp_rmb() - lfence) and some Windchip processors which could reorder stores (thus needing smp_wmb() - sfence). Thoughts ? Thanks, Mathieu Simon ___ lttng-dev mailing list lttng-dev@lists.lttng.org http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ___ lttng-dev mailing list lttng-dev@lists.lttng.org http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: [lttng-dev] Xeon Phi memory barriers
On Fri, Dec 06, 2013 at 08:15:38PM +, Mathieu Desnoyers wrote: - Original Message - From: Simon Marchi simon.mar...@polymtl.ca To: lttng-dev@lists.lttng.org Sent: Tuesday, November 19, 2013 4:26:06 PM Subject: [lttng-dev] Xeon Phi memory barriers Hello there, Hi Simon, While reading this reply, please keep in mind that I'm in a mindset where I've been in a full week of meeting, and it's late on Friday evening here. So YMMV ;-) I'm CCing Paul E. McKenney, so he can debunk my answer :) liburcu does not build on the Intel Xeon Phi, because the chip is recognized as x86_64, but lacks the {s,l,m}fence instructions found on usual x86_64 processors. The following is taken from the Xeon Phi dev guide: Let's have a look: The Intel® Xeon PhiTM coprocessor memory model is the same as that of the Intel® Pentium processor. The reads and writes always appear in programmed order at the system bus (or the ring interconnect in the case of the Intel® Xeon PhiTM coprocessor); the exception being that read misses are permitted to go ahead of buffered writes on the system bus when all the buffered writes are cached hits and are, therefore, not directed to the same address being accessed by the read miss. OK, so reads can be reordered with respect to following writes. That would be -preceding- writes, correct? As a consequence of its stricter memory ordering model, the Intel® Xeon PhiTM coprocessor does not support the SFENCE, LFENCE, and MFENCE instructions that provide a more efficient way of controlling memory ordering on other Intel processors. I guess sfence and lfence are indeed completely useless, because we only can ever care about ordering reads vs writes (mfence). But even the mfence is not there. The usual approach is an atomic operation to a dummy location on the stack. Is that the recommendation for Xeon Phi? Either way, what should userspace RCU do to detect that it is being built on a Xeon Phi? I am sure that Mathieu would welcome the relevant patches for this. ;-) While reads and writes from an Intel® Xeon PhiTM coprocessor appear in program order on the system bus, This part of the sentence seems misleading to me. Didn't the first sentence state the opposite ? the exception being that read misses are permitted to go ahead of buffered writes on the system bus when all the buffered writes are cached hits and are, therefore, not directed to the same address being accessed by the read miss. I'm probably missing something. The trick might be that read misses are only allowed to pass write -hits-, which would mean that the system bus would have already seen the invalidate corresponding to the delayed write, and thus would have no evidence of any misorderingr the compiler can still reorder unrelated memory operations while maintaining program order on a single Intel® Xeon PhiTM coprocessor (hardware thread). If software running on an Intel® Xeon PhiTM coprocessor is dependent on the order of memory operations on another Intel® Xeon PhiTM coprocessor then a serializing instruction (e.g., CPUID, instruction with a LOCK prefix) between the memory operations is required to guarantee completion of all memory accesses issued prior to the serializing instruction before any subsequent memory operations are started. OK, sounds like my guess of atomic instruction to dummy stack location is correct, or perhaps carrying out a nearby assignment using an xchg instruction. (end of quote) From what I understand, it is safe to leave out any run-time memory barriers, but we still need barriers that prevent the compiler from reordering (using __asm__ __volatile__ (:::memory)). In urcu/arch/x86.h, I see that when CONFIG_RCU_HAVE_FENCE is false, memory barriers result in both compile-time and run-time memory barriers: __asm__ __volatile__ (lock; addl $0,0(%%esp):::memory). I guess this would work for the Phi, but the lock instruction does not seem necessary. Actually, either a cpuid (core serializing) instruction or lock-prefixed instruction (serializing as a side-effect memory accesses) seems required. It would certainly be safe. One approach would be to keep it that way unless/until someone showed it to be unnecessary. So, should we just set CONFIG_RCU_HAVE_FENCE to false when compiling for the Phi and go on with our lives, or should we add a specific config for this case? I _think_ we could get away with this mapping: smp_wmb() - barrier() reasoning: write vs write are not reordered by the processor. smp_rmb() - barrier() reasoning: read vs read not reordered by processor. smp_mb() - __asm__ __volatile__ (lock; addl $0,0(%%esp):::memory) or a cpuid instruction reasoning: cpu can reorder reads vs later writes. smp_read_barrier_depends() - nothing at all (not needed at any level). This should be safe, though I would argue for do { } while (0) for
Re: [lttng-dev] Xeon Phi memory barriers
- Original Message - From: Paul E. McKenney paul...@linux.vnet.ibm.com To: Mathieu Desnoyers mathieu.desnoy...@efficios.com Cc: Simon Marchi simon.mar...@polymtl.ca, lttng-dev@lists.lttng.org Sent: Friday, December 6, 2013 10:40:45 PM Subject: Re: [lttng-dev] Xeon Phi memory barriers On Fri, Dec 06, 2013 at 08:15:38PM +, Mathieu Desnoyers wrote: - Original Message - From: Simon Marchi simon.mar...@polymtl.ca To: lttng-dev@lists.lttng.org Sent: Tuesday, November 19, 2013 4:26:06 PM Subject: [lttng-dev] Xeon Phi memory barriers Hello there, Hi Simon, While reading this reply, please keep in mind that I'm in a mindset where I've been in a full week of meeting, and it's late on Friday evening here. So YMMV ;-) I'm CCing Paul E. McKenney, so he can debunk my answer :) liburcu does not build on the Intel Xeon Phi, because the chip is recognized as x86_64, but lacks the {s,l,m}fence instructions found on usual x86_64 processors. The following is taken from the Xeon Phi dev guide: Let's have a look: The Intel® Xeon PhiTM coprocessor memory model is the same as that of the Intel® Pentium processor. The reads and writes always appear in programmed order at the system bus (or the ring interconnect in the case of the Intel® Xeon PhiTM coprocessor); the exception being that read misses are permitted to go ahead of buffered writes on the system bus when all the buffered writes are cached hits and are, therefore, not directed to the same address being accessed by the read miss. OK, so reads can be reordered with respect to following writes. That would be -preceding- writes, correct? Oh, yes, I got it reversed. As a consequence of its stricter memory ordering model, the Intel® Xeon PhiTM coprocessor does not support the SFENCE, LFENCE, and MFENCE instructions that provide a more efficient way of controlling memory ordering on other Intel processors. I guess sfence and lfence are indeed completely useless, because we only can ever care about ordering reads vs writes (mfence). But even the mfence is not there. The usual approach is an atomic operation to a dummy location on the stack. Is that the recommendation for Xeon Phi? Yes, see below, Either way, what should userspace RCU do to detect that it is being built on a Xeon Phi? I am sure that Mathieu would welcome the relevant patches for this. ;-) While reads and writes from an Intel® Xeon PhiTM coprocessor appear in program order on the system bus, This part of the sentence seems misleading to me. Didn't the first sentence state the opposite ? the exception being that read misses are permitted to go ahead of buffered writes on the system bus when all the buffered writes are cached hits and are, therefore, not directed to the same address being accessed by the read miss. I'm probably missing something. The trick might be that read misses are only allowed to pass write -hits-, which would mean that the system bus would have already seen the invalidate corresponding to the delayed write, and thus would have no evidence of any misorderingr the compiler can still reorder unrelated memory operations while maintaining program order on a single Intel® Xeon PhiTM coprocessor (hardware thread). If software running on an Intel® Xeon PhiTM coprocessor is dependent on the order of memory operations on another Intel® Xeon PhiTM coprocessor then a serializing instruction (e.g., CPUID, instruction with a LOCK prefix) between the memory operations is required to guarantee completion of all memory accesses issued prior to the serializing instruction before any subsequent memory operations are started. OK, sounds like my guess of atomic instruction to dummy stack location is correct, or perhaps carrying out a nearby assignment using an xchg instruction. Yes, or CPUID instruction seems OK too. We already use lock; addl on stack in URCU for cases where fence instructions may not be available (x86-32). (end of quote) From what I understand, it is safe to leave out any run-time memory barriers, but we still need barriers that prevent the compiler from reordering (using __asm__ __volatile__ (:::memory)). In urcu/arch/x86.h, I see that when CONFIG_RCU_HAVE_FENCE is false, memory barriers result in both compile-time and run-time memory barriers: __asm__ __volatile__ (lock; addl $0,0(%%esp):::memory). I guess this would work for the Phi, but the lock instruction does not seem necessary. Actually, either a cpuid (core serializing) instruction or lock-prefixed instruction (serializing as a side-effect memory accesses) seems required. It would certainly be safe. One approach would be to keep it that way unless/until someone showed it to be unnecessary. So, should we just set CONFIG_RCU_HAVE_FENCE to false when compiling