Re: [osol-discuss] [perf-discuss] how to disable sun cc optimization partly in program?
On 2007-12-22, at 下午9:17, 陶捷 TaoJie wrote: Hi Bart, I noticed this email just now :( Thank you for your advice. Are there any barrier instructions on x86/x64 could force the rdtsc to behave sychronously? iret, xchg, cpuid, sfence, lock, etc. but cpuid changes eax etc, sfence is not available for all pentium (PIII ???) My concern is --- rdtsc [barrier] AA BB CC XX [barrier] rdtsc (2nd rdtsc - 1st rdtsc) should be the time cost of these inner instructions/functions. And it should be equal to or greater than the actual cost. Are there any barrier instructions to force rdtsc execute before AA and 2nd rdtsc execute after XX? using some continuous nops? or some instrcution else? sfence rdtsc x sfence rdtsc maybe cpuid is available if exa can be corrupted, or you can save it somewhere before cpuid . -minskey Btw, you mentioned you had the experience of performance measuring. Are there any recommended articles about performance measuring on x86/x64 platform? Are there any recommended atricles about measuring instrcution cost? For example, in some books, they said nop costs 1 cycle on Pentium, costs 3 cycle on 386. How to get these precise costs? Thank you :) Another question: In SMP or Multi-core (or say CMT) platform, each processor/core does have its own tsc register on its chip, doesn't it? Then, how could gethrtime() guarantee to provide the system-wide time? I mean if a program runing on CPU1 for a while and then running on CPU2, would gethrtime() - gethrtime() be the precise time cost? Does gethrtime() read ticks from CPU's tsc register or read it from system-wide timer( e.g. 8253 chip for x86)? I'm not familiar with timer... sorry for these stupid questions :-( Kind Regards, TJ 2007/10/30, Bart Smaalders [EMAIL PROTECTED]: ?? TaoJie wrote: Dear all: My platform is: Intel Pentium 4 CPU OpenSolaris B74, built by myself Sun Studio 11 In my program, I use asm(rdtsc) to measure the time cost between two rdtsc. for example: int some_func(...) { long long time1, time2; int i = 3198, j = 324; asm volatile(rdtsc : =A (time1)); i = i + j * i / j; asm volatile(rdtsc : =A (time2)) return i; } int main(...) { some_func(); } When I compile this program using cc example.c and disasmble a.out by dis, the program logic is ok. The output is some_func() main+0x36: 0f 31 rdtsc main+0x38: 89 45 f4 movl %eax,-0xc(%ebp) main+0x3b: 89 55 f8 movl %edx,-0x8(%ebp) main+0x3e: 8b 45 e8 movl -0x18(%ebp), %eax main+0x41: 03 45 e4 addl -0x1c(%ebp), %eax main+0x44: 89 45 e8 movl %eax,-0x18(%ebp) main+0x47: 8b 45 e8 movl -0x18(%ebp), %eax main+0x4a: 0f af 45 e4imull -0x1c(%ebp), %eax main+0x4e: 89 45 e8 movl %eax,-0x18(%ebp) main+0x51: 8b 45 e8 movl -0x18(%ebp), %eax main+0x54: 99 cltd main+0x55: f7 7d e4 idivl -0x1c(%ebp) main+0x58: 8b d0 movl %eax,%edx main+0x5a: 89 55 e8 movl %edx,-0x18(%ebp) main+0x5d: 0f 31 rdtsc main+0x5f: 89 45 ec movl %eax,-0x14(%ebp) main+0x62: 89 55 f0 movl %edx,-0x10(%ebp) When I compile this program using cc -xO5, the dis output is some_func() main+0x7: 0f 31 rdtsc main+0x9: 89 45 e8 movl %eax,-0x18(%ebp) main+0xc: 89 55 ec movl %edx,-0x14(%ebp) main+0xf: 0f 31 rdtsc main+0x11: 89 45 f0 movl %eax,-0x10(%ebp) main+0x14: 89 55 f4 movl %edx,-0xc(%ebp) main+0x17: 8b 5d f0 movl -0x10(%ebp), %ebx main+0x1a: 8b 45 f4 movl -0xc(%ebp), %eax main+0x1d: 8b 4d e8 movl -0x18(%ebp), %ecx main+0x20: 8b 55 ec movl -0x14(%ebp), %edx main+0x23: 2b d9 subl %ecx,%ebx main+0x25: 1b c2 sbbl %edx,%eax main+0x27: 89 5d e0 movl %ebx,-0x20(%ebp) main+0x2a: 89 45 e4 movl %eax,-0x1c(%ebp) Now the program logic is wrong! sun cc thinks rdtscs are irrelative with the other parts in some_func, and then it advances the second asm(rdtsc)! In this case, I can't measure the time cost. Then how can I stop sun cc optimization partly between these two asm statements when using -xO5
Re: [osol-discuss] [perf-discuss] how to disable sun cc optimization partly in program?
On Sat, 22 Dec 2007, Minskey Guo wrote: On 2007-12-22, at 下午9:17, 陶捷 TaoJie wrote: Hi Bart, I noticed this email just now :( Thank you for your advice. Are there any barrier instructions on x86/x64 could force the rdtsc to behave sychronously? iret, xchg, cpuid, sfence, lock, etc. but cpuid changes eax etc, sfence is not available for all pentium (PIII ???) We had this discussion with AMD a while ago; if I remember correctly, but Bart may well step in here, is that the only thing that's guaranteed in all situations and fully vendor/chip-rev independent is CPUID. Which is sort of a barrier sledgehammer. Takes thousands of cycles. Wondering - what _exactly_ are you planning to do ? Instruction-based sampling can be done via CPU performance monitoring counters, the old sample time, do something, sample time again is sort-of superseded by those. High-level access in Solaris would be via the cpc(7d) driver. FrankH. My concern is --- rdtsc [barrier] AA BB CC XX [barrier] rdtsc (2nd rdtsc - 1st rdtsc) should be the time cost of these inner instructions/functions. And it should be equal to or greater than the actual cost. Are there any barrier instructions to force rdtsc execute before AA and 2nd rdtsc execute after XX? using some continuous nops? or some instrcution else? sfence rdtsc x sfence rdtsc maybe cpuid is available if exa can be corrupted, or you can save it somewhere before cpuid . -minskey Btw, you mentioned you had the experience of performance measuring. Are there any recommended articles about performance measuring on x86/x64 platform? Are there any recommended atricles about measuring instrcution cost? For example, in some books, they said nop costs 1 cycle on Pentium, costs 3 cycle on 386. How to get these precise costs? Thank you :) Another question: In SMP or Multi-core (or say CMT) platform, each processor/core does have its own tsc register on its chip, doesn't it? Then, how could gethrtime() guarantee to provide the system-wide time? I mean if a program runing on CPU1 for a while and then running on CPU2, would gethrtime() - gethrtime() be the precise time cost? Does gethrtime() read ticks from CPU's tsc register or read it from system-wide timer( e.g. 8253 chip for x86)? I'm not familiar with timer... sorry for these stupid questions :-( Kind Regards, TJ 2007/10/30, Bart Smaalders [EMAIL PROTECTED]: ?? TaoJie wrote: Dear all: My platform is: Intel Pentium 4 CPU OpenSolaris B74, built by myself Sun Studio 11 In my program, I use asm(rdtsc) to measure the time cost between two rdtsc. for example: int some_func(...) { long long time1, time2; int i = 3198, j = 324; asm volatile(rdtsc : =A (time1)); i = i + j * i / j; asm volatile(rdtsc : =A (time2)) return i; } int main(...) { some_func(); } When I compile this program using cc example.c and disasmble a.out by dis, the program logic is ok. The output is some_func() main+0x36: 0f 31 rdtsc main+0x38: 89 45 f4 movl %eax,-0xc(%ebp) main+0x3b: 89 55 f8 movl %edx,-0x8(%ebp) main+0x3e: 8b 45 e8 movl -0x18(%ebp),%eax main+0x41: 03 45 e4 addl -0x1c(%ebp),%eax main+0x44: 89 45 e8 movl %eax,-0x18(%ebp) main+0x47: 8b 45 e8 movl -0x18(%ebp),%eax main+0x4a: 0f af 45 e4imull -0x1c(%ebp),%eax main+0x4e: 89 45 e8 movl %eax,-0x18(%ebp) main+0x51: 8b 45 e8 movl -0x18(%ebp),%eax main+0x54: 99 cltd main+0x55: f7 7d e4 idivl -0x1c(%ebp) main+0x58: 8b d0 movl %eax,%edx main+0x5a: 89 55 e8 movl %edx,-0x18(%ebp) main+0x5d: 0f 31 rdtsc main+0x5f: 89 45 ec movl %eax,-0x14(%ebp) main+0x62: 89 55 f0 movl %edx,-0x10(%ebp) When I compile this program using cc -xO5, the dis output is some_func() main+0x7: 0f 31 rdtsc main+0x9: 89 45 e8 movl %eax,-0x18(%ebp) main+0xc: 89 55 ec movl %edx,-0x14(%ebp) main+0xf: 0f 31 rdtsc main+0x11: 89 45 f0 movl %eax,-0x10(%ebp) main+0x14: 89 55 f4 movl %edx,-0xc(%ebp) main+0x17: 8b 5d f0 movl -0x10(%ebp),%ebx main+0x1a: 8b 45 f4 movl -0xc(%ebp),%eax main+0x1d: 8b 4d e8 movl -0x18(%ebp),%ecx main+0x20: 8b 55 ec movl -0x14(%ebp),%edx main+0x23: 2b d9 subl %ecx,%ebx
Re: [osol-discuss] [perf-discuss] how to disable sun cc optimization partly in program?
2007/12/22, Frank Hofmann [EMAIL PROTECTED]: On Sat, 22 Dec 2007, Minskey Guo wrote: On 2007-12-22, at 下午9:17, 陶捷 TaoJie wrote: Hi Bart, I noticed this email just now :( Thank you for your advice. Are there any barrier instructions on x86/x64 could force the rdtsc to behave sychronously? iret, xchg, cpuid, sfence, lock, etc. but cpuid changes eax etc, sfence is not available for all pentium (PIII ???) We had this discussion with AMD a while ago; if I remember correctly, but Do you remember the topic of that discussion? Bart may well step in here, is that the only thing that's guaranteed in all situations and fully vendor/chip-rev independent is CPUID. Which is sort of a barrier sledgehammer. Takes thousands of cycles. Because it will takes thousands of cycles? It takes thousands of cycles, then it will affact the testing result a bit. But it seems a good generic resolution. btw, On P4 and the later Intel platform, which instruction is the best barrier? On AMD Opteron and the later AMD platform, which instruction is? Wondering - what _exactly_ are you planning to do ? Instruction-based sampling can be done via CPU performance monitoring counters, the old sample time, do something, sample time again is sort-of superseded by those. High-level access in Solaris would be via the cpc(7d) driver. OK, I'll try to find some articles about performance monitoring counters and the cpc driver in Solaris to read. A program, I want to analysis its detail behavior. In a word, I want to know the time cost of any sub-flow on the whole program flow. Suppose the program flow is a long vertical line like *main* *func1 * *func2 * *func3 * *some key instructions in func3 (record it as #1)* *func4* *func3* *func2 * *func1 * *some key instructions in func1 (record it as #2)* *exit in main* I'm interested in func4 takes how much time? #1 takes how much time? #2 takes how much time? control transfered from func2 to func3 (this is a function call) takes how much time? during func4, this program may be interrupted by some event, if so, it takes how much time? and it spends how much time to re-gain the CPU if not, that's all right. To this problem, are there any good suggestions? Kind Regards, TJ FrankH. My concern is --- rdtsc [barrier] AA BB CC XX [barrier] rdtsc (2nd rdtsc - 1st rdtsc) should be the time cost of these inner instructions/functions. And it should be equal to or greater than the actual cost. Are there any barrier instructions to force rdtsc execute before AA and 2nd rdtsc execute after XX? using some continuous nops? or some instrcution else? sfence rdtsc x sfence rdtsc maybe cpuid is available if exa can be corrupted, or you can save it somewhere before cpuid . -minskey Btw, you mentioned you had the experience of performance measuring. Are there any recommended articles about performance measuring on x86/x64 platform? Are there any recommended atricles about measuring instrcution cost? For example, in some books, they said nop costs 1 cycle on Pentium, costs 3 cycle on 386. How to get these precise costs? Thank you :) Another question: In SMP or Multi-core (or say CMT) platform, each processor/core does have its own tsc register on its chip, doesn't it? Then, how could gethrtime() guarantee to provide the system-wide time? I mean if a program runing on CPU1 for a while and then running on CPU2, would gethrtime() - gethrtime() be the precise time cost? Does gethrtime() read ticks from CPU's tsc register or read it from system-wide timer( e.g. 8253 chip for x86)? I'm not familiar with timer... sorry for these stupid questions :-( Kind Regards, TJ 2007/10/30, Bart Smaalders [EMAIL PROTECTED]: ?? TaoJie wrote: Dear all: My platform is: Intel Pentium 4 CPU OpenSolaris B74, built by myself Sun Studio 11 In my program, I use asm(rdtsc) to measure the time cost between two rdtsc. for example: int some_func(...) { long long time1, time2; int i = 3198, j = 324; asm volatile(rdtsc : =A (time1)); i = i + j * i / j; asm volatile(rdtsc : =A (time2)) return i; } int main(...) { some_func(); } When I compile this program using cc example.c and disasmble a.out by dis, the program logic is ok. The output is some_func() main+0x36: 0f 31 rdtsc main+0x38: 89 45 f4 movl %eax,-0xc(%ebp) main+0x3b: 89 55 f8 movl %edx,-0x8(%ebp) main+0x3e: 8b 45 e8 movl -0x18(%ebp),%eax main+0x41: 03 45 e4 addl -0x1c(%ebp),%eax main+0x44: 89 45 e8 movl %eax,-0x18(%ebp) main+0x47: 8b 45 e8
Re: [osol-discuss] [perf-discuss] how to disable sun cc optimization partly in program?
On Sat, 22 Dec 2007, 陶捷 TaoJie wrote: 2007/12/22, Frank Hofmann [EMAIL PROTECTED]: On Sat, 22 Dec 2007, Minskey Guo wrote: On 2007-12-22, at 下午9:17, 陶捷 TaoJie wrote: Hi Bart, I noticed this email just now :( Thank you for your advice. Are there any barrier instructions on x86/x64 could force the rdtsc to behave sychronously? iret, xchg, cpuid, sfence, lock, etc. but cpuid changes eax etc, sfence is not available for all pentium (PIII ???) We had this discussion with AMD a while ago; if I remember correctly, but Do you remember the topic of that discussion? Was related to: http://src.opensolaris.org/source/diff/onnv/onnv-gate/usr/src/uts/intel/ia32/ml/i86_subr.s?r1=5322r2=5084 http://src.opensolaris.org/source/diff/onnv/onnv-gate/usr/src/uts/i86pc/os/mlsetup.c?r1=5338r2=5084 Bart may well step in here, is that the only thing that's guaranteed in all situations and fully vendor/chip-rev independent is CPUID. Which is sort of a barrier sledgehammer. Takes thousands of cycles. Because it will takes thousands of cycles? Yes. It takes thousands of cycles, then it will affact the testing result a bit. But it seems a good generic resolution. btw, On P4 and the later Intel platform, which instruction is the best barrier? On AMD Opteron and the later AMD platform, which instruction is? The only one that allows serialization even for instructions that do not access memory is cpuid. All else (mfence and varieties, iret) have cornercases where they may not serialize [on all cpu varieties]. The source above should give you some more details. Wondering - what _exactly_ are you planning to do ? Instruction-based sampling can be done via CPU performance monitoring counters, the old sample time, do something, sample time again is sort-of superseded by those. High-level access in Solaris would be via the cpc(7d) driver. OK, I'll try to find some articles about performance monitoring counters and the cpc driver in Solaris to read. Start with the source and with this one: http://docs.sun.com/app/docs/doc/816-5172/6mbb7btcs?a=view A program, I want to analysis its detail behavior. In a word, I want to know the time cost of any sub-flow on the whole program flow. Suppose the program flow is a long vertical line like *main* *func1 * *func2 * *func3 * *some key instructions in func3 (record it as #1)* *func4* *func3* *func2 * *func1 * *some key instructions in func1 (record it as #2)* *exit in main* Again, see source above. Or, rather, try using CPC, and/or DTrace's timestampers. If it's your own sourcecode, userland SDT probes might give you the necessary sampling hooks as well. I'm interested in func4 takes how much time? #1 takes how much time? #2 takes how much time? control transfered from func2 to func3 (this is a function call) takes how much time? during func4, this program may be interrupted by some event, if so, it takes how much time? and it spends how much time to re-gain the CPU if not, that's all right. To this problem, are there any good suggestions? If the time involved in all these samples is _not_ microscopic, then DTrace's sampling might well tell you. If it is microscopic, though, then CPC (or even your own use of CPU performance monitoring facilities, to avoid a kernel driver overhead) might become necessary. Correctly sampling micro-events is a hard task, and I'm not aware of a generic good suggestion. Have a great weekend, FrankH. Kind Regards, TJ FrankH. My concern is --- rdtsc [barrier] AA BB CC XX [barrier] rdtsc (2nd rdtsc - 1st rdtsc) should be the time cost of these inner instructions/functions. And it should be equal to or greater than the actual cost. Are there any barrier instructions to force rdtsc execute before AA and 2nd rdtsc execute after XX? using some continuous nops? or some instrcution else? sfence rdtsc x sfence rdtsc maybe cpuid is available if exa can be corrupted, or you can save it somewhere before cpuid . -minskey Btw, you mentioned you had the experience of performance measuring. Are there any recommended articles about performance measuring on x86/x64 platform? Are there any recommended atricles about measuring instrcution cost? For example, in some books, they said nop costs 1 cycle on Pentium, costs 3 cycle on 386. How to get these precise costs? Thank you :) Another question: In SMP or Multi-core (or say CMT) platform, each processor/core does have its own tsc register on its chip, doesn't it? Then, how could gethrtime() guarantee to provide the system-wide time? I mean if a program runing on CPU1 for a while and then running on CPU2, would gethrtime() - gethrtime() be the precise time cost? Does gethrtime() read ticks from CPU's tsc register or read it from system-wide timer( e.g. 8253 chip for x86)? I'm not familiar with timer... sorry for these stupid questions :-( Kind Regards, TJ 2007/10/30, Bart Smaalders