Re: [osol-discuss] [perf-discuss] how to disable sun cc optimization partly in program?

2007-12-22 Thread Minskey Guo


On 2007-12-22, at 下午9:17, 陶捷 TaoJie wrote:


Hi Bart,

I noticed this email just now :(
Thank you for your advice.

Are there any barrier instructions on x86/x64 could force the rdtsc  
to behave sychronously?


iret, xchg, cpuid, sfence, lock, etc.  but cpuid changes eax etc,   
sfence is not available for all pentium (PIII ???)



My concern is
---
rdtsc
[barrier]
AA
BB
CC

XX
[barrier]
rdtsc

(2nd rdtsc - 1st rdtsc) should be the time cost of these inner  
instructions/functions.

And it should be equal to or greater than the actual cost.

Are there any barrier instructions to force rdtsc execute before AA  
and 2nd rdtsc execute after XX?

using some continuous nops? or some instrcution else?


sfence
rdtsc
x

sfence
rdtsc


maybe cpuid is available if exa can be corrupted, or you can save it  
somewhere before cpuid .


-minskey





Btw, you mentioned you had the experience of performance measuring.
Are there any recommended articles about performance measuring on  
x86/x64 platform?

Are there any recommended atricles about measuring instrcution cost?
For example, in some books, they said nop costs 1 cycle on Pentium,  
costs 3 cycle on 386. How to get these precise costs?


Thank you :)

Another question:
In SMP or Multi-core (or say CMT) platform, each processor/core does  
have its own tsc register on its chip, doesn't it?
Then, how could gethrtime() guarantee to provide the system-wide  
time? I mean if a program runing on CPU1 for a while and then  
running on CPU2, would gethrtime() - gethrtime() be the precise time  
cost? Does gethrtime() read ticks from CPU's tsc register or read it  
from system-wide timer( e.g. 8253 chip for x86)?


I'm not familiar with timer... sorry for these stupid questions :-(


Kind Regards,
TJ


2007/10/30, Bart Smaalders [EMAIL PROTECTED]:
?? TaoJie wrote:
 Dear all:

 My platform is:
 Intel Pentium 4 CPU
 OpenSolaris B74, built by myself
 Sun Studio 11

 In my program, I use asm(rdtsc) to measure the time cost between  
two rdtsc.

 for example:
 int some_func(...)
 {
 long long time1, time2;
 int i = 3198, j = 324;

 asm volatile(rdtsc : =A (time1));

 
 i = i + j * i / j;

 asm volatile(rdtsc : =A (time2))

 return i;
 }

 int main(...)
 {
 
 some_func();
 
 }

 When I compile this program using cc example.c and disasmble a.out
 by dis, the program logic is ok. The output is
 some_func()
 main+0x36:  0f 31  rdtsc
 main+0x38:  89 45 f4   movl
%eax,-0xc(%ebp)
 main+0x3b:  89 55 f8   movl
%edx,-0x8(%ebp)
 main+0x3e:  8b 45 e8   movl   -0x18(%ebp), 
%eax
 main+0x41:  03 45 e4   addl   -0x1c(%ebp), 
%eax
 main+0x44:  89 45 e8   movl
%eax,-0x18(%ebp)
 main+0x47:  8b 45 e8   movl   -0x18(%ebp), 
%eax
 main+0x4a:  0f af 45 e4imull  -0x1c(%ebp), 
%eax
 main+0x4e:  89 45 e8   movl
%eax,-0x18(%ebp)
 main+0x51:  8b 45 e8   movl   -0x18(%ebp), 
%eax

 main+0x54:  99 cltd
 main+0x55:  f7 7d e4   idivl  -0x1c(%ebp)
 main+0x58:  8b d0  movl   %eax,%edx
 main+0x5a:  89 55 e8   movl
%edx,-0x18(%ebp)

 main+0x5d:  0f 31  rdtsc
 main+0x5f:  89 45 ec   movl
%eax,-0x14(%ebp)
 main+0x62:  89 55 f0   movl
%edx,-0x10(%ebp)


 When I compile this program using cc -xO5, the dis output is
 some_func()
 main+0x7:   0f 31  rdtsc
 main+0x9:   89 45 e8   movl
%eax,-0x18(%ebp)
 main+0xc:   89 55 ec   movl
%edx,-0x14(%ebp)

 main+0xf:   0f 31  rdtsc
 main+0x11:  89 45 f0   movl
%eax,-0x10(%ebp)
 main+0x14:  89 55 f4   movl
%edx,-0xc(%ebp)
 main+0x17:  8b 5d f0   movl   -0x10(%ebp), 
%ebx
 main+0x1a:  8b 45 f4   movl   -0xc(%ebp), 
%eax
 main+0x1d:  8b 4d e8   movl   -0x18(%ebp), 
%ecx
 main+0x20:  8b 55 ec   movl   -0x14(%ebp), 
%edx

 main+0x23:  2b d9  subl   %ecx,%ebx
 main+0x25:  1b c2  sbbl   %edx,%eax
 main+0x27:  89 5d e0   movl
%ebx,-0x20(%ebp)
 main+0x2a:  89 45 e4   movl
%eax,-0x1c(%ebp)


 Now the program logic is wrong! sun cc thinks rdtscs are irrelative
 with the other parts in some_func, and then it advances the second
 asm(rdtsc)!
 In this case, I can't measure the time cost.

 Then how can I stop sun cc optimization partly between these two asm
 statements when using -xO5 

Re: [osol-discuss] [perf-discuss] how to disable sun cc optimization partly in program?

2007-12-22 Thread Frank Hofmann



On Sat, 22 Dec 2007, Minskey Guo wrote:



On 2007-12-22, at 下午9:17, 陶捷 TaoJie wrote:


Hi Bart,

I noticed this email just now :(
Thank you for your advice.

Are there any barrier instructions on x86/x64 could force the rdtsc to 
behave sychronously?


iret, xchg, cpuid, sfence, lock, etc.  but cpuid changes eax etc,  sfence is 
not available for all pentium (PIII ???)


We had this discussion with AMD a while ago; if I remember correctly, but 
Bart may well step in here, is that the only thing that's guaranteed in 
all situations and fully vendor/chip-rev independent is CPUID. Which is 
sort of a barrier sledgehammer. Takes thousands of cycles.


Wondering - what _exactly_ are you planning to do ? Instruction-based 
sampling can be done via CPU performance monitoring counters, the old 
sample time, do something, sample time again is sort-of superseded by 
those. High-level access in Solaris would be via the cpc(7d) driver.


FrankH.






My concern is
---
rdtsc
[barrier]
AA
BB
CC

XX
[barrier]
rdtsc

(2nd rdtsc - 1st rdtsc) should be the time cost of these inner 
instructions/functions.

And it should be equal to or greater than the actual cost.

Are there any barrier instructions to force rdtsc execute before AA and 2nd 
rdtsc execute after XX?

using some continuous nops? or some instrcution else?


sfence
rdtsc
x

sfence
rdtsc


maybe cpuid is available if exa can be corrupted, or you can save it 
somewhere before cpuid .


-minskey





Btw, you mentioned you had the experience of performance measuring.
Are there any recommended articles about performance measuring on x86/x64 
platform?

Are there any recommended atricles about measuring instrcution cost?
For example, in some books, they said nop costs 1 cycle on Pentium, costs 3 
cycle on 386. How to get these precise costs?


Thank you :)

Another question:
In SMP or Multi-core (or say CMT) platform, each processor/core does have 
its own tsc register on its chip, doesn't it?
Then, how could gethrtime() guarantee to provide the system-wide time? I 
mean if a program runing on CPU1 for a while and then running on CPU2, 
would gethrtime() - gethrtime() be the precise time cost? Does gethrtime() 
read ticks from CPU's tsc register or read it from system-wide timer( e.g. 
8253 chip for x86)?


I'm not familiar with timer... sorry for these stupid questions :-(


Kind Regards,
TJ


2007/10/30, Bart Smaalders [EMAIL PROTECTED]:
?? TaoJie wrote:

Dear all:

My platform is:
Intel Pentium 4 CPU
OpenSolaris B74, built by myself
Sun Studio 11

In my program, I use asm(rdtsc) to measure the time cost between two 
rdtsc.

for example:
int some_func(...)
{
long long time1, time2;
int i = 3198, j = 324;

asm volatile(rdtsc : =A (time1));


i = i + j * i / j;

asm volatile(rdtsc : =A (time2))

return i;
}

int main(...)
{

some_func();

}

When I compile this program using cc example.c and disasmble a.out
by dis, the program logic is ok. The output is
some_func()
main+0x36:  0f 31  rdtsc
main+0x38:  89 45 f4   movl   %eax,-0xc(%ebp)
main+0x3b:  89 55 f8   movl   %edx,-0x8(%ebp)
main+0x3e:  8b 45 e8   movl   -0x18(%ebp),%eax
main+0x41:  03 45 e4   addl   -0x1c(%ebp),%eax
main+0x44:  89 45 e8   movl   %eax,-0x18(%ebp)
main+0x47:  8b 45 e8   movl   -0x18(%ebp),%eax
main+0x4a:  0f af 45 e4imull  -0x1c(%ebp),%eax
main+0x4e:  89 45 e8   movl   %eax,-0x18(%ebp)
main+0x51:  8b 45 e8   movl   -0x18(%ebp),%eax
main+0x54:  99 cltd
main+0x55:  f7 7d e4   idivl  -0x1c(%ebp)
main+0x58:  8b d0  movl   %eax,%edx
main+0x5a:  89 55 e8   movl   %edx,-0x18(%ebp)
main+0x5d:  0f 31  rdtsc
main+0x5f:  89 45 ec   movl   %eax,-0x14(%ebp)
main+0x62:  89 55 f0   movl   %edx,-0x10(%ebp)

When I compile this program using cc -xO5, the dis output is
some_func()
main+0x7:   0f 31  rdtsc
main+0x9:   89 45 e8   movl   %eax,-0x18(%ebp)
main+0xc:   89 55 ec   movl   %edx,-0x14(%ebp)
main+0xf:   0f 31  rdtsc
main+0x11:  89 45 f0   movl   %eax,-0x10(%ebp)
main+0x14:  89 55 f4   movl   %edx,-0xc(%ebp)
main+0x17:  8b 5d f0   movl   -0x10(%ebp),%ebx
main+0x1a:  8b 45 f4   movl   -0xc(%ebp),%eax
main+0x1d:  8b 4d e8   movl   -0x18(%ebp),%ecx
main+0x20:  8b 55 ec   movl   -0x14(%ebp),%edx
main+0x23:  2b d9  subl   %ecx,%ebx

Re: [osol-discuss] [perf-discuss] how to disable sun cc optimization partly in program?

2007-12-22 Thread 陶捷 TaoJie
2007/12/22, Frank Hofmann [EMAIL PROTECTED]:



 On Sat, 22 Dec 2007, Minskey Guo wrote:

 
  On 2007-12-22, at 下午9:17, 陶捷 TaoJie wrote:
 
  Hi Bart,
 
  I noticed this email just now :(
  Thank you for your advice.
 
  Are there any barrier instructions on x86/x64 could force the rdtsc to
  behave sychronously?
 
  iret, xchg, cpuid, sfence, lock, etc.  but cpuid changes eax
 etc,  sfence is
  not available for all pentium (PIII ???)

 We had this discussion with AMD a while ago; if I remember correctly, but


Do you remember the topic of that discussion?



 Bart may well step in here, is that the only thing that's guaranteed in
 all situations and fully vendor/chip-rev independent is CPUID. Which is
 sort of a barrier sledgehammer. Takes thousands of cycles.

Because it will takes thousands of cycles?

It takes thousands of cycles, then it will affact the testing result a bit.
But it seems a good generic resolution.

btw, On P4 and the later Intel platform, which instruction is the best
barrier?
On AMD Opteron and the later AMD platform, which instruction is?



 Wondering - what _exactly_ are you planning to do ? Instruction-based
 sampling can be done via CPU performance monitoring counters, the old
 sample time, do something, sample time again is sort-of superseded by
 those. High-level access in Solaris would be via the cpc(7d) driver.

OK, I'll try to find some articles about performance monitoring counters and
the cpc driver in Solaris to read.

A program, I want to analysis its detail behavior.
In a word, I want to know the time cost of any sub-flow on the whole program
flow.
Suppose the program flow is a long vertical line like
*main*
*func1 *
*func2 *
*func3 *
*some key instructions in func3  (record it as #1)*
*func4*
*func3*
*func2 *
*func1 *
*some key instructions in func1  (record it as #2)*
*exit in main*

I'm interested in
func4 takes how much time?
#1 takes how much time?
#2 takes how much time?
control transfered from func2 to func3 (this is a function call) takes how
much time?
during func4, this program may be interrupted by some event, if so, it takes
how much time? and it spends how much time to re-gain the CPU
if not, that's all right.

To this problem, are there any good suggestions?


Kind Regards,
TJ



 FrankH.



 
  My concern is
  ---
  rdtsc
  [barrier]
  AA
  BB
  CC
  
  XX
  [barrier]
  rdtsc
  
  (2nd rdtsc - 1st rdtsc) should be the time cost of these inner
  instructions/functions.
  And it should be equal to or greater than the actual cost.
 
  Are there any barrier instructions to force rdtsc execute before AA and
 2nd
  rdtsc execute after XX?
  using some continuous nops? or some instrcution else?
 
  sfence
  rdtsc
  x
 
  sfence
  rdtsc
 
 
  maybe cpuid is available if exa can be corrupted, or you can save it
  somewhere before cpuid .
 
  -minskey
 
 
 
 
  Btw, you mentioned you had the experience of performance measuring.
  Are there any recommended articles about performance measuring on
 x86/x64
  platform?
  Are there any recommended atricles about measuring instrcution cost?
  For example, in some books, they said nop costs 1 cycle on Pentium,
 costs 3
  cycle on 386. How to get these precise costs?
 
  Thank you :)
 
  Another question:
  In SMP or Multi-core (or say CMT) platform, each processor/core does
 have
  its own tsc register on its chip, doesn't it?
  Then, how could gethrtime() guarantee to provide the system-wide time?
 I
  mean if a program runing on CPU1 for a while and then running on CPU2,
  would gethrtime() - gethrtime() be the precise time cost? Does
 gethrtime()
  read ticks from CPU's tsc register or read it from system-wide timer(
 e.g.
  8253 chip for x86)?
 
  I'm not familiar with timer... sorry for these stupid questions :-(
 
 
  Kind Regards,
  TJ
 
 
  2007/10/30, Bart Smaalders [EMAIL PROTECTED]:
  ?? TaoJie wrote:
  Dear all:
 
  My platform is:
  Intel Pentium 4 CPU
  OpenSolaris B74, built by myself
  Sun Studio 11
 
  In my program, I use asm(rdtsc) to measure the time cost between two
  rdtsc.
  for example:
  int some_func(...)
  {
  long long time1, time2;
  int i = 3198, j = 324;
 
  asm volatile(rdtsc : =A (time1));
 
  
  i = i + j * i / j;
 
  asm volatile(rdtsc : =A (time2))
 
  return i;
  }
 
  int main(...)
  {
  
  some_func();
  
  }
 
  When I compile this program using cc example.c and disasmble a.out
  by dis, the program logic is ok. The output is
  some_func()
  main+0x36:  0f 31  rdtsc
  main+0x38:  89 45 f4   movl   %eax,-0xc(%ebp)
  main+0x3b:  89 55 f8   movl   %edx,-0x8(%ebp)
  main+0x3e:  8b 45 e8   movl   -0x18(%ebp),%eax
  main+0x41:  03 45 e4   addl   -0x1c(%ebp),%eax
  main+0x44:  89 45 e8   movl   %eax,-0x18(%ebp)
  main+0x47:  8b 45 e8

Re: [osol-discuss] [perf-discuss] how to disable sun cc optimization partly in program?

2007-12-22 Thread Frank Hofmann



On Sat, 22 Dec 2007, 陶捷 TaoJie wrote:


2007/12/22, Frank Hofmann [EMAIL PROTECTED]:




On Sat, 22 Dec 2007, Minskey Guo wrote:



On 2007-12-22, at 下午9:17, 陶捷 TaoJie wrote:


Hi Bart,

I noticed this email just now :(
Thank you for your advice.

Are there any barrier instructions on x86/x64 could force the rdtsc to
behave sychronously?


iret, xchg, cpuid, sfence, lock, etc.  but cpuid changes eax

etc,  sfence is

not available for all pentium (PIII ???)


We had this discussion with AMD a while ago; if I remember correctly, but



Do you remember the topic of that discussion?


Was related to:

http://src.opensolaris.org/source/diff/onnv/onnv-gate/usr/src/uts/intel/ia32/ml/i86_subr.s?r1=5322r2=5084
http://src.opensolaris.org/source/diff/onnv/onnv-gate/usr/src/uts/i86pc/os/mlsetup.c?r1=5338r2=5084






Bart may well step in here, is that the only thing that's guaranteed in
all situations and fully vendor/chip-rev independent is CPUID. Which is
sort of a barrier sledgehammer. Takes thousands of cycles.


Because it will takes thousands of cycles?


Yes.



It takes thousands of cycles, then it will affact the testing result a bit.
But it seems a good generic resolution.

btw, On P4 and the later Intel platform, which instruction is the best
barrier?
On AMD Opteron and the later AMD platform, which instruction is?


The only one that allows serialization even for instructions that do not 
access memory is cpuid. All else (mfence and varieties, iret) have 
cornercases where they may not serialize [on all cpu varieties]. The 
source above should give you some more details.







Wondering - what _exactly_ are you planning to do ? Instruction-based
sampling can be done via CPU performance monitoring counters, the old
sample time, do something, sample time again is sort-of superseded by
those. High-level access in Solaris would be via the cpc(7d) driver.


OK, I'll try to find some articles about performance monitoring counters and
the cpc driver in Solaris to read.


Start with the source and with this one:

http://docs.sun.com/app/docs/doc/816-5172/6mbb7btcs?a=view



A program, I want to analysis its detail behavior.
In a word, I want to know the time cost of any sub-flow on the whole program
flow.
Suppose the program flow is a long vertical line like
*main*
*func1 *
*func2 *
*func3 *
*some key instructions in func3  (record it as #1)*
*func4*
*func3*
*func2 *
*func1 *
*some key instructions in func1  (record it as #2)*
*exit in main*


Again, see source above. Or, rather, try using CPC, and/or DTrace's 
timestampers. If it's your own sourcecode, userland SDT probes might give 
you the necessary sampling hooks as well.





I'm interested in
func4 takes how much time?
#1 takes how much time?
#2 takes how much time?
control transfered from func2 to func3 (this is a function call) takes how
much time?
during func4, this program may be interrupted by some event, if so, it takes
how much time? and it spends how much time to re-gain the CPU
if not, that's all right.

To this problem, are there any good suggestions?


If the time involved in all these samples is _not_ microscopic, then 
DTrace's sampling might well tell you. If it is microscopic, though, then 
CPC (or even your own use of CPU performance monitoring facilities, to 
avoid a kernel driver overhead) might become necessary.


Correctly sampling micro-events is a hard task, and I'm not aware of a 
generic good suggestion.


Have a great weekend,
FrankH.



Kind Regards,
TJ




FrankH.






My concern is
---
rdtsc
[barrier]
AA
BB
CC

XX
[barrier]
rdtsc

(2nd rdtsc - 1st rdtsc) should be the time cost of these inner
instructions/functions.
And it should be equal to or greater than the actual cost.

Are there any barrier instructions to force rdtsc execute before AA and

2nd

rdtsc execute after XX?
using some continuous nops? or some instrcution else?


sfence
rdtsc
x

sfence
rdtsc


maybe cpuid is available if exa can be corrupted, or you can save it
somewhere before cpuid .

-minskey





Btw, you mentioned you had the experience of performance measuring.
Are there any recommended articles about performance measuring on

x86/x64

platform?
Are there any recommended atricles about measuring instrcution cost?
For example, in some books, they said nop costs 1 cycle on Pentium,

costs 3

cycle on 386. How to get these precise costs?

Thank you :)

Another question:
In SMP or Multi-core (or say CMT) platform, each processor/core does

have

its own tsc register on its chip, doesn't it?
Then, how could gethrtime() guarantee to provide the system-wide time?

I

mean if a program runing on CPU1 for a while and then running on CPU2,
would gethrtime() - gethrtime() be the precise time cost? Does

gethrtime()

read ticks from CPU's tsc register or read it from system-wide timer(

e.g.

8253 chip for x86)?

I'm not familiar with timer... sorry for these stupid questions :-(


Kind Regards,
TJ


2007/10/30, Bart Smaalders