Re: [Valgrind-users] (Callgrind/kcachegrind) help with D1mr interpretation...

David Chapman Sat, 12 May 2012 12:14:44 -0700

On 5/12/2012 8:21 AM, Michael Andronov wrote:

David -
Thank you for the speedy reply.
...like 11.5% of the time, the value is not in the cache and must beretrieved from main memory.
That is my understanding too. I do not know how much gain I can get byremoving those 11.5%, but I would like to do that.

As I said, without restructuring the code to make every data structurefit into cache, you're going to have that 11.5% cache miss rate. Theonly question is how much it slows down your program. Restructuring thecode or data can reduce the miss rate, likely improving performance too.

...dereferencing the variable named /right/ immediately after theprefetch, so the CPU will stall 11.5% of the time.
Ok…
Do I understand correctly that the 11.5% misses at the pointof dereferencing the variable named /right /means UNDOUBTEDLY that thevariable /right/ itself is not in L1 at this moment?(In other words, the 64 bytes at $1 = (IndexValue &) @0x7ffff5ad13e0:are not prefetched? ).Also, do I understand correctly, that that if the /right /wereprefetched, putting additional prefetch command would NOT have stalledCPU?

Strictly speaking, the variable /right/ is a pointer to a data structure(the "&" is syntactic sugar). 11.5% of the time, the specific datastructure you want at that moment (referenced by /right/) is not in theL1 cache. It is very likely that the pointer itself is in the L1 cache,because it had to be pushed onto the stack or loaded into a register.

Also, if - for testing purposes -  I put some code between
                __builtin_prefetch(&right);
and
__builtin_prefetch(*((void **)&right));  //?!! Why D1mr are observed there?
it should eliminate the stall. ( Though not going to improve overallperformance, just eliminate the stall (D1mr misses)…)I tried that, putting different length of code… But still getting thesame 11.5% at the same de-referrencing point…
That is when I get puzzled… ;)

Depending on the number of data structures you have and their layout, an11.5% rate may be not too bad. Cache memories (especially L1) aresmall. It could be very hard to improve that rate.

Putting instructions between the prefetch and the usage will not reducethe miss rate; only a change in code or data layout can do that. Butyou can reduce the stall time, performing useful work while the miss isresolved. Seehttp://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html - "ifthe prefetch is done early enough before the access then the data willbe in the cache by the time it is accessed." If you can perform usefulwork during this period, overall performance will increase. If not,work to reduce the miss rate. Prefetching does not reduce cache misses.

Finally, I modified the code like:
"
…
static SListOp compare(const IndexValue& left,IndexValue&right,ulong ity) {__builtin_prefetch(*((void **)&right)); //?!! Why D1mrare observed there?int cmp=left.key.cmp(right.key,(TREE_KT)ity); returncmp<0?SLO_LT:cmp>0?SLO_GT:SLO_NOOP;
  }
…
"
and put /__builtin_prefetch(&right)/ before calling /compare(), /upperthe calling stack. I even tried different `distance` from thecompare() call… But still stuck with 11.5% at the same point.
And that is where I'm really get lost. ;)

Not knowing how the IndexValue objects are organized, I can't reallyhelp here. To reduce the miss percentage, you somehow have to organizethe accessed set so that it fits into the L1 cache. If they are not inan array, or if the size of that array is larger than the L1 cache, itwill be difficult or impossible to reduce the rate further.

Only a cache-friendly reorganization of your data or access patternscan reduce this…
I understand that. The problem is - I do not understand what ishappening with cache at the point of dereferencing. That is why it isdifficult to adjust the code.

The cache has a set of recently used memory lines. A miss means thatblock of data hasn't been accessed recently. No more. If you group thedata so that it is all close together, or access pieces that are veryclose together (moving on to a new section only rarely), then you have acache-friendly organization. This may not be possible in your program.

...if your working set size exceeds the cache size,
I do not think it is the case with that program.

Thanks for your help. I'm still trying to 'get control' over those D1mr.
Michael.
On 2012-05-12, at 12:43 AM, David Chapman <dcchap...@acm.org<mailto:dcchap...@acm.org>> wrote:
On 5/11/2012 8:04 PM, Michael Andronov wrote:
I'm looking for some help/hint to explain the results I'm observing with the 
following peace of code:
"
…
struct CmpIndexValue {
                static SListOp compare(const IndexValue&  left,IndexValue&  
right,ulong ity) {
                    __builtin_prefetch(&right);
                    __builtin_prefetch(*((void **)&right));  //?!! Why D1mr are 
observed there?
                    int cmp=left.key.cmp(right.key,(TREE_KT)ity); return 
cmp<0?SLO_LT:cmp>0?SLO_GT:SLO_NOOP;
                }
        };
…
"
where the `right` is the reference for some structure like:
(gdb) print right
$1 = (IndexValue&) @0x7ffff5ad13e0: {key = {ptr = {p = 0x7ffff5ad13d8, l = 5}, 
... }
(gdb) x /4x (void **)&right
0x7ffff5ad13e0: 0xf5ad13d8      0x00007fff      0x00000005      0x00000000
(gdb) x /4x *(void **)&right
0x7ffff5ad13d8: 0x048056a1      0x01000016      0xf5ad13d8      0x00007fff
(gdb)

The expected behaviour:
- the line - __builtin_prefetch(*((void **)&right)); - should trigger the 
prefetch (64 bytes of my machine) from address {key = {ptr = {p = 0x7ffff5ad13d8… .
- when the actual access to right.key.ptr.p will occur - within left.key.cmp() 
- the values should be within Ld1…

The observed behaviour is different, however:
- the kcachegrind is reporting significant 11.5% D1mr misses at 
__builtin_prefetch(*((void **)&right)); line. Which looks a bit strange and 
unexpected for me...

Actually, if that line is commented out, then approximately the same amount of 
D1mr losses is shown later, within left.key.cmp() function, at attempt to 
access right.key.ptr.p values.
So, to the 'certain degree', putting the __builtin_prefetch() function is doing 
the job and performing prefetching…
But the idea was to eliminate the D1mr misses, not  'to move' them around the 
code. ;)

I would be very grateful for any hint for direction to understand and/or 
explanation where my expectations are wrong and/or what I'm doing wrong.
The goal of prefetching is to get the data you are going to needwhile your code is doing something else. You have told the compilerthat a specific value is going to be used very soon, so it should askthe CPU to ensure that the value is in the cache. It sounds like11.5% of the time, the value is not in the cache and must beretrieved from main memory. Only a cache-friendly reorganization ofyour data or access patterns can reduce this, and if your working setsize exceeds the cache size, you're more or less stuck with that missrate. You could move the misses around in the code, but you can'tget rid of them by prefetching.
Alternatively, you could rewrite the code to do more work between theprefetch request and the actual use of the data. Right now you aredereferencing the variable named /right/ immediately after theprefetch, so the CPU will stall 11.5% of the time. If there was morework to do, the CPU would not be idle while the cache line was beingloaded.
--
     David chapmandcchap...@acm.org
     Chapman Consulting -- San Jose, CA
     Software Development Done Right.
     www.chapman-consulting-sj.com



--
    David Chapman      dcchap...@acm.org
    Chapman Consulting -- San Jose, CA
    Software Development Done Right.
    www.chapman-consulting-sj.com

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users

Re: [Valgrind-users] (Callgrind/kcachegrind) help with D1mr interpretation...

Reply via email to