David -
Thank for detailed explanation. Once again, I reviewed your points, my code,
valgrind/callgrind output(s).
Let me summarized briefly:
a. The CPU, the code running on is:
"
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
CPU socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
...
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
$ getconf LEVEL1_DCACHE_LINESIZE
64
"
So, the L1d cache is 32K, with each line 64 bytes long.
b.
> Strictly speaking, the variable right is a pointer to a data structure (the
> "&" is syntactic sugar). 11.5% of the time, the specific data structure you
> want at that moment (referenced by right) is not in the L1 cache. It is very
> likely that the pointer itself is in the L1 cache, because it had to be
> pushed onto the stack or loaded into a register.
That in my understanding too.
Plus, the structure in question is very 'simple' - 'a 64-bit pointer(&rigth) to
a 64-bit pointer'. As I understand it, the CPU should:
1. prefetch the L1d line, starting with address, pointed by the first pointer
(if that address is not in L1d).
2. dereference the first pointer, (get the second pointer)
3. prefetch the L1d line, starting with address, pointed by the second pointer
(if that address is not in L1d).
4. at this point, at least L1d line number of bytes (64), at address of
`pointer to the pointer` should be within the L1d cache.
I understand everything what you are saying that the re-structuring of the code
is the way to get gains on performance… But my question at this point is
slightly different -
I'm trying to understand what the CPU is actually doing at the point of
question and, secondly, if the output of the valgrind/callgrind is 'correct'
and my interpretation of the output is correct.
In particular, I'm failing to understand the following:
c. I looked into C++ code once again:
"
...
SListOp add(const T& obj,T **ret=NULL) {
Node *update[xHeight],*node=&node0,*prev=NULL; int i=level; if
(ret!=NULL) *ret=NULL;
if (node->ptrs[i] != NULL)
__builtin_prefetch((*(void **)&(node->ptrs[i]->obj))); //
NO MISSES THERE!!! point A
if (--i>=0) for (;;) {
if((i-1) >= 0){
if (node->ptrs[i-1]!=NULL&&node->ptrs[i-1]!=prev){
__builtin_prefetch((*(void
**)&(node->ptrs[i-1]->obj))); // NO MISSES THERE!!! point B
}
}
switch
(node->ptrs[i]!=NULL&&node->ptrs[i]!=prev?SL::compare(obj,node->ptrs[i]->obj,extra):SLO_LT)
{...}
…
}
"
and it is SL::compare(), where the misses are happening:
"
…
static SListOp compare(const IndexValue& left,IndexValue& right,ulong ity) {
// __builtin_prefetch(&right);
__builtin_prefetch(*((void **)&right)); // ?! ~11.5%
MISSES AT THIS POINT ?????!!!!!! point C
…
}
…
"
Basically, my understanding, written in paragraph b. does not explain what I
see in paragraph c.
I would expect to see some rate failing at point A or paint B ( see comments
within the code). But for some reason it is happening only within point C…
Even if I put the line like:
"
…
void * pdb1;
if(node->ptrs[i]!=NULL)
pdb1 = *(void **)&(node->ptrs[i]->obj); //NO MISSES THERE!!! Point D
switch
(node->ptrs[i]!=NULL&&node->ptrs[i]!=prev?SL::compare(obj,node->ptrs[i]->obj,extra):SLO_LT)
{...}
...
"
I would expect the misses ' to move to point D', to be reported there. But it
is not happening…
So, I would really like to understand better what I'm observing…
Thanks a lot for your help,
Michael.
On 2012-05-12, at 3:12 PM, David Chapman <dcchap...@acm.org> wrote:
> On 5/12/2012 8:21 AM, Michael Andronov wrote:
>>
>> David -
>>
>> Thank you for the speedy reply.
>>
>>> ...like 11.5% of the time, the value is not in the cache and must be
>>> retrieved from main memory.
>> That is my understanding too. I do not know how much gain I can get by
>> removing those 11.5%, but I would like to do that.
>
> As I said, without restructuring the code to make every data structure fit
> into cache, you're going to have that 11.5% cache miss rate. The only
> question is how much it slows down your program. Restructuring the code or
> data can reduce the miss rate, likely improving performance too.
>
>>
>>> ...dereferencing the variable named right immediately after the prefetch,
>>> so the CPU will stall 11.5% of the time.
>> Ok…
>> Do I understand correctly that the 11.5% misses at the point of
>> dereferencing the variable named right means UNDOUBTEDLY that the variable
>> right itself is not in L1 at this moment?
>> (In other words, the 64 bytes at $1 = (IndexValue &) @0x7ffff5ad13e0: are
>> not prefetched? ).
>> Also, do I understand correctly, that that if the right were prefetched,
>> putting additional prefetch command would NOT have stalled CPU?
>
> Strictly speaking, the variable right is a pointer to a data structure (the
> "&" is syntactic sugar). 11.5% of the time, the specific data structure you
> want at that moment (referenced by right) is not in the L1 cache. It is very
> likely that the pointer itself is in the L1 cache, because it had to be
> pushed onto the stack or loaded into a register.
>
>> Also, if - for testing purposes - I put some code between
>>>> __builtin_prefetch(&right);
>> and
>>>> __builtin_prefetch(*((void **)&right)); //?!! Why D1mr are observed there?
>>
>> it should eliminate the stall. ( Though not going to improve overall
>> performance, just eliminate the stall (D1mr misses)…)
>> I tried that, putting different length of code… But still getting the same
>> 11.5% at the same de-referrencing point…
>> That is when I get puzzled… ;)
>
> Depending on the number of data structures you have and their layout, an
> 11.5% rate may be not too bad. Cache memories (especially L1) are small. It
> could be very hard to improve that rate.
>
> Putting instructions between the prefetch and the usage will not reduce the
> miss rate; only a change in code or data layout can do that. But you can
> reduce the stall time, performing useful work while the miss is resolved.
> See http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html - "if the
> prefetch is done early enough before the access then the data will be in the
> cache by the time it is accessed." If you can perform useful work during
> this period, overall performance will increase. If not, work to reduce the
> miss rate. Prefetching does not reduce cache misses.
>
>>
>> Finally, I modified the code like:
>> "
>> …
>> static SListOp compare(const IndexValue& left,IndexValue& right,ulong ity)
>> {
>> __builtin_prefetch(*((void **)&right)); //?!! Why D1mr are
>> observed there?
>> int cmp=left.key.cmp(right.key,(TREE_KT)ity); return
>> cmp<0?SLO_LT:cmp>0?SLO_GT:SLO_NOOP;
>> }
>> …
>> "
>> and put __builtin_prefetch(&right) before calling compare(), upper the
>> calling stack. I even tried different `distance` from the compare() call…
>> But still stuck with 11.5% at the same point.
>>
>> And that is where I'm really get lost. ;)
>
> Not knowing how the IndexValue objects are organized, I can't really help
> here. To reduce the miss percentage, you somehow have to organize the
> accessed set so that it fits into the L1 cache. If they are not in an array,
> or if the size of that array is larger than the L1 cache, it will be
> difficult or impossible to reduce the rate further.
>
>>
>>
>>> Only a cache-friendly reorganization of your data or access patterns can
>>> reduce this…
>> I understand that. The problem is - I do not understand what is happening
>> with cache at the point of dereferencing. That is why it is difficult to
>> adjust the code.
>
> The cache has a set of recently used memory lines. A miss means that block
> of data hasn't been accessed recently. No more. If you group the data so
> that it is all close together, or access pieces that are very close together
> (moving on to a new section only rarely), then you have a cache-friendly
> organization. This may not be possible in your program.
>
>>
>>
>>> ...if your working set size exceeds the cache size,
>>
>> I do not think it is the case with that program.
>>
>> Thanks for your help. I'm still trying to 'get control' over those D1mr.
>> Michael.
>>
>> On 2012-05-12, at 12:43 AM, David Chapman <dcchap...@acm.org> wrote:
>>
>>> On 5/11/2012 8:04 PM, Michael Andronov wrote:
>>>>
>>>> I'm looking for some help/hint to explain the results I'm observing with
>>>> the following peace of code:
>>>> "
>>>> …
>>>> struct CmpIndexValue {
>>>> static SListOp compare(const IndexValue& left,IndexValue&
>>>> right,ulong ity) {
>>>> __builtin_prefetch(&right);
>>>> __builtin_prefetch(*((void **)&right)); //?!! Why D1mr are
>>>> observed there?
>>>> int cmp=left.key.cmp(right.key,(TREE_KT)ity); return
>>>> cmp<0?SLO_LT:cmp>0?SLO_GT:SLO_NOOP;
>>>> }
>>>> };
>>>> …
>>>> "
>>>> where the `right` is the reference for some structure like:
>>>> (gdb) print right
>>>> $1 = (IndexValue &) @0x7ffff5ad13e0: {key = {ptr = {p = 0x7ffff5ad13d8, l
>>>> = 5}, ... }
>>>> (gdb) x /4x (void **)&right
>>>> 0x7ffff5ad13e0: 0xf5ad13d8 0x00007fff 0x00000005
>>>> 0x00000000
>>>> (gdb) x /4x *(void **)&right
>>>> 0x7ffff5ad13d8: 0x048056a1 0x01000016 0xf5ad13d8
>>>> 0x00007fff
>>>> (gdb)
>>>>
>>>> The expected behaviour:
>>>> - the line - __builtin_prefetch(*((void **)&right)); - should trigger the
>>>> prefetch (64 bytes of my machine) from address {key = {ptr = {p =
>>>> 0x7ffff5ad13d8… .
>>>> - when the actual access to right.key.ptr.p will occur - within
>>>> left.key.cmp() - the values should be within Ld1…
>>>>
>>>> The observed behaviour is different, however:
>>>> - the kcachegrind is reporting significant 11.5% D1mr misses at
>>>> __builtin_prefetch(*((void **)&right)); line. Which looks a bit strange
>>>> and unexpected for me...
>>>>
>>>> Actually, if that line is commented out, then approximately the same
>>>> amount of D1mr losses is shown later, within left.key.cmp() function, at
>>>> attempt to access right.key.ptr.p values.
>>>> So, to the 'certain degree', putting the __builtin_prefetch() function is
>>>> doing the job and performing prefetching…
>>>> But the idea was to eliminate the D1mr misses, not 'to move' them around
>>>> the code. ;)
>>>>
>>>> I would be very grateful for any hint for direction to understand and/or
>>>> explanation where my expectations are wrong and/or what I'm doing wrong.
>>>>
>>>>
>>>
>>> The goal of prefetching is to get the data you are going to need while your
>>> code is doing something else. You have told the compiler that a specific
>>> value is going to be used very soon, so it should ask the CPU to ensure
>>> that the value is in the cache. It sounds like 11.5% of the time, the
>>> value is not in the cache and must be retrieved from main memory. Only a
>>> cache-friendly reorganization of your data or access patterns can reduce
>>> this, and if your working set size exceeds the cache size, you're more or
>>> less stuck with that miss rate. You could move the misses around in the
>>> code, but you can't get rid of them by prefetching.
>>>
>>> Alternatively, you could rewrite the code to do more work between the
>>> prefetch request and the actual use of the data. Right now you are
>>> dereferencing the variable named right immediately after the prefetch, so
>>> the CPU will stall 11.5% of the time. If there was more work to do, the
>>> CPU would not be idle while the cache line was being loaded.
>>> --
>>> David Chapman dcchap...@acm.org
>>> Chapman Consulting -- San Jose, CA
>>> Software Development Done Right.
>>> www.chapman-consulting-sj.com
>>
>
>
> --
> David Chapman dcchap...@acm.org
> Chapman Consulting -- San Jose, CA
> Software Development Done Right.
> www.chapman-consulting-sj.com
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Valgrind-users mailing list
Valgrind-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/valgrind-users