Re: [m5-dev] Profile Results for Mesh Network

2011-01-27 Thread Nilay Vaish
On Mon, 24 Jan 2011, Nilay Vaish wrote: On Mon, 24 Jan 2011, Steve Reinhardt wrote: Yes, that's right. So there's probably no big win in trying to further reduce the number of calls to lookup() in Ruby; the possibilities I see for improvement are: 1. Adding an instruction buffer to SimpleCPU

Re: [m5-dev] Profile Results for Mesh Network

2011-01-27 Thread Steve Reinhardt
On Thu, Jan 27, 2011 at 4:36 AM, Nilay Vaish ni...@cs.wisc.edu wrote: I tried caching the index for the MRU block, so that the hash table need not be looked up. It is hard to point if there is a speed up or not. When I run m5.prof, profile results show that time taken by CacheMemory::lookup()

Re: [m5-dev] Profile Results for Mesh Network

2011-01-27 Thread Korey Sewell
From Steve's response, it looks like I'm jumping in the conversation on the wrong page. To be clear, Nilay were you optimizing the lookup() calls or trying to reduce the number of times lookup gets called? My MRU comments and keeping things in the the SimpleCPU was directed toward the latter.

Re: [m5-dev] Profile Results for Mesh Network

2011-01-27 Thread Nilay Vaish
On Thu, 27 Jan 2011, Korey Sewell wrote: From Steve's response, it looks like I'm jumping in the conversation on the wrong page. To be clear, Nilay were you optimizing the lookup() calls or trying to reduce the number of times lookup gets called? My MRU comments and keeping things in the the

Re: [m5-dev] Profile Results for Mesh Network

2011-01-27 Thread Nilay Vaish
On Thu, 27 Jan 2011, Steve Reinhardt wrote: On Thu, Jan 27, 2011 at 4:36 AM, Nilay Vaish ni...@cs.wisc.edu wrote: I tried caching the index for the MRU block, so that the hash table need not be looked up. It is hard to point if there is a speed up or not. When I run m5.prof, profile results

Re: [m5-dev] Profile Results for Mesh Network

2011-01-27 Thread Steve Reinhardt
On Thu, Jan 27, 2011 at 4:01 PM, Nilay Vaish ni...@cs.wisc.edu wrote: I also have an implementation that performs linear search of cache sets instead of hash table lookup. Again, I saw small improvement. But as you mentioned, when the associativity will go up, linear search will perform

Re: [m5-dev] Profile Results for Mesh Network

2011-01-24 Thread Steve Reinhardt
On Sun, Jan 23, 2011 at 4:08 PM, Nilay Vaish ni...@cs.wisc.edu wrote: On Sun, 23 Jan 2011, Korey Sewell wrote: In sendFetch(), it calls sendTiming() which would then call the recvTiming on the cache port since those two should be binded as peers. I'm a little unsure of how the RubyPort,

Re: [m5-dev] Profile Results for Mesh Network

2011-01-24 Thread Nilay Vaish
On Mon, 24 Jan 2011, Steve Reinhardt wrote: On Sun, Jan 23, 2011 at 4:08 PM, Nilay Vaish ni...@cs.wisc.edu wrote: On Sun, 23 Jan 2011, Korey Sewell wrote: In sendFetch(), it calls sendTiming() which would then call the recvTiming on the cache port since those two should be binded as peers.

Re: [m5-dev] Profile Results for Mesh Network

2011-01-24 Thread Korey Sewell
Steve, we can try caching MRU cache block. We can also try replacing hash table with a two dimensional array indexed using cache set and cache way. This should at least show some decent speedup (depending on SMC code). The O3 caches the MRU and ironically, I had just patched the InOrder model

Re: [m5-dev] Profile Results for Mesh Network

2011-01-24 Thread Gabriel Michael Black
Quoting Steve Reinhardt ste...@gmail.com: Gabe, how many bytes at a time does the x86 predecoder fetch? If it doesn't currently grab a cache line at a time, could it be made to do so, and do you know if that would cause any issues with SMC? All of the predecoders expect to receive one

Re: [m5-dev] Profile Results for Mesh Network

2011-01-23 Thread Nilay Vaish
I dug more in to the code today. There are three paths along which calls are made to the RubyPort::M5Port::recvTiming(), which eventually results in calls to CacheMemory::lookup(). 1. TimingSimpleCPU::sendFetch() - 140 million 2. TimingSimpleCPU::handleReadPacket() - 30 million 3.

Re: [m5-dev] Profile Results for Mesh Network

2011-01-23 Thread Nilay Vaish
On Sun, 23 Jan 2011, Korey Sewell wrote: In sendFetch(), it calls sendTiming() which would then call the recvTiming on the cache port since those two should be binded as peers. I'm a little unsure of how the RubyPort, Sequencer, CacheMemory, and CacheController (?) relationship is working

[m5-dev] Profile Results for Mesh Network

2011-01-19 Thread Nilay Vaish
I profiled m5 again, using the following command. ./build/ALPHA_FS_MOESI_hammer/m5.prof ./configs/example/ruby_fs.py --maxtick 2000 -n 8 --topology Mesh --mesh-rows 2 --num-l2cache 8 --num-dir 8 Results have been copied below. CacheMemory::lookup() still consumes some time but is

Re: [m5-dev] Profile Results for Mesh Network

2011-01-19 Thread Steve Reinhardt
What's the deal with Histogram::add()? Either it's too slow or it's being called too much, I'd say, unless we're tracking some incredibly vital statistics there. Can you use the call graph part of the profile to find where most of the calls are coming from? Also, can you look at the stats and

Re: [m5-dev] Profile Results for Mesh Network

2011-01-19 Thread nathan binkert
What's the deal with Histogram::add()?  Either it's too slow or it's being called too much, I'd say, unless we're tracking some incredibly vital statistics there.  Can you use the call graph part of the profile to find where most of the calls are coming from? Don't spend too much time fixing

Re: [m5-dev] Profile Results for Mesh Network

2011-01-19 Thread Nilay Vaish
Some more data from the same simulation, this time from m5out/stats.txt. The number of memory references is slightly more than 30,000,000 where as the number of lookups in the cache is about 256,000,000. So that would be a ratio of 1 : 8.5. I suspect that the reason for this might be that every

Re: [m5-dev] Profile Results for Mesh Network

2011-01-19 Thread Steve Reinhardt
On Wed, Jan 19, 2011 at 3:56 PM, Nilay Vaish ni...@cs.wisc.edu wrote: Some more data from the same simulation, this time from m5out/stats.txt. The number of memory references is slightly more than 30,000,000 where as the number of lookups in the cache is about 256,000,000. So that would be a

Re: [m5-dev] Profile Results for Mesh Network

2011-01-19 Thread nathan binkert
Do you mean that I do a lookup in all 8 caches for each reference?  Is this part of an assertion that's checking coherence invariants?  It seems like that would be something that we'd want to only do in debug mode (or maybe only when explicitly enabled). Seems like something that we could

Re: [m5-dev] Profile Results for Mesh Network

2011-01-19 Thread Nilay Vaish
On Wed, 19 Jan 2011, Steve Reinhardt wrote: On Wed, Jan 19, 2011 at 3:56 PM, Nilay Vaish ni...@cs.wisc.edu wrote: Some more data from the same simulation, this time from m5out/stats.txt. The number of memory references is slightly more than 30,000,000 where as the number of lookups in the