[gem5-users] Discrepancy between Timing and O3 CPU models in X86

2017-08-31 Thread Sumanth Sridhar
Hi,

I have come across something odd when trying to compare the relative
performance of semaphores and mutex locks when 2 threads sum into a
shared variable. As you'd expect, the performance of a mutex lock
seems to be better when simulating on the O3 CPU using the X86 ISA,
but semaphores surprisingly do much better on the TimingSimpleCPU.

I decided to try the same with the ARM ISA to see if it behaved the
same way. However, ARM performs as expected - mutex locks beat
semaphores in both CPU models.

These are the steps I took:

1. scons -j 3 build/X86/gem5.opt PROTOCOL=MOESI_hammer
2. build/X86/gem5.opt configs/example/fs.py -n 4
--kernel=x86_64-vmlinux-2.6.22.9.smp --disk-image=linux-x86.img
--script=configs/boot/hack_back_ckpt.rcS
3. build/X86/gem5.opt configs/example/fs.py -r1 -n4
--kernel=x86_64-vmlinux-2.6.22.9.smp --disk-image=linux-x86.img
--script=mutex.rcS --ruby --l1d_size=32kB --l1i_size=32kB
--l2_size=256kB --l1d_assoc=8 --l1i_assoc=8 --l2_assoc=4
--cpu-type=TimingSimpleCPU --restore-with-cpu=TimingSimpleCPU

mutex.rcS:
m5 resetstats
./mutex-test 100
m5 dumpstats

m5 resetstats
m5 exit

Similarly for sem-test and the DerivO3CPU.

The program I am testing with is quite simple:
1. The main thread spawns 2 pthreads.
2. This is the thread function:

// the spawned thread
// increments the 'count' variable 'rounds' times
void* threadFn(void* arg)
{
// ensure that each thread runs on a single CPU by pinning
int cpuId = (int)(long) arg;
pinThread(cpuId);

int i;
// 'rounds' has been set as 100,000 for this test
for (i = 0; i < rounds; i++)
{
// use locks to atomically increment 'count'
pthread_mutex_lock();
// sem_wait();
count++;
pthread_mutex_unlock();
// sem_post();
}
return NULL;
}

Here are the results (sim_ticks) after execution of the above scripts:

++---+
|For 10 rounds   |sim_ticks per round|
|   per thread   |  (in Ks)  |
|+---+
||   mutex| sem  |
+-+--++--+
| X86 | O3   |   117 K| 240 K|
| | Timing   |   582 K| 449 K|
+-+--++--+
| ARM | O3   |   149 K| 184 K|
| | Timing   |   370 K| 680 K|
+-+--++--+
(execution for ARM was done using "--caches" instead "--ruby", but
was otherwise identical to the method used for X86)

These trends repeat for a larger number of 'rounds' as well.

Could someone please help me understand why this might be the case..?

Thanks!
Sumanth Sridhar
___
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] Garnet 2.0 + How the dir_nodes are connected in Mesh_XY network

2017-08-31 Thread Tushar Krishna
Hi Faisal,

> On Aug 31, 2017, at 2:17 AM, F. A. Faisal  wrote:
> 
> Dear All,
> 
> 1. I cross-checked the code for Mesh_XY.py and MeshDirCorners_XY.py file.
> Where, in the MeshDirCorners_XY.py has the proper implementation of the 
> connected dir_nodes. 
> However, I couldn't find any code that connectes the dir_nodes with each 
> router inside the Mesh_XY.py file.

In Mesh_XY.py, *all* controllers (L1s, then L2s, then directories) are 
connected one by one to each router.
This topology inherently assumes same number of directories as CPUs, so that 
each core is connected to both a CPU and a directory.

Dir Corners on the other hand has 4 directory controllers, explicitly connected 
to the corner routers.

> 
> 2. Now, as mentioned in the figure of Mesh_XY at homepage of garnet 2.0, two 
> links are connected between two routers (one incomming and outgoing). 
> However, if we see the below code for Mesh_XY, we can see both links are 
> connected to the same port number or string. Hence, it means as the bandwidth 
> factor is fixed, either one of the router can send a packet at a time, not 
> both.

I didn’t understand your point.
What do you mean by same port number?
Two links are created - they both have different “src_node” and “dst_node” 
routers.
For example, between routers 5 and 6, two links are created, one from 5 (src) 
to 6 (dst), and one from 6 (src) to 5 (dst).
Internally each will instantiate an output port at the source router and input 
port at the destination router, and can both operate in parallel.

> 
> # East output to West input links (weight = 1)
> for row in xrange(num_rows):
> for col in xrange(num_columns):
> if (col + 1 < num_columns):
> east_out = col + (row * num_columns)
> west_in = (col + 1) + (row * num_columns)
> int_links.append(IntLink(link_id=link_count,
>  src_node=routers[east_out],
>  dst_node=routers[west_in],
>  src_outport="East",
>  dst_inport="West",
>  latency = link_latency,
>  weight=1))
> link_count += 1
> 
> # West output to East input links (weight = 1)
> for row in xrange(num_rows):
> for col in xrange(num_columns):
> if (col + 1 < num_columns):
> east_in = col + (row * num_columns)
> west_out = (col + 1) + (row * num_columns)
> int_links.append(IntLink(link_id=link_count,
>  src_node=routers[west_out],
>  dst_node=routers[east_in],
>  src_outport="West",
>  dst_inport="East",
>  latency = link_latency,
>  weight=1))
> link_count += 1
> 
> 
> 
> 
> 3. I also like to know how much minimum clock cycle is required to transmit a 
> complete packet between the two interconnected routers. 
> 
For each flit, it takes link_latency cycles to transmit.
Default value of link_latency is one, but can be overwritten for each link 
separately in the topology file.

For a N-flit packet, it will thus take N cycles.


> 4. And is it possible to make the packet level tracing in Garnet 2.0.
> 

Not yet.
I plan to upload a patch for that next week - keep a look out on the Garnet2.0 
wiki page.



> Please let me know the details to increase the understanding.
> 
> Thanks a lot in advance for your kind help.
> 
> Best regards,
> 
> F. A. Faisal
> 

- Tushar

> 
> ___
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

___
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

[gem5-users] Garnet 2.0 + How the dir_nodes are connected in Mesh_XY network

2017-08-31 Thread F. A. Faisal
Dear All,

1. I cross-checked the code for Mesh_XY.py and MeshDirCorners_XY.py file.
Where, in the MeshDirCorners_XY.py has the proper implementation of the
connected dir_nodes.
However, I couldn't find any code that connectes the dir_nodes with each
router inside the Mesh_XY.py file.

2. Now, as mentioned in the figure of Mesh_XY at homepage of garnet 2.0,
two links are connected between two routers (one incomming and outgoing).
However, if we see the below code for Mesh_XY, we can see both links are
connected to the same port number or string. Hence, it means as the
bandwidth factor is fixed, either one of the router can send a packet at a
time, not both.

# East output to West input links (weight = 1)

for row in xrange(num_rows):

for col in xrange(num_columns):

if (col + 1 < num_columns):

east_out = col + (row * num_columns)

west_in = (col + 1) + (row * num_columns)

int_links.append(IntLink(link_id=link_count,

 src_node=routers[east_out],

 dst_node=routers[west_in],

 src_outport="East",

 dst_inport="West",

 latency = link_latency,

 weight=1))

link_count += 1


# West output to East input links (weight = 1)

for row in xrange(num_rows):

for col in xrange(num_columns):

if (col + 1 < num_columns):

east_in = col + (row * num_columns)

west_out = (col + 1) + (row * num_columns)

int_links.append(IntLink(link_id=link_count,

 src_node=routers[west_out],

 dst_node=routers[east_in],

 src_outport="West",

 dst_inport="East",

 latency = link_latency,

 weight=1))

link_count += 1




3. I also like to know how much minimum clock cycle is required to transmit
a complete packet between the two interconnected routers.

4. And is it possible to make the packet level tracing in Garnet 2.0.

Please let me know the details to increase the understanding.

Thanks a lot in advance for your kind help.

Best regards,

F. A. Faisal
___
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users