[gem5-users] Re: ARM SVE ISA

2024-01-25 Thread Giacomo Travaglini via gem5-users

Hi Nazmus,


You should have a look at 
https://github.com/gem5/gem5/blob/stable/src/arch/arm/insts/sve_macromem.hh


To simply answer your question, the micro-op cracking happens in the 
instruction definition and that's why you can't find anything

in the cpu/pipeline code. If you want to change things, you should amend the 
instruction definition accordingly.


Kind Regards


Giacomo



On 25/01/2024 16:08, Nazmus Sakib wrote:
Hello. I had a followup question.
As I understand from a list of enums, gem5 has different implementations for 
SVE load/store, like for unit and non unit stride, indexed (gather/scatter).
In case of a gather. I can see from ExecAll and other debug files, one load is 
separated into micro ops. Each microOp is responsible for one address and one 
word.
I know part of the memory traffic is controlled/determined by the ISA, like 
InitiateAcc() or CompleteAccess().
But which files should I look into, if I wanted to find:
1. Exactly where in the cpu pipeline this process is taking place ? I mean the 
process of one load instruction breaking down into several microOp ? and also, 
how they are fed into the registers ? This is a ISA specific problem and in 
LSQunit they just commit it, so I am assuming  the decoder has a part here.  I 
am just curious as how (or at least where) this interface between the ISA and 
the pipeline works, and also, if I wanted to change the ISA implementation of 
this (suppose I decide not to break this into microOps) where should I look for 
?
2. In the stat files, I do not see any statistics that point to this. So 
basically, even though it is a single load, all the statistics like cache hit 
miss etc are compiled for each individual microOps, as if they were scalar 
instructions, not vector ?


From: Giacomo Travaglini 

Sent: 15 January 2024 07:55
To: Nazmus Sakib ; The gem5 Users mailing list 

Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus


On 15/01/2024 14:32, Nazmus Sakib wrote:
Hello. Thanks for your response.
I am running O3 cpu (ARMO3CPU), not minor.


It's the same:

https://github.com/gem5/gem5/blob/stable/src/cpu/o3/lsq.cc#L816


Also, I get it that LSQ unit can do this.
But a cache must have separate logic for scalar and vector read/writes, as 
scheduling events to support a timing model for vector load/store must be 
different ?


A gem5 cache only reasons in terms of cacheline (64bytes) and same goes for a 
coherent interconnect, regardless of vector vs scalar.



Also, the interconnection (bus or crossbar or whatever) must be large enough to 
support vector read/writes ?


As I mentioned earlier, memory requests bigger than a cacheline will be split 
into fragments at the LSQ. To give you a more concrete example: say that you 
have a 1024bits vector (128bytes). A single vector load will be split into 2 
64bytes memory requests. The D-cache will see two requests to two consecutive 
cachelines. It will produce two GetS if it is a miss, or it will return them if 
present.

The LSQ will wait for both requests to return with data and will coalesce them 
before returning data to the writeback vector register.


I hope this helps


Giacomo




From: Giacomo Travaglini 

Sent: 15 January 2024 03:30
To: Nazmus Sakib ; The gem5 Users mailing list 

Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus,


On 15/01/2024 02:41, Nazmus Sakib wrote:
Thank you. I will try to switch to starter_se.py.
I still had some questions regarding SVE.
1. When I compile with msve-vector-bit set to 512, I can see PTRUE instruction, 
which is replaced by whilelow when I compile without setting the vector bit 
value. Now on gem5, it seems whilelow and the corresponding incw instructions 
works fine, because when I keep sve_vl=1 in gem5, incw increments by 0x4 ( 128 
bits) and when I set sve_vl=4 the incw increments by 0x16 (512 bits). But what 
I am curious about, is whether there is anything wrong with the implementation 
of PTRUE instruction in gem5.


Without inspecting the disassembled program, I simply guess using 
msve-vector-bit=512 forces the code to not be VL agnostic and hardcodes it to 
512. So there's nothing surprising in failing the run with a non matching 
hardware.

I believe the proof there is nothing inherently wrong in ptrue in gem5 comes 
from the fact that, keeping the 512b binary untouched (with ptrue), and only 
setting VL=4, you have a 

[gem5-users] Re: ARM SVE ISA

2024-01-25 Thread Nazmus Sakib via gem5-users
Hello. I had a followup question.
As I understand from a list of enums, gem5 has different implementations for 
SVE load/store, like for unit and non unit stride, indexed (gather/scatter).
In case of a gather. I can see from ExecAll and other debug files, one load is 
separated into micro ops. Each microOp is responsible for one address and one 
word.
I know part of the memory traffic is controlled/determined by the ISA, like 
InitiateAcc() or CompleteAccess().
But which files should I look into, if I wanted to find:
1. Exactly where in the cpu pipeline this process is taking place ? I mean the 
process of one load instruction breaking down into several microOp ? and also, 
how they are fed into the registers ? This is a ISA specific problem and in 
LSQunit they just commit it, so I am assuming  the decoder has a part here.  I 
am just curious as how (or at least where) this interface between the ISA and 
the pipeline works, and also, if I wanted to change the ISA implementation of 
this (suppose I decide not to break this into microOps) where should I look for 
?
2. In the stat files, I do not see any statistics that point to this. So 
basically, even though it is a single load, all the statistics like cache hit 
miss etc are compiled for each individual microOps, as if they were scalar 
instructions, not vector ?


From: Giacomo Travaglini 
Sent: 15 January 2024 07:55
To: Nazmus Sakib ; The gem5 Users mailing list 

Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus


On 15/01/2024 14:32, Nazmus Sakib wrote:
Hello. Thanks for your response.
I am running O3 cpu (ARMO3CPU), not minor.


It's the same:

https://github.com/gem5/gem5/blob/stable/src/cpu/o3/lsq.cc#L816


Also, I get it that LSQ unit can do this.
But a cache must have separate logic for scalar and vector read/writes, as 
scheduling events to support a timing model for vector load/store must be 
different ?


A gem5 cache only reasons in terms of cacheline (64bytes) and same goes for a 
coherent interconnect, regardless of vector vs scalar.



Also, the interconnection (bus or crossbar or whatever) must be large enough to 
support vector read/writes ?


As I mentioned earlier, memory requests bigger than a cacheline will be split 
into fragments at the LSQ. To give you a more concrete example: say that you 
have a 1024bits vector (128bytes). A single vector load will be split into 2 
64bytes memory requests. The D-cache will see two requests to two consecutive 
cachelines. It will produce two GetS if it is a miss, or it will return them if 
present.

The LSQ will wait for both requests to return with data and will coalesce them 
before returning data to the writeback vector register.


I hope this helps


Giacomo




From: Giacomo Travaglini 

Sent: 15 January 2024 03:30
To: Nazmus Sakib ; The gem5 Users 
mailing list 
Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus,


On 15/01/2024 02:41, Nazmus Sakib wrote:
Thank you. I will try to switch to starter_se.py.
I still had some questions regarding SVE.
1. When I compile with msve-vector-bit set to 512, I can see PTRUE instruction, 
which is replaced by whilelow when I compile without setting the vector bit 
value. Now on gem5, it seems whilelow and the corresponding incw instructions 
works fine, because when I keep sve_vl=1 in gem5, incw increments by 0x4 ( 128 
bits) and when I set sve_vl=4 the incw increments by 0x16 (512 bits). But what 
I am curious about, is whether there is anything wrong with the implementation 
of PTRUE instruction in gem5.


Without inspecting the disassembled program, I simply guess using 
msve-vector-bit=512 forces the code to not be VL agnostic and hardcodes it to 
512. So there's nothing surprising in failing the run with a non matching 
hardware.

I believe the proof there is nothing inherently wrong in ptrue in gem5 comes 
from the fact that, keeping the 512b binary untouched (with ptrue), and only 
setting VL=4, you have a successful run.


2. As shown in my first email, my data arrays are 64 bytes in size. An sve load 
instruction with sve_vl=4 will allow all 64 bytes to be loaded by one ld1w 
instruction (theoretically at least in an actual cpu ). I can see from the 
outputs generated by debug flag LSQUnit and CacheALL, that indeed all 64 bytes 
are accessed by one instruction. For example:
system.cpu.dcache: access for WriteReq [81010:8104f]
The address range here are for 64 byte (16 integer of 4 byte in my test code).
But, without support in the bus/interconnection 

[gem5-users] 'SConsEnvironment' object has no attribute 'M4': when building

2024-01-25 Thread Ioannis Constantinou via gem5-users
Hello all,

So I’m trying to build gem5 on a new machine and I get the following error.



scons: Reading SConscript files ...

Checking for linker -Wl,--as-needed support... yes

Checking for compiler -gz support... yes

Checking for linker -gz support... yes

Info: Using Python config: python3-config

Checking for C header file Python.h... yes

Checking Python version... 3.10.4

Checking for accept(0,0,0) in C++ library None... yes

Checking for zlibVersion() in C++ library z... yes

Checking for C library tcmalloc... no

Checking for C library tcmalloc_minimal... no

Warning: You can get a 12% performance improvement by installing tcmalloc 
(libgoogle-perftools-dev package on Ubuntu or RedHat).

Checking for char temp; backtrace_symbols_fd((void *), 0, 0) in C library 
None... yes

Checking for C header file fenv.h... yes

Checking for C header file png.h... yes

Checking for clock_nanosleep(0,0,NULL,NULL) in C library None... yes

Checking for C header file valgrind/valgrind.h... no

Checking for pkg-config package hdf5-serial... no

Checking for pkg-config package hdf5... no

Checking for H5Fcreate("", 0, 0, 0) in C library hdf5... no

Warning: Couldn't find HDF5 C++ libraries. Disabling HDF5 support.

Checking for C header file linux/if_tun.h... yes

Checking for shm_open("/test", 0, 0) in C library None... no

Checking for shm_open("/test", 0, 0) in C library rt... yes

Checking for C header file linux/kvm.h... yes

Checking for timer_create(CLOCK_MONOTONIC, NULL, NULL) in C library None... yes

Checking size of struct kvm_xsave ... yes

Checking for member exclude_host in struct perf_event_attr...yes

Checking for pkg-config package protobuf... yes

Checking for GOOGLE_PROTOBUF_VERIFY_VERSION in C++ library protobuf... yes

AttributeError: 'SConsEnvironment' object has no attribute 'M4':

  File "/onyx/data/p182/GEM5_STABLE_VERSION/gem5_qemu_virt/SConstruct", line 
602:

main.SConscript(os.path.join(root, 'SConscript'),

  File 
"/nvme/h/buildsets/eb_cyclone_rl/software/SCons/4.4.0-GCCcore-11.3.0/lib/python3.10/site-packages/SCons/Script/SConscript.py",
 line 597:

return _SConscript(self.fs, *files, **subst_kw)

  File 
"/nvme/h/buildsets/eb_cyclone_rl/software/SCons/4.4.0-GCCcore-11.3.0/lib/python3.10/site-packages/SCons/Script/SConscript.py",
 line 285:

exec(compile(scriptdata, scriptname, 'exec'), call_stack[-1].globals)

  File 
"/onyx/data/p182/GEM5_STABLE_VERSION/gem5_qemu_virt/build/libelf/SConscript", 
line 121:

m4env.M4(target=File('libelf_convert.c'),




I have experience with building gem5 but on this machine I can’t figure out 
what the problem is. Did anyone faced the same issue before?

The gem5 version I use is  21.1.0.2

GCC version is 11.3

Python 3.10.4

Thank you in advance,
Ioannis Constantinou.


___
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org


[gem5-users] X86 multi-core full system simulation with kvm

2024-01-25 Thread 张聪武 via gem5-users
Hi,




I'm simulating a multi-core architecture with full-system support. I want to 
accelerate my simulation with kvm, however it's slow when bring up other cores, 
and shows [firmware bug]. I was wondering if there is a way to solve this.

The following image is my running log. It can continue simulation, but may take 
serveral minutes to show next [firmware bug]. 

Thanks,

Congwu Zhang___
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org