[gem5-users] Re: ARM SVE ISA

2024-01-31 Thread Nazmus Sakib via gem5-users
Hello Mr.Travaglini.
I had some more questions, although I am not sure they are specific to this 
thread or not.
1. When I try to run a simulation with cacheline size smaller than the 
"fetchbuffer size", I get an error. As it turns out, cacheline size has to be 
larger than the fetchbuffer size. I know how to change fetchbuffer size, but my 
question is why ? I mean, if fetchbuffer size is 64 and cacheline size is 32, 
why cant the fetch unit take 2 cachelines (from instruction cache I am 
guessing) ?
2. I can change the cacheline size to 8 bytes, but not smaller. When I try to 
run the simulation with 4 byte cacheline, it gives me page fault. I did some 
debugging, and I can see some addresses are mapped in page table via 
allocatemem() in sim/process and map() in pagetable.cc up until 8 bye 
cacheline, but not for 4 byte. Note that, some allocation and mapping are done 
at system initialization time, some happens after the "Real Simulation" message 
in gem5. For both 4 byte and greater than 4 byte, the mapping in the initial 
state are done similarly for my test binary executable, but the runtime mapping 
does not happen for 4 byte. Is this a 32 bit/ 64 bit thing , since aarch64 is 
64 bit and my cachelines are 4 byte ? But in that case, why a page fault ? I 
mean you can append zeros to make a 32 bit address a 64 bit address right ?


From: Giacomo Travaglini 
Sent: 25 January 2024 09:59
To: Nazmus Sakib ; The gem5 Users mailing list 

Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

You don't often get email from giacomo.travagl...@arm.com. Learn why this is 
important
WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus,


You should have a look at 
https://github.com/gem5/gem5/blob/stable/src/arch/arm/insts/sve_macromem.hh


To simply answer your question, the micro-op cracking happens in the 
instruction definition and that's why you can't find anything

in the cpu/pipeline code. If you want to change things, you should amend the 
instruction definition accordingly.


Kind Regards


Giacomo



On 25/01/2024 16:08, Nazmus Sakib wrote:
Hello. I had a followup question.
As I understand from a list of enums, gem5 has different implementations for 
SVE load/store, like for unit and non unit stride, indexed (gather/scatter).
In case of a gather. I can see from ExecAll and other debug files, one load is 
separated into micro ops. Each microOp is responsible for one address and one 
word.
I know part of the memory traffic is controlled/determined by the ISA, like 
InitiateAcc() or CompleteAccess().
But which files should I look into, if I wanted to find:
1. Exactly where in the cpu pipeline this process is taking place ? I mean the 
process of one load instruction breaking down into several microOp ? and also, 
how they are fed into the registers ? This is a ISA specific problem and in 
LSQunit they just commit it, so I am assuming  the decoder has a part here.  I 
am just curious as how (or at least where) this interface between the ISA and 
the pipeline works, and also, if I wanted to change the ISA implementation of 
this (suppose I decide not to break this into microOps) where should I look for 
?
2. In the stat files, I do not see any statistics that point to this. So 
basically, even though it is a single load, all the statistics like cache hit 
miss etc are compiled for each individual microOps, as if they were scalar 
instructions, not vector ?


From: Giacomo Travaglini 

Sent: 15 January 2024 07:55
To: Nazmus Sakib ; The gem5 Users 
mailing list 
Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus


On 15/01/2024 14:32, Nazmus Sakib wrote:
Hello. Thanks for your response.
I am running O3 cpu (ARMO3CPU), not minor.


It's the same:

https://github.com/gem5/gem5/blob/stable/src/cpu/o3/lsq.cc#L816


Also, I get it that LSQ unit can do this.
But a cache must have separate logic for scalar and vector read/writes, as 
scheduling events to support a timing model for vector load/store must be 
different ?


A gem5 cache only reasons in terms of cacheline (64bytes) and same goes for a 
coherent interconnect, regardless of vector vs scalar.



Also, the interconnection (bus or crossbar or whatever) must be large enough to 
support vector read/writes ?


As I mentioned earlier, memory requests bigger than a cacheline will be split 
into fragments at the LSQ. To give you a more concrete example: say that you 
have a 1024bits vector (128bytes). A single vector load will be split into 2 
64bytes memory requests. The D-cache 

[gem5-users] Re: ARM SVE ISA

2024-01-25 Thread Giacomo Travaglini via gem5-users

Hi Nazmus,


You should have a look at 
https://github.com/gem5/gem5/blob/stable/src/arch/arm/insts/sve_macromem.hh


To simply answer your question, the micro-op cracking happens in the 
instruction definition and that's why you can't find anything

in the cpu/pipeline code. If you want to change things, you should amend the 
instruction definition accordingly.


Kind Regards


Giacomo



On 25/01/2024 16:08, Nazmus Sakib wrote:
Hello. I had a followup question.
As I understand from a list of enums, gem5 has different implementations for 
SVE load/store, like for unit and non unit stride, indexed (gather/scatter).
In case of a gather. I can see from ExecAll and other debug files, one load is 
separated into micro ops. Each microOp is responsible for one address and one 
word.
I know part of the memory traffic is controlled/determined by the ISA, like 
InitiateAcc() or CompleteAccess().
But which files should I look into, if I wanted to find:
1. Exactly where in the cpu pipeline this process is taking place ? I mean the 
process of one load instruction breaking down into several microOp ? and also, 
how they are fed into the registers ? This is a ISA specific problem and in 
LSQunit they just commit it, so I am assuming  the decoder has a part here.  I 
am just curious as how (or at least where) this interface between the ISA and 
the pipeline works, and also, if I wanted to change the ISA implementation of 
this (suppose I decide not to break this into microOps) where should I look for 
?
2. In the stat files, I do not see any statistics that point to this. So 
basically, even though it is a single load, all the statistics like cache hit 
miss etc are compiled for each individual microOps, as if they were scalar 
instructions, not vector ?


From: Giacomo Travaglini 

Sent: 15 January 2024 07:55
To: Nazmus Sakib ; The gem5 Users mailing list 

Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus


On 15/01/2024 14:32, Nazmus Sakib wrote:
Hello. Thanks for your response.
I am running O3 cpu (ARMO3CPU), not minor.


It's the same:

https://github.com/gem5/gem5/blob/stable/src/cpu/o3/lsq.cc#L816


Also, I get it that LSQ unit can do this.
But a cache must have separate logic for scalar and vector read/writes, as 
scheduling events to support a timing model for vector load/store must be 
different ?


A gem5 cache only reasons in terms of cacheline (64bytes) and same goes for a 
coherent interconnect, regardless of vector vs scalar.



Also, the interconnection (bus or crossbar or whatever) must be large enough to 
support vector read/writes ?


As I mentioned earlier, memory requests bigger than a cacheline will be split 
into fragments at the LSQ. To give you a more concrete example: say that you 
have a 1024bits vector (128bytes). A single vector load will be split into 2 
64bytes memory requests. The D-cache will see two requests to two consecutive 
cachelines. It will produce two GetS if it is a miss, or it will return them if 
present.

The LSQ will wait for both requests to return with data and will coalesce them 
before returning data to the writeback vector register.


I hope this helps


Giacomo




From: Giacomo Travaglini 

Sent: 15 January 2024 03:30
To: Nazmus Sakib ; The gem5 Users mailing list 

Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus,


On 15/01/2024 02:41, Nazmus Sakib wrote:
Thank you. I will try to switch to starter_se.py.
I still had some questions regarding SVE.
1. When I compile with msve-vector-bit set to 512, I can see PTRUE instruction, 
which is replaced by whilelow when I compile without setting the vector bit 
value. Now on gem5, it seems whilelow and the corresponding incw instructions 
works fine, because when I keep sve_vl=1 in gem5, incw increments by 0x4 ( 128 
bits) and when I set sve_vl=4 the incw increments by 0x16 (512 bits). But what 
I am curious about, is whether there is anything wrong with the implementation 
of PTRUE instruction in gem5.


Without inspecting the disassembled program, I simply guess using 
msve-vector-bit=512 forces the code to not be VL agnostic and hardcodes it to 
512. So there's nothing surprising in failing the run with a non matching 
hardware.

I believe the proof there is nothing inherently wrong in ptrue in gem5 comes 
from the fact that, keeping the 512b binary untouched (with ptrue), and only 
setting VL=4, you have a 

[gem5-users] Re: ARM SVE ISA

2024-01-25 Thread Nazmus Sakib via gem5-users
Hello. I had a followup question.
As I understand from a list of enums, gem5 has different implementations for 
SVE load/store, like for unit and non unit stride, indexed (gather/scatter).
In case of a gather. I can see from ExecAll and other debug files, one load is 
separated into micro ops. Each microOp is responsible for one address and one 
word.
I know part of the memory traffic is controlled/determined by the ISA, like 
InitiateAcc() or CompleteAccess().
But which files should I look into, if I wanted to find:
1. Exactly where in the cpu pipeline this process is taking place ? I mean the 
process of one load instruction breaking down into several microOp ? and also, 
how they are fed into the registers ? This is a ISA specific problem and in 
LSQunit they just commit it, so I am assuming  the decoder has a part here.  I 
am just curious as how (or at least where) this interface between the ISA and 
the pipeline works, and also, if I wanted to change the ISA implementation of 
this (suppose I decide not to break this into microOps) where should I look for 
?
2. In the stat files, I do not see any statistics that point to this. So 
basically, even though it is a single load, all the statistics like cache hit 
miss etc are compiled for each individual microOps, as if they were scalar 
instructions, not vector ?


From: Giacomo Travaglini 
Sent: 15 January 2024 07:55
To: Nazmus Sakib ; The gem5 Users mailing list 

Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus


On 15/01/2024 14:32, Nazmus Sakib wrote:
Hello. Thanks for your response.
I am running O3 cpu (ARMO3CPU), not minor.


It's the same:

https://github.com/gem5/gem5/blob/stable/src/cpu/o3/lsq.cc#L816


Also, I get it that LSQ unit can do this.
But a cache must have separate logic for scalar and vector read/writes, as 
scheduling events to support a timing model for vector load/store must be 
different ?


A gem5 cache only reasons in terms of cacheline (64bytes) and same goes for a 
coherent interconnect, regardless of vector vs scalar.



Also, the interconnection (bus or crossbar or whatever) must be large enough to 
support vector read/writes ?


As I mentioned earlier, memory requests bigger than a cacheline will be split 
into fragments at the LSQ. To give you a more concrete example: say that you 
have a 1024bits vector (128bytes). A single vector load will be split into 2 
64bytes memory requests. The D-cache will see two requests to two consecutive 
cachelines. It will produce two GetS if it is a miss, or it will return them if 
present.

The LSQ will wait for both requests to return with data and will coalesce them 
before returning data to the writeback vector register.


I hope this helps


Giacomo




From: Giacomo Travaglini 

Sent: 15 January 2024 03:30
To: Nazmus Sakib ; The gem5 Users 
mailing list 
Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus,


On 15/01/2024 02:41, Nazmus Sakib wrote:
Thank you. I will try to switch to starter_se.py.
I still had some questions regarding SVE.
1. When I compile with msve-vector-bit set to 512, I can see PTRUE instruction, 
which is replaced by whilelow when I compile without setting the vector bit 
value. Now on gem5, it seems whilelow and the corresponding incw instructions 
works fine, because when I keep sve_vl=1 in gem5, incw increments by 0x4 ( 128 
bits) and when I set sve_vl=4 the incw increments by 0x16 (512 bits). But what 
I am curious about, is whether there is anything wrong with the implementation 
of PTRUE instruction in gem5.


Without inspecting the disassembled program, I simply guess using 
msve-vector-bit=512 forces the code to not be VL agnostic and hardcodes it to 
512. So there's nothing surprising in failing the run with a non matching 
hardware.

I believe the proof there is nothing inherently wrong in ptrue in gem5 comes 
from the fact that, keeping the 512b binary untouched (with ptrue), and only 
setting VL=4, you have a successful run.


2. As shown in my first email, my data arrays are 64 bytes in size. An sve load 
instruction with sve_vl=4 will allow all 64 bytes to be loaded by one ld1w 
instruction (theoretically at least in an actual cpu ). I can see from the 
outputs generated by debug flag LSQUnit and CacheALL, that indeed all 64 bytes 
are accessed by one instruction. For example:
system.cpu.dcache: access for WriteReq [81010:8104f]
The address range here are for 64 byte (16 integer of 4 byte in my test code).
But, without support in the bus/interconnection 

[gem5-users] Re: ARM SVE ISA

2024-01-15 Thread Giacomo Travaglini via gem5-users

Hi Nazmus


On 15/01/2024 14:32, Nazmus Sakib wrote:
Hello. Thanks for your response.
I am running O3 cpu (ARMO3CPU), not minor.


It's the same:

https://github.com/gem5/gem5/blob/stable/src/cpu/o3/lsq.cc#L816


Also, I get it that LSQ unit can do this.
But a cache must have separate logic for scalar and vector read/writes, as 
scheduling events to support a timing model for vector load/store must be 
different ?


A gem5 cache only reasons in terms of cacheline (64bytes) and same goes for a 
coherent interconnect, regardless of vector vs scalar.



Also, the interconnection (bus or crossbar or whatever) must be large enough to 
support vector read/writes ?


As I mentioned earlier, memory requests bigger than a cacheline will be split 
into fragments at the LSQ. To give you a more concrete example: say that you 
have a 1024bits vector (128bytes). A single vector load will be split into 2 
64bytes memory requests. The D-cache will see two requests to two consecutive 
cachelines. It will produce two GetS if it is a miss, or it will return them if 
present.

The LSQ will wait for both requests to return with data and will coalesce them 
before returning data to the writeback vector register.


I hope this helps


Giacomo




From: Giacomo Travaglini 

Sent: 15 January 2024 03:30
To: Nazmus Sakib ; The gem5 Users mailing list 

Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus,


On 15/01/2024 02:41, Nazmus Sakib wrote:
Thank you. I will try to switch to starter_se.py.
I still had some questions regarding SVE.
1. When I compile with msve-vector-bit set to 512, I can see PTRUE instruction, 
which is replaced by whilelow when I compile without setting the vector bit 
value. Now on gem5, it seems whilelow and the corresponding incw instructions 
works fine, because when I keep sve_vl=1 in gem5, incw increments by 0x4 ( 128 
bits) and when I set sve_vl=4 the incw increments by 0x16 (512 bits). But what 
I am curious about, is whether there is anything wrong with the implementation 
of PTRUE instruction in gem5.


Without inspecting the disassembled program, I simply guess using 
msve-vector-bit=512 forces the code to not be VL agnostic and hardcodes it to 
512. So there's nothing surprising in failing the run with a non matching 
hardware.

I believe the proof there is nothing inherently wrong in ptrue in gem5 comes 
from the fact that, keeping the 512b binary untouched (with ptrue), and only 
setting VL=4, you have a successful run.


2. As shown in my first email, my data arrays are 64 bytes in size. An sve load 
instruction with sve_vl=4 will allow all 64 bytes to be loaded by one ld1w 
instruction (theoretically at least in an actual cpu ). I can see from the 
outputs generated by debug flag LSQUnit and CacheALL, that indeed all 64 bytes 
are accessed by one instruction. For example:
system.cpu.dcache: access for WriteReq [81010:8104f]
The address range here are for 64 byte (16 integer of 4 byte in my test code).
But, without support in the bus/interconnection connected with cpu to deal with 
64 bytes (or whatever is the vector length)  and additional code in gem5 to 
support multi-word read/write , shouldnt only one word (I am guessing that is 4 
byte in gem5 for arm) can be read from cache to cpu ? In that case, how are all 
64 bytes is requested and read from cache to cpu in gem5 with one instruction? 
Is there some underlying mechanism, like micro-ops or some architectural 
feature that is taking place transparently ? Or maybe a simple loop that is not 
part of the debug flag output? I tried to look in src/mem/cache/base.cc and 
cache.cc but could not get an answer.


Simply put, the O3/Minor LSQ will allow every request which does not span 
between a cacheline boundary. If a memory request spans two cachelines, the 
request will be split in two (or more) fragments [1].


Hope this helps


Giacomo


[1]: https://github.com/gem5/gem5/blob/stable/src/cpu/minor/lsq.cc#L1632




From: Giacomo Travaglini 

Sent: 12 January 2024 03:56
To: Nazmus Sakib ; The gem5 Users mailing list 

Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA


You don't often get email from 
giacomo.travagl...@arm.com. Learn why this is 
important

WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

You are right, I created a PR to fix this:



https://github.com/gem5/gem5/pull/764



Kind Regards



Giacomo



[gem5-users] Re: ARM SVE ISA

2024-01-15 Thread Nazmus Sakib via gem5-users
Hello. Thanks for your response.
I am running O3 cpu (ARMO3CPU), not minor.
Also, I get it that LSQ unit can do this.
But a cache must have separate logic for scalar and vector read/writes, as 
scheduling events to support a timing model for vector load/store must be 
different ?
Also, the interconnection (bus or crossbar or whatever) must be large enough to 
support vector read/writes ?

From: Giacomo Travaglini 
Sent: 15 January 2024 03:30
To: Nazmus Sakib ; The gem5 Users mailing list 

Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus,


On 15/01/2024 02:41, Nazmus Sakib wrote:
Thank you. I will try to switch to starter_se.py.
I still had some questions regarding SVE.
1. When I compile with msve-vector-bit set to 512, I can see PTRUE instruction, 
which is replaced by whilelow when I compile without setting the vector bit 
value. Now on gem5, it seems whilelow and the corresponding incw instructions 
works fine, because when I keep sve_vl=1 in gem5, incw increments by 0x4 ( 128 
bits) and when I set sve_vl=4 the incw increments by 0x16 (512 bits). But what 
I am curious about, is whether there is anything wrong with the implementation 
of PTRUE instruction in gem5.


Without inspecting the disassembled program, I simply guess using 
msve-vector-bit=512 forces the code to not be VL agnostic and hardcodes it to 
512. So there's nothing surprising in failing the run with a non matching 
hardware.

I believe the proof there is nothing inherently wrong in ptrue in gem5 comes 
from the fact that, keeping the 512b binary untouched (with ptrue), and only 
setting VL=4, you have a successful run.


2. As shown in my first email, my data arrays are 64 bytes in size. An sve load 
instruction with sve_vl=4 will allow all 64 bytes to be loaded by one ld1w 
instruction (theoretically at least in an actual cpu ). I can see from the 
outputs generated by debug flag LSQUnit and CacheALL, that indeed all 64 bytes 
are accessed by one instruction. For example:
system.cpu.dcache: access for WriteReq [81010:8104f]
The address range here are for 64 byte (16 integer of 4 byte in my test code).
But, without support in the bus/interconnection connected with cpu to deal with 
64 bytes (or whatever is the vector length)  and additional code in gem5 to 
support multi-word read/write , shouldnt only one word (I am guessing that is 4 
byte in gem5 for arm) can be read from cache to cpu ? In that case, how are all 
64 bytes is requested and read from cache to cpu in gem5 with one instruction? 
Is there some underlying mechanism, like micro-ops or some architectural 
feature that is taking place transparently ? Or maybe a simple loop that is not 
part of the debug flag output? I tried to look in src/mem/cache/base.cc and 
cache.cc but could not get an answer.


Simply put, the O3/Minor LSQ will allow every request which does not span 
between a cacheline boundary. If a memory request spans two cachelines, the 
request will be split in two (or more) fragments [1].


Hope this helps


Giacomo


[1]: https://github.com/gem5/gem5/blob/stable/src/cpu/minor/lsq.cc#L1632




From: Giacomo Travaglini 

Sent: 12 January 2024 03:56
To: Nazmus Sakib ; The gem5 Users 
mailing list 
Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA


You don't often get email from 
giacomo.travagl...@arm.com. Learn why this 
is important

WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

You are right, I created a PR to fix this:



https://github.com/gem5/gem5/pull/764



Kind Regards



Giacomo



From: Nazmus Sakib 
Date: Thursday, 11 January 2024 at 19:34
To: Giacomo Travaglini 
, The gem5 Users 
mailing list 
Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

Not compiling with -msve-vector-bits did the trick. It runs perfectly, whether 
I set the cpu[0].isa[0].sve_vl_se to 4 or keep it to 1.
Thank you for the suggestions !!
One last thing, the starter_se.py does not seem to have support for 
--cpu-type=ArmO3CPU (or am I missing something) ?



From: Giacomo Travaglini 

Sent: 11 January 2024 12:16
To: The gem5 Users mailing list 

Cc: Jason Lowe-Power ; 
Nazmus Sakib 
Subject: Re: ARM SVE ISA




You don't often get email from 

[gem5-users] Re: ARM SVE ISA

2024-01-15 Thread Giacomo Travaglini via gem5-users

Hi Nazmus,


On 15/01/2024 02:41, Nazmus Sakib wrote:
Thank you. I will try to switch to starter_se.py.
I still had some questions regarding SVE.
1. When I compile with msve-vector-bit set to 512, I can see PTRUE instruction, 
which is replaced by whilelow when I compile without setting the vector bit 
value. Now on gem5, it seems whilelow and the corresponding incw instructions 
works fine, because when I keep sve_vl=1 in gem5, incw increments by 0x4 ( 128 
bits) and when I set sve_vl=4 the incw increments by 0x16 (512 bits). But what 
I am curious about, is whether there is anything wrong with the implementation 
of PTRUE instruction in gem5.


Without inspecting the disassembled program, I simply guess using 
msve-vector-bit=512 forces the code to not be VL agnostic and hardcodes it to 
512. So there's nothing surprising in failing the run with a non matching 
hardware.

I believe the proof there is nothing inherently wrong in ptrue in gem5 comes 
from the fact that, keeping the 512b binary untouched (with ptrue), and only 
setting VL=4, you have a successful run.


2. As shown in my first email, my data arrays are 64 bytes in size. An sve load 
instruction with sve_vl=4 will allow all 64 bytes to be loaded by one ld1w 
instruction (theoretically at least in an actual cpu ). I can see from the 
outputs generated by debug flag LSQUnit and CacheALL, that indeed all 64 bytes 
are accessed by one instruction. For example:
system.cpu.dcache: access for WriteReq [81010:8104f]
The address range here are for 64 byte (16 integer of 4 byte in my test code).
But, without support in the bus/interconnection connected with cpu to deal with 
64 bytes (or whatever is the vector length)  and additional code in gem5 to 
support multi-word read/write , shouldnt only one word (I am guessing that is 4 
byte in gem5 for arm) can be read from cache to cpu ? In that case, how are all 
64 bytes is requested and read from cache to cpu in gem5 with one instruction? 
Is there some underlying mechanism, like micro-ops or some architectural 
feature that is taking place transparently ? Or maybe a simple loop that is not 
part of the debug flag output? I tried to look in src/mem/cache/base.cc and 
cache.cc but could not get an answer.


Simply put, the O3/Minor LSQ will allow every request which does not span 
between a cacheline boundary. If a memory request spans two cachelines, the 
request will be split in two (or more) fragments [1].


Hope this helps


Giacomo


[1]: https://github.com/gem5/gem5/blob/stable/src/cpu/minor/lsq.cc#L1632




From: Giacomo Travaglini 

Sent: 12 January 2024 03:56
To: Nazmus Sakib ; The gem5 Users mailing list 

Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA


You don't often get email from 
giacomo.travagl...@arm.com. Learn why this is 
important

WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

You are right, I created a PR to fix this:



https://github.com/gem5/gem5/pull/764



Kind Regards



Giacomo



From: Nazmus Sakib 
Date: Thursday, 11 January 2024 at 19:34
To: Giacomo Travaglini , The 
gem5 Users mailing list 
Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

Not compiling with -msve-vector-bits did the trick. It runs perfectly, whether 
I set the cpu[0].isa[0].sve_vl_se to 4 or keep it to 1.
Thank you for the suggestions !!
One last thing, the starter_se.py does not seem to have support for 
--cpu-type=ArmO3CPU (or am I missing something) ?



From: Giacomo Travaglini 

Sent: 11 January 2024 12:16
To: The gem5 Users mailing list 

Cc: Jason Lowe-Power ; Nazmus Sakib 

Subject: Re: ARM SVE ISA




You don't often get email from 
giacomo.travagl...@arm.com. Learn why this is 
important


WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus,



I can see from what you posted you are compiling the testcase with 512b vector 
width. I believe you should amend the gem5 VL accordingly… Basically writing up 
in the gem5 config:



cpu.isa[0].sve_vl_se = 4



According to [1].

This should fix your problem. Another solution I believe would be to compile 
without specifying the VL. Then it should be VL agnostic code I presume.



Anyway, I also recommend you use configs/example/arm/starter_se.py as se.py is 
per se 

[gem5-users] Re: ARM SVE ISA

2024-01-14 Thread Nazmus Sakib via gem5-users
Thank you. I will try to switch to starter_se.py.
I still had some questions regarding SVE.
1. When I compile with msve-vector-bit set to 512, I can see PTRUE instruction, 
which is replaced by whilelow when I compile without setting the vector bit 
value. Now on gem5, it seems whilelow and the corresponding incw instructions 
works fine, because when I keep sve_vl=1 in gem5, incw increments by 0x4 ( 128 
bits) and when I set sve_vl=4 the incw increments by 0x16 (512 bits). But what 
I am curious about, is whether there is anything wrong with the implementation 
of PTRUE instruction in gem5.
2. As shown in my first email, my data arrays are 64 bytes in size. An sve load 
instruction with sve_vl=4 will allow all 64 bytes to be loaded by one ld1w 
instruction (theoretically at least in an actual cpu ). I can see from the 
outputs generated by debug flag LSQUnit and CacheALL, that indeed all 64 bytes 
are accessed by one instruction. For example:
system.cpu.dcache: access for WriteReq [81010:8104f]
The address range here are for 64 byte (16 integer of 4 byte in my test code).
But, without support in the bus/interconnection connected with cpu to deal with 
64 bytes (or whatever is the vector length)  and additional code in gem5 to 
support multi-word read/write , shouldnt only one word (I am guessing that is 4 
byte in gem5 for arm) can be read from cache to cpu ? In that case, how are all 
64 bytes is requested and read from cache to cpu in gem5 with one instruction? 
Is there some underlying mechanism, like micro-ops or some architectural 
feature that is taking place transparently ? Or maybe a simple loop that is not 
part of the debug flag output? I tried to look in src/mem/cache/base.cc and 
cache.cc but could not get an answer.

From: Giacomo Travaglini 
Sent: 12 January 2024 03:56
To: Nazmus Sakib ; The gem5 Users mailing list 

Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

You don't often get email from giacomo.travagl...@arm.com. Learn why this is 
important
WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

You are right, I created a PR to fix this:



https://github.com/gem5/gem5/pull/764



Kind Regards



Giacomo



From: Nazmus Sakib 
Date: Thursday, 11 January 2024 at 19:34
To: Giacomo Travaglini , The gem5 Users mailing 
list 
Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA

Not compiling with -msve-vector-bits did the trick. It runs perfectly, whether 
I set the cpu[0].isa[0].sve_vl_se to 4 or keep it to 1.
Thank you for the suggestions !!
One last thing, the starter_se.py does not seem to have support for 
--cpu-type=ArmO3CPU (or am I missing something) ?



From: Giacomo Travaglini 
Sent: 11 January 2024 12:16
To: The gem5 Users mailing list 
Cc: Jason Lowe-Power ; Nazmus Sakib 
Subject: Re: ARM SVE ISA



You don't often get email from giacomo.travagl...@arm.com. Learn why this is 
important

WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus,



I can see from what you posted you are compiling the testcase with 512b vector 
width. I believe you should amend the gem5 VL accordingly… Basically writing up 
in the gem5 config:



cpu.isa[0].sve_vl_se = 4



According to [1].

This should fix your problem. Another solution I believe would be to compile 
without specifying the VL. Then it should be VL agnostic code I presume.



Anyway, I also recommend you use configs/example/arm/starter_se.py as se.py is 
per se deprecated



Kind Regards



Giacomo



[1]: https://github.com/gem5/gem5/blob/stable/src/arch/arm/ArmISA.py#L179



From: Nazmus Sakib via gem5-users 
Date: Thursday, 11 January 2024 at 17:54
To: gem5-users@gem5.org 
Cc: Jason Lowe-Power , Nazmus Sakib 
Subject: [gem5-users] ARM SVE ISA

Hello.
I am trying to run a simple program with SVE instructions on gem5. However, the 
output with debug flag ExecALL suggests there is a issue with the decoder.
Here is the test code:

#define STREAM_ARRAY_SIZE 16
void main()

{

for (int j=0; j___
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org


[gem5-users] Re: ARM SVE ISA

2024-01-12 Thread Giacomo Travaglini via gem5-users
You are right, I created a PR to fix this:

https://github.com/gem5/gem5/pull/764

Kind Regards

Giacomo

From: Nazmus Sakib 
Date: Thursday, 11 January 2024 at 19:34
To: Giacomo Travaglini , The gem5 Users mailing 
list 
Cc: Jason Lowe-Power 
Subject: Re: ARM SVE ISA
Not compiling with -msve-vector-bits did the trick. It runs perfectly, whether 
I set the cpu[0].isa[0].sve_vl_se to 4 or keep it to 1.
Thank you for the suggestions !!
One last thing, the starter_se.py does not seem to have support for 
--cpu-type=ArmO3CPU (or am I missing something) ?

From: Giacomo Travaglini 
Sent: 11 January 2024 12:16
To: The gem5 Users mailing list 
Cc: Jason Lowe-Power ; Nazmus Sakib 
Subject: Re: ARM SVE ISA

You don't often get email from giacomo.travagl...@arm.com. Learn why this is 
important
WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus,



I can see from what you posted you are compiling the testcase with 512b vector 
width. I believe you should amend the gem5 VL accordingly… Basically writing up 
in the gem5 config:



cpu.isa[0].sve_vl_se = 4



According to [1].

This should fix your problem. Another solution I believe would be to compile 
without specifying the VL. Then it should be VL agnostic code I presume.



Anyway, I also recommend you use configs/example/arm/starter_se.py as se.py is 
per se deprecated



Kind Regards



Giacomo



[1]: https://github.com/gem5/gem5/blob/stable/src/arch/arm/ArmISA.py#L179



From: Nazmus Sakib via gem5-users 
Date: Thursday, 11 January 2024 at 17:54
To: gem5-users@gem5.org 
Cc: Jason Lowe-Power , Nazmus Sakib 
Subject: [gem5-users] ARM SVE ISA

Hello.
I am trying to run a simple program with SVE instructions on gem5. However, the 
output with debug flag ExecALL suggests there is a issue with the decoder.
Here is the test code:

#define STREAM_ARRAY_SIZE 16
void main()

{

for (int j=0; j___
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org


[gem5-users] Re: ARM SVE ISA

2024-01-11 Thread Nazmus Sakib via gem5-users
Not compiling with -msve-vector-bits did the trick. It runs perfectly, whether 
I set the cpu[0].isa[0].sve_vl_se to 4 or keep it to 1.
Thank you for the suggestions !!
One last thing, the starter_se.py does not seem to have support for 
--cpu-type=ArmO3CPU (or am I missing something) ?


From: Giacomo Travaglini 
Sent: 11 January 2024 12:16
To: The gem5 Users mailing list 
Cc: Jason Lowe-Power ; Nazmus Sakib 
Subject: Re: ARM SVE ISA

You don't often get email from giacomo.travagl...@arm.com. Learn why this is 
important
WARNING This email originated external to the NMSU email system. Do not click 
on links or open attachments unless you are sure the content is safe.

Hi Nazmus,



I can see from what you posted you are compiling the testcase with 512b vector 
width. I believe you should amend the gem5 VL accordingly… Basically writing up 
in the gem5 config:



cpu.isa[0].sve_vl_se = 4



According to [1].

This should fix your problem. Another solution I believe would be to compile 
without specifying the VL. Then it should be VL agnostic code I presume.



Anyway, I also recommend you use configs/example/arm/starter_se.py as se.py is 
per se deprecated



Kind Regards



Giacomo



[1]: https://github.com/gem5/gem5/blob/stable/src/arch/arm/ArmISA.py#L179



From: Nazmus Sakib via gem5-users 
Date: Thursday, 11 January 2024 at 17:54
To: gem5-users@gem5.org 
Cc: Jason Lowe-Power , Nazmus Sakib 
Subject: [gem5-users] ARM SVE ISA

Hello.
I am trying to run a simple program with SVE instructions on gem5. However, the 
output with debug flag ExecALL suggests there is a issue with the decoder.
Here is the test code:

#define STREAM_ARRAY_SIZE 16
void main()

{

for (int j=0; j___
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org


[gem5-users] Re: ARM SVE ISA

2024-01-11 Thread Giacomo Travaglini via gem5-users
Hi Nazmus,

I can see from what you posted you are compiling the testcase with 512b vector 
width. I believe you should amend the gem5 VL accordingly… Basically writing up 
in the gem5 config:

cpu.isa[0].sve_vl_se = 4

According to [1].
This should fix your problem. Another solution I believe would be to compile 
without specifying the VL. Then it should be VL agnostic code I presume.

Anyway, I also recommend you use configs/example/arm/starter_se.py as se.py is 
per se deprecated

Kind Regards

Giacomo

[1]: https://github.com/gem5/gem5/blob/stable/src/arch/arm/ArmISA.py#L179

From: Nazmus Sakib via gem5-users 
Date: Thursday, 11 January 2024 at 17:54
To: gem5-users@gem5.org 
Cc: Jason Lowe-Power , Nazmus Sakib 
Subject: [gem5-users] ARM SVE ISA
Hello.
I am trying to run a simple program with SVE instructions on gem5. However, the 
output with debug flag ExecALL suggests there is a issue with the decoder.
Here is the test code:
#define STREAM_ARRAY_SIZE 16
void main()
{
for (int j=0; j___
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org