[Bug target/95285] AArch64:aarch64 medium code model proposal

2021-09-10 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

Andrew Pinski  changed:

   What|Removed |Added

   Severity|normal  |enhancement
   Keywords||assemble-failure,
   ||link-failure

[Bug target/95285] AArch64:aarch64 medium code model proposal

2021-09-10 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #17 from Wilco  ---
Here is the current medium code model proposal:
https://github.com/ARM-software/abi-aa/pull/107/files

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-12-10 Thread wdijkstr at arm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #16 from Wilco  ---
Note there is an early writeup of the current code models here:
https://github.com/ARM-software/abi-aa/pull/57/files (I've added the issues
with the current large model in review comments).

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-28 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #15 from Wilco  ---
(In reply to Bu Le from comment #14)
> > > Anyway, my point is that the size of single data does't affact the fact 
> > > that
> > > medium code model is missing in aarch64 and aarch64 is lack of PIC large
> > > code model.
> > 
> > What is missing is efficient support for >4GB of data, right? How that is
> > implemented is a different question - my point is that it does not require a
> > new code model. It would be much better if it just worked without users even
> > needing to think about code models.
> > 
> > Also, what is the purpose of a large fpic model? Are there any applications
> > that use shared libraries larger than 4GB?
> 
> Yes, I understand, and I am grateful for you suggestion. I have to say it is
> not a critical problem. After all, most applications works fine with
> curreent code modes. 
> 
> But there are some cases, like CESM with certain configuration, or my test
> case, which cannot be compiled with current gcc compiler on aarch64.
> Unfortunately, applications that large than 4GB is quiet normal in HPC
> feild. In the meantime, x86 and llvm-aarch64 can compile it, with medium or
> large-pic code model. That is the purpose I am proposing it. By adding this
> feature, we can make a step forward for aarch64 gcc compiler, making it more
> powerful and robust.
> 
> Clear enough for your concern? 

Yes but such a feature needs to be defined in an ABI and well specified. This
is why I'm trying to get the underlying requirements first. Note that while
LLVM allows -fpic in large model, it doesn't correctly implement it. The large
model shouldn't ever be needed by actual applications.

> And for the implementation you suggested, I believe it is a promissing plan.
> I would like to try to implement it first. Might take weeks of development.
> I will see what I can get. I will give you update with progress.
> 
> Thanks for the suggestion again.

As discussed, there are many different ways of supporting the requirement of
>4GB of data, so I wouldn't start on the implementation before there is a good
specification. GCC and LLVM would need to implement it in the same way after
all.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #14 from Bu Le  ---
> > Anyway, my point is that the size of single data does't affact the fact that
> > medium code model is missing in aarch64 and aarch64 is lack of PIC large
> > code model.
> 
> What is missing is efficient support for >4GB of data, right? How that is
> implemented is a different question - my point is that it does not require a
> new code model. It would be much better if it just worked without users even
> needing to think about code models.
> 
> Also, what is the purpose of a large fpic model? Are there any applications
> that use shared libraries larger than 4GB?

Yes, I understand, and I am grateful for you suggestion. I have to say it is
not a critical problem. After all, most applications works fine with curreent
code modes. 

But there are some cases, like CESM with certain configuration, or my test
case, which cannot be compiled with current gcc compiler on aarch64.
Unfortunately, applications that large than 4GB is quiet normal in HPC feild.
In the meantime, x86 and llvm-aarch64 can compile it, with medium or large-pic
code model. That is the purpose I am proposing it. By adding this feature, we
can make a step forward for aarch64 gcc compiler, making it more powerful and
robust.

Clear enough for your concern? 

And for the implementation you suggested, I believe it is a promissing plan. I
would like to try to implement it first. Might take weeks of development. I
will see what I can get. I will give you update with progress.

Thanks for the suggestion again.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #13 from Wilco  ---
(In reply to Bu Le from comment #11)
>  
> > You're right, we need an extra add, so it's like this:
> > 
> > adrpx0, bar1.2782
> > movkx1, :high32_47:bar1.2782
> > add x0, x0, x1
> > add x0, x0, :lo12:bar1.2782
> > 
> > > (By the way, the high32_47 relocation you suggested is the prel_g2 in the
> > > officail aarch64 ABI released)
> > 
> > It needs a new relocation because of the ADRP. ADR could be used so the
> > existing R__MOVW_PREL_G0-3 work, but then you need 5 instructions.
> 
> So you suggest a new relocation type "high32_47" to calculate the offset
> between ADRP and bar1. Am I right?

Yes. It needs to have an offset to the adrp instruction so it can compute the
correct ADRP offset and then extract bits 32-47.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #12 from Wilco  ---
(In reply to Bu Le from comment #10)
> > Fortran already has -fstack-arrays to decide between allocating arrays on
> > the heap or on the stack.
> 
> I tried the flag with my example. The fstack-array seems cannot move the
> array in the bss to the heap. The problem is still there. 

It is an existing feature that chooses between malloc and stack. It would need
modification to do the same for large data/bss objects.

> Anyway, my point is that the size of single data does't affact the fact that
> medium code model is missing in aarch64 and aarch64 is lack of PIC large
> code model.

What is missing is efficient support for >4GB of data, right? How that is
implemented is a different question - my point is that it does not require a
new code model. It would be much better if it just worked without users even
needing to think about code models.

Also, what is the purpose of a large fpic model? Are there any applications
that use shared libraries larger than 4GB?

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #11 from Bu Le  ---

> You're right, we need an extra add, so it's like this:
> 
> adrpx0, bar1.2782
> movk  x1, :high32_47:bar1.2782
> add x0, x0, x1
> add x0, x0, :lo12:bar1.2782
> 
> > (By the way, the high32_47 relocation you suggested is the prel_g2 in the
> > officail aarch64 ABI released)
> 
> It needs a new relocation because of the ADRP. ADR could be used so the
> existing R__MOVW_PREL_G0-3 work, but then you need 5 instructions.

So you suggest a new relocation type "high32_47" to calculate the offset
between ADRP and bar1. Am I right?

> > And in terms of engineering, you idea can save the trouble to modify the
> > linker for calculating the offset for 3 movks. But we still need to make a
> > new relocation type for ADRP, because it currently checking the overflow of
> > address and gives the "relocation truncated to fit" error. Therefore, both
> > idea need to do works in binutils, which make it also equivalent.
> 
> There is relocation 276 (R__ADR_PREL_PG_HI21_NC).

Yes, through, we still need to make a change to compiler so when it comes to
medium code model, ADRP can use R__ADR_PREL_PG_HI21_NC relocation.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #10 from Bu Le  ---

> Fortran already has -fstack-arrays to decide between allocating arrays on
> the heap or on the stack.

I tried the flag with my example. The fstack-array seems cannot move the array
in the bss to the heap. The problem is still there. 

Anyway, my point is that the size of single data does't affact the fact that
medium code model is missing in aarch64 and aarch64 is lack of PIC large code
model.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #9 from Wilco  ---
(In reply to Bu Le from comment #7)
> (In reply to Wilco from comment #5)
> > (In reply to Bu Le from comment #0)
> > 
> > Also it would be much more efficient to have a relocation like this if you
> > wanted a 48-bit PC-relative offset:
> > 
> > adrpx0, bar1.2782
> > add x0, x0, :lo12:bar1.2782
> > movk  x0, :high32_47:bar1.2782
> 
> I am afraid that put the PC-relative offset into x0 is not correct, because
> x0 issuppose to be the final address of bar1 rather than an PC offset.
> Therefore an extra register is needed to hold the offest temporarily.

You're right, we need an extra add, so it's like this:

adrpx0, bar1.2782
movkx1, :high32_47:bar1.2782
add x0, x0, x1
add x0, x0, :lo12:bar1.2782

> (By the way, the high32_47 relocation you suggested is the prel_g2 in the
> officail aarch64 ABI released)

It needs a new relocation because of the ADRP. ADR could be used so the
existing R__MOVW_PREL_G0-3 work, but then you need 5 instructions.

> And in terms of engineering, you idea can save the trouble to modify the
> linker for calculating the offset for 3 movks. But we still need to make a
> new relocation type for ADRP, because it currently checking the overflow of
> address and gives the "relocation truncated to fit" error. Therefore, both
> idea need to do works in binutils, which make it also equivalent.

There is relocation 276 (R__ADR_PREL_PG_HI21_NC).

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #8 from Wilco  ---
(In reply to Bu Le from comment #6)
> (In reply to Wilco from comment #4)
> > (In reply to Bu Le from comment #3)
> > > (In reply to Wilco from comment #2)
> 
> > Well the question is whether we're talking about more than 4GB of code or
> > more than 4GB of data. With >4GB code you're indeed stuck with the large
> > model. With data it is feasible to automatically use malloc for arrays when
> > larger than a certain size, so there is no need to change the application at
> > all. Something like that could be the default in the small model so that you
> > don't have any extra overhead unless you have huge arrays. Making the
> > threshold configurable means you can tune it for a specific application.
> 
> 
> Is this automatic malloc already avaiable on some target? I haven't found an
> example that works in that way. Would you mind provide an example?

Fortran already has -fstack-arrays to decide between allocating arrays on the
heap or on the stack.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #7 from Bu Le  ---
(In reply to Wilco from comment #5)
> (In reply to Bu Le from comment #0)
> 
> Also it would be much more efficient to have a relocation like this if you
> wanted a 48-bit PC-relative offset:
> 
> adrpx0, bar1.2782
> add x0, x0, :lo12:bar1.2782
> movkx0, :high32_47:bar1.2782

I am afraid that put the PC-relative offset into x0 is not correct, because x0
issuppose to be the final address of bar1 rather than an PC offset. Therefore
an extra register is needed to hold the offest temporarily. Later, we need to
add the PC address of the movk with the offset to calsulate 32:48 bits of the
final address of bar1. Finally, add this part of address with x0 to compute the
entire 48 bits final address. So the code sould be following sequence:

adrpx0, bar1.2782
add x0, x0, :lo12:bar1.2782  //x0 here hold the 0:31 bits of the final addr
movkx4, :prel_g2:bar1.2782
adr x1, .
sub x1, x1, 0x4
add x4, x4, x1   // x4 here hold the 32:47 bits of the final addr
add x0, x4, x0

(By the way, the high32_47 relocation you suggested is the prel_g2 in the
officail aarch64 ABI released)

So acctually, if we just want a 48-bit PC-relevent relocation, your idea and
mine both need 6-7 instructions to get the symbol. In terms of efficiency, it
would be similar. 

And in terms of engineering, you idea can save the trouble to modify the linker
for calculating the offset for 3 movks. But we still need to make a new
relocation type for ADRP, because it currently checking the overflow of address
and gives the "relocation truncated to fit" error. Therefore, both idea need to
do works in binutils, which make it also equivalent.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-27 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #6 from Bu Le  ---
(In reply to Wilco from comment #4)
> (In reply to Bu Le from comment #3)
> > (In reply to Wilco from comment #2)

> Well the question is whether we're talking about more than 4GB of code or
> more than 4GB of data. With >4GB code you're indeed stuck with the large
> model. With data it is feasible to automatically use malloc for arrays when
> larger than a certain size, so there is no need to change the application at
> all. Something like that could be the default in the small model so that you
> don't have any extra overhead unless you have huge arrays. Making the
> threshold configurable means you can tune it for a specific application.


Is this automatic malloc already avaiable on some target? I haven't found an
example that works in that way. Would you mind provide an example?

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-26 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #5 from Wilco  ---
(In reply to Bu Le from comment #0)

Also it would be much more efficient to have a relocation like this if you
wanted a 48-bit PC-relative offset:

adrpx0, bar1.2782
add x0, x0, :lo12:bar1.2782
movkx0, :high32_47:bar1.2782

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-26 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #4 from Wilco  ---
(In reply to Bu Le from comment #3)
> (In reply to Wilco from comment #2)
> 
> > Is the main usage scenario huge arrays? If so, these could easily be
> > allocated via malloc at startup rather than using bss. It means an extra
> > indirection in some cases (to load the pointer), but it should be much more
> > efficient than using a large code model with all the overheads.
> 
> Thanks for the reply. 
> 
> The large array is just used to construct the test case. It is not a
> neccessary condition for this scenario. The common scenario is that the
> symbol is too far away for small code model to reach it, which cloud also
> result from large amount of small arrays, structures, etc. Meanwhile, the
> large code model is able to reach the symbol but can not be position
> independent, which cause the problem. 
> 
> Besides, the code in CESM is quiet complicated to reconstruct with malloc,
> which is also not an acceptable option for my customer.
> 
> Clear enough for your concern?

Well the question is whether we're talking about more than 4GB of code or more
than 4GB of data. With >4GB code you're indeed stuck with the large model. With
data it is feasible to automatically use malloc for arrays when larger than a
certain size, so there is no need to change the application at all. Something
like that could be the default in the small model so that you don't have any
extra overhead unless you have huge arrays. Making the threshold configurable
means you can tune it for a specific application.

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-26 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #3 from Bu Le  ---
(In reply to Wilco from comment #2)

> Is the main usage scenario huge arrays? If so, these could easily be
> allocated via malloc at startup rather than using bss. It means an extra
> indirection in some cases (to load the pointer), but it should be much more
> efficient than using a large code model with all the overheads.

Thanks for the reply. 

The large array is just used to construct the test case. It is not a neccessary
condition for this scenario. The common scenario is that the symbol is too far
away for small code model to reach it, which cloud also result from large
amount of small arrays, structures, etc. Meanwhile, the large code model is
able to reach the symbol but can not be position independent, which cause the
problem. 

Besides, the code in CESM is quiet complicated to reconstruct with malloc,
which is also not an acceptable option for my customer.

Clear enough for your concern?

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-26 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #2 from Wilco  ---
(In reply to Bu Le from comment #0)
> Created attachment 48584 [details]
> proposed patch
> 
> I would like to propose an implementation of the medium code model in
> aarch64. A prototype is attached, passed bootstrap and the regression test.
> 
> Mcmodel = medium is a missing code model in aarch64 architecture, which is
> supported in x86. This code model describes a situation that some small data
> is relocated by small code model while large data is relocated by large code
> model. The official statement about medium code model in x86 ABI file page
> 34 URL : https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf
> 
> The key difference between x86 and aarch64 is that x86 can use lea+movabs
> instruction to implement a dynamic relocatable large code model. Currently,
> large code model in AArch64 relocate the symbol using ldr instruction, which
> can only be static linked. However, the small code mode use adrp + ldr
> instruction, which can be dynamic linked. Therefore, the medium code model
> cannot be implemented directly by simply setting a threshold. As a result a
> dynamic reloadable large code model is needed first for a functional medium
> code model.
> 
> I met this problem when compiling CESM, which is a climate forecast software
> that widely used in hpc field. In some configure case, when the manipulating
> large arrays, the large code model with dynamic relocation is needed. The
> following case is abstract from CESM for this scenario.
> 
> program main
>  common/baz/a,b,c
>  real a,b,c
>  b = 1.0
>  call foo()
>  print*, b
>  end
> 
>  subroutine foo()
>  common/baz/a,b,c
>  real a,b,c
> 
>  integer, parameter :: nx = 1024
>  integer, parameter :: ny = 1024
>  integer, parameter :: nz = 1024
>  integer, parameter :: nf = 1
>  real :: bar(nf,nx*ny*nz)
>  real :: bar1(nf,nx*ny*nz)
>  bar = 0.0
>  bar1 =0.0
>  b = bar(1,1024*1024*100)
>  b = bar1(1,1)
> 
>  return
>  end
> 
> compile with -mcmodel=small -fPIC will give following error due to the
> access of bar1 array
> test.f90:(.text+0x28): relocation truncated to fit:
> R_AARCH64_ADR_PREL_PG_HI21 against `.bss'
> test.f90:(.text+0x6c): relocation truncated to fit:
> R_AARCH64_ADR_PREL_PG_HI21 against `.bss'
> 
> compile with -mcmodel=large -fPIC will give unsupported error:
> f951: sorry, unimplemented: code model ‘large’ with ‘-fPIC’
> 
> As discussed in the beginning, to tackle this problem we have to solve the
> static large code model problem. My solution here is to use
> R_AARCH64_MOVW_PREL_Gx group relocation with instructions to calculate the
> current PC value.
> 
> Before change (mcmodel=small) :
> adrpx0, bar1.2782
> add x0, x0, :lo12:bar1.2782
> 
> After change:(mcmodel = medium proposed):
> movzx0, :prel_g3:bar1.2782
> movk  x0, :prel_g2_nc:bar1.2782
> movk  x0, :prel_g1_nc:bar1.2782
> movk  x0, :prel_g0_nc:bar1.2782
> adr   x1, .
> sub   x1, x1, 0x4
> add   x0, x0, x1
> 
> The first 4 movk instruction will calculate the offset between bar1 and the
> last movk instruction in 64-bits, which fulfil the requirement of large code
> model(64-bit relocation).
> The adr+sub instruction will calculate the pc-address of the last movk
> instruction. By adding the offset with the PC address, bar1 can be
> dynamically located.
> 
> Because this relocation is time consuming, a threshold is set to classify
> the size of the data to be relocated, like x86. The default value of the
> threshold is set to 65536, which is max relocation capability of small code
> model.
> This implementation will also need to amend the linker in binutils so that
> the4 movk can calculated the same pc-offset of the last movk instruction.
> 
> The good side of this implementation is that it can use existed relocation
> type to prototype a medium code model.
> 
> The drawback of this implementation also exists. 
> For start, these 4movk instructions and the adr instruction must be combined
> in this order. No other instruction should insert in between the sequence,
> which will leads to mistake symbol address. This might impede the insn
> schedule optimizations. 
> Secondly, the linker need to make the change correspondingly so that every
> mov instruction calculate the same pc-offset. For example, in my
> implementation, the fisrt movz instruction will need to add 12 to the result
> of ":prel_g3:bar1.2782" to make up the pc-offset.   
> 
> I haven't figure out a suitable solution for these problems yet. You are
> most welcomed to leave your suggestions regarding these issues.

Is the main usage scenario huge arrays? If so, these could easily be allocated
via malloc at startup rather than using bss. It means an extra indirection in
some cases (to load 

[Bug target/95285] AArch64:aarch64 medium code model proposal

2020-05-23 Thread bule1 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95285

--- Comment #1 from Bu Le  ---
Created attachment 48585
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48585=edit
patch for binutils