Re: [PING][PATCH v3] Disable reg offset in quad-word store for Falkor

2018-02-19 Thread Wilco Dijkstra
Siddhesh Poyarekar wrote:
> On Thursday 15 February 2018 07:50 PM, Wilco Dijkstra wrote:
>> So it seems to me using existing cost mechanisms is always preferable, even 
>> if you
>> currently can't differentiate between loads and stores.
>
> Luis is working on address cost adjustments among other things, so I
> guess the path of least resistance for gcc8 is to have those adjustments
> go in and then figure out how much improvement this patch (or separating
> loads and stores) would get on top of that.  Would that be acceptable?

Yes adjusting costs is not an issue as that's clearly something we need to 
improve
anyway. Having numbers for both approaches would be useful, however I think 
it's still
best to go with the cost approach for GCC8 as that should get most of the gain.

Wilco

Re: [PING][PATCH v3] Disable reg offset in quad-word store for Falkor

2018-02-15 Thread Siddhesh Poyarekar
On Thursday 15 February 2018 07:50 PM, Wilco Dijkstra wrote:
> So it seems to me using existing cost mechanisms is always preferable, even 
> if you
> currently can't differentiate between loads and stores.

Luis is working on address cost adjustments among other things, so I
guess the path of least resistance for gcc8 is to have those adjustments
go in and then figure out how much improvement this patch (or separating
loads and stores) would get on top of that.  Would that be acceptable?

Siddhesh


Re: [PING][PATCH v3] Disable reg offset in quad-word store for Falkor

2018-02-15 Thread Wilco Dijkstra
Hi Siddhesh,

I still don't like the idea of disabling a whole class of instructions in the 
md file.
It seems much better to adjust the costs here so that you get most of the
improvement now, and fine tune it once we can differentiate between
loads and stores.

Taking your example, adding -funroll-loops generates this for Falkor:

ldr q7, [x2, x18]
add x5, x18, 16
add x4, x1, x18
add x10, x18, 32
add x11, x1, x5
add x3, x18, 48
add x12, x1, x10
add x9, x18, 64
add x14, x1, x3
add x8, x18, 80
add x15, x1, x9
add x7, x18, 96
add x16, x1, x8
str q7, [x4]
ldr q16, [x2, x5]
add x6, x18, 112
add x17, x1, x7
add x18, x18, 128
add x5, x1, x6
cmp x18, x13
str q16, [x11]
ldr q17, [x2, x10]
str q17, [x12]
ldr q18, [x2, x3]
str q18, [x14]
ldr q19, [x2, x9]
str q19, [x15]
ldr q20, [x2, x8]
str q20, [x16]
ldr q21, [x2, x7]
str q21, [x17]
ldr q22, [x2, x6]
str q22, [x5]
bne .L25

If you adjust costs however you'd get this:

.L25:
ldr q7, [x14]
add x14, x14, 128
add x4, x4, 128
str q7, [x4, -128]
ldr q16, [x14, -112]
str q16, [x4, -112]
ldr q17, [x14, -96]
str q17, [x4, -96]
ldr q18, [x14, -80]
str q18, [x4, -80]
ldr q19, [x14, -64]
str q19, [x4, -64]
ldr q20, [x14, -48]
str q20, [x4, -48]
ldr q21, [x14, -32]
str q21, [x4, -32]
ldr q22, [x14, -16]
cmp x14, x9
str q22, [x4, -16]
bne .L25

So it seems to me using existing cost mechanisms is always preferable, even if 
you
currently can't differentiate between loads and stores.

Wilco