Re: [PING][PATCH v3] Disable reg offset in quad-word store for Falkor
Siddhesh Poyarekar wrote: > On Thursday 15 February 2018 07:50 PM, Wilco Dijkstra wrote: >> So it seems to me using existing cost mechanisms is always preferable, even >> if you >> currently can't differentiate between loads and stores. > > Luis is working on address cost adjustments among other things, so I > guess the path of least resistance for gcc8 is to have those adjustments > go in and then figure out how much improvement this patch (or separating > loads and stores) would get on top of that. Would that be acceptable? Yes adjusting costs is not an issue as that's clearly something we need to improve anyway. Having numbers for both approaches would be useful, however I think it's still best to go with the cost approach for GCC8 as that should get most of the gain. Wilco
Re: [PING][PATCH v3] Disable reg offset in quad-word store for Falkor
On Thursday 15 February 2018 07:50 PM, Wilco Dijkstra wrote: > So it seems to me using existing cost mechanisms is always preferable, even > if you > currently can't differentiate between loads and stores. Luis is working on address cost adjustments among other things, so I guess the path of least resistance for gcc8 is to have those adjustments go in and then figure out how much improvement this patch (or separating loads and stores) would get on top of that. Would that be acceptable? Siddhesh
Re: [PING][PATCH v3] Disable reg offset in quad-word store for Falkor
Hi Siddhesh, I still don't like the idea of disabling a whole class of instructions in the md file. It seems much better to adjust the costs here so that you get most of the improvement now, and fine tune it once we can differentiate between loads and stores. Taking your example, adding -funroll-loops generates this for Falkor: ldr q7, [x2, x18] add x5, x18, 16 add x4, x1, x18 add x10, x18, 32 add x11, x1, x5 add x3, x18, 48 add x12, x1, x10 add x9, x18, 64 add x14, x1, x3 add x8, x18, 80 add x15, x1, x9 add x7, x18, 96 add x16, x1, x8 str q7, [x4] ldr q16, [x2, x5] add x6, x18, 112 add x17, x1, x7 add x18, x18, 128 add x5, x1, x6 cmp x18, x13 str q16, [x11] ldr q17, [x2, x10] str q17, [x12] ldr q18, [x2, x3] str q18, [x14] ldr q19, [x2, x9] str q19, [x15] ldr q20, [x2, x8] str q20, [x16] ldr q21, [x2, x7] str q21, [x17] ldr q22, [x2, x6] str q22, [x5] bne .L25 If you adjust costs however you'd get this: .L25: ldr q7, [x14] add x14, x14, 128 add x4, x4, 128 str q7, [x4, -128] ldr q16, [x14, -112] str q16, [x4, -112] ldr q17, [x14, -96] str q17, [x4, -96] ldr q18, [x14, -80] str q18, [x4, -80] ldr q19, [x14, -64] str q19, [x4, -64] ldr q20, [x14, -48] str q20, [x4, -48] ldr q21, [x14, -32] str q21, [x4, -32] ldr q22, [x14, -16] cmp x14, x9 str q22, [x4, -16] bne .L25 So it seems to me using existing cost mechanisms is always preferable, even if you currently can't differentiate between loads and stores. Wilco