Re: Unaligned NEON memory access on ARMv7 phones
> To my surprise, clang 6.0 is willing to generate vld1.8 when no > particular CPU model is specified: > https://godbolt.org/g/i5PqcQ This sample for vld1.8 will be valid due to element size aligned. Also, although this generator generates hardfp abi as default, if using gcc 7 (with -march=armv7-a -mfpu=neon -mfloat-abi=hard -O3 -mno-unaligned-access -mthumb) , generated code is following. : 0: f920 0adf vld1.64 {d0-d1}, [r0 :64] 4: 4770bx lr 6: bf00nop 0008 : 8: b500push{lr} a: b085sub sp, #20 c: 4601mov r1, r0 e: 2210movsr2, #16 10: 4668mov r0, sp 12: f7ff fffe bl 0 16: f92d 0adf vld1.64 {d0-d1}, [sp :64] 1a: b005add sp, #20 1c: f85d fb04 ldr.w pc, [sp], #4 Although gcc doesn't optimize memcpy with -mno-unaligned-access, if using -munaligned-acces, it uses ldr, stmia and vld1.64, not vld1.8. ARM has big endian support, all case cannot replace *.16/*.32/*.64 with *.8 to support both endians. > Is unaligned NEON allowed on any ARMv7 CPU without trapping after all > even if unaligned ALU loads/stores might not be? If unaligned access with alignment identifier, it will cause trap. And it will depend on element size. -- Makoto Kato On Thu, Mar 29, 2018 at 8:38 PM, Henri Sivonenwrote: > On Thu, Mar 29, 2018 at 4:09 AM, Makoto Kato wrote: > > Since SCTLR isn't allowed on userland, there is no way to detect > unalignment > > access support without trap. Generally, unalignement access causes > SIGBUS, > > so we might get a data from crash reporter. Android armv7-a ABI doesn't > > define that hardware configuration has to set alignment bit of SCTLR, so > we > > should consider both unfortunately. > > To my surprise, clang 6.0 is willing to generate vld1.8 when no > particular CPU model is specified: > https://godbolt.org/g/i5PqcQ > > Is unaligned NEON allowed on any ARMv7 CPU without trapping after all > even if unaligned ALU loads/stores might not be? > > > ARM document of Cortex-A8 says [*1], alignment identifier is 64 > > (VLD2.16@64), it requires 2 cycles, but alignment identifier is 128 > > (VLD2.16@128), it is 1 cycle. And on Cortex-A9, unalignment access > requires > > additional cycles [*2]. > ... > > [*1] > > http://infocenter.arm.com/help/index.jsp?topic=/com.arm. > doc.ddi0344h/ch16s06s07.html > > [*2] > > http://infocenter.arm.com/help/index.jsp?topic=/com.arm. > doc.ddi0344h/ch16s06s07.html > > Thank you. Was [*2] meant to be a different URL? > > On Wed, Mar 28, 2018 at 6:36 PM, Gregory Szorc wrote: > > Is > > http://fastcompression.blogspot.fr/2015/08/accessing- > unaligned-memory.html > > and/or the comments for MEM_FORCE_MEMORY_ACCESS at > > https://github.com/facebook/zstd/blob/dev/lib/common/mem.h useful? > > Thanks, but unfortunately these don't address my issue. These are > about getting GCC to perform an unaligned load efficiently when the > programmer has already decided to want an unaligned load. > > I'm trying to figure out whether it's worthwhile to spend cycles to > move pointers to alignment if possible or whether it makes sense to > just use unaligned operations unconditionally. (Also, GCC doesn't > matter in my case, since I'm planning Rust code.) > > In non-ARMv7 cases my findings are that moving to alignment doesn't > look empirically worthwhile on aarch64 (tested RPi3 and ThunderX, > which both have in-order cores; should test an out-of-order core, but > documentation supports the empirical results) or on Haswell > (documentation indicates that the key is Nehalem or newer). On Core 2 > Duo, moving to alignment is worthwhile. > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ > ___ > dev-platform mailing list > dev-platform@lists.mozilla.org > https://lists.mozilla.org/listinfo/dev-platform > ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Unaligned NEON memory access on ARMv7 phones
> Thank you. Was [*2] meant to be a different URL? Ah, correct is the following. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100511_0401_10_en/ric1447333721072.html and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409i/BABDGHIB.html -- Makoto On Thu, Mar 29, 2018 at 8:38 PM, Henri Sivonenwrote: > On Thu, Mar 29, 2018 at 4:09 AM, Makoto Kato wrote: > > Since SCTLR isn't allowed on userland, there is no way to detect > unalignment > > access support without trap. Generally, unalignement access causes > SIGBUS, > > so we might get a data from crash reporter. Android armv7-a ABI doesn't > > define that hardware configuration has to set alignment bit of SCTLR, so > we > > should consider both unfortunately. > > To my surprise, clang 6.0 is willing to generate vld1.8 when no > particular CPU model is specified: > https://godbolt.org/g/i5PqcQ > > Is unaligned NEON allowed on any ARMv7 CPU without trapping after all > even if unaligned ALU loads/stores might not be? > > > ARM document of Cortex-A8 says [*1], alignment identifier is 64 > > (VLD2.16@64), it requires 2 cycles, but alignment identifier is 128 > > (VLD2.16@128), it is 1 cycle. And on Cortex-A9, unalignment access > requires > > additional cycles [*2]. > ... > > [*1] > > http://infocenter.arm.com/help/index.jsp?topic=/com.arm. > doc.ddi0344h/ch16s06s07.html > > [*2] > > http://infocenter.arm.com/help/index.jsp?topic=/com.arm. > doc.ddi0344h/ch16s06s07.html > > Thank you. Was [*2] meant to be a different URL? > > On Wed, Mar 28, 2018 at 6:36 PM, Gregory Szorc wrote: > > Is > > http://fastcompression.blogspot.fr/2015/08/accessing- > unaligned-memory.html > > and/or the comments for MEM_FORCE_MEMORY_ACCESS at > > https://github.com/facebook/zstd/blob/dev/lib/common/mem.h useful? > > Thanks, but unfortunately these don't address my issue. These are > about getting GCC to perform an unaligned load efficiently when the > programmer has already decided to want an unaligned load. > > I'm trying to figure out whether it's worthwhile to spend cycles to > move pointers to alignment if possible or whether it makes sense to > just use unaligned operations unconditionally. (Also, GCC doesn't > matter in my case, since I'm planning Rust code.) > > In non-ARMv7 cases my findings are that moving to alignment doesn't > look empirically worthwhile on aarch64 (tested RPi3 and ThunderX, > which both have in-order cores; should test an out-of-order core, but > documentation supports the empirical results) or on Haswell > (documentation indicates that the key is Nehalem or newer). On Core 2 > Duo, moving to alignment is worthwhile. > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ > ___ > dev-platform mailing list > dev-platform@lists.mozilla.org > https://lists.mozilla.org/listinfo/dev-platform > ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Unaligned NEON memory access on ARMv7 phones
On Thu, Mar 29, 2018 at 4:09 AM, Makoto Katowrote: > Since SCTLR isn't allowed on userland, there is no way to detect unalignment > access support without trap. Generally, unalignement access causes SIGBUS, > so we might get a data from crash reporter. Android armv7-a ABI doesn't > define that hardware configuration has to set alignment bit of SCTLR, so we > should consider both unfortunately. To my surprise, clang 6.0 is willing to generate vld1.8 when no particular CPU model is specified: https://godbolt.org/g/i5PqcQ Is unaligned NEON allowed on any ARMv7 CPU without trapping after all even if unaligned ALU loads/stores might not be? > ARM document of Cortex-A8 says [*1], alignment identifier is 64 > (VLD2.16@64), it requires 2 cycles, but alignment identifier is 128 > (VLD2.16@128), it is 1 cycle. And on Cortex-A9, unalignment access requires > additional cycles [*2]. ... > [*1] > http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344h/ch16s06s07.html > [*2] > http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344h/ch16s06s07.html Thank you. Was [*2] meant to be a different URL? On Wed, Mar 28, 2018 at 6:36 PM, Gregory Szorc wrote: > Is > http://fastcompression.blogspot.fr/2015/08/accessing-unaligned-memory.html > and/or the comments for MEM_FORCE_MEMORY_ACCESS at > https://github.com/facebook/zstd/blob/dev/lib/common/mem.h useful? Thanks, but unfortunately these don't address my issue. These are about getting GCC to perform an unaligned load efficiently when the programmer has already decided to want an unaligned load. I'm trying to figure out whether it's worthwhile to spend cycles to move pointers to alignment if possible or whether it makes sense to just use unaligned operations unconditionally. (Also, GCC doesn't matter in my case, since I'm planning Rust code.) In non-ARMv7 cases my findings are that moving to alignment doesn't look empirically worthwhile on aarch64 (tested RPi3 and ThunderX, which both have in-order cores; should test an out-of-order core, but documentation supports the empirical results) or on Haswell (documentation indicates that the key is Nehalem or newer). On Core 2 Duo, moving to alignment is worthwhile. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Unaligned NEON memory access on ARMv7 phones
Since SCTLR isn't allowed on userland, there is no way to detect unalignment access support without trap. Generally, unalignement access causes SIGBUS, so we might get a data from crash reporter. Android armv7-a ABI doesn't define that hardware configuration has to set alignment bit of SCTLR, so we should consider both unfortunately. ARM document of Cortex-A8 says [*1], alignment identifier is 64 (VLD2.16@64), it requires 2 cycles, but alignment identifier is 128 (VLD2.16@128), it is 1 cycle. And on Cortex-A9, unalignment access requires additional cycles [*2]. When I was investigating string issue, I created a benchmark result for optimization on Cortex-A7. So I will look for this data. -- Makoto [*1] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344h/ch16s06s07.html [*2] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344h/ch16s06s07.html On Tue, Mar 27, 2018 at 7:44 PM, Henri Sivonenwrote: > I'm having trouble finding reliable information about the performance > of unaligned NEON memory access on ARMv7 phones. > > What I can find is: > > * ARMv7 seems to allow unaligned access to be a trap-to-kernel kind > of performance disaster, but it's hard to find information about > whether the phone SoCs we care about are actually disastrous like > that. > > * On aarch64, unaligned access is the same instruction as aligned > access and gets dynamically penalized, but only minimally, if the > access crosses a cache line boundary. *Presumably* ARMv7 code running > on an ARMv8 core gets the same benefit. > > Do we know what performance characteristics we can assume for > unaligned NEON loads/stores on Android phones that have ARMv7 cores > and recent enough Android that Fennec runs in the first place? > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ > ___ > dev-platform mailing list > dev-platform@lists.mozilla.org > https://lists.mozilla.org/listinfo/dev-platform > ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Unaligned NEON memory access on ARMv7 phones
On Tue, Mar 27, 2018 at 3:44 AM, Henri Sivonenwrote: > I'm having trouble finding reliable information about the performance > of unaligned NEON memory access on ARMv7 phones. > > What I can find is: > > * ARMv7 seems to allow unaligned access to be a trap-to-kernel kind > of performance disaster, but it's hard to find information about > whether the phone SoCs we care about are actually disastrous like > that. > > * On aarch64, unaligned access is the same instruction as aligned > access and gets dynamically penalized, but only minimally, if the > access crosses a cache line boundary. *Presumably* ARMv7 code running > on an ARMv8 core gets the same benefit. > > Do we know what performance characteristics we can assume for > unaligned NEON loads/stores on Android phones that have ARMv7 cores > and recent enough Android that Fennec runs in the first place? Is http://fastcompression.blogspot.fr/2015/08/accessing-unaligned-memory.html and/or the comments for MEM_FORCE_MEMORY_ACCESS at https://github.com/facebook/zstd/blob/dev/lib/common/mem.h useful? I could also introduce you to the zstandard developers if you think it would be useful (compression often spends a large portion of its execution time accessing and moving memory and I'm pretty sure they know arcane memory access details like this). Reply privately if you want that introduction. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform