Re: Unaligned NEON memory access on ARMv7 phones

2018-04-05 Thread Makoto Kato
 > To my surprise, clang 6.0 is willing to generate vld1.8 when no
> particular CPU model is specified:
> https://godbolt.org/g/i5PqcQ

This sample for vld1.8 will be valid due to element size aligned.

Also, although this generator generates hardfp abi as default, if using gcc
7 (with -march=armv7-a -mfpu=neon -mfloat-abi=hard -O3
-mno-unaligned-access -mthumb) , generated code is following.

 :
   0:   f920 0adf   vld1.64 {d0-d1}, [r0 :64]
   4:   4770bx  lr
   6:   bf00nop

0008 :
   8:   b500push{lr}
   a:   b085sub sp, #20
   c:   4601mov r1, r0
   e:   2210movsr2, #16
  10:   4668mov r0, sp
  12:   f7ff fffe   bl  0 
  16:   f92d 0adf   vld1.64 {d0-d1}, [sp :64]
  1a:   b005add sp, #20
  1c:   f85d fb04   ldr.w   pc, [sp], #4

Although gcc doesn't optimize memcpy with -mno-unaligned-access, if using
-munaligned-acces, it uses ldr, stmia and vld1.64, not vld1.8.

ARM has big endian support, all case cannot replace *.16/*.32/*.64 with *.8
to support both endians.


> Is unaligned NEON allowed on any ARMv7 CPU without trapping after all
> even if unaligned ALU loads/stores might not be?

If unaligned access with alignment identifier, it will cause trap.  And it
will depend on element size.


-- Makoto Kato


On Thu, Mar 29, 2018 at 8:38 PM, Henri Sivonen  wrote:

> On Thu, Mar 29, 2018 at 4:09 AM, Makoto Kato  wrote:
> > Since SCTLR isn't allowed on userland, there is no way to detect
> unalignment
> > access support without trap.  Generally, unalignement access causes
> SIGBUS,
> > so we might get a data from crash reporter.  Android armv7-a ABI doesn't
> > define that hardware configuration has to set alignment bit of SCTLR, so
> we
> > should consider both unfortunately.
>
> To my surprise, clang 6.0 is willing to generate vld1.8 when no
> particular CPU model is specified:
> https://godbolt.org/g/i5PqcQ
>
> Is unaligned NEON allowed on any ARMv7 CPU without trapping after all
> even if unaligned ALU loads/stores might not be?
>
> > ARM document of Cortex-A8 says [*1], alignment identifier is 64
> > (VLD2.16@64), it requires 2 cycles, but alignment identifier is 128
> > (VLD2.16@128), it is 1 cycle.  And on Cortex-A9, unalignment access
> requires
> > additional cycles [*2].
> ...
> > [*1]
> > http://infocenter.arm.com/help/index.jsp?topic=/com.arm.
> doc.ddi0344h/ch16s06s07.html
> > [*2]
> > http://infocenter.arm.com/help/index.jsp?topic=/com.arm.
> doc.ddi0344h/ch16s06s07.html
>
> Thank you. Was [*2] meant to be a different URL?
>
> On Wed, Mar 28, 2018 at 6:36 PM, Gregory Szorc  wrote:
> > Is
> > http://fastcompression.blogspot.fr/2015/08/accessing-
> unaligned-memory.html
> > and/or the comments for MEM_FORCE_MEMORY_ACCESS at
> > https://github.com/facebook/zstd/blob/dev/lib/common/mem.h useful?
>
> Thanks, but unfortunately these don't address my issue. These are
> about getting GCC to perform an unaligned load efficiently when the
> programmer has already decided to want an unaligned load.
>
> I'm trying to figure out whether it's worthwhile to spend cycles to
> move pointers to alignment if possible or whether it makes sense to
> just use unaligned operations unconditionally. (Also, GCC doesn't
> matter in my case, since I'm planning Rust code.)
>
> In non-ARMv7 cases my findings are that moving to alignment doesn't
> look empirically worthwhile on aarch64 (tested RPi3 and ThunderX,
> which both have in-order cores; should test an out-of-order core, but
> documentation supports the empirical results) or on Haswell
> (documentation indicates that the key is Nehalem or newer). On Core 2
> Duo, moving to alignment is worthwhile.
>
> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/
> ___
> dev-platform mailing list
> dev-platform@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Unaligned NEON memory access on ARMv7 phones

2018-04-05 Thread Makoto Kato
> Thank you. Was [*2] meant to be a different URL?

Ah, correct is the following.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100511_0401_10_en/ric1447333721072.html
and
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409i/BABDGHIB.html


-- Makoto

On Thu, Mar 29, 2018 at 8:38 PM, Henri Sivonen  wrote:

> On Thu, Mar 29, 2018 at 4:09 AM, Makoto Kato  wrote:
> > Since SCTLR isn't allowed on userland, there is no way to detect
> unalignment
> > access support without trap.  Generally, unalignement access causes
> SIGBUS,
> > so we might get a data from crash reporter.  Android armv7-a ABI doesn't
> > define that hardware configuration has to set alignment bit of SCTLR, so
> we
> > should consider both unfortunately.
>
> To my surprise, clang 6.0 is willing to generate vld1.8 when no
> particular CPU model is specified:
> https://godbolt.org/g/i5PqcQ
>
> Is unaligned NEON allowed on any ARMv7 CPU without trapping after all
> even if unaligned ALU loads/stores might not be?
>
> > ARM document of Cortex-A8 says [*1], alignment identifier is 64
> > (VLD2.16@64), it requires 2 cycles, but alignment identifier is 128
> > (VLD2.16@128), it is 1 cycle.  And on Cortex-A9, unalignment access
> requires
> > additional cycles [*2].
> ...
> > [*1]
> > http://infocenter.arm.com/help/index.jsp?topic=/com.arm.
> doc.ddi0344h/ch16s06s07.html
> > [*2]
> > http://infocenter.arm.com/help/index.jsp?topic=/com.arm.
> doc.ddi0344h/ch16s06s07.html
>
> Thank you. Was [*2] meant to be a different URL?
>
> On Wed, Mar 28, 2018 at 6:36 PM, Gregory Szorc  wrote:
> > Is
> > http://fastcompression.blogspot.fr/2015/08/accessing-
> unaligned-memory.html
> > and/or the comments for MEM_FORCE_MEMORY_ACCESS at
> > https://github.com/facebook/zstd/blob/dev/lib/common/mem.h useful?
>
> Thanks, but unfortunately these don't address my issue. These are
> about getting GCC to perform an unaligned load efficiently when the
> programmer has already decided to want an unaligned load.
>
> I'm trying to figure out whether it's worthwhile to spend cycles to
> move pointers to alignment if possible or whether it makes sense to
> just use unaligned operations unconditionally. (Also, GCC doesn't
> matter in my case, since I'm planning Rust code.)
>
> In non-ARMv7 cases my findings are that moving to alignment doesn't
> look empirically worthwhile on aarch64 (tested RPi3 and ThunderX,
> which both have in-order cores; should test an out-of-order core, but
> documentation supports the empirical results) or on Haswell
> (documentation indicates that the key is Nehalem or newer). On Core 2
> Duo, moving to alignment is worthwhile.
>
> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/
> ___
> dev-platform mailing list
> dev-platform@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Unaligned NEON memory access on ARMv7 phones

2018-03-31 Thread Henri Sivonen
On Thu, Mar 29, 2018 at 4:09 AM, Makoto Kato  wrote:
> Since SCTLR isn't allowed on userland, there is no way to detect unalignment
> access support without trap.  Generally, unalignement access causes SIGBUS,
> so we might get a data from crash reporter.  Android armv7-a ABI doesn't
> define that hardware configuration has to set alignment bit of SCTLR, so we
> should consider both unfortunately.

To my surprise, clang 6.0 is willing to generate vld1.8 when no
particular CPU model is specified:
https://godbolt.org/g/i5PqcQ

Is unaligned NEON allowed on any ARMv7 CPU without trapping after all
even if unaligned ALU loads/stores might not be?

> ARM document of Cortex-A8 says [*1], alignment identifier is 64
> (VLD2.16@64), it requires 2 cycles, but alignment identifier is 128
> (VLD2.16@128), it is 1 cycle.  And on Cortex-A9, unalignment access requires
> additional cycles [*2].
...
> [*1]
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344h/ch16s06s07.html
> [*2]
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344h/ch16s06s07.html

Thank you. Was [*2] meant to be a different URL?

On Wed, Mar 28, 2018 at 6:36 PM, Gregory Szorc  wrote:
> Is
> http://fastcompression.blogspot.fr/2015/08/accessing-unaligned-memory.html
> and/or the comments for MEM_FORCE_MEMORY_ACCESS at
> https://github.com/facebook/zstd/blob/dev/lib/common/mem.h useful?

Thanks, but unfortunately these don't address my issue. These are
about getting GCC to perform an unaligned load efficiently when the
programmer has already decided to want an unaligned load.

I'm trying to figure out whether it's worthwhile to spend cycles to
move pointers to alignment if possible or whether it makes sense to
just use unaligned operations unconditionally. (Also, GCC doesn't
matter in my case, since I'm planning Rust code.)

In non-ARMv7 cases my findings are that moving to alignment doesn't
look empirically worthwhile on aarch64 (tested RPi3 and ThunderX,
which both have in-order cores; should test an out-of-order core, but
documentation supports the empirical results) or on Haswell
(documentation indicates that the key is Nehalem or newer). On Core 2
Duo, moving to alignment is worthwhile.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Unaligned NEON memory access on ARMv7 phones

2018-03-31 Thread Makoto Kato
Since SCTLR isn't allowed on userland, there is no way to detect
unalignment access support without trap.  Generally, unalignement access
causes SIGBUS, so we might get a data from crash reporter.  Android armv7-a
ABI doesn't define that hardware configuration has to set alignment bit of
SCTLR, so we should consider both unfortunately.

ARM document of Cortex-A8 says [*1], alignment identifier is 64 (VLD2.16@64),
it requires 2 cycles, but alignment identifier is 128 (VLD2.16@128), it is
1 cycle.  And on Cortex-A9, unalignment access requires additional cycles
[*2].

When I was investigating string issue, I created a benchmark result for
optimization on Cortex-A7.  So I will look for this data.


-- Makoto

[*1]
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344h/ch16s06s07.html
[*2]
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344h/ch16s06s07.html

On Tue, Mar 27, 2018 at 7:44 PM, Henri Sivonen  wrote:

> I'm having trouble finding reliable information about the performance
> of unaligned NEON memory access on ARMv7 phones.
>
> What I can find is:
>
>  * ARMv7 seems to allow unaligned access to be a trap-to-kernel kind
> of performance disaster, but it's hard to find information about
> whether the phone SoCs we care about are actually disastrous like
> that.
>
>  * On aarch64, unaligned access is the same instruction as aligned
> access and gets dynamically penalized, but only minimally, if the
> access crosses a cache line boundary. *Presumably* ARMv7 code running
> on an ARMv8 core gets the same benefit.
>
> Do we know what performance characteristics we can assume for
> unaligned NEON loads/stores on Android phones that have ARMv7 cores
> and recent enough Android that Fennec runs in the first place?
>
> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/
> ___
> dev-platform mailing list
> dev-platform@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Unaligned NEON memory access on ARMv7 phones

2018-03-28 Thread Gregory Szorc
On Tue, Mar 27, 2018 at 3:44 AM, Henri Sivonen  wrote:

> I'm having trouble finding reliable information about the performance
> of unaligned NEON memory access on ARMv7 phones.
>
> What I can find is:
>
>  * ARMv7 seems to allow unaligned access to be a trap-to-kernel kind
> of performance disaster, but it's hard to find information about
> whether the phone SoCs we care about are actually disastrous like
> that.
>
>  * On aarch64, unaligned access is the same instruction as aligned
> access and gets dynamically penalized, but only minimally, if the
> access crosses a cache line boundary. *Presumably* ARMv7 code running
> on an ARMv8 core gets the same benefit.
>
> Do we know what performance characteristics we can assume for
> unaligned NEON loads/stores on Android phones that have ARMv7 cores
> and recent enough Android that Fennec runs in the first place?


Is
http://fastcompression.blogspot.fr/2015/08/accessing-unaligned-memory.html
and/or the comments for MEM_FORCE_MEMORY_ACCESS at
https://github.com/facebook/zstd/blob/dev/lib/common/mem.h useful?

I could also introduce you to the zstandard developers if you think it
would be useful (compression often spends a large portion of its execution
time accessing and moving memory and I'm pretty sure they know arcane
memory access details like this). Reply privately if you want that
introduction.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform