Re: [lttng-dev] lttng-consumerd crash on aarch64 due to x86 arch specific optimization

2023-01-31 Thread Mathieu Desnoyers via lttng-dev

On 2023-01-31 11:18, Mathieu Desnoyers wrote:

On 2023-01-31 11:08, Mathieu Desnoyers wrote:

On 2023-01-30 01:50, Beckius, Mikael via lttng-dev wrote:

Hello Matthieu!

I have looked at this in place of Anders and as far as I can tell 
this is not an arm64 issue but an arm issue. And even on arm 
__ARM_FEATURE_UNALIGNED is 1 so it seems the problem only occurs if 
size equals 8.


So for ARM, perhaps we should do the following in 
include/lttng/ust-arch.h:


#if defined(LTTNG_UST_ARCH_ARM) && defined(__ARM_FEATURE_UNALIGNED)
#define LTTNG_UST_ARCH_HAS_EFFICIENT_UNALIGNED_ACCESS 1
#endif

And refer to 
https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html#ARM-Options


Based on that documentation, it is possible to build with 
-mno-unaligned-access,
and for all pre-ARMv6, all ARMv6-M and for ARMv8-M Baseline 
architectures,

unaligned accesses are not enabled.

I would only push this kind of change into the master branch though, 
due to

its impact and the fact that this is only a performance improvement.


But setting LTTNG_UST_ARCH_HAS_EFFICIENT_UNALIGNED_ACCESS 1 for arm32
when __ARM_FEATURE_UNALIGNED is defined would still cause issues for
8-byte lttng_inline_memcpy with my proposed patch right ?

AFAIU 32-bit arm with __ARM_FEATURE_UNALIGNED has unaligned accesses for
2 and 4 bytes accesses, but somehow traps for unaligned 8-bytes
accesses ?


Re-reading your analysis, I may have mistakenly concluded that using the
lttng ust ring buffer in "packed" mode would be faster than aligned mode 
on arm32 and aarch64, but that's not really what you have benchmarked there.


So forget what I said about setting 
LTTNG_UST_ARCH_HAS_EFFICIENT_UNALIGNED_ACCESS to 1 for arm32 and aarch64.


There is a distinction between having efficient unaligned access and
supporting unaligned accesses at all.

For aarch64, it appears to support unaligned accesses, but it may be
slower than aligned accesses AFAIU.

For arm32, it supports unaligned accesses for 2 and 4 bytes when 
__ARM_FEATURE_UNALIGNED is set, but not for 8 bytes (it traps). Then 
it's not clear whether a 2 or 4 bytes access is slower when unaligned 
compared to aligned.


At the end of the day, it's a question of compactness of the generated 
trace data (added throughput overhead) vs cpu time required to perform 
an unaligned access vs aligned.


Thoughts ?

Thanks,

Mathieu



Thanks,

Mathieu





In addition I did some performance testing of lttng_inline_memcpy by 
extracting it and adding it to a simple test program. It appears that 
the general performance increases on arm, arm64, arm on arm64 
hardware and x86-64. But it also appears that on arm if you end up in 
memcpy the old code where you call memcpy directly is actually 
slightly faster.


Nothing unexpected here. Just make sure that your test program does 
not call lttng_inline_memcpy
with constant size values which end up optimizing away branches. In 
the context where lttng_inline_memcpy

is used, most of the time its arguments are not constants.



Skipping the memcpy fallback on arm for unaligned copies of sizes 2 
and 4 further improves the performance


This would be naturally done on your board if we conditionally
set LTTNG_UST_ARCH_HAS_EFFICIENT_UNALIGNED_ACCESS 1 for 
__ARM_FEATURE_UNALIGNED

right ?

and setting LTTNG_UST_ARCH_HAS_EFFICIENT_UNALIGNED_ACCESS 1 yields the 
best performance on arm64.


This could go into lttng-ust master branch as well, e.g.:

#if defined(LTTNG_UST_ARCH_AARCH64)
#define LTTNG_UST_ARCH_HAS_EFFICIENT_UNALIGNED_ACCESS 1
#endif

Thanks!

Mathieu



Micke
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev






--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] lttng-consumerd crash on aarch64 due to x86 arch specific optimization

2023-01-31 Thread Mathieu Desnoyers via lttng-dev

On 2023-01-31 11:08, Mathieu Desnoyers wrote:

On 2023-01-30 01:50, Beckius, Mikael via lttng-dev wrote:

Hello Matthieu!

I have looked at this in place of Anders and as far as I can tell this 
is not an arm64 issue but an arm issue. And even on arm 
__ARM_FEATURE_UNALIGNED is 1 so it seems the problem only occurs if 
size equals 8.


So for ARM, perhaps we should do the following in include/lttng/ust-arch.h:

#if defined(LTTNG_UST_ARCH_ARM) && defined(__ARM_FEATURE_UNALIGNED)
#define LTTNG_UST_ARCH_HAS_EFFICIENT_UNALIGNED_ACCESS 1
#endif

And refer to 
https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html#ARM-Options


Based on that documentation, it is possible to build with 
-mno-unaligned-access,

and for all pre-ARMv6, all ARMv6-M and for ARMv8-M Baseline architectures,
unaligned accesses are not enabled.

I would only push this kind of change into the master branch though, due to
its impact and the fact that this is only a performance improvement.


But setting LTTNG_UST_ARCH_HAS_EFFICIENT_UNALIGNED_ACCESS 1 for arm32
when __ARM_FEATURE_UNALIGNED is defined would still cause issues for
8-byte lttng_inline_memcpy with my proposed patch right ?

AFAIU 32-bit arm with __ARM_FEATURE_UNALIGNED has unaligned accesses for
2 and 4 bytes accesses, but somehow traps for unaligned 8-bytes
accesses ?

Thanks,

Mathieu





In addition I did some performance testing of lttng_inline_memcpy by 
extracting it and adding it to a simple test program. It appears that 
the general performance increases on arm, arm64, arm on arm64 hardware 
and x86-64. But it also appears that on arm if you end up in memcpy 
the old code where you call memcpy directly is actually slightly faster.


Nothing unexpected here. Just make sure that your test program does not 
call lttng_inline_memcpy
with constant size values which end up optimizing away branches. In the 
context where lttng_inline_memcpy

is used, most of the time its arguments are not constants.



Skipping the memcpy fallback on arm for unaligned copies of sizes 2 
and 4 further improves the performance


This would be naturally done on your board if we conditionally
set LTTNG_UST_ARCH_HAS_EFFICIENT_UNALIGNED_ACCESS 1 for 
__ARM_FEATURE_UNALIGNED

right ?

and setting LTTNG_UST_ARCH_HAS_EFFICIENT_UNALIGNED_ACCESS 1 yields the 
best performance on arm64.


This could go into lttng-ust master branch as well, e.g.:

#if defined(LTTNG_UST_ARCH_AARCH64)
#define LTTNG_UST_ARCH_HAS_EFFICIENT_UNALIGNED_ACCESS 1
#endif

Thanks!

Mathieu



Micke
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev




--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] lttng-consumerd crash on aarch64 due to x86 arch specific optimization

2023-01-31 Thread Mathieu Desnoyers via lttng-dev

On 2023-01-30 01:50, Beckius, Mikael via lttng-dev wrote:

Hello Matthieu!

I have looked at this in place of Anders and as far as I can tell this is not 
an arm64 issue but an arm issue. And even on arm __ARM_FEATURE_UNALIGNED is 1 
so it seems the problem only occurs if size equals 8.


So for ARM, perhaps we should do the following in include/lttng/ust-arch.h:

#if defined(LTTNG_UST_ARCH_ARM) && defined(__ARM_FEATURE_UNALIGNED)
#define LTTNG_UST_ARCH_HAS_EFFICIENT_UNALIGNED_ACCESS 1
#endif

And refer to https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html#ARM-Options

Based on that documentation, it is possible to build with -mno-unaligned-access,
and for all pre-ARMv6, all ARMv6-M and for ARMv8-M Baseline architectures,
unaligned accesses are not enabled.

I would only push this kind of change into the master branch though, due to
its impact and the fact that this is only a performance improvement.



In addition I did some performance testing of lttng_inline_memcpy by extracting 
it and adding it to a simple test program. It appears that the general 
performance increases on arm, arm64, arm on arm64 hardware and x86-64. But it 
also appears that on arm if you end up in memcpy the old code where you call 
memcpy directly is actually slightly faster.


Nothing unexpected here. Just make sure that your test program does not call 
lttng_inline_memcpy
with constant size values which end up optimizing away branches. In the context 
where lttng_inline_memcpy
is used, most of the time its arguments are not constants.



Skipping the memcpy fallback on arm for unaligned copies of sizes 2 and 4 
further improves the performance


This would be naturally done on your board if we conditionally
set LTTNG_UST_ARCH_HAS_EFFICIENT_UNALIGNED_ACCESS 1 for __ARM_FEATURE_UNALIGNED
right ?

and setting LTTNG_UST_ARCH_HAS_EFFICIENT_UNALIGNED_ACCESS 1 yields the best 
performance on arm64.

This could go into lttng-ust master branch as well, e.g.:

#if defined(LTTNG_UST_ARCH_AARCH64)
#define LTTNG_UST_ARCH_HAS_EFFICIENT_UNALIGNED_ACCESS 1
#endif

Thanks!

Mathieu



Micke
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH v2] Tests: select_poll_epoll: Add support for _time64

2023-01-31 Thread Alistair Francis via lttng-dev
On Thu, Dec 15, 2022 at 6:20 AM Jérémie Galarneau  wrote:
>
> Hi Alistair,
>
> The patch you submitted doesn't pass on x86 and x86-64.

Are you able to provide the failures? It should just be a simple fix

>
> I have written an alternative patch that works on the 32/64 variants of ARM 
> and x86. I could only verify that it builds on RISC-V 64.
>
> Are you able to compile-test it on RISC-V 32?
>
> https://review.lttng.org/c/lttng-tools/+/8907

Thanks!

I am currently having some trouble building it. The requirement on
liburcu >= 0.14 is proving difficult to meet and the patch conflicts
with earlier versions of lttng.

I had a look at the patch though.

It seems like you still call SYS_ppoll, which won't work on 64-bit
time_t 32-bit systems.

Changes like this:

+   #ifdef sys_pselect6_time64
+   test_pselect_time64();
+   #else
   test_pselect();
+   #endif /* sys_pselect6_time64 */

will mean that test_pselect() isn't called on 32-bit platforms with a
5.4+ kernel. Which I thought is what you wanted to avoid.

Alistair
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev