Bug#1064629: libamd-comgr2: segfault in rocfft

2024-03-02 Thread Christian Kastner
Hey Cory,

On 2024-02-28 21:16, Cordell Bloor wrote:
> This segfault does seem to be caused by mixing clang-15 and clang-17 in
> the HIP RTC codepath. When libamdhip64 from ROCm 5.6.1 (built with the
> same clang-17 as rocm-compilersupport 6.0+git20231212.4510c28+dfsg-1) is
> used, the segfault disappeared [1].

I think that this also needs to be fixed in bin:hipcc. It currently has
an unversioned Depends on libamdhip64-dev, making it possible to use
clang-17 hipcc with clang-15 libamdhip64-5.

# should also work with s/podman/docker/, of course
$ podman run --rm -it debian:experimental sh -c 'apt update && apt install -s 
hipcc/experimental | grep "Inst.*libamdhip64"'
[...]
Inst libamdhip64-5 (5.2.3-13 Debian:unstable [amd64])
Inst libamdhip64-dev (5.2.3-13 Debian:unstable [amd64])

I'd file a bug and fix the dependency in rocm-hipamd myself, but I'm
only 90% confident that I'm not missing something, so wanted to check
first.

If it's indeed missing from bin:hipcc, I guess it should be updated to
libamdhip64-dev (= ${binary:Version})

Discovered when building the newer rocFFT, which only build-depends on
hipcc.

Best,
Christian

> [1]: https://ci.rocm.debian.net/packages/r/rocfft/unstable/amd64+gfx1030/7998/



Bug#1064629: libamd-comgr2: segfault in rocfft

2024-02-28 Thread Cordell Bloor
This segfault does seem to be caused by mixing clang-15 and clang-17 in 
the HIP RTC codepath. When libamdhip64 from ROCm 5.6.1 (built with the 
same clang-17 as rocm-compilersupport 6.0+git20231212.4510c28+dfsg-1) is 
used, the segfault disappeared [1].


Sincerely,
Cory Bloor

[1]: 
https://ci.rocm.debian.net/packages/r/rocfft/unstable/amd64+gfx1030/7998/




Bug#1064629: libamd-comgr2: segfault in rocfft,version graph

2024-02-25 Thread Cordell Bloor
The segfault on the very first rocfft test (and in no other library) is 
probably a good indication that HIP runtime compilation (RTC) is broken. 
That is a feature that is used for every rocFFT function, but not used 
by any other ROCm library.


Whether this bug is because the HIP stack is currently built on a mix of 
clang-15 and clang-17, or if this bug is inherent to the clang-17 
version of comgr is an open question. My priority at the moment is to 
complete the move to clang-17 so we can at least eliminate the clang-15 
variable and re-stabilize the ROCm stack on clang-17.




Bug#1064629: libamd-comgr2: segfault in rocfft

2024-02-24 Thread Cordell Bloor
Package: libamd-comgr2
Version: 6.0+git20231212.4510c28+dfsg-1~exp2
Severity: important
X-Debbugs-Cc: c...@slerp.xyz

Dear Maintainer,

The rocfft tests began segfaulting on all architectures when
rocm-compilersupport 6.0+git20231212.4510c28+dfsg-1~exp2 was uploaded to
unstable. You can see it from the CI run for
6.0+git20231212.4510c28+dfsg-1~exp1 on experimental:
https://ci.rocm.debian.net/packages/r/rocfft/unstable/amd64+gfx1030/6343/
as compared to with 5.2.3-2:
https://ci.rocm.debian.net/packages/r/rocfft/unstable/amd64+gfx1030/6466/

I've captured a backtrace, although it's very unclear to me what the
problem is:

root@b50a9fa13687:~# gdb /usr/libexec/rocm/librocfft0-tests/rocfft-test
GNU gdb (Debian 13.2-1) 13.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/libexec/rocm/librocfft0-tests/rocfft-test...

This GDB supports auto-downloading debuginfo from the following URLs:
  
Enable debuginfod for this session? (y or [n]) y
Debuginfod has been enabled.
To make this setting permanent, add 'set debuginfod enabled on' to .gdbinit.
Reading symbols from 
/root/.cache/debuginfod_client/b2eea099f3a928be0c9fb7ba45fbee4d9b157b43/debuginfo...
(gdb) r
Starting program: /usr/libexec/rocm/librocfft0-tests/rocfft-test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
single epsilon: 3.75e-05double epsilon: 1e-15
Random seed: 4218654847
rocFFT version: 1.0.21.
[==] Running 289668 tests from 43 test suites.
[--] Global test environment set-up.
[--] 1 test from manual
[ RUN  ] manual.vs_fftw
Manual test:
length: 8
istride: 1
idist: 8
ostride: 1
odist: 8
batch: 1
isize: 8
osize: 8
ioffset: 0 0
ooffset: 0 0
in-place
transform_type: fft_transform_type_complex_forward
fft_array_type_complex_interleaved -> fft_array_type_complex_interleaved
single-precision
ilength: 8
olength: 8
ibuffer_size: 64
obuffer_size: 64

Token: 
complex_forward_len_8_single_ip_batch_1_istride_1_CI_ostride_1_CI_idist_8_odist_8_ioffset_0_0_ooffset_0_0
[New Thread 0x733576c0 (LWP 4429)]
[New Thread 0x72b546c0 (LWP 4430)]
[Thread 0x72b546c0 (LWP 4430) exited]
[New Thread 0x7fffebbff6c0 (LWP 4431)]
[New Thread 0x7fffeb3fe6c0 (LWP 4432)]
[Detaching after vfork from child process 4433]
[Thread 0x7fffeb3fe6c0 (LWP 4432) exited]
[Thread 0x7fffebbff6c0 (LWP 4431) exited]
[New Thread 0x7fffeb3fe6c0 (LWP 4434)]
[New Thread 0x7fffebbff6c0 (LWP 4435)]
[Thread 0x7fffeb3fe6c0 (LWP 4434) exited]
[New Thread 0x7fffeb3fe6c0 (LWP 4436)]
[Thread 0x7fffeb3fe6c0 (LWP 4436) exited]
[Thread 0x7fffebbff6c0 (LWP 4435) exited]
[   OK ] manual.vs_fftw (2682 ms)
[--] 1 test from manual (2682 ms total)

[--] 26 tests from rocfft_UnitTest
[ RUN  ] rocfft_UnitTest.default_load_callback_complex_single
[New Thread 0x7fffeb3fe6c0 (LWP 4437)]
[New Thread 0x7fffebbff6c0 (LWP 4438)]
[Detaching after vfork from child process 4439]
[Thread 0x7fffebbff6c0 (LWP 4438) exited]
[Thread 0x7fffeb3fe6c0 (LWP 4437) exited]
[New Thread 0x7fffebbff6c0 (LWP 4440)]
[New Thread 0x703ff6c0 (LWP 4441)]
[Thread 0x7fffebbff6c0 (LWP 4440) exited]

Thread 1 "rocfft-test" received signal SIGSEGV, Segmentation fault.
0x740d6208 in ?? () from /lib/x86_64-linux-gnu/libamdhip64.so.5
(gdb) thread apply all bt

Thread 12 (Thread 0x703ff6c0 (LWP 4441) "rocfft-test"):
#0  __GI___ioctl (fd=fd@entry=3, request=request@entry=3222817548) at 
../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x7349ee90 in kmtIoctl (fd=3, request=request@entry=3222817548, 
arg=arg@entry=0x703fddc0) at ./src/libhsakmt.c:13
#2  0x73497ddc in hsaKmtWaitOnMultipleEvents_Ext 
(event_age=0x703fdea8, Milliseconds=3, WaitOnAll=, 
NumEvents=, Events=0x703fde78) at ./src/events.c:407
#3  hsaKmtWaitOnMultipleEvents_Ext (Events=0x703fde78, NumEvents=1, 
WaitOnAll=, Milliseconds=3, event_age=0x703fdea8) at 
./src/events.c:378
#4  0x7349854b in hsaKmtWaitOnEvent_Ext (Event=, 
Milliseconds=, event_age=) at ./src/events.c:226
#5  0x73537640 in rocr::core::InterruptSignal::WaitRelaxed 
(this=0x895d6f00,