Public bug reported:

The built-in test suite for rocblas is crashing repeatably with memory
access errors, usually a hipError 700 which translates to
hipErrorIllegalAddress.

## Reproducing

The entire test suite does not cause problems, the easiest way to get
this crash to manifest is to use './rocblas-test
--gtest_filter=*trsm_batched*'

## Debug information

When run through gdb, '/usr/libexec/rocm/librocblas5-tests/rocblas-test
--gtest_filter=*trsm_batched' generates the following output and
backtrace:

Query device success: there are 1 devices
-------------------------------------------------------------------------------
Device ID 0 : AMD Radeon Pro W7900 gfx1100                                      
                
with 48.3 GB memory, max. SCLK 1760 MHz, max. MCLK 1124 MHz, memoryBusWidth 48 
Bytes, compute capability 11.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, 
warpSize 32
-------------------------------------------------------------------------------
info: parsing of test data may take a couple minutes before any test output 
appears...
                                                
Note: Google Test filter = *trsm_batched*                                       
                
[==========] Running 13969 tests from 3 test suites.
[----------] Global test environment set-up.
[----------] 11916 tests from _/trsm_batched    
[New Thread 0x7ffe641af6c0 (LWP 1407327)]
[New Thread 0x7ffcea7ff6c0 (LWP 1407328)]                                       
                
[Thread 0x7ffcea7ff6c0 (LWP 1407328) exited]
Signal 0x7ffce2d05900 time stamps may be invalid.                               
                
clients/common/../include/blas3/testing_trsm_batched.hpp:546: Failure
Expected equality of these values:
  hXorB_1.transfer_from(dXorB)                  
    Which is: 700                                                               
                
  hipSuccess                      
    Which is: 0
                                                                                
                
Error: hipMemcpy post-guard copy failure.
clients/gtest/../include/d_vector.hpp:165: Failure                              
                
Expected equality of these values:                                              
                                                                                
                                 
  memcmp(host.data(), m_guard, m_guard_len)                                     
                
    Which is: -203                
  0 

clients/gtest/../include/d_vector.hpp:190: Failure
Expected equality of these values:
  (hipFree)(d)
    Which is: 700
  hipSuccess
    Which is: 0

clients/gtest/../include/device_batch_matrix.hpp:391: Failure
Expected equality of these values:
  (hipFree)(tmp_device_data)
    Which is: 700
  hipSuccess
    Which is: 0

Error: hipMemcpy post-guard copy failure.
clients/gtest/../include/d_vector.hpp:165: Failure
Expected equality of these values:
  memcmp(host.data(), m_guard, m_guard_len)
    Which is: -203
  0

Error: hipMemcpy pre-guard copy failure.
clients/gtest/../include/d_vector.hpp:175: Failure
Expected equality of these values:
  memcmp(host.data(), m_guard, m_guard_len)
    Which is: -203
  0

clients/gtest/../include/d_vector.hpp:190: Failure
Expected equality of these values:
  (hipFree)(d)
    Which is: 700
  hipSuccess
    Which is: 0

clients/gtest/../include/device_batch_matrix.hpp:391: Failure
Expected equality of these values:
  (hipFree)(tmp_device_data)
    Which is: 700
  hipSuccess
    Which is: 0

rocBLAS error retreiving the device (deviceID: 32767)

Thread 1 "rocblas-test" received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) at 
./nptl/pthread_kill.c:44
warning: 44     ./nptl/pthread_kill.c: No such file or directory
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=6, no_tid=0) 
at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (threadid=<optimized out>, signo=6) at 
./nptl/pthread_kill.c:89
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at 
./nptl/pthread_kill.c:100
#3  0x00007fffebc2fb7e in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26
#4  0x00007fffebc128ec in __GI_abort () at ./stdlib/abort.c:77
#5  0x00007fffeead5477 in rocblas_abort () at 
/usr/src/rocblas-7.1.0-1ubuntu4/library/src/rocblas_ostream.cpp:81
#6  0x00007fffeeace40e in _rocblas_handle::~_rocblas_handle (this=<optimized 
out>) at 
/usr/src/rocblas-7.1.0-1ubuntu4/library/src/include/rocblas_ostream.hpp:537
#7  0x00007fffeead2cd3 in rocblas_destroy_handle (handle=0x5555d5bbcfd0) at 
/usr/src/rocblas-7.1.0-1ubuntu4/library/src/rocblas_auxiliary.cpp:230
#8  0x000055555818e5ad in rocblas_local_handle::~rocblas_local_handle 
(this=0x7fffffffdad8) at 
/usr/src/rocblas-7.1.0-1ubuntu4/clients/common/client_utility.cpp:519
#9  0x0000555557a2ec84 in testing_trsm_batched<double> (arg=...) at 
/usr/src/rocblas-7.1.0-1ubuntu4/clients/common/../include/blas3/testing_trsm_batched.hpp:775
#10 0x000055555819e7b5 in std::function<void()>::operator() 
(this=0x7fffffffdc90) at 
/usr/lib/gcc/x86_64-linux-gnu/16/../../../../include/c++/16/bits/std_function.h:581
#11 catch_signals_and_exceptions_as_failures (test=..., set_alarm=true) at 
/usr/src/rocblas-7.1.0-1ubuntu4/clients/common/gtest_helpers.cpp:199
#12 0x0000555555c374c5 in (anonymous 
namespace)::trsm_batched_blas3_tensile_Test::TestBody (this=<optimized out>) at 
/usr/src/rocblas-7.1.0-1ubuntu4/clients/gtest/blas3/trsm_gtest.cpp:201
#13 0x00005555581fb6c7 in void 
testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, 
void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#14 0x00005555581e0dae in testing::Test::Run() ()
#15 0x00005555581e0f35 in testing::TestInfo::Run() ()
#16 0x00005555581ebd47 in testing::TestSuite::Run() ()
#17 0x00005555581f0ebc in testing::internal::UnitTestImpl::RunAllTests() ()
#18 0x00005555581fbd27 in bool 
testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl,
 bool>(testing::internal::UnitTestImpl*, bool 
(testing::internal::UnitTestImpl::*)(), char const*) ()
#19 0x00005555581e0fca in testing::UnitTest::Run() ()
#20 0x000055555587efc9 in RUN_ALL_TESTS () at /usr/include/gtest/gtest.h:2334
#21 main (argc=1, argv=0x7fffffffe458) at 
/usr/src/rocblas-7.1.0-1ubuntu4/clients/gtest/rocblas_gtest_main.cpp:344


## versions and affected HW

Arch: amd64
Tested version: rocblas 7.1.0-1ubuntu4
AMD gpu ISAs tested: gfx1100, gfx1101, gfx1201

** Affects: rocblas (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2147137

Title:
  rocblas-test crashes with hipErrorIllegalAddress(700)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/rocblas/+bug/2147137/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to