Public bug reported:

Upon investigation and testing on machine with gfx1151 ROCm_ISA, there
are rocsolver autopkgtests that are failing. Here is log of 1 run:

```
[  FAILED  ] 29 tests, listed below:
[  FAILED  ] daily_lapack/POSV.strided_batched__float/23, where GetParam() = ({ 
1000, 2000, 2000, 0 }, { 524, 1 })
[  FAILED  ] daily_lapack/POTRF_64.strided_batched__float/9, where GetParam() = 
({ 2000, 2000, 0 }, U)
- [  FAILED  ] checkin_lapack/GESVDX.__double/162, where GetParam() = ({ 20, 
20, 0, 0, 0 }, { 0, 0, 0, 0, 0, 0, 0 })
- [  FAILED  ] checkin_lapack/GESVDX.batched__double/167, where GetParam() = ({ 
20, 20, 0, 0, 0 }, { 0, 1, 1, 5, 12, 0, 0 })
[  FAILED  ] daily_lapack/SYGV.strided_batched__float/0, where GetParam() = ({ 
192, 192, 192, 0 }, { 1, N, U })
[  FAILED  ] daily_lapack/SYGVJ.strided_batched__float/16, where GetParam() = 
({ 300, 300, 310, 0 }, { 2, V, U })
- [  FAILED  ] daily_lapack/HEEVX.__float_complex/2, where GetParam() = ({ 192, 
192, 192, 5, 15, 100, 170 }, { V, V, L })
- [  FAILED  ] daily_lapack/HEEVX.batched__float_complex/2, where GetParam() = 
({ 192, 192, 192, 5, 15, 100, 170 }, { V, V, L })
- [  FAILED  ] daily_lapack/HEEVX.batched__float_complex/3, where GetParam() = 
({ 192, 192, 192, 5, 15, 100, 170 }, { V, I, U })
- [  FAILED  ] daily_lapack/SYEVDX_INPLACE.__float/8, where GetParam() = ({ 
300, 300, 330, -15, -5, 200, 300 }, { N, V, L })
- [  FAILED  ] daily_lapack/SYGVX.batched__float/8, where GetParam() = ({ 256, 
270, 256, 260, -10, 10, 1, 100, 0 }, { 3, N, I, U })
- [  FAILED  ] daily_lapack/SYGVX.batched__float/10, where GetParam() = ({ 256, 
270, 256, 260, -10, 10, 1, 100, 0 }, { 2, V, I, U })
- [  FAILED  ] daily_lapack/SYGVX.strided_batched__float/14, where GetParam() = 
({ 300, 300, 310, 320, -15, -5, 200, 300, 0 }, { 3, N, I, U })
- [  FAILED  ] daily_lapack/HEGVX.__float_complex/6, where GetParam() = ({ 256, 
270, 256, 260, -10, 10, 1, 100, 0 }, { 1, N, A, U })
- [  FAILED  ] daily_lapack/HEGVX.__float_complex/12, where GetParam() = ({ 
300, 300, 310, 320, -15, -5, 200, 300, 0 }, { 1, N, A, U })
- [  FAILED  ] daily_lapack/HEGVX.batched__float_complex/1, where GetParam() = 
({ 192, 192, 192, 192, 5, 15, 100, 150, 0 }, { 2, N, V, L })
- [  FAILED  ] daily_lapack/HEGVX.batched__float_complex/2, where GetParam() = 
({ 192, 192, 192, 192, 5, 15, 100, 150, 0 }, { 3, N, I, U })
[  FAILED  ] daily_lapack/SYGVDX.batched__float/0, where GetParam() = ({ 192, 
192, 192, 192, 5, 10, 10, 15, 0 }, { 1, N, A, U })
[  FAILED  ] daily_lapack/SYGVDX.batched__float/6, where GetParam() = ({ 256, 
270, 256, 260, -10, 10, 1, 100, 0 }, { 1, N, A, U })
[  FAILED  ] daily_lapack/SYGVDX.strided_batched__float/6, where GetParam() = 
({ 256, 270, 256, 260, -10, 10, 1, 100, 0 }, { 1, N, A, U })
[  FAILED  ] daily_lapack/HEGVDX.batched__float_complex/17, where GetParam() = 
({ 300, 300, 310, 320, -15, -10, 20, 30, 0 }, { 3, V, A, L })
[  FAILED  ] daily_lapack/HEGVDX.strided_batched__float_complex/8, where 
GetParam() = ({ 256, 270, 256, 260, -10, 10, 1, 100, 0 }, { 3, N, I, U })
[  FAILED  ] checkin_lapack/SYGVDX_INPLACE.__float/41, where GetParam() = ({ 
35, 35, 35, 35, -10, 10, 3, 15, 0 }, { 3, V, A, L })
- [  FAILED  ] checkin_lapack/BDSVDX.__double/83, where GetParam() = (U, { 64, 
128, 0 }, { 2, 0, 0, 1, 5 })
- [  FAILED  ] checkin_lapack/BDSVDX.__double/85, where GetParam() = (U, { 64, 
128, 0 }, { 2, 0, 0, 7, 12 })
- [  FAILED  ] checkin_lapack/BDSVDX.__double/86, where GetParam() = (U, { 64, 
128, 0 }, { 0, 0, 0, 0, 0 })
- [  FAILED  ] checkin_lapack/BDSVDX.__double/87, where GetParam() = (U, { 64, 
128, 0 }, { 1, 5, 15, 0, 0 })
- [  FAILED  ] checkin_lapack/BDSVDX.__double/88, where GetParam() = (U, { 64, 
128, 0 }, { 1, 0, 15, 0, 0 })
- [  FAILED  ] checkin_lapack/BDSVDX.__double/179, where GetParam() = (L, { 64, 
128, 0 }, { 1, 0, 15, 0, 0 })

29 FAILED TESTS
```
(Expanded log: https://paste.ubuntu.com/p/Wq4Xjdmhy8/)

1) The failing GESVDX, BDSVDX, HEEVX, SYGVX, SYEVDX_INPLACE, HEGVX tests (with 
- prefix in the log above) have been solved. 
The fix can be mostly credited to the STEBZ 
(https://github.com/ROCm/rocm-libraries/pull/4735) and GETF2 
(https://github.com/ROCm/rocm-libraries/pull/3743) synchronization bug fix 
patches. 
It might also be worth mentioning the introduction of the increase hegvdx test 
tolerance (https://github.com/ROCm/rocm-libraries/pull/2380) patch, even though 
the above mentioned tests were passing without  it, but on some environments 
they might not.

2) The occasional failing of SYGVDX_INPLACE during some test runs
```
[  FAILED  ] checkin_lapack/SYGVDX_INPLACE.__float/41, where GetParam() = ({ 
35, 35, 35, 35, -10, 10, 3, 15, 0 }, { 3, V, A, L })
```
has been resolved with introducing a increase sygvdx inplace test tolerance 
patch (https://github.com/ROCm/rocm-libraries/pull/4436) where instead of 8 * n 
it has been changed to 10 * n.
__

Additionally, due to the possibility of various testing environments:
- skip-test-if-vram-is-insufficient.patch 
(https://github.com/ROCm/rocm-libraries/pull/3886)
- fix-buffer-overflow-causing-test-fails.patch 
(https://github.com/ROCm/rocm-libraries/commit/5ecfb5741a1f0584f1d9b249d4a952e183803c90)
- fix-getri-in-rocsolver-failing.patch 
(https://github.com/ROCm/rocm-libraries/pull/1954)

Have been applied.

The  current status is that there are couple (2 to 5 depending on the run) 
tests failing. From multiple test runs, it can be determined that the failing 
tests are always batched OR strided_batched versions of some of the following:
- POSV
- POTRF_64
- SYGV
- SYGVDJ
- SYGVDX
- HEGVDX

Examples:
Run 1
```
[ RUN      ] daily_lapack/POSV.strided_batched__float_complex/22
clients/common/lapack/testing_posv.hpp:503: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 0.025169280421464928 
vs 0.00011920928955078125

[  FAILED  ] daily_lapack/POSV.strided_batched__float_complex/22, where
GetParam() = ({ 1000, 2000, 2000, 0 }, { 200, 1 }) (126 ms)

[ RUN      ] daily_lapack/POTRF_64.strided_batched__float_complex/5
clients/common/lapack/testing_potf2_potrf.hpp:475: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 
0.00026901860011981017 vs 0.00011920928955078125

[  FAILED  ] daily_lapack/POTRF_64.strided_batched__float_complex/5,
where GetParam() = ({ 1000, 1000, 0 }, U) (130 ms)

[ RUN      ] daily_lapack/SYGV.strided_batched__float/17
clients/common/lapack/testing_sygv_hegv.hpp:706: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 0.012406964249268786 
vs 3.5762786865234375e-05

[  FAILED  ] daily_lapack/SYGV.strided_batched__float/17, where GetParam() = ({ 
300, 300, 310, 0 }, { 3, V, L }) (237 ms)
```

Run 2
```
[ RUN      ] checkin_lapack/POTRF_64.strided_batched__float_complex/11
clients/common/lapack/testing_potf2_potrf.hpp:475: Failure
Expected: ((max_error)) <= ((n)*get_epsilon<T>()), actual: 
0.0015920922572822285 vs 5.9604644775390625e-06

[  FAILED  ] checkin_lapack/POTRF_64.strided_batched__float_complex/11,
where GetParam() = ({ 50, 50, 1 }, U) (6 ms)

[ RUN      ] daily_lapack/SYGVDJ.strided_batched__float/16
clients/common/lapack/testing_sygvdj_hegvdj.hpp:668: Failure
Expected: ((max_error)) <= ((2 * n)*get_epsilon<T>()), actual: 
0.030438767109022009 vs 7.152557373046875e-05

[  FAILED  ] daily_lapack/SYGVDJ.strided_batched__float/16, where
GetParam() = ({ 300, 300, 310, 0 }, { 2, V, U }) (130 ms)

[ RUN      ] daily_lapack/SYGVDX.strided_batched__float/17
clients/common/lapack/testing_sygvdx_hegvdx.hpp:1109: Failure
Expected: ((max_error)) <= ((8 * n)*get_epsilon<T>()), actual: 
0.0072121107950806618 vs 0.000286102294921875

[  FAILED  ] daily_lapack/SYGVDX.strided_batched__float/17, where GetParam() = 
({ 300, 300, 310, 320, -15, -10, 20, 30, 0 }, { 3, V, A, L }) (266 ms)
```

Test are failing due to an error threshold well above the tolerance (CPU vs GPU 
calculation comparison). It appears that floating point imprecision is one 
possible cause, perhaps with lossy math optimizations.
Upstream issues have been opened by users [1 - 
https://github.com/ROCm/rocm-libraries/issues/3169, 2 - 
https://github.com/ROCm/rocm-libraries/issues/3171, 3 - 
https://github.com/ROCm/rocm-libraries/issues/3380] and not addressed yet 
meaning probably even newer versions have the same problem.

As rocsolver depends on rocblas (rocblas = building blocks, rocsolver =
LAPACK algorithms assembled from those blocks), moving forward we might
want to see rocblas tests passing.

** Affects: rocsolver (Ubuntu)
     Importance: Undecided
     Assignee: Bojan Aleksovski (b0b0a)
         Status: In Progress

** Changed in: rocsolver (Ubuntu)
       Status: New => In Progress

** Changed in: rocsolver (Ubuntu)
     Assignee: (unassigned) => Bojan Aleksovski (b0b0a)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2144027

Title:
  Fix tests for 7.1.0

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/rocsolver/+bug/2144027/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to