Public bug reported:

== Comment: #0 - Michael Ranweiler <[email protected]> - 2018-08-16
09:58:02 ==

Witherspoon cluster now has ATS enabled with driver 396.42, CUDA version
9.2.148. They are running the CORAL benchmark LULESH with and without
ATS, and they see a significant performance drop with ATS enabled.

========
Below is the run with ATS:

Run completed:
Problem size = 160
MPI tasks = 8
Iteration count = 100
Final Origin Energy = 1.605234e+09
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 2.384186e-07
TotalAbsDiff = 5.300015e-07
MaxRelDiff = 1.631916e-12

Elapsed time = 153.00 (s)
Grind time (us/z/c) = 0.37352393 (per dom) (0.046690491 overall)
FOM = 21417.637 (z/s)

========
Here is the run without ATS:
Run completed:
Problem size = 160
MPI tasks = 8
Iteration count = 100
Final Origin Energy = 1.605234e+09
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 2.384186e-07
TotalAbsDiff = 5.300015e-07
MaxRelDiff = 1.631916e-12


Elapsed time = 13.27 (s)
Grind time (us/z/c) = 0.032394027 (per dom) (0.0040492534 overall)
FOM = 246959.11 (z/s)
========

Using ATS on a single node slows down the OpenACC version more than 10
times, and for the version with OpenMP 4.5 and managed memory, they
observe a 2x slowdown.

Last comment from NVIDIA (Javier Cabezas - 07/29/2018 11:30 AM):
We think we have found where's the issue.

This behavior reproduces for any two concurrent processes that create
CUDA contexts on the GPUs and heavily unmap memory (no need to launch
any work on the GPUs). When the problem repros, perf shows that most of
the time is spent in mmio_invalidate. However, this only happens when
processes register GPUs attached to the same NPU. Thus, if process A,
initializes GPU 0 and/or 1, and process B, initializes GPU 2 and/or 3,
we don't see the slowdown. This makes sense, because ATSDs on different
NPUs are issued independently.

After some code inspection in npu-dma.c (powerpc backend in the Linux
kernel), Mark noticed that the problem could be in the utilization of
test_and_set_bit_lock in get_mmio_atsd_reg. The implementation of
test_and_set_bit_lock in powerpc relies on the ldarx/stdcx instructions
(PPC_LLARX/PPC_STLCX in the snippet below):

#define DEFINE_TESTOP(fn, op, prefix, postfix, eh)    \
static __inline__ unsigned long fn(                   \
        unsigned long mask,                           \
        volatile unsigned long *_p)                   \
{                                                     \
    unsigned long old, t;                             \
    unsigned long *p = (unsigned long *)_p;           \
    __asm__ __volatile__ (                            \
    prefix                                            \
"1:"    PPC_LLARX(%0,0,%3,eh) "\n"                    \
    stringify_in_c(op) "%1,%0,%2\n"                   \
    PPC405_ERR77(0,%3)                                \
    PPC_STLCX "%1,0,%3\n"                             \
    "bne- 1b\n"                                       \
    postfix                                           \
    : "=&r" (old), "=&r" (t)                          \
    : "r" (mask), "r" (p)                             \
    : "cc", "memory");                                \
    return (old & mask);                              \
}

According to the PowerPC manual, ldarx creates a memory reservation and
a subsequent stwcx instruction from the same processor ensures an atomic
read-modify-write operation. However, the reservation can be lost if a
different processor executes any store instruction on the same address.
That's why "bne- 1b" checks wether stwcx was successful and jumps back
to retry, otherwise. Since DEFINE_TESTOP doesn't implement any back-off
mechanism, two different processors trying to get an ATSD register can
starve each other.

Mark compiled a custom kernel which surrounds the calls to
test_and_set_bit_lock in get_mmio_atsd_reg with a spinlock and I
verified that it solves the issue. These are the execution times for
LULESH:

ATS OFF
Elapsed time         =      16.87 (s)

ATS ON
Elapsed time         =     215.56 (s)

ATS ON + Spinlock
Elapsed time         =      18.14 (s)


Fixed with the following patch in the powerpc tree:
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=9eab9901b015

== Comment: #1 - Michael Ranweiler <[email protected]> - 2018-08-20 14:56:52 
==
This is now in mainline, too:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/powerpc/platforms/powernv/npu-dma.c?id=9eab9901b015f489199105c470de1ffc337cfabb

It has some small fuzz to apply to 4.15.0.32-35:

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 6c8e168e6571..18226895681e 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -434,8 +434,9 @@ static int get_mmio_atsd_reg(struct npu *npu)
        int i;
 
        for (i = 0; i < npu->mmio_atsd_count; i++) {
-               if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
-                       return i;
+               if (!test_bit(i, &npu->mmio_atsd_usage))
+                       if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
+                               return i;
        }
 
        return -ENOSPC;

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
         Status: New


** Tags: architecture-ppc64le bugnameltc-170624 severity-high 
targetmilestone-inin---

** Tags added: architecture-ppc64le bugnameltc-170624 severity-high
targetmilestone-inin---

** Changed in: ubuntu
     Assignee: (unassigned) => Ubuntu on IBM Power Systems Bug Triage 
(ubuntu-power-triage)

** Package changed: ubuntu => linux (Ubuntu)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1788097

Title:
  performance drop with ATS enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1788097/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to