Hi,

Table 2 reports negative latencies. This doesn't look right to me ;-)
If it's the outcome of a parameter fit to the performance model, then use a parameter name (e.g. alpha) instead of the term 'latency'.

Figure 11 has a very narrow range in the y-coordinate and thus exaggerates the variation greatly. "GPU performance" should be adjusted to something like "execution time" to explain the meaning of the y-axis.

Page 12: The latency for VecDot is higher than for VecAXPY because VecDot requires the result to be copied back to the host. This is an additional operation.

Regarding performance measurements: Did you synchronize after each kernel launch? I.e. did you run (approach A)
 for (many times) {
   synchronize();
   start_timer();
   kernel_launch();
   synchronize();
   stop_timer();
 }
and then take averages over the timings obtained, or did you (approach B)
 synchronize();
 start_timer();
 for (many times) {
   kernel_launch();
 }
 synchronize();
 stop_timer();
and then divide the obtained time by the number of runs?

Approach A will report a much higher latency than the latter, because synchronizations are expensive (i.e. your latency consists of kernel launch latency plus device synchronization latency). Approach B is slightly over-optimistic, but I've found it to better match what one observes for an algorithm involving several kernel launches.

Best regards,
Karli



On 10/10/19 12:34 AM, Smith, Barry F. via petsc-dev wrote:

   We've prepared a short report on the performance of vector operations on Summit and would appreciate any feed back including: inconsistencies, lack of clarity, incorrect notation or terminology, etc.

    Thanks

     Barry, Hannah, and Richard





Reply via email to