Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
Karl, that would be fantastic. Much obliged!

--Richard

On 9/23/19 8:09 PM, Karl Rupp wrote:
Hi,

`git grep cudaStreamCreate` reports that vectors, matrices and scatters create 
their own streams. This will almost inevitably create races (there is no 
synchronization mechanism implemented), unless one calls WaitForGPU() after 
each operation. Some of the non-deterministic tests can likely be explained by 
this.

I'll clean this up in the next few hours if there are no objections.

Best regards,
Karli



On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:
I'm no CUDA expert (not yet, anyway), but, from what I've read, the default 
stream (stream 0) is (mostly) synchronous to host and device, so WaitForGPU() 
is not needed in that case. I don't know if there is any performance penalty in 
explicitly calling it in that case, anyway.

In any case, it looks like there are still some cases where potentially 
asynchronous CUDA library calls are being "timed" without a WaitForGPU() to 
ensure that the calls actually complete. I will make a pass through the 
aijcusparse and aijviennacl code looking for these.

--Richard

On 9/23/19 3:28 PM, Zhang, Junchao wrote:
It looks cusparsestruct->stream is always created (not NULL).  I don't know 
logic of the "if (!cusparsestruct->stream)".
--Junchao Zhang


On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov> 
> wrote:

In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
the end of the function it had

  if (!yy) { /* MatMult */
if (!cusparsestruct->stream) {
  ierr = WaitForGPU();CHKERRCUDA(ierr);
}
  }

I assume we don't need the logic to do this only in the MatMult()
with no add case and should just do this all the time, for the
purposes of timing if no other reason. Is there some reason to NOT
do this because of worries the about effects that these
WaitForGPU() invocations might have on performance?

I notice other problems in aijcusparse.cu 
,
now that I look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I
see that we have GPU timing calls around the cusparse_csr_spmv()
(but no WaitForGPU() inside the timed region). I believe this is
another area in which we get a meaningless timing. It looks like
we need a WaitForGPU() there, and then maybe inside the timed
region handling the scatter. (I don't know if this stuff happens
asynchronously or not.) But do we potentially want two
WaitForGPU() calls in one function, just to help with getting
timings? I don't have a good idea of how much overhead this adds.

--Richard

On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
The old code swapped the first two lines. Since with
-log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the
order to have better overlap.
  ierr =

VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr =

VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know
execution time without -log_view (hence cuda synchronization). I
manually calculated the Total Mflop/s for these cases for easy
comparison.

<>



EventCount  Time (sec) Flop 
--- Global ---  --- Stage   Total   GPU-
CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess  AvgLen  
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s
Count   Size   Count   Size  %F

---
6 MPI ranks,
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743   0 0 
0.00e+000 0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03
2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0 0   0 0 
0.00e+000 0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00  3  0  0  0  0  13  

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev
No objection. Thanks.
--Junchao Zhang


On Mon, Sep 23, 2019 at 10:09 PM Karl Rupp 
mailto:r...@iue.tuwien.ac.at>> wrote:
Hi,

`git grep cudaStreamCreate` reports that vectors, matrices and scatters
create their own streams. This will almost inevitably create races
(there is no synchronization mechanism implemented), unless one calls
WaitForGPU() after each operation. Some of the non-deterministic tests
can likely be explained by this.

I'll clean this up in the next few hours if there are no objections.

Best regards,
Karli



On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:
> I'm no CUDA expert (not yet, anyway), but, from what I've read, the
> default stream (stream 0) is (mostly) synchronous to host and device, so
> WaitForGPU() is not needed in that case. I don't know if there is any
> performance penalty in explicitly calling it in that case, anyway.
>
> In any case, it looks like there are still some cases where potentially
> asynchronous CUDA library calls are being "timed" without a WaitForGPU()
> to ensure that the calls actually complete. I will make a pass through
> the aijcusparse and aijviennacl code looking for these.
>
> --Richard
>
> On 9/23/19 3:28 PM, Zhang, Junchao wrote:
>> It looks cusparsestruct->stream is always created (not NULL).  I don't
>> know logic of the "if (!cusparsestruct->stream)".
>> --Junchao Zhang
>>
>>
>> On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev
>> mailto:petsc-dev@mcs.anl.gov> 
>> >> wrote:
>>
>> In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
>> the end of the function it had
>>
>>   if (!yy) { /* MatMult */
>> if (!cusparsestruct->stream) {
>>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>> }
>>   }
>>
>> I assume we don't need the logic to do this only in the MatMult()
>> with no add case and should just do this all the time, for the
>> purposes of timing if no other reason. Is there some reason to NOT
>> do this because of worries the about effects that these
>> WaitForGPU() invocations might have on performance?
>>
>> I notice other problems in aijcusparse.cu 
>> ,
>> now that I look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I
>> see that we have GPU timing calls around the cusparse_csr_spmv()
>> (but no WaitForGPU() inside the timed region). I believe this is
>> another area in which we get a meaningless timing. It looks like
>> we need a WaitForGPU() there, and then maybe inside the timed
>> region handling the scatter. (I don't know if this stuff happens
>> asynchronously or not.) But do we potentially want two
>> WaitForGPU() calls in one function, just to help with getting
>> timings? I don't have a good idea of how much overhead this adds.
>>
>> --Richard
>>
>> On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
>>> I made the following changes:
>>> 1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
>>>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>>>   ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>>>   ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
>>>   PetscFunctionReturn(0);
>>> 2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
>>> The old code swapped the first two lines. Since with
>>> -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the
>>> order to have better overlap.
>>>   ierr =
>>> 
>>> VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>>   ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>>>   ierr =
>>> 
>>> VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>>   ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
>>> 3) Log time directly in the test code so we can also know
>>> execution time without -log_view (hence cuda synchronization). I
>>> manually calculated the Total Mflop/s for these cases for easy
>>> comparison.
>>>
>>> <>
>>>
>>> 
>>> 
>>> EventCount  Time (sec) Flop
>>>--- Global ---  --- Stage   Total   GPU-
>>> CpuToGpu -   - GpuToCpu - GPU
>>>Max Ratio  Max Ratio   Max  Ratio  Mess
>>> AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s
>>> Count   Size   Count   Size  %F
>>> 
>>> ---
>>> 6 MPI ranks,
>>> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
>>> 2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743   0
>>>  0 0.00e+000 0.00e+00  

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Karl Rupp via petsc-dev

Hi,

`git grep cudaStreamCreate` reports that vectors, matrices and scatters 
create their own streams. This will almost inevitably create races 
(there is no synchronization mechanism implemented), unless one calls 
WaitForGPU() after each operation. Some of the non-deterministic tests 
can likely be explained by this.


I'll clean this up in the next few hours if there are no objections.

Best regards,
Karli



On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:
I'm no CUDA expert (not yet, anyway), but, from what I've read, the 
default stream (stream 0) is (mostly) synchronous to host and device, so 
WaitForGPU() is not needed in that case. I don't know if there is any 
performance penalty in explicitly calling it in that case, anyway.


In any case, it looks like there are still some cases where potentially 
asynchronous CUDA library calls are being "timed" without a WaitForGPU() 
to ensure that the calls actually complete. I will make a pass through 
the aijcusparse and aijviennacl code looking for these.


--Richard

On 9/23/19 3:28 PM, Zhang, Junchao wrote:
It looks cusparsestruct->stream is always created (not NULL).  I don't 
know logic of the "if (!cusparsestruct->stream)".

--Junchao Zhang


On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:


In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
the end of the function it had

  if (!yy) { /* MatMult */
    if (!cusparsestruct->stream) {
  ierr = WaitForGPU();CHKERRCUDA(ierr);
    }
  }

I assume we don't need the logic to do this only in the MatMult()
with no add case and should just do this all the time, for the
purposes of timing if no other reason. Is there some reason to NOT
do this because of worries the about effects that these
WaitForGPU() invocations might have on performance?

I notice other problems in aijcusparse.cu ,
now that I look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I
see that we have GPU timing calls around the cusparse_csr_spmv()
(but no WaitForGPU() inside the timed region). I believe this is
another area in which we get a meaningless timing. It looks like
we need a WaitForGPU() there, and then maybe inside the timed
region handling the scatter. (I don't know if this stuff happens
asynchronously or not.) But do we potentially want two
WaitForGPU() calls in one function, just to help with getting
timings? I don't have a good idea of how much overhead this adds.

--Richard

On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:

I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
The old code swapped the first two lines. Since with
-log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the
order to have better overlap.
  ierr =

VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr =

VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know
execution time without -log_view (hence cuda synchronization). I
manually calculated the Total Mflop/s for these cases for easy
comparison.

<>



Event                Count      Time (sec)     Flop  
               --- Global ---  --- Stage   Total   GPU    -

CpuToGpu -   - GpuToCpu - GPU
                   Max Ratio  Max     Ratio   Max  Ratio  Mess  
AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s

Count   Size   Count   Size  %F

---
6 MPI ranks,
MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743       0
 0 0.00e+00    0 0.00e+00  0

VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03
2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0     0       0
 0 0.00e+00    0 0.00e+00  0

VecScatterEnd        100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0
 0 0.00e+00    0 0.00e+00  0


24 MPI ranks
MatMult              100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04
5.9e+04 

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
I'm no CUDA expert (not yet, anyway), but, from what I've read, the default 
stream (stream 0) is (mostly) synchronous to host and device, so WaitForGPU() 
is not needed in that case. I don't know if there is any performance penalty in 
explicitly calling it in that case, anyway.

In any case, it looks like there are still some cases where potentially 
asynchronous CUDA library calls are being "timed" without a WaitForGPU() to 
ensure that the calls actually complete. I will make a pass through the 
aijcusparse and aijviennacl code looking for these.

--Richard

On 9/23/19 3:28 PM, Zhang, Junchao wrote:
It looks cusparsestruct->stream is always created (not NULL).  I don't know 
logic of the "if (!cusparsestruct->stream)".
--Junchao Zhang


On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end of the 
function it had

  if (!yy) { /* MatMult */
if (!cusparsestruct->stream) {
  ierr = WaitForGPU();CHKERRCUDA(ierr);
}
  }

I assume we don't need the logic to do this only in the MatMult() with no add 
case and should just do this all the time, for the purposes of timing if no 
other reason. Is there some reason to NOT do this because of worries the about 
effects that these WaitForGPU() invocations might have on performance?

I notice other problems in aijcusparse.cu, now that I 
look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I see that we have GPU 
timing calls around the cusparse_csr_spmv() (but no WaitForGPU() inside the 
timed region). I believe this is another area in which we get a meaningless 
timing. It looks like we need a WaitForGPU() there, and then maybe inside the 
timed region handling the scatter. (I don't know if this stuff happens 
asynchronously or not.) But do we potentially want two WaitForGPU() calls in 
one function, just to help with getting timings? I don't have a good idea of 
how much overhead this adds.

--Richard

On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old code 
swapped the first two lines. Since with -log_view, MatMultAdd_SeqAIJCUSPARSE is 
blocking, I changed the order to have better overlap.
  ierr = 
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr = 
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know execution time 
without -log_view (hence cuda synchronization). I manually calculated the Total 
Mflop/s for these cases for easy comparison.

<>


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks,
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks
MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

42 MPI ranks
MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mark Adams via petsc-dev
Note, the numerical problems that we have look a lot like a race condition
of some sort. Happens with empty processors and goes away under
cuda-memcheck (valgrind like thing).

I did try adding WaitForGPU() , but maybe I did do it right or there are
other synchronization mechanisms.


On Mon, Sep 23, 2019 at 6:28 PM Zhang, Junchao via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> It looks cusparsestruct->stream is always created (not NULL).  I don't
> know logic of the "if (!cusparsestruct->stream)".
> --Junchao Zhang
>
>
> On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
>
>> In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end
>> of the function it had
>>
>>   if (!yy) { /* MatMult */
>> if (!cusparsestruct->stream) {
>>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>> }
>>   }
>>
>> I assume we don't need the logic to do this only in the MatMult() with no
>> add case and should just do this all the time, for the purposes of timing
>> if no other reason. Is there some reason to NOT do this because of worries
>> the about effects that these WaitForGPU() invocations might have on
>> performance?
>>
>> I notice other problems in aijcusparse.cu, now that I look closer. In
>> MatMultTransposeAdd_SeqAIJCUSPARSE(), I see that we have GPU timing calls
>> around the cusparse_csr_spmv() (but no WaitForGPU() inside the timed
>> region). I believe this is another area in which we get a meaningless
>> timing. It looks like we need a WaitForGPU() there, and then maybe inside
>> the timed region handling the scatter. (I don't know if this stuff happens
>> asynchronously or not.) But do we potentially want two WaitForGPU() calls
>> in one function, just to help with getting timings? I don't have a good
>> idea of how much overhead this adds.
>>
>> --Richard
>>
>> On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
>>
>> I made the following changes:
>> 1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
>>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>>   ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>>   ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
>>   PetscFunctionReturn(0);
>> 2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old
>> code swapped the first two lines. Since with
>> -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the order to
>> have better overlap.
>>   ierr =
>> VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>   ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>>   ierr =
>> VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>   ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
>> 3) Log time directly in the test code so we can also know execution
>> time without -log_view (hence cuda synchronization). I manually calculated
>> the Total Mflop/s for these cases for easy comparison.
>>
>> <>
>>
>>
>> 
>> EventCount  Time (sec) Flop
>>--- Global ---  --- Stage   Total   GPU- CpuToGpu -   -
>> GpuToCpu - GPU
>>Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen
>>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
>> Count   Size  %F
>>
>> ---
>> 6 MPI ranks,
>> MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05
>> 0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05
>> 0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000
>> 0.00e+00  0
>>
>> 24 MPI ranks
>> MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04
>> 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04
>> 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000
>> 0.00e+00  0
>>
>> 42 MPI ranks
>> MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04
>> 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04
>> 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000
>> 0.00e+00  0
>> VecScatterEnd100 1.0 

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev
It looks cusparsestruct->stream is always created (not NULL).  I don't know 
logic of the "if (!cusparsestruct->stream)".
--Junchao Zhang


On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end of the 
function it had

  if (!yy) { /* MatMult */
if (!cusparsestruct->stream) {
  ierr = WaitForGPU();CHKERRCUDA(ierr);
}
  }

I assume we don't need the logic to do this only in the MatMult() with no add 
case and should just do this all the time, for the purposes of timing if no 
other reason. Is there some reason to NOT do this because of worries the about 
effects that these WaitForGPU() invocations might have on performance?

I notice other problems in aijcusparse.cu, now that I 
look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I see that we have GPU 
timing calls around the cusparse_csr_spmv() (but no WaitForGPU() inside the 
timed region). I believe this is another area in which we get a meaningless 
timing. It looks like we need a WaitForGPU() there, and then maybe inside the 
timed region handling the scatter. (I don't know if this stuff happens 
asynchronously or not.) But do we potentially want two WaitForGPU() calls in 
one function, just to help with getting timings? I don't have a good idea of 
how much overhead this adds.

--Richard

On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old code 
swapped the first two lines. Since with -log_view, MatMultAdd_SeqAIJCUSPARSE is 
blocking, I changed the order to have better overlap.
  ierr = 
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr = 
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know execution time 
without -log_view (hence cuda synchronization). I manually calculated the Total 
Mflop/s for these cases for easy comparison.

<>


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks,
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks
MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

42 MPI ranks
MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + regular SF + log_view
MatMult  100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 335743   629278  100 1.02e+02  100 
2.69e+02 100
VecScatterBegin  100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  24  0100100  0 0   0  0 0.00e+00  100 
2.69e+02  0
VecScatterEnd100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end of the 
function it had

  if (!yy) { /* MatMult */
if (!cusparsestruct->stream) {
  ierr = WaitForGPU();CHKERRCUDA(ierr);
}
  }

I assume we don't need the logic to do this only in the MatMult() with no add 
case and should just do this all the time, for the purposes of timing if no 
other reason. Is there some reason to NOT do this because of worries the about 
effects that these WaitForGPU() invocations might have on performance?

I notice other problems in aijcusparse.cu, now that I look closer. In 
MatMultTransposeAdd_SeqAIJCUSPARSE(), I see that we have GPU timing calls 
around the cusparse_csr_spmv() (but no WaitForGPU() inside the timed region). I 
believe this is another area in which we get a meaningless timing. It looks 
like we need a WaitForGPU() there, and then maybe inside the timed region 
handling the scatter. (I don't know if this stuff happens asynchronously or 
not.) But do we potentially want two WaitForGPU() calls in one function, just 
to help with getting timings? I don't have a good idea of how much overhead 
this adds.

--Richard

On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old code 
swapped the first two lines. Since with -log_view, MatMultAdd_SeqAIJCUSPARSE is 
blocking, I changed the order to have better overlap.
  ierr = 
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr = 
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know execution time 
without -log_view (hence cuda synchronization). I manually calculated the Total 
Mflop/s for these cases for easy comparison.

<>


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks,
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks
MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

42 MPI ranks
MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + regular SF + log_view
MatMult  100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 335743   629278  100 1.02e+02  100 
2.69e+02 100
VecScatterBegin  100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  24  0100100  0 0   0  0 0.00e+00  100 
2.69e+02  0
VecScatterEnd100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  20  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 9.5078e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   4  0  0  0  0 0   0100 1.02e+020 
0.00e+00  0
VecCopyFromSome  

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
OK, I wrote to the OLCF Consultants and they told me that

* Yes, the jsrun Visualizer numberings correspond to the 'lstopo' ones.

and, from this I can conclude that

* If I ask for 6 resource sets, each with 1 core and 1 GPU each, the some of 
the cores in different resource sets will share L2/L3 cache.

* For the above case, in which I want 6 MPI ranks that don't share anything, I 
need to ask for 6 resource sets each with *2 cores* and 1 GPU each. When I ask 
for 2 cores, each resource set will consist of 2 cores that share L2/L3, so 
this is how you can get resource sets that don't share L2/L3 between them.

--Richard

On 9/23/19 11:10 AM, Mills, Richard Tran wrote:
To further muddy the waters, the OLCF Summit User Guide 
(https://www.olcf.ornl.gov/for-users/system-user-guides/summit/summit-user-guide)
 states that

"The POWER9 processor is built around IBM’s SIMD Multi-Core (SMC). The 
processor provides 22 SMCs with separate 32kB L1 data and instruction caches. 
Pairs of SMCs share a 512kB L2 cache and a 10MB L3 cache."

And there is some funny stuff in that lstopo output. On the first socket, I see 
one "SMC" that doesn't share L2/L3 with anyone. This may be because it actually 
shares this with a "service" node that is hidden to jsrun. But why are there 
three such SMCs on the second socket?!

I've written to the OLCF Consultants to see if they can provide any 
clarification on this. In particular, I want to know if the jsrun Visualizer 
hardware thread and core numberings correspond to the lstopo ones. I think 
that's the only way to tell if we are getting cores that don't share L2/L3 
resources or not.

--Richard


On 9/23/19 10:58 AM, Zhang, Junchao wrote:
The figure did not clearly say all cores share L3.  Instead, we should look at 
p.16 of https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf

"The POWER9 chip contains two memory controllers, PCIe Gen4 I/O controllers, 
and an interconnection system that connects all components within the chip at 7 
TBps. Each core has 256 KB of L2 cache, and all cores share 120 MB of L3 
embedded DRAM (eDRAM)."
--Junchao Zhang


On Mon, Sep 23, 2019 at 11:58 AM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF 
output from a Summit compute node to see an illustration of the node layout.

--Richard

On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote:
I also did OpenMP stream test and then I found mismatch between OpenMPI and 
MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  
One has to place MPI ranks to different pairs to get best bandwidth. See 
different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
node has 21 cores. I assume that means 11 pairs. The new results are below. 
They match with we what I got from OpenMPI. The bandwidth is almost doubled 
from 1 to 2 cores per socket. IBM document also says each socket has two memory 
controllers. I could not find the core-memory controller affinity info. I tried 
different bindings but did not find huge difference.

#Ranks  Rate (MB/s)Ratio over 2 ranks
1 29229.8   -
2 59091.0  1.0
4112260.7  1.9
6159852.8  2.7
8194351.7  3.3
10   215841.0  3.7
12   232316.6  3.9
14   244615.7  4.1
16   254450.8  4.3
18   262185.7  4.4
20   267181.0  4.5
22   270290.4  4.6
24   221944.9  3.8
26   238302.8  4.0


--Junchao Zhang


On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

 For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

 Barry


> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets.
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks Rate (MB/s) Ratio over 2 ranks
> --
> 2  59012.28341.00
> 4  70959.14751.20
> 6 106639.98371.81
> 8 138638.69292.35
> 10171125.08732.90
> 12196162.51973.32
> 14215272.78103.65
> 16229562.40403.89
> 18242587.49134.11
> 20251057.17314.25
> 22258569.77944.38
> 24265443.29244.50
> 26266562.78724.52
> 28267043.6367   

Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Zhang, Hong via petsc-dev
Done. See 
https://gitlab.com/petsc/petsc/commit/85ec510f49531057ebfe1fb641fe93a36371878e
Hong

On Mon, Sep 23, 2019 at 11:32 AM Pierre Jolivet 
mailto:pierre.joli...@enseeiht.fr>> wrote:
Hong,
You should probably cherry pick 
https://gitlab.com/petsc/petsc/commit/93d7d1d6d29b0d66b5629a261178b832a925de80?merge_request_iid=2069
 (and remove the MatNest part).
This fixes a similar issue in MatTransposeMatMult with nontrivial LDAs.
Since this commit is part of a feature MR that is unlikely to be ready for 
tomorrow, this fix (as of now) is also unlikely to be in master for the release.

Thanks,
Pierre

On 23 Sep 2019, at 6:02 PM, Zhang, Hong 
mailto:hzh...@mcs.anl.gov>> wrote:

Barry:
As a hack for this release could you have the Numeric portion of the 
multiply routines check if the symbolic data is there and if not just call the 
symbolic an attach the needed data? You might need to have a utility function 
that does all the symbolic part except the allocation of the matrix and then 
call this from the numeric part as well as the real symbolic part.

I'm working on this now.  I was not aware of MatSeqDenseSetLDA() which changes 
pattern of data access in seqdense matrix.
Pierre's patch:
"change Bm here 
https://www.mcs.anl.gov/petsc/petsc-dev/src/mat/impls/aij/mpi/mpimatmatmult.c.html#line549
 to the LDA of B"
fix this bug. I'll further test it and submit a pull request.
Then, I'll check slepc's bug report.
Hong



Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
To further muddy the waters, the OLCF Summit User Guide 
(https://www.olcf.ornl.gov/for-users/system-user-guides/summit/summit-user-guide)
 states that

"The POWER9 processor is built around IBM’s SIMD Multi-Core (SMC). The 
processor provides 22 SMCs with separate 32kB L1 data and instruction caches. 
Pairs of SMCs share a 512kB L2 cache and a 10MB L3 cache."

And there is some funny stuff in that lstopo output. On the first socket, I see 
one "SMC" that doesn't share L2/L3 with anyone. This may be because it actually 
shares this with a "service" node that is hidden to jsrun. But why are there 
three such SMCs on the second socket?!

I've written to the OLCF Consultants to see if they can provide any 
clarification on this. In particular, I want to know if the jsrun Visualizer 
hardware thread and core numberings correspond to the lstopo ones. I think 
that's the only way to tell if we are getting cores that don't share L2/L3 
resources or not.

--Richard


On 9/23/19 10:58 AM, Zhang, Junchao wrote:
The figure did not clearly say all cores share L3.  Instead, we should look at 
p.16 of https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf

"The POWER9 chip contains two memory controllers, PCIe Gen4 I/O controllers, 
and an interconnection system that connects all components within the chip at 7 
TBps. Each core has 256 KB of L2 cache, and all cores share 120 MB of L3 
embedded DRAM (eDRAM)."
--Junchao Zhang


On Mon, Sep 23, 2019 at 11:58 AM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF 
output from a Summit compute node to see an illustration of the node layout.

--Richard

On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote:
I also did OpenMP stream test and then I found mismatch between OpenMPI and 
MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  
One has to place MPI ranks to different pairs to get best bandwidth. See 
different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
node has 21 cores. I assume that means 11 pairs. The new results are below. 
They match with we what I got from OpenMPI. The bandwidth is almost doubled 
from 1 to 2 cores per socket. IBM document also says each socket has two memory 
controllers. I could not find the core-memory controller affinity info. I tried 
different bindings but did not find huge difference.

#Ranks  Rate (MB/s)Ratio over 2 ranks
1 29229.8   -
2 59091.0  1.0
4112260.7  1.9
6159852.8  2.7
8194351.7  3.3
10   215841.0  3.7
12   232316.6  3.9
14   244615.7  4.1
16   254450.8  4.3
18   262185.7  4.4
20   267181.0  4.5
22   270290.4  4.6
24   221944.9  3.8
26   238302.8  4.0


--Junchao Zhang


On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

 For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

 Barry


> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets.
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks Rate (MB/s) Ratio over 2 ranks
> --
> 2  59012.28341.00
> 4  70959.14751.20
> 6 106639.98371.81
> 8 138638.69292.35
> 10171125.08732.90
> 12196162.51973.32
> 14215272.78103.65
> 16229562.40403.89
> 18242587.49134.11
> 20251057.17314.25
> 22258569.77944.38
> 24265443.29244.50
> 26266562.78724.52
> 28267043.63674.53
> 30266833.72124.52
> 32267183.84744.53
>
> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Junchao could try the PETSc (and non-PETSc) streams tests on the machine.
>
>   There are a few differences, compiler, the reported results are with 
> OpenMP, different number of cores but yes the performance is a bit low. For 
> DOE that is great, makes GPUs look better :-)
>
>
> > On Sep 21, 2019, at 11:11 PM, Jed Brown 
> > mailto:j...@jedbrown.org>> wrote:
> >
> > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> > GB/s for the node for the best case (42 ranks).
> >

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev
The figure did not clearly say all cores share L3.  Instead, we should look at 
p.16 of https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf

"The POWER9 chip contains two memory controllers, PCIe Gen4 I/O controllers, 
and an interconnection system that connects all components within the chip at 7 
TBps. Each core has 256 KB of L2 cache, and all cores share 120 MB of L3 
embedded DRAM (eDRAM)."
--Junchao Zhang


On Mon, Sep 23, 2019 at 11:58 AM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF 
output from a Summit compute node to see an illustration of the node layout.

--Richard

On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote:
I also did OpenMP stream test and then I found mismatch between OpenMPI and 
MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  
One has to place MPI ranks to different pairs to get best bandwidth. See 
different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
node has 21 cores. I assume that means 11 pairs. The new results are below. 
They match with we what I got from OpenMPI. The bandwidth is almost doubled 
from 1 to 2 cores per socket. IBM document also says each socket has two memory 
controllers. I could not find the core-memory controller affinity info. I tried 
different bindings but did not find huge difference.

#Ranks  Rate (MB/s)Ratio over 2 ranks
1 29229.8   -
2 59091.0  1.0
4112260.7  1.9
6159852.8  2.7
8194351.7  3.3
10   215841.0  3.7
12   232316.6  3.9
14   244615.7  4.1
16   254450.8  4.3
18   262185.7  4.4
20   267181.0  4.5
22   270290.4  4.6
24   221944.9  3.8
26   238302.8  4.0


--Junchao Zhang


On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

 For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

 Barry


> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets.
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks Rate (MB/s) Ratio over 2 ranks
> --
> 2  59012.28341.00
> 4  70959.14751.20
> 6 106639.98371.81
> 8 138638.69292.35
> 10171125.08732.90
> 12196162.51973.32
> 14215272.78103.65
> 16229562.40403.89
> 18242587.49134.11
> 20251057.17314.25
> 22258569.77944.38
> 24265443.29244.50
> 26266562.78724.52
> 28267043.63674.53
> 30266833.72124.52
> 32267183.84744.53
>
> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Junchao could try the PETSc (and non-PETSc) streams tests on the machine.
>
>   There are a few differences, compiler, the reported results are with 
> OpenMP, different number of cores but yes the performance is a bit low. For 
> DOE that is great, makes GPUs look better :-)
>
>
> > On Sep 21, 2019, at 11:11 PM, Jed Brown 
> > mailto:j...@jedbrown.org>> wrote:
> >
> > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> > GB/s for the node for the best case (42 ranks).
> >
> > My understanding is that these systems have 8 channels of DDR4-2666 per
> > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
> > system, and 270 GB/s STREAM Triad according to this post
> >
> >  
> > https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
> >
> > Is this 60% of Triad the best we can get for SpMV?
> >
> > "Zhang, Junchao via petsc-dev" 
> > mailto:petsc-dev@mcs.anl.gov>> writes:
> >
> >> 42 cores have better performance.
> >>
> >> 36 MPI ranks
> >> MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
> >> 0.0e+00  6 99 97 28  0 100100100100  0 25145   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterBegin  100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
> >> 0.0e+00  0  0 97 28  0   1  0100100  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterEnd100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  1  0  0  0  0  

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev
L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF 
output from a Summit compute node to see an illustration of the node layout.

--Richard

On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote:
I also did OpenMP stream test and then I found mismatch between OpenMPI and 
MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  
One has to place MPI ranks to different pairs to get best bandwidth. See 
different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
node has 21 cores. I assume that means 11 pairs. The new results are below. 
They match with we what I got from OpenMPI. The bandwidth is almost doubled 
from 1 to 2 cores per socket. IBM document also says each socket has two memory 
controllers. I could not find the core-memory controller affinity info. I tried 
different bindings but did not find huge difference.

#Ranks  Rate (MB/s)Ratio over 2 ranks
1 29229.8   -
2 59091.0  1.0
4112260.7  1.9
6159852.8  2.7
8194351.7  3.3
10   215841.0  3.7
12   232316.6  3.9
14   244615.7  4.1
16   254450.8  4.3
18   262185.7  4.4
20   267181.0  4.5
22   270290.4  4.6
24   221944.9  3.8
26   238302.8  4.0


--Junchao Zhang


On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

 For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

 Barry


> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets.
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks Rate (MB/s) Ratio over 2 ranks
> --
> 2  59012.28341.00
> 4  70959.14751.20
> 6 106639.98371.81
> 8 138638.69292.35
> 10171125.08732.90
> 12196162.51973.32
> 14215272.78103.65
> 16229562.40403.89
> 18242587.49134.11
> 20251057.17314.25
> 22258569.77944.38
> 24265443.29244.50
> 26266562.78724.52
> 28267043.63674.53
> 30266833.72124.52
> 32267183.84744.53
>
> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Junchao could try the PETSc (and non-PETSc) streams tests on the machine.
>
>   There are a few differences, compiler, the reported results are with 
> OpenMP, different number of cores but yes the performance is a bit low. For 
> DOE that is great, makes GPUs look better :-)
>
>
> > On Sep 21, 2019, at 11:11 PM, Jed Brown 
> > mailto:j...@jedbrown.org>> wrote:
> >
> > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> > GB/s for the node for the best case (42 ranks).
> >
> > My understanding is that these systems have 8 channels of DDR4-2666 per
> > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
> > system, and 270 GB/s STREAM Triad according to this post
> >
> >  
> > https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
> >
> > Is this 60% of Triad the best we can get for SpMV?
> >
> > "Zhang, Junchao via petsc-dev" 
> > mailto:petsc-dev@mcs.anl.gov>> writes:
> >
> >> 42 cores have better performance.
> >>
> >> 36 MPI ranks
> >> MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
> >> 0.0e+00  6 99 97 28  0 100100100100  0 25145   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterBegin  100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
> >> 0.0e+00  0  0 97 28  0   1  0100100  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterEnd100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  1  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >>
> >> --Junchao Zhang
> >>
> >>
> >> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. 
> >> mailto:bsm...@mcs.anl.gov>>>
> >>  wrote:
> >>
> >>  Junchao,
> >>
> >>Mark has a good point; could you also try for completeness the CPU with 
> >> 36 cores and see if it is any better than the 42 core case?
> >>
> >>  Barry
> >>
> >>  So extrapolating about 20 nodes of the CPUs is equivalent to 1 node 

Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Pierre Jolivet via petsc-dev
Hong,
You should probably cherry pick 
https://gitlab.com/petsc/petsc/commit/93d7d1d6d29b0d66b5629a261178b832a925de80?merge_request_iid=2069
 

 (and remove the MatNest part).
This fixes a similar issue in MatTransposeMatMult with nontrivial LDAs.
Since this commit is part of a feature MR that is unlikely to be ready for 
tomorrow, this fix (as of now) is also unlikely to be in master for the release.

Thanks,
Pierre

> On 23 Sep 2019, at 6:02 PM, Zhang, Hong  wrote:
> 
> Barry:
> As a hack for this release could you have the Numeric portion of the 
> multiply routines check if the symbolic data is there and if not just call 
> the symbolic an attach the needed data? You might need to have a utility 
> function that does all the symbolic part except the allocation of the matrix 
> and then call this from the numeric part as well as the real symbolic part.
>  
> I'm working on this now.  I was not aware of MatSeqDenseSetLDA() which 
> changes pattern of data access in seqdense matrix. 
> Pierre's patch:
> "change Bm here 
> https://www.mcs.anl.gov/petsc/petsc-dev/src/mat/impls/aij/mpi/mpimatmatmult.c.html#line549
>  
> 
>  to the LDA of B" 
> fix this bug. I'll further test it and submit a pull request. 
> Then, I'll check slepc's bug report.
> Hong



Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Smith, Barry F. via petsc-dev



  Junchao,

Great, thanks

   Barry

  Eventually I think this should all got into a MR that includes these tests 
and the PetscSF ping-pongs so later someone can reproduce these numbers on 
Summit and on the new machines that come out.

> On Sep 23, 2019, at 11:01 AM, Zhang, Junchao  wrote:
> 
> I also did OpenMP stream test and then I found mismatch between OpenMPI and 
> MPI.  That reminded me a subtle issue on summit: pair of cores share L2 
> cache.  One has to place MPI ranks to different pairs to get best bandwidth. 
> See different bindings
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
> node has 21 cores. I assume that means 11 pairs. The new results are below. 
> They match with we what I got from OpenMPI. The bandwidth is almost doubled 
> from 1 to 2 cores per socket. IBM document also says each socket has two 
> memory controllers. I could not find the core-memory controller affinity 
> info. I tried different bindings but did not find huge difference.
>   
> #Ranks  Rate (MB/s)Ratio over 2 ranks
> 1 29229.8   -
> 2 59091.0  1.0
> 4112260.7  1.9
> 6159852.8  2.7
> 8194351.7  3.3
> 10   215841.0  3.7
> 12   232316.6  3.9
> 14   244615.7  4.1
> 16   254450.8  4.3
> 18   262185.7  4.4
> 20   267181.0  4.5
> 22   270290.4  4.6
> 24   221944.9  3.8
> 26   238302.8  4.0
> 
> 
> --Junchao Zhang
> 
> 
> On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F.  wrote:
> 
>   Junchao,
> 
>  For completeness could you please run with a single core? But leave the 
> ratio as you have with over 2 ranks since that is the correct model.
> 
>Thanks
> 
>  Barry
> 
> 
> > On Sep 22, 2019, at 11:14 AM, Zhang, Junchao  wrote:
> > 
> > I did stream test on Summit. I used the MPI version from petsc, but largely 
> > increased the array size N since one socket of Summit has 120MB L3 cache. I 
> > used MPI version since it was easy for me to distribute ranks evenly to the 
> > two sockets. 
> > The result matches with data released by OLCF (see attached figure) and 
> > data given by Jed. We can see the bandwidth saturates around 24 ranks.
> > 
> > #Ranks Rate (MB/s) Ratio over 2 ranks
> > --
> > 2  59012.28341.00
> > 4  70959.14751.20
> > 6 106639.98371.81
> > 8 138638.69292.35
> > 10171125.08732.90
> > 12196162.51973.32
> > 14215272.78103.65
> > 16229562.40403.89
> > 18242587.49134.11
> > 20251057.17314.25
> > 22258569.77944.38
> > 24265443.29244.50
> > 26266562.78724.52
> > 28267043.63674.53
> > 30266833.72124.52
> > 32267183.84744.53
> > 
> > On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F.  wrote:
> > 
> >   Junchao could try the PETSc (and non-PETSc) streams tests on the machine. 
> > 
> >   There are a few differences, compiler, the reported results are with 
> > OpenMP, different number of cores but yes the performance is a bit low. For 
> > DOE that is great, makes GPUs look better :-)
> > 
> > 
> > > On Sep 21, 2019, at 11:11 PM, Jed Brown  wrote:
> > > 
> > > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> > > GB/s for the node for the best case (42 ranks).
> > > 
> > > My understanding is that these systems have 8 channels of DDR4-2666 per
> > > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
> > > system, and 270 GB/s STREAM Triad according to this post
> > > 
> > >  
> > > https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
> > > 
> > > Is this 60% of Triad the best we can get for SpMV?
> > > 
> > > "Zhang, Junchao via petsc-dev"  writes:
> > > 
> > >> 42 cores have better performance.
> > >> 
> > >> 36 MPI ranks
> > >> MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
> > >> 0.0e+00  6 99 97 28  0 100100100100  0 25145   0  0 0.00e+00
> > >> 0 0.00e+00  0
> > >> VecScatterBegin  100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
> > >> 0.0e+00  0  0 97 28  0   1  0100100  0 0   0  0 0.00e+00
> > >> 0 0.00e+00  0
> > >> VecScatterEnd100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> > >> 0.0e+00  1  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+00
> > >> 0 0.00e+00  0
> > >> 
> > >> --Junchao Zhang
> > >> 
> > >> 
> > >> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. 
> > >> mailto:bsm...@mcs.anl.gov>> wrote:
> > >> 
> > >>  Junchao,
> > >> 
> > >>Mark has a good point; could you also try for completeness the CPU 
> > >> with 36 cores and see if it is 

Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Zhang, Hong via petsc-dev
Barry:
As a hack for this release could you have the Numeric portion of the 
multiply routines check if the symbolic data is there and if not just call the 
symbolic an attach the needed data? You might need to have a utility function 
that does all the symbolic part except the allocation of the matrix and then 
call this from the numeric part as well as the real symbolic part.

I'm working on this now.  I was not aware of MatSeqDenseSetLDA() which changes 
pattern of data access in seqdense matrix.
Pierre's patch:
"change Bm here 
https://www.mcs.anl.gov/petsc/petsc-dev/src/mat/impls/aij/mpi/mpimatmatmult.c.html#line549
 to the LDA of B"
fix this bug. I'll further test it and submit a pull request.
Then, I'll check slepc's bug report.
Hong


Re: [petsc-dev] It would be really nice if you could run a single job on the pipeline with a branch

2019-09-23 Thread Smith, Barry F. via petsc-dev



> On Sep 23, 2019, at 10:43 AM, Jed Brown  wrote:
> 
> "Smith, Barry F. via petsc-dev"  writes:
> 
>>> On Sep 22, 2019, at 11:26 PM, Balay, Satish  wrote:
>>> 
>>> Even-though a fix addresses a breakage in a single build - that change
>>> could break other things so its generally best to run a full test.
>> 
>>  Sure before a merge we want everything tested but when one is iterating on 
>> a single bug it would be a much better use of resources
> 
> FWIW, any containerized jobs can easily be run locally.
> 
> I think there would be complexity for GitLab to deploy this due to
> artifacts and other state that may create dependencies between jobs, but
> for our independent jobs, it should be possible.

  Yes I have done this and it was straightforward to debug problems that arose 
on the cloud tests. 

 I have added the information to the wiki  
https://gitlab.com/petsc/petsc/wikis/Home feel free to add more details or 
corrections




Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Zhang, Junchao via petsc-dev
I also did OpenMP stream test and then I found mismatch between OpenMPI and 
MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  
One has to place MPI ranks to different pairs to get best bandwidth. See 
different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
node has 21 cores. I assume that means 11 pairs. The new results are below. 
They match with we what I got from OpenMPI. The bandwidth is almost doubled 
from 1 to 2 cores per socket. IBM document also says each socket has two memory 
controllers. I could not find the core-memory controller affinity info. I tried 
different bindings but did not find huge difference.

#Ranks  Rate (MB/s)Ratio over 2 ranks
1 29229.8   -
2 59091.0  1.0
4112260.7  1.9
6159852.8  2.7
8194351.7  3.3
10   215841.0  3.7
12   232316.6  3.9
14   244615.7  4.1
16   254450.8  4.3
18   262185.7  4.4
20   267181.0  4.5
22   270290.4  4.6
24   221944.9  3.8
26   238302.8  4.0


--Junchao Zhang


On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

 For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

 Barry


> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets.
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks Rate (MB/s) Ratio over 2 ranks
> --
> 2  59012.28341.00
> 4  70959.14751.20
> 6 106639.98371.81
> 8 138638.69292.35
> 10171125.08732.90
> 12196162.51973.32
> 14215272.78103.65
> 16229562.40403.89
> 18242587.49134.11
> 20251057.17314.25
> 22258569.77944.38
> 24265443.29244.50
> 26266562.78724.52
> 28267043.63674.53
> 30266833.72124.52
> 32267183.84744.53
>
> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Junchao could try the PETSc (and non-PETSc) streams tests on the machine.
>
>   There are a few differences, compiler, the reported results are with 
> OpenMP, different number of cores but yes the performance is a bit low. For 
> DOE that is great, makes GPUs look better :-)
>
>
> > On Sep 21, 2019, at 11:11 PM, Jed Brown 
> > mailto:j...@jedbrown.org>> wrote:
> >
> > For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165
> > GB/s for the node for the best case (42 ranks).
> >
> > My understanding is that these systems have 8 channels of DDR4-2666 per
> > socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket
> > system, and 270 GB/s STREAM Triad according to this post
> >
> >  
> > https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/
> >
> > Is this 60% of Triad the best we can get for SpMV?
> >
> > "Zhang, Junchao via petsc-dev" 
> > mailto:petsc-dev@mcs.anl.gov>> writes:
> >
> >> 42 cores have better performance.
> >>
> >> 36 MPI ranks
> >> MatMult  100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 
> >> 0.0e+00  6 99 97 28  0 100100100100  0 25145   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterBegin  100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 
> >> 0.0e+00  0  0 97 28  0   1  0100100  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >> VecScatterEnd100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> >> 0.0e+00  1  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
> >> 0.00e+00  0
> >>
> >> --Junchao Zhang
> >>
> >>
> >> On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. 
> >> mailto:bsm...@mcs.anl.gov>>>
> >>  wrote:
> >>
> >>  Junchao,
> >>
> >>Mark has a good point; could you also try for completeness the CPU with 
> >> 36 cores and see if it is any better than the 42 core case?
> >>
> >>  Barry
> >>
> >>  So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of 
> >> the GPUs for the multiply for this problem size.
> >>
> >>> On Sep 21, 2019, at 6:40 PM, Mark Adams 
> >>> mailto:mfad...@lbl.gov>>>
> >>>  wrote:
> >>>
> >>> I came 

Re: [petsc-dev] It would be really nice if you could run a single job on the pipeline with a branch

2019-09-23 Thread Jed Brown via petsc-dev
"Smith, Barry F. via petsc-dev"  writes:

>> On Sep 22, 2019, at 11:26 PM, Balay, Satish  wrote:
>> 
>> Even-though a fix addresses a breakage in a single build - that change
>> could break other things so its generally best to run a full test.
>
>   Sure before a merge we want everything tested but when one is iterating on 
> a single bug it would be a much better use of resources

FWIW, any containerized jobs can easily be run locally.

I think there would be complexity for GitLab to deploy this due to
artifacts and other state that may create dependencies between jobs, but
for our independent jobs, it should be possible.


Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Zhang, Hong via petsc-dev
Barry :

   We would like avoid allocating a huge array for the matrix and then having 
the user place on top of it.

   In the new paradigm there could be options called on the resulting C of 
MatMatGetProduct() that would take effect before the C is fully formed to 
prevent the allocating and freeing for a huge array the same time as user array 
exists but with the current API we have for this release.
Allocation of C is done in the symbolic product, not GetProduct(). Petsc gets 
user's array before symbolic product, thus it will not allocate C array.
Hong




Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Smith, Barry F. via petsc-dev


   Hong,

As a hack for this release could you have the Numeric portion of the 
multiply routines check if the symbolic data is there and if not just call the 
symbolic an attach the needed data? You might need to have a utility function 
that does all the symbolic part except the allocation of the matrix and then 
call this from the numeric part as well as the real symbolic part.

  Barry



> On Sep 23, 2019, at 9:59 AM, Zhang, Hong  wrote:
> 
> Yes, we should allow users to provide their own matrix array. 
> 
> We use MatDensePlaceArray() to plug an array into matrix C before 
> MatMatMult(). If we cannot do this, we will have to copy from the internal 
> array of the result C to our array.
> 
> Would the following sequence work?
> MatMatMultSymbolic()
> MatDensePlaceArray()
> MatMatMultNumeric()
> This seems a reasonable API, but it is not obvious to users when and where 
> MatDensePlaceArray() should be called.
> Currently, most users call  MatMatMult(A,B, reuse,) instead of 
> MatMatMultSymbolic/Numeric. 
> 
> We plan to add
> MatMatGetProduct(A,B,);
> Then,
> MatDensePlaceArray(C,array);
> 
> Hong



Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Smith, Barry F. via petsc-dev


   We would like avoid allocating a huge array for the matrix and then having 
the user place on top of it. 

   In the new paradigm there could be options called on the resulting C of 
MatMatGetProduct() that would take effect before the C is fully formed to 
prevent the allocating and freeing for a huge array the same time as user array 
exists but with the current API we have for this release.



> On Sep 23, 2019, at 9:59 AM, Zhang, Hong  wrote:
> 
> Yes, we should allow users to provide their own matrix array. 
> 
> We use MatDensePlaceArray() to plug an array into matrix C before 
> MatMatMult(). If we cannot do this, we will have to copy from the internal 
> array of the result C to our array.
> 
> Would the following sequence work?
> MatMatMultSymbolic()
> MatDensePlaceArray()
> MatMatMultNumeric()
> This seems a reasonable API, but it is not obvious to users when and where 
> MatDensePlaceArray() should be called.
> Currently, most users call  MatMatMult(A,B, reuse,) instead of 
> MatMatMultSymbolic/Numeric. 
> 
> We plan to add
> MatMatGetProduct(A,B,);
> Then,
> MatDensePlaceArray(C,array);
> 
> Hong



Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Zhang, Hong via petsc-dev
Yes, we should allow users to provide their own matrix array.

We use MatDensePlaceArray() to plug an array into matrix C before MatMatMult(). 
If we cannot do this, we will have to copy from the internal array of the 
result C to our array.

Would the following sequence work?
MatMatMultSymbolic()
MatDensePlaceArray()
MatMatMultNumeric()
This seems a reasonable API, but it is not obvious to users when and where 
MatDensePlaceArray() should be called.
Currently, most users call  MatMatMult(A,B, reuse,) instead of 
MatMatMultSymbolic/Numeric.

We plan to add
MatMatGetProduct(A,B,);
Then,
MatDensePlaceArray(C,array);

Hong


Re: [petsc-dev] Broken MatMatMult_MPIAIJ_MPIDense

2019-09-23 Thread Jose E. Roman via petsc-dev



> El 22 sept 2019, a las 19:11, Smith, Barry F.  escribió:
> 
>   Jose,
> 
> Thanks for the pointer. 
> 
> Will this change dramatically affect the organization of SLEPc? As noted 
> in my previous email eventually we need to switch to a new API where the 
> REUSE with a different matrix is even more problematic.
> 
>  If you folks have use cases that fundamentally require reusing a 
> previous matrix instead of destroying and getting a new one created we will 
> need to think about additional features in the API that would allow this 
> reusing of an array. But it seems to me that destroying the old matrix and 
> using the initial call to create the matrix should be ok and just require 
> relatively minor changes to your codes?
> 
>  Barry

We use MatDensePlaceArray() to plug an array into matrix C before MatMatMult(). 
If we cannot do this, we will have to copy from the internal array of the 
result C to our array.

Would the following sequence work?
MatMatMultSymbolic()
MatDensePlaceArray()
MatMatMultNumeric()

Jose