from:"Mills, Richard Tran via petsc\-dev"

Re: [petsc-dev] -with-kokkos-cuda-arch=AMPERE80 nonsense

2021-04-05 Thread Mills, Richard Tran via petsc-dev

Hmm, OK, I found a table at

  https://sparta.sandia.gov/doc/accelerate_kokkos.html

and it tells me that "PASCAL60" refers to "NVIDIA Pascal generation CC 6.0 GPU" 
and "PASCAL61" refers to "NVIDIA Pascal generation CC 6.1 GPU". But I have no 
idea what those 6.0 vs 6.1 version numbers mean, and I can't seem to easily 
find any information from NVIDIA that connects anything in the output of 
"nvidia-smi -a" to these versions.

I think maybe what I want is an NVIDIA equivalent to Intel's ark.intel.com, 
which decodes the mysterious Intel version numbers to tell me what 
architectural features are present. But does anything like this exist for 
NVIDIA?

--Richard



On 4/5/21 1:10 PM, Mills, Richard Tran wrote:
You raise a good point, Barry. I've been completely mystified by what some of 
these names even mean. What does "PASCAL60" vs. "PASCAL61" even mean? Do you 
know of where this is even documented? I can't really find anything about it in 
the Kokkos documentation. The only thing I can really find is an issue or two 
about "hey, shouldn't our CMake stuff figure this out automatically" and then 
some posts about why it can't really do that. Not encouraging.

--Richard

On 4/3/21 8:42 PM, Barry Smith wrote:

  It would be very nice to NOT require PETSc users to provide this flag, how 
the heck will they know what it should be when we cannot automate it ourselves?

  Any ideas of how this can be determined based on the current system? NVIDIA 
does not help since these "advertising" names don't seem to trivially map to 
information you can get from a particular GPU when you logged into it. For 
example nvidia-smi doesn't use these names directly. Is there some mapping from 
nvidia-smi  to these names we could use? If we are serious about having a 
non-trivial number of users utilizing GPUs, which we need to be for future, we 
cannot have this absurd demands in our installation process.

  Barry

Does spack have some magic for this we could use?

Re: [petsc-dev] -with-kokkos-cuda-arch=AMPERE80 nonsense

2021-04-05 Thread Mills, Richard Tran via petsc-dev

You raise a good point, Barry. I've been completely mystified by what some of 
these names even mean. What does "PASCAL60" vs. "PASCAL61" even mean? Do you 
know of where this is even documented? I can't really find anything about it in 
the Kokkos documentation. The only thing I can really find is an issue or two 
about "hey, shouldn't our CMake stuff figure this out automatically" and then 
some posts about why it can't really do that. Not encouraging.

--Richard

On 4/3/21 8:42 PM, Barry Smith wrote:


  It would be very nice to NOT require PETSc users to provide this flag, how 
the heck will they know what it should be when we cannot automate it ourselves?

  Any ideas of how this can be determined based on the current system? NVIDIA 
does not help since these "advertising" names don't seem to trivially map to 
information you can get from a particular GPU when you logged into it. For 
example nvidia-smi doesn't use these names directly. Is there some mapping from 
nvidia-smi  to these names we could use? If we are serious about having a 
non-trivial number of users utilizing GPUs, which we need to be for future, we 
cannot have this absurd demands in our installation process.

  Barry

Does spack have some magic for this we could use?

Re: [petsc-dev] Job openings

2020-11-25 Thread Mills, Richard Tran via petsc-dev

petsc-announce is used so infrequently that I've often found it a little odd 
that petsc-users isn't just used for the announcements. Perhap we could 
repurpose petsc-announce somewhat by making it something that users *can* post 
to, but posts are moderated. Then we can make sure that job posting 
announcements fit our guidelines.

--Richard

On 11/20/20 6:00 PM, Junchao Zhang wrote:
I think we can just send to both petsc-announce and petsc-users. First there 
are not many such emails.  Second, if there are, users should be happy to see 
that.
I receive 10+ ad emails daily and I don't mind receiving extra 5 emails monthly 
:)

--Junchao Zhang

On Fri, Nov 20, 2020 at 7:27 PM Barry Smith 
mailto:bsm...@petsc.dev>> wrote:

  PETSc announce has more people than petsc-users but it is not clear that 
everyone on petsc-users is on petsc-announce. Everyone should join 
petsc-announce but they may not.

  We could send them to both with the same label but then many people will get 
two emails which is annoying.

  Maybe use the labels   [PETSc Job opening] and [PETSc Release] to give people 
an easier filter.


   An approach which is probably not simple is that anything sent to 
petsc-announce is also sent to everyone on petsc-users who IS NOT on 
petsc-announce so everyone gets only exactly one copy regardless of whether 
they are on both or either.

   1)  Maybe we could just manually remove everyone from announce who is in 
users and make sure that anything sent to announce also gets sent to users.
2)  Or whenever anyone joins users we sign them up for announce 
automatically and then only send such message to announce (the webpage could 
indicate you will
  automatically also be added to announce. This seems the least 
painful, but then someone now needs to add to announce everyone who is on users 
but not
  on announce.

   People could get fancy with filters to get only one copy but that is 
obnoxious to expect them to do that.

   Barry



On Nov 20, 2020, at 1:27 PM, Junchao Zhang 
mailto:junchao.zh...@gmail.com>> wrote:

The usefulness depends on how many users subscribe to petsc-announce.

Since there are not many such emails, I think it is fine to send to 
petsc-users. And in these emails, we can always add a link to a job section on 
the petsc website.  Once petsc users get used to this, they may go to the 
website later when they are finding jobs.

--Junchao Zhang


On Fri, Nov 20, 2020 at 1:04 PM Matthew Knepley 
mailto:knep...@gmail.com>> wrote:
That is a good idea. Anyone against this?

  Thanks,

Matt

On Fri, Nov 20, 2020 at 1:26 PM Barry Smith 
mailto:bsm...@petsc.dev>> wrote:

  Maybe something as simple for petsc-announce

 Subject:[Release] 
 Subject:[Job opening] 

   Then when you send out the most recent job opening you can include in the 
message something like

"The PETSc announce mailing list will continue to be low volume. We will 
now tag each message in the subject line with [Release], [Job opening],  or 
possibly other tags so you can have your mail program filter out messages you 
are not interested in.

Thanks for your continued support,"



On Nov 20, 2020, at 9:45 AM, Matthew Knepley 
mailto:knep...@gmail.com>> wrote:

I got the second email in less than one month about sending a job opening to 
the PETSc list.

1) Should we have some policy about this?

I think we should encourage it, but in a way that does not produce noise for 
people. I think there are no other good outlets for computational jobs.

2) Should we have a section of the website for this?

I would like something that just selected some petsc-users mail from the 
archive with a query in the URL.

3) If we encourage it, should we have a special header for job posts in the 
mailing list?

This would facilitate 2).

  Thanks,

 Matt

--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/



--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/

[petsc-dev] Postdoctoral position at Argonne: Numerical Solvers for Next Generation High Performance Computing Architectures

2020-08-24 Thread Mills, Richard Tran via petsc-dev

Dear PETSc Users and Developers,

The PETSc/TAO team at Argonne National Laboratory has an opening for a 
postdoctoral researcher to work on development of robust and efficient 
algebraic solvers and related technologies targeting exascale-class 
supercomputers -- such as the Aurora machine slated to be the first exascale 
computer in the United States and fielded at Argonne -- and other novel 
high-performance computing (HPC) architectures. For those interested, please 
see the job posting at https://bit.ly/3kPtY8L.

Best regards,
Richard

[petsc-dev] Argonne National Laboratory hiring for staff position in Numerical PDEs and Scientific Computing

2020-07-29 Thread Mills, Richard Tran via petsc-dev

Dear PETSc Users and Developers,

The Laboratory for Applied Mathematics, Numerical Software, and Statistics 
(LANS, https://www.anl.gov/mcs/lans) in the Mathematics and Computer Science 
Division at Argonne National Laboratory -- which has served as the "home" for 
PETSc development for over two decades -- is seeking a research staff member in 
the area of partial differential equations (PDEs), numerical software, and 
scientific computing. Because this position could provide an excellent 
opportunity for someone interested in pursuing a research program furthering 
the development and applications of PETSc, I want to make members of the PETSc 
user and developer community aware of it.

Applicants at junior or senior career levels will be considered. For further 
details and to apply, please visit the job listings posted at

Senior applicants: https://bit.ly/2Otr7De
Earlier-career applicants: https://bit.ly/2B1Rxch

Best regards,
Richard

Re: [petsc-dev] Meaning of PETSc matrices with zero rows but nonzero columns?

2020-05-30 Thread Mills, Richard Tran via petsc-dev

Thanks for the replies, everyone. It suppose is not actually that hard for me 
to handle these dimensions properly -- I just hadn't personally encountered or 
thought much about when such operations with empty matrices might arise, and 
was initially puzzled about what multiplication by an "empty" matrix even 
means. I think I see now why I need to put in the work to handle these cases 
properly. (Sure wish that MKL could just do it, though!)

--Richard

On 5/30/20 4:09 PM, Stefano Zampini wrote:





On May 31, 2020, at 1:03 AM, Jed Brown 
 wrote:

Stefano Zampini  
writes:



If A is 0x8 and B is 8x5 then C is correct to be of size 0x5. The rows and 
columns of the resulting matrix have to follow the rules.



Right, I think if you said C is 0x0 (which seems like Richard's proposal), 
you'd need to relax shape compatibility logic in many places, including in ways 
that might produce confusing errors.




Richard

In the triple matrix product case, your code will break, because the operation 
will no longer be associative

A 3x0, B 0x8, C 8x7 -> (ABC) is a valid 3x7 matrix (empty)

If I understand you right, (AB)  would be  a 0x0 matrix, and it can no longer 
be multiplied against C



Richard, what is the hardship in preserving the shape relations?

[petsc-dev] Meaning of PETSc matrices with zero rows but nonzero columns?

2020-05-30 Thread Mills, Richard Tran via petsc-dev

All,

I'm working on adding support for matrix products to AIJMKL, and I'm uncertain 
about some issues surrounding empty matrices. PETSc will happily let me 
multiply an empty matrix with another (and this arises in the sequential 
matrix-matrix multiplication routines when running with multiple MPI ranks and 
using MPIAIJ), but MKL does not like empty matrices (or matrices with no 
nonzeros), so I've got code in various places in the existing AIJMKL routines 
to handle these cases without calling MKL.

I'm not quite sure what needs to be done in, say, 
MatMatMultSymbolic_SeqAIJMKL_SeqAIJMKL(). In the SeqAIJ version, if a matrix A 
is passed in that has zero rows (that is, A->rmap->N = 0), and matrix B has N 
columns (B->cmap->N = N), then a matrix C with zero rows and N columns is 
created. My question boils down to "Does it mean anything in PETSc to have a 
matrix with 0 rows but a nonzero number of columns"? It is less complicated if 
I handle the empty matrix cases by creating a matrix with 0 rows and 0 columns, 
but I am not sure if this breaks something. (I'm also not sure what a "matrix" 
with 0 rows even means.) Also not sure if there is any other info that I need 
to preserve in the result matrix C, of if it is OK to handle the case of any 
empty A or B by always producing the same empty matrix C.

--Richard

Re: [petsc-dev] Should we add something about GPU support to the user manual?

2019-10-29 Thread Mills, Richard Tran via petsc-dev

Hi Gautam,

Apologies for overlooking this. The slides are now available online:

  https://www.mcs.anl.gov/petsc/meetings/2019/slides/mills-petsc-2019.pdf

There isn't a whole lot in terms of GPU results in there, but GPU support is 
currently a very active area of development for the PETSc team and we should 
have a lot more to report soon. (We'll have several presentations at SIAM-PP in 
Seattle on this topic, for instance.)

Best regards,
Richard

On 9/20/19 12:37 PM, Bisht, Gautam wrote:
Hi Richard,

Information about PETSc’s support for GPU would be super helpful. Btw, I 
noticed that in PETSc User 2019 
meeting<https://www.mcs.anl.gov/petsc/meetings/2019/index.html> you gave a talk 
on "Progress with PETSc on Manycore and GPU-based Systems on the Path to 
Exascale”, but the slides for the talk were not up on the website. Is it 
possible for you to share those slides or post them on the online?

Thanks,
-Gautam

On Sep 12, 2019, at 10:18 AM, Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:

Fellow PETSc developers,

I've had a few people recently ask me something along the lines of "Where do I 
look in the user manual for information about how to use GPUs with PETSc?", and 
then I have to give them the slightly embarrassing answer that there is nothing 
in there. Since we officially added GPU support a few releases ago, it might be 
appropriate to put something in the manual (even though our GPU support is 
still a moving target). I think I can draft something based on the existing 
tutorial material that Karl and I have been presenting. Do others think this 
would be worthwhile, or is our GPU support still too immature to belong in the 
manual? And are there any thoughts on where this belongs in the manual?

--Richard

Re: [petsc-dev] MR seemed to vanish

2019-10-10 Thread Mills, Richard Tran via petsc-dev

Mark, I see it at https://gitlab.com/petsc/petsc/merge_requests/2130? Is that 
not the right MR?

--Richard

On 10/10/19 3:51 PM, Mark Adams via petsc-dev wrote:
My MR mark/gamg-eigest-sa-cheby seems to have vanished. It does not seem to be 
in master. Anyone know where it is?
Thanks,
Mark

Re: [petsc-dev] Why no SpGEMM support in AIJCUSPARSE and AIJVIENNACL?

2019-10-03 Thread Mills, Richard Tran via petsc-dev



On 10/3/19 1:12 AM, Karl Rupp wrote:
Do you have any experience with nsparse?

https://github.com/EBD-CREST/nsparse

I've seen claims that it is much faster than cuSPARSE for sparse
matrix-matrix products.

I haven't tried nsparse, no.

But since the performance comes from a hardware feature (cache), I would be 
surprised if there is a big performance leap over ViennaCL. (There's certainly 
some potential for some tweaking of ViennaCL's kernels; but note that even 
ViennaCL is much faster than cuSPARSE's spGEMM on average).

With the libaxb-wrapper we can just add nsparse as an operations backend and 
then easily try it out and compare against the other packages. In the end it 
doesn't matter which package provides the best performance; we just want to 
leverage it :-)
I'd be happy to add support for this (though I suppose I should play with it 
first to verify that it is, in fact, worthwhile). Karl, is your branch with 
libaxb ready for people to start using it, or should we wait for you to do more 
with it? (Or, would you like any help with it?)

I'd like to try to add support for a few things like cuSPARSE SpGEMM before I 
go to the Summit hackathon, but I don't want to write a bunch of code that will 
be thrown away once your libaxb approach is in place.

--Richard

Best regards,
Karli




Karl Rupp via petsc-dev <mailto:petsc-dev@mcs.anl.gov> 
writes:

Hi Richard,

CPU spGEMM is about twice as fast even on the GPU-friendly case of a
single rank: http://viennacl.sourceforge.net/viennacl-benchmarks-spmm.html

I agree that it would be good to have a GPU-MatMatMult for the sake of
experiments. Under these performance constraints it's not top priority,
though.

Best regards,
Karli


On 10/3/19 12:00 AM, Mills, Richard Tran via petsc-dev wrote:
Fellow PETSc developers,

I am wondering why the AIJCUSPARSE and AIJVIENNACL matrix types do not
support the sparse matrix-matrix multiplication (SpGEMM, or MatMatMult()
in PETSc parlance) routines provided by cuSPARSE and ViennaCL,
respectively. Is there a good reason that I shouldn't add those? My
guess is that support was not added because SpGEMM is hard to do well on
a GPU compared to many CPUs (it is hard to compete with, say, Intel Xeon
CPUs with their huge caches) and it has been the case that one would
generally be better off doing these operations on the CPU. Since the
trend at the big supercomputing centers seems to be to put more and more
of the computational power into GPUs, I'm thinking that I should add the
option to use the GPU library routines for SpGEMM, though. Is there some
good reason to *not* do this that I am not aware of? (Maybe the CPUs are
better for this even on a machine like Summit, but I think we're at the
point that we should at least be able to experimentally verify this.)

--Richard

[petsc-dev] Why no SpGEMM support in AIJCUSPARSE and AIJVIENNACL?

2019-10-02 Thread Mills, Richard Tran via petsc-dev

Fellow PETSc developers,

I am wondering why the AIJCUSPARSE and AIJVIENNACL matrix types do not support 
the sparse matrix-matrix multiplication (SpGEMM, or MatMatMult() in PETSc 
parlance) routines provided by cuSPARSE and ViennaCL, respectively. Is there a 
good reason that I shouldn't add those? My guess is that support was not added 
because SpGEMM is hard to do well on a GPU compared to many CPUs (it is hard to 
compete with, say, Intel Xeon CPUs with their huge caches) and it has been the 
case that one would generally be better off doing these operations on the CPU. 
Since the trend at the big supercomputing centers seems to be to put more and 
more of the computational power into GPUs, I'm thinking that I should add the 
option to use the GPU library routines for SpGEMM, though. Is there some good 
reason to *not* do this that I am not aware of? (Maybe the CPUs are better for 
this even on a machine like Summit, but I think we're at the point that we 
should at least be able to experimentally verify this.)

--Richard

Re: [petsc-dev] CUDA STREAMS

2019-10-02 Thread Mills, Richard Tran via petsc-dev

Mark,

It looks like you are missing some critical CUDA library (or libraries) in your 
link line. I know you will at least need the CUDA runtime "-lcudart". Look at 
something like PETSC_WITH_EXTERNAL_LIB for one of your CUDA-enabled PETSc 
builds in $PETSC_ARCH/lib/petsc/conf/petscvariables to see what else you might 
need.

--Richard

On 10/2/19 7:20 AM, Mark Adams via petsc-dev wrote:

I found a CUDAVersion.cu of STREAMS and tried to build it. I got it to compile 
manually with:

nvcc -o CUDAVersion.o -ccbin pgc++ 
-I/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/include
 -Wno-deprecated-gpu-targets -c --compiler-options="-g 
-I/ccs/home/adams/petsc/include 
-I/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/include   " 
`pwd`/CUDAVersion.cu
/gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu(22): warning: conversion from 
a string literal to "char *" is deprecated
 

And this did produce a .o file. But I get this when I try to link.

make -f makestreams CUDAVersion
mpicc -g -fast  -o CUDAVersion CUDAVersion.o 
-Wl,-rpath,/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib 
-L/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib 
-Wl,-rpath,/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib 
-L/ccs/home/adams/petsc/arch-summit-opt64-pgi-cuda/lib 
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib/pgi.ld
 
-Wl,-rpath,/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib
 
-L/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-19.4/spectrum-mpi-10.3.0.1-20190611-4ymaahbai7ehhw4rves5jjiwon2laz3a/lib
 
-Wl,-rpath,/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib
 
-L/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-19.4-6acz4xyqjlpoaonjiiqjme2aknrfnzoy/linuxpower/19.4/lib
 -Wl,-rpath,/usr/lib/gcc/ppc64le-redhat-linux/4.8.5 
-L/usr/lib/gcc/ppc64le-redhat-linux/4.8.5 -lpetsc -llapack -lblas -lparmetis 
-lmetis -lstdc++ -ldl -lpthread -lmpiprofilesupport -lmpi_ibm_usempi 
-lmpi_ibm_mpifh -lmpi_ibm -lpgf90rtl -lpgf90 -lpgf90_rpm1 -lpgf902 -lpgftnrtl 
-latomic -lpgkomp -lomp -lomptarget -lpgmath -lpgc -lrt -lmass_simdp9 -lmassvp9 
-lmassp9 -lm -lgcc_s -lstdc++ -ldl
CUDAVersion.o: In function `setupStream(long, PetscBool, PetscBool)':
/gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu:394: undefined reference to 
`cudaGetDeviceCount'
/gpfs/alpine/geo127/scratch/adams/CUDAVersion.cu:406: undefined reference to 
`cudaSetDevice'
 

I have compared this link line with working examples and it looks the same. 
There is not .c file here -- main is in the .cu file. I assume that is the 
difference.

Any ideas?
Thanks,
Mark

Re: [petsc-dev] error with karlrupp/fix-cuda-streams

2019-09-25 Thread Mills, Richard Tran via petsc-dev

On 9/25/19 11:38 AM, Mark Adams via petsc-dev wrote:
[...]
> jsrun does take -n. It just has other args. I am trying to check if it
> requires other args. I thought it did but let me check.

https://www.olcf.ornl.gov/for-users/system-user-guides/summitdev-quickstart-guide/

-n  --nrs   Number of resource sets


-n is still supported. There are two versions of everything. One letter ones 
and more explanatory ones.
Yes, it's supported, but it's a little different than what "-n" usually does in 
mpiexec, where it means the number of processes. For 'jsrun', it means the 
number of resource sets, which is multiplied by the "tasks per resource set" 
specified by "-a" to get the MPI process count. I think if we can specify that 
"-a 1" is part of our "mpiexec", then we should be OK with using -n as PETSc 
normally does.

--Richard

In fact they have a nice little tool to viz layouts and they give you the 
command line with this short form, eg,

https://jsrunvisualizer.olcf.ornl.gov/?s1f0o01n6c4g1r14d1b21l0=


Beta2 Change (October 17):
-n was be replaced by -nnodes

So its not the same functionality as 'mpiexec -n'

I am still waiting for an interactive shell to test just -n. That really should 
run


Either way - please try the above branch

Satish

>
>
> >
> > And then configure needs to run some binaries for some checks - here
> > perhaps '-n 1' doesn't matter. [MPICH defaults to 1, OpenMPI defaults
> > to ncore]. So perhaps mpiexec is required for this purpose on summit?
> >
> > And then there is this code to escape spaces in path - for
> > windows. [but we have to make sure this is not in code-path for user
> > specified --with-mpiexec="jsrun -g 1"
> >
> > Satish
> >
> > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> >
> > > No luck,
> > >
> > > On Wed, Sep 25, 2019 at 10:01 AM Balay, Satish 
> > > mailto:ba...@mcs.anl.gov>>
> > wrote:
> > >
> > > > Mark,
> > > >
> > > > Can you try the fix in branch balay/fix-mpiexec-shell-escape and see
> > if it
> > > > works?
> > > >
> > > > Satish
> > > >
> > > > On Wed, 25 Sep 2019, Balay, Satish via petsc-dev wrote:
> > > >
> > > > > Mark,
> > > > >
> > > > > Can you send configure.log from mark/fix-cuda-with-gamg-pintocpu
> > branch?
> > > > >
> > > > > Satish
> > > > >
> > > > > On Wed, 25 Sep 2019, Mark Adams via petsc-dev wrote:
> > > > >
> > > > > > I double checked that a clean build of your (master) branch has
> > this
> > > > error
> > > > > > by my branch (mark/fix-cuda-with-gamg-pintocpu), which may include
> > > > stuff
> > > > > > from Barry that is not yet in master, works.
> > > > > >
> > > > > > On Wed, Sep 25, 2019 at 5:26 AM Karl Rupp via petsc-dev <
> > > > > > petsc-dev@mcs.anl.gov> wrote:
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 9/25/19 11:12 AM, Mark Adams via petsc-dev wrote:
> > > > > > > > I am using karlrupp/fix-cuda-streams, merged with master, and I
> > > > get this
> > > > > > > > error:
> > > > > > > >
> > > > > > > > Could not execute "['jsrun -g\\ 1 -c\\ 1 -a\\ 1
> > --oversubscribe -n
> > > > 1
> > > > > > > > printenv']":
> > > > > > > > Error, invalid argument:  1
> > > > > > > >
> > > > > > > > My branch mark/fix-cuda-with-gamg-pintocpu seems to work but I
> > did
> > > > edit
> > > > > > > > the jsrun command but Karl's branch still fails. (SUMMIT was
> > down
> > > > today
> > > > > > > > so there could have been updates).
> > > > > > > >
> > > > > > > > Any suggestions?
> > > > > > >
> > > > > > > Looks very much like a systems issue to me.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Karli
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
>

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev

Karl, that would be fantastic. Much obliged!

--Richard

On 9/23/19 8:09 PM, Karl Rupp wrote:
Hi,

`git grep cudaStreamCreate` reports that vectors, matrices and scatters create 
their own streams. This will almost inevitably create races (there is no 
synchronization mechanism implemented), unless one calls WaitForGPU() after 
each operation. Some of the non-deterministic tests can likely be explained by 
this.

I'll clean this up in the next few hours if there are no objections.

Best regards,
Karli



On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:
I'm no CUDA expert (not yet, anyway), but, from what I've read, the default 
stream (stream 0) is (mostly) synchronous to host and device, so WaitForGPU() 
is not needed in that case. I don't know if there is any performance penalty in 
explicitly calling it in that case, anyway.

In any case, it looks like there are still some cases where potentially 
asynchronous CUDA library calls are being "timed" without a WaitForGPU() to 
ensure that the calls actually complete. I will make a pass through the 
aijcusparse and aijviennacl code looking for these.

--Richard

On 9/23/19 3:28 PM, Zhang, Junchao wrote:
It looks cusparsestruct->stream is always created (not NULL).  I don't know 
logic of the "if (!cusparsestruct->stream)".
--Junchao Zhang


On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov> 
<mailto:petsc-dev@mcs.anl.gov><mailto:petsc-dev@mcs.anl.gov>> wrote:

In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
the end of the function it had

  if (!yy) { /* MatMult */
if (!cusparsestruct->stream) {
  ierr = WaitForGPU();CHKERRCUDA(ierr);
}
  }

I assume we don't need the logic to do this only in the MatMult()
with no add case and should just do this all the time, for the
purposes of timing if no other reason. Is there some reason to NOT
do this because of worries the about effects that these
WaitForGPU() invocations might have on performance?

I notice other problems in aijcusparse.cu 
<http://aijcusparse.cu><http://aijcusparse.cu>,
now that I look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I
see that we have GPU timing calls around the cusparse_csr_spmv()
(but no WaitForGPU() inside the timed region). I believe this is
another area in which we get a meaningless timing. It looks like
we need a WaitForGPU() there, and then maybe inside the timed
region handling the scatter. (I don't know if this stuff happens
asynchronously or not.) But do we potentially want two
WaitForGPU() calls in one function, just to help with getting
timings? I don't have a good idea of how much overhead this adds.

--Richard

On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
The old code swapped the first two lines. Since with
-log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the
order to have better overlap.
  ierr =

VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr =

VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know
execution time without -log_view (hence cuda synchronization). I
manually calculated the Total Mflop/s for these cases for easy
comparison.

<>



EventCount  Time (sec) Flop 
--- Global ---  --- Stage   Total   GPU-
CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess  AvgLen  
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s
Count   Size   Count   Size  %F

---
6 MPI ranks,
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03
2.2e+05 0.0e+00 24 99 97 18  0 100100100100  0  4743   0 0 
0.00e+000 0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03
2.2e+05 0.0e+00  0  0 97 18  0   0  0100100  0 0   0 0 
0.00e+000 0.00e+00  0

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev

I'm no CUDA expert (not yet, anyway), but, from what I've read, the default 
stream (stream 0) is (mostly) synchronous to host and device, so WaitForGPU() 
is not needed in that case. I don't know if there is any performance penalty in 
explicitly calling it in that case, anyway.

In any case, it looks like there are still some cases where potentially 
asynchronous CUDA library calls are being "timed" without a WaitForGPU() to 
ensure that the calls actually complete. I will make a pass through the 
aijcusparse and aijviennacl code looking for these.

--Richard

On 9/23/19 3:28 PM, Zhang, Junchao wrote:
It looks cusparsestruct->stream is always created (not NULL).  I don't know 
logic of the "if (!cusparsestruct->stream)".
--Junchao Zhang


On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end of the 
function it had

  if (!yy) { /* MatMult */
if (!cusparsestruct->stream) {
  ierr = WaitForGPU();CHKERRCUDA(ierr);
}
  }

I assume we don't need the logic to do this only in the MatMult() with no add 
case and should just do this all the time, for the purposes of timing if no 
other reason. Is there some reason to NOT do this because of worries the about 
effects that these WaitForGPU() invocations might have on performance?

I notice other problems in aijcusparse.cu<http://aijcusparse.cu>, now that I 
look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I see that we have GPU 
timing calls around the cusparse_csr_spmv() (but no WaitForGPU() inside the 
timed region). I believe this is another area in which we get a meaningless 
timing. It looks like we need a WaitForGPU() there, and then maybe inside the 
timed region handling the scatter. (I don't know if this stuff happens 
asynchronously or not.) But do we potentially want two WaitForGPU() calls in 
one function, just to help with getting timings? I don't have a good idea of 
how much overhead this adds.

--Richard

On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old code 
swapped the first two lines. Since with -log_view, MatMultAdd_SeqAIJCUSPARSE is 
blocking, I changed the order to have better overlap.
  ierr = 
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr = 
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know execution time 
without -log_view (hence cuda synchronization). I manually calculated the Total 
Mflop/s for these cases for easy comparison.

<>


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks,
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks
MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

42 MPI ranks
MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev

In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end of the 
function it had

  if (!yy) { /* MatMult */
if (!cusparsestruct->stream) {
  ierr = WaitForGPU();CHKERRCUDA(ierr);
}
  }

I assume we don't need the logic to do this only in the MatMult() with no add 
case and should just do this all the time, for the purposes of timing if no 
other reason. Is there some reason to NOT do this because of worries the about 
effects that these WaitForGPU() invocations might have on performance?

I notice other problems in aijcusparse.cu, now that I look closer. In 
MatMultTransposeAdd_SeqAIJCUSPARSE(), I see that we have GPU timing calls 
around the cusparse_csr_spmv() (but no WaitForGPU() inside the timed region). I 
believe this is another area in which we get a meaningless timing. It looks 
like we need a WaitForGPU() there, and then maybe inside the timed region 
handling the scatter. (I don't know if this stuff happens asynchronously or 
not.) But do we potentially want two WaitForGPU() calls in one function, just 
to help with getting timings? I don't have a good idea of how much overhead 
this adds.

--Richard

On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old code 
swapped the first two lines. Since with -log_view, MatMultAdd_SeqAIJCUSPARSE is 
blocking, I changed the order to have better overlap.
  ierr = 
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr = 
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know execution time 
without -log_view (hence cuda synchronization). I manually calculated the Total 
Mflop/s for these cases for easy comparison.

<>


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks,
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks
MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

42 MPI ranks
MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + regular SF + log_view
MatMult  100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 335743   629278  100 1.02e+02  100 
2.69e+02 100
VecScatterBegin  100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  24  0100100  0 0   0  0 0.00e+00  100 
2.69e+02  0
VecScatterEnd100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  20  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 9.5078e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   4  0  0  0  0 0   0100 1.02e+020 
0.00e+00  0
VecCopyFromSome

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev

OK, I wrote to the OLCF Consultants and they told me that

* Yes, the jsrun Visualizer numberings correspond to the 'lstopo' ones.

and, from this I can conclude that

* If I ask for 6 resource sets, each with 1 core and 1 GPU each, the some of 
the cores in different resource sets will share L2/L3 cache.

* For the above case, in which I want 6 MPI ranks that don't share anything, I 
need to ask for 6 resource sets each with *2 cores* and 1 GPU each. When I ask 
for 2 cores, each resource set will consist of 2 cores that share L2/L3, so 
this is how you can get resource sets that don't share L2/L3 between them.

--Richard

On 9/23/19 11:10 AM, Mills, Richard Tran wrote:
To further muddy the waters, the OLCF Summit User Guide 
(https://www.olcf.ornl.gov/for-users/system-user-guides/summit/summit-user-guide)
 states that

"The POWER9 processor is built around IBM’s SIMD Multi-Core (SMC). The 
processor provides 22 SMCs with separate 32kB L1 data and instruction caches. 
Pairs of SMCs share a 512kB L2 cache and a 10MB L3 cache."

And there is some funny stuff in that lstopo output. On the first socket, I see 
one "SMC" that doesn't share L2/L3 with anyone. This may be because it actually 
shares this with a "service" node that is hidden to jsrun. But why are there 
three such SMCs on the second socket?!

I've written to the OLCF Consultants to see if they can provide any 
clarification on this. In particular, I want to know if the jsrun Visualizer 
hardware thread and core numberings correspond to the lstopo ones. I think 
that's the only way to tell if we are getting cores that don't share L2/L3 
resources or not.

--Richard

On 9/23/19 10:58 AM, Zhang, Junchao wrote:
The figure did not clearly say all cores share L3.  Instead, we should look at 
p.16 of https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf

"The POWER9 chip contains two memory controllers, PCIe Gen4 I/O controllers, 
and an interconnection system that connects all components within the chip at 7 
TBps. Each core has 256 KB of L2 cache, and all cores share 120 MB of L3 
embedded DRAM (eDRAM)."
--Junchao Zhang

On Mon, Sep 23, 2019 at 11:58 AM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF 
output from a Summit compute node to see an illustration of the node layout.

--Richard

On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote:
I also did OpenMP stream test and then I found mismatch between OpenMPI and 
MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  
One has to place MPI ranks to different pairs to get best bandwidth. See 
different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
node has 21 cores. I assume that means 11 pairs. The new results are below. 
They match with we what I got from OpenMPI. The bandwidth is almost doubled 
from 1 to 2 cores per socket. IBM document also says each socket has two memory 
controllers. I could not find the core-memory controller affinity info. I tried 
different bindings but did not find huge difference.

#Ranks  Rate (MB/s)Ratio over 2 ranks
1 29229.8   -
2 59091.0  1.0
4112260.7  1.9
6159852.8  2.7
8194351.7  3.3
10   215841.0  3.7
12   232316.6  3.9
14   244615.7  4.1
16   254450.8  4.3
18   262185.7  4.4
20   267181.0  4.5
22   270290.4  4.6
24   221944.9  3.8
26   238302.8  4.0

--Junchao Zhang

On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

 For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

 Barry

> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets.
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks Rate (MB/s) Ratio over 2 ranks
> --
> 2  59012.28341.00
> 4  70959.14751.20
> 6 106639.98371.81
> 8 138638.69292.35
> 10171125.08732.90
> 12196162.51973.32
> 14215272.78103.65
> 16229562.40403.89
> 18242587.49134.11
> 20251057.17314.25
> 22

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev

To further muddy the waters, the OLCF Summit User Guide 
(https://www.olcf.ornl.gov/for-users/system-user-guides/summit/summit-user-guide)
 states that

"The POWER9 processor is built around IBM’s SIMD Multi-Core (SMC). The 
processor provides 22 SMCs with separate 32kB L1 data and instruction caches. 
Pairs of SMCs share a 512kB L2 cache and a 10MB L3 cache."

And there is some funny stuff in that lstopo output. On the first socket, I see 
one "SMC" that doesn't share L2/L3 with anyone. This may be because it actually 
shares this with a "service" node that is hidden to jsrun. But why are there 
three such SMCs on the second socket?!

I've written to the OLCF Consultants to see if they can provide any 
clarification on this. In particular, I want to know if the jsrun Visualizer 
hardware thread and core numberings correspond to the lstopo ones. I think 
that's the only way to tell if we are getting cores that don't share L2/L3 
resources or not.

--Richard

On 9/23/19 10:58 AM, Zhang, Junchao wrote:
The figure did not clearly say all cores share L3.  Instead, we should look at 
p.16 of https://www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf

"The POWER9 chip contains two memory controllers, PCIe Gen4 I/O controllers, 
and an interconnection system that connects all components within the chip at 7 
TBps. Each core has 256 KB of L2 cache, and all cores share 120 MB of L3 
embedded DRAM (eDRAM)."
--Junchao Zhang

On Mon, Sep 23, 2019 at 11:58 AM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
L3 and L2 are shared between cores, actually. See the attached 'lstopo' PDF 
output from a Summit compute node to see an illustration of the node layout.

--Richard

On 9/23/19 9:01 AM, Zhang, Junchao via petsc-dev wrote:
I also did OpenMP stream test and then I found mismatch between OpenMPI and 
MPI.  That reminded me a subtle issue on summit: pair of cores share L2 cache.  
One has to place MPI ranks to different pairs to get best bandwidth. See 
different bindings
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b21l0= and 
https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n2c21g3r12d1b22l0=. Note each 
node has 21 cores. I assume that means 11 pairs. The new results are below. 
They match with we what I got from OpenMPI. The bandwidth is almost doubled 
from 1 to 2 cores per socket. IBM document also says each socket has two memory 
controllers. I could not find the core-memory controller affinity info. I tried 
different bindings but did not find huge difference.

#Ranks  Rate (MB/s)Ratio over 2 ranks
1 29229.8   -
2 59091.0  1.0
4112260.7  1.9
6159852.8  2.7
8194351.7  3.3
10   215841.0  3.7
12   232316.6  3.9
14   244615.7  4.1
16   254450.8  4.3
18   262185.7  4.4
20   267181.0  4.5
22   270290.4  4.6
24   221944.9  3.8
26   238302.8  4.0

--Junchao Zhang

On Sun, Sep 22, 2019 at 6:04 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

  Junchao,

 For completeness could you please run with a single core? But leave the 
ratio as you have with over 2 ranks since that is the correct model.

   Thanks

 Barry

> On Sep 22, 2019, at 11:14 AM, Zhang, Junchao 
> mailto:jczh...@mcs.anl.gov>> wrote:
>
> I did stream test on Summit. I used the MPI version from petsc, but largely 
> increased the array size N since one socket of Summit has 120MB L3 cache. I 
> used MPI version since it was easy for me to distribute ranks evenly to the 
> two sockets.
> The result matches with data released by OLCF (see attached figure) and data 
> given by Jed. We can see the bandwidth saturates around 24 ranks.
>
> #Ranks Rate (MB/s) Ratio over 2 ranks
> --
> 2  59012.28341.00
> 4  70959.14751.20
> 6 106639.98371.81
> 8 138638.69292.35
> 10171125.08732.90
> 12196162.51973.32
> 14215272.78103.65
> 16229562.40403.89
> 18242587.49134.11
> 20251057.17314.25
> 22258569.77944.38
> 24265443.29244.50
> 26266562.78724.52
> 28267043.63674.53
> 30266833.72124.52
> 32267183.84744.53
>
> On Sat, Sep 21, 2019 at 11:24 PM Smith, Barry F. 
> mailto:bsm...@mcs.anl.gov>> wrote:
>
>   Junchao could try the PETSc (and non-PETSc) streams tests on the machine.
>
>   There are a few differences, compiler, the reported results are with 
> OpenMP, different number of cores but yes the performance is a bit low. For 
> DOE that is great, makes GPUs look better :-)
>
>
> > On Sep 21, 2019, at 1

Re: [petsc-dev] MatMult on Summit

2019-09-23 Thread Mills, Richard Tran via petsc-dev

t; On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. 
> >> mailto:bsm...@mcs.anl.gov><mailto:bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>>>
> >>  wrote:
> >>
> >>  Junchao,
> >>
> >>Mark has a good point; could you also try for completeness the CPU with 
> >> 36 cores and see if it is any better than the 42 core case?
> >>
> >>  Barry
> >>
> >>  So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of 
> >> the GPUs for the multiply for this problem size.
> >>
> >>> On Sep 21, 2019, at 6:40 PM, Mark Adams 
> >>> mailto:mfad...@lbl.gov><mailto:mfad...@lbl.gov<mailto:mfad...@lbl.gov>>>
> >>>  wrote:
> >>>
> >>> I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty 
> >>> saturated at that point.
> >>>
> >>> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev 
> >>> mailto:petsc-dev@mcs.anl.gov><mailto:petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>>>
> >>>  wrote:
> >>> Here are CPU version results on one node with 24 cores, 42 cores. Click 
> >>> the links for core layout.
> >>>
> >>> 24 MPI ranks, 
> >>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> >>> MatMult  100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> >>> 0.0e+00  8 99 97 25  0 100100100100  0 17948   0  0 0.00e+000 
> >>> 0.00e+00  0
> >>> VecScatterBegin  100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
> >>> 0.0e+00  0  0 97 25  0   0  0100100  0 0   0  0 0.00e+000 
> >>> 0.00e+00  0
> >>> VecScatterEnd100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> >>> 0.0e+00  2  0  0  0  0  19  0  0  0  0 0   0  0 0.00e+000 
> >>> 0.00e+00  0
> >>>
> >>> 42 MPI ranks, 
> >>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0=
> >>> MatMult  100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
> >>> 0.0e+00 23 99 97 30  0 100100100100  0 27493   0  0 0.00e+000 
> >>> 0.00e+00  0
> >>> VecScatterBegin  100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
> >>> 0.0e+00  0  0 97 30  0   1  0100100  0 0   0  0 0.00e+000 
> >>> 0.00e+00  0
> >>> VecScatterEnd100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> >>> 0.0e+00  6  0  0  0  0  24  0  0  0  0 0   0  0 0.00e+000 
> >>> 0.00e+00  0
> >>>
> >>> --Junchao Zhang
> >>>
> >>>
> >>> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. 
> >>> mailto:bsm...@mcs.anl.gov><mailto:bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>>>
> >>>  wrote:
> >>>
> >>>  Junchao,
> >>>
> >>>   Very interesting. For completeness please run also 24 and 42 CPUs 
> >>> without the GPUs. Note that the default layout for CPU cores is not good. 
> >>> You will want 3 cores on each socket then 12 on each.
> >>>
> >>>  Thanks
> >>>
> >>>   Barry
> >>>
> >>>  Since Tim is one of our reviewers next week this is a very good test 
> >>> matrix :-)
> >>>
> >>>
> >>>> On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev 
> >>>> mailto:petsc-dev@mcs.anl.gov><mailto:petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>>>
> >>>>  wrote:
> >>>>
> >>>> Click the links to visualize it.
> >>>>
> >>>> 6 ranks
> >>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0=
> >>>> jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU 
> >>>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> >>>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >>>>
> >>>> 24 ranks
> >>>> https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0=
> >>>> jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU 
> >>>> --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f 
> >>>> HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view
> >>>>
> >>>> --Junchao Zhang
> >

Re: [petsc-dev] MatMult on Summit

2019-09-20 Thread Mills, Richard Tran via petsc-dev

Junchao,

Can you share your 'jsrun' command so that we can see how you are mapping 
things to resource sets?

--Richard

On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote:
I downloaded a sparse matrix (HV15R) 
from Florida Sparse Matrix Collection. Its size is about 2M x 2M. Then I ran 
the same MatMult 100 times on one node of Summit with -mat_type aijcusparse 
-vec_type cuda. I found MatMult was almost dominated by VecScatter in this 
simple test. Using 6 MPI ranks + 6 GPUs,  I found CUDA aware SF could improve 
performance. But if I enabled Multi-Process Service on Summit and used 24 ranks 
+ 6 GPUs, I found CUDA aware SF hurt performance. I don't know why and have to 
profile it. I will also collect  data with multiple nodes. Are the matrix and 
tests proper?


EventCount  Time (sec) Flop 
 --- Global ---  --- Stage   Total   GPU- CpuToGpu -   - GpuToCpu - GPU
   Max Ratio  Max Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---
6 MPI ranks (CPU version)
MatMult  100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743   0  0 0.00e+000 
0.00e+00  0
VecScatterBegin  100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

6 MPI ranks + 6 GPUs + regular SF
MatMult  100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
2.69e+02 100
VecScatterBegin  100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  64  0100100  0 0   0  0 0.00e+00  100 
2.69e+02  0
VecScatterEnd100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  22  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   5  0  0  0  0 0   0100 1.02e+020 
0.00e+00  0
VecCopyFromSome  100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  54  0  0  0  0 0   0  0 0.00e+00  100 
2.69e+02  0

6 MPI ranks + 6 GPUs + CUDA-aware SF
MatMult  100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+000 
0.00e+00 100
VecScatterBegin  100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  1  0 97 18  0  70  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  17  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0

24 MPI ranks + 6 GPUs + regular SF
MatMult  100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 
6.72e+01 100
VecScatterBegin  100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0  34  0100100  0 0   0  0 0.00e+00  100 
6.72e+01  0
VecScatterEnd100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  42  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0
VecCUDACopyTo100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   3  0  0  0  0 0   0100 4.61e+010 
0.00e+00  0
VecCopyFromSome  100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  29  0  0  0  0 0   0  0 0.00e+00  100 
6.72e+01  0

24 MPI ranks + 6 GPUs + CUDA-aware SF
MatMult  100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  1 99 97 25  0 100100100100  0 387864   9733910 0.00e+000 
0.00e+00 100
VecScatterBegin  100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  1  0 97 25  0  35  0100100  0 0   0  0 0.00e+000 
0.00e+00  0
VecScatterEnd100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  48  0  0  0  0 0   0  0 0.00e+000 
0.00e+00  0


--Junchao Zhang

Re: [petsc-dev] Configure hangs on Summit

2019-09-20 Thread Mills, Richard Tran via petsc-dev

Everything that Barry says about '--with-batch' is valid, but let me point out 
one thing about Summit: You don't need "--with-batch" at all, because the 
Summit login/compile nodes run the same hardware (minus the GPUs) and software 
stack as the back-end compute nodes. This makes configuring and building 
software far, far easier than we are used to on the big LCF machines. I was 
actually shocked when I found this out -- I'd gotten so used to struggling with 
cross-compilers, etc.

--Richard

On 9/20/19 9:28 PM, Smith, Barry F. wrote:


--with-batch is still there and should be used in such circumstances. The 
difference is that --with-branch does not generate a program that you need to 
submit to the batch system before continuing the configure. Instead 
--with-batch guesses at and skips some of the tests (with clear warnings on how 
you can adjust the guesses).

 Regarding the hanging. This happens because the thread monitoring of 
configure started executables was removed years ago since it was slow and 
occasionally buggy (the default wait was an absurd 10 minutes too). Thus when 
configure tried to test an mpiexec that hung the test would hang.   There is 
code in one of my branches I've been struggling to get into master for a long 
time that puts back the thread monitoring for this one call with a small 
timeout so you should never see this hang again.

  Barry

   We could be a little clever and have configure detect it is on a Cray or 
other batch system and automatically add the batch option. That would be a nice 
little feature for someone to add. Probably just a few lines of code.




On Sep 20, 2019, at 8:59 PM, Mills, Richard Tran via petsc-dev 
<mailto:petsc-dev@mcs.anl.gov> wrote:

Hi Junchao,

Glad you've found a workaround, but I don't know why you are hitting this 
problem. The last time I built PETSc on Summit (just a couple days ago), I 
didn't have this problem. I'm working from the example template that's in the 
PETSc repo at config/examples/arch-olcf-summit-opt.py.

Can you point me to your configure script on Summit so I can try to reproduce 
your problem?

--Richard

On 9/20/19 4:25 PM, Zhang, Junchao via petsc-dev wrote:


Satish's trick --with-mpiexec=/bin/true solved the problem.  Thanks.
--Junchao Zhang


On Fri, Sep 20, 2019 at 3:50 PM Junchao Zhang 
<mailto:jczh...@mcs.anl.gov> wrote:
My configure hangs on Summit at
  TESTING: configureMPIEXEC from 
config.packages.MPI(config/BuildSystem/config/packages/MPI.py:170)

On the machine one has to use script to submit jobs. So why do we need 
configureMPIEXEC? Do I need to use --with-batch? I remember we removed that.

--Junchao Zhang

Re: [petsc-dev] Configure hangs on Summit

2019-09-20 Thread Mills, Richard Tran via petsc-dev

Hi Junchao,

Glad you've found a workaround, but I don't know why you are hitting this 
problem. The last time I built PETSc on Summit (just a couple days ago), I 
didn't have this problem. I'm working from the example template that's in the 
PETSc repo at config/examples/arch-olcf-summit-opt.py.

Can you point me to your configure script on Summit so I can try to reproduce 
your problem?

--Richard

On 9/20/19 4:25 PM, Zhang, Junchao via petsc-dev wrote:
Satish's trick --with-mpiexec=/bin/true solved the problem.  Thanks.
--Junchao Zhang


On Fri, Sep 20, 2019 at 3:50 PM Junchao Zhang 
mailto:jczh...@mcs.anl.gov>> wrote:
My configure hangs on Summit at
  TESTING: configureMPIEXEC from 
config.packages.MPI(config/BuildSystem/config/packages/MPI.py:170)

On the machine one has to use script to submit jobs. So why do we need 
configureMPIEXEC? Do I need to use --with-batch? I remember we removed that.

--Junchao Zhang

[petsc-dev] Should we add something about GPU support to the user manual?

2019-09-12 Thread Mills, Richard Tran via petsc-dev

Fellow PETSc developers,

I've had a few people recently ask me something along the lines of "Where do I 
look in the user manual for information about how to use GPUs with PETSc?", and 
then I have to give them the slightly embarrassing answer that there is nothing 
in there. Since we officially added GPU support a few releases ago, it might be 
appropriate to put something in the manual (even though our GPU support is 
still a moving target). I think I can draft something based on the existing 
tutorial material that Karl and I have been presenting. Do others think this 
would be worthwhile, or is our GPU support still too immature to belong in the 
manual? And are there any thoughts on where this belongs in the manual?

--Richard

Re: [petsc-dev] Gitlab notifications

2019-09-12 Thread Mills, Richard Tran via petsc-dev

On 9/12/19 6:33 AM, Jed Brown via petsc-dev wrote:
[...]

Can you explain CODEOWNERS to me? I cannot find it on the GItlab site. I
want to see every MR.



https://docs.gitlab.com/ee/user/project/code_owners.html

We currently require approval from Integration (of which you are a
member) and a code owner (as specified in the file).

We used to have optional approvals from any other developer, but Satish
just removed that due to this notification thing, which I guess means
that any other developer (non-integrator, non-owner) should just comment
their approval if they find time to review.

Alright, this CODEOWNERS thing is new to me. I assume that everyone should go 
and edit this file and add themselves as "code owners" for the relevant 
portions of PETSc that they've done significant development on?

--Richard

Re: [petsc-dev] alternatives to cygwin on Windows with PETSc

2019-07-01 Thread Mills, Richard Tran via petsc-dev

Ah, OK. I was originally just thinking that many people would be happy if they 
can get PETSc to simply work with GCC or Clang that they get from the package 
manager used in the WSL setup. I believe both the Microsoft and Intel compilers 
are all available for free these days, so I'll install them and see how hard 
(or easy) it is to get them to work. Unfortunately, I haven't done any 
development as a "regular" Windows user since the late 1990s, though, so I'm 
not sure what exactly such users need to do. Are there any folks in the 
petsc-dev list that do (or have done) regular Windows development that can 
chime in?

--Richard

On 7/1/19 2:16 PM, Smith, Barry F. wrote:


   Richard,

 Thanks. The important thing is to be able to build PETSc for Microsoft and 
Intel Windows compilers (so that users can use the libraries from the Microsoft 
development system as a "regular" Windows users).

   Barry




On Jul 1, 2019, at 3:59 PM, Mills, Richard Tran via petsc-dev 
<mailto:petsc-dev@mcs.anl.gov> wrote:

I played around with WSL1 quite some time ago and it seemed pretty promising. I 
have not tried WSL2, but I'm guessing that it may be the best option for 
building PETSc on a Windows 10 machine. I've got a Windows 10 machine (it 
basically just runs my television/media center) and I'll give it a try on there.

--Richard

On 6/29/19 8:11 PM, Jed Brown via petsc-dev wrote:


"Smith, Barry F. via petsc-dev" 
<mailto:petsc-dev@mcs.anl.gov>
 writes:




  Does it make sense to recommend/suggest  git bash for Windows as an 
alternative/in addition to Cygwin?



I would love to be able to recommend git-bash and/or WSL2 (which now
includes a full Linux kernel).  I don't have a system on which to test,
but it should be possible to make it work (if it doesn't already).

Re: [petsc-dev] alternatives to cygwin on Windows with PETSc

2019-07-01 Thread Mills, Richard Tran via petsc-dev

I played around with WSL1 quite some time ago and it seemed pretty promising. I 
have not tried WSL2, but I'm guessing that it may be the best option for 
building PETSc on a Windows 10 machine. I've got a Windows 10 machine (it 
basically just runs my television/media center) and I'll give it a try on there.

--Richard

On 6/29/19 8:11 PM, Jed Brown via petsc-dev wrote:

"Smith, Barry F. via petsc-dev" 
 writes:



  Does it make sense to recommend/suggest  git bash for Windows as an 
alternative/in addition to Cygwin?



I would love to be able to recommend git-bash and/or WSL2 (which now
includes a full Linux kernel).  I don't have a system on which to test,
but it should be possible to make it work (if it doesn't already).

[petsc-dev] Controlling matrix type on different levels of multigrid hierarchy? (Motivation is GPUs)

2019-06-12 Thread Mills, Richard Tran via petsc-dev

Colleagues,

I think we ought to have a way to control which levels of a PETSc multigrid 
solve happen on the GPU vs. the CPU, as I'd like to keep coarse levels on the 
CPU, but run the calculations for finer levels on the GPU. Currently, for a 
code that is using a DM to manage its grid, one can use GPUs inside the 
application of PCMG by doing putting something like

  -dm_mat_type aijcusparse -dm_vec_type cuda

on the command line. What I'd like to be able to do is to also control which 
levels get plain AIJ matrices and which get a GPU type, maybe via something like

  -mg_levels_N_dm_mat_type aijcusparse -mg_levels_N_dm_mat_type cuda

for level N. (Being able to specify a range of levels would be even nicer, but 
let's start simple.)

Maybe doing the above is as simple as making sure that DMSetFromOptions() gets 
called for the DM for each level. But I think I may be not understanding some 
sort of additional complications. Can someone who knows the PCMG framework 
better chime in? Or do others have ideas for a more elegant way of giving this 
sort of control to the user?

Best regards,
Richard

[petsc-dev] User(s) manual sections field in manual pages?

2019-06-08 Thread Mills, Richard Tran via petsc-dev

Colleagues,

I have noticed that we have a "Users manual sections" section in the 
MatNullSpaceCreate() manual page, and an empty "User manual sections" section 
(which I suppose should be corrected to "Users manual sections", since it is 
officially the "PETSc Users Manual"). Those appear to be the only two manual 
pages that use these headings. Would we like to add these for other manual 
pages, or, since they appear to be unused, should we eliminate them?

--Richard

Re: [petsc-dev] https://www.dursi.ca/post/hpc-is-dying-and-mpi-is-killing-it.html

2019-05-16 Thread Mills, Richard Tran via petsc-dev

OK, so this thread is two months old but I saw some things recently that 
reminded me of it.

To answer Barry's first question: I think that "AI" was used more than "HPC" 
during the presentation because the HPC community's ridiculous focus on 
rankings in the TOP500 list has resulted in machines that aren't truly good for 
much other than xGEMM operations. And if you are looking around for something 
to justify your xGEMM machine, well, deep neural nets fit the bill pretty well. 
(Yes, the fact that GPUs are really good for, well, *graphics* and this is a 
huge market -- way, way bigger than HPC -- is a contributing factor.)

On the health of HPC sales, I, like Bill, was thinking of what I see in 
earnings reports from companies like Intel and NVIDIA. Yes, a lot of this is 
driven by AI applications in data centers, but the same hardware gets used for 
what I think of as more "traditional" HPC.

As for the increasing use of MPI in machine learning, two major examples are

* Uber's Horovod framework: https://eng.uber.com/horovod/
* Microsoft's Cognitive Toolkit (CNTK) uses MPI for parallel training: 
https://docs.microsoft.com/en-us/cognitive-toolkit/multiple-gpus-and-machines

There are other examples, too, but Uber and Microsoft are pretty big players. 
I'm seeing a lot of examples of people using Horovod, in particular.

--Richard

On 3/19/19 4:11 PM, Gropp, William D wrote:
There is a sort of citation for the increasing use of MPI in distributed ML/DL 
- Torsten has a recent paper on demystifying ML with a graph based on published 
papers. Not the same, but interesting.

On the health of HPC, sales figures are available (along with attendance at SC) 
and these show HPC is healthy if not growing at unsustainable rates :)

Bill

On Mar 19, 2019 3:45 PM, "Smith, Barry F. via petsc-dev" 
<mailto:petsc-dev@mcs.anl.gov> wrote:

> On Mar 19, 2019, at 12:27 AM, Mills, Richard Tran via petsc-dev 
> <mailto:petsc-dev@mcs.anl.gov> wrote:
>
> I've seen this quite some time ago. Others in this thread have already 
> articulated many of the same criticisms I have with the material in this blog 
> post, as well as some of the problems that I have with MPI, so I'll content 
> myself by asking the following:
>
> If HPC is as dying as this guy says it is, then
>
> * Why did DOE just announce today that they are spending $500 million on the 
> first (there are *more* coming?) US-based exascale computing system?

   Why was the acronym AI used more often than HPC during the presentation?

>
> * Why are companies like Intel, NVIDIA, Mellanox, etc., managing to sell so 
> much HPC hardware?

   Citation
>
> and if it is all the fault of MPI, then
>
> * Why have a bunch of the big machine-learning shops actually been moving 
> towards more use of MPI?

   Citation

>
> Yeah, MPI has plenty of warts. So does Fortran -- yet that hasn't killed 
> scientific computing.
>
> --Richard
>
> On 3/17/19 1:12 PM, Smith, Barry F. via petsc-dev wrote:
>>   I stubbled on this today; I should have seen it years ago.
>>
>>   Barry
>>
>>
>

Re: [petsc-dev] Is there a good reason that BuildSystem's cuda.py requires GNU compilers?

2019-03-25 Thread Mills, Richard Tran via petsc-dev

Folks,

I've spent a while looking at the BuildSystem code, and I think this is going
to take me more time than I have available right now to figure it out on my
own. Someone more familiar with BuildSystem needs to give me some hints --
soon, if possible, as I really think that building with non-GCC compilers and
CUDA should be supported in the upcoming release.

What I want to do is to add a test inside cuda.py that checks to see if
something like

nvcc --compiler-option=
hello.c

will return successfully.

What I wasn't sure about was how to get at the values for a bunch of the above
variables within the cuda.py code. After deciding I couldn't really follow
everything that is happening in the code buy just looking at it, I used the
'pdb' python debugger to stick a breakpoint in the configureLibrary() method in
cuda.py so I could poke around.

Aside: Looking at contents of configure objects?
I had hoped I could look at everything that is stashed in the different objects
by doing things like

(Pdb) p dir(self.compilers)

But this doesn't actually list everything in there. There is no 'CUDAC'
attribute listed, for instance, but it is there for me to print:

(Pdb) p self.compilers.CUDAC
'nvcc'

Is there a good way for me to actually see all the attributes in something like
the self.compilers object? Sorry, my Python skills are very rusty -- haven't
written much Python in about a decade.
End aside

It appears that what I need to construct my command line is then available in

self.compilers.CUDAC -- The invocation for the CUDA compiler
self.compilers.CXXFLAGS -- The flags passed to the C++ compiler (our "host")
self.compilers.CUDAFLAGS -- The flags like "-ccbin pgc++" being passed to nvcc
or whatever CUDAC is

I could use these to construct a command that I then pass to the command shell,
and maybe I should just do this, but this doesn't seem to follow the
BuildSystem paradigm. It seems like I should be able to run this test by doing
something like

self.pushLanguage('CUDA')
self.checkCompile(cuda_test)

which is, in fact, invoked in checkCUDAVersion(). But the command put together
by checkCompile() does not include "--compiler-option=". Should I be modifying the code the code somewhere so that
this argument goes into the compiler invocation constructed in
self.checkCompile? If so, where should I be doing this?

--Richard

On 3/22/19 10:24 PM, Mills, Richard Tran wrote:

On 3/22/19 3:28 PM, Mills, Richard Tran wrote:
On 3/22/19 12:13 PM, Balay, Satish wrote:

Is there currently an existing check like this somewhere? Or will things just
fail when running 'make' right now?

Most likely no. Its probably best to attempt the error case - and
figure-out how to add a check.

I gave things a try and verified that there is no check for this anywhere in
configure -- things just fail at 'make' time. I think that all we need is a
test that will try to compile any simple, valid C program using "nvcc
--compiler-options= ". If the
test fails, it should report something like "Compiler flags do not work with
CUDA compiler; perhaps you need to provide to use -ccbin in CUDAFLAGS to
specify the intended host compiler".

I'm not sure where this test should go. Does it make sense for this to go in
cuda.py with the other checks like checkNVCCDoubleAlign()? If so, how do I get
at the values of and ? I'm
not sure what modules I need to import from BuildSystem...
OK, answering part of my own question here: Re-familiarizing myself with how
the configure packages work, and then looking through the makefiles, I see that
the argument to "--compiler-options" is filled in by the makefile variables

${PCC_FLAGS} ${CFLAGS} ${CCPPFLAGS}

and it appears that this partly maps to self.compilers.CFLAGS in BuildSystem.
But so far I've not managed to employ the right combination of find and grep to
figure out where PCC_FLAGS and CCPPFLAGS come from.

--Richard

Satish

On Fri, 22 Mar 2019, Mills, Richard Tran via petsc-dev wrote:

On 3/18/19 7:29 PM, Balay, Satish wrote:

On Tue, 19 Mar 2019, Mills, Richard Tran via petsc-dev wrote:

Colleagues,

It took me a while to get PETSc to build at all with anything on Summit other
than the GNU compilers, but, once this was accomplished, editing out the
isGNU() test and then passing something like

'--with-cuda=1',
'--with-cudac=nvcc -ccbin pgc++',

Does the following also work?

--with-cuda=1 --with-cudac=nvcc CUDAFLAGS='-ccbin pgc++'

Yes, using CUDAFLAGS as above also works, and that does seem to be a better way
to do things.

After experimenting with a lot of different builds on Summit, and doing more
reading about how CUDA compilation works on different platforms, I'm now
thinking that perhaps configure.py should *avoid* doing anything clever to try
figure out what the value of "-ccbin" should be. For one, this is not anything
that NVIDIA

Re: [petsc-dev] Is there a good reason that BuildSystem's cuda.py requires GNU compilers?

2019-03-22 Thread Mills, Richard Tran via petsc-dev



On 3/22/19 3:28 PM, Mills, Richard Tran wrote:
On 3/22/19 12:13 PM, Balay, Satish wrote:

Is there currently an existing check like this somewhere? Or will things just 
fail when running 'make' right now?



Most likely no. Its probably best to attempt the error case - and
figure-out how to add a check.

I gave things a try and verified that there is no check for this anywhere in 
configure -- things just fail at 'make' time. I think that all we need is a 
test that will try to compile any simple, valid C program using "nvcc 
--compiler-options= ". If the 
test fails, it should report something like "Compiler flags do not work with 
CUDA compiler; perhaps you need to provide to use -ccbin in CUDAFLAGS to 
specify the intended host compiler".

I'm not sure where this test should go. Does it make sense for this to go in 
cuda.py with the other checks like checkNVCCDoubleAlign()? If so, how do I get 
at the values of  and ? I'm 
not sure what modules I need to import from BuildSystem...
OK, answering part of my own question here: Re-familiarizing myself with how 
the configure packages work, and then looking through the makefiles, I see that 
the argument to "--compiler-options" is filled in by the makefile variables

${PCC_FLAGS} ${CFLAGS} ${CCPPFLAGS}

and it appears that this partly maps to self.compilers.CFLAGS in BuildSystem. 
But so far I've not managed to employ the right combination of find and grep to 
figure out where PCC_FLAGS and CCPPFLAGS come from.

--Richard

--Richard

Satish

On Fri, 22 Mar 2019, Mills, Richard Tran via petsc-dev wrote:



On 3/18/19 7:29 PM, Balay, Satish wrote:

On Tue, 19 Mar 2019, Mills, Richard Tran via petsc-dev wrote:



Colleagues,

It took me a while to get PETSc to build at all with anything on Summit other 
than the GNU compilers, but, once this was accomplished, editing out the 
isGNU() test and then passing something like

'--with-cuda=1',
'--with-cudac=nvcc -ccbin pgc++',



Does the following also work?

--with-cuda=1 --with-cudac=nvcc CUDAFLAGS='-ccbin pgc++'

Yes, using CUDAFLAGS as above also works, and that does seem to be a better way 
to do things.

After experimenting with a lot of different builds on Summit, and doing more 
reading about how CUDA compilation works on different platforms, I'm now 
thinking that perhaps configure.py should *avoid* doing anything clever to try 
figure out what the value of "-ccbin" should be. For one, this is not anything 
that NVIDIA's toolchain does for the user in the first place: If you want to 
use nvcc with a host compiler that isn't whatever NVIDIA considers the default 
(g++ on Linux, clang on Mac OS, MSVC on Windows), NVIDIA expects you to provide 
the appropriate '-ccbin' argument. Second, nvcc isn't the only CUDA compiler 
that a user might want to use: some people use Clang directly to compile CUDA 
code. Third, which host compilers are supported appears to be platform 
independent; for example, GCC is the default/preferred host compiler on Linux, 
but isn't even supported on Mac OS! Figuring out what is supported is very 
convoluted, and I think that trying to get configure to determine this may be 
more trouble than it is worth. I think we should instead let the user try 
whatever, and print out a helpful message how they "may need to specify host 
compiler to nvcc with -ccbin" if the CUDA compiler doesn't seem to work. Also, 
I'll put something about this in the CUDA configure examples. Any objections?




Sometimes we have extra options in configure for specific features for
ex: --with-pic --with-visibility etc.

But that gets messy. On cuda side - we've have --with-cuda-arch and at
some point elimiated it [so CUDAFLAGS is now the interface for this
flag].  We could add --with-cuda-internal-compiler option to petsc
configure - but it will again have similar drawbacks. I personally
think most users will gravitate towards specifying such option via
CUDAFLAGS




to configure works fine. So, I should make a change to the BuildSystem cuda.py 
along these lines. I'm wondering exactly how I should make this work. I could 
just remove the check,



sure



but I think that maybe the better thing to do is to check isGNU(), then if the 
compiler is *not* GNU, configure should add the appropriate '-ccbin' argument 
to "--with-cudac", unless the user has specified '-ccbin' in their 
'--with-cudac' already. Or do we need to get this fancy?



The check should be: do --compiler-options= constructed by  PETSc configure  
work with CUDAC

Is there currently an existing check like this somewhere? Or will things just 
fail when running 'make' right now?



[or perhaps we should - just trim the --compiler-options to only -I flags?]

I think we should avoid explict check for a compiler type [i.e isGNU() check] 
as much as possible.




CUDA is only supposed to work with certain compilers, but there doesn't seem to 
be a correct official list (for i

Re: [petsc-dev] Is there a good reason that BuildSystem's cuda.py requires GNU compilers?

2019-03-22 Thread Mills, Richard Tran via petsc-dev

On 3/22/19 12:13 PM, Balay, Satish wrote:

Is there currently an existing check like this somewhere? Or will things just 
fail when running 'make' right now?




Most likely no. Its probably best to attempt the error case - and
figure-out how to add a check.

I gave things a try and verified that there is no check for this anywhere in 
configure -- things just fail at 'make' time. I think that all we need is a 
test that will try to compile any simple, valid C program using "nvcc 
--compiler-options= ". If the 
test fails, it should report something like "Compiler flags do not work with 
CUDA compiler; perhaps you need to provide to use -ccbin in CUDAFLAGS to 
specify the intended host compiler".

I'm not sure where this test should go. Does it make sense for this to go in 
cuda.py with the other checks like checkNVCCDoubleAlign()? If so, how do I get 
at the values of  and ? I'm 
not sure what modules I need to import from BuildSystem...

--Richard

Satish

On Fri, 22 Mar 2019, Mills, Richard Tran via petsc-dev wrote:



On 3/18/19 7:29 PM, Balay, Satish wrote:

On Tue, 19 Mar 2019, Mills, Richard Tran via petsc-dev wrote:



Colleagues,

It took me a while to get PETSc to build at all with anything on Summit other 
than the GNU compilers, but, once this was accomplished, editing out the 
isGNU() test and then passing something like

'--with-cuda=1',
'--with-cudac=nvcc -ccbin pgc++',



Does the following also work?

--with-cuda=1 --with-cudac=nvcc CUDAFLAGS='-ccbin pgc++'

Yes, using CUDAFLAGS as above also works, and that does seem to be a better way 
to do things.

After experimenting with a lot of different builds on Summit, and doing more 
reading about how CUDA compilation works on different platforms, I'm now 
thinking that perhaps configure.py should *avoid* doing anything clever to try 
figure out what the value of "-ccbin" should be. For one, this is not anything 
that NVIDIA's toolchain does for the user in the first place: If you want to 
use nvcc with a host compiler that isn't whatever NVIDIA considers the default 
(g++ on Linux, clang on Mac OS, MSVC on Windows), NVIDIA expects you to provide 
the appropriate '-ccbin' argument. Second, nvcc isn't the only CUDA compiler 
that a user might want to use: some people use Clang directly to compile CUDA 
code. Third, which host compilers are supported appears to be platform 
independent; for example, GCC is the default/preferred host compiler on Linux, 
but isn't even supported on Mac OS! Figuring out what is supported is very 
convoluted, and I think that trying to get configure to determine this may be 
more trouble than it is worth. I think we should instead let the user try 
whatever, and print out a helpful message how they "may need to specify host 
compiler to nvcc with -ccbin" if the CUDA compiler doesn't seem to work. Also, 
I'll put something about this in the CUDA configure examples. Any objections?




Sometimes we have extra options in configure for specific features for
ex: --with-pic --with-visibility etc.

But that gets messy. On cuda side - we've have --with-cuda-arch and at
some point elimiated it [so CUDAFLAGS is now the interface for this
flag].  We could add --with-cuda-internal-compiler option to petsc
configure - but it will again have similar drawbacks. I personally
think most users will gravitate towards specifying such option via
CUDAFLAGS




to configure works fine. So, I should make a change to the BuildSystem cuda.py 
along these lines. I'm wondering exactly how I should make this work. I could 
just remove the check,



sure



but I think that maybe the better thing to do is to check isGNU(), then if the 
compiler is *not* GNU, configure should add the appropriate '-ccbin' argument 
to "--with-cudac", unless the user has specified '-ccbin' in their 
'--with-cudac' already. Or do we need to get this fancy?



The check should be: do --compiler-options= constructed by  PETSc configure  
work with CUDAC

Is there currently an existing check like this somewhere? Or will things just 
fail when running 'make' right now?



[or perhaps we should - just trim the --compiler-options to only -I flags?]

I think we should avoid explict check for a compiler type [i.e isGNU() check] 
as much as possible.




CUDA is only supposed to work with certain compilers, but there doesn't seem to 
be a correct official list (for instance, it supposedly won't work with the IBM 
XL compilers, but they certainly *are* actually supported on Summit). Heck, the 
latest GCC suite won't even work right now. Since what compilers are supported 
seems to be in flux, I suggest we just let the user try anything and then let 
things fail if it doesn't work.



I suspec the list is dependent on the install [for ex: linux vs Windows vs 
somthing else?] and version of cuda [for ex: each version of cuda supports only 
specific versions of gcc]

Yes, you are correct about this, as I detailed above.

Re: [petsc-dev] Is there a good reason that BuildSystem's cuda.py requires GNU compilers?

2019-03-22 Thread Mills, Richard Tran via petsc-dev

On 3/18/19 7:29 PM, Balay, Satish wrote:

On Tue, 19 Mar 2019, Mills, Richard Tran via petsc-dev wrote:



Colleagues,

It took me a while to get PETSc to build at all with anything on Summit other 
than the GNU compilers, but, once this was accomplished, editing out the 
isGNU() test and then passing something like

'--with-cuda=1',
'--with-cudac=nvcc -ccbin pgc++',



Does the following also work?

--with-cuda=1 --with-cudac=nvcc CUDAFLAGS='-ccbin pgc++'

Yes, using CUDAFLAGS as above also works, and that does seem to be a better way 
to do things.

After experimenting with a lot of different builds on Summit, and doing more 
reading about how CUDA compilation works on different platforms, I'm now 
thinking that perhaps configure.py should *avoid* doing anything clever to try 
figure out what the value of "-ccbin" should be. For one, this is not anything 
that NVIDIA's toolchain does for the user in the first place: If you want to 
use nvcc with a host compiler that isn't whatever NVIDIA considers the default 
(g++ on Linux, clang on Mac OS, MSVC on Windows), NVIDIA expects you to provide 
the appropriate '-ccbin' argument. Second, nvcc isn't the only CUDA compiler 
that a user might want to use: some people use Clang directly to compile CUDA 
code. Third, which host compilers are supported appears to be platform 
independent; for example, GCC is the default/preferred host compiler on Linux, 
but isn't even supported on Mac OS! Figuring out what is supported is very 
convoluted, and I think that trying to get configure to determine this may be 
more trouble than it is worth. I think we should instead let the user try 
whatever, and print out a helpful message how they "may need to specify host 
compiler to nvcc with -ccbin" if the CUDA compiler doesn't seem to work. Also, 
I'll put something about this in the CUDA configure examples. Any objections?




Sometimes we have extra options in configure for specific features for
ex: --with-pic --with-visibility etc.

But that gets messy. On cuda side - we've have --with-cuda-arch and at
some point elimiated it [so CUDAFLAGS is now the interface for this
flag].  We could add --with-cuda-internal-compiler option to petsc
configure - but it will again have similar drawbacks. I personally
think most users will gravitate towards specifying such option via
CUDAFLAGS




to configure works fine. So, I should make a change to the BuildSystem cuda.py 
along these lines. I'm wondering exactly how I should make this work. I could 
just remove the check,



sure



but I think that maybe the better thing to do is to check isGNU(), then if the 
compiler is *not* GNU, configure should add the appropriate '-ccbin' argument 
to "--with-cudac", unless the user has specified '-ccbin' in their 
'--with-cudac' already. Or do we need to get this fancy?



The check should be: do --compiler-options= constructed by  PETSc configure  
work with CUDAC

Is there currently an existing check like this somewhere? Or will things just 
fail when running 'make' right now?



[or perhaps we should - just trim the --compiler-options to only -I flags?]

I think we should avoid explict check for a compiler type [i.e isGNU() check] 
as much as possible.




CUDA is only supposed to work with certain compilers, but there doesn't seem to 
be a correct official list (for instance, it supposedly won't work with the IBM 
XL compilers, but they certainly *are* actually supported on Summit). Heck, the 
latest GCC suite won't even work right now. Since what compilers are supported 
seems to be in flux, I suggest we just let the user try anything and then let 
things fail if it doesn't work.



I suspec the list is dependent on the install [for ex: linux vs Windows vs 
somthing else?] and version of cuda [for ex: each version of cuda supports only 
specific versions of gcc]

Yes, you are correct about this, as I detailed above.



Satish




--Richard

On 3/12/19 8:45 PM, Smith, Barry F. wrote:


  Richard,

You need to remove the isGNU() test and then experiment with getting the 
Nvidia tools to use the compiler you want it to use.

 No one has made a serious effort to use any other compilers but Gnu (at 
least not publicly).

   Barry





On Mar 12, 2019, at 10:40 PM, Mills, Richard Tran via petsc-dev 
<mailto:petsc-dev@mcs.anl.gov><mailto:petsc-dev@mcs.anl.gov><mailto:petsc-dev@mcs.anl.gov>
 wrote:

Fellow PETSc developers,

If I try to configure PETSc with CUDA support on the ORNL Summit system using 
non-GNU compilers, I run into an error due to the following code in 
packages/cuda.py:

  def configureTypes(self):
import config.setCompilers
if not config.setCompilers.Configure.isGNU(self.setCompilers.CC, self.log):
  raise RuntimeError('Must use GNU compilers with CUDA')
  ...

Is this just because this code predates support for other host compilers with 
nvcc, or is there perhaps some more subtle reason that I, with my

Re: [petsc-dev] https://www.dursi.ca/post/hpc-is-dying-and-mpi-is-killing-it.html

2019-03-18 Thread Mills, Richard Tran via petsc-dev

I've seen this quite some time ago. Others in this thread have already 
articulated many of the same criticisms I have with the material in this blog 
post, as well as some of the problems that I have with MPI, so I'll content 
myself by asking the following:

If HPC is as dying as this guy says it is, then

* Why did DOE just announce today that they are spending $500 million on the 
first (there are *more* coming?) US-based exascale computing system?

* Why are companies like Intel, NVIDIA, Mellanox, etc., managing to sell so 
much HPC hardware?

and if it is all the fault of MPI, then

* Why have a bunch of the big machine-learning shops actually been moving 
towards more use of MPI?

Yeah, MPI has plenty of warts. So does Fortran -- yet that hasn't killed 
scientific computing.

--Richard

On 3/17/19 1:12 PM, Smith, Barry F. via petsc-dev wrote:


  I stubbled on this today; I should have seen it years ago.

  Barry

Re: [petsc-dev] Is there a good reason that BuildSystem's cuda.py requires GNU compilers?

2019-03-18 Thread Mills, Richard Tran via petsc-dev

Colleagues,

It took me a while to get PETSc to build at all with anything on Summit other 
than the GNU compilers, but, once this was accomplished, editing out the 
isGNU() test and then passing something like

'--with-cuda=1',
'--with-cudac=nvcc -ccbin pgc++',

to configure works fine. So, I should make a change to the BuildSystem cuda.py 
along these lines. I'm wondering exactly how I should make this work. I could 
just remove the check, but I think that maybe the better thing to do is to 
check isGNU(), then if the compiler is *not* GNU, configure should add the 
appropriate '-ccbin' argument to "--with-cudac", unless the user has specified 
'-ccbin' in their '--with-cudac' already. Or do we need to get this fancy?

CUDA is only supposed to work with certain compilers, but there doesn't seem to 
be a correct official list (for instance, it supposedly won't work with the IBM 
XL compilers, but they certainly *are* actually supported on Summit). Heck, the 
latest GCC suite won't even work right now. Since what compilers are supported 
seems to be in flux, I suggest we just let the user try anything and then let 
things fail if it doesn't work.

--Richard

On 3/12/19 8:45 PM, Smith, Barry F. wrote:


  Richard,

You need to remove the isGNU() test and then experiment with getting the 
Nvidia tools to use the compiler you want it to use.

 No one has made a serious effort to use any other compilers but Gnu (at 
least not publicly).

   Barry





On Mar 12, 2019, at 10:40 PM, Mills, Richard Tran via petsc-dev 
<mailto:petsc-dev@mcs.anl.gov> wrote:

Fellow PETSc developers,

If I try to configure PETSc with CUDA support on the ORNL Summit system using 
non-GNU compilers, I run into an error due to the following code in 
packages/cuda.py:

  def configureTypes(self):
import config.setCompilers
if not config.setCompilers.Configure.isGNU(self.setCompilers.CC, self.log):
  raise RuntimeError('Must use GNU compilers with CUDA')
  ...

Is this just because this code predates support for other host compilers with 
nvcc, or is there perhaps some more subtle reason that I, with my inexperience 
using CUDA, don't know about? I'm guessing that I just need to add support for 
using '-ccbin' appropriately to set the location of the non-GNU host compiler, 
but maybe there is something that I'm missing. I poked around in the petsc-dev 
mailing list archives and can find a few old threads on using non-GNU 
compilers, but I'm not sure what conclusions were reached.

Best regards,
Richard

Re: [petsc-dev] Is there a good reason that BuildSystem's cuda.py requires GNU compilers?

2019-03-12 Thread Mills, Richard Tran via petsc-dev



On 3/12/19 8:49 PM, Balay, Satish wrote:

1. You might want to check 'Feature-request: Must use GNU compilers
with CUDA' thread on petsc-maint.

Yes, I see this thread now. Thanks.



2. what does nvcc use internally on summit?

It defaults to the GCC suite but it is supposed to also support the PGI and XLF 
compilers on Summit.



3. On linux - nvcc defaults to using g++ internally (as far as I know)
- so we added that check. But you might want to remove that check - if
things work on summit

4. -ccbin is a wierd option - which I don't undersstand.

--compiler-bindir(-ccbin)
Specify the directory in which the host compiler executable resides.  
The
host compiler executable name can be also specified to ensure that the 
correct
host compiler is selected.  In addition, driver prefix options 
('--input-drive-prefix',
'--dependency-drive-prefix', or '--drive-prefix') may need to be 
specified,
if nvcc is executed in a Cygwin shell or a MinGW shell on Windows.

i.e you specify a PATH. But what binary does it pick up from that
PATH? The way I understood is - it looks for an alternate 'g++' in the
specified path. [well - again this is based on the usual linux
installs of cuda]

This is a really kludgy option. You can specify a directory (and then I think 
it just looks for g++ in there?) but you can also specify a path to xlc++ or 
another compiler. (There appears to be no way to control this via an 
environment variable, strangely.) I will see if setting '-ccbin' to the path 
for whatever C++ compiler has been specified to configure.py works.

--Richard



Satish


On Wed, 13 Mar 2019, Mills, Richard Tran via petsc-dev wrote:



Fellow PETSc developers,

If I try to configure PETSc with CUDA support on the ORNL Summit system using 
non-GNU compilers, I run into an error due to the following code in 
packages/cuda.py:

  def configureTypes(self):
import config.setCompilers
if not config.setCompilers.Configure.isGNU(self.setCompilers.CC, self.log):
  raise RuntimeError('Must use GNU compilers with CUDA')
  ...

Is this just because this code predates support for other host compilers with 
nvcc, or is there perhaps some more subtle reason that I, with my inexperience 
using CUDA, don't know about? I'm guessing that I just need to add support for 
using '-ccbin' appropriately to set the location of the non-GNU host compiler, 
but maybe there is something that I'm missing. I poked around in the petsc-dev 
mailing list archives and can find a few old threads on using non-GNU 
compilers, but I'm not sure what conclusions were reached.

Best regards,
Richard

Re: [petsc-dev] Is there a good reason that BuildSystem's cuda.py requires GNU compilers?

2019-03-12 Thread Mills, Richard Tran via petsc-dev

Jed, yes, I'm on petsc-maint and I see this thread now. Thanks; I'll take a 
look. (I've had a crazy day and have been more behind than usual on email.)

--Richard

On 3/12/19 8:48 PM, Jed Brown wrote:

Richard, are you not on petsc-maint?  There was a thread about this today.

PGI "community edition" is free now.  We could add it to our test suite.

  https://www.pgroup.com/products/community.htm

"Smith, Barry F. via petsc-dev" 
<mailto:petsc-dev@mcs.anl.gov> writes:



  Richard,

You need to remove the isGNU() test and then experiment with getting the 
Nvidia tools to use the compiler you want it to use.

 No one has made a serious effort to use any other compilers but Gnu (at 
least not publicly).

   Barry





On Mar 12, 2019, at 10:40 PM, Mills, Richard Tran via petsc-dev 
<mailto:petsc-dev@mcs.anl.gov> wrote:

Fellow PETSc developers,

If I try to configure PETSc with CUDA support on the ORNL Summit system using 
non-GNU compilers, I run into an error due to the following code in 
packages/cuda.py:

  def configureTypes(self):
import config.setCompilers
if not config.setCompilers.Configure.isGNU(self.setCompilers.CC, self.log):
  raise RuntimeError('Must use GNU compilers with CUDA')
  ...

Is this just because this code predates support for other host compilers with 
nvcc, or is there perhaps some more subtle reason that I, with my inexperience 
using CUDA, don't know about? I'm guessing that I just need to add support for 
using '-ccbin' appropriately to set the location of the non-GNU host compiler, 
but maybe there is something that I'm missing. I poked around in the petsc-dev 
mailing list archives and can find a few old threads on using non-GNU 
compilers, but I'm not sure what conclusions were reached.

Best regards,
Richard

[petsc-dev] Is there a good reason that BuildSystem's cuda.py requires GNU compilers?

2019-03-12 Thread Mills, Richard Tran via petsc-dev

Fellow PETSc developers,

If I try to configure PETSc with CUDA support on the ORNL Summit system using 
non-GNU compilers, I run into an error due to the following code in 
packages/cuda.py:

  def configureTypes(self):
import config.setCompilers
if not config.setCompilers.Configure.isGNU(self.setCompilers.CC, self.log):
  raise RuntimeError('Must use GNU compilers with CUDA')
  ...

Is this just because this code predates support for other host compilers with 
nvcc, or is there perhaps some more subtle reason that I, with my inexperience 
using CUDA, don't know about? I'm guessing that I just need to add support for 
using '-ccbin' appropriately to set the location of the non-GNU host compiler, 
but maybe there is something that I'm missing. I poked around in the petsc-dev 
mailing list archives and can find a few old threads on using non-GNU 
compilers, but I'm not sure what conclusions were reached.

Best regards,
Richard

Re: [petsc-dev] Note about CPR preconditioner in PCFIELDSPLIT man page?

2019-03-03 Thread Mills, Richard Tran via petsc-dev



On 3/3/19 8:21 PM, Jed Brown wrote:

"Mills, Richard Tran"  writes:



Alright, I see the source of my confusion: I've now seen three different that 
are all called a "Constrained Pressure Residual" preconditioner. What I have 
generally seen is exactly what Matt describes: R is just a restriction to a 
subset of DoFs (pressure, and sometimes saturation), and P is just R' (using 
Matlab notation here). A CPR preconditioner in which P differs from R' is not 
something I've seen until now.



Please read the code again.  P = R' for this case, but R is restriction to the 
sum of field components, not a subset of the variables.

Oops -- yes, I see. Thanks, Jed. Elsewhere that I've seen a CPR preconditioner 
used, the restriction is only to a subset of the variables. In that case, the 
preconditioner can be constructed with only PCFIELDSPLIT and PCCOMPOSITE, but 
in the case described in the email messages to Barry, I do see that PCGALERKIN 
is needed.

--Richard





I'll see if I can think of a succinct change to the documentation for 
PCFIELDSPLIT to describe the two cases and submit a pull request.

--Richard

On 3/3/19 8:38 AM, Jed Brown wrote:

Matthew Knepley 

 writes:



On Sun, Mar 3, 2019 at 12:58 AM Jed Brown via petsc-dev <
petsc-dev@mcs.anl.gov>
 wrote:



My take is that someone looking for CPR is more likely to end up on the
PCFIELDSPLIT manual page than the PCGALERKIN page.  The solver you have
configured in your mail is not the CPR we have been asked about here.
See Barry's message and example code below.




Thanks for retrieving this Jed. I am sure Richard and I both have the same
question. Perhaps I am being an idiot.
I am supposing that R is just a restriction to some subset of dofs, so its
just binary, so that R A P just selects that
submatrix.



Look at the source.  These are not subsets:

 /*
 Apply the restriction operator for the Galkerin problem
 */
 PetscErrorCode ApplyR(Mat A, Vec x,Vec y)
 {
   PetscErrorCode ierr;
   PetscInt   b;
   PetscFunctionBegin;
   ierr = VecGetBlockSize(x,);CHKERRQ(ierr);
   ierr = VecStrideGather(x,0,y,INSERT_VALUES);CHKERRQ(ierr);
   for (PetscInt k=1;k

Re: [petsc-dev] Note about CPR preconditioner in PCFIELDSPLIT man page?

2019-03-03 Thread Mills, Richard Tran via petsc-dev

Alright, I see the source of my confusion: I've now seen three different that 
are all called a "Constrained Pressure Residual" preconditioner. What I have 
generally seen is exactly what Matt describes: R is just a restriction to a 
subset of DoFs (pressure, and sometimes saturation), and P is just R' (using 
Matlab notation here). A CPR preconditioner in which P differs from R' is not 
something I've seen until now.

I'll see if I can think of a succinct change to the documentation for 
PCFIELDSPLIT to describe the two cases and submit a pull request.

--Richard

On 3/3/19 8:38 AM, Jed Brown wrote:

Matthew Knepley  writes:



On Sun, Mar 3, 2019 at 12:58 AM Jed Brown via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:



My take is that someone looking for CPR is more likely to end up on the
PCFIELDSPLIT manual page than the PCGALERKIN page.  The solver you have
configured in your mail is not the CPR we have been asked about here.
See Barry's message and example code below.




Thanks for retrieving this Jed. I am sure Richard and I both have the same
question. Perhaps I am being an idiot.
I am supposing that R is just a restriction to some subset of dofs, so its
just binary, so that R A P just selects that
submatrix.



Look at the source.  These are not subsets:

 /*
 Apply the restriction operator for the Galkerin problem
 */
 PetscErrorCode ApplyR(Mat A, Vec x,Vec y)
 {
   PetscErrorCode ierr;
   PetscInt   b;
   PetscFunctionBegin;
   ierr = VecGetBlockSize(x,);CHKERRQ(ierr);
   ierr = VecStrideGather(x,0,y,INSERT_VALUES);CHKERRQ(ierr);
   for (PetscInt k=1;k

Re: [petsc-dev] Errors in MatSetValuesBlockedLocal() when using AIJ, but not BAIJ!

2019-02-21 Thread Mills, Richard Tran via petsc-dev

On 2/21/19 1:48 PM, Smith, Barry F. wrote:





On Feb 20, 2019, at 12:50 PM, Mills, Richard Tran via petsc-dev 
<mailto:petsc-dev@mcs.anl.gov> wrote:

Folks,

I'm working with some PFLOTRAN examples, and I've been scratching my head for a 
while at errors that are being generated in MatSetValuesBlockedLocal(), related 
to new nonzeros requiring a malloc. E.g.,

[0]PETSC ERROR: Argument out of range
[0]PETSC ERROR: New nonzero at (2,0) caused a malloc

The strange thing is that all of these examples run fine when I use the 
PFLOTRAN default BAIJ matrix type (with blocks column-oriented) -- I only see 
the issue when I switch to using AIJ. If I set the matrix option 
MAT_NEW_NONZERO_ALLOCATION_ERR to PETSC_FALSE to allow things to run with AIJ, 
everything seems to work perfectly. (And once the Jacobian matrix is 
constructed the first time, the memory usage is not growing according to what I 
see from -malloc_log.)

I've spent some time poking around with a debugger in the variants of 
MatSetValues() and MatXAIJSetPreallocation(), and I haven't been able make much 
progress in determining why I'm seeing this in AIJ vs. BAIJ -- it's easy to get 
confused looking at this code, and it's going to take me more time to follow 
what is going on than I've had to spare so far. So, I've got a few questions. 
First, should it even be possible to see what I am seeing? That is, if the 
MatSetValuesBlocked() routine is not causing new allocations when using BAIJ, 
should this be possible with AIJ?




Sure, because the preallocation for BAIJ matrix is by block, while for AIJ 
it is by point (even if you provide a block size to AIJ). So long as the new 
entry is within an allocated block it will not trigger an error with BAIJ but 
may with AIJ.

Hmm. It seems that I've misunderstood what MatXAIJSetPreallocation() is 
supposed to do. I thought that if I pass in a blocksize > 1, 
MatXAIJSetPreallocation() will turn the block-row preallocation into the 
equivalent scalar-row stuff. This seems to be what is happening in the code, 
unless I'm reading this wrong:

[...]
304: } else {/* Convert block-row precallocation to 
scalar-row */
305:   PetscInt i,m,*sdnnz,*sonnz;
306:   MatGetLocalSize(A,,NULL);
307:   PetscMalloc2((!!dnnz)*m,,(!!onnz)*m,);
308:   for (i=0; i MatSetValuesBlocked() 
--> MatSetValues() --> MatSetValues_SeqAIJ(). I have trouble following all of 
the logic in matsetvaluesseqaij_(), though, so I guess I'm going to have try 
harder to understand why MatSeqXAIJReallocateAIJ() is being hit there.

Any educated guesses about what might be going wrong would be appreciated. 
There is a lot of PETSc code I'm not familiar with involved here.

Cheers,
Richard





   Barry



(Clearly it *is* possible, but should it be?) I'm still trying to figure out if 
there is something wrong with what PFLOTRAN is doing, vs. something going wrong 
somewhere inside PETSc.

Any hints about what to look at from someone with more familiarity with this 
code would be appreciated.

--Richard

[petsc-dev] Note about CPR preconditioner in PCFIELDSPLIT man page?

2019-02-20 Thread Mills, Richard Tran via petsc-dev

Folks,

There is a note in the PCFIELDSPLIT manual page about implementing a so-called 
"Constrained Pressure Residual" (CPR) preconditioner:

"The Constrained Pressure Preconditioner (CPR) can be implemented using 
PCCOMPOSITE
 with 
PCGALERKIN.
 CPR first solves an R A P subsystem, updates the residual on all variables 
(PCCompositeSetType(pc,PC_COMPOSITE_MULTIPLICATIVE)),
 and then applies a simple ILU like preconditioner on all the variables."

I'm not sure why there is a reference to PCGALERKIN in there. Although CPR *is* 
using a Galerkin-type projection, I believe that this can all be set up on the 
command line by using PCFIELDSPLIT and PCCOMPOSITE. It seems that CPR 
preconditioners are defined in a few different ways, but I believe I can set up 
something like this in PFLOTRAN (where field 0 is pressure, 1 saturation, and 2 
is temperature) by doing something like this:

-flow_ksp_type fgmres -flow_pc_type composite \
-flow_pc_composite_type multiplicative -flow_pc_composite_pcs 
fieldsplit,bjacobi \
-flow_sub_0_ksp_type fgmres -flow_sub_0_pc_fieldsplit_type additive \
-flow_sub_0_pc_fieldsplit_0_fields 0 -flow_sub_0_pc_fieldsplit_1_fields 1,2 \
-flow_sub_0_fieldsplit_0_ksp_type richardson -flow_sub_0_fieldsplit_1_ksp_type 
preonly \
-flow_sub_0_fieldsplit_0_pc_type hypre -flow_sub_0_fieldsplit_0_pc_hypre_type 
boomeramg \
-flow_sub_0_fieldsplit_1_pc_type none -flow_sub_0_fieldsplit_1_sub_pc_type none 
\
-flow_sub_0_fieldsplit_1_ksp_max_it 10 -flow_sub_1_sub_pc_type ilu

Am I missing something? If the above setup is correct, it seems like the 
mention of PCGALERKIN is a bit confusing, since this is the PCFIELDSPLIT man 
page.

--Richard

[petsc-dev] Errors in MatSetValuesBlockedLocal() when using AIJ, but not BAIJ!

2019-02-20 Thread Mills, Richard Tran via petsc-dev

Folks,

I'm working with some PFLOTRAN examples, and I've been scratching my head for a 
while at errors that are being generated in MatSetValuesBlockedLocal(), related 
to new nonzeros requiring a malloc. E.g.,

[0]PETSC ERROR: Argument out of range
[0]PETSC ERROR: New nonzero at (2,0) caused a malloc

The strange thing is that all of these examples run fine when I use the 
PFLOTRAN default BAIJ matrix type (with blocks column-oriented) -- I only see 
the issue when I switch to using AIJ. If I set the matrix option 
MAT_NEW_NONZERO_ALLOCATION_ERR to PETSC_FALSE to allow things to run with AIJ, 
everything seems to work perfectly. (And once the Jacobian matrix is 
constructed the first time, the memory usage is not growing according to what I 
see from -malloc_log.)

I've spent some time poking around with a debugger in the variants of 
MatSetValues() and MatXAIJSetPreallocation(), and I haven't been able make much 
progress in determining why I'm seeing this in AIJ vs. BAIJ -- it's easy to get 
confused looking at this code, and it's going to take me more time to follow 
what is going on than I've had to spare so far. So, I've got a few questions. 
First, should it even be possible to see what I am seeing? That is, if the 
MatSetValuesBlocked() routine is not causing new allocations when using BAIJ, 
should this be possible with AIJ? (Clearly it *is* possible, but should it be?) 
I'm still trying to figure out if there is something wrong with what PFLOTRAN 
is doing, vs. something going wrong somewhere inside PETSc.

Any hints about what to look at from someone with more familiarity with this 
code would be appreciated.

--Richard

Re: [petsc-dev] configure issues with new MKL and --with-mkl_sparse_optimize=0

2018-12-04 Thread Mills, Richard Tran via petsc-dev

On 12/3/18 5:43 PM, Matthew Knepley wrote:
On Mon, Dec 3, 2018 at 8:35 PM Mills, Richard Tran via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:
Sorry, everyone, for how long it took me to have time to get back to this. I 
tried the change I suggested in my previous message, but configure fails with

***
 UNABLE to CONFIGURE with GIVEN OPTIONS(see configure.log for 
details):
---
Did not find package MKL_SPARSE_OPTIMIZE needed by mkl_sparse_sp2m.
Enable the package using --with-mkl_sparse_optimize
***

Does this happen because I have specified that this is a "look for by default" 
package? What I'd like to happen is that this package be looked for my default, 
in fact, but if MKL_SPARSE_OPTIMIZE is not present or disabled, then I'd like 
the test for mkl_sparse_sp2m to fail gracefully and configure to continue. That 
is: I don't want the fact that I am looking for this package automatically to 
imply that it HAS to be there. Is there any easy way to get this behavior?

Put it in the optional packages list. Then check explicitly that it was found 
in the sp2m configure method.
Hmm. I'm not sure that I understand what you are suggesting, Matt. (Maybe 
because what I wrote wasn't clear to begin with.) The way things are set up 
right now, there is a "package", mkl_sparse_sp2m, that depends on the "package" 
mkl_sparse_optimize. I want configure to always check for mkl_sparse_optimize 
unless the user explicitly specifies not to, and I'd like the same thing for 
mkl_sparse_sp2m. But the latter cannot be used unless mkl_sparse_optimize is 
present and enabled. I'd think that the way things ought to work (not saying 
that it does work this way) is that I should be able to specify that 
mkl_sparse_sp2m requires mkl_sparse_optimize, and that I want to look for 
mkl_sparse_sp2m by default -- and configure should determine that 
mkl_sparse_sp2m should not be enabled if mkl_sparse_optimize is not 
present/enabled. I'm not sure how to make this work.

Of course, the above may be all academic, as I think it is better to try to put 
the mkl_sparse_sp2m stuff in mkl_sparse_optimize.py. mkl_sparse_sp2m is really 
just a feature associated with mkl_sparse_optimize in the more recent editions 
of MKL, and I don't think there should be two "packages" associated with this. 
I'm going to take a stab at rolling these into one package.

--Richard

  Thanks,

Matt

--Richard

On 11/27/18 2:15 PM, Smith, Barry F. wrote:

On Nov 26, 2018, at 6:12 PM, Mills, Richard Tran via petsc-dev 
<mailto:petsc-dev@mcs.anl.gov> wrote:

Hi Stefano,

Apologies for the slow reply; I was out for the US Thanksgiving Holiday.

You've found an error in the logic for the MKL configure tests: Your MKL does 
indeed have mkl_sparse_sp2m(), but since this depends on having 
mkl_sparse_optimize() and you have specified '--with-mkl_sparse_optimize=0', 
PETSC_HAVE_MKL_SPARSE_SP2M either ought to not be defined, or in my AIJMKL code 
I should only enable the SP2M code if PETSC_HAVE_MKL_SPARSE_OPTIMIZE is 
defined. I think I like the former option better. For those more familiar with 
BuildSystem than I am, I think I can do this by putting the following in 
setupDependencies() in mkl_sparse_sp2m.py:

self.mkl_sparse_optimize  = 
framework.require('config.packages.mkl_sparse_optimize', self)
self.deps   = 
[self.blasLapack,self.mkl_sparse_optimize]


Looks ok to me. Make a pull request with Satish as a reviewer.

   Barry



Is that all that is required?

--Richard

On 11/21/18 10:59 PM, Stefano Zampini wrote:


Richard,

I just noticed that PETSc master does not build with the options

'--with-mkl_pardiso-dir=/soft/com/packages/intel/18/u3/mkl',
'--with-mkl_sparse_optimize=0',

This is on frog at MCS, but it will be the same on other machines as the macros 
configuration

PETSC_HAVE_MKL_SPARSE_OPTIMIZE not defined
PETSC_HAVE_MKL_SPARSE_SP2M defined

does not seem to be supported.

/nfs2/szampini/src/petsc/src/mat/impls/aij/seq/aijmkl/aijmkl.c(797): error: 
struct "" has no field "csrA"
csrA = a->csrA;

Could you please take a look? Why is the mkl_sparse_sp2m package flagged as 
lookforbydefault in the configure scripts?

Thanks
--
Stefano




--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/%7Eknepley/>

Re: [petsc-dev] configure issues with new MKL and --with-mkl_sparse_optimize=0

2018-12-03 Thread Mills, Richard Tran via petsc-dev

Sorry, everyone, for how long it took me to have time to get back to this. I 
tried the change I suggested in my previous message, but configure fails with

***
 UNABLE to CONFIGURE with GIVEN OPTIONS(see configure.log for 
details):
---
Did not find package MKL_SPARSE_OPTIMIZE needed by mkl_sparse_sp2m.
Enable the package using --with-mkl_sparse_optimize
***

Does this happen because I have specified that this is a "look for by default" 
package? What I'd like to happen is that this package be looked for my default, 
in fact, but if MKL_SPARSE_OPTIMIZE is not present or disabled, then I'd like 
the test for mkl_sparse_sp2m to fail gracefully and configure to continue. That 
is: I don't want the fact that I am looking for this package automatically to 
imply that it HAS to be there. Is there any easy way to get this behavior?

--Richard

On 11/27/18 2:15 PM, Smith, Barry F. wrote:





On Nov 26, 2018, at 6:12 PM, Mills, Richard Tran via petsc-dev 
<mailto:petsc-dev@mcs.anl.gov> wrote:

Hi Stefano,

Apologies for the slow reply; I was out for the US Thanksgiving Holiday.

You've found an error in the logic for the MKL configure tests: Your MKL does 
indeed have mkl_sparse_sp2m(), but since this depends on having 
mkl_sparse_optimize() and you have specified '--with-mkl_sparse_optimize=0', 
PETSC_HAVE_MKL_SPARSE_SP2M either ought to not be defined, or in my AIJMKL code 
I should only enable the SP2M code if PETSC_HAVE_MKL_SPARSE_OPTIMIZE is 
defined. I think I like the former option better. For those more familiar with 
BuildSystem than I am, I think I can do this by putting the following in 
setupDependencies() in mkl_sparse_sp2m.py:

self.mkl_sparse_optimize  = 
framework.require('config.packages.mkl_sparse_optimize', self)
self.deps   = 
[self.blasLapack,self.mkl_sparse_optimize]



Looks ok to me. Make a pull request with Satish as a reviewer.

   Barry




Is that all that is required?

--Richard

On 11/21/18 10:59 PM, Stefano Zampini wrote:


Richard,

I just noticed that PETSc master does not build with the options

'--with-mkl_pardiso-dir=/soft/com/packages/intel/18/u3/mkl',
'--with-mkl_sparse_optimize=0',

This is on frog at MCS, but it will be the same on other machines as the macros 
configuration

PETSC_HAVE_MKL_SPARSE_OPTIMIZE not defined
PETSC_HAVE_MKL_SPARSE_SP2M defined

does not seem to be supported.

/nfs2/szampini/src/petsc/src/mat/impls/aij/seq/aijmkl/aijmkl.c(797): error: 
struct "" has no field "csrA"
csrA = a->csrA;

Could you please take a look? Why is the mkl_sparse_sp2m package flagged as 
lookforbydefault in the configure scripts?

Thanks
--
Stefano

Re: [petsc-dev] configure issues with new MKL and --with-mkl_sparse_optimize=0

2018-11-26 Thread Mills, Richard Tran via petsc-dev

Oh, I should also have asked: Why are you telling PETSc to not use 
mkl_sparse_optimize()? Does it cause some sort of problem if you build with 
that on?

--Richard

On 11/21/18 10:59 PM, Stefano Zampini wrote:
Richard,

I just noticed that PETSc master does not build with the options

'--with-mkl_pardiso-dir=/soft/com/packages/intel/18/u3/mkl',
'--with-mkl_sparse_optimize=0',

This is on frog at MCS, but it will be the same on other machines as the macros 
configuration

PETSC_HAVE_MKL_SPARSE_OPTIMIZE not defined
PETSC_HAVE_MKL_SPARSE_SP2M defined

does not seem to be supported.

/nfs2/szampini/src/petsc/src/mat/impls/aij/seq/aijmkl/aijmkl.c(797): error: 
struct "" has no field "csrA"
csrA = a->csrA;

Could you please take a look? Why is the mkl_sparse_sp2m package flagged as 
lookforbydefault in the configure scripts?

Thanks
--
Stefano

Re: [petsc-dev] configure issues with new MKL and --with-mkl_sparse_optimize=0

2018-11-26 Thread Mills, Richard Tran via petsc-dev

Hi Stefano,

Apologies for the slow reply; I was out for the US Thanksgiving Holiday.

You've found an error in the logic for the MKL configure tests: Your MKL does 
indeed have mkl_sparse_sp2m(), but since this depends on having 
mkl_sparse_optimize() and you have specified '--with-mkl_sparse_optimize=0', 
PETSC_HAVE_MKL_SPARSE_SP2M either ought to not be defined, or in my AIJMKL code 
I should only enable the SP2M code if PETSC_HAVE_MKL_SPARSE_OPTIMIZE is 
defined. I think I like the former option better. For those more familiar with 
BuildSystem than I am, I think I can do this by putting the following in 
setupDependencies() in mkl_sparse_sp2m.py:

self.mkl_sparse_optimize  = 
framework.require('config.packages.mkl_sparse_optimize', self)
self.deps   = 
[self.blasLapack,self.mkl_sparse_optimize]

Is that all that is required?

--Richard

On 11/21/18 10:59 PM, Stefano Zampini wrote:
Richard,

I just noticed that PETSc master does not build with the options

'--with-mkl_pardiso-dir=/soft/com/packages/intel/18/u3/mkl',
'--with-mkl_sparse_optimize=0',

This is on frog at MCS, but it will be the same on other machines as the macros 
configuration

PETSC_HAVE_MKL_SPARSE_OPTIMIZE not defined
PETSC_HAVE_MKL_SPARSE_SP2M defined

does not seem to be supported.

/nfs2/szampini/src/petsc/src/mat/impls/aij/seq/aijmkl/aijmkl.c(797): error: 
struct "" has no field "csrA"
csrA = a->csrA;

Could you please take a look? Why is the mkl_sparse_sp2m package flagged as 
lookforbydefault in the configure scripts?

Thanks
--
Stefano

47 matches

Mail list logo