Re: [petsc-users] GPU implementation of serial smoothers

2023-01-10 Thread Stefano Zampini
DILU in openfoam is our block Jacobi ilu subdomain solvers

On Tue, Jan 10, 2023, 23:45 Barry Smith  wrote:

>
>   The default is some kind of Jacobi plus Chebyshev, for a certain class
> of problems, it is quite good.
>
>
>
> On Jan 10, 2023, at 3:31 PM, Mark Lohry  wrote:
>
> So what are people using for GAMG configs on GPU? I was hoping petsc today
> would be performance competitive with AMGx but it sounds like that's not
> the case?
>
> On Tue, Jan 10, 2023 at 3:03 PM Jed Brown  wrote:
>
>> Mark Lohry  writes:
>>
>> > I definitely need multigrid. I was under the impression that GAMG was
>> > relatively cuda-complete, is that not the case? What functionality works
>> > fully on GPU and what doesn't, without any host transfers (aside from
>> > what's needed for MPI)?
>> >
>> > If I use -ksp-pc_type gamg -mg_levels_pc_type pbjacobi
>> -mg_levels_ksp_type
>> > richardson is that fully on device, but -mg_levels_pc_type ilu or
>> > -mg_levels_pc_type sor require transfers?
>>
>> You can do `-mg_levels_pc_type ilu`, but it'll be extremely slow (like
>> 20x slower than an operator apply). One can use Krylov smoothers, though
>> that's more synchronization. Automatic construction of operator-dependent
>> multistage smoothers for linear multigrid (because Chebyshev only works for
>> problems that have eigenvalues near the real axis) is something I've wanted
>> to develop for at least a decade, but time is always short. I might put
>> some effort into p-MG with such smoothers this year as we add DDES to our
>> scale-resolving compressible solver.
>>
>
>


Re: [petsc-users] GPU implementation of serial smoothers

2023-01-10 Thread Barry Smith

  The default is some kind of Jacobi plus Chebyshev, for a certain class of 
problems, it is quite good.



> On Jan 10, 2023, at 3:31 PM, Mark Lohry  wrote:
> 
> So what are people using for GAMG configs on GPU? I was hoping petsc today 
> would be performance competitive with AMGx but it sounds like that's not the 
> case?
> 
> On Tue, Jan 10, 2023 at 3:03 PM Jed Brown  > wrote:
>> Mark Lohry mailto:mlo...@gmail.com>> writes:
>> 
>> > I definitely need multigrid. I was under the impression that GAMG was
>> > relatively cuda-complete, is that not the case? What functionality works
>> > fully on GPU and what doesn't, without any host transfers (aside from
>> > what's needed for MPI)?
>> >
>> > If I use -ksp-pc_type gamg -mg_levels_pc_type pbjacobi -mg_levels_ksp_type
>> > richardson is that fully on device, but -mg_levels_pc_type ilu or
>> > -mg_levels_pc_type sor require transfers?
>> 
>> You can do `-mg_levels_pc_type ilu`, but it'll be extremely slow (like 20x 
>> slower than an operator apply). One can use Krylov smoothers, though that's 
>> more synchronization. Automatic construction of operator-dependent 
>> multistage smoothers for linear multigrid (because Chebyshev only works for 
>> problems that have eigenvalues near the real axis) is something I've wanted 
>> to develop for at least a decade, but time is always short. I might put some 
>> effort into p-MG with such smoothers this year as we add DDES to our 
>> scale-resolving compressible solver.



Re: [petsc-users] GPU implementation of serial smoothers

2023-01-10 Thread Mark Lohry
So what are people using for GAMG configs on GPU? I was hoping petsc today
would be performance competitive with AMGx but it sounds like that's not
the case?

On Tue, Jan 10, 2023 at 3:03 PM Jed Brown  wrote:

> Mark Lohry  writes:
>
> > I definitely need multigrid. I was under the impression that GAMG was
> > relatively cuda-complete, is that not the case? What functionality works
> > fully on GPU and what doesn't, without any host transfers (aside from
> > what's needed for MPI)?
> >
> > If I use -ksp-pc_type gamg -mg_levels_pc_type pbjacobi
> -mg_levels_ksp_type
> > richardson is that fully on device, but -mg_levels_pc_type ilu or
> > -mg_levels_pc_type sor require transfers?
>
> You can do `-mg_levels_pc_type ilu`, but it'll be extremely slow (like 20x
> slower than an operator apply). One can use Krylov smoothers, though that's
> more synchronization. Automatic construction of operator-dependent
> multistage smoothers for linear multigrid (because Chebyshev only works for
> problems that have eigenvalues near the real axis) is something I've wanted
> to develop for at least a decade, but time is always short. I might put
> some effort into p-MG with such smoothers this year as we add DDES to our
> scale-resolving compressible solver.
>


Re: [petsc-users] GPU implementation of serial smoothers

2023-01-10 Thread Jed Brown
Mark Lohry  writes:

> I definitely need multigrid. I was under the impression that GAMG was
> relatively cuda-complete, is that not the case? What functionality works
> fully on GPU and what doesn't, without any host transfers (aside from
> what's needed for MPI)?
>
> If I use -ksp-pc_type gamg -mg_levels_pc_type pbjacobi -mg_levels_ksp_type
> richardson is that fully on device, but -mg_levels_pc_type ilu or
> -mg_levels_pc_type sor require transfers?

You can do `-mg_levels_pc_type ilu`, but it'll be extremely slow (like 20x 
slower than an operator apply). One can use Krylov smoothers, though that's 
more synchronization. Automatic construction of operator-dependent multistage 
smoothers for linear multigrid (because Chebyshev only works for problems that 
have eigenvalues near the real axis) is something I've wanted to develop for at 
least a decade, but time is always short. I might put some effort into p-MG 
with such smoothers this year as we add DDES to our scale-resolving 
compressible solver.


Re: [petsc-users] GPU implementation of serial smoothers

2023-01-10 Thread Mark Lohry
>
> BTW, on unstructured grids, coloring requires a lot of colors and thus
> many times more bandwidth (due to multiple passes) than the operator itself.


I've noticed -- in AMGx the multicolor GS was generally dramatically slower
than jacobi because of lots of colors with few elements.

You can use sparse triangular kernels like ILU (provided by cuBLAS), but
> they are so mindbogglingly slow that you'll go back to the drawing board
> and try to use a multigrid method of some sort with polynomial/point-block
> smoothing.
>

I definitely need multigrid. I was under the impression that GAMG was
relatively cuda-complete, is that not the case? What functionality works
fully on GPU and what doesn't, without any host transfers (aside from
what's needed for MPI)?

If I use -ksp-pc_type gamg -mg_levels_pc_type pbjacobi -mg_levels_ksp_type
richardson is that fully on device, but -mg_levels_pc_type ilu or
-mg_levels_pc_type sor require transfers?


On Tue, Jan 10, 2023 at 2:47 PM Jed Brown  wrote:

> The joy of GPUs. You can use sparse triangular kernels like ILU (provided
> by cuBLAS), but they are so mindbogglingly slow that you'll go back to the
> drawing board and try to use a multigrid method of some sort with
> polynomial/point-block smoothing.
>
> BTW, on unstructured grids, coloring requires a lot of colors and thus
> many times more bandwidth (due to multiple passes) than the operator itself.
>
> Mark Lohry  writes:
>
> > Well that's suboptimal. What are my options for 100% GPU solves with no
> > host transfers?
> >
> > On Tue, Jan 10, 2023, 2:23 PM Barry Smith  wrote:
> >
> >>
> >>
> >> On Jan 10, 2023, at 2:19 PM, Mark Lohry  wrote:
> >>
> >> Is DILU a point-block method? We have -pc_type pbjacobi (and vpbjacobi
> if
> >>> the node size is not uniform). The are good choices for
> scale-resolving CFD
> >>> on GPUs.
> >>>
> >>
> >> I was hoping you'd know :)  pbjacobi is underperforming ilu by a pretty
> >> wide margin on some of the systems i'm looking at.
> >>
> >> We don't have colored smoothers currently in PETSc.
> >>>
> >>
> >> So what happens under the hood when I run -mg_levels_pc_type sor on GPU?
> >> Are you actually decomposing the matrix into lower and computing updates
> >> with matrix multiplications? Or is it just the standard serial algorithm
> >> with thread safety ignored?
> >>
> >>
> >>   It is running the regular SOR on the CPU and needs to copy up the
> vector
> >> and copy down the result.
> >>
> >>
> >> On Tue, Jan 10, 2023 at 1:52 PM Barry Smith  wrote:
> >>
> >>>
> >>>   We don't have colored smoothers currently in PETSc.
> >>>
> >>> > On Jan 10, 2023, at 12:56 PM, Jed Brown  wrote:
> >>> >
> >>> > Is DILU a point-block method? We have -pc_type pbjacobi (and
> vpbjacobi
> >>> if the node size is not uniform). The are good choices for
> scale-resolving
> >>> CFD on GPUs.
> >>> >
> >>> > Mark Lohry  writes:
> >>> >
> >>> >> I'm running GAMG with CUDA, and I'm wondering how the nominally
> serial
> >>> >> smoother algorithms are implemented on GPU? Specifically SOR/GS and
> >>> ILU(0)
> >>> >> -- in e.g. AMGx these are applied by first creating a coloring, and
> the
> >>> >> smoother passes are done color by color. Is this how it's done in
> >>> petsc AMG?
> >>> >>
> >>> >> Tangential, AMGx and OpenFOAM offer something called "DILU",
> diagonal
> >>> ILU.
> >>> >> Is there an equivalent in petsc?
> >>> >>
> >>> >> Thanks,
> >>> >> Mark
> >>>
> >>>
> >>
>


Re: [petsc-users] GPU implementation of serial smoothers

2023-01-10 Thread Jed Brown
The joy of GPUs. You can use sparse triangular kernels like ILU (provided by 
cuBLAS), but they are so mindbogglingly slow that you'll go back to the drawing 
board and try to use a multigrid method of some sort with 
polynomial/point-block smoothing.

BTW, on unstructured grids, coloring requires a lot of colors and thus many 
times more bandwidth (due to multiple passes) than the operator itself.

Mark Lohry  writes:

> Well that's suboptimal. What are my options for 100% GPU solves with no
> host transfers?
>
> On Tue, Jan 10, 2023, 2:23 PM Barry Smith  wrote:
>
>>
>>
>> On Jan 10, 2023, at 2:19 PM, Mark Lohry  wrote:
>>
>> Is DILU a point-block method? We have -pc_type pbjacobi (and vpbjacobi if
>>> the node size is not uniform). The are good choices for scale-resolving CFD
>>> on GPUs.
>>>
>>
>> I was hoping you'd know :)  pbjacobi is underperforming ilu by a pretty
>> wide margin on some of the systems i'm looking at.
>>
>> We don't have colored smoothers currently in PETSc.
>>>
>>
>> So what happens under the hood when I run -mg_levels_pc_type sor on GPU?
>> Are you actually decomposing the matrix into lower and computing updates
>> with matrix multiplications? Or is it just the standard serial algorithm
>> with thread safety ignored?
>>
>>
>>   It is running the regular SOR on the CPU and needs to copy up the vector
>> and copy down the result.
>>
>>
>> On Tue, Jan 10, 2023 at 1:52 PM Barry Smith  wrote:
>>
>>>
>>>   We don't have colored smoothers currently in PETSc.
>>>
>>> > On Jan 10, 2023, at 12:56 PM, Jed Brown  wrote:
>>> >
>>> > Is DILU a point-block method? We have -pc_type pbjacobi (and vpbjacobi
>>> if the node size is not uniform). The are good choices for scale-resolving
>>> CFD on GPUs.
>>> >
>>> > Mark Lohry  writes:
>>> >
>>> >> I'm running GAMG with CUDA, and I'm wondering how the nominally serial
>>> >> smoother algorithms are implemented on GPU? Specifically SOR/GS and
>>> ILU(0)
>>> >> -- in e.g. AMGx these are applied by first creating a coloring, and the
>>> >> smoother passes are done color by color. Is this how it's done in
>>> petsc AMG?
>>> >>
>>> >> Tangential, AMGx and OpenFOAM offer something called "DILU", diagonal
>>> ILU.
>>> >> Is there an equivalent in petsc?
>>> >>
>>> >> Thanks,
>>> >> Mark
>>>
>>>
>>


Re: [petsc-users] GPU implementation of serial smoothers

2023-01-10 Thread Mark Lohry
Well that's suboptimal. What are my options for 100% GPU solves with no
host transfers?

On Tue, Jan 10, 2023, 2:23 PM Barry Smith  wrote:

>
>
> On Jan 10, 2023, at 2:19 PM, Mark Lohry  wrote:
>
> Is DILU a point-block method? We have -pc_type pbjacobi (and vpbjacobi if
>> the node size is not uniform). The are good choices for scale-resolving CFD
>> on GPUs.
>>
>
> I was hoping you'd know :)  pbjacobi is underperforming ilu by a pretty
> wide margin on some of the systems i'm looking at.
>
> We don't have colored smoothers currently in PETSc.
>>
>
> So what happens under the hood when I run -mg_levels_pc_type sor on GPU?
> Are you actually decomposing the matrix into lower and computing updates
> with matrix multiplications? Or is it just the standard serial algorithm
> with thread safety ignored?
>
>
>   It is running the regular SOR on the CPU and needs to copy up the vector
> and copy down the result.
>
>
> On Tue, Jan 10, 2023 at 1:52 PM Barry Smith  wrote:
>
>>
>>   We don't have colored smoothers currently in PETSc.
>>
>> > On Jan 10, 2023, at 12:56 PM, Jed Brown  wrote:
>> >
>> > Is DILU a point-block method? We have -pc_type pbjacobi (and vpbjacobi
>> if the node size is not uniform). The are good choices for scale-resolving
>> CFD on GPUs.
>> >
>> > Mark Lohry  writes:
>> >
>> >> I'm running GAMG with CUDA, and I'm wondering how the nominally serial
>> >> smoother algorithms are implemented on GPU? Specifically SOR/GS and
>> ILU(0)
>> >> -- in e.g. AMGx these are applied by first creating a coloring, and the
>> >> smoother passes are done color by color. Is this how it's done in
>> petsc AMG?
>> >>
>> >> Tangential, AMGx and OpenFOAM offer something called "DILU", diagonal
>> ILU.
>> >> Is there an equivalent in petsc?
>> >>
>> >> Thanks,
>> >> Mark
>>
>>
>


Re: [petsc-users] GPU implementation of serial smoothers

2023-01-10 Thread Barry Smith


> On Jan 10, 2023, at 2:19 PM, Mark Lohry  wrote:
> 
>> Is DILU a point-block method? We have -pc_type pbjacobi (and vpbjacobi if 
>> the node size is not uniform). The are good choices for scale-resolving CFD 
>> on GPUs.
> 
> I was hoping you'd know :)  pbjacobi is underperforming ilu by a pretty wide 
> margin on some of the systems i'm looking at.
> 
>> We don't have colored smoothers currently in PETSc.
> 
> So what happens under the hood when I run -mg_levels_pc_type sor on GPU? Are 
> you actually decomposing the matrix into lower and computing updates with 
> matrix multiplications? Or is it just the standard serial algorithm with 
> thread safety ignored?

  It is running the regular SOR on the CPU and needs to copy up the vector and 
copy down the result.
> 
> On Tue, Jan 10, 2023 at 1:52 PM Barry Smith  > wrote:
>> 
>>   We don't have colored smoothers currently in PETSc.
>> 
>> > On Jan 10, 2023, at 12:56 PM, Jed Brown > > > wrote:
>> > 
>> > Is DILU a point-block method? We have -pc_type pbjacobi (and vpbjacobi if 
>> > the node size is not uniform). The are good choices for scale-resolving 
>> > CFD on GPUs.
>> > 
>> > Mark Lohry mailto:mlo...@gmail.com>> writes:
>> > 
>> >> I'm running GAMG with CUDA, and I'm wondering how the nominally serial
>> >> smoother algorithms are implemented on GPU? Specifically SOR/GS and ILU(0)
>> >> -- in e.g. AMGx these are applied by first creating a coloring, and the
>> >> smoother passes are done color by color. Is this how it's done in petsc 
>> >> AMG?
>> >> 
>> >> Tangential, AMGx and OpenFOAM offer something called "DILU", diagonal ILU.
>> >> Is there an equivalent in petsc?
>> >> 
>> >> Thanks,
>> >> Mark
>> 



Re: [petsc-users] GPU implementation of serial smoothers

2023-01-10 Thread Mark Lohry
>
> Is DILU a point-block method? We have -pc_type pbjacobi (and vpbjacobi if
> the node size is not uniform). The are good choices for scale-resolving CFD
> on GPUs.
>

I was hoping you'd know :)  pbjacobi is underperforming ilu by a pretty
wide margin on some of the systems i'm looking at.

We don't have colored smoothers currently in PETSc.
>

So what happens under the hood when I run -mg_levels_pc_type sor on GPU?
Are you actually decomposing the matrix into lower and computing updates
with matrix multiplications? Or is it just the standard serial algorithm
with thread safety ignored?

On Tue, Jan 10, 2023 at 1:52 PM Barry Smith  wrote:

>
>   We don't have colored smoothers currently in PETSc.
>
> > On Jan 10, 2023, at 12:56 PM, Jed Brown  wrote:
> >
> > Is DILU a point-block method? We have -pc_type pbjacobi (and vpbjacobi
> if the node size is not uniform). The are good choices for scale-resolving
> CFD on GPUs.
> >
> > Mark Lohry  writes:
> >
> >> I'm running GAMG with CUDA, and I'm wondering how the nominally serial
> >> smoother algorithms are implemented on GPU? Specifically SOR/GS and
> ILU(0)
> >> -- in e.g. AMGx these are applied by first creating a coloring, and the
> >> smoother passes are done color by color. Is this how it's done in petsc
> AMG?
> >>
> >> Tangential, AMGx and OpenFOAM offer something called "DILU", diagonal
> ILU.
> >> Is there an equivalent in petsc?
> >>
> >> Thanks,
> >> Mark
>
>


Re: [petsc-users] GPU implementation of serial smoothers

2023-01-10 Thread Barry Smith


  We don't have colored smoothers currently in PETSc.

> On Jan 10, 2023, at 12:56 PM, Jed Brown  wrote:
> 
> Is DILU a point-block method? We have -pc_type pbjacobi (and vpbjacobi if the 
> node size is not uniform). The are good choices for scale-resolving CFD on 
> GPUs.
> 
> Mark Lohry  writes:
> 
>> I'm running GAMG with CUDA, and I'm wondering how the nominally serial
>> smoother algorithms are implemented on GPU? Specifically SOR/GS and ILU(0)
>> -- in e.g. AMGx these are applied by first creating a coloring, and the
>> smoother passes are done color by color. Is this how it's done in petsc AMG?
>> 
>> Tangential, AMGx and OpenFOAM offer something called "DILU", diagonal ILU.
>> Is there an equivalent in petsc?
>> 
>> Thanks,
>> Mark



Re: [petsc-users] GPU implementation of serial smoothers

2023-01-10 Thread Jed Brown
Is DILU a point-block method? We have -pc_type pbjacobi (and vpbjacobi if the 
node size is not uniform). The are good choices for scale-resolving CFD on GPUs.

Mark Lohry  writes:

> I'm running GAMG with CUDA, and I'm wondering how the nominally serial
> smoother algorithms are implemented on GPU? Specifically SOR/GS and ILU(0)
> -- in e.g. AMGx these are applied by first creating a coloring, and the
> smoother passes are done color by color. Is this how it's done in petsc AMG?
>
> Tangential, AMGx and OpenFOAM offer something called "DILU", diagonal ILU.
> Is there an equivalent in petsc?
>
> Thanks,
> Mark


[petsc-users] GPU implementation of serial smoothers

2023-01-10 Thread Mark Lohry
I'm running GAMG with CUDA, and I'm wondering how the nominally serial
smoother algorithms are implemented on GPU? Specifically SOR/GS and ILU(0)
-- in e.g. AMGx these are applied by first creating a coloring, and the
smoother passes are done color by color. Is this how it's done in petsc AMG?

Tangential, AMGx and OpenFOAM offer something called "DILU", diagonal ILU.
Is there an equivalent in petsc?

Thanks,
Mark