Dear Sara,
at least the error message is clear now: there's no memory left on the GPU.
You could have guessed this in advance by inspecting the first lines of
the output where the memory estimator reports:
Estimated static dynamical RAM per process > 652.61 MB
Estimated max dynamical RAM per process > 16.82 GB
Estimated total dynamical RAM > 1210.88 GB
The second entry is the important one: you have one process per GPU and
16 GB of memory on each card. Although the estimates is for RAM, it's
generally a good guess also for the GPU memory.
Try using less pools (or more nodes if you desperately need this to run
fast).
Best,
Pietro
On 8/31/20 6:54 PM, Sara Postorino wrote:
Thank for your response,
I ran it again with 6.5 (couldn't install 6.6a1), it uses the serial
eigensolver.
now I get :
Band Structure Calculation
Davidson diagonalization with overlap
Computing kpt #: 1 of 9 on this pool
Really copied g2kin H->D
Really copied evc H->D
Really copied et H->D
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Error in routine cegterg (1):
cannot allocate vc_d
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
stopping ...
I attach input and output
I'll put the rest on gitlab
Thank you,
Sara
Il giorno dom 30 ago 2020 alle ore 23:18 Pietro Bonfa
<[email protected] <mailto:[email protected]>> ha scritto:
Dear Sara,
I'd suggest checking the following:
1. verify that the serial eigensolver is used (it's written at the
beginning of the output);
2. use the latest version (6.6a1) that will correctly report problems
with memory allocations during the iterative diagonalization.
Could you please also open an issue at
https://gitlab.com/QEF/q-e-gpu/-/issues
<https://gitlab.com/QEF/q-e-gpu/-/issues>
and attach the input, the
pseudopotentials and the job script that you are using?
Thank you,
kind regards,
Pietro
On 8/29/20 6:33 PM, Sara Postorino wrote:
> Hi QE users,
>
> I am running PW on Marconi100 and experiencing problems during
> digonalization. I am using version 6.5 (autoload of the modules
on m100).
> My system is a MoTe2 bilayer k mesh 39x39x1 with many bands due
to the
> fact that I will do a GW calculation on top of it. (The calculation
> works if I do not add many bands)
> I tried with 4000 and 3000 bands using Davidson diagonalization
running
> on 18 nodes:
> Parallel version (MPI & OpenMP), running on 2304 processor cores
> Number of MPI processes: 72
> Threads/MPI process: 32
> When doin the calculation of the first point I get:
>
> Really copied g2kin H->D
> Really copied evc H->D
> Really copied et H->D
> Really copied vrs H->D
> dp_memcpy_d2h_c2dinvalid pitch argument 12
>
> I also tried with Conjugate gradient algorithm but it gets stuck at
>
> Really copied evc H->D
> Really copied et H->D
> Really copied h_diag H->D
> Really copied becp%nc H->D
> Really copied g2kin H->D
> Really copied vrs H->D
>
> And here it takes forever. I left it running for more than 1 hour
and it
> didn't finish on k point and since I have 147 kpoints the computation
> would be very expensive even if it worked.
>
> I also tried to go down to 1000 bands (I need way more) and got
> Really copied g2kin H->D
> Really copied evc H->D
> Really copied et H->D
> Really copied vrs H->D
> zhegvdx_gpu error: cusolverDnZpotrf failed!
>
>
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> Error in routine cdiaghg_gpu (1):
> zhegvdx_gpu failed
>
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
> Do you have any suggestion on how to fix this issue?
> Thanks
>
> Sara Postorino
> PhD student
> University of Rome Tor Vergata
>
>
>
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>>
> Mail priva di virus. www.avast.com
<http://www.avast.com/>
>
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>>
>
>
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> _______________________________________________
> Quantum ESPRESSO is supported by MaX
(http://www.max-centre.eu/quantum-espresso
<http://www.max-centre.eu/quantum-espresso>
> users mailing list [email protected]
<mailto:[email protected]>
> https://lists.quantum-espresso.org/mailman/listinfo/users
<https://lists.quantum-espresso.org/mailman/listinfo/users>
>
Firma il tuo 5 per mille all’Università di Parma e aiuta così i
nostri studenti che vogliono realizzare un’esperienza di studio
all’estero - Indica 00308780345 nella tua denuncia dei redditi.
_______________________________________________
Quantum ESPRESSO is supported by MaX
(www.max-centre.eu/quantum-espresso
<http://www.max-centre.eu/quantum-espresso>)
users mailing list [email protected]
<mailto:[email protected]>
https://lists.quantum-espresso.org/mailman/listinfo/users
<https://lists.quantum-espresso.org/mailman/listinfo/users>
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Mail priva di virus. www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#m_-4887640929092430203_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
_______________________________________________
Quantum ESPRESSO is supported by MaX (http://www.max-centre.eu/quantum-espresso
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users
Firma il tuo 5 per mille all’Università di Parma e aiuta così i nostri studenti
che vogliono realizzare un’esperienza di studio all’estero - Indica 00308780345
nella tua denuncia dei redditi.
_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users