Dear Quantum ESPRESSO Developers and Community,

I am writing to report a persistent runtime error in the GPU-accelerated 
version of ph.x (Quantum ESPRESSO v7.5) when calculating electron-phonon 
coefficients using the OpenACC port.

While the code successfully calculates the Dynamical Matrices and Frequencies 
on the GPU, it consistently crashes during the final electron-phonon 
interaction step (routine elphon) with a File I/O error, specifically related 
to the temporary file a2Fsave.

1. System and Compilation Details:

Version: Quantum ESPRESSO v7.5 (GitLab release)

Compiler: NVIDIA HPC SDK v24.9

Configuration: ./configure --enable-openacc --with-cuda=yes --with-cuda-cc=89 
--with-cuda-runtime=12.6

Hardware: NVIDIA RTX 4090 (Ada Lovelace)

MPI: OpenMPI (via NVIDIA HPC SDK)

2. The Issue: When running ph.x with electron_phonon = 'interpolated' (or any 
mode that triggers elphon), the execution aborts immediately after 
diagonalizing the dynamical matrix for the first q-point. The crash occurs 
regardless of the MPI parallelization level (reproduced with both -np 1 and -np 
8).

3. Error Log: The crash points to a read error in elphon.f90 attempting to read 
a file that appears to be empty or not flushed to disk.

 FIO-F-217/list-directed read/unit=40/attempt to read past end of file.
 File name = './out/mgb2.a2Fsave',   formatted, sequential access   record = 1
 In source file /path/to/q-e/PHonon/PH/elphon.f90, at line number 847
 File name = './out/mgb2.a2Fsave',   formatted, sequential access   record = 1
 In source file /path/to/q-e/PHonon/PH/elphon.f90, at line number 847
4. Reproduction Case (MgB2): I reproduced this using a standard MgB2 test case.

Input snippet (ph.in):

Fortran
&INPUTPH
  tr2_ph   = 1.0d-14,
  prefix   = 'mgb2',
  outdir   = './out',
  fildyn   = 'mgb2.dyn',
  fildvscf = 'mgb2.dvscf',
  electron_phonon = 'interpolated',  ! <--- Triggers the crash
  trans    = .true.,
  ldisp    = .true.,
  nq1=6, nq2=6, nq3=4
/
5. Observations:

Pure Phonons work: If I comment out electron_phonon, the GPU run finishes 
successfully and writes .dyn and .dvscffiles.

CPU Works: The exact same input runs successfully on the CPU-only binary 
(gfortran compilation).

File Incompatibility: I attempted to run the heavy phonon calculation on the 
GPU and the final electron-phonon collection on the CPU (using recover=.true. 
or trans=.false.), but the CPU binary cannot read the GPU-generated 
.dvscf/binary files ("problems reading u" error), likely due to binary 
format/padding differences between nvfortranand gfortran.

It appears there is a race condition or file handling issue in the OpenACC 
implementation of the elphon routine where the a2Fsave file is read before it 
is successfully written/closed.

Any advice on a workaround or a patch for elphon.f90 to stabilize the GPU I/O 
would be greatly appreciated.

Thank you for your time and for developing this software.

Best regards,

Dholon Kumar Paul

Research Assistant, BRAC University, Bangladesh
_______________________________________________________________________________
The Quantum ESPRESSO Foundation stands in solidarity with all civilians 
worldwide who are victims of terrorism, military aggression, and indiscriminate 
warfare.
--------------------------------------------------------------------------------
Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users

Reply via email to