a careful look at the error message reveals, that you are running out of space for MPI communicators for which a fixed maximum number (16k) seems to be allowed. this hints at a problem somewhere that communicators are generated with MPI_Comm_split() and not properly cleared afterwards.
axel. On Mon, Dec 10, 2018 at 6:53 AM Alex.Durie <[email protected]> wrote: > > The problem seems to crop up when a minimum of 8 processors are used. As a > quick and easily accessible test, I tried it on example08 of the Wannier90 > examples with the following command; > > > mpirun -np 8 pw.x -i iron.scf > scf.out > > > and the same problem occurred. I am using PWSCF v.6.3 using the Intel > parallel studio 2016 suite. PW was built using all intel compilers, intel MPI > and mkl. > > > Many thanks, > > > Alex > > > Date: Sun, 9 Dec 2018 21:26:31 +0100 > From: Paolo Giannozzi <[email protected]> > To: Quantum Espresso users Forum <[email protected]> > Subject: Re: [QE-users] MPI error in pw.x > Message-ID: > <capmgbcs0vu+gjzj_ty3cfx+8pgorvsuf5lvw0k7+fyb+sbz...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > If it is not a problem of your compiler or mpi libraries, it can only be > the usual problem of irreproducibility of results on different processors. > In order to figure out this, one needs as a strict minimum some information > on which exact version exhibits the problem, under which exact > circumstances (e.g. mpirun -np ... ) and an input that can be run in a > reasonable amount of time on a reasonably small machine. > > Paolo > - > > On Sat, Dec 8, 2018 at 9:55 PM Alex.Durie <[email protected]> wrote: > > > Dear experts, > > > > I have been running pw.x with multiple processes quite successfully, > > however when the number of processes is high enough, such that the space > > group has more than 7 processes, where the subspace diagonalization no > > longer uses a serial algorithm, the program crashes abruptly at about the > > 10th iteration with the following errors; > > > > Fatal error in PMPI_Cart_sub: Other MPI error, error stack: > > PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, > > remain_dims=0x7ffe0b27a6e8, comm_new=0x7ffe0b27a640) failed > > PMPI_Cart_sub(178)...................: > > MPIR_Comm_split_impl(270)............: > > MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 > > free on this process; ignore_id=0) > > Fatal error in PMPI_Cart_sub: Other MPI error, error stack: > > PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, > > remain_dims=0x7ffefaee7ce8, comm_new=0x7ffefaee7c40) failed > > PMPI_Cart_sub(178)...................: > > MPIR_Comm_split_impl(270)............: > > MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 > > free on this process; ignore_id=0) > > Fatal error in PMPI_Cart_sub: Other MPI error, error stack: > > PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, > > remain_dims=0x7fffc482d168, comm_new=0x7fffc482d0c0) failed > > PMPI_Cart_sub(178)...................: > > MPIR_Comm_split_impl(270)............: > > MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 > > free on this process; ignore_id=0) > > Fatal error in PMPI_Cart_sub: Other MPI error, error stack: > > PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, > > remain_dims=0x7ffe23a022e8, comm_new=0x7ffe23a02240) failed > > PMPI_Cart_sub(178)...................: > > MPIR_Comm_split_impl(270)............: > > MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 > > free on this process; ignore_id=0) > > forrtl: error (69): process interrupted (SIGINT) > > Image PC Routine Line > > Source > > pw.x 0000000000EAAC45 Unknown Unknown Unknown > > pw.x 0000000000EA8867 Unknown Unknown Unknown > > pw.x 0000000000E3DC64 Unknown Unknown Unknown > > pw.x 0000000000E3DA76 Unknown Unknown Unknown > > pw.x 0000000000DC41B6 Unknown Unknown Unknown > > pw.x 0000000000DCBB2E Unknown Unknown Unknown > > libpthread.so.0 00002BA339B746D0 Unknown Unknown > > Unknown > > libmpi.so.12 00002BA3390A345F Unknown Unknown > > Unknown > > libmpi.so.12 00002BA3391AEE39 Unknown Unknown > > Unknown > > libmpi.so.12 00002BA3391AEB32 Unknown Unknown > > Unknown > > libmpi.so.12 00002BA3390882F9 Unknown Unknown > > Unknown > > libmpi.so.12 00002BA339087D5D Unknown Unknown > > Unknown > > libmpi.so.12 00002BA339087BDC Unknown Unknown > > Unknown > > libmpi.so.12 00002BA339087B0C Unknown Unknown > > Unknown > > libmpi.so.12 00002BA339089932 Unknown Unknown > > Unknown > > libmpifort.so.12 00002BA338C41B1C Unknown Unknown > > Unknown > > pw.x 0000000000BCEE47 bcast_real_ 37 > > mp_base.f90 > > pw.x 0000000000BAF7E4 mp_mp_mp_bcast_rv 395 mp.f90 > > pw.x 0000000000B6E881 pcdiaghg_ 363 > > cdiaghg.f90 > > pw.x 0000000000AF7304 protate_wfc_k_ 256 > > rotate_wfc_k.f90 > > pw.x 0000000000681E82 rotate_wfc_ 64 > > rotate_wfc.f90 > > pw.x 000000000064F519 diag_bands_ 423 > > c_bands.f90 > > pw.x 000000000064CAD4 c_bands_ 99 > > c_bands.f90 > > pw.x 000000000040C014 electrons_scf_ 552 > > electrons.f90 > > pw.x 0000000000408DBD electrons_ 146 > > electrons.f90 > > pw.x 000000000057582B run_pwscf_ 132 > > run_pwscf.f90 > > pw.x 0000000000406AC5 MAIN__ 77 > > pwscf.f90 > > pw.x 000000000040695E Unknown Unknown Unknown > > libc.so.6 00002BA33A0A5445 Unknown Unknown > > Unknown > > pw.x 0000000000406869 Unknown Unknown Unknown > > forrtl: error (69): process interrupted (SIGINT) > > Image PC Routine Line > > Source > > pw.x 0000000000EAAC45 Unknown Unknown Unknown > > pw.x 0000000000EA8867 Unknown Unknown Unknown > > pw.x 0000000000E3DC64 Unknown Unknown Unknown > > pw.x 0000000000E3DA76 Unknown Unknown Unknown > > pw.x 0000000000DC41B6 Unknown Unknown Unknown > > pw.x 0000000000DCBB2E Unknown Unknown Unknown > > libpthread.so.0 00002B8E527936D0 Unknown Unknown > > Unknown > > libmpi.so.12 00002B8E51CC276E Unknown Unknown > > Unknown > > libmpi.so.12 00002B8E51DCDE39 Unknown Unknown > > Unknown > > libmpi.so.12 00002B8E51DCDB32 Unknown Unknown > > Unknown > > libmpi.so.12 00002B8E51CA72F9 Unknown Unknown > > Unknown > > libmpi.so.12 00002B8E51CA6D5D Unknown Unknown > > Unknown > > libmpi.so.12 00002B8E51CA6BDC Unknown Unknown > > Unknown > > libmpi.so.12 00002B8E51CA6B0C Unknown Unknown > > Unknown > > libmpi.so.12 00002B8E51CA8932 Unknown Unknown > > Unknown > > libmpifort.so.12 00002B8E51860B1C Unknown Unknown > > Unknown > > pw.x 0000000000BCEE47 bcast_real_ 37 > > mp_base.f90 > > pw.x 0000000000BAF7E4 mp_mp_mp_bcast_rv 395 mp.f90 > > pw.x 0000000000B6E881 pcdiaghg_ 363 > > cdiaghg.f90 > > pw.x 0000000000AF7304 protate_wfc_k_ 256 > > rotate_wfc_k.f90 > > pw.x 0000000000681E82 rotate_wfc_ 64 > > rotate_wfc.f90 > > pw.x 000000000064F519 diag_bands_ 423 > > c_bands.f90 > > pw.x 000000000064CAD4 c_bands_ 99 > > c_bands.f90 > > pw.x 000000000040C014 electrons_scf_ 552 > > electrons.f90 > > pw.x 0000000000408DBD electrons_ 146 > > electrons.f90 > > pw.x 000000000057582B run_pwscf_ 132 > > run_pwscf.f90 > > pw.x 0000000000406AC5 MAIN__ 77 > > pwscf.f90 > > pw.x 000000000040695E Unknown Unknown Unknown > > libc.so.6 00002B8E52CC4445 Unknown Unknown > > Unknown > > pw.x 0000000000406869 Unknown Unknown Unknown > > forrtl: error (69): process interrupted (SIGINT) > > Image PC Routine Line > > Source > > pw.x 0000000000EAAC45 Unknown Unknown Unknown > > pw.x 0000000000EA8867 Unknown Unknown Unknown > > pw.x 0000000000E3DC64 Unknown Unknown Unknown > > pw.x 0000000000E3DA76 Unknown Unknown Unknown > > pw.x 0000000000DC41B6 Unknown Unknown Unknown > > pw.x 0000000000DCBB2E Unknown Unknown Unknown > > libpthread.so.0 00002ABAB008D6D0 Unknown Unknown > > Unknown > > libmpi.so.12 00002ABAAF5BC45C Unknown Unknown > > Unknown > > libmpi.so.12 00002ABAAF6C7E39 Unknown Unknown > > Unknown > > libmpi.so.12 00002ABAAF6C7B32 Unknown Unknown > > Unknown > > libmpi.so.12 00002ABAAF5A12F9 Unknown Unknown > > Unknown > > libmpi.so.12 00002ABAAF5A0D5D Unknown Unknown > > Unknown > > libmpi.so.12 00002ABAAF5A0BDC Unknown Unknown > > Unknown > > libmpi.so.12 00002ABAAF5A0B0C Unknown Unknown > > Unknown > > libmpi.so.12 00002ABAAF5A2932 Unknown Unknown > > Unknown > > libmpifort.so.12 00002ABAAF15AB1C Unknown Unknown > > Unknown > > pw.x 0000000000BCEE47 bcast_real_ 37 > > mp_base.f90 > > pw.x 0000000000BAF7E4 mp_mp_mp_bcast_rv 395 mp.f90 > > pw.x 0000000000B6E881 pcdiaghg_ 363 > > cdiaghg.f90 > > pw.x 0000000000AF7304 protate_wfc_k_ 256 > > rotate_wfc_k.f90 > > pw.x 0000000000681E82 rotate_wfc_ 64 > > rotate_wfc.f90 > > pw.x 000000000064F519 diag_bands_ 423 > > c_bands.f90 > > pw.x 000000000064CAD4 c_bands_ 99 > > c_bands.f90 > > pw.x 000000000040C014 electrons_scf_ 552 > > electrons.f90 > > pw.x 0000000000408DBD electrons_ 146 > > electrons.f90 > > pw.x 000000000057582B run_pwscf_ 132 > > run_pwscf.f90 > > pw.x 0000000000406AC5 MAIN__ 77 > > pwscf.f90 > > pw.x 000000000040695E Unknown Unknown Unknown > > libc.so.6 00002ABAB05BE445 Unknown Unknown > > Unknown > > pw.x 0000000000406869 Unknown Unknown Unknown > > forrtl: error (69): process interrupted (SIGINT) > > Image PC Routine Line > > Source > > pw.x 0000000000EAAC45 Unknown Unknown Unknown > > pw.x 0000000000EA8867 Unknown Unknown Unknown > > pw.x 0000000000E3DC64 Unknown Unknown Unknown > > pw.x 0000000000E3DA76 Unknown Unknown Unknown > > pw.x 0000000000DC41B6 Unknown Unknown Unknown > > pw.x 0000000000DCBB2E Unknown Unknown Unknown > > libpthread.so.0 00002ACB4BF866D0 Unknown Unknown > > Unknown > > libmpi.so.12 00002ACB4B4B5775 Unknown Unknown > > Unknown > > libmpi.so.12 00002ACB4B5C0E39 Unknown Unknown > > Unknown > > libmpi.so.12 00002ACB4B5C0B32 Unknown Unknown > > Unknown > > libmpi.so.12 00002ACB4B49A2F9 Unknown Unknown > > Unknown > > libmpi.so.12 00002ACB4B499D5D Unknown Unknown > > Unknown > > libmpi.so.12 00002ACB4B499BDC Unknown Unknown > > Unknown > > libmpi.so.12 00002ACB4B499B0C Unknown Unknown > > Unknown > > libmpi.so.12 00002ACB4B49B932 Unknown Unknown > > Unknown > > libmpifort.so.12 00002ACB4B053B1C Unknown Unknown > > Unknown > > pw.x 0000000000BCEE47 bcast_real_ 37 > > mp_base.f90 > > pw.x 0000000000BAF7E4 mp_mp_mp_bcast_rv 395 mp.f90 > > pw.x 0000000000B6E881 pcdiaghg_ 363 > > cdiaghg.f90 > > pw.x 0000000000AF7304 protate_wfc_k_ 256 > > rotate_wfc_k.f90 > > pw.x 0000000000681E82 rotate_wfc_ 64 > > rotate_wfc.f90 > > pw.x 000000000064F519 diag_bands_ 423 > > c_bands.f90 > > pw.x 000000000064CAD4 c_bands_ 99 > > c_bands.f90 > > pw.x 000000000040C014 electrons_scf_ 552 > > electrons.f90 > > pw.x 0000000000408DBD electrons_ 146 > > electrons.f90 > > pw.x 000000000057582B run_pwscf_ 132 > > run_pwscf.f90 > > pw.x 0000000000406AC5 MAIN__ 77 > > pwscf.f90 > > pw.x 000000000040695E Unknown Unknown Unknown > > libc.so.6 00002ACB4C4B7445 Unknown Unknown > > Unknown > > pw.x 0000000000406869 Unknown Unknown Unknown > > > > Sample output below > > > > Parallel version (MPI), running on 16 processors > > > > MPI processes distributed on 1 nodes R & G space division: > > proc/nbgrp/npool/nimage = 16 > > > > Reading cobalt.scf Message from routine > > read_cards : DEPRECATED: no units specified in > > ATOMIC_POSITIONS card Message > > from routine read_cards : > > > > ATOMIC_POSITIONS: units set to alat > > > > Current dimensions of program PWSCF are: > > > > Max number of different atomic species (ntypx) = 10 > > > > Max number of k-points (npk) = 40000 > > > > Max angular momentum in pseudopotentials (lmaxx) = 3 > > > > Presently no symmetry can be used with electric field > > > > > > file Co.pz-n-kjpaw_psl.1.0.0.UPF: wavefunction(s) 4S 3D renormalized > > > > > > Subspace diagonalization in iterative solution of the eigenvalue problem: > > > > one sub-group per band group will be used > > > > scalapack distributed-memory algorithm (size of sub-group: 2* 2 procs) > > > > > > Parallelization info > > > > -------------------- > > > > sticks: dense smooth PW G-vecs: dense smooth PW > > > > Min 13 13 4 2449 2449 462 > > > > Max 14 14 5 2516 2516 527 > > > > Sum 221 221 69 39945 39945 7777 > > > > Many thanks, > > > > Alex Durie > > PhD student > > Open University > > United Kingdom > > _______________________________________________ > > users mailing list > > [email protected] > > https://lists.quantum-espresso.org/mailman/listinfo/users > > > > -- > Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche, > Univ. Udine, via delle Scienze 208, 33100 Udine, Italy > Phone +39-0432-558216, fax +39-0432-558222 > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > <http://lists.quantum-espresso.org/pipermail/users/attachments/20181209/d619524d/attachment-0001.html> > _______________________________________________ > users mailing list > [email protected] > https://lists.quantum-espresso.org/mailman/listinfo/users -- Dr. Axel Kohlmeyer [email protected] http://goo.gl/1wk0 College of Science & Technology, Temple University, Philadelphia PA, USA International Centre for Theoretical Physics, Trieste. Italy. _______________________________________________ users mailing list [email protected] https://lists.quantum-espresso.org/mailman/listinfo/users
