Re: [Wien] Problem with wien2k 13.1 parallel for Slurm+intel mpi

Peter Blaha Tue, 26 Nov 2013 08:36:06 -0800

Have you checked case.output0000  or case.scf0 ?
Do they look ok ?
Is there a reasonble line    :DEN  in scf0 ?


If yes, it  seems that lapw0_mpi  (and thus mpi + fftw2/3) works.

lapw1_mpi requires besides mpi also scalapack. This is included in Intels mkl
with your ifort compiler.
The most crucual setting is the selection of the blacks-library, and intel
supplies special blacks-libraries for Intel-mpi, open-MPI, or mvapi-mpi
and you must be sure to have linked the correct library in lapw1.

PS: I assume you have been able to run this without mpi-parallelization
in sequential mode ??
And I also assume, you could run it in k-parallel mode ?

PPS: 6 cores is not a good choice for lapw1 ! try to use squared number of
cores like 16, 64, ....

Am 26.11.2013 14:03, schrieb Natalia Pavlenko:

Dear Prof. Blaha,

thanks a lot for your reply. I have corrected the .machines file
(the node with 6 cores is automatically chosen):
-----------------
lapw0: alcc92:6
1:alcc92:6
granularity:1
extrafine:1
-----------------
but nevertheless got the following output in case.dayfile:
-------case.dayfile--------

Calculating case in /alcc/gpfs1/home/exp6/pavlenna/work/laosto/ovac/case
on alcc92 with PID 9804
using WIEN2k_13.1 (Release 17/6/2013) in /alcc/gpfs1/home/exp6/pavlenna/wien


     start       (Tue Nov 26 13:41:14 CET 2013) with lapw0 (50/99 to go)

     cycle 1     (Tue Nov 26 13:41:14 CET 2013)  (50/99 to go)

  lapw0 -p    (13:41:15) starting parallel lapw0 at Tue Nov 26 13:41:15 CET 2013

-------- .machine0 : 6 processors
0.024u 0.024s 0:12.00 0.3%      0+0k 1632+8io 6pf+0w

  lapw1  -up -p       (13:41:27) starting parallel lapw1 at Tue Nov 26 13:41:27 
CET 2013

->  starting parallel LAPW1 jobs at Tue Nov 26 13:41:27 CET 2013
running LAPW1 in parallel mode (using .machines)
1 number_of_parallel_jobs
      alcc92 alcc92 alcc92 alcc92 alcc92 alcc92(6) 0.016u 0.004s 0:00.75 1.3%   
 0+0k 0+8io 0pf+0w
    Summary of lapw1para:
    alcc92        k=0     user=0  wallclock=0
0.068u 0.020s 0:02.19 3.6%      0+0k 0+104io 0pf+0w

  lapw1  -dn -p       (13:41:29) starting parallel lapw1 at Tue Nov 26 13:41:29 
CET 2013

->  starting parallel LAPW1 jobs at Tue Nov 26 13:41:29 CET 2013
running LAPW1 in parallel mode (using .machines.help)
1 number_of_parallel_jobs
      alcc92 alcc92 alcc92 alcc92 alcc92 alcc92(6) 0.020u 0.004s 0:00.42 4.7%   
 0+0k 0+8io 0pf+0w
    Summary of lapw1para:
    alcc92        k=0     user=0  wallclock=0
0.072u 0.028s 0:02.11 4.2%      0+0k 0+104io 0pf+0w

  lapw2 -up -p        (13:41:31) running LAPW2 in parallel mode

**  LAPW2 crashed!
0.248u 0.012s 0:00.73 34.2%     0+0k 8+16io 0pf+0w
error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/lapw2para -up uplapw2.def  
 failed

  stop error

---------------------------------
In the uplapw2.err I have the following error messages:

Error in LAPW2
  'LAPW2' - can't open unit: 30
  'LAPW2' -        filename: case.energyup_1
**  testerror: Error in Parallel LAPW2
-----------------
and the following error output messages:

------------------
starting on alcc92
  LAPW0 END
  LAPW0 END
  LAPW0 END
  LAPW0 END
  LAPW0 END
  LAPW0 END
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
case.scf1up_1: No such file or directory.
grep: No match.
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
case.scf1dn_1: No such file or directory.
grep: No match.
FERMI - Error
cp: cannot stat `.in.tmp': No such file or directory

  stop error

-----------------------------------
Please let me know, maybe something is wrong in the mpi configuration.
I have an intel mpi installed on the cluster.


Best regards, N.Pavlenko


Am 2013-11-23 13:07, schrieb Peter Blaha:

You completely misunderstand how parallelization in wien2k works.
Please read the UG carefully (parallelization), also notice the
k-parallel and mpi-parallel options and for which case they are useful.

I'm not familiar with "Slurm", but it looks as if you ran 6 times the same
sequential job in parallel, overwriting the generated files all the time.

I have a problem with parallel run of Wien2k 13.1 on a cluster
with Slurm Environment+ Intel mpi.
In a test run for 1 node with 6 cpu cores, I
generated the following .machines file:

-------.machines
#
lapw0:alcc69
1:alcc69:6
granularity:1
extrafine:1


this is ok, except the lapw0 line, which would not run in parallel. Use

lapw0:alc69:6

and  used the following command in the script:
srun -n 6  runsp_lapw -NI -cc 0.0001 -i 50


You are running 6 times  "runsp_lapw ..."

wien2k spans its parallelization itself (provided you have properly
installed wien2k and specified the proper "mpirun ... command" during
siteconfig), but you must add the  -p flag.

So the single command

runsp_lapw -NI -cc 0.0001 -i 50 -p

should start 6 parallel jobs (with your machines file mpi-parallel) itself.
(You only need to have permission to do so).


In the first cycle, the lapw0,lapw1 and lapw2 are successfully
finished, but after that lcore and mixer continue to run in parallel mode,
they intermix with lapw0 from the second cycle and cause a crash,
which can be seen from the output in case.dayfile:

--------------------------------------------------------------
     cycle 1     (Fri Nov 22 15:32:51 CET 2013)  (50/99 to go)

  lapw0       (15:32:51) >   lapw0    (15:32:51) >   lapw0 (15:32:51) >   lapw0    
(15:32:51) >   lapw0

     (15:32:51) >   lapw0    (15:32:51) 44.798u 0.244s 0:45.75 98.4% 0+0k 0+0io 
0pf+0w

  lapw1  -up          (15:33:37)
  lapw1  -up          (15:33:38)
  lapw1  -up          (15:33:38)
  lapw1  -up          (15:33:39)
  lapw1  -up          (15:33:39)
  lapw1  -up          (15:33:39)  _nb in dscgst.F         512 128

  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128

  lapw1  -dn          (16:12:48)
  lapw1  -dn          (16:13:25)
  lapw1  -dn          (16:13:29)
  lapw1  -dn          (16:13:30)
  lapw1  -dn          (16:13:42)
  lapw1  -dn          (16:13:47)  _nb in dscgst.F         512 128

  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128
  _nb in dscgst.F         512         128

  lapw2 -up           (17:07:01)
  lapw2 -up           (17:07:57)
  lapw2 -up           (17:08:44)
  lapw2 -up           (17:08:52)
  lapw2 -dn           (17:09:00)
  lapw2 -up           (17:09:01)
  lapw2 -up           (17:09:02)
  lapw2 -dn           (17:09:52)
  lapw2 -dn           (17:10:40)
  lapw2 -dn           (17:10:56)
  lapw2 -dn           (17:11:03)
  lapw2 -dn           (17:11:13)
  lcore -up   (17:11:40) 0.124u 0.024s 0:00.33 42.4%  0+0k 0+0io 0pf+0w
  lcore -dn   (17:11:41) 0.120u 0.024s 0:00.30 46.6%  0+0k 0+0io 0pf+0w
  mixer       (17:11:42) 0.172u 0.092s 0:00.58 44.8%  0+0k 0+0io 0pf+0w

error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def failed

  stop error
  lcore -up   (17:12:15) 0.132u 0.012s 0:00.20 70.0%  0+0k 0+0io 0pf+0w
  lcore -dn   (17:12:15) 0.128u 0.012s 0:00.20 65.0%  0+0k 0+0io 0pf+0w
  mixer       (17:11:42) 0.172u 0.092s 0:00.58 44.8%  0+0k 0+0io 0pf+0w

error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def failed

  stop error
  lcore -up   (17:12:15) 0.132u 0.012s 0:00.20 70.0%  0+0k 0+0io 0pf+0w
  lcore -dn   (17:12:15) 0.128u 0.012s 0:00.20 65.0%  0+0k 0+0io 0pf+0w
  mixer       (17:12:16) 0.680u 0.132s 0:02.28 35.5%  0+0k 0+0io 0pf+0w

:ENERGY convergence:  0 0 0
:CHARGE convergence:  0 0.0001 0

     cycle 2     (Fri Nov 22 17:12:18 CET 2013)  (49/98 to go)

  lapw0       (17:12:18)
  lcore -up   (17:12:58) 0.000u 0.008s 0:00.00 0.0%   0+0k 0+0io 0pf+0w

error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/lcore uplcore.def failed

  stop error
  lcore -up   (17:13:02) 0.000u 0.008s 0:00.00 0.0%   0+0k 0+0io 0pf+0w

error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/lcore uplcore.def failed

  stop error

------------------------------------------------------------------------------

It looks like the .machines file needs some additional details about
the calculation mode for lcore and mixer. How to configure properly the
.machines in this case?


Best regards, N.Pavlenko


_______________________________________________
Wien mailing list
[email protected]
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/[email protected]/index.html


--
-----------------------------------------
Peter Blaha
Inst. Materials Chemistry, TU Vienna
Getreidemarkt 9, A-1060 Vienna, Austria
Tel: +43-1-5880115671
Fax: +43-1-5880115698
email: [email protected]
-----------------------------------------
_______________________________________________
Wien mailing list
[email protected]
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/[email protected]/index.html

Re: [Wien] Problem with wien2k 13.1 parallel for Slurm+intel mpi

Reply via email to