I'm trying to run a long phonon calculation using 3 nodes of a cluster, each of which has 16 processors. I'm using 15 on each node, with this command:

mpirun -np 45 --hostfile hostfile.txt ph.x < NiOHF_ordered1H7.PhG.in | tee NiOHF_ordered1H7.PhG.out 2> error.txt

The input files are attached. Before the ph.x calculation, I ran an SCF using the attached SCF input file.

My problem is that after 3 days of computation, the process gets killed with no indication of why. The dmesg -T command doesn't show any kills at the right date. How can I figure out what went wrong and what to do about it? Would running on fewer processors help?
        mam
phonons of Ni3F2(OH)4 at Gamma
 &inputph
  tr2_ph=1.0d-14,
  outdir = './out/'
  prefix = 'aiida',
  epsil=.true.,
  amass(1)=58.6934,
  amass(2)=58.6934,
  amass(3)=18.9984
  amass(4)=1.00794,
  amass(5)=15.9994,
  trans  = .true.
  fildyn='NiOHF_ordered1H7_.dynG',
 /
0.0 0.0 0.0
&CONTROL
  calculation = 'scf'
  etot_conv_thr =   1.3000000000d-04
  forc_conv_thr =   1.0000000000d-04
  outdir = './out/'
  prefix = 'aiida'
  pseudo_dir = './pseudo/'
  tprnfor = .true.
  tstress = .true.
  verbosity = 'low'
/
&SYSTEM
  degauss =   1.4699723600d-02
  ecutrho =   600
  ecutwfc =   100
  ibrav = 0
  nat = 26
  nspin = 2
  ntyp = 5
Hubbard_U(1) = 7.
Hubbard_U(2) = 7.
  occupations = 'fixed'
!  smearing = 'cold'
  lda_plus_u=.true.,
  tot_magnetization = 0.
  starting_magnetization(1) =   0.8
  starting_magnetization(2) =   -0.8
  starting_magnetization(3) =   0.
  starting_magnetization(4) =   0.
  starting_magnetization(5) =   0.
/
&ELECTRONS
  conv_thr =   1.0000000000d-07
  electron_maxstep = 3000
  mixing_beta =   2.0000000000d-01
/
ATOMIC_SPECIES
Ni1     58.6934 ni_pbesol_v1.4.uspp.F.UPF
Ni2     58.6934 ni_pbesol_v1.4.uspp.F.UPF
F      18.9984 F.oncvpsp.upf
H      1.00794 H_ONCV_PBEsol-1.0.upf
O      15.9994 O.pbesol-n-kjpaw_psl.0.1.UPF
ATOMIC_POSITIONS crystal
Ni1           0.9680183807       0.0263107459       0.9783108308
Ni1           0.3045746118       0.0209143866       0.0084271434
Ni1           0.6411601951       0.0326592337       0.0131468998
Ni2           0.0287555271       0.9790835341       0.4915771994
Ni2           0.3653117094       0.9736857834       0.5216841580
Ni2           0.6921696367       0.9673388130       0.4868535613
H            0.3664337110       0.5587673063       0.2207053022
H            0.7711523533       0.6673850125       0.2109430049
H            0.0847701418       0.5318553413       0.7085663827
H            0.4959331063       0.5608752024       0.7208617849
H            0.5621900948       0.3326319169       0.2890537903
H            0.9668980679       0.4412345344       0.2792980517
H            0.2485858688       0.4681576584       0.7914372862
H            0.8373976074       0.4391119548       0.7791359710
F            0.0594468841       0.6215010140       0.1059432366
F            0.7934945419       0.5779511484       0.6060003729
F            0.2738815299       0.3784930960       0.3940541285
F            0.5398339136       0.4220452355       0.8940008576
O            0.4029952628       0.6389325035       0.1163156174
O            0.7496761883       0.6657977104       0.1092517007
O            0.1215216453       0.5881897105       0.6060767314
O            0.4712462387       0.5970002502       0.6197252787
O            0.5836559299       0.3342035007       0.3907472769
O            0.9303359025       0.3610680558       0.3836873758
O            0.2118128870       0.4118165671       0.8939225762
O            0.8620813928       0.4029897812       0.8802737008
K_POINTS automatic
4 11 3 0 0 0
CELL_PARAMETERS angstrom
      9.2705774000       0.0000000000       0.0000000000
     -1.5637153840       2.7010480439       0.0000000000
      0.2403598005       1.2112911718       9.2695561023
n0000.phasis
n0000.phasis
n0000.phasis
n0000.phasis
n0000.phasis
n0000.phasis
n0000.phasis
n0000.phasis
n0000.phasis
n0000.phasis
n0000.phasis
n0000.phasis
n0000.phasis
n0000.phasis
n0000.phasis
n0001.phasis
n0001.phasis
n0001.phasis
n0001.phasis
n0001.phasis
n0001.phasis
n0001.phasis
n0001.phasis
n0001.phasis
n0001.phasis
n0001.phasis
n0001.phasis
n0001.phasis
n0001.phasis
n0001.phasis
n0004.phasis
n0004.phasis
n0004.phasis
n0004.phasis
n0004.phasis
n0004.phasis
n0004.phasis
n0004.phasis
n0004.phasis
n0004.phasis
n0004.phasis
n0004.phasis
n0004.phasis
n0004.phasis
n0004.phasis
_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu/quantum-espresso)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users

Reply via email to