[Wien] Error in lapw1para_lapw script causing errors when running parallel lapw2

"Paweł Leśniak, IFMPAN" Wed, 17 Jun 2009 18:17:33 +0200

W dniu 2009-06-17 16:59, Peter Blaha pisze:
> Testing this in TiC with 47 k-points:
>
> 4 lines in .machines; granularity:1
>
> testpara_lapw   produces:
>
> 1 : homer(11) 11k
> 2 : homer(11) 11k
> 3 : homer(11) 11k
> 4 : homer(11) 11k
> 5 : homer(11) 3k
>
> and also x lapw1 -p / x lapw2 -p runs fine.
>
OK, let's assume test case of TiC (still the same problem as with TiO2 
test case).


It depends on what you have in line 444th of lapw1para_lapw.

430         set kold = $kbegin
431         if ($loop > $multi && $?extrafine) then
432             @ head = $kbegin
433             set tail = 1
434             @ kbegin = $kbegin + 1
435         else
436             @ head = $kbegin + $weigh[$p] - 1
437             set tail = $weigh[$p]
438             @ kbegin = $kbegin + $weigh[$p]
439         endif
440
441
442         if ($head >= $klist) then
443             set head    = $klist
444             @ tail = $klist - $kold - 1 # here
445         endif

Generation of 5-th part of klist follows:
$klist = 47, $kbegin = 45, $kold = 45 before line 430th
in line 436th, head = 45 + 11 - 1 = 55
in line 437th, tail = 11
in line 438th, kbegin = 45 + 11 = 56

What is the problem, we can see in lines 442-445:
442th: head = 55 >= klist = 47, so we follow lines 443-445.
443rd: head = 47     -> this is fine
444th: tail = 47 - 45 -1 = 1    (while it should be 47 - 45 + 1 = 3)
Of course this will produce only last line from klist, and this will be 
k-point number 47. We are missing k-points number 45 and 46.
This error is quite obvious to me, and it's really amazing for me, that 
you are getting correct results.

So if you are getting correct split of k-points in lapw1para_lapw, then
1) you have @ tail = $klist - $kold + 1   in line 444th  (and also 
corrected "same thing" in line 248th of testpara_lapw).
or
2) your (t)csh evaluates expressions from right to left.
or
3) code which I've downloaded from wien2k site is different from the one 
you are testing on
> It produces 5 !!! klists, the latter one with the remaining 3 k-points 
> and the
> "fastest" cpu will get this junk.
>
> So from my point of view it works perfectly well.
It's very strange. Could you download a copy of code from wien2k site 
and check it on freshly downloaded TiC (or other) test case in k-parallel.

We have here 3 different environments (different distributions) of Linux 
on x86_64, all producing the same errors.
> > Indeed I am using $SCRATCH variable. I've also checked -it switch 
> and it
> > works with k-points splitted 2/2/2/2/1  on 4 cpus.
>
> No, you cannot use iterative diag, because with 4 lines in .machines, but
> actually 5 !! junks, you don't know on which computer the 5th junk 
> will be executed
> (it will be the fastest, but that can change from iteration to 
> iteration and you will
> not have an old vector file.
Unless $SCRATCH is on distributed filesystem, when each node can see all 
parts of vector file ($case.vector_*), I guess.


I'd be very gratefull if you could check what I'm writing about on 
freshly downloaded code/data (available to download for registered users).


Pawel Lesniak

[Wien] Error in lapw1para_lapw script causing errors when running parallel lapw2

Reply via email to