Dear Laurence and Peter,

1) No, I did not run with omp. The above discussions in threads are all in 
sequential mode (no -p). However, indeed I have tested dstart and lapw0 in 
parallel mode, where lapw0 hangs similarly like in serial mode and dstart 
parallel mode runs fine. Just in case, I attach below one version of my 
.machines file when I ran dstart in sequential but lapw0 in parallel mode with 
2 processors:
***********
#dstart:localhost localhost
speed:localhost localhost
lapw0:localhost localhost

1:localhost
1:localhost
granularity:1
extrafine:1

omp_global:16
***********
And of course, I never made it to lapw1, due to the lapw0 hanging issue.

2) Through inserting a bunch of PRINT *, “BREAKPOINT1,2,3,…”, the exact line of 
the where the programme hangs has been determined. In the output of “time lapw0 
lapw0.def”, it hangs exactly at CALL XCPOT1(luse2,LM,…). The context in lapw0.F 
is:
***********
if (.not.xcpot1qq) then
  PRINT *, “BREAKPOINT13”
  CALL XCPOT1(luse2,LM,…)
  PRINT *, “BREAKPOINT14”
***********
BREAKPOINT13 is the last printed out. 14 is not printed. Importantly, no any 
BREAKPOINT within the subroutine XCPOT1 is printed. The first “BREAKPOINT” in 
XCPOT1 is at the earliest legit position after all the USE, IMPLICIT NONE, and 
parameters declaration. It doesn’t get printed. That seems to tell XCPOT1 is 
called but never runs, so the code hangs after “BREAKPOINT13” and never prints 
out the BREAKPOINTs in XCPOT1 or BREAKPOINT14.
I don’t understand why, considering XCPOT1 subroutine seems legit and compiled 
fine...

3) My last resort was to ask ChatGPT why subroutines can hang, it suggested 7 
possibilities from programming level to system level. And I provide some of my 
guess and questions on these possibilities.
 a) Infinite loops. I have checked all DO loops in XCPOT1.f, but all loops are 
closed. If there is any, compiler should have found that. So NO.
 b) Large memory allocation. There is no large array allocation in XCPOT1, 
despite three dynamic allocations. So NOT likely.
 c) Recursion without proper termination. NO. XCPOT1 is not a recursive 
subroutine.
 d) Blocking I/O operations. NO. It was not waiting for user input or reading 
from a slow device.
 e) Incorrect use of pointers. NO. I didn’t find pointers in XCPOT1.
 f) Stack overflow. No. Again, I didn’t see any recursion or large arrays. The 
three dynamic allocatables seem small.
 g) Deadlocks. This is the part I don’t quite understand if it could happen, 
but my guess is no. Even though I run lapw0 in sequential mode, could circular 
dependency between tasks still happen when the programme runs on an Apple 
silicon Mac system?

This is where the problem is stuck at the moment, unfortunately.


Best regards
Yichen
_______________________________________________
Wien mailing list
[email protected]
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/[email protected]/index.html

Reply via email to