> On Jun 19, 2019, at 5:05 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:
> Hi, Noam
> Can you try your original command line with the following addition:
> mpirun —mca pml ucx —mca btl ^vader,tcp,openib -mca osc ucx  
> I think we're seeing some conflict between UCX PML and UCT OSC. 

I did this, although meanwhile I also did a clean compile (to add some 
debugging statements) and switched from running on 1 node (36 cores) to 2 
nodes. The problem is slightly different, but still similar.  Now the memory 
doesn’t continue to expand until it runs out.  Instead, one node (the head 
node?) is using 55 GB, while the other is using only 23 GB.  The latter value 
(23 GB) is consistent with the usage from ps or top (36 * 640 MB/proc).  When I 
kill the job, the node that used to use 55 GB goes down to 34 GB (with nothing 
running), and the other is down to about 1 GB. 


Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
users mailing list

Reply via email to