You might want to run some profiling / timing to see what parts of the 
application start running slower over time.

Also check for memory leaks.


On Sep 22, 2011, at 5:44 PM, Tom Hilinski wrote:

> Hi, A job I am running slows down as it approaches the end. I'd
> appreciate any ideas you may have on possible cause or what else I can
> look at for diagnostic info.
> 
> Environment:
> * Linux cluster, very recent version of Fedora.
> * openmpi 1.5
> 
> Characteristics of job:
> * Tasks are all the same size and duration.
> * 56K tasks, but multiple tasks given to each process.
> * Typically run 120 processes.
> * Slowdown starts at ~52K completed, then rate of completion of each
> task declines geometrically from ~1k/minute to 4/minute at 54K.
> 
> Here are some queries done when the slowdown occurs:
> 
> * "ps" on master node - most processes in suspend state:
> F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
> 0 S  3348 27933 15675  0  80   0 - 13608 poll_s pts/0    00:00:00 mpiexec
> 0 S  3348 28009 27933 14  80   0 - 227632 epoll_ pts/0   00:08:13 C5MPI
> 0 S  3348 28011 27933 14  80   0 - 227672 epoll_ pts/0   00:08:17 C5MPI
> 0 S  3348 28013 27933 13  80   0 - 227713 epoll_ pts/0   00:08:06 C5MPI
> 0 S  3348 28015 27933 13  80   0 - 227844 epoll_ pts/0   00:08:02 C5MPI
> 0 S  3348 28017 27933 14  80   0 - 227849 epoll_ pts/0   00:08:13 C5MPI
> 0 S  3348 28019 27933 13  80   0 - 227892 epoll_ pts/0   00:08:07 C5MPI
> 
> * file handles (allocated handle count is ~constant):
> $ cat /proc/sys/fs/file-nr
> 3968    0       801014
> 
> * Processes in a suspend or run state (varies):
> $ orte-top -pid 27933 | grep ' S |' | wc -l
> 124
> $ orte-top -pid 27933 | grep ' R |'
> Rank |  Nodename | Command |   Pid | State |   Time | Pri | #threads | 
> Vsize |    RSS | Peak Vsize | Shr Size |
>   0 | rubel-001 |   C5MPI | 14700 |     R |   2.2H |  20 |        1 |
> 246208 |  12660 |     246208 |    17664 |
>   1 | rubel-001 |   C5MPI | 14702 |     R |   2.2H |  20 |        1 |
> 245360 |  44860 |     245360 |    17664 |
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to