On Mar 3, 2010, at 12:16 PM, Prentice Bisbal wrote:

> Eugene Loh wrote:
>> Prentice Bisbal wrote:
>>> Eugene Loh wrote:
>>> 
>>>> Prentice Bisbal wrote:
>>>> 
>>>>> Is there a limit on how many MPI processes can run on a single host?
>>>>> 
>> Depending on which OMPI release you're using, I think you need something
>> like 4*np up to 7*np (plus a few) descriptors.  So, with 256, you need
>> 1000+ descriptors.  You're quite possibly up against your limit, though
>> I don't know for sure that that's the problem here.
>> 
>> You say you're running 1.2.8.  That's "a while ago", so would you
>> consider updating as a first step?  Among other things, newer OMPIs will
>> generate a much clearer error message if the descriptor limit is the
>> problem.
> 
> While 1.2.8 might be "a while ago", upgrading software just because it's
> "old" is not a valid argument.
> 
> I can install the lastest version of OpenMPI, but it will take a little
> while.

Maybe not because it is "old", but Eugene is correct. The old versions of OMPI 
required more file descriptors than the newer versions.

That said, you'll still need a minimum of 4x the number of procs on the node 
even with the latest release. I suggest talking to your sys admin about getting 
the limit increased. It sounds like it has been set unrealistically low.


> 
> 
>>>>> I have a user trying to test his code on the command-line on a single
>>>>> host before running it on our cluster like so:
>>>>> 
>>>>> mpirun -np X foo
>>>>> 
>>>>> When he tries to run it on large number of process (X = 256, 512), the
>>>>> program fails, and I can reproduce this with a simple "Hello, World"
>>>>> program:
>>>>> 
>>>>> $ mpirun -np 256 mpihello
>>>>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu
>>>>> exited on signal 15 (Terminated).
>>>>> 252 additional processes aborted (not shown)
>>>>> 
>>>>> I've done some testing and found that X <155 for this program to work.
>>>>> Is this a bug, part of the standard, or design/implementation decision?
>>>>> 
>>>>> 
>>>>> 
>>>> One possible issue is the limit on the number of descriptors.  The error
>>>> message should be pretty helpful and descriptive, but perhaps you're
>>>> using an older version of OMPI.  If this is your problem, one workaround
>>>> is something like this:
>>>> 
>>>> unlimit descriptors
>>>> mpirun -np 256 mpihello
>>>> 
>>> 
>>> Looks like I'm not allowed to set that as a regular user:
>>> 
>>> $ ulimit -n 2048
>>> -bash: ulimit: open files: cannot modify limit: Operation not permitted
>>> 
>>> Since I am the admin, I could change that elsewhere, but I'd rather not
>>> do that system-wide unless absolutely necessary.
>>> 
>>>> though I guess the syntax depends on what shell you're running.  Another
>>>> is to set the MCA parameter opal_set_max_sys_limits to 1.
>>>> 
>>> That didn't work either:
>>> 
>>> $ mpirun -mca opal_set_max_sys_limits 1 -np 256 mpihello
>>> mpirun noticed that job rank 0 with PID 0 on node juno.sns.ias.edu
>>> exited on signal 15 (Terminated).
>>> 252 additional processes aborted (not shown)
>>> 
>>> 
>> 
>> 
>> ------------------------------------------------------------------------
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> -- 
> Prentice Bisbal
> Linux Software Support Specialist/System Administrator
> School of Natural Sciences
> Institute for Advanced Study
> Princeton, NJ
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to