[OMPI users] IB Memory Requirements, adjusting for reduced memory consumption

2012-01-12 Thread V. Ram
Open MPI IB Gurus, I have some slightly older InfiniBand-equipped nodes with IB which have less RAM than we'd like, and on which we tend to run jobs that can span 16-32 nodes of this type. The jobs themselves tend to run on the heavy side in terms of their own memory requirements. When we used

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2012-01-04 Thread V. Ram
responded to the firmware part of this earlier: http://www.open-mpi.org/community/lists/users/2011/12/18014.php Thank you, V. Ram -- http://www.fastmail.fm - Access your email from home and the web

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-19 Thread V. Ram
Thank you. V. Ram > On Dec 15, 2011, at 7:24 PM, V. Ram wrote: > > > Hi Terry, > > > > Thanks so much for the response. My replies are in-line below. > > > > On Thu, Dec 15, 2011, at 07:00 AM, TERRY DONTJE wrote: > >> IIRC, RNR's are usu

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-15 Thread V. Ram
ted number of observable parameters I'm aware of, to be dependent on the number of nodes involved. It is an intermittent problem, but when it happens, it happens at job launch, and it does occur most of the time. Thanks, V. Ram > --td > > > > Open MPI InfiniBand gurus and/or Mellanox:

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-14 Thread V. Ram
Open MPI InfiniBand gurus and/or Mellanox: could I please get some assistance with this? Any suggestions on tunables or debugging parameters to try? Thank you very much. On Mon, Dec 12, 2011, at 10:42 AM, V. Ram wrote: > Hello, > > We are running a cluster that has a good number of ol

[OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-12 Thread V. Ram
use the same InfiniBand fabric continuously without any issue, so I don't think it's the fabric/switch. I'm at a loss for what to do next to try and find the root cause of the issue. I suspect something perhaps having to do with the mthca support/drivers, but how can I track this down further?

Re: [OMPI users] Processor/core selection/affinity for large shared memory systems

2008-12-05 Thread V. Ram
Terry Frankcombe wrote: > Isn't it up to the OS scheduler what gets run where? I was under the impression that the processor affinity API was designed to let the OS (at least Linux) know how a given task preferred to be bound in terms of the system topology. -- V. Ram v_r_...@fastmail

Re: [OMPI users] Processor/core selection/affinity for large shared memory systems

2008-12-05 Thread V. Ram
ve any easy way to tell that without a hostfile, etc. -- V. Ram v_r_...@fastmail.fm -- http://www.fastmail.fm - Or how I learned to stop worrying and love email again

Re: [OMPI users] Processor/core selection/affinity for large shared memory systems

2008-12-04 Thread V. Ram
e sockets (all 4 cores) active at a time on this job. Does this make more sense? -- V. Ram v_r_...@fastmail.fm -- http://www.fastmail.fm - A no graphics, no pop-ups email service

[OMPI users] Processor/core selection/affinity for large shared memory systems

2008-12-04 Thread V. Ram
such functionality is technically possible via PLPA. Is there in fact a way to specify such a thing with 1.2.8, and if not, will 1.3 support these kinds arguments? Thank you. -- V. Ram v_r_...@fastmail.fm -- http://www.fastmail.fm - Or how I learned to stop worrying

Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-24 Thread V. Ram
nyone else experiencing the same issues. Thanks Leonardo! OMPI devs: does this imply bug(s) in the e1000 driver/chip? Should I contact the driver authors? On Fri, 10 Oct 2008 12:42:19 -0400, "V. Ram" <v_r_...@fastmail.fm> said: > Leonardo, > > These nodes are all usi

Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-10 Thread V. Ram
uot;eth0, eth1". You should > > try to restrict Open MPI to use only one of the available networks by > > using the --mca btl_tcp_if_include ethx parameter to mpirun, where x > > is the network interface that is always connected to the same logical > > and physical

Re: [OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-10 Thread V. Ram
re there any suggestions on how to figure out if it's a problem with > > the code or the OMPI installation/software on the system? We have > > tried > > "--debug-daemons" with no new/interesting information being revealed. > > Is there a way to trap segfault messages or more detailed MPI > > transaction information or anything else that could help diagnose > > this? > > > > Thanks. > > -- > > V. Ram > > v_r_...@fastmail.fm -- V. Ram v_r_...@fastmail.fm -- http://www.fastmail.fm - A no graphics, no pop-ups email service

[OMPI users] Crashes over TCP/ethernet but not on shared memory

2008-10-01 Thread V. Ram
on the system? We have tried "--debug-daemons" with no new/interesting information being revealed. Is there a way to trap segfault messages or more detailed MPI transaction information or anything else that could help diagnose this? Thanks. -- V. Ram v_r_...@fastmail.fm

[OMPI users] Crash in code using OMPI 1.2.7 - Debugging assistance sought

2008-09-24 Thread V. Ram
installation/software on the system? We have tried "--debug-daemons" with no new/interesting information being revealed. Is there a way to trap segfault messages or more detailed MPI transaction information or anything else that could help diagnose this? Thanks. -- V. Ram v_r_...@f