Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain
On Wed, 14 Mar 2012 at 5:50pm, Ralph Castain wrote On Mar 14, 2012, at 5:44 PM, Reuti wrote: (I was just typing when Ralph's message came in: I can confirm this. To avoid it, it would mean for Open MPI to collect all lines from the hostfile which are on the same machine. SGE creates entries

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain
On Thu, 15 Mar 2012 at 12:44am, Reuti wrote Which version of SGE are you using? The traditional rsh startup was replaced by the builtin startup some time ago (although it should still work). We're currently running the rather ancient 6.1u4 (due to the "If it ain't broke..." philosophy).

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Rayson Ho
Hi Joshua, I don't think the new built-in rsh in later versions of Grid Engine is going to make any difference - the orted is the real starter of the MPI tasks and should have a greater influence on the task environment. However, it would help if you can record the nice values and resource

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Reuti
Am 15.03.2012 um 05:22 schrieb Joshua Baker-LePain: > On Wed, 14 Mar 2012 at 5:50pm, Ralph Castain wrote > >> On Mar 14, 2012, at 5:44 PM, Reuti wrote: > >>> (I was just typing when Ralph's message came in: I can confirm this. To >>> avoid it, it would mean for Open MPI to collect all lines

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Ralph Castain
Just to be clear: I take it that the first entry is the host name, and the second is the number of slots allocated on that host? FWIW: I see the problem. Our parser was apparently written assuming every line was a unique host, so it doesn't even check to see if there is duplication. Easy fix -

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Reuti
Am 15.03.2012 um 15:37 schrieb Ralph Castain: > Just to be clear: I take it that the first entry is the host name, and the > second is the number of slots allocated on that host? This is correct. > FWIW: I see the problem. Our parser was apparently written assuming every > line was a unique

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Ralph Castain
On Mar 15, 2012, at 8:46 AM, Reuti wrote: > Am 15.03.2012 um 15:37 schrieb Ralph Castain: > >> Just to be clear: I take it that the first entry is the host name, and the >> second is the number of slots allocated on that host? > > This is correct. > > >> FWIW: I see the problem. Our parser

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain
On Thu, 15 Mar 2012 at 1:53pm, Reuti wrote PS: In your example you also had the case 2 slots in the low priority queue, what is the actual setup in your cluster? Our actual setup is: o lab.q, slots=numprocs, load_thresholds=np_load_avg=1.5, labs (=SGE projects) limited by RQS to a number

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain
On Thu, 15 Mar 2012 at 4:41pm, Reuti wrote Am 15.03.2012 um 15:50 schrieb Ralph Castain: On Mar 15, 2012, at 8:46 AM, Reuti wrote: Am 15.03.2012 um 15:37 schrieb Ralph Castain: FWIW: I see the problem. Our parser was apparently written assuming every line was a unique host, so it doesn't

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Ralph Castain
No, I'll fix the parser as we should be able to run anyway. Just can't guarantee which queue the job will end up in, but at least it -will- run. On Mar 15, 2012, at 11:34 AM, Joshua Baker-LePain wrote: > On Thu, 15 Mar 2012 at 4:41pm, Reuti wrote > >> Am 15.03.2012 um 15:50 schrieb Ralph

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain
On Thu, 15 Mar 2012 at 11:38am, Ralph Castain wrote No, I'll fix the parser as we should be able to run anyway. Just can't guarantee which queue the job will end up in, but at least it -will- run. Makes sense to me. Thanks! -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Ralph Castain
Here's the patch: I've set it up to go into 1.5, but not 1.4 as that series is being closed out. Please let me know if this solves the problem for you. Modified: orte/mca/ras/gridengine/ras_gridengine_module.c == ---

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Reuti
Am 15.03.2012 um 18:14 schrieb Joshua Baker-LePain: > On Thu, 15 Mar 2012 at 1:53pm, Reuti wrote > >> PS: In your example you also had the case 2 slots in the low priority queue, >> what is the actual setup in your cluster? > > Our actual setup is: > > o lab.q, slots=numprocs,

Re: [OMPI users] MPI_Testsome with incount=0, NULL array_of_indices and array_of_statuses causes MPI_ERR_ARG

2012-03-15 Thread Eugene Loh
On 03/13/12 13:25, Jeffrey Squyres wrote: On Mar 9, 2012, at 5:17 PM, Jeremiah Willcock wrote: On Open MPI 1.5.1, when I call MPI_Testsome with incount=0 and the two output arrays NULL, I get an argument error (MPI_ERR_ARG). Is this the intended behavior? If incount=0, no requests can

Re: [OMPI users] MPI_Testsome with incount=0, NULL array_of_indices and array_of_statuses causes MPI_ERR_ARG

2012-03-15 Thread Jeffrey Squyres
Many thanks for doing this Eugene. On Mar 15, 2012, at 11:58 AM, Eugene Loh wrote: > On 03/13/12 13:25, Jeffrey Squyres wrote: >> On Mar 9, 2012, at 5:17 PM, Jeremiah Willcock wrote: >>> On Open MPI 1.5.1, when I call MPI_Testsome with incount=0 and the two >>> output arrays NULL, I get an

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Ralph Castain
Great - thanks! On Mar 15, 2012, at 2:55 PM, Joshua Baker-LePain wrote: > On Thu, 15 Mar 2012 at 11:49am, Ralph Castain wrote > >> Here's the patch: I've set it up to go into 1.5, but not 1.4 as that series >> is being closed out. Please let me know if this solves the problem for you. > > I

Re: [hwloc-users] Problems on SMP with 48 cores

2012-03-15 Thread Samuel Thibault
Samuel Thibault, le Thu 15 Mar 2012 07:42:40 +0100, a écrit : > Brice Goglin, le Wed 14 Mar 2012 22:32:07 +0100, a écrit : > > We debugged this in private emails with Hartmut. His 48-core platform is > > now detected properly. Everything got fixed with a patch > > functionnally-identical to what