Re: [OMPI users] Performance Issues on SMP Workstation

2017-02-01 Thread Andy Witzig
Thank you, Bennet.  From my testing, I’ve seen that the application usually 
performs better at much smaller ranks on the workstation.  I’ve tested on the 
cluster and do not see the same response (i.e. see better performance with 
ranks of -np 15 or 20).   The workstation is not shared and is not doing any 
other work.  I ran the application on the Workstation with top and confirmed 
that 20 procs were fully loaded.

I’ll look into the diagnostics you mentioned and get back with you.

Best regards,
Andy
  
On Feb 1, 2017, at 6:15 PM, Bennet Fauber  wrote:

How do they compare if you run a much smaller number of ranks, say -np 2 or 4?

Is the workstation shared and doing any other work?

You could insert some diagnostics into your script, for example,
uptime and free, both before and after running your MPI program and
compare.

You could also run top in batch mode in the background for your own
username, then run your MPI program, and compare the results from top.
We've seen instances where the MPI ranks only get distributed to a
small number of processors, which you see if they all have small
percentages of CPU.

Just flailing in the dark...

-- bennet



On Wed, Feb 1, 2017 at 6:36 PM, Andy Witzig  wrote:
> Thank for the idea.  I did the test and only get a single host.
> 
> Thanks,
> Andy
> 
> On Feb 1, 2017, at 5:04 PM, r...@open-mpi.org wrote:
> 
> Simple test: replace your executable with “hostname”. If you see multiple
> hosts come out on your cluster, then you know why the performance is
> different.
> 
> On Feb 1, 2017, at 2:46 PM, Andy Witzig  wrote:
> 
> Honestly, I’m not exactly sure what scheme is being used.  I am using the
> default template from Penguin Computing for job submission.  It looks like:
> 
> #PBS -S /bin/bash
> #PBS -q T30
> #PBS -l walltime=24:00:00,nodes=1:ppn=20
> #PBS -j oe
> #PBS -N test
> #PBS -r n
> 
> mpirun $EXECUTABLE $INPUT_FILE
> 
> I’m not configuring OpenMPI anywhere else. It is possible the Penguin
> Computing folks have pre-configured my MPI environment.  I’ll see what I can
> find.
> 
> Best regards,
> Andy
> 
> On Feb 1, 2017, at 4:32 PM, Douglas L Reeder  wrote:
> 
> Andy,
> 
> What allocation scheme are you using on the cluster. For some codes we see
> noticeable differences using fillup vs round robin, not 4x though. Fillup is
> more shared memory use while round robin uses more infinniband.
> 
> Doug
> 
> On Feb 1, 2017, at 3:25 PM, Andy Witzig  wrote:
> 
> Hi Tom,
> 
> The cluster uses an Infiniband interconnect.  On the cluster I’m requesting:
> #PBS -l walltime=24:00:00,nodes=1:ppn=20.  So technically, the run on the
> cluster should be SMP on the node, since there are 20 cores/node.  On the
> workstation I’m just using the command: mpirun -np 20 …. I haven’t finished
> setting Torque/PBS up yet.
> 
> Best regards,
> Andy
> 
> On Feb 1, 2017, at 4:10 PM, Elken, Tom  wrote:
> 
> For this case:  " a cluster system with 2.6GHz Intel Haswell with 20 cores /
> node and 128GB RAM/node.  "
> 
> are you running 5 ranks per node on 4 nodes?
> What interconnect are you using for the cluster?
> 
> -Tom
> 
> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Andrew
> Witzig
> Sent: Wednesday, February 01, 2017 1:37 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Performance Issues on SMP Workstation
> 
> By the way, the workstation has a total of 36 cores / 72 threads, so using
> mpirun
> -np 20 is possible (and should be equivalent) on both platforms.
> 
> Thanks,
> cap79
> 
> On Feb 1, 2017, at 2:52 PM, Andy Witzig  wrote:
> 
> Hi all,
> 
> I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
> 
> 2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
> seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
> Haswell with 20 cores / node and 128GB RAM/node.  Both applications have
> been compiled using OpenMPI 1.6.4.  I have tried running:
> 
> 
> mpirun -np 20 $EXECUTABLE $INPUT_FILE
> mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
> 
> and others, but cannot achieve the same performance on the workstation as is
> 
> seen on the cluster.  The workstation outperforms on other non-MPI but
> multi-
> threaded applications, so I don’t think it’s a hardware issue.
> 
> 
> Any help you can provide would be appreciated.
> 
> Thanks,
> cap79
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] OpenMPI not running any job on Mac OS X 10.12

2017-02-01 Thread Michel Lesoinne
I have compiled OpenMPI 2.0.2 on a new Macbook running OS X 10.12 and have
been trying to run simple program.
I configured openmpi with
../configure --disable-shared --prefix ~/.local
make all install

Then I have  a simple code only containing a call to MPI_Init.
I compile it with
mpirun -np 2 ./mpitest

The output is:

[Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
unable to open mca_patcher_overwrite: File not found (ignored)

[Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
unable to open mca_shmem_mmap: File not found (ignored)

[Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
unable to open mca_shmem_posix: File not found (ignored)

[Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
unable to open mca_shmem_sysv: File not found (ignored)

--

It looks like opal_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during opal_init; some of which are due to configuration or

environment problems.  This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an

Open MPI developer):


  opal_shmem_base_select failed

  --> Returned value -1 instead of OPAL_SUCCESS

--

Without the --disable-shared in the configuration, then I get:


[Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
parameter in file ../../orte/orted/pmix/pmix_server.c at line 264

[Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line
666

--

It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during orte_init; some of which are due to configuration or

environment problems.  This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an

Open MPI developer):


  pmix server init failed

  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS

--




Has anyone seen this? What am I missing?
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance Issues on SMP Workstation

2017-02-01 Thread Bennet Fauber
How do they compare if you run a much smaller number of ranks, say -np 2 or 4?

Is the workstation shared and doing any other work?

You could insert some diagnostics into your script, for example,
uptime and free, both before and after running your MPI program and
compare.

You could also run top in batch mode in the background for your own
username, then run your MPI program, and compare the results from top.
We've seen instances where the MPI ranks only get distributed to a
small number of processors, which you see if they all have small
percentages of CPU.

Just flailing in the dark...

-- bennet



On Wed, Feb 1, 2017 at 6:36 PM, Andy Witzig  wrote:
> Thank for the idea.  I did the test and only get a single host.
>
> Thanks,
> Andy
>
> On Feb 1, 2017, at 5:04 PM, r...@open-mpi.org wrote:
>
> Simple test: replace your executable with “hostname”. If you see multiple
> hosts come out on your cluster, then you know why the performance is
> different.
>
> On Feb 1, 2017, at 2:46 PM, Andy Witzig  wrote:
>
> Honestly, I’m not exactly sure what scheme is being used.  I am using the
> default template from Penguin Computing for job submission.  It looks like:
>
> #PBS -S /bin/bash
> #PBS -q T30
> #PBS -l walltime=24:00:00,nodes=1:ppn=20
> #PBS -j oe
> #PBS -N test
> #PBS -r n
>
> mpirun $EXECUTABLE $INPUT_FILE
>
> I’m not configuring OpenMPI anywhere else. It is possible the Penguin
> Computing folks have pre-configured my MPI environment.  I’ll see what I can
> find.
>
> Best regards,
> Andy
>
> On Feb 1, 2017, at 4:32 PM, Douglas L Reeder  wrote:
>
> Andy,
>
> What allocation scheme are you using on the cluster. For some codes we see
> noticeable differences using fillup vs round robin, not 4x though. Fillup is
> more shared memory use while round robin uses more infinniband.
>
> Doug
>
> On Feb 1, 2017, at 3:25 PM, Andy Witzig  wrote:
>
> Hi Tom,
>
> The cluster uses an Infiniband interconnect.  On the cluster I’m requesting:
> #PBS -l walltime=24:00:00,nodes=1:ppn=20.  So technically, the run on the
> cluster should be SMP on the node, since there are 20 cores/node.  On the
> workstation I’m just using the command: mpirun -np 20 …. I haven’t finished
> setting Torque/PBS up yet.
>
> Best regards,
> Andy
>
> On Feb 1, 2017, at 4:10 PM, Elken, Tom  wrote:
>
> For this case:  " a cluster system with 2.6GHz Intel Haswell with 20 cores /
> node and 128GB RAM/node.  "
>
> are you running 5 ranks per node on 4 nodes?
> What interconnect are you using for the cluster?
>
> -Tom
>
> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Andrew
> Witzig
> Sent: Wednesday, February 01, 2017 1:37 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Performance Issues on SMP Workstation
>
> By the way, the workstation has a total of 36 cores / 72 threads, so using
> mpirun
> -np 20 is possible (and should be equivalent) on both platforms.
>
> Thanks,
> cap79
>
> On Feb 1, 2017, at 2:52 PM, Andy Witzig  wrote:
>
> Hi all,
>
> I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
>
> 2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
> seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
> Haswell with 20 cores / node and 128GB RAM/node.  Both applications have
> been compiled using OpenMPI 1.6.4.  I have tried running:
>
>
> mpirun -np 20 $EXECUTABLE $INPUT_FILE
> mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
>
> and others, but cannot achieve the same performance on the workstation as is
>
> seen on the cluster.  The workstation outperforms on other non-MPI but
> multi-
> threaded applications, so I don’t think it’s a hardware issue.
>
>
> Any help you can provide would be appreciated.
>
> Thanks,
> cap79
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> 

Re: [OMPI users] Performance Issues on SMP Workstation

2017-02-01 Thread Andy Witzig
Thank for the idea.  I did the test and only get a single host.

Thanks,
Andy

On Feb 1, 2017, at 5:04 PM, r...@open-mpi.org wrote:

Simple test: replace your executable with “hostname”. If you see multiple hosts 
come out on your cluster, then you know why the performance is different.

> On Feb 1, 2017, at 2:46 PM, Andy Witzig  > wrote:
> 
> Honestly, I’m not exactly sure what scheme is being used.  I am using the 
> default template from Penguin Computing for job submission.  It looks like:
> 
> #PBS -S /bin/bash
> #PBS -q T30
> #PBS -l walltime=24:00:00,nodes=1:ppn=20
> #PBS -j oe
> #PBS -N test
> #PBS -r n
> 
> mpirun $EXECUTABLE $INPUT_FILE
> 
> I’m not configuring OpenMPI anywhere else. It is possible the Penguin 
> Computing folks have pre-configured my MPI environment.  I’ll see what I can 
> find.
> 
> Best regards,
> Andy
> 
> On Feb 1, 2017, at 4:32 PM, Douglas L Reeder  > wrote:
> 
> Andy,
> 
> What allocation scheme are you using on the cluster. For some codes we see 
> noticeable differences using fillup vs round robin, not 4x though. Fillup is 
> more shared memory use while round robin uses more infinniband.
> 
> Doug
>> On Feb 1, 2017, at 3:25 PM, Andy Witzig > > wrote:
>> 
>> Hi Tom,
>> 
>> The cluster uses an Infiniband interconnect.  On the cluster I’m requesting: 
>> #PBS -l walltime=24:00:00,nodes=1:ppn=20.  So technically, the run on the 
>> cluster should be SMP on the node, since there are 20 cores/node.  On the 
>> workstation I’m just using the command: mpirun -np 20 …. I haven’t finished 
>> setting Torque/PBS up yet.
>> 
>> Best regards,
>> Andy
>> 
>> On Feb 1, 2017, at 4:10 PM, Elken, Tom > > wrote:
>> 
>> For this case:  " a cluster system with 2.6GHz Intel Haswell with 20 cores / 
>> node and 128GB RAM/node.  "
>> 
>> are you running 5 ranks per node on 4 nodes?
>> What interconnect are you using for the cluster?
>> 
>> -Tom
>> 
>>> -Original Message-
>>> From: users [mailto:users-boun...@lists.open-mpi.org 
>>> ] On Behalf Of Andrew
>>> Witzig
>>> Sent: Wednesday, February 01, 2017 1:37 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Performance Issues on SMP Workstation
>>> 
>>> By the way, the workstation has a total of 36 cores / 72 threads, so using 
>>> mpirun
>>> -np 20 is possible (and should be equivalent) on both platforms.
>>> 
>>> Thanks,
>>> cap79
>>> 
 On Feb 1, 2017, at 2:52 PM, Andy Witzig > wrote:
 
 Hi all,
 
 I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
>>> 2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
>>> seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
>>> Haswell with 20 cores / node and 128GB RAM/node.  Both applications have
>>> been compiled using OpenMPI 1.6.4.  I have tried running:
 
 mpirun -np 20 $EXECUTABLE $INPUT_FILE
 mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
 
 and others, but cannot achieve the same performance on the workstation as 
 is
>>> seen on the cluster.  The workstation outperforms on other non-MPI but 
>>> multi-
>>> threaded applications, so I don’t think it’s a hardware issue.
 
 Any help you can provide would be appreciated.
 
 Thanks,
 cap79
 ___
 users mailing list
 users@lists.open-mpi.org 
 https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
 
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org 
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org 
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
>> 
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> 
> ___
> users mailing list
> users@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Performance Issues on SMP Workstation

2017-02-01 Thread Andy Witzig
Thanks, Bennet.  I made the modification to the Torque submission file and got 
“20 n388”, which confirms (like you said) that for my cluster runs I am 
requesting 20 cores on a single node.

Best regards,
Andy

On Feb 1, 2017, at 5:15 PM, Bennet Fauber  wrote:

You may want to run this by Penguin support, too.

I believe that Penguin on Demand use Torque, in which case the

   nodes=1:ppn=20

is requesting 20 cores on a single node.

If this is Torque, then you should get a host list, with counts by inserting

   uniq -c $PBS_NODEFILE

after the last #PBS directive.  That should print the host name and the
number 20.  MPI should resort to whatever it uses when it is on the same
node.

-- bennet


On Wed, Feb 1, 2017 at 6:04 PM, r...@open-mpi.org  wrote:
> Simple test: replace your executable with “hostname”. If you see multiple
> hosts come out on your cluster, then you know why the performance is
> different.
> 
> On Feb 1, 2017, at 2:46 PM, Andy Witzig  wrote:
> 
> Honestly, I’m not exactly sure what scheme is being used.  I am using the
> default template from Penguin Computing for job submission.  It looks like:
> 
> #PBS -S /bin/bash
> #PBS -q T30
> #PBS -l walltime=24:00:00,nodes=1:ppn=20
> #PBS -j oe
> #PBS -N test
> #PBS -r n
> 
> mpirun $EXECUTABLE $INPUT_FILE
> 
> I’m not configuring OpenMPI anywhere else. It is possible the Penguin
> Computing folks have pre-configured my MPI environment.  I’ll see what I can
> find.
> 
> Best regards,
> Andy
> 
> On Feb 1, 2017, at 4:32 PM, Douglas L Reeder  wrote:
> 
> Andy,
> 
> What allocation scheme are you using on the cluster. For some codes we see
> noticeable differences using fillup vs round robin, not 4x though. Fillup is
> more shared memory use while round robin uses more infinniband.
> 
> Doug
> 
> On Feb 1, 2017, at 3:25 PM, Andy Witzig  wrote:
> 
> Hi Tom,
> 
> The cluster uses an Infiniband interconnect.  On the cluster I’m requesting:
> #PBS -l walltime=24:00:00,nodes=1:ppn=20.  So technically, the run on the
> cluster should be SMP on the node, since there are 20 cores/node.  On the
> workstation I’m just using the command: mpirun -np 20 …. I haven’t finished
> setting Torque/PBS up yet.
> 
> Best regards,
> Andy
> 
> On Feb 1, 2017, at 4:10 PM, Elken, Tom  wrote:
> 
> For this case:  " a cluster system with 2.6GHz Intel Haswell with 20 cores /
> node and 128GB RAM/node.  "
> 
> are you running 5 ranks per node on 4 nodes?
> What interconnect are you using for the cluster?
> 
> -Tom
> 
> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Andrew
> Witzig
> Sent: Wednesday, February 01, 2017 1:37 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Performance Issues on SMP Workstation
> 
> By the way, the workstation has a total of 36 cores / 72 threads, so using
> mpirun
> -np 20 is possible (and should be equivalent) on both platforms.
> 
> Thanks,
> cap79
> 
> On Feb 1, 2017, at 2:52 PM, Andy Witzig  wrote:
> 
> Hi all,
> 
> I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
> 
> 2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
> seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
> Haswell with 20 cores / node and 128GB RAM/node.  Both applications have
> been compiled using OpenMPI 1.6.4.  I have tried running:
> 
> 
> mpirun -np 20 $EXECUTABLE $INPUT_FILE
> mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
> 
> and others, but cannot achieve the same performance on the workstation as is
> 
> seen on the cluster.  The workstation outperforms on other non-MPI but
> multi-
> threaded applications, so I don’t think it’s a hardware issue.
> 
> 
> Any help you can provide would be appreciated.
> 
> Thanks,
> cap79
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> 

Re: [OMPI users] Performance Issues on SMP Workstation

2017-02-01 Thread Bennet Fauber
You may want to run this by Penguin support, too.

I believe that Penguin on Demand use Torque, in which case the

nodes=1:ppn=20

is requesting 20 cores on a single node.

If this is Torque, then you should get a host list, with counts by inserting

uniq -c $PBS_NODEFILE

after the last #PBS directive.  That should print the host name and the
number 20.  MPI should resort to whatever it uses when it is on the same
node.

-- bennet


On Wed, Feb 1, 2017 at 6:04 PM, r...@open-mpi.org  wrote:
> Simple test: replace your executable with “hostname”. If you see multiple
> hosts come out on your cluster, then you know why the performance is
> different.
>
> On Feb 1, 2017, at 2:46 PM, Andy Witzig  wrote:
>
> Honestly, I’m not exactly sure what scheme is being used.  I am using the
> default template from Penguin Computing for job submission.  It looks like:
>
> #PBS -S /bin/bash
> #PBS -q T30
> #PBS -l walltime=24:00:00,nodes=1:ppn=20
> #PBS -j oe
> #PBS -N test
> #PBS -r n
>
> mpirun $EXECUTABLE $INPUT_FILE
>
> I’m not configuring OpenMPI anywhere else. It is possible the Penguin
> Computing folks have pre-configured my MPI environment.  I’ll see what I can
> find.
>
> Best regards,
> Andy
>
> On Feb 1, 2017, at 4:32 PM, Douglas L Reeder  wrote:
>
> Andy,
>
> What allocation scheme are you using on the cluster. For some codes we see
> noticeable differences using fillup vs round robin, not 4x though. Fillup is
> more shared memory use while round robin uses more infinniband.
>
> Doug
>
> On Feb 1, 2017, at 3:25 PM, Andy Witzig  wrote:
>
> Hi Tom,
>
> The cluster uses an Infiniband interconnect.  On the cluster I’m requesting:
> #PBS -l walltime=24:00:00,nodes=1:ppn=20.  So technically, the run on the
> cluster should be SMP on the node, since there are 20 cores/node.  On the
> workstation I’m just using the command: mpirun -np 20 …. I haven’t finished
> setting Torque/PBS up yet.
>
> Best regards,
> Andy
>
> On Feb 1, 2017, at 4:10 PM, Elken, Tom  wrote:
>
> For this case:  " a cluster system with 2.6GHz Intel Haswell with 20 cores /
> node and 128GB RAM/node.  "
>
> are you running 5 ranks per node on 4 nodes?
> What interconnect are you using for the cluster?
>
> -Tom
>
> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Andrew
> Witzig
> Sent: Wednesday, February 01, 2017 1:37 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Performance Issues on SMP Workstation
>
> By the way, the workstation has a total of 36 cores / 72 threads, so using
> mpirun
> -np 20 is possible (and should be equivalent) on both platforms.
>
> Thanks,
> cap79
>
> On Feb 1, 2017, at 2:52 PM, Andy Witzig  wrote:
>
> Hi all,
>
> I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
>
> 2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
> seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
> Haswell with 20 cores / node and 128GB RAM/node.  Both applications have
> been compiled using OpenMPI 1.6.4.  I have tried running:
>
>
> mpirun -np 20 $EXECUTABLE $INPUT_FILE
> mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
>
> and others, but cannot achieve the same performance on the workstation as is
>
> seen on the cluster.  The workstation outperforms on other non-MPI but
> multi-
> threaded applications, so I don’t think it’s a hardware issue.
>
>
> Any help you can provide would be appreciated.
>
> Thanks,
> cap79
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance Issues on SMP Workstation

2017-02-01 Thread r...@open-mpi.org
Simple test: replace your executable with “hostname”. If you see multiple hosts 
come out on your cluster, then you know why the performance is different.

> On Feb 1, 2017, at 2:46 PM, Andy Witzig  wrote:
> 
> Honestly, I’m not exactly sure what scheme is being used.  I am using the 
> default template from Penguin Computing for job submission.  It looks like:
> 
> #PBS -S /bin/bash
> #PBS -q T30
> #PBS -l walltime=24:00:00,nodes=1:ppn=20
> #PBS -j oe
> #PBS -N test
> #PBS -r n
> 
> mpirun $EXECUTABLE $INPUT_FILE
> 
> I’m not configuring OpenMPI anywhere else. It is possible the Penguin 
> Computing folks have pre-configured my MPI environment.  I’ll see what I can 
> find.
> 
> Best regards,
> Andy
> 
> On Feb 1, 2017, at 4:32 PM, Douglas L Reeder  > wrote:
> 
> Andy,
> 
> What allocation scheme are you using on the cluster. For some codes we see 
> noticeable differences using fillup vs round robin, not 4x though. Fillup is 
> more shared memory use while round robin uses more infinniband.
> 
> Doug
>> On Feb 1, 2017, at 3:25 PM, Andy Witzig > > wrote:
>> 
>> Hi Tom,
>> 
>> The cluster uses an Infiniband interconnect.  On the cluster I’m requesting: 
>> #PBS -l walltime=24:00:00,nodes=1:ppn=20.  So technically, the run on the 
>> cluster should be SMP on the node, since there are 20 cores/node.  On the 
>> workstation I’m just using the command: mpirun -np 20 …. I haven’t finished 
>> setting Torque/PBS up yet.
>> 
>> Best regards,
>> Andy
>> 
>> On Feb 1, 2017, at 4:10 PM, Elken, Tom > > wrote:
>> 
>> For this case:  " a cluster system with 2.6GHz Intel Haswell with 20 cores / 
>> node and 128GB RAM/node.  "
>> 
>> are you running 5 ranks per node on 4 nodes?
>> What interconnect are you using for the cluster?
>> 
>> -Tom
>> 
>>> -Original Message-
>>> From: users [mailto:users-boun...@lists.open-mpi.org 
>>> ] On Behalf Of Andrew
>>> Witzig
>>> Sent: Wednesday, February 01, 2017 1:37 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Performance Issues on SMP Workstation
>>> 
>>> By the way, the workstation has a total of 36 cores / 72 threads, so using 
>>> mpirun
>>> -np 20 is possible (and should be equivalent) on both platforms.
>>> 
>>> Thanks,
>>> cap79
>>> 
 On Feb 1, 2017, at 2:52 PM, Andy Witzig > wrote:
 
 Hi all,
 
 I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
>>> 2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
>>> seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
>>> Haswell with 20 cores / node and 128GB RAM/node.  Both applications have
>>> been compiled using OpenMPI 1.6.4.  I have tried running:
 
 mpirun -np 20 $EXECUTABLE $INPUT_FILE
 mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
 
 and others, but cannot achieve the same performance on the workstation as 
 is
>>> seen on the cluster.  The workstation outperforms on other non-MPI but 
>>> multi-
>>> threaded applications, so I don’t think it’s a hardware issue.
 
 Any help you can provide would be appreciated.
 
 Thanks,
 cap79
 ___
 users mailing list
 users@lists.open-mpi.org 
 https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org 
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> ___
>> users mailing list
>> users@lists.open-mpi.org 
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance Issues on SMP Workstation

2017-02-01 Thread Andy Witzig
Honestly, I’m not exactly sure what scheme is being used.  I am using the 
default template from Penguin Computing for job submission.  It looks like:

#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n

mpirun $EXECUTABLE $INPUT_FILE

I’m not configuring OpenMPI anywhere else. It is possible the Penguin Computing 
folks have pre-configured my MPI environment.  I’ll see what I can find.

Best regards,
Andy

On Feb 1, 2017, at 4:32 PM, Douglas L Reeder  wrote:

Andy,

What allocation scheme are you using on the cluster. For some codes we see 
noticeable differences using fillup vs round robin, not 4x though. Fillup is 
more shared memory use while round robin uses more infinniband.

Doug
> On Feb 1, 2017, at 3:25 PM, Andy Witzig  wrote:
> 
> Hi Tom,
> 
> The cluster uses an Infiniband interconnect.  On the cluster I’m requesting: 
> #PBS -l walltime=24:00:00,nodes=1:ppn=20.  So technically, the run on the 
> cluster should be SMP on the node, since there are 20 cores/node.  On the 
> workstation I’m just using the command: mpirun -np 20 …. I haven’t finished 
> setting Torque/PBS up yet.
> 
> Best regards,
> Andy
> 
> On Feb 1, 2017, at 4:10 PM, Elken, Tom  wrote:
> 
> For this case:  " a cluster system with 2.6GHz Intel Haswell with 20 cores / 
> node and 128GB RAM/node.  "
> 
> are you running 5 ranks per node on 4 nodes?
> What interconnect are you using for the cluster?
> 
> -Tom
> 
>> -Original Message-
>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Andrew
>> Witzig
>> Sent: Wednesday, February 01, 2017 1:37 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Performance Issues on SMP Workstation
>> 
>> By the way, the workstation has a total of 36 cores / 72 threads, so using 
>> mpirun
>> -np 20 is possible (and should be equivalent) on both platforms.
>> 
>> Thanks,
>> cap79
>> 
>>> On Feb 1, 2017, at 2:52 PM, Andy Witzig  wrote:
>>> 
>>> Hi all,
>>> 
>>> I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
>> 2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
>> seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
>> Haswell with 20 cores / node and 128GB RAM/node.  Both applications have
>> been compiled using OpenMPI 1.6.4.  I have tried running:
>>> 
>>> mpirun -np 20 $EXECUTABLE $INPUT_FILE
>>> mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
>>> 
>>> and others, but cannot achieve the same performance on the workstation as is
>> seen on the cluster.  The workstation outperforms on other non-MPI but multi-
>> threaded applications, so I don’t think it’s a hardware issue.
>>> 
>>> Any help you can provide would be appreciated.
>>> 
>>> Thanks,
>>> cap79
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance Issues on SMP Workstation

2017-02-01 Thread Douglas L Reeder
Andy,

What allocation scheme are you using on the cluster. For some codes we see 
noticeable differences using fillup vs round robin, not 4x though. Fillup is 
more shared memory use while round robin uses more infinniband.

Doug
> On Feb 1, 2017, at 3:25 PM, Andy Witzig  wrote:
> 
> Hi Tom,
> 
> The cluster uses an Infiniband interconnect.  On the cluster I’m requesting: 
> #PBS -l walltime=24:00:00,nodes=1:ppn=20.  So technically, the run on the 
> cluster should be SMP on the node, since there are 20 cores/node.  On the 
> workstation I’m just using the command: mpirun -np 20 …. I haven’t finished 
> setting Torque/PBS up yet.
> 
> Best regards,
> Andy
> 
> On Feb 1, 2017, at 4:10 PM, Elken, Tom  wrote:
> 
> For this case:  " a cluster system with 2.6GHz Intel Haswell with 20 cores / 
> node and 128GB RAM/node.  "
> 
> are you running 5 ranks per node on 4 nodes?
> What interconnect are you using for the cluster?
> 
> -Tom
> 
>> -Original Message-
>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Andrew
>> Witzig
>> Sent: Wednesday, February 01, 2017 1:37 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Performance Issues on SMP Workstation
>> 
>> By the way, the workstation has a total of 36 cores / 72 threads, so using 
>> mpirun
>> -np 20 is possible (and should be equivalent) on both platforms.
>> 
>> Thanks,
>> cap79
>> 
>>> On Feb 1, 2017, at 2:52 PM, Andy Witzig  wrote:
>>> 
>>> Hi all,
>>> 
>>> I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
>> 2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
>> seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
>> Haswell with 20 cores / node and 128GB RAM/node.  Both applications have
>> been compiled using OpenMPI 1.6.4.  I have tried running:
>>> 
>>> mpirun -np 20 $EXECUTABLE $INPUT_FILE
>>> mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
>>> 
>>> and others, but cannot achieve the same performance on the workstation as is
>> seen on the cluster.  The workstation outperforms on other non-MPI but multi-
>> threaded applications, so I don’t think it’s a hardware issue.
>>> 
>>> Any help you can provide would be appreciated.
>>> 
>>> Thanks,
>>> cap79
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance Issues on SMP Workstation

2017-02-01 Thread Andy Witzig
Hi Tom,

The cluster uses an Infiniband interconnect.  On the cluster I’m requesting: 
#PBS -l walltime=24:00:00,nodes=1:ppn=20.  So technically, the run on the 
cluster should be SMP on the node, since there are 20 cores/node.  On the 
workstation I’m just using the command: mpirun -np 20 …. I haven’t finished 
setting Torque/PBS up yet.

Best regards,
Andy
 
On Feb 1, 2017, at 4:10 PM, Elken, Tom  wrote:

For this case:  " a cluster system with 2.6GHz Intel Haswell with 20 cores / 
node and 128GB RAM/node.  "

are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?

-Tom

> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Andrew
> Witzig
> Sent: Wednesday, February 01, 2017 1:37 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Performance Issues on SMP Workstation
> 
> By the way, the workstation has a total of 36 cores / 72 threads, so using 
> mpirun
> -np 20 is possible (and should be equivalent) on both platforms.
> 
> Thanks,
> cap79
> 
>> On Feb 1, 2017, at 2:52 PM, Andy Witzig  wrote:
>> 
>> Hi all,
>> 
>> I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
> 2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
> seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
> Haswell with 20 cores / node and 128GB RAM/node.  Both applications have
> been compiled using OpenMPI 1.6.4.  I have tried running:
>> 
>> mpirun -np 20 $EXECUTABLE $INPUT_FILE
>> mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
>> 
>> and others, but cannot achieve the same performance on the workstation as is
> seen on the cluster.  The workstation outperforms on other non-MPI but multi-
> threaded applications, so I don’t think it’s a hardware issue.
>> 
>> Any help you can provide would be appreciated.
>> 
>> Thanks,
>> cap79
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance Issues on SMP Workstation

2017-02-01 Thread Elken, Tom
For this case:  " a cluster system with 2.6GHz Intel Haswell with 20 cores / 
node and 128GB RAM/node.  "

are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?

-Tom

> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Andrew
> Witzig
> Sent: Wednesday, February 01, 2017 1:37 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Performance Issues on SMP Workstation
> 
> By the way, the workstation has a total of 36 cores / 72 threads, so using 
> mpirun
> -np 20 is possible (and should be equivalent) on both platforms.
> 
> Thanks,
> cap79
> 
> > On Feb 1, 2017, at 2:52 PM, Andy Witzig  wrote:
> >
> > Hi all,
> >
> > I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
> 2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
> seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
> Haswell with 20 cores / node and 128GB RAM/node.  Both applications have
> been compiled using OpenMPI 1.6.4.  I have tried running:
> >
> > mpirun -np 20 $EXECUTABLE $INPUT_FILE
> > mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
> >
> > and others, but cannot achieve the same performance on the workstation as is
> seen on the cluster.  The workstation outperforms on other non-MPI but multi-
> threaded applications, so I don’t think it’s a hardware issue.
> >
> > Any help you can provide would be appreciated.
> >
> > Thanks,
> > cap79
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Open MPI over RoCE using breakout cable and switch

2017-02-01 Thread Brendan Myers
Hello Howard,

I was wondering if you have been able to look at this issue at all, or if 
anyone has any ideas on what to try next.

 

Thank you,

Brendan

 

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Brendan Myers
Sent: Tuesday, January 24, 2017 11:11 AM
To: 'Open MPI Users' 
Subject: Re: [OMPI users] Open MPI over RoCE using breakout cable and switch

 

Hello Howard,

Here is the error output after building with debug enabled.  These CX4 Mellanox 
cards view each port as a separate device and I am using port 1 on the card 
which is device mlx5_0. 

 

Thank you,

Brendan

 

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Howard 
Pritchard
Sent: Tuesday, January 24, 2017 8:21 AM
To: Open MPI Users  >
Subject: Re: [OMPI users] Open MPI over RoCE using breakout cable and switch

 

Hello Brendan,

 

This helps some, but looks like we need more debug output.

 

Could you build a debug version of Open MPI by adding --enable-debug

to the config options and rerun the test with the breakout cable setup

and keeping the --mca btl_base_verbose 100 command line option?

 

Thanks

 

Howard

 

 

2017-01-23 8:23 GMT-07:00 Brendan Myers  >:

Hello Howard,

Thank you for looking into this. Attached is the output you requested.  Also, I 
am using Open MPI 2.0.1.

 

Thank you,

Brendan

 

From: users [mailto:users-boun...@lists.open-mpi.org 
 ] On Behalf Of Howard Pritchard
Sent: Friday, January 20, 2017 6:35 PM
To: Open MPI Users  >
Subject: Re: [OMPI users] Open MPI over RoCE using breakout cable and switch

 

Hi Brendan

 

I doubt this kind of config has gotten any testing with OMPI.  Could you rerun 
with

 

--mca btl_base_verbose 100

 

added to the command line and post the output to the list?

 

Howard

 

 

Brendan Myers  > schrieb am Fr. 20. Jan. 2017 um 15:04:

Hello,

I am attempting to get Open MPI to run over 2 nodes using a switch and a single 
breakout cable with this design:

(100GbE)QSFP <> 2x (50GbE)QSFP   

 

Hardware Layout:

Breakout cable module A connects to switch (100GbE)

Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)

Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)

Switch is Mellanox SN 2700 100GbE RoCE switch

 

* I  am able to pass RDMA traffic between the nodes with perftest 
(ib_write_bw) when using the breakout cable as the IC from both nodes to the 
switch.

* When attempting to run a job using the breakout cable as the IC Open 
MPI aborts with failure to initialize open fabrics device errors.

* If I replace the breakout cable with 2 standard QSFP cables the Open 
MPI job will complete correctly.  

 

 

This is the command I use, it works unless I attempt a run with the breakout 
cable used as IC:

mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues 
P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm  -hostfile mpi-hosts-ce 
/usr/local/bin/IMB-MPI1

 

If anyone has any idea as to why using a breakout cable is causing my jobs to 
fail please let me know.

 

Thank you,

 

Brendan T. W. Myers

brendan.my...@soft-forge.com  

Software Forge Inc

 

___

users mailing list

users@lists.open-mpi.org  

https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org  
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

 

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance Issues on SMP Workstation

2017-02-01 Thread Andrew Witzig
By the way, the workstation has a total of 36 cores / 72 threads, so using 
mpirun -np 20 is possible (and should be equivalent) on both platforms. 

Thanks,
cap79

> On Feb 1, 2017, at 2:52 PM, Andy Witzig  wrote:
> 
> Hi all,
> 
> I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4 
> 2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am 
> seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel 
> Haswell with 20 cores / node and 128GB RAM/node.  Both applications have been 
> compiled using OpenMPI 1.6.4.  I have tried running:
> 
> mpirun -np 20 $EXECUTABLE $INPUT_FILE
> mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
> 
> and others, but cannot achieve the same performance on the workstation as is 
> seen on the cluster.  The workstation outperforms on other non-MPI but 
> multi-threaded applications, so I don’t think it’s a hardware issue.
> 
> Any help you can provide would be appreciated.
> 
> Thanks,
> cap79
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Error using hpcc benchmark

2017-02-01 Thread Cabral, Matias A
Hi Wodel,

As you already figured out, mpirun -x 

[OMPI users] Performance Issues on SMP Workstation

2017-02-01 Thread Andy Witzig
Hi all,

I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4 2.3 
GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am seeing a 4x 
performance drop compared to a cluster system with 2.6GHz Intel Haswell with 20 
cores / node and 128GB RAM/node.  Both applications have been compiled using 
OpenMPI 1.6.4.  I have tried running:

mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE

and others, but cannot achieve the same performance on the workstation as is 
seen on the cluster.  The workstation outperforms on other non-MPI but 
multi-threaded applications, so I don’t think it’s a hardware issue.

Any help you can provide would be appreciated.

Thanks,
cap79
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Error using hpcc benchmark

2017-02-01 Thread wodel youchi
Hi,

Thank you for you replies, but :-) it didn't work for me.

Using hpcc compiled with OpenMPI 2.0.1 :
I tried to use *export **PSM_MQ_RECVREQS_MAX=1000* as mentioned by
Howard, but the job didn't take into account the export (I am starting the
job from the home directory of a user, the home directory is shared by nfs
with all compute nodes).
I tried to use the .bash_profile to export the variable, but the job didn't
take it into account, I got the same error


*Exhausted 1048576 MQ irecv request descriptors, which usually indicates a
user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=1048576)*
And as I mentioned before, each time on different node(s).


>From the help of the mpirun command, I read that to pass an environment
variable we have to use *-x *with the commend; i.e. :
mpirun -np 512* -x **PSM_MQ_RECVREQS_MAX=1000 *--mca mtl psm --hostfile
hosts32 /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt

But when tested, I get this errors



*PSM was unable to open an endpoint. Please make sure that the network link
isactive on the node and the hardware is functioning.Error: Ran out of
memory*

I tested with lower values, the only one that worked for me is  *2097152 *which
is 2 times the default value of PSM_MQ..., but even with this value, I get
the same error with the new value, and the exits.

*Exhausted **2097152 MQ irecv request descriptors, which usually indicates
a user program error or insufficient request descriptors
(PSM_MQ_RECVREQS_MAX=**2097152 )*

PS: for Cabral, I didn't find any way to know the default value of *PSM_MEMORY
*to be able to modify it.

Any idea??? Could this be a problem on the infiniband configuration?

Does the mtu have anything to do with this problem ?

ibv_devinfo
hca_id: qib0
transport:  InfiniBand (0)
fw_ver: 0.0.0
node_guid:  0011:7500:0070:59a6
sys_image_guid: 0011:7500:0070:59a6
vendor_id:  0x1175
vendor_part_id: 29474
hw_ver: 0x2
board_id:   InfiniPath_QLE7340
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)

*max_mtu:4096 (5)
active_mtu: 2048 (4)*
sm_lid: 1
port_lid:   1
port_lmc:   0x00
link_layer: InfiniBand



Regards.

2017-01-31 17:55 GMT+01:00 Cabral, Matias A :

> Hi Wodel,
>
>
>
> As Howard mentioned, this is probably because many ranks and sending to a
> single one and exhausting the receive requests MQ. You can individually
> enlarge the receive/send requests queues with the specific variables
> (PSM_MQ_RECVREQS_MAX/ PSM_MQ_SENDREQS_MAX) or increase both with
> PSM_MEMORY=max.  Note that the psm library will allocate more system memory
> for the queues.
>
>
>
> Thanks,
>
>
>
> _MAC
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Tuesday, January 31, 2017 6:38 AM
> *To:* Open MPI Users 
> *Subject:* Re: [OMPI users] Error using hpcc benchmark
>
>
>
> Hi Wodel
>
>
>
> Randomaccess part of HPCC is probably causing this.
>
>
>
> Perhaps set PSM env. variable -
>
>
>
> Export PSM_MQ_REVCREQ_MAX=1000
>
>
>
> or something like that.
>
>
>
> Alternatively launch the job using
>
>
>
> mpirun --mca plm ob1 --host 
>
>
>
> to avoid use of psm.  Performance will probably suffer with this option
> however.
>
>
>
> Howard
>
> wodel youchi  schrieb am Di. 31. Jan. 2017 um
> 08:27:
>
> Hi,
>
> I am n newbie in HPC world
>
> I am trying to execute the hpcc benchmark on our cluster, but every time I
> start the job, I get this error, then the job exits
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *compute017.22840Exhausted 1048576 MQ irecv request descriptors, which
> usually indicates a user program error or insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576) compute024.22840Exhausted 1048576 MQ irecv
> request descriptors, which usually indicates a user program error or
> insufficient request descriptors (PSM_MQ_RECVREQS_MAX=1048576)
> compute019.22847Exhausted 1048576 MQ irecv request descriptors, which
> usually indicates a user program error or insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576)
> --- Primary job
> terminated normally, but 1 process returned a non-zero exit code.. Per
> user-direction, the job has been aborted.
> ---
> --
> mpirun detected that one or more processes exited with non-zero status,
> thus 

Re: [OMPI users] MPI_Comm_spawn question

2017-02-01 Thread elistratovaa
I am using Open MPI version 2.0.1.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users