Re: [OMPI devel] btl/scif: SIGSEGV in MPI_Finalize()

2014-06-02 Thread Hjelm, Nathan T
I agree. I will take a look and fix the issue sometime this week. -Nathan From: devel [devel-boun...@open-mpi.org] on behalf of George Bosilca [bosi...@icl.utk.edu] Sent: Monday, June 02, 2014 2:56 PM To: Open MPI Developers Subject: Re: [OMPI devel]

Re: [OMPI devel] btl/scif: SIGSEGV in MPI_Finalize()

2014-06-02 Thread George Bosilca
If the scif BTL registered it's own memory registration function, I would have expected that it will deregister it upon finalize. Without this we run into circular dependencies that are not solvable at the library level. George. On Mon, Jun 2, 2014 at 12:39 AM, Gilles Gouaillardet

Re: [OMPI devel] trunk failure

2014-06-02 Thread Mike Dubman
with this fix - no failure. Thanks! On Mon, Jun 2, 2014 at 8:52 PM, Ralph Castain wrote: > Yep, that's the one. Should have fixed that problem > > > On Jun 2, 2014, at 10:30 AM, Mike Dubman wrote: > > This one? "Fix typo that would cause a segfault

Re: [OMPI devel] trunk failure

2014-06-02 Thread Ralph Castain
Yep, that's the one. Should have fixed that problem On Jun 2, 2014, at 10:30 AM, Mike Dubman wrote: > This one? "Fix typo that would cause a segfault if orte_startup_timeout was > set" > If so, it is still running. > > > On Mon, Jun 2, 2014 at 8:16 PM, Ralph

Re: [OMPI devel] trunk failure

2014-06-02 Thread Mike Dubman
This one? "Fix typo that would cause a segfault if orte_startup_timeout was set" If so, it is still running. On Mon, Jun 2, 2014 at 8:16 PM, Ralph Castain wrote: > You're still missing a commit that fixed this problem > > On Jun 2, 2014, at 9:44 AM, Mike Dubman

Re: [OMPI devel] trunk failure

2014-06-02 Thread Ralph Castain
You're still missing a commit that fixed this problem On Jun 2, 2014, at 9:44 AM, Mike Dubman wrote: > The jenkins still failed (hang and killed by timeout after 3m) as below. No > env. mca params were used. > > Changes: > Revert r31926 and replace it with a more

Re: [OMPI devel] trunk failure

2014-06-02 Thread Ralph Castain
I fixed this - key was that it only would happen if the MCA param orte_startup_timeout was set. It really does help, folks, if you include info on what MCA params were set when you get these failures. Otherwise, it is impossible to replicate the problem. On Jun 2, 2014, at 6:49 AM, Ralph

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet
Thanks Ralph, i will try this tomorrow Cheers, Gilles On Tue, Jun 3, 2014 at 12:03 AM, Ralph Castain wrote: > I think I have this fixed with r31928, but have no way to test it on my > machine. Please see if it works for you. > > > On Jun 2, 2014, at 7:09 AM, Ralph

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Ralph Castain
I think I have this fixed with r31928, but have no way to test it on my machine. Please see if it works for you. On Jun 2, 2014, at 7:09 AM, Ralph Castain wrote: > This is indeed the problem - we are trying to send a message and don't know > how to get it somewhere. I'll

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet
Thanks Jeff, from the FAQ, openmpi should work on nodes who have different number of IB ports (at least since v1.2) about IB ports on the same subnet, all i was able to find is explanation about why i get this warning : WARNING: There are more than one active ports on host '%s', but the default

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Ralph Castain
This is indeed the problem - we are trying to send a message and don't know how to get it somewhere. I'll break the loop, and then ask that you run this again with -mca oob_base_verbose 10 so we can see the intended recipient. On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet

Re: [OMPI devel] trunk failure

2014-06-02 Thread Ralph Castain
Hi guys I'm awake now and will take a look at this - thanks Ralph On Jun 2, 2014, at 6:34 AM, Mike Dubman wrote: > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun > -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0 >

Re: [OMPI devel] trunk failure

2014-06-02 Thread Mike Dubman
/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0 /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempiProgram terminated with signal 11, Segmentation fault.

Re: [OMPI devel] trunk failure

2014-06-02 Thread Gilles Gouaillardet
OK, please send me a clean gdb backtrace : ulimit -c unlimited /* this should generate a core */ mpirun ... gdb mpirun core... bt if no core gdb mpirun r -np ... --mca ... ... and after the crash bt then i can only review the code and hope i can find the root cause of the error i am unable to

Re: [OMPI devel] trunk failure

2014-06-02 Thread Mike Dubman
Hi, The jenkins took your commit and applied automatically, I tried with mca flag later. Also, we don`t have /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor in our system, the cpuspeed daemon is off by default on all our nodes. Regards M On Mon, Jun 2, 2014 at 3:00 PM, Gilles

Re: [OMPI devel] trunk failure

2014-06-02 Thread Gilles Gouaillardet
Mike, did you apply the patch *and* mpirun --mca rtc_freq_priority 0 ? *both* are required (--mca rtc_freq_priority 0 is not enough without the patch) can you please confirm there is no /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor (pseudo) file on your system ? if this still does not

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Jeff Squyres (jsquyres)
I'm AFK but let me reply about the IB thing: double ports/multi rail is a good thing. It's not a good thing if they're on the same subnet. Check the FAQ - http://www.open-mpi.org/faq/?category=openfabrics - I can't see it well enough on the small screen of my phone, but I think there's a q on

Re: [OMPI devel] trunk failure

2014-06-02 Thread Mike Dubman
more info, specifying --mca rtc_freq_priority 0 explicitly, generates different kind of fail: $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0

Re: [OMPI devel] trunk failure

2014-06-02 Thread Mike Dubman
Hi, This fix "orte_rtc_base_select: skip a RTC module if it has a zero priority" did not help and jenkins stilll fails as before. The ompi was configured: --with-platform=contrib/platform/mellanox/optimized --with-ompi-param-check --enable-picky --with-knem --with-mxm --with-fca The run was on

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet
Jeff, On Mon, Jun 2, 2014 at 7:26 PM, Jeff Squyres (jsquyres) wrote: > On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > > i faced a bit different problem, but that is 100% reproductible : > > - i launch mpirun (no batch manager)

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Jeff Squyres (jsquyres)
On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet wrote: > i faced a bit different problem, but that is 100% reproductible : > - i launch mpirun (no batch manager) from a node with one IB port > - i use -host node01,node02 where node01 and node02 both have two IB

[OMPI devel] already founded stay at your institution

2014-06-02 Thread Manuel Rodriguez Pascual
To whom may interest, My name is Manuel Rodríguez. I am a postdoc worker in CIEMAT, the Spanish leading research centre for Energy, Environment and Technology. This mail is devoted to inform you about a possible funded stay at your institution. I have been offered a research grant, fully founded

Re: [OMPI devel] trunk failure

2014-06-02 Thread Gilles Gouaillardet
Mike and Ralph, i could not find a simple workaround. for the time being, i commited r31926 and invite those who face a similar issue to use the following workaround : export OMPI_MCA_rtc_freq_priority=0 /* or mpirun --mca rtc_freq_priority 0 ... */ Cheers, Gilles On Mon, Jun 2, 2014 at

Re: [OMPI devel] trunk failure

2014-06-02 Thread Gilles Gouaillardet
Mike and Ralph, i got the very same error. in orte/mca/rtc/freq/rtc_freq.c at line 187 fp = fopen(filename, "r"); and filename is "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor" there is no error check, so if fp is NULL, orte_getline() will call fgets() that will crash. that can happen

Re: [OMPI devel] trunk failure

2014-06-02 Thread Ralph Castain
It's merrily passing all my MTT tests, so it appears to be fine for me. It would help if you provided *some* information along with these reports - like how was this configured, what environment are you running under, how many nodes were you using, etc. Otherwise, it's a totally useless report.

Re: [OMPI devel] trunk failure

2014-06-02 Thread Ralph Castain
I'm afraid that tells me absolutely nothing. On Jun 1, 2014, at 8:50 PM, Mike Dubman wrote: > Hi, > The trunk hangs after following commits, seems 3-5,7 can be the ones. > Changes > Java-oshmem: update examples > Java: update javadoc's install locations > Replace

[OMPI devel] trunk failure

2014-06-02 Thread Mike Dubman
Hi, The trunk hangs after following commits, seems 3-5,7 can be the ones. Changes 1. Java-oshmem: update examples 2. Java: update javadoc's install locations 3. Replace the PML barrier with an RTE barrier for now until we can come up with a better solution for connectionless BTLs.