I agree. I will take a look and fix the issue sometime this week.
-Nathan
From: devel [devel-boun...@open-mpi.org] on behalf of George Bosilca
[bosi...@icl.utk.edu]
Sent: Monday, June 02, 2014 2:56 PM
To: Open MPI Developers
Subject: Re: [OMPI devel]
If the scif BTL registered it's own memory registration function, I
would have expected that it will deregister it upon finalize. Without
this we run into circular dependencies that are not solvable at the
library level.
George.
On Mon, Jun 2, 2014 at 12:39 AM, Gilles Gouaillardet
with this fix - no failure.
Thanks!
On Mon, Jun 2, 2014 at 8:52 PM, Ralph Castain wrote:
> Yep, that's the one. Should have fixed that problem
>
>
> On Jun 2, 2014, at 10:30 AM, Mike Dubman wrote:
>
> This one? "Fix typo that would cause a segfault
Yep, that's the one. Should have fixed that problem
On Jun 2, 2014, at 10:30 AM, Mike Dubman wrote:
> This one? "Fix typo that would cause a segfault if orte_startup_timeout was
> set"
> If so, it is still running.
>
>
> On Mon, Jun 2, 2014 at 8:16 PM, Ralph
This one? "Fix typo that would cause a segfault if orte_startup_timeout was
set"
If so, it is still running.
On Mon, Jun 2, 2014 at 8:16 PM, Ralph Castain wrote:
> You're still missing a commit that fixed this problem
>
> On Jun 2, 2014, at 9:44 AM, Mike Dubman
You're still missing a commit that fixed this problem
On Jun 2, 2014, at 9:44 AM, Mike Dubman wrote:
> The jenkins still failed (hang and killed by timeout after 3m) as below. No
> env. mca params were used.
>
> Changes:
> Revert r31926 and replace it with a more
I fixed this - key was that it only would happen if the MCA param
orte_startup_timeout was set.
It really does help, folks, if you include info on what MCA params were set
when you get these failures. Otherwise, it is impossible to replicate the
problem.
On Jun 2, 2014, at 6:49 AM, Ralph
Thanks Ralph,
i will try this tomorrow
Cheers,
Gilles
On Tue, Jun 3, 2014 at 12:03 AM, Ralph Castain wrote:
> I think I have this fixed with r31928, but have no way to test it on my
> machine. Please see if it works for you.
>
>
> On Jun 2, 2014, at 7:09 AM, Ralph
I think I have this fixed with r31928, but have no way to test it on my
machine. Please see if it works for you.
On Jun 2, 2014, at 7:09 AM, Ralph Castain wrote:
> This is indeed the problem - we are trying to send a message and don't know
> how to get it somewhere. I'll
Thanks Jeff,
from the FAQ, openmpi should work on nodes who have different number of IB
ports (at least since v1.2)
about IB ports on the same subnet, all i was able to find is explanation
about why i get this warning :
WARNING: There are more than one active ports on host '%s', but the
default
This is indeed the problem - we are trying to send a message and don't know how
to get it somewhere. I'll break the loop, and then ask that you run this again
with -mca oob_base_verbose 10 so we can see the intended recipient.
On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet
Hi guys
I'm awake now and will take a look at this - thanks
Ralph
On Jun 2, 2014, at 6:34 AM, Mike Dubman wrote:
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
> -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0
>
/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
-np 8 -mca btl sm,tcp --mca rtc_freq_priority 0
/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempiProgram
terminated with signal 11, Segmentation fault.
OK,
please send me a clean gdb backtrace :
ulimit -c unlimited
/* this should generate a core */
mpirun ...
gdb mpirun core...
bt
if no core
gdb mpirun
r -np ... --mca ... ...
and after the crash
bt
then i can only review the code and hope i can find the root cause of the
error i am unable to
Hi,
The jenkins took your commit and applied automatically, I tried with mca
flag later.
Also, we don`t have /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
in our system, the cpuspeed daemon is off by default on all our nodes.
Regards
M
On Mon, Jun 2, 2014 at 3:00 PM, Gilles
Mike,
did you apply the patch *and* mpirun --mca rtc_freq_priority 0 ?
*both* are required (--mca rtc_freq_priority 0 is not enough without the
patch)
can you please confirm there is no
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
(pseudo) file on your system ?
if this still does not
I'm AFK but let me reply about the IB thing: double ports/multi rail is a good
thing. It's not a good thing if they're on the same subnet.
Check the FAQ - http://www.open-mpi.org/faq/?category=openfabrics - I can't see
it well enough on the small screen of my phone, but I think there's a q on
more info, specifying --mca rtc_freq_priority 0 explicitly, generates
different kind of fail:
$/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
-np 8 -mca btl sm,tcp --mca rtc_freq_priority 0
Hi,
This fix "orte_rtc_base_select: skip a RTC module if it has a zero
priority" did not help and jenkins stilll fails as before.
The ompi was configured:
--with-platform=contrib/platform/mellanox/optimized --with-ompi-param-check
--enable-picky --with-knem --with-mxm --with-fca
The run was on
Jeff,
On Mon, Jun 2, 2014 at 7:26 PM, Jeff Squyres (jsquyres)
wrote:
> On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> > i faced a bit different problem, but that is 100% reproductible :
> > - i launch mpirun (no batch manager)
On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet
wrote:
> i faced a bit different problem, but that is 100% reproductible :
> - i launch mpirun (no batch manager) from a node with one IB port
> - i use -host node01,node02 where node01 and node02 both have two IB
To whom may interest,
My name is Manuel RodrÃguez. I am a postdoc worker in CIEMAT, the
Spanish leading research centre for Energy, Environment and
Technology. This mail is devoted to inform you about a possible
funded stay at your institution.
I have been offered a research grant, fully founded
Mike and Ralph,
i could not find a simple workaround.
for the time being, i commited r31926 and invite those who face a similar
issue to use the following workaround :
export OMPI_MCA_rtc_freq_priority=0
/* or mpirun --mca rtc_freq_priority 0 ... */
Cheers,
Gilles
On Mon, Jun 2, 2014 at
Mike and Ralph,
i got the very same error.
in orte/mca/rtc/freq/rtc_freq.c at line 187
fp = fopen(filename, "r");
and filename is "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor"
there is no error check, so if fp is NULL, orte_getline() will call fgets()
that will crash.
that can happen
It's merrily passing all my MTT tests, so it appears to be fine for me.
It would help if you provided *some* information along with these reports -
like how was this configured, what environment are you running under, how many
nodes were you using, etc. Otherwise, it's a totally useless report.
I'm afraid that tells me absolutely nothing.
On Jun 1, 2014, at 8:50 PM, Mike Dubman wrote:
> Hi,
> The trunk hangs after following commits, seems 3-5,7 can be the ones.
> Changes
> Java-oshmem: update examples
> Java: update javadoc's install locations
> Replace
Hi,
The trunk hangs after following commits, seems 3-5,7 can be the ones.
Changes
1. Java-oshmem: update examples
2. Java: update javadoc's install locations
3. Replace the PML barrier with an RTE barrier for now until we can come
up with a better solution for connectionless BTLs.
27 matches
Mail list logo