Folks,
several mtt tests (ompi-trunk r31963) failed (SIGSEGV in mpirun) with a
similar stack trace.
For example, you can refer to :
http://mtt.open-mpi.org/index.php?do_redir=2199
the issue is not related whatsoever to the init_thread_serialized test
(other tests failed with similar symptoms)
at the MPI_Abort hang as I'm having trouble replicating it.
>
>
> On Jun 5, 2014, at 7:16 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
> > Jeff,
> >
> > as pointed by Ralph, i do wish using eth0 for oob messages.
> >
>
Jeff,
as pointed by Ralph, i do wish using eth0 for oob messages.
i work on a 4k+ nodes cluster with a very decent gigabit ethernet
network (reasonable oversubscription + switches
from a reputable vendor you are familiar with ;-) )
my experience is that IPoIB can be very slow at establishing a
Ralph,
the application still hangs, i attached new logs.
on slurm0, if i /sbin/ifconfig eth0:1 down
then the application does not hang any more
Cheers,
Gilles
On Wed, Jun 4, 2014 at 12:43 PM, Ralph Castain wrote:
> I appear to have this fixed now - please give the
Ralph,
slurm is installed and running on both nodes.
that being said, there is no running job on any node so unless
mpirun automagically detects slurm is up and running, i assume
i am running under rsh.
i can run the test again after i stop slurm if needed, but that will not
happen before
a btl tcp,self --mca oob_base_verbose 10
./abort
the oob logs are attached
Cheers,
Gilles
On Tue, Jun 3, 2014 at 12:10 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:
> Thanks Ralph,
>
> i will try this tomorrow
>
> Cheers,
>
> Gilles
>
>
>
> On Tue
recipient.
>
> On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> #7 0x7fe8fab67ce3 in mca_oob_tcp_component_hop_unknown () from
> /.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
>
>
>
>
> ___
Thanks Jeff,
from the FAQ, openmpi should work on nodes who have different number of IB
ports (at least since v1.2)
about IB ports on the same subnet, all i was able to find is explanation
about why i get this warning :
WARNING: There are more than one active ports on host '%s', but the
default
q/scaling_governor
> in our system, the cpuspeed daemon is off by default on all our nodes.
>
>
> Regards
> M
>
>
> On Mon, Jun 2, 2014 at 3:00 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Mike,
>>
>> did you apply the
libc.so.6(__libc_start_main+0xfd)[0x393741ecdd]
>> [vegas12:13834] [10]
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909]
>> [vegas12:13834] *** End of error message ***
>> Segmentation fault (core dumped)
>>
>&
Jeff,
On Mon, Jun 2, 2014 at 7:26 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:
> On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> > i faced a bit different problem, but that is 100% reproductible :
> > -
:45 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:
> in orte/mca/rtc/freq/rtc_freq.c at line 187
> fp = fopen(filename, "r");
> and filename is "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor"
>
> there is no error check, so if fp
Mike and Ralph,
i got the very same error.
in orte/mca/rtc/freq/rtc_freq.c at line 187
fp = fopen(filename, "r");
and filename is "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor"
there is no error check, so if fp is NULL, orte_getline() will call fgets()
that will crash.
that can happen
Artem,
thanks for the feedback.
i commited the patch to the trunk (r31922)
as i indicated in the commit log, this patch is likely suboptimal and has
room for improvement.
Jeff commented about the usnic related issue, so i will wait for a fix from
the Cisco folks.
Cheers,
Gilles
On Sun,
Artem,
this looks like the issue initially reported by Rolf
http://www.open-mpi.org/community/lists/devel/2014/05/14836.php
in http://www.open-mpi.org/community/lists/devel/2014/05/14839.php
i posted a patch and a workaround :
export OMPI_MCA_btl_openib_use_eager_rdma=0
i do not recall i
Folks,
i recently had to solve a tricky issue that involves alignment of fortran
types.
the attached program can be used and ran on two tasks in order to evidence
the issue.
if gfortran is used (to build both openmpi and the test case), then the
test is successful
if ifort (Intel compiler) is
>
this looks like an up-to-date CentOS box.
i am unable to reproduce the warnings (may be uninitialized in this
function) with a similar box :-(
> On May 27, 2014, at 9:29 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> so far, it seems this is a false posit
good to know !
how should we handle this within mtt ?
decrease nseconds to 570 ?
Cheers,
Gilles
On Thu, May 29, 2014 at 12:03 AM, Ralph Castain <r...@open-mpi.org> wrote:
> Ah, that satisfied it!
>
> Sorry for the chase - I'll update my test.
>
>
> On May 28,
Ralph,
On Wed, May 28, 2014 at 9:33 PM, Ralph Castain wrote:
> This is definetly what happens : only some tasks call MPI_Comm_free()
>
>
> Really? I don't see how that can happen in loop_spawn - every process is
> clearly calling comm_free. Or are you referring to the
Jeff,
On Wed, May 28, 2014 at 8:31 PM, Jeff Squyres (jsquyres)
> To be totally clear: MPI says it is erroneous for only some (not all)
processes in a communicator to call MPI_COMM_FREE. So if that's the real
problem, then the discussion about why the parent(s) is(are) trying to
contact the
wrote:
>
> > Hi Gilles
> >
> > I concur on the typo and fixed it - thanks for catching it. I'll have to
> look into the problem you reported as it has been fixed in the past, and
> was working last I checked it. The info required for this 3-way
> connect/accept is suppos
Ralph,
can you please describe your environment (at least compiler (and version) +
configure command line)
i checked osc_rdma_data_move.c only :
size_t incoming_length; is used to improve readability.
it is used only in an assert clause and in OPAL_OUTPUT_VERBOSE
one way to silence the warning
Ralph,
On 2014/05/28 12:10, Ralph Castain wrote:
> my understanding is that there are two ways of seeing things :
> a) the "R-way" : the problem is the parent should not try to communicate to
> already exited processes
> b) the "J-way" : the problem is the children should have waited either in
FINALIZE is allowed to block if it needs to, such that
> OMPI sending control messages to procs that are still "connected" (in the
> MPI sense) should never cause a race condition.
> >
> > As such, this sounds like an OMPI bug.
> >
> >
> >
> >
> > On May 27, 2014,
Thanks Jeff,
i can only speak for myself : i use OpenGrok on a daily basis and it is a
great help
Cheers,
Gilles
On Wed, May 28, 2014 at 8:21 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:
> I can ask IU to adjust the OpenGrok config.
>
>
> On May 27, 2014,
Folks,
while debugging the dynamic/intercomm_create from the ibm test suite, i
found something odd.
i ran *without* any batch manager on a VM (one socket and four cpus)
mpirun -np 1 ./dynamic/intercomm_create
it hangs by default
it works with --mca coll ^ml
basically :
- task 0 spawns task 1
-
Folks,
currently, the dynamic/intercomm_create test from the ibm test suite output
the following messages :
dpm_base_disconnect_init: error -12 in isend to process 1
the root cause it task 0 tries to send messages to already exited tasks.
one way of seeing things is that this is an application
Folks,
OMPI Opengrok search (http://svn.open-mpi.org/source) currently returns
results for :
- trunk
- v1.6 branch
- v1.5 branch
- v1.3 branch
imho, it could/should return results for the following branches :
- trunk
- v1.8 branch
- v1.6 branch
and maybe the v1.4 branch (and the v1.9 branch when
Rolf,
the assert fails because the endpoint reference count is greater than one.
the root cause is the endpoint has been added to the list of
eager_rdma_buffers of the openib btl device (and hence OBJ_RETAIN'ed at
ompi/mca/btl/openib/btl_openib_endpoint.c:1009)
a simple workaround is not to use
(e.g. use btl_send)
- my suggested update of line 498 (e.g. use btl_send) was correct.
Cheers,
Gilles
On 2014/05/20 4:06, Nathan Hjelm wrote:
> On Mon, May 19, 2014 at 02:14:57PM +0900, Gilles Gouaillardet wrote:
>>Nathan,
>>
>>do you mean the bug/typo was not a
Thanks guys !
i commited r31816 (bfo: allocate the allocator in init rather than open)
and made a CMR
based on mtt results, i will push George's commit tomorrow.
and based on Rolf recommendation, i will do the CMR by the end of the week
if everything
works fine
Gilles
...@open-mpi.org]
> Sent: Thursday, May 15, 2014 10:43 PM
> To: s...@open-mpi.org
> Subject: [OMPI svn] svn:open-mpi r31786 - trunk/ompi/mca/bml/r2
>
> Author: ggouaillardet (Gilles Gouaillardet)
> Date: 2014-05-16 00:43:18 EDT (Fri, 16 May 2014)
> New Revision: 31786
> URL:
Folks,
i was unable to compile trunk after svn update.
i use different directories (aka VPATH) for source and build
error message is related to the missing shmem/java directory
from the oshmem directory.
The attached patch fixed this.
/* that being said, i did not try to build java for oshmem,
Folks,
there is a small memory leak in ompi/mca/pml/bfo/pml_bfo_component.c
in my environment, this module is not used.
this means mca_pml_bfo_component_open() and mca_pml_bfo_component_close()
are invoked but
mca_pml_bfo_component_init() and mca_pml_bfo_component_fini() are *not*
invoked.
Folks,
a simple
mpirun -np 2 -host localhost --mca btl,tcp mpi_helloworld
crashes after some of yesterday's commits (i would blame r31778 and/or
r31782,
but i am not 100% sure)
/* a list receives a negative value, so the program takes some time
before crashing,
symptom may vary from one system
Nathan,
this had no effect on my environment :-(
i am not sure you can reuse mca_btl_scif_module.scif_fd with connect()
i had to use a new scif fd for that.
then i ran into an other glitch : if the listen thread does not
scif_accept() the connection,
the scif_connect() will take 30 seconds
Folks,
since r31765 (opal/event: release the opal event context when closing
the event base)
mpirun crashes at the end of the job.
for example :
$ mpirun --mca btl tcp,self -n 4 `pwd`/src/MPI_Allreduce_user_c
MPITEST info (0): Starting MPI_Allreduce_user() test
MPITEST_results:
Nathan,
> Looks like this is a scif bug. From the documentation:
and from the source code, scif_poll(...) simply calls poll(...)
at least in MPSS 2.1
> Since that is not the case I will look through the documentation and see
if there is a way other than pthread_cancel.
what about :
- use a
o it is certainly doable.
>
> I don't know the specifics of why Nathan's code is having trouble exiting,
> but I suspect that a simple solution - not involving pthread_cancel - can be
> readily developed.
>
>
> On May 13, 2014, at 7:18 PM, Gilles Gouaillardet
> <gilles
i wrote this too early ...
the attached program produces incorrect results when ran with
--mca btl scif,vader,self
once the most up-to-date patch of #4610 has been applied, (at least) one
bug remain, and it is in the scif btl
the attached patch fixes it.
Gilles
On 2014/05/12 16:17, Gilles
Nathan,
On 2014/05/08 4:21, Hjelm, Nathan T wrote:
> c) that being said, that should work so there is a bug
> d) there is a regression in v1.8 and a bug that might have been always here
> This is probably not a regression. The SCIF btl has been part of the 1.7
> series for some time. The nightly
issue, i will investigate more next week
Gilles
On 2014/05/09 18:08, Gilles Gouaillardet wrote:
> I ran some more investigations with --mca btl scif,self
>
> i found that the previous patch i posted was complete crap and i
> apologize for it.
>
> on a brighter side, and imho, the
I ran some more investigations with --mca btl scif,self
i found that the previous patch i posted was complete crap and i
apologize for it.
on a brighter side, and imho, the issue only occurs if fragments are
received (and then processed) out of order.
/* i did not observe this with the tcp btl,
From: devel [devel-boun...@open-mpi.org] on behalf of Gilles Gouaillardet
> [gilles.gouaillar...@iferc.org]
> Sent: Thursday, May 08, 2014 1:32 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] regression with derived datatypes
>
> George,
>
> you do not need a
Nathan and George,
here are the output files of the original test_scif.c
the command line was
mpirun -np 2 -host localhost --mca btl scif,vader,self --mca
mpi_ddt_unpack_debug 1 --mca mpi_ddt_pack_debug 1 --mca
mpi_ddt_position_debug 1 a.out
this is a silent failure and there is no core file
George,
you do not need any hardware, just download MPSS from Intel and install it.
make sure the mic kernel module is loaded *and* you can read/write to the
newly created /dev/mic/* devices.
/* i am now running this on a virtual machine with no MIC whatsoever */
i was able to improve things a
On 2014/05/08 2:15, Ralph Castain wrote:
> I wonder if that might also explain the issue reported by Gilles regarding
> the scif BTL? In his example, the problem only occurred if the message was
> split across scif and vader. If so, then it might be that splitting messages
> in general is
Dear OpenMPI Folks,
i noticed some crashes when running OpenMPI (both latest v1.8 and trunk
from svn) on a single linux system where a MIC is available.
/* strictly speaking, MIC hardware is not needed: libscif.so, mic kernel
module and accessible /dev/mic/* are enough */
the attached test_scif
Joost,
i created #4581 and attached a patch (for the trunk) in order to solve
this leak (and two similar ones)
Cheers,
Gilles
On 2014/04/29 5:18, VandeVondele Joost wrote:
> Hi,
>
> I applied the patch from ticket #4569 (to 1.8.1), and things improved (in
> particular the reported issue is
Edgar and Christoph,
i do not think ROMIO supports this yet.
from ompi/mca/io/romio/romio/README
"This version of ROMIO includes everything defined in the MPI I/O
chapter except support for file interoperability [...]"
i also ran ompi/mca/io/romio/romio/test/external32.c :
on a x86_64 box
Lisandro,
i assume you are running OpenMPI 1.8
r31554 fixes this issue (and some others)
https://svn.open-mpi.org/trac/ompi/changeset/31554/branches/v1.8/ompi/communicator/comm_cid.c
the root cause was an unitialized variable (rc in
ompi/communicator/comm_cid.c), and the issue only occured when
homogeneous
> cluster, even with --enable-hetero. I've run it that way on my cluster.
>
> On Apr 27, 2014, at 7:50 PM, Gilles Gouaillardet
> <gilles.gouaillar...@iferc.org> wrote:
>
>> According to Jeff's comment, OpenMPI compiled with
>> --enable-heterogeneous is brok
According to Jeff's comment, OpenMPI compiled with
--enable-heterogeneous is broken even in an homogeneous cluster.
as a first step, MTT could be ran with OpenMPI compiled with
--enable-heterogenous and running on an homogeneous cluster
(ideally on both little and big endian) in order to identify
Folks,
Here is attached an oversimplified version of the MPI_Recv_init_null_c
test from the
intel test suite.
the test works fine with v1.6, v1.7 and v1.8 branches but fails with the
trunk.
i wonder wether the bug is in OpenMPI or the test itself.
on one hand, we could consider there is a bug
my bad :-(
this has just been fixed
Gilles
On 2014/04/23 14:55, Nathan Hjelm wrote:
> The ompi_datatype_flatten.c file appears to be missing. Let me know once
> it is committed and I will take a look. I will see if I can write the
> RMA code using it over the next week or so.
>
George,
i am sorry i cannot see how flatten datatype can be helpful here :-(
in this example, the master must broadcast a long vector. this datatype
is contiguous
so the flatten'ed datatype *is* the type provided by the MPI application.
how would pipelining happen in this case (e.g. who has to
Nathan,
i uploaded this part to github :
https://github.com/ggouaillardet/ompi-svn-mirror/tree/flatten-datatype
you really need to check the last commit :
https://github.com/ggouaillardet/ompi-svn-mirror/commit/a8d014c6f144fa5732bdd25f8b6b05b07ea8
please consider this as experimental and
Dear OpenMPI developers,
i just created #4531 in order to track this issue :
https://svn.open-mpi.org/trac/ompi/ticket/4531
Basically, the coll/tuned implementation of MPI_Bcast does not work when
two tasks
uses datatypes of different sizes.
for example, if the root send two large vectors of
701 - 758 of 758 matches
Mail list logo