Re: [OMPI users] InfiniBand, different OpenFabrics transport types

2011-07-19 Thread Bill Johnstone
Yevgeny,

Sorry for the delay in replying -- I'd been out for a few days.



- Original Message -
> From: Yevgeny Kliteynik 
> Sent: Thursday, July 14, 2011 12:51 AM
> Subject: Re: [OMPI users] InfiniBand, different OpenFabrics transport types
 
> While I'm trying to find an old HCA somewhere, could you please
> post here the output of "ibv_devinfo -v" on mthca?

:~$ ibv_devinfo -v
hca_id:    mthca0
    transport:            InfiniBand (0)
    fw_ver:                4.8.917
    node_guid:            0005:ad00:000b:60c0
    sys_image_guid:            0005:ad00:0100:d050
    vendor_id:            0x05ad
    vendor_part_id:            25208
    hw_ver:                0xA0
    board_id:            MT_00A001
    phys_port_cnt:            2
    max_mr_size:            0x
    page_size_cap:            0xf000
    max_qp:                64512
    max_qp_wr:            65535
    device_cap_flags:        0x1c76
    max_sge:            59
    max_sge_rd:            0
    max_cq:                65408
    max_cqe:            131071
    max_mr:                131056
    max_pd:                32768
    max_qp_rd_atom:            4
    max_ee_rd_atom:            0
    max_res_rd_atom:        258048
    max_qp_init_rd_atom:        128
    max_ee_init_rd_atom:        0
    atomic_cap:            ATOMIC_HCA (1)
    max_ee:                0
    max_rdd:            0
    max_mw:                0
    max_raw_ipv6_qp:        0
    max_raw_ethy_qp:        0
    max_mcast_grp:            8192
    max_mcast_qp_attach:        56
    max_total_mcast_qp_attach:    458752
    max_ah:                0
    max_fmr:            0
    max_srq:            960
    max_srq_wr:            65535
    max_srq_sge:            31
    max_pkeys:            64
    local_ca_ack_delay:        15
        port:    1
            state:            PORT_ACTIVE (4)
            max_mtu:        2048 (4)
            active_mtu:        2048 (4)
            sm_lid:            2
            port_lid:        49
            port_lmc:        0x00
            link_layer:        IB
            max_msg_sz:        0x8000
            port_cap_flags:        0x02510a68
            max_vl_num:        8 (4)
            bad_pkey_cntr:        0x0
            qkey_viol_cntr:        0x0
            sm_sl:            0
            pkey_tbl_len:        64
            gid_tbl_len:        32
            subnet_timeout:        8
            init_type_reply:    0
            active_width:        4X (2)
            active_speed:        2.5 Gbps (1)
            phys_state:        LINK_UP (5)
            GID[  0]:        fe80::::0005:ad00:000b:60c1

        port:    2
            state:            PORT_DOWN (1)
            max_mtu:        2048 (4)
            active_mtu:        512 (2)
            sm_lid:            0
            port_lid:        0
            port_lmc:        0x00
            link_layer:        IB
            max_msg_sz:        0x8000
            port_cap_flags:        0x02510a68
            max_vl_num:        8 (4)
            bad_pkey_cntr:        0x0
            qkey_viol_cntr:        0x0
            sm_sl:            0
            pkey_tbl_len:        64
            gid_tbl_len:        32
            subnet_timeout:        0
            init_type_reply:    0
            active_width:        4X (2)
            active_speed:        2.5 Gbps (1)
            phys_state:        POLLING (2)
            GID[  0]:        fe80::::0005:ad00:000b:60c2



Re: [OMPI users] InfiniBand, different OpenFabrics transport types

2011-07-11 Thread Bill Johnstone
Hi Yevgeny and list,



- Original Message -

> From: Yevgeny Kliteynik 

> I'll check the MCA_BTL_OPENIB_TRANSPORT_UNKNOWN thing and get back to you.

Thank you.

> One question though, just to make sure we're on the same page: so the jobs 
> do run OK on
> the older HCAs, as long as they run *only* on the older HCAs, right?

Yes, correct.  They run on the newer hosts using the newer (ConnectX) HCAs as 
long as the jobs stay on the same (newer) HCA type, and they run on the older 
HCAs (mthca) so long as the jobs stay on the same HCA type as well.  IOW, as 
long as the jobs run on homogeneous IB hardware, they run successfully to 
completion.  We've successfully done stuff like Checkpoint/Restart using the 
BLCR functionality, and it all seems to work well and in a seemingly robust way.

> Please make sure that the jobs are using only IB with "--mca btl 
> openib,self" parameters.

The system is in use right now, so I will have to test this and get back you, 
but I can also say with certainty that we don't specify --mca parameters unless 
a user needs to run on Ethernet-only (to avoid the IB errors we're discussing). 
 Otherwise, it is at the Open MPI 1.5.3 default behavior.  The users are also 
all using the systemwide Open MPI installation, so this isn't an issue of an 
erroneous local configuration lying around from multiple parallel installs, or 
interfering copies of different builds, etc.

Other than the mandatory iw_cm kernel module, we are not building/using any 
iWarp or DAPL/uDAPL functionality.  We are also not running IP on the IB 
network.



Re: [OMPI users] InfiniBand, different OpenFabrics transport types

2011-07-08 Thread Bill Johnstone
Hello, and thanks for the reply.



- Original Message -
> From: Jeff Squyres <jsquy...@cisco.com>
> Sent: Thursday, July 7, 2011 5:14 PM
> Subject: Re: [OMPI users] InfiniBand, different OpenFabrics transport types
> 
> On Jun 28, 2011, at 1:46 PM, Bill Johnstone wrote:
> 
>>  I have a heterogeneous network of InfiniBand-equipped hosts which are all 
> connected to the same backbone switch, an older SDR 10 Gb/s unit.
>> 
>>  One set of nodes uses the Mellanox "ib_mthca" driver, while the 
> other uses the "mlx4" driver.
>> 
>>  This is on Linux 2.6.32, with Open MPI 1.5.3 .
>> 
>>  When I run Open MPI across these node types, I get an error message of the 
> form:
>> 
>>  Open MPI detected two different OpenFabrics transport types in the same 
> Infiniband network. 
>>  Such mixed network trasport configuration is not supported by Open MPI.
>> 
>>  Local host: compute-chassis-1-node-01
>>  Local adapter: mthca0 (vendor 0x5ad, part ID 25208) 
>>  Local transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN 
> 
> Wow, that's cool ("UNKNOWN").  Are you using an old version of 
> OFED or something?

No, clean local build of OFED 1.5.3 packages, but I don't have the full huge 
complement of OFED packages installed, since our setup is not using IPoIB, SDP, 
etc.

ibdiagnet, and all the usual suspects work as expected, and I'm able to do 
large scale Open MPI runs just fine, so long as I don't cross Mellanox HCA 
types.


> Mellanox -- how can this happen?
> 
>>  Remote host: compute-chassis-3-node-01
>>  Remote Adapter: (vendor 0x2c9, part ID 26428) 
>>  Remote transport type: MCA_BTL_OPENIB_TRANSPORT_IB
>> 
>>  Two questions:
>> 
>>  1. Why is this occurring if both adapters have all the OpenIB software set 
> up?  Is it because Open MPI is trying to use functionality such as ConnectX 
> with 
> the newer hardware, which is incompatible with older hardware, or is it 
> something more mundane?
> 
> It's basically a mismatch of IB capabilities -- Open MPI is trying to use 
> more advanced features in some nodes and not in others.

I also tried looking in the adapter-specific settings in the .ini file under 
/etc, but the only difference I found was in MTU, and I think that's configured 
on the switch.
 
>>  2. How can I use IB amongst these heterogeneous nodes?
> 
> Mellanox will need to answer this question...  It might be able to be done, 
> but 
> I don't know how offhand.  The first issue is to figure out why you're 
> getting TRANSPORT_UNKNOWN on the one node.

OK, please let me know what other things to try or what other info I can 
provide.



[OMPI users] InfiniBand, different OpenFabrics transport types

2011-06-28 Thread Bill Johnstone
Hello all.

I have a heterogeneous network of InfiniBand-equipped hosts which are all 
connected to the same backbone switch, an older SDR 10 Gb/s unit.

One set of nodes uses the Mellanox "ib_mthca" driver, while the other uses the 
"mlx4" driver.


This is on Linux 2.6.32, with Open MPI 1.5.3 .


When I run Open MPI across these node types, I get an error message of the form:

Open MPI detected two different OpenFabrics transport types in the same 
Infiniband network. 
Such mixed network trasport configuration is not supported by Open MPI.

Local host: compute-chassis-1-node-01
Local adapter: mthca0 (vendor 0x5ad, part ID 25208) 
Local transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN 

Remote host: compute-chassis-3-node-01
Remote Adapter: (vendor 0x2c9, part ID 26428) 
Remote transport type: MCA_BTL_OPENIB_TRANSPORT_IB

Two questions:

1. Why is this occurring if both adapters have all the OpenIB software set up?  
Is it because Open MPI is trying to use functionality such as ConnectX with the 
newer hardware, which is incompatible with older hardware, or is it something 
more mundane?

2. How can I use IB amongst these heterogeneous nodes?

Thank you.




Re: [OMPI users] BLCR support not building on 1.5.3

2011-05-27 Thread Bill Johnstone
Hello,


Thank you very much for this.  I've replied further below:


- Original Message -
> From: Joshua Hursey 
[...]
> What other configure options are you passing to Open MPI? Specifically the 
> configure test will always fail if '--with-ft=cr' is not specified - by 
> default Open MPI will only build the BLCR component if C/R FT is requested by 
> the user.

This was it!  Now the BLCR supports builds in just fine.

If I may offer some feedback:

When I think "Checkpoint/Restart", I don't immediately think "Fault Tolerance"; 
rather, I'm interested in it for a better alternative to suspend/resume.  So I 
had *no* idea turning on the "ft" configure option this was a prerequisite for 
BLCR support to compile from just reading the configure help, configure output, 
docs, etc.

I'd like to request that this be made easier to spot.  At a minimum, the 
configure -help output could mention this when it gets to talking about BLCR, 
or C/R in general.

Additionally, in general when configuring components, it would be nice in the 
config logs if there was a way to get more details about the tests (and why 
they failed) than just "can compile...no".  This may require more invasive 
changes - not being super-knowledgeable about configure, I don't know how much 
work this would be.

Lastly, the standard Open MPI documentation (particularly the FAQ) could be 
updated in the C/R or BLCR sections to reflect the need for the "--with-ft=cr" 
argument.

Again, I really appreciate the assistance.



[OMPI users] BLCR support not building on 1.5.3

2011-05-26 Thread Bill Johnstone
Hello all.

I'm building 1.5.3 from source on a Debian Squeeze AMD64 system, and trying to 
get BLCR support built-in.  I've installed all the packages that I think should 
be relevant to BLCR support, including:

+blcr-dkms
+libcr0
+libcr-dev
+blcr-util

I've also installed blcr-testuite .  I only run Open MPI's configure after 
loading the blcr modules, and the tests in blcr-testsuite pass.  The relevant 
headers seem to be in /usr/include and the relevant libraries in /usr/lib .

I've tried three different invocations of configure:

1. No BLCR-related arguments.

Output snippet from configure:
checking --with-blcr value... simple ok (unspecified)
checking --with-blcr-libdir value... simple ok (unspecified)
checking if MCA component crs:blcr can compile... no

2. With --with-blcr=/usr only

Output snippet from configure:
checking --with-blcr value... sanity check ok (/usr)
checking --with-blcr-libdir value... simple ok (unspecified)
configure: WARNING: BLCR support requested but not found.  Perhaps you need to 
specify the location of the BLCR libraries.
configure: error: Aborting.

3. With --with-blcr-libdir=/usr/lib only

Output snippet from configure:
checking --with-blcr value... simple ok (unspecified)
checking --with-blcr-libdir value... sanity check ok (/usr/lib)
checking if MCA component crs:blcr can compile... no


config.log only seems to contain the output of whatever tests were run to 
determine whether or not blcr support could be compiled, but I don't see any 
way to get details on what code and compile invocation actually failed, in 
order to get to the root of the problem.  I'm not a configure or m4 expert, so 
I'm not sure how to go further in troubleshooting this.

Help would be much appreciated.

Thanks!




Re: [OMPI users] Making RPM from source that respects --prefix

2009-10-07 Thread Bill Johnstone
Hello Jeff and Kiril,

Thank you for your responses.  Based on the information you both provided, I 
was able to get buildrpm to make the OMPI RPM the way I wanted.  I ended up 
having to define _prefix , _mandir , and _infodir .

Additionally, I found I had to use --define "shell_scripts_basename mpivars" 
because without that, when I tried to use mpi-selector, mpi-selector did not 
find the installation since it specifically seems to look for the shell scripts 
as mpivars.{sh,csh} rather than mpivars-1.3.3.{sh,csh} as the .spec file 
builds.  I think the .spec file should be changed to match what mpi-selector 
expects.

Jeff, it might also be really useful to have a .spec build option to allow the 
RPM to register itself as the system default.  I hand-modified the .spec file 
to do this.  Please let me know if I should register a feature request 
somewhere more formally.

Thanks again to you both, and sorry for taking so long to reply.






[OMPI users] Making RPM from source that respects --prefix

2009-10-02 Thread Bill Johnstone
I'm trying to build an RPM of 1.3.3 from the SRPM.  Despite typical RPM 
practice, I need to build ompi so that it installs to a different directory 
from /usr or /opt, i.e. what I would get if I just built from source myself 
with a --prefix argument to configure.

When I invoke buildrpm with the --define 'configure_options --prefix= ...', the options do get set when the building process gets kicked off.  
However, when I query the final RPM, only vampirtrace has paid attention to the 
specified --prefix and wants to place its files accordingly.  How should I 
alter the .spec file (or in some other place?) to get the desired behavior for 
the final file locations in the RPM?

Thank you for any help.






[OMPI users] mpirun (orte ?) not shutting down cleanly on job aborts

2008-06-09 Thread Bill Johnstone
Hello OMPI devs,

I'm currently running OMPI v 1.2.4 .  It didn't seem that any bugs which affect 
me or my users were fixed in 1.2.5 and 1.2.6, so I haven't upgraded yet.

When I was initially getting started with OpenMPI, I had some problems which I 
was able to solve, but one still remains.  As I mentioned in 
http://www.open-mpi.org/community/lists/users/2007/07/3716.php

when there is a non-graceful exit on any of the MPI jobs, mpirun hangs.  As an 
example, I have a code that I run which, when it has a trivial runtime error 
(e.g., some small mistake in the input file) dies yielding messages to the 
screen like:

[node1.x86-64:28556] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 16

but mpirun never exits, and Ctrl+C won't kill it.  I have to resort to kill -9.

Now that I'm running under SLURM, this is worse because there is no nice way to 
manually clear individual jobs off the controller.  So even if I manually kill 
mpirun on the failed job, slurmctld still thinks its running.

Ralph Castain replied to the previously-linked message:
http://www.open-mpi.org/community/lists/users/2007/07/3718.php indicating that 
he thought he knew why this was happening and that it was or would likely be 
fixed in the trunk.

At this point, I just want to know: can I look forward to this being fixed in 
the upcoming v 1.3 series?

I don't mean that to sound ungrateful: *many thanks* to the OMPI devs for what 
you've already given the community at large.  I'm just a bit frustrated because 
we seem to run a lot of codes on our cluster that abort at one time or another.

Thank you.





[OMPI users] Documentation on running under slurm

2008-06-05 Thread Bill Johnstone
Hello all.

It would seem that the documentation, at least the FAQ page at 
http://www.open-mpi.org/faq/?category=slurm is a little out of date with 
respect to running on newer versions of SLURM (I just got things working with 
version 1.3.3) .

According to the SLURM documentation, srun -A is deperecated, and even if you 
look in the manpage for salloc, -A is not directly mentioned, it's just 
discussed in the --no-shell section.

I was able to successfully submit/run using:
salloc -n <# procs> mpirun 

without needing an interactive shell.  So doesn't this seem like the more 
up-to-date way of doing things rather than srun -A?  Also, it would seem sbatch 
replaces srun -b, but I don't use this mode of operation, so I'm not sure.

Perhaps the OpenMPI documentation should be updated accordingly?

Thanks.





[OMPI users] SLURM vs. Torque?

2007-10-22 Thread Bill Johnstone
Hello All.

We are starting to need resource/scheduling management for our small
cluster, and I was wondering if any of you could provide comments on
what you think about Torque vs. SLURM?  On the basis of the appearance
of active development as well as the documentation, SLURM seems to be
superior, but can anyone shed light on how they compare in use?

I realize the truth in the stock answer of "it depends on what you
need/want," but as of yet we are not experienced enough with this kind
of thing to have a set of firm requirements.  At this point, we can
probably adapt our workflow/usage a little bit to accomodate the way
the resource manager works.  And of course we'll be using OpenMPI with
whatever resource manager we go with.

Anyway, enough from me -- I'm looking to hear other's experiences and
viewpoints.

Thanks for any input!

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: [OMPI users] mpirun hanging followup

2007-07-18 Thread Bill Johnstone

--- Ralph Castain  wrote:

> Unfortunately, we don't have more debug statements internal to that
> function. I'll have to create a patch for you that will add some so
> we can
> better understand why it is failing - will try to send it to you on
> Wed.

Thank you for the patch you sent.

I solved the problem.  It was a head-slapper of an error.  Turned out
that I had forgotten -- the permissions on the filesystem override the
permissions of the mount point.  As I mentioned, these machines have an
NFS root filesystem.  In that filesystem, tmp has permissions 1777. 
However, when each node mounts its local temp partition to /tmp, the
permissions on that filesystem are the permissions the mount point
takes on.

In this case, I had forgotten to apply permissions 1777 to /tmp after
mounting on each machine.  As a result, /tmp really did not have the
appropriate permissions for mpirun to write to it as necessary.

Your patch helped me figure this out.  Technically, I should have been
able to figure it out from the messages you'd already sent to the
mailing list, but it wasn't until I saw the line in session_dir.c where
the error was occurring that I realized it had to be some kind of
permissions error.

I've attached the new debug output below:

[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 108
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 391
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_init_stage1.c at line 626
--
It looks like orte_init failed for some reason; your parallel process
is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value -1 instead of ORTE_SUCCESS

--
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_system_init.c at line 42
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 52
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.

Starting at line 108 of session_dir.c, is:

if (ORTE_SUCCESS != (ret = opal_os_dirpath_create(directory, my_mode)))
{
ORTE_ERROR_LOG(ret);
}

Three further points:

-Is there some reason ORTE can't bail out gracefully upon this error,
instead of hanging like it was doing for me?

-I think leaving in the extra debug logging code you sent me in the
patch for future Open MPI versions would be a good idea to help
troubleshoot problems like this.

-It would be nice to see "--debug-daemons" added to the Troubleshooting
section of the FAQ on the web site.

Thank you very very much for your help Ralph and everyone else that replied.




Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, 
photos & more. 
http://mobile.yahoo.com/go?refer=1GNXIC


Re: [OMPI users] mpirun hanging followup

2007-07-18 Thread Bill Johnstone
--- Ralph Castain  wrote:

> No, the session directory is created in the tmpdir - we don't create
> anything anywhere else, nor do we write any executables anywhere.

In the case where the TMPDIR env variable isn't specified, what is the
default assumed by Open MPI/orte?

> Just out of curiosity: although I know you have different arch's on
> your
> nodes, the tests you are running are all executing on the same arch,
> correct???

Yes, tests all execute on the same arch, although I am led to another
question.  Can I use a headnode of a particular arch, but in my mpirun
hostfile, specify only nodes of another arch, and launch from the
headnode?  In other words, no computation is done on the headnode of
arch A, all computation is done on nodes of arch B, but the job is
launched from the headnode -- would that be acceptable?

I should be clear that for the problem you are helping me with, *all*
the nodes involved are running the same arch, OS, compiler, system
libraries, etc.  The multiple arch question is for edification for the
future.





Got a little couch potato? 
Check out fun summer activities for kids.
http://search.yahoo.com/search?fr=oni_on_mail=summer+activities+for+kids=bz
 


Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread Bill Johnstone
I made sure the TMPDIR environment variable was set to /tmp for 
non-interactive logins, and got the same result as before.

Also specifying the "-mca tmpdir_base /tmp" command-line options gave
the same result as well.

I made a mistake in my previous e-mail however -- the user home
directories are also writable by each node (again, via NFS).  /var and
/tmp are the only unique-per-node writable directories.  I'm assuming
that by default, the session directory structure is created in the run
directory, or the user's home directory, or something similar?

/tmp and the home directories are both mounted nosuid, but are mounted
exec.  Does mpirun write/run a suid executable in any of these
directories?

Thank you.

--- Ralph Castain <r...@lanl.gov> wrote:

> Open MPI needs to create a temporary directory structure that we call
> the
> "session directory". This error is telling you that Open MPI was
> unable to
> create that directory, probably due to a permission issue.
> 
> We decide on the root directory for the session directory using a
> progression. You can direct where you want it to go by setting the
> TMPDIR
> environment variable, or (to set it just for us) using -mca
> tmpdir_base foo
> on the mpirun command (or you can set OMPI_MCA_tmpidir_base=foo in
> your
> environment), where "foo" is the root of your tmp directory you want
> us to
> use (e.g., /tmp).
> 
> Hope that helps
> Ralph
> 
> 
> 
> On 7/17/07 3:09 PM, "Bill Johnstone" <beejsto...@yahoo.com> wrote:
> 
> > When I run with --debug-daemons, I get:
> > 
> > 
> > 
> > [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
> > runtime/orte_init_stage1.c at line 626
> >
>
--
> > It looks like orte_init failed for some reason; your parallel
> process
> > is
> > likely to abort.  There are many reasons that a parallel process
> can
> > fail during orte_init; some of which are due to configuration or
> > environment problems.  This failure appears to be an internal
> failure;
> > here's some additional information (which may only be relevant to
> an
> > Open MPI developer):
> > 
> >   orte_session_dir failed
> >   --> Returned value -1 instead of ORTE_SUCCESS
> > 
> >
>
--
> > [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
> > runtime/orte_system_init.c at line 42
> > [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
> > runtime/orte_init.c at line 52
> > Open RTE was unable to initialize properly.  The error occured
> while
> > attempting to orte_init().  Returned value -1 instead of
> ORTE_SUCCESS.
> > 
> > 
> > 
> > Where would you suggest I look next?
> > 
> > Also, if it makes any difference, /usr/local is on a read-only
> NFSROOT.
> >  Only /tmp and /var are writeable per-node.
> > 
> > Thank you very much for your help so far.
> > 
> > --- George Bosilca <bosi...@cs.utk.edu> wrote:
> > 
> >> Sorry. The --debug was supposed to be --debug-devel. But I suspect
> >> that if you have a normal build then there will be not much
> output.
> >> However, --debug-daemons should give enough output so we can at
> least
> >>  
> >> have a starting point.
> >> 
> >>george.
> >> 
> >> On Jul 17, 2007, at 2:46 PM, Bill Johnstone wrote:
> >> 
> >>> George Bosilca wrote:
> >>> 
> >>>> You can start by adding --debug-daemons and --debug to your
> mpirun
> >>>> command line. This will generate a lot of output related to the
> >>>> operations done internally by the launcher. If you send this
> >> output
> >>>> to the list we might be able to help you a little bit more.
> >>> 
> >>> OK, I added those, but got a message about needing to supply a
> >>> suitable
> >>> debugger.  If I supply the "--debugger gdb" argument, I just get
> >>> dumped
> >>> into gdb.  I'm not sure what I need to do next to get the
> launcher
> >>> output you mentioned.  My knowledge of gdb is pretty rudimentary.
> >> 
> >>> Do I
> >>> need to set mpirun as the executable, and the use the gdb "run"
> >>> command
> >>> with the mpirun arguments?
> >>> 
> >>> Do I need to rebuild openmpi with --enable-debug?
> > 
> > 
> > 
> >   
> >
>
__
> > __
> > Luggage? GPS? Comic books?
> > Check out fitting gifts for grads at Yahoo! Search
> >
>
http://search.yahoo.com/search?fr=oni_on_mail=graduation+gifts=bz
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 





Moody friends. Drama queens. Your life? Nope! - their life, your story. Play 
Sims Stories at Yahoo! Games.
http://sims.yahoo.com/  


Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread Bill Johnstone
When I run with --debug-daemons, I get:



[node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_init_stage1.c at line 626
--
It looks like orte_init failed for some reason; your parallel process
is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value -1 instead of ORTE_SUCCESS

--
[node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_system_init.c at line 42
[node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 52
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.



Where would you suggest I look next?

Also, if it makes any difference, /usr/local is on a read-only NFSROOT.
 Only /tmp and /var are writeable per-node.

Thank you very much for your help so far.

--- George Bosilca <bosi...@cs.utk.edu> wrote:

> Sorry. The --debug was supposed to be --debug-devel. But I suspect  
> that if you have a normal build then there will be not much output.  
> However, --debug-daemons should give enough output so we can at least
>  
> have a starting point.
> 
>george.
> 
> On Jul 17, 2007, at 2:46 PM, Bill Johnstone wrote:
> 
> > George Bosilca wrote:
> >
> >> You can start by adding --debug-daemons and --debug to your mpirun
> >> command line. This will generate a lot of output related to the
> >> operations done internally by the launcher. If you send this
> output
> >> to the list we might be able to help you a little bit more.
> >
> > OK, I added those, but got a message about needing to supply a  
> > suitable
> > debugger.  If I supply the "--debugger gdb" argument, I just get  
> > dumped
> > into gdb.  I'm not sure what I need to do next to get the launcher
> > output you mentioned.  My knowledge of gdb is pretty rudimentary.  
> 
> > Do I
> > need to set mpirun as the executable, and the use the gdb "run"  
> > command
> > with the mpirun arguments?
> >
> > Do I need to rebuild openmpi with --enable-debug?



  

Luggage? GPS? Comic books? 
Check out fitting gifts for grads at Yahoo! Search
http://search.yahoo.com/search?fr=oni_on_mail=graduation+gifts=bz


Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread Bill Johnstone
George Bosilca wrote:

> You can start by adding --debug-daemons and --debug to your mpirun
> command line. This will generate a lot of output related to the
> operations done internally by the launcher. If you send this output
> to the list we might be able to help you a little bit more.

OK, I added those, but got a message about needing to supply a suitable
debugger.  If I supply the "--debugger gdb" argument, I just get dumped
into gdb.  I'm not sure what I need to do next to get the launcher
output you mentioned.  My knowledge of gdb is pretty rudimentary.  Do I
need to set mpirun as the executable, and the use the gdb "run" command
with the mpirun arguments?

Do I need to rebuild openmpi with --enable-debug?




Building a website is a piece of cake. Yahoo! Small Business gives you all the 
tools to get online.
http://smallbusiness.yahoo.com/webhosting 


Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread Bill Johnstone
Thanks for the help.  I've replied below.

--- "G.O."  wrote:

> 1- Check to make sure that there are no firewalls blocking
> traffic between the nodes.

There is no firewall in-between the nodes.  If I run jobs directly via
ssh, e.g. "ssh node4 env" they work.

> 2 - Check to make sure that all nodes have the openmpi installed
> and have the very same executable you are trying to run on the same
> path, have all permissions correctly.

Yes, they are all installed to /usr/local , the permissions are the
same, and if I just invoke mpirun on an individual node by logging into
it, it works.  In fact, even commands like "ssh node4 mpirun" (just to
get the mpirun help banner) work.

> 3- Check to make sure that all nodes have the same interface,
> i.e. eth0 .

They all do have the same interfaces.  In my configureation, eth1 is
the interface that corresponds to the cluster IP network.  I have tried
using "--mca btl_tcp_if_include eth1" but it seems to make no
difference.

>That's all i can think of for very quick checks for now. Hope it's
> one of this.

Thank you very much, but unfortunately it isn't any of these, as far as
I can tell.



  

Fussy? Opinionated? Impossible to please? Perfect.  Join Yahoo!'s user panel 
and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 



[OMPI users] mpirun hanging followup

2007-07-17 Thread Bill Johnstone
Hello all.

I could really use help trying to figure out why mpirun is hanging as
detailed in my previous message yesterday, 16 July.  Since there's been
no response, please allow me to give a short summary.

-Open MPI 1.2.3 on GNU/Linux, 2.6.21 kernel, gcc 4.1.2, bash 3.2.15 is
default shell
-Open MPI installed to /usr/local, which is in non-interactive session
path
-Systems are AMD64, using ethernet as interconnect, on private IP
network

mpirun hangs whenever I invoke any process running on a remote node. 
It runs a job fine if I invoke it so that it only runs on the local
node.  Ctrl+C never successfully cancels an mpirun job -- I have to use
kill -9.

I'm asking for help trying to figure what steps have been taken by
mpirun, and how I can figure out where things are getting stuck /
crashing.  What could be happening on the remote nodes?  What debugging
steps can I take?

Without MPI running, the cluster is of no use, so I would really
appreciate some help here.





Need Mail bonding?
Go to the Yahoo! Mail Q for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list=396546091