Re: [OMPI users] InfiniBand, different OpenFabrics transport types
Yevgeny, Sorry for the delay in replying -- I'd been out for a few days. - Original Message - > From: Yevgeny Kliteynik> Sent: Thursday, July 14, 2011 12:51 AM > Subject: Re: [OMPI users] InfiniBand, different OpenFabrics transport types > While I'm trying to find an old HCA somewhere, could you please > post here the output of "ibv_devinfo -v" on mthca? :~$ ibv_devinfo -v hca_id: mthca0 transport: InfiniBand (0) fw_ver: 4.8.917 node_guid: 0005:ad00:000b:60c0 sys_image_guid: 0005:ad00:0100:d050 vendor_id: 0x05ad vendor_part_id: 25208 hw_ver: 0xA0 board_id: MT_00A001 phys_port_cnt: 2 max_mr_size: 0x page_size_cap: 0xf000 max_qp: 64512 max_qp_wr: 65535 device_cap_flags: 0x1c76 max_sge: 59 max_sge_rd: 0 max_cq: 65408 max_cqe: 131071 max_mr: 131056 max_pd: 32768 max_qp_rd_atom: 4 max_ee_rd_atom: 0 max_res_rd_atom: 258048 max_qp_init_rd_atom: 128 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 0 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 8192 max_mcast_qp_attach: 56 max_total_mcast_qp_attach: 458752 max_ah: 0 max_fmr: 0 max_srq: 960 max_srq_wr: 65535 max_srq_sge: 31 max_pkeys: 64 local_ca_ack_delay: 15 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 2 port_lid: 49 port_lmc: 0x00 link_layer: IB max_msg_sz: 0x8000 port_cap_flags: 0x02510a68 max_vl_num: 8 (4) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 64 gid_tbl_len: 32 subnet_timeout: 8 init_type_reply: 0 active_width: 4X (2) active_speed: 2.5 Gbps (1) phys_state: LINK_UP (5) GID[ 0]: fe80::::0005:ad00:000b:60c1 port: 2 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: IB max_msg_sz: 0x8000 port_cap_flags: 0x02510a68 max_vl_num: 8 (4) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 64 gid_tbl_len: 32 subnet_timeout: 0 init_type_reply: 0 active_width: 4X (2) active_speed: 2.5 Gbps (1) phys_state: POLLING (2) GID[ 0]: fe80::::0005:ad00:000b:60c2
Re: [OMPI users] InfiniBand, different OpenFabrics transport types
Hi Yevgeny and list, - Original Message - > From: Yevgeny Kliteynik> I'll check the MCA_BTL_OPENIB_TRANSPORT_UNKNOWN thing and get back to you. Thank you. > One question though, just to make sure we're on the same page: so the jobs > do run OK on > the older HCAs, as long as they run *only* on the older HCAs, right? Yes, correct. They run on the newer hosts using the newer (ConnectX) HCAs as long as the jobs stay on the same (newer) HCA type, and they run on the older HCAs (mthca) so long as the jobs stay on the same HCA type as well. IOW, as long as the jobs run on homogeneous IB hardware, they run successfully to completion. We've successfully done stuff like Checkpoint/Restart using the BLCR functionality, and it all seems to work well and in a seemingly robust way. > Please make sure that the jobs are using only IB with "--mca btl > openib,self" parameters. The system is in use right now, so I will have to test this and get back you, but I can also say with certainty that we don't specify --mca parameters unless a user needs to run on Ethernet-only (to avoid the IB errors we're discussing). Otherwise, it is at the Open MPI 1.5.3 default behavior. The users are also all using the systemwide Open MPI installation, so this isn't an issue of an erroneous local configuration lying around from multiple parallel installs, or interfering copies of different builds, etc. Other than the mandatory iw_cm kernel module, we are not building/using any iWarp or DAPL/uDAPL functionality. We are also not running IP on the IB network.
Re: [OMPI users] InfiniBand, different OpenFabrics transport types
Hello, and thanks for the reply. - Original Message - > From: Jeff Squyres <jsquy...@cisco.com> > Sent: Thursday, July 7, 2011 5:14 PM > Subject: Re: [OMPI users] InfiniBand, different OpenFabrics transport types > > On Jun 28, 2011, at 1:46 PM, Bill Johnstone wrote: > >> I have a heterogeneous network of InfiniBand-equipped hosts which are all > connected to the same backbone switch, an older SDR 10 Gb/s unit. >> >> One set of nodes uses the Mellanox "ib_mthca" driver, while the > other uses the "mlx4" driver. >> >> This is on Linux 2.6.32, with Open MPI 1.5.3 . >> >> When I run Open MPI across these node types, I get an error message of the > form: >> >> Open MPI detected two different OpenFabrics transport types in the same > Infiniband network. >> Such mixed network trasport configuration is not supported by Open MPI. >> >> Local host: compute-chassis-1-node-01 >> Local adapter: mthca0 (vendor 0x5ad, part ID 25208) >> Local transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN > > Wow, that's cool ("UNKNOWN"). Are you using an old version of > OFED or something? No, clean local build of OFED 1.5.3 packages, but I don't have the full huge complement of OFED packages installed, since our setup is not using IPoIB, SDP, etc. ibdiagnet, and all the usual suspects work as expected, and I'm able to do large scale Open MPI runs just fine, so long as I don't cross Mellanox HCA types. > Mellanox -- how can this happen? > >> Remote host: compute-chassis-3-node-01 >> Remote Adapter: (vendor 0x2c9, part ID 26428) >> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_IB >> >> Two questions: >> >> 1. Why is this occurring if both adapters have all the OpenIB software set > up? Is it because Open MPI is trying to use functionality such as ConnectX > with > the newer hardware, which is incompatible with older hardware, or is it > something more mundane? > > It's basically a mismatch of IB capabilities -- Open MPI is trying to use > more advanced features in some nodes and not in others. I also tried looking in the adapter-specific settings in the .ini file under /etc, but the only difference I found was in MTU, and I think that's configured on the switch. >> 2. How can I use IB amongst these heterogeneous nodes? > > Mellanox will need to answer this question... It might be able to be done, > but > I don't know how offhand. The first issue is to figure out why you're > getting TRANSPORT_UNKNOWN on the one node. OK, please let me know what other things to try or what other info I can provide.
[OMPI users] InfiniBand, different OpenFabrics transport types
Hello all. I have a heterogeneous network of InfiniBand-equipped hosts which are all connected to the same backbone switch, an older SDR 10 Gb/s unit. One set of nodes uses the Mellanox "ib_mthca" driver, while the other uses the "mlx4" driver. This is on Linux 2.6.32, with Open MPI 1.5.3 . When I run Open MPI across these node types, I get an error message of the form: Open MPI detected two different OpenFabrics transport types in the same Infiniband network. Such mixed network trasport configuration is not supported by Open MPI. Local host: compute-chassis-1-node-01 Local adapter: mthca0 (vendor 0x5ad, part ID 25208) Local transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN Remote host: compute-chassis-3-node-01 Remote Adapter: (vendor 0x2c9, part ID 26428) Remote transport type: MCA_BTL_OPENIB_TRANSPORT_IB Two questions: 1. Why is this occurring if both adapters have all the OpenIB software set up? Is it because Open MPI is trying to use functionality such as ConnectX with the newer hardware, which is incompatible with older hardware, or is it something more mundane? 2. How can I use IB amongst these heterogeneous nodes? Thank you.
Re: [OMPI users] BLCR support not building on 1.5.3
Hello, Thank you very much for this. I've replied further below: - Original Message - > From: Joshua Hursey[...] > What other configure options are you passing to Open MPI? Specifically the > configure test will always fail if '--with-ft=cr' is not specified - by > default Open MPI will only build the BLCR component if C/R FT is requested by > the user. This was it! Now the BLCR supports builds in just fine. If I may offer some feedback: When I think "Checkpoint/Restart", I don't immediately think "Fault Tolerance"; rather, I'm interested in it for a better alternative to suspend/resume. So I had *no* idea turning on the "ft" configure option this was a prerequisite for BLCR support to compile from just reading the configure help, configure output, docs, etc. I'd like to request that this be made easier to spot. At a minimum, the configure -help output could mention this when it gets to talking about BLCR, or C/R in general. Additionally, in general when configuring components, it would be nice in the config logs if there was a way to get more details about the tests (and why they failed) than just "can compile...no". This may require more invasive changes - not being super-knowledgeable about configure, I don't know how much work this would be. Lastly, the standard Open MPI documentation (particularly the FAQ) could be updated in the C/R or BLCR sections to reflect the need for the "--with-ft=cr" argument. Again, I really appreciate the assistance.
[OMPI users] BLCR support not building on 1.5.3
Hello all. I'm building 1.5.3 from source on a Debian Squeeze AMD64 system, and trying to get BLCR support built-in. I've installed all the packages that I think should be relevant to BLCR support, including: +blcr-dkms +libcr0 +libcr-dev +blcr-util I've also installed blcr-testuite . I only run Open MPI's configure after loading the blcr modules, and the tests in blcr-testsuite pass. The relevant headers seem to be in /usr/include and the relevant libraries in /usr/lib . I've tried three different invocations of configure: 1. No BLCR-related arguments. Output snippet from configure: checking --with-blcr value... simple ok (unspecified) checking --with-blcr-libdir value... simple ok (unspecified) checking if MCA component crs:blcr can compile... no 2. With --with-blcr=/usr only Output snippet from configure: checking --with-blcr value... sanity check ok (/usr) checking --with-blcr-libdir value... simple ok (unspecified) configure: WARNING: BLCR support requested but not found. Perhaps you need to specify the location of the BLCR libraries. configure: error: Aborting. 3. With --with-blcr-libdir=/usr/lib only Output snippet from configure: checking --with-blcr value... simple ok (unspecified) checking --with-blcr-libdir value... sanity check ok (/usr/lib) checking if MCA component crs:blcr can compile... no config.log only seems to contain the output of whatever tests were run to determine whether or not blcr support could be compiled, but I don't see any way to get details on what code and compile invocation actually failed, in order to get to the root of the problem. I'm not a configure or m4 expert, so I'm not sure how to go further in troubleshooting this. Help would be much appreciated. Thanks!
Re: [OMPI users] Making RPM from source that respects --prefix
Hello Jeff and Kiril, Thank you for your responses. Based on the information you both provided, I was able to get buildrpm to make the OMPI RPM the way I wanted. I ended up having to define _prefix , _mandir , and _infodir . Additionally, I found I had to use --define "shell_scripts_basename mpivars" because without that, when I tried to use mpi-selector, mpi-selector did not find the installation since it specifically seems to look for the shell scripts as mpivars.{sh,csh} rather than mpivars-1.3.3.{sh,csh} as the .spec file builds. I think the .spec file should be changed to match what mpi-selector expects. Jeff, it might also be really useful to have a .spec build option to allow the RPM to register itself as the system default. I hand-modified the .spec file to do this. Please let me know if I should register a feature request somewhere more formally. Thanks again to you both, and sorry for taking so long to reply.
[OMPI users] Making RPM from source that respects --prefix
I'm trying to build an RPM of 1.3.3 from the SRPM. Despite typical RPM practice, I need to build ompi so that it installs to a different directory from /usr or /opt, i.e. what I would get if I just built from source myself with a --prefix argument to configure. When I invoke buildrpm with the --define 'configure_options --prefix= ...', the options do get set when the building process gets kicked off. However, when I query the final RPM, only vampirtrace has paid attention to the specified --prefix and wants to place its files accordingly. How should I alter the .spec file (or in some other place?) to get the desired behavior for the final file locations in the RPM? Thank you for any help.
[OMPI users] mpirun (orte ?) not shutting down cleanly on job aborts
Hello OMPI devs, I'm currently running OMPI v 1.2.4 . It didn't seem that any bugs which affect me or my users were fixed in 1.2.5 and 1.2.6, so I haven't upgraded yet. When I was initially getting started with OpenMPI, I had some problems which I was able to solve, but one still remains. As I mentioned in http://www.open-mpi.org/community/lists/users/2007/07/3716.php when there is a non-graceful exit on any of the MPI jobs, mpirun hangs. As an example, I have a code that I run which, when it has a trivial runtime error (e.g., some small mistake in the input file) dies yielding messages to the screen like: [node1.x86-64:28556] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 16 but mpirun never exits, and Ctrl+C won't kill it. I have to resort to kill -9. Now that I'm running under SLURM, this is worse because there is no nice way to manually clear individual jobs off the controller. So even if I manually kill mpirun on the failed job, slurmctld still thinks its running. Ralph Castain replied to the previously-linked message: http://www.open-mpi.org/community/lists/users/2007/07/3718.php indicating that he thought he knew why this was happening and that it was or would likely be fixed in the trunk. At this point, I just want to know: can I look forward to this being fixed in the upcoming v 1.3 series? I don't mean that to sound ungrateful: *many thanks* to the OMPI devs for what you've already given the community at large. I'm just a bit frustrated because we seem to run a lot of codes on our cluster that abort at one time or another. Thank you.
[OMPI users] Documentation on running under slurm
Hello all. It would seem that the documentation, at least the FAQ page at http://www.open-mpi.org/faq/?category=slurm is a little out of date with respect to running on newer versions of SLURM (I just got things working with version 1.3.3) . According to the SLURM documentation, srun -A is deperecated, and even if you look in the manpage for salloc, -A is not directly mentioned, it's just discussed in the --no-shell section. I was able to successfully submit/run using: salloc -n <# procs> mpirun without needing an interactive shell. So doesn't this seem like the more up-to-date way of doing things rather than srun -A? Also, it would seem sbatch replaces srun -b, but I don't use this mode of operation, so I'm not sure. Perhaps the OpenMPI documentation should be updated accordingly? Thanks.
[OMPI users] SLURM vs. Torque?
Hello All. We are starting to need resource/scheduling management for our small cluster, and I was wondering if any of you could provide comments on what you think about Torque vs. SLURM? On the basis of the appearance of active development as well as the documentation, SLURM seems to be superior, but can anyone shed light on how they compare in use? I realize the truth in the stock answer of "it depends on what you need/want," but as of yet we are not experienced enough with this kind of thing to have a set of firm requirements. At this point, we can probably adapt our workflow/usage a little bit to accomodate the way the resource manager works. And of course we'll be using OpenMPI with whatever resource manager we go with. Anyway, enough from me -- I'm looking to hear other's experiences and viewpoints. Thanks for any input! __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: [OMPI users] mpirun hanging followup
--- Ralph Castainwrote: > Unfortunately, we don't have more debug statements internal to that > function. I'll have to create a patch for you that will add some so > we can > better understand why it is failing - will try to send it to you on > Wed. Thank you for the patch you sent. I solved the problem. It was a head-slapper of an error. Turned out that I had forgotten -- the permissions on the filesystem override the permissions of the mount point. As I mentioned, these machines have an NFS root filesystem. In that filesystem, tmp has permissions 1777. However, when each node mounts its local temp partition to /tmp, the permissions on that filesystem are the permissions the mount point takes on. In this case, I had forgotten to apply permissions 1777 to /tmp after mounting on each machine. As a result, /tmp really did not have the appropriate permissions for mpirun to write to it as necessary. Your patch helped me figure this out. Technically, I should have been able to figure it out from the messages you'd already sent to the mailing list, but it wasn't until I saw the line in session_dir.c where the error was occurring that I realized it had to be some kind of permissions error. I've attached the new debug output below: [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 108 [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 391 [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file runtime/orte_init_stage1.c at line 626 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_session_dir failed --> Returned value -1 instead of ORTE_SUCCESS -- [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file runtime/orte_system_init.c at line 42 [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 52 Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS. Starting at line 108 of session_dir.c, is: if (ORTE_SUCCESS != (ret = opal_os_dirpath_create(directory, my_mode))) { ORTE_ERROR_LOG(ret); } Three further points: -Is there some reason ORTE can't bail out gracefully upon this error, instead of hanging like it was doing for me? -I think leaving in the extra debug logging code you sent me in the patch for future Open MPI versions would be a good idea to help troubleshoot problems like this. -It would be nice to see "--debug-daemons" added to the Troubleshooting section of the FAQ on the web site. Thank you very very much for your help Ralph and everyone else that replied. Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. http://mobile.yahoo.com/go?refer=1GNXIC
Re: [OMPI users] mpirun hanging followup
--- Ralph Castainwrote: > No, the session directory is created in the tmpdir - we don't create > anything anywhere else, nor do we write any executables anywhere. In the case where the TMPDIR env variable isn't specified, what is the default assumed by Open MPI/orte? > Just out of curiosity: although I know you have different arch's on > your > nodes, the tests you are running are all executing on the same arch, > correct??? Yes, tests all execute on the same arch, although I am led to another question. Can I use a headnode of a particular arch, but in my mpirun hostfile, specify only nodes of another arch, and launch from the headnode? In other words, no computation is done on the headnode of arch A, all computation is done on nodes of arch B, but the job is launched from the headnode -- would that be acceptable? I should be clear that for the problem you are helping me with, *all* the nodes involved are running the same arch, OS, compiler, system libraries, etc. The multiple arch question is for edification for the future. Got a little couch potato? Check out fun summer activities for kids. http://search.yahoo.com/search?fr=oni_on_mail=summer+activities+for+kids=bz
Re: [OMPI users] mpirun hanging followup
I made sure the TMPDIR environment variable was set to /tmp for non-interactive logins, and got the same result as before. Also specifying the "-mca tmpdir_base /tmp" command-line options gave the same result as well. I made a mistake in my previous e-mail however -- the user home directories are also writable by each node (again, via NFS). /var and /tmp are the only unique-per-node writable directories. I'm assuming that by default, the session directory structure is created in the run directory, or the user's home directory, or something similar? /tmp and the home directories are both mounted nosuid, but are mounted exec. Does mpirun write/run a suid executable in any of these directories? Thank you. --- Ralph Castain <r...@lanl.gov> wrote: > Open MPI needs to create a temporary directory structure that we call > the > "session directory". This error is telling you that Open MPI was > unable to > create that directory, probably due to a permission issue. > > We decide on the root directory for the session directory using a > progression. You can direct where you want it to go by setting the > TMPDIR > environment variable, or (to set it just for us) using -mca > tmpdir_base foo > on the mpirun command (or you can set OMPI_MCA_tmpidir_base=foo in > your > environment), where "foo" is the root of your tmp directory you want > us to > use (e.g., /tmp). > > Hope that helps > Ralph > > > > On 7/17/07 3:09 PM, "Bill Johnstone" <beejsto...@yahoo.com> wrote: > > > When I run with --debug-daemons, I get: > > > > > > > > [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file > > runtime/orte_init_stage1.c at line 626 > > > -- > > It looks like orte_init failed for some reason; your parallel > process > > is > > likely to abort. There are many reasons that a parallel process > can > > fail during orte_init; some of which are due to configuration or > > environment problems. This failure appears to be an internal > failure; > > here's some additional information (which may only be relevant to > an > > Open MPI developer): > > > > orte_session_dir failed > > --> Returned value -1 instead of ORTE_SUCCESS > > > > > -- > > [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file > > runtime/orte_system_init.c at line 42 > > [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file > > runtime/orte_init.c at line 52 > > Open RTE was unable to initialize properly. The error occured > while > > attempting to orte_init(). Returned value -1 instead of > ORTE_SUCCESS. > > > > > > > > Where would you suggest I look next? > > > > Also, if it makes any difference, /usr/local is on a read-only > NFSROOT. > > Only /tmp and /var are writeable per-node. > > > > Thank you very much for your help so far. > > > > --- George Bosilca <bosi...@cs.utk.edu> wrote: > > > >> Sorry. The --debug was supposed to be --debug-devel. But I suspect > >> that if you have a normal build then there will be not much > output. > >> However, --debug-daemons should give enough output so we can at > least > >> > >> have a starting point. > >> > >>george. > >> > >> On Jul 17, 2007, at 2:46 PM, Bill Johnstone wrote: > >> > >>> George Bosilca wrote: > >>> > >>>> You can start by adding --debug-daemons and --debug to your > mpirun > >>>> command line. This will generate a lot of output related to the > >>>> operations done internally by the launcher. If you send this > >> output > >>>> to the list we might be able to help you a little bit more. > >>> > >>> OK, I added those, but got a message about needing to supply a > >>> suitable > >>> debugger. If I supply the "--debugger gdb" argument, I just get > >>> dumped > >>> into gdb. I'm not sure what I need to do next to get the > launcher > >>> output you mentioned. My knowledge of gdb is pretty rudimentary. > >> > >>> Do I > >>> need to set mpirun as the executable, and the use the gdb "run" > >>> command > >>> with the mpirun arguments? > >>> > >>> Do I need to rebuild openmpi with --enable-debug? > > > > > > > > > > > __ > > __ > > Luggage? GPS? Comic books? > > Check out fitting gifts for grads at Yahoo! Search > > > http://search.yahoo.com/search?fr=oni_on_mail=graduation+gifts=bz > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > Moody friends. Drama queens. Your life? Nope! - their life, your story. Play Sims Stories at Yahoo! Games. http://sims.yahoo.com/
Re: [OMPI users] mpirun hanging followup
When I run with --debug-daemons, I get: [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file runtime/orte_init_stage1.c at line 626 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_session_dir failed --> Returned value -1 instead of ORTE_SUCCESS -- [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file runtime/orte_system_init.c at line 42 [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 52 Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS. Where would you suggest I look next? Also, if it makes any difference, /usr/local is on a read-only NFSROOT. Only /tmp and /var are writeable per-node. Thank you very much for your help so far. --- George Bosilca <bosi...@cs.utk.edu> wrote: > Sorry. The --debug was supposed to be --debug-devel. But I suspect > that if you have a normal build then there will be not much output. > However, --debug-daemons should give enough output so we can at least > > have a starting point. > >george. > > On Jul 17, 2007, at 2:46 PM, Bill Johnstone wrote: > > > George Bosilca wrote: > > > >> You can start by adding --debug-daemons and --debug to your mpirun > >> command line. This will generate a lot of output related to the > >> operations done internally by the launcher. If you send this > output > >> to the list we might be able to help you a little bit more. > > > > OK, I added those, but got a message about needing to supply a > > suitable > > debugger. If I supply the "--debugger gdb" argument, I just get > > dumped > > into gdb. I'm not sure what I need to do next to get the launcher > > output you mentioned. My knowledge of gdb is pretty rudimentary. > > > Do I > > need to set mpirun as the executable, and the use the gdb "run" > > command > > with the mpirun arguments? > > > > Do I need to rebuild openmpi with --enable-debug? Luggage? GPS? Comic books? Check out fitting gifts for grads at Yahoo! Search http://search.yahoo.com/search?fr=oni_on_mail=graduation+gifts=bz
Re: [OMPI users] mpirun hanging followup
George Bosilca wrote: > You can start by adding --debug-daemons and --debug to your mpirun > command line. This will generate a lot of output related to the > operations done internally by the launcher. If you send this output > to the list we might be able to help you a little bit more. OK, I added those, but got a message about needing to supply a suitable debugger. If I supply the "--debugger gdb" argument, I just get dumped into gdb. I'm not sure what I need to do next to get the launcher output you mentioned. My knowledge of gdb is pretty rudimentary. Do I need to set mpirun as the executable, and the use the gdb "run" command with the mpirun arguments? Do I need to rebuild openmpi with --enable-debug? Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online. http://smallbusiness.yahoo.com/webhosting
Re: [OMPI users] mpirun hanging followup
Thanks for the help. I've replied below. --- "G.O."wrote: > 1- Check to make sure that there are no firewalls blocking > traffic between the nodes. There is no firewall in-between the nodes. If I run jobs directly via ssh, e.g. "ssh node4 env" they work. > 2 - Check to make sure that all nodes have the openmpi installed > and have the very same executable you are trying to run on the same > path, have all permissions correctly. Yes, they are all installed to /usr/local , the permissions are the same, and if I just invoke mpirun on an individual node by logging into it, it works. In fact, even commands like "ssh node4 mpirun" (just to get the mpirun help banner) work. > 3- Check to make sure that all nodes have the same interface, > i.e. eth0 . They all do have the same interfaces. In my configureation, eth1 is the interface that corresponds to the cluster IP network. I have tried using "--mca btl_tcp_if_include eth1" but it seems to make no difference. >That's all i can think of for very quick checks for now. Hope it's > one of this. Thank you very much, but unfortunately it isn't any of these, as far as I can tell. Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7
[OMPI users] mpirun hanging followup
Hello all. I could really use help trying to figure out why mpirun is hanging as detailed in my previous message yesterday, 16 July. Since there's been no response, please allow me to give a short summary. -Open MPI 1.2.3 on GNU/Linux, 2.6.21 kernel, gcc 4.1.2, bash 3.2.15 is default shell -Open MPI installed to /usr/local, which is in non-interactive session path -Systems are AMD64, using ethernet as interconnect, on private IP network mpirun hangs whenever I invoke any process running on a remote node. It runs a job fine if I invoke it so that it only runs on the local node. Ctrl+C never successfully cancels an mpirun job -- I have to use kill -9. I'm asking for help trying to figure what steps have been taken by mpirun, and how I can figure out where things are getting stuck / crashing. What could be happening on the remote nodes? What debugging steps can I take? Without MPI running, the cluster is of no use, so I would really appreciate some help here. Need Mail bonding? Go to the Yahoo! Mail Q for great tips from Yahoo! Answers users. http://answers.yahoo.com/dir/?link=list=396546091