Re: [OMPI devel] Getting the number of nodes
I'm running this on my mac where I expected to only get back the localhost. I upgraded to 1.0.2 a little while back, had been using one of the alphas (I think it was alpha 9 but I can't be sure) up until that point when this function returned '1' on my mac. -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - Ralph H Castain wrote: Rc=0 indicates that the "get" function was successful, so this means that there were no nodes on the NODE_SEGMENT. Were you running this in an environment where nodes had been allocated to you? Or were you expecting to find only "localhost" on the segment? I'm not entirely sure, but I don't believe there have been significant changes in 1.0.2 for some time. My guess is that something has changed on your system as opposed to in the OpenMPI code you're using. Did you do an update recently and then begin seeing this behavior? Your revision level is 1000+ behind the current repository, so my guess is that you haven't updated for awhile - since 1.0.2 is under maintenance for bugs only, that shouldn't be a problem. I'm just trying to understand why your function is doing something different if the OpenMPI code your using hasn't changed. Ralph On 7/5/06 2:40 PM, "Nathan DeBardeleben"wrote: Open MPI: 1.0.2 Open MPI SVN revision: r9571 The rc value returned by the 'get' call is '0'. All I'm doing is calling init with my own daemon name, it's coming up fine, then I immediately call this to figure out how many nodes are associated with this machine. -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - Ralph H Castain wrote: Hi Nathan Could you tell us which version of the code you are using, and print out the rc value that was returned by the "get" call? I see nothing obviously wrong with the code, but much depends on what happened prior to this call too. BTW: you might want to release the memory stored in the returned values - it could represent a substantial memory leak. Ralph On 7/5/06 9:28 AM, "Nathan DeBardeleben" wrote: I used to use this code to get the number of nodes in a cluster / machine / whatever: int get_num_nodes(void) { int rc; size_t cnt; orte_gpr_value_t **values; rc = orte_gpr.get(ORTE_GPR_KEYS_OR|ORTE_GPR_TOKENS_OR, ORTE_NODE_SEGMENT, NULL, NULL, , ); if(rc != ORTE_SUCCESS) { return 0; } return cnt; } This now returns '0' on my MAC when it used to return 1. Is this not an acceptable way of doing this? Is there a cleaner / better way these days? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Getting the number of nodes
Rc=0 indicates that the "get" function was successful, so this means that there were no nodes on the NODE_SEGMENT. Were you running this in an environment where nodes had been allocated to you? Or were you expecting to find only "localhost" on the segment? I'm not entirely sure, but I don't believe there have been significant changes in 1.0.2 for some time. My guess is that something has changed on your system as opposed to in the OpenMPI code you're using. Did you do an update recently and then begin seeing this behavior? Your revision level is 1000+ behind the current repository, so my guess is that you haven't updated for awhile - since 1.0.2 is under maintenance for bugs only, that shouldn't be a problem. I'm just trying to understand why your function is doing something different if the OpenMPI code your using hasn't changed. Ralph On 7/5/06 2:40 PM, "Nathan DeBardeleben"wrote: >> Open MPI: 1.0.2 >>Open MPI SVN revision: r9571 > The rc value returned by the 'get' call is '0'. > All I'm doing is calling init with my own daemon name, it's coming up > fine, then I immediately call this to figure out how many nodes are > associated with this machine. > > -- Nathan > Correspondence > - > Nathan DeBardeleben, Ph.D. > Los Alamos National Laboratory > Parallel Tools Team > High Performance Computing Environments > phone: 505-667-3428 > email: ndeb...@lanl.gov > - > > > > Ralph H Castain wrote: >> Hi Nathan >> >> Could you tell us which version of the code you are using, and print out the >> rc value that was returned by the "get" call? I see nothing obviously wrong >> with the code, but much depends on what happened prior to this call too. >> >> BTW: you might want to release the memory stored in the returned values - it >> could represent a substantial memory leak. >> >> Ralph >> >> >> >> On 7/5/06 9:28 AM, "Nathan DeBardeleben" wrote: >> >> >>> I used to use this code to get the number of nodes in a cluster / >>> machine / whatever: >>> int get_num_nodes(void) { int rc; size_t cnt; orte_gpr_value_t **values; rc = orte_gpr.get(ORTE_GPR_KEYS_OR|ORTE_GPR_TOKENS_OR, ORTE_NODE_SEGMENT, NULL, NULL, , ); if(rc != ORTE_SUCCESS) { return 0; } return cnt; } >>> This now returns '0' on my MAC when it used to return 1. Is this not an >>> acceptable way of doing this? Is there a cleaner / better way these days? >>> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Getting the number of nodes
Hi Nathan Could you tell us which version of the code you are using, and print out the rc value that was returned by the "get" call? I see nothing obviously wrong with the code, but much depends on what happened prior to this call too. BTW: you might want to release the memory stored in the returned values - it could represent a substantial memory leak. Ralph On 7/5/06 9:28 AM, "Nathan DeBardeleben"wrote: > I used to use this code to get the number of nodes in a cluster / > machine / whatever: >> int >> get_num_nodes(void) >> { >> int rc; >> size_t cnt; >> orte_gpr_value_t **values; >> >> rc = orte_gpr.get(ORTE_GPR_KEYS_OR|ORTE_GPR_TOKENS_OR, >> ORTE_NODE_SEGMENT, NULL, NULL, , ); >> >> if(rc != ORTE_SUCCESS) { >> return 0; >> } >> >> return cnt; >> } > This now returns '0' on my MAC when it used to return 1. Is this not an > acceptable way of doing this? Is there a cleaner / better way these days?
[OMPI devel] Getting the number of nodes
I used to use this code to get the number of nodes in a cluster / machine / whatever: int get_num_nodes(void) { int rc; size_t cnt; orte_gpr_value_t **values; rc = orte_gpr.get(ORTE_GPR_KEYS_OR|ORTE_GPR_TOKENS_OR, ORTE_NODE_SEGMENT, NULL, NULL, , ); if(rc != ORTE_SUCCESS) { return 0; } return cnt; } This now returns '0' on my MAC when it used to return 1. Is this not an acceptable way of doing this? Is there a cleaner / better way these days? -- -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov -
Re: [OMPI devel] orted problem
This has been around for a very long time (at least a year, if memory serves correctly). The problem is that the system "hangs" while trying to flush the io buffers through the RML because it loses connection to the head node process (for 1.x, that's basically mpirun) - but the "flush" procedure doesn't give up. What's needed is some tuneup of the entire I/O-RML system so that we can timeout properly and, when receiving that error, exit instead of retrying. I thought someone was going to take a shot at that awhile back (at least six months ago), but I don't recall it actually happening - too many higher priorities. Ralph On 7/4/06 3:05 PM, "Josh Hursey"wrote: > I have been noticing this for a while (at least 2 months) as well > along with stale session directories. I filed a bug yesterday #177 >https://svn.open-mpi.org/trac/ompi/ticket/177 > I'll add this stack trace to it. I want to take a closer look > tomorrow to see what's really going on here. > > When I left it yesterday I found that if you CTRL-C the running > mpirun, and the orted's hang then if you send another signal to > mpirun sometimes mpirun will die from SIGPIPE. This is a race > condition due to the orteds leaving, but we should be masking that > signal or something other than dieing. > > So I think there is more than one race in this code, and will need > some serious looking at. > > --Josh > > On Jul 4, 2006, at 12:38 PM, George Bosilca wrote: > >> Starting with few days ago, I notice that more and more orted are >> left over after my runs. Usually, if the job run to completions they >> disappear. But if I kill the job or it segfault they don't. I >> attached to one of them and I get the following stack: >> >> #0 0x9001f7a8 in select () >> #1 0x00375d34 in select_dispatch (arg=0x39ec6c, tv=0xbfffe664) >> at ../../../ompi-trunk/opal/event/select.c:202 >> #2 0x00373b70 in opal_event_loop (flags=1) at ../../../ompi-trunk/ >> opal/event/event.c:485 >> #3 0x00237ee0 in orte_iof_base_flush () at ../../../../ompi-trunk/ >> orte/mca/iof/base/iof_base_flush.c:111 >> #4 0x004cbb38 in orte_pls_fork_wait_proc (pid=9045, status=9, >> cbdata=0x50c250) at ../../../../../ompi-trunk/orte/mca/pls/fork/ >> pls_fork_module.c:175 >> #5 0x002111f0 in do_waitall (options=0) at ../../ompi-trunk/orte/ >> runtime/orte_wait.c:500 >> #6 0x00210ac8 in orte_wait_signal_callback (fd=20, event=8, >> arg=0x26f3f8) at ../../ompi-trunk/orte/runtime/orte_wait.c:366 >> #7 0x003737f8 in opal_event_process_active () at ../../../ompi-trunk/ >> opal/event/event.c:428 >> #8 0x00373ce8 in opal_event_loop (flags=1) at ../../../ompi-trunk/ >> opal/event/event.c:513 >> #9 0x00368714 in opal_progress () at ../../ompi-trunk/opal/runtime/ >> opal_progress.c:259 >> #10 0x004cdf48 in opal_condition_wait (c=0x4cf0f0, m=0x4cf0b0) >> at ../../../../../ompi-trunk/opal/threads/condition.h:81 >> #11 0x004cde60 in orte_pls_fork_finalize () at ../../../../../ompi- >> trunk/orte/mca/pls/fork/pls_fork_module.c:764 >> #12 0x002417d0 in orte_pls_base_finalize () at ../../../../ompi-trunk/ >> orte/mca/pls/base/pls_base_close.c:42 >> #13 0x000ddf58 in orte_rmgr_urm_finalize () at ../../../../../ompi- >> trunk/orte/mca/rmgr/urm/rmgr_urm.c:521 >> #14 0x00254ec0 in orte_rmgr_base_close () at ../../../../ompi-trunk/ >> orte/mca/rmgr/base/rmgr_base_close.c:39 >> #15 0x0020e574 in orte_system_finalize () at ../../ompi-trunk/orte/ >> runtime/orte_system_finalize.c:65 >> #16 0x0020899c in orte_finalize () at ../../ompi-trunk/orte/runtime/ >> orte_finalize.c:42 >> #17 0x2ac8 in main (argc=19, argv=0xb17c) at ../../../../ompi- >> trunk/orte/tools/orted/orted.c:377 >> >> Somehow, it wait for the pid 9045. But this was one of the kids, and >> it get the SIG_KILL signal (I checked with strace). I wonder if we >> don't have a race condition somewhere on the wait_signal code. >> >> Hope that helps, >>george. >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Josh Hursey > jjhur...@open-mpi.org > http://www.open-mpi.org/ > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel