Re: [OMPI devel] Getting the number of nodes
I agree with Ralph, this code should work fine (we do this internally in orte_ras_base_node_query()). You may try adding a 'dump' of the GPR to make sure that the node segment has information on it. Add a call like the following to your function: orte_gpr.dup_segment(NULL); or better yet orte_gpr.dump_segment(ORTE_NODE_SEGMENT); that should print out the node segment that it would be reading from. This may be a problem elsewhere, and this will help us pinpoint it. Cheers, Josh > I'm running this on my mac where I expected to only get back the > localhost. I upgraded to 1.0.2 a little while back, had been using one > of the alphas (I think it was alpha 9 but I can't be sure) up until that > point when this function returned '1' on my mac. > > -- Nathan > Correspondence > - > Nathan DeBardeleben, Ph.D. > Los Alamos National Laboratory > Parallel Tools Team > High Performance Computing Environments > phone: 505-667-3428 > email: ndeb...@lanl.gov > - > > > > Ralph H Castain wrote: >> Rc=0 indicates that the "get" function was successful, so this means >> that >> there were no nodes on the NODE_SEGMENT. Were you running this in an >> environment where nodes had been allocated to you? Or were you expecting >> to >> find only "localhost" on the segment? >> >> I'm not entirely sure, but I don't believe there have been significant >> changes in 1.0.2 for some time. My guess is that something has changed >> on >> your system as opposed to in the OpenMPI code you're using. Did you do >> an >> update recently and then begin seeing this behavior? Your revision level >> is >> 1000+ behind the current repository, so my guess is that you haven't >> updated >> for awhile - since 1.0.2 is under maintenance for bugs only, that >> shouldn't >> be a problem. I'm just trying to understand why your function is doing >> something different if the OpenMPI code your using hasn't changed. >> >> Ralph >> >> >> >> On 7/5/06 2:40 PM, "Nathan DeBardeleben" wrote: >> >> Open MPI: 1.0.2 Open MPI SVN revision: r9571 >>> The rc value returned by the 'get' call is '0'. >>> All I'm doing is calling init with my own daemon name, it's coming up >>> fine, then I immediately call this to figure out how many nodes are >>> associated with this machine. >>> >>> -- Nathan >>> Correspondence >>> - >>> Nathan DeBardeleben, Ph.D. >>> Los Alamos National Laboratory >>> Parallel Tools Team >>> High Performance Computing Environments >>> phone: 505-667-3428 >>> email: ndeb...@lanl.gov >>> - >>> >>> >>> >>> Ralph H Castain wrote: >>> Hi Nathan Could you tell us which version of the code you are using, and print out the rc value that was returned by the "get" call? I see nothing obviously wrong with the code, but much depends on what happened prior to this call too. BTW: you might want to release the memory stored in the returned values - it could represent a substantial memory leak. Ralph On 7/5/06 9:28 AM, "Nathan DeBardeleben" wrote: > I used to use this code to get the number of nodes in a cluster / > machine / whatever: > > >> int >> get_num_nodes(void) >> { >> int rc; >> size_t cnt; >> orte_gpr_value_t **values; >> >> rc = orte_gpr.get(ORTE_GPR_KEYS_OR|ORTE_GPR_TOKENS_OR, >> ORTE_NODE_SEGMENT, NULL, NULL, &cnt, >> &values); >> >> if(rc != ORTE_SUCCESS) { >> return 0; >> } >> >> return cnt; >> } >> >> > This now returns '0' on my MAC when it used to return 1. Is this not > an > acceptable way of doing this? Is there a cleaner / better way these > days? > > ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Getting the number of nodes
I'm running this on my mac where I expected to only get back the localhost. I upgraded to 1.0.2 a little while back, had been using one of the alphas (I think it was alpha 9 but I can't be sure) up until that point when this function returned '1' on my mac. -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - Ralph H Castain wrote: Rc=0 indicates that the "get" function was successful, so this means that there were no nodes on the NODE_SEGMENT. Were you running this in an environment where nodes had been allocated to you? Or were you expecting to find only "localhost" on the segment? I'm not entirely sure, but I don't believe there have been significant changes in 1.0.2 for some time. My guess is that something has changed on your system as opposed to in the OpenMPI code you're using. Did you do an update recently and then begin seeing this behavior? Your revision level is 1000+ behind the current repository, so my guess is that you haven't updated for awhile - since 1.0.2 is under maintenance for bugs only, that shouldn't be a problem. I'm just trying to understand why your function is doing something different if the OpenMPI code your using hasn't changed. Ralph On 7/5/06 2:40 PM, "Nathan DeBardeleben" wrote: Open MPI: 1.0.2 Open MPI SVN revision: r9571 The rc value returned by the 'get' call is '0'. All I'm doing is calling init with my own daemon name, it's coming up fine, then I immediately call this to figure out how many nodes are associated with this machine. -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - Ralph H Castain wrote: Hi Nathan Could you tell us which version of the code you are using, and print out the rc value that was returned by the "get" call? I see nothing obviously wrong with the code, but much depends on what happened prior to this call too. BTW: you might want to release the memory stored in the returned values - it could represent a substantial memory leak. Ralph On 7/5/06 9:28 AM, "Nathan DeBardeleben" wrote: I used to use this code to get the number of nodes in a cluster / machine / whatever: int get_num_nodes(void) { int rc; size_t cnt; orte_gpr_value_t **values; rc = orte_gpr.get(ORTE_GPR_KEYS_OR|ORTE_GPR_TOKENS_OR, ORTE_NODE_SEGMENT, NULL, NULL, &cnt, &values); if(rc != ORTE_SUCCESS) { return 0; } return cnt; } This now returns '0' on my MAC when it used to return 1. Is this not an acceptable way of doing this? Is there a cleaner / better way these days? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Getting the number of nodes
Rc=0 indicates that the "get" function was successful, so this means that there were no nodes on the NODE_SEGMENT. Were you running this in an environment where nodes had been allocated to you? Or were you expecting to find only "localhost" on the segment? I'm not entirely sure, but I don't believe there have been significant changes in 1.0.2 for some time. My guess is that something has changed on your system as opposed to in the OpenMPI code you're using. Did you do an update recently and then begin seeing this behavior? Your revision level is 1000+ behind the current repository, so my guess is that you haven't updated for awhile - since 1.0.2 is under maintenance for bugs only, that shouldn't be a problem. I'm just trying to understand why your function is doing something different if the OpenMPI code your using hasn't changed. Ralph On 7/5/06 2:40 PM, "Nathan DeBardeleben" wrote: >> Open MPI: 1.0.2 >>Open MPI SVN revision: r9571 > The rc value returned by the 'get' call is '0'. > All I'm doing is calling init with my own daemon name, it's coming up > fine, then I immediately call this to figure out how many nodes are > associated with this machine. > > -- Nathan > Correspondence > - > Nathan DeBardeleben, Ph.D. > Los Alamos National Laboratory > Parallel Tools Team > High Performance Computing Environments > phone: 505-667-3428 > email: ndeb...@lanl.gov > - > > > > Ralph H Castain wrote: >> Hi Nathan >> >> Could you tell us which version of the code you are using, and print out the >> rc value that was returned by the "get" call? I see nothing obviously wrong >> with the code, but much depends on what happened prior to this call too. >> >> BTW: you might want to release the memory stored in the returned values - it >> could represent a substantial memory leak. >> >> Ralph >> >> >> >> On 7/5/06 9:28 AM, "Nathan DeBardeleben" wrote: >> >> >>> I used to use this code to get the number of nodes in a cluster / >>> machine / whatever: >>> int get_num_nodes(void) { int rc; size_t cnt; orte_gpr_value_t **values; rc = orte_gpr.get(ORTE_GPR_KEYS_OR|ORTE_GPR_TOKENS_OR, ORTE_NODE_SEGMENT, NULL, NULL, &cnt, &values); if(rc != ORTE_SUCCESS) { return 0; } return cnt; } >>> This now returns '0' on my MAC when it used to return 1. Is this not an >>> acceptable way of doing this? Is there a cleaner / better way these days? >>> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Getting the number of nodes
Open MPI: 1.0.2 Open MPI SVN revision: r9571 The rc value returned by the 'get' call is '0'. All I'm doing is calling init with my own daemon name, it's coming up fine, then I immediately call this to figure out how many nodes are associated with this machine. -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - Ralph H Castain wrote: Hi Nathan Could you tell us which version of the code you are using, and print out the rc value that was returned by the "get" call? I see nothing obviously wrong with the code, but much depends on what happened prior to this call too. BTW: you might want to release the memory stored in the returned values - it could represent a substantial memory leak. Ralph On 7/5/06 9:28 AM, "Nathan DeBardeleben" wrote: I used to use this code to get the number of nodes in a cluster / machine / whatever: int get_num_nodes(void) { int rc; size_t cnt; orte_gpr_value_t **values; rc = orte_gpr.get(ORTE_GPR_KEYS_OR|ORTE_GPR_TOKENS_OR, ORTE_NODE_SEGMENT, NULL, NULL, &cnt, &values); if(rc != ORTE_SUCCESS) { return 0; } return cnt; } This now returns '0' on my MAC when it used to return 1. Is this not an acceptable way of doing this? Is there a cleaner / better way these days? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Getting the number of nodes
Hi Nathan Could you tell us which version of the code you are using, and print out the rc value that was returned by the "get" call? I see nothing obviously wrong with the code, but much depends on what happened prior to this call too. BTW: you might want to release the memory stored in the returned values - it could represent a substantial memory leak. Ralph On 7/5/06 9:28 AM, "Nathan DeBardeleben" wrote: > I used to use this code to get the number of nodes in a cluster / > machine / whatever: >> int >> get_num_nodes(void) >> { >> int rc; >> size_t cnt; >> orte_gpr_value_t **values; >> >> rc = orte_gpr.get(ORTE_GPR_KEYS_OR|ORTE_GPR_TOKENS_OR, >> ORTE_NODE_SEGMENT, NULL, NULL, &cnt, &values); >> >> if(rc != ORTE_SUCCESS) { >> return 0; >> } >> >> return cnt; >> } > This now returns '0' on my MAC when it used to return 1. Is this not an > acceptable way of doing this? Is there a cleaner / better way these days?