Hi Abhishek, Ok, so if I use
nodes n0000,n0001 it works. The other two forms nodes n000[0-1] nodes n0000-n0001 do NOT work. It appears that your current bjs only accepts a comma-delimited list of nodes, and it is not yet clear to me that wildcards will work. Now, to another question. If I submit a job using bjssub, it sets up the environment variable "NODES" to contain a list of nodes that can be used by the owner of the job to submit jobs to. In the old bproc it was just a list of numbers, which you used in the bpsh command. Here I got a list of numbers that don't necessarily make sense... How would one run xrx with these? For example: [r...@dgk3 ~]# bjssub -n 2 -i -s 10000 /bin/bash Waiting for interactive job nodes. (nodes 0 6443568 6441632) Starting interactive job. NODES=6443568,6441632 JOBID=0 so what do these numbers correspond to? Typically in a batch environment you don't know which nodes get assigned to you, so the script you use to run the jobs must be told which nodes are yours to use. Similarly for mpi programs. Admittedly xmvapich still has some problems, but it runs with a list of nodes too, in just the same way as xrx. Also, doing something like "xrx -a" should now look not at the total list of nodes as defined by statfs, but rather the locally defined list from, for example, an environment variable. Could I suggest that the NODES variable be set to an "xcpu-aware" list of nodes, and then that the command set (xrx, xmvapich,...) look at it for resolution of the "-a" option? It doesn't have to be NODES, but something unique could work. Thanks, Daniel On Sun, Dec 14, 2008 at 10:44 PM, Abhishek Kulkarni <[email protected]> wrote: > > > On Sun, Dec 14, 2008 at 8:29 PM, Daniel Gruner <[email protected]> wrote: >> >> On Sun, Dec 14, 2008 at 10:20 PM, Abhishek Kulkarni <[email protected]> >> wrote: >> > >> > >> > On Sun, Dec 14, 2008 at 7:04 PM, Daniel Gruner <[email protected]> >> > wrote: >> >> >> >> Hi Abhishek, >> >> >> >> Well, I compiled it and installed it (the Makefile needs work...), and >> >> it stays up as a daemon, but doesn't show any available nodes: >> >> >> >> [r...@dgk3 bjs]# bjsstat >> >> Pool: default Nodes (total/up/free): 0/0/0 >> >> ID User Command Requirements >> >> >> >> Did you change anything in the format for the bjs.conf file? >> > >> > Yes I added an extra option (statfs) which can be specified as: >> > >> > statfs localhost!20003 >> > >> > bjs would fetch the node information from statfs. >> > Although, the 'nodes' parameter in bjs.conf remains -- an intersection >> > set >> > of the two dictates the total nodes for bjs. >> >> Well, here is my bjs.conf, and regardless of whether I specify the >> nodes line or not, bjsstat does not appear to show any active nodes. >> I have not modified statfs in any way, so the port 20003 should still >> be fine. >> >> # Sample BJS configuration file >> # >> # $Id: bjs.conf,v 1.10 2003/11/10 19:40:22 mkdist Exp $ >> >> spooldir /var/spool/bjs >> policypath /usr/local/lib64/bjs:/usr/local/lib/bjs >> socketpath /tmp/.bjs >> #acctlog /tmp/acct.log >> statfsaddr localhost!20003 >> >> pool default >> policy filler >> # nodes 0-1 >> maxsecs 20000000 >> >> I have tried this with the nodes line like: >> >> nodes n0000-n0001 > > The nodes line is not optional. I would probably make it > > nodes n000[0-1] or > nodes n0000, n0001 > > though what you specified should work too (I will check that out). > > And spawn bjs with -v switch to get a more verbose output. > Thanks. > > >> >> but it doesn't work either. xstat seems totally normal: >> >> [r...@dgk3 ~]# xstat >> n0000 tcp!10.10.0.10!6667 /Linux/x86_64 up 0 >> n0001 tcp!10.10.0.11!6667 /Linux/x86_64 up 0 >> >> >> Daniel >> >> >> > >> >> >> >> Daniel >> >> >> >> >> >> On Sun, Dec 14, 2008 at 10:10 AM, Abhishek Kulkarni >> >> <[email protected]> >> >> wrote: >> >> > >> >> > >> >> > On Sat, Dec 13, 2008 at 9:37 PM, Daniel Gruner <[email protected]> >> >> > wrote: >> >> >> >> >> >> Hi Abhishek, >> >> >> >> >> >> What is the status of your port of bjs? Is it part of the sxcpu >> >> >> tree >> >> >> (or pulled when one checks out from the sxcpu svn repository)? I'd >> >> >> really like to test it... >> >> > >> >> > Daniel, >> >> > >> >> > You probably missed the quick announcement, here it is again: >> >> > >> >> > >> >> > http://groups.google.com/group/xcpu/browse_thread/thread/42ed613c72fe55ba# >> >> > >> >> > After syncing changes between the sxcpu and the xcpu2 tree, it could >> >> > be >> >> > used >> >> > for either. >> >> > Let me know how it works for you. >> >> > Thanks >> >> > >> >> > >> >> >> >> >> >> Daniel >> >> >> >> >> >> >> >> >> On Mon, Dec 8, 2008 at 3:08 PM, Abhishek Kulkarni >> >> >> <[email protected]> >> >> >> wrote: >> >> >> > >> >> >> > This patch makes bjs comply with the changed semantics of >> >> >> > xp_nodeset_list_by_state to obtain the down nodes from statfs. >> >> >> > >> >> >> > Signed-off-by: Abhishek Kulkarni <[email protected]> >> >> >> > >> >> >> > Index: bjs.c >> >> >> > >> >> >> > =================================================================== >> >> >> > --- bjs.c (revision 746) >> >> >> > +++ bjs.c (working copy) >> >> >> > @@ -2481,19 +2481,7 @@ >> >> >> > >> >> >> > if (r > 0) { >> >> >> > /* Check for machine status changes */ >> >> >> > - /* TODO: Instead of jumping over these hoops, improve >> >> >> > the >> >> >> > - way down nodes can be obtained from statfs */ >> >> >> > - >> >> >> > - down_nodeset = >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, >> >> >> > "down(initializing)"); >> >> >> > - xp_nodeset_append(down_nodeset, >> >> >> > - >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, >> >> >> > "down(disconnected)")); >> >> >> > - xp_nodeset_append(down_nodeset, >> >> >> > - >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, >> >> >> > "down(connect_failed)")); >> >> >> > - xp_nodeset_append(down_nodeset, >> >> >> > - >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, >> >> >> > "down(read_failed)")); >> >> >> > - xp_nodeset_append(down_nodeset, >> >> >> > - >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, >> >> >> > "down(no_contact)")); >> >> >> > - >> >> >> > + down_nodeset = >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, >> >> >> > 0); >> >> >> > if (down_nodeset->len != down_nodes) { >> >> >> > if (verbose) syslog(LOG_INFO, "XCPU cluster status >> >> >> > change."); >> >> >> > chng = update_cluster_status(conf.statfsaddr); >> >> >> > @@ -2505,9 +2493,10 @@ >> >> >> > p->policy->state_change(p); >> >> >> > } >> >> >> > } >> >> >> > + down_nodes = down_nodeset->len; >> >> >> > } >> >> >> > - down_nodes = down_nodeset->len; >> >> >> > >> >> >> > + >> >> >> > /* Check for new clients */ >> >> >> > if (FD_ISSET(conf.client_sockfd, &rset)) >> >> >> > client_accept(); >> >> >> > >> >> >> > >> >> >> > >> >> > >> >> > >> > >> > > >
