On Mon, Dec 15, 2008 at 6:17 AM, Daniel Gruner <[email protected]> wrote:
> > Hi Abhishek, > > Ok, so if I use > > nodes n0000,n0001 > > it works. The other two forms > > nodes n000[0-1] > nodes n0000-n0001 > > do NOT work. No, n000[0-1] works. n0000-n0001 won't work. BJS supports node ranges similar to all the other xcpu utils (xrx, xgroupset etc.) It is mandatory to supply a range enclosed in [ ] > It appears that your current bjs only accepts a > comma-delimited list of nodes, > and it is not yet clear to me that wildcards will work. Wildcards don't work for any other utils either. I have some changes ready that separate out the node range parsing, but it's mostly untested. Wildcards can be done if there's a compelling reason for it. > > > Now, to another question. If I submit a job using bjssub, it sets up > the environment variable "NODES" to contain a list of nodes that can > be used by the owner of the job to submit jobs to. In the old bproc > it was just a list of numbers, which you used in the bpsh command. > Here I got a list of numbers that don't necessarily make sense... How > would one run xrx with these? You can't. Because the numbers don't make sense. It's a bug. You should see it as: "NODES=n0000, n0001" I can reproduce the bug with the interactive mode. It probably went untested, I'll look into it. > For example: > > [r...@dgk3 ~]# bjssub -n 2 -i -s 10000 /bin/bash > Waiting for interactive job nodes. > (nodes 0 6443568 6441632) > Starting interactive job. > NODES=6443568,6441632 > JOBID=0 > > so what do these numbers correspond to? Typically in a batch > environment you don't know which nodes get assigned to you, so the > script you use to run the jobs must be told which nodes are yours to > use. Similarly for mpi programs. Admittedly xmvapich still has some > problems, but it runs with a list of nodes too, in just the same way > as xrx. Also, doing something like "xrx -a" should now look not at > the total list of nodes as defined by statfs, but rather the locally > defined list from, for example, an environment variable. Could I > suggest that the NODES variable be set to an "xcpu-aware" list of > nodes, and then that the command set (xrx, xmvapich,...) look at it > for resolution of the "-a" option? It doesn't have to be NODES, but > something unique could work. > > Thanks, > Daniel > > On Sun, Dec 14, 2008 at 10:44 PM, Abhishek Kulkarni <[email protected]> > wrote: > > > > > > On Sun, Dec 14, 2008 at 8:29 PM, Daniel Gruner <[email protected]> > wrote: > >> > >> On Sun, Dec 14, 2008 at 10:20 PM, Abhishek Kulkarni < > [email protected]> > >> wrote: > >> > > >> > > >> > On Sun, Dec 14, 2008 at 7:04 PM, Daniel Gruner <[email protected]> > >> > wrote: > >> >> > >> >> Hi Abhishek, > >> >> > >> >> Well, I compiled it and installed it (the Makefile needs work...), > and > >> >> it stays up as a daemon, but doesn't show any available nodes: > >> >> > >> >> [r...@dgk3 bjs]# bjsstat > >> >> Pool: default Nodes (total/up/free): 0/0/0 > >> >> ID User Command Requirements > >> >> > >> >> Did you change anything in the format for the bjs.conf file? > >> > > >> > Yes I added an extra option (statfs) which can be specified as: > >> > > >> > statfs localhost!20003 > >> > > >> > bjs would fetch the node information from statfs. > >> > Although, the 'nodes' parameter in bjs.conf remains -- an intersection > >> > set > >> > of the two dictates the total nodes for bjs. > >> > >> Well, here is my bjs.conf, and regardless of whether I specify the > >> nodes line or not, bjsstat does not appear to show any active nodes. > >> I have not modified statfs in any way, so the port 20003 should still > >> be fine. > >> > >> # Sample BJS configuration file > >> # > >> # $Id: bjs.conf,v 1.10 2003/11/10 19:40:22 mkdist Exp $ > >> > >> spooldir /var/spool/bjs > >> policypath /usr/local/lib64/bjs:/usr/local/lib/bjs > >> socketpath /tmp/.bjs > >> #acctlog /tmp/acct.log > >> statfsaddr localhost!20003 > >> > >> pool default > >> policy filler > >> # nodes 0-1 > >> maxsecs 20000000 > >> > >> I have tried this with the nodes line like: > >> > >> nodes n0000-n0001 > > > > The nodes line is not optional. I would probably make it > > > > nodes n000[0-1] or > > nodes n0000, n0001 > > > > though what you specified should work too (I will check that out). > > > > And spawn bjs with -v switch to get a more verbose output. > > Thanks. > > > > > >> > >> but it doesn't work either. xstat seems totally normal: > >> > >> [r...@dgk3 ~]# xstat > >> n0000 tcp!10.10.0.10!6667 /Linux/x86_64 up 0 > >> n0001 tcp!10.10.0.11!6667 /Linux/x86_64 up 0 > >> > >> > >> Daniel > >> > >> > >> > > >> >> > >> >> Daniel > >> >> > >> >> > >> >> On Sun, Dec 14, 2008 at 10:10 AM, Abhishek Kulkarni > >> >> <[email protected]> > >> >> wrote: > >> >> > > >> >> > > >> >> > On Sat, Dec 13, 2008 at 9:37 PM, Daniel Gruner <[email protected]> > >> >> > wrote: > >> >> >> > >> >> >> Hi Abhishek, > >> >> >> > >> >> >> What is the status of your port of bjs? Is it part of the sxcpu > >> >> >> tree > >> >> >> (or pulled when one checks out from the sxcpu svn repository)? > I'd > >> >> >> really like to test it... > >> >> > > >> >> > Daniel, > >> >> > > >> >> > You probably missed the quick announcement, here it is again: > >> >> > > >> >> > > >> >> > > http://groups.google.com/group/xcpu/browse_thread/thread/42ed613c72fe55ba# > >> >> > > >> >> > After syncing changes between the sxcpu and the xcpu2 tree, it > could > >> >> > be > >> >> > used > >> >> > for either. > >> >> > Let me know how it works for you. > >> >> > Thanks > >> >> > > >> >> > > >> >> >> > >> >> >> Daniel > >> >> >> > >> >> >> > >> >> >> On Mon, Dec 8, 2008 at 3:08 PM, Abhishek Kulkarni > >> >> >> <[email protected]> > >> >> >> wrote: > >> >> >> > > >> >> >> > This patch makes bjs comply with the changed semantics of > >> >> >> > xp_nodeset_list_by_state to obtain the down nodes from statfs. > >> >> >> > > >> >> >> > Signed-off-by: Abhishek Kulkarni <[email protected]> > >> >> >> > > >> >> >> > Index: bjs.c > >> >> >> > > >> >> >> > > =================================================================== > >> >> >> > --- bjs.c (revision 746) > >> >> >> > +++ bjs.c (working copy) > >> >> >> > @@ -2481,19 +2481,7 @@ > >> >> >> > > >> >> >> > if (r > 0) { > >> >> >> > /* Check for machine status changes */ > >> >> >> > - /* TODO: Instead of jumping over these hoops, > improve > >> >> >> > the > >> >> >> > - way down nodes can be obtained from statfs */ > >> >> >> > - > >> >> >> > - down_nodeset = > >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, > >> >> >> > "down(initializing)"); > >> >> >> > - xp_nodeset_append(down_nodeset, > >> >> >> > - > >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, > >> >> >> > "down(disconnected)")); > >> >> >> > - xp_nodeset_append(down_nodeset, > >> >> >> > - > >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, > >> >> >> > "down(connect_failed)")); > >> >> >> > - xp_nodeset_append(down_nodeset, > >> >> >> > - > >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, > >> >> >> > "down(read_failed)")); > >> >> >> > - xp_nodeset_append(down_nodeset, > >> >> >> > - > >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, > >> >> >> > "down(no_contact)")); > >> >> >> > - > >> >> >> > + down_nodeset = > >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, > >> >> >> > 0); > >> >> >> > if (down_nodeset->len != down_nodes) { > >> >> >> > if (verbose) syslog(LOG_INFO, "XCPU cluster > status > >> >> >> > change."); > >> >> >> > chng = update_cluster_status(conf.statfsaddr); > >> >> >> > @@ -2505,9 +2493,10 @@ > >> >> >> > p->policy->state_change(p); > >> >> >> > } > >> >> >> > } > >> >> >> > + down_nodes = down_nodeset->len; > >> >> >> > } > >> >> >> > - down_nodes = down_nodeset->len; > >> >> >> > > >> >> >> > + > >> >> >> > /* Check for new clients */ > >> >> >> > if (FD_ISSET(conf.client_sockfd, &rset)) > >> >> >> > client_accept(); > >> >> >> > > >> >> >> > > >> >> >> > > >> >> > > >> >> > > >> > > >> > > > > > >
