On Mon, Dec 15, 2008 at 8:48 AM, Daniel Gruner <[email protected]> wrote:
> > More issues with bjs (apparently related to my previous note on the > available nodes and how to specify them to xrx from within the > submitted script or interactive batch session): > > Say I get a single node using bjssub: > > [da...@dgk3 ~]$ bjssub -n 1 -s 10000 -i /bin/bash > Waiting for interactive job nodes. > (nodes 3 6443568) > Starting interactive job. > NODES=6443568 > This won't work at all. The node name is getting mangled for some reason if you submit an interactive job. Bproc used numbers for nodes, and bjs follows the same convention. I have refrained from doing extensive changes to the code, since many of the issues called for a complete redesign complying to the xcpu way of doing things. > JOBID=3 > > [da...@dgk3 ~]$ bjsstat > Pool: default Nodes (total/up/free): 2/2/1 > ID User Command > Requirements > 3 R danny (interactive) > nodes=1 secs=10000 > > [da...@dgk3 ~]$ xgetent -a group > xgetent: n0001: Error 5: unknown user > xgetent: n0000: Error 5: error opening file grent > [da...@dgk3 ~]$ xgetent -a passwd > xgetent: n0001: Error 5: unknown user > xgetent: n0000: Error 0: (null) > [da...@dgk3 ~]$ xrx -a date > Error: unknown user > [da...@dgk3 ~]$ xrx n0000 date > Error: Invalid argument:0.98.82.48 > [da...@dgk3 ~]$ xrx n0001 date > Error: Invalid argument:0.98.82.48 > [da...@dgk3 ~]$ xrx date > Error: Invalid argument:0.98.82.48 BJS should automatically add the required users and groups when you submit a job. > > > So, what you see here is that there are a few things that are not > quite working... I am not sure what the permissions are for the > nodes. The one assigned to me must have only my group and user > defined, so that I "own" the node (apart from root and xcpu-admin, so > that maintenance can be done by root). Yes, the node permissions depend on the authorized users in the userpool. When you own a node, it has your user (and group) apart from xcpu-admin. > Also, there must be a way to > access the said node(s) by xrx, or whatever one decides to run inside > the submitted script. > The NODES env variable is recognized by xrx and xmvapich. > > I have been using bjs for many years, and it is a wonderful little > scheduler, so I have strong feelings for how it should work... :-) Heh, sure. Your inputs and suggestions are really appreciated. > It > seems like there are only some kinks to iron out. Great work, > Abhishek! > > Daniel > > > On 12/15/08, Daniel Gruner <[email protected]> wrote: > > Hi Abhishek, > > > > Ok, so if I use > > > > nodes n0000,n0001 > > > > it works. The other two forms > > > > nodes n000[0-1] > > nodes n0000-n0001 > > > > do NOT work. It appears that your current bjs only accepts a > > comma-delimited list of nodes, > > and it is not yet clear to me that wildcards will work. > > > > Now, to another question. If I submit a job using bjssub, it sets up > > the environment variable "NODES" to contain a list of nodes that can > > be used by the owner of the job to submit jobs to. In the old bproc > > it was just a list of numbers, which you used in the bpsh command. > > Here I got a list of numbers that don't necessarily make sense... How > > would one run xrx with these? For example: > > > > [r...@dgk3 ~]# bjssub -n 2 -i -s 10000 /bin/bash > > Waiting for interactive job nodes. > > (nodes 0 6443568 6441632) > > Starting interactive job. > > NODES=6443568,6441632 > > JOBID=0 > > > > so what do these numbers correspond to? Typically in a batch > > environment you don't know which nodes get assigned to you, so the > > script you use to run the jobs must be told which nodes are yours to > > use. Similarly for mpi programs. Admittedly xmvapich still has some > > problems, but it runs with a list of nodes too, in just the same way > > as xrx. Also, doing something like "xrx -a" should now look not at > > the total list of nodes as defined by statfs, but rather the locally > > defined list from, for example, an environment variable. Could I > > suggest that the NODES variable be set to an "xcpu-aware" list of > > nodes, and then that the command set (xrx, xmvapich,...) look at it > > for resolution of the "-a" option? It doesn't have to be NODES, but > > something unique could work. > > > > Thanks, > > > > Daniel > > > > > > On Sun, Dec 14, 2008 at 10:44 PM, Abhishek Kulkarni < > [email protected]> wrote: > > > > > > > > > On Sun, Dec 14, 2008 at 8:29 PM, Daniel Gruner <[email protected]> > wrote: > > >> > > >> On Sun, Dec 14, 2008 at 10:20 PM, Abhishek Kulkarni < > [email protected]> > > >> wrote: > > >> > > > >> > > > >> > On Sun, Dec 14, 2008 at 7:04 PM, Daniel Gruner <[email protected]> > > >> > wrote: > > >> >> > > >> >> Hi Abhishek, > > >> >> > > >> >> Well, I compiled it and installed it (the Makefile needs work...), > and > > >> >> it stays up as a daemon, but doesn't show any available nodes: > > >> >> > > >> >> [r...@dgk3 bjs]# bjsstat > > >> >> Pool: default Nodes (total/up/free): 0/0/0 > > >> >> ID User Command Requirements > > >> >> > > >> >> Did you change anything in the format for the bjs.conf file? > > >> > > > >> > Yes I added an extra option (statfs) which can be specified as: > > >> > > > >> > statfs localhost!20003 > > >> > > > >> > bjs would fetch the node information from statfs. > > >> > Although, the 'nodes' parameter in bjs.conf remains -- an > intersection > > >> > set > > >> > of the two dictates the total nodes for bjs. > > >> > > >> Well, here is my bjs.conf, and regardless of whether I specify the > > >> nodes line or not, bjsstat does not appear to show any active nodes. > > >> I have not modified statfs in any way, so the port 20003 should still > > >> be fine. > > >> > > >> # Sample BJS configuration file > > >> # > > >> # $Id: bjs.conf,v 1.10 2003/11/10 19:40:22 mkdist Exp $ > > >> > > >> spooldir /var/spool/bjs > > >> policypath /usr/local/lib64/bjs:/usr/local/lib/bjs > > >> socketpath /tmp/.bjs > > >> #acctlog /tmp/acct.log > > >> statfsaddr localhost!20003 > > >> > > >> pool default > > >> policy filler > > >> # nodes 0-1 > > >> maxsecs 20000000 > > >> > > >> I have tried this with the nodes line like: > > >> > > >> nodes n0000-n0001 > > > > > > The nodes line is not optional. I would probably make it > > > > > > nodes n000[0-1] or > > > nodes n0000, n0001 > > > > > > though what you specified should work too (I will check that out). > > > > > > And spawn bjs with -v switch to get a more verbose output. > > > Thanks. > > > > > > > > >> > > >> but it doesn't work either. xstat seems totally normal: > > >> > > >> [r...@dgk3 ~]# xstat > > >> n0000 tcp!10.10.0.10!6667 /Linux/x86_64 up 0 > > >> n0001 tcp!10.10.0.11!6667 /Linux/x86_64 up 0 > > >> > > >> > > >> Daniel > > >> > > >> > > >> > > > >> >> > > >> >> Daniel > > >> >> > > >> >> > > >> >> On Sun, Dec 14, 2008 at 10:10 AM, Abhishek Kulkarni > > >> >> <[email protected]> > > >> >> wrote: > > >> >> > > > >> >> > > > >> >> > On Sat, Dec 13, 2008 at 9:37 PM, Daniel Gruner < > [email protected]> > > >> >> > wrote: > > >> >> >> > > >> >> >> Hi Abhishek, > > >> >> >> > > >> >> >> What is the status of your port of bjs? Is it part of the > sxcpu > > >> >> >> tree > > >> >> >> (or pulled when one checks out from the sxcpu svn repository)? > I'd > > >> >> >> really like to test it... > > >> >> > > > >> >> > Daniel, > > >> >> > > > >> >> > You probably missed the quick announcement, here it is again: > > >> >> > > > >> >> > > > >> >> > > http://groups.google.com/group/xcpu/browse_thread/thread/42ed613c72fe55ba# > > >> >> > > > >> >> > After syncing changes between the sxcpu and the xcpu2 tree, it > could > > >> >> > be > > >> >> > used > > >> >> > for either. > > >> >> > Let me know how it works for you. > > >> >> > Thanks > > >> >> > > > >> >> > > > >> >> >> > > >> >> >> Daniel > > >> >> >> > > >> >> >> > > >> >> >> On Mon, Dec 8, 2008 at 3:08 PM, Abhishek Kulkarni > > >> >> >> <[email protected]> > > >> >> >> wrote: > > >> >> >> > > > >> >> >> > This patch makes bjs comply with the changed semantics of > > >> >> >> > xp_nodeset_list_by_state to obtain the down nodes from > statfs. > > >> >> >> > > > >> >> >> > Signed-off-by: Abhishek Kulkarni <[email protected]> > > >> >> >> > > > >> >> >> > Index: bjs.c > > >> >> >> > > > >> >> >> > > =================================================================== > > >> >> >> > --- bjs.c (revision 746) > > >> >> >> > +++ bjs.c (working copy) > > >> >> >> > @@ -2481,19 +2481,7 @@ > > >> >> >> > > > >> >> >> > if (r > 0) { > > >> >> >> > /* Check for machine status changes */ > > >> >> >> > - /* TODO: Instead of jumping over these hoops, > improve > > >> >> >> > the > > >> >> >> > - way down nodes can be obtained from statfs */ > > >> >> >> > - > > >> >> >> > - down_nodeset = > > >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, > > >> >> >> > "down(initializing)"); > > >> >> >> > - xp_nodeset_append(down_nodeset, > > >> >> >> > - > > >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, > > >> >> >> > "down(disconnected)")); > > >> >> >> > - xp_nodeset_append(down_nodeset, > > >> >> >> > - > > >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, > > >> >> >> > "down(connect_failed)")); > > >> >> >> > - xp_nodeset_append(down_nodeset, > > >> >> >> > - > > >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, > > >> >> >> > "down(read_failed)")); > > >> >> >> > - xp_nodeset_append(down_nodeset, > > >> >> >> > - > > >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, > > >> >> >> > "down(no_contact)")); > > >> >> >> > - > > >> >> >> > + down_nodeset = > > >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr, > > >> >> >> > 0); > > >> >> >> > if (down_nodeset->len != down_nodes) { > > >> >> >> > if (verbose) syslog(LOG_INFO, "XCPU cluster > status > > >> >> >> > change."); > > >> >> >> > chng = update_cluster_status(conf.statfsaddr); > > >> >> >> > @@ -2505,9 +2493,10 @@ > > >> >> >> > p->policy->state_change(p); > > >> >> >> > } > > >> >> >> > } > > >> >> >> > + down_nodes = down_nodeset->len; > > >> >> >> > } > > >> >> >> > - down_nodes = down_nodeset->len; > > >> >> >> > > > >> >> >> > + > > >> >> >> > /* Check for new clients */ > > >> >> >> > if (FD_ISSET(conf.client_sockfd, &rset)) > > >> >> >> > client_accept(); > > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> > > > >> >> > > > >> > > > >> > > > > > > > > > >
