More issues with bjs (apparently related to my previous note on the
available nodes and how to specify them to xrx from within the
submitted script or interactive batch session):
Say I get a single node using bjssub:
[da...@dgk3 ~]$ bjssub -n 1 -s 10000 -i /bin/bash
Waiting for interactive job nodes.
(nodes 3 6443568)
Starting interactive job.
NODES=6443568
JOBID=3
[da...@dgk3 ~]$ bjsstat
Pool: default Nodes (total/up/free): 2/2/1
ID User Command
Requirements
3 R danny (interactive)
nodes=1 secs=10000
[da...@dgk3 ~]$ xgetent -a group
xgetent: n0001: Error 5: unknown user
xgetent: n0000: Error 5: error opening file grent
[da...@dgk3 ~]$ xgetent -a passwd
xgetent: n0001: Error 5: unknown user
xgetent: n0000: Error 0: (null)
[da...@dgk3 ~]$ xrx -a date
Error: unknown user
[da...@dgk3 ~]$ xrx n0000 date
Error: Invalid argument:0.98.82.48
[da...@dgk3 ~]$ xrx n0001 date
Error: Invalid argument:0.98.82.48
[da...@dgk3 ~]$ xrx date
Error: Invalid argument:0.98.82.48
So, what you see here is that there are a few things that are not
quite working... I am not sure what the permissions are for the
nodes. The one assigned to me must have only my group and user
defined, so that I "own" the node (apart from root and xcpu-admin, so
that maintenance can be done by root). Also, there must be a way to
access the said node(s) by xrx, or whatever one decides to run inside
the submitted script.
I have been using bjs for many years, and it is a wonderful little
scheduler, so I have strong feelings for how it should work... :-) It
seems like there are only some kinks to iron out. Great work,
Abhishek!
Daniel
On 12/15/08, Daniel Gruner <[email protected]> wrote:
> Hi Abhishek,
>
> Ok, so if I use
>
> nodes n0000,n0001
>
> it works. The other two forms
>
> nodes n000[0-1]
> nodes n0000-n0001
>
> do NOT work. It appears that your current bjs only accepts a
> comma-delimited list of nodes,
> and it is not yet clear to me that wildcards will work.
>
> Now, to another question. If I submit a job using bjssub, it sets up
> the environment variable "NODES" to contain a list of nodes that can
> be used by the owner of the job to submit jobs to. In the old bproc
> it was just a list of numbers, which you used in the bpsh command.
> Here I got a list of numbers that don't necessarily make sense... How
> would one run xrx with these? For example:
>
> [r...@dgk3 ~]# bjssub -n 2 -i -s 10000 /bin/bash
> Waiting for interactive job nodes.
> (nodes 0 6443568 6441632)
> Starting interactive job.
> NODES=6443568,6441632
> JOBID=0
>
> so what do these numbers correspond to? Typically in a batch
> environment you don't know which nodes get assigned to you, so the
> script you use to run the jobs must be told which nodes are yours to
> use. Similarly for mpi programs. Admittedly xmvapich still has some
> problems, but it runs with a list of nodes too, in just the same way
> as xrx. Also, doing something like "xrx -a" should now look not at
> the total list of nodes as defined by statfs, but rather the locally
> defined list from, for example, an environment variable. Could I
> suggest that the NODES variable be set to an "xcpu-aware" list of
> nodes, and then that the command set (xrx, xmvapich,...) look at it
> for resolution of the "-a" option? It doesn't have to be NODES, but
> something unique could work.
>
> Thanks,
>
> Daniel
>
>
> On Sun, Dec 14, 2008 at 10:44 PM, Abhishek Kulkarni <[email protected]>
> wrote:
> >
> >
> > On Sun, Dec 14, 2008 at 8:29 PM, Daniel Gruner <[email protected]> wrote:
> >>
> >> On Sun, Dec 14, 2008 at 10:20 PM, Abhishek Kulkarni <[email protected]>
> >> wrote:
> >> >
> >> >
> >> > On Sun, Dec 14, 2008 at 7:04 PM, Daniel Gruner <[email protected]>
> >> > wrote:
> >> >>
> >> >> Hi Abhishek,
> >> >>
> >> >> Well, I compiled it and installed it (the Makefile needs work...), and
> >> >> it stays up as a daemon, but doesn't show any available nodes:
> >> >>
> >> >> [r...@dgk3 bjs]# bjsstat
> >> >> Pool: default Nodes (total/up/free): 0/0/0
> >> >> ID User Command Requirements
> >> >>
> >> >> Did you change anything in the format for the bjs.conf file?
> >> >
> >> > Yes I added an extra option (statfs) which can be specified as:
> >> >
> >> > statfs localhost!20003
> >> >
> >> > bjs would fetch the node information from statfs.
> >> > Although, the 'nodes' parameter in bjs.conf remains -- an intersection
> >> > set
> >> > of the two dictates the total nodes for bjs.
> >>
> >> Well, here is my bjs.conf, and regardless of whether I specify the
> >> nodes line or not, bjsstat does not appear to show any active nodes.
> >> I have not modified statfs in any way, so the port 20003 should still
> >> be fine.
> >>
> >> # Sample BJS configuration file
> >> #
> >> # $Id: bjs.conf,v 1.10 2003/11/10 19:40:22 mkdist Exp $
> >>
> >> spooldir /var/spool/bjs
> >> policypath /usr/local/lib64/bjs:/usr/local/lib/bjs
> >> socketpath /tmp/.bjs
> >> #acctlog /tmp/acct.log
> >> statfsaddr localhost!20003
> >>
> >> pool default
> >> policy filler
> >> # nodes 0-1
> >> maxsecs 20000000
> >>
> >> I have tried this with the nodes line like:
> >>
> >> nodes n0000-n0001
> >
> > The nodes line is not optional. I would probably make it
> >
> > nodes n000[0-1] or
> > nodes n0000, n0001
> >
> > though what you specified should work too (I will check that out).
> >
> > And spawn bjs with -v switch to get a more verbose output.
> > Thanks.
> >
> >
> >>
> >> but it doesn't work either. xstat seems totally normal:
> >>
> >> [r...@dgk3 ~]# xstat
> >> n0000 tcp!10.10.0.10!6667 /Linux/x86_64 up 0
> >> n0001 tcp!10.10.0.11!6667 /Linux/x86_64 up 0
> >>
> >>
> >> Daniel
> >>
> >>
> >> >
> >> >>
> >> >> Daniel
> >> >>
> >> >>
> >> >> On Sun, Dec 14, 2008 at 10:10 AM, Abhishek Kulkarni
> >> >> <[email protected]>
> >> >> wrote:
> >> >> >
> >> >> >
> >> >> > On Sat, Dec 13, 2008 at 9:37 PM, Daniel Gruner <[email protected]>
> >> >> > wrote:
> >> >> >>
> >> >> >> Hi Abhishek,
> >> >> >>
> >> >> >> What is the status of your port of bjs? Is it part of the sxcpu
> >> >> >> tree
> >> >> >> (or pulled when one checks out from the sxcpu svn repository)? I'd
> >> >> >> really like to test it...
> >> >> >
> >> >> > Daniel,
> >> >> >
> >> >> > You probably missed the quick announcement, here it is again:
> >> >> >
> >> >> >
> >> >> >
> http://groups.google.com/group/xcpu/browse_thread/thread/42ed613c72fe55ba#
> >> >> >
> >> >> > After syncing changes between the sxcpu and the xcpu2 tree, it could
> >> >> > be
> >> >> > used
> >> >> > for either.
> >> >> > Let me know how it works for you.
> >> >> > Thanks
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> Daniel
> >> >> >>
> >> >> >>
> >> >> >> On Mon, Dec 8, 2008 at 3:08 PM, Abhishek Kulkarni
> >> >> >> <[email protected]>
> >> >> >> wrote:
> >> >> >> >
> >> >> >> > This patch makes bjs comply with the changed semantics of
> >> >> >> > xp_nodeset_list_by_state to obtain the down nodes from statfs.
> >> >> >> >
> >> >> >> > Signed-off-by: Abhishek Kulkarni <[email protected]>
> >> >> >> >
> >> >> >> > Index: bjs.c
> >> >> >> >
> >> >> >> >
> ===================================================================
> >> >> >> > --- bjs.c (revision 746)
> >> >> >> > +++ bjs.c (working copy)
> >> >> >> > @@ -2481,19 +2481,7 @@
> >> >> >> >
> >> >> >> > if (r > 0) {
> >> >> >> > /* Check for machine status changes */
> >> >> >> > - /* TODO: Instead of jumping over these hoops, improve
> >> >> >> > the
> >> >> >> > - way down nodes can be obtained from statfs */
> >> >> >> > -
> >> >> >> > - down_nodeset =
> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
> >> >> >> > "down(initializing)");
> >> >> >> > - xp_nodeset_append(down_nodeset,
> >> >> >> > -
> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
> >> >> >> > "down(disconnected)"));
> >> >> >> > - xp_nodeset_append(down_nodeset,
> >> >> >> > -
> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
> >> >> >> > "down(connect_failed)"));
> >> >> >> > - xp_nodeset_append(down_nodeset,
> >> >> >> > -
> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
> >> >> >> > "down(read_failed)"));
> >> >> >> > - xp_nodeset_append(down_nodeset,
> >> >> >> > -
> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
> >> >> >> > "down(no_contact)"));
> >> >> >> > -
> >> >> >> > + down_nodeset =
> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
> >> >> >> > 0);
> >> >> >> > if (down_nodeset->len != down_nodes) {
> >> >> >> > if (verbose) syslog(LOG_INFO, "XCPU cluster status
> >> >> >> > change.");
> >> >> >> > chng = update_cluster_status(conf.statfsaddr);
> >> >> >> > @@ -2505,9 +2493,10 @@
> >> >> >> > p->policy->state_change(p);
> >> >> >> > }
> >> >> >> > }
> >> >> >> > + down_nodes = down_nodeset->len;
> >> >> >> > }
> >> >> >> > - down_nodes = down_nodeset->len;
> >> >> >> >
> >> >> >> > +
> >> >> >> > /* Check for new clients */
> >> >> >> > if (FD_ISSET(conf.client_sockfd, &rset))
> >> >> >> > client_accept();
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>