More issues with bjs (apparently related to my previous note on the
available nodes and how to specify them to xrx from within the
submitted script or interactive batch session):

Say I get a single node using bjssub:

[da...@dgk3 ~]$ bjssub -n 1 -s 10000 -i /bin/bash
Waiting for interactive job nodes.
(nodes 3 6443568)
Starting interactive job.
NODES=6443568
JOBID=3

[da...@dgk3 ~]$ bjsstat
Pool: default   Nodes (total/up/free): 2/2/1
ID      User     Command
Requirements
    3 R danny    (interactive)
nodes=1 secs=10000

[da...@dgk3 ~]$ xgetent -a group
xgetent: n0001: Error 5: unknown user
xgetent: n0000: Error 5: error opening file grent
[da...@dgk3 ~]$ xgetent -a passwd
xgetent: n0001: Error 5: unknown user
xgetent: n0000: Error 0: (null)
[da...@dgk3 ~]$ xrx -a date
Error: unknown user
[da...@dgk3 ~]$ xrx n0000 date
Error: Invalid argument:0.98.82.48
[da...@dgk3 ~]$ xrx n0001 date
Error: Invalid argument:0.98.82.48
[da...@dgk3 ~]$ xrx date
Error: Invalid argument:0.98.82.48

So, what you see here is that there are a few things that are not
quite working...  I am not sure what the permissions are for the
nodes.  The one assigned to me must have only my group and user
defined, so that I "own" the node (apart from root and xcpu-admin, so
that maintenance can be done by root).  Also, there must be a way to
access the said node(s) by xrx, or whatever one decides to run inside
the submitted script.

I have been using bjs for many years, and it is a wonderful little
scheduler, so I have strong feelings for how it should work... :-)  It
seems like there are only some kinks to iron out.  Great work,
Abhishek!

Daniel


On 12/15/08, Daniel Gruner <[email protected]> wrote:
> Hi Abhishek,
>
>  Ok, so if I use
>
>  nodes n0000,n0001
>
>  it works.  The other two forms
>
>  nodes n000[0-1]
>  nodes n0000-n0001
>
>  do NOT work.  It appears that your current bjs only accepts a
>  comma-delimited list of nodes,
>  and it is not yet clear to me that wildcards will work.
>
>  Now, to another question.  If I submit a job using bjssub, it sets up
>  the environment variable "NODES" to contain a list of nodes that can
>  be used by the owner of the job to submit jobs to.  In the old bproc
>  it was just a list of numbers, which you used in the bpsh command.
>  Here I got a list of numbers that don't necessarily make sense...  How
>  would one run xrx with these?  For example:
>
>  [r...@dgk3 ~]# bjssub -n 2 -i -s 10000 /bin/bash
>  Waiting for interactive job nodes.
>  (nodes 0 6443568 6441632)
>  Starting interactive job.
>  NODES=6443568,6441632
>  JOBID=0
>
>  so what do these numbers correspond to?  Typically in a batch
>  environment you don't know which nodes get assigned to you, so the
>  script you use to run the jobs must be told which nodes are yours to
>  use.  Similarly for mpi programs.  Admittedly xmvapich still has some
>  problems, but it runs with a list of nodes too, in just the same way
>  as xrx.  Also, doing something like "xrx -a" should now look not at
>  the total list of nodes as defined by statfs, but rather the locally
>  defined list from, for example, an environment variable.  Could I
>  suggest that the NODES variable be set to an "xcpu-aware" list of
>  nodes, and then that the command set (xrx, xmvapich,...) look at it
>  for resolution of the "-a" option?  It doesn't have to be NODES, but
>  something unique could work.
>
>  Thanks,
>
> Daniel
>
>
>  On Sun, Dec 14, 2008 at 10:44 PM, Abhishek Kulkarni <[email protected]> 
> wrote:
>  >
>  >
>  > On Sun, Dec 14, 2008 at 8:29 PM, Daniel Gruner <[email protected]> wrote:
>  >>
>  >> On Sun, Dec 14, 2008 at 10:20 PM, Abhishek Kulkarni <[email protected]>
>  >> wrote:
>  >> >
>  >> >
>  >> > On Sun, Dec 14, 2008 at 7:04 PM, Daniel Gruner <[email protected]>
>  >> > wrote:
>  >> >>
>  >> >> Hi Abhishek,
>  >> >>
>  >> >> Well, I compiled it and installed it (the Makefile needs work...), and
>  >> >> it stays up as a daemon, but doesn't show any available nodes:
>  >> >>
>  >> >> [r...@dgk3 bjs]# bjsstat
>  >> >> Pool: default   Nodes (total/up/free): 0/0/0
>  >> >> ID      User     Command                        Requirements
>  >> >>
>  >> >> Did you change anything in the format for the bjs.conf file?
>  >> >
>  >> > Yes I added an extra option (statfs) which can be specified as:
>  >> >
>  >> > statfs      localhost!20003
>  >> >
>  >> > bjs would fetch the node information from statfs.
>  >> > Although, the 'nodes' parameter in bjs.conf remains -- an intersection
>  >> > set
>  >> > of the two dictates the total nodes for bjs.
>  >>
>  >> Well, here is my bjs.conf, and regardless of whether I specify the
>  >> nodes line or not, bjsstat does not appear to show any active nodes.
>  >> I have not modified statfs in any way, so the port 20003 should still
>  >> be fine.
>  >>
>  >> # Sample BJS configuration file
>  >> #
>  >> # $Id: bjs.conf,v 1.10 2003/11/10 19:40:22 mkdist Exp $
>  >>
>  >> spooldir   /var/spool/bjs
>  >> policypath /usr/local/lib64/bjs:/usr/local/lib/bjs
>  >> socketpath /tmp/.bjs
>  >> #acctlog   /tmp/acct.log
>  >> statfsaddr localhost!20003
>  >>
>  >> pool default
>  >>        policy filler
>  >> #        nodes  0-1
>  >>        maxsecs 20000000
>  >>
>  >> I have tried this with the nodes line like:
>  >>
>  >> nodes n0000-n0001
>  >
>  > The nodes line is not optional. I would probably make it
>  >
>  > nodes n000[0-1] or
>  > nodes n0000, n0001
>  >
>  > though what you specified should work too (I will check that out).
>  >
>  > And spawn bjs with -v switch to get a more verbose output.
>  > Thanks.
>  >
>  >
>  >>
>  >> but it doesn't work either.  xstat seems totally normal:
>  >>
>  >> [r...@dgk3 ~]# xstat
>  >> n0000   tcp!10.10.0.10!6667     /Linux/x86_64   up      0
>  >> n0001   tcp!10.10.0.11!6667     /Linux/x86_64   up      0
>  >>
>  >>
>  >> Daniel
>  >>
>  >>
>  >> >
>  >> >>
>  >> >> Daniel
>  >> >>
>  >> >>
>  >> >> On Sun, Dec 14, 2008 at 10:10 AM, Abhishek Kulkarni
>  >> >> <[email protected]>
>  >> >> wrote:
>  >> >> >
>  >> >> >
>  >> >> > On Sat, Dec 13, 2008 at 9:37 PM, Daniel Gruner <[email protected]>
>  >> >> > wrote:
>  >> >> >>
>  >> >> >> Hi Abhishek,
>  >> >> >>
>  >> >> >> What is the status of your port of bjs?  Is it part of the sxcpu
>  >> >> >> tree
>  >> >> >> (or pulled when one checks out from the sxcpu svn repository)?  I'd
>  >> >> >> really like to test it...
>  >> >> >
>  >> >> > Daniel,
>  >> >> >
>  >> >> > You probably missed the quick announcement, here it is again:
>  >> >> >
>  >> >> >
>  >> >> > 
> http://groups.google.com/group/xcpu/browse_thread/thread/42ed613c72fe55ba#
>  >> >> >
>  >> >> > After syncing changes between the sxcpu and the xcpu2 tree, it could
>  >> >> > be
>  >> >> > used
>  >> >> > for either.
>  >> >> > Let me know how it works for you.
>  >> >> > Thanks
>  >> >> >
>  >> >> >
>  >> >> >>
>  >> >> >> Daniel
>  >> >> >>
>  >> >> >>
>  >> >> >> On Mon, Dec 8, 2008 at 3:08 PM, Abhishek Kulkarni
>  >> >> >> <[email protected]>
>  >> >> >> wrote:
>  >> >> >> >
>  >> >> >> > This patch makes bjs comply with the changed semantics of
>  >> >> >> > xp_nodeset_list_by_state to obtain the down nodes from statfs.
>  >> >> >> >
>  >> >> >> > Signed-off-by: Abhishek Kulkarni <[email protected]>
>  >> >> >> >
>  >> >> >> > Index: bjs.c
>  >> >> >> >
>  >> >> >> > 
> ===================================================================
>  >> >> >> > --- bjs.c       (revision 746)
>  >> >> >> > +++ bjs.c       (working copy)
>  >> >> >> > @@ -2481,19 +2481,7 @@
>  >> >> >> >
>  >> >> >> >        if (r > 0) {
>  >> >> >> >            /* Check for machine status changes */
>  >> >> >> > -           /* TODO: Instead of jumping over these hoops, improve
>  >> >> >> > the
>  >> >> >> > -              way down nodes can be obtained from statfs */
>  >> >> >> > -
>  >> >> >> > -           down_nodeset =
>  >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>  >> >> >> > "down(initializing)");
>  >> >> >> > -           xp_nodeset_append(down_nodeset,
>  >> >> >> > -
>  >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>  >> >> >> > "down(disconnected)"));
>  >> >> >> > -           xp_nodeset_append(down_nodeset,
>  >> >> >> > -
>  >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>  >> >> >> > "down(connect_failed)"));
>  >> >> >> > -           xp_nodeset_append(down_nodeset,
>  >> >> >> > -
>  >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>  >> >> >> > "down(read_failed)"));
>  >> >> >> > -           xp_nodeset_append(down_nodeset,
>  >> >> >> > -
>  >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>  >> >> >> > "down(no_contact)"));
>  >> >> >> > -
>  >> >> >> > +           down_nodeset =
>  >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>  >> >> >> > 0);
>  >> >> >> >            if (down_nodeset->len != down_nodes) {
>  >> >> >> >                if (verbose) syslog(LOG_INFO, "XCPU cluster status
>  >> >> >> > change.");
>  >> >> >> >                chng = update_cluster_status(conf.statfsaddr);
>  >> >> >> > @@ -2505,9 +2493,10 @@
>  >> >> >> >                            p->policy->state_change(p);
>  >> >> >> >                    }
>  >> >> >> >                }
>  >> >> >> > +               down_nodes = down_nodeset->len;
>  >> >> >> >            }
>  >> >> >> > -           down_nodes = down_nodeset->len;
>  >> >> >> >
>  >> >> >> > +
>  >> >> >> >            /* Check for new clients */
>  >> >> >> >            if (FD_ISSET(conf.client_sockfd, &rset))
>  >> >> >> >                client_accept();
>  >> >> >> >
>  >> >> >> >
>  >> >> >> >
>  >> >> >
>  >> >> >
>  >> >
>  >> >
>  >
>  >
>

Reply via email to