On Mon, Dec 15, 2008 at 6:17 AM, Daniel Gruner <[email protected]> wrote:

>
> Hi Abhishek,
>
> Ok, so if I use
>
> nodes n0000,n0001
>
> it works.  The other two forms
>
> nodes n000[0-1]
> nodes n0000-n0001
>
> do NOT work.


No, n000[0-1] works. n0000-n0001 won't work. BJS supports node ranges
similar to all the other xcpu utils (xrx, xgroupset etc.) It is mandatory to
supply a range enclosed in [ ]



> It appears that your current bjs only accepts a
> comma-delimited list of nodes,
> and it is not yet clear to me that wildcards will work.


Wildcards don't work for any other utils either. I have some changes ready
that separate out the node range parsing, but it's mostly untested.
Wildcards can be done if there's a compelling reason for it.


>
>
> Now, to another question.  If I submit a job using bjssub, it sets up
> the environment variable "NODES" to contain a list of nodes that can
> be used by the owner of the job to submit jobs to.  In the old bproc
> it was just a list of numbers, which you used in the bpsh command.
> Here I got a list of numbers that don't necessarily make sense...  How
> would one run xrx with these?


You can't. Because the numbers don't make sense. It's a bug. You should see
it as:
"NODES=n0000, n0001"

I can reproduce the bug with the interactive mode. It probably went
untested, I'll look into it.



>  For example:
>
> [r...@dgk3 ~]# bjssub -n 2 -i -s 10000 /bin/bash
> Waiting for interactive job nodes.
> (nodes 0 6443568 6441632)
> Starting interactive job.
> NODES=6443568,6441632
> JOBID=0
>
> so what do these numbers correspond to?  Typically in a batch
> environment you don't know which nodes get assigned to you, so the
> script you use to run the jobs must be told which nodes are yours to
> use.  Similarly for mpi programs.  Admittedly xmvapich still has some
> problems, but it runs with a list of nodes too, in just the same way
> as xrx.  Also, doing something like "xrx -a" should now look not at
> the total list of nodes as defined by statfs, but rather the locally
> defined list from, for example, an environment variable.  Could I
> suggest that the NODES variable be set to an "xcpu-aware" list of
> nodes, and then that the command set (xrx, xmvapich,...) look at it
> for resolution of the "-a" option?  It doesn't have to be NODES, but
> something unique could work.
>
> Thanks,
> Daniel
>
> On Sun, Dec 14, 2008 at 10:44 PM, Abhishek Kulkarni <[email protected]>
> wrote:
> >
> >
> > On Sun, Dec 14, 2008 at 8:29 PM, Daniel Gruner <[email protected]>
> wrote:
> >>
> >> On Sun, Dec 14, 2008 at 10:20 PM, Abhishek Kulkarni <
> [email protected]>
> >> wrote:
> >> >
> >> >
> >> > On Sun, Dec 14, 2008 at 7:04 PM, Daniel Gruner <[email protected]>
> >> > wrote:
> >> >>
> >> >> Hi Abhishek,
> >> >>
> >> >> Well, I compiled it and installed it (the Makefile needs work...),
> and
> >> >> it stays up as a daemon, but doesn't show any available nodes:
> >> >>
> >> >> [r...@dgk3 bjs]# bjsstat
> >> >> Pool: default   Nodes (total/up/free): 0/0/0
> >> >> ID      User     Command                        Requirements
> >> >>
> >> >> Did you change anything in the format for the bjs.conf file?
> >> >
> >> > Yes I added an extra option (statfs) which can be specified as:
> >> >
> >> > statfs      localhost!20003
> >> >
> >> > bjs would fetch the node information from statfs.
> >> > Although, the 'nodes' parameter in bjs.conf remains -- an intersection
> >> > set
> >> > of the two dictates the total nodes for bjs.
> >>
> >> Well, here is my bjs.conf, and regardless of whether I specify the
> >> nodes line or not, bjsstat does not appear to show any active nodes.
> >> I have not modified statfs in any way, so the port 20003 should still
> >> be fine.
> >>
> >> # Sample BJS configuration file
> >> #
> >> # $Id: bjs.conf,v 1.10 2003/11/10 19:40:22 mkdist Exp $
> >>
> >> spooldir   /var/spool/bjs
> >> policypath /usr/local/lib64/bjs:/usr/local/lib/bjs
> >> socketpath /tmp/.bjs
> >> #acctlog   /tmp/acct.log
> >> statfsaddr localhost!20003
> >>
> >> pool default
> >>        policy filler
> >> #        nodes  0-1
> >>        maxsecs 20000000
> >>
> >> I have tried this with the nodes line like:
> >>
> >> nodes n0000-n0001
> >
> > The nodes line is not optional. I would probably make it
> >
> > nodes n000[0-1] or
> > nodes n0000, n0001
> >
> > though what you specified should work too (I will check that out).
> >
> > And spawn bjs with -v switch to get a more verbose output.
> > Thanks.
> >
> >
> >>
> >> but it doesn't work either.  xstat seems totally normal:
> >>
> >> [r...@dgk3 ~]# xstat
> >> n0000   tcp!10.10.0.10!6667     /Linux/x86_64   up      0
> >> n0001   tcp!10.10.0.11!6667     /Linux/x86_64   up      0
> >>
> >>
> >> Daniel
> >>
> >>
> >> >
> >> >>
> >> >> Daniel
> >> >>
> >> >>
> >> >> On Sun, Dec 14, 2008 at 10:10 AM, Abhishek Kulkarni
> >> >> <[email protected]>
> >> >> wrote:
> >> >> >
> >> >> >
> >> >> > On Sat, Dec 13, 2008 at 9:37 PM, Daniel Gruner <[email protected]>
> >> >> > wrote:
> >> >> >>
> >> >> >> Hi Abhishek,
> >> >> >>
> >> >> >> What is the status of your port of bjs?  Is it part of the sxcpu
> >> >> >> tree
> >> >> >> (or pulled when one checks out from the sxcpu svn repository)?
>  I'd
> >> >> >> really like to test it...
> >> >> >
> >> >> > Daniel,
> >> >> >
> >> >> > You probably missed the quick announcement, here it is again:
> >> >> >
> >> >> >
> >> >> >
> http://groups.google.com/group/xcpu/browse_thread/thread/42ed613c72fe55ba#
> >> >> >
> >> >> > After syncing changes between the sxcpu and the xcpu2 tree, it
> could
> >> >> > be
> >> >> > used
> >> >> > for either.
> >> >> > Let me know how it works for you.
> >> >> > Thanks
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> Daniel
> >> >> >>
> >> >> >>
> >> >> >> On Mon, Dec 8, 2008 at 3:08 PM, Abhishek Kulkarni
> >> >> >> <[email protected]>
> >> >> >> wrote:
> >> >> >> >
> >> >> >> > This patch makes bjs comply with the changed semantics of
> >> >> >> > xp_nodeset_list_by_state to obtain the down nodes from statfs.
> >> >> >> >
> >> >> >> > Signed-off-by: Abhishek Kulkarni <[email protected]>
> >> >> >> >
> >> >> >> > Index: bjs.c
> >> >> >> >
> >> >> >> >
> ===================================================================
> >> >> >> > --- bjs.c       (revision 746)
> >> >> >> > +++ bjs.c       (working copy)
> >> >> >> > @@ -2481,19 +2481,7 @@
> >> >> >> >
> >> >> >> >        if (r > 0) {
> >> >> >> >            /* Check for machine status changes */
> >> >> >> > -           /* TODO: Instead of jumping over these hoops,
> improve
> >> >> >> > the
> >> >> >> > -              way down nodes can be obtained from statfs */
> >> >> >> > -
> >> >> >> > -           down_nodeset =
> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
> >> >> >> > "down(initializing)");
> >> >> >> > -           xp_nodeset_append(down_nodeset,
> >> >> >> > -
> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
> >> >> >> > "down(disconnected)"));
> >> >> >> > -           xp_nodeset_append(down_nodeset,
> >> >> >> > -
> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
> >> >> >> > "down(connect_failed)"));
> >> >> >> > -           xp_nodeset_append(down_nodeset,
> >> >> >> > -
> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
> >> >> >> > "down(read_failed)"));
> >> >> >> > -           xp_nodeset_append(down_nodeset,
> >> >> >> > -
> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
> >> >> >> > "down(no_contact)"));
> >> >> >> > -
> >> >> >> > +           down_nodeset =
> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
> >> >> >> > 0);
> >> >> >> >            if (down_nodeset->len != down_nodes) {
> >> >> >> >                if (verbose) syslog(LOG_INFO, "XCPU cluster
> status
> >> >> >> > change.");
> >> >> >> >                chng = update_cluster_status(conf.statfsaddr);
> >> >> >> > @@ -2505,9 +2493,10 @@
> >> >> >> >                            p->policy->state_change(p);
> >> >> >> >                    }
> >> >> >> >                }
> >> >> >> > +               down_nodes = down_nodeset->len;
> >> >> >> >            }
> >> >> >> > -           down_nodes = down_nodeset->len;
> >> >> >> >
> >> >> >> > +
> >> >> >> >            /* Check for new clients */
> >> >> >> >            if (FD_ISSET(conf.client_sockfd, &rset))
> >> >> >> >                client_accept();
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>

Reply via email to