Please see inline:

On Mon, Dec 15, 2008 at 7:12 PM, Abhishek Kulkarni <[email protected]> wrote:
>
>
> On Mon, Dec 15, 2008 at 6:17 AM, Daniel Gruner <[email protected]> wrote:
>>
>> Hi Abhishek,
>>
>> Ok, so if I use
>>
>> nodes n0000,n0001
>>
>> it works.  The other two forms
>>
>> nodes n000[0-1]
>> nodes n0000-n0001
>>
>> do NOT work.
>
> No, n000[0-1] works. n0000-n0001 won't work. BJS supports node ranges
> similar to all the other xcpu utils (xrx, xgroupset etc.) It is mandatory to
> supply a range enclosed in [ ]
>

Well, it doesn't work for me! :

[r...@dgk3 ~]# /usr/local/etc/init.d/bjs start
Starting bjs: /usr/local/sbin/bjs: /etc/xcpu/bjs.conf:13: Invalid node
specification: n000[0-1]
/usr/local/sbin/bjs: Configuration load failed.  Exiting.
                                                           [FAILED]

The only way it works for me is specifying the comma-separated list of
nodes.  A parsing bug perhaps?

>
>>
>> It appears that your current bjs only accepts a
>> comma-delimited list of nodes,
>> and it is not yet clear to me that wildcards will work.
>
> Wildcards don't work for any other utils either. I have some changes ready
> that separate out the node range parsing, but it's mostly untested.
> Wildcards can be done if there's a compelling reason for it.
>
>>
>> Now, to another question.  If I submit a job using bjssub, it sets up
>> the environment variable "NODES" to contain a list of nodes that can
>> be used by the owner of the job to submit jobs to.  In the old bproc
>> it was just a list of numbers, which you used in the bpsh command.
>> Here I got a list of numbers that don't necessarily make sense...  How
>> would one run xrx with these?
>
> You can't. Because the numbers don't make sense. It's a bug. You should see
> it as:
> "NODES=n0000, n0001"
>

Ok, that makes sense.  You mean to say that in the batch mode
(non-interactive) it works?
I always test things interactively first, to make sure they work, and
then put them in scripts.


> I can reproduce the bug with the interactive mode. It probably went
> untested, I'll look into it.
>
>
>>
>>  For example:
>>
>> [r...@dgk3 ~]# bjssub -n 2 -i -s 10000 /bin/bash
>> Waiting for interactive job nodes.
>> (nodes 0 6443568 6441632)
>> Starting interactive job.
>> NODES=6443568,6441632
>> JOBID=0
>>
>> so what do these numbers correspond to?  Typically in a batch
>> environment you don't know which nodes get assigned to you, so the
>> script you use to run the jobs must be told which nodes are yours to
>> use.  Similarly for mpi programs.  Admittedly xmvapich still has some
>> problems, but it runs with a list of nodes too, in just the same way
>> as xrx.  Also, doing something like "xrx -a" should now look not at
>> the total list of nodes as defined by statfs, but rather the locally
>> defined list from, for example, an environment variable.  Could I
>> suggest that the NODES variable be set to an "xcpu-aware" list of
>> nodes, and then that the command set (xrx, xmvapich,...) look at it
>> for resolution of the "-a" option?  It doesn't have to be NODES, but
>> something unique could work.
>>
>> Thanks,
>> Daniel
>>
>> On Sun, Dec 14, 2008 at 10:44 PM, Abhishek Kulkarni <[email protected]>
>> wrote:
>> >
>> >
>> > On Sun, Dec 14, 2008 at 8:29 PM, Daniel Gruner <[email protected]>
>> > wrote:
>> >>
>> >> On Sun, Dec 14, 2008 at 10:20 PM, Abhishek Kulkarni
>> >> <[email protected]>
>> >> wrote:
>> >> >
>> >> >
>> >> > On Sun, Dec 14, 2008 at 7:04 PM, Daniel Gruner <[email protected]>
>> >> > wrote:
>> >> >>
>> >> >> Hi Abhishek,
>> >> >>
>> >> >> Well, I compiled it and installed it (the Makefile needs work...),
>> >> >> and
>> >> >> it stays up as a daemon, but doesn't show any available nodes:
>> >> >>
>> >> >> [r...@dgk3 bjs]# bjsstat
>> >> >> Pool: default   Nodes (total/up/free): 0/0/0
>> >> >> ID      User     Command                        Requirements
>> >> >>
>> >> >> Did you change anything in the format for the bjs.conf file?
>> >> >
>> >> > Yes I added an extra option (statfs) which can be specified as:
>> >> >
>> >> > statfs      localhost!20003
>> >> >
>> >> > bjs would fetch the node information from statfs.
>> >> > Although, the 'nodes' parameter in bjs.conf remains -- an
>> >> > intersection
>> >> > set
>> >> > of the two dictates the total nodes for bjs.
>> >>
>> >> Well, here is my bjs.conf, and regardless of whether I specify the
>> >> nodes line or not, bjsstat does not appear to show any active nodes.
>> >> I have not modified statfs in any way, so the port 20003 should still
>> >> be fine.
>> >>
>> >> # Sample BJS configuration file
>> >> #
>> >> # $Id: bjs.conf,v 1.10 2003/11/10 19:40:22 mkdist Exp $
>> >>
>> >> spooldir   /var/spool/bjs
>> >> policypath /usr/local/lib64/bjs:/usr/local/lib/bjs
>> >> socketpath /tmp/.bjs
>> >> #acctlog   /tmp/acct.log
>> >> statfsaddr localhost!20003
>> >>
>> >> pool default
>> >>        policy filler
>> >> #        nodes  0-1
>> >>        maxsecs 20000000
>> >>
>> >> I have tried this with the nodes line like:
>> >>
>> >> nodes n0000-n0001
>> >
>> > The nodes line is not optional. I would probably make it
>> >
>> > nodes n000[0-1] or
>> > nodes n0000, n0001
>> >
>> > though what you specified should work too (I will check that out).
>> >
>> > And spawn bjs with -v switch to get a more verbose output.
>> > Thanks.
>> >
>> >
>> >>
>> >> but it doesn't work either.  xstat seems totally normal:
>> >>
>> >> [r...@dgk3 ~]# xstat
>> >> n0000   tcp!10.10.0.10!6667     /Linux/x86_64   up      0
>> >> n0001   tcp!10.10.0.11!6667     /Linux/x86_64   up      0
>> >>
>> >>
>> >> Daniel
>> >>
>> >>
>> >> >
>> >> >>
>> >> >> Daniel
>> >> >>
>> >> >>
>> >> >> On Sun, Dec 14, 2008 at 10:10 AM, Abhishek Kulkarni
>> >> >> <[email protected]>
>> >> >> wrote:
>> >> >> >
>> >> >> >
>> >> >> > On Sat, Dec 13, 2008 at 9:37 PM, Daniel Gruner <[email protected]>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Hi Abhishek,
>> >> >> >>
>> >> >> >> What is the status of your port of bjs?  Is it part of the sxcpu
>> >> >> >> tree
>> >> >> >> (or pulled when one checks out from the sxcpu svn repository)?
>> >> >> >>  I'd
>> >> >> >> really like to test it...
>> >> >> >
>> >> >> > Daniel,
>> >> >> >
>> >> >> > You probably missed the quick announcement, here it is again:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > http://groups.google.com/group/xcpu/browse_thread/thread/42ed613c72fe55ba#
>> >> >> >
>> >> >> > After syncing changes between the sxcpu and the xcpu2 tree, it
>> >> >> > could
>> >> >> > be
>> >> >> > used
>> >> >> > for either.
>> >> >> > Let me know how it works for you.
>> >> >> > Thanks
>> >> >> >
>> >> >> >
>> >> >> >>
>> >> >> >> Daniel
>> >> >> >>
>> >> >> >>
>> >> >> >> On Mon, Dec 8, 2008 at 3:08 PM, Abhishek Kulkarni
>> >> >> >> <[email protected]>
>> >> >> >> wrote:
>> >> >> >> >
>> >> >> >> > This patch makes bjs comply with the changed semantics of
>> >> >> >> > xp_nodeset_list_by_state to obtain the down nodes from statfs.
>> >> >> >> >
>> >> >> >> > Signed-off-by: Abhishek Kulkarni <[email protected]>
>> >> >> >> >
>> >> >> >> > Index: bjs.c
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > ===================================================================
>> >> >> >> > --- bjs.c       (revision 746)
>> >> >> >> > +++ bjs.c       (working copy)
>> >> >> >> > @@ -2481,19 +2481,7 @@
>> >> >> >> >
>> >> >> >> >        if (r > 0) {
>> >> >> >> >            /* Check for machine status changes */
>> >> >> >> > -           /* TODO: Instead of jumping over these hoops,
>> >> >> >> > improve
>> >> >> >> > the
>> >> >> >> > -              way down nodes can be obtained from statfs */
>> >> >> >> > -
>> >> >> >> > -           down_nodeset =
>> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>> >> >> >> > "down(initializing)");
>> >> >> >> > -           xp_nodeset_append(down_nodeset,
>> >> >> >> > -
>> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>> >> >> >> > "down(disconnected)"));
>> >> >> >> > -           xp_nodeset_append(down_nodeset,
>> >> >> >> > -
>> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>> >> >> >> > "down(connect_failed)"));
>> >> >> >> > -           xp_nodeset_append(down_nodeset,
>> >> >> >> > -
>> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>> >> >> >> > "down(read_failed)"));
>> >> >> >> > -           xp_nodeset_append(down_nodeset,
>> >> >> >> > -
>> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>> >> >> >> > "down(no_contact)"));
>> >> >> >> > -
>> >> >> >> > +           down_nodeset =
>> >> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>> >> >> >> > 0);
>> >> >> >> >            if (down_nodeset->len != down_nodes) {
>> >> >> >> >                if (verbose) syslog(LOG_INFO, "XCPU cluster
>> >> >> >> > status
>> >> >> >> > change.");
>> >> >> >> >                chng = update_cluster_status(conf.statfsaddr);
>> >> >> >> > @@ -2505,9 +2493,10 @@
>> >> >> >> >                            p->policy->state_change(p);
>> >> >> >> >                    }
>> >> >> >> >                }
>> >> >> >> > +               down_nodes = down_nodeset->len;
>> >> >> >> >            }
>> >> >> >> > -           down_nodes = down_nodeset->len;
>> >> >> >> >
>> >> >> >> > +
>> >> >> >> >            /* Check for new clients */
>> >> >> >> >            if (FD_ISSET(conf.client_sockfd, &rset))
>> >> >> >> >                client_accept();
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>

Reply via email to