Hi Abhishek,

Ok, so if I use

nodes n0000,n0001

it works.  The other two forms

nodes n000[0-1]
nodes n0000-n0001

do NOT work.  It appears that your current bjs only accepts a
comma-delimited list of nodes,
and it is not yet clear to me that wildcards will work.

Now, to another question.  If I submit a job using bjssub, it sets up
the environment variable "NODES" to contain a list of nodes that can
be used by the owner of the job to submit jobs to.  In the old bproc
it was just a list of numbers, which you used in the bpsh command.
Here I got a list of numbers that don't necessarily make sense...  How
would one run xrx with these?  For example:

[r...@dgk3 ~]# bjssub -n 2 -i -s 10000 /bin/bash
Waiting for interactive job nodes.
(nodes 0 6443568 6441632)
Starting interactive job.
NODES=6443568,6441632
JOBID=0

so what do these numbers correspond to?  Typically in a batch
environment you don't know which nodes get assigned to you, so the
script you use to run the jobs must be told which nodes are yours to
use.  Similarly for mpi programs.  Admittedly xmvapich still has some
problems, but it runs with a list of nodes too, in just the same way
as xrx.  Also, doing something like "xrx -a" should now look not at
the total list of nodes as defined by statfs, but rather the locally
defined list from, for example, an environment variable.  Could I
suggest that the NODES variable be set to an "xcpu-aware" list of
nodes, and then that the command set (xrx, xmvapich,...) look at it
for resolution of the "-a" option?  It doesn't have to be NODES, but
something unique could work.

Thanks,
Daniel

On Sun, Dec 14, 2008 at 10:44 PM, Abhishek Kulkarni <[email protected]> wrote:
>
>
> On Sun, Dec 14, 2008 at 8:29 PM, Daniel Gruner <[email protected]> wrote:
>>
>> On Sun, Dec 14, 2008 at 10:20 PM, Abhishek Kulkarni <[email protected]>
>> wrote:
>> >
>> >
>> > On Sun, Dec 14, 2008 at 7:04 PM, Daniel Gruner <[email protected]>
>> > wrote:
>> >>
>> >> Hi Abhishek,
>> >>
>> >> Well, I compiled it and installed it (the Makefile needs work...), and
>> >> it stays up as a daemon, but doesn't show any available nodes:
>> >>
>> >> [r...@dgk3 bjs]# bjsstat
>> >> Pool: default   Nodes (total/up/free): 0/0/0
>> >> ID      User     Command                        Requirements
>> >>
>> >> Did you change anything in the format for the bjs.conf file?
>> >
>> > Yes I added an extra option (statfs) which can be specified as:
>> >
>> > statfs      localhost!20003
>> >
>> > bjs would fetch the node information from statfs.
>> > Although, the 'nodes' parameter in bjs.conf remains -- an intersection
>> > set
>> > of the two dictates the total nodes for bjs.
>>
>> Well, here is my bjs.conf, and regardless of whether I specify the
>> nodes line or not, bjsstat does not appear to show any active nodes.
>> I have not modified statfs in any way, so the port 20003 should still
>> be fine.
>>
>> # Sample BJS configuration file
>> #
>> # $Id: bjs.conf,v 1.10 2003/11/10 19:40:22 mkdist Exp $
>>
>> spooldir   /var/spool/bjs
>> policypath /usr/local/lib64/bjs:/usr/local/lib/bjs
>> socketpath /tmp/.bjs
>> #acctlog   /tmp/acct.log
>> statfsaddr localhost!20003
>>
>> pool default
>>        policy filler
>> #        nodes  0-1
>>        maxsecs 20000000
>>
>> I have tried this with the nodes line like:
>>
>> nodes n0000-n0001
>
> The nodes line is not optional. I would probably make it
>
> nodes n000[0-1] or
> nodes n0000, n0001
>
> though what you specified should work too (I will check that out).
>
> And spawn bjs with -v switch to get a more verbose output.
> Thanks.
>
>
>>
>> but it doesn't work either.  xstat seems totally normal:
>>
>> [r...@dgk3 ~]# xstat
>> n0000   tcp!10.10.0.10!6667     /Linux/x86_64   up      0
>> n0001   tcp!10.10.0.11!6667     /Linux/x86_64   up      0
>>
>>
>> Daniel
>>
>>
>> >
>> >>
>> >> Daniel
>> >>
>> >>
>> >> On Sun, Dec 14, 2008 at 10:10 AM, Abhishek Kulkarni
>> >> <[email protected]>
>> >> wrote:
>> >> >
>> >> >
>> >> > On Sat, Dec 13, 2008 at 9:37 PM, Daniel Gruner <[email protected]>
>> >> > wrote:
>> >> >>
>> >> >> Hi Abhishek,
>> >> >>
>> >> >> What is the status of your port of bjs?  Is it part of the sxcpu
>> >> >> tree
>> >> >> (or pulled when one checks out from the sxcpu svn repository)?  I'd
>> >> >> really like to test it...
>> >> >
>> >> > Daniel,
>> >> >
>> >> > You probably missed the quick announcement, here it is again:
>> >> >
>> >> >
>> >> > http://groups.google.com/group/xcpu/browse_thread/thread/42ed613c72fe55ba#
>> >> >
>> >> > After syncing changes between the sxcpu and the xcpu2 tree, it could
>> >> > be
>> >> > used
>> >> > for either.
>> >> > Let me know how it works for you.
>> >> > Thanks
>> >> >
>> >> >
>> >> >>
>> >> >> Daniel
>> >> >>
>> >> >>
>> >> >> On Mon, Dec 8, 2008 at 3:08 PM, Abhishek Kulkarni
>> >> >> <[email protected]>
>> >> >> wrote:
>> >> >> >
>> >> >> > This patch makes bjs comply with the changed semantics of
>> >> >> > xp_nodeset_list_by_state to obtain the down nodes from statfs.
>> >> >> >
>> >> >> > Signed-off-by: Abhishek Kulkarni <[email protected]>
>> >> >> >
>> >> >> > Index: bjs.c
>> >> >> >
>> >> >> > ===================================================================
>> >> >> > --- bjs.c       (revision 746)
>> >> >> > +++ bjs.c       (working copy)
>> >> >> > @@ -2481,19 +2481,7 @@
>> >> >> >
>> >> >> >        if (r > 0) {
>> >> >> >            /* Check for machine status changes */
>> >> >> > -           /* TODO: Instead of jumping over these hoops, improve
>> >> >> > the
>> >> >> > -              way down nodes can be obtained from statfs */
>> >> >> > -
>> >> >> > -           down_nodeset =
>> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>> >> >> > "down(initializing)");
>> >> >> > -           xp_nodeset_append(down_nodeset,
>> >> >> > -
>> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>> >> >> > "down(disconnected)"));
>> >> >> > -           xp_nodeset_append(down_nodeset,
>> >> >> > -
>> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>> >> >> > "down(connect_failed)"));
>> >> >> > -           xp_nodeset_append(down_nodeset,
>> >> >> > -
>> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>> >> >> > "down(read_failed)"));
>> >> >> > -           xp_nodeset_append(down_nodeset,
>> >> >> > -
>> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>> >> >> > "down(no_contact)"));
>> >> >> > -
>> >> >> > +           down_nodeset =
>> >> >> > xp_nodeset_list_by_state(conf.statfsaddr,
>> >> >> > 0);
>> >> >> >            if (down_nodeset->len != down_nodes) {
>> >> >> >                if (verbose) syslog(LOG_INFO, "XCPU cluster status
>> >> >> > change.");
>> >> >> >                chng = update_cluster_status(conf.statfsaddr);
>> >> >> > @@ -2505,9 +2493,10 @@
>> >> >> >                            p->policy->state_change(p);
>> >> >> >                    }
>> >> >> >                }
>> >> >> > +               down_nodes = down_nodeset->len;
>> >> >> >            }
>> >> >> > -           down_nodes = down_nodeset->len;
>> >> >> >
>> >> >> > +
>> >> >> >            /* Check for new clients */
>> >> >> >            if (FD_ISSET(conf.client_sockfd, &rset))
>> >> >> >                client_accept();
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>

Reply via email to