On Mon, 10 Sep 2012, Michael Gutteridge wrote:


On Sat, Sep 8, 2012 at 3:51 AM, Evren Yurtesen IB <[email protected]> wrote:

I realized the nodes still require access to .Xauthority file
otherwise they get authorization errors. So it appears people without
access to .Xauthority file can not connect anyway?


I believe that would be the case.


The plugin seems to make every node to read /tmp/slurm-spank-x11.* files
even when --x11=first is used. This is causing errors to be printed if
execution uses 2 or more nodes. Dont you have this problem also? Any
solutions?

Works fine for us- I haven't worked with anything other than "first",
but this really looks like maybe it's a problem with the plugin not
being on the assigned node or slurmd not "knowing" where the plugin
is.  Double check plugstack.conf, etc. etc.  Don't know if a restart
of slurmd is required in these cases.


Strange, because the author Matthieu Hautreux said he has seen the same behavior. Then we came up with a semi-tested patch (I tested it and it seems to fix the issue at least for me). It is attached.

The problem appears only when the process spans multiple nodes and it wont show up if you are using many cores in the same node. Also, maybe you are not receiving any text output from nodes and the message is eaten somewhere?

In either case, it is mostly a cosmetic problem, so there is not much to worry about :)

Thanks,
Evren
diff --git a/slurm-spank-x11-plug.c b/slurm-spank-x11-plug.c
index bda0a41..54ed7f9 100644
--- a/slurm-spank-x11-plug.c
+++ b/slurm-spank-x11-plug.c
@@ -363,8 +363,11 @@ exit:
 int slurm_spank_user_init (spank_t sp, int ac, char **av)
 {
 	int status=-1;
+	int do_init=0;
 	uint32_t jobid;
 	uint32_t stepid;
+	uint32_t nnodes;
+	uint32_t nodeid; 
 
 	if ( x11_mode == X11_MODE_NONE )
 		return 0;
@@ -381,7 +384,39 @@ int slurm_spank_user_init (spank_t sp, int ac, char **av)
 		return _x11_init_remote_batch(sp,jobid,stepid);
 	}
 	else if ( x11_mode != X11_MODE_BATCH ) {
-		return _x11_init_remote_inter(sp,jobid,stepid);
+
+		/* get the number of nodes */
+		if ( spank_get_item (sp, S_JOB_NNODES, &nnodes) != ESPANK_SUCCESS )
+			return status;
+		
+		/* get the local node ID */
+		if ( spank_get_item (sp, S_JOB_NODEID, &nodeid) != ESPANK_SUCCESS )
+			return status;
+
+		/* test if the local node has to go further */
+		switch ( x11_mode ) {
+		case X11_MODE_FIRST :
+			if ( nodeid == 0 ) {
+				do_init = 1;
+			}
+			break;
+		case X11_MODE_LAST :
+			if ( nodeid == (nnodes - 1) ) {
+				do_init = 1;
+			}
+			break;
+		case X11_MODE_ALL :
+			do_init = 1;
+			break;
+		default :
+			break;
+		}
+		
+		/* do the initialization of the X11 export if requested */
+		if ( do_init == 1 )
+			return _x11_init_remote_inter(sp,jobid,stepid);
+		else
+			return 0;
 	}
 
 }

Reply via email to