Re: [Lustre-discuss] Lustre, automount and EIO

2010-03-25 Thread Andreas Dilger
On 2010-03-25, at 06:33, Stephen Willey wrote:
> We're using autofs so the Lustre may be mounted instantly before the  
> command using it is run.  We believe it may be because the client  
> has not yet established connections to all the OSTs when mount  
> returns and the following command is run.
>
> We've tried creating an automounter module based on mount_generic  
> that simply puts a 1s delay in the mount, and that's reduced the  
> number of errors, but they're very much still there.  Putting in a  
> larger delay is an option, but fairly obviously a pretty bad one.

I agree.  The reason that we return from mount before the OSC devices  
have established their connections is to avoid hanging the mount in  
case of an unavailable OST.  That said, if the OSCs are accessed  
before they have a chance to complete the connection the kernel should  
wait until the connection attempt has completed before returning an  
error.

> Mar 25 12:26:38 rr445 automount[6457]: open_mount: (mount):cannot  
> open mount module lustre (/usr/lib64/autofs/mount_lustre.so: cannot  
> open shared object file: No such file or directory)

Is this message itself always part of the problem?  This seems  
autoconf related, and makes me wonder if automount is expecting to  
access a mount_lustre.so object INSTEAD of /sbin/mount.lustre.  If  
that is the case it may not be doing the initial mount quite  
correctly.  I'm not sure of that, but it seems unusual.

> Mar 25 12:26:38 rr445 kernel: Lustre: Client epsilon-client has  
> started
> Mar 25 12:26:38 rr445 kernel: LustreError: 22600:0:(file.c: 
> 993:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO

It would be useful to look into the Lustre kernel debug logs for this  
failure.  If there was an RPC timeout during connection (e.g. if the  
OST is slow to respond) then that should have produced an earlier  
console error.  If the above operation is failing before trying to  
connect to the OST, then that should be fixed.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre, automount and EIO

2010-03-25 Thread Stephen Willey
We're seeing errors which we believe are down to automount returning too early 
from a Lustre mount.

We're using autofs so the Lustre may be mounted instantly before the command 
using it is run.  We believe it may be because the client has not yet 
established connections to all the OSTs when mount returns and the following 
command is run.

We've tried creating an automounter module based on mount_generic that simply 
puts a 1s delay in the mount, and that's reduced the number of errors, but 
they're very much still there.  Putting in a larger delay is an option, but 
fairly obviously a pretty bad one.

Once the filesystem is actually mounted, things will work properly, until that 
is, the automounter drops the mount again of course.

Pasted below are two example log excerpts where we've automounted a filesystem 
called /net/epsilon, then immediately tried to fopen() a file on it which gives 
an I/O error.

I've attached a tiny C program that can regularly replicate the issue (it 
happened on 16 machines when run with pdsh across a set of roughly 400 and this 
is fairly representative)

Any ideas or recommendations would be much appreciated,

Stephen



Mar 25 12:26:38 rr445 automount[6457]: open_mount: (mount):cannot open mount 
module lustre (/usr/lib64/autofs/mount_lustre.so: cannot open shared object 
file: No such file or directory)
Mar 25 12:26:38 rr445 kernel: Lustre: Client epsilon-client has started 

   
Mar 25 12:26:38 rr445 kernel: LustreError: 
22600:0:(file.c:993:ll_glimpse_size()) obd_enqueue returned rc -5, returning 
-EIO


Mar 25 12:26:37 rr447 automount[6458]: open_mount: (mount):cannot open mount 
module lustre (/usr/lib64/autofs/mount_lustre.so: cannot open shared object 
file: No such file or directory)

 
Mar 25 12:26:37 rr447 kernel: Lustre: Client epsilon-client has started 

   
Mar 25 12:26:37 rr447 kernel: LustreError: 
2370:0:(file.c:993:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO


-- 
Stephen Willey
Senior Systems Engineer
Framestore
19-23 Wells Street, London W1T 3PQ
+44 207 344 8000
www.framestore.com 
/*
 *  * Immediately begin writing to a file on disk, to test Lustre
 *   */

#include 
#include 

#define DATA "kjrlewkujriojfjvclsdjfoiewujfdkjljvoisjvowjfelkjelwkjvfljifwedse"

int main(int argc, char *argv[])
{
	FILE *f;
	unsigned int times = 512;

	if (argc != 2) {
		fprintf(stderr, "Usage: %s \n", argv[0]);
		return -1;
	}

	f = fopen(argv[1], "a"); /* creates if necessary */
	if (f == NULL) {
		perror("fopen");
		return -1;
	}

	while (times != 0) {
		if (fwrite(DATA, strlen(DATA), 1, f) != 1) {
			perror("fwrite");
			return -1;
		}
		times--;
	}

	if (fclose(f) != 0) {
		perror("fclose");
		return -1;
	}

	return 0;
}
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss