Re: [Lustre-discuss] open() ENOENT bug

2008-11-02 Thread Robin Humble
On Thu, Oct 30, 2008 at 02:05:57PM +0100, Peter Kjellstrom wrote:
On Thursday 30 October 2008, Brian J. Murrell wrote:
 On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote:
  we have a user with simultaneously starting fortran runs that fail
  about 10% of the time because Lustre sometimes returns ENOENT instead
  of EACCES to an open() request on a read-only file.

 I can reproduce this on 1.6.6 as well using your reproducer.

We have also seen this bug on our systems (reported by a user running a 
Fortran code). We have servers with both 1.4 
(2.6.9-55.0.9.EL_lustre.1.4.11.1smp) and 1.6 
(2.6.18-8.1.14.el5_lustre.1.6.4.2smp) lustre.

The error is seen towards both server versions from a cluster with patchless 
1.6.5.1 clients running centos-5.2.x86_64 (2.6.18-92.1.13.el5).

However the error is not seen from another cluster running _patched_ 1.6.5.1 
on centos-4.x86_64 (2.6.9-67.0.7.EL_lustre.1.6.5.1smp).

I dug up an old 2.6.9-67.0.7.EL_lustre.1.6.5.1 + IB kernel (who'd have
thought it'd boot with a RHEL5 userland!? :-) and you are right - my
openFileMinimal test case runs without problem. ie. 2.6.9 seems a lot
more robust than 2.6.18 and onwards.

however, when running ~10 copies of the below fortran code with the
above RHEL4 + 1.6.5.1 kernel, several of the copies of the code always
die with:
  Fortran runtime error: Stale NFS file handle

  program blah
  implicit none
  integer i
  do i=1,1000
  open(3,file='file',status='old')
  close(3)
  enddo
  stop
  end

so although my cut-down C code reproducer doesn't trigger anything, it
seems Lustre still has issues with the real fortran code. the user's
jobs would probably run ok in this RHEL4 environment though as they
don't run 10 copies at once.
it's a slightly different variant of the bug as well (different error
code), or maybe it's just a totaly different bug.

cheers,
robin




/Peter

 Can you file a bug in our bugzilla about it?  Please include your
 reproducer program.

 b.



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] open() ENOENT bug

2008-10-30 Thread Brian J. Murrell
On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote:
 Hi,

Hi.

 
 we have a user with simultaneously starting fortran runs that fail
 about 10% of the time because Lustre sometimes returns ENOENT instead
 of EACCES to an open() request on a read-only file.

I can reproduce this on 1.6.6 as well using your reproducer.

Can you file a bug in our bugzilla about it?  Please include your
reproducer program.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] open() ENOENT bug

2008-10-30 Thread Peter Kjellstrom
On Thursday 30 October 2008, Brian J. Murrell wrote:
 On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote:
  Hi,

 Hi.

  we have a user with simultaneously starting fortran runs that fail
  about 10% of the time because Lustre sometimes returns ENOENT instead
  of EACCES to an open() request on a read-only file.

 I can reproduce this on 1.6.6 as well using your reproducer.

We have also seen this bug on our systems (reported by a user running a 
Fortran code). We have servers with both 1.4 
(2.6.9-55.0.9.EL_lustre.1.4.11.1smp) and 1.6 
(2.6.18-8.1.14.el5_lustre.1.6.4.2smp) lustre.

The error is seen towards both server versions from a cluster with patchless 
1.6.5.1 clients running centos-5.2.x86_64 (2.6.18-92.1.13.el5).

However the error is not seen from another cluster running _patched_ 1.6.5.1 
on centos-4.x86_64 (2.6.9-67.0.7.EL_lustre.1.6.5.1smp).

/Peter

 Can you file a bug in our bugzilla about it?  Please include your
 reproducer program.

 b.


signature.asc
Description: This is a digitally signed message part.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] open() ENOENT bug

2008-10-30 Thread Robin Humble
On Thu, Oct 30, 2008 at 08:28:05AM -0400, Brian J. Murrell wrote:
On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote:
 we have a user with simultaneously starting fortran runs that fail
 about 10% of the time because Lustre sometimes returns ENOENT instead
 of EACCES to an open() request on a read-only file.
I can reproduce this on 1.6.6 as well using your reproducer.

thanks for looking into it so quickly.

Can you file a bug in our bugzilla about it?  Please include your
reproducer program.

https://bugzilla.lustre.org/show_bug.cgi?id=17545

cheers,
robin
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss