Re: [Lustre-discuss] Time needed to enable quota
On Wed, Jun 15, 2011 at 01:30:08PM +0100, Guy Coates wrote: On 15/06/11 13:14, Frank Heckes wrote: Hi all, we're planning to enable quota on our Lustre file systems running with version 1.8.4. We like to estimate the downtime needed to run quotacheck. Hi, We recently did a quotacheck on a 40M inodes used / 160 TB used filesystem. It took ~20 mins. (DDN 9900 backend storage). Hi, I can report similar values: With 20M inodes used / 100 TB used lfs quotacheck on a Lustre 1.8.4 filesystem took ~10 mins. Backend storage of this filesystem is HP MSA2000. Regards, Roland -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Roland Laifer Scientific Computing Services (SCS) Zirkel 2, Building 20.21, Room 209 76131 Karlsruhe, Germany Phone: +49 721 608 44861 Fax: +49 721 32550 Email: roland.lai...@kit.edu Web: http://www.scc.kit.edu KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Error when mv from Lustre to system
It works. :) Thanks Christian and Andreas for your time. - Mail Original - De: Christian Becker christian.bec...@math.tu-dortmund.de À: Andreas Dilger adil...@whamcloud.com Cc: s...@free.fr, lustre-discuss@lists.lustre.org Envoyé: Mercredi 15 Juin 2011 19h53:54 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne Objet: Re: [Lustre-discuss] Error when mv from Lustre to system Andreas Dilger wrote: SLES cp and mv try to preserve xattrs, but I suspect they get an error when trying to copy the lustre.lov xattr to a non-Lustre filesystem. The message is just letting you know that some attributes are not copied, but hopefully this does not cause the mv to return an error code. To disable this warning, add the following line lustre.lov skip to the file /etc/xattr.conf. Works fine for us. best regards, Christian Cheers, Andreas On 2011-06-15, at 4:17 AM, s...@free.fr wrote: Hi Lustre subscribers, I have a problem which seems related to the discussion http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/1092ff06ae1fb58f/82695528cc76ced6?lnk=gstq=error+mv#82695528cc76ced6 But it happens on SLES nodes with no selinux. When I try to mv a file from Lustre to the system of a node, I have this error : mv: setting attributes for `/tmp/foo': Operation not supported But the file or directory is correctly moved from lustre to /tmp/foo. I'm using Lustre 1.8.5 on SLES 11SP1 nodes, and SLES 10 OSS and MDS. Do you have any clue? Thanks, -- Jay N. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Path lost when accessing files
Hi Lustre users, we actually a little problems with jobs running on our cluster and using Lustre. Sometimes, we have these errors : forrtl: No such file or directory forrtl: severe (29): file not found, unit 213, file �@/suivi.d000 It does not only happen with forttl but also sometimes with other files. It tries to access a file located at : �@/suivi.d000. We also had errors when he was trying to access files like there were at the root of the FS, in this example /suivi.d000. It's like it was loosing or corrupting the PWD environment variable. The funny thing is that when we execute this same job again, it works perfectly. We didn't succeed in reproducing the errors but they still happens from time to time. I didn't find any Lustre errors in my logs related the these problems. We're using Lustre 1.8.5 on SLES 11SP1 nodes, and SLES 10 OSS and MDS. Do you have any clue? Thanks, Jay N. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] lfs quotacheck -ug /lfs01/ sleeps
Dear all, I'm trying to enable quota on my lustre file system. Issuing the lfs quotacheck -ug /lfs01/ command doesn't produce anything. And ps aux | grep lfs command shows that the process is sleeping. I don't know where to go from here. Any idea to discover what went wrong? thanks in advance, M.Adel ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lfs quotacheck -ug /lfs01/ sleeps
On 16 Jun 2011, at 11:33, Mohamed Adel wrote: Dear all, I'm trying to enable quota on my lustre file system. Issuing the lfs quotacheck -ug /lfs01/ command doesn't produce anything. And ps aux | grep lfs command shows that the process is sleeping. I don't know where to go from here. Any idea to discover what went wrong? This is correct, look for a kernel process called quotacheck on the Lustre servers, when all those threads have exited then lfs should also exit. As came up yesterday this could take a few tens of minutes. Ashley. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lfs quotacheck -ug /lfs01/ sleeps
Dear Ashley, Thanks for your quick response. This is correct, look for a kernel process called quotacheck on the Lustre servers, when all those threads have exited then lfs should also exit. As came up yesterday this could take a few tens of minutes. Issuing ps aux | grep quotacheck command on all lustre servers (mds and oss) didn't show any quotacheck process running though lfs quotacheck is still sleeping on the client from which I issued the lfs quotacheck. Does that mean the processes has finished? or something else went wrong? thanks in advance, M.Adel ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lfs quotacheck -ug /lfs01/ sleeps
On 16 Jun 2011, at 11:54, Mohamed Adel wrote: Dear Ashley, Thanks for your quick response. This is correct, look for a kernel process called quotacheck on the Lustre servers, when all those threads have exited then lfs should also exit. As came up yesterday this could take a few tens of minutes. Issuing ps aux | grep quotacheck command on all lustre servers (mds and oss) didn't show any quotacheck process running though lfs quotacheck is still sleeping on the client from which I issued the lfs quotacheck. Does that mean the processes has finished? or something else went wrong? This is what I see when I test it on our cluster here, this is a demo filesystem so very small hence the quotacheck happens very quickly, on a real fs you should see more processes. $ pdsh -a ps auwx | grep quota sabina-client0: root 6621 0.0 0.0 4280 464 pts/1S+ 13:36 0:00 lfs quotacheck /lustre/sab/client sabina-oss1: root 30278 0.0 0.0 0 0 ?D13:36 0:00 [quotacheck] sabina-mds1: root 12310 0.0 0.0 61160 776 pts/0S+ 13:36 0:00 grep quota This is on a 1.8 filesystem, I believe it used to work differently on 1.6. Ashley. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] $MOUNT2 in acc-sm
On 11-06-15 05:58 PM, Jay Lan wrote: I found my problem! I defined MOUNT=/mnt/nbp0 and MOUNT2=/mnt/nbp0-2. Bad idea!!! The sanity_mount_check* scripts use `grep` to search for $MOUNT and $MOUNT2. Since $MOUNT is a substring of $MOUNT2, `grep` on situations return wrong count! That sounds like a bug. Can you please file a ticket at http://jira.whamcould.com/ detailing your problem and solution? Thanx, b. -- Brian J. Murrell Senior Software Engineer Whamcloud, Inc. signature.asc Description: OpenPGP digital signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] $MOUNT2 in acc-sm
A slight typo - http://jira.whamcloud.com On 11-06-16 5:07 AM, Brian J. Murrell wrote: snip That sounds like a bug. Can you please file a ticket at http://jira.whamcould.com/ detailing your problem and solution? -- Peter Jones Whamcloud, Inc. www.whamcloud.com ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] $MOUNT2 in acc-sm
On 11-06-16 10:15 AM, Peter Jones wrote: A slight typo - http://jira.whamcloud.com Thanks Peter. On 11-06-16 5:07 AM, Brian J. Murrell wrote: snip That sounds like a bug. Can you please file a ticket at http://jira.whamcould.com/ detailing your problem and solution? ^ LOL. b. -- Brian J. Murrell Senior Software Engineer Whamcloud, Inc. signature.asc Description: OpenPGP digital signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] LustreError: 26019:0:(file.c:3143:ll_inode_revalidate_fini()) failure -2 inode
Hi Lustre 1.8 A lot of LustreErrors on client: ustreError: 8747:0:(file.c:3143:ll_inode_revalidate_fini()) Skipped 6 previous similar messages LustreError: 8747:0:(file.c:3143:ll_inode_revalidate_fini()) failure -2 inode 63486047 LustreError: 8747:0:(file.c:3143:ll_inode_revalidate_fini()) Skipped 4 previous similar messages LustreError: 26019:0:(file.c:3143:ll_inode_revalidate_fini()) failure -2 inode 54366423 LustreError: 26019:0:(file.c:3143:ll_inode_revalidate_fini()) Skipped 7 previous similar messages LustreError: 26019:0:(file.c:3143:ll_inode_revalidate_fini()) failure -2 inode 43338388 LustreError: 26019:0:(file.c:3143:ll_inode_revalidate_fini()) Skipped 3 previous similar messages LustreError: 26019:0:(file.c:3143:ll_inode_revalidate_fini()) failure -2 inode 14273218 LustreError: 26019:0:(file.c:3143:ll_inode_revalidate_fini()) Skipped 10 previous similar messages LustreError: 26019:0:(file.c:3143:ll_inode_revalidate_fini()) failure -2 inode 10272497 LustreError: 26019:0:(file.c:3143:ll_inode_revalidate_fini()) failure -2 inode 32001327 LustreError: 26019:0:(file.c:3143:ll_inode_revalidate_fini()) Skipped 1 previous similar message LustreError: 8747:0:(file.c:3143:ll_inode_revalidate_fini()) failure -2 inode 50378921 LustreError: 8747:0:(file.c:3143:ll_inode_revalidate_fini()) Skipped 1 previous similar message What does it mean !? failure -2 !? It seems all working correctly, no errors on OSS's or MDS. Thanks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Unexpect file system error during normal system works
We have a problem with lustre, in connection with this I wanted to ask you, can you help us ? We have a unexpect file system error during normal system working. / Jun 13 15:00:30 ossw12 kernel: LDISKFS-fs error (device dm-9): mb_free_blocks: double-free of inode 82041293's block 346591170(bit 4034 in group 10577) Jun 13 15:00:30 ossw12 kernel: Jun 13 15:00:30 ossw12 kernel: Aborting journal on device dm-9. Jun 13 15:00:30 ossw12 kernel: Remounting filesystem read-only Jun 13 15:00:30 ossw12 kernel: LDISKFS-fs error (device dm-9): mb_free_blocks: 3LustreError: 4026:0:(fsfilt-ldiskfs.c:280:fsfilt_ldiskfs_start()) error starting handle for op 8 (106 credits): rc -30 Jun 13 15:00:30 ossw12 kernel: double-free of inode 82041293's block 346591171(bit 4035 in group 10577)/ Jun 13 15:06:53 ossw12 kernel: LDISKFS-fs error (device dm-12): mb_free_blocks: double-free of inode 90143054's block 125314561(bit 9729 in group 3824) Jun 13 15:06:53 ossw12 kernel: Jun 13 15:06:53 ossw12 kernel: Aborting journal on device dm-12. Jun 13 15:06:53 ossw12 kernel: Remounting filesystem read-only Jun 13 15:06:53 ossw12 kernel: ldiskfs_abort called. Jun 13 15:06:53 ossw12 kernel: LDISKFS-fs error (device dm-12): ldiskfs_journal_start_sb: Detected aborted journal Jun 13 15:06:53 ossw12 kernel: Remounting filesystem read-only Another try to mount file system: / Jun 13 15:12:24 ossw12 kernel: kjournald starting. Commit interval 5 seconds Jun 13 15:12:24 ossw12 kernel: LDISKFS-fs warning (device dm-9): ldiskfs_clear_journal_err: Filesystem error recorded from previous mount: IO failure Jun 13 15:12:24 ossw12 kernel: LDISKFS-fs warning (device dm-9): ldiskfs_clear_journal_err: Marking fs in need of filesystem check. Jun 13 15:12:24 ossw12 kernel: LDISKFS-fs warning: mounting fs with errors, running e2fsck is recommended Jun 13 15:12:24 ossw12 kernel: LDISKFS FS on dm-9, internal journal Jun 13 15:12:24 ossw12 kernel: LDISKFS-fs: recovery complete. Jun 13 15:12:24 ossw12 kernel: LDISKFS-fs: mounted filesystem with ordered data mode./ / Jun 13 15:16:48 ossw12 kernel: kjournald starting. Commit interval 5 seconds Jun 13 15:16:48 ossw12 kernel: LDISKFS-fs warning (device dm-12): ldiskfs_clear_journal_err: Filesystem error recorded from previous mount: IO failure Jun 13 15:16:48 ossw12 kernel: LDISKFS-fs warning (device dm-12): ldiskfs_clear_journal_err: Marking fs in need of filesystem check. Jun 13 15:16:48 ossw12 kernel: LDISKFS-fs warning: mounting fs with errors, running e2fsck is recommended Jun 13 15:16:48 ossw12 kernel: LDISKFS FS on dm-12, internal journal Jun 13 15:16:48 ossw12 kernel: LDISKFS-fs: recovery complete. Jun 13 15:16:48 ossw12 kernel: LDISKFS-fs: mounted filesystem with ordered data mode./ How we can recover or repair data from this devices ? fsck repair some errors, but then we try mount files system we have errors: / Jun 13 18:39:17 ossw12 kernel: LDISKFS-fs error (device dm-9): mb_free_blocks: double-free of inode 82041293's block 346591170(bit 4034 in group 10577) Jun 13 18:39:17 ossw12 kernel: Jun 13 18:39:17 ossw12 kernel: Aborting journal on device dm-9. Jun 13 18:39:17 ossw12 kernel: Remounting filesystem read-only Jun 13 18:39:17 ossw12 kernel: LDISKFS-fs error (device dm-9): mb_free_blocks: double-free of inode 82041293's block 346591171(bit 4035 in group 10577) Jun 13 18:39:17 ossw12 kernel: Jun 13 18:39:17 ossw12 kernel: LDISKFS-fs error (device dm-9): mb_free_blocks: double-free of inode 82041293's block 346591172(bit 4036 in group 10577)/ Hardware doesnt report any problems. -- Regards | Piotr Przybylo | Technical Support Engineer | Polcom Sp z o.o. | | ul. Krakowska 43 | 32-050 Skawina, Poland | | mobile: +48609539945 | tel: +48 12 652 8682 | ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Unexpect file system error during normal system works
Hi Piotr, Which lustre version is this? Also which version of e2fsprogs are you using? Is the back end disk a software RAID or HW raid? If you can not see any errors on your hardware I would recommend to run fsck few times until it does does not find any problems. I also highly recommend to collect logs from each fsck run in case they are needed for further debugging. If you are not sure that your hardware is OK then you may want to run fsck with -n switch and send output to mailing list. Best regards, Wojciech On 16 June 2011 13:33, Piotr Przybylo piotr_przyb...@polcom.com.pl wrote: We have a problem with lustre, in connection with this I wanted to ask you, can you help us ? We have a unexpect file system error during normal system working. * Jun 13 15:00:30 ossw12 kernel: LDISKFS-fs error (device dm-9): mb_free_blocks: double-free of inode 82041293's block 346591170(bit 4034 in group 10577) Jun 13 15:00:30 ossw12 kernel: Jun 13 15:00:30 ossw12 kernel: Aborting journal on device dm-9. Jun 13 15:00:30 ossw12 kernel: Remounting filesystem read-only Jun 13 15:00:30 ossw12 kernel: LDISKFS-fs error (device dm-9): mb_free_blocks: 3LustreError: 4026:0:(fsfilt-ldiskfs.c:280:fsfilt_ldiskfs_start()) error starting handle for op 8 (106 credits): rc -30 Jun 13 15:00:30 ossw12 kernel: double-free of inode 82041293's block 346591171(bit 4035 in group 10577)* Jun 13 15:06:53 ossw12 kernel: LDISKFS-fs error (device dm-12): mb_free_blocks: double-free of inode 90143054's block 125314561(bit 9729 in group 3824) Jun 13 15:06:53 ossw12 kernel: Jun 13 15:06:53 ossw12 kernel: Aborting journal on device dm-12. Jun 13 15:06:53 ossw12 kernel: Remounting filesystem read-only Jun 13 15:06:53 ossw12 kernel: ldiskfs_abort called. Jun 13 15:06:53 ossw12 kernel: LDISKFS-fs error (device dm-12): ldiskfs_journal_start_sb: Detected aborted journal Jun 13 15:06:53 ossw12 kernel: Remounting filesystem read-only Another try to mount file system: * Jun 13 15:12:24 ossw12 kernel: kjournald starting. Commit interval 5 seconds Jun 13 15:12:24 ossw12 kernel: LDISKFS-fs warning (device dm-9): ldiskfs_clear_journal_err: Filesystem error recorded from previous mount: IO failure Jun 13 15:12:24 ossw12 kernel: LDISKFS-fs warning (device dm-9): ldiskfs_clear_journal_err: Marking fs in need of filesystem check. Jun 13 15:12:24 ossw12 kernel: LDISKFS-fs warning: mounting fs with errors, running e2fsck is recommended Jun 13 15:12:24 ossw12 kernel: LDISKFS FS on dm-9, internal journal Jun 13 15:12:24 ossw12 kernel: LDISKFS-fs: recovery complete. Jun 13 15:12:24 ossw12 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.* * Jun 13 15:16:48 ossw12 kernel: kjournald starting. Commit interval 5 seconds Jun 13 15:16:48 ossw12 kernel: LDISKFS-fs warning (device dm-12): ldiskfs_clear_journal_err: Filesystem error recorded from previous mount: IO failure Jun 13 15:16:48 ossw12 kernel: LDISKFS-fs warning (device dm-12): ldiskfs_clear_journal_err: Marking fs in need of filesystem check. Jun 13 15:16:48 ossw12 kernel: LDISKFS-fs warning: mounting fs with errors, running e2fsck is recommended Jun 13 15:16:48 ossw12 kernel: LDISKFS FS on dm-12, internal journal Jun 13 15:16:48 ossw12 kernel: LDISKFS-fs: recovery complete. Jun 13 15:16:48 ossw12 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.* How we can recover or repair data from this devices ? fsck repair some errors, but then we try mount files system we have errors: * Jun 13 18:39:17 ossw12 kernel: LDISKFS-fs error (device dm-9): mb_free_blocks: double-free of inode 82041293's block 346591170(bit 4034 in group 10577) Jun 13 18:39:17 ossw12 kernel: Jun 13 18:39:17 ossw12 kernel: Aborting journal on device dm-9. Jun 13 18:39:17 ossw12 kernel: Remounting filesystem read-only Jun 13 18:39:17 ossw12 kernel: LDISKFS-fs error (device dm-9): mb_free_blocks: double-free of inode 82041293's block 346591171(bit 4035 in group 10577) Jun 13 18:39:17 ossw12 kernel: Jun 13 18:39:17 ossw12 kernel: LDISKFS-fs error (device dm-9): mb_free_blocks: double-free of inode 82041293's block 346591172(bit 4036 in group 10577)* Hardware doesnt report any problems. -- Regards | Piotr Przybylo | Technical Support Engineer | Polcom Sp z o.o. | | ul. Krakowska 43 | 32-050 Skawina, Poland | | mobile: +48609539945 | tel: +48 12 652 8682 | ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Wojciech Turek ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] What exactly is punch statistic?
Hi, I have been covertly trying for a long time to find out what punch means as far a lustre llobdstat output but have not really found anything definitive. Can someone answer that for me? (BTW: I am not alone in my ignorance... :) ) Thanks. Joe Mervini Sandia National Laboratories High Performance Computing 505.844.6770 jame...@sandia.gov ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] What exactly is punch statistic?
It is called when truncating a file - afaik it is showing you the number of truncates, more or less. cliffw On Thu, Jun 16, 2011 at 10:52 AM, Mervini, Joseph A jame...@sandia.govwrote: Hi, I have been covertly trying for a long time to find out what punch means as far a lustre llobdstat output but have not really found anything definitive. Can someone answer that for me? (BTW: I am not alone in my ignorance... :) ) Thanks. Joe Mervini Sandia National Laboratories High Performance Computing 505.844.6770 jame...@sandia.gov ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- cliffw Support Guy WhamCloud, Inc. www.whamcloud.com ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Path lost when accessing files
On Thursday, June 16, 2011 03:30:38 PM Sebastien Piechurski wrote: Hi, This problem is documented in bug 23978 (http://bugzilla.lustre.org/show_bug.cgi?id=23978). To summarize: the fortran runtime is making a call to getcwd() to get the full path to a file which was given as a relative path. Lustre sometimes fail to answer to this syscall, which returns a non initialized buffer and an error code, BUT the fortran runtime does not test the getcwd() return code, and uses the buffer as-is. The uninitialized buffer is what you see as @, followed by the relative path. A patch is currently inspected. Perfectly summarized I'll just add two things. 1) The patch didn't help :-( 2) There are two work-arounds listed in the bz, patch the kernel to retry the getcwd or build and use an LD_PRELOAD wrapper to retry the getcwd. /Peter From: lustre-discuss-boun...@lists.lustre.org we actually a little problems with jobs running on our cluster and using Lustre. Sometimes, we have these errors : forrtl: No such file or directory forrtl: severe (29): file not found, unit 213, file @/suivi.d000 ... signature.asc Description: This is a digitally signed message part. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss