Re: [lustre-discuss] LustreError and client crashes

2017-06-25 Thread Dilger, Andreas
These errors on the OSS are just from rebooting the clients in the middle of data transfer. Cheers, Andreas On Jun 25, 2017, at 09:09, Riccardo Veraldi > wrote: Hello, I have a high volume data transfer between my Lustre

[lustre-discuss] LustreError and client crashes

2017-06-25 Thread Riccardo Veraldi
Hello, I have a high volume data transfer between my Lustre filesystems. I upgraded to Lustre 2.9.0 on server side and Lustre 2.9.59 on client side (because of a corruption problem bug). My clients running 2.9.59 hangs and I need to reboot them and at about the same time these are the kind of the

Re: [lustre-discuss] LustreError on ZFS volumes

2016-12-13 Thread Alexander I Kulyavtsev
s@lists.lustre.org Subject: Re: [lustre-discuss] LustreError on ZFS volumes We discussed a course of action this morning and decided that we'd start by migrating the files off of the OST. Testing suggests files that cannot be completely read will be left on OST0002. Due to the nature of the corruption

Re: [lustre-discuss] LustreError on ZFS volumes

2016-12-13 Thread Jesse Stroik
We discussed a course of action this morning and decided that we'd start by migrating the files off of the OST. Testing suggests files that cannot be completely read will be left on OST0002. Due to the nature of the corruption - faulty hardware raid controller - it seems unlikely we'll be

Re: [lustre-discuss] LustreError on ZFS volumes

2016-12-12 Thread Crowe, Tom
Hi Jessie, In regards to you seeing 370 objects with errors form ‘zpool status’, but having over 400 files with “access issues”, I would suggest running the ‘zpool scrub’ to identify all the ZFS objects in the pool that are reporting permanent errors. It would be very important to have a

Re: [lustre-discuss] LustreError on ZFS volumes

2016-12-12 Thread Jesse Stroik
Thanks for taking the time to respond, Tom, For clarification, it sounds like you are using hardware based RAID-6, and not ZFS raid? Is this correct? Or was the faulty card simply an HBA? You are correct. This particular file system is still using hardware RAID6. At the bottom of the

Re: [lustre-discuss] LustreError on ZFS volumes

2016-12-12 Thread Crowe, Tom
Hi Jessie, For clarification, it sounds like you are using hardware based RAID-6, and not ZFS raid? Is this correct? Or was the faulty card simply an HBA? At the bottom of the ‘zpool status -v pool_name’ output, you may see paths and/or zfs object ID’s of the damaged/impacted files. This would

[lustre-discuss] LustreError on ZFS volumes

2016-12-12 Thread Jesse Stroik
One of our lustre file systems still running lustre 2.5.3 and zfs 0.6.3 experienced corruption due to a bad RAID controller. The OST in question was a RAID6 volume which we've marked inactive. Most of our lustre clients are 2.8.0. zfs status reports corruption and checksum errors. I have not

Re: [lustre-discuss] Lustreerror: Error -2 syncing data on lock cancel

2016-03-22 Thread Manno, Dominic Anthony
Hi Kurt, Have a look at https://jira.hpdd.intel.com/browse/LU-6664. Andreas gives a good explanation as to what is going on in his last comment. If you need more clarification, post back to the list. We have experienced this here at LANL with multiple 2.5.x filesystems. Some use ldiskfs, while

[lustre-discuss] Lustreerror: Error -2 syncing data on lock cancel

2016-03-22 Thread Kurt Strosahl
Good Morning, Recently in my test environment I've seen the following error on the oss: Error -2 syncing data on lock cancel. At the time there was only one client mounting the test lustre file system, and the only process running was a compilation of gcc, so there was virtually no activity

Re: [Lustre-discuss] LustreError

2014-06-19 Thread Mohr Jr, Richard Frank (Rick Mohr)
( I have re-added the lustre-discuss mailing list to the reply.) I am not familiar with that error message. A quick google turned up a couple of links that may be helpful to you: http://lists.lustre.org/pipermail/lustre-discuss/2010-September/014035.html

[Lustre-discuss] LustreError

2014-06-16 Thread
Dear Experts, We are running lustre 2.4.1 with a combined MDT/MGS disk-server mounted with 4 device-mappers as 4 OSTs. Recently the setup suffered from high system load and long hang when trying to lfs df -h from client. could someone shed light on the situation? Any help would be greatly

Re: [Lustre-discuss] LustreError

2014-06-16 Thread Mohr Jr, Richard Frank (Rick Mohr)
On Jun 5, 2014, at 11:48 AM, curiojus...@gmail.com wrote: [user@disk-server]$ lctl dl snip 12 ST obdfilter lustre-OST lustre-OST_UUID 5 One of your OSTs appears to be down which would explain why lfs df was hanging. Have you been able to troubleshoot this problem to determine the

[Lustre-discuss] LustreError: 3193:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable error -5

2014-04-11 Thread Vijay Amirtharaj A
Hi, We have 50 TB storage on lustre, we are using lustre 2.3.0-2.6.32_279.5.1.el6.x86_64.x86_64 OS: Centos 6.3 We have 31 compute nodes. My issue is: When we are restarting storage my jobs are running fine, that is writing with out any issue. After some time, my jobs coming out with this

Re: [Lustre-discuss] LustreError: 3193:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable error -5

2014-04-11 Thread Parinay Kondekar
Apr 11 04:31:19 node16 kernel: LustreError: 3185:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable error -5 req@8802d1826400x1464726686245296/t0(0) o4-lustre-OST0002-osc- 88106ab4dc00@192.168.1.46@o2ib:6/4 lens 488/416 e 0 to 0 dl 1397170923 ref 2 fl Interpret:R/0/0 rc

Re: [Lustre-discuss] LustreError: 3193:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable error -5

2014-04-11 Thread Vijay Amirtharaj A
Hi Parinay Kondekar, Thanks for your reply. I am new to lustre, please explain me how to gather the information. Regards, Vijay Amirtharaj A On Fri, Apr 11, 2014 at 2:55 PM, Parinay Kondekar parinay_konde...@xyratex.com wrote: Apr 11 04:31:19 node16 kernel: LustreError:

Re: [Lustre-discuss] LustreError: 3193:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable error -5

2014-04-11 Thread Parinay Kondekar
I would go to see whats in man pages in this case. http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#idp554480 e.g. If the reported error is anything else (such as -5, I/O error), it likely indicates a storage failure. The low-level file system returns

[Lustre-discuss] LustreError codes -114 and -16 (ldlm_lib.c:1919:target_send_reply_msg())

2012-02-15 Thread Marina Cacciagrano
Hello, On all the nodes of a lustre 1.8.2 , I often see messages similar to the following in /var/log/syslog: LustreError: 8862:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-114) req@8103f97dc850 x1393780295030087/t0

Re: [Lustre-discuss] LustreError codes -114 and -16 (ldlm_lib.c:1919:target_send_reply_msg())

2012-02-15 Thread Chris Horn
errno 16 is EBUSY (device or resource busy) and errno 114 is EALREADY (Operation already in progress). Chris Horn On Feb 15, 2012, at 10:52 AM, Marina Cacciagrano wrote: Hello, On all the nodes of a lustre 1.8.2 , I often see messages similar to the following in /var/log/syslog: LustreError:

Re: [Lustre-discuss] LustreError codes -114 and -16 (ldlm_lib.c:1919:target_send_reply_msg())

2012-02-15 Thread Marina Cacciagrano
...@cray.com To: Marina Cacciagrano marina.cacciagr...@framestore.com Cc: lustre-discuss@lists.lustre.org lustre-discuss@lists.lustre.org Sent: Wednesday, 15 February, 2012 5:21:33 PM Subject: Re: [Lustre-discuss] LustreError codes -114 and -16 (ldlm_lib.c:1919:target_send_reply_msg()) errno

Re: [Lustre-discuss] LustreError codes -114 and -16 (ldlm_lib.c:1919:target_send_reply_msg())

2012-02-15 Thread Chris Horn
: Re: [Lustre-discuss] LustreError codes -114 and -16 (ldlm_lib.c:1919:target_send_reply_msg()) errno 16 is EBUSY (device or resource busy) and errno 114 is EALREADY (Operation already in progress). Chris Horn On Feb 15, 2012, at 10:52 AM, Marina Cacciagrano wrote: Hello, On all the nodes

Re: [Lustre-discuss] LustreError: 26019:0:(file.c:3143:ll_inode_revalidate_fini()) failure -2 inode

2012-01-23 Thread Erich Focht
Hello, we are seeing this error a lot since we updated NFS exporting clients to 1.8.7 (oracle version). Jan 16 15:19:16 xxfs1 kernel: LustreError: 7312:0:(file.c:3329:ll_inode_revalidate_fini()) failure -2 inode 87425037 Jan 16 15:19:16 xxfs1 kernel: LustreError:

[Lustre-discuss] LustreError: 26019:0:(file.c:3143:ll_inode_revalidate_fini()) failure -2 inode

2011-06-16 Thread fenix . serega
Hi Lustre 1.8 A lot of LustreErrors on client: ustreError: 8747:0:(file.c:3143:ll_inode_revalidate_fini()) Skipped 6 previous similar messages LustreError: 8747:0:(file.c:3143:ll_inode_revalidate_fini()) failure -2 inode 63486047 LustreError: 8747:0:(file.c:3143:ll_inode_revalidate_fini())

[Lustre-discuss] LustreError

2011-01-25 Thread Ronald K Long
I have one lustre client that keeps reporting the following error. Jan 25 09:43:07 edclxs11 kernel: LustreError: 6886:0:(file.c:3287:ll_inode_revalidate_fini()) failure -2 inode 202278709 Jan 25 09:43:07 edclxs11 kernel: LustreError: 6886:0:(file.c:3287:ll_inode_revalidate_fini()) Skipped 7

Re: [Lustre-discuss] LustreError

2011-01-25 Thread Oleg Drokin
Hello! It's not necessary missing, some other factors might be in play. E.g. if you have somewhat older version of Lustre and export it via NFS from this node, I think there was a bug leading to such messages. If it is indeed missing, e2fsck should fix a case where a directory entry

Re: [Lustre-discuss] LustreError: 5920:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@ processing error (-19)

2010-08-09 Thread Alexey Lyashkov
Hi Roger, command line and output looks correct, but On Aug 8, 2010, at 22:19, Roger Spellman wrote: tunefs.lustre --verbose --erase-param --mgsnode=192.168.2...@o2ib --mgsnode=192.168.2...@o2ib --writeconf --fsname=tslstr --ost --index=1 /dev/mapper/map0 I then ran tunefs.lustre on this

[Lustre-discuss] LustreError: lookup/take lock error -13

2010-07-16 Thread Gregory Matthews
Does anyone recognise the following logs: Jul 15 18:24:40 cs04r-sc-mds01-01 kernel: LustreError: 17241:0:(mds_open.c:1053:mds_open()) parent 121938276/854916762 lookup/take lock error -13 Jul 15 18:24:40 cs04r-sc-mds01-01 kernel: LustreError: 17241:0:(mds_open.c:1053:mds_open()) Skipped 5

[Lustre-discuss] LustreError: error creating objects !!!!

2009-09-25 Thread Pe.Herb
Hi all, I'm using lustre 1.8.0 on CentOS 5.0. I have a problem when MDS is restarted (forced). After, I mount lustre on MDS, the following are logs on MDS: Sep 26 01:43:09 MDS2 kernel: LustreError: 14452:0:(mds_open.c:432:mds_create_objects()) error creating objects for inode 20945086: rc = -5 Sep

Re: [Lustre-discuss] LustreError: error creating objects !!!!

2009-09-25 Thread Andreas Dilger
On Sep 26, 2009 02:04 +0700, Pe.Herb wrote: Hi all, I'm using lustre 1.8.0 on CentOS 5.0. I have a problem when MDS is restarted (forced). After, I mount lustre on MDS, the following are logs on MDS: Sep 26 01:43:09 MDS2 kernel: LustreError: 14452:0:(mds_open.c:432:mds_create_objects())

[Lustre-discuss] LustreError: ptlrpc body, buffer size, message magic

2009-09-21 Thread Thomas Roth
Hi all, on our 1.6.7.2 system, the MDT is quite busy writing the following type of messages to the log, and I would just like to ask if somebody has an idea what they mean and if they mean harm: Sep 21 19:50:30 mds1 kernel: LustreError: 6009:0:(pack_generic.c:566:lustre_msg_buf_v2()) msg

Re: [Lustre-discuss] LustreError: ptlrpc body, buffer size, message magic

2009-09-21 Thread Andreas Dilger
On Sep 21, 2009 20:14 +0200, Thomas Roth wrote: on our 1.6.7.2 system, the MDT is quite busy writing the following type of messages to the log, and I would just like to ask if somebody has an idea what they mean and if they mean harm: Sep 21 19:50:30 mds1 kernel: LustreError:

Re: [Lustre-discuss] LustreError: 3429:0:(llog_obd.c:226:llog_add()) No ctxt

2009-06-05 Thread Timh Bergström
I've verified that we run 1.6.7.1. We still get errors similar to the ones i posted; Jun 5 07:55:11 mdt1 kernel: LustreError: 3420:0:(llog_obd.c:226:llog_add()) Skipped 261 previous similar messages Jun 5 07:55:11 mdt1 kernel: LustreError: 3420:0:(lov_log.c:118:lov_llog_origin_add()) Can't add

[Lustre-discuss] LustreError: 3429:0:(llog_obd.c:226:llog_add()) No ctxt

2009-06-03 Thread Timh Bergström
Hi all, After a mdt-server-crash we decided to upgrade to 2.6.22+1.6.7 ( to solve some other problems we've had before ) from 2.6.18+1.6.6.1 we got this errors in dmesg on MDT: LustreError: 3429:0:(llog_obd.c:226:llog_add()) No ctxt LustreError: 3429:0:(lov_log.c:118:lov_llog_origin_add()) Can't

Re: [Lustre-discuss] LustreError: 3429:0:(llog_obd.c:226:llog_add()) No ctxt

2009-06-03 Thread Kevin Van Maren
1.6.7 is known to corrupt the MDT and was pulled from the download site. Please make sure you are using 1.6.7.1 and not 1.6.7. Kevin Timh Bergström wrote: Hi all, After a mdt-server-crash we decided to upgrade to 2.6.22+1.6.7 ( to solve some other problems we've had before ) from

Re: [Lustre-discuss] LustreError: 3429:0:(llog_obd.c:226:llog_add()) No ctxt

2009-06-03 Thread Timh Bergström
Hello and thanks for the reply, Im 99% sure we are running 1.6.7.1, when was it released btw? I've mailed the package maintainer to be sure. Provided we run 1.6.7.1, and still got theese errors, what should we do to get rid of them? Does it indicate some serious error(s)? Or would a simple fsck

Re: [Lustre-discuss] LustreError: lock callback timer expired after

2009-03-30 Thread Simon Latapie
Simon Latapie wrote: Greetings, I currently have a lustre system with 1 MDS, 2 OSS with 2 OSTs each, and 37 lustre clients (1 login and 36 compute nodes), all using infiniband as lustre network (o2ib). All nodes are on 1.6.5.1 patched kernel. There is network error (no packet loss

Re: [Lustre-discuss] LustreError: lock callback timer expired after

2009-03-30 Thread Oleg Drokin
Hello! On Mar 30, 2009, at 7:06 AM, Simon Latapie wrote: I currently have a lustre system with 1 MDS, 2 OSS with 2 OSTs each, and 37 lustre clients (1 login and 36 compute nodes), all using infiniband as lustre network (o2ib). All nodes are on 1.6.5.1 patched kernel. For the past two

Re: [Lustre-discuss] LustreError: 11-0: an error occurred while communicating with 192.168.16...@o2ib. The ost_connect operation failed with -19

2009-03-25 Thread Kevin Van Maren
Dennis, You haven't provided enough context for people to help. What have you done to determine if the IB fabric is working properly? What are hostnames and NIDs for the 10 servers (lctl list_nids)? Which OSTs are on which servers? OST4 is on a machine at 192.168.16.23 What machine is

Re: [Lustre-discuss] LustreError: 11-0: an error occurred while communicating with 192.168.16...@o2ib. The ost_connect operation failed with -19

2009-03-25 Thread Dennis Nelson
On 3/25/09 11:12 AM, Kevin Van Maren kevin.vanma...@sun.com wrote: Dennis, You haven't provided enough context for people to help. What have you done to determine if the IB fabric is working properly? Basic functionality appears to be there. I can lctl ping between all servers. I have

[Lustre-discuss] LustreError: 11-0: an error occurred while communicating with 192.168.16...@o2ib. The ost_connect operation failed with -19

2009-03-24 Thread Dennis Nelson
Hi, I have encountered an issue with Lustre that has happened a couple of times now. I am beginning to suspect an issue with the IB fabric but wanted to reach out to the list to confirm my suspicions. The odd part is that even when the MDS complains that it cannot connect to a given ost, lctl

Re: [Lustre-discuss] LustreError: The mds_getxattr operation failed with -43

2009-01-13 Thread Andreas Dilger
On Jan 12, 2009 18:21 +, Gon�alo Borges wrote: - It seems my clients are not able to reach my mdt. If you do a dmesg in a client linux machine, you will get: ---*--- LustreError: 11-0: an error occurred while communicating with 172.30.1@tcp. The mds_getxattr operation failed with

[Lustre-discuss] LustreError: The mds_getxattr operation failed with -43

2009-01-12 Thread Gonçalo Borges
Hi All... I'm having the following problems: - It seems my clients are not able to reach my mdt. If you do a dmesg in a client linux machine, you will get: ---*--- LustreError: 11-0: an error occurred while communicating with 172.30.1@tcp. The mds_getxattr operation failed with -43

Re: [Lustre-discuss] LustreError: server_bulk_callback

2008-09-30 Thread Isaac Huang
On Wed, Sep 24, 2008 at 05:22:55PM -0600, Nathan Dauchy wrote: Can anyone direct me to documentation to decipher these messages? What does server_bulk_callback do, and does status -103 indicate a severe problem for event types 2 and 4? server_bulk_callback signals the completion of bulk data

Re: [Lustre-discuss] LustreError: server_bulk_callback

2008-09-26 Thread Andreas Dilger
On Sep 24, 2008 17:22 -0600, Nathan Dauchy wrote: We have 4 OSS nodes and 2 MDS nodes configured in HA pairs, running 2.6.18-53.1.14.el5_lustre.1.6.5smp, and using the o2ib network transport. We had multiple failovers recently (possibly due to hardware problems, but no root cause yet) and

[Lustre-discuss] LustreError: server_bulk_callback

2008-09-24 Thread Nathan Dauchy
Greetings, We have 4 OSS nodes and 2 MDS nodes configured in HA pairs, running 2.6.18-53.1.14.el5_lustre.1.6.5smp, and using the o2ib network transport. We had multiple failovers recently (possibly due to hardware problems, but no root cause yet) and managed to get things back again to what I

Re: [Lustre-discuss] LustreError: acquire timeout exceeded

2008-07-30 Thread Andreas Dilger
On Jul 29, 2008 18:51 +0200, Thomas Roth wrote: kern.log.1:Jul 20 06:47:19 kernel: LustreError: 27696:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout exceeded for key 0 kern.log.1:Jul 20 06:47:41 kernel: LustreError: 27713:0:(upcall_cache.c:326:upcall_cache_get_entry())

[Lustre-discuss] LustreError: acquire timeout exceeded

2008-07-29 Thread Thomas Roth
Hi all, I've encountered a LustreError that might have triggered an unwanted failover of a MGS/MGD -HA-pair of servers. I'm not sure about the latter, but at least I have not found a trace of that error via Google, so it might be worth considering. And it occurred in this form only the two