These errors on the OSS are just from rebooting the clients in the middle of
data transfer.
Cheers, Andreas
On Jun 25, 2017, at 09:09, Riccardo Veraldi
> wrote:
Hello,
I have a high volume data transfer between my Lustre
Hello,
I have a high volume data transfer between my Lustre filesystems.
I upgraded to Lustre 2.9.0 on server side and Lustre 2.9.59 on client
side (because of a corruption problem bug).
My clients running 2.9.59 hangs and I need to reboot them and at about
the same time these are the kind of the
s@lists.lustre.org
Subject: Re: [lustre-discuss] LustreError on ZFS volumes
We discussed a course of action this morning and decided that we'd start
by migrating the files off of the OST. Testing suggests files that
cannot be completely read will be left on OST0002.
Due to the nature of the corruption
We discussed a course of action this morning and decided that we'd start
by migrating the files off of the OST. Testing suggests files that
cannot be completely read will be left on OST0002.
Due to the nature of the corruption - faulty hardware raid controller -
it seems unlikely we'll be
Hi Jessie,
In regards to you seeing 370 objects with errors form ‘zpool status’, but
having over 400 files with “access issues”, I would suggest running the ‘zpool
scrub’ to identify all the ZFS objects in the pool that are reporting permanent
errors.
It would be very important to have a
Thanks for taking the time to respond, Tom,
For clarification, it sounds like you are using hardware based RAID-6, and not
ZFS raid? Is this correct? Or was the faulty card simply an HBA?
You are correct. This particular file system is still using hardware RAID6.
At the bottom of the
Hi Jessie,
For clarification, it sounds like you are using hardware based RAID-6, and not
ZFS raid? Is this correct? Or was the faulty card simply an HBA?
At the bottom of the ‘zpool status -v pool_name’ output, you may see paths
and/or zfs object ID’s of the damaged/impacted files. This would
One of our lustre file systems still running lustre 2.5.3 and zfs 0.6.3
experienced corruption due to a bad RAID controller. The OST in question
was a RAID6 volume which we've marked inactive. Most of our lustre
clients are 2.8.0.
zfs status reports corruption and checksum errors. I have not
Hi Kurt,
Have a look at https://jira.hpdd.intel.com/browse/LU-6664. Andreas gives a good
explanation as to what is going on in his last comment. If you need more
clarification, post back to the list. We have experienced this here at LANL
with multiple 2.5.x filesystems. Some use ldiskfs, while
Good Morning,
Recently in my test environment I've seen the following error on the oss:
Error -2 syncing data on lock cancel. At the time there was only one client
mounting the test lustre file system, and the only process running was a
compilation of gcc, so there was virtually no activity
( I have re-added the lustre-discuss mailing list to the reply.)
I am not familiar with that error message. A quick google turned up a couple
of links that may be helpful to you:
http://lists.lustre.org/pipermail/lustre-discuss/2010-September/014035.html
Dear Experts,
We are running lustre 2.4.1 with a combined MDT/MGS disk-server mounted
with 4 device-mappers as 4 OSTs. Recently the setup suffered from high
system load and long hang when trying to lfs df -h from client.
could someone shed light on the situation?
Any help would be greatly
On Jun 5, 2014, at 11:48 AM, curiojus...@gmail.com
wrote:
[user@disk-server]$ lctl dl
snip
12 ST obdfilter lustre-OST lustre-OST_UUID 5
One of your OSTs appears to be down which would explain why lfs df was
hanging. Have you been able to troubleshoot this problem to determine the
Hi,
We have 50 TB storage on lustre, we are using lustre
2.3.0-2.6.32_279.5.1.el6.x86_64.x86_64 OS: Centos 6.3
We have 31 compute nodes.
My issue is:
When we are restarting storage my jobs are running fine, that is writing
with out any issue.
After some time, my jobs coming out with this
Apr 11 04:31:19 node16 kernel: LustreError:
3185:0:(osc_request.c:1689:osc_brw_redo_request())
@@@ redo for recoverable error -5
req@8802d1826400x1464726686245296/t0(0) o4-lustre-OST0002-osc-
88106ab4dc00@192.168.1.46@o2ib:6/4 lens 488/416 e 0 to 0 dl 1397170923
ref 2 fl Interpret:R/0/0 rc
Hi Parinay Kondekar,
Thanks for your reply.
I am new to lustre, please explain me how to gather the information.
Regards,
Vijay Amirtharaj A
On Fri, Apr 11, 2014 at 2:55 PM, Parinay Kondekar
parinay_konde...@xyratex.com wrote:
Apr 11 04:31:19 node16 kernel: LustreError:
I would go to see whats in man pages in this case.
http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#idp554480
e.g.
If the reported error is anything else (such as -5, I/O error), it likely
indicates a storage failure. The low-level file system returns
Hello,
On all the nodes of a lustre 1.8.2 , I often see messages similar to the
following in /var/log/syslog:
LustreError: 8862:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing
error (-114) req@8103f97dc850 x1393780295030087/t0
errno 16 is EBUSY (device or resource busy) and errno 114 is EALREADY
(Operation already in progress).
Chris Horn
On Feb 15, 2012, at 10:52 AM, Marina Cacciagrano wrote:
Hello,
On all the nodes of a lustre 1.8.2 , I often see messages similar to the
following in /var/log/syslog:
LustreError:
...@cray.com
To: Marina Cacciagrano marina.cacciagr...@framestore.com
Cc: lustre-discuss@lists.lustre.org lustre-discuss@lists.lustre.org
Sent: Wednesday, 15 February, 2012 5:21:33 PM
Subject: Re: [Lustre-discuss] LustreError codes -114 and -16
(ldlm_lib.c:1919:target_send_reply_msg())
errno
: Re: [Lustre-discuss] LustreError codes -114 and -16
(ldlm_lib.c:1919:target_send_reply_msg())
errno 16 is EBUSY (device or resource busy) and errno 114 is EALREADY
(Operation already in progress).
Chris Horn
On Feb 15, 2012, at 10:52 AM, Marina Cacciagrano wrote:
Hello,
On all the nodes
Hello,
we are seeing this error a lot since we updated NFS exporting clients to 1.8.7
(oracle
version).
Jan 16 15:19:16 xxfs1 kernel: LustreError:
7312:0:(file.c:3329:ll_inode_revalidate_fini()) failure -2 inode 87425037
Jan 16 15:19:16 xxfs1 kernel: LustreError:
Hi
Lustre 1.8
A lot of LustreErrors on client:
ustreError: 8747:0:(file.c:3143:ll_inode_revalidate_fini()) Skipped 6
previous similar messages
LustreError: 8747:0:(file.c:3143:ll_inode_revalidate_fini()) failure -2
inode 63486047
LustreError: 8747:0:(file.c:3143:ll_inode_revalidate_fini())
I have one lustre client that keeps reporting the following error.
Jan 25 09:43:07 edclxs11 kernel: LustreError:
6886:0:(file.c:3287:ll_inode_revalidate_fini()) failure -2 inode 202278709
Jan 25 09:43:07 edclxs11 kernel: LustreError:
6886:0:(file.c:3287:ll_inode_revalidate_fini()) Skipped 7
Hello!
It's not necessary missing, some other factors might be in play. E.g. if you
have somewhat older version of Lustre and export it via NFS from this node,
I think there was a bug leading to such messages.
If it is indeed missing, e2fsck should fix a case where a directory entry
Hi Roger,
command line and output looks correct, but
On Aug 8, 2010, at 22:19, Roger Spellman wrote:
tunefs.lustre --verbose --erase-param --mgsnode=192.168.2...@o2ib
--mgsnode=192.168.2...@o2ib
--writeconf --fsname=tslstr --ost --index=1 /dev/mapper/map0
I then ran tunefs.lustre on this
Does anyone recognise the following logs:
Jul 15 18:24:40 cs04r-sc-mds01-01 kernel: LustreError:
17241:0:(mds_open.c:1053:mds_open()) parent 121938276/854916762
lookup/take lock error -13
Jul 15 18:24:40 cs04r-sc-mds01-01 kernel: LustreError:
17241:0:(mds_open.c:1053:mds_open()) Skipped 5
Hi all,
I'm using lustre 1.8.0 on CentOS 5.0.
I have a problem when MDS is restarted (forced).
After, I mount lustre on MDS, the following are logs on MDS:
Sep 26 01:43:09 MDS2 kernel: LustreError:
14452:0:(mds_open.c:432:mds_create_objects()) error creating objects for
inode 20945086: rc = -5
Sep
On Sep 26, 2009 02:04 +0700, Pe.Herb wrote:
Hi all,
I'm using lustre 1.8.0 on CentOS 5.0.
I have a problem when MDS is restarted (forced).
After, I mount lustre on MDS, the following are logs on MDS:
Sep 26 01:43:09 MDS2 kernel: LustreError:
14452:0:(mds_open.c:432:mds_create_objects())
Hi all,
on our 1.6.7.2 system, the MDT is quite busy writing the following type
of messages to the log, and I would just like to ask if somebody has an
idea what they mean and if they mean harm:
Sep 21 19:50:30 mds1 kernel: LustreError:
6009:0:(pack_generic.c:566:lustre_msg_buf_v2()) msg
On Sep 21, 2009 20:14 +0200, Thomas Roth wrote:
on our 1.6.7.2 system, the MDT is quite busy writing the following type
of messages to the log, and I would just like to ask if somebody has an
idea what they mean and if they mean harm:
Sep 21 19:50:30 mds1 kernel: LustreError:
I've verified that we run 1.6.7.1.
We still get errors similar to the ones i posted;
Jun 5 07:55:11 mdt1 kernel: LustreError:
3420:0:(llog_obd.c:226:llog_add()) Skipped 261 previous similar
messages
Jun 5 07:55:11 mdt1 kernel: LustreError:
3420:0:(lov_log.c:118:lov_llog_origin_add()) Can't add
Hi all,
After a mdt-server-crash we decided to upgrade to 2.6.22+1.6.7 ( to
solve some other problems we've had before ) from 2.6.18+1.6.6.1 we
got this errors in dmesg on MDT:
LustreError: 3429:0:(llog_obd.c:226:llog_add()) No ctxt
LustreError: 3429:0:(lov_log.c:118:lov_llog_origin_add()) Can't
1.6.7 is known to corrupt the MDT and was pulled from the download
site. Please make sure you are using 1.6.7.1 and not 1.6.7.
Kevin
Timh Bergström wrote:
Hi all,
After a mdt-server-crash we decided to upgrade to 2.6.22+1.6.7 ( to
solve some other problems we've had before ) from
Hello and thanks for the reply,
Im 99% sure we are running 1.6.7.1, when was it released btw? I've
mailed the package maintainer to be sure.
Provided we run 1.6.7.1, and still got theese errors, what should we
do to get rid of them? Does it indicate some serious error(s)? Or
would a simple fsck
Simon Latapie wrote:
Greetings,
I currently have a lustre system with 1 MDS, 2 OSS with 2 OSTs each, and
37 lustre clients (1 login and 36 compute nodes), all using infiniband
as lustre network (o2ib). All nodes are on 1.6.5.1 patched kernel.
There is network error (no packet loss
Hello!
On Mar 30, 2009, at 7:06 AM, Simon Latapie wrote:
I currently have a lustre system with 1 MDS, 2 OSS with 2 OSTs each,
and
37 lustre clients (1 login and 36 compute nodes), all using infiniband
as lustre network (o2ib). All nodes are on 1.6.5.1 patched kernel.
For the past two
Dennis,
You haven't provided enough context for people to help.
What have you done to determine if the IB fabric is working properly?
What are hostnames and NIDs for the 10 servers (lctl list_nids)?
Which OSTs are on which servers?
OST4 is on a machine at 192.168.16.23
What machine is
On 3/25/09 11:12 AM, Kevin Van Maren kevin.vanma...@sun.com wrote:
Dennis,
You haven't provided enough context for people to help.
What have you done to determine if the IB fabric is working properly?
Basic functionality appears to be there. I can lctl ping between all
servers. I have
Hi,
I have encountered an issue with Lustre that has happened a couple of times
now. I am beginning to suspect an issue with the IB fabric but wanted to
reach out to the list to confirm my suspicions. The odd part is that even
when the MDS complains that it cannot connect to a given ost, lctl
On Jan 12, 2009 18:21 +, Gon�alo Borges wrote:
- It seems my clients are not able to reach my mdt. If you do a dmesg
in a client linux machine, you will get:
---*---
LustreError: 11-0: an error occurred while communicating with
172.30.1@tcp. The mds_getxattr operation failed with
Hi All...
I'm having the following problems:
- It seems my clients are not able to reach my mdt. If you do a dmesg
in a client linux machine, you will get:
---*---
LustreError: 11-0: an error occurred while communicating with
172.30.1@tcp. The mds_getxattr operation failed with -43
On Wed, Sep 24, 2008 at 05:22:55PM -0600, Nathan Dauchy wrote:
Can anyone direct me to documentation to decipher these messages?
What does server_bulk_callback do, and does status -103 indicate a
severe problem for event types 2 and 4?
server_bulk_callback signals the completion of bulk data
On Sep 24, 2008 17:22 -0600, Nathan Dauchy wrote:
We have 4 OSS nodes and 2 MDS nodes configured in HA pairs, running
2.6.18-53.1.14.el5_lustre.1.6.5smp, and using the o2ib network
transport. We had multiple failovers recently (possibly due to hardware
problems, but no root cause yet) and
Greetings,
We have 4 OSS nodes and 2 MDS nodes configured in HA pairs, running
2.6.18-53.1.14.el5_lustre.1.6.5smp, and using the o2ib network
transport. We had multiple failovers recently (possibly due to hardware
problems, but no root cause yet) and managed to get things back again to
what I
On Jul 29, 2008 18:51 +0200, Thomas Roth wrote:
kern.log.1:Jul 20 06:47:19 kernel: LustreError:
27696:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout
exceeded for key 0
kern.log.1:Jul 20 06:47:41 kernel: LustreError:
27713:0:(upcall_cache.c:326:upcall_cache_get_entry())
Hi all,
I've encountered a LustreError that might have triggered an unwanted
failover of a MGS/MGD -HA-pair of servers. I'm not sure about the
latter, but at least I have not found a trace of that error via Google,
so it might be worth considering.
And it occurred in this form only the two
47 matches
Mail list logo