Re: [lustre-discuss] Error on a zpool underlying an OST

2016-07-15 Thread Kevin Abbey

Hi Bob,

Thank you for the notes.  I began to examining the zpool before 
obtaining the new LSI card.  I was unable to start lustre without the 
new card. Once I installed the replacement and re-examined the zpools 
the resilvered pool was re-scrubbed, exported and reimported, and to my 
surprise repaired.  As a further test, I removed the spare disk that 
replaced the "apparent" bad disk and re-added the disk that was 
removed.  The zpool resilvered ok and scrubbed clean.  Lustre mounted 
and cleaned a few orphaned blocks but appeared fully functional from the 
client side.  However, without a "snapshot" (file list, md5sums - though 
zfs does internal check sums) of the prior status I cannot be sure if a 
data file was lost.  This is something I'll need to address. Maybe 
Robinhood can help with this?


Thanks again for the notes.  They will likely be useful in a similar 
scenario.


Kevin


On 07/12/2016 09:10 AM, Bob Ball wrote:
The answer came offline, and I guess I never replied back to the 
original posting.  This is what I learned.  It deals with only a 
single file, not 1000's.  --bob

---

On Mon, 14 Mar 2016, Bob Ball wrote:

OK, it would seem the affected user has already deleted this file, as 
the "lfs fid2path" returns:

[root@umt3int01 ~]# lfs fid2path /lustre/umt3 [0x22582:0xb5c0:0x0]
fid2path: error on FID [0x22582:0xb5c0:0x0]: No such file or 
directory


I verified I could to it back and forth using a different file.

I am making one last check, with the OST re-activated (I had set it 
inactive on our MDT/MGS to keep new files off while figuring this out).


Nope, gone.  Time to do the clear and remove the snapshot.

Thanks for your help on this.

bob

On 3/14/2016 10:45 AM, Don Holmgren wrote:

 No, no downside.  The snapshot really is just used so that I can do this
 sort of repair live.

 Once you've found the Lustre OID with "find", for 
ll_decode_filter_fid to

 work you'll have to then umount the OST and remount as type lustre.

 Good luck!

 Don

Thank you!  This is very helpful.

I have no space to make a snapshot, so I will just umount this OST for 
a bit and remount it zfs.  Our users can take some off-time if we are 
not busy just then.


It will be an interesting process.  I'm all set to drain and remake 
though, should this method not work.  I was putting that off to start 
until later today as I've other issues just now. Since it would take 
me 2-3 days total to drain, remake and refill, your detailed method is 
far more likeable for me.


Just to be certain, other than the temporary unavailability of the 
Lustre file system, do you see any downside to not working from a 
snapshot?


bob


On 3/14/2016 10:21 AM, Don Holmgren wrote:

 Hi Bob -

 I only get the lustre-discuss digest, so am not sure how to reply to 
that

 whole list.  But I can reply directly to you regarding your posting
 (copied at the bottom).

 In the ZFS error message

errors: Permanent errors have been detected in the following files:
  ost-007/ost0030:<0x2c90f>

 0x2c90f is the ZFS inode number of the damaged item.  To turn this 
into a

 Lustre filename, do the following:

 1. First, you have to use "find" using that inode number to get the
 corresponding
Lustre object ID.  I do this via a ZFS snapshot, something like:

zfs snapshot ost-007/ost0030@mar14
mount -t zfs ost-007/ost0030@mar14 /mnt/snapshot
find /mnt/snapshot/O -inum 182543

 (note 0x2c90f = 182543 decimal).  This may return something like

/mnt/snapshot/O/0/d22/54

 if indeed the damaged item is a file object.


 2. OK, assuming the "find" did return a file object like above (in this
 case the
Lustre OID of the object is 54) you need to find the parent "FID" of
 that
OID.  Do this as follows on the OSS where you've mounted the 
snapshot:


[root@lustrenew3 ~]# ll_decode_filter_fid /mnt/snapshot/O/0/d22/54
/mnt/snapshot/O/0/d22/54: parent=[0x204010a:0x0:0x0] stripe=0


 3. That string "0x204010a:0x0:0x0" is related to the Lustre FID.
 You
can use "lfs fid2path" to convert this to a filename.  "lfs fid2path"
 must be
execute on a client of your Lustre filesystem.  And, on our 
Lustre, the

return string must be slightly altered (chopped up differently):

 [root@client ~]# lfs fid2path /djhzlus [0x20400:0x10a:0x0]
 /djhzlus/test/copy1/l6496f21b7075m00155m031/gauge/Coulomb/l6496f21b7075m00155m031-Coul_002 



Here /djhzlus was where the Lustre filesystem was mounted on my 
client

(client).  fid2path takes three numbers, in my case the first was
the first 9 hex digits of the return from ll_decode_filter_fid, and
 the
second was the last 5 hex digits (I supressed the leading zeros) and
 the
third was 0x0 (not sure whether this was the 2nd or 3rd field from
ll_decode_filter_fid.

You can always use "lfs path2fid" on your Lustre client against 
another

file in your filesystem to find the 

Re: [lustre-discuss] Error on a zpool underlying an OST

2016-07-12 Thread Bob Ball
The answer came offline, and I guess I never replied back to the 
original posting.  This is what I learned.  It deals with only a single 
file, not 1000's.  --bob

---

On Mon, 14 Mar 2016, Bob Ball wrote:

OK, it would seem the affected user has already deleted this file, as 
the "lfs fid2path" returns:

[root@umt3int01 ~]# lfs fid2path /lustre/umt3 [0x22582:0xb5c0:0x0]
fid2path: error on FID [0x22582:0xb5c0:0x0]: No such file or directory

I verified I could to it back and forth using a different file.

I am making one last check, with the OST re-activated (I had set it 
inactive on our MDT/MGS to keep new files off while figuring this out).


Nope, gone.  Time to do the clear and remove the snapshot.

Thanks for your help on this.

bob

On 3/14/2016 10:45 AM, Don Holmgren wrote:

 No, no downside.  The snapshot really is just used so that I can do this
 sort of repair live.

 Once you've found the Lustre OID with "find", for ll_decode_filter_fid to
 work you'll have to then umount the OST and remount as type lustre.

 Good luck!

 Don

Thank you!  This is very helpful.

I have no space to make a snapshot, so I will just umount this OST for a 
bit and remount it zfs.  Our users can take some off-time if we are not 
busy just then.


It will be an interesting process.  I'm all set to drain and remake 
though, should this method not work.  I was putting that off to start 
until later today as I've other issues just now.  Since it would take me 
2-3 days total to drain, remake and refill, your detailed method is far 
more likeable for me.


Just to be certain, other than the temporary unavailability of the 
Lustre file system, do you see any downside to not working from a snapshot?


bob


On 3/14/2016 10:21 AM, Don Holmgren wrote:

 Hi Bob -

 I only get the lustre-discuss digest, so am not sure how to reply to that
 whole list.  But I can reply directly to you regarding your posting
 (copied at the bottom).

 In the ZFS error message

errors: Permanent errors have been detected in the following files:
  ost-007/ost0030:<0x2c90f>

 0x2c90f is the ZFS inode number of the damaged item.  To turn this into a
 Lustre filename, do the following:

 1. First, you have to use "find" using that inode number to get the
 corresponding
Lustre object ID.  I do this via a ZFS snapshot, something like:

zfs snapshot ost-007/ost0030@mar14
mount -t zfs ost-007/ost0030@mar14 /mnt/snapshot
find /mnt/snapshot/O -inum 182543

 (note 0x2c90f = 182543 decimal).  This may return something like

/mnt/snapshot/O/0/d22/54

 if indeed the damaged item is a file object.


 2. OK, assuming the "find" did return a file object like above (in this
 case the
Lustre OID of the object is 54) you need to find the parent "FID" of
 that
OID.  Do this as follows on the OSS where you've mounted the snapshot:

[root@lustrenew3 ~]# ll_decode_filter_fid /mnt/snapshot/O/0/d22/54
/mnt/snapshot/O/0/d22/54: parent=[0x204010a:0x0:0x0] stripe=0


 3. That string "0x204010a:0x0:0x0" is related to the Lustre FID.
 You
can use "lfs fid2path" to convert this to a filename.  "lfs fid2path"
 must be
execute on a client of your Lustre filesystem.  And, on our Lustre, 
the

return string must be slightly altered (chopped up differently):

 [root@client ~]# lfs fid2path /djhzlus [0x20400:0x10a:0x0]
 
/djhzlus/test/copy1/l6496f21b7075m00155m031/gauge/Coulomb/l6496f21b7075m00155m031-Coul_002

Here /djhzlus was where the Lustre filesystem was mounted on my client
(client).  fid2path takes three numbers, in my case the first was
the first 9 hex digits of the return from ll_decode_filter_fid, and
 the
second was the last 5 hex digits (I supressed the leading zeros) and
 the
third was 0x0 (not sure whether this was the 2nd or 3rd field from
ll_decode_filter_fid.

You can always use "lfs path2fid" on your Lustre client against 
another

file in your filesystem to find the pattern for your FID.

To check that you've indeed found the correct file, you can do
"lfs getstripe" to confirm that the objid matches the Lustre OID you
got with the find.


 Once you figure out the bad file, you can delete it from Lustre, and then
 use "zpool clear ost-007" to clear the reporting of
  ost-007/ost0030:<0x2c90f>
 Don't forget to umount and delete your ZFS snapshot of the OST with the
 bad file.


 I should mention that I found a Python script ("zfsobj2fid") somewhere
 that directly returns the FID using the ZFS debugger ("zdb") directly
 against the mounted OST.  You can probably google for zfsobj2fid; if you
 can't find it let me know and I'll dig around to see if I still have a
 copy.  Here's how I used it to get the FID for "lfs fid2path":

 [root@lustrenew3 ~]# ./zfsobj2fid zp2/ost2 0x113
 [0x204010a:0x0:0x0]

 (my OID was 0x113, my pool was "zp2" and the ZFS OST was "ost2"). But,
 note that the 

[lustre-discuss] Error on a zpool underlying an OST

2016-07-11 Thread Kevin Abbey

Hi,

Can anyone advise how to clean up 1000s of zfs level permanent errors 
and the lustre level too?


A similar question was presented on the list but I did not see an answer.
https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg12454.html

As I was testing new hardware I discovered an LSI HBA was bad.  On a 
single combined MDS/OSS there were 8 OSTs split across 2 jbod and 2 LSI 
HBA.  The mdt was on a 3rd jbod downlinked on the jbod connected with 
the bad controller.  The zpools connected to the good HBA were scrubed 
clean after unmounting and stopping lustre.  The zpools on the bad 
controller continued to have errors while connected to the bad 
controller.  One of these OSTs reported a disk failure during the scrub 
and began resilvering yet autoreplace was off.This is a very bad 
event considering the card was causing all of the errors.  Neither a 
scrub or resilver would ever complete.  I stopped the scrub on the 3 
other osts and detached the spare from the ost in resilver process.  
After narrowing down the bad HBA (initially it was not clear if cables 
or jbod backplanes were bad), I use the good HBA to scrub the jbod 1 
again, then shutdown disconnected the jbod1.  Then proceeded to connect 
the jbod2 to the good controller to scrub the jbod 2 zpools which had 
previously been attached to the bad LSI controller.  The 3 zpools which 
had scrub stopped previously did complete successfully.  The one which 
had begun resilvering began again to resilver after I initiated a 
replace of the failed disk with the spare.  The resilver completed but 
many permanent errors were discovered on the zpool.  Since this is a 
test pool I was interested to know if zfs would recover.  In a real 
scenario with HW problems I'll shutdown and disconnect the data drives 
prior to HW testing.


The status listed below shows a new scrub in process after the resilver 
completed.  The cache drive is missing because the 3rd jbod is 
disconnected temporarily.



===

ZFS:   v0.6.5.7-1
lustre 2.8.55
kernel 2.6.32_642.1.1.el6.x86_64.x86_64
Centos 6.8


===
  ~]# zpool status -v test-ost4
  pool: test-ost4
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub in progress since Mon Jul 11 22:29:09 2016
689G scanned out of 12.4T at 711M/s, 4h49m to go
40K repaired, 5.41% done
config:

NAME   STATE READ WRITE CKSUM
test-ost4  ONLINE 0 0   180
  raidz2-0 ONLINE 0 0   360
ata-ST4000NM0033-9ZM170_Z1Z7GYXY   ONLINE 0 0 2  
(repairing)
ata-ST4000NM0033-9ZM170_Z1Z7KKPQ   ONLINE 0 0 3  
(repairing)
ata-ST4000NM0033-9ZM170_Z1Z7L5E7   ONLINE 0 0 3  
(repairing)
ata-ST4000NM0033-9ZM170_Z1Z7KGQT   ONLINE 0 0 0  
(repairing)
ata-ST4000NM0033-9ZM170_Z1Z7LA8K   ONLINE 0 0 4  
(repairing)
ata-ST4000NM0033-9ZM170_Z1Z7KB0X   ONLINE 0 0 3  
(repairing)
ata-ST4000NM0033-9ZM170_Z1Z7JSMN   ONLINE 0 0 2  
(repairing)
ata-ST4000NM0033-9ZM170_Z1Z7KXRA   ONLINE 0 0 2  
(repairing)
ata-ST4000NM0033-9ZM170_Z1Z7MLSN   ONLINE 0 0 2  
(repairing)
ata-ST4000NM0033-9ZM170_Z1Z7L4DT   ONLINE 0 0 7  
(repairing)

cache
  ata-D2CSTK251M20-0240_A19CV01122792  UNAVAIL 0 0 0

errors: Permanent errors have been detected in the following files:

test-ost4/test-ost4:<0xe00>
test-ost4/test-ost4:<0xe01>
test-ost4/test-ost4:<0xe02>
test-ost4/test-ost4:<0xe03>
test-ost4/test-ost4:<0xe04>
test-ost4/test-ost4:<0xe05>
test-ost4/test-ost4:<0xe06>...
...
...continues..
...
...
test-ost4/test-ost4:<0xdfe>
test-ost4/test-ost4:<0xdff>
===

Follow up questions,

Is is better to not have a spare attached to the pool to prevent 
resilvering in this scenario?  (bad HBA, disk failed during scrub, 
resilver began, yet auto relplace was off.  The spare was assigned to 
the zpool.)


In a dual path to the jbod would the bad HBA card be disabled 
automatically to prevent IO errors reaching the disk?  The current setup 
is single path only.



Thank you for any notes in advance,
Kevin

--
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/

Rutgers University - Science Building
315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
Email: kevin.ab...@rutgers.edu


Re: [lustre-discuss] Error on a zpool underlying an OST

2016-03-11 Thread Alexander I Kulyavtsev

You lost one "file" only:
0x2c90f

I wold take zfs snapshot on ost, mount it as zfs and try to find lustre FID of 
the file.

If that does not work, I guess zdb with high verbosity level can help to 
pinpoint broken zfs object, like in
"zdb: Examining ZFS At Point-Blank Range," and see what it is (plain zfs file 
or else).

Knowing zfs version can be helpful.

Alex

On Mar 11, 2016, at 7:19 PM, Bob Ball > 
wrote:

errors: Permanent errors have been detected in the following files:

   ost-007/ost0030:<0x2c90f>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Error on a zpool underlying an OST

2016-03-11 Thread Fred Liu
You may try recover options(rarely help) from "zpool import" but rebuilding the 
zpool has huge possibilities.

Thanks.

Fred






On Fri, Mar 11, 2016 at 5:19 PM -0800, "Bob Ball" 
> wrote:

Hi, we have Lustre 2.7.58 in place on our OST and MDT/MGS (combined).
Underlying the lustre file system is a raid-z2 zfs pool.

A few days ago, we lost 2 disks at once from the raid-z2.  I replaced
one and a resilver started, that seemed to choke.  So, I put back both
disks with replacements, and the new re-silver shows the following now.

[root@umdist03 ~]# zpool status -v ost-007
   pool: ost-007
  state: DEGRADED
status: One or more devices has experienced an error resulting in data
 corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
   scan: resilvered 972G in 9h25m with 1 errors on Fri Mar 11 19:12:37 2016
config:

 NAME  STATE READ WRITE CKSUM
 ost-007   DEGRADED 0 0 1
   raidz2-0DEGRADED 0 0 4
 replacing-0   DEGRADED 0 0 0
   18280868502819750645UNAVAIL  0 0 0
was /dev/disk/by-path/pci-:0c:00.0-scsi-0:2:20:0-part1/old
   pci-:0c:00.0-scsi-0:2:20:0  ONLINE   0 0 0
 pci-:0c:00.0-scsi-0:2:21:0ONLINE   0 0 0
 pci-:0c:00.0-scsi-0:2:22:0ONLINE   0 0 0
 pci-:0c:00.0-scsi-0:2:23:0ONLINE   0 0 0
 pci-:0c:00.0-scsi-0:2:24:0ONLINE   0 0 0
 pci-:0c:00.0-scsi-0:2:35:0ONLINE   0 0 0
 pci-:0c:00.0-scsi-0:2:36:0ONLINE   1 0 0
 pci-:0c:00.0-scsi-0:2:37:0ONLINE   0 0 0
 pci-:0c:00.0-scsi-0:2:38:0ONLINE   0 0 0
 replacing-9   UNAVAIL  0 0 0
   14369532488179106769UNAVAIL  0 0 0
was /dev/disk/by-path/pci-:0c:00.0-scsi-0:2:39:0-part1/old
   pci-:0c:00.0-scsi-0:2:39:0  ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

 ost-007/ost0030:<0x2c90f>

what are my options here?  If I don't care about the file, can I
identify it and then just delete it?  Or is my only real option to drain
the pool and rebuild it cleanly?

Thanks for any help/advice.

bob
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Error on a zpool underlying an OST

2016-03-11 Thread Bob Ball
Hi, we have Lustre 2.7.58 in place on our OST and MDT/MGS (combined).  
Underlying the lustre file system is a raid-z2 zfs pool.


A few days ago, we lost 2 disks at once from the raid-z2.  I replaced 
one and a resilver started, that seemed to choke.  So, I put back both 
disks with replacements, and the new re-silver shows the following now.


[root@umdist03 ~]# zpool status -v ost-007
  pool: ost-007
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: resilvered 972G in 9h25m with 1 errors on Fri Mar 11 19:12:37 2016
config:

NAME  STATE READ WRITE CKSUM
ost-007   DEGRADED 0 0 1
  raidz2-0DEGRADED 0 0 4
replacing-0   DEGRADED 0 0 0
  18280868502819750645UNAVAIL  0 0 0  
was /dev/disk/by-path/pci-:0c:00.0-scsi-0:2:20:0-part1/old

  pci-:0c:00.0-scsi-0:2:20:0  ONLINE   0 0 0
pci-:0c:00.0-scsi-0:2:21:0ONLINE   0 0 0
pci-:0c:00.0-scsi-0:2:22:0ONLINE   0 0 0
pci-:0c:00.0-scsi-0:2:23:0ONLINE   0 0 0
pci-:0c:00.0-scsi-0:2:24:0ONLINE   0 0 0
pci-:0c:00.0-scsi-0:2:35:0ONLINE   0 0 0
pci-:0c:00.0-scsi-0:2:36:0ONLINE   1 0 0
pci-:0c:00.0-scsi-0:2:37:0ONLINE   0 0 0
pci-:0c:00.0-scsi-0:2:38:0ONLINE   0 0 0
replacing-9   UNAVAIL  0 0 0
  14369532488179106769UNAVAIL  0 0 0  
was /dev/disk/by-path/pci-:0c:00.0-scsi-0:2:39:0-part1/old

  pci-:0c:00.0-scsi-0:2:39:0  ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

ost-007/ost0030:<0x2c90f>

what are my options here?  If I don't care about the file, can I 
identify it and then just delete it?  Or is my only real option to drain 
the pool and rebuild it cleanly?


Thanks for any help/advice.

bob
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org