Re: [ceph-users] Power outages!!! help!

2017-09-28 Thread Ronny Aasen
On 28. sep. 2017 18:53, hjcho616 wrote: Yay! Finally after about exactly one month I finally am able to mount the drive! Now is time to see how my data is doing. =P Doesn't look too bad though. Got to love the open source. =) I downloaded ceph source code. Built them. Then tried to run

Re: [ceph-users] Power outages!!! help!

2017-09-28 Thread hjcho616
Yay! Finally after about exactly one month I finally am able to mount the drive!  Now is time to see how my data is doing. =P  Doesn't look too bad though. Got to love the open source. =)  I downloaded ceph source code.  Built them.  Then tried to run ceph-objectstore-export on that osd.4.   

Re: [ceph-users] Power outages!!! help!

2017-09-22 Thread hjcho616
Ronny, Could you help me with this log?  I got this with debug osd=20 filestore=20 ms=20.  This one is running "ceph pg repair 2.7"  This is one of the smaller page, thus log was smaller.  Others have similar errors.  I can see the lines with ERR, but other than that is there something I should

Re: [ceph-users] Power outages!!! help!

2017-09-20 Thread hjcho616
# rados list-inconsistent-pg data["0.0","0.5","0.a","0.e","0.1c","0.29","0.2c"]# rados list-inconsistent-pg metadata["1.d","1.3d"]# rados list-inconsistent-pg rbd["2.7"]# rados list-inconsistent-obj 0.0 --format=json-pretty {    "epoch": 23112,    "inconsistents": []}# rados

Re: [ceph-users] Power outages!!! help!

2017-09-20 Thread hjcho616
Thanks Ronny.  I'll try that inconsistent issue soon.   I think the OSD drive that PG 1.28 is sitting on is still ok... just file corruption happened when power outage happened.. =P  As you suggested, cd /var/lib/ceph/osd/ceph-4/current/ tar --xattrs --preserve-permissions -zcvf osd.4.tar.gz

Re: [ceph-users] Power outages!!! help!

2017-09-20 Thread Ronny Aasen
i would only tar the pg you have missing objects from, trying to inject older objects when the pg is correct can not be good. scrub errors is kind of the issue with only 2 replicas. when you have 2 different objects. how to know witch one is correct and witch one is bad.. and as you have read

Re: [ceph-users] Power outages!!! help!

2017-09-20 Thread hjcho616
Thanks Ronny. I decided to try to tar everything under current directory.  Is this correct command for it?  Is there any directory we do not want in the new drive?  commit_op_seq, meta, nosnap, omap? tar --xattrs --preserve-permissions -zcvf osd.4.tar.gz . As far as inconsistent PGs... I am

Re: [ceph-users] Power outages!!! help!

2017-09-20 Thread Ronny Aasen
On 20.09.2017 16:49, hjcho616 wrote: Anyone?  Can this page be saved?  If not what are my options? Regards, Hong On Saturday, September 16, 2017 1:55 AM, hjcho616 wrote: Looking better... working on scrubbing.. HEALTH_ERR 1 pgs are stuck inactive for more than 300

Re: [ceph-users] Power outages!!! help!

2017-09-20 Thread hjcho616
Anyone?  Can this page be saved?  If not what are my options? Regards,Hong On Saturday, September 16, 2017 1:55 AM, hjcho616 wrote: Looking better... working on scrubbing..HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs incomplete; 12 pgs

Re: [ceph-users] Power outages!!! help!

2017-09-16 Thread hjcho616
Looking better... working on scrubbing..HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs incomplete; 12 pgs inconsistent; 2 pgs repair; 1 pgs stuck inactive; 1 pgs stuck unclean; 109 scrub errors; too few PGs per OSD (29 < min 30); mds rank 0 has failed; mds cluster is

Re: [ceph-users] Power outages!!! help!

2017-09-15 Thread hjcho616
After running ceph osd lost osd.0, it started backfilling... I figured that was supposed to happen earlier when I added those missing PGs.  Running in to "too few PGs per OSD" I removed osds after cluster stopped working after adding osds.  But I guess I still needed them.  Currently I see

Re: [ceph-users] Power outages!!! help!

2017-09-15 Thread Ronny Aasen
you write you had all pg's exported except one. so i assume you have injected those pg's into the cluster again using the method linked a few times in this thread. How did that go, were you successfull in recovering those pg's ? kind regards. Ronny Aasen On 15. sep. 2017 07:52, hjcho616

Re: [ceph-users] Power outages!!! help!

2017-09-14 Thread hjcho616
I just did this and backfilling started.  Let's see where this takes me. ceph osd lost 0 --yes-i-really-mean-it Regards,Hong On Friday, September 15, 2017 12:44 AM, hjcho616 wrote: Ronny, Working with all of the pgs shown in the "ceph health detail", I ran below for

Re: [ceph-users] Power outages!!! help!

2017-09-14 Thread hjcho616
Ronny, Working with all of the pgs shown in the "ceph health detail", I ran below for each PG to export.ceph-objectstore-tool --op export --pgid 0.1c   --data-path /var/lib/ceph/osd/ceph-0 --journal-path /var/lib/ceph/osd/ceph-0/journal --skip-journal-replay --file 0.1c.export I have all PGs

Re: [ceph-users] Power outages!!! help!

2017-09-13 Thread hjcho616
Rooney, Just tried hooking up osd.0 back.  osd.0 seems to be better as I was able to run ceph-objectstore-tool export so decided to try hooking it up.  Looks like journal is not happy.  Is there any way to get this running?  Or do I need to start getting data using ceph-objectstore-tool?

Re: [ceph-users] Power outages!!! help!

2017-09-13 Thread Ronny Aasen
On 13. sep. 2017 07:04, hjcho616 wrote: Ronny, Did bunch of ceph pg repair pg# and got the scrub errors down to 10... well was 9, trying to fix one became 10.. waiting for it to fix (I did that noout trick as I only have two copies). 8 of those scrub errors looks like it would need data

Re: [ceph-users] Power outages!!! help!

2017-09-12 Thread hjcho616
Ronny, Did bunch of ceph pg repair pg# and got the scrub errors down to 10... well was 9, trying to fix one became 10.. waiting for it to fix (I did that noout trick as I only have two copies).  8 of those scrub errors looks like it would need data from osd.0. HEALTH_ERR 22 pgs are stuck

Re: [ceph-users] Power outages!!! help!

2017-09-12 Thread hjcho616
Thank you for those references!  I'll have to go study some more.  Good portion of that inconsistent seems to be from missing data from osd.0. =P  There appears to be some from okay drives. =P  Kicked off "ceph pg repair pg#" few times, but doesn't seem to change much yet. =P  As far as smart

Re: [ceph-users] Power outages!!! help!

2017-09-12 Thread Ronny Aasen
you can start by posting more details. atleast "ceph osd tree" "cat ceph.conf" and "ceph osd df" so we can see what settings you are running, and how your cluster is balanced at the moment. generally: inconsistent pg's are pg's that have scrub errors. use rados list-inconsistent-pg [pool]

Re: [ceph-users] Power outages!!! help!

2017-09-10 Thread hjcho616
It took a while.  It appears to have cleaned up quite a bit... but still has issues.  I've been seeing below message for more than a day and cpu utilization and io utilization is low... looks like something is stuck...  I rebooted OSDs several times when it looked like it was stuck earlier and

Re: [ceph-users] Power outages!!! help!

2017-09-04 Thread hjcho616
Hmm.. I hope I don't really need any thing from osd.0. =P # ceph-objectstore-tool --op export --pgid 2.35 --data-path /var/lib/ceph/osd/ceph-0 --journal-path /var/lib/ceph/osd/ceph-0/journal --file 2.35.exportFailure to read OSD superblock: (2) No such file or directory# ceph-objectstore-tool

Re: [ceph-users] Power outages!!! help!

2017-09-04 Thread hjcho616
Ronny, While letting cluster replicate, looks like this might be a while, I decided to look in to where those pgs are missing.. From the "ceph health detail" I found pgs that are unfound.  Then found the directories that had that pgs, pasted on the right of that detail message below..pg 2.35 is

Re: [ceph-users] Power outages!!! help!

2017-09-04 Thread hjcho616
Thank you Ronny.  I've added two OSDs to OSD2, 2TB each.  I hope that would be enough. =)  I've changed min_size and size to 2.  OSDs are busy balancing again.  I'll try those you recommended and will get back to you with more questions! =)  # ceph osd treeID WEIGHT   TYPE NAME      UP/DOWN

Re: [ceph-users] Power outages!!! help!

2017-09-03 Thread Ronny Aasen
I would not even attempt to connect a recovered drive to ceph, especially not one that have had xfs errors and corruption. your pg's that are undersized lead me to belive you still need to either expand, with more disks, or nodes. or that you need to set |osd crush chooseleaf type = 0 | to

Re: [ceph-users] Power outages!!! help!

2017-09-02 Thread hjcho616
I checked with ceph-2, 3, 4, 5 so I figured it was safe to assume that superblock file is the same.  I copied it over and started OSD.  It still fails with the same error message.  Looks like when I updated to 10.2.9, some osd needs to be updated and that process is not finding the data it

Re: [ceph-users] Power outages!!! help!

2017-09-01 Thread hjcho616
Just realized there is a file called superblock in the ceph directory.  ceph-1 and ceph-2's superblock file is identical, ceph-6 and ceph-7 are identical, but not between the two groups.  When I originally created the OSDs, I created ceph-0 through 5.  Can superblock file be copied over from

Re: [ceph-users] Power outages!!! help!

2017-09-01 Thread hjcho616
Tried connecting recovered osd.  Looks like some of the files in the lost+found are super blocks.  Below is the log.  What can I do about this? 2017-09-01 22:27:27.634228 7f68837e5800  0 set uid:gid to 1001:1001 (ceph:ceph)2017-09-01 22:27:27.634245 7f68837e5800  0 ceph version 10.2.9

Re: [ceph-users] Power outages!!! help!

2017-09-01 Thread hjcho616
Found the partition, wasn't able to mount the partition right away... Did a xfs_repair on that drive.   Got bunch of messages like this.. =(entry "10a89fd.__head_AE319A25__0" in shortform directory 845908970 references non-existent inode 605294241               junking entry

Re: [ceph-users] Power outages!!! help!

2017-09-01 Thread hjcho616
Looks like it has been rescued... Only 1 error as we saw before in the smart log!# ddrescue -f /dev/sda /dev/sdc ./rescue.logGNU ddrescue 1.21Press Ctrl-C to interrupt     ipos:    1508 GB, non-trimmed:        0 B,  current rate:       0 B/s     opos:    1508 GB, non-scraped:        0 B,  

Re: [ceph-users] Power outages!!! help!

2017-08-30 Thread Ronny Aasen
On 30.08.2017 15:32, Steve Taylor wrote: I'm not familiar with dd_rescue, but I've just been reading about it. I'm not seeing any features that would be beneficial in this scenario that aren't also available in dd. What specific features give it "really a far better chance of restoring a copy

Re: [ceph-users] Power outages!!! help!

2017-08-30 Thread Steve Taylor
I'm not familiar with dd_rescue, but I've just been reading about it. I'm not seeing any features that would be beneficial in this scenario that aren't also available in dd. What specific features give it "really a far better chance of restoring a copy of your disk" than dd? I'm always

Re: [ceph-users] Power outages!!! help!

2017-08-30 Thread Steve Taylor
Yes, if I had created the RBD in the same cluster I was trying to repair then I would have used rbd-fuse to "map" the RBD in order to avoid potential deadlock issues with the kernel client. I had another cluster available, so I copied its config file to the osd node, created the RBD in the

Re: [ceph-users] Power outages!!! help!

2017-08-30 Thread Ronny Aasen
[snip] I'm not sure if I am liking what I see on fdisk... it doesn't show sdb1. I hope it shows up when I run dd_rescue to other drive... =P # fdisk /dev/sdb Welcome to fdisk (util-linux 2.25.2). Changes will remain in memory only, until you decide to write them. Be careful before using

Re: [ceph-users] Power outages!!! help!

2017-08-29 Thread hjcho616
This is what it looks like today.  Seems like ceph-osds are sitting at 0% cpu so... all the migrations appear to be done,  Does this look ok to shutdown and continue when I get the HDD on Thursday? # ceph healthHEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 20 pgs

Re: [ceph-users] Power outages!!! help!

2017-08-29 Thread Tomasz Kusmierz
Maged, on second host he has 4 out of 5 OSD failed on him … I think he’s past the trying to increase the backfill threshold :) ofcourse he could try to degrade cluster by letting mirror within same host :) > On 29 Aug 2017, at 21:26, Maged Mokhtar wrote: > > One of the

Re: [ceph-users] Power outages!!! help!

2017-08-29 Thread Maged Mokhtar
One of the things to watch out in small clusters is OSDs can get full rather unexpectedly in recovery/backfill cases: In your case you have 2 OSD nodes with 5 disks each. Since you have a replica of 2, each PG will have 1 copy on each host, so if an OSD fails, all its PGs will have to be

Re: [ceph-users] Power outages!!! help!

2017-08-29 Thread Tomasz Kusmierz
Just FYI, setting size and min_size to 1 is a last resort in my mind - to get you out of dodge !! Before setting that you should have made your self 105% certain that all OSD you leave ON, have NO bad sectors or no sectors pending or no any errors of any kind. once you can mount the

Re: [ceph-users] Power outages!!! help!

2017-08-29 Thread Willem Jan Withagen
On 29-8-2017 19:12, Steve Taylor wrote: > Hong, > > Probably your best chance at recovering any data without special, > expensive, forensic procedures is to perform a dd from /dev/sdb to > somewhere else large enough to hold a full disk image and attempt to > repair that. You'll want to use

Re: [ceph-users] Power outages!!! help!

2017-08-29 Thread hjcho616
Nice!  Thank you for the explanation!  I feel like I can revive that OSD. =)   That does sound great.  I don't quite have another cluster so waiting for a drive to arrive! =)   After setting min and max_min to 1, looks like toofull flag is gone... Maybe when I was making that video copy OSDs

Re: [ceph-users] Power outages!!! help!

2017-08-29 Thread David Turner
But it was absolutely awesome to run an osd off of an rbd after the disk failed. On Tue, Aug 29, 2017, 1:42 PM David Turner wrote: > To addend Steve's success, the rbd was created in a second cluster in the > same datacenter so it didn't run the risk of deadlocking that

Re: [ceph-users] Power outages!!! help!

2017-08-29 Thread David Turner
To addend Steve's success, the rbd was created in a second cluster in the same datacenter so it didn't run the risk of deadlocking that mapping rbds on machines running osds has. It is still theoretical to work on the same cluster, but more inherently dangerous for a few reasons. On Tue, Aug 29,

Re: [ceph-users] Power outages!!! help!

2017-08-29 Thread Steve Taylor
Hong, Probably your best chance at recovering any data without special, expensive, forensic procedures is to perform a dd from /dev/sdb to somewhere else large enough to hold a full disk image and attempt to repair that. You'll want to use 'conv=noerror' with your dd command since your disk is

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Tomasz Kusmierz
Rule of thumb with batteries is: - more “proper temperature” you run them at the more life you get out of them - more battery is overpowered for your application the longer it will survive. Get your self a LSI 94** controller and use it as HBA and you will be fine. but get MORE DRIVES ! …

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread hjcho616
Thank you Tomasz and Ronny.  I'll have to order some hdd soon and try these out.  Car battery idea is nice!  I may try that.. =)  Do they last longer?   Ones that fit the UPS original battery spec didn't last very long... part of the reason why I gave up on them.. =P  My wife probably won't like

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Tomasz Kusmierz
Sorry for being brutal … anyway 1. get the battery for UPS ( a car battery will do as well, I’ve moded on ups in the past with truck battery and it was working like a charm :D ) 2. get spare drives and put those in because your cluster CAN NOT get out of error due to lack of space 3. Follow

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread hjcho616
Tomasz, Those machines are behind a surge protector.  Doesn't appear to be a good one!   I do have a UPS... but it is my fault... no battery.  Power was pretty reliable for a while... and UPS was just beeping every chance it had, disrupting some sleep.. =P  So running on surge protector only.  I

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Ronny Aasen
> [SNIP - bad drives] Generally when a disk is displaying bad blocks to the OS, the drive have been remapping blocks for ages in the background. and the disk is really on it's last legs. a bit unlikely that you get so many disks dying at the same time tho. but the problem can have been

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Tomasz Kusmierz
So to decode few things about your disk: 1 Raw_Read_Error_Rate 0x002f 100 100 051Pre-fail Always - 37 37 read erros and only one sector marked as pending - fun disk :/ 181 Program_Fail_Cnt_Total 0x0022 099 099 000Old_age Always - 35325174

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Tomasz Kusmierz
I think you are looking at something more like this :

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Tomasz Kusmierz
I think you are looking at something more like this :

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread hjcho616
So.. would doing something like this could potentially bring it back to life? =) Analyzing a Faulty Hard Disk using Smartctl - Thomas-Krenn-Wiki | | | | || | | | | | Analyzing a Faulty Hard Disk using Smartctl - Thomas-Krenn-Wiki | | | | On

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Tomasz Kusmierz
I think you’ve got your anwser: 197 Current_Pending_Sector 0x0032 100 100 000Old_age Always - 1 > On 28 Aug 2017, at 21:22, hjcho616 wrote: > > Steve, > > I thought that was odd too.. > > Below is from the log, This captures transition from good

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread hjcho616
Steve, I thought that was odd too..  Below is from the log, This captures transition from good to bad. Looks like there is "Device: /dev/sdb [SAT], 1 Currently unreadable (pending) sectors".   And looks like I did a repair with /dev/sdb1... =P # grep sdb syslog.1Aug 27 06:27:22 OSD1 smartd[1031]:

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Steve Taylor
I'm jumping in a little late here, but running xfs_repair on your partition can't frag your partition table. The partition table lives outside the partition block device and xfs_repair doesn't have access to it when run against /dev/sdb1. I haven't actually tested it, but it seems unlikely that

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread hjcho616
Tomasz, Looks like when I did xfs_repair -L /dev/sdb1 it did something to partition table and I don't see /dev/sdb1 anymore... or maybe I missed 1 in the /dev/sdb1? =(. Yes.. that extra power outage did a pretty good damage... =P  I am hoping 0.007% is very small...=P  Any recommendations on

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Ronny Aasen
comments inline On 28.08.2017 18:31, hjcho616 wrote: I'll see what I can do on that... Looks like I may have to add another OSD host as I utilized all of the SATA ports on those boards. =P Ronny, I am running with size=2 min_size=1. I created everything with ceph-deploy and didn't touch

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Tomasz Kusmierz
Sorry mate I’ve just noticed the "unfound (0.007%)” I think that your main culprit here is osd.0. You need to have all osd’s on one host to get all the data back. Also for time being I would just change size and min size down to 1 and try to figure out which osd you actually need to get all

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread hjcho616
Thank you all for suggestions! Maged, I'll see what I can do on that... Looks like I may have to add another OSD host as I utilized all of the SATA ports on those boards. =P Ronny, I am running with size=2 min_size=1.  I created everything with ceph-deploy and didn't touch much of that pool

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Ronny Aasen
On 28. aug. 2017 08:01, hjcho616 wrote: Hello! I've been using ceph for long time mostly for network CephFS storage, even before Argonaut release! It's been working very well for me. Yes, I had some power outtages before and asked few questions on this list before and got resolved happily!

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Maged Mokhtar
I would suggest either adding 1 new disk on each of the 2 machines increasing the osd_backfill_full_ratio to something like 90 or 92 from default 85. /Maged On 2017-08-28 08:01, hjcho616 wrote: > Hello! > > I've been using ceph for long time mostly for network CephFS storage, even >