Re: [ceph-users] Power outages!!! help!

2017-09-01 Thread hjcho616
Just realized there is a file called superblock in the ceph directory.  ceph-1 and ceph-2's superblock file is identical, ceph-6 and ceph-7 are identical, but not between the two groups.  When I originally created the OSDs, I created ceph-0 through 5.  Can superblock file be copied over from

Re: [ceph-users] Power outages!!! help!

2017-09-01 Thread hjcho616
Tried connecting recovered osd.  Looks like some of the files in the lost+found are super blocks.  Below is the log.  What can I do about this? 2017-09-01 22:27:27.634228 7f68837e5800  0 set uid:gid to 1001:1001 (ceph:ceph)2017-09-01 22:27:27.634245 7f68837e5800  0 ceph version 10.2.9

Re: [ceph-users] Power outages!!! help!

2017-09-01 Thread hjcho616
Found the partition, wasn't able to mount the partition right away... Did a xfs_repair on that drive.   Got bunch of messages like this.. =(entry "10a89fd.__head_AE319A25__0" in shortform directory 845908970 references non-existent inode 605294241               junking entry

Re: [ceph-users] Changing the failure domain

2017-09-01 Thread Laszlo Budai
Hello, We have checked all the drives, and there is no problem with them. If there would be a failing drive, then I think that the slow requests should appear also in the normal traffic as the ceph cluster is using all the OSDs as primaries for some PGs. But these slow requests are appearing

Re: [ceph-users] v12.2.0 Luminous released

2017-09-01 Thread Sage Weil
On Fri, 1 Sep 2017, Felix, Evan J wrote: > Is there documentation about how to deal with a pool application > association that is not one of cephfs, rbd, or rgw? We have multiple > pools that have nothing to do with those applications, we just use the > objects in them directly using the

Re: [ceph-users] Changing the failure domain

2017-09-01 Thread David Turner
Don't discount failing drives. You can have drives in a "ready-to-fail" state that doesn't show up in SMART or anywhere easy to track. When backfilling, the drive is using sectors it may not normally use. I managed a 1400 osd cluster that would lose 1-3 drives in random nodes when I added new

Re: [ceph-users] Changing the failure domain

2017-09-01 Thread Laszlo Budai
Hi David, Well, most probably the larger part of our PGs will have to be reorganized, as we are moving from 9 hosts to 3 chassis. But I was hoping to be able to throttle the backfilling to an extent where it has minimal impact on our user traffic. Unfortunately I wasn't able to do it. I saw

[ceph-users] a question about use of CEPH_IOC_SYNCIO in write

2017-09-01 Thread sa514164
Hi: I want to ask a question about CEPH_IOC_SYNCIO flag. I know that when using O_SYNC flag or O_DIRECT flag, write call executes in other two code paths different than using CEPH_IOC_SYNCIO flag. And I find the comments about CEPH_IOC_SYNCIO here: /* *

Re: [ceph-users] Power outages!!! help!

2017-09-01 Thread hjcho616
Looks like it has been rescued... Only 1 error as we saw before in the smart log!# ddrescue -f /dev/sda /dev/sdc ./rescue.logGNU ddrescue 1.21Press Ctrl-C to interrupt     ipos:    1508 GB, non-trimmed:        0 B,  current rate:       0 B/s     opos:    1508 GB, non-scraped:        0 B,  

Re: [ceph-users] Changing the failure domain

2017-09-01 Thread David Turner
That is normal to have backfilling because the crush map did change. The host and the chassis have crush numbers and their own weight which is the sum of the osds under them. By moving the host into the chassis you changed the weight of the chassis and that affects the PG placement even though

Re: [ceph-users] PGs in peered state?

2017-09-01 Thread Yuri Gorshkov
Hi All, Is there a known procedure to debug the PG state in case of problems like this? Best regards, Yuri. 2017-08-28 14:05 GMT+03:00 Yuri Gorshkov : > Hi. > > When trying to take down a host for maintenance purposes I encountered an > I/O stall along with some PGs

Re: [ceph-users] Very slow start of osds after reboot

2017-09-01 Thread Piotr Dzionek
Hi, I have RAID0 for each disks, unfortunately my raid doesn't support JBOD. Apart from this I also run separate cluster with Jewel 10.2.9 on RAID0 and there is no such problem(I just tested it). Moreover, a cluster that has this issue used to run Firefly with RAID0 and everything was fine.

Re: [ceph-users] osd heartbeat protocol issue on upgrade v12.1.0 ->v12.2.0

2017-09-01 Thread Thomas Gebhardt
Hello, thank you very much for the hint, you are right! Kind regards, Thomas Marc Roos schrieb am 30.08.2017 um 14:26: > > I had this also once. If you update all nodes and then systemctl restart > 'ceph-osd@*' on all nodes, you should be fine. But first the monitors of > course