Re: [ceph-users] Help Basically..

2018-09-02 Thread David Turner
Agreed on not going the disks until your cluster is healthy again. Making them out and seeing how healthy you can get in the meantime is a good idea. On Sun, Sep 2, 2018, 1:18 PM Ronny Aasen wrote: > On 02.09.2018 17:12, Lee wrote: > > Should I just out the OSD's first or completely zap them

Re: [ceph-users] Help Basically..

2018-09-02 Thread Ronny Aasen
On 02.09.2018 17:12, Lee wrote: Should I just out the OSD's first or completely zap them and recreate? Or delete and let the cluster repair itself? On the second node when it started back up I had problems with the Journals for ID 5 and 7 they were also recreated all the rest are still the

Re: [ceph-users] Help Basically..

2018-09-02 Thread Lee
Ok, rather than going gunhoe at this.. 1. I have set out, 31,24,21,18,15,14,13,6 and 7,5 (10 is a new OSD) Which gives me ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 23.65970 root default -5 8.18990 host data33-a4 13 0.90999 osd.13 up0

Re: [ceph-users] Help Basically..

2018-09-02 Thread Lee
Should I just out the OSD's first or completely zap them and recreate? Or delete and let the cluster repair itself? On the second node when it started back up I had problems with the Journals for ID 5 and 7 they were also recreated all the rest are still the originals. I know that some PG's are

Re: [ceph-users] Help Basically..

2018-09-02 Thread David Turner
The problem is with never getting a successful run of `ceph-osd --flush-journal` on the old SSD journal drive. All of the OSDs that used the dead journal need to be removed from the cluster, wiped, and added back in. The data on them is not 100% consistent because the old journal died. Any word

Re: [ceph-users] Help Basically..

2018-09-02 Thread Lee
I followed: $ journal_uuid=$(sudo cat /var/lib/ceph/osd/ceph-0/journal_uuid) $ sudo sgdisk --new=1:0:+20480M --change-name=1:'ceph journal' --partition-guid=1:$journal_uuid --typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdk Then $ sudo ceph-osd --mkjournal -i 20 $ sudo

Re: [ceph-users] Help Basically..

2018-09-02 Thread Lee
> > > Hi David, > > Yes heath detail outputs all the errors etc and recovery / backfill is > going on, just taking time 25% misplaced and 1.5 degraded. > > I can list out the pools and see sizes etc.. > > My main problem is I have no client IO from a read perspective, I cannot > start vms I'm

Re: [ceph-users] Help Basically..

2018-09-02 Thread Lee
Hi David, Yes heath detail outputs all the errors etc and recovery / backfill is going on, just taking time 25% misplaced and 1.5 degraded. I can list out the pools and see sizes etc.. My main problem is I have no client IO from a read perspective, I cannot start vms I'm openstack and ceph -w

Re: [ceph-users] Help Basically..

2018-09-02 Thread David Turner
When the first node went offline with a dead SSD journal, all of the dates on the OSDs was useless. Unless you could flush the journals, you can't guarantee that a wire the cluster think happened actually made it to the disk. The proper procedure here is to remove those OSDs and add them again as

Re: [ceph-users] Help Basically..

2018-09-02 Thread David C
Does "ceph health detail" work? Have you manually confirmed the OSDs on the nodes are working? What was the replica size of the pools? Are you seeing any progress with the recovery? On Sun, Sep 2, 2018 at 9:42 AM Lee wrote: > Running 0.94.5 as part of a Openstack enviroment, our ceph setup is

[ceph-users] Help Basically..

2018-09-02 Thread Lee
Running 0.94.5 as part of a Openstack enviroment, our ceph setup is 3x OSD Nodes 3x MON Nodes, yesterday we had a aircon outage in our hosting enviroment, 1 OSD node failed (offline with a the journal SSD dead) left with 2 nodes running correctly, 2 hours later a second OSD node failed complaining