Re: [ceph-users] pgs stuck inactive

2017-03-17 Thread Brad Hubbard
On Fri, Mar 17, 2017 at 7:43 PM, Laszlo Budai wrote: > Hello Brad, > > I've fond the reason for the segfault. On the OSD servers the > /etc/cep/ceph.cilent.admin.keyring file was missing. This showed up when > I've set the debugging parameters you've suggested. That

Re: [ceph-users] pgs stuck inactive

2017-03-17 Thread Laszlo Budai
Hello Brad, I've fond the reason for the segfault. On the OSD servers the /etc/cep/ceph.cilent.admin.keyring file was missing. This showed up when I've set the debugging parameters you've suggested. Once I've copied the file from the monitor, the import-rados has worked out. Now the cluster

Re: [ceph-users] pgs stuck inactive

2017-03-16 Thread Brad Hubbard
So I've tested this procedure locally and it works successfully for me. $ ./ceph -v *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af) $ ./ceph-objectstore-tool import-rados rbd 0.3.export Importing from pgid 0.3

Re: [ceph-users] pgs stuck inactive

2017-03-16 Thread Laszlo Budai
My mistake, I've run it on a wrong system ... I've attached the terminal output. I've run this on a test system where I was getting the same segfault when trying import-rados. Kind regards, Laszlo On 16.03.2017 07:41, Laszlo Budai wrote: [root@storage2 ~]# gdb -ex 'r' -ex 't a a bt full'

Re: [ceph-users] pgs stuck inactive

2017-03-15 Thread Laszlo Budai
[root@storage2 ~]# gdb -ex 'r' -ex 't a a bt full' -ex 'q' --args ceph-objectstore-tool import-rados volumes pg.3.367.export.OSD.35 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later

Re: [ceph-users] pgs stuck inactive

2017-03-15 Thread Brad Hubbard
Can you install the debuginfo for ceph (how this works depends on your distro) and run the following? # gdb -ex 'r' -ex 't a a bt full' -ex 'q' --args ceph-objectstore-tool import-rados volumes pg.3.367.export.OSD.35 On Thu, Mar 16, 2017 at 12:02 AM, Laszlo Budai wrote:

Re: [ceph-users] pgs stuck inactive

2017-03-15 Thread Laszlo Budai
Hello, the ceph-objectstore-tool import-rados volumes pg.3.367.export.OSD.35 command crashes. ~# ceph-objectstore-tool import-rados volumes pg.3.367.export.OSD.35 *** Caught signal (Segmentation fault) ** in thread 7f85b60e28c0 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)

Re: [ceph-users] pgs stuck inactive

2017-03-15 Thread Laszlo Budai
Ok. Delete the dirs using the ceph-objectstore-tool. DONE ceph pg force_create_pg 3.367 lead me to this state: HEALTH_WARN 1 pgs stuck inactive; 1 pgs stuck unclean; 16 requests are blocked > 32 sec; 2 osds have slow requests; noout flag(s) set pg 3.367 is stuck inactive since forever, current

Re: [ceph-users] pgs stuck inactive

2017-03-15 Thread Laszlo Budai
Hello, So, I've done the following seps: 1. set noout 2. stop osd2 3. ceph-objectstore-tool remove 4. start osd2 5. repeat step 2-4 on osd 28 and 35 then I've run the ceph pg force_create_pg 3.367. This has left the PG in creating state: # ceph -s cluster

Re: [ceph-users] pgs stuck inactive

2017-03-14 Thread Brad Hubbard
Decide which copy you want to keep and export that with ceph-objectstore-tool Delete all copies on all OSDs with ceph-objectstore-tool (not by deleting the directory on the disk). Use force_create_pg to recreate the pg empty. Use ceph-objectstore-tool to do a rados import on the exported pg

Re: [ceph-users] pgs stuck inactive

2017-03-14 Thread Laszlo Budai
Hello, I have tried to recover the pg using the following steps: Preparation: 1. set noout 2. stop osd.2 3. use ceph-objectstore-tool to export from osd2 4. start osd.2 5. repeat step 2-4 on osd 35,28, 63 (I've done these hoping to be able to use one of those exports to recover the PG) First

Re: [ceph-users] pgs stuck inactive

2017-03-12 Thread Brad Hubbard
On Sun, Mar 12, 2017 at 7:51 PM, Laszlo Budai wrote: > Hello, > > I have already done the export with ceph_objectstore_tool. I just have to > decide which OSDs to keep. > Can you tell me why the directory structure in the OSDs is different for the > same PG when checking

Re: [ceph-users] pgs stuck inactive

2017-03-12 Thread Laszlo Budai
Hello, I have already done the export with ceph_objectstore_tool. I just have to decide which OSDs to keep. Can you tell me why the directory structure in the OSDs is different for the same PG when checking on different OSDs? For instance, in OSD 2 and 63 there are NO subdirectories in the

Re: [ceph-users] pgs stuck inactive

2017-03-11 Thread Brad Hubbard
On Sat, Mar 11, 2017 at 7:43 PM, Laszlo Budai wrote: > Hello, > > Thank you for your answer. > > indeed the min_size is 1: > > # ceph osd pool get volumes size > size: 3 > # ceph osd pool get volumes min_size > min_size: 1 > # > I'm gonna try to find the mentioned

Re: [ceph-users] pgs stuck inactive

2017-03-10 Thread Brad Hubbard
So this is why it happened I guess. pool 3 'volumes' replicated size 3 min_size 1 min_size = 1 is a recipe for disasters like this and there are plenty of ML threads about not setting it below 2. The past intervals in the pg query show several intervals where a single OSD may have gone rw. How

Re: [ceph-users] pgs stuck inactive

2017-03-10 Thread Laszlo Budai
The OSDs are all there. $ sudo ceph osd stat osdmap e60609: 72 osds: 72 up, 72 in an I have attached the result of ceph osd tree, and ceph osd dump commands. I got some extra info about the network problem. A faulty network device has flooded the network eating up all the bandwidth so the

Re: [ceph-users] pgs stuck inactive

2017-03-10 Thread Brad Hubbard
To me it looks like someone may have done an "rm" on these OSDs but not removed them from the crushmap. This does not happen automatically. Do these OSDs show up in "ceph osd tree" and "ceph osd dump" ? If so, paste the output. Without knowing what exactly happened here it may be difficult to

Re: [ceph-users] pgs stuck inactive

2017-03-10 Thread Laszlo Budai
Hello, I was informed that due to a networking issue the ceph cluster network was affected. There was a huge packet loss, and network interfaces were flipping. That's all I got. This outage has lasted a longer period of time. So I assume that some OSD may have been considered dead and the

Re: [ceph-users] pgs stuck inactive

2017-03-09 Thread Brad Hubbard
Can you explain more about what happened? The query shows progress is blocked by the following OSDs. "blocked_by": [ 14, 17, 51, 58, 63, 64,

Re: [ceph-users] pgs stuck inactive and unclean, too feww PGs per OSD

2015-10-07 Thread Christian Balzer
Hello, On Thu, 8 Oct 2015 11:27:46 +0800 (CST) wikison wrote: > Hi, > I've removed the rbd pool and created it again. It picked up my > default settings but there are still some problems. After running "sudo > ceph -s", the output is as follow: > cluster

Re: [ceph-users] pgs stuck inactive and unclean, too feww PGs per OSD

2015-10-07 Thread wikison
Here, like this : esta@monitorOne:~$ sudo ceph osd tree ID WEIGHT TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY -3 4.39996 root defualt -2 1.0 host storageTwo 0 0.0 osd.0 up 1.0 1.0 1 1.0 osd.1 up 1.0

Re: [ceph-users] pgs stuck inactive and unclean, too feww PGs per OSD

2015-10-07 Thread Christian Balzer
Hello, On Thu, 8 Oct 2015 12:21:40 +0800 (CST) wikison wrote: > Here, like this : > esta@monitorOne:~$ sudo ceph osd tree > ID WEIGHT TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY > -3 4.39996 root defualt That's your problem. It should be "default" Your

Re: [ceph-users] pgs stuck inactive and unclean, too feww PGs per OSD

2015-10-07 Thread wikison
Hi, I've removed the rbd pool and created it again. It picked up my default settings but there are still some problems. After running "sudo ceph -s", the output is as follow: cluster 0b9b05db-98fe-49e6-b12b-1cce0645c015 health HEALTH_WARN 512 pgs stuck

Re: [ceph-users] pgs stuck inactive and unclean, too feww PGs per OSD

2015-10-07 Thread Chris Jones
One possibility, it may be that the crush map is not creating. Look at your /etc/ceph/ceph.conf file and see if you have something under the OSD section (actually could be in global too) that looks like the following: osd crush update on start = false If that line is there and if you're not

Re: [ceph-users] pgs stuck inactive and unclean, too feww PGs per OSD

2015-10-06 Thread Christian Balzer
Hello, On Wed, 7 Oct 2015 12:57:58 +0800 (CST) wikison wrote: This is a very old bug, misfeature. And creeps up every week or so here, google is your friend. > Hi, > I have a cluster of one monitor and eight OSDs. These OSDs are running > on four hosts(each host has two OSDs). When I set up