Re: [ceph-users] Recovering incomplete PGs with ceph_objectstore_tool

2015-04-09 Thread Chris Kitzmiller
Success! Hopefully my notes from the process will help: In the event of multiple disk failures the cluster could lose PGs. Should this occur it is best to attempt to restart the OSD process and have the drive marked as up+out. Marking the drive as out will cause data to flow off the drive to

Re: [ceph-users] What are you doing to locate performance issues in a Ceph cluster?

2015-04-08 Thread Chris Kitzmiller
On Apr 7, 2015, at 7:44 PM, Francois Lafont wrote: Chris Kitzmiller wrote: I graph aggregate stats for `ceph --admin-daemon /var/run/ceph/ceph-osd.$osdid.asok perf dump`. If the max latency strays too far outside of my mean latency I know to go look for the troublemaker. My graphs look

Re: [ceph-users] Recovering incomplete PGs with ceph_objectstore_tool

2015-04-07 Thread Chris Kitzmiller
I'm not having much luck here. Is there a possibility that the imported PGs aren't being picked up because the MONs think that they're older than the empty PGs I find on the up OSDs? I feel that I'm so close to *not* losing my RBD volume because I only have two bad PGs and I've successfully

Re: [ceph-users] Recovering incomplete PGs with ceph_objectstore_tool

2015-04-06 Thread Chris Kitzmiller
and size = 2. On Thu, Apr 2, 2015 at 10:20 PM, Chris Kitzmiller ca...@hampshire.edu wrote: On Apr 3, 2015, at 12:37 AM, LOPEZ Jean-Charles jelo...@redhat.com wrote: according to your ceph osd tree capture, although the OSD reweight is set to 1, the OSD CRUSH weight is set to 0 (2nd column

Re: [ceph-users] What are you doing to locate performance issues in a Ceph cluster?

2015-04-06 Thread Chris Kitzmiller
On Apr 6, 2015, at 7:04 PM, Robert LeBlanc rob...@leblancnet.us wrote: I see that ceph has 'ceph osd perf' that gets the latency of the OSDs. Is there a similar command that would provide some performance data about RBDs in use? I'm concerned about out ability to determine which RBD(s) may be

Re: [ceph-users] Recovering incomplete PGs with ceph_objectstore_tool

2015-04-06 Thread Chris Kitzmiller
On Apr 3, 2015, at 12:37 AM, LOPEZ Jean-Charles jelo...@redhat.com wrote: according to your ceph osd tree capture, although the OSD reweight is set to 1, the OSD CRUSH weight is set to 0 (2nd column). You need to assign the OSD a CRUSH weight so that it can be selected by CRUSH: ceph osd

Re: [ceph-users] Recovering incomplete PGs with ceph_objectstore_tool

2015-04-04 Thread Chris Kitzmiller
On Apr 3, 2015, at 12:37 AM, LOPEZ Jean-Charles jelo...@redhat.com wrote: according to your ceph osd tree capture, although the OSD reweight is set to 1, the OSD CRUSH weight is set to 0 (2nd column). You need to assign the OSD a CRUSH weight so that it can be selected by CRUSH: ceph osd

Re: [ceph-users] Troubleshooting Incomplete PGs

2014-10-28 Thread Chris Kitzmiller
On Oct 28, 2014, at 5:20 PM, Lincoln Bryant wrote: Hi Greg, Loic, I think we have seen this as well (sent a mail to the list a week or so ago about incomplete pgs). I ended up giving up on the data and doing a force_create_pgs after doing a find on my OSDs and deleting the relevant pg

[ceph-users] How to recover Incomplete PGs from lost time symptom?

2014-10-24 Thread Chris Kitzmiller
I have a number of PGs which are marked as incomplete. I'm at a loss for how to go about recovering these PGs and believe they're suffering from the lost time symptom. How do I recover these PGs? I'd settle for sacrificing the lost time and just going with what I've got. I've lost the ability

Re: [ceph-users] Troubleshooting Incomplete PGs

2014-10-23 Thread Chris Kitzmiller
On Oct 22, 2014, at 8:22 PM, Craig Lewis wrote: Shot in the dark: try manually deep-scrubbing the PG. You could also try marking various osd's OUT, in an attempt to get the acting set to include osd.25 again, then do the deep-scrub again. That probably won't help though, because the pg

Re: [ceph-users] Troubleshooting Incomplete PGs

2014-10-22 Thread Chris Kitzmiller
, what's up with osd.-1? On Tue, Oct 21, 2014 at 7:04 PM, Chris Kitzmiller ckitzmil...@hampshire.edu wrote: I've gotten myself into the position of having ~100 incomplete PGs. All of my OSDs are up+in (and I've restarted them all one by one). I was in the process of rebalancing after

Re: [ceph-users] Troubleshooting Incomplete PGs

2014-10-22 Thread Chris Kitzmiller
On Oct 22, 2014, at 7:51 PM, Craig Lewis wrote: On Wed, Oct 22, 2014 at 3:09 PM, Chris Kitzmiller ckitzmil...@hampshire.edu wrote: On Oct 22, 2014, at 1:50 PM, Craig Lewis wrote: Incomplete means Ceph detects that a placement group is missing a necessary period of history from its log

Re: [ceph-users] Ceph writes stall for long perioids with no disk/network activity

2014-08-06 Thread Chris Kitzmiller
On Aug 5, 2014, at 12:43 PM, Mark Nelson wrote: On 08/05/2014 08:42 AM, Mariusz Gronczewski wrote: On Mon, 04 Aug 2014 15:32:50 -0500, Mark Nelson mark.nel...@inktank.com wrote: On 08/04/2014 03:28 PM, Chris Kitzmiller wrote: On Aug 1, 2014, at 1:31 PM, Mariusz Gronczewski wrote: I got

Re: [ceph-users] Ceph runs great then falters

2014-08-04 Thread Chris Kitzmiller
On Aug 2, 2014, at 12:03 AM, Christian Balzer wrote: On Fri, 1 Aug 2014 14:23:28 -0400 Chris Kitzmiller wrote: I have 3 nodes each running a MON and 30 OSDs. Given the HW you list below, that might be a tall order, particular CPU wise in certain situations. I'm not seeing any dramatic

Re: [ceph-users] Ceph writes stall for long perioids with no disk/network activity

2014-08-04 Thread Chris Kitzmiller
On Aug 1, 2014, at 1:31 PM, Mariusz Gronczewski wrote: I got weird stalling during writes, sometimes I got same write speed for few minutes and after some time it starts stalling with 0 MB/s for minutes I'm getting very similar behavior on my cluster. My writes start well but then just kinda

[ceph-users] Ceph runs great then falters

2014-08-01 Thread Chris Kitzmiller
I have 3 nodes each running a MON and 30 OSDs. When I test my cluster with either rados bench or with fio via a 10GbE client using RBD I get great initial speeds 900MBps and I max out my 10GbE links for a while. Then, something goes wrong the performance falters and the cluster stops responding

Re: [ceph-users] HW recommendations for OSD journals?

2014-07-24 Thread Chris Kitzmiller
I found this article very interesting: http://techreport.com/review/26523/the-ssd-endurance-experiment-casualties-on-the-way-to-a-petabyte I've got Samsung 840 Pros and while I'm thinking that I wouldn't go with them again I am interested in the fact that (in this anecdotal experiment) it

[ceph-users] High fs_apply_latency on one node

2014-03-03 Thread Chris Kitzmiller
I've got a 3 node cluster where ceph osd perf reports reasonable fs_apply_latency for 2 out of 3 of my nodes (~30ms). But on the third node I've got latencies averaging 15000+ms for all OSDs. Running ceph 72.2 on Ubuntu 10.13. Each node has 30 HDDs with 6 SSDs for journals. iperf reports full