Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Quentin Hartman
What does ceph status say? I had a problem with similar symptoms some months ago that was accompanied by OSDs getting marked out for no apparent reason and the cluster going into a HEALTH_WARN state intermittently. Ultimately the root of the problem ended up being a faulty NIC. Once I took that

[ceph-users] OSD RAM usage values

2015-07-17 Thread Kenneth Waegeman
Hi all, I've read in the documentation that OSDs use around 512MB on a healthy cluster.(http://ceph.com/docs/master/start/hardware-recommendations/#ram) Now, our OSD's are all using around 2GB of RAM memory while the cluster is healthy. PID USER PR NIVIRTRESSHR S %CPU

Re: [ceph-users] RGW Malformed Headers

2015-07-17 Thread Simon Murray
Test complete. Civet still shows the same problem: https://gist.github.com/spjmurray/88203f564389294b3774 /admin/user?uid=admin is fine /admin/user?quotauid=adminquota-type=user is not so good. Upgrade to 0.94.2 didn't solve the problem nor 9.0.2. Unless anyone knows anything more I'll go a

Re: [ceph-users] OSD RAM usage values

2015-07-17 Thread Gregory Farnum
On Fri, Jul 17, 2015 at 1:13 PM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: Hi all, I've read in the documentation that OSDs use around 512MB on a healthy cluster.(http://ceph.com/docs/master/start/hardware-recommendations/#ram) Now, our OSD's are all using around 2GB of RAM memory

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Mark Nelson
On 07/17/2015 08:38 AM, J David wrote: This is the same cluster I posted about back in April. Since then, the situation has gotten significantly worse. Here is what iostat looks like for the one active RBD image on this cluster: Device: rrqm/s wrqm/s r/s w/srkB/s

Re: [ceph-users] 10d

2015-07-17 Thread Dan van der Ster
A bit of progress: rm'ing everything from inside current/36.10d_head/ actually let the OSD start and continue deleting other PGs. Cheers, Dan On Fri, Jul 17, 2015 at 3:26 PM, Dan van der Ster d...@vanderster.com wrote: Thanks for the quick reply. We /could/ just wipe these OSDs and start from

[ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread J David
This is the same cluster I posted about back in April. Since then, the situation has gotten significantly worse. Here is what iostat looks like for the one active RBD image on this cluster: Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await

Re: [ceph-users] 10d

2015-07-17 Thread Dan van der Ster
Hi Greg + list, Sorry to reply to this old'ish thread, but today one of these PGs bit us in the ass. Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171, and 69 all crash when trying to delete pg 36.10d. They all crash with ENOTEMPTY suggests garbage data in osd data dir

Re: [ceph-users] 10d

2015-07-17 Thread Gregory Farnum
I think you'll need to use the ceph-objectstore-tool to remove the PG/data consistently, but I've not done this — David or Sam will need to chime in. -Greg On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com wrote: Hi Greg + list, Sorry to reply to this old'ish thread, but

Re: [ceph-users] 10d

2015-07-17 Thread Dan van der Ster
Thanks for the quick reply. We /could/ just wipe these OSDs and start from scratch (the only other pools were 4+2 ec and recovery already brought us to 100% active+clean). But it'd be good to understand and prevent this kind of crash... Cheers, Dan On Fri, Jul 17, 2015 at 3:18 PM, Gregory

[ceph-users] Problem re-running dpkg-buildpackages with '-nc' option

2015-07-17 Thread Bartłomiej Święcki
Hi all, I'm trying to rebuild ceph deb packages using 'dpkg-buildpackages -nc'. Without '-nc' the compilation works fine but obviously takes a long time. When I add the '-nc' option, I end up with following issues: .. ./check_version ./.git_version ./.git_version is up to date. CXXLD

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread J David
On Fri, Jul 17, 2015 at 10:47 AM, Quentin Hartman qhart...@direwolfdigital.com wrote: What does ceph status say? Usually it says everything is cool. However just now it gave this: cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7 health HEALTH_WARN 2 requests are blocked 32 sec

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Quentin Hartman
That looks a lot like what I was seeing initially. The OSDs getting marked out was relatively rare and it took a bit before I saw it. I ended up digging into the logs on the OSDs themselves to discover that they were getting marked out. The messages were like So-and-so incorrectly marked us out

[ceph-users] RGW Malformed Headers

2015-07-17 Thread Simon Murray
The basics are, 14.04, giant, apache with the ceph version of fastCGI. I'll spin up a test system in openstack with Civet and see if it misbehaves the same way or I need to narrow it down further. Chances are if you haven't heard of it I'll need to crack out g++ and get my hands dirty --

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Mark Nelson
On 07/17/2015 09:55 AM, J David wrote: On Fri, Jul 17, 2015 at 10:21 AM, Mark Nelson mnel...@redhat.com wrote: rados -p pool 30 bench write just to see how it handles 4MB object writes. Here's that, from the VM host: Total time run: 52.062639 Total writes made: 66 Write

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Shane Gibson
David - I'm new to Ceph myself, so can't point out any smoking guns - but your problem feels like a network issue. I suggest you check all of your OSD/Mon/Clients network interfaces. Check for errors, check that they are negotiating the same link speed/type with your switches (if you have LLDP

Re: [ceph-users] Dont used fqdns in monmaptool and ceph-mon --mkfs

2015-07-17 Thread Shane Gibson
On 7/16/15, 9:51 PM, ceph-users on behalf of Goncalo Borges ceph-users-boun...@lists.ceph.com on behalf of gonc...@physics.usyd.edu.au wrote: Once I substituted the fqdn by simply the hostname (without the domain) it worked. Goncalo, I ran into the same problems too - and ended up bailing on

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Steve Dainard
Disclaimer: I'm relatively new to ceph, and haven't moved into production with it. Did you run your bench for 30 seconds? For reference my bench from a VM bridged to a 10Gig card with 90x4TB at 30 seconds is: Total time run: 30.766596 Total writes made: 1979 Write size:

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Somnath Roy
I would say use admin socket to find out which part is causing most of the latencies, don't rule out disk anomalies. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J David Sent: Friday, July 17, 2015 8:07 AM To:

Re: [ceph-users] OSD latency inaccurate reports?

2015-07-17 Thread Kostis Fardelas
Also, by running ceph osd perf, I see that fs_apply_latency is larger than fs_commit_latency. Shouldn't that be the opposite? Apply latency is afaik the time that it takes to to apply updates to the file system in page cache. Commitcycle latency is the time it takes to flush cache on disks, right?

Re: [ceph-users] Unsetting osd_crush_chooseleaf_type = 0

2015-07-17 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Yes you will need to change osd to host as you thought so that copies will be separated between hosts. You will run into problems you see until that is changed. It will cause data movement. - Robert LeBlanc PGP Fingerprint 79A2 9CA4

Re: [ceph-users] Slow requests during ceph osd boot

2015-07-17 Thread Kostis Fardelas
Thanks for your answers, we will also experiment with osd recovery max active / threads and will come back to you Regards, Kostis On 16 July 2015 at 12:29, Jan Schermer j...@schermer.cz wrote: For me setting recovery_delay_start helps during the OSD bootup _sometimes_, but it clearly does

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread J David
On Fri, Jul 17, 2015 at 11:15 AM, Quentin Hartman qhart...@direwolfdigital.com wrote: That looks a lot like what I was seeing initially. The OSDs getting marked out was relatively rare and it took a bit before I saw it. Our problem is most of the time and does not appear confined to a specific

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread J David
On Fri, Jul 17, 2015 at 12:19 PM, Mark Nelson mnel...@redhat.com wrote: Maybe try some iperf tests between the different OSD nodes in your cluster and also the client to the OSDs. This proved to be an excellent suggestion. One of these is not like the others: f16 inbound: 6Gbps f16 outbound:

Re: [ceph-users] Workaround for RHEL/CentOS 7.1 rbdmap service start warnings?

2015-07-17 Thread Steve Dainard
Other than those errors, do you find RBD's will not be unmapped on system restart/shutdown on a machine using systemd? Leaving the system hanging without network connections trying to unmap RBD's? That's been my experience thus far, so I wrote an (overly simple) systemd file to handle this on a

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Alex Gorbachev
May I suggest checking also the error counters on your network switch? Check speed and duplex. Is bonding in use? Is flow control on? Can you swap the network cable? Can you swap a NIC with another node and does the problem follow? Hth, Alex On Friday, July 17, 2015, Steve Thompson

Re: [ceph-users] backing Hadoop with Ceph ??

2015-07-17 Thread Josh Durgin
On 07/15/2015 11:48 AM, Shane Gibson wrote: Somnath - thanks for the reply ... :-) Haven't tried anything yet - just starting to gather info/input/direction for this solution. Looking at the S3 API info [2] - there is no mention of support for the S3a API extensions - namely rename support.

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Quentin Hartman
Glad we were able to point you in the right direction! I would suspect a borderline cable at this point. Did you happen to notice if the interface had negotiated down to some dumb speed? If it had, I've seen cases where a dodgy cable has caused an intermittent problem that causes it to negotiate

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Steve Thompson
On Fri, 17 Jul 2015, J David wrote: f16 inbound: 6Gbps f16 outbound: 6Gbps f17 inbound: 6Gbps f17 outbound: 6Gbps f18 inbound: 6Gbps f18 outbound: 1.2Mbps Unless the network was very busy when you did this, I think that 6 Gb/s may not be very good either. Usually iperf will give you much

Re: [ceph-users] Workaround for RHEL/CentOS 7.1 rbdmap service start warnings?

2015-07-17 Thread Bruce McFarland
Yes the rbd's are not remapped at system boot time. I haven't run into a VM or system hang because this since I ran into it as part of investigating using RHEL 7.1 as a client distro. Yes remapping the rbd's in a startup script worked around the issue. -Original Message- From: Steve