Re: [ceph-users] Instance filesystem corrupt

2016-10-26 Thread Ahmed Mostafa
This is more or less the same bahaviour i have in ky environment By any chance is anyone running their osds and their hypervisors on the same machine ? And could high workload, like starting 40 - 60 or above virtual machines have an effect on this problem ? On Thursday, 27 October 2016, wrote:

Re: [ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-26 Thread Kostis Fardelas
It is not more than a three line script. You will also need leveldb's code in your working directory: ``` #!/usr/bin/python2 import leveldb leveldb.RepairDB('./omap') ``` I totally agree that we need more repair tools to be officially available and also tools that provide better insight to compo

Re: [ceph-users] [EXTERNAL] Re: Instance filesystem corrupt

2016-10-26 Thread Will . Boege
Strangely enough, I’m also seeing similar user issues – a strangely high volume of corrupt instance boot disks. At this point I’m attributing it to the fact that our Ceph cluster is patched 9 months ahead of our RedHat OSP Kilo environment. However that’s a total guess at this point….. From:

Re: [ceph-users] SSS Caching

2016-10-26 Thread Christian Balzer
Hello, On Wed, 26 Oct 2016 15:40:00 + Ashley Merrick wrote: > Hello All, > > Currently running a CEPH cluster connected to KVM via the KRBD and used only > for this purpose. > > Is working perfectly fine, however would like to look at increasing / helping > with random write performance

Re: [ceph-users] [EXTERNAL] Instance filesystem corrupt

2016-10-26 Thread Keynes_Lee
Most of filesystem corrupt causes instances crashed, we saw that after a shutdown / restart ( triggered by OpenStack portal buttons or triggered by OS commands in Instances ) Some are early-detected, we saw filesystem errors in OS logs on instances. Then we make a filesystem check ( FSCK / chk

Re: [ceph-users] Instance filesystem corrupt

2016-10-26 Thread Keynes_Lee
Hum ~~~seems we have in common We use rbd snap create to make snapshot for instances volumes rbd export and rbd export-diff command to make daily backup. Now we got 29 instances and 33 volumes [cid:image007.jpg@01D1747D.DB260110] Keynes Lee李 俊 賢 Direct: +886-2-6612-1025 Mobile: +886-9

Re: [ceph-users] Monitoring Overhead

2016-10-26 Thread Anthony D'Atri
> Collectd and graphite look really nice. Also look into Grafana, and of course RHSC. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Significantly increased CPU footprint on OSDs after Hammer -> Jewel upgrade, OSDs occasionally wrongly marked as down

2016-10-26 Thread Trygve Vea
- Den 26.okt.2016 21:25 skrev Haomai Wang hao...@xsky.com: > On Thu, Oct 27, 2016 at 2:10 AM, Trygve Vea > wrote: >> - Den 26.okt.2016 16:37 skrev Sage Weil s...@newdream.net: >>> On Wed, 26 Oct 2016, Trygve Vea wrote: - Den 26.okt.2016 14:41 skrev Sage Weil s...@newdream.net: >>>

Re: [ceph-users] Significantly increased CPU footprint on OSDs after Hammer -> Jewel upgrade, OSDs occasionally wrongly marked as down

2016-10-26 Thread Haomai Wang
On Thu, Oct 27, 2016 at 2:10 AM, Trygve Vea wrote: > - Den 26.okt.2016 16:37 skrev Sage Weil s...@newdream.net: >> On Wed, 26 Oct 2016, Trygve Vea wrote: >>> - Den 26.okt.2016 14:41 skrev Sage Weil s...@newdream.net: >>> > On Wed, 26 Oct 2016, Trygve Vea wrote: >>> >> Hi, >>> >> >>> >> We

Re: [ceph-users] pg remapped+peering forever and MDS trimming behind

2016-10-26 Thread Brady Deetz
Just before your response, I decided to take the chance of restarting the primary osd for the pg (153). At this point, the MDS trimming error is gone and I'm in a warning state now. The pg has moved from peering+remapped to active+degraded+remapped+backfilling. I'd say we're probably nearly back

Re: [ceph-users] pg remapped+peering forever and MDS trimming behind

2016-10-26 Thread Wido den Hollander
> Op 26 oktober 2016 om 20:44 schreef Brady Deetz : > > > Summary: > This is a production CephFS cluster. I had an OSD node crash. The cluster > rebalanced successfully. I brought the down node back online. Everything > has rebalanced except 1 hung pg and MDS trimming is now behind. No hardware

[ceph-users] pg remapped+peering forever and MDS trimming behind

2016-10-26 Thread Brady Deetz
Summary: This is a production CephFS cluster. I had an OSD node crash. The cluster rebalanced successfully. I brought the down node back online. Everything has rebalanced except 1 hung pg and MDS trimming is now behind. No hardware failures have become apparent yet. Questions: 1) Is there a way to

Re: [ceph-users] Significantly increased CPU footprint on OSDs after Hammer -> Jewel upgrade, OSDs occasionally wrongly marked as down

2016-10-26 Thread Trygve Vea
- Den 26.okt.2016 16:37 skrev Sage Weil s...@newdream.net: > On Wed, 26 Oct 2016, Trygve Vea wrote: >> - Den 26.okt.2016 14:41 skrev Sage Weil s...@newdream.net: >> > On Wed, 26 Oct 2016, Trygve Vea wrote: >> >> Hi, >> >> >> >> We have two Ceph-clusters, one exposing pools both for RGW and

[ceph-users] SSS Caching

2016-10-26 Thread Ashley Merrick
Hello All, Currently running a CEPH cluster connected to KVM via the KRBD and used only for this purpose. Is working perfectly fine, however would like to look at increasing / helping with random write performance and latency, specially from multiple VM's hitting the spinning disks at same tim

Re: [ceph-users] Significantly increased CPU footprint on OSDs after Hammer -> Jewel upgrade, OSDs occasionally wrongly marked as down

2016-10-26 Thread Haomai Wang
On Wed, Oct 26, 2016 at 9:57 PM, Trygve Vea wrote: > - Den 26.okt.2016 15:36 skrev Haomai Wang hao...@xsky.com: >> On Wed, Oct 26, 2016 at 9:09 PM, Trygve Vea >> wrote: >>> >>> - Den 26.okt.2016 14:41 skrev Sage Weil s...@newdream.net: >>> > On Wed, 26 Oct 2016, Trygve Vea wrote: >>> >> H

Re: [ceph-users] Significantly increased CPU footprint on OSDs after Hammer -> Jewel upgrade, OSDs occasionally wrongly marked as down

2016-10-26 Thread Sage Weil
On Wed, 26 Oct 2016, Trygve Vea wrote: > - Den 26.okt.2016 14:41 skrev Sage Weil s...@newdream.net: > > On Wed, 26 Oct 2016, Trygve Vea wrote: > >> Hi, > >> > >> We have two Ceph-clusters, one exposing pools both for RGW and RBD > >> (OpenStack/KVM) pools - and one only for RBD. > >> > >> Aft

Re: [ceph-users] How is split brain situations handled in ceph?

2016-10-26 Thread Wido den Hollander
> Op 26 oktober 2016 om 15:51 schreef J David : > > > On Wed, Oct 26, 2016 at 8:55 AM, Andreas Davour wrote: > > If there are 1 MON in B, that cluster will have quorum within itself and > > keep running, and in A the MON cluster will vote and reach quorum again. > > Quorum requires a majority

Re: [ceph-users] Instance filesystem corrupt

2016-10-26 Thread Ahmed Mostafa
Actually i have the same problem when starting an instance backed up by librbd But this only happens when trying to start 60+ instance But I decided that this is due to the fact that we are using old hardware that is not able to respond to high demand. Could that be the same issue that you are fa

Re: [ceph-users] Significantly increased CPU footprint on OSDs after Hammer -> Jewel upgrade, OSDs occasionally wrongly marked as down

2016-10-26 Thread Trygve Vea
- Den 26.okt.2016 15:36 skrev Haomai Wang hao...@xsky.com: > On Wed, Oct 26, 2016 at 9:09 PM, Trygve Vea > wrote: >> >> - Den 26.okt.2016 14:41 skrev Sage Weil s...@newdream.net: >> > On Wed, 26 Oct 2016, Trygve Vea wrote: >> >> Hi, >> >> >> >> We have two Ceph-clusters, one exposing pools

Re: [ceph-users] How is split brain situations handled in ceph?

2016-10-26 Thread J David
On Wed, Oct 26, 2016 at 8:55 AM, Andreas Davour wrote: > If there are 1 MON in B, that cluster will have quorum within itself and > keep running, and in A the MON cluster will vote and reach quorum again. Quorum requires a majority of all monitors. One monitor by itself (in a cluster with at lea

Re: [ceph-users] [EXTERNAL] Instance filesystem corrupt

2016-10-26 Thread Jason Dillaman
I am not aware of any similar reports against librbd on Firefly. Do you use any configuration overrides? Does the filesystem corruption appears while the instances are running or only after a shutdown / restart of the instance? On Wed, Oct 26, 2016 at 12:46 AM, wrote: > No , we are using Firefly

Re: [ceph-users] Significantly increased CPU footprint on OSDs after Hammer -> Jewel upgrade, OSDs occasionally wrongly marked as down

2016-10-26 Thread Haomai Wang
On Wed, Oct 26, 2016 at 9:09 PM, Trygve Vea wrote: > > - Den 26.okt.2016 14:41 skrev Sage Weil s...@newdream.net: > > On Wed, 26 Oct 2016, Trygve Vea wrote: > >> Hi, > >> > >> We have two Ceph-clusters, one exposing pools both for RGW and RBD > >> (OpenStack/KVM) pools - and one only for RBD.

Re: [ceph-users] How is split brain situations handled in ceph?

2016-10-26 Thread Robert Sanders
If your clustering something important do it at the application level. For example financial transactions are replicated at the application level just for this reason. As far as Ceph I'm not an expert yet. Even with all the file system wizardry in the world some things need to be handled outsi

Re: [ceph-users] Significantly increased CPU footprint on OSDs after Hammer -> Jewel upgrade, OSDs occasionally wrongly marked as down

2016-10-26 Thread Trygve Vea
- Den 26.okt.2016 14:41 skrev Sage Weil s...@newdream.net: > On Wed, 26 Oct 2016, Trygve Vea wrote: >> Hi, >> >> We have two Ceph-clusters, one exposing pools both for RGW and RBD >> (OpenStack/KVM) pools - and one only for RBD. >> >> After upgrading both to Jewel, we have seen a significantl

Re: [ceph-users] Significantly increased CPU footprint on OSDs after Hammer -> Jewel upgrade, OSDs occasionally wrongly marked as down

2016-10-26 Thread Sage Weil
On Wed, 26 Oct 2016, Trygve Vea wrote: > Hi, > > We have two Ceph-clusters, one exposing pools both for RGW and RBD > (OpenStack/KVM) pools - and one only for RBD. > > After upgrading both to Jewel, we have seen a significantly increased CPU > footprint on the OSDs that are a part of the cluste

[ceph-users] Significantly increased CPU footprint on OSDs after Hammer -> Jewel upgrade, OSDs occasionally wrongly marked as down

2016-10-26 Thread Trygve Vea
Hi, We have two Ceph-clusters, one exposing pools both for RGW and RBD (OpenStack/KVM) pools - and one only for RBD. After upgrading both to Jewel, we have seen a significantly increased CPU footprint on the OSDs that are a part of the cluster which includes RGW. This graph illustrates this: h

Re: [ceph-users] rgw / s3website, MethodNotAllowed on Jewel 10.2.3

2016-10-26 Thread Trygve Vea
served with 405 >> Method Not Allowed >> >> DEBUG: Sending request method_string='PUT', uri='/?website', >> headers={'x-amz-content-sha256': >> '3fcf37205b114f03a910d11d74206358f1681381f0f9498

Re: [ceph-users] rgw / s3website, MethodNotAllowed on Jewel 10.2.3

2016-10-26 Thread Yoann Moulin
7;x-amz-content-sha256': > '3fcf37205b114f03a910d11d74206358f1681381f0f9498b25aa1cc65e168937', > 'Authorization': 'AWS4-HMAC-SHA256 > Credential=V4NZ37SLP3VOPR2BI5UW/20161026/US/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date,Signature=4cbd6a7

Re: [ceph-users] running xfs_fsr on ceph OSDs

2016-10-26 Thread mj
Hi Christian, Thanks for the reply / suggestion! MJ On 10/24/2016 10:02 AM, Christian Balzer wrote: Hello, On Mon, 24 Oct 2016 09:41:37 +0200 mj wrote: Hi, We have been running xfs on our servers for many years, and we are used to run a scheduled xfs_fsr during the weekend. Lately we hav

[ceph-users] rgw / s3website, MethodNotAllowed on Jewel 10.2.3

2016-10-26 Thread Trygve Vea
f0f9498b25aa1cc65e168937', 'Authorization': 'AWS4-HMAC-SHA256 Credential=V4NZ37SLP3VOPR2BI5UW/20161026/US/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date,Signature=4cbd6a7c26dc149fc8fb352dae2d42c27e9bdc254cecc467802941cfc0e200a2', 'x-amz-d

Re: [ceph-users] 6 Node cluster with 24 SSD per node: Hardwareplanning/ agreement

2016-10-26 Thread Gandalf Corvotempesta
2016-10-11 9:20 GMT+02:00 Дробышевский, Владимир : > It may looks like a boys club but I believe that sometimes for the > proof-of-concept projects or in the beginning of the commercial project > without a lot of invesments it worth to consider used hardware. For example, > it's possible to find us

Re: [ceph-users] All PGs are active+clean, still remapped PGs

2016-10-26 Thread Wido den Hollander
> Op 26 oktober 2016 om 10:44 schreef Sage Weil : > > > On Wed, 26 Oct 2016, Dan van der Ster wrote: > > On Tue, Oct 25, 2016 at 7:06 AM, Wido den Hollander wrote: > > > > > >> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster > > >> : > > >> > > >> > > >> Hi Wido, > > >> > > >> This seems

Re: [ceph-users] All PGs are active+clean, still remapped PGs

2016-10-26 Thread Sage Weil
On Wed, 26 Oct 2016, Dan van der Ster wrote: > On Tue, Oct 25, 2016 at 7:06 AM, Wido den Hollander wrote: > > > >> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster : > >> > >> > >> Hi Wido, > >> > >> This seems similar to what our dumpling tunables cluster does when a few > >> particular osds

Re: [ceph-users] All PGs are active+clean, still remapped PGs

2016-10-26 Thread Dan van der Ster
On Tue, Oct 25, 2016 at 7:06 AM, Wido den Hollander wrote: > >> Op 24 oktober 2016 om 22:29 schreef Dan van der Ster : >> >> >> Hi Wido, >> >> This seems similar to what our dumpling tunables cluster does when a few >> particular osds go down... Though in our case the remapped pgs are >> correctly