Re: [ceph-users] Per pool or per image RBD copy on read

2017-08-16 Thread Jason Dillaman
You should be able to utilize image-meta to override the configuration
on a particular image:

# rbd image-meta set  conf_rbd_clone_copy_on_read true



On Wed, Aug 16, 2017 at 8:36 PM, Xavier Trilla
 wrote:
> Hi,
>
>
>
> Is it possible to enable copy on read for a rbd child image? I’ve been
> checking around and looks like the only way to enable copy-on-read is
> enabling it for the whole cluster using:
>
>
>
> rbd_clone_copy_on_read = true
>
>
>
> Can it be enabled just for specific images or pools?
>
>
>
> We keep some parent images in SSDs -the ones being used often- and plenty
> other images on HDD -as they aren’t used so often- and we would like to
> enable copy-on-read  just for the children images of the ones stored on
> HDDs.
>
>
>
> Thanks!
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VMware + Ceph using NFS sync/async ?

2017-08-16 Thread Adrian Saul
> I'd be interested in details of this small versus large bit.

The smaller shares is just simply to distribute the workload over more RBDs so 
the bottleneck doesn’t become the RBD device. The size itself doesn’t 
particularly matter but just the idea to distribute VMs across many shares 
rather than a few large datastores.

We originally started with 10TB shares, just because we had the space - but we 
found performance was running out before capacity did.  But it's been apparent 
that the limitation appears to be at the RBD level, particularly with writes.  
So under heavy usage with say VMWare snapshot backups VMs gets impacted by 
higher latency to the point that some VMs become unresponsive for small 
periods.  The ceph cluster itself has plenty of performance available and 
handles far higher workload periods, but individual RBD devices just seem to 
hit the wall.

For example, one of our shares will sit there all day happily doing 3-400 IOPS 
read at very low latencies.  During the backup period we get heavier writes as 
snapshots are created and cleaned up.   That increased write activity pushes 
the RBD to 100% busy and read latencies go up from 1-2ms to 20-30ms, even 
though the number of reads doesn’t change that much.   The devices though can 
handle more, I can see periods of up to 1800 IOPS read and 800 write.

There is probably more tuning that can be applied at the XFS/NFS level, but for 
the moment that’s the direction we are taking - creating more shares.

>
> Would you say that the IOPS starvation is more an issue of the large
> filesystem than the underlying Ceph/RBD?

As above - I think its more to do with an IOPS limitation at the RBD device 
level - likely due to sync write latency limiting the number of effective IOs.  
That might be XFS as well but I have not had the chance to dial that in more.

> With a cache-tier in place I'd expect all hot FS objects (inodes, etc) to be
> there and thus be as fast as it gets from a Ceph perspective.

Yeah - the cache teir takes a fair bit of the heat and improves the response 
considerably for the SATA environments - it makes a significant difference.  
The SSD only pool images behave in a similar way but operate to a much higher 
performance level before they start showing issues.

> OTOH lots of competing accesses to same journal, inodes would be a
> limitation inherent to the FS.

Its likely there is tuning there to improve the XFS performance, but from the 
stats of the RBD device they are showing the latencies going up, there might be 
more impact further up the stack, but the underlying device shows the change in 
performance.

>
> Christian
>
> >
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Osama Hasebou
> > Sent: Wednesday, 16 August 2017 10:34 PM
> > To: n...@fisk.me.uk
> > Cc: ceph-users 
> > Subject: Re: [ceph-users] VMware + Ceph using NFS sync/async ?
> >
> > Hi Nick,
> >
> > Thanks for replying! If Ceph is combined with Openstack then, does that
> mean that actually when openstack writes are happening, it is not fully sync'd
> (as in written to disks) before it starts receiving more data, so acting as 
> async
> ? In that scenario there is a chance for data loss if things go bad, i.e power
> outage or something like that ?
> >
> > As for the slow operations, reading is quite fine when I compare it to a SAN
> storage system connected to VMware. It is writing data, small chunks or big
> ones, that suffer when trying to use the sync option with FIO for
> benchmarking.
> >
> > In that case, I wonder, is no one using CEPH with VMware in a production
> environment ?
> >
> > Cheers.
> >
> > Regards,
> > Ossi
> >
> >
> >
> > Hi Osama,
> >
> > This is a known problem with many software defined storage stacks, but
> potentially slightly worse with Ceph due to extra overheads. Sync writes
> have to wait until all copies of the data are written to disk by the OSD and
> acknowledged back to the client. The extra network hops for replication and
> NFS gateways add significant latency which impacts the time it takes to carry
> out small writes. The Ceph code also takes time to process each IO request.
> >
> > What particular operations are you finding slow? Storage vmotions are just
> bad, and I don’t think there is much that can be done about them as they are
> split into lots of 64kb IO’s.
> >
> > One thing you can try is to force the CPU’s on your OSD nodes to run at C1
> cstate and force their minimum frequency to 100%. This can have quite a
> large impact on latency. Also you don’t specify your network, but 10G is a
> must.
> >
> > Nick
> >
> >
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Osama Hasebou
> > Sent: 14 August 2017 12:27
> > To: ceph-users
> > >
> > Subject: [ceph-users] VMware + Ceph using NFS sync/async ?
> >
> > Hi Everyone,
> >
> > We started testing the idea of 

[ceph-users] Per pool or per image RBD copy on read

2017-08-16 Thread Xavier Trilla
Hi,

Is it possible to enable copy on read for a rbd child image? I've been checking 
around and looks like the only way to enable copy-on-read is enabling it for 
the whole cluster using:

rbd_clone_copy_on_read = true

Can it be enabled just for specific images or pools?

We keep some parent images in SSDs -the ones being used often- and plenty other 
images on HDD -as they aren't used so often- and we would like to enable 
copy-on-read  just for the children images of the ones stored on HDDs.

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VMware + Ceph using NFS sync/async ?

2017-08-16 Thread Christian Balzer

Hello,

On Thu, 17 Aug 2017 00:13:24 + Adrian Saul wrote:

> We are using Ceph on NFS for VMWare – we are using SSD tiers in front of SATA 
> and some direct SSD pools.  The datastores are just XFS file systems on RBD 
> managed by a pacemaker cluster for failover.
> 
> Lessons so far are that large datastores quickly run out of IOPS and compete 
> for performance – you are better off with many smaller RBDs (say 1TB) to 
> spread out workloads.  Also tuning up NFS threads seems to help.
> 
I'd be interested in details of this small versus large bit.

Would you say that the IOPS starvation is more an issue of the large
filesystem than the underlying Ceph/RBD?

With a cache-tier in place I'd expect all hot FS objects (inodes, etc) to
be there and thus be as fast as it gets from a Ceph perspective. 

OTOH lots of competing accesses to same journal, inodes would be a
limitation inherent to the FS.

Christian

> 
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Osama Hasebou
> Sent: Wednesday, 16 August 2017 10:34 PM
> To: n...@fisk.me.uk
> Cc: ceph-users 
> Subject: Re: [ceph-users] VMware + Ceph using NFS sync/async ?
> 
> Hi Nick,
> 
> Thanks for replying! If Ceph is combined with Openstack then, does that mean 
> that actually when openstack writes are happening, it is not fully sync'd (as 
> in written to disks) before it starts receiving more data, so acting as async 
> ? In that scenario there is a chance for data loss if things go bad, i.e 
> power outage or something like that ?
> 
> As for the slow operations, reading is quite fine when I compare it to a SAN 
> storage system connected to VMware. It is writing data, small chunks or big 
> ones, that suffer when trying to use the sync option with FIO for 
> benchmarking.
> 
> In that case, I wonder, is no one using CEPH with VMware in a production 
> environment ?
> 
> Cheers.
> 
> Regards,
> Ossi
> 
> 
> 
> Hi Osama,
> 
> This is a known problem with many software defined storage stacks, but 
> potentially slightly worse with Ceph due to extra overheads. Sync writes have 
> to wait until all copies of the data are written to disk by the OSD and 
> acknowledged back to the client. The extra network hops for replication and 
> NFS gateways add significant latency which impacts the time it takes to carry 
> out small writes. The Ceph code also takes time to process each IO request.
> 
> What particular operations are you finding slow? Storage vmotions are just 
> bad, and I don’t think there is much that can be done about them as they are 
> split into lots of 64kb IO’s.
> 
> One thing you can try is to force the CPU’s on your OSD nodes to run at C1 
> cstate and force their minimum frequency to 100%. This can have quite a large 
> impact on latency. Also you don’t specify your network, but 10G is a must.
> 
> Nick
> 
> 
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Osama Hasebou
> Sent: 14 August 2017 12:27
> To: ceph-users >
> Subject: [ceph-users] VMware + Ceph using NFS sync/async ?
> 
> Hi Everyone,
> 
> We started testing the idea of using Ceph storage with VMware, the idea was 
> to provide Ceph storage through open stack to VMware, by creating a virtual 
> machine coming from Ceph + Openstack , which acts as an NFS gateway, then 
> mount that storage on top of VMware cluster.
> 
> When mounting the NFS exports using the sync option, we noticed a huge 
> degradation in performance which makes it very slow to use it in production, 
> the async option makes it much better but then there is the risk of it being 
> risky that in case a failure shall happen, some data might be lost in that 
> Scenario.
> 
> Now I understand that some people in the ceph community are using Ceph with 
> VMware using NFS gateways, so if you can kindly shed some light on your 
> experience, and if you do use it in production purpose, that would be great 
> and how did you mitigate the sync/async options and keep write performance.
> 
> 
> Thanks you!!!
> 
> Regards,
> Ossi
> 
> 
> Confidentiality: This email and any attachments are confidential and may be 
> subject to copyright, legal or some other professional privilege. They are 
> intended solely for the attention and use of the named addressee(s). They may 
> only be copied, distributed or disclosed with the consent of the copyright 
> owner. If you have received this email by mistake or by breach of the 
> confidentiality clause, please notify the sender immediately by return email 
> and delete or destroy all copies of the email. Any confidentiality, privilege 
> or copyright is not waived or lost because this email has been sent to you by 
> mistake.




-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] VMware + Ceph using NFS sync/async ?

2017-08-16 Thread Adrian Saul

We are using Ceph on NFS for VMWare – we are using SSD tiers in front of SATA 
and some direct SSD pools.  The datastores are just XFS file systems on RBD 
managed by a pacemaker cluster for failover.

Lessons so far are that large datastores quickly run out of IOPS and compete 
for performance – you are better off with many smaller RBDs (say 1TB) to spread 
out workloads.  Also tuning up NFS threads seems to help.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Osama 
Hasebou
Sent: Wednesday, 16 August 2017 10:34 PM
To: n...@fisk.me.uk
Cc: ceph-users 
Subject: Re: [ceph-users] VMware + Ceph using NFS sync/async ?

Hi Nick,

Thanks for replying! If Ceph is combined with Openstack then, does that mean 
that actually when openstack writes are happening, it is not fully sync'd (as 
in written to disks) before it starts receiving more data, so acting as async ? 
In that scenario there is a chance for data loss if things go bad, i.e power 
outage or something like that ?

As for the slow operations, reading is quite fine when I compare it to a SAN 
storage system connected to VMware. It is writing data, small chunks or big 
ones, that suffer when trying to use the sync option with FIO for benchmarking.

In that case, I wonder, is no one using CEPH with VMware in a production 
environment ?

Cheers.

Regards,
Ossi



Hi Osama,

This is a known problem with many software defined storage stacks, but 
potentially slightly worse with Ceph due to extra overheads. Sync writes have 
to wait until all copies of the data are written to disk by the OSD and 
acknowledged back to the client. The extra network hops for replication and NFS 
gateways add significant latency which impacts the time it takes to carry out 
small writes. The Ceph code also takes time to process each IO request.

What particular operations are you finding slow? Storage vmotions are just bad, 
and I don’t think there is much that can be done about them as they are split 
into lots of 64kb IO’s.

One thing you can try is to force the CPU’s on your OSD nodes to run at C1 
cstate and force their minimum frequency to 100%. This can have quite a large 
impact on latency. Also you don’t specify your network, but 10G is a must.

Nick


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Osama 
Hasebou
Sent: 14 August 2017 12:27
To: ceph-users >
Subject: [ceph-users] VMware + Ceph using NFS sync/async ?

Hi Everyone,

We started testing the idea of using Ceph storage with VMware, the idea was to 
provide Ceph storage through open stack to VMware, by creating a virtual 
machine coming from Ceph + Openstack , which acts as an NFS gateway, then mount 
that storage on top of VMware cluster.

When mounting the NFS exports using the sync option, we noticed a huge 
degradation in performance which makes it very slow to use it in production, 
the async option makes it much better but then there is the risk of it being 
risky that in case a failure shall happen, some data might be lost in that 
Scenario.

Now I understand that some people in the ceph community are using Ceph with 
VMware using NFS gateways, so if you can kindly shed some light on your 
experience, and if you do use it in production purpose, that would be great and 
how did you mitigate the sync/async options and keep write performance.


Thanks you!!!

Regards,
Ossi


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster unavailable for 20 mins when downed server was reintroduced

2017-08-16 Thread Gregory Farnum
On Wed, Aug 16, 2017 at 4:04 AM Sean Purdy  wrote:

> On Tue, 15 Aug 2017, Gregory Farnum said:
> > On Tue, Aug 15, 2017 at 4:23 AM Sean Purdy 
> wrote:
> > > I have a three node cluster with 6 OSD and 1 mon per node.
> > >
> > > I had to turn off one node for rack reasons.  While the node was down,
> the
> > > cluster was still running and accepting files via radosgw.  However,
> when I
> > > turned the machine back on, radosgw uploads stopped working and things
> like
> > > "ceph status" starting timed out.  It took 20 minutes for "ceph
> status" to
> > > be OK.
>
> > > 2017-08-15 11:28:29.835943 7fdf2d74b700  0 monclient(hunting):
> > > authenticate timed out after 3002017-08-15
> > > 11:28:29.835993 7fdf2d74b700  0 librados: client.admin authentication
> error
> > > (110) Connection timed out
> > >
> >
> > That just means the client couldn't connect to an in-quorum monitor. It
> > should have tried them all in sequence though — did you check if you had
> > *any* functioning quorum?
>
> There was a functioning quorum - I checked with "ceph --admin-daemon
> /var/run/ceph/ceph-mon.xxx.asok quorum_status".  Well - I interpreted the
> output as functioning.  There was a nominated leader.
>

Did you try running "ceph -s" from more than one location? If you had a
functioning quorum that should have worked. And any live clients should
have been able to keep working.


>
>
> > > 2017-08-15 11:23:07.180123 7f11c0fcc700  0 -- 172.16.0.43:0/2471 >>
> > > 172.16.0.45:6812/1904 conn(0x556eeaf4d000 :-1
> > > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
> > > l=0).handle_connect_reply connect got BADAUTHORIZER
> > >
> >
> > This one's odd. We did get one report of seeing something like that, but
> I
> > tend to think it's a clock sync issue.
>
> I saw some messages about clock sync, but ntpq -p looked OK on each
> server.  Will investigate further.
>
>  remote   refid  st t when poll reach   delay   offset
> jitter
>
> ==
> +172.16.0.16 129.250.35.250   3 u  847 1024  3770.2891.103
>  0.376
> +172.16.0.18 80.82.244.1203 u   93 1024  3770.397   -0.653
>  1.040
> *172.16.0.19 158.43.128.332 u  279 1024  3770.2440.262
>  0.158
>
>
> > > ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-9: (2)
> No
> > > such file or directory
> > >
> > And that would appear to be something happening underneath Ceph, wherein
> > your data wasn't actually all the way mounted or something?
>
> It's the machine mounting the disks at boot time - udev or ceph-osd.target
> keeps retrying until eventually the disk/OSD is mounted.  Or eventually it
> gives up.  Do the OSDs need a monitor quorum at startup?  It kept
> restarting OSDs for 20 mins.
>

I think they'll keep trying to connect but they may eventually time out; or
if they get a sufficiently mean response (such as BADAUTHORIZER) they may
shut down on their own.


>
> Timing went like this:
>
> 11:22 node boot
> 11:22 ceph-mon starts, recovers logs, compaction, first BADAUTHORIZER
> message
> 11:22 starting disk activation for 18 partitions (3 per bluestore)
> 11:23 mgr on other node can't find secret_id
> 11:43 bluefs mount succeeded on OSDs, ceph-osds go live
> 11:45 last BADAUTHORIZER message in monitor log
> 11:45 this host calls and wins a monitor election, mon_down health check
> clears
> 11:45 mgr happy
>

The timing there on the mounting (how does it take 20 minutes?!?!?) and
everything working again certainly is suspicious. It's not the direct cause
of the issue, but there may be something else going on which is causing
both of them.

All in all; I'm confused. The monitor being on ext4 can't influence this in
any way I can imagine.
-Greg


>
>
> > Anyway, it should have survived that transition without any noticeable
> > impact (unless you are running so close to capacity that merely getting
> the
> > downed node up-to-date overwhelmed your disks/cpu). But without some
> basic
> > information about what the cluster as a whole was doing I couldn't
> > speculate.
>
> This is a brand new 3 node cluster.  Dell R720 running Debian 9 with 2x
> SSD for OS and ceph-mon, 6x 2Tb SATA for ceph-osd using bluestore, per
> node.  Running radosgw as object store layer.  Only activity is a
> single-threaded test job uploading millions of small files over S3.  There
> are about 5.5million test objects so far (additionally 3x replication).
> This job was fine when the machine was down, stalled when machine booted.
>
> Looking at activity graphs at the time, there didn't seem to be a network
> bottleneck or CPU issue or disk throughput bottleneck.  But I'll look a bit
> closer.
>
> ceph-mon is on an ext4 filesystem though.   Perhaps I should move this to
> xfs?  Bluestore is xfs+bluestore.
>
> I presume it's a monitor issue somehow.
>
>
> > -Greg
>
> Thanks for your input.

Re: [ceph-users] Optimise Setup with Bluestore

2017-08-16 Thread Mark Nelson

Hi Mehmet!

On 08/16/2017 11:12 AM, Mehmet wrote:

:( no suggestions or recommendations on this?

Am 14. August 2017 16:50:15 MESZ schrieb Mehmet :

Hi friends,

my actual hardware setup per OSD-node is as follow:

# 3 OSD-Nodes with
- 2x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz ==> 12 Cores, no
Hyper-Threading
- 64GB RAM
- 12x 4TB HGST 7K4000 SAS2 (6GB/s) Disks as OSDs
- 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling Device for
12 Disks (20G Journal size)
- 1x Samsung SSD 840/850 Pro only for the OS

# and 1x OSD Node with
- 1x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (10 Cores 20 Threads)
- 64GB RAM
- 23x 2TB TOSHIBA MK2001TRKB SAS2 (6GB/s) Disks as OSDs
- 1x SEAGATE ST32000445SS SAS2 (6GB/s) Disk as OSDs
- 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling Device for
24 Disks (15G Journal size)
- 1x Samsung SSD 850 Pro only for the OS


The single P3700 for 23 spinning disks is pushing it.  They have high 
write durability but based on the model that is the 400GB version?  If 
you are doing a lot of writes you might wear it out pretty fast and it's 
a single point of failure for the entire node (if it dies you have a lot 
of data dying with it).  General unbalanced setups like this are 
trickier to get performing well as well.




As you can see, i am using 1 (one) NVMe (Intel DC P3700 NVMe – 400G)
Device for whole Spinning Disks (partitioned) on each OSD-node.

When „Luminous“ is available (as next LTE) i plan to switch vom
„filestore“ to „bluestore“ 

As far as i have read bluestore consists of
- „the device“
- „block-DB“: device that store RocksDB metadata
- „block-WAL“: device that stores RocksDB „write-ahead journal“

Which setup would be usefull in my case?
I Would setup the disks via "ceph-deploy".


So typically we recommend something like a 1-2GB WAL partition on the 
NVMe drive per OSD and use the remaining space for DB.  If you run out 
of DB space, bluestore will start using the spinning disks to store KV 
data instead.  I suspect this will still be the advice you will want to 
follow, though at some point having so many WAL and DB partitions on the 
NVMe may start becoming a bottleneck.  Something like 63K sequential 
writes to heavily fragmented objects might be worth testing, but in most 
cases I suspect DB and WAL on NVMe is still going to be faster.




Thanks in advance for your suggestions!
- Mehmet


ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS billions of files and inline_data?

2017-08-16 Thread Michael Metz-Martini | SpeedPartner GmbH
Hi,

Am 16.08.2017 um 19:31 schrieb Henrik Korkuc:
> On 17-08-16 19:40, John Spray wrote:
>> On Wed, Aug 16, 2017 at 3:27 PM, Henrik Korkuc  wrote:
> maybe you can suggest any recommendations how to scale Ceph for billions
> of objects? More PGs per OSD, more OSDs, more pools? Somewhere in the
> list it was mentioned that OSDs need to keep object list in memory, is
> it still valid for bluestore?
We started using cephfs in 2014 and scaled to 4 billion small files in a
separate pool plus 500 million in a second pool - 2only" 225 TB of data.

Unfortunately every objects creates another object in the data pool so
(due to size with a replication of 2, which is a real pain in the a*)
we're now at about 16 billion inodes distributed over 136 spinning
disks. XFS performed very bad with such huge number of files so we
switched all osd's to ext4 one by one which helped a lot (but keep an
eye on your total number of inodes).

I'm quite sure we made many configuration mistakes (replication of 2 ;
to few pg's in the beginning) and had to learn a lot the hard way while
keeping the site up & running.

As our disks are filling up and we would have to expand our storage -
needs rebalance which takes several months(!) - we decided to leave the
ceph-train and migrate to a more filesystem-like setup. We don't really
need objectstores and it seems cephfs can't manage such a huge number of
files (or we're unable to optimize it for that use-case). We will give
glusterfs with Raid6 underneath and nfs a try - more "basic" and
hopefully more robust.

-- 
Kind regards
 Michael Metz-Martini
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph luminous: error in manual installation when security enabled

2017-08-16 Thread Oscar Segarra
Hi,

As ceph-deploy utility does not work properly with named clusters (other
than the default ceph) In order to have a named cluster I have created the
monitor using the manual procedure:

http://docs.ceph.com/docs/master/install/manual-deployment/#monitor-bootstrapping

In the end, it starts up perfectly when security is disabled:

auth_cluster_required = none
auth_service_required = none
auth_client_required = none

But when I neable security in order to deploy mgr:

auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

Monitor does not start... it just crashes

[root@vdicnode01 vdicmgmtcl]# /usr/bin/ceph-mon -d --cluster vdicmgmtcl
--id vdicnode01 --setuser ceph --setgroup ceph
2017-08-16 19:27:58.668063 7fed10697e40  0 set uid:gid to 167:167
(ceph:ceph)
2017-08-16 19:27:58.668078 7fed10697e40  0 ceph version 12.1.4
(a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc), process
(unknown), pid 5049
2017-08-16 19:27:58.668107 7fed10697e40  0 pidfile_write: ignore empty
--pid-file
2017-08-16 19:27:58.687811 7fed10697e40  0 load: jerasure load: lrc load:
isa
2017-08-16 19:27:58.687893 7fed10697e40  0  set rocksdb option compression
= kNoCompression
2017-08-16 19:27:58.687898 7fed10697e40  0  set rocksdb option
write_buffer_size = 33554432
2017-08-16 19:27:58.687910 7fed10697e40  0  set rocksdb option compression
= kNoCompression
2017-08-16 19:27:58.687912 7fed10697e40  0  set rocksdb option
write_buffer_size = 33554432
2017-08-16 19:27:58.688000 7fed10697e40  4 rocksdb: RocksDB version: 5.4.0

2017-08-16 19:27:58.688004 7fed10697e40  4 rocksdb: Git sha
rocksdb_build_git_sha:@0@
2017-08-16 19:27:58.688005 7fed10697e40  4 rocksdb: Compile date Aug 15 2017
2017-08-16 19:27:58.688007 7fed10697e40  4 rocksdb: DB SUMMARY

2017-08-16 19:27:58.688041 7fed10697e40  4 rocksdb: CURRENT file:  CURRENT

2017-08-16 19:27:58.688043 7fed10697e40  4 rocksdb: IDENTITY file:  IDENTITY

2017-08-16 19:27:58.688045 7fed10697e40  4 rocksdb: MANIFEST file:
 MANIFEST-000242 size: 264 Bytes

2017-08-16 19:27:58.688047 7fed10697e40  4 rocksdb: SST files in
/var/lib/ceph/mon/vdicmgmtcl-vdicnode01/store.db dir, Total Num: 4, files:
78.sst 91.sst 000166.sst 000169.sst

2017-08-16 19:27:58.688048 7fed10697e40  4 rocksdb: Write Ahead Log file in
/var/lib/ceph/mon/vdicmgmtcl-vdicnode01/store.db: 000243.log size: 0 ;

2017-08-16 19:27:58.688050 7fed10697e40  4 rocksdb:
Options.error_if_exists: 0
2017-08-16 19:27:58.688050 7fed10697e40  4 rocksdb:
Options.create_if_missing: 0
2017-08-16 19:27:58.688050 7fed10697e40  4 rocksdb:
Options.paranoid_checks: 1
2017-08-16 19:27:58.688051 7fed10697e40  4 rocksdb:
Options.env: 0x7fed197a5de0
2017-08-16 19:27:58.688051 7fed10697e40  4 rocksdb:
   Options.info_log: 0x7fed1a8393c0
2017-08-16 19:27:58.688052 7fed10697e40  4 rocksdb:
 Options.max_open_files: -1
2017-08-16 19:27:58.688052 7fed10697e40  4 rocksdb:
 Options.max_file_opening_threads: 16
2017-08-16 19:27:58.688053 7fed10697e40  4 rocksdb:
  Options.use_fsync: 0
2017-08-16 19:27:58.688053 7fed10697e40  4 rocksdb:
Options.max_log_file_size: 0
2017-08-16 19:27:58.688054 7fed10697e40  4 rocksdb:
 Options.max_manifest_file_size: 18446744073709551615
2017-08-16 19:27:58.688054 7fed10697e40  4 rocksdb:
Options.log_file_time_to_roll: 0
2017-08-16 19:27:58.688055 7fed10697e40  4 rocksdb:
Options.keep_log_file_num: 1000
2017-08-16 19:27:58.688055 7fed10697e40  4 rocksdb:
 Options.recycle_log_file_num: 0
2017-08-16 19:27:58.688055 7fed10697e40  4 rocksdb:
Options.allow_fallocate: 1
2017-08-16 19:27:58.688056 7fed10697e40  4 rocksdb:
 Options.allow_mmap_reads: 0
2017-08-16 19:27:58.688056 7fed10697e40  4 rocksdb:
Options.allow_mmap_writes: 0
2017-08-16 19:27:58.688057 7fed10697e40  4 rocksdb:
 Options.use_direct_reads: 0
2017-08-16 19:27:58.688057 7fed10697e40  4 rocksdb:
 Options.use_direct_io_for_flush_and_compaction: 0
2017-08-16 19:27:58.688057 7fed10697e40  4 rocksdb:
 Options.create_missing_column_families: 0
2017-08-16 19:27:58.688063 7fed10697e40  4 rocksdb:
 Options.db_log_dir:
2017-08-16 19:27:58.688063 7fed10697e40  4 rocksdb:
Options.wal_dir: /var/lib/ceph/mon/vdicmgmtcl-vdicnode01/store.db
2017-08-16 19:27:58.688064 7fed10697e40  4 rocksdb:
 Options.table_cache_numshardbits: 6
2017-08-16 19:27:58.688064 7fed10697e40  4 rocksdb:
 Options.max_subcompactions: 1
2017-08-16 19:27:58.688065 7fed10697e40  4 rocksdb:
 Options.max_background_flushes: 1
2017-08-16 19:27:58.688065 7fed10697e40  4 rocksdb:
Options.WAL_ttl_seconds: 0
2017-08-16 19:27:58.688066 7fed10697e40  4 rocksdb:
Options.WAL_size_limit_MB: 0
2017-08-16 19:27:58.688066 7fed10697e40  4 rocksdb:
Options.manifest_preallocation_size: 4194304
2017-08-16 19:27:58.688066 7fed10697e40  4 rocksdb:
Options.is_fd_close_on_exec: 1
2017-08-16 19:27:58.688067 7fed10697e40  4 rocksdb:
Options.advise_random_on_open: 1
2017-08-16 19:27:58.688067 7fed10697e40  4 rocksdb:
 Options.db_write_buffer_size: 0
2017-08-16 

Re: [ceph-users] CephFS billions of files and inline_data?

2017-08-16 Thread Henrik Korkuc

On 17-08-16 19:40, John Spray wrote:

On Wed, Aug 16, 2017 at 3:27 PM, Henrik Korkuc  wrote:

Hello,

I have use case for billions of small files (~1KB) on CephFS and as to my
experience having billions of objects in a pool is not very good idea (ops
slow down, large memory usage, etc) I decided to test CephFS inline_data.
After activating this feature and starting copy process I noticed that
objects are still created on data pool, but their size is 0. Is this
expected behavior? Maybe someone can share tips on using large amount of
small objects? I am on 12.1.3, already using decreased min block size for
bluestore.

Couple of thoughts:
  - Frequently when someone has a "billions of small files" workload
they really want an object store, not a filesystem

in this case I need POSIX, to replace current system.


  - In many cases the major per-file overhead is MDS CPU req/s rather
than the OSD ops, so inline data may be efficient but not result in
overall speedup
  - If you do need to get rid of the overhead of writing objects to the
data pool, you could work on creating a special backtraceless flag
(per-filesystem), where the filesystem cannot do lookups by inode (no
NFS, no hardlinks, limited disaster recovery), but it doesn't write
backtraces either.
It looks like I may need backtraces, so will need to put bunch of 
objects into pools.


maybe you can suggest any recommendations how to scale Ceph for billions 
of objects? More PGs per OSD, more OSDs, more pools? Somewhere in the 
list it was mentioned that OSDs need to keep object list in memory, is 
it still valid for bluestore?


Also setting bluestore_min_alloc_size_* to 1024 results in 
"/tmp/buildd/ceph-12.1.3/src/common/Checksummer.h: 219: FAILED 
assert(csum_data->length() >= (offset + length) / csum_block_size * 
sizeof(typename Alg::value_t))" during OSD start right after ceph-disk 
prepare.



John


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Optimise Setup with Bluestore

2017-08-16 Thread David Turner
Honestly there isn't enough information about your use case.  RBD usage
with small IO vs ObjectStore with large files vs ObjectStore with small
files vs any number of things.  The answer to your question might be that
for your needs you should look at having a completely different hardware
configuration than what you're running.  There is no correct way to
configure your cluster based on what hardware you have.  What hardware you
use and what configuration settings you use should be based on your needs
and use case.

On Wed, Aug 16, 2017 at 12:13 PM Mehmet  wrote:

> :( no suggestions or recommendations on this?
>
> Am 14. August 2017 16:50:15 MESZ schrieb Mehmet :
>
>> Hi friends,
>>
>> my actual hardware setup per OSD-node is as follow:
>>
>> # 3 OSD-Nodes with
>> - 2x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz ==> 12 Cores, no
>> Hyper-Threading
>> - 64GB RAM
>> - 12x 4TB HGST 7K4000 SAS2 (6GB/s) Disks as OSDs
>> - 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling Device for
>> 12 Disks (20G Journal size)
>> - 1x Samsung SSD 840/850 Pro only for the OS
>>
>> # and 1x OSD Node with
>> - 1x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (10 Cores 20 Threads)
>> - 64GB RAM
>> - 23x 2TB TOSHIBA MK2001TRKB SAS2 (6GB/s) Disks as OSDs
>> - 1x SEAGATE ST32000445SS SAS2 (6GB/s) Disk as OSDs
>> - 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling Device for
>> 24 Disks (15G Journal size)
>> - 1x Samsung SSD 850 Pro only for the OS
>>
>> As you can see, i am using 1 (one) NVMe (Intel DC P3700 NVMe – 400G)
>> Device for whole Spinning Disks (partitioned) on each OSD-node.
>>
>> When „Luminous“ is available (as next LTE) i plan to switch vom
>> „filestore“ to „bluestore“ 
>>
>> As far as i have read bluestore consists of
>> - „the device“
>> - „block-DB“: device that store RocksDB metadata
>> - „block-WAL“: device that stores RocksDB „write-ahead journal“
>>
>> Which setup would be usefull in my case?
>> I Would setup the disks via "ceph-deploy".
>>
>> Thanks in advance for your suggestions!
>> - Mehmet
>> --
>>
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS billions of files and inline_data?

2017-08-16 Thread John Spray
On Wed, Aug 16, 2017 at 3:27 PM, Henrik Korkuc  wrote:
> Hello,
>
> I have use case for billions of small files (~1KB) on CephFS and as to my
> experience having billions of objects in a pool is not very good idea (ops
> slow down, large memory usage, etc) I decided to test CephFS inline_data.
> After activating this feature and starting copy process I noticed that
> objects are still created on data pool, but their size is 0. Is this
> expected behavior? Maybe someone can share tips on using large amount of
> small objects? I am on 12.1.3, already using decreased min block size for
> bluestore.

Couple of thoughts:
 - Frequently when someone has a "billions of small files" workload
they really want an object store, not a filesystem
 - In many cases the major per-file overhead is MDS CPU req/s rather
than the OSD ops, so inline data may be efficient but not result in
overall speedup
 - If you do need to get rid of the overhead of writing objects to the
data pool, you could work on creating a special backtraceless flag
(per-filesystem), where the filesystem cannot do lookups by inode (no
NFS, no hardlinks, limited disaster recovery), but it doesn't write
backtraces either.

John

>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Optimise Setup with Bluestore

2017-08-16 Thread Mehmet
:( no suggestions or recommendations on this? 

Am 14. August 2017 16:50:15 MESZ schrieb Mehmet :
>Hi friends,
>
>my actual hardware setup per OSD-node is as follow:
>
># 3 OSD-Nodes with
>- 2x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz ==> 12 Cores, no 
>Hyper-Threading
>- 64GB RAM
>- 12x 4TB HGST 7K4000 SAS2 (6GB/s) Disks as OSDs
>- 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling Device for
>
>12 Disks (20G Journal size)
>- 1x Samsung SSD 840/850 Pro only for the OS
>
># and 1x OSD Node with
>- 1x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (10 Cores 20 Threads)
>- 64GB RAM
>- 23x 2TB TOSHIBA MK2001TRKB SAS2 (6GB/s) Disks as OSDs
>- 1x SEAGATE ST32000445SS SAS2 (6GB/s) Disk as OSDs
>- 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling Device for
>
>24 Disks (15G Journal size)
>- 1x Samsung SSD 850 Pro only for the OS
>
>As you can see, i am using 1 (one) NVMe (Intel DC P3700 NVMe – 400G) 
>Device for whole Spinning Disks (partitioned) on each OSD-node.
>
>When „Luminous“ is available (as next LTE) i plan to switch vom 
>„filestore“ to „bluestore“ 
>
>As far as i have read bluestore consists of
>-  „the device“
>-  „block-DB“: device that store RocksDB metadata
>-  „block-WAL“: device that stores RocksDB „write-ahead journal“
>
>Which setup would be usefull in my case?
>I Would setup the disks via "ceph-deploy".
>
>Thanks in advance for your suggestions!
>- Mehmet
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% of total capacity

2017-08-16 Thread Mandar Naik
Thanks a lot for the reply. To eliminate issue of root not being present
and duplicate entries
in crush map I have updated my crush map. Now I have default root and I
have crush hierarchy
without duplicate entries.

I have now created one pool local to host "ip-10-0-9-233" while other pool
local to host "ip-10-0-9-126"
using respective crush rules as pasted below. After host "ip-10-0-9-233"
gets full, requests to write new
keys to pool from host "ip-10-0-9-126" timed out.  From the "ceph pg dump"
output I see PGs only getting
stored at respective hosts. So pg interference across pools does not seem
to be an issue to me at least.

Purpose of keeping one pool local to host is not for the locality. With the
use case of single point of solution
for both local as well as replicated data clients need to know only the
pool name during read/write operations.

I am not sure if this use case fits with the ceph. So I am trying to
determine if there is any option in ceph to
make ceph understand that only one host is full and it could still serve
new write requests as long as they do
not touch the OSD that is full.


Test output:


#ceph osd dump


epoch 93

fsid 7a238d99-67ed-4610-540a-449043b3c24e

created 2017-08-16 09:34:15.580112

modified 2017-08-16 11:55:40.676234

flags sortbitwise,require_jewel_osds

pool 7 'ip-10-0-9-233-pool' replicated size 1 min_size 1 crush_ruleset 1
object_hash rjenkins pg_num 128 pgp_num 128 last_change 87 flags hashpspool
stripe_width 0

pool 8 'ip-10-0-9-126-pool' replicated size 1 min_size 1 crush_ruleset 2
object_hash rjenkins pg_num 128 pgp_num 128 last_change 92 flags hashpspool
stripe_width 0

max_osd 3


# ceph -s

cluster 7a238d99-67ed-4610-540a-449043b3c24e

health HEALTH_OK

monmap e3: 3 mons at {ip-10-0-9-126=10.0.9.126:6789/0,ip-10-0-9-233=10.0.9.
233:6789/0,ip-10-0-9-250=10.0.9.250:6789/0}

election epoch 8, quorum 0,1,2 ip-10-0-9-126,ip-10-0-9-233,
ip-10-0-9-250

osdmap e93: 3 osds: 3 up, 3 in

flags sortbitwise,require_jewel_osds

  pgmap v679: 256 pgs, 2 pools, 0 bytes data, 0 objects

106 MB used, 134 GB / 134 GB avail

 256 active+clean


# ceph osd tree

ID WEIGHT  TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY

-1 0.13197 root default

-5 0.04399 rack ip-10-0-9-233-rack

-3 0.04399  host ip-10-0-9-233

0 0.04399  osd.0 up  1.0   1.0

-7 0.04399 rack ip-10-0-9-126-rack

-6 0.04399  host ip-10-0-9-126

1 0.04399  osd.1 up  1.0   1.0

-9 0.04399 rack ip-10-0-9-250-rack

-8 0.04399  host ip-10-0-9-250

2 0.04399  osd.2 up  1.0   1.0


# ceph osd crush rule list

[

"ip-10-0-9-233_ruleset",

"ip-10-0-9-126_ruleset",

"ip-10-0-9-250_ruleset",

"replicated_ruleset"

]


# ceph osd crush rule dump ip-10-0-9-233_ruleset

{

"rule_id": 0,

"rule_name": "ip-10-0-9-233_ruleset",

"ruleset": 1,

"type": 1,

"min_size": 1,

"max_size": 10,

"steps": [

{

"op": "take",

"item": -5,

"item_name": "ip-10-0-9-233-rack"

},

{

"op": "chooseleaf_firstn",

"num": 0,

"type": "host"

},

{

"op": "emit"

}

]

}



# ceph osd crush rule dump ip-10-0-9-126_ruleset

{

"rule_id": 1,

"rule_name": "ip-10-0-9-126_ruleset",

"ruleset": 2,

"type": 1,

"min_size": 1,

"max_size": 10,

"steps": [

{

"op": "take",

"item": -7,

"item_name": "ip-10-0-9-126-rack"

},

{

"op": "chooseleaf_firstn",

"num": 0,

"type": "host"

},

{

"op": "emit"

}

]

}


# ceph osd crush rule dump replicated_ruleset

{

"rule_id": 4,

"rule_name": "replicated_ruleset",

"ruleset": 4,

"type": 1,

"min_size": 1,

"max_size": 10,

"steps": [

{

"op": "take",

"item": -1,

"item_name": "default"

},

{

"op": "chooseleaf_firstn",

"num": 0,

"type": "host"

},

{

"op": "emit"

}

]


# ceph -s

cluster 7a238d99-67ed-4610-540a-449043b3c24e

health HEALTH_ERR

1 full osd(s)

full,sortbitwise,require_jewel_osds flag(s) set

monmap e3: 3 mons at {ip-10-0-9-126=10.0.9.126:6789/0,ip-10-0-9-233=10.0.9.
233:6789/0,ip-10-0-9-250=10.0.9.250:6789/0}

election epoch 8, quorum 0,1,2 ip-10-0-9-126,ip-10-0-9-233,
ip-10-0-9-250

osdmap e99: 3 osds: 3 up, 3 in

flags full,sortbitwise,require_jewel_osds

  pgmap v920: 256 pgs, 2 pools, 42696 MB data, 2 objects

44844 MB used, 93324 MB / 134 GB avail

 256 active+clean


# ceph osd df

ID WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE  VAR  PGS

0 0.04399  1.0 46056M 43801M  2255M 95.10 3.00 128

1 0.04399  1.0 46056M 36708k 46020M  0.08 0.00 128

2 0.04399  1.0 46056M 34472k 46022M  0.07 0.00   0

  TOTAL   134G 43870M 94298M 31.75

MIN/MAX VAR: 0.00/3.00  STDDEV: 44.80


# ceph df

GLOBAL:

SIZE AVAIL   RAW USED %RAW USED


Re: [ceph-users] CephFS billions of files and inline_data?

2017-08-16 Thread Gregory Farnum
On Wed, Aug 16, 2017 at 7:28 AM Henrik Korkuc  wrote:

> Hello,
>
> I have use case for billions of small files (~1KB) on CephFS and as to
> my experience having billions of objects in a pool is not very good idea
> (ops slow down, large memory usage, etc) I decided to test CephFS
> inline_data. After activating this feature and starting copy process I
> noticed that objects are still created on data pool, but their size is
> 0. Is this expected behavior? Maybe someone can share tips on using
> large amount of small objects? I am on 12.1.3, already using decreased
> min block size for bluestore.
>

This is expected. It still creates the objects because it needs them to do
inode-based lookups of the file path, and because if the file grows larger
it will need to move it from inline to regular storage anyway.

Do keep in mind that while the inline data feature is pretty well-tested in
our nightlies, it sounds like you'll be exercising it a lot more than
anybody has in the field yet. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw returns 404 Not Found

2017-08-16 Thread Martin Emrich
Thanks for the hint!

If I use the IP address of the rados gateway or the DNS name configured under 
„rgw dns name”, I get a 403 instead of 404. And that could be remedied by using 
the user that initially created the bucket.

Cheers,

Martin


Von: David Turner 
Datum: Mittwoch, 16. August 2017 um 16:31
An: Martin Emrich , "ceph-us...@ceph.com" 

Betreff: Re: [ceph-users] Radosgw returns 404 Not Found


You need to fix your endpoint URLs. The line that shows this is.

2017-08-16 14:02:21.725967 7fc7f5317700 10 s->object=s3testbucket-1 
s->bucket=ceph-kl-mon1.de.empolis.com

It thinks you're bucket is your domain name and your object is your bucket 
name. If you did this using an IP instead of URL it would work. Look up how to 
correct your endpoints to correctly fix this issue.

On Wed, Aug 16, 2017, 9:22 AM Martin Emrich 
> wrote:
Hi!

I have the following issue: While “radosgw bucket list” shows me my buckets, S3 
API clients only get a “404 Not Found”. With debug level 20, I see the 
following output of the radosgw service:

2017-08-16 14:02:21.725959 7fc7f5317700 20 rgw::auth::s3::LocalEngine granted 
access
2017-08-16 14:02:21.725960 7fc7f5317700 20 rgw::auth::s3::AWSAuthStrategy 
granted access
2017-08-16 14:02:21.725963 7fc7f5317700  2 req 1:0.004722:s3:GET 
/s3testbucket-1:get_obj:normalizing buckets and tenants
2017-08-16 14:02:21.725967 7fc7f5317700 10 s->object=s3testbucket-1 
s->bucket=ceph-kl-mon1.de.empolis.com
2017-08-16 14:02:21.725974 7fc7f5317700  2 req 1:0.004734:s3:GET 
/s3testbucket-1:get_obj:init permissions
2017-08-16 14:02:21.725986 7fc7f5317700 20 get_system_obj_state: 
rctx=0x7fc7f530fe50 
obj=default.rgw.data.root:ceph-kl-mon1.de.empolis.com
 state=0x7fc915db1e00 s->prefetch_data=0
2017-08-16 14:02:21.725990 7fc7f5317700 10 cache get: 
name=default.rgw.data.root++ceph-kl-mon1.de.empolis.com
 : miss
2017-08-16 14:02:21.728248 7fc7f5317700 10 cache put: 
name=default.rgw.data.root++ceph-kl-mon1.de.empolis.com
 info.flags=0x0
2017-08-16 14:02:21.728253 7fc7f5317700 10 adding 
default.rgw.data.root++ceph-kl-mon1.de.empolis.com
 to cache LRU end
2017-08-16 14:02:21.728285 7fc7f5317700 10 read_permissions on 
ceph-kl-mon1.de.empolis.com[]) ret=-2002
2017-08-16 14:02:21.728287 7fc7f5317700 20 op->ERRORHANDLER: err_no=-2002 
new_err_no=-2002
2017-08-16 14:02:21.728368 7fc7f5317700  2 req 1:0.007127:s3:GET 
/s3testbucket-1:get_obj:op status=0
2017-08-16 14:02:21.728371 7fc7f5317700  2 req 1:0.007130:s3:GET 
/s3testbucket-1:get_obj:http status=404
2017-08-16 14:02:21.728380 7fc7f5317700  1 == req done req=0x7fc7f5311190 
op status=0 http_status=404 ==

I see the line with “err_no=-2002”. What does that mean?

Thanks

Martin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw returns 404 Not Found

2017-08-16 Thread David Turner
You need to fix your endpoint URLs. The line that shows this is.

2017-08-16 14:02:21.725967 7fc7f5317700 10 s->object=s3testbucket-1
s->bucket=ceph-kl-mon1.de.empolis.com

It thinks you're bucket is your domain name and your object is your bucket
name. If you did this using an IP instead of URL it would work. Look up how
to correct your endpoints to correctly fix this issue.

On Wed, Aug 16, 2017, 9:22 AM Martin Emrich 
wrote:

> Hi!
>
>
>
> I have the following issue: While “radosgw bucket list” shows me my
> buckets, S3 API clients only get a “404 Not Found”. With debug level 20, I
> see the following output of the radosgw service:
>
>
>
> 2017-08-16 14:02:21.725959 7fc7f5317700 20 rgw::auth::s3::LocalEngine
> granted access
>
> 2017-08-16 14:02:21.725960 7fc7f5317700 20 rgw::auth::s3::AWSAuthStrategy
> granted access
>
> 2017-08-16 14:02:21.725963 7fc7f5317700  2 req 1:0.004722:s3:GET
> /s3testbucket-1:get_obj:normalizing buckets and tenants
>
> 2017-08-16 14:02:21.725967 7fc7f5317700 10 s->object=s3testbucket-1
> s->bucket=ceph-kl-mon1.de.empolis.com
>
> 2017-08-16 14:02:21.725974 7fc7f5317700  2 req 1:0.004734:s3:GET
> /s3testbucket-1:get_obj:init permissions
>
> 2017-08-16 14:02:21.725986 7fc7f5317700 20 get_system_obj_state:
> rctx=0x7fc7f530fe50 obj=default.rgw.data.root:ceph-kl-mon1.de.empolis.com
> state=0x7fc915db1e00 s->prefetch_data=0
>
> 2017-08-16 14:02:21.725990 7fc7f5317700 10 cache get:
> name=default.rgw.data.root++ceph-kl-mon1.de.empolis.com : miss
>
> 2017-08-16 14:02:21.728248 7fc7f5317700 10 cache put:
> name=default.rgw.data.root++ceph-kl-mon1.de.empolis.com info.flags=0x0
>
> 2017-08-16 14:02:21.728253 7fc7f5317700 10 adding default.rgw.data.root++
> ceph-kl-mon1.de.empolis.com to cache LRU end
>
> 2017-08-16 14:02:21.728285 7fc7f5317700 10 read_permissions on
> ceph-kl-mon1.de.empolis.com[]) ret=-2002
>
> 2017-08-16 14:02:21.728287 7fc7f5317700 20 op->ERRORHANDLER: err_no=-2002
> new_err_no=-2002
>
> 2017-08-16 14:02:21.728368 7fc7f5317700  2 req 1:0.007127:s3:GET
> /s3testbucket-1:get_obj:op status=0
>
> 2017-08-16 14:02:21.728371 7fc7f5317700  2 req 1:0.007130:s3:GET
> /s3testbucket-1:get_obj:http status=404
>
> 2017-08-16 14:02:21.728380 7fc7f5317700  1 == req done
> req=0x7fc7f5311190 op status=0 http_status=404 ==
>
>
>
> I see the line with “err_no=-2002”. What does that mean?
>
>
>
> Thanks
>
>
>
> Martin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS billions of files and inline_data?

2017-08-16 Thread Henrik Korkuc

Hello,

I have use case for billions of small files (~1KB) on CephFS and as to 
my experience having billions of objects in a pool is not very good idea 
(ops slow down, large memory usage, etc) I decided to test CephFS 
inline_data. After activating this feature and starting copy process I 
noticed that objects are still created on data pool, but their size is 
0. Is this expected behavior? Maybe someone can share tips on using 
large amount of small objects? I am on 12.1.3, already using decreased 
min block size for bluestore.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore WAL or DB devices on a distant SSD ?

2017-08-16 Thread David Turner
Would reads and writes to the SSD on another server be faster than reads
and writes to HDD on the local server? If the answer is no, then even if
this was possible it would be worse than just putting your WAL and DB on
the same HDD locally.  I don't think this is a use case the devs planned
for.

You can definitely put multiple WAL land DB partitions on a single SSD.

On Wed, Aug 16, 2017, 6:04 AM Hervé Ballans 
wrote:

> Hi,
>
> We are currently running two Proxmox/ceph clusters that work perfectly
> (since 2014) and thank to this succesful experience, we plan to install
> a new Ceph cluster for storage of our computing cluster.
>
> Until now, we only used RBD (virtualization context) but now we want to
> use CephFS for this new cluster (separated from the other two, hardware
> is different and dedicated for this new clusters).
>
> I'm interested in testing a CephFS cluster with BlueStore as a backend
> storage.
>
> I have several OSDs servers (with a dozen SATA HDDs on each) but some do
> not have an additional SSD disk (only 3 of the servers have an
> additional SSD).
>
> My question is about BlueStore WAL/DB devices. When I read the
> documentation, it seems that adding both WAL and DB devices improve
> BlueStore performances.
>
> But, can we configure these devices on a distant SSD (I mean on a SSD
> which is not on the local OSDs server but on an another machine which is
> on the same Ceph cluster) ?
>
> If yes, can I configure mulitple WAL or DB devices on the same SSD ?
>
> And finally, is it relevant to do that (I mean in term of performance) ?
>
> Hoping to have been clear on my context, thanks in advance for your
> reply  or your reflection.
>
> Regards,
>
> Hervé
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VMware + Ceph using NFS sync/async ?

2017-08-16 Thread Nick Fisk
Hi Matt,

 

Well behaved applications are the problem here. ESXi sends all writes as sync 
writes. So although OS’s will still do their own buffering, any ESXi level 
operation is all done as sync. This is probably seen the greatest when 
migrating vm’s between datastores, everything gets done as sync 64KB ios 
meaning, copying a 1TB VM can often take nearly 24 hours.

 

Osama, can you describe the difference in performance you see between Openstack 
and ESXi and what type of operations are these? Sync writes should be the same 
no matter the client, except in the NFS case you will have an extra network hop 
and potentially a little bit of PG congestion around the FS journal on the RBd 
device.

 

Osama, you can’t compare Ceph to a SAN. Just in terms of network latency you 
have an extra 2 hops. In ideal scenario you might be able to get Ceph write 
latency down to 0.5-1ms for a 4kb io, compared to to about 0.1-0.3 for a 
storage array. However, what you will find with Ceph is that other things start 
to increase this average long before you would start to see this on storage 
arrays. 

 

The migration is a good example of this. As I said, ESXi migrates a vm in 64KB 
io’s, but does 32 of these blocks in parallel at a time. On storage arrays, 
these 64KB io’s are coalesced in the battery protected write cached into bigger 
IO’s before being persisted to disk. The storage array can also accept all 32 
of these requests at once.

 

A similar thing happens in Ceph/RBD/NFS via the Ceph filestore journal, but 
that coalescing is now an extra 2 hops away and with a bit of extra latency 
introduced by the Ceph code, we are already a bit slower. But here’s the 
killer, PG locking!!! You can’t write 32 IO’s in parallel to the same 
object/PG, each one has to be processed sequentially because of the locks. 
(Please someone correct me if I’m wrong here). If your 64KB write latency is 
2ms, then you can only do 500 64KB IO’s a second. 64KB*500=~30MB/s vs a Storage 
Array which would be doing the operation in the hundreds of MB/s range.

 

Note: When proper iSCSI for RBD support is finished, you might be able to use 
the VAAI offloads, which would dramatically increase performance for migrations 
as well.

 

Also once persistent SSD write caching for librbd becomes available, a lot of 
these problems will go away, as the SSD will behave like a storage array’s 
write cache and will only be 1 hop away from the client as well.

 

From: Matt Benjamin [mailto:mbenj...@redhat.com] 
Sent: 16 August 2017 14:49
To: Osama Hasebou 
Cc: n...@fisk.me.uk; ceph-users 
Subject: Re: [ceph-users] VMware + Ceph using NFS sync/async ?

 

Hi Osama,

I don't have a clear sense of the the application workflow here--and Nick 
appears to--but I thought it worth noting that NFSv3 and NFSv4 clients 
shouldn't normally need the sync mount option to achieve i/o stability with 
well-behaved applications.  In both versions of the protocol, an application 
write that is synchronous (or, more typically, the equivalent application sync 
barrier) should not succeed until an NFS-protocol COMMIT (or in some cases 
w/NFSv4, WRITE w/stable flag set) has been acknowledged by the NFS server.  If 
the NFS i/o stability model is insufficient for a your workflow, moreover, I'd 
be worried that -osync writes (which might be incompletely applied during a 
failure event) may not be correctly enforcing your invariant, either.

 

Matt

 

On Wed, Aug 16, 2017 at 8:33 AM, Osama Hasebou  > wrote:

Hi Nick,

 

Thanks for replying! If Ceph is combined with Openstack then, does that mean 
that actually when openstack writes are happening, it is not fully sync'd (as 
in written to disks) before it starts receiving more data, so acting as async ? 
In that scenario there is a chance for data loss if things go bad, i.e power 
outage or something like that ?

 

As for the slow operations, reading is quite fine when I compare it to a SAN 
storage system connected to VMware. It is writing data, small chunks or big 
ones, that suffer when trying to use the sync option with FIO for benchmarking.

 

In that case, I wonder, is no one using CEPH with VMware in a production 
environment ?

 

Cheers.

 

Regards,
Ossi

 

 

 

Hi Osama,

 

This is a known problem with many software defined storage stacks, but 
potentially slightly worse with Ceph due to extra overheads. Sync writes have 
to wait until all copies of the data are written to disk by the OSD and 
acknowledged back to the client. The extra network hops for replication and NFS 
gateways add significant latency which impacts the time it takes to carry out 
small writes. The Ceph code also takes time to process each IO request.

 

What particular operations are you finding slow? Storage vmotions are just bad, 
and I don’t think there is much that can be done about them as they are split 
into lots of 64kb IO’s.

[ceph-users] Radosgw returns 404 Not Found

2017-08-16 Thread Martin Emrich
Hi!

I have the following issue: While “radosgw bucket list” shows me my buckets, S3 
API clients only get a “404 Not Found”. With debug level 20, I see the 
following output of the radosgw service:

2017-08-16 14:02:21.725959 7fc7f5317700 20 rgw::auth::s3::LocalEngine granted 
access
2017-08-16 14:02:21.725960 7fc7f5317700 20 rgw::auth::s3::AWSAuthStrategy 
granted access
2017-08-16 14:02:21.725963 7fc7f5317700  2 req 1:0.004722:s3:GET 
/s3testbucket-1:get_obj:normalizing buckets and tenants
2017-08-16 14:02:21.725967 7fc7f5317700 10 s->object=s3testbucket-1 
s->bucket=ceph-kl-mon1.de.empolis.com
2017-08-16 14:02:21.725974 7fc7f5317700  2 req 1:0.004734:s3:GET 
/s3testbucket-1:get_obj:init permissions
2017-08-16 14:02:21.725986 7fc7f5317700 20 get_system_obj_state: 
rctx=0x7fc7f530fe50 obj=default.rgw.data.root:ceph-kl-mon1.de.empolis.com 
state=0x7fc915db1e00 s->prefetch_data=0
2017-08-16 14:02:21.725990 7fc7f5317700 10 cache get: 
name=default.rgw.data.root++ceph-kl-mon1.de.empolis.com : miss
2017-08-16 14:02:21.728248 7fc7f5317700 10 cache put: 
name=default.rgw.data.root++ceph-kl-mon1.de.empolis.com info.flags=0x0
2017-08-16 14:02:21.728253 7fc7f5317700 10 adding 
default.rgw.data.root++ceph-kl-mon1.de.empolis.com to cache LRU end
2017-08-16 14:02:21.728285 7fc7f5317700 10 read_permissions on 
ceph-kl-mon1.de.empolis.com[]) ret=-2002
2017-08-16 14:02:21.728287 7fc7f5317700 20 op->ERRORHANDLER: err_no=-2002 
new_err_no=-2002
2017-08-16 14:02:21.728368 7fc7f5317700  2 req 1:0.007127:s3:GET 
/s3testbucket-1:get_obj:op status=0
2017-08-16 14:02:21.728371 7fc7f5317700  2 req 1:0.007130:s3:GET 
/s3testbucket-1:get_obj:http status=404
2017-08-16 14:02:21.728380 7fc7f5317700  1 == req done req=0x7fc7f5311190 
op status=0 http_status=404 ==

I see the line with “err_no=-2002”. What does that mean?

Thanks

Martin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.1.4 Luminous (RC) released

2017-08-16 Thread Abhishek Lekshmanan
Alfredo Deza  writes:

> On Tue, Aug 15, 2017 at 10:35 PM, Matt Benjamin  wrote:
>> I think we need a v12.1.5 including #17040
We discussed this in the RGW Standups today & may not need one more RC
for the bug above, and should be fine as far as the fix is in 12.2.0

>
> *I* think that this is getting to a point where we should just have
> nightly development releases.
>
> What is the benefit of waiting for each RC every two weeks (or so) otherwise?

We could consider something of this sort for M maybe?

Abhishek
> On one side we are treating the RC releases somewhat like normal
> releases, with proper announcements, waiting for
> QA suites to complete, and have leads "ack" when their components are
> good enough. But on the other side of things
> we then try to cut releases to include fixes as immediate as possible
> (and as often as that means)
>
> We've had 3 releases already in August and this would mean discussing
> a *fourth*. 
>
>>
>> Matt
>>
>> On Tue, Aug 15, 2017 at 5:16 PM, Gregory Farnum  wrote:
>>> On Tue, Aug 15, 2017 at 2:05 PM, Abhishek  wrote:
 This is the fifth release candidate for Luminous, the next long term
 stable release. We’ve had to do this release as there was a bug in
 the previous RC, which affected upgrades to Luminous.[1]
>>>
>>> In particular, this will fix things for those of you who upgraded from
>>> Jewel or a previous RC and saw OSDs crash instantly on boot. We had an
>>> oversight in dealing with another bug. (Standard disclaimer: this was
>>> a logic error that resulted in no data changes. There were no
>>> durability implications — not that that helps much when you can't read
>>> your data out again.)
>>>
>>> Sorry guys!
>>> -Greg
>>>

 Please note that this is still a *release candidate* and
 not the final release, we're expecting the final Luminous release in
 a week's time, meanwhile, testing and feedback is very much welcom.

 Ceph Luminous (v12.2.0) will be the foundation for the next long-term
 stable release series. There have been major changes since Kraken
 (v11.2.z) and Jewel (v10.2.z), and the upgrade process is non-trivial.
 Please read these release notes carefully. Full details and changelog at
 http://ceph.com/releases/v12-1-4-luminous-rc-released/

 Notable Changes from 12.1.3
 ---
 * core: Wip 20985 divergent handling luminous (issue#20985, pr#17001, Greg
 Farnum)
 * qa/tasks/thrashosds-health.yaml: ignore MON_DOWN (issue#20910, pr#17003,
 Sage Weil)
 * crush, mon: fix weight set vs crush device classes (issue#20939, Sage
 Weil)


 Getting Ceph
 
 * Git at git://github.com/ceph/ceph.git
 * Tarball at http://download.ceph.com/tarballs/ceph-12.1.4.tar.gz
 * For packages, see http://docs.ceph.com/docs/master/install/get-packages/
 * For ceph-deploy, see
 http://docs.ceph.com/docs/master/install/install-ceph-deploy
 * Release sha1: a5f84b37668fc8e03165aaf5cbb380c78e4deba4

 [1]: http://tracker.ceph.com/issues/20985


 Best Regards
 Abhishek

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VMware + Ceph using NFS sync/async ?

2017-08-16 Thread Osama Hasebou
Hi Nick, 

Thanks for replying! If Ceph is combined with Openstack then, does that mean 
that actually when openstack writes are happening, it is not fully sync'd (as 
in written to disks) before it starts receiving more data, so acting as async ? 
In that scenario there is a chance for data loss if things go bad, i.e power 
outage or something like that ? 

As for the slow operations, reading is quite fine when I compare it to a SAN 
storage system connected to VMware. It is writing data, small chunks or big 
ones, that suffer when trying to use the sync option with FIO for benchmarking. 

In that case, I wonder, is no one using CEPH with VMware in a production 
environment ? 

Cheers. 

Regards, 
Ossi 







Hi Osama, 



This is a known problem with many software defined storage stacks, but 
potentially slightly worse with Ceph due to extra overheads. Sync writes have 
to wait until all copies of the data are written to disk by the OSD and 
acknowledged back to the client. The extra network hops for replication and NFS 
gateways add significant latency which impacts the time it takes to carry out 
small writes. The Ceph code also takes time to process each IO request. 



What particular operations are you finding slow? Storage vmotions are just bad, 
and I don’t think there is much that can be done about them as they are split 
into lots of 64kb IO’s. 



One thing you can try is to force the CPU’s on your OSD nodes to run at C1 
cstate and force their minimum frequency to 100%. This can have quite a large 
impact on latency. Also you don’t specify your network, but 10G is a must. 



Nick 




From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Osama 
Hasebou 
Sent: 14 August 2017 12:27 
To: ceph-users  
Subject: [ceph-users] VMware + Ceph using NFS sync/async ? 




Hi Everyone, 





We started testing the idea of using Ceph storage with VMware, the idea was to 
provide Ceph storage through open stack to VMware, by creating a virtual 
machine coming from Ceph + Openstack , which acts as an NFS gateway, then mount 
that storage on top of VMware cluster. 





When mounting the NFS exports using the sync option, we noticed a huge 
degradation in performance which makes it very slow to use it in production, 
the async option makes it much better but then there is the risk of it being 
risky that in case a failure shall happen, some data might be lost in that 
Scenario. 





Now I understand that some people in the ceph community are using Ceph with 
VMware using NFS gateways, so if you can kindly shed some light on your 
experience, and if you do use it in production purpose, that would be great and 
how did you mitigate the sync/async options and keep write performance. 








Thanks you!!! 





Regards, 
Ossi 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running commands on Mon or OSD nodes

2017-08-16 Thread Osama Hasebou
Hi David, 

We are running 10.2.7, but it seems it is ok now and it reflected all the 
changes. 

Thank you! 

Regards, 
Ossi 



From: "David Turner"  
To: "Osama Hasebou" , "ceph-users" 
 
Sent: Tuesday, 8 August, 2017 23:31:17 
Subject: Re: [ceph-users] Running commands on Mon or OSD nodes 

Regardless of which node you run that command on, the command is talking to the 
mons. If you are getting different values between different nodes, double check 
their configs and make sure your mon quorum isn't somehow in a split-brain 
scenario. Which version of Ceph are you running. 

On Tue, Aug 8, 2017 at 4:13 AM Osama Hasebou < [ mailto:osama.hase...@csc.fi | 
osama.hase...@csc.fi ] > wrote: 



Hi Everyone, 

I was trying to run the ceph osd crush reweight command to move data out of one 
node that has hardware failures and I noticed that as I set the crush reweight 
to 0, some nodes would reflect it when I do ceph osd tree and some wouldn't. 

What is the proper way to run command access cluster, does one need to run same 
command *ceph osd crush reweight* from all mon nodes and it would push it down 
to all osd tree and update the crush, or is it also ok to run it once on an osd 
node and it will copy it across the other nodes and update the crush map? 

Thank you! 

Regards, 
Ossi 

___ 
ceph-users mailing list 
[ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] 
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.1.4 Luminous (RC) released

2017-08-16 Thread Alfredo Deza
On Tue, Aug 15, 2017 at 10:35 PM, Matt Benjamin  wrote:
> I think we need a v12.1.5 including #17040

*I* think that this is getting to a point where we should just have
nightly development releases.

What is the benefit of waiting for each RC every two weeks (or so) otherwise?

On one side we are treating the RC releases somewhat like normal
releases, with proper announcements, waiting for
QA suites to complete, and have leads "ack" when their components are
good enough. But on the other side of things
we then try to cut releases to include fixes as immediate as possible
(and as often as that means)

We've had 3 releases already in August and this would mean discussing
a *fourth*.

>
> Matt
>
> On Tue, Aug 15, 2017 at 5:16 PM, Gregory Farnum  wrote:
>> On Tue, Aug 15, 2017 at 2:05 PM, Abhishek  wrote:
>>> This is the fifth release candidate for Luminous, the next long term
>>> stable release. We’ve had to do this release as there was a bug in
>>> the previous RC, which affected upgrades to Luminous.[1]
>>
>> In particular, this will fix things for those of you who upgraded from
>> Jewel or a previous RC and saw OSDs crash instantly on boot. We had an
>> oversight in dealing with another bug. (Standard disclaimer: this was
>> a logic error that resulted in no data changes. There were no
>> durability implications — not that that helps much when you can't read
>> your data out again.)
>>
>> Sorry guys!
>> -Greg
>>
>>>
>>> Please note that this is still a *release candidate* and
>>> not the final release, we're expecting the final Luminous release in
>>> a week's time, meanwhile, testing and feedback is very much welcom.
>>>
>>> Ceph Luminous (v12.2.0) will be the foundation for the next long-term
>>> stable release series. There have been major changes since Kraken
>>> (v11.2.z) and Jewel (v10.2.z), and the upgrade process is non-trivial.
>>> Please read these release notes carefully. Full details and changelog at
>>> http://ceph.com/releases/v12-1-4-luminous-rc-released/
>>>
>>> Notable Changes from 12.1.3
>>> ---
>>> * core: Wip 20985 divergent handling luminous (issue#20985, pr#17001, Greg
>>> Farnum)
>>> * qa/tasks/thrashosds-health.yaml: ignore MON_DOWN (issue#20910, pr#17003,
>>> Sage Weil)
>>> * crush, mon: fix weight set vs crush device classes (issue#20939, Sage
>>> Weil)
>>>
>>>
>>> Getting Ceph
>>> 
>>> * Git at git://github.com/ceph/ceph.git
>>> * Tarball at http://download.ceph.com/tarballs/ceph-12.1.4.tar.gz
>>> * For packages, see http://docs.ceph.com/docs/master/install/get-packages/
>>> * For ceph-deploy, see
>>> http://docs.ceph.com/docs/master/install/install-ceph-deploy
>>> * Release sha1: a5f84b37668fc8e03165aaf5cbb380c78e4deba4
>>>
>>> [1]: http://tracker.ceph.com/issues/20985
>>>
>>>
>>> Best Regards
>>> Abhishek
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster unavailable for 20 mins when downed server was reintroduced

2017-08-16 Thread Sean Purdy
On Tue, 15 Aug 2017, Gregory Farnum said:
> On Tue, Aug 15, 2017 at 4:23 AM Sean Purdy  wrote:
> > I have a three node cluster with 6 OSD and 1 mon per node.
> >
> > I had to turn off one node for rack reasons.  While the node was down, the
> > cluster was still running and accepting files via radosgw.  However, when I
> > turned the machine back on, radosgw uploads stopped working and things like
> > "ceph status" starting timed out.  It took 20 minutes for "ceph status" to
> > be OK.

> > 2017-08-15 11:28:29.835943 7fdf2d74b700  0 monclient(hunting):
> > authenticate timed out after 3002017-08-15
> > 11:28:29.835993 7fdf2d74b700  0 librados: client.admin authentication error
> > (110) Connection timed out
> >
> 
> That just means the client couldn't connect to an in-quorum monitor. It
> should have tried them all in sequence though — did you check if you had
> *any* functioning quorum?

There was a functioning quorum - I checked with "ceph --admin-daemon 
/var/run/ceph/ceph-mon.xxx.asok quorum_status".  Well - I interpreted the 
output as functioning.  There was a nominated leader.
 

> > 2017-08-15 11:23:07.180123 7f11c0fcc700  0 -- 172.16.0.43:0/2471 >>
> > 172.16.0.45:6812/1904 conn(0x556eeaf4d000 :-1
> > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
> > l=0).handle_connect_reply connect got BADAUTHORIZER
> >
> 
> This one's odd. We did get one report of seeing something like that, but I
> tend to think it's a clock sync issue.

I saw some messages about clock sync, but ntpq -p looked OK on each server.  
Will investigate further.

 remote   refid  st t when poll reach   delay   offset  jitter
==
+172.16.0.16 129.250.35.250   3 u  847 1024  3770.2891.103   0.376
+172.16.0.18 80.82.244.1203 u   93 1024  3770.397   -0.653   1.040
*172.16.0.19 158.43.128.332 u  279 1024  3770.2440.262   0.158
 

> > ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-9: (2) No
> > such file or directory
> >
> And that would appear to be something happening underneath Ceph, wherein
> your data wasn't actually all the way mounted or something?

It's the machine mounting the disks at boot time - udev or ceph-osd.target 
keeps retrying until eventually the disk/OSD is mounted.  Or eventually it 
gives up.  Do the OSDs need a monitor quorum at startup?  It kept restarting 
OSDs for 20 mins.

Timing went like this:

11:22 node boot
11:22 ceph-mon starts, recovers logs, compaction, first BADAUTHORIZER message
11:22 starting disk activation for 18 partitions (3 per bluestore)
11:23 mgr on other node can't find secret_id
11:43 bluefs mount succeeded on OSDs, ceph-osds go live
11:45 last BADAUTHORIZER message in monitor log
11:45 this host calls and wins a monitor election, mon_down health check clears
11:45 mgr happy
 
 
> Anyway, it should have survived that transition without any noticeable
> impact (unless you are running so close to capacity that merely getting the
> downed node up-to-date overwhelmed your disks/cpu). But without some basic
> information about what the cluster as a whole was doing I couldn't
> speculate.

This is a brand new 3 node cluster.  Dell R720 running Debian 9 with 2x SSD for 
OS and ceph-mon, 6x 2Tb SATA for ceph-osd using bluestore, per node.  Running 
radosgw as object store layer.  Only activity is a single-threaded test job 
uploading millions of small files over S3.  There are about 5.5million test 
objects so far (additionally 3x replication).  This job was fine when the 
machine was down, stalled when machine booted.

Looking at activity graphs at the time, there didn't seem to be a network 
bottleneck or CPU issue or disk throughput bottleneck.  But I'll look a bit 
closer.

ceph-mon is on an ext4 filesystem though.   Perhaps I should move this to xfs?  
Bluestore is xfs+bluestore.

I presume it's a monitor issue somehow.


> -Greg

Thanks for your input.

Sean
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] BlueStore WAL or DB devices on a distant SSD ?

2017-08-16 Thread Hervé Ballans

Hi,

We are currently running two Proxmox/ceph clusters that work perfectly 
(since 2014) and thank to this succesful experience, we plan to install 
a new Ceph cluster for storage of our computing cluster.


Until now, we only used RBD (virtualization context) but now we want to 
use CephFS for this new cluster (separated from the other two, hardware 
is different and dedicated for this new clusters).


I'm interested in testing a CephFS cluster with BlueStore as a backend 
storage.


I have several OSDs servers (with a dozen SATA HDDs on each) but some do 
not have an additional SSD disk (only 3 of the servers have an 
additional SSD).


My question is about BlueStore WAL/DB devices. When I read the 
documentation, it seems that adding both WAL and DB devices improve 
BlueStore performances.


But, can we configure these devices on a distant SSD (I mean on a SSD 
which is not on the local OSDs server but on an another machine which is 
on the same Ceph cluster) ?


If yes, can I configure mulitple WAL or DB devices on the same SSD ?

And finally, is it relevant to do that (I mean in term of performance) ?

Hoping to have been clear on my context, thanks in advance for your 
reply  or your reflection.


Regards,

Hervé


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which kernel version support object-map feature from rbd kernel client

2017-08-16 Thread TYLin
You can check linux source code to see the feature supported by kernel client.

e.g. linux 4.13-rc5 
(https://github.com/torvalds/linux/blob/v4.13-rc5/drivers/block/rbd.c)
in drivers/block/rbd.c:

/* Feature bits */

#define RBD_FEATURE_LAYERING(1ULL<<0)
#define RBD_FEATURE_STRIPINGV2  (1ULL<<1)
#define RBD_FEATURE_EXCLUSIVE_LOCK  (1ULL<<2)
#define RBD_FEATURE_DATA_POOL   (1ULL<<7)

#define RBD_FEATURES_ALL(RBD_FEATURE_LAYERING | \
 RBD_FEATURE_STRIPINGV2 |   \
 RBD_FEATURE_EXCLUSIVE_LOCK |   \
 RBD_FEATURE_DATA_POOL)

So far only supports layering, striping, exclusive lock, data pool.


> On Aug 15, 2017, at 9:34 PM, moftah moftah  wrote:
> 
> I dont think so ,
> I tested with with kernel-4.10.17-1-pve which is proxmox5 kernel and that one 
> didnt have object-map support
> 
> had to disable the feature from the rbd image in order for the krbd rbd 
> module to deal with it and not complain about features
> 
> 
> Thanks
> 
> 
> 
> On Tue, Aug 15, 2017 at 9:25 AM, David Turner  > wrote:
> I thought that object-map, introduced with Jewel, was included with the 4.9 
> kernel and every kernel since then.
> 
> 
> On Tue, Aug 15, 2017, 7:26 AM Shinobu Kinjo  > wrote:
> It would be much better to explain why as of today, object-map feature
> is not supported by the kernel client, or document it.
> 
> On Tue, Aug 15, 2017 at 8:08 PM, Ilya Dryomov  > wrote:
> > On Tue, Aug 15, 2017 at 11:34 AM, moftah moftah  > > wrote:
> >> Hi All,
> >>
> >> I have search everywhere for some sort of table that show kernel version to
> >> what rbd image features supported and didnt find any.
> >>
> >> basically I am looking at latest kernels from kernel.org 
> >>  , and i am thinking
> >> of upgrading to 4.12 since it is stable but i want to make sure i can get
> >> rbd images with object-map features working with rbd.ko
> >>
> >> if anyone know please let me know what kernel version i have to upgrade to
> >> to get that feature supported by kernel client
> >
> > As of today, object-map feature is not supported by the kernel client.
> >
> > Thanks,
> >
> > Ilya
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Delete PG because ceph pg force_create_pg doesnt help

2017-08-16 Thread Hauke Homburg
Hello,


How can i delete a pg completly from a ceph server? I think i have all
Data manually from the Server deleted. But i a ceph pg  query
shows the pg already? A ceph pg force_create_pg doesn't create the pg.

The ceph says he has created the pg, and a pg is stuck more than 300 sec.

Thanks for your help.

-- 
www.w3-creative.de

www.westchat.de

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% of total capacity

2017-08-16 Thread Etienne Menguy
Hi,


Your crushmap has issues.

You don't have any root and you have duplicates entries. Currently you store 
data on a single OSD.


You can manually fix the crushmap by decompiling, editing and compiling.

http://docs.ceph.com/docs/hammer/rados/operations/crush-map/#editing-a-crush-map

(if you have some production data, do a backup first)


Étienne



From: ceph-users  on behalf of Mandar Naik 

Sent: Wednesday, August 16, 2017 09:39
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% 
of total capacity

Hi,
I just wanted to give a friendly reminder for this issue. I would appreciate if 
someone
can help me out here. Also, please do let me know in case some more information 
is
required here.

On Thu, Aug 10, 2017 at 2:41 PM, Mandar Naik 
> wrote:
Hi Peter,
Thanks a lot for the reply. Please find 'ceph osd df' output here -

# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
 2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
 1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
 0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
 0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
 1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
 2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
  TOTAL   134G 43925M 94244M 31.79
MIN/MAX VAR: 0.00/2.99  STDDEV: 44.85

I setup this cluster by manipulating CRUSH map using CLI. I had a default root
before but it gave me an impression that since every rack is under a single
root bucket its marking entire cluster down in case one of the osd is 95% full. 
So I
removed root bucket but that still did not help me. No crush rule is referring
to root bucket in the above mentioned case.

Yes, I added one osd under two racks by linking host bucket from one rack to 
another
using following command -

"osd crush link   [...] :  link existing entry for  
under location "


On Thu, Aug 10, 2017 at 1:40 PM, Peter Maloney 
> 
wrote:
I think a `ceph osd df` would be useful.

And how did you set up such a cluster? I don't see a root, and you have each 
osd in there more than once...is that even possible?



On 08/10/17 08:46, Mandar Naik wrote:

Hi,

I am evaluating ceph cluster for a solution where ceph could be used for 
provisioning

pools which could be either stored local to a node or replicated across a 
cluster.  This

way ceph could be used as single point of solution for writing both local as 
well as replicated

data. Local storage helps avoid possible storage cost that comes with 
replication factor of more

than one and also provide availability as long as the data host is alive.


So I tried an experiment with Ceph cluster where there is one crush rule which 
replicates data across

nodes and other one only points to a crush bucket that has local ceph osd. 
Cluster configuration

is pasted below.


Here I observed that if one of the disk is full (95%) entire cluster goes into 
error state and stops

accepting new writes from/to other nodes. So ceph cluster became unusable even 
though it’s only

32% full. The writes are blocked even for pools which are not touching the full 
osd.


I have tried playing around crush hierarchy but it did not help. So is it 
possible to store data in the above

manner with Ceph ? If yes could we get cluster state in usable state after one 
of the node is full ?



# ceph df


GLOBAL:

   SIZE AVAIL  RAW USED %RAW USED

   134G 94247M   43922M 31.79


# ceph –s


   cluster ba658a02-757d-4e3c-7fb3-dc4bf944322f

health HEALTH_ERR

   1 full osd(s)

   full,sortbitwise,require_jewel_osds flag(s) set

monmap e3: 3 mons at 
{ip-10-0-9-122=10.0.9.122:6789/0,ip-10-0-9-146=10.0.9.146:6789/0,ip-10-0-9-210=10.0.9.210:6789/0}

   election epoch 14, quorum 0,1,2 
ip-10-0-9-122,ip-10-0-9-146,ip-10-0-9-210

osdmap e93: 3 osds: 3 up, 3 in

   flags full,sortbitwise,require_jewel_osds

 pgmap v630: 384 pgs, 6 pools, 43772 MB data, 18640 objects

   43922 MB used, 94247 MB / 134 GB avail

384 active+clean


# ceph osd tree


ID WEIGHT  TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY

-9 0.04399 rack ip-10-0-9-146-rack

-8 0.04399 host ip-10-0-9-146

2 0.04399 osd.2up  1.0  1.0

-7 0.04399 rack ip-10-0-9-210-rack

-6 0.04399 host ip-10-0-9-210

1 0.04399 osd.1up  1.0  1.0

-5 0.04399 rack ip-10-0-9-122-rack

-3 0.04399 host ip-10-0-9-122

0 0.04399 osd.0up  1.0  1.0

-4 0.13197 rack rep-rack

-3 0.04399 

Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% of total capacity

2017-08-16 Thread Luis Periquito
Not going through the obvious of that crush map is just not looking
correct or even sane... or that the policy itself doesn't sound very
sane - but I'm sure you'll understand the caveats and issues it may
present...

what's most probably happening is that a (or several) pool is using
those same OSDs and the requests to those PGs are also getting blocked
because of the disk full. This turns that some (or all) of the
remaining OSDs are waiting for that one to complete some IO, and
whilst those OSDs have IOs waiting to complete it also stops
responding to the IO that was only local.

Adding more insanity to your architecture what should (the keyword
here is should as I never tested, saw or even thought of such
scenario) work would be OSDs to have local storage and OSDs to have
distributed storage.

As for the architecture itself, and not knowing much of your use-case,
it may make sense to have local storage in something else than Ceph -
you're not using any of the facilities it provides you, and having
some overheads - or using a different strategy for it. IIRC there was
a way to hint data locality to Ceph...


On Wed, Aug 16, 2017 at 8:39 AM, Mandar Naik  wrote:
> Hi,
> I just wanted to give a friendly reminder for this issue. I would appreciate
> if someone
> can help me out here. Also, please do let me know in case some more
> information is
> required here.
>
> On Thu, Aug 10, 2017 at 2:41 PM, Mandar Naik  wrote:
>>
>> Hi Peter,
>> Thanks a lot for the reply. Please find 'ceph osd df' output here -
>>
>> # ceph osd df
>> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
>>  2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
>>  1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
>>  0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
>>  0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
>>  1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
>>  2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
>>   TOTAL   134G 43925M 94244M 31.79
>> MIN/MAX VAR: 0.00/2.99  STDDEV: 44.85
>>
>> I setup this cluster by manipulating CRUSH map using CLI. I had a default
>> root
>> before but it gave me an impression that since every rack is under a
>> single
>> root bucket its marking entire cluster down in case one of the osd is 95%
>> full. So I
>> removed root bucket but that still did not help me. No crush rule is
>> referring
>> to root bucket in the above mentioned case.
>>
>> Yes, I added one osd under two racks by linking host bucket from one rack
>> to another
>> using following command -
>>
>> "osd crush link   [...] :  link existing entry for
>>  under location "
>>
>>
>> On Thu, Aug 10, 2017 at 1:40 PM, Peter Maloney
>>  wrote:
>>>
>>> I think a `ceph osd df` would be useful.
>>>
>>> And how did you set up such a cluster? I don't see a root, and you have
>>> each osd in there more than once...is that even possible?
>>>
>>>
>>>
>>> On 08/10/17 08:46, Mandar Naik wrote:
>>>
>>> Hi,
>>>
>>> I am evaluating ceph cluster for a solution where ceph could be used for
>>> provisioning
>>>
>>> pools which could be either stored local to a node or replicated across a
>>> cluster.  This
>>>
>>> way ceph could be used as single point of solution for writing both local
>>> as well as replicated
>>>
>>> data. Local storage helps avoid possible storage cost that comes with
>>> replication factor of more
>>>
>>> than one and also provide availability as long as the data host is alive.
>>>
>>>
>>> So I tried an experiment with Ceph cluster where there is one crush rule
>>> which replicates data across
>>>
>>> nodes and other one only points to a crush bucket that has local ceph
>>> osd. Cluster configuration
>>>
>>> is pasted below.
>>>
>>>
>>> Here I observed that if one of the disk is full (95%) entire cluster goes
>>> into error state and stops
>>>
>>> accepting new writes from/to other nodes. So ceph cluster became unusable
>>> even though it’s only
>>>
>>> 32% full. The writes are blocked even for pools which are not touching
>>> the full osd.
>>>
>>>
>>> I have tried playing around crush hierarchy but it did not help. So is it
>>> possible to store data in the above
>>>
>>> manner with Ceph ? If yes could we get cluster state in usable state
>>> after one of the node is full ?
>>>
>>>
>>>
>>> # ceph df
>>>
>>>
>>> GLOBAL:
>>>
>>>SIZE AVAIL  RAW USED %RAW USED
>>>
>>>134G 94247M   43922M 31.79
>>>
>>>
>>> # ceph –s
>>>
>>>
>>>cluster ba658a02-757d-4e3c-7fb3-dc4bf944322f
>>>
>>> health HEALTH_ERR
>>>
>>>1 full osd(s)
>>>
>>>full,sortbitwise,require_jewel_osds flag(s) set
>>>
>>> monmap e3: 3 mons at
>>> {ip-10-0-9-122=10.0.9.122:6789/0,ip-10-0-9-146=10.0.9.146:6789/0,ip-10-0-9-210=10.0.9.210:6789/0}
>>>
>>>election epoch 14, quorum 0,1,2
>>> ip-10-0-9-122,ip-10-0-9-146,ip-10-0-9-210
>>>
>>> osdmap e93: 3 osds: 3 up, 3 

Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% of total capacity

2017-08-16 Thread Mandar Naik
Hi,
I just wanted to give a friendly reminder for this issue. I would
appreciate if someone
can help me out here. Also, please do let me know in case some more
information is
required here.

On Thu, Aug 10, 2017 at 2:41 PM, Mandar Naik  wrote:

> Hi Peter,
> Thanks a lot for the reply. Please find 'ceph osd df' output here -
>
> # ceph osd df
> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
>  2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
>  1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
>  0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
>  0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
>  1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
>  2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
>   TOTAL   134G 43925M 94244M 31.79
> MIN/MAX VAR: 0.00/2.99  STDDEV: 44.85
>
> I setup this cluster by manipulating CRUSH map using CLI. I had a default
> root
> before but it gave me an impression that since every rack is under a single
> root bucket its marking entire cluster down in case one of the osd is 95%
> full. So I
> removed root bucket but that still did not help me. No crush rule is
> referring
> to root bucket in the above mentioned case.
>
> Yes, I added one osd under two racks by linking host bucket from one rack
> to another
> using following command -
>
> "osd crush link   [...] :  link existing entry for
>  under location "
>
>
> On Thu, Aug 10, 2017 at 1:40 PM, Peter Maloney  consult.de> wrote:
>
>> I think a `ceph osd df` would be useful.
>>
>> And how did you set up such a cluster? I don't see a root, and you have
>> each osd in there more than once...is that even possible?
>>
>>
>>
>> On 08/10/17 08:46, Mandar Naik wrote:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> * Hi, I am evaluating ceph cluster for a solution where ceph could be
>> used for provisioning pools which could be either stored local to a node or
>> replicated across a cluster.  This way ceph could be used as single point
>> of solution for writing both local as well as replicated data. Local
>> storage helps avoid possible storage cost that comes with replication
>> factor of more than one and also provide availability as long as the data
>> host is alive.   So I tried an experiment with Ceph cluster where there is
>> one crush rule which replicates data across nodes and other one only points
>> to a crush bucket that has local ceph osd. Cluster configuration is pasted
>> below. Here I observed that if one of the disk is full (95%) entire cluster
>> goes into error state and stops accepting new writes from/to other nodes.
>> So ceph cluster became unusable even though it’s only 32% full. The writes
>> are blocked even for pools which are not touching the full osd. I have
>> tried playing around crush hierarchy but it did not help. So is it possible
>> to store data in the above manner with Ceph ? If yes could we get cluster
>> state in usable state after one of the node is full ? # ceph df GLOBAL:
>>SIZE AVAIL  RAW USED %RAW USED134G 94247M
>>   43922M 31.79 # ceph –scluster
>> ba658a02-757d-4e3c-7fb3-dc4bf944322f health HEALTH_ERR1
>> full osd(s)full,sortbitwise,require_jewel_osds flag(s) set
>> monmap e3: 3 mons at
>> {ip-10-0-9-122=10.0.9.122:6789/0,ip-10-0-9-146=10.0.9.146:6789/0,ip-10-0-9-210=10.0.9.210:6789/0
>> }
>>election epoch 14, quorum 0,1,2
>> ip-10-0-9-122,ip-10-0-9-146,ip-10-0-9-210 osdmap e93: 3 osds: 3 up, 3
>> inflags full,sortbitwise,require_jewel_osds  pgmap v630:
>> 384 pgs, 6 pools, 43772 MB data, 18640 objects43922 MB used,
>> 94247 MB / 134 GB avail 384 active+clean # ceph osd tree ID
>> WEIGHT  TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY -9
>> 0.04399 rack ip-10-0-9-146-rack -8 0.04399 host ip-10-0-9-146 2 0.04399
>> osd.2up  1.0  1.0 -7 0.04399 rack
>> ip-10-0-9-210-rack -6 0.04399 host ip-10-0-9-210 1 0.04399
>> osd.1up  1.0  1.0 -5 0.04399 rack
>> ip-10-0-9-122-rack -3 0.04399 host ip-10-0-9-122 0 0.04399
>> osd.0up  1.0  1.0 -4 0.13197 rack
>> rep-rack -3 0.04399 host ip-10-0-9-122 0 0.04399 osd.0
>>up  1.0  1.0 -6 0.04399 host
>> ip-10-0-9-210 1 0.04399 osd.1up  1.0
>>  1.0 -8 0.04399 host ip-10-0-9-146 2 0.04399 osd.2
>>up  1.0  1.0 # ceph osd crush rule list [
>>"rep_ruleset","ip-10-0-9-122_ruleset","ip-10-0-9-210_ruleset",
>>"ip-10-0-9-146_ruleset" ] # ceph osd crush rule dump rep_ruleset {
>>"rule_id": 0,"rule_name": "rep_ruleset","ruleset": 0,"type":
>> 1,