from:"Paul Emmerich"

Re: [ceph-users] Monitor handle_auth_bad_method

2020-01-18 Thread Paul Emmerich

check if the mons have the same keyring file and the same config file.
-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sat, Jan 18, 2020 at 12:39 AM Justin Engwer  wrote:
>
> Hi,
> I'm a home user of ceph. Most of the time I can look at the email lists and 
> articles and figure things out on my own. I've unfortunately run into an 
> issue I can't troubleshoot myself.
>
> Starting one of my monitors yields this error:
>
> 2020-01-17 15:34:13.497 7fca3d006040  0 mon.kvm2@-1(probing) e11  my rank is 
> now 2 (was -1)
> 2020-01-17 15:34:13.696 7fca2909b700 -1 mon.kvm2@2(probing) e11 
> handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
> 2020-01-17 15:34:14.098 7fca2909b700 -1 mon.kvm2@2(probing) e11 
> handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
> 2020-01-17 15:34:14.899 7fca2909b700 -1 mon.kvm2@2(probing) e11 
> handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
>
>
> I've grabbed a good monmap from other monitors and double checked permissions 
> on /var/lib/ceph/mon/ceph-kvm2/ to make sure that it's not a filesystem error 
> and everything looks good to me.
>
> /var/lib/ceph/mon/ceph-kvm2/:
> total 32
> drwxr-xr-x  4 ceph ceph  4096 May  7  2018 .
> drwxr-x---. 4 ceph ceph44 Jan  8 11:44 ..
> -rw-r--r--. 1 ceph ceph 0 Apr 25  2018 done
> -rw---. 1 ceph ceph77 Apr 25  2018 keyring
> -rw-r--r--. 1 ceph ceph 8 Apr 25  2018 kv_backend
> drwx--  2 ceph ceph 16384 May  7  2018 lost+found
> drwxr-xr-x. 2 ceph ceph  4096 Jan 17 15:27 store.db
> -rw-r--r--. 1 ceph ceph 0 Apr 25  2018 systemd
>
> /var/lib/ceph/mon/ceph-kvm2/lost+found:
> total 20
> drwx-- 2 ceph ceph 16384 May  7  2018 .
> drwxr-xr-x 4 ceph ceph  4096 May  7  2018 ..
>
> /var/lib/ceph/mon/ceph-kvm2/store.db:
> total 68424
> drwxr-xr-x. 2 ceph ceph 4096 Jan 17 15:27 .
> drwxr-xr-x  4 ceph ceph 4096 May  7  2018 ..
> -rw---  1 ceph ceph 65834705 Jan 17 14:57 1557088.sst
> -rw---  1 ceph ceph 1833 Jan 17 15:27 1557090.sst
> -rw---  1 ceph ceph0 Jan 17 15:27 1557092.log
> -rw---  1 ceph ceph   17 Jan 17 15:27 CURRENT
> -rw-r--r--. 1 ceph ceph   37 Apr 25  2018 IDENTITY
> -rw-r--r--. 1 ceph ceph0 Apr 25  2018 LOCK
> -rw---  1 ceph ceph  185 Jan 17 15:27 MANIFEST-1557091
> -rw---  1 ceph ceph 4941 Jan 17 14:57 OPTIONS-1557087
> -rw---  1 ceph ceph 4941 Jan 17 15:27 OPTIONS-1557094
>
>
> Any help would be appreciated.
>
>
>
> --
>
> Justin Engwer
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Default Pools

2020-01-18 Thread Paul Emmerich

RGW tools will automatically deploy these pools, for example, running
radosgw-admin will create them if they don't exist.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sat, Jan 18, 2020 at 2:48 AM Daniele Riccucci  wrote:
>
> Hello,
> I'm still a bit confused by the .rgw.root and the
> default.rgw.{control,meta,log} pools.
> I recently removed the RGW daemon I had running and the aforementioned
> pools, however after a rebalance I suddenly find them again in the
> output of:
>
> $ ceph osd pool ls
> cephfs_data
> cephfs_metadata
> .rgw.root
> default.rgw.control
> default.rgw.meta
> default.rgw.log
>
> Each has 8 pgs but zero usage.
> I was unable to find logs or indications as to which daemon or action
> recreated them or whether it is safe to remove them again, where should
> I look?
> I'm on Nautilus 14.2.5, container deployment.
> Thank you.
>
> Regards,
> Daniele
>
> Il 23/04/19 22:14, David Turner ha scritto:
> > You should be able to see all pools in use in a RGW zone from the
> > radosgw-admin command. This [1] is probably overkill for most, but I
> > deal with multi-realm clusters so I generally think like this when
> > dealing with RGW.  Running this as is will create a file in your current
> > directory for each zone in your deployment (likely to be just one
> > file).  My rough guess for what you would find in that file based on
> > your pool names would be this [2].
> >
> > If you identify any pools not listed from the zone get command, then you
> > can rename [3] the pool to see if it is being created and/or used by rgw
> > currently.  The process here would be to stop all RGW daemons, rename
> > the pools, start a RGW daemon, stop it again, and see which pools were
> > recreated.  Clean up the pools that were freshly made and rename the
> > original pools back into place before starting your RGW daemons again.
> > Please note that .rgw.root is a required pool in every RGW deployment
> > and will not be listed in the zones themselves.
> >
> >
> > [1]
> > for realm in $(radosgw-admin realm list --format=json | jq '.realms[]'
> > -r); do
> >for zonegroup in $(radosgw-admin --rgw-realm=$realm zonegroup list
> > --format=json | jq '.zonegroups[]' -r); do
> >  for zone in $(radosgw-admin --rgw-realm=$realm
> > --rgw-zonegroup=$zonegroup zone list --format=json | jq '.zones[]' -r); do
> >echo $realm.$zonegroup.$zone.json
> >radosgw-admin --rgw-realm=$realm --rgw-zonegroup=$zonegroup
> > --rgw-zone=$zone zone get > $realm.$zonegroup.$zone.json
> >  done
> >done
> > done
> >
> > [2] default.default.default.json
> > {
> >  "id": "{{ UUID }}",
> >  "name": "default",
> >  "domain_root": "default.rgw.meta",
> >  "control_pool": "default.rgw.control",
> >  "gc_pool": ".rgw.gc",
> >  "log_pool": "default.rgw.log",
> >  "user_email_pool": ".users.email",
> >  "user_uid_pool": ".users.uid",
> >  "system_key": {
> >  },
> >  "placement_pools": [
> >  {
> >  "key": "default-placement",
> >  "val": {
> >  "index_pool": "default.rgw.buckets.index",
> >  "data_pool": "default.rgw.buckets.data",
> >  "data_extra_pool": "default.rgw.buckets.non-ec",
> >  "index_type": 0,
> >  "compression": ""
> >  }
> >  }
> >  ],
> >  "metadata_heap": "",
> >  "tier_config": [],
> >  "realm_id": "{{ UUID }}"
> > }
> >
> > [3] ceph osd pool rename  
> >
> > On Thu, Apr 18, 2019 at 10:46 AM Brent Kennedy  > <mailto:bkenn...@cfl.rr.com>> wrote:
> >
> > Yea, that was a cluster created during firefly...
> >
> > Wish there was a good article on the naming and use of these, or
> > perhaps a way I could make sure they are not used before deleting
> > them.  I know RGW will recreate anything it uses, but I don’t want
> > to lose data because I wanted a clean syst

Re: [ceph-users] Slow Performance - Sequential IO

2020-01-18 Thread Paul Emmerich

Benchmarking is hard.

It's expected that random will be faster than sequential with this IO pattern.
Reason is that you are going to different blocks/pgs/osds with random
IO for every 4k block but the sequential IO is stuck on a 4mb block
for more IOs than your queue is deep.

Question is: is writing 4k blocks sequentially in any way
representative of your workload? Probably not, so don't test that.

You mentioned vmware, so writing 64k blocks with qd 64 might be
relevant for you (e.g., vmotion). In this case try to configure
striping with a 64kb stripe size for the image format and test
sequentially writing 64kb blocks.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sat, Jan 18, 2020 at 5:14 AM Christian Balzer  wrote:
>
>
> Hello,
>
> I had very odd results in the past with the fio rbd engine and would
> suggest testing things in the environment you're going to deploy in, end
> to end.
>
> That said, without any caching and coalescing of writes, sequential 4k
> writes will hit the same set of OSDs for 4MB worth of data, thus limiting
> things to whatever the overall latency (network, 3x write) is here.
> With random writes you will engage more or less all OSDs that hold your
> fio file, thus spreading things out.
> This becomes more and more visible with increasing number of OSDs and
> nodes.
>
> Regards,
>
> Christian
> On Fri, 17 Jan 2020 23:01:09 + Anthony Brandelli (abrandel) wrote:
>
> > Not been able to make any headway on this after some significant effort.
> >
> > -Tested all 48 SSDs with FIO directly, all tested with 10% of each other 
> > for 4k iops in rand|seq read|write.
> > -Disabled all CPU power save.
> > -Tested with both rbd cache enabled and disabled on the client.
> > -Tested with drive caches enabled and disabled (hdparm)
> > -Minimal TCP retransmissions under load (<10 for a 2 minute duration).
> > -No drops/pause frames noted on upstream switches.
> > -CPU load on OSD nodes peaks at 6~.
> > -iostat shows a peak of 15ms under read/write workloads, %util peaks at 
> > about 10%.
> > -Swapped out the RBD client for a bigger box, since the load was peaking at 
> > 16. Now a 24 core box, load still peaks at 16.
> > -Disabled cephx signatures
> > -Verified hardware health (nothing in dmesg, nothing in CIMC fault logs, 
> > storage controller logs)
> > -Test multiple SSDs at once to find the controllers iops limit, which is 
> > apparently 650k @ 4k.
> >
> > Nothing has made a noticeable difference here. I'm pretty baffled as to 
> > what would be causing the awful sequential read and write performance, but 
> > allowing good random r/w speeds.
> >
> > I switched up fio testing methodologies to use more threads, but this 
> > didn't seem to help either:
> >
> > [global]
> > bs=4k
> > ioengine=rbd
> > iodepth=32
> > size=5g
> > runtime=120
> > numjobs=4
> > group_reporting=1
> > pool=rbd_af1
> > rbdname=image1
> >
> > [seq-read]
> > rw=read
> > stonewall
> >
> > [rand-read]
> > rw=randread
> > stonewall
> >
> > [seq-write]
> > rw=write
> > stonewall
> >
> > [rand-write]
> > rw=randwrite
> > stonewall
> >
> > Any pointers are appreciated at this point. I've been following other 
> > threads on the mailing list, and looked at the archives, related to RBD 
> > performance but none of the solutions that worked for others seem to have 
> > helped this setup.
> >
> > Thanks,
> > Anthony
> >
> > 
> > From: Anthony Brandelli (abrandel) 
> > Sent: Tuesday, January 14, 2020 12:43 AM
> > To: ceph-users@lists.ceph.com 
> > Subject: Slow Performance - Sequential IO
> >
> >
> > I have a newly setup test cluster that is giving some surprising numbers 
> > when running fio against an RBD. The end goal here is to see how viable a 
> > Ceph based iSCSI SAN of sorts is for VMware clusters, which require a bunch 
> > of random IO.
> >
> >
> >
> > Hardware:
> >
> > 2x E5-2630L v2 (2.4GHz, 6 core)
> >
> > 256GB RAM
> >
> > 2x 10gbps bonded network, Intel X520
> >
> > LSI 9271-8i, SSDs used for OSDs in JBOD mode
> >
> > Mons: 2x 1.2TB 10K SAS in RAID1
> >
> > OSDs: 12x Samsung MZ6ER800HAGL-3 800GB SAS SSDs, super cap/power loss 
> > protection
> >
> >
> >
> > Cluster setup:
&

Re: [ceph-users] [External Email] RE: Beginner questions

2020-01-16 Thread Paul Emmerich

Discussing DB size requirements without knowing the exact cluster
requirements doesn't work.

Here are some real-world examples:

cluster1: CephFS, mostly large files, replicated x3
0.2% used for metadata

cluster2: radosgw, mix between replicated and erasure, mixed file sizes
(lots of tiny files, though)
1.3% used for metadata

The 4%-10% quoted in the docs are *not based on any actual usage data*,
they are just an absolute worst case estimate.


A 30 GB DB partition for a 12 TiB disk is 0.25% if the disk is completely
full (which it won't be) is sufficient for many use cases.
I think cluster2 with 1.3% is one of the highest metadata usages that I've
seen on an actual production cluster.
I can think of a setup that probably has more but I haven't ever explicitly
checked it.

The restriction to 3/30/300 is temporary and might be fixed in a future
release, so I'd just partition that disk into X DB devices.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Jan 16, 2020 at 10:28 PM Bastiaan Visser  wrote:

> Dave made a good point WAL + DB might end up a little over 60G, I would
> probably go with ~70Gig partitions /LV's per OSD in your case. (if the nvme
> drive is smart enough to spread the writes over all available capacity,
> mort recent nvme's are). I have not yet seen a WAL larger or even close to
> than a gigabyte.
>
> We don't even think about EC-coded pools on clusters with less than 6
> nodes (spindles, full SSD is another story).
> EC pools neer more processing resources  We usually settle with 1 gig per
> TB of storage on replicated only sluters, but whet EC polls are involved,
> we add at least 50% to that. Also make sure your processors are up for it.
>
> Do not base your calculations on a healthy cluster -> build to fail.
> How long are you willing to be in a degraded state on node failure.
> Especially when using many larger spindles. recovery time might be way
> longer than you think. 12 * 12TB is 144TB storage, on a 4+2 EC pool you
> might end up with over 200 TB of traffic, on a 10Gig network that's roughly
> 2 and a half days to recover. IF your processors are not bottleneck due to
> EC parity calculations and all capacity is available for recovery (which is
> usually not the case, there is still production traffic that will eat up
> resources).
>
> Op do 16 jan. 2020 om 21:30 schreef :
>
>> Dave;
>>
>> I don't like reading inline responses, so...
>>
>> I have zero experience with EC pools, so I won't pretend to give advice
>> in that area.
>>
>> I would think that small NVMe for DB would be better than nothing, but I
>> don't know.
>>
>> Once I got the hang of building clusters, it was relatively easy to wipe
>> a cluster out and rebuild it.  Perhaps you could take some time, and
>> benchmark different configurations?
>>
>> Thank you,
>>
>> Dominic L. Hilsbos, MBA
>> Director – Information Technology
>> Perform Air International Inc.
>> dhils...@performair.com
>> www.PerformAir.com
>>
>>
>> -Original Message-
>> From: Dave Hall [mailto:kdh...@binghamton.edu]
>> Sent: Thursday, January 16, 2020 1:04 PM
>> To: Dominic Hilsbos; ceph-users@lists.ceph.com
>> Subject: Re: [External Email] RE: [ceph-users] Beginner questions
>>
>> Dominic,
>>
>> We ended up with a 1.6TB PCIe NVMe in each node.  For 8 drives this
>> worked out to a DB size of something like 163GB per OSD. Allowing for
>> expansion to 12 drives brings it down to 124GB. So maybe just put the
>> WALs on NVMe and leave the DBs on the platters?
>>
>> Understood that we will want to move to more nodes rather than more
>> drives per node, but our funding is grant and donation based, so we may
>> end up adding drives in the short term.  The long term plan is to get to
>> separate MON/MGR/MDS nodes and 10s of OSD nodes.
>>
>> Due to our current low node count, we are considering erasure-coded PGs
>> rather than replicated in order to maximize usable space.  Any
>> guidelines or suggestions on this?
>>
>> Also, sorry for not replying inline.  I haven't done this much in a
>> while - I'll figure it out.
>>
>> Thanks.
>>
>> -Dave
>>
>> On 1/16/2020 2:48 PM, dhils...@performair.com wrote:
>> > Dave;
>> >
>> > I'd like to expand on this answer, briefly...
>> >
>> > The information in the docs is wrong.  There have been many discussions
>> about changing it, but no good alternative has bee

Re: [ceph-users] Beginner questions

2020-01-16 Thread Paul Emmerich

Don't use Mimic, support for it is far worse than Nautilus or Luminous. I
think we were the only company who built a product around Mimic, both
Redhat and Suse enterprise storage was Luminous and then Nautilus skipping
Mimic entirely.

We only offered Mimic as a default for a limited time and immediately moved
to Nautilus as it became available and Nautilus + Debian 10 has been great
for us.
Mimic and Debian 9 was... well, hacked together, due to the gcc backport
issues. That's not to say that it doesn't work, in fact Mimic (> 13.2.2)
and Debian 9 worked perfectly fine for us.

Our Debian 10 and Nautilus packages are just so much better and more stable
than Debian 9 + Mimic because we don't need to do weird things with Debian.
Check the mailing list for old posts around the Mimic release by me to see
how we did that build. It's not pretty, but it was the only way to use Ceph
>= Mimic on Debian 9.
All that mess has been eliminated with Debian 10.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Jan 16, 2020 at 6:55 PM Bastiaan Visser  wrote:

> I would definitely go for Nautilus. there are quite some
> optimizations that went in after mimic.
>
> Bluestore DB size usually ends up at either 30 or 60 GB.
> 30 GB is one of the sweet spots during normal operation. But during
> compaction, ceph writes the new data before removing the old, hence the
> 60GB.
> Next sweetspot is 300/600GB. any size between 60 and 300 will never be
> unused.
>
> DB Usage is also dependent on ceph usage, object storage is known to use a
> lot more db space than rbd images for example.
>
> Op do 16 jan. 2020 om 17:46 schreef Dave Hall :
>
>> Hello all.
>>
>> Sorry for the beginner questions...
>>
>> I am in the process of setting up a small (3 nodes, 288TB) Ceph cluster
>> to store some research data.  It is expected that this cluster will grow
>> significantly in the next year, possibly to multiple petabytes and 10s of
>> nodes.  At this time I'm expected a relatively small number of clients,
>> with only one or two actively writing collected data - albeit at a high
>> volume per day.
>>
>> Currently I'm deploying on Debian 9 via ceph-ansible.
>>
>> Before I put this cluster into production I have a couple questions based
>> on my experience to date:
>>
>> Luminous, Mimic, or Nautilus?  I need stability for this deployment, so I
>> am sticking with Debian 9 since Debian 10 is fairly new, and I have been
>> hesitant to go with Nautilus.  Yet Mimic seems to have had a hard road on
>> Debian but for the efforts at Croit.
>>
>>- Statements on the Releases page are now making more sense to me,
>>but I would like to confirm that Nautilus is the right choice at this 
>> time?
>>
>> Bluestore DB size:  My nodes currently have 8 x 12TB drives (plus 4 empty
>> bays) and a PCIe NVMe drive.  If I understand the suggested calculation
>> correctly, the DB size for a 12 TB Bluestore OSD would be 480GB.  If my
>> NVMe isn't big enough to provide this size, should I skip provisioning the
>> DBs on the NVMe, or should I give each OSD 1/12th of what I have
>> available?  Also, should I try to shift budget a bit to get more NVMe as
>> soon as I can, and redo the OSDs when sufficient NVMe is available?
>>
>> Thanks.
>>
>> -Dave
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging

2020-01-06 Thread Paul Emmerich

We've also seen some problems with FileStore on newer kernels; 4.9 is the
last kernel that worked reliably with FileStore in my experience.

But I haven't seen problems with BlueStore related to the kernel version
(well, except for that scrub bug, but my work-around for that is in all
release versions).

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, Jan 6, 2020 at 8:44 PM Jelle de Jong 
wrote:

> Hello everybody,
>
> I have issues with very slow requests a simple tree node cluster here,
> four WDC enterprise disks and Intel Optane NVMe journal on identical
> high memory nodes, with 10GB networking.
>
> It was working all good with Ceph Hammer on Debian Wheezy, but I wanted
> to upgrade to a supported version and test out bluestore as well. So I
> upgraded to luminous on Debian Stretch and used ceph-volume to create
> bluestore osds, everything went downhill from there.
>
> I went back to filestore on all nodes but I still have slow requests and
> I can not pinpoint a good reason I tried to debug and gathered
> information to look at:
>
> https://paste.debian.net/hidden/acc5d204/
>
> First I thought it was the balancing that was making things slow, then I
> thought it might be the LVM layer, so I recreated the nodes without LVM
> by switching from ceph-volume to ceph-disk, no different still slow
> request. Then I changed back from bluestore to filestore but still the a
> very slow cluster. Then I thought it was a CPU scheduling issue and
> downgraded the 5.x kernel and CPU performance is full speed again. I
> thought maybe there is something weird with an osd and taking them out
> one by one, but slow request are still showing up and client performance
> from vms is really poor.
>
> I just feel a burst of small requests keeps blocking for a while then
> recovers again.
>
> Many thanks for helping out looking at the URL.
>
> If there are options which I should tune for a hdd with nvme journal
> setup please share.
>
> Jelle
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph luminous bluestore poor random write performances

2020-01-02 Thread Paul Emmerich

Why do you think that is slow? That's 4.5k write iops and 13.5k read iops
at the same time, that's amazing for a total of 30 HDDs.

It's actually way faster than you'd expect for 30 HDDs, so these DB devices
are really helping there :)


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Jan 2, 2020 at 12:14 PM Ignazio Cassano 
wrote:

> Hi Stefan, using fio with bs=64k I got very good performances.
> I am not skilled on storage, but linux file system block size is 4k.
> So, How can I modify the configuration on ceph to obtain best performances
> with bs=4k ?
> Regards
> Ignazio
>
>
>
> Il giorno gio 2 gen 2020 alle ore 10:59 Stefan Kooman  ha
> scritto:
>
>> Quoting Ignazio Cassano (ignaziocass...@gmail.com):
>> > Hello All,
>> > I installed ceph luminous with openstack, an using fio in a virtual
>> machine
>> > I got slow random writes:
>> >
>> > fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
>> --name=test
>> > --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G
>> > --readwrite=randrw --rwmixread=75
>>
>> Do you use virtio-scsi with a SCSI queue per virtual CPU core? How many
>> cores do you have? I suspect that the queue depth is hampering
>> throughput here ... but is throughput performance really interesting
>> anyway for your use case? Low latency generally matters most.
>>
>> Gr. Stefan
>>
>>
>> --
>> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
>> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bucket link tenanted to non-tenanted

2019-12-23 Thread Paul Emmerich

I think you need this pull request https://github.com/ceph/ceph/pull/28813
to do this, I don't think this was ever backported to any upstream release
branch

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, Dec 23, 2019 at 2:53 PM Marcelo Miziara  wrote:

> Hello all. I'm trying to link a bucket from a user that's not "tenanted"
> (user1) to a tenanted user (usert$usert), but i'm getting an error message.
>
> I'm using Luminous:
> # ceph version
> ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous
> (stable)
>
> The steps I took were:
> 1) create user1:
> # radosgw-admin user create --uid='user1' --display-name='user1'
>
> 2) create one bucket (bucket1) withe the user1 credentials:
> # s3cmd -c .s3cfg mb s3://bucket1
> Bucket 's3://bucket1/' created
>
> 3) create usert$usert:
> # radosgw-admin user create --uid='usert$usert' --display-name='usert'
>
> 4) Tried to link:
> # radosgw-admin bucket link --uid="usert\$usert" --bucket=bucket1
> failure: (2) No such file or directory: 2019-12-23 10:20:49.078908
> 7f8f4f3d9dc0  0 could not get bucket info for bucket=bucket1
>
> Am I doing something wrong?
>
> Thanks in advance, Marcelo.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Annoying PGs not deep-scrubbed in time messages in Nautilus.

2019-12-09 Thread Paul Emmerich

solved it: the warning is of course generated by ceph-mgr and not ceph-mon.

So for my problem that means: should have injected the option in ceph-mgr.
That's why it obviously worked when setting it on the pool...

The solution for you is to simply put the option under global and restart
ceph-mgr (or use daemon config set; it doesn't support changing config via
ceph tell for some reason)


Paul

On Mon, Dec 9, 2019 at 8:32 PM Paul Emmerich  wrote:

>
>
> On Mon, Dec 9, 2019 at 5:17 PM Robert LeBlanc 
> wrote:
>
>> I've increased the deep_scrub interval on the OSDs on our Nautilus
>> cluster with the following added to the [osd] section:
>>
>
> should have read the beginning of your email; you'll need to set the
> option on the mons as well because they generate the warning. So your
> problem might be completely different from what I'm seeing here
>


>
>
> Paul
>
>
>>
>> osd_deep_scrub_interval = 260
>>
>> And I started seeing
>>
>> 1518 pgs not deep-scrubbed in time
>>
>> in ceph -s. So I added
>>
>> mon_warn_pg_not_deep_scrubbed_ratio = 1
>>
>> since the default would start warning with a whole week left to scrub.
>> But the messages persist. The cluster has been running for a month with
>> these settings. Here is an example of the output. As you can see, some of
>> these are not even two weeks old, no where close to the 75% of 4 weeks.
>>
>> pg 6.1f49 not deep-scrubbed since 2019-11-09 23:04:55.370373
>>pg 6.1f47 not deep-scrubbed since 2019-11-18 16:10:52.561204
>>pg 6.1f44 not deep-scrubbed since 2019-11-18 15:48:16.825569
>>pg 6.1f36 not deep-scrubbed since 2019-11-20 05:39:00.309340
>>pg 6.1f31 not deep-scrubbed since 2019-11-27 02:48:45.347680
>>pg 6.1f30 not deep-scrubbed since 2019-11-11 21:34:15.795622
>>pg 6.1f2d not deep-scrubbed since 2019-11-24 11:37:39.502829
>>pg 6.1f27 not deep-scrubbed since 2019-11-25 07:38:58.689315
>>pg 6.1f25 not deep-scrubbed since 2019-11-20 00:13:43.048569
>>pg 6.1f1a not deep-scrubbed since 2019-11-09 15:08:43.51
>>pg 6.1f19 not deep-scrubbed since 2019-11-25 10:24:47.884332
>>1468 more pgs...
>> Mon Dec  9 08:12:01 PST 2019
>>
>> There is very little data on the cluster, so it's not a problem of
>> deep-scrubs taking too long:
>>
>> $ ceph df
>> RAW STORAGE:
>>CLASS SIZEAVAIL   USEDRAW USED %RAW USED
>>hdd   6.3 PiB 6.1 PiB 153 TiB  154 TiB  2.39
>>nvme  5.8 TiB 5.6 TiB 138 GiB  197 GiB  3.33
>>TOTAL 6.3 PiB 6.2 PiB 154 TiB  154 TiB  2.39
>>
>> POOLS:
>>POOL   ID STORED  OBJECTS USED
>>%USED MAX AVAIL
>>.rgw.root   1 3.0 KiB   7 3.0 KiB
>> 0   1.8 PiB
>>default.rgw.control 2 0 B   8 0 B
>> 0   1.8 PiB
>>default.rgw.meta3 7.4 KiB  24 7.4 KiB
>> 0   1.8 PiB
>>default.rgw.log 4  11 GiB 341  11 GiB
>> 0   1.8 PiB
>>default.rgw.buckets.data6 100 TiB  41.84M 100 TiB
>>  1.82   4.2 PiB
>>default.rgw.buckets.index   7  33 GiB 574  33 GiB
>> 0   1.8 PiB
>>default.rgw.buckets.non-ec  8 8.1 MiB  22 8.1 MiB
>> 0   1.8 PiB
>>
>> Please help me figure out what I'm doing wrong with these settings.
>>
>> Thanks,
>> Robert LeBlanc
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Annoying PGs not deep-scrubbed in time messages in Nautilus.

2019-12-09 Thread Paul Emmerich

On Mon, Dec 9, 2019 at 5:17 PM Robert LeBlanc  wrote:

> I've increased the deep_scrub interval on the OSDs on our Nautilus cluster
> with the following added to the [osd] section:
>

should have read the beginning of your email; you'll need to set the option
on the mons as well because they generate the warning. So your problem
might be completely different from what I'm seeing here


Paul


>
> osd_deep_scrub_interval = 260
>
> And I started seeing
>
> 1518 pgs not deep-scrubbed in time
>
> in ceph -s. So I added
>
> mon_warn_pg_not_deep_scrubbed_ratio = 1
>
> since the default would start warning with a whole week left to scrub. But
> the messages persist. The cluster has been running for a month with these
> settings. Here is an example of the output. As you can see, some of these
> are not even two weeks old, no where close to the 75% of 4 weeks.
>
> pg 6.1f49 not deep-scrubbed since 2019-11-09 23:04:55.370373
>pg 6.1f47 not deep-scrubbed since 2019-11-18 16:10:52.561204
>pg 6.1f44 not deep-scrubbed since 2019-11-18 15:48:16.825569
>pg 6.1f36 not deep-scrubbed since 2019-11-20 05:39:00.309340
>pg 6.1f31 not deep-scrubbed since 2019-11-27 02:48:45.347680
>pg 6.1f30 not deep-scrubbed since 2019-11-11 21:34:15.795622
>pg 6.1f2d not deep-scrubbed since 2019-11-24 11:37:39.502829
>pg 6.1f27 not deep-scrubbed since 2019-11-25 07:38:58.689315
>pg 6.1f25 not deep-scrubbed since 2019-11-20 00:13:43.048569
>pg 6.1f1a not deep-scrubbed since 2019-11-09 15:08:43.51
>pg 6.1f19 not deep-scrubbed since 2019-11-25 10:24:47.884332
>1468 more pgs...
> Mon Dec  9 08:12:01 PST 2019
>
> There is very little data on the cluster, so it's not a problem of
> deep-scrubs taking too long:
>
> $ ceph df
> RAW STORAGE:
>CLASS SIZEAVAIL   USEDRAW USED %RAW USED
>hdd   6.3 PiB 6.1 PiB 153 TiB  154 TiB  2.39
>nvme  5.8 TiB 5.6 TiB 138 GiB  197 GiB  3.33
>TOTAL 6.3 PiB 6.2 PiB 154 TiB  154 TiB  2.39
>
> POOLS:
>POOL   ID STORED  OBJECTS USED
>%USED MAX AVAIL
>.rgw.root   1 3.0 KiB   7 3.0 KiB
> 0   1.8 PiB
>default.rgw.control 2 0 B   8 0 B
> 0   1.8 PiB
>default.rgw.meta3 7.4 KiB  24 7.4 KiB
> 0   1.8 PiB
>default.rgw.log 4  11 GiB 341  11 GiB
> 0   1.8 PiB
>default.rgw.buckets.data6 100 TiB  41.84M 100 TiB
>  1.82   4.2 PiB
>default.rgw.buckets.index   7  33 GiB 574  33 GiB
> 0   1.8 PiB
>default.rgw.buckets.non-ec  8 8.1 MiB  22 8.1 MiB
> 0   1.8 PiB
>
> Please help me figure out what I'm doing wrong with these settings.
>
> Thanks,
> Robert LeBlanc
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Annoying PGs not deep-scrubbed in time messages in Nautilus.

2019-12-09 Thread Paul Emmerich

Hi,

nice coincidence that you mention that today; I've just debugged the exact
same problem on a setup where deep_scrub_interval was increased.

The solution was to set the deep_scrub_interval directly on all pools
instead (which was better for this particular setup anyways):

ceph osd pool set  deep_scrub_interval 

Here's the code that generates the warning:
https://github.com/ceph/ceph/blob/v14.2.4/src/mon/PGMap.cc#L3058

* There's no obvious bug in the code, no reason why it shouldn't work with
the option unless "pool->opts.get(pool_opts_t::DEEP_SCRUB_INTERVAL, x)"
returns the wrong thing if it's not configured for a pool
* I've used "config diff" to check that all mons use the correct value for
deep_scrub_interval
* mon_warn_pg_not_deep_scrubbed_ratio is a little bit odd because the
warning will trigger at (mon_warn_pg_not_deep_scrubbed_ratio + 1) *
deep_scrub_interval which is somewhat unexpected, so by default at 125% the
configured interval



Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, Dec 9, 2019 at 5:17 PM Robert LeBlanc  wrote:

> I've increased the deep_scrub interval on the OSDs on our Nautilus cluster
> with the following added to the [osd] section:
>
> osd_deep_scrub_interval = 260
>
> And I started seeing
>
> 1518 pgs not deep-scrubbed in time
>
> in ceph -s. So I added
>
> mon_warn_pg_not_deep_scrubbed_ratio = 1
>
> since the default would start warning with a whole week left to scrub. But
> the messages persist. The cluster has been running for a month with these
> settings. Here is an example of the output. As you can see, some of these
> are not even two weeks old, no where close to the 75% of 4 weeks.
>
> pg 6.1f49 not deep-scrubbed since 2019-11-09 23:04:55.370373
>pg 6.1f47 not deep-scrubbed since 2019-11-18 16:10:52.561204
>pg 6.1f44 not deep-scrubbed since 2019-11-18 15:48:16.825569
>pg 6.1f36 not deep-scrubbed since 2019-11-20 05:39:00.309340
>pg 6.1f31 not deep-scrubbed since 2019-11-27 02:48:45.347680
>pg 6.1f30 not deep-scrubbed since 2019-11-11 21:34:15.795622
>pg 6.1f2d not deep-scrubbed since 2019-11-24 11:37:39.502829
>pg 6.1f27 not deep-scrubbed since 2019-11-25 07:38:58.689315
>pg 6.1f25 not deep-scrubbed since 2019-11-20 00:13:43.048569
>pg 6.1f1a not deep-scrubbed since 2019-11-09 15:08:43.51
>pg 6.1f19 not deep-scrubbed since 2019-11-25 10:24:47.884332
>1468 more pgs...
> Mon Dec  9 08:12:01 PST 2019
>
> There is very little data on the cluster, so it's not a problem of
> deep-scrubs taking too long:
>
> $ ceph df
> RAW STORAGE:
>CLASS SIZEAVAIL   USEDRAW USED %RAW USED
>hdd   6.3 PiB 6.1 PiB 153 TiB  154 TiB  2.39
>nvme  5.8 TiB 5.6 TiB 138 GiB  197 GiB  3.33
>TOTAL 6.3 PiB 6.2 PiB 154 TiB  154 TiB  2.39
>
> POOLS:
>POOL   ID STORED  OBJECTS USED
>%USED MAX AVAIL
>.rgw.root   1 3.0 KiB   7 3.0 KiB
> 0   1.8 PiB
>default.rgw.control 2 0 B   8 0 B
> 0   1.8 PiB
>default.rgw.meta3 7.4 KiB  24 7.4 KiB
> 0   1.8 PiB
>default.rgw.log 4  11 GiB 341  11 GiB
> 0   1.8 PiB
>default.rgw.buckets.data6 100 TiB  41.84M 100 TiB
>  1.82   4.2 PiB
>default.rgw.buckets.index   7  33 GiB 574  33 GiB
> 0   1.8 PiB
>default.rgw.buckets.non-ec  8 8.1 MiB  22 8.1 MiB
> 0   1.8 PiB
>
> Please help me figure out what I'm doing wrong with these settings.
>
> Thanks,
> Robert LeBlanc
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cluster in ERR status when rebalancing

2019-12-09 Thread Paul Emmerich

This is a (harmless) bug that existed since Mimic and will be fixed in
14.2.5 (I think?). The health error will clear up without any intervention.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, Dec 9, 2019 at 12:03 PM Eugen Block  wrote:

> Hi,
>
> since we upgraded our cluster to Nautilus we also see those messages
> sometimes when it's rebalancing. There are several reports about this
> [1] [2], we didn't see it in Luminous. But eventually the rebalancing
> finished and the error message cleared, so I'd say there's (probably)
> nothing to worry about if there aren't any other issues.
>
> Regards,
> Eugen
>
>
> [1] https://tracker.ceph.com/issues/39555
> [2] https://tracker.ceph.com/issues/41255
>
>
> Zitat von Simone Lazzaris :
>
> > Hi all;
> > Long story short, I have a cluster of 26 OSD in 3 nodes (8+9+9). One
> > of the disk is showing
> > some read error, so I''ve added an OSD in the faulty node (OSD.26)
> > and set the (re)weight of
> > the faulty OSD (OSD.12) to zero.
> >
> > The cluster is now rebalancing, which is fine, but I have now 2 PG
> > in "backfill_toofull" state, so
> > the cluster health is "ERR":
> >
> >   cluster:
> > id: 9ec27b0f-acfd-40a3-b35d-db301ac5ce8c
> > health: HEALTH_ERR
> > Degraded data redundancy (low space): 2 pgs backfill_toofull
> >
> >   services:
> > mon: 3 daemons, quorum s1,s2,s3 (age 7d)
> > mgr: s1(active, since 7d), standbys: s2, s3
> > osd: 27 osds: 27 up (since 2h), 26 in (since 2h); 262 remapped pgs
> > rgw: 3 daemons active (s1, s2, s3)
> >
> >   data:
> > pools:   10 pools, 1200 pgs
> > objects: 11.72M objects, 37 TiB
> > usage:   57 TiB used, 42 TiB / 98 TiB avail
> > pgs: 2618510/35167194 objects misplaced (7.446%)
> >  938 active+clean
> >  216 active+remapped+backfill_wait
> >  44  active+remapped+backfilling
> >  2   active+remapped+backfill_wait+backfill_toofull
> >
> >   io:
> > recovery: 163 MiB/s, 50 objects/s
> >
> >   progress:
> > Rebalancing after osd.12 marked out
> >   [=.]
> >
> > As you can see, there is plenty of space and none of my OSD  is in
> > full or near full state:
> >
> >
> ++--+---+---++-++-+---+
> > | id | host |  used | avail | wr ops | wr data | rd ops | rd data |
> >  state   |
> >
> ++--+---+---++-++-+---+
> > | 0  |  s1  | 2415G | 1310G |0   | 0   |0   | 0   |
> > exists,up |
> > | 1  |  s2  | 2009G | 1716G |0   | 0   |0   | 0   |
> > exists,up |
> > | 2  |  s3  | 2183G | 1542G |0   | 0   |0   | 0   |
> > exists,up |
> > | 3  |  s1  | 2680G | 1045G |0   | 0   |0   | 0   |
> > exists,up |
> > | 4  |  s2  | 2063G | 1662G |0   | 0   |0   | 0   |
> > exists,up |
> > | 5  |  s3  | 2269G | 1456G |0   | 0   |0   | 0   |
> > exists,up |
> > | 6  |  s1  | 2523G | 1202G |0   | 0   |0   | 0   |
> > exists,up |
> > | 7  |  s2  | 1973G | 1752G |0   | 0   |0   | 0   |
> > exists,up |
> > | 8  |  s3  | 2007G | 1718G |0   | 0   |1   | 0   |
> > exists,up |
> > | 9  |  s1  | 2485G | 1240G |0   | 0   |0   | 0   |
> > exists,up |
> > | 10 |  s2  | 2385G | 1340G |0   | 0   |0   | 0   |
> > exists,up |
> > | 11 |  s3  | 2079G | 1646G |0   | 0   |0   | 0   |
> > exists,up |
> > | 12 |  s1  | 2272G | 1453G |0   | 0   |0   | 0   |
> > exists,up |
> > | 13 |  s2  | 2381G | 1344G |0   | 0   |0   | 0   |
> > exists,up |
> > | 14 |  s3  | 1923G | 1802G |0   | 0   |0   | 0   |
> > exists,up |
> > | 15 |  s1  | 2617G | 1108G |0   | 0   |0   | 0   |
> > exists,up |
> > | 16 |  s2  | 2099G | 1626G |0   | 0   |0   | 0   |
> > exists,up |
> > | 17 |  s3  | 2336G | 1389G |0   | 0   |0   | 0   |
> > exists,up |
> > | 18 |  s1  | 2435G | 1290G |0   | 0   |0   | 0   |
> > exists,up |
> > | 19 |  s2  | 2198G | 1527G |0   | 0   |0

Re: [ceph-users] best pool usage for vmware backing

2019-12-05 Thread Paul Emmerich

No, you obviously don't need multiple pools for load balancing.
-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Dec 5, 2019 at 6:46 PM Philip Brown  wrote:

> Hmm...
>
> I reread through the docs in and around
> https://docs.ceph.com/docs/master/rbd/iscsi-targets/
>
> and it mentions about iscsi multipathing through multiple CEPH storage
> gateways... but it doesnt seem to say anything about needing multiple POOLS.
>
> when you wrote,
> " 1 pool per storage class (e.g., SSD and HDD), at least one RBD per
> gateway per pool for load balancing (failover-only load balancing
> policy)."
>
>
> you seemed to imply that we needed to set up multiple pools to get "load
> balancing", but it is lacking some information.
>
> Let me see if I can infer the missing details to your original post.
>
> Perhaps you are suggesting that we use storage pools, to emulate the old
> dual-controller hardware raid array best practice of assigning half the
> LUNs to one controller, and half to the other, for "load balancing".
> Except in this case, we would tell half our vms to use (pool1) and half
> our vms to use (pool2). And then somehow (?) assign preference for pool1 to
> use ceph gateway1, and pool2 to prefer using ceph gateway2
>
> is that what you are saying?
>
> it would make sense...
> Except that I dont see anything in the docs that says how to make such an
> association between pools and a theoretical preferred iscsi gateway.
>
>
>
>
>
>
> - Original Message -
> From: "Paul Emmerich" 
> To: "Philip Brown" 
> Cc: "ceph-users" 
> Sent: Thursday, December 5, 2019 8:16:09 AM
> Subject: Re: [ceph-users] best pool usage for vmware backing
>
> ceph-iscsi doesn't support round-robin multi-pathing; so you need at least
> one LUN per gateway to utilize all of them.
>
> Please see https://docs.ceph.com for basics about RBDs and pools.
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
> On Thu, Dec 5, 2019 at 5:04 PM Philip Brown  wrote:
>
> > Interesting.
> > I thought when you defined a pool, and then defined an RBD within that
> > pool.. that any auto-replication stayed within that pool?
> > So what kind of "load balancing" do you mean?
> > I'm confused.
> >
> >
> >
> >
> > - Original Message -
> > From: "Paul Emmerich" 
> > To: "Philip Brown" 
> > Cc: "ceph-users" 
> > Sent: Wednesday, December 4, 2019 12:05:47 PM
> > Subject: Re: [ceph-users] best pool usage for vmware backing
> >
> > 1 pool per storage class (e.g., SSD and HDD), at least one RBD per
> > gateway per pool for load balancing (failover-only load balancing
> > policy).
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> >
> > On Wed, Dec 4, 2019 at 8:51 PM Philip Brown  wrote:
> > >
> > > Lets say that you had roughly 60 OSDs that you wanted to use to provide
> > storage for VMware, through RBDs served through iscsi.
> > >
> > > Target VM types are completely mixed. Web front ends, app tier.. a few
> > databases.. and the kitchen sink.
> > > Estimated number of VMs: 50-200
> > > b
> > >
> > > How would people recommend the storage be divided up?
> > >
> > > The big questions are:
> > >
> > > * 1 pool, or multiple,and why
> > >
> > > * many RBDs, few RBDs, or single  RBD per pool? why?
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Philip Brown| Sr. Linux System Administrator | Medata, Inc.
> > > 5 Peters Canyon Rd Suite 250
> > > Irvine CA 92606
> > > Office 714.918.1310| Fax 714.918.1325
> > > pbr...@medata.com| www.medata.com
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] best pool usage for vmware backing

2019-12-05 Thread Paul Emmerich

ceph-iscsi doesn't support round-robin multi-pathing; so you need at least
one LUN per gateway to utilize all of them.

Please see https://docs.ceph.com for basics about RBDs and pools.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Dec 5, 2019 at 5:04 PM Philip Brown  wrote:

> Interesting.
> I thought when you defined a pool, and then defined an RBD within that
> pool.. that any auto-replication stayed within that pool?
> So what kind of "load balancing" do you mean?
> I'm confused.
>
>
>
>
> - Original Message -
> From: "Paul Emmerich" 
> To: "Philip Brown" 
> Cc: "ceph-users" 
> Sent: Wednesday, December 4, 2019 12:05:47 PM
> Subject: Re: [ceph-users] best pool usage for vmware backing
>
> 1 pool per storage class (e.g., SSD and HDD), at least one RBD per
> gateway per pool for load balancing (failover-only load balancing
> policy).
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Wed, Dec 4, 2019 at 8:51 PM Philip Brown  wrote:
> >
> > Lets say that you had roughly 60 OSDs that you wanted to use to provide
> storage for VMware, through RBDs served through iscsi.
> >
> > Target VM types are completely mixed. Web front ends, app tier.. a few
> databases.. and the kitchen sink.
> > Estimated number of VMs: 50-200
> > b
> >
> > How would people recommend the storage be divided up?
> >
> > The big questions are:
> >
> > * 1 pool, or multiple,and why
> >
> > * many RBDs, few RBDs, or single  RBD per pool? why?
> >
> >
> >
> >
> >
> >
> >
> > --
> > Philip Brown| Sr. Linux System Administrator | Medata, Inc.
> > 5 Peters Canyon Rd Suite 250
> > Irvine CA 92606
> > Office 714.918.1310| Fax 714.918.1325
> > pbr...@medata.com| www.medata.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What does the ceph-volume@simple-crazyhexstuff SystemD service do? And what to do about oversized MDS cache?

2019-12-05 Thread Paul Emmerich

The ceph-volume services make sure that the right partitions are mounted at
/var/lib/ceph/osd/ceph-X

In "simple" mode the service gets the necessary information from a json
file (long-hex-string.json) in /etc/ceph

ceph-volume simple scan/activate create the json file and systemd unit.

ceph-disk used udev instead for the activation which was *very* messy and a
frequent cause of long startup delays (seen > 40 minutes on encrypted
ceph-disk OSDs)

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Dec 5, 2019 at 1:03 PM Ranjan Ghosh  wrote:

> Hi all,
>
> After upgrading to Ubuntu 19.10 and consequently from Mimic to Nautilus, I
> had a mini-shock when my OSDs didn't come up. Okay, I should have read the
> docs more closely, I had to do:
>
> # ceph-volume simple scan /dev/sdb1
>
> # ceph-volume simple activate --all
>
> Hooray. The OSDs came back to life. And I saw that some weird services
> were created. Didn't give that much thought at first, but later I noticed
> there is now a new service in town:
>
> ===
>
> root@yak1 ~ # systemctl status
> ceph-volume@simple-0-6585a10b-917f-4458-a464-b4dd729ef174.service
> ceph-volume@simple-0-6585a10b-917f-4458-a464-b4dd729ef174.service - Ceph
> Volume activation: simple-0-6585a10b-917f-4458-a464-b4dd729ef174
>Loaded: loaded (/lib/systemd/system/ceph-volume@.service; enabled;
> vendor preset: enabled)
>Active: inactive (dead) since Wed 2019-12-04 23:29:15 CET; 13h ago
>  Main PID: 10048 (code=exited, status=0/SUCCESS)
>
> ===
>
> Hmm. It's dead. But my cluster is alive & kicking, though. Everything is
> working. Why is this needed? Should I be worried? Or can I safely delete
> that service from /etc/systemd/... since it's not running anyway?
>
> Another, probably minor issue:
>
> I still get a HEALTH_WARN "1 MDSs report oversized cache". But it doesn't
> tell me any details and I cannot find anything in the logs. What should I
> do to resolve this? Set mds_cache_memory_limit? How do I determine an
> acceptable value?
>
>
> Thank you / Best regards
>
> Ranjan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] best pool usage for vmware backing

2019-12-04 Thread Paul Emmerich

1 pool per storage class (e.g., SSD and HDD), at least one RBD per
gateway per pool for load balancing (failover-only load balancing
policy).

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Dec 4, 2019 at 8:51 PM Philip Brown  wrote:
>
> Lets say that you had roughly 60 OSDs that you wanted to use to provide 
> storage for VMware, through RBDs served through iscsi.
>
> Target VM types are completely mixed. Web front ends, app tier.. a few 
> databases.. and the kitchen sink.
> Estimated number of VMs: 50-200
> b
>
> How would people recommend the storage be divided up?
>
> The big questions are:
>
> * 1 pool, or multiple,and why
>
> * many RBDs, few RBDs, or single  RBD per pool? why?
>
>
>
>
>
>
>
> --
> Philip Brown| Sr. Linux System Administrator | Medata, Inc.
> 5 Peters Canyon Rd Suite 250
> Irvine CA 92606
> Office 714.918.1310| Fax 714.918.1325
> pbr...@medata.com| www.medata.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW performance with low object sizes

2019-12-03 Thread Paul Emmerich

On Tue, Dec 3, 2019 at 6:43 PM Robert LeBlanc  wrote:
>
> On Tue, Dec 3, 2019 at 9:11 AM Ed Fisher  wrote:
>>
>>
>>
>> On Dec 3, 2019, at 10:28 AM, Robert LeBlanc  wrote:
>>
>> Did you make progress on this? We have a ton of < 64K objects as well and 
>> are struggling to get good performance out of our RGW. Sometimes we have RGW 
>> instances that are just gobbling up CPU even when there are no requests to 
>> them, so it seems like things are getting hung up somewhere. There is 
>> nothing in the logs and I haven't had time to do more troubleshooting.
>>
>>
>> There's a bug in the current stable Nautilus release that causes a loop 
>> and/or crash in get_obj_data::flush (you should be able to see it gobbling 
>> up CPU in perf top). This is the related issue: 
>> https://tracker.ceph.com/issues/39660 -- it should be fixed as soon as 
>> 14.2.5 is released (any day now, supposedly).
>
>
> We will try out the new version when it's released and see if it improves 
> things for us.

I can confirm that what you are describing sounds like the issue
linked above; yeah, the issue talks mainly about crashes, but that's
the "good" version of this bug. The bad just hangs the thread in an
infinite loop, I've once debugged this in more detail... the added
locks in linked pull request fixed this.


Paul

>
> Thanks,
> Robert LeBlanc
>
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Tuning Nautilus for flash only

2019-11-28 Thread Paul Emmerich

Can confirm that disabling power saving helps. I've also seen latency
improvements with sysctl -w net.ipv4.tcp_low_latency=1

Another thing that sometimes helps is disabling the write cache of
your SSDs (hdparm -W 0), depends on the disk model, though.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Nov 28, 2019 at 7:59 PM David Majchrzak, ODERLAND Webbhotell
AB  wrote:
>
> Paul,
>
> Absolutely, I said I was looking at those settings and most didn't make
> any sense to me in a production environment (we've been running ceph
> since Dumpling).
>
> However we only have 1 cluster on Bluestore and I wanted to get some
> opinions if anything other than the defaults in ceph.conf or sysctl or
> things like Wido suggested with c-states would make any differences.
> (Thank you Wido!)
>
> Yes, running benchmarks is great, and we're already doing that
> ourselves.
>
> Cheers and have a nice evening!
>
> --
> David Majchrzak
>
>
> On tor, 2019-11-28 at 17:46 +0100, Paul Emmerich wrote:
> > Please don't run this config in production.
> > Disabling checksumming is a bad idea, disabling authentication is
> > also
> > pretty bad.
> >
> > There are also a few options in there that no longer exist (osd op
> > threads) or are no longer relevant (max open files), in general, you
> > should not blindly copy config files you find on the Internet. Only
> > set an option to its non-default value after carefully checking what
> > it does and whether it applies to your use case.
> >
> > Also, run benchmarks yourself. Use benchmarks that are relevant to
> > your use case.
> >
> > Paul
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Tuning Nautilus for flash only

2019-11-28 Thread Paul Emmerich

Please don't run this config in production.
Disabling checksumming is a bad idea, disabling authentication is also
pretty bad.

There are also a few options in there that no longer exist (osd op
threads) or are no longer relevant (max open files), in general, you
should not blindly copy config files you find on the Internet. Only
set an option to its non-default value after carefully checking what
it does and whether it applies to your use case.

Also, run benchmarks yourself. Use benchmarks that are relevant to
your use case.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Nov 28, 2019 at 1:17 PM Wido den Hollander  wrote:
>
>
>
> On 11/28/19 12:56 PM, David Majchrzak, ODERLAND Webbhotell AB wrote:
> > Hi!
> >
> > We've deployed a new flash only ceph cluster running Nautilus and I'm
> > currently looking at any tunables we should set to get the most out of
> > our NVMe SSDs.
> >
> > I've been looking a bit at the options from the blog post here:
> >
> > https://ceph.io/community/bluestore-default-vs-tuned-performance-comparison/
> >
> > with the conf here:
> > https://gist.github.com/likid0/1b52631ff5d0d649a22a3f30106ccea7
> >
> > However some of them, like checksumming, is for testing speed only but
> > not really applicable in a real life scenario with critical data.
> >
> > Should we stick with defaults or is there anything that could help?
> >
> > We have 256GB of RAM on each OSD host, 8 OSD hosts with 10 SSDs on
> > each. 2 osd daemons on each SSD. Raise ssd bluestore cache to 8GB?
> >
> > Workload is about 50/50 r/w ops running qemu VMs through librbd. So
> > mixed block size.
>
> Pin the C-State of your CPUs to 1 and disable powersaving. That can
> reduce the latency vastly.
>
> Testing with rados bench -t 1 -b 4096 -o 4096 you should be able to get
> to a 0.8ms write latency with 3x replication.
>
> >
> > 3 replicas.
> >
> > Appreciate any advice!
> >
> > Kind Regards,
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Dynamic bucket index resharding bug? - rgw.none showing unreal number of objects

2019-11-22 Thread Paul Emmerich

On Fri, Nov 22, 2019 at 9:09 PM J. Eric Ivancich  wrote:

> 2^64 (2 to the 64th power) is 18446744073709551616, which is 13 greater
> than your value of 18446744073709551603. So this likely represents the
> value of -13, but displayed in an unsigned format.

I've seen this with values between -2 and -10, see
https://tracker.ceph.com/issues/37942


Paul

>
> Obviously is should not calculate a value of -13. I'm guessing it's a
> bug when bucket index entries that are categorized as rgw.none are
> found, we're not adding to the stats, but when they're removed they are
> being subtracted from the stats.
>
> Interestingly resharding recalculates these, so you'll likely have a
> much smaller value when you're done.
>
> It seems the operations that result in rgw.none bucket index entries are
> cancelled operations and removals.
>
> We're currently looking at how best to deal with rgw.none stats here:
>
> https://github.com/ceph/ceph/pull/29062
>
> Eric
>
> --
> J. Eric Ivancich
> he/him/his
> Red Hat Storage
> Ann Arbor, Michigan, USA
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Dynamic bucket index resharding bug? - rgw.none showing unreal number of objects

2019-11-22 Thread Paul Emmerich

I've originally reported the linked issue. I've seen this problem with
negative stats on several of S3 setups but I could never figure out
how to reproduce it.

But I haven't seen the resharder act on these stats; that seems like a
particularly bad case :(


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Nov 22, 2019 at 5:51 PM David Monschein  wrote:
>
> Hi all. Running an Object Storage cluster with Ceph Nautilus 14.2.4.
>
> We are running into what appears to be a serious bug that is affecting our 
> fairly new object storage cluster. While investigating some performance 
> issues -- seeing abnormally high IOPS, extremely slow bucket stat listings 
> (over 3 minutes) -- we noticed some dynamic bucket resharding jobs running. 
> Strangely enough they were resharding buckets that had very few objects. Even 
> more worrying was the number of new shards Ceph was planning: 65521
>
> [root@os1 ~]# radosgw-admin reshard list
> [
> {
> "time": "2019-11-22 00:12:40.192886Z",
> "tenant": "",
> "bucket_name": "redacteed",
> "bucket_id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
> "new_instance_id": 
> "redacted:c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7552496.28",
> "old_num_shards": 1,
> "new_num_shards": 65521
> }
> ]
>
> Upon further inspection we noticed a seemingly impossible number of objects 
> (18446744073709551603) in rgw.none for the same bucket:
> [root@os1 ~]# radosgw-admin bucket stats --bucket=redacted
> {
> "bucket": "redacted",
> "tenant": "",
> "zonegroup": "dbb69c5b-b33f-4af2-950c-173d695a4d2c",
> "placement_rule": "default-placement",
> "explicit_placement": {
> "data_pool": "",
> "data_extra_pool": "",
> "index_pool": ""
> },
> "id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
> "marker": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.7000639.20",
> "index_type": "Normal",
> "owner": "d52cb8cc-1f92-47f5-86bf-fb28bc6b592c",
> "ver": "0#12623",
> "master_ver": "0#0",
> "mtime": "2019-11-22 00:18:41.753188Z",
> "max_marker": "0#",
> "usage": {
> "rgw.none": {
> "size": 0,
> "size_actual": 0,
> "size_utilized": 0,
> "size_kb": 0,
> "size_kb_actual": 0,
> "size_kb_utilized": 0,
> "num_objects": 18446744073709551603
> },
> "rgw.main": {
> "size": 63410030,
> "size_actual": 63516672,
> "size_utilized": 63410030,
> "size_kb": 61924,
> "size_kb_actual": 62028,
> "size_kb_utilized": 61924,
> "num_objects": 27
> },
> "rgw.multimeta": {
> "size": 0,
> "size_actual": 0,
> "size_utilized": 0,
> "size_kb": 0,
> "size_kb_actual": 0,
> "size_kb_utilized": 0,
> "num_objects": 0
> }
> },
> "bucket_quota": {
> "enabled": false,
> "check_on_raw": false,
> "max_size": -1,
> "max_size_kb": 0,
> "max_objects": -1
> }
> }
>
> It would seem that the unreal number of objects in rgw.none is driving the 
> resharding process, making ceph reshard the bucket 65521 times. I am assuming 
> 65521 is the limit.
>
> I have seen only a couple of references to this issue, none of which had a 
> resolution or much of a conversation around them:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030791.html
> https://tracker.ceph.com/issues/37942
>
> For now we are cancelling these resharding jobs since they seem to be causing 
> performance issues with the cluster, but this is an untenable solution. Does 
> anyone know what is causing this? Or how to prevent it/fix it?
>
> Thanks,
> Dave Monschein
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot enable pg_autoscale_mode

2019-11-21 Thread Paul Emmerich

"ceph osd dump" shows you if the flag is set


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Nov 21, 2019 at 1:18 PM Thomas Schneider <74cmo...@gmail.com> wrote:
>
> Hello Paul,
>
> I didn't skip this step.
>
> Actually I'm sure that everything on Cluster is on Nautilus because I
> had issues with SLES 12SP2 Clients that failed to connect due to
> outdated client tools that could not connect to Nautilus.
>
> Would it make sense to execute
> ceph osd require-osd-release nautilus
> again?
>
> THX
>
> Am 21.11.2019 um 12:17 schrieb Paul Emmerich:
> > ceph osd require-osd-release nautilus
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replace bad db for bluestore

2019-11-21 Thread Paul Emmerich

if it's no longer readable: no, the OSD is lost, you'll have to re-create it

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Nov 21, 2019 at 12:18 PM zhanrzh...@teamsun.com.cn
 wrote:
>
> Hi,all
>  Suppose the db of bluestore can't read/write,are there  some methods to 
> replace bad db with a new one,in luminous.
> If not, i dare not  deploy ceph with bluestore in my production.
>
> zhanrzh...@teamsun.com.cn
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot enable pg_autoscale_mode

2019-11-21 Thread Paul Emmerich

did you skip the last step of the OSD upgrade during the Nautilus upgrade?

ceph osd require-osd-release nautilus

Paul



Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Nov 21, 2019 at 10:56 AM Thomas Schneider <74cmo...@gmail.com> wrote:
>
> Hi,
>
> I try to enable pg_autoscale_mode on a specific pool of my cluster,
> however this returns an error.
> root@ld3955:~# ceph osd pool set ssd pg_autoscale_mode on
> Error EINVAL: must set require_osd_release to nautilus or later before
> setting pg_autoscale_mode
>
> The error message is clear, but my cluster is running Ceph 14.2.4;
> please advise how to fix this.
> root@ld3955:~# ceph --version
> ceph version 14.2.4 (65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus
> (stable)
>
> THX
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Large OMAP Object

2019-11-20 Thread Paul Emmerich

On Wed, Nov 20, 2019 at 5:16 PM  wrote:
>
> All;
>
> Since I haven't heard otherwise, I have to assume that the only way to get 
> this to go away is to dump the contents of the RGW bucket(s), and  recreate 
> it (them)?

Things to try:

* check the bucket sharding status: radosgw-admin bucket limit check
* reshard the bucket if you aren't running multi-site and the shards
are just too big (did you disable resharding?)
* resharding is working/index is actually smaller: check if the index
shard is actually in use, compare the id to the actual current id (see
bucket stats); it could be a leftover/leaked object (that sometimes
happened during resharding in older versions)


Paul

>
> How did this get past release approval?  A change which makes a valid cluster 
> state in-valid, with no mitigation other than downtime, in a minor release.
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director – Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com
>
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> dhils...@performair.com
> Sent: Friday, November 15, 2019 9:13 AM
> To: ceph-users@lists.ceph.com
> Cc: Stephen Self
> Subject: Re: [ceph-users] Large OMAP Object
>
> Wido;
>
> Ok, yes, I have tracked it down to the index for one of our buckets.  I 
> missed the ID in the ceph df output previously.  Next time I'll wait to read 
> replies until I've finished my morning coffee.
>
> How would I go about correcting this?
>
> The content for this bucket is basically just junk, as we're still doing 
> production qualification, and workflow planning.  Moving from Windows file 
> shares to self-hosted cloud storage is a significant undertaking.
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director – Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com
>
>
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido 
> den Hollander
> Sent: Friday, November 15, 2019 8:40 AM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Large OMAP Object
>
>
>
> On 11/15/19 4:35 PM, dhils...@performair.com wrote:
> > All;
> >
> > Thank you for your help so far.  I have found the log entries from when the 
> > object was found, but don't see a reference to the pool.
> >
> > Here the logs:
> > 2019-11-14 03:10:16.508601 osd.1 (osd.1) 21 : cluster [DBG] 56.7 deep-scrub 
> > starts
> > 2019-11-14 03:10:18.325881 osd.1 (osd.1) 22 : cluster [WRN] Large omap 
> > object found. Object: 
> > 56:f7d15b13:::.dir.f91aeff8-a365-47b4-a1c8-928cd66134e8.44130.1:head Key 
> > count: 380425 Size (bytes): 82896978
> >
>
> In this case it's in pool 56, check 'ceph df' to see which pool that is.
>
> To me this seems like a RGW bucket which index grew too big.
>
> Use:
>
> $ radosgw-admin bucket list
> $ radosgw-admin metadata get bucket:
>
> And match that UUID back to the bucket.
>
> Wido
>
> > Thank you,
> >
> > Dominic L. Hilsbos, MBA
> > Director – Information Technology
> > Perform Air International Inc.
> > dhils...@performair.com
> > www.PerformAir.com
> >
> >
> >
> > -Original Message-
> > From: Wido den Hollander [mailto:w...@42on.com]
> > Sent: Friday, November 15, 2019 1:56 AM
> > To: Dominic Hilsbos; ceph-users@lists.ceph.com
> > Cc: Stephen Self
> > Subject: Re: [ceph-users] Large OMAP Object
> >
> > Did you check /var/log/ceph/ceph.log on one of the Monitors to see which
> > pool and Object the large Object is in?
> >
> > Wido
> >
> > On 11/15/19 12:23 AM, dhils...@performair.com wrote:
> >> All;
> >>
> >> We had a warning about a large OMAP object pop up in one of our clusters 
> >> overnight.  The cluster is configured for CephFS, but nothing mounts a 
> >> CephFS, at this time.
> >>
> >> The cluster mostly uses RGW.  I've checked the cluster log, the MON log, 
> >> and the MGR log on one of the mons, with no useful references to the pool 
> >> / pg where the large OMAP objects resides.
> >>
> >> Is my only option to find this large OMAP object to go through the OSD 
> >> logs for the individual OSDs in the cluster?
> >>
> >> Thank you,
> >>
> >> Dominic L. Hilsbos, MBA
> >> Director - Information Technology
> >> Perform Air International Inc.
> >> dhils...@performair.com
> >> www.PerformAir.com
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com

Re: [ceph-users] Migrating from block to lvm

2019-11-15 Thread Paul Emmerich

You'll have to tell LVM about multi-path, otherwise LVM gets confused.
But that should be the only thing

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Nov 15, 2019 at 6:04 PM Mike Cave  wrote:
>
> Greetings all!
>
>
>
> I am looking at upgrading to Nautilus in the near future (currently on 
> Mimic). We have a cluster built on 480 OSDs all using multipath and simple 
> block devices. I see that the ceph-disk tool is now deprecated and the 
> ceph-volume tool doesn’t do everything that ceph-disk did for simple devices 
> (e.g. I’m unable to activate a new osd and set the location of wal/block.db, 
> so far as I have been able to figure out). So for disk replacements going 
> forward it could get ugly.
>
>
>
> We deploy/manage using Ceph Ansible.
>
>
>
> I’m okay with updating the OSDs to LVM and understand that it will require a 
> full rebuild of each OSD.
>
>
>
> I was thinking of going OSD by OSD through the cluster until they are all 
> completed. However, someone suggested doing an entire node at a time (that 
> would be 20 OSDs at a time in this case). Is one method going to be better 
> than the other?
>
>
>
> Also a question about setting-up LVM: given I’m using multipath devices, do I 
> have to preconfigure the LVM devices before running the ansible plays or will 
> ansible take care of the LVM setup (even though they are on multipath)?
>
>
>
> I would then do the upgrade to Nautilus from Mimic after all the OSDs were 
> converted.
>
>
>
> I’m looking for opinions on best practices to complete this as I’d like to 
> minimize impact to our clients.
>
>
>
> Cheers,
>
> Mike Cave
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Paul Emmerich

To clear up a few misconceptions here:

* RBD keyrings should use the "profile rbd" permissions, everything
else is *wrong* and should be fixed asap
* Manually adding the blacklist permission might work but isn't
future-proof, fix the keyring instead
* The suggestion to mount them elsewhere to fix this only works
because "elsewhere" probably has an admin keyring, this is a bad
work-around, fix the keyring instead
* This is unrelated to openstack and will happen with *any* reasonably
configured hypervisor that uses exclusive locking

This problem usually happens after upgrading to Luminous without
reading the change log. The change log tells you to adjust the keyring
permissions accordingly

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Nov 15, 2019 at 4:56 PM Joshua M. Boniface  wrote:
>
> Thanks Simon! I've implemented it, I guess I'll test it out next time my 
> homelab's power dies :-)
>
> On 2019-11-15 10:54 a.m., Simon Ironside wrote:
>
> On 15/11/2019 15:44, Joshua M. Boniface wrote:
>
> Hey All:
>
> I've also quite frequently experienced this sort of issue with my Ceph 
> RBD-backed QEMU/KVM
>
> cluster (not OpenStack specifically). Should this workaround of allowing the 
> 'osd blacklist'
>
> command in the caps help in that scenario as well, or is this an 
> OpenStack-specific
>
> functionality?
>
> Yes, my use case is RBD backed QEMU/KVM too, not Openstack. It's
> required for all RBD clients.
>
> Simon
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Large OMAP Object

2019-11-15 Thread Paul Emmerich

Note that the size limit changed from 2M keys to 200k keys recently
(14.2.3 or 14.2.2 or something), so that object is probably older and
that's just the first deep scrub with the reduced limit that triggered
the warning.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Nov 15, 2019 at 4:40 PM Wido den Hollander  wrote:
>
>
>
> On 11/15/19 4:35 PM, dhils...@performair.com wrote:
> > All;
> >
> > Thank you for your help so far.  I have found the log entries from when the 
> > object was found, but don't see a reference to the pool.
> >
> > Here the logs:
> > 2019-11-14 03:10:16.508601 osd.1 (osd.1) 21 : cluster [DBG] 56.7 deep-scrub 
> > starts
> > 2019-11-14 03:10:18.325881 osd.1 (osd.1) 22 : cluster [WRN] Large omap 
> > object found. Object: 
> > 56:f7d15b13:::.dir.f91aeff8-a365-47b4-a1c8-928cd66134e8.44130.1:head Key 
> > count: 380425 Size (bytes): 82896978
> >
>
> In this case it's in pool 56, check 'ceph df' to see which pool that is.
>
> To me this seems like a RGW bucket which index grew too big.
>
> Use:
>
> $ radosgw-admin bucket list
> $ radosgw-admin metadata get bucket:
>
> And match that UUID back to the bucket.
>
> Wido
>
> > Thank you,
> >
> > Dominic L. Hilsbos, MBA
> > Director – Information Technology
> > Perform Air International Inc.
> > dhils...@performair.com
> > www.PerformAir.com
> >
> >
> >
> > -Original Message-
> > From: Wido den Hollander [mailto:w...@42on.com]
> > Sent: Friday, November 15, 2019 1:56 AM
> > To: Dominic Hilsbos; ceph-users@lists.ceph.com
> > Cc: Stephen Self
> > Subject: Re: [ceph-users] Large OMAP Object
> >
> > Did you check /var/log/ceph/ceph.log on one of the Monitors to see which
> > pool and Object the large Object is in?
> >
> > Wido
> >
> > On 11/15/19 12:23 AM, dhils...@performair.com wrote:
> >> All;
> >>
> >> We had a warning about a large OMAP object pop up in one of our clusters 
> >> overnight.  The cluster is configured for CephFS, but nothing mounts a 
> >> CephFS, at this time.
> >>
> >> The cluster mostly uses RGW.  I've checked the cluster log, the MON log, 
> >> and the MGR log on one of the mons, with no useful references to the pool 
> >> / pg where the large OMAP objects resides.
> >>
> >> Is my only option to find this large OMAP object to go through the OSD 
> >> logs for the individual OSDs in the cluster?
> >>
> >> Thank you,
> >>
> >> Dominic L. Hilsbos, MBA
> >> Director - Information Technology
> >> Perform Air International Inc.
> >> dhils...@performair.com
> >> www.PerformAir.com
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Paul Emmerich

On Fri, Nov 15, 2019 at 4:39 PM Wido den Hollander  wrote:
>
>
>
> On 11/15/19 4:25 PM, Paul Emmerich wrote:
> > On Fri, Nov 15, 2019 at 4:02 PM Wido den Hollander  wrote:
> >>
> >>  I normally use LVM on top
> >> of each device and create 2 LVs per OSD:
> >>
> >> - WAL: 1GB
> >> - DB: xx GB
> >
> > Why? I've seen this a few times and I can't figure out what the
> > advantage of doing this explicitly on the LVM level instead of relying
> > on BlueStore to handle this.
> >
>
> If the WAL+DB are on a external device you want the WAL to be there as
> well. That's why I specify the WAL separate.
>
> This might be an 'old habbit' as well.

But the WAL will be placed onto the DB device if it isn't explicitly
specified, so there's no advantage to having a separate partition.


Paul

>
> Wido
>
> >
> > Paul
> >
> >>
> >>>
> >>>
> >>> The initial cluster is +1PB and we’re planning to expand it again with
> >>> 1PB in the near future to migrate our data.
> >>>
> >>> We’ll only use the system thru the RGW (No CephFS, nor block device),
> >>> and we’ll store “a lot” of small files on it… (Millions of files a day)
> >>>
> >>>
> >>>
> >>> The reason I’m asking it, is that I’ve been able to break the test
> >>> system (long story), causing OSDs to fail as they ran out of space…
> >>> Expanding the disks (the block DB device as well as the main block
> >>> device) failed with the ceph-bluestore-tool…
> >>>
> >>>
> >>>
> >>> Thanks for your answer!
> >>>
> >>>
> >>>
> >>> Kristof
> >>>
> >>>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Paul Emmerich

On Fri, Nov 15, 2019 at 4:02 PM Wido den Hollander  wrote:
>
>  I normally use LVM on top
> of each device and create 2 LVs per OSD:
>
> - WAL: 1GB
> - DB: xx GB

Why? I've seen this a few times and I can't figure out what the
advantage of doing this explicitly on the LVM level instead of relying
on BlueStore to handle this.


Paul

>
> >
> >
> > The initial cluster is +1PB and we’re planning to expand it again with
> > 1PB in the near future to migrate our data.
> >
> > We’ll only use the system thru the RGW (No CephFS, nor block device),
> > and we’ll store “a lot” of small files on it… (Millions of files a day)
> >
> >
> >
> > The reason I’m asking it, is that I’ve been able to break the test
> > system (long story), causing OSDs to fail as they ran out of space…
> > Expanding the disks (the block DB device as well as the main block
> > device) failed with the ceph-bluestore-tool…
> >
> >
> >
> > Thanks for your answer!
> >
> >
> >
> > Kristof
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Paul Emmerich

On Fri, Nov 15, 2019 at 4:04 PM Kristof Coucke  wrote:
>
> Hi Paul,
>
> Thank you for the answer.
> I didn't thought of that approach... (Using the NVMe for the meta data pool 
> of RGW).
>
> From where do you get the limitation of 1.3TB?

13 OSDs/Server * 10 Servers * 30 GB/OSD usable DB space / 3 (Replica)


>
> I don't get that one...
>
> Br,
>
> Kristof
>
> Op vr 15 nov. 2019 om 15:26 schreef Paul Emmerich :
>>
>> On Fri, Nov 15, 2019 at 3:16 PM Kristof Coucke  
>> wrote:
>> > We’ve configured a Ceph cluster with 10 nodes, each having 13 large disks 
>> > (14TB) and 2 NVMe disks (1,6TB).
>> > The recommendations I’ve read in the online documentation, state that the 
>> > db block device should be around 4%~5% of the slow device. So, the 
>> > block.db should be somewhere between 600GB and 700GB as a best practice.
>>
>> That recommendation is unfortunately not based on any facts :(
>> How much you really need depends on your actual usage.
>>
>> > However… I was thinking to only reserve 200GB per OSD as fast device… 
>> > Which is 1/3 of the recommendation…
>>
>> For various weird internal reason it'll only use ~30 GB in the steady
>> state during operation before spilling over at the moment, 300 GB
>> would be the next magical number
>> (search mailing list for details)
>>
>>
>> > Is it recommended to still use it as a block.db
>>
>> yes
>>
>> > or is it recommended to only use it as a WAL device?
>>
>> no, there is no advantage to that if it's that large
>>
>>
>> > Should I just split the NVMe in three and only configure 3 OSDs to use the 
>> > system? (This would mean that the performace shall be degraded to the 
>> > speed of the slowest device…)
>>
>> no
>>
>> > We’ll only use the system thru the RGW (No CephFS, nor block device), and 
>> > we’ll store “a lot” of small files on it… (Millions of files a day)
>>
>> the current setup gives you around ~1.3 TB of usable metadata space
>> which may or may not be enough, really depends on how much "a lot" is
>> and how small "small" is.
>>
>> It might be better to use the NVMe disks as dedicated OSDs and map all
>> metadata pools onto them directly, that allows you to fully utilize
>> the space for RGW metadata (but not Ceph metadata in the data pools)
>> without running into weird db size restrictions.
>> There are advantages and disadvantages to both approaches
>>
>> Paul
>>
>> >
>> >
>> >
>> > The reason I’m asking it, is that I’ve been able to break the test system 
>> > (long story), causing OSDs to fail as they ran out of space… Expanding the 
>> > disks (the block DB device as well as the main block device) failed with 
>> > the ceph-bluestore-tool…
>> >
>> >
>> >
>> > Thanks for your answer!
>> >
>> >
>> >
>> > Kristof
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Paul Emmerich

On Fri, Nov 15, 2019 at 3:16 PM Kristof Coucke  wrote:
> We’ve configured a Ceph cluster with 10 nodes, each having 13 large disks 
> (14TB) and 2 NVMe disks (1,6TB).
> The recommendations I’ve read in the online documentation, state that the db 
> block device should be around 4%~5% of the slow device. So, the block.db 
> should be somewhere between 600GB and 700GB as a best practice.

That recommendation is unfortunately not based on any facts :(
How much you really need depends on your actual usage.

> However… I was thinking to only reserve 200GB per OSD as fast device… Which 
> is 1/3 of the recommendation…

For various weird internal reason it'll only use ~30 GB in the steady
state during operation before spilling over at the moment, 300 GB
would be the next magical number
(search mailing list for details)


> Is it recommended to still use it as a block.db

yes

> or is it recommended to only use it as a WAL device?

no, there is no advantage to that if it's that large


> Should I just split the NVMe in three and only configure 3 OSDs to use the 
> system? (This would mean that the performace shall be degraded to the speed 
> of the slowest device…)

no

> We’ll only use the system thru the RGW (No CephFS, nor block device), and 
> we’ll store “a lot” of small files on it… (Millions of files a day)

the current setup gives you around ~1.3 TB of usable metadata space
which may or may not be enough, really depends on how much "a lot" is
and how small "small" is.

It might be better to use the NVMe disks as dedicated OSDs and map all
metadata pools onto them directly, that allows you to fully utilize
the space for RGW metadata (but not Ceph metadata in the data pools)
without running into weird db size restrictions.
There are advantages and disadvantages to both approaches

Paul

>
>
>
> The reason I’m asking it, is that I’ve been able to break the test system 
> (long story), causing OSDs to fail as they ran out of space… Expanding the 
> disks (the block DB device as well as the main block device) failed with the 
> ceph-bluestore-tool…
>
>
>
> Thanks for your answer!
>
>
>
> Kristof
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext]

2019-11-13 Thread Paul Emmerich

On Wed, Nov 13, 2019 at 10:13 AM Stefan Bauer  wrote:
>
> Paul,
>
>
> i would like to take the chance, to thank you and ask if it could not be, that
> subop_latency reports high value (is that avgtime in seconds reported?)
> because the communication partner is slow in writing/commiting?

no


Paul

>
>
> Dont want to follow the red hering :/
>
>
> We have the following times on our 11 osds. Attached image.
>
>
>
> -----Ursprüngliche Nachricht-
> Von: Paul Emmerich 
> Gesendet: Donnerstag 7 November 2019 19:04
> An: Stefan Bauer 
> CC: ceph-users@lists.ceph.com
> Betreff: Re: [ceph-users] how to find the lazy egg - poor performance - 
> interesting observations [klartext]
>
> You can have a look at subop_latency in "ceph daemon osd.XX perf
> dump", it tells you how long an OSD took to reply to another OSD.
> That's usually a good indicator if an OSD is dragging down others.
> Or have a look at "ceph osd perf dump" which is basically disk
> latency; simpler to acquire but with less information
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Thu, Nov 7, 2019 at 6:55 PM Stefan Bauer  wrote:
> >
> > Hi folks,
> >
> >
> > we are running a 3 node proxmox-cluster with - of corse - ceph :)
> >
> > ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous 
> > (stable)
> >
> >
> > 10G network. iperf reports almost 10G between all nodes.
> >
> >
> > We are using mixed standard SSDs (crucial / samsung). We are aware, that 
> > these disks can not delivery high iops or great throughput, but we have 
> > several of these clusters and this one is showing very poor performance.
> >
> >
> > NOW the strange fact:
> >
> >
> > When a specific node is rebooting, the throughput is acceptable.
> >
> >
> > But when the specific node is back, the results dropped by almost 100%.
> >
> >
> > 2 NODES (one rebooting)
> >
> >
> > # rados bench -p scbench 10 write --no-cleanup
> > hints = 1
> > Maintaining 16 concurrent writes of 4194304 bytes to objects of size 
> > 4194304 for up to 10 seconds or 0 objects
> > Object prefix: benchmark_data_pve3_1767693
> >   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg 
> > lat(s)
> > 0   0 0 0 0 0   -   > > 0
> > 1  165539   155.992   156   0.0445665
> > 0.257988
> > 2  16   11094187.98   2200.087097
> > 0.291173
> > 3  16   156   140   186.645   1840.462171
> > 0.286895
> > 4  16   184   168167.98   112   0.0235336
> > 0.358085
> > 5  16   210   194   155.181   1040.112401
> > 0.347883
> > 6  16   252   236   157.314   1680.134099
> > 0.382159
> > 7  16   287   271   154.838   140   0.0264864 
> > 0.40092
> > 8  16   329   313   156.481   168   0.0609964
> > 0.394753
> > 9  16   364   348   154.649   1400.244309
> > 0.392331
> >10  16   416   400   159.981   2080.277489
> > 0.387424
> > Total time run: 10.335496
> > Total writes made:  417
> > Write size: 4194304
> > Object size:4194304
> > Bandwidth (MB/sec): 161.386
> > Stddev Bandwidth:   37.8065
> > Max bandwidth (MB/sec): 220
> > Min bandwidth (MB/sec): 104
> > Average IOPS:   40
> > Stddev IOPS:9
> > Max IOPS:   55
> > Min IOPS:   26
> > Average Latency(s): 0.396434
> > Stddev Latency(s):  0.428527
> > Max latency(s): 1.86968
> > Min latency(s): 0.020558
> >
> >
> >
> > THIRD NODE ONLINE:
> >
> >
> >
> > root@pve3:/# rados bench -p scbench 10 write --no-cleanup
> > hints = 1
> > Maintaining 16 concurrent writes of 4194304 bytes to objects of size 
> > 4194304 for up to 10 seconds or 0 objects
> > Object prefix: benchmark_data_pve3_1771977
> >   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg 
> > lat(s)
> > 0   0 0 0 0 0   -   > > 0
&g

Re: [ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext]

2019-11-07 Thread Paul Emmerich

You can have a look at subop_latency in "ceph daemon osd.XX perf
dump", it tells you how long an OSD took to reply to another OSD.
That's usually a good indicator if an OSD is dragging down others.
Or have a look at "ceph osd perf dump" which is basically disk
latency; simpler to acquire but with less information

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Nov 7, 2019 at 6:55 PM Stefan Bauer  wrote:
>
> Hi folks,
>
>
> we are running a 3 node proxmox-cluster with - of corse - ceph :)
>
> ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous 
> (stable)
>
>
> 10G network. iperf reports almost 10G between all nodes.
>
>
> We are using mixed standard SSDs (crucial / samsung). We are aware, that 
> these disks can not delivery high iops or great throughput, but we have 
> several of these clusters and this one is showing very poor performance.
>
>
> NOW the strange fact:
>
>
> When a specific node is rebooting, the throughput is acceptable.
>
>
> But when the specific node is back, the results dropped by almost 100%.
>
>
> 2 NODES (one rebooting)
>
>
> # rados bench -p scbench 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 
> for up to 10 seconds or 0 objects
> Object prefix: benchmark_data_pve3_1767693
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
> 0   0 0 0 0 0   -   0
> 1  165539   155.992   156   0.04456650.257988
> 2  16   11094187.98   2200.0870970.291173
> 3  16   156   140   186.645   1840.4621710.286895
> 4  16   184   168167.98   112   0.02353360.358085
> 5  16   210   194   155.181   1040.1124010.347883
> 6  16   252   236   157.314   1680.1340990.382159
> 7  16   287   271   154.838   140   0.0264864 0.40092
> 8  16   329   313   156.481   168   0.06099640.394753
> 9  16   364   348   154.649   1400.2443090.392331
>10  16   416   400   159.981   2080.2774890.387424
> Total time run: 10.335496
> Total writes made:  417
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 161.386
> Stddev Bandwidth:   37.8065
> Max bandwidth (MB/sec): 220
> Min bandwidth (MB/sec): 104
> Average IOPS:   40
> Stddev IOPS:9
> Max IOPS:   55
> Min IOPS:   26
> Average Latency(s): 0.396434
> Stddev Latency(s):  0.428527
> Max latency(s): 1.86968
> Min latency(s): 0.020558
>
>
>
> THIRD NODE ONLINE:
>
>
>
> root@pve3:/# rados bench -p scbench 10 write --no-cleanup
> hints = 1
> Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 
> for up to 10 seconds or 0 objects
> Object prefix: benchmark_data_pve3_1771977
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
> 0   0 0 0 0 0   -   0
> 1  163923   91.994392 0.213530.267249
> 2  164630   59.992428 0.295270.268672
> 3  165337   49.3271280.1227320.259731
> 4  165337   36.9954 0   -0.259731
> 5  165337   29.5963 0   -0.259731
> 6  168771   47.3271   45.0.241921 1.19831
> 7  16   10690   51.4214760.124821 1.07941
> 8  16   129   11356.49292   0.03141460.941378
> 9  16   142   126   55.9919520.2855360.871445
>10  16   147   131   52.3925200.3548030.852074
> Total time run: 10.138312
> Total writes made:  148
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 58.3924
> Stddev Bandwidth:   34.405
> Max bandwidth (MB/sec): 92
> Min bandwidth (MB/sec): 0
> Average IOPS:   14
> Stddev IOPS:8
> Max IOPS:   23
> Min IOPS:   0
> Average Latency(s): 1.08818
> Stddev Latency(s):  1.55967
> Max latency(s): 5.02514
> Min latency(s): 0.0255947
&

Re: [ceph-users] Zombie OSD filesystems rise from the grave during bluestore conversion

2019-11-05 Thread Paul Emmerich

On Mon, Nov 4, 2019 at 11:04 PM J David  wrote:
>
> OK. Is there possibly a more surgical approach?  It's going to take a
> really long time to convert the cluster, so we don't want to do
> anything global that might cause weirdness if any of the OSD servers
> with unconverted OSD's need to be rebooted during the process.

FWIW we used a brute-force approach here in the past. A loop that
tries up to 10 time to kill all processes that show up in lsof for
that disk, unmounts it, and wipes it.

ceph-volume solved all of these problems


Paul
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Zombie OSD filesystems rise from the grave during bluestore conversion

2019-11-05 Thread Paul Emmerich

On Tue, Nov 5, 2019 at 2:43 AM J David  wrote:
> $ sudo ceph osd safe-to-destroy 42
> OSD(s) 42 are safe to destroy without reducing data durability.
> $ sudo ceph osd destroy 42
> Error EPERM: Are you SURE? This will mean real, permanent data loss,
> as well as cephx and lockbox keys. Pass --yes-i-really-mean-it if you
> really do.
>
> Is that a bug?  Or did we miss a step?

could be a new feature, I've only realized this exists/works since Nautilus.
You seem to be a relatively old version since you still have ceph-disk installed


Paul

>
> Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Zombie OSD filesystems rise from the grave during bluestore conversion

2019-11-04 Thread Paul Emmerich

That's probably the ceph-disk udev script being triggered from
something somewhere (and a lot of things can trigger that script...)

Work-around: convert everything to ceph-volume simple first by running
"ceph-volume simple scan" and "ceph-volume simple activate", that will
disable udev in the intended way.

BTW: you can run destroy before stopping the OSD, you won't need the
--yes-i-really-mean-it if it's drained in this case

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Nov 4, 2019 at 6:33 PM J David  wrote:
>
> While converting a luminous cluster from filestore to bluestore, we
> are running into a weird race condition on a fairly regular basis.
>
> We have a master script that writes upgrade scripts for each OSD
> server.  The script for an OSD looks like this:
>
> ceph osd out 68
> while ! ceph osd safe-to-destroy 68 ; do sleep 10 ; done
> systemctl stop ceph-osd@68
> sleep 10
> systemctl kill ceph-osd@68
> sleep 10
> umount /var/lib/ceph/osd/ceph-68
> ceph osd destroy 68 --yes-i-really-mean-it
> ceph-volume lvm zap /dev/sda --destroy
> ceph-volume lvm create --bluestore --data /dev/sda --osd-id 68
> sleep 10
> while [ "`ceph health`" != "HEALTH_OK" ] ; do ceph health; sleep 10 ; done
>
> (It's run with sh -e so any error will cause an abort.)
>
> The problem we run into is that in about 1 out of 10 runs, when this
> gets to the "lvm zap" stage, and fails:
>
> --> Zapping: /dev/sda
> Running command: wipefs --all /dev/sda2
> Running command: dd if=/dev/zero of=/dev/sda2 bs=1M count=10
>  stderr: 10+0 records in
> 10+0 records out
> 10485760 bytes (10 MB, 10 MiB) copied, 0.00667608 s, 1.6 GB/s
> --> Destroying partition since --destroy was used: /dev/sda2
> Running command: parted /dev/sda --script -- rm 2
> --> Unmounting /dev/sda1
> Running command: umount -v /dev/sda1
>  stderr: umount: /var/lib/ceph/tmp/mnt.9k0GDx (/dev/sda1) unmounted
> Running command: wipefs --all /dev/sda1
>  stderr: wipefs: error: /dev/sda1: probing initialization failed:
>  stderr: Device or resource busy
> -->  RuntimeError: command returned non-zero exit status: 1
>
> And, lo and behold, it's right: /dev/sda1 has been remounted as
> /var/lib/ceph/osd/ceph-68.
>
> That's after the OSD has been stopped, killed, and destroyed; there
> *is no* osd.68.  It happens after the filesystem has been unmounted
> twice (once by an explicit umount and once by "lvm zap."  The "lvm
> zap" umount shown here with the path /var/lib/ceph/tmp/mnt.9k0GDx
> suggests that the remount is happening in the background somewhere
> while the lvm zap is running.
>
> If we do the zap before the osd destroy, the same thing happens but
> the (still-existing) OSD does not actually restart.  So it's just the
> filesystem that won't stay unmounted long enough to destroy it, not
> the whole OSD.
>
> What's causing this?  How do we keep the filesystem from lurching out
> of the grave in mid-conversion like this?
>
> This is on Debian Stretch with systemd, if that matters.
>
> Thanks!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD fail to start - fsid problem with KVM

2019-11-04 Thread Paul Emmerich

Which is a different OSD.

Try running ceph-volume lvm activate --all


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Nov 4, 2019 at 7:44 AM Anton Aleksandrov  wrote:
>
> I think, that yes, LVM is online - both drives preset and LVM operates.
>
> lvs -o lv_tags returns:
>
> ceph.block_device=/dev/osd_vg/osd_lv,ceph.block_uuid=EyUggo-Ja06-MptY-Rt4Q-l6iV-0pse-7Y8rPh,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=31c48e79-7724-49db-a20f-dd86e195972f,ceph.cluster_name=ceph,ceph.crush_device_class=None,ceph.encrypted=0,ceph.osd_fsid=a0aa881c-aa2d-4462-9c2f-cd289810e9e7,ceph.osd_id=23,ceph.type=block,ceph.vdo=0
>
> We deployed everything using ceph-deloy tool, pretty automatic and
> without doing anything extra special.
>
> Reason for this setup on this server is because all other OSDs have
> 1x8tb, but this host has 2x4tb and just 8gb of ram, therefore we did not
> want to make two separate OSDs.
>
> Anton
>
> On 04.11.2019 00:49, Paul Emmerich wrote:
> > On Sun, Nov 3, 2019 at 6:25 PM Anton Aleksandrov  
> > wrote:
> >> Hello community.
> >>
> >> We run Ceph on quite old hardware with quite low traffic. Yesterday we
> >> had to reboot one of the OSDs and after reboot it did not came up. The
> >> error message is:
> >>
> >> [2019-11-02 15:05:07,317][ceph_volume.process][INFO  ] Running command:
> >> /usr/sbin/ceph-volume lvm trigger 22-1c0b3fd7-7d80-4de9-9594-17ac5b2bf92f
> >> [2019-11-02 15:05:07,473][ceph_volume.process][INFO  ] stderr -->
> >> RuntimeError: could not find osd.22 with fsid
> >> 1c0b3fd7-7d80-4de9-9594-17ac5b2bf92f
> >>
> >> This OSD has 2 disks, which are put into one logical volume (basically
> >> raid0) and then used for osd storage.
> > please don't do that (unless you have very good reason to)
> >
> >> We are quite a beginners with Ceph and this error stuck us. What should
> >> we do? Change fsid (where?)? Right now cluster is in repair-state.. As
> >> the last resort we would drop osd and rebuild it, but it would be very
> >> important for us to understand - what and why happenned. Is it faulty
> >> config or did something bad happen with the disks?
> > Is the LV online? Do the LV tags look correct? Check lvs -o lv_tags
> >
> >
> >
> > Paul
> >
> >
> >> Regards,
> >> Anton.
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD fail to start - fsid problem with KVM

2019-11-03 Thread Paul Emmerich

On Sun, Nov 3, 2019 at 6:25 PM Anton Aleksandrov  wrote:
>
> Hello community.
>
> We run Ceph on quite old hardware with quite low traffic. Yesterday we
> had to reboot one of the OSDs and after reboot it did not came up. The
> error message is:
>
> [2019-11-02 15:05:07,317][ceph_volume.process][INFO  ] Running command:
> /usr/sbin/ceph-volume lvm trigger 22-1c0b3fd7-7d80-4de9-9594-17ac5b2bf92f
> [2019-11-02 15:05:07,473][ceph_volume.process][INFO  ] stderr -->
> RuntimeError: could not find osd.22 with fsid
> 1c0b3fd7-7d80-4de9-9594-17ac5b2bf92f
>
> This OSD has 2 disks, which are put into one logical volume (basically
> raid0) and then used for osd storage.

please don't do that (unless you have very good reason to)

> We are quite a beginners with Ceph and this error stuck us. What should
> we do? Change fsid (where?)? Right now cluster is in repair-state.. As
> the last resort we would drop osd and rebuild it, but it would be very
> important for us to understand - what and why happenned. Is it faulty
> config or did something bad happen with the disks?

Is the LV online? Do the LV tags look correct? Check lvs -o lv_tags



Paul


>
> Regards,
> Anton.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore runs out of space and dies

2019-10-31 Thread Paul Emmerich

BlueStore doesn't handle running out of space gracefully because that
doesn't happen on a real disk because full_ratio (95%) and the
failsafe_full_ratio (97%? some obscure config option) kick in before
that happens.

Yeah, I've also lost some test cluster with tiny disks to this. Usual
reason is keeping it in a degraded state for weeks at a time...

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Oct 31, 2019 at 10:50 AM George Shuklin
 wrote:
>
> Hello.
>
> In my lab a nautilus cluster with a bluestore suddenly went dark. As I
> found it had used 98% of the space and most of OSDs (small, 10G each)
> went offline. Any attempt to restart them failed with this message:
>
> # /usr/bin/ceph-osd -f --cluster  ceph --id 18 --setuser ceph --setgroup
> ceph
>
> 2019-10-31 09:44:37.591 7f73d54b3f80 -1 osd.18 271 log_to_monitors
> {default=true}
> 2019-10-31 09:44:37.615 7f73bff99700 -1
> bluestore(/var/lib/ceph/osd/ceph-18) _do_alloc_write failed to allocate
> 0x1 allocated 0x ffe4 min_alloc_size 0x1 available 0x 0
> 2019-10-31 09:44:37.615 7f73bff99700 -1
> bluestore(/var/lib/ceph/osd/ceph-18) _do_write _do_alloc_write failed
> with (28) No space left on device
> 2019-10-31 09:44:37.615 7f73bff99700 -1
> bluestore(/var/lib/ceph/osd/ceph-18) _txc_add_transaction error (28) No
> space left on device not handled on operation 10 (op 30, counting from 0)
> 2019-10-31 09:44:37.615 7f73bff99700 -1
> bluestore(/var/lib/ceph/osd/ceph-18) ENOSPC from bluestore,
> misconfigured cluster
> /build/ceph-14.2.4/src/os/bluestore/BlueStore.cc: In function 'void
> BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> ObjectStore::Transaction*)' thread 7f73bff99700 time 2019-10-31
> 09:44:37.620694
> /build/ceph-14.2.4/src/os/bluestore/BlueStore.cc: 11455:
> ceph_abort_msg("unexpected error")
>
> I was able to recover cluster by adding some more space into VGs for
> some of OSDs and using this command:
>
> ceph-bluestore-tool --log-level 30 --path /var/lib/ceph/osd/ceph-xx
> --command bluefs-bdev-expand
>
> It worked but only because I added some space into OSD.
>
> I'm curious, is there a way to recover such OSD without growing it? On
> the old filestore I can just remove some objects to gain space, is this
> possible for bluestore? My main concern is that OSD daemon simply
> crashes at start, so I can't just add 'more OSD' to cluster - all data
> become unavailable, because OSDs are completely dead.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS client hanging and cache issues

2019-10-30 Thread Paul Emmerich

Kernel bug due to a bad backport, see recent posts here.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Oct 30, 2019 at 10:42 AM Bob Farrell  wrote:
>
> Hi. We are experiencing a CephFS client issue on one of our servers.
>
> ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus 
> (stable)
>
> Trying to access, `umount`, or `umount -f` a mounted CephFS volumes causes my 
> shell to hang indefinitely.
>
> After a reboot I can remount the volumes cleanly but they drop out after < 1 
> hour of use.
>
> I see this log entry multiple times when I reboot the server:
> ```
> cache_from_obj: Wrong slab cache. inode_cache but object is from 
> ceph_inode_info
> ```
> The machine then reboots after approx. 30 minutes.
>
> All other Ceph/CephFS clients and servers seem perfectly happy. CephFS 
> cluster is HEALTH_OK.
>
> Any help appreciated. If I can provide any further details please let me know.
>
> Thanks in advance,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS kernel module lockups in Ubuntu linux-image-5.0.0-32-generic?

2019-10-24 Thread Paul Emmerich

Could it be related to the broken backport as described in
https://tracker.ceph.com/issues/40102 ?

(It did affect 4.19, not sure about 5.0)

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Oct 24, 2019 at 4:23 PM Christopher Wieringa
 wrote:
>
> Hello all,
>
>
>
> I’ve been using the Ceph kernel modules in Ubuntu to load a CephFS filesystem 
> quite successfully for several months.  Yesterday, I went through a round of 
> updates on my Ubuntu 18.04 machines, which loaded 
> linux-image-5.0.0-32-generic as the kernel.  I’m noticing that while the 
> kernel boots and the CephFS filesystem is mounted successfully, any actual 
> file traffic between client/servers is causing the client to hard lock 
> without any kernel panics seen in the logs.  Downgrading to 
> linux-image-5.0.0-31-generic fixes the issue.
>
> I’m planning on trying to figure out how to get a crash kernel running and 
> see if I can generate a log file of some sort, but just wondering if anyone 
> else mounting Ubuntu 18.04 and CephFS via kernel modules could replicate this 
> crashing behavior in their environment.  I don’t have experience in setting 
> up crash kernels or even in filing Ubuntu kernel bugs, so I thought I’d ask 
> this list to see if someone else can replicate this.
>
> My Setup:
> Ubuntu 18.04.3 LTS amd64
>
> Kernel: linux-image-5.0.0-32-generic(installed through the 
> linux-generic-hwe-18.04 metapackage)
>
> Number of machines affected/replicated the lockup issues:  >100  (three 
> generations of desktop computers, and multiple virtual machines as well)
>
>
>
> Also, any suggestions for how to get a log for this are appreciated.
>
>
>
> Chris Wieringa
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Decreasing the impact of reweighting osds

2019-10-22 Thread Paul Emmerich

getting rid of filestore solves most latency spike issues during
recovery because they are often caused by random XFS hangs (splitting
dirs or just xfs having a bad day)


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Oct 22, 2019 at 6:02 AM Mark Kirkwood
 wrote:
>
> We recently needed to reweight a couple of OSDs on one of our clusters
> (luminous on Ubuntu,  8 hosts, 8 OSD/host). I (think) we reweighted by
> approx 0.2. This was perhaps too much, as IO latency on RBD drives
> spiked to several seconds at times.
>
> We'd like to lessen this effect as much as we can. So we are looking at
> priority and queue parameters (OSDs are Filestore based with S3700 SSD
> or similar NVME journals):
>
> # priorities
> osd_client_op_priority
> osd_recovery_op_priority
> osd_recovery_priority
> osd_scrub_priority
> osd_snap_trim_priority
>
> # queue tuning
> filestore_queue_max_ops
> filestore_queue_low_threshhold
> filestore_queue_high_threshhold
> filestore_expected_throughput_ops
> filestore_queue_high_delay_multiple
> filestore_queue_max_delay_multiple
>
> My first question is this - do these parameters require the CFQ
> scheduler (like osd_disk_thread_ioprio_priority does)? We are currently
> using deadline (we have not tweaked queue/iosched/write_expire down from
> 5000 to 1500 which might be good to do).
>
> My 2nd question is - should we consider increasing
> osd_disk_thread_ioprio_priority (and hence changing to CFQ scheduler)? I
> usually see this parameter discussed WRT scrubbing, and we are not
> having issues with that.
>
> regards
>
> Mark
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] krbd / kcephfs - jewel client features question

2019-10-19 Thread Paul Emmerich

bit 21 indicates whether upmap is supported which is not set in
0x7fddff8ee8cbffb, so no.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sat, Oct 19, 2019 at 2:00 PM Lei Liu  wrote:
>
> Hello llya,
>
> After updated client kernel version to 3.10.0-862 , ceph features shows:
>
> "client": {
> "group": {
> "features": "0x7010fb86aa42ada",
> "release": "jewel",
> "num": 5
> },
> "group": {
> "features": "0x7fddff8ee8cbffb",
> "release": "jewel",
> "num": 1
> },
> "group": {
> "features": "0x3ffddff8eea4fffb",
> "release": "luminous",
> "num": 6
> },
> "group": {
> "features": "0x3ffddff8eeacfffb",
> "release": "luminous",
> "num": 1
> }
> }
>
> both 0x7fddff8ee8cbffb and 0x7010fb86aa42ada are reported by new kernel 
> client.
>
> Is it now possible to force set-require-min-compat-client to be luminous, if 
> not how to fix it?
>
> Thanks
>
> Ilya Dryomov  于2019年10月17日周四 下午9:45写道：
>>
>> On Thu, Oct 17, 2019 at 3:38 PM Lei Liu  wrote:
>> >
>> > Hi Cephers,
>> >
>> > We have some ceph clusters in 12.2.x version, now we want to use upmap 
>> > balancer，but when i set set-require-min-compat-client to luminous, it's 
>> > failed
>> >
>> > # ceph osd set-require-min-compat-client luminous
>> > Error EPERM: cannot set require_min_compat_client to luminous: 6 connected 
>> > client(s) look like jewel (missing 0xa20); 1 connected 
>> > client(s) look like jewel (missing 0x800); 1 connected 
>> > client(s) look like jewel (missing 0x820); add 
>> > --yes-i-really-mean-it to do it anyway
>> >
>> > ceph features
>> >
>> > "client": {
>> > "group": {
>> > "features": "0x40106b84a842a52",
>> > "release": "jewel",
>> > "num": 6
>> > },
>> > "group": {
>> > "features": "0x7010fb86aa42ada",
>> > "release": "jewel",
>> > "num": 1
>> > },
>> > "group": {
>> > "features": "0x7fddff8ee84bffb",
>> > "release": "jewel",
>> > "num": 1
>> > },
>> > "group": {
>> > "features": "0x3ffddff8eea4fffb",
>> > "release": "luminous",
>> > "num": 7
>> > }
>> > }
>> >
>> > and sessions
>> >
>> > "MonSession(unknown.0 10.10.100.6:0/1603916368 is open allow *, features 
>> > 0x40106b84a842a52 (jewel))",
>> > "MonSession(unknown.0 10.10.100.2:0/2484488531 is open allow *, features 
>> > 0x40106b84a842a52 (jewel))",
>> > "MonSession(client.? 10.10.100.6:0/657483412 is open allow *, features 
>> > 0x7fddff8ee84bffb (jewel))",
>> > "MonSession(unknown.0 10.10.14.67:0/500706582 is open allow *, features 
>> > 0x7010fb86aa42ada (jewel))"
>> >
>> > can i use --yes-i-really-mean-it to force enable it ?
>>
>> No.  0x40106b84a842a52 and 0x7fddff8ee84bffb are too old.
>>
>> Thanks,
>>
>> Ilya
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Can't create erasure coded pools with k+m greater than hosts?

2019-10-18 Thread Paul Emmerich

Default failure domain in Ceph is "host" (see ec profile), i.e., you
need at least k+m hosts (but at least k+m+1 is better for production
setups).
You can change that to OSD, but that's not a good idea for a
production setup for obvious reasons. It's slightly better to write a
crush rule that explicitly picks two disks on 3 different hosts


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Oct 18, 2019 at 8:45 PM Salsa  wrote:
>
> I have probably misunterstood how to create erasure coded pools so I may be 
> in need of some theory and appreciate if you can point me to documentation 
> that may clarify my doubts.
>
> I have so far 1 cluster with 3 hosts and 30 OSDs (10 each host).
>
> I tried to create an erasure code profile like so:
>
> "
> # ceph osd erasure-code-profile get ec4x2rs
> crush-device-class=
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=2
> plugin=jerasure
> technique=reed_sol_van
> w=8
> "
>
> If I create a pool using this profile or any profile where K+M > hosts , then 
> the pool gets stuck.
>
> "
> # ceph -s
>   cluster:
> id: eb4aea44-0c63-4202-b826-e16ea60ed54d
> health: HEALTH_WARN
> Reduced data availability: 16 pgs inactive, 16 pgs incomplete
> 2 pools have too many placement groups
> too few PGs per OSD (4 < min 30)
>
>   services:
> mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 11d)
> mgr: ceph01(active, since 74m), standbys: ceph03, ceph02
> osd: 30 osds: 30 up (since 2w), 30 in (since 2w)
>
>   data:
> pools:   11 pools, 32 pgs
> objects: 0 objects, 0 B
> usage:   32 GiB used, 109 TiB / 109 TiB avail
> pgs: 50.000% pgs not active
>  16 active+clean
>  16 creating+incomplete
>
> # ceph osd pool ls
> test_ec
> test_ec2
> "
> The pool will never leave this "creating+incomplete" state.
>
> The pools were created like this:
> "
> # ceph osd pool create test_ec2 16 16 erasure ec4x2rs
> # ceph osd pool create test_ec 16 16 erasure
> "
> The default profile pool is created correctly.
>
> My profiles are like this:
> "
> # ceph osd erasure-code-profile get default
> k=2
> m=1
> plugin=jerasure
> technique=reed_sol_van
>
> # ceph osd erasure-code-profile get ec4x2rs
> crush-device-class=
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=2
> plugin=jerasure
> technique=reed_sol_van
> w=8
> "
>
> From what I've read it seems to be possible to create erasure code pools with 
> higher than hosts K+M. Is this not so?
> What am I doing wrong? Do I have to create any special crush map rule?
>
> --
> Salsa
>
> Sent with ProtonMail Secure Email.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] default.rgw.log contains large omap object

2019-10-14 Thread Paul Emmerich

Yeah, the number of shards is configurable ("rgw usage num shards"? or
something).

Are you sure you aren't using it? This feature is not enabled by
default, someone had to explicitly set "rgw enable usage log" for you
to run into this problem.


Paul
-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Oct 15, 2019 at 12:15 AM Troy Ablan  wrote:
>
> Paul,
>
> Apparently never.  Appears to (potentially) have every request from the
> beginning of time (late last year, in my case).  In our use case, we
> don't really need this data (not multi-tenant), so I might simply clear it.
>
> But in the case where this were an extremely high transaction cluster
> where we did care about historical data for longer, can this be spread
> or sharded across more than just this small handful of objects?
>
> Thanks for the insight.
>
> -Troy
>
>
> On 10/14/19 3:01 PM, Paul Emmerich wrote:
> > Looks like the usage log (radosgw-admin usage show), how often do you trim 
> > it?
> >
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] default.rgw.log contains large omap object

2019-10-14 Thread Paul Emmerich

Looks like the usage log (radosgw-admin usage show), how often do you trim it?


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Oct 14, 2019 at 11:55 PM Troy Ablan  wrote:
>
> Hi folks,
>
> Mimic cluster here, RGW pool with only default zone.  I have a
> persistent error here
>
> LARGE_OMAP_OBJECTS 1 large omap objects
>
>  1 large objects found in pool 'default.rgw.log'
>
>  Search the cluster log for 'Large omap object found' for more
> details.
>
> I think I've narrowed it down to this namespace:
>
> -[~:#]- rados -p default.rgw.log --namespace usage ls
>
>
>
>
>
> usage.17
> usage.19
> usage.24
>
> -[~:#]- rados -p default.rgw.log --namespace usage listomapkeys usage.17
> | wc -l
> 21284
>
> -[~:#]- rados -p default.rgw.log --namespace usage listomapkeys usage.19
> | wc -l
>
>
>
>
> 1355968
>
> -[~:#]- rados -p default.rgw.log --namespace usage listomapkeys usage.24
> | wc -l
>
>
>
>
> 0
>
>
> Now, this last one take a long time to return -- minutes, even with a 0
> response, and listomapvals indeed returns a very large amount of data.
> I'm doing `wc -c` on listomapvals but this hasn't returned at the time
> of writing this message.  Is there anything I can do about this?
>
> Whenever PGs on the default.rgw.log are recovering or backfilling, my
> RGW cluster appears to block writes for almost two hours, and I think it
> points to this object, or at least this pool.
>
> I've been having trouble finding any documentation about how this log
> pool is used by RGW.  I have a feeling updating this object happens on
> every write to the cluster.  How would I remove this bottleneck?  Can I?
>
> Thanks
>
> -Troy
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] problem returning mon back to cluster

2019-10-14 Thread Paul Emmerich

How big is the mon's DB?  As in just the total size of the directory you copied

FWIW I recently had to perform mon surgery on a 14.2.4 (or was it
14.2.2?) cluster with 8 GB mon size and I encountered no such problems
while syncing a new mon which took 10 minutes or so.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Oct 14, 2019 at 9:41 PM Nikola Ciprich
 wrote:
>
> On Mon, Oct 14, 2019 at 04:31:22PM +0200, Nikola Ciprich wrote:
> > On Mon, Oct 14, 2019 at 01:40:19PM +0200, Harald Staub wrote:
> > > Probably same problem here. When I try to add another MON, "ceph
> > > health" becomes mostly unresponsive. One of the existing ceph-mon
> > > processes uses 100% CPU for several minutes. Tried it on 2 test
> > > clusters (14.2.4, 3 MONs, 5 storage nodes with around 2 hdd osds
> > > each). To avoid errors like "lease timeout", I temporarily increase
> > > "mon lease", from 5 to 50 seconds.
> > >
> > > Not sure how bad it is from a customer PoV. But it is a problem by
> > > itself to be several minutes without "ceph health", when there is an
> > > increased risk of losing the quorum ...
> >
> > Hi Harry,
> >
> > thanks a lot for your reply! not sure we're experiencing the same issue,
> > i don't have it on any other cluster.. when this is happening to you, does
> > only ceph health stop working, or it also blocks all clients IO?
> >
> > BR
> >
> > nik
> >
> >
> > >
> > >  Harry
> > >
> > > On 13.10.19 20:26, Nikola Ciprich wrote:
> > > >dear ceph users and developers,
> > > >
> > > >on one of our production clusters, we got into pretty unpleasant 
> > > >situation.
> > > >
> > > >After rebooting one of the nodes, when trying to start monitor, whole 
> > > >cluster
> > > >seems to hang, including IO, ceph -s etc. When this mon is stopped again,
> > > >everything seems to continue. Traying to spawn new monitor leads to the 
> > > >same problem
> > > >(even on different node).
> > > >
> > > >I had to give up after minutes of outage, since it's unacceptable. I 
> > > >think we had this
> > > >problem once in the past on this cluster, but after some (but much 
> > > >shorter) time, monitor
> > > >joined and it worked fine since then.
> > > >
> > > >All cluster nodes are centos 7 machines, I have 3 monitors (so 2 are now 
> > > >running), I'm
> > > >using ceph 13.2.6
> > > >
> > > >Network connection seems to be fine.
> > > >
> > > >Anyone seen similar problem? I'd be very grateful for tips on how to 
> > > >debug and solve this..
> > > >
> > > >for those interested, here's log of one of running monitors with 
> > > >debug_mon set to 10/10:
> > > >
> > > >https://storage.lbox.cz/public/d258d0
> > > >
> > > >if I could provide more info, please let me know
> > > >
> > > >with best regards
> > > >
> > > >nikola ciprich
>
> just to add quick update, I was able to reproduce the issue by transferring 
> monitor
> directories to test environmen with same IP adressing, so I can safely play 
> with that
> now..
>
> increasing lease timeout didn't help me to fix the problem,
> but at least I seem to be able to use ceph -s now.
>
> few things I noticed in the meantime:
>
> - when I start problematic monitor, monitor slow ops start to appear for
> quorum leader and the count is slowly increasing:
>
> 44 slow ops, oldest one blocked for 130 sec, mon.nodev1c has slow 
> ops
>
> - removing and recreating monitor didn't help
>
> - checking mon_status of problematic monitor shows it remains in the 
> "synchronizing" state
>
> I tried increasing debug_ms and debug_paxos but didn't see anything usefull 
> there..
>
> will report further when I got something. I anyone has any idea in the 
> meantime, please
> let me know.
>
> BR
>
> nik
>
>
>
>
> --
> -
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28. rijna 168, 709 00 Ostrava
>
> tel.:   +420 591 166 214
> fax:+420 596 621 273
> mobil:  +420 777 093 799
>
> www.linuxbox.cz
>
> mobil servis: +420 737 238 656
> email servis: ser...@linuxbox.cz
> -
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-10-10 Thread Paul Emmerich

I've also encountered this issue on a cluster yesterday; one CPU got
stuck in an infinite loop in get_obj_data::flush and it stopped
serving requests. I've updated the tracker issue accordingly.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Aug 21, 2019 at 3:55 PM Vladimir Brik
 wrote:
>
> Hello
>
> I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically,
> radosgw process on those machines starts consuming 100% of 5 CPU cores
> for days at a time, even though the machine is not being used for data
> transfers (nothing in radosgw logs, couple of KB/s of network).
>
> This situation can affect any number of our rados gateways, lasts from
> few hours to few days and stops if radosgw process is restarted or on
> its own.
>
> Does anybody have an idea what might be going on or how to debug it? I
> don't see anything obvious in the logs. Perf top is saying that CPU is
> consumed by radosgw shared object in symbol get_obj_data::flush, which,
> if I interpret things correctly, is called from a symbol with a long
> name that contains the substring "boost9intrusive9list_impl"
>
> This is our configuration:
> rgw_frontends = civetweb num_threads=5000 port=443s
> ssl_certificate=/etc/ceph/rgw.crt
> error_log_file=/var/log/ceph/civetweb.error.log
>
> (error log file doesn't exist)
>
>
> Thanks,
>
> Vlad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs 1 large omap objects

2019-10-08 Thread Paul Emmerich

Hi,

the default for this warning changed recently (see other similar
threads on the mailing list), it was 2 million before 14.2.3.

I don't think the new default of 200k is a good choice, so increasing
it is a reasonable work-around.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Oct 7, 2019 at 3:37 AM Nigel Williams
 wrote:
>
> I've adjusted the threshold:
>
> ceph config set osd osd_deep_scrub_large_omap_object_key_threshold 35
>
> Colleague suggested that this will take effect on the next deep-scrub.
>
> Is the default of 200,000 too small? will this be adjusted in future
> releases or is it meant to be adjusted in some use-cases?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rgw: multisite support

2019-10-03 Thread Paul Emmerich

On Thu, Oct 3, 2019 at 12:03 PM M Ranga Swami Reddy
 wrote:
>
> Below url says: "Switching from a standalone deployment to a multi-site 
> replicated deployment is not supported.
> https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/app-rgw-multisite.html

this is wrong, might be a weird openstack-specific restriction.

Migrating single-site to multi-site is trivial, you just add the second site.


Paul

>
> Please advise.
>
>
> On Thu, Oct 3, 2019 at 3:28 PM M Ranga Swami Reddy  
> wrote:
>>
>> Hi,
>> Iam using the 2 ceph clusters in diff DCs (away by 500 KM) with ceph 12.2.11 
>> version.
>> Now, I want to setup rgw multisite using the above 2 ceph clusters.
>>
>> is it possible? if yes, please share good document to do the same.
>>
>> Thanks
>> Swami
>
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Nautilus minor versions archive

2019-10-01 Thread Paul Emmerich

There's virtually no difference between 14.2.3 and 14.2.4, it's only a
bug fix for running ceph-volume without a terminal on stderr on some
environments.

Unrelated: Do you really want to run the RDMA messenger in production?
I wouldn't dare if the system is in any way critical.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Oct 1, 2019 at 11:48 PM Volodymyr Litovka  wrote:
>
> It's quite pity. We tested solution Ceph 14.2.3 + RDMA in lab, got it
> working, but at the moment of moving it to the production there was
> 14.2.4 and we got few hours outage trying to launch.
>
> On 02.10.2019 00:28, Paul Emmerich wrote:
> > On Tue, Oct 1, 2019 at 11:21 PM Volodymyr Litovka  wrote:
> >> Dear colleagues,
> >>
> >> is archive of previos ceph minor versions  is available?
> > no, because reprepro doesn't support that
> >
> > (yeah, that's clearly a solvable problem, but the current workflow
> > just uses reprepro and one repository)
> >
> > Paul
> >
> >
> >> To be more specific - I'm looking for 14.2.1 or 14.2.2 for Ubuntu, but
> >> @download.ceph.com only 14.2.4 is available.
> >>
> >> Thank you.
> >>
> >> --
> >> Volodymyr Litovka
> >> "Vision without Execution is Hallucination." -- Thomas Edison
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
> Volodymyr Litovka
>"Vision without Execution is Hallucination." -- Thomas Edison
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Nautilus minor versions archive

2019-10-01 Thread Paul Emmerich

On Tue, Oct 1, 2019 at 11:21 PM Volodymyr Litovka  wrote:
>
> Dear colleagues,
>
> is archive of previos ceph minor versions  is available?

no, because reprepro doesn't support that

(yeah, that's clearly a solvable problem, but the current workflow
just uses reprepro and one repository)

Paul


>
> To be more specific - I'm looking for 14.2.1 or 14.2.2 for Ubuntu, but
> @download.ceph.com only 14.2.4 is available.
>
> Thank you.
>
> --
> Volodymyr Litovka
>"Vision without Execution is Hallucination." -- Thomas Edison
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Commit and Apply latency on nautilus

2019-09-30 Thread Paul Emmerich

BTW: commit and apply latency are the exact same thing since
BlueStore, so don't bother looking at both.

In fact you should mostly be looking at the op_*_latency counters


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Sep 30, 2019 at 8:46 PM Sasha Litvak
 wrote:
>
> In my case, I am using premade Prometheus sourced dashboards in grafana.
>
> For individual latency, the query looks like that
>
>  irate(ceph_osd_op_r_latency_sum{ceph_daemon=~"$osd"}[1m]) / on (ceph_daemon) 
> irate(ceph_osd_op_r_latency_count[1m])
> irate(ceph_osd_op_w_latency_sum{ceph_daemon=~"$osd"}[1m]) / on (ceph_daemon) 
> irate(ceph_osd_op_w_latency_count[1m])
>
> The other ones use
>
> ceph_osd_commit_latency_ms
> ceph_osd_apply_latency_ms
>
> and graph the distribution of it over time
>
> Also, average OSD op latency
>
> avg(rate(ceph_osd_op_r_latency_sum{cluster="$cluster"}[5m]) / 
> rate(ceph_osd_op_r_latency_count{cluster="$cluster"}[5m]) >= 0)
> avg(rate(ceph_osd_op_w_latency_sum{cluster="$cluster"}[5m]) / 
> rate(ceph_osd_op_w_latency_count{cluster="$cluster"}[5m]) >= 0)
>
> Average OSD apply + commit latency
> avg(ceph_osd_apply_latency_ms{cluster="$cluster"})
> avg(ceph_osd_commit_latency_ms{cluster="$cluster"})
>
>
> On Mon, Sep 30, 2019 at 11:13 AM Marc Roos  wrote:
>>
>>
>> What parameters are you exactly using? I want to do a similar test on
>> luminous, before I upgrade to Nautilus. I have quite a lot (74+)
>>
>> type_instance=Osd.opBeforeDequeueOpLat
>> type_instance=Osd.opBeforeQueueOpLat
>> type_instance=Osd.opLatency
>> type_instance=Osd.opPrepareLatency
>> type_instance=Osd.opProcessLatency
>> type_instance=Osd.opRLatency
>> type_instance=Osd.opRPrepareLatency
>> type_instance=Osd.opRProcessLatency
>> type_instance=Osd.opRwLatency
>> type_instance=Osd.opRwPrepareLatency
>> type_instance=Osd.opRwProcessLatency
>> type_instance=Osd.opWLatency
>> type_instance=Osd.opWPrepareLatency
>> type_instance=Osd.opWProcessLatency
>> type_instance=Osd.subopLatency
>> type_instance=Osd.subopWLatency
>> ...
>> ...
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Alex Litvak [mailto:alexander.v.lit...@gmail.com]
>> Sent: zondag 29 september 2019 13:06
>> To: ceph-users@lists.ceph.com
>> Cc: ceph-de...@vger.kernel.org
>> Subject: [ceph-users] Commit and Apply latency on nautilus
>>
>> Hello everyone,
>>
>> I am running a number of parallel benchmark tests against the cluster
>> that should be ready to go to production.
>> I enabled prometheus to monitor various information and while cluster
>> stays healthy through the tests with no errors or slow requests,
>> I noticed an apply / commit latency jumping between 40 - 600 ms on
>> multiple SSDs.  At the same time op_read and op_write are on average
>> below 0.25 ms in the worth case scenario.
>>
>> I am running nautilus 14.2.2, all bluestore, no separate NVME devices
>> for WAL/DB, 6 SSDs per node(Dell PowerEdge R440) with all drives Seagate
>> Nytro 1551, osd spread across 6 nodes, running in
>> containers.  Each node has plenty of RAM with utilization ~ 25 GB during
>> the benchmark runs.
>>
>> Here are benchmarks being run from 6 client systems in parallel,
>> repeating the test for each block size in <4k,16k,128k,4M>.
>>
>> On rbd mapped partition local to each client:
>>
>> fio --name=randrw --ioengine=libaio --iodepth=4 --rw=randrw
>> --bs=<4k,16k,128k,4M> --direct=1 --size=2G --numjobs=8 --runtime=300
>> --group_reporting --time_based --rwmixread=70
>>
>> On mounted cephfs volume with each client storing test file(s) in own
>> sub-directory:
>>
>> fio --name=randrw --ioengine=libaio --iodepth=4 --rw=randrw
>> --bs=<4k,16k,128k,4M> --direct=1 --size=2G --numjobs=8 --runtime=300
>> --group_reporting --time_based --rwmixread=70
>>
>> dbench -t 30 30
>>
>> Could you please let me know if huge jump in applied and committed
>> latency is justified in my case and whether I can do anything to improve
>> / fix it.  Below is some additional cluster info.
>>
>> Thank you,
>>
>> root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph osd df
>> ID CLASS WEIGHT  REWEIGHT SIZERAW USE DATAOMAPMETA AVAIL
>>   %USE VAR  PGS STATUS
>>   6   ssd 1.74609  1.0 1.7 T

Re: [ceph-users] Nautilus Ceph Status Pools & Usage

2019-09-30 Thread Paul Emmerich

It's just a display bug in ceph -s:

https://tracker.ceph.com/issues/40011

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sun, Sep 29, 2019 at 4:41 PM Lazuardi Nasution
 wrote:
>
> Hi,
>
> I'm starting with Nautilus and do create and delete some pools. When I
> check with "ceph status" I find something weird with "pools" number
> when tall pools have been deleted. I the meaning of "pools" number
> different than Luminous? As there is no pool and PG, why there is
> usage on "ceph status"?
>
> Best regards,
>
>   cluster:
> id: e53af8e4-8ef7-48ad-ae4e-3d0486ba0d72
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum c08-ctrl,c09-ctrl,c10-ctrl (age 3m)
> mgr: c08-ctrl(active, since 7d), standbys: c09-ctrl, c10-ctrl
> osd: 88 osds: 88 up (since 7d), 88 in (since 7d)
>
>   data:
> pools:   7 pools, 0 pgs
> objects: 0 objects, 0 B
> usage:   1.6 TiB used, 262 TiB / 264 TiB avail
> pgs:
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD Object Size for BlueStore OSD

2019-09-30 Thread Paul Emmerich

It's sometimes faster if you reduce the object size, but I wouldn't go
below 1 MB. Depends on your hardware and use case, 4 MB is a very good
default, though.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Sep 30, 2019 at 6:44 AM Lazuardi Nasution
 wrote:
>
> Hi,
>
> Is 4MB default RBD object size still relevant for BlueStore OSD? Any 
> guideline for best RBD object size for BlueStore OSD especially on high 
> performance media (SSD, NVME)?
>
> Best regards,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs: apache locks up after parallel reloads on multiple nodes

2019-09-12 Thread Paul Emmerich

Yeah, CephFS is much closer to POSIX semantics for a filesystem than
NFS. There's an experimental relaxed mode called LazyIO but I'm not
sure if it's applicable here.

You can debug this by dumping slow requests from the MDS servers via
the admin socket


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Sep 12, 2019 at 5:07 PM Stefan Kooman  wrote:
>
> Dear list,
>
> We recently switched the shared storage for our linux shared hosting
> platforms from "nfs" to "cephfs". Performance improvement are
> noticeable. It all works fine, however, there is one peculiar thing:
> when Apache reloads after a logrotate of the "error" logs all but one
> node will hang for ~ 15 minutes. The log rotates are scheduled with a
> cron, the nodes themselves synced with ntp. The first node that reloads
> apache will keep on working, all the others will hang, and after a
> period of ~ 15 minutes they will all recover almost simultaneously.
>
> Our setup looks like this: 10 webservers all sharing the same cephfs
> filesystem. Each webserver with around 100 apache threads has around
> 10.000 open file handles to "error" logs on cephfs. To be clear, all
> webservers have a file handle on _the same_ "error" logs. The logrotate
> takes around two seconds on the "surviving" node.
>
> What could be the reason for this? Does it have something to do with
> file locking, i.e. that it behaves differently on cephfs compared to nfs
> (more strict)? What would be a good way to find out what is the root
> cause? We have sysdig traces of different nodes, but on the nodes where
> apache hangs not a lot is going on ... until it all recovers.
>
> We remediated this by delaying the Apache reloads on all but one node.
> Then there is no issue at all, even as all the other web servers still
> reload almost at the same time.
>
> Any info / hints on how to investigate this issue further are highly
> appreciated.
>
> Gr. Stefan
>
> --
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] units of metrics

2019-09-12 Thread Paul Emmerich

We use a custom script to collect these metrics in croit

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Sep 12, 2019 at 5:00 PM Stefan Kooman  wrote:
>
> Hi Paul,
>
> Quoting Paul Emmerich (paul.emmer...@croit.io):
> > https://static.croit.io/ceph-training-examples/ceph-training-example-admin-socket.pdf
>
> Thanks for the link. So, what tool do you use to gather the metrics? We
> are using telegraf module of the Ceph manager. However, this module only
> provides "sum" and not "avgtime" so I can't do the calculations. The
> influx and zabbix mgr modules also only provide "sum". The only metrics
> module that *does* send "avgtime" is the prometheus module:
>
> ceph_mds_reply_latency_sum
> ceph_mds_reply_latency_count
>
> All modules use "self.get_all_perf_counters()" though:
>
> ~/git/ceph/src/pybind/mgr/ > grep -Ri get_all_perf_counters *
> dashboard/controllers/perf_counters.py:return 
> mgr.get_all_perf_counters()
> diskprediction_cloud/agent/metrics/ceph_mon_osd.py:perf_data = 
> obj_api.module.get_all_perf_counters(services=('mon', 'osd'))
> influx/module.py:for daemon, counters in 
> six.iteritems(self.get_all_perf_counters()):
> mgr_module.py:def get_all_perf_counters(self, prio_limit=PRIO_USEFUL,
> prometheus/module.py:for daemon, counters in 
> self.get_all_perf_counters().items():
> restful/api/perf.py:counters = 
> context.instance.get_all_perf_counters()
> telegraf/module.py:for daemon, counters in 
> six.iteritems(self.get_all_perf_counters())
>
> Besides the *ceph* telegraf module we also use the ceph plugin for
> telegraf ... but that plugin does not (yet?) provide mds metrics though.
> Ideally we would *only* use the ceph mgr telegraf module to collect *all
> the things*.
>
> Not sure what's the difference in python code between the modules that could 
> explain this.
>
> Gr. Stefan
>
> --
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph for "home lab" / hobbyist use?

2019-09-07 Thread Paul Emmerich

On Sat, Sep 7, 2019 at 12:55 AM William Ferrell  wrote:
>
> On Fri, Sep 6, 2019 at 6:37 PM Peter Woodman  wrote:
> >
> > 2GB ram is gonna be really tight, probably.
>
> Bummer. So it won't be enough for any Ceph component to run reliably?

2 GB is tough for an OSD. The main problem is recovery. If you attempt
this: the trick is the use fewer PGs per OSD than recommended.

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

>
> Drat. The nice thing about the HC2 is the fact that it can power the
> attached SATA disk and itself through one barrel connector and
> includes a stackable case to mount the disk and act as a heat sink.
> Would have been so easy to stack those up and use fewer cables/power
> adapters to do it. I wish I could convince the Hardkernel guys to drop
> a new model with this form factor that uses their 64-bit 4GB chipset.
>
> > However, I do something similar at home with a bunch of rock64 4gb boards, 
> > and it works well.
>
> Nice! Would you mind describing your setup in more detail? It's
> encouraging to hear this works on arm (I wasn't sure if that was still
> supported). I'm also curious how you manage all the power supplies,
> cables and disks. Assuming you're using external 3.5" USB 3 disks that
> require their own power, you're already at 8 power adapters for just 4
> nodes.
>
> > There are sometimes issues with the released ARM packages (frequently crc32 
> > doesn;'t work, which isn't great), so you may have to build your own on the 
> > board you're targeting or on something like scaleway, YMMV.
>
> Thanks, that's good to know.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] regurlary 'no space left on device' when deleting on cephfs

2019-09-06 Thread Paul Emmerich

Yeah, no ENOSPC error code on deletion is a little bit unintuitive,
but what it means is: the purge queue is full.
You've already told the MDS to purge faster.

Not sure how to tell it to increase the maximum backlog for
deletes/purges, though, but you should be able to find something with
the search term "purge queue". :)

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Sep 6, 2019 at 4:00 PM Stefan Kooman  wrote:
>
> Quoting Kenneth Waegeman (kenneth.waege...@ugent.be):
> > The cluster is healthy at this moment, and we have certainly enough space
> > (see also osd df below)
>
> It's not well balanced though ... do you use ceph balancer (with
> balancer in upmap mode)?
>
> Gr. Stefan
>
>
> --
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] units of metrics

2019-09-04 Thread Paul Emmerich

https://static.croit.io/ceph-training-examples/ceph-training-example-admin-socket.pdf

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Sep 4, 2019 at 5:38 PM Stefan Kooman  wrote:
>
> Hi,
>
> Just wondering, what are the units of the metrics logged by "perf dump"?
> For example mds perf dump:
>
> "reply_latency": {
> "avgcount": 560013520,
> "sum": 184229.305658729,
> "avgtime": 0.000328972
>
>
> is the 'avgtime' in seconds, with "avgtime": 0.000328972 representing
> 0.328972 ms?
>
> As far as I can see the logs collected by the telegraf manager plugin
> only sends "sum". So how would I calculate the average reply latency for
> mds requests?
>
> Thanks,
>
> Gr. Stefan
>
> --
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs full, 2/3 Raw capacity used

2019-08-26 Thread Paul Emmerich

The balancer is unfortunately not that good when you have large k+m in
erasure coding profiles and relatively few servers, some manual
balancing will be required


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Aug 26, 2019 at 12:33 PM Simon Oosthoek
 wrote:
>
> On 26-08-19 12:00, EDH - Manuel Rios Fernandez wrote:
> > Balancer just balance in Healthy mode.
> >
> > The problem is that data is distributed without be balanced in their first
> > write, that cause unproperly data balanced across osd.
>
> I suppose the crush algorithm doesn't take the fullness of the osds into
> account when placing objects...
>
> >
> > This problem only happens in CEPH, we are the same with 14.2.2, having to
> > change the weight manually.Because the balancer is a passive element of the
> > cluster.
> >
> > I hope in next version we get a more aggressive balancer, like enterprises
> > storages that allow to fill up 95% storage (raw).
>
> I'm thinking a cronjob with a script to parse the output of `ceph osd df
> tree` and reweight according to the percentage used would be relatively
> easy to write. But I'll concentrate on monitoring before I start
> tweaking there ;-)
>
> Cheers
>
> /Simon
>
> >
> > Regards
> >
> >
> > -Mensaje original-
> > De: ceph-users  En nombre de Simon
> > Oosthoek
> > Enviado el: lunes, 26 de agosto de 2019 11:52
> > Para: Dan van der Ster 
> > CC: ceph-users 
> > Asunto: Re: [ceph-users] cephfs full, 2/3 Raw capacity used
> >
> > On 26-08-19 11:37, Dan van der Ster wrote:
> >> Thanks. The version and balancer config look good.
> >>
> >> So you can try `ceph osd reweight osd.10 0.8` to see if it helps to
> >> get you out of this.
> >
> > I've done this and the next fullest 3 osds. This will take some time to
> > recover, I'll let you know when it's done.
> >
> > Thanks,
> >
> > /simon
> >
> >>
> >> -- dan
> >>
> >> On Mon, Aug 26, 2019 at 11:35 AM Simon Oosthoek
> >>  wrote:
> >>>
> >>> On 26-08-19 11:16, Dan van der Ster wrote:
> >>>> Hi,
> >>>>
> >>>> Which version of ceph are you using? Which balancer mode?
> >>>
> >>> Nautilus (14.2.2), balancer is in upmap mode.
> >>>
> >>>> The balancer score isn't a percent-error or anything humanly usable.
> >>>> `ceph osd df tree` can better show you exactly which osds are
> >>>> over/under utilized and by how much.
> >>>>
> >>>
> >>> Aha, I ran this and sorted on the %full column:
> >>>
> >>> 81   hdd   10.81149  1.0  11 TiB 5.2 TiB 5.1 TiB   4 KiB  14 GiB
> >>> 5.6 TiB 48.40 0.73  96 up osd.81
> >>> 48   hdd   10.81149  1.0  11 TiB 5.3 TiB 5.2 TiB  15 KiB  14 GiB
> >>> 5.5 TiB 49.08 0.74  95 up osd.48
> >>> 154   hdd   10.81149  1.0  11 TiB 5.5 TiB 5.4 TiB 2.6 GiB  15 GiB
> >>> 5.3 TiB 50.95 0.76  96 up osd.154
> >>> 129   hdd   10.81149  1.0  11 TiB 5.5 TiB 5.4 TiB 5.1 GiB  16 GiB
> >>> 5.3 TiB 51.33 0.77  96 up osd.129
> >>> 42   hdd   10.81149  1.0  11 TiB 5.6 TiB 5.5 TiB 2.6 GiB  14 GiB
> >>> 5.2 TiB 51.81 0.78  96 up osd.42
> >>> 122   hdd   10.81149  1.0  11 TiB 5.7 TiB 5.6 TiB  16 KiB  14 GiB
> >>> 5.1 TiB 52.47 0.79  96 up osd.122
> >>> 120   hdd   10.81149  1.0  11 TiB 5.7 TiB 5.6 TiB 2.6 GiB  15 GiB
> >>> 5.1 TiB 52.92 0.79  95 up osd.120
> >>> 96   hdd   10.81149  1.0  11 TiB 5.8 TiB 5.7 TiB 2.6 GiB  15 GiB
> >>> 5.0 TiB 53.58 0.80  96 up osd.96
> >>> 26   hdd   10.81149  1.0  11 TiB 5.8 TiB 5.7 TiB  20 KiB  15 GiB
> >>> 5.0 TiB 53.68 0.80  97 up osd.26
> >>> ...
> >>>  6   hdd   10.81149  1.0  11 TiB 8.3 TiB 8.2 TiB  88 KiB  18 GiB
> >>> 2.5 TiB 77.14 1.16  96 up osd.6
> >>> 16   hdd   10.81149  1.0  11 TiB 8.4 TiB 8.3 TiB  28 KiB  18 GiB
> >>> 2.4 TiB 77.56 1.16  95 up osd.16
> >>>  0   hdd   10.81149  1.0  11 TiB 8.6 TiB 8.4 TiB  48 KiB  17 GiB
> >>> 2.2 TiB 79.24

Re: [ceph-users] BlueStore.cc: 11208: ceph_abort_msg("unexpected error")

2019-08-23 Thread Paul Emmerich

I've seen that before (but never on Nautilus), there's already an
issue at tracker.ceph.com but I don't recall the id or title.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Aug 23, 2019 at 1:47 PM Lars Täuber  wrote:
>
> Hi Paul,
>
> a result of fgrep is attached.
> Can you do something with it?
>
> I can't read it. Maybe this is the relevant part:
> " bluestore(/var/lib/ceph/osd/first-16) _txc_add_transaction error (39) 
> Directory not empty not handled on operation 21 (op 1, counting from 0)"
>
> Later I tried it again and the osd is working again.
>
> It feels like I hit a bug!?
>
> Huge thanks for your help.
>
> Cheers,
> Lars
>
> Fri, 23 Aug 2019 13:36:00 +0200
> Paul Emmerich  ==> Lars Täuber  :
> > Filter the log for "7f266bdc9700" which is the id of the crashed
> > thread, it should contain more information on the transaction that
> > caused the crash.
> >
> >
> > Paul
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] BlueStore.cc: 11208: ceph_abort_msg("unexpected error")

2019-08-23 Thread Paul Emmerich

Filter the log for "7f266bdc9700" which is the id of the crashed
thread, it should contain more information on the transaction that
caused the crash.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Aug 23, 2019 at 9:29 AM Lars Täuber  wrote:
>
> Hi there!
>
> In our testcluster is an osd that won't start anymore.
>
> Here is a short part of the log:
>
> -1> 2019-08-23 08:56:13.316 7f266bdc9700 -1 
> /tmp/release/Debian/WORKDIR/ceph-14.2.2/src/os/bluestore/BlueStore.cc: In 
> function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> ObjectStore::Transaction*)' thread 7f266bdc9700 time 2019-08-23 
> 08:56:13.318938
> /tmp/release/Debian/WORKDIR/ceph-14.2.2/src/os/bluestore/BlueStore.cc: 11208: 
> ceph_abort_msg("unexpected error")
>
>  ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus 
> (stable)
>  1: (ceph::__ceph_abort(char const*, int, char const*, 
> std::__cxx11::basic_string, std::allocator 
> > const&)+0xdf) [0x564406ac153a]
>  2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> ObjectStore::Transaction*)+0x2830) [0x5644070e48d0]
>  3: 
> (BlueStore::queue_transactions(boost::intrusive_ptr&,
>  std::vector std::allocator >&, boost::intrusive_ptr, 
> ThreadPool::TPHandle*)+0x42a) [0x5644070ec33a]
>  4: 
> (ObjectStore::queue_transaction(boost::intrusive_ptr&,
>  ObjectStore::Transaction&&, boost::intrusive_ptr, 
> ThreadPool::TPHandle*)+0x7f) [0x564406cd620f]
>  5: (PG::_delete_some(ObjectStore::Transaction*)+0x945) [0x564406d32d85]
>  6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) 
> [0x564406d337d1]
>  7: (boost::statechart::simple_state PG::RecoveryState::ToDelete, boost::mpl::list mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
> mpl_::na, mpl_::na, mpl_::na>, 
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
> const&, void const*)+0x109) [0x564406d81ec9]
>  8: (boost::statechart::state_machine PG::RecoveryState::Initial, std::allocator, 
> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
>  const&)+0x6b) [0x564406d4e7cb]
>  9: (PG::do_peering_event(std::shared_ptr, 
> PG::RecoveryCtx*)+0x2af) [0x564406d3f39f]
>  10: (OSD::dequeue_peering_evt(OSDShard*, PG*, 
> std::shared_ptr, ThreadPool::TPHandle&)+0x1b4) 
> [0x564406c7e644]
>  11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, 
> ThreadPool::TPHandle&)+0xc4) [0x564406c7e8c4]
>  12: (OSD::ShardedOpWQ::_process(unsigned int, 
> ceph::heartbeat_handle_d*)+0x7d7) [0x564406c72667]
>  13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) 
> [0x56440724f7d4]
>  14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5644072521d0]
>  15: (()+0x7fa3) [0x7f26862f6fa3]
>  16: (clone()+0x3f) [0x7f2685ea64cf]
>
>
> The log is so huge that I don't know which part may be of interest. The cite 
> is the part I think is most useful.
> Is there anybody able to read and explain this?
>
>
> Thanks in advance,
> Lars
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread Paul Emmerich

On Wed, Aug 21, 2019 at 3:55 PM Vladimir Brik
 wrote:
>
> Hello
>
> I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically,
> radosgw process on those machines starts consuming 100% of 5 CPU cores
> for days at a time, even though the machine is not being used for data
> transfers (nothing in radosgw logs, couple of KB/s of network).
>
> This situation can affect any number of our rados gateways, lasts from
> few hours to few days and stops if radosgw process is restarted or on
> its own.
>
> Does anybody have an idea what might be going on or how to debug it? I
> don't see anything obvious in the logs. Perf top is saying that CPU is
> consumed by radosgw shared object in symbol get_obj_data::flush, which,
> if I interpret things correctly, is called from a symbol with a long
> name that contains the substring "boost9intrusive9list_impl"
>
> This is our configuration:
> rgw_frontends = civetweb num_threads=5000 port=443s
> ssl_certificate=/etc/ceph/rgw.crt
> error_log_file=/var/log/ceph/civetweb.error.log

Probably unrelated to your problem, but running with lots of threads
is usually an indicator that the async beast frontend would be a
better fit for your setup.
(But the code you see in perf should not be related to the frontend)


Paul

>
> (error log file doesn't exist)
>
>
> Thanks,
>
> Vlad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mon db change from rocksdb to leveldb

2019-08-21 Thread Paul Emmerich

You can't downgrade from Luminous to Kraken well officially at least.

I guess it maybe could somehow work but you'd need to re-create all
the services. For the mon example: delete a mon, create a new old one,
let it sync, etc.
Still a bad idea.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Aug 21, 2019 at 1:37 PM nokia ceph  wrote:
>
> Hi Team,
>
> One of our old customer had Kraken and they are going to upgrade to Luminous 
> . In the process they also requesting for downgrade procedure.
> Kraken used leveldb for ceph-mon data , from luminous it changed to rocksdb , 
> upgrade works without any issues.
>
> When we downgrade , the ceph-mon does not start and the mon kv_backend not 
> moving away from rocksdb .
>
> After downgrade , when kv_backend is rocksdb following error thrown by 
> ceph-mon , trying to load data from rocksdb and end up in this error,
>
> 2019-08-21 11:22:45.200188 7f1a0406f7c0  4 rocksdb: Recovered from manifest 
> file:/var/lib/ceph/mon/ceph-cn1/store.db/MANIFEST-000716 
> succeeded,manifest_file_number is 716, next_file_number is 718, last_sequence 
> is 311614, log_number is 0,prev_log_number is 0,max_column_family is 0
>
> 2019-08-21 11:22:45.200198 7f1a0406f7c0  4 rocksdb: Column family [default] 
> (ID 0), log number is 715
>
> 2019-08-21 11:22:45.200247 7f1a0406f7c0  4 rocksdb: EVENT_LOG_v1 
> {"time_micros": 1566386565200240, "job": 1, "event": "recovery_started", 
> "log_files": [717]}
> 2019-08-21 11:22:45.200252 7f1a0406f7c0  4 rocksdb: Recovering log #717 mode 2
> 2019-08-21 11:22:45.200282 7f1a0406f7c0  4 rocksdb: Creating manifest 719
>
> 2019-08-21 11:22:45.201222 7f1a0406f7c0  4 rocksdb: EVENT_LOG_v1 
> {"time_micros": 1566386565201218, "job": 1, "event": "recovery_finished"}
> 2019-08-21 11:22:45.202582 7f1a0406f7c0  4 rocksdb: DB pointer 0x55d4dacf
> 2019-08-21 11:22:45.202726 7f1a0406f7c0 -1 ERROR: on disk data includes 
> unsupported features: compat={},rocompat={},incompat={9=luminous ondisk 
> layout}
> 2019-08-21 11:22:45.202735 7f1a0406f7c0 -1 error checking features: (1) 
> Operation not permitted
>
> We changed the kv_backend file inside /var/lib/ceph/mon/ceph-cn1 to leveldb 
> and ceph-mon failed with following error,
>
> 2019-08-21 11:24:07.922978 7fc5a25de7c0 -1 WARNING: the following dangerous 
> and experimental features are enabled: bluestore,rocksdb
> 2019-08-21 11:24:07.922983 7fc5a25de7c0  0 set uid:gid to 167:167 (ceph:ceph)
> 2019-08-21 11:24:07.923009 7fc5a25de7c0  0 ceph version 11.2.0 
> (f223e27eeb35991352ebc1f67423d4ebc252adb7), process ceph-mon, pid 3509050
> 2019-08-21 11:24:07.923050 7fc5a25de7c0  0 pidfile_write: ignore empty 
> --pid-file
> 2019-08-21 11:24:07.944867 7fc5a25de7c0 -1 WARNING: the following dangerous 
> and experimental features are enabled: bluestore,rocksdb
> 2019-08-21 11:24:07.950304 7fc5a25de7c0  0 load: jerasure load: lrc load: isa
> 2019-08-21 11:24:07.950563 7fc5a25de7c0 -1 error opening mon data directory 
> at '/var/lib/ceph/mon/ceph-cn1': (22) Invalid argument
>
> Is there any possibility to toggle ceph-mon db between leveldb and rocksdb?
> Tried to add mon_keyvaluedb = leveldb and filestore_omap_backend = leveldb in 
> ceph.conf also not worked.
> thanks,
> Muthu
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDSs report damaged metadata - "return_code": -116

2019-08-19 Thread Paul Emmerich

Hi,

that error just says that the path is wrong. I unfortunately don't
know the correct way to instruct it to scrub a stray path off the top
of my head; you can always run a recursive scrub on / to go over
everything, though


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Aug 19, 2019 at 12:55 PM Lars Täuber  wrote:
>
> Hi all!
>
> Where can I look up what the error number means?
> Or did I something wrong in my command line?
>
> Thanks in advance,
> Lars
>
> Fri, 16 Aug 2019 13:31:38 +0200
> Lars Täuber  ==> Paul Emmerich  :
> > Hi Paul,
> >
> > thank you for your help. But I get the following error:
> >
> > # ceph tell mds.mds3 scrub start 
> > "~mds0/stray7/15161f7/dovecot.index.backup" repair
> > 2019-08-16 13:29:40.208 7f7e927fc700  0 client.881878 ms_handle_reset on 
> > v2:192.168.16.23:6800/176704036
> > 2019-08-16 13:29:40.240 7f7e937fe700  0 client.867786 ms_handle_reset on 
> > v2:192.168.16.23:6800/176704036
> > {
> > "return_code": -116
> > }
> >
> >
> >
> > Lars
> >
> >
> > Fri, 16 Aug 2019 13:17:08 +0200
> > Paul Emmerich  ==> Lars Täuber  :
> > > Hi,
> > >
> > > damage_type backtrace is rather harmless and can indeed be repaired
> > > with the repair command, but it's called scrub_path.
> > > Also you need to pass the name and not the rank of the MDS as id, it 
> > > should be
> > >
> > > # (on the server where the MDS is actually running)
> > > ceph daemon mds.mds3 scrub_path ...
> > >
> > > But you should also be able to use ceph tell since nautilus which is a
> > > little bit easier because it can be run from any node:
> > >
> > > ceph tell mds.mds3 scrub start 'PATH' repair
> > >
> > >
> > > Paul
> > >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Correct number of pg

2019-08-19 Thread Paul Emmerich

On Mon, Aug 19, 2019 at 10:51 AM Jake Grimmett  wrote:
>
> Dear All,
>
> We have a new Nautilus cluster, used for cephfs, with pg_autoscaler in
> warn mode.
>
> Shortly after hitting 62% full, the autoscaler started warning that we
> have too few pg:
>
> *
> Pool ec82pool has 4096 placement groups, should have 16384
> *
>
> The pool is 62% full, we have 450 OSD, and are using 8 k=8 m=2 Erasure
> encoding.
>
> Does 16384 pg seem reasonable?

no, that would be a horrible value for a cluster of that size, 4096 is
perfect here.


Paul

>
> The on-line pg calculator suggests 4096...
>
> https://ceph.io/pgcalc/
>
> (Size = 10, OSD=450, %Data=100, Target OSD 100)
>
> many thanks,
>
> Jake
>
> --
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue,
> Cambridge CB2 0QH, UK.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] WAL/DB size

2019-08-16 Thread Paul Emmerich

Btw, the original discussion leading to the 4% recommendation is here:
https://github.com/ceph/ceph/pull/23210


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Aug 15, 2019 at 11:23 AM Виталий Филиппов  wrote:
>
> 30gb already includes WAL, see 
> http://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
>
> 15 августа 2019 г. 1:15:58 GMT+03:00, Anthony D'Atri  
> пишет:
>>
>> Good points in both posts, but I think there’s still some unclarity.
>>
>> Absolutely let’s talk about DB and WAL together.  By “bluestore goes on 
>> flash” I assume you mean WAL+DB?
>>
>> “Simply allocate DB and WAL will appear there automatically”
>>
>> Forgive me please if this is obvious, but I’d like to see a holistic 
>> explanation of WAL and DB sizing *together*, which I think would help folks 
>> put these concepts together and plan deployments with some sense of 
>> confidence.
>>
>> We’ve seen good explanations on the list of why only specific DB sizes, say 
>> 30GB, are actually used _for the DB_.
>> If the WAL goes along with the DB, shouldn’t we also explicitly determine an 
>> appropriate size N for the WAL, and make the partition (30+N) GB?
>> If so, how do we derive N?  Or is it a constant?
>>
>> Filestore was so much simpler, 10GB set+forget for the journal.  Not that I 
>> miss XFS, mind you.
>>
>>
>>>> Actually standalone WAL is required when you have either very small fast
>>>> device (and don't want db to use it) or three devices (different in
>>>> performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be located
>>>> at the fastest one.
>>>>
>>>> For the given use case you just have HDD and NVMe and DB and WAL can
>>>> safely collocate. Which means you don't need to allocate specific volume
>>>> for WAL. Hence no need to answer the question how many space is needed
>>>> for WAL. Simply allocate DB and WAL will appear there automatically.
>>>>
>>>>
>>> Yes, i'm surprised how often people talk about the DB and WAL separately
>>> for no good reason.  In common setups bluestore goes on flash and the
>>> storage goes on the HDDs, simple.
>>>
>>> In the event flash is 100s of GB and would be wasted, is there anything
>>> that needs to be done to set rocksdb to use the highest level?  600 I
>>> believe
>>
>> 
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> With best regards,
> Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDSs report damaged metadata

2019-08-16 Thread Paul Emmerich

Hi,

damage_type backtrace is rather harmless and can indeed be repaired
with the repair command, but it's called scrub_path.
Also you need to pass the name and not the rank of the MDS as id, it should be

# (on the server where the MDS is actually running)
ceph daemon mds.mds3 scrub_path ...

But you should also be able to use ceph tell since nautilus which is a
little bit easier because it can be run from any node:

ceph tell mds.mds3 scrub start 'PATH' repair


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Aug 16, 2019 at 8:40 AM Lars Täuber  wrote:
>
> Hi all!
>
> The mds of our ceph cluster produces a health_err state.
> It is a nautilus 14.2.2 on debian buster installed from the repo made by 
> croit.io with osds on bluestore.
>
> The symptom:
> # ceph -s
>   cluster:
> health: HEALTH_ERR
> 1 MDSs report damaged metadata
>
>   services:
> mon: 3 daemons, quorum mon1,mon2,mon3 (age 2d)
> mgr: mon3(active, since 2d), standbys: mon2, mon1
> mds: cephfs_1:1 {0=mds3=up:active} 2 up:standby
> osd: 30 osds: 30 up (since 17h), 29 in (since 19h)
>
>   data:
> pools:   3 pools, 1153 pgs
> objects: 435.21k objects, 806 GiB
> usage:   4.7 TiB used, 162 TiB / 167 TiB avail
> pgs: 1153 active+clean
>
>
> # ceph health detail
> HEALTH_ERR 1 MDSs report damaged metadata
> MDS_DAMAGE 1 MDSs report damaged metadata
> mdsmds3(mds.0): Metadata damage detected
>
> #ceph tell mds.0 damage ls
> 2019-08-16 07:20:09.415 7f1254ff9700  0 client.840758 ms_handle_reset on 
> v2:192.168.16.23:6800/176704036
> 2019-08-16 07:20:09.431 7f1255ffb700  0 client.840764 ms_handle_reset on 
> v2:192.168.16.23:6800/176704036
> [
> {
> "damage_type": "backtrace",
> "id": 3760765989,
> "ino": 1099518115802,
> "path": "~mds0/stray7/15161f7/dovecot.index.backup"
> }
> ]
>
>
>
> I tried this without much luck:
> # ceph daemon mds.0 "~mds0/stray7/15161f7/dovecot.index.backup" recursive 
> repair
> admin_socket: exception getting command descriptions: [Errno 2] No such file 
> or directory
>
>
> Is there a way out of this error?
>
> Thanks and best regards,
> Lars
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-14 Thread Paul Emmerich

Starting point to debug/fix this would be to extract the osdmap from
one of the dead OSDs:

ceph-objectstore-tool --op get-osdmap --data-path /var/lib/ceph/osd/...

Then try to run osdmaptool on that osdmap to see if it also crashes,
set some --debug options (don't know which one off the top of my
head).
Does it also crash? How does it differ from the map retrieved with
"ceph osd getmap"?

You can also set the osdmap with "--op set-osdmap", does it help to
set the osdmap retrieved by "ceph osd getmap"?

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Aug 14, 2019 at 7:59 AM Troy Ablan  wrote:
>
> I've opened a tracker issue at https://tracker.ceph.com/issues/41240
>
> Background: Cluster of 13 hosts, 5 of which contain 14 SSD OSDs between
> them.  409 HDDs in as well.
>
> The SSDs contain the RGW index and log pools, and some smaller pools
> The HDDs ccontain all other pools, including the RGW data pool
>
> The RGW instance contains just over 1 billion objects across about 65k
> buckets.  I don't know of any action on the cluster that would have
> caused this.  There have been no changes to the crush map in months, but
> HDDs were added a couple weeks ago and backfilling is still in progress
> but in the home stretch.
>
> I don't know what I can do at this point, though something points to the
> osdmap on these being wrong and/or corrupted?  Log excerpt from crash
> included below.  All of the OSD logs I checked look very similar.
>
>
>
>
> 2019-08-13 18:09:52.913 7f76484e9d80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
> /el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3362] Recovered
> from manifest file:db/MANIFEST-245361 succeeded,manifest_file_number is
> 245361, next_file_number is 245364, last_sequence is 606668564
> 6, log_number is 0,prev_log_number is 0,max_column_family is
> 0,deleted_log_number is 245359
>
> 2019-08-13 18:09:52.913 7f76484e9d80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
> /el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3370] Column family
> [default] (ID 0), log number is 245360
>
> 2019-08-13 18:09:52.918 7f76484e9d80  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1565719792920682, "job": 1, "event": "recovery_started",
> "log_files": [245362]}
> 2019-08-13 18:09:52.918 7f76484e9d80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
> /el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:551] Recovering
> log #245362 mode 0
> 2019-08-13 18:09:52.919 7f76484e9d80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
> /el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:2863] Creating
> manifest 245364
>
> 2019-08-13 18:09:52.933 7f76484e9d80  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1565719792935329, "job": 1, "event": "recovery_finished"}
> 2019-08-13 18:09:52.951 7f76484e9d80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
> /el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:1218] DB pointer
> 0x56445a6c8000
> 2019-08-13 18:09:52.951 7f76484e9d80  1
> bluestore(/var/lib/ceph/osd/ceph-46) _open_db opened rocksdb path db
> options
> compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=
> 1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152
> 2019-08-13 18:09:52.964 7f76484e9d80  1 freelist init
> 2019-08-13 18:09:52.976 7f76484e9d80  1
> bluestore(/var/lib/ceph/osd/ceph-46) _open_alloc opening allocation metadata
> 2019-08-13 18:09:53.119 7f76484e9d80  1
> bluestore(/var/lib/ceph/osd/ceph-46) _open_alloc loaded 926 GiB in 13292
> extents
> 2019-08-13 18:09:53.133 7f76484e9d80 -1 *** Caught signal (Aborted) **
>   in thread 7f76484e9d80 thread_name:ceph-osd
>
>   ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
> (stable)
>   1: (()+0xf5d0) [0x7f763c4455d0]
>   2: (gsignal()+0x37) [0x7f763b466207]
>   3: (abort()+0x148) [0x7f763b4678f8]
>   4: (__gnu_cxx::__verbose

Re: [ceph-users] WAL/DB size

2019-08-13 Thread Paul Emmerich

On Tue, Aug 13, 2019 at 10:04 PM Wido den Hollander  wrote:
> I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of DB in
> use. No slow db in use.

random rgw-only setup here: 12TB drive, 77% full, 48GB metadata and
10GB omap for index and whatever.

That's 0.5% + 0.1%. And that's a setup that's using mostly erasure
coding and small-ish objects.


> I've talked with many people from the community and I don't see an
> agreement for the 4% rule.

agreed, 4% isn't a reasonable default.
I've seen setups with even 10% metadata usage, but these are weird
edge cases with very small objects on NVMe-only setups (obviously
without a separate DB device).

Paul

>
> Wido
>
> >
> > Thank you,
> >
> > Dominic L. Hilsbos, MBA
> > Director – Information Technology
> > Perform Air International Inc.
> > dhils...@performair.com
> > www.PerformAir.com
> >
> >
> >
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> > Wido den Hollander
> > Sent: Tuesday, August 13, 2019 12:51 PM
> > To: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] WAL/DB size
> >
> >
> >
> > On 8/13/19 5:54 PM, Hemant Sonawane wrote:
> >> Hi All,
> >> I have 4 6TB of HDD and 2 450GB SSD and I am going to partition each
> >> disk to 220GB for rock.db. So my question is does it make sense to use
> >> wal for my configuration? if yes then what could be the size of it? help
> >> will be really appreciated.
> >
> > Yes, the WAL needs to be about 1GB in size. That should work in allmost
> > all configurations.
> >
> > 220GB is more then you need for the DB as well. It's doesn't hurt, but
> > it's not needed. For each 6TB drive you need about ~60GB of space for
> > the DB.
> >
> > Wido
> >
> >> --
> >> Thanks and Regards,
> >>
> >> Hemant Sonawane
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New CRUSH device class questions

2019-08-07 Thread Paul Emmerich

On Wed, Aug 7, 2019 at 9:30 AM Robert LeBlanc  wrote:

>> # ceph osd crush rule dump replicated_racks_nvme
>> {
>>  "rule_id": 0,
>>  "rule_name": "replicated_racks_nvme",
>>  "ruleset": 0,
>>  "type": 1,
>>  "min_size": 1,
>>  "max_size": 10,
>>  "steps": [
>>  {
>>  "op": "take",
>>  "item": -44,
>>  "item_name": "default~nvme"<
>>  },
>>  {
>>  "op": "chooseleaf_firstn",
>>  "num": 0,
>>  "type": "rack"
>>  },
>>  {
>>  "op": "emit"
>>  }
>>  ]
>> }
>> ```
>
>
> Yes, our HDD cluster is much like this, but not Luminous, so we created as 
> separate root with SSD OSD for the metadata and set up a CRUSH rule for the 
> metadata pool to be mapped to SSD. I understand that the CRUSH rule should 
> have a `step take default class ssd` which I don't see in your rule unless 
> the `~` in the item_name means device class.

~ is the internal implementation of device classes. Internally it's
still using separate roots, that's how it stays compatible with older
clients that don't know about device classes.

And since it wasn't mentioned here yet: consider upgrading to Nautilus
to benefit from the new and improved accounting for metadata space.
You'll be able to see how much space is used for metadata and quotas
should work properly for metadata usage.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

>
> Thanks
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New CRUSH device class questions

2019-08-06 Thread Paul Emmerich

On Tue, Aug 6, 2019 at 7:45 PM Robert LeBlanc  wrote:
> We have a 12.2.8 luminous cluster with all NVMe and we want to take some of 
> the NVMe OSDs and allocate them strictly to metadata pools (we have a problem 
> with filling up this cluster and causing lingering metadata problems, and 
> this will guarantee space for metadata operations).

Depending on the workload and metadata size: this might be a bad idea
as it reduces parallelism.


> In the past, we have done this the old-school way of creating a separate 
> root, but I wanted to see if we could leverage the device class function 
> instead.
>
> Right now all our devices show as ssd rather than nvme, but that is the only 
> class in this cluster. None of the device classes were manually set, so is 
> there a reason they were not detected as nvme?

Ceph only distinguishes rotational vs. non-rotational for device
classes and device-specific configuration options

>
> Is it possible to add a new device class like 'metadata'?

yes, it's just a string, you can use whatever you want. For example,
we run a few setups that distinguish 2.5" and 3.5" HDDs.


>
> If I set the device class manually, will it be overwritten when the OSD boots 
> up?

not sure about this, we run all of our setups with auto-updating on
start disabled

> Is what I'm trying to accomplish better done by the old-school separate root 
> and the osd_crush_location_hook (potentially using a file with a list of 
> partition UUIDs that should be in the metadata pool).?

device classes are the way to go here

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


> Thank you,
> Robert LeBlanc
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore caching oddities, again

2019-08-04 Thread Paul Emmerich

On Sun, Aug 4, 2019 at 3:47 AM Christian Balzer  wrote:

> 2. Bluestore caching still broken
> When writing data with the fios below, it isn't cached on the OSDs.
> Worse, existing cached data that gets overwritten is removed from the
> cache, which while of course correct can't be free in terms of allocation
> overhead.
> Why not doing what any sensible person would expect from experience with
> any other cache there is, cache writes in case the data gets read again
> soon and in case of overwrites use existing allocations.

This is by design.
The BlueStore only populates its cache on reads, not on writes. The idea is
that a reasonable application does not read data it just wrote (and if it does
it's already cached at a higher layer like the page cache or a cache on the
hypervisor).

Paul
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blacklisting during boot storm

2019-08-03 Thread Paul Emmerich

The usual reason for blacklisting RBD clients is breaking an exclusive
lock because the previous owner seemed to have crashed.
Blacklisting the old owner is necessary in case you had a network
partition and not a crash. Note that this is entirely normal and no
reason to worry.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sat, Aug 3, 2019 at 1:15 PM Kees Meijs  wrote:
>
> Hi list,
>
> Yesterday afternoon we experienced a compute node outage in our OpenStack 
> (obviously Ceph backed) cluster.
>
> We tried to (re)start compute instances again as fast as possible, resulting 
> in some KVM/RBD clients getting blacklisted. The problem was spotted very 
> quickly so we could remove the listing at once while the cluster was coping 
> fine with the boot storm.
>
> Question: what can we do to prevent the blacklisting? Or, does it make sense 
> to completely disable the mechanism (doesn't feel right) or maybe configure 
> it differently?
>
> Thanks!
>
> Regards,
> Kees
>
> --
> https://nefos.nl/contact
>
> Nefos IT bv
> Ambachtsweg 25 (industrienummer 4217)
> 5627 BZ Eindhoven
> Nederland
>
> KvK 66494931
>
> Aanwezig op maandag, dinsdag, woensdag en vrijdag
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bluestore write iops calculation

2019-08-02 Thread Paul Emmerich

On Fri, Aug 2, 2019 at 2:51 PM  wrote:
>
> > 1. For 750 object write request , data written directly into data
> > partition and since we use EC 4+1 there will be 5 iops across the
> > cluster for each obejct write . This makes 750 * 5 = 3750 iops
>
> don't forget about the metadata and the deferring of small writes.
> deferred write queue + metadata, then data for each OSD. this is either
> 2 or 3 ops per an OSD. the deferred write queue is in the same RocksDB
> so deferred write queue + metadata should be 1 op, although a slightly
> bigger one (8-12 kb for 4 kb writes). so it's either 3*5*750 or 2*5*750,
> depending on how your final statistics is collected

where small means 32kb or smaller going to BlueStore, so <= 128kb writes
from the client.

Also: please don't do 4+1 erasure coding, see older discussions for details.


Paul

>
> > 2. For 750 attribute request , first it will be written into
> > rocksdb.WAL and then to rocks.db . So , 2 iops per disk for every
> > attribute request . This makes 750*2*5 = 7500 iops inside the cluster.
>
> rocksdb is LSM so it doesn't write to wal then to DB, it just writes to
> WAL then compacts it at some point and merges with L0->L1->L2->...
>
> so in theory without compaction it should be 1*5*750 iops
>
> however, there is a bug that makes bluestore do 2 writes+syncs instead
> of 1 per each journal write (not all the time though). the first write
> is the rocksdb's WAL and the second one is the bluefs's journal. this
> probably adds another 5*750 iops on top of each of (1) and (2).
>
> so 5*((2 or 3)+1+2)*750 = either 18750 or 22500. 18750/120 = 156.25,
> 22500/120 = 187.5
>
> the rest may be compaction or metadata reads if you update some objects.
> or maybe I'm missing something else. however this is already closer to
> your 200 iops :)
>
> --
> Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Adventures with large RGW buckets

2019-08-02 Thread Paul Emmerich

On Thu, Aug 1, 2019 at 10:48 PM Gregory Farnum  wrote:
>
> On Thu, Aug 1, 2019 at 12:06 PM Eric Ivancich  wrote:
> I expect RGW could do this, but unfortunately deleting namespaces at
> the RADOS level is not practical. People keep asking and maybe in some
> future world it will be cheaper, but a namespace is effectively just
> part of the object name (and I don't think it's even the first thing
> they sort by for the key entries in metadata tracking!), so deleting a
> namespace would be equivalent to deleting a snapshot[1] but with the
> extra cost that namespaces can be created arbitrarily on every write
> operation (so our solutions for handling snapshots without it being
> ludicrously expensive wouldn't apply). Deleting a namespace from the
> OSD-side using map updates would require the OSD to iterate through
> just about all the objects they have and examine them for deletion.

yes, i was thinking of something similar to snapshot deletion. I assumed
that objects were ordered by namespace internally/that listing a name-space
would be efficient.

>
> Is it cheaper than doing over the network? Sure. Is it cheap enough
> we're willing to let a single user request generate that kind of
> cluster IO on an unconstrained interface? Absolutely not.

Agreed, nothing wrong with doing it over the network in this case.


Paul


> -Greg
> [1]: Deleting snapshots is only feasible because every OSD maintains a
> sorted secondary index from snapid->set. This is only
> possible because snapids are issued by the monitors and clients
> cooperate in making sure they can't get reused after being deleted.
> Namespaces are generated by clients and there are no constraints on
> their use, reuse, or relationship to each other. We could maybe work
> around these problems, but it'd be building a fundamentally different
> interface than what namespaces currently are.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph+docker

2019-07-31 Thread Paul Emmerich

https://hub.docker.com/r/ceph/ceph/tags

> Am 01.08.2019 um 04:16 schrieb "zhanrzh...@teamsun.com.cn" 
> :
> 
> Hello Paul,
> Thank you a lot  for reply.Are there other verion ceph  ready  run on docker.
> 
> --
> zhanrzh...@teamsun.com.cn
>> Please don't start new deployments with Luminous, it's EOL since last
>> month: https://docs.ceph.com/docs/master/releases/schedule/
>> (but it'll probably still receive critical updates if anything happens
>> as there are lots of deployments out there)
>> 
>> But there's no good reason to not start with Nautilus at the moment.
>> 
>> Paul
>> 
>> --
>> Paul Emmerich
>> 
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>> 
>> croit GmbH
>> Freseniusstr. 31h
>> 81247 München
>> www.croit.io
>> Tel: +49 89 1896585 90
>> 
>> On Wed, Jul 31, 2019 at 4:04 AM zhanrzh...@teamsun.com.cn
>>  wrote:
>>> 
>>> 
>>> Hello,cephers
>>>  We are researching ceph(Luminous) and docker to ready to use in our 
>>> future production environment.
>>> Are there some one use ceph and docker in production ?Any suggestion will 
>>> be thanks.
>>> 
>>> --
>>> zhanrzh...@teamsun.com.cn
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Adventures with large RGW buckets

2019-07-31 Thread Paul Emmerich

Hi,

we are seeing a trend towards rather large RGW S3 buckets lately.
we've worked on
several clusters with 100 - 500 million objects in a single bucket, and we've
been asked about the possibilities of buckets with several billion objects more
than once.

From our experience: buckets with tens of million objects work just fine with
no big problems usually. Buckets with hundreds of million objects require some
attention. Buckets with billions of objects? "How about indexless buckets?" -
"No, we need to list them".


A few stories and some questions:


1. The recommended number of objects per shard is 100k. Why? How was this
default configuration derived?

It doesn't really match my experiences. We know a few clusters running with
larger shards because resharding isn't possible for various reasons at the
moment. They sometimes work better than buckets with lots of shards.

So we've been considering to at least double that 100k target shard size
for large buckets, that would make the following point far less annoying.



2. Many shards + ordered object listing = lots of IO

Unfortunately telling people to not use ordered listings when they don't really
need them doesn't really work as their software usually just doesn't support
that :(

A listing request for X objects will retrieve up to X objects from each shard
for ordering them. That will lead to quite a lot of traffic between the OSDs
and the radosgw instances, even for relatively innocent simple queries as X
defaults to 1000 usually.

Simple example: just getting the first page of a bucket listing with 4096
shards fetches around 1 GB of data from the OSD to return ~300kb or so to the
S3 client.

I've got two clusters here that are only used for some relatively low-bandwidth
backup use case here. However, there are a few buckets with hundreds of millions
of objects that are sometimes being listed by the backup system.

The result is that this cluster has an average read IO of 1-2 GB/s, all going
to the index pool. Not a big deal since that's coming from SSDs and goes over
80 Gbit/s LACP bonds. But it does pose the question about scalability
as the user-
visible load created by the S3 clients is quite low.



3. Deleting large buckets

Someone accidentaly put 450 million small objects into a bucket and only noticed
when the cluster ran full. The bucket isn't needed, so just delete it and case
closed?

Deleting is unfortunately far slower than adding objects, also
radosgw-admin leaks
memory during deletion: https://tracker.ceph.com/issues/40700

Increasing --max-concurrent-ios helps with deletion speed (option does effect
deletion concurrency, documentation says it's only for other specific commands).

Since the deletion is going faster than new data is being added to that cluster
the "solution" was to run the deletion command in a memory-limited cgroup and
restart it automatically after it gets killed due to leaking.


How could the bucket deletion of the future look like? Would it be possible
to put all objects in buckets into RADOS namespaces and implement some kind
of efficient namespace deletion on the OSD level similar to how pool deletions
are handled at a lower level?



4. Common prefixes could filtered in the rgw class on the OSD instead
of in radosgw

Consider a bucket with 100 folders with 1000 objects in each and only one shard

/p1/1, /p1/2, ..., /p1/1000, /p2/1, /p2/2, ..., /p2/1000, ... /p100/1000


Now a user wants to list / with aggregating common prefixes on the
delimiter / and
wants up to 1000 results.
So there'll be 100 results returned to the client: the common prefixes
p1 to p100.

How much data will be transfered between the OSDs and radosgw for this request?
How many omap entries does the OSD scan?

radosgw will ask the (single) index object to list the first 1000 objects. It'll
return 1000 objects in a quite unhelpful way: /p1/1, /p1/2, , /p1/1000

radosgw will discard 999 of these and detect one common prefix and continue the
iteration at /p1/\xFF to skip the remaining entries in /p1/ if there are any.
The OSD will then return everything in /p2/ in that next request and so on.

So it'll internally list every single object in that bucket. That will
be a problem
for large buckets and having lots of shards doesn't help either.


This shouldn't be too hard to fix: add an option "aggregate prefixes" to the RGW
class method and duplicate the fast-forward logic from radosgw in
cls_rgw. It doesn't
even need to change the response type or anything, it just needs to
limit entries in
common prefixes to one result.
Is this a good idea or am I missing something?

IO would be reduced by a factor of 100 for that particular
pathological case. I've
unfortunately seen a real-world setup that I think hits a case like that.


Paul

-- 
Paul Emmerich

Looking for help

Re: [ceph-users] ceph+docker

2019-07-31 Thread Paul Emmerich

Please don't start new deployments with Luminous, it's EOL since last
month: https://docs.ceph.com/docs/master/releases/schedule/
(but it'll probably still receive critical updates if anything happens
as there are lots of deployments out there)

But there's no good reason to not start with Nautilus at the moment.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Jul 31, 2019 at 4:04 AM zhanrzh...@teamsun.com.cn
 wrote:
>
>
> Hello,cephers
> We are researching ceph(Luminous) and docker to ready to use in our 
> future production environment.
> Are there some one use ceph and docker in production ?Any suggestion will be 
> thanks.
>
> --
> zhanrzh...@teamsun.com.cn
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Urgent Help Needed (regarding rbd cache)

2019-07-31 Thread Paul Emmerich

Yes, this is power-failure safe. It behaves exactly the same as a real
disk's write cache.

It's really a question about semantics: what does it mean to write
data? Should the data still be guaranteed to be there after a power
failure?
The answer for most writes is: no, such a guarantee is neither
necessary nor helpful.

If you really want to write data then you'll have to specify that
during the write (sync or similar commands). It doesn't matter if you
are using
a networked file system with a cache or a real disk with a cache
underneath, the behavior will be the same. Data not flushed out
explicitly
or written with a sync flag will be lost after a power outage. But
that's fine because all reasonable file systems, applications, and
operating
systems are designed to handle exactly this case (disk caches aren't a
new thing).


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Jul 31, 2019 at 1:08 PM Janne Johansson  wrote:
>
> Den ons 31 juli 2019 kl 06:55 skrev Muhammad Junaid :
>>
>> The question is about RBD Cache in write-back mode using KVM/libvirt. If we 
>> enable this, it uses local KVM Host's RAM as cache for VM's write requests. 
>> And KVM Host immediately responds to VM's OS that data has been written to 
>> Disk (Actually it is still not on OSD's yet). Then how can be it power 
>> failure safe?
>>
> It is not. Nothing is power-failure safe. However you design things, you will 
> always have the possibility of some long write being almost done when the 
> power goes away, and that write (and perhaps others in caches) will be lost. 
> Different filesystems handle losses good or bad, databases running on those 
> filesystems will have their I/O interrupted and not acknowledged which may be 
> not-so-bad or very bad.
>
> The write-caches you have in the KVM guest and this KVM RBD cache will make 
> the guest I/Os faster, at the expense of higher risk of losing data in a 
> power outage, but there will be some kind of roll-back, some kind of 
> fsck/trash somewhere to clean up if a KVM host dies with guests running.
> In 99% of the cases, this is ok, the only lost things are "last lines to this 
> log file" or "those 3 temp files for the program that ran" and in the last 1% 
> you need to pull out your backups like when the physical disks die.
>
> If someone promises "power failure safe" then I think they are misguided. 
> Chances may be super small for bad things to happen, but it will just never 
> be 0%. Also, I think the guest VMs have the possibility to ask kvm-rbd to 
> flush data out, and as such take the "pain" of waiting for real completion 
> when it is actually needed, so that other operation can go fast (and slightly 
> less safe) and the IO that needs harder guarantees can call for flushing and 
> know when data actually is on disk for real.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Problems understanding 'ceph-features' output

2019-07-29 Thread Paul Emmerich

yes, that's good enough for "upmap".

Mapping client features to versions is somewhat unreliable by design: not
every new release adds a new feature, some features are backported to older
releases, kernel clients are a completely independent implementation not
directly mapable to a Ceph release.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, Jul 29, 2019 at 2:23 PM Massimo Sgaravatto <
massimo.sgarava...@gmail.com> wrote:

> I have a ceph cluster where mon, osd and mgr are running ceph luminous
>
> If I try running ceph features [*], I see that clients are grouped in 2
> sets:
>
> - the first one appears using luminous with features 0x3ffddff8eea4fffb
> - the second one appears using luminous too, but with
> features 0x3ffddff8eeacfffb
>
> If I try to check which are these clients (I use 'ceph daemon mon.xyz
> sessions' on the 3 mons) I see that the second group includes also some
> Openstack nodes which are actually using Nautilus.
> Is this normal/expected that they appear using luminous as release ?
>
>
> Can the 'ceph features' output be used to understand if I am ready to
> switch to upmap for the balancer ?
> I.e. are 0x3ffddff8eea4fffb and 0x3ffddff8eeacfffb good enough for upmap ?
>
> Thanks, Massimo
>
>
>
> # ceph features
> {
> "mon": {
> "group": {
> "features": "0x3ffddff8eeacfffb",
> "release": "luminous",
> "num": 3
> }
> },
> "osd": {
> "group": {
> "features": "0x3ffddff8eeacfffb",
> "release": "luminous",
> "num": 70
> }
> },
> "client": {
> "group": {
> "features": "0x3ffddff8eea4fffb",
> "release": "luminous",
> "num": 16
> },
> "group": {
> "features": "0x3ffddff8eeacfffb",
> "release": "luminous",
> "num": 74
> }
> }
> }
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Anybody using 4x (size=4) replication?

2019-07-24 Thread Paul Emmerich

We got a few size=4 pools, but most of them are metadata pools paired with
m=3 or m=4 erasure coded pools for the actual data.
Goal is to provide the same availability and durability guarantees for the
metadata as the data.

But we got some older odd setup with replicated size=4 for that reason
(setup predates nautilus so no ec overrides originally).

I'd prefer erasure coding over a size=4 setup for most scenarios nowadays.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jul 24, 2019 at 9:22 PM Wido den Hollander  wrote:

> Hi,
>
> Is anybody using 4x (size=4, min_size=2) replication with Ceph?
>
> The reason I'm asking is that a customer of mine asked me for a solution
> to prevent a situation which occurred:
>
> A cluster running with size=3 and replication over different racks was
> being upgraded from 13.2.5 to 13.2.6.
>
> During the upgrade, which involved patching the OS as well, they
> rebooted one of the nodes. During that reboot suddenly a node in a
> different rack rebooted. It was unclear why this happened, but the node
> was gone.
>
> While the upgraded node was rebooting and the other node crashed about
> 120 PGs were inactive due to min_size=2
>
> Waiting for the nodes to come back, recovery to finish it took about 15
> minutes before all VMs running inside OpenStack were back again.
>
> As you are upgraded or performing any maintenance with size=3 you can't
> tolerate a failure of a node as that will cause PGs to go inactive.
>
> This made me think about using size=4 and min_size=2 to prevent this
> situation.
>
> This obviously has implications on write latency and cost, but it would
> prevent such a situation.
>
> Is anybody here running a Ceph cluster with size=4 and min_size=2 for
> this reason?
>
> Thank you,
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Paul Emmerich

+1 on adding them all at the same time.

All these methods that gradually increase the weight aren't really
necessary in newer releases of Ceph.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jul 24, 2019 at 8:59 PM Reed Dier  wrote:

> Just chiming in to say that this too has been my preferred method for
> adding [large numbers of] OSDs.
>
> Set the norebalance nobackfill flags.
> Create all the OSDs, and verify everything looks good.
> Make sure my max_backfills, recovery_max_active are as expected.
> Make sure everything has peered.
> Unset flags and let it run.
>
> One crush map change, one data movement.
>
> Reed
>
>
> That works, but with newer releases I've been doing this:
>
> - Make sure cluster is HEALTH_OK
> - Set the 'norebalance' flag (and usually nobackfill)
> - Add all the OSDs
> - Wait for the PGs to peer. I usually wait a few minutes
> - Remove the norebalance and nobackfill flag
> - Wait for HEALTH_OK
>
> Wido
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Paul Emmerich

On Wed, Jul 24, 2019 at 8:36 PM Peter Eisch 
wrote:

> # lsblk
> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sda 8:0 0 1.7T 0 disk
> ├─sda1 8:1 0 100M 0 part
> ├─sda2 8:2 0 1.7T 0 part
> └─sda5 8:5 0 10M 0 part
> sdb 8:16 0 1.7T 0 disk
> ├─sdb1 8:17 0 100M 0 part
> ├─sdb2 8:18 0 1.7T 0 part
> └─sdb5 8:21 0 10M 0 part
> sdc 8:32 0 1.7T 0 disk
> ├─sdc1 8:33 0 100M 0 part
>

That's ceph-disk which was removed, run "ceph-volume simple scan"

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90



> ...
> I'm thinking the OSD would start (I can recreate the .service definitions
> in systemctl) if the above were mounted in a way like they are on another
> of my hosts:
> # lsblk
> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sda 8:0 0 1.7T 0 disk
> ├─sda1 8:1 0 100M 0 part
> │ └─97712be4-1234-4acc-8102-2265769053a5 253:17 0 98M 0 crypt
> /var/lib/ceph/osd/ceph-16
> ├─sda2 8:2 0 1.7T 0 part
> │ └─049b7160-1234-4edd-a5dc-fe00faca8d89 253:16 0 1.7T 0 crypt
> └─sda5 8:5 0 10M 0 part
> /var/lib/ceph/osd-lockbox/97712be4-9674-4acc-1234-2265769053a5
> sdb 8:16 0 1.7T 0 disk
> ├─sdb1 8:17 0 100M 0 part
> │ └─f03f0298-1234-42e9-8b28-f3016e44d1e2 253:26 0 98M 0 crypt
> /var/lib/ceph/osd/ceph-17
> ├─sdb2 8:18 0 1.7T 0 part
> │ └─51177019-1234-4963-82d1-5006233f5ab2 253:30 0 1.7T 0 crypt
> └─sdb5 8:21 0 10M 0 part
> /var/lib/ceph/osd-lockbox/f03f0298-1234-42e9-8b28-f3016e44d1e2
> sdc 8:32 0 1.7T 0 disk
> ├─sdc1 8:33 0 100M 0 part
> │ └─0184df0c-1234-404d-92de-cb71b1047abf 253:8 0 98M 0 crypt
> /var/lib/ceph/osd/ceph-18
> ├─sdc2 8:34 0 1.7T 0 part
> │ └─fdad7618-1234-4021-a63e-40d973712e7b 253:13 0 1.7T 0 crypt
> ...
>
> Thank you for your time on this,
>
> peter
>
> Peter Eisch
> Senior Site Reliability Engineer
> T *1.612.659.3228* <1.612.659.3228>
> [image: Facebook] <https://www.facebook.com/VirginPulse>
> [image: LinkedIn] <https://www.linkedin.com/company/virgin-pulse>
> [image: Twitter] <https://twitter.com/virginpulse>
> *virginpulse.com* <https://www.virginpulse.com/>
> | *virginpulse.com/global-challenge*
> <https://www.virginpulse.com/en-gb/global-challenge/>
>
> Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
> Switzerland | United Kingdom | USA
> Confidentiality Notice: The information contained in this e-mail,
> including any attachment(s), is intended solely for use by the designated
> recipient(s). Unauthorized use, dissemination, distribution, or
> reproduction of this message by anyone other than the intended
> recipient(s), or a person designated as responsible for delivering such
> messages to the intended recipient, is strictly prohibited and may be
> unlawful. This e-mail may contain proprietary, confidential or privileged
> information. Any views or opinions expressed are solely those of the author
> and do not necessarily represent those of Virgin Pulse, Inc. If you have
> received this message in error, or are not the named recipient(s), please
> immediately notify the sender and delete this e-mail message.
> v2.59
>
> From: Xavier Trilla 
> Date: Wednesday, July 24, 2019 at 1:25 PM
> To: Peter Eisch 
> Cc: "ceph-users@lists.ceph.com" 
> Subject: Re: [ceph-users] Upgrading and lost OSDs
>
> Hi Peter,
>
> Im not sure but maybe after some changes the OSDs are not being
> recongnized by ceph scripts.
>
> Ceph used to use udev to detect the OSDs and then moved to lvm, which kind
> of OSDs are you running? Blustore or filestore? Which version did you use
> to create them?
>
> Cheers!
>
> El 24 jul 2019, a les 20:04, Peter Eisch  peter.ei...@virginpulse.com> va escriure:
> Hi,
>
> I’m working through updating from 12.2.12/luminious to 14.2.2/nautilus on
> centos 7.6. The managers are updated alright:
>
> # ceph -s
>   cluster:
> id: 2fdb5976-1234-4b29-ad9c-1ca74a9466ec
> health: HEALTH_WARN
> Degraded data redundancy: 24177/9555955 objects degraded
> (0.253%), 7 pgs degraded, 1285 pgs undersized
> 3 monitors have not enabled msgr2
>  ...
>
> I updated ceph on a OSD host with 'yum update' and then rebooted to grab
> the current kernel. Along the way, the contents of all the directories in
> /var/lib/ceph/osd/ceph-*/ were deleted. Thus I have 16 OSDs down from this.
> I can manage the undersized but I'd like to get these drives working again
> without deleting each OSD and recreating them.
>
> So far I've pulled the respective cephx key into the 'keyring' file and
> populated 'bluestore' into the '

Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Paul Emmerich

Did you use ceph-disk before?

Support for ceph-disk was removed, see Nautilus upgrade instructions.
You'll need to run "ceph-volume simple scan" to convert them to ceph-volume

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jul 24, 2019 at 8:25 PM Xavier Trilla 
wrote:

> Hi Peter,
>
> Im not sure but maybe after some changes the OSDs are not being
> recongnized by ceph scripts.
>
> Ceph used to use udev to detect the OSDs and then moved to lvm, which kind
> of OSDs are you running? Blustore or filestore? Which version did you use
> to create them?
>
>
> Cheers!
>
> El 24 jul 2019, a les 20:04, Peter Eisch  va
> escriure:
>
> Hi,
>
> I’m working through updating from 12.2.12/luminious to 14.2.2/nautilus on
> centos 7.6. The managers are updated alright:
>
> # ceph -s
>   cluster:
> id: 2fdb5976-1234-4b29-ad9c-1ca74a9466ec
> health: HEALTH_WARN
> Degraded data redundancy: 24177/9555955 objects degraded
> (0.253%), 7 pgs degraded, 1285 pgs undersized
> 3 monitors have not enabled msgr2
>  ...
>
> I updated ceph on a OSD host with 'yum update' and then rebooted to grab
> the current kernel. Along the way, the contents of all the directories in
> /var/lib/ceph/osd/ceph-*/ were deleted. Thus I have 16 OSDs down from this.
> I can manage the undersized but I'd like to get these drives working again
> without deleting each OSD and recreating them.
>
> So far I've pulled the respective cephx key into the 'keyring' file and
> populated 'bluestore' into the 'type' files but I'm unsure how to get the
> lockboxes mounted to where I can get the OSDs running. The osd-lockbox
> directory is otherwise untouched from when the OSDs were deployed.
>
> Is there a way to run ceph-deploy or some other tool to rebuild the mounts
> for the drives?
>
> peter
>
> Peter Eisch
> Senior Site Reliability Engineer
> T *1.612.659.3228* <1.612.659.3228>
>  <https://www.facebook.com/VirginPulse>
>  <https://www.linkedin.com/company/virgin-pulse>
>  <https://twitter.com/virginpulse>
> *virginpulse.com* <https://www.virginpulse.com/>
> | *virginpulse.com/global-challenge*
> <https://www.virginpulse.com/en-gb/global-challenge/>
>
> Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
> Switzerland | United Kingdom | USA
> Confidentiality Notice: The information contained in this e-mail,
> including any attachment(s), is intended solely for use by the designated
> recipient(s). Unauthorized use, dissemination, distribution, or
> reproduction of this message by anyone other than the intended
> recipient(s), or a person designated as responsible for delivering such
> messages to the intended recipient, is strictly prohibited and may be
> unlawful. This e-mail may contain proprietary, confidential or privileged
> information. Any views or opinions expressed are solely those of the author
> and do not necessarily represent those of Virgin Pulse, Inc. If you have
> received this message in error, or are not the named recipient(s), please
> immediately notify the sender and delete this e-mail message.
> v2.59
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Questions regarding backing up Ceph

2019-07-24 Thread Paul Emmerich

Note that enabling rbd mirroring means taking a hit on IOPS performance,
just think of it as a x2 overhead mainly on IOPS.
But it does work very well for disaster recovery scenarios if you can take
the performance hit.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jul 24, 2019 at 4:18 PM Tobias Gall 
wrote:

> Hallo,
>
> what about RGW Replication:
>
> https://ceph.com/geen-categorie/radosgw-simple-replication-example/
> http://docs.ceph.com/docs/master/radosgw/multisite/
>
> or rdb-mirroring:
>
> http://docs.ceph.com/docs/master/rbd/rbd-mirroring/
>
> Regards,
> Tobias
>
> Am 24.07.19 um 13:37 schrieb Fabian Niepelt:
> > Hello ceph-users,
> >
> > I am currently building a Ceph cluster that will serve as a backend for
> > Openstack and object storage using RGW. The cluster itself is finished
> and
> > integrated with Openstack and virtual machines for testing are being
> deployed.
> > Now I'm a bit stumped on how to effectively backup the Ceph pools.
> > My requirements are two weekly backups, of which one must be offline
> after
> > finishing backing up (systems turned powerless). We are expecting about
> 250TB to
> > 500TB of data for now. The backups must protect against accidental pool
> > deletion/corruption or widespread infection of a cryptovirus. In short:
> Complete
> > data loss in the production Ceph cluster.
> >
> > At the moment, I am facing two issues:
> >
> > 1. For the cinder pool, I looked into creating snapshots using the ceph
> CLI (so
> > they don't turn up in Openstack and cannot be accidentally deleted by
> users) and
> > exporting their diffs. But volumes with snapshots created this way
> cannot be
> > removed from Openstack. Does anyone have an idea how to do this better?
> > Alternatively, I could do a full export each week, but I am not sure if
> that
> > would be fast enough..
> >
> > 2. My search so far has only turned up backing up RBD pools, but how
> could I
> > backup the pools that are used for object storage?
> >
> > Of course, I'm also open to completely other ideas on how to backup Ceph
> and
> > would appreciate hearing how you people are doing your backups.
> >
> > Any help is much appreciated.
> >
> > Greetings
> > Fabian
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --
> Tobias Gall
> Facharbeitsgruppe Datenkommunikation
> Universitätsrechenzentrum
>
> Technische Universität Chemnitz
> Straße der Nationen 62 | R. B302A
> 09111 Chemnitz
> Germany
>
> Tel:+49 371 531-33617
> Fax:+49 371 531-833617
>
> tobias.g...@hrz.tu-chemnitz.de
> www.tu-chemnitz.de
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New best practices for osds???

2019-07-22 Thread Paul Emmerich

On Mon, Jul 22, 2019 at 12:52 PM Vitaliy Filippov 
wrote:

> It helps performance,


Not necessarily, I've seen several setups where disabling the cache
increases performance


Paul


> but it can also lead to data loss if the raid
> controller is crap (not flushing data correctly)
>
> --
> With best regards,
>Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Iscsi in the nautilus Dashboard

2019-07-22 Thread Paul Emmerich

Version 9 is the fqdn stuff which was introduced in 3.1.
Use 3.2 as 3.1 is pretty broken.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, Jul 22, 2019 at 3:24 AM Brent Kennedy  wrote:

> I have a test cluster running centos 7.6 setup with two iscsi gateways (
> per the requirement ).  I have the dashboard setup in nautilus ( 14.2.2 )
> and I added the iscsi gateways via the command.  Both show down and when I
> go to the dashboard it states:
>
>
>
> “ Unsupported `ceph-iscsi` config version. Expected 9 but found 8.  “
>
>
>
> Both iscsi gateways were setup from scratch since the latest and greatest
> packages required for ceph iscsi install are not available in the centos
> repositories.  Is 3.0 not considered version 9?  ( did I do something
> wrong? ) Why is it called/detected as version 8 when its version 3?
>
>
>
> I also wondering, the package versions listed as required in the nautilus
> docs(http://docs.ceph.com/docs/nautilus/rbd/iscsi-target-cli/)  state
> x.x.x or NEWER package, but when I try to add a gateway gwcli complains
> about the tcmu-runner and targetcli versions and I have to use the
>
> Skipchecks=true option when adding them.
>
>
>
> Another thing came up, might be doing it wrong as well:
>
> Added a disk, then added the client, then tried to add the auth using the
> auth command and it states: “Failed to update the client's auth: Invalid
> password”
>
>
>
> Actual output:
>
> /iscsi-target...erpix:backup1> auth username=test password=test
>
> CMD: ../hosts/ auth *
>
> username=test, password=test, mutual_username=None, mutual_password=None
>
> CMD: ../hosts/ auth *
>
> auth to be set to username='test', password='test',
> mutual_username='None', mutual_password='None' for
> 'iqn.2019-07.com.somgthing:backup1'
>
> Failed to update the client's auth: Invalid username
>
>
>
> Did I miss something in the setup doc?
>
>
>
> Installed packages:
>
> rtslib:  wget
> https://github.com/open-iscsi/rtslib-fb/archive/v2.1.fb69.tar.gz
>
> target-cli: wget
> https://github.com/open-iscsi/targetcli-fb/archive/v2.1.fb49.tar.gz
>
> tcmu-runner: wget
> https://github.com/open-iscsi/tcmu-runner/archive/v1.4.1.tar.gz
>
> ceph-iscsi: wget https://github.com/ceph/ceph-iscsi/archive/3.0.tar.gz
>
> configshell: wget
> https://github.com/open-iscsi/configshell-fb/archive/v1.1.fb25.tar.gz
>
>
>
> Other bits I installed as part of this:
>
> yum install epel-release python-pip python-devel -y
>
> yum groupinstall "Development Tools" -y
>
> python -m pip install --upgrade pip setuptools wheel
>
> pip install netifaces cryptography flask
>
>
>
>
>
> Any helps or pointer would be greatly appreciated!
>
>
>
> -Brent
>
>
>
> Existing Clusters:
>
> Test: Nautilus 14.2.2 with 3 osd servers, 1 mon/man, 1 gateway, 2 iscsi
> gateways ( all virtual on nvme )
>
> US Production(HDD): Nautilus 14.2.1 with 11 osd servers, 3 mons, 4
> gateways behind haproxy LB
>
> UK Production(HDD): Luminous 12.2.11 with 25 osd servers, 3 mons/man, 3
> gateways behind haproxy LB
>
> US Production(SSD): Luminous 12.2.11 with 6 osd servers, 3 mons/man, 3
> gateways behind haproxy LB
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] memory usage of: radosgw-admin bucket rm

2019-07-21 Thread Paul Emmerich

For mailing list archive readers in the future:

On Tue, Jul 9, 2019 at 1:22 PM Paul Emmerich  wrote:

> Try to add "--inconsistent-index" (caution: will obviously leave your
> bucket in a broken state during the deletion, so don't try to use the
> bucket)
>

this was bad advice as long as https://tracker.ceph.com/issues/40700 is not
fixed, don't do that.



>
> You can also speed up the deletion with "--max-concurrent-ios" (default
> 32). The documentation incorrectly claims that "--max-concurrent-ios" is
> only for other operations but that's wrong, it is used for most bucket
> operations including deleteion.
>

this, however, is a good idea to speed up deletion of large buckets.

Try to combine the deletion command with timeout or something to not run
into OOM all the time affecting other services.
(or use cgroups to limit RAM)


Paul


>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
> On Tue, Jul 9, 2019 at 1:11 PM Harald Staub 
> wrote:
>
>> Currently removing a bucket with a lot of objects:
>> radosgw-admin bucket rm --bucket=$BUCKET --bypass-gc --purge-objects
>>
>> This process was killed by the out-of-memory killer. Then looking at the
>> graphs, we see a continuous increase of memory usage for this process,
>> about +24 GB per day. Removal rate is about 3 M objects per day.
>>
>> It is not the fastest hardware, and this index pool is still without
>> SSDs. The bucket is sharded, 1024 shards. We are on Nautilus 14.2.1, now
>> about 500 OSDs.
>>
>> So with this bucket with 60 M objects, we would need about 480 GB of RAM
>> to come through. Or is there a workaround? Should I open a tracker issue?
>>
>> The killed remove command can just be called again, but it will be
>> killed again before it finishes. Also, it has to run some time until it
>> continues to actually remove objects. This "wait time" is also
>> increasing. Last time, after about 16 M objects already removed, the
>> wait time was nearly 9 hours. Also during this time, there is a memory
>> ramp, but not so steep.
>>
>> BTW it feels strange that the removal of objects is slower (about 3
>> times) than adding objects.
>>
>>   Harry
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multiple OSD crashes

2019-07-19 Thread Paul Emmerich

I've also encountered a crash just like that after upgrading to 14.2.2

Looks like this issue: http://tracker.ceph.com/issues/37282

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Fri, Jul 19, 2019 at 11:36 AM Daniel Aberger - Profihost AG <
d.aber...@profihost.ag> wrote:

> Hello,
>
> we are experiencing crashing OSDs in multiple independent Ceph clusters.
>
> Each OSD has very similar log entries regarding the crash as far as I
> can tell.
>
> Example log: https://pastebin.com/raw/vQ2AJ5ud
>
> I can provide you with more log files. They are too large for pastebin
> and I'm not aware of this mailing lists email attachement policy.
>
> Every log consists of the following entries:
>
> 2019-07-10 21:36:31.903886 7f322aeff700 -1 rocksdb: submit_transaction
> error: Corruption: block checksum mismatch code = 2 Rocksdb transaction:
> Put( Prefix = M key =
> 0x08c1'.461231.000125574325' Value size = 184)
> Put( Prefix = M key = 0x08c1'._fastinfo' Value size = 186)
> Put( Prefix = O key =
>
> 0x7f80015806b4'(!rbd_data.7c012a6b8b4567.004e!='0xfffe6f002f'x'
> Value size = 325)
> Put( Prefix = O key =
>
> 0x7f80015806b4'(!rbd_data.7c012a6b8b4567.004e!='0xfffe'o'
> Value size = 1608)
> Put( Prefix = L key = 0x0226dc7a Value size = 16440)
> 2019-07-10 21:36:31.913113 7f322aeff700 -1
> /build/ceph/src/os/bluestore/BlueStore.cc: In function 'void
> BlueStore::_kv_sync_thread()' thread 7f322aeff700 time 2019-07-10
> 21:36:31.903909
> /build/ceph/src/os/bluestore/BlueStore.cc: 8808: FAILED assert(r == 0)
>
>  ceph version 12.2.12-7-g1321c5e91f
> (1321c5e91f3d5d35dd5aa5a0029a54b9a8ab9498) luminous (stable)
>
>
> Unfortunately I'm unable to interpret the dumps. I hope you can help me
> with this issue.
>
> Regards,
> Daniel
>
>
>
> --
> Mit freundlichen Grüßen
>   Daniel Aberger
> Ihr Profihost Team
>
> ---
> Profihost AG
> Expo Plaza 1
> 30539 Hannover
> Deutschland
>
> Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282
> URL: http://www.profihost.com | E-Mail: i...@profihost.com
>
> Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827
> Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350
> Vorstand: Cristoph Bluhm, Sebastian Bluhm, Stefan Priebe
> Aufsichtsrat: Prof. Dr. iur. Winfried Huck (Vorsitzender)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Legacy BlueStore stats reporting?

2019-07-19 Thread Paul Emmerich

On Fri, Jul 19, 2019 at 1:47 PM Stig Telfer 
wrote:

>
> On 19 Jul 2019, at 10:01, Konstantin Shalygin  wrote:
>
> Using Ceph-Ansible stable-4.0 I did a rolling update from latest Mimic to 
> Nautilus 14.2.2 on a cluster yesterday, and the update ran to completion 
> successfully.
>
> However, in ceph status I see a warning of the form "Legacy BlueStore stats 
> reporting detected” for all OSDs in the cluster.
>
> Can anyone help me with what has gone wrong, and what should be done to fix 
> it?
>
> I thin you should start to run repair for your OSD's - [1]
>
> [1]
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-July/035889.html
>
> k
>
> Thanks Konstantin - running the BlueStore repair as indicated appears to
> be working.
>
> One difference from Sage’s description of the scenario is that I did not
> explicitly create new OSDs during or after the Nautilus upgrade. Perhaps
> the Ceph-Ansible rolling update script did something to trigger this.
>

the known bug is that stats will show up incorrectly; the warning is
expected after the upgrade

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


>
> Best wishes,
> Stig
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Nautilus:14.2.2 Legacy BlueStore stats reporting detected

2019-07-19 Thread Paul Emmerich

bluestore warn on legacy statfs = false

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Fri, Jul 19, 2019 at 1:35 PM nokia ceph  wrote:

> Hi Team,
>
> After upgrading our cluster from 14.2.1 to 14.2.2 , the cluster moved to
> warning state with following error
>
> cn1.chn6m1c1ru1c1.cdn ~# ceph status
>   cluster:
> id: e9afb5f3-4acf-421a-8ae6-caaf328ef888
> health: HEALTH_WARN
> Legacy BlueStore stats reporting detected on 335 OSD(s)
>
>   services:
> mon: 5 daemons, quorum cn1,cn2,cn3,cn4,cn5 (age 114m)
> mgr: cn4(active, since 2h), standbys: cn3, cn1, cn2, cn5
> osd: 335 osds: 335 up (since 112m), 335 in
>
>   data:
> pools:   1 pools, 8192 pgs
> objects: 129.01M objects, 849 TiB
> usage:   1.1 PiB used, 749 TiB / 1.8 PiB avail
> pgs: 8146 active+clean
>  46   active+clean+scrubbing
>
> Checked the bug list and found that this issue is solved however still
> exists ,
>
> https://github.com/ceph/ceph/pull/28563
>
> How to disable this warning?
>
> Thanks,
> Muthu
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Need to replace OSD. How do I find physical disk

2019-07-18 Thread Paul Emmerich

On Thu, Jul 18, 2019 at 8:10 PM John Petrini  wrote:

> Try ceph-disk list
>

no, this system is running ceph-volume not ceph-disk because the
mountpoints are in tmpfs

ceph-volume lvm list

But it looks like the disk is just completely broken and disappeared from
the system.

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph OSD daemon causes network card issues

2019-07-18 Thread Paul Emmerich

Hi,

Intel 82576 is bad. I've seen quite a few problems with these older igb
familiy NICs, but losing the PCIe link is a new one.
I usually see them getting stuck with a message like "tx queue X hung,
resetting device..."

Try to disable offloading features using ethtool, that sometimes helps with
the problems that I've seen. Maybe that's just a variant of the stuck
problem?


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Jul 18, 2019 at 12:47 PM Geoffrey Rhodes 
wrote:

> Hi Cephers,
>
> I've been having an issue since upgrading my cluster to Mimic 6 months ago
> (previously installed with Luminous 12.2.1).
> All nodes that have the same PCIe network card seem to loose network
> connectivity randomly. (frequency ranges from a few days to weeks per host
> node)
> The affected nodes only have the Intel 82576 LAN Card in common, different
> motherboards, installed drives, RAM and even PSUs.
> Nodes that have the Intel I350 cards are not affected by the Mimic upgrade.
> Each host node has recommended RAM installed and has between 4 and 6 OSDs
> / sata hard drives installed.
> The cluster operated for over a year (Luminous) without a single issue,
> only after the Mimic upgrade did the issues begin with these nodes.
> The cluster is only used for CephFS (file storage, low intensity usage)
> and makes use of erasure data pool (K=4, M=2).
>
> I've tested many things, different kernel versions, different Ubuntu LTS
> releases, re-installation and even CENTOS 7, different releases of Mimic,
> different igb drivers.
> If I stop the ceph-osd daemons the issue does not occur.  If I swap out
> the Intel 82576 card with the Intel I350 the issue is resolved.
> I haven't any more ideas other than replacing the cards but I feel the
> issue is linked to the ceph-osd daemon and a change in the Mimic release.
> Below are the various software versions and drivers I've tried and a log
> extract from a node that lost network connectivity. - Any help or
> suggestions would be greatly appreciated.
>
> *OS:*  Ubuntu 16.04 / 18.04 and recently CENTOS 7
> *Ceph Version:*Mimic (currently 13.2.6)
> *Network card:*4-PORT 1GB INTEL 82576 LAN CARD (AOC-SG-I4)
> *Driver:  *   igb
> *Driver Versions:* 5.3.0-k / 5.3.5.22s / 5.4.0-k
> *Network Config:* 2 x bonded (LACP) 1GB nic for public net,   2 x
> bonded (LACP) 1GB nic for private net
> *Log errors:*
> Jun 27 12:10:28 cephnode5 kernel: [497346.638608] igb :03:00.0
> enp3s0f0: PCIe link lost, device now detached
> Jun 27 12:10:28 cephnode5 kernel: [497346.686752] igb :04:00.1
> enp4s0f1: PCIe link lost, device now detached
> Jun 27 12:10:29 cephnode5 kernel: [497347.550473] igb :03:00.1
> enp3s0f1: PCIe link lost, device now detached
> Jun 27 12:10:29 cephnode5 kernel: [497347.646785] igb :04:00.0
> enp4s0f0: PCIe link lost, device now detached
> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from
> 10.100.4.1:6809 osd.16 since back 2019-06
> -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
> 12:10:23.796726)
> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from
> 10.100.6.1:6804 osd.20 since back 2019-06
> -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
> 12:10:23.796726)
> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from
> 10.100.7.1:6803 osd.25 since back 2019-06
> -27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
> 12:10:23.796726)
> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from
> 10.100.8.1:6803 osd.30 since back 2019-06
> -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
> 12:10:23.796726)
> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from
> 10.100.9.1:6808 osd.43 since back 2019-06
> -27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
> 12:10:23.796726)
>
>
> Kind regards
> Geoffrey Rhodes
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Investigating Config Error, 300x reduction in IOPs performance on RGW layer

2019-07-18 Thread Paul Emmerich

On Thu, Jul 18, 2019 at 3:44 AM Robert LeBlanc  wrote:

> I'm pretty new to RGW, but I'm needing to get max performance as well.
> Have you tried moving your RGW metadata pools to nvme? Carve out a bit of
> NVMe space and then pin the pool to the SSD class in CRUSH, that way the
> small metadata ops aren't on slow media.
>

no, don't do that:

1) a performance difference of 130 vs. 48k iopos is not due to SSD vs. NVMe
for metadata unless the SSD is absolute crap
2) the OSDs already have an NVMe DB device, it's much easier to use it
directly than by partioning the NVMes to create a separate partition as a
normal OSD


Assuming your NVMe disks are a reasonable size (30GB per OSD): put the
metadata pools on the HDDs. It's better to have 48 OSDs with 4 NVMes behind
them handling metadata than only 4 OSDs with SSDs.

Running mons in VMs with gigabit network is fine for small clusters and not
a performance problem


How are you benchmarking?

Paul


> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Jul 17, 2019 at 5:59 PM Ravi Patel  wrote:
>
>> Hello,
>>
>> We have deployed ceph cluster and we are trying to debug a massive drop
>> in performance between the RADOS layer vs the RGW layer
>>
>> ## Cluster config
>> 4 OSD nodes (12 Drives each, NVME Journals, 1 SSD drive) 40GbE NIC
>> 2 RGW nodes ( DNS RR load balancing) 40GbE NIC
>> 3 MON nodes 1 GbE NIC
>>
>> ## Pool configuration
>> RGW data pool  - replicated 3x 4M stripe (HDD)
>> RGW metadata pool - replicated 3x (SSD) pool
>>
>> ## Benchmarks
>> 4K Read IOP/s performance using RADOS Bench 48,000~ IOP/s
>> 4K Read RGW performance via s3 interface ~ 130 IOP/s
>>
>> Really trying to understand how to debug this issue. all the nodes never
>> break 15% CPU utilization and there is plenty of RAM. The one pathological
>> issue in our cluster is that the MON nodes are currently on VMs that are
>> sitting behind a single 1 GbE NIC. (We are in the process of moving them,
>> but are unsure if that will fix the issue.
>>
>> What metrics should we be looking at to debug the RGW layer. Where do we
>> need to look?
>>
>> ---
>>
>> Ravi Patel, PhD
>> Machine Learning Systems Lead
>> Email: r...@kheironmed.com
>>
>>
>> *Kheiron Medical Technologies*
>>
>> kheironmed.com | supporting radiologists with deep learning
>>
>> Kheiron Medical Technologies Ltd. is a registered company in England and
>> Wales. This e-mail and its attachment(s) are intended for the above named
>> only and are confidential. If they have come to you in error then you must
>> take no action based upon them but contact us immediately. Any disclosure,
>> copying, distribution or any action taken or omitted to be taken in
>> reliance on it is prohibited and may be unlawful. Although this e-mail and
>> its attachments are believed to be free of any virus, it is the
>> responsibility of the recipient to ensure that they are virus free. If you
>> contact us by e-mail then we will store your name and address to facilitate
>> communications. Any statements contained herein are those of the individual
>> and not the organisation.
>>
>> Registered number: 10184103. Registered office: RocketSpace, 40
>> Islington High Street, London, N1 8EQ
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pools limit

2019-07-16 Thread Paul Emmerich

100+ pools work fine if you can get the PG count right (auto-scaler helps,
there are some options that you'll need to tune for small-ish pools).

But it's not a "nice" setup. Have you considered using namespaces instead?


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Tue, Jul 16, 2019 at 4:17 PM M Ranga Swami Reddy 
wrote:

> Hello - I have created 10 nodes ceph cluster with 14.x version. Can you
> please confirm below:
>
> Q1 - Can I create 100+ pool (or more) on the cluster? (the reason is -
> creating a pool per project). Any limitation on pool creation?
>
> Q2 - In the above pool - I use 128 PG-NUM - to start with and enable
> autoscale for PG_NUM, so that based on the data in the pool, PG_NUM will
> increase by ceph itself.
>
> Let me know if any limitations for the above and any fore see issue?
>
> Thanks
> Swami
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New best practices for osds???

2019-07-16 Thread Paul Emmerich

On Tue, Jul 16, 2019 at 5:43 PM Stolte, Felix 
wrote:

>
> They told us, that "best practices" for ceph would be to deploy disks as
> Raid 0 consisting of one disk using a raid controller with a big writeback
> cache.
>
> Since this "best practice" is new to me, I would like to hear your opinion
> on this topic.
>

We've got a few clusters running like that, usually because raid
controllers not supporting anything else.
Sometimes the raid controller caches help with performance in specific use
cases, but they can also hurt performance by doing stupid things.

But best practice would be to just use them in jbod mode or pass-through
mode.


Paul


>
> Regards Felix
>
>
> -
>
> -
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
>
> -
>
> -
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What if etcd is lost

2019-07-16 Thread Paul Emmerich

All that etcd stuff is specific to ceph-docker, it supports storing config
options in there. Most of these features are obselete with the mon config
store nowadays.

To answer the original question: all you need to recover your Ceph cluster
are the mon and osd disks, they have all the data you need to get Ceph up
and running.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Jul 16, 2019 at 5:35 PM Janne Johansson  wrote:

> Den mån 15 juli 2019 kl 23:05 skrev Oscar Segarra  >:
>
>> Hi Frank,
>> Thanks a lot for your quick response.
>> Yes, the use case that concerns me is the following:
>> 1.- I bootstrap a complete cluster mons, osds, mgr, mds, nfs, etc using
>> etcd as a key store
>>
>
> as a key store ... for what? Are you stuffing anything ceph-related there?
> If so, please tell us what.
>
> As previously said, ceph has no etcd concept so unless you somehow pull
> stuff out of ceph and feed it into etcd, ceph will be completely careless
> if you lose etcd data.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

1 2 3 4 5 >

1 - 100 of 457 matches

Mail list logo