I tried upgrading my home cluster to 15.2.7 (from 15.2.5) today and it appears
to be entering a loop when trying to match docker images for ceph:v15.2.7:
2020-12-01T16:47:26.761950-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mgr
daemons...
2020-12-01T16:47:26.769581-0700 mgr.aladdin.liknom [
I'm trying to figure out a CRUSH rule that will spread data out across my
cluster as much as possible, but not more than 2 chunks per host.
If I use the default rule with an osd failure domain like this:
step take default
step choose indep 0 type osd
step emit
I get clustering of 3-4 chunks on
1 harrahs
1 mirage
2 mandalaybay
2 paris
...
Hopefully someone else will find this useful.
Bryan
> On May 12, 2021, at 9:58 AM, Bryan Stillwell wrote:
>
> I'm trying to figure out a CRUSH rule that will spread data out across my
> cluster as much as possible,
I'm looking for help in figuring out why cephadm isn't making any progress
after I told it to redeploy an mds daemon with:
ceph orch daemon redeploy mds.cephfs.aladdin.kgokhr ceph/ceph:v15.2.12
The output from 'ceph -W cephadm' just says:
2021-05-14T16:24:46.628084+ mgr.paris.glbvov [INF]
ndep 0 type host
> step chooseleaf indep 1 type osd
> step emit
>
> J.
>
> ‐‐‐ Original Message ‐‐‐
>
> On Wednesday, May 12th, 2021 at 17:58, Bryan Stillwell
> wrote:
>
>> I'm trying to figure out a CRUSH rule that will spread data out across my
,1,14,0,19,8]p8[8,17,4,1,14,0,19,8]p8
2021-05-11T22:41:11.332885+ 2021-05-11T22:41:11.332885+
I'm now considering using device classes and assigning the OSDs to either hdd1
or hdd2... Unless someone has another idea?
Thanks,
Bryan
> On May 14, 2021, at 12:35 PM, Bryan Stillwell wrote:
This morning I tried adding a mon node to my home Ceph cluster with the
following command:
ceph orch daemon add mon ether
This seemed to work at first, but then it decided to remove it fairly quickly
which broke the cluster because the mon. keyring was also removed:
2021-06-01T14:16:11.523210
ly complete any upgrades after that, which means the global container
image name was never changed.
Bryan
On Jun 1, 2021, at 9:38 AM, Bryan Stillwell
mailto:bstillw...@godaddy.com>> wrote:
This morning I tried adding a mon node to my home Ceph cluster with the
following command:
ceph orch
There appears to be arm64 packages built for Ubuntu Bionic, but not for Focal.
Any chance Focal packages can be built as well?
Thanks,
Bryan
> On Jul 8, 2021, at 12:20 PM, David Galloway wrote:
>
> Caution: This email is from an external sender. Please do not click links or
> open attachment
I upgraded one of my clusters to v16.2.5 today and now I'm seeing these
messages from 'ceph -W cephadm':
2021-07-08T22:01:55.356953+ mgr.excalibur.kuumco [ERR] Failed to apply
alertmanager spec AlertManagerSpec({'placement': PlacementSpec(count=1),
'service_type': 'alertmanager', 'service_i
Thanks David! This looks good now. :)
> On Jul 8, 2021, at 6:28 PM, David Galloway wrote:
>
> Done!
>
> On 7/8/21 3:51 PM, Bryan Stillwell wrote:
>> There appears to be arm64 packages built for Ubuntu Bionic, but not for
>> Focal. Any chance Focal pa
One of the main limitations of using CephFS is the requirement to reduce the
number of active MDS daemons to one during upgrades. As far as I can tell this
has been a known problem since Luminous (~2017). This issue essentially
requires downtime during upgrades for any CephFS cluster that needs m
The last two days we've experienced a couple short outages shortly after
setting both 'noscrub' and 'nodeep-scrub' on one of our largest Ceph clusters
(~2,200 OSDs). This cluster is running Nautilus (14.2.6) and setting/unsetting
these flags has been done many times in the past without a problem.
I have a cluster running Nautilus where the bucket instance (backups.190) has
gone missing:
# radosgw-admin metadata list bucket | grep 'backups.19[0-1]' | sort
"backups.190",
"backups.191",
# radosgw-admin metadata list bucket.instance | grep 'backups.19[0-1]' | sort
"backups.191:00
We've run into a problem on our test cluster this afternoon which is running
Nautilus (14.2.2). It seems that any time PGs move on the cluster (from
marking an OSD down, setting the primary-affinity to 0, or by using the
balancer), a large number of the OSDs in the cluster peg the CPU cores the
Our test cluster is seeing a problem where peering is going incredibly slow
shortly after upgrading it to Nautilus (14.2.2) from Luminous (12.2.12).
>From what I can tell it seems to be caused by "wait for new map" taking a long
>time. When looking at dump_historic_slow_ops on pretty much any O
Sep 4, 2019, at 11:55 AM, Guilherme Geronimo
mailto:guilherme.geron...@gmail.com>> wrote:
Notice: This email is from an external sender.
Hey Bryan,
I suppose all nodes are using jumboframes (mtu 9000), right?
I would suggest to check OSD->MON communication.
Can you send the ou
lag
* Taking the fragile OSD out
* restarting the "fragile" OSDs
* check if everything is ok look ing their logs
* taking off the NOUP flag
* Take a coffee and wait till all data are drain
[]'s
Arthur (aKa Guilherme Geronimo)
On 04/09/2019 15:32, Bryan Stillwell wrote:
We are
I'm wondering if it's possible to enable compression on existing RGW buckets?
The cluster is running Luminous 12.2.12 with FileStore as the backend (no
BlueStore compression then).
We have a cluster that recently started to rapidly fill up with compressible
content (qcow2 images) and I would l
On Oct 29, 2019, at 9:44 AM, Thomas Schneider <74cmo...@gmail.com> wrote:
> in my unhealthy cluster I cannot run several ceph osd command because
> they hang, e.g.
> ceph osd df
> ceph osd pg dump
>
> Also, ceph balancer status hangs.
>
> How can I fix this issue?
Check the status of your ceph-m
3 7f0e16363700 0 mgr[dashboard]
> [29/Oct/2019:17:37:56] ENGINE Error in HTTPServer.tick
> Traceback (most recent call last):
> File
> "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line
> 2021, in start
>self.tick()
> File
> "/usr/li
lass to be used for new object uploads -
> just note that some 'helpful' s3 clients will insert a
> 'x-amz-storage-class: STANDARD' header to requests that don't specify
> one, and the presence of this header will override the user's default
> storage class.
This morning I noticed that on a new cluster the number of PGs for the
default.rgw.buckets.data pool was way too small (just 8 PGs), but when I try to
split the PGs the cluster doesn't do anything:
# ceph osd pool set default.rgw.buckets.data pg_num 16
set pool 13 pg_num to 16
It seems to set t
Responding to myself to follow up with what I found.
While going over the release notes for 14.2.3/14.2.4 I found this was a known
problem that has already been fixed. Upgrading the cluster to 14.2.4 fixed the
issue.
Bryan
> On Oct 30, 2019, at 10:33 AM, Bryan Stillwell wrote:
>
Today I tried enabling RGW compression on a Nautilus 14.2.4 test cluster and
found it wasn't doing any compression at all. I figure I must have missed
something in the docs, but I haven't been able to find out what that is and
could use some help.
This is the command I used to enable zlib-base
port to nautilus
> in https://tracker.ceph.com/issues/41981.
>
> On 11/6/19 5:54 PM, Bryan Stillwell wrote:
>> Today I tried enabling RGW compression on a Nautilus 14.2.4 test cluster and
>> found it wasn't doing any compression at all. I figure I must have missed
>> some
Thanks Casey!
Adding the following to my swiftclient put_object call caused it to start
compressing the data:
headers={'x-object-storage-class': 'STANDARD'}
I appreciate the help!
Bryan
> On Nov 7, 2019, at 9:26 AM, Casey Bodley wrote:
>
> On 11/7/19 10
With FileStore you can get the number of OSD maps for an OSD by using a simple
find command:
# rpm -q ceph
ceph-12.2.12-0.el7.x86_64
# find /var/lib/ceph/osd/ceph-420/current/meta/ -name 'osdmap*' | wc -l
42486
Does anyone know of an equivalent command that can be used with BlueStore?
Thanks,
B
There are some bad links to the mailing list subscribe/unsubscribe/archives on
this page that should get updated:
https://ceph.io/resources/
The subscribe/unsubscribe/archives links point to the old lists vger and
lists.ceph.com, and not the new lists on lists.ceph.io:
ceph-devel
subscribe
I've upgraded 7 of our clusters to Nautilus (14.2.4) and noticed that on some
of the clusters (3 out of 7) the OSDs aren't using msgr2 at all. Here's the
output for osd.0 on 2 clusters of each type:
### Cluster 1 (v1 only):
# ceph osd find 0 | jq -r '.addrs'
{
"addrvec": [
{
"type":
18 16:46:05.979 7f917becf700 1 -- 10.0.13.2:0/3084510 learned_addr
learned my addr 10.0.13.2:0/3084510 (peer_addr_for_me v1:10.0.13.2:0/0)
The learned address is v1:10.0.13.2:0/0. What else can I do to figure out why
it's deciding to use the legacy protocol only?
Thanks,
Bryan
> On Nov 15
as to track down, maybe a check should be added before
enabling msgr2 to make sure the require-osd-release is set to nautilus?
Bryan
> On Nov 18, 2019, at 5:41 PM, Bryan Stillwell wrote:
>
> I cranked up debug_ms to 20 on two of these clusters today and I'm still not
> underst
Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Tue, Nov 19, 2019 at 8:42 PM Bryan Stillwell
> wrote:
>>
>> Closing the loop here. I
On multiple clusters we are seeing the mgr hang frequently when the balancer is
enabled. It seems that the balancer is getting caught in some kind of infinite
loop which chews up all the CPU for the mgr which causes problems with other
modules like prometheus (we don't have the devicehealth mod
e of a solution yet so I'll stick with disabled balancer
> for now since the current pg placement is fine.
>
> Regards,
> Eugen
>
>
> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg56994.html
> [2] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg5
On Nov 18, 2019, at 8:12 AM, Dan van der Ster wrote:
>
> On Fri, Nov 15, 2019 at 4:45 PM Joao Eduardo Luis wrote:
>>
>> On 19/11/14 11:04AM, Gregory Farnum wrote:
>>> On Thu, Nov 14, 2019 at 8:14 AM Dan van der Ster
>>> wrote:
Hi Joao,
I might have found the reason why s
Rich,
What's your failure domain (osd? host? chassis? rack?) and how big is each of
them?
For example I have a failure domain of type rack in one of my clusters with
mostly even rack sizes:
# ceph osd crush rule dump | jq -r '.[].steps'
[
{
"op": "take",
"item": -1,
"item_name":
On our test cluster after upgrading to 14.2.5 I'm having problems with the mons
pegging a CPU core while moving data around. I'm currently converting the OSDs
from FileStore to BlueStore by marking the OSDs out in multiple nodes,
destroying the OSDs, and then recreating them with ceph-volume lv
alFrameEx
0.55% [kernel] [k] _raw_spin_unlock_irqrestore
I increased mon debugging to 20 and nothing stuck out to me.
Bryan
> On Dec 12, 2019, at 4:46 PM, Bryan Stillwell wrote:
>
> On our test cluster after upgrading to 14.2.5 I'm having problems with th
roblem.
Bryan
On Dec 14, 2019, at 10:27 AM, Sasha Litvak
mailto:alexander.v.lit...@gmail.com>> wrote:
Notice: This email is from an external sender.
Bryan,
Were you able to resolve this? If yes, can you please share with the list?
On Fri, Dec 13, 2019 at 10:08 AM Bryan Stillwell
After upgrading one of our clusters from Nautilus 14.2.2 to Nautilus 14.2.5 I'm
seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H').
Attaching to the thread with strace shows a lot of mmap and munmap calls.
Here's the distribution after watching it for a few minutes:
48.7
On Dec 18, 2019, at 1:48 PM, e...@lapsus.org wrote:
>
> That sounds very similar to what I described there:
> https://tracker.ceph.com/issues/43364
I would agree that they're quite similar if not the same thing! Now that you
mention it I see the thread is named mgr-fin in 'top -H' as well. I
On Dec 18, 2019, at 11:58 AM, Sage Weil
mailto:s...@newdream.net>> wrote:
On Wed, 18 Dec 2019, Bryan Stillwell wrote:
After upgrading one of our clusters from Nautilus 14.2.2 to Nautilus 14.2.5 I'm
seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H'
ou received this in error, please contact the sender and
destroy any copies of this information.
________
From: Bryan Stillwell mailto:bstillw...@godaddy.com>>
Sent: Wednesday, December 18, 2019 4:44:45 PM
To: Sage Weil mailto:s...@newdream.net>
I was going to try adding an OSD to my home cluster using one of the 4GB
Raspberry Pis today, but it appears that the Ubuntu Bionic arm64 repo is
missing a bunch of packages:
$ sudo grep ^Package:
/var/lib/apt/lists/download.ceph.com_debian-nautilus_dists_bionic_main_binary-arm64_Packages
Packa
I just noticed that arm64 packages only exist for xenial. Is there a reason
why bionic packages aren't being built?
Thanks,
Bryan
> On Dec 20, 2019, at 4:22 PM, Bryan Stillwell wrote:
>
> I was going to try adding an OSD to my home cluster using one of the 4GB
> Raspberry
Great work! Thanks to everyone involved!
One minor thing I've noticed so far with the Ubuntu Bionic build is it's
reporting the release as an RC instead of being 'stable':
$ ceph versions | grep octopus
"ceph version 15.2.0 (dc6a0b5c3cbf6a5e1d6d4f20b5ad466d76b96247) octopus
(rc)": 1
B
On Mar 24, 2020, at 5:38 AM, Abhishek Lekshmanan wrote:
> #. Upgrade monitors by installing the new packages and restarting the
> monitor daemons. For example, on each monitor host,::
>
> # systemctl restart ceph-mon.target
>
> Once all monitors are up, verify that the monitor upgrade i
48 matches
Mail list logo