[ceph-users] Re: 18.2.2 dashboard really messed up.

2024-03-13 Thread Harry G Coin
d more about the new page you can check here <https://docs.ceph.com/en/latest/mgr/dashboard/#overview-of-the-dashboard-landing-page>. Regards, Nizam On Mon, Mar 11, 2024 at 11:47 PM Harry G Coin wrote: Looking at ceph -s, all is well.  Looking at the dashboard, 85% of my capacit

[ceph-users] 18.2.2 dashboard really messed up.

2024-03-11 Thread Harry G Coin
Looking at ceph -s, all is well.  Looking at the dashboard, 85% of my capacity is 'warned', and 95% is 'in danger'.   There is no hint given as to the nature of the danger or reason for the warning.  Though apparently with merely 5% of my ceph world 'normal', the cluster reports 'ok'.  Which,

[ceph-users] Howto: 'one line patch' in deployed cluster?

2023-12-14 Thread Harry G Coin
Is there a 'Howto' or 'workflow' to implement a one-line patch in a running cluster?  With full understanding it will be gone on the next upgrade? Hopefully without having to set up an entire packaging/development environment? Thanks! To implement: * /Subject/: Re: Permanent KeyError:

[ceph-users] Permanent KeyError: 'TYPE' ->17.2.7: return self.blkid_api['TYPE'] == 'part'

2023-11-07 Thread Harry G Coin
These repeat for every host, only after upgrading from prev release Quincy to 17.2.7.   As a result, the cluster is always warned, never indicates healthy. root@noc1:~# ceph health detail HEALTH_WARN failed to probe daemons or devices [WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or

[ceph-users] libcephfs init hangs, is there a 'timeout' argument?

2023-08-09 Thread Harry G Coin
Libcephfs's 'init' call hangs when passed arguments that once worked normally, but later refer to a cluster that's either broken, is on its way out of service, has too few mons, etc.  At least the python libcephfs wrapper hangs on init. Of course mount and session timeouts work, but is there

[ceph-users] Puzzle re 'ceph: mds0 session blocklisted"

2023-08-08 Thread Harry G Coin
Can anyone help me understand seemingly contradictory cephfs error messages? I have a RHEL ceph client that mounts a cephfs file system via autofs.  Very typical.  After boot, when a user first uses the mount, for example 'ls /mountpoint' , all appears normal to the user.  But on the system

[ceph-users] RHEL / CephFS / Pacific / SELinux unavoidable "relabel inode" error?

2023-08-02 Thread Harry G Coin
Hi!  No matter what I try, using the latest cephfs on an all ceph-pacific setup, I've not been able to avoid this error message, always similar to this on RHEL family clients: SELinux: inode=1099954719159 on dev=ceph was found to have an invalid context=system_u:object_r:unlabeled_t:s0.  This

[ceph-users] ls: cannot access '/cephfs': Stale file handle

2023-05-17 Thread Harry G Coin
I have two autofs entries that mount the same cephfs file system to two different mountpoints.  Accessing the first of the two fails with 'stale file handle'.  The second works normally. Other than the name of the mount point, the lines in autofs are identical.   No amount of 'umount -f' or

[ceph-users] Re: 17.2.6 fs 'ls' ok, but 'cat' 'operation not permitted' puzzle

2023-05-02 Thread Harry G Coin
vel much appreciated pointer from Curt here: On 5/2/23 14:21, Curt wrote: This thread might be of use, it's an older version of ceph 14, but might still apply, https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/23FDDSYBCDVMYGCUTALACPFAJYITLOHJ/ ? On Tue, May 2, 2023 at 11:06 PM

[ceph-users] 17.2.6 fs 'ls' ok, but 'cat' 'operation not permitted' puzzle

2023-05-02 Thread Harry G Coin
In 17.2.6 is there a security requirement that pool names supporting a ceph fs filesystem match the filesystem name.data for the data and name.meta for the associated metadata pool? (multiple file systems are enabled) I have filesystems from older versions with the data pool name matching

[ceph-users] Grafana host overview -- "no data"?

2022-05-12 Thread Harry G. Coin
I've a 'healthy' cluster with a dashboard where Grafana correctly reports the number of osds on a host and the correct raw capacity -- and 'no data' for any time period, for any of the osd's (dockerized Quincy).  Meanwhile the top level dashboard cluster reports reasonable client throughput

[ceph-users] Re: The last 15 'degraded' items take as many hours as the first 15K?

2022-05-12 Thread Harry G. Coin
On 5/12/22 02:05, Janne Johansson wrote: Den tors 12 maj 2022 kl 00:03 skrev Harry G. Coin : Might someone explain why the count of degraded items can drop thousands, sometimes tens of thousands in the same number of hours it takes to go from 10 to 0? For example, when an OSD or a host

[ceph-users] Re: The last 15 'degraded' items take as many hours as the first 15K?

2022-05-11 Thread Harry G. Coin
is not huge and is not RGW index omap, that slow of a single-object recovery would have me checking whether I have a bad disk that's presenting itself as significantly underperforming. Josh On Wed, May 11, 2022 at 4:03 PM Harry G. Coin wrote: Might someone explain why the count of degraded items can

[ceph-users] Re: reinstalled node with OSD

2022-05-11 Thread Harry G. Coin
bbk, It did help!  Thank you. Here's a slightly more 'with the osd-fsid details filled in' procedure for moving a 'dockerized' / container-run OSD set of drives to a replacement server/motherboard (or the same server with blank/new/fresh reinstalled OS).  For occasions when the 'new setup'

[ceph-users] [progress WARNING root] complete: ev ... does not exist, oh my!

2022-05-06 Thread Harry G. Coin
I tried searching for the meaning of a ceph Quincy all caps WARNING message, and failed.  So I need help.   Ceph tells me my cluster is 'healthy', yet emits a bunch of 'progress WARNING root] comlete ev' ... messages.  Which I score right up there with the helpful dmesg "yama, becoming

[ceph-users] How to make ceph syslog items approximate ceph -w ?

2022-05-05 Thread Harry G. Coin
Using Quincy I'm getting a much worse lag owing to ceph syslog message volume, though without obvious system errors. In the usual case of no current/active hardware errors and no software crashes:  what config settings can I pick so that what appears in syslog is as close to what would appear

[ceph-users] Re: v17.2.0 Quincy released

2022-04-19 Thread Harry G. Coin
Great news!  Any notion when the many pending bug fixes will show up in Pacific?  It's been a while. On 4/19/22 20:36, David Galloway wrote: We're very happy to announce the first stable release of the Quincy series. We encourage you to read the full release notes at

[ceph-users] How to avoid 'bad port / jabber flood' = ceph killer?

2022-01-27 Thread Harry G. Coin
I would really appreciate advice because I bet many of you have 'seen this before' but I can't find a recipe. There must be a 'better way' to respond to this situation:  It starts with a well working small ceph cluster with 5 servers and no apparent change to the workflow  suddenly starts

[ceph-users] "Just works" no-typing drive placement howto?

2022-01-21 Thread Harry G. Coin
There's got to be some obvious way I haven't found for this common ceph use case, that happens at least once every couple weeks.   I hope someone on this list knows and can give a link.  The scenario goes like this, on a server with a drive providing boot capability, the rest osds: 1. First,

[ceph-users] A middle ground between containers and 'lts distros'?

2021-11-18 Thread Harry G. Coin
I sense the concern about ceph distributions via containers generally has to do with what you might call a feeling of 'opaqueness'.   The feeling is amplified as most folks who choose open source solutions prize being able promptly to address the particular concerns affecting them without

[ceph-users] Re: How to get ceph bug 'non-errors' off the dashboard?

2021-10-03 Thread Harry G. Coin
Worked very well!  Thank you. Harry Coin On 10/2/21 11:23 PM, 胡 玮文 wrote: Hi Harry, Please try these commands in CLI: ceph health mute MGR_MODULE_ERROR ceph health mute CEPHADM_CHECK_NETWORK_MISSING Weiwen Hu 在 2021年10月3日,05:37,Harry G. Coin 写道: I need help getting two 'non errors

[ceph-users] How to get ceph bug 'non-errors' off the dashboard?

2021-10-02 Thread Harry G. Coin
I need help getting two 'non errors' off the ceph dashboard so it stops falsely scaring people with the dramatic read "HEALTH_ERR" --- and masking what could be actual errors of immediate importance. The first is a bug where the devs try to do date arithmetic between incompatible variables. 

[ceph-users] Re: Trying to understand what overlapped roots means in pg_autoscale's scale-down mode

2021-10-01 Thread Harry G. Coin
I asked as well, it seems nobody on the list knows so far. On 9/30/21 10:34 AM, Andrew Gunnerson wrote: Hello, I'm trying to figure out what overlapping roots entails with the default scale-down autoscaling profile in Ceph Pacific. My test setup involves a CRUSH map that looks like:

[ceph-users] Set some but not all drives as 'autoreplace'?

2021-09-28 Thread Harry G. Coin
Hi all, I know Ceph offers a way to 'automatically' cause blank drives it detects to be spun up into osds, but I think that's an 'all or nothing' situation if I read the docs properly. Is there a way to specify which slots, or even better, a way to specify not specific slots?  It sure would

[ceph-users] Is this really an 'error'? "pg_autoscaler... has overlapping roots"

2021-09-23 Thread Harry G. Coin
Is there anything to be done about groups of log messages like "pg_autoscaler ERROR root] pool has overlapping roots" The cluster reports it is healthy, and yet this is reported as an error, so-- is it an error that ought to have been reported, or is it not an error? Thanks Harry Coin

[ceph-users] "Remaining time" under-estimates by 100x....

2021-09-22 Thread Harry G. Coin
Is there a way to re-calibrate the various 'global recovery event' and related 'remaining time' estimators? For the last three days I've been assured that a 19h event will be over in under 3 hours... Previously I think Microsoft held the record for the most incorrect 'please wait' progress

[ceph-users] after upgrade: HEALTH ERR ...'devicehealth' has failed: can't subtract offset-naive and offset-aware datetimes

2021-09-21 Thread Harry G. Coin
A cluster reporting no errors running 16.2.5 immediately after upgrade to 16.2.6 features what seems to be an entirely bug-related dramatic 'Heath Err' on the dashboard: Module 'devicehealth' has failed: can't subtract offset-naive and offset-aware datetimes Looking at the bug tracking

[ceph-users] Bigger picture 'ceph web calculator', was Re: SATA vs SAS

2021-08-22 Thread Harry G. Coin
This topic comes up often enough, maybe it's time for one of those 'web calculators'.  One that accepts the user who knows their goals but not ceph-fu,  entering the importance of various factors (my suggested factors:  read freq/stored tb, write freq/stored tb, unreplicated tb needed, min target

[ceph-users] Docker container snapshots accumulate until disk full failure?

2021-08-11 Thread Harry G. Coin
Does ceph remove container subvolumes holding previous revisions of daemon images after upgrades? I have a couple servers using btrfs to hold the containers.   The number of docker related sub-volumes just keeps growing, way beyond the number of daemons running.  If I ignore this, I'll get

[ceph-users] Re: Did standby dashboards stop redirecting to the active one?

2021-07-26 Thread Harry G. Coin
atest/mgr/dashboard/#disable-the-redirection> > (e.g.: HAProxy)? No redirection, nothing. Just timeout on every manager other than the active one.  Adding a HAproxy would be easily done, but seems redundant to ceph internal capability -- that at one time worked, anyhow. > > Kind Regard

[ceph-users] Did standby dashboards stop redirecting to the active one?

2021-07-26 Thread Harry G. Coin
Somewhere between Nautilus and Pacific the hosts running standby managers, which previously would redirect browsers to the currently active mgr/dashboard, seem to have stopped doing that.   Is that a switch somewhere?  Or was I just happily using an undocumented feature? Thanks Harry Coin

[ceph-users] Re: name alertmanager/node-exporter already in use with v16.2.5

2021-07-11 Thread Harry G. Coin
On 7/8/21 5:06 PM, Bryan Stillwell wrote: > I upgraded one of my clusters to v16.2.5 today and now I'm seeing these > messages from 'ceph -W cephadm': > > 2021-07-08T22:01:55.356953+ mgr.excalibur.kuumco [ERR] Failed to apply > alertmanager spec AlertManagerSpec({'placement':

[ceph-users] Re: name alertmanager/node-exporter already in use with v16.2.5

2021-07-10 Thread Harry G. Coin
Same problem here.  Hundreds of lines like '    Updating node-exporter deployment (+4 -4 -> 5) (0s)   [] ' And, similar to yours: ... 2021-07-10T16:26:30.432487-0500 mgr.noc4.tvhgac [ERR] Failed to apply node-exporter spec MonitoringSpec({'placement':

[ceph-users] Question re: replacing failed boot/os drive in cephadm / pacific cluster

2021-07-09 Thread Harry G. Coin
Hi In a Pacific/container/cephadm setup, when a server boot/os drive fails (unrelated to any osd actual storage):  Can the boot/OS drive be replaced with a 'fresh install OS install' then simply setting up the same networking addressing/ssh keys (assuming the necessary docker/non-ceph pkgs are

[ceph-users] Why does 'mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED 2w' expire in less than a day?

2021-07-07 Thread Harry G. Coin
Is this happening to anyone else?  After this command: ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED 2w The 'dashboard' shows 'Health OK',  then after a few hours (perhaps a mon leadership change), it's back to 'degraded' and 'AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are

[ceph-users] Re: CephFS design

2021-06-14 Thread Harry G. Coin
On 6/11/21 3:52 AM, Szabo, Istvan (Agoda) wrote: > Hi, > > Can you suggest me what is a good cephfs design? I've never used it, only rgw > and rbd we have, but want to give a try. Howvere in the mail list I saw a > huge amount of issues with cephfs so would like to go with some let's say >

[ceph-users] Re: In theory - would 'cephfs root' out-perform 'rbd root'?

2021-06-13 Thread Harry G. Coin
ing and network bandwidth than the 'known interesting only' parts of files. > > On Fri, Jun 11, 2021 at 12:31 PM Harry G. Coin wrote: >> On any given a properly sized ceph setup, for other than database end >> use) theoretically shouldn't a ceph-fs root out-perform any fs

[ceph-users] Cephfs root/boot?

2021-06-07 Thread Harry G. Coin
Has anyone added the 'conf.d' modules (and on the centos/rhel/fedora world done the selinux work) so that initramfs/dracut can 'direct kernel boot' cephfs as a guest image root file system?  It took some work for the nfs folks to manage being the root filesystem. Harry

[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-02 Thread Harry G. Coin
On 6/2/21 2:28 PM, Phil Regnauld wrote: > Dave Hall (kdhall) writes: >> But the developers aren't out in the field with their deployments >> when something weird impacts a cluster and the standard approaches don't >> resolve it. And let's face it: Ceph is a marvelously robust solution for >>

[ceph-users] mons assigned via orch label 'committing suicide' upon reboot.

2021-05-28 Thread Harry G. Coin
FYI, I'm getting monitors assigned via '... apply label:mon' with current and valid 'mon' tags:  'committing suicide' after surprise reboots in the  'Pacific' 16.2.4 release.  The tag indicating a monitor should be assigned to that host is present and never changed. Deleting the mon tag, waiting

[ceph-users] Re: orch apply mon assigns wrong IP address?

2021-05-21 Thread Harry G. Coin
On 5/21/21 9:49 AM, Eugen Block wrote: > You can define the public_network [1]: > > ceph config set mon public_network ** > > For example: > > ceph config set mon public_network 10.1.2.0/24 > > Or is that already defined and it happens anyway? The public network is defined, and it happens anyway

[ceph-users] orch apply mon assigns wrong IP address?

2021-05-21 Thread Harry G. Coin
Is there a way to force '.. orch apply  *' to limit ip address selection to addresses matching the hostname in dns or /etc/hosts, or to a specific address given at 'host add' time?   I've hit a bothersome problem: On v15, 'ceph orch apply mon ...' appears not to use the dns ip or /etc/hosts when

[ceph-users] diskprediction_local to be retired or fixed or??

2020-12-11 Thread Harry G. Coin
Any idea whether 'diskprediction_local' will ever work in containers?  I'm running 15.2.7 which contains a dependency on scikit-learn v 0.19.2 which isn't in the container.  It's been throwing that error for a year now on all the octopus container versions I tried.  It used to be on the baremetal

[ceph-users] Switch docker image?

2020-10-22 Thread Harry G. Coin
This has got to be ceph/docker "101" but I can't find the answer in the docs and need help. The latest docker octopus images support using the ntpsec time daemon.  The default stable octopus image doesn't as yet. I want to add a mon to a cluster that needs to use ntpsec  (just go with it..), so

[ceph-users] Re: Are there 'tuned profiles' for various ceph scenarios?

2020-07-01 Thread Harry G. Coin
[Resent to correct title] Marc: Here's a template that works here.  You'll need to do some steps to create the 'secret' and make the block devs and so on:                                     Glad I could contribute something.   Sure would appreciate leads for the suggested

[ceph-users] Re: *****SPAM***** Are there 'tuned profiles' for various ceph scenarios?

2020-07-01 Thread Harry G. Coin
Marc: Here's a template that works here.  You'll need to do some steps to create the 'secret' and make the block devs and so on:                                     Glad I could contribute something.   Sure would appreciate leads for the suggested sysctls/etc either apart or as

[ceph-users] Are there 'tuned profiles' for various ceph scenarios?

2020-07-01 Thread Harry G. Coin
Hi Are there any 'official' or even 'works for us' pointers to 'tuned profiles' for such common uses as 'ceph baremetal osd host' 'ceph osd + libvirt host' 'ceph mon/mgr' 'guest vm based on a kernel-mounted rbd' 'guest vm based on a direct virtio->rados link' I suppose there are a few other

[ceph-users] Re: Re layout help: need chassis local io to minimize net links

2020-06-29 Thread Harry G. Coin
Anthony asked about the 'use case'.  Well, I haven't gone into details because I worried it wouldn't help much.  From a 'ceph' perspective, the sandbox layout goes like this:  4 pretty much identical old servers, each with 6 drives, and a smaller server just running a mon to break ties.  Usual

[ceph-users] Re: Re layout help: need chassis local io to minimize net links

2020-06-29 Thread Harry G. Coin
tes will have to hit the net > regardless of any machinations. > >> On Jun 29, 2020, at 7:31 PM, Harry G. Coin wrote: >> >> I need exactly what ceph is for a whole lot of work, that work just >> doesn't represent a large fraction of the total local traffic. Ceph is &

[ceph-users] Re layout help: need chassis local io to minimize net links

2020-06-29 Thread Harry G. Coin
I need exactly what ceph is for a whole lot of work, that work just doesn't represent a large fraction of the total local traffic.  Ceph is the right choice.  Plainly ceph has tremendous support for replication within a chassis, among chassis and among racks.  I just need intra-chassis traffic to

[ceph-users] layout help: need chassis local io to minimize net links

2020-06-29 Thread Harry G. Coin
Hi I have a few servers each with 6 or more disks, with a storage workload that's around 80% done entirely within each server.   From a work-to-be-done perspective there's no need for 80% of the load to traverse network interfaces, the rest needs what ceph is all about.   So I cooked up a set of

[ceph-users] Recovery throughput inversely linked with rbd_cache_xyz?

2020-04-23 Thread Harry G. Coin
Hello, A couple days ago I increased the rbd cache size from the default to 256MB/osd on a small 4 node, 6 osd/node setup in a test/lab setting.  The rbd volumes are all vm images with writeback cache parameters and steady if only a few mb/sec writes going on. Logging mostly.    I noticed

[ceph-users] Re: v14.2.3 Nautilus released

2019-09-04 Thread Harry G. Coin
Does anyone know if the change to disable spdk by default  (so as to remove the corei7 dependency when running on intel platforms) made it in to 14.2.3?   The spdk version only required core2 in 14.2.1, the change to require corei7 in 14.2.2 killed all the osds on older systems flat. On