Re: [ceph-users] monitor ghosted

2020-01-08 Thread Brad Hubbard
On Thu, Jan 9, 2020 at 5:48 AM Peter Eisch wrote: > Hi, > > This morning one of my three monitor hosts got booted from the Nautilus > 14.2.4 cluster and it won’t regain. There haven’t been any changes, or > events at this site at all. The conf file is the [unchanged] and the same > as the other

Re: [ceph-users] ceph-objectstore-tool crash when trying to recover pg from OSD

2019-11-07 Thread Brad Hubbard
I'd suggest you open a tracker under the Bluestore component so someone can take a look. I'd also suggest you include a log with 'debug_bluestore=20' added to the COT command line. On Thu, Nov 7, 2019 at 6:56 PM Eugene de Beste wrote: > > Hi, does anyone have any feedback for me regarding this?

Re: [ceph-users] Inconsistents + FAILED assert(recovery_info.oi.legacy_snaps.size())

2019-10-30 Thread Brad Hubbard
up > ([27,30,38], p27) acting ([30,25], p30) > > I also checked the logs of all OSDs already done and got the same logs > about this object : > * osd.4, last time : 2019-10-10 16:15:20 > * osd.32, last time : 2019-10-14 01:54:56 > * osd.33, last time : 2019-10-11 06:24:01 >

Re: [ceph-users] Inconsistents + FAILED assert(recovery_info.oi.legacy_snaps.size())

2019-10-29 Thread Brad Hubbard
On Tue, Oct 29, 2019 at 9:09 PM Jérémy Gardais wrote: > > Thus spake Brad Hubbard (bhubb...@redhat.com) on mardi 29 octobre 2019 à > 08:20:31: > > Yes, try and get the pgs healthy, then you can just re-provision the down > > OSDs. > > > > Run a scrub

Re: [ceph-users] Inconsistents + FAILED assert(recovery_info.oi.legacy_snaps.size())

2019-10-28 Thread Brad Hubbard
Yes, try and get the pgs healthy, then you can just re-provision the down OSDs. Run a scrub on each of these pgs and then use the commands on the following page to find out more information for each case. https://docs.ceph.com/docs/luminous/rados/troubleshooting/troubleshooting-pg/ Focus on the

Re: [ceph-users] lot of inconsistent+failed_repair - failed to pick suitable auth object (14.2.3)

2019-10-10 Thread Brad Hubbard
ashpspool stripe_width 0 application cephfs This looked like something min_size 1 could cause, but I guess that's not the cause here. > so inconsistens is empty, which is weird, no ? Try scrubbing the pg just before running the command. > > Thanks again! > > K > > > On 10/10/2019

Re: [ceph-users] lot of inconsistent+failed_repair - failed to pick suitable auth object (14.2.3)

2019-10-10 Thread Brad Hubbard
Does pool 6 have min_size = 1 set? https://tracker.ceph.com/issues/24994#note-5 would possibly be helpful here, depending on what the output of the following command looks like. # rados list-inconsistent-obj [pgid] --format=json-pretty On Thu, Oct 10, 2019 at 8:16 PM Kenneth Waegeman wrote: >

Re: [ceph-users] Ceph pg repair clone_missing?

2019-10-09 Thread Brad Hubbard
Awesome! Sorry it took so long. On Thu, Oct 10, 2019 at 12:44 AM Marc Roos wrote: > > > Brad, many thanks!!! My cluster has finally HEALTH_OK af 1,5 year or so! > :) > > > -Original Message- > Subject: Re: Ceph pg repair clone_missing? > > On Fri, Oct 4, 2019 at 6:09 PM Marc Roos >

Re: [ceph-users] Ceph pg repair clone_missing?

2019-10-08 Thread Brad Hubbard
On Fri, Oct 4, 2019 at 6:09 PM Marc Roos wrote: > > > > >Try something like the following on each OSD that holds a copy of > >rbd_data.1f114174b0dc51.0974 and see what output you get. > >Note that you can drop the bluestore flag if they are not bluestore > >osds and you will need

Re: [ceph-users] Ceph pg repair clone_missing?

2019-10-03 Thread Brad Hubbard
On Thu, Oct 3, 2019 at 6:46 PM Marc Roos wrote: > > > > >> > >> I was following the thread where you adviced on this pg repair > >> > >> I ran these rados 'list-inconsistent-obj'/'rados > >> list-inconsistent-snapset' and have output on the snapset. I tried > to > >> extrapolate your

Re: [ceph-users] Ceph pg repair clone_missing?

2019-10-02 Thread Brad Hubbard
On Wed, Oct 2, 2019 at 9:00 PM Marc Roos wrote: > > > > Hi Brad, > > I was following the thread where you adviced on this pg repair > > I ran these rados 'list-inconsistent-obj'/'rados > list-inconsistent-snapset' and have output on the snapset. I tried to > extrapolate your comment on the

Re: [ceph-users] OSD crashed during the fio test

2019-10-01 Thread Brad Hubbard
9 at 8:03 AM Sasha Litvak > wrote: >> >> It was hardware indeed. Dell server reported a disk being reset with power >> on. Checking the usual suspects i.e. controller firmware, controller event >> log (if I can get one), drive firmware. >> I will report more when I g

Re: [ceph-users] ceph pg repair fails...?

2019-10-01 Thread Brad Hubbard
On Wed, Oct 2, 2019 at 1:15 AM Mattia Belluco wrote: > > Hi Jake, > > I am curious to see if your problem is similar to ours (despite the fact > we are still on Luminous). > > Could you post the output of: > > rados list-inconsistent-obj > > and > > rados list-inconsistent-snapset Make sure

Re: [ceph-users] ceph-osd@n crash dumps

2019-10-01 Thread Brad Hubbard
On Tue, Oct 1, 2019 at 10:43 PM Del Monaco, Andrea < andrea.delmon...@atos.net> wrote: > Hi list, > > After the nodes ran OOM and after reboot, we are not able to restart the > ceph-osd@x services anymore. (Details about the setup at the end). > > I am trying to do this manually, so we can see

Re: [ceph-users] OSD crashed during the fio test

2019-10-01 Thread Brad Hubbard
Removed ceph-de...@vger.kernel.org and added d...@ceph.io On Tue, Oct 1, 2019 at 4:26 PM Alex Litvak wrote: > > Hellow everyone, > > Can you shed the line on the cause of the crash? Could actually client > request trigger it? > > Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30

Re: [ceph-users] ceph; pg scrub errors

2019-09-24 Thread Brad Hubbard
On Tue, Sep 24, 2019 at 10:51 PM M Ranga Swami Reddy wrote: > > Interestingly - "rados list-inconsistent-obj ${PG} --format=json" not > showing any objects inconsistent-obj. > And also "rados list-missing-obj ${PG} --format=json" also not showing any > missing or unfound objects. Complete a

Re: [ceph-users] ZeroDivisionError when running ceph osd status

2019-09-11 Thread Brad Hubbard
On Thu, Sep 12, 2019 at 1:52 AM Benjamin Tayehanpour wrote: > > Greetings! > > I had an OSD down, so I ran ceph osd status and got this: > > [root@ceph1 ~]# ceph osd status > Error EINVAL: Traceback (most recent call last): > File "/usr/lib64/ceph/mgr/status/module.py", line 313, in

Re: [ceph-users] ceph-fuse segfaults in 14.2.2

2019-09-06 Thread Brad Hubbard
On Wed, Sep 4, 2019 at 9:42 PM Andras Pataki wrote: > > Dear ceph users, > > After upgrading our ceph-fuse clients to 14.2.2, we've been seeing sporadic > segfaults with not super revealing stack traces: > > in thread 7fff5a7fc700 thread_name:ceph-fuse > > ceph version 14.2.2

Re: [ceph-users] BlueStore.cc: 11208: ceph_abort_msg("unexpected error")

2019-08-25 Thread Brad Hubbard
https://tracker.ceph.com/issues/38724 On Fri, Aug 23, 2019 at 10:18 PM Paul Emmerich wrote: > > I've seen that before (but never on Nautilus), there's already an > issue at tracker.ceph.com but I don't recall the id or title. > > > Paul > > -- > Paul Emmerich > > Looking for help with your Ceph

Re: [ceph-users] ceph status: pg backfill_toofull, but all OSDs have enough space

2019-08-22 Thread Brad Hubbard
https://tracker.ceph.com/issues/41255 is probably reporting the same issue. On Thu, Aug 22, 2019 at 6:31 PM Lars Täuber wrote: > > Hi there! > > We also experience this behaviour of our cluster while it is moving pgs. > > # ceph health detail > HEALTH_ERR 1 MDSs report slow metadata IOs; Reduced

Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-18 Thread Brad Hubbard
On Thu, Aug 15, 2019 at 2:09 AM Troy Ablan wrote: > > Paul, > > Thanks for the reply. All of these seemed to fail except for pulling > the osdmap from the live cluster. > > -Troy > > -[~:#]- ceph-objectstore-tool --op get-osdmap --data-path > /var/lib/ceph/osd/ceph-45/ --file osdmap45 >

Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-18 Thread Brad Hubbard
On Thu, Aug 15, 2019 at 2:09 AM Troy Ablan wrote: > > Paul, > > Thanks for the reply. All of these seemed to fail except for pulling > the osdmap from the live cluster. > > -Troy > > -[~:#]- ceph-objectstore-tool --op get-osdmap --data-path > /var/lib/ceph/osd/ceph-45/ --file osdmap45 >

Re: [ceph-users] Possibly a bug on rocksdb

2019-08-11 Thread Brad Hubbard
Could you create a tracker for this? Also, if you can reproduce this could you gather a log with debug_osd=20 ? That should show us the superblock it was trying to decode as well as additional details. On Mon, Aug 12, 2019 at 6:29 AM huxia...@horebdata.cn wrote: > > Dear folks, > > I had an OSD

Re: [ceph-users] 14.2.2 - OSD Crash

2019-08-06 Thread Brad Hubbard
-63> 2019-08-07 00:51:52.861 7fe987e49700 1 heartbeat_map clear_timeout 'OSD::osd_op_tp thread 0x7fe987e49700' had suicide timed out after 150 You hit a suicide timeout, that's fatal. On line 80 the process kills the thread based on the assumption it's hung. src/common/HeartbeatMap.cc: 66

Re: [ceph-users] set_mon_vals failed to set cluster_network Configuration option 'cluster_network' may not be modified at runtime

2019-07-02 Thread Brad Hubbard
I'd suggest creating a tracker similar to http://tracker.ceph.com/issues/40554 which was created for the issue in the thread you mentioned. On Wed, Jul 3, 2019 at 12:29 AM Vandeir Eduardo wrote: > > Hi, > > on client machines, when I use the command rbd, for example, rbd ls > poolname, this

Re: [ceph-users] details about cloning objects using librados

2019-07-02 Thread Brad Hubbard
gt; application is responsible for any locking needed. > -Greg > > On Tue, Jul 2, 2019 at 3:49 AM Brad Hubbard wrote: > > > > Yes, this should be possible using an object class which is also a > > RADOS client (via the RADOS API). You'll still have some client > >

Re: [ceph-users] details about cloning objects using librados

2019-07-02 Thread Brad Hubbard
t;> Thank you for your response , and we will check this video as well. >>> Our requirement is while writing an object into the cluster , if we can >>> provide number of copies to be made , the network consumption between >>> client and cluster will be only for one object write.

Re: [ceph-users] details about cloning objects using librados

2019-06-27 Thread Brad Hubbard
On Thu, Jun 27, 2019 at 8:58 PM nokia ceph wrote: > > Hi Team, > > We have a requirement to create multiple copies of an object and currently we > are handling it in client side to write as separate objects and this causes > huge network traffic between client and cluster. > Is there

Re: [ceph-users] obj_size_info_mismatch error handling

2019-06-17 Thread Brad Hubbard
relating to the clearing in mon, mgr, or osd logs. > > > > So, not entirely sure what fixed it, but it is resolved on its own. > > > > Thanks, > > > > Reed > > > > On Apr 30, 2019, at 8:01 PM, Brad Hubbard wrote: > > > > On Wed, May 1, 2019 at

Re: [ceph-users] obj_size_info_mismatch error handling

2019-04-30 Thread Brad Hubbard
On Wed, May 1, 2019 at 10:54 AM Brad Hubbard wrote: > > Which size is correct? Sorry, accidental discharge =D If the object info size is *incorrect* try forcing a write to the OI with something like the following. 1. rados -p [name_of_pool_17] setomapval 10008536718. tempora

Re: [ceph-users] obj_size_info_mismatch error handling

2019-04-30 Thread Brad Hubbard
Which size is correct? On Tue, Apr 30, 2019 at 1:06 AM Reed Dier wrote: > > Hi list, > > Woke up this morning to two PG's reporting scrub errors, in a way that I > haven't seen before. > > $ ceph versions > { > "mon": { > "ceph version 13.2.5

Re: [ceph-users] Is it possible to run a standalone Bluestore instance?

2019-04-21 Thread Brad Hubbard
> > Best, > Can Zhang > > > On Fri, Apr 19, 2019 at 6:28 PM Brad Hubbard wrote: > > > > OK. So this works for me with master commit > > bdaac2d619d603f53a16c07f9d7bd47751137c4c on Centos 7.5.1804. > > > > I cloned the repo and ran './install-deps.sh'

Re: [ceph-users] Is it possible to run a standalone Bluestore instance?

2019-04-19 Thread Brad Hubbard
If you can give me specific steps so I can reproduce this from a freshly cloned tree I'd be happy to look further into it. Good luck. On Thu, Apr 18, 2019 at 7:00 PM Brad Hubbard wrote: > > Let me try to reproduce this on centos 7.5 with master and I'll let > you know how I go. > >

Re: [ceph-users] Is it possible to run a standalone Bluestore instance?

2019-04-18 Thread Brad Hubbard
gt; Notice the "U" and "V" from nm results. > > > > > Best, > Can Zhang > > On Thu, Apr 18, 2019 at 9:36 AM Brad Hubbard wrote: > > > > Does it define _ZTIN13PriorityCache8PriCacheE ? If it does, and all is > > as you say, then it

Re: [ceph-users] Is it possible to run a standalone Bluestore instance?

2019-04-17 Thread Brad Hubbard
:15 libceph-common.so -> > libceph-common.so.0 > -rwxr-xr-x. 1 root root 211853400 Apr 17 11:15 libceph-common.so.0 > > > > > Best, > Can Zhang > > On Thu, Apr 18, 2019 at 7:00 AM Brad Hubbard wrote: > > > > On Wed, Apr 17, 2019 at 1:37 PM Can Zhang w

Re: [ceph-users] Is it possible to run a standalone Bluestore instance?

2019-04-17 Thread Brad Hubbard
On Wed, Apr 17, 2019 at 1:37 PM Can Zhang wrote: > > Thanks for your suggestions. > > I tried to build libfio_ceph_objectstore.so, but it fails to load: > > ``` > $ LD_LIBRARY_PATH=./lib ./bin/fio --enghelp=libfio_ceph_objectstore.so > > fio: engine libfio_ceph_objectstore.so not loadable > IO

Re: [ceph-users] showing active config settings

2019-04-16 Thread Brad Hubbard
puzzled why it doesn't show any change when I run this no matter > what I set it to: > > # ceph -n osd.1 --show-config | grep osd_recovery_max_active > osd_recovery_max_active = 3 > > in fact it doesn't matter if I use an OSD number that doesn't exist, same > thing if I use c

Re: [ceph-users] showing active config settings

2019-04-16 Thread Brad Hubbard
On Tue, Apr 16, 2019 at 6:03 PM Paul Emmerich wrote: > > This works, it just says that it *might* require a restart, but this > particular option takes effect without a restart. We've already looked at changing the wording once to make it more palatable. http://tracker.ceph.com/issues/18424 >

Re: [ceph-users] showing active config settings

2019-04-15 Thread Brad Hubbard
On Tue, Apr 16, 2019 at 7:38 AM solarflow99 wrote: > > Then why doesn't this work? > > # ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4' > osd.0: osd_recovery_max_active = '4' (not observed, change may require > restart) > osd.1: osd_recovery_max_active = '4' (not observed, change may

Re: [ceph-users] VM management setup

2019-04-05 Thread Brad Hubbard
If you want to do containers at the same time, or transition some/all to containers at some point in future maybe something based on kubevirt [1] would be more futureproof? [1] http://kubevirt.io/ CNV is an example, https://www.redhat.com/en/resources/container-native-virtualization On Sat, Apr

Re: [ceph-users] scrub errors

2019-03-28 Thread Brad Hubbard
ed+inconsistent+peering, and the other peer is active+clean+inconsistent Per the document I linked previously if a pg remains remapped you likely have a problem with your configuration. Take a good look at your crushmap, pg distribution, pool configuration, etc. > > > On Wed, Mar 27, 2019 at 4:1

Re: [ceph-users] scrub errors

2019-03-27 Thread Brad Hubbard
{ > "osd": "7", > "status": "not queried" > }, > { > "osd": "8", > "status": "already probed" > }, >

Re: [ceph-users] Fedora 29 Issues.

2019-03-26 Thread Brad Hubbard
https://bugzilla.redhat.com/show_bug.cgi?id=1662496 On Wed, Mar 27, 2019 at 5:00 AM Andrew J. Hutton wrote: > > More or less followed the install instructions with modifications as > needed; but I'm suspecting that either a dependency was missed in the > F29 package or something else is up. I

Re: [ceph-users] scrub errors

2019-03-26 Thread Brad Hubbard
ther OSDs appear to be ok, I see > them up and in, why do you see something wrong? > > On Mon, Mar 25, 2019 at 4:00 PM Brad Hubbard wrote: >> >> Hammer is no longer supported. >> >> What's the status of osds 7 and 17? >> >> On Tue, Mar 26, 2019 at 8:56 A

Re: [ceph-users] scrub errors

2019-03-25 Thread Brad Hubbard
"last_epoch_clean": 20840, > "parent": "0.0", > "parent_split_bits": 0, > "last_scrub": "21395'11835365", > "last_scrub_stamp": "20

Re: [ceph-users] scrub errors

2019-03-25 Thread Brad Hubbard
It would help to know what version you are running but, to begin with, could you post the output of the following? $ sudo ceph pg 10.2a query $ sudo rados list-inconsistent-obj 10.2a --format=json-pretty Also, have a read of

Re: [ceph-users] OS Upgrade now monitor wont start

2019-03-24 Thread Brad Hubbard
Do a "ps auwwx" to see how a running monitor was started and use the equivalent command to try to start the MON that won't start. "ceph-mon --help" will show you what you need. Most important is to get the ID portion right and to add "-d" to get it to run in teh foreground and log to stdout. HTH

Re: [ceph-users] Slow OPS

2019-03-21 Thread Brad Hubbard
21 16:51:56.862447", > "age": 376.527241, > "duration": 1.331278, > > Kind regards, > Glen Baars > > -Original Message- > From: Brad Hubbard > Sent: Thursday, 21 March 2019 1:43 PM > To: Glen Baars > Cc: cep

Re: [ceph-users] Slow OPS

2019-03-20 Thread Brad Hubbard
Actually, the lag is between "sub_op_committed" and "commit_sent". Is there any pattern to these slow requests? Do they involve the same osd, or set of osds? On Thu, Mar 21, 2019 at 3:37 PM Brad Hubbard wrote: > > On Thu, Mar 21, 2019 at 3:20 PM Glen Baars > wrote:

Re: [ceph-users] Slow OPS

2019-03-20 Thread Brad Hubbard
> > Does anyone know what that section is waiting for? Hi Glen, These are documented, to some extent, here. http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ It looks like it may be taking a long time to communicate the commit message back to the client? Are these sl

Re: [ceph-users] Slow OPS

2019-03-20 Thread Brad Hubbard
On Thu, Mar 21, 2019 at 12:11 AM Glen Baars wrote: > > Hello Ceph Users, > > > > Does anyone know what the flag point ‘Started’ is? Is that ceph osd daemon > waiting on the disk subsystem? This is set by "mark_started()" and is roughly set when the pg starts processing the op. Might want to

Re: [ceph-users] leak memory when mount cephfs

2019-03-19 Thread Brad Hubbard
On Tue, Mar 19, 2019 at 7:54 PM Zhenshi Zhou wrote: > > Hi, > > I mount cephfs on my client servers. Some of the servers mount without any > error whereas others don't. > > The error: > # ceph-fuse -n client.kvm -m ceph.somedomain.com:6789 /mnt/kvm -r /kvm -d > 2019-03-19 17:03:29.136

Re: [ceph-users] Large OMAP Objects in default.rgw.log pool

2019-03-07 Thread Brad Hubbard
On Fri, Mar 8, 2019 at 4:46 AM Samuel Taylor Liston wrote: > > Hello All, > I have recently had 32 large map objects appear in my default.rgw.log > pool. Running luminous 12.2.8. > > Not sure what to think about these.I’ve done a lot of reading > about how when these

Re: [ceph-users] Failed to repair pg

2019-03-07 Thread Brad Hubbard
you could try reading the data from this object and write it again using rados get then rados put. On Fri, Mar 8, 2019 at 3:32 AM Herbert Alexander Faleiros wrote: > > On Thu, Mar 07, 2019 at 01:37:55PM -0300, Herbert Alexander Faleiros wrote: > > Hi, > > > > # ceph health detail > > HEALTH_ERR

Re: [ceph-users] http://tracker.ceph.com/issues/38122

2019-03-06 Thread Brad Hubbard
+Jos Collin On Thu, Mar 7, 2019 at 9:41 AM Milanov, Radoslav Nikiforov wrote: > Can someone elaborate on > > > > From http://tracker.ceph.com/issues/38122 > > > > Which exactly package is missing? > > And why is this happening ? In Mimic all dependencies are resolved by yum? > > - Rado > > >

Re: [ceph-users] OSD fails to start (fsck error, unable to read osd superblock)

2019-02-13 Thread Brad Hubbard
A single OSD should be expendable and you should be able to just "zap" it and recreate it. Was this not true in your case? On Wed, Feb 13, 2019 at 1:27 AM Ruben Rodriguez wrote: > > > > On 2/9/19 5:40 PM, Brad Hubbard wrote: > > On Sun, Feb 10, 2019 at 1:

Re: [ceph-users] Debugging 'slow requests' ...

2019-02-11 Thread Brad Hubbard
rong/misconfigured with the new switch: we > would try to replicate the problem, possibly without a ceph deployment ... > > Thanks again for your help ! > > Cheers, Massimo > > On Sun, Feb 10, 2019 at 12:07 AM Brad Hubbard wrote: >> >> The log ends at >> >>

Re: [ceph-users] Debugging 'slow requests' ...

2019-02-09 Thread Brad Hubbard
t; > 2019-02-09 07:35:14.627462 7f99972cc700 1 -- 192.168.222.204:6804/4159520 > <== osd.5 192.168.222.202:6816/157436 2527 > osd_repop(client.171725953.0:404377591 8.9b e1205833/1205735) v2 > 1050+0+123635 (1225076790 0 171428115) 0x5610f5128a00 con 0x5610fc5bf000 > 2019-02-0

Re: [ceph-users] OSD fails to start (fsck error, unable to read osd superblock)

2019-02-09 Thread Brad Hubbard
On Sun, Feb 10, 2019 at 1:56 AM Ruben Rodriguez wrote: > > Hi there, > > Running 12.2.11-1xenial on a machine with 6 SSD OSD with bluestore. > > Today we had two disks fail out of the controller, and after a reboot > they both seemed to come back fine but ceph-osd was only able to start > in one

Re: [ceph-users] Debugging 'slow requests' ...

2019-02-08 Thread Brad Hubbard
Try capturing another log with debug_ms turned up. 1 or 5 should be Ok to start with. On Fri, Feb 8, 2019 at 8:37 PM Massimo Sgaravatto wrote: > > Our Luminous ceph cluster have been worked without problems for a while, but > in the last days we have been suffering from continuous slow

Re: [ceph-users] backfill_toofull after adding new OSDs

2019-02-06 Thread Brad Hubbard
Let's try to restrict discussion to the original thread "backfill_toofull while OSDs are not full" and get a tracker opened up for this issue. On Sat, Feb 2, 2019 at 11:52 AM Fyodor Ustinov wrote: > > Hi! > > Right now, after adding OSD: > > # ceph health detail > HEALTH_ERR 74197563/199392333

Re: [ceph-users] process stuck in D state on cephfs kernel mount

2019-01-21 Thread Brad Hubbard
http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html should still be current enough and makes good reading on the subject. On Mon, Jan 21, 2019 at 8:46 PM Stijn De Weirdt wrote: > > hi marc, > > > - how to prevent the D state process to accumulate so much load? > you can't. in

Re: [ceph-users] centos 7.6 kernel panic caused by osd

2019-01-11 Thread Brad Hubbard
On Fri, Jan 11, 2019 at 8:58 PM Rom Freiman wrote: > > Same kernel :) Not exactly the point I had in mind, but sure ;) > > > On Fri, Jan 11, 2019, 12:49 Brad Hubbard wrote: >> >> Haha, in the email thread he says CentOS but the bug is opened against RHEL >>

Re: [ceph-users] centos 7.6 kernel panic caused by osd

2019-01-11 Thread Brad Hubbard
Haha, in the email thread he says CentOS but the bug is opened against RHEL :P Is it worth recommending a fix in skb_can_coalesce() upstream so other modules don't hit this? On Fri, Jan 11, 2019 at 7:39 PM Ilya Dryomov wrote: > > On Fri, Jan 11, 2019 at 1:38 AM Brad Hubbard

Re: [ceph-users] centos 7.6 kernel panic caused by osd

2019-01-10 Thread Brad Hubbard
same setup, you might be hitting the same > bug. Thanks for that Jason, I wasn't aware of that bug. I'm interested to see the details. > > On Thu, Jan 10, 2019 at 6:46 PM Brad Hubbard wrote: > > > > On Fri, Jan 11, 2019 at 12:20 AM Rom Freiman wrote: > > > >

Re: [ceph-users] centos 7.6 kernel panic caused by osd

2019-01-10 Thread Brad Hubbard
On Fri, Jan 11, 2019 at 12:20 AM Rom Freiman wrote: > > Hey, > After upgrading to centos7.6, I started encountering the following kernel > panic > > [17845.147263] XFS (rbd4): Unmounting Filesystem > [17846.860221] rbd: rbd4: capacity 3221225472 features 0x1 > [17847.109887] XFS (rbd4): Mounting

Re: [ceph-users] Compacting omap data

2019-01-03 Thread Brad Hubbard
Nautilus will make this easier. https://github.com/ceph/ceph/pull/18096 On Thu, Jan 3, 2019 at 5:22 AM Bryan Stillwell wrote: > > Recently on one of our bigger clusters (~1,900 OSDs) running Luminous > (12.2.8), we had a problem where OSDs would frequently get restarted while >

Re: [ceph-users] Ceph OOM Killer Luminous

2018-12-21 Thread Brad Hubbard
Can you provide the complete OOM message from the dmesg log? On Sat, Dec 22, 2018 at 7:53 AM Pardhiv Karri wrote: > > > Thank You for the quick response Dyweni! > > We are using FileStore as this cluster is upgraded from > Hammer-->Jewel-->Luminous 12.2.8. 16x2TB HDD per node for all nodes.

Re: [ceph-users] Ceph 10.2.11 - Status not working

2018-12-17 Thread Brad Hubbard
On Tue, Dec 18, 2018 at 10:23 AM Mike O'Connor wrote: > > Hi All > > I have a ceph cluster which has been working with out issues for about 2 > years now, it was upgrade about 6 month ago to 10.2.11 > > root@blade3:/var/lib/ceph/mon# ceph status > 2018-12-18 10:42:39.242217 7ff770471700 0 --

Re: [ceph-users] Crush, data placement and randomness

2018-12-06 Thread Brad Hubbard
https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf On Thu, Dec 6, 2018 at 8:11 PM Leon Robinson wrote: > > The most important thing to remember about CRUSH is that the H stands for > hashing. > > If you hash the same object you're going to get the same result. > > e.g. cat

Re: [ceph-users] How to repair active+clean+inconsistent?

2018-11-14 Thread Brad Hubbard
t; Clearly, on osd.67, the “attrs” array is empty. The question is, > how do I fix this? > > Many thanks in advance, > > -kc > > K.C. Wong > kcw...@verseon.com > M: +1 (408) 769-8235 > > - > Confidentiality Notice: &

Re: [ceph-users] How to repair active+clean+inconsistent?

2018-11-11 Thread Brad Hubbard
C. Wong >> kcw...@verseon.com >> M: +1 (408) 769-8235 >> >> - >> Confidentiality Notice: >> This message contains confidential information. If you are not the >> intended recipient and received this message

Re: [ceph-users] How to repair active+clean+inconsistent?

2018-11-11 Thread Brad Hubbard
What does "rados list-inconsistent-obj " say? Note that you may have to do a deep scrub to populate the output. On Mon, Nov 12, 2018 at 5:10 AM K.C. Wong wrote: > > Hi folks, > > I would appreciate any pointer as to how I can resolve a > PG stuck in “active+clean+inconsistent” state. This has >

Re: [ceph-users] How to subscribe to developers list

2018-11-11 Thread Brad Hubbard
What do you get if you send "help" (without quotes) to m ajord...@vger.kernel.org ? On Sun, Nov 11, 2018 at 10:15 AM Cranage, Steve < scran...@deepspacestorage.com> wrote: > Can anyone tell me the secret? A colleague tried and failed many times so > I tried and got this: > > > > > > Steve

Re: [ceph-users] OSDs crashing

2018-09-25 Thread Brad Hubbard
On Tue, Sep 25, 2018 at 11:31 PM Josh Haft wrote: > > Hi cephers, > > I have a cluster of 7 storage nodes with 12 drives each and the OSD > processes are regularly crashing. All 84 have crashed at least once in > the past two days. Cluster is Luminous 12.2.2 on CentOS 7.4.1708, > kernel version

Re: [ceph-users] PG inconsistent, "pg repair" not working

2018-09-25 Thread Brad Hubbard
On Tue, Sep 25, 2018 at 7:50 PM Sergey Malinin wrote: > > # rados list-inconsistent-obj 1.92 > {"epoch":519,"inconsistents":[]} It's likely the epoch has changed since the last scrub and you'll need to run another scrub to repopulate this data. > > Septem

Re: [ceph-users] [RGWRados]librados: Objecter returned from getxattrs r=-36

2018-09-19 Thread Brad Hubbard
Are you using filestore or bluestore on the OSDs? If filestore what is the underlying filesystem? You could try setting debug_osd and debug_filestore to 20 and see if that gives some more info? On Wed, Sep 19, 2018 at 12:36 PM fatkun chan wrote: > > > ceph version 12.2.5

Re: [ceph-users] what is Implicated osds

2018-08-20 Thread Brad Hubbard
On Tue, Aug 21, 2018 at 2:37 AM, Satish Patel wrote: > Folks, > > Today i found ceph -s is really slow and just hanging for minute or 2 > minute to give me output also same with "ceph osd tree" output, > command just hanging long time to give me output.. > > This is what i am seeing output, one

Re: [ceph-users] [Jewel 10.2.11] OSD Segmentation fault

2018-08-13 Thread Brad Hubbard
Jewel is almost EOL. It looks similar to several related issues, one of which is http://tracker.ceph.com/issues/21826 On Mon, Aug 13, 2018 at 9:19 PM, Alexandru Cucu wrote: > Hi, > > Already tried zapping the disk. Unfortunaltely the same segfaults keep > me from adding the OSD back to the

Re: [ceph-users] OSD had suicide timed out

2018-08-08 Thread Brad Hubbard
.12.125.3:0/735946 22 osd_ping(ping e13589 stamp 2018-08-08 > 10:45:33.021217) v4 2004+0+0 (3639738084 0 0) 0x55bb63bb7200 con > 0x55bb65e79800 > > Regarding heartbeat messages, all i can see on the failing osd is "heartbeat > map is healthy" before the timeout mess

Re: [ceph-users] OSD had suicide timed out

2018-08-08 Thread Brad Hubbard
Do you see "internal heartbeat not healthy" messages in the log of the osd that suicides? On Wed, Aug 8, 2018 at 5:45 PM, Brad Hubbard wrote: > What is the load like on the osd host at the time and what does the > disk utilization look like? > > Also, what does the transact

Re: [ceph-users] OSD had suicide timed out

2018-08-08 Thread Brad Hubbard
; 'OSD::peering_tp thread 0x7fe03f52f700' had suicide timed out after 150 > 0> 2018-08-08 09:14:00.970742 7fe03f52f700 -1 *** Caught signal > (Aborted) ** > > > Could it be that the suiciding OSDs are rejecting the ping somehow? I'm > quite confused as on what's really

Re: [ceph-users] OSD had suicide timed out

2018-08-07 Thread Brad Hubbard
Try to work out why the other osds are saying this one is down. Is it because this osd is too busy to respond or something else. debug_ms = 1 will show you some message debugging which may help. On Tue, Aug 7, 2018 at 10:34 PM, Josef Zelenka wrote: > To follow up, I did some further digging

Re: [ceph-users] Bluestore OSD Segfaults (12.2.5/12.2.7)

2018-08-07 Thread Brad Hubbard
Looks like https://tracker.ceph.com/issues/21826 which is a dup of https://tracker.ceph.com/issues/20557 On Wed, Aug 8, 2018 at 1:49 AM, Thomas White wrote: > Hi all, > > We have recently begun switching over to Bluestore on our Ceph cluster, > currently on 12.2.7. We first began encountering

Re: [ceph-users] Luminous OSD crashes every few seconds: FAILED assert(0 == "past_interval end mismatch")

2018-08-01 Thread Brad Hubbard
If you don't already know why, you should investigate why your cluster could not recover after the loss of a single osd. Your solution seems valid given your description. On Thu, Aug 2, 2018 at 12:15 PM, J David wrote: > On Wed, Aug 1, 2018 at 9:53 PM, Brad Hubbard wrote: >

Re: [ceph-users] Luminous OSD crashes every few seconds: FAILED assert(0 == "past_interval end mismatch")

2018-08-01 Thread Brad Hubbard
What is the status of the cluster with this osd down and out? On Thu, Aug 2, 2018 at 5:42 AM, J David wrote: > Hello all, > > On Luminous 12.2.7, during the course of recovering from a failed OSD, > one of the other OSDs started repeatedly crashing every few seconds > with an assertion failure:

Re: [ceph-users] fyi: Luminous 12.2.7 pulled wrong osd disk, resulted in node down

2018-08-01 Thread Brad Hubbard
On Wed, Aug 1, 2018 at 10:38 PM, Marc Roos wrote: > > > Today we pulled the wrong disk from a ceph node. And that made the whole > node go down/be unresponsive. Even to a simple ping. I cannot find to > much about this in the log files. But I expect that the > /usr/bin/ceph-osd process caused a

Re: [ceph-users] OMAP warning ( again )

2018-08-01 Thread Brad Hubbard
"swift_versioning": "false", > "swift_ver_location": "", > "index_type": 0, > "mdsearch_config": [], > "reshard_status": 0, > "new_bucket_instance_id&quo

Re: [ceph-users] OMAP warning ( again )

2018-07-31 Thread Brad Hubbard
Search the cluster log for 'Large omap object found' for more details. On Wed, Aug 1, 2018 at 3:50 AM, Brent Kennedy wrote: > Upgraded from 12.2.5 to 12.2.6, got a “1 large omap objects” warning > message, then upgraded to 12.2.7 and the message went away. I just added > four OSDs to balance

Re: [ceph-users] Self shutdown of 1 whole system (Derbian stretch/Ceph 12.2.7/bluestore)

2018-07-23 Thread Brad Hubbard
Ceph doesn't shut down systems as in kill or reboot the box if that's what you're saying? On Mon, Jul 23, 2018 at 5:04 PM, Nicolas Huillard wrote: > Le lundi 23 juillet 2018 à 11:07 +0700, Konstantin Shalygin a écrit : >> > I even have no fancy kernel or device, just real standard Debian. >> >

Re: [ceph-users] active+clean+inconsistent PGs after upgrade to 12.2.7

2018-07-19 Thread Brad Hubbard
I've updated the tracker. On Thu, Jul 19, 2018 at 7:51 PM, Robert Sander wrote: > On 19.07.2018 11:15, Ronny Aasen wrote: > >> Did you upgrade from 12.2.5 or 12.2.6 ? > > Yes. > >> sounds like you hit the reason for the 12.2.7 release >> >> read :

Re: [ceph-users] Omap warning in 12.2.6

2018-07-19 Thread Brad Hubbard
Search the cluster log for 'Large omap object found' for more details. On Fri, Jul 20, 2018 at 5:13 AM, Brent Kennedy wrote: > I just upgraded our cluster to 12.2.6 and now I see this warning about 1 > large omap object. I looked and it seems this warning was just added in > 12.2.6. I found a

Re: [ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-18 Thread Brad Hubbard
On Thu, Jul 19, 2018 at 12:47 PM, Troy Ablan wrote: > > > On 07/18/2018 06:37 PM, Brad Hubbard wrote: >> On Thu, Jul 19, 2018 at 2:48 AM, Troy Ablan wrote: >>> >>> >>> On 07/17/2018 11:14 PM, Brad Hubbard wrote: >>>> >>>>

Re: [ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-18 Thread Brad Hubbard
On Thu, Jul 19, 2018 at 2:48 AM, Troy Ablan wrote: > > > On 07/17/2018 11:14 PM, Brad Hubbard wrote: >> >> On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan wrote: >>> >>> I was on 12.2.5 for a couple weeks and started randomly seeing >>> corruption, m

Re: [ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-18 Thread Brad Hubbard
On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan wrote: > I was on 12.2.5 for a couple weeks and started randomly seeing > corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke > loose. I panicked and moved to Mimic, and when that didn't solve the > problem, only then did I start

Re: [ceph-users] Jewel PG stuck inconsistent with 3 0-size objects

2018-07-16 Thread Brad Hubbard
Your issue is different since not only do the omap digests of all replicas not match the omap digest from the auth object info but they are all different to each other. What is min_size of pool 67 and what can you tell us about the events leading up to this? On Mon, Jul 16, 2018 at 7:06 PM,

Re: [ceph-users] Slow requests

2018-07-09 Thread Brad Hubbard
rnel exhibiting the problem. > > kind regards > > Ben > >> Brad Hubbard hat am 5. Juli 2018 um 01:16 geschrieben: >> >> >> On Wed, Jul 4, 2018 at 6:26 PM, Benjamin Naber >> wrote: >> > Hi @all, >> > >> > im currently in testing for

Re: [ceph-users] Slow requests

2018-07-04 Thread Brad Hubbard
On Wed, Jul 4, 2018 at 6:26 PM, Benjamin Naber wrote: > Hi @all, > > im currently in testing for setup an production environment based on the > following OSD Nodes: > > CEPH Version: luminous 12.2.5 > > 5x OSD Nodes with following specs: > > - 8 Core Intel Xeon 2,0 GHZ > > - 96GB Ram > > - 10x

Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-28 Thread Brad Hubbard
provide from the time leading up to when the issue was first seen? > > Cheers > > Andrei > - Original Message - >> From: "Brad Hubbard" >> To: "Andrei Mikhailovsky" >> Cc: "ceph-users" >> Sent: Thursday, 28 June, 2018 01:

Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-27 Thread Brad Hubbard
uot;key" : "", >"oid" : ".dir.default.80018061.2", >"namespace" : "", >"snapid" : -2, >"max" : 0 > }, > "truncate_size" : 0, > &qu

  1   2   3   4   5   >