On Thu, Jan 9, 2020 at 5:48 AM Peter Eisch
wrote:
> Hi,
>
> This morning one of my three monitor hosts got booted from the Nautilus
> 14.2.4 cluster and it won’t regain. There haven’t been any changes, or
> events at this site at all. The conf file is the [unchanged] and the same
> as the other
I'd suggest you open a tracker under the Bluestore component so
someone can take a look. I'd also suggest you include a log with
'debug_bluestore=20' added to the COT command line.
On Thu, Nov 7, 2019 at 6:56 PM Eugene de Beste wrote:
>
> Hi, does anyone have any feedback for me regarding this?
up
> ([27,30,38], p27) acting ([30,25], p30)
>
> I also checked the logs of all OSDs already done and got the same logs
> about this object :
> * osd.4, last time : 2019-10-10 16:15:20
> * osd.32, last time : 2019-10-14 01:54:56
> * osd.33, last time : 2019-10-11 06:24:01
>
On Tue, Oct 29, 2019 at 9:09 PM Jérémy Gardais
wrote:
>
> Thus spake Brad Hubbard (bhubb...@redhat.com) on mardi 29 octobre 2019 à
> 08:20:31:
> > Yes, try and get the pgs healthy, then you can just re-provision the down
> > OSDs.
> >
> > Run a scrub
Yes, try and get the pgs healthy, then you can just re-provision the down OSDs.
Run a scrub on each of these pgs and then use the commands on the
following page to find out more information for each case.
https://docs.ceph.com/docs/luminous/rados/troubleshooting/troubleshooting-pg/
Focus on the
ashpspool stripe_width 0 application cephfs
This looked like something min_size 1 could cause, but I guess that's
not the cause here.
> so inconsistens is empty, which is weird, no ?
Try scrubbing the pg just before running the command.
>
> Thanks again!
>
> K
>
>
> On 10/10/2019
Does pool 6 have min_size = 1 set?
https://tracker.ceph.com/issues/24994#note-5 would possibly be helpful
here, depending on what the output of the following command looks
like.
# rados list-inconsistent-obj [pgid] --format=json-pretty
On Thu, Oct 10, 2019 at 8:16 PM Kenneth Waegeman
wrote:
>
Awesome! Sorry it took so long.
On Thu, Oct 10, 2019 at 12:44 AM Marc Roos wrote:
>
>
> Brad, many thanks!!! My cluster has finally HEALTH_OK af 1,5 year or so!
> :)
>
>
> -Original Message-
> Subject: Re: Ceph pg repair clone_missing?
>
> On Fri, Oct 4, 2019 at 6:09 PM Marc Roos
>
On Fri, Oct 4, 2019 at 6:09 PM Marc Roos wrote:
>
> >
> >Try something like the following on each OSD that holds a copy of
> >rbd_data.1f114174b0dc51.0974 and see what output you get.
> >Note that you can drop the bluestore flag if they are not bluestore
> >osds and you will need
On Thu, Oct 3, 2019 at 6:46 PM Marc Roos wrote:
>
> >
> >>
> >> I was following the thread where you adviced on this pg repair
> >>
> >> I ran these rados 'list-inconsistent-obj'/'rados
> >> list-inconsistent-snapset' and have output on the snapset. I tried
> to
> >> extrapolate your
On Wed, Oct 2, 2019 at 9:00 PM Marc Roos wrote:
>
>
>
> Hi Brad,
>
> I was following the thread where you adviced on this pg repair
>
> I ran these rados 'list-inconsistent-obj'/'rados
> list-inconsistent-snapset' and have output on the snapset. I tried to
> extrapolate your comment on the
9 at 8:03 AM Sasha Litvak
> wrote:
>>
>> It was hardware indeed. Dell server reported a disk being reset with power
>> on. Checking the usual suspects i.e. controller firmware, controller event
>> log (if I can get one), drive firmware.
>> I will report more when I g
On Wed, Oct 2, 2019 at 1:15 AM Mattia Belluco wrote:
>
> Hi Jake,
>
> I am curious to see if your problem is similar to ours (despite the fact
> we are still on Luminous).
>
> Could you post the output of:
>
> rados list-inconsistent-obj
>
> and
>
> rados list-inconsistent-snapset
Make sure
On Tue, Oct 1, 2019 at 10:43 PM Del Monaco, Andrea <
andrea.delmon...@atos.net> wrote:
> Hi list,
>
> After the nodes ran OOM and after reboot, we are not able to restart the
> ceph-osd@x services anymore. (Details about the setup at the end).
>
> I am trying to do this manually, so we can see
Removed ceph-de...@vger.kernel.org and added d...@ceph.io
On Tue, Oct 1, 2019 at 4:26 PM Alex Litvak wrote:
>
> Hellow everyone,
>
> Can you shed the line on the cause of the crash? Could actually client
> request trigger it?
>
> Sep 30 22:52:58 storage2n2-la ceph-osd-17[10770]: 2019-09-30
On Tue, Sep 24, 2019 at 10:51 PM M Ranga Swami Reddy
wrote:
>
> Interestingly - "rados list-inconsistent-obj ${PG} --format=json" not
> showing any objects inconsistent-obj.
> And also "rados list-missing-obj ${PG} --format=json" also not showing any
> missing or unfound objects.
Complete a
On Thu, Sep 12, 2019 at 1:52 AM Benjamin Tayehanpour
wrote:
>
> Greetings!
>
> I had an OSD down, so I ran ceph osd status and got this:
>
> [root@ceph1 ~]# ceph osd status
> Error EINVAL: Traceback (most recent call last):
> File "/usr/lib64/ceph/mgr/status/module.py", line 313, in
On Wed, Sep 4, 2019 at 9:42 PM Andras Pataki
wrote:
>
> Dear ceph users,
>
> After upgrading our ceph-fuse clients to 14.2.2, we've been seeing sporadic
> segfaults with not super revealing stack traces:
>
> in thread 7fff5a7fc700 thread_name:ceph-fuse
>
> ceph version 14.2.2
https://tracker.ceph.com/issues/38724
On Fri, Aug 23, 2019 at 10:18 PM Paul Emmerich wrote:
>
> I've seen that before (but never on Nautilus), there's already an
> issue at tracker.ceph.com but I don't recall the id or title.
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph
https://tracker.ceph.com/issues/41255 is probably reporting the same issue.
On Thu, Aug 22, 2019 at 6:31 PM Lars Täuber wrote:
>
> Hi there!
>
> We also experience this behaviour of our cluster while it is moving pgs.
>
> # ceph health detail
> HEALTH_ERR 1 MDSs report slow metadata IOs; Reduced
On Thu, Aug 15, 2019 at 2:09 AM Troy Ablan wrote:
>
> Paul,
>
> Thanks for the reply. All of these seemed to fail except for pulling
> the osdmap from the live cluster.
>
> -Troy
>
> -[~:#]- ceph-objectstore-tool --op get-osdmap --data-path
> /var/lib/ceph/osd/ceph-45/ --file osdmap45
>
On Thu, Aug 15, 2019 at 2:09 AM Troy Ablan wrote:
>
> Paul,
>
> Thanks for the reply. All of these seemed to fail except for pulling
> the osdmap from the live cluster.
>
> -Troy
>
> -[~:#]- ceph-objectstore-tool --op get-osdmap --data-path
> /var/lib/ceph/osd/ceph-45/ --file osdmap45
>
Could you create a tracker for this?
Also, if you can reproduce this could you gather a log with
debug_osd=20 ? That should show us the superblock it was trying to
decode as well as additional details.
On Mon, Aug 12, 2019 at 6:29 AM huxia...@horebdata.cn
wrote:
>
> Dear folks,
>
> I had an OSD
-63> 2019-08-07 00:51:52.861 7fe987e49700 1 heartbeat_map
clear_timeout 'OSD::osd_op_tp thread 0x7fe987e49700' had suicide timed
out after 150
You hit a suicide timeout, that's fatal. On line 80 the process kills
the thread based on the assumption it's hung.
src/common/HeartbeatMap.cc:
66
I'd suggest creating a tracker similar to
http://tracker.ceph.com/issues/40554 which was created for the issue
in the thread you mentioned.
On Wed, Jul 3, 2019 at 12:29 AM Vandeir Eduardo
wrote:
>
> Hi,
>
> on client machines, when I use the command rbd, for example, rbd ls
> poolname, this
gt; application is responsible for any locking needed.
> -Greg
>
> On Tue, Jul 2, 2019 at 3:49 AM Brad Hubbard wrote:
> >
> > Yes, this should be possible using an object class which is also a
> > RADOS client (via the RADOS API). You'll still have some client
> >
t;> Thank you for your response , and we will check this video as well.
>>> Our requirement is while writing an object into the cluster , if we can
>>> provide number of copies to be made , the network consumption between
>>> client and cluster will be only for one object write.
On Thu, Jun 27, 2019 at 8:58 PM nokia ceph wrote:
>
> Hi Team,
>
> We have a requirement to create multiple copies of an object and currently we
> are handling it in client side to write as separate objects and this causes
> huge network traffic between client and cluster.
> Is there
relating to the clearing in mon, mgr, or osd logs.
> >
> > So, not entirely sure what fixed it, but it is resolved on its own.
> >
> > Thanks,
> >
> > Reed
> >
> > On Apr 30, 2019, at 8:01 PM, Brad Hubbard wrote:
> >
> > On Wed, May 1, 2019 at
On Wed, May 1, 2019 at 10:54 AM Brad Hubbard wrote:
>
> Which size is correct?
Sorry, accidental discharge =D
If the object info size is *incorrect* try forcing a write to the OI
with something like the following.
1. rados -p [name_of_pool_17] setomapval 10008536718.
tempora
Which size is correct?
On Tue, Apr 30, 2019 at 1:06 AM Reed Dier wrote:
>
> Hi list,
>
> Woke up this morning to two PG's reporting scrub errors, in a way that I
> haven't seen before.
>
> $ ceph versions
> {
> "mon": {
> "ceph version 13.2.5
>
> Best,
> Can Zhang
>
>
> On Fri, Apr 19, 2019 at 6:28 PM Brad Hubbard wrote:
> >
> > OK. So this works for me with master commit
> > bdaac2d619d603f53a16c07f9d7bd47751137c4c on Centos 7.5.1804.
> >
> > I cloned the repo and ran './install-deps.sh'
If you can give me specific steps so I can reproduce this
from a freshly cloned tree I'd be happy to look further into it.
Good luck.
On Thu, Apr 18, 2019 at 7:00 PM Brad Hubbard wrote:
>
> Let me try to reproduce this on centos 7.5 with master and I'll let
> you know how I go.
>
>
gt; Notice the "U" and "V" from nm results.
>
>
>
>
> Best,
> Can Zhang
>
> On Thu, Apr 18, 2019 at 9:36 AM Brad Hubbard wrote:
> >
> > Does it define _ZTIN13PriorityCache8PriCacheE ? If it does, and all is
> > as you say, then it
:15 libceph-common.so ->
> libceph-common.so.0
> -rwxr-xr-x. 1 root root 211853400 Apr 17 11:15 libceph-common.so.0
>
>
>
>
> Best,
> Can Zhang
>
> On Thu, Apr 18, 2019 at 7:00 AM Brad Hubbard wrote:
> >
> > On Wed, Apr 17, 2019 at 1:37 PM Can Zhang w
On Wed, Apr 17, 2019 at 1:37 PM Can Zhang wrote:
>
> Thanks for your suggestions.
>
> I tried to build libfio_ceph_objectstore.so, but it fails to load:
>
> ```
> $ LD_LIBRARY_PATH=./lib ./bin/fio --enghelp=libfio_ceph_objectstore.so
>
> fio: engine libfio_ceph_objectstore.so not loadable
> IO
puzzled why it doesn't show any change when I run this no matter
> what I set it to:
>
> # ceph -n osd.1 --show-config | grep osd_recovery_max_active
> osd_recovery_max_active = 3
>
> in fact it doesn't matter if I use an OSD number that doesn't exist, same
> thing if I use c
On Tue, Apr 16, 2019 at 6:03 PM Paul Emmerich wrote:
>
> This works, it just says that it *might* require a restart, but this
> particular option takes effect without a restart.
We've already looked at changing the wording once to make it more palatable.
http://tracker.ceph.com/issues/18424
>
On Tue, Apr 16, 2019 at 7:38 AM solarflow99 wrote:
>
> Then why doesn't this work?
>
> # ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
> osd.0: osd_recovery_max_active = '4' (not observed, change may require
> restart)
> osd.1: osd_recovery_max_active = '4' (not observed, change may
If you want to do containers at the same time, or transition some/all
to containers at some point in future maybe something based on
kubevirt [1] would be more futureproof?
[1] http://kubevirt.io/
CNV is an example,
https://www.redhat.com/en/resources/container-native-virtualization
On Sat, Apr
ed+inconsistent+peering, and the other peer is active+clean+inconsistent
Per the document I linked previously if a pg remains remapped you
likely have a problem with your configuration. Take a good look at
your crushmap, pg distribution, pool configuration, etc.
>
>
> On Wed, Mar 27, 2019 at 4:1
{
> "osd": "7",
> "status": "not queried"
> },
> {
> "osd": "8",
> "status": "already probed"
> },
>
https://bugzilla.redhat.com/show_bug.cgi?id=1662496
On Wed, Mar 27, 2019 at 5:00 AM Andrew J. Hutton
wrote:
>
> More or less followed the install instructions with modifications as
> needed; but I'm suspecting that either a dependency was missed in the
> F29 package or something else is up. I
ther OSDs appear to be ok, I see
> them up and in, why do you see something wrong?
>
> On Mon, Mar 25, 2019 at 4:00 PM Brad Hubbard wrote:
>>
>> Hammer is no longer supported.
>>
>> What's the status of osds 7 and 17?
>>
>> On Tue, Mar 26, 2019 at 8:56 A
"last_epoch_clean": 20840,
> "parent": "0.0",
> "parent_split_bits": 0,
> "last_scrub": "21395'11835365",
> "last_scrub_stamp": "20
It would help to know what version you are running but, to begin with,
could you post the output of the following?
$ sudo ceph pg 10.2a query
$ sudo rados list-inconsistent-obj 10.2a --format=json-pretty
Also, have a read of
Do a "ps auwwx" to see how a running monitor was started and use the
equivalent command to try to start the MON that won't start. "ceph-mon
--help" will show you what you need. Most important is to get the ID
portion right and to add "-d" to get it to run in teh foreground and
log to stdout. HTH
21 16:51:56.862447",
> "age": 376.527241,
> "duration": 1.331278,
>
> Kind regards,
> Glen Baars
>
> -Original Message-
> From: Brad Hubbard
> Sent: Thursday, 21 March 2019 1:43 PM
> To: Glen Baars
> Cc: cep
Actually, the lag is between "sub_op_committed" and "commit_sent". Is
there any pattern to these slow requests? Do they involve the same
osd, or set of osds?
On Thu, Mar 21, 2019 at 3:37 PM Brad Hubbard wrote:
>
> On Thu, Mar 21, 2019 at 3:20 PM Glen Baars
> wrote:
>
> Does anyone know what that section is waiting for?
Hi Glen,
These are documented, to some extent, here.
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
It looks like it may be taking a long time to communicate the commit
message back to the client? Are these sl
On Thu, Mar 21, 2019 at 12:11 AM Glen Baars wrote:
>
> Hello Ceph Users,
>
>
>
> Does anyone know what the flag point ‘Started’ is? Is that ceph osd daemon
> waiting on the disk subsystem?
This is set by "mark_started()" and is roughly set when the pg starts
processing the op. Might want to
On Tue, Mar 19, 2019 at 7:54 PM Zhenshi Zhou wrote:
>
> Hi,
>
> I mount cephfs on my client servers. Some of the servers mount without any
> error whereas others don't.
>
> The error:
> # ceph-fuse -n client.kvm -m ceph.somedomain.com:6789 /mnt/kvm -r /kvm -d
> 2019-03-19 17:03:29.136
On Fri, Mar 8, 2019 at 4:46 AM Samuel Taylor Liston wrote:
>
> Hello All,
> I have recently had 32 large map objects appear in my default.rgw.log
> pool. Running luminous 12.2.8.
>
> Not sure what to think about these.I’ve done a lot of reading
> about how when these
you could try reading the data from this object and write it again
using rados get then rados put.
On Fri, Mar 8, 2019 at 3:32 AM Herbert Alexander Faleiros
wrote:
>
> On Thu, Mar 07, 2019 at 01:37:55PM -0300, Herbert Alexander Faleiros wrote:
> > Hi,
> >
> > # ceph health detail
> > HEALTH_ERR
+Jos Collin
On Thu, Mar 7, 2019 at 9:41 AM Milanov, Radoslav Nikiforov
wrote:
> Can someone elaborate on
>
>
>
> From http://tracker.ceph.com/issues/38122
>
>
>
> Which exactly package is missing?
>
> And why is this happening ? In Mimic all dependencies are resolved by yum?
>
> - Rado
>
>
>
A single OSD should be expendable and you should be able to just "zap"
it and recreate it. Was this not true in your case?
On Wed, Feb 13, 2019 at 1:27 AM Ruben Rodriguez wrote:
>
>
>
> On 2/9/19 5:40 PM, Brad Hubbard wrote:
> > On Sun, Feb 10, 2019 at 1:
rong/misconfigured with the new switch: we
> would try to replicate the problem, possibly without a ceph deployment ...
>
> Thanks again for your help !
>
> Cheers, Massimo
>
> On Sun, Feb 10, 2019 at 12:07 AM Brad Hubbard wrote:
>>
>> The log ends at
>>
>>
t;
> 2019-02-09 07:35:14.627462 7f99972cc700 1 -- 192.168.222.204:6804/4159520
> <== osd.5 192.168.222.202:6816/157436 2527
> osd_repop(client.171725953.0:404377591 8.9b e1205833/1205735) v2
> 1050+0+123635 (1225076790 0 171428115) 0x5610f5128a00 con 0x5610fc5bf000
> 2019-02-0
On Sun, Feb 10, 2019 at 1:56 AM Ruben Rodriguez wrote:
>
> Hi there,
>
> Running 12.2.11-1xenial on a machine with 6 SSD OSD with bluestore.
>
> Today we had two disks fail out of the controller, and after a reboot
> they both seemed to come back fine but ceph-osd was only able to start
> in one
Try capturing another log with debug_ms turned up. 1 or 5 should be Ok
to start with.
On Fri, Feb 8, 2019 at 8:37 PM Massimo Sgaravatto
wrote:
>
> Our Luminous ceph cluster have been worked without problems for a while, but
> in the last days we have been suffering from continuous slow
Let's try to restrict discussion to the original thread
"backfill_toofull while OSDs are not full" and get a tracker opened up
for this issue.
On Sat, Feb 2, 2019 at 11:52 AM Fyodor Ustinov wrote:
>
> Hi!
>
> Right now, after adding OSD:
>
> # ceph health detail
> HEALTH_ERR 74197563/199392333
http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
should still be current enough and makes good reading on the subject.
On Mon, Jan 21, 2019 at 8:46 PM Stijn De Weirdt wrote:
>
> hi marc,
>
> > - how to prevent the D state process to accumulate so much load?
> you can't. in
On Fri, Jan 11, 2019 at 8:58 PM Rom Freiman wrote:
>
> Same kernel :)
Not exactly the point I had in mind, but sure ;)
>
>
> On Fri, Jan 11, 2019, 12:49 Brad Hubbard wrote:
>>
>> Haha, in the email thread he says CentOS but the bug is opened against RHEL
>>
Haha, in the email thread he says CentOS but the bug is opened against RHEL :P
Is it worth recommending a fix in skb_can_coalesce() upstream so other
modules don't hit this?
On Fri, Jan 11, 2019 at 7:39 PM Ilya Dryomov wrote:
>
> On Fri, Jan 11, 2019 at 1:38 AM Brad Hubbard
same setup, you might be hitting the same
> bug.
Thanks for that Jason, I wasn't aware of that bug. I'm interested to
see the details.
>
> On Thu, Jan 10, 2019 at 6:46 PM Brad Hubbard wrote:
> >
> > On Fri, Jan 11, 2019 at 12:20 AM Rom Freiman wrote:
> > >
>
On Fri, Jan 11, 2019 at 12:20 AM Rom Freiman wrote:
>
> Hey,
> After upgrading to centos7.6, I started encountering the following kernel
> panic
>
> [17845.147263] XFS (rbd4): Unmounting Filesystem
> [17846.860221] rbd: rbd4: capacity 3221225472 features 0x1
> [17847.109887] XFS (rbd4): Mounting
Nautilus will make this easier.
https://github.com/ceph/ceph/pull/18096
On Thu, Jan 3, 2019 at 5:22 AM Bryan Stillwell wrote:
>
> Recently on one of our bigger clusters (~1,900 OSDs) running Luminous
> (12.2.8), we had a problem where OSDs would frequently get restarted while
>
Can you provide the complete OOM message from the dmesg log?
On Sat, Dec 22, 2018 at 7:53 AM Pardhiv Karri wrote:
>
>
> Thank You for the quick response Dyweni!
>
> We are using FileStore as this cluster is upgraded from
> Hammer-->Jewel-->Luminous 12.2.8. 16x2TB HDD per node for all nodes.
On Tue, Dec 18, 2018 at 10:23 AM Mike O'Connor wrote:
>
> Hi All
>
> I have a ceph cluster which has been working with out issues for about 2
> years now, it was upgrade about 6 month ago to 10.2.11
>
> root@blade3:/var/lib/ceph/mon# ceph status
> 2018-12-18 10:42:39.242217 7ff770471700 0 --
https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
On Thu, Dec 6, 2018 at 8:11 PM Leon Robinson wrote:
>
> The most important thing to remember about CRUSH is that the H stands for
> hashing.
>
> If you hash the same object you're going to get the same result.
>
> e.g. cat
t; Clearly, on osd.67, the “attrs” array is empty. The question is,
> how do I fix this?
>
> Many thanks in advance,
>
> -kc
>
> K.C. Wong
> kcw...@verseon.com
> M: +1 (408) 769-8235
>
> -
> Confidentiality Notice:
&
C. Wong
>> kcw...@verseon.com
>> M: +1 (408) 769-8235
>>
>> -
>> Confidentiality Notice:
>> This message contains confidential information. If you are not the
>> intended recipient and received this message
What does "rados list-inconsistent-obj " say?
Note that you may have to do a deep scrub to populate the output.
On Mon, Nov 12, 2018 at 5:10 AM K.C. Wong wrote:
>
> Hi folks,
>
> I would appreciate any pointer as to how I can resolve a
> PG stuck in “active+clean+inconsistent” state. This has
>
What do you get if you send "help" (without quotes) to m
ajord...@vger.kernel.org ?
On Sun, Nov 11, 2018 at 10:15 AM Cranage, Steve <
scran...@deepspacestorage.com> wrote:
> Can anyone tell me the secret? A colleague tried and failed many times so
> I tried and got this:
>
>
>
>
>
> Steve
On Tue, Sep 25, 2018 at 11:31 PM Josh Haft wrote:
>
> Hi cephers,
>
> I have a cluster of 7 storage nodes with 12 drives each and the OSD
> processes are regularly crashing. All 84 have crashed at least once in
> the past two days. Cluster is Luminous 12.2.2 on CentOS 7.4.1708,
> kernel version
On Tue, Sep 25, 2018 at 7:50 PM Sergey Malinin wrote:
>
> # rados list-inconsistent-obj 1.92
> {"epoch":519,"inconsistents":[]}
It's likely the epoch has changed since the last scrub and you'll need
to run another scrub to repopulate this data.
>
> Septem
Are you using filestore or bluestore on the OSDs? If filestore what is
the underlying filesystem?
You could try setting debug_osd and debug_filestore to 20 and see if
that gives some more info?
On Wed, Sep 19, 2018 at 12:36 PM fatkun chan wrote:
>
>
> ceph version 12.2.5
On Tue, Aug 21, 2018 at 2:37 AM, Satish Patel wrote:
> Folks,
>
> Today i found ceph -s is really slow and just hanging for minute or 2
> minute to give me output also same with "ceph osd tree" output,
> command just hanging long time to give me output..
>
> This is what i am seeing output, one
Jewel is almost EOL.
It looks similar to several related issues, one of which is
http://tracker.ceph.com/issues/21826
On Mon, Aug 13, 2018 at 9:19 PM, Alexandru Cucu wrote:
> Hi,
>
> Already tried zapping the disk. Unfortunaltely the same segfaults keep
> me from adding the OSD back to the
.12.125.3:0/735946 22 osd_ping(ping e13589 stamp 2018-08-08
> 10:45:33.021217) v4 2004+0+0 (3639738084 0 0) 0x55bb63bb7200 con
> 0x55bb65e79800
>
> Regarding heartbeat messages, all i can see on the failing osd is "heartbeat
> map is healthy" before the timeout mess
Do you see "internal heartbeat not healthy" messages in the log of the
osd that suicides?
On Wed, Aug 8, 2018 at 5:45 PM, Brad Hubbard wrote:
> What is the load like on the osd host at the time and what does the
> disk utilization look like?
>
> Also, what does the transact
; 'OSD::peering_tp thread 0x7fe03f52f700' had suicide timed out after 150
> 0> 2018-08-08 09:14:00.970742 7fe03f52f700 -1 *** Caught signal
> (Aborted) **
>
>
> Could it be that the suiciding OSDs are rejecting the ping somehow? I'm
> quite confused as on what's really
Try to work out why the other osds are saying this one is down. Is it
because this osd is too busy to respond or something else.
debug_ms = 1 will show you some message debugging which may help.
On Tue, Aug 7, 2018 at 10:34 PM, Josef Zelenka
wrote:
> To follow up, I did some further digging
Looks like https://tracker.ceph.com/issues/21826 which is a dup of
https://tracker.ceph.com/issues/20557
On Wed, Aug 8, 2018 at 1:49 AM, Thomas White wrote:
> Hi all,
>
> We have recently begun switching over to Bluestore on our Ceph cluster,
> currently on 12.2.7. We first began encountering
If you don't already know why, you should investigate why your cluster
could not recover after the loss of a single osd.
Your solution seems valid given your description.
On Thu, Aug 2, 2018 at 12:15 PM, J David wrote:
> On Wed, Aug 1, 2018 at 9:53 PM, Brad Hubbard wrote:
>
What is the status of the cluster with this osd down and out?
On Thu, Aug 2, 2018 at 5:42 AM, J David wrote:
> Hello all,
>
> On Luminous 12.2.7, during the course of recovering from a failed OSD,
> one of the other OSDs started repeatedly crashing every few seconds
> with an assertion failure:
On Wed, Aug 1, 2018 at 10:38 PM, Marc Roos wrote:
>
>
> Today we pulled the wrong disk from a ceph node. And that made the whole
> node go down/be unresponsive. Even to a simple ping. I cannot find to
> much about this in the log files. But I expect that the
> /usr/bin/ceph-osd process caused a
"swift_versioning": "false",
> "swift_ver_location": "",
> "index_type": 0,
> "mdsearch_config": [],
> "reshard_status": 0,
> "new_bucket_instance_id&quo
Search the cluster log for 'Large omap object found' for more details.
On Wed, Aug 1, 2018 at 3:50 AM, Brent Kennedy wrote:
> Upgraded from 12.2.5 to 12.2.6, got a “1 large omap objects” warning
> message, then upgraded to 12.2.7 and the message went away. I just added
> four OSDs to balance
Ceph doesn't shut down systems as in kill or reboot the box if that's
what you're saying?
On Mon, Jul 23, 2018 at 5:04 PM, Nicolas Huillard wrote:
> Le lundi 23 juillet 2018 à 11:07 +0700, Konstantin Shalygin a écrit :
>> > I even have no fancy kernel or device, just real standard Debian.
>> >
I've updated the tracker.
On Thu, Jul 19, 2018 at 7:51 PM, Robert Sander
wrote:
> On 19.07.2018 11:15, Ronny Aasen wrote:
>
>> Did you upgrade from 12.2.5 or 12.2.6 ?
>
> Yes.
>
>> sounds like you hit the reason for the 12.2.7 release
>>
>> read :
Search the cluster log for 'Large omap object found' for more details.
On Fri, Jul 20, 2018 at 5:13 AM, Brent Kennedy wrote:
> I just upgraded our cluster to 12.2.6 and now I see this warning about 1
> large omap object. I looked and it seems this warning was just added in
> 12.2.6. I found a
On Thu, Jul 19, 2018 at 12:47 PM, Troy Ablan wrote:
>
>
> On 07/18/2018 06:37 PM, Brad Hubbard wrote:
>> On Thu, Jul 19, 2018 at 2:48 AM, Troy Ablan wrote:
>>>
>>>
>>> On 07/17/2018 11:14 PM, Brad Hubbard wrote:
>>>>
>>>>
On Thu, Jul 19, 2018 at 2:48 AM, Troy Ablan wrote:
>
>
> On 07/17/2018 11:14 PM, Brad Hubbard wrote:
>>
>> On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan wrote:
>>>
>>> I was on 12.2.5 for a couple weeks and started randomly seeing
>>> corruption, m
On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan wrote:
> I was on 12.2.5 for a couple weeks and started randomly seeing
> corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke
> loose. I panicked and moved to Mimic, and when that didn't solve the
> problem, only then did I start
Your issue is different since not only do the omap digests of all
replicas not match the omap digest from the auth object info but they
are all different to each other.
What is min_size of pool 67 and what can you tell us about the events
leading up to this?
On Mon, Jul 16, 2018 at 7:06 PM,
rnel
exhibiting the problem.
>
> kind regards
>
> Ben
>
>> Brad Hubbard hat am 5. Juli 2018 um 01:16 geschrieben:
>>
>>
>> On Wed, Jul 4, 2018 at 6:26 PM, Benjamin Naber
>> wrote:
>> > Hi @all,
>> >
>> > im currently in testing for
On Wed, Jul 4, 2018 at 6:26 PM, Benjamin Naber wrote:
> Hi @all,
>
> im currently in testing for setup an production environment based on the
> following OSD Nodes:
>
> CEPH Version: luminous 12.2.5
>
> 5x OSD Nodes with following specs:
>
> - 8 Core Intel Xeon 2,0 GHZ
>
> - 96GB Ram
>
> - 10x
provide from the time leading up to when the issue was first seen?
>
> Cheers
>
> Andrei
> - Original Message -
>> From: "Brad Hubbard"
>> To: "Andrei Mikhailovsky"
>> Cc: "ceph-users"
>> Sent: Thursday, 28 June, 2018 01:
uot;key" : "",
>"oid" : ".dir.default.80018061.2",
>"namespace" : "",
>"snapid" : -2,
>"max" : 0
> },
> "truncate_size" : 0,
> &qu
1 - 100 of 404 matches
Mail list logo