Re: [ceph-users] OSD crash after change of osd_memory_target

2020-01-22 Thread Igor Fedotov
Hi Martin, looks like a bug to me. You might want to remove all custom settings from config database and try to set osd-memory-target only. Would it help? Thanks, Igor On 1/22/2020 3:43 PM, Martin Mlynář wrote: Dne 21. 01. 20 v 21:12 Stefan Kooman napsal(a): Quoting Martin Mlynář (nex

Re: [ceph-users] OSD crash after change of osd_memory_target

2020-01-22 Thread Martin Mlynář
Dne 21. 01. 20 v 21:12 Stefan Kooman napsal(a): > Quoting Martin Mlynář (nexus+c...@smoula.net): > >> Do you think this could help? OSD does not even start, I'm getting a little >> lost how flushing caches could help. > I might have mis-understood. I though the OSDs crashed when you set the > conf

Re: [ceph-users] OSD crash after change of osd_memory_target

2020-01-21 Thread Stefan Kooman
Quoting Martin Mlynář (nexus+c...@smoula.net): > Do you think this could help? OSD does not even start, I'm getting a little > lost how flushing caches could help. I might have mis-understood. I though the OSDs crashed when you set the config setting. > According to trace I suspect something aro

Re: [ceph-users] OSD crash after change of osd_memory_target

2020-01-21 Thread Martin Mlynář
Dne út 21. 1. 2020 17:09 uživatel Stefan Kooman napsal: > Quoting Martin Mlynář (nexus+c...@smoula.net): > > > > > When I remove this option: > > # ceph config rm osd osd_memory_target > > > > OSD starts without any trouble. I've seen same behaviour when I wrote > > this parameter into /etc/ceph/

Re: [ceph-users] OSD crash after change of osd_memory_target

2020-01-21 Thread Stefan Kooman
Quoting Martin Mlynář (nexus+c...@smoula.net): > > When I remove this option: > # ceph config rm osd osd_memory_target > > OSD starts without any trouble. I've seen same behaviour when I wrote > this parameter into /etc/ceph/ceph.conf > > Is this a known bug? Am I doing something wrong? I wond

Re: [ceph-users] OSD Crash When Upgrading from Jewel to Luminous?

2018-08-22 Thread Gregory Farnum
Adjusting CRUSH weight shouldn't have caused this. Unfortunately the logs don't have a lot of hints — the thread that crashed doesn't have any output except for the Crashed state. If you can reproduce this with more debugging on we ought to be able to track it down; if not it seems we missed a stra

Re: [ceph-users] OSD Crash When Upgrading from Jewel to Luminous?

2018-08-21 Thread Kenneth Van Alstyne
After looking into this further, is it possible that adjusting CRUSH weight of the OSDs while running mis-matched versions of the ceph-osd daemon across the cluster can cause this issue? Under certain circumstances in our cluster, this may happen automatically on the backend. I can’t duplicate

Re: [ceph-users] OSD Crash When Upgrading from Jewel to Luminous?

2018-08-17 Thread Gregory Farnum
Do you have more logs that indicate what state machine event the crashing OSDs received? This obviously shouldn't have happened, but it's a plausible failure mode, especially if it's a relatively rare combination of events. -Greg On Fri, Aug 17, 2018 at 4:49 PM Kenneth Van Alstyne < kvanalst...@kn

Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-27 Thread Brad Hubbard
On Tue, Mar 27, 2018 at 9:04 PM, Dietmar Rieder wrote: > Thanks Brad! Hey Dietmar, yw. > > I added some information to the ticket. > Unfortunately I still could not grab a coredump, since there was no > segfault lately. OK. That may help to get us started. Getting late here for me so I'll take

Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-27 Thread Dietmar Rieder
Thanks Brad! I added some information to the ticket. Unfortunately I still could not grab a coredump, since there was no segfault lately. http://tracker.ceph.com/issues/23431 Maybe Oliver has something to add as well. Dietmar On 03/27/2018 11:37 AM, Brad Hubbard wrote: > "NOTE: a copy of th

Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-27 Thread Brad Hubbard
"NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this." Have you ever wondered what this means and why it's there? :) This is at least something you can try. it may provide useful information, it may not. This stack looks like it is either corrupted, or possibly not in

Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-23 Thread Dietmar Rieder
Hi, I encountered one more two days ago, and I opened a ticket: http://tracker.ceph.com/issues/23431 In our case it is more like 1 every two weeks, for now... And it is affecting different OSDs on different hosts. Dietmar On 03/23/2018 11:50 AM, Oliver Freyermuth wrote: > Hi together, > > I

Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-23 Thread Oliver Freyermuth
Hi together, I notice exactly the same, also the same addresses, Luminous 12.2.4, CentOS 7. Sadly, logs are equally unhelpful. It happens randomly on an OSD about once per 2-3 days (of the 196 total OSDs we have). It's also not a container environment. Cheers, Oliver Am 08.03.2018 u

Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-09 Thread Dietmar Rieder
On 03/09/2018 12:49 AM, Brad Hubbard wrote: > On Fri, Mar 9, 2018 at 3:54 AM, Subhachandra Chandra > wrote: >> I noticed a similar crash too. Unfortunately, I did not get much info in the >> logs. >> >> *** Caught signal (Segmentation fault) ** >> >> Mar 07 17:58:26 data7 ceph-osd-run.sh[796380]:

Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-08 Thread Brad Hubbard
On Fri, Mar 9, 2018 at 3:54 AM, Subhachandra Chandra wrote: > I noticed a similar crash too. Unfortunately, I did not get much info in the > logs. > > *** Caught signal (Segmentation fault) ** > > Mar 07 17:58:26 data7 ceph-osd-run.sh[796380]: in thread 7f63a0a97700 > thread_name:safe_timer > >

Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-08 Thread Subhachandra Chandra
I noticed a similar crash too. Unfortunately, I did not get much info in the logs. *** Caught signal (Segmentation fault) ** Mar 07 17:58:26 data7 ceph-osd-run.sh[796380]: in thread 7f63a0a97700 thread_name:safe_timer Mar 07 17:58:28 data7 ceph-osd-run.sh[796380]: docker_exec.sh: line 56: 7971

Re: [ceph-users] OSD crash during pg repair - recovery_info.ss.clone_snaps.end and other problems

2018-03-07 Thread Jan Pekař - Imatic
On 6.3.2018 22:28, Gregory Farnum wrote: On Sat, Mar 3, 2018 at 2:28 AM Jan Pekař - Imatic > wrote: Hi all, I have few problems on my cluster, that are maybe linked together and now caused OSD down during pg repair. First few notes about my cluster:

Re: [ceph-users] OSD crash during pg repair - recovery_info.ss.clone_snaps.end and other problems

2018-03-06 Thread Gregory Farnum
On Sat, Mar 3, 2018 at 2:28 AM Jan Pekař - Imatic wrote: > Hi all, > > I have few problems on my cluster, that are maybe linked together and > now caused OSD down during pg repair. > > First few notes about my cluster: > > 4 nodes, 15 OSDs installed on Luminous (no upgrade). > Replicated pools wi

Re: [ceph-users] osd crash because rocksdb report  ‘Compaction error: Corruption: block checksum mismatch’

2017-09-17 Thread wei.qiaomiao
om build? Are there any changes to the > > source code? yes, we builded the source code by ourself, but there are not any changes for source code . Original Mail Sender: To: WeiQiaoMiao00105316 CC: Date: 2017/09/17 01:56 Subject: Re: [ceph-users] osd crash because rocksdb report

Re: [ceph-users] osd crash because rocksdb report  ‘Compaction error: Corruption: block checksum mismatch’

2017-09-16 Thread Sage Weil
On Fri, 15 Sep 2017, wei.qiaom...@zte.com.cn wrote: > > Hi,all    > >    My cluster running  12.2.0  with bluestore, we used fio tool with > librbd ioengine make io test  yesterday, and serval osds crash one after > another. > >    3 * node, 30 OSD, 1TB SATA HDD for OSD data, 1GB SATA SSD  parti

Re: [ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

2017-06-05 Thread Stephen M. Anthony ( Faculty/Staff - Ctr for Innovation in Teach & )
Using rbd ls -l poolname to list all images and their snapshots, then purging snapshots from each image with rbd snap purge poolname/imagename, then finally reweighing each flapping OSD to 0.0 resolved this issue. -Steve On 2017-06-02 14:15, Steve Anthony wrote: I'm seeing this again on two

Re: [ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

2017-06-02 Thread Steve Anthony
I'm seeing this again on two OSDs after adding another 20 disks to my cluster. Is there someway I can maybe determine which snapshots the recovery process is looking for? Or maybe find and remove the objects it's trying to recover, since there's apparently a problem with them? Thanks! -Steve On 0

Re: [ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

2017-05-18 Thread Steve Anthony
Hmmm, after crashing for a few days every 30 seconds it's apparently running normally again. Weird. I was thinking since it's looking for a snapshot object, maybe re-enabling snaptrimming and removing all the snapshots in the pool would remove that object (and the problem)? Never got to that point

Re: [ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())

2017-05-17 Thread Gregory Farnum
On Wed, May 17, 2017 at 10:51 AM Steve Anthony wrote: > Hello, > > After starting a backup (create snap, export and import into a second > cluster - one RBD image still exporting/importing as of this message) > the other day while recovery operations on the primary cluster were > ongoing I notice

Re: [ceph-users] osd crash - disk hangs

2016-12-01 Thread Warren Wang - ISD
You’ll need to upgrade your kernel. It’s a terrible div by zero bug that occurs while trying to calculate load. You can still use “top –b –n1” instead of ps, but ultimately the kernel update fixed it for us. You can’t kill procs that are in uninterruptible wait. Here’s the Ubuntu version: http

Re: [ceph-users] osd crash

2016-12-01 Thread VELARTIS Philipp Dürhammer
I am using proxmox so i guess ist debian. I will update the kernel there are newer versions. But generally if a osd crashes like this - can it be hardware related? How to dismount the disk? I cant even make ps ax or losof -it hangs because my osd is still mounted and blocks everything... i canno

Re: [ceph-users] osd crash

2016-12-01 Thread Nick Fisk
Are you using Ubuntu 16.04 (Guessing from your kernel version). There was a numa bug in early kernels, try updating to the latest in the 4.4 series. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of VELARTIS Philipp Dürhammer Sent: 01 December 2016 12:04 To: 'ceph-us...

Re: [ceph-users] OSD crash after conversion to bluestore

2016-03-31 Thread Adrian Saul
2:08 AM > To: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] OSD crash after conversion to bluestore > > Hi, > > if i understand it correct, bluestore wont use / is not a filesystem to be > mounted. > > So if an osd is up and in, while we dont see its mounted into th

Re: [ceph-users] OSD crash after conversion to bluestore

2016-03-31 Thread Adrian Saul
Not sure about commands however if you look at the OSD mount point there is a “bluefs” file. From: German Anders [mailto:gand...@despegar.com] Sent: Thursday, 31 March 2016 11:48 PM To: Adrian Saul Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] OSD crash after conversion to bluestore

Re: [ceph-users] OSD crash after conversion to bluestore

2016-03-31 Thread Oliver Dzombic
Hi, if i understand it correct, bluestore wont use / is not a filesystem to be mounted. So if an osd is up and in, while we dont see its mounted into the filesystem and accessable, we could assume that it must be powered by bluestore... !??! -- Mit freundlichen Gruessen / Best regards Oliver D

Re: [ceph-users] OSD crash after conversion to bluestore

2016-03-31 Thread German Anders
having jewel install, is possible to run a command in order to see that the OSD is actually using bluestore? Thanks in advance, Best, *German* 2016-03-31 1:24 GMT-03:00 Adrian Saul : > > I upgraded my lab cluster to 10.1.0 specifically to test out bluestore and > see what latency difference i

Re: [ceph-users] OSD Crash with scan_min and scan_max values reduced

2016-02-22 Thread M Ranga Swami Reddy
So basically the issue - http://tracker.ceph.com/issues/4698 osd suicide timeout On Mon, Feb 22, 2016 at 7:06 PM, M Ranga Swami Reddy wrote: > Hello, > I have reduced the scan_min and scan_max as below. After the below > change, during the scrubbing, got the op_tp_thread time out after 15. > Aft

Re: [ceph-users] OSD crash, unable to restart

2015-12-02 Thread Major Csaba
Hi, On 12/02/2015 08:12 PM, Gregory Farnum wrote: On Wed, Dec 2, 2015 at 11:11 AM, Major Csaba wrote: Hi, [ sorry, I accidentaly left out the list address ] This is the content of the LOG file in the directory /var/lib/ceph/osd/ceph-7/current/omap: 2015/12/02-18:48:12.241386 7f805fc27900 Reco

Re: [ceph-users] OSD crash, unable to restart

2015-12-02 Thread Gregory Farnum
On Wed, Dec 2, 2015 at 11:11 AM, Major Csaba wrote: > Hi, > [ sorry, I accidentaly left out the list address ] > > This is the content of the LOG file in the directory > /var/lib/ceph/osd/ceph-7/current/omap: > 2015/12/02-18:48:12.241386 7f805fc27900 Recovering log #26281 > 2015/12/02-18:48:12.242

Re: [ceph-users] OSD crash, unable to restart

2015-12-02 Thread Gregory Farnum
On Wed, Dec 2, 2015 at 10:54 AM, Major Csaba wrote: > Hi, > > I have a small cluster(5 nodes, 20OSDs), where an OSD crashed. There is no > any other signal of problems. No kernel message, so the disks seem to be OK. > > I tried to restart the OSD but the process stops almost immediately with the >

Re: [ceph-users] osd crash and high server load - ceph-osd crashes with stacktrace

2015-10-25 Thread Brad Hubbard
- Original Message - > From: "Jacek Jarosiewicz" > To: ceph-users@lists.ceph.com > Sent: Sunday, 25 October, 2015 8:48:59 PM > Subject: Re: [ceph-users] osd crash and high server load - ceph-osd crashes > with stacktrace > > We've upgraded ceph to 0.

Re: [ceph-users] osd crash and high server load - ceph-osd crashes with stacktrace

2015-10-25 Thread Jacek Jarosiewicz
We've upgraded ceph to 0.94.4 and kernel to 3.16.0-51-generic but the problem still persists. Lately we see these crashes on a daily basis. I'm leaning toward the conclusion that this is a software problem - this hardware ran stable before and we're seeing all four nodes crash randomly with the

Re: [ceph-users] OSD crash

2015-09-22 Thread Alex Gorbachev
Hi Brad, This occurred on a system under moderate load - has not happened since and I do not know how to reproduce. Thank you, Alex On Tue, Sep 22, 2015 at 7:29 PM, Brad Hubbard wrote: > - Original Message - > > > From: "Alex Gorbachev" > > To: "ceph-users" > > Sent: Wednesday, 9 Sep

Re: [ceph-users] OSD crash

2015-09-22 Thread Brad Hubbard
- Original Message - > From: "Alex Gorbachev" > To: "ceph-users" > Sent: Wednesday, 9 September, 2015 6:38:50 AM > Subject: [ceph-users] OSD crash > Hello, > We have run into an OSD crash this weekend with the following dump. Please > advise what this could be. Hello Alex, As you kn

Re: [ceph-users] OSD Crash makes whole cluster unusable ?

2014-12-16 Thread Craig Lewis
So the problem started once remapping+backfilling started, and lasted until the cluster was healthy again? Have you adjusted any of the recovery tunables? Are you using SSD journals? I had a similar experience the first time my OSDs started backfilling. The average RadosGW operation latency wen

Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-19 Thread Craig Lewis
On Fri, Sep 19, 2014 at 2:35 AM, Francois Deppierraz wrote: > Hi Craig, > > I'm planning to completely re-install this cluster with firefly because > I started to see other OSDs crashes with the same trim_object error... > I did lose data because of this, but it was unrelated to the XFS issues.

Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-19 Thread Francois Deppierraz
Hi Craig, I'm planning to completely re-install this cluster with firefly because I started to see other OSDs crashes with the same trim_object error... So now, I'm more interested in figuring out exactly why data corruption happened in the first place than repairing the cluster. Comments in-lin

Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-16 Thread Craig Lewis
On Mon, Sep 8, 2014 at 2:53 PM, Francois Deppierraz wrote: > > > XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) > > All logs from before the disaster are still there, do you have any > advise on what would be relevant? > > This is a problem. It's not necessarily a deadlock.

Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-12 Thread Gregory Farnum
On Fri, Sep 12, 2014 at 4:41 AM, Francois Deppierraz wrote: > Hi, > > Following-up this issue, I've identified that almost all unfound objects > belongs to a single RBD volume (with the help of the script below). > > Now what's the best way to try to recover the filesystem stored on this > RBD vol

Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-12 Thread Francois Deppierraz
Hi, Following-up this issue, I've identified that almost all unfound objects belongs to a single RBD volume (with the help of the script below). Now what's the best way to try to recover the filesystem stored on this RBD volume? 'mark_unfound_lost revert' or 'mark_unfound_lost lost' and then run

Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-11 Thread Francois Deppierraz
Hi Greg, An attempt to recover pg 3.3ef by copying it from broken osd.6 to working osd.32 resulted in one more broken osd :( Here's what was actually done: root@storage1:~# ceph pg 3.3ef list_missing | head { "offset": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max"

Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-08 Thread Gregory Farnum
On Mon, Sep 8, 2014 at 2:53 PM, Francois Deppierraz wrote: > Hi Greg, > > Thanks for your support! > > On 08. 09. 14 20:20, Gregory Farnum wrote: > >> The first one is not caused by the same thing as the ticket you >> reference (it was fixed well before emperor), so it appears to be some >> kind o

Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-08 Thread Francois Deppierraz
Hi Greg, Thanks for your support! On 08. 09. 14 20:20, Gregory Farnum wrote: > The first one is not caused by the same thing as the ticket you > reference (it was fixed well before emperor), so it appears to be some > kind of disk corruption. > The second one is definitely corruption of some kin

Re: [ceph-users] osd crash: trim_objectcould not find coid

2014-09-08 Thread Gregory Farnum
On Mon, Sep 8, 2014 at 1:42 AM, Francois Deppierraz wrote: > Hi, > > This issue is on a small 2 servers (44 osds) ceph cluster running 0.72.2 > under Ubuntu 12.04. The cluster was filling up (a few osds near full) > and I tried to increase the number of pg per pool to 1024 for each of > the 14 poo

Re: [ceph-users] OSD crash during script, 0.56.4

2013-05-13 Thread Travis Rhoden
I'm afraid I don't. I don't think I looked when it happened, and searching for one just now came up empty. :/ If it happens again, I'll be sure to keep my eye out for one. FWIW, this particular server (1 out of 5) has 8GB *less* RAM than the others (one bad stick, it seems), and this has happen

Re: [ceph-users] OSD crash during script, 0.56.4

2013-05-13 Thread Gregory Farnum
On Tue, May 7, 2013 at 9:44 AM, Travis Rhoden wrote: > Hey folks, > > Saw this crash the other day: > > ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca) > 1: /usr/bin/ceph-osd() [0x788fba] > 2: (()+0xfcb0) [0x7f19d1889cb0] > 3: (gsignal()+0x35) [0x7f19d0248425] > 4: (abort()+0x1