Hi Martin,
looks like a bug to me.
You might want to remove all custom settings from config database and
try to set osd-memory-target only.
Would it help?
Thanks,
Igor
On 1/22/2020 3:43 PM, Martin Mlynář wrote:
Dne 21. 01. 20 v 21:12 Stefan Kooman napsal(a):
Quoting Martin Mlynář
Dne 21. 01. 20 v 21:12 Stefan Kooman napsal(a):
> Quoting Martin Mlynář (nexus+c...@smoula.net):
>
>> Do you think this could help? OSD does not even start, I'm getting a little
>> lost how flushing caches could help.
> I might have mis-understood. I though the OSDs crashed when you set the
>
Quoting Martin Mlynář (nexus+c...@smoula.net):
> Do you think this could help? OSD does not even start, I'm getting a little
> lost how flushing caches could help.
I might have mis-understood. I though the OSDs crashed when you set the
config setting.
> According to trace I suspect something
Dne út 21. 1. 2020 17:09 uživatel Stefan Kooman napsal:
> Quoting Martin Mlynář (nexus+c...@smoula.net):
>
> >
> > When I remove this option:
> > # ceph config rm osd osd_memory_target
> >
> > OSD starts without any trouble. I've seen same behaviour when I wrote
> > this parameter into
Quoting Martin Mlynář (nexus+c...@smoula.net):
>
> When I remove this option:
> # ceph config rm osd osd_memory_target
>
> OSD starts without any trouble. I've seen same behaviour when I wrote
> this parameter into /etc/ceph/ceph.conf
>
> Is this a known bug? Am I doing something wrong?
I
Hi,
I'm having troubles changing osd_memory_target on my test cluster. I've
upgraded whole cluster from luminous to nautiuls, all OSDs are running
bluestore. Because this testlab is short in RAM, I wanted to lower
osd_memory_target to save some memory.
# ceph version
ceph version 14.2.6
Adjusting CRUSH weight shouldn't have caused this. Unfortunately the logs
don't have a lot of hints — the thread that crashed doesn't have any output
except for the Crashed state. If you can reproduce this with more debugging
on we ought to be able to track it down; if not it seems we missed a
After looking into this further, is it possible that adjusting CRUSH weight of
the OSDs while running mis-matched versions of the ceph-osd daemon across the
cluster can cause this issue? Under certain circumstances in our cluster, this
may happen automatically on the backend. I can’t
Do you have more logs that indicate what state machine event the crashing
OSDs received? This obviously shouldn't have happened, but it's a plausible
failure mode, especially if it's a relatively rare combination of events.
-Greg
On Fri, Aug 17, 2018 at 4:49 PM Kenneth Van Alstyne <
Hello all:
I ran into an issue recently with one of my clusters when upgrading
from 10.2.10 to 12.2.7. I have previously tested the upgrade in a lab and
upgraded one of our five production clusters with no issues. On the second
cluster, however, I ran into an issue where all OSDs that
On Tue, Mar 27, 2018 at 9:04 PM, Dietmar Rieder
wrote:
> Thanks Brad!
Hey Dietmar,
yw.
>
> I added some information to the ticket.
> Unfortunately I still could not grab a coredump, since there was no
> segfault lately.
OK. That may help to get us started. Getting
Thanks Brad!
I added some information to the ticket.
Unfortunately I still could not grab a coredump, since there was no
segfault lately.
http://tracker.ceph.com/issues/23431
Maybe Oliver has something to add as well.
Dietmar
On 03/27/2018 11:37 AM, Brad Hubbard wrote:
> "NOTE: a copy of
"NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this."
Have you ever wondered what this means and why it's there? :)
This is at least something you can try. it may provide useful
information, it may not.
This stack looks like it is either corrupted, or possibly not in
Hi,
I encountered one more two days ago, and I opened a ticket:
http://tracker.ceph.com/issues/23431
In our case it is more like 1 every two weeks, for now...
And it is affecting different OSDs on different hosts.
Dietmar
On 03/23/2018 11:50 AM, Oliver Freyermuth wrote:
> Hi together,
>
> I
Hi together,
I notice exactly the same, also the same addresses, Luminous 12.2.4, CentOS 7.
Sadly, logs are equally unhelpful.
It happens randomly on an OSD about once per 2-3 days (of the 196 total OSDs we
have). It's also not a container environment.
Cheers,
Oliver
Am 08.03.2018
On 03/09/2018 12:49 AM, Brad Hubbard wrote:
> On Fri, Mar 9, 2018 at 3:54 AM, Subhachandra Chandra
> wrote:
>> I noticed a similar crash too. Unfortunately, I did not get much info in the
>> logs.
>>
>> *** Caught signal (Segmentation fault) **
>>
>> Mar 07 17:58:26 data7
On Fri, Mar 9, 2018 at 3:54 AM, Subhachandra Chandra
wrote:
> I noticed a similar crash too. Unfortunately, I did not get much info in the
> logs.
>
> *** Caught signal (Segmentation fault) **
>
> Mar 07 17:58:26 data7 ceph-osd-run.sh[796380]: in thread 7f63a0a97700
>
I noticed a similar crash too. Unfortunately, I did not get much info in
the logs.
*** Caught signal (Segmentation fault) **
Mar 07 17:58:26 data7 ceph-osd-run.sh[796380]: in thread 7f63a0a97700
thread_name:safe_timer
Mar 07 17:58:28 data7 ceph-osd-run.sh[796380]: docker_exec.sh: line 56:
Hi,
I noticed in my client (using cephfs) logs that an osd was unexpectedly
going down.
While checking the osd logs for the affected OSD I found that the osd
was seg faulting:
[]
2018-03-07 06:01:28.873049 7fd9af370700 -1 *** Caught signal
(Segmentation fault) **
in thread 7fd9af370700
On 6.3.2018 22:28, Gregory Farnum wrote:
On Sat, Mar 3, 2018 at 2:28 AM Jan Pekař - Imatic > wrote:
Hi all,
I have few problems on my cluster, that are maybe linked together and
now caused OSD down during pg repair.
First few
On Sat, Mar 3, 2018 at 2:28 AM Jan Pekař - Imatic
wrote:
> Hi all,
>
> I have few problems on my cluster, that are maybe linked together and
> now caused OSD down during pg repair.
>
> First few notes about my cluster:
>
> 4 nodes, 15 OSDs installed on Luminous (no upgrade).
Hi all,
I have few problems on my cluster, that are maybe linked together and
now caused OSD down during pg repair.
First few notes about my cluster:
4 nodes, 15 OSDs installed on Luminous (no upgrade).
Replicated pools with 1 pool (pool 6) cached by ssd disks.
I don't detect any hardware
om build? Are there any changes to the
> > source code?
yes, we builded the source code by ourself, but there are not any changes for
source code .
Original Mail
Sender: <s...@newdream.net>
To: WeiQiaoMiao00105316
CC: <ceph-users@lists.ceph.com>
Date: 2017/09/17 01:56
S
On Fri, 15 Sep 2017, wei.qiaom...@zte.com.cn wrote:
>
> Hi,all
>
> My cluster running 12.2.0 with bluestore, we used fio tool with
> librbd ioengine make io test yesterday, and serval osds crash one after
> another.
>
> 3 * node, 30 OSD, 1TB SATA HDD for OSD data, 1GB SATA SSD
Hi,all
My cluster running 12.2.0 with bluestore, we used fio tool with librbd
ioengine make io test yesterday, and serval osds crash one after another.
3 * node, 30 OSD, 1TB SATA HDD for OSD data, 1GB SATA SSD partition for db,
576 MB SATA SSD partition for wal.
ceph
Hi list,
A few days ago we had some problems with our ceph cluster, and now we have some
OSDs crashing on start with messages like this right before crashing:
2017-06-09 15:35:02.226430 7fb46d9e4700 -1 log_channel(cluster) log [ERR] :
trim_object Snap 4aae0 not in clones
I can start those OSDs
Using rbd ls -l poolname to list all images and their snapshots, then
purging snapshots from each image with rbd snap purge
poolname/imagename, then finally reweighing each flapping OSD to 0.0
resolved this issue.
-Steve
On 2017-06-02 14:15, Steve Anthony wrote:
I'm seeing this again on two
I'm seeing this again on two OSDs after adding another 20 disks to my
cluster. Is there someway I can maybe determine which snapshots the
recovery process is looking for? Or maybe find and remove the objects
it's trying to recover, since there's apparently a problem with them?
Thanks!
-Steve
On
Hmmm, after crashing for a few days every 30 seconds it's apparently
running normally again. Weird. I was thinking since it's looking for a
snapshot object, maybe re-enabling snaptrimming and removing all the
snapshots in the pool would remove that object (and the problem)? Never
got to that point
On Wed, May 17, 2017 at 10:51 AM Steve Anthony wrote:
> Hello,
>
> After starting a backup (create snap, export and import into a second
> cluster - one RBD image still exporting/importing as of this message)
> the other day while recovery operations on the primary cluster
Hello,
After starting a backup (create snap, export and import into a second
cluster - one RBD image still exporting/importing as of this message)
the other day while recovery operations on the primary cluster were
ongoing I noticed an OSD (osd.126) start to crash; I reweighted it to 0
to prepare
com'" <ceph-users@lists.ceph.com>
Subject: [ceph-users] osd crash - disk hangs
Hello!
Tonight i had a osd crash. See the dump below. Also this osd is still mounted.
Whats the cause? A bug? What to do next? I cant do a lsof or ps ax because it
hangs.
Thank You!
Dec 1 00:31:30 ceph2 kern
to the latest in the 4.4 series.
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
VELARTIS Philipp Dürhammer
Sent: 01 December 2016 12:04
To: 'ceph-us...@ceph.com' <ceph-us...@ceph.com<mailto:ceph-us...@ceph.com>>
Subject: [ceph-users] osd crash
Hello!
Tonight
Hello!
Tonight i had a osd crash. See the dump below. Also this osd is still mounted.
Whats the cause? A bug? What to do next? I cant do a lsof or ps ax because it
hangs.
Thank You!
Dec 1 00:31:30 ceph2 kernel: [17314369.493029] divide error: [#1] SMP
Dec 1 00:31:30 ceph2 kernel:
...@ceph.com' <ceph-us...@ceph.com>
Subject: [ceph-users] osd crash
Hello!
Tonight i had a osd crash. See the dump below. Also this osd is still mounted.
Whats the cause? A bug? What to do next?
Thank You!
Dec 1 00:31:30 ceph2 kernel: [17314369.493029] divide error: [#
Hello!
Tonight i had a osd crash. See the dump below. Also this osd is still mounted.
Whats the cause? A bug? What to do next?
Thank You!
Dec 1 00:31:30 ceph2 kernel: [17314369.493029] divide error: [#1] SMP
Dec 1 00:31:30 ceph2 kernel: [17314369.493062] Modules linked in: act_police
Hi,
if i understand it correct, bluestore wont use / is not a filesystem to
be mounted.
So if an osd is up and in, while we dont see its mounted into the
filesystem and accessable, we could assume that it must be powered by
bluestore... !??!
--
Mit freundlichen Gruessen / Best regards
Oliver
I upgraded my lab cluster to 10.1.0 specifically to test out bluestore and see
what latency difference it makes.
I was able to one by one zap and recreate my OSDs to bluestore and rebalance
the cluster (the change to having new OSDs start with low weight threw me at
first, but once I worked
So basically the issue - http://tracker.ceph.com/issues/4698
osd suicide timeout
On Mon, Feb 22, 2016 at 7:06 PM, M Ranga Swami Reddy
wrote:
> Hello,
> I have reduced the scan_min and scan_max as below. After the below
> change, during the scrubbing, got the op_tp_thread
Hello,
I have reduced the scan_min and scan_max as below. After the below
change, during the scrubbing, got the op_tp_thread time out after 15.
After some time, OSDs crashed also... Any suggestions will be
helpful... Thanking you.
==
-osd_backfill_scan_min = 64
-osd_backfill_scan_max = 512
Hi,
On 12/02/2015 08:12 PM, Gregory Farnum wrote:
On Wed, Dec 2, 2015 at 11:11 AM, Major Csaba wrote:
Hi,
[ sorry, I accidentaly left out the list address ]
This is the content of the LOG file in the directory
/var/lib/ceph/osd/ceph-7/current/omap:
On Wed, Dec 2, 2015 at 10:54 AM, Major Csaba wrote:
> Hi,
>
> I have a small cluster(5 nodes, 20OSDs), where an OSD crashed. There is no
> any other signal of problems. No kernel message, so the disks seem to be OK.
>
> I tried to restart the OSD but the process stops
Hi,
I have a small cluster(5 nodes, 20OSDs), where an OSD crashed. There is
no any other signal of problems. No kernel message, so the disks seem to
be OK.
I tried to restart the OSD but the process stops almost immediately with
the same logs.
Version is 0.94.5
On Wed, Dec 2, 2015 at 11:11 AM, Major Csaba wrote:
> Hi,
> [ sorry, I accidentaly left out the list address ]
>
> This is the content of the LOG file in the directory
> /var/lib/ceph/osd/ceph-7/current/omap:
> 2015/12/02-18:48:12.241386 7f805fc27900 Recovering log #26281
We've upgraded ceph to 0.94.4 and kernel to 3.16.0-51-generic
but the problem still persists. Lately we see these crashes on a daily
basis. I'm leaning toward the conclusion that this is a software problem
- this hardware ran stable before and we're seeing all four nodes crash
randomly with
Hi,
We've noticed a problem with our cluster setup:
4 x OSD nodes:
E5-1630 CPU
32 GB RAM
Mellanox MT27520 56Gbps network cards
SATA controller LSI Logic SAS3008
Storage nodes are connected to two SuperMicro chassis: 847E1C-R1K28JBOD
Each node has 2-3 spinning OSDs (6TB drives) and 2 ssd drives
t; <a...@iss-integration.com>
> > To: "ceph-users" <ceph-users@lists.ceph.com>
> > Sent: Wednesday, 9 September, 2015 6:38:50 AM
> > Subject: [ceph-users] OSD crash
>
> > Hello,
>
> > We have run into an OSD crash this weekend with the fol
- Original Message -
> From: "Alex Gorbachev" <a...@iss-integration.com>
> To: "ceph-users" <ceph-users@lists.ceph.com>
> Sent: Wednesday, 9 September, 2015 6:38:50 AM
> Subject: [ceph-users] OSD crash
> Hello,
> We have run into
Hello,
We have run into an OSD crash this weekend with the following dump. Please
advise what this could be.
Best regards,
Alex
2015-09-07 14:55:01.345638 7fae6c158700 0 -- 10.80.4.25:6830/2003934 >>
10.80.4.15:6813/5003974 pipe(0x1dd73000 sd=257 :6830 s=2 pgs=14271 cs=251
l=0
Hi,
I build ceph code from wip-newstore on RHEL7 and running performance tests
to compare with filestore. After few hours of running the tests the osd
daemons started to crash. Here is the stack trace, the osd crashes
immediately after the restart. So I could not get the osd up and running.
ceph
Hi there,
today I had an osd crash with ceph 0.87/giant which made my hole cluster
unusable for 45 Minutes.
First it began with a disk error:
sd 0:1:2:0: [sdc] CDB: Read(10)Read(10):: 28 28 00 00 0d 15 fe d0 fd 7b e8 f8
00 00 00 00 b0 08 00 00
XFS (sdc1): xfs_imap_to_bp: xfs_trans_read_buf()
So the problem started once remapping+backfilling started, and lasted until
the cluster was healthy again? Have you adjusted any of the recovery
tunables? Are you using SSD journals?
I had a similar experience the first time my OSDs started backfilling. The
average RadosGW operation latency
Hi Craig,
I'm planning to completely re-install this cluster with firefly because
I started to see other OSDs crashes with the same trim_object error...
So now, I'm more interested in figuring out exactly why data corruption
happened in the first place than repairing the cluster.
Comments
On Fri, Sep 19, 2014 at 2:35 AM, Francois Deppierraz franc...@ctrlaltdel.ch
wrote:
Hi Craig,
I'm planning to completely re-install this cluster with firefly because
I started to see other OSDs crashes with the same trim_object error...
I did lose data because of this, but it was unrelated
On Mon, Sep 8, 2014 at 2:53 PM, Francois Deppierraz franc...@ctrlaltdel.ch
wrote:
XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
All logs from before the disaster are still there, do you have any
advise on what would be relevant?
This is a problem. It's not
Hi,
Following-up this issue, I've identified that almost all unfound objects
belongs to a single RBD volume (with the help of the script below).
Now what's the best way to try to recover the filesystem stored on this
RBD volume?
'mark_unfound_lost revert' or 'mark_unfound_lost lost' and then
On Fri, Sep 12, 2014 at 4:41 AM, Francois Deppierraz
franc...@ctrlaltdel.ch wrote:
Hi,
Following-up this issue, I've identified that almost all unfound objects
belongs to a single RBD volume (with the help of the script below).
Now what's the best way to try to recover the filesystem stored
Hi Greg,
An attempt to recover pg 3.3ef by copying it from broken osd.6 to
working osd.32 resulted in one more broken osd :(
Here's what was actually done:
root@storage1:~# ceph pg 3.3ef list_missing | head
{ offset: { oid: ,
key: ,
snapid: 0,
hash: 0,
max: 0,
On Mon, Sep 8, 2014 at 1:42 AM, Francois Deppierraz
franc...@ctrlaltdel.ch wrote:
Hi,
This issue is on a small 2 servers (44 osds) ceph cluster running 0.72.2
under Ubuntu 12.04. The cluster was filling up (a few osds near full)
and I tried to increase the number of pg per pool to 1024 for
Hi Greg,
Thanks for your support!
On 08. 09. 14 20:20, Gregory Farnum wrote:
The first one is not caused by the same thing as the ticket you
reference (it was fixed well before emperor), so it appears to be some
kind of disk corruption.
The second one is definitely corruption of some kind
On Mon, Sep 8, 2014 at 2:53 PM, Francois Deppierraz
franc...@ctrlaltdel.ch wrote:
Hi Greg,
Thanks for your support!
On 08. 09. 14 20:20, Gregory Farnum wrote:
The first one is not caused by the same thing as the ticket you
reference (it was fixed well before emperor), so it appears to be
On two different occasions I've had an osd crash and misplace objects when
rapid object deletion has been triggered by discard/trim operations with the
qemu rbd driver. Has anybody else had this kind of trouble? The objects are
still on disk, just not in a place where the osd thinks is valid.
Hello,
Using db2bb270e93ed44f9252d65d1d4c9b36875d0ea5 I had observed some
disaster-alike behavior after ``pool create'' command - every osd
daemon in the cluster will die at least once(some will crash times in
a row after bringing back). Please take a look on the
backtraces(almost identical)
I'm afraid I don't. I don't think I looked when it happened, and
searching for one just now came up empty. :/ If it happens again,
I'll be sure to keep my eye out for one.
FWIW, this particular server (1 out of 5) has 8GB *less* RAM than the
others (one bad stick, it seems), and this has
Hey folks,
Saw this crash the other day:
ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)
1: /usr/bin/ceph-osd() [0x788fba]
2: (()+0xfcb0) [0x7f19d1889cb0]
3: (gsignal()+0x35) [0x7f19d0248425]
4: (abort()+0x17b) [0x7f19d024bb8b]
5:
65 matches
Mail list logo