dPool::WorkThread*)+0x53d) [0x9e05dd]
> 17: (ThreadPool::WorkThread::entry()+0x10) [0x9e1760]
> 18: (()+0x7a51) [0x7f384b6b0a51]
> 19: (clone()+0x6d) [0x7f384a6409ad]
>
> ceph version is v0.80.9, manually executes `ceph pg deep-scrub 3.d70`
would also
> cause osd crash.
>
&
orkThread*)+0x53d) [0x9e05dd]
17: (ThreadPool::WorkThread::entry()+0x10) [0x9e1760]
18: (()+0x7a51) [0x7f384b6b0a51]
19: (clone()+0x6d) [0x7f384a6409ad]
ceph version is v0.80.9, manually executes `ceph pg deep-scrub 3.d70` would also
cause osd crash.
Any ideas? or did I missed some logs necessa
.
Regards
Srikanth
On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi
srikanth.madugu...@gmail.com wrote:
Hi Sage,
Unfortunately I purged the cluster yesterday and restarted the
backfill tool. I did not see the osd crash yet on the cluster. I am
monitoring the OSDs and will update you once I
.
Regards
Srikanth
On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi
srikanth.madugu...@gmail.com wrote:
Hi Sage,
Unfortunately I purged the cluster yesterday and restarted the
backfill tool. I did not see the osd crash yet on the cluster. I am
monitoring the OSDs and will update you
and restarted the
backfill tool. I did not see the osd crash yet on the cluster. I am
monitoring the OSDs and will update you once I see the crash.
With the new backfill run I have reduced the rps by half, not sure if
this is the reason for not seeing the crash yet.
Regards
Srikanth
On Mon, Jun 1
I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
with that branch with 'debug newstore = 20' and send us the log?
(You can just do 'ceph-post-file filename'.)
Thanks!
sage
On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
Hi Sage,
The assertion failed at line 1639,
Hi Sage and all,
I build ceph code from wip-newstore on RHEL7 and running performance
tests to compare with filestore. After few hours of running the tests
the osd daemons started to crash. Here is the stack trace, the osd
crashes immediately after the restart. So I could not get the osd up
and
Hi Sage,
The assertion failed at line 1639, here is the log message
2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
function 'virtual int NewStore::collection_list_partial(coll_t,
ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
ghobject_t*)' thread 7f0891be0700
On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
Hi Sage and all,
I build ceph code from wip-newstore on RHEL7 and running performance
tests to compare with filestore. After few hours of running the tests
the osd daemons started to crash. Here is the stack trace, the osd
crashes immediately
Hi Sage,
Unfortunately I purged the cluster yesterday and restarted the
backfill tool. I did not see the osd crash yet on the cluster. I am
monitoring the OSDs and will update you once I see the crash.
With the new backfill run I have reduced the rps by half, not sure if
this is the reason
Hi, Samuel Sage
In our current production environment, there exists osd crash because of the
inconsistence of data, when reading the “_” xattr. Which is described in the
issue:
http://tracker.ceph.com/issues/10117.
And I also find a two year’s old issue, which also describes
On Fri, Sep 06, 2013 at 08:21:07AM -0700, Sage Weil wrote:
On Fri, 6 Sep 2013, Chris Dunlop wrote:
On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote:
Also, you should upgrade to dumpling. :)
I've been considering it. It was initially a little scary with
the various issues that were
On Wed, 11 Sep 2013, Chris Dunlop wrote:
On Fri, Sep 06, 2013 at 08:21:07AM -0700, Sage Weil wrote:
On Fri, 6 Sep 2013, Chris Dunlop wrote:
On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote:
Also, you should upgrade to dumpling. :)
I've been considering it. It was initially a
On Fri, 6 Sep 2013, Chris Dunlop wrote:
On Fri, Sep 06, 2013 at 01:12:21PM +1000, Chris Dunlop wrote:
On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote:
On Fri, 6 Sep 2013, Chris Dunlop wrote:
Hi Sage,
Does this answer your question?
2013-09-06 09:30:19.813811 7f0ae8cbc700
On Fri, 6 Sep 2013, Chris Dunlop wrote:
On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote:
On Fri, 6 Sep 2013, Chris Dunlop wrote:
Hi Sage,
Does this answer your question?
2013-09-06 09:30:19.813811 7f0ae8cbc700 0 log [INF] : applying
configuration change:
G'day,
I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD:
http://tracker.ceph.com/issues/6233
ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33)
1: /usr/bin/ceph-osd() [0x8530a2]
2: (()+0xf030) [0x7f541ca39030]
3: (gsignal()+0x35) [0x7f541b132475
wrote:
G'day,
I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD:
http://tracker.ceph.com/issues/6233
ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33)
1: /usr/bin/ceph-osd() [0x8530a2]
2: (()+0xf030) [0x7f541ca39030]
3: (gsignal()+0x35
is.
Thanks!
sage
On Fri, 6 Sep 2013, Chris Dunlop wrote:
G'day,
I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD:
http://tracker.ceph.com/issues/6233
ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33)
1: /usr/bin/ceph-osd() [0x8530a2
On Fri, 6 Sep 2013, Chris Dunlop wrote:
G'day,
I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an
OSD:
http://tracker.ceph.com/issues/6233
ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33)
1: /usr/bin/ceph-osd
On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote:
On Fri, 6 Sep 2013, Chris Dunlop wrote:
Hi Sage,
Does this answer your question?
2013-09-06 09:30:19.813811 7f0ae8cbc700 0 log [INF] : applying
configuration change: internal_safe_to_start_threads = 'true'
2013-09-06
On Fri, Sep 06, 2013 at 01:12:21PM +1000, Chris Dunlop wrote:
On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote:
On Fri, 6 Sep 2013, Chris Dunlop wrote:
Hi Sage,
Does this answer your question?
2013-09-06 09:30:19.813811 7f0ae8cbc700 0 log [INF] : applying
configuration change:
Hello,
Using db2bb270e93ed44f9252d65d1d4c9b36875d0ea5 I had observed some
disaster-alike behavior after ``pool create'' command - every osd
daemon in the cluster will die at least once(some will crash times in
a row after bringing back). Please take a look on the
backtraces(almost identical)
I had one of my OSDs crash yesterday. I'm using ceph version 0.56.3
(6eb7e15a4783b122e9b0c85ea9ba064145958aa5).
The part of the log file where the crash happened is attached. Not really sure
what lead up to it, but I did get an alert from my server monitor telling me my
swap space got really
On Wed, Jan 9, 2013 at 4:38 PM, Sage Weil s...@inktank.com wrote:
On Wed, 9 Jan 2013, Ian Pye wrote:
Hi,
Every time I try an bring up an OSD, it crashes and I get the
following: error (121) Remote I/O error not handled on operation 20
This error code (EREMOTEIO) is not used by Ceph. What
Hello list,
after a reboot of my node i see this on all OSDs of this node after the
reboot:
2012-12-14 09:03:20.393224 7f8e652f8780 -1 osd/OSD.cc: In function
'OSDMapRef OSDService::get_map(epoch_t)' thread 7f8e652f8780 time
2012-12-14 09:03:20.392528
osd/OSD.cc: 4385: FAILED
same log more verbose:
11 ec=10 les/c 3307/3307 3306/3306/3306) [] r=0 lpr=0 lcod 0'0 mlcod 0'0
inactive] read_log done
-11 2012-12-14 09:17:50.648572 7fb6e0d6b780 10 osd.3 pg_epoch: 3996
pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10
les/c 3307/3307 3306/3306/3306)
On 12/14/2012 10:14 AM, Stefan Priebe wrote:
One more IMPORTANT note. This might happen due to the fact that a disk was
missing (disk failure) afte the reboot.
fstab and mountpoint are working with UUIDs so they match but the journal
block device:
osd journal = /dev/sde1
didn't match
On 12/14/2012 08:52 AM, Dennis Jacobfeuerborn wrote:
On 12/14/2012 10:14 AM, Stefan Priebe wrote:
One more IMPORTANT note. This might happen due to the fact that a disk was
missing (disk failure) afte the reboot.
fstab and mountpoint are working with UUIDs so they match but the journal
block
Hello Dennis,
Am 14.12.2012 15:52, schrieb Dennis Jacobfeuerborn:
didn't match anymore - as the numbers got renumber due to the failed disk.
Is there a way to use some kind of UUIDs here too for journal?
You should be able to use /dev/disk/by-uuid/* instead. That should give you
a stable view
Hi Stefan,
Here's what I often do when I have a journal and data partition sharing
a disk:
sudo parted -s -a optimal /dev/$DEV mklabel gpt
sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-journal 0% 10G
sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-data 10G 100%
Mark
On
Hi Mark,
Am 14.12.2012 16:20, schrieb Mark Nelson:
sudo parted -s -a optimal /dev/$DEV mklabel gpt
sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-journal 0% 10G
sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-data 10G 100%
My disks are gpt too and i'm also using parted. But
Hello Mark,
Am 14.12.2012 16:20, schrieb Mark Nelson:
sudo parted -s -a optimal /dev/$DEV mklabel gpt
sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-journal 0% 10G
sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-data 10G 100%
Isn't that the part type you're using?
mkpart
On Fri, 14 Dec 2012, Stefan Priebe wrote:
One more IMPORTANT note. This might happen due to the fact that a disk was
missing (disk failure) afte the reboot.
fstab and mountpoint are working with UUIDs so they match but the journal
block device:
osd journal = /dev/sde1
didn't match
Hi Sage,
this was just an idea and i need to fix MY uuid problem. But then the
crash is still a problem of ceph. Have you looked into my log?
Am 14.12.2012 20:42, schrieb Sage Weil:
On Fri, 14 Dec 2012, Stefan Priebe wrote:
One more IMPORTANT note. This might happen due to the fact that a
On 11/14/2012 11:31 PM, eric_yh_c...@wiwynn.com wrote:
Dear All:
I met this issue on one of osd node. Is this a known issue? Thanks!
ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
1: /usr/bin/ceph-osd() [0x6edaba]
2: (()+0xfcb0) [0x7f08b112dcb0]
3:
Dear All:
I met this issue on one of osd node. Is this a known issue? Thanks!
ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
1: /usr/bin/ceph-osd() [0x6edaba]
2: (()+0xfcb0) [0x7f08b112dcb0]
3: (gsignal()+0x35) [0x7f08afd09445]
4: (abort()+0x17b)
(in fact I had some OSD crash - not
definitive, the osd restart ok-, maybe related to an error in my new nodes
network configuration that I discovered after. More on that later, I can
find the traces, but I'm not sure it's related)
When ceph was finally stable again, with HEALTH_OK, I decided
, 49,50,51... Data store is on XFS.
I'm currently in the process of growing my ceph from 6 nodes to 12
nodes. 11 nodes are currently in ceph, for a 130 TB total. Declaring new
osd was OK, the data has moved quite ok (in fact I had some OSD crash
- not definitive, the osd restart ok-, maybe
Hi all,
after adding a new node into our ceph-cluster yesterday, we had a crash
of one OSD.
I have found this kind of message in the bugtracker as being solved (
http://tracker.newdream.net/issues/2075),
I will update this one for my convenience and attach the according log (
due to
Hi,
Almost always one or more osd dies when doing overlapped recovery -
e.g. add new crushmap and remove some newly added osds from cluster
some minutes later during remap or inject two slightly different
crushmaps after a short time(surely preserving at least one of
replicas online). Seems that
On Tue, 4 Sep 2012, Andrey Korolyov wrote:
Hi,
Almost always one or more osd dies when doing overlapped recovery -
e.g. add new crushmap and remove some newly added osds from cluster
some minutes later during remap or inject two slightly different
crushmaps after a short time(surely
On Thu, 23 Aug 2012, Andrey Korolyov wrote:
Hi,
today during heavy test a pair of osds and one mon died, resulting to
hard lockup of some kvm processes - they went unresponsible and was
killed leaving zombie processes ([kvm] defunct). Entire cluster
contain sixteen osd on eight nodes and
On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote:
On Thu, 23 Aug 2012, Andrey Korolyov wrote:
Hi,
today during heavy test a pair of osds and one mon died, resulting to
hard lockup of some kvm processes - they went unresponsible and was
killed leaving zombie processes ([kvm]
The tcmalloc backtrace on the OSD suggests this may be unrelated, but
what's the fd limit on your monitor process? You may be approaching
that limit if you've got 500 OSDs and a similar number of clients.
On Wed, Aug 22, 2012 at 6:55 PM, Andrey Korolyov and...@xdel.ru wrote:
On Thu, Aug 23, 2012
Le 09/07/2012 19:14, Samuel Just a écrit :
Can you restart the node that failed to complete the upgrade with
Well, it's a little big complicated ; I now run those nodes with XFS,
and I've long-running jobs on it right now, so I can't stop the ceph
cluster at the moment.
As I've keeped the
On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
As I've keeped the original broken btrfs volumes, I tried this morning to
run the old osd in parrallel, using the $cluster variable. I only have
partial success.
The cluster mechanism was never intended for moving
Le 10/07/2012 17:56, Tommi Virtanen a écrit :
On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
As I've keeped the original broken btrfs volumes, I tried this morning to
run the old osd in parrallel, using the $cluster variable. I only have
partial success.
The
On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
The cluster mechanism was never intended for moving existing osds to
other clusters. Trying that might not be a good idea.
Ok, good to know. I saw that the remaining maps could lead to problem, but
in 2 words, what
Le 10/07/2012 19:11, Tommi Virtanen a écrit :
On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
The cluster mechanism was never intended for moving existing osds to
other clusters. Trying that might not be a good idea.
Ok, good to know. I saw that the remaining
On Tue, Jul 10, 2012 at 10:36 AM, Yann Dupont
yann.dup...@univ-nantes.fr wrote:
Fundamentally, it comes down to this: the two clusters will still have
the same fsid, and you won't be isolated from configuration errors or
(CEPH-PROD is the old btrfs volume ). /CEPH is new xfs volume, completely
Can you restart the node that failed to complete the upgrade with
debug filestore = 20
debug osd = 20
and post the log after an hour or so of running? The upgrade process
might legitimately take a while.
-Sam
On Sat, Jul 7, 2012 at 1:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
Le
to other
(sane) nodes, leading to ceph-osd crash on thoses nodes.
The LevelDB binary contents are not transferred over to other nodes;
this kind of corruption would not spread over the Ceph clustering
mechanisms. It's more likely that you have 4 independently corrupted
LevelDBs. Something
that ultimate kernel oops, bad data has been transmitted to other
(sane) nodes, leading to ceph-osd crash on thoses nodes.
The LevelDB binary contents are not transferred over to other nodes;
Ok thanks for the clarification ;
this kind of corruption would not spread over the Ceph clustering
On Mon, Jul 9, 2012 at 12:05 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
The information here isn't enough to say whether the cause of the
corruption is btrfs or LevelDB, but the recovery needs to handled by
LevelDB -- and upstream is working on making it more robust:
Le 06/07/2012 19:01, Gregory Farnum a écrit :
On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
Le 05/07/2012 23:32, Gregory Farnum a écrit :
[...]
ok, so as all nodes were identical, I probably have hit a btrfs bug (like
a
erroneous out of space ) in more or
Le 05/07/2012 23:32, Gregory Farnum a écrit :
[...]
ok, so as all nodes were identical, I probably have hit a btrfs bug (like a
erroneous out of space ) in more or less the same time. And when 1 osd was
out,
OH , I didn't finish the sentence... When 1 osd was out, missing data
was copied on
On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
Le 05/07/2012 23:32, Gregory Farnum a écrit :
[...]
ok, so as all nodes were identical, I probably have hit a btrfs bug (like
a
erroneous out of space ) in more or less the same time. And when 1 osd
was
out,
oops. Before that ultimate kernel oops, bad data has been
transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
nodes.
I don't think that's actually possible — the OSDs all do quite a lot of
interpretation between what they get off the wire and what goes on disk.
What you've
to a
kernel oops. Before that ultimate kernel oops, bad data has been
transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
nodes.
If you think this scenario is highly improbable in real life (that is,
btrfs will probably be fixed for good, and then, corruption can't
happen
) nodes, leading to ceph-osd crash on thoses
nodes.
I don't think that's actually possible — the OSDs all do quite a lot of
interpretation between what they get off the wire and what goes on disk. What
you've got here are 4 corrupted LevelDB databases, and we pretty much can't do
that through
to ceph-osd crash on thoses
nodes.
I don't think that's actually possible — the OSDs all do quite a lot of
interpretation between what they get off the wire and what goes on disk. What
you've got here are 4 corrupted LevelDB databases, and we pretty much can't do
that through the interfaces we
On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right
now.
Tried to restart osd with 0.47.3, then next branch, and today with 0.48.
4 of 8 nodes fails with the same message :
ceph version
Le 03/07/2012 21:42, Tommi Virtanen a écrit :
On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right
now.
Tried to restart osd with 0.47.3, then next branch, and today with 0.48.
4 of 8 nodes fails
On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
In the case I could repair, do you think a crashed FS as it is right now is
valuable for you, for future reference , as I saw you can't reproduce the
problem ? I can make an archive (or a btrfs dump ?), but it will be
Hey guys,
Thanks for the problem report. I've created an issue to track it at
http://tracker.newdream.net/issues/2687.
It looks like we just assume that if you're using a file, you've got
enough space for it. It shouldn't be a big deal to at least do some
startup checks which will fail gracefully.
THANKS a lot. This fixes it. I've merged your branch into next and i
wsn't able to trigger the osd crash again. So please include this into 0.48.
Greets
Stefan
Am 26.06.2012 20:01, schrieb Sam Just:
Stefan,
Sorry for the delay, I think I've found the problem. Could you give
On Wed, 27 Jun 2012, Stefan Priebe - Profihost AG wrote:
THANKS a lot. This fixes it. I've merged your branch into next and i wsn't
able to trigger the osd crash again. So please include this into 0.48.
Excellent. Thanks for testing! This now in next.
sage
Greets
Stefan
Am
On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe s.pri...@profihost.ag wrote:
Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
can happen. For building I use the provided Debian scripts.
Perhaps you upgraded the debs but did not restart the daemons? That
would make the
Am 26.06.2012 18:05, schrieb Tommi Virtanen:
On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe s.pri...@profihost.ag wrote:
Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
can happen. For building I use the provided Debian scripts.
Perhaps you upgraded the debs but did
Stefan,
Sorry for the delay, I think I've found the problem. Could you give
wip_ms_handle_reset_race a try?
-Sam
On Tue, Jun 26, 2012 at 9:47 AM, Stefan Priebe s.pri...@profihost.ag wrote:
Am 26.06.2012 18:05, schrieb Tommi Virtanen:
On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe
I've yet to make the core match the binary.
On Jun 22, 2012, at 11:32 PM, Stefan Priebe s.pri...@profihost.ag wrote:
Thanks did you find anything?
Am 23.06.2012 um 01:59 schrieb Sam Just sam.j...@inktank.com:
I am still looking into the logs.
-Sam
On Fri, Jun 22, 2012 at 3:56 PM,
Thanks yes it is from the next branch.
Am 23.06.2012 um 02:26 schrieb Dan Mick dan.m...@inktank.com:
The ceph-osd binary you sent claims to be version 0.47.2-521-g88c762, which
is not quite 0.47.3. You can get the version with binary -v, or (in my
case) examining strings in the binary.
Thanks did you find anything?
Am 23.06.2012 um 01:59 schrieb Sam Just sam.j...@inktank.com:
I am still looking into the logs.
-Sam
On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick dan.m...@inktank.com wrote:
Stefan, I'm looking at your logs and coredump now.
On 06/21/2012 11:43 PM, Stefan
I'm still able to crash the ceph cluster while doing a lot of random I/O
and then shut down the KVM.
Stefan
Am 21.06.2012 21:57, schrieb Stefan Priebe:
OK i discovered this time that all osds had the same disk usage before
crash. After starting the osd again i got this one:
/dev/sdb1 224G 23G
Stefan, I'm looking at your logs and coredump now.
On 06/21/2012 11:43 PM, Stefan Priebe wrote:
Does anybody have an idea? This is right now a showstopper to me.
Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost
AGs.pri...@profihost.ag:
Hello list,
i'm able to reproducably crash osd
I am still looking into the logs.
-Sam
On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick dan.m...@inktank.com wrote:
Stefan, I'm looking at your logs and coredump now.
On 06/21/2012 11:43 PM, Stefan Priebe wrote:
Does anybody have an idea? This is right now a showstopper to me.
Am 21.06.2012 um
The ceph-osd binary you sent claims to be version 0.47.2-521-g88c762,
which is not quite 0.47.3. You can get the version with binary -v, or
(in my case) examining strings in the binary. I'm retrieving that
version to analyze the core dump.
On 06/21/2012 11:43 PM, Stefan Priebe wrote:
Does
Hello list,
i'm able to reproducably crash osd daemons.
How i can reproduce:
Kernel: 3.5.0-rc3
Ceph: 0.47.3
FS: btrfs
Journal: 2GB tmpfs per OSD
OSD: 3x servers with 4x Intel SSD OSDs each
10GBE Network
rbd_cache_max_age: 2.0
rbd_cache_size: 33554432
Disk is set to writeback.
Start a KVM VM
When i start now the OSD again it seems to hang for forever. Load goes
up to 200 and I/O Waits rise vom 0% to 20%.
Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:
Hello list,
i'm able to reproducably crash osd daemons.
How i can reproduce:
Kernel: 3.5.0-rc3
Ceph: 0.47.3
FS:
Another strange thing. Why does THIS OSD have 24GB and the others just
650MB?
/dev/sdb1 224G 654M 214G 1% /srv/osd.20
/dev/sdc1 224G 638M 214G 1% /srv/osd.21
/dev/sdd1 224G 24G 190G 12% /srv/osd.22
/dev/sde1 224G 607M 214G 1%
Mhm is this normal (ceph health is NOW OK again)
/dev/sdb1 224G 655M 214G 1% /srv/osd.20
/dev/sdc1 224G 640M 214G 1% /srv/osd.21
/dev/sdd1 224G 34G 181G 16% /srv/osd.22
/dev/sde1 224G 608M 214G 1% /srv/osd.23
Why does one OSD has
OK i discovered this time that all osds had the same disk usage before
crash. After starting the osd again i got this one:
/dev/sdb1 224G 23G 191G 11% /srv/osd.30
/dev/sdc1 224G 1,5G 213G 1% /srv/osd.31
/dev/sdd1 224G 1,5G 213G 1% /srv/osd.32
Not sure if this is a bug or not. It was definitely user error -- but
since the OSD process bailed, figured I would report it.
I had /tmpfs mounted with 2.5GB of space:
tmpfs on /tmpfs type tmpfs (rw,size=2560m)
Then I decided to increase my journal size to 5G, but forgot to
increase the limit
I hit this a couple times and wondered the same thing. Why does the
OSD need to bail when it runs out of journal space?
On Wed, Jun 20, 2012 at 3:56 PM, Travis Rhoden trho...@gmail.com wrote:
Not sure if this is a bug or not. It was definitely user error -- but
since the OSD process bailed,
Am 17.06.2012 23:16, schrieb Sage Weil:
Hi Stefan,
I opened http://tracker.newdream.net/issues/2599 to track this, but the
dump strangely does not include the ceph version or commit sha1. What
version were you running?
Sorry that was my build system it accidently removed the .git dir while
Hi,
today i got another osd crash ;-( Strangely the osd logs are all empty.
It seems the logrotate hasn't reloaded the daemons but i still have the
core dump file? What's next?
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord
executable` is
needed to interpret this.
--- end dump of recent events ---
Am 16.06.2012 14:57, schrieb Stefan Priebe:
Hi,
today i got another osd crash ;-( Strangely the osd logs are all empty.
It seems the logrotate hasn't reloaded the daemons but i still have the
core dump file? What's next
no clear stack message. I
suspect btrfs , but I have no proof.
This node (OSD.7) seems to have been the 1st one to crash, generated
reconstruction between OSD then lead to the cascade osd crash.
The other physical machines are still up, but with no osd running. here
are some trace found in osd log
On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
Results : Worked like a charm during two days, apart btrfs warn messages
then OSD begin to crash 1 after all 'domino style'.
Sorry to hear that. Reading through your message, there seem to be
several problems; whether
Can you send the osd logs? The merge_log crashes are probably fixable
if I can see the logs.
The leveldb crash is almost certainly a result of memory corruption.
Thanks
-Sam
On Mon, Jun 4, 2012 at 9:16 AM, Tommi Virtanen t...@inktank.com wrote:
On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont
This is probably the same/similar to http://tracker.newdream.net/issues/2462,
no? There's a log there, though I've no idea how helpful it is.
On Monday, June 4, 2012 at 10:40 AM, Sam Just wrote:
Can you send the osd logs? The merge_log crashes are probably fixable
if I can see the logs.
+recovering+remapped+backfill; 1950 GB data, 3734 GB used, 26059 GB
/
29794 GB avail; 272914/1349073 degraded (20.230%)
and sometimes the ceph-osd on node0 is crashing. At the moment of writing,
the
degrading continues to shrink down below 20%.
How did ceph-osd crash? Is there a dump
Hi Sage,
I uploaded the osd.0 log as well.
http://85.214.49.87/ceph/20120124/osd.0.log.bz2
-martin
Am 25.01.2012 23:08, schrieb Sage Weil:
Hi Martin,
On Tue, 24 Jan 2012, Martin Mailand wrote:
Hi,
today I tried the btrfs patch mentioned on the btrfs ml. Therefore I rebooted
osd.0 with a new
Hi Martin,
On Tue, 24 Jan 2012, Martin Mailand wrote:
Hi,
today I tried the btrfs patch mentioned on the btrfs ml. Therefore I rebooted
osd.0 with a new kernel and created a new btrfs on the osd.0, than I took the
osd.0 into the cluster. During the the resync of osd.0 osd.2 and osd.3
On Tue, Jan 24, 2012 at 10:48 AM, Martin Mailand mar...@tuxadero.com wrote:
Hi,
today I tried the btrfs patch mentioned on the btrfs ml. Therefore I
rebooted osd.0 with a new kernel and created a new btrfs on the osd.0, than
I took the osd.0 into the cluster. During the the resync of osd.0
Hi Greg,
ok, do you guys still need the core files, or could I delete them?
-martin
Am 24.01.2012 22:13, schrieb Gregory Farnum:
On Tue, Jan 24, 2012 at 10:48 AM, Martin Mailandmar...@tuxadero.com wrote:
Hi,
today I tried the btrfs patch mentioned on the btrfs ml. Therefore I
rebooted osd.0
On Tue, Jan 24, 2012 at 1:22 PM, Martin Mailand mar...@tuxadero.com wrote:
Hi Greg,
ok, do you guys still need the core files, or could I delete them?
Sam thinks probably not since we have the backtraces and the
logs...thanks for asking, though! :)
-Greg
--
To unsubscribe from this list: send
This is an interesting one -- the invariant that assert is checking
isn't too complicated (that the object lives on the RecoveryWQ's
queue) and seems to hold everywhere the RecoveryWQ is called. And the
functions modifying the queue are always called under the workqueue
lock, and do maintenance if
On 05/27/2011 06:16 PM, Gregory Farnum wrote:
This is an interesting one -- the invariant that assert is checking
isn't too complicated (that the object lives on the RecoveryWQ's
queue) and seems to hold everywhere the RecoveryWQ is called. And the
functions modifying the queue are always called
On 05/27/2011 10:18 PM, Gregory Farnum wrote:
Can you check out the recoverywq_fix branch and see if that prevents
this issue? Or just apply the patch I've included below. :)
-Greg
Looks as though this patch has helped.
At least this osd has completd rebalancing.
Great! Thanks!
WBR,
1 - 100 of 105 matches
Mail list logo