Re: osd crash when deep-scrubbing

2015-10-19 Thread changtao381
dPool::WorkThread*)+0x53d) [0x9e05dd] > 17: (ThreadPool::WorkThread::entry()+0x10) [0x9e1760] > 18: (()+0x7a51) [0x7f384b6b0a51] > 19: (clone()+0x6d) [0x7f384a6409ad] > > ceph version is v0.80.9, manually executes `ceph pg deep-scrub 3.d70` would also > cause osd crash. > &

osd crash when deep-scrubbing

2015-10-18 Thread Jiaying Ren
orkThread*)+0x53d) [0x9e05dd] 17: (ThreadPool::WorkThread::entry()+0x10) [0x9e1760] 18: (()+0x7a51) [0x7f384b6b0a51] 19: (clone()+0x6d) [0x7f384a6409ad] ceph version is v0.80.9, manually executes `ceph pg deep-scrub 3.d70` would also cause osd crash. Any ideas? or did I missed some logs necessa

Re: osd crash with object store set to newstore

2015-06-05 Thread Srikanth Madugundi
. Regards Srikanth On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi srikanth.madugu...@gmail.com wrote: Hi Sage, Unfortunately I purged the cluster yesterday and restarted the backfill tool. I did not see the osd crash yet on the cluster. I am monitoring the OSDs and will update you once I

Re: osd crash with object store set to newstore

2015-06-05 Thread Sage Weil
. Regards Srikanth On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi srikanth.madugu...@gmail.com wrote: Hi Sage, Unfortunately I purged the cluster yesterday and restarted the backfill tool. I did not see the osd crash yet on the cluster. I am monitoring the OSDs and will update you

Re: osd crash with object store set to newstore

2015-06-03 Thread Srikanth Madugundi
and restarted the backfill tool. I did not see the osd crash yet on the cluster. I am monitoring the OSDs and will update you once I see the crash. With the new backfill run I have reduced the rps by half, not sure if this is the reason for not seeing the crash yet. Regards Srikanth On Mon, Jun 1

Re: osd crash with object store set to newstore

2015-06-01 Thread Sage Weil
I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash with that branch with 'debug newstore = 20' and send us the log? (You can just do 'ceph-post-file filename'.) Thanks! sage On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage, The assertion failed at line 1639,

osd crash with object store set to newstore

2015-06-01 Thread Srikanth Madugundi
Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and

Re: osd crash with object store set to newstore

2015-06-01 Thread Srikanth Madugundi
Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700

Re: osd crash with object store set to newstore

2015-06-01 Thread Sage Weil
On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately

Re: osd crash with object store set to newstore

2015-06-01 Thread Srikanth Madugundi
Hi Sage, Unfortunately I purged the cluster yesterday and restarted the backfill tool. I did not see the osd crash yet on the cluster. I am monitoring the OSDs and will update you once I see the crash. With the new backfill run I have reduced the rps by half, not sure if this is the reason

OSD Crash for xattr _ absent issue.

2014-11-26 Thread Wenjunh
Hi, Samuel Sage In our current production environment, there exists osd crash because of the inconsistence of data, when reading the “_” xattr. Which is described in the issue: http://tracker.ceph.com/issues/10117. And I also find a two year’s old issue, which also describes

Bobtail to dumpling (was: OSD crash during repair)

2013-09-10 Thread Chris Dunlop
On Fri, Sep 06, 2013 at 08:21:07AM -0700, Sage Weil wrote: On Fri, 6 Sep 2013, Chris Dunlop wrote: On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote: Also, you should upgrade to dumpling. :) I've been considering it. It was initially a little scary with the various issues that were

Re: Bobtail to dumpling (was: OSD crash during repair)

2013-09-10 Thread Sage Weil
On Wed, 11 Sep 2013, Chris Dunlop wrote: On Fri, Sep 06, 2013 at 08:21:07AM -0700, Sage Weil wrote: On Fri, 6 Sep 2013, Chris Dunlop wrote: On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote: Also, you should upgrade to dumpling. :) I've been considering it. It was initially a

Re: OSD crash during repair

2013-09-06 Thread Sage Weil
On Fri, 6 Sep 2013, Chris Dunlop wrote: On Fri, Sep 06, 2013 at 01:12:21PM +1000, Chris Dunlop wrote: On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote: On Fri, 6 Sep 2013, Chris Dunlop wrote: Hi Sage, Does this answer your question? 2013-09-06 09:30:19.813811 7f0ae8cbc700

Re: OSD crash during repair

2013-09-06 Thread Sage Weil
On Fri, 6 Sep 2013, Chris Dunlop wrote: On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote: On Fri, 6 Sep 2013, Chris Dunlop wrote: Hi Sage, Does this answer your question? 2013-09-06 09:30:19.813811 7f0ae8cbc700 0 log [INF] : applying configuration change:

OSD crash during repair

2013-09-05 Thread Chris Dunlop
G'day, I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD: http://tracker.ceph.com/issues/6233 ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33) 1: /usr/bin/ceph-osd() [0x8530a2] 2: (()+0xf030) [0x7f541ca39030] 3: (gsignal()+0x35) [0x7f541b132475

Re: OSD crash during repair

2013-09-05 Thread Sage Weil
wrote: G'day, I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD: http://tracker.ceph.com/issues/6233 ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33) 1: /usr/bin/ceph-osd() [0x8530a2] 2: (()+0xf030) [0x7f541ca39030] 3: (gsignal()+0x35

Re: OSD crash during repair

2013-09-05 Thread Chris Dunlop
is. Thanks! sage On Fri, 6 Sep 2013, Chris Dunlop wrote: G'day, I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD: http://tracker.ceph.com/issues/6233 ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33) 1: /usr/bin/ceph-osd() [0x8530a2

Re: OSD crash during repair

2013-09-05 Thread Sage Weil
On Fri, 6 Sep 2013, Chris Dunlop wrote: G'day, I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD: http://tracker.ceph.com/issues/6233 ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33) 1: /usr/bin/ceph-osd

Re: OSD crash during repair

2013-09-05 Thread Chris Dunlop
On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote: On Fri, 6 Sep 2013, Chris Dunlop wrote: Hi Sage, Does this answer your question? 2013-09-06 09:30:19.813811 7f0ae8cbc700 0 log [INF] : applying configuration change: internal_safe_to_start_threads = 'true' 2013-09-06

Re: OSD crash during repair

2013-09-05 Thread Chris Dunlop
On Fri, Sep 06, 2013 at 01:12:21PM +1000, Chris Dunlop wrote: On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote: On Fri, 6 Sep 2013, Chris Dunlop wrote: Hi Sage, Does this answer your question? 2013-09-06 09:30:19.813811 7f0ae8cbc700 0 log [INF] : applying configuration change:

OSD crash upon pool creation

2013-07-15 Thread Andrey Korolyov
Hello, Using db2bb270e93ed44f9252d65d1d4c9b36875d0ea5 I had observed some disaster-alike behavior after ``pool create'' command - every osd daemon in the cluster will die at least once(some will crash times in a row after bringing back). Please take a look on the backtraces(almost identical)

OSD Crash

2013-03-04 Thread Dave Spano
I had one of my OSDs crash yesterday. I'm using ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5). The part of the log file where the crash happened is attached. Not really sure what lead up to it, but I did get an alert from my server monitor telling me my swap space got really

Re: OSD crash, ceph version 0.56.1

2013-01-09 Thread Ian Pye
On Wed, Jan 9, 2013 at 4:38 PM, Sage Weil s...@inktank.com wrote: On Wed, 9 Jan 2013, Ian Pye wrote: Hi, Every time I try an bring up an OSD, it crashes and I get the following: error (121) Remote I/O error not handled on operation 20 This error code (EREMOTEIO) is not used by Ceph. What

osd crash after reboot

2012-12-14 Thread Stefan Priebe
Hello list, after a reboot of my node i see this on all OSDs of this node after the reboot: 2012-12-14 09:03:20.393224 7f8e652f8780 -1 osd/OSD.cc: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f8e652f8780 time 2012-12-14 09:03:20.392528 osd/OSD.cc: 4385: FAILED

Re: osd crash after reboot

2012-12-14 Thread Stefan Priebe
same log more verbose: 11 ec=10 les/c 3307/3307 3306/3306/3306) [] r=0 lpr=0 lcod 0'0 mlcod 0'0 inactive] read_log done -11 2012-12-14 09:17:50.648572 7fb6e0d6b780 10 osd.3 pg_epoch: 3996 pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10 les/c 3307/3307 3306/3306/3306)

Re: osd crash after reboot

2012-12-14 Thread Dennis Jacobfeuerborn
On 12/14/2012 10:14 AM, Stefan Priebe wrote: One more IMPORTANT note. This might happen due to the fact that a disk was missing (disk failure) afte the reboot. fstab and mountpoint are working with UUIDs so they match but the journal block device: osd journal = /dev/sde1 didn't match

Re: osd crash after reboot

2012-12-14 Thread Mark Nelson
On 12/14/2012 08:52 AM, Dennis Jacobfeuerborn wrote: On 12/14/2012 10:14 AM, Stefan Priebe wrote: One more IMPORTANT note. This might happen due to the fact that a disk was missing (disk failure) afte the reboot. fstab and mountpoint are working with UUIDs so they match but the journal block

Re: osd crash after reboot

2012-12-14 Thread Stefan Priebe - Profihost AG
Hello Dennis, Am 14.12.2012 15:52, schrieb Dennis Jacobfeuerborn: didn't match anymore - as the numbers got renumber due to the failed disk. Is there a way to use some kind of UUIDs here too for journal? You should be able to use /dev/disk/by-uuid/* instead. That should give you a stable view

Re: osd crash after reboot

2012-12-14 Thread Mark Nelson
Hi Stefan, Here's what I often do when I have a journal and data partition sharing a disk: sudo parted -s -a optimal /dev/$DEV mklabel gpt sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-journal 0% 10G sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-data 10G 100% Mark On

Re: osd crash after reboot

2012-12-14 Thread Stefan Priebe - Profihost AG
Hi Mark, Am 14.12.2012 16:20, schrieb Mark Nelson: sudo parted -s -a optimal /dev/$DEV mklabel gpt sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-journal 0% 10G sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-data 10G 100% My disks are gpt too and i'm also using parted. But

Re: osd crash after reboot

2012-12-14 Thread Stefan Priebe - Profihost AG
Hello Mark, Am 14.12.2012 16:20, schrieb Mark Nelson: sudo parted -s -a optimal /dev/$DEV mklabel gpt sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-journal 0% 10G sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-data 10G 100% Isn't that the part type you're using? mkpart

Re: osd crash after reboot

2012-12-14 Thread Sage Weil
On Fri, 14 Dec 2012, Stefan Priebe wrote: One more IMPORTANT note. This might happen due to the fact that a disk was missing (disk failure) afte the reboot. fstab and mountpoint are working with UUIDs so they match but the journal block device: osd journal = /dev/sde1 didn't match

Re: osd crash after reboot

2012-12-14 Thread Stefan Priebe
Hi Sage, this was just an idea and i need to fix MY uuid problem. But then the crash is still a problem of ceph. Have you looked into my log? Am 14.12.2012 20:42, schrieb Sage Weil: On Fri, 14 Dec 2012, Stefan Priebe wrote: One more IMPORTANT note. This might happen due to the fact that a

Re: OSD crash on 0.48.2argonaut

2012-11-15 Thread Josh Durgin
On 11/14/2012 11:31 PM, eric_yh_c...@wiwynn.com wrote: Dear All: I met this issue on one of osd node. Is this a known issue? Thanks! ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe) 1: /usr/bin/ceph-osd() [0x6edaba] 2: (()+0xfcb0) [0x7f08b112dcb0] 3:

OSD crash on 0.48.2argonaut

2012-11-14 Thread Eric_YH_Chen
Dear All: I met this issue on one of osd node. Is this a known issue? Thanks! ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe) 1: /usr/bin/ceph-osd() [0x6edaba] 2: (()+0xfcb0) [0x7f08b112dcb0] 3: (gsignal()+0x35) [0x7f08afd09445] 4: (abort()+0x17b)

Re: osd crash in ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*)

2012-10-15 Thread Samuel Just
(in fact I had some OSD crash - not definitive, the osd restart ok-, maybe related to an error in my new nodes network configuration that I discovered after. More on that later, I can find the traces, but I'm not sure it's related) When ceph was finally stable again, with HEALTH_OK, I decided

osd crash in ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*)

2012-10-11 Thread Yann Dupont
, 49,50,51... Data store is on XFS. I'm currently in the process of growing my ceph from 6 nodes to 12 nodes. 11 nodes are currently in ceph, for a 130 TB total. Declaring new osd was OK, the data has moved quite ok (in fact I had some OSD crash - not definitive, the osd restart ok-, maybe

OSD-crash on 0.48.1argonout, error void ReplicatedPG::recover_got(hobject_t, eversion_t) not seen on list

2012-09-19 Thread Oliver Francke
Hi all, after adding a new node into our ceph-cluster yesterday, we had a crash of one OSD. I have found this kind of message in the bugtracker as being solved ( http://tracker.newdream.net/issues/2075), I will update this one for my convenience and attach the according log ( due to

Re: OSD crash

2012-09-04 Thread Andrey Korolyov
Hi, Almost always one or more osd dies when doing overlapped recovery - e.g. add new crushmap and remove some newly added osds from cluster some minutes later during remap or inject two slightly different crushmaps after a short time(surely preserving at least one of replicas online). Seems that

Re: OSD crash

2012-09-04 Thread Sage Weil
On Tue, 4 Sep 2012, Andrey Korolyov wrote: Hi, Almost always one or more osd dies when doing overlapped recovery - e.g. add new crushmap and remove some newly added osds from cluster some minutes later during remap or inject two slightly different crushmaps after a short time(surely

Re: OSD crash

2012-08-22 Thread Sage Weil
On Thu, 23 Aug 2012, Andrey Korolyov wrote: Hi, today during heavy test a pair of osds and one mon died, resulting to hard lockup of some kvm processes - they went unresponsible and was killed leaving zombie processes ([kvm] defunct). Entire cluster contain sixteen osd on eight nodes and

Re: OSD crash

2012-08-22 Thread Andrey Korolyov
On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote: On Thu, 23 Aug 2012, Andrey Korolyov wrote: Hi, today during heavy test a pair of osds and one mon died, resulting to hard lockup of some kvm processes - they went unresponsible and was killed leaving zombie processes ([kvm]

Re: OSD crash

2012-08-22 Thread Gregory Farnum
The tcmalloc backtrace on the OSD suggests this may be unrelated, but what's the fd limit on your monitor process? You may be approaching that limit if you've got 500 OSDs and a similar number of clients. On Wed, Aug 22, 2012 at 6:55 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Aug 23, 2012

Re: domino-style OSD crash

2012-07-10 Thread Yann Dupont
Le 09/07/2012 19:14, Samuel Just a écrit : Can you restart the node that failed to complete the upgrade with Well, it's a little big complicated ; I now run those nodes with XFS, and I've long-running jobs on it right now, so I can't stop the ceph cluster at the moment. As I've keeped the

Re: domino-style OSD crash

2012-07-10 Thread Tommi Virtanen
On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: As I've keeped the original broken btrfs volumes, I tried this morning to run the old osd in parrallel, using the $cluster variable. I only have partial success. The cluster mechanism was never intended for moving

Re: domino-style OSD crash

2012-07-10 Thread Yann Dupont
Le 10/07/2012 17:56, Tommi Virtanen a écrit : On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: As I've keeped the original broken btrfs volumes, I tried this morning to run the old osd in parrallel, using the $cluster variable. I only have partial success. The

Re: domino-style OSD crash

2012-07-10 Thread Tommi Virtanen
On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: The cluster mechanism was never intended for moving existing osds to other clusters. Trying that might not be a good idea. Ok, good to know. I saw that the remaining maps could lead to problem, but in 2 words, what

Re: domino-style OSD crash

2012-07-10 Thread Yann Dupont
Le 10/07/2012 19:11, Tommi Virtanen a écrit : On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: The cluster mechanism was never intended for moving existing osds to other clusters. Trying that might not be a good idea. Ok, good to know. I saw that the remaining

Re: domino-style OSD crash

2012-07-10 Thread Tommi Virtanen
On Tue, Jul 10, 2012 at 10:36 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Fundamentally, it comes down to this: the two clusters will still have the same fsid, and you won't be isolated from configuration errors or (CEPH-PROD is the old btrfs volume ). /CEPH is new xfs volume, completely

Re: domino-style OSD crash

2012-07-09 Thread Samuel Just
Can you restart the node that failed to complete the upgrade with debug filestore = 20 debug osd = 20 and post the log after an hour or so of running? The upgrade process might legitimately take a while. -Sam On Sat, Jul 7, 2012 at 1:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Le

Re: domino-style OSD crash

2012-07-09 Thread Tommi Virtanen
to other (sane) nodes, leading to ceph-osd crash on thoses nodes. The LevelDB binary contents are not transferred over to other nodes; this kind of corruption would not spread over the Ceph clustering mechanisms. It's more likely that you have 4 independently corrupted LevelDBs. Something

Re: domino-style OSD crash

2012-07-09 Thread Yann Dupont
that ultimate kernel oops, bad data has been transmitted to other (sane) nodes, leading to ceph-osd crash on thoses nodes. The LevelDB binary contents are not transferred over to other nodes; Ok thanks for the clarification ; this kind of corruption would not spread over the Ceph clustering

Re: domino-style OSD crash

2012-07-09 Thread Tommi Virtanen
On Mon, Jul 9, 2012 at 12:05 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote: The information here isn't enough to say whether the cause of the corruption is btrfs or LevelDB, but the recovery needs to handled by LevelDB -- and upstream is working on making it more robust:

Re: domino-style OSD crash

2012-07-07 Thread Yann Dupont
Le 06/07/2012 19:01, Gregory Farnum a écrit : On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Le 05/07/2012 23:32, Gregory Farnum a écrit : [...] ok, so as all nodes were identical, I probably have hit a btrfs bug (like a erroneous out of space ) in more or

Re: domino-style OSD crash

2012-07-06 Thread Yann Dupont
Le 05/07/2012 23:32, Gregory Farnum a écrit : [...] ok, so as all nodes were identical, I probably have hit a btrfs bug (like a erroneous out of space ) in more or less the same time. And when 1 osd was out, OH , I didn't finish the sentence... When 1 osd was out, missing data was copied on

Re: domino-style OSD crash

2012-07-06 Thread Gregory Farnum
On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Le 05/07/2012 23:32, Gregory Farnum a écrit : [...] ok, so as all nodes were identical, I probably have hit a btrfs bug (like a erroneous out of space ) in more or less the same time. And when 1 osd was out,

Re: domino-style OSD crash

2012-07-05 Thread Gregory Farnum
oops. Before that ultimate kernel oops, bad data has been transmitted to other (sane) nodes, leading to ceph-osd crash on thoses nodes. I don't think that's actually possible — the OSDs all do quite a lot of interpretation between what they get off the wire and what goes on disk. What you've

Re: domino-style OSD crash

2012-07-04 Thread Yann Dupont
to a kernel oops. Before that ultimate kernel oops, bad data has been transmitted to other (sane) nodes, leading to ceph-osd crash on thoses nodes. If you think this scenario is highly improbable in real life (that is, btrfs will probably be fixed for good, and then, corruption can't happen

Re: domino-style OSD crash

2012-07-04 Thread Gregory Farnum
) nodes, leading to ceph-osd crash on thoses nodes. I don't think that's actually possible — the OSDs all do quite a lot of interpretation between what they get off the wire and what goes on disk. What you've got here are 4 corrupted LevelDB databases, and we pretty much can't do that through

Re: domino-style OSD crash

2012-07-04 Thread Yann Dupont
to ceph-osd crash on thoses nodes. I don't think that's actually possible — the OSDs all do quite a lot of interpretation between what they get off the wire and what goes on disk. What you've got here are 4 corrupted LevelDB databases, and we pretty much can't do that through the interfaces we

Re: domino-style OSD crash

2012-07-03 Thread Tommi Virtanen
On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right now. Tried to restart osd with 0.47.3, then next branch, and today with 0.48. 4 of 8 nodes fails with the same message : ceph version

Re: domino-style OSD crash

2012-07-03 Thread Yann Dupont
Le 03/07/2012 21:42, Tommi Virtanen a écrit : On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right now. Tried to restart osd with 0.47.3, then next branch, and today with 0.48. 4 of 8 nodes fails

Re: domino-style OSD crash

2012-07-03 Thread Tommi Virtanen
On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote: In the case I could repair, do you think a crashed FS as it is right now is valuable for you, for future reference , as I saw you can't reproduce the problem ? I can make an archive (or a btrfs dump ?), but it will be

Re: Should an OSD crash when journal device is out of space?

2012-07-02 Thread Gregory Farnum
Hey guys, Thanks for the problem report. I've created an issue to track it at http://tracker.newdream.net/issues/2687. It looks like we just assume that if you're using a file, you've got enough space for it. It shouldn't be a big deal to at least do some startup checks which will fail gracefully.

Re: reproducable osd crash

2012-06-27 Thread Stefan Priebe - Profihost AG
THANKS a lot. This fixes it. I've merged your branch into next and i wsn't able to trigger the osd crash again. So please include this into 0.48. Greets Stefan Am 26.06.2012 20:01, schrieb Sam Just: Stefan, Sorry for the delay, I think I've found the problem. Could you give

Re: reproducable osd crash

2012-06-27 Thread Sage Weil
On Wed, 27 Jun 2012, Stefan Priebe - Profihost AG wrote: THANKS a lot. This fixes it. I've merged your branch into next and i wsn't able to trigger the osd crash again. So please include this into 0.48. Excellent. Thanks for testing! This now in next. sage Greets Stefan Am

Re: reproducable osd crash

2012-06-26 Thread Tommi Virtanen
On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe s.pri...@profihost.ag wrote: Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this can happen. For building I use the provided Debian scripts. Perhaps you upgraded the debs but did not restart the daemons? That would make the

Re: reproducable osd crash

2012-06-26 Thread Stefan Priebe
Am 26.06.2012 18:05, schrieb Tommi Virtanen: On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe s.pri...@profihost.ag wrote: Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this can happen. For building I use the provided Debian scripts. Perhaps you upgraded the debs but did

Re: reproducable osd crash

2012-06-26 Thread Sam Just
Stefan, Sorry for the delay, I think I've found the problem. Could you give wip_ms_handle_reset_race a try? -Sam On Tue, Jun 26, 2012 at 9:47 AM, Stefan Priebe s.pri...@profihost.ag wrote: Am 26.06.2012 18:05, schrieb Tommi Virtanen: On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe

Re: reproducable osd crash

2012-06-25 Thread Dan Mick
I've yet to make the core match the binary. On Jun 22, 2012, at 11:32 PM, Stefan Priebe s.pri...@profihost.ag wrote: Thanks did you find anything? Am 23.06.2012 um 01:59 schrieb Sam Just sam.j...@inktank.com: I am still looking into the logs. -Sam On Fri, Jun 22, 2012 at 3:56 PM,

Re: reproducable osd crash

2012-06-23 Thread Stefan Priebe
Thanks yes it is from the next branch. Am 23.06.2012 um 02:26 schrieb Dan Mick dan.m...@inktank.com: The ceph-osd binary you sent claims to be version 0.47.2-521-g88c762, which is not quite 0.47.3. You can get the version with binary -v, or (in my case) examining strings in the binary.

Re: reproducable osd crash

2012-06-23 Thread Stefan Priebe
Thanks did you find anything? Am 23.06.2012 um 01:59 schrieb Sam Just sam.j...@inktank.com: I am still looking into the logs. -Sam On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick dan.m...@inktank.com wrote: Stefan, I'm looking at your logs and coredump now. On 06/21/2012 11:43 PM, Stefan

Re: reproducable osd crash

2012-06-22 Thread Stefan Priebe - Profihost AG
I'm still able to crash the ceph cluster while doing a lot of random I/O and then shut down the KVM. Stefan Am 21.06.2012 21:57, schrieb Stefan Priebe: OK i discovered this time that all osds had the same disk usage before crash. After starting the osd again i got this one: /dev/sdb1 224G 23G

Re: reproducable osd crash

2012-06-22 Thread Dan Mick
Stefan, I'm looking at your logs and coredump now. On 06/21/2012 11:43 PM, Stefan Priebe wrote: Does anybody have an idea? This is right now a showstopper to me. Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AGs.pri...@profihost.ag: Hello list, i'm able to reproducably crash osd

Re: reproducable osd crash

2012-06-22 Thread Sam Just
I am still looking into the logs. -Sam On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick dan.m...@inktank.com wrote: Stefan, I'm looking at your logs and coredump now. On 06/21/2012 11:43 PM, Stefan Priebe wrote: Does anybody have an idea? This is right now a showstopper to me. Am 21.06.2012 um

Re: reproducable osd crash

2012-06-22 Thread Dan Mick
The ceph-osd binary you sent claims to be version 0.47.2-521-g88c762, which is not quite 0.47.3. You can get the version with binary -v, or (in my case) examining strings in the binary. I'm retrieving that version to analyze the core dump. On 06/21/2012 11:43 PM, Stefan Priebe wrote: Does

reproducable osd crash

2012-06-21 Thread Stefan Priebe - Profihost AG
Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM

Re: reproducable osd crash

2012-06-21 Thread Stefan Priebe - Profihost AG
When i start now the OSD again it seems to hang for forever. Load goes up to 200 and I/O Waits rise vom 0% to 20%. Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS:

Re: reproducable osd crash

2012-06-21 Thread Stefan Priebe - Profihost AG
Another strange thing. Why does THIS OSD have 24GB and the others just 650MB? /dev/sdb1 224G 654M 214G 1% /srv/osd.20 /dev/sdc1 224G 638M 214G 1% /srv/osd.21 /dev/sdd1 224G 24G 190G 12% /srv/osd.22 /dev/sde1 224G 607M 214G 1%

Re: reproducable osd crash

2012-06-21 Thread Stefan Priebe - Profihost AG
Mhm is this normal (ceph health is NOW OK again) /dev/sdb1 224G 655M 214G 1% /srv/osd.20 /dev/sdc1 224G 640M 214G 1% /srv/osd.21 /dev/sdd1 224G 34G 181G 16% /srv/osd.22 /dev/sde1 224G 608M 214G 1% /srv/osd.23 Why does one OSD has

Re: reproducable osd crash

2012-06-21 Thread Stefan Priebe
OK i discovered this time that all osds had the same disk usage before crash. After starting the osd again i got this one: /dev/sdb1 224G 23G 191G 11% /srv/osd.30 /dev/sdc1 224G 1,5G 213G 1% /srv/osd.31 /dev/sdd1 224G 1,5G 213G 1% /srv/osd.32

Should an OSD crash when journal device is out of space?

2012-06-20 Thread Travis Rhoden
Not sure if this is a bug or not. It was definitely user error -- but since the OSD process bailed, figured I would report it. I had /tmpfs mounted with 2.5GB of space: tmpfs on /tmpfs type tmpfs (rw,size=2560m) Then I decided to increase my journal size to 5G, but forgot to increase the limit

Re: Should an OSD crash when journal device is out of space?

2012-06-20 Thread Matthew Roy
I hit this a couple times and wondered the same thing. Why does the OSD need to bail when it runs out of journal space? On Wed, Jun 20, 2012 at 3:56 PM, Travis Rhoden trho...@gmail.com wrote: Not sure if this is a bug or not.  It was definitely user error -- but since the OSD process bailed,

Re: OSD crash

2012-06-18 Thread Stefan Priebe - Profihost AG
Am 17.06.2012 23:16, schrieb Sage Weil: Hi Stefan, I opened http://tracker.newdream.net/issues/2599 to track this, but the dump strangely does not include the ceph version or commit sha1. What version were you running? Sorry that was my build system it accidently removed the .git dir while

OSD crash

2012-06-16 Thread Stefan Priebe
Hi, today i got another osd crash ;-( Strangely the osd logs are all empty. It seems the logrotate hasn't reloaded the daemons but i still have the core dump file? What's next? Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord

Re: OSD crash

2012-06-16 Thread Stefan Priebe
executable` is needed to interpret this. --- end dump of recent events --- Am 16.06.2012 14:57, schrieb Stefan Priebe: Hi, today i got another osd crash ;-( Strangely the osd logs are all empty. It seems the logrotate hasn't reloaded the daemons but i still have the core dump file? What's next

domino-style OSD crash

2012-06-04 Thread Yann Dupont
no clear stack message. I suspect btrfs , but I have no proof. This node (OSD.7) seems to have been the 1st one to crash, generated reconstruction between OSD then lead to the cascade osd crash. The other physical machines are still up, but with no osd running. here are some trace found in osd log

Re: domino-style OSD crash

2012-06-04 Thread Tommi Virtanen
On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote: Results : Worked like a charm during two days, apart btrfs warn messages then OSD begin to crash 1 after all 'domino style'. Sorry to hear that. Reading through your message, there seem to be several problems; whether

Re: domino-style OSD crash

2012-06-04 Thread Sam Just
Can you send the osd logs? The merge_log crashes are probably fixable if I can see the logs. The leveldb crash is almost certainly a result of memory corruption. Thanks -Sam On Mon, Jun 4, 2012 at 9:16 AM, Tommi Virtanen t...@inktank.com wrote: On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont

Re: domino-style OSD crash

2012-06-04 Thread Greg Farnum
This is probably the same/similar to http://tracker.newdream.net/issues/2462, no? There's a log there, though I've no idea how helpful it is. On Monday, June 4, 2012 at 10:40 AM, Sam Just wrote: Can you send the osd logs? The merge_log crashes are probably fixable if I can see the logs.

Re: Problem after ceph-osd crash

2012-02-20 Thread Sage Weil
+recovering+remapped+backfill; 1950 GB data, 3734 GB used, 26059 GB / 29794 GB avail; 272914/1349073 degraded (20.230%) and sometimes the ceph-osd on node0 is crashing. At the moment of writing, the degrading continues to shrink down below 20%. How did ceph-osd crash? Is there a dump

Re: osd crash during resync

2012-01-26 Thread Martin Mailand
Hi Sage, I uploaded the osd.0 log as well. http://85.214.49.87/ceph/20120124/osd.0.log.bz2 -martin Am 25.01.2012 23:08, schrieb Sage Weil: Hi Martin, On Tue, 24 Jan 2012, Martin Mailand wrote: Hi, today I tried the btrfs patch mentioned on the btrfs ml. Therefore I rebooted osd.0 with a new

Re: osd crash during resync

2012-01-25 Thread Sage Weil
Hi Martin, On Tue, 24 Jan 2012, Martin Mailand wrote: Hi, today I tried the btrfs patch mentioned on the btrfs ml. Therefore I rebooted osd.0 with a new kernel and created a new btrfs on the osd.0, than I took the osd.0 into the cluster. During the the resync of osd.0 osd.2 and osd.3

Re: osd crash during resync

2012-01-24 Thread Gregory Farnum
On Tue, Jan 24, 2012 at 10:48 AM, Martin Mailand mar...@tuxadero.com wrote: Hi, today I tried the btrfs patch mentioned on the btrfs ml. Therefore I rebooted osd.0 with a new kernel and created a new btrfs on the osd.0, than I took the osd.0 into the cluster. During the the resync of osd.0

Re: osd crash during resync

2012-01-24 Thread Martin Mailand
Hi Greg, ok, do you guys still need the core files, or could I delete them? -martin Am 24.01.2012 22:13, schrieb Gregory Farnum: On Tue, Jan 24, 2012 at 10:48 AM, Martin Mailandmar...@tuxadero.com wrote: Hi, today I tried the btrfs patch mentioned on the btrfs ml. Therefore I rebooted osd.0

Re: osd crash during resync

2012-01-24 Thread Gregory Farnum
On Tue, Jan 24, 2012 at 1:22 PM, Martin Mailand mar...@tuxadero.com wrote: Hi Greg, ok, do you guys still need the core files, or could I delete them? Sam thinks probably not since we have the backtraces and the logs...thanks for asking, though! :) -Greg -- To unsubscribe from this list: send

Re: OSD crash

2011-05-27 Thread Gregory Farnum
This is an interesting one -- the invariant that assert is checking isn't too complicated (that the object lives on the RecoveryWQ's queue) and seems to hold everywhere the RecoveryWQ is called. And the functions modifying the queue are always called under the workqueue lock, and do maintenance if

Re: OSD crash

2011-05-27 Thread Fyodor Ustinov
On 05/27/2011 06:16 PM, Gregory Farnum wrote: This is an interesting one -- the invariant that assert is checking isn't too complicated (that the object lives on the RecoveryWQ's queue) and seems to hold everywhere the RecoveryWQ is called. And the functions modifying the queue are always called

Re: OSD crash

2011-05-27 Thread Fyodor Ustinov
On 05/27/2011 10:18 PM, Gregory Farnum wrote: Can you check out the recoverywq_fix branch and see if that prevents this issue? Or just apply the patch I've included below. :) -Greg Looks as though this patch has helped. At least this osd has completd rebalancing. Great! Thanks! WBR,

  1   2   >