Hi,
I keep reading recommendations about disabling debug logging in Ceph
in order to improve performance. There are two things that are unclear
to me though:
a. what do we lose if we decrease default debug logging and where is
the sweet point in order to not lose critical messages?
I would say
Hello,
we are on Debian Jessie and Hammer 0.94.9 and recently we decided to
upgrade our kernel from 3.16 to 4.9 (jessie-backports). We experience
the same regression but with some shiny points
-- ceph tell osd average across the cluster --
3.16.39-1: 204MB/s
4.9.0-0: 158MB/s
-- 1 rados bench
Python script, perhaps
> you could post it as an example?
>
> Thanks!
>
> -- Dan
>
>
>> On Oct 20, 2016, at 01:42, Kostis Fardelas <dante1...@gmail.com> wrote:
>>
>> We pulled leveldb from upstream and fired leveldb.RepairDB against the
>> OSD omap direc
> Cheers
> Goncalo
>
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Kostis
> Fardelas [dante1...@gmail.com]
> Sent: 20 October 2016 09:09
> To: ceph-users
> Subject: [ceph-users] Surviving a ceph clus
Hello cephers,
this is the blog post on our Ceph cluster's outage we experienced some
weeks ago and about how we managed to revive the cluster and our
clients's data.
I hope it will prove useful for anyone who will find himself/herself
in a similar position. Thanks for everyone on the ceph-users
t;
>
>
>
> On 18.09.2016 18:59, Kostis Fardelas wrote:
>>
>> If you are aware of the problematic PGs and they are exportable, then
>> ceph-objectstore-tool is a viable solution. If not, then running gdb
>> and/or higher debug osd level logs may prove useful (to un
If you are aware of the problematic PGs and they are exportable, then
ceph-objectstore-tool is a viable solution. If not, then running gdb
and/or higher debug osd level logs may prove useful (to understand
more about the problem or collect info to ask for more in ceph-devel).
On 13 September 2016
Hello Goncalo,
afaik the authoritative shard is concluded based on deep-scrub object
checksums which was included in Hammer. Is this in-line with your
experience? If yes, is there any other method of concluding for the
auth shard besides object timestamps for ceph < jewel?
Kostis
On 13 September
to bump this. It looks like a leak (and
of course I could extend the leak by bumping pid_max) but this is not
the case, isn't it?
Kostis
On 15 September 2016 at 14:40, Wido den Hollander <w...@42on.com> wrote:
>
>> Op 15 september 2016 om 13:27 schreef Kostis Fardelas <d
Hello cephers,
being in a degraded cluster state with 6/162 OSDs down ((Hammer
0.94.7, 162 OSDs, 27 "fat" nodes, 1000s of clients) ) like the below
ceph cluster log indicates:
2016-09-12 06:26:08.443152 mon.0 62.217.119.14:6789/0 217309 : cluster
[INF] pgmap v106027148: 28672 pgs: 2
Hello cephers,
last week we survived a 3-day outage on our ceph cluster (Hammer
0.94.7, 162 OSDs, 27 "fat" nodes, 1000s of clients) due to 6 out of
162 OSDs crash in the SAME node. The outage was caused in the
following timeline:
time 0: OSDs living in the same node (rd0-19) start heavily
(host-wise) pool is going to be limited to <
> 0.8TB usable space. (The two 0.3 hosts will fill up well before
> the two larger hosts are full).
>
>
> On Tue, Jul 26, 2016 at 1:55 PM, Kostis Fardelas <dante1...@gmail.com> wrote:
>> Hello Dan,
>> I increased
ere choose_total_tries is too low for
> your cluster configuration.
> Try increasing choose_total_tries from 50 to 75.
>
> -- Dan
>
>
>
> On Fri, Jul 22, 2016 at 4:17 PM, Kostis Fardelas <dante1...@gmail.com> wrote:
>> Hello,
>> being in latest Hammer, I thin
"op": "emit"
}
]
},
{
"rule_id": 2,
"rule_name": "rbd",
"ruleset": 2,
"type": 1,
"min_size": 1,
of profile tunables
changes and their impact on a production cluster.
Kostis
On 24 July 2016 at 14:29, Kostis Fardelas <dante1...@gmail.com> wrote:
> nice to hear from you Goncalo,
> what you propose sounds like an interesting theory, I will test it
> tomorrow and let you know. In t
_
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Kostis
> Fardelas [dante1...@gmail.com]
> Sent: 23 July 2016 16:32
> To: Brad Hubbard
> Cc: ceph-users
> Subject: Re: [ceph-users] Recovery stuck after adjusting to recent tunables
>
> Hi
. That was not the case with argonaut tunables as I remember.
Regards
On 23 July 2016 at 06:16, Brad Hubbard <bhubb...@redhat.com> wrote:
> On Sat, Jul 23, 2016 at 12:17 AM, Kostis Fardelas <dante1...@gmail.com> wrote:
>> Hello,
>> being in latest Hammer, I think I hit
Hello,
being in latest Hammer, I think I hit a bug with more recent than
legacy tunables.
Being in legacy tunables for a while, I decided to experiment with
"better" tunables. So first I went from argonaut profile to bobtail
and then to firefly. However, I decided to make the changes on
Hello,
I upgraded a staging ceph cluster from latest Firefly to latest Hammer
last week. Everything went fine overall and I would like to share my
observations so far:
a. every OSD upgrade lasts appr. 3 minutes. I doubt there is any way
to speed this up though
b. rados bench with different block
fsids from redeployed
OSDs, even after removing the old ones from crushmap. You need to rm
them
Regards,
Kostis
On 15 June 2016 at 17:14, Kostis Fardelas <dante1...@gmail.com> wrote:
> Hello,
> in the process of redeploying some OSDs in our cluster, after
> destroying one of them (do
Hi Hauke,
you could increase the mon/osd full/near full ratios but at this level
of disk space scarcity, things may need your constant attention
especially in case of failure given the risk of closing down the
cluster IO. Modifying crush weights may be of use too.
Regards,
Kostis
On 15 June 2016
Hello Jacob, Gregory,
did you manage to start up those OSDs at last? I came across a very
much alike incident [1] (no flags preventing the OSDs from getting UP
in the cluster though, no hardware problems reported) and I wonder if
you found out what was the culprit in your case.
[1]
Hello,
in the process of redeploying some OSDs in our cluster, after
destroying one of them (down, out, remove from crushmap) and trying to
redeploy it (crush add ,start), we reach a state where the OSD gets
stuck at booting state:
root@staging-rd0-02:~# ceph daemon osd.12 status
{ "cluster_fsid":
There is the "ceph pg {pgid} mark_unfound_lost revert|delete" command but
you may also find interesting to utilize ceph-objectstore-tool to do the job
On 15 May 2016 at 20:22, Michael Kuriger wrote:
> I would try:
>
> ceph pg repair 15.3b3
>
>
>
>
>
> [image: yp]
>
>
>
> Michael
r recovery (and then delete it when
> done).
>
> I've never done this or worked on the tooling though so that's bailout the
> extent of my knowledge.
> -Greg
>
>
> On Wednesday, February 17, 2016, Kostis Fardelas <dante1...@gmail.com>
> wrote:
>>
>> R
import --data-path
/var/lib/ceph/osd/ceph-xx/ --journal-path
/var/lib/ceph/osd/ceph-xx/journal --file 3.5a9..export
d. start the osd
Regards,
Kostis
On 18 February 2016 at 02:54, Gregory Farnum <gfar...@redhat.com> wrote:
> On Wed, Feb 17, 2016 at 4:44 PM, Kostis Fardelas <dante1...@gmai
d I achieve this with ceph_objectstore_tool?
Regards,
Kostis
On 18 February 2016 at 01:22, Gregory Farnum <gfar...@redhat.com> wrote:
> On Wed, Feb 17, 2016 at 3:05 PM, Kostis Fardelas <dante1...@gmail.com> wrote:
>> Hello cephers,
>> due to an unfortunate sequence of
Hello cephers,
due to an unfortunate sequence of events (disk crashes, network
problems), we are currently in a situation with one PG that reports
unfound objects. There is also an OSD which cannot start-up and
crashes with the following:
2016-02-17 18:40:01.919546 7fecb0692700 -1
Hello cephers,
after being on 0.80.10 for a while, we upgraded to 0.80.11 and we
noticed the following things:
a. ~13% paxos refresh latency increase (from about 0.015 to 0.017 on average)
b. ~15% paxos commit latency ( from 0.019 to 0.022 on average)
c. osd commitcycle latencies were decreased
Hi Vickey,
under "Upgrade procedures", you will see that it is recommended to
upgrade clients after having upgraded your cluster [1]
[1] http://docs.ceph.com/docs/master/install/upgrading-ceph/#upgrading-a-client
Regards
On 13 January 2016 at 12:44, Vickey Singh
Hi cephers,
after one OSD node crash (6 OSDs in total), we experienced an increase
of approximately 230-260 threads for every other OSD node. We have 26
OSD nodes with 6 OSDs per node, so this is approximately 40 threads
per osd. The OSD node has joined the cluster after 15-20 minutes.
The only
t 6:59 AM, Kostis Fardelas <dante1...@gmail.com> wrote:
>> Hi cephers,
>> after one OSD node crash (6 OSDs in total), we experienced an increase
>> of approximately 230-260 threads for every other OSD node. We have 26
>> OSD nodes with 6 OSDs per node, so this is approximat
? Again the statistics fetched from the sockets seem
more reasonable, the commitcycle latency is substantially larger than
apply latency, which seems normal to me.
Is this a bug or a misunderstanding on my part?
Regards,
Kostis
On 13 July 2015 at 13:27, Kostis Fardelas dante1...@gmail.com wrote
any real world (and the cluster was
not very happy about that).
Jan
On 15 Jul 2015, at 10:53, Kostis Fardelas dante1...@gmail.com wrote:
Hello,
after some trial and error we concluded that if we start the 6 stopped
OSD daemons with a delay of 1 minute, we do not experience slow
requests
more, the slow requests will vanish. The possibility
of not having tuned our setup to the most finest detail is not zeroed
out but I wonder if at any way we miss some ceph tuning in terms of
ceph configuration.
We run firefly latest stable version.
Regards,
Kostis
On 13 July 2015 at 13:28, Kostis
Hello,
I noticed that commit/apply latency reported using:
ceph pg dump -f json-pretty
is very different from the values reported when querying the OSD sockets.
What is your opinion? What are the targets the I should fetch metrics
from in order to be as precise as possible?
Hello,
after rebooting a ceph node and the OSDs starting booting and joining
the cluster, we experience slow requests that get resolved immediately
after cluster recovers. It is improtant to note that before the node
reboot, we set noout flag in order to prevent recovery - so there are
only
Hello,
it seems that new packages for firefly have been uploaded to repo.
However, I can't find any details in Ceph Release notes. There is only
one thread in ceph-devel [1], but it is not clear what this new
version is about. Is it safe to upgrade from 0.80.9 to 0.80.10?
Regards,
Kostis
[1]
Hi Robert,
an improvement to your checks could be the addition of check
parameters (instead of using hard coded values for warn and crit) so
that someone can change their values in main.mk. Hope to find some
time soon and send you a PR about it. Nice job btw!
On 19 November 2014 18:23, Robert
.
-Greg
On Tuesday, July 8, 2014, Kostis Fardelas dante1...@gmail.com wrote:
Hi,
we maintain a cluster with 126 OSDs, replication 3 and appr. 148T raw
used space. We store data objects basically on two pools, the one
being appr. 300x larger in data stored and # of objects terms than the
other
Hi,
we maintain a cluster with 126 OSDs, replication 3 and appr. 148T raw
used space. We store data objects basically on two pools, the one
being appr. 300x larger in data stored and # of objects terms than the
other. Based on the formula provided here
Hi,
from my experience both ceph osd crush reweight and ceph osd
reweight will lead to CRUSH map changes and PGs remapping. So both
commands eventually redistribute data beween OSDs. Is there any good
reason in terms of ceph performance best practices to choose the one
over the other?
On 26 June
Hi,
during PGs remapping, the cluster recovery process sometimes gets
stuck on PGs with backfill_toofull state. The obvious solution is to
reweight the impacted OSD until we add new OSDs to the cluster. In
order to force the remapping process to complete asap we try to inject
a higher value on
43 matches
Mail list logo