I hope you restarted those mons sequentially, waiting between each for the
quorum to return.
Is there any recovery or pg autoscaling going on?
Are all OSDs up/in, ie. are the three numbers returned by `ceph osd stat` the
same?
— aad
> On Oct 19, 2020, at 7:05 PM, Szabo, Istvan (Agoda)
>
>>
>> Very nice and useful document. One thing is not clear for me, the fio
>> parameters in appendix 5:
>> --numjobs=<1|4> --iodepths=<1|32>
>> it is not clear if/when the iodepth was set to 32, was it used with all
>> tests with numjobs=4 ? or was it:
>> --numjobs=<1|4> --iodepths=1
> We have
Poking through the source I *think* the doc should indeed refer to the “dup”
function, vs “copy”. That said, arguably we shouldn’t have a section in the
docs that says "there’s this thing you can do but we aren’t going to tell you
how”.
Looking at the history / blame info, which only seems
Thanks, Mark.
I’m interested as well, wanting to provide block service to baremetal hosts;
iSCSI seems to be the classic way to do that.
I know there’s some work on MS Windows RBD code, but I’m uncertain if it’s
production-worthy, and if RBD namespaces suffice for tenant isolation — and are
>> If you guys have any suggestions about used hardware that can be a good fit
>> considering mainly low noise, please let me know.
>
> So we didn’t get these requirements initially, there’s no way for us to help
> you when the requirements aren’t available for us to consider, even if we had
> thx for taking care. I read "works as designed, be sure to have disk
> space for the mon available”.
Well, yeah ;)
> It sounds a little odd that the growth
> from 50MB to ~15GB + compaction space happens within a couple of
> seconds, when two OSD rejoin the cluster.
I’m suspicious — even on
>> With today’s networking, _maybe_ a super-dense NVMe box needs 100Gb/s where
>> a less-dense probably is fine with 25Gb/s. And of course PCI lanes.
>>
>>
Apologies for not consolidating these replys. My UMA is not my friend today.
> With 10 NVMe drives per node, I'm guessing that a single EPYC 7451 is
> going to be CPU bound for small IO workloads (2.4c/4.8t per OSD), but
> will be network bound for large IO workloads unless you are sticking
>
> How they did it?
You can create partitions / LVs by hand and build OSDs on them, or you can use
ceph-volume lvm batch –osds-per-device
> I have an idea to create a new bucket type under host, and put two LV from
> each ceph osd VG into that new bucket. Rules are the same (different host),
Depending what you’re looking to accomplish, setting up a cluster in VMs
(VirtualBox, Fusion, cloud provider, etc) may meet your needs without having to
buy anything.
>
> - Don't think having a few 1Gbit can replace a >10Gbit. Ceph doesn't use
> such bonds optimal. I already asked about this
Is this a reply to Paul’s message from 11 months ago?
https://bit.ly/32oZGlR
The PM1725b is interesting in that it has explicitly configurable durability vs
capacity, which may be even more effective than user-level short-stroking /
underprovisioning.
>
> Hi. How do you say 883DCT is faster
If you have capacity to have both online at the same time, why not add the SSDs
to the existing pool, let the cluster converge, then remove the HDDs? Either
all at once or incrementally? With care you’d have zero service impact. If
you want to change the replication strategy at the same
Now that’s a *very* different question from numbers assigned during an install.
With recent releases instead of going down the full removal litany listed
below, you can instead down/out the OSD and `destroy` it. That preserves the
CRUSH bucket and OSD ID, then when you use ceph-disk,
FWIW a handful of years back there was a bug in at least some LSI firmware
where the setting “Disk Default” silently turned the volatile cache *on*
instead of the documented behavior, which was to leave alone.
> On Sep 3, 2020, at 8:13 AM, Reed Dier wrote:
>
> It looks like I ran into the
>
> huxia...@horebdata.cn
>
> From: Anthony D'Atri
> Date: 2020-09-05 20:00
> To: huxia...@horebdata.cn
> CC: ceph-users
> Subject: Re: [ceph-users] PG number per OSD
> One factor is RAM usage, that was IIRC the motivation for the lowering of the
> recommendation o
One factor is RAM usage, that was IIRC the motivation for the lowering of the
recommendation of the ratio from 200 to 100. Memory needs also increase during
recovery and backfill.
When calculating, be sure to consider repllicas.
ratio = (pgp_num x replication) / num_osds
As HDDs grow the
>
> Looking for a bit of guidance / approach to upgrading from Nautilus to
> Octopus considering CentOS and Ceph-Ansible.
>
> We're presently running a Nautilus cluster (all nodes / daemons 14.2.11 as
> of this post).
> - There are 4 monitor-hosts with mon, mgr, and dashboard functions
>
Is your MUA wrapping lines, or is the list software?
As predicted. Look at the VAR column and the STDDEV of 37.27
> On Aug 27, 2020, at 9:02 AM, Dallas Jones wrote:
>
> 1 122.79410- 123 TiB 42 TiB 41 TiB 217 GiB 466 GiB 81
> TiB 33.86 1.00 -root default
> -3
Doubling the capacity in one shot was a big topology change, hence the 53%
misplaced.
OSD fullness will naturally reflect a bell curve; there will be a tail of
under-full and over-full OSDs. If you’d not said that your cluster was very
full before expansion I would have predicted it from the
> I wanna limit the traffic of specific buckets. Can haproxy, nginx or any
> other proxy software deal with it ?
Yes. I’ve seen it done.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
My understanding is that the existing mon_clock_drift_allowed value of 50 ms
(default) is so that PAXOS among the mon quorum can function. So OSDs (and
mgrs, and clients etc) are out of scope of that existing code.
Things like this are why I like to ensure that the OS does `ntpdate -b` or
One way this can happen is if you have the default setting
osd_scrub_during_recovery=false
If you’ve been doing a lot of [re]balancing, drive replacements, topology
changes, expansions, etc. scrubs can be starved especially if you’re doing EC
on HDDs.
HDD or SSD OSDs? Replication or
This is a natural condition of CRUSH. You don’t mention what release the
back-end or the clients are running so it’s difficult to give an exact answer.
Don’t mess with the CRUSH weights.
Either adjust the override / reweights with `ceph osd
test-reweight-by-utilization /
>> In the past, the minimum recommendation was 1GB RAM per HDD blue store OSD.
There was a rule of thumb of 1GB RAM *per TB* of HDD Filestore OSD, perhaps you
were influenced by that?
___
ceph-users mailing list -- ceph-users@ceph.io
To
Bear in mind that ARM and x86 are architectures, not CPU models. Both are
available in a vast variety of core counts, clocks, and implementations.
Eg., an 80 core Ampere Altra likely will smoke a Intel Atom D410 in every way.
That said, what does “performance” mean? For object storage, it
min_compat is a different thing entirely.
You need to set the tunables as a group. This will cause data to move, so you
may wish to throttle recovery, model the PG movement ahead of time, use the
upmap trick to control movement etc.
> That is an interesting point. We are using 12 on 1 nvme journal for our
> Filestore nodes (which seems to work ok). The workload for wal + db is
> different so that could be a factor. However when I've looked at the IO
> metrics for the nvme it seems to be only lightly loaded, so does not
> Thanks for the thinking. By 'traffic' I mean: when a user space rbd
> write has as a destination three replica osds in the same chassis
eek.
> does the whole write get shipped out to the mon and then back
Mons are control-plane only.
> All the 'usual suspects' like lossy ethernets and
What does “traffic” mean? Reads? Writes will have to hit the net regardless
of any machinations.
> On Jun 29, 2020, at 7:31 PM, Harry G. Coin wrote:
>
> I need exactly what ceph is for a whole lot of work, that work just
> doesn't represent a large fraction of the total local traffic.
M=1 is never a good choice. Just use replication instead.
> On Jun 26, 2020, at 3:05 AM, Zhenshi Zhou wrote:
>
> Hi Janne,
>
> I use the default profile(2+1) and set failure-domain=host, is my best
> practice?
>
> Janne Johansson 于2020年6月26日周五 下午4:59写道:
>
>> Den fre 26 juni 2020 kl
The benefit of disabling on-drive cache may be at least partly dependent on the
HBA; I’ve done testing of one specific drive model and found no difference,
where someone else reported a measurable difference for the same model.
> Good to know that we're not alone :) I also looked for a newer
>> I can remember reading this before. I was hoping you maybe had some
>> setup with systemd scripts or maybe udev.
>
> Yeah, doing this on boot up would be ideal. I was looking really hard into
> tuned and other services that claimed can do it, but required plugins or
> other stuff did/does
>
> I am wondering if its not necessary to have to drain/fill OSD nodes at
> all and if this can be done with just a fresh install and not touch
> the OSD's
Abosolutely. I’ve done this both with Trusty -> Bionic and Precise -> RHEL7.
> however I don't know how to perform a fresh
Exactly my thought too: the node in question has a most-full outlier, so when
that OSD is out, the most-full is … less full.
ceph osd df | sort -nk8 | tail
Or using
https://gitlab.cern.ch/ceph/ceph-scripts/blob/e7dbedba66fb585e38f539a41f9640f3ad8af96b/tools/histogram.py
ceph osd df | egrep
Having been through a number of repaves and expansions, along with Firefy-era
DB inflation and rack-weight-difference bugs a few things pop out from these
accounts, ymmv.
o There’s been a bug where compaction didn’t happen if all OSDs weren’t up/in,
not sure of affected versions. One of the
>
> Why don't you do the crush map change together with that? All the data will
> be reshuffled then any w
Couldn’t the upmap trick be used to manage this? Compute a full set to pin the
before mappings, then incrementally remove them so that data moves in a
controlled fashion? I ssupect
cbs.centos.org offers 14.2.7 packages for el8 eg
https://cbs.centos.org/koji/buildinfo?buildID=28564 but I don’t know anything
about their provenance or nature.
For sure a downloads.ceph.com package would be desirable.
> On Jun 3, 2020, at 4:04 PM, Victoria Martinez de la Cruz
> wrote:
>
>
t;
>>>
>>>
>>> I would not call a ceph page, a random tuning tip. At
>>least I hope
>>> they
>>> are not. NVMe-only with 100Gbit is not really a standard
>> setup. I
>>> assume
>>>
Quick and easy depends on your network infrastructure. Sometimes it is
difficult or impossible to retrofit a live cluster without disruption.
> On May 25, 2020, at 1:03 AM, Marc Roos wrote:
>
>
> I am interested. I am always setting mtu to 9000. To be honest I cannot
> imagine there is
Historically I have often but not always found that removing / destroying the
affected OSD would clear the inconsistent PG. At one point the logged message
was clear about who reported and who was the perp, but then a later release
broke that. Not sure what recent releases say, since with
Understandable concern.
FWIW I’ve used rbd-mirror to move thousands of volumes between clusters with
zero clobbers.
—aad
> On May 14, 2020, at 9:46 AM, Kees Meijs | Nefos wrote:
>
> My main concern is pulling images into a non-empty pool. It would be
> (very) bad if rbd-mirror tries to be
What is a saturated network with modern switched technologies? Links to
individual hosts? Uplinks from TORS (public)? Switch backplane (cluster)?
> That is correct.I didn't explain it clearly. I said that is because in
> some write only scenario the public network and cluster network will
>
When you set up the rbd-mirror daemons with each others’ configs, and initiate
mirroring of a volume, the destination will create the volume in the
destination cluster and pull over data.
Hopefully you’re creating unique volume names so there won’t be conflicts, but
that said if the
igured
> on a per-pool basis" (according documentation).
>
> On 14-05-2020 09:13, Anthony D'Atri wrote:
>> So?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
__
So?
>
> Hi Anthony,
>
> Thanks as well.
>
> Well, it's a one-time job.
>
> K.
>
> On 14-05-2020 09:10, Anthony D'Atri wrote:
>> Why not use rbd-mirror to handle the volumes?
>
___
ceph-users mailing list --
Why not use rbd-mirror to handle the volumes?
> On May 13, 2020, at 11:27 PM, Kees Meijs wrote:
>
> Hi Konstantin,
>
> Thank you very much. That's a good question.
>
> The implementations of OpenStack and Ceph and "the other" OpenStack and
> Ceph are, apart from networking, completely
cal physical networking).
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
>
> From: Stefan Kooman
> Sent: 13 May 2020 07:40
> To: ceph-users@ceph.io
> Subject: [ceph-us
> I think, however, that a disappearing back network has no real consequences
> as the heartbeats always go over both.
FWIW this has not been my experience, at least through Luminous.
What I’ve seen is that when the cluster/replication net is configured but
unavailable, OSD heartbeats fail
>> If your public network is saturated, that actually is a problem, last thing
>> you want is to add recovery traffic, or to slow down heartbeats. For most
>> people, it isn’t saturated.
>
>See Frank Schilder's post about a meltdown which he believes could have
>been caused by
> Hi,
>
> I deployed few clusters with two networks as well as only one network.
> There has little impact between them for my experience.
>
> I did a performance test on nautilus cluster with two networks last week.
> What I found is that the cluster network has low bandwidth usage
During
I’ve inherited a couple of clusters with non-default (ie, not “ceph”) internal
names, and I want to rename them for the usual reasons.
I had previously developed a full list of steps - which I no longer have access
to.
Anyone done this recently? Want to be sure I’m not missing something.
*
ceph.conf still *works*, though, for those with existing systems built to
manage it?
While the centralized config is very handy, I’d love to find a way to tie it
into external revision control, so that settings could continue to be managed
in eg. git and applied by automation.
> On Apr 30,
Do I misunderstand this script, or does it not _quite_ do what’s desired here?
I fully get the scenario of applying a full-cluster map to allow incremental
topology changes.
To be clear, if this is run to effectively freeze backfill during / following a
traumatic event, it will freeze that
>
> 1. shoud i use a raid controller a create for example a raid 5 with all disks
> on each osd server? or should i passtrough all disks to ceph osd?
>
> If your OSD servers have HDDs, buy a good RAID Controller with a
> battery-backed write cache and configure it using multiple RAID-0
Check the output carefully, the OP was using a 5 year old version of
ceph-deploy. Updating brought success, presumably with Nautilus.
> On Apr 5, 2020, at 11:06 AM, Brent Kennedy wrote:
>
> If you are deploying the octopus release, it cant be installed on CentOS 7 (
> must use centOS 8
Linuxes don’t require swap at least, maybe BSDs still do but I haven’t run one
since the mid 90s. Back in the day swap had to be at least the size of
physmem, but we’re talking SunOS 4.14 days.
Swap IMHO has been moot for years. It dates to a time when RAM was much more
expensive and
The strategy that Nghia described is inefficient for moving data more than
once, but safe since there are always N copies, vs a strategy of setting noout,
destroying the OSDs, and recreating them on the new server. That would be more
efficient, albeit with a period of reduced redundancy.
I’ve
You can adjust the Primary Affinity down on the larger drives so they’ll get
less read load. In one test I’ve seen this result in a 10-15% increase in read
throughout but it depends on your situation.
Optimal settings would require calculations that make my head hurt, maybe
someone has a
> How to get rid of this logging??
>
> Mar 31 13:40:03 c01 ceph-mgr: 2020-03-31 13:40:03.521 7f554edc8700 0
> log_channel(cluster) log [DBG] : pgmap v672067: 384 pgs: 384
> active+clean;
Why?
>
> I already have the time logged, I do not need it a second time.
>
> Mar 31 13:39:59 c01
I suspect that while not default we still need to be able to run OSDs created
on older releases, or with leveldb explicitly?
>
>> - nothing provides libleveldb.so.1()(64bit) needed by
>> ceph-osd-2:15.2.0-0.el8.x86_64
>
> Does the OSD still use leveldb at all?
>
> - Ken
>
I suspect Ceph is configured in their case to send all logs off-node to a
central syslog server, ELK, etc.
With Jewel this seemed to result in daemons crashing, but probably it’s since
been fixed (I haven’t tried).
> that is much less than I experienced of allocated disk space in case
>
That has potential. Another, albeit suboptimal idea would be simply
Quid
as in
’S quid
as in “it’s squid”. cf. https://en.wikipedia.org/wiki/%27S_Wonderful
Alternately just skip to R and when someone tasks about Q, we say “The first
rule of Ceph is that we don’t talk about Q”.
— aad
>
This is an expensive operation. You want to slow it down, not burden the OSDs.
> On Mar 21, 2020, at 5:46 AM, Jan Pekař - Imatic wrote:
>
> Each node has 64GB RAM so it should be enough (12 OSD's = 48GB used).
>
>> On 21/03/2020 13.14, XuYun wrote:
>> Bluestore requires more than 4G memory
He means that if eg. you enforce 1 copy of a PG per rack, that any upmaps you
enter don’t result in 2 or 3 in the same rack. If your CRUSH poilicy is one
copy per *host* the danger is even higher that you could have data become
unavailable or even lost in case of a failure.
> On Mar 16, 2020,
Yeah the removal of that was annoying for sure. ISTR that one can gather the
information from the OSDs’ admin sockets.
Envision a Prometheus exporter that polls the admin sockets (in parallel) and
Grafana panes that graph slow requests by OSD and by node.
> On Mar 13, 2020, at 4:14 PM,
>
> An observation: We are using proxmox (5.4), and it displays the WEAR level as
> "N/A", which is unfortunate... :-( Tried upgrading to 6: the same, still no
> wearout in the GUI.
>
> Here is smartctl output:
>
>> smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-26-pve] (local build)
Custom cluster names are being incrementally deprecated. With ceph-deploy I
thought they were removed in 1.39.
You could probably achieve parallel clusters with containerized daemons if you
tried hard enough, but I have to ask what leads you to want to do this.
> On Mar 11, 2020, at 4:07
>>
>> FWIW when using rbd-mirror to migrate volumes between SATA SSD clusters, I
>> found that
>>
>>
>> rbd_mirror_journal_max_fetch_bytes:
>>section: "client"
>>value: "33554432"
>>
>> rbd_journal_max_payload_bytes:
>>section: "client"
>>value: “8388608"
>
> Indeed,
FWIW when using rbd-mirror to migrate volumes between SATA SSD clusters, I
found that
rbd_mirror_journal_max_fetch_bytes:
section: "client"
value: "33554432"
rbd_journal_max_payload_bytes:
section: "client"
value: “8388608"
Made a world of difference in expediting
Yes. The drive treats some portion of cells as SLC, which having only two
charge states is a lot faster and is used as cache. As with any cache-enabled
drive, if that cache fills up either due to misaligned flush cycles or simply
data coming in faster than it can flush, you’ll see a
>
>>> Sage, do you think I can workaround by setting
>>> mon_sync_max_payload_size ridiculously small, like 1024 or something
>>> like that?
>>
>> Yeah... IIRC that is how the original user worked around the problem. I
>> think they use 64 or 128 KB.
>
> Nice... 64kB still triggered elections
That depends on how you define “decent” , and your use case.
Be careful that these are QLC drives. QLC is pretty new and longevity would
seem to vary quite a bit based on op mix. These might be fine for read-mostly
workloads, but high-turnover databases might burn them up fast, especially as
This is a natural condition of bonding, it has little to do with ceph-osd.
Make sure your hash policy is set appropriatelly, so that you even have a
chance of using both links.
https://support.packet.com/kb/articles/lacp-bonding
The larger the set of destinations, the more likely you are to
I applied those packages for the same reason on a staging cluster and so far so
good.
> On Jan 24, 2020, at 9:15 AM, Atherion wrote:
>
>
> Hi Ceph Community.
> We currently have a luminous cluster running and some machines still on
> Ubuntu 14.04
> We are looking to upgrade these machines
I’ve had one or two situations where swap might have helped a memory
consumption problem, but others in which it would have *worsened* cluster
performance. Sometimes it’s better for the *cluster* for an OSD to die /
restart / get OOMkilled than for it to limp along sluggishly.
In the past
I agree wrt making the nodes weights uniform.
When mixing drive sizes, be careful that the larger ones don’t run afoul of the
pg max — they will receive more pgs than the smaller ones, and if you lose a
node that might be enough to send some over the max. ‘ceph OSD df’ and look
at the PG
I have clusters mixing with no problems. At one point we bought a SAS model,
when we don’t have a local exact spare we replace dead ones with SATA.
Some HBAs or backplanes though might not support mixing. Unlike crappy HBA
RoC vols, Ceph doesn’t care.
> On Oct 16, 2019, at 1:39 PM,
We’ve running with 2000 fwiw.
> On Oct 11, 2019, at 2:02 PM, Paul Emmerich wrote:
>
> Which defaults to rgw_thread_pool_size, so yeah, you can adjust that option.
>
> To answer your actual question: we've run civetweb with 1024 threads
> with no problems related to the number of threads.
>
>
Very large omaps can take quite a while.
>
>> You meta data PGs *are* backfilling. It is the "61 keys/s" statement in the
>> ceph status output in the recovery I/O line. If this is too slow, increase
>> osd_max_backfills and osd_recovery_max_active.
>>
>> Or just have some coffee ...
>
>
>
Parallelism. The backfill/recovery tunables control how many recovery ops a
given OSD will perform. If you’re adding a new OSD, naturally it is the
bottleneck. For other forms of data movement, early on one has multiple OSDs
reading and writing independently. Toward the end, increasingly
>>> Do you have statistics on the size of the OSDMaps or count of them
>>> which were being maintained by the OSDs?
>> No, I don't think so. How can I find this information?
>
> Hmm I don't know if we directly expose the size of maps. There are
> perfcounters which expose the range of maps being
It’s not that the limit is *ignored*; sometimes the failure of the subtree
isn’t *detected*. Eg., I’ve seen this happen when a node experienced kernel
weirdness or OOM conditions such that the OSDs didn’t all get marked down at
the same time, so the PGs all started recovering. Admitedly it’s
This is in part a question of *how many* of those dense OSD nodes you have. If
you have a hundred of them, then most likely they’re spread across a decent
number of racks and the loss of one or two is a tolerable *fraction* of the
whole cluster.
If you have a cluster of just, say, 3-4 of
I’d always been led to believe that when a client attaches it tries the mons in
its conf file in order until it contacts one, which may then hand it off to
another. This was given as an explanation for a situation I once had where 2/5
mons were down but clients could run but couldn’t
Are you using Filestore? If so directory splitting can manifest this way.
Check your networking too, packet loss between OSD nodes or between OSD nodes
and the mons can also manifest this way, say if bonding isn’t working properly
or you have a bad link.
But as suggested below, check the
> I agree. The ceph-user list is so busy that one cannot manually check all the
> time if someone replied. I would like messages with me in cc to come to inbox
> as well as via the list.
Some time ago I sent a message to the ostensible list owner, which seemed to be
blackholed.
I suggested
In a past life I had a bunch of SAN gear dumped in my lap, it was spec’d by
someone else misintepreting vague specs. It was SAN gear with an AoE driver.
I wasn’t using Ceph, but sending it back and getting a proper solution wasn’t
an option. Ended up using SAN gear as a NAS with a single
401 - 487 of 487 matches
Mail list logo