[ceph-users] Re: Mon DB compaction MON_DISK_BIG

2020-10-19 Thread Anthony D'Atri
I hope you restarted those mons sequentially, waiting between each for the quorum to return. Is there any recovery or pg autoscaling going on? Are all OSDs up/in, ie. are the three numbers returned by `ceph osd stat` the same? — aad > On Oct 19, 2020, at 7:05 PM, Szabo, Istvan (Agoda) >

[ceph-users] Re: Proxmox+Ceph Benchmark 2020

2020-10-14 Thread Anthony D'Atri
>> >> Very nice and useful document. One thing is not clear for me, the fio >> parameters in appendix 5: >> --numjobs=<1|4> --iodepths=<1|32> >> it is not clear if/when the iodepth was set to 32, was it used with all >> tests with numjobs=4 ? or was it: >> --numjobs=<1|4> --iodepths=1 > We have

[ceph-users] Re: Bluestore migration: per-osd device copy

2020-10-12 Thread Anthony D'Atri
Poking through the source I *think* the doc should indeed refer to the “dup” function, vs “copy”. That said, arguably we shouldn’t have a section in the docs that says "there’s this thing you can do but we aren’t going to tell you how”. Looking at the history / blame info, which only seems

[ceph-users] Re: Ceph iSCSI Performance

2020-10-05 Thread Anthony D'Atri
Thanks, Mark. I’m interested as well, wanting to provide block service to baremetal hosts; iSCSI seems to be the classic way to do that. I know there’s some work on MS Windows RBD code, but I’m uncertain if it’s production-worthy, and if RBD namespaces suffice for tenant isolation — and are

[ceph-users] Re: Feedback for proof of concept OSD Node

2020-10-04 Thread Anthony D'Atri
>> If you guys have any suggestions about used hardware that can be a good fit >> considering mainly low noise, please let me know. > > So we didn’t get these requirements initially, there’s no way for us to help > you when the requirements aren’t available for us to consider, even if we had

[ceph-users] Re: Massive Mon DB Size with noout on 14.2.11

2020-10-02 Thread Anthony D'Atri
> thx for taking care. I read "works as designed, be sure to have disk > space for the mon available”. Well, yeah ;) > It sounds a little odd that the growth > from 50MB to ~15GB + compaction space happens within a couple of > seconds, when two OSD rejoin the cluster. I’m suspicious — even on

[ceph-users] Re: NVMe's

2020-09-23 Thread Anthony D'Atri
>> With today’s networking, _maybe_ a super-dense NVMe box needs 100Gb/s where >> a less-dense probably is fine with 25Gb/s. And of course PCI lanes. >> >>

[ceph-users] Re: NVMe's

2020-09-23 Thread Anthony D'Atri
Apologies for not consolidating these replys. My UMA is not my friend today. > With 10 NVMe drives per node, I'm guessing that a single EPYC 7451 is > going to be CPU bound for small IO workloads (2.4c/4.8t per OSD), but > will be network bound for large IO workloads unless you are sticking >

[ceph-users] Re: NVMe's

2020-09-23 Thread Anthony D'Atri
> How they did it? You can create partitions / LVs by hand and build OSDs on them, or you can use ceph-volume lvm batch –osds-per-device > I have an idea to create a new bucket type under host, and put two LV from > each ceph osd VG into that new bucket. Rules are the same (different host),

[ceph-users] Re: Setting up a small experimental CEPH network

2020-09-21 Thread Anthony D'Atri
Depending what you’re looking to accomplish, setting up a cluster in VMs (VirtualBox, Fusion, cloud provider, etc) may meet your needs without having to buy anything. > > - Don't think having a few 1Gbit can replace a >10Gbit. Ceph doesn't use > such bonds optimal. I already asked about this

[ceph-users] Re: Choosing suitable SSD for Ceph cluster

2020-09-12 Thread Anthony D'Atri
Is this a reply to Paul’s message from 11 months ago? https://bit.ly/32oZGlR The PM1725b is interesting in that it has explicitly configurable durability vs capacity, which may be even more effective than user-level short-stroking / underprovisioning. > > Hi. How do you say 883DCT is faster

[ceph-users] Re: Change crush rule on pool

2020-09-12 Thread Anthony D'Atri
If you have capacity to have both online at the same time, why not add the SSDs to the existing pool, let the cluster converge, then remove the HDDs? Either all at once or incrementally? With care you’d have zero service impact. If you want to change the replication strategy at the same

[ceph-users] Re: Is it possible to assign osd id numbers?

2020-09-11 Thread Anthony D'Atri
Now that’s a *very* different question from numbers assigned during an install. With recent releases instead of going down the full removal litany listed below, you can instead down/out the OSD and `destroy` it. That preserves the CRUSH bucket and OSD ID, then when you use ceph-disk,

[ceph-users] Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

2020-09-06 Thread Anthony D'Atri
FWIW a handful of years back there was a bug in at least some LSI firmware where the setting “Disk Default” silently turned the volatile cache *on* instead of the documented behavior, which was to leave alone. > On Sep 3, 2020, at 8:13 AM, Reed Dier wrote: > > It looks like I ran into the

[ceph-users] Re: PG number per OSD

2020-09-06 Thread Anthony D'Atri
> > huxia...@horebdata.cn > > From: Anthony D'Atri > Date: 2020-09-05 20:00 > To: huxia...@horebdata.cn > CC: ceph-users > Subject: Re: [ceph-users] PG number per OSD > One factor is RAM usage, that was IIRC the motivation for the lowering of the > recommendation o

[ceph-users] Re: PG number per OSD

2020-09-05 Thread Anthony D'Atri
One factor is RAM usage, that was IIRC the motivation for the lowering of the recommendation of the ratio from 200 to 100. Memory needs also increase during recovery and backfill. When calculating, be sure to consider repllicas. ratio = (pgp_num x replication) / num_osds As HDDs grow the

[ceph-users] Re: Fwd: Upgrade Path Advice Nautilus (CentOS 7) -> Octopus (new OS)

2020-08-27 Thread Anthony D'Atri
> > Looking for a bit of guidance / approach to upgrading from Nautilus to > Octopus considering CentOS and Ceph-Ansible. > > We're presently running a Nautilus cluster (all nodes / daemons 14.2.11 as > of this post). > - There are 4 monitor-hosts with mon, mgr, and dashboard functions >

[ceph-users] Re: Cluster degraded after adding OSDs to increase capacity

2020-08-27 Thread Anthony D'Atri
Is your MUA wrapping lines, or is the list software? As predicted. Look at the VAR column and the STDDEV of 37.27 > On Aug 27, 2020, at 9:02 AM, Dallas Jones wrote: > > 1 122.79410- 123 TiB 42 TiB 41 TiB 217 GiB 466 GiB 81 > TiB 33.86 1.00 -root default > -3

[ceph-users] Re: Cluster degraded after adding OSDs to increase capacity

2020-08-27 Thread Anthony D'Atri
Doubling the capacity in one shot was a big topology change, hence the 53% misplaced. OSD fullness will naturally reflect a bell curve; there will be a tail of under-full and over-full OSDs. If you’d not said that your cluster was very full before expansion I would have predicted it from the

[ceph-users] Re: does ceph rgw has any option to limit bandwidth

2020-08-19 Thread Anthony D'Atri
> I wanna limit the traffic of specific buckets. Can haproxy, nginx or any > other proxy software deal with it ? Yes. I’ve seen it done. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph not warning about clock skew on an OSD-only host?

2020-08-12 Thread Anthony D'Atri
My understanding is that the existing mon_clock_drift_allowed value of 50 ms (default) is so that PAXOS among the mon quorum can function. So OSDs (and mgrs, and clients etc) are out of scope of that existing code. Things like this are why I like to ensure that the OS does `ntpdate -b` or

[ceph-users] Re: Problems with long taking deep-scrubbing processes causing PG_NOT_DEEP_SCRUBBED

2020-07-31 Thread Anthony D'Atri
One way this can happen is if you have the default setting osd_scrub_during_recovery=false If you’ve been doing a lot of [re]balancing, drive replacements, topology changes, expansions, etc. scrubs can be starved especially if you’re doing EC on HDDs. HDD or SSD OSDs? Replication or

[ceph-users] Re: unbalanced pg/osd allocation

2020-07-30 Thread Anthony D'Atri
This is a natural condition of CRUSH. You don’t mention what release the back-end or the clients are running so it’s difficult to give an exact answer. Don’t mess with the CRUSH weights. Either adjust the override / reweights with `ceph osd test-reweight-by-utilization /

[ceph-users] Re: OSD memory leak?

2020-07-14 Thread Anthony D'Atri
>> In the past, the minimum recommendation was 1GB RAM per HDD blue store OSD. There was a rule of thumb of 1GB RAM *per TB* of HDD Filestore OSD, perhaps you were influenced by that? ___ ceph-users mailing list -- ceph-users@ceph.io To

[ceph-users] Re: Questions on Ceph on ARM

2020-07-07 Thread Anthony D'Atri
Bear in mind that ARM and x86 are architectures, not CPU models. Both are available in a vast variety of core counts, clocks, and implementations. Eg., an 80 core Ampere Altra likely will smoke a Intel Atom D410 in every way. That said, what does “performance” mean? For object storage, it

[ceph-users] Re: Nautilus upgrade HEALTH_WARN legacy tunables

2020-07-04 Thread Anthony D'Atri
min_compat is a different thing entirely. You need to set the tunables as a group. This will cause data to move, so you may wish to throttle recovery, model the PG movement ahead of time, use the upmap trick to control movement etc.

[ceph-users] Re: Bluestore performance tuning for hdd with nvme db+wal

2020-06-30 Thread Anthony D'Atri
> That is an interesting point. We are using 12 on 1 nvme journal for our > Filestore nodes (which seems to work ok). The workload for wal + db is > different so that could be a factor. However when I've looked at the IO > metrics for the nvme it seems to be only lightly loaded, so does not

[ceph-users] Re: Re layout help: need chassis local io to minimize net links

2020-06-29 Thread Anthony D'Atri
> Thanks for the thinking. By 'traffic' I mean: when a user space rbd > write has as a destination three replica osds in the same chassis eek. > does the whole write get shipped out to the mon and then back Mons are control-plane only. > All the 'usual suspects' like lossy ethernets and

[ceph-users] Re: Re layout help: need chassis local io to minimize net links

2020-06-29 Thread Anthony D'Atri
What does “traffic” mean? Reads? Writes will have to hit the net regardless of any machinations. > On Jun 29, 2020, at 7:31 PM, Harry G. Coin wrote: > > I need exactly what ceph is for a whole lot of work, that work just > doesn't represent a large fraction of the total local traffic.

[ceph-users] Re: fault tolerant about erasure code pool

2020-06-26 Thread Anthony D'Atri
M=1 is never a good choice. Just use replication instead. > On Jun 26, 2020, at 3:05 AM, Zhenshi Zhou wrote: > > Hi Janne, > > I use the default profile(2+1) and set failure-domain=host, is my best > practice? > > Janne Johansson 于2020年6月26日周五 下午4:59写道: > >> Den fre 26 juni 2020 kl

[ceph-users] Re: High ceph_osd_commit_latency_ms on Toshiba MG07ACA14TE HDDs

2020-06-24 Thread Anthony D'Atri
The benefit of disabling on-drive cache may be at least partly dependent on the HBA; I’ve done testing of one specific drive model and found no difference, where someone else reported a measurable difference for the same model. > Good to know that we're not alone :) I also looked for a newer

[ceph-users] Re: High ceph_osd_commit_latency_ms on Toshiba MG07ACA14TE HDDs

2020-06-24 Thread Anthony D'Atri
>> I can remember reading this before. I was hoping you maybe had some >> setup with systemd scripts or maybe udev. > > Yeah, doing this on boot up would be ideal. I was looking really hard into > tuned and other services that claimed can do it, but required plugins or > other stuff did/does

[ceph-users] Re: OSD node OS upgrade strategy

2020-06-19 Thread Anthony D'Atri
> > I am wondering if its not necessary to have to drain/fill OSD nodes at > all and if this can be done with just a fresh install and not touch > the OSD's Abosolutely. I’ve done this both with Trusty -> Bionic and Precise -> RHEL7. > however I don't know how to perform a fresh

[ceph-users] Re: MAX AVAIL goes up when I reboot an OSD node

2020-06-14 Thread Anthony D'Atri
Exactly my thought too: the node in question has a most-full outlier, so when that OSD is out, the most-full is … less full. ceph osd df | sort -nk8 | tail Or using https://gitlab.cern.ch/ceph/ceph-scripts/blob/e7dbedba66fb585e38f539a41f9640f3ad8af96b/tools/histogram.py ceph osd df | egrep

[ceph-users] Re: Sizing your MON storage with a large cluster

2020-06-14 Thread Anthony D'Atri
Having been through a number of repaves and expansions, along with Firefy-era DB inflation and rack-weight-difference bugs a few things pop out from these accounts, ymmv. o There’s been a bug where compaction didn’t happen if all OSDs weren’t up/in, not sure of affected versions. One of the

[ceph-users] Re: Best way to change bucket hierarchy

2020-06-05 Thread Anthony D'Atri
> > Why don't you do the crush map change together with that? All the data will > be reshuffled then any w Couldn’t the upmap trick be used to manage this? Compute a full set to pin the before mappings, then incrementally remove them so that data moves in a controlled fashion? I ssupect

[ceph-users] Re: Nautilus latest builds for CentOS 8

2020-06-03 Thread Anthony D'Atri
cbs.centos.org offers 14.2.7 packages for el8 eg https://cbs.centos.org/koji/buildinfo?buildID=28564 but I don’t know anything about their provenance or nature. For sure a downloads.ceph.com package would be desirable. > On Jun 3, 2020, at 4:04 PM, Victoria Martinez de la Cruz > wrote: > >

[ceph-users] Re: [ceph-users]: Ceph Nautius not working after setting MTU 9000

2020-05-29 Thread Anthony D'Atri
t; >>> >>> >>> I would not call a ceph page, a random tuning tip. At >>least I hope >>> they >>> are not. NVMe-only with 100Gbit is not really a standard >> setup. I >>> assume >>>

[ceph-users] Re: [External Email] Re: Ceph Nautius not working after setting MTU 9000

2020-05-25 Thread Anthony D'Atri
Quick and easy depends on your network infrastructure. Sometimes it is difficult or impossible to retrofit a live cluster without disruption. > On May 25, 2020, at 1:03 AM, Marc Roos wrote: > >  > I am interested. I am always setting mtu to 9000. To be honest I cannot > imagine there is

[ceph-users] Re: PGS INCONSISTENT - read_error - replace disk or pg repair then replace disk

2020-05-23 Thread Anthony D'Atri
Historically I have often but not always found that removing / destroying the affected OSD would clear the inconsistent PG. At one point the logged message was clear about who reported and who was the perp, but then a later release broke that. Not sure what recent releases say, since with

[ceph-users] Re: Using rbd-mirror in existing pools

2020-05-14 Thread Anthony D'Atri
Understandable concern. FWIW I’ve used rbd-mirror to move thousands of volumes between clusters with zero clobbers. —aad > On May 14, 2020, at 9:46 AM, Kees Meijs | Nefos wrote: > > My main concern is pulling images into a non-empty pool. It would be > (very) bad if rbd-mirror tries to be

[ceph-users] Re: Cluster network and public network

2020-05-14 Thread Anthony D'Atri
What is a saturated network with modern switched technologies? Links to individual hosts? Uplinks from TORS (public)? Switch backplane (cluster)? > That is correct.I didn't explain it clearly. I said that is because in > some write only scenario the public network and cluster network will >

[ceph-users] Re: Using rbd-mirror in existing pools

2020-05-14 Thread Anthony D'Atri
When you set up the rbd-mirror daemons with each others’ configs, and initiate mirroring of a volume, the destination will create the volume in the destination cluster and pull over data. Hopefully you’re creating unique volume names so there won’t be conflicts, but that said if the

[ceph-users] Re: Migrating clusters (and versions)

2020-05-14 Thread Anthony D'Atri
igured > on a per-pool basis" (according documentation). > > On 14-05-2020 09:13, Anthony D'Atri wrote: >> So? > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io __

[ceph-users] Re: Migrating clusters (and versions)

2020-05-14 Thread Anthony D'Atri
So? > > Hi Anthony, > > Thanks as well. > > Well, it's a one-time job. > > K. > > On 14-05-2020 09:10, Anthony D'Atri wrote: >> Why not use rbd-mirror to handle the volumes? > ___ ceph-users mailing list --

[ceph-users] Re: Migrating clusters (and versions)

2020-05-14 Thread Anthony D'Atri
Why not use rbd-mirror to handle the volumes? > On May 13, 2020, at 11:27 PM, Kees Meijs wrote: > > Hi Konstantin, > > Thank you very much. That's a good question. > > The implementations of OpenStack and Ceph and "the other" OpenStack and > Ceph are, apart from networking, completely

[ceph-users] Re: Cluster network and public network

2020-05-13 Thread Anthony D'Atri
cal physical networking). > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Stefan Kooman > Sent: 13 May 2020 07:40 > To: ceph-users@ceph.io > Subject: [ceph-us

[ceph-users] Re: Cluster network and public network

2020-05-12 Thread Anthony D'Atri
> I think, however, that a disappearing back network has no real consequences > as the heartbeats always go over both. FWIW this has not been my experience, at least through Luminous. What I’ve seen is that when the cluster/replication net is configured but unavailable, OSD heartbeats fail

[ceph-users] Re: Cluster network and public network

2020-05-09 Thread Anthony D'Atri
>> If your public network is saturated, that actually is a problem, last thing >> you want is to add recovery traffic, or to slow down heartbeats. For most >> people, it isn’t saturated. > >See Frank Schilder's post about a meltdown which he believes could have >been caused by

[ceph-users] Re: Cluster network and public network

2020-05-09 Thread Anthony D'Atri
> Hi, > > I deployed few clusters with two networks as well as only one network. > There has little impact between them for my experience. > > I did a performance test on nautilus cluster with two networks last week. > What I found is that the cluster network has low bandwidth usage During

[ceph-users] Cluster rename procedure

2020-05-08 Thread Anthony D'Atri
I’ve inherited a couple of clusters with non-default (ie, not “ceph”) internal names, and I want to rename them for the usual reasons. I had previously developed a full list of steps - which I no longer have access to. Anyone done this recently? Want to be sure I’m not missing something. *

[ceph-users] Re: How to apply ceph.conf changes using new tool cephadm

2020-05-07 Thread Anthony D'Atri
ceph.conf still *works*, though, for those with existing systems built to manage it? While the centralized config is very handy, I’d love to find a way to tie it into external revision control, so that settings could continue to be managed in eg. git and applied by automation. > On Apr 30,

[ceph-users] Re: upmap balancer and consequences of osds briefly marked out

2020-05-03 Thread Anthony D'Atri
Do I misunderstand this script, or does it not _quite_ do what’s desired here? I fully get the scenario of applying a full-cluster map to allow incremental topology changes. To be clear, if this is run to effectively freeze backfill during / following a traumatic event, it will freeze that

[ceph-users] Re: some ceph general questions about the design

2020-04-21 Thread Anthony D'Atri
> > 1. shoud i use a raid controller a create for example a raid 5 with all disks > on each osd server? or should i passtrough all disks to ceph osd? > > If your OSD servers have HDDs, buy a good RAID Controller with a > battery-backed write cache and configure it using multiple RAID-0

[ceph-users] Re: Cann't create ceph cluster

2020-04-05 Thread Anthony D'Atri
Check the output carefully, the OP was using a 5 year old version of ceph-deploy. Updating brought success, presumably with Nautilus. > On Apr 5, 2020, at 11:06 AM, Brent Kennedy wrote: > > If you are deploying the octopus release, it cant be installed on CentOS 7 ( > must use centOS 8

[ceph-users] Re: Questions on Ceph cluster without OS disks

2020-04-04 Thread Anthony D'Atri
Linuxes don’t require swap at least, maybe BSDs still do but I haven’t run one since the mid 90s. Back in the day swap had to be at least the size of physmem, but we’re talking SunOS 4.14 days. Swap IMHO has been moot for years. It dates to a time when RAM was much more expensive and

[ceph-users] Re: Replace OSD node without remapping PGs

2020-04-01 Thread Anthony D'Atri
The strategy that Nghia described is inefficient for moving data more than once, but safe since there are always N copies, vs a strategy of setting noout, destroying the OSDs, and recreating them on the new server. That would be more efficient, albeit with a period of reduced redundancy. I’ve

[ceph-users] Re: Load on drives of different sizes in ceph

2020-03-31 Thread Anthony D'Atri
You can adjust the Primary Affinity down on the larger drives so they’ll get less read load. In one test I’ve seen this result in a 10-15% increase in read throughout but it depends on your situation. Optimal settings would require calculations that make my head hurt, maybe someone has a

[ceph-users] Re: Logging remove duplicate time

2020-03-31 Thread Anthony D'Atri
> How to get rid of this logging?? > > Mar 31 13:40:03 c01 ceph-mgr: 2020-03-31 13:40:03.521 7f554edc8700 0 > log_channel(cluster) log [DBG] : pgmap v672067: 384 pgs: 384 > active+clean; Why? > > I already have the time logged, I do not need it a second time. > > Mar 31 13:39:59 c01

[ceph-users] Re: v15.2.0 Octopus released

2020-03-26 Thread Anthony D'Atri
I suspect that while not default we still need to be able to run OSDs created on older releases, or with leveldb explicitly? > >> - nothing provides libleveldb.so.1()(64bit) needed by >> ceph-osd-2:15.2.0-0.el8.x86_64 > > Does the OSD still use leveldb at all? > > - Ken >

[ceph-users] Re: Questions on Ceph cluster without OS disks

2020-03-24 Thread Anthony D'Atri
I suspect Ceph is configured in their case to send all logs off-node to a central syslog server, ELK, etc. With Jewel this seemed to result in daemons crashing, but probably it’s since been fixed (I haven’t tried). > that is much less than I experienced of allocated disk space in case >

[ceph-users] Re: Q release name

2020-03-23 Thread Anthony D'Atri
That has potential. Another, albeit suboptimal idea would be simply Quid as in ’S quid as in “it’s squid”. cf. https://en.wikipedia.org/wiki/%27S_Wonderful Alternately just skip to R and when someone tasks about Q, we say “The first rule of Ceph is that we don’t talk about Q”. — aad >

[ceph-users] Re: Problem with OSD::osd_op_tp thread had timed out and other connected issues

2020-03-21 Thread Anthony D'Atri
This is an expensive operation. You want to slow it down, not burden the OSDs. > On Mar 21, 2020, at 5:46 AM, Jan Pekař - Imatic wrote: > > Each node has 64GB RAM so it should be enough (12 OSD's = 48GB used). > >> On 21/03/2020 13.14, XuYun wrote: >> Bluestore requires more than 4G memory

[ceph-users] Re: Forcibly move PGs from full to empty OSD

2020-03-16 Thread Anthony D'Atri
He means that if eg. you enforce 1 copy of a PG per rack, that any upmaps you enter don’t result in 2 or 3 in the same rack. If your CRUSH poilicy is one copy per *host* the danger is even higher that you could have data become unavailable or even lost in case of a failure. > On Mar 16, 2020,

[ceph-users] Re: How to get num ops blocked per OSD

2020-03-13 Thread Anthony D'Atri
Yeah the removal of that was annoying for sure. ISTR that one can gather the information from the OSDs’ admin sockets. Envision a Prometheus exporter that polls the admin sockets (in parallel) and Grafana panes that graph slow requests by OSD and by node. > On Mar 13, 2020, at 4:14 PM,

[ceph-users] Re: Ceph Performance of Micron 5210 SATA?

2020-03-11 Thread Anthony D'Atri
> > An observation: We are using proxmox (5.4), and it displays the WEAR level as > "N/A", which is unfortunate... :-( Tried upgrading to 6: the same, still no > wearout in the GUI. > > Here is smartctl output: > >> smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-26-pve] (local build)

[ceph-users] Re: Single machine / multiple monitors

2020-03-11 Thread Anthony D'Atri
Custom cluster names are being incrementally deprecated. With ceph-deploy I thought they were removed in 1.39. You could probably achieve parallel clusters with containerized daemons if you tried hard enough, but I have to ask what leads you to want to do this. > On Mar 11, 2020, at 4:07

[ceph-users] Re: rbd-mirror replay is very slow - but initial bootstrap is fast

2020-03-10 Thread Anthony D'Atri
>> >> FWIW when using rbd-mirror to migrate volumes between SATA SSD clusters, I >> found that >> >> >> rbd_mirror_journal_max_fetch_bytes: >>section: "client" >>value: "33554432" >> >> rbd_journal_max_payload_bytes: >>section: "client" >>value: “8388608" > > Indeed,

[ceph-users] Re: rbd-mirror replay is very slow - but initial bootstrap is fast

2020-03-10 Thread Anthony D'Atri
FWIW when using rbd-mirror to migrate volumes between SATA SSD clusters, I found that rbd_mirror_journal_max_fetch_bytes: section: "client" value: "33554432" rbd_journal_max_payload_bytes: section: "client" value: “8388608" Made a world of difference in expediting

[ceph-users] Re: Ceph Performance of Micron 5210 SATA?

2020-03-06 Thread Anthony D'Atri
Yes. The drive treats some portion of cells as SLC, which having only two charge states is a lot faster and is used as cache. As with any cache-enabled drive, if that cache fills up either due to misaligned flush cycles or simply data coming in faster than it can flush, you’ll see a

[ceph-users] Re: Can't add a ceph-mon to existing large cluster

2020-03-05 Thread Anthony D'Atri
> >>> Sage, do you think I can workaround by setting >>> mon_sync_max_payload_size ridiculously small, like 1024 or something >>> like that? >> >> Yeah... IIRC that is how the original user worked around the problem. I >> think they use 64 or 128 KB. > > Nice... 64kB still triggered elections

[ceph-users] Re: Ceph Performance of Micron 5210 SATA?

2020-03-05 Thread Anthony D'Atri
That depends on how you define “decent” , and your use case. Be careful that these are QLC drives. QLC is pretty new and longevity would seem to vary quite a bit based on op mix. These might be fine for read-mostly workloads, but high-turnover databases might burn them up fast, especially as

[ceph-users] Re: small cluster HW upgrade

2020-02-02 Thread Anthony D'Atri
This is a natural condition of bonding, it has little to do with ceph-osd. Make sure your hash policy is set appropriatelly, so that you even have a chance of using both links. https://support.packet.com/kb/articles/lacp-bonding The larger the set of destinations, the more likely you are to

[ceph-users] Re: Ubuntu 18.04.4 Ceph 12.2.12

2020-01-24 Thread Anthony D'Atri
I applied those packages for the same reason on a staging cluster and so far so good. > On Jan 24, 2020, at 9:15 AM, Atherion wrote: > >  > Hi Ceph Community. > We currently have a luminous cluster running and some machines still on > Ubuntu 14.04 > We are looking to upgrade these machines

[ceph-users] Re: High swap usage on one replication node

2019-12-09 Thread Anthony D'Atri
I’ve had one or two situations where swap might have helped a memory consumption problem, but others in which it would have *worsened* cluster performance. Sometimes it’s better for the *cluster* for an OSD to die / restart / get OOMkilled than for it to limp along sluggishly. In the past

[ceph-users] Re: Unbalanced data distribution

2019-10-22 Thread Anthony D'Atri
I agree wrt making the nodes weights uniform. When mixing drive sizes, be careful that the larger ones don’t run afoul of the pg max — they will receive more pgs than the smaller ones, and if you lose a node that might be enough to send some over the max. ‘ceph OSD df’ and look at the PG

[ceph-users] Re: mix sata/sas same pool

2019-10-16 Thread Anthony D'Atri
I have clusters mixing with no problems. At one point we bought a SAS model, when we don’t have a local exact spare we replace dead ones with SATA. Some HBAs or backplanes though might not support mixing. Unlike crappy HBA RoC vols, Ceph doesn’t care. > On Oct 16, 2019, at 1:39 PM,

[ceph-users] Re: RadosGW max worker threads

2019-10-11 Thread Anthony D'Atri
We’ve running with 2000 fwiw. > On Oct 11, 2019, at 2:02 PM, Paul Emmerich wrote: > > Which defaults to rgw_thread_pool_size, so yeah, you can adjust that option. > > To answer your actual question: we've run civetweb with 1024 threads > with no problems related to the number of threads. > >

[ceph-users] Re: Nautilus: PGs stuck remapped+backfilling

2019-10-11 Thread Anthony D'Atri
Very large omaps can take quite a while. > >> You meta data PGs *are* backfilling. It is the "61 keys/s" statement in the >> ceph status output in the recovery I/O line. If this is too slow, increase >> osd_max_backfills and osd_recovery_max_active. >> >> Or just have some coffee ... > > >

[ceph-users] Re: Nautilus: PGs stuck remapped+backfilling

2019-10-11 Thread Anthony D'Atri
Parallelism. The backfill/recovery tunables control how many recovery ops a given OSD will perform. If you’re adding a new OSD, naturally it is the bottleneck. For other forms of data movement, early on one has multiple OSDs reading and writing independently. Toward the end, increasingly

[ceph-users] Re: Unexpected increase in the memory usage of OSDs

2019-10-09 Thread Anthony D'Atri
>>> Do you have statistics on the size of the OSDMaps or count of them >>> which were being maintained by the OSDs? >> No, I don't think so. How can I find this information? > > Hmm I don't know if we directly expose the size of maps. There are > perfcounters which expose the range of maps being

[ceph-users] Re: RAM recommendation with large OSDs?

2019-10-03 Thread Anthony D'Atri
It’s not that the limit is *ignored*; sometimes the failure of the subtree isn’t *detected*. Eg., I’ve seen this happen when a node experienced kernel weirdness or OOM conditions such that the OSDs didn’t all get marked down at the same time, so the PGs all started recovering. Admitedly it’s

[ceph-users] Re: RAM recommendation with large OSDs?

2019-10-02 Thread Anthony D'Atri
This is in part a question of *how many* of those dense OSD nodes you have. If you have a hundred of them, then most likely they’re spread across a decent number of racks and the loss of one or two is a tolerable *fraction* of the whole cluster. If you have a cluster of just, say, 3-4 of

[ceph-users] Re: how many monitor should to deploy in a 1000+ osd cluster

2019-09-26 Thread Anthony D'Atri
I’d always been led to believe that when a client attaches it tries the mons in its conf file in order until it contacts one, which may then hand it off to another. This was given as an explanation for a situation I once had where 2/5 mons were down but clients could run but couldn’t

[ceph-users] Re: disk failure

2019-09-05 Thread Anthony D'Atri
Are you using Filestore? If so directory splitting can manifest this way. Check your networking too, packet loss between OSD nodes or between OSD nodes and the mons can also manifest this way, say if bonding isn’t working properly or you have a bad link. But as suggested below, check the

[ceph-users] Re: Proposal to disable "Avoid Duplicates" on all ceph.io lists

2019-09-04 Thread Anthony D'Atri
> I agree. The ceph-user list is so busy that one cannot manually check all the > time if someone replied. I would like messages with me in cc to come to inbox > as well as via the list. Some time ago I sent a message to the ostensible list owner, which seemed to be blackholed. I suggested

[ceph-users] Re: Strange Ceph architect with SAN storages

2019-08-22 Thread Anthony D'Atri
In a past life I had a bunch of SAN gear dumped in my lap, it was spec’d by someone else misintepreting vague specs. It was SAN gear with an AoE driver. I wasn’t using Ceph, but sending it back and getting a proper solution wasn’t an option. Ended up using SAN gear as a NAS with a single

<    1   2   3   4   5