Re: [ceph-users] Consumer-grade SSD in Ceph

2020-01-03 Thread Anthony D'Atri
>> SATA: Micron 5100-5200-5300, Seagate Nytro 1351/1551 (don't forget to 
>> disable their cache with hdparm -W 0)

We didn’t find a measurable difference doing this on 5100s, ymmv.

Depending on your use-case, CRUSH rules (EC vs R), etc. sub-DWPD models may be 
fine for OSDs, but I suggest higher durability for mon DBs. 

I thought this list was dead and we were using the ceph.io list now?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-users Digest, Vol 83, Issue 18

2019-12-20 Thread Anthony D'Atri
> Hi Sinan,
> 
> I would not recommend using 860 EVO or Crucial MX500 SSD's in a Ceph cluster, 
> as those are consumer grade solutions and not enterprise ones.

The OP knows that, but wants to know why.

> Performance and durability will be issues. If feasible, I would simply go 
> NVMe  as it sounds like you will be using this disk to store the journal or 
> db partition.

IF there’s an avalable PCI slot.  And then you have N OSDs dependent on a 
single device, so when it flakes you lose the whole node.

One thing that consumer SSDs — and even some that pretend to be “enterprise” — 
experience is performance cliffing, they look okay with light workload, but at 
some threshold they saturate and plummet.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG Balancer Upmap mode not working

2019-12-09 Thread Anthony D'Atri
> How is that possible? I dont know how much more proof I need to present that 
> there's a bug.

FWIW, your pastes are hard to read with all the ? in them.  Pasting 
non-7-bit-ASCII?

> |I increased PGs and see no difference.

From what pgp_num to what new value?  Numbers that are not a power of 2 can 
contribute to the sort of problem you describe.  Do you have host CRUSH fault 
domain?

> Raising PGs to 100 is an old statement anyway, anything 60+ should be fine. 

Fine in what regard?  To be sure, Wido’s advice means a *ratio* of at least 
100.  ratio = (pgp_num * replication) / #osds

The target used to be 200, a commit around 12.2.1 retconned that to 100.  Best 
I can tell the rationale is memory usage at the expense of performance.

Is your original except complete? Ie., do you only have 24 OSDs?  Across how 
many nodes?

The old guidance for tiny clusters:

• Less than 5 OSDs set pg_num to 128

• Between 5 and 10 OSDs set pg_num to 512

• Between 10 and 50 OSDs set pg_num to 1024


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-16 Thread Anthony D'Atri
Thanks — interesting reading.

Distilling the discussion there, below are my takeaways.  Am I interpreting 
correctly?

1) The spillover phenomenon and thus the small number of discrete sizes that 
are effective without being wasteful — are recognized

2) "I don't think we should plan teh block.db size based on the rocksdb 
stairstep pattern. A better solution would be to tweak the rocksdb level sizes 
at mkfs time based on the block.db size!”

3) Neither 1) nor 2) was actually acted upon, so we got arbitrary guidance 
based on a calculation of the number of metadata objects, with no input from or 
action upon how the DB actually behaves?


Am I interpreting correctly?


> Btw, the original discussion leading to the 4% recommendation is here:
> https://github.com/ceph/ceph/pull/23210
> 
> 
> -- 
> Paul Emmerich
> 
> 
>> 30gb already includes WAL, see 
>> http://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
>> 
>> 15 августа 2019 г. 1:15:58 GMT+03:00, Anthony D'Atri  
>> пишет:
>>> 
>>> Good points in both posts, but I think there’s still some unclarity.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-14 Thread Anthony D'Atri
Good points in both posts, but I think there’s still some unclarity.

Absolutely let’s talk about DB and WAL together.  By “bluestore goes on flash” 
I assume you mean WAL+DB?

“Simply allocate DB and WAL will appear there automatically”

Forgive me please if this is obvious, but I’d like to see a holistic 
explanation of WAL and DB sizing *together*, which I think would help folks put 
these concepts together and plan deployments with some sense of confidence.

We’ve seen good explanations on the list of why only specific DB sizes, say 
30GB, are actually used _for the DB_.
If the WAL goes along with the DB, shouldn’t we also explicitly determine an 
appropriate size N for the WAL, and make the partition (30+N) GB?
If so, how do we derive N?  Or is it a constant?

Filestore was so much simpler, 10GB set+forget for the journal.  Not that I 
miss XFS, mind you.


>> Actually standalone WAL is required when you have either very small fast
>> device (and don't want db to use it) or three devices (different in
>> performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be located
>> at the fastest one.
>> 
>> For the given use case you just have HDD and NVMe and DB and WAL can
>> safely collocate. Which means you don't need to allocate specific volume
>> for WAL. Hence no need to answer the question how many space is needed
>> for WAL. Simply allocate DB and WAL will appear there automatically.
>> 
>> 
> Yes, i'm surprised how often people talk about the DB and WAL separately
> for no good reason.  In common setups bluestore goes on flash and the
> storage goes on the HDDs, simple.
> 
> In the event flash is 100s of GB and would be wasted, is there anything
> that needs to be done to set rocksdb to use the highest level?  600 I
> believe



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to maximize the OSD effective queue depth in Ceph?

2019-08-06 Thread Anthony D'Atri
> However, I'm starting to think that the problem isn't with the number
> of threads that have work to do... the problem may just be that the
> OSD & PG code has enough thread locking happening that there is no
> possible way to have more than a few things happening on a single OSD
> (or perhaps a single placement group).
> 
> Has anyone thought about the problem from this angle?  This would help
> explain why multiple-OSDs-per-SSD is so effective (even though the
> thought of doing this in production is utterly terrifying).


When researching this topic a few months back the below is what I found, HTH.  
We’re planning to break up NVMe drives into multiple OSDs.  I don’t find this 
terrifying so much as somewhat awkward, we’ll have to update deployment and 
troubleshooting/maintenance procedures to act accordingly. 

Back in the day it was conventional Ceph wisdom to never put multiple OSDs on a 
single device, but my sense was that was an artifact of bottlenecked spinners.  
The resultant seek traffic I imagine could be ugly, but would it be worse than 
we already suffered with colo journals?  (*)  With a device that can handle 
lots of IO depth without seeks, IMHO it’s not so bad, especially as Ceph has 
evolved to cope better with larger numbers of OSDs.


"per-osd session lock", "all AIO completions are fired from a single thread – 
so even if you are pumping data to the OSDs using 8 threads, you are only 
getting serialized completions”

https://apawel.me/ceph-creating-multiple-osds-on-nvme-devices-luminous/

https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en

https://www.spinics.net/lists/ceph-devel/msg41570.html

https://bugzilla.redhat.com/show_bug.cgi?id=1541415

http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning


> With block sizes 64K and lower the avgqu-sz value never went above 1
> under any workload, and I never saw the iostat util% much above 50%.


I’ve been told that iostat %util isn’t as meaningful with SSDs as it was with 
HDDs, but I don’t recall the rationale.  ymmv.




*  And oh did we suffer from them :-x

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-08-04 Thread Anthony D'Atri
>>> We have been using:
>>> 
>>> osd op queue = wpq
>>> osd op queue cut off = high
>>> 
>>> It virtually eliminates the impact of backfills on our clusters. Our
> 
> It does better because it is a fair share queue and doesn't let recovery
> ops take priority over client ops at any point for any time. It allows
> clients to have a much more predictable latency to the storage.


Why aren’t these default settings then?  Those who set these:  do you run with 
them all the time, or only while expanding?  Is peering still impactful?

— aad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-28 Thread Anthony D'Atri
Paul Emmerich wrote:

> +1 on adding them all at the same time.
> 
> All these methods that gradually increase the weight aren't really
> necessary in newer releases of Ceph.

Because the default backfill/recovery values are lower than they were in, say, 
Dumpling?

Doubling (or more) the size of a cluster in one swoop still means a lot of 
peering and a lot of recovery I/O, I’ve seen a cluster’s data rate go to or 
near 0 for a brief but nonzero length of time.  If something goes wrong with 
the network (cough cough subtle jumbo frame lossage cough) , if one has 
fat-fingered something along the way, etc. going in increments means that a ^C 
lets the cluster stablize before very long.  Then you get to troubleshoot with 
HEALTH_OK instead of HEALTH_WARN or HEALTH_ERR.

Having experienced a cluster be DoS’d for hours when its size was tripled in 
one go, I’m once bitten twice shy.  Yes, that was Dumpling, but even with SSDs 
on Jewel and Luminous I’ve seen sigificant client performance impact from 
en-masse topology changes.

— aad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New best practices for osds???

2019-07-26 Thread Anthony D'Atri
> This is worse than I feared, but very much in the realm of concerns I 
> had with using single-disk RAID0 setups.? Thank you very much for 
> posting your experience!? My money would still be on using *high write 
> endurance* NVMes for DB/WAL and whatever I could afford for block.?


yw.  Of course there are all manner of use-cases and constraints, so others 
have different experiences.  Perhaps with the freedom to not use a certain HBA 
vendor things would be somewhat better but in said past life the practice cost 
hundreds of thousands of dollars.

I personally have a low tolerance for fuss, and management / mapping of WAL/DB 
devices still seems like a lot of fuss especially when drives fail or have to 
be replaced for other reasons.

For RBD clusters/pools at least I really enjoy not having to mess with multiple 
devices; I’d rather run colo with SATA SSDs than spinners with NVMe WAL+DB. 

- aad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New best practices for osds???

2019-07-25 Thread Anthony D'Atri
> We run few hundred HDD OSDs for our backup cluster, we set one RAID 0 per HDD 
> in order to be able
> to use -battery protected- write cache from the RAID controller. It really 
> improves performance, for both
> bluestore and filestore OSDs.

Having run something like 6000 HDD-based FileStore OSDs with colo journals on 
RAID HBAs I’d like to offer some contrasting thoughts.

TL;DR:  Never again!  False economy.  ymmv.

Details:

* The implementation predated me and was carved in dogfood^H^H^H^H^H^H^Hstone, 
try as I might I could not get it fixed.

* Single-drive RAID0 VDs were created to expose the underlying drives to the 
OS.  When the architecture was conceived, the HBAs in question didn’t have 
JBOD/passthrough, though a firmware update shortly thereafter did bring that 
ability.  That caching was a function of VDs wasn’t known at the time.

* My sense was that the FBWC did offer some throughput performance for at least 
some workloads, but at the cost of latency.

* Using a RAID-capable HBA in IR mode with FBWC meant having to monitor for the 
presence and status of the BBU/supercap

* The utility needed for that monitoring, when invoked with ostensibly 
innocuous parameters, would lock up the HBA for several seconds.

* Traditional BBUs are rated for lifespan of *only* one year.  FBWCs maybe for 
… three?  Significant cost to RMA or replace them:  time and karma wasted 
fighting with the system vendor CSO, engineer and remote hands time to take the 
system down and swap.  And then the connectors for the supercap were touchy; 
15% of the time the system would come up and not see it at all.

* The RAID-capable HBA itself + FBWC + supercap cost …. a couple three hundred 
more than an IT / JBOD equivalent

* There was a little-known flaw in secondary firmware that caused FBWC / 
supercap modules to be falsely reported bad.  The system vendor acted like I 
was making this up and washed their hands of it, even when I provided them the 
HBA vendors’ artifacts and documents.

* There were two design flaws that could and did result in cache data loss when 
a system rebooted or lost power.  There was a field notice for this, which 
required harvesting serial numbers and checking each.  The affected range of 
serials was quite a bit larger than what the validation tool admitted.  I had 
to manage the replacement of 302+ of these in production use, each needing 
engineer time time to manage Ceph, to do the hands work, and hassle with RMA 
paperwork.

* There was a firmware / utility design flaw that caused the HDD’s onboard 
volatile write cache to be silently turned on, despite an HBA config dump 
showing a setting that should have left it off.  Again data was lost when a 
node crashed hard or lost power.

* There was another firmware flaw that prevented booting if there was pinned / 
preserved cache data after a reboot / power loss if a drive failed or was 
yanked.  The HBA’s option ROM utility would block booting and wait for input on 
the console.  One could get in and tell it to discard that cache, but it would 
not actually do so, instead looping back to the same screen.  The only way to 
get the system to boot again was to replace and RMA the HBA.

* The VD layer lessened the usefulness of iostat data.  It also complicated OSD 
deployment / removal / replacement.  A smartctl hack to access SMART attributes 
below the VD layer would work on some systems but not others.

* The HBA model in question would work normally with a certain CPU generation, 
but not with slightly newer servers with the next CPU generation.  They would 
randomly, on roughly one boot out of five, negotiate PCIe gen3 which they 
weren’t capable of handling properly, and would silently run at about 20% of 
normal speed.  Granted this isn’t necessarily specific to an IR HBA.



Add it all up, and my assertion is that the money, time, karma, and user impact 
you save from NOT dealing with a RAID HBA *more than pays for* using SSDs for 
OSDs instead.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Observation of bluestore db/wal performance

2019-07-21 Thread Anthony D'Atri
This may be somewhat controversial, so I’ll try to tread lightly.

Might we infer that your OSDs are on spinners?  And at 500 GB it would seem 
likely that they and the servers are old?  Please share hardware details and OS.

Having suffered an “enterprise” dogfood deployment in which I had to attempt to 
support thousands of RBD clients on spinners with colo journals (and a serious 
design flaw that some of you are familiar with), my knee-jerk thought is that 
they are antithetical to “heavy use of block storage”.  I understand though 
that in an education setting you may not have choices.

How highly utilized are your OSD drives?  Depending on your workload you 
*might* benefit with more PGs.  But since you describe your OSDs as being 500GB 
on average, I have to ask:  do their sizes vary considerably?  If so, larger 
OSDs are going to have more PGs (and thus receive more workload) than smaller.  
“ceph osd df” will show the number of PGs on each.  If you do have a 
significant disparity of drive sizes, careful enabling and tweaking of primary 
affinity can have measureable results in read performance.

Is the number of PGs a power of 2?  If not, some of your PGs will be much 
larger than others.  Do you have OSD fillage reasonably well balanced?  If 
“ceph osd df” shows a wide variance, this can also hamper performance as the 
workload will not be spread evenly.

With all due respect to those who have tighter constraints than I enjoy in my 
my current corporate setting, heavy RBD usage on spinners can be sisyphean.  
Granted I’ve never run with a cache tier myself, or with separate WAL/DB 
devices.  In a corporate setting the additional cost of SSD OSDs can easily be 
balanced by reduced administrative hassle and user experience.  If that isn’t 
an option for you anytime soon, then by all means I’d stick with the cache 
tier, and maybe with Luminous indefinitely.  


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clock skew

2019-04-26 Thread Anthony D'Atri
> @Janne: i will checkout/implement the peer config per your suggestion. 
> However what confuses us is that chrony thinks the clocks match, and 
> only ceph feels it doesn't. So we are not sure if the peer config will 
> actually help in this situation. But time will tell.

Ar ar.

Chrony thinks that the clocks match *what*, though?  That each system matches 
the public pool against which it’s synced?

Something I’ve noticed, especially when using the public pool in Asia, is that 
DNS rotation results in the pool FQDNs to resolve differently multiple times a 
day.  And that the quality of those servers varies considerably.  Naturally the 
ones in that pool that I set up a few years ago are spot-on, but I digress ;)

Consider this scenario:

Ceph mon A resolves the pool FQDN to serverX, which is 10ms slow
Ceph mon B resolves the pool FQDN to serverY, which is 20ms fast with lots of 
jitter

That can get you a 30ms spread right there.  This is the benefit of having the 
mons peer with each other as well as with upstream servers of varying 
stratum/quality — worst case, they will select one of their own to sync with.

With iburst and modern client polling backoff, there usually isn’t much reason 
to not configure a bunch of peers.  Multiple Public/vendor pool FQDNs are 
reasonable to include, but I also like to hardcode in a few known-good public 
peers as well, even one in a different region if necessary.  Have your systems 
peer against each other too.  

Depending on the size of your operation, consider your own timeserver to deploy 
on-prem, though antenna placement can be a hassle.

This is horribly non-enterprise, but I also suggest picking up one of these:

https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server/

It’s cheap and it can’t handle tens of thousands of clients, but it doesn’t 
have to.  Stick it in an office window and add it to your peer lists.  If you 
have a larger number of clients, have your internal NTP servers configure it 
(as well as each other, K-complete).  If you don’t, include it in their local 
peer constellation.  Best case you have an excellent low-stratum source for 
your systems for cheap.  Worst case you are no worse off than you were before.

Now, whether this situation is what you’re seeing I can’t say without more 
info, but it is at least plausible.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw, nss: dropping the legacy PKI token support in RadosGW (removed in OpenStack Ocata)

2019-04-19 Thread Anthony D'Atri


I've been away from OpenStack for a couple of years now, so this may have 
changed.  But back around the Icehouse release, at least, upgrading between 
OpenStack releases was a major undertaking, so backing an older OpenStack with 
newer Ceph seems like it might be more common than one might think.

Which is not to argue for or against dropping PKI in Ceph, but if it's going to 
be done, please call that out early in the release notes to avoid rude 
awakenings.


> [Adding ceph-users for better usability]
> 
> On Fri, 19 Apr 2019, Radoslaw Zarzynski wrote:
>> Hello,
>> 
>> RadosGW can use OpenStack Keystone as one of its authentication
>> backends. Keystone in turn had been offering many token variants
>> over the time with PKI/PKIz being one of them. Unfortunately,
>> this specific type had many flaws (like explosion in size of HTTP
>> header) and has been dropped from Keystone in August 2016 [1].
>> By "dropping" I don't mean just "deprecating". PKI tokens have
>> been physically eradicated from Keystone's code base not leaving
>> documentation behind. This happened in OpenStack Ocata.
>> 
>> Intuitively I don't expect that brand new Ceph is deployed with
>> an ancient OpenStack release. Similarly, upgrading Ceph while
>> keeping very old OpenStack seems quite improbable.
> 
> This sounds reasonable to me.  If someone is running an old OpenStack, 
> they should be able to defer their Ceph upgrade until OpenStack is 
> upgraded... or at least transition off the old keystone variant?
> 
> sage
> 
>> If so, we may consider dropping PKI token support in further
>> releases. What makes me perceive this idea as attractive is:
>> 1) significant clean-up in RGW. We could remove a lot of
>> complexity including the entire revocation machinery with
>> its dedicated thread.
>> 2) Killing the NSS dependency. After moving the AWS-like
>> crypto services of RGW to OpenSSL, the CMS utilized by PKI
>> token support is the library sole's user.
>> I'm not saying it's a blocker for NSS removal. Likely we could
>> reimplement the stuff on top of OpenSSL as well.
>> All I'm worrying about is this can be futile effort bringing
>> more problems/confusion than benefits. For instance, instead
>> of just dropping the "nss_db_path" config option, we would
>> need to replace it with counterpart for OpenSSL or take care
>> of differences in certificate formats between the libraries.
>> 
>> I can see benefits of the removal. However, the actual cost
>> is mysterious to me. Is the feature useful?
>> 
>> Regards,
>> Radek
>> 
>> [1]: 
>> https://github.com/openstack/keystone/commit/8a66ef635400083fa426c0daf477038967785caf
>> 
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph block storage cluster limitations

2019-03-30 Thread Anthony D'Atri
> Hello,
> 
> I wanted to know if there are any max limitations on
> 
> - Max number of Ceph data nodes
> - Max number of OSDs per data node
> - Global max on number of OSDs
> - Any limitations on the size of each drive managed by OSD?
> - Any limitation on number of client nodes?
> - Any limitation on maximum number of RBD volumes that can be created?

I don’t think there any *architectural* limits, but there can be *practical* 
limits.  There are a lot of variables and everyone has a unique situation, but 
some thoughts:

> Max number of Ceph data nodes

May be limited at some extreme by networking.  Don’t cheap out on your switches.

> - Max number of OSDs per data node

People have run at least 72.  Consider RAM required for a given set of drives, 
and that a single host/chassis isn’t a big percentage of your cluster.  Ie., 
don’t have a huge fault domain that will bite you later.  For a production 
cluster at scale I would suggest at least 12 OSD nodes, but this depends on 
lots of variables.  Conventional wisdom is 1GB RAM per 1TB of OSD; in practice 
for a large cluster I would favor somewhat more.  A cluster with, say, 3 nodes 
of 72 OSDs each is going to be in bad way when one fails.

> - Global max on number of OSDs

A cluster with at lest 10800 has existed.

https://indico.cern.ch/event/542464/contributions/2202295/attachments/1289543/1921810/cephday-dan.pdf
https://indico.cern.ch/event/649159/contributions/2761965/attachments/1544385/2423339/hroussea-storage-at-CERN.pdf

The larger a cluster becomes, the more careful attention must be paid to 
topology and tuning.

> Also, any advise on using NVMes for OSD drives?

They rock.  Evaluate your servers carefully:
* Some may route PCI through a multi-mode SAS/SATA HBA
* Watch for PCI bridges or multiplexing
* Pinning, minimize data over QPI links
* Faster vs more cores can squeeze out more performance 

AMD Epyc single-socket systems may be very interesting for NVMe OSD nodes.

> What is the known maximum cluster size that Ceph RBD has been deployed to?

See above.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-24 Thread Anthony D'Atri

> Date: Fri, 22 Feb 2019 16:26:34 -0800
> From: solarflow99 
> 
> 
> Aren't you undersized at only 30GB?  I thought you should have 4% of your
> OSDs

The 4% guidance is new.  Until relatively recently the oft-suggested and 
default value was 1%.

> From: "Vitaliy Filippov" 
> Numbers are easy to calculate from RocksDB parameters, however I also  
> don't understand why it's 3 -> 30 -> 300...
> 
> Default memtables are 256 MB, there are 4 of them, so L0 should be 1 GB,  
> L1 should be 10 GB, and L2 should be 100 GB?

I’m very curious as well, one would think that in practice the size and usage 
of the OSD would be factors, something the docs imply.

This is an area where we could really use more concrete guidance.  Clusters 
especially using HDDs are often doing so for $/TB reasons.  Economics and 
available slots are constraints on how much faster WAL+DB storage can be 
provisioned.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread Anthony D'Atri

? Did we start recommending that production mons run on a VM?  I'd be very 
hesitant to do that, though probably some folks do.

I can say for sure that in the past (Firefly) I experienced outages related to 
mons running on HDDs.  That was a cluster of 450 HDD OSDs with colo journals 
and hundreds of RBD clients.  Something obscure about running out of "global 
IDs" and not being able to create new ones fast enough.  We had to work around 
with a combo of lease settings on the mons and clients, though with Hammer and 
later I would not expect that exact situation to arise.  Still it left me 
paranoid about mon DBs and HDDs. 

-- aad


> 
> But ceph recommendation is to use VM (not even the  HW node
> recommended). will try to change the mon disk as SSD and HW node.
> 
> On Fri, Feb 22, 2019 at 5:25 PM Darius Kasparavičius  wrote:
>> 
>> If your using hdd for monitor servers. Check their load. It might be
>> the issue there.
>> 
>> On Fri, Feb 22, 2019 at 1:50 PM M Ranga Swami Reddy
>>  wrote:
>>> 
>>> ceph-mon disk with 500G with HDD (not journals/SSDs).  Yes, mon use
>>> folder on FS on a disk
>>> 
>>> On Fri, Feb 22, 2019 at 5:13 PM David Turner  wrote:
 
 Mon disks don't have journals, they're just a folder on a filesystem on a 
 disk.
 
 On Fri, Feb 22, 2019, 6:40 AM M Ranga Swami Reddy  
 wrote:
> 
> ceph mons looks fine during the recovery.  Using  HDD with SSD
> journals. with recommeded CPU and RAM numbers.
> 
> On Fri, Feb 22, 2019 at 4:40 PM David Turner  
> wrote:
>> 
>> What about the system stats on your mons during recovery? If they are 
>> having a hard time keeping up with requests during a recovery, I could 
>> see that impacting client io. What disks are they running on? CPU? Etc.
>> 
>> On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy  
>> wrote:
>>> 
>>> Debug setting defaults are using..like 1/5 and 0/5 for almost..
>>> Shall I try with 0 for all debug settings?
>>> 
>>> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius  
>>> wrote:
 
 Hello,
 
 
 Check your CPU usage when you are doing those kind of operations. We
 had a similar issue where our CPU monitoring was reporting fine < 40%
 usage, but our load on the nodes was high mid 60-80. If it's possible
 try disabling ht and see the actual cpu usage.
 If you are hitting CPU limits you can try disabling crc on messages.
 ms_nocrc
 ms_crc_data
 ms_crc_header
 
 And setting all your debug messages to 0.
 If you haven't done you can also lower your recovery settings a little.
 osd recovery max active
 osd max backfills
 
 You can also lower your file store threads.
 filestore op threads
 
 
 If you can also switch to bluestore from filestore. This will also
 lower your CPU usage. I'm not sure that this is bluestore that does
 it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
 compared to filestore + leveldb .
 
 
 On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
  wrote:
> 
> Thats expected from Ceph by design. But in our case, we are using all
> recommendation like rack failure domain, replication n/w,etc, still
> face client IO performance issues during one OSD down..
> 
> On Tue, Feb 19, 2019 at 10:56 PM David Turner  
> wrote:
>> 
>> With a RACK failure domain, you should be able to have an entire 
>> rack powered down without noticing any major impact on the clients.  
>> I regularly take down OSDs and nodes for maintenance and upgrades 
>> without seeing any problems with client IO.
>> 
>> On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
>>  wrote:
>>> 
>>> Hello - I have a couple of questions on ceph cluster stability, even
>>> we follow all recommendations as below:
>>> - Having separate replication n/w and data n/w
>>> - RACK is the failure domain
>>> - Using SSDs for journals (1:4ratio)
>>> 
>>> Q1 - If one OSD down, cluster IO down drastically and customer Apps 
>>> impacted.
>>> Q2 - what is stability ratio, like with above, is ceph cluster
>>> workable condition, if one osd down or one node down,etc.
>>> 
>>> Thanks
>>> Swami
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-18 Thread Anthony D'Atri
On older releases, at least, inflated DBs correlated with miserable recovery 
performance and lots of slow requests.  The  DB and OSDs were also on HDD FWIW. 
  A single drive failure would result in substantial RBD impact.  

> On Feb 18, 2019, at 3:28 AM, Dan van der Ster  wrote:
> 
> Not really.
> 
> You should just restart your mons though -- if done one at a time it
> has zero impact on your clients.
> 
> -- dan
> 
> 
> On Mon, Feb 18, 2019 at 12:11 PM M Ranga Swami Reddy
>  wrote:
>> 
>> Hi Sage - If the mon data increases, is this impacts the ceph cluster
>> performance (ie on ceph osd bench, etc)?
>> 
>> On Fri, Feb 15, 2019 at 3:13 PM M Ranga Swami Reddy
>>  wrote:
>>> 
>>> today I again hit the warn with 30G also...
>>> 
 On Thu, Feb 14, 2019 at 7:39 PM Sage Weil  wrote:
 
> On Thu, 7 Feb 2019, Dan van der Ster wrote:
> On Thu, Feb 7, 2019 at 12:17 PM M Ranga Swami Reddy
>  wrote:
>> 
>> Hi Dan,
>>> During backfilling scenarios, the mons keep old maps and grow quite
>>> quickly. So if you have balancing, pg splitting, etc. ongoing for
>>> awhile, the mon stores will eventually trigger that 15GB alarm.
>>> But the intended behavior is that once the PGs are all active+clean,
>>> the old maps should be trimmed and the disk space freed.
>> 
>> old maps not trimmed after cluster reached to "all+clean" state for all 
>> PGs.
>> Is there (known) bug here?
>> As the size of dB showing > 15G, do I need to run the compact commands
>> to do the trimming?
> 
> Compaction isn't necessary -- you should only need to restart all
> peon's then the leader. A few minutes later the db's should start
> trimming.
 
 The next time someone sees this behavior, can you please
 
 - enable debug_mon = 20 on all mons (*before* restarting)
   ceph tell mon.* injectargs '--debug-mon 20'
 - wait for 10 minutes or so to generate some logs
 - add 'debug mon = 20' to ceph.conf (on mons only)
 - restart the monitors
 - wait for them to start trimming
 - remove 'debug mon = 20' from ceph.conf (on mons only)
 - tar up the log files, ceph-post-file them, and share them with ticket
 http://tracker.ceph.com/issues/38322
 
 Thanks!
 sage
 
 
 
 
> -- dan
> 
> 
>> 
>> Thanks
>> Swami
>> 
>>> On Wed, Feb 6, 2019 at 6:24 PM Dan van der Ster  
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> With HEALTH_OK a mon data dir should be under 2GB for even such a large 
>>> cluster.
>>> 
>>> During backfilling scenarios, the mons keep old maps and grow quite
>>> quickly. So if you have balancing, pg splitting, etc. ongoing for
>>> awhile, the mon stores will eventually trigger that 15GB alarm.
>>> But the intended behavior is that once the PGs are all active+clean,
>>> the old maps should be trimmed and the disk space freed.
>>> 
>>> However, several people have noted that (at least in luminous
>>> releases) the old maps are not trimmed until after HEALTH_OK *and* all
>>> mons are restarted. This ticket seems related:
>>> http://tracker.ceph.com/issues/37875
>>> 
>>> (Over here we're restarting mons every ~2-3 weeks, resulting in the
>>> mon stores dropping from >15GB to ~700MB each time).
>>> 
>>> -- Dan
>>> 
>>> 
 On Wed, Feb 6, 2019 at 1:26 PM Sage Weil  wrote:
 
 Hi Swami
 
 The limit is somewhat arbitrary, based on cluster sizes we had seen 
 when
 we picked it.  In your case it should be perfectly safe to increase it.
 
 sage
 
 
> On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote:
> 
> Hello -  Are the any limits for mon_data_size for cluster with 2PB
> (with 2000+ OSDs)?
> 
> Currently it set as 15G. What is logic behind this? Can we increase
> when we get the mon_data_size_warn messages?
> 
> I am getting the mon_data_size_warn message even though there a ample
> of free space on the disk (around 300G free disk)
> 
> Earlier thread on the same discusion:
> https://www.spinics.net/lists/ceph-users/msg42456.html
> 
> Thanks
> Swami
> 
> 
> 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions about using existing HW for PoC cluster

2019-01-27 Thread Anthony D'Atri
> Been reading "Learning Ceph - Second Edition”

An outstanding book, I must say ;)

> So can I get by with using a single SATA SSD (size?) per server for RocksDB / 
> WAL if I'm using Bluestore?

Depends on the rest of your setup and use-case, but I think this would be a 
bottleneck.  Some thoughts:

* You wrote that your servers have 1x 240GB SATA SSD that has the OS, and 8x 
2TB SATA OSD drives.

** Sharing the OS with journal/metadata could lead to contention between the two
** Since the OS has been doing who-knows-what with that drive, check the 
lifetime used/remaining with `smartctl -a`.
** If they’ve been significantly consumed, their lifetime with the load Ceph 
will present will be limited.
** SSDs selected for OS/boot drives often have relatively low durability (DWPD) 
and may suffer performance cliffing when given a steady load.  Look up the 
specs on your model
** 8 OSDs sharing a single SSD for metadata is a very large failure domain.  
If/when you lose that SSD, your lose all 8 OSDs and the host itself.  You would 
want to set the subtree limit to “host”, and not fill the OSDs past, say, 60% 
so that you’d have room to backfill in case of a failure not caught by the 
subtree limit
** 8 HDD OSDs sharing a single SATA SSD for metadata will be bottlenecked 
unless your workload is substantially reads.

* Single SATA HDD on the mons

** When it fails, you lose the mon
** I have personally seen issues due to HDDs not handling peak demands, 
resulting in an outage

The gear you list is fairly old and underpowered, but sure you could use it *as 
a PoC*.  For a production deployment you’d want different hardware.

> - Is putting the journal on a partition of the SATA drives a real I/O killer? 
> (this is how my Proxmox boxes are set up)

With Filestore and HDDs, absolutely.  Even worse if you were to use EC.  There 
may be some coalescing of ops, but you’re still going to get a *lot* of long 
seeks, and spinners can only do a certain number of IOPs.  I think in the book 
I described personal experience with such a setup that even tickled a design 
flaw on the part of a certain HDD vendor.  Eventually I was permitted to get 
journal devices (this was pre-BlueStore GA), which were PCIe NVMe.  Write 
performance doubled.  Then we hit a race condition / timing issue in nvme.ko, 
but I digress...

When using SATA *SSD*s for OSDs, you have no seeks of course, and colocating 
the journals/metadata is more viable.

> - If YES to the above, then is a SATA SSD acceptable for journal device, or 
> should I definitely consider PCIe SSD? (I'd have to limit to one per server, 
> which I know isn't optimal, but price prevents otherwise…)

Optanes for these systems would be overkill.  If you would plan to have the PoC 
cluster run any appreciable load for any length of time, I might suggest 
instead adding 2x SATA SSDs per, so you could map 4x OSDs to each.  These would 
not need to be large:  upstream party line would have you allocate 80GB on 
each, though depending on your use-case you might well do fine with less, 2x 
240GB class or even 2x 120GB class should suffice for PoC service.  For 
production I would advise “enterprise” class drives with at least 1 DWPD 
durability — recently we’ve seen a certain vendor weasel their durability by 
computing it incorrectly.

Depending on what you expect out of your PoC, and especially assuming you use 
BlueStore, you might get away with colocation, but do not expect performance 
that can be extrapolated for a production deployment.  

With the NICs you have, you could keep it simple and skip a replication/back 
end network altogether, or you could bond the LOM ports and split them.  
Whatever’s simplest with the network infrastructure you have.  For production 
you’d most likely want LACP bonded NICs, but if the network tech is modern, 
skipping the replication network may be very feasible,  But I’m ahead of your 
context …

HTH
— aad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-01-17 Thread Anthony D'Atri
>> Lenz has provided this image that is currently being used for the 404
>> page of the dashboard:
>> 
>> https://github.com/ceph/ceph/blob/master/src/pybind/mgr/dashboard/frontend/src/assets/1280px-Nautilus_Octopus.jpg
> 
> Nautilus *shells* are somewhat iconic/well known/distinctive.  Maybe a
> variant of https://en.wikipedia.org/wiki/File:Nautilus_Section_cut.jpg
> would be interesting on a t-shirt?

I agree with Tim.  T shirts with photos can be tricky, it’s easy for them to 
look cheesy and they don’t age well.

In the same vein, something with a lower bit-depth and not non-cross-section 
might be slightly more recognizable:

https://www.vectorstock.com/royalty-free-vector/nautilus-vector-2806848

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions re mon_osd_cache_size increase

2019-01-07 Thread Anthony D'Atri
Thanks, Greg.  This is as I suspected. Ceph is full of subtleties and I wanted 
to be sure.

-- aad


> 
> The osd_map_cache_size controls the OSD’s cache of maps; the change in 13.2.3 
> is to the default for the monitors’.
> On Mon, Jan 7, 2019 at 8:24 AM Anthony D'Atri  <mailto:a...@dreamsnake.net>> wrote:
> 
> 
> > * The default memory utilization for the mons has been increased
> >  somewhat.  Rocksdb now uses 512 MB of RAM by default, which should
> >  be sufficient for small to medium-sized clusters; large clusters
> >  should tune this up.  Also, the `mon_osd_cache_size` has been
> >  increase from 10 OSDMaps to 500, which will translate to an
> >  additional 500 MB to 1 GB of RAM for large clusters, and much less
> >  for small clusters.
> 
> 
> Just I don't perseverate on this:   mon_osd_cache_size is a [mon] setting for 
> ceph-mon only?  Does it relate to osd_map_cache_size?  ISTR that in the past 
> the latter defaulted to 500; I had seen a presentation (I think from Dan) at 
> an OpenStack Summit advising its decrease and it defaults to 50 now.  
> 
> I like to be very clear about where additional memory is needed, especially 
> for dense systems.
> 
> -- Anthony
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Questions re mon_osd_cache_size increase

2019-01-07 Thread Anthony D'Atri



> * The default memory utilization for the mons has been increased
>  somewhat.  Rocksdb now uses 512 MB of RAM by default, which should
>  be sufficient for small to medium-sized clusters; large clusters
>  should tune this up.  Also, the `mon_osd_cache_size` has been
>  increase from 10 OSDMaps to 500, which will translate to an
>  additional 500 MB to 1 GB of RAM for large clusters, and much less
>  for small clusters.


Just I don't perseverate on this:   mon_osd_cache_size is a [mon] setting for 
ceph-mon only?  Does it relate to osd_map_cache_size?  ISTR that in the past 
the latter defaulted to 500; I had seen a presentation (I think from Dan) at an 
OpenStack Summit advising its decrease and it defaults to 50 now.  

I like to be very clear about where additional memory is needed, especially for 
dense systems.

-- Anthony

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore nvme DB/WAL size

2018-12-21 Thread Anthony D'Atri
> It'll cause problems if yours the only one NVMe drive will die - you'll lost 
> all the DB partitions and all the OSDs are going to be failed


The severity of this depends a lot on the size of the cluster.  If there are 
only, say, 4 nodes total, for sure the loss of a quarter of the OSDs will be 
somewhere between painful and fatal.  Especially if the subtree limit does not 
forestall rebalancing, and if EC is being used vs replication.  From a pain 
angle, though, this is no worse than if the server itself smokes.

It's easy to say "don't do that" but sometimes one doesn't have a choice:

* Unit economics can confound provisioning of larger/more external metadata 
devices.  I'm sure Vlad isn't using spinners because he hates SSDs.

* Devices have to go somewhere.  It's not uncommon to have a server using 2 
PCIe slots for NICs (1) and another for an HBA, leaving as few as 1 or 0 free.  
Sometimes the potential for a second PCI riser is canceled by the need to 
provision a rear drive cage for OS/boot drives to maximize front-panel bay 
availability.

* Cannibalizing one or more front drive bays for metadata drives can be 
problematic:
- Usable cluster capacity is decreased, along with unit economics
- Dogfood or draconian corporate policy (Herbert! Herbert!) can prohibit this.  
I've personally in the past been prohibited from the obvious choise to use a 
simple open-market LFF to SFF adapter because it wasn't officially "supported" 
and would use components without a corporate SKU.

The 4% guidance was 1% until not all that long ago.  Guidance on calculating 
adequate sizing based on application and workload would be NTH.  I've been told 
that an object storage (RGW) use case can readily get away with less because 
L2/L3/etc are both rarely accessed and the first to be overflowed onto slower 
storage.  And that block (RBD) workloads have different access patterns that 
are more impacted by overflow of higher levels.  As RBD pools increasingly are 
deployed on SSD/NVMe devices, the case for colocating their metadata is strong, 
and obviates having to worry about sizing before deployment.













(1) One of many reasons to seriously consider not having a separate replication 
network





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] warning: fast-diff map is invalid operation may be slow; object map invalid

2018-10-15 Thread Anthony D'Atri

We turned on all the RBD v2 features while running Jewel; since then all 
clusters have been updated to Luminous 12.2.2 and additional clusters added 
that have never run Jewel.

Today I find that a few percent of volumes in each cluster have issues, 
examples below.

I'm concerned that these issues may present problems when using rbd-mirror to 
move volumes between clusters.  Many instances involve heads or nodes of 
snapshot trees; it's possible but unverified that those not currently 
snap-related may have been in the past.

In the Jewel days we retroactively applied fast-diff, object-map to existing 
volumes but did not bother with tombstones.

Any thoughts on

1) How this happens?
2) Is rbd object-map rebuild"  always safe, especially on volumes that are in 
active use?
3) The disturbing messages spewed by `rbd ls` -- related or not?
4) Would this as I fear confound successful rbd-mirror migration?

I've found 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-August/012137.html 
 
that *seems* to indicate that a live rebuild is safe,but I'm still uncertain 
about the root cause, and if it's still happening.  I've never ventured into 
this dark corner before so I'm being careful.

All clients are QEMU/libvirt; most are 12.2.2 but there are some lingering 
Jewel, most likely 10.2.6 or perhaps 10.2.3.  Eg:


# ceph features
{
"mon": {
"group": {
"features": "0x1ffddff8eea4fffb",
"release": "luminous",
"num": 5
}
},
"osd": {
"group": {
"features": "0x1ffddff8eea4fffb",
"release": "luminous",
"num": 983
}
},
"client": {
"group": {
"features": "0x7fddff8ee84bffb",
"release": "jewel",
"num": 15
},
"group": {
"features": "0x1ffddff8eea4fffb",
"release": "luminous",
"num": 3352
}
}
}


# rbd ls  -l |wc
2018-10-05 20:55:17.397288 7f976cff9700 -1 librbd::image::RefreshParentRequest: 
failed to locate snapshot: Snapshot with this id not found
2018-10-05 20:55:17.397334 7f976cff9700 -1 librbd::image::RefreshRequest: 
failed to refresh parent image: (2) No such file or directory
2018-10-05 20:55:17.397397 7f976cff9700 -1 librbd::image::OpenRequest: failed 
to refresh image: (2) No such file or directory
2018-10-05 20:55:17.398025 7f976cff9700 -1 librbd::io::AioCompletion: 
0x7f978667b570 fail: (2) No such file or directory
2018-10-05 20:55:17.398075 7f976cff9700 -1 librbd::image::RefreshParentRequest: 
failed to locate snapshot: Snapshot with this id not found
2018-10-05 20:55:17.398079 7f976cff9700 -1 librbd::image::RefreshRequest: 
failed to refresh parent image: (2) No such file or directory
2018-10-05 20:55:17.398096 7f976cff9700 -1 librbd::image::OpenRequest: failed 
to refresh image: (2) No such file or directory
2018-10-05 20:55:17.398659 7f976cff9700 -1 librbd::io::AioCompletion: 
0x7f978660c240 fail: (2) No such file or directory
2018-10-05 20:55:30.416174 7f976cff9700 -1 librbd::io::AioCompletion: 
0x7f9786cd5ee0 fail: (2) No such file or directory
2018-10-05 20:55:34.083188 7f976d7fa700 -1 librbd::object_map::RefreshRequest: 
failed to load object map: rbd_object_map.b18d634146825.2d8f
2018-10-05 20:55:34.084101 7f976cff9700 -1 
librbd::object_map::InvalidateRequest: 0x7f97544d11e0 should_complete: r=0
2018-10-05 20:55:38.597014 7f976d7fa700 -1 librbd::image::OpenRequest: failed 
to retreive immutable metadata: (2) No such file or directory
2018-10-05 20:55:38.597109 7f976cff9700 -1 librbd::io::AioCompletion: 
0x7f9786d3a7c0 fail: (2) No such file or directory
2018-10-05 20:55:51.584101 7f976d7fa700 -1 librbd::object_map::RefreshRequest: 
failed to load object map: rbd_object_map.c447c403109b2.6a04
2018-10-05 20:55:51.592616 7f976cff9700 -1 
librbd::object_map::InvalidateRequest: 0x7f975409fee0 should_complete: r=0
2018-10-05 20:55:59.414229 7f976d7fa700 -1 librbd::image::OpenRequest: failed 
to retreive immutable metadata: (2) No such file or directory
2018-10-05 20:55:59.414321 7f976cff9700 -1 librbd::io::AioCompletion: 
0x7f9786df0760 fail: (2) No such file or directory
2018-10-05 20:56:09.029179 7f976d7fa700 -1 librbd::object_map::RefreshRequest: 
failed to load object map: rbd_object_map.9b28e148b97af.6a09
2018-10-05 20:56:09.035212 7f976cff9700 -1 
librbd::object_map::InvalidateRequest: 0x7f9754644030 should_complete: r=0
2018-10-05 20:56:09.036087 7f976d7fa700 -1 librbd::object_map::RefreshRequest: 
failed to load object map: rbd_object_map.9b28e148b97af.6a0a
2018-10-05 20:56:09.042200 7f976cff9700 -1 
librbd::object_map::InvalidateRequest: 0x7f97541d2c10 should_complete: r=0
   6544   22993 1380784

# rbd du
warning: fast-diff map is invalid for 

Re: [ceph-users] Favorite SSD

2018-09-17 Thread Anthony D'Atri
> Micron 5200 line seems to not have a high endurance SKU like the 5100 line 
> sadly.


The 3.84TB 5200 PRO is rated at ~2.4 DWPD — you need higher than that?  I do 
find references to higher-durability ~5DWPD  5200 MAX models up to 1.9 TB.

Online resources on the 5200 product line don’t always agree — I think at first 
they had not intended to provide a 3.84TB model at all, but changed their minds 
along the way.

https://www.micron.com/~/media/documents/products/data-sheet/ssd/5200_ssd.pdf


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding node efficient data move.

2018-09-11 Thread Anthony D'Atri
> When adding a node and I increment the crush weight like this. I have 
> the most efficient data transfer to the 4th node?
> 
> sudo -u ceph ceph osd crush reweight osd.23 1 
> sudo -u ceph ceph osd crush reweight osd.24 1 
> sudo -u ceph ceph osd crush reweight osd.25 1 
> sudo -u ceph ceph osd crush reweight osd.26 1 
> sudo -u ceph ceph osd crush reweight osd.27 1 
> sudo -u ceph ceph osd crush reweight osd.28 1 
> sudo -u ceph ceph osd crush reweight osd.29 1 
> 
> And then after recovery
> 
> sudo -u ceph ceph osd crush reweight osd.23 2

I'm not sure if you're asking for the most *efficient* way to add capacity, or 
the least *impactful*.

The most *efficient* would be to have the new OSDs start out at their full 
CRUSH weight.  This way data only has to move once.  However the overhead of 
that much movement can be quite significant, especially if I correctly read 
that you're expanding the size of the cluster by 33%.

What I prefer to do (on replicated clusters) is to use this script:

https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight

I set the CRUSH weights to 0 then run the script like

ceph-gentle-reweight -o  -b 10 -d 0.01 -t 3.48169 -i 10 -r | tee 
-a /var/tmp/upweight.log

Note that I disable measure_latency() out of paranoia.  This is less 
*efficient* in that some data ends up being moved more than once, and the 
elapsed time to complete is longer, but has the advantage of less impact.  It 
also allows one to quickly stop data movement if a drive/HBA/server/network 
issue causes difficulties.  Small steps means that each completes quickly.

I also set

osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1
osd_recovery_max_single_start = 1
osd_scrub_during_recovery = false

to additionally limit the impact of data movement on client operations.

YMMV. 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Exact scope of OSD heartbeating?

2018-07-18 Thread Anthony D'Atri
Thanks, Dan.  I thought so but wanted to verify.  I'll see if I can work up a 
doc PR to clarify this.

>> The documentation here:
>> 
>> http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/
>> 
>> says
>> 
>> "Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6 
>> seconds"
>> 
>> and
>> 
>> " If a neighboring Ceph OSD Daemon doesn’t show a heartbeat within a 20 
>> second grace period, the Ceph OSD Daemon may consider the neighboring Ceph 
>> OSD Daemon down and report it back to a Ceph Monitor,"
>> 
>> I've always thought that each OSD heartbeats with *every* other OSD, which 
>> of course means that total heartbeat traffic grows ~ quadratically.  However 
>> in extending test we've observed that the number of other OSDs that a 
>> subject heartbeat (heartbeated?) was < N, which has us wondering if perhaps 
>> only OSDs with which a given OSD shares are contacted -- or some other 
>> subset.
>> 
> 
> OSDs heartbeat with their peers, the set of osds with whom they share
> at least one PG.
> You can see the heartbeat peers (HB_PEERS) in ceph pg dump -- after
> the header "OSD_STAT USED  AVAIL TOTAL HB_PEERS..."
> 
> This is one of the nice features of the placement group concept --
> heartbeats and peering in general stays constant with the number of
> PGs per OSD, rather than scaling up with the total number of OSDs in a
> cluster.
> 
> Cheers, Dan
> 
> 
>> I plan to submit a doc fix for mon_osd_min_down_reporters and wanted to 
>> resolve this FUD first.
>> 
>> -- aad
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Exact scope of OSD heartbeating?

2018-07-17 Thread Anthony D'Atri
The documentation here:

http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/ 


says

"Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6 
seconds"

and

" If a neighboring Ceph OSD Daemon doesn’t show a heartbeat within a 20 second 
grace period, the Ceph OSD Daemon may consider the neighboring Ceph OSD Daemon 
down and report it back to a Ceph Monitor,"

I've always thought that each OSD heartbeats with *every* other OSD, which of 
course means that total heartbeat traffic grows ~ quadratically.  However in 
extending test we've observed that the number of other OSDs that a subject 
heartbeat (heartbeated?) was < N, which has us wondering if perhaps only OSDs 
with which a given OSD shares are contacted -- or some other subset.

I plan to submit a doc fix for mon_osd_min_down_reporters and wanted to resolve 
this FUD first.

-- aad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy: recommended?

2018-04-06 Thread Anthony D'Atri
> ?I read a couple of versions ago that ceph-deploy was not recommended
> for production clusters.? 

InkTank had sort of discouraged the use of ceph-deploy; in 2014 we used it only 
to deploy OSDs.

Some time later the message changed.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IO rate-limiting with Ceph RBD (and libvirt)

2018-03-22 Thread Anthony D'Atri
> FYI: I/O limiting in combination with OpenStack 10/12 + Ceph doesn?t work 
> properly. Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1476830 
> 

That's an OpenStack bug, nothing to do with Ceph.  Nothing stops you from using 
virsh to throttle directly:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-blockio-techniques
 


https://github.com/cernceph/ceph-scripts/blob/master/tools/virsh-throttle-rbd.py
 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating to new pools

2018-02-22 Thread Anthony D'Atri
Cumulative followup to various insightful replies.

I wrote:

 No, it's not really possible currently and we have no plans to add
>>> such support since it would not be of any long-term value.
>> 
>> The long-term value would be the ability to migrate volumes from, say, a 
>> replicated pool to an an EC pool without extended downtime.

Among the replies:

> That's why the Mimic release should offer a specific set of "rbd
> migration XYX" actions to perform minimal downtime migrations.

That'd be awesome, if it can properly handle client-side attachments.

> I was saying that hacking around rbd-mirror to add such a feature has
> limited long-term value given the release plan. Plus, in the immediate
> term you can use a kernel md RAID device or QEMU+librbd to perform the
> migration for you w/ minimal downtime (albeit with potentially more
> hands-on setup involved).

Great ideas if one controls both the back end and the guest OS.  In our case we 
can't go mucking around inside the guests.

> if you use qemu, it's also possible to use drive-mirror feature from qemu.
> (can mirror and migrate from 1 storage to another storage without downtime).


Interesting idea, and a qemu function I was not previously aware of.  Has some 
potential wrinkles though:

o Considerably more network traffic and I/O load to/from the hypervisor (or 
whatever you call the hosts where your guest VMs run)
o Scaling to thousands of volumes each potentially TBs in size
o Handling unattached volumes
o Co-ordination with / prevention of user attach/detach operations during the 
process

-- aad


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating to new pools

2018-02-21 Thread Anthony D'Atri
>> I was thinking we might be able to configure/hack rbd mirroring to mirror to
>> a pool on the same cluster but I gather from the OP and your post that this
>> is not really possible?
> 
> No, it's not really possible currently and we have no plans to add
> such support since it would not be of any long-term value.

The long-term value would be the ability to migrate volumes from, say, a 
replicated pool to an an EC pool without extended downtime.





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sizing your MON storage with a large cluster

2018-02-09 Thread Anthony D'Atri
Thanks, Wido -- words to live by.

I had all kinds of problems with mon DBs not compacting under Firefly, really 
pointed out the benefit of having ample space on the mons -- and the necessity 
of having those DB's live on something faster than an LFF HDD.

I've had this happen when using ceph-gentle-reweight to slowly bring in a large 
population of new OSDs.  Breaking that into phases helps a bunch, or setting a 
large -i interval.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HW Raid vs. Multiple OSD

2017-11-13 Thread Anthony D'Atri
Oscar, a few thoughts:

o I think you might have some misunderstandings about how Ceph works.  Ceph is 
best deployed as a single cluster spanning multiple servers, generally at least 
3.  Is that your plan?  It sort of sounds as though you're thinking of Ceph 
managing only the drives local to each of your converged VDI hosts, like local 
RAID would.  Ceph doesn't work that way.  Well, technically it could but 
wouldn't be a great architecture.  You would want to have at least 3 servers, 
with all of the Ceph OSDs in a single cluster.

o Re RAID0:

> Then, may I understand that your advice is a RAID0 for each 4TB? For a
> balanced configuration...
> 
> 1 osd x 1 disk of 4TB
> 1 osd x 2 disks of 2TB
> 1 odd x 4 disks of 1 TB


For performance a greater number of smaller drives is generally going to be 
best.  VDI desktops are going to be fairly latency-sensitive and you'd really 
do best with SSDs.  All those desktops thrashing a small number of HDDs is not 
going to deliver tolerable performance.

Don't use RAID at all for the OSDs.  Even if you get hardware RAID HBAs, 
configure JBOD/passthrough mode so that OSDs are deployed directly on the 
drives.  This will minimize latency as well as manifold hassles that one adds 
when wrapping drives in HBA RAID volumes.

o Re CPU:

> The other question is considering having one OSDs vs 8 OSDs... 8 OSDs will
> consume more CPU than 1 OSD (RAID5) ?
> 
> As I want to share compute and osd in the same box, resources consumed by
> OSD can be a handicap.


If the CPU cycles used by Ceph are a problem, your architecture has IMHO bigger 
problems.  You need to design for a safety margin of RAM and CPU to accommodate 
spikes in usage, both by Ceph and by your desktops.  There is no way each of 
the systems you describe is going to have enough cycles for 100 desktops 
concurrently active.  You'd be allocating each of them only ~3GB of RAM -- I've 
not had to run MS Windows 10 but even with page sharing that seems awfully 
tight on RAM.

Since you mention PProLiant and 8 drives I'm going assume you're targeting the 
DL360?  I suggest if possible considering the 10SFF models to get you more 
drive bays, ditching the optical drive.  If you can get rear bays to use to 
boot the OS from, that's better yet so you free up front panel drive bays for 
OSD use.  You want to maximize the number of drive bays available for OSD use, 
and if at all possible you want to avoid deploying the operating system's 
filesystems and OSDs on the same drives.

With the numbers you mention throughout the thread, it would seem as though you 
would end up with potentially as little as 80GB of usable space per virtual 
desktop - will that meet your needs?  One of the difficulties with converged 
architectures is that storage and compute don't necessarily scale at the same 
rate.  To that end I suggest considering 2U 25-drive-bay systems so that you 
have room to add more drives.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Anyone else having digest issues with Apple Mail?

2017-09-13 Thread Anthony D'Atri
For a couple of weeks now digests have been appearing to me off and on with a 
few sets of MIME headers and maybe 1-2 messages.  When I look at the raw text 
the whole digest is in there.

Screencap below.  Anyone else experiencing this?


https://www.evernote.com/l/AL2CMToOPiBIJYZgw9KzswqiBhHHoRIm6hA 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Vote re release cadence

2017-09-07 Thread Anthony D'Atri
One vote for:

* Drop the odd releases, and aim for a ~9 month cadence. This splits the 
difference between the current even/odd pattern we've been doing.

We've already been bit by gotchas with upgrades even between point releases, so 
I favor strategies that limit the number of upgrade paths in the hope that they 
will be more solid.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High iowait on OSD node

2017-07-27 Thread Anthony D'Atri
My first suspicion would be the HBA.  Are you using a RAID HBA?  If so I 
suggest checking the status of your BBU/FBWC and cache policy.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crushmap from Rack aware to Node aware

2017-06-02 Thread Anthony D'Atri
All very true and worth considering, but I feel compelled to mention the 
strategy of setting mon_osd_down_out_subtree_limit carefully to prevent 
automatic rebalancing.

*If* the loss of a failure domain is temporary, ie. something you can fix 
fairly quickly, it can be preferable to not start that avalanche of recovery, 
both to avoid contention with client workloads and also for the fillage factor 
that David describes.

If of course the loss of the failure domain can’t be corrected quickly, then 
one would still be in a quandary re whether to shift the capacity onto the 
surviving failure domains or take the risk of reduced redundancy while the 
problem is worked.

That said, I’ve seen situations where OSD’s in a failure domain weren’t 
reported down in close enough temporal proximity, and the subtree limit didn’t 
kick in.

In my current situation we’re already planning to exploit the half-rack 
strategy you describe for EC clusters, it improves the failure domain situation 
without being monopolizing as many DC racks.

— aad 

> The problem with having 3 failure domains with replica 3 is that if you
> lose a complete failure domain, then you have nowhere for the 3rd replica
> to go.  If you have 4 failure domains with replica 3 and you lose an entire
> failure domain, then you over fill the remaining 3 failure domains and can
> only really use 55% of your cluster capacity.  If you have 5 failure
> domains, then you start normalizing and losing a failure domain doesn't
> impact as severely.  The more failure domains you get to, the less it
> affects you when you lose one.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Re-weight Entire Cluster?

2017-05-30 Thread Anthony D'Atri
OIC, thanks for providing the tree output.  From what you wrote originally it 
seemed plausible that you were mixing up the columns, which is not an uncommon 
thing to do.

If all of your OSD’s are the same size, and have a CRUSH weight of 1., then 
you have just the usual OSD fullness distribution problem.

If you have other OSD’s in the cluster that are the same size as these but have 
different CRUSH weights, then you do have a problem.  Is that the case?  Feel 
free to privately email me your entire ceph osd tree output if you like, to 
avoid spamming the list.

— aad

> Hi Anthony,
> 
> When the OSDs were added it appears they were added with a crush weight of 1 
> so I believe we need to change the weighting as we are getting a lot of very 
> full OSDs.
> 
> -21  20.0 host somehost
> 216   1.0 osd.216   up  1.0  1.0
> 217   1.0 osd.217   up  1.0  1.0
> 218   1.0 osd.218   up  1.0  1.0
> 219   1.0 osd.219   up  1.0  1.0
> 220   1.0 osd.220   up  1.0  1.0
> 221   1.0 osd.221   up  1.0  1.0
> 222   1.0 osd.222   up  1.0  1.0
> 223   1.0 osd.223   up  1.0  1.0
> 
> -Original Message-
> From: Anthony D'Atri <a...@dreamsnake.net>
> Date: Tuesday, May 30, 2017 at 1:10 PM
> To: ceph-users <ceph-users@lists.ceph.com>
> Cc: Cave Mike <mc...@uvic.ca>
> Subject: Re: [ceph-users] Re-weight Entire Cluster?
> 
> 
> 
>> It appears the current best practice is to weight each OSD according to it?s 
>> size (3.64 for 4TB drive, 7.45 for 8TB drive, etc).
> 
> OSD’s are created with those sorts of CRUSH weights by default, yes.  Which 
> is convenient, but it’s import to know that those weights are arbitrary, and 
> what really matters is how the weights of each OSD / host / rack compares to 
> its siblings.  They are relative weights, not absolute capacities.
> 
>> As it turns out, it was not configured this way at all; all of the OSDs are 
>> weighted at 1.
> 
> Are you perhaps confusing CRUSH weights with override weights?  In the below 
> example each OSD has a CRUSH weight of 3.48169, but the override reweight is 
> 1.000.  The override ranges from 0 to 1.  It is admittedly confusing to have 
> two different things called weight.  Ceph’s reweight-by-utilization eg. acts 
> by adjusting the override reweight and not touching the CRUSH weights. 
> 
> ID  WEIGHT TYPE NAMEUP/DOWN REWEIGHT 
> PRIMARY-AFFINITY
> -44   83.56055 host somehostname
> 9363.48169 osd.936   up  1.0  
> 1.0
> 9373.48169 osd.937   up  1.0  
> 1.0
> 9383.48169 osd.938   up  1.0  
> 1.0
> 9393.48169 osd.939   up  1.0  
> 1.0
> 9403.48169 osd.940   up  1.0  
> 1.0
> 9413.48169 osd.941   up  1.0  
> 1.0
> 
> If you see something similar, from “ceph osd tree”, then chances are that 
> there’s no point in changing anything since with CRUSH weights, all that 
> matters is how they compare across OSD’s/racks/hosts/etc..  So you could 
> double all of them just for grins, and nothing in how the cluster operates 
> would change.
> 
> — Anthony
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Re-weight Entire Cluster?

2017-05-30 Thread Anthony D'Atri


> It appears the current best practice is to weight each OSD according to it?s 
> size (3.64 for 4TB drive, 7.45 for 8TB drive, etc).

OSD’s are created with those sorts of CRUSH weights by default, yes.  Which is 
convenient, but it’s import to know that those weights are arbitrary, and what 
really matters is how the weights of each OSD / host / rack compares to its 
siblings.  They are relative weights, not absolute capacities.

> As it turns out, it was not configured this way at all; all of the OSDs are 
> weighted at 1.

Are you perhaps confusing CRUSH weights with override weights?  In the below 
example each OSD has a CRUSH weight of 3.48169, but the override reweight is 
1.000.  The override ranges from 0 to 1.  It is admittedly confusing to have 
two different things called weight.  Ceph’s reweight-by-utilization eg. acts by 
adjusting the override reweight and not touching the CRUSH weights. 

ID  WEIGHT TYPE NAMEUP/DOWN REWEIGHT 
PRIMARY-AFFINITY
-44   83.56055 host somehostname
9363.48169 osd.936   up  1.0  
1.0
9373.48169 osd.937   up  1.0  
1.0
9383.48169 osd.938   up  1.0  
1.0
9393.48169 osd.939   up  1.0  
1.0
9403.48169 osd.940   up  1.0  
1.0
9413.48169 osd.941   up  1.0  
1.0

If you see something similar, from “ceph osd tree”, then chances are that 
there’s no point in changing anything since with CRUSH weights, all that 
matters is how they compare across OSD’s/racks/hosts/etc..  So you could double 
all of them just for grins, and nothing in how the cluster operates would 
change.

— Anthony

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-04-26 Thread Anthony D'Atri
At a meeting with Intel folks a while back, they discussed the idea that future 
large devices — which we’re starting to now see — would achieve greater 
*effective* durability via a lower cost/GB that encourages the use of larger 
than needed devices.  Which is a sort of overprovisioning, just more visible to 
the end user than conventional behind-the-scenes overprovisioning.

With very dense OSD servers, larger but faster devices have a certain appeal 
for greater ratios, especially if PCI slots are limited.  The failure domain of 
course is still a consideration.


> How did you settle on the P3608 vs say the P3600 or P3700 for journals? And 
> also the 1.6T size? Seems overkill, unless its pulling double duty beyond OSD 
> journals.
> 
> Only improvement over the P3x00 is the move from x4 lanes to x8 lanes on the 
> PCIe bus, but the P3600/P3700 offer much more in terms of endurance, and at 
> lower prices compared to the P3608.
> How big are your journal sizes, or are you over provisioning to increase 
> endurance on the card?
> 
> It would seem the new P4800X will be a perfect journaling device with 
> >30DWPD, and even lower latency, even though it is ?low? storage size, 375GB 
> would still hold 15 25GB journals, which seems excessively large.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Primary Affinity

2017-04-19 Thread Anthony D'Atri
Re ratio, I think you’re right.

Write performance depends for sure on what the journal devices are.  If the 
journals are colo’d on spinners, then for sure the affinity game isn’t going to 
help writes massively.

My understanding of write latency is that min_size journals have to be written 
before the op returns, so if journals aren’t on SSD’s that’s going to be a big 
bottleneck.




> Hi,
> 
>>> Assuming production level, we would keep a pretty close 1:2 SSD:HDD ratio,
>> 1:4-5 is common but depends on your needs and the devices in question, ie. 
>> assuming LFF drives and that you aren’t using crummy journals.
> 
> You might be speaking about different ratios here. I think that Anthony is 
> speaking about journal/OSD and Reed speaking about capacity ratio between and 
> HDD and SSD tier/root. 
> 
> I have been experimenting with hybrid setups (1 copy on SSD + 2 copies on 
> HDD), like Richard says you’ll get much better random read performance with 
> primary OSD on SSD but write performance won’t be amazing since you still 
> have 2 HDD copies to write before ACK. 
> 
> I know the doc suggests using primary affinity but since it’s a OSD level 
> setting it does not play well with other storage tiers so I searched for 
> other options. From what I have tested, a rule that selects the first/primary 
> OSD from the ssd-root then the rest of the copies from the hdd-root works. 
> Though I am not sure it is *guaranteed* that the first OSD selected will be 
> primary.
> 
> “rule hybrid {
>  ruleset 2
>  type replicated
>  min_size 1
>  max_size 10
>  step take ssd-root
>  step chooseleaf firstn 1 type host
>  step emit
>  step take hdd-root
>  step chooseleaf firstn -1 type host
>  step emit
> }”
> 
> Cheers,
> Maxime
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Primary Affinity

2017-04-18 Thread Anthony D'Atri
I get digests, so please forgive me if this has been covered already.

> Assuming production level, we would keep a pretty close 1:2 SSD:HDD ratio,

1:4-5 is common but depends on your needs and the devices in question, ie. 
assuming LFF drives and that you aren’t using crummy journals.

> First of all, is this even a valid architecture decision? 

Inktank described it to me back in 2014/2015 so I don’t think it’s ultra outré. 
 It does sound like a lot of work to maintain, especially when components get 
replaced or added.

> it should boost performance levels considerably compared to spinning disks,

Performance in which sense?  I would expect it to boost read performance but 
not so much writes.

I haven’t used cache tiering so can’t comment on the relative merits.  Your 
local workload may be a factor.

— aad



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the actual justification for min_size?

2017-03-21 Thread Anthony D'Atri
I’m fairly sure I saw it as recently as Hammer, definitely Firefly. YMMV.


> On Mar 21, 2017, at 4:09 PM, Gregory Farnum <gfar...@redhat.com> wrote:
> 
> You shouldn't need to set min_size to 1 in order to heal any more. That was 
> the case a long time ago but it's been several major LTS releases now. :)
> So: just don't ever set min_size to 1.
> -Greg
> On Tue, Mar 21, 2017 at 6:04 PM Anthony D'Atri <a...@dreamsnake.net> wrote:
> >> a min_size of 1 is dangerous though because it means you are 1 hard disk 
> >> failure away from losing the objects within that placement group entirely. 
> >> a min_size of 2 is generally considered the minimum you want but many 
> >> people ignore that advice, some wish they hadn't.
> >
> > I admit I am having difficulty following why this is the case
> 
> I think we have a case of fervently agreeing.
> 
> Setting min_size on a specific pool to 1 to allow PG’s to heal is absolutely 
> a normal thing in certain circumstances, but it’s important to
> 
> 1) Know _exactly_ what you’re doing, to which pool, and why
> 2) Do it very carefully, changing ‘size’ instead of ‘min_size’ on a busy pool 
> with a bunch of PG’s and data can be quite the rude awakening.
> 3) Most importantly, _only_ set it for the minimum time needed, with eyes 
> watching the healing, and set it back immediately after all affected PG’s 
> have peered and healed.
> 
> The danger, which I think is what Wes was getting at, is in leaving it set to 
> 1 all the time, or forgetting to revert it.  THAT is, as we used to say, 
> begging to lose.
> 
> — aad
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] add multiple OSDs to cluster

2017-03-21 Thread Anthony D'Atri
Deploying or removing OSD’s in parallel for sure can save elapsed time and 
avoid moving data more than once.  There are certain pitfalls, though, and the 
strategy needs careful planning.

- Deploying a new OSD at full weight means a lot of write operations.  Running 
multiple whole-OSD backfills to a single host can — depending on your situation 
— saturate the HBA, resulting in slow requests. 
- Judicious setting of norebalance/norecover can help somewhat, to give the 
affected OSD’s/ PG’s time to peer and become ready before shoving data at them
- Deploying at 0 CRUSH weight and incrementally ratcheting up the weight as 
PG’s peer can spread that out
- I’ve recently seen the idea of temporarily setting primary-affinity to 0 on 
the affected OSD’s to deflect some competing traffic as well
- One workaround is that if you have OSD’s to deploy on more than one server, 
you could deploy them in batches of say 1-2 on each server, striping them if 
you will.  That diffuses the impact and results in faster elapsed recovery

As for how many is safe to do in parallel, there are multiple variables there.  
HDD vs SSD, client workload.  And especially how many other OSD’s are in the 
same logical rack/host.  On a cluster of 450 OSD’s, with 150 in each logical 
rack, each OSD is less than 1% of a rack, so deploying 4 of them at once would 
not be a massive change.  However in a smaller cluster with say 45 OSD’s, 15 in 
each rack, that would tickle a much larger fraction of the cluster and be more 
disruptive.

If the numbers below are TOTALS, if you would be expanding your cluster from a 
total of 4 OSD’s to a total of 8, that would be something I wouldn’t do, having 
experienced under Dumpling what it was like to triple the size of a certain 
cluster in one swoop.  

So one approach is trial and error to see how many you can get away with before 
you get slow requests, then backing off.  In production of course this is 
playing with fire. Depending on which release you’re running, cranking down a 
common set of backfill/recovery tunable can help mitigate the thundering herd 
effect as well.

— aad

> This morning I tried the careful approach, and added one OSD to server1. 
> It all went fine, everything rebuilt and I have a HEALTH_OK again now. 
> It took around 7 hours.
> 
> But now I started thinking... (and that's when things go wrong, 
> therefore hoping for feedback here)
> 
> The question: was I being stupid to add only ONE osd to the server1? Is 
> it not smarter to add all four OSDs at the same time?
> 
> I mean: things will rebuild anyway...and I have the feeling that 
> rebuilding from 4 -> 8 OSDs is not going to be much heavier than 
> rebuilding from 4 -> 5 OSDs. Right?
> 
> So better add all new OSDs together on a specific server?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the actual justification for min_size?

2017-03-21 Thread Anthony D'Atri
>> a min_size of 1 is dangerous though because it means you are 1 hard disk 
>> failure away from losing the objects within that placement group entirely. a 
>> min_size of 2 is generally considered the minimum you want but many people 
>> ignore that advice, some wish they hadn't. 
> 
> I admit I am having difficulty following why this is the case

I think we have a case of fervently agreeing.

Setting min_size on a specific pool to 1 to allow PG’s to heal is absolutely a 
normal thing in certain circumstances, but it’s important to

1) Know _exactly_ what you’re doing, to which pool, and why
2) Do it very carefully, changing ‘size’ instead of ‘min_size’ on a busy pool 
with a bunch of PG’s and data can be quite the rude awakening.
3) Most importantly, _only_ set it for the minimum time needed, with eyes 
watching the healing, and set it back immediately after all affected PG’s have 
peered and healed.

The danger, which I think is what Wes was getting at, is in leaving it set to 1 
all the time, or forgetting to revert it.  THAT is, as we used to say, begging 
to lose.

— aad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Directly addressing files on individual OSD

2017-03-15 Thread Anthony D'Atri
As I parse Youssef’s message, I believe there are some misconceptions.  It 
might help if you could give a bit more info on what your existing ‘cluster’ is 
running.  NFS? CIFS/SMB?  Something else?

1) Ceph regularly runs scrubs to ensure that all copies of data are consistent. 
 The checksumming that you describe would be both infeasible and redundant.

2) It sounds as though your current back-end stores user files as-is and is 
either a traditional file server setup or perhaps a virtual filesystem 
aggregating multiple filesystems.  Ceph is not a file storage solution in this 
sense.  The below sounds as though you want user files to not be sharded across 
multiple servers.  This is antithetical to how Ceph works and is counter to 
data durability and availability, unless there is some replication that you 
haven’t described.  Reference this diagram:

http://docs.ceph.com/docs/master/_images/stack.png

Beneath the hood Ceph operates internally on ‘objects’ that are not exposed to 
clients as such. There are several different client interfaces that are built 
on top of this block service:

- RBD volumes — think in terms of a virtual disk drive attached to a VM
- RGW — like Amazon S3 or Swift
- CephFS — provides a mountable filesystem interface, somewhat like NFS or even 
SMB but with important distictions in behavior and use-case

I had not heard of iRODS before but just looked it up.  It is a very different 
thing than any of the common interfaces to Ceph.

If your users need to mount the storage as a share / volume, in the sense of 
SMB or NFS, then Ceph may not be your best option.  If they can cope with an S3 
/ Swift type REST object interface, a cluster with RGW interfaces might do the 
job, or perhaps Swift or Gluster.   It’s hard to say for sure based on 
assumptions of what you need.

— Anthony


> We currently run a commodity cluster that supports a few petabytes of data. 
> Each node in the cluster has 4 drives, currently mounted as /0 through /3. We 
> have been researching alternatives for managing the storage, Ceph being one 
> possibility, iRODS being another. For preservation purposes, we would like 
> each file to exist as one whole piece per drive (as opposed to being striped 
> across multiple drives). It appears this is the default in Ceph.
> 
> Now, it has always been convenient for us to run distributed jobs over SSH 
> to, for instance, compile a list of checksums of all files in the cluster:
> 
> dsh -Mca 'find /{0..3}/items -name \*.warc.gz | xargs md5sum 
> >/tmp/$HOSTNAME.md5sum'
> 
> And that nicely allows each node to process its own files using the local CPU.
> 
> Would this scenario still be possible where Ceph is managing the storage?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] apologies for the erroneous subject - should have been Re: Unable to boot OS on cluster node

2017-03-11 Thread Anthony D'Atri
A certain someone bumped my elbow as I typed, think in terms of this week’s 
family-bombed video going the rounds on FB.  My ignominy is boundless and my 
door now locked when replying.

— aad




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] http://www.dell.com/support/home/us/en/04/product-support/servicetag/JFGQY02/warranty#

2017-03-10 Thread Anthony D'Atri
> As long as you don?t nuke the OSDs or the journals, you should be OK.

This.  Most HBA failures I’ve experienced don’t corrupt data on the drives, bit 
it can happen.

Assuming the data is okay, you should be able to just install the OS, install 
the *same version* of Ceph packages, reboot, and have them come up and in (and 
backfill / recover with a vengeance)


— aad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon.mon01 store is getting too big! 18119 MB >= 15360 MB -- 94% avail

2017-02-01 Thread Anthony D'Atri
> In particular, when using leveldb, stalls while reading or writing to 
> the store - typically, leveldb is compacting when this happens. This 
> leads to all sorts of timeouts to be triggered, but the really annoying 
> one would be the lease timeout, which tends to result in flapping quorum.
> 
> Also, being unable to sync monitors. Again, stalls on leveldb lead to 
> timeouts being triggered and the sync to restart.
> 
> Once upon a time, this *may* have also translated into large memory 
> consumption. A direct relation was never proved though, and behaviour 
> went away as ceph became smarter, and libs were updated by distros.

My team suffered no small amount of pain due to persistent DB inflation, not 
just during topology churn.  RHCS 1.3.2 addressed that for us.  Before we 
applied that update I saw mon DB’s grow as large as 54GB.

When measuring the size of /var/lib/ceph/mon/store.db, be careful to not 
blindly include *.log or *LOG* files that you may find there.  I set 
leveldb_log = /dev/null to supress writing those, which were confusing our 
metrics.   I also set mon_compact_on_start = true to compact each mon’s leveldb 
at startup.  This was found anecdotally to be more effective than using ceph 
tell to compact during operation, as there was less contention.  It does mean 
however that one needs to be careful, when the set of DB’s across mons is 
inflated, to not restart them all in a short amount of time.  It seems that 
even after the mon log reports that compaction is complete, there is (as of 
Hammer) trimming still running silently in the background that impacts 
performance until complete.  This means that one will see additional shrinkage 
of the store.db directory over time.

In my clusters of 450+ OSD’s, 4GB is the arbitrary point above which I get 
worried.  Mind you most of our mon DB’s are still on (wince) LFF rotational 
drives, which doesn’t help.  Strongly advise faster storage for the DB’s.  I 
found that the larger the DB’s grow, the slower all mon operations become — 
which includes peering and especially backfill/recovery.  With a DB that large 
you may find that OSD loss / removal via attrition can cause significant client 
impact.

Inflating during recovery/backfill does still happen; it sounds as though the 
OP doubled the size of his/her cluster in one swoop, which is fairly drastic.  
Early on with Dumpling I trebled the size of a cluster in one operation, and 
still ache from the fallout.  A phased deployment will spread out the impact 
and allow the DB’s to preen in between phases.  One approach is to add only 1 
or a few drives per OSD server at a time, but in parallel.  So if you were 
adding 10 servers of 12 OSD’s each, say 6-12 steps of 10x1 or 10x2 OSD’s.  That 
way the write workload is spread across 10 servers instead of funneling into 
just one, avoiding HBA saturation and the blocked requests that can result from 
it.  Adding the OSD’s with 0 weight and using ceph osd crush reweight to bring 
them up in phases can also ease the process.  Early on we would allow each 
reweight to fully recover before the next step, but I’ve since found that 
peering is the biggest share of the impact, and that upweighting can proceed 
just as safely after peering clears up from the previous adjustment.  This 
avoids moving some fraction of data more than once.  With Jewel 
backfill/recovery is improved to not shuffle data that doesnt’ really need to 
move, but with Hammer this decidely helps avoid a bolus of slow requests as 
each new OSD comes up and peers.

— Anthony

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs ata1.00: status: { DRDY }

2017-01-06 Thread Anthony D'Atri
YMMV of course, but the first thing that struck me was the constraint of scrub 
times.  Constraining them to fewer hours can mean that more run in parallel.  
If you truly have off-hours for client ops (Graphite / Grafana are great for 
visualizing that) that might make sense, but in my 24x7 OpenStack world, there 
is little or no off-hour lull, so I let scrubs run all the time.

You might also up osd_deep_scrub_interval.  The default is one week; I raise 
that to four weeks as a compromise between aggressive protection and the 
realities of contention.

— Anthony, currently looking for a new Ceph opportunity.

>> In our ceph.conf we already have this settings active:
>> 
>> osd max scrubs = 1
>> osd scrub begin hour = 20
>> osd scrub end hour = 7
>> osd op threads = 16
>> osd client op priority = 63
>> osd recovery op priority = 1
>> osd op thread timeout = 5
>> 
>> osd disk thread ioprio class = idle
>> osd disk thread ioprio priority = 7
>> 
> You're missing the most powerful scrub dampener there is:
> osd_scrub_sleep = 0.1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel P3700 SSD for journals

2016-11-22 Thread Anthony D'Atri
You wrote P3700 so that’s what I discussed ;)

If you want to connect to your HBA you’ll want a SATA device like the S3710 
series:

http://ark.intel.com/products/family/83425/Data-Center-SSDs#@Server

The P3700 is a PCI device, goes into an empty slot, and is not speed-limited by 
the SATA interface.  At perhaps higher cost.

With 7.2 I would think you’d be fine, driver-wise.  Either should be detected 
and work out of the box.

— Anthony


> 
> thx Alan and Anthony for sharing on these P3700 drives.
> 
> Anthony, just to follow up on your email: my OS is CentOS7.2.   Can
> you please elaborate on nvme on the CentOS7.2, I'm in no way expert on
> nvme, but I can here see that
> https://www.pcper.com/files/imagecache/article_max_width/news/2015-06-08/Demartek_SFF-8639.png
> the connectors are different for nvme. Does this mean I cannot connect
> to PERC 730 raid controller?
> 
> Is there anything particular required when installing the CentOS on
> these drives, or they will be automatically detected and work out of
> the box by default? Thx will
> 
> On Mon, Nov 21, 2016 at 12:16 PM, Anthony D'Atri <a...@dreamsnake.net> wrote:
>> The SATA S3700 series has been the de-facto for journals for some time.  And 
>> journals don’t neeed all that much space.
>> 
>> We’re using 400GB P3700’s.  I’ll say a couple of things:
>> 
>> o Update to the latest firmware available when you get your drives, qual it 
>> and stick with it for a while so you have a uniform experience
>> o Run a recent kernel with a recent nvme.ko, eg. the RHEL 7.1 3.10.0-229.4.2 
>> kernel’s bundled nvme.ko has a rare timing issue that causes us resets at 
>> times.  YMMV.
>> 
>> Which OS do you run?
>> 
>> 
>> 
>> Read through this document or a newer version thereof
>> 
>> https://www-ssl.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-p3700-spec.pdf
>> 
>> or for SATA drives
>> 
>> http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3710-spec.html
>> 
>> 
>> It’s possible that your vendor is uninformed or lying, trying to upsell you. 
>>  At times larger units can perform better due to internal parallelism, ie. a 
>> 1.6TB unit may electrically be 4x 400GB parts in parallel.  For 7200RPM LFF 
>> drives, as Nick noted 12x journals per P3700 is probably as high as you want 
>> to go, otherwise you can bottleneck.
>> 
>> What *is* true is the distinction among series.  Check the graph halfway 
>> down this page:
>> 
>> http://www.anandtech.com/show/8104/intel-ssd-dc-p3700-review-the-pcie-ssd-transition-begins-with-nvme
>> 
>> Prima fascia the P3500’s can seem like a relative bargain, but attend to the 
>> durability — that is where the P3600 and P3700 differ dramatically.  For 
>> some the P3600 may be durable enough, given certain workloads and expected 
>> years of service.  I tend to be paranoid and lobbied for us to err on the 
>> side of caution with the P3700.  YMMV.
>> 
>> — Anthony

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel P3700 SSD for journals

2016-11-20 Thread Anthony D'Atri
The SATA S3700 series has been the de-facto for journals for some time.  And 
journals don’t neeed all that much space.

We’re using 400GB P3700’s.  I’ll say a couple of things:

o Update to the latest firmware available when you get your drives, qual it and 
stick with it for a while so you have a uniform experience
o Run a recent kernel with a recent nvme.ko, eg. the RHEL 7.1 3.10.0-229.4.2 
kernel’s bundled nvme.ko has a rare timing issue that causes us resets at 
times.  YMMV.

Which OS do you run?



Read through this document or a newer version thereof

https://www-ssl.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-p3700-spec.pdf

or for SATA drives

http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3710-spec.html


It’s possible that your vendor is uninformed or lying, trying to upsell you.  
At times larger units can perform better due to internal parallelism, ie. a 
1.6TB unit may electrically be 4x 400GB parts in parallel.  For 7200RPM LFF 
drives, as Nick noted 12x journals per P3700 is probably as high as you want to 
go, otherwise you can bottleneck.  

What *is* true is the distinction among series.  Check the graph halfway down 
this page:

http://www.anandtech.com/show/8104/intel-ssd-dc-p3700-review-the-pcie-ssd-transition-begins-with-nvme

Prima fascia the P3500’s can seem like a relative bargain, but attend to the 
durability — that is where the P3600 and P3700 differ dramatically.  For some 
the P3600 may be durable enough, given certain workloads and expected years of 
service.  I tend to be paranoid and lobbied for us to err on the side of 
caution with the P3700.  YMMV.

— Anthony
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitoring Overhead

2016-10-26 Thread Anthony D'Atri
> Collectd and graphite look really nice.

Also look into Grafana, and of course RHSC.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade from Hammer to Jewel

2016-10-25 Thread Anthony D'Atri

> We have an Openstack which use Ceph for Cinder and Glance. Ceph is in
> Hammer release and we need to upgrade to Jewel. My question is :
> are the Hammer clients compatible with the Jewel servers ? (upgrade of Mon
> then Ceph servers first)
> As the upgrade of the Ceph client need a reboot of all the instances on
> each openstack hypervisor, we will have to keep for a while our client in
> hammer version

Chances are that you don't strictly need to do that.

1) You can update the Ceph packages on the hypervisor to stage the update 
without affecting running instances.

2) Be careful using the term "reboot".  If the tenant just does "shutdown -r" 
or equivalent, that will not restart the qemu or whatever process.  An API stop 
/ start of the instance at the OpenStack layer would be needed to get a new 
process

3) Live migration accomplishes the same goal, so if your instances are 
migrate-able, you can shift them around until they all have a new process.

In any event, you might take the opportunity to apply other updates, eg. if 
your Nova config isn't properly set up to enable the Ceph client admin socket, 
now's the time.  With the socket operational one can query instances to ensure 
that they're all running the new code before updating the back end 
incompatibly, eg. setting straw2 when there are clients with Hammer packages 
installed but still running Firefly.

-- Anthony

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-22 Thread Anthony D'Atri
> FWIW, the xfs -n size=64k option is probably not a good idea. 

Agreed, moreover it’s a really bad idea.  You get memory allocation slowdowns 
as described in the linked post, and eventually the OSD dies.

It can be mitigated somewhat by periodically (say every 2 hours, ymmv) flushing 
the system’s buffer cache to effectively defragment memory, or possibly by 
increasing setting vm_min_free_kybtes to a large enough setting based on your 
physmem.

For sure the only real cure is rebuilding all OSD’s with defaults.

— Anthony

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help recovering failed cluster

2016-06-12 Thread Anthony D'Atri
> Current cluster health:
>cluster 537a3e12-95d8-48c3-9e82-91abbfdf62e0
> health HEALTH_WARN
>5 pgs degraded
>8 pgs down
>48 pgs incomplete
>3 pgs recovering
>1 pgs recovery_wait
>76 pgs stale
>5 pgs stuck degraded
>48 pgs stuck inactive
>76 pgs stuck stale
>53 pgs stuck unclean
>5 pgs stuck undersized
>5 pgs undersized



First I have to remark on you having 7 mons.  Your cluster is very small - many 
clusters with hundreds of OSD’s are happy with 5.  At the Vancouver OpenStack 
summer there was a discussion re number of mons, there was consensus that 5 is 
generally plenty and that with 7+ the traffic among them really starts being 
excessive.  YMMV of course.

Assuming you have size on your pools set to 3 and min_size set to 2, this might 
be one of those times where temporarily setting min_size on the pools to 1 does 
the trick or at least helps.  I suspect in your case it wouldn’t completely 
heal the cluster but it might improve it and allow recovery to proceed.  Later 
you’d revert to the usual setting for obvious reasons.

— Anthony
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journal partition owner's not change to ceph

2016-06-12 Thread Anthony D'Atri
> The GUID for a CEPH journal partition should be 
> "45B0969E-9B03-4F30-B4C6-B4B80CEFF106"
> I haven't been able to find this info in the documentation on the ceph site

The GUID typecodes are listed in the /usr/sbin/ceph-disk script.

I had an issue a couple years ago where a subset of OSD’s in one cluster would 
not start at boot, but if they were mounted and manually started they would 
run.  Turned out that some goof that predated me had messed with the typecodes 
; correcting them with sgdisk restored them to normal behavior.  That was 
Dumpling FWIW.

— Anthony



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel ubuntu release is half cooked

2016-05-23 Thread Anthony D'Atri


Re:

> 2. Inefficient chown documentation - The documentation states that one should 
> "chown -R ceph:ceph /var/lib/ceph" if one is looking to have ceph-osd ran as 
> user ceph and not as root. Now, this command would run a chown process one 
> osd at a time. I am considering my cluster to be a fairly small cluster with 
> just 30 osds between 3 osd servers. It takes about 60 minutes to run the 
> chown command on each osd (3TB disks with about 60% usage). It would take 
> about 10 hours to complete this command on each osd server, which is just mad 
> in my opinion. I can't imagine this working well at all on servers with 20-30 
> osds! IMHO the docs should be adjusted to instruct users to run the chown in 
> _parallel_ on all osds instead of doing it one by one. 


I suspect the docs are playing it safe there, Ceph runs on servers of widely 
varying scale, capabilities, and robustness.  Running 30 chown -R processes in 
parallel could present noticeable impact on a production server.


— Anthony
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Do you see a data loss if a SSD hosting several OSD journals crashes

2016-05-20 Thread Anthony D'Atri
You should be protected against single component failures, yes, that's the 
point of journals.

It's important to ensure that on-disk volatile cache -- these days in the 
8-128MB range -- remain turned off, otherwise it usually presents an 
opportunity for data loss especially when power drops.  Disk manufactures tout 
cache size but are quiet about the fact that it's always turned off by default 
for just this reason.  Some LSI firmware and storcli versions have a bug that 
silently turns this on.

RAID HBA's also introduce an opportunity for data loss, with their on-card 
caches.  HBA cache not protected by a BBU / supercap has a similar 
vulnerability, and in some cases are just plain flaky and rife with hassles.

I'm not immediately finding a definitive statement about the number of journal 
writes required for ack.  

> So, which is correct, all replicas must be written or only min_size before 
> ack?  
> 
> But for me the takeaway is that writes are protected - even if the journal 
> drive crashes, I am covered.
> 
> - epk
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Anthony D'Atri
> Sent: Friday, May 20, 2016 1:32 PM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Do you see a data loss if a SSD hosting several OSD 
> journals crashes
> 
> 
>> Ceph will not acknowledge a client write before all journals (replica 
>> size, 3 by default) have received the data, so loosing one journal SSD 
>> will NEVER result in an actual data loss.
> 
> Some say that all replicas must be written; others say that only min_size, 2 
> by default, must be written before ack.
> 
> --aad
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> Legal Disclaimer:
> The information contained in this message may be privileged and confidential. 
> It is intended to be read only by the individual or entity to whom it is 
> addressed or by their designee. If the reader of this message is not the 
> intended recipient, you are on notice that any distribution of this message, 
> in any form, is strictly prohibited. If you have received this message in 
> error, please immediately notify the sender and delete or destroy any copy of 
> this message!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Do you see a data loss if a SSD hosting several OSD journals crashes

2016-05-20 Thread Anthony D'Atri

> Ceph will not acknowledge a client write before all journals (replica
> size, 3 by default) have received the data, so loosing one journal SSD
> will NEVER result in an actual data loss.

Some say that all replicas must be written; others say that only min_size, 2 by 
default, must be written before ack.

--aad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dense storage nodes

2016-05-20 Thread Anthony D'Atri
[ too much to quote ]

Dense nodes often work better for object-focused workloads than block-focused, 
the impact of delayed operations is simply speed vs. a tenant VM crashing.

Re RAID5 volumes to decrease the number of OSD’s:   This sort of approach is 
getting increasing attention in that it brings down the OSD count, reducing the 
resource demands of peering, especially during storms.  It also makes the OSD 
fillage bell curve narrower.   But one must also consider that the write speed 
of a RAID5 group is that of a single drive due to the parity recalc, and that 
if one does not adjust osd_op_threads and osd_disk_threads, throughput can 
suffer because fewer ops can run across the cluster at the same time.

Re Intel P3700 NVMe cards, has anyone out there experienced reset issues that 
may be related to workload, kernel version, driver version, firmware version, 
etc?  Or even Firefly vs Hammer?

There was an excellent presentation at the Austin OpenStack Summit re 
optimizing dense nodes — pinning OSD processes, HBA/NIC interrupts etc. to 
cores/sockets to limit data sent over QPI links on NUMA architectures.  It’s 
easy to believe that modern inter-die links are Fast Enough For You Old Man but 
there’s more too it.

— Anthony


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-29 Thread Anthony D'Atri
> Right now we run the journal as a partition on the data disk. I've build 
> drives without journals and the write performance seems okay but random io 
> performance is poor in comparison to what it should be. 


Co-located journals have multiple issues:

o The disks are presented with double the number of write ops -- and lots of 
long seeks -- which impacts even read performance due to contention

o The atypical seek pattern can collide with disk firmware and factory config 
in ways that result in a very elevated level of read errors.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrub Error / How does ceph pg repair work?

2015-05-12 Thread Anthony D'Atri
For me that's true about 1/3 the time, but often I do still have to repair the 
PG after removing the affected OSD.  YMMV.

 
 
 
 Agree that 99+% of the inconsistent PG's I see correlate directly to disk 
 flern.
 
 Check /var/log/kern.log*, /var/log/messages*, etc. and I'll bet you find 
 errors correlating.
 
 
 More to this... In the case that an inconsistent PG is caused by a
 failed disk read, you don't need to run ceph pg repair at all.
 Instead, since your drive is bad, stop the osd process, mark that osd
 out. After backfilling has completed and the PG is re-scrubbed, you
 will find it is consistent again.
 
 Cheers, Dan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrub Error / How does ceph pg repair work?

2015-05-11 Thread Anthony D'Atri


Agree that 99+% of the inconsistent PG's I see correlate directly to disk flern.

Check /var/log/kern.log*, /var/log/messages*, etc. and I'll bet you find errors 
correlating.

-- Anthony

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sunday's Ceph based business model

2015-03-15 Thread Anthony D'Atri

Interesting idea.  I'm not sure, though, Ceph is designed with this sort of 
latency in mind.  

Crashplan does let you do something very similar for free, as I understand it, 
though it's more of a nearline thing.  


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] adding osd node best practice

2015-03-12 Thread Anthony D'Atri
 We have application cluster and ceph as storage solution, cluster consists of 
 six servers, so we've installed
 monitor on every one of them, to have ceph cluster sane (quorum) if server or 
 two of them goes down. 

You want an odd number for sure, to avoid the classic split-brain problem:

http://ceph.com/docs/master/rados/operations/add-or-rm-mons/

I think the bit re diminishing returns with 5 mons was told to me by a 
consultant, but I don’t have a reference.  The more you have the more traffic 
they have to exchange among themselves, I’m thinking that’s probably not a huge 
deal until N gets a lot bigger.

  or is it not necessary/recommended to have mon on node with osds?

I’ve read multiple documents recommending against an AIO config, IIRC e.g.. so 
that heavy backfilling or client operations to the OSD’s don’t starve the mons. 
Best to Google around a bit, the size/density/number/workload of your OSD’s is 
likely a significant factor.  On a small cluster I can see the appeal of an AIO 
strategy, unless you perhaps have hypervisors on the appropriate network and 
might consider running mons as VM’s with resource reservations.


—aad



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] adding osd node best practice

2015-03-07 Thread Anthony D'Atri
1) That's an awful lot of mons.  Are they VM's or something?  My sense is that 
mons 5 have diminishing returns at best.  

2) Only two OSD nodes?  Assume you aren't running 3 copies of data or racks.  

3) The new nodes will have fewer OSD's?   Be careful with host / OSD weighting 
to avoid a gross imbalance in disk utilization.   

4) I've had experience tripling the size of a cluster in one shot, and in 
backfilling a whole rack of 100+ OSD's in one shot.  Cf.  BÖC's 'Veteran of the 
Psychic Wars'.  I do not recommend this approach esp.  if you don't have truly 
embarrassing amounts of RAM.  Suggest disabling scrubs / deep-scrubs, 
throttling the usual backfill / recovery values, including setting recovery op 
priority as low as 1 for the duration.   Deploy one OSD at a time.  Yes this 
will cause data to move more than once.  But it will also minimize your 
exposure to as-of-yet undiscovered problems with the new hardware, and the 
magnitude of peering storms.   And thus client impact.   One OSD on each new 
system, sequentially.   Check the weights in the CRUSH map.  Time backfill to 
HEALTH_OK.  Let them soak for a few days before serially deploying the rest.   

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] No auto-mount of OSDs after server reboot

2015-01-30 Thread Anthony D'Atri

One thing than can cause this is messed-up partition ID's / typecodes.   Check 
out the ceph-disk script to see how they get applied.  I have a few systems 
that somehow got messed up -- at boot they don't get started, but if I mounted 
them manually on /mnt, checked out the whoami file and remounted accordingly, 
then started, they ran fine.

# for i in b c d e f g h i j k ; do sgdisk 
--typecode=1:4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D /dev/sd$i ; done

# for i in b c d e f g h i j k ; do sgdisk 
--typecode=2:45B0969E-9B03-4F30-B4C6-B4B80CEFF106 /dev/sd$i ; done

One system I botched and set all the GUID's to a constant; I went back and 
fixed that:

# for i in b c d e f g h i j k ; do sgdisk 
--typecode=2:45B0969E-9B03-4F30-B4C6-B4B80CEFF106 --partition-guid=$(uuidgen 
-r) /dev/sd$i ; done

Note that I have not yet rebooted these systems to validate this approach, so 
YMMV, proceed at your own risk, this advice is not FDIC-insured and may lose 
value.


# sgdisk -i 1 /dev/sdb
Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown)
Partition unique GUID: 61397DDD-E203-4D9A-9256-24E0F5F97344
First sector: 20973568 (at 10.0 GiB)
Last sector: 5859373022 (at 2.7 TiB)
Partition size: 5838399455 sectors (2.7 TiB)
Attribute flags: 
Partition name: 'ceph data'

# sgdisk -i 2 /dev/sdb
Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
Partition unique GUID: EF292AB7-985E-40A2-B185-DD5911D17BD7
First sector: 2048 (at 1024.0 KiB)
Last sector: 20971520 (at 10.0 GiB)
Partition size: 20969473 sectors (10.0 GiB)
Attribute flags: 
Partition name: 'ceph journal'

--aad


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Survey re journals on SSD vs co-located on spinning rust

2015-01-28 Thread Anthony D'Atri

My apologies if this has been covered ad-naseum in the past; I wasn't finding a 
lot of relevant archived info.

I'm curious how may people are using

1) OSD's on spinning disks, with journals on SSD's -- how many journals per 
SSD?  4-5?

2) OSD's on spinning disks, with [10GB] journals co-located at the logical end 
of each disk.

LFF disks or SFF?  JBOD HBA's or RAID HBA's with cache?

For those running co-located journals, am curious if you're mounting XFS with 
the inode64 flag, and if you're providing block storage for VM's, if you have 
any issues with slow requests crashing VM's, or with OSD add/remove/backfill 
operations precipitating lots of slow requests.

Please reply to me, not the list, and I'll summarize.  Thanks.

-- me




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com