Re: [ceph-users] Questions about using existing HW for PoC cluster

2019-02-09 Thread ceph
Hi 

Am 27. Januar 2019 18:20:24 MEZ schrieb Will Dennis :
>Been reading "Learning Ceph - Second Edition"
>(https://learning.oreilly.com/library/view/learning-ceph-/9781787127913/8f98bac7-44d4-45dc-b672-447d162ea604.xhtml)
>and in Ch. 4 I read this:
>
>"We've noted that Ceph OSDs built with the new BlueStore back end do
>not require journals. One might reason that additional cost savings can
>be had by not having to deploy journal devices, and this can be quite
>true. However, BlueStore does still benefit from provisioning certain
>data components on faster storage, especially when OSDs are deployed on
>relatively slow HDDs. Today's investment in fast FileStore journal
>devices for HDD OSDs is not wasted when migrating to BlueStore. When
>repaving OSDs as BlueStore devices the former journal devices can be
>readily re purposed for BlueStore's RocksDB and WAL data. When using
>SSD-based OSDs, this BlueStore accessory data can reasonably be
>colocated with the OSD data store. For even better performance they can
>employ faster yet NVMe or other technloogies for WAL and RocksDB. This
>approach is not unknown for traditional FileStore journals as well,
>though it is not inexpensive.Ceph clusters that are fortunate to
>exploit SSDs as primary OSD dri
>ves usually do not require discrete journal devices, though use cases
>that require every last bit of performance may justify NVMe journals.
>SSD clusters with NVMe journals are as uncommon as they are expensive,
>but they are not unknown."
>
>So can I get by with using a single SATA SSD (size?) per server for
>RocksDB / WAL if I'm using Bluestore?

IIRC there is  a rule of thump where the Size of DB-partition should be 4% of 
the OSD Size.

I.e. 4TB OSD should have At least a DB Partition of 160GB
 
Hth
- Mehmet

>
>
>> - Is putting the journal on a partition of the SATA drives a real I/O
>killer? (this is how my Proxmox boxes are set up)
>> - If YES to the above, then is a SATA SSD acceptable for journal
>device, or should I definitely consider PCIe SSD? (I'd have to limit to
>one per server, which I know isn't optimal, but price prevents
>otherwise...)
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions about using existing HW for PoC cluster

2019-01-28 Thread Will Dennis


The hope is to be able to provide scale-out storage, that will be performant 
enough to use as a primary fs-based data store for research data (right now we 
mount via NFS on our cluster nodes, may do that with Ceph or perhaps do native 
cephfs access from the cluster nodes.) Right now I’m still in the “I don’t know 
what I don’t know” stage :)
From: Willem Jan Withagen mailto:w...@digiware.nl>>
Date: Monday, Jan 28, 2019, 8:11 AM

I'd carefully define the term: "all seems to work well".

I'm running several ZFS instances of equal or bigger size, that are
specifically tuned (buses, ssds, memory and ARC ) to their usage. And
they usually do perform very well.

No if you define "work well" as performance close to what you get out of
your zfs store be careful not to compare pears to lemons. You might
need rather beefy HW to get to the ceph-cluster performance at the same
level as your ZFS.

So you'd better define you PoC target with real expectations.

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions about using existing HW for PoC cluster

2019-01-28 Thread Willem Jan Withagen

On 28-1-2019 02:56, Will Dennis wrote:

I mean to use CephFS on this PoC; the initial use would be to back up an 
existing ZFS server with ~43TB data (may have to limit the backed-up data 
depending on how much capacity I can get out of the OSD servers) and then share 
out via NFS as a read-only copy, that would give me some I/O speeds on writes 
and reads, and allow me to test different aspects of Ceph before I go pitching 
it as a primary data storage technology (it will be our org's first foray into 
SDS, and I want it to succeed.)

No way I'd go primary production storage with this motley collection of 
"pre-loved" equipment :) If it all seems to work well, I think I could get a 
reasonable budget for new production-grade gear.


Perhaps superfluous, my 2ct anyways.

I'd carefully define the term: "all seems to work well".

I'm running several ZFS instances of equal or bigger size, that are 
specifically tuned (buses, ssds, memory and ARC ) to their usage. And 
they usually do perform very well.


No if you define "work well" as performance close to what you get out of 
your zfs store be careful not to compare pears to lemons. You might 
need rather beefy HW to get to the ceph-cluster performance at the same 
level as your ZFS.


So you'd better define you PoC target with real expectations.

--WjW




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions about using existing HW for PoC cluster

2019-01-27 Thread Will Dennis
Thanks for contributing your knowledge to that book, Anthony - really enjoying 
it :)

I didn't mean to use the OS SSD for Ceph use - would buy a second SSD per 
server for that... I will take a look at SATA SSD prices, hopefully the smaller 
ones (>500MB) will be at an acceptable price so that I can buy 1 (or even 2) 
for each server. Love to run two for OS (md mirror) and then two more for Ceph 
use, but that's probably going to add up to more money than I'd want to ask 
for. I was going to check SMART on the existing SSD; since they are Intel SSDs, 
there's also an Intel tool ( 
https://www.intel.com/content/www/us/en/support/articles/06289/memory-and-storage.html
 ) that I was going to use. In any case, will probably re-use the existing OS 
SSD for the new OS, and add in 1/2 new SSDs for Ceph per OSD server.

I also think a new SSD per mon would be doable; maybe 500GB - 1TB OK?

Usually for a storage system I'd be using some sort of Intel DC drives, but may 
go with Samsung 8xx Pro's for this to keep the price lower.

I mean to use CephFS on this PoC; the initial use would be to back up an 
existing ZFS server with ~43TB data (may have to limit the backed-up data 
depending on how much capacity I can get out of the OSD servers) and then share 
out via NFS as a read-only copy, that would give me some I/O speeds on writes 
and reads, and allow me to test different aspects of Ceph before I go pitching 
it as a primary data storage technology (it will be our org's first foray into 
SDS, and I want it to succeed.)

No way I'd go primary production storage with this motley collection of 
"pre-loved" equipment :) If it all seems to work well, I think I could get a 
reasonable budget for new production-grade gear.

From what I've read so far in the book, and on the prior list posts, prolly do 
2x10G bond to the common 10G switch that serves the cluster this would be a 
part of. Do the mon servers need 10G NICs too? If so, I may have to scrounge 
some 10Gbase-T NICs from other servers to give to them (they only have dual 1G 
NICs on the mobo.)

Thanks again!
Will

-Original Message-
From: Anthony D'Atri [mailto:a...@dreamsnake.net] 
Sent: Sunday, January 27, 2019 6:32 PM
To: Will Dennis
Cc: ceph-users
Subject: Re: [ceph-users] Questions about using existing HW for PoC cluster

> Been reading "Learning Ceph - Second Edition”

An outstanding book, I must say ;)

> So can I get by with using a single SATA SSD (size?) per server for RocksDB / 
> WAL if I'm using Bluestore?

Depends on the rest of your setup and use-case, but I think this would be a 
bottleneck.  Some thoughts:

* You wrote that your servers have 1x 240GB SATA SSD that has the OS, and 8x 
2TB SATA OSD drives.

** Sharing the OS with journal/metadata could lead to contention between the two
** Since the OS has been doing who-knows-what with that drive, check the 
lifetime used/remaining with `smartctl -a`.
** If they’ve been significantly consumed, their lifetime with the load Ceph 
will present will be limited.
** SSDs selected for OS/boot drives often have relatively low durability (DWPD) 
and may suffer performance cliffing when given a steady load.  Look up the 
specs on your model
** 8 OSDs sharing a single SSD for metadata is a very large failure domain.  
If/when you lose that SSD, your lose all 8 OSDs and the host itself.  You would 
want to set the subtree limit to “host”, and not fill the OSDs past, say, 60% 
so that you’d have room to backfill in case of a failure not caught by the 
subtree limit
** 8 HDD OSDs sharing a single SATA SSD for metadata will be bottlenecked 
unless your workload is substantially reads.

* Single SATA HDD on the mons

** When it fails, you lose the mon
** I have personally seen issues due to HDDs not handling peak demands, 
resulting in an outage

The gear you list is fairly old and underpowered, but sure you could use it *as 
a PoC*.  For a production deployment you’d want different hardware.

> - Is putting the journal on a partition of the SATA drives a real I/O killer? 
> (this is how my Proxmox boxes are set up)

With Filestore and HDDs, absolutely.  Even worse if you were to use EC.  There 
may be some coalescing of ops, but you’re still going to get a *lot* of long 
seeks, and spinners can only do a certain number of IOPs.  I think in the book 
I described personal experience with such a setup that even tickled a design 
flaw on the part of a certain HDD vendor.  Eventually I was permitted to get 
journal devices (this was pre-BlueStore GA), which were PCIe NVMe.  Write 
performance doubled.  Then we hit a race condition / timing issue in nvme.ko, 
but I digress...

When using SATA *SSD*s for OSDs, you have no seeks of course, and colocating 
the journals/metadata is more viable.

> - If YES to the above, then is a SATA SSD acceptable for journal device, or 
> should I definitely consider PCIe SSD? (I'd have to limit to one per server, 
> whic

Re: [ceph-users] Questions about using existing HW for PoC cluster

2019-01-27 Thread Anthony D'Atri
> Been reading "Learning Ceph - Second Edition”

An outstanding book, I must say ;)

> So can I get by with using a single SATA SSD (size?) per server for RocksDB / 
> WAL if I'm using Bluestore?

Depends on the rest of your setup and use-case, but I think this would be a 
bottleneck.  Some thoughts:

* You wrote that your servers have 1x 240GB SATA SSD that has the OS, and 8x 
2TB SATA OSD drives.

** Sharing the OS with journal/metadata could lead to contention between the two
** Since the OS has been doing who-knows-what with that drive, check the 
lifetime used/remaining with `smartctl -a`.
** If they’ve been significantly consumed, their lifetime with the load Ceph 
will present will be limited.
** SSDs selected for OS/boot drives often have relatively low durability (DWPD) 
and may suffer performance cliffing when given a steady load.  Look up the 
specs on your model
** 8 OSDs sharing a single SSD for metadata is a very large failure domain.  
If/when you lose that SSD, your lose all 8 OSDs and the host itself.  You would 
want to set the subtree limit to “host”, and not fill the OSDs past, say, 60% 
so that you’d have room to backfill in case of a failure not caught by the 
subtree limit
** 8 HDD OSDs sharing a single SATA SSD for metadata will be bottlenecked 
unless your workload is substantially reads.

* Single SATA HDD on the mons

** When it fails, you lose the mon
** I have personally seen issues due to HDDs not handling peak demands, 
resulting in an outage

The gear you list is fairly old and underpowered, but sure you could use it *as 
a PoC*.  For a production deployment you’d want different hardware.

> - Is putting the journal on a partition of the SATA drives a real I/O killer? 
> (this is how my Proxmox boxes are set up)

With Filestore and HDDs, absolutely.  Even worse if you were to use EC.  There 
may be some coalescing of ops, but you’re still going to get a *lot* of long 
seeks, and spinners can only do a certain number of IOPs.  I think in the book 
I described personal experience with such a setup that even tickled a design 
flaw on the part of a certain HDD vendor.  Eventually I was permitted to get 
journal devices (this was pre-BlueStore GA), which were PCIe NVMe.  Write 
performance doubled.  Then we hit a race condition / timing issue in nvme.ko, 
but I digress...

When using SATA *SSD*s for OSDs, you have no seeks of course, and colocating 
the journals/metadata is more viable.

> - If YES to the above, then is a SATA SSD acceptable for journal device, or 
> should I definitely consider PCIe SSD? (I'd have to limit to one per server, 
> which I know isn't optimal, but price prevents otherwise…)

Optanes for these systems would be overkill.  If you would plan to have the PoC 
cluster run any appreciable load for any length of time, I might suggest 
instead adding 2x SATA SSDs per, so you could map 4x OSDs to each.  These would 
not need to be large:  upstream party line would have you allocate 80GB on 
each, though depending on your use-case you might well do fine with less, 2x 
240GB class or even 2x 120GB class should suffice for PoC service.  For 
production I would advise “enterprise” class drives with at least 1 DWPD 
durability — recently we’ve seen a certain vendor weasel their durability by 
computing it incorrectly.

Depending on what you expect out of your PoC, and especially assuming you use 
BlueStore, you might get away with colocation, but do not expect performance 
that can be extrapolated for a production deployment.  

With the NICs you have, you could keep it simple and skip a replication/back 
end network altogether, or you could bond the LOM ports and split them.  
Whatever’s simplest with the network infrastructure you have.  For production 
you’d most likely want LACP bonded NICs, but if the network tech is modern, 
skipping the replication network may be very feasible,  But I’m ahead of your 
context …

HTH
— aad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions about using existing HW for PoC cluster

2019-01-27 Thread Will Dennis
Been reading "Learning Ceph - Second Edition" 
(https://learning.oreilly.com/library/view/learning-ceph-/9781787127913/8f98bac7-44d4-45dc-b672-447d162ea604.xhtml)
 and in Ch. 4 I read this:

"We've noted that Ceph OSDs built with the new BlueStore back end do not 
require journals. One might reason that additional cost savings can be had by 
not having to deploy journal devices, and this can be quite true. However, 
BlueStore does still benefit from provisioning certain data components on 
faster storage, especially when OSDs are deployed on relatively slow HDDs. 
Today's investment in fast FileStore journal devices for HDD OSDs is not wasted 
when migrating to BlueStore. When repaving OSDs as BlueStore devices the former 
journal devices can be readily re purposed for BlueStore's RocksDB and WAL 
data. When using SSD-based OSDs, this BlueStore accessory data can reasonably 
be colocated with the OSD data store. For even better performance they can 
employ faster yet NVMe or other technloogies for WAL and RocksDB. This approach 
is not unknown for traditional FileStore journals as well, though it is not 
inexpensive.Ceph clusters that are fortunate to exploit SSDs as primary OSD dri
 ves usually do not require discrete journal devices, though use cases that 
require every last bit of performance may justify NVMe journals. SSD clusters 
with NVMe journals are as uncommon as they are expensive, but they are not 
unknown."

So can I get by with using a single SATA SSD (size?) per server for RocksDB / 
WAL if I'm using Bluestore?


> - Is putting the journal on a partition of the SATA drives a real I/O killer? 
> (this is how my Proxmox boxes are set up)
> - If YES to the above, then is a SATA SSD acceptable for journal device, or 
> should I definitely consider PCIe SSD? (I'd have to limit to one per server, 
> which I know isn't optimal, but price prevents otherwise...)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Questions about using existing HW for PoC cluster

2019-01-26 Thread Will Dennis
Hi all,

Kind of new to Ceph (have been using 10.2.11 on a 3-node Proxmox 4.x cluster 
[hyperconverged], works great!) and now I'm thinking of perhaps using it for a 
bigger data storage project at work, a PoC at first, but built as correctly as 
possible for performance and availability. I have the following server 
equipment available to use for the PoC; if it all goes well, I'd think new 
hardware for an actual production installation would be in order :)

For the OSD servers, I have:

(5) Intel R2312GL4GS 2U servers (c. 2013) with the following specs --
  - (2) Intel Xeon E5-2660 CPUs (8-core, dual-threaded)
  - 64GB memory
  - (1) dual-port 10Gbase-T NIC (Intel X540-AT2)
  - (1) dual-port Infiniband HBA (Mellanox MT27500 ConnectX-3) (probably won't 
use, and would remove)
  - (4) Intel 1Gbase-T NICs (on mobo)
  - (1) Intel 240GB SATA SSD (OS)
  - (8) Hitachi 2TB SATA drives

I am not bound to using the existing disk in these servers, but also want to 
keep the price down, as this is only a PoC. Was thinking of either putting an 
Intel Optane 900P PCIe SSD (480G) in for journal, or else some sort of SATA SSD 
in one of the available front bays (it's a 12 hotswap-bay machine, + two 
internal SSD mounts.) I also could get some higher capacity (and newer!) SATA 
drives, so as to keep the number of OSDs down for a given capacity (shooting 
for 25-50TB to start.) However, I'd love it if I didn't have to ask for any 
money ;)

For monitor machines, I have available three Supermicro (c.2011) 1U servers 
with:
  - (2) Intel Xeon X5680 CPUs
  - 48GB memory
  - (2) 1Gbase-T NICs (on mobo)
  - (1) WD 2TB SATA drive

I am considering also the rack placement; the 5 servers I'd use for OSD all 
currently live in one rack, and the Mon servers in another. I could move them 
if necessary.

So, a few questions to start ;)

- Is the above an acceptable collection of useful equipment for a PoC of modern 
Ceph? (thinking of installing Mimic with Bluestore)
- Is putting the journal on a partition of the SATA drives a real I/O killer? 
(this is how my Proxmox boxes are set up)
- If YES to the above, then is a SATA SSD acceptable for journal device, or 
should I definitely consider PCIe SSD? (I'd have to limit to one per server, 
which I know isn't optimal, but price prevents otherwise...)
- Should I spread the servers out over racks, which would probably force me to 
use 3 out of the 5 avail OSD servers, and put bigger disks in them to get the 
desired capacity (I only have three racks to work with), or is it OK for a PoC 
to keep all OSD servers in one rack?
- Are the platforms I'm proposing to use for monitor servers acceptable as-is, 
or do they need more memory, SSD drives, or 10GbE NICs?

OK, enough q's for now - thanks for helping a new Ceph'r out :)

Best,
Will



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com