Re: [ceph-users] New Ceph cluster design

2018-03-13 Thread Christian Balzer

Hello,

On Sat, 10 Mar 2018 16:14:53 +0100 Vincent Godin wrote:

> Hi,
> 
> As i understand it, you'll have one RAID1 of two SSDs for 12 HDDs. A
> WAL is used for all writes on your host. 

This isn't filestore, AFAIK the WAL/DB will be used for small writes only
to keep latency with Bluestore akin to filestore levels.
Large writes will go directly to the HDDs.

However each write will of course necessitate a write to the DB and thus
IOPS (much more so than bandwidth) are paramount here.

> If you have good SSDs, they
> can handle 450-550 MBpsc. Your 12 HDDs SATA can handle 12 x 100 MBps
> that is to say 1200 GBps. 

Aside from what I wrote above I'd like to repeat myself and others here
for the umpteenth time, focusing on bandwidth is a fallacy in nearly all
use cases, IOPS tend to become the bottleneck.

Also that's 1.2GB/s or 1200MB/s. 

The OP stated 10TB HDDs and many (but not exclusively?) small objects,
so if we're looking at lots of small writes the bandwidth of the SSDs
becomes a factor again and with the sizes involved they appear too small
as well. (going with the rough ratio of 10GB per TB).

Either a RAID1 of at least 1600GB NVMes or 2 800GB NVMes and a resulting
failure domain of 6 HDDs would be better/safer fit. 

> So your RAID 1 will be the bootleneck with
> this design. A good design would be to have one SSD for 4 or 5 HDD. In
> your case, the best option would be to start with 3 SSDs for 12 HDDs
> to have a balances node. Don't forget to choose SSD with a high WDPD
> ratio (>10)
> 
More SSDs/NVMes are of course better and DWPD is important, but probably
less so than with filestore journals.
A DWPD of >10 is overkill for anything I've ever encountered, for many
things 3 will be fine, especially if one knows what is expected.

For example a filestore cache tier SSD with inline journal (800GB DC S3610,
3 DWPD) has a media wearout of 97 (3% used) after 2 years with this
constant and not insignificant load:
---
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb   0.0383.097.07  303.24   746.64  5084.9937.59 
0.050.150.710.13   0.06   2.00
---

300 write IOPS and 5MB/s for all that time.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Ceph cluster design

2018-03-10 Thread Vincent Godin
Hi,

As i understand it, you'll have one RAID1 of two SSDs for 12 HDDs. A
WAL is used for all writes on your host. If you have good SSDs, they
can handle 450-550 MBpsc. Your 12 HDDs SATA can handle 12 x 100 MBps
that is to say 1200 GBps. So your RAID 1 will be the bootleneck with
this design. A good design would be to have one SSD for 4 or 5 HDD. In
your case, the best option would be to start with 3 SSDs for 12 HDDs
to have a balances node. Don't forget to choose SSD with a high WDPD
ratio (>10)

The network needs of your node depend of the bandwith of your disks.
As explain over, your 12 HDDs can handle 1200 GBps so you need to have
a public and a private network that can handle it. In your case, a
minimum a two 10 Gbps networks for per node are needed. If you need
redondancy, just use two LACP networks with each having two 10 Gbps
links. The scrub or deep scrub operations will not have a significant
impact on your network but on your disks utilisations. So you need to
plan them during low usage by your clients
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Ceph cluster design

2018-03-09 Thread Jonathan Proulx
On Fri, Mar 09, 2018 at 03:06:15PM +0100, Ján Senko wrote:
:We are looking at 100+ nodes.
:
:I know that the Ceph official recommendation is 1GB of RAM per 1TB of disk.
:Was this ever changed since 2015?
:CERN is definitely using less (source:
:https://cds.cern.ch/record/2015206/files/CephScaleTestMarch2015.pdf)

looking at my recently (re)installed luminous bluestore nodes

Looking at 24hr peak (5min average) RAM utilization I'm seeing 40G
Commitied and ~30G active RAM on nodes with 10x4T drives and ~82G
Committed 57G active on nodes with 24x2T drives ( average 45.77% full)

  data:
pools:   19 pools, 10240 pgs
objects: 16820k objects, 77257 GB
usage:   228 TB used, 271 TB / 499 TB avail
pgs: 10240 active+clean

(12 storage-nodes 173 osds )

This is almost entierly RBD for OpenStack VMs, only a negligible
amount is radosgw type object storage none is erasure coded.

I spec'ed bit over recommended RAM  (for example 64T to 40G storage )
so I've nto had memory issues with older filestore or newer bluestore
implementations, but I would still round up rather than down for my
use case any way.

:RedHat suggests using 16GB + 2GB/HDD as the latest requirements.
:
:BTW: Anyone has comments on SSD sizes for Bluestore or the other questions?

These systems are using 10G:1T SSD:7.2K_SAS_DISK (ie 40GB SSD for 4T
HDD) this seem sufficient (running with WAL and DB on spinners really
tanks IOPS capacity) but I don't know that it is optimal.  It is close
enough to RedHat recommendation that I would believe them.

Note that we've moved to more smaller disks (the 2T are newer) as we
were running out of IOPS, maybe more SSD in front would help or maybe
our use pattern being so heavy in active volume use as opposed to cold
object storage is unusual. Obviously 10k or 15k drive would help & my
next expantion probably will be as we're still at higher % of our IOPS
capacity utilization than were are of our storage capacity utilization...

-Jon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Ceph cluster design

2018-03-09 Thread John Petrini
What you linked was only a 2 week test. When Ceph is healthy it does not
need a lot of RAM, it's during recovery that OOM appears and that's when
you'll find yourself upgrading the RAM on your nodes just to stop OOM and
allow the cluster to recover. Look through the mailing list and you'll see
that this is one of the most common mistakes made when spec'ing hardware
for Ceph.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Ceph cluster design

2018-03-09 Thread Ján Senko
We are looking at 100+ nodes.

I know that the Ceph official recommendation is 1GB of RAM per 1TB of disk.
Was this ever changed since 2015?
CERN is definitely using less (source:
https://cds.cern.ch/record/2015206/files/CephScaleTestMarch2015.pdf)
RedHat suggests using 16GB + 2GB/HDD as the latest requirements.

BTW: Anyone has comments on SSD sizes for Bluestore or the other questions?

Jan





2018-03-09 14:58 GMT+01:00 Brady Deetz :

> I'd increase ram. 1GB per 1TB of disk is the recommendation.
>
> Another thing you need to consider is your node density. 12x10TB is a lot
> of data to have to rebalance if you aren't going to have 20+ nodes. I have
> 17 nodes with 24x6TB disks each. Rebuilds can take what seems like an
> eternity. It may be worth looking at cheaper sockets and smaller disks in
> order to increase your node count.
>
> How many nodes will this cluster have?
>
>
> On Mar 9, 2018 4:16 AM, "Ján Senko"  wrote:
>
> I am planning a new Ceph deployement and I have few questions that I could
> not find good answers yet.
>
> Our nodes will be using Xeon-D machines with 12 HDDs each and 64GB each.
> Our target is to use 10TB drives for 120TB capacity per node.
>
> 1. We want to have small amount of SSDs in the machines. For OS and I
> guess for WAL/DB of Bluestore. I am thinking about having a RAID 1 with two
> 400GB 2.5" SSD drives. Will this fit WAL/DB? We plan to store many small
> objects.
> 2. While doing scrub/deep scrub, is there any significant network traffic?
> Assuming we are using Erasure coding pool, how do the nodes check the
> consistency of an object? Do they transfer the whole object chunks or do
> they only transfer the checksums?
> 3. We have to decide on which HDD to use, and there is a question of HGST
> vs Seagate, 512e vs 4kn sectors, SATA vs SAS. Do you have some tips for
> these decisions? We do not have very high IO, so we do not need performance
> at any cost. As for manufacturer and the sector size, I haven't found any
> guidelines/benchmarks that would steer me towards any.
>
> Thank you for your insight
> Jan
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>


-- 
Jan Senko, Skype janos-
Phone in Switzerland: +41 774 144 602
Phone in Czech Republic: +420 777 843 818
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Ceph cluster design

2018-03-09 Thread Brady Deetz
I'd increase ram. 1GB per 1TB of disk is the recommendation.

Another thing you need to consider is your node density. 12x10TB is a lot
of data to have to rebalance if you aren't going to have 20+ nodes. I have
17 nodes with 24x6TB disks each. Rebuilds can take what seems like an
eternity. It may be worth looking at cheaper sockets and smaller disks in
order to increase your node count.

How many nodes will this cluster have?


On Mar 9, 2018 4:16 AM, "Ján Senko"  wrote:

I am planning a new Ceph deployement and I have few questions that I could
not find good answers yet.

Our nodes will be using Xeon-D machines with 12 HDDs each and 64GB each.
Our target is to use 10TB drives for 120TB capacity per node.

1. We want to have small amount of SSDs in the machines. For OS and I guess
for WAL/DB of Bluestore. I am thinking about having a RAID 1 with two 400GB
2.5" SSD drives. Will this fit WAL/DB? We plan to store many small objects.
2. While doing scrub/deep scrub, is there any significant network traffic?
Assuming we are using Erasure coding pool, how do the nodes check the
consistency of an object? Do they transfer the whole object chunks or do
they only transfer the checksums?
3. We have to decide on which HDD to use, and there is a question of HGST
vs Seagate, 512e vs 4kn sectors, SATA vs SAS. Do you have some tips for
these decisions? We do not have very high IO, so we do not need performance
at any cost. As for manufacturer and the sector size, I haven't found any
guidelines/benchmarks that would steer me towards any.

Thank you for your insight
Jan



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Ceph cluster design

2018-03-09 Thread Tristan Le Toullec

Hi,
    same REX, we had troubles with OutOfMemory Kill Process on OSD 
process with ten 8 To disks. After an upgrade to 128 Go these troubles 
disapears.


Recommendations on memory aren't overestimated.

Regards,
Tristan


On 09/03/2018 11:31, Eino Tuominen wrote:

On 09/03/2018 12.16, Ján Senko wrote:

I am planning a new Ceph deployement and I have few questions that I 
could not find good answers yet.


Our nodes will be using Xeon-D machines with 12 HDDs each and 64GB each.
Our target is to use 10TB drives for 120TB capacity per node.
We ran into problems with 20 x 6 TB drives and 64 GB memory which we 
then increased to 128 GB. According to my experience the 
recommendation of 1 GB of memory per 1 TB of disk space has to be 
taken seriously.




<>___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Ceph cluster design

2018-03-09 Thread Eino Tuominen

On 09/03/2018 12.16, Ján Senko wrote:

I am planning a new Ceph deployement and I have few questions that I 
could not find good answers yet.


Our nodes will be using Xeon-D machines with 12 HDDs each and 64GB each.
Our target is to use 10TB drives for 120TB capacity per node.
We ran into problems with 20 x 6 TB drives and 64 GB memory which we 
then increased to 128 GB. According to my experience the recommendation 
of 1 GB of memory per 1 TB of disk space has to be taken seriously.


--
  Eino Tuominen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] New Ceph cluster design

2018-03-09 Thread Ján Senko
I am planning a new Ceph deployement and I have few questions that I could
not find good answers yet.

Our nodes will be using Xeon-D machines with 12 HDDs each and 64GB each.
Our target is to use 10TB drives for 120TB capacity per node.

1. We want to have small amount of SSDs in the machines. For OS and I guess
for WAL/DB of Bluestore. I am thinking about having a RAID 1 with two 400GB
2.5" SSD drives. Will this fit WAL/DB? We plan to store many small objects.
2. While doing scrub/deep scrub, is there any significant network traffic?
Assuming we are using Erasure coding pool, how do the nodes check the
consistency of an object? Do they transfer the whole object chunks or do
they only transfer the checksums?
3. We have to decide on which HDD to use, and there is a question of HGST
vs Seagate, 512e vs 4kn sectors, SATA vs SAS. Do you have some tips for
these decisions? We do not have very high IO, so we do not need performance
at any cost. As for manufacturer and the sector size, I haven't found any
guidelines/benchmarks that would steer me towards any.

Thank you for your insight
Jan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com