Re: [ceph-users] New Ceph cluster design
Hello, On Sat, 10 Mar 2018 16:14:53 +0100 Vincent Godin wrote: > Hi, > > As i understand it, you'll have one RAID1 of two SSDs for 12 HDDs. A > WAL is used for all writes on your host. This isn't filestore, AFAIK the WAL/DB will be used for small writes only to keep latency with Bluestore akin to filestore levels. Large writes will go directly to the HDDs. However each write will of course necessitate a write to the DB and thus IOPS (much more so than bandwidth) are paramount here. > If you have good SSDs, they > can handle 450-550 MBpsc. Your 12 HDDs SATA can handle 12 x 100 MBps > that is to say 1200 GBps. Aside from what I wrote above I'd like to repeat myself and others here for the umpteenth time, focusing on bandwidth is a fallacy in nearly all use cases, IOPS tend to become the bottleneck. Also that's 1.2GB/s or 1200MB/s. The OP stated 10TB HDDs and many (but not exclusively?) small objects, so if we're looking at lots of small writes the bandwidth of the SSDs becomes a factor again and with the sizes involved they appear too small as well. (going with the rough ratio of 10GB per TB). Either a RAID1 of at least 1600GB NVMes or 2 800GB NVMes and a resulting failure domain of 6 HDDs would be better/safer fit. > So your RAID 1 will be the bootleneck with > this design. A good design would be to have one SSD for 4 or 5 HDD. In > your case, the best option would be to start with 3 SSDs for 12 HDDs > to have a balances node. Don't forget to choose SSD with a high WDPD > ratio (>10) > More SSDs/NVMes are of course better and DWPD is important, but probably less so than with filestore journals. A DWPD of >10 is overkill for anything I've ever encountered, for many things 3 will be fine, especially if one knows what is expected. For example a filestore cache tier SSD with inline journal (800GB DC S3610, 3 DWPD) has a media wearout of 97 (3% used) after 2 years with this constant and not insignificant load: --- Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.0383.097.07 303.24 746.64 5084.9937.59 0.050.150.710.13 0.06 2.00 --- 300 write IOPS and 5MB/s for all that time. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New Ceph cluster design
Hi, As i understand it, you'll have one RAID1 of two SSDs for 12 HDDs. A WAL is used for all writes on your host. If you have good SSDs, they can handle 450-550 MBpsc. Your 12 HDDs SATA can handle 12 x 100 MBps that is to say 1200 GBps. So your RAID 1 will be the bootleneck with this design. A good design would be to have one SSD for 4 or 5 HDD. In your case, the best option would be to start with 3 SSDs for 12 HDDs to have a balances node. Don't forget to choose SSD with a high WDPD ratio (>10) The network needs of your node depend of the bandwith of your disks. As explain over, your 12 HDDs can handle 1200 GBps so you need to have a public and a private network that can handle it. In your case, a minimum a two 10 Gbps networks for per node are needed. If you need redondancy, just use two LACP networks with each having two 10 Gbps links. The scrub or deep scrub operations will not have a significant impact on your network but on your disks utilisations. So you need to plan them during low usage by your clients ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New Ceph cluster design
On Fri, Mar 09, 2018 at 03:06:15PM +0100, Ján Senko wrote: :We are looking at 100+ nodes. : :I know that the Ceph official recommendation is 1GB of RAM per 1TB of disk. :Was this ever changed since 2015? :CERN is definitely using less (source: :https://cds.cern.ch/record/2015206/files/CephScaleTestMarch2015.pdf) looking at my recently (re)installed luminous bluestore nodes Looking at 24hr peak (5min average) RAM utilization I'm seeing 40G Commitied and ~30G active RAM on nodes with 10x4T drives and ~82G Committed 57G active on nodes with 24x2T drives ( average 45.77% full) data: pools: 19 pools, 10240 pgs objects: 16820k objects, 77257 GB usage: 228 TB used, 271 TB / 499 TB avail pgs: 10240 active+clean (12 storage-nodes 173 osds ) This is almost entierly RBD for OpenStack VMs, only a negligible amount is radosgw type object storage none is erasure coded. I spec'ed bit over recommended RAM (for example 64T to 40G storage ) so I've nto had memory issues with older filestore or newer bluestore implementations, but I would still round up rather than down for my use case any way. :RedHat suggests using 16GB + 2GB/HDD as the latest requirements. : :BTW: Anyone has comments on SSD sizes for Bluestore or the other questions? These systems are using 10G:1T SSD:7.2K_SAS_DISK (ie 40GB SSD for 4T HDD) this seem sufficient (running with WAL and DB on spinners really tanks IOPS capacity) but I don't know that it is optimal. It is close enough to RedHat recommendation that I would believe them. Note that we've moved to more smaller disks (the 2T are newer) as we were running out of IOPS, maybe more SSD in front would help or maybe our use pattern being so heavy in active volume use as opposed to cold object storage is unusual. Obviously 10k or 15k drive would help & my next expantion probably will be as we're still at higher % of our IOPS capacity utilization than were are of our storage capacity utilization... -Jon ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New Ceph cluster design
What you linked was only a 2 week test. When Ceph is healthy it does not need a lot of RAM, it's during recovery that OOM appears and that's when you'll find yourself upgrading the RAM on your nodes just to stop OOM and allow the cluster to recover. Look through the mailing list and you'll see that this is one of the most common mistakes made when spec'ing hardware for Ceph. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New Ceph cluster design
We are looking at 100+ nodes. I know that the Ceph official recommendation is 1GB of RAM per 1TB of disk. Was this ever changed since 2015? CERN is definitely using less (source: https://cds.cern.ch/record/2015206/files/CephScaleTestMarch2015.pdf) RedHat suggests using 16GB + 2GB/HDD as the latest requirements. BTW: Anyone has comments on SSD sizes for Bluestore or the other questions? Jan 2018-03-09 14:58 GMT+01:00 Brady Deetz: > I'd increase ram. 1GB per 1TB of disk is the recommendation. > > Another thing you need to consider is your node density. 12x10TB is a lot > of data to have to rebalance if you aren't going to have 20+ nodes. I have > 17 nodes with 24x6TB disks each. Rebuilds can take what seems like an > eternity. It may be worth looking at cheaper sockets and smaller disks in > order to increase your node count. > > How many nodes will this cluster have? > > > On Mar 9, 2018 4:16 AM, "Ján Senko" wrote: > > I am planning a new Ceph deployement and I have few questions that I could > not find good answers yet. > > Our nodes will be using Xeon-D machines with 12 HDDs each and 64GB each. > Our target is to use 10TB drives for 120TB capacity per node. > > 1. We want to have small amount of SSDs in the machines. For OS and I > guess for WAL/DB of Bluestore. I am thinking about having a RAID 1 with two > 400GB 2.5" SSD drives. Will this fit WAL/DB? We plan to store many small > objects. > 2. While doing scrub/deep scrub, is there any significant network traffic? > Assuming we are using Erasure coding pool, how do the nodes check the > consistency of an object? Do they transfer the whole object chunks or do > they only transfer the checksums? > 3. We have to decide on which HDD to use, and there is a question of HGST > vs Seagate, 512e vs 4kn sectors, SATA vs SAS. Do you have some tips for > these decisions? We do not have very high IO, so we do not need performance > at any cost. As for manufacturer and the sector size, I haven't found any > guidelines/benchmarks that would steer me towards any. > > Thank you for your insight > Jan > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- Jan Senko, Skype janos- Phone in Switzerland: +41 774 144 602 Phone in Czech Republic: +420 777 843 818 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New Ceph cluster design
I'd increase ram. 1GB per 1TB of disk is the recommendation. Another thing you need to consider is your node density. 12x10TB is a lot of data to have to rebalance if you aren't going to have 20+ nodes. I have 17 nodes with 24x6TB disks each. Rebuilds can take what seems like an eternity. It may be worth looking at cheaper sockets and smaller disks in order to increase your node count. How many nodes will this cluster have? On Mar 9, 2018 4:16 AM, "Ján Senko"wrote: I am planning a new Ceph deployement and I have few questions that I could not find good answers yet. Our nodes will be using Xeon-D machines with 12 HDDs each and 64GB each. Our target is to use 10TB drives for 120TB capacity per node. 1. We want to have small amount of SSDs in the machines. For OS and I guess for WAL/DB of Bluestore. I am thinking about having a RAID 1 with two 400GB 2.5" SSD drives. Will this fit WAL/DB? We plan to store many small objects. 2. While doing scrub/deep scrub, is there any significant network traffic? Assuming we are using Erasure coding pool, how do the nodes check the consistency of an object? Do they transfer the whole object chunks or do they only transfer the checksums? 3. We have to decide on which HDD to use, and there is a question of HGST vs Seagate, 512e vs 4kn sectors, SATA vs SAS. Do you have some tips for these decisions? We do not have very high IO, so we do not need performance at any cost. As for manufacturer and the sector size, I haven't found any guidelines/benchmarks that would steer me towards any. Thank you for your insight Jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New Ceph cluster design
Hi, same REX, we had troubles with OutOfMemory Kill Process on OSD process with ten 8 To disks. After an upgrade to 128 Go these troubles disapears. Recommendations on memory aren't overestimated. Regards, Tristan On 09/03/2018 11:31, Eino Tuominen wrote: On 09/03/2018 12.16, Ján Senko wrote: I am planning a new Ceph deployement and I have few questions that I could not find good answers yet. Our nodes will be using Xeon-D machines with 12 HDDs each and 64GB each. Our target is to use 10TB drives for 120TB capacity per node. We ran into problems with 20 x 6 TB drives and 64 GB memory which we then increased to 128 GB. According to my experience the recommendation of 1 GB of memory per 1 TB of disk space has to be taken seriously. <>___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New Ceph cluster design
On 09/03/2018 12.16, Ján Senko wrote: I am planning a new Ceph deployement and I have few questions that I could not find good answers yet. Our nodes will be using Xeon-D machines with 12 HDDs each and 64GB each. Our target is to use 10TB drives for 120TB capacity per node. We ran into problems with 20 x 6 TB drives and 64 GB memory which we then increased to 128 GB. According to my experience the recommendation of 1 GB of memory per 1 TB of disk space has to be taken seriously. -- Eino Tuominen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] New Ceph cluster design
I am planning a new Ceph deployement and I have few questions that I could not find good answers yet. Our nodes will be using Xeon-D machines with 12 HDDs each and 64GB each. Our target is to use 10TB drives for 120TB capacity per node. 1. We want to have small amount of SSDs in the machines. For OS and I guess for WAL/DB of Bluestore. I am thinking about having a RAID 1 with two 400GB 2.5" SSD drives. Will this fit WAL/DB? We plan to store many small objects. 2. While doing scrub/deep scrub, is there any significant network traffic? Assuming we are using Erasure coding pool, how do the nodes check the consistency of an object? Do they transfer the whole object chunks or do they only transfer the checksums? 3. We have to decide on which HDD to use, and there is a question of HGST vs Seagate, 512e vs 4kn sectors, SATA vs SAS. Do you have some tips for these decisions? We do not have very high IO, so we do not need performance at any cost. As for manufacturer and the sector size, I haven't found any guidelines/benchmarks that would steer me towards any. Thank you for your insight Jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com