Re: [ceph-users] CEPH cluster to meet 5 msec latency

2016-10-20 Thread Christian Balzer

Hello,

re-adding the ML, so everybody benefits from this.

On Thu, 20 Oct 2016 14:03:56 +0530 Subba Rao K wrote:

> Hi Christian,
> 
> I have seen one of your responses in CEPH user group and wanted some help
> from you.
> 
> Can you please share HW configuration of the CEPH cluster which can service
> within 5 msec.
> 

Make sure you re-read that mail and understand what I wrote there.
It's not the entire cluster, just the cache-tier.
And I have a very specific use case, ideally suited for cache-tiering.

There are 450 VMs, all running the same application, which tends to write
small logs and more important status and lock files.
It basically never does any reads after booting, or if so these come from
the in-VM pagecache. 
A typical state of affairs from the view of Ceph is this:
---
client io 12948 kB/s wr, 2626 op/s
---

Again, these writes tend to be mostly to the same files (and thus Ceph
objects) over and over.

So the hot data and working set is rather small, significantly smaller
than the actual cache-pool: 4x DC S3610 800GB x2 (nodes) / 2 (replication).


> To meet the 5 msec latency, I was contemplating between All-SSD Ceph
> Cluster and Cache-tier Ceph Cluster with SAS Drives. With out test data I
> am unable to decide.
> 

Both can work, but without knowing your use case and working set that's
also impossible to answer.

Do read the current "RBD with SSD journals and SAS OSDs" thread, it has
lots of valuable information pertaining to this.

Based on that thread, the cache-tier of my test cluster can do this from
inside a VM:

fio --size=1G --ioengine=libaio --invalidate=1 --sync=1 --numjobs=1 --rw=write 
--name=fiojob --blocksize=4K --iodepth=1
---
  write: io=31972KB, bw=1183.7KB/s, iops=295, runt= 27012msec
slat (msec): min=1, max=12, avg= 3.38, stdev= 1.31
clat (usec): min=0, max=13, avg= 1.16, stdev= 0.44
 lat (msec): min=1, max=12, avg= 3.38, stdev= 1.31
---

The cache-tier HW are 2 nodes with 32GB RAM, one rather meek E5-2620 v3
(running at PERFORMANCE though), 2x DC S3610 400GB (split into 4 OSDs) and
QDDR (40Gb/s) Infiniband (IPoIB).
Hammer, replication 2.

So obviously something beefier in the CPU and storage department (NVMe
comes to mind) should be even better, people have reached about 1ms for
4k sync writes. 

So if you have DB type application with a well known working set and can
fit that into a NVMe cache-tier which you can afford, that would be
perfect.
Settings like "readforward" on the cache-tier can also keep it from
getting "polluted" and thus free for all your writes.

An ideal cache-tier node would have something like 2 NVMes (both Intel
and Samsung make decent ones, specifics depend on your needs like
endurance), a single CPU with FAST cores (6-8 cores over 3GHz) and the
lowest latency networking you can afford (40Gb/s better than 10, etc).
You may get away with a replication of 2 here IF the NVMes are well known,
trusted AND monitored, thus saving you a good deal of latency (0.5ms at
least I reckon).

I'd still go for SSD journals for any HDD OSDs, though.

The inherent (write) latency of SSDs is larger than NVMes, but if you were
to go for a full SSD cluster you still should be able to meet that 5ms
easily and don't have to worry about the complexity and risks of
cache-tiering. 
OTOH you will want a replication of 3, with the resulting latency penalty
(and costs).


Then at the top end of cost and performance, you'd have a SSD cluster with
NVMe journals.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] effectively reducing scrub io impact

2016-10-20 Thread Christian Balzer

Hello,

On Thu, 20 Oct 2016 15:03:02 +0200 Oliver Dzombic wrote:

> Hi Christian,
> 
> thank you for your time.
> 
> The problem is deep scrub only.
> 
> Jewel 10.2.2 is used.
>
Hmm, I was under the impression that the unified queue in Jewel was
supposed to stop scrubs from eating all the I/O babies.
 
> Thank you for your hint with manual deep scrubs on specific OSD's. I
> didnt come up with that idea.
> 
It's been discussed here before, including the feature request to set the
"last scrubbed" values with an external tool, so that you DON'T
actually have to kick off a scrub to get the desired timing. 

> -
> 
> Where do you know
> 
> osd_scrub_sleep
> 
> from ?
> 
Reading more or less every article here for the last 3 years and
definitely reading release notes religiously. 

> I am saw here lately on the mailinglist multiple times many "hidden"
> config options. ( while hidden is everything which is not mentioned in
> the doku @ ceph.com ).
> 
Lots of that going on, including massive behavior changes like to one in
cache-tiering from Hammer to Jewel that aren't pointed out or the new
parameters that control them.

Christian

> ceph.com does not know about osd_scrub_sleep config option ( except
> mentioned in (past) release notes )
> 
> The search engine finds it mainly in github or bugtracker.
> 
> Is there any source of a (complete) list of available config options,
> useable by normal admin's ?
> 
> Or is it really neccessary to grab through source codes and release
> notes to collect that kind information on your own ?
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-20 Thread Christian Balzer

Hello,

On Thu, 20 Oct 2016 15:45:34 + Jim Kilborn wrote:

Good to know.

You may be able to squeeze some more 4K write IOPS out of this by cranking
the CPUs to full speed, see the relevant recent threads about this.

As for the 120GB (there is no 128GB SM863 model according to Samsung) SSDs
as journals, keep in mind that in your current cluster that limits you to
about 1TBW/day if you want them to survive 5 years.
Something to keep in mind.

Christian
> The chart obviously didn’t go well. Here it is again
> 
> 
> 
> fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} 
> --numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting 
> --name=journal-test
> 
> 
> 
> FIO Test Local disk  SAN/NFS  
> Ceph size=3/SSD journal
> 
> 4M Writes  53 MB/sec   12 IOPS 62 MB/sec15 IOPS   
>  151 MB/sec 37 IOPS
> 
> 4M Rand Writes34 MB/sec 8 IOPS 63 MB/sec15 IOPS
> 155 MB/sec 37 IOPS
> 
> 4M Read  66 MB/sec   15 IOPS   102 MB/sec25 IOPS  
>  662 MB/sec 161 IOPS
> 
> 4M Rand Read73 MB/sec   17 IOPS   103 MB/sec25 IOPS   670 
> MB/sec 163 IOPS
> 
> 4K Writes2.9 MB/sec 738 IOPS   3.8 MB/sec   952 IOPS  
>   2.3 MB/sec 571 IOPS
> 
> 4K Rand Writes 551 KB/sec  134 IOPS   3.6 MB/sec   911 IOPS   2.0 
> MB/sec 501 IOPS
> 
> 4K Read  28 MB/sec 7001 IOPS8 MB/sec 1945 IOPS
>13 MB/sec 3256 IOPS
> 
> 4K Rand Read 263 KB/sec5 MB/sec 1246 IOPS 
> 8 MB/sec  2015 IOPS
> 
> 
> 
> Sent from Mail for Windows 10
> 
> 
> 
> From: Jim Kilborn
> Sent: Thursday, October 20, 2016 10:20 AM
> To: Christian Balzer; 
> ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - 
> cache pressure, capability release, poor iostat await avg queue size
> 
> 
> 
> Thanks Christion for the additional information and comments.
> 
> 
> 
> · upgraded the kernels, but still had poor performance
> 
> · Removed all the pools and recreated with just a replication of 3, 
> with the two pool for the data and metadata. No cache tier pool
> 
> · Turned back on the write caching with hdparm. We do have a Large 
> UPS and dual power supplies in the ceph unit. If we get a long power outage, 
> everything will go down anyway.
> 
> 
> 
> I am no longer seeing the issue of the slow requests, ops blocked, etc.
> 
> 
> 
> I think I will push for the following design per ceph server
> 
> 
> 
> 8  4TB sata drives
> 
> 2 Samsung 128GB SM863 SSD each holding 4 osd journals
> 
> 
> 
> With 4 hosts, and a replication of 3 to start with
> 
> 
> 
> I did a quick test with 4 - 4TB spinners and 1 Samsung 128GB SM863 SSD 
> holding the  4 osd journals, with 4 hosts in the cluster over infiniband.
> 
> 
> 
> At the 4M read, watching iftop, the client is receiving between  4.5 GB/sec - 
> 5.5Gb/sec over infiniband
> 
> Which is around 600MB/sec and translates well to the FIO number.
> 
> 
> 
> fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} 
> --numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting 
> --name=journal-test
> 
> 
> 
> FIO Test
> 
> 
> Local disk
> 
> 
> SAN/NFS
> 
> 
> Ceph w/Repl/SSD journal
> 
> 
> 4M Writes
> 
> 
> 53 MB/sec   12 IOPS
> 
> 
> 62 MB/sec15 IOPS
> 
> 
>   151 MB/sec 37 IOPS
> 
> 
> 4M Rand Writes
> 
> 
> 34 MB/sec 8 IOPS
> 
> 
> 63 MB/sec15 IOPS
> 
> 
>   155 MB/sec 37 IOPS
> 
> 
> 4M Read
> 
> 
> 66 MB/sec   15 IOPS
> 
> 
> 102 MB/sec  25 IOPS
> 
> 
>   662 MB/sec 161 IOPS
> 
> 
> 4M Rand Read
> 
> 
> 73 MB/sec   17 IOPS
> 
> 
> 103 MB/sec  25 IOPS
> 
> 
>   670 MB/sec 163 IOPS
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 4K Writes
> 
> 
> 2.9 MB/sec 738 IOPS
> 
> 
> 3.8 MB/sec   952 IOPS
> 
> 
>   2.3 MB/sec 571 IOPS
> 
> 
> 4K Rand Writes
> 
> 
> 551 KB/sec  134 IOPS
> 
> 
> 3.6 MB/sec   911 IOPS
> 
> 
>   2.0 MB/sec 501 IOPS
> 
> 
> 4K Read
> 
> 
> 28 MB/sec 7001 IOPS
> 
> 
> 8 MB/sec 1945 IOPS
> 
> 
>   13 MB/sec 3256 IOPS
> 
> 
> 4K Rand Read
> 
> 
> 263 KB/sec
> 
> 
> 5 MB/sec 1246 IOPS
> 
> 
>   8 MB/sec  2015 IOPS
> 
> 
> 
> 
> That performance is fine for our needs
> 
> Again, thanks for the help guys.
> 
> 
> 
> Regards,
> 
> Jim
> 
> 
> 
> From: Christian Balzer
> Sent: Wednesday, October 19, 2016 7:54 PM
> To: ceph-users@lists.ceph.com
> Cc: Jim Kilborn
> Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - 
> cache pressure, capability release, poor iostat await avg queue size
> 
> 
> 
> Hello,
> 
> On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:
> 
> > I have 

[ceph-users] Ceph recommendations for ALL SSD

2016-10-20 Thread Ramakrishna Nishtala (rnishtal)
Hi
Any suggestions/recommendations on all SSD  for Ceph?

I see SSD freezes occasionally on SATA drives, thus creating spikes in latency 
at times. Recovers after a brief pause of 20-30 secs. Any best practices like 
colocated journals or not, schedulers, hdparms etc appreciated. Working on 1.3.

Regards,

Rama
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with Ceph padding files out to ceph.dir.layout.stripe_unit size

2016-10-20 Thread Kate Ward
All are relatively recent Ubuntu 16.04.1 kernels. I upgraded ka05 last
night, but still see an issue. I'm happy to upgrade the rest.
$ for h in ka00 ka01 ka02 ka03 ka04 ka05; do ssh $h uname -a; done
Linux ka00 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:00:59 UTC 2016
i686 i686 i686 GNU/Linux
Linux ka01 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:01:55 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux
Linux ka02 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:01:55 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux
Linux ka03 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:01:55 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux
Linux ka04 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:01:55 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux
Linux ka05 4.4.0-38-generic #57-Ubuntu SMP Tue Sep 6 15:42:33 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux

k8

On Thu, Oct 20, 2016 at 11:39 PM John Spray  wrote:

> On Thu, Oct 20, 2016 at 10:15 PM, Kate Ward 
> wrote:
> > I have a strange problem that began manifesting after I rebuilt my
> cluster a
> > month or so back. A tiny subset of my files on CephFS are being
> zero-padded
> > out to the length of ceph.dir.layout.stripe_unit when the files are later
> > *read* (not when they are written). Tonight I realized the padding
> matched
> > the stripe_unit value of 1048576, and changed it to 4194304, which
> resulted
> > in those files that get padded taking on the new stripe_unit value. I've
> > since changed it back. I've tried searching Google for answers, and Ceph
> > bugs, but have had no luck so far.
> >
> > Current ceph.dir.layout setting for the entire cluster.
> > $ getfattr -n ceph.dir.layout /ceph/ka
> > getfattr: Removing leading '/' from absolute path names
> > # file: ceph/ka
> > ceph.dir.layout="stripe_unit=1048576 stripe_count=2 object_size=8388608
> > pool=cephfs_data"
> >
> > Ceph is mounted on all machines using the kernel driver. The problem is
> not
> > isolated to a single machine.
> > $ grep /ceph/ka /etc/mtab
> > backupz/ceph/ka /backupz/ceph/ka zfs rw,noatime,xattr,noacl 0 0
> > 172.16.0.11:6789,172.16.0.19:6789:/ /ceph/ka ceph
> > rw,noatime,nodiratime,name=admin,secret=,acl 0 0
> >
> > Files from a Subversion repository, where the last one was padded after I
> > tried to check out the repo.
> > kward@ka02 2016-10-20T22:57:13
> > %0]/ceph/ka/data/repoz/forestent/forestent/db/revs/0
> > $ ls -lrt |tail -5
> > -rw-r--r-- 1 www-data www-data1079 Oct 20 08:53 877
> > -rw-r--r-- 1 www-data www-data1415 Oct 20 08:55 878
> > -rw-r--r-- 1 www-data www-data1059 Oct 20 09:01 879
> > -rw-r--r-- 1 www-data www-data1318 Oct 20 09:36 880
> > -rw-r--r-- 1 www-data www-data 4194304 Oct 20 19:18 881
>
> This issue isn't one I immediately recognise.  What kernel version is
> in use on the clients?
>
> I would guess an issue like this comes from the file size recovery
> that we do if/when a client loses contact while writing a file, which
> I don't know that we ever test with non-default layouts, and might not
> work very well with layouts that have stripe_unit != object_size.
>
> If you set "debug mds = 10" and "debug filer = 10" on the MDS, and
> then capture the log from the point in time where you write a file to
> the point in time where the file is statted and gives the incorrect
> size (if it's reproducible that readily), that should gives us a
> better idea.
>
> John
>
> >
> > Files stored, then later accessed via WebDAV. Only those files accessed
> were
> > subsequently padded.
> > [kward@ka02 2016-10-20T23:01:42 %0]~/www/webdav/OmniFocus.ofocus
> > $ ls -l
> > total 16389
> > -rw-r--r-- 1 www-data www-data 4194304 Oct 20 20:08
> > 00=ay-_KSusSw8+jOtYClSC2kx.zip
> > -rw-r--r-- 1 www-data www-data1383 Oct 20 19:22
> > 20161020172209=pP4DpDOXAaA.client
> > -rw-r--r-- 1 www-data www-data 4194304 Oct 20 20:20
> > 20161020182047=pP4DpDOXAaA.client
> > -rw-r--r-- 1 www-data www-data 4194304 Oct 20 21:11
> > 20161020191117=pP4DpDOXAaA.client
> > -rw-r--r-- 1 www-data www-data1309 Oct 20 21:56
> > 20161020195647=jY9iwiPfUhB.client
> > -rw-r--r-- 1 www-data www-data 4194304 Oct 20 22:04
> > 20161020200427=pP4DpDOXAaA.client
> > -rw-r--r-- 1 www-data www-data1309 Oct 20 22:54
> > 20161020205415=jY9iwiPfUhB.client
> >
> > Cluster lists as healthy. (Yes, I'm aware one of the OSDs is currently
> down.
> > The issue was there two months before it went down.)
> > $ ceph status
> > cluster f13b6373-0cdc-4372-85a2-66bf2841e313
> >  health HEALTH_OK
> >  monmap e3: 3 mons at
> > {ka01=172.16.0.11:6789/0,ka03=172.16.0.15:6789/0,ka04=172.16.0.17:6789/0
> }
> > election epoch 36, quorum 0,1,2 ka01,ka03,ka04
> >   fsmap e1140219: 1/1/1 up {0=ka01=up:active}, 2 up:standby
> >  osdmap e1234338: 16 osds: 15 up, 15 in
> > flags sortbitwise
> >   pgmap v2296058: 1216 pgs, 3 pools, 7343 GB data, 1718 kobjects
> > 14801 GB used, 19360 GB / 34161 GB avail
> > 1216 

Re: [ceph-users] Issue with Ceph padding files out to ceph.dir.layout.stripe_unit size

2016-10-20 Thread John Spray
On Thu, Oct 20, 2016 at 10:15 PM, Kate Ward  wrote:
> I have a strange problem that began manifesting after I rebuilt my cluster a
> month or so back. A tiny subset of my files on CephFS are being zero-padded
> out to the length of ceph.dir.layout.stripe_unit when the files are later
> *read* (not when they are written). Tonight I realized the padding matched
> the stripe_unit value of 1048576, and changed it to 4194304, which resulted
> in those files that get padded taking on the new stripe_unit value. I've
> since changed it back. I've tried searching Google for answers, and Ceph
> bugs, but have had no luck so far.
>
> Current ceph.dir.layout setting for the entire cluster.
> $ getfattr -n ceph.dir.layout /ceph/ka
> getfattr: Removing leading '/' from absolute path names
> # file: ceph/ka
> ceph.dir.layout="stripe_unit=1048576 stripe_count=2 object_size=8388608
> pool=cephfs_data"
>
> Ceph is mounted on all machines using the kernel driver. The problem is not
> isolated to a single machine.
> $ grep /ceph/ka /etc/mtab
> backupz/ceph/ka /backupz/ceph/ka zfs rw,noatime,xattr,noacl 0 0
> 172.16.0.11:6789,172.16.0.19:6789:/ /ceph/ka ceph
> rw,noatime,nodiratime,name=admin,secret=,acl 0 0
>
> Files from a Subversion repository, where the last one was padded after I
> tried to check out the repo.
> kward@ka02 2016-10-20T22:57:13
> %0]/ceph/ka/data/repoz/forestent/forestent/db/revs/0
> $ ls -lrt |tail -5
> -rw-r--r-- 1 www-data www-data1079 Oct 20 08:53 877
> -rw-r--r-- 1 www-data www-data1415 Oct 20 08:55 878
> -rw-r--r-- 1 www-data www-data1059 Oct 20 09:01 879
> -rw-r--r-- 1 www-data www-data1318 Oct 20 09:36 880
> -rw-r--r-- 1 www-data www-data 4194304 Oct 20 19:18 881

This issue isn't one I immediately recognise.  What kernel version is
in use on the clients?

I would guess an issue like this comes from the file size recovery
that we do if/when a client loses contact while writing a file, which
I don't know that we ever test with non-default layouts, and might not
work very well with layouts that have stripe_unit != object_size.

If you set "debug mds = 10" and "debug filer = 10" on the MDS, and
then capture the log from the point in time where you write a file to
the point in time where the file is statted and gives the incorrect
size (if it's reproducible that readily), that should gives us a
better idea.

John

>
> Files stored, then later accessed via WebDAV. Only those files accessed were
> subsequently padded.
> [kward@ka02 2016-10-20T23:01:42 %0]~/www/webdav/OmniFocus.ofocus
> $ ls -l
> total 16389
> -rw-r--r-- 1 www-data www-data 4194304 Oct 20 20:08
> 00=ay-_KSusSw8+jOtYClSC2kx.zip
> -rw-r--r-- 1 www-data www-data1383 Oct 20 19:22
> 20161020172209=pP4DpDOXAaA.client
> -rw-r--r-- 1 www-data www-data 4194304 Oct 20 20:20
> 20161020182047=pP4DpDOXAaA.client
> -rw-r--r-- 1 www-data www-data 4194304 Oct 20 21:11
> 20161020191117=pP4DpDOXAaA.client
> -rw-r--r-- 1 www-data www-data1309 Oct 20 21:56
> 20161020195647=jY9iwiPfUhB.client
> -rw-r--r-- 1 www-data www-data 4194304 Oct 20 22:04
> 20161020200427=pP4DpDOXAaA.client
> -rw-r--r-- 1 www-data www-data1309 Oct 20 22:54
> 20161020205415=jY9iwiPfUhB.client
>
> Cluster lists as healthy. (Yes, I'm aware one of the OSDs is currently down.
> The issue was there two months before it went down.)
> $ ceph status
> cluster f13b6373-0cdc-4372-85a2-66bf2841e313
>  health HEALTH_OK
>  monmap e3: 3 mons at
> {ka01=172.16.0.11:6789/0,ka03=172.16.0.15:6789/0,ka04=172.16.0.17:6789/0}
> election epoch 36, quorum 0,1,2 ka01,ka03,ka04
>   fsmap e1140219: 1/1/1 up {0=ka01=up:active}, 2 up:standby
>  osdmap e1234338: 16 osds: 15 up, 15 in
> flags sortbitwise
>   pgmap v2296058: 1216 pgs, 3 pools, 7343 GB data, 1718 kobjects
> 14801 GB used, 19360 GB / 34161 GB avail
> 1216 active+clean
>
> Details:
> - Ceph 10.2.2 (Ubuntu 16.04.1 packages)
> - 4x servers, each with 4x OSDs on HDDs (mixture of 2T and 3T drives);
> journals on SSD
> - 3x Mons, and 3x MDSs
> - Data is replicated 2x
> - The only usage of the cluster is via CephFS
>
> Kate
> https://ch.linkedin.com/in/kate-ward-1119b9
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Issue with Ceph padding files out to ceph.dir.layout.stripe_unit size

2016-10-20 Thread Kate Ward
I have a strange problem that began manifesting after I rebuilt my cluster
a month or so back. A tiny subset of my files on CephFS are being
zero-padded out to the length of ceph.dir.layout.stripe_unit when the files
are later *read* (not when they are written). Tonight I realized the
padding matched the stripe_unit value of 1048576, and changed it to
4194304, which resulted in those files that get padded taking on the new
stripe_unit value. I've since changed it back. I've tried searching Google
for answers, and Ceph bugs, but have had no luck so far.

Current ceph.dir.layout setting for the entire cluster.
$ getfattr -n ceph.dir.layout /ceph/ka
getfattr: Removing leading '/' from absolute path names
# file: ceph/ka
ceph.dir.layout="stripe_unit=1048576 stripe_count=2 object_size=8388608
pool=cephfs_data"

Ceph is mounted on all machines using the kernel driver. The problem is not
isolated to a single machine.
$ grep /ceph/ka /etc/mtab
backupz/ceph/ka /backupz/ceph/ka zfs rw,noatime,xattr,noacl 0 0
172.16.0.11:6789,172.16.0.19:6789:/ /ceph/ka ceph
rw,noatime,nodiratime,name=admin,secret=,acl 0 0

Files from a Subversion repository, where the last one was padded after I
tried to check out the repo.
kward@ka02 2016-10-20T22:57:13
%0]/ceph/ka/data/repoz/forestent/forestent/db/revs/0
$ ls -lrt |tail -5
-rw-r--r-- 1 www-data www-data1079 Oct 20 08:53 877
-rw-r--r-- 1 www-data www-data1415 Oct 20 08:55 878
-rw-r--r-- 1 www-data www-data1059 Oct 20 09:01 879
-rw-r--r-- 1 www-data www-data1318 Oct 20 09:36 880
-rw-r--r-- 1 www-data www-data 4194304 Oct 20 19:18 881

Files stored, then later accessed via WebDAV. Only those files accessed
were subsequently padded.
[kward@ka02 2016-10-20T23:01:42 %0]~/www/webdav/OmniFocus.ofocus
$ ls -l
total 16389
-rw-r--r-- 1 www-data www-data 4194304 Oct 20 20:08
00=ay-_KSusSw8+jOtYClSC2kx.zip
-rw-r--r-- 1 www-data www-data1383 Oct 20 19:22
20161020172209=pP4DpDOXAaA.client
-rw-r--r-- 1 www-data www-data 4194304 Oct 20 20:20
20161020182047=pP4DpDOXAaA.client
-rw-r--r-- 1 www-data www-data 4194304 Oct 20 21:11
20161020191117=pP4DpDOXAaA.client
-rw-r--r-- 1 www-data www-data1309 Oct 20 21:56
20161020195647=jY9iwiPfUhB.client
-rw-r--r-- 1 www-data www-data 4194304 Oct 20 22:04
20161020200427=pP4DpDOXAaA.client
-rw-r--r-- 1 www-data www-data1309 Oct 20 22:54
20161020205415=jY9iwiPfUhB.client

Cluster lists as healthy. (Yes, I'm aware one of the OSDs is currently
down. The issue was there two months before it went down.)
$ ceph status
cluster f13b6373-0cdc-4372-85a2-66bf2841e313
 health HEALTH_OK
 monmap e3: 3 mons at {ka01=
172.16.0.11:6789/0,ka03=172.16.0.15:6789/0,ka04=172.16.0.17:6789/0}
election epoch 36, quorum 0,1,2 ka01,ka03,ka04
  fsmap e1140219: 1/1/1 up {0=ka01=up:active}, 2 up:standby
 osdmap e1234338: 16 osds: 15 up, 15 in
flags sortbitwise
  pgmap v2296058: 1216 pgs, 3 pools, 7343 GB data, 1718 kobjects
14801 GB used, 19360 GB / 34161 GB avail
1216 active+clean

Details:
- Ceph 10.2.2 (Ubuntu 16.04.1 packages)
- 4x servers, each with 4x OSDs on HDDs (mixture of 2T and 3T drives);
journals on SSD
- 3x Mons, and 3x MDSs
- Data is replicated 2x
- The only usage of the cluster is via CephFS

Kate
https://ch.linkedin.com/in/kate-ward-1119b9
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Announcing the ceph-large mailing list

2016-10-20 Thread Stillwell, Bryan J
Do you run a large Ceph cluster?  Do you find that you run into issues
that you didn't have when your cluster was smaller?  If so we have a new
mailing list for you!

Announcing the new ceph-large mailing list.  This list is targeted at
experienced Ceph operators with cluster(s) over 500 OSDs to discuss
issues and experiences with going big.  If you're one of these people,
please join the list here:

http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Memory leak in radosgw

2016-10-20 Thread Trey Palmer
I've been trying to test radosgw multisite and have a pretty bad memory
leak.It appears to be associated only with multisite sync.

Multisite works well for a small numbers of objects.However, it all
fell over when I wrote in 8M 64K objects to two buckets overnight for
testing (via cosbench).

The leak appears to happen on the multisite transfer source -- that is, the
node where the objects were written originally.   The radosgw process
eventually dies, I'm sure via the OOM killer, and systemd restarts it.
Then repeat, though multisite sync pretty much stops at that point.

I have tried 10.2.2, 10.2.3 and a combination of the two.   I'm running on
CentOS 7.2, using civetweb with SSL.   I saw that the memory profiler only
works on mon, osd and mds processes.

Anyone else seen anything like this?

   -- Trey
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph on two data centers far away

2016-10-20 Thread German Anders
Thanks, that's too far actually lol. And how things going with rbd
mirroring?

*German*

2016-10-20 14:49 GMT-03:00 yan cui :

> The two data centers are actually cross US.  One is in the west, and the
> other in the east.
> We try to sync rdb images using RDB mirroring.
>
> 2016-10-20 9:54 GMT-07:00 German Anders :
>
>> from curiosity I wanted to ask you what kind of network topology are you
>> trying to use across the cluster? In this type of scenario you really need
>> an ultra low latency network, how far from each other?
>>
>> Best,
>>
>> *German*
>>
>> 2016-10-18 16:22 GMT-03:00 Sean Redmond :
>>
>>> Maybe this would be an option for you:
>>>
>>> http://docs.ceph.com/docs/jewel/rbd/rbd-mirroring/
>>>
>>>
>>> On Tue, Oct 18, 2016 at 8:18 PM, yan cui  wrote:
>>>
 Hi Guys,

Our company has a use case which needs the support of Ceph across
 two data centers (one data center is far away from the other). The
 experience of using one data center is good. We did some benchmarking on
 two data centers, and the performance is bad because of the synchronization
 feature in Ceph and large latency between data centers. So, are there
 setting ups like data center aware features in Ceph, so that we have good
 locality? Usually, we use rbd to create volume and snapshot. But we want
 the volume is high available with acceptable performance in case one data
 center is down. Our current setting ups does not consider data center
 difference. Any ideas?


 Thanks, Yan

 --
 Think big; Dream impossible; Make it happen.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>
>
> --
> Think big; Dream impossible; Make it happen.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing image of rbd mirroring

2016-10-20 Thread yan cui
Thanks Jason, I will try to use your method.

2016-10-19 17:23 GMT-07:00 Jason Dillaman :

> On Wed, Oct 19, 2016 at 6:52 PM, yan cui  wrote:
> > 2016-10-19 15:46:44.843053 7f35c9925d80 -1 librbd: cannot obtain
> exclusive
> > lock - not removing
>
> Are you attempting to delete the primary or non-primary image? I would
> expect any attempts to delete the non-primary image to fail since the
> non-primary image will automatically be deleted when mirroring is
> disabled on the primary side (or the primary image is deleted).
>
> There was an issue where the rbd-mirror daemon would not release the
> exclusive lock on the image after a forced promotion. The fix for that
> will be included in the forthcoming 10.2.4 release.
>
> If it is neither of these scenarios, can you re-run the "rbd rm"
> command with "--debug-rbd=20" option appended?
>
> Thanks,
>
> --
> Jason
>



-- 
Think big; Dream impossible; Make it happen.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph on two data centers far away

2016-10-20 Thread German Anders
from curiosity I wanted to ask you what kind of network topology are you
trying to use across the cluster? In this type of scenario you really need
an ultra low latency network, how far from each other?

Best,

*German*

2016-10-18 16:22 GMT-03:00 Sean Redmond :

> Maybe this would be an option for you:
>
> http://docs.ceph.com/docs/jewel/rbd/rbd-mirroring/
>
>
> On Tue, Oct 18, 2016 at 8:18 PM, yan cui  wrote:
>
>> Hi Guys,
>>
>>Our company has a use case which needs the support of Ceph across two
>> data centers (one data center is far away from the other). The experience
>> of using one data center is good. We did some benchmarking on two data
>> centers, and the performance is bad because of the synchronization feature
>> in Ceph and large latency between data centers. So, are there setting ups
>> like data center aware features in Ceph, so that we have good locality?
>> Usually, we use rbd to create volume and snapshot. But we want the volume
>> is high available with acceptable performance in case one data center is
>> down. Our current setting ups does not consider data center difference. Any
>> ideas?
>>
>>
>> Thanks, Yan
>>
>> --
>> Think big; Dream impossible; Make it happen.
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-20 Thread Jim Kilborn
The chart obviously didn’t go well. Here it is again



fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} 
--numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting 
--name=journal-test



FIO Test Local disk  SAN/NFS
  Ceph size=3/SSD journal

4M Writes  53 MB/sec   12 IOPS 62 MB/sec15 IOPS
151 MB/sec 37 IOPS

4M Rand Writes34 MB/sec 8 IOPS 63 MB/sec15 IOPS155 
MB/sec 37 IOPS

4M Read  66 MB/sec   15 IOPS   102 MB/sec25 IOPS   
662 MB/sec 161 IOPS

4M Rand Read73 MB/sec   17 IOPS   103 MB/sec25 IOPS   670 
MB/sec 163 IOPS

4K Writes2.9 MB/sec 738 IOPS   3.8 MB/sec   952 IOPS
2.3 MB/sec 571 IOPS

4K Rand Writes 551 KB/sec  134 IOPS   3.6 MB/sec   911 IOPS   2.0 
MB/sec 501 IOPS

4K Read  28 MB/sec 7001 IOPS8 MB/sec 1945 IOPS  
 13 MB/sec 3256 IOPS

4K Rand Read 263 KB/sec5 MB/sec 1246 IOPS   
  8 MB/sec  2015 IOPS



Sent from Mail for Windows 10



From: Jim Kilborn
Sent: Thursday, October 20, 2016 10:20 AM
To: Christian Balzer; 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size



Thanks Christion for the additional information and comments.



· upgraded the kernels, but still had poor performance

· Removed all the pools and recreated with just a replication of 3, 
with the two pool for the data and metadata. No cache tier pool

· Turned back on the write caching with hdparm. We do have a Large UPS 
and dual power supplies in the ceph unit. If we get a long power outage, 
everything will go down anyway.



I am no longer seeing the issue of the slow requests, ops blocked, etc.



I think I will push for the following design per ceph server



8  4TB sata drives

2 Samsung 128GB SM863 SSD each holding 4 osd journals



With 4 hosts, and a replication of 3 to start with



I did a quick test with 4 - 4TB spinners and 1 Samsung 128GB SM863 SSD holding 
the  4 osd journals, with 4 hosts in the cluster over infiniband.



At the 4M read, watching iftop, the client is receiving between  4.5 GB/sec - 
5.5Gb/sec over infiniband

Which is around 600MB/sec and translates well to the FIO number.



fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} 
--numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting 
--name=journal-test



FIO Test


Local disk


SAN/NFS


Ceph w/Repl/SSD journal


4M Writes


53 MB/sec   12 IOPS


62 MB/sec15 IOPS


  151 MB/sec 37 IOPS


4M Rand Writes


34 MB/sec 8 IOPS


63 MB/sec15 IOPS


  155 MB/sec 37 IOPS


4M Read


66 MB/sec   15 IOPS


102 MB/sec  25 IOPS


  662 MB/sec 161 IOPS


4M Rand Read


73 MB/sec   17 IOPS


103 MB/sec  25 IOPS


  670 MB/sec 163 IOPS














4K Writes


2.9 MB/sec 738 IOPS


3.8 MB/sec   952 IOPS


  2.3 MB/sec 571 IOPS


4K Rand Writes


551 KB/sec  134 IOPS


3.6 MB/sec   911 IOPS


  2.0 MB/sec 501 IOPS


4K Read


28 MB/sec 7001 IOPS


8 MB/sec 1945 IOPS


  13 MB/sec 3256 IOPS


4K Rand Read


263 KB/sec


5 MB/sec 1246 IOPS


  8 MB/sec  2015 IOPS




That performance is fine for our needs

Again, thanks for the help guys.



Regards,

Jim



From: Christian Balzer
Sent: Wednesday, October 19, 2016 7:54 PM
To: ceph-users@lists.ceph.com
Cc: Jim Kilborn
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size



Hello,

On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:

> I have setup a new linux cluster to allow migration from our old SAN based 
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not the (main) cause of
your problems.

> I am basically running stock ceph settings, with just turning the write cache 
> off via hdparm on the drives, and temporarily turning of scrubbing.
>
The former is bound to kill performance, if you care that much for your
data but can't guarantee constant power (UPS, dual PSUs, etc), consider
using a BBU caching controller.

The later I venture you did because performance was abysmal with scrubbing
enabled.
Which is always a good indicator that your cluster needs tuning, improving.

> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
> Server performance should be good.
Memory is fine, CPU I can't tell from the model number and I'm not
inclined to look up or guess, 

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-20 Thread Jim Kilborn
Thanks Christion for the additional information and comments.



· upgraded the kernels, but still had poor performance

· Removed all the pools and recreated with just a replication of 3, 
with the two pool for the data and metadata. No cache tier pool

· Turned back on the write caching with hdparm. We do have a Large UPS 
and dual power supplies in the ceph unit. If we get a long power outage, 
everything will go down anyway.



I am no longer seeing the issue of the slow requests, ops blocked, etc.



I think I will push for the following design per ceph server



8  4TB sata drives

2 Samsung 128GB SM863 SSD each holding 4 osd journals



With 4 hosts, and a replication of 3 to start with



I did a quick test with 4 - 4TB spinners and 1 Samsung 128GB SM863 SSD holding 
the  4 osd journals, with 4 hosts in the cluster over infiniband.



At the 4M read, watching iftop, the client is receiving between  4.5 GB/sec - 
5.5Gb/sec over infiniband

Which is around 600MB/sec and translates well to the FIO number.



fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} 
--numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting 
--name=journal-test



FIO Test


Local disk


SAN/NFS


Ceph w/Repl/SSD journal


4M Writes


53 MB/sec   12 IOPS


62 MB/sec15 IOPS


  151 MB/sec 37 IOPS


4M Rand Writes


34 MB/sec 8 IOPS


63 MB/sec15 IOPS


  155 MB/sec 37 IOPS


4M Read


66 MB/sec   15 IOPS


102 MB/sec  25 IOPS


  662 MB/sec 161 IOPS


4M Rand Read


73 MB/sec   17 IOPS


103 MB/sec  25 IOPS


  670 MB/sec 163 IOPS














4K Writes


2.9 MB/sec 738 IOPS


3.8 MB/sec   952 IOPS


  2.3 MB/sec 571 IOPS


4K Rand Writes


551 KB/sec  134 IOPS


3.6 MB/sec   911 IOPS


  2.0 MB/sec 501 IOPS


4K Read


28 MB/sec 7001 IOPS


8 MB/sec 1945 IOPS


  13 MB/sec 3256 IOPS


4K Rand Read


263 KB/sec


5 MB/sec 1246 IOPS


  8 MB/sec  2015 IOPS




That performance is fine for our needs

Again, thanks for the help guys.



Regards,

Jim



From: Christian Balzer
Sent: Wednesday, October 19, 2016 7:54 PM
To: ceph-users@lists.ceph.com
Cc: Jim Kilborn
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size



Hello,

On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:

> I have setup a new linux cluster to allow migration from our old SAN based 
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not the (main) cause of
your problems.

> I am basically running stock ceph settings, with just turning the write cache 
> off via hdparm on the drives, and temporarily turning of scrubbing.
>
The former is bound to kill performance, if you care that much for your
data but can't guarantee constant power (UPS, dual PSUs, etc), consider
using a BBU caching controller.

The later I venture you did because performance was abysmal with scrubbing
enabled.
Which is always a good indicator that your cluster needs tuning, improving.

> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
> Server performance should be good.
Memory is fine, CPU I can't tell from the model number and I'm not
inclined to look up or guess, but that usually only becomes a bottleneck
when dealing with all SSD setup and things requiring the lowest latency
possible.


> Since I am running cephfs, I have tiering setup.
That should read "on top of EC pools", and as John said, not a good idea
at all, both EC pools and cache-tiering.

> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
> So the idea is to ensure a single host failure.
> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
> replicated set with size=2

This isn't a Seagate, you mean Samsung. And that's a consumer model,
ill suited for this task, even with the DC level SSDs below as journals.

And as such a replication of 2 is also ill advised, I've seen these SSDs
die w/o ANY warning whatsoever and long before their (abysmal) endurance
was exhausted.

> The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
> the cache SSD. It has power loss protection

Those are fine. If you re-do you cluster, don't put more than 4-5 journals
on them.

> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
> the erasure code uses only the 16 spinning 4TB drives.
>
> The problems that I am seeing is that I start copying data from our old san 
> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
> 1.4 TB, I start seeing:
>
> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> 26 ops are blocked > 65.536 sec on osd.0
> 37 ops are blocked > 32.768 sec on 

Re: [ceph-users] effectively reducing scrub io impact

2016-10-20 Thread Frédéric Nass
- Le 20 Oct 16, à 15:03, Oliver Dzombic  a écrit : 

> Hi Christian,

> thank you for your time.

> The problem is deep scrub only.

> Jewel 10.2.2 is used.

> Thank you for your hint with manual deep scrubs on specific OSD's. I
> didnt come up with that idea.

> -

> Where do you know

> osd_scrub_sleep

> from ?

> I am saw here lately on the mailinglist multiple times many "hidden"
> config options. ( while hidden is everything which is not mentioned in
> the doku @ ceph.com ).

> ceph.com does not know about osd_scrub_sleep config option ( except
> mentioned in (past) release notes )

> The search engine finds it mainly in github or bugtracker.

> Is there any source of a (complete) list of available config options,
> useable by normal admin's ?

Hi Oliver, 

This is probably what you're looking for: 
https://github.com/ceph/ceph/blob/master/src/common/config_opts.h 

You can change the Branch on the left to match the version of your cluster. 

Regards, 

Frederic. 

> Or is it really neccessary to grab through source codes and release
> notes to collect that kind information on your own ?

> --
> Mit freundlichen Gruessen / Best regards

> Oliver Dzombic
> IP-Interactive

> mailto:i...@ip-interactive.de

> Anschrift:

> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen

> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic

> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107

> Am 20.10.2016 um 14:39 schrieb Christian Balzer:

> > Hello,

> > On Thu, 20 Oct 2016 11:23:54 +0200 Oliver Dzombic wrote:

> >> Hi,

> >> we have here globally:

> >> osd_client_op_priority = 63
> >> osd_disk_thread_ioprio_class = idle
> >> osd_disk_thread_ioprio_priority = 7
> >> osd_max_scrubs = 1

> > If you google for osd_max_scrubs you will find plenty of threads, bug
> > reports, etc.

> > The most significant and benificial impact for client I/O can be achieved
> > by telling scrub to release its deadly grip on the OSDs with something like
> > osd_scrub_sleep = 0.1

> > Also which version, Hammer IIRC?
> > Jewel's unified queue should help as well, but no first hand experience
> > here.

> >> to influence the scrubbing performance and

> >> osd_scrub_begin_hour = 1
> >> osd_scrub_end_hour = 7

> >> to influence the scrubbing time frame


> >> Now, as it seems, this time frame is/was not enough, so ceph started
> >> scrubbing all the time, i assume because of the age of the objects.

> > You may want to line things up, so that OSDs/PGs are evenly spread out.
> > For example with 6 OSDs, manually initiate a deep scrub each day (at 01:00
> > in your case), so that only a specific subset is doing deep scrub conga.


> >> And it does it with:

> >> 4 active+clean+scrubbing+deep

> >> ( instead of the configured 1 )

> > That's per OSD, not global, see above, google.


> >> So now, we experience a situation, where the spinning drives are so
> >> busy, that the IO performance got too bad.

> >> The only reason that its not a catastrophy is, that we have a cache tier
> >> in front of it, which loweres the IO needs on the spnning drives.

> >> Unluckily we have also some pools going directly on the spinning drives.

> >> So these pools experience a very bad IO performance.

> >> So we had to disable scrubbing during business houres ( which is not
> >> really a solution ).

> > It is, unfortunately, for many people.
> > As mentioned many times, if your cluster is having issues with deep-scrubs
> > during peak hours, it will also be unhappy if you loose an OSD and
> > backfills happen.
> > If it is unhappy with normal scrubs, you need to upgrade/expand HW
> > immediately.

> >> So any idea why

> >> 1. 4-5 scrubs we can see, while osd_max_scrubs = 1 is set ?
> > See above.

> > With BlueStore in the wings and reduced (negated?) need for deep-scrubs, I
> > doubt this will see much coding effort.

> >> 2. Why the impact on the spinning drives is so hard, while we lowered
> >> the IO priority for it ?

> > That has only a small impact, deep-scrub by its very nature reads all
> > objects and thus kills I/Os by seeks and polluting caches.


> > Christian

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] effectively reducing scrub io impact

2016-10-20 Thread Paweł Sadowski
You can inspect source code or do:

ceph --admin-daemon /var/run/ceph/ceph-osd.OSD_ID.asok config show |
grep scrub # or similar

And then check in source code :)

On 10/20/2016 03:03 PM, Oliver Dzombic wrote:
> Hi Christian,
>
> thank you for your time.
>
> The problem is deep scrub only.
>
> Jewel 10.2.2 is used.
>
> Thank you for your hint with manual deep scrubs on specific OSD's. I
> didnt come up with that idea.
>
> -
>
> Where do you know
>
> osd_scrub_sleep
>
> from ?
>
> I am saw here lately on the mailinglist multiple times many "hidden"
> config options. ( while hidden is everything which is not mentioned in
> the doku @ ceph.com ).
>
> ceph.com does not know about osd_scrub_sleep config option ( except
> mentioned in (past) release notes )
>
> The search engine finds it mainly in github or bugtracker.
>
> Is there any source of a (complete) list of available config options,
> useable by normal admin's ?
>
> Or is it really neccessary to grab through source codes and release
> notes to collect that kind information on your own ?
>

-- 
PS
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel Versions for KVM Hypervisors

2016-10-20 Thread Ilya Dryomov
On Thu, Oct 20, 2016 at 2:45 PM, David Riedl  wrote:
> Hi cephers,
>
> I want to use the newest features of jewel on my cluster. I already updated
> all kernels on the OSD nodes to the following version:
> 4.8.2-1.el7.elrepo.x86_64.
>
> The KVM hypervisors are running the CentOS 7 stock kernel (
> 3.10.0-327.22.2.el7.x86_64 )
>
> If I understand it correctly, libvirt/qemu/librbd don't use the kernel for
> communication with ceph. At least there is no kernel layer shown in this
> diagram: http://docs.ceph.com/docs/master/rbd/libvirt/

Correct.

>
> The ceph packages are already updated on the hypervisors. On top on all this
> there is Openstack, if that's important at all.
>
>
> My question is: Do I need to update the kernels of the hypervisors if I only
> use libvirt as a base?

No - if all you are doing is qemu/librbd, the stock kernel is good
enough.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] qemu-rbd and ceph striping

2016-10-20 Thread Jason Dillaman
On Thu, Oct 20, 2016 at 1:51 AM, Ahmed Mostafa
 wrote:
> different OSDs

PGs -- but more or less correct since the OSDs will process requests
for a particular PG sequentially and not in parallel.

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] effectively reducing scrub io impact

2016-10-20 Thread Oliver Dzombic
Hi Christian,

thank you for your time.

The problem is deep scrub only.

Jewel 10.2.2 is used.

Thank you for your hint with manual deep scrubs on specific OSD's. I
didnt come up with that idea.

-

Where do you know

osd_scrub_sleep

from ?

I am saw here lately on the mailinglist multiple times many "hidden"
config options. ( while hidden is everything which is not mentioned in
the doku @ ceph.com ).

ceph.com does not know about osd_scrub_sleep config option ( except
mentioned in (past) release notes )

The search engine finds it mainly in github or bugtracker.

Is there any source of a (complete) list of available config options,
useable by normal admin's ?

Or is it really neccessary to grab through source codes and release
notes to collect that kind information on your own ?

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 20.10.2016 um 14:39 schrieb Christian Balzer:
> 
> Hello,
> 
> On Thu, 20 Oct 2016 11:23:54 +0200 Oliver Dzombic wrote:
> 
>> Hi,
>>
>> we have here globally:
>>
>> osd_client_op_priority = 63
>> osd_disk_thread_ioprio_class = idle
>> osd_disk_thread_ioprio_priority = 7
>> osd_max_scrubs = 1
>>
> If you google for  osd_max_scrubs you will find plenty of threads, bug
> reports, etc.
> 
> The most significant and benificial impact for client I/O can be achieved
> by telling scrub to release its deadly grip on the OSDs with something like
> osd_scrub_sleep = 0.1
> 
> Also which version, Hammer IIRC?
> Jewel's unified queue should help as well, but no first hand experience
> here.
> 
>> to influence the scrubbing performance and
>>
>> osd_scrub_begin_hour = 1
>> osd_scrub_end_hour = 7
>>
>> to influence the scrubbing time frame
>>
>>
>> Now, as it seems, this time frame is/was not enough, so ceph started
>> scrubbing all the time, i assume because of the age of the objects.
>>
> You may want to line things up, so that OSDs/PGs are evenly spread out.
> For example with 6 OSDs, manually initiate a deep scrub each day (at 01:00
> in your case), so that only a specific subset is doing deep scrub conga. 
> 
> 
>> And it does it with:
>>
>> 4 active+clean+scrubbing+deep
>>
>> ( instead of the configured 1 )
>>
> That's per OSD, not global, see above, google.
> 
>>
>> So now, we experience a situation, where the spinning drives are so
>> busy, that the IO performance got too bad.
>>
>> The only reason that its not a catastrophy is, that we have a cache tier
>> in front of it, which loweres the IO needs on the spnning drives.
>>
>> Unluckily we have also some pools going directly on the spinning drives.
>>
>> So these pools experience a very bad IO performance.
>>
>> So we had to disable scrubbing during business houres ( which is not
>> really a solution ).
>>
> It is, unfortunately, for many people.
> As mentioned many times, if your cluster is having issues with deep-scrubs
> during peak hours, it will also be unhappy if you loose an OSD and
> backfills happen.
> If it is unhappy with normal scrubs, you need to upgrade/expand HW
> immediately.
> 
>> So any idea why
>>
>> 1. 4-5 scrubs we can see, while osd_max_scrubs = 1 is set ?
> See above.
> 
> With BlueStore in the wings and reduced (negated?) need for deep-scrubs, I
> doubt this will see much coding effort.
> 
>> 2. Why the impact on the spinning drives is so hard, while we lowered
>> the IO priority for it ?
>>
> That has only a small impact, deep-scrub by its very nature reads all
> objects and thus kills I/Os by seeks and polluting caches.
> 
> 
> Christian
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Kernel Versions for KVM Hypervisors

2016-10-20 Thread David Riedl

Hi cephers,

I want to use the newest features of jewel on my cluster. I already 
updated all kernels on the OSD nodes to the following version: 
4.8.2-1.el7.elrepo.x86_64.


The KVM hypervisors are running the CentOS 7 stock kernel ( 
3.10.0-327.22.2.el7.x86_64 )


If I understand it correctly, libvirt/qemu/librbd don't use the kernel 
for communication with ceph. At least there is no kernel layer shown in 
this diagram: http://docs.ceph.com/docs/master/rbd/libvirt/


The ceph packages are already updated on the hypervisors. On top on all 
this there is Openstack, if that's important at all.



My question is: Do I need to update the kernels of the hypervisors if I 
only use libvirt as a base?



Regards

David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] effectively reducing scrub io impact

2016-10-20 Thread Christian Balzer

Hello,

On Thu, 20 Oct 2016 11:23:54 +0200 Oliver Dzombic wrote:

> Hi,
> 
> we have here globally:
> 
> osd_client_op_priority = 63
> osd_disk_thread_ioprio_class = idle
> osd_disk_thread_ioprio_priority = 7
> osd_max_scrubs = 1
>
If you google for  osd_max_scrubs you will find plenty of threads, bug
reports, etc.

The most significant and benificial impact for client I/O can be achieved
by telling scrub to release its deadly grip on the OSDs with something like
osd_scrub_sleep = 0.1

Also which version, Hammer IIRC?
Jewel's unified queue should help as well, but no first hand experience
here.

> to influence the scrubbing performance and
> 
> osd_scrub_begin_hour = 1
> osd_scrub_end_hour = 7
> 
> to influence the scrubbing time frame
> 
> 
> Now, as it seems, this time frame is/was not enough, so ceph started
> scrubbing all the time, i assume because of the age of the objects.
> 
You may want to line things up, so that OSDs/PGs are evenly spread out.
For example with 6 OSDs, manually initiate a deep scrub each day (at 01:00
in your case), so that only a specific subset is doing deep scrub conga. 


> And it does it with:
> 
> 4 active+clean+scrubbing+deep
> 
> ( instead of the configured 1 )
> 
That's per OSD, not global, see above, google.

> 
> So now, we experience a situation, where the spinning drives are so
> busy, that the IO performance got too bad.
> 
> The only reason that its not a catastrophy is, that we have a cache tier
> in front of it, which loweres the IO needs on the spnning drives.
> 
> Unluckily we have also some pools going directly on the spinning drives.
> 
> So these pools experience a very bad IO performance.
> 
> So we had to disable scrubbing during business houres ( which is not
> really a solution ).
> 
It is, unfortunately, for many people.
As mentioned many times, if your cluster is having issues with deep-scrubs
during peak hours, it will also be unhappy if you loose an OSD and
backfills happen.
If it is unhappy with normal scrubs, you need to upgrade/expand HW
immediately.

> So any idea why
> 
> 1. 4-5 scrubs we can see, while osd_max_scrubs = 1 is set ?
See above.

With BlueStore in the wings and reduced (negated?) need for deep-scrubs, I
doubt this will see much coding effort.

> 2. Why the impact on the spinning drives is so hard, while we lowered
> the IO priority for it ?
> 
That has only a small impact, deep-scrub by its very nature reads all
objects and thus kills I/Os by seeks and polluting caches.


Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD with SSD journals and SAS OSDs

2016-10-20 Thread Nick Fisk


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> William Josefsson
> Sent: 20 October 2016 10:25
> To: Nick Fisk 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] RBD with SSD journals and SAS OSDs
> 
> On Mon, Oct 17, 2016 at 6:16 PM, Nick Fisk  wrote:
> > Did you also set /check the c-states, this can have a large impact as well?
> 
> Hi Nick. I did try intel_idle.max_cstate=0, and I've got quite a significant 
> improvement as attached below. Thanks for this
advice!
> This is still with DIRECT=1, SYNC=1, BS=4k, RW=WRITE.

Excellent, glad it worked for you. It surprising what the power saving features 
can do to bursty performance.

> 
> I wanted also to ask you about Numa. Some argue it should be disabled for 
> high performance. My hosts are Dual Socket, 2x2630v4
> 2.2Ghz. Do you have any suggestions around whether enable or disable numa and 
> what would be the Impact? Thx will

I don't have much experience around this area other than that I know that numa 
can impact performance, hence the reason all my
recent OSD nodes have been single socket. I took the easy option :-)

There are two things you need to be aware of I think

1. Storage and network controllers could be connected via different sockets 
causing data to be dragged over the interconnect bus.
There isn't much you can do about this, apart from carefully placement of pci-e 
cards, but 1 socket will always suffer.

2. OSD processes flipping between sockets. I think this has been discussed here 
in the past. I believe some gains could be achieved
by pinning the OSD process to certain cores, but I'm afraid your best bet would 
be to search the archives as I can't really offer
much advice.


> 
> 
> 
> simple-write-62: (groupid=14, jobs=62): err= 0: pid=2133: Thu Oct 20
> 12:47:58 2016
>   write: io=1213.8MB, bw=41421KB/s, iops=10355, runt= 30006msec
> clat (msec): min=2, max=81, avg= 5.99, stdev= 2.72
>  lat (msec): min=2, max=81, avg= 5.99, stdev= 2.72
> clat percentiles (usec):
>  |  1.00th=[ 2864],  5.00th=[ 3184], 10.00th=[ 3376], 20.00th=[ 3696],
>  | 30.00th=[ 3984], 40.00th=[ 4576], 50.00th=[ 6048], 60.00th=[ 6688],
>  | 70.00th=[ 7264], 80.00th=[ 7712], 90.00th=[ 8640], 95.00th=[ 9920],
>  | 99.00th=[12480], 99.50th=[13248], 99.90th=[38656], 99.95th=[41728],
>  | 99.99th=[81408]
> bw (KB  /s): min=  343, max= 1051, per=1.62%, avg=669.64, stdev=160.04
> lat (msec) : 4=30.01%, 10=65.31%, 20=4.53%, 50=0.12%, 100=0.02%
>   cpu  : usr=0.04%, sys=0.54%, ctx=636287, majf=0, minf=1905
>   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  issued: total=r=0/w=310721/d=0, short=r=0/w=0/d=0
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Snapshot size and cluster usage

2016-10-20 Thread Stefan Heitmüller
We do have 2 ceph (9.2.1) clusters, where one is sending snaphots of
pools to the other one for backup purposes.

Snapshots are fine, however the ceph pool get's blown up by sizes not
matching the snapshots.

Here's the size of a snapshot and the resulting cluster usage
afterwards. The snapshot is ~2GB, but the cluster itself increases by
~300GB (every night)


# rbd diff --from-snap 20161017-010003 pool2/image@20161018-010005
--format plain | awk '{ SUM += $2 } END { print SUM/1024/1024/1024 " GB" }'
2.29738 GB

--- before snap ---

GLOBAL:
SIZE  AVAIL RAW USED %RAW USED
4947G 2953G1993G 40.29

-- after snap ---

GLOBAL:
SIZE  AVAIL RAW USED %RAW USED
4947G 2627G2319G 46.88

The originating one behaves correctly. (Increases bit more due to other
pools and images)

# rbd diff --from-snap 20161017-010003 pool1/image@20161018-010005
--format plain | awk '{ SUM += $2 } END { print SUM/1024/1024/1024 " GB" }'
2.29738 GB

--- before snap ---

GLOBAL:
SIZE  AVAIL RAW USED %RAW USED
2698G 1292G1405G 52.10

-- after snap ---

GLOBAL:
SIZE  AVAIL RAW USED %RAW USED
2698G 1288G1409G 52.24

Any ideas where to have a look?

regards

Stefan



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-20 Thread mj

Hi,

Interesting reading!

Any chance you could state some of your lessons (if any) you learned..?

I can, for example, imagine your situation would have been much better 
with a replication factor of three instead of two..?


MJ

On 10/20/2016 12:09 AM, Kostis Fardelas wrote:

Hello cephers,
this is the blog post on our Ceph cluster's outage we experienced some
weeks ago and about how we managed to revive the cluster and our
clients's data.

I hope it will prove useful for anyone who will find himself/herself
in a similar position. Thanks for everyone on the ceph-users and
ceph-devel lists who contributed to our inquiries during
troubleshooting.

https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/

Regards,
Kostis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD with SSD journals and SAS OSDs

2016-10-20 Thread William Josefsson
On Mon, Oct 17, 2016 at 6:16 PM, Nick Fisk  wrote:
> Did you also set /check the c-states, this can have a large impact as well?

Hi Nick. I did try intel_idle.max_cstate=0, and I've got quite a
significant improvement as attached below. Thanks for this advice!
This is still with DIRECT=1, SYNC=1, BS=4k, RW=WRITE.

I wanted also to ask you about Numa. Some argue it should be disabled
for high performance. My hosts are Dual Socket, 2x2630v4 2.2Ghz. Do
you have any suggestions around whether enable or disable numa and
what would be the Impact? Thx will



simple-write-62: (groupid=14, jobs=62): err= 0: pid=2133: Thu Oct 20
12:47:58 2016
  write: io=1213.8MB, bw=41421KB/s, iops=10355, runt= 30006msec
clat (msec): min=2, max=81, avg= 5.99, stdev= 2.72
 lat (msec): min=2, max=81, avg= 5.99, stdev= 2.72
clat percentiles (usec):
 |  1.00th=[ 2864],  5.00th=[ 3184], 10.00th=[ 3376], 20.00th=[ 3696],
 | 30.00th=[ 3984], 40.00th=[ 4576], 50.00th=[ 6048], 60.00th=[ 6688],
 | 70.00th=[ 7264], 80.00th=[ 7712], 90.00th=[ 8640], 95.00th=[ 9920],
 | 99.00th=[12480], 99.50th=[13248], 99.90th=[38656], 99.95th=[41728],
 | 99.99th=[81408]
bw (KB  /s): min=  343, max= 1051, per=1.62%, avg=669.64, stdev=160.04
lat (msec) : 4=30.01%, 10=65.31%, 20=4.53%, 50=0.12%, 100=0.02%
  cpu  : usr=0.04%, sys=0.54%, ctx=636287, majf=0, minf=1905
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=0/w=310721/d=0, short=r=0/w=0/d=0
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] effectively reducing scrub io impact

2016-10-20 Thread Oliver Dzombic
Hi,

we have here globally:

osd_client_op_priority = 63
osd_disk_thread_ioprio_class = idle
osd_disk_thread_ioprio_priority = 7
osd_max_scrubs = 1

to influence the scrubbing performance and

osd_scrub_begin_hour = 1
osd_scrub_end_hour = 7

to influence the scrubbing time frame


Now, as it seems, this time frame is/was not enough, so ceph started
scrubbing all the time, i assume because of the age of the objects.

And it does it with:

4 active+clean+scrubbing+deep

( instead of the configured 1 )


So now, we experience a situation, where the spinning drives are so
busy, that the IO performance got too bad.

The only reason that its not a catastrophy is, that we have a cache tier
in front of it, which loweres the IO needs on the spnning drives.

Unluckily we have also some pools going directly on the spinning drives.

So these pools experience a very bad IO performance.

So we had to disable scrubbing during business houres ( which is not
really a solution ).

So any idea why

1. 4-5 scrubs we can see, while osd_max_scrubs = 1 is set ?
2. Why the impact on the spinning drives is so hard, while we lowered
the IO priority for it ?


Thank you !



-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Yet another hardware planning question ...

2016-10-20 Thread Christian Balzer

Hello,

On Thu, 20 Oct 2016 07:56:55 + Patrik Martinsson wrote:

> Hi Christian, 
> 
> Thanks for your very detailed and thorough explanation, very much
> appreciated. 
> 
You're welcome.

> We have definitely thought of a design where we have dedicated nvme-
> pools for 'high-performance' as you say. 
>
Having the whole cluster blazingly fast is of course a nice goal, but how
to achieve this without breaking the bank is always the more tricky part.

In my (very specific) use case I was able to go from a totally overloaded
cluster to a very bored one by just adding a small cache-tier, but that
works so well because the clients here are well known and under our
control, they're all the same and are happy if they can scribble away tiny
amounts of data within 5ms, a perfect fit for this.
 
> At the same time I *thought* that having the journal offloaded to
> another device *always* was the best solution 
>  - if you use mainly spinners, have the journals on ssd's
>  - if you mainly use ssd's, have journals on nvme's 
>
Quite so. If you have a large/unlimited budget, go for it.
 
> But that's not always the case I guess, and thanks for pointing that
> out. 
>
Again, matching things up in terms of speed (network vs journal vs OSD),
endurance and size is both involved and gets costly quickly.

Christian

> Best regards, 
> Patrik Martinsson 
> Sweden
> 
> 
> On fre, 2016-10-14 at 09:59 +0900, Christian Balzer wrote:
> > Hello,
> > 
> > On Thu, 13 Oct 2016 15:46:03 + Patrik Martinsson wrote:
> > 
> > > 
> > > On tor, 2016-10-13 at 10:29 -0500, Brady Deetz wrote:
> > > > 
> > > > 6 SSD per nvme journal might leave your journal in contention.
> > > > Canyou
> > > > provide the specific models you will be using?
> > > 
> > > Well, according to Dell, the card is called "Dell 1.6TB, NVMe,
> > > Mixed
> > > Use Express Flash, PM1725", but the specs for the card is listed
> > > here h
> > > ttp://i.dell.com/sites/doccontent/shared-content/data-
> > > sheets/en/Documents/Dell-PowerEdge-Express-Flash-NVMe-Mixed-Use-
> > > PCIe-
> > > SSD.pdf
> > > 
> > That's a re-branded (not much, same model number) Samsung.
> > Both that link and the equivalent Samsung link are not what I would
> > consider professional, with their "up to" speeds.
> > Because that usually is a fact of design and flash modules used,
> > typically
> > resulting in smaller drives being slower (less parallelism).
> > 
> > Extrapolating from the 3.2 TB model we can assume that these can not
> > write
> > more than 2MB/s.
> > 
> > If your 40Gb/s network is single ported or active/standby (you didn't
> > mention), then this is fine, as 2 of these journals NVMes would be a
> > perfect match.
> > If it's dual-ported with MC-LAG, then you're wasting half of the
> > potential
> > bandwidth. 
> > 
> > Also these NVMes have a nice, feel good 5 DWPD, for future
> > reference. 
> > 
> > > 
> > > Forgive me for my poor English here, but when you say "leave your
> > > journal in contention", what exactly do you mean by that ?
> > > 
> > He means that the combined bandwidth of your SSDs will be larger than
> > those of your journal NVMe's, limiting the top bandwidth your nodes
> > can
> > write at to those of the journals.
> > 
> > In your case we're missing any pertinent details about the SSDs as
> > well.
> > 
> > An educated guess (size, 12Gbs link, Samsung) makes them these:
> > http://www.samsung.com/semiconductor/products/flash-storage/enterpris
> > e-ssd/MZILS1T9HCHP?ia=832
> > http://www.samsung.com/semiconductor/global/file/media/PM853T.pdf
> > 
> > So 750MB/s sequential writes, 3 of these can already handle more than
> > your
> > NVMe.
> > 
> > However the 1 DWPD (the PDF is more detailed and gives us a scary 0.3
> > DWPD
> > for small I/Os) of these SSDs would definitely stop me from
> > considering
> > them.
> > Unless you can quantify your write volume with certainty and it's
> > below
> > the level these SSDs can support, go for something safer, at least 3
> > DWPD.
> > 
> > Quick estimate:
> > 24 SSDs (replication of 3) * 1.92TB * 0.3 (worst case) = 13.8TB/day 
> > That's ignoring further overhead and write amplification by the FS
> > (journals) and Ceph itself.
> > So if your cluster sees less than 10TB writes/day, you may at least
> > assume
> > it won't kill those SSDs within months.
> > 
> > Your journal NVMes are incidentally a decent match endurance wise at
> > a
> > (much more predictable) 16TB/day.
> > 
> > 
> > The above is of course all about bandwidth (sequential writes), which
> > are
> > important in certain use cases and during backfill/recovery actions.
> > 
> > Since your use case suggest more of a DB, smallish data transactions
> > scenario, that "waste" of bandwidth may be totally acceptable.
> > All my clusters certainly favor lower latency over higher bandwidth
> > when
> > having to choose between either. 
> > 
> > It comes back to use case and write volume, those journal NVMes will
> > help
> > with keeping latency low (for 

Re: [ceph-users] Yet another hardware planning question ...

2016-10-20 Thread Patrik Martinsson
Hi Christian, 

Thanks for your very detailed and thorough explanation, very much
appreciated. 

We have definitely thought of a design where we have dedicated nvme-
pools for 'high-performance' as you say. 

At the same time I *thought* that having the journal offloaded to
another device *always* was the best solution 
 - if you use mainly spinners, have the journals on ssd's
 - if you mainly use ssd's, have journals on nvme's 

But that's not always the case I guess, and thanks for pointing that
out. 

Best regards, 
Patrik Martinsson 
Sweden


On fre, 2016-10-14 at 09:59 +0900, Christian Balzer wrote:
> Hello,
> 
> On Thu, 13 Oct 2016 15:46:03 + Patrik Martinsson wrote:
> 
> > 
> > On tor, 2016-10-13 at 10:29 -0500, Brady Deetz wrote:
> > > 
> > > 6 SSD per nvme journal might leave your journal in contention.
> > > Canyou
> > > provide the specific models you will be using?
> > 
> > Well, according to Dell, the card is called "Dell 1.6TB, NVMe,
> > Mixed
> > Use Express Flash, PM1725", but the specs for the card is listed
> > here h
> > ttp://i.dell.com/sites/doccontent/shared-content/data-
> > sheets/en/Documents/Dell-PowerEdge-Express-Flash-NVMe-Mixed-Use-
> > PCIe-
> > SSD.pdf
> > 
> That's a re-branded (not much, same model number) Samsung.
> Both that link and the equivalent Samsung link are not what I would
> consider professional, with their "up to" speeds.
> Because that usually is a fact of design and flash modules used,
> typically
> resulting in smaller drives being slower (less parallelism).
> 
> Extrapolating from the 3.2 TB model we can assume that these can not
> write
> more than 2MB/s.
> 
> If your 40Gb/s network is single ported or active/standby (you didn't
> mention), then this is fine, as 2 of these journals NVMes would be a
> perfect match.
> If it's dual-ported with MC-LAG, then you're wasting half of the
> potential
> bandwidth. 
> 
> Also these NVMes have a nice, feel good 5 DWPD, for future
> reference. 
> 
> > 
> > Forgive me for my poor English here, but when you say "leave your
> > journal in contention", what exactly do you mean by that ?
> > 
> He means that the combined bandwidth of your SSDs will be larger than
> those of your journal NVMe's, limiting the top bandwidth your nodes
> can
> write at to those of the journals.
> 
> In your case we're missing any pertinent details about the SSDs as
> well.
> 
> An educated guess (size, 12Gbs link, Samsung) makes them these:
> http://www.samsung.com/semiconductor/products/flash-storage/enterpris
> e-ssd/MZILS1T9HCHP?ia=832
> http://www.samsung.com/semiconductor/global/file/media/PM853T.pdf
> 
> So 750MB/s sequential writes, 3 of these can already handle more than
> your
> NVMe.
> 
> However the 1 DWPD (the PDF is more detailed and gives us a scary 0.3
> DWPD
> for small I/Os) of these SSDs would definitely stop me from
> considering
> them.
> Unless you can quantify your write volume with certainty and it's
> below
> the level these SSDs can support, go for something safer, at least 3
> DWPD.
> 
> Quick estimate:
> 24 SSDs (replication of 3) * 1.92TB * 0.3 (worst case) = 13.8TB/day 
> That's ignoring further overhead and write amplification by the FS
> (journals) and Ceph itself.
> So if your cluster sees less than 10TB writes/day, you may at least
> assume
> it won't kill those SSDs within months.
> 
> Your journal NVMes are incidentally a decent match endurance wise at
> a
> (much more predictable) 16TB/day.
> 
> 
> The above is of course all about bandwidth (sequential writes), which
> are
> important in certain use cases and during backfill/recovery actions.
> 
> Since your use case suggest more of a DB, smallish data transactions
> scenario, that "waste" of bandwidth may be totally acceptable.
> All my clusters certainly favor lower latency over higher bandwidth
> when
> having to choose between either. 
> 
> It comes back to use case and write volume, those journal NVMes will
> help
> with keeping latency low (for your DBs) so if that is paramount, go
> with
> that.
> 
> They do feel a bit wasted (1.6TB, of which you'll use 1-200MB at
> most),
> though.
> Consider alternative designs where you have special pools for high
> performance needs on NVMes and use 3+DWPD SSDs (journals inline) for
> the
> rest.
> 
> Also I'd use the E5-2697A v4 CPU instead with SSDs (faster baseline
> and
> Turbo).
> 
> Christian
> 
> > 
> > Best regards, 
> > Patrik Martinsson
> > Sweden
> > 
> > 
> > > 
> > > On Oct 13, 2016 10:23 AM, "Patrik Martinsson"
> > >  wrote:
> > > > 
> > > > Hello everyone, 
> > > > 
> > > > We are in the process of buying hardware for our first ceph-
> > > > cluster. We
> > > > will start with some testing and do some performance
> > > > measurements
> > > > to
> > > > see that we are on the right track, and once we are satisfied
> > > > with
> > > > our
> > > > setup we'll continue to grow in it as time comes along.
> > > > 
> > > > Now, I'm just seeking some thoughts on our 

Re: [ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-20 Thread Kris Gillespie
Kostis,

Excellent article mate. This is the kind of war story that can really help 
people out. Learning through (others) adversity.

Kris


> On 20 Oct 2016, at 00:09, Kostis Fardelas  wrote:
> 
> Hello cephers,
> this is the blog post on our Ceph cluster's outage we experienced some
> weeks ago and about how we managed to revive the cluster and our
> clients's data.
> 
> I hope it will prove useful for anyone who will find himself/herself
> in a similar position. Thanks for everyone on the ceph-users and
> ceph-devel lists who contributed to our inquiries during
> troubleshooting.
> 
> https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
> 
> Regards,
> Kostis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-20 Thread Kostis Fardelas
We pulled leveldb from upstream and fired leveldb.RepairDB against the
OSD omap directory using a simple python script. Ultimately, that
didn't make things forward. We resorted to check every object's
timestamp/md5sum/attributes on the crashed OSD against the replicas in
the cluster and at last took the way of discarding the journal, when
we concluded with as much confidence as possible that we would not
lose data.

It would be really useful at that moment if we had a tool to inspect
the journal's contents of the crashed OSD and limit the scope of the
verification process.

On 20 October 2016 at 08:15, Goncalo Borges
 wrote:
> Hi Kostis...
> That is a tale from the dark side. Glad you recover it and that you were 
> willing to doc it all up, and share it. Thank you for that,
> Can I also ask which tool did you use to recover the leveldb?
> Cheers
> Goncalo
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Kostis 
> Fardelas [dante1...@gmail.com]
> Sent: 20 October 2016 09:09
> To: ceph-users
> Subject: [ceph-users] Surviving a ceph cluster outage: the hard way
>
> Hello cephers,
> this is the blog post on our Ceph cluster's outage we experienced some
> weeks ago and about how we managed to revive the cluster and our
> clients's data.
>
> I hope it will prove useful for anyone who will find himself/herself
> in a similar position. Thanks for everyone on the ceph-users and
> ceph-devel lists who contributed to our inquiries during
> troubleshooting.
>
> https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
>
> Regards,
> Kostis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com