Re: [ceph-users] use ZFS for OSDs
Hi Michal, Really nice work on the ZFS testing. I've been thinking about this myself from time to time, However I wasn't sure if ZoL was ready to use in production with Ceph. I would like to see instead of using multiple osd's in zfs/ceph but running say a z+2 for say 8-12 3-4TB spinners and leverage some nice SSD's maybe a P3700 400GB for the zil/l2arc with compression and going back to 2x replicas which then this could give us some pretty fast/safe/efficient storage. Now to find that money tree. Regards, Quenten Grasso -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Michal Kozanecki Sent: Friday, 10 April 2015 5:15 AM To: Christian Balzer; ceph-users Subject: Re: [ceph-users] use ZFS for OSDs I had surgery and have been off for a while. Had to rebuild test ceph+openstack cluster with whatever spare parts I had. I apologize for the delay for anyone who's been interested. Here are the results; == Hardware/Software 3 node CEPH cluster, 3 OSDs (one OSD per node) -- CPU = 1x E5-2670 v1 RAM = 8GB OS Disk = 500GB SATA OSD = 900GB 10k SAS (sdc - whole device) Journal = Shared Intel SSD DC3500 80GB (sdb1 - 10GB partition) ZFS log = Shared Intel SSD DC3500 80GB (sdb2 - 4GB partition) ZFS L2ARC = Intel SSD 320 40GB (sdd - whole device) - ceph 0.87 ZoL 0.63 CentOS 7.0 2 node KVM/Openstack cluster CPU = 2x Xeon X5650 RAM = 24 GB OS Disk = 500GB SATA - Ubuntu 14.04 OpenStack Juno the rough performance of this oddball sized test ceph cluster is 8k 1000-1500 IOPS == Compression; (cut out unneeded details) Various Debian and CentOS images, with lots of test SVN and GIT data KVM/OpenStack [root@ceph03 ~]# zfs get all SAS1 NAME PROPERTY VALUE SOURCE SAS1 used 586G - SAS1 compressratio 1.50x - SAS1 recordsize32Klocal SAS1 checksum on default SAS1 compression lz4local SAS1 refcompressratio 1.50x - SAS1 written 586G - SAS1 logicalused 877G - == Dedupe; (dedupe is enabled on a dataset level but can dedupe space savings only be viewed at a pool level - bit odd I know) Various Debian and CentOS images, with lots of test SVN and GIT data KVM/OpenStack [root@ceph01 ~]# zpool get all SAS1 NAME PROPERTY VALUE SOURCE SAS1 size 836G - SAS1 capacity 70%- SAS1 dedupratio 1.02x - SAS1 free 250G - SAS1 allocated 586G - == Bitrot/Corruption; Injected random data to random locations (changed seek to random value) of sdc with; dd if=/dev/urandom of=/dev/sdc seek=54356 bs=4k count=1 Results; 1. ZFS detects error on disk affecting PG files, being as this is a single vdev (no zraid or mirror) it cannot automatically fix. It blocks all(but delete) access to the entire files(inaccessible). *note: I ran this after status after already repairing 2 PGs (5.15 and 5.25), ZFS status will no longer list filename after it has been repaired/deleted/cleared* [root@ceph01 ~]# zpool status -v pool: SAS1 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub in progress since Thu Apr 9 13:04:54 2015 153G scanned out of 586G at 40.3M/s, 3h3m to go 0 repaired, 26.05% done config: NAME STATE READ WRITE CKSUM SAS1 ONLINE 0 035 sdc ONLINE 0 070 logs sdb2ONLINE 0 0 0 cache sdd ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: /SAS1/current/5.e_head/DIR_E/DIR_0/DIR_6/rbd\udata.2ba762ae8944a.24cc__head_6153260E__5 2. CEPH-OSD cannot read PG file. Kicks off scrub/deep-scrub /var/log/ceph/ceph-osd.2.log 2015-04-09 13:10:18.319312 7fcbb163a700 -1 log_channel(default) log [ERR] : 5.18 shard 1: soid cd635018/rbd_data.93d1f74b0dc51.18ee/head//5 candidate had a read error, digest 1835988768 != known digest 473354757 2015-04-09 13:11:38.587014 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 5.18 deep-scrub 0 missing, 1
Re: [ceph-users] use ZFS for OSDs
pgs inconsistent; 2 scrub errors; noout flag(s) set monmap e2: 3 mons at {ceph01=10.10.10.101:6789/0,ceph02=10.10.10.102:6789/0,ceph03=10.10.10.103:6789/0}, election epoch 146, quorum 0,1,2 ceph01,ceph02,ceph03 osdmap e3178: 3 osds: 3 up, 3 in flags noout pgmap v890949: 392 pgs, 6 pools, 931 GB data, 249 kobjects 1756 GB used, 704 GB / 2460 GB avail 2 active+clean+inconsistent 391 active+clean client io 0 B/s rd, 7920 B/s wr, 3 op/s 3. Repair must be manually kicked off [root@client01 ~]# ceph pg repair 5.18 instructing pg 5.18 on osd.0 to repair [root@client01 ~]# ceph health detail HEALTH_WARN 1 pgs repair; noout flag(s) set pg 5.25 is active+clean+inconsistent, acting [1,0,2] pg 5.18 is active+clean+scrubbing+deep+repair, acting [2,0,1] /var/log/ceph/ceph-osd.2.log 2015-04-09 13:30:01.609756 7fcbb163a700 -1 log_channel(default) log [ERR] : 5.18 shard 1: soid cd635018/rbd_data.93d1f74b0dc51.18ee/head//5 candidate had a read error, digest 1835988768 != known digest 473354757 2015-04-09 13:30:41.834465 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 5.18 repair 0 missing, 1 inconsistent objects 2015-04-09 13:30:41.834479 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 5.18 repair 1 errors, 1 fixed /var/log/ceph/ceph-osd.1.log 2015-04-09 13:30:47.952742 7fe10b3c5700 -1 log_channel(default) log [ERR] : 5.25 shard 1: soid 73eb0125/rbd_data.5315b2ae8944a.5348/head//5 candidate had a read error, digest 1522345897 != known digest 1180025616 2015-04-09 13:31:23.389095 7fe10b3c5700 -1 log_channel(default) log [ERR] : 5.25 repair 0 missing, 1 inconsistent objects 2015-04-09 13:31:23.389112 7fe10b3c5700 -1 log_channel(default) log [ERR] : 5.25 repair 1 errors, 1 fixed Conclusion; ZFS compression works GREAT, between 30-50% compression depending on data (I was getting around 30-35% with only OS images, once I loaded on real test data (SVN/GIT/etc) this increased to 50%). ZFS dedupe doesn't seem to get you much at least with how CEPH works. Maybe due to my recordsize (32K)? ZFS/CEPH bitrot/corruption protection isn't fully automated but still pretty damn good in my opinion, an improvement over silent bitrot of coin tossing of other filesystems if CEPH somehow detects an error. CEPH attempts accessing the file, ZFS detects error and basically kills access to the file. CEPH detects this as a read error and kicks off a scrub on the PG. PG repair does not seem to happen automatically, however when manually kicked off it succeeds. Let me know if there's anything else or any questions people have while I have this test cluster running. Cheers, Michal Kozanecki | Linux Administrator | mkozane...@evertz.com -Original Message- From: Christian Balzer [mailto:ch...@gol.com] Sent: November-01-14 4:43 AM To: ceph-users Cc: Michal Kozanecki Subject: Re: [ceph-users] use ZFS for OSDs On Fri, 31 Oct 2014 16:32:49 + Michal Kozanecki wrote: I'll test this by manually inducing corrupted data to the ZFS filesystem and report back how ZFS+ceph interact during a detected file failure/corruption, how it recovers and any manual steps required, and report back with the results. Looking forward to that. As for compression, using lz4 the CPU impact is around 5-20% depending on load, type of I/O and I/O size, with little-to-no I/O performance impact, and in fact in some cases the I/O performance actually increases. I'm currently looking at a compression ratio on the ZFS datasets of around 30-35% for a data consisting of rbd backed OpenStack KVM VMs. I'm looking at a similar deployment (VM images) and over 30% compression will at least negate the need of ZFS to have at least 20% free space or suffer massive degradation otherwise. CPU usage looks acceptable, however in combination with SSD backed OSDs that's another thing to consider. As in, is it worth to spend X amount of money for faster CPUs and 10-20% space savings or will be another SSD be cheaper? I'm trying to position Ceph against SolidFire, who are claiming 4-10 times data reduction by a combination of compression, deduping and thin provisioning. Without of course quantifying things, like what step gives which reduction based on what sample data. I have not tried any sort of dedupe as it is memory intensive and I only had 24GB of ram on each node. I'll grab some FIO benchmarks and report back. I foresee a massive failure here, despite a huge potential given one use case here where all VMs are basically identical (KSM is very effective with those, too). Why the predicted failure? Several reasons: 1. Deduping is only local, per OSD. That will make a big dent, but with many nearly identical VM images we should still have a quite a bit of identical data per OSD. However... 2. Data alignment. The default RADOS objects making up images are 4MB. Which, given my limited
Re: [ceph-users] use ZFS for OSDs
On Fri, 31 Oct 2014 16:32:49 + Michal Kozanecki wrote: I'll test this by manually inducing corrupted data to the ZFS filesystem and report back how ZFS+ceph interact during a detected file failure/corruption, how it recovers and any manual steps required, and report back with the results. Looking forward to that. As for compression, using lz4 the CPU impact is around 5-20% depending on load, type of I/O and I/O size, with little-to-no I/O performance impact, and in fact in some cases the I/O performance actually increases. I'm currently looking at a compression ratio on the ZFS datasets of around 30-35% for a data consisting of rbd backed OpenStack KVM VMs. I'm looking at a similar deployment (VM images) and over 30% compression will at least negate the need of ZFS to have at least 20% free space or suffer massive degradation otherwise. CPU usage looks acceptable, however in combination with SSD backed OSDs that's another thing to consider. As in, is it worth to spend X amount of money for faster CPUs and 10-20% space savings or will be another SSD be cheaper? I'm trying to position Ceph against SolidFire, who are claiming 4-10 times data reduction by a combination of compression, deduping and thin provisioning. Without of course quantifying things, like what step gives which reduction based on what sample data. I have not tried any sort of dedupe as it is memory intensive and I only had 24GB of ram on each node. I'll grab some FIO benchmarks and report back. I foresee a massive failure here, despite a huge potential given one use case here where all VMs are basically identical (KSM is very effective with those, too). Why the predicted failure? Several reasons: 1. Deduping is only local, per OSD. That will make a big dent, but with many nearly identical VM images we should still have a quite a bit of identical data per OSD. However... 2. Data alignment. The default RADOS objects making up images are 4MB. Which, given my limited knowledge of ZFS, I presume will be mapped to 128KB ZFS blocks which are then subject to the deduping process. However even if one were to install the same OS on the same sized RBD images I predict subtle differences in alignment within those objects and thus ZFS blocks. That becomes a near certainty when those images (OS installs) are customized, files being added or deleted, etc. 3. ZFS block size and VM FS metadata. Even if all the data would be perfectly, identically aligned in the 4MB RADOS objects the resulting 128KB ZFS blocks are likely to contain metadata like inodes (creation time) in them, thus making them subtly different and not eligible for deduping. OTOH SolidFire claims to be doing global deduplication, how they do that efficiently is a bit beyond, especially given the memory sizes of their appliances. My guess is they keep a map on disk (all SSDs) on each node instead of keeping it in RAM. I suppose the updates (writes to the SSDs) of this map are still substantially less than the data otherwise written w/o deduping. Thusly I think Ceph will need a similar approach for any deduping to work, in combination with a much finer grained block size. The later I believe is already being discussed in the context of cache tier pools, having to promote/demote 4MB blobs for a single hot 4KB of data is hardly efficient. Regards, Christian Cheers, -Original Message- From: Christian Balzer [mailto:ch...@gol.com] Sent: October-30-14 4:12 AM To: ceph-users Cc: Michal Kozanecki Subject: Re: [ceph-users] use ZFS for OSDs On Wed, 29 Oct 2014 15:32:57 + Michal Kozanecki wrote: [snip] With Ceph handling the redundancy at the OSD level I saw no need for using ZFS mirroring or zraid, instead if ZFS detects corruption instead of self-healing it sends a read failure of the pg file to ceph, and then ceph's scrub mechanisms should then repair/replace the pg file using a good replica elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting match! Could you elaborate on that? AFAIK Ceph currently has no way to determine which of the replicas is good, one such failed PG object will require you to do a manual repair after the scrub and hope that two surviving replicas (assuming a size of 3) are identical. If not, start tossing a coin. Ideally Ceph would have a way to know what happened (as in, it's a checksum and not a real I/O error) and do a rebuild of that object itself. On an other note, have you done any tests using the ZFS compression? I'm wondering what the performance impact and efficiency are. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] use ZFS for OSDs
I'll test this by manually inducing corrupted data to the ZFS filesystem and report back how ZFS+ceph interact during a detected file failure/corruption, how it recovers and any manual steps required, and report back with the results. As for compression, using lz4 the CPU impact is around 5-20% depending on load, type of I/O and I/O size, with little-to-no I/O performance impact, and in fact in some cases the I/O performance actually increases. I'm currently looking at a compression ratio on the ZFS datasets of around 30-35% for a data consisting of rbd backed OpenStack KVM VMs. I have not tried any sort of dedupe as it is memory intensive and I only had 24GB of ram on each node. I'll grab some FIO benchmarks and report back. Cheers, -Original Message- From: Christian Balzer [mailto:ch...@gol.com] Sent: October-30-14 4:12 AM To: ceph-users Cc: Michal Kozanecki Subject: Re: [ceph-users] use ZFS for OSDs On Wed, 29 Oct 2014 15:32:57 + Michal Kozanecki wrote: [snip] With Ceph handling the redundancy at the OSD level I saw no need for using ZFS mirroring or zraid, instead if ZFS detects corruption instead of self-healing it sends a read failure of the pg file to ceph, and then ceph's scrub mechanisms should then repair/replace the pg file using a good replica elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting match! Could you elaborate on that? AFAIK Ceph currently has no way to determine which of the replicas is good, one such failed PG object will require you to do a manual repair after the scrub and hope that two surviving replicas (assuming a size of 3) are identical. If not, start tossing a coin. Ideally Ceph would have a way to know what happened (as in, it's a checksum and not a real I/O error) and do a rebuild of that object itself. On an other note, have you done any tests using the ZFS compression? I'm wondering what the performance impact and efficiency are. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] use ZFS for OSDs
On Wed, 29 Oct 2014 15:32:57 + Michal Kozanecki wrote: [snip] With Ceph handling the redundancy at the OSD level I saw no need for using ZFS mirroring or zraid, instead if ZFS detects corruption instead of self-healing it sends a read failure of the pg file to ceph, and then ceph's scrub mechanisms should then repair/replace the pg file using a good replica elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting match! Could you elaborate on that? AFAIK Ceph currently has no way to determine which of the replicas is good, one such failed PG object will require you to do a manual repair after the scrub and hope that two surviving replicas (assuming a size of 3) are identical. If not, start tossing a coin. Ideally Ceph would have a way to know what happened (as in, it's a checksum and not a real I/O error) and do a rebuild of that object itself. On an other note, have you done any tests using the ZFS compression? I'm wondering what the performance impact and efficiency are. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] use ZFS for OSDs
On Wed, 29 Oct 2014, Kenneth Waegeman wrote: Hi, We are looking to use ZFS for our OSD backend, but I have some questions. My main question is: Does Ceph already supports the writeparallel mode for ZFS ? (as described here: http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesting-things-going-on/) I've found this, but I suppose it is outdated: https://wiki.ceph.com/Planning/Blueprints/Emperor/osd%3A_ceph_on_zfs All of the code is there, but it is almost completely untested. Should Ceph be build with ZFS support? I found a --with-zfslib option somewhere, but can someone verify this, or better has instructions for it?:-) What parameters should be tuned to use this? I found these : filestore zfs_snap = 1 Yes journal_aio = 0 journal_dio = 0 Maybe, if ZFS doesn't support directio or aio. Curious to hear how it goes! Wouldn't recommend this for production though without significant testing. sage Are there other things we need for it? Many thanks!! Kenneth ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] use ZFS for OSDs
Forgot to mention, when you create the ZFS/ZPOOL datasets, make sure to set the xattar setting to sa e.g. zpool create osd01 -O xattr=sa -O compression=lz4 sdb OR if zpool/zfs dataset already created zfs set xattr=sa osd01 Cheers -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Michal Kozanecki Sent: October-29-14 11:33 AM To: Kenneth Waegeman; ceph-users Subject: Re: [ceph-users] use ZFS for OSDs Hi Kenneth, I run a small ceph test cluster using ZoL (ZFS on Linux) ontop of CentOS 7, so I'll try and answer any questions. :) Yes, ZFS writeparallel support is there, but NOT compiled in by default. You'll need to compile it with --with-zlib, but that by itself will fail to compile the ZFS support as I found out. You need to ensure you have ZoL installed and working, and then pass the location of libzfs to ceph at compile time. Personally I just set my environment variables before compiling like so; ldconfig export LIBZFS_LIBS=/usr/include/libzfs/ export LIBZFS_CFLAGS=-I/usr/include/libzfs -I/usr/include/libspl However, the writeparallel performance isn't all that great. The writeparallel mode makes heavy use of ZFS's (and BtrFS's for that matter) snapshotting capability, and the snap performance on ZoL, at least when I last tested it, is pretty terrible. You lose any performance benefits you gain with writeparallel to the poor snap performance. If you decide that you don't need writeparallel mode you, can use the prebuilt packages (or compile with default options) without issue. Ceph (without zlib support compiled in) will detect ZFS as a generic/ext4 file system and work accordingly. As far as performance tweaking, ZIL, write journals and etc, I found that the performance difference between using a ZIL vs ceph write journal is about the same. I also found that doing both (ZIL AND writejournal) didn't give me much of a performance benefit. In my small test cluster I decided after testing to forego the ZIL and only use a SSD backed ceph write journal on each OSD, with each OSD being a single ZFS dataset/vdev(no zraid or mirroring). With Ceph handling the redundancy at the OSD level I saw no need for using ZFS mirroring or zraid, instead if ZFS detects corruption instead of self-healing it sends a read failure of the pg file to ceph, and then ceph's scrub mechanisms should then repair/replace the pg file using a good replica elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting match! Let me know if there's anything else I can answer. Cheers -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kenneth Waegeman Sent: October-29-14 6:09 AM To: ceph-users Subject: [ceph-users] use ZFS for OSDs Hi, We are looking to use ZFS for our OSD backend, but I have some questions. My main question is: Does Ceph already supports the writeparallel mode for ZFS ? (as described here: http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesting-things-going-on/) I've found this, but I suppose it is outdated: https://wiki.ceph.com/Planning/Blueprints/Emperor/osd%3A_ceph_on_zfs Should Ceph be build with ZFS support? I found a --with-zfslib option somewhere, but can someone verify this, or better has instructions for it?:-) What parameters should be tuned to use this? I found these : filestore zfs_snap = 1 journal_aio = 0 journal_dio = 0 Are there other things we need for it? Many thanks!! Kenneth ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] use ZFS for OSDs
hi michal, thanks for the info. we will certainly try it and see if we come to the same conclusions ;) one small detail: since you were using centos7, i'm assuming you were using ZoL 0.6.3? stijn On 10/29/2014 08:03 PM, Michal Kozanecki wrote: Forgot to mention, when you create the ZFS/ZPOOL datasets, make sure to set the xattar setting to sa e.g. zpool create osd01 -O xattr=sa -O compression=lz4 sdb OR if zpool/zfs dataset already created zfs set xattr=sa osd01 Cheers -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Michal Kozanecki Sent: October-29-14 11:33 AM To: Kenneth Waegeman; ceph-users Subject: Re: [ceph-users] use ZFS for OSDs Hi Kenneth, I run a small ceph test cluster using ZoL (ZFS on Linux) ontop of CentOS 7, so I'll try and answer any questions. :) Yes, ZFS writeparallel support is there, but NOT compiled in by default. You'll need to compile it with --with-zlib, but that by itself will fail to compile the ZFS support as I found out. You need to ensure you have ZoL installed and working, and then pass the location of libzfs to ceph at compile time. Personally I just set my environment variables before compiling like so; ldconfig export LIBZFS_LIBS=/usr/include/libzfs/ export LIBZFS_CFLAGS=-I/usr/include/libzfs -I/usr/include/libspl However, the writeparallel performance isn't all that great. The writeparallel mode makes heavy use of ZFS's (and BtrFS's for that matter) snapshotting capability, and the snap performance on ZoL, at least when I last tested it, is pretty terrible. You lose any performance benefits you gain with writeparallel to the poor snap performance. If you decide that you don't need writeparallel mode you, can use the prebuilt packages (or compile with default options) without issue. Ceph (without zlib support compiled in) will detect ZFS as a generic/ext4 file system and work accordingly. As far as performance tweaking, ZIL, write journals and etc, I found that the performance difference between using a ZIL vs ceph write journal is about the same. I also found that doing both (ZIL AND writejournal) didn't give me much of a performance benefit. In my small test cluster I decided after testing to forego the ZIL and only use a SSD backed ceph write journal on each OSD, with each OSD being a single ZFS dataset/vdev(no zraid or mirroring). With Ceph handling the redundancy at the OSD level I saw no need for using ZFS mirroring or zraid, instead if ZFS detects corruption instead of self-healing it sends a read failure of the pg file to ceph, and then ceph's scrub mechanisms should then repair/replace the pg file using a good replica elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting match! Let me know if there's anything else I can answer. Cheers -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kenneth Waegeman Sent: October-29-14 6:09 AM To: ceph-users Subject: [ceph-users] use ZFS for OSDs Hi, We are looking to use ZFS for our OSD backend, but I have some questions. My main question is: Does Ceph already supports the writeparallel mode for ZFS ? (as described here: http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesting-things-going-on/) I've found this, but I suppose it is outdated: https://wiki.ceph.com/Planning/Blueprints/Emperor/osd%3A_ceph_on_zfs Should Ceph be build with ZFS support? I found a --with-zfslib option somewhere, but can someone verify this, or better has instructions for it?:-) What parameters should be tuned to use this? I found these : filestore zfs_snap = 1 journal_aio = 0 journal_dio = 0 Are there other things we need for it? Many thanks!! Kenneth ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] use ZFS for OSDs
Hi Stijn, Yes, on my cluster I am running; CentOS 7, ZoL 0.6.3, Ceph 80.5. Cheers -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stijn De Weirdt Sent: October-29-14 3:49 PM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] use ZFS for OSDs hi michal, thanks for the info. we will certainly try it and see if we come to the same conclusions ;) one small detail: since you were using centos7, i'm assuming you were using ZoL 0.6.3? stijn On 10/29/2014 08:03 PM, Michal Kozanecki wrote: Forgot to mention, when you create the ZFS/ZPOOL datasets, make sure to set the xattar setting to sa e.g. zpool create osd01 -O xattr=sa -O compression=lz4 sdb OR if zpool/zfs dataset already created zfs set xattr=sa osd01 Cheers -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Michal Kozanecki Sent: October-29-14 11:33 AM To: Kenneth Waegeman; ceph-users Subject: Re: [ceph-users] use ZFS for OSDs Hi Kenneth, I run a small ceph test cluster using ZoL (ZFS on Linux) ontop of CentOS 7, so I'll try and answer any questions. :) Yes, ZFS writeparallel support is there, but NOT compiled in by default. You'll need to compile it with --with-zlib, but that by itself will fail to compile the ZFS support as I found out. You need to ensure you have ZoL installed and working, and then pass the location of libzfs to ceph at compile time. Personally I just set my environment variables before compiling like so; ldconfig export LIBZFS_LIBS=/usr/include/libzfs/ export LIBZFS_CFLAGS=-I/usr/include/libzfs -I/usr/include/libspl However, the writeparallel performance isn't all that great. The writeparallel mode makes heavy use of ZFS's (and BtrFS's for that matter) snapshotting capability, and the snap performance on ZoL, at least when I last tested it, is pretty terrible. You lose any performance benefits you gain with writeparallel to the poor snap performance. If you decide that you don't need writeparallel mode you, can use the prebuilt packages (or compile with default options) without issue. Ceph (without zlib support compiled in) will detect ZFS as a generic/ext4 file system and work accordingly. As far as performance tweaking, ZIL, write journals and etc, I found that the performance difference between using a ZIL vs ceph write journal is about the same. I also found that doing both (ZIL AND writejournal) didn't give me much of a performance benefit. In my small test cluster I decided after testing to forego the ZIL and only use a SSD backed ceph write journal on each OSD, with each OSD being a single ZFS dataset/vdev(no zraid or mirroring). With Ceph handling the redundancy at the OSD level I saw no need for using ZFS mirroring or zraid, instead if ZFS detects corruption instead of self-healing it sends a read failure of the pg file to ceph, and then ceph's scrub mechanisms should then repair/replace the pg file using a good replica elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting match! Let me know if there's anything else I can answer. Cheers -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kenneth Waegeman Sent: October-29-14 6:09 AM To: ceph-users Subject: [ceph-users] use ZFS for OSDs Hi, We are looking to use ZFS for our OSD backend, but I have some questions. My main question is: Does Ceph already supports the writeparallel mode for ZFS ? (as described here: http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesti ng-things-going-on/) I've found this, but I suppose it is outdated: https://wiki.ceph.com/Planning/Blueprints/Emperor/osd%3A_ceph_on_zfs Should Ceph be build with ZFS support? I found a --with-zfslib option somewhere, but can someone verify this, or better has instructions for it?:-) What parameters should be tuned to use this? I found these : filestore zfs_snap = 1 journal_aio = 0 journal_dio = 0 Are there other things we need for it? Many thanks!! Kenneth ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com