Re: [ceph-users] use ZFS for OSDs

2015-04-14 Thread Quenten Grasso
Hi Michal,

Really nice work on the ZFS testing.

I've been thinking about this myself from time to time, However I wasn't sure 
if ZoL was ready to use in  production with Ceph.

I would like to see instead of using multiple osd's in zfs/ceph but running say 
a z+2 for say 8-12 3-4TB spinners and leverage some nice SSD's maybe a P3700 
400GB
for the zil/l2arc with compression and going back to 2x replicas which then 
this could give us some pretty fast/safe/efficient storage.

Now to find that money tree.

Regards,
Quenten Grasso

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Michal 
Kozanecki
Sent: Friday, 10 April 2015 5:15 AM
To: Christian Balzer; ceph-users
Subject: Re: [ceph-users] use ZFS for OSDs

I had surgery and have been off for a while. Had to rebuild test ceph+openstack 
cluster with whatever spare parts I had. I apologize for the delay for anyone 
who's been interested.

Here are the results;
==
Hardware/Software
3 node CEPH cluster, 3 OSDs (one OSD per node)
--
CPU = 1x E5-2670 v1
RAM = 8GB
OS Disk = 500GB SATA
OSD = 900GB 10k SAS (sdc - whole device) Journal = Shared Intel SSD DC3500 80GB 
(sdb1 - 10GB partition) ZFS log = Shared Intel SSD DC3500 80GB (sdb2 - 4GB 
partition) ZFS L2ARC = Intel SSD 320 40GB (sdd - whole device)
-
ceph 0.87
ZoL 0.63
CentOS 7.0

2 node KVM/Openstack cluster

CPU = 2x Xeon X5650
RAM = 24 GB
OS Disk = 500GB SATA
-
Ubuntu 14.04
OpenStack Juno

the rough performance of this oddball sized test ceph cluster is 8k 1000-1500 
IOPS 

==
Compression; (cut out unneeded details)
Various Debian and CentOS images, with lots of test SVN and GIT data 
KVM/OpenStack

[root@ceph03 ~]# zfs get all SAS1
NAME  PROPERTY  VALUE  SOURCE
SAS1  used  586G   -
SAS1  compressratio 1.50x  -
SAS1  recordsize32Klocal
SAS1  checksum  on default
SAS1  compression   lz4local
SAS1  refcompressratio  1.50x  -
SAS1  written   586G   -
SAS1  logicalused   877G   -

==
Dedupe; (dedupe is enabled on a dataset level but can dedupe space savings only 
be viewed at a pool level - bit odd I know) Various Debian and CentOS images, 
with lots of test SVN and GIT data KVM/OpenStack

[root@ceph01 ~]# zpool get all SAS1
NAME  PROPERTY   VALUE  SOURCE
SAS1  size   836G   -
SAS1  capacity   70%-
SAS1  dedupratio 1.02x  -
SAS1  free   250G   -
SAS1  allocated  586G   -

==
Bitrot/Corruption;
Injected random data to random locations (changed seek to random value) of sdc 
with;

dd if=/dev/urandom of=/dev/sdc seek=54356 bs=4k count=1

Results;

1. ZFS detects error on disk affecting PG files, being as this is a single vdev 
(no zraid or mirror) it cannot automatically fix. It blocks all(but delete) 
access to the entire files(inaccessible). 
*note: I ran this after status after already repairing 2 PGs (5.15 and 5.25), 
ZFS status will no longer list filename after it has been 
repaired/deleted/cleared*



[root@ceph01 ~]# zpool status -v
  pool: SAS1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub in progress since Thu Apr  9 13:04:54 2015
153G scanned out of 586G at 40.3M/s, 3h3m to go
0 repaired, 26.05% done
config:

NAME  STATE READ WRITE CKSUM
SAS1  ONLINE   0 035
  sdc ONLINE   0 070
logs
  sdb2ONLINE   0 0 0
cache
  sdd ONLINE   0 0 0

errors: Permanent errors have been detected in the following files: 

/SAS1/current/5.e_head/DIR_E/DIR_0/DIR_6/rbd\udata.2ba762ae8944a.24cc__head_6153260E__5



2. CEPH-OSD cannot read PG file. Kicks off scrub/deep-scrub



/var/log/ceph/ceph-osd.2.log
2015-04-09 13:10:18.319312 7fcbb163a700 -1 log_channel(default) log [ERR] : 
5.18 shard 1: soid cd635018/rbd_data.93d1f74b0dc51.18ee/head//5 
candidate had a read error, digest 1835988768 != known digest 473354757
2015-04-09 13:11:38.587014 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 
5.18 deep-scrub 0 missing, 1

Re: [ceph-users] use ZFS for OSDs

2015-04-09 Thread Michal Kozanecki
 pgs inconsistent; 2 scrub errors; noout flag(s) set
 monmap e2: 3 mons at 
{ceph01=10.10.10.101:6789/0,ceph02=10.10.10.102:6789/0,ceph03=10.10.10.103:6789/0},
 election epoch 146, quorum 0,1,2 ceph01,ceph02,ceph03
 osdmap e3178: 3 osds: 3 up, 3 in
flags noout
  pgmap v890949: 392 pgs, 6 pools, 931 GB data, 249 kobjects
1756 GB used, 704 GB / 2460 GB avail
   2 active+clean+inconsistent
 391 active+clean
  client io 0 B/s rd, 7920 B/s wr, 3 op/s



3. Repair must be manually kicked off



[root@client01 ~]# ceph pg repair 5.18
instructing pg 5.18 on osd.0 to repair

[root@client01 ~]# ceph health detail
HEALTH_WARN 1 pgs repair; noout flag(s) set
pg 5.25 is active+clean+inconsistent, acting [1,0,2]
pg 5.18 is active+clean+scrubbing+deep+repair, acting [2,0,1]

/var/log/ceph/ceph-osd.2.log
2015-04-09 13:30:01.609756 7fcbb163a700 -1 log_channel(default) log [ERR] : 
5.18 shard 1: soid cd635018/rbd_data.93d1f74b0dc51.18ee/head//5 
candidate had a read error, digest 1835988768 != known digest 473354757
2015-04-09 13:30:41.834465 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 
5.18 repair 0 missing, 1 inconsistent objects
2015-04-09 13:30:41.834479 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 
5.18 repair 1 errors, 1 fixed

/var/log/ceph/ceph-osd.1.log
2015-04-09 13:30:47.952742 7fe10b3c5700 -1 log_channel(default) log [ERR] : 
5.25 shard 1: soid 73eb0125/rbd_data.5315b2ae8944a.5348/head//5 
candidate had a read error, digest 1522345897 != known digest 1180025616
2015-04-09 13:31:23.389095 7fe10b3c5700 -1 log_channel(default) log [ERR] : 
5.25 repair 0 missing, 1 inconsistent objects
2015-04-09 13:31:23.389112 7fe10b3c5700 -1 log_channel(default) log [ERR] : 
5.25 repair 1 errors, 1 fixed



Conclusion;

ZFS compression works GREAT, between 30-50% compression depending on data (I 
was getting around 30-35% with only OS images, once I loaded on real test data 
(SVN/GIT/etc) this increased to 50%).

ZFS dedupe doesn't seem to get you much at least with how CEPH works. Maybe due 
to my recordsize (32K)?

ZFS/CEPH bitrot/corruption protection isn't fully automated but still pretty 
damn good in my opinion, an improvement over silent bitrot of coin tossing of 
other filesystems if CEPH somehow detects an error. CEPH attempts accessing the 
file, ZFS detects error and basically kills access to the file. CEPH detects 
this as a read error and kicks off a scrub on the PG. PG repair does not seem 
to happen automatically, however when manually kicked off it succeeds. 

Let me know if there's anything else or any questions people have while I have 
this test cluster running.

Cheers,
Michal Kozanecki | Linux Administrator | mkozane...@evertz.com


-Original Message-
From: Christian Balzer [mailto:ch...@gol.com] 
Sent: November-01-14 4:43 AM
To: ceph-users
Cc: Michal Kozanecki
Subject: Re: [ceph-users] use ZFS for OSDs

On Fri, 31 Oct 2014 16:32:49 + Michal Kozanecki wrote:

 I'll test this by manually inducing corrupted data to the ZFS 
 filesystem and report back how ZFS+ceph interact during a detected 
 file failure/corruption, how it recovers and any manual steps 
 required, and report back with the results.
 
Looking forward to that.

 As for compression, using lz4 the CPU impact is around 5-20% depending 
 on load, type of I/O and I/O size, with little-to-no I/O performance 
 impact, and in fact in some cases the I/O performance actually 
 increases. I'm currently looking at a compression ratio on the ZFS 
 datasets of around 30-35% for a data consisting of rbd backed 
 OpenStack KVM VMs.

I'm looking at a similar deployment (VM images) and over 30% compression will 
at least negate the need of ZFS to have at least 20% free space or suffer 
massive degradation otherwise.

CPU usage looks acceptable, however in combination with SSD backed OSDs that's 
another thing to consider.
As in, is it worth to spend X amount of money for faster CPUs and 10-20% space 
savings or will be another SSD be cheaper?

I'm trying to position Ceph against SolidFire, who are claiming 4-10 times data 
reduction by a combination of compression, deduping and thin provisioning. 
Without of course quantifying things, like what step gives which reduction 
based on what sample data.

 I have not tried any sort of dedupe as it is memory intensive and I 
 only had 24GB of ram on each node. I'll grab some FIO benchmarks and 
 report back.
 
I foresee a massive failure here, despite a huge potential given one use case 
here where all VMs are basically identical (KSM is very effective with those, 
too).
Why the predicted failure? Several reasons:

1. Deduping is only local, per OSD. 
That will make a big dent, but with many nearly identical VM images we should 
still have a quite a bit of identical data per OSD. However...

2. Data alignment.
The default RADOS objects making up images are 4MB. Which, given my limited

Re: [ceph-users] use ZFS for OSDs

2014-11-01 Thread Christian Balzer
On Fri, 31 Oct 2014 16:32:49 + Michal Kozanecki wrote:

 I'll test this by manually inducing corrupted data to the ZFS filesystem
 and report back how ZFS+ceph interact during a detected file
 failure/corruption, how it recovers and any manual steps required, and
 report back with the results. 
 
Looking forward to that.

 As for compression, using lz4 the CPU impact is around 5-20% depending
 on load, type of I/O and I/O size, with little-to-no I/O performance
 impact, and in fact in some cases the I/O performance actually
 increases. I'm currently looking at a compression ratio on the ZFS
 datasets of around 30-35% for a data consisting of rbd backed OpenStack
 KVM VMs. 

I'm looking at a similar deployment (VM images) and over 30% compression
will at least negate the need of ZFS to have at least 20% free space or
suffer massive degradation otherwise.

CPU usage looks acceptable, however in combination with SSD backed OSDs
that's another thing to consider.
As in, is it worth to spend X amount of money for faster CPUs and 10-20%
space savings or will be another SSD be cheaper?

I'm trying to position Ceph against SolidFire, who are claiming 4-10 times
data reduction by a combination of compression, deduping and thin
provisioning. 
Without of course quantifying things, like what step gives which reduction
based on what sample data.

 I have not tried any sort of dedupe as it is memory intensive
 and I only had 24GB of ram on each node. I'll grab some FIO benchmarks
 and report back.
 
I foresee a massive failure here, despite a huge potential given one use
case here where all VMs are basically identical (KSM is very effective
with those, too).
Why the predicted failure? Several reasons:

1. Deduping is only local, per OSD. 
That will make a big dent, but with many nearly identical VM images we
should still have a quite a bit of identical data per OSD. However...

2. Data alignment.
The default RADOS objects making up images are 4MB. Which, given my limited
knowledge of ZFS, I presume will be mapped to 128KB ZFS blocks which are
then subject to the deduping process. 
However even if one were to install the same OS on the same sized RBD
images I predict subtle differences in alignment within those objects and
thus ZFS blocks.
That becomes a near certainty when those images (OS installs) are
customized, files being added or deleted, etc.

3. ZFS block size and VM FS metadata.
Even if all the data would be perfectly, identically aligned in the 4MB
RADOS objects the resulting 128KB ZFS blocks are likely to contain
metadata like inodes (creation time) in them, thus making them subtly
different and not eligible for deduping. 

OTOH SolidFire claims to be doing global deduplication, how they do that
efficiently is a bit beyond, especially given the memory sizes of their
appliances. My guess is they keep a map on disk (all SSDs) on each node
instead of keeping it in RAM. 
I suppose the updates (writes to the SSDs) of this map are still
substantially less than the data otherwise written w/o deduping.

Thusly I think Ceph will need a similar approach for any deduping to work,
in combination with a much finer grained block size. 
The later I believe is already being discussed in the context of cache
tier pools, having to promote/demote 4MB blobs for a single hot 4KB of
data is hardly efficient.

Regards,

Christian

 Cheers,
 
 
 
 -Original Message-
 From: Christian Balzer [mailto:ch...@gol.com] 
 Sent: October-30-14 4:12 AM
 To: ceph-users
 Cc: Michal Kozanecki
 Subject: Re: [ceph-users] use ZFS for OSDs
 
 On Wed, 29 Oct 2014 15:32:57 + Michal Kozanecki wrote:
 
 [snip]
  With Ceph handling the
  redundancy at the OSD level I saw no need for using ZFS mirroring or 
  zraid, instead if ZFS detects corruption instead of self-healing it 
  sends a read failure of the pg file to ceph, and then ceph's scrub 
  mechanisms should then repair/replace the pg file using a good replica 
  elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting 
  match!
  
 Could you elaborate on that? 
 AFAIK Ceph currently has no way to determine which of the replicas is
 good, one such failed PG object will require you to do a manual repair
 after the scrub and hope that two surviving replicas (assuming a size of
 3) are identical. If not, start tossing a coin. Ideally Ceph would have
 a way to know what happened (as in, it's a checksum and not a real I/O
 error) and do a rebuild of that object itself.
 
 On an other note, have you done any tests using the ZFS compression?
 I'm wondering what the performance impact and efficiency are.
 
 Christian


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] use ZFS for OSDs

2014-10-31 Thread Michal Kozanecki
I'll test this by manually inducing corrupted data to the ZFS filesystem and 
report back how ZFS+ceph interact during a detected file failure/corruption, 
how it recovers and any manual steps required, and report back with the 
results. 

As for compression, using lz4 the CPU impact is around 5-20% depending on load, 
type of I/O and I/O size, with little-to-no I/O performance impact, and in fact 
in some cases the I/O performance actually increases. I'm currently looking at 
a compression ratio on the ZFS datasets of around 30-35% for a data consisting 
of rbd backed OpenStack KVM VMs. I have not tried any sort of dedupe as it is 
memory intensive and I only had 24GB of ram on each node. I'll grab some FIO 
benchmarks and report back.

Cheers,



-Original Message-
From: Christian Balzer [mailto:ch...@gol.com] 
Sent: October-30-14 4:12 AM
To: ceph-users
Cc: Michal Kozanecki
Subject: Re: [ceph-users] use ZFS for OSDs

On Wed, 29 Oct 2014 15:32:57 + Michal Kozanecki wrote:

[snip]
 With Ceph handling the
 redundancy at the OSD level I saw no need for using ZFS mirroring or 
 zraid, instead if ZFS detects corruption instead of self-healing it 
 sends a read failure of the pg file to ceph, and then ceph's scrub 
 mechanisms should then repair/replace the pg file using a good replica 
 elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting 
 match!
 
Could you elaborate on that? 
AFAIK Ceph currently has no way to determine which of the replicas is good, 
one such failed PG object will require you to do a manual repair after the 
scrub and hope that two surviving replicas (assuming a size of
3) are identical. If not, start tossing a coin.
Ideally Ceph would have a way to know what happened (as in, it's a checksum and 
not a real I/O error) and do a rebuild of that object itself.

On an other note, have you done any tests using the ZFS compression?
I'm wondering what the performance impact and efficiency are.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] use ZFS for OSDs

2014-10-30 Thread Christian Balzer
On Wed, 29 Oct 2014 15:32:57 + Michal Kozanecki wrote:

[snip]
 With Ceph handling the
 redundancy at the OSD level I saw no need for using ZFS mirroring or
 zraid, instead if ZFS detects corruption instead of self-healing it
 sends a read failure of the pg file to ceph, and then ceph's scrub
 mechanisms should then repair/replace the pg file using a good replica
 elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting
 match!
 
Could you elaborate on that? 
AFAIK Ceph currently has no way to determine which of the replicas is
good, one such failed PG object will require you to do a manual repair
after the scrub and hope that two surviving replicas (assuming a size of
3) are identical. If not, start tossing a coin.
Ideally Ceph would have a way to know what happened (as in, it's a
checksum and not a real I/O error) and do a rebuild of that object itself.

On an other note, have you done any tests using the ZFS compression?
I'm wondering what the performance impact and efficiency are.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] use ZFS for OSDs

2014-10-29 Thread Sage Weil
On Wed, 29 Oct 2014, Kenneth Waegeman wrote:
 Hi,
 
 We are looking to use ZFS for our OSD backend, but I have some questions.
 
 My main question is: Does Ceph already supports the writeparallel mode for ZFS
 ? (as described here:
 http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesting-things-going-on/)
 I've found this, but I suppose it is outdated:
 https://wiki.ceph.com/Planning/Blueprints/Emperor/osd%3A_ceph_on_zfs

All of the code is there, but it is almost completely untested.
 
 Should Ceph be build with ZFS support? I found a --with-zfslib option
 somewhere, but can someone verify this, or better has instructions for it?:-)

 What parameters should be tuned to use this?
 I found these :
filestore zfs_snap = 1

Yes

journal_aio = 0
journal_dio = 0

Maybe, if ZFS doesn't support directio or aio.

Curious to hear how it goes!  Wouldn't recommend this for production 
though without significant testing.

sage

 
 Are there other things we need for it?
 
 Many thanks!!
 Kenneth
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] use ZFS for OSDs

2014-10-29 Thread Michal Kozanecki
Forgot to mention, when you create the ZFS/ZPOOL datasets, make sure to set the 
xattar setting to sa

e.g.
  
zpool create osd01 -O xattr=sa -O compression=lz4 sdb

OR if zpool/zfs dataset already created

zfs set xattr=sa osd01

Cheers



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Michal 
Kozanecki
Sent: October-29-14 11:33 AM
To: Kenneth Waegeman; ceph-users
Subject: Re: [ceph-users] use ZFS for OSDs

Hi Kenneth,

I run a small ceph test cluster using ZoL (ZFS on Linux) ontop of CentOS 7, so 
I'll try and answer any questions. :) 

Yes, ZFS writeparallel support is there, but NOT compiled in by default. You'll 
need to compile it with --with-zlib, but that by itself will fail to compile 
the ZFS support as I found out. You need to ensure you have ZoL installed and 
working, and then pass the location of libzfs to ceph at compile time. 
Personally I just set my environment variables before compiling like so;

ldconfig
export LIBZFS_LIBS=/usr/include/libzfs/
export LIBZFS_CFLAGS=-I/usr/include/libzfs -I/usr/include/libspl

However, the writeparallel performance isn't all that great. The writeparallel 
mode makes heavy use of ZFS's (and BtrFS's for that matter) snapshotting 
capability, and the snap performance on ZoL, at least when I last tested it, is 
pretty terrible. You lose any performance benefits you gain with writeparallel 
to the poor snap performance. 

If you decide that you don't need writeparallel mode you, can use the prebuilt 
packages (or compile with default options) without issue. Ceph (without zlib 
support compiled in) will detect ZFS as a generic/ext4 file system and work 
accordingly. 

As far as performance tweaking, ZIL, write journals and etc, I found that the 
performance difference between using a ZIL vs ceph write journal is about the 
same. I also found that doing both (ZIL AND writejournal) didn't give me much 
of a performance benefit. In my small test cluster I decided after testing to 
forego the ZIL and only use a SSD backed ceph write journal on each OSD, with 
each OSD being a single ZFS dataset/vdev(no zraid or mirroring). With Ceph 
handling the redundancy at the OSD level I saw no need for using ZFS mirroring 
or zraid, instead if ZFS detects corruption instead of self-healing it sends a 
read failure of the pg file to ceph, and then ceph's scrub mechanisms should 
then repair/replace the pg file using a good replica elsewhere on the cluster. 
ZFS + ceph are a beautiful bitrot fighting match!

Let me know if there's anything else I can answer. 

Cheers

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Kenneth Waegeman
Sent: October-29-14 6:09 AM
To: ceph-users
Subject: [ceph-users] use ZFS for OSDs

Hi,

We are looking to use ZFS for our OSD backend, but I have some questions.

My main question is: Does Ceph already supports the writeparallel mode for ZFS 
? (as described here:  
http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesting-things-going-on/)
I've found this, but I suppose it is outdated:  
https://wiki.ceph.com/Planning/Blueprints/Emperor/osd%3A_ceph_on_zfs

Should Ceph be build with ZFS support? I found a --with-zfslib option 
somewhere, but can someone verify this, or better has instructions for
it?:-)

What parameters should be tuned to use this?
I found these :
 filestore zfs_snap = 1
 journal_aio = 0
 journal_dio = 0

Are there other things we need for it?

Many thanks!!
Kenneth

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] use ZFS for OSDs

2014-10-29 Thread Stijn De Weirdt


hi michal,

thanks for the info. we will certainly try it and see if we come to the 
same conclusions ;)


one small detail: since you were using centos7, i'm assuming you were 
using ZoL 0.6.3?


stijn

On 10/29/2014 08:03 PM, Michal Kozanecki wrote:

Forgot to mention, when you create the ZFS/ZPOOL datasets, make sure to set the 
xattar setting to sa

e.g.

zpool create osd01 -O xattr=sa -O compression=lz4 sdb

OR if zpool/zfs dataset already created

zfs set xattr=sa osd01

Cheers



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Michal 
Kozanecki
Sent: October-29-14 11:33 AM
To: Kenneth Waegeman; ceph-users
Subject: Re: [ceph-users] use ZFS for OSDs

Hi Kenneth,

I run a small ceph test cluster using ZoL (ZFS on Linux) ontop of CentOS 7, so 
I'll try and answer any questions. :)

Yes, ZFS writeparallel support is there, but NOT compiled in by default. You'll 
need to compile it with --with-zlib, but that by itself will fail to compile 
the ZFS support as I found out. You need to ensure you have ZoL installed and 
working, and then pass the location of libzfs to ceph at compile time. 
Personally I just set my environment variables before compiling like so;

ldconfig
export LIBZFS_LIBS=/usr/include/libzfs/
export LIBZFS_CFLAGS=-I/usr/include/libzfs -I/usr/include/libspl

However, the writeparallel performance isn't all that great. The writeparallel 
mode makes heavy use of ZFS's (and BtrFS's for that matter) snapshotting 
capability, and the snap performance on ZoL, at least when I last tested it, is 
pretty terrible. You lose any performance benefits you gain with writeparallel 
to the poor snap performance.

If you decide that you don't need writeparallel mode you, can use the prebuilt 
packages (or compile with default options) without issue. Ceph (without zlib 
support compiled in) will detect ZFS as a generic/ext4 file system and work 
accordingly.

As far as performance tweaking, ZIL, write journals and etc, I found that the 
performance difference between using a ZIL vs ceph write journal is about the 
same. I also found that doing both (ZIL AND writejournal) didn't give me much 
of a performance benefit. In my small test cluster I decided after testing to 
forego the ZIL and only use a SSD backed ceph write journal on each OSD, with 
each OSD being a single ZFS dataset/vdev(no zraid or mirroring). With Ceph 
handling the redundancy at the OSD level I saw no need for using ZFS mirroring 
or zraid, instead if ZFS detects corruption instead of self-healing it sends a 
read failure of the pg file to ceph, and then ceph's scrub mechanisms should 
then repair/replace the pg file using a good replica elsewhere on the cluster. 
ZFS + ceph are a beautiful bitrot fighting match!

Let me know if there's anything else I can answer.

Cheers

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Kenneth Waegeman
Sent: October-29-14 6:09 AM
To: ceph-users
Subject: [ceph-users] use ZFS for OSDs

Hi,

We are looking to use ZFS for our OSD backend, but I have some questions.

My main question is: Does Ceph already supports the writeparallel mode for ZFS 
? (as described here:
http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesting-things-going-on/)
I've found this, but I suppose it is outdated:
https://wiki.ceph.com/Planning/Blueprints/Emperor/osd%3A_ceph_on_zfs

Should Ceph be build with ZFS support? I found a --with-zfslib option 
somewhere, but can someone verify this, or better has instructions for
it?:-)

What parameters should be tuned to use this?
I found these :
  filestore zfs_snap = 1
  journal_aio = 0
  journal_dio = 0

Are there other things we need for it?

Many thanks!!
Kenneth

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] use ZFS for OSDs

2014-10-29 Thread Michal Kozanecki
Hi Stijn,

Yes, on my cluster I am running; CentOS 7, ZoL 0.6.3, Ceph 80.5.

Cheers


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stijn 
De Weirdt
Sent: October-29-14 3:49 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] use ZFS for OSDs


hi michal,

thanks for the info. we will certainly try it and see if we come to the same 
conclusions ;)

one small detail: since you were using centos7, i'm assuming you were using ZoL 
0.6.3?

stijn

On 10/29/2014 08:03 PM, Michal Kozanecki wrote:
 Forgot to mention, when you create the ZFS/ZPOOL datasets, make sure 
 to set the xattar setting to sa

 e.g.

 zpool create osd01 -O xattr=sa -O compression=lz4 sdb

 OR if zpool/zfs dataset already created

 zfs set xattr=sa osd01

 Cheers



 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
 Of Michal Kozanecki
 Sent: October-29-14 11:33 AM
 To: Kenneth Waegeman; ceph-users
 Subject: Re: [ceph-users] use ZFS for OSDs

 Hi Kenneth,

 I run a small ceph test cluster using ZoL (ZFS on Linux) ontop of 
 CentOS 7, so I'll try and answer any questions. :)

 Yes, ZFS writeparallel support is there, but NOT compiled in by 
 default. You'll need to compile it with --with-zlib, but that by 
 itself will fail to compile the ZFS support as I found out. You need 
 to ensure you have ZoL installed and working, and then pass the 
 location of libzfs to ceph at compile time. Personally I just set my 
 environment variables before compiling like so;

 ldconfig
 export LIBZFS_LIBS=/usr/include/libzfs/
 export LIBZFS_CFLAGS=-I/usr/include/libzfs -I/usr/include/libspl

 However, the writeparallel performance isn't all that great. The 
 writeparallel mode makes heavy use of ZFS's (and BtrFS's for that matter) 
 snapshotting capability, and the snap performance on ZoL, at least when I 
 last tested it, is pretty terrible. You lose any performance benefits you 
 gain with writeparallel to the poor snap performance.

 If you decide that you don't need writeparallel mode you, can use the 
 prebuilt packages (or compile with default options) without issue. Ceph 
 (without zlib support compiled in) will detect ZFS as a generic/ext4 file 
 system and work accordingly.

 As far as performance tweaking, ZIL, write journals and etc, I found that the 
 performance difference between using a ZIL vs ceph write journal is about the 
 same. I also found that doing both (ZIL AND writejournal) didn't give me much 
 of a performance benefit. In my small test cluster I decided after testing to 
 forego the ZIL and only use a SSD backed ceph write journal on each OSD, with 
 each OSD being a single ZFS dataset/vdev(no zraid or mirroring). With Ceph 
 handling the redundancy at the OSD level I saw no need for using ZFS 
 mirroring or zraid, instead if ZFS detects corruption instead of self-healing 
 it sends a read failure of the pg file to ceph, and then ceph's scrub 
 mechanisms should then repair/replace the pg file using a good replica 
 elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting match!

 Let me know if there's anything else I can answer.

 Cheers

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
 Of Kenneth Waegeman
 Sent: October-29-14 6:09 AM
 To: ceph-users
 Subject: [ceph-users] use ZFS for OSDs

 Hi,

 We are looking to use ZFS for our OSD backend, but I have some questions.

 My main question is: Does Ceph already supports the writeparallel mode for 
 ZFS ? (as described here:
 http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesti
 ng-things-going-on/) I've found this, but I suppose it is outdated:
 https://wiki.ceph.com/Planning/Blueprints/Emperor/osd%3A_ceph_on_zfs

 Should Ceph be build with ZFS support? I found a --with-zfslib option 
 somewhere, but can someone verify this, or better has instructions for
 it?:-)

 What parameters should be tuned to use this?
 I found these :
   filestore zfs_snap = 1
   journal_aio = 0
   journal_dio = 0

 Are there other things we need for it?

 Many thanks!!
 Kenneth

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com