I just recently built an OpenIndiana 151a7 system that is currently 1/2 PB
that will be expanded to 1 PB as we collect imaging data for the Human
Connectome Project at Washington University in St. Louis.  It is very much
like your use case as this is an offsite backup system that will write once
and read rarely.

It has displaced a BlueArc DR system because their mechanisms for syncing
over distances could not keep up with our data generation rate.   The fact
it cost 5x per TB as homebrew helped the decision also.

It is currently 180 4TB SAS Seagate Constellations in 4 Supermicro JBODs.
  The JBODS currently are in two branches only cascading once.   When
expanded 4 JBODs will be on each branch.  The pool is configured as 9 zvols
of 19 drives in raidz3.   The remaining disks are configured as hot
spares.  Metedata only is cached in 128GB ram and 2 480GB Intel 520 SSDs
for L2ARC.  Sync (ZIL) is turned off since the worst that would happen is
that we would need to rerun an rsync job.

Two identical servers were built for a cold standby configuration.   Since
it is a DR system the need for a hot standby was ruled out since even
several hours downtime would not be an issue.  Each server is fitted with 2
LSI 9207-8e HBAs configured as redundant multipath to the JBODs.

Before putting in into service I ran several iozone tests to benchmark the
pool.   Even with really fat vdevs the performance is impressive.   If
you're interested in that data let me know.    It has many hours of idle
time each day so additional performance tests are not out of the question

Actually I should say I designed and configured the system.  The system was
assembled by a colleague at UMINN.   If you would like more details on the
hardware I have a very detailed assembly doc I wrote and would be happy to

The system receives daily rsyncs from our production BlueArc system.   The
rsyncs are split into 120 parallel rsync jobs.  This overcomes the latency
slow down TCP suffers from and we see total throughput between
500-700Mb/s.  The BlueArc has 120TB of 15k SAS tiered to NL-SAS.  All
metadata is on the SAS pool.   The ZFS system outpaces the BlueArc on
metadata when rsync does its tree walk.

Given all the safeguards built into ZFS, I would not hesitate to build a
production system at the multi-petabyte scale.   If a channel to disks are
no longer available it will simply stop writing and data will be safe.
Given the redundant paths, power supplies, etc, the odds of that happening
are very unlikely.  The single points of failure left when running a single
server remain at the motherboard, CPU and RAM level.   Build a hot standby
server and human error becomes the most likely failure.


On Fri, Mar 15, 2013 at 8:09 PM, Marion Hakanson <hakan...@ohsu.edu> wrote:

> Greetings,
> Has anyone out there built a 1-petabyte pool?  I've been asked to look
> into this, and was told "low performance" is fine, workload is likely
> to be write-once, read-occasionally, archive storage of gene sequencing
> data.  Probably a single 10Gbit NIC for connectivity is sufficient.
> We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
> using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
> Back-of-the-envelope might suggest stacking up eight to ten of those,
> depending if you want a "raw marketing petabyte", or a proper "power-of-two
> usable petabyte".
> I get a little nervous at the thought of hooking all that up to a single
> server, and am a little vague on how much RAM would be advisable, other
> than "as much as will fit" (:-).  Then again, I've been waiting for
> something like pNFS/NFSv4.1 to be usable for gluing together multiple
> NFS servers into a single global namespace, without any sign of that
> happening anytime soon.
> So, has anyone done this?  Or come close to it?  Thoughts, even if you
> haven't done it yourself?
> Thanks and regards,
> Marion
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
zfs-discuss mailing list

Reply via email to