Thanks for sharing, Jeff!
On Mar 24, 2012, at 4:33 PM, Jeff Bacon wrote:
>> 2012-03-21 16:41, Paul Kraus wrote:
>>> I have been running ZFS in a mission critical application since
>>> zpool version 10 and have not seen any issues with some of the vdevs
>>> in a zpool full while others are virtually empty. We have been running
>>> commercial Solaris 10 releases. The configuration was that each
>> Thanks for sharing some real-life data from larger deployments,
>> as you often did. That's something I don't often have access
>> to nowadays, with a liberty to tell :)
> Here's another datapoint, then:
> I'm using sol10u9 and u10 on a number of supermicro boxes,
> mostly X8DTH boards with LSI 9211/9208 controllers and E5600 CPUs.
> Application is NFS file service to a bunch of clients, and
> we also have an in-house database application written in Java
> which implements a column-oriented db in files. Just about all
> of it is raidz2, much of it running gzip-compressed.
> Since I can't find anything saying not to other than some common
> wisdom about not putting your eggs all in one basket that I'm
> choosing to reject in some cases, I just keep adding vdevs to
> the pool. started with 2TB barracudas for dev/test/archive
> usage and constellations for prod, now 3TB drives, have just
> added some of the new Pipeline drives with nothing particularly
> of interest to note therefrom.
> You can create a startlingly large pool this way:
> ny-fs7(68)% zpool list
> NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
> srv 177T 114T 63.3T 64% ONLINE -
> most pools are smaller. this is an archive box that's also
> the guinea pig, 12 vdevs of 7 drives raidz2. the largest prod
> one is 130TB in 11 vdevs of 8 drives raidz2. I won't guess
> at the mix of 2TB and 3TB. these are both sol10u9.
> Another box has 150TB in 6 pools, raidz2/gzip using 2TB
> constellations, dual X5690s with 144GB RAM running 20-30
> Java db workers. We do manage to break this box on the
> odd occasion - there's a race condition in the ZIO code
> where a buffer can be freed while the block buffer is in
> the process of being "loaned" out to the compression code.
> However, it takes 600 zpool threads plus another 600-900
> java threads running at the same time with a backlog of
> 80000 ZIOs in queue, so it's not the sort of thing that
> anyone's likely to run across much. :) It's fixed
> in sol11, I understand; however, our intended fix is
> to split the whole thing so that the workload (which
> for various reasons needs to be on one box) is moved
> to a 4-socket Westmere, and all of the data pools
> are served via NFS from other boxes.
> I did lose some data once, long ago, using LSI 1068-based
> controllers on older kit, but pretty much I can attribute
> that to something between me-being-stupid and the 1068s
> really not being especially friendly towards the LSI
> expander chips in the older 3Gb/s SMC backplanes when used
> for SATA-over-SAS tunneling. The current arrangements
> are pretty solid otherwise.
In general, mixing SATA and SAS directly behind expanders (eg without
SAS/SATA intereposers) seems to be bad juju that an OS can't fix.
> The SATA-based boxes can be a little cranky when a drive
> toasts, of course - they sit and hang for a while until they
> finally decide to offline the drive. We take that as par
> for the course; for the application in question (basically,
> storing huge amounts of data on the odd occasion that someone
> has a need for it), it's not exactly a showstopper.
> I am curious as to whether there is any practical upper-limit
> on the number of vdevs, or how far one might push this kind of
> configuration in terms of pool size - assuming a sufficient
> quantity of RAM, of course.... I'm sure I will need to
> split this up someday but for the application there's just
> something hideously convenient about leaving it all in one
> filesystem in one pool.
I've run pools with > 100 top-level vdevs. It is not uncommon to see
40+ top-level vdevs.
DTrace Conference, April 3, 2012,
ZFS Performance and Training
zfs-discuss mailing list