Hello Yaniv, Wednesday, April 18, 2007, 3:44:57 PM, you wrote:
YA> Hello, YA> I'd like to plan a storage solution for a system currently in production. YA> The system's storage is based on code which writes many files to YA> the file system, with overall storage needs currently around 40TB YA> and expected to reach hundreds of TBs. The average file size of YA> the system is ~100K, which translates to ~500 million files today, YA> and billions of files in the future. This storage is accessed over YA> NFS by a rack of 40 Linux blades, and is mostly read-only (99% of YA> the activity is reads). While I realize calling this sub-optimal YA> system design is probably an understatement, the design of the YA> system is beyond my control and isn't likely to change in the near future. YA> The system's current storage is based on 4 VxFS filesystems, YA> created on SVM meta-devices each ~10TB in size. A 2-node Sun YA> Cluster serves the filesystems, 2 filesystems per node. Each of YA> the filesystems undergoes growfs as more storage is made YA> available. We're looking for an alternative solution, in an YA> attempt to improve performance and ability to recover from YA> disasters (fsck on 2^42 files isn't practical, and I'm getting YA> pretty worried due to this fact - even the smallest filesystem YA> inconsistency will leave me lots of useless bits). YA> Question is - does anyone here have experience with large ZFS YA> filesystems with many small-files? Is it practical to base such a YA> solution on a few (8) zpools, each with single large filesystem in it? YA> Many thanks in advance for any advice, I have "some" experience with similar but bigger environment and lot of data already on zfs (for years now). Although I can't talk many details... One of a problems is: how are you going to backup all this data? With so many small files classical approach probably won't work, and if it does now it won't in a (near) future. I would strongly suggest disk-to-disk backup + snapshots for point-in-time backups. With lot of small files I observed zfs to consume about the same disk space as UFS. It seems there's a problem with fs fragmentation after some time with lot of files (zfs send|recv helps for some time). While I see no problem going with one file system (pool itself?) in each zpool, with TBs of data I would consider to split it into more file systems mostly for "management" reasons like backup, snapshotting. Splitting into more file systems also helps when you have to migrate one of file systems to another storage - it's easier to find 1TB of storage than 20TB. I try to keep each production file system below 1TB, not that there are any problems with larger file systems. When doing Sun Cluster consider creating at least as many zpools as you have nodes in a cluster, so if you have to you can spread out your workload to each node in a cluster (put each zpool in a different sc group with its own IP). We did some tests with Linux (2.4 and 2.6) and it seems there's a problem if you have thousands of nfs file systems - they won't all be mounted automatically, and even doing it manually (or in a script with a sleep between each mount) there seems to be a limit below 1000. We did not investigate further as in that environment all nfs clients are Solaris server (x86, sparc) and we see no problems with thousands of file systems. If you switch a rg from one node to another which was serving another nfs rg group, keep in mind that nfsd will actually restart which means service distraction for that other group also. With zfs stopping nfsd can sometimes take even minutes... There are more things also to consider (storage layout, network config, etc...). -- Best regards, Robert Milkowski mailto:[EMAIL PROTECTED] http://milek.blogspot.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss