Hello Yaniv,

Wednesday, April 18, 2007, 3:44:57 PM, you wrote:

YA> Hello,

YA> I'd like to plan a storage solution for a system currently in production.

YA> The system's storage is based on code which writes many files to
YA> the file system, with overall storage needs currently around 40TB
YA> and expected to reach hundreds of TBs. The average file size of
YA> the system is ~100K, which translates to ~500 million files today,
YA> and billions of files in the future. This storage is accessed over
YA> NFS by a rack of 40 Linux blades, and is mostly read-only (99% of
YA> the activity is reads). While I realize calling this sub-optimal
YA> system design is probably an understatement, the design of the
YA> system is beyond my control and isn't likely to change in the near future.

YA> The system's current storage is based on 4 VxFS filesystems,
YA> created on SVM meta-devices each ~10TB in size. A 2-node Sun
YA> Cluster serves the filesystems, 2 filesystems per node. Each of
YA> the filesystems undergoes growfs as more storage is made
YA> available. We're looking for an alternative solution, in an
YA> attempt to improve performance and ability to recover from
YA> disasters (fsck on 2^42 files isn't practical, and I'm getting
YA> pretty worried due to this fact - even the smallest filesystem
YA> inconsistency will leave me lots of useless bits).

YA> Question is - does anyone here have experience with large ZFS
YA> filesystems with many small-files? Is it practical to base such a
YA> solution on a few (8) zpools, each with single large filesystem in it?

YA> Many thanks in advance for any advice,

I have "some" experience with similar but bigger environment and lot of
data already on zfs (for years now). Although I can't talk many
details...

One of a problems is: how are you going to backup all this data?
With so many small files classical approach probably won't work, and
if it does now it won't in a (near) future. I would strongly suggest
disk-to-disk backup + snapshots for point-in-time backups.

With lot of small files I observed zfs to consume about the same disk
space as UFS.

It seems there's a problem with fs fragmentation after some time with
lot of files (zfs send|recv helps for some time).

While I see no problem going with one file system (pool itself?) in
each zpool, with TBs of data I would consider to split it into more
file systems mostly for "management" reasons like  backup,
snapshotting. Splitting into more file systems also helps when you
have to migrate one of file systems to another storage - it's easier
to find 1TB of storage than 20TB.
I try to keep each production file system below 1TB, not
that there are any problems with larger file systems.

When doing Sun Cluster consider creating at least as many zpools as
you have nodes in a cluster, so if you have to you can spread out your
workload to each node in a cluster (put each zpool in a different sc
group with its own IP).

We did some tests with Linux (2.4 and 2.6) and it seems there's a
problem if you have thousands of nfs file systems - they won't all be
mounted automatically, and even doing it manually (or in a script with
a sleep between each mount) there seems to be a limit below 1000. We
did not investigate further as in that environment all nfs clients are
Solaris server (x86, sparc) and we see no problems with thousands of
file systems.

If you switch a rg from one node to another which was serving another
nfs rg group, keep in mind that nfsd will actually restart which means
service distraction for that other group also. With zfs stopping nfsd
can sometimes take even minutes...

There are more things also to consider (storage layout, network
config, etc...).







-- 
Best regards,
 Robert Milkowski                      mailto:[EMAIL PROTECTED]
                                       http://milek.blogspot.com

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to