Re: [osol-discuss] inode numbers on ZFS
ZFS is a 128 bit filesystem, isn't it? So I hope it uses 128 bit inode numbers too. but it should at least use 64 bit for inode numbers. Now what happens to a 32 bit application that calls stat(2) on a file that uses an inode number that is outside the 32 bit scope. Whill this cause stat(2) to return a EOVERFLOW condition in this case when stat(2) is called from a 32 bit application? Depends on whether it's large file aware or not, I'd say. (the ino field in stat64 is 64 bits) Casper ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] inode numbers on ZFS
[EMAIL PROTECTED] schrieb: ZFS is a 128 bit filesystem, isn't it? Depends on whether it's large file aware or not, I'd say. (the ino field in stat64 is 64 bits) So to fully utilize a ZFS file system the average file size has to be 16 EB? People are already moaning today that on MT-UFS the average file size has to be 1 MB... I hope it is just an interface limitation and that ZFS's internals don't impose such a limit. Daniel ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] inode numbers on ZFS
Daniel, To clear up a misconception about MTB UFS. The maximum density of inodes that can be in a MTB UFS filesystem is 1 inode per megabyte of space. This does not mean that a megabyte of space is used for every file. It simply means you cannot have more than a million or so files per terabyte of storage. The reason for this is simple, it could take days or weeks to fsck the filesystem. sarah Daniel Rock wrote: [EMAIL PROTECTED] schrieb: ZFS is a 128 bit filesystem, isn't it? Depends on whether it's large file aware or not, I'd say. (the ino field in stat64 is 64 bits) So to fully utilize a ZFS file system the average file size has to be 16 EB? People are already moaning today that on MT-UFS the average file size has to be 1 MB... I hope it is just an interface limitation and that ZFS's internals don't impose such a limit. Daniel ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] inode numbers on ZFS
ZFS inode numbers are 64 bits. The current implemenation restricts this to a 48-bit usable range, but this is not an architectural restriction. Future enhancements plan to extend this to the full 64 bits. 32-bit apps that attempt to stat() a file whose inode number is greater than 32 bits will return EOVERFLOW. 64-bit apps and largefile aware apps will have no problems. The ZFS object allocation scheme always tries to allocate the lowest object number first, so you will never have files with greater than 32-bit inode numbers until you have 2^32 files on the system[1]. There is little expectation that anyone will be able to fill a ZFS filesystem, ever[2]. There is reasonable expectation, however, that in the next 10-20 years we will pass the 64-bit limit for some use cases. Hope that helps. - Eric [1] The actual algorithm allows for some fuzz factor, so this could theoretically occur at 75% of 2^32 files. [2] For a complete discussion of these limits, see Jeff's blog: http://blogs.sun.com/roller/page/bonwick?entry=128_bit_storage_are_you -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] inode numbers on ZFS
Eric Schrock wrote: ZFS inode numbers are 64 bits. The current implemenation restricts this to a 48-bit usable range, but this is not an architectural restriction. Future enhancements plan to extend this to the full 64 bits. 32-bit apps that attempt to stat() a file whose inode number is greater than 32 bits will return EOVERFLOW. 64-bit apps and largefile aware apps will have no problems. So does this mean 32-bit apps that didn't need to be largefile aware in the past because they only touched small files now need to become largefile aware to avoid problems with ZFS if they call stat()?(Granted, they've already had problems with stat() with out-of-range dates from NFS servers and other places, but those aren't as common as ZFS will be.) And xfs filesystems exported from SGI systems But as said, only when the number of inodes exceeds 75% of 2^32 or 3 billion which for current ufs sizes would be a 24TB filesystem But since the typical filesystem only allocates around 25% of inodes before it fills up, it would be more like a full 100TB before you get to such huge inode numbers, with filesizes staying what they are. Casper ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] inode numbers on ZFS
Eric Schrock [EMAIL PROTECTED] wrote: There is little expectation that anyone will be able to fill a ZFS filesystem, ever[2]. There is reasonable expectation, however, that in the next 10-20 years we will pass the 64-bit limit for some use cases. Do you believe that there currently already systems with 2000 TB? During the past 17 years, the capacity of a single 3.5 disk did increase by a factor of 2000 (a factor of 1.57 per year). In 20 years, the capacity of a single disk will increase by a factor of ~ 8000. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED](work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] inode numbers on ZFS
On Wed, Oct 12, 2005 at 10:34:49AM -0700, Alan Coopersmith wrote: So does this mean 32-bit apps that didn't need to be largefile aware in the past because they only touched small files now need to become largefile aware to avoid problems with ZFS if they call stat()?(Granted, they've already had problems with stat() with out-of-range dates from NFS servers and other places, but those aren't as common as ZFS will be.) Yes, unfortunately this is the case. But it will only affect filesystems with more than 3 billion files on them. There's not much that can be done about this - if you want to have more than 2^32 files, you need more than 32 bits to uniquely identify them. The lightweight ZFS filesystem model will also reduce this effect, since administrators will be encouraged to have many filesystems (i.e. one per user) instead of a single mammoth filesystem (all of /export/home). - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Re: [zones-discuss] Re: Re: Unionfs for Zones
AFAIK, FiST (which underlies the referenced support for unionfs on Solaris) was never updated past Solaris 7/SunOS 5.7. That's a problem, because it depends on the (private, uncommitted, and all-too-undocumented!) vfs interface, which changed at least from 5.7 to 5.8 (to add support for umount -f, I think); not sure what if any subsequent changes there have been. I've wanted to play with FiST, but didn't stumble across it 'til I was already running Solaris 8, so no go; and updating it myself would've been rough (see below for part of why). Sure would be nice if there were a doc on the vfs interface. I understand it's one of the few places left for adding certain types of magic, so it may well never become committed (unless there's some way to achieve both extensibility and compatibility - maybe the last change provides a hook for that?). But there will always be non-Sun filesystem implementations (AFS, DCE/DFS, VxFS, ...), and a pseudo-NFS server isn't always a suitable approach. Since it's not a committed interface (at least not yet), a history of version-to-version changes, and a heads-up on any upcoming changes (so filesystem maintainers could update their code), would be a big help too. This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Re: [zones-discuss] Re: Re: Unionfs for Zones
AFAIK, FiST (which underlies the referenced support for unionfs on Solaris) was never updated past Solaris 7/SunOS 5.7. That's a problem, because it depends on the (private, uncommitted, and all-too-undocumented!) vfs interface, which changed at least from 5.7 to 5.8 (to add support for umount -f, I think); not sure what if any subsequent changes there have been. I've wanted to play with FiST, but didn't stumble across it 'til I was already running Solaris 8, so no go; and updating it myself would've been rough (see below for part of why). Sure would be nice if there were a doc on the vfs interface. I understand it's one of the few places left for adding certain types of magic, so it may well never become committed (unless there's some way to achieve both extensibility and compatibility - maybe the last change provides a hook for that?). But there will always be non-Sun filesystem implementations (AFS, DCE/DFS, VxFS, ...), and a pseudo-NFS server isn't always a suitable approach. Since it's not a committed interface (at least not yet), a history of version-to-version changes, and a heads-up on any upcoming changes (so filesystem maintainers could update their code), would be a big help too. This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] inode numbers on ZFS
Yes, there are multi-petabyte systems out there. Though you may disagree, I personally don't think its unreasonable to expect such filesystems to pass the 16 exabyte range within the next 20 years. Neither did the ZFS designers, hence the 128-bit capability. Note that we are talking about filesystems, not individual disks. ZFS filesystems can span any number of disks, just as you could achieve by layering on top of a volume manager or through a distributed filesystem. Besides just being flat out larger, the growth rate of filesystem size not directly proportional to the growth rate of disks. - Eric On Wed, Oct 12, 2005 at 07:50:49PM +0200, Joerg Schilling wrote: Eric Schrock [EMAIL PROTECTED] wrote: There is little expectation that anyone will be able to fill a ZFS filesystem, ever[2]. There is reasonable expectation, however, that in the next 10-20 years we will pass the 64-bit limit for some use cases. Do you believe that there currently already systems with 2000 TB? During the past 17 years, the capacity of a single 3.5 disk did increase by a factor of 2000 (a factor of 1.57 per year). In 20 years, the capacity of a single disk will increase by a factor of ~ 8000. J?rg -- EMail:[EMAIL PROTECTED] (home) J?rg Schilling D-13353 Berlin [EMAIL PROTECTED] (uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Re: [zones-discuss] Re: Re: Unionfs for Zones
AFAIK, FiST (which underlies the referenced support for unionfs on Solaris) was never updated past Solaris 7/SunOS 5.7. That's a problem, because it depends on the (private, uncommitted, and all-too-undocumented!) vfs interface, which changed at least from 5.7 to 5.8 (to add support for umount -f, I think); not sure what if any subsequent changes there have been. I've wanted to play with FiST, but didn't stumble across it 'til I was already running Solaris 8, so no go; and updating it myself would've been rough (see below for part of why). Sure would be nice if there were a doc on the vfs interface. I understand it's one of the few places left for adding certain types of magic, so it may well never become committed (unless there's some way to achieve both extensibility and compatibility - maybe the last change provides a hook for that?). But there will always be non-Sun filesystem implementations (AFS, DCE/DFS, VxFS, ...), and a pseudo-NFS server isn't always a suitable approach. Sun has always worked with 3rd party FS vendors to give them sufficient understanding of the filesystem interfaces (VxFS, DFS/AFS). But this often done, I think, on a source code access basis. Casper ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Re: Re: [zones-discuss] Re: Re: Unionfs for Zones
Right, but now with OpenSolaris, anybody could be a 3rd party FS vendor. :-) So by way of working with everybody, wouldn't it be easier to write a doc than to hold a lot of hands one-by-one? This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] inode numbers on ZFS
On Wed, 2005-10-12 at 12:58, Eric Schrock wrote: There is little expectation that anyone will be able to fill a ZFS filesystem, ever[2]. There is reasonable expectation, however, that in the next 10-20 years we will pass the 64-bit limit for some use cases. and, unless my math is off by a few orders of magnitude, a 2^128-block pool at current storage densities would require a data center of roughly the scale of Larry Niven's Ringworld... ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] inode numbers on ZFS
On Wed, 12 Oct 2005, Bill Sommerfeld wrote: and, unless my math is off by a few orders of magnitude, a 2^128-block pool at current storage densities would require a data center of roughly the scale of Larry Niven's Ringworld... Well at least it won't require a Dysan Sphere sized data center! :-) -- Rich Teer, SCNA, SCSA, OpenSolaris CAB member President, Rite Online Inc. Voice: +1 (250) 979-1638 URL: http://www.rite-group.com/rich ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
RE: [osol-discuss] inode numbers on ZFS
Well at least it won't require a Dysan Sphere sized data center! :-) You do realize when we go to quantum computing, the topology of the disk storage is no longer an issue, right? This communication is intended for the use of the recipient to which it is addressed, and may contain confidential, personal and or privileged information. Please contact us immediately if you are not the intended recipient of this communication, and do not copy, distribute, or take action relying on it. Any communication received in error, or subsequent reply, should be deleted or destroyed. ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] inode numbers on ZFS
On Wed, 2005-10-12 at 17:41, Sarah Jelinek wrote: Daniel, To clear up a misconception about MTB UFS. The maximum density of inodes that can be in a MTB UFS filesystem is 1 inode per megabyte of space. This does not mean that a megabyte of space is used for every file. It simply means you cannot have more than a million or so files per terabyte of storage. Which is nowhere near adequate in all cases: Filesystemkbytesused avail capacity Mounted on /dev/dsk/c2t0d0s2703246224 626354966 6985879690%/export/data Filesystem iused ifree %iused Mounted on /dev/dsk/c2t0d0s2 16688199 6718764120% /export/data Note that this filesystem already has more files than a maximum size Multiterabyte UFS (16TB-16 million inodes). My understanding is that the argument is that fsck could take indefinitely long. I know that fsck on this takes about 3 hours, although you would have to ask what goes wrong to need an fsck. -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] inode numbers on ZFS
On Wed, 2005-10-12 at 18:58, Eric Schrock wrote: Yes, unfortunately this is the case. But it will only affect filesystems with more than 3 billion files on them. There's not much that can be done about this - if you want to have more than 2^32 files, you need more than 32 bits to uniquely identify them. The lightweight ZFS filesystem model will also reduce this effect, since administrators will be encouraged to have many filesystems (i.e. one per user) instead of a single mammoth filesystem (all of /export/home). How far has this been tested? I know I tested it, just to see how well it worked, about 6 months ago. On a fairly small machine, 10,000 filesystems was starting to get interesting. I just wonder, seeing as we would need about 40,000 filesystems under this model. -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] inode numbers on ZFS
[EMAIL PROTECTED] wrote: But since the typical filesystem only allocates around 25% of inodes before it fills up, it would be more like a full 100TB before you get to such huge inode numbers, with filesizes staying what they are. Okay - I wasn't clear on if inodes were allocated sequentially or from all over the available address space. -- -Alan Coopersmith- [EMAIL PROTECTED] Sun Microsystems, Inc. - X Window System Engineering ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] 16 slices for PPC VTOC due to the ATT SVR4 implementation ?
ALL : RE: http://svn.genunix.org/repos/opensolaris/trunk/usr/src/cmd/fmthard/fmthard.c and http://svn.genunix.org/repos/polaris/trunk/usr/src/uts/common/sys/dklabel.h For a long long time now I have been using 16 slices on the x86 edition of Solaris. [1] This has worked fine and is quite stable and is based on the implemetation from the ATT SVR4 spec it seems. Within the Sparc world we have always been limited to 8 slices with the backup slice being an overlap of the whole disk by convention. This seems to date way back to the SunOS days and BSD world. Jörg will probably be able to provide illumination on that perhaps. At this stage I am looking at the boot issues for the PowerPC port and also spending time looking at the process of getting the kernel booted up and the VTOC certainly comes into play here. My hope is to implement the 16 slice ( _SUNOS_VTOC_16 ? ) approach to the logical partitions. Are there any obvious pitfalls in this approach or am I labouring under a misconception about the implementation ? Dennis Clarke [1] this machine is running here with Solaris 2.5.1 for x86 # uname -a SunOS tunafish 5.5.1 Generic_103641-42 i86pc i386 i86pc # prtvtoc /dev/rdsk/c0t2d0s0 * /dev/rdsk/c0t2d0s0 partition map * * Dimensions: * 512 bytes/sector * 106 sectors/track * 10 tracks/cylinder *1060 sectors/cylinder *3954 cylinders *3952 accessible cylinders * * Flags: * 1: unmountable * 10: read-only * * First SectorLast * Partition Tag FlagsSector CountSector Mount Directory 0 200 3180197160200339 1 700 200340435660635999 2 501 0 4189120 4189119 3 000 636000524700 1160699 4 0001160700 33920 1194619 5 0001194620410220 1604839 6 4001604840449440 2054279 7 0002054280 1049400 3103679 8 101 0 1060 1059 9 900 1060 2120 3179 a 3003103680855420 3959099 f 0003959100230020 4189119 /usr/local This machine has a external HP C3010 2GB disk attached .. still spinning. Who knows how old. ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] inode numbers on ZFS
On Wed, Oct 12, 2005 at 09:28:52PM +0100, Peter Tribble wrote: How far has this been tested? I know I tested it, just to see how well it worked, about 6 months ago. On a fairly small machine, 10,000 filesystems was starting to get interesting. I just wonder, seeing as we would need about 40,000 filesystems under this model. If I remember correctly, most of your problems were relating to performance under these situations (lots of filesystems). Much work has gone into improving performance; I don't know for a fact if we've tried the 40,000 filesystem model. Right now the priority is getting ZFS out the door, and our performance efforts are focused around getting individual filesystems to perform well. I can say for a fact that there is a lot of low hanging fruit in the administration tools to make various operations (listing, deletion, etc) go faster. It's just not something we've been able to focus our efforts on. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Re: [powerpc-discuss] 16 slices for PPC VTOC due to the ATT SVR4 implementation ?
On 10/12/05, Dennis Clarke [EMAIL PROTECTED] wrote: ALL : RE: http://svn.genunix.org/repos/opensolaris/trunk/usr/src/cmd/fmthard/fmthard.c and http://svn.genunix.org/repos/polaris/trunk/usr/src/uts/common/sys/dklabel.h For a long long time now I have been using 16 slices on the x86 edition of Solaris. [1] The number of partitions to support is an implementation choice: http://svn.genunix.org/repos/polaris/trunk/usr/src/uts/common/sys/isa_defs.h Look for _SUNOS_VTOC_8 or _SUNOS_VTOC_16. I put there _SUNOS_VTOC_8 for PPC port since I think that both are quite bad nowadays and we should go with EFI (aka GPT) label anyway. There are plans to make EFI labeled disks bootable and I hoped to just stick to it when it'll be available. EFI label should give you unlimited (in theory) number of partitions, if I understood it correctly. Current Solaris implementations may limit it however. BTW, lack of certainty on that subject is another reason to postpone the PPC disk support meanwhile. Regards, Cyril This has worked fine and is quite stable and is based on the implemetation from the ATT SVR4 spec it seems. Within the Sparc world we have always been limited to 8 slices with the backup slice being an overlap of the whole disk by convention. This seems to date way back to the SunOS days and BSD world. Jörg will probably be able to provide illumination on that perhaps. At this stage I am looking at the boot issues for the PowerPC port and also spending time looking at the process of getting the kernel booted up and the VTOC certainly comes into play here. My hope is to implement the 16 slice ( _SUNOS_VTOC_16 ? ) approach to the logical partitions. Are there any obvious pitfalls in this approach or am I labouring under a misconception about the implementation ? Dennis Clarke [1] this machine is running here with Solaris 2.5.1 for x86 # uname -a SunOS tunafish 5.5.1 Generic_103641-42 i86pc i386 i86pc # prtvtoc /dev/rdsk/c0t2d0s0 * /dev/rdsk/c0t2d0s0 partition map * * Dimensions: * 512 bytes/sector * 106 sectors/track * 10 tracks/cylinder *1060 sectors/cylinder *3954 cylinders *3952 accessible cylinders * * Flags: * 1: unmountable * 10: read-only * * First SectorLast * Partition Tag FlagsSector CountSector Mount Directory 0 200 3180197160200339 1 700 200340435660635999 2 501 0 4189120 4189119 3 000 636000524700 1160699 4 0001160700 33920 1194619 5 0001194620410220 1604839 6 4001604840449440 2054279 7 0002054280 1049400 3103679 8 101 0 1060 1059 9 900 1060 2120 3179 a 3003103680855420 3959099 f 0003959100230020 4189119 /usr/local This machine has a external HP C3010 2GB disk attached .. still spinning. Who knows how old. ___ powerpc-discuss mailing list [EMAIL PROTECTED] ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
[osol-discuss] Re: [powerpc-discuss] 16 slices for PPC VTOC due to the ATT SVR4 implementation ?
On 10/12/05, Cyril Plisko [EMAIL PROTECTED] wrote: On 10/12/05, Dennis Clarke [EMAIL PROTECTED] wrote: ALL : RE: http://svn.genunix.org/repos/opensolaris/trunk/usr/src/cmd/fmthard/fmthard.c and http://svn.genunix.org/repos/polaris/trunk/usr/src/uts/common/sys/dklabel.h For a long long time now I have been using 16 slices on the x86 edition of Solaris. [1] The number of partitions to support is an implementation choice: http://svn.genunix.org/repos/polaris/trunk/usr/src/uts/common/sys/isa_defs.h Look for _SUNOS_VTOC_8 or _SUNOS_VTOC_16. Yes, I saw that in dklabel.h and was looking in various places for all references to the _SUNOS_VTOC_8 or 16. I put there _SUNOS_VTOC_8 for PPC port since I think that both are quite bad nowadays and we should go with EFI (aka GPT) label anyway. There are plans to make EFI labeled disks bootable and I hoped to just stick to it when it'll be available. EFI label should give you unlimited (in theory) number of partitions, if I understood it correctly. Current Solaris implementations may limit it however. I thought that this was kinda funny in fmthard.c : #if defined(_SUNOS_VTOC_16) /* make the vtoc look sane - ha ha */ vtoc-v_version = V_VERSION; vtoc-v_sanity = VTOC_SANE; vtoc-v_nparts = V_NUMPAR; . . . and have been reading more on the issue of EFI labels. BTW, lack of certainty on that subject is another reason to postpone the PPC disk support meanwhile. and stick with netboot for the time being. check. Dennis ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org