Re: [zfs-discuss] ZFS + DB + fragments
When you have a striped storage device under a file system, then the database or file system's view of contiguous data is not contiguous on the media. Right. That's a good reason to use fairly large stripes. (The primary limiting factor for stripe size is efficient parallel access; using a 100 MB stripe size means that an average 100 MB file gets less than two disks' worth of throughput.) ZFS, of course, doesn't have this problem, since it's handling the layout on the media; it can store things as contiguously as it wants. There are many different ways to place the data on the media and we would typically strive for a diverse stochastic spread. Err ... why? A random distribution makes reasonable sense if you assume that future read requests are independent, or that they are dependent in unpredictable ways. Now, if you've got sufficient I/O streams, you could argue that requests *are* independent, but in many other cases they are not, and they're usually predictable (particularly after a startup period). Optimizing for the predicted access cases makes sense. (Optimizing for observed access may make sense in some cases as well.) -- Anton This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
We are all anxiously awaiting data... -- richard Would it be worthwhile to build a test case: - Build a postgresql database and import 1 000 000 (or more) lines of data. - Run a single and multiple large table scan queries ... and watch the system then, - Update a column of each row in the database, run the same queries and watch the system Continue updating more colums (to get more defrag) until you notice something. I personally believe that since most people will have hardware LUN's (with underlying RAID) and cache, it will be difficult to notice anything. Given that those hardware LUN's might be busy with their own wizardry ;) You will also have to minimize the effect of the database cache ... It will be a tough assignment ... maybe someone has already done this? Thinking about this (very abstract) ... does it really matter? [8KB-a][8KB-b][8KB-c] So what it 8KB-b gets updated and moved somewhere else? If the DB gets a request to read 8KB-a, it needs to do an I/O (eliminate all caching). If it gets a request to read 8KB-b, it needs to do an I/O. Does it matter that b is somewhere else ... it still needs to go get it ... only in a very abstract world with read-ahead (both hardware or db) would 8KB-b be in cache after 8KB-a was read. Hmmm... the only way is to get some data :) *hehe* ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] internal error: Bad file number
On Thu, 15 Nov 2007, Manoj Nayak wrote: I am getting following error message when I run any zfs command.I have attach the script I use to create ramdisk image for Thumper. # zfs volinit internal error: Bad file number Abort - core dumped This sounds as if you may have somehow lost the /dev/zfs link. Try linking /dev/zfs to ../devices/pseudo/[EMAIL PROTECTED]:zfs assuming the latter exists at all. If that doesn't do the trick, could you attach a truss -f output? Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4500 device disconnect problem persists
Speaking of error recovery due to bad blocks - anyone know if the SATA disks that are delivered with the Thumper have enterprise or desktop firmware/settings by default? If I'm not mistaken one of the differences is that the enterrprise variant more quickly gives up with bad blocks and reports those to the operating system compared to the desktop variant that will keep on retrying forever (or almost atleast)... This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Hello can, Thursday, November 15, 2007, 2:54:21 AM, you wrote: cyg The major difference between ZFS and WAFL in this regard is that cyg ZFS batch-writes-back its data to disk without first aggregating cyg it in NVRAM (a subsidiary difference is that ZFS maintains a cyg small-update log which WAFL's use of NVRAM makes unnecessary). cyg Decoupling the implementation from NVRAM makes ZFS usable on cyg arbitrary rather than specialized platforms, and that without cyg doubt constitutes a significant advantage by increasing the cyg available options (in both platform and price) for those cyg installations that require the kind of protection (and ease of cyg management) that both WAFL and ZFS offer and that don't require cyg the level of performance that WAFL provides and ZFS often may not cyg (the latter hasn't gotten much air time here, and while it can be cyg discussed to some degree in the abstract a better approach would cyg be to have some impartial benchmarks to look at, because the cyg on-disk block layouts do differ significantly and sometimes cyg subtly even if the underlying approaches don't). Well, ZFS allows you to put its ZIL on a separate device which could be NVRAM. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for consumers WAS:Yager on ZFS
On 11/15/07, Paul Kraus [EMAIL PROTECTED] wrote: Splitting this thread and changing the subject to reflect that... On 11/14/07, can you guess? [EMAIL PROTECTED] wrote: Another prominent debate in this thread revolves around the question of just how significant ZFS's unusual strengths are for *consumer* use. WAFL clearly plays no part in that debate, because it's available only on closed, server systems. I am both a large systems administrator and a 'home user' (I prefer that term to 'consumer'). I am also very slow to adopt new technologies in either environment. We have started using ZFS at work due to performance improvements (for our workload) over UFS (or any other FS we tested). At home the biggest reason I went with ZFS for my data is ease of management. I split my data up based on what it is ... media (photos, movies, etc.), vendor stuff (software, datasheets, etc.), home directories, and other misc. data. This gives me a good way to control backups based on the data type. I know, this is all more sophisticated than the typical home user. The biggest win for me is that I don't have to partition my storage in advance. I build one zpool and multiple datasets. I don't set quotas or reservations (although I could). So I suppose my argument for ZFS in home use is not data integrity, but much simpler management, both short and long term. I am in the same situation as you and fully agree, except for data integrity. At work, a sofistigated backup system keeps many copies of my files, while at home it is much more rudimentary and data integrity becomes also very important, certainly more than speed. Paul Paul Kraus Albacon 2008 Facilities ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Paul Bartholdi Chemin de la Barillette 11 CH-1260 NYON Suisse tel +41 22 361 0222 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
can you guess? wrote: For very read intensive and position sensitive applications, I guess this sort of capability might make a difference? No question about it. And sequential table scans in databases are among the most significant examples, because (unlike things like streaming video files which just get laid down initially and non-synchronously in a manner that at least potentially allows ZFS to accumulate them in large, contiguous chunks - though ISTR some discussion about just how well ZFS managed this when it was accommodating multiple such write streams in parallel) the tables are also subject to fine-grained, often-random update activity. Background defragmentation can help, though it generates a boatload of additional space overhead in any applicable snapshot. The reason that this is hard to characterize is that there are really two very different configurations used to address different performance requirements: cheap and fast. It seems that when most people first consider this problem, they do so from the cheap perspective: single disk view. Anyone who strives for database performance will choose the fast perspective: stripes. And anyone who *really* understands the situation will do both. Note: data redundancy isn't really an issue for this analysis, but consider it done in real life. When you have a striped storage device under a file system, then the database or file system's view of contiguous data is not contiguous on the media. The best solution is to make the data piece-wise contiguous on the media at the appropriate granularity - which is largely determined by disk access characteristics (the following assumes that the database table is large enough to be spread across a lot of disks at moderately coarse granularity, since otherwise it's often small enough to cache in the generous amounts of RAM that are inexpensively available today). A single chunk on an (S)ATA disk today (the analysis is similar for high-performance SCSI/FC/SAS disks) needn't exceed about 4 MB in size to yield over 80% of the disk's maximum possible (fully-contiguous layout) sequential streaming performance (after the overhead of an 'average' - 1/3 stroke - initial seek and partial rotation are figured in: the latter could be avoided by using a chunk size that's an integral multiple of the track size, but on today's zoned disks that's a bit awkward). A 1 MB chunk yields around 50% of the maximum streaming performance. ZFS's maximum 128 KB 'chunk size' if effectively used as the disk chunk size as you seem to be suggesting yields only about 15% of the disk's maximum streaming performance (leaving aside an additional degradation to a small fraction of even that should you use RAID-Z). And if you match the ZFS block size to a 16 KB database block size and use that as the effective unit of distribution across the set of disks, you'll obt ain a mighty 2% of the potential streaming performance (again, we'll be charitable and ignore the further degradation if RAID-Z is used). Now, if your system is doing nothing else but sequentially scanning this one database table, this may not be so bad: you get truly awful disk utilization (2% of its potential in the last case, ignoring RAID-Z), but you can still read ahead through the entire disk set and obtain decent sequential scanning performance by reading from all the disks in parallel. But if your database table scan is only one small part of a workload which is (perhaps the worst case) performing many other such scans in parallel, your overall system throughput will be only around 4% of what it could be had you used 1 MB chunks (and the individual scan performances will also suck commensurately, of course). Using 1 MB chunks still spreads out your database admirably for parallel random-access throughput: even if the table is only 1 GB in size (eminently cachable in RAM, should that be preferable), that'll spread it out across 1,000 disks (2,000, if you mirror it and load-balance to spread out the accesses), and for much smaller database tables if they're accessed sufficiently heavily for throughput to be an issue they'll be wholly cache-resident. Or another way to look at it is in terms of how many disks you have in your system: if it's less than the number of MB in your table size, then the table will be spread across all of them regardless of what chunk size is used, so you might as well use one that's large enough to give you decent sequential scanning performance (and if your table is too small to spread across all the disks, then it may well all wind up in cache anyway). ZFS's problem (well, the one specific to this issue, anyway) is that it tries to use its 'block size' to cover two different needs: performance for moderately fine-grained updates (though its need to propagate those updates upward to the root of the applicable tree
Re: [zfs-discuss] Yager on ZFS
On 11/15/07 9:05 AM, Robert Milkowski [EMAIL PROTECTED] wrote: Hello can, Thursday, November 15, 2007, 2:54:21 AM, you wrote: cyg The major difference between ZFS and WAFL in this regard is that cyg ZFS batch-writes-back its data to disk without first aggregating cyg it in NVRAM (a subsidiary difference is that ZFS maintains a cyg small-update log which WAFL's use of NVRAM makes unnecessary). cyg Decoupling the implementation from NVRAM makes ZFS usable on cyg arbitrary rather than specialized platforms, and that without cyg doubt constitutes a significant advantage by increasing the cyg available options (in both platform and price) for those cyg installations that require the kind of protection (and ease of cyg management) that both WAFL and ZFS offer and that don't require cyg the level of performance that WAFL provides and ZFS often may not cyg (the latter hasn't gotten much air time here, and while it can be cyg discussed to some degree in the abstract a better approach would cyg be to have some impartial benchmarks to look at, because the cyg on-disk block layouts do differ significantly and sometimes cyg subtly even if the underlying approaches don't). Well, ZFS allows you to put its ZIL on a separate device which could be NVRAM. Like RAMSAN SSD http://www.superssd.com/products/ramsan-300/ It is the only FC attached, Battery-backed SSD that I know of, and we have dreams of clusterfication. Otherwise we would use one of those PCI-Express based NVRAM cards that are on the horizon. My initial results for lots of small files was very pleasing. I dream of a JBOD with lots of disks + something like this built into 3u. Too bad Sun's forthcoming JBODS probably wont have anything similar to this... -Andy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/receive via intermediate device
Simple answer yes. Slightly longer answer. zfs send just writes to stdout where you put that is upto your needs, can can be a file in some filesystem, a raw disk, a tape, a pipe to another program (such as ssh or compress or encrypt) zfs recv reads from stdin so just do the reverse of what you did for send. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
... Well, ZFS allows you to put its ZIL on a separate device which could be NVRAM. And that's a GOOD thing (especially because it's optional rather than requiring that special hardware be present). But if I understand the ZIL correctly not as effective as using NVRAM as a more general kind of log for a wider range of data sizes and types, as WAFL does. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is ZFS stable in OpenSolaris?
On Thu, 2007-11-15 at 17:20 +, Darren J Moffat wrote: hex.cookie wrote: In production environment, which platform should we use? Solaris 10 U4 or OpenSolaris 70+? How should we estimate a stable edition for production? Or OpenSolaris is stable in some build? All depends on what you define by stable. Do you intend to pay Sun for a service contract ? If so S10u4 is likley your best route Do you care about patching rather than upgrading ? If patching S10u4 If you can do upgrade (highly recommened IMO) using live_upgrade(5) then a Solaris Express For an OpenSolaris based distribution I think the realistic choices are from the following list: Solaris Express Community Edition (SX:CE) Solaris Express Developer Edition (SX:DE) Belenix Nexenta OpenSolaris Developer Preview (Project Indiana) or Martux if you want to run on Sparc. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot mount 'mypool': Input/output error
I appreciate the different responses that I have gotten. As some of you may have realized I am not a guru in Linux / Solaris... I have been trying to figure out what file system my Solaris box was using... I got a comment from Paul that from the fdisk command he could see that most likely the partitions are Solaris UFS... I don't see that information anywhere, so I'm wondering if I missed something, or if you are assuming this Paul? I am sure I will not use ZFS to its fullest potential at all.. right now I'm trying to recover the dead disk, so if it works to mount a single disk/boot disk, that's all I need, I don't need it to be very functional. As I suggested, I will only be using this to change permissions and then return the disk into the appropriate Server once I am able to log back into that server. I will try the zfs import just to give it a go. I have done modprobe fuse and have it loaded... but the fact that allow is not available in the latest version clears up why that wasn't working... Sorry Darren, I was not sure what the CC Forums really did and I just chose ones that I thought might be related to ZFS not realized that Crypto is probably another project... I got another suggestion that the file system is UFS, which would make me think that mount -t ufs /dev/sda1 /mnt/mymount should work, but given that that fails with mount: wrong fs type, bad option, bad superblock on /dev/sda1, or too many mounted file systems some thing is not right... but that's probably more a linux community discussion topic... thanks thought. Thanks for your suggestion Mark, I will look in the linux FUSE although I do have a feeling we downloaded the Solaris FUSE software and put it on a linux box... I'll have to look into that some more. Thank you for your responses... This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is ZFS stable in OpenSolaris?
hex.cookie wrote: In production environment, which platform should we use? Solaris 10 U4 or OpenSolaris 70+? How should we estimate a stable edition for production? Or OpenSolaris is stable in some build? All depends on what you define by stable. Do you intend to pay Sun for a service contract ? If so S10u4 is likley your best route Do you care about patching rather than upgrading ? If patching S10u4 If you can do upgrade (highly recommened IMO) using live_upgrade(5) then a Solaris Express For an OpenSolaris based distribution I think the realistic choices are from the following list: Solaris Express Community Edition (SX:CE) Solaris Express Developer Edition (SX:DE) Belenix Nexenta OpenSolaris Developer Preview (Project Indiana) Another important consideration is what ZFS functionality you need since not all features available in OpenSolaris releases were backported to Solaris 10u4 (because some of them were completed *after* S10u4 shipped). -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to create ZFS pool ?
On Thu, 2007-11-15 at 05:25 -0800, Boris Derzhavets wrote: Thank you very much Mike for your feedback. Just one more question. I noticed five device under /dev/rdsk:- c1t0d0p0 c1t0d0p1 c1t0d0p2 c1t0d0p3 c1t0d0p4 been created by system immediately after installation completed. I believe it's x86 limitation (no more then 4 primary partitions) If I've got your point right, in case when Other OS partition gets number 3. I am supposed to run:- # zpool create pool c1t0d0p3 Yes. Just make sure it's the correct partition, ie. partition 3 is actually where you want the zpool otherwise you'll corrupt/loose what ever data is on that partition. You also need to make sure that partition 3 is defined and you can see it in fdisk as Solaris creates these p? devices whether they exist or not. So if I read your previous email correctly, you'll need to run format, select your first disk then run fdisk again. Empty/unused space doesn't mean a partition has been created. From there, you'll want to create a new partition and if you're not familiar with Solaris fdisk, it's a PITA until you get really used to it. You'll want to start one (1) cylinder past the end of your last partition so there's no overlap, then calculate the size of the partition. I usually use cylinders for this. So on one of my systems: Total disk size is 17849 cylinders Cylinder size is 16065 (512 byte) blocks Cylinders Partition StatusType Start End Length% = == = === == === 1 ActiveSolaris2 1 52245224 29 SELECT ONE OF THE FOLLOWING: 1. Create a partition 2. Specify the active partition 3. Delete a partition 4. Change between Solaris and Solaris2 Partition IDs 5. Exit (update disk configuration and exit) 6. Cancel (exit without updating disk configuration) Enter Selection: So the last cylinder is 5224 so we'll start on 5225 and to use the rest of the disk, you'll want to take the max cylinders (17849 from top line) and subtract 5225 which gives you 12624. Select 1 to create a new partition: Select the partition type to create: 1=SOLARIS2 2=UNIX3=PCIXOS 4=Other 5=DOS12 6=DOS16 7=DOSEXT 8=DOSBIG 9=DOS16LBA A=x86 BootB=Diagnostic C=FAT32 D=FAT32LBA E=DOSEXTLBA F=EFI0=Exit? Select 4 for Other OS Specify the percentage of disk to use for this partition (or type c to specify the size in cylinders). Now select c for cylinders (I've never been much one for trusting percentages;) Enter starting cylinder number: 5225 Enter partition size in cylinders: 12624 (It'll ask you about making it the active partition - say no here) Total disk size is 17849 cylinders Cylinder size is 16065 (512 byte) blocks Cylinders Partition StatusType Start End Length% = == = === == === 1 ActiveSolaris2 1 52245224 29 2 Other OS 5225 1784812624 71 SELECT ONE OF THE FOLLOWING: 1. Create a partition 2. Specify the active partition 3. Delete a partition 4. Change between Solaris and Solaris2 Partition IDs 5. Exit (update disk configuration and exit) 6. Cancel (exit without updating disk configuration) Double check you're not overlapping any of the partitions and select 5 to save the partition. In this case, the pool would be c1t0d0p2. Not the most technically accurate but think of p0 as the entire disk and your first partition starts with p1 and so forth. Hope that helps. If you want, post your fdisk partition table if you want a second set of eyes. Boris. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Mike Dotson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on a raid box
On Tue, Nov 13, 2007 at 12:25:24PM +0100, Paul Boven wrote: Hi everyone, We've building a storage system that should have about 2TB of storage and good sequential write speed. The server side is a Sun X4200 running Solaris 10u4 (plus yesterday's recommended patch cluster), the array we bought is a Transtec Provigo 510 12-disk array. The disks are SATA, and it's connected to the Sun through U320-scsi. We are doing basically the same thing with simliar Western Scientific (wsm.com) raids, based on infortrend controllers. ZFS notices when we pull a disk and goes on and does the right thing. I wonder if you've got a scsi card/driver problem. We tried using an Adaptec card with solaris with poor results; switched to LSI, it just works. danno -- Dan Pritts, System Administrator Internet2 office: +1-734-352-4953 | mobile: +1-734-834-7224 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for consumers WAS:Yager on ZFS
... At home the biggest reason I went with ZFS for my data is ease of management. I split my data up based on what it is ... media (photos, movies, etc.), vendor stuff (software, datasheets, etc.), home directories, and other misc. data. This gives me a good way to control backups based on the data type. It's not immediately clear why simply segregating the different data types into different directory sub-trees wouldn't allow you to do pretty much the same thing. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [fuse-discuss] cannot mount 'mypool': Input/output error
On Thu, 2007-11-15 at 07:22 -0800, Nabeel Saad wrote: Hello, I have a question about using ZFS with Fuse. A little bit of background of what we've been doing first... We recently had an issue with a Solaris server where the permissions of the main system files in /etc and such were changed. On server restart, Solaris threw an error and it was not possible to log in, even as root. So, given that it's the only Solaris machine we have, we took out the drive and after much trouble trying with different machines, we connected it to Linux 2005 Limited Edition server using a USB to SATA connector. The linux machine now sees the device in /dev/sda* and I can confirm this by doing the following: [root]# fdisk sda Command (m for help): p Disk sda (Sun disk label): 16 heads, 149 sectors, 65533 cylinders Units = cylinders of 2384 * 512 bytes Device FlagStart EndBlocks Id System sda1 1719 11169 112644002 SunOS root sda2 u 0 1719 20490483 SunOS swap sda3 0 65533 781153365 Whole disk sda5 16324 65533 586571288 SunOS home sda6 11169 16324 61447607 SunOS var Given that Solaris uses ZFS, Solaris *can* use ZFS. ZFS root isn't supported by any distro (other than perhaps Indiana). The filesystem you are trying to mount is probably UFS. we figured to be able to change the permissions, we'll need to be able to mount the device. So, we found Fuse, downloaded, installed it along with ZFS. Everything went as expected until the creation of the pool for some reason. We're interested in either sda1, sda3 or sda5, we'll know better once we can mount them... So, we do ./run.sh and then the zpool and zfs commands are available. My ZFS questions come here, once we run the create command, I get the error directly: [root]# zpool create mypool sda If you want to destroy the data on /dev/sda then this is a good start. IF it were ZFS (which it probably isn't) you'd want to be using zpool import. fuse: mount failed: Invalid argument cannot mount 'mypool': Input/output error However, if I list the pools, clearly it's been created: [root]# zpool list NAMESIZEUSED AVAILCAP HEALTH ALTROOT mypool 74.5G 88K 74.5G 0% ONLINE - It seems the issue is with the mounting, and I can't understand why: [root]# zfs mount mypool fuse: mount failed: Invalid argument cannot mount 'mypool': Input/output error [root]# zfs mount I had searched through the source code trying to figure out what argument was considered invalid and found the following: 477 if (res == -1) { 478 /* 479 * Maybe kernel doesn't support unprivileged mounts, in this 480 * case try falling back to fusermount 481 */ 482 if (errno == EPERM) { 483 res = -2; 484 } else { 485 int errno_save = errno; 486 if (mo-blkdev errno == ENODEV !fuse_mnt_check_fuseblk()) 487 fprintf(stderr, fuse: 'fuseblk' support missing\n); 488 else 489 fprintf(stderr, fuse: mount failed: %s\n, 490 strerror(errno_save)); 491 } 492 493 goto out_close; 494 } in the following file: http://cvs.opensolaris.org/source/xref/fuse/libfuse/mount.c This is the OpenSolaris fuse code, you're using FUSE on Linux. You should check with the Linux FUSE community... -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS implimentations
Hello, Does any one have some real world examples of using a large ZFS cluster ie some where with 40+ vdev's in the range of a few hundred or so terrabytes? Thank you. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs mount -a intermittent
I have a slimmed down install on on_b61 and sometimes when the box is rebooted it fails to automatically remount the pool. Most cases if I login and run zfs mount -a it will mount. Some cases I have to reboot again. Can someone provide some insight as to what may be going on here? truss captures the following when it fails 412:brk(0x0808D000) = 0 412:brk(0x0809D000) = 0 412:brk(0x080AD000) = 0 412:brk(0x080BD000) = 0 412:open(/dev/zfs, O_RDWR)= 3 412:fstat64(3, 0x08047BA0) = 0 412:d=0x0448 i=95420420 m=0020666 l=1 u=0 g=3 rdev=0x02D800 00 412:at = Nov 15 06:17:13 PST 2007 [ 1195136233 ] 412:mt = Nov 15 06:17:13 PST 2007 [ 1195136233 ] 412:ct = Nov 15 06:17:13 PST 2007 [ 1195136233 ] 412:bsz=8192 blks=0 fs=devfs 412:stat64(/dev/pts/0, 0x08047CB0)= 0 412:d=0x044C i=447105886 m=0020620 l=1 u=0 g=0 rdev=0x00600 000 412:at = Nov 15 06:17:32 PST 2007 [ 1195136252 ] 412:mt = Nov 15 06:17:32 PST 2007 [ 1195136252 ] 412:ct = Nov 15 06:17:32 PST 2007 [ 1195136252 ] 412:bsz=8192 blks=0 fs=dev 412:open(/etc/mnttab, O_RDONLY) = 4 412:fstat64(4, 0x08047B60) = 0 412:d=0x04580001 i=2 m=0100444 l=2 u=0 g=0 sz=651 412:at = Nov 15 06:17:38 PST 2007 [ 1195136258 ] 412:mt = Nov 15 06:17:38 PST 2007 [ 1195136258 ] 412:ct = Nov 15 06:17:04 PST 2007 [ 1195136224 ] 412:bsz=512 blks=2 fs=mntfs 412:open(/etc/dfs/sharetab, O_RDONLY) Err#2 ENOENT 412:open(/etc/mnttab, O_RDONLY) = 5 412:fstat64(5, 0x08047B80) = 0 412:d=0x04580001 i=2 m=0100444 l=3 u=0 g=0 sz=651 412:at = Nov 15 06:17:38 PST 2007 [ 1195136258 ] 412:mt = Nov 15 06:17:38 PST 2007 [ 1195136258 ] 412:ct = Nov 15 06:17:04 PST 2007 [ 1195136224 ] 412:bsz=512 blks=2 fs=mntfs 412:sysconfig(_CONFIG_PAGESIZE) = 4096 412:ioctl(3, ZFS_IOC_POOL_CONFIGS, 0x08046DA4) = 0 412:llseek(5, 0, SEEK_CUR) = 0 412:close(5)= 0 412:close(3)= 0 412:llseek(4, 0, SEEK_CUR) = 0 412:close(4)= 0 412:_exit(0) Looking at the ioctl call in libzfs_configs.c i think 412:ioctl(3, ZFS_IOC_POOL_CONFIGS, 0x08046DA4) = 0 is matching the section of code below. 245 for (;;) { 246 if (ioctl(zhp-zpool_hdl-libzfs_fd, ZFS_IOC_POOL_STATS, 247 zc) == 0) { 248 /* 249 * The real error is returned in the zc_cookie field. 250 */ 251 error = zc.zc_cookie; 252 break; 253 } This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
Richard Elling wrote: ... there are really two very different configurations used to address different performance requirements: cheap and fast. It seems that when most people first consider this problem, they do so from the cheap perspective: single disk view. Anyone who strives for database performance will choose the fast perspective: stripes. And anyone who *really* understands the situation will do both. I'm not sure I follow. Many people who do high performance databases use hardware RAID arrays which often do not expose single disks. They don't have to expose single disks: they just have to use reasonable chunk sizes on each disk, as I explained later. Only very early (or very low-end) RAID used very small per-disk chunks (up to 64 KB max). Before the mid-'90s chunk sizes had grown to 128 - 256 KB per disk on mid-range arrays in order to improve disk utilization in the array. From talking with one of its architects years ago my impression is that HP's (now somewhat aging) EVA series uses 1 MB as its chunk size (the same size I used as an example, though today one could argue for as much as 4 MB and soon perhaps even more). The array chunk size is not the unit of update, just the unit of distribution across the array: RAID-5 will happily update a single 4 KB file block within a given array chunk and the associated 4 KB of parity within the parity chunk. But the larger chunk size does allow files to retain the option of using logical contiguity to attain better streaming sequential performance, rather than splintering that logical contiguity at fine grain across multiple disks. ... A single chunk on an (S)ATA disk today (the analysis is similar for high-performance SCSI/FC/SAS disks) needn't exceed about 4 MB in size to yield over 80% of the disk's maximum possible (fully-contiguous layout) sequential streaming performance (after the overhead of an 'average' - 1/3 stroke - initial seek and partial rotation are figured in: the latter could be avoided by using a chunk size that's an integral multiple of the track size, but on today's zoned disks that's a bit awkward). A 1 MB chunk yields around 50% of the maximum streaming performance. ZFS's maximum 128 KB 'chunk size' if effectively used as the disk chunk size as you seem to be suggesting yields only about 15% of the disk's maximum streaming performance (leaving aside an additional degradation to a small fraction of even that should you use RAID-Z). And if you match the ZFS block size to a 16 KB database block size and use that as the effective unit of distribution across the set of disks, you'll obtain a mighty 2% of the potential streaming performance (again, we'll be charitable and ignore the further degradation if RAID-Z is used). You do not seem to be considering the track cache, which for modern disks is 16-32 MBytes. If those disks are in a RAID array, then there is often larger read caches as well. Are you talking about hardware RAID in that last comment? I thought ZFS was supposed to eliminate the need for that. Expecting a seek and read for each iop is a bad assumption. The bad assumption is that the disks are otherwise idle and therefore have the luxury of filling up their track caches - especially when I explicitly assumed otherwise in the following paragraph in that post. If the system is heavily loaded the disks will usually have other requests queued up (even if the next request comes in immediately rather than being queued at the disk itself, an even half-smart disk will abort any current read-ahead activity so that it can satisfy the new request). Not that it would necessarily do much good for the case currently under discussion even if the disks weren't otherwise busy and they did fill up the track caches: ZFS's COW policies tend to encourage data that's updated randomly at fine grain (as a database table often is) to be splattered across the storage rather than neatly arranged such that the next data requested from a given disk will just happen to reside right after the previous data requested from that disk. Now, if your system is doing nothing else but sequentially scanning this one database table, this may not be so bad: you get truly awful disk utilization (2% of its potential in the last case, ignoring RAID-Z), but you can still read ahead through the entire disk set and obtain decent sequential scanning performance by reading from all the disks in parallel. But if your database table scan is only one small part of a workload which is (perhaps the worst case) performing many other such scans in parallel, your overall system throughput will be only around 4% of what it could be had you used 1 MB chunks (and the individual scan performances will also suck commensurately, of course). ... Real data would be greatly appreciated. In my tests, I see reasonable media bandwidth speeds
[zfs-discuss] read/write NFS block size and ZFS
Hello all... I'm migrating a nfs server from linux to solaris, and all clients(linux) are using read/write block sizes of 8192. That was the better performance that i got, and it's working pretty well (nfsv3). I want to use all the zfs' advantages, and i know i can have a performance loss, so i want to know if there is a recomendation for bs on nfs/zfs, or what do you think about it. I must test, or there is no need to make such configurations with zfs? Thanks very much for your time! Leal. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
Adam Leventhal wrote: On Thu, Nov 08, 2007 at 07:28:47PM -0800, can you guess? wrote: How so? In my opinion, it seems like a cure for the brain damage of RAID-5. Nope. A decent RAID-5 hardware implementation has no 'write hole' to worry about, and one can make a software implementation similarly robust with some effort (e.g., by using a transaction log to protect the data-plus-parity double-update or by using COW mechanisms like ZFS's in a more intelligent manner). Can you reference a software RAID implementation which implements a solution to the write hole and performs well. No, but I described how to use a transaction log to do so and later on in the post how ZFS could implement a different solution more consistent with its current behavior. In the case of the transaction log, the key is to use the log not only to protect the RAID update but to protect the associated higher-level file operation as well, such that a single log force satisfies both (otherwise, logging the RAID update separately would indeed slow things down - unless you had NVRAM to use for it, in which case you've effectively just reimplemented a low-end RAID controller - which is probably why no one has implemented that kind of solution in a stand-alone software RAID product). ... The part of RAID-Z that's brain-damaged is its concurrent-small-to-medium-sized-access performance (at least up to request sizes equal to the largest block size that ZFS supports, and arguably somewhat beyond that): while conventional RAID-5 can satisfy N+1 small-to-medium read accesses or (N+1)/2 small-to-medium write accesses in parallel (though the latter also take an extra rev to complete), RAID-Z can satisfy only one small-to-medium access request at a time (well, plus a smidge for read accesses if it doesn't verity the parity) - effectively providing RAID-3-style performance. Brain damage seems a bit of an alarmist label. I consider 'brain damage' to be if anything a charitable characterization. While you're certainly right that for a given block we do need to access all disks in the given stripe, it seems like a rather quaint argument: aren't most environments that matter trying to avoid waiting for the disk at all? Everyone tries to avoid waiting for the disk at all. Remarkably few succeed very well. Intelligent prefetch and large caches -- I'd argue -- are far more important for performance these days. Intelligent prefetch doesn't do squat if your problem is disk throughput (which in server environments it frequently is). And all caching does (if you're lucky and your workload benefits much at all from caching) is improve your system throughput at the point where you hit the disk throughput wall. Improving your disk utilization, by contrast, pushes back that wall. And as I just observed in another thread, not by 20% or 50% but potentially by around two decimal orders of magnitude if you compare the sequential scan performance to multiple randomly-updated database tables between a moderately coarsely-chunked conventional RAID and a fine-grained ZFS block size (e.g., the 16 KB used by the example database) with each block sprayed across several disks. Sure, that's a worst-case scenario. But two orders of magnitude is a hell of a lot, even if it doesn't happen often - and suggests that in more typical cases you're still likely leaving a considerable amount of performance on the table even if that amount is a lot less than a factor of 100. The easiest way to fix ZFS's deficiency in this area would probably be to map each group of N blocks in a file as a stripe with its own parity - which would have the added benefit of removing any need to handle parity groups at the disk level (this would, incidentally, not be a bad idea to use for mirroring as well, if my impression is correct that there's a remnant of LVM-style internal management there). While this wouldn't allow use of parity RAID for very small files, in most installations they really don't occupy much space compared to that used by large files so this should not constitute a significant drawback. I don't really think this would be feasible given how ZFS is stratified today, but go ahead and prove me wrong: here are the instructions for bringing over a copy of the source code: http://www.opensolaris.org/os/community/tools/scm Now you want me not only to design the fix but code it for you? I'm afraid that you vastly overestimate my commitment to ZFS: while I'm somewhat interested in discussing it and happy to provide what insights I can, I really don't personally care whether it succeeds or fails. But I sort of assumed that you might. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
... For modern disks, media bandwidths are now getting to be 100 MBytes/s. If you need 500 MBytes/s of sequential read, you'll never get it from one disk. And no one here even came remotely close to suggesting that you should try to. You can get it from multiple disks, so the questions are: 1. How to avoid other bottlenecks, such as a shared fibre channel ath? Diversity. 2. How to predict the data layout such that you can guarantee a wide spread? You've missed at least one more significant question: 3. How to lay out the data such that this 500 MB/s drain doesn't cripple *other* concurrent activity going on in the system (that's what increasing the amount laid down on each drive to around 1 MB accomplishes - otherwise, you can easily wind up using all the system's disk resources to satisfy that one application, or even fall short if you have fewer than 50 disks available, since if you spread the data out relatively randomly in 128 KB chunks on a system with disks reasonably well-filled with data you'll only be obtaining around 10 MB/s from each disk, whereas with 1 MB chunks similarly spread about each disk can contribute more like 35 MB/s and you'll need only 14 - 15 disks to meet your requirement). Use smaller ZFS block sizes and/or RAID-Z and things get rapidly worse. - bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Fwd: ZFS for consumers WAS:Yager on ZFS
Sent from the correct address... -- Forwarded message -- From: Paul Kraus [EMAIL PROTECTED] Date: Nov 15, 2007 12:57 PM Subject: Re: [zfs-discuss] ZFS for consumers WAS:Yager on ZFS To: zfs-discuss@opensolaris.org On 11/15/07, can you guess? [EMAIL PROTECTED] wrote: ... At home the biggest reason I went with ZFS for my data is ease of management. I split my data up based on what it is ... media (photos, movies, etc.), vendor stuff (software, datasheets, etc.), home directories, and other misc. data. This gives me a good way to control backups based on the data type. It's not immediately clear why simply segregating the different data types into different directory sub-trees wouldn't allow you to do pretty much the same thing. An old habit ... I think about backups along the lines of ufsdumps of entire filesystems, I know, an outdated model. I also like being able to see how much space I am using for each with a simple df rather than a du (that takes a while to run). I can also tune compression on a data type basis (no real point in trying to compress media files that are already compressed MPEG and JPEGs). -- Paul Kraus Albacon 2008 Facilities -- Paul Kraus Albacon 2008 Facilities ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Macs compatibility (was Re: Yager on ZFS)
This is clearly off-topic :-) but perhaps worth correcting -- Long-time MAC users must be getting used to having their entire world disrupted and having to re-buy all their software. This is at least the second complete flag-day (no forward or backwards compatibility) change they've been through. Actually, no; a fair number of Macintosh applications written in 1984, for the original Macintosh, still run on machines/OSes shipped in 2006. Apple provided processor compatibility by emulating the 68000 series on PowerPC, and the PowerPC on Intel; and OS compatibility by providing essentially a virtual machine running Mac OS 9 inside Mac OS X (up through 10.4). Sadly, Mac OS 9 applications no longer run on Mac OS 10.5, so it's true that the world is disrupted now for those with software written prior to 2000 or so. To make this vaguely Solaris-relevant, it's impressive that SunOS 4.x applications still generally run on Solaris 10, at least on SPARC systems, though Sun doesn't do processor emulation. Still not very ZFS-relevant. :-) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
can you guess? billtodd at metrocast.net writes: You really ought to read a post before responding to it: the CERN study did encounter bad RAM (and my post mentioned that) - but ZFS usually can't do a damn thing about bad RAM, because errors tend to arise either before ZFS ever gets the data or after it has already returned and checked it (and in both cases, ZFS will think that everything's just fine). According to the memtest86 author, corruption most often occurs at the moment memory cells are written to, by causing bitflips in adjacent cells. So when a disk DMA data to RAM, and corruption occur when the DMA operation writes to the memory cells, and then ZFS verifies the checksum, then it will detect the corruption. Therefore ZFS is perfectly capable (and even likely) to detect memory corruption during simple read operations from a ZFS pool. Of course there are other cases where neither ZFS nor any other checksumming filesystem is capable of detecting anything (e.g. the sequence of events: data is corrupted, checksummed, written to disk). -- Marc Bevand ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss