Re: [zfs-discuss] Poor relative performance of SAS over SATA drives
if you get rid of the HBA and log device, and run with ZIL disabled (if your work load is compatible with a disabled ZIL.) By get rid of the HBA I assume you mean put in a battery-backed RAID card instead? -J ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [OpenIndiana-discuss] Question about WD drives with Super Micro systems
WD's drives have gotten better the last few years but their quality is still not very good. I doubt they test their drives extensively for heavy duty server configs, particularly since you don't see them inside any of the major server manufactures' boxes. Hitachi in particular does well in mass storage configs. -J Sent via iPhone Is your email Premiere? On Aug 6, 2011, at 10:45, Roy Sigurd Karlsbakk r...@karlsbakk.net wrote: Hi all We have a few servers with WD Black (and some green) drives on Super Micro systems. We've seen both drives work well with direct attach, but with LSI controllers and Super Micro's SAS expanders, well, that's another story. With those SAS expanders, we've seen numerous drives being kicked out and flagged as bad during high load (typically scrub/resilver). We have not seen this on the units we have with Hitachi or Seagate drives. After a drive is kicked out, we run a test on it, using WDs tool, and in many (or most) cases, we find the drive being error free. We've seen these issues on several machines, so hardware failure seem not to be the case. Have anyone here used WD drives with LSI controllers (3801/3081/9211) with Super Micro machines? Any success stories? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ OpenIndiana-discuss mailing list openindiana-disc...@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [OpenIndiana-discuss] Question about WD drives with Super Micro systems
This might be related to your issue: http://blog.mpecsinc.ca/2010/09/western-digital-re3-series-sata-drives.html On Saturday, August 6, 2011, Roy Sigurd Karlsbakk r...@karlsbakk.net wrote: In my experience, SATA drives behind SAS expanders just don't work. They fail in the manner you describe, sooner or later. Use SAS and be happy. Funny thing is Hitachi and Seagate drives work stably, whereas WD drives tend to fail rather quickly Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ OpenIndiana-discuss mailing list openindiana-disc...@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Long import due to spares.
Just for history as to why Fishworks was running on this box...we were in the beta program and have upgraded along the way. This box is an X4240 with 16x 146GB disks running the Feb 2010 release of FW with de-dupe. We were getting ready to re-purpose the box and getting our data off. We then deleted a filesystem that was using de-duplication and the box suddenly went into a freeze and the pool had activity like crazy. After several failed attempts to recover the box to usable state (days of importing failed), we reloaded the boot drives with Nexenta 3.0 (b134) (which was our goal anyway). When we tried to import this pool again, after 24 hours the pool finally imported but with the error that the two spares were FAULTED with too many errors. Controller is an LSI 1068E-IR Normally, I'd believe the drive was dead except, both spares? Could this be related to the de-dupe FS being deleted? -J ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Unusual Resilver Result
Hi, I just replaced a drive (c12t5d0 in the listing below). For the first 6 hours of the resilver I saw no issues. However, sometime during the last hour of the resilver, the new drive and two others in the same RAID-Z2 strip threw a couple checksum errors. Also, two of the other drives in the stripe sometime the the last hour decided they need to resilver small amounts of data (128K and 64K respectively). OS in snv126. My two questions are: Should I be worried about these checksum errors? What caused the small resilverings on c8t5d0 and c11t5d0 which were not replaced or otherwise touched? Thank you in advance. -J pool: zpool_db_css state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 7h0m with 0 errors on Thu Sep 30 04:59:49 2010 config: NAME STATE READ WRITE CKSUM zpool_db_css ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c8t5d0 ONLINE 0 0 4 128K resilvered c10t5d0 ONLINE 0 0 0 c11t5d0 ONLINE 0 0 2 64K resilvered c12t5d0 ONLINE 0 0 3 61.0G resilvered c13t5d0 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 c8t6d0 ONLINE 0 0 0 c10t6d0 ONLINE 0 0 0 c11t6d0 ONLINE 0 0 0 c12t6d0 ONLINE 0 0 0 c13t6d0 ONLINE 0 0 0 raidz2-2 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 c8t7d0 ONLINE 0 0 0 c10t7d0 ONLINE 0 0 0 c11t7d0 ONLINE 0 0 0 c12t7d0 ONLINE 0 0 0 c13t7d0 ONLINE 0 0 0 spares c13t4d0AVAIL c12t4d0AVAIL ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Unusual Resilver Result
Thanks Tuomas. I'll run the scrub. It's an aging X4500. -J On Thu, Sep 30, 2010 at 3:25 AM, Tuomas Leikola tuomas.leik...@gmail.comwrote: On Thu, Sep 30, 2010 at 9:08 AM, Jason J. W. Williams jasonjwwilli...@gmail.com wrote: Should I be worried about these checksum errors? Maybe. Your disks, cabling or disk controller is probably having some issue which caused them. or maybe sunspots are to blame. Run a scrub often and monitor if there are more, and if there is a pattern to them. Have backups. Maybe switch hardware one by one to see if that helps. What caused the small resilverings on c8t5d0 and c11t5d0 which were not replaced or otherwise touched? It was the checksum errors. ZFS automatically read the good data on other mirrors, and replaced the broken blocks with correct data. If you run zpool clear and zpool scrub you will notice these checksum errors have vanished. If they were caused by botched writes, no new errors should probably appear. If they are botched reads, you can see some new ones appearing :( So, not critical yet but something to keep an eye on. Tuomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Long resilver time
134 it is. This is an OpenSolaris rig that's going to be replaced within the next 60 days, so just need to get it to something that won't through false checksum errors like the 120-123 builds do and has decent rebuild times. Future boxes will be NexentaStor. Thank you guys. :) -J On Sun, Sep 26, 2010 at 2:21 PM, Richard Elling rich...@nexenta.com wrote: On Sep 26, 2010, at 1:16 PM, Roy Sigurd Karlsbakk wrote: Upgrading is definitely an option. What is the current snv favorite for ZFS stability? I apologize, with all the Oracle/Sun changes I haven't been paying as close attention to big reports on zfs-discuss as I used to. OpenIndiana b147 is the latest binary release, but it also includes the fix for CR6494473, ZFS needs a way to slow down resilvering http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473 http://www.openindiana.org Are you sure upgrading to OI is safe at this point? 134 is stable unless you start fiddling with dedup, and OI is hardly tested. For a production setup, I'd recommend 134 For a production setup? For production I'd recommend something that is supported, preferably NexentaStor 3 (which is b134 + important ZFS fixes :-) -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling rich...@nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Long resilver time
Err...I meant Nexenta Core. -J On Mon, Sep 27, 2010 at 12:02 PM, Jason J. W. Williams jasonjwwilli...@gmail.com wrote: 134 it is. This is an OpenSolaris rig that's going to be replaced within the next 60 days, so just need to get it to something that won't through false checksum errors like the 120-123 builds do and has decent rebuild times. Future boxes will be NexentaStor. Thank you guys. :) -J On Sun, Sep 26, 2010 at 2:21 PM, Richard Elling rich...@nexenta.comwrote: On Sep 26, 2010, at 1:16 PM, Roy Sigurd Karlsbakk wrote: Upgrading is definitely an option. What is the current snv favorite for ZFS stability? I apologize, with all the Oracle/Sun changes I haven't been paying as close attention to big reports on zfs-discuss as I used to. OpenIndiana b147 is the latest binary release, but it also includes the fix for CR6494473, ZFS needs a way to slow down resilvering http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473 http://www.openindiana.org Are you sure upgrading to OI is safe at this point? 134 is stable unless you start fiddling with dedup, and OI is hardly tested. For a production setup, I'd recommend 134 For a production setup? For production I'd recommend something that is supported, preferably NexentaStor 3 (which is b134 + important ZFS fixes :-) -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling rich...@nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intermittent ZFS hang
If one was sticking with OpenSolaris for the short term, is something older than 134 more stable/less buggy? Not using de-dupe. -J On Thu, Sep 23, 2010 at 6:04 PM, Richard Elling richard.ell...@gmail.comwrote: Hi Charles, There are quite a few bugs in b134 that can lead to this. Alas, due to the new regime, there was a period of time where the distributions were not being delivered. If I were in your shoes, I would upgrade to OpenIndiana b147 which has 26 weeks of maturity and bug fixes over b134. http://www.openindiana.org -- richard On Sep 23, 2010, at 2:48 PM, Charles J. Knipe wrote: So, I'm still having problems with intermittent hangs on write with my ZFS pool. Details from my original post are below. Since posting that, I've gone back and forth with a number of you, and gotten a lot of useful advice, but I'm still trying to get to the root of the problem so I can correct it. Since the original post I have: -Gathered a great deal of information in the form of kernel thread dumps, zio_state dumps, and live crash dumps while the problem is happening. -Been advised that my ruling out of dedupe was probably premature, as I still likely have a good deal of deduplicated data on-disk. -Checked just about every log and counter that might indicate a hardware error, without finding one. I was wondering at this point if someone could give me some pointers on the following: 1. Given the dumps and diagnostic data I've gathered so far, is there a way I can determine for certain where in the ZFS driver I'm spending so much time hanging? At the very least I'd like to try to determine whether it is, in-fact a deduplication issue. 2. If it is, in fact, a deduplication issue, would my only recourse be a new pool and a send/receive operation? The data we're storing is VMFS volumes for ESX. We're tossing around the idea of creating new volumes in the same pool (now that dedupe is off) and migrating VMs over in small batches. The theory is that we would be writing non-deduped data this way, and when we were done we could remove the deduplicated volumes. Is this sound? Thanks again for all the help! -Charles Howdy, We're having a ZFS performance issue over here that I was hoping you guys could help me troubleshoot. We have a ZFS pool made up of 24 disks, arranged into 7 raid-z devices of 4 disks each. We're using it as an iSCSI back-end for VMWare and some Oracle RAC clusters. Under normal circumstances performance is very good both in benchmarks and under real-world use. Every couple days, however, I/O seems to hang for anywhere between several seconds and several minutes. The hang seems to be a complete stop of all write I/O. The following zpool iostat illustrates: pool0 2.47T 5.13T120 0 293K 0 pool0 2.47T 5.13T127 0 308K 0 pool0 2.47T 5.13T131 0 322K 0 pool0 2.47T 5.13T144 0 347K 0 pool0 2.47T 5.13T135 0 331K 0 pool0 2.47T 5.13T122 0 295K 0 pool0 2.47T 5.13T135 0 330K 0 While this is going on our VMs all hang, as do any zfs create commands or attempts to touch/create files in the zfs pool from the local system. After several minutes the system un-hangs and we see very high write rates before things return to normal across the board. Some more information about our configuration: We're running OpenSolaris svn-134. ZFS is at version 22. Our disks are 15kRPM 300gb Seagate Cheetahs, mounted in Promise J610S Dual enclosures, hanging off a Dell SAS 5/e controller. We'd tried out most of this configuration previously on OpenSolaris 2009.06 without running into this problem. The only thing that's new, aside from the newer OpenSolaris/ZFS is a set of four SSDs configured as log disks. At first we blamed de-dupe, but we've disabled that. Next we suspected the SSD log disks, but we've seen the problem with those removed, as well. Has anyone seen anything like this before? Are there any tools we can use to gather information during the hang which might be useful in determining what's going wrong? Thanks for any insights you may have. -Charles -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org
[zfs-discuss] Long resilver time
I just witnessed a resilver that took 4h for 27gb of data. Setup is 3x raid-z2 stripes with 6 disks per raid-z2. Disks are 500gb in size. No checksum errors. It seems like an exorbitantly long time. The other 5 disks in the stripe with the replaced disk were at 90% busy and ~150io/s each during the resilver. Does this seem unusual to anyone else? Could it be due to heavy fragmentation or do I have a disk in the stripe going bad? Post-resilver no disk is above 30% util or noticeably higher than any other disk. Thank you in advance. (kernel is snv123) -J Sent via iPhone Is your e-mail Premiere?___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Long resilver time
Upgrading is definitely an option. What is the current snv favorite for ZFS stability? I apologize, with all the Oracle/Sun changes I haven't been paying as close attention to big reports on zfs-discuss as I used to. -J Sent via iPhone Is your e-mail Premiere? On Sep 26, 2010, at 10:22, Roy Sigurd Karlsbakk r...@karlsbakk.net wrote: I just witnessed a resilver that took 4h for 27gb of data. Setup is 3x raid-z2 stripes with 6 disks per raid-z2. Disks are 500gb in size. No checksum errors. It seems like an exorbitantly long time. The other 5 disks in the stripe with the replaced disk were at 90% busy and ~150io/s each during the resilver. Does this seem unusual to anyone else? Could it be due to heavy fragmentation or do I have a disk in the stripe going bad? Post-resilver no disk is above 30% util or noticeably higher than any other disk. Thank you in advance. (kernel is snv123) It surely seems a long time for 27 gigs. Scrub takes its time, but for this 50TB setup with currently ~29TB used, on WD Green drives (yeah, I know they're bad, but I didn't know that at the time I installed the box, and they have worked flawlessly for a year or so), scrub takes a bit of time, but nothing comparible to what you're reporting scrub: scrub completed after 47h57m with 0 errors on Fri Sep 3 16:57:26 2010 Also, snv123 is quite old, is upgrading to 134 an option? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] ZFS iscsi snapshot - VSS compatible?
Since iSCSI is block-level, I don't think the iSCSI intelligence at the file level you're asking for is feasible. VSS is used at the file-system level on either NTFS partitions or over CIFS. -J On Wed, Jan 7, 2009 at 5:06 PM, Mr Stephen Yum sosu...@yahoo.com wrote: Hi all, If I want to make a snapshot of an iscsi volume while there's a transfer going on, is there a way to detect this and either 1) not include the file being transferred, or 2) wait until the transfer is finished before making the snapshot? If I understand correctly, this is what Microsoft's VSS is supposed to do. Am I right? Right now, when there is a transfer going on while making the snapshot, I always end up with a corrupt file (understandably so, since the file transfer is unfinished). S ___ storage-discuss mailing list storage-disc...@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/storage-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 3ware support
X4500 problems seconded. Still having issues with port resets due to the Marvell driver. Though they seem considerably more transient and less likely to lock up the entire systems in the most recent ( b72) OpenSolaris builds. -J On Feb 12, 2008 9:35 AM, Carson Gaspar [EMAIL PROTECTED] wrote: Tim wrote: A much cheaper (and probably the BEST supported card), is the supermicro based on the marvell chipset. This is the same chipset that is used in the thumper x4500 so you know that the folks at sun are doing their due diligence to make sure the drivers are solid. Except the drivers _aren't_ solid, at least in Solaris(tm). The OpenSolaris drivers may have been fixed (I know a lot of work is going into them, but I haven't tested them), but those fixes have not made it back into the supported realm. So if you need to run a supported OS, I'd skip the Marvell chips if possible, at least for now. -- Carson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] LVM on ZFS
Hey Thiago, SVM is a direct replacement for LVM. Also, you'll notice about a 30% performance boost if you move from LVM to SVM. At least we did when we moved a couple of years ago. -J On Jan 21, 2008 8:09 AM, Thiago Sobral [EMAIL PROTECTED] wrote: Hi folks, I need to manage volumes like LVM does on Linux or AIX, and I think that ZFS can solve this issue. I read the SVM specification and certainly it doesn't will be the solution that I'll adopt. I don't have Veritas here. I created a pool with name black and a volume lv00, then created a filesystem with 'newfs' command: #newfs /dev/zvol/rdsk/black/lv00 is this the right way ? What's is the best way to manage volumes in Solaris? Do you have a URL or document describing this !? cheers, TS ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] De-duplication in ZFS
It'd be a really nice feature. Combined with baked-in replication it would be a nice alternative to our DD appliances. -J On Jan 21, 2008 2:03 PM, John Martinez [EMAIL PROTECTED] wrote: Great question. I've been wondering this myself over the past few weeks, as de-dup is becoming more popular a term in our IT department. -john On Jan 20, 2008, at 5:40 PM, Narayan Venkat wrote: Hi, Is de-duplication in ZFS an active project? If so, can somebody share details about how it's going to be implemented? Thanks. NV ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] MySQL/ZFS backup program posted.
Hey Y'all, I've posted the program (SnapBack) my company developed internally for backing up production MySQL servers using ZFS snapshots: http://blogs.digitar.com/jjww/?itemid=56 Hopefully, it'll save other folks some time. We use it a lot for standing up new MySQL slaves as well. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Not Offlining Disk on SCSI Sense Error (X4500)
Hello, There seems to be a persistent issue we have with ZFS where one of the SATA disk in a zpool on a Thumper starts throwing sense errors, ZFS does not offline the disk and instead hangs all zpools across the system. If it is not caught soon enough, application data ends up in an inconsistent state. We've had this issue with b54 through b77 (as of last night). We don't seem to be the only folks with this issue reading through the archives. Are there any plans to fix this behavior? It really makes ZFS less than desirable/reliable. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Not Offlining Disk on SCSI Sense Error (X4500)
Hi Albert, Thank you for the link. ZFS isn't offlining the disk in b77. -J On Jan 3, 2008 3:07 PM, Albert Chin [EMAIL PROTECTED] wrote: On Thu, Jan 03, 2008 at 02:57:08PM -0700, Jason J. W. Williams wrote: There seems to be a persistent issue we have with ZFS where one of the SATA disk in a zpool on a Thumper starts throwing sense errors, ZFS does not offline the disk and instead hangs all zpools across the system. If it is not caught soon enough, application data ends up in an inconsistent state. We've had this issue with b54 through b77 (as of last night). We don't seem to be the only folks with this issue reading through the archives. Are there any plans to fix this behavior? It really makes ZFS less than desirable/reliable. http://blogs.sun.com/eschrock/entry/zfs_and_fma FMA For ZFS Phase 2 (PSARC/2007/283) was integrated in b68: http://www.opensolaris.org/os/community/arc/caselog/2007/283/ http://www.opensolaris.org/os/community/on/flag-days/all/ -- albert chin ([EMAIL PROTECTED]) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Not Offlining Disk on SCSI Sense Error (X4500)
Hi Eric, Hard to say. I'll use MDB next time it happens for more info. The applications using any zpool lock up. -J On Jan 3, 2008 3:33 PM, Eric Schrock [EMAIL PROTECTED] wrote: When you say starts throwing sense errors, does that mean every I/O to the drive will fail, or some arbitrary percentage of I/Os will fail? If it's the latter, ZFS is trying to do the right thing by recognizing these as transient errors, but eventually the ZFS diagnosis should kick in. What does '::spa -ve' in 'mdb -k' show in one of these situations? How about '::zio_state'? - Eric On Thu, Jan 03, 2008 at 03:11:39PM -0700, Jason J. W. Williams wrote: Hi Albert, Thank you for the link. ZFS isn't offlining the disk in b77. -J On Jan 3, 2008 3:07 PM, Albert Chin [EMAIL PROTECTED] wrote: On Thu, Jan 03, 2008 at 02:57:08PM -0700, Jason J. W. Williams wrote: There seems to be a persistent issue we have with ZFS where one of the SATA disk in a zpool on a Thumper starts throwing sense errors, ZFS does not offline the disk and instead hangs all zpools across the system. If it is not caught soon enough, application data ends up in an inconsistent state. We've had this issue with b54 through b77 (as of last night). We don't seem to be the only folks with this issue reading through the archives. Are there any plans to fix this behavior? It really makes ZFS less than desirable/reliable. http://blogs.sun.com/eschrock/entry/zfs_and_fma FMA For ZFS Phase 2 (PSARC/2007/283) was integrated in b68: http://www.opensolaris.org/os/community/arc/caselog/2007/283/ http://www.opensolaris.org/os/community/on/flag-days/all/ -- albert chin ([EMAIL PROTECTED]) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, FishWorkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance with Oracle
Seconded. Redundant controllers means you get one controller that locks them both up, as much as it means you've got backup. Best Regards, Jason On Mar 21, 2007 4:03 PM, Richard Elling [EMAIL PROTECTED] wrote: JS wrote: I'd definitely prefer owning a sort of SAN solution that would basically just be trays of JBODs exported through redundant controllers, with enterprise level service. The world is still playing catch up to integrate with all the possibilities of zfs. It was called the A5000, later A5100 and A5200. I've still got the scars and Torrey looks like one of the X-men. If you think that a disk drive vendor can write better code than an OS/systems vendor, then you're due for a sad realization. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] X4500 ILOM thinks disk 20 is faulted, ZFS thinks not.
Hey Guys, Have any of y'all seen a condition where the ILOM considers a disk faulted (status is 3 instead of 1), but ZFS keeps writing to the disk and doesn't report any errors? I'm going to do a scrub tomorrow and see what comes back. I'm curious what caused the ILOM to fault the disk. Any advice is greatly appreciated. Best Regards, Jason P.S. The system is running OpenSolaris Build 54. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4500 ILOM thinks disk 20 is faulted, ZFS thinks not.
Hi Ralf, Thank you for the suggestion. About half of the disks are reporting 1968-1969 in the Soft Errors field. All disks are reporting 1968 in the Illegal Request field. There don't appear to be any other errors; all other counters are 0. The Illegal Request count seems a little fishy...like iostat -E doesn't like the X4500 for some reason. Thank you again for your help. Best Regards, Jason On Dec 4, 2007 2:54 AM, Ralf Ramge [EMAIL PROTECTED] wrote: Jason J. W. Williams wrote: Have any of y'all seen a condition where the ILOM considers a disk faulted (status is 3 instead of 1), but ZFS keeps writing to the disk and doesn't report any errors? I'm going to do a scrub tomorrow and see what comes back. I'm curious what caused the ILOM to fault the disk. Any advice is greatly appreciated. What does `iostat -E` tell you? I've experienced several times that ZFS is very fault tolerant - a bit too tolerant for my taste - when it comes to faulting a disk. I saw external FC drives with hundreds or even thousands of errors, even entire hanging loops or drives with hardware trouble, and neither ZFS nor /var/adm/messages reported a problem. So I prefer examining the iostat output over `zpool status` - but with the unattractive side effect that it's not possible to reset the error count which iostat reports without a reboot, so this method is not suitable for monitoring purposes. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
A quick Google of ext3 fsck did not yield obvious examples of why people needed to run fsck on ext3, though it did remind me that by default ext3 runs fsck just for the hell of it every N (20?) mounts - could that have been part of what you were seeing? I'm not sure if that's what Robert meant, but that's been my experience with ext3. In fact that little behavior caused a rather lengthy bit of downtime on another company in our same colo facility this week as a result of a facility required reboot. Frankly, ext3 is an abortion of a filesystem. I'm somewhat surprised its being used as a counterexample of journaling filesystems being no less reliable than ZFS. XFS or ReiserFS are both better examples than ext3. The primary use case for end-to-end checksumming in our environment has been exonerating the storage path when data corruption occurs. Its been crucial in a couple of instances in proving to our DB vendor that the corruption was caused by their code and not the OS, drivers, HBA, FC network, array etc. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Count objects/inodes
Hi Guys, Someone asked me how to count the number of inodes/objects in a ZFS filesystem and I wasn't exactly sure. zdb -dv filesystem seems like a likely candidate but I wanted to find out for sure. As to why you'd want to know this, I don't know their reasoning but I assume it has to do with the maximum number of files a ZFS filesystem can support (2^48 no?). Thank you in advance for your help. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fracture Clone Into FS
Hi Bill, You've got it 99%. I want to roll E back to say B, and keep G intact. I really don't care about C, D or F. Essentially, B is where I want to roll back to, but in case B's data copy doesn't improve what I'm trying to fix I want to have copy of G's data around so I can go back to how it. My order of operations would be something like this: 1.) Snapshot filesystem to preserve current state (snapshot F). 2.) Create clone of F (clone G). 3.) Roll the filesystem back to snapshot B. 4.) Maintain clone G data even though filesystem is at B. My concerns are: 1.) If I rollback to B after creating the clone, it will erase F and thereby the dependent clone G. 2.) If I promote the clone G, G will be the active filesystem data copy, when I want B to be the active data copy, I just want to keep G around. I apologize that this is coming out so confusingly. Please let me know if this is clear at all. I guess in a simple way, you could say I'd like to be able to rollback to any particular snapshot without having to lose any newer snapshot. Thereby giving the ability to roll-forward and backward. Thank you in advance very much! Best Regards, Jason On 10/18/07, Bill Moore [EMAIL PROTECTED] wrote: I may not be understanding your usage case correctly, so bear with me. Here is what I understand your request to be. Time is increasing from left to right. A -- B -- C -- D -- E \ - F -- G Where E and G are writable filesystems and the others are snapshots. I think you're saying that you want to, for example, keep G and roll E back to A, keeping A, B, F, and G. If that's correct, I think you can just clone A (getting H), promote H, then delete C, D, and E. That would leave you with: A -- H \ -- B -- F -- G Is that anything at all like what you're after? --Bill On Wed, Oct 17, 2007 at 10:00:03PM -0600, Jason J. W. Williams wrote: Hey Guys, Its not possible yet to fracture a snapshot or clone into a self-standing filesystem is it? Basically, I'd like to fracture a snapshot/clone into is own FS so I can rollback past that snapshot in the original filesystem and still keep that data. Thank you in advance. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fracture Clone Into FS
Hi Bill, Thinking about this a little more, would this provide the ability to maintain B and G's data for a rollback followed by a possible roll forward? 1.) Create a clone of snapshot_B (clone_B). 2.) Create a new current snapshot (snapshot_F). 3.) Create a clone of snapshot_F (clone_F). 4.) Promote clone_B. 5.) If clone_Bs data doesn't work out, promote clone_F to roll forward. Thank you in advance. Best Regards, Jason On 10/18/07, Jason J. W. Williams [EMAIL PROTECTED] wrote: Hi Bill, You've got it 99%. I want to roll E back to say B, and keep G intact. I really don't care about C, D or F. Essentially, B is where I want to roll back to, but in case B's data copy doesn't improve what I'm trying to fix I want to have copy of G's data around so I can go back to how it. My order of operations would be something like this: 1.) Snapshot filesystem to preserve current state (snapshot F). 2.) Create clone of F (clone G). 3.) Roll the filesystem back to snapshot B. 4.) Maintain clone G data even though filesystem is at B. My concerns are: 1.) If I rollback to B after creating the clone, it will erase F and thereby the dependent clone G. 2.) If I promote the clone G, G will be the active filesystem data copy, when I want B to be the active data copy, I just want to keep G around. I apologize that this is coming out so confusingly. Please let me know if this is clear at all. I guess in a simple way, you could say I'd like to be able to rollback to any particular snapshot without having to lose any newer snapshot. Thereby giving the ability to roll-forward and backward. Thank you in advance very much! Best Regards, Jason On 10/18/07, Bill Moore [EMAIL PROTECTED] wrote: I may not be understanding your usage case correctly, so bear with me. Here is what I understand your request to be. Time is increasing from left to right. A -- B -- C -- D -- E \ - F -- G Where E and G are writable filesystems and the others are snapshots. I think you're saying that you want to, for example, keep G and roll E back to A, keeping A, B, F, and G. If that's correct, I think you can just clone A (getting H), promote H, then delete C, D, and E. That would leave you with: A -- H \ -- B -- F -- G Is that anything at all like what you're after? --Bill On Wed, Oct 17, 2007 at 10:00:03PM -0600, Jason J. W. Williams wrote: Hey Guys, Its not possible yet to fracture a snapshot or clone into a self-standing filesystem is it? Basically, I'd like to fracture a snapshot/clone into is own FS so I can rollback past that snapshot in the original filesystem and still keep that data. Thank you in advance. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Fracture Clone Into FS
Hey Guys, Its not possible yet to fracture a snapshot or clone into a self-standing filesystem is it? Basically, I'd like to fracture a snapshot/clone into is own FS so I can rollback past that snapshot in the original filesystem and still keep that data. Thank you in advance. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Direct I/O ability with zfs?
Hi Dale, We're testing out the enhanced arc_max enforcement (track DNLC entries) using Build 72 right now. Hopefully, it will fix the memory creep, which is the only real downside to ZFS for DB work it seems to me. Frankly, of our DB loads have improved performance with ZFS. I suspect its because we are write-heavy. -J On 10/3/07, Dale Ghent [EMAIL PROTECTED] wrote: On Oct 3, 2007, at 10:31 AM, Roch - PAE wrote: If the DB cache is made large enough to consume most of memory, the ZFS copy will quickly be evicted to stage other I/Os on their way to the DB cache. What problem does that pose ? Personally, I'm still not completely sold on the performance (performance as in ability, not speed) of ARC eviction. Often times, especially during a resilver, a server with ~2GB of RAM free under normal circumstances will dive down to the minfree floor, causing processes to be swapped out. We've had to take to manually constraining ARC max size so this situation is avoided. This is on s10u2/3. I haven't tried anything heavy duty with Nevada simply because I don't put Nevada in production situations. Anyhow, in the case of DBs, ARC indeed becomes a vestigial organ. I'm surprised that this is being met with skepticism considering that Oracle highly recommends direct IO be used, and, IIRC, Oracle performance was the main motivation to adding DIO to UFS back in Solaris 2.6. This isn't a problem with ZFS or any specific fs per se, it's the buffer caching they all employ. So I'm a big fan of seeing 6429855 come to fruition. /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS ARC DNLC Limitation
Hello All, Awhile back (Feb '07) when we noticed ZFS was hogging all the memory on the system, y'all were kind enough to help us use the arc_max tunable to attempt to limit that usage to a hard value. Unfortunately, at the time a sticky problem was that the hard limit did not include DNLC entries generated by ZFS. I've been watching the list since then and trying to watch the Nevada commits. I haven't noticed that anything has been committed back so that arc_max truly enforces the max amount of memory ZFS is allowed to consume (including DNLC entries). Has this been corrected and I just missed it? Thank you in advance for you any help. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Snapshot destroy to
Hey All, Is it possible (or even technically feasible) for zfs to have a destroy to feature? Basically destroy any snapshot older than a certain date? Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Snapshot destroy to
Hi Mark, Thank you very much. That's what I was kind of afraid of. Its fine to script it, just would be nice to have a built in function. :-) Thank you again. Best Regards, Jason On 5/11/07, Mark J Musante [EMAIL PROTECTED] wrote: On Fri, 11 May 2007, Jason J. W. Williams wrote: Is it possible (or even technically feasible) for zfs to have a destroy to feature? Basically destroy any snapshot older than a certain date? Sorta-kinda. You can use 'zfs get' to get the creation time of a snapshot. If you give it -p, it'll provide the seconds-since-epoch time so, with a little fancy footwork, this is scriptable. Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
Hi Guys, Rather than starting a new thread I thought I'd continue this thread. I've been running Build 54 on a Thumper since Mid January and wanted to ask a question about the zfs_arc_max setting. We set it to 0x1 #4GB, however its creeping over that till our Kernel memory usage is nearly 7GB (::memstat inserted below). This is a database server so I was curious if the DNLC would have this affect over time, as it does quite quickly when dealing with small files? Would it be worth upgrade to Build 59? Thank you in advance! Best Regards, Jason Page SummaryPagesMB %Tot Kernel1750044 6836 42% Anon 1211203 4731 29% Exec and libs7648290% Page cache 220434 8615% Free (cachelist) 318625 12448% Free (freelist)659607 2576 16% Total 4167561 16279 Physical 4078747 15932 On 3/23/07, Roch - PAE [EMAIL PROTECTED] wrote: With latest Nevada setting zfs_arc_max in /etc/system is sufficient. Playing with mdb on a live system is more tricky and is what caused the problem here. -r [EMAIL PROTECTED] writes: Jim Mauro wrote: All righty...I set c_max to 512MB, c to 512MB, and p to 256MB... arc::print -tad { ... c02e29e8 uint64_t size = 0t299008 c02e29f0 uint64_t p = 0t16588228608 c02e29f8 uint64_t c = 0t33176457216 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t33176457216 ... } c02e2a08 /Z 0x2000 arc+0x48: 0x7b9789000 = 0x2000 c02e29f8 /Z 0x2000 arc+0x38: 0x7b9789000 = 0x2000 c02e29f0 /Z 0x1000 arc+0x30: 0x3dcbc4800 = 0x1000 arc::print -tad { ... c02e29e8 uint64_t size = 0t299008 c02e29f0 uint64_t p = 0t268435456 -- p is 256MB c02e29f8 uint64_t c = 0t536870912 -- c is 512MB c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t536870912--- c_max is 512MB ... } After a few runs of the workload ... arc::print -d size size = 0t536788992 Ah - looks like we're out of the woods. The ARC remains clamped at 512MB. Is there a way to set these fields using /etc/system? Or does this require a new or modified init script to run and do the above with each boot? Darren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: ZFS memory and swap usage
Hi Rainer, While I would recommend upgrading to Build 54 or newer to use the system tunable, its not that big of a deal to set the ARC on boot up. We've done it on a T2000 for awhile, until we could take it down for an extended period of time to upgrade it. Definitely WOULD NOT run a database on ZFS without it. You will run out of RAM, and depending on how your DB responds to being out of RAM, you could get some very undesirable results. Just my two cents. -J On 3/19/07, Rainer Heilke [EMAIL PROTECTED] wrote: The updated information states that the kernel setting is only for the current Nevada build. We are not going to use the kernel debugger method to change the setting on a live production system (and do this everytime we need to reboot). We're back to trying to set their expectations more realistically, and using proper tools to measure memory usage. As I stated at the outset, they are trying to start up a 10GB SGA database within two minutes to simulate the start-up of five 2GB databases at boot-up. I sincerely doubt they are going to start all five databases simultaneously within two minutes on a regular boot-up. So, what is the best use of the OS tools (vmstat, etc.) to show them how this would really occur? Rainer This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X2200-M2
Hi Brian, To my understanding the X2100 M2 and X2200 M2 are basically the same board OEM'd from Quanta...except the 2200 M2 has two sockets. As to ZFS and their weirdness, it would seem to me that fixing it would be more an issue of the SATA/SCSI driver. I may be wrong here. -J On 3/12/07, Brian Hechinger [EMAIL PROTECTED] wrote: After the interesting revelations about the X2100 and it's hot-swap abilities, what are the abilities of the X2200-M2's disk subsystem, and is ZFS going to tickle any wierdness out of them? -brian -- The reason I don't use Gnome: every single other window manager I know of is very powerfully extensible, where you can switch actions to different mouse buttons. Guess which one is not, because it might confuse the poor users? Here's a hint: it's not the small and fast one.--Linus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: How much do we really want zpool remove?
Hi Przemol, I think migration is a really important feature...think I said that... ;-) SAN/RAID is not awful...frankly there's not been better solution (outside of NetApp's WAFL) till ZFS. SAN/RAID just has its own reliability issues you accept unless you don't have toZFS :-) -J On 2/27/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: On Thu, Feb 22, 2007 at 12:21:50PM -0700, Jason J. W. Williams wrote: Hi Przemol, I think Casper had a good point bringing up the data integrity features when using ZFS for RAID. Big companies do a lot of things just because that's the certified way that end up biting them in the rear. Trusting your SAN arrays is one of them. That all being said, the need to do migrations is a very valid concern. Jason, I don't claim that SAN/RAID solutions are the best and don't have any mistakes/failures/problems. But if SAN/RAID is so bad why companies using them survive ? Imagine also that some company is using SAN/RAID for a few years and doesn't have any problems (or once a few months). Also from time to time they need to migrate between arrays (for whatever reason). Now you come and say that they have unreliable SAN/RAID and you offer something new (ZFS) which is going to make it much more reliable but migration to another array will be painfull. What do you think what they choose ? BTW: I am a fan of ZFS. :-) przemol -- Ustawiaj rekordy DNS dla swojej domeny http://link.interia.pl/f1a1a ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ARGHH. An other panic!!
Hi Gino, Was there more than one LUN in the RAID-Z using the port you disabled? -J On 2/26/07, Gino Ruopolo [EMAIL PROTECTED] wrote: Hi Jason, saturday we made some tests and found that disabling a FC port under heavy load (MPXio enabled) often takes to a panic. (using a RAID-Z !) No problems with UFS ... later, Gino This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HELIOS and ZFS cache
Hi Eric, Everything Mark said. We as a customer ran into this running MySQL on a Thumper (and T2000). We solved it on the Thumper by limiting the ARC to 4GB: /etc/system: set zfs:zfs_arc_max = 0x1 #4GB This has worked marvelously over the past 50 days. The ARC stays around 5-6GB now. Leaving 11GB for the DB. Best Regards, Jason On 2/22/07, Mark Maybee [EMAIL PROTECTED] wrote: This issue has been discussed a number of times in this forum. To summerize: ZFS (specifically, the ARC) will try to use *most* of the systems available memory to cache file system data. The default is to max out at physmem-1GB (i.e., use all of physical memory except for 1GB). In the face of memory pressure, the ARC will give up memory, however there are some situations where we are unable to free up memory fast enough for an application that needs it (see example in the HELIOS note below). In these situations, it may be necessary to lower the ARCs maximum memory footprint, so that there is a larger amount of memory immediately available for applications. This is particularly relevant in situations where there is a known amount of memory that will always be required for use by some application (databases often fall into this category). The tradeoff here is that the ARC will not be able to cache as much file system data, and that could impact performance. For example, if you know that an application will need 5GB on a 36GB machine, you could set the arc maximum to 30GB (0x78000). In ZFS on on10 prior to update 4, you can only change the arc max size via explicit actions with mdb(1): # mdb -kw arc::print -a c_max address c_max = current-max address/Z new-max In the current opensolaris nevada bits, and in s10u4, you can use the system variable 'zfs_arc_max' to set the maximum arc size. Just set this in /etc/system. -Mark Erik Vanden Meersch wrote: Could someone please provide comments or solution for this? Subject: Solaris 10 ZFS problems with database applications HELIOS TechInfo #106 Tue, 20 Feb 2007 Solaris 10 ZFS problems with database applications -- We have tested Solaris 10 release 11/06 with ZFS without any problems using all HELIOS UB based products, including very high load tests. However we learned from customers that some database solutions (known are Sybase and Oracle), when allocating a large amount of memory may slow down or even freeze the system for up to a minute. This can result in RPC timeout messages and service interrupts for HELIOS processes. ZFS is basically using most memory for file caching. Freeing this ZFS memory for the database memory allocation can result into serious delays. This does not occur when using HELIOS products only. HELIOS tested system was using 4GB memory. Customer production machine was using 16GB memory. Contact your SUN representative how to limit the ZFS cache and what else to consider using ZFS in your workflow. Check also with your application vendor for recommendations using ZFS with their applications. Best regards, HELIOS Support HELIOS Software GmbH Steinriede 3 30827 Garbsen (Hannover) Germany Phone: +49 5131 709320 FAX:+49 5131 709325 http://www.helios.de -- http://www.sun.com/solaris * Erik Vanden Meersch * Solution Architect *Sun Microsystems, Inc.* Phone x48835/+32-2-704 8835 Mobile 0479/95 05 98 Email [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: How much do we really want zpool remove?
Hi Przemol, I think Casper had a good point bringing up the data integrity features when using ZFS for RAID. Big companies do a lot of things just because that's the certified way that end up biting them in the rear. Trusting your SAN arrays is one of them. That all being said, the need to do migrations is a very valid concern. Best Regards, Jason On 2/22/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: On Wed, Feb 21, 2007 at 04:43:34PM +0100, [EMAIL PROTECTED] wrote: I cannot let you say that. Here in my company we are very interested in ZFS, but we do not care about the RAID/mirror features, because we already have a SAN with RAID-5 disks, and dual fabric connection to the hosts. But you understand that these underlying RAID mechanism give absolutely no guarantee about data integrity but only that some data was found were some (possibly other) data was written? (RAID5 never verifies the checkum is correct on reads; it only uses it to reconstruct data when reads fail) But you understand that he perhaps knows that but so far nothing wrong happened [*] and migration is still very important feature for him ? [*] almost every big company has its data center with SAN and FC connections with RAID-5 or RAID-10 in their storage arrays and they are treated as reliable przemol -- Wpadka w kosciele - zobacz http://link.interia.pl/f19ea ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zfs best practice for 2U SATA iSCSI NAS
Hi Nicholas, Actually Virtual Iron, they have a nice system at the moment with live migration of windows guest. Ah. We looked at them for some Windows DR. They do have a nice product. 3. Which leads to: coming from Debian, how easy is system updates? I remember with OpenBSD system updates used to be a pain. Not a pain, but coming from Debian/Gentoo not great either. Packaging is one of the last areas that Solaris really needs an upgrade. You might want to take a look at Nexenta, which is OpenSolaris with GNU userland and apt-get. Works pretty well. Once installed you can update it to Build 56 to get the iSCSI target. -J ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zfs best practice for 2U SATA iSCSI NAS
Hi Nicholas, ZFS itself is very stable and very effective as fast FS in our experience. If you browse the archives of the list you'll see that NFS performance is pretty acceptable, with some performance/RAM quirks around small files: http://www.opensolaris.org/jive/message.jspa?threadID=19858 http://www.opensolaris.org/jive/thread.jspa?threadID=18394 To my understanding the iSCSI driver is undergoing significant performance improvements...maybe someone close to this can help? If by VI you are referring to VMware Infrastructure...you won't get any support from VMware if you're using the iSCSI target on Solaris as its not approved by them. Not that this is really a problem in my experience as VMware tech support is pretty terrible anyway. Some questions: 1. how stable is zfs? i'm tolarent to some sweat work to fix problems but data loss is unacceptable We haven't experienced any data loss, and have had some pretty nasty things thrown at it (FC array rebooted unexpectedly). 2. If drives need to be pulled and put into a new chasis does zfs handle them having new device names and being out of order? My understanding and experience here is yes. It'll read the ZFS lables off the drives/slice. 3. Is it possible to hot swap drives with raidz(2) Depends on your underlying hardware. To my knowledge hot-swapping is not dependent on the RAID-level at all. 4. How does performance compare with 'brand name' storage systems? No clue if you're referring to NetApp. Does anyone else know? -J ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS or UFS - what to do?
Hi Jeff, Maybe I mis-read this thread, but I don't think anyone was saying that using ZFS on-top of an intelligent array risks more corruption. Given my experience, I wouldn't run ZFS without some level of redundancy, since it will panic your kernel in a RAID-0 scenario where it detects a LUN is missing and can't fix it. That being said, I wouldn't run anything but ZFS anymore. When we had some database corruption issues awhile back, ZFS made it very simple to prove it was the DB. Just did a scrub and boom, verification that the data was laid down correctly. RAID-5 will have better random read performance the RAID-Z for reasons Robert had to beat into my head. ;-) But if you really need that performance, perhaps RAID-10 is what you should be looking at? Someone smarter than I can probably give a better idea. Regarding the failure detection, is anyone on the list have the ZFS/FMA traps fed into a network management app yet? I'm curious what the experience with it is? Best Regards, Jason On 1/29/07, Jeffery Malloch [EMAIL PROTECTED] wrote: Hi Guys, SO... From what I can tell from this thread ZFS if VERY fussy about managing writes,reads and failures. It wants to be bit perfect. So if you use the hardware that comes with a given solution (in my case an Engenio 6994) to manage failures you risk a) bad writes that don't get picked up due to corruption from write cache to disk b) failures due to data changes that ZFS is unaware of that the hardware imposes when it tries to fix itself. So now I have a $70K+ lump that's useless for what it was designed for. I should have spent $20K on a JBOD. But since I didn't do that, it sounds like a traditional model works best (ie. UFS et al) for the type of hardware I have. No sense paying for something and not using it. And by using ZFS just as a method for ease of file system growth and management I risk much more corruption. The other thing I haven't heard is why NOT to use ZFS. Or people who don't like it for some reason or another. Comments? Thanks, Jeff PS - the responses so far have been great and are much appreciated! Keep 'em coming... This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Project Proposal: Availability Suite
Thank you for the detailed explanation. It is very helpful to understand the issue. Is anyone successfully using SNDR with ZFS yet? Best Regards, Jason On 1/26/07, Jim Dunham [EMAIL PROTECTED] wrote: Jason J. W. Williams wrote: Could the replication engine eventually be integrated more tightly with ZFS? Not it in the present form. The architecture and implementation of Availability Suite is driven off block-based replication at the device level (/dev/rdsk/...), something that allows the product to replicate any Solaris file system, database, etc., without any knowledge of what it is actually replicating. To pursue ZFS replication in the manner of Availability Suite, one needs to see what replication looks like from an abstract point of view. So simplistically, remote replication is like the letter 'h', where the left side of the letter is the complete I/O path on the primary node, the horizontal part of the letter is the remote replication network link, and the right side of the letter is only the bottom half of the complete I/O path on the secondary node. Next ZFS would have to have its functional I/O path split into two halves, a top and bottom piece. Next we configure replication, the letter 'h', between two given nodes, running both a top and bottom piece of ZFS on the source node, and just the bottom half of ZFS on the secondary node. Today, the SNDR component of Availability Suite works like the letter 'h' today, where we split the Solaris I/O stack into a top and bottom half. The top half is that software (file system, database or application I/O) that directs its I/Os to the bottom half (raw device, volume manager or block device). So all that needs to be done is to design and build a new variant of the letter 'h', and find the place to separate ZFS into two pieces. - Jim Dunham That would be slick alternative to send/recv. Best Regards, Jason On 1/26/07, Jim Dunham [EMAIL PROTECTED] wrote: Project Overview: I propose the creation of a project on opensolaris.org, to bring to the community two Solaris host-based data services; namely volume snapshot and volume replication. These two data services exist today as the Sun StorageTek Availability Suite, a Solaris 8, 9 10, unbundled product set, consisting of Instant Image (II) and Network Data Replicator (SNDR). Project Description: Although Availability Suite is typically known as just two data services (II SNDR), there is an underlying Solaris I/O filter driver framework which supports these two data services. This framework provides the means to stack one or more block-based, pseudo device drivers on to any pre-provisioned cb_ops structure, [ http://www.opensolaris.org/os/article/2005-03-31_inside_opensolaris__solaris_driver_programming/#datastructs ], thereby shunting all cb_ops I/O into the top of a developed filter driver, (for driver specific processing), then out the bottom of this filter driver, back into the original cb_ops entry points. Availability Suite was developed to interpose itself on the I/O stack of a block device, providing a filter driver framework with the means to intercept any I/O originating from an upstream file system, database or application layer I/O. This framework provided the means for Availability Suite to support snapshot and remote replication data services for UFS, QFS, VxFS, and more recently the ZFS file system, plus various databases like Oracle, Sybase and PostgreSQL, and also application I/Os. By providing a filter driver at this point in the Solaris I/O stack, it allows for any number of data services to be implemented, without regard to the underlying block storage that they will be configured on. Today, as a snapshot and/or replication solution, the framework allows both the source and destination block storage device to not only differ in physical characteristics (DAS, Fibre Channel, iSCSI, etc.), but also logical characteristics such as in RAID type, volume managed storage (i.e., SVM, VxVM), lofi, zvols, even ram disks. Community Involvement: By providing this filter-driver framework, two working filter drivers (II SNDR), and an extensive collection of supporting software and utilities, it is envisioned that those individuals and companies that adopt OpenSolaris as a viable storage platform, will also utilize and enhance the existing II SNDR data services, plus have offered to them the means in which to develop their own block-based filter driver(s), further enhancing the use and adoption on OpenSolaris. A very timely example that is very applicable to Availability Suite and the OpenSolaris community, is the recent announcement of the Project Proposal: lofi [ compression encryption ] - http://www.opensolaris.org/jive/click.jspamessageID=26841. By leveraging both the Availability Suite and the lofi OpenSolaris projects, it would be highly probable to not only offer compression encryption to lofi devices (as already proposed
Re: [zfs-discuss] hot spares - in standby?
Hi Guys, I seem to remember the Massive Array of Independent Disk guys ran into a problem I think they called static friction, where idle drives would fail on spin up after being idle for a long time: http://www.eweek.com/article2/0,1895,1941205,00.asp Would that apply here? Best Regards, Jason On 1/29/07, Toby Thain [EMAIL PROTECTED] wrote: On 29-Jan-07, at 9:04 PM, Al Hopper wrote: On Mon, 29 Jan 2007, Toby Thain wrote: Hi, This is not exactly ZFS specific, but this still seems like a fruitful place to ask. It occurred to me today that hot spares could sit in standby (spun down) until needed (I know ATA can do this, I'm supposing SCSI does too, but I haven't looked at a spec recently). Does anybody do this? Or does everybody do this already? I don't work with enough disk storage systems to know what is the industry norm. But there are 3 broad categories of disk drive spares: a) Cold Spare. A spare where the power is not connected until it is required. [1] b) Warm Spare. A spare that is active but placed into a low power mode. ... c) Hot Spare. A spare that is spun up and ready to accept read/write/position (etc) requests. Hi Al, Thanks for reminding me of the distinction. It seems very few installations would actually require (c)? Does the tub curve (chance of early life failure) imply that hot spares should be burned in, instead of sitting there doing nothing from new? Just like a data disk, seems to me you'd want to know if a hot spare fails while waiting to be swapped in. Do they get tested periodically? The ideal scenario, as you already allude to, would be for the disk subsystem to initially configure the drive as a hot spare and send it periodic test events for, say, the first 48 hours. For some reason that's a little shorter than I had in mind, but I take your word that that's enough burn-in for semiconductors, motors, servos, etc. This would get it past the first segment of the bathtub reliability curve ... If saving power was the highest priority, then the ideal situation would be where the disk subsystem could apply/remove power to the spare and move it from warm to cold upon command. I am surmising that it would also considerably increase the spare's useful lifespan versus hot and spinning. One trick with disk subsystems, like ZFS that have yet to have the FMA type functionality added and which (today) provide for hot spares only, is to initially configure a pool with one (hot) spare, and then add a 2nd hot spare, based on installing a brand new device, say, 12 months later. And another spare 12 months later. What you are trying to achieve, with this strategy, is to avoid the scenario whereby mechanical systems, like disk drives, tend to wear out within the same general, relatively short, timeframe. One (obvious) issue with this strategy, is that it may be impossible to purchase the same disk drive 12 and 24 months later. However, it's always possible to purchase a larger disk drive ...which is not guaranteed to be compatible with your storage subsystem...! --Toby and simply commit to the fact that the extra space provided by the newer drive will be wasted. [1] The most common example is a disk drive mounted on a carrier but not seated within the disk drive enclosure. Simple push in when required. ... Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Project Proposal: Availability Suite
Hi Jim, Thank you very much for the heads up. Unfortunately, we need the write-cache enabled for the application I was thinking of combining this with. Sounds like SNDR and ZFS need some more soak time together before you can use both to their full potential together? Best Regards, Jason On 1/29/07, Jim Dunham [EMAIL PROTECTED] wrote: Jason, Thank you for the detailed explanation. It is very helpful to understand the issue. Is anyone successfully using SNDR with ZFS yet? Of the opportunities I've been involved with the answer is yes, but so far I've not seen SNDR with ZFS in a production environment, but that does not mean they don't exists. It was not until late June '06, that AVS 4.0, Solaris 10 and ZFS were generally available, and to date AVS has not been made available for the Solaris Express, Community Release, but it will be real soon. While I have your attention, there are two issues between ZFS and AVS that needs mentioning. 1). When ZFS is given an entire LUN to place in a ZFS storage pool, ZFS detect this, enabling SCSI write-caching on the LUN, and also opens the LUN with exclusive access, preventing other data services (like AVS) from accessing this device. The work-around is to manually format the LUN, typically placing all the blocks into a single partition, then just place this partition into the ZFS storage pool. ZFS detect this, not owning the entire LUN, and doesn't enable write-caching, which means it also doesn't open the LUN with exclusive access, and therefore AVS and ZFS can share the same LUN. I thought about submitting an RFE to have ZFS provide a means to override this restriction, but I am not 100% certain that a ZFS filesystem directly accessing a write-cached enabled LUN is the same thing as a replicated ZFS filesystem accessing a write-cached enabled LUN. Even though AVS is write-order consistent, there are disaster recovery scenarios, when enacted, where block-order, verses write-order I/Os are issued. 2). One has to be very cautious in using zpool import -f (forced import), especially on a LUN or LUNs in which SNDR is actively replicating into. If ZFS complains that the storage pool was not cleanly exported when issuing a zpool import ..., and one attempts a zpool import -f , without checking the active replication state, they are sure to panic Solaris. Of course this failure scenario is no different then accessing a LUN or LUNs on dual-ported, or SAN based storage when another Solaris host is still accessing the ZFS filesystem, or controller based replication, as they are all just different operational scenarios of the same issue, data blocks changing out from underneath the ZFS filesystem, and its CRC checking mechanisms. Jim Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hot spares - in standby?
Hi Toby, You're right. The healthcheck would definitely find any issues. I misinterpreted your comment to that effect as a question and didn't quite latch on. A zpool MAID-mode with that healthcheck might also be interesting on something like a Thumper for pure-archival, D2D backup work. Would dramatically cut down on the power. What do y'all think? Best Regards, Jason On 1/29/07, Toby Thain [EMAIL PROTECTED] wrote: On 29-Jan-07, at 11:02 PM, Jason J. W. Williams wrote: Hi Guys, I seem to remember the Massive Array of Independent Disk guys ran into a problem I think they called static friction, where idle drives would fail on spin up after being idle for a long time: You'd think that probably wouldn't happen to a spare drive that was spun up from time to time. In fact this problem would be (mitigated and/or) caught by the periodic health check I suggested. --T http://www.eweek.com/article2/0,1895,1941205,00.asp Would that apply here? Best Regards, Jason On 1/29/07, Toby Thain [EMAIL PROTECTED] wrote: On 29-Jan-07, at 9:04 PM, Al Hopper wrote: On Mon, 29 Jan 2007, Toby Thain wrote: Hi, This is not exactly ZFS specific, but this still seems like a fruitful place to ask. It occurred to me today that hot spares could sit in standby (spun down) until needed (I know ATA can do this, I'm supposing SCSI does too, but I haven't looked at a spec recently). Does anybody do this? Or does everybody do this already? I don't work with enough disk storage systems to know what is the industry norm. But there are 3 broad categories of disk drive spares: a) Cold Spare. A spare where the power is not connected until it is required. [1] b) Warm Spare. A spare that is active but placed into a low power mode. ... c) Hot Spare. A spare that is spun up and ready to accept read/write/position (etc) requests. Hi Al, Thanks for reminding me of the distinction. It seems very few installations would actually require (c)? Does the tub curve (chance of early life failure) imply that hot spares should be burned in, instead of sitting there doing nothing from new? Just like a data disk, seems to me you'd want to know if a hot spare fails while waiting to be swapped in. Do they get tested periodically? The ideal scenario, as you already allude to, would be for the disk subsystem to initially configure the drive as a hot spare and send it periodic test events for, say, the first 48 hours. For some reason that's a little shorter than I had in mind, but I take your word that that's enough burn-in for semiconductors, motors, servos, etc. This would get it past the first segment of the bathtub reliability curve ... If saving power was the highest priority, then the ideal situation would be where the disk subsystem could apply/remove power to the spare and move it from warm to cold upon command. I am surmising that it would also considerably increase the spare's useful lifespan versus hot and spinning. One trick with disk subsystems, like ZFS that have yet to have the FMA type functionality added and which (today) provide for hot spares only, is to initially configure a pool with one (hot) spare, and then add a 2nd hot spare, based on installing a brand new device, say, 12 months later. And another spare 12 months later. What you are trying to achieve, with this strategy, is to avoid the scenario whereby mechanical systems, like disk drives, tend to wear out within the same general, relatively short, timeframe. One (obvious) issue with this strategy, is that it may be impossible to purchase the same disk drive 12 and 24 months later. However, it's always possible to purchase a larger disk drive ...which is not guaranteed to be compatible with your storage subsystem...! --Toby and simply commit to the fact that the extra space provided by the newer drive will be wasted. [1] The most common example is a disk drive mounted on a carrier but not seated within the disk drive enclosure. Simple push in when required. ... Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS or UFS - what to do?
Hi Jeff, We're running a FLX210 which I believe is an Engenio 2884. In our case it also is attached to a T2000. ZFS has run VERY stably for us with data integrity issues at all. We did have a significant latency problem caused by ZFS flushing the write cache on the array after every write, but that can be fixed by configuring your array to ignore cache flushes. The instructions for Engenio products are here: http://blogs.digitar.com/jjww/?itemid=44 We use the config for a production database, so I can't speak to the NFS issues. All I would mention is to watch the RAM consumption by ZFS. Does anyone on the list have a recommendation for ARC sizing with NFS? Best Regards, Jason On 1/26/07, Jeffery Malloch [EMAIL PROTECTED] wrote: Hi Folks, I am currently in the midst of setting up a completely new file server using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM) connected to an Engenio 6994 product (I work for LSI Logic so Engenio is a no brainer). I have configured a couple of zpools from Volume groups on the Engenio box - 1x2.5TB and 1x3.75TB. I then created sub zfs systems below that and set quotas and sharenfs'd them so that it appears that these file systems are dynamically shrinkable and growable. It looks very good... I can see the correct file system sizes on all types of machines (Linux 32/64bit and of course Solaris boxes) and if I resize the quota it's picked up in NFS right away. But I would be the first in our organization to use this in an enterprise system so I definitely have some concerns that I'm hoping someone here can address. 1. How stable is ZFS? The Engenio box is completely configured for RAID5 with hot spares and write cache (8GB) has battery backup so I'm not too concerned from a hardware side. I'm looking for an idea of how stable ZFS itself is in terms of corruptability, uptime and OS stability. 2. Recommended config. Above, I have a fairly simple setup. In many of the examples the granularity is home directory level and when you have many many users that could get to be a bit of a nightmare administratively. I am really only looking for high level dynamic size adjustability and am not interested in its built in RAID features. But given that, any real world recommendations? 3. Caveats? Anything I'm missing that isn't in the docs that could turn into a BIG gotchya? 4. Since all data access is via NFS we are concerned that 32 bit systems (Mainly Linux and Windows via Samba) will not be able to access all the data areas of a 2TB+ zpool even if the zfs quota on a particular share is less then that. Can anyone comment? The bottom line is that with anything new there is cause for concern. Especially if it hasn't been tested within our organization. But the convenience/functionality factors are way too hard to ignore. Thanks, Jeff This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS or UFS - what to do?
Correction: ZFS has run VERY stably for us with data integrity issues at all. should read ZFS has run VERY stably for us with NO data integrity issues at all. On 1/26/07, Jason J. W. Williams [EMAIL PROTECTED] wrote: Hi Jeff, We're running a FLX210 which I believe is an Engenio 2884. In our case it also is attached to a T2000. ZFS has run VERY stably for us with data integrity issues at all. We did have a significant latency problem caused by ZFS flushing the write cache on the array after every write, but that can be fixed by configuring your array to ignore cache flushes. The instructions for Engenio products are here: http://blogs.digitar.com/jjww/?itemid=44 We use the config for a production database, so I can't speak to the NFS issues. All I would mention is to watch the RAM consumption by ZFS. Does anyone on the list have a recommendation for ARC sizing with NFS? Best Regards, Jason On 1/26/07, Jeffery Malloch [EMAIL PROTECTED] wrote: Hi Folks, I am currently in the midst of setting up a completely new file server using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM) connected to an Engenio 6994 product (I work for LSI Logic so Engenio is a no brainer). I have configured a couple of zpools from Volume groups on the Engenio box - 1x2.5TB and 1x3.75TB. I then created sub zfs systems below that and set quotas and sharenfs'd them so that it appears that these file systems are dynamically shrinkable and growable. It looks very good... I can see the correct file system sizes on all types of machines (Linux 32/64bit and of course Solaris boxes) and if I resize the quota it's picked up in NFS right away. But I would be the first in our organization to use this in an enterprise system so I definitely have some concerns that I'm hoping someone here can address. 1. How stable is ZFS? The Engenio box is completely configured for RAID5 with hot spares and write cache (8GB) has battery backup so I'm not too concerned from a hardware side. I'm looking for an idea of how stable ZFS itself is in terms of corruptability, uptime and OS stability. 2. Recommended config. Above, I have a fairly simple setup. In many of the examples the granularity is home directory level and when you have many many users that could get to be a bit of a nightmare administratively. I am really only looking for high level dynamic size adjustability and am not interested in its built in RAID features. But given that, any real world recommendations? 3. Caveats? Anything I'm missing that isn't in the docs that could turn into a BIG gotchya? 4. Since all data access is via NFS we are concerned that 32 bit systems (Mainly Linux and Windows via Samba) will not be able to access all the data areas of a 2TB+ zpool even if the zfs quota on a particular share is less then that. Can anyone comment? The bottom line is that with anything new there is cause for concern. Especially if it hasn't been tested within our organization. But the convenience/functionality factors are way too hard to ignore. Thanks, Jeff This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: How much do we really want zpool remove?
To be fair, you can replace vdevs with same-sized or larger vdevs online. The issue is that you cannot replace with smaller vdevs nor can you eliminate vdevs. In other words, I can migrate data around without downtime, I just can't shrink or eliminate vdevs without send/recv. This is where the philosophical disconnect lies. Everytime we descend into this rathole, we stir up more confusion :-( We did just this to move off RAID-5 LUNs that were the vdevs for a pool, to RAID-10 LUNs. Worked very well, and as Richard said was done all on-line. Doesn't really address the shrinking issue though. :-) Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] multihosted ZFS
You could use SAN zoning of the affected LUN's to keep multiple hosts from seeing the zpool. When failover time comes, you change the zoning to make the LUN's visible to the new host, then import. When the old host reboots, it won't find any zpool. Better safe than sorry Or change the LUN masking on the array. Depending on your switch that can be less disruptive, and depending on your storage array might be able to be scripted. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Project Proposal: Availability Suite
Could the replication engine eventually be integrated more tightly with ZFS? That would be slick alternative to send/recv. Best Regards, Jason On 1/26/07, Jim Dunham [EMAIL PROTECTED] wrote: Project Overview: I propose the creation of a project on opensolaris.org, to bring to the community two Solaris host-based data services; namely volume snapshot and volume replication. These two data services exist today as the Sun StorageTek Availability Suite, a Solaris 8, 9 10, unbundled product set, consisting of Instant Image (II) and Network Data Replicator (SNDR). Project Description: Although Availability Suite is typically known as just two data services (II SNDR), there is an underlying Solaris I/O filter driver framework which supports these two data services. This framework provides the means to stack one or more block-based, pseudo device drivers on to any pre-provisioned cb_ops structure, [ http://www.opensolaris.org/os/article/2005-03-31_inside_opensolaris__solaris_driver_programming/#datastructs ], thereby shunting all cb_ops I/O into the top of a developed filter driver, (for driver specific processing), then out the bottom of this filter driver, back into the original cb_ops entry points. Availability Suite was developed to interpose itself on the I/O stack of a block device, providing a filter driver framework with the means to intercept any I/O originating from an upstream file system, database or application layer I/O. This framework provided the means for Availability Suite to support snapshot and remote replication data services for UFS, QFS, VxFS, and more recently the ZFS file system, plus various databases like Oracle, Sybase and PostgreSQL, and also application I/Os. By providing a filter driver at this point in the Solaris I/O stack, it allows for any number of data services to be implemented, without regard to the underlying block storage that they will be configured on. Today, as a snapshot and/or replication solution, the framework allows both the source and destination block storage device to not only differ in physical characteristics (DAS, Fibre Channel, iSCSI, etc.), but also logical characteristics such as in RAID type, volume managed storage (i.e., SVM, VxVM), lofi, zvols, even ram disks. Community Involvement: By providing this filter-driver framework, two working filter drivers (II SNDR), and an extensive collection of supporting software and utilities, it is envisioned that those individuals and companies that adopt OpenSolaris as a viable storage platform, will also utilize and enhance the existing II SNDR data services, plus have offered to them the means in which to develop their own block-based filter driver(s), further enhancing the use and adoption on OpenSolaris. A very timely example that is very applicable to Availability Suite and the OpenSolaris community, is the recent announcement of the Project Proposal: lofi [ compression encryption ] - http://www.opensolaris.org/jive/click.jspamessageID=26841. By leveraging both the Availability Suite and the lofi OpenSolaris projects, it would be highly probable to not only offer compression encryption to lofi devices (as already proposed), but by collectively leveraging these two project, creating the means to support file systems, databases and applications, across all block-based storage devices. Since Availability Suite has strong technical ties to storage, please look for email discussion for this project at: storage-discuss at opensolaris dot org A complete set of Availability Suite administration guides can be found at: http://docs.sun.com/app/docs?p=coll%2FAVS4.0 Project Lead: Jim Dunham http://www.opensolaris.org/viewProfile.jspa?username=jdunham Availability Suite - New Solaris Storage Group This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thumper Origins Q
Hi Wee, Having snapshots in the filesystem that work so well is really nice. How are y'all quiescing the DB? Best Regards, J On 1/24/07, Wee Yeh Tan [EMAIL PROTECTED] wrote: On 1/25/07, Bryan Cantrill [EMAIL PROTECTED] wrote: ... after all, what was ZFS going to do with that expensive but useless hardware RAID controller? ... I almost rolled over reading this. This is exactly what I went through when we moved our database server out from Vx** to ZFS. We had a 3510 and were thinking how best to configure the RAID. In the end, we ripped out the controller board and used the 3510 as a JBOD directly attached to the server. My DBA was so happy with this setup (especially with the snapshot capability) he is asking for another such setup. -- Just me, Wire ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] need advice: ZFS config ideas for X4500 Thumper?
Hi Neal, We've been getting pretty good performance out of RAID-Z2 with 3x 6-disk RAID-Z2 stripes. More stripes mean better performance all around...particularly on random reads. But as a file-server that's probably not a concern. With RAID-Z2 it seems to me 2 hot-spares is very sufficient, but I'll defer to others with more knowledge. Best Regards, Jason On 1/23/07, Neal Pollack [EMAIL PROTECTED] wrote: Hi: (Warning, new zfs user question) I am setting up an X4500 for our small engineering site file server. It's mostly for builds, images, doc archives, certain workspace archives, misc data. I'd like a trade off between space and safety of data. I have not set up a large ZFS system before, and have only played with simple raidz2 with 7 disks. After reading http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl; I am leaning toward a RAID-Z2 config with spares, for approx 15 terabytes, but I do not yet understand the nomenclature and exact config details. For example, the graph/chart shows that 7+2 RAID-Z2 with spares would be a good balance in capacity and data safety, but I do not know what to do with that number, how it maps to an actual setup? Does that type of config also provide a balance between performance and data safety? Can someone provide an actual example of how the config should look? If I save two disks for the boot, how do the other 46 disks get configured between spares and zfs groups? Thanks, Neal ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: Re: External drive enclosures + Sun
I believe the SmartArray is an LSI like the Dell PERC isn't it? Best Regards, Jason On 1/23/07, Robert Suh [EMAIL PROTECTED] wrote: People trying to hack together systems might want to look at the HP DL320s http://h10010.www1.hp.com/wwpc/us/en/ss/WF05a/15351-241434-241475-241475 -f79-3232017.html 12 drive bays, Intel Woodcrest, SAS (and SATA) controller. If you snoop around, you might be able to find drive carriers on eBay or elsewhere (*cough* search HP drive sleds HP drive carriers) $3k for the chassis. A mini thumper. Though I'm not sure if Solaris supports the Smart Array controller. Rob -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of mike Sent: Monday, January 22, 2007 1:17 PM To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Re: Re: Re: Re: External drive enclosures + Sun I'm dying here - does anyone know when or even if they will support these? I had this whole setup planned out but it requires eSATA + port multipliers. I want to use ZFS, but currently cannot in that fashion. I'd still have to buy some [more expensive, noisier, bulky internal drive] solution for ZFS. Unless anyone has other ideas. I'm looking to run a 5-10 drive system (with easy ability to expand) in my home office; not in a datacenter. Even opening up to iSCSI seems to not get me much - there aren't any SOHO type NAS enclosures that act as iSCSI targets. There are however handfuls of eSATA based 4, 5, and 10 drive enclosures perfect for this... but all require the port multiplier support. On 1/22/07, Frank Cusack [EMAIL PROTECTED] wrote: Unfortunately, Solaris does not support SATA port multipliers (yet) so I think you're pretty limited in how many esata drives you can connect. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] need advice: ZFS config ideas for X4500 Thumper?
Hi Peter, Perhaps I'm a bit dense, but I've been befuddled by the x+y notation myself. Is it X stripes consisting of Y disks? Best Regards, Jason On 1/23/07, Peter Tribble [EMAIL PROTECTED] wrote: On 1/23/07, Neal Pollack [EMAIL PROTECTED] wrote: Hi: (Warning, new zfs user question) I am setting up an X4500 for our small engineering site file server. It's mostly for builds, images, doc archives, certain workspace archives, misc data. ... Can someone provide an actual example of how the config should look? If I save two disks for the boot, how do the other 46 disks get configured between spares and zfs groups? What I ended up with was working with 8+2 raidz2 vdevs. It could have been 4+2, but 8+2 gives you more space, and that was more important than performance. (The performance of the 8+2 is easily adequate for our needs.) And with 46 drives to play with I can have 4 lots of that. At the moment I have 6 hot-spares (I may take some of those out later, but at the moment I don't need them). So the config looks like: zpool create images \ raidz2 c{0,1,4,6,7}t0d0 c{1,4,5,6,7}t1d0 \ raidz2 c{0,4,5,6,7}t2d0 c{0,1,5,6,7}t3d0 \ raidz2 c{0,1,4,6,7}t4d0 c{0,1,4,6,7}t5d0 \ raidz2 c{0,1,4,5,7}t6d0 c{0,1,4,5,6}t7d0 \ spare c0t1d0 c1t2d0 c4t3d0 c5t5d0 c6t6d0 c7t7d0 this spreads everything across all the controllers, and with no more than 2 disks on each controller I could survive the rather unlikely event of a controller failure (unless it's the controller with the boot drives...). -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] need advice: ZFS config ideas for X4500 Thumper?
Hi Peter, Ah! That clears it up for me. Thank you. Best Regards, Jason On 1/23/07, Peter Tribble [EMAIL PROTECTED] wrote: On 1/23/07, Jason J. W. Williams [EMAIL PROTECTED] wrote: Hi Peter, Perhaps I'm a bit dense, but I've been befuddled by the x+y notation myself. Is it X stripes consisting of Y disks? Sorry. Took a short cut on that bit. It's x data disks + y parity. So in the case of raidz1, y=1; in the case of raidz2, y=2. And ideally x should be a power of 2. (So 8+2 is a raidz2 stripe of 10 disks in total.) I've always used this notation, but now I think about it I'm not sure how universal it is. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Thumper Origins Q
Hi All, This is a bit off-topic...but since the Thumper is the poster child for ZFS I hope its not too off-topic. What are the actual origins of the Thumper? I've heard varying stories in word and print. It appears that the Thumper was the original server Bechtolsheim designed at Kealia as a massive video server. However, when we were first told about it a year ago through Sun contacts Thumper was described as a part of a scalabe iSCSI storage system, where Thumpers would be connected to a head (which looked a lot like a pair of X4200s) via iSCSI that would then present the storage over iSCSI and NFS. Recently, other sources mentioned they were told about the same time that Thumper was part of the Honeycomb project. So I was curious if anyone had any insights into the history/origins of the Thumper...or just wanted to throw more rumors on the fire. ;-) Thanks in advance for your indulgence. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Synchronous Mount?
Hi Prashanth, My company did a lot of LVM+XFS vs. SVM+UFS testing in addition to ZFS. Overall, LVM's overhead is abysmal. We witnessed performance hits of 50%+. SVM only reduced performance by about 15%. ZFS was similar, though a tad higher. Also, my understanding is you can't write to a ZFS snapshot...unless you clone it. Perhaps, someone who knows more than I can clarify. Best Regards, Jason On 1/23/07, Prashanth Radhakrishnan [EMAIL PROTECTED] wrote: Is there someway to synchronously mount a ZFS filesystem? '-o sync' does not appear to be honoured. No there isn't. Why do you think it is necessary? Specifically, I was trying to compare ZFS snapshots with LVM snapshots on Linux. One of the tests does writes to an ext3FS (that's on top of an LVM snapshot) mounted synchronously, in order to measure the real Copy-on-write overhead. So, I was wondering if I could do the same with ZFS. Seems not. Thanks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thumper Origins Q
Wow. That's an incredibly cool story. Thank you for sharing it! Does the Thumper today pretty much resemble what you saw then? Best Regards, Jason On 1/23/07, Bryan Cantrill [EMAIL PROTECTED] wrote: This is a bit off-topic...but since the Thumper is the poster child for ZFS I hope its not too off-topic. What are the actual origins of the Thumper? I've heard varying stories in word and print. It appears that the Thumper was the original server Bechtolsheim designed at Kealia as a massive video server. That's correct -- it was originally called the StreamStor. Speaking personally, I first learned about it in the meeting with Andy that I described here: http://blogs.sun.com/bmc/entry/man_myth_legend I think it might be true that this was the first that anyone in Solaris had heard of it. Certainly, it was the first time that Andy had ever heard of ZFS. It was a very high bandwidth conversation, at any rate. ;) After the meeting, I returned post-haste to Menlo Park, where I excitedly described the box to Jeff Bonwick, Bill Moore and Bart Smaalders. Bill said something like I gotta see this thing and sometime later (perhaps the next week?) Bill, Bart and I went down to visit Andy. Andy gave us a much more detailed tour, with Bill asking all sorts of technical questions about the hardware (many of which were something like how did you get a supplier to build that for you?!). After the tour, Andy took the three of us to lunch, and it was one of those moments that I won't forget: Bart, Bill, Andy and I sitting in the late afternoon Palo Alto sun, with us very excited about his hardware, and Andy very excited about our software. Everyone realized that these two projects -- born independently -- were made for each other, that together they would change the market. It was one of those rare moments that reminds you why you got into this line of work -- and I feel lucky to have shared in it. - Bryan -- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Synchronous Mount?
Hi Prashanth, This was about a year ago. I believe I ran bonnie++ and IOzone tests. Tried also to simulate an OLTP load. The 15-20% overhead for ZFS was vs. UFS on a raw disk...UFS on SVM was almost exactly 15% lower performance than raw UFS. UFS and XFS on raw disk were pretty similar in terms of performance, until you got into small files...then XFS bogged down really badly. None of this was testing with snapshots, so I'm not sure of the effect there. I can attest we're running ZFS right now in production on a Thumper serving two MySQL instances, under an 80/20 write/read load. We use ZFS snapshots as our primary backup mechanism (flush/lock the tables, flush the logs, snap, release the locks). At the moment we have 60 ZFS snapshots across 4 filesystems (one FS per zpool). Our primary database zpool has 26 of those snapshots, and the primary DB log zpool has another 26 snapshots. Overall, we haven't noticed any performance degradation in our database serving performance. I don't have hard benchmark numbers for you on this, but anecdotally it works very well. There have been some folks complaining here of snapshot numbers in the 200+ range causing performance problems on a single FS. We don't plan to have more than about 40 snapshots on an FS right now. Hope this is somewhat helpful. Its been a long time (2+ years) since I've used Ext3 on a Linux system, so I couldn't give you a comparative benchmark. Good luck! :-) Best Regards, Jason On 1/23/07, Prashanth Radhakrishnan [EMAIL PROTECTED] wrote: Hi Jason, My company did a lot of LVM+XFS vs. SVM+UFS testing in addition to ZFS. Overall, LVM's overhead is abysmal. We witnessed performance hits of 50%+. SVM only reduced performance by about 15%. ZFS was similar, though a tad higher. Yes, LVM snapshots' overhead is high. But I've seen that as you start increasing the chunksize, they get better (though, with higher space usage). So, you saw performance reductions as much as 15% with ZFS clones/snapshots. I'm curious to know what tests and ZFS config (# of snapshots/clones) you ran on. I ran bonnie++ and din't notice any perceptible drops in the numbers. Though my config had only upto 3 clones and 3 snapshots for each of them. Also, my understanding is you can't write to a ZFS snapshot...unless you clone it. Perhaps, someone who knows more than I can clarify. Right. I wanted to check if creating snapshots affected the performance of the origin FS/clone. Thanks, Prashanth On 1/23/07, Prashanth Radhakrishnan [EMAIL PROTECTED] wrote: Is there someway to synchronously mount a ZFS filesystem? '-o sync' does not appear to be honoured. No there isn't. Why do you think it is necessary? Specifically, I was trying to compare ZFS snapshots with LVM snapshots on Linux. One of the tests does writes to an ext3FS (that's on top of an LVM snapshot) mounted synchronously, in order to measure the real Copy-on-write overhead. So, I was wondering if I could do the same with ZFS. Seems not. Thanks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: External drive enclosures + Sun Server for massstorage
Hi Frank, I'm sure Richard will check it out. He's a very good guy and not trying to jerk you around. I'm sure the hostility isn't warranted. :-) Best Regards, Jason On 1/22/07, Frank Cusack [EMAIL PROTECTED] wrote: On January 22, 2007 10:03:14 AM -0800 Richard Elling [EMAIL PROTECTED] wrote: Toby Thain wrote: To be clear: the X2100 drives are neither hotswap nor hotplug under Solaris. Replacing a failed drive requires a reboot. I do not believe this is true, though I don't have one to test. Well if you won't accept multiple technically adept people's word on it, I highly suggest you get one to test instead of speculating. If this were true, then we would have had to rewrite the disk drivers to not allow us to open a device more than once, even if we also closed the device. I can't imagine anyone allowing such code to be written. Obviously you have not rewritten the disk drivers to do this, so this is the wrong line of reasoning. However, I don't believe this is the context of the issue. I believe that this release note deals with the use of NVRAID (NVidia's MCP RAID controller) which does not have a systems management interface under Solaris. The solution is to not use NVRAID for Solaris. Rather, use the proven techniques that we've been using for decades to manage hot plugging drives. No, the release note is not about NVRAID. In short, the release note is confusing, so ignore it. Use x2100 disks as hot pluggable like you've always used hot plug disks in Solaris. Again, NO these drives are not hot pluggable and the release note is accurate. PLEASE get a system to test. Or take our word for it. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: External drive enclosures + Sun Server for mass
Hi David, Depending on the I/O you're doing the X4100/X4200 are much better suited because of the dual HyperTransport buses. As a storage box with GigE outputs you've got a lot more I/O capacity with two HT buses than one. That plus the X4100 is just a more solid box. The X2100 M2 while a vast improvement over the X2100 in terms of reliability and features, is still an OEM'd whitebox. We use the X2100 M2s for application servers, but for anything that needs solid reliability or I/O we go Galaxy. Best Regards, Jason On 1/22/07, David J. Orman [EMAIL PROTECTED] wrote: Not to be picky, but the X2100 and X2200 series are NOT designed/targeted for disk serving (they don't even have redundant power supplies). They're compute-boxes. The X4100/X4200 are what you are looking for to get a flexible box more oriented towards disk i/o and expansion. I don't see those as being any better suited to external discs other than: #1 - They have the capacity for redundant PSUs, which is irrelevant to my needs. #2 - They only have PCI Express slots, and I can't find any good external SATA interface cards on PCI Express I can't wrap my head around the idea that I should buy a lot more than I need, which still doesn't serve my purposes. The 4 disks in an x4100 still aren't enough, and the machine is a fair amount more costly. I just need mirrored boot drives, and an external disk array. That said (if you're set on an X2200 M2), you are probably better off getting a PCI-E SCSI controller, and then attaching it to an external SCSI-SATA JBOD. There are plenty of external JBODs out there which use Ultra320/Ultra160 as a host interface and SATA as a drive interface. Sun will sell you a supported SCSI controller with the X2200 M2 (the Sun StorageTek PCI-E Dual Channel Ultra320 SCSI HBA). SCSI is far better for a host attachment mechanism than eSATA if you plan on doing more than a couple of drives, which it sounds like you are. While the SCSI HBA is going to cost quite a bit more than an eSATA HBA, the external JBODs run about the same, and the total difference is going to be $300 or so across the whole setup (which will cost you $5000 or more fully populated). So the cost to use SCSI vs eSATA as the host- attach is a rounding error. I understand your comments in some ways, in others I do not. It sounds like we're moving backwards in time. Exactly why is SCSI better than SAS/SATA for external devices? From my experience (with other OSs/hardware platforms) the opposite is true. A nice SAS/SATA controller with external ports (especially those that allow multiple SAS/SATA drives via one cable - whichever tech you use) works wonderfully for me, and I get a nice thin/clean cable which makes cable management much more enjoyable in higher density situations. I also don't agree with the logic just spend a mere $300 extra to use older technology! $300 may not be much to large business, but things like this nickle and dime small business owners. There's a lot of things I'd prefer to spend $300 on than an expensive SCSI HBA which offers no advantages over a SAS counterpart, in fact offers disadvantages instead. Your input is of course highly valued, and it's quite possible I'm missing an important piece to the puzzle somewhere here, but I am not convinced this is the ideal solution - simply a stick with the old stuff, it's easier solution, which I am very much against. Thanks, David This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: External drive enclosures + Sun Server for massstorage
Hi Guys, The original X2100 was a pile of doggie doo-doo. All of our problems with it go back to the atrocious quality of the nForce 4 Pro chipset. The NICs in particular are just crap. The M2s are better, but the MCP55 chipset has not resolved all of its flakiness issues. That being said Sun designed that case with hot-plug bays, if Solaris isn't going to support it, then those shouldn't be there in my opinion. Best Regards, Jason On 1/22/07, Frank Cusack [EMAIL PROTECTED] wrote: In short, the release note is confusing, so ignore it. Use x2100 disks as hot pluggable like you've always used hot plug disks in Solaris. Again, NO these drives are not hot pluggable and the release note is accurate. PLEASE get a system to test. Or take our word for it. hmm I think I may have just figured out the problem here. YES the x2100 is that bad. I too found it quite hard to believe that Sun would sell this without hot plug drives. It seems like a step backwards. (and of course I don't mean that the x2100 is awful, it's a great hardware and very well priced ... now if only hot plug worked!) My main issue is that the x2100 is advertised as hot plug working. You have to dig pretty deep -- deeper than would be expected of a typical buyer -- to find that Solaris does not support it. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: External drive enclosures + Sun Server for mass
Hi David, Glad to help! I don't want to bad-mouth the X2100 M2s that much, because they have been solid. I believe the M2s are made/designed just for Sun by Quanta Computer (http://www.quanta.com.tw/e_default.htm) whereas the mobos in the original X2100 was Tyan Tiger with some slight modifications. That all being said, the problem is that Nvidia chipset. The MCP55 in the X2100 M2 is an alright chipset, the nForce 4 Pro just had bugs. Best Regards, Jason On 1/22/07, David J. Orman [EMAIL PROTECTED] wrote: Hi David, Depending on the I/O you're doing the X4100/X4200 are much better suited because of the dual HyperTransport buses. As a storage box with GigE outputs you've got a lot more I/O capacity with two HT buses than one. That plus the X4100 is just a more solid box. That much makes sense, thanks for clearing that up. The X2100 M2 while a vast improvement over the X2100 in terms of reliability and features, is still an OEM'd whitebox. We use the X2100 M2s for application servers, but for anything that needs solid reliability or I/O we go Galaxy. Ahh. That explains a lot. Thank you once again! Sounds like the X2* is the red-headed stepchild of Sun's product line. They should slap disclaimers up on the product information pages so we know better than to purchase into something that doesn't fully function. Still unclear on the SAS/SATA solutions, but hopefully that'll progress further now in the thread. Cheers, David This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Understanding ::memstat in terms of the ARC
Hello all, I have a question. Below are two ::memstat outputs about 5 days apart. The interesting thing is the anonymous memory shows 2GB, though the two major hogs of that memory (two MySQL instances) claim to be consuming about 6.2GB (checked via pmap). Also, it seems like the ARC keeps creeping the kernel memory over the 4GB limit I set for the ARC (zfs_arc_max). What I was also, curious about, is if ZFS affects the cachelist line, or if that is just for UFS. Thank you in advance! Best Regards, Jason 01/17/2007 02:28:50 GMT 2007 Page SummaryPagesMB %Tot Kernel1485925 5804 36% Anon 855812 3343 21% Exec and libs7438290% Page cache 3863150% Free (cachelist) 185235 7234% Free (freelist) 1629288 6364 39% Total 4167561 16279 Physical 4078747 15932 01/22/2007 01:17:32 GMT 2007 Page SummaryPagesMB %Tot Kernel1534184 5992 37% Anon 538054 2101 13% Exec and libs7497290% Page cache 18550720% Free (cachelist) 1384165 5406 33% Free (freelist)685111 2676 16% Total 4167561 16279 Physical 4078747 15932 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] External drive enclosures + Sun Server for mass storage
Hi Shannon, The markup is still pretty high on a per-drive basis. That being said, $1-2/GB is darn low for the capacity in a server. Plus, you're also paying for having enough HyperTransport I/O to feed the PCI-E I/O. Does anyone know what problems they had with the 250GB version of the Thumper that caused them to pull it? Best Regards, Jason On 1/20/07, Shannon Roddy [EMAIL PROTECTED] wrote: Frank Cusack wrote: thumper (x4500) seems pretty reasonable ($/GB). -frank I am always amazed that people consider thumper to be reasonable in price. 450% or more markup per drive from street price in July 2006 numbers doesn't seem reasonable to me, even after subtracting the cost of the system. I like the x4500, I wish I had one. But, I can't pay what Sun wants for it. So, instead, I am stuck buying lower end Sun systems and buying third party SCSI/SATA JBODs. I like Sun. I like their products, but I can't understand their storage pricing most of the time. -Shannon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] External drive enclosures + Sun Server for mass storage
Hi David, I don't know if your company qualifies as a startup under Sun's regs but you can get an X4500/Thumper for $24,000 under this program: http://www.sun.com/emrkt/startupessentials/ Best Regards, Jason On 1/19/07, David J. Orman [EMAIL PROTECTED] wrote: Hi, I'm looking at Sun's 1U x64 server line, and at most they support two drives. This is fine for the root OS install, but obviously not sufficient for many users. Specifically, I am looking at the: http://www.sun.com/servers/x64/x2200/ X2200M2. It only has Riser card assembly with two internal 64-bit, 8-lane, low-profile, half length PCI-Express slots for expansion. What I'm looking for is a SAS/SATA card that would allow me to add an external SATA enclosure (or some such device) to add storage. The supported list on the HCL is pretty slim, and I see no PCI-E stuff. A card that supports SAS would be *ideal*, but I can settle for normal SATA too. So, anybody have any good suggestions for these two things: #1 - SAS/SATA PCI-E card that would work with the Sun X2200M2. #2 - Rack-mountable external enclosure for SAS/SATA drives, supporting hot swap of drives. Basically, I'm trying to get around using Sun's extremely expensive storage solutions while waiting on them to release something reasonable now that ZFS exists. Cheers, David This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: What SATA controllers are people using for ZFS?
Hi Frank, Sun doesn't support the X2100 SATA controller on Solaris 10? That's just bizarre. -J On 1/18/07, Frank Cusack [EMAIL PROTECTED] wrote: THANK YOU Naveen, Al Hopper, others, for sinking yourselves into the shit world of PC hardware and [in]compatibility and coming up with well qualified white box solutions for S10. I strongly prefer to buy Sun kit, but I am done waiting for Sun to support the SATA controller on the x2100. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Heavy writes freezing system
Hi Anantha, I was curious why segregating at the FS level would provide adequate I/O isolation? Since all FS are on the same pool, I assumed flogging a FS would flog the pool and negatively affect all the other FS on that pool? Best Regards, Jason On 1/17/07, Anantha N. Srirama [EMAIL PROTECTED] wrote: You're probably hitting the same wall/bug that I came across; ZFS in all versions up to and including Sol10U3 generates excessive I/O when it encounters 'fssync' or if any of the files were opened with 'O_DSYNC' option. I do believe Oracle (or any DB for that matter) opens the file with O_DSYNC option. During normal times it does result in excessive I/O but is probably well under your system capacity (it was in our case.) But when you are doing backups or clones (Oracle clones by using RMAN or copying of db files?) you are going to flood the I/O sub-system and that's when the whole ZFS excessive I/O starts to put a hurt on the DB performance. Here are a few suggestions that can give you interim relief: - Seggregate your I/O at filesystem level; the bug is at the filesystem level not ZFS pool level. By this I mean ensure the online redo logs are in a ZFS FS that nobody else uses, same for control files. As long as the writes to control and online redo logs are met your system will be happy. - Ensure that your clone and RMAN (if you're going to disk) write to a seperate ZFS FS that contains no production files. - If the above two items don't give you relieve then relocate the online redo log and control files to a UFS filesystem. No need to downgrade the entire ZFS to something else. - Consider Oracle ASM (DB version permitting,) works very well. Why deal with VxFS. Feel free to drop me a line, I've over 17 years of Oracle DB experience and love to troubleshoot problems like this. I've another vested interest; we're considering ZFS for widespread use in our environment and any experience is good for us. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] Re: Heavy writes freezing system
Hi Robert, I see. So it really doesn't get around the idea of putting DB files and logs on separate spindles? Best Regards, Jason On 1/17/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Jason, Wednesday, January 17, 2007, 11:24:50 PM, you wrote: JJWW Hi Anantha, JJWW I was curious why segregating at the FS level would provide adequate JJWW I/O isolation? Since all FS are on the same pool, I assumed flogging a JJWW FS would flog the pool and negatively affect all the other FS on that JJWW pool? because of the bug which forces all outstanding writes in a file system to commit to storage in case of one fsync to one file. Now when you separate data to different file systems the bug will affect only data in that file system which could greatly reduce imapct on performance if it's done right. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Eliminating double path with ZFS's volume manager
Hi Philip, I'm not an expert, so I'm afraid I don't know what to tell you. I'd call Apple Support and see what they say. As horrid as they are at Enterprise support they may be the best ones to clarify if multipathing is available without Xsan. Best Regards, Jason On 1/16/07, Philip Mötteli [EMAIL PROTECTED] wrote: Looks like its got a half-way decent multipath design: http://docs.info.apple.com/article.html?path=Xsan/1.1/ en/c3xs12.html Great, but that is with Xsan. If I don't exchange our Hitachi with an Xsan, I don't have this 'cvadmin'. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Eliminating double path with ZFS's volume manager
Hi Torrey, I think it does if you buy Xsan. Its still a separate product isn't it? Thought its more like QFS + MPXIO. Best Regards, Jason On 1/15/07, Torrey McMahon [EMAIL PROTECTED] wrote: Robert Milkowski wrote: 2. I belive it's definitely possible to just correct your config under Mac OS without any need to use other fs or volume manager, however going to zfs could be a good idea anyway That implies that MacOS has some sort of native SCSI multipathing like Solaris Mpxio. Does such a beast exist? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS direct IO
Hi Roch, You mentioned improved ZFS performance in the latest Nevada build (60 right now?)...I was curious if one would notice much of a performance improvement between 54 and 60? Also, does anyone think the zfs_arc_max tunable-support will be made available as a patch to S10U3, or would that wait until U4? Thank you in advance! Best Regards, Jason On 1/15/07, Roch - PAE [EMAIL PROTECTED] wrote: Jonathan Edwards writes: On Jan 5, 2007, at 11:10, Anton B. Rang wrote: DIRECT IO is a set of performance optimisations to circumvent shortcomings of a given filesystem. Direct I/O as generally understood (i.e. not UFS-specific) is an optimization which allows data to be transferred directly between user data buffers and disk, without a memory-to-memory copy. This isn't related to a particular file system. true .. directio(3) is generally used in the context of *any* given filesystem to advise it that an application buffer to system buffer copy may get in the way or add additional overhead (particularly if the filesystem buffer is doing additional copies.) You can also look at it as a way of reducing more layers of indirection particularly if I want the application overhead to be higher than the subsystem overhead. Programmatically .. less is more. Direct IO makes good sense when the target disk sectors are set a priori. But in the context of ZFS, would you rather have 10 direct disk I/Os or 10 bcopies and 2 I/O (say that was possible). As for read, I can see that when the load is cached in the disk array and we're running 100% CPU, the extra copy might be noticeable. Is this the situation that longs for DIO ? What % of a system is spent in the copy ? What is the added latency that comes from the copy ? Is DIO the best way to reduce the CPU cost of ZFS ? The current Nevada code base has quite nice performance characteristics (and certainly quirks); there are many further efficiency gains to be reaped from ZFS. I just don't see DIO on top of that list for now. Or at least someone needs to spell out what is ZFS/DIO and how much better it is expected to be (back of the envelope calculation accepted). Reading RAID-Z subblocks on filesystems that have checksum disabled might be interesting. That would avoid some disk seeks.To served the subblocks directly or not is a separate matter; it's a small deal compared to the feature itself. How about disabling the DB checksum (it can't fix the block anyway) and do mirroring ? -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Eliminating double path with ZFS's volume manager
Hi Torrey, Looks like its got a half-way decent multipath design: http://docs.info.apple.com/article.html?path=Xsan/1.1/en/c3xs12.html Whether or not it works is another story I suppose. ;-) Best Regards, Jason On 1/15/07, Torrey McMahon [EMAIL PROTECTED] wrote: Got me. However, transport multipathing - Like Mpxio, DLM, VxDMP, etc. - is usually separated from the filesystem layers. Jason J. W. Williams wrote: Hi Torrey, I think it does if you buy Xsan. Its still a separate product isn't it? Thought its more like QFS + MPXIO. Best Regards, Jason On 1/15/07, Torrey McMahon [EMAIL PROTECTED] wrote: Robert Milkowski wrote: 2. I belive it's definitely possible to just correct your config under Mac OS without any need to use other fs or volume manager, however going to zfs could be a good idea anyway That implies that MacOS has some sort of native SCSI multipathing like Solaris Mpxio. Does such a beast exist? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] Replacing a drive in a raidz2 group
Hi Robert, Will build 54 offline the drive? Best Regards, Jason On 1/13/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Jason, Saturday, January 13, 2007, 12:06:57 AM, you wrote: JJWW Hi Robert, JJWW We've experienced luck with flaky SATA drives in our STK array by JJWW unseating and reseating the drive to cause a reset of the firmware. It JJWW may be a bad drive, or the firmware may just have hit a bug. Hope its JJWW the latter! :-D JJWW I'd be interested why the hot-spare didn't kick in. I thought the FMA JJWW integration would detect read errors. FMA did but ZFS+FMA we're not there in U3. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacing a drive in a raidz2 group
Hi Robert, We've experienced luck with flaky SATA drives in our STK array by unseating and reseating the drive to cause a reset of the firmware. It may be a bad drive, or the firmware may just have hit a bug. Hope its the latter! :-D I'd be interested why the hot-spare didn't kick in. I thought the FMA integration would detect read errors. Best Regards, Jason On 1/12/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello zfs-discuss, One of our drives in x4500 is failing - it periodically disconnects/connects. ZFS only reports READ errors and no hot-spare automatically took in which was expected currently. So I issued zpool replace with a hot-spare drive. Now it takes forever and it seems like ZFS is rebuilding drive using checksums - wouldn't it be much faster if it just copied data from the drive being replaced (like attaching mirror)? -- Best regards, Robert mailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
Hi Mark, That does help tremendously. How does ZFS decide which zio cache to use? I apologize if this has already been addressed somewhere. Best Regards, Jason On 1/11/07, Mark Maybee [EMAIL PROTECTED] wrote: Al Hopper wrote: On Wed, 10 Jan 2007, Mark Maybee wrote: Jason J. W. Williams wrote: Hi Robert, Thank you! Holy mackerel! That's a lot of memory. With that type of a calculation my 4GB arc_max setting is still in the danger zone on a Thumper. I wonder if any of the ZFS developers could shed some light on the calculation? In a worst-case scenario, Robert's calculations are accurate to a certain degree: If you have 1GB of dnode_phys data in your arc cache (that would be about 1,200,000 files referenced), then this will result in another 3GB of related data held in memory: vnodes/znodes/ dnodes/etc. This related data is the in-core data associated with an accessed file. Its not quite true that this data is not evictable, it *is* evictable, but the space is returned from these kmem caches only after the arc has cleared its blocks and triggered the free of the related data structures (and even then, the kernel will need to to a kmem_reap to reclaim the memory from the caches). The fragmentation that Robert mentions is an issue because, if we don't free everything, the kmem_reap may not be able to reclaim all the memory from these caches, as they are allocated in slabs. We are in the process of trying to improve this situation. snip . Understood (and many Thanks). In the meantime, is there a rule-of-thumb that you could share that would allow mere humans (like me) to calculate the best values of zfs:zfs_arc_max and ncsize, given the that machine has nGb of RAM and is used in the following broad workload scenarios: a) a busy NFS server b) a general multiuser development server c) a database server d) an Apache/Tomcat/FTP server e) a single user Gnome desktop running U3 with home dirs on a ZFS filesystem It would seem, from reading between the lines of previous emails, particularly the ones you've (Mark M) written, that there is a rule of thumb that would apply given a standard or modified ncsize tunable?? I'm primarily interested in a calculation that would allow settings that would reduce the possibility of the machine descending into swap hell. Ideally, there would be no need for any tunables; ZFS would always do the right thing. This is our grail. In the meantime, I can give some recommendations, but there is no rule of thumb that is going to work in all circumstances. ncsize: As I have mentioned previously, there are overheads associated with caching vnode data in ZFS. While the physical on-disk data for a znode is only 512bytes, the related in-core cost is significantly higher. Roughly, you can expect that each ZFS vnode held in the DNLC will cost about 3K of kernel memory. So, you need to set ncsize appropriately for how much memory you are willing to devote to it. 500,000 entries is going to cost you 1.5GB of memory. zfs_arc_max: This is the maximum amount of memory you want the ARC to be able to use. Note that the ARC won't necessarily use this much memory: if other applications need memory, the ARC will shrink to accommodate. Although, also note that the ARC *can't* shrink if all of its memory is held. For example, data in the DNLC cannot be evicted from the ARC, so this data must first be evicted from the DNLC before the ARC can free up space (this is why it is dangerous to turn off the ARCs ability to evict vnodes from the DNLC). Also keep in mind that the ARC size does not account for many in-core data structures used by ZFS (znodes/dnodes/ dbufs/etc). Roughly, for every 1MB of cached file pointers, you can expect another 3MB of memory used outside of the ARC. So, in the example above, where ncsize is 500,000, the ARC is only seeing about 400MB of the 1.5GB consumed. As I have stated previously, we consider this a bug in the current ARC accounting that we will soon fix. This is only an issue in environments where many files are being accessed. If the number of files accessed is relatively low, then the ARC size will be much closer to the actual memory consumed by ZFS. So, in general, you should not really need to tune zfs_arc_max. However, in environments where you have specific applications that consume known quantities of memory (e.g. database), it will likely
Re: [zfs-discuss] Solid State Drives?
Hello all, Just my two cents on the issue. The Thumper is proving to be a terrific database server in all aspects except latency. While the latency is acceptable, being able to add some degree of battery-backed write cache that ZFS could use would be phenomenal. Best Regards, Jason On 1/11/07, Jonathan Edwards [EMAIL PROTECTED] wrote: On Jan 11, 2007, at 15:42, Erik Trimble wrote: On Thu, 2007-01-11 at 10:35 -0800, Richard Elling wrote: The product was called Sun PrestoServ. It was successful for benchmarking and such, but unsuccessful in the market because: + when there is a failure, your data is spread across multiple fault domains + it is not clusterable, which is often a requirement for data centers + it used a battery, so you had to deal with physical battery replacement and all of the associated battery problems + it had yet another device driver, so integration was a pain Google for it and you'll see all sorts of historical perspective. -- richard Yes, I remember (and used) PrestoServ. Back in the SPARCcenter 1000 days. :-) as do i .. (keep your batteries charged!! and don't panic!) And yes, local caching makes the system non-clusterable. not necessarily .. i like the javaspaces approach to coherency, and companies like gigaspaces have done some pretty impressive things with in memory SBA databases and distributed grid architectures .. intelligent coherency design with a good distribution balance for local, remote, and redundant can go a long way in improving your cache numbers. However, all the other issues are common to a typical HW raid controller, and many people use host-based HW controllers just fine and don't find their problems to be excessive. True given most workloads, but in general it's the coherency issues that drastically affect throughput on shared controllers particularly as you add and distribute the same luns or data across different control processors. Add too many and your cache hit rates might fall in the toilet. .je ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
Hi Guys, After reading through the discussion on this regarding ZFS memory fragmentation on snv_53 (and forward) and going through our ::kmastat...looks like ZFS is sucking down about 544 MB of RAM in the various caches. About 360MB of that is in the zio_buf_65536 cache. Next most notable is 55MB in zio_buf_32768, and 36MB in zio_buf_16384. I don't think that's too bad but worth keeping track of. At this point our kernel memory growth seems to have slowed, with it hovering around 5GB, and the anon column is mostly what's growing now (as expected...MySQL). Most of the problem in the discussion thread on this seemed to be related to a lot of DLNC entries due to the workload of a file server. How would this affect a database server with operations in only a couple very large files? Thank you in advance. Best Regards, Jason On 1/10/07, Jason J. W. Williams [EMAIL PROTECTED] wrote: Sanjeev Robert, Thanks guys. We put that in place last night and it seems to be doing a lot better job of consuming less RAM. We set it to 4GB and each of our 2 MySQL instances on the box to a max of 4GB. So hopefully slush of 4GB on the Thumper is enough. I would be interested in what the other ZFS modules memory behaviors are. I'll take a perusal through the archives. In general it seems to me that a max cap for ZFS whether set through a series of individual tunables or a single root tunable would be very helpful. Best Regards, Jason On 1/10/07, Sanjeev Bagewadi [EMAIL PROTECTED] wrote: Jason, Robert is right... The point is ARC is the caching module of ZFS and majority of the memory is consumed through ARC. Hence by limiting the c_max of ARC we are limiting the amount ARC consumes. However, other modules of ZFS would consume more but that may not be as significant as ARC. Expert, please correct me if I am wrong here. Thanks and regards, Sanjeev. Robert Milkowski wrote: Hello Jason, Tuesday, January 9, 2007, 10:28:12 PM, you wrote: JJWW Hi Sanjeev, JJWW Thank you! I was not able to find anything as useful on the subject as JJWW that! We are running build 54 on an X4500, would I be correct in my JJWW reading of that article that if I put set zfs:zfs_arc_max = JJWW 0x1 #4GB in my /etc/system, ZFS will consume no more than JJWW 4GB? Thank you in advance. That's the idea however it's not working that way now - under some circumstances ZFS could still consume much more memory - see other posts lately here. -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Adding disk to a RAID-Z?
Hi Kyle, I think there was a lot of talk about this behavior on the RAIDZ2 vs. RAID-10 thread. My understanding from that discussion was that every write stripes the block across all disks on a RAIDZ/Z2 group, thereby making writing the group no faster than writing to a single disk. However reads are much faster, as all the disk are activated in the read process. The default config on the X4500 we received recently was RAIDZ-groups of 6 disks (across the 6 controllers) striped together into one large zpool. Best Regards, Jason On 1/10/07, Kyle McDonald [EMAIL PROTECTED] wrote: Robert Milkowski wrote: Hello Kyle, Wednesday, January 10, 2007, 5:33:12 PM, you wrote: KM Remember though that it's been mathematically figured that the KM disadvantages to RaidZ start to show up after 9 or 10 drives. (That's Well, nothing like this was proved and definitely not mathematically. It's just a common sense advise - for many users keeping raidz groups below 9 disks should give good enough performance. However if someone creates raidz group of 48 disks he/she probable expects also performance and in general raid-z wouldn't offer one. It's very possible I misstated something. :) I thought I had read though, something like over 9 or so disks would put mean that each FS block would be written to less than a single disk block on each disk? Or maybe it was that waiting to read from all drives for files less than a FS block would suffer? Ahhh... I can't remember what the effect were thought to be. I thought there was some theoretical math involved though. I do remember people advising against it though. Not just on a performance basis, but also on a increased risk of failure basis. I think it was just seen as a good balancing point. -Kyle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] Re: Adding disk to a RAID-Z?
Hi Robert, I read the following section from http://blogs.sun.com/roch/entry/when_to_and_not_to as indicating random writes to a RAID-Z had the performance of a single disk regardless of the group size: Effectively, as a first approximation, an N-disk RAID-Z group will behave as a single device in terms of deliveredrandom input IOPS. Thus a 10-disk group of devices each capable of 200-IOPS, will globally act as a 200-IOPS capable RAID-Z group. Best Regards, Jason On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Jason, Wednesday, January 10, 2007, 10:54:29 PM, you wrote: JJWW Hi Kyle, JJWW I think there was a lot of talk about this behavior on the RAIDZ2 vs. JJWW RAID-10 thread. My understanding from that discussion was that every JJWW write stripes the block across all disks on a RAIDZ/Z2 group, thereby JJWW making writing the group no faster than writing to a single disk. JJWW However reads are much faster, as all the disk are activated in the JJWW read process. The opposite actually. Because of COW, writing (modifying as well) will give you up-to N-1 disks performance for raid-z1 and N-2 disks performance for raid-z2. Howeer reading can be slow in case of many small random reads as to read each fs block you've got to wait for all data disks in a group. JJWW The default config on the X4500 we received recently was RAIDZ-groups JJWW of 6 disks (across the 6 controllers) striped together into one large JJWW zpool. However the problem with that config is lack of hot-spare. Of course it depends what you want (and there was no hot spare support in U2 which is os installed in factory so far). -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[4]: [zfs-discuss] Limit ZFS Memory Utilization
Hi Robert, We've got the default ncsize. I didn't see any advantage to increasing it outside of NFS serving...which this server is not. For speed the X4500 is showing to be a killer MySQL platform. Between the blazing fast procs and the sheer number of spindles, its perfromance is tremendous. If MySQL cluster had full disk-based support, scale-out with X4500s a-la Greenplum would be terrific solution. At this point, the ZFS memory gobbling is the main roadblock to being a good database platform. Regarding the paging activity, we too saw tremendous paging of up to 24% of the X4500s CPU being used for that with the default arc_max. After changing it to 4GB, we haven't seen anything much over 5-10%. Best Regards, Jason On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Jason, Thursday, January 11, 2007, 12:36:46 AM, you wrote: JJWW Hi Robert, JJWW Thank you! Holy mackerel! That's a lot of memory. With that type of a JJWW calculation my 4GB arc_max setting is still in the danger zone on a JJWW Thumper. I wonder if any of the ZFS developers could shed some light JJWW on the calculation? JJWW That kind of memory loss makes ZFS almost unusable for a database system. If you leave ncsize with default value then I belive it won't consume that much memory. JJWW I agree that a page cache similar to UFS would be much better. Linux JJWW works similarly to free pages, and it has been effective enough in the JJWW past. Though I'm equally unhappy about Linux's tendency to grab every JJWW bit of free RAM available for filesystem caching, and then cause JJWW massive memory thrashing as it frees it for applications. Page cache won't be better - just better memory control for ZFS caches is strongly desired. Unfortunately from time to time ZFS makes servers to page enormously :( -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
Sanjeev, Could you point me in the right direction as to how to convert the following GCC compile flags to Studio 11 compile flags? Any help is greatly appreciated. We're trying to recompile MySQL to give a stacktrace and core file to track down exactly why its crashing...hopefully it will illuminate if memory truly is the issue. Thank you very much in advance! -felide-constructors -fno-exceptions -fno-rtti Best Regards, Jason On 1/7/07, Sanjeev Bagewadi [EMAIL PROTECTED] wrote: Jason, There is no documented way of limiting the memory consumption. The ARC section of ZFS tries to adapt to the memory pressure of the system. However, in your case probably it is not quick enough I guess. One way of limiting the memory consumption would be limit the arc.c_max This (arc.c_max) is set to 3/4 of the memory available (or 1GB less than memory available). This is done when the ZFS is loaded (arc_init()). You should be able to change the value of arc.c_max through mdb and set it to the value you want. Exercise caution while setting it. Make sure you don't have active zpools during this operation. Thanks and regards, Sanjeev. Jason J. W. Williams wrote: Hello, Is there a way to set a max memory utilization for ZFS? We're trying to debug an issue where the ZFS is sucking all the RAM out of the box, and its crashing MySQL as a result we think. Will ZFS reduce its cache size if it feels memory pressure? Any help is greatly appreciated. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
We're not using the Enterprise release, but we are working with them. It looks like MySQL is crashing due to lack of memory. -J On 1/8/07, Toby Thain [EMAIL PROTECTED] wrote: On 8-Jan-07, at 11:54 AM, Jason J. W. Williams wrote: ...We're trying to recompile MySQL to give a stacktrace and core file to track down exactly why its crashing...hopefully it will illuminate if memory truly is the issue. If you're using the Enterprise release, can't you get MySQL's assistance with this? --Toby ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Limit ZFS Memory Utilization
Hello, Is there a way to set a max memory utilization for ZFS? We're trying to debug an issue where the ZFS is sucking all the RAM out of the box, and its crashing MySQL as a result we think. Will ZFS reduce its cache size if it feels memory pressure? Any help is greatly appreciated. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solid State Drives?
Could this ability (separate ZIL device) coupled with an SSD give something like a Thumper the write latency benefit of battery-backed write cache? Best Regards, Jason On 1/5/07, Neil Perrin [EMAIL PROTECTED] wrote: Robert Milkowski wrote On 01/05/07 11:45,: Hello Neil, Friday, January 5, 2007, 4:36:05 PM, you wrote: NP I'm currently working on putting the ZFS intent log on separate devices NP which could include seperate disks and nvram/solid state devices. NP This would help any application using fsync/O_DSYNC - in particular NP DB and NFS. From protoyping considerable peformanace improvements have NP been seen. Can you share any results from prototype testing? I'd prefer not to just yet as I don't want to raise expectations unduly. When testing I was using a simple local benchmark, whereas I'd prefer to run something more official such as TPC. I'm also missing a few required features in the protoype which may affect performance. Hopefully I can can provide some results soon, but even those will be unoffical. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] RAIDZ2 vs. ZFS RAID-10
Hello All, I was curious if anyone had run a benchmark on the IOPS performance of RAIDZ2 vs RAID-10? I'm getting ready to run one on a Thumper and was curious what others had seen. Thank you in advance. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ2 vs. ZFS RAID-10
Hi Richard, Hmmthat's interesting. I wonder if its worth benchmarking RAIDZ2 if those are the results you're getting. The testing is to see the performance gain we might get for MySQL moving off the FLX210 to an active/passive pair of X4500s. Was hoping with that many SATA disks RAIDZ2 would provide a nice safety net. Best Regards, Jason On 1/3/07, Richard Elling [EMAIL PROTECTED] wrote: Jason J. W. Williams wrote: Hello All, I was curious if anyone had run a benchmark on the IOPS performance of RAIDZ2 vs RAID-10? I'm getting ready to run one on a Thumper and was curious what others had seen. Thank you in advance. I've been using a simple model for small, random reads. In that model, the performance of a raidz[12] set will be approximately equal to a single disk. For example, if you have 6 disks, then the performance for the 6-disk raidz2 set will be normalized to 1, and the performance of a 3-way dynamic stripe of 2-way mirrors will have a normalized performance of 6. I'd be very interested to see if your results concur. The models for writes or large reads are much more complicated because of the numerous caches of varying size and policy throughout the system. The small, random read workload will be largely unaffected by caches and you should see the performance as predicted by the disk rpm and seek time. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ2 vs. ZFS RAID-10
Just got an interesting benchmark. I made two zpools: RAID-10 (9x 2-way RAID-1 mirrors: 18 disks total) RAID-Z2 (3x 6-way RAIDZ2 group: 18 disks total) Copying 38.4GB of data from the RAID-Z2 to the RAID-10 took 307 seconds. Deleted the data from the RAID-Z2. Then copying the 38.4GB of data from the RAID-10 to the RAID-Z2 took 258 seconds. Would have expected the RAID-10 to write data more quickly. Its interesting to me that the RAID-10 pool registered the 38.4GB of data as 38.4GB, whereas the RAID-Z2 registered it as 56.4. Best Regards, Jason On 1/3/07, Jason J. W. Williams [EMAIL PROTECTED] wrote: Hi Richard, Hmmthat's interesting. I wonder if its worth benchmarking RAIDZ2 if those are the results you're getting. The testing is to see the performance gain we might get for MySQL moving off the FLX210 to an active/passive pair of X4500s. Was hoping with that many SATA disks RAIDZ2 would provide a nice safety net. Best Regards, Jason On 1/3/07, Richard Elling [EMAIL PROTECTED] wrote: Jason J. W. Williams wrote: Hello All, I was curious if anyone had run a benchmark on the IOPS performance of RAIDZ2 vs RAID-10? I'm getting ready to run one on a Thumper and was curious what others had seen. Thank you in advance. I've been using a simple model for small, random reads. In that model, the performance of a raidz[12] set will be approximately equal to a single disk. For example, if you have 6 disks, then the performance for the 6-disk raidz2 set will be normalized to 1, and the performance of a 3-way dynamic stripe of 2-way mirrors will have a normalized performance of 6. I'd be very interested to see if your results concur. The models for writes or large reads are much more complicated because of the numerous caches of varying size and policy throughout the system. The small, random read workload will be largely unaffected by caches and you should see the performance as predicted by the disk rpm and seek time. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] RAIDZ2 vs. ZFS RAID-10
Hi Robert, Our X4500 configuration is multiple 6-way (across controllers) RAID-Z2 groups striped together. Currently, 3 RZ2 groups. I'm about to test write performance against ZFS RAID-10. I'm curious why RAID-Z2 performance should be good? I assumed it was an analog to RAID-6. In our recent experience RAID-5 due to the 2 reads, a XOR calc and a write op per write instruction is usually much slower than RAID-10 (two write ops). Any advice is greatly appreciated. Best Regards, Jason On 1/3/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Jason, Wednesday, January 3, 2007, 11:11:31 PM, you wrote: JJWW Hi Richard, JJWW Hmmthat's interesting. I wonder if its worth benchmarking RAIDZ2 JJWW if those are the results you're getting. The testing is to see the JJWW performance gain we might get for MySQL moving off the FLX210 to an JJWW active/passive pair of X4500s. Was hoping with that many SATA disks JJWW RAIDZ2 would provide a nice safety net. Well, you weren't thinking about one big raidz2 group? To get more performance you can create one pool with many smaller raidz2 groups - that way your worst case read performance should increase approximately N times where N is number of raidz-2 groups. However keep in mind that write performance should be really good with raidz2. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] RAIDZ2 vs. ZFS RAID-10
Hi Robert, That makes sense. Thank you. :-) Also, it was zpool I was looking at. zfs always showed the correct size. -J On 1/3/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Jason, Wednesday, January 3, 2007, 11:40:38 PM, you wrote: JJWW Just got an interesting benchmark. I made two zpools: JJWW RAID-10 (9x 2-way RAID-1 mirrors: 18 disks total) JJWW RAID-Z2 (3x 6-way RAIDZ2 group: 18 disks total) JJWW Copying 38.4GB of data from the RAID-Z2 to the RAID-10 took 307 JJWW seconds. Deleted the data from the RAID-Z2. Then copying the 38.4GB of JJWW data from the RAID-10 to the RAID-Z2 took 258 seconds. Would have JJWW expected the RAID-10 to write data more quickly. Actually with 18 disks in raid-10 in theory you get write performance equal to stripe of 9 disks. With 18 disks in 3 raidz2 groups of 6 disks each you should expect something like (6-2)*3 = 12 disk, so equal to 12 disks in stripe. JJWW Its interesting to me that the RAID-10 pool registered the 38.4GB of JJWW data as 38.4GB, whereas the RAID-Z2 registered it as 56.4. If you checked with zpool - then it's ok - it reports disk usage also wit parity overhead. If zfs list showed you that numbers then either you're using old snv bits or s10U2 as it was corrected some time ago (in U3). -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re[2]: RAIDZ2 vs. ZFS RAID-10
Hi Anton, Thank you for the information. That is exactly our scenario. We're 70% write heavy, and given the nature of the workload, our typical writes are 10-20K. Again the information is much appreciated. Best Regards, Jason On 1/3/07, Anton B. Rang [EMAIL PROTECTED] wrote: In our recent experience RAID-5 due to the 2 reads, a XOR calc and a write op per write instruction is usually much slower than RAID-10 (two write ops). Any advice is greatly appreciated. RAIDZ and RAIDZ2 does not suffer from this malady (the RAID5 write hole). 1. This isn't the write hole. 2. RAIDZ and RAIDZ2 suffer from read-modify-write overhead when updating a file in writes of less than 128K, but not when writing a new file or issuing large writes. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over NFS extra slow?
Hi Brad, I believe benr experienced the same/similar issue here: http://www.opensolaris.org/jive/message.jspa?messageID=77347 If it is the same, I believe its a known ZFS/NFS interaction bug, and has to do with small file creation. Best Regards, Jason On 1/2/07, Brad Plecs [EMAIL PROTECTED] wrote: I had a user report extreme slowness on a ZFS filesystem mounted over NFS over the weekend. After some extensive testing, the extreme slowness appears to only occur when a ZFS filesystem is mounted over NFS. One example is doing a 'gtar xzvf php-5.2.0.tar.gz'... over NFS onto a ZFS filesystem. this takes: real5m12.423s user0m0.936s sys 0m4.760s Locally on the server (to the same ZFS filesystem) takes: real0m4.415s user0m1.884s sys 0m3.395s The same job over NFS to a UFS filesystem takes real1m22.725s user0m0.901s sys 0m4.479s Same job locally on server to same UFS filesystem: real0m10.150s user0m2.121s sys 0m4.953s This is easily reproducible even with single large files, but the multiple small files seems to illustrate some awful sync latency between each file. Any idea why ZFS over NFS is so bad? I saw the threads that talk about an fsync penalty, but they don't seem relevant since the local ZFS performance is quite good. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] Re: Difference between ZFS and UFS with one LUN froma SAN
Hi Robert, MPxIO had correctly moved the paths. More than one path to controller A was OK, and one patch to controller A for each LUN was active when controller B was rebooted. I have a hunch that the array was at fault, because it also rebooted a Windows server with LUNs only on Controller A. In the case of the Windows server Engenios RDAC was handling multipathing. Overall, not a big deal, I just wouldn't trust the array to do a hitless commanded controller failover or firmware upgrade. -J On 12/22/06, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Jason, Friday, December 22, 2006, 5:55:38 PM, you wrote: JJWW Just for what its worth, when we rebooted a controller in our array JJWW (we pre-moved all the LUNs to the other controller), despite using JJWW MPXIO ZFS kernel panicked. Verified that all the LUNs were on the JJWW correct controller when this occurred. Its not clear why ZFS thought JJWW it lost a LUN but it did. We have done cable pulling using ZFS/MPXIO JJWW before and that works very well. It may well be array-related in our JJWW case, but I hate anyone to have a false sense of security. Did you first check (with format for example) if LUNs were really accessible? If MPxIO worked ok and at least one path is ok then ZFS won't panic. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Difference between ZFS and UFS with one LUN froma SAN
Just for what its worth, when we rebooted a controller in our array (we pre-moved all the LUNs to the other controller), despite using MPXIO ZFS kernel panicked. Verified that all the LUNs were on the correct controller when this occurred. Its not clear why ZFS thought it lost a LUN but it did. We have done cable pulling using ZFS/MPXIO before and that works very well. It may well be array-related in our case, but I hate anyone to have a false sense of security. -J On 12/22/06, Tim Cook [EMAIL PROTECTED] wrote: This may not be the answer you're looking for, but I don't know if it's something you've thought of. If you're pulling a LUN from an expensive array, with multiple HBA's in the system, why not run mpxio? If you ARE running mpxio, there shouldn't be an issue with a path dropping. I have the setup above in my test lab and pull cables all the time and have yet to see a zfs kernel panic. Is this something you've considered? I haven't seen the bug in question, but I definitely have not run into it when running mpxio. --Tim -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Shawn Joy Sent: Friday, December 22, 2006 7:35 AM To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] Re: Difference between ZFS and UFS with one LUN froma SAN OK, But lets get back to the original question. Does ZFS provide you with less features than UFS does on one LUN from a SAN (i.e is it less stable). ZFS on the contrary checks every block it reads and is able to find the mirror or reconstruct the data in a raidz config. Therefore ZFS uses only valid data and is able to repair the data blocks automatically. This is not possible in a traditional filesystem/volume manager configuration. The above is fine. If I have two LUNs. But my original question was if I only have one LUN. What about kernel panics from ZFS if for instance access to one controller goes away for a few seconds or minutes. Normally UFS would just sit there and warn I have lost access to the controller. Then when the controller returns, after a short period, the warnings go away and the LUN continues to operate. The admin can then research further into why the controller went away. With ZFS, the above will panic the system and possibly cause other coruption on other LUNs due to this panic? I believe this was discussed in other threads? I also believe there is a bug filed against this? If so when should we expect this bug to be fixed? My understanding of ZFS is that it functions better in an environment where we have JBODs attached to the hosts. This way ZFS takes care of all of the redundancy? But what about SAN enviroments where customers have spend big money to invest in storage. I know of one instance where a customer has a growing need for more storage space. There environemt uses many inodes. Due to the UFS inode limitation, when creating LUNs over one TB, they would have to quadrulpe the about of storage usesd in there SAN in order to hold all of the files. A possible solution to this inode issue would be ZFS. However they have experienced kernel panics in there environment when a controller dropped of line. Any body have a solution to this? Shawn This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Difference between ZFS and UFS with one LUN froma SAN
Hi Tim, One switch environment, two ports going to the host, 4 ports going to the storage. Switch is a Brocade SilkWorm 3850 and the HBA is a dual-port QLA2342. Solaris rev is S10 update 3. Array is a StorageTek FLX210 (Engenio 2884) The LUNs had moved to the other controller and MPXIO had shown the paths change as a result, so it was a bit bizarre. Rebooting the other controller shouldn't have done anything, but it did. Could have been the array. -J On 12/22/06, Tim Cook [EMAIL PROTECTED] wrote: Always good to hear others experiences J. Maybe I'll try firing up the Nexan today and downing a controller to see how that affects it vs. downing a switch port/pulling cable. My first intuition is time-out values. A cable pull will register differently than a blatant time-out depending on where it occurs. IE: Pulling the cable from the back of the server will register instantly, vs. the storage timing out 3 switches away. I'm sure you're aware of that, but just an FYI for others following the thread less familiar with SAN technology. To get a little more background: What kind of an array is it? How do you have the controllers setup? Active/active? Active/passive? In other words do you have array side failover occurring as well or is it in *dummy mode*? Do you have multiple physical paths? IE: each controller port and each server port hitting different switches? What HBA's are you using? What switches? What version of snv are you running, and which driver? Yey for slow Friday's before x-mas, I have a bit of time to play in the lab today. --Tim -Original Message- From: Jason J. W. Williams [mailto:[EMAIL PROTECTED] Sent: Friday, December 22, 2006 10:56 AM To: Tim Cook Cc: Shawn Joy; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Re: Difference between ZFS and UFS with one LUN froma SAN Just for what its worth, when we rebooted a controller in our array (we pre-moved all the LUNs to the other controller), despite using MPXIO ZFS kernel panicked. Verified that all the LUNs were on the correct controller when this occurred. Its not clear why ZFS thought it lost a LUN but it did. We have done cable pulling using ZFS/MPXIO before and that works very well. It may well be array-related in our case, but I hate anyone to have a false sense of security. -J On 12/22/06, Tim Cook [EMAIL PROTECTED] wrote: This may not be the answer you're looking for, but I don't know if it's something you've thought of. If you're pulling a LUN from an expensive array, with multiple HBA's in the system, why not run mpxio? If you ARE running mpxio, there shouldn't be an issue with a path dropping. I have the setup above in my test lab and pull cables all the time and have yet to see a zfs kernel panic. Is this something you've considered? I haven't seen the bug in question, but I definitely have not run into it when running mpxio. --Tim -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Shawn Joy Sent: Friday, December 22, 2006 7:35 AM To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] Re: Difference between ZFS and UFS with one LUN froma SAN OK, But lets get back to the original question. Does ZFS provide you with less features than UFS does on one LUN from a SAN (i.e is it less stable). ZFS on the contrary checks every block it reads and is able to find the mirror or reconstruct the data in a raidz config. Therefore ZFS uses only valid data and is able to repair the data blocks automatically. This is not possible in a traditional filesystem/volume manager configuration. The above is fine. If I have two LUNs. But my original question was if I only have one LUN. What about kernel panics from ZFS if for instance access to one controller goes away for a few seconds or minutes. Normally UFS would just sit there and warn I have lost access to the controller. Then when the controller returns, after a short period, the warnings go away and the LUN continues to operate. The admin can then research further into why the controller went away. With ZFS, the above will panic the system and possibly cause other coruption on other LUNs due to this panic? I believe this was discussed in other threads? I also believe there is a bug filed against this? If so when should we expect this bug to be fixed? My understanding of ZFS is that it functions better in an environment where we have JBODs attached to the hosts. This way ZFS takes care of all of the redundancy? But what about SAN enviroments where customers have spend big money to invest in storage. I know of one instance where a customer has a growing need for more storage space. There environemt uses many inodes. Due to the UFS inode limitation, when creating LUNs over one TB, they would have to quadrulpe the about of storage usesd in there SAN in order to hold all of the files. A possible solution to this inode issue would be ZFS. However they have experienced
Re: [zfs-discuss] What SATA controllers are people using for ZFS?
Hi Naveen, I believe the newer LSI cards work pretty well with Solaris. Best Regards, Jason On 12/20/06, Naveen Nalam [EMAIL PROTECTED] wrote: Hi, This may not be the right place to post, but hoping someone here is running a reliably working system with 12 drives using ZFS that can tell me what hardware they are using. I have on order with my server vendor a pair of 12-drive servers that I want to use with ZFS for our company file stores. We're trying to use Supermicro PDSME motherboards, and each has two Supermicro MV8 sata cards. Solaris 10U3 he's found doesn't work on these systems. And I just read a post today (and an older post) on this group about how the Marvell based cards lock up. I can't afford lockups since this is very critical and expensive data that is being stored. My goal is a single cpu board that works with Solaris, and somehow get 12-drives plus 2 system boot drives plugged into it. I don't see any suitable sata cards on the Sun HCL. Are there any 4-port PCIe cards that people know reliably work? The Adaptec 1430SA looks nice, but no idea if it works. I could potentially get two 4-port PCIe cards, a 2 port PCI sata card (for boot), and 4-port motherboard - for 14 drives total. And cough up the extra cash for a supported dual-cpu motherboard (though i'm only using one cpu). any advice greatly appreciated.. Thanks! Naveen This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and SE 3511
Hi Toby, My understanding on the subject of SATA firmware reliability vs. FC/SCSI is that its mostly related to SATA firmware being a lot younger. The FC/SCSI firmware that's out there has been debugged for 10 years or so, so it has a lot fewer hiccoughs. Pillar Data Systems told us once that they found most of their SATA failed disks were just fine when examined, so their policy is to issue a RESET to the drive when a SATA error is detected, then retry the write/read and keep trucking. If they continue to get SATA errors, then they'll fail the drive. Looking at the latest Engenio SATA products, I believe they do the same thing. Its probably unfair to expect defect rates out of SATA firmware equivalent to firmware that's been around for a long time...particularly with the price pressures on SATA. SAS may suffer the same issue, though they seem to have 1,000,000 MTBF ratings like their traditional FC/SCSI counterparts. On a side-note, we experienced a path failure to a drive in our SATA Engenio array (older model), simply popping the drive out and back in fixed the issue...haven't had any notifications since. A RESET and RETRY would have been nice behavior to have, since popping and reinserting triggered a rebuild of the drive. Best Regards, Jason On 12/19/06, Toby Thain [EMAIL PROTECTED] wrote: On 19-Dec-06, at 2:42 PM, Jason J. W. Williams wrote: I do see this note in the 3511 documentation: Note - Do not use a Sun StorEdge 3511 SATA array to store single instances of data. It is more suitable for use in configurations where the array has a backup or archival role. My understanding of this particular scare-tactic wording (its also in the SANnet II OEM version manual almost verbatim) is that it has mostly to do with the relative unreliability of SATA firmware versus SCSI/FC firmware. That's such a sad sentence to have to read. Either prices are unrealistically low, or the revenues aren't being invested properly? --Toby Its possible that the disks are lower quality SATA disks too, but that was not what was relayed to us when we looked at buying the 3511 from Sun or the DotHill version (SANnet II). Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in a SAN environment
Not sure. I don't see an advantage to moving off UFS for boot pools. :-) -J On 12/20/06, James C. McPherson [EMAIL PROTECTED] wrote: Jason J. W. Williams wrote: I agree with others here that the kernel panic is undesired behavior. If ZFS would simply offline the zpool and not kernel panic, that would obviate my request for an informational message. It'd be pretty darn obvious what was going on. What about the root/boot pool? James C. McPherson -- Solaris kernel software engineer, system admin and troubleshooter http://www.jmcp.homeunix.com/blog Find me on LinkedIn @ http://www.linkedin.com/in/jamescmcpherson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and SE 3511
I do see this note in the 3511 documentation: Note - Do not use a Sun StorEdge 3511 SATA array to store single instances of data. It is more suitable for use in configurations where the array has a backup or archival role. My understanding of this particular scare-tactic wording (its also in the SANnet II OEM version manual almost verbatim) is that it has mostly to do with the relative unreliability of SATA firmware versus SCSI/FC firmware. Its possible that the disks are lower quality SATA disks too, but that was not what was relayed to us when we looked at buying the 3511 from Sun or the DotHill version (SANnet II). Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in a SAN environment
Shouldn't there be a big warning when configuring a pool with no redundancy and/or should that not require a -f flag ? why? what if the redundancy is below the pool .. should we warn that ZFS isn't directly involved in redundancy decisions? Because if the host controller port goes flaky and starts introducing checksum errors at the block level (a lady a few weeks ago reported this) ZFS will kernel panic, and most users won't expect it. Users should be warned it seems to me to the real possibility of a kernel panic if they don't implement redundancy at the zpool level. Just my 2 cents. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss