Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!
Thanks for the detailed response - further questions inline... Christopher George wrote: Excellent questions! I see the PCI card has an external power connector - can you explain how/why that's required, as opposed to using an on card battery or similar. DDRdrive X1 ZIL functionality is best served with an external attached UPS, this allows the X1 to perform as a non-volatile storage device without specific user configuration or unique operation. An often overlooked aspect of batteries (irrespective of technology or internal/external) is their limited lifetime and varying degrees of maintenance and oversight required. For example, a lithium (Li-Ion) battery supply, as used by older NVRAM products and not the X1, does have the minimum required energy density for an internal solution. But has a fatal flaw for enterprise applications - an ignition mode failure possibility. Google "lithium battery fire". Such an instance, even if rare, would be catastrophic not only to the on-card data but the host server and so on... Supercapacitors are another alternative which thankfully do not share the ignition mode failure mechanism of Li-Ion, but are hampered mainly by cost with some longevity concerns which can be addressed. In the end, we selected data integrity, cost, and serviceability as our top three priorities. This led us to the industry standard external lead-acid battery as sold by APC. Key benefits of the DDRdrive X1 power solution: 1) Data Integrity - Supports multiple back-to-back power failures, a single DDRdrive X1 uses less than 5W when the host is powered down, even a small UPS is over-provisioned and unlike an internal solution will not normally require a lengthy recharge time prior to the next power incident. Optionally a backup to NAND can be performed to remove the UPS duration as a factor. 2) Cost Effective / Flexible - The Smart-UPS SC 450VA (280 Watts) is an excellent choice for most installations and retails for approximately $150.00. Flexibility is in regard to UPS selection, as it can be right-sized (duration) for each individual application if needed. 3) Reliability / Maintenance - UPS front panel LED status for battery replacement and audible alarms when battery is low or non-operational. Industry standard battery form factor backed by APC the industry leading manufacture of enterprise-class backup solutions. OK, I take your point about battery fires, however we've been using battery backed cards (of various types) in servers for a while now, and I think you might have over over-emphasized those risks, when compared to the operational complexity of maintaining a separate power circuit for my PCI cards! But then, I haven't actually done the research on battery reliability either. :-) I'm not sure about others on the list, but I have a dislike of AC power bricks in my racks. Sometimes they're unavoidable, but they're also physically awkward - where do we put them? Using up space on a dedicated shelf? Cable tied to the rack itself? Hidden under the floor? Is the state of the power input exposed to software in some way? In other terms, can I have a nagios check running on my server that triggers an alert if the power cable accidentally gets pulled out? What happens if the *host* power to the card fails? Nothing, the DDRdrive X1's data integrity is guaranteed by the attached UPS. OK, which means that the UPS must be separate to the UPS powering the server then. The 155mb rate for sustained writes is low for DDR ram? The DRAM's value add is it's extremely low latency (even compared to NAND) and other intrinsic properties such as longevity and reliability. The read/write sequential bandwidth is completely bound by the PCI Express interface. Any plans on a pci-e multi-lane version then? All my servers are still Gig-E, and I'm not likely to see more than 100MB/sec of NFS traffic, however I'm sure there are plenty of NFS servers on 10G out there that will see quite a bit more than 155MB/sec for moderate amounts of time. I know we can put more than one of these cards in a server, but those slots are often taken up with other things! I look forward to these being available in Australia :-) Thanks, Tristan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!
> "cg" == Christopher George writes: cg> Nothing, the DDRdrive X1's data integrity is guaranteed by the cg> attached UPS. I've found UPS power is less reliable than unprotected line power where I live, especially when using bargain UPS's like the ones you suggest. I've tracked it for five years, and that's simply the case. When devices have dual power inputs I do plug one into the UPS though. I've also found unplanned powerdowns usually occur during maintenance because of people tripping over cords (networking equipment likes to put A/B power on opposite sides of the chassis. Thanks for that, to those who do it.), dropping things, bumping power strip switches (which should not exist in the first place), provoking crappy devices (ex poweron surges causing overcurrent), mucking around with the batteries, or confusing highly-stupid UPS microcontrollers over their buggy web interfaces (``reset controller''), clumsy buttonpads (a single on/off/test button? are you *CRAZY*? and sometimes I have to _hold the button down_? What next, double-pressing? there's on, there's off, but what about the ``off-but-charging'' state: how's it requested and how's it confirmed? hazily? thanks, assholes.). Your decision to use UPS power is based on the imaginary scenario you walk us through: building loses line power for X minutes, UPS runs out. Obviously I'm familiar with the scenario but honestly I've not run into that one in practice as often as other ones, which is why I call it fantasy. cg> NAND only provides an optional (user configured) cg> backup/restore feature. so, it does not even attempt to query the UPS? How can it live up to the ideally-functioning-UPS protection scheme you describe, then? To do so it needs UPS communication: it'd need to NAND-backup before the battery ran out, so it needs to get advance warning of a low battery from the UPS. It'd also need a way to halt the computer, or at least to take itself offline and propogate the error up the driver stack, if the UPS has not enough charge to complete a NAND backup or of the UPS considers its batteries defective. Personally, I don't care if the card talks to the UPS, because I think realistically if you take the cases when power stops coming out of a UPS and overlap them with the cases when the UPS provided warning before the power stopped coming out, there's not much overlap. Spurious warnings and sudden shutdowns are *more* common over the life of the units I've had than this imaginary graceful powerdown scenario. Finally, data that's stored ``durably'' needs to survive yanked cables. IMHO most people who are certain cables will never be yanked or are willing to take the risk, would be better off just disabling the ZIL rather than using a slog. Then you don't have to worry about pools failing to import from missing slog if you do yank a cable, which is a better tradeoff. NAND storage therefore needs to be self-contained, like a disk drive, to be useful as a slog. The ANS-9010 comes closer to that than this card, though I don't know if it actually delivers, either. pgpEQSwbNLlEe.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] opensolaris-vmware
Haha, Yeah that's tomorrow, I have a test vm I will be testing on. I shall report back! Thank you all! On Wed, Jan 13, 2010 at 8:26 PM, Fajar A. Nugraha wrote: > On Thu, Jan 14, 2010 at 6:40 AM, Gregory Durham > wrote: > > Arnaud, > > The virtual machines coming up as if they were on is the least of my > > worries, my biggest worry is keeping the filesystems of the vms alive > i.e. > > not corrupt. > > As Tim said, The snapshot disk are in the same state they would be in > if you pulled the power plug. > This is also the same thing you got BTW if you use LVM snapshot (on > Linux) or SAN/NAS based snapshots (like NetApp) > > > In the case of exchange, I have exchange itself on a raw lun in physical > > compatibility mode, and I have 2 LUNs mounted with the Server 2008 iSCSI > > initiator for logs and the exchange DB. > > Most modern filesystem and database have journaling that can recover > from power failure scenarios, so they should be able to use the > snapshot and provide consistent, non-corrupt information. > > So the question now is, have you tried restoring from snapshot? > > -- > Fajar > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!
Excellent questions! > I see the PCI card has an external power connector - can you explain > how/why that's required, as opposed to using an on card battery or > similar. DDRdrive X1 ZIL functionality is best served with an external attached UPS, this allows the X1 to perform as a non-volatile storage device without specific user configuration or unique operation. An often overlooked aspect of batteries (irrespective of technology or internal/external) is their limited lifetime and varying degrees of maintenance and oversight required. For example, a lithium (Li-Ion) battery supply, as used by older NVRAM products and not the X1, does have the minimum required energy density for an internal solution. But has a fatal flaw for enterprise applications - an ignition mode failure possibility. Google "lithium battery fire". Such an instance, even if rare, would be catastrophic not only to the on-card data but the host server and so on... Supercapacitors are another alternative which thankfully do not share the ignition mode failure mechanism of Li-Ion, but are hampered mainly by cost with some longevity concerns which can be addressed. In the end, we selected data integrity, cost, and serviceability as our top three priorities. This led us to the industry standard external lead-acid battery as sold by APC. Key benefits of the DDRdrive X1 power solution: 1) Data Integrity - Supports multiple back-to-back power failures, a single DDRdrive X1 uses less than 5W when the host is powered down, even a small UPS is over-provisioned and unlike an internal solution will not normally require a lengthy recharge time prior to the next power incident. Optionally a backup to NAND can be performed to remove the UPS duration as a factor. 2) Cost Effective / Flexible - The Smart-UPS SC 450VA (280 Watts) is an excellent choice for most installations and retails for approximately $150.00. Flexibility is in regard to UPS selection, as it can be right-sized (duration) for each individual application if needed. 3) Reliability / Maintenance - UPS front panel LED status for battery replacement and audible alarms when battery is low or non-operational. Industry standard battery form factor backed by APC the industry leading manufacture of enterprise-class backup solutions. > What happens if the *host* power to the card fails? Nothing, the DDRdrive X1's data integrity is guaranteed by the attached UPS. > The 155mb rate for sustained writes is low for DDR ram? The DRAM's value add is it's extremely low latency (even compared to NAND) and other intrinsic properties such as longevity and reliability. The read/write sequential bandwidth is completely bound by the PCI Express interface. > Is this because the backup to NAND is a constant thing, rather than only > at power fail? No, the backup to NAND is not continual. All Host IO is directed to DRAM for maximum performance while the NAND only provides an optional (user configured) backup/restore feature. Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] opensolaris-vmware
On Thu, Jan 14, 2010 at 6:40 AM, Gregory Durham wrote: > Arnaud, > The virtual machines coming up as if they were on is the least of my > worries, my biggest worry is keeping the filesystems of the vms alive i.e. > not corrupt. As Tim said, The snapshot disk are in the same state they would be in if you pulled the power plug. This is also the same thing you got BTW if you use LVM snapshot (on Linux) or SAN/NAS based snapshots (like NetApp) > In the case of exchange, I have exchange itself on a raw lun in physical > compatibility mode, and I have 2 LUNs mounted with the Server 2008 iSCSI > initiator for logs and the exchange DB. Most modern filesystem and database have journaling that can recover from power failure scenarios, so they should be able to use the snapshot and provide consistent, non-corrupt information. So the question now is, have you tried restoring from snapshot? -- Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] opensolaris-vmware
Arnaud, The virtual machines coming up as if they were on is the least of my worries, my biggest worry is keeping the filesystems of the vms alive i.e. not corrupt. I have all of my virtual machines set up with raw LUNs in physical compatibility mode. This has increased performance but sadly at the cost of vmware snapshots. Is there anything within the virtual machine itself I can do to keep the filesysystem in tact? In the case of exchange, I have exchange itself on a raw lun in physical compatibility mode, and I have 2 LUNs mounted with the Server 2008 iSCSI initiator for logs and the exchange DB. This is a set up is similar to several other *nix vms I have residing on this SAN. Which I am also worrying about. Any other ideas? Thanks, Greg On Tue, Jan 12, 2010 at 1:11 AM, Arnaud Brand wrote: > Your machines won’t come up running, they’ll start up from scratch (like > if you had hit the reset button). > > > > If you want your machines to come up you have to make vmware snapshots, > which capture the state of the running VM (memory, etc..). Typically this is > automated with solutions like VCB (Vmware consolidated backup), but I’ve > just found http://communities.vmware.com/docs/DOC-8760 (not tested though > since we are running ESX and have bought VCB licenses). > > > > Bear in mind that vmware won’t be able to take a consistent snapshot if > some disks in the VM come from VMDK files while some other disks are raw > LUNs (or otherwise mounted directly in the VM, I mean out of control from > esx). You’ll have to restart the machine from scratch in this case and have > a strong potential for discrepancies between VMDK and raw luns. > > On the other hand, I understand that you want Exchange2007 logs and db to > live their live so that when you « revert to snapshot » you don’t loose all > the mail that was sent/delivered in between. > > So this can be a perfectly valid design depending on how you have set it > up. > > > > I don’t think snapshots (be they vmware or zfs) are a good tool for > failover or redundancy here. Basically, if your storage is not accessible > from your esxi hosts, your VMs are toasted and you have to restart them from > scratch. > > Please note, I don’t know about esxi iscsi retry policies specifics. For > ESX we use an SVC cluster (2 node FC cluster), so our ESX hosts can always > access the storage. > > > > You could try to setup an iscsi cluster like this > http://docs.sun.com/app/docs/doc/820-7821/z4f557a?a=view (look for the > figure at the bottom). You would obtain a mirrored pool where you could > place the vmware zvols. Then you could iscsi-share these zvols. > > Though I’m not sure if/how OpenHA could/would failover if one of your node > fails (I always wanted to play with openHA but don’t have the time nor the > hardware at hand to try it). > > > > This setup of course doesn’t prevent you from doing vmware snapshots and > zfs snapshots, you’ll just achieve some level of fault-tolerance. > > > > Please note I don’t know anything about using NFS with esx/esxi. Maybe > there are setups that are easier to achieve using NFS and provide the same > (or a better) level of fault-tolerance. > > > > Hope this helps, > > Arnaud > > > > *De :* zfs-discuss-boun...@opensolaris.org [mailto: > zfs-discuss-boun...@opensolaris.org] *De la part de* Tim Cook > *Envoyé :* mardi 12 janvier 2010 04:36 > *À :* Greg > *Cc :* zfs-discuss@opensolaris.org > *Objet :* Re: [zfs-discuss] opensolaris-vmware > > > > > > On Mon, Jan 11, 2010 at 6:17 PM, Greg wrote: > > Hello All, > I hope this makes sense, I have two opensolaris machines with a bunch of > hard disks, one acts as a iSCSI SAN, and the other is identical other than > the hard disk configuration. The only thing being served are VMWare esxi raw > disks, which hold either virtual machines or data that the particular > virtual machine uses, I.E. we have exchange 2007 virtualized and through its > iSCSI initiator we are mounting two LUNs one for the database and another > for the Logs, all on different arrays of course. Any how we are then > snapshotting this data across the SAN network to the other box using > snapshot send/recv. In the case the other box fails this box can immediatly > serve all of the iSCSI LUNs. The problem, I don't really know if its a > problem...Is when I snapshot a running vm will it come up alive in esxi or > do I have to accomplish this in a different way. These snapshots will then > be written to tape with bacula. I hope I am posting this in the correct > place. > > Thanks, > Greg > -- > > > What you've got are crash consistent snapshots. The disks are in the same > state they would be in if you pulled the power plug. They may come up just > fine, or they may be in a corrupt state. If you take snapshots frequently > enough, you should have at least one good snapshot. Your other option is > scripting. You can build custom scripts to leverage the VSS providers in > Windows... but it won't be easy. > > Any re
Re: [zfs-discuss] opensolaris-vmware
Tim, iSCSI was a design descision at the time. Performance was key and I wanted to utilize being able to hand a LUN on the SAN to esxi, and use it as a raw disk in physical compatibility mode...however what this has done is that I can no longer take snapshots on the esxi server and must rely on zfs snapshot. Also I have multiple *nix virtual machines I need to worry about backing up and making sure that if all fails that the file systems are consistent... Thanks, Greg On Mon, Jan 11, 2010 at 7:36 PM, Tim Cook wrote: > > > On Mon, Jan 11, 2010 at 6:17 PM, Greg wrote: > >> Hello All, >> I hope this makes sense, I have two opensolaris machines with a bunch of >> hard disks, one acts as a iSCSI SAN, and the other is identical other than >> the hard disk configuration. The only thing being served are VMWare esxi raw >> disks, which hold either virtual machines or data that the particular >> virtual machine uses, I.E. we have exchange 2007 virtualized and through its >> iSCSI initiator we are mounting two LUNs one for the database and another >> for the Logs, all on different arrays of course. Any how we are then >> snapshotting this data across the SAN network to the other box using >> snapshot send/recv. In the case the other box fails this box can immediatly >> serve all of the iSCSI LUNs. The problem, I don't really know if its a >> problem...Is when I snapshot a running vm will it come up alive in esxi or >> do I have to accomplish this in a different way. These snapshots will then >> be written to tape with bacula. I hope I am posting this in the correct >> place. >> >> Thanks, >> Greg >> -- >> >> > What you've got are crash consistent snapshots. The disks are in the same > state they would be in if you pulled the power plug. They may come up just > fine, or they may be in a corrupt state. If you take snapshots frequently > enough, you should have at least one good snapshot. Your other option is > scripting. You can build custom scripts to leverage the VSS providers in > Windows... but it won't be easy. > > Any reason in particular you're using iSCSI? I've found NFS to be much > more simple to manage, and performance to be equivalent if not better (in > large clusters). > > -- > --Tim > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How do separate ZFS filesystems affect performance?
On Wed, Jan 13, 2010 at 08:21:13AM -0600, Gary Mills wrote: > Yes, I understand that, but do filesystems have separate queues of any > sort within the ZIL? I'm not sure. If you can experiment and measure a benefit, understanding the reasons is helpful but secondary. If you can't experiment so easily, you're stuck asking questions, as now, to see whether the effort of experimenting is potentially worthwhile. Some other things to note (not necessarily arguments for or against): * you can have multiple slog devices, in case you're creating so much ZIL traffic that ZIL queueing is a real problem, however shared or structured between filesystems. * separate filesystems can have different properties which might help tuning and experiments (logbias, copies, compress, *cache), as well the recordsize. Maybe you will find that compress on mailboxes helps, as long as you're not also compressing the db's? * separate filesystems may have different recovery requirements (snapshot cycles). Note that taking snapshots is ~free, but keeping them and deleting them have costs over time. Perhaps you can save some of these costs if the db's are throwaway/rebuildable. > If not, would it help to put the database > filesystems into a separate zpool? Maybe, if you have the extra devices - but you need to compare with the potential benefit of adding those devices (and their IOPS) to benefit all users of the existing pool. For example, if the databases are a distinctly different enough load, you could compare putting them on a dedicated pool on ssd, vs using those ssd's as additional slog/l2arc. Unless you can make quite categorical separations between the workloads, such that an unbalanced configuration matches an unbalanced workload, you may still be better with consolidated IO capacity in the one pool. Note, also, you can only take recursive atomic snapshots within the one pool - this might be important if the db's have to match the mailbox state exactly, for recovery. -- Dan. pgpdoPYf5GMFk.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS import hangs with over 66000 context switches shown in top
It seems there are more info on this issue here http://opensolaris.org/jive/thread.jspa?threadID=121568&tstart=0 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] NexentaStor 2.2.1 Developer Edition Released
Hi All, I'd like to announce the immediate availability of NexentaStor Developer Edition v2.2.1. Changes since v2.2 include many bug fixes. More information: * This is a major stable release. * Storage limit increased to 4TB. * Built-in antivirus capability. * Consistent snapshots Oracle and MySQL databases. * A Citrix StorageLink adapter * Asynchronous reverse replication support * Per-snapshot probabilistic search engine * Remote-access support were added. * Japanese language support for interface You can download CD image at http://www.nexentastor.org/projects/site/wiki/DeveloperEdition Summary of recent changes is on freshmeat at http://freshmeat.net/projects/nexentastor/ A complete list of projects (14 and growing) is at http://www.nexentastor.org/projects Nightly images are available at http://ftp.nexentastor.org/nightly/ Regards -- Anil Gulecha Community Lead, NexentaStor.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!
That's very interesting tech you've got there... :-) I have a couple of questions, with apologies in advance if I missed them on the website.. I see the PCI card has an external power connector - can you explain how/why that's required, as opposed to using an on card battery or similar. What happens if the power to the card fails? The 155mb rate for sustained writes is low for DDR ram? Is this because the backup to NAND is a constant thing, rather than only at power fail? Regards Tristan Christopher George wrote: The DDRdrive X1 OpenSolaris device driver is now complete, please join us in our first-ever ZFS Intent Log (ZIL) beta test program. A select number of X1s are available for loan, preferred candidates would have a validation background and/or a true passion for torturing new hardware/driver :-) We are singularly focused on the ZIL device market, so a test environment bound by synchronous writes is required. The beta program will provide extensive technical support and a unique opportunity to have direct interaction with the product designers. Would you like to take part in the advancement of Open Storage and explore the far-reaching potential of ZFS based Hybrid Storage Pools? If so, please send an inquiry to "zfs at ddrdrive dot com". The drive for speed, Christopher George Founder/CTO www.ddrdrive.com *** Special thanks goes out to SUN employees Garrett D'Amore and James McPherson for their exemplary help and support. Well done! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!
Hi Adam, So was FW aware of this or in contact with these guys? Also are you requesting/ordering any of these cards to evaluate? The device seems kind of small at 4GB, and uses a double wide PCI Express slot. Neil. On 01/13/10 12:27, Adam Leventhal wrote: Hey Chris, The DDRdrive X1 OpenSolaris device driver is now complete, please join us in our first-ever ZFS Intent Log (ZIL) beta test program. A select number of X1s are available for loan, preferred candidates would have a validation background and/or a true passion for torturing new hardware/driver :-) We are singularly focused on the ZIL device market, so a test environment bound by synchronous writes is required. The beta program will provide extensive technical support and a unique opportunity to have direct interaction with the product designers. Congratulations! This is great news for ZFS. I'll be very interested to see the results members of the community can get with your device as part of their pool. COMSTAR iSCSI performance should be dramatically improved in particular. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovering a broken mirror
No. Only slice 6 from what I understand. I didn't create this (the person who did has left the company) and all I know is that the pool was mounted on /oraprod before it faulted. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovering a broken mirror
Jim Sloey wrote: We have a production SunFireV240 that had a zfs mirror until this week. One of the drives (c1t3d0) in the mirror failed. The system was shutdown and the bad disk replaced without an export. I don't know what happened next but by the time I got involved there was no evidence that the remaining good disk (c1t2d0) had ever been part of a ZFS mirror. Using dd on the raw device I can see data on slice 6 of the good disk but can't import it. Did you use entire disks c1t2d0 and c1t3d0 to create your pool? If yes, then current labeling on c1t2d0 suggests that it got relabeled somehow in the process. Is there any way to recover from this or are they SOL? First I'd make a full copy of c1t2d0 content (so you can try recovery several times). Then first thing I'd try is to relabel c1t2d0 with EFI label and check if the pool is there. regards, victor Thanks in advance # zpool status no pools available # zpool import # ls /etc/zfs # ls /dev/dsk c0t0d0s0 c0t0d0s3 c0t0d0s6 c1t0d0s1 c1t0d0s4 c1t0d0s7 c1t1d0s2 c1t1d0s5 c1t2d0c1t2d0s2 c1t2d0s5 c1t3d0s0 c1t3d0s3 c1t3d0s6 c0t0d0s1 c0t0d0s4 c0t0d0s7 c1t0d0s2 c1t0d0s5 c1t1d0s0 c1t1d0s3 c1t1d0s6 c1t2d0s0 c1t2d0s3 c1t2d0s6 c1t3d0s1 c1t3d0s4 c0t0d0s2 c0t0d0s5 c1t0d0s0 c1t0d0s3 c1t0d0s6 c1t1d0s1 c1t1d0s4 c1t1d0s7 c1t2d0s1 c1t2d0s4 c1t3d0c1t3d0s2 c1t3d0s5 # format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c1t0d0 bootdisk /p...@1c,60/s...@2/s...@0,0 1. c1t1d0 bootmirr /p...@1c,60/s...@2/s...@1,0 2. c1t2d0 /p...@1c,60/s...@2/s...@2,0 3. c1t3d0 /p...@1c,60/s...@2/s...@3,0 Specify disk (enter its number): 2 selecting c1t2d0 format> inquiry Vendor: FUJITSU Product: MAT3147N SUN146G Revision: 1703 format> current Current Disk = c1t2d0 /p...@1c,60/s...@2/s...@2,0 format> verify Primary label contents: Volume name = <> ascii name = pcyl= 14089 ncyl= 14087 acyl=2 nhead = 24 nsect = 848 Part TagFlag Cylinders SizeBlocks 0 rootwm 0 -12 129.19MB(13/0/0) 264576 1 swapwu 13 -25 129.19MB(13/0/0) 264576 2 backupwu 0 - 14086 136.71GB(14087/0/0) 286698624 3 unassignedwm 00 (0/0/0) 0 4 unassignedwm 00 (0/0/0) 0 5 unassignedwm 00 (0/0/0) 0 6usrwm 26 - 14086 136.46GB(14061/0/0) 286169472 7 unassignedwm 00 (0/0/0) 0 format> quit # dd if=/dev/rdsk/c1t2d0s6 count=1 | od -x 1+0 records in 1+0 records out 000 7215 2b79 046c 8ddc 3e31 6966 caa4 6950 020 9c60 4514 7d4a 2a13 9b66 e69e d484 a327 040 4eb0 220e 9c7f 6604 6182 7b39 1310 9c5c 060 4584 c7c6 bd51 aba9 7b4d ec9a 99b2 6bc2 100 6cab 7a88 46d7 937d 5026 86cd 4cf9 ae83 120 20f3 44ec c22e d322 e6cc 2c09 f598 caf4 140 a9c5 85ad a695 8862 c6cc 124d bb72 d540 160 8886 2173 57cc 9759 a209 d78e 9a11 df4d 200 cdc4 5c99 259a 56e5 a301 d540 e691 182b 220 b354 93a9 bc33 085e 1fb6 0445 ac95 59aa 240 fb5a dd66 21de 2f18 24e7 d4c9 c464 99a5 260 9ae4 628a a434 7b96 d1a0 d761 3c21 3ed5 300 c417 5364 e5a3 837a dfd6 266c 50a6 4b10 320 95d5 2952 0f8f cb30 9ef0 23ab 6abc 6872 340 ed58 1977 79ff 9a89 0533 530e 6b83 95aa 360 630b f638 8508 02b1 6266 ca8a 6990 8ad4 400 47c2 7db3 9d9c 62cc ccb4 db3a 0803 ef35 420 0bd3 46b3 04bb d778 c471 9d65 de1b 1861 440 e0b9 ae27 d084 19da 716d b0ca 67be 07ea 460 5650 268e eb2c d7cc 083d c1a8 55ac 4c3c 500 d699 f558 d353 dc61 e25b 2bb8 7d8c 249c 520 c853 258a 01cd b366 bad3 2599 f8ac b3dc 540 6783 72eb 9029 926b 72e6 c84c 3cd7 59e1 560 f122 f20e f8d8 f32f 8226 ceeb acd0 ccf0 600 df3c f3f5 1e71 5d67 da75 1d84 b177 d21b 620 5fa8 a340 6404 2bec 2884 1d62 83cc 2498 640 4288 cf67 c6de 0970 75fe 9e05 8ed8 2173 660 fd30 4ec8 9ea0 63ee bd3f 7a07 b01a d04b 700 8045 29a6 6203 9ed3 9c16 740f 335e 53d8 720 c70e 9c73 981a f0f1 3547 8b84 0651 b1fb 740 b5c8 4887 dafe 15ab 721b 60d2 c1d8 8441 760 eee2 1896 2311 76da 1bfb 4422 3439 07e5 0001000 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!
Hey Chris, > The DDRdrive X1 OpenSolaris device driver is now complete, > please join us in our first-ever ZFS Intent Log (ZIL) beta test > program. A select number of X1s are available for loan, > preferred candidates would have a validation background > and/or a true passion for torturing new hardware/driver :-) > > We are singularly focused on the ZIL device market, so a test > environment bound by synchronous writes is required. The > beta program will provide extensive technical support and a > unique opportunity to have direct interaction with the product > designers. Congratulations! This is great news for ZFS. I'll be very interested to see the results members of the community can get with your device as part of their pool. COMSTAR iSCSI performance should be dramatically improved in particular. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Recovering a broken mirror
We have a production SunFireV240 that had a zfs mirror until this week. One of the drives (c1t3d0) in the mirror failed. The system was shutdown and the bad disk replaced without an export. I don't know what happened next but by the time I got involved there was no evidence that the remaining good disk (c1t2d0) had ever been part of a ZFS mirror. Using dd on the raw device I can see data on slice 6 of the good disk but can't import it. Is there any way to recover from this or are they SOL? Thanks in advance # zpool status no pools available # zpool import # ls /etc/zfs # ls /dev/dsk c0t0d0s0 c0t0d0s3 c0t0d0s6 c1t0d0s1 c1t0d0s4 c1t0d0s7 c1t1d0s2 c1t1d0s5 c1t2d0c1t2d0s2 c1t2d0s5 c1t3d0s0 c1t3d0s3 c1t3d0s6 c0t0d0s1 c0t0d0s4 c0t0d0s7 c1t0d0s2 c1t0d0s5 c1t1d0s0 c1t1d0s3 c1t1d0s6 c1t2d0s0 c1t2d0s3 c1t2d0s6 c1t3d0s1 c1t3d0s4 c0t0d0s2 c0t0d0s5 c1t0d0s0 c1t0d0s3 c1t0d0s6 c1t1d0s1 c1t1d0s4 c1t1d0s7 c1t2d0s1 c1t2d0s4 c1t3d0c1t3d0s2 c1t3d0s5 # format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c1t0d0 bootdisk /p...@1c,60/s...@2/s...@0,0 1. c1t1d0 bootmirr /p...@1c,60/s...@2/s...@1,0 2. c1t2d0 /p...@1c,60/s...@2/s...@2,0 3. c1t3d0 /p...@1c,60/s...@2/s...@3,0 Specify disk (enter its number): 2 selecting c1t2d0 format> inquiry Vendor: FUJITSU Product: MAT3147N SUN146G Revision: 1703 format> current Current Disk = c1t2d0 /p...@1c,60/s...@2/s...@2,0 format> verify Primary label contents: Volume name = <> ascii name = pcyl= 14089 ncyl= 14087 acyl=2 nhead = 24 nsect = 848 Part TagFlag Cylinders SizeBlocks 0 rootwm 0 -12 129.19MB(13/0/0) 264576 1 swapwu 13 -25 129.19MB(13/0/0) 264576 2 backupwu 0 - 14086 136.71GB(14087/0/0) 286698624 3 unassignedwm 00 (0/0/0) 0 4 unassignedwm 00 (0/0/0) 0 5 unassignedwm 00 (0/0/0) 0 6usrwm 26 - 14086 136.46GB(14061/0/0) 286169472 7 unassignedwm 00 (0/0/0) 0 format> quit # dd if=/dev/rdsk/c1t2d0s6 count=1 | od -x 1+0 records in 1+0 records out 000 7215 2b79 046c 8ddc 3e31 6966 caa4 6950 020 9c60 4514 7d4a 2a13 9b66 e69e d484 a327 040 4eb0 220e 9c7f 6604 6182 7b39 1310 9c5c 060 4584 c7c6 bd51 aba9 7b4d ec9a 99b2 6bc2 100 6cab 7a88 46d7 937d 5026 86cd 4cf9 ae83 120 20f3 44ec c22e d322 e6cc 2c09 f598 caf4 140 a9c5 85ad a695 8862 c6cc 124d bb72 d540 160 8886 2173 57cc 9759 a209 d78e 9a11 df4d 200 cdc4 5c99 259a 56e5 a301 d540 e691 182b 220 b354 93a9 bc33 085e 1fb6 0445 ac95 59aa 240 fb5a dd66 21de 2f18 24e7 d4c9 c464 99a5 260 9ae4 628a a434 7b96 d1a0 d761 3c21 3ed5 300 c417 5364 e5a3 837a dfd6 266c 50a6 4b10 320 95d5 2952 0f8f cb30 9ef0 23ab 6abc 6872 340 ed58 1977 79ff 9a89 0533 530e 6b83 95aa 360 630b f638 8508 02b1 6266 ca8a 6990 8ad4 400 47c2 7db3 9d9c 62cc ccb4 db3a 0803 ef35 420 0bd3 46b3 04bb d778 c471 9d65 de1b 1861 440 e0b9 ae27 d084 19da 716d b0ca 67be 07ea 460 5650 268e eb2c d7cc 083d c1a8 55ac 4c3c 500 d699 f558 d353 dc61 e25b 2bb8 7d8c 249c 520 c853 258a 01cd b366 bad3 2599 f8ac b3dc 540 6783 72eb 9029 926b 72e6 c84c 3cd7 59e1 560 f122 f20e f8d8 f32f 8226 ceeb acd0 ccf0 600 df3c f3f5 1e71 5d67 da75 1d84 b177 d21b 620 5fa8 a340 6404 2bec 2884 1d62 83cc 2498 640 4288 cf67 c6de 0970 75fe 9e05 8ed8 2173 660 fd30 4ec8 9ea0 63ee bd3f 7a07 b01a d04b 700 8045 29a6 6203 9ed3 9c16 740f 335e 53d8 720 c70e 9c73 981a f0f1 3547 8b84 0651 b1fb 740 b5c8 4887 dafe 15ab 721b 60d2 c1d8 8441 760 eee2 1896 2311 76da 1bfb 4422 3439 07e5 0001000 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
> "et" == Erik Trimble writes: et> Probably, the smart thing to push for is inclusion of some new et> command in the ATA standard (in a manner like TRIM). Likely et> something that would return both native Block and Page sizes et> upon query. that would be the *sane* thing to do. The *smart* thing to do would be write a quick test to determine the apparent page size by performance-testing write-flush-write-flush-write-flush with various write sizes and finding the knee that indicates the smallest size at which read-before-write has stopped. The test could happen in 'zpool create' and have its result written into the vdev label. Inventing ATA commands takes too long to propogate through the technosphere, and the EE's always implement them wrongly: for example, a device with SDRAM + supercap should probably report 512 byte sectors because the algorithm for copying from SDRAM to NAND is subject to change and none of your business, but EE's are not good with language and will try to apelike match up the paragraph in the spec with the disorganized thoughts in their head, fit pegs into holes, and will end up giving you the NAND page size without really understanding why you wanted it other than that some standard they can't control demands it. They may not even understand why their devices are faster and slower---they are probably just hurling shit against an NTFS and shipping whatever runs some testsuite fastest---so doing the empirical test is the only way to document what you really care about in a way that will make it across the language and cultural barriers between people who argue about javascript vs python and ones that argue about Agilent vs LeCroy. Within the proprietary wall of these flash filesystem companies the testsuites are probably worth as much as the filesystem code, and here without the wall an open-source statistical test is worth more than a haggled standard. Remember the ``removeable'' bit in USB sticks and the mess that both software and hardware made out of it. (hot-swappable SATA drives are ``non-removeable'' and don't need rmformat while USB/firewore do? yeah, sorry, u fail abstraction. and USB drives have the ``removable medium'' bit set when the medium and the controller are inseperable, it's the _controller_ that's removeable? ya sorry u fail reading English.) If you can get an answer by testing, DO IT, and evolve the test to match products on the market as necessary. This promises to be a lot more resilient than the track record with bullshit ATA commands and will work with old devices too. By the time you iron out your standard we will be using optonanocyberflash instead: that's what happened with the removeable bit and r/w optical storage. BTW let me know when read/write UDF 2.0 on dvd+r is ready---the standard was only announced twelve years ago, thanks. pgpOg9cjVknOA.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!
The DDRdrive X1 OpenSolaris device driver is now complete, please join us in our first-ever ZFS Intent Log (ZIL) beta test program. A select number of X1s are available for loan, preferred candidates would have a validation background and/or a true passion for torturing new hardware/driver :-) We are singularly focused on the ZIL device market, so a test environment bound by synchronous writes is required. The beta program will provide extensive technical support and a unique opportunity to have direct interaction with the product designers. Would you like to take part in the advancement of Open Storage and explore the far-reaching potential of ZFS based Hybrid Storage Pools? If so, please send an inquiry to "zfs at ddrdrive dot com". The drive for speed, Christopher George Founder/CTO www.ddrdrive.com *** Special thanks goes out to SUN employees Garrett D'Amore and James McPherson for their exemplary help and support. Well done! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500/x4540 does the internal controllers have a bbu?
On Jan 12, 2010, at 7:46 PM, Brad wrote: > Richard, > > "Yes, write cache is enabled by default, depending on the pool configuration." > Is it enabled for a striped (mirrored configuration) zpool? I'm asking > because of a concern I've read on this forum about a problem with SSDs (and > disks) where if a power outage occurs any data in cache would be lost if it > hasn't been flushed to disk. If the vdev is a whole disk (for Solaris == not a slice), then ZFS will attempt to set the write cache enable. By default, Solaris will not set write cache enable on disks, in part because it causes bad juju for UFS. This is independent of the data protection configuration. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fast mirror resync?
On Wed, Jan 13, 2010 at 4:35 PM, Max Levine wrote: > Veritas has this feature called fast mirror resync where they have a > DRL on each side of the mirror and, detaching/re-attaching a mirror > causes only the changed bits to be re-synced. Is anything similar > planned for ZFS? ZFS has that feature from moment zero. -- Regards, Cyril ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs fast mirror resync?
Veritas has this feature called fast mirror resync where they have a DRL on each side of the mirror and, detaching/re-attaching a mirror causes only the changed bits to be re-synced. Is anything similar planned for ZFS? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How do separate ZFS filesystems affect performance?
On Tue, Jan 12, 2010 at 01:56:57PM -0800, Richard Elling wrote: > On Jan 12, 2010, at 12:37 PM, Gary Mills wrote: > > > On Tue, Jan 12, 2010 at 11:11:36AM -0600, Bob Friesenhahn wrote: > >> On Tue, 12 Jan 2010, Gary Mills wrote: > >>> > >>> Is moving the databases (IMAP metadata) to a separate ZFS filesystem > >>> likely to improve performance? I've heard that this is important, but > >>> I'm not clear why this is. > > > > I found a couple of references that suggest just putting the databases > > on their own ZFS filesystem has a great benefit. One is an e-mail > > message to a mailing list from Vincent Fox at UC Davis. They run a > > similar system to ours at that site. He says: > > > >Particularly the database is important to get it's own filesystem so > >that it's queue/cache are separated. > > Another policy you might consider is the recordsize for the > database vs the message store. In general, databases like the > recordsize to match. Of course, recordsize is a per-dataset > parameter. Unfortunately, it's not a single database. There are many of them, of different types. One is a Berkeley DB, others are something specific to the IMAP server (called skiplist), and some are small flat files that are just rewritten. All they have in common is activity and frequent locking. They can be relocated as a whole. > > The second one is from: > > > >http://blogs.sun.com/roch/entry/the_dynamics_of_zfs > > > > He says: > > > >For file modification that come with some immediate data integrity > >constraint (O_DSYNC, fsync etc.) ZFS manages a per-filesystem intent > >log or ZIL. > > > > This sounds like the ZIL queue mentioned above. Is I/O for each of > > those handled separately? > > ZIL is for the pool. Yes, I understand that, but do filesystems have separate queues of any sort within the ZIL? If not, would it help to put the database filesystems into a separate zpool? > We did some experiments with the messaging server and a RAID > array with separate logs. As expected, it didn't make much difference > because of the nice, large nonvolatile write cache on the array. This > reinforces the notion that Dan Carosone also recently noted: performance > gains for separate logs are possible when the latency of the separate > log device is much lower than the latency of the devices in the main pool, > and, of course, the workload uses sync writes. It certainly sounds as if latency is the key for synchronous writes. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS shareiscsi and Comstar
On Wed, Jan 13, 2010 at 1:03 PM, Matthew Hollick wrote: > > Is the intent to move away from iscsitgt? Yes. [1] > Can I change some configuration somewhere to specify which iscsi target > service to use? No. In fact the cited ARC case obsoletes the shareiscsi property altogether. [1] http://arc.opensolaris.org/caselog/PSARC/2010/006/ -- Regards, Cyril ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS shareiscsi and Comstar
Good morning, I am new to the OpenSolaris game so forgive me if this is covered elsewhere. While reading the documentation for zfs and comstar I note that there are two methods for creating iscsi targets with zfs volumes. The first (which appears to be unsupported) is to use the schareiscsi=on attribute on the volume. This method does not appear to use comstar but the older, userspace, iscsitgt service. In fact, if iscsitgt is offline it gets started when attempting to create a volume with the sharescsi=on attribute. Is the intent to move away from iscsitgt? Can I change some configuration somewhere to specify which iscsi target service to use? Regards, Matthew. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss