Re: [zfs-discuss] zfs questions wrt unused blocks
Richard, thanks for the heads-up. I found some material here that sheds a bit more light on it: http://en.wikipedia.org/wiki/ZFS http://all-unix.blogspot.com/2007/04/transaction-file-system-and-cow.html Regards, heinz Richard Elling wrote: On Feb 15, 2010, at 8:43 PM, heinz zerbes wrote: Gents, We want to understand the mechanism of zfs a bit better. Q: what is the design/algorithm of zfs in terms of reclaiming unused blocks? Q: what criteria is there for zfs to start reclaiming blocks The answer to these questions is too big for an email. Think of ZFS as a very dynamic system with many different factors influencing block allocation. Issue at hand is an LDOM or zone running in a virtual (thin-provisioned) disk on a NFS server and a zpool inside that vdisk. This vdisk tends to grow in size even if the user writes and deletes a file again. Question is, whether this reclaiming of unused blocks can kick in earlier, so that the filesystem doesn't grow much more than what is actually allocated? ZFS is a COW file system, which partly explains what you are seeing. Snapshots, deduplication, and the ZIL complicate the picture. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 15-17, 2010) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs questions wrt unused blocks
Gents, We want to understand the mechanism of zfs a bit better. Q: what is the design/algorithm of zfs in terms of reclaiming unused blocks? Q: what criteria is there for zfs to start reclaiming blocks Issue at hand is an LDOM or zone running in a virtual (thin-provisioned) disk on a NFS server and a zpool inside that vdisk. This vdisk tends to grow in size even if the user writes and deletes a file again. Question is, whether this reclaiming of unused blocks can kick in earlier, so that the filesystem doesn't grow much more than what is actually allocated? Thanks, heinz -- Heinz Zerbes Security Consultant and Auditor Sun Microsystems Australia 33 Berry St., North Sydney, NSW 2060 AU Phone x59468/+61 2 9466 9468 Mobile +61 410 727 961 Fax +61 2 9466 9411 Email heinz.zer...@sun.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs questions wrt unused blocks
Gents, We want to understand the mechanism of zfs a bit better. Q: what is the design/algorithm of zfs in terms of reclaiming unused blocks? Q: what criteria is there for zfs to start reclaiming blocks Issue at hand is an LDOM or zone running in a virtual (thin-provisioned) disk on a NFS server and a zpool inside that vdisk. This vdisk tends to grow in size even if the user writes and deletes a file again. Question is, whether this reclaiming of unused blocks can kick in earlier, so that the filesystem doesn't grow much more than what is actually allocated? Thanks, heinz ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs questions wrt unused blocks
On Feb 15, 2010, at 8:43 PM, heinz zerbes wrote: Gents, We want to understand the mechanism of zfs a bit better. Q: what is the design/algorithm of zfs in terms of reclaiming unused blocks? Q: what criteria is there for zfs to start reclaiming blocks The answer to these questions is too big for an email. Think of ZFS as a very dynamic system with many different factors influencing block allocation. Issue at hand is an LDOM or zone running in a virtual (thin-provisioned) disk on a NFS server and a zpool inside that vdisk. This vdisk tends to grow in size even if the user writes and deletes a file again. Question is, whether this reclaiming of unused blocks can kick in earlier, so that the filesystem doesn't grow much more than what is actually allocated? ZFS is a COW file system, which partly explains what you are seeing. Snapshots, deduplication, and the ZIL complicate the picture. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 15-17, 2010) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs: questions on ARC membership based on type/ordering of Reads/Writes
I am trying to understand the ARC's behavior based on different permutations of (a)sync Reads and (a)sync Writes. thank you, in advance o does the data for a *sync-write* *ever* go into the ARC? eg, my understanding is that the data goes to the ZIL (and the SLOG, if present), but how does it get from the ZIL to the ZIO layer? eg, does it go to the ARC on its way to the ZIO ? o if the sync-write-data *does* go to the ARC, does it go to the ARC *after* it is written to the ZIL's backing-store, or does the data go to the ZIL and the ARC in parallel ? o if a sync-write's data goes to the ARC and ZIL *in parallel*, then does zfs prevent an ARC-hit until the data is confirmed to be on the ZIL's nonvolatile media (eg, disk-platter or SLOG) ? or could a Read get an ARC-hit on a block *before* it's written to zil's backing-store? o is the DMU where the Serialization of transactions occurs? o if an async-Write for block-X hits the Serializer before a Read for block-X hits the Serializer, i am assuming the Read can pass the async-Write; eg, the Read is *not* pended behind the async-write. however, if a Read hits the Serializer after a *sync*-write, then i'm assuming the Read is pended until the sync-write is written to the ZIL's nonvolatile media. o if a Read passes an async-write, then i'm assuming the Read can be satisfied by either the arc, l2arc, or disk. o it's stated that the L2ARC is for random-reads. however, there's nothing to prevent the L2ARC from containing blocks derived from *sequential*-reads, right ? also, blocks from async-writes can also live in l2arc, right? how about sync-writes ? o is the l2arc literally simply a *larger* ARC? eg, does the l2arc obey the normal cache property where everything that is in the L1$ (eg, ARC) is also in the L2$ (eg, l2arc) ? (I have a feeling that the set-theoretic intersection of ARC and L2ARC is empty (for some reason). o does the l2arc use the ARC algorithm (as the name suggests) ? thank you, /andrew Solaris RPE ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs: questions on ARC membership based on type/ordering of Reads/Writes
On Nov 25, 2009, at 11:55 AM, andrew.r...@sun.com wrote: I am trying to understand the ARC's behavior based on different permutations of (a)sync Reads and (a)sync Writes. thank you, in advance o does the data for a *sync-write* *ever* go into the ARC? always eg, my understanding is that the data goes to the ZIL (and the SLOG, if present), but how does it get from the ZIL to the ZIO layer? ZIL is effectively write-only. It is only read when the pool is imported. eg, does it go to the ARC on its way to the ZIO ? ARC is the cache for buffering data. o if the sync-write-data *does* go to the ARC, does it go to the ARC *after* it is written to the ZIL's backing-store, or does the data go to the ZIL and the ARC in parallel ? A sync write returns when the data is written to the ZIL. An async write returns when the data is in the ARC, and later the unwritten contents of the ARC are pushed to the pool when the transaction group is committed. o if a sync-write's data goes to the ARC and ZIL *in parallel*, then does zfs prevent an ARC-hit until the data is confirmed to be on the ZIL's nonvolatile media (eg, disk-platter or SLOG) ? or could a Read get an ARC-hit on a block *before* it's written to zil's backing-store? In my mind, the ARC and ZIL are orthogonal. o is the DMU where the Serialization of transactions occurs? Serialization? o if an async-Write for block-X hits the Serializer before a Read for block-X hits the Serializer, i am assuming the Read can pass the async-Write; eg, the Read is *not* pended behind the async-write. however, if a Read hits the Serializer after a *sync*-write, then i'm assuming the Read is pended until the sync-write is written to the ZIL's nonvolatile media. o if a Read passes an async-write, then i'm assuming the Read can be satisfied by either the arc, l2arc, or disk. I think you are asking if write order is preserved. The answer is yes. o it's stated that the L2ARC is for random-reads. however, there's nothing to prevent the L2ARC from containing blocks derived from *sequential*-reads, right ? also, blocks from async-writes can also live in l2arc, right? how about sync-writes ? Blocks which are not yet committed to the pool are locked in the ARC so they can't be evicted. Once committed, the lock is removed. o is the l2arc literally simply a *larger* ARC? eg, does the l2arc obey the normal cache property where everything that is in the L1$ (eg, ARC) is also in the L2$ (eg, l2arc) ? (I have a feeling that the set-theoretic intersection of ARC and L2ARC is empty (for some reason). No. The L2ARC is not in the datapath between the ARC and media. Further, data is not evicted from the ARC into the L2ARC. Rather, the L2ARC is filled from data near the eviction ends of the MRU and MFU lists. The movement of data to the L2ARC is throttled and grouped in sequence, improving efficiency for devices which like large writes, such as read-optimized flash. Think of it this way. Data which is in the ARC is fed into the L2ARC. If the data is later evicted from the ARC, it can still live in the L2ARC. When the L2ARC has lower read latency then the pool's media, then it can improve performance because the data can be read from L2ARC instead of the pool. This fits the general definition of a cache, but does not work the same way as multilevel CPU caches. o does the l2arc use the ARC algorithm (as the name suggests) ? Yes, but it really isn't separate from the ARC, from a management point of view. To fully understand it, you need to know about how the metadata for each buffer in the ARC is managed. This will introduce the concept of the ghosts, and the L2ARC is a simple extension. The comments in the source are nicely descriptive, and you might consider reading them through once, even if you don't dive into the code itself: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS questions
Hello Zfs Community, I am trying to locate if zfs has a compatible tool to Veritas's vxbench? Any ideas? I see a tool called vdbench that looks close, but it is not a Sun tool, does Sun recommend something to customers moving from Veritas to ZFS and like vxbench and its capabilities? Thanks, Richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
Richard Gilmore wrote: Hello Zfs Community, I am trying to locate if zfs has a compatible tool to Veritas's vxbench? Any ideas? I see a tool called vdbench that looks close, but it is not a Sun tool, does Sun recommend something to customers moving from Veritas to ZFS and like vxbench and its capabilities? filebench http://sourceforge.net/projects/filebench/ http://www.solarisinternals.com/wiki/index.php/FileBench http://blogs.sun.com/dom/entry/filebench:_a_zfs_v_vxfs ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
Thommy M. wrote: Richard Gilmore wrote: Hello Zfs Community, I am trying to locate if zfs has a compatible tool to Veritas's vxbench? Any ideas? I see a tool called vdbench that looks close, but it is not a Sun tool, does Sun recommend something to customers moving from Veritas to ZFS and like vxbench and its capabilities? filebench http://sourceforge.net/projects/filebench/ http://www.solarisinternals.com/wiki/index.php/FileBench http://blogs.sun.com/dom/entry/filebench:_a_zfs_v_vxfs Also, /usr/benchmarks/filebench for later Solaris releases. IIRC, vdbench is in process of becoming open source, but I do not know the current status. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS questions
I will have a file system in a SAN using ZFS. Can someone answer my questions? 1. Can I create ZFS volumes on a ZFS file system from one server, attach the file system read-write to a different server (to load data), then detach the file system from that server and attach the file system read-only to multiple other servers? 2. Can I expand on the fly a ZFS volume within a file system when the file system it's attached to a server read-write? Thanks for your help, Dave ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS questions with mirrors
IHAC that is asking the following. any thoughts would be appreciated Take two drives, zpool to make a mirror. Remove a drive - and the server HANGS. Power off and reboot the server, and everything comes up cleanly. Take the same two drives (still Solaris 10). Install Veritas Volume Manager (4.1). Mirror the two drives. Remove a drive - everything is still running. Replace the drive, everything still working. No outage. So the big questions to Tech support: 1. Is this a known property of ZFS ? That when a drive from a hot swap system is removed the server hangs ? (We were attempting to simulate a drive failure) 2. Or is this just because it was an E450 ? Ie, would removing a zfs mirror disk (unexpected hardware removal as opposed to using zfs to remove the disk) on a V240 or V480 cause the same problem ? 3. What could we expect if a drive mysteriously failed during operation of a server with a zfs mirror ? Would the server hang like it did during testing ? How can we test this ? 4. If it is a known property of zfs, is there a date when it is expected to be fixed (if ever) ? Peter PS: I may not be on this alias so please respond to me directly -- = __ /_/\ /_\\ \Peter Wilk - OS/Security Support /_\ \\ / Sun Microsystems /_/ \/ / / 1 Network Drive, P.O Box 4004 /_/ / \//\ Burlington, Massachusetts 01803-0904 \_\//\ / / 1-800-USA-4SUN, opt 1, opt 1,case number# \_/ / /\ / Email: [EMAIL PROTECTED] \_/ \\ \ = \_\ \\ \_\/ = ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions with mirrors
The current behavior depends on the implementation of the driver and support for hotplug events. When a drive is yanked, one of two things can happen: - I/Os will fail, and any attempt to re-open the device will result in failure. - I/Os will fail, but the device can continued to be opened by its existing path. ZFS currently handles case #1 and will mark the device faulted, generating an FMA fault in the process. Future ZFS/FMA integration will address problem #2, and is on the short list of features to address. In the meantime, you can 'zpool offline' the bad device to prevent ZFS from trying to access it. That being said, the server should never hang - only proceed arbitrarily slowly. When you say 'hang', what does that mean? - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS questions
Can someone explain to me what the 'volinit' and 'volfini' options to zfs do ? It's not obvious from the source code and these options are undocumented. Thanks, John -- John Cecere Sun Microsystems 732-302-3922 / [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs questions from Sun customer
Please reply to [EMAIL PROTECTED] Background / configuration ** zpool will not create a storage pool on fibre channel storage. I'm attached to an IBM SVC using the IBMsdd driver. I have no problem using SVM metadevices and UFS on these devices. List steps to reproduce the problem(if applicable): Build Solaris 10 Update 2 server Attach to an external storage array via IBM SVC Load lpfc driver (6.02h) Load IBMsdd software (1.6.1.0-2) Attempt to use zpool create to make a storage pool: # zpool create -f extdisk vpath1c internal error: unexpected error 22 at line 446 of ../common/libzfs_pool.c * reply to customer It looks like you have an additional unwanted software layer between Solaris and the disk hardware. Currently ZFS needs to access the physical device to work correctly. Something like: # zpool create -f extdisk c5t0d0 c5t1d0 .. Let me know if this works for you. * follow-up question from customer Yes, using the c#t#d# disks work, but anyone using fibre-channel storage on somethink like IBM Shark or EMC Clariion will want multiple paths to disk using either IBMsdd, EMCpower or Solaris native MPIO. Does ZFS work with any of these fibre channel multipathing drivers? Thanks for any assistance you can provide. -- David Curtis - TSE Sun Microsystems 303-272-6628Enterprise Services [EMAIL PROTECTED] OS / Installation Support Monday to Friday9:00 AM to 6:00 PM Mountain ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs questions from Sun customer
This suggests that there is some kind of bug in the layered storage software. ZFS doesn't do anything special to the underlying storage device; it merely relies on a few ldi_*() routines. I would try running the following dtrace script: #!/usr/sbin/dtrace -s vdev_disk_open:return, ldi_open_by_name:return, ldi_open_by_path:return, ldi_get_size:return { trace(arg1); } And then re-run your 'zpool create' command. That will at at least get us pointed in the right direction. - Eric On Wed, Jul 26, 2006 at 09:47:03AM -0600, David Curtis wrote: Please reply to [EMAIL PROTECTED] Background / configuration ** zpool will not create a storage pool on fibre channel storage. I'm attached to an IBM SVC using the IBMsdd driver. I have no problem using SVM metadevices and UFS on these devices. List steps to reproduce the problem(if applicable): Build Solaris 10 Update 2 server Attach to an external storage array via IBM SVC Load lpfc driver (6.02h) Load IBMsdd software (1.6.1.0-2) Attempt to use zpool create to make a storage pool: # zpool create -f extdisk vpath1c internal error: unexpected error 22 at line 446 of ../common/libzfs_pool.c * reply to customer It looks like you have an additional unwanted software layer between Solaris and the disk hardware. Currently ZFS needs to access the physical device to work correctly. Something like: # zpool create -f extdisk c5t0d0 c5t1d0 .. Let me know if this works for you. * follow-up question from customer Yes, using the c#t#d# disks work, but anyone using fibre-channel storage on somethink like IBM Shark or EMC Clariion will want multiple paths to disk using either IBMsdd, EMCpower or Solaris native MPIO. Does ZFS work with any of these fibre channel multipathing drivers? Thanks for any assistance you can provide. -- David Curtis - TSESun Microsystems 303-272-6628 Enterprise Services [EMAIL PROTECTED] OS / Installation Support Monday to Friday 9:00 AM to 6:00 PM Mountain ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs questions from Sun customer
zfs should work fine with disks under the control of solaris mpxio. i don't know about any of the other multipathing solutions. if you're trying to use a device that's controlled by another multipathing solution, you might want to try specifying the full path to the device, ex: zpool create -f extdisk /dev/foo2/vpath1c ed On Wed, Jul 26, 2006 at 09:47:03AM -0600, David Curtis wrote: Please reply to [EMAIL PROTECTED] Background / configuration ** zpool will not create a storage pool on fibre channel storage. I'm attached to an IBM SVC using the IBMsdd driver. I have no problem using SVM metadevices and UFS on these devices. List steps to reproduce the problem(if applicable): Build Solaris 10 Update 2 server Attach to an external storage array via IBM SVC Load lpfc driver (6.02h) Load IBMsdd software (1.6.1.0-2) Attempt to use zpool create to make a storage pool: # zpool create -f extdisk vpath1c internal error: unexpected error 22 at line 446 of ../common/libzfs_pool.c * reply to customer It looks like you have an additional unwanted software layer between Solaris and the disk hardware. Currently ZFS needs to access the physical device to work correctly. Something like: # zpool create -f extdisk c5t0d0 c5t1d0 .. Let me know if this works for you. * follow-up question from customer Yes, using the c#t#d# disks work, but anyone using fibre-channel storage on somethink like IBM Shark or EMC Clariion will want multiple paths to disk using either IBMsdd, EMCpower or Solaris native MPIO. Does ZFS work with any of these fibre channel multipathing drivers? Thanks for any assistance you can provide. -- David Curtis - TSESun Microsystems 303-272-6628 Enterprise Services [EMAIL PROTECTED] OS / Installation Support Monday to Friday 9:00 AM to 6:00 PM Mountain ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs questions from Sun customer
Eric, Here is what the customer gets trying to create the pool using the software alias: (I added all the ldi_open's to the script) # zpool create -f extdisk vpath1c # ./dtrace.script dtrace: script './dtrace.script' matched 6 probes CPU IDFUNCTION:NAME 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 17817 ldi_open_by_name:return 0 0 16191 ldi_get_size:return-1 0 44942vdev_disk_open:return22 Thanks, David Eric Schrock wrote On 07/26/06 10:03 AM,: This suggests that there is some kind of bug in the layered storage software. ZFS doesn't do anything special to the underlying storage device; it merely relies on a few ldi_*() routines. I would try running the following dtrace script: #!/usr/sbin/dtrace -s vdev_disk_open:return, ldi_open_by_name:return, ldi_open_by_path:return, ldi_get_size:return { trace(arg1); } And then re-run your 'zpool create' command. That will at at least get us pointed in the right direction. - Eric On Wed, Jul 26, 2006 at 09:47:03AM -0600, David Curtis wrote: Please reply to [EMAIL PROTECTED] Background / configuration ** zpool will not create a storage pool on fibre channel storage. I'm attached to an IBM SVC using the IBMsdd driver. I have no problem using SVM metadevices and UFS on these devices. List steps to reproduce the problem(if applicable): Build Solaris 10 Update 2 server Attach to an external storage array via IBM SVC Load lpfc driver (6.02h) Load IBMsdd software (1.6.1.0-2) Attempt to use zpool create to make a storage pool: # zpool create -f extdisk vpath1c internal error: unexpected error 22 at line 446 of ../common/libzfs_pool.c * reply to customer It looks like you have an additional unwanted software layer between Solaris and the disk hardware. Currently ZFS needs to access the physical device to work correctly. Something like: # zpool create -f extdisk c5t0d0 c5t1d0 .. Let me know if this works for you. * follow-up question from customer Yes, using the c#t#d# disks work, but anyone using fibre-channel storage on somethink like IBM Shark or EMC Clariion will want multiple paths to disk using either IBMsdd, EMCpower or Solaris native MPIO. Does ZFS work with any of these fibre channel multipathing drivers? Thanks for any assistance you can provide. -- David Curtis - TSESun Microsystems 303-272-6628 Enterprise Services [EMAIL PROTECTED] OS / Installation Support Monday to Friday 9:00 AM to 6:00 PM Mountain ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock -- David Curtis - TSE Sun Microsystems 303-272-6628Enterprise Services [EMAIL PROTECTED] OS / Installation Support Monday to Friday9:00 AM to 6:00 PM Mountain ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs questions from Sun customer
So it does look like something's messed up here. Before we pin this down as a driver bug, we should double check that we are indeed opening what we think we're opening, and try to track down why ldi_get_size is failing. Try this: #!/usr/sbin/dtrace -s ldi_open_by_name:entry { trace(stringof(args[0])); } ldi_prop_exists:entry { trace(stringof(args[2])); } ldi_prop_exists:return { trace(arg1); } ldi_get_otyp:return { trace(arg1); } - Eric On Wed, Jul 26, 2006 at 12:49:35PM -0600, David Curtis wrote: Eric, Here is what the customer gets trying to create the pool using the software alias: (I added all the ldi_open's to the script) # zpool create -f extdisk vpath1c # ./dtrace.script dtrace: script './dtrace.script' matched 6 probes CPU IDFUNCTION:NAME 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 17817 ldi_open_by_name:return 0 0 16191 ldi_get_size:return-1 0 44942vdev_disk_open:return22 -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs questions from Sun customer
Eric, Here is the output: # ./dtrace2.dtr dtrace: script './dtrace2.dtr' matched 4 probes CPU IDFUNCTION:NAME 0 17816 ldi_open_by_name:entry /dev/dsk/vpath1c 0 16197 ldi_get_otyp:return 0 0 15546ldi_prop_exists:entry Nblocks 0 15547 ldi_prop_exists:return 0 0 15546ldi_prop_exists:entry nblocks 0 15547 ldi_prop_exists:return 0 0 15546ldi_prop_exists:entry Size 0 15547 ldi_prop_exists:return 0 0 15546ldi_prop_exists:entry size 0 15547 ldi_prop_exists:return 0 Thanks, David Eric Schrock wrote On 07/26/06 01:01 PM,: So it does look like something's messed up here. Before we pin this down as a driver bug, we should double check that we are indeed opening what we think we're opening, and try to track down why ldi_get_size is failing. Try this: #!/usr/sbin/dtrace -s ldi_open_by_name:entry { trace(stringof(args[0])); } ldi_prop_exists:entry { trace(stringof(args[2])); } ldi_prop_exists:return { trace(arg1); } ldi_get_otyp:return { trace(arg1); } - Eric On Wed, Jul 26, 2006 at 12:49:35PM -0600, David Curtis wrote: Eric, Here is what the customer gets trying to create the pool using the software alias: (I added all the ldi_open's to the script) # zpool create -f extdisk vpath1c # ./dtrace.script dtrace: script './dtrace.script' matched 6 probes CPU IDFUNCTION:NAME 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 15801 ldi_open_by_dev:return 0 0 7233ldi_open_by_vp:return 0 0 17817 ldi_open_by_name:return 0 0 16191 ldi_get_size:return-1 0 44942vdev_disk_open:return22 -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock -- David Curtis - TSE Sun Microsystems 303-272-6628Enterprise Services [EMAIL PROTECTED] OS / Installation Support Monday to Friday9:00 AM to 6:00 PM Mountain ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs questions from Sun customer
On Wed, Jul 26, 2006 at 02:11:44PM -0600, David Curtis wrote: Eric, Here is the output: # ./dtrace2.dtr dtrace: script './dtrace2.dtr' matched 4 probes CPU IDFUNCTION:NAME 0 17816 ldi_open_by_name:entry /dev/dsk/vpath1c 0 16197 ldi_get_otyp:return 0 0 15546ldi_prop_exists:entry Nblocks 0 15547 ldi_prop_exists:return 0 0 15546ldi_prop_exists:entry nblocks 0 15547 ldi_prop_exists:return 0 0 15546ldi_prop_exists:entry Size 0 15547 ldi_prop_exists:return 0 0 15546ldi_prop_exists:entry size 0 15547 ldi_prop_exists:return 0 OK, this definitely seems to be a driver bug. I'm no driver expert, but it seems that exporting none of the above properties is a problem - ZFS has no idea how big this disk is! Perhaps someone more familiar with the DDI/LDI interfaces can explain the appropriate way to implement these on the driver end. But at this point its safe to say that ZFS isn't doing anything wrong. The layered driver is exporting a device in /dev/dsk, but not exporting basic information (such as the size or number of blocks) that ZFS (and potentially the rest of Solaris) needs to interact with the device. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs questions from Sun customer
Does format show these drives to be available and containing a non-zero size? Eric Schrock wrote: On Wed, Jul 26, 2006 at 02:11:44PM -0600, David Curtis wrote: Eric, Here is the output: # ./dtrace2.dtr dtrace: script './dtrace2.dtr' matched 4 probes CPU IDFUNCTION:NAME 0 17816 ldi_open_by_name:entry /dev/dsk/vpath1c 0 16197 ldi_get_otyp:return 0 0 15546ldi_prop_exists:entry Nblocks 0 15547 ldi_prop_exists:return 0 0 15546ldi_prop_exists:entry nblocks 0 15547 ldi_prop_exists:return 0 0 15546ldi_prop_exists:entry Size 0 15547 ldi_prop_exists:return 0 0 15546ldi_prop_exists:entry size 0 15547 ldi_prop_exists:return 0 OK, this definitely seems to be a driver bug. I'm no driver expert, but it seems that exporting none of the above properties is a problem - ZFS has no idea how big this disk is! Perhaps someone more familiar with the DDI/LDI interfaces can explain the appropriate way to implement these on the driver end. But at this point its safe to say that ZFS isn't doing anything wrong. The layered driver is exporting a device in /dev/dsk, but not exporting basic information (such as the size or number of blocks) that ZFS (and potentially the rest of Solaris) needs to interact with the device. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs questions from Sun customer
zfs depends on ldi_get_size(), which depends on the device being accessed exporting one of the properties below. i guess the the devices generated by IBMsdd and/or EMCpower/or don't generate these properties. ed On Wed, Jul 26, 2006 at 01:53:31PM -0700, Eric Schrock wrote: On Wed, Jul 26, 2006 at 02:11:44PM -0600, David Curtis wrote: Eric, Here is the output: # ./dtrace2.dtr dtrace: script './dtrace2.dtr' matched 4 probes CPU IDFUNCTION:NAME 0 17816 ldi_open_by_name:entry /dev/dsk/vpath1c 0 16197 ldi_get_otyp:return 0 0 15546ldi_prop_exists:entry Nblocks 0 15547 ldi_prop_exists:return 0 0 15546ldi_prop_exists:entry nblocks 0 15547 ldi_prop_exists:return 0 0 15546ldi_prop_exists:entry Size 0 15547 ldi_prop_exists:return 0 0 15546ldi_prop_exists:entry size 0 15547 ldi_prop_exists:return 0 OK, this definitely seems to be a driver bug. I'm no driver expert, but it seems that exporting none of the above properties is a problem - ZFS has no idea how big this disk is! Perhaps someone more familiar with the DDI/LDI interfaces can explain the appropriate way to implement these on the driver end. But at this point its safe to say that ZFS isn't doing anything wrong. The layered driver is exporting a device in /dev/dsk, but not exporting basic information (such as the size or number of blocks) that ZFS (and potentially the rest of Solaris) needs to interact with the device. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Questions. (RAID-Z questions actually)
I understand the copy-on-write thing. That was very well illustrated in ZFS The Last Word in File Systems by Jeff Bonwick. But if every block is it's own RAID-Z stripe, if the block is lost, how does ZFS recover the block??? You should perhaps not take block literally; the block is written as part of a single transaction on all disks of the RAID-Z group. Only when the block is stored on disk, the bits referencing them will be written. For the whole block to be lost, all disks need to be lost or the transaction must not occur. Is the stripe parity (as opposed to block checksum which I understand) stored somewhere else or within the same black Parts of the block are written to each disk; the parity is written to the parity disk. But how exactly does every RAID-Z write is a full stripe write works? More specifically, if in a 3 disk RAID-Z configuration, if one disk fails completely and is replaced, exactly how does the metadata driven reconstruction recover the newly replaced disk? The metadata driven reconstruction will take the ueberblock and from there it will re-read the other disks and reconstruct the parity while also verifying checksums. Not all data needs to be read and not all parity needs to be computed; only the bits of disks which are actually in use are verified and have their parity recomputed. Well, the tricky bit here is RAID-Z reconstruction. Because the stripes are all different sizes, there's no simple formula like all the disks XOR to zero. You have to traverse the filesystem metadata to determine the RAID-Z geometry. Note that this would be impossible if the filesystem and the RAID array were separate products, which is why there's nothing like RAID-Z in the storage market today. You really need an integrated view of the logical and physical structure of the data to pull it off. Every stripe is different size? Is this because ZFS adapts to the nature of the I/O coming to it? It's because the blocks written are all of different sizes. So if you write a 128K block on a 3 way RAID-Z, this can be written as 2x64K of data + 1x64K of parity. (Though I must admit that in such a scheme the disks still XOR to zero, at least the bits of disk used) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
So, based on the below, there should be no reason why a flash-based ZFS filesystem should need to do anything special to avoid problems. That's a Good Thing. I think that using flash as the system disk will be the way to go. Using flash as read-only with a disk or memory for read-write would result in a very fast system with fewer points of failure... On Jun 20, 2006, at 6:23 PM, Nathan Kroenert wrote: And, this is a worst case, no? If the device itself also does some funky stuff under the covers, and ZFS only writes an update if there is *actually* something to write, then it could be much much longer than 4 years. Actually - That's an interesting. I assume ZFS only writes something when there is actually data? :) Nathan. On Wed, 2006-06-21 at 06:25, Eric Schrock wrote: On Tue, Jun 20, 2006 at 02:18:34PM -0600, Gregory Shaw wrote: Wouldn't that be: 5 seconds per write = 86400/5 = 17280 writes per day 256 rotated locations for 17280/256 = 67 writes per location per day Resulting in (10/67) ~1492 days or 4.08 years before failure? That's still a long time, but it's not 100 years. Yes, I goofed on the math. It's still (256*10*5) seconds, but somehow I managed to goof up the math. I tried it again and came up with 1,481 days. - Eric -- Eric Schrock, Solaris Kernel Development http:// blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss - Gregory Shaw, IT Architect Phone: (303) 673-8273Fax: (303) 673-2773 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 [EMAIL PROTECTED] (work) Louisville, CO 80028-4382[EMAIL PROTECTED] (home) When Microsoft writes an application for Linux, I've Won. - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
Erik Trimble wrote: That is, start out with adding the ability to differentiate between access policy in a vdev. Generally, we're talking only about mirror vdevs right now. Later on, we can consider the ability to migrate data based on performance, but a lot of this has to take into consideration snapshot capability and such, so is a bit less straightforward. The policy is implemented on the read side, since you still need to commit writes to all mirrors. The implementation shouldn't be difficult, deciding on the administrative interface will be the hardest part. Oh, and the newest thing in the consumer market is called hybrid drives, which is a melding of a Flash drive with a Winchester drive. It's originally targetted at the laptop market - think a 1GB flash memory welded to a 40GB 2.5 hard drive in the same form-factor. You don't replace the DRAM cache on the HD - it's still there for fast-write response. But all the frequently used blocks get scheduled to be placed on the Flash part of the drive, while the mechanical part actually holds a copy of everything. The Flash portion is there for power efficiency as well as performance. Flash is (can be) a bit more sophisticated. The problem is that they have a limited write endurance -- typically spec'ed at 100k writes to any single bit. The good flash drives use block relocation, spares, and write spreading to avoid write hot spots. For many file systems, the place to worry is the block(s) containing your metadata. ZFS inherently spreads and mirrors its metadata, so it should be more appropriate for flash devices than FAT or UFS. Similarly, the disk drive manufacturers make extensive use of block sparing, so applying that technique to the hybrid drives is expected. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
On Tue, Jun 20, 2006 at 09:32:58AM -0700, Richard Elling wrote: Flash is (can be) a bit more sophisticated. The problem is that they have a limited write endurance -- typically spec'ed at 100k writes to any single bit. The good flash drives use block relocation, spares, and write spreading to avoid write hot spots. For many file systems, the place to worry is the block(s) containing your metadata. ZFS inherently spreads and mirrors its metadata, so it should be more appropriate for flash devices than FAT or UFS. What about the UberBlock? It's written each time a transaction group commits. Cheers, - jonathan -- Jonathan Adams, Solaris Kernel Development ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
On Tue, Jun 20, 2006 at 11:17:42AM -0700, Jonathan Adams wrote: On Tue, Jun 20, 2006 at 09:32:58AM -0700, Richard Elling wrote: Flash is (can be) a bit more sophisticated. The problem is that they have a limited write endurance -- typically spec'ed at 100k writes to any single bit. The good flash drives use block relocation, spares, and write spreading to avoid write hot spots. For many file systems, the place to worry is the block(s) containing your metadata. ZFS inherently spreads and mirrors its metadata, so it should be more appropriate for flash devices than FAT or UFS. What about the UberBlock? It's written each time a transaction group commits. Right. But we rotate the uberblock over 128 positions in the device label. This helps with write-leveling. Furthermore, a lot of flash devices are starting to incorporate write-leveling in HW, since a lot of software just doesn't deal with it. --Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
Jonathan Adams wrote: On Tue, Jun 20, 2006 at 09:32:58AM -0700, Richard Elling wrote: Flash is (can be) a bit more sophisticated. The problem is that they have a limited write endurance -- typically spec'ed at 100k writes to any single bit. The good flash drives use block relocation, spares, and write spreading to avoid write hot spots. For many file systems, the place to worry is the block(s) containing your metadata. ZFS inherently spreads and mirrors its metadata, so it should be more appropriate for flash devices than FAT or UFS. What about the UberBlock? It's written each time a transaction group commits. Also, options such as -nomtime and -noctime have been introduced alongside -noatime in some free operating systems to limit the amount of meta data that gets written back to disk. Darren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
Also, options such as -nomtime and -noctime have been introduced alongside -noatime in some free operating systems to limit the amount of meta data that gets written back to disk. Those seem rather pointless. (mtime and ctime generally imply other changes, often to the inode; atime does not) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
On Tue, Jun 20, 2006 at 02:18:34PM -0600, Gregory Shaw wrote: Wouldn't that be: 5 seconds per write = 86400/5 = 17280 writes per day 256 rotated locations for 17280/256 = 67 writes per location per day Resulting in (10/67) ~1492 days or 4.08 years before failure? That's still a long time, but it's not 100 years. Yes, I goofed on the math. It's still (256*10*5) seconds, but somehow I managed to goof up the math. I tried it again and came up with 1,481 days. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
Richard Elling wrote: Erik Trimble wrote: Oh, and the newest thing in the consumer market is called hybrid drives, which is a melding of a Flash drive with a Winchester drive. It's originally targetted at the laptop market - think a 1GB flash memory welded to a 40GB 2.5 hard drive in the same form-factor. You don't replace the DRAM cache on the HD - it's still there for fast-write response. But all the frequently used blocks get scheduled to be placed on the Flash part of the drive, while the mechanical part actually holds a copy of everything. The Flash portion is there for power efficiency as well as performance. Flash is (can be) a bit more sophisticated. The problem is that they have a limited write endurance -- typically spec'ed at 100k writes to any single bit. The good flash drives use block relocation, spares, and write spreading to avoid write hot spots. For many file systems, the place to worry is the block(s) containing your metadata. ZFS inherently spreads and mirrors its metadata, so it should be more appropriate for flash devices than FAT or UFS. What I do not know yet is exactly how the flash portion of these hybrid drives is administered. I rather expect that a non-hybrid-aware OS may not actually exercise the flash storage on these drives by default; or should I say, the flash storage will only be available to a hybrid-aware OS. Has anyone reading this seen a command-set reference for one of these drives? Dana ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
Eric Schrock wrote: On Tue, Jun 20, 2006 at 11:17:42AM -0700, Jonathan Adams wrote: On Tue, Jun 20, 2006 at 09:32:58AM -0700, Richard Elling wrote: Flash is (can be) a bit more sophisticated. The problem is that they have a limited write endurance -- typically spec'ed at 100k writes to any single bit. The good flash drives use block relocation, spares, and write spreading to avoid write hot spots. For many file systems, the place to worry is the block(s) containing your metadata. ZFS inherently spreads and mirrors its metadata, so it should be more appropriate for flash devices than FAT or UFS. What about the UberBlock? It's written each time a transaction group commits. Yes, but this is only written once every 5 seconds, and we store to 256 different locations in a ring buffer. So you have (256*10*5) seconds, or about 100 years. 100k writes is the de-facto minimum. In looking at some SSD (yes, they are marketing them as solid state disks) drives with IDE or SATA interfaces, at least one vendor specs 5,000,000 writes, sizes up to 128 GBytes. It will be a while before these are really inexpensive, though. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
Dana H. Myers wrote: What I do not know yet is exactly how the flash portion of these hybrid drives is administered. I rather expect that a non-hybrid-aware OS may not actually exercise the flash storage on these drives by default; or should I say, the flash storage will only be available to a hybrid-aware OS. Samsung describes their hybrid drives as using flash for the boot block and as a write cache. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
And, this is a worst case, no? If the device itself also does some funky stuff under the covers, and ZFS only writes an update if there is *actually* something to write, then it could be much much longer than 4 years. Actually - That's an interesting. I assume ZFS only writes something when there is actually data? :) Nathan. On Wed, 2006-06-21 at 06:25, Eric Schrock wrote: On Tue, Jun 20, 2006 at 02:18:34PM -0600, Gregory Shaw wrote: Wouldn't that be: 5 seconds per write = 86400/5 = 17280 writes per day 256 rotated locations for 17280/256 = 67 writes per location per day Resulting in (10/67) ~1492 days or 4.08 years before failure? That's still a long time, but it's not 100 years. Yes, I goofed on the math. It's still (256*10*5) seconds, but somehow I managed to goof up the math. I tried it again and came up with 1,481 days. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
I assume ZFS only writes something when there is actually data? Right. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
[EMAIL PROTECTED] wrote: Also, options such as -nomtime and -noctime have been introduced alongside -noatime in some free operating systems to limit the amount of meta data that gets written back to disk. Those seem rather pointless. (mtime and ctime generally imply other changes, often to the inode; atime does not) Well operating systems that *do* get used to build devices *do* have these mount options for this purpose, so I imagine that someone who does this kind of thing thinks they're worthwhile. Darren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
Mike Gerdts wrote: On 6/17/06, Dale Ghent [EMAIL PROTECTED] wrote: The concept of shifting blocks in a zpool around in the background as part of a scrubbing process and/or on the order of a explicit command to populate newly added devices seems like it could be right up ZFS's alley. Perhaps it could also be done with volume-level granularity. Off the top of my head, an area where this would be useful is performance management - e.g. relieving load on a particular FC interconnect or an overburdened RAID array controller/cache thus allowing total no-downtime-to-cp-data-around flexibility when one is horizontally scaling storage performance. Another good use would be to migrate blocks that are rarely accessed to slow storage (750 GB drives with RAID-Z) while very active blocks are kept on fast storage (solid state disk). Presumably writes would go to relatively fast storage and use idle IO cycles to migrate those that don't have a lot of reads to slower storage. Blocks that are very active and reside on slow storage could be migrated (mirrored?) to fast storage. Solid state disk often has a higher failure rate than normal disk and a limited write cycle. Hence it is often desirable to try and redesign the filesystem to do fewer writes when it is on (for example) compact flash, so moving hot blocks to fast storage can have consequences. But then there is also this new storage paradigm in the e-rags where a hard drive also has some amount of solid state storage to speed up the boot time. It'll be interesting to see how that plays out, but I suspect the idea is that in the relevant market (PCs), it'll be used for things like drivers and OS core image files that do not change very often. Darren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
Darren Reed wrote: Solid state disk often has a higher failure rate than normal disk and a limited write cycle. Hence it is often desirable to try and redesign the filesystem to do fewer writes when it is on (for example) compact flash, so moving hot blocks to fast storage can have consequences. Solid state storage does not necessarily mean flash. For example, I have recently performed some testing of Sun's Directory Server in conjunction with solid state disks from two different vendors. Both of these used standard DRAM, so there's no real limit to the number of writes that can be performed. They have lots of internal redundancy features (e.g., ECC memory with chipkill, redundant power supplies, internal UPSes, and internal hard drives to protect against extended power outages), but both vendors said that customers often use other forms of redundancy (e.g., mirror to traditional disk, or RAID across multiple solid-state devices). One of the vendors mentioned that both SVM and VxVM have the ability to designate one disk in a mirror as write only (unless the other has failed) which can be good for providing redundancy with cheaper, traditional storage. All reads would still come from the solid state storage so they would be very fast, and as long as the write rate doesn't exceed that of the traditional disk then there wouldn't be much adverse performance impact from the slower disk in the mirror. I don't believe that ZFS has this capability, but it could be something worth looking into. The original suggestion provided in this thread would potentially work well in that kind of setup. ZFS with compression can also provide a notable win because the compression can significantly reduce the amount of storage required, which can help cut down on the costs. Solid state disks like this are expensive (both of the 32GB disks that I tested list at around $60K), so controlling costs is important. Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
On 6/17/06, Neil A. Wilson [EMAIL PROTECTED] wrote: Darren Reed wrote: Solid state disk often has a higher failure rate than normal disk and a limited write cycle. Hence it is often desirable to try and redesign the filesystem to do fewer writes when it is on (for example) compact flash, so moving hot blocks to fast storage can have consequences. I mentioned solid state (assuming DRAM-based) and 750 GB drives as the two ends of the spectrum available. Most people will find their extremes that are each closer to the middle of the spectrum. Possibly a multi-tier approach including 73 GB FC, 300 GB FC, and 500 GB SATA would be more likely in most shops. Solid state disks like this are expensive (both of the 32GB disks that I tested list at around $60K), so controlling costs is important. If you remove enterprise from the solid state disk equation, consider this at $150 + the cost of 4 1 GB DDR DIMMs. I suppose you could mirror across a pair of them and still have a pretty fast small 4GB of space for less than $1k. http://www.anandtech.com/storage/showdoc.aspx?i=2480 FWIW, google gives plenty of hits for solid state disk terabyte. Mike -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
Saying Solid State disk in the storage arena means battery-backed DRAM (or, rarely, NVRAM). It does NOT include the various forms of solid-state memory (compact flash, SD, MMC, etc.);Flash disk is reserved for those kind of devices. This is historical, since Flash disk hasn't been functionally usable in the Enterprise Storage arena until the last year or so. Battery-backed DRAM as a disk has been around for a very long time, though. :-) We've all talked about adding the ability to change read/write policy on a pool's vdevs for awhile. There are a lot of good reasons that this is desirable. However, I'd like to try to separate this request from HSM, and not immediately muddy the waters by trying to lump too many things together. That is, start out with adding the ability to differentiate between access policy in a vdev. Generally, we're talking only about mirror vdevs right now. Later on, we can consider the ability to migrate data based on performance, but a lot of this has to take into consideration snapshot capability and such, so is a bit less straightforward. And, on not a completely tangential side note: WTF is up with the costs for Solid State disks? I mean, prices well over $1k per GB are typical, which is absolutely ludicrous. The DRAM itself is under $100/GB, and these devices are idiot-simple to make. In the minimalist case, it's simply DIMM slots, a NiCad battery and trickle charger, and a SCSI/SATA/FC interface chip. Even in the fancy case, were you provide a backup drive to copy the DRAM contents to in case of power failure, it's a trivial engineering exercise. I realize there is (currently) a small demand for these devices, but honestly, I'm pretty sure that if they reduced the price by a factor of 3, they'd see 10x or maybe even 100x the volume, cause these little buggers are just so damned useful. Oh, and the newest thing in the consumer market is called hybrid drives, which is a melding of a Flash drive with a Winchester drive. It's originally targetted at the laptop market - think a 1GB flash memory welded to a 40GB 2.5 hard drive in the same form-factor. You don't replace the DRAM cache on the HD - it's still there for fast-write response. But all the frequently used blocks get scheduled to be placed on the Flash part of the drive, while the mechanical part actually holds a copy of everything. The Flash portion is there for power efficiency as well as performance. -Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS questions
On Jun 16, 2006, at 11:40 PM, Richard Elling wrote: Kimberly Chang wrote: A couple of ZFS questions: 1. ZFS dynamic striping will automatically use new added devices when there are write requests. Customer has a *mostly read-only* application with I/O bottleneck, they wonder if there is a ZFS command or mechanism to enable the manual rebalancing of ZFS data when adding new drives to an existing pool? cp :-) If you copy the file then the new writes will be spread across the newly added drives. It doesn't really matter how you do the copy, though. She raises an interesting point, though. The concept of shifting blocks in a zpool around in the background as part of a scrubbing process and/or on the order of a explicit command to populate newly added devices seems like it could be right up ZFS's alley. Perhaps it could also be done with volume-level granularity. Off the top of my head, an area where this would be useful is performance management - e.g. relieving load on a particular FC interconnect or an overburdened RAID array controller/cache thus allowing total no-downtime-to-cp-data-around flexibility when one is horizontally scaling storage performance. /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss