Re: [zfs-discuss] zfs questions wrt unused blocks

2010-02-16 Thread heinz zerbes

Richard,

thanks for the heads-up. I found some material here that sheds a bit 
more light on it:



http://en.wikipedia.org/wiki/ZFS
http://all-unix.blogspot.com/2007/04/transaction-file-system-and-cow.html

Regards,
heinz


Richard Elling wrote:

On Feb 15, 2010, at 8:43 PM, heinz zerbes wrote:
  

Gents,

We want to understand the mechanism of zfs a bit better.

Q: what is the design/algorithm of zfs in terms of reclaiming unused blocks?
Q: what criteria is there for zfs to start reclaiming blocks



The answer to these questions is too big for an email. Think of
ZFS as a very dynamic system with many different factors influencing
block allocation.

  

Issue at hand is an LDOM or zone running in a virtual (thin-provisioned) disk 
on a NFS server and a zpool inside that vdisk.
This vdisk tends to grow in size even if the user writes and deletes a file 
again. Question is, whether this reclaiming of unused blocks can kick in 
earlier, so that the filesystem doesn't grow much more than what is actually 
allocated?



ZFS is a COW file system, which partly explains what you are seeing.
Snapshots, deduplication, and the ZIL complicate the picture.
 -- richard


ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)



  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs questions wrt unused blocks

2010-02-15 Thread heinz zerbes




Gents,

We want to understand the mechanism of zfs a bit better.

Q: what is the design/algorithm of zfs in terms of reclaiming unused
blocks?
Q: what criteria is there for zfs to start reclaiming blocks

Issue at hand is an LDOM or zone running in a virtual
(thin-provisioned) disk on a NFS server and a zpool inside that vdisk.
This vdisk tends to grow in size even if the user writes and deletes a
file again. Question is, whether this reclaiming of unused blocks can
kick in earlier, so that the filesystem doesn't grow much more than
what is actually allocated?

Thanks,
heinz



-- 

  

  
   Heinz Zerbes 
Security Consultant and Auditor
  Sun Microsystems Australia
33 Berry St., North Sydney, NSW 2060 AU
Phone x59468/+61 2 9466 9468
Mobile +61 410 727 961
Fax +61 2 9466 9411
Email heinz.zer...@sun.com
  


  

  




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs questions wrt unused blocks

2010-02-15 Thread heinz zerbes


Gents,

We want to understand the mechanism of zfs a bit better.

Q: what is the design/algorithm of zfs in terms of reclaiming unused blocks?
Q: what criteria is there for zfs to start reclaiming blocks

Issue at hand is an LDOM or zone running in a virtual (thin-provisioned) 
disk on a NFS server and a zpool inside that vdisk.
This vdisk tends to grow in size even if the user writes and deletes a 
file again. Question is, whether this reclaiming of unused blocks can 
kick in earlier, so that the filesystem doesn't grow much more than what 
is actually allocated?


Thanks,
heinz


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs questions wrt unused blocks

2010-02-15 Thread Richard Elling
On Feb 15, 2010, at 8:43 PM, heinz zerbes wrote:
 
 Gents,
 
 We want to understand the mechanism of zfs a bit better.
 
 Q: what is the design/algorithm of zfs in terms of reclaiming unused blocks?
 Q: what criteria is there for zfs to start reclaiming blocks

The answer to these questions is too big for an email. Think of
ZFS as a very dynamic system with many different factors influencing
block allocation.

 Issue at hand is an LDOM or zone running in a virtual (thin-provisioned) disk 
 on a NFS server and a zpool inside that vdisk.
 This vdisk tends to grow in size even if the user writes and deletes a file 
 again. Question is, whether this reclaiming of unused blocks can kick in 
 earlier, so that the filesystem doesn't grow much more than what is actually 
 allocated?

ZFS is a COW file system, which partly explains what you are seeing.
Snapshots, deduplication, and the ZIL complicate the picture.
 -- richard


ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs: questions on ARC membership based on type/ordering of Reads/Writes

2009-11-25 Thread Andrew . Rutz

I am trying to understand the ARC's behavior based on different
permutations of (a)sync Reads and (a)sync Writes.

thank you, in advance


o does the data for a *sync-write* *ever* go into the ARC?
  eg, my understanding is that the data goes to the ZIL (and
  the SLOG, if present), but how does it get from the ZIL to the ZIO layer?
  eg, does it go to the ARC on its way to the ZIO ?
  o if the sync-write-data *does* go to the ARC, does it go to
the ARC *after* it is written to the ZIL's backing-store,
or does the data go to the ZIL and the ARC in parallel ?
o if a sync-write's data goes to the ARC and ZIL *in parallel*,
  then does zfs prevent an ARC-hit until the data is confirmed
  to be on the ZIL's nonvolatile media (eg, disk-platter or SLOG) ?
  or could a Read get an ARC-hit on a block *before* it's written
  to zil's backing-store?


o is the DMU where the Serialization of transactions occurs?

o if an async-Write for block-X hits the Serializer before a Read
  for block-X hits the Serializer, i am assuming the Read can
  pass the async-Write; eg, the Read is *not* pended behind the
  async-write.  however, if a Read hits the Serializer after a
  *sync*-write, then i'm assuming the Read is pended until
  the sync-write is written to the ZIL's nonvolatile media.
  o if a Read passes an async-write, then i'm assuming the Read
can be satisfied by either the arc, l2arc, or disk.

o it's stated that the L2ARC is for random-reads.  however, there's
  nothing to prevent the L2ARC from containing blocks derived from
  *sequential*-reads, right ?   also, blocks from async-writes can
  also live in l2arc, right?  how about sync-writes ?

o is the l2arc literally simply a *larger* ARC?  eg, does the l2arc
  obey the normal cache property where everything that is in the L1$
  (eg, ARC) is also in the L2$ (eg, l2arc) ?  (I have a feeling that
  the set-theoretic intersection of ARC and L2ARC is empty (for some
  reason).
  o does the l2arc use the ARC algorithm (as the name suggests) ?

thank you,

/andrew
Solaris RPE
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs: questions on ARC membership based on type/ordering of Reads/Writes

2009-11-25 Thread Richard Elling

On Nov 25, 2009, at 11:55 AM, andrew.r...@sun.com wrote:


I am trying to understand the ARC's behavior based on different
permutations of (a)sync Reads and (a)sync Writes.

thank you, in advance


o does the data for a *sync-write* *ever* go into the ARC?


always


 eg, my understanding is that the data goes to the ZIL (and
 the SLOG, if present), but how does it get from the ZIL to the ZIO  
layer?


ZIL is effectively write-only.  It is only read when the pool is  
imported.



 eg, does it go to the ARC on its way to the ZIO ?


ARC is the cache for buffering data.


 o if the sync-write-data *does* go to the ARC, does it go to
   the ARC *after* it is written to the ZIL's backing-store,
   or does the data go to the ZIL and the ARC in parallel ?


A sync write returns when the data is written to the ZIL.
An async write returns when the data is in the ARC, and later
the unwritten contents of the ARC are pushed to the pool when
the transaction group is committed.


   o if a sync-write's data goes to the ARC and ZIL *in parallel*,
 then does zfs prevent an ARC-hit until the data is confirmed
 to be on the ZIL's nonvolatile media (eg, disk-platter or SLOG) ?
 or could a Read get an ARC-hit on a block *before* it's written
 to zil's backing-store?


In my mind, the ARC and ZIL are orthogonal.


o is the DMU where the Serialization of transactions occurs?


Serialization?


o if an async-Write for block-X hits the Serializer before a Read
 for block-X hits the Serializer, i am assuming the Read can
 pass the async-Write; eg, the Read is *not* pended behind the
 async-write.  however, if a Read hits the Serializer after a
 *sync*-write, then i'm assuming the Read is pended until
 the sync-write is written to the ZIL's nonvolatile media.
 o if a Read passes an async-write, then i'm assuming the Read
   can be satisfied by either the arc, l2arc, or disk.


I think you are asking if write order is preserved. The answer is yes.


o it's stated that the L2ARC is for random-reads.  however, there's
 nothing to prevent the L2ARC from containing blocks derived from
 *sequential*-reads, right ?   also, blocks from async-writes can
 also live in l2arc, right?  how about sync-writes ?


Blocks which are not yet committed to the pool are locked in the
ARC so they can't be evicted. Once committed, the lock is removed.


o is the l2arc literally simply a *larger* ARC?  eg, does the l2arc
 obey the normal cache property where everything that is in the L1$
 (eg, ARC) is also in the L2$ (eg, l2arc) ?  (I have a feeling that
 the set-theoretic intersection of ARC and L2ARC is empty (for some
 reason).


No. The L2ARC is not in the datapath between the ARC and media.
Further, data is not evicted from the ARC into the L2ARC. Rather,
the L2ARC is filled from data near the eviction ends of the MRU and
MFU lists. The movement of data to the L2ARC is throttled and
grouped in sequence, improving efficiency for devices which like
large writes, such as read-optimized flash.

Think of it this way. Data which is in the ARC is fed into the L2ARC.
If the data is later evicted from the ARC, it can still live in the  
L2ARC.

When the L2ARC has lower read latency then the pool's media,
then it can improve performance because the data can be read from
L2ARC instead of the pool. This fits the general definition of a cache,
but does not work the same way as multilevel CPU caches.


 o does the l2arc use the ARC algorithm (as the name suggests) ?


Yes, but it really isn't separate from the ARC, from a management  
point of

view. To fully understand it, you need to know about how the metadata
for each buffer in the ARC is managed.  This will introduce the concept
of the ghosts, and the L2ARC is a simple extension.  The comments
in the source are nicely descriptive, and you might consider reading  
them

through once, even if you don't dive into the code itself:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS questions

2008-07-23 Thread Richard Gilmore
Hello Zfs Community,

I am trying to locate if zfs has a compatible tool to Veritas's 
vxbench?  Any ideas?  I see a tool called vdbench that looks close, but 
it is not a Sun tool, does Sun recommend something to customers moving 
from Veritas to ZFS and like vxbench and its capabilities?

Thanks,
Richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2008-07-23 Thread Thommy M.
Richard Gilmore wrote:
 Hello Zfs Community,
 
 I am trying to locate if zfs has a compatible tool to Veritas's 
 vxbench?  Any ideas?  I see a tool called vdbench that looks close, but 
 it is not a Sun tool, does Sun recommend something to customers moving 
 from Veritas to ZFS and like vxbench and its capabilities?


filebench

http://sourceforge.net/projects/filebench/
http://www.solarisinternals.com/wiki/index.php/FileBench
http://blogs.sun.com/dom/entry/filebench:_a_zfs_v_vxfs

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2008-07-23 Thread Richard Elling
Thommy M. wrote:
 Richard Gilmore wrote:
   
 Hello Zfs Community,

 I am trying to locate if zfs has a compatible tool to Veritas's 
 vxbench?  Any ideas?  I see a tool called vdbench that looks close, but 
 it is not a Sun tool, does Sun recommend something to customers moving 
 from Veritas to ZFS and like vxbench and its capabilities?
 


 filebench

 http://sourceforge.net/projects/filebench/
 http://www.solarisinternals.com/wiki/index.php/FileBench
 http://blogs.sun.com/dom/entry/filebench:_a_zfs_v_vxfs
   

Also, /usr/benchmarks/filebench for later Solaris releases.

IIRC, vdbench is in process of becoming open source, but I do
not know the current status.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS questions

2006-12-14 Thread Dave Burleson


I will have a file system in a SAN using ZFS.  Can someone answer my
questions?

1. Can I create  ZFS volumes on a ZFS file system from one server,
attach the file system read-write to a different server (to load data),
then detach the file system from that server and attach the file system
read-only to multiple other servers?

2. Can I expand on the fly a ZFS volume within a file system when the
file system it's attached to a server read-write?

Thanks for your help,

Dave


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS questions with mirrors

2006-08-21 Thread Peter Wilk
IHAC that is asking the following. any thoughts would be appreciated

Take two drives, zpool to make a mirror.
Remove a drive - and the server HANGS. Power off and reboot the server,
and everything comes up cleanly.

Take the same two drives (still Solaris 10). Install Veritas Volume
Manager (4.1). Mirror the two drives. Remove a drive - everything is
still running. Replace the drive, everything still working. No outage.

So the big questions to Tech support:
1. Is this a known property of ZFS ? That when a drive from a hot swap
system is removed the server hangs ? (We were attempting to simulate a
drive failure)
2. Or is this just because it was an E450 ? Ie, would removing a zfs
mirror disk (unexpected hardware removal as opposed to using zfs to
remove the disk) on a V240 or V480 cause the same problem ?
3. What could we expect if a drive mysteriously failed during
operation of a server with a zfs mirror ? Would the server hang like it
did during testing ? How can we test this ?
4. If it is a known property of zfs, is there a date when it is
expected to be fixed (if ever) ?



Peter

PS: I may not be on this alias so please respond to me directly
-- 
=
 __
/_/\
   /_\\ \Peter Wilk -  OS/Security Support
  /_\ \\ /   Sun Microsystems
 /_/ \/ / /  1 Network Drive,  P.O Box 4004
/_/ /   \//\ Burlington, Massachusetts 01803-0904
\_\//\   / / 1-800-USA-4SUN, opt 1, opt 1,case number#
 \_/ / /\ /  Email: [EMAIL PROTECTED]
  \_/ \\ \   =
   \_\ \\
\_\/

=



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions with mirrors

2006-08-21 Thread Eric Schrock
The current behavior depends on the implementation of the driver and
support for hotplug events.  When a drive is yanked, one of two things
can happen:

- I/Os will fail, and any attempt to re-open the device will result in
  failure.

- I/Os will fail, but the device can continued to be opened by its
  existing path.

ZFS currently handles case #1 and will mark the device faulted,
generating an FMA fault in the process.  Future ZFS/FMA integration will
address problem #2, and is on the short list of features to address.  In
the meantime, you can 'zpool offline' the bad device to prevent ZFS from
trying to access it.

That being said, the server should never hang - only proceed arbitrarily
slowly.  When you say 'hang', what does that mean?

- Eric

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS questions

2006-07-28 Thread John Cecere
Can someone explain to me what the 'volinit' and 'volfini' options to zfs do ? It's not obvious from the source code and these 
options are undocumented.


Thanks,
John


--
John Cecere
Sun Microsystems
732-302-3922 / [EMAIL PROTECTED]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs questions from Sun customer

2006-07-26 Thread David Curtis
Please reply to [EMAIL PROTECTED]

 Background / configuration **

zpool will not create a storage pool on fibre channel storage.  I'm
attached to an IBM SVC using the IBMsdd driver.  I have no problem using
SVM metadevices and UFS on these devices.

  List steps to reproduce the problem(if applicable):
Build Solaris 10 Update 2 server
Attach to an external storage array via IBM SVC
Load lpfc driver (6.02h)
Load IBMsdd software (1.6.1.0-2)
Attempt to use zpool create to make a storage pool:
# zpool create -f extdisk vpath1c
internal error: unexpected error 22 at line 446 of ../common/libzfs_pool.c

* reply to customer

It looks like you have an additional unwanted software layer between
Solaris and the disk hardware. Currently ZFS needs to access the
physical device to work correctly. Something like:

# zpool create -f extdisk c5t0d0 c5t1d0 ..

Let me know if this works for you.

* follow-up question from customer 

Yes, using the c#t#d# disks work, but anyone using fibre-channel storage
on somethink like IBM Shark or EMC Clariion will want multiple paths to
disk using either IBMsdd, EMCpower or Solaris native MPIO.  Does ZFS
work with any of these fibre channel multipathing drivers?


Thanks for any assistance you can provide.
-- 

David Curtis - TSE  Sun Microsystems
303-272-6628Enterprise Services
[EMAIL PROTECTED]   OS / Installation Support
Monday to Friday9:00 AM to 6:00 PM Mountain

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs questions from Sun customer

2006-07-26 Thread Eric Schrock
This suggests that there is some kind of bug in the layered storage
software.  ZFS doesn't do anything special to the underlying storage
device; it merely relies on a few ldi_*() routines.  I would try running
the following dtrace script:

#!/usr/sbin/dtrace -s

vdev_disk_open:return,
ldi_open_by_name:return,
ldi_open_by_path:return,
ldi_get_size:return
{
trace(arg1);
}

And then re-run your 'zpool create' command.  That will at at least get
us pointed in the right direction.

- Eric

On Wed, Jul 26, 2006 at 09:47:03AM -0600, David Curtis wrote:
 Please reply to [EMAIL PROTECTED]
 
  Background / configuration **
 
 zpool will not create a storage pool on fibre channel storage.  I'm
 attached to an IBM SVC using the IBMsdd driver.  I have no problem using
 SVM metadevices and UFS on these devices.
 
   List steps to reproduce the problem(if applicable):
 Build Solaris 10 Update 2 server
 Attach to an external storage array via IBM SVC
 Load lpfc driver (6.02h)
 Load IBMsdd software (1.6.1.0-2)
 Attempt to use zpool create to make a storage pool:
 # zpool create -f extdisk vpath1c
 internal error: unexpected error 22 at line 446 of ../common/libzfs_pool.c
 
 * reply to customer
 
 It looks like you have an additional unwanted software layer between
 Solaris and the disk hardware. Currently ZFS needs to access the
 physical device to work correctly. Something like:
 
 # zpool create -f extdisk c5t0d0 c5t1d0 ..
 
 Let me know if this works for you.
 
 * follow-up question from customer 
 
 Yes, using the c#t#d# disks work, but anyone using fibre-channel storage
 on somethink like IBM Shark or EMC Clariion will want multiple paths to
 disk using either IBMsdd, EMCpower or Solaris native MPIO.  Does ZFS
 work with any of these fibre channel multipathing drivers?
 
 
 Thanks for any assistance you can provide.
 -- 
 
 David Curtis - TSESun Microsystems
 303-272-6628  Enterprise Services
 [EMAIL PROTECTED] OS / Installation Support
 Monday to Friday  9:00 AM to 6:00 PM Mountain
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs questions from Sun customer

2006-07-26 Thread Edward Pilatowicz
zfs should work fine with disks under the control of solaris mpxio.
i don't know about any of the other multipathing solutions.

if you're trying to use a device that's controlled by another
multipathing solution, you might want to try specifying the full
path to the device, ex:
zpool create -f extdisk /dev/foo2/vpath1c

ed

On Wed, Jul 26, 2006 at 09:47:03AM -0600, David Curtis wrote:
 Please reply to [EMAIL PROTECTED]

  Background / configuration **

 zpool will not create a storage pool on fibre channel storage.  I'm
 attached to an IBM SVC using the IBMsdd driver.  I have no problem using
 SVM metadevices and UFS on these devices.

   List steps to reproduce the problem(if applicable):
 Build Solaris 10 Update 2 server
 Attach to an external storage array via IBM SVC
 Load lpfc driver (6.02h)
 Load IBMsdd software (1.6.1.0-2)
 Attempt to use zpool create to make a storage pool:
 # zpool create -f extdisk vpath1c
 internal error: unexpected error 22 at line 446 of ../common/libzfs_pool.c

 * reply to customer

 It looks like you have an additional unwanted software layer between
 Solaris and the disk hardware. Currently ZFS needs to access the
 physical device to work correctly. Something like:

 # zpool create -f extdisk c5t0d0 c5t1d0 ..

 Let me know if this works for you.

 * follow-up question from customer 

 Yes, using the c#t#d# disks work, but anyone using fibre-channel storage
 on somethink like IBM Shark or EMC Clariion will want multiple paths to
 disk using either IBMsdd, EMCpower or Solaris native MPIO.  Does ZFS
 work with any of these fibre channel multipathing drivers?


 Thanks for any assistance you can provide.
 --

 David Curtis - TSESun Microsystems
 303-272-6628  Enterprise Services
 [EMAIL PROTECTED] OS / Installation Support
 Monday to Friday  9:00 AM to 6:00 PM Mountain

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs questions from Sun customer

2006-07-26 Thread David Curtis
Eric,

Here is what the customer gets trying to create the pool using the
software alias: (I added all the ldi_open's to the script)
# zpool create -f extdisk vpath1c

# ./dtrace.script
dtrace: script './dtrace.script' matched 6 probes
CPU IDFUNCTION:NAME
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  17817  ldi_open_by_name:return 0
  0  16191  ldi_get_size:return-1
  0  44942vdev_disk_open:return22

Thanks,
David

Eric Schrock wrote On 07/26/06 10:03 AM,:
 This suggests that there is some kind of bug in the layered storage
 software.  ZFS doesn't do anything special to the underlying storage
 device; it merely relies on a few ldi_*() routines.  I would try running
 the following dtrace script:
 
 #!/usr/sbin/dtrace -s
 
 vdev_disk_open:return,
 ldi_open_by_name:return,
 ldi_open_by_path:return,
 ldi_get_size:return
 {
   trace(arg1);
 }
 
 And then re-run your 'zpool create' command.  That will at at least get
 us pointed in the right direction.
 
 - Eric
 
 On Wed, Jul 26, 2006 at 09:47:03AM -0600, David Curtis wrote:
 
Please reply to [EMAIL PROTECTED]

 Background / configuration **

zpool will not create a storage pool on fibre channel storage.  I'm
attached to an IBM SVC using the IBMsdd driver.  I have no problem using
SVM metadevices and UFS on these devices.

  List steps to reproduce the problem(if applicable):
Build Solaris 10 Update 2 server
Attach to an external storage array via IBM SVC
Load lpfc driver (6.02h)
Load IBMsdd software (1.6.1.0-2)
Attempt to use zpool create to make a storage pool:
# zpool create -f extdisk vpath1c
internal error: unexpected error 22 at line 446 of ../common/libzfs_pool.c

* reply to customer

It looks like you have an additional unwanted software layer between
Solaris and the disk hardware. Currently ZFS needs to access the
physical device to work correctly. Something like:

# zpool create -f extdisk c5t0d0 c5t1d0 ..

Let me know if this works for you.

* follow-up question from customer 

Yes, using the c#t#d# disks work, but anyone using fibre-channel storage
on somethink like IBM Shark or EMC Clariion will want multiple paths to
disk using either IBMsdd, EMCpower or Solaris native MPIO.  Does ZFS
work with any of these fibre channel multipathing drivers?


Thanks for any assistance you can provide.
-- 

David Curtis - TSESun Microsystems
303-272-6628  Enterprise Services
[EMAIL PROTECTED] OS / Installation Support
Monday to Friday  9:00 AM to 6:00 PM Mountain

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 
 --
 Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock

-- 

David Curtis - TSE  Sun Microsystems
303-272-6628Enterprise Services
[EMAIL PROTECTED]   OS / Installation Support
Monday to Friday9:00 AM to 6:00 PM Mountain

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs questions from Sun customer

2006-07-26 Thread Eric Schrock
So it does look like something's messed up here.  Before we pin this
down as a driver bug, we should double check that we are indeed opening
what we think we're opening, and try to track down why ldi_get_size is
failing.  Try this:

#!/usr/sbin/dtrace -s

ldi_open_by_name:entry
{
trace(stringof(args[0]));
}

ldi_prop_exists:entry
{
trace(stringof(args[2]));
}

ldi_prop_exists:return
{
trace(arg1);
}

ldi_get_otyp:return
{
trace(arg1);
}

- Eric


On Wed, Jul 26, 2006 at 12:49:35PM -0600, David Curtis wrote:
 Eric,
 
 Here is what the customer gets trying to create the pool using the
 software alias: (I added all the ldi_open's to the script)
 # zpool create -f extdisk vpath1c
 
 # ./dtrace.script
 dtrace: script './dtrace.script' matched 6 probes
 CPU IDFUNCTION:NAME
   0   7233ldi_open_by_vp:return 0
   0  15801   ldi_open_by_dev:return 0
   0   7233ldi_open_by_vp:return 0
   0  15801   ldi_open_by_dev:return 0
   0   7233ldi_open_by_vp:return 0
   0  15801   ldi_open_by_dev:return 0
   0   7233ldi_open_by_vp:return 0
   0  15801   ldi_open_by_dev:return 0
   0   7233ldi_open_by_vp:return 0
   0  15801   ldi_open_by_dev:return 0
   0   7233ldi_open_by_vp:return 0
   0  15801   ldi_open_by_dev:return 0
   0   7233ldi_open_by_vp:return 0
   0  15801   ldi_open_by_dev:return 0
   0   7233ldi_open_by_vp:return 0
   0  15801   ldi_open_by_dev:return 0
   0   7233ldi_open_by_vp:return 0
   0  17817  ldi_open_by_name:return 0
   0  16191  ldi_get_size:return-1
   0  44942vdev_disk_open:return22

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs questions from Sun customer

2006-07-26 Thread David Curtis
Eric,

Here is the output:

# ./dtrace2.dtr
dtrace: script './dtrace2.dtr' matched 4 probes
CPU IDFUNCTION:NAME
  0  17816   ldi_open_by_name:entry   /dev/dsk/vpath1c
  0  16197  ldi_get_otyp:return 0
  0  15546ldi_prop_exists:entry   Nblocks
  0  15547   ldi_prop_exists:return 0
  0  15546ldi_prop_exists:entry   nblocks
  0  15547   ldi_prop_exists:return 0
  0  15546ldi_prop_exists:entry   Size
  0  15547   ldi_prop_exists:return 0
  0  15546ldi_prop_exists:entry   size
  0  15547   ldi_prop_exists:return 0

Thanks,
David

Eric Schrock wrote On 07/26/06 01:01 PM,:
 So it does look like something's messed up here.  Before we pin this
 down as a driver bug, we should double check that we are indeed opening
 what we think we're opening, and try to track down why ldi_get_size is
 failing.  Try this:
 
 #!/usr/sbin/dtrace -s
 
 ldi_open_by_name:entry
 {
   trace(stringof(args[0]));
 }
 
 ldi_prop_exists:entry
 {
   trace(stringof(args[2]));
 }
 
 ldi_prop_exists:return
 {
   trace(arg1);
 }
 
 ldi_get_otyp:return
 {
   trace(arg1);
 }
 
 - Eric
 
 
 On Wed, Jul 26, 2006 at 12:49:35PM -0600, David Curtis wrote:
 
Eric,

Here is what the customer gets trying to create the pool using the
software alias: (I added all the ldi_open's to the script)
# zpool create -f extdisk vpath1c

# ./dtrace.script
dtrace: script './dtrace.script' matched 6 probes
CPU IDFUNCTION:NAME
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  15801   ldi_open_by_dev:return 0
  0   7233ldi_open_by_vp:return 0
  0  17817  ldi_open_by_name:return 0
  0  16191  ldi_get_size:return-1
  0  44942vdev_disk_open:return22
 
 
 --
 Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock

-- 

David Curtis - TSE  Sun Microsystems
303-272-6628Enterprise Services
[EMAIL PROTECTED]   OS / Installation Support
Monday to Friday9:00 AM to 6:00 PM Mountain

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs questions from Sun customer

2006-07-26 Thread Eric Schrock
On Wed, Jul 26, 2006 at 02:11:44PM -0600, David Curtis wrote:
 Eric,
 
 Here is the output:
 
 # ./dtrace2.dtr
 dtrace: script './dtrace2.dtr' matched 4 probes
 CPU IDFUNCTION:NAME
   0  17816   ldi_open_by_name:entry   /dev/dsk/vpath1c
   0  16197  ldi_get_otyp:return 0
   0  15546ldi_prop_exists:entry   Nblocks
   0  15547   ldi_prop_exists:return 0
   0  15546ldi_prop_exists:entry   nblocks
   0  15547   ldi_prop_exists:return 0
   0  15546ldi_prop_exists:entry   Size
   0  15547   ldi_prop_exists:return 0
   0  15546ldi_prop_exists:entry   size
   0  15547   ldi_prop_exists:return 0
 

OK, this definitely seems to be a driver bug.  I'm no driver expert, but
it seems that exporting none of the above properties is a problem - ZFS
has no idea how big this disk is!  Perhaps someone more familiar with
the DDI/LDI interfaces can explain the appropriate way to implement
these on the driver end.

But at this point its safe to say that ZFS isn't doing anything wrong.
The layered driver is exporting a device in /dev/dsk, but not exporting
basic information (such as the size or number of blocks) that ZFS (and
potentially the rest of Solaris) needs to interact with the device.

- Eric

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs questions from Sun customer

2006-07-26 Thread Torrey McMahon
Does format show these drives to be available and containing a non-zero 
size?


Eric Schrock wrote:

On Wed, Jul 26, 2006 at 02:11:44PM -0600, David Curtis wrote:
  

Eric,

Here is the output:

# ./dtrace2.dtr
dtrace: script './dtrace2.dtr' matched 4 probes
CPU IDFUNCTION:NAME
  0  17816   ldi_open_by_name:entry   /dev/dsk/vpath1c
  0  16197  ldi_get_otyp:return 0
  0  15546ldi_prop_exists:entry   Nblocks
  0  15547   ldi_prop_exists:return 0
  0  15546ldi_prop_exists:entry   nblocks
  0  15547   ldi_prop_exists:return 0
  0  15546ldi_prop_exists:entry   Size
  0  15547   ldi_prop_exists:return 0
  0  15546ldi_prop_exists:entry   size
  0  15547   ldi_prop_exists:return 0




OK, this definitely seems to be a driver bug.  I'm no driver expert, but
it seems that exporting none of the above properties is a problem - ZFS
has no idea how big this disk is!  Perhaps someone more familiar with
the DDI/LDI interfaces can explain the appropriate way to implement
these on the driver end.

But at this point its safe to say that ZFS isn't doing anything wrong.
The layered driver is exporting a device in /dev/dsk, but not exporting
basic information (such as the size or number of blocks) that ZFS (and
potentially the rest of Solaris) needs to interact with the device.

- Eric

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs questions from Sun customer

2006-07-26 Thread Edward Pilatowicz
zfs depends on ldi_get_size(), which depends on the device being
accessed exporting one of the properties below.  i guess the
the devices generated by IBMsdd and/or EMCpower/or don't
generate these properties.

ed


On Wed, Jul 26, 2006 at 01:53:31PM -0700, Eric Schrock wrote:
 On Wed, Jul 26, 2006 at 02:11:44PM -0600, David Curtis wrote:
  Eric,
 
  Here is the output:
 
  # ./dtrace2.dtr
  dtrace: script './dtrace2.dtr' matched 4 probes
  CPU IDFUNCTION:NAME
0  17816   ldi_open_by_name:entry   /dev/dsk/vpath1c
0  16197  ldi_get_otyp:return 0
0  15546ldi_prop_exists:entry   Nblocks
0  15547   ldi_prop_exists:return 0
0  15546ldi_prop_exists:entry   nblocks
0  15547   ldi_prop_exists:return 0
0  15546ldi_prop_exists:entry   Size
0  15547   ldi_prop_exists:return 0
0  15546ldi_prop_exists:entry   size
0  15547   ldi_prop_exists:return 0
 

 OK, this definitely seems to be a driver bug.  I'm no driver expert, but
 it seems that exporting none of the above properties is a problem - ZFS
 has no idea how big this disk is!  Perhaps someone more familiar with
 the DDI/LDI interfaces can explain the appropriate way to implement
 these on the driver end.

 But at this point its safe to say that ZFS isn't doing anything wrong.
 The layered driver is exporting a device in /dev/dsk, but not exporting
 basic information (such as the size or number of blocks) that ZFS (and
 potentially the rest of Solaris) needs to interact with the device.

 - Eric

 --
 Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Questions. (RAID-Z questions actually)

2006-07-03 Thread Casper . Dik


I understand the copy-on-write thing. That was very well illustrated in 
ZFS The Last Word in File Systems by Jeff Bonwick.

But if every block is it's own RAID-Z stripe, if the block is lost, how 
does ZFS recover the block???

You should perhaps not take block literally; the block is written as
part of a single transaction on all disks of the RAID-Z group.

Only when the block is stored on disk, the bits referencing them will
be written.  For the whole block to be lost, all disks need to be lost
or the transaction must not occur.

Is the stripe parity (as opposed to block checksum which I understand) 
stored somewhere else or within the same black

Parts of the block are written to each disk; the parity is written to
the parity disk.

But how exactly does every RAID-Z write is a full stripe write works? 
More specifically, if in a 3 disk RAID-Z configuration, if one disk 
fails completely and is replaced, exactly how does the metadata driven 
reconstruction recover the newly replaced disk?

The metadata driven reconstruction will take the ueberblock and from there
it will re-read the other disks and reconstruct the parity while also
verifying checksums.

Not all data needs to be read and not all parity needs to be computed;
only the bits of disks which are actually in use are verified and have
their parity recomputed.


Well, the tricky bit here is RAID-Z reconstruction. Because the 
stripes are all different sizes, there's no simple formula like all the 
disks XOR to zero. You have to traverse the filesystem metadata to 
determine the RAID-Z geometry. Note that this would be impossible if the 
filesystem and the RAID array were separate products, which is why 
there's nothing like RAID-Z in the storage market today. You really need 
an integrated view of the logical and physical structure of the data to 
pull it off.

Every stripe is different size? Is this because ZFS adapts to the nature 
of the I/O coming to it?

It's because the blocks written are all of different sizes.



So if you write a 128K block on a 3 way RAID-Z, this can be written as
2x64K of data + 1x64K of parity.

(Though I must admit that in such a scheme the disks still XOR to zero, at 
least the bits of disk used)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-22 Thread Gregory Shaw
So, based on the below, there should be no reason why a flash-based  
ZFS filesystem should need to do anything special to avoid problems.


That's a Good Thing.

I think that using flash as the system disk will be the way to go.
Using flash as read-only with a disk or memory for read-write would  
result in a very fast system with fewer points of failure...


On Jun 20, 2006, at 6:23 PM, Nathan Kroenert wrote:


And, this is a worst case, no?

If the device itself also does some funky stuff under the covers, and
ZFS only writes an update if there is *actually* something to write,
then it could be much much longer than 4 years.

Actually - That's an interesting. I assume ZFS only writes something
when there is actually data?

:)

Nathan.

On Wed, 2006-06-21 at 06:25, Eric Schrock wrote:

On Tue, Jun 20, 2006 at 02:18:34PM -0600, Gregory Shaw wrote:

Wouldn't that be:

5 seconds per write = 86400/5 = 17280 writes per day
256 rotated locations for 17280/256 = 67 writes per location per day

Resulting in (10/67) ~1492 days or 4.08 years before failure?

That's still a long time, but it's not 100 years.


Yes, I goofed on the math.  It's still (256*10*5) seconds, but
somehow I managed to goof up the math.  I tried it again and came up
with 1,481 days.

- Eric

--
Eric Schrock, Solaris Kernel Development   http:// 
blogs.sun.com/eschrock

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-
Gregory Shaw, IT Architect
Phone: (303) 673-8273Fax: (303) 673-2773
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive ULVL4-382  [EMAIL PROTECTED] (work)
Louisville, CO 80028-4382[EMAIL PROTECTED] (home)
When Microsoft writes an application for Linux, I've Won. - Linus  
Torvalds




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-20 Thread Richard Elling

Erik Trimble wrote:
That is, start out with adding the ability to differentiate between 
access policy in a vdev.  Generally, we're talking only about mirror 
vdevs right now.  Later on, we can consider the ability to migrate data 
based on performance, but a lot of this has to take into consideration 
snapshot capability and such, so is a bit less straightforward.


The policy is implemented on the read side, since you still need to
commit writes to all mirrors.  The implementation shouldn't be difficult,
deciding on the administrative interface will be the hardest part.

Oh, and the newest thing in the consumer market is called hybrid 
drives, which is a melding of a Flash drive with a Winchester drive.   
It's originally targetted at the laptop market - think a 1GB flash 
memory welded to a 40GB 2.5 hard drive in the same form-factor.  You 
don't replace the DRAM cache on the HD - it's still there for fast-write 
response. But all the frequently used blocks get scheduled to be 
placed on the Flash part of the drive, while the mechanical part 
actually holds a copy of everything.  The Flash portion is there for 
power efficiency as well as performance.


Flash is (can be) a bit more sophisticated.  The problem is that they
have a limited write endurance -- typically spec'ed at 100k writes to
any single bit.  The good flash drives use block relocation, spares, and
write spreading to avoid write hot spots.  For many file systems, the
place to worry is the block(s) containing your metadata.  ZFS inherently
spreads and mirrors its metadata, so it should be more appropriate for
flash devices than FAT or UFS.

Similarly, the disk drive manufacturers make extensive use of block sparing,
so applying that technique to the hybrid drives is expected.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-20 Thread Jonathan Adams
On Tue, Jun 20, 2006 at 09:32:58AM -0700, Richard Elling wrote:
 Flash is (can be) a bit more sophisticated.  The problem is that they
 have a limited write endurance -- typically spec'ed at 100k writes to
 any single bit.  The good flash drives use block relocation, spares, and
 write spreading to avoid write hot spots.  For many file systems, the
 place to worry is the block(s) containing your metadata.  ZFS inherently
 spreads and mirrors its metadata, so it should be more appropriate for
 flash devices than FAT or UFS.

What about the UberBlock?  It's written each time a transaction group
commits.

Cheers,
- jonathan

-- 
Jonathan Adams, Solaris Kernel Development
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-20 Thread Bill Moore
On Tue, Jun 20, 2006 at 11:17:42AM -0700, Jonathan Adams wrote:
 On Tue, Jun 20, 2006 at 09:32:58AM -0700, Richard Elling wrote:
  Flash is (can be) a bit more sophisticated.  The problem is that they
  have a limited write endurance -- typically spec'ed at 100k writes to
  any single bit.  The good flash drives use block relocation, spares, and
  write spreading to avoid write hot spots.  For many file systems, the
  place to worry is the block(s) containing your metadata.  ZFS inherently
  spreads and mirrors its metadata, so it should be more appropriate for
  flash devices than FAT or UFS.
 
 What about the UberBlock?  It's written each time a transaction group
 commits.

Right.  But we rotate the uberblock over 128 positions in the device
label.  This helps with write-leveling.  Furthermore, a lot of flash
devices are starting to incorporate write-leveling in HW, since a lot of
software just doesn't deal with it.


--Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-20 Thread Darren Reed

Jonathan Adams wrote:


On Tue, Jun 20, 2006 at 09:32:58AM -0700, Richard Elling wrote:


Flash is (can be) a bit more sophisticated.  The problem is that they
have a limited write endurance -- typically spec'ed at 100k writes to
any single bit.  The good flash drives use block relocation, spares, and
write spreading to avoid write hot spots.  For many file systems, the
place to worry is the block(s) containing your metadata.  ZFS inherently
spreads and mirrors its metadata, so it should be more appropriate for
flash devices than FAT or UFS.



What about the UberBlock?  It's written each time a transaction group
commits.



Also, options such as -nomtime and -noctime have been introduced
alongside -noatime in some free operating systems to limit the amount
of meta data that gets written back to disk.

Darren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-20 Thread Casper . Dik

Also, options such as -nomtime and -noctime have been introduced
alongside -noatime in some free operating systems to limit the amount
of meta data that gets written back to disk.


Those seem rather pointless.  (mtime and ctime generally imply other
changes, often to the inode; atime does not)

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-20 Thread Eric Schrock
On Tue, Jun 20, 2006 at 02:18:34PM -0600, Gregory Shaw wrote:
 Wouldn't that be:
 
 5 seconds per write = 86400/5 = 17280 writes per day
 256 rotated locations for 17280/256 = 67 writes per location per day
 
 Resulting in (10/67) ~1492 days or 4.08 years before failure?
 
 That's still a long time, but it's not 100 years.

Yes, I goofed on the math.  It's still (256*10*5) seconds, but
somehow I managed to goof up the math.  I tried it again and came up
with 1,481 days.

- Eric

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-20 Thread Dana H. Myers
Richard Elling wrote:
 Erik Trimble wrote:

 Oh, and the newest thing in the consumer market is called hybrid
 drives, which is a melding of a Flash drive with a Winchester
 drive.   It's originally targetted at the laptop market - think a 1GB
 flash memory welded to a 40GB 2.5 hard drive in the same
 form-factor.  You don't replace the DRAM cache on the HD - it's still
 there for fast-write response. But all the frequently used blocks
 get scheduled to be placed on the Flash part of the drive, while the
 mechanical part actually holds a copy of everything.  The Flash
 portion is there for power efficiency as well as performance.
 
 Flash is (can be) a bit more sophisticated.  The problem is that they
 have a limited write endurance -- typically spec'ed at 100k writes to
 any single bit.  The good flash drives use block relocation, spares, and
 write spreading to avoid write hot spots.  For many file systems, the
 place to worry is the block(s) containing your metadata.  ZFS inherently
 spreads and mirrors its metadata, so it should be more appropriate for
 flash devices than FAT or UFS.

What I do not know yet is exactly how the flash portion of these hybrid
drives is administered.  I rather expect that a non-hybrid-aware OS may
not actually exercise the flash storage on these drives by default; or
should I say, the flash storage will only be available to a hybrid-aware
OS.

Has anyone reading this seen a command-set reference for one of these
drives?

Dana
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-20 Thread Richard Elling

Eric Schrock wrote:

On Tue, Jun 20, 2006 at 11:17:42AM -0700, Jonathan Adams wrote:

On Tue, Jun 20, 2006 at 09:32:58AM -0700, Richard Elling wrote:

Flash is (can be) a bit more sophisticated.  The problem is that they
have a limited write endurance -- typically spec'ed at 100k writes to
any single bit.  The good flash drives use block relocation, spares, and
write spreading to avoid write hot spots.  For many file systems, the
place to worry is the block(s) containing your metadata.  ZFS inherently
spreads and mirrors its metadata, so it should be more appropriate for
flash devices than FAT or UFS.

What about the UberBlock?  It's written each time a transaction group
commits.


Yes, but this is only written once every 5 seconds, and we store to 256
different locations in a ring buffer.  So you have (256*10*5)
seconds, or about 100 years.


100k writes is the de-facto minimum.  In looking at some SSD (yes, they
are marketing them as solid state disks) drives with IDE or SATA interfaces,
at least one vendor specs 5,000,000 writes, sizes up to 128 GBytes.  It
will be a while before these are really inexpensive, though.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-20 Thread Richard Elling

Dana H. Myers wrote:

What I do not know yet is exactly how the flash portion of these hybrid
drives is administered.  I rather expect that a non-hybrid-aware OS may
not actually exercise the flash storage on these drives by default; or
should I say, the flash storage will only be available to a hybrid-aware
OS.


Samsung describes their hybrid drives as using flash for the boot block
and as a write cache.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-20 Thread Nathan Kroenert
And, this is a worst case, no?

If the device itself also does some funky stuff under the covers, and 
ZFS only writes an update if there is *actually* something to write, 
then it could be much much longer than 4 years.

Actually - That's an interesting. I assume ZFS only writes something
when there is actually data?

:)

Nathan.

On Wed, 2006-06-21 at 06:25, Eric Schrock wrote:
 On Tue, Jun 20, 2006 at 02:18:34PM -0600, Gregory Shaw wrote:
  Wouldn't that be:
  
  5 seconds per write = 86400/5 = 17280 writes per day
  256 rotated locations for 17280/256 = 67 writes per location per day
  
  Resulting in (10/67) ~1492 days or 4.08 years before failure?
  
  That's still a long time, but it's not 100 years.
 
 Yes, I goofed on the math.  It's still (256*10*5) seconds, but
 somehow I managed to goof up the math.  I tried it again and came up
 with 1,481 days.
 
 - Eric
 
 --
 Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-20 Thread Jeff Bonwick
 I assume ZFS only writes something when there is actually data?

Right.

Jeff

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-20 Thread Darren Reed

[EMAIL PROTECTED] wrote:


Also, options such as -nomtime and -noctime have been introduced
alongside -noatime in some free operating systems to limit the amount
of meta data that gets written back to disk.
   




Those seem rather pointless.  (mtime and ctime generally imply other
changes, often to the inode; atime does not)
 



Well operating systems that *do* get used to build devices *do*
have these mount options for this purpose, so I imagine that
someone who does this kind of thing thinks they're worthwhile.

Darren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-17 Thread Darren Reed

Mike Gerdts wrote:


On 6/17/06, Dale Ghent [EMAIL PROTECTED] wrote:


The concept of shifting blocks in a zpool around in the background as
part of a scrubbing process and/or on the order of a explicit command
to populate newly added devices seems like it could be right up ZFS's
alley. Perhaps it could also be done with volume-level granularity.

Off the top of my head, an area where this would be useful is
performance management - e.g. relieving load on a particular FC
interconnect or an overburdened RAID array controller/cache thus
allowing total no-downtime-to-cp-data-around flexibility when one is
horizontally scaling storage performance.



Another good use would be to migrate blocks that are rarely accessed
to slow storage (750 GB drives with RAID-Z) while very active blocks
are kept on fast storage (solid state disk).  Presumably writes would
go to relatively fast storage and use idle IO cycles to migrate those
that don't have a lot of reads to slower storage.  Blocks that are
very active and reside on slow storage could be migrated (mirrored?)
to fast storage.



Solid state disk often has a higher failure rate than normal disk and a
limited write cycle.  Hence it is often desirable to try and redesign the
filesystem to do fewer writes when it is on (for example) compact flash,
so moving hot blocks to fast storage can have consequences.

But then there is also this new storage paradigm in the e-rags where
a hard drive also has some amount of solid state storage to speed up
the boot time.  It'll be interesting to see how that plays out, but I
suspect the idea is that in the relevant market (PCs), it'll be used for
things like drivers and OS core image files that do not change very
often.

Darren

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-17 Thread Neil A. Wilson

Darren Reed wrote:

Solid state disk often has a higher failure rate than normal disk and a
limited write cycle.  Hence it is often desirable to try and redesign the
filesystem to do fewer writes when it is on (for example) compact flash,
so moving hot blocks to fast storage can have consequences.


Solid state storage does not necessarily mean flash.  For example, I 
have recently performed some testing of Sun's Directory Server in 
conjunction with solid state disks from two different vendors.  Both of 
these used standard DRAM, so there's no real limit to the number of 
writes that can be performed.  They have lots of internal redundancy 
features (e.g., ECC memory with chipkill, redundant power supplies, 
internal UPSes, and internal hard drives to protect against extended 
power outages), but both vendors said that customers often use other 
forms of redundancy (e.g., mirror to traditional disk, or RAID across 
multiple solid-state devices).


One of the vendors mentioned that both SVM and VxVM have the ability to 
designate one disk in a mirror as write only (unless the other has 
failed) which can be good for providing redundancy with cheaper, 
traditional storage.  All reads would still come from the solid state 
storage so they would be very fast, and as long as the write rate 
doesn't exceed that of the traditional disk then there wouldn't be much 
adverse performance impact from the slower disk in the mirror.  I don't 
believe that ZFS has this capability, but it could be something worth 
looking into.  The original suggestion provided in this thread would 
potentially work well in that kind of setup.


ZFS with compression can also provide a notable win because the 
compression can significantly reduce the amount of storage required, 
which can help cut down on the costs.  Solid state disks like this are 
expensive (both of the 32GB disks that I tested list at around $60K), so 
controlling costs is important.



Neil

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-17 Thread Mike Gerdts

On 6/17/06, Neil A. Wilson [EMAIL PROTECTED] wrote:

Darren Reed wrote:
 Solid state disk often has a higher failure rate than normal disk and a
 limited write cycle.  Hence it is often desirable to try and redesign the
 filesystem to do fewer writes when it is on (for example) compact flash,
 so moving hot blocks to fast storage can have consequences.


I mentioned solid state (assuming DRAM-based) and 750 GB drives as the
two ends of the spectrum available.  Most people will find their
extremes that are each closer to the middle of the spectrum.  Possibly
a multi-tier approach including 73 GB FC, 300 GB FC, and 500 GB SATA
would be more likely in most shops.


  Solid state disks like this are
expensive (both of the 32GB disks that I tested list at around $60K), so
controlling costs is important.



If you remove enterprise from the solid state disk equation,
consider this at $150 + the cost of 4 1 GB DDR DIMMs.  I suppose you
could mirror across a pair of them and still have a pretty fast small
4GB of space for less than $1k.

http://www.anandtech.com/storage/showdoc.aspx?i=2480

FWIW, google gives plenty of hits for solid state disk terabyte.

Mike

--
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-17 Thread Erik Trimble
Saying Solid State disk in the storage arena means battery-backed DRAM 
(or, rarely, NVRAM).  It does NOT include the various forms of 
solid-state memory (compact flash, SD, MMC, etc.);Flash disk is 
reserved for those kind of devices.


This is historical, since Flash disk hasn't been functionally usable in 
the Enterprise Storage arena until the last year or so. Battery-backed 
DRAM as a disk has been around for a very long time, though. :-)



We've all talked about adding the ability to change read/write policy on 
a pool's vdevs for awhile. There are a lot of good reasons that this is 
desirable. However, I'd like to try to separate this request from HSM, 
and not immediately muddy the waters by trying to lump too many things 
together.


That is, start out with adding the ability to differentiate between 
access policy in a vdev.  Generally, we're talking only about mirror 
vdevs right now.  Later on, we can consider the ability to migrate data 
based on performance, but a lot of this has to take into consideration 
snapshot capability and such, so is a bit less straightforward.



And, on not a completely tangential side note:  WTF is up with the costs 
for Solid State disks?  I mean, prices well over $1k per GB are typical, 
which is absolutely ludicrous. The DRAM itself is under $100/GB, and 
these devices are idiot-simple to make.  In the minimalist case, it's 
simply  DIMM slots, a NiCad battery and trickle charger, and a 
SCSI/SATA/FC interface chip.  Even in the fancy case, were you provide a 
backup drive to copy the DRAM contents to in case of power failure, it's 
a trivial engineering exercise.   I realize there is (currently) a small 
demand for these devices, but honestly,  I'm pretty sure that if they 
reduced the price by a factor of 3, they'd see 10x or maybe even 100x 
the volume, cause these little buggers are just so damned useful.


Oh, and the newest thing in the consumer market is called hybrid 
drives, which is a melding of a Flash drive with a Winchester drive.   
It's originally targetted at the laptop market - think a 1GB flash 
memory welded to a 40GB 2.5 hard drive in the same form-factor.  You 
don't replace the DRAM cache on the HD - it's still there for fast-write 
response. But all the frequently used blocks get scheduled to be 
placed on the Flash part of the drive, while the mechanical part 
actually holds a copy of everything.  The Flash portion is there for 
power efficiency as well as performance.



-Erik


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS questions

2006-06-16 Thread Dale Ghent


On Jun 16, 2006, at 11:40 PM, Richard Elling wrote:


Kimberly Chang wrote:

A couple of ZFS questions:
1. ZFS dynamic striping will automatically use new added devices  
when there are write requests. Customer has a *mostly read-only*  
application with I/O bottleneck, they wonder if there is a ZFS  
command or mechanism to enable the manual rebalancing of ZFS data  
when adding new drives to an existing pool?


cp :-)
If you copy the file then the new writes will be spread across the  
newly

added drives.  It doesn't really matter how you do the copy, though.


She raises an interesting point, though.

The concept of shifting blocks in a zpool around in the background as  
part of a scrubbing process and/or on the order of a explicit command  
to populate newly added devices seems like it could be right up ZFS's  
alley. Perhaps it could also be done with volume-level granularity.


Off the top of my head, an area where this would be useful is  
performance management - e.g. relieving load on a particular FC  
interconnect or an overburdened RAID array controller/cache thus  
allowing total no-downtime-to-cp-data-around flexibility when one is  
horizontally scaling storage performance.


/dale

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss