Re: [zfs-discuss] best way to configure raidz groups

2009-12-31 Thread Paul Armstrong
Rather than hacking something like that, he could use a Disk on Module 
(http://en.wikipedia.org/wiki/Disk_on_module) or something like 
http://www.tomshardware.com/news/nanoSSD-Drive-Elecom-Japan-SATA,8538.html 
(which I suspect may be a DOM but I've not poked around sufficiently to see).

Paul
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Andras Spitzer
Let me sum up my thoughts in this topic.

To Richard [relling] : I agree with you this topic is even more confusing if we 
are not careful enough to specify exactly what we are talking about. Thin 
provision can be done on multiple layers, and though you said you like it to be 
closer to the app than closer to the dumb disks (if you were referring to SAN), 
my opinion is that each and every scenario has it's own pros/cons. I learned 
long time ago not to declare a technology good/bad, there are technologies 
which are used properly (usually declared as good tech) and others which are 
not (usually declared as bad).

--

Let me clarify my case, and why I mentioned thin devices on SAN specifically. 
Many people replied with the thin device support of ZFS (which is called sparse 
volumes if I'm correct), but what I was talking about is something else. It's 
thin device awareness on the SAN.

In this case you configure your LUN in the SAN as thin device, a virtual LUN(s) 
which is backed by a pool of physical disks in the SAN. From the OS it's 
transparent, so it is from the Volume Manager/Filesystem point of view.

That is the basic definition of my scenarion with thin devices on SAN. High-end 
SAN frames like HDS USP-V (feature called Hitachi Dynamic Provisioning), EMC 
Symmetrix V-Max (feature called Virtual provisioning) supports this (and I'm 
sure many others as well). As you discovered the LUN in the OS, you start to 
use it, like put under Volume Manager, create filesystem, copy files, but the 
SAN only allocates physical blocks (more precisely group of blocks called 
extents) as you write them, which means you'll use only as much (or a bit more 
rounded to the next extent) on the physical disk as you use in reality.

From this standpoint we can define two terms, thin-friendly and thin-hostile 
environments. Thin-friendly would be any environment where OS/VM/FS doesn't 
write to blocks it doesn't really use (for example during initialization it 
doesn't fills up the LUN with a pattern or 0s).

That's why Veritas' SmartMove is a nice feature, as when you move from fat to 
thin devices (from the OS both LUNs look exactly the same), it will copy the 
blocks only which are used by the VxFS files. 

That is still the basics of having thin devices on SAN, and hope to have a 
thin-friendly environment. The next level of this is the management of the thin 
devices and the physical pool where thin devices allocates their extents from.

Even if you get migrated to thin device LUNs, your thin devices will become fat 
again, even if you fill up your filesystem once, the thin device on the SAN 
will remain fat, no space reclamation is happening by default. The reason is 
pretty simple, the SAN storage has no knowledge of the filesystem structure, as 
such it can't decide whether a block should be released back to the pool, or 
it's really not in use. Then came Veritas with this brilliant idea of building 
a bridge between the FS and the SAN frame (this became the Thin Reclamation 
API), so they can communicate which blocks are not in use indeed.

I really would like you to read this Quick Note from Veritas about this 
feature, it will explain way better the concept as I did : 
http://ftp.support.veritas.com/pub/support/products/Foundation_Suite/338546.pdf

Btw, in this concept VxVM can even detect (via ASL) whether a LUN is thin 
device/thin device reclamation capable or not.

Honestly I have mixed feeling about ZFS. I feel that this is obviously the 
future's VM/Filesystem, but then I realize in the same time the roles of the 
individual parts in the big picture are getting mixed up. Am I the only one 
with the impression that ZFS sooner or later will evolve to a SAN OS, and the 
zfs, zpool commands will only become some lightweight interfaces to control the 
SAN frame? :-) (like Solution Enabler for EMC)

If you ask me the pool concept always works more efficient if 1# you have more 
capacity in the pool 2# if you have more systems to share the pool, that's why 
I see the thin device pool more rational in a SAN frame.

Anyway, I'm sorry if you were already aware what I explained above, I also hope 
I didn't offend anyone with my views,

Regards,
sendai
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Ragnar Sundblad

On 31 dec 2009, at 06.01, Richard Elling wrote:

 
 On Dec 30, 2009, at 2:24 PM, Ragnar Sundblad wrote:
 
 
 On 30 dec 2009, at 22.45, Richard Elling wrote:
 
 On Dec 30, 2009, at 12:25 PM, Andras Spitzer wrote:
 
 Richard,
 
 That's an interesting question, if it's worth it or not. I guess the 
 question is always who are the targets for ZFS (I assume everyone, though 
 in reality priorities has to set up as the developer resources are 
 limited). For a home office, no doubt thin provisioning is not much of a 
 use, for an enterprise company the numbers might really make a difference 
 if we look at the space used vs space allocated.
 
 There are some studies that thin provisioning can reduce physical space 
 used up to 30%, which is huge. (Even though I understands studies are not 
 real life and thin provisioning is not viable in every environment)
 
 Btw, I would like to discuss scenarios where though we have 
 over-subscribed pool in the SAN (meaning the overall allocated space to 
 the systems is more than the physical space in the pool) with proper 
 monitoring and proactive physical drive adds we won't let any 
 systems/applications attached to the SAN realize that we have thin devices.
 
 Actually that's why I believe configuring thin devices without 
 periodically reclaiming space is just a timebomb, though if you have the 
 option to periodically reclaim space, you can maintain the pool in the SAN 
 in a really efficient way. That's why I found Veritas' Thin Reclamation 
 API as a milestone in the thin device field.
 
 Anyway, only future can tell if thin provisioning will or won't be a major 
 feature in the storage world, though as I saw Veritas already added this 
 feature I was wondering if ZFS has it at least on it's roadmap.
 
 Thin provisioning is absolutely, positively a wonderful, good thing!  The 
 question
 is, how does the industry handle the multitude of thin provisioning models, 
 each
 layered on top of another? For example, here at the ranch I use VMWare and 
 Xen,
 which thinly provision virtual disks. I do this over iSCSI to a server 
 running ZFS
 which thinly provisions the iSCSI target.  If I had a virtual RAID array, I 
 would
 probably use that, too. Personally, I think being thinner closer to the 
 application
 wins over being thinner closer to dumb storage devices (disk drives).
 
 I don't get it - why do we need anything more magic (or complicated)
 than support for TRIM from the filesystems and the storage systems?
 
 TRIM is just one part of the problem (or solution, depending on your point
 of view). The TRIM command is part of the T10 protocols that allows a
 host to tell a block device that data in a set of blocks is no longer of
 any value, and the block device can destroy the data without adverse
 consequence.
 
 In a world with copy-on-write and without snapshots, it is obvious that
 there will be a lot of blocks running around that are no longer in use.
 Snapshots (and their clones) changes that use case. So in a world of
 snapshots, there will be fewer blocks which are not used. Remember,
 the TRIM command is very important to OSes like Windows or OSX
 which do not have file systems that are copy-on-write or have decent
 snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use
 snapshots.

I don't believe that there is such a big difference between those
cases. Sure, snapshots may keep more data on disk, but only as much
as the user choose to keep. There has been other ways to keep old
data on disk before (RCS, Solaris patch backout blurbs, logs, caches,
what have you), so there is not really a brand new world there.
(BTW, once upon a time, real operating systems had (optional) file
versioning built into the operating system or file system itself.)

If there was a mechanism that always tended to keep all of the
disk full, that would be another case. Snapshots may do that
with the autosnapshot and warn-and-clean-when-getting-full
features of OpenSolaris, but especially servers will probably
not be managed that way, they will probably have a much more
controlled snapshot policy. (Especially if you want to save every
possible bit of disk space, as those guys with the big fantastic
and ridiculously expensive storage systems always want to do -
maybe that will change in the future though.)

 That said, adding TRIM support is not hard in ZFS. But it depends on
 lower level drivers to pass the TRIM commands down the stack. These
 ducks are lining up now.

Good.

 I don't see why TRIM would be hard to implement for ZFS either,
 except that you may want to keep data from a few txgs back just
 for safety, which would probably call for some two-stage freeing
 of data blocks (those free blocks that are to be TRIMmed, and
 those that already are).
 
 Once a block is freed in ZFS, it no longer needs it. So the problem
 of TRIM in ZFS is not related to the recent txg commit history.

It may be that you want to save a few txgs back, so if you get
a failure where 

Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Ragnar Sundblad

On 31 dec 2009, at 00.31, Bob Friesenhahn wrote:

 On Wed, 30 Dec 2009, Mike Gerdts wrote:
 
 Should the block size be a tunable so that page size of SSD (typically
 4K, right?) and upcoming hard disks that sport a sector size  512
 bytes?
 
 Enterprise SSDs are still in their infancy.  The actual page size of an SSD 
 could be almost anything.  Due to lack of seek time concerns and the high 
 cost of erasing a page, a SSD could be designed with a level of indirection 
 so that multiple logical writes to disjoint offsets could be combined into a 
 single SSD physical page.  Likewise a large logical block could be subdivided 
 into mutiple SSD pages, which are allocated on demand.  Logic is cheap and 
 SSDs are full of logic so it seems reasonable that future SSDs will do this, 
 if not already, since similar logic enables wear-leveling.

I believe that almost all flash devices are already are doing this,
and only the first generation SD cards or something like that are
not doing it and leaving it to the host.

But I could be wrong of course.

/ragge s

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL

2009-12-31 Thread Willy
 Thanks, sounds like it should handle all but the
 worst faults OK then; I believe the maximum retry
 timeout is typically set to about 60 seconds in
 consumer drives.

Are you sure about this?  I thought these consumer level drives would try 
indefinitely to carry out its operation.  Even Samsung's white paper on CCTL 
RAID error recovery says it could take a minute or longer (see Desktop 
Unsuccessful Error Recovery diagram) 
http://www.samsung.com/global/business/hdd/learningresource/whitepapers/LearningResource_CCTL.html
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Changing ZFS drive pathing

2009-12-31 Thread James C. McPherson

Mike wrote:

Just thought I would let you all know that I followed what Alex suggested
along with what many of you pointed out and it worked! Here are the steps
I followed:

1. Break root drive mirror
2. zpool export filesystem
3. run the command to start MPIOX and reboot the machine
4. zpool import filesystem
5. Check the system
6. Recreate the mirror.

Thank you all for the help!  I feel much better and it worked without a
single problem!  I'm very impressed with MPXIO and wish I had known about
it before spending thousands of dollars on PowerPath.



As somebody who's done a bunch of work on stmsboot[a], and
who has at least a passing knowledge of devids[b] (which are what
ZFS and MPxIO use to identify devices), I am disappointed that
you believe it was necessary to follow the above steps.

Assuming that your devices do not have devids which change, then
all that should have been required was

[setup your root mirror]

# /usr/sbin/stmsboot -e

[reboot when prompted]
[twiddle thumbs]
[ login ]

No ZFS export and import required.
No breaking and recreating of mirror required.


[a] http://blogs.sun.com/jmcp/entry/on_stmsboot_1m
[b] http://www.jmcp.homeunix.com/~jmcp/WhatIsAGuid.pdf


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] best way to configure raidz groups

2009-12-31 Thread Thomas Burgess

 For the OS, I'd drop the adapter/compact-flash combo and use the
 stripped down Kingston version of the Intel x25m MLC SSD.  If you're
 not familiar with it, the basic scoup is that this drive contains half
 the flash memory (40Gb) *and* half the controller channels (5 versus
 10) of the Intel drive - and so, performance is basically a little
 less than half although read performance is still very good.  For more
 info, google for  hardware reviews.  This product is still a little
 hard to find, froogle for the following part numbers:

 Desktop Bundle - SNV125-S2BD/40GB
 Bare drive - SNV125-S2/40GB

 Currently you can find the bare drive for under $100.  This is bound
 to give you better performance and guaranteed compatibility compared
 to adapters and compact flash.

 The problem with adapters is that, although the price is great,
 compatibility and build quality are all over the map and YMMV
 considerably.  You would not be happy if you saved $20 on the
 adapter/flash combo and ended up with nightmare reliability.

 The great thing about 2.5 SSDs is that mounting is simply a question
 of duct tape or velcro! [ well  almost ... but you can velcro them
 onto the sidewall of you case ]  So you can use all your available
 3.5 disk drive bays for ZFS disks.

I was able to find some of the 64 gb snv125-S2 drives for a decent price.
 Do these also work well for L2ARC?

This brings more questions actually.

I know it's not recommended to use partitons for ZFS but does this still
apply for SSD's and the root pool?

I was thinking about making maybe using half of the ssd for the root pool
and putting the ZIL on the other half.

Or would i just be better off leaving the ZIL on the raidz drives?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Help on Mailing List

2009-12-31 Thread Florian

Hello there,

is there any possibilty to receive all old mailings from
the list? I would like to search those for know-how that i
don't double post to often :-)

Thanks,
Florian
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help on Mailing List

2009-12-31 Thread Henrik Johansson
http://mail.opensolaris.org/pipermail/zfs-discuss/

Henrik
http://sparcv9.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled

2009-12-31 Thread Jack Kielsmeier
 Yeah, still no joy on getting my pool back.  I think
 I might have to try grabbing another server with a
 lot more memory and slapping the HBA and the drives
 in that.  Can ZFS deal with a controller change?

Just some more info that 'may' help.
After I upgraded to 8GB of RAM, I did not limit the amount of RAM zfs can take. 
So if you are doing any kind of limiting in /etc/system, you may want to take 
that out.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL

2009-12-31 Thread Eric D. Mudama

On Thu, Dec 31 at  2:14, Willy wrote:

Thanks, sounds like it should handle all but the
worst faults OK then; I believe the maximum retry
timeout is typically set to about 60 seconds in
consumer drives.


Are you sure about this?  I thought these consumer level drives
would try indefinitely to carry out its operation.  Even Samsung's
white paper on CCTL RAID error recovery says it could take a minute
or longer (see Desktop Unsuccessful Error Recovery diagram)
http://www.samsung.com/global/business/hdd/learningresource/whitepapers/LearningResource_CCTL.html


Depends very much on the firmware and the error type.  Each vendor
will have their own trade-secret approaches to solving this issue
based on their own failure rates and expected usages.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Bob Friesenhahn

On Thu, 31 Dec 2009, Ragnar Sundblad wrote:


Also, currently, when the SSDs for some very strange reason is
constructed from flash chips designed for firmware and slowly
changing configuration data and can only erase in very large chunks,
TRIMing is good for the housekeeping in the SSD drive. A typical
use case for this would be a laptop.


I have heard quite a few times that TRIM is good for SSD drives but 
I don't see much actual use for it.  Every responsible SSD drive 
maintains a reserve of unused space (20-50%) since it is needed for 
wear leveling and to repair failing spots.  This means that even when 
a SSD is 100% full it still has considerable space remaining.  A very 
simple SSD design solution is that when a SSD block is overwritten 
it is replaced with an already-erased block from the free pool and the 
old block is submitted to the free pool for eventual erasure and 
re-use.  This approach avoids adding erase times to the write latency 
as long as the device can erase as fast as the average date write 
rate.


There are of course SSDs with hardly any (or no) reserve space, but 
while we might be willing to sacrifice an image or two to SSD block 
failure in our digital camera, that is just not acceptable for serious 
computer use.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Andras Spitzer
Just an update :

Finally I found some technical details about this Thin Reclamation API :

(http://blogs.hds.com/claus/2009/12/i-love-it-when-a-plan-comes-together.html)

This week, (December 7th), Symantec announced their “completing the thin 
provisioning ecosystem” that includes the necessary API calls for the file 
system to “notify” the storage array when space is “deleted”. The interface is 
a previously disused and now revised/reused/repurposed SCSI command (called 
Write Same) which was jointly worked out with Symantec, Hitachi, and 3PAR. This 
command allows the file systems (in this case Veritas VxFS) to notify the 
storage systems that space is no longer occupied. How cool is that! There is 
also a subcommittee to INCITS T10 studying the standardization is this and SNIA 
is also studying this. It won’t be long before most file systems, databases, 
and storage vendors adopt this technology.

So it's based on the SCSI Write Same/UNMAP command, (and if I understand 
correctly SATA TRIM is similar to this from the FS point of view) which 
standard is not ratified yet.

Also, happy new year to everyone!

Regards,
sendai
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] what happens to the deduptable (DDT) when you set dedup=off ???

2009-12-31 Thread Robert Milkowski

On 30/12/2009 22:57, ono wrote:

will i be able to see which files were affected by dedup or can i do a
zfs send/recieve to another filesystem to clean it up?
   


send|recv will be enough.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS extremely slow performance

2009-12-31 Thread Richard Elling

On Dec 31, 2009, at 2:49 AM, Robert Milkowski wrote:



judging by a *very* quick glance it looks like you have an issue  
with c3t0d0 device which is responding very slowly.


Yes, there is an I/O stuck on the device which is not getting serviced.
See below...



--
Robert Milkowski
http://milek.blogspot.com



On 31/12/2009 09:10, Emily Grettel wrote:


Hi,

I'm using OpenSolaris 127 from my previous posts to address CIFS  
problems. I have a few zpools but lately (with an uptime of 32  
days) we've started to get CIFS issues and really bad IO  
performance. I've been running scrubs on a nightly basis.


I'm not sure why its happenning either - I'm new to OpenSolaris.

I ran fsstat whilst trying to unrar a 8.4Gb file with an ISO inside  
it:


fsstat zfs 1
 new  name   name  attr  attr lookup rddir  read read  write write
 file remov  chng   get   setops   ops   ops bytes   ops bytes
3.29K   466   367  633K 1.50K  1.66M 8.66K  964K 15.6G  314K 9.38G  
zfs
0 0 0 4 0  8 0   135 5.13M93 5.02M  
zfs
0 0 0 7 0 18 0   205 5.63M   137 5.64M  
zfs
0 0 0 4 0  8 090 3.92K49 14.6K  
zfs
0 0 0 4 0  8 0   115 16.4K65 27.5K  
zfs
0 0 0 8 0 13 0   153 8.36M   113 8.38M  
zfs
0 0 0 4 0  8 094 3.96K53 19.1K  
zfs
0 0 0 7 0 18 080   80042 1.13K  
zfs
0 0 0 4 0  8 090 3.92K48 7.62K  
zfs
0 0 0 4 0  8 099  132K53 7.14K  
zfs
0 0 0 4 0  8 0   188 5.99K96 5.62K  
zfs
0 0 0 4 0  8 095  664K52  420K  
zfs
0 0 0 9 0 22 0   164 7.97K92 12.2K  
zfs

 new  name   name  attr  attr lookup rddir  read read  write write
 file remov  chng   get   setops   ops   ops bytes   ops bytes
0 0 0 4 0  8 0   111 2.63M70 2.63M  
zfs
0 0 0 4 0  8 0   262 6.63M   153 6.63M  
zfs
0 0 0 4 0  8 080   80044 1.70K  
zfs
0 0 0 4 0  8 0   337 18.1M   247 18.1M  
zfs
0 0 0 7 0 18 0   127 5.75M89 5.63M  
zfs
0 0 0 4 0  8 080   80050 25.6K  
zfs


My iostat appears below this message (its quite long! to give you  
an idea). I'm really not sure why the performance has really  
dropped all of a sudden or how to diagnose it. CIFS shares  
occasionally drop out too.


Its a bit of a downer to be experiencing on the 31st of December. I  
hope everyone has a Safe  Happy New Years :-)


I'm unable to upgrade to the latest release because of an issue  
with python:


pfexec pkg image-update
Creating Plan /pkg: Cannot remove 'pkg://opensolaris.org/sunwipkg-gui-l...@0.5.11 
,5.11-0.127:2009T075414Z' due to the following packages that  
depend on it:
  pkg://opensolaris.org/SUNWipkg- 
g...@0.5.11,5.11-0.127:2009T075333Z


So I'm stuck on 127 until I can rebuild this machine :(

Cheers,
Em

extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.01.00.00.5  2.8  1.0 2815.9 1000.0 100 100 c7t3d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.00.00.0  2.0  1.00.00.0 100 100 c7t3d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.06.00.0   81.5  1.2  1.0  198.6  166.6  60 100 c7t3d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.00.00.0  0.0  1.00.00.0   0 100 c7t3d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.06.00.04.0  0.0  0.00.00.2   0   0 c7t1d0
0.06.00.04.0  0.0  0.00.00.2   0   0 c7t2d0
0.09.00.0   31.5  0.0  0.40.0   41.7   0  38 c7t3d0
0.06.00.04.0  0.0  0.00.00.3   0   0 c7t4d0
0.06.00.04.0  0.0  0.00.00.2   0   0 c7t5d0
0.06.00.04.0  0.0  0.00.00.1   0   0 c0t1d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   56.0  122.0 5972.3 8592.2  0.0  0.50.02.9   0  25 c7t1d0
   55.0  136.0 5998.3 8590.2  0.0  0.70.03.8   0  29 c7t2d0
0.0  111.00.0 4342.9  0.0  2.20.0   20.2   0  57 c7t3d0
  103.0  153.0 5868.3 8590.7  0.0  0.40.01.7   0  21 c7t4d0
   96.0  130.0 5946.8 8591.2  0.0  0.70.03.2   0  

Re: [zfs-discuss] ZFS extremely slow performance

2009-12-31 Thread Bob Friesenhahn

On Thu, 31 Dec 2009, Emily Grettel wrote:

 
I'm using OpenSolaris 127 from my previous posts to address CIFS problems. I 
have a few zpools but
lately (with an uptime of 32 days) we've started to get CIFS issues and really 
bad IO performance.
I've been running scrubs on a nightly basis.
 
I'm not sure why its happenning either - I'm new to OpenSolaris.


Without knowing anything about your pool, your c7t3d0 device seems 
possibly suspect.  Notice that it often posts a very high asvc_t.


What is the output from 'zpool status' for this pool?

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Ragnar Sundblad

On 31 dec 2009, at 17.18, Bob Friesenhahn wrote:

 On Thu, 31 Dec 2009, Ragnar Sundblad wrote:
 
 Also, currently, when the SSDs for some very strange reason is
 constructed from flash chips designed for firmware and slowly
 changing configuration data and can only erase in very large chunks,
 TRIMing is good for the housekeeping in the SSD drive. A typical
 use case for this would be a laptop.
 
 I have heard quite a few times that TRIM is good for SSD drives but I don't 
 see much actual use for it.  Every responsible SSD drive maintains a reserve 
 of unused space (20-50%) since it is needed for wear leveling and to repair 
 failing spots.  This means that even when a SSD is 100% full it still has 
 considerable space remaining.

(At least as long as those blocks aren't used up in place of
bad/worn out) blocks...)

  A very simple SSD design solution is that when a SSD block is overwritten 
 it is replaced with an already-erased block from the free pool and the old 
 block is submitted to the free pool for eventual erasure and re-use.  This 
 approach avoids adding erase times to the write latency as long as the device 
 can erase as fast as the average date write rate.

This is what they do, as far as I have understood, but more
free space to play with makes the job easier and therefor
faster, and gives you a larger burst headroom before you hit
the erase-speed limit of the disk.

 There are of course SSDs with hardly any (or no) reserve space, but while we 
 might be willing to sacrifice an image or two to SSD block failure in our 
 digital camera, that is just not acceptable for serious computer use.

I think the idea is that with TRIM you can also use the file
system's unused space for wear leveling and flash block filling.
If your disk is completely full there is of course no gain.

/ragge s

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] best way to configure raidz groups

2009-12-31 Thread Erik Trimble

Thomas Burgess wrote:



For the OS, I'd drop the adapter/compact-flash combo and use the
stripped down Kingston version of the Intel x25m MLC SSD.  If you're
not familiar with it, the basic scoup is that this drive contains half
the flash memory (40Gb) *and* half the controller channels (5 versus
10) of the Intel drive - and so, performance is basically a little
less than half although read performance is still very good.  For more
info, google for  hardware reviews.  This product is still a little
hard to find, froogle for the following part numbers:

Desktop Bundle - SNV125-S2BD/40GB
Bare drive - SNV125-S2/40GB

Currently you can find the bare drive for under $100.  This is bound
to give you better performance and guaranteed compatibility compared
to adapters and compact flash.

The problem with adapters is that, although the price is great,
compatibility and build quality are all over the map and YMMV
considerably.  You would not be happy if you saved $20 on the
adapter/flash combo and ended up with nightmare reliability.

The great thing about 2.5 SSDs is that mounting is simply a question
of duct tape or velcro! [ well  almost ... but you can velcro them
onto the sidewall of you case ]  So you can use all your available
3.5 disk drive bays for ZFS disks.

I was able to find some of the 64 gb snv125-S2 drives for a decent 
price.  Do these also work well for L2ARC?


This brings more questions actually.

I know it's not recommended to use partitons for ZFS but does this 
still apply for SSD's and the root pool?


I was thinking about making maybe using half of the ssd for the root 
pool and putting the ZIL on the other half.


Or would i just be better off leaving the ZIL on the raidz drives?



It's OK to use partitions on SSDs, so long as you realize that using an 
SSD for multiple purposes splits the bandwidth into the SSD across 
multiple uses.   In your case, using an SSD as both an L2ARC and a root 
pool device is reasonable, as the rpool traffic should not be heavy.


I would NOT recommend using a X25-M or especially the snv125-S2 as a ZIL 
device.   Write performance isn't going to be very good at all - in 
fact, I think it should be not much different than using the bare 
drives.   As an L2ARC cache device, however, it's a good choice.


Oh, and there's plenty of bay adapters out there for cheap - use one.  
My favorite is a two-SSD-in-1-floppy drive bay like this:


http://www.startech.com/item/HSB220SAT25B-35-Tray-Less-Dual-25-SATA-HD-Hot-Swap-Bay.aspx

(I see them for under $40 at local stores)


20GB for a rpool is sufficient, so the rest can go to L2ARC.  I would 
disable any swap volume on the SSDs, however. If you need swap, put it 
somewhere else.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Richard Elling

On Dec 31, 2009, at 1:43 AM, Andras Spitzer wrote:


Let me sum up my thoughts in this topic.

To Richard [relling] : I agree with you this topic is even more  
confusing if we are not careful enough to specify exactly what we  
are talking about. Thin provision can be done on multiple layers,  
and though you said you like it to be closer to the app than closer  
to the dumb disks (if you were referring to SAN), my opinion is that  
each and every scenario has it's own pros/cons. I learned long time  
ago not to declare a technology good/bad, there are technologies  
which are used properly (usually declared as good tech) and others  
which are not (usually declared as bad).


I hear you.  But you are trapped thinking about 20th century designs  
and ZFS is a

21st century design.  More below...

Let me clarify my case, and why I mentioned thin devices on SAN  
specifically. Many people replied with the thin device support of  
ZFS (which is called sparse volumes if I'm correct), but what I was  
talking about is something else. It's thin device awareness on the  
SAN.


In this case you configure your LUN in the SAN as thin device, a  
virtual LUN(s) which is backed by a pool of physical disks in the  
SAN. From the OS it's transparent, so it is from the Volume Manager/ 
Filesystem point of view.


That is the basic definition of my scenarion with thin devices on  
SAN. High-end SAN frames like HDS USP-V (feature called Hitachi  
Dynamic Provisioning), EMC Symmetrix V-Max (feature called Virtual  
provisioning) supports this (and I'm sure many others as well). As  
you discovered the LUN in the OS, you start to use it, like put  
under Volume Manager, create filesystem, copy files, but the SAN  
only allocates physical blocks (more precisely group of blocks  
called extents) as you write them, which means you'll use only as  
much (or a bit more rounded to the next extent) on the physical disk  
as you use in reality.


From this standpoint we can define two terms, thin-friendly and  
thin-hostile environments. Thin-friendly would be any environment  
where OS/VM/FS doesn't write to blocks it doesn't really use (for  
example during initialization it doesn't fills up the LUN with a  
pattern or 0s).


That's why Veritas' SmartMove is a nice feature, as when you move  
from fat to thin devices (from the OS both LUNs look exactly the  
same), it will copy the blocks only which are used by the VxFS files.


ZFS does this by design. There is no way in ZFS to not do this.
I suppose it could be touted as a feature :-)  Maybe we should brand
ZFS as THINbyDESIGN(TM)  Or perhaps we can rebrand
SMARTMOVE(TM) as TRYINGTOCATCHUPWITHZFS(TM) :-)

That is still the basics of having thin devices on SAN, and hope to  
have a thin-friendly environment. The next level of this is the  
management of the thin devices and the physical pool where thin  
devices allocates their extents from.


Even if you get migrated to thin device LUNs, your thin devices will  
become fat again, even if you fill up your filesystem once, the thin  
device on the SAN will remain fat, no space reclamation is happening  
by default. The reason is pretty simple, the SAN storage has no  
knowledge of the filesystem structure, as such it can't decide  
whether a block should be released back to the pool, or it's really  
not in use. Then came Veritas with this brilliant idea of building a  
bridge between the FS and the SAN frame (this became the Thin  
Reclamation API), so they can communicate which blocks are not in  
use indeed.


I really would like you to read this Quick Note from Veritas about  
this feature, it will explain way better the concept as I did : http://ftp.support.veritas.com/pub/support/products/Foundation_Suite/338546.pdf


Btw, in this concept VxVM can even detect (via ASL) whether a LUN is  
thin device/thin device reclamation capable or not.


Correct.  Since VxVM and VxFS are separate software, they have expanded
the interface between them.

Consider adding a mirror or replacing a drive.

Prior to SMARTMOVE, VxVM had no idea what part of the volume was data
and what was unused. So VxVM would silver the mirror by copying all of  
the

blocks from one side to the other. Clearly this is uncool when your SAN
storage is virtualized.

With SMARTMOVE, VxFS has a method to tell VxVM that portions of the
volume are unused. Now when you silver the mirror, VxVM knows that
some bits are unused and it won't bother to copy them.  This is a bona
fide good thing for virtualized SAN arrays.

ZFS was designed with the knowledge that the limited interface between
file systems and volume managers was a severe limitation that leads to
all sorts of complexity and angst. So a different design is needed.  ZFS
has fully integrated RAID with the file system, so there is no need, by
design, to create a new interface between these layers. In other words,
the only way to silver a disk in ZFS is to silver the data. You can't  
silver

unused space. 

Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Richard Elling

[I TRIMmed the thread a bit ;-)]

On Dec 31, 2009, at 1:43 AM, Ragnar Sundblad wrote:

On 31 dec 2009, at 06.01, Richard Elling wrote:


In a world with copy-on-write and without snapshots, it is obvious  
that
there will be a lot of blocks running around that are no longer in  
use.

Snapshots (and their clones) changes that use case. So in a world of
snapshots, there will be fewer blocks which are not used. Remember,
the TRIM command is very important to OSes like Windows or OSX
which do not have file systems that are copy-on-write or have decent
snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use
snapshots.


I don't believe that there is such a big difference between those
cases.


The reason you want TRIM for SSDs is to recover the write speed.
A freshly cleaned page can be written faster than a dirty page.
But in COW, you are writing to new pages and not rewriting old
pages. This is fundamentally different than FAT, NTFS, or HFS+,
but it is those markets which are driving TRIM adoption.

[TRIMmed]


Once a block is freed in ZFS, it no longer needs it. So the problem
of TRIM in ZFS is not related to the recent txg commit history.


It may be that you want to save a few txgs back, so if you get
a failure where parts of the last txg gets lost, you will still be
able to get an old (few seconds/minutes) version of your data back.


This is already implemented. Blocks freed in the past few txgs are
not returned to the freelist immediately. This was needed to enable
uberblock recovery in b128. So TRIMming from the freelist is safe.


This could happen if the sync commands aren't correctly implemented
all the way (as we have seen some stories about on this list).
Maybe someone disabled syncing somewhere to improve performance.

It could also happen if a non volatile caching device, such as
a storage controller, breaks in some bad way. Or maybe you just
had a bad/old battery/supercap in a device that implements
NV storage with batteries/supercaps.


The
issue is that traversing the free block list has to be protected by
locks, so that the file system does not allocate a block when it is
also TRIMming the block. Not so difficult, as long as the TRIM
occurs relatively quickly.

I think that any TRIM implementation should be an administration
command, like scrub. It probably doesn't make sense to have it
running all of the time.  But on occasion, it might make sense.


I am not sure why it shouldn't run at all times, except for the
fact that it seems to be badly implemented in some SATA devices
with high latencies, so that it will interrupt any data streaming
to/from the disks.


I don't see how it would not have negative performance impacts.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Joerg Schilling
Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote:

 I have heard quite a few times that TRIM is good for SSD drives but 
 I don't see much actual use for it.  Every responsible SSD drive 
 maintains a reserve of unused space (20-50%) since it is needed for 
 wear leveling and to repair failing spots.  This means that even when 
 a SSD is 100% full it still has considerable space remaining.  A very 
 simple SSD design solution is that when a SSD block is overwritten 
 it is replaced with an already-erased block from the free pool and the 
 old block is submitted to the free pool for eventual erasure and 
 re-use.  This approach avoids adding erase times to the write latency 
 as long as the device can erase as fast as the average date write 
 rate.

The question in case if SSDs is:

ZFS is COW, but does the SSD know which block is in use and which is not?

If the SSD did know whether a block is in use, it could erase unused blocks
in advance. But what is an unused block on a filesystem that supports
snapshots?


From the perspective of the SSD I see only the following difference between
a COW filesystem an a conventional filesystem. A conventional filesystem 
may write more often to the same block number than a COW filesystem does.
But even for the non-COW case, I would expect that the SSD frequently remaps
overwritten blocks to previously erased spares.

My conclusion is that ZFS on a SSD works fine in case that the the primary used
blocks plus all active snapshots use less space than the official size - the 
spare reserve from the SSD. If you however fill up the medium, I expect a
performance degradation.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Joerg Schilling
Richard Elling richard.ell...@gmail.com wrote:

 The reason you want TRIM for SSDs is to recover the write speed.
 A freshly cleaned page can be written faster than a dirty page.
 But in COW, you are writing to new pages and not rewriting old
 pages. This is fundamentally different than FAT, NTFS, or HFS+,
 but it is those markets which are driving TRIM adoption.

Your mistake is to asume a maiden SSD and not to think about what's
happening after the SSD was in use for a while. Even for the COW case,
blocks are reused after some time and the disk does has no way to
know in advance which blocks are still in use and which blocks are no
longer used and may be prepared for being overwritten.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool creation best practices

2009-12-31 Thread Marion Hakanson
mijoh...@gmail.com said:
 I've never had a lun go bad but bad things do happen.  Does anyone else use
 ZFS in this way?  Is this an unrecommended setup?

We used ZFS like this on a Hitachi array for 3 years.  Worked fine, not
one bad block/checksum error detected.  Still using it on an old Sun 6120
array, too.


  It's too late to change my
 setup, but in the future when I'm planning new systems, should I consider the
 effort to allow zfs fully control all the disks? 

Well, you should certainly consider all the alternatives you can afford.
Our customers happen to like cheap bulk storage, so we have a Thumper,
and a few SAS-connected Sun J4000 SATA JBOD's.  But our grant-funded
researchers may not be a typical customer mix

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL

2009-12-31 Thread R.G. Keen
I'm in full overthink/overresearch mode on this issue, preparatory to ordering 
disks for my OS/zfs NAS build. So bear with me. I've been reading manuals and 
code, but it's hard for me to come up to speed on a new OS quickly. 

The question(s) underlying this thread seem to be:
(1) Does zfs raidz/raidz2/etc have the same issue with long recovery times as 
RAID5? That being dropping a drive from the array because it experiences an 
error and recovery that lasts longer than the controller (zfs/OS/device driver 
stack in this case) waits for an error message?
and 
(2) Can non raid edition drives be set to have shorter error recovery for 
raid use?

On (1), I pick out the following answers:
==
From Miles Nordin;
n Does this happen in ZFS?

No. Any timeouts in ZFS are annoyingly based on the ``desktop''
storage stack underneath it which is unaware of redundancy and of the
possibility of reading data from elsewhere in a redundant stripe
rather than waiting 7, 30, or 180 seconds for it. ZFS will bang away
on a slow drive for hours, bringing the whole system down with it,
rather than read redundant data from elsewhere in the stripe, so you
don't have to worry about drives dropping out randomly. Every last
bit will be squeezed from the first place ZFS tried to read it, even
if this takes years. 
==
From Darren J Moffat;
A combination of ZFS and FMA on OpenSolaris means it will recover.
Depending on many factors - not just the hard drive and its firmware -
will depend on how long the time outs actually.
==
From Erik Trimble;
The issue is excessive error recovery times INTERNAL to the hard drive.
So, worst case scenario is that ZFS marks the drive as bad during a
write, causing the zpool to be degraded. It's not going to lose your
data. It just may case a premature marking of a drive as bad.

None of this kills a RAID (ZFS, traditional SW Raid, or HW Raid). It
doesn't cause data corruption. The issue is sub-optimal disk fault
determination.
==
From Richard Relling;
For the Solaris sd(7d) driver, the default timeout is 60 seconds with
3 or 5 retries, depending on the hardware. Whether you notice this at the
application level depends on other factors: reads vs writes, etc. You can tune
this, of course, and you have access to the source.
==
From dubslick;
Are you sure about this? I thought these consumer level drives would try 
indefinitely to carry out its operation. Even Samsung's white paper on CCTL 
RAID error recovery says it could take a minute or longer
==
From Bob Friesen;
 For a complete newbie, can someone simply answer the following: will
 using non-enterprise level drives affect ZFS like it affects
 hardware RAID?
Yes.
==
So from a group of knowledgeable people I get answers all the way from no 
problem, it'll just work, may take a while though to ...using non-enterprise 
raid drives will affect zfs just like it does hardwar raid, that being to 
unnecessarily drop out a disk, and thereby expose the array to failure from a 
second read/write fault on another disk.

Most of the votes seem to be in the no problem range. But beyond me trying to 
learn all the source code, is there any way to tell how it will really react? 

My issue is this: I *want* the attributes of consumer-level drives other than 
the infinite retries. I want slow spin speed for low vibration and low power 
consumption, am willing to deal with the slower transfer/access speeds to get 
it. I can pay for (but resent being forced to!) raid-rated drives, but I don't 
like the extra power consumption needed to get them to be very fast in access 
and transfers. I'm fine with whipping in a new drive when one of the existing 
ones gets flaky. I find that I may be in the curious position of being forced 
to pay twice the price and expend twice the power to get drives that have many 
features I don't want or need and don't have what I do need, except for the one 
issue which may (infrequently!) tear up whatever data I have built. ... maybe...

On question (2), I believe that my research has led to the following:
Drives which support the SMART Command Transport spec, which is many newer 
disks, appear to allow setting timeouts on read/write operations completing. 
However, this setting appears not to persist beyond a power cycle. 

Is there any good reason there can't be a driver added to the boot sequence 
that will open a file for which drives need to be SCT-set to have timeouts 
which are shorter than infinite (one of the issues from above) and also short 
enough to meet the needs of returning errors in a timely manner so that there 
is not a huge window for a second fault to corrupt a zfs array?

Forgive me if I'm being too literal here. Think 

Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread Ragnar Sundblad

On 31 dec 2009, at 19.26, Richard Elling wrote:

 [I TRIMmed the thread a bit ;-)]
 
 On Dec 31, 2009, at 1:43 AM, Ragnar Sundblad wrote:
 On 31 dec 2009, at 06.01, Richard Elling wrote:
 
 In a world with copy-on-write and without snapshots, it is obvious that
 there will be a lot of blocks running around that are no longer in use.
 Snapshots (and their clones) changes that use case. So in a world of
 snapshots, there will be fewer blocks which are not used. Remember,
 the TRIM command is very important to OSes like Windows or OSX
 which do not have file systems that are copy-on-write or have decent
 snapshots. OTOH, ZFS does copy-on-write and lots of ZFS folks use
 snapshots.
 
 I don't believe that there is such a big difference between those
 cases.
 
 The reason you want TRIM for SSDs is to recover the write speed.
 A freshly cleaned page can be written faster than a dirty page.
 But in COW, you are writing to new pages and not rewriting old
 pages. This is fundamentally different than FAT, NTFS, or HFS+,
 but it is those markets which are driving TRIM adoption.

Flash SSDs actually always remap new writes into a
only-append-to-new-pages style, pretty much as ZFS does itself.
So for a SSD there is no big difference between ZFS and
filesystems as UFS, NTFS, HFS+ et al, on the flash level they
all work the same.
The reason is that there is no way for it to rewrite single
disk blocks, it can only fill up already erased pages of
512K (for example). When the old blocks get mixed with unused
blocks (because of block rewrites, TRIM or Write Many/UNMAP),
it needs to compact the data by copying all active blocks from
those pages into previously erased pages, and there write the
active data compacted/continuos. (When this happens, things tend
to get really slow.)

So TRIM is just as applicable to ZFS as any other file system
for flash SSD, there is no real difference.

 [TRIMmed]
 
 Once a block is freed in ZFS, it no longer needs it. So the problem
 of TRIM in ZFS is not related to the recent txg commit history.
 
 It may be that you want to save a few txgs back, so if you get
 a failure where parts of the last txg gets lost, you will still be
 able to get an old (few seconds/minutes) version of your data back.
 
 This is already implemented. Blocks freed in the past few txgs are
 not returned to the freelist immediately. This was needed to enable
 uberblock recovery in b128. So TRIMming from the freelist is safe.

I see, very good!

 This could happen if the sync commands aren't correctly implemented
 all the way (as we have seen some stories about on this list).
 Maybe someone disabled syncing somewhere to improve performance.
 
 It could also happen if a non volatile caching device, such as
 a storage controller, breaks in some bad way. Or maybe you just
 had a bad/old battery/supercap in a device that implements
 NV storage with batteries/supercaps.
 
 The
 issue is that traversing the free block list has to be protected by
 locks, so that the file system does not allocate a block when it is
 also TRIMming the block. Not so difficult, as long as the TRIM
 occurs relatively quickly.
 
 I think that any TRIM implementation should be an administration
 command, like scrub. It probably doesn't make sense to have it
 running all of the time.  But on occasion, it might make sense.
 
 I am not sure why it shouldn't run at all times, except for the
 fact that it seems to be badly implemented in some SATA devices
 with high latencies, so that it will interrupt any data streaming
 to/from the disks.
 
 I don't see how it would not have negative performance impacts.

It will, I am sure! But *if* the user for one reason or the other
wants TRIM, it can not be assumed that TRIMing major bunches at
certain times is any better than trimming small amounts all the
time. Both behaviors may be useful, but I have hard to see a real
good use case where you want batch trimming, but easy to see cases
where continuos trimming could be useful and hopefully hardly
noticeable thanks to the file system caching.

/ragge s

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL

2009-12-31 Thread Bob Friesenhahn

On Thu, 31 Dec 2009, R.G. Keen wrote:

I'm in full overthink/overresearch mode on this issue, preparatory 
to ordering disks for my OS/zfs NAS build. So bear with me. I've 
been reading manuals and code, but it's hard for me to come up to 
speed on a new OS quickly.


The question(s) underlying this thread seem to be:
(1) Does zfs raidz/raidz2/etc have the same issue with long recovery 
times as RAID5? That being dropping a drive from the array because 
it experiences an error and recovery that lasts longer than the 
controller (zfs/OS/device driver stack in this case) waits for an 
error message?

and
(2) Can non raid edition drives be set to have shorter error recovery for 
raid use?


I like the nice and short answer from this Bob Friesen fellow the 
best. :-)


I have heard that some vendor's drives can be re-flashed or set to use 
short timeouts.  Some vendors don't like this so they are trying to 
prohibit it or doing so may invalidate the warranty.


Unless things have changed (since a couple of years ago when I last 
looked), there are some vendors (e.g. Seagate) who offer enterprise 
SATA drives with only a small surcharge over astonishingly similar 
desktop SATA drives.  The only actual difference seems to be the 
firmware which is loaded on the drive.  Check out the Barracuda ES.2 
series.


It does not really matter what Solaris or ZFS does if the drive 
essentially locks up when it is trying to recover a bad sector.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-31 Thread David Magda

On Dec 31, 2009, at 13:44, Joerg Schilling wrote:

ZFS is COW, but does the SSD know which block is in use and which  
is not?


If the SSD did know whether a block is in use, it could erase unused  
blocks
in advance. But what is an unused block on a filesystem that  
supports

snapshots?


Personally, I think that at some point in the future there will need  
to be a command telling SSDs that the file system will take care of  
handling blocks, as new FS designs will be COW. ZFS is the first  
mainstream one to do it, but Btrfs is there as well, and it looks  
like Apple will be making its own FS.


Just as the first 4096-byte block disks are silently emulating 4096 - 
to-512 blocks, SSDs are currently re-mapping LBAs behind the scenes.  
Perhaps in the future there will be a setting to say no really, I'm  
talking about the /actual/ LBA 123456.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS extremely slow performance

2009-12-31 Thread Emily Grettel

Hello!

 


 This could be a broken disk, or it could be some other
 hardware/software/firmware issue. Check the errors on the
 device with
 iostat -En


Heres the output:

 

c7t1d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: WDC WD10EADS-00L Revision: 1A01 Serial No:
Size: 1000.20GB 1000204886016 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 4 Predictive Failure Analysis: 0
c7t2d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: WDC WD10EADS-00P Revision: 0A01 Serial No:
Size: 1000.20GB 1000204886016 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 4 Predictive Failure Analysis: 0
c7t3d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: WDC WD10EADS-00P Revision: 0A01 Serial No:
Size: 1000.20GB 1000204886016 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 4 Predictive Failure Analysis: 0
c7t4d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: WDC WD10EADS-00P Revision: 0A01 Serial No:
Size: 1000.20GB 1000204886016 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 4 Predictive Failure Analysis: 0
c7t5d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: WDC WD10EADS-00P Revision: 0A01 Serial No:
Size: 1000.20GB 1000204886016 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 4 Predictive Failure Analysis: 0
c7t0d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: WDC WD740GD-00FL Revision: 8F33 Serial No:
Size: 74.36GB 74355769344 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 6 Predictive Failure Analysis: 0
c0t1d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: WDC WD10EADS-00P Revision: 0A01 Serial No:
Size: 1000.20GB 1000204886016 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 5 Predictive Failure Analysis: 0
c3t0d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: WDC WD7500AAKS-0 Revision: 4G30 Serial No:
Size: 750.16GB 750156374016 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 5 Predictive Failure Analysis: 0
c3t1d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: WDC WD7500AAKS-0 Revision: 4G30 Serial No:
Size: 750.16GB 750156374016 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 5 Predictive Failure Analysis: 0


 You should also check the fma logs:
 fmadm faulty

 

Empty


 fmdump -eV
 


This turned out to be huge. But they're mostly something like this:

 

 

Nov 13 2009 10:15:41.883716494 ereport.fs.zfs.checksum
nvlist version: 0
class = ereport.fs.zfs.checksum
ena = 0x7cfde552fd100401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0xda1d003c03abad23
vdev = 0x4389ee65271b9187
(end detector)

pool = tank
pool_guid = 0xda1d003c03abad23
pool_context = 0
pool_failmode = wait
vdev_guid = 0x4389ee65271b9187
vdev_type = replacing
parent_guid = 0x79c2f2cf0b81ae5a
parent_type = raidz
zio_err = 0
zio_offset = 0xae9b3fa00
zio_size = 0x6600
zio_objset = 0x24
zio_object = 0x1b2
zio_level = 0
zio_blkid = 0x635
__ttl = 0x1
__tod = 0x4afc971d 0x34ac718e

 

 

Thanks for helping and telling me about those commands :-)

 

The scrub I started last night is still running, it usually takes about 8 
hours. Will post the results.

 

- Em

 

 
 From: richard.ell...@gmail.com
 To: mi...@task.gda.pl
 Date: Thu, 31 Dec 2009 08:37:03 -0800
 CC: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] ZFS extremely slow performance
 
 On Dec 31, 2009, at 2:49 AM, Robert Milkowski wrote:
 
 
  judging by a *very* quick glance it looks like you have an issue 
  with c3t0d0 device which is responding very slowly.
 
 Yes, there is an I/O stuck on the device which is not getting serviced.
 See below...
 
 
  -- 
  Robert Milkowski
  http://milek.blogspot.com
 
 
  
_
If It Exists, You'll Find it on SEEK Australia's #1 job site
http://clk.atdmt.com/NMN/go/157639755/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] raidz1 pool import failed with missing slog

2009-12-31 Thread Yuriy Vasylchenko
In osol 2009.06 - rpool vdev was dying but I was able to do the clean export of 
the data pool. The data pool's zil was on the failed HDD's slice as well as 
slog's GUID. As the result I have 4 out of 4 raid5 healthy data drives but 
cannot import zpool to access the data. This is obviously a disaster for me.

I found two discussions about similar issues:
http://opensolaris.org/jive/thread.jspa?messageID=233666 and
http://opensolaris.org/jive/thread.jspa?messageID=420073
But I don't think the recipes in these threads can help to import the 
inconsistent  
pool.

Is there any way to ignore missing ZIL devices during the import?
I expected that this is the case, since 
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide says:
[b]Hybrid Storage Pools (or Pools with SSDs)[/b] 
...
If a separate log device is not mirrored and the device that contains the log 
fails, storing log blocks reverts to the storage pool.

If not - can I somehow reassemble the pool using [b]zpool import -D[/b] option, 
or do anything else to get my data back?

Please help!
-- 
This message posted from opensolaris.org

zpool_import.log
Description: Binary data
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL

2009-12-31 Thread R.G. Keen
 On Thu, 31 Dec 2009, Bob Friesenhahn wrote:
 I like the nice and short answer from this Bob
 Friesen fellow the 
 best. :-)
It was succinct, wasn't it?  8-)

Sorry - I pulled the attribution from the ID, not the 
signature which was waiting below. DOH!

When you say:
 It does not really matter what Solaris or ZFS does if the drive 
 essentially locks up when it is trying to recover a bad sector.
I'd have to say that it depends. If Solaris/zfs/etc. is restricted
to actions which consist of marking the disk semi-permanently
bad and continuing, yes, it amounts to the same thing: it opens
a yawning chasm of one more error and you're dead, until the
array can be serviced and un-degraded. At least I think it 
does, based on what I've read, anyway.

However, if OS/S/zfs/etc. performs an appropriate fire drill up
to and including logging the issues, quiescing the array, and 
annoying the operator then it closes up the sudden-death window. 
This gives the operator of the array a chance to do something 
about it, such as swapping in a spare and starting 
rebuilding/resilvering/etc. 

Given the largish aggregate monetary value to RAIDZ builders of 
sidestepping the doubled-cost of raid specialized drives, it occurs
to me that having a special set of actions for desktop-ish drives 
might be a good idea. Something like a fix-the-failed repair mode
which pulls all recoverable data off the purportedly failing drive
and onto a new spare to avoid a monster resilvering and the associated
vulnerable time to a second or third failure.

Viewed in that light, exactly what OS/S/zfs does on a long extended
reply from a disk and exactly what can be done to minimize the 
time when the array runs in a degraded mode where the next step
loses the data seems to be a really important issue. 

Well, OK, it does to me because my purpose here is getting to 
background scrubbing of errors in the disks. Other things might
be more important to others.  8-)

And the question might be moot if the SMART SCT architecture in
desktop drives lets you do a power-on hack to shorten the reply-failed
time for better raid operation. That's actually the solution I'd like
to see in a perfect world - I get back to a redundant array of INEXPENSIVE
disks, and I can pick those disks to be big and slow/low power instead
of fast/high power. 

I'd welcome any enlightened speculation on this. I do recognize that
I'm an idiot on these matters compared to people with actual 
experience. 8-)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] $100 SSD = 5x faster dedupe

2009-12-31 Thread Michael Herf
I've written about my slow-to-dedupe RAIDZ.

After a week of.waitingI finally bought a little $100 30G OCZ
Vertex and plugged it in as a cache.

After 2 hours of warmup, my zfs send/receive rate on the pool is
16MB/sec (reading and writing each at 16MB as measured by zpool
iostat).
That's up from 3MB/sec, with a RAM-only cache on a 6GB machine.

The SSD has about 8GB utilized right now, and the L2ARC benefit is amazing.
Quite an amazing improvement for $100...recommend you don't dedupe without one.

mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL

2009-12-31 Thread Bob Friesenhahn

On Thu, 31 Dec 2009, R.G. Keen wrote:

Given the largish aggregate monetary value to RAIDZ builders of
sidestepping the doubled-cost of raid specialized drives, it occurs
to me that having a special set of actions for desktop-ish drives
might be a good idea. Something like a fix-the-failed repair mode
which pulls all recoverable data off the purportedly failing drive
and onto a new spare to avoid a monster resilvering and the associated
vulnerable time to a second or third failure.


The problem is that a desktop-ish drive may single-mindedly focus on 
reading the bad data while otherwise responding as if it is alive.  So 
everything just waits a long time while the OS sends new requests to 
the drive (which are recieved) but the OS does not get the requested 
data back.  To make matters worse, the OS might send another request 
for the same data, the drive gives up on the last request, and then 
proceeds with the new request for the same bad data.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] $100 SSD = 5x faster dedupe

2009-12-31 Thread Michael Herf
Make that 25MB/sec, and rising...
So it's 8x faster now.

mike
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL

2009-12-31 Thread Richard Elling

On Dec 31, 2009, at 6:14 PM, R.G. Keen wrote:

On Thu, 31 Dec 2009, Bob Friesenhahn wrote:
I like the nice and short answer from this Bob
Friesen fellow the
best. :-)

It was succinct, wasn't it?  8-)

Sorry - I pulled the attribution from the ID, not the
signature which was waiting below. DOH!

When you say:

It does not really matter what Solaris or ZFS does if the drive
essentially locks up when it is trying to recover a bad sector.

I'd have to say that it depends. If Solaris/zfs/etc. is restricted
to actions which consist of marking the disk semi-permanently
bad and continuing, yes, it amounts to the same thing: it opens
a yawning chasm of one more error and you're dead, until the
array can be serviced and un-degraded. At least I think it
does, based on what I've read, anyway.


Some nits:
disks aren't marked as semi-bad, but if ZFS has trouble with a
block, it will try to not use the block again.  So there is two levels
of recovery at work: whole device and block.

The one more and you're dead is really N errors in T time.
For disks which don't return when there is an error, you can
reasonably expect that T will be a long time (multiples of 60
seconds) and therefore the N in T threshold will not be triggered.

The term degraded does not have a consistent definition
across the industry. See the zpool man page for the definition
used for ZFS.  In particular, DEGRADED != FAULTED


However, if OS/S/zfs/etc. performs an appropriate fire drill up
to and including logging the issues, quiescing the array, and
annoying the operator then it closes up the sudden-death window.
This gives the operator of the array a chance to do something
about it, such as swapping in a spare and starting
rebuilding/resilvering/etc.


Issues are logged, for sure.  If you want to monitor them proactively,
you need to configure SNMP traps for FMA.


Given the largish aggregate monetary value to RAIDZ builders of
sidestepping the doubled-cost of raid specialized drives, it occurs
to me that having a special set of actions for desktop-ish drives
might be a good idea. Something like a fix-the-failed repair mode
which pulls all recoverable data off the purportedly failing drive
and onto a new spare to avoid a monster resilvering and the associated
vulnerable time to a second or third failure.


It already does this, as long as there are N errors in T time.  There
is room for improvement here, but I'm not sure how one can set a
rule that would explicitly take care of the I/O never returning from
a disk while a different I/O to the same disk returns.  More research
required here...


Viewed in that light, exactly what OS/S/zfs does on a long extended
reply from a disk and exactly what can be done to minimize the
time when the array runs in a degraded mode where the next step
loses the data seems to be a really important issue.


Once the state changes to DEGRADED, the admin must zpool clear
the errors to return the state to normal. Make sure your definition of
degraded matches.


Well, OK, it does to me because my purpose here is getting to
background scrubbing of errors in the disks. Other things might
be more important to others.  8-)

And the question might be moot if the SMART SCT architecture in
desktop drives lets you do a power-on hack to shorten the reply-failed
time for better raid operation. That's actually the solution I'd like
to see in a perfect world - I get back to a redundant array of  
INEXPENSIVE

disks, and I can pick those disks to be big and slow/low power instead
of fast/high power.


In my experience, disk drive firmware quality and feature sets vary
widely.  I've got a bunch of scars from shaky firmware and I even
got a new one a few months ago. So perhaps one day the disk
vendors will perfect their firmware? :-)


I'd welcome any enlightened speculation on this. I do recognize that
I'm an idiot on these matters compared to people with actual
experience. 8-)


So you want some scars too? :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss