from:"Bob Friesenhahn"

Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-14 Thread Bob Friesenhahn

On Thu, 14 Feb 2008, Tim wrote:

 If you're going for best single file write performance, why are you doing
 mirrors of the LUNs?  Perhaps I'm misunderstanding why you went from one
 giant raid-0 to what is essentially a raid-10.

That decision was made because I also need data reliability.

As mentioned before, the write rate peaked at 200MB/second using 
RAID-0 across 12 disks exported as one big LUN.  Other firmware-based 
methods I tried typically offered about 170MB/second.  Even a four 
disk firmware-managed RAID-5 with ZFS on top offered about 
165MB/second.  Given that I would like to achieve 300MB/second, a few 
tens of MB don't make much difference.  It may be that I bought the 
wrong product, but perhaps there is a configuration change which will 
help make up some of the difference without sacrificing data 
reliability.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-14 Thread Bob Friesenhahn

On Fri, 15 Feb 2008, Will Murnane wrote:
 What is the workload for this system?  Benchmarks are fine and good,
 but application performance is the determining factor of whether a
 system is performing acceptably.

The system is primarily used for image processing where the image data 
is uncompressed and a typical file is 12MB.  In some cases the files 
will be hundreds of MB or GB.  The typical case is to read a file and 
output a new file.  For some very large files, an uncompressed 
temporary file is edited in place with random access.

I am the author of the application and need the filesystem to be fast 
enough that it will uncover any slowness in my code. :-)

 Perhaps iozone is behaving in a bad way; you might investigate

That is always possible.  Iozone (http://www.iozone.org/) has been 
around for a very long time and has seen a lot of improvement by many 
smart people so it does not seem very suspect.

 bonnie++: http://www.sunfreeware.com/programlistintel10.html

I will check it out.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS write throttling

2008-02-15 Thread Bob Friesenhahn

On Fri, 15 Feb 2008, Roch Bourbonnais wrote:
 The latter appears to be bug 6429855. But the underlying behaviour
 doesn't really seem desirable; are there plans afoot to do any work on
 ZFS write throttling to address this kind of thing?

 Throttling is being addressed.

   http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205

I have observed similar behavior when using 'iozone' on a large file 
to benchmark ZFS on my StorageTek 2540 array.  Fsstat shows gaps of up 
to 30 seconds of no I/O when run on a 10 second update cycle but when 
I go to look at the lights on the array, I see that it is actually 
fully busy.  It seems that the application is stalled during this 
load.  It also seems that simple operations like 'ls' get stalled 
under such heavy load.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-15 Thread Bob Friesenhahn

On Fri, 15 Feb 2008, Roch Bourbonnais wrote:
 What was the interlace on the LUN ?

 The question was about LUN  interlace not interface.
 128K to 1M works better.

The segment size is set to 128K.  The max the 2540 allows is 512K. 
Unfortunately, the StorageTek 2540 and CAM documentation does not 
really define what segment size means.

 Any compression ?

Compression is disabled.

 Does turn off checksum helps the number (that would point to a CPU limited 
 throughput).

I have not tried that but this system is loafing during the benchmark. 
It has four 3GHz Opteron cores.

Does this output from 'iostat -xnz 20' help to understand issues?

 extended device statistics
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 3.00.7   26.43.5  0.0  0.00.04.2   0   2 c1t1d0
 0.0  154.20.0 19680.3  0.0 20.70.0  134.2   0  59 
c4t600A0B80003A8A0B096147B451BEd0
 0.0  211.50.0 26940.5  1.1 33.95.0  160.5  99 100 
c4t600A0B800039C9B50A9C47B4522Dd0
 0.0  211.50.0 26940.6  1.1 33.95.0  160.4  99 100 
c4t600A0B800039C9B50AA047B4529Bd0
 0.0  154.00.0 19654.7  0.0 20.70.0  134.2   0  59 
c4t600A0B80003A8A0B096647B453CEd0
 0.0  211.30.0 26915.0  1.1 33.95.0  160.5  99 100 
c4t600A0B800039C9B50AA447B4544Fd0
 0.0  152.40.0 19447.0  0.0 20.50.0  134.5   0  59 
c4t600A0B80003A8A0B096A47B4559Ed0
 0.0  213.20.0 27183.8  0.9 34.14.2  159.9  90 100 
c4t600A0B800039C9B50AA847B45605d0
 0.0  152.50.0 19453.4  0.0 20.50.0  134.5   0  59 
c4t600A0B80003A8A0B096E47B456DAd0
 0.0  213.20.0 27177.4  0.9 34.14.2  159.9  90 100 
c4t600A0B800039C9B50AAC47B45739d0
 0.0  213.20.0 27195.3  0.9 34.14.2  159.9  90 100 
c4t600A0B800039C9B50AB047B457ADd0
 0.0  154.40.0 19711.8  0.0 20.70.0  134.0   0  59 
c4t600A0B80003A8A0B097347B457D4d0
 0.0  211.30.0 26958.6  1.1 33.95.0  160.6  99 100 
c4t600A0B800039C9B50AB447B4595Fd0

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-15 Thread Bob Friesenhahn

On Fri, 15 Feb 2008, Peter Tribble wrote:

 Each LUN is accessed through only one of the controllers (I presume the
 2540 works the same way as the 2530 and 61X0 arrays). The paths are
 active/passive (if the active fails it will relocate to the other path).
 When I set mine up the first time it allocated all the LUNs to controller B
 and performance was terrible. I then manually transferred half the LUNs
 to controller A and it started to fly.

I assume that you either altered the Access State shown for the LUN 
in the output of 'mpathadm show lu DEVICE' or you noticed and 
observed the pattern:

 Target Port Groups:
 ID:  3
 Explicit Failover:  yes
 Access State:  active
 Target Ports:
 Name:  200400a0b83a8a0c
 Relative ID:  0

 ID:  2
 Explicit Failover:  yes
 Access State:  standby
 Target Ports:
 Name:  200500a0b83a8a0c
 Relative ID:  0

I find this all very interesting and illuminating:

for dev in c4t600A0B80003A8A0B096A47B4559Ed0  \
c4t600A0B80003A8A0B096E47B456DAd0 \
c4t600A0B80003A8A0B096147B451BEd0 \
c4t600A0B80003A8A0B096647B453CEd0 \
c4t600A0B80003A8A0B097347B457D4d0 \
c4t600A0B800039C9B50A9C47B4522Dd0 \
c4t600A0B800039C9B50AA047B4529Bd0 \
c4t600A0B800039C9B50AA447B4544Fd0 \
c4t600A0B800039C9B50AA847B45605d0 \
c4t600A0B800039C9B50AAC47B45739d0 \
c4t600A0B800039C9B50AB047B457ADd0 \
c4t600A0B800039C9B50AB447B4595Fd0 \
do
echo === $dev ===
for mpathadm show lu /dev/rdsk/$dev | grep 'Access State'
for done
=== c4t600A0B80003A8A0B096A47B4559Ed0 ===
 Access State:  active
 Access State:  standby
=== c4t600A0B80003A8A0B096E47B456DAd0 ===
 Access State:  active
 Access State:  standby
=== c4t600A0B80003A8A0B096147B451BEd0 ===
 Access State:  active
 Access State:  standby
=== c4t600A0B80003A8A0B096647B453CEd0 ===
 Access State:  active
 Access State:  standby
=== c4t600A0B80003A8A0B097347B457D4d0 ===
 Access State:  active
 Access State:  standby
=== c4t600A0B800039C9B50A9C47B4522Dd0 ===
 Access State:  active
 Access State:  standby
=== c4t600A0B800039C9B50AA047B4529Bd0 ===
 Access State:  standby
 Access State:  active
=== c4t600A0B800039C9B50AA447B4544Fd0 ===
 Access State:  standby
 Access State:  active
=== c4t600A0B800039C9B50AA847B45605d0 ===
 Access State:  standby
 Access State:  active
=== c4t600A0B800039C9B50AAC47B45739d0 ===
 Access State:  standby
 Access State:  active
=== c4t600A0B800039C9B50AB047B457ADd0 ===
 Access State:  standby
 Access State:  active
=== c4t600A0B800039C9B50AB447B4595Fd0 ===
 Access State:  standby
 Access State:  active

Notice that the first six LUNs are active to one controller while the 
second six LUNs are active to the other controller.  Based on this, I 
should rebuild my pool by splitting my mirrors across this boundary.

I am really happy that ZFS makes such things easy to try out.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-15 Thread Bob Friesenhahn

On Fri, 15 Feb 2008, Peter Tribble wrote:

 May not be relevant, but still worth checking - I have a 2530 (which ought
 to be that same only SAS instead of FC), and got fairly poor performance
 at first. Things improved significantly when I got the LUNs properly
 balanced across the controllers.

What do you mean by properly balanced across the controllers?  Are 
you using the multipath support in Solaris 10 or are you relying on 
ZFS to balance the I/O load?  Do some disks have more affinity for a 
controller than the other?

With the 2540, there is a FC connection to each redundant controller. 
The Solaris 10 multipathing presumably load-shares the I/O to each 
controller.  The controllers then perform some sort of magic to get 
the data to and from the SAS drives.

The controller stats are below.  I notice that it seems that 
controller B has seen a bit more activity than controller A but the 
firmware does not provide a controller uptime value so it is possible 
that one controller was up longer than another:

Performance Statistics - A on Storage System Array-1
Timestamp:  Fri Feb 15 14:37:39 CST 2008
Total IOPS: 1098.83
Average IOPS:   355.83
Read %: 38.28
Write %:61.71
Total Data Transferred: 139284.41 KBps
Read:   53844.26 KBps
Average Read:   17224.04 KBps
Peak Read:  242232.70 KBps
Written:85440.15 KBps
Average Written:26966.58 KBps
Peak Written:   139918.90 KBps
Average Read Size:  639.96 KB
Average Write Size: 629.94 KB
Cache Hit %:85.32

Performance Statistics - B on Storage System Array-1
Timestamp:  Fri Feb 15 14:37:45 CST 2008
Total IOPS: 1526.69
Average IOPS:   497.32
Read %: 34.90
Write %:65.09
Total Data Transferred: 193594.58 KBps
Read:   68200.00 KBps
Average Read:   24052.61 KBps
Peak Read:  339693.55 KBps
Written:125394.58 KBps
Average Written:37768.40 KBps
Peak Written:   183534.66 KBps
Average Read Size:  895.80 KB
Average Write Size: 883.38 KB
Cache Hit %:75.05

If I then go to the performance stats on an individual disk, I see

Performance Statistics - Disk-08 on Storage System Array-1
Timestamp:  Fri Feb 15 14:43:36 CST 2008
Total IOPS: 196.33
Average IOPS:   72.01
Read %: 9.65
Write %:90.34
Total Data Transferred: 25076.91 KBps
Read:   2414.11 KBps
Average Read:   3521.44 KBps
Peak Read:  48422.00 KBps
Written:22662.79 KBps
Average Written:5423.78 KBps
Peak Written:   28036.43 KBps
Average Read Size:  127.29 KB
Average Write Size: 127.77 KB
Cache Hit %:89.30

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-15 Thread Bob Friesenhahn

On Fri, 15 Feb 2008, Luke Lonergan wrote:
 I only managed to get 200 MB/s write when I did RAID 0 across all
 drives using the 2540's RAID controller and with ZFS on top.

 Ridiculously bad.

I agree. :-(

 While I agree that data is sent twice (actually up to 8X if striping
 across four mirrors)

 Still only twice the data that would otherwise be sent, in other words: the
 mirroring causes a duplicate set of data to be written.

Right.  But more little bits of data to be sent due to ZFS striping.

 Given that you're not even saturating the FC-AL links, the problem is in the
 hardware RAID.  I suggest disabling read and write caching in the hardware
 RAID.

Hardware RAID is not an issue in this case since each disk is exported 
as a LUN.  Performance with ZFS is not much different than when 
hardware RAID was used.  I previously tried disabling caching in the 
hardware and it did not make a difference in the results.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-15 Thread Bob Friesenhahn

On Fri, 15 Feb 2008, Bob Friesenhahn wrote:

 Notice that the first six LUNs are active to one controller while the
 second six LUNs are active to the other controller.  Based on this, I
 should rebuild my pool by splitting my mirrors across this boundary.

 I am really happy that ZFS makes such things easy to try out.

Now that I have tried this out, I can unhappily say that it made no 
measurable difference to actual performance.  However it seems like a 
better layout anyway.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-15 Thread Bob Friesenhahn

On Fri, 15 Feb 2008, Albert Chin wrote:

 http://groups.google.com/group/comp.unix.solaris/browse_frm/thread/59b43034602a7b7f/0b500afc4d62d434?lnk=stq=#0b500afc4d62d434

This is really discouraging.  Based on these newsgroup postings I am 
thinking that the Sun StorageTek 2540 was not a good investment for 
me, especially given that the $23K for it came right out of my own 
paycheck and it took me 6 months of frustration (first shipment was 
damaged) to receive it.  Regardless, this was the best I was able to 
afford unless I built the drive array myself.

The page at 
http://www.sun.com/storagetek/disk_systems/workgroup/2540/benchmarks.jsp 
claims 546.22 MBPS for the large file processing benchmark.  So I go 
to look at the actual SPC2 full disclosure report and see that for one 
stream, the average data rate is 105MB/second (compared with 
102MB/second with RAID-5), and rises to 284MB/second with 10 streams. 
The product obviously performs much better for reads than it does for 
writes and is better for multi-user performance than single-user.

It seems like I am getting a good bit more performance from my own 
setup than what the official benchmark suggests (they used 72MB 
drives, with 24-drives total) so it seems that everything is working 
fine.

This is a lesson for me, and I have certainly learned a fair amount 
about drive arrays, fiber channel, and ZFS, in the process.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] 'du' is not accurate on zfs

2008-02-15 Thread Bob Friesenhahn

I have a script which generates a file and then immediately uses 'du 
-h' to obtain its size.  With Solaris 10 I notice that this often 
returns an incorrect value of '0' as if ZFS is lazy about reporting 
actual disk use.  Meanwhile, 'ls -l' does report the correct size.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-16 Thread Bob Friesenhahn

On Sat, 16 Feb 2008, Peter Tribble wrote:
 Agreed. My 2530 gives me about 450MB/s on writes and 800 on reads.
 That's zfs striped across 4 LUNs, each of which is hardware raid-5
 (24 drives in total, so each raid-5 LUN is 5 data + 1 parity).

Is this single-file bandwidth or multiple-file/thread bandwidth? 
According to Sun's own benchmark data, the 2530 was capable of 
20MB/second more than the 2540 on writes for a single large file, and 
the difference went away after that.  For multi-user activity the 
throughput clearly improves to be similar to what you describe.  Most 
people are likely interested in maximizing multi-user performance, and 
particularly for reads.

Visit 
http://www.storageperformance.org/results/benchmark_results_spc2/#sun_spc2 
to see the various benchmark results.  According to these results, for 
large-file writes the 2530/2540 compares well with other StorageTek 
products, including the more expensive 6140 and 6540 arrays.  It also 
compares well with similarly-sized storage products from other 
vendors.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 'du' is not accurate on zfs

2008-02-16 Thread Bob Friesenhahn

On Sat, 16 Feb 2008, Richard Elling wrote:

 ls -l shows the length.  ls -s shows the size, which may be
 different than the length.  You probably want size rather than du.

That is true.  Unfortunately 'ls -s' displays in units of disk blocks 
and does not also consider the 'h' option in order to provide a value 
suitable for humans.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-16 Thread Bob Friesenhahn

On Sat, 16 Feb 2008, Joel Miller wrote:

 Here is how you can tell the array to ignore cache sync commands and 
 the force unit access bits...(Sorry if it wraps..)

Thanks to the kind advice of yourself and Mertol Ozyoney, there is a 
huge boost in write performance:

Was: 154MB/second
Now: 279MB/second

The average service time for each disk LUN has dropped considerably.

The numbers provided by 'zfs iostat' are very close to what is 
measured by 'iozone'.

This is like night and day and gets me very close to my original 
target write speed of 300MB/second.

Thank you very much!

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-16 Thread Bob Friesenhahn

On Sat, 16 Feb 2008, Mertol Ozyoney wrote:

 Please try to distribute Lun's between controllers and try to benchmark by
 disabling cache mirroring. (it's different then disableing cache)

By the term disabling cache mirroring are you talking about Write 
Cache With Replication Enabled in the Common Array Manager?

Does this feature maintain a redundant cache (two data copies) between 
controllers?

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] filebench for Solaris 10?

2008-02-16 Thread Bob Friesenhahn

Some of us are still using Solaris 10 since it is the version of 
Solaris released and supported by Sun.  The 'filebench' software from 
SourceForge does not seem to install or work on Solaris 10.  The 
'pkgadd' command refuses to recognize the package, even when it is set 
to Solaris 2.4 mode.

I was able to build the software but observation of what 'make 
install' does is that it installs into the private home directory of 
some hard-coded user.  The 'make package' command builds an unusable 
package similar to the one on SourceForge.

Are the filebench maintainers aware of this problem?  Will a package 
which works for Solaris 10 (which some of us are still using) be 
posted?

Thanks,

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Recommendations for per-user NFS shared home directories?

2008-02-17 Thread Bob Friesenhahn

I am attempting to create per-user ZFS filesystems under an exported 
/home ZFS filesystem.  This would work fine except that the 
ownership/permissions settings applied to the mount point of those 
per-user filesystems on the server are not seen by NFS clients. 
Instead NFS clients see directory ownership of root:other (Solaris 9 
clients), root:wheel (OS-X clients), and root:daemon (FreeBSD 
clients).  Only Solaris 10 clients seem to preserve original ownership 
and permissions.

Is there a way to resolve this problem?

Thanks,

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Recommendations for per-user NFS shared home directories?

2008-02-17 Thread Bob Friesenhahn

On Sun, 17 Feb 2008, Mattias Pantzare wrote:
 You should use automount for your mountings if you have many clients.
 Change the automount map and all clients will mount the new filesystem
 if needed. You can move some users to a new server with very little
 work, just change the mapping for that user.

Yes, of course.  This would be easy if I was running a homogeneous 
network, but instead I have to deal with several kinds of automounter, 
some of which seem to change between each major release.  This seems 
like a good task for another day.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-18 Thread Bob Friesenhahn

On Mon, 18 Feb 2008, Ralf Ramge wrote:
 I'm a bit disturbed because I think about switching to 2530/2540
 shelves, but a maximum 250 MB/sec would disqualify them instantly, even

Note that this is single-file/single-thread I/O performance. I suggest 
that you read the formal benchmark report for this equipment since it 
covers multi-thread I/O performance as well.  The multi-user 
performance is considerably higher.

Given ZFS's smarts, the JBOD approach seems like a good one as long as 
the hardware provides a non-volatile cache.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] filebench for Solaris 10?

2008-02-19 Thread Bob Friesenhahn

On Tue, 19 Feb 2008, Marion Hakanson wrote:

 I've installed and run filebench (version 1.1.0) from the SourceForge
 packages on Solaris-10 here, both SPARC and x86_64, with no problems.
 Looks like I downloaded it 23-Jan-2008.

This is what I get with the filebench-1.1.0_x86_pkg.tar.gz from 
SourceForge:

# pkgadd -d .
pkgadd: ERROR: no packages were found in 
/home/bfriesen/src/benchmark/filebench
# ls
install/  pkginfo   pkgmapreloc/

My system has the latest package management patches applied.  What am 
I missing?

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] five megabytes per second with Microsoft iSCSI initiator (2.06)

2008-02-19 Thread Bob Friesenhahn

It would be useful if people here who have used iSCSI on top of ZFS 
could share their performance experiences.  It is very easy to waste a 
lot of time trying to realize unrealistic expectations.  Hopefully 
iSCSI on top of ZFS normally manages to transfer much more than 
5MB/second!

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] filebench for Solaris 10?

2008-02-19 Thread Bob Friesenhahn

On Tue, 19 Feb 2008, Marion Hakanson wrote:

 # pkgadd -d .
 pkgadd: ERROR: no packages were found in 
 /home/bfriesen/src/benchmark/filebench
 # ls
 install/  pkginfo   pkgmapreloc/
 . . .

 Um, cd .. and pkgadd -d . again.  The package is the actual directory
 that you unpacked.  Note the instructions for unpacking confused me a bit
 as well.  I had expected to pkgadd -d . filebench, but pkgadd is smart
 enough to scan the entire -d directory for packages.

Very odd. That worked.  Thank you very much!.  It seems that filebench 
is unconventional in almost every possible way.  Installing it based 
on the available documentation was an exercise in frustration.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Preferred backup s/w

2008-02-22 Thread Bob Friesenhahn

On advice of Joerg Schilling and not knowing what 'star' was, I 
decided to install it for testing.  Star uses a very unorthodox build 
and install approach so the person building it has very little control 
over what it does.

Unfortunately I made the mistake of installing it under /usr/local 
where it decided to remove the GNU tar I had installed there.  Star 
does not support traditional tar command line syntax so it can't be 
used with existing scripts.  Performance testing showed that it was no 
more efficient than the 'gtar' which comes with Solaris.  It seems 
that 'star' does not support an 'uninstall' target so now I am forced 
to manually remove it from my system.

It seems that the best way to deal with star is to install it into its 
own directory so that it does not interfere with existing software.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Preferred backup s/w

2008-02-22 Thread Bob Friesenhahn

On Fri, 22 Feb 2008, Bob Friesenhahn wrote:
 where it decided to remove the GNU tar I had installed there.  Star
 does not support traditional tar command line syntax so it can't be
 used with existing scripts.  Performance testing showed that it was no
 more efficient than the 'gtar' which comes with Solaris.  It seems

There is something I should clarify in the above.  Star is a stickler 
for POSIX command line syntax so syntax like 'tar -cvf foo.tar' or 
'tar cvf foo.tar' does not work, but 'tar -c -v -f foo.tar' does work.

Testing with Star, GNU tar, and Solaris cpio showed that Star and GNU 
tar were able to archive the content of my home directory with no 
complaint whereas Solaris cpio required specification of the 'ustar' 
format in order to deal with long file and path names, as well as 
large inode numbers.  Solaris cpio complained about many things with 
my files (e.g. unresolved passwd and group info), but managed to 
produce the highest throughput when archiving to a disk file.

I can not attest to the ability of these tools to deal with ACLs since 
I don't use them.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Preferred backup s/w

2008-02-23 Thread Bob Friesenhahn

On Sat, 23 Feb 2008, Joerg Schilling wrote:

 Star typically needs 1/4 .. 1/3 of the CPU time needed by GNU tar ans it
 uses two processes to do the work in parallel. If you found a case where
 star is not faster than GNU tar andwhere the speed is not limited by the
 filesystem or the I/O devices, this is a bug that will be fixed if you provide
 the needed information to repeat it.

I re-ran my little test today and do see that 'star' does produce 
somewhat reduced overall run time but does not consume less CPU than 
GNU tar.  This is just a test of the time to archive the files in my 
home directory.  My home directory is in a zfs filesystem.  The output 
is written to a file in the same storage pool but a different 
filesystem.  This time around I used default block sizes rather than 
128K.  Overall throughput seems on the order of 40MB/second.

gtar -cf gtar.tar /home/bfriesen  6.42s user 128.27s system 12% cpu 17:19.66 
total
-rw-r--r--   1 bfriesen home 37G Feb 23 10:55 gtar.tar

star -c -f star.tar /home/bfriesen  4.11s user 142.65s system 15% cpu 16:03.41 
total
-rw-r--r--   1 bfriesen home 37G Feb 23 11:15 star.tar

find /home/bfriesen -depth -print  0.55s user 3.52s system 6% cpu 1:01.61 total
cpio -o -H ustar -O cpio.tar  11.47s user 122.28s system 11% cpu 18:38.97 total
-rwxr-xr-x   1 bfriesen home 37G Feb 23 11:40 cpio.tar*

Notice that Sun's cpio marks its output file as executable, which is 
clearly a bug.

Clearly none of these tools are adequate to deal with the massive data 
storage made easy with zfs storage pools.  Zfs requires similarly 
innovative backup solutions to deal with it.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] The old problem with tar, zfs, nfs and zil

2008-02-25 Thread Bob Friesenhahn

On Mon, 25 Feb 2008, msl wrote:
 I mean, can you confirm that the zil_disable/zfs solaris nfs 
 service, is a similar service like a standard xfs or ext3 linux/nfs 
 solution (take into account the NFS service provided)?

From what I have heard:

  * Linux does not implement NFS writes correctly in that data is not
flushed to disk before returning.  Don't turn your Linux system off
during application writes since user data will likely be lost when
the system returns.  Besides the applications losing data, running
applications are likely to become confused.

  * ZFS has had an issue in that requesting a fsync() of one file
causes a sync of the entire filesystem.  This is a huge performance
glitch.  Wikipedia says that it is fixed in Solaris Nevada.

Someone should update this WikiPedia section:

http://en.wikipedia.org/wiki/ZFS#Solaris_implementation_issues

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-26 Thread Bob Friesenhahn

On Sun, 17 Feb 2008, Mertol Ozyoney wrote:

 Hi Bob;

 When you have some spare time can you prepare a simple benchmark report in
 PDF that I can share with my customers to demonstrate the performance of
 2540 ?

While I do not claim that it is simple I have created a report on my 
configuration and experience.  It should be useful for users of the 
Sun StorageTek 2540, ZFS, and Solaris 10 multipathing.

See

http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf

or http://tinyurl.com/2djewn for the URL challenged.

Feel free this share this document with anyone who is interested.

Thanks

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-27 Thread Bob Friesenhahn

On Wed, 27 Feb 2008, Cyril Plisko wrote:

  
 http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf

 Nov 26, 2008 ??? May I borrow your time machine ? ;-)

Are there any stock prices you would like to know about?  Perhaps you 
are interested in the outcome of the elections?

There was a time inversion layer in Texas. Fixed now ...

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can ZFS be event-driven or not?

2008-02-27 Thread Bob Friesenhahn

On Wed, 27 Feb 2008, Nicolas Williams wrote:

 Maybe snapshot file whenever a write-filedescriptor is closed or
 somesuch?

 Again.  Not enough.  Some apps (many!) deal with multiple files.

Or more significantly, with multiple pages.  When using memory mapping 
the application may close its file descriptor, but then the underlying 
file is updated in a somewhat random fashion as dirty pages are 
written to disk.

It seems that this hypothesis is without merit.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can ZFS be event-driven or not?

2008-02-27 Thread Bob Friesenhahn

On Wed, 27 Feb 2008, Uwe Dippel wrote:


 As much as ZFS is revolutionary, it is far away from being the 
 'ultimate file system', if it doesn't know how to handle 
 event-driven snapshots

UFS == Ultimate File System
ZFS == Zettabyte File System

Perhaps you have these two confused?  ZFS does not lay claim to being 
the ultimate file system.

You can provide great benefit to society if you invent and implement a 
filesystem with all that ZFS offers, plus your remarkable ideas, 
provided that the result still provides the performance that users 
expect and there is sufficient storage space available.  Consider this 
to be your life's mission.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can ZFS be event-driven or not?

2008-02-28 Thread Bob Friesenhahn

On Thu, 28 Feb 2008, Uwe Dippel wrote:

 1. The application (NFS - sftp) does not know about the state of writing?

Sometimes applications know about the state of writing and sometimes 
they do not.  Sometimes they don't even know they are writing.

 2. Obviously nobody sees anything in having access to all versions of a file 
 stored there?

First it is necessary to determine what version means when it comes 
to a file.

At the application level, the system presents a different view than 
what is actually stored on disk since the system uses several levels 
of write caching to improve performance.  The only time that these 
should necessarily be the same is if the application uses a file 
descriptor to access the file (no memory mapping) and invokes fsync(). 
If memory mapping is used, the equivalent is msync() with the MS_SYNC 
option.  Using fsync() or msync(MS_SYNC) blocks the application until 
the I/O is done.

If a file is updated via memory mapping, then the data sent to the 
underlying file is based on the system's virtual memory system so the 
actually data sent to disk may not be coherent at all.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Patch 127729-07 not NFS patch!

2008-02-29 Thread Bob Friesenhahn

The Sun Update Manager on my x86 Solaris 10 box describes this new 
patch as SunOS 5.10_x86 nfs fs patch (note use of nfs) but looking 
at the problem descriptions this is quite clearly a big ZFS patch that 
Solaris 10 users should pay attention to since it fixes a bunch of 
nasty bugs.

Maybe someone can fix this fat-fingered patch description in Sun 
Update Manager?


Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic ZFS disk accesses

2008-03-01 Thread Bob Friesenhahn

On Sat, 1 Mar 2008, Bill Shannon wrote:

 I think I've reached the limit of what I can do remotely.  Now I have
 to repeat all these experiments when I'm sitting next to the disk and
 can actually hear it and see if the correlation remains.  Then, it may
 be time to dig into the ksh93 code and figure out what it thinks it's
 doing.  Fortunately, I've been there before...

One thing that can make a shell periodically active is if it is 
checking for new mail.  Check the ksh man page for descriptions of 
MAIL, MAILCHECK, MAILPATH.  Perhaps whenever it checks for new mail, 
it also updates this file.  Unsetting the MAIL environment variable 
may make the noise go away.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

2008-03-03 Thread Bob Friesenhahn

On Mon, 3 Mar 2008, Darren J Moffat wrote:

 I'm not convinced that single bit flips are the common
 failure mode for disks.  Most enterprise class disks already
 have enough ECC to correct at least 8 bytes per block.

 and for consumer rather than enterprise  class disks ?

You are assuming that the ECC used for consumer disks is 
substantially different than that used for enterprise disks.  That 
is likely not the case since ECC is provided by a chip which costs a 
few dollars.  The only reason to use a lesser grade algorithm would be 
to save a small bit of storage space.

Consumer disks use essentially the same media as enterprise disks.

Consumer disks store a higher bit density on similar media.

Consumer disks have less precise/consistent head controllers than 
enterprise disks.

Consumer disks are less well-specified than enterprise disks.

Due to the higher bit density we can expect more wrong bits to be read 
since we are pushing the media harder.  Due to less consistent head 
controllers we can expect more incidences of reading or writing the 
wrong track or writing something which can't be read.  Consumer disks 
are often used in an environment where they may be physically 
disturbed while they are writing or reading the data.  Enterprise 
disks are usually used in very stable environments.

The upshot of this is that we can expect more unrecoverable errors, 
but it seems unlikely that there will be more single bit errors 
recoverable at the ZFS level.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

2008-03-04 Thread Bob Friesenhahn

On Tue, 4 Mar 2008, Richard Elling wrote:

 Also note: the checksums don't have enough information to
 recreate the data for very many bit changes.  Hashes might,
 but I don't know anyone using sha256.

It is indeed important to recognize that the checksums are a way to 
detect that the data is incorrect rather than a way to tell that the 
data is correct.  There may be several permutations of wrong data 
which can result in the same checksum, but the probability of 
encountering those permutations due to natural causes is quite small.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs send/recv question

2008-03-07 Thread Bob Friesenhahn

On Fri, 7 Mar 2008, Rob Logan wrote:

 zfs send -i z/[EMAIL PROTECTED]  z/[EMAIL PROTECTED]  | bzip2 -c |\
   ssh host.com bzcat | zfs recv -v -F -d z
 zfs send -i z/[EMAIL PROTECTED]  z/[EMAIL PROTECTED]  | bzip2 -c |\
   ssh host.com bzcat | zfs recv -v -F -d z
 zfs send -i z/[EMAIL PROTECTED]z/[EMAIL PROTECTED]| bzip2 -c |\
ssh host.com bzcat | zfs recv -v -F -d z

Since I see 'bzip2' mentioned here (a rather slow compressor), I 
should mention that based on a recommendation from a friend, I gave a 
compressor called 'lzop' (http://www.lzop.org/) a try due to its 
reputation for compression speed.  Compressing zfs send was causing it 
to take much longer.  Testing with 'lzop' showed that it was 2.5X 
faster than gzip on the Opteron CPU and that the compression was just 
a bit worse than gzip's default compression level.  It seems that some 
assembly language is used for x86 and Opteron.  I did not test the 
relative speed differences on SPARC.

The benefit from a compressor depends on the speed of the pipe and the 
speed of the filesystem.  If CPU and/or network is the bottleneck, 
then LZO compression may be the solution.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Preserve creator across send/receive

2008-03-11 Thread Bob Friesenhahn

On Tue, 11 Mar 2008, Haik Aftandilian wrote:
 Or is there a way to manually set the creator of a fileystem?

Not knowing any better I used a simple 'chown owner:group' syntax. :-)

You could also use 'cpio -p' to transfer directory ownership based on 
the original master.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs backups to tape

2008-03-14 Thread Bob Friesenhahn

On Fri, 14 Mar 2008, Bill Shannon wrote:

 What's the best way to backup a zfs filesystem to tape, where the size
 of the filesystem is larger than what can fit on a single tape?
 ufsdump handles this quite nicely.  Is there a similar backup program
 for zfs?  Or a general tape management program that can take data from

Previously it was suggested on this list to use a special version of 
tar called 'star' (ftp://ftp.berlios.de/pub/star).

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS I/O algorithms

2008-03-15 Thread Bob Friesenhahn

On Sat, 15 Mar 2008, Richard Elling wrote:

 My observation, is that each metaslab is, by default, 1 MByte in size.  Each
 top-level vdev is allocated by metaslabs.  ZFS tries to allocate a top-level
 vdev's metaslab before moving onto another one.  So you should see eight
 128kByte allocs per top-level vdev before the next top-level vdev is
 allocated.

 That said, the actual iops are sent in parallel.  So it is not unusual to see
 many, most, or all of the top-level vdevs concurrently busy.

 Does this match your experience?

I do see that all the devices are quite evenly busy.  There is no 
doubt that the load balancing is quite good.  The main question is if 
there is any actual striping going on (breaking the data into 
smaller chunks), or if the algorithm is simply load balancing. 
Striping trades IOPS for bandwidth.

Using my application, I did some tests today.  The application was 
used to do balanced read/write of about 500GB of data in some tens of 
thousand of reasonably large files.  The application sequentially 
reads a file, then sequentially writes a file.  Several copies (2-6) 
of the application were run at once for concurrency.  What I noticed 
is that with hardly any CPU being used, the read+write bandwidth 
seemed to be bottlenecked at about 280MB/second with 'zfs iostat' 
showing very balanced I/O between the reads and the writes.

The system I set up is performing quite a bit differently than I 
anticipated.  The I/O is bottlenecked and I find that my application 
can do significant processing of the data without significantly 
increasing the application run time.  So CPU time is almost free.

If I was to assign a smaller block size for the filesystem, would that 
provide more of the benefits of striping or would it be detrimental to 
performance due to the number of I/Os?

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS I/O algorithms

2008-03-16 Thread Bob Friesenhahn

On Sun, 16 Mar 2008, Richard Elling wrote:

 But where is the bottleneck?  iostat will show bottlenecks in the
 physical disks and channels.  vmstat or mpstat will show the
 bottlenecks in cpus.  To see if the app is the bottleneck will
 require some analysis of the app itself.  Is it spending its time
 blocked on I/O?

The application is spending almost all the time blocked on I/O.  I see 
that the number of device writes per second seems pretty high.  The 
application is doing I/O in 128K blocks.  How many IOPS does a modern 
300GB 15K RPM SAS drive typically deliver?  Of course the IOPS 
capacity depends on if the access is random or sequential.  At the 
application level, the access is completely sequential but ZFS is 
likely doing some extra seeks.

iostat output (atime=off):

  extended device statistics 
devicer/sw/s   Mr/s   Mw/s wait actv  svc_t  %w  %b 
sd0   0.00.00.00.0  0.0  0.00.0   0   0 
sd1   0.00.00.00.0  0.0  0.02.8   0   0 
sd2   0.00.00.00.0  0.0  0.00.0   0   0 
sd10 80.4  170.7   10.0   19.9  0.0  9.2   36.5   0  54 
sd11 82.1  170.2   10.2   20.0  0.0 13.3   52.9   0  71 
sd12 79.3  168.39.9   20.0  0.0 13.1   53.1   0  69 
sd13 80.6  173.0   10.0   19.9  0.0  9.3   36.7   0  56 
sd14 80.9  167.8   10.1   20.0  0.0 13.4   53.8   0  70 
sd15 77.7  168.79.7   19.9  0.0  9.1   37.1   0  52 
sd16 77.3  170.69.6   20.0  0.0 13.3   53.7   0  70 
sd17 76.4  168.29.5   20.0  0.0  9.1   37.2   0  52 
sd18 76.7  172.29.5   19.9  0.0 13.5   54.2   0  70 
sd19 83.8  173.2   10.4   20.0  0.0 13.7   53.4   0  74 
sd20 73.3  174.39.1   20.0  0.0  9.1   36.9   0  56 
sd21 75.3  170.29.4   20.0  0.0 13.2   53.9   0  69 
nfs1  0.00.00.00.0  0.0  0.00.0   0   0

% mpstat
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0  288   1  189  1018  413  815   26  102   880  30463   3   0  94
   1  185   1  180   6341  830   43  111   740  31173   2   0  94
   2  284   1  183   5216  617   27   98   670  49544   3   0  93
   3  176   1  239   748  353  555   25   76   620  39334   3   0  93

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Microsoft WinFS for ZFS?

2008-03-17 Thread Bob Friesenhahn

On Mon, 17 Mar 2008, Orvar Korvar wrote:

 My question is, because WinFS database is running on top of NTFS, 
 could a similar thing be done for ZFS? Implement a database running 
 on top of ZFS, that has similar functionality as WinFS?

Object-oriented content management could be run on any sort of 
underlying file system.  It is just a layer on top.

 (I never understood the advantages of having a database on top of 
 NTFS, maybe it would be pointless for ZFS? Can someone knowledgeable 
 give some input to my question?)

ZFS just provides storage.  It seems that the problem with 
object-oriented content management is that a user interface needs to 
be provided, which is not standardized in any way.  This user 
interface needs to be used to put content into the system, to find 
content in the system, and to use content from the system.  There also 
needs to be a way to back everything up.  If the content management 
knows about the internal structure of the objects, then it might 
provide a way to access a document so that all of the objects (e.g. 
figures) used by that document are visible and may be updated.

There are likely some mainframe environments which do this sort of 
thing, but mainframes are essentially closed systems so the mainframe 
vendor has more control.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Microsoft WinFS for ZFS?

2008-03-18 Thread Bob Friesenhahn

On Tue, 18 Mar 2008, Orvar Korvar wrote:
 Just as ZFS makes NTFS look like crap, I would like SUN to make 
 something that makes WinFS look like crap! :o) Would it be possible 
 to utilize the unique functions ZFS has, to revolutionize again? 
 What possible advantages could ZFS provide for the database thingy? 
 Are there any advantages to use ZFS instead, at all? Speculations 
 are welcome! :o)

ZFS is cool because it is very clean, nicely documented, and is very 
simple for the user.  It would be quite wrong for Sun to diverge from 
this.

There are many other things that Sun should focus on before worrying 
about content management.  It would be useful if ZFS helped make using 
the SAN as easy as it makes using a collection of already accessible 
disks. ZFS is pretty, but it is layered on top of some very ugly 
looking things (e.g. multipath is super-ugly), so lets attend to 
those ugly things before worrying about adding frosting on top.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS I/O algorithms

2008-03-19 Thread Bob Friesenhahn

On Wed, 19 Mar 2008, Bill Moloney wrote:

 When application IO sizes get small, the overhead in ZFS goes
 up dramatically.

Thanks for the feedback.  However, from what I have observed, it is 
not a full story at all.  On my own system, when a new file is 
written, the write block size does not make a significant difference 
to the write speed.  Similarly, read block size does not make a 
significant difference to the sequential read speed.  I do see a 
large difference in rates when an existing file is updated 
sequentially.  There is a many orders of magnitude difference for 
random I/O type updates.

I think that there some rather obvious reasons for the difference 
between writing a new file, or updating an existing file.  When 
writing a new file, the system can buffer up to a disk block's worth 
of size prior to issuing a a disk I/O, or it can immedialy write what 
it has and since the write is sequential, it does not need to re-read 
prior to write (but there may be more metadata I/Os).  For the case of 
updating part of a disk block, there needs to be a read prior to write 
if the block is not cached in RAM.

If the system is short on RAM, it may be that ZFS issues many more 
write I/Os than if it has a lot of RAM.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS I/O algorithms

2008-03-20 Thread Bob Friesenhahn

On Thu, 20 Mar 2008, Mario Goebbels wrote:

 Similarly, read block size does not make a
 significant difference to the sequential read speed.

 Last time I did a simple bench using dd, supplying the record size as
 blocksize to it instead of no blocksize parameter bumped the mirror pool
 speed from 90MB/s to 130MB/s.

Indeed.  However, as an interesting twist to things, in my own 
benchmark runs I see two behaviors.  When the file size is smaller 
than the amount of RAM the ARC can reasonably grow to, the write block 
size does make a clear difference.  When the file size is larger than 
RAM, the write block size no longer makes much difference and 
sometimes larger block sizes actually go slower.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS I/O algorithms

2008-03-20 Thread Bob Friesenhahn

On Thu, 20 Mar 2008, Jonathan Edwards wrote:

 in that case .. try fixing the ARC size .. the dynamic resizing on the ARC 
 can be less than optimal IMHO

Is a 16GB ARC size not considered to be enough? ;-)

I was only describing the behavior that I observed.  It seems to me 
that when large files are written very quickly, that when the file 
becomes bigger than the ARC, that what is contained in the ARC is 
mostly stale and does not help much any more.  If the file is smaller 
than the ARC, then there is likely to be more useful caching.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best practices for ZFS plaiding

2008-03-26 Thread Bob Friesenhahn

On Wed, 26 Mar 2008, Tim wrote:

 No raid at all.  The system should just stripe across all of the LUN's
 automagically, and since you're already doing your raid on the thumper's,
 they're *protected*.  You can keep growing the zpool indefinitely, I'm not
 aware of any maximum disk limitation.

The data may be protected, but the uptime will be dependent on the 
uptime of all of those systems.  Downtime of *any* of the systems in a 
load-share configuration means downtime for the entire pool.  Of 
course this is the case with any storage system as more hardware is 
added but autonomously administered hardware is more likely to 
encounter a problem.  Local disk is usually more reliable than remote 
disk.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Status of ZFS boot for sparc?

2008-03-26 Thread Bob Friesenhahn

On Wed, 26 Mar 2008, Lori Alt wrote:

 zfs boot support for sparc (included in the overall delivery
 of zfs boot, which includes install support, support for
 swap and dump zvols, and various other improvements)
 is still planned for Update 6.

Does zfs boot have any particular firmware dependencies?  Will it work 
on old SPARC systems?

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Periodic flush

2008-03-26 Thread Bob Friesenhahn

My application processes thousands of files sequentially, reading 
input files, and outputting new files.  I am using Solaris 10U4. 
While running the application in a verbose mode, I see that it runs 
very fast but pauses about every 7 seconds for a second or two.  This 
is while reading 50MB/second and writing 73MB/second (ARC cache miss 
rate of 87%).  The pause does not occur if the application spends more 
time doing real work.  However, it would be nice if the pause went 
away.

I have tried turning down the ARC size (from 14GB to 10GB) but the 
behavior did not noticeably improve.  The storage device is trained to 
ignore cache flush requests.  According to the Evil Tuning Guide, the 
pause I am seeing is due to a cache flush after the uberblock updates.

It does not seem like a wise choice to disable ZFS cache flushing 
entirely.  Is there a better way other than adding a small delay into 
my application?

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Periodic flush

2008-03-27 Thread Bob Friesenhahn

On Wed, 26 Mar 2008, Neelakanth Nadgir wrote:
 When you experience the pause at the application level,
 do you see an increase in writes to disk? This might the
 regular syncing of the transaction group to disk.

If I use 'zpool iostat' with a one second interval what I see is two 
or three samples with no write I/O at all followed by a huge write of 
100 to 312MB/second.  Writes claimed to be a lower rate are split 
across two sample intervale.

It seems that writes are being cached and then issued all at once. 
This behavior assumes that the file may be written multiple times so a 
delayed write is more efficient.

If I run a script like

while true
do
sync
done

then the write data rate is much more consistent (at about 
66MB/second) and the program does not stall.  Of course this is not 
very efficient.

Are the 'zpool iostat' statistics accurate?

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Periodic flush

2008-03-27 Thread Bob Friesenhahn

On Thu, 27 Mar 2008, Neelakanth Nadgir wrote:

 This causes the sync to happen much faster, but as you say, suboptimal.
 Haven't had the time to go through the bug report, but probably
 CR 6429205 each zpool needs to monitor its throughput
 and throttle heavy writers
 will help.

I hope that this feature is implemented soon, and works well. :-)

I tested with my application outputting to a UFS filesystem on a 
single 15K RPM SAS disk and saw that it writes about 50MB/second and 
without the bursty behavior of ZFS.  When writing to ZFS filesystem on 
a RAID array, zpool I/O stat reports an average (over 10 seconds) 
write rate of 54MB/second.  Given that the throughput is not much 
higher on the RAID array, I assume that the bottleneck is in my 
application.

 Are the 'zpool iostat' statistics accurate?

 Yes. You could also look at regular iostat
 and correlate it.

Iostat shows that my RAID array disks are loafing with only 9MB/second 
writes to each but with 82 writes/second.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] nfs and smb performance

2008-03-28 Thread Bob Friesenhahn

On Fri, 28 Mar 2008, abs wrote:

 Sorry for being vague but I actually tried it with the cifs in zfs 
 option, but I think I will try the samba option now that you mention 
 it.  Also is there a way to actually improve the nfs performance 
 specifically?

CIFS uses TCP.  NFS uses either TCP or UDP, and usually UDP by 
default.

In order to improve NFS client performance, it may be useful to 
increase the 'rsize' and 'wsize' client mount options to 32K. 
Solaris 10 defaults the buffer size to 32K but many other clients use 
8K.  Some clients support a '-a' option to specify the maximum 
read-ahead and tuning this value can help considerably for sequential 
access.  Using gigabit eithernet with jumbo frames will improve 
performance even further.  Notice that most of these tunings are for 
the client-side and not for the server.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] nfs and smb performance

2008-03-28 Thread Bob Friesenhahn

 
 CIFS uses TCP.  NFS uses either TCP or UDP, and usually UDP by default.

 For Sun systems, NFSv3 using 32kByte [rw]size over TCP has been
 the default configuration for 10+ years.  Do you still see clients running
 NFSv2 over UDP?

Yes, I see that TCP is the default in Solaris 9.  Is it also the 
default in Solaris 8?.  I do know that tuning mount options made a 
considerable difference for FreeBSD 5.X and Apple's OS X Tiger. 
Apple's OS X Leopard does not seem to need tuning like previous 
versions did.  OS X Tiger and earlier actually sent application writes 
directly to NFS so that performance was very dependent on application 
write size regardless of client NFS tunings.

Unfortunately, not everyone is using Solaris.  The Solaris 10 NFS 
client implementation really screams.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Problem importing pool from BSD 7.0 into Nexenta

2008-03-31 Thread Bob Friesenhahn

On Mon, 31 Mar 2008, Tim wrote:

 Perhaps someone else can correct me if I'm wrong, but if you're using the
 whole disk, ZFS shouldn't be displaying a slice when listing your disks,
 should it?  I've *NEVER* seen it do that on any of mine except when using
 partials/slicese.

 I would expect:
 c1d1s8

 To be:
 c1d1

Yes, this seems suspicious.  It is also suspicious that some devices 
use 'p' (partition?) while others use 's' (slice?).  The 
partitions may be FreeBSD partitions or some other type that Solaris 
is not expecting.  FreeBSD can partition at a level visible to the 
BIOS and it can further sub-partition a FreeBSD partition for use in 
individual filesystems.

Regardless, I am very interested to hear if ZFS pools can really be 
transferred back and forth between Solaris and FreeBSD.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] OpenSolaris ZFS NAS Setup

2008-04-07 Thread Bob Friesenhahn

On Mon, 7 Apr 2008, Ross wrote:

 However that doesn't necessarily mean it's ready for production use. 
 ZFS will hang for 3 mins (180 seconds) waiting for the iSCSI client 
 to timeout.  Now I don't know about you, but HA to me doesn't mean 
 Highly Available, but with occasional 3 minute breaks.  Most of 
 the client applications we would want to run on ZFS would be broken 
 with a 3 minute delay returning data, and this was enough for us to 
 give up on ZFS over iSCSI for now.

It seems to me that this is a problem with the iSCSI client timeout 
parameters rather than ZFS itself.  Three minutes is sufficient for 
use over the internet but seems excessive on a LAN.  Have you 
investigated to see if the iSCSI client timeout parameters can be 
adjusted?

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS volume export to USB-2 or Firewire?

2008-04-08 Thread Bob Friesenhahn

Currently it is easy to share a ZFS volume as an iSCSI target.  Has 
there been any thought toward adding the ability to share a ZFS volume 
via USB-2 or Firewire to a directly attached client?

There is a substantial market for storage products which act like a 
USB-2 or Firewire drive.  Some of these offer some form of RAID. 
It seems to me that ZFS with a server capability to appear as several 
USB-2 or Firewire drives (or eSATA) may be appealing for larger RAIDs 
of several terrabytes.

Is anyone aware of an application which can usefully share a ZFS 
volume (essentially a file) in this way?

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance of one single 'cp'

2008-04-08 Thread Bob Friesenhahn

On my drive array (capable of 260MB/second single-process writes and 
450MB/second single-process reads) 'zfs iostat' reports a read rate of 
about 59MB/second and a write rate of about 59MB/second when executing 
'cp -r' on a directory containing thousands of 8MB files.  This seems 
very similar to the performance you are seeing.

The system indicators (other than disk I/O) are almost flatlined at 
zero while the copy is going on.

It seems that a multi-threaded 'cp' could be much faster.

With GNU xargs, find, and cpio, I think that it is possible to cobble 
together a much faster copy since GNU xargs supports --max-procs and 
--max-args arguments to allow executing commands concurrently with 
different sets of files.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ls -lt for links slower than for regular files

2008-04-08 Thread Bob Friesenhahn

On Tue, 8 Apr 2008, [EMAIL PROTECTED] wrote:
 a few seconds and the links list in, perhaps, 60 seconds.  Is there a
 difference in what ls has to do when listing links versus listing regular 
 files
 in ZFS that would cause a slowdown?

Since you specified '-t' the links have to be dereferenced (find the 
file that is referred to) which results in opening the directory to 
see if the file exists, and what its properties are.  With 50K+ files, 
opening the directory and finding the file will take tangible time. 
If there are multiple directories in the symbolic link path, then 
these directories need to be opened as well.  Symbolic links are not 
free.

More RAM may help if it results in keeping the directory data hot in 
the cache.

If the links were hard links rather than symbolic links, then 
performance will be similar to a regular file (since it is then a 
regular file).

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS volume export to USB-2 or Firewire?

2008-04-09 Thread Bob Friesenhahn

On Wed, 9 Apr 2008, Ross wrote:

 Well the first problem is that USB cables are directional, and you 
 don't have the port you need on any standard motherboard.  That

Thanks for that info.  I did not know that.

 Adding iSCSI support to ZFS is relatively easy since Solaris already 
 supported TCP/IP and iSCSI.  Adding USB support is much more 
 difficult and isn't likely to happen since afaik the hardware to do 
 it just doens't exist.

I don't believe that Firewire is directional but presumably the 
Firewire support in Solaris only expects to support certain types of 
devices.  My workstation has Firewire but most systems won't have it.

It seemed really cool to be able to put your laptop next to your 
Solaris workstation and just plug it in via USB or Firewire so it can 
be used as a removable storage device.  Or Solaris could be used on 
appropriate hardware to create a more reliable portable storage 
device.  Apparently this is not to be and it will be necessary to deal 
with iSCSI instead.

I have never used iSCSI so I don't know how difficult it is to use as 
temporary removable storage under Windows or OS-X.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS volume export to USB-2 or Firewire?

2008-04-09 Thread Bob Friesenhahn

On Wed, 9 Apr 2008, Richard Elling wrote:

 I just get my laptop within WiFi range and mount :-).  I don't see any
 benefit to a wire which is slower than Ethernet, when an Ethernet
 port is readily available on almost all modern laptops.

Under Windows or Mac, is this as convenient as pugging in a USB or 
Firewire disk or does it require system administrator type knowledge?

If you go to Starbucks, does your laptop attempt to mount your iSCSI 
volume on a (presumably) unreachable network?

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] OpenSolaris ZFS NAS Setup

2008-04-11 Thread Bob Friesenhahn

On Fri, 11 Apr 2008, Simon Breden wrote:

 Thanks myxiplx for the info on replacing a faulted drive. I think 
 the X4500 has LEDs to show drive statuses so you can see which 
 physical drive to pull and replace, but how does one know which 
 physical disk to pull out when you just have a standard PC with 
 drives directly plugged into on-motherboard SATA connectors -- i.e. 
 with no status LEDs?

This should be a wakeup call to make sure that this is all figured out 
in advance before the hardware fails.  If you were to format the drive 
for a traditional filesystem you would need to know which one it was. 
Failure recovery should be no different except for the fact that the 
machine may be down, pressure is on, and the information you expected 
to use for recovery was on that machine. :-)

This is a case where it is worthwhile maintaining a folder (in paper 
form) which contains important recovery information for your machines.
Open up the machine in advance and put sticky labels on the drives 
with their device names.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] LZO compression?

2008-04-12 Thread Bob Friesenhahn

On Sat, 12 Apr 2008, roland wrote:

 i'm really wondering that interest in alternative compression 
 schemes is that low, especially due to the fact that lzo seems to 
 compress better and be faster than lzjb.

LZO seems to have a whole family of compressors.  One reason why it is 
faster is that the author has worked really hard on a few CPU-specific 
optimizations.  Is the license ok for Solaris?

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 24-port SATA controller options?

2008-04-14 Thread Bob Friesenhahn

On Mon, 14 Apr 2008, Blake Irvin wrote:

 The only supported controller I've found is the Areca ARC-1280ML. 
 I want to put it in one of the 24-disk Supermicro chassis that 
 Silicon Mechanics builds.

For obvious reasons (redundancy and throughput), it makes more sense 
to purchase two 12 port cards.  I see that there is an option to 
populate more cache RAM.

I would be interested to know what actual throughput that one card is 
capable of.  The CDW site says 300MB/s.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance of one single 'cp'

2008-04-14 Thread Bob Friesenhahn

On Mon, 14 Apr 2008, Jeff Bonwick wrote:
 disks=`format /dev/null | grep c.t.d | nawk '{print $2}'`

I had to change the above line to

disks=`format /dev/null | grep ' c.t' | nawk '{print $2}'`

in order to match my mutipathed devices.

./diskqual.sh
c1t0d0 130 MB/sec
c1t1d0 13422 MB/sec
c4t600A0B80003A8A0B096A47B4559Ed0 190 MB/sec
c4t600A0B80003A8A0B096E47B456DAd0 202 MB/sec
c4t600A0B80003A8A0B096147B451BEd0 186 MB/sec
c4t600A0B80003A8A0B096647B453CEd0 176 MB/sec
c4t600A0B80003A8A0B097347B457D4d0 189 MB/sec
c4t600A0B800039C9B50A9C47B4522Dd0 174 MB/sec
c4t600A0B800039C9B50AA047B4529Bd0 197 MB/sec
c4t600A0B800039C9B50AA447B4544Fd0 223 MB/sec
c4t600A0B800039C9B50AA847B45605d0 224 MB/sec
c4t600A0B800039C9B50AAC47B45739d0 223 MB/sec
c4t600A0B800039C9B50AB047B457ADd0 219 MB/sec
c4t600A0B800039C9B50AB447B4595Fd0 223 MB/sec

My 'cp -r' performance is about the same as Henrik's.  The 'cp -r' 
performance is much less than disk benchmark tools would suggest.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Confused by compressratio

2008-04-15 Thread Bob Friesenhahn

On Tue, 15 Apr 2008, Luke Scharf wrote:

 AFAIK, ext3 supports sparse files just like it should -- but it doesn't
 dynamically figure out what to write based on the contents of the file.

Since zfs inspects all data anyway in order to compute the block 
checksum, it can easily know if a block is all zeros.

For ext3, inspecting all blocks for zeros would be viewed as 
unnecessary overhead.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 24-port SATA controller options?

2008-04-15 Thread Bob Friesenhahn

On Tue, 15 Apr 2008, Keith Bierman wrote:

 Perhaps providing the computations rather than the conclusions would be more 
 persuasive  on a technical list ;

No doubt.  The computations depend considerably on the size of the 
disk drives involved.  The odds of experiencing media failure on a 
single 1TB SATA disk are quite high.  Consider that this media failure 
may occur while attempting to recover from a failed disk.  There have 
been some good articles on this in USENIX Login magazine.

ZFS raidz1 and raidz2 are NOT directly equivalent to RAID5 and RAID6 
so the failure statistics would be different.  Regardless, single disk 
failure in a raidz1 substantially increases the risk that something 
won't be recoverable if there is a media failure while rebuilding. 
Since ZFS duplicates its own metadata blocks, it is most likely that 
some user data would be lost but the pool would otherwise recover.  If 
a second disk drive completely fails, then you are toast with raidz1.

RAID5 and RAID6 rebuild the entire disk while raidz1 and raidz2 only 
rebuild existing data blocks so raidz1 and raidz2 are less likely to 
experience media failure if the pool is not full.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 24-port SATA controller options?

2008-04-15 Thread Bob Friesenhahn

On Tue, 15 Apr 2008, Maurice Volaski wrote:
 4 drive failures over 5 years. Of course, YMMV, especially if you
 drive drunk :-)

Note that there is a difference between drive failure and media data 
loss. In a system which has been running fine for a while, the chance 
of a second drive failing during rebuild may be low, but the chance of 
block-level media failure is not.

However, computers do not normally run in a vaccum.  Many failures are 
caused by something like a power glitch, temperature cycle, or the 
flap of a butterfly's wings.  Unless your environment is completely 
stable and the devices are not dependent on some of the same things 
(e.g. power supplies, chassis, SATA controller, air conditioning) then 
what caused one device to fail may very well cause another device to 
fail.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

2008-04-15 Thread Bob Friesenhahn

On Tue, 15 Apr 2008, Brandon High wrote:

 I think RAID-Z is different, since the stripe needs to spread across
 all devices for protection. I'm not sure how it's done.

My understanding is that RAID-Z is indeed different and does NOT have 
to spread across all devices for protection.  It can use less than the 
total available devices and since parity is distributed the parity 
could be written to any drive.

I am sure that someone will correct me if the above is wrong.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Periodic flush

2008-04-15 Thread Bob Friesenhahn

On Tue, 15 Apr 2008, Mark Maybee wrote:
 going to take 12sec to get this data onto the disk.  This impedance
 mis-match is going to manifest as pauses:  the application fills
 the pipe, then waits for the pipe to empty, then starts writing again.
 Note that this won't be smooth, since we need to complete an entire
 sync phase before allowing things to progress.  So you can end up
 with IO gaps.  This is probably what the original submitter is

Yes.  With an application which also needs to make best use of 
available CPU, these I/O gaps cut into available CPU time (by 
blocking the process) unless the application uses multithreading and 
an intermediate write queue (more memory) to separate the CPU-centric 
parts from the I/O-centric parts.  While the single-threaded 
application is waiting for data to be written, it is not able to read 
and process more data.  Since reads take time to complete, being 
blocked on write stops new reads from being started so the data is 
ready when it is needed.

 There is one down side to this new model: if a write load is very
 bursty, e.g., a large 5GB write followed by 30secs of idle, the
 new code may be less efficient than the old.  In the old code, all

This is also a common scenario. :-)

Presumably the special slow I/O code would not kick in unless the 
burst was large enough to fill quite a bit of the ARC.

Real time throttling is quite a challenge to do in software.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 24-port SATA controller options?

2008-04-16 Thread Bob Friesenhahn

On Wed, 16 Apr 2008, David Magda wrote:

 RAID5 and RAID6 rebuild the entire disk while raidz1 and raidz2 only
 rebuild existing data blocks so raidz1 and raidz2 are less likely to
 experience media failure if the pool is not full.

 While the failure statistics may be different, I think any comparison would 
 be apples-to-apples.

Note that if the pool is only 10% full, then it is 10X less likely to 
experience a media failure during rebuild than traditional RAID-5/6 
with the same disks.  In addition to this, zfs replicates metadata and 
writes the copies to different disks depending on the redundancy 
strategy.  A traditional filesystem on traditional RAID does not have 
this same option (having no knowledge of the underlying disks) even 
though it does replicate some essential metadata (multiple super 
blocks).

Since my time on this list, the vast majority of reports have been of 
the nature my pool did not come back up after system crash or the 
pool stopped responding and not that their properly redundant pool 
lost some user data.  This indicates that the storage principles are 
quite sound but the implementation (being relatively new) still has a 
few rough edges.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS raidz write performance:what to expect from SATA drives on ICH9R (AHCI)

2008-04-17 Thread Bob Friesenhahn

On Thu, 17 Apr 2008, Tim wrote:

 Along those lines, I'd *strongly* suggest running Jeff's script to pin down
 whether one drive is the culprit:

But that script only tests read speed and Pascal's read performance 
seems fine.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Solaris 10U5 ZFS features?

2008-04-17 Thread Bob Friesenhahn

Even though I am on a bunch of Sun propaganda lists, I have not yet 
spotted an announcement for Solaris 10U5 even though it is now 
available for download.  Sun's formal web site is useless for 
comparing what is in different update releases since its notion of 
What's New is a comparison with Solaris 9 and Solaris 8, which are 
as old as dirt and it is not clear if and when this summary gets 
updated.

Can someone please post a summary of any new ZFS features or 
significant fixes which are in Solaris 10U5?  Is there value to 
upgrading a system to this release over and above what is provided by 
patches?

Thanks,

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS raidz write performance:what to expect from SATA drives on ICH9R

2008-04-18 Thread Bob Friesenhahn

On Fri, 18 Apr 2008, Pascal Vandeputte wrote:

 Thanks for all the replies!

 Some output from iostat -x 1 while doing a dd of /dev/zero to a 
 file on a raidz of c1t0d0s3, c1t1d0 and c1t2d0 using bs=1048576:
  [ data removed ]
 It's all a little fishy, and kw/s doesn't differ much between the 
 drives (but this could be explained as drive(s) with longer wait 
 queues holding back the others I guess?).

Your data does strongly support my hypothesis that using a slice on 
'sd0' would slow down writes.  It may also be that your boot drive is 
a different type and vintage from the other drives.

Testing with output from /dev/zero is not very good since zfs treats 
blocks of zeros specially.  I have found 'iozone' 
(http://www.iozone.org/) to be quite useful for basic filesystem 
throughput testing.

 Hmm, doesn't look like one drive holding back another one, all of 
 them seem to be equally slow at writing.

Note that if drives are paired, or raidz requires a write to all 
drives, then the write rate is necessarily limited to the speed of the 
slowest device.  I suspect that your c1t1d0 and c1t2d0 drives are 
similar type and vintage whereas the boot drive was delivered with the 
computer and has different performance characteristics (double wammy). 
Usually drives delivered with computers are selected by the computer 
vendor based on lowest cost in order to decrease the cost of the 
entire computer.

SATA drives are cheap this days so perhaps you can find a way to add a 
fourth drive which is at least as good as the drives you are using for 
c1t1d0 and c1t2d0.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS raidz write performance:what to expect from SATA drives on ICH9R

2008-04-18 Thread Bob Friesenhahn

On Fri, 18 Apr 2008, Pascal Vandeputte wrote:
 - does Solaris require a swap space on disk

No, Solaris does not require a swap space.  However you do not have a 
lot of memory so when there is not enough virtual memory available, 
programs will fail to allocate memory and quit running.  There is an 
advantage to having a swap area since then Solaris can put rarely used 
pages in swap to improve overall performance.  The memory can then be 
used for useful caching (e.g. ZFS ARC), or for your applications.

In addition to using a dedicated partition, you can use a file on UFS 
for swap ('man swap') and ZFS itself is able to support a swap volume. 
I don't think that you can put a normal swap file on ZFS so you would 
want to use ZFS's built-in support for that.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS raidz write performance:what to expect from SATA drives on ICH9R

2008-04-19 Thread Bob Friesenhahn

On Sun, 20 Apr 2008, A Darren Dunham wrote:

 I think these paragraphs are referring to two different concepts with
 swap.  Swapfiles or backing store in the first, and virtual memory
 space in the second.

The swap area is mis-named since Solaris never swaps.  Some older 
operating systems would put an entire program in the swap area when 
the system ran short on memory and would have to swap between 
programs.  Solaris just pages (a virtual memory function) and it is 
very smart about how and when it does it.  Only dirty pages which are 
not write-mapped to a file in the filesystem need to go in the swap 
area, and only when the system runs short on RAM.

Solaris is a quite-intensely memory-mapped system.  The memory mapping 
allows a huge amount of sharing of shared library files, program 
text images, and unmodified pages shared after fork().  The end result 
is a very memory-efficient OS.

Now if we could just get ZFS ARC and Gnome Desktop to not use any 
memory, we would be in nirvana. :-)

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS raidz write performance:what to expect from SATA drives on ICH9R

2008-04-19 Thread Bob Friesenhahn

On Sat, 19 Apr 2008, michael schuster wrote:

 that's true most of the time ... unless free memory gets *really* low, then 
 Solaris *does* start to swap (ie page out pages by process). IIRC, the 
 threshold for swapping is minfree (measured in pages), and the value that 
 needs to fall below this threshold is freemem.

Most people here are likely too young to know what swapping really 
is.  Swapping is not the same as the paging that Solaris does.  With 
swapping the kernel knows that this address region belongs to this 
process and we are short of RAM so block copy the process to the swap 
area, and only remember that it exists via the process table.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] backup for x4500?

2008-04-20 Thread Bob Friesenhahn

On Sun, 20 Apr 2008, Peter Tribble wrote:
  Does anyone here have experience of this with multi-TB filesystems and
  any of these solutions that they'd be willing to share with me please?

 My experience so far is that anything past a terabyte and 10 million files,
 and any backup software struggles.

What is the cause of the struggling?  Does the backup host run short 
of RAM or CPU?  If backups are incremental, is a large portion of time 
spent determining the changes to be backed up?  What is the relative 
cost of many small files vs large files?

How does 'zfs send' performance compare with a traditional incremental 
backup system?

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] backup for x4500?

2008-04-20 Thread Bob Friesenhahn

On Sun, 20 Apr 2008, Peter Tribble wrote:

  What is the cause of the struggling?  Does the backup host run short of
 RAM or CPU?  If backups are incremental, is a large portion of time spent
 determining the changes to be backed up?  What is the relative cost of many
 small files vs large files?

 It's just the fact that, while the backup completes, it can take over 24 
 hours.
 Clearly this takes you well over any backup window. It's not so much that the
 backup software is defective; it's an indication that traditional notions of
 backup need to be rethought.

There is no doubt about that.  However, there are organizations with 
hundreds of terrabytes online and they manage to survive somehow.  I 
receive bug reports from people with 600K files in a single 
subdirectory. Terrabyte-sized USB drives are available now. When you 
say that the backup can take over 24 hours, are you talking only about 
the initial backup, or incrementals as well?

 I have one small (200G) filesystem that takes an hour to do an incremental
 with no changes. (After a while, it was obvious we don't need to do that
 every night.)

That is pretty outrageous.  It seems that your backup software is 
suspect since it must be severely assaulting the filesystem.  I am 
using 'rsync' (version 3.0) to do disk-to-disk network backups (with 
differencing) to a large Firewire type drive and have not noticed any 
performance issues.  I do not have 10 million files though (I have 
about half of that).

Since zfs supports really efficient snapshots, a backup system which 
is aware of snapshots can take snapshots and then backup safely even 
if the initial dump takes several days.  Really smart software could 
perform both initial dump and incremental dump simultaneously.  The 
minimum useful incremental backup interval would still be be limited 
to the time required to do one incremental backup.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS for write-only media?

2008-04-21 Thread Bob Friesenhahn

Are there any plans to support ZFS for write-only media such as 
optical storage?  It seems that if mirroring or even zraid is used 
that ZFS would be a good basis for long term archival storage.

Has this been considered?  I expect that it is possible today by using 
files as the underlying media and then copying those individual files 
to optical storage.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS for write-only media?

2008-04-21 Thread Bob Friesenhahn

On Mon, 21 Apr 2008, Dana H. Myers wrote:

 Bob Friesenhahn wrote:
 Are there any plans to support ZFS for write-only media such as optical 
 storage?  It seems that if mirroring or even zraid is used that ZFS would 
 be a good basis for long term archival storage.
 I'm just going to assume that write-only here means write-once,
 read-many, since it's far too late for an April Fool's joke.

Yes, of course.  Such as to CD-R, DVD-RW, or more exotic technologies 
such as holographic drives (300GB drives are on the market). For 
example, with two CD-R drives it should be possible to build a ZFS 
mirror on two CDs, but the I/O to these devices may need to be done in 
a linear sequential fashion at a rate sufficient to keep the writer 
happy, so temporary files (or memory-based buffering) likely need to 
be used.

No one wants to be faced with a situation in which two copies are made 
to CD but both copies are deemed to be bad when they are read.  ZFS 
could make that situation much better.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS for write-only media?

2008-04-21 Thread Bob Friesenhahn

On Mon, 21 Apr 2008, Mark A. Carlson wrote:

 Maybe what you want is to archive files off to optical media?

 Perhaps ADM - http://opensolaris.org/os/project/adm ?

That looks interesting, but true archiving is needed.  The level of 
archiving for this application is that copies would be kept thousands 
of feet underground in a stable salt mine on continents 'A' and 'B'. 
An alternative is special temperature, humidity, and pressure 
controlled above-ground bunkers. It is desired that the data be 
preserved for hundreds or a thousand years, which would of course 
require copying to more modern media ever so often.  The cost to 
create the original data is up to $200 million (today's cost) and it 
can not be recreated.  The size of the originals to be archived ranges 
from 2TB to 400TB depending on how deep the archiving is.

The existing archive approach is in analog form but it is found that 
there is noticeable degredation after 50 or 100 years which is not 
possible to fully correct.

When saw a discussion of these requirements today, ZFS immediately 
came to mind due to its many media-independent error detection and 
correction features, and the fact that it is open source.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS for write-only media?

2008-04-21 Thread Bob Friesenhahn

On Mon, 21 Apr 2008, Mark A. Carlson wrote:

 Interesting problem. And yes you are right, there are a number
 of problems to solve here, see:

 http://blogs.sun.com/mac/en_US/entry/open_archive

Standards and open source are clearly the way to go.  Many open source 
applications have already been demonstrated to last far longer than 
their commercial counterparts.

ZFS is open sourced but it is perhaps not mature and widespread enough 
yet to be seen as a stable long-term storage standard.  The problem is 
a long term problem so there seems to be opportunity here for ZFS if 
it is adapted somewhat to address archiving.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS for write-only media?

2008-04-22 Thread Bob Friesenhahn

On Tue, 22 Apr 2008, Ralf Bertling wrote:

 Hi Bob,
 If I was willing to do that I would simply build a pool from file-
 based storage being n-ISO images.
 It would involve the following steps
 1. create blank ISO images of the size of your media
 2. zpool create wormyz raidz2 image1.iso image2.iso image3.iso ...
 3. Move your data to the pool
 4. export the pool
 5. burn the media

 If you need to recover, copy the data from the device using dd
 conv=sync,noerror

Yes, I know that this will work and what I thought of.  But I was 
thinking that perhaps ZFS would be able to attach to the read-only 
pool. At the moment it is likely not willing to attach to read-only 
devices since part of its function depends on writing.

 The problem here is that by putting the data away from your machine,
 you loose the chance to scrub
 it on a regular basis, i.e. there is always the risk of silent
 corruption.

Running a scrub is pointless since the media is not writeable. :-)

 I am not an expert, but the MTTDL is in tousands of years when using
 raidz2 with a hot-spare and regular scrubbing.

A thousand years ago, knights were storming castle calls.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS for write-only media?

2008-04-22 Thread Bob Friesenhahn

On Tue, 22 Apr 2008, Jonathan Loran wrote:

 But that's the point.  You can't correct silent errors on write once
 media because you can't write the repair.

Yes, you can correct the error (at time of read) due to having both 
redundant media, and redundant blocks. That is a normal function of 
ZFS.  It just not possible to correct the failed block on the media by 
re-writing it or moving its data to a new location.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS for write-only media?

2008-04-22 Thread Bob Friesenhahn

On Tue, 22 Apr 2008, Jonathan Loran wrote:
 
 I suppose with ditto blocks, this has some merrit.  Someone needs to 
 characterize how errors probigate on different types of WORM media.  perhaps 
 this has already been done.  In my experience, when DVD-R go south, they 
 really go bad at once.  Not a lot of small bit errors.  But a full analysis 
 would be good.  Probably it would make the most sence to write mirrored WORM 
 disks with different technology to hedge your bets.

It does not really matter since ZFS supports various forms of RAID, 
including arbitrary mirroring.  If possible, the media can be 
purchased from different vendors so there is less chance of similar 
bit-rot across the lot.

With $40 to $200 million spent per project, a few extra copies is in 
the noise. :-)

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Diverse, Dispersed, Distributed, Unscheduled RAID volumes

2008-04-25 Thread Bob Friesenhahn

On Fri, 25 Apr 2008, Richard Elling wrote:

 No.  ZFS is not a distributed file system.

While the results might not be pretty, if each PC exports a drive via 
iSCSI and mirroring is used with plenty of PCs in each mirror, it 
seems like it would work but with likely dismal performance if a PC 
was turned off (retries and 3+ minute iSCSI failure recovery logic). 
There would be additional dismal performance when the PC is turned 
back on due to cumulative resilvering.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs data corruption

2008-04-27 Thread Bob Friesenhahn

On Sat, 26 Apr 2008, Carson Gaspar wrote:
 It's not safe to jump to this conclusion.  Disk drivers that support FMA
 won't log error messages to /var/adm/messages.  As more support for I/O
 FMA shows up, you won't see random spew in the messages file any more.

 mode=large financial institution paying support customer
 That is a Very Bad Idea. Please convey this to whoever thinks that
 they're helping by not sysloging I/O errors. If this shows up in
 Solaris 11, we will Not Be Amused. Lack of off-box error logging will
 directly cause loss of revenue.
 /mode

I am glad to hear that your large financial institution (Bear 
Stearns?) is contributing to the OpenSolaris project. :-)

Today's systems are very complex and may contain many tens of disks. 
Syslog is a bottleneck and often logs to local files, which grow very 
large, and hinder system performance while many log messages are being 
reported.  If syslog is to a remote host, then the network is also 
impacted.

If a device (or several inter-related devices) is/are experiencing 
problems, it seems best to isolate and diagnose it, with one 
intelligent notification rather than spewing hundreds of thousands of 
low-level error messages to a system logger.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS - Implementation Successes and Failures

2008-04-28 Thread Bob Friesenhahn

On Mon, 28 Apr 2008, Dominic Kay wrote:

 I'm not looking to replace the Best Practices or Evil Tuning guides but to
 take a slightly different slant.  If you have been involved in a ZFS
 implementation small or large and would like to discuss it either in
 confidence or as a referenceable case study that can be written up, I'd be
 grateful if you'd make contact.

Back in February I set up ZFS on a 12-disk StorageTek 2540 array and 
documented my experience (at that time) in the white paper available 
at 
http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf;.

Since then I am still quite satisified.  ZFS has yet to report a bad 
block or cause me any trouble at all.

The only complaint I would have is that 'cp -r' performance is less 
than would be expected given the raw bandwidth capacity.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs performance so bad on my system

2008-04-29 Thread Bob Friesenhahn

On Tue, 29 Apr 2008, Krzys wrote:

 I am not sure, I had very ok system when I did originally build it and when I
 did originally started to use zfs, but now its so horribly slow. I do believe
 that amount of snaps that I have are causing it.

This seems like a bold assumption without supportive evidence.

 # zpool list
 NAMESIZEUSED   AVAILCAP  HEALTH ALTROOT
 mypool  278G255G   23.0G91%  ONLINE -
 mypool21.59T   1.54T   57.0G96%  ONLINE -

Very full!

 For example I am trying to copy 1.4G file from my /var/mail to /d/d1 directory
 which is zfs file system on mypool2 pool. It takes 25 minutes to copy it, 
 while
 copying it to tmp directory only takes few seconds. Whats wrong with this? Why
 its so long to copy that wile to my zfs file system?

Not good.  Some filesystems get slower when they are almost full since 
they have to work harder to find resources and verify quota limits.  I 
don't know if that applies to ZFS.

However, it may be that you have one or more disks which is 
experiencing many soft errors (several re-tries before success) and 
maybe you should look into that first.  ZFS runs on top of a bunch of 
other subsystems and drivers so if those other subsystems and drivers 
are slow to repond then ZFS will be slow.  With your raidz2 setup, all 
it takes is one slow disk to slow everything down.

I suggest using 'iostat -e' to check for device errors, and 'iostat 
-x' (while doing the copy) to look for suspicious device behavior.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] share zfs hierarchy over nfs

2008-04-29 Thread Bob Friesenhahn

On Tue, 29 Apr 2008, Tim Wood wrote:
 but that makes it sound like this issue was resolved by changing the 
 NFS client behavior in solaris.  Since my NFS client machines are 
 going to be linux machines that doesn't help me any.

Yes, Solaris 10 does nice helpful things that other OSs don't do.  I 
use per-user ZFS filesystems so I encountered the same problem.  It is 
necessary to force the automounter to request the full mount path.

On Solaris and OS-X Leopard client systems I use an /etc/auto_home 
like

# Home directory map for automounter
#
*   freddy:/home/

which also works for Solaris 9 without depending on the Solaris 10 
feature.

For FreeBSD (which uses the am-utils automounter) I figured out this 
horrific looking map incantation:

* 
type:=nfs;rhost:=freddy;rfs:=/home/${key};fs:=${autodir}/${rhost}${rfs};opts:=rw,grpid,resvport,vers=3,proto=tcp,nosuid,nodev

So for Linux, I think that you will also need to figure out an 
indirect-map incantation which works for its own broken automounter. 
Make sure that you read all available documentation for the Linux 
automounter so you know which parts don't actually work.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] share zfs hierarchy over nfs

2008-04-30 Thread Bob Friesenhahn

On Tue, 29 Apr 2008, Jonathan Loran wrote:

 Oh contraire Bob.  I'm not going to boost Linux, but in this department,
 they've tried to do it right.  If you use Linux autofs V4 or higher, you
 can use Sun style maps (except there's no direct maps in V4.  Need V5
 for direct maps).  For our home directories, which use an indirect map,
 we just use the Solaris map, thus:

 auto_home:
 *zfs-server:/home/

 Sorry to be so off (ZFS) topic.

I am glad to hear that the Linux automounter has moved forward since 
my experience with it a couple of years ago and indirect maps were 
documented but also documented not to actually work. :-)

I don't think that this discussion is off-topic.  Filesystems are so 
easy to create with ZFS that it has become popular to create per-user 
filesystems.  It would be useful if the various automounter 
incantations to make everything work would appear in a ZFS-related 
Wiki somewhere.

This can be an embarrassing situtation for the system administrator 
who thinks that everything is working fine due to testing with Solaris 
10 clients.  So he swiches all the home directories to ZFS per-user 
filesystems overnight.  Imagine the frustration and embarrassment when 
that poor system administrator returns the next day and finds that 
many users can not access their home directories!

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-01 Thread Bob Friesenhahn

On Thu, 1 May 2008, Rustam wrote:

 Today my production server crashed  4 times. THIS IS NIGHTMARE!
 Self-healing file system?! For me ZFS is SELF-KILLING filesystem.

 I cannot fsck it, there's no such tool.
 I cannot scrub it, it crashes 30-40 minutes after scrub starts.
 I cannot use it, it crashes a number of times every day! And with every crash 
 number of checksum failures is growing:

Is your ZFS pool configured with redundancy (e.g mirrors, raidz) or is 
it non-redundant?  If non-redundant, then there is not much that ZFS 
can really do if a device begins to fail.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-01 Thread Bob Friesenhahn

On Thu, 1 May 2008, Rustam wrote:

 operating system: 5.10 Generic_127112-07 (i86pc)

Seems kind of old.  I am using Generic_127112-11 here.

Probably many hundreds of nasty bugs have been eliminated since the 
version you are using.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Bob Friesenhahn

On Mon, 5 May 2008, Marcelo Leal wrote:

 Hello, If you believe that the problem can be related to ZIL code, 
 you can try to disable it to debug (isolate) the problem. If it is 
 not a fileserver (NFS), disabling the zil should not impact 
 consistency.

In what way is NFS special when it comes to ZFS consistency?  If NFS 
consistency is lost by disabling the zil then local consistency is 
also lost.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Bob Friesenhahn

On Mon, 5 May 2008, eric kustarz wrote:

 That's not true:
 http://blogs.sun.com/erickustarz/entry/zil_disable

 Perhaps people are using consistency to mean different things here...

Consistency means that fsync() assures that the data will be written 
to disk so no data is lost.  It is not the same thing as no 
corruption.  ZFS will happily lose some data in order to avoid some 
corruption if the system loses power.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS still crashing after patch

2008-05-05 Thread Bob Friesenhahn

On Mon, 5 May 2008, Marcelo Leal wrote:
 I'm calling consistency, a coherent local view...
 I think that was one option to debug (if not a NFS server), without
 generate a corrupted filesystem.

In other words your flight reservation will not be lost if the system 
crashes.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and disk usage management?

2008-05-05 Thread Bob Friesenhahn

On Mon, 5 May 2008, [EMAIL PROTECTED] wrote:
 The problem is the fact that NFS mounts cannot be done across
 filesystems as implemented with ZFS and Solaris 10. For example, we have
 client machines mounting to /groups/accounting... but we also have
 clients mounting to /groups directly.

On my system I have a /home filesystem, and then I have additional 
logical-per user filesystems underneath.  I know that I can mount 
/home directly but I currently automount the per-user filesystems 
since otherwise user permissions and filesystem quotas are not visible 
to the client for anything other than Solaris 10.

I assume that ZFS quotas are enforced even if the current size and 
space free is not included in the user visible 'df'.  Is that not 
true?

Presumably applications get some unexpected error when the quota limit 
is hit since the client OS does not know the real amount of space 
free.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Linux

2008-05-06 Thread Bob Friesenhahn

On Tue, 6 May 2008, Bill McGonigle wrote:

 That file says 'Copyright 2007 Sun Microsystems, Inc.', though, so
 Sun has the rights to do this.  But being GPLv2 code, why do I have
 any patent rights to include/redistribute that grub code in my
 (theoretical) product (let's assume it does something that is covered

By releasing this bit of code to Grub under the GPL v2 license, Sun 
has effectively transferred rights to use that scrap of code (in any 
context) regardless of any Sun patents which may apply.  However, it 
seems that the useful ZFS patents would be for writing/updating the 
filesystem rather than reading from it.  You can be sure that Sun put 
as little ZFS code in Grub as was possible (and not just for license 
reasons).

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sanity check -- x4500 storage server for enterprise file service

2008-05-07 Thread Bob Friesenhahn

On Wed, 7 May 2008, Paul B. Henson wrote:

 I was thinking about allocating 2 drives for the OS (SVM mirroring, pending
 ZFS boot support), two hot spares, and allocating the other 44 drives as
 mirror pairs into a single pool. While this will result in lower available
 space than raidz, my understanding is that it should provide much better
 performance. Is there anything potentially problematic about this
 configuration? Low-level disk performance analysis is not really my field,

It sounds quite solid.  The load should be quite nicely distributed 
across the mirrors.

 It seems like kind of a waste to allocate 1TB to the operating system,
 would there be any issue in taking a slice of those boot disks and creating
 a zfs mirror with them to add to the pool?

You don't want to go there.  Keep in mind that there is currently no 
way to reclaim a device after it has been added to the pool other than 
substituting another device for it.  Also, the write performance to 
these slices would be less than normal.

If I was you, I would keep more disks spare in the beginning and see 
how the system is working.  If everything is working great, then add 
more disks to the pool.  Once disks are added to the pool, they are 
comitted.

An advantage of load-shared mirrors is that more pairs can be added at 
any time.  You need enough disks in the system to satisfy current disk 
space and I/O rate requirements, but it is not necessary to start off 
with all the disks added to the pool.  Disks added earlier will be 
initially more loaded up than disks added later.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Image with DD from ZFS partition

2008-05-08 Thread Bob Friesenhahn

On Wed, 7 May 2008, Hans wrote:

 hello,
 can i create a image from ZFS with the DD command?
 when i work with linux i use partimage to create an image from one partitino 
 and store it on another. so i can restore it if an error.
 partimage do not work with zfs, so i must use the DD command.
 i think so:
 DD IF=/dev/sda1 OF=/backup/image
 can i create an image this way, and restore it the other:
 DD IF=/backup/image OF=/dev/sda1
 when i have two partitions with zfs, can i boot from the live cd, mount one 
 partition to use it as backup target?
 or is it possible to create a ext2 partition and use a linux rescue cd to 
 backup the zfs partition with dd ?

While the methods you describe are not the zfs way of doing things, 
they should work.  The zfs pool would need to be offlined (taken 
completely out of service, via zpool export) before backing it up via 
raw devices with dd.  Every raw device in the pool would need to be 
backed up at that time in order to make a valid restore possible. Once 
the devices in the pool have been copied, the pool can be re-imported 
to activate it.  This approach is quite a lot of work and the pool is 
not available during this time.

It is much better to do things the zfs way since then the pool can 
still be completely active.  Taking a snapshot takes less than a 
second.  Then you can send the filesystems to be backed up to a file 
or to another system.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sanity check -- x4500 storage server for enterprise file service

2008-05-08 Thread Bob Friesenhahn

On Thu, 8 May 2008, Ross wrote:

 protected even if a disk fails. I found this post quite an 
 interesting 
 read:http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl

Richard's blog entry does not tell the whole story.  ZFS does not 
protect against memory corruption errors and CPU execution errors 
except for in the validated data path.  It also does not protect you 
against kernel bugs, corrosion, meteorite strikes, or civil unrest. 
As a result, the MTTDL plots (which only consider media reliability 
and redundancy) become quite incorrect as they reach stratospheric 
levels.

Note that Richard does include a critical disclaimer: The MTTDL 
calculation is one attribute of Reliability, Availability, and 
Serviceability (RAS) which we can also calculate relatively easily. 
Notice the operative word one.

The law of diminishing returns still applies.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sanity check -- x4500 storage server for enterprise file service

2008-05-08 Thread Bob Friesenhahn

On Thu, 8 May 2008, Ross Smith wrote:

 True, but I'm seeing more and more articles pointing out that the 
 risk of a secondary failure is increasing as disks grow in size, and

Quite true.

 While I'm not sure of the actual error rates (Western digital list 
 their unrecoverable rates as  1 in 10^15), I'm very concious that 
 if you have any one disk fail completely, you are then reliant on 
 being able to read without error every single bit of data from every 
 other disk in that raid set.  I'd much rather have dual parity and 
 know that single bit errors are still easily recoverable during the 
 rebuild process.

I understand the concern.  However, the published unrecoverable rates 
are for the completely random write/read case.  ZFS validates the data 
read for each read and performs a repair if a read is faulty.  Doing a 
zfs scrub forces all of the data to be read and repaired if 
necessary.  Assuming that the data is read (and repaired if necessary) 
on a periodic basis, the chance that an unrecoverable read will occur 
will surely be dramatically lower.  This of course assumes that the 
system administrator pays attention and proactively replaces disks 
which are reporting unusually high and increasing read failure rates.

It is a simple matter of statistics.  If you have read a disk block 
successfully 1000 times, what is the probability that the next read 
from that block will spontaneously fail?  How about if you have read 
from it successfully a million times?

Assuming a reasonably designed storage system, the most likely cause 
of data loss is human error due to carelessness or confusion.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1392 matches

Mail list logo