Re: [zfs-discuss] [osol-help] zfs destroy stalls, need to hard reboot

2009-12-29 Thread Brent Jones
On Sun, Dec 27, 2009 at 1:35 PM, Brent Jones br...@servuhome.net wrote:
 On Sun, Dec 27, 2009 at 12:55 AM, Stephan Budach stephan.bud...@jvm.de 
 wrote:
 Brent,

 I had known about that bug a couple of weeks ago, but that bug has been 
 files against v111 and we're at v130. I have also seached the ZFS part of 
 this forum and really couldn't find much about this issue.

 The other issue I noticed is that, as opposed to the statements I read, that 
 once zfs is underway destroying a big dataset, other operations would 
 continue to work, but that doesen't seem to be the case. When destroying the 
 3 TB dataset, the other zvol that had been exported via iSCSI stalled as 
 well and that's really bad.

 Cheers,
 budy
 --
 This message posted from opensolaris.org
 ___
 opensolaris-help mailing list
 opensolaris-h...@opensolaris.org


 I just tested your claim, and you appear to be correct.

 I created a couple dummy ZFS filesystems, loaded them with about 2TB,
 exported them via CIFS, and destroyed one of them.
 The destroy took the usual amount of time (about 2 hours), and
 actually, quite to my surprise, all I/O on the ENTIRE zpool stalled.
 I dont recall seeing this prior to 130, in fact, I know I would have
 noticed this, as we create and destroy large ZFS filesystems very
 frequently.

 So it seems the original issue I reported many months back has
 actually gained some new negative impacts  :(

 I'll try to escalate this with my Sun support contract, but Sun
 support still isn't very familiar/clued in about OpenSolaris, so I
 doubt I will get very far.

 Cross posting to ZFS-discuss also, as other may have seen this and
 know of a solution/workaround.



 --
 Brent Jones
 br...@servuhome.net


I did some more testing, and it seems this is 100% reproducible ONLY
if the file system and/or entire pool had compression or de-dupe
enabled at one point.
It doesn't seem to matter if de-dupe/compression was enabled for 5
minutes, or the entire life of the pool, as soon as either are turned
on in snv_130, doing any type of mass change (like deleting a big file
system) will hang ALL I/O for a significant amount of time.

If I create a filesystem with neither enabled, fill it with a few TB
of data, and do a 'zfs destroy' on it, it'll go pretty quick, just a
couple minutes, and no noticeable impact to system I/O.

I'm curious about the 7000 series appliances, since those supposedly
ship now with de-dupe as a fully supported option. Is the code
significantly different in the core of ZFS on the 7000 appliances than
a recent build of OpenSolaris?
My sales rep assures me theres very little overhead by enabling
de-dupe on the 7000 series (which he's trying to sell us, obviously)
but I can't see how that could be, when I have the same hardware the
7000's run on (fully loaded X4540).

Any thoughts from anyone?

-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread przemolicc
On Mon, Dec 28, 2009 at 01:40:03PM -0800, Brad wrote:
 This doesn't make sense to me. You've got 32 GB, why not use it?
 Artificially limiting the memory use to 20 GB seems like a waste of
 good money.
 
 I'm having a hard time convincing the dbas to increase the size of the SGA to 
 20GB because their philosophy is, no matter what eventually you'll have to 
 hit disk to pick up data thats not stored in cache (arc or l2arc).  The 
 typical database server in our environment holds over 3TB of data.

Brad,

are your DBAs aware that if you increase your SGA (currently 4 GB)
- to 8  GB - you get 100 % more memory for SGA
- to 16 GB - you get 300 % more memory for SGA
- to 20 GB - you get 400 % ...

If they are not aware, well ...

But try to be patient - I had similar situation. It took quite long time to 
convince
our DBA to increase SGA from 16 GB to 20 GB. Finally they did :-)

You can always use stronger argument that not using already bought memory
is wasting of _money_.

Regards
Przemyslaw Bak (przemol)
--
http://przemol.blogspot.com/





























--
Sprawdz, co przyniesie Nowy Rok!
Zapytaj wrozke  http://link.interia.pl/f254d

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [osol-help] zfs destroy stalls, need to hard reboot

2009-12-29 Thread Stephan Budach
Hi Brent,

what you have noticed makes sense and that behaviour has been present since 
v127, when dedupe was introduced in OpenSolaris. This also fits into my 
observations. I thought I had totally messed up  one of my OpenSolaris boxes 
which I used to take my first steps with ZFS/dedupe and re-creating the same 
zpool on another OpenSolaris box, immediately returned my pool to deliver high 
performance I/O. Alas, after I had enabled dedupe on one of the zfs vols, the 
system started to show those issues again.

If I got that correctly, ZFS calculates a sha-256 bit checksum anyway, so that 
really shouldn't impact performance significantly. I have installed OpenSolaris 
on a Dell R610 with 2 current Nehalem CPUs and 12 GB of RAM and I couldn't 
notice a difference in I/O with or w/o dedupe configured.

Budy
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-29 Thread Robert Milkowski

I included networking-discuss@


On 28/12/2009 15:50, Saso Kiselkov wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Thank you for the advice. After trying flowadm the situation improved
somewhat, but I'm still getting occasional packet overflow (10-100
packets about every 10-15 minutes). This is somewhat unnerving, because
I don't know how to track it down.

Here are the flowadm settings I use:

# flowadm show-flow iptv
FLOWLINKIPADDR   PROTO  LPORT   RPORT
DSFLD
iptve1000g1 LCL:224.0.0.0/4  -- --  --  --

# flowadm show-flowprop iptv
FLOW PROPERTYVALUE  DEFAULTPOSSIBLE
iptv maxbw   -- -- ?
iptv priorityhigh   -- high

I also tuned udp_max_buf to 256MB. All recording processes are boosted
to the RT priority class and zfs_txg_timeout=1 to force the system to
commit data to disk in smaller and more manageable chunks. Is there any
further tuning you could recommend?

Regards,
- --
Saso

I need all IP multicast input traffic on e1000g1 to get the highest
possible priority.

Markus Kovero wrote:
   

Hi, Try to add flow for traffic you want to get prioritized, I noticed that 
opensolaris tends to drop network connectivity without priority flows defined, 
I believe this is a feature presented by crossbow itself. flowadm is your 
friend that is.
I found this particularly annoying if you monitor servers with icmp-ping and 
high load causes checks to fail therefore triggering unnecessary alarms.

Yours
Markus Kovero

-Original Message-
From: zfs-discuss-boun...@opensolaris.org 
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Saso Kiselkov
Sent: 28. joulukuuta 2009 15:25
To: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] ZFS write bursts cause short app stalls

I progressed with testing a bit further and found that I was hitting
another scheduling bottleneck - the network. While the write burst was
running and ZFS was commiting data to disk, the server was dropping
incomming UDP packets (netstat -s | grep udpInOverflows grew by about
1000-2000 packets during every write burst).

To work around that I had to boost the scheduling priority of recorder
processes to the real-time class and I also had to lower
zfs_txg_timeout=1 (there was still minor packet drop after just doing
priocntl on the processes) to even out the CPU load.

Any ideas on why ZFS should completely thrash the network layer and make
it drop incomming packets?

Regards,
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks406oACgkQRO8UcfzpOHBVFwCguUVlMhTt9PlcbcqUjJzJ8Oij
CiIAoJJFHu1wtLMbyOyhXbyDPTkSFSFc
=VLoO
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write bursts cause short app stalls

2009-12-29 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I tried removing the flow and subjectively packet loss occurs a bit less
often, but still it is happening. Right now I'm trying to figure out of
it's due to the load on the server or not - I've left only about 15
concurrent recording instances, producing  8% load on the system. If
the packet loss still occurs, I guess I'll have to disregard the loss
measurements as irrelevant, since at such a load the server should not
be dropping packets at all... I guess.

Regards,
- --
Saso

Robert Milkowski wrote:
 I included networking-discuss@
 
 
 On 28/12/2009 15:50, Saso Kiselkov wrote:
 Thank you for the advice. After trying flowadm the situation improved
 somewhat, but I'm still getting occasional packet overflow (10-100
 packets about every 10-15 minutes). This is somewhat unnerving, because
 I don't know how to track it down.
 
 Here are the flowadm settings I use:
 
 # flowadm show-flow iptv
 FLOWLINKIPADDR   PROTO  LPORT   RPORT
 DSFLD
 iptve1000g1 LCL:224.0.0.0/4  -- -- 
 --  --
 
 # flowadm show-flowprop iptv
 FLOW PROPERTYVALUE  DEFAULTPOSSIBLE
 iptv maxbw   -- -- ?
 iptv priorityhigh   -- high
 
 I also tuned udp_max_buf to 256MB. All recording processes are boosted
 to the RT priority class and zfs_txg_timeout=1 to force the system to
 commit data to disk in smaller and more manageable chunks. Is there any
 further tuning you could recommend?
 
 Regards,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks58KIACgkQRO8UcfzpOHCSJQCePCPVhbbfdogNHL735qz3A3dI
4acAn2jofXsGsveDYCgkelwg1xXKFVId
=UPRk
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Brad
Thanks for the suggestion!

I have heard mirrored vdevs configuration are preferred for Oracle but whats 
the difference between a raidz mirrored vdev vs a raid10 setup? 

We have tested a zfs stripe configuration before with 15 disks and our tester 
was extremely happy with the performance.  After talking to our tester, she 
doesn't feel comfortable with the current raidz setup.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] OpenSolaris to Ubuntu

2009-12-29 Thread Duane Walker
I tried running an OpenSolaris server so I could use ZFS but SMB Serving wasn't 
reliable (it would only work for about 15 minutes). I also couldn't get Cacti 
working (No PHP-SNMP support and I tried building PHP with SNMP but it failed).

So now I am going to run Ubuntu with RAID1 drives.  I am trying to transfer the 
files from my zpool (I have the drive in a USB - SATA chassis).

I want to mount the pool and then volume without destroying the files if 
possible.

If I create a pool will it destroy the contents of the pool?

From reading the doco and the forums it looks like zpool import rpool 
/dev/sdc may be what I want?

I did a zpool import but it didn't show the drive.  It was part of a mirror 
maybe zpool import -D?

I have built zfs-fuse and it seems to be working.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Ross Walker

On Dec 29, 2009, at 7:55 AM, Brad bene...@yahoo.com wrote:


Thanks for the suggestion!

I have heard mirrored vdevs configuration are preferred for Oracle  
but whats the difference between a raidz mirrored vdev vs a raid10  
setup?


A mirrored raidz provides redundancy at a steep cost to performance  
and might I add a high monetary cost.


Because each write of a raidz is striped across the disks the  
effective IOPS of the vdev is equal to that of a single disk. This can  
be improved by utilizing multiple (smaller) raidz vdevs which are  
striped, but not by mirroring them.


With raid10 each mirrored pair has the IOPS of a single drive. Since  
these mirrors are typically 2 disk vdevs, you can have a lot more of  
them and thus a lot more IOPS (some people talk about using 3 disk  
mirrors, but it's probably just as good as getting setting copies=2 on  
a regular pool of mirrors).


We have tested a zfs stripe configuration before with 15 disks and  
our tester was extremely happy with the performance.  After talking  
to our tester, she doesn't feel comfortable with the current raidz  
setup.


How many luns are you working with now? 15?

Is the storage direct attached or is it coming from a storage server  
that may have the physical disks in a raid configuration already?


If direct attached, create a pool of mirrors. If it's coming from a  
storage server where the disks are in a raid already, just create a  
striped pool and set copies=2.


-Ross



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Eric D. Mudama

On Tue, Dec 29 at  4:55, Brad wrote:

Thanks for the suggestion!

I have heard mirrored vdevs configuration are preferred for Oracle
but whats the difference between a raidz mirrored vdev vs a raid10
setup?

We have tested a zfs stripe configuration before with 15 disks and
our tester was extremely happy with the performance.  After talking
to our tester, she doesn't feel comfortable with the current raidz
setup.


As a general rule of thumb, each vdev has the random performance
roughly the same as a single member of that vdev.  Having six RAIDZ
vdevs in a pool should give roughly the performance as a stripe of six
bare drives, for random IO.

When you're in a workload that you expect to be bounded by random IO
performance, in ZFS you'd want to increase the number of VDEVs to be
as large as possible, which acts to distribute random work across all
of your disks.  Building a pool out of 2-disk mirrors, then, is the
preferred layout for random performance, since it's the highest ratio
of disks to vdevs you can achieve (short of non-fault-tolerant
configurations).

This winds up looking similar to RAID10 in layout, in that you're
striping across a lot of disks that each consists of a mirror, though
the checksumming rules are different.  Performance should also be
similar, though it's possible RAID10 may give slightly better random
read performance at the expense of some data quality guarantees, since
I don't believe RAID10 normally validates checksums on returned data
if the device didn't return an error.  In normal practice, RAID10 and
a pool of mirrored vdevs should benchmark against each other within
your margin of error.

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Brad
@ross

Because each write of a raidz is striped across the disks the
effective IOPS of the vdev is equal to that of a single disk. This can
be improved by utilizing multiple (smaller) raidz vdevs which are
striped, but not by mirroring them.

So with random reads, would it perform better on a raid5 layout since the FS 
blocks are written to each disk instead of a stripe?

With zfs's implementation of raid10, would we still get data protection and 
checksumming?

How many luns are you working with now? 15?  
Is the storage direct attached or is it coming from a storage server
that may have the physical disks in a raid configuration already?
If direct attached, create a pool of mirrors. If it's coming from a
storage server where the disks are in a raid already, just create a
striped pool and set copies=2.

We're not using a SAN but a Sun X4270 with sixteen SAS drives (two dedicated to 
OS, two for ssd, raid 11+1.
There's a total of seven datasets from a single pool.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Brad
@eric

As a general rule of thumb, each vdev has the random performance
roughly the same as a single member of that vdev. Having six RAIDZ
vdevs in a pool should give roughly the performance as a stripe of six
bare drives, for random IO.

It sounds like we'll need 16 vdevs striped in a pool to at least get the 
performance of 15 drives plus another 16 mirrored for redundancy.

If we are bounded in iops by the vdev, would it make sense to go with the bare 
minimum of drives (3) per vdev?

This winds up looking similar to RAID10 in layout, in that you're
striping across a lot of disks that each consists of a mirror, though
the checksumming rules are different. Performance should also be
similar, though it's possible RAID10 may give slightly better random
read performance at the expense of some data quality guarantees, since
I don't believe RAID10 normally validates checksums on returned data
if the device didn't return an error. In normal practice, RAID10 and
a pool of mirrored vdevs should benchmark against each other within
your margin of error.

That's interesting to know that with ZFS's implementation of raid10 it doesn't 
have checksumming built-in.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Bob Friesenhahn

On Tue, 29 Dec 2009, Ross Walker wrote:


A mirrored raidz provides redundancy at a steep cost to performance and might 
I add a high monetary cost.


I am not sure what a mirrored raidz is.  I have never heard of such 
a thing before.


With raid10 each mirrored pair has the IOPS of a single drive. Since these 
mirrors are typically 2 disk vdevs, you can have a lot more of them and thus 
a lot more IOPS (some people talk about using 3 disk mirrors, but it's 
probably just as good as getting setting copies=2 on a regular pool of 
mirrors).


This is another case where using a term like raid10 does not make 
sense when discussing zfs.  ZFS does not support raid10.  ZFS does 
not support RAID 0 or RAID 1 so it can't support RAID 1+0 (RAID 10).


Some important points to consider are that every write to a raidz vdev 
must be synchronous.  In other words, the write needs to complete on 
all the drives in the stripe before the write may return as complete. 
This is also true of RAID 1 (mirrors) which specifies that the 
drives are perfect duplicates of each other.  However, zfs does not 
implement RAID 1 either.  This is easily demonstrated since you can 
unplug one side of the mirror and the writes to the zfs mirror will 
still succeed, catching up the mirror which is behind as soon as it is 
plugged back in.  When using mirrors, zfs supports logic which will 
catch that mirror back up (only sending the missing updates) when 
connectivity improves.  With RAID 1 where is no way to recover a 
mirror other than a full copy from the other drive.


Zfs load-shares across vdevs so it will load-share across mirror vdevs 
rather than striping (as RAID 10 would require).


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [osol-help] zfs destroy stalls, need to hard reboot

2009-12-29 Thread Richard Elling

On Dec 29, 2009, at 12:34 AM, Brent Jones wrote:
On Sun, Dec 27, 2009 at 1:35 PM, Brent Jones br...@servuhome.net  
wrote:
On Sun, Dec 27, 2009 at 12:55 AM, Stephan Budach stephan.bud...@jvm.de 
 wrote:

Brent,

I had known about that bug a couple of weeks ago, but that bug has  
been files against v111 and we're at v130. I have also seached the  
ZFS part of this forum and really couldn't find much about this  
issue.


The other issue I noticed is that, as opposed to the statements I  
read, that once zfs is underway destroying a big dataset, other  
operations would continue to work, but that doesen't seem to be  
the case. When destroying the 3 TB dataset, the other zvol that  
had been exported via iSCSI stalled as well and that's really bad.


Cheers,
budy
--
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-h...@opensolaris.org



I just tested your claim, and you appear to be correct.

I created a couple dummy ZFS filesystems, loaded them with about 2TB,
exported them via CIFS, and destroyed one of them.
The destroy took the usual amount of time (about 2 hours), and
actually, quite to my surprise, all I/O on the ENTIRE zpool stalled.
I dont recall seeing this prior to 130, in fact, I know I would have
noticed this, as we create and destroy large ZFS filesystems very
frequently.



So it seems the original issue I reported many months back has
actually gained some new negative impacts  :(

I'll try to escalate this with my Sun support contract, but Sun
support still isn't very familiar/clued in about OpenSolaris, so I
doubt I will get very far.

Cross posting to ZFS-discuss also, as other may have seen this and
know of a solution/workaround.



--
Brent Jones
br...@servuhome.net



I did some more testing, and it seems this is 100% reproducible ONLY
if the file system and/or entire pool had compression or de-dupe
enabled at one point.
It doesn't seem to matter if de-dupe/compression was enabled for 5
minutes, or the entire life of the pool, as soon as either are turned
on in snv_130, doing any type of mass change (like deleting a big file
system) will hang ALL I/O for a significant amount of time.


I don't believe compression matters.  But dedup can really make a big
difference.  When you enable dedup, the deduplication table (DDT) is
created to keep track of the references to blocks. When you remove a
file, the reference counter needs to be decremented for each block
in the file. When a DDT entry has a reference count of zero, the block
can be freed. When you destroy a file system (or dataset) which has
dedup enabled, then all of the blocks written since dedup was enabled
will need to have their reference counters decremented. This workload
looks like a small, random read followed by a small write. With luck,  
the

small, random read will already be loaded in the ARC, but you can't
escape the small write (though they should be coalesced).

Bottom line, rm or destroy of deduplicated files or datasets will create
a flurry of small, random I/O to the pool. If you use devices in the  
pool
which are not optimized for lots of small, random I/O, then this  
activity

will take a long time.

...which brings up a few interesting questions;

Does it make sense to remove deduplicated files?

How do we schedule automatic snapshot removal?

I filed an RFE on a method to address this problem.  I'll pass along
the CR if or when it is assigned.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Mattias Pantzare
On Tue, Dec 29, 2009 at 18:16, Brad bene...@yahoo.com wrote:
 @eric

 As a general rule of thumb, each vdev has the random performance
 roughly the same as a single member of that vdev. Having six RAIDZ
 vdevs in a pool should give roughly the performance as a stripe of six
 bare drives, for random IO.

 It sounds like we'll need 16 vdevs striped in a pool to at least get the 
 performance of 15 drives plus another 16 mirrored for redundancy.

 If we are bounded in iops by the vdev, would it make sense to go with the 
 bare minimum of drives (3) per vdev?

Minimum is 1 drive per vdev. Minimum with redundancy is 2 if you use
mirroring. You should do mirroring to get the best performance.

 This winds up looking similar to RAID10 in layout, in that you're
 striping across a lot of disks that each consists of a mirror, though
 the checksumming rules are different. Performance should also be
 similar, though it's possible RAID10 may give slightly better random
 read performance at the expense of some data quality guarantees, since
 I don't believe RAID10 normally validates checksums on returned data
 if the device didn't return an error. In normal practice, RAID10 and
 a pool of mirrored vdevs should benchmark against each other within
 your margin of error.

 That's interesting to know that with ZFS's implementation of raid10 it 
 doesn't have checksumming built-in.

He was talking about RAID10, not mirroring in ZFS. ZFS will always use
checksums.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Eric D. Mudama

On Tue, Dec 29 at  9:16, Brad wrote:

@eric

As a general rule of thumb, each vdev has the random performance
roughly the same as a single member of that vdev. Having six RAIDZ
vdevs in a pool should give roughly the performance as a stripe of six
bare drives, for random IO.

It sounds like we'll need 16 vdevs striped in a pool to at least get
the performance of 15 drives plus another 16 mirrored for redundancy.


If you were striping across 16 devices before, you will achieve
similar random IO performance by striping across 16 vdevs, regardless
of their type.  Sequential throughput is more a function of the number
of devices, not the number of vdevs, in that a 3-disk RAIDZ will have
the sequential write throughput (roughly) of a pair of drives.

You still get checksumming, but if a device fails or you get a
corruption in your non-redundant stripe, zfs may not have enough
information to repair your data.  For a read-only data reference,
maybe a restore from backup in these situations is okay, but for most
installations that is unacceptable.

The disk cost of a raidz pool of mirrors is identical to the disk cost
of raid10.


If we are bounded in iops by the vdev, would it make sense to go
with the bare minimum of drives (3) per vdev?


ZFS supports non-redundant vdev layouts, but they're generally not
recommended.  The smallest mirror you can build is 2 devices, and the
smallest raidz is 3 devices per vdev.


This winds up looking similar to RAID10 in layout, in that you're
striping across a lot of disks that each consists of a mirror, though
the checksumming rules are different. Performance should also be
similar, though it's possible RAID10 may give slightly better random
read performance at the expense of some data quality guarantees, since
I don't believe RAID10 normally validates checksums on returned data
if the device didn't return an error. In normal practice, RAID10 and
a pool of mirrored vdevs should benchmark against each other within
your margin of error.

That's interesting to know that with ZFS's implementation of raid10
it doesn't have checksumming built-in.


I don't believe I said this.  I am reasonably certain that all
zpool/zfs layouts validate checksums, even if built with no
redundancy.  The RAID10-similar layout in ZFS is an array of
mirrors, such that you build a bunch of 2-device mirrored vdevs, and
add them all into a single pool.  You wind up with a layout like:

Pool0
  mirror-0
disk0
disk1
  mirror-1
disk2
disk3
  mirror-2
disk4
disk5
  ...
  mirror-N
disk-2N
disk-2N+1

This will give you the best random IO performance possible with ZFS,
independent of the type of disks used.  (Obviously some of the same
rules may not apply with ramdisks or SSDs, but those are special cases
for most.)

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [osol-help] zfs destroy stalls, need to hard reboot

2009-12-29 Thread Eric D. Mudama

On Tue, Dec 29 at  9:50, Richard Elling wrote:

I don't believe compression matters.  But dedup can really make a big
difference.  When you enable dedup, the deduplication table (DDT) is
created to keep track of the references to blocks. When you remove a


Are there any published notes on relative DDT size compared to file
count, dedup efficiency, pool size, etc. for admins to make server
capacity planning decisions?

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Richard Elling

On Dec 29, 2009, at 9:16 AM, Brad wrote:


@eric

As a general rule of thumb, each vdev has the random performance
roughly the same as a single member of that vdev. Having six RAIDZ
vdevs in a pool should give roughly the performance as a stripe of six
bare drives, for random IO.


This model begins to break down with raidz2 and further breaks down
with raidz3.  Since I wrote about this simple model here:
http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance
we've refined it a bit, to take into account the number of parity  
devices.


For small, random read IOPS the performance of a single, top-level  
vdev is

performance = performance of a disk * (N / (N - P))

where,
N = number of disks in the vdev
P = number of parity devices in the vdev

For example, using 5 disks @ 100 IOPS we get something like:
2-disk mirror: 200 IOPS
4+1 raidz: 125 IOPS
3+2 raidz2: 167 IOPS
2+3 raidz3:  250 IOPS

Once again, it is clear that mirroring will offer the best small,  
random read

IOPS.

It sounds like we'll need 16 vdevs striped in a pool to at least get  
the performance of 15 drives plus another 16 mirrored for redundancy.


If we are bounded in iops by the vdev, would it make sense to go  
with the bare minimum of drives (3) per vdev?


This winds up looking similar to RAID10 in layout, in that you're
striping across a lot of disks that each consists of a mirror, though
the checksumming rules are different. Performance should also be
similar, though it's possible RAID10 may give slightly better random
read performance at the expense of some data quality guarantees, since
I don't believe RAID10 normally validates checksums on returned data
if the device didn't return an error. In normal practice, RAID10 and
a pool of mirrored vdevs should benchmark against each other within
your margin of error.

That's interesting to know that with ZFS's implementation of raid10  
it doesn't have checksumming built-in.


ZFS always checksums everything unless you explicitly disable
checksumming for data. Metadata is always checksummed.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [osol-help] zfs destroy stalls, need to hard reboot

2009-12-29 Thread Richard Elling

On Dec 29, 2009, at 10:03 AM, Eric D. Mudama wrote:


On Tue, Dec 29 at  9:50, Richard Elling wrote:

I don't believe compression matters.  But dedup can really make a big
difference.  When you enable dedup, the deduplication table (DDT) is
created to keep track of the references to blocks. When you remove a


Are there any published notes on relative DDT size compared to file
count, dedup efficiency, pool size, etc. for admins to make server
capacity planning decisions?


I think it is still too early to tell. The community will need to do  
more

experiments and share results :-)

Also, the DDT is not instrumented -- quite unlike the ARC, for instance.
I've been making some DTrace measurements, but am not yet ready
to share any results.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Tim Cook
On Tue, Dec 29, 2009 at 12:07 PM, Richard Elling
richard.ell...@gmail.comwrote:

 On Dec 29, 2009, at 9:16 AM, Brad wrote:

  @eric

 As a general rule of thumb, each vdev has the random performance
 roughly the same as a single member of that vdev. Having six RAIDZ
 vdevs in a pool should give roughly the performance as a stripe of six
 bare drives, for random IO.


 This model begins to break down with raidz2 and further breaks down
 with raidz3.  Since I wrote about this simple model here:

 http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance
 we've refined it a bit, to take into account the number of parity devices.

 For small, random read IOPS the performance of a single, top-level vdev is
performance = performance of a disk * (N / (N - P))

 where,
N = number of disks in the vdev
P = number of parity devices in the vdev

 For example, using 5 disks @ 100 IOPS we get something like:
2-disk mirror: 200 IOPS
4+1 raidz: 125 IOPS
3+2 raidz2: 167 IOPS
2+3 raidz3:  250 IOPS

 Once again, it is clear that mirroring will offer the best small, random
 read
 IOPS.


  It sounds like we'll need 16 vdevs striped in a pool to at least get the
 performance of 15 drives plus another 16 mirrored for redundancy.

 If we are bounded in iops by the vdev, would it make sense to go with the
 bare minimum of drives (3) per vdev?

 This winds up looking similar to RAID10 in layout, in that you're
 striping across a lot of disks that each consists of a mirror, though
 the checksumming rules are different. Performance should also be
 similar, though it's possible RAID10 may give slightly better random
 read performance at the expense of some data quality guarantees, since
 I don't believe RAID10 normally validates checksums on returned data
 if the device didn't return an error. In normal practice, RAID10 and
 a pool of mirrored vdevs should benchmark against each other within
 your margin of error.

 That's interesting to know that with ZFS's implementation of raid10 it
 doesn't have checksumming built-in.


 ZFS always checksums everything unless you explicitly disable
 checksumming for data. Metadata is always checksummed.
  -- richard



I imagine he's referring to the fact that it cannot fix any checksum errors
it finds.  flamesuitLet me open the can of worms by saying this is nearly
as bad as not doing checksumming at all.  Knowing the data is bad when you
can't do anything to fix it doesn't really help if you have no way to
regenerate it. /flamesuit


-- 
--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenSolaris to Ubuntu

2009-12-29 Thread Tim Cook
On Tue, Dec 29, 2009 at 9:04 AM, Duane Walker du...@walker-family.orgwrote:

 I tried running an OpenSolaris server so I could use ZFS but SMB Serving
 wasn't reliable (it would only work for about 15 minutes).


I've been running native cifs on Opensolaris for 3 years with about 15
minutes of downtime total which was for upgrades.  Solaris was not your
problem.


 I also couldn't get Cacti working (No PHP-SNMP support and I tried building
 PHP with SNMP but it failed).


Yes, there is php-snmp support.  I'm not sure why you'd build it from
scratch instead of using packages.
php5_snmphttp://www.blastwave.org/jir/pkgcontents.ftd?software=php5_snmpstyle=briefstate=5arch=i386
CSWphp5snmp5.2.112009-10-14i3868
http://www.blastwave.org/jir/packages.fam




 So now I am going to run Ubuntu with RAID1 drives.  I am trying to transfer
 the files from my zpool (I have the drive in a USB - SATA chassis).

 I want to mount the pool and then volume without destroying the files if
 possible.

 If I create a pool will it destroy the contents of the pool?

 From reading the doco and the forums it looks like zpool import rpool
 /dev/sdc may be what I want?

 I did a zpool import but it didn't show the drive.  It was part of a
 mirror maybe zpool import -D?

 I have built zfs-fuse and it seems to be working.


Assuming this isn't just a troll as I don't recall seeing you ask for help
on php or cifs, you'd need to ask on the zfs-fuse mailing list.  That
project has no relation to Opensolaris or the dev's here.  Are they even
using the same/newer zfs version you created your pool on Opensolaris with?
If not, it isn't going to import.

-- 
--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenSolaris to Ubuntu

2009-12-29 Thread Eric D. Mudama

On Tue, Dec 29 at 12:40, Tim Cook wrote:

On Tue, Dec 29, 2009 at 9:04 AM, Duane Walker du...@walker-family.orgwrote:


I tried running an OpenSolaris server so I could use ZFS but SMB Serving
wasn't reliable (it would only work for about 15 minutes).



I've been running native cifs on Opensolaris for 3 years with about 15
minutes of downtime total which was for upgrades.  Solaris was not your
problem.


If he tried using 2009.06 (the latest stable release) your statement
is false.

2009.06 is unusable for serious CIFS work due to the hangs fixed in
b114/b116, and being at the bleeding edge of the dev repository often
has risks if you're not familiar administering a Solaris/OpenSolaris
system.



--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenSolaris to Ubuntu

2009-12-29 Thread Tim Cook
On Tue, Dec 29, 2009 at 12:48 PM, Eric D. Mudama
edmud...@bounceswoosh.orgwrote:

 On Tue, Dec 29 at 12:40, Tim Cook wrote:

 On Tue, Dec 29, 2009 at 9:04 AM, Duane Walker du...@walker-family.org
 wrote:

  I tried running an OpenSolaris server so I could use ZFS but SMB Serving
 wasn't reliable (it would only work for about 15 minutes).



 I've been running native cifs on Opensolaris for 3 years with about 15
 minutes of downtime total which was for upgrades.  Solaris was not your
 problem.


 If he tried using 2009.06 (the latest stable release) your statement
 is false.

 2009.06 is unusable for serious CIFS work due to the hangs fixed in
 b114/b116, and being at the bleeding edge of the dev repository often
 has risks if you're not familiar administering a Solaris/OpenSolaris
 system.



Serious CIFS work meaning what?  I've got a system that's been running
2009.06 for 6 months in a small office setting and it hasn't been unusable
for anything I've needed.

-- 
--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Erik Trimble

Eric D. Mudama wrote:

On Tue, Dec 29 at  9:16, Brad wrote:
The disk cost of a raidz pool of mirrors is identical to the disk cost
of raid10.

ZFS can't do a raidz of mirrors or a mirror of raidz.  Members of a 
mirror or raidz[123] must be a fundamental device (i.e. file or drive)





This winds up looking similar to RAID10 in layout, in that you're
striping across a lot of disks that each consists of a mirror, though
the checksumming rules are different. Performance should also be
similar, though it's possible RAID10 may give slightly better random
read performance at the expense of some data quality guarantees, since
I don't believe RAID10 normally validates checksums on returned data
if the device didn't return an error. In normal practice, RAID10 and
a pool of mirrored vdevs should benchmark against each other within
your margin of error.

That's interesting to know that with ZFS's implementation of raid10
it doesn't have checksumming built-in.


I don't believe I said this.  I am reasonably certain that all
zpool/zfs layouts validate checksums, even if built with no
redundancy.  The RAID10-similar layout in ZFS is an array of
mirrors, such that you build a bunch of 2-device mirrored vdevs, and
add them all into a single pool.  You wind up with a layout like:



Yes. PLEASE be careful - checksumming and redundancy are DIFFERENT concepts.

In ZFS, EVERYTHING is checksummed - the data blocks, and the metadata.  
This is separate from redundancy.  Regardless of the zpool layout 
(mirrors, raidz, or no redundancy), ZFS stores a checksum of all objects 
- this checksum is used to determine if the object has been corrupted.  
This check is done on any /read/


Should the checksum determine that the object is corrupt, then there are 
two things that can happen:  if your zpool has some form of redundancy 
for that object, ZFS will then reread the object from the redundant side 
of the mirror, or reconstruct the data using parity.  It will then 
re-write the object to another place in the zpool, and eliminate the 
bad object.  Else, if there is no redundancy, then it will fail to 
return the data, and log an error message to the syslog.


In the case of metadata, even in a non-redundant zpool, some of that 
metadata is stored multiple times, so there is the possibility that you 
will be able to recover/reconstruct some metadata which fails checksumming.


In short, Checksumming is how ZFS /determines/ data corruption, and 
Redundancy is how ZFS /fixes/ it.  Checksumming is /always/ present, 
while redundancy depends on the pool layout and options (cf. copies 
property).




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled

2009-12-29 Thread scottford
I booted the snv_130 live cd and ran zpool import -fFX and it took a day, but 
it imported my pool and rolled it back to a previous version.  I haven't looked 
to see what was missing, but I didn't need any of the changes over the last few 
weeks.

Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Brad
@relling
For small, random read IOPS the performance of a single, top-level
vdev is
performance = performance of a disk * (N / (N - P))  
  133 * 12/(12-1)=
  133 * 12/11

where,
N = number of disks in the vdev
P = number of parity devices in the vdev

performance of a disk = Is this a rough estimate of the disk's IOP?


For example, using 5 disks @ 100 IOPS we get something like:
2-disk mirror: 200 IOPS
4+1 raidz: 125 IOPS
3+2 raidz2: 167 IOPS
2+3 raidz3: 250 IOPS

So if the rated iops on our disks is @133 iops
133 * 12/(12-1) = 145

11+1 raidz: 145 IOPS?

If that's the rate for a 11+1 raidz vdev, then why is iostat showing
about 700 combined IOPS (reads/writes) per disk?

r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
1402.2 7805.3 2.7 36.2 0.2 54.9 0.0 6.0 0 940 c1
10.8 1.0 0.1 0.0 0.0 0.1 0.0 7.0 0 7 c1t0d0
117.1 640.7 0.2 1.8 0.0 4.5 0.0 5.9 1 76 c1t1d0
116.9 638.2 0.2 1.7 0.0 4.6 0.0 6.1 1 78 c1t2d0
116.4 639.1 0.2 1.8 0.0 4.6 0.0 6.0 1 78 c1t3d0
116.6 638.1 0.2 1.7 0.0 4.6 0.0 6.1 1 77 c1t4d0
113.2 638.0 0.2 1.8 0.0 4.6 0.0 6.1 1 77 c1t5d0
116.6 635.3 0.2 1.7 0.0 4.5 0.0 6.0 1 76 c1t6d0
116.2 637.8 0.2 1.8 0.0 4.7 0.0 6.2 1 79 c1t7d0
115.3 636.7 0.2 1.8 0.0 4.4 0.0 5.8 1 77 c1t8d0
115.4 637.8 0.2 1.8 0.0 4.5 0.0 5.9 1 77 c1t9d0
114.8 635.0 0.2 1.8 0.0 4.3 0.0 5.7 1 76 c1t10d0
114.9 639.9 0.2 1.8 0.0 4.7 0.0 6.2 1 78 c1t11d0
115.1 638.7 0.2 1.8 0.0 4.4 0.0 5.9 1 77 c1t12d0
1.6 140.0 0.0 15.1 0.0 0.6 0.0 4.4 0 8 c1t13d0
1.3 9.1 0.0 0.1 0.0 0.0 0.0 1.0 0 0 c1t14d0
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zfs upgrade freezes desktop

2009-12-29 Thread roland
i have a problem which is perhaps related.

i installed opensolaris snv_130.
after adding 4 additional disks and creating a raidz on them with 
compression=gzip and dedup enabled, i got reproducable system freeze (not sure, 
but the desktop/mouse-coursor froze) directly after login - without actively 
accessing the disks at all.

after removing the disks, all is fine again - no freeze.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled

2009-12-29 Thread tom wagner
 I booted the snv_130 live cd and ran zpool import
 -fFX and it took a day, but it imported my pool and
 rolled it back to a previous version.  I haven't
 looked to see what was missing, but I didn't need any
 of the changes over the last few weeks.
 
 Scott

I'll give it a shot.  Hope this works, Will report back if it succeeds.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Richard Elling

On Dec 29, 2009, at 11:26 AM, Brad wrote:


@relling
For small, random read IOPS the performance of a single, top-level
vdev is
performance = performance of a disk * (N / (N - P))
 133 * 12/(12-1)=
 133 * 12/11

where,
N = number of disks in the vdev
P = number of parity devices in the vdev

performance of a disk = Is this a rough estimate of the disk's IOP?


For example, using 5 disks @ 100 IOPS we get something like:
2-disk mirror: 200 IOPS
4+1 raidz: 125 IOPS
3+2 raidz2: 167 IOPS
2+3 raidz3: 250 IOPS

So if the rated iops on our disks is @133 iops
133 * 12/(12-1) = 145

11+1 raidz: 145 IOPS?

If that's the rate for a 11+1 raidz vdev, then why is iostat showing
about 700 combined IOPS (reads/writes) per disk?


Because the model is for small, random read IOPS
over the full size of the disk. What you are seeing is
caching and seek optimization at work (a good thing).
But, AFAIK,  there are no decent performance models
which take caching into account. In most cases, storage
is sized based on empirical studies.
 -- richard


r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
1402.2 7805.3 2.7 36.2 0.2 54.9 0.0 6.0 0 940 c1
10.8 1.0 0.1 0.0 0.0 0.1 0.0 7.0 0 7 c1t0d0
117.1 640.7 0.2 1.8 0.0 4.5 0.0 5.9 1 76 c1t1d0
116.9 638.2 0.2 1.7 0.0 4.6 0.0 6.1 1 78 c1t2d0
116.4 639.1 0.2 1.8 0.0 4.6 0.0 6.0 1 78 c1t3d0
116.6 638.1 0.2 1.7 0.0 4.6 0.0 6.1 1 77 c1t4d0
113.2 638.0 0.2 1.8 0.0 4.6 0.0 6.1 1 77 c1t5d0
116.6 635.3 0.2 1.7 0.0 4.5 0.0 6.0 1 76 c1t6d0
116.2 637.8 0.2 1.8 0.0 4.7 0.0 6.2 1 79 c1t7d0
115.3 636.7 0.2 1.8 0.0 4.4 0.0 5.8 1 77 c1t8d0
115.4 637.8 0.2 1.8 0.0 4.5 0.0 5.9 1 77 c1t9d0
114.8 635.0 0.2 1.8 0.0 4.3 0.0 5.7 1 76 c1t10d0
114.9 639.9 0.2 1.8 0.0 4.7 0.0 6.2 1 78 c1t11d0
115.1 638.7 0.2 1.8 0.0 4.4 0.0 5.9 1 77 c1t12d0
1.6 140.0 0.0 15.1 0.0 0.6 0.0 4.4 0 8 c1t13d0
1.3 9.1 0.0 0.1 0.0 0.0 0.0 1.0 0 0 c1t14d0
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Scrub slow (again) after dedupe

2009-12-29 Thread Michael Herf
I have a 4-disk RAIDZ, and I reduced the time to scrub it from 80
hours to about 14 by reducing the number of snapshots, adding RAM,
turning off atime, compression, and some other tweaks. This week
(after replaying a large volume with dedup=on) it's back up, way up.

I replayed a 700G filesystem to get the dedup benefits, and a scrub is
taking 100+ hours now.
dedupratio for the pool is now around 1.7, and it was about 1.1 when
my scrub took 14 hours (nothing else has really changed).

arcstat.pl is showing a lot of misses, and the filesystem is seeking a
lot - iostat reports 350k/sec transfer with 170 reads/sec, ouch.

I've ordered a SSD drive to see if L2ARC will help this situation, but
in general it seems like a bad trend. Agree with a previous poster
that tools to estimate DDT size are important, and perhaps there are
less random-access ways to scrub a deduped filesystem, if the DDT is
not in core?

I have several zdb -DDD outputs from throughout the week if anyone
would like to see them.

Also, is there any way to instrument scrub to see which parts of the
filesystem it is traversing?

mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs zend is very slow

2009-12-29 Thread Brandon High
On Wed, Dec 16, 2009 at 8:19 AM, Brandon High bh...@freaks.com wrote:
 On Wed, Dec 16, 2009 at 8:05 AM, Bob Friesenhahn
 bfrie...@simple.dallas.tx.us wrote:
  In his case 'zfs send' to /dev/null was still quite fast and the network
 was also quite fast (when tested with benchmark software).  The implication
 is that ssh network transfer performace may have dropped with the update.

 zfs send appears to be fast still, but receive is slow.

 I tried a pipe from the send to the receive, as well as using mbuffer
 with a 100mb buffer, both wrote at ~ 12 MB/s.

I did a little bit of testing today. I'm sending from a snv_129
system, using a 2.31GB filesystem to test.

The sender has 8GB of DDR2-800 memory and a Athlon X2 4850e cpu. It's
using 8x WD Green 5400rpm 1TB drives on a PCI-X controller, in a
raidz2.
The receiver has 2GB of DDR2-533 memory and a Atom 330 cpu. It's using
2 Hitachi 7200rpm 1TB drives in a non-redundant zpool.

I destroyed and recreated the zpool on the receiver between tests.

Doing a send to /dev/null completes in under a second, since the
entire dataset can be cached.

Sending across the network to a snv_118 system via netcat, then to
/dev/null took 45.496s and 40.384s.
Sending across the network to a snv_118 system via netcat, then to
/tank/test took 45.496s and 40.384s.

Sending across the network via netcat and recv'ing on a snv_118 system
took 101s and 97s.

I rebooted the receiver to a snv_128a BE and did the same tests.

Sending across the network to a snv_128a system via netcat, then to
/dev/null took 43.067s.

Sending across the network via netcat and recv'ing on a snv_128a
system took 98s with dedup=off.
Sending across the network via netcat and recv'ing on a snv_128a
system took 121s with dedup=on.
Sending across the network via netcat and recv'ing on a snv_128a
system took 134s with dedup=verify

It looks like the receive times didn't change much for a small
dataset. The change from fletcher4 to sha256 when enabling dedup is
probably responsible for the slowdown.

I suspect that the dataset is too small to run into the performance
problems I was seeing. I'll try later with a larger filesystem and see
what the numbers look like.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] raidz vs raid5 clarity needed

2009-12-29 Thread Brad
Hi!  I'm attempting to understand the pros/cons  between raid5 and raidz after 
running into a performance issue with Oracle on zfs  
(http://opensolaris.org/jive/thread.jspa?threadID=120703tstart=0).

I would appreciate some feedback on what I've understood so far:

WRITES

raid5 - A FS block is written on a single disk (or multiple disks depending on 
size data???)
raidz - A FS block is written in a dynamic stripe (depending on size of 
data?)across n number of vdevs (minus parity).

READS

raid5 - IO count depends on  how many disks FS block written to. (data crosses 
two disks 2 IOs??)
raidz - A single read will span across n number of vdevs (minus parity).  
(1single IO??)

NEGATIVES

raid5 - Write hole penalty, where if system crashes in the middle of a write 
block update before or after updating parity - data is corrupt.
 - Overhead (read previous block, read parity, update parity and write 
block)
- No checksumming of data!
- Slow read sequential performance.

raidz - Bound by x number of IOPS from slowest vdev since blocks are striped.
  Bad for small random reads

POSITIVES

raid5 - Good for random reads (between raid5 and raidz!) since blocks are not 
striped across sum of disks.
raidz - Good for sequential reads and writes since data is striped across sum 
of vdevs.
- No write hole penalty!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz vs raid5 clarity needed

2009-12-29 Thread A Darren Dunham
On Tue, Dec 29, 2009 at 02:37:20PM -0800, Brad wrote:
 I would appreciate some feedback on what I've understood so far:
 
 WRITES
 
 raid5 - A FS block is written on a single disk (or multiple disks
depending on size data???)

There is no direct relationship between a filesystem and the RAID
structure.  RAID5 maps virtual sectors to columns in some width pattern.
How the FS uses those virtual sectors is up to it.  The admin may need
to know how it is to be used if there is a desire to tweak the stripe
width.  This makes some comparisons difficult because RAID5 is only a
presentation and management of a set of contiguous blocks, while raidz
is always associated with a particular filesystem.

Updates to RAID5 are in-place.

 raidz - A FS block is written in a dynamic stripe (depending on size of 
 data?)across n number of vdevs (minus parity).
 

The stripe may be written in as few as 1 disk for data and other disks
for parity, or the stripe may cover all the disks.

 READS
 
 raid5 - IO count depends on  how many disks FS block written to. (data
 crosses two disks 2 IOs??)

Well, that's true for anything.  You can't read two disks without
issuing two reads.  The main issue is that RAID5 has no ability to
validate the data, so it doesn't need to read all columns.  It can just
read one sector if necessary and return the data.  How many disk sectors
must be retreived may depend on which filesystem is in use.  But in most
cases (common filesystems, common stripe widths), a single FS block will
not be distributed over many disks.

 raidz - A single read will span across n number of vdevs (minus
  parity).  (1single IO??)

If not in cache, the ZFS block is read (usually only from the non-parity
components), and that block may be on many disks.  The entire ZFS block
is read so that it can be validated against the checksum.

 NEGATIVES
 
 raid5 - Write hole penalty, where if system crashes in the middle of a
 write block update before or after updating parity - data is corrupt.

Assuming no other structures are used to address it (like a log
device).  A log device is not really part of RAID5, but may be found in
implementations of RAID5.

  - Overhead (read previous block, read parity, update parity
 and write block)

True for non-full-stripe writes.  Full stripe writes need no read step
(something the raidz implementation leverages).

 - No checksumming of data!
 - Slow read sequential performance.

Not sure why sequential read performance would have a penalty under
RAID5.  

-- 
Darren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I destroy a Zpool without importing it?

2009-12-29 Thread A Darren Dunham
On Sun, Dec 27, 2009 at 06:02:18PM +0100, Colin Raven wrote:
 Are there any negative consequences as a result of a force import? I mean
 STUNT; Sudden Totally Unexpected and Nasty Things
 -Me

If the pool is not in use, no.  It's a safety check to avoid problems
that can easily crop up when storage can be seen by multiple machines.

If your pool is imported on machine A and you force import it on machine
B at the same time, you will corrupt the pool.

-- 
Darren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenSolaris to Ubuntu

2009-12-29 Thread Eric D. Mudama

On Tue, Dec 29 at 12:49, Tim Cook wrote:

Serious CIFS work meaning what?  I've got a system that's been running
2009.06 for 6 months in a small office setting and it hasn't been unusable
for anything I've needed.


Wierd.  Win7-x64 clients crashed my 2009.06 installation within 30
seconds of beginning a search on the shared drive.  I could bork the
server at-will.  It only required a single client.

Rolling back to 2008.11 was required for me.  A significant number of
people reported similar problems without win7 clients (service
crashed/stuck, system cannot be rebooted properly, etc.) so I am
certain it wasn't just me.

Maybe something about your client mix wasn't hitting the bugs I (and
others) ran into.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Eric D. Mudama

On Tue, Dec 29 at 11:14, Erik Trimble wrote:

Eric D. Mudama wrote:

On Tue, Dec 29 at  9:16, Brad wrote:
The disk cost of a raidz pool of mirrors is identical to the disk cost
of raid10.

ZFS can't do a raidz of mirrors or a mirror of raidz.  Members of a 
mirror or raidz[123] must be a fundamental device (i.e. file or 
drive)


Sorry, typo/thinko ... I meant to say a zpool of mirrors, not a raidz
pool of mirrors.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenSolaris to Ubuntu

2009-12-29 Thread Duane Walker
I was trying to get Cacti running and it was all working except the PHP-SNMP.  
I installed it but the SNMP support wasn't recognised (in phpinfo()).

I was reading the posts for the Cacti package and they said they were planning 
to add the SNMP support.

I am running a combination of Win7-64 and 32 bit computers and someone else 
mentioned that win7 64 causes problems.  The server itself was very stable and 
SCP (WinSCP) worked fine but SMB wouldn't stay up.  I tried restarting the 
servives but only a reboot would fix it.

I am more familiar with Ubuntu and Fedora.  We use Red Hat Enterprise and AIX 
at work.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-USAS-L8i

2009-12-29 Thread James Dickens
not sure of your experience level, but did you try   running devfsadm   and
then checking in format for your new disks

James Dickens
uadmin.blogspot.com


On Sun, Dec 27, 2009 at 3:59 AM, Muhammed Syyid opensola...@syyid.netwrote:

 Hi
 I just picked up one of these cards and had a few questions
 After installing it I can see it via scanpci but any devices I've connected
 to it don't show up in iostat -En , is there anything specific I need to do
 to enable it?

 Do any of you experience the bug mentioned below (worried about using it
 and losing my data)
 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6894775
 http://opensolaris.org/jive/thread.jspa?threadID=117702tstart=1
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenSolaris to Ubuntu

2009-12-29 Thread Eric D. Mudama

On Tue, Dec 29 at 17:00, Duane Walker wrote:

I am running a combination of Win7-64 and 32 bit computers and
someone else mentioned that win7 64 causes problems.  The server
itself was very stable and SCP (WinSCP) worked fine but SMB wouldn't
stay up.  I tried restarting the servives but only a reboot would
fix it.


To me, this sounds like the same issues I was hitting in 2009.06.  We
are using b129 successfully at work to share CIFS to mixed XP, 2003,
Win7 and Win7-64 clients ... you'll need to either start from b129 (or
130?)  from genunix.org, or update to it from 2009.06 by switching to
the dev repository and updating your image.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Ross Walker
On Dec 29, 2009, at 12:36 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
 wrote:



On Tue, 29 Dec 2009, Ross Walker wrote:


A mirrored raidz provides redundancy at a steep cost to performance  
and might I add a high monetary cost.


I am not sure what a mirrored raidz is.  I have never heard of  
such a thing before.


With raid10 each mirrored pair has the IOPS of a single drive.  
Since these mirrors are typically 2 disk vdevs, you can have a lot  
more of them and thus a lot more IOPS (some people talk about using  
3 disk mirrors, but it's probably just as good as getting setting  
copies=2 on a regular pool of mirrors).


This is another case where using a term like raid10 does not make  
sense when discussing zfs.  ZFS does not support raid10.  ZFS does  
not support RAID 0 or RAID 1 so it can't support RAID 1+0 (RAID 10).


Did it again... I understand the difference. I hope it didn't confuse  
the OP by throwing that out there. What I meant to say was a zpool of  
mirror vdevs.


Some important points to consider are that every write to a raidz  
vdev must be synchronous.  In other words, the write needs to  
complete on all the drives in the stripe before the write may return  
as complete. This is also true of RAID 1 (mirrors) which specifies  
that the drives are perfect duplicates of each other.


I believe mirrored vdevs can do this in parallel though, while raidz  
vdevs need to do this serially due to the ordered nature of the  
transaction which makes the sync writes faster on the mirrors.


 However, zfs does not implement RAID 1 either.  This is easily  
demonstrated since you can unplug one side of the mirror and the  
writes to the zfs mirror will still succeed, catching up the mirror  
which is behind as soon as it is plugged back in.  When using  
mirrors, zfs supports logic which will catch that mirror back up  
(only sending the missing updates) when connectivity improves.  With  
RAID 1 where is no way to recover a mirror other than a full copy  
from the other drive.


That's not completely true these days as a lot of raid implementations  
use bitmaps to track changed blocks and a raid1 continues to function  
when the other side disappears. The real difference is the mirror  
implementation in ZFS is in the file system and not at an abstracted  
block-io layer so it is more intelligent in it's use and layout.


Zfs load-shares across vdevs so it will load-share across mirror  
vdevs rather than striping (as RAID 10 would require).


Bob, an interesting question was brought up to me about how copies may  
affect random read performance. I didn't know the answer, but if ZFS  
knows there are additional copies would it not also spread the load  
across those as well to make sure the wait queues on each spindle are  
as even as possible?


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenSolaris to Ubuntu

2009-12-29 Thread Sriram Narayanan
Each of these problems that you faced can be solved. Please ask for
help on each of these via separate emails to osol-discuss and you'll
get help.

I say so because I'm moving my infrastructure to opensolaris for these
services, among others.

-- Sriram

On 12/29/09, Duane Walker du...@walker-family.org wrote:
 I tried running an OpenSolaris server so I could use ZFS but SMB Serving
 wasn't reliable (it would only work for about 15 minutes). I also couldn't
 get Cacti working (No PHP-SNMP support and I tried building PHP with SNMP
 but it failed).

 So now I am going to run Ubuntu with RAID1 drives.  I am trying to transfer
 the files from my zpool (I have the drive in a USB - SATA chassis).

 I want to mount the pool and then volume without destroying the files if
 possible.

 If I create a pool will it destroy the contents of the pool?

 From reading the doco and the forums it looks like zpool import rpool
 /dev/sdc may be what I want?

 I did a zpool import but it didn't show the drive.  It was part of a
 mirror maybe zpool import -D?

 I have built zfs-fuse and it seems to be working.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-- 
Sent from my mobile device
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-USAS-L8i

2009-12-29 Thread Muhammed Syyid
Thanks a bunch - that did the trick :)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz vs raid5 clarity needed

2009-12-29 Thread Brad
@ross

If the write doesn't span the whole stripe width then there is a read
of the parity chunk, write of the block and a write of the parity
chunk which is the write hole penalty/vulnerability, and is 3
operations (if the data spans more then 1 chunk then it is written in
parallel so you can think of it as one operation, if the data doesn't
fill any given chunk then a read of the existing data chunk is
necessary to fill in the missing data making it 4 operations). No
other operation on the array can execute while this is happening.

I thought with raid5 for a new FS block write, the previous block is read in, 
then read parity, write/update parity then write the new block (2 reads 2 
writes)??

Yes, reads are exactly like writes on the raidz vdev, no other
operation, read or write, can execute while this is happening. This is
where the problem lies, and is felt hardest with random IOs.

Ah - so with a random read workload, a read IO can not be
executed in multiple streams or simultaneously until the current IO has
completed with raidz.  Was the thought process behind this to mitigate the
write hole issue or for performance (a write is a single IO instead of  3 or 4 
IOs with raid5)?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled

2009-12-29 Thread Jack Kielsmeier
I got my pool back

Did a rig upgrade (new motherboard, processor, and 8 GB of RAM), re-installed 
opensolaris 2009.06, did an upgrade to snv_130, and did the import!

The import only took about 4 hours!

I have a hunch that I was running into some sort of issue with not having 
enough RAM previously. Of course, that's just a guess.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled

2009-12-29 Thread Jack Kielsmeier
I should note that my import command was:

zpool import -f vault
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss