Re: [zfs-discuss] ZFS performance benchmarks in various configurations

2010-02-15 Thread Carson Gaspar

Richard Elling wrote:
...

As you can see, so much has changed, hopefully for the better, that running
performance benchmarks on old software just isn't very interesting.

NB. Oracle's Sun OpenStorage systems do not use Solaris 10 and if they did, they
would not be competitive in the market. The notion that OpenSolaris is worthless
and Solaris 10 rules is simply bull*


OpenSolaris isn't worthless, but no way in hell would I run it in 
production, based on my experiences running it at home from b111 to now. 
The mpt driver problems are just one of many show stoppers (is that 
resolved yet, or do we still need magic /etc/system voodoo?).


Of course, Solaris 10 couldn't properly drive the Marvell attached disks 
in an X4500 prior to U6 either, unless you ran an IDR (pretty 
inexcusable in a storage-centric server release).


--
Carson

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2010-02-15 Thread Orvar Korvar
Yes, if you value your data you should change from USB drives to normal drives. 
I heard that USB did some strange things? Normal connection such as SATA is 
more reliable.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS slowness under domU high load

2010-02-15 Thread Bogdan Ćulibrk



zfs ml wrote:


sorry, scratch the above - I didn't see this:
9. domUs have ext3 mounted with: noatime,commit=120

Is the write traffic because you backing up to the same disks that the 
domUs live on?


Yes it is.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS slowness under domU high load

2010-02-15 Thread Bogdan Ćulibrk

Kjetil and Richard thanks for this.



Kjetil Torgrim Homme wrote:

Bogdan Ćulibrk b...@default.rs writes:


What are my options from here? To move onto zvol with greater
blocksize? 64k? 128k? Or I will get into another trouble going that
way when I have small reads coming from domU (ext3 with default
blocksize of 4k)?


yes, definitely.  have you considered using NFS rather than zvols for
the data filesystems?  (keep zvol for the domU software.)



That makes sense. Will it be useful to simply add new drive to domU 
backed with greater blocksize zvol or maybe vmdk file? Does it have to 
be nfs backend?




it's strange that you see so much write activity during backup -- I'd
expect that to do just reads...  what's going on at the domU?

generally, the best way to improve performance is to add RAM for ARC
(512 MiB is *very* little IMHO) and SSD for your ZIL, but it does seem
to be a poor match for your concept of many small low-cost dom0's.



Writes are coming from backup packing before transfering on real backup 
location. Most likely this is the main reason for whole problem.


One more thing regarding SSD, will be useful to throw in additional 
SAS/SATA drive in to serve as L2ARC? I know SSD is the most logical 
thing to put as L2ARC, but will conventional drive be of *any* help in 
L2ARC?



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs questions wrt unused blocks

2010-02-15 Thread heinz zerbes




Gents,

We want to understand the mechanism of zfs a bit better.

Q: what is the design/algorithm of zfs in terms of reclaiming unused
blocks?
Q: what criteria is there for zfs to start reclaiming blocks

Issue at hand is an LDOM or zone running in a virtual
(thin-provisioned) disk on a NFS server and a zpool inside that vdisk.
This vdisk tends to grow in size even if the user writes and deletes a
file again. Question is, whether this reclaiming of unused blocks can
kick in earlier, so that the filesystem doesn't grow much more than
what is actually allocated?

Thanks,
heinz



-- 

  

  
   Heinz Zerbes 
Security Consultant and Auditor
  Sun Microsystems Australia
33 Berry St., North Sydney, NSW 2060 AU
Phone x59468/+61 2 9466 9468
Mobile +61 410 727 961
Fax +61 2 9466 9411
Email heinz.zer...@sun.com
  


  

  




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD and ZFS

2010-02-15 Thread Tracey Bernath
For those following the saga:
With the prefetch problem fixed, and data coming off the L2ARC instead of
the disks, the system switched from IO bound to CPU bound, I opened up the
throttles with some explicit PARALLEL hints in the Oracle commands, and we
were finally able to max out the single SSD:


r/s  w/s kr/s  kw/s wait actv wsvc_t asvc_t  %w  %b device

  826.03.2 104361.8   35.2  0.0  9.90.0   12.0   3 100 c0t0d4

So, when we maxed out the SSD cache, it was delivering 100+MB/s, and 830
IOPS
with 3.4 TB behind it in a 4 disk SATA RAIDz1.

Still have to remap it to 8k blocks to get more efficiency, but for raw
numbers, it's right what I was
looking for. Now, to add the second SSD ZIL/L2ARC for a mirror. I may even
splurge for one more to
get a three way mirror. That will completely saturate the SCSI channel. Now
I need a bigger server

Did I mention it was $1000 for the whole setup? Bah-ha-ha-ha.

Tracey


On Sat, Feb 13, 2010 at 11:51 PM, Tracey Bernath tbern...@ix.netcom.comwrote:

 OK, that was the magic  incantation I  was looking for:
 - changing the noprefetch option opened the floodgates to the L2ARC
 - changing the max queue depth relived the wait time on the drives,
 although I may undo this again in the benchmarking since these drives all
 have NCQ

 I went from all four disks of the array at 100%, doing about 170 read
 IOPS/25MB/s
 to all four disks of the array at 0%, once hitting nealyr 500 IOPS/65MB/s
 off the cache drive (@ only 50% load).
 This bodes well for adding a second mirrored cache drive to push for the
 1KIOPS.

 Now I am ready to insert the mirror for the ZIL and the CACHE, and we will
 be ready
 for some production benchmarking.


 BEFORE:
  devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b  us sy wt id
   sd0 170.00.4 7684.70.0  0.0 35.0  205.3   0 100  11  80  0 82

   sd1 168.40.4 7680.20.0  0.0 34.6  205.1   0 100
   sd2 172.00.4 7761.70.0  0.0 35.0  202.9   0 100
   sd4 170.00.4 7727.10.0  0.0 35.0  205.3   0 100
   sd5   1.6  2.6  182.4  104.8  0.0  0.5  117.8   0  31



 AFTER:
 extended device statistics
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 0.00.00.00.0  0.0  0.00.00.0   0   0 c0t0d0
 0.00.00.00.0  0.0  0.00.00.0   0   0 c0t0d1
 0.00.00.00.0  0.0  0.00.00.0   0   0 c0t0d2
 0.00.00.00.0  0.0  0.00.00.0   0   0 c0t0d3
   285.20.8 36236.2   14.4  0.0  0.50.01.8   1  37 c0t0d4


 And, keep  in mind this was on less than $1000 of hardware.

 Thanks for the pointers guys,
 Tracey



 On Sat, Feb 13, 2010 at 9:22 AM, Richard Elling 
 richard.ell...@gmail.comwrote:

 comment below...

 On Feb 12, 2010, at 2:25 PM, TMB wrote:
  I have a similar question, I put together a cheapo RAID with four 1TB WD
 Black (7200) SATAs, in a 3TB RAIDZ1, and I added a 64GB OCZ Vertex SSD, with
 slice 0 (5GB) for ZIL and the rest of the SSD  for cache:
  # zpool status dpool
   pool: dpool
  state: ONLINE
  scrub: none requested
  config:
 
 NAMESTATE READ WRITE CKSUM
 dpool   ONLINE   0 0 0
   raidz1ONLINE   0 0 0
 c0t0d0  ONLINE   0 0 0
 c0t0d1  ONLINE   0 0 0
 c0t0d2  ONLINE   0 0 0
 c0t0d3  ONLINE   0 0 0
  [b]logs
   c0t0d4s0  ONLINE   0 0 0[/b]
  [b]cache
   c0t0d4s1  ONLINE   0 0 0[/b]
 spares
   c0t0d6AVAIL
   c0t0d7AVAIL
 
capacity operationsbandwidth
  pool used  avail   read  write   read  write
  --  -  -  -  -  -  -
  dpool   72.1G  3.55T237 12  29.7M   597K
   raidz172.1G  3.55T237  9  29.7M   469K
 c0t0d0  -  -166  3  7.39M   157K
 c0t0d1  -  -166  3  7.44M   157K
 c0t0d2  -  -166  3  7.39M   157K
 c0t0d3  -  -167  3  7.45M   157K
   c0t0d4s020K  4.97G  0  3  0   127K
  cache   -  -  -  -  -  -
   c0t0d4s1  17.6G  36.4G  3  1   249K   119K
  --  -  -  -  -  -  -
  I just don't seem to be getting any bang for the buck I should be.  This
 was taken while rebuilding an Oracle index, all files stored in this pool.
  The WD disks are at 100%, and nothing is coming from the cache.  The cache
 does have the entire DB cached (17.6G used), but hardly reads anything from
 it.  I also am not seeing the spike of data flowing into the ZIL either,
 although iostat show there is just write traffic hitting the SSD:
 
  extended device statistics  cpu
  devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b  

Re: [zfs-discuss] [networking-discuss] Help needed on big transfers failure with e1000g

2010-02-15 Thread Arnaud Brand

Hi,

Sending to zfs-discuss too since this seems to be related to the zfs 
receive operation.
The following only holds true when the replication stream (ie the delta 
between snap1 and snap2) is more than about 800GB.


If I proceed with this command the transfer fails after some variable 
amount of time, usually near about 790GB.
pfexec /sbin/zfs send -DRp -I tank/t...@snap1 tank/t...@snap2 | ssh -c 
arcfour nc-tanktsm pfexec /sbin/zfs recv -F -d tank


If however, I any no problems I proceed with :
pfexec /sbin/zfs send -DRp -I tank/t...@snap1 tank/t...@snap2 | 
/usr/gnu/bin/dd of=/tank/stream.zfs bs=1M

manually copy the stream.zfs to the remote host through scp and then
/usr/gnu/bin/dd if=/tank/stream.zfs bs=1M | pfexec /sbin/zfs recv -F -d tank

An additional thing to note (perhaps) is that there where no snapshots 
between snap1 and snap2.
I reproduced the behaviour with two different replication streams (one 
weighting 1.63TB and the other one 1.48TB).


I tried to add some buffering to the first procedure by piping (on both 
ends) through dd or mbuffer.

While the global throughput slightly improves, the transfer still fails.
I also suspected problems with ssh and used netcat, with and without dd, 
with or without mbuffer.

I had even more throughput but it still failed.

I guess asking to transfer a 1TB+ replication stream live might just 
be too much and I might be the only crazy fool to try to do it.
I've setup zfs-auto-snap and launch my replication stream more 
frequently, so as long as zfs-auto-snap doesn't remove intermediate 
snaps before the script can send them, this is kind of solved it for me, 
but still, there might be a problem somewhere that others might encounter.

Just thought I should report it.

- Arnaud

Le 09/02/2010 17:41, Arnaud Brand a écrit :

Sorry for the double-post, I forgot to answer your questions.

Arnaud

Le 09/02/2010 17:31, Arnaud Brand a écrit :

Hi James,

Sorry to bother you again, I think I found the problem but need 
confirmation/advice.


It appears that A wants to send through SSH more data to B that B can 
accept.
In fact, B is in the process of committing the zfs recv of a big 
snapshot, but A still has some snaps to send (zfs send -R).
After exactly 6 minutes, B sends a RST to A and both close their 
connections.

Sadly the receive operation goes away with the ssh session.

Whats looks strange is that B doesn't reduce its window, it just 
keeps ACKing the last byte SSH could eat and leaves the window at 64436.
I guess it's related to SSH channel multiplexing features : it can't 
reduce the window or other channels won't make progress either, am I 
right here ?


I did some ndd lookups in /dev/tcp and /dev/ip to see if I can find a 
timeout to raise, but found nothing matching the 6 minutes.
The docs related to tcp/ip tunables I found on sun's website didn't 
bring me much further, I found nothing that seemed applicable.


I agree my case is perhaps a bit overstretched and I'm going generate 
the replication stream in a local file and send it over by hand.
Later I shouldn't have snapshots that are that big and I wish zfs 
recv wouldn't block for so long either, but still I'm asking myself 
if the behavior I'm observing is correct or if it's the sign of some 
misconfiguration on my part (or perhaps I've once more forgotten how 
things work).


Sorry for my bad english, I hope you understood what I meant.
Just in case I've attached the tcpdump output of the ssh session 
starting at the very last packet that is accepted and acked by B.

A is 192.0.2.2 and B is 192.0.2.1.

If you could shed some light on this case I'd be very grateful, but I 
don't want to bother you.


Thanks,
Arnaud


Le 09/02/2010 14:39, James Carlson a écrit :

Arnaud Brand wrote:

Le 08/02/10 23:18, James Carlson a écrit :

Causes for RST include:

- peer application is intentionally setting the linger time to 
zero

  and issuing close(2), which results in TCP RST generation.

Might be possible, but I can't see why the receiving end would do 
that.

No idea, but a debugger on that side might be able to detect something.


- bugs in one or both peers (often related to TCP keepalive; key
  signature of such a problem is an apparent two-hour time 
limit).


That could be it, but I doubt it since disconnections appeared 
anywhere

randomly in the range 10 minutes to 13 hours.
It should be noted that the node sending the RST keeps the connection
open (netstat -a shows its still established).
To be honest that puzzles me.

That sounds horrible.  There's no way a node that still has state for
the connection should be sending RST.  Normal procedure is to generate
RST when you do _not_ have state for the connection or (if you're
intentionally aborting the connection) to discard the state at the same
time you send RST.

That points to either a bug in the peer's TCP/IP implementation or one
of the causes that you've dismissed (particularly either a duplicate IP
address or a 

Re: [zfs-discuss] available space

2010-02-15 Thread Cindy Swearingen

Hi Charles,

What kind of pool is this?

The SIZE and AVAIL amounts will vary depending on the ZFS redundancy and 
whether the deflated or inflated amounts are displayed.


I attempted to explain the differences in the zpool list/zfs list
display, here:

http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#HZFSAdministrationQuestions

Why doesn't the space that is reported by the zpool list command and the 
zfs list command match?


The zpool list command output in this FAQ is based on the OpenSolaris/
Nevada builds and differs in the AVAIL column, which is now displayed as
ALLOC and FREE.

Thanks,

Cindy



On 02/13/10 10:28, Charles Hedrick wrote:

I have the following pool:

NAME   SIZE   USED  AVAILCAP  HEALTH  ALTROOT
OIRT  6.31T  3.72T  2.59T58%  ONLINE  /

zfs list shows the following for a typical file system:

NAMEUSED  AVAIL  REFER  MOUNTPOINT
OIRT/sakai/production  1.40T  1.77T  1.40T  /OIRT/sakai/production

Why is available lower when shown by zfs than zpool?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] available space

2010-02-15 Thread Charles Hedrick
Thanks. That makes sense. This is raidz2.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS Volume Destroy Halts I/O

2010-02-15 Thread Nick
I've seen threads like this around this ZFS forum, so forgive me if I'm 
covering old ground.  I currently have a ZFS configuration where I have 
individual drives presented to my Opensolaris machine and I'm using ZFS to do a 
RAIDZ-1 on the drives.  I have several filesystems and volumes on this storage 
pool.  When I do a zfs destroy on a volume (and possibly a filesystem, though 
I haven't tried that, yet), I run into two issues.  The first is that the 
destroy command takes several hours to complete - for example, destroying a 10 
GB volume on Friday took 5 hours.  The second is that, while this command is 
running, all I/O on the storage pool appears to be halted, or at least paused.  
There are a few symptoms of this...first, NFS clients accessing volumes on this 
server just hang and do not respond to commands.  Some clients hang 
indefinitely while others time out and mark the volume as stale.  On iSCSI 
clients, the clients most often time out and disconnect from the iSCSI vol
 ume - which is bad for my clients that are booting over those iSCSI volumes.

I'm using the latest Opensolaris dev build (132) and I have my storage pools 
and volumes upgraded to the latest available versions.  I am using 
deduplication on my ZFS volumes, set at the highest volume level, so I'm not 
sure if this has an impact.  Can anyone provide any hints as to whether this is 
a bug or expected behavior, what's causing it, and how I can solve or work 
around it?

Thanks,
Nick
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Volume Destroy Halts I/O

2010-02-15 Thread Bob Friesenhahn

On Mon, 15 Feb 2010, Nick wrote:


I'm using the latest Opensolaris dev build (132) and I have my 
storage pools and volumes upgraded to the latest available versions. 
I am using deduplication on my ZFS volumes, set at the highest 
volume level, so I'm not sure if this has an impact.  Can anyone 
provide any hints as to whether this is a bug or expected behavior, 
what's causing it, and how I can solve or work around it?


There is no doubt that it is both a bug and expected behavior and is 
related to deduplication being enabled.


Others here have reported similar problems.  The problem seems to be 
due to insufficient caching in the zfs ARC due to not enough RAM or 
L2ARC not being installed.  Some people achieved rapid success after 
installing a SSD as a L2ARC device.  Other people have reported 
success after moving their pool to a system with a lot more RAM 
installed.  Others have relied on patience.  A few have given up and 
considered their pool totally lost.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Plan for upgrading a ZFS based SAN

2010-02-15 Thread Tiernan OToole
Good morning all.

I am in the process of building my V1 SAN for media storage in house, and i
am already thinkg ov the V2 build...

Currently, there are 8 250Gb hdds and 3 500Gb disks. the 8 250s are in a
RAIDZ2 array, and the 3 500s will be in RAIDZ1...

At the moment, the current case is quite full. i am looking at a 20 drive
hotswap case, which i plan to order soon. when the time comes, and i start
upgrading the drives to larger drives, say 1Tb drives, would it be easy to
migrate the contents of the RAIDZ2 array to the new Array? I see mentions of
ZFS Send and ZFS recieve, but i have no idea if they would do the job...

Thanks in advance.

-- 
Tiernan O'Toole
blog.lotas-smartman.net
www.tiernanotoolephotography.com
www.the-hairy-one.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Plan for upgrading a ZFS based SAN

2010-02-15 Thread James Dickens
Yes send and receive will do the job. see zfs manpage for details.

James Dickens
http://uadmin.blogspot.com


On Mon, Feb 15, 2010 at 11:56 AM, Tiernan OToole lsmart...@gmail.comwrote:

 Good morning all.

 I am in the process of building my V1 SAN for media storage in house, and i
 am already thinkg ov the V2 build...

 Currently, there are 8 250Gb hdds and 3 500Gb disks. the 8 250s are in a
 RAIDZ2 array, and the 3 500s will be in RAIDZ1...

 At the moment, the current case is quite full. i am looking at a 20 drive
 hotswap case, which i plan to order soon. when the time comes, and i start
 upgrading the drives to larger drives, say 1Tb drives, would it be easy to
 migrate the contents of the RAIDZ2 array to the new Array? I see mentions of
 ZFS Send and ZFS recieve, but i have no idea if they would do the job...

 Thanks in advance.

 --
 Tiernan O'Toole
 blog.lotas-smartman.net
 www.tiernanotoolephotography.com
 www.the-hairy-one.com

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Volume Destroy Halts I/O

2010-02-15 Thread Nick
 
 There is no doubt that it is both a bug and expected
 behavior and is 
 related to deduplication being enabled.

Is it expected because it's a bug, or is it a bug that is not going to be fixed 
and so I should expect it?  Is there a bug/defect I can keep an eye on in one 
of the Opensolaris bug/defect interfaces that will help me figure out what's 
going on with it and when a solution is expected?

 
 Others here have reported similar problems.  The
 problem seems to be 
 due to insufficient caching in the zfs ARC due to not
 enough RAM or 
 L2ARC not being installed.  Some people achieved
 rapid success after 
 installing a SSD as a L2ARC device.  Other people
 have reported 
 success after moving their pool to a system with a
 lot more RAM 
 installed.  Others have relied on patience.  A few
 have given up and 
 considered their pool totally lost.
 

I have 8 GB of RAM on my system, which I consider to be a fairly decent amount 
of RAM for a storage system - maybe I'm naive about that, though.  8GB should 
provide a pretty decent amount of RAM available for caching, so I would think 
that a 10 GB volume would be able to go through RAM pretty quickly.  Also, 
there isn't much except ZFS and COMSTAR running on this box, so there isn't 
really anything else using the RAM.

I've already considered going and buying some SSDs for the L2ARC stuff, so 
maybe I'll pursue this path.  I was certainly patient with it - I didn't reboot 
the box because I could see slow progress on the destroy.  However, the other 
guys in my group who had stuff hanging off of this ZFS storage that had to wait 
5 hours for the storage to respond to their requests were not quite so 
understanding or patient.  This is a pretty big roadblock, IMHO, to this being 
a workable storage solution.  I certainly do understand that I'm using the dev 
releases, so it is under development and I should expect bugs - this one just 
seems pretty significant, like I would need to schedule maintenance windows to 
do volume management.

Thanks!
-Nick
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Shrink the slice used for zpool?

2010-02-15 Thread Yi Zhang
Hi,

I recently installed OpenSoalris 200906 on a 10GB primary partition on
my laptop. I noticed there wasn't any option for customizing the
slices inside the solaris partition. After installation, there was
only a single slice (0) occupying the entire partition. Now the
problem is that I need to set up a UFS slice for my development. Is
there a way to shrink slice 0 (backing storage for the zpool) and make
room for a new slice to be used for UFS?

I also tried to create UFS on another primary DOS partition, but
apparently only one Solaris partition is allowed on one disk. So that
failed...

Thanks!

Yi Zhang
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Volume Destroy Halts I/O

2010-02-15 Thread Eric Schrock

On 02/15/10 10:26, Nick wrote:

There is no doubt that it is both a bug and expected
behavior and is 
related to deduplication being enabled.


Is it expected because it's a bug, or is it a bug that is not going to be fixed 
and so I should expect it?  Is there a bug/defect I can keep an eye on in one 
of the Opensolaris bug/defect interfaces that will help me figure out what's 
going on with it and when a solution is expected?


See:

6922161 zio_ddt_free is single threaded with performance impact
6924824 destroying a dedup-enabled dataset bricks system

Both issues stem from the fact that free operations used to be in-memory 
only but with dedup enabled can result in synchronous I/O to disks in 
syncing context.


- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Shrink the slice used for zpool?

2010-02-15 Thread Casper . Dik

Hi,

I recently installed OpenSoalris 200906 on a 10GB primary partition on
my laptop. I noticed there wasn't any option for customizing the
slices inside the solaris partition. After installation, there was
only a single slice (0) occupying the entire partition. Now the
problem is that I need to set up a UFS slice for my development. Is
there a way to shrink slice 0 (backing storage for the zpool) and make
room for a new slice to be used for UFS?

I also tried to create UFS on another primary DOS partition, but
apparently only one Solaris partition is allowed on one disk. So that
failed...


Can you create a zvol and use that for ufs?  Slow, but ...

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Volume Destroy Halts I/O

2010-02-15 Thread Nick
Thanks!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Shrink the slice used for zpool?

2010-02-15 Thread Yi Zhang
On Mon, Feb 15, 2010 at 1:48 PM,  casper@sun.com wrote:

Hi,

I recently installed OpenSoalris 200906 on a 10GB primary partition on
my laptop. I noticed there wasn't any option for customizing the
slices inside the solaris partition. After installation, there was
only a single slice (0) occupying the entire partition. Now the
problem is that I need to set up a UFS slice for my development. Is
there a way to shrink slice 0 (backing storage for the zpool) and make
room for a new slice to be used for UFS?

I also tried to create UFS on another primary DOS partition, but
apparently only one Solaris partition is allowed on one disk. So that
failed...


 Can you create a zvol and use that for ufs?  Slow, but ...

 Casper



Casper, thanks for the tip! Actually I'm not sure if this would work
for me. I wanted to use directio to bypass the file system cache when
reading/writing files. That's why I chose UFS instead of ZFS. Now if I
create UFS on top of zvol, I'm not sure if a call to directio() would
actually do its work...


Yi
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Shrink the slice used for zpool?

2010-02-15 Thread Richard Elling

On Feb 15, 2010, at 11:15 AM, Yi Zhang wrote:

 On Mon, Feb 15, 2010 at 1:48 PM,  casper@sun.com wrote:
 
 Hi,
 
 I recently installed OpenSoalris 200906 on a 10GB primary partition on
 my laptop. I noticed there wasn't any option for customizing the
 slices inside the solaris partition. After installation, there was
 only a single slice (0) occupying the entire partition. Now the
 problem is that I need to set up a UFS slice for my development. Is
 there a way to shrink slice 0 (backing storage for the zpool) and make
 room for a new slice to be used for UFS?
 
 I also tried to create UFS on another primary DOS partition, but
 apparently only one Solaris partition is allowed on one disk. So that
 failed...
 
 
 Can you create a zvol and use that for ufs?  Slow, but ...
 
 Casper
 
 
 
 Casper, thanks for the tip! Actually I'm not sure if this would work
 for me. I wanted to use directio to bypass the file system cache when
 reading/writing files. That's why I chose UFS instead of ZFS. Now if I
 create UFS on top of zvol, I'm not sure if a call to directio() would
 actually do its work...

zfs set primarycache=metadata filesystem
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Shrink the slice used for zpool?

2010-02-15 Thread Darren J Moffat

On 15/02/2010 19:15, Yi Zhang wrote:

Can you create a zvol and use that for ufs?  Slow, but ...

Casper




Casper, thanks for the tip! Actually I'm not sure if this would work
for me. I wanted to use directio to bypass the file system cache when
reading/writing files. That's why I chose UFS instead of ZFS. Now if I
create UFS on top of zvol, I'm not sure if a call to directio() would
actually do its work...


Why not just use ZFS and set the similar options on the ZFS dataset:
zfs set primarycache=metadata datasetname

That is a close approximation to the UFS feature of directio() for 
bypassing storing the data in the filesystem cache.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Shrink the slice used for zpool?

2010-02-15 Thread Yi Zhang
Thank you, Darren and Richard. I think this gives what I wanted.

Yi

On Mon, Feb 15, 2010 at 3:13 PM, Darren J Moffat darren.mof...@sun.com wrote:
 On 15/02/2010 19:15, Yi Zhang wrote:

 Can you create a zvol and use that for ufs?  Slow, but ...

 Casper



 Casper, thanks for the tip! Actually I'm not sure if this would work
 for me. I wanted to use directio to bypass the file system cache when
 reading/writing files. That's why I chose UFS instead of ZFS. Now if I
 create UFS on top of zvol, I'm not sure if a call to directio() would
 actually do its work...

 Why not just use ZFS and set the similar options on the ZFS dataset:
        zfs set primarycache=metadata datasetname

 That is a close approximation to the UFS feature of directio() for bypassing
 storing the data in the filesystem cache.

 --
 Darren J Moffat

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs promote

2010-02-15 Thread Cindy Swearingen

Hi--

From your pre-promotion output, both fs1-patch and snap1 are 
referencing the same 16.4 GB, which makes sense. I don't see how fs1 
could be a

clone of fs1-patch because it should be REFER'ing 16.4 GB as well in
your pre-promotion zfs list.

If you snapshot, clone, and promote, then the snapshot--clone is
promoted to be the origin file system and so is now charged the USED
space, but the post-promotion snapshot space should remain in the
REFER column.

Try it yourself with a test file system, create a 100m file, snapshot,
and clone. Then promote the clone. You will see that the 100MB in REFER
space of the original snapshot becomes 100MB USED space in the newly
promoted clone. The original snapshot remains 100MB in REFER.

If the test snap/clone/promote works then we need to figure out what's
going on in your rgd3/fs* environment. From your output, it almost looks
like the space accounting for snap1 and fs1 are reversed in your post-
promotion zfs list. I don't know why.

I can't reproduce this in the Solaris 10 10/09 release.

Thanks,

Cindy

# zfs create tank/fs1
# mkfile 100m /tank/fs1/file1
# zfs list -r tank
NAME   USED  AVAIL  REFER  MOUNTPOINT
tank   100M  66.8G23K  /tank
tank/fs1   100M  66.8G   100M  /tank/fs1
# zfs snapshot tank/f...@snap1
# zfs clone tank/f...@snap1 tank/fs1-patch
# zfs list -r tank
NAME USED  AVAIL  REFER  MOUNTPOINT
tank 100M  66.8G23K  /tank
tank/fs1 100M  66.8G   100M  /tank/fs1
tank/f...@snap1  0  -   100M  -
tank/fs1-patch  0  66.8G   100M  /tank/fs1-patch
# zfs promote tank/fs1-patch
# zfs list -r tank
NAME   USED  AVAIL  REFER  MOUNTPOINT
tank   100M  66.8G24K  /tank
tank/fs1  0  66.8G   100M  /tank/fs1
tank/fs1-patch 100M  66.8G   100M  /tank/fs1-patch
tank/fs1-pa...@snap1  0  -   100M  -




On 02/12/10 19:37, tester wrote:

Hello,

# /usr/sbin/zfs list -r rgd3
NAME   USEDAVAIL  REFER  MOUNTPOINT
rgd3   16.5G23.4G20K
/rgd3
rgd3/fs1  19K   23.4G21K   
/app/fs1
rgd3/fs1-patch 16.4G23.4G   16.4G 
/app/fs1-patch
rgd3/fs1-pa...@snap134.8M -   16.4G  -

# /usr/sbin/zfs promote rgd3/fs1

snap is 16.4G in USED.

# /usr/sbin/zfs list -r rgd3
NAME   USED  AVAIL  REFER  MOUNTPOINT
rgd3  16.5G  23.4G20K   /rgd3
rgd3/fs1 16.4G  23.4G21K  /app/fs1
rgd3/f...@snap1 16.4G   -  16.4G -
rgd3/fs1-patch33.9M  23.4G  16.4G/app/fs1-patch

5.10 Generic_141414-10

I tired to line up numbers, it does not work. Sorry for the format.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-15 Thread Peter Tribble
On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff beimh...@hotmail.com wrote:
 I am in the proof-of-concept phase of building a large ZFS/Solaris based SAN 
 box, and am experiencing absolutely poor / unusable performance.
...

 From here, I discover the iscsi target on our Windows server 2008 R2 File 
 server, and see the disk is attached in Disk Management.  I initialize the 
 10TB disk fine, and begin to quick format it.  Here is where I begin to see 
 the poor performance issue.   The Quick Format took about 45 minutes. And 
 once the disk is fully mounted, I get maybe 2-5 MB/s average to this disk.

Did you actually make any progress on this?

I've seen exactly the same thing. Basically, terrible transfer rates
with Windows
and the server sitting there completely idle. We had support cases open with
both Sun and Microsoft, which got nowhere.

This seems to me to be more a case of working out where the impedance
mismatch is rather than a straightforward performance issue. In my case
I could saturate the network from a Solaris client, but only maybe 2% from
a Windows box. Yes, tweaking nagle got us to almost 3%. Still nowhere
near enough to make replacing our FC SAN with X4540s an attractive
proposition.

(And I see that most of the other replies simply asserted that your zfs
configuration was bad, without either having experienced this scenario
or worked out that the actual delivered performance was an order of
magnitude or two short of what even an admittedly sub-optimal configuration
ought to have delivered.)

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-15 Thread Bob Beverage
 On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff
 beimh...@hotmail.com wrote:
 I've seen exactly the same thing. Basically, terrible
 transfer rates
 with Windows
 and the server sitting there completely idle.

I am also seeing this behaviour.  It started somewhere around snv111 but I am 
not sure exactly when.  I used to get 30-40MB/s transfers over cifs but at some 
point that dropped to roughly 7.5MB/s.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS slowness under domU high load

2010-02-15 Thread Daniel Carosone
On Mon, Feb 15, 2010 at 01:45:57PM +0100, Bogdan ?ulibrk wrote:
 One more thing regarding SSD, will be useful to throw in additional  
 SAS/SATA drive in to serve as L2ARC? I know SSD is the most logical  
 thing to put as L2ARC, but will conventional drive be of *any* help in  
 L2ARC?

Only in very particular circumstances. L2ARC is a latency play; for it
to win, you need the l2arc device(s) to be lower latency than the
primary storage, at least for reads. 

This usually translates to ssd for lower latency than disk, but can
also work if your data pool has unusually high latency - remote iscsi,
usb, some other odd mostly channel-related configurations. 

If the reason your disks have high latency is simply high load, l2arc
on another disk might, maybe, just work to redistribute some of that
load, but it will be a precarious balance, and probably need several
additional disks, perhaps roughly as many as currently in the pool.
By that stage, you're better off just reshaping the pool to use the
extra disks to best effect; mirrors vs raidz, more vdevs, etc.
Managing all that l2arc will take memory, too.

In your case, though, a couple of extra disks dedicated to staging
whatever transform you're doing to the backup files might be
worthwhile, if it will fit.  Even if they make the backup transform
itself slower (unlikely if its predominantly sequential), removing
the contention impact from the primary service could be a net win. 

--
Dan.

pgppyQ2MBTeW2.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD and ZFS

2010-02-15 Thread Daniel Carosone
On Sun, Feb 14, 2010 at 11:08:52PM -0600, Tracey Bernath wrote:
 Now, to add the second SSD ZIL/L2ARC for a mirror. 

Just be clear: mirror ZIL by all means, but don't mirror l2arc, just
add more devices and let them load-balance.   This is especially true
if you're sharing ssd writes with ZIL, as slices on the same devices.

 I may even splurge for one more to get a three way mirror.

With more devices, questions about selecting different devices
appropriate for each purpose come into play.

 Now I need a bigger server

See? :)

--
Dan.

pgpdQ822KAFIw.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reading ZFS config for an extended period

2010-02-15 Thread taemun
Just thought I'd chime in for anyone who had read this - the import
operation completed this time, after 60 hours of disk grinding.

:)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reading ZFS config for an extended period

2010-02-15 Thread Khyron
The DDT is stored within the pool, IIRC, but there is an RFE open to allow
you to
store it on a separate top level VDEV, like a SLOG.

The other thing I've noticed with all of the destroyed a large dataset with
dedup
enabled and it's taking forever to import/destory/insert function here
questions
is that the process runs so so so much faster with 8+ GiB of RAM.  Almost to
a man,
everyone who reports these 3, 4, or more day destroys has  8 GiB of RAM on
the
storage server.

Just some observations/thoughts.

On Mon, Feb 15, 2010 at 23:14, taemun tae...@gmail.com wrote:

 Just thought I'd chime in for anyone who had read this - the import
 operation completed this time, after 60 hours of disk grinding.

 :)
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
You can choose your friends, you can choose the deals. - Equity Private

If Linux is faster, it's a Solaris bug. - Phil Harman

Blog - http://whatderass.blogspot.com/
Twitter - @khyron4eva
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs questions wrt unused blocks

2010-02-15 Thread heinz zerbes


Gents,

We want to understand the mechanism of zfs a bit better.

Q: what is the design/algorithm of zfs in terms of reclaiming unused blocks?
Q: what criteria is there for zfs to start reclaiming blocks

Issue at hand is an LDOM or zone running in a virtual (thin-provisioned) 
disk on a NFS server and a zpool inside that vdisk.
This vdisk tends to grow in size even if the user writes and deletes a 
file again. Question is, whether this reclaiming of unused blocks can 
kick in earlier, so that the filesystem doesn't grow much more than what 
is actually allocated?


Thanks,
heinz


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reading ZFS config for an extended period

2010-02-15 Thread Rob Logan

  RFE open to allow you to store [DDT] on a separate top level VDEV

hmm, add to this spare, log and cache vdevs, its to the point of making
another pool and thinly provisioning volumes to maintain partitioning  
flexibility.

taemun: hay, thanks for closing the loop!

Rob
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reading ZFS config for an extended period

2010-02-15 Thread taemun
The system in question has 8GB of ram. It never paged during the
import (unless I was asleep at that point, but anyway).

It ran for 52 hours, then started doing 47% kernel cpu usage. At this
stage, dtrace stopped responding, and so iopattern died, as did
iostat. It was also increasing ram usage rapidly (15mb / minute).
After an hour of that, the cpu went up to 76%. An hour later, CPU
usage stopped. Hard drives were churning throughout all of this
(albeit at a rate that looks like each vdev is being controller by a
single threaded operation).

I'm guessing that if you don't have enough ram, it gets stuck on the
use-lots-of-cpu phase, and just dies from too much paging. Of course,
I have absolutely nothing to back that up.

Personally, I think that if L2ARC devices were persistent, we already
have the mechanism in place for storing the DDT as a seperate vdev.
The problem is, there is nothing you can run at boot time to populate
the L2ARC, so the dedup writes are ridiculously slow until the cache
is warm. If the cache stayed warm, or there was an option to forcibly
warm up the cache, this could be somewhat alleviated.

Cheers
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs questions wrt unused blocks

2010-02-15 Thread Richard Elling
On Feb 15, 2010, at 8:43 PM, heinz zerbes wrote:
 
 Gents,
 
 We want to understand the mechanism of zfs a bit better.
 
 Q: what is the design/algorithm of zfs in terms of reclaiming unused blocks?
 Q: what criteria is there for zfs to start reclaiming blocks

The answer to these questions is too big for an email. Think of
ZFS as a very dynamic system with many different factors influencing
block allocation.

 Issue at hand is an LDOM or zone running in a virtual (thin-provisioned) disk 
 on a NFS server and a zpool inside that vdisk.
 This vdisk tends to grow in size even if the user writes and deletes a file 
 again. Question is, whether this reclaiming of unused blocks can kick in 
 earlier, so that the filesystem doesn't grow much more than what is actually 
 allocated?

ZFS is a COW file system, which partly explains what you are seeing.
Snapshots, deduplication, and the ZIL complicate the picture.
 -- richard


ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reading ZFS config for an extended period

2010-02-15 Thread Markus Kovero
 The other thing I've noticed with all of the destroyed a large dataset with 
 dedup 
 enabled and it's taking forever to import/destory/insert function here 
 questions 
 is that the process runs so so so much faster with 8+ GiB of RAM.  Almost to 
 a man, 
 everyone who reports these 3, 4, or more day destroys has  8 GiB of RAM on 
 the 
 storage server.

I've witnessed destroys that take several days with 24GB+ systems (dataset over 
30TB). I guess it's just matter of how large datasets vs. how much ram.

Yours
Markus Kovero
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-15 Thread Ragnar Sundblad

On 15 feb 2010, at 23.33, Bob Beverage wrote:

 On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff
 beimh...@hotmail.com wrote:
 I've seen exactly the same thing. Basically, terrible
 transfer rates
 with Windows
 and the server sitting there completely idle.
 
 I am also seeing this behaviour.  It started somewhere around snv111 but I am 
 not sure exactly when.  I used to get 30-40MB/s transfers over cifs but at 
 some point that dropped to roughly 7.5MB/s.

Wasn't zvol changed a while ago from asynchronous to
synchronous? Could that be it?

I don't understand that change at all - of course a zvol with or
without iscsi to access it should behave exactly as a (not broken)
disk, strictly obeying the protocol for write cache. cache flush etc.
Having it entirely synchronous is in many cases almost as useless
as having it asynchronous.

Just as much as zfs itself should demands this from it's disks, as it
does, I believe it should provide this itself when used as storage
for others. To me it seems that the zvol+iscsi functionality seems not
ready for production and needs more work. If anyone has any better
explanation, please share it with me!

I guess a good slog could help a bit, especially if you have a bursty
write load.

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss