from:"Brandon High"

Re: [zfs-discuss] what have you been buying for slog and l2arc?

2012-08-06 Thread Brandon High

On Mon, Aug 6, 2012 at 2:15 PM, Stefan Ring stefan...@gmail.com wrote:
 So you're saying that SSDs don't generally flush data to stable medium
 when instructed to? So data written before an fsync is not guaranteed
 to be seen after a power-down?

It depends on the model. Consumer models are less likely to
immediately flush. My understanding that this is done in part to do
some write coalescing and reduce the number of P/E cycles. Enterprise
models should either flush, or contain a super capacitor that provides
enough power for the drive to complete writing any date in its buffer.

 If that -- ignoring cache flush requests -- is the whole reason why
 SSDs are so fast, I'm glad I haven't got one yet.

They're fast for random reads and writes because they don't have seek
latency. They're fast for sequential IO because they aren't limited by
spindle speed.

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-07-30 Thread Brandon High

On Mon, Jul 30, 2012 at 7:11 AM, GREGG WONDERLY gregg...@gmail.com wrote:
 I thought I understood that copies would not be on the same disk, I guess I 
 need to go read up on this again.

ZFS attempts to put copies on separate devices, but there's no guarantee.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Persistent errors?

2012-06-22 Thread Brandon High

On Mon, Jun 18, 2012 at 3:55 PM, sol a...@yahoo.com wrote:
 It seems as though every time I scrub my mirror I get a few megabytes of
 checksum errors on one disk (luckily corrected by the other). Is there some
 way of tracking down a problem which might be persistent?

Check the output of 'fmdump -eV', it should have some (rather
extensive) information.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Migration of a Thumper to bigger HDDs

2012-05-24 Thread Brandon High

On Thu, May 17, 2012 at 2:50 PM, Jim Klimov jimkli...@cos.ru wrote:
 New question: if the snv_117 does see the 3Tb disks well,
 the matter of upgrading the OS becomes not so urgent - we
 might prefer to delay that until the next stable release
 of OpenIndiana or so.

There were some pretty major fixes and new features added between
snv_117 and snv_134 (the last OpenSolaris release). It might be worth
updating to snv_134 at the very least.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] checking/fixing busy locks for zfs send/receive

2012-03-16 Thread Brandon High

On Fri, Mar 16, 2012 at 2:35 PM, Philip Brown p...@bolthole.com wrote:
 if there isnt a process visible doing this via ps, I'm wondering how
 one might check if a zfs filesystem or snapshot is rendered busy in
 this way, interfering with an unmount or destroy?

 I'm also wondering if this sort of thing can mean interference between
 some combination of multiple send/receives at the same time, on the
 same filesystem?

Look at 'zfs hold', 'zfs holds', and 'zfs release'. Sends and receives
will place holds on snapshots to prevent them from being changed.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Compatibility of Hitachi Deskstar 7K3000 HDS723030ALA640 with ZFS

2012-03-06 Thread Brandon High

On Tue, Mar 6, 2012 at 2:40 AM, Koopmann, Jan-Peter
jan-pe...@koopmann.eu wrote:
 Do you or anyone else have experience with the 3TB 5K3000 drives
 (namely HDS5C3030ALA630)? I am thinking of replacing my current 4*1TB drives
 with 4*3TB drives (home server). Any issues with TER or alike?

I have been using 8 x 3TB 5k3000 in a raidz2 for about a year without issue.

The Deskstar 3TB come off the same production line as the Ultrastar
5k3000. I would avoid the 2TB and smaler 5k3000 - They come off a
separate production line.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Compatibility of Hitachi Deskstar 7K3000 HDS723030ALA640 with ZFS

2012-03-05 Thread Brandon High

On Mon, Mar 5, 2012 at 9:52 AM, luis Johnstone l...@luisjohnstone.com wrote:
 As far as I can tell, the Hitachi Deskstar 7K3000 (HDS723030ALA640) uses
 512B sectors and so I presume does not suffer from such issues (because it
 doesn't lie about the physical layout of sectors on-platter)

Both the 7K3000 and 5K3000 drives have 512B physical sectors.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Server upgrade

2012-02-15 Thread Brandon High

On Wed, Feb 15, 2012 at 9:16 AM, David Dyer-Bennet d...@dd-b.net wrote:
 Is there an upgrade path from (I think I'm running Solaris Express) to
 something modern?  (That could be an Oracle distribution, or the free

There *was* an upgrade path from snv_134 to snv_151a (Solaris 11
Express) but I don't know if Oracle still supports it. There was an
intermediate step or two along the way (snv_134b I think?) to move
from OpenSolaris to Oracle Solaris.

As others mentioned, you could jump to OpenIndiana from your current
version. You may not be able to move between OI and S11 in the future,
so it's a somewhat important decision.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] grrr, How to get rid of mis-touched file named `-c'

2011-11-26 Thread Brandon High

On Wed, Nov 23, 2011 at 11:43 AM, Harry Putnam rea...@newsguy.com wrote:
 OK, I'm out of escapes.  or other tricks... other than using emacs but
 I haven't installed emacs as yet.

 I can just ignore them of course, until such time as I do get emacs
 installed, but by now I just want to know how it might be done from a
 shell prompt.

rm ./-c ./-O ./-k

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Replacement for X25-E

2011-09-22 Thread Brandon High

On Tue, Sep 20, 2011 at 12:21 AM, Markus Kovero markus.kov...@nebula.fi wrote:
 Hi, I was wondering do you guys have any recommendations as replacement for
 Intel X25-E as it is being EOL’d? Mainly as for log device.

The Intel 311 seems like a good fit. It's a 20gb SLC device intended
to act as a cache device with the Z68 chipset.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deskstars and CCTL (aka TLER)

2011-09-22 Thread Brandon High

On Wed, Sep 7, 2011 at 7:40 PM, Daniel Carosone d...@geek.com.au wrote:
 Looks like another positive for these drives over the competition.
 The same appears to be the case for the 5k3000's as well (page 96 in
 that document).

Be careful with the smaller 5k3000 drives. The 1TB and 2TB drives are
not manufactured on the same line as the Ultrastar and seem to have
lower reliability. Only the 3TB 5k3000 shares specs with the Ultrastar
5k3000.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Replacement for X25-E

2011-09-22 Thread Brandon High

On Thu, Sep 22, 2011 at 12:53 PM, Ray Van Dolson rvandol...@esri.com wrote:
 It seems to perform similarly to the X-25E as well (3300 IOPS for
 random writes).  Perhaps the drive can be overprovisioned as well?

 My impression was that Intel was classifying the 3xx series as
 non-Enterprise however.  Even with the SLC.

I don't think the 311 has any over-provisioning (other than the 7%
from GB - GiB conversion). I believe it is an X25-E with only 5
channels populated. The upcoming enterprise models are MLC based and
have greater over-provisioning AFAIK.

The 20GB 311 only costs ~ $100 though. The 100GB Intel 710 costs ~ $650.

The 311 is a good choice for home or budget users, and it seems that
the 710 is much bigger than it needs to be for slog devices.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deskstars and CCTL (aka TLER)

2011-09-07 Thread Brandon High

On Wed, Sep 7, 2011 at 2:20 AM, Roy Sigurd Karlsbakk r...@karlsbakk.net wrote:
 Does anyone know if this is possible from OI/Solaris, or if this needs to be 
 done on driver level?

You should be able to do it via smartctl. The setting does not persist
through power cycles, so you'll want to add it to a startup script.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS raidz on top of hardware raid0

2011-08-26 Thread Brandon High

On Fri, Aug 12, 2011 at 6:34 PM, Tom Tang thomps...@supermicro.com wrote:
 Suppose I want to build a 100-drive storage system, wondering if there is any 
 disadvantages for me to setup 20 arrays of HW RAID0 (5 drives each), then 
 setup ZFS file system on these 20 virtual drives and configure them as RAIDZ?

A 20-device wide raidz is a bad idea. Making those devices from
stripes just compounds the issue.

The biggest problem is that resilvering would be a nightmare, and
you're practically guaranteed to have additional failures or read
errors while degraded.

You would achieve better performance, error detection and recovery by
using several top-level raidz. 20 x 5-disk raidz would give you very
good read and write performance with decent resilver times and 20%
overhead for redundancy. 10 x 10-disk raidz2 would give more
protection, but a little less performance, and higher resilver times.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Intel 320 as ZIL?

2011-08-15 Thread Brandon High

On Thu, Aug 11, 2011 at 1:00 PM, Ray Van Dolson rvandol...@esri.com wrote:
 Are any of you using the Intel 320 as ZIL?  It's MLC based, but I
 understand its wear and performance characteristics can be bumped up
 significantly by increasing the overprovisioning to 20% (dropping
 usable capacity to 80%).

Intel recently added the 311, a small SLC-based drive for use as a
temp cache with their Z68 platform. It's limited to 20GB, but it might
be a better fit for use as a ZIL than the 320.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Disk IDs and DD

2011-08-09 Thread Brandon High

On Tue, Aug 9, 2011 at 8:20 AM, Paul Kraus p...@kraus-haus.org wrote:
    Nothing to worry about here. Controller IDs (cn) are assigned
 based on the order the kernel probes the hardware. On the SPARC
 systems you can usually change this in the firmware (OBP), but they
 really don't _mean_ anything (other than the kernel found c8 before it
 found c9).

If you're really bothered by the device names, you can rebuild the
device map. There's no reason to do it unless you've had to replace
hardware, etc.

The steps are similar to these:
http://spiralbound.net/blog/2005/12/21/rebuilding-the-solaris-device-tree

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Exapnd ZFS storage.

2011-08-03 Thread Brandon High

On Wed, Aug 3, 2011 at 3:02 AM, Nix mithun.gaik...@gmail.com wrote:
 I have 4 disk with 1 TB of disk and I want to expand the zfs pool size.

 I have 2 more disk with 1 TB of size.

 Is it possible to expand the current RAIDz array with new disk?

You can't add the new drives to your current vdev. You can create
another vdev to add to your pool though.

If you're adding another vdev, it should have the same geometry as
your current (ie: 4 drives). The zpool command will complain if you
try to add a vdev with different geometry or redundancy, though you
can force it with -f.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Fragmentation issue - examining the ZIL

2011-08-03 Thread Brandon High

On Mon, Aug 1, 2011 at 4:27 PM, Daniel Carosone d...@geek.com.au wrote:
 The other thing that can cause a storm of tiny IOs is dedup, and this
 effect can last long after space has been freed and/or dedup turned
 off, until all the blocks corresponding to DDT entries are rewritten.
 I wonder if this was involved here.

Using dedup on a pool that houses an Oracle DB is Doing It Wrong in so
many ways...

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Fragmentation issue - examining the ZIL

2011-08-01 Thread Brandon High

On Mon, Aug 1, 2011 at 2:16 PM, Neil Perrin neil.per...@oracle.com wrote:
 In general the blogs conclusion is correct . When file systems get full
 there is
 fragmentation (happens to all file systems) and for ZFS the pool uses gang
 blocks of smaller blocks when there are insufficient large blocks.

The blog doesn't mention how full the pool was. It's pretty well
documented that performance takes a nosedive at a certain point.

A slow scrub is actually not related to the problems in the blog post,
since there's not a lot of writes during (or at least caused by) a
scrub. Fragmentation is a real issue with pools that are (or have
been) very full. The data gets written out in fragments and has to be
read back in the same order.

If the mythical bp_rewrite code ever shows up, it will be possible to
defrag a pool. But not yet.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD vs hybrid drive - any advice?

2011-07-26 Thread Brandon High

On Tue, Jul 26, 2011 at 7:51 AM, David Dyer-Bennet d...@dd-b.net wrote:

 Processing the request just means flagging the blocks, though, right?
 And the actual benefits only acrue if the garbage collection / block
 reshuffling background tasks get a chance to run?


I think that's right. TRIM just gives hints to the garbage collector that
sectors are no longer in use. When the GC runs, it can find more flash
blocks more easily that aren't used or combine several mostly-empty
blocks and erase or otherwise free them for reuse later.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] recover zpool with a new installation

2011-07-26 Thread Brandon High

On Tue, Jul 26, 2011 at 1:14 PM, Cindy Swearingen 
cindy.swearin...@oracle.com wrote:

 Yes, you can reinstall the OS on another disk and as long as the
 OS install doesn't touch the other pool's disks, your
 previous non-root pool should be intact. After the install
 is complete, just import the pool.


You can also use the Live CD or Live USB to access your pool or possibly fix
your existing installation.

You will have to force the zpool import with either a reinstall or a Live
boot.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Large scale performance query

2011-07-25 Thread Brandon High

On Sun, Jul 24, 2011 at 11:34 PM, Phil Harrison philha...@gmail.com wrote:

 What kind of performance would you expect from this setup? I know we can
 multiple the base IOPS by 24 but what about max sequential read/write?


You should have a theoretical max close to 144x single-disk throughput. Each
raidz3 has 6 data drives which can be read from simultaneously, multiplied
by your 24 vdevs. Of course, you'll hit your controllers' limits well before
that.

Even with a controller per JBOD, you'll be limited by the SAS connection.
The 7k3000 has throughput from 115 - 150 MB/s, meaning each of your JBODs
will be capable of 5.2 GB/sec - 6.8 GB/sec, roughly 10 times the bandwidth
of a single SAS 6g connection. Use multipathing if you can to increase the
bandwidth to each JBOD.

Depending on the types of access that clients are performing, your cache
devices may not be any help. If the data is read multiple times by multiple
clients, then you'll see some benefit. If it's only being read infrequently
or by one client, it probably won't help much at all. That said, if your
access is mostly sequential then random access latency shouldn't affect you
too much, and you will still have more bandwidth from your main storage
pools than from the cache devices.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Replacing failed drive

2011-07-22 Thread Brandon High

On Fri, Jul 22, 2011 at 1:12 PM, Chris Dunbar - Earthside, LLC
cdun...@earthside.net wrote:
 I have physically replaced the drive, but I have not partitioned it yet. I
 know there is a command to copy the layout from one disk to another and that
 has worked well for me in the past. I just have to find the command again.
 Once that is done, do I need to detach the spare before I run the replace
 command or does running the replace command automatically bump the spare out
 of service and put it back to being just a spare?

Since it isn't the rpool, you shouldn't have to partition the replacement drive.

Since you've physically replaced the drive, you should just have to do:
# zpool replace tank c10t0d0

The pool should resilver, and I think the spare should automatically
detach. If not
# zpool remove tank c10t6d0
should take care of it.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD vs hybrid drive - any advice?

2011-07-21 Thread Brandon High

On Thu, Jul 21, 2011 at 4:08 PM, Gordon Ross gordon.w.r...@gmail.com wrote:
 And then for about $400 one can get an 250GB SSD, such as:
  Crucial M4 CT256M4SSD2 2.5 256GB SATA III MLC Internal Solid State
 Drive (SSD)
  http://www.newegg.com/Product/Product.aspx?Item=N82E16820148443

 Anyone have experience with either one?  (good or bad)

The hybrid drive might accelerate some operations. No guarantees,
though. It's about as fast as a WD Velociraptor in some operations,
and the same as the regular Seagate 500gb in others. There is a decent
review of it at Anandtech.

The M4 is pretty decent, though the Vertex 3 and other Sandforce
2000-based drives beat it in benchmarks. Honestly though, you'll
probably be very happy with any recent SSD, eg: C300, M4, Intel 320,
Intel 510, Sandforce 1200-based (Vertex 2, Phoenix Pro, etc),
Sandforce 2200-based (Vertex 3, Corsair Force GT, Patriot Wildfire,
etc).

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] latest zpool version in solaris 11 express

2011-07-20 Thread Brandon High

On Mon, Jul 18, 2011 at 6:21 AM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 Kidding aside, for anyone finding this thread at a later time, here's the
 answer.  It sounds unnecessarily complex at first, but then I went through
 it ... Only took like a minute or two.  It was exceptionally easy in fact.
        https://pkg-register.oracle.com

Do you need a support contract in order to access the certificate
application? I'm getting the following error when I try to get a cert:
There has been a problem with contacting the entitlement server. You
will only be able to issue new certificates for public products.
Please try again later

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zil on multiple usb keys

2011-07-18 Thread Brandon High

On Sun, Jul 17, 2011 at 12:13 PM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 Actually, you can't do that.  You can't make a vdev from other vdev's, and 
 when it comes to striping and mirroring your only choice is to do it the 
 right way.

 If you were REALLY trying to go out of your way to do it wrong somehow, I 
 suppose you could probably make a zvol from a stripe, and then export it to 
 yourself via iscsi, repeat with another zvol, and then mirror the two iscsi 
 targets.   ;-)  You might even be able to do the same crazy thing with simply 
 zvol's and no iscsi...  But either way you'd really be going out of your way 
 to create a problem.   ;-)

The right way to do it, um, incorrectly is to create a striped device
using SVM, and use that as a vdev for your pool.

So yes, you could create two 800GB stripes, and use them to create a
ZFS mirror. But it would be a really bad idea.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Replacement disks for Sun X4500

2011-07-15 Thread Brandon High

On Wed, Jul 6, 2011 at 10:12 PM, X4 User b7075...@klzlk.com wrote:
 I am bumping this thread because I too have the same question ... can I put 
 modern 3TB disks (hitachi deskstars) into an old x4500 ?

I have 8 x 3TB drives (Deskstar 5k3000) attached to a Supermicro
AOC-SAT2-MV8 and it works fine. This card uses the same Marvell
controller as the x4500.

Performance is fine if not slightly better than the WD10EADS drives
that I replaced. Of course, the pool was about 92% full with the
smaller drives ...

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pure SSD Pool

2011-07-12 Thread Brandon High

On Tue, Jul 12, 2011 at 7:41 AM, Eric Sproul espr...@omniti.com wrote:
 But that's exactly the problem-- ZFS being copy-on-write will
 eventually have written to all of the available LBA addresses on the
 drive, regardless of how much live data exists.  It's the rate of
 change, in other words, rather than the absolute amount that gets us
 into trouble with SSDs.  The SSD has no way of knowing what blocks

Most enterprise SSDs use something like 30% for spare area. So a
drive with 128MiB (base 2) of flash will have 100MB (base 10) of
available storage. A consumer level drive will have ~ 6% spare, or
128MiB of flash and 128MB of available storage. Some drives have 120MB
available, but still have 128 MiB of flash and therefore slightly more
spare area. Controllers like the Sandforce that do some dedup can give
you even more effective spare area, depending on the type of data.

When the OS starts reusing LBAs, the drive will re-map them into new
flash blocks in the spare area and may perform garbage collection on
the now partially used blocks. The effectiveness of this depends on
how quickly the system is writing and how full the drive is.

I failed to mention earlier that ZFS's write aggregation is also
helpful when used with flash drives since it can help to ensure that a
whole flash block is written at once. Increasing the ashift value to
4k when the pool is created may also help.

 Now, others have hinted that certain controllers are better than
 others in the absence of TRIM, but I don't see how GC could know what
 blocks are available to be erased without information from the OS.

The changed LBAs are remapped rather than overwritten in place. The
drive knows which LBAs in a flash block have been re-mapped, and can
do garbage collection when the right criteria are met.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pure SSD Pool

2011-07-12 Thread Brandon High

On Tue, Jul 12, 2011 at 12:14 PM, Eric Sproul espr...@omniti.com wrote:
 I see, thanks for that explanation.  So finding drives that keep more
 space in reserve is key to getting consistent performance under ZFS.

More spare area might give you more performance, but the big
difference is the lifetime of the device. A device with more spare
area can handle more writes.

In the capacity range (eg: 50-64 GB, 64 GiB flash), then the drive
with more spare will last longer but may not offer a performance
benefit. Higher capacity drives will offer better performance because
they have more flash channels to write to, and they should last longer
because while the spare area is the same percentage of total capacity,
it's numerically larger.

A consumer 240GB drive (256GiB flash) will have 27GiB spare area. An
enterprise 50GB (64GiB flash) drive will have 16 GiB spare area, or
about 25% of the total capacity. Even though the consumer drive only
sets aside ~ 10% for spare, it's so much larger that it will last
longer at any given rate of writing. If you were to completely fill
and re-fill each drive, the consumer drive will fail earlier, but
you'd have to write nearly 5x as much data to fill it even once.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pure SSD Pool

2011-07-11 Thread Brandon High

On Mon, Jul 11, 2011 at 7:03 AM, Eric Sproul espr...@omniti.com wrote:
 Interesting-- what is the suspected impact of not having TRIM support?

There shouldn't be much, since zfs isn't changing data in place. Any
drive with reasonable garbage collection (which is pretty much
everything these days) should be fine until the volume gets very full.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Cannot format 2.5TB ext disk (EFI)

2011-06-24 Thread Brandon High

On Thu, Jun 23, 2011 at 1:20 PM, Richard Elling
richard.ell...@gmail.com wrote:
 2TB limit for 32-bit Solaris. If you hit this, then you'll find a lot of 
 complaints at boot.
 By default, an Ultra-24 should boot 64-bit. Dunno about the HBA, though...

I think the limit is 1TB for 32-bit. I've tried to use 2TB drives on
an Atom N270-based board and they were not recognized, but they worked
fine under FreeBSD.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] JBOD recommendation for ZFS usage

2011-05-30 Thread Brandon High

On Mon, May 30, 2011 at 6:16 PM, Jim Klimov jimkli...@cos.ru wrote:
 Also some articles stated that at one time there were
 single-port SAS drives, so there are at least two SAS
 connectors after all ;)

Nope, only one mechanical connector. A dual port cable can be used
with single- or dual-ported SAS device, or with SATA drives. A single
port cable can be used with a single- or dual-ported SAS device
(although it will only use one port) or with a SATA drive. A SATA
cable can be used with a SATA device.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] offline dedup

2011-05-26 Thread Brandon High

On Thu, May 26, 2011 at 8:37 AM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 Question:  Is it possible, or can it easily become possible, to periodically
 dedup a pool instead of keeping dedup running all the time?  It is easy to

I think it's been discussed before, and the conclusion is that it
would require bp_rewrite.

Offline (or deferred) dedup certainly seems more attractive given the
current real-time performance.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)

2011-05-26 Thread Brandon High

On Thu, May 26, 2011 at 9:34 AM, Eugen Leitl eu...@leitl.org wrote:
 How bad would raidz2 do on mostly sequential writes and reads
 (Athlon64 single-core, 4 GByte RAM, FreeBSD 8.2)?

I was using a similar but slightly higher spec setup (quad-core cpu 
8 GB RAM) at home and didn't have any problems with an 8-drive raidz2,
though my usage is fairly light. The system is more than fast enough
to saturate gigabit ethernet for sequential reads and writes. My
drives were WD10EADS Green drives.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS, Oracle and Nexenta

2011-05-24 Thread Brandon High

On Tue, May 24, 2011 at 12:41 PM, Richard Elling
richard.ell...@gmail.com wrote:
 There are many ZFS implementations, each evolving as the contributors desire.
 Diversity and innovation is a good thing.

... unless Oracle's zpool v30 is different than Nexenta's v30.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS, Oracle and Nexenta

2011-05-24 Thread Brandon High

On Tue, May 24, 2011 at 3:17 PM, Peter Jeremy
peter.jer...@alcatel-lucent.com wrote:
 I believe the various OSS projects that use ZFS have formed a working
 group to co-ordinate ZFS amongst themselves.  I don't know if Oracle
 was invited to join (though given the way Oracle has behaved in all

Richard would probably know for certain.

There will probably be a fork at some point to an OSS ZFS and an
Oracle ZFS. Hopefully neither side will actively try to break
compatibility.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Monitoring disk seeks

2011-05-19 Thread Brandon High

On Thu, May 19, 2011 at 5:35 AM, Sašo Kiselkov skiselkov...@gmail.com wrote:
 I'd like to ask whether there is a way to monitor disk seeks. I have an
 application where many concurrent readers (50) sequentially read a
 large dataset (10T) at a fairly low speed (8-10 Mbit/s). I can monitor
 read/write ops using iostat, but that doesn't tell me how contiguous the
 data is, i.e. when iostat reports 500 read ops, does that translate to
 500 seeks + 1 read per seek, or 50 seeks + 10 reads, etc? Thanks!

You can sort of do this with a DTrace script.

Something like: (forgive my crappy script, I've only poked at DTrace a
few times)

#pragma D option quiet
io:::done
/ args[1]-dev_name == sd  args[1]-dev_instance  11 /
{
  printf(%d.%03d,%s,%i,%s,%i\n,
 (timestamp/100),
 (timestamp / 1000) % 1000,
 args[1]-dev_statname,
 args[0]-b_lblkno,
 (args[0]-b_flags  B_WRITE ? W : R),
 args[0]-b_bcount
);
}

For every completed IO, this should give you the timestamp, device
name, start LBA, Read or Write and length of the IO.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris vs FreeBSD question

2011-05-18 Thread Brandon High

On Wed, May 18, 2011 at 5:47 AM, Paul Kraus p...@kraus-haus.org wrote:
 P.S. If anyone here has a suggestion as to how to get Solaris to load
 I would love to hear it. I even tried disabling multi-cores (which
 makes the CPUs look like dual core instead of quad) with no change. I
 have not been able to get serial console redirect to work so I do not
 have a good log of the failures.

Have you checked your system in the HCL device tool at
http://www.sun.com/bigadmin/hcl/hcts/device_detect.jsp ? It should be
able to tell you which device is causing the problem. If I remember
correctly, you can feed it the output of 'lspci -vv -n'.

You may have to disable some on-board devices to get through the
installer, but I couldn't begin to guess which.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Reboots when importing old rpool

2011-05-17 Thread Brandon High

On Tue, May 17, 2011 at 11:10 AM, Hung-ShengTsao (Lao Tsao) Ph.D.
laot...@gmail.com wrote:

 may be do
 zpool import  -R /a rpool

'zpool import -N' may work as well.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Brandon High

On Sun, May 15, 2011 at 10:14 PM, Richard Elling
richard.ell...@gmail.com wrote:
 On May 15, 2011, at 10:18 AM, Jim Klimov jimkli...@cos.ru wrote:
 In case of RAIDZ2 this recommendation leads to vdevs sized 6 (4+2), 10 (8+2) 
 or 18 (16+2) disks - the latter being mentioned in the original post.

 A similar theory was disproved back in 2006 or 2007. I'd be very surprised if
 there was a reliable way to predict the actual use patterns in advance. 
 Features
 like compression and I/O coalescing improve performance, but make the old
 rules of thumb even more obsolete.

I thought that having data disks that were a power of two was still
recommended, due to the way that ZFS splits records/blocks in a raidz
vdev. Or are you responding to some other point?

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Brandon High

On Sat, May 14, 2011 at 11:20 PM, John Doe dav3...@gmail.com wrote:
 171   Hitachi 7K3000 3TB
 I'd go for the more environmentally friendly Ultrastar 5K3000 version - with 
 that many drives you wont mind the slower rotation but WILL notice a 
 difference in power and cooling cost

A word of caution - The Hitachi Deskstar 5K3000 drives in 1TB and 2TB
are different than the 3TB.

The 1TB and 2TB are manufactured in China, and have a very high
failure and DOA rate according to Newegg.

The 3TB drives come off the same production line as the Ultrastar
5K3000 in Thailand and may be more reliable.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Brandon High

On Mon, May 16, 2011 at 8:33 AM, Richard Elling
richard.ell...@gmail.com wrote:
 As a rule of thumb, the resilvering disk is expected to max out at around
 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect
 the throttles or broken data path.

My system was doing far less than 80 IOPS during resilver when I
recently upgraded the drives. The older and newer drives were both 5k
RPM drives (WD10EADS and Hitachi 5K3000 3TB) so I don't expect it to
be super fast.

The worst resilver was 50 hours, the best was about 20 hours. This was
just my home server, which is lightly used. The clients (2-3 CIFS
clients, 3 mostly idle VBox instances using raw zvols, and 2-3 NFS
clients) are mostly idle and don't do a lot of writes.

Adjusting zfs_resilver_delay and zfs_resilver_min_time_ms sped things
up a bit, which suggests that the default values may be too
conservative for some environments.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Still no way to recover a corrupted pool

2011-05-16 Thread Brandon High

On Mon, May 16, 2011 at 1:55 PM, Freddie Cash fjwc...@gmail.com wrote:
 Would not import in Solaris 11 Express.  :(  Could not even find any
 pools to import.  Even when using zpool import -d /dev/dsk or any
 other import commands.  Most likely due to using a FreeBSD-specific
 method of labelling the disks.

I think someone solved this before by creating a directory and making
symlinks to the correct partition/slices on each disk. Then you can
use 'zpool import -d /tmp/foo' to do the import. eg:

# mkdir /tmp/fbsd # create a temp directory to point to the p0
partitions of the relevant disks
# ln -s /dev/dsk/c8t1d0p0 /tmp/fbsd/
# ln -s /dev/dsk/c8t2d0p0 /tmp/fbsd/
# ln -s /dev/dsk/c8t3d0p0 /tmp/fbsd/
# ln -s /dev/dsk/c8t4d0p0 /tmp/fbsd/
# zpool import -d /tmp/fbsd/ $POOLNAME

I've never used FreeBSD so I can't offer any advice about which device
name is correct or if this will work. Posts from February 2010 Import
zpool from FreeBSD in OpenSolaris indicate that you want p0.

 It's just frustrating that it's still possible to corrupt a pool in
 such a way that nuke and pave is the only solution.  Especially when

I'm not sure it was the only solution, it's just the one you followed.

 What's most frustrating is that this is the third time I've built this
 pool due to corruption like this, within three months.  :(

You may have an underlying hardware problem, or there could be a bug
in the FreeBSD implementation that you're tripping over.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS on HP MDS 600

2011-05-10 Thread Brandon High

On Mon, May 9, 2011 at 8:33 AM, Darren Honeyball ml...@spod.net wrote:
 I'm just mulling over the best configuration for this system - our work load 
 is mostly writing millions of small files (around 50k) with occasional reads 
  we need to keep as much space as possible.

If space is a priority, then raidz or raidz2 are probably the best
bets. If you're going to have a lot of random iops, then mirrors are
best.

You have some control over the performance : space ratio with raidz by
adjusting the width of the radiz vdevs. For instance, mirrors will
provide 34TB of space and best random iops. 24 x 3-disk raidz vdevs
will have 48TB of space but still have pretty strong random iops
performance. 13 x 5-disk raidz vdevs will give 52TB of space at the
lost of lower random iops.

Testing will help you find the best configuration for your environment.

 HP's recommendations for configuring the MDS 600 with ZFS is to let the P212 
 do the raid functions (raid 1+0 is recommended here) by configuring each half 
 of the MDS 600 as a single logical drive (35 drives)  then use a basic zfs 
 pool on top to provide the zfs functionality - to me this would seem to loose 
 a lot of the error checking functions of zfs?

If you configured the two logical drives as a mirror in ZFS, then
you'd still have full protection. Your overhead would be really high
though - 3/4 of your original capacity would be used for data
protection if I understand the recommendation correctly. (You'd use
1/2 of the original capacity for RAID1 in the MDS, then 1/2 of the
remaining for the ZFS mirror.) You could use non-redundant pool in ZFS
to reduce the overhead, but you sacrifice the self-healing properties
of ZFS when you do that.

 Another option is to use raidz and let zfs handle the smart stuff - as the 
 P212 doesn't support a true dumb JBOD function I'd need to create each drive 
 as a single raid 0 logical drive - are there any drawback to doing this? Or 
 would it be better to create slightly larger logical drives using say 2 
 physical drives per logical drive?

Single-device logical drives are required when you can't configure a
card or device as JBOD, and I believe its usually the recommended
solution. Once you have the LUNs created, you can use ZFS to create
mirrors or raidz vdevs.

 I'm planning on having 2 hot spares - one in each side of the MDS 600, is it 
 also worth using a dedicated ZIL spindle or 2?

It would depend on your workload. (How's that for helpful?)

If you're experiencing a lot of synchronous writes, then a ZIL will
help. If you aren't seeing a lot of sync writes, then a ZIL won't
help. The ZIL doesn't have to be very large, since it's flushed on a
regular basis. From the Best Practices guide:
For a target throughput of X MB/sec and given that ZFS pushes
transaction groups every 5 seconds (and have 2 outstanding), we also
expect the ZIL to not grow beyond X MB/sec * 10 sec. So to service
100MB/sec of synchronous writes, 1 GB of log device should be
sufficient.

If the MDS has a non-volatile cache, there should be little or no need
to use a ZIL.

However, some reports have shown ZFS with a ZIL to be faster than
using non-volatile cache. You should test performance using your
workload.

 Is it worth tweaking zfs_nocacheflush or zfs_vdev_max_pending?

As I mentioned above, if the MDS has a non-volatile cache, then
setting zfs_nocacheflush might help performance.

If you're exporting one LUN per device then you shouldn't need to
adjust the max_pending. If you're exporting larger RAID10 luns from
the MDS, then increasing the value might help for read workloads.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] primarycache=metadata seems to force behaviour of secondarycache=metadata

2011-05-10 Thread Brandon High

On Mon, May 9, 2011 at 2:54 PM, Tomas Ögren st...@acc.umu.se wrote:
 Slightly off topic, but we had an IBM RS/6000 43P with a PowerPC 604e
 cpu, which had about 60MB/s memory bandwidth (which is kind of bad for a
 332MHz cpu) and its disks could do 70-80MB/s or so.. in some other
 machine..

It wasn't that long ago when 66MB/s ATA was considered a waste because
no drive could use that much bandwidth. These days a slow drive has
max throughput greater than 110MB/s.

(OK, looking at some online reviews, it was about 13 years ago. Maybe
I'm just old.)

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-06 Thread Brandon High

On Fri, May 6, 2011 at 9:15 AM, Ray Van Dolson rvandol...@esri.com wrote:

 We use dedupe on our VMware datastores and typically see 50% savings,
 often times more.  We do of course keep like VM's on the same volume

I think NetApp uses 4k blocks by default, so the block size and
alignment should match up for most filesystems and yield better
savings.

Your server's resource requirements for ZFS and dedup will be much
higher due to the large DDT, as you initially suspected.

If bp_rewrite is ever completed and released, this might change. It
should allow for offline dedup, which may make dedup usable in more
situations.

 Apologies for devolving the conversation too much in the NetApp
 direction -- simply was a point of reference for me to get a better
 understanding of things on the ZFS side. :)

It's good to compare the two, since they have a pretty large overlap
in functionality but sometimes very different implementations.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-05 Thread Brandon High

On Thu, May 5, 2011 at 11:17 AM, Giovanni Tirloni gtirl...@sysdroid.com wrote:
 What I find it curious is that it only happens with incrementals. Full
 send's go as fast as possible (monitored with mbuffer). I was just wondering
 if other people have seen it, if there is a bug (b111 is quite old), etc.

I missed that you were using b111 earlier. That's probably a large
part of the problem. There were a lot of performance and reliability
improvements between b111 and b134, and there have been more between
b134 and b148 (OI) or b151 (S11 Express).

Updating the host you're receiving on to something more recent may fix
the performance problem you're seeing.

Fragmentation shouldn't be to great of an issue if the pool you're
writing to is relatively empty. There were changes made to zpool
metaslab allocation post-b111 that might improve performance for pools
between 70% and 96% full. This could also be why the full sends
perform better than incremental sends.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Brandon High

On Wed, May 4, 2011 at 8:23 PM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 Generally speaking, dedup doesn't work on VM images.  (Same is true for ZFS
 or netapp or anything else.)  Because the VM images are all going to have
 their own filesystems internally with whatever blocksize is relevant to the
 guest OS.  If the virtual blocks in the VM don't align with the ZFS (or
 whatever FS) host blocks...  Then even when you write duplicated data inside
 the guest, the host won't see it as a duplicated block.

A zvol with 4k blocks should give you decent results with Windows
guests. Recent versions use 4k alignment by default and 4k blocks, so
there should be lots of duplicates for a base OS image.

 There are some situations where dedup may help on VM images...  For example
 if you're not using sparse files and you have a zero-filed disk...  But in

compression=zle works even better for these cases, since it doesn't
require DDT resources.

 Or if you're intimately familiar with both the guest  host filesystems, and
 you choose blocksizes carefully to make them align.  But that seems
 complicated and likely to fail.

Using a 4k block size is a safe bet, since most OSs use a block size
that is a multiple of 4k. It's the same reason that the new Advanced
Format drives use 4k sectors.

Windows uses 4k alignment and 4k (or larger) clusters.
ext3/ext4 uses 1k, 2k, or 4k blocks. Drives over 512MB should use 4k
by default. The block alignment is determined by the partitioning, so
some care needs to be taken there.
zfs uses 'ashift' size blocks. I'm not sure what ashift works out to
be when using a zvol though, so it could be as small as 512b but may
be set to the same as the blocksize property.
ufs is 4k or 8k on x86 and 8k on sun4u. As with ext4, block alignment
is determined by partitioning and slices.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Brandon High

On Thu, May 5, 2011 at 8:50 PM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 If you have to use the 4k recordsize, it is likely to consume 32x more
 memory than the default 128k recordsize of ZFS.  At this rate, it becomes
 increasingly difficult to get a justification to enable the dedup.  But it's
 certainly possible.

You're forgetting that zvols use an 8k volblocksize by default. If
you're currently exporting exporting volumes with iSCSI it's only a 2x
increase.

The tradeoff is that you should have more duplicate blocks, and reap
the rewards there. I'm fairly certain that it won't offset the large
increase in the size of the DDT however. Dedup with zvols is probably
never a good idea as a result.

Only if you're hosting your VM images in .vmdk files will you get 128k
blocks. Of course, your chance of getting many identical blocks gets
much, much smaller. You'll have to worry about the guests' block
alignment in the context of the image file, since two identical files
may not create identical blocks as seen from ZFS. This means you may
get only fractional savings and have an enormous DDT.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Brandon High

On Wed, May 4, 2011 at 12:29 PM, Erik Trimble erik.trim...@oracle.com wrote:
        I suspect that NetApp does the following to limit their resource
 usage:   they presume the presence of some sort of cache that can be
 dedicated to the DDT (and, since they also control the hardware, they can
 make sure there is always one present).  Thus, they can make their code

AFAIK, NetApp has more restrictive requirements about how much data
can be dedup'd on each type of hardware.

See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller
pieces of hardware can only dedup 1TB volumes, and even the big-daddy
filers will only dedup up to 16TB per volume, even if the volume size
is 32TB (the largest volume available for dedup).

NetApp solves the problem by putting rigid constraints around the
problem, whereas ZFS lets you enable dedup for any size dataset. Both
approaches have limitations, and it sucks when you hit them.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-04 Thread Brandon High

On Wed, May 4, 2011 at 2:25 PM, Giovanni Tirloni gtirl...@sysdroid.com wrote:
   The problem we've started seeing is that a zfs send -i is taking hours to
 send a very small amount of data (eg. 20GB in 6 hours) while a zfs send full
 transfer everything faster than the incremental (40-70MB/s). Sometimes we
 just give up on sending the incremental and send a full altogether.

Does the send complete faster if you just pipe to /dev/null? I've
observed that if recv stalls, it'll pause the send, and they two go
back and forth stepping on each other's toes. Unfortunately, send and
recv tend to pause with each individual snapshot they are working on.

Putting something like mbuffer
(http://www.maier-komor.de/mbuffer.html) in the middle can help smooth
it out and speed things up tremendously. It prevents the send from
pausing when the recv stalls, and allows the recv to continue working
when the send is stalled. You will have to fiddle with the buffer size
and other options to tune it for your use.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Brandon High

On Wed, May 4, 2011 at 4:36 PM, Erik Trimble erik.trim...@oracle.com wrote:
 If so, I'm almost certain NetApp is doing post-write dedup.  That way, the
 strictly controlled max FlexVol size helps with keeping the resource limits
 down, as it will be able to round-robin the post-write dedup to each FlexVol
 in turn.

They are, its in their docs. A volume is dedup'd when 20% of
non-deduped data is added to it, or something similar. 8 volumes can
be processed at once though, I believe, and it could be that weaker
systems are not able to do as many in parallel.

 block usage has a significant 4k presence.  One way I reduced this initally
 was to have the VMdisk image stored on local disk, then copied the *entire*
 image to the ZFS server, so the server saw a single large file, which meant
 it tended to write full 128k blocks.  Do note, that my 30 images only takes

Wouldn't you have been better off cloning datasets that contain an
unconfigured install and customizing from there?

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Faster copy from UFS to ZFS

2011-05-03 Thread Brandon High

On Tue, May 3, 2011 at 5:47 AM, Joerg Schilling
joerg.schill...@fokus.fraunhofer.de wrote:
 But this is most likely slower than star and does rsync support sparse files?

'rsync -ASHXavP'

-A: ACLs
-S: Sparse files
-H: Hard links
-X: Xattrs
-a: archive mode; equals -rlptgoD (no -H,-A,-X)

You don't need to specify --whole-file, it's implied when copying on
the same system. --inplace can play badly with hard links and
shouldn't be used.

It probably will be slower than other options but it may be more
accurate, especially with -H

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Faster copy from UFS to ZFS

2011-05-03 Thread Brandon High

On Tue, May 3, 2011 at 12:36 PM, Erik Trimble erik.trim...@oracle.com wrote:
 rsync is indeed slower than star; so far as I can tell, this is due almost
 exclusively to the fact that rsync needs to build an in-memory table of all
 work being done *before* it starts to copy. After that, it copies at about

rsync 3.0+ will start copying almost immediately, so it's much better
in that respect than previous versions. It continues to scan update
the list of files while sending data.

 network use pattern), which helps for ZFS copying.  The one thing I'm not
 sure of is whether rsync uses a socket, pipe, or semaphore method when doing
 same-host copying. I presume socket (which would slightly slow it down vs

It creates a socketpair() before clone()ing itself and uses the socket
for communications.

 That said, rsync is really the only solution if you have a partial or
 interrupted copy.  It's also really the best method to do verification.

For verification you should specify -c (checksums), otherwise it will
only look at the size, permissions, owner and date and if they all
match it will not look at the file contents. It can take as long (or
longer) to complete than the original copy, since files on both side
need to be read and checksummed.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ls reports incorrect file size

2011-05-02 Thread Brandon High

On Mon, May 2, 2011 at 1:56 PM, Eric D. Mudama
edmud...@bounceswoosh.org wrote:
 that the application would have done the seek+write combination, since
 on NTFS (which doesn't support sparse) these would have been real
 1.5GB files, and there would be hundreds or thousands of them in
 normal usage.

NTFS supports sparse files.
http://www.flexhex.com/docs/articles/sparse-files.phtml

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-30 Thread Brandon High

On Thu, Apr 28, 2011 at 6:48 PM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 What does it mean / what should you do, if you run that command, and it
 starts spewing messages like this?
 leaked space: vdev 0, offset 0x3bd8096e00, size 7168

I'm not sure there's much you can do about it short of deleting
datasets and/or snapshots.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-29 Thread Brandon High

On Fri, Apr 29, 2011 at 7:10 AM, Roy Sigurd Karlsbakk r...@karlsbakk.net 
wrote:
 This was fletcher4 earlier, and still is in opensolaris/openindiana. Given a 
 combination with verify (which I would use anyway, since there are always 
 tiny chances of collisions), why would sha256 be a better choice?

fletcher4 was only an option for snv_128, which was quickly pulled and
replaced with snv_128b which removed fletcher4 as an option.

The official post is here:
http://www.opensolaris.org/jive/thread.jspa?threadID=118519tstart=0#437431

It looks like fletcher4 is still an option in snv_151a for non-dedup
datasets, and is in fact the default.

As an aside: Erik, any idea when the 159 bits will make it to the public?

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Faster copy from UFS to ZFS

2011-04-29 Thread Brandon High

On Fri, Apr 29, 2011 at 10:53 AM, Dan Shelton dan.shel...@oracle.com wrote:
 Is anyone aware of any freeware program that can speed up copying tons of
 data (2 TB) from UFS to ZFS on same server?

Setting 'sync=disabled' for the initial copy will help, since it will
make all writes asynchronous.

You will probably want to set it back to default after you're done.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Still no way to recover a corrupted pool

2011-04-29 Thread Brandon High

On Fri, Apr 29, 2011 at 1:23 PM, Freddie Cash fjwc...@gmail.com wrote:
 Running ZFSv28 on 64-bit FreeBSD 8-STABLE.

I'd suggest trying to import the pool into snv_151a (Solaris 11
Express), which is the reference and development platform for ZFS.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Finding where dedup'd files are

2011-04-28 Thread Brandon High

Is there an easy way to find out what datasets have dedup'd data in
them. Even better would be to discover which files in a particular
dataset are dedup'd.

I ran
# zdb -

which gave output like:

index 1055c9f21af63 refcnt 2 single DVA[0]=0:1e274ec3000:2ac00:STD:1
[L0 deduplicated block] sha256 uncompressed LE contiguous unique
unencrypted 1-copy size=2L/2P birth=236799L/236799P fill=1
cksum=55c9f21af6399be:11f9d4f5ff4cb109:2af8b798671e47ba:d19caf78da295df5

How can I translate this into datasets or files?

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Brandon High

On Wed, Apr 27, 2011 at 9:26 PM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 Correct me if I'm wrong, but the dedup sha256 checksum happens in addition
 to (not instead of) the fletcher2 integrity checksum.  So after bootup,

My understanding is that enabling dedup forces sha256.

The default checksum used for deduplication is sha256 (subject to
change). When dedup is enabled, the dedup checksum algorithm overrides
the checksum property.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Brandon High

On Thu, Apr 28, 2011 at 3:05 PM, Erik Trimble erik.trim...@oracle.com wrote:
 A careful reading of the man page seems to imply that there's no way to
 change the dedup checksum algorithm from sha256, as the dedup property
 ignores the checksum property, and there's no provided way to explicitly
 set a checksum algorithm specific to dedup (i.e. there's no way to
 override the default for dedup).

That's my understanding as well. The initial release used fletcher4 or
sha256, but there was either a bug in the fletcher4 code or a hash
collision that required removing it as an option.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding where dedup'd files are

2011-04-28 Thread Brandon High

On Thu, Apr 28, 2011 at 3:48 PM, Ian Collins i...@ianshome.com wrote:
 Dedup is at the block, not file level.

Files are usually composed of blocks.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding where dedup'd files are

2011-04-28 Thread Brandon High

On Thu, Apr 28, 2011 at 4:06 PM, Erik Trimble erik.trim...@oracle.com wrote:
 Which means, that while I can get a list of blocks which are deduped, it
 may not be possible to generate a list of files from that list of
 blocks.

Is it possible to determine which datasets the blocks are referenced from?

Since I have some datasets with dedup'd data, I'm a little paranoid
about tanking the system if they are destroyed.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive

2011-04-27 Thread Brandon High

On Wed, Apr 27, 2011 at 12:51 PM, Lamp Zy lam...@gmail.com wrote:
 Any ideas how to identify which drive is the one that failed so I can
 replace it?

Try the following:
# fmdump -eV
# fmadm faulty

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Drive replacement speed

2011-04-26 Thread Brandon High

The last resilver finished after 50 hours. Ouch.

I'm onto the next device now, which seems to be progressing much, much better.

The current tunings that I'm using right now are:
echo zfs_resilver_delay/W0t0 | mdb -kw
echo zfs_resilver_min_time_ms/W0t2 | pfexec mdb -kw

Things could slow down, but at 13 hours in, the resilver has been
managing ~ 100M/s and is 70% done.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Drive replacement speed

2011-04-25 Thread Brandon High

-  -743 10  2.02M  59.4K
c2t3d0-  -771 11  2.01M  59.3K
c2t4d0-  -767 10  1.94M  59.1K
replacing -  -  0  1.00K 17  1.48M
  c2t5d0/old  -  -  0  0  0  0
  c2t5d0  -  -  0533 17  1.48M
c2t6d0-  -791 10  1.98M  59.2K
c2t7d0-  -796 10  1.99M  59.3K
  -  -  -  -  -  -

$ iostat -xn 60 3
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  362.4   11.5 5693.9   71.6  0.7  0.72.02.0  14  30 c2t0d0
  365.3   11.5 5689.0   71.6  0.7  0.71.81.9  14  29 c2t1d0
  363.2   11.5 5693.2   71.6  0.7  0.71.92.0  14  30 c2t2d0
  364.0   11.5 5692.7   71.6  0.7  0.71.91.9  14  30 c2t3d0
  361.2   11.5 5672.8   71.6  0.7  0.71.91.9  14  30 c2t4d0
  202.4  163.1 2915.2 2475.3  0.3  1.10.82.9   7  26 c2t5d0
  170.4  190.4 2747.3 2757.6  0.5  1.31.53.6  11  31 c2t6d0
  386.4   11.2 5659.0   71.6  0.5  0.61.31.5  12  27 c2t7d0
   95.01.2   94.5   16.1  0.0  0.00.20.2   0   1 c0t0d0
0.91.23.3   16.1  0.0  0.07.51.9   0   0 c0t1d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  514.1   13.0 1937.7   65.7  0.2  0.80.31.5   5  27 c2t0d0
  510.1   13.2 1943.1   65.7  0.2  0.80.51.6   6  29 c2t1d0
  513.3   13.2 1926.3   65.8  0.2  0.80.31.5   5  28 c2t2d0
  505.9   13.3 1936.7   65.8  0.2  0.90.31.8   5  30 c2t3d0
  513.8   12.8 1890.1   65.8  0.2  0.80.31.5   5  26 c2t4d0
0.1  488.60.1 1216.5  0.0  2.20.04.6   0  33 c2t5d0
  533.3   12.7 1875.3   65.9  0.1  0.70.21.3   4  24 c2t6d0
  541.6   12.9 1923.2   65.8  0.1  0.70.21.2   3  23 c2t7d0
0.02.00.09.4  0.0  0.01.00.2   0   0 c0t0d0
0.02.00.09.4  0.0  0.01.00.2   0   0 c0t1d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  506.79.2 1906.9   50.2  0.6  0.21.20.5  20  23 c2t0d0
  509.89.3 1909.5   50.2  0.6  0.21.20.4  19  23 c2t1d0
  508.69.0 1900.4   50.2  0.7  0.31.40.5  21  25 c2t2d0
  506.89.4 1897.2   50.3  0.6  0.21.20.5  19  23 c2t3d0
  505.19.4 1852.4   50.4  0.6  0.21.20.5  19  23 c2t4d0
0.0  487.60.0 1227.9  0.0  3.50.07.2   0  46 c2t5d0
  534.89.2 1855.6   50.2  0.6  0.21.00.4  18  22 c2t6d0
  540.59.3 1891.4   50.2  0.5  0.21.00.4  17  21 c2t7d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c0t0d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c0t1d0



-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-25 Thread Brandon High

On Mon, Apr 25, 2011 at 8:20 AM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 and 128k assuming default recordsize.  (BTW, recordsize seems to be a zfs
 property, not a zpool property.  So how can you know or configure the
 blocksize for something like a zvol iscsi target?)

zvols use the 'volblocksize' property, which defaults to 8k. A 1TB
zvol is therefore 2^27 blocks and would require ~ 34 GB for the ddt
(assuming that a ddt entry is 270 bytes).

The zfs man page for the property reads:

volblocksize=blocksize

 For volumes, specifies the block size of the volume. The
 blocksize  cannot  be  changed  once the volume has been
 written, so it should be set at  volume  creation  time.
 The default blocksize for volumes is 8 Kbytes. Any power
 of 2 from 512 bytes to 128 Kbytes is valid.

 This property can also be referred to by  its  shortened
 column name, volblock.

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Drive replacement speed

2011-04-25 Thread Brandon High

On Mon, Apr 25, 2011 at 4:45 PM, Richard Elling
richard.ell...@gmail.com wrote:
 If there is other work going on, then you might be hitting the resilver
 throttle. By default, it will delay 2 clock ticks, if needed. It can be turned

There is some other access to the pool from nfs and cifs clients, but
not much, and mostly reads.

Setting zfs_resilver_delay seems to have helped some, based on the
iostat output. Are there other tunables?

 Probably won't work because it does not make the resilvering drive
 any faster.

It doesn't seem like the devices are the bottleneck, even with the
delay turned off.

$ iostat -xn 60 3
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  369.2   11.5 5577.0   71.3  0.7  0.71.91.9  14  29 c2t0d0
  371.9   11.5 5570.3   71.3  0.7  0.71.71.8  13  29 c2t1d0
  369.9   11.5 5574.4   71.3  0.7  0.71.81.9  14  29 c2t2d0
  370.7   11.5 5573.9   71.3  0.7  0.71.81.9  14  29 c2t3d0
  368.0   11.5 5553.1   71.3  0.7  0.71.81.9  14  29 c2t4d0
  196.1  172.8 2825.5 2436.6  0.3  1.10.83.0   6  26 c2t5d0
  183.6  184.9 2717.6 2674.7  0.5  1.31.43.5  11  31 c2t6d0
  393.0   11.2 5540.7   71.3  0.5  0.61.31.5  12  26 c2t7d0
   95.81.2   95.6   16.2  0.0  0.00.20.2   0   1 c0t0d0
0.91.23.6   16.2  0.0  0.07.51.9   0   0 c0t1d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  891.2   11.8 2386.9   64.4  0.0  1.20.01.3   1  36 c2t0d0
  919.9   12.1 2351.8   64.6  0.0  1.10.01.2   0  35 c2t1d0
  906.9   12.1 2346.1   64.6  0.0  1.20.01.3   0  36 c2t2d0
  877.9   11.6 2351.0   64.5  0.7  0.50.80.6  23  35 c2t3d0
  883.4   12.0 2322.0   64.4  0.2  1.00.21.1   7  35 c2t4d0
0.8  758.00.8 1910.4  0.2  5.00.26.6   3  72 c2t5d0
  882.7   11.4 2355.1   64.4  0.8  0.40.90.4  27  34 c2t6d0
  907.8   11.4 2373.1   64.5  0.7  0.30.80.4  23  30 c2t7d0
 1607.89.4 1568.2   83.0  0.1  0.20.10.1   3  18 c0t0d0
7.39.1   23.5   83.0  0.1  0.06.01.4   2   2 c0t1d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  960.3   12.7 2868.0   59.0  1.1  0.71.20.8  37  52 c2t0d0
  963.2   12.7 2877.5   59.1  1.1  0.81.10.8  36  51 c2t1d0
  960.3   12.6 2844.7   59.1  1.1  0.71.10.8  37  52 c2t2d0
 1000.1   12.8 2827.1   59.0  0.6  1.20.61.2  21  52 c2t3d0
  960.9   12.3 2811.1   59.0  1.3  0.61.30.6  42  51 c2t4d0
0.5  962.20.4 2418.3  0.0  4.10.04.3   0  59 c2t5d0
 1014.2   12.3 2820.6   59.1  0.8  0.80.80.8  28  48 c2t6d0
 1031.2   12.5 2822.0   59.1  0.8  0.80.70.8  26  45 c2t7d0
 1836.40.0 1783.40.0  0.0  0.20.00.1   1  19 c0t0d0
5.30.05.30.0  0.0  0.01.11.5   1   1 c0t1d0


-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive

2011-04-25 Thread Brandon High

On Mon, Apr 25, 2011 at 4:56 PM, Lamp Zy lam...@gmail.com wrote:
 I'd expect the spare drives to auto-replace the failed one but this is not
 happening.

 What am I missing?

Is the autoreplace property set to 'on'?
# zpool get autoreplace fwgpool0
# zpool set autoreplace=on fwgpool0

 I really would like to get the pool back in a healthy state using the spare
 drives before trying to identify which one is the failed drive in the
 storage array and trying to replace it. How do I do this?

Turning on autoreplace might start the replace. If not, the following
will replace the failed drive with the first spare. (I'd suggest
verifying the device names before running it.)
# zpool replace fwgpool0 c4t5000C5001128FE4Dd0 c4t5000C50014D70072d0

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?

2011-04-25 Thread Brandon High

On Mon, Apr 25, 2011 at 4:53 PM, Fred Liu fred_...@issi.com wrote:
 So how can I set the quota size on a file system with dedup enabled?

I believe the quota applies to the non-dedup'd data size. If a user
stores 10G of data, it will use 10G of quota, regardless of whether it
dedups at 100:1 or 1:1.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Drive replacement speed

2011-04-25 Thread Brandon High

On Mon, Apr 25, 2011 at 5:26 PM, Brandon High bh...@freaks.com wrote:
 Setting zfs_resilver_delay seems to have helped some, based on the
 iostat output. Are there other tunables?

I found zfs_resilver_min_time_ms while looking. I've tried bumping it
up considerably, without much change.

'zpool status' is still showing:
 scan: resilver in progress since Sat Apr 23 17:03:13 2011
6.06T scanned out of 6.40T at 36.0M/s, 2h46m to go
769G resilvered, 94.64% done

'iostat -xn' shows asvc_t under 10ms still.

Increasing the per-device queue depth has increased the ascv_t but
hasn't done much to effect the throughput. I'm using:
echo zfs_vdev_max_pending/W0t35 | pfexec mdb -kw

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] just can't import

2011-04-11 Thread Brandon High

On Sun, Apr 10, 2011 at 10:01 PM, Matt Harrison
iwasinnamuk...@genestate.com wrote:
 The machine only has 4G RAM I believe.

There's your problem. 4G is not enough memory for dedup, especially
without a fast L2ARC device.

 It's time I should be heading to bed so I'll let it sit overnight, and if
 I'm still stuck with it I'll give Ian's recent suggestions a go and report
 back.

I'd suggest waiting for it to finish the destroy. It will, if you give it time.

Trying to force the import is only going to put you back in the same
situation - The system will attempt to complete the destroy and seem
to hang until it's completed.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] just can't import

2011-04-11 Thread Brandon High

On Mon, Apr 11, 2011 at 10:55 AM, Matt Harrison
iwasinnamuk...@genestate.com wrote:
 It did finish eventually, not sure how long it took in the end. Things are
 looking good again :)

If you want to continue using dedup, you should invest in (a lot) more
memory. The amount of memory required depends on the size of your pool
and the type of data that you're storing. Data that large blocks will
use less memory.

I suspect that the minimum memory for most moderately sized pools is
over 16GB. There has been a lot of discussion regarding how much
memory each dedup'd block requires, and I think it was about 250-270
bytes per block. 1TB of data (at max block size and no duplicate data)
will require about 2GB of memory to run effectively. (This seems high
to me, hopefully someone else can confirm.) This is memory that is
available to the ARC, above and beyond what is being used by the
system and applications. Of course, using all your ARC to hold dedup
data won't help much either, as either cacheable data or dedup info
will be evicted rather quickly. Forcing the system to read dedup
tables from the pool is slow, since it's a lot of random reads.

All I know is that I have 8GB in my home system, and it is not enough
to work with the 8TB pool that I have. Adding a fast SSD as L2ARC can
help reduce the memory requirements somewhat by keeping dedup data
more easily accessible. (And make sure that your L2ARC device is large
enough. I fried a 30GB OCV Vertex in just a few months of use, I
suspect from the constant writes.)

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] just can't import

2011-04-10 Thread Brandon High

On Sun, Apr 10, 2011 at 9:01 PM, Matt Harrison
iwasinnamuk...@genestate.com wrote:
 I had a de-dup dataset and tried to destroy it. The command hung and so did
 anything else zfs related. I waited half and hour or so, the dataset was
 only 15G, and rebooted.

How much RAM does the system have? Dedup uses a LOT of memory, and it
can take a long time to destroy dedup'd datasets.

If you keep waiting, it'll eventually return. It could be a few hours or longer.

 The machine refused to boot, stuck at Reading ZFS Config. Asking around on

The system resumed the destroy that was in progress. If you let it
sit, it'll eventually complete.

 Well the livecd is also hanging on import, anything else zfs hangs. iostat
 shows some reads but they drop off to almost nothing after 2 mins or so.

Likewise, it's trying to complete the destroy. Be patient and it'll
complete. Never versions of Open Solaris or Solaris 11 Express may
complete it faster.

 Any tips greatly appreciated,

Just wait...

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Going forward after Oracle - Let's get organized, let's get started.

2011-04-09 Thread Brandon High

On Sat, Apr 9, 2011 at 10:41 AM, Chris Forgeron cforge...@acsi.ca wrote:
 I see your point, but you also have to understand that sometimes too many 
 helpers/opinions are a bad thing.  There is a set core of ZFS developers 
 who make a lot of this move forward, and they are the key right now. The rest 
 of us will just muddy the waters with conflicting/divergent opinions on 
 direction and goals.

It would be nice to have some communication from the devs about what
they're working on. A moderated list that only a limited set of people
normally post to would be excellent.

I'd be excited to hear that there's a new feature being worked on,
rather than the radio silence we've had.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to rename rpool. Is that recommended ?

2011-04-08 Thread Brandon High

On Fri, Apr 8, 2011 at 12:10 AM, Arjun YK arju...@gmail.com wrote:
 I have a situation where a host, which is booted off its 'rpool', need
 to temporarily import the 'rpool' of another host, edit some files in
 it, and export the pool back retaining its original name 'rpool'. Can
 this be done ?

Yes you can do it, no it is not recommended.

I had a need to do something similar to what you're attempting and
ended up using a Live CD (which doesn't have an rpool to have a naming
conflict) to do the manipulations.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS send/receive to Solaris/FBSD/OpenIndiana/Nexenta VM guest?

2011-04-07 Thread Brandon High

On Thu, Apr 7, 2011 at 4:01 PM, Joe Auty j...@netmusician.org wrote:
 My source computer is running Solaris 10 ZFS version 15. Does this mean that 
 I'd be asking for trouble doing a zfs send back to this machine from any 
 other ZFS machine running a version  15? I just want to make sure I 
 understand all of this info...

There are two versions when it comes to ZFS - The zpool version and
the zfs version.

bhigh@basestar:~$ zpool list -o name,version
NAME   VERSION
rpool   31

bhigh@basestar:~$ zfs list -o name,version
NAME   VERSION
rpool5
rpool/ROOT   5
rpool/ROOT/snv_151   5
rpool/dump   -
rpool/rsrv   5
rpool/swap   -

I think that the version that matters (for your purposes) is the ZFS
version. It should be set when using 'send -R' and having 'zfs
receive' create the destination datasets. I recommend testing however.

 If this is the case, what are my strategies? Solaris 10 for my temporary 
 backup machine? Is it possible to run OpenIndiana or Nexenta or something and 
 somehow set up these machines with ZFS v15 or something?

You can set the zpool version when you create the pool, and you can
set the zfs version when you create the dataset. I'm not sure that
you'll need to set the pool version to anything lower if the dataset
version is correct though. You should test this, however.

-B

--
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS send/receive to Solaris/FBSD/OpenIndiana/Nexenta VM guest?

2011-04-06 Thread Brandon High

On Tue, Apr 5, 2011 at 12:38 PM, Joe Auty j...@netmusician.org wrote:

 How about getting a little more crazy... What if this entire server
 temporarily hosting this data was a VM guest running ZFS? I don't foresee
 this being a problem either, but with so


The only thing to watch out for is to make sure that the receiving datasets
aren't a higher version that the zfs version that you'll be using on the
replacement server. Because you can't downgrade a dataset, using snv_151a
and planning to send to Nexenta as a final step will trip you up unless you
explicitly create them with a lower version.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS send/receive to Solaris/FBSD/OpenIndiana/Nexenta VM guest?

2011-04-06 Thread Brandon High

On Wed, Apr 6, 2011 at 10:42 AM, Paul Kraus pk1...@gmail.com wrote:
    I thought I saw that with zpool 10 (or was it 15) the zfs send
 format had been committed and you *could* send/recv between different
 version of zpool/zfs. From Solaris 10U9 with zpool 22 manpage for zfs:

There is still a problem if the dataset version is too high. I
*believe* that a 'zfs send -R' should send the zfs version, and that
zfs receive will create any new datasets using that version. (I have a
received dataset here that's zfs v 4, whereas everything else in the
pool is v5.) As long as you don't do a zfs upgrade after that point,
you should be fine.

It's probably a good idea to check that the received versions are the
same as the source before doing a destroy though. ;-)

One other thing that I forgot to mention in my last mail too: If
you're receiving into a VM, make sure that the VM can manage
redundancy on its zfs storage, and not just multiple vdsk on the same
host disk / lun. Either give it access to the raw devices, or use
iSCSI, or create your vdsk on different luns and raidz them, etc.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NTFS on NFS and iSCSI always generates small IO's

2011-03-10 Thread Brandon High

On Thu, Mar 10, 2011 at 12:15 AM, Matthew Anderson
matth...@ihostsolutions.com.au wrote:
 I have a feeling it's to do with ZFS's recordsize property but haven't been 
 able to find any solid testing done with NTFS. I'm going to do some testing 
 using smaller record sizes tonight to see if that helps the issue.
 At the moment I'm surviving on cache and am quickly running out of capacity.

 Can anyone suggest any further tests or have any idea about what's going on?

The default blocksize for a zfs volume is 8k, so 4k writes will
probably require a read as well. You can try creating a new volume
with volblocksize set to 4k and see if that helps. The value can't be
changed once set, so you'll have to make a new dataset.

Make sure the wcd property is set to false for the volume in
stmfadm in order to enable the write cache. It shouldn't make a huge
difference with the zil disabled, but it certainly won't hurt.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NTFS on NFS and iSCSI always generates small IO's

2011-03-10 Thread Brandon High

On Thu, Mar 10, 2011 at 9:45 AM, Richard Elling
richard.ell...@gmail.com wrote:
 Default recordsize for NFS is 128K. For the VM case, you will want to match 
 the block size of
 the clients. However, once the file (on the NFS server) is created with 128K 
 records, it will remain
 at 128K forever. So you will need to create a new VM store after the 
 recordsize is tuned.

You can change the recordsize and copy the vmdk files on the nfs
server, which will re-write them with a smaller recordsize.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134

2011-03-07 Thread Brandon High

On Mon, Mar 7, 2011 at 1:50 PM, Yaverot yave...@computermail.net wrote:
 1. While performance isn't my top priority, doesn't using slices make a 
 significant difference?

Write caching will be disabled on devices that use slices. It can be
turned back on by using format -e

 2. Doesn't snv_134 that I'm running already account for variances in these 
 nominally-same disks?

It will allow some small differences. I'm not sure what the limit on
the difference size is.

 3. The market refuses to sell disks under $50, therefore I won't be able to 
 buy drives of 'matching' capacity anyway.

You can always use a larger drive. If you think you may want to go
back to smaller drives, make sure that the autoexpand zpool property
is disabled though.

 3. Assuming I want to do such an allocation, is this done with quota  
 reservation? Or is it snapshots as you suggest?

I think Edward misspoke when he said to use snapshots, and probably
meant reservation.

I've taken to creating a dataset called reserved and giving it a 10G
reservation. (10G isn't a special value, feel free to use 5% of your
pool size or whatever else you're comfortable with.) It's unmounted
and doesn't contain anything, but it ensures that there is a chunk of
space I can make available if needed. Because it doesn't contain
anything, there shouldn't be any concern for de-allocation of blocks
when it's destroyed. Alternately, the reservation can be reduced to
make space available.

 Would it make more sense to make another filesystem in the pool, fill it 
 enough and keep it handy to delete? Or is there some advantage to zfs destroy 
 (snapshot) over zfs destroy (filesystem)? While I am thinking about the 
 system and have extra drives, like now, is the time to make plans for the 
 next system is full event.

If a dataset contains data, the blocks will have to be freed when it's
destroyed. If it's an empty dataset with a reservation, the only
change is to fiddle some accounting bits.

I seem to remember seeing a fix for 100% full pools a while ago so
this may not be as critical as it used to be, but it's a nice safety
net to have.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS send/recv horribly slow on system with 1800+ filesystems

2011-03-01 Thread Brandon High

On Mon, Feb 28, 2011 at 10:38 PM, Moazam Raja moa...@gmail.com wrote:
 We've noticed that on systems with just a handful of filesystems, ZFS
 send (recursive) is quite quick, but on our 1800+ fs box, it's
 horribly slow.

When doing an incremental send, the system has to identify what blocks
have changed, which can take some time. If not much data has changed,
the delay can take longer than the actual send.

I've noticed that there's a small delay when starting a send of a new
snapshot and when starting the receive of one. Putting something like
mbuffer in the path helps to smooth things out. It won't help in the
example you've cited below, but it will help in real world use.

 The other odd thing I've noticed is that during the 'zfs send' to
 /dev/null, zpool iostat shows we're actually *writing* to the zpool at
 the rate of 4MB-8MB/s, but reading almost nothing. How can this be the
 case?

The writing seems odd, but the lack of reads doesn't. You might have
most or all of the data in the ARC or L2ARC, so your zpool doesn't
need to be read from.

 1.) Does ZFS get immensely slow once we have thousands of filesystems?

No. Incremental sends might take longer, as I mentioned above.

 2.) Why do we see 4MB-8MB/s of *writes* to the filesystem when we do a
 'zfs send' to /dev/null ?

Is anything else using the filesystems in the pool?

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Performance

2011-02-28 Thread Brandon High

On Sun, Feb 27, 2011 at 7:35 PM, Brandon High bh...@freaks.com wrote:
 It moves from best fit to any fit at a certain point, which is at
 ~ 95% (I think). Best fit looks for a large contiguous space to avoid
 fragmentation while any fit looks for any free space.

I got the terminology wrong, it's first-fit when there is space,
moving to best-fit at 96% full.

See 
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/metaslab.c
for details.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] External SATA drive enclosures + ZFS?

2011-02-27 Thread Brandon High

On Sun, Feb 27, 2011 at 4:15 PM, Rich Teer rich.t...@rite-group.com wrote:
 So the question is, what eSATA non-RAID HBA do people recommend?  Bear
 in mind that I'm looking for something with driver support out of the
 box with either the latest Solaris 10, or Solaris 11 Express.

The SiI3124 (PCI / PCI-X) and SiI3132 (PCIe) based cards can be picked
up for about $20-$30. They're supported, and support PMPs in Solaris.
I don't know about support on Sparc though.

http://www.newegg.com/Product/Product.aspx?Item=N82E16816132021
http://www.newegg.com/Product/Product.aspx?Item=N82E16816132027

 Assuming the use of eSATA enclosures do do people recommend?  I don't
 need huge amounts of space; two drives should be enough and four will
 be plenty and allow for expansion.  Again, I'm looking for a JBOD coz
 I want ZFS do all the work.

Something similar to the Sans Digital enclosures would probably work.
They use a PMP to make all the drives available via one eSATA, which
may or may not work. It's supposed to, but there are hardware
blacklists in the drivers that may cause you trouble.

Another thought is to ditch the Sun boxes and use a HP ProLiant
Microserver. It's about $320 and holds 4 drives, with an expansion
slot for an additional controller. I think some people have reported
success with these on the list.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] External SATA drive enclosures + ZFS?

2011-02-27 Thread Brandon High

On Sun, Feb 27, 2011 at 7:48 AM, taemun tae...@gmail.com wrote:
 eSATA has no need for any interposer chips between a modern SATA chipset on
 the motherboard and a SATA hard drive. You can buy cables with appropriate

eSATA has different electrical specifications, namely higher minimum
transmit power and lower minimum receive power. An internal power
might work with a SATA to eSATA cable or adapter, but it's not
guaranteed to.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Performance

2011-02-27 Thread Brandon High

On Sun, Feb 27, 2011 at 6:59 AM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 But there is one specific thing, isn't there?  Where ZFS will choose to use
 a different algorithm for something, when pool usage exceeds some threshold.
 Right?  What is that?

It moves from best fit to any fit at a certain point, which is at
~ 95% (I think). Best fit looks for a large contiguous space to avoid
fragmentation while any fit looks for any free space.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What drives?

2011-02-26 Thread Brandon High

On Thu, Feb 24, 2011 at 10:45 PM, Markus Kovero markus.kov...@nebula.fi wrote:
 Hi! I'd go for WD RE edition. Blacks and Greens are for desktop use and 
 therefore lack proper TLER settings and have useless power saving features 
 that could induce errors and mysterious slowness.

There has been a lot of discussion about TLER in the past, and I'm
less convinced that it's a requirement for zfs than I used to think.
I've been using WD Green (EADS) drives for two years without issue.
They are ones that sleep and TLER settings could be changed on though.

Many of the new WD Green drives (including some of the RE) use 4k
sectors, which will wreak havoc on zpool performance. Other
manufacturers are starting to use 4k sectors on their 5400 rpm drives
as well so shop carefully if you decide to go with a lower spindle
speed. I have not seen a 7200 rpm drive with 4k sectors, but I'm sure
they exist.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] External SATA drive enclosures + ZFS?

2011-02-25 Thread Brandon High

On Fri, Feb 25, 2011 at 4:34 PM, Rich Teer rich.t...@rite-group.com wrote:
 Space is starting to get a bit tight here, so I'm looking at adding
 a couple of TB to my home server.  I'm considering external USB or
 FireWire attached drive enclosures.  Cost is a real issue, but I also

I would avoid USB, since it can be less reliable than other connection
methods. That's the impression I get from older posts made by Sun
devs, at least. I'm not sure how well Firewire 400 is supported, let
alone Firewire 800.

You might want to consider eSATA. Port multipliers are supported in
recent builds (128+ I think), and will give better performance than
USB. I'm not sure if PMP are supported on Sparc though., since it
requires support in both the controller and PMP.

Consider enclosures from other manufacturers as well. I've heard good
things about Sans Digital, but I've never used them. The 2-drive
enclosure has the same components as the item you linked but 1/2 the
cost via Newegg.

 The intent would be put two 1TB or 2TB drives in the enclosure and use
 ZFS to create a mirrored pool out of them.  Assuming this enclosure is
 set to JBOD mode, would I be able to use this with ZFS?  The enclosure

Yes, but I think the enclosure has a SiI5744 inside it, so you'll
still have one connection from the computer to the enclosure. If that
goes, you'll lose both drives. If you're just using two drives, two
separate enclosures on separate buses may be better. Look at
http://www.sansdigital.com/towerstor/ts1ut.html for instance. There
are also larger enclosures with up to 8 drives.

 I can't think of a reason why it wouldn't work, but I also have exactly
 zero experience with this kind of set up!

Like I mentioned, USB is prone to some flakiness.

 Assuming this would work, given that I can't see to find a 4-drive
 version of it, would I be correct in thinking that I could buy two of

You might be better off using separate enclosures for reliability.
Make sure to split the mirrors across the two devices. Use separate
USB controllers if possible, so a bus reset doesn't affect both sides.

 Assuming my proposed enclosure would work, and assuming the use of
 reasonable quality 7200 RPM disks, how would you expect the performance
 to compare with the differential UltraSCSI set up I'm currently using?
 I think the DWIS is rated at either 20MB/sec or 40MB/sec, so on the
 surface, the USB attached drives would seem to be MUCH faster...

USB 2.0 is about 30-40MB/s under ideal conditions, but doesn't support
any of the command queuing that SCSI does. I'd expect performance to
be slightly lower, and to use slightly more CPU. Most USB controllers
don't support DMA, so all I/O requires CPU time.

What about an inexpensive SAS card (eg: Supermicro AOC-USAS-L4i) and
external SAS enclosure (eg: Sans Digital TowerRAID TR4X). It would
cost about $350 for the setup.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/Drobo (Newbie) Question

2011-02-08 Thread Brandon High

On Tue, Feb 8, 2011 at 12:53 PM, David Dyer-Bennet d...@dd-b.net wrote:
 Wait, are you saying that the handling of errors in RAIDZ and mirrors is
 completely different?  That it dumps the mirror disk immediately, but
 keeps trying to get what it can from the RAIDZ disk?  Because otherwise,
 you assertion doesn't seem to hold up.

I think he meant that if one drive in a mirror dies completely, then
any single read error on the remaining drive is not recoverable.

With raidz2 (or a 3-way mirror for that matter), if one drive dies
completely, you still have redundancy.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/Drobo (Newbie) Question

2011-02-07 Thread Brandon High

On Sat, Feb 5, 2011 at 9:54 AM, Gaikokujin Kyofusho
gaikokujinkyofu...@gmail.com wrote:
 Just to make sure I understand your example, if I say had a 4x2tb drives, 
 2x750gb, 2x1.5tb drives etc then i could make 3 groups (perhaps 1 raidz1 + 1 
 mirrored + 1 mirrored), in terms of accessing them would they just be mounted 
 like 3 partitions or could it all be accessed like one big partition?

You could add them to one pool, and then create multiple filesystems
inside the pool. You total storage would be the sum of the drives'
capacity after redundancy, or 3x2tb + 750gb + 1.5tb.

It's not recommended to use different levels of redundancy in a pool,
so you may want to consider using mirrors for everything. This also
makes it easier to add or upgrade capacity later.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Brandon High

On Mon, Feb 7, 2011 at 6:15 AM, Yi Zhang yizhan...@gmail.com wrote:
 On Mon, Feb 7, 2011 at 12:25 AM, Richard Elling
 richard.ell...@gmail.com wrote:
 Solaris UFS directio has three functions:
        1. improved async code path
        2. multiple concurrent writers
        3. no buffering

 Thanks for the comments, Richard. All I wanted is to achieve 3 on ZFS.
 But as I said, apprently 2.a) below didn't give me that. Do you have
 any suggestion?

Don't. Use a ZIL, which will meet the requirements for synchronous IO.
Set primarycache to metadata to prevent caching reads.

ZFS is a very different beast than UFS and doesn't require the same tuning.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Brandon High

On Mon, Feb 7, 2011 at 10:29 AM, Yi Zhang yizhan...@gmail.com wrote:
 I already set primarycache to metadata, and I'm not concerned about
 caching reads, but caching writes. It appears writes are indeed cached
 judging from the time of 2.a) compared to UFS+directio. More
 specifically, 80MB/2s=40MB/s (UFS+directio) looks realistic while
 80MB/0.11s=800MB/s (ZFS+primarycache=metadata) doesn't.

You're trying to force a solution that isn't relevant for the
situation. ZFS is not UFS, and solutions that are required for UFS to
work correctly are not needed with ZFS.

Yes, writes are cached, but all the POSIX requirements for synchronous
IO are met by the ZIL. As long as your storage devices, be they SAN,
DAS or somewhere in between respect cache flushes, you're fine. If you
need more performance, use a slog device that respects cache flushes.
You don't need to worry about whether writes are being cached, because
any data that is written synchronously will be committed to stable
storage before the write returns.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and spindle speed (7.2k / 10k / 15k)

2011-02-06 Thread Brandon High

On Sat, Feb 5, 2011 at 3:34 PM, Roy Sigurd Karlsbakk r...@karlsbakk.net wrote:
 so as not to exceed the channel bandwidth. When they need to get higher disk
 capacity, they add more platters.

 May this mean those drives are more robust in terms of reliability, since the 
 leaks between sectors is less likely with the lower density?

More platters leads to more heat and higher power consumption. Most
drives are 3 or 4 platters, though Hitachi usually manufactures 5
platter drives as well.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and spindle speed (7.2k / 10k / 15k)

2011-02-02 Thread Brandon High

On Wed, Feb 2, 2011 at 6:10 AM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 Don't know why you'd assume that.  I would assume a 2TB drive would be
 precisely double the sequential throughput of a 500G.  I think if you double

That's assuming that the drives have the same number of platters. 500G
drives are generally one platter, and 2T drives are generally 4
platters. Same size platters, same density. The 500G drive could be
expected to have slightly higher random iops due to lower mass in the
heads, but it's probably not statistically significant.

I think the current batch of 3TB drives are 7200 RPM with 5 platters
and 667GB per platter or 5400 RPM with 4 platters at 750GB/platter.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and TRIM

2011-01-29 Thread Brandon High

On Sat, Jan 29, 2011 at 8:31 AM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 What is the status of ZFS support for TRIM?

I believe it's been supported for a while now.
http://www.c0t0d0s0.org/archives/6792-SATA-TRIM-support-in-Opensolaris.html

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] reliable, enterprise worthy JBODs?

2011-01-26 Thread Brandon High

On Wed, Jan 26, 2011 at 2:20 AM, Lasse Osterild lass...@unixzone.dk wrote:
 On 01/26/11 09:50 AM, Lasse Osterild wrote:
 That's an odd comment.  I've used a fair bit of SuperMicro kit over the 
 years and I wouldn't describe any of it as low-quality.

 Nothing odd about it - I've had three SC847E26-JBOD cases and they've all 
 been faulty in one way or another, looking closer at the circuit boards I see 
 bad soldering in a lot of places, components which have clearly been exposed 
 to too much heat during soldering.  And SuperMicro being less than helpful  
 competent in solving the issues.

I think it depends on what you're used to. SuperMicro is great
whitebox gear, and if you go through a VAR that assembles and tests it
can be as reliable as anything else. It can be a huge hassle to handle
RMA and parts if you're buying the gear from Provantage or Newegg
though.

For home use, I'd buy Supermicro, ASUS, etc. through Newegg, etc. and
assemble it myself.

For a small business, I'd buy HP or Dell, since they have great deals
for SMB and decent support.

For a small to medium business that can afford to have good sysadmins,
I'd be fine with SuperMicro systems purchased through a VAR. This is
probably the category that many people on this list fall into.
Purchasing though a VAR adds to the cost, but you can often get same-
or next-day turnaround. (Obligatory plug: A friend has had good luck
purchasing through Silicon Mechanics and recommends them, especially
if you're in the Seattle area.)

For large enterprise shops, there is usually sufficient volume and
negotiation that you're tied to a particular vendor. My current
employer uses Netapp and in-house systems. Previous employers have
been 100% HP, 100% Dell, or 100% Sun depending on the purchasing
agreements in place, and I've found them to all be about the same.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] reliable, enterprise worthy JBODs?

2011-01-25 Thread Brandon High

On Tue, Jan 25, 2011 at 10:04 AM, Philip Brown p...@bolthole.com wrote:
 So, another hardware question :)

 ZFS has been touted as taking maximal advantage of disk hardware, to the 
 point where it can be used efficiently and cost-effectively on JBODs, rather 
 than having to throw more expensive RAID arrays at it.

Off the top of my head, I can think of 3 sources: LSI, Dell and Supermicro.

LSI sells the 620J and 630J. I believe these are what Dell re-labels
as the M1000.

Supermicro makes server chassis and sells JBOD kits.

There are many more, if you take time to look.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is my bottleneck RAM?

2011-01-20 Thread Brandon High

On Thu, Jan 20, 2011 at 8:18 AM, Eugen Leitl eu...@leitl.org wrote:
 Oh, and with 4x 3 TByte SATA mirrored pool is pretty much without
 alternative, right?

You can also use raidz2, which will have a little more resiliency.
With mirroring, you can lose one disk without data loss, but losing a
second disk might destroy your data.

With raidz2, you can lose any 2 disks, but you pay for it with
somewhat lower performance.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 3 4 5 >

1 - 100 of 455 matches

Mail list logo