from:"Brandon High"

Re: [zfs-discuss] what have you been buying for slog and l2arc?

2012-08-06 Thread Brandon High

On Mon, Aug 6, 2012 at 2:15 PM, Stefan Ring  wrote:
> So you're saying that SSDs don't generally flush data to stable medium
> when instructed to? So data written before an fsync is not guaranteed
> to be seen after a power-down?

It depends on the model. Consumer models are less likely to
immediately flush. My understanding that this is done in part to do
some write coalescing and reduce the number of P/E cycles. Enterprise
models should either flush, or contain a super capacitor that provides
enough power for the drive to complete writing any date in its buffer.

> If that -- ignoring cache flush requests -- is the whole reason why
> SSDs are so fast, I'm glad I haven't got one yet.

They're fast for random reads and writes because they don't have seek
latency. They're fast for sequential IO because they aren't limited by
spindle speed.

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?

2012-07-30 Thread Brandon High

On Mon, Jul 30, 2012 at 7:11 AM, GREGG WONDERLY  wrote:
> I thought I understood that copies would not be on the same disk, I guess I 
> need to go read up on this again.

ZFS attempts to put copies on separate devices, but there's no guarantee.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Persistent errors?

2012-06-22 Thread Brandon High

On Mon, Jun 18, 2012 at 3:55 PM, sol  wrote:
> It seems as though every time I scrub my mirror I get a few megabytes of
> checksum errors on one disk (luckily corrected by the other). Is there some
> way of tracking down a problem which might be persistent?

Check the output of 'fmdump -eV', it should have some (rather
extensive) information.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Migration of a Thumper to bigger HDDs

2012-05-23 Thread Brandon High

On Thu, May 17, 2012 at 2:50 PM, Jim Klimov  wrote:
> New question: if the snv_117 does see the 3Tb disks well,
> the matter of upgrading the OS becomes not so urgent - we
> might prefer to delay that until the next stable release
> of OpenIndiana or so.

There were some pretty major fixes and new features added between
snv_117 and snv_134 (the last OpenSolaris release). It might be worth
updating to snv_134 at the very least.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] checking/fixing busy locks for zfs send/receive

2012-03-16 Thread Brandon High

On Fri, Mar 16, 2012 at 2:35 PM, Philip Brown  wrote:
> if there isnt a process visible doing this via ps, I'm wondering how
> one might check if a zfs filesystem or snapshot is rendered "busy" in
> this way, interfering with an unmount or destroy?
>
> I'm also wondering if this sort of thing can mean interference between
> some combination of multiple send/receives at the same time, on the
> same filesystem?

Look at 'zfs hold', 'zfs holds', and 'zfs release'. Sends and receives
will place holds on snapshots to prevent them from being changed.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Compatibility of Hitachi Deskstar 7K3000 HDS723030ALA640 with ZFS

2012-03-06 Thread Brandon High

On Tue, Mar 6, 2012 at 2:40 AM, Koopmann, Jan-Peter
 wrote:
> Do you or anyone else have experience with the 3TB 5K3000 drives
> (namely HDS5C3030ALA630)? I am thinking of replacing my current 4*1TB drives
> with 4*3TB drives (home server). Any issues with TER or alike?

I have been using 8 x 3TB 5k3000 in a raidz2 for about a year without issue.

The Deskstar 3TB come off the same production line as the Ultrastar
5k3000. I would avoid the 2TB and smaler 5k3000 - They come off a
separate production line.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Compatibility of Hitachi Deskstar 7K3000 HDS723030ALA640 with ZFS

2012-03-05 Thread Brandon High

On Mon, Mar 5, 2012 at 9:52 AM, luis Johnstone  wrote:
> As far as I can tell, the Hitachi Deskstar 7K3000 (HDS723030ALA640) uses
> 512B sectors and so I presume does not suffer from such issues (because it
> doesn't lie about the physical layout of sectors on-platter)

Both the 7K3000 and 5K3000 drives have 512B physical sectors.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Server upgrade

2012-02-15 Thread Brandon High

On Wed, Feb 15, 2012 at 9:16 AM, David Dyer-Bennet  wrote:
> Is there an upgrade path from (I think I'm running Solaris Express) to
> something modern?  (That could be an Oracle distribution, or the free

There *was* an upgrade path from snv_134 to snv_151a (Solaris 11
Express) but I don't know if Oracle still supports it. There was an
intermediate step or two along the way (snv_134b I think?) to move
from OpenSolaris to Oracle Solaris.

As others mentioned, you could jump to OpenIndiana from your current
version. You may not be able to move between OI and S11 in the future,
so it's a somewhat important decision.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] grrr, How to get rid of mis-touched file named `-c'

2011-11-26 Thread Brandon High

On Wed, Nov 23, 2011 at 11:43 AM, Harry Putnam  wrote:
> OK, I'm out of escapes.  or other tricks... other than using emacs but
> I haven't installed emacs as yet.
>
> I can just ignore them of course, until such time as I do get emacs
> installed, but by now I just want to know how it might be done from a
> shell prompt.

rm ./-c ./-O ./-k

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Replacement for X25-E

2011-09-22 Thread Brandon High

On Thu, Sep 22, 2011 at 12:53 PM, Ray Van Dolson  wrote:
> It seems to perform similarly to the X-25E as well (3300 IOPS for
> random writes).  Perhaps the drive can be overprovisioned as well?
>
> My impression was that Intel was classifying the 3xx series as
> non-Enterprise however.  Even with the SLC.

I don't think the 311 has any over-provisioning (other than the 7%
from GB -> GiB conversion). I believe it is an X25-E with only 5
channels populated. The upcoming enterprise models are MLC based and
have greater over-provisioning AFAIK.

The 20GB 311 only costs ~ $100 though. The 100GB Intel 710 costs ~ $650.

The 311 is a good choice for home or budget users, and it seems that
the 710 is much bigger than it needs to be for slog devices.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deskstars and CCTL (aka TLER)

2011-09-22 Thread Brandon High

On Wed, Sep 7, 2011 at 7:40 PM, Daniel Carosone  wrote:
> Looks like another positive for these drives over the "competition".
> The same appears to be the case for the 5k3000's as well (page 96 in
> that document).

Be careful with the smaller 5k3000 drives. The 1TB and 2TB drives are
not manufactured on the same line as the Ultrastar and seem to have
lower reliability. Only the 3TB 5k3000 shares specs with the Ultrastar
5k3000.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Replacement for X25-E

2011-09-22 Thread Brandon High

On Tue, Sep 20, 2011 at 12:21 AM, Markus Kovero  wrote:
> Hi, I was wondering do you guys have any recommendations as replacement for
> Intel X25-E as it is being EOL’d? Mainly as for log device.

The Intel 311 seems like a good fit. It's a 20gb SLC device intended
to act as a cache device with the Z68 chipset.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deskstars and CCTL (aka TLER)

2011-09-07 Thread Brandon High

On Wed, Sep 7, 2011 at 2:20 AM, Roy Sigurd Karlsbakk  wrote:
> Does anyone know if this is possible from OI/Solaris, or if this needs to be 
> done on driver level?

You should be able to do it via smartctl. The setting does not persist
through power cycles, so you'll want to add it to a startup script.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS raidz on top of hardware raid0

2011-08-26 Thread Brandon High

On Fri, Aug 12, 2011 at 6:34 PM, Tom Tang  wrote:
> Suppose I want to build a 100-drive storage system, wondering if there is any 
> disadvantages for me to setup 20 arrays of HW RAID0 (5 drives each), then 
> setup ZFS file system on these 20 virtual drives and configure them as RAIDZ?

A 20-device wide raidz is a bad idea. Making those devices from
stripes just compounds the issue.

The biggest problem is that resilvering would be a nightmare, and
you're practically guaranteed to have additional failures or read
errors while degraded.

You would achieve better performance, error detection and recovery by
using several top-level raidz. 20 x 5-disk raidz would give you very
good read and write performance with decent resilver times and 20%
overhead for redundancy. 10 x 10-disk raidz2 would give more
protection, but a little less performance, and higher resilver times.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Intel 320 as ZIL?

2011-08-15 Thread Brandon High

On Mon, Aug 15, 2011 at 2:07 PM, Ray Van Dolson  wrote:
> Looks interesting... specs around the same as the old X-25E.  We have
> heard however, that Intel will be announcing a true successor to their
> X-25E line shortly.

I think it's the 710 and 720 that you're referring to.

The 710 is MLC-HET (high endurance) and will be in 100/200/300GB
capacities. The 720 is SLC, but a PCIe interface and will be 200/400GB
capacity.

I don't imagine either will be very cheap.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Intel 320 as ZIL?

2011-08-15 Thread Brandon High

On Thu, Aug 11, 2011 at 1:00 PM, Ray Van Dolson  wrote:
> Are any of you using the Intel 320 as ZIL?  It's MLC based, but I
> understand its wear and performance characteristics can be bumped up
> significantly by increasing the overprovisioning to 20% (dropping
> usable capacity to 80%).

Intel recently added the 311, a small SLC-based drive for use as a
temp cache with their Z68 platform. It's limited to 20GB, but it might
be a better fit for use as a ZIL than the 320.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Disk IDs and DD

2011-08-09 Thread Brandon High

On Tue, Aug 9, 2011 at 8:20 AM, Paul Kraus  wrote:
>    Nothing to worry about here. Controller IDs (c) are assigned
> based on the order the kernel probes the hardware. On the SPARC
> systems you can usually change this in the firmware (OBP), but they
> really don't _mean_ anything (other than the kernel found c8 before it
> found c9).

If you're really bothered by the device names, you can rebuild the
device map. There's no reason to do it unless you've had to replace
hardware, etc.

The steps are similar to these:
http://spiralbound.net/blog/2005/12/21/rebuilding-the-solaris-device-tree

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Fragmentation issue - examining the ZIL

2011-08-03 Thread Brandon High

On Mon, Aug 1, 2011 at 4:27 PM, Daniel Carosone  wrote:
> The other thing that can cause a storm of tiny IOs is dedup, and this
> effect can last long after space has been freed and/or dedup turned
> off, until all the blocks corresponding to DDT entries are rewritten.
> I wonder if this was involved here.

Using dedup on a pool that houses an Oracle DB is Doing It Wrong in so
many ways...

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Exapnd ZFS storage.

2011-08-03 Thread Brandon High

On Wed, Aug 3, 2011 at 3:02 AM, Nix  wrote:
> I have 4 disk with 1 TB of disk and I want to expand the zfs pool size.
>
> I have 2 more disk with 1 TB of size.
>
> Is it possible to expand the current RAIDz array with new disk?

You can't add the new drives to your current vdev. You can create
another vdev to add to your pool though.

If you're adding another vdev, it should have the same geometry as
your current (ie: 4 drives). The zpool command will complain if you
try to add a vdev with different geometry or redundancy, though you
can force it with -f.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Fragmentation issue - examining the ZIL

2011-08-01 Thread Brandon High

On Mon, Aug 1, 2011 at 2:16 PM, Neil Perrin  wrote:
> In general the blogs conclusion is correct . When file systems get full
> there is
> fragmentation (happens to all file systems) and for ZFS the pool uses gang
> blocks of smaller blocks when there are insufficient large blocks.

The blog doesn't mention how full the pool was. It's pretty well
documented that performance takes a nosedive at a certain point.

A slow scrub is actually not related to the problems in the blog post,
since there's not a lot of writes during (or at least caused by) a
scrub. Fragmentation is a real issue with pools that are (or have
been) very full. The data gets written out in fragments and has to be
read back in the same order.

If the mythical bp_rewrite code ever shows up, it will be possible to
defrag a pool. But not yet.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] recover zpool with a new installation

2011-07-26 Thread Brandon High

On Tue, Jul 26, 2011 at 1:14 PM, Cindy Swearingen <
cindy.swearin...@oracle.com> wrote:

> Yes, you can reinstall the OS on another disk and as long as the
> OS install doesn't touch the other pool's disks, your
> previous non-root pool should be intact. After the install
> is complete, just import the pool.
>

You can also use the Live CD or Live USB to access your pool or possibly fix
your existing installation.

You will have to force the zpool import with either a reinstall or a Live
boot.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?

2011-07-26 Thread Brandon High

On Tue, Jul 26, 2011 at 7:51 AM, David Dyer-Bennet  wrote:

> "Processing" the request just means flagging the blocks, though, right?
> And the actual benefits only acrue if the garbage collection / block
> reshuffling background tasks get a chance to run?
>

I think that's right. TRIM just gives hints to the garbage collector that
sectors are no longer in use. When the GC runs, it can find more flash
blocks more easily that aren't used or combine several mostly-empty
blocks and erase or otherwise free them for reuse later.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?

2011-07-26 Thread Brandon High

On Tue, Jul 26, 2011 at 5:59 AM, Edward Ned Harvey <
opensolarisisdeadlongliveopensola...@nedharvey.com> wrote:

> like 4%, and for some reason (I don't know why) there's a benefit to
> optimizing on 8k pages.  Which means no.  If you overwrite a sector of a
>

>From what I've heard it's due in large part to the FAT file system, since
its used in a lot of embedded systems as well as on flash cards. The FAT
cluster size is 32k, so any flash block that's a factor of 32k works
well. Page sizes are usually 2k with a 128k erase block, 4k with a 256k
erase block, or 4k with a 512k erase block.

It's also due to ECC reasons, since a larger block size allows more
efficient ECC over a larger block of data. This is similar to move to 4k
sectors in magnetic drives.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Large scale performance query

2011-07-25 Thread Brandon High

On Sun, Jul 24, 2011 at 11:34 PM, Phil Harrison  wrote:

> What kind of performance would you expect from this setup? I know we can
> multiple the base IOPS by 24 but what about max sequential read/write?
>

You should have a theoretical max close to 144x single-disk throughput. Each
raidz3 has 6 "data drives" which can be read from simultaneously, multiplied
by your 24 vdevs. Of course, you'll hit your controllers' limits well before
that.

Even with a controller per JBOD, you'll be limited by the SAS connection.
The 7k3000 has throughput from 115 - 150 MB/s, meaning each of your JBODs
will be capable of 5.2 GB/sec - 6.8 GB/sec, roughly 10 times the bandwidth
of a single SAS 6g connection. Use multipathing if you can to increase the
bandwidth to each JBOD.

Depending on the types of access that clients are performing, your cache
devices may not be any help. If the data is read multiple times by multiple
clients, then you'll see some benefit. If it's only being read infrequently
or by one client, it probably won't help much at all. That said, if your
access is mostly sequential then random access latency shouldn't affect you
too much, and you will still have more bandwidth from your main storage
pools than from the cache devices.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Replacing failed drive

2011-07-22 Thread Brandon High

On Fri, Jul 22, 2011 at 1:12 PM, Chris Dunbar - Earthside, LLC
 wrote:
> I have physically replaced the drive, but I have not partitioned it yet. I
> know there is a command to copy the layout from one disk to another and that
> has worked well for me in the past. I just have to find the command again.
> Once that is done, do I need to detach the spare before I run the replace
> command or does running the replace command automatically bump the spare out
> of service and put it back to being just a spare?

Since it isn't the rpool, you shouldn't have to partition the replacement drive.

Since you've physically replaced the drive, you should just have to do:
# zpool replace tank c10t0d0

The pool should resilver, and I think the spare should automatically
detach. If not
# zpool remove tank c10t6d0
should take care of it.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?

2011-07-21 Thread Brandon High

On Thu, Jul 21, 2011 at 4:08 PM, Gordon Ross  wrote:
> And then for about $400 one can get an 250GB SSD, such as:
>  Crucial M4 CT256M4SSD2 2.5" 256GB SATA III MLC Internal Solid State
> Drive (SSD)
>  http://www.newegg.com/Product/Product.aspx?Item=N82E16820148443
>
> Anyone have experience with either one?  (good or bad)

The hybrid drive might accelerate some operations. No guarantees,
though. It's about as fast as a WD Velociraptor in some operations,
and the same as the regular Seagate 500gb in others. There is a decent
review of it at Anandtech.

The M4 is pretty decent, though the Vertex 3 and other Sandforce
2000-based drives beat it in benchmarks. Honestly though, you'll
probably be very happy with any recent SSD, eg: C300, M4, Intel 320,
Intel 510, Sandforce 1200-based (Vertex 2, Phoenix Pro, etc),
Sandforce 2200-based (Vertex 3, Corsair Force GT, Patriot Wildfire,
etc).

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] latest zpool version in solaris 11 express

2011-07-20 Thread Brandon High

On Mon, Jul 18, 2011 at 6:21 AM, Edward Ned Harvey
 wrote:
> Kidding aside, for anyone finding this thread at a later time, here's the
> answer.  It sounds unnecessarily complex at first, but then I went through
> it ... Only took like a minute or two.  It was exceptionally easy in fact.
>        https://pkg-register.oracle.com

Do you need a support contract in order to access the certificate
application? I'm getting the following error when I try to get a cert:
"There has been a problem with contacting the entitlement server. You
will only be able to issue new certificates for public products.
Please try again later"

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zil on multiple usb keys

2011-07-17 Thread Brandon High

On Sun, Jul 17, 2011 at 12:13 PM, Edward Ned Harvey
 wrote:
> Actually, you can't do that.  You can't make a vdev from other vdev's, and 
> when it comes to striping and mirroring your only choice is to do it the 
> right way.
>
> If you were REALLY trying to go out of your way to do it wrong somehow, I 
> suppose you could probably make a zvol from a stripe, and then export it to 
> yourself via iscsi, repeat with another zvol, and then mirror the two iscsi 
> targets.   ;-)  You might even be able to do the same crazy thing with simply 
> zvol's and no iscsi...  But either way you'd really be going out of your way 
> to create a problem.   ;-)

The right way to do it, um, incorrectly is to create a striped device
using SVM, and use that as a vdev for your pool.

So yes, you could create two 800GB stripes, and use them to create a
ZFS mirror. But it would be a really bad idea.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Replacement disks for Sun X4500

2011-07-15 Thread Brandon High

On Wed, Jul 6, 2011 at 10:12 PM, X4 User  wrote:
> I am bumping this thread because I too have the same question ... can I put 
> modern 3TB disks (hitachi deskstars) into an old x4500 ?

I have 8 x 3TB drives (Deskstar 5k3000) attached to a Supermicro
AOC-SAT2-MV8 and it works fine. This card uses the same Marvell
controller as the x4500.

Performance is fine if not slightly better than the WD10EADS drives
that I replaced. Of course, the pool was about 92% full with the
smaller drives ...

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pure SSD Pool

2011-07-12 Thread Brandon High

On Tue, Jul 12, 2011 at 12:14 PM, Eric Sproul  wrote:
> I see, thanks for that explanation.  So finding drives that keep more
> space in reserve is key to getting consistent performance under ZFS.

More spare area might give you more performance, but the big
difference is the lifetime of the device. A device with more spare
area can handle more writes.

In the capacity range (eg: 50-64 GB, 64 GiB flash), then the drive
with more spare will last longer but may not offer a performance
benefit. Higher capacity drives will offer better performance because
they have more flash channels to write to, and they should last longer
because while the spare area is the same percentage of total capacity,
it's numerically larger.

A "consumer" 240GB drive (256GiB flash) will have 27GiB spare area. An
"enterprise" 50GB (64GiB flash) drive will have 16 GiB spare area, or
about 25% of the total capacity. Even though the consumer drive only
sets aside ~ 10% for spare, it's so much larger that it will last
longer at any given rate of writing. If you were to completely fill
and re-fill each drive, the consumer drive will fail earlier, but
you'd have to write nearly 5x as much data to fill it even once.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pure SSD Pool

2011-07-12 Thread Brandon High

On Tue, Jul 12, 2011 at 7:41 AM, Eric Sproul  wrote:
> But that's exactly the problem-- ZFS being copy-on-write will
> eventually have written to all of the available LBA addresses on the
> drive, regardless of how much live data exists.  It's the rate of
> change, in other words, rather than the absolute amount that gets us
> into trouble with SSDs.  The SSD has no way of knowing what blocks

Most "enterprise" SSDs use something like 30% for spare area. So a
drive with 128MiB (base 2) of flash will have 100MB (base 10) of
available storage. A consumer level drive will have ~ 6% spare, or
128MiB of flash and 128MB of available storage. Some drives have 120MB
available, but still have 128 MiB of flash and therefore slightly more
spare area. Controllers like the Sandforce that do some dedup can give
you even more effective spare area, depending on the type of data.

When the OS starts reusing LBAs, the drive will re-map them into new
flash blocks in the spare area and may perform garbage collection on
the now partially used blocks. The effectiveness of this depends on
how quickly the system is writing and how full the drive is.

I failed to mention earlier that ZFS's write aggregation is also
helpful when used with flash drives since it can help to ensure that a
whole flash block is written at once. Increasing the ashift value to
4k when the pool is created may also help.

> Now, others have hinted that certain controllers are better than
> others in the absence of TRIM, but I don't see how GC could know what
> blocks are available to be erased without information from the OS.

The changed LBAs are remapped rather than overwritten in place. The
drive knows which LBAs in a flash block have been re-mapped, and can
do garbage collection when the right criteria are met.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pure SSD Pool

2011-07-11 Thread Brandon High

On Mon, Jul 11, 2011 at 7:03 AM, Eric Sproul  wrote:
> Interesting-- what is the suspected impact of not having TRIM support?

There shouldn't be much, since zfs isn't changing data in place. Any
drive with reasonable garbage collection (which is pretty much
everything these days) should be fine until the volume gets very full.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Cannot format 2.5TB ext disk (EFI)

2011-06-24 Thread Brandon High

On Thu, Jun 23, 2011 at 1:20 PM, Richard Elling
 wrote:
> 2TB limit for 32-bit Solaris. If you hit this, then you'll find a lot of 
> complaints at boot.
> By default, an Ultra-24 should boot 64-bit. Dunno about the HBA, though...

I think the limit is 1TB for 32-bit. I've tried to use 2TB drives on
an Atom N270-based board and they were not recognized, but they worked
fine under FreeBSD.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] JBOD recommendation for ZFS usage

2011-05-30 Thread Brandon High

On Mon, May 30, 2011 at 6:16 PM, Jim Klimov  wrote:
> Also some articles stated that at one time there were
> single-port SAS drives, so there are at least two SAS
> connectors after all ;)

Nope, only one mechanical connector. A dual port cable can be used
with single- or dual-ported SAS device, or with SATA drives. A single
port cable can be used with a single- or dual-ported SAS device
(although it will only use one port) or with a SATA drive. A SATA
cable can be used with a SATA device.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)

2011-05-26 Thread Brandon High

On Thu, May 26, 2011 at 9:34 AM, Eugen Leitl  wrote:
> How bad would raidz2 do on mostly sequential writes and reads
> (Athlon64 single-core, 4 GByte RAM, FreeBSD 8.2)?

I was using a similar but slightly higher spec setup (quad-core cpu &
8 GB RAM) at home and didn't have any problems with an 8-drive raidz2,
though my usage is fairly light. The system is more than fast enough
to saturate gigabit ethernet for sequential reads and writes. My
drives were WD10EADS "Green" drives.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] offline dedup

2011-05-26 Thread Brandon High

On Thu, May 26, 2011 at 8:37 AM, Edward Ned Harvey
 wrote:
> Question:  Is it possible, or can it easily become possible, to periodically
> dedup a pool instead of keeping dedup running all the time?  It is easy to

I think it's been discussed before, and the conclusion is that it
would require bp_rewrite.

Offline (or deferred) dedup certainly seems more attractive given the
current real-time performance.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS, Oracle and Nexenta

2011-05-24 Thread Brandon High

On Tue, May 24, 2011 at 3:17 PM, Peter Jeremy
 wrote:
> I believe the various OSS projects that use ZFS have formed a working
> group to co-ordinate ZFS amongst themselves.  I don't know if Oracle
> was invited to join (though given the way Oracle has behaved in all

Richard would probably know for certain.

There will probably be a fork at some point to an OSS ZFS and an
Oracle ZFS. Hopefully neither side will actively try to break
compatibility.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS, Oracle and Nexenta

2011-05-24 Thread Brandon High

On Tue, May 24, 2011 at 12:41 PM, Richard Elling
 wrote:
> There are many ZFS implementations, each evolving as the contributors desire.
> Diversity and innovation is a good thing.

... unless Oracle's zpool v30 is different than Nexenta's v30.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Monitoring disk seeks

2011-05-19 Thread Brandon High

On Thu, May 19, 2011 at 5:35 AM, Sašo Kiselkov  wrote:
> I'd like to ask whether there is a way to monitor disk seeks. I have an
> application where many concurrent readers (>50) sequentially read a
> large dataset (>10T) at a fairly low speed (8-10 Mbit/s). I can monitor
> read/write ops using iostat, but that doesn't tell me how contiguous the
> data is, i.e. when iostat reports "500" read ops, does that translate to
> 500 seeks + 1 read per seek, or 50 seeks + 10 reads, etc? Thanks!

You can sort of do this with a DTrace script.

Something like: (forgive my crappy script, I've only poked at DTrace a
few times)

#pragma D option quiet
io:::done
/ args[1]->dev_name == "sd" && args[1]->dev_instance < 11 /
{
  printf("%d.%03d,%s,%i,%s,%i\n",
 (timestamp/100),
 (timestamp / 1000) % 1000,
 args[1]->dev_statname,
 args[0]->b_lblkno,
 (args[0]->b_flags & B_WRITE ? "W" : "R"),
 args[0]->b_bcount
);
}

For every completed IO, this should give you the timestamp, device
name, start LBA, "R"ead or "W"rite and length of the IO.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris vs FreeBSD question

2011-05-18 Thread Brandon High

On Wed, May 18, 2011 at 5:47 AM, Paul Kraus  wrote:
> P.S. If anyone here has a suggestion as to how to get Solaris to load
> I would love to hear it. I even tried disabling multi-cores (which
> makes the CPUs look like dual core instead of quad) with no change. I
> have not been able to get serial console redirect to work so I do not
> have a good log of the failures.

Have you checked your system in the HCL device tool at
http://www.sun.com/bigadmin/hcl/hcts/device_detect.jsp ? It should be
able to tell you which device is causing the problem. If I remember
correctly, you can feed it the output of 'lspci -vv -n'.

You may have to disable some on-board devices to get through the
installer, but I couldn't begin to guess which.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Reboots when importing old rpool

2011-05-17 Thread Brandon High

On Tue, May 17, 2011 at 11:10 AM, Hung-ShengTsao (Lao Tsao) Ph.D.
 wrote:
>
> may be do
> zpool import  -R /a rpool

'zpool import -N' may work as well.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Still no way to recover a "corrupted" pool

2011-05-16 Thread Brandon High

On Mon, May 16, 2011 at 1:55 PM, Freddie Cash  wrote:
> Would not import in Solaris 11 Express.  :(  Could not even find any
> pools to import.  Even when using "zpool import -d /dev/dsk" or any
> other import commands.  Most likely due to using a FreeBSD-specific
> method of labelling the disks.

I think someone solved this before by creating a directory and making
symlinks to the correct partition/slices on each disk. Then you can
use 'zpool import -d /tmp/foo' to do the import. eg:

# mkdir /tmp/fbsd # create a temp directory to point to the p0
partitions of the relevant disks
# ln -s /dev/dsk/c8t1d0p0 /tmp/fbsd/
# ln -s /dev/dsk/c8t2d0p0 /tmp/fbsd/
# ln -s /dev/dsk/c8t3d0p0 /tmp/fbsd/
# ln -s /dev/dsk/c8t4d0p0 /tmp/fbsd/
# zpool import -d /tmp/fbsd/ $POOLNAME

I've never used FreeBSD so I can't offer any advice about which device
name is correct or if this will work. Posts from February 2010 "Import
zpool from FreeBSD in OpenSolaris" indicate that you want p0.

> It's just frustrating that it's still possible to corrupt a pool in
> such a way that "nuke and pave" is the only solution.  Especially when

I'm not sure it was the only solution, it's just the one you followed.

> What's most frustrating is that this is the third time I've built this
> pool due to corruption like this, within three months.  :(

You may have an underlying hardware problem, or there could be a bug
in the FreeBSD implementation that you're tripping over.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Brandon High

On Mon, May 16, 2011 at 8:33 AM, Richard Elling
 wrote:
> As a rule of thumb, the resilvering disk is expected to max out at around
> 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect
> the throttles or broken data path.

My system was doing far less than 80 IOPS during resilver when I
recently upgraded the drives. The older and newer drives were both 5k
RPM drives (WD10EADS and Hitachi 5K3000 3TB) so I don't expect it to
be super fast.

The worst resilver was 50 hours, the best was about 20 hours. This was
just my home server, which is lightly used. The clients (2-3 CIFS
clients, 3 mostly idle VBox instances using raw zvols, and 2-3 NFS
clients) are mostly idle and don't do a lot of writes.

Adjusting zfs_resilver_delay and zfs_resilver_min_time_ms sped things
up a bit, which suggests that the default values may be too
conservative for some environments.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Brandon High

On Sat, May 14, 2011 at 11:20 PM, John Doe  wrote:
>> 171   Hitachi 7K3000 3TB
> I'd go for the more environmentally friendly Ultrastar 5K3000 version - with 
> that many drives you wont mind the slower rotation but WILL notice a 
> difference in power and cooling cost

A word of caution - The Hitachi Deskstar 5K3000 drives in 1TB and 2TB
are different than the 3TB.

The 1TB and 2TB are manufactured in China, and have a very high
failure and DOA rate according to Newegg.

The 3TB drives come off the same production line as the Ultrastar
5K3000 in Thailand and may be more reliable.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 350TB+ storage solution

2011-05-15 Thread Brandon High

On Sun, May 15, 2011 at 10:14 PM, Richard Elling
 wrote:
> On May 15, 2011, at 10:18 AM, Jim Klimov  wrote:
>> In case of RAIDZ2 this recommendation leads to vdevs sized 6 (4+2), 10 (8+2) 
>> or 18 (16+2) disks - the latter being mentioned in the original post.
>
> A similar theory was disproved back in 2006 or 2007. I'd be very surprised if
> there was a reliable way to predict the actual use patterns in advance. 
> Features
> like compression and I/O coalescing improve performance, but make the old
> "rules of thumb" even more obsolete.

I thought that having data disks that were a power of two was still
recommended, due to the way that ZFS splits records/blocks in a raidz
vdev. Or are you responding to some other point?

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Tuning disk failure detection?

2011-05-10 Thread Brandon High

On Tue, May 10, 2011 at 9:18 AM, Ray Van Dolson  wrote:
> My question is -- is there a way to tune the MPT driver or even ZFS
> itself to be more/less aggressive on what it sees as a "failure"
> scenario?

You didn't mention what drives you had attached, but I'm guessing they
were normal "desktop" drives.

I suspect (but can't confirm) that using enterprise drives with TLER /
ERC / CCTL would have reported the failure up the stack faster than a
consumer drive. The drives will report an error after 7 seconds rather
than retry for several minutes.

You may be able to enable the feature on your drives, depending on the
manufacturer and firmware revision.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] primarycache=metadata seems to force behaviour of secondarycache=metadata

2011-05-10 Thread Brandon High

On Mon, May 9, 2011 at 2:54 PM, Tomas Ögren  wrote:
> Slightly off topic, but we had an IBM RS/6000 43P with a PowerPC 604e
> cpu, which had about 60MB/s memory bandwidth (which is kind of bad for a
> 332MHz cpu) and its disks could do 70-80MB/s or so.. in some other
> machine..

It wasn't that long ago when 66MB/s ATA was considered a waste because
no drive could use that much bandwidth. These days a "slow" drive has
max throughput greater than 110MB/s.

(OK, looking at some online reviews, it was about 13 years ago. Maybe
I'm just old.)

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS on HP MDS 600

2011-05-09 Thread Brandon High

On Mon, May 9, 2011 at 8:33 AM, Darren Honeyball  wrote:
> I'm just mulling over the best configuration for this system - our work load 
> is mostly writing millions of small files (around 50k) with occasional reads 
> & we need to keep as much space as possible.

If space is a priority, then raidz or raidz2 are probably the best
bets. If you're going to have a lot of random iops, then mirrors are
best.

You have some control over the performance : space ratio with raidz by
adjusting the width of the radiz vdevs. For instance, mirrors will
provide 34TB of space and best random iops. 24 x 3-disk raidz vdevs
will have 48TB of space but still have pretty strong random iops
performance. 13 x 5-disk raidz vdevs will give 52TB of space at the
lost of lower random iops.

Testing will help you find the best configuration for your environment.

> HP's recommendations for configuring the MDS 600 with ZFS is to let the P212 
> do the raid functions (raid 1+0 is recommended here) by configuring each half 
> of the MDS 600 as a single logical drive (35 drives) & then use a basic zfs 
> pool on top to provide the zfs functionality - to me this would seem to loose 
> a lot of the error checking functions of zfs?

If you configured the two logical drives as a mirror in ZFS, then
you'd still have full protection. Your overhead would be really high
though - 3/4 of your original capacity would be used for data
protection if I understand the recommendation correctly. (You'd use
1/2 of the original capacity for RAID1 in the MDS, then 1/2 of the
remaining for the ZFS mirror.) You could use non-redundant pool in ZFS
to reduce the overhead, but you sacrifice the self-healing properties
of ZFS when you do that.

> Another option is to use raidz and let zfs handle the smart stuff - as the 
> P212 doesn't support a true dumb JBOD function I'd need to create each drive 
> as a single raid 0 logical drive - are there any drawback to doing this? Or 
> would it be better to create slightly larger logical drives using say 2 
> physical drives per logical drive?

Single-device logical drives are required when you can't configure a
card or device as JBOD, and I believe its usually the recommended
solution. Once you have the LUNs created, you can use ZFS to create
mirrors or raidz vdevs.

> I'm planning on having 2 hot spares - one in each side of the MDS 600, is it 
> also worth using a dedicated ZIL spindle or 2?

It would depend on your workload. (How's that for helpful?)

If you're experiencing a lot of synchronous writes, then a ZIL will
help. If you aren't seeing a lot of sync writes, then a ZIL won't
help. The ZIL doesn't have to be very large, since it's flushed on a
regular basis. From the Best Practices guide:
"For a target throughput of X MB/sec and given that ZFS pushes
transaction groups every 5 seconds (and have 2 outstanding), we also
expect the ZIL to not grow beyond X MB/sec * 10 sec. So to service
100MB/sec of synchronous writes, 1 GB of log device should be
sufficient."

If the MDS has a non-volatile cache, there should be little or no need
to use a ZIL.

However, some reports have shown ZFS with a ZIL to be faster than
using non-volatile cache. You should test performance using your
workload.

> Is it worth tweaking zfs_nocacheflush or zfs_vdev_max_pending?

As I mentioned above, if the MDS has a non-volatile cache, then
setting zfs_nocacheflush might help performance.

If you're exporting one LUN per device then you shouldn't need to
adjust the max_pending. If you're exporting larger RAID10 luns from
the MDS, then increasing the value might help for read workloads.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-06 Thread Brandon High

On Fri, May 6, 2011 at 9:15 AM, Ray Van Dolson  wrote:

> We use dedupe on our VMware datastores and typically see 50% savings,
> often times more.  We do of course keep "like" VM's on the same volume

I think NetApp uses 4k blocks by default, so the block size and
alignment should match up for most filesystems and yield better
savings.

Your server's resource requirements for ZFS and dedup will be much
higher due to the large DDT, as you initially suspected.

If bp_rewrite is ever completed and released, this might change. It
should allow for offline dedup, which may make dedup usable in more
situations.

> Apologies for devolving the conversation too much in the NetApp
> direction -- simply was a point of reference for me to get a better
> understanding of things on the ZFS side. :)

It's good to compare the two, since they have a pretty large overlap
in functionality but sometimes very different implementations.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Brandon High

On Thu, May 5, 2011 at 8:50 PM, Edward Ned Harvey
 wrote:
> If you have to use the 4k recordsize, it is likely to consume 32x more
> memory than the default 128k recordsize of ZFS.  At this rate, it becomes
> increasingly difficult to get a justification to enable the dedup.  But it's
> certainly possible.

You're forgetting that zvols use an 8k volblocksize by default. If
you're currently exporting exporting volumes with iSCSI it's only a 2x
increase.

The tradeoff is that you should have more duplicate blocks, and reap
the rewards there. I'm fairly certain that it won't offset the large
increase in the size of the DDT however. Dedup with zvols is probably
never a good idea as a result.

Only if you're hosting your VM images in .vmdk files will you get 128k
blocks. Of course, your chance of getting many identical blocks gets
much, much smaller. You'll have to worry about the guests' block
alignment in the context of the image file, since two identical files
may not create identical blocks as seen from ZFS. This means you may
get only fractional savings and have an enormous DDT.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Brandon High

On Wed, May 4, 2011 at 8:23 PM, Edward Ned Harvey
 wrote:
> Generally speaking, dedup doesn't work on VM images.  (Same is true for ZFS
> or netapp or anything else.)  Because the VM images are all going to have
> their own filesystems internally with whatever blocksize is relevant to the
> guest OS.  If the virtual blocks in the VM don't align with the ZFS (or
> whatever FS) host blocks...  Then even when you write duplicated data inside
> the guest, the host won't see it as a duplicated block.

A zvol with 4k blocks should give you decent results with Windows
guests. Recent versions use 4k alignment by default and 4k blocks, so
there should be lots of duplicates for a base OS image.

> There are some situations where dedup may help on VM images...  For example
> if you're not using sparse files and you have a zero-filed disk...  But in

compression=zle works even better for these cases, since it doesn't
require DDT resources.

> Or if you're intimately familiar with both the guest & host filesystems, and
> you choose blocksizes carefully to make them align.  But that seems
> complicated and likely to fail.

Using a 4k block size is a safe bet, since most OSs use a block size
that is a multiple of 4k. It's the same reason that the new "Advanced
Format" drives use 4k sectors.

Windows uses 4k alignment and 4k (or larger) clusters.
ext3/ext4 uses 1k, 2k, or 4k blocks. Drives over 512MB should use 4k
by default. The block alignment is determined by the partitioning, so
some care needs to be taken there.
zfs uses 'ashift' size blocks. I'm not sure what ashift works out to
be when using a zvol though, so it could be as small as 512b but may
be set to the same as the blocksize property.
ufs is 4k or 8k on x86 and 8k on sun4u. As with ext4, block alignment
is determined by partitioning and slices.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-05 Thread Brandon High

On Thu, May 5, 2011 at 11:17 AM, Giovanni Tirloni  wrote:
> What I find it curious is that it only happens with incrementals. Full
> send's go as fast as possible (monitored with mbuffer). I was just wondering
> if other people have seen it, if there is a bug (b111 is quite old), etc.

I missed that you were using b111 earlier. That's probably a large
part of the problem. There were a lot of performance and reliability
improvements between b111 and b134, and there have been more between
b134 and b148 (OI) or b151 (S11 Express).

Updating the host you're receiving on to something more recent may fix
the performance problem you're seeing.

Fragmentation shouldn't be to great of an issue if the pool you're
writing to is relatively empty. There were changes made to zpool
metaslab allocation post-b111 that might improve performance for pools
between 70% and 96% full. This could also be why the full sends
perform better than incremental sends.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Brandon High

On Wed, May 4, 2011 at 4:36 PM, Erik Trimble  wrote:
> If so, I'm almost certain NetApp is doing post-write dedup.  That way, the
> strictly controlled max FlexVol size helps with keeping the resource limits
> down, as it will be able to round-robin the post-write dedup to each FlexVol
> in turn.

They are, its in their docs. A volume is dedup'd when 20% of
non-deduped data is added to it, or something similar. 8 volumes can
be processed at once though, I believe, and it could be that weaker
systems are not able to do as many in parallel.

> block usage has a significant 4k presence.  One way I reduced this initally
> was to have the VMdisk image stored on local disk, then copied the *entire*
> image to the ZFS server, so the server saw a single large file, which meant
> it tended to write full 128k blocks.  Do note, that my 30 images only takes

Wouldn't you have been better off cloning datasets that contain an
unconfigured install and customizing from there?

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-04 Thread Brandon High

On Wed, May 4, 2011 at 2:25 PM, Giovanni Tirloni  wrote:
>   The problem we've started seeing is that a zfs send -i is taking hours to
> send a very small amount of data (eg. 20GB in 6 hours) while a zfs send full
> transfer everything faster than the incremental (40-70MB/s). Sometimes we
> just give up on sending the incremental and send a full altogether.

Does the send complete faster if you just pipe to /dev/null? I've
observed that if recv stalls, it'll pause the send, and they two go
back and forth stepping on each other's toes. Unfortunately, send and
recv tend to pause with each individual snapshot they are working on.

Putting something like mbuffer
(http://www.maier-komor.de/mbuffer.html) in the middle can help smooth
it out and speed things up tremendously. It prevents the send from
pausing when the recv stalls, and allows the recv to continue working
when the send is stalled. You will have to fiddle with the buffer size
and other options to tune it for your use.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Brandon High

On Wed, May 4, 2011 at 12:29 PM, Erik Trimble  wrote:
>        I suspect that NetApp does the following to limit their resource
> usage:   they presume the presence of some sort of cache that can be
> dedicated to the DDT (and, since they also control the hardware, they can
> make sure there is always one present).  Thus, they can make their code

AFAIK, NetApp has more restrictive requirements about how much data
can be dedup'd on each type of hardware.

See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller
pieces of hardware can only dedup 1TB volumes, and even the big-daddy
filers will only dedup up to 16TB per volume, even if the volume size
is 32TB (the largest volume available for dedup).

NetApp solves the problem by putting rigid constraints around the
problem, whereas ZFS lets you enable dedup for any size dataset. Both
approaches have limitations, and it sucks when you hit them.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Faster copy from UFS to ZFS

2011-05-03 Thread Brandon High

On Tue, May 3, 2011 at 12:36 PM, Erik Trimble  wrote:
> rsync is indeed slower than star; so far as I can tell, this is due almost
> exclusively to the fact that rsync needs to build an in-memory table of all
> work being done *before* it starts to copy. After that, it copies at about

rsync 3.0+ will start copying almost immediately, so it's much better
in that respect than previous versions. It continues to scan update
the list of files while sending data.

> network use pattern), which helps for ZFS copying.  The one thing I'm not
> sure of is whether rsync uses a socket, pipe, or semaphore method when doing
> same-host copying. I presume socket (which would slightly slow it down vs

It creates a socketpair() before clone()ing itself and uses the socket
for communications.

> That said, rsync is really the only solution if you have a partial or
> interrupted copy.  It's also really the best method to do verification.

For verification you should specify -c (checksums), otherwise it will
only look at the size, permissions, owner and date and if they all
match it will not look at the file contents. It can take as long (or
longer) to complete than the original copy, since files on both side
need to be read and checksummed.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Faster copy from UFS to ZFS

2011-05-03 Thread Brandon High

On Tue, May 3, 2011 at 5:47 AM, Joerg Schilling
 wrote:
> But this is most likely slower than star and does rsync support sparse files?

'rsync -ASHXavP'

-A: ACLs
-S: Sparse files
-H: Hard links
-X: Xattrs
-a: archive mode; equals -rlptgoD (no -H,-A,-X)

You don't need to specify --whole-file, it's implied when copying on
the same system. --inplace can play badly with hard links and
shouldn't be used.

It probably will be slower than other options but it may be more
accurate, especially with -H

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ls reports incorrect file size

2011-05-02 Thread Brandon High

On Mon, May 2, 2011 at 1:56 PM, Eric D. Mudama
 wrote:
> that the application would have done the seek+write combination, since
> on NTFS (which doesn't support sparse) these would have been real
> 1.5GB files, and there would be hundreds or thousands of them in
> normal usage.

NTFS supports sparse files.
http://www.flexhex.com/docs/articles/sparse-files.phtml

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-29 Thread Brandon High

On Thu, Apr 28, 2011 at 6:48 PM, Edward Ned Harvey
 wrote:
> What does it mean / what should you do, if you run that command, and it
> starts spewing messages like this?
> leaked space: vdev 0, offset 0x3bd8096e00, size 7168

I'm not sure there's much you can do about it short of deleting
datasets and/or snapshots.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Still no way to recover a "corrupted" pool

2011-04-29 Thread Brandon High

On Fri, Apr 29, 2011 at 1:23 PM, Freddie Cash  wrote:
> Running ZFSv28 on 64-bit FreeBSD 8-STABLE.

I'd suggest trying to import the pool into snv_151a (Solaris 11
Express), which is the reference and development platform for ZFS.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Faster copy from UFS to ZFS

2011-04-29 Thread Brandon High

On Fri, Apr 29, 2011 at 10:53 AM, Dan Shelton  wrote:
> Is anyone aware of any freeware program that can speed up copying tons of
> data (2 TB) from UFS to ZFS on same server?

Setting 'sync=disabled' for the initial copy will help, since it will
make all writes asynchronous.

You will probably want to set it back to default after you're done.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-29 Thread Brandon High

On Fri, Apr 29, 2011 at 7:10 AM, Roy Sigurd Karlsbakk  
wrote:
> This was fletcher4 earlier, and still is in opensolaris/openindiana. Given a 
> combination with verify (which I would use anyway, since there are always 
> tiny chances of collisions), why would sha256 be a better choice?

fletcher4 was only an option for snv_128, which was quickly pulled and
replaced with snv_128b which removed fletcher4 as an option.

The official post is here:
http://www.opensolaris.org/jive/thread.jspa?threadID=118519&tstart=0#437431

It looks like fletcher4 is still an option in snv_151a for non-dedup
datasets, and is in fact the default.

As an aside: Erik, any idea when the 159 bits will make it to the public?

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding where dedup'd files are

2011-04-28 Thread Brandon High

On Thu, Apr 28, 2011 at 4:06 PM, Erik Trimble  wrote:
> Which means, that while I can get a list of blocks which are deduped, it
> may not be possible to generate a list of files from that list of
> blocks.

Is it possible to determine which datasets the blocks are referenced from?

Since I have some datasets with dedup'd data, I'm a little paranoid
about tanking the system if they are destroyed.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Brandon High

On Thu, Apr 28, 2011 at 3:50 PM, Edward Ned Harvey
 wrote:
> When a block is scheduled to be written, system performs checksum, and looks
> for a matching entry in DDT in ARC/L2ARC.  In the event of an ARC/L2ARC

... which, if it's on L2ARC, is another read too. While most people
will be using a fast SSD, it's slower than RAM and still worth
mentioning.

> cache miss for a DDT entry which actually exists, the system will need to
> perform a number of small disk reads in order to fetch the DDT entry from
> disk.  Correct?  I figure at least one, probably more than one, read to
> locate the entry on disk, and then another read to actually read the entry.

I think it's safe to assume it'll usually be multiple reads from the
pool devices. These are random iops.

> After this, the system knows there is a checksum match between the block
> waiting to be written, and another block that's already on disk, and it
> could possibly have to do yet another read for verification, before it is
> able to finally do the write.  Right?

If verify is on, it'll read the on-disk block and compare it to the
to-be-written block. If they match, it will increment the refcount for
the on-disk block.

If the zpool property dedupditto is set and the refcount for the
on-disk block exceeds the threshold, it will write another copy of the
block to disk.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding where dedup'd files are

2011-04-28 Thread Brandon High

On Thu, Apr 28, 2011 at 3:48 PM, Ian Collins  wrote:
> Dedup is at the block, not file level.

Files are usually composed of blocks.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Brandon High

On Thu, Apr 28, 2011 at 3:05 PM, Erik Trimble  wrote:
> A careful reading of the man page seems to imply that there's no way to
> change the dedup checksum algorithm from sha256, as the dedup property
> ignores the checksum property, and there's no provided way to explicitly
> set a checksum algorithm specific to dedup (i.e. there's no way to
> override the default for dedup).

That's my understanding as well. The initial release used fletcher4 or
sha256, but there was either a bug in the fletcher4 code or a hash
collision that required removing it as an option.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Brandon High

On Wed, Apr 27, 2011 at 9:26 PM, Edward Ned Harvey
 wrote:
> Correct me if I'm wrong, but the dedup sha256 checksum happens in addition
> to (not instead of) the fletcher2 integrity checksum.  So after bootup,

My understanding is that enabling dedup forces sha256.

"The default checksum used for deduplication is sha256 (subject to
change). When dedup is enabled, the dedup checksum algorithm overrides
the checksum property."

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Finding where dedup'd files are

2011-04-28 Thread Brandon High

Is there an easy way to find out what datasets have dedup'd data in
them. Even better would be to discover which files in a particular
dataset are dedup'd.

I ran
# zdb -

which gave output like:

index 1055c9f21af63 refcnt 2 single DVA[0]=<0:1e274ec3000:2ac00:STD:1>
[L0 deduplicated block] sha256 uncompressed LE contiguous unique
unencrypted 1-copy size=2L/2P birth=236799L/236799P fill=1
cksum=55c9f21af6399be:11f9d4f5ff4cb109:2af8b798671e47ba:d19caf78da295df5

How can I translate this into datasets or files?

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive

2011-04-27 Thread Brandon High

On Wed, Apr 27, 2011 at 12:51 PM, Lamp Zy  wrote:
> Any ideas how to identify which drive is the one that failed so I can
> replace it?

Try the following:
# fmdump -eV
# fmadm faulty

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Drive replacement speed

2011-04-26 Thread Brandon High

The last resilver finished after 50 hours. Ouch.

I'm onto the next device now, which seems to be progressing much, much better.

The current tunings that I'm using right now are:
echo zfs_resilver_delay/W0t0 | mdb -kw
echo zfs_resilver_min_time_ms/W0t2 | pfexec mdb -kw

Things could slow down, but at 13 hours in, the resilver has been
managing ~ 100M/s and is 70% done.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Drive replacement speed

2011-04-25 Thread Brandon High

On Mon, Apr 25, 2011 at 5:26 PM, Brandon High  wrote:
> Setting zfs_resilver_delay seems to have helped some, based on the
> iostat output. Are there other tunables?

I found zfs_resilver_min_time_ms while looking. I've tried bumping it
up considerably, without much change.

'zpool status' is still showing:
 scan: resilver in progress since Sat Apr 23 17:03:13 2011
6.06T scanned out of 6.40T at 36.0M/s, 2h46m to go
769G resilvered, 94.64% done

'iostat -xn' shows asvc_t under 10ms still.

Increasing the per-device queue depth has increased the ascv_t but
hasn't done much to effect the throughput. I'm using:
echo zfs_vdev_max_pending/W0t35 | pfexec mdb -kw

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?

2011-04-25 Thread Brandon High

On Mon, Apr 25, 2011 at 4:53 PM, Fred Liu  wrote:
> So how can I set the quota size on a file system with dedup enabled?

I believe the quota applies to the non-dedup'd data size. If a user
stores 10G of data, it will use 10G of quota, regardless of whether it
dedups at 100:1 or 1:1.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive

2011-04-25 Thread Brandon High

On Mon, Apr 25, 2011 at 4:56 PM, Lamp Zy  wrote:
> I'd expect the spare drives to auto-replace the failed one but this is not
> happening.
>
> What am I missing?

Is the autoreplace property set to 'on'?
# zpool get autoreplace fwgpool0
# zpool set autoreplace=on fwgpool0

> I really would like to get the pool back in a healthy state using the spare
> drives before trying to identify which one is the failed drive in the
> storage array and trying to replace it. How do I do this?

Turning on autoreplace might start the replace. If not, the following
will replace the failed drive with the first spare. (I'd suggest
verifying the device names before running it.)
# zpool replace fwgpool0 c4t5000C5001128FE4Dd0 c4t5000C50014D70072d0

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Drive replacement speed

2011-04-25 Thread Brandon High

On Mon, Apr 25, 2011 at 4:45 PM, Richard Elling
 wrote:
> If there is other work going on, then you might be hitting the resilver
> throttle. By default, it will delay 2 clock ticks, if needed. It can be turned

There is some other access to the pool from nfs and cifs clients, but
not much, and mostly reads.

Setting zfs_resilver_delay seems to have helped some, based on the
iostat output. Are there other tunables?

> Probably won't work because it does not make the resilvering drive
> any faster.

It doesn't seem like the devices are the bottleneck, even with the
delay turned off.

$ iostat -xn 60 3
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  369.2   11.5 5577.0   71.3  0.7  0.71.91.9  14  29 c2t0d0
  371.9   11.5 5570.3   71.3  0.7  0.71.71.8  13  29 c2t1d0
  369.9   11.5 5574.4   71.3  0.7  0.71.81.9  14  29 c2t2d0
  370.7   11.5 5573.9   71.3  0.7  0.71.81.9  14  29 c2t3d0
  368.0   11.5 5553.1   71.3  0.7  0.71.81.9  14  29 c2t4d0
  196.1  172.8 2825.5 2436.6  0.3  1.10.83.0   6  26 c2t5d0
  183.6  184.9 2717.6 2674.7  0.5  1.31.43.5  11  31 c2t6d0
  393.0   11.2 5540.7   71.3  0.5  0.61.31.5  12  26 c2t7d0
   95.81.2   95.6   16.2  0.0  0.00.20.2   0   1 c0t0d0
0.91.23.6   16.2  0.0  0.07.51.9   0   0 c0t1d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  891.2   11.8 2386.9   64.4  0.0  1.20.01.3   1  36 c2t0d0
  919.9   12.1 2351.8   64.6  0.0  1.10.01.2   0  35 c2t1d0
  906.9   12.1 2346.1   64.6  0.0  1.20.01.3   0  36 c2t2d0
  877.9   11.6 2351.0   64.5  0.7  0.50.80.6  23  35 c2t3d0
  883.4   12.0 2322.0   64.4  0.2  1.00.21.1   7  35 c2t4d0
0.8  758.00.8 1910.4  0.2  5.00.26.6   3  72 c2t5d0
  882.7   11.4 2355.1   64.4  0.8  0.40.90.4  27  34 c2t6d0
  907.8   11.4 2373.1   64.5  0.7  0.30.80.4  23  30 c2t7d0
 1607.89.4 1568.2   83.0  0.1  0.20.10.1   3  18 c0t0d0
7.39.1   23.5   83.0  0.1  0.06.01.4   2   2 c0t1d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  960.3   12.7 2868.0   59.0  1.1  0.71.20.8  37  52 c2t0d0
  963.2   12.7 2877.5   59.1  1.1  0.81.10.8  36  51 c2t1d0
  960.3   12.6 2844.7   59.1  1.1  0.71.10.8  37  52 c2t2d0
 1000.1   12.8 2827.1   59.0  0.6  1.20.61.2  21  52 c2t3d0
  960.9   12.3 2811.1   59.0  1.3  0.61.30.6  42  51 c2t4d0
0.5  962.20.4 2418.3  0.0  4.10.04.3   0  59 c2t5d0
 1014.2   12.3 2820.6   59.1  0.8  0.80.80.8  28  48 c2t6d0
 1031.2   12.5 2822.0   59.1  0.8  0.80.70.8  26  45 c2t7d0
 1836.40.0 1783.40.0  0.0  0.20.00.1   1  19 c0t0d0
5.30.05.30.0  0.0  0.01.11.5   1   1 c0t1d0


-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-25 Thread Brandon High

On Mon, Apr 25, 2011 at 8:20 AM, Edward Ned Harvey
 wrote:
> and 128k assuming default recordsize.  (BTW, recordsize seems to be a zfs
> property, not a zpool property.  So how can you know or configure the
> blocksize for something like a zvol iscsi target?)

zvols use the 'volblocksize' property, which defaults to 8k. A 1TB
zvol is therefore 2^27 blocks and would require ~ 34 GB for the ddt
(assuming that a ddt entry is 270 bytes).

The zfs man page for the property reads:

volblocksize=blocksize

 For volumes, specifies the block size of the volume. The
 blocksize  cannot  be  changed  once the volume has been
 written, so it should be set at  volume  creation  time.
 The default blocksize for volumes is 8 Kbytes. Any power
 of 2 from 512 bytes to 128 Kbytes is valid.

 This property can also be referred to by  its  shortened
 column name, volblock.

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Drive replacement speed

2011-04-25 Thread Brandon High

0-  -771 10  1.99M  59.4K
c2t2d0-  -743 10  2.02M  59.4K
c2t3d0-  -771 11  2.01M  59.3K
c2t4d0-  -767 10  1.94M  59.1K
replacing -  -  0  1.00K 17  1.48M
  c2t5d0/old  -  -  0  0  0  0
  c2t5d0  -  -  0533 17  1.48M
c2t6d0-  -791 10  1.98M  59.2K
c2t7d0-  -796 10  1.99M  59.3K
  -  -  -  -  -  -

$ iostat -xn 60 3
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  362.4   11.5 5693.9   71.6  0.7  0.72.02.0  14  30 c2t0d0
  365.3   11.5 5689.0   71.6  0.7  0.71.81.9  14  29 c2t1d0
  363.2   11.5 5693.2   71.6  0.7  0.71.92.0  14  30 c2t2d0
  364.0   11.5 5692.7   71.6  0.7  0.71.91.9  14  30 c2t3d0
  361.2   11.5 5672.8   71.6  0.7  0.71.91.9  14  30 c2t4d0
  202.4  163.1 2915.2 2475.3  0.3  1.10.82.9   7  26 c2t5d0
  170.4  190.4 2747.3 2757.6  0.5  1.31.53.6  11  31 c2t6d0
  386.4   11.2 5659.0   71.6  0.5  0.61.31.5  12  27 c2t7d0
   95.01.2   94.5   16.1  0.0  0.00.20.2   0   1 c0t0d0
0.91.23.3   16.1  0.0  0.07.51.9   0   0 c0t1d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  514.1   13.0 1937.7   65.7  0.2  0.80.31.5   5  27 c2t0d0
  510.1   13.2 1943.1   65.7  0.2  0.80.51.6   6  29 c2t1d0
  513.3   13.2 1926.3   65.8  0.2  0.80.31.5   5  28 c2t2d0
  505.9   13.3 1936.7   65.8  0.2  0.90.31.8   5  30 c2t3d0
  513.8   12.8 1890.1   65.8  0.2  0.80.31.5   5  26 c2t4d0
0.1  488.60.1 1216.5  0.0  2.20.04.6   0  33 c2t5d0
  533.3   12.7 1875.3   65.9  0.1  0.70.21.3   4  24 c2t6d0
  541.6   12.9 1923.2   65.8  0.1  0.70.21.2   3  23 c2t7d0
0.02.00.09.4  0.0  0.01.00.2   0   0 c0t0d0
0.02.00.09.4  0.0  0.01.00.2   0   0 c0t1d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  506.79.2 1906.9   50.2  0.6  0.21.20.5  20  23 c2t0d0
  509.89.3 1909.5   50.2  0.6  0.21.20.4  19  23 c2t1d0
  508.69.0 1900.4   50.2  0.7  0.31.40.5  21  25 c2t2d0
  506.89.4 1897.2   50.3  0.6  0.21.20.5  19  23 c2t3d0
  505.19.4 1852.4   50.4  0.6  0.21.20.5  19  23 c2t4d0
0.0  487.60.0 1227.9  0.0  3.50.07.2   0  46 c2t5d0
  534.89.2 1855.6   50.2  0.6  0.21.00.4  18  22 c2t6d0
  540.59.3 1891.4   50.2  0.5  0.21.00.4  17  21 c2t7d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c0t0d0
0.00.0    0.0    0.0  0.0  0.00.00.0   0   0 c0t1d0



-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] just can't import

2011-04-11 Thread Brandon High

On Mon, Apr 11, 2011 at 10:55 AM, Matt Harrison
 wrote:
> It did finish eventually, not sure how long it took in the end. Things are
> looking good again :)

If you want to continue using dedup, you should invest in (a lot) more
memory. The amount of memory required depends on the size of your pool
and the type of data that you're storing. Data that large blocks will
use less memory.

I suspect that the minimum memory for most moderately sized pools is
over 16GB. There has been a lot of discussion regarding how much
memory each dedup'd block requires, and I think it was about 250-270
bytes per block. 1TB of data (at max block size and no duplicate data)
will require about 2GB of memory to run effectively. (This seems high
to me, hopefully someone else can confirm.) This is memory that is
available to the ARC, above and beyond what is being used by the
system and applications. Of course, using all your ARC to hold dedup
data won't help much either, as either cacheable data or dedup info
will be evicted rather quickly. Forcing the system to read dedup
tables from the pool is slow, since it's a lot of random reads.

All I know is that I have 8GB in my home system, and it is not enough
to work with the 8TB pool that I have. Adding a fast SSD as L2ARC can
help reduce the memory requirements somewhat by keeping dedup data
more easily accessible. (And make sure that your L2ARC device is large
enough. I fried a 30GB OCV Vertex in just a few months of use, I
suspect from the constant writes.)

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] just can't import

2011-04-11 Thread Brandon High

On Sun, Apr 10, 2011 at 10:01 PM, Matt Harrison
 wrote:
> The machine only has 4G RAM I believe.

There's your problem. 4G is not enough memory for dedup, especially
without a fast L2ARC device.

> It's time I should be heading to bed so I'll let it sit overnight, and if
> I'm still stuck with it I'll give Ian's recent suggestions a go and report
> back.

I'd suggest waiting for it to finish the destroy. It will, if you give it time.

Trying to force the import is only going to put you back in the same
situation - The system will attempt to complete the destroy and seem
to hang until it's completed.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] just can't import

2011-04-10 Thread Brandon High

On Sun, Apr 10, 2011 at 9:01 PM, Matt Harrison
 wrote:
> I had a de-dup dataset and tried to destroy it. The command hung and so did
> anything else zfs related. I waited half and hour or so, the dataset was
> only 15G, and rebooted.

How much RAM does the system have? Dedup uses a LOT of memory, and it
can take a long time to destroy dedup'd datasets.

If you keep waiting, it'll eventually return. It could be a few hours or longer.

> The machine refused to boot, stuck at Reading ZFS Config. Asking around on

The system resumed the destroy that was in progress. If you let it
sit, it'll eventually complete.

> Well the livecd is also hanging on import, anything else zfs hangs. iostat
> shows some reads but they drop off to almost nothing after 2 mins or so.

Likewise, it's trying to complete the destroy. Be patient and it'll
complete. Never versions of Open Solaris or Solaris 11 Express may
complete it faster.

> Any tips greatly appreciated,

Just wait...

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Going forward after Oracle - Let's get organized, let's get started.

2011-04-09 Thread Brandon High

On Sat, Apr 9, 2011 at 10:41 AM, Chris Forgeron  wrote:
> I see your point, but you also have to understand that sometimes too many 
> helpers/opinions are a bad thing.  There is a set "core" of ZFS developers 
> who make a lot of this move forward, and they are the key right now. The rest 
> of us will just muddy the waters with conflicting/divergent opinions on 
> direction and goals.

It would be nice to have some communication from the devs about what
they're working on. A moderated list that only a limited set of people
normally post to would be excellent.

I'd be excited to hear that there's a new feature being worked on,
rather than the radio silence we've had.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to rename rpool. Is that recommended ?

2011-04-08 Thread Brandon High

On Fri, Apr 8, 2011 at 12:10 AM, Arjun YK  wrote:
> I have a situation where a host, which is booted off its 'rpool', need
> to temporarily import the 'rpool' of another host, edit some files in
> it, and export the pool back retaining its original name 'rpool'. Can
> this be done ?

Yes you can do it, no it is not recommended.

I had a need to do something similar to what you're attempting and
ended up using a Live CD (which doesn't have an rpool to have a naming
conflict) to do the manipulations.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS send/receive to Solaris/FBSD/OpenIndiana/Nexenta VM guest?

2011-04-07 Thread Brandon High

On Thu, Apr 7, 2011 at 4:01 PM, Joe Auty  wrote:
> My source computer is running Solaris 10 ZFS version 15. Does this mean that 
> I'd be asking for trouble doing a zfs send back to this machine from any 
> other ZFS machine running a version > 15? I just want to make sure I 
> understand all of this info...

There are two versions when it comes to ZFS - The zpool version and
the zfs version.

bhigh@basestar:~$ zpool list -o name,version
NAME   VERSION
rpool   31

bhigh@basestar:~$ zfs list -o name,version
NAME   VERSION
rpool5
rpool/ROOT   5
rpool/ROOT/snv_151   5
rpool/dump   -
rpool/rsrv   5
rpool/swap   -

I think that the version that matters (for your purposes) is the ZFS
version. It should be set when using 'send -R' and having 'zfs
receive' create the destination datasets. I recommend testing however.

> If this is the case, what are my strategies? Solaris 10 for my temporary 
> backup machine? Is it possible to run OpenIndiana or Nexenta or something and 
> somehow set up these machines with ZFS v15 or something?

You can set the zpool version when you create the pool, and you can
set the zfs version when you create the dataset. I'm not sure that
you'll need to set the pool version to anything lower if the dataset
version is correct though. You should test this, however.

-B

--
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS send/receive to Solaris/FBSD/OpenIndiana/Nexenta VM guest?

2011-04-06 Thread Brandon High

On Wed, Apr 6, 2011 at 10:42 AM, Paul Kraus  wrote:
>    I thought I saw that with zpool 10 (or was it 15) the zfs send
> format had been committed and you *could* send/recv between different
> version of zpool/zfs. From Solaris 10U9 with zpool 22 manpage for zfs:

There is still a problem if the dataset version is too high. I
*believe* that a 'zfs send -R' should send the zfs version, and that
zfs receive will create any new datasets using that version. (I have a
received dataset here that's zfs v 4, whereas everything else in the
pool is v5.) As long as you don't do a zfs upgrade after that point,
you should be fine.

It's probably a good idea to check that the received versions are the
same as the source before doing a destroy though. ;-)

One other thing that I forgot to mention in my last mail too: If
you're receiving into a VM, make sure that the VM can manage
redundancy on its zfs storage, and not just multiple vdsk on the same
host disk / lun. Either give it access to the raw devices, or use
iSCSI, or create your vdsk on different luns and raidz them, etc.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS send/receive to Solaris/FBSD/OpenIndiana/Nexenta VM guest?

2011-04-06 Thread Brandon High

On Tue, Apr 5, 2011 at 12:38 PM, Joe Auty  wrote:

> How about getting a little more crazy... What if this entire server
> temporarily hosting this data was a VM guest running ZFS? I don't foresee
> this being a problem either, but with so
>

The only thing to watch out for is to make sure that the receiving datasets
aren't a higher version that the zfs version that you'll be using on the
replacement server. Because you can't downgrade a dataset, using snv_151a
and planning to send to Nexenta as a final step will trip you up unless you
explicitly create them with a lower version.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NTFS on NFS and iSCSI always generates small IO's

2011-03-10 Thread Brandon High

On Thu, Mar 10, 2011 at 9:45 AM, Richard Elling
 wrote:
> Default recordsize for NFS is 128K. For the VM case, you will want to match 
> the block size of
> the clients. However, once the file (on the NFS server) is created with 128K 
> records, it will remain
> at 128K forever. So you will need to create a new VM store after the 
> recordsize is tuned.

You can change the recordsize and copy the vmdk files on the nfs
server, which will re-write them with a smaller recordsize.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NTFS on NFS and iSCSI always generates small IO's

2011-03-10 Thread Brandon High

On Thu, Mar 10, 2011 at 12:15 AM, Matthew Anderson
 wrote:
> I have a feeling it's to do with ZFS's recordsize property but haven't been 
> able to find any solid testing done with NTFS. I'm going to do some testing 
> using smaller record sizes tonight to see if that helps the issue.
> At the moment I'm surviving on cache and am quickly running out of capacity.
>
> Can anyone suggest any further tests or have any idea about what's going on?

The default blocksize for a zfs volume is 8k, so 4k writes will
probably require a read as well. You can try creating a new volume
with volblocksize set to 4k and see if that helps. The value can't be
changed once set, so you'll have to make a new dataset.

Make sure the "wcd" property is set to "false" for the volume in
stmfadm in order to enable the write cache. It shouldn't make a huge
difference with the zil disabled, but it certainly won't hurt.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134

2011-03-07 Thread Brandon High

On Mon, Mar 7, 2011 at 1:50 PM, Yaverot  wrote:
> 1. While performance isn't my top priority, doesn't using slices make a 
> significant difference?

Write caching will be disabled on devices that use slices. It can be
turned back on by using format -e

> 2. Doesn't snv_134 that I'm running already account for variances in these 
> nominally-same disks?

It will allow some small differences. I'm not sure what the limit on
the difference size is.

> 3. The market refuses to sell disks under $50, therefore I won't be able to 
> buy drives of 'matching' capacity anyway.

You can always use a larger drive. If you think you may want to go
back to smaller drives, make sure that the autoexpand zpool property
is disabled though.

> 3. Assuming I want to do such an allocation, is this done with quota & 
> reservation? Or is it snapshots as you suggest?

I think Edward misspoke when he said to use snapshots, and probably
meant reservation.

I've taken to creating a dataset called "reserved" and giving it a 10G
reservation. (10G isn't a special value, feel free to use 5% of your
pool size or whatever else you're comfortable with.) It's unmounted
and doesn't contain anything, but it ensures that there is a chunk of
space I can make available if needed. Because it doesn't contain
anything, there shouldn't be any concern for de-allocation of blocks
when it's destroyed. Alternately, the reservation can be reduced to
make space available.

> Would it make more sense to make another filesystem in the pool, fill it 
> enough and keep it handy to delete? Or is there some advantage to zfs destroy 
> (snapshot) over zfs destroy (filesystem)? While I am thinking about the 
> system and have extra drives, like now, is the time to make plans for the 
> next "system is full" event.

If a dataset contains data, the blocks will have to be freed when it's
destroyed. If it's an empty dataset with a reservation, the only
change is to fiddle some accounting bits.

I seem to remember seeing a fix for 100% full pools a while ago so
this may not be as critical as it used to be, but it's a nice safety
net to have.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Format returning bogus controller info

2011-03-01 Thread Brandon High

On Mon, Feb 28, 2011 at 9:39 PM, Dave Pooser  wrote:
> Is the same true of controllers? That is, will c12 remain c12 or
> /pci@0,0/pci8086,340c@5 remain /pci@0,0/pci8086,340c@5 even if other
> controllers are active?

You can rebuild the device tree if it bothers you. There are some
(outdated) instructions here:
http://spiralbound.net/blog/2005/12/21/rebuilding-the-solaris-device-tree
. I think you can do this all with a new boot environment, rather than
boot from a CD.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS send/recv horribly slow on system with 1800+ filesystems

2011-03-01 Thread Brandon High

On Mon, Feb 28, 2011 at 10:38 PM, Moazam Raja  wrote:
> We've noticed that on systems with just a handful of filesystems, ZFS
> send (recursive) is quite quick, but on our 1800+ fs box, it's
> horribly slow.

When doing an incremental send, the system has to identify what blocks
have changed, which can take some time. If not much data has changed,
the delay can take longer than the actual send.

I've noticed that there's a small delay when starting a send of a new
snapshot and when starting the receive of one. Putting something like
mbuffer in the path helps to smooth things out. It won't help in the
example you've cited below, but it will help in real world use.

> The other odd thing I've noticed is that during the 'zfs send' to
> /dev/null, zpool iostat shows we're actually *writing* to the zpool at
> the rate of 4MB-8MB/s, but reading almost nothing. How can this be the
> case?

The writing seems odd, but the lack of reads doesn't. You might have
most or all of the data in the ARC or L2ARC, so your zpool doesn't
need to be read from.

> 1.) Does ZFS get immensely slow once we have thousands of filesystems?

No. Incremental sends might take longer, as I mentioned above.

> 2.) Why do we see 4MB-8MB/s of *writes* to the filesystem when we do a
> 'zfs send' to /dev/null ?

Is anything else using the filesystems in the pool?

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Performance

2011-02-28 Thread Brandon High

On Sun, Feb 27, 2011 at 7:35 PM, Brandon High  wrote:
> It moves from "best fit" to "any fit" at a certain point, which is at
> ~ 95% (I think). Best fit looks for a large contiguous space to avoid
> fragmentation while any fit looks for any free space.

I got the terminology wrong, it's first-fit when there is space,
moving to best-fit at 96% full.

See 
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/metaslab.c
for details.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Performance

2011-02-27 Thread Brandon High

On Sun, Feb 27, 2011 at 6:59 AM, Edward Ned Harvey
 wrote:
> But there is one specific thing, isn't there?  Where ZFS will choose to use
> a different algorithm for something, when pool usage exceeds some threshold.
> Right?  What is that?

It moves from "best fit" to "any fit" at a certain point, which is at
~ 95% (I think). Best fit looks for a large contiguous space to avoid
fragmentation while any fit looks for any free space.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] External SATA drive enclosures + ZFS?

2011-02-27 Thread Brandon High

On Sun, Feb 27, 2011 at 7:48 AM, taemun  wrote:
> eSATA has no need for any interposer chips between a modern SATA chipset on
> the motherboard and a SATA hard drive. You can buy cables with appropriate

eSATA has different electrical specifications, namely higher minimum
transmit power and lower minimum receive power. An internal power
might work with a SATA to eSATA cable or adapter, but it's not
guaranteed to.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] External SATA drive enclosures + ZFS?

2011-02-27 Thread Brandon High

On Sun, Feb 27, 2011 at 4:15 PM, Rich Teer  wrote:
> So the question is, what eSATA non-RAID HBA do people recommend?  Bear
> in mind that I'm looking for something with driver support "out of the
> box" with either the latest Solaris 10, or Solaris 11 Express.

The SiI3124 (PCI / PCI-X) and SiI3132 (PCIe) based cards can be picked
up for about $20-$30. They're supported, and support PMPs in Solaris.
I don't know about support on Sparc though.

http://www.newegg.com/Product/Product.aspx?Item=N82E16816132021
http://www.newegg.com/Product/Product.aspx?Item=N82E16816132027

> Assuming the use of eSATA enclosures do do people recommend?  I don't
> need huge amounts of space; two drives should be enough and four will
> be plenty and allow for expansion.  Again, I'm looking for a JBOD coz
> I want ZFS do all the work.

Something similar to the Sans Digital enclosures would probably work.
They use a PMP to make all the drives available via one eSATA, which
may or may not work. It's supposed to, but there are hardware
blacklists in the drivers that may cause you trouble.

Another thought is to ditch the Sun boxes and use a HP ProLiant
Microserver. It's about $320 and holds 4 drives, with an expansion
slot for an additional controller. I think some people have reported
success with these on the list.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What drives?

2011-02-26 Thread Brandon High

On Thu, Feb 24, 2011 at 10:45 PM, Markus Kovero  wrote:
> Hi! I'd go for WD RE edition. Blacks and Greens are for desktop use and 
> therefore lack proper TLER settings and have useless power saving features 
> that could induce errors and mysterious slowness.

There has been a lot of discussion about TLER in the past, and I'm
less convinced that it's a requirement for zfs than I used to think.
I've been using WD Green (EADS) drives for two years without issue.
They are ones that sleep and TLER settings could be changed on though.

Many of the new WD Green drives (including some of the RE) use 4k
sectors, which will wreak havoc on zpool performance. Other
manufacturers are starting to use 4k sectors on their 5400 rpm drives
as well so shop carefully if you decide to go with a lower spindle
speed. I have not seen a 7200 rpm drive with 4k sectors, but I'm sure
they exist.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] External SATA drive enclosures + ZFS?

2011-02-25 Thread Brandon High

On Fri, Feb 25, 2011 at 4:34 PM, Rich Teer  wrote:
> Space is starting to get a bit tight here, so I'm looking at adding
> a couple of TB to my home server.  I'm considering external USB or
> FireWire attached drive enclosures.  Cost is a real issue, but I also

I would avoid USB, since it can be less reliable than other connection
methods. That's the impression I get from older posts made by Sun
devs, at least. I'm not sure how well Firewire 400 is supported, let
alone Firewire 800.

You might want to consider eSATA. Port multipliers are supported in
recent builds (128+ I think), and will give better performance than
USB. I'm not sure if PMP are supported on Sparc though., since it
requires support in both the controller and PMP.

Consider enclosures from other manufacturers as well. I've heard good
things about Sans Digital, but I've never used them. The 2-drive
enclosure has the same components as the item you linked but 1/2 the
cost via Newegg.

> The intent would be put two 1TB or 2TB drives in the enclosure and use
> ZFS to create a mirrored pool out of them.  Assuming this enclosure is
> set to JBOD mode, would I be able to use this with ZFS?  The enclosure

Yes, but I think the enclosure has a SiI5744 inside it, so you'll
still have one connection from the computer to the enclosure. If that
goes, you'll lose both drives. If you're just using two drives, two
separate enclosures on separate buses may be better. Look at
http://www.sansdigital.com/towerstor/ts1ut.html for instance. There
are also larger enclosures with up to 8 drives.

> I can't think of a reason why it wouldn't work, but I also have exactly
> zero experience with this kind of set up!

Like I mentioned, USB is prone to some flakiness.

> Assuming this would work, given that I can't see to find a 4-drive
> version of it, would I be correct in thinking that I could buy two of

You might be better off using separate enclosures for reliability.
Make sure to split the mirrors across the two devices. Use separate
USB controllers if possible, so a bus reset doesn't affect both sides.

> Assuming my proposed enclosure would work, and assuming the use of
> reasonable quality 7200 RPM disks, how would you expect the performance
> to compare with the differential UltraSCSI set up I'm currently using?
> I think the DWIS is rated at either 20MB/sec or 40MB/sec, so on the
> surface, the USB attached drives would seem to be MUCH faster...

USB 2.0 is about 30-40MB/s under ideal conditions, but doesn't support
any of the command queuing that SCSI does. I'd expect performance to
be slightly lower, and to use slightly more CPU. Most USB controllers
don't support DMA, so all I/O requires CPU time.

What about an inexpensive SAS card (eg: Supermicro AOC-USAS-L4i) and
external SAS enclosure (eg: Sans Digital TowerRAID TR4X). It would
cost about $350 for the setup.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/Drobo (Newbie) Question

2011-02-08 Thread Brandon High

On Tue, Feb 8, 2011 at 12:53 PM, David Dyer-Bennet  wrote:
> Wait, are you saying that the handling of errors in RAIDZ and mirrors is
> completely different?  That it dumps the mirror disk immediately, but
> keeps trying to get what it can from the RAIDZ disk?  Because otherwise,
> you assertion doesn't seem to hold up.

I think he meant that if one drive in a mirror dies completely, then
any single read error on the remaining drive is not recoverable.

With raidz2 (or a 3-way mirror for that matter), if one drive dies
completely, you still have redundancy.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Brandon High

On Mon, Feb 7, 2011 at 10:29 AM, Yi Zhang  wrote:
> I already set primarycache to metadata, and I'm not concerned about
> caching reads, but caching writes. It appears writes are indeed cached
> judging from the time of 2.a) compared to UFS+directio. More
> specifically, 80MB/2s=40MB/s (UFS+directio) looks realistic while
> 80MB/0.11s=800MB/s (ZFS+primarycache=metadata) doesn't.

You're trying to force a solution that isn't relevant for the
situation. ZFS is not UFS, and solutions that are required for UFS to
work correctly are not needed with ZFS.

Yes, writes are cached, but all the POSIX requirements for synchronous
IO are met by the ZIL. As long as your storage devices, be they SAN,
DAS or somewhere in between respect cache flushes, you're fine. If you
need more performance, use a slog device that respects cache flushes.
You don't need to worry about whether writes are being cached, because
any data that is written synchronously will be committed to stable
storage before the write returns.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Brandon High

On Mon, Feb 7, 2011 at 6:15 AM, Yi Zhang  wrote:
> On Mon, Feb 7, 2011 at 12:25 AM, Richard Elling
>  wrote:
>> Solaris UFS directio has three functions:
>>        1. improved async code path
>>        2. multiple concurrent writers
>>        3. no buffering
>>
> Thanks for the comments, Richard. All I wanted is to achieve 3 on ZFS.
> But as I said, apprently 2.a) below didn't give me that. Do you have
> any suggestion?

Don't. Use a ZIL, which will meet the requirements for synchronous IO.
Set primarycache to metadata to prevent caching reads.

ZFS is a very different beast than UFS and doesn't require the same tuning.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/Drobo (Newbie) Question

2011-02-07 Thread Brandon High

On Sat, Feb 5, 2011 at 9:54 AM, Gaikokujin Kyofusho
 wrote:
> Just to make sure I understand your example, if I say had a 4x2tb drives, 
> 2x750gb, 2x1.5tb drives etc then i could make 3 groups (perhaps 1 raidz1 + 1 
> mirrored + 1 mirrored), in terms of accessing them would they just be mounted 
> like 3 partitions or could it all be accessed like one big partition?

You could add them to one pool, and then create multiple filesystems
inside the pool. You total storage would be the sum of the drives'
capacity after redundancy, or 3x2tb + 750gb + 1.5tb.

It's not recommended to use different levels of redundancy in a pool,
so you may want to consider using mirrors for everything. This also
makes it easier to add or upgrade capacity later.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and spindle speed (7.2k / 10k / 15k)

2011-02-06 Thread Brandon High

On Sat, Feb 5, 2011 at 3:34 PM, Roy Sigurd Karlsbakk  wrote:
>> so as not to exceed the channel bandwidth. When they need to get higher disk
>> capacity, they add more platters.
>
> May this mean those drives are more robust in terms of reliability, since the 
> leaks between sectors is less likely with the lower density?

More platters leads to more heat and higher power consumption. Most
drives are 3 or 4 platters, though Hitachi usually manufactures 5
platter drives as well.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 3 4 5 >

1 - 100 of 475 matches

Mail list logo