Re: [zfs-discuss] Zpool with data errors

2011-06-21 Thread Marty Scholes
> didn't seem to we would need zfs to provide that redundancy also.

There was a time when I fell for this line of reasoning too.  The problem (if 
you want to call it that) with zfs is that it will show you, front and center, 
the corruption taking place in your stack.

> Since we're on SAN with Raid internally

Your situation would suggest that your RAID silently corrupted data and didn't 
even know about it.

Until you can trust the volumes behind zfs (and I don't trust any of them 
anymore, regardless of the brand name on the cabinet), give zfs at least some 
redundancy so that it can pick up the slack.

By the way, I used to trust storage because I didn't believe it was corrupting 
data, but I had no proof one way or the other, so I gave it the benefit of the 
doubt.

Since I have been using zfs, my standards have gone up considerably.  Now I 
trust storage because I can *prove* it's correct.

If someone can't prove that a volume is returning correct data, don't trust it. 
 Let zfs manage it.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # disks per vdev

2011-06-17 Thread marty scholes
Funny you say that. 

My Sun v40z connected a pair of Sun A5200 arrays running OSol 128a can't see 
the enclosures. The luxadm command comes up blank. 

Except for that annoyance (and similar other issues) the Sun gear works well 
with a Sun operating system. 

Sent from Yahoo! Mail on Android

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # disks per vdev

2011-06-17 Thread Marty Scholes
> Lights. Good.

Agreed. In a fit of desperation and stupidity I once enumerated disks by 
pulling them one by one from the array to see which zfs device faulted.

On a busy array it is hard even to use the leds as indicators.

It makes me wonder how large shops with thousands of spindles handle this.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # disks per vdev

2011-06-17 Thread Marty Scholes
> Lights.  Good.

Agreed.  In a fit of desperation and stupidity I once enumerated disks by 
pulling them one by one from the array to see which zfs device faulted.

On a busy array it is hard even to use the leds as indicators.

It makes me wonder how large shops with thousands of spindles handle this.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Server with 4 drives, how to configure ZFS?

2011-06-16 Thread Marty Scholes
> Has there been any change to the server hardware with
> respect to number of
> drives since ZFS has come out? Many of the servers
> around still have an even
> number of drives (2, 4) etc. and it seems far from
> optimal from a ZFS
> standpoint. All you can do is make one or two
> mirrors, or a 3 way mirror and
> a spare, right? 

With four drives you could also make a RAIDZ3 set, allowing you to have the 
lowest usable space, poorest performance and worst resilver times possible.

Sorry, couldn't resist.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # disks per vdev

2011-06-15 Thread Marty Scholes
It sounds like you are getting a good plan together.

> The only thing though I seem to remember reading that adding vdevs to
> pools way after the creation of the pool and data had been written to it,
> that things aren't spread evenly - is that right? So it might actually make
> sense to buy all the disks now and start fresh with the final build.

In this scenario, balancing would not impact your performance.  You would start 
with the performance of a single vdev.  Adding the second vdev later will only 
increase performance, even if horribly imbalanced.  Over time it will start to 
balance itself.  If you want it balanced, you can force zfs to start balancing 
by copying files then deleting the originals.

> Starting with only 6 disks would leave growth for another 6 disk
> raid-z2 (to keep matching geometry) leaving 3 disks spare which is
> not ideal. 

Maintaining identical geometry only matters if all of the disks are identical.  
If you later add 2TB disks, then pick whatever geometry works for you.  The 
most important thing is to maintain consistent vdev types, e.g. all RAIDZ2.

> I do like the idea of having a hot spare

I'm not sure I agree.  In my anecdotal experience, sometimes my array would 
offline (for whatever reason) and zfs would try to replace as many disks as it 
could with the hot spares.  If there weren't enough hot spares for the whole 
array, then the pool was left irreversibly damaged, having several disks in the 
middle of being replaced.  This has only happened once or twice and in the 
panic I might have handled it incorrectly, but it has spooked me from having 
hot spares.

> This is a bit OT, but can you have one vdev that is a duplicate of
> another vdev? By that I mean say you had 2x 7 disk raid-z2 vdevs, 
> instead of them both being used in one large pool could you have one
> that is a backup of the other, allowing you to destroy one of them
> and re-build without data loss? 

Absolutely.  I do this very thing with large, slow disks holding a backup for 
the main disks.  My home server has an SMF service which regularly synchronizes 
the time-slider snapshots from each main pool to the backup pool.  This has 
saved me when a whole pool disappeared (see above) and has allowed me to make 
changes to the layout of the main pools.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS for Linux?

2011-06-14 Thread Marty Scholes
Just for completeness, there is also VirtualBox which runs Solaris nicely.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # disks per vdev

2011-06-14 Thread Marty Scholes
I am asssuming you will put all of the vdevs into a single pool, which is a 
good idea unless you have a specific reason for keeping them separate, e.g. you 
want to be able to destroy / rebuild a particular vdev while leaving the others 
intact.

Fewer disks per vdev implies more vdevs, providing better random performance, 
lower scrub and resilver times and the ability to expand a vdev by replacing 
only the few disks in it.

The downside of more vdevs is that you dedicate your parity to each vdev, e.g. 
a RAIDZ2 would need two parity disks per vdev.

> I'm in two minds with mirrors. I know they provide
> the best performance and protection, and if this was
> a business critical machine I wouldn't hesitate.
> 
> But as it for a home media server, which is mainly
> WORM access and will be storing (legal!) DVD/Bluray
> rips i'm not so sure I can sacrify the space.

For a home media server, all accesses are essentially sequential, so random 
performance should not be a deciding factor.

> 7x 2 way mirrors would give me 7TB usable with 1 hot
> spare, using 1TB disks, which is a big drop from
> 12TB! I could always jump to 2TB disks giving me 14TB
> usable but I already have 6x 1TB disks in my WHS
> build which i'd like to re-use.

I would be tempted to start with a 4+2 (six disk RAIDZ2) vdev using your 
current disks and plan from there.  There is no reason you should feel 
compelled to buy more 1TB disks just because you already have some.

> Am I right in saying that single disks cannot be
> added to a raid-z* vdev so a minimum of 3 would be
> required each time. However a mirror is just 2 disks
> so if adding disks over a period of time mirrors
> would be cheaper each time.

That is not correct.  You cannot ever add disks to a vdev.  Well, you can add 
additional disks to a mirror vdev, but otherwise, once you set the geometry, a 
vdev is stuck for life.

However, you can add any vdev you want to an existing pool.  You can take a 
pool with a single vdev set up as a 6x RAIDZ2 and add a single disk to that 
pool.  The previous example is a horrible idea because it makes the entire pool 
dependent upon a single disk.  The example also illustrates that you can add 
any type of vdev to a pool.

Most agree it is best to make the pool from vdevs of identical geometry, but 
that is not enforced by zfs.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS receive checksum mismatch

2011-06-10 Thread Marty Scholes
> I stored a snapshot stream to a file

The tragic irony here is that the file was stored on a non-zfs filesystem.  You 
had had undetected bitrot which unknowingly corrupted the stream.  Other files 
also might have been silently corrupted as well.

You may have just made one of the strongest cases yet for zfs and its 
assurances.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS receive checksum mismatch

2011-06-10 Thread Marty Scholes
> If it is true that unlike ZFS itself, the replication
> stream format has
> no redundancy (even of ECC/CRC sort), how can it be
> used for
> long-term retention "on tape"?

It can't.  I don't think it has been documented anywhere, but I believe that it 
has been well understood that if you don't trust your storage (tape, disk, 
floppies, punched cards, whatever), then you shouldn't trust your incremental 
streams on that storage.

It's as if the ZFS design assumed that all incremental streams would be either 
perfect or retryable.

This is a huge problem for tape retention, not so much for disk retention.

On a personal level I have handled this with a separate pool of fewer, larger 
and slower drives which serves solely as backup, taking incremental streams 
from the main pool every 20 minutes or so.

Unfortunately that approach breaks the legacy backup strategy of pretty much 
every company.

I think the message is that unless you can ensure the integrity of the stream, 
either backups should go to another pool or zfs send/receive should not be a 
critical part of the backup strategy.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC and poor read performance

2011-06-08 Thread Marty Scholes
> This is not a true statement. If the primarycache
> policy is set to the default, all data will
> be cached in the ARC.

Richard, you know this stuff so well that I am hesitant to disagree with you.  
At the same time, I have seen this myself, trying to load video files into 
L2ARC without success.

> The ARC statistics are nicely documented in arc.c and
> available as kstats.

And I looked in the source.  My C is a little rusty, yet it appears that 
prefetch items are not stored in L2ARC by default.  Prefetches will satisfy a 
good portion of sequential reads but won't go to L2ARC.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC and poor read performance

2011-06-08 Thread Marty Scholes
> > Are some of the reads sequential?  Sequential reads
> don't go to L2ARC.
> 
> That'll be it. I assume the L2ARC is just taking
> metadata. In situations 
> such as mine, I would quite like the option of
> routing sequential read 
> data to the L2ARC also.

The good news is that it is almost a certaintly that actual iSCSI usage will be 
of a (more) random nature than your tests, suggesting higher L2ARC usage in 
real world application.

I'm not sure how zfs makes the distinction between a random and sequential 
read, but the more you think about it, not caching sequential requests makes 
sense.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC and poor read performance

2011-06-07 Thread Marty Scholes
I'll throw out some (possibly bad) ideas.

Is ARC satisfying the caching needs?  32 GB for ARC should almost cover the 
40GB of total reads, suggesting that the L2ARC doesn't add any value for this 
test.

Are the SSD devices saturated from an I/O standpoint?  Put another way, can ZFS 
put data to them fast enough?  If they aren't taking writes fast enough, then 
maybe they can't effectively load for caching.  Certainly if they are saturated 
for writes they can't do much for reads.

Are some of the reads sequential?  Sequential reads don't go to L2ARC.

What does iostat say for the SSD units?  What does arc_summary.pl (maybe 
spelled differently) say about the ARC / L2ARC usage?  How much of the SSD 
units are in use as reported in zpool iostat -v?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to properly read "zpool iostat -v" ? ;)

2011-06-02 Thread Marty Scholes
While I am by no means on expert on this, I went through a similar mental 
exercise previously and came to the conclusion that in order to service a 
particular read request, zfs may need to read more from the disk.  For example, 
a 16KB request in a stripe might need to retrieve the full 128KB stripe, if 
only to verify the checksum of the stripe prior to returning 16KB to the OS.

If I have understand it correctly, then the vdev numbers refer to the amount of 
data returned to the OS to satisfy requests, while the individual disk numbers 
refer to the amount of disk I/O required to satisfy the requests.

Does that make sense?

Standard disclaimers apply: I could be wrong, I often am wrong, etc.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)

2011-05-27 Thread Marty Scholes
> 2011/5/26 Eugen Leitl :
> > How bad would raidz2 do on mostly sequential writes
> and reads
> > (Athlon64 single-core, 4 GByte RAM, FreeBSD 8.2)?
> >
> > The best way is to go is striping mirrored pools,
> right?
> > I'm worried about losing the two "wrong" drives out
> of 8.
> > These are all 7200.11 Seagates, refurbished. I'd
> scrub
> > once a week, that'd probably suck on raidz2, too?
> >
> > Thanks.
> 
> Sequential? Let's suppose no spares.
> 
> 4 mirrors of 2 = sustained bandwidth of 4 disks
> raidz2 with 8 disks = sustained bandwidth of 6 disks
> 
> So :)

Turn it around and discuss writes.  Reads may or may not give 8x throughput 
with mirrors.  In either setup, writes will require 8x storage bandwidth since 
all drives will be written to.  Mirrors will deliver 4x throughput and RAIDZ2 
will deliver 6x throughput.

For what it's worth, I ran a 22 disk home array as a single RAIDZ3 vdev 
(19+3)for several months and it was fine.  These days I run a 32 disk array 
laid out as four vdevs, each an 8 disk RAIDZ2, i.e. 4x 6+2.

The best advice is simply to test your workload against different 
configurations.  ZFS lets you pick what works for you.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Myth? 21 disk raidz3: "Don't put more than ___ disks in a vdev"

2010-10-20 Thread Marty Scholes
Richard wrote:
>
> Untrue. The performance of a 21-disk raidz3 will be nowhere near the
> performance of a 20 disk 2-way mirrror.

You know this stuff better than I do.  Assuming no bus/cpu bottlenecks, a 21 
disk raidz3 should provide sequential throughput of 18 disks and random 
throughput of 1 disk.

A 20 disk 2-way mirror should provide sequential read throughput of (at best) 
20 disks, sequential write throughput of (at best) 10 disks, random read 
throughput of between 2 and 20 disks and random write throughput of between 1 
and 10 disks.

At one extreme, mirrors are marginally better and at the other extreme mirrors 
are 10x the write and 20x the read performance.  That's a wide range.

> Taking this to a limit, would you say a 1,000 disk
> raidz3 set is a good thing?
> 10,000 disks?

I don't know, maybe.  Even If we accept that there is some magic X where 
stripes wider than X are bad, what is that X and how do we determine it?  
Likely, it depends on the several factors, including r/w iops (both of which 
can be mitigated by L2ARC and SLOG) and resilver times.

If seek time was a non-issue (flash?) then there is no real case for mirrors.  
Mirrors can, if the data is laid out perfectly, provide sequential throughput 
which grows linearly with the vdev count.  RAIDZN always will provide 
sequential throughput which grows linearly with the stripe width.  Therefore, 
with low access time and low throughput storage (flash?), RAIDZN with very wide 
stripes makes an awful lot of sense. 

> FS is open source, feel free to modify and share your
> ideas for improvement.

And that's what we are doing here: sharing ideas.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

2010-10-18 Thread Marty Scholes
> Richard wrote:
> Yep, it depends entirely on how you use the pool.  As soon as you
> come up with a credible model to predict that, then we can optimize
> accordingly :-)

You say that somewhat tongue-in-cheek, but Edward's right.  If the resliver 
code progresses in slab/transaction-group/whatever-the-correct-term-is order, 
then a pool with any significant use will have the resilver code seeking all 
over the disk.

If instead, resilver blindly moved in block number order, then it would have 
very little seek activity and the effective throughput would be close to that 
of pure sequential i/o for both the new disk and the remaining disks in the 
vdev.

Would it make sense for scrub/resilver to be more aware of operating in disk 
order instead of zfs order?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-15 Thread Marty Scholes
> On Fri, Oct 15, 2010 at 3:16 PM, Marty Scholes
>  wrote:
> > My home server's main storage is a 22 (19 + 3) disk
> RAIDZ3 pool backed up hourly to a 14 (11+3) RAIDZ3
> backup pool.
> 
> How long does it take to resilver a disk in that
> pool?  And how long
> does it take to run a scrub?
> 
> When I initially setup a 24-disk raidz2 vdev, it died
> trying to
> resilver a single 500 GB SATA disk.  I/O under 1
> MBps, all 24 drives
> thrashing like crazy, could barely even login to the
> system and type
> onscreen.  It was a nightmare.
> 
> That, and normal (no scrub, no resilver) disk I/O was
> abysmal.
> 
> Since then, I've avoided any vdev with more than 8
> drives in it.

MY situation is kind of unique.  I picked up 120 15K 73GB FC disks early this 
year for $2 per.  As such, spindle count is a non-issue.  As a home server, it 
has very little need for write iops and I have 8 disks for L2ARC on the main 
pool.

Main pool is at 40% capacity and backup pool is at 65% capacity.  Both take 
about 70 minutes to scrub.  The last time I tested a resilver it took about 3 
hours.

The difference is that these are low capacity 15K FC spindles and the pool has 
very little sustained I/O; it only bursts now and again.  Resilvers would go 
mostly uncontested, and with RAIDZ3 + autoreplace=off, I can actually schedule 
a resilver.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-15 Thread Marty Scholes
Sorry, I can't not respond...

Edward Ned Harvey wrote:
> whatever you do, *don't* configure one huge raidz3.

Peter, whatever you do, *don't* make a decision based on blanket 
generalizations.

> If you can afford mirrors, your risk is much lower.
>  Because although it's
> hysically possible for 2 disks to fail simultaneously
> and ruin the pool,
> the probability of that happening is smaller than the
> probability of 3
> simultaneous disk failures on the raidz3.

Edward, I normally agree with most of what you have to say, but this has gone 
off the deep end.  I can think of counter-use-cases far faster than I can type.

>  Due to
> smaller resilver window.

Coupled with a smaller MTTDL, smaller cabinet space yield, smaller $/GB ratio, 
etc.

> I highly endorse mirrors for nearly all purposes.

Clearly.

Peter, go straight to the source.

http://blogs.sun.com/roch/entry/when_to_and_not_to

In short:
1. vdev_count = spindle_count / (stripe_width + parity_count)
2. IO/s is proprotional to vdev_count
3. Usable capacity is proportional to stripe_width * vdev_count
4. A mirror can be approximated by a stripe of width one
5. Mean time to data loss increases exponentially with parity_count
6. Resilver time increases (super)linearly with stripe width

Balance capacity available, storage needed, performance needed and your own 
level of paranoia regarding data loss.

My home server's main storage is a 22 (19 + 3) disk RAIDZ3 pool backed up 
hourly to a 14 (11+3) RAIDZ3 backup pool.

Clearly this is not a production Oracle server.  Equally clear is that my 
paranoia index is rather high.

ZFS will let you choose the combination of stripe width and parity count which 
works for you.

There is no "one size fits all."
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance issues with iSCSI under Linux

2010-10-15 Thread Marty Scholes
> I've had a few people sending emails directly
> suggesting it might have something to do with the
> ZIL/SLOG.   I guess I should have said that the issue
> happen both ways, whether we copy TO or FROM the
> Nexenta box.

You mentioned a second Nexenta box earlier.  To rule out client-side issues, 
have you considered testing with Nexenta as the iSCSI/NFS client?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance issues with iSCSI under Linux

2010-10-13 Thread Marty Scholes
> Here are some more findings...
> 
> The Nexenta box has 3 pools:
> syspool: made of 2 mirrored (hardware RAID) local SAS
> disks
> pool_sas: made of 22 15K SAS disks in ZFS mirrors on
> 2 JBODs on 2 controllers
> pool_sata: made of 42 SATA disks in 6 RAIDZ2 vdevs on
> a single controller
> 
> When we copy data from any linux box to either the
> pool_sas or pool_sata, it is painfully slow.
> 
> When we copy data from any linux box directly to the
> syspool, it is plenty fast
> 
> When we copy data locally on the Nexenta box from the
> syspool to either the pool_sas or pool_sata, it is
> crazy fast.
> 
> We also see the same pattern whether we use iSCSI or
> NFS. We've also tested using different NICs (some at
> 1GbE, some at 10GbE) and even tried bypassing the
> switch by directly connecting the two boxes with a
> cable- and it didn't made any difference.  We've also
> tried not using the SSD for the ZIL.
> 
> So...  
> We've ruled out iSCSI, the networking, the ZIL
> device, even the HBAs as it is fast when it is done
> locally.
> 
> Where should we look next?
> 
> Thank you all for your help!
> Ian

Looking at the list suggested earlier:
1. Linux network stack
2. Linux iSCSI issues
3. Network cabling/switch between the devices
4. Nexenta CPU constraints (unlikely, I know, but let's be thorough)
5. Nexenta network stack
6. COMSTAR problems

It looks like you have ruled out everything.

The only thing that still stands out is that network operations (iSCSI and NFS) 
to external drives are slow, correct?

Just for completeness, what happens if you scp a file to the three different 
pools?  If the results are the same as NFS and iSCSI, then I think the network 
can be ruled out.

I would be leaning toward thinking there is some mismatch between the network 
protocols and the external controllers/cables/arrays.

Are the controllers the same hardware/firmware/driver for the internal vs. 
external drives?

Keep digging.  I think you are getting close.

Cheers,
Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Ubuntu iSCSI install to COMSTAR zfs volume Howto

2010-10-11 Thread Marty Scholes
I apologize if this has been covered before.  I have not seen a blow-by-blow 
installation guide for Ubuntu onto an iSCSI target.

The install guides I have seen assume that you can make a target visible to 
all, which is a problem if you want multiple iSCSI installations on the same 
COMSTAR target.  During install Ubuntu generates three random initiators and 
you have to deal with them to get things working correctly.

I did this for a few reasons:
1. I have some PCs which already have another OS installed on them and want 
Ubuntu available without any changes to the local drive
2. I want each PC to netboot Ubuntu with no interaction from the user and some 
assurance that each machine will boot the correct image
3. It's cool
4. Because I can

I am confident that there are things here which can be done better.  Any and 
all feedback is appreciated.

Server is OpenSolaris build 128a at 192.168.223.147.
Client is Acer laptop with pxe boot enabled.
DHCP server is dd-wrt router with DHCP modifications

I have the following modifications made to the DHCP server.

dhcp-match=gpxe,175
dhcp-option=175,8:1:1
dhcp-boot=net:#gpxe,gpxe-1.0.1-undionly.kpxe,v40z,192.168.223.147
dhcp-boot=net:gpxe,menu.gpxe,v40z,192.168.223.147


I have added the following files to /tftpboot.

* /tftp/gpxe-1.0.1-undionly.kpxe
This is available from www.etherboot.org

* /tftp/menu.pxe
This file is needed to get gpxe to do an iSCSI boot to a target using an 
initiator based on the client uuid.  The contents of my file follow.


#!gpxe

# initialize
dhcp net0

# keep our iSCSI mappings around even if the drive does not resolve
set keep-san 1

# set the initiator using our uuid
set initiator-iqn iqn.1993-08.org.debian:${uuid}

# set the target
set root-path 
iscsi:192.168.223.147iqn.1986-03.com.sun:02:41fb1720-66ce-c72a-81fb-bbf396db7849

# try to boot from the iSCSI device
echo "Attempting to boot from san ${root-path}"
sanboot ${root-path}

# if we made it here, then boot failed, probably a new disk, chainload
# ubuntu installer

chain pxelinux.0

# for some reason, the silly system stalls and doesn't bother to chainload


* The Ubuntu Lucid netboot files, found at
http://archive.ubuntu.com/ubuntu/dists/lucid/main/installer-amd64/current/images/netboot/netboot.tar.gz

Just follow the 8 steps below, and you have a fully installed Ubuntu client on 
iSCSI

STEP 1 -- Create sparse zfs volume on OpenSolaris


bash-4.0$ pfexec zfs create -s -V 320G tank/export/iscsi/acer-ubuntu
bash-4.0$ zfs get all tank/export/iscsi/acer-ubuntu
NAME   PROPERTY   VALUE  
SOURCE
tank/export/iscsi/acer-ubuntu  type   volume -
tank/export/iscsi/acer-ubuntu  creation   Mon Oct 11 13:30 2010  -
tank/export/iscsi/acer-ubuntu  used   54.5K  -
tank/export/iscsi/acer-ubuntu  available  709G   -
tank/export/iscsi/acer-ubuntu  referenced 54.5K  -
tank/export/iscsi/acer-ubuntu  compressratio  1.00x  -
tank/export/iscsi/acer-ubuntu  reservationnone   
default
tank/export/iscsi/acer-ubuntu  volsize320G   -
tank/export/iscsi/acer-ubuntu  checksum   on 
default
tank/export/iscsi/acer-ubuntu  compressionon 
inherited from tank
tank/export/iscsi/acer-ubuntu  readonly   off
default
tank/export/iscsi/acer-ubuntu  shareiscsi off
inherited from tank/export/iscsi
tank/export/iscsi/acer-ubuntu  copies 1  
default
tank/export/iscsi/acer-ubuntu  refreservation none   
default
tank/export/iscsi/acer-ubuntu  primarycache   all
default
tank/export/iscsi/acer-ubuntu  secondarycache all
default
tank/export/iscsi/acer-ubuntu  usedbysnapshots0  -
tank/export/iscsi/acer-ubuntu  usedbydataset  54.5K  -
tank/export/iscsi/acer-ubuntu  usedbychildren 0  -
tank/export/iscsi/acer-ubuntu  usedbyrefreservation   0  -
tank/export/iscsi/acer-ubuntu  logbiaslatency
default
tank/export/iscsi/acer-ubuntu  dedup  off
default
tank/export/iscsi/acer-ubuntu  mlslabel   none   
default
tank/export/iscsi/acer-ubuntu  com.sun:auto-snapshot  true   
inherited from tank/export/iscsi
=

Re: [zfs-discuss] Performance issues with iSCSI under Linux

2010-10-09 Thread Marty Scholes
Ok,

Let's think about this for a minute.  The log drive is c1t11d0 and it appears 
to be almost completely unused, so we probably can rule out a ZIL bottleneck.  
I run Ubuntu booting iSCSI against OSol 128a and the writes do not appear to be 
synchronous.  So, writes aren't the issue.

>From the Linux side, it appears the drive in question is either sdb or dm-3, 
>and both appear to be the same drive.  Since switching to zfs, my 
>Linux-disk-fu has become a bit rusty.  Is one an alias for the other?  The 
>Linux disk appears to top out at around 50MB/s or so.  That looks suspiciously 
>like it is running on a gigabit connection with some problems.

I agree that the zfs side looks like it has plenty of bandwidth and iops to 
spare.

>From what I can see, you can narrow the search down to a few things:
1. Linux network stack
2. Linux iSCSI issues
3. Network cabling/switch between the devices
4. Nexenta CPU constraints (unlikely, I know, but let's be thorough)
5. Nexenta network stack
6. COMSTAR problems

As another poster pointed out, testing some NFS and ssh traffic can eliminate 
1, 3 and 5 above.

I recommend going down the list and testing every piece in isolation as much as 
possible to narrow the list.

Good luck and let us know what you learn.

Cheers,
Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Bursty writes - why?

2010-10-06 Thread Marty Scholes
I think you are seeing ZFS store up the writes, coalesce them, then flush to 
disk every 30 seconds.

Unless the writes are synchronous, the ZIL won't be used, but the writes will 
be cached instead, then flushed.

If you think about it, this is far more sane than flushing to disk every time 
the write() system call is used.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] scrub doesn't finally finish?

2010-10-06 Thread Marty Scholes
Have you had a lot of activity since the scrub started?

I have noticed what appears to be extra I/O at the end of a scrub when activity 
took place during the scrub.  It's as if the scrub estimator does not take the 
extra activity into account.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] drive speeds etc

2010-09-28 Thread Marty Scholes
Roy Sigurd Karlsbakk wrote:
> device r/s w/s kr/s kw/s wait actv svc_t %w %b 
> cmdk0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 
> cmdk1 0.0 163.6 0.0 20603.7 1.6 0.5 12.9 24 24 
> fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 
> sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 
> sd1 0.5 140.3 0.3 2426.3 0.0 1.0 7.2 0 14 
> sd2 0.0 138.3 0.0 2476.3 0.0 1.5 10.6 0 18 
> sd3 0.0 303.9 0.0 2633.8 0.0 0.4 1.3 0 7 
> sd4 0.5 306.9 0.3 2555.8 0.0 0.4 1.2 0 7 
> sd5 1.0 308.5 0.5 2579.7 0.0 0.3 1.0 0 7 
> sd6 1.0 304.9 0.5 2352.1 0.0 0.3 1.1 1 7 
> sd7 1.0 298.9 0.5 2764.5 0.0 0.6 2.0 0 13 
> sd8 1.0 304.9 0.5 2400.8 0.0 0.3 0.9 0 6 

Something is going on with how these writes are ganged together.  The first two 
drives average 17KB per write and the other six 8.7KB per write.

The aggregate statistics listed show less of a disparity, but one still exists.

I have to wonder if there is some "max transfer length" type of setting on each 
drive which is different, allowing the Hitachi drives to allow larger 
transfers, resulting in fewer I/O operations, each having a longer service time.

Just to avoid confusion, the svc_t field it "service time" and not "seek time." 
 The service time is the total time to service a request, including seek time, 
controller overhead, time for the data to transit the SATA bus and time to 
write the data.  If the requests are larger, all else being equal, the service 
time will ALWAYS be higher, but that does NOT imply the drive is slower.  On 
the contrary, it often implies a faster drive which can service more data per 
request.

At any rate, there is a reason that the Hitachi drives are handling larger 
requests than the WD drives.  I glanced at the code for a while but could not 
figure out where the max transfer size is determined or used.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] drive speeds etc

2010-09-27 Thread Marty Scholes
Is this a sector size issue?

I see two of the disks each doing the same amount of work in roughly half the 
I/O operations each operation taking about twice the time compared to each of 
the remaining six drives.

I know nothing about either drive, but I wonder if one type of drive has twice 
the sector size of the other?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sliced iSCSI device for doing RAIDZ?

2010-09-24 Thread Marty Scholes
Alexander Skwar wrote:
> Okay. This contradicts the ZFS Best Practices Guide,
> which states:
> 
> # For production environments, configure ZFS so that
> # it can repair data inconsistencies. Use ZFS
> redundancy,
> # such as RAIDZ, RAIDZ-2, RAIDZ-3, mirror, or copies
> > 1,
> # regardless of the RAID level implemented on the
> # underlying storage device. With such redundancy,
> faults in the
> # underlying storage device or its connections to the
> host can
> # be discovered and repaired by ZFS.

> Anyway. Without redundancy, ZFS cannot do recovery,
> can
> it? As far as I understand, it could detect block
> level corruption,
> even if there's not redundancy. But it could not
> correct such a
> corruption.
> 
> Or is that a wrong understanding?
> 
> If I got the gist of what you wrote, it boils down to
> how reliable
> the SAN is? But also SANs could have "block level"
> corruption,
> no? I'm a bit confused, because of the (perceived?)
> contra-
> diction to the Best Practices Guide… :)

This comes down to how much you trust your "storage device" whatever that may 
be.  If you have full faith in your SAN (and I don't have full faith in it, no 
matter what its make/model), then ignore ZFS redundancy.

When I first deployed a hardware RAID solution around 1995, the vendor proudly 
stated that the device could scrub mirrors and correct errors.  I asked when it 
found a discrepancy, how did it know which side of the mirror was correct?  He 
stammered for a while, but it basically came down to the device flipping a coin.

ZFS will ensure integrity, even when the underlying device fumbles.

When you mirror the iSCSI devices, be sure that they are configured in such a 
way that a failure on one iSCSI "device" does not imply a failure on the other 
iSCSI device.  As a simple example, if you sliced a disk into three partitions 
and then presented them as a three way mirror to ZFS, then a single disk 
failure will wipe out everything, even though you have the illusion of 
redundancy at the ZFS level.  I have seen some systems where the SAN has 
presented what appeared to be independent devices, but a failure on the 
underlying disk faulted both devices, rendering ZFS helpless.

Good luck,
Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-16 Thread Marty Scholes
David Dyer-Bennet wote:
> Sure, if only a single thread is ever writing to the
> disk store at a time.
> 
> This situation doesn't exist with any kind of
> enterprise disk appliance,
> though; there are always multiple users doing stuff.

Ok, I'll bite.

Your assertion seems to be that "any kind of enterprise disk appliance" will 
always have enough simultaneous I/O requests queued that any sequential read 
from any application will be sufficiently broken up by requests from other 
applications, effectively rendering all read requests as random.  If I follow 
your logic, since all requests are essentially random anyway, then where they 
fall on the disk is irrelevant.

I might challenge a couple of those assumptions.

First, if the data is not fragmented, then ZFS would coalesce multiple 
contiguous read requests into a single large read request, increasing total 
throughput regardless of competing I/O requests (which also might benefit from 
the same effect).

Second, I am unaware of an enterprise requirement that disk I/O run at 100% 
busy, any more than I am aware of the same requirement for full network link 
utilization, CPU utilization or PCI bus utilization.

What appears to be missing from this discussion is any shred of scientific 
evidence that fragmentation is good or bad and by how much.  We also lack any 
detail on how much fragmentation does take place.

Let's see if some people in the community can get some real numbers behind this 
stuff in real world situations.

Cheers,
Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-14 Thread Marty Scholes
Richard Elling wote:
> Define "fragmentation"?

Maybe this is the wrong thread.  I have noticed that an old pool can take 4 
hours to scrub, with a large portion of the time reading from the pool disks at 
the rate of 150+ MB/s but zpool iostat reports 2 MB/s read speed.  My naive 
interpretation is that the data scrub is looking for has become fragmented.

Should I refresh the pool by zfs sending it to another pool then zfs receiving 
the data back again, the same scrub can take less than an hour with zpool 
iostat reporting more sane throughput.

On an old pool which had lots of snapshots come and go, the scrub throughput is 
awful.  On that same data, refreshed via zfs send/receive, the throughput much 
better.

It would appear to me that this is an artifact of fragmentation, although I 
have nothing scientific on which to base this.  Additional unscientific 
observations leads me to believe these same "refreshed" pools also perform 
better for non-scrub activities.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-09 Thread Marty Scholes
I am speaking from my own observations and nothing scientific such as reading 
the code or designing the process.

> A) Resilver = Defrag. True/false?

False

> B) If I buy larger drives and resilver, does defrag
> happen?

No.  The first X sectors of the bigger drive are identical to the smaller 
drive, fragments and all.

> C) Does zfs send zfs receive mean it will defrag?

Yes.  The data is laid out on the receiving side in a sane manner, until it 
later becomes fragmented.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Marty Scholes
Erik wrote:
> Actually, your biggest bottleneck will be the IOPS
> limits of the 
> drives.  A 7200RPM SATA drive tops out at 100 IOPS.
>  Yup. That's it.
> So, if you need to do 62.5e6 IOPS, and the rebuild
> drive can do just 100 
> IOPS, that means you will finish (best case) in
> 62.5e4 seconds.  Which 
> is over 173 hours. Or, about 7.25 WEEKS.

My OCD is coming out and I will split that hair with you.  173 hours is just 
over a week.

This is a fascinating and timely discussion.  My personal (biased and 
unhindered by facts) preference is wide stripes RAIDZ3.  Ned is right that I 
kept reading that RAIDZx should not exceed _ devices and couldn't find real 
numbers behind those conclusions.

Discussions in this thread have opened my eyes a little and I am in the middle 
of deploying a second 22 disk fibre array on home server, so I have been 
struggling with the best way to allocate pools.  Up until reading this thread, 
the biggest downside to wide stripes, that I was aware of, has been low iops.  
And let's be clear: while on paper the iops of a wide stripe is the same as a 
single disk, it actually is worse.  In truth, the service time for any request 
on wide stripe is the service time of the SLOWEST disk for that request.  The 
slowest disk may vary from request to request, but will always delay the entire 
stripe operation.

Since all of the 44 spindles are 15K disks, I am about to convince myself to go 
with two pools of wide stripes and keep several spindles for L2ARC and SLOG.  
The thinking is that other background operations (scrub and resilver) can take 
place with little impact to application performance, since those will be using 
L2ARC and SLOG.

Of course, I could be wrong on any of the above.

Cheers,
Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] shrink zpool

2010-08-26 Thread Marty Scholes
> Is it currently or near future possible to shrink a
> zpool "remove a disk"

As other's have noted, no, not until the mythical bp_rewrite() function is 
introduced.

So far I have found no documentation on bp_rewrite(), other than it is the 
solution to evacuating a vdev, restriping a vdev, defragmenting a vdev, solving 
world hunger and bringing peace to the Middle East.

If you search the forums you will find all sorts of discussion around this 
evasive feature, but nothing concrete.  I think it's hiding behind the unicorn 
located at the end of the rainbow.

With Oracle withdrawing/inhousing/whatever development, it's a safe bet that 
bp_rewrite() now rests in the hands of the community, possibly to be born in 
Nexenta-land.

Maybe it's time for me to quit whining, dust off my K&R book and get to work on 
the weekends coming up with an honest implementation plan.

Anyone want to join a task force for getting bp_rewrite() implemented as a 
community effort?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (preview) Whitepaper - ZFS Pools Explained - feedback welcome

2010-08-26 Thread Marty Scholes
This paper is exactly what is needed -- giving an overview to a wide audience 
of the ZFS fundamental components and benefits.

I found several grammar errors -- to be expected in a draft and I think at 
least one technical error.

The paper seems to imply that multiple vdevs will induce striping across the 
vdevs, ala RAIDx0.  Though I haven't looked at the code, my understanding is 
that records are contained to a single vdev.

The clarification that each vdev gives iops roughly equivalent to a single disk 
is useful information not generally understood.  I was glad to see it there.

Overall, this is a terrific step forward for understanding ZFS and encouraging 
its adoption.

Now if only SRSS would work under Nexenta...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Backup zpool

2010-08-12 Thread Marty Scholes
Script attached.

Cheers,
Marty
-- 
This message posted from opensolaris.org

zfs_sync
Description: Binary data
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Backup zpool

2010-08-12 Thread Marty Scholes
> Hello,
> 
> I would like to backup my main zpool (originally
> called "data") inside an equally originally named
> "backup"zpool, which will also holds other kinds of
> backups.
> 
> Basically I'd like to end up with 
> backup/data
> backup/data/dataset1
> backup/data/dataset2
> backup/otherthings/dataset1
> backup/otherthings/dataset2
> 
> this is quite simply doable by using zfs send / zfs
> receive.
> 
> the problem is with compression. I have default
> compression enabled on my data pool, but I'd like to
> use gzip-2 on backup/data.
> I am using b134 with zpool version 22, which I read
> had some new features regarding this use case
> (http://arc.opensolaris.org/caselog/PSARC/2009/510/200
> 90924_tom.erickson). The problem is, I don't
> understand how to to this. I don't really care about
> mantaining former properties but of course that would
> be a plus. 

I have a similar situation where dedup is enabled on the backup, but not the 
main pool, for performance reasons.  Once the pools are set, I have a script 
which does exactly what you are looking for using the time-slider snaps.  It 
finds the latest snap common to the main and backup pool, rolls back the backup 
to that snap, then sends the incrementals in between.  It also handles the case 
of no destination file system and tries to send the first snap.

At least in 128a, the auto snapshot seems to delete the old snaps from both 
pools, even though it is not configured to snap the backup pool, which keeps 
the snap count sane on the backup pool.

I would never claim the script is world-class, but I run it hourly from cron 
and it keeps the stuff in sync without me having to do anything.  Say the word 
and I'll send you a copy.

Good luck,
Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-11 Thread Marty Scholes
Peter wrote:
> One question though. Marty mentioned that raidz
> parity is limited to 3. But in my experiment, it
> seems I can get parity to any level.
> 
> You create a raidz zpool as:
> 
> # zpool create mypool raidzx disk1 diskk2 
> 
> Here, x in raidzx is a numeric value indicating the
> desired parity.
> 
> In my experiment, the following command seems to
> work:
> 
> # zpool create mypool raidz10 disk1 disk2 ...
> 
> In my case, it gives an error that I need at least 11
> disks (which I don't) but the point is that raidz
> parity does not seem to be limited to 3. Is this not
> true?

You have my curiousity.  I was asking for that feature in these forums last 
year.

What OS, version and ZFS version are you running?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-11 Thread Marty Scholes
Erik Trimble wrote:
> On 8/10/2010 9:57 PM, Peter Taps wrote:
> > Hi Eric,
> >
> > Thank you for your help. At least one part is clear
> now.
> >
> > I still am confused about how the system is still
> functional after one disk fails.
> >
> > Consider my earlier example of 3 disks zpool
> configured for raidz-1. To keep it simple let's not
> consider block sizes.
> >
> > Let's say I send a write value "abcdef" to the
> zpool.
> >
> > As the data gets striped, we will have 2 characters
> per disk.
> >
> > disk1 = "ab" + some parity info
> > disk2 = "cd" + some parity info
> > disk3 = "ef" + some parity info
> >
> > Now, if disk2 fails, I lost "cd." How will I ever
> recover this? The parity info may tell me that
> something is bad but I don't see how my data will get
> recovered.
> >
> > The only good thing is that any newer data will now
> be striped over two disks.
> >
> > Perhaps I am missing some fundamental concept about
> raidz.
> >
> > Regards,
> > Peter
> 
> Parity is not intended to tell you *if* something is
> bad (well, it's not 
> *designed* for that). It tells you how to RECONSTRUCT
> something should 
> it be bad.  ZFS uses Checksums of the data (which are
> stored as data 
> themselves) to tell if some data is bad, and thus
> needs to be re-written 

To follow up Erik's post, parity is used both to detect and correct errors in a 
string of equal sized numbers, each parity is equal in size to each of the 
numbers.  In the old serial protocols, one bit was used to detect an error in a 
string of 7 bits, so each "number" in the string was a one bit.  In the case of 
ZFS, each "number" in the string is a disk block.  The length of the string of 
numbers is completely arbitrary.

I am rusty on parity math, but Reed-Solomon is used (of which XOR is a 
degenerate case) such that each parity is independent of the other parities.  
RAIDZ can support up to three parities per stripe.

Generally, a single parity can either detect a single corrupt number in a 
string or if it is known which number is corrupt, a single parity can correct 
that number.  Traditional RAID5 makes the assumption that it knows which number 
(i.e. block) is bad because the disk failed and therefore can use the parity 
block to reconstruct it.  RAID5 cannot reconstruct a random bit-flip.

RAIDZ takes a different approach where the checksum for the number string (i.e. 
stripe) exists in a different, already validated stripe.  With that checksum in 
hand, ZFS knows when a stripe is corrupt but not which block.  ZFS will then 
reconstruct each data block in the stripe using the parity block, one data 
block at a time until the checksum matches.  At that point ZFS knows which 
block is bad and can rebuild it and write it to disk.  A scrub does this for 
all stripes and all parities in each stripe.

Using the example above, the disk layout would look more like the following for 
a single stripe, and as Erik mentioned, the location of the data and parity 
blocks will change from stripe to stripe:
disk1 = "ab"
disk2 = "cd"
disk3 = parity info

Again using the example above, if disk 2 fails, or even stays online but 
producess bad data, the information can be reconstructed from disk 3.

The beauty of ZFS is that it does not depend on parity to detect errors, your 
stripes can be as wide as you want (up to 100-ish devices) and you can choose 
1, 2 or 3 parity devices.

Hope that makes sense,
Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disk space on Raidz1 configuration

2010-08-06 Thread Marty Scholes
> ahh that explains it all, god damn that base 1000
> standard , only usefull for sales people :)

As much as it all annoys me too, the SI prefixes are used correctly pretty much 
everywhere except in operating systems.

A kilometer is not 1024 meters and a megawatt is not 1048576 watts.

Us, the IT community, grabbed a set of well defined prefixes used by the rest 
of creation, redefined them, and then became angry because the remainder of 
civilization uses the correct terms.

We have no one to blame but ourselves.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slog/L2ARC on a hard drive and not SSD?

2010-07-21 Thread Marty Scholes
> Hi,
> Out of pure curiosity, I was wondering, what would
> happen if one tries to use a regular 7200RPM (or 10K)
> drive as slog or L2ARC (or both)?

I have done both with success.

At one point my backup pool was a collection of USB attached drives (please 
keep the laughter down) with dedup=verify.  Solaris' slow USB performance 
coupled with slow drives and dedup reads gave abysmal write speeds, so much so 
that at times it had trouble keeping the snapshots synchronized.  To help it 
along, I took an unused fast, small SCSI disk and made it L2ARC, which 
significantly improved write performance on the pool.

During testing of some iSCSI applications, I ran into a scenario where a client 
was performing many small, synchronous writes to a zvol in a wide RAIDZ3 
stripe.  Since synchronous writes can double the write activity (once for the 
zil and once for the actual pool), actual throughput from the client was below 
2MB/s, even though the pool would sustain 200MB/s on sequential writes.  As 
above, I added a mirrored slog which was two small, fast SCSI drives.  While I 
expected the throughput to double, it actually went up by a factor of 4, to 
8MB/s.  Even though 8MB/s wasn't mind-numbing, it was enough that it was close 
to saturating the client's 100Mb ethernet link, so it was ok.

I think the reason that the slog improved things so much is that the slog disks 
were not bothered with other i/o and were doing very little seeking.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help identify failed drive

2010-07-21 Thread Marty Scholes
> If the format utility is not displaying the WD drives
> correctly,
> then ZFS won't see them correctly either. You need to
> find out why.
> 
> I would export this pool and recheck all of your
> device connections.

I didn't see it in the postings, but are the same serial numbers showing up 
multiple times?  Is accidental multipathing taking place here?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help identify failed drive

2010-07-20 Thread marty scholes
Michael Shadle wrote:

>Actually I guess my real question is why iostat hasn't logged any

> errors in its counters even though the device has been bad in there
> for months?

One of my arrays had a drive in slot 4 fault -- lots of reset something or 
other 
errors.  I cleared the errors and the pool and it did it again, even though the 
drive was showing ok in smartmontools and passed its internal self test.

I replaced the drive with my cold spare and a week later the replacement drive 
in slot 4 had the same errors.

Clearly it was the chassis and not the drive.  I blew out the connector on slot 
4 and it did again a week later.

Again I cleared error, cycled the power on the array and haven't had the 
problem 
in the past 5 weeks.

Sometimes things just happen, I guess.



  
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help identify failed drive

2010-07-19 Thread Marty Scholes
> > ' iostat -Eni ' indeed outputs Device ID on some of
> > the drives,but I still
> > can't understand how it helps me to identify model
> > of specific drive.

Get and install smartmontools.  Period.  I resisted it for a few weeks but it 
has been an amazing tool.  It will tell you more than you ever wanted to know 
about any disk drive in the /dev/rdsk/ tree, down to the serial number.

I have seen zfs remember original names in a pool after they have been renamed 
by the OS such that "zpool status" can list c22t4d0 as a drive in the pool when 
there exists no such drive on the system.

> Why has it been reported as bad (for probably 2
> months now, I haven't
> got around to figuring out which disk in the case it
> is etc.) but the
> iostat isn't showing me any errors.

Start a scrub or do an obscure find, e.g. "find /tank_mointpoint -name core" 
and watch the drive activity lights.  The drive in the pool which isn't 
blinking like crazy is a faulted/offlined drive.

Ugly and oh-so-hackerish, but it works.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Move Fedora or Windows disk image to ZFS (iScsi Boot)

2010-07-19 Thread Marty Scholes
> I've found plenty of documentation on how to create a
> ZFS volume, iscsi share it, and then do a fresh
> install of Fedora or Windows on the volume.

Really?  I have found just the opposite: how to move your functioning 
Windows/Linux install to iSCSI.

I am fumbling through this process for Ubuntu on a laptop using a Frankenstein 
mishmash of PXE -> gPXE -> menu.cfg -> sanboot -> grub -> initrd -> Ubuntu.

The initial install is through Ubuntu's netboot pxelinux.0 files which make 
iSCSI installs fairly painless as long as there are no initiator restrictions 
on the LUN.

I couldn't find the magic formula in dnsmasq (on my router) to set the target 
and initiators which is needed to allow multiple devices to see their own iSCSI 
volumes, so I used a ${uuid} suffix for both in a gPXE menu.cfg file.

Stranger still, it seems that only one LUN can be allocated system-wide, so I 
can't map LUN0 to target iqn.foo and another LUN0 to target iqn.bar, which 
means each initiator gets a non-zero LUN.  It doesn't seem to bother the iSCSI 
stacks, but it bugs me.

The other poster is correct, all of this has to match in gPXE, initrd and 
Ubuntu.

Either I am more daft than I thought (always a safe choice), or the same thing 
is very difficult in Windows.  To be honest, I have not braved a raw Windows 
install to iSCSI yet, but will once I conquer Ubuntu.

The advantage of going straight to iSCSI is that the zvol can be arbritrarily 
large and you only allocate the blocks which have been touched.  If you install 
to a disk then do the dd if=localdisk of=iSCSIdisk approach, the zvol will be 
completely allocated.  Worse, the iSCSI volume is limited to the size of the 
original disk, which kind of misses the point of thin provisioning.

Good luck.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] preparing for future drive additions

2010-07-15 Thread Marty Scholes
Cindy wrote:
> Mirrored pools are more flexible and generally
> provide good performance.
> 
> You can easily create a mirrored pool of two disks
> and then add two
> more disks later. You can also replace each disk with
> larger disks
> if needed. See the example below.

There is no dispute that multiple vdevs (mirrors or otherwise) allow changing 
the drives in a single vdev without requiring a change the whole pool.

There also is no dispute that mirrors provide better read iops than any other 
vdev type.

On the other hand, situation after situation exists where 2+ drives offline in 
a pool leaving the RAIDZ1 and single mirror vdevs in real trouble.  As I write 
this, the first thread in this forum is about an invalid pool because one drive 
died and another is offline, leaving the pool corrupted.  This stuff just 
happens in the real world with non-DMX-class gear.

One major point I read over and over about zfs was that it allowed the same 
level of protection without needing to spend $35 per GB of storage from an 
enterprise vendor.

The only way to make this happen is with significant redundancy.  I choose n+3 
redundancy and love it.  It's like having two prebuilt hot spares.

To achieve n+3 redundnancy with mirrors would require quadrupling the costs and 
spindle count vs. unprotected storage.

It would seem that any vdev with n+1 protection is not adequate protection 
using sub million dollar storage equipment.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remove non-redundant disk

2010-07-07 Thread Marty Scholes
> I think the request is to remove vdev's from a pool.
>  Not currently possible.  Is this in the works?

Actually, I think this is two requests, hashed over hundreds of times in this 
forum:
1. Remove a vdev from a pool
2. Nondisruptively change vdev geometry

#1 above has a stunningly obvious use case.  Suppose, despite your best 
efforts, QA, planning and walkthroughs, you accidentally fat finger a "zpool 
attach" and unintentionally "zpool add" a disk to a pool.  There is no way to 
reverse that operation without *significant* downtime.

I have discussed #2 above multiple times and has at least one obvious use case. 
 Suppose, just for a minute, that over the years since you deployed a zfs pool 
with nearly constant uptime, that your business needs change and you need to 
add a disk to a RAIDZ vdev, or move from RAIDZ1 to RAIDZ2, or disks have grown 
so big that you wish to remove a disk from a vdev.

The responses from the community on the two requests seem to be:
1. Don't ever make this mistake and if you do, then tough luck
2. No business ever changes, or technology never changes, or zfs deployments 
have short lives, or businesses are perfectly ok with large downtimes to effect 
geometry changes.

Both responses seem antithetical to the zfs ethos of survivability in the face 
of errors and nondisruptive flexibility.

Honestly, I still don't understand the resistance to adding those features.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] optimal ZFS filesystem layout on JBOD

2010-07-01 Thread Marty Scholes
Joachim Worringen wrote:
> Greetings,
> 
> we are running a few databases of currently 200GB
> (growing) in total for data warehousing:
> - new data via INSERTs for (up to) millions of rows
> per day; sometimes with UPDATEs
> - most data in a single table (=> 10 to 100s of
> millions of rows)
> - queries SELECT subsets of this table via an index
> - for effective parallelisation, queries create
> (potentially large) non-temporary tables which are
> deleted at the end of the query => lots of simple
> INSERTs and SELECTs during queries
> - large transactions: they may contain millions of
> INSERTs/UPDATEs
> - running version PostgreSQL 8.4.2
> 
> We are moving all this to a larger system - the
> hardware is available, therefore fixed:
> - Sun X4600 (16 cores, 64GB)
> - external SAS JBOD with 24 2,5" slots: 
>   o 18x SAS 10k 146GB drives
> o  2x SAS 10k 73GB drives
>   o 4x Intel SLC 32GB SATA SSD
> JBOD connected to Adaptec SAS HBA with BBU
> - Internal storage via on-board RAID HBA:
>   o 2x 73GB SAS 10k for OS (RAID1)
> o 2x Intel SLC 32GB SATA SSD for ZIL (RAID1) (?)
> - OS will be Solaris 10 to have ZFS as filesystem
> (and dtrace)
> - 10GigE towards client tier (currently, another
> X4600 with 32cores and 64GB)
> 
> What would be the optimal storage/ZFS layout for
> this? I checked solarisinternals.com and some
> PostgreSQL resources and came to the following
> concept - asking for your comments:
> - run the JBOD without HW-RAID, but let all
> redundancy be done by ZFS for maximum flexibility 
> - create separate ZFS pools for tablespaces (data,
> index, temp) and WAL on separate devices (LUNs):
> - use the 4 SSDs in the JBOD as Level-2 ARC cache
> (can I use a single cache for all pools?) w/o
> redundancy
> - use the 2 SSDs connected to the on-board HBA as
> RAID1 for ZFS ZIL
> 
> Potential issues that I see:
> - the ZFS ZIL will not benefit from a BBU (as it is
> connected to the backplane, driven by the
> onboard-RAID), and might be too small (32GB for ~2TB
> of data with lots of writes)?
> - the pools on the JBOD might have the wrong size for
> the tablespaces - like: using the 2 73GB drives as
> RAID 1 for temp might become too small, but adding a
> 146GB drive might not be a good idea?
> - with 20 spindles, does it make sense at all to use
> dedicated devices for the tabelspaces, or will the
> load be distributed well enough across the spindles
> anyway?
> 
> thanks for any comments & suggestions,
>  
>  Joachim

I'll chime in based on some tuning experience I had under UFS with Pg 7.x 
coupled with some experience with ZFS, but not experience with later Pg on ZFS. 
 Take this with a grain of salt.

Pg loves to push everything to the WAL and then dribble the changes back to the 
datafiles when convenient.  At a checkpoint, all of the changes are flushed in 
bulk to the tablespace.  Since the changes to WAL and disk are synchronous, ZIL 
is used, which I believe translates to all data being written four times under 
ZFS: once to WAL ZIL, then to WAL, then to tablespace ZIL, then to tablespace.

For writes, I would break WAL into it's own pool and then put an SSD ZIL mirror 
on that.  It would allow all writes to be nearly instant to WAL and would keep 
the ZIL needs to the size of the WAL, which probably won't exceed the size of 
your SSD.  The ZIL on WAL will especially help with large index updates which 
can cause cascading b-tree splits and result in large amounts of small 
syncronous I/O, bringing Pg to a crawl.  Checkpoints will still slow things 
down when the data is flushed to the tablespace pool, but that will happen with 
coalesced writes, so iops would be less of a concern.

For reads, I would either keep indexes and tables on the same pool and back 
them with as much L2ARC as needed for the working set, or if you lack 
sufficient L2ARC, break the indexes into their own pool and L2ARC those 
instead, because index reads generally are more random and heavily used, at 
least for well tuned queries.  Full table scans for well-vacuumed tables are 
generally sequential in nature, so table iops again are less of a concern.

If you have to break the indexes into their own pool for dedicated SSD L2ARC, 
you might consider adding some smaller or short-stroked 15K drives for L2ARC on 
the table pool.

For geometry, find the redundancy that you need, e.g. +1, +2 or +3, then decide 
which is more important, space or iops.  If L2ARC and ZIL reduce your need for 
iops, then go with RAIDZ[123].  If you still need the iops, pile a bunch of 
[123]-way mirrors together.

Yes, I would avoid HW raid and run pure JBOD and would be tempted to keep temp 
tables on the index or table pool.

Like I said above, take this with a grain of salt and feel free to throw out, 
disagree with or lampoon me for anything that does not resonate with you.

Whatever you do, make sure you stress-test the configuration with 
production-size data and workloads before you deploy it.

Good luck,
Marty
-- 
This message posted from

Re: [zfs-discuss] Depth of Scrub

2010-06-04 Thread Marty Scholes
> I have a small question about the depth of scrub in a
> raidz/2/3 configuration.
> I'm quite sure scrub does not check spares or unused
> areas of the disks (it
> could check if the disks detects any errors there).
> But what about the parity?

>From some informal performance testing of RAIDZ2/3 arrays, I am confident that 
>scrub reads the parity blocks and normal reads do not.

You can see this for yourself with "iostat -x" or "zpool iostat -v"

Start monitoring and watch read I/O.  You will see regularly that a RAIDZ3 
array will read from all but three drives, which I presume is the unread parity.

Do the same monitoring while a scrub is underway and you will see all drives 
being read from equally.

My experience suggests something similar is taking place with mirrors.

If you think about it, having a scrub check everything but the parity would be 
a rather pointless operation.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] one more time: pool size changes

2010-06-04 Thread Marty Scholes
On Jun 3, 2010 7:35 PM, David Magda wrote:

> On Jun 3, 2010, at 13:36, Garrett D'Amore wrote:
> 
> > Perhaps you have been unlucky.  Certainly, there is
> a window with N 
> > +1 redundancy where a single failure leaves the
> system exposed in  
> > the face of a 2nd fault.  This is a statistics
> game...
> 
> It doesn't even have to be a drive failure, but an
> unrecoverable read  
> error.

Well said.

Also include a controller burp, a bit flip somewhere, a drive going offline 
briefly, fibre cable momentary interruption, etc.  The list goes on.

My experience is that these weirdo "once in a lifetime" issues tend to present 
in clumps which are not as evenly distributred as statistics would lead you to 
believe.  Rather, like my kids, they save up their fun into coordinated bursts.

When these bursts happen, the ensuing conversations with stakeholders about how 
all of this "redundancy" you tricked them into purchasing has left them 
exposed.  Not good times.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] one more time: pool size changes

2010-06-03 Thread Marty Scholes
David Dyer-Bennet wrote:
> My choice of mirrors rather than RAIDZ is based on
> the fact that I have
> only 8 hot-swap bays (I still think of this as LARGE
> for a home server;
> the competition, things like the Drobo, tends to have
> 4 or 5), that I
> don't need really large amounts of storage (after my
> latest upgrade I'm
> running with 1.2TB of available data space), and that
> I expected to need
> to expand storage over the life of the system.  With
> mirror vdevs, I can
> expand them without compromising redundancy even
> temporarily, by attaching
> the new drives before I detach the old drives; I
> couldn't do that with
> RAIDZ.  Also, the fact that disk is now so cheap
> means that 100%
> redundancy is affordable, I don't have to compromise
> on RAIDZ.

Maybe I have been unlucky too many times doing storage admin in the 90s, but 
simple mirroring still scares me.  Even with a hot spare (you do have one, 
right?) the rebuild window leaves the entire pool exposed to a single failure.

One of the nice things about zfs is that allows, "to each his own."  My home 
server's main pool is 22x 73GB disks in a Sun A5000 configured as RAIDZ3.  Even 
without a hot spare, it takes several failures to get the pool into trouble.

At the same time, there are several downsides to a wide stripe like that, 
including relatively poor iops and longer rebuild windows.  As noted above, 
until bp_rewrite arrives, I cannot change the geometry of a vdev, which kind of 
limits the flexibility.

As a side rant, I still find myself baffled that Oracle/Sun correctly touts the 
benefits of zfs in the enterprise, including tremendous flexibility and 
simplicity of filesystem provisioning and nondisruptive changes to filesystems 
via properties.

These forums are filled with people stating that the enterprise demands simple, 
flexibile and nondisruptive filesystem changes, but no enterprise cares about 
simple, flexibile and nondisruptive pool/vdev changes, e.g. changing a vdev 
geometry or evacuating a vdev.  I can't accept that zfs flexibility is critical 
and zpool flexibility is unwanted.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] creating a fast ZIL device for $200

2010-05-28 Thread Marty Scholes
I have a Sun A5000, 22x 73GB 15K disks in split-bus configuration, two dual 2Gb 
HBAs and four fibre cables from server to array, all for just under $200.

The array gives 4Gb of aggregate thoughput in each direction across two 11 disk 
buses.

Right now it is the main array, but when we outgrow its storage it will become 
a multiple external ZIL / L2ARC array for a slow sata array.

Admittedly, it is rare for all of the pieces to come together at the right 
price like this and since it is unsupported no one would seriously consider it 
for production.

At the same time, it makes blistering main storage today and will provide for 
amazing iops against slow storage later.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Marty Scholes
I cant' stop myself; I have to respond.  :-)

Richard wrote:
> The ideal pool has one inexpensive, fast, and reliable device :-)

My ideal pool has become one inexpensive, fast and reliable "device" built on 
whatever I choose.

> I'm not sure how to connect those into the system (USB 3?)

Me neither, but if I had to start guessing about host connections, I would 
probably think FC.

> but when you build it, let us know how it works out.

While it would be a fun project, a toy like that would certainly exceed my 
feeble hardware experience and even more feeble budget.

At the same time, I could make a compelling argument that this sort of 
arrangement: stripes of flash, is the future of tier-one storage.  We already 
are seeing SSD devices which internally are stripes of flash.  More and more 
disks farms are taking on the older roles of tape.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Marty Scholes
Bob Friesenhahn wrote:
> It is unreasonable to spend more than 24 hours to resilver a single
> drive. It is unreasonable to spend more than 6 days resilvering all 
> of the devices in a RAID group (the 7th day is reserved for the system 
> administrator). It is unreasonable to spend very much time at all on 
> resilvering (using current rotating media) since the resilvering 
> process kills performance.

Bob, the vast majority of your post I agree with.  At the same time, I might 
disagree with a couple of things.

I don't really care how long a resilver takes (hours, days, months) given a 
couple things:
* Sufficient protection exists on the degraded array during rebuild
** Put another way, the array is NEVER in danger
* Rebuild takes a back seat to production demands

Since I am on a rant, I suspect there is also room for improvement in the 
scrub.  Why would I rescrub a stripe that was read (and presumably validated) 
30 seconds ago by a production application?  Wouldn't it make more sense for 
scrub to "play nice" with production, moving a leisurely pace and only 
scrubbing stripes not read in the past X hours/days/weeks/whatever?

I also agree that an ideal pool would be lowering the device capacity and 
radically increasing the device count.  In my perfect world, I would have a 
RAID set of 200+ cheap, low-latency, low-capacity flash drives backed by an 
additional N% parity, e.g. 40-ish flash drives.  A setup like this would give 
massive throughput: 200x each flash drive, amazing IOPS and incredible 
resiliancy.  Rebuild times would be low due to lower capacity.  One could 
probably build such a beast in 1U using MicroSDHC cards or some such thing.

End rant.

Cheers,
Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cores vs. Speed?

2010-02-05 Thread Marty Scholes
>> Was my raidz2 performance comment above correct?
>>  That the write speed is that of the slowest disk?
>>  That is what I believe I have
>> read.

> You are
> sort-of-correct that its the write speed of the
> slowest disk.

My experience is not in line with that statement.  RAIDZ will write a complete 
stripe plus parity (RAIDZ2 -> two parities, etc.).  The write speed of the 
entire stripe will be brought down to that of the slowest disk, but only for 
its portion of the stripe.  In the case of a 5 spindle RAIDZ2, 1/3 of the 
stripe will be written to each of three disks and parity info on the other two 
disks.  The throughput would be 3x the slowest disk for read or write.

> Mirrored drives will be faster, especially for
> random I/O. But you sacrifice storage for that
> performance boost.

Is that really true?  Even after glancing at the code, I don't know if zfs 
overlaps mirror reads across devices.  Watching my rpool mirror leads me to 
believe that it does not.  If true, then mirror reads would be no faster than a 
single disk.  Mirror writes are no faster than the slowest disk.

As a somewhat related rant, there seems to be confusion about mirror IOPS vs. 
RAIDZ[123] IOPS.  Assuming mirror reads are not overlapped, then a mirror vdev 
will read and write at roughly the same throughput and IOPS as a single disk 
(ignoring bus and cpu constraints).

Also ignoring bus and cpu constraints, a RAIDZ[123] vdev will read and write at 
roughly the same throughput of a single disk, multiplied by the number of data 
drives: three in the config being discussed.  Also, a RAIDZ[123] vdev will have 
IOPS performance similar to that of a single disk.

A stack of mirror vdevs will, of course, perform much better than a single 
mirror vdev in terms of throughput and IOPS.

A stack of RAIDZ[123] vdevs will also perform much better than a single 
RAIDZ[123] vdev in terms of throughput and IOPS.

RAIDZ tends to have more CPU overhead and provides more flexibility in choosing 
the optimal data to redundancy ratio.

Many read IOPS problems can be mitigated by L2ARC, even a set of small, fast 
disk drives.  Many write IOPS problems can be mitigated by ZIL.

My anecdotal conclusions backed by zero science,
Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] adpu320 scsi timeouts only with ZFS

2010-01-15 Thread Marty Scholes
> To fix it, I swapped out the Adaptec controller and
> put in LSI Logic  
> and all the problems went away.

I'm using Sun's built-in LSI controller with (I presume) the original internal 
cable shipped by Sun.

Still, no joy for me at U320 speeds.  To be precise, when the controller is set 
at U320, it runs amazingly fast until it freezes, at which point it is quite 
slow.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] adpu320 scsi timeouts only with ZFS

2010-01-14 Thread Marty Scholes
> Any news regarding this issue? I'm having the same
> problems.

Me too.  My v40z with U320 drives in the internal bay will lock up partway 
through a scrub.

I backed the whole SCSI chain down to U160, but it seems a shame that U320 
speeds can't be used.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] $100 SSD = >5x faster dedupe

2010-01-08 Thread marty scholes
--- On Thu, 1/7/10, Tiernan OToole  wrote:
> Sorry to hijack the thread, but can you
> explain your setup? Sounds interesting, but need more
> info...

This is just a home setup to amuse me and placate my three boys, each of whom 
has several Windows instances running under Virtualbox.

Server is a Sun v40z: quad 2.4 GHz Opteron with 16GB.  Internal bays hold a 
pair of 73GB drives as a mirrored rpool and a pair of 36GB drives for spares to 
the array plus a 146GB drive I use as cache to the usb pool (a single 320GB 
sata drive).

The array is an HP MSA30 with 14x36GB drives configured as RAIDZ3 using the 
spares listed above with auto snapshots as the tank pool. Tank is synchronized 
hourly to the usb pool.

It's all connected via four HP 4000M switches (one at the server and one at 
each workstation) which are meshed via gigabit fiber.

Two workstations are triple-head sunrays.

One station is a single sunray 150 integrated unit.

This is a work in progress with plenty of headroom to grow.  I started the 
build in November and have less than $1200 into it so far.

Thanks for letting me hijack the thread by sharing!

Cheers,
Marty


  
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] $100 SSD = >5x faster dedupe

2010-01-07 Thread Marty Scholes
Ian wrote:
> Why did you set dedup=verify on the USB pool?

Because that is my last-ditch copy of the data and MUST be correct.  At the 
same time, I want to cram as much data as possible into the pool.

If I ever go to the USB pool, something has already gone horribly wrong and I 
am desperate.  I can't comprehend the anxiety I would have if one or more 
stripes had a birthday collision giving me silent data corruption that I found 
out about months or years later.

It's probably paranoid, but a level of paranoia I can live with.

Good question, by the way.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] $100 SSD = >5x faster dedupe

2010-01-06 Thread Marty Scholes
Michael Herf wrote:
> I've written about my slow-to-dedupe RAIDZ.
> 
> After a week of.waitingI finally bought a
> little $100 30G OCZ
> Vertex and plugged it in as a cache.
> 
> After <2 hours of warmup, my zfs send/receive rate on
> the pool is
> >16MB/sec (reading and writing each at 16MB as
> measured by zpool
> iostat).
> That's up from <3MB/sec, with a RAM-only cache on a
> 6GB machine.
> 
> The SSD has about 8GB utilized right now, and the
> L2ARC benefit is amazing.
> Quite an amazing improvement for $100...recommend you
> don't dedupe without one.

I did something similar, but with a SCSI drive.  I keep a large external USB 
drive as a "last ditch" recovery pool which is synchronized hourly from the 
main pool.  Kind of like a poor man's tape backup.

When I enabled dedup=verify on the USB pool, the sync performance went south, 
because the USB drive had to read stripes to verify that they were actual dups. 
 Since I had an unused 146GB SCSI drive plugged in, I made the SCSI drive L2ARC 
for the USB pool.  Write performance skyrocketed by a factor of 6 and is now 
faster than when there was no dedupe enabled.

Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz data loss stories?

2009-12-22 Thread Marty Scholes
risner wrote:
> If I understand correctly, raidz{1} is 1 drive
> protection and space is (drives - 1) available.
> Raidz2 is 2 drive protection and space is (drives -
> 2) etc.  Same for raidz3 being 3 drive protection.

Yes.

> Everything I've seen you should stay around 6-9
> drives for raidz, so don't do a raidz3 with 12
> drives.  Instead make two raidz3 with 6 drives each
>  (which is (6-3)*1.5 * 2 = 9 TB array.)

>From what I can tell, this is purely a function of needed IOPS.  Wider stripe 
>= better storage/bandwidth utilization = less IOPS.  For home usage I run a 14 
>drive RAIDZ3 array.

> As for whether or not to do raidz, for me the
> issue is performance.  I can't handle the raidz
> write penalty.

If there is a RAIDZ write penalty over mirroring, I am unaware of it.  In fact, 
sequential writes are faster under RAIDZ.

> If I needed triple drive protection,
> a 3way mirror setup would be the only way I would
> go.

That will give high IOPS with 33% storage utilization and 33% bandwidth 
utilization.  In other words, for every MB of data read/witten by an 
application, 3MB is read/written from/to the array and stored.

Multiply all storage and bandwidth needs by three.

>  I don't yet quite understand why a 3+ drive
> raidz2 vdev is better than a 3 drive mirror vdev?
> Other than a 5 drive setup is 3 drives of space
> when a 6 drive setup using 3 way mirror is only 2
>  drive space.

Part of the question you answered yourself.  The other part is that with a 6 
drive RAIDZ3, I can lose ANY three drives and still be running.  With three 
mirrors, I can lose the pool if the wrong two drives die.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz data loss stories?

2009-12-22 Thread Marty Scholes
Bob Friesenhahn wrote:
> On Tue, 22 Dec 2009, Marty Scholes wrote:
> >
> > That's not entirely true, is it?
> > * RAIDZ is RAID5 + checksum + COW
> > * RAIDZ2 is RAID6 + checksum + COW
> > * A stack of mirror vdevs is RAID10 + checksum +
> COW
> 
> These are layman's simplifications that no one here
> should be 
> comfortable with.

Well, ok.  They do seem to capture the essence of what the different flavors of 
ZFS protection do, but I'll take you at your word.

We do seem to be spinning off on a tangent, tho.

> Zfs borrows proven data recovery technologies from
> classic RAID but 
> the data layout on disk is not classic RAID, or even
> close to it. 
> Metadata and file data are handled differently.
>  Metadata is always 
> uplicated, with the most critical metadata being
> strewn across 
> multiple disks.  Even "mirror" disks are not really
> mirrors of each 
> other.

I am having a little trouble reconciling the above statements, but again, ok.  
I haven't read the official RAID spec, so again, I'll take you at your word.  
Honestly, those seem like important nuances, but nuances nonetheless.

> Earlier in this discussion thread someone claimed
> that if a raidz disk 
> was lost that the pool was then just one data error
> away from total disaster

That would be me.  Let me substitute the phrase "user data loss in some way, 
shape or form which disrupts availability" for the words "total disaster."

Honestly, I think we are splitting hairs here.  Everyone agrees that RAIDZ 
takes RAID5 to a new level.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz data loss stories?

2009-12-22 Thread Marty Scholes
Bob Friesenhahn wrote:
> Why are people talking about "RAID-5", RAID-6", and
> "RAID-10" on this 
> list?  This is the zfs-discuss list and zfs does not
> do "RAID-5", 
> "RAID-6", or "RAID-10".
> 
> Applying classic RAID terms to zfs is just plain
> wrong and misleading 
> since zfs does not directly implement these classic
> RAID approaches 
> even though it re-uses some of the algorithms for
> data recovery. 
> Details do matter.

That's not entirely true, is it?
* RAIDZ is RAID5 + checksum + COW
* RAIDZ2 is RAID6 + checksum + COW
* A stack of mirror vdevs is RAID10 + checksum + COW

While there isn't an actual one-to-one mapping, many traditional RAID concepts 
do seem to apply to ZFS discussions, don't they?

Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz data loss stories?

2009-12-22 Thread Marty Scholes
> > Hi Ross,
> >
> > What about old good raid10? It's a pretty
> reasonable choice for  
> > heavy loaded storages, isn't it?
> >
> > I remember when I migrated raidz2 to 8xdrives
> raid10 the application  
> > administrators were just really happy with the new
> access speed. (we  
> > didn't use stripped raidz2 though as you are
> suggesting).
> 
> Raid10 provides excellent performance and if
> performance is a priority  
> then I recommend it, but I was under the impression
> that resiliency  
> was the priority, as raidz2/raidz3 provide greater
> resiliency for a  
> sacrifice in performance.

My experience is in line with Ross' comments.  There is no question that more 
independent vdevs will improve IOPS, e.g. RAID10 or even a pile of RAIDZ vdevs.

I have been burnt too many times to let an array get critical (no redunancy).  
Never, ever, ever again.

With a RAID1 or RAID10, one disk loss puts the whole pool critical, just one 
bad sector from disaster.  One prays the hot spare can be built in time.

With RAIDZ, the same is true.

I think of triple (or even quad) mirroring the same way as I think of RAIDZ3: 
it's like having prebuilt hot spares.

I suspect that the IOPS problems of wide stripes are becoming mitigated by 
L2ARC/ZIL and that the trend will be toward wide stripes with ever higher 
parity counts.

Sun's recent storage offerings tend to confirm this trend: slower, cheaper and 
bigger SATA drives fronted by SSD L2ARC and ZIL.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Stupid to have 2 disk raidz?

2009-10-21 Thread Marty Scholes
Erik Trimble wrote:
> As always, the devil is in the details. In this case,
> the primary 
> problem I'm having is maintaining two different block
> mapping schemes 
> (one for the old disk layout, and one for the new
> disk layout) and still 
> being able to interrupt the expansion process.  My
> primary problem is 
> that I have to keep both schemes in memory during the
> migration, and if 
> something should happen (i.e. reboot, panic, etc)
> then I lose the 
> current state of the zpool, and everything goes to
> hell in a handbasket.

It might not be that bad, if only zfs would allow mirroring a raidz pool.  Back 
when I did storage admin for a smaller company where availability was 
hyper-critical (but we couldn't afford EMC/Veritas), we had a hardware RAID5 
array.  After a few years of service, we ran into some problems:
* Need to restripe the array?  Screwed.
* Need to replace the array because current one is EOL?  Screwed.
* Array controller barfed for whatever reason?  Screwed.
* Need to flash the controller with latest firmware?  Screwed.
* Need to replace a component on the array, e.g. NIC, controller or power 
supply?  Screwed.
* Need to relocate the array?  Screwed.

If we could stomach downtime or short-lived storage solutions, none of this 
would have mattered.

To get around this, we took two hardware RAID arrays and mirrored them in 
software.  We could 
offline/restripe/replace/upgrade/relocate/whatever-we-wanted to an individual 
array since it was only a mirror which we could offline/online or detach/attach.

I suspect this could be simulated today with setting up a mirrored pool on top 
of a zvol of a raidz pool.  That involves a lot of overhead, doing 
parity/checksum calculations multiple times for the same data.  On the plus 
side, setting this up might make it possible to defrag a pool.

Should zfs simply allow mirroring one pool with another, then with a few spare 
disks laying around, altering the geometry of an existing pool could be done 
with zero downtime using steps similar to the following.
1. Create spare_pool as large as current_pool using spare disks
2. Attach spare_pool to current_pool
3. Wait for resilver to complete
4. Detach and destroy current_pool
5. Create new_pool the way you want it now
6. Attach new_pool to spare_pool
7. Wait for resilver to complete
8. Detach/destroy spare_pool
9. Chuckle at the fact that you completely remade your production pool while 
fully available

I did this dance several times over the course of many years back in the 
Disksuite days.

Thoughts?

Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAIDZ versus mirrroed

2009-09-16 Thread Marty Scholes
> Yes.  This is a mathematical way of saying
> "lose any P+1 of N disks."

I am hesitant to beat this dead horse, yet it is a nuance that either I have 
completely misunderstood or many people I've met have completely missed.

Whether a stripe of mirrors or mirror of a stripes, any single failure makes 
the array critical, i.e. one failure from disaster.

For example, suppose a stripe of four sets of mirrors.  That stripe has 8 disks 
total: four data and four mirrors.  If one disk fails, say on mirror set 3, 
then set 3 is running on a single disk.  Should that remaining disk in set 3 
fail, the whole stripe is lost.  Yes, the stripe is safe as long as the next 
failure is not from set 3.

Contrast that to RAIDZ3.  Suppose seven total disks with the same effective 
pool size: 4 data and 3 parity.  If any single disk is lost then the array is 
not critical and can still survive any other loss.  In fact, it can survive a 
total of any three disk failures before it becomes critical.

I just see it too often where someone states that a stripe of four mirror sets 
can sustain four disk failures.  Yes, that's true, as long as the correct four 
disks fail.  If we could control which disks fail, then none of this would even 
be necessary, so that argument seems rather silly.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAIDZ versus mirrroed

2009-09-16 Thread Marty Scholes
> This line of reasoning doesn't get you very far.
>  It is much better to take a look at
> the mean time to data loss (MTTDL) for the various
> configurations.  I wrote a
> series of blogs to show how this is done.
> http://blogs.sun.com/relling/tags/mttdl"; 
> target="_blank">http://blogs.sun.com/relling/tags/mttdl

I will play the Devils advocate here and point out that the chart shows MTTDL 
for RAIDZ2, both 6 and 8 disk, is much better than mirroring.

The chart does show that three way mirroring is better still and I would guess 
that RAIDZ3 surpasses that.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send older version?

2009-09-16 Thread Marty Scholes
Lori Alt wrote:
> As for being able to read streams of a later format
> on an earlier 
> version of ZFS, I don't think that will ever be
> supported.  In that 
> case, we really would have to somehow convert the
> format of the objects 
> stored within the send stream and we have no plans to
> implement anything 
> like that. 

If that is true, then it at least makes sense to include a "zfs downgrade" and 
"zpool downgrade" option, does it not?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAIDZ versus mirrroed

2009-09-16 Thread Marty Scholes
> Generally speaking, striping mirrors will be faster
> than raidz or raidz2,
> but it will require a higher number of disks and
> therefore higher cost to
> The main reason to use
> raidz or raidz2 instead
> of striping mirrors would be to keep the cost down,
> or to get higher usable
> space out of a fixed number of drives.

While it has been a while since I have done storage management for critical 
systems, the advantage I see with RAIDZN is better fault tolerance: any N 
drives may fail before  the set goes critical.

With straight mirroring, failure of the wrong two drives will invalidate the 
whole pool.

The advantage of striped mirrors is that it offers a better chance of higher 
iops (assuming the I/O is distributed correctly).  Also, it might be easier to 
expand a mirror by upgrading only two drives with larger drives.  With RAID, 
the entire stripe of drives would need to be upgraded.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send older version?

2009-09-15 Thread Marty Scholes
> The zfs send stream is dependent on the version of
> the filesystem, so the 
> only way to create an older stream is to create a
> back-versioned 
> filesystem:
> 
>   zfs create -o version=N pool/filesystem
> You can see what versions your system supports by
> using the zfs upgrade 
> command:

Thanks for the feedback.  So if I have a version X pool/filesystem set, does 
that mean the way to move it back to an older version of TANK is to do 
something like:
* Create OLDTANK with version=N
* For each snapshot in TANK
** (cd tank_snapshot; tar cvf -) | (cd old_tank; tar xvf -)
** zfs snapshot oldtank the_snapshot_name

This seems rather involved to get my current files/snaps into an older format.  
What did I miss?

Thanks again,
Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs send older version?

2009-09-14 Thread Marty Scholes
After moving from SXCE to 2009.06, my ZFS pools/file systems were at too new of 
a version.  I upgraded to the latest dev and recently upgraded to 122, but am 
not too thrilled with the instability, especially zfs send / recv lockups 
(don't recall the bug number).

I keep a copy of all of my critical stuff along with the original auto 
snapshots on a USB drive.

I really want to move back to 2009.06 and keep all of my files / snapshots.  Is 
there a way somehow to zfs send an older stream that 2009.06 will read so that 
I can import that into 2009.06?

Can I even create an older pool/dataset using 122?  Ideally I would provision 
an older version of the data and simply reinstall 2009.06 and just import the 
pool created under 122.

It seems this would be a regular request.  If I understand it correctly, an 
older BE cannot read upgraded pools and file systems, so a boot image upgrade 
followed by a zfs and zpool upgrade would kill a shop's ability to fall back.  
Or am I mistaken?

Is there a way to send older streams?

Thanks,
Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss