Re: [zfs-discuss] How to grow ZFS on growing pool?

2010-02-02 Thread David Champion
* On 02 Feb 2010, Darren J Moffat wrote: 
 
 zpool get autoexpand test

This seems to be a new property -- it's not in my Solaris 10 or
OpenSolaris 2009.06 systems, and they have always expanded immediately
upon replacement.  In what build number or official release does
autoexpand appear, and does it always default to off?  This will be
important to know for upgrades.

Thanks.

-- 
 -D.d...@uchicago.eduNSITUniversity of Chicago
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-02 Thread David Champion
* On 02 Feb 2010, Orvar Korvar wrote:
 Ok, I see that the chassi contains a mother board. So never mind that
 question.

 Another q:  Is it possible to have large chassi with lots of drives,
 and the opensolaris in another chassi, how do you connect them both?

The J4500 and most other storage products being discussed are not
servers: they are SATA concentrators with SAS uplinks.  You plug in
a bunch of cheap SATA disk, and you connect the chassis to a server
with SAS.  The logic board on the storage tray just converts the SAS
signalling to SATA.  It is not a computer in the usual sense.

In many cases such products also have SAS expander ports, so that you
can link multiple storage trays to a single SAS host bus adapter on your
server by daisy-chaining them.

So you need at least one SAS HBA on your OpenSolaris box, and SAS cables
to hook up the trays containing the SATA drives.


To the original question: you can purchase a J4x00 with a limited
number of drives (empty is generally not an option), but there is no
officially-sanctioned way to obtain the drive adapters except to buy Sun
disks.  You need either a SAS or a SATA drive bracket to adapt the drive
to the J4x00 backplane, but they are not sold separately: one ships with
each drive.

As mentioned there are companies that sell remanufactured or discarded
components, or machine their own substitutes.  (Re)marketing Sun or
compatible drive brackets has always been a lively business for a
few small outfits.  But Sun has no involvement with this, and may be
unwilling to support a frankenstein server.

Sun state that their OEM drives are of higher quality than OTS drives
from manufacturers or retailers, and that they have custom firmware that
improves their performance and reliability in Sun storage trays/arrays.
I see no reason to disbelieve that, but it is quite a steep price to pay
for that premium edge.  When cost is a bigger concern than performance
or reliability, I have generally bought the StorEdge product with the
smallest drives I can (250 GB or 500 GB) and upgraded them myself to the
size I really want.  It's cheaper to buy 20 drives from CDW than 10 from
Sun even when you account for the tiny throwaway drive, and you can keep
the 10 extra as cold spares.  At low enough scale the financial savings
are worth the time to replace them as they fail.

(I wish I could say the same of the StorEdge arrays themselves.  Fully
half of my 2540 controllers have failed, costing me huge amounts of
time in both direct and contractual service, and I'm given up on them
completely as a product line.  I'll be thrilled to switch to JBOD.)

For larger and less fault-tolerant systems, when money is available,
I'm happy to pay Sun's premium.

However, as others say, the other brands sometimes offer decent enough
products to use instead of Sun's enterprise line.  As always, it depends
on your site's requirements and budget.  I assume that a home NAS is
comparatively low on both: therefore I wouldn't even shop with Sun
unless you have a line on cheap castoffs from an enterprise shop.

-- 
 -D.d...@uchicago.eduNSITUniversity of Chicago
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to grow ZFS on growing pool?

2010-02-02 Thread David Champion
* On 02 Feb 2010, Richard Elling wrote: 
 
 This behaviour has changed twice.  Long ago, the pools would autoexpand.
 This is a bad thing, by default, so it was changed such that the expansion
 would only occur on pool import (around 3-4 years ago). The autoexpand 
 property allows you to expand without an export/import (and arrived around
 18 months ago). It is not surprising that various Solaris 10 releases/patches
 would have one of the three behaviours.

Well well, I guess it's been a while since I actually tested this.
:)  Thanks, Richard.  I'll watch for autoexpand in next releases of
s10/osol.

-- 
 -D.d...@uchicago.eduNSITUniversity of Chicago
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can zfs snapshot nfs mounts

2009-04-11 Thread David Champion
 guess the upshot is that if one were to daily rsync data to an zfs
 filesystem, the changes wrought there by rsync would be reflected in
 zfs snapshots, maybe timed to happen right after the rsync runs, as
 these new blocks covering only the deltas... I don't really know what
 deltas are... but I guess it would be only the changed parts.

I do this (roughly) for Linux backups.  My ZFS server exports a backup
dataset via NFS to a Linux machine.  Twice a day (4am and 4pm) Linux
rsyncs to the NFS mountpoint.  Once a day (at midnight) the ZFS server
snapshots the dataset.

 And I'm guessing further that one would be able to recover each change
 from the snapshots somehow.

Yes.  My ZFS backup dataset has snapdir=hidden, but it's still available
over the NFS mount.  My Linux users can do this kind of thing:
cd /nfs/backup/.zfs/snapshot/auto-d20090312
more somefile

to read somefile from the 12 March 2009 backup.

 In my OP, I mentioned rsync and rsnapshot backup system on linux as
 being in some way comparable.  I do understand how rsnapshot works but
 still not seeing exactly how the zfs snapshots work.
 
 Maybe a concrete example would be a bit easier to understand if you
 can give one.  I''m still not really understanding COW.

Copy on write means that two objects (files) referring to identical data
get pointers to the data instead of duplicate copies.  As long as these
are only read, and not written, the pointer to the same data is fine.
When a write occurs, the data is copied and one of the referrers gets a
pointer to the new copy.  This prevents the write from affecting both
referring files.

Copy on write is a description of how COW is used in virtual memory.
For disk storage, copy isn't necessarily accurate: since the entire
data block is rewritten anyway, a separate copy step can be optimized
away.

Here's a simple illustration of COW in action.  It's not necessarily
an accurate depiction of ZFS, but of the general concept in terms of a
filesystem.

  1. When a file (file A) is written to disk, blocks are allocated for
 the file and data is stored in those blocks.  The blocks each have
 a reference count, and ref counts are set to 1 because only one
 file refers to the blocks.

  2. I copy File A to File B.  The new file simply refers to all the
 same blocks.  The ref counts are raised to 2.

  3. I snapshot the filesystem.  This is essentially like copying every
 file in it, as in #2.  No blocks are copied because no new data was
 written, but ref counts are raised.

 I'm not sure about zfs's implementation, but in principle I guess
 an immutable snapshot should only need to raise ref ct by 1 in
 total, whereas a mutable snapshot (i.e., a clone) would incrememnt
 once for every reference in the filesystem.

  4. I rsync to the file in step #1.  Let's suppose this leaves blocks
 1 and 2 alone, but updates block 3.  The new data for block 3 is
 written to a new block (call it 3bis), and block 3 is left on the
 disk as it is.  Block 3's ref count is decremented, and 3bis's ref
 count is set to 1.

 File A: blocks 1, 2, 3bis
 File B: blocks 1, 2, 3
 Block 1: ref ct 3 (file A, file B, snapshot)
 Block 2: ref ct 3 (file A, file B, snapshot)
 Block 3: ref ct 2 (file B, snapshot)
 Block 3bis: ref ct 1 (file A)

  5. I remove file B.  Ref counts for its blocks are decremented, but
 since all its blocks still have ref counts  0, they persist.  No
 blocks are removed from the dataset.

 File A: blocks 1, 2, 3bis
 Block 1: ref ct 2 (file A, snapshot)
 Block 2: ref ct 2 (file A, snapshot)
 Block 3: ref ct 1 (snapshot)
 Block 3bis: ref ct 1 (file A)

  6. I remove file A.  Ref counts again decrement.

 Block 1: ref ct 1 (snapshot)
 Block 2: ref ct 1 (snapshot)
 Block 3: ref ct 1 (snapshot)
 Block 3bis: ref ct 0

 Since 3bis no longer has any referrers, it is deallocated.  Blocks
 1, 2, and 3 are still used by the snapshot, even though the original
 files A and B are no longer present.

This is a pretty simplistic view.  In practice, not only does the COW
methodology apply to the files' data blocks; it also applies to their
metadata, the filesystem's directories, and so on.  This ensures that
directory information as well as files persist in snapshots.  It also
explains why snapshots are virtually instantaneous: you only make a new
set of pointers to all the existing data, but you don't replace any of
the existing data.

 So if I wanted to find a specific change in a file... that would be
 somewhere in the zfs snapthosts... say to retrieve a certain
 formulation in some kind of `rc' file that worked better than a later
 formulation. How would I do that?

Using the .zfs/snapshot directory (see above) you can diff two different
generations of a file at the same path.

-- 
 -D.d...@uchicago.eduNSITUniversity of Chicago

Re: [zfs-discuss] Can this be done?

2009-04-07 Thread David Champion
* On 07 Apr 2009, Michael Shadle wrote: 
 
 Now quick question - if I have a raidz2 named 'tank' already I can
 expand the pool by doing:
 
 zpool attach tank raidz2 device1 device2 device3 ... device7
 
 It will make 'tank' larger and each group of disks (vdev? or zdev?)
 will be dual parity. It won't create a mirror, will it?

That's correct.

Anything you're unsure about, you can test.  Just create a zpool using
files instead of devices:

for i in 1 2 3 4; do
mkfile 256m /tmp/file$i
done
zpool create testpool raidz /tmp/file1 /tmp/file2 /tmp/file3 /tmp/file4

...and experiment on that.  No data risk this way.

-- 
 -D.d...@uchicago.eduNSITUniversity of Chicago
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can this be done?

2009-03-28 Thread David Champion
* On 28 Mar 2009, Peter Tribble wrote: 
 The choice of raidz1 versus raidz2 is another matter. Given that
 you've already got raidz1, and you can't (yet) grow that or expand
 it to raidz2, then there doesn't seem to be much point to having the
 second half of your storage being more protected.

 If you were starting from scratch, then you have a choice between a
 single raidz2 vdev and a pair of raidz1 vdevs. (Lots of other choices
 too, but that is really what you're asking here I guess.)

I've had too many joint failures in my life to put much faith in raidz1,
especially with 7 disks that likely come from the same manufacturing
batch and might exhibit the same flaws.  A single-redundancy system of 7
disks (gross) has too high a MTTDL for my taste.

If you can sell yourself on raidz2 and the loss of two more disks' worth
of data -- a loss which IMO is more than made up for by the gain in
security -- consider this technique:

1. build a new zpool of a single raidz2;
2. migrate your data from the old zpool to the new one;
3. destroy the old zpool, releasing its volumes;
4. use 'zpool add' to add those old volumes to the new zpool as a
   second raidz2 vdev (see Richard Elling's previous post).

Now you have a single zpool consisting of two raidz2 vdevs.

The migration in step 2 can be done either by 'zfs send'ing each zfs
in the zpool, or by constructing analogous zfs in the new zpool and
rsyncing the files across in one go.

-- 
 -D.d...@uchicago.eduNSITUniversity of Chicago
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-09 Thread David Champion
 too many words wasted, but not a single word, how to restore the data.

 I have read the man pages carefully. But again: there's nothing said,
 that on USB drives zfs umount pool is not allowed.

You misunderstand.  This particular point has nothing to do with USB;
it's the same for any ZFS environment.  You're allowed to do a zfs
umount on a filesystem, there's no problem with that.  But remember
that ZFS is not just a filesystem, in the way that reiserfs and UFS are
filesystems.  It's an integrated storage pooling system and filesystem.
When you umount a filesystem, you're not taking any storage offline,
you're just removing the filesystem's presence on the VFS hierarchy.

You umounted a zfs filesystem, not touching the pool, then removed
the device.  This is analogous to preparing an external hardware RAID
and creating one or more filesystems, using them a while, umounting
one of them, and powering down the RAID.  You did nothing to protect
other filesystems or the RAID's r/w cache.  Everything on the RAID
is now inconsistent and suspect.  But since your RAID was a single
striped volume, there's no mirror or parity information with which to
reconstruct the data.

ZFS is capable of detecting these problems, where other filesystems are
often not.  But no filesystem can tell what the data should have been
when the only copy of the data is damaged.

This is documented in ZFS.  It's not about USB, it's just that USB
devices can be more vulnerable to this kind of treatment than other
kinds of storage are.

 And again: Why should a 2 weeks old Seagate HDD suddenly be damaged,
 if there was no shock, hit or any other event like that?

It happens all the time.  We just don't always know about it.

-- 
 -D.d...@uchicago.eduNSITUniversity of Chicago
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] using USB memory keys for l2arc and zil

2009-02-05 Thread David Champion
  Would there be an advantage to using 4GB USB memory sticks on a home
  system for zil and l2arc?
 
 Probably not. Most USB devices are slower than SATA disks.

Moreover, all USB devices are alower than most SATA disks.

-- 
 -D.d...@uchicago.eduNSITUniversity of Chicago
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] j4200 drive carriers

2009-01-31 Thread David Champion
 nevermind, i will just get a Promise array.

Don't.  I don't normally like to badmouth vendors, but my experience
with Promise was one of the worst in my career, for reasons that should
be relevant other ZFS-oriented customers.

We ordered a Promise array because their tech sheet said Solaris was
supported.  We received it and set it up, and from the start got scsi
errors from the array when configuring devices.  (This is before even
touching ZFS; at this stage we just wanted to run fdisk.)  It took a
while to find someone at Promise, and when they did they wouldn't open
a case ticket because, they said, Solaris was unsupported.  When I went
back to their web site -- a horrible site, by the way -- the tech sheet
had been replaced with one that did NOT list Solaris among the supported
OSes, although the author and date of the PDF file were the same.

I wrote to my contact at Promise, but they held to their guns on the
non-support even after I sent them copies of both PDFs.  I cajoled my
Sun account manager into connecting us with someone who might be able to
figure it out, but no one could.

It took several months to get Promise to agree to refund our unit,
and only because our retailer (CDW) took the reins and held on tight.
Promise stopped returning my e-mail long before that.

Others may have different fortune with them; we were using the
dual-controller FC Vtrak, whatever the model number is, and maybe other
interfaces work better.  But after the support issue, I wouldn't dare
touch them again for use on Solaris.

-- 
 -D.d...@uchicago.eduNSITUniversity of Chicago
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] after controller crash, 'one or more devices is currently unavailable'

2008-11-06 Thread David Champion
I have a feeling I pushed people away with a long message.  Let me
reduce my problem to one question.

 # zpool import -f z
 cannot import 'z': one or more devices is currently unavailable
 
 
 'zdb -l' shows four valid labels for each of these disks except for the
 new one.  Is this what unavailable means, in this case?

I have now faked up a label for the disk that didn't have one and
applied it with dd.

Can anyone say what unavailable means, given that all eight disks are
registered devices at the correct paths, are readable, and have labels?

-- 
 -D.[EMAIL PROTECTED]NSITUniversity of Chicago
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] after controller crash, 'one or more devices is currently unavailable'

2008-10-30 Thread David Champion

There are a lot of hits for this error in google, but I've had trouble
identifying any that resemble my situation.  I apologize if you've
answered it before.  If it's better for me to open a case with Sun
Support, I can do that, but I'm hoping to cheat my way around the system
so that I don't have to send somebody Explorer output before they
escalate it.  Seems more efficient in the long run. :)

Most of my tale of woe is background:

I have a pool running under Solaris 10 5/08.  It's an 8-member raidz2
whose volumes are on a 2540 array with two controllers.  Volumes are
mapped 1:1 with physical disks.  I didn't really want a 2540, but I
couldn't get anyone to swear to me that any other fiber-channel product
would work with Solaris.  I'm using fiber multipathing.

I've had two disk failures in the past two weeks.  Last week I replaced
the first.  No problems with ZFS initially; a 'zfs replace' did the
right thing.  Yesterday I replaced the second.  But while investigating
the problem I noticed that two of my paths had gone down, so that 6
disks had both paths attached, and 2 disks had only one path.

At this time, 'zpool status' showed:
  pool: z
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun. more devices contains corrupted data./msg/ZFS-8000-D3
 scrub: resilver completed with 0 errors on Fri Oct 24 20:04:51 2008
config:

NAME STATE READ WRITE CKSUM
zDEGRADED 0 0 0
  raidz2 DEGRADED 0 0 0
c6t600A0B800049F9E1030548B3DF1Ed0s0  ONLINE   0 0 0
c6t600A0B800049F9E1030848B3DF52d0s0  ONLINE   0 0 0
c6t600A0B800049F9E1030B48B3DF7Ed0s0  ONLINE   0 0 0
c6t600A0B800049F9E1030E48B3DFA6d0s0  ONLINE   0 0 0
c6t600A0B800049F9E1031148B3DFD2d0s0  ONLINE   0 0 0
c6t600A0B800049F9E1031448B3DFFAd0s0  ONLINE   0 0 0
c6t600A0B800049F9E1031748B3E020d0s0  UNAVAIL  0 0 0 
 cannot open
c6t600A0B800049F9E1031A48B3E04Cd0s0  ONLINE   0 0 0


(At the time I hadn't figured it out, but I believe now that the one
disk was UNAVAIL because the disk had not been properly partitioned yet,
so s0 was undefined.)

Solaris 10's mpath support seems so far to be fairly intolerant of
reconfiguration without a reboot, and I wasn't ready to reboot yet, but
I thought I'd try resetting the controller that wasn't attached to all
of the disks.  But it appears that for some reason the CAM software
reset both controllers simultaneously.  The whole pool went into an
error state, and all disks became unavailable.  Very annoying, but not a
problem for zfs-discuss.


At this time, 'zpool status' showed:
  pool: z
 state: FAULTED
status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
zFAULTED  0 0 0 
 corrupted data
  raidz2 DEGRADED 0 0 0
c6t600A0B800049F9E1030548B3DF1Ed0s0  UNAVAIL  0 0 0 
 corrupted data
c6t600A0B800049F9E1030848B3DF52d0s0  UNAVAIL  0 0 0 
 corrupted data
c6t600A0B800049F9E1030B48B3CF7Ed0s0  UNAVAIL  0 0 0 
 corrupted data
c6t600A0B800049F9E1030E48B3DFA6d0s0  UNAVAIL  0 0 0 
 cannot open
c6t600A0B800049F9E1031148B3DFD2d0s0  UNAVAIL  0 0 0 
 corrupted data
c6t600A0B800049F9E1031448B3DFFAd0s0  UNAVAIL  0 0 0 
 corrupted data
c6t600A0B800049F9E1031748B3E020d0s0  UNAVAIL  0 0 0 
 cannot open
c6t600A0B800049F9E1031A48B3E04Cd0s0  UNAVAIL  0 0 0 
 corrupted data


I don't know whether there's any chance of recovering this, but I
wanted to try.  I reset the 2540 again, but still no communication with
Solaris.  I rebooted the server, and communications resumed.  I had to
do some further repair/reconfig on the 2540 for the two disks marked
'cannot open', but it was a minor issue and worked fine.  Solaris was
then able to see all my disks.


Now we come to the main point.

I still hadn't figured out the partitioning problem on E020d0s0
yet.  It didn't occur to me because I believed that to be a spare disk
which I had already