from:"Peter Jeremy"

Re: [zfs-discuss] pool metadata has duplicate children

2013-01-10 Thread Peter Jeremy

On 2013-Jan-08 21:30:57 -0800, John Giannandrea j...@meer.net wrote:
Notice that in the absence of the faulted da2 the OS has assigned da3 to da2 
etc.  I suspect this was part of the original problem in creating a label with 
two da2s

The primary vdev identifier is tha guid.  Tha path is of secondary
importance (ZFS should automatically recover from juggled disks
without an issue - and has for me).

Try running zdb -l on each of your pool disks and verify that
each has 4 identical labels, and that the 5 guids (one on each
disk) are unique and match the vdev_tree you got from zdb.

My suspicion is that you've somehow lost the disk with the guid
3419704811362497180.

twa0: 3ware 9000 series Storage Controller
twa0: INFO: (0x15: 0x1300): Controller details:: Model 9500S-8, 8 ports, 
Firmware FE9X 2.08.00.006
da0 at twa0 bus 0 scbus0 target 0 lun 0
da1 at twa0 bus 0 scbus0 target 1 lun 0
da2 at twa0 bus 0 scbus0 target 2 lun 0
da3 at twa0 bus 0 scbus0 target 3 lun 0
da4 at twa0 bus 0 scbus0 target 4 lun 0

Are these all JBOD devices?

-- 
Peter Jeremy


pgpykCYjUFT7j.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Repairing corrupted ZFS pool

2012-11-19 Thread Peter Jeremy

On 2012-Nov-19 11:02:06 -0500, Ray Arachelian r...@arachelian.com wrote:
Is the pool importing properly at least?  Maybe you can create another
volume and transfer the data over for that volume, then destroy it?

The pool is imported and passes all tests except zfs diff.  Creating
another pool _is_ an option but I'm not sure how to transfer the data
across - using zfs send | zfs recv replicates the corruption and
tar -c | tar -x loses all the snapshots.

There are special things you can do with import where you can roll back
to a certain txg on the import if you know the damage is recent.

The damage exists in the oldest snapshot for that filesystem.

-- 
Peter Jeremy


pgpxQIIBICxmG.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Repairing corrupted ZFS pool

2012-11-19 Thread Peter Jeremy

On 2012-Nov-19 13:47:01 -0500, Ray Arachelian r...@arachelian.com wrote:
On 11/19/2012 12:03 PM, Peter Jeremy wrote:
 The damage exists in the oldest snapshot for that filesystem.
Are you able to delete that snapshot?

Yes but it has no effect - the corrupt object exists in the current
pool so deleting an old snapshot has no effect.

What I was hoping was that someone would have a suggestion on removing
the corruption in-place - using zdb, zhack or similar.

-- 
Peter Jeremy


pgpjDHtffkLE5.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Repairing corrupted ZFS pool

2012-11-19 Thread Peter Jeremy

On 2012-Nov-19 21:10:56 +0100, Jim Klimov jimkli...@cos.ru wrote:
On 2012-11-19 20:28, Peter Jeremy wrote:
 Yep - that's the fallback solution.  With 1874 snapshots spread over 54
 filesystems (including a couple of clones), that's a major undertaking.
 (And it loses timestamp information).

Well, as long as you have and know the base snapshots for the clones,
you can recreate them at the same branching point on the new copy too.

Yes, it's just painful.

Also, while you are at it, you can use different settings on the new
pool, based on your achieved knowledge of your data

This pool has a rebuild in its future anyway so I have this planned.
 - perhaps using
better compression (IMHO stale old data that became mostly read-only
is a good candidate for gzip-9), setting proper block sizes for files
of databases and disk images, maybe setting better checksums, and if

your RAM vastness and data similarity permit - perhaps employing dedup

After reading the horror stories and reading up on how dedupe works,
this is definitely not on the list.

(run zdb -S on source pool to simulate dedup and see if you get any
better than 3x savings - then it may become worthwhile).

Not without lots more RAM - and that would mean a whole new box.

Perhaps, if the zfs diff does perform reasonably for you, you can
feed its output as the list of objects to replicate in rsync's input
and save many cycles this way.

The starting point of this saga was that zfs diff failed, so that
isn't an option.

On 2012-Nov-19 21:24:19 +0100, Jim Klimov jimkli...@cos.ru wrote:
fatally difficult scripting (I don't know if it is possible to fetch
the older attribute values from snapshots - which were in force at
that past moment of time; if somebody knows anything on this - plz
write).

The best way to identify past attributes is probably to parse
zfs history, though that won't help for received attributes.

-- 
Peter Jeremy


pgpgjjcrpOhyK.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Repairing corrupted ZFS pool

2012-11-19 Thread Peter Jeremy

On 2012-Nov-19 14:38:30 -0700, Mark Shellenbaum mark.shellenb...@oracle.com 
wrote:
On 11/19/12 1:14 PM, Jim Klimov wrote:
 On 2012-11-19 20:58, Mark Shellenbaum wrote:
 There is probably nothing wrong with the snapshots.  This is a bug in
 ZFS diff.  The ZPL parent pointer is only guaranteed to be correct for
 directory objects.  What you probably have is a file that was hard
 linked multiple times and the parent pointer (i.e. directory) was
 recycled and is now a file

Ah.  Thank you for that.  I knew about the parent pointer, I wasn't
aware that ZFS didn't manage it correctly.

The parent pointer for hard linked files is always set to the last link 
to be created.

$ mkdir dir.1
$ mkdir dir.2
$ touch dir.1/a
$ ln dir.1/a dir.2/a.linked
$ rm -rf dir.2

Now the parent pointer for a will reference a removed directory.

I've done some experimenting and confirmod this behaviour.  I gather
zdb bypasses ARC because the change of parent pointer after the ln(1)
only becames visible after a sync.

The ZPL never uses the parent pointer internally.  It is only used by 
zfs diff and other utility code to translate object numbers to full 
pathnames.  The ZPL has always set the parent pointer, but it is more 
for debugging purposes.

I didn't realise that.  I agree that the above scenario can't be
tracked with a single parent pointer but I assumed that ZFS reset the
parent to unknown rather than leaving it as a pointer to a random
no-longer-valid object.

This probably needs to be documented as a caveat on zfs diff -
especially since it can cause hangs and panics with older kernel code.

-- 
Peter Jeremy


pgpFsaNn4GfUQ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Repairing corrupted ZFS pool

2012-11-16 Thread Peter Jeremy

 send/recv (which happily and quietly
replicates the corruption).  Note that I have never (intentionally)
used extended attributes within the pool but it has been exported to
Windows XP via Samba and possibly to OS-X via NFSv3.

Does anyone have any suggestions for fixing the corruption?  One
suggestion was tar c | tar x but that is a last resort (since there
are 54 filesystems and ~1900 snapshots in the pool).

-- 
Peter Jeremy


pgpi6E6cZupsp.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS best practice for FreeBSD?

2012-10-12 Thread Peter Jeremy

On 2012-Oct-12 08:11:13 +0100, andy thomas a...@time-domain.co.uk wrote:
This is apparently what had been done in this case:

   gpart add -b 34 -s 600 -t freebsd-swap da0
   gpart add -b 634 -s 1947525101 -t freebsd-zfs da1
   gpart show

Assuming that you can be sure that you'll keep 512B sector disks,
that's OK but I'd recommend that you align both the swap and ZFS
partitions on at least 4KiB boundaries for future-proofing (ie
you can safely stick the same partition table onto a 4KiB disk
in future).

Is this a good scheme? The server has 12 G of memory (upped from 4 GB last 
year after it kept crashing with out of memory reports on the console 
screen) so I doubt the swap would actually be used very often.

Having enough swap to hold a crashdump is useful.  You might consider
using gmirror for swap redundancy (though 3-way is overkill).  (And
I'd strongly recommend against swapping to a zvol or ZFS - FreeBSD has
issues with that combination).

The other issue with this server is it needs to be rebooted every 8-10 
weeks as disk I/O slows to a crawl over time and the server becomes 
unusable. After a reboot, it's fine again. I'm told ZFS 13 on FreeBSD 8.0 
has a lot of problems

Yes, it does - and your symptoms match one of the problems.  Does
top(1) report lots of inactive and cache memory and very little free
memory and a high kstat.zfs.misc.arcstats.memory_throttle_count once
I/O starts slowing down?

 so I was planning to rebuild the server with FreeBSD 
9.0 and ZFS 28 but I didn't want to make any basic design mistakes in 
doing this.

I'd suggest you test 9.1-RC2 (just released) with a view to using 9.1,
rather than installing 9.0.

Since your questions are FreeBSD specific, you might prefer to ask on
the freebsd-fs list.

-- 
Peter Jeremy


pgpoDwzmWvFUU.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] FreeBSD ZFS

2012-08-09 Thread Peter Jeremy

On 2012-Aug-09 16:05:00 +0530, Jim Klimov jimkli...@cos.ru wrote:
2012-08-09 13:57, Karl Wagner wrote:
 Firstly, I believe it currently stands at zpool v28. Is this correct?

For FreeBSD 8.x and 9.x, yes.  FreeBSD-head includes feature flags
and com.delphix:async_destroy.

 Will this be updated any time soon?

I expect 8-stable and 9-stable will be update to match -head once
FreeBSD 9.1 is released (ie 9.1 won't support feature flags but 9.2
and a potential 8.4 will).  In general, FreeBSD imports ZFS fixes and
enhancements, generally from Illumos, as they become available.  The
Oracle v29 and later updates won't be available in FreeBSD unless they
are open-sourced by Oracle.

New features in the works include modernized compression and
checksum algorithms, among others. Nominal zpool version is 5000
for pools which enabled feature flags, and that is currently
supported by oi_151a5 prebuilt distro (I don't know of other
builds with that - feature integrated into code this summer).

FreeBSD-head does.

-- 
Peter Jeremy


pgpaswWHOLhMp.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-02 Thread Peter Jeremy

On 2012-Aug-02 18:30:01 +0530, opensolarisisdeadlongliveopensolaris 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
Ok, so the point is, in some cases, somebody might want redundancy on
a device that has no redundancy.  They're willing to pay for it by
halving their performance.

This isn't quite true - write performance will be at least halved
(possibly worse due to additional seeking) but read performance
could potentially improve (more copies means, on average, there should
be less seeking to get a a copy than if there was only one copy).
And non-IO performance is unaffected.

  The only situation I'll acknowledge is
the laptop situation, and I'll say, present day very few people would
be willing to pay *that* much for this limited use-case redundancy.

My guess is that, for most people, the overall performance impact
would be minimal because disk write performance isn't the limiting
factor for most laptop usage scenarios.

The solution that I as an IT person would recommend and deploy would
be to run without copies and instead cover you bum by doing backups.

You need backups in any case but backups won't help you if you can't
conveniently access them.  Before giving a blanket recommendation, you
need to consider how the person uses their laptop.  Consider the
following scenario:  You're in the middle of a week-long business trip
and your laptop develops a bad sector in an inconvenient spot.  Do you:
a) Let ZFS automagically repair the sector thanks to copies=2.
b) Attempt to rebuild your laptop and restore from backups (left securely
   at home) via the dodgy hotel wifi.

-- 
Peter Jeremy


pgpvosNQQa9DJ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Peter Jeremy

On 2012-Aug-01 21:00:46 +0530, Nigel W nige...@nosun.ca wrote:
I think a fantastic idea for dealing with the DDT (and all other
metadata for that matter) would be an option to put (a copy of)
metadata exclusively on a SSD.

This is on my wishlist as well.  I believe ZEVO supports it so possibly
it'll be available in ZFS in the near future.

-- 
Peter Jeremy


pgpNyzMT6fOdD.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files

2012-07-04 Thread Peter Jeremy

On 2012-Jul-05 06:47:36 +1000, Nico Williams n...@cryptonector.com wrote:
On Wed, Jul 4, 2012 at 11:14 AM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Tue, 3 Jul 2012, James Litchfield wrote:
 Agreed - msync/munmap is the only guarantee.

 I don't see that the munmap definition assures that anything is written to
 disk.  The system is free to buffer the data in RAM as long as it likes
 without writing anything at all.

Oddly enough the manpages at the Open Group don't make this clear.

They don't specify the behaviour on write(2) or close(2) either.  All
this means is that there is no guarantee that munmap(2) (or write(2)
or close(2)) will immediately flush the data to stable storage.

  So
I think it may well be advisable to use msync(3C) before munmap() on
MAP_SHARED mappings.

If you want to be certain that your changes will be flushed to stable
storage by a particular point in your program execution then you must
call msync(MS_SYNC) before munmap(2).

  However, I think all implementors should, and
probably all do (Linux even documents that it does) have an implied
msync(2) when doing a munmap(2).

There's nothing in the standard requiring this behaviour and it will
adversely impact performance in the general case so I would expect
that implementors _wouldn't_ force msync(2) on munmap(2).  FreeBSD
definitely doesn't.  As for Linux, I keep finding cases where, if a
standard doesn't mandate specific behaviour, Linux will implement (and
document) different behaviour to the way other OSs behave in the same
situation.

  I really makes no sense at all to
have munmap(2) not imply msync(3C).

Actually, it makes no more sense for munmap(2) to imply msync(2) than
it does for close(2) [which is functionally equivalent] to imply
fsync(2) - ie none at all.

(That's another thing, I don't see where the standard requires that
munmap(2) be synchronous.

http://pubs.opengroup.org/onlinepubs/009695399/functions/munmap.html
states Further references to these pages shall result in the
generation of a SIGSEGV signal to the process.  It's difficult to
see how to implement this behaviour unless munmap(2) is synchronous.

 Async munmap(2) - no need to mount
cross-calls, instead allowing to mapping to be torn down over time.
Doing a synchronous msync(3C), then a munmap(2) is a recipe for going
real slow, but if munmap(2) does not portably guarantee an implied
msync(3C), then would it be safe to do an async msync(2) then
munmap(2)??)

I don't understand what you are trying to achieve here.  munmap(2)
should be a relatively cheap operation so there is very little to be
gained by making it asynchronous.  Can you please explain a scenario
where munmap(2) would be slow (other than cases where implementors
have deliberately and unnecessarily made it slow).  I agree that
msync(MS_SYNC) is slow but if you want a guarantee that your data is
securely written to stable storage then you need to wait for that
stable storage.  msync(MS_ASYNC) should have no impact on a later
munmap(2) and it should always be safe to call msync(MS_ASYNC) before
munmap(2) (in fact, it's a good idea to maximise portability).

-- 
Peter Jeremy


pgp7hDyys4IEu.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Spare drive inherited cksum errors?

2012-05-29 Thread Peter Jeremy

On 2012-May-29 22:04:39 +1000, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
If you have a drive (or two drives) with bad sectors, they will only be
detected as long as the bad sectors get used.  Given that your pool is less
than 100% full, it means you might still have bad hardware going undetected,
if you pass your scrub.

One way around this is to 'dd' each drive to /dev/null (or do a long
test using smartmontools).  This ensures that the drive thinks all
sectors are readable.

You might consider creating a big file (dd if=/dev/zero of=bigfile.junk
bs=1024k) and then when you're out of disk space, scrub again.  (Obviously,
you would be unable to make new writes to pool as long as it's filled...)

I'm not sure how ZFS handles no large free blocks, so you might need
to repeat this more than once to fill the disk.

This could leave your drive seriously fragmented.  If you do try this,
I'd recommend creating a snapshot first and then rolling back to it,
rather than just deleting the junk file.  Also, this (obviously) won't
work at all on a filesystem with compression enabled.

-- 
Peter Jeremy


pgpwHwVLcSvcK.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Drive upgrades

2012-04-17 Thread Peter Jeremy

On 2012-Apr-17 17:25:36 +1000, Jim Klimov jimkli...@cos.ru wrote:
For the sake of archives, can you please post a common troubleshooting
techinque which users can try at home to see if their disks honour the
request or not? ;) I guess it would involve random-write bandwidths in
two cases?

1) Issue disable write cache command to drive
2) Write several MB of data to drive
3) As soon as drive acknowledges completion, remove power to drive (this
   will require a electronic switch in the drive's power lead)
4) Wait until drive spins down.
5) Power up drive and wait until ready
6) Verify data written in (2) can be read.
7) Argue with drive vendor that drive doesn't meet specifications :-)

A similar approach can also be used to verify that NCQ  cache flush
commands actually work.

-- 
Peter Jeremy


pgp4WNXKBfWaW.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Drive upgrades

2012-04-16 Thread Peter Jeremy

On 2012-Apr-14 02:30:54 +1000, Tim Cook t...@cook.ms wrote:
You will however have an issue replacing them if one should fail.  You need to 
have the same block count to replace a device, which is why I asked for a 
right-sizing years ago.

The traditional approach this is to slice the disk yourself so you have a 
slice size with a known area and a dummy slice of a couple of GB in case a 
replacement is a bit smaller.  Unfortunately, ZFS on Solaris disables the drive 
cache if you don't give it a complete disk so this approach incurs as 
significant performance overhead there.  FreeBSD leaves the drive cache enabled 
in either situation.  I'm not sure how OI or Linux behave.

-- 
Peter Jeremy


pgprzpycAxFkZ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Improving snapshot write performance

2012-04-11 Thread Peter Jeremy

On 2012-Apr-11 18:34:42 +1000, Ian Collins i...@ianshome.com wrote:
I use an application with a fairly large receive data buffer (256MB) to 
replicate data between sites.

I have noticed the buffer becoming completely full when receiving 
snapshots for some filesystems, even over a slow (~2MB/sec) WAN 
connection.  I assume this is due to the changes being widely scattered.

As Richard pointed out, the write side should be mostly contiguous.

Is there any way to improve this situation?

Is the target pool nearly full (so ZFS is spending lots of time searching
for free space)?

Do you have dedupe enabled on the target pool?  This would force ZFS to
search the DDT to write blocks - this will be expensive, especially if
you don't have enough RAM.

Do yoy have a high compression level (gzip or gzip-N) on the target
filesystems, without enough CPU horsepower?

Do you have a dying (or dead) disk in the target pool?

-- 
Peter Jeremy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] about btrfs and zfs

2011-10-18 Thread Peter Jeremy

On 2011-Oct-18 23:18:02 +1100, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
I recently put my first btrfs system into production.  Here are the
similarities/differences I noticed different between btrfs and zfs:

Thanks for that.

* zfs has storage tiering.  (cache  log devices, such as SSD's to
accelerate performance.)  btrfs doesn't have this yet.

I'd call that multi-level caching and journalling.  To me, storage
tiering means something like HSM - something that lets me push rarely
used data to near-line storage (eg big green SATA drives that are spun
down most of the time) whilst retaining the ability to transparently
access it.

On 2011-Oct-19 03:46:30 +1100, Mark Sandrock mark.sandr...@oracle.com wrote:
Doesn't a scrub do more than what 'fsck' does?

It does different things.  I'm not sure about more.

fsck verifies the logical consistency of a filesystem.  For UFS, this
includes: used data blocks are allocated to exactly one file,
directory entries point to valid inodes, allocated inodes have at
least one link, the number of links in an inode exactly matches the
number of directory entries pointing to that inode, directories form a
single tree without loops, file sizes are consistent with the number
of allocated blocks, unallocated data/inodes blocks are in the
relevant free bitmaps, redundant superblock data is consistent.  It
can't verify data.

scrub uses checksums to verify the contents of all blocks and attempts
to correct errors using redundant copies of blocks.  This implicitly
detects some types of logical errors.  I don't know if scrub includes
explicit logic to detect things like directory loops, missing free
blocks, unreachable allocated blocks, multiply allocated blocks, etc.

IIRC, fsck was seldom needed at
my former site once UFS journalling
became available. Sweet update.

Whilst Solaris very rarely insists we run fsck, we have had a number
of cases where we have found files corrupted following a crash - even
with UFS journalling enabled.  Unfortunately, this isn't the sort of
thing that fsck could detect.

-- 
Peter Jeremy


pgpe2tUImniF1.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Large scale performance query

2011-08-08 Thread Peter Jeremy

On 2011-Aug-08 17:12:15 +0800, Andrew Gabriel andrew.gabr...@oracle.com wrote:
periodic scrubs to cater for this case. I do a scrub via cron once a 
week on my home system. Having almost completely filled the pool, this 
was taking about 24 hours. However, now that I've replaced the disks and 
done a send/recv of the data across to a new larger pool which is only 
1/3rd full, that's dropped down to 2 hours.

FWIW, scrub time is more related to how fragmented a pool is, rather
than how full it is.  My main pool is only at 61% (of 5.4TiB) and has
never been much above that but has lots of snapshots and a fair amount
of activity.  A scrub takes around 17 hours.

This is another area where the mythical block rewrite would help a lot.

-- 
Peter Jeremy


pgpH1dpSOBHnT.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD vs hybrid drive - any advice?

2011-07-26 Thread Peter Jeremy

On 2011-Jul-26 17:24:05 +0800, Fajar A. Nugraha w...@fajar.net wrote:
Shouldn't modern SSD controllers be smart enough already that they know:
- if there's a request to overwrite a sector, then the old data on
that sector is no longer needed

ZFS never does update-in-place and UFS only does update-in-place for
metadata and where the application forces update-in-place.  This means
there will generally (always for ZFS) be a delay between when a
filesystem frees (is no longer interested in the contents of) a
sector and when it overwrites that sector.  Without TRIM support, a
SSD can only use overwrite to indicate that the contents of a sector
are not needed.  Which, in turn, means there is a pool of sectors that
the FS knows are unused but the SSD doesn't - and is therefore forced
to preserve.

Since an overwrite almost never matches the erase page, this increases
wear on the SSD because it is forced to rewrite unwanted data in order
to free up pages for erasure to support external write requests.  It
also reduces performance for several reasons:
- The SSD has to unnecessarily copy data - which takes time.
- The space recovered by each erasure is effectively reduced by the
  amount of rewritten data so more time-consuming erasures are needed
  for a given external write load.
- The pools of unused but not erased and erased (available)
  sectors are smaller, increasing the probability that an external
  write will require a synchronous erase cycle to complete.

- allocate a clean sector from pool of available sectors (part of
wear-leveling mechanism)

As above, in the absence of TRIM, the pool will be smaller (and more
likely to be empty).

- clear the old sector, and add it to the pool (possibly done in
background operation)

Otherwise a sector could never be rewritten.

It seems to be the case with sandforce-based SSDs. That would pretty
much let the SSD work just fine even without TRIM (like when used
under HW raid).

Better SSDs mitigate the problem by having more hidden space
(keeping the available pool larger to reduce the probability of a
synchronous erase being needed) and higher performance (masking the
impact of the additional internal writes and erasures).

If TRIM support was available then the performance would still
improve.  This means you either get better system performance from
the same SSD, or you can get the same system performance from a
lower-performance (cheaper) SSD.

-- 
Peter Jeremy


pgpoOozgavEXj.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Changed to AHCI, can not access disk???

2011-07-05 Thread Peter Jeremy

On 2011-Jul-05 21:03:50 +0800, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Orvar Korvar
...
 I suspect the problem is because I changed to AHCI. 

This is normal, no matter what OS you have.  It's the hardware.

Switching to AHCI changes the device interface presented to the kernel
and you need a different device driver to access the data.  As long as
your OS supports AHCI (and that is true of any OS that supports ZFS)
then you will still be able to access the disks - though the actual
path to the disk or disk device name will change.

If you start using a disk in non-AHCI mode, you must always continue to use
it in non-AHCI mode.  If you switch, it will make the old data inaccessible.

Only if your OS is broken.  The data is equally accessible in either
mode.  ZFS makes it easier to switch modes because it doesn't care
about the actual device name - at worst, you will need an export and
import.

-- 
Peter Jeremy


pgpHPygB4VeNl.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS working group and feature flags proposal

2011-05-25 Thread Peter Jeremy

On 2011-May-26 03:02:04 +0800, Matthew Ahrens mahr...@delphix.com wrote:
The first product of the working group is the design for a ZFS on-disk
versioning method that will allow for distributed development of ZFS
on-disk format changes without further explicit coordination. This
method eliminates the problem of two developers both allocating
version number 31 to mean their own feature.

Looks good.

pool open (zpool import and implicit import from zpool.cache)
   If pool is at SPA_VERSION_FEATURES, we must check for feature
   compatibility.  First we will look through entries in the label
   nvlist's features_for_read.  If there is a feature listed there
   which we don't understand, and it has a nonzero value, then we
   can not open the pool.

Is it worth splitting feature used value into optional and
mandatory?  (Possibly with the ability to have an optional read
feature be linked to a mandatory write feature).

To use an existing example: dedupe (AFAIK) does not affect read code
and so could show up as an optional read feature but a mandatory write
feature (though I suspect this could equally be handled by just
listing it in features_for_write).

As a more theoretical example, consider OS-X resource forks?  The
presence of a resource fork matters for both read and write on OS-X
but nowhere else.  A (hypothetical) ZFS port to OS-X would want to
know whether the pool contained resource forks even if opened R/O
but this should not stop a different ZFS port from reading (and
maybe even writing to) the pool.

-- 
Peter Jeremy


pgpj1BokjEkft.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS, Oracle and Nexenta

2011-05-24 Thread Peter Jeremy

On 2011-May-25 03:49:43 +0800, Brandon High bh...@freaks.com wrote:
... unless Oracle's zpool v30 is different than Nexenta's v30.

This would be unfortunate but no worse than the current situation
with UFS - Solaris, *BSD and HP Tru64 all have native UFS filesystems,
all of which are incompatible.

I believe the various OSS projects that use ZFS have formed a working
group to co-ordinate ZFS amongst themselves.  I don't know if Oracle
was invited to join (though given the way Oracle has behaved in all
the other OSS working groups it was a member of, having Oracle onboard
might be a disadvantage).

-- 
Peter Jeremy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Backup complete rpool structure and data to tape

2011-05-11 Thread Peter Jeremy

On 2011-May-12 00:20:28 +0800, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
Backup/restore of bootable rpool to tape with a 3rd party application like
legato etc is kind of difficult.  Because if you need to do a bare metal
restore, how are you going to do it?

This is a generic problem, not limited to ZFS.  The generic solutions
are either:
a) Customised boot disk that includes the 3rd party restore client
b) Separate backup of root+client in a format that's restorable using
   tools only on the generic boot disk (eg tar or ufsdump).
(Where boot disk could be network boot instead of a physical CD/DVD).

I might suggest:  If you use zfs send to backup rpool to a file in the
data pool...  And then use legato etc to backup the data pool...

As Edward pointed out later, this might be OK as a disaster-recovery
approach but isn't suitable for the situation where you want to
restore a subset of the files (eg you need to recover a file someone
accidently deleted) and a zfs send stream isn't intended for storage.

Another potential downside is that the only way to read the stream is
using zfs recv into ZFS - this could present a problem if you wanted
to migrate the data into a different filesystem.  (All other restore
utilities I'm aware of use normal open/write/chmod/... interfaces so
you can restore your backup into any filesystem).

Finally, the send/recv protocol is not guaranteed to be compatible
between ZFS versions.  I'm not aware of any specific issues (though
someone reports that a zfs.v15 send | zfs.v22 recv caused pool
corruption in another recent thread) and would hope that zfs recv
would always maintain full compatibility with older zfs send.

But I hope you can completely abandon the whole 3rd party backup software
and tapes.  Some people can, and others cannot.  By far, the fastest best
way to backup ZFS is to use zfs send | zfs receive on another system or a
set of removable disks.

Unfortunately, this doesn't fit cleanly into the traditional
enterprise backup solution where Legato/NetBackup/TSM/... backs up
into a SILO with automatic tape replication and off-site rotation.

Incidentally, when you do incremental zfs send, you have to specify the
from and to snapshots.  So there must be at least one identical snapshot
in the sending and receiving system (or else your only option is to do a
complete full send.)

And (at least on v15) if you are using an incremental replication
stream and you create (or clone) a new descendent filesystem, you will
need to manually manage the initial replication of that filesystem.

BTW, if you do elect to build a bootable, removable drive for backups,
you should be aware that gzip compression isn't supported - at least
in v15, trying to make a gzip compressed filesystem bootable or trying
to set compression=gzip on a bootable filesystem gives a very
uninformative error message and it took a fair amount of trawling
through the source code to find the real cause.

-- 
Peter Jeremy


pgpnNCrRwuYrc.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-03 Thread Peter Jeremy

On 2011-May-04 08:39:39 +0800, Rich Teer rich.t...@rite-group.com wrote:
Also related to this is a performance question.  My initial test involved
copying a 50 MB zfs file system to a new disk, which took 2.5 minutes
to complete.  The strikes me as being a bit high for a mere 50 MB;
are my expectation realistic or is it just because of my very budget
concious set up?  If so, where's the bottleneck?

Possibilities I can think of:
- Do you have lots of snapshots?  There's an overhead of a second or so
  for each snapshot to be sent.
- Is the source pool heavily fragmented with lots of small files?

The source pool is on a pair of 146 GB 10K RPM disks on separate
busses in a D1000 (split bus arrangement) and the destination pool
is on a IOMega 1 GB USB attached disk.  The machine to which both
pools are connected is a Sun Blade 1000 with a pair of 900 MHz US-III
CPUs and 2 GB of RAM.

Hopefully a silly question but does the SB1000 support USB2?  All of
the Sun hardware I've dealt with only has USB1 ports.

And, BTW, 2GB RAM is very light on for ZFS (though I note you only
have a very small amount of data).

-- 
Peter Jeremy


pgp8UazHZQHJM.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs incremental send?

2011-03-28 Thread Peter Jeremy

On 2011-Mar-29 02:19:30 +0800, Roy Sigurd Karlsbakk r...@karlsbakk.net wrote:
Is it (or will it) be possible to do a partial/resumable zfs
send/receive? If having 30TB of data and only a gigabit link, such
transfers takes a while, and if interrupted, will require a
re-transmit of all the data.

zfs send/receive works on snapshots: The smallest chunk of data that
can be sent/received is the delta between two snapshots.  There's no
way to do a partial delta - defining the endpoint of a partial
transfer or the starting point for resumption is effectively a
snapshot.

For an initial replication of a large amount of data, the most
feasible approach is probably to temporarily co-locate the destination
disk array with the server to copy the data across.  You can reduce
the size of each incremental chunk by taking frequent snapshots (these
can be deleted once they have been replicated to the backup host).

-- 
Peter Jeremy


pgpn3inOqECRR.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Invisible snapshot/clone

2011-03-17 Thread Peter Jeremy

On 2011-Mar-17 10:23:01 +0800, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
To find it, run zdb -d, and search for something with a %
Something like:  zdb -d tank | grep %

And then you can zfs destroy the thing.

Thanks, that worked.

  P.S.  Every time I did this, the
zfs destroy would complete with some sort of error message, but then if you
searched for the thing again, you would see that it actually completed
successfully.

Likewise, I had 'zfs destroy' whinge but the offending clone was gone.

P.S.  If your primary goal is to use ZFS, you would probably be better
switching to nexenta or openindiana or solaris 11 express, because they all
support ZFS much better than freebsd.

I'm primarily interested in running FreeBSD and will be upgrading to
ZFSv28 once it's been shaken out a bit longer.

-- 
Peter Jeremy


pgp0CAxj3Ebk1.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Invisible snapshot/clone

2011-03-16 Thread Peter Jeremy

I am in the process of upgrading from FreeBSD-8.1 with ZFSv14 to
FreeBSD-8.2 with ZFSv15 and, following a crash, have run into a
problem with ZFS claiming a snapshot or clone exists that I can't
find.

I was transferring a set of snapshots from my primary desktop to a
backup host (both ZFSv14) using:
  zfs send -I zroot/home@20110210bu -R zroot/home@20110317bu | \
ssh backup_host zfs recv -vd zroot
and whilst that was in progress, I did a 'df -k' on backup_host.
At this point, both the df and the zfs recv wedged unkillably.

zfs list showed that the last snapshot on the destination system
was zroot/home@20110309, so I did a rollback to it (which reported
no error) and ran:
  zfs send -I zroot/home@20110309 -R zroot/home@20110317bu | \
ssh backup_host zfs recv -vd zroot
which reported:
receiving incremental stream of zroot/home@20110310 into zroot/home@20110310
cannot restore to zroot/home@20110310: destination already exists
warning: cannot send 'zroot/home@20110310': Broken pipe

I cannot find anything by that name (or any snapshots later than
zroot/home@20110309 or any clones) and cannot destroy
zroot/home@20110309:

# zfs rollback zroot/home@20110309
# zfs destroy zroot/home@20110309
cannot destroy 'zroot/home@20110309': dataset already exists
# zfs destroy -r zroot/home@20110309
cannot destroy 'zroot/home@20110309': snapshot is cloned
no snapshots destroyed
# zfs destroy -R zroot/home@20110309
cannot destroy 'zroot/home@20110309': snapshot is cloned
no snapshots destroyed
# zfs destroy -frR zroot/home@20110309
cannot destroy 'zroot/home@20110309': snapshot is cloned
no snapshots destroyed
# zfs list -t all |grep home@20110310
# zfs get all | grep origin
# zfs get all | grep home@20110310
# 

I have tried rebooting, upgrading the pool from v14 to v15 and
export/import without success.  Does anyone have any other suggestions?

zpool history -i looks like:

2011-03-17.08:02:57 zfs rollback zroot/home@20110210bu
2011-03-17.08:02:59 zfs recv -vd zroot
2011-03-17.08:02:59 [internal replay_inc_sync txg:872817696] dataset = 973
2011-03-17.08:02:59 [internal reservation set txg:872817697] 0 dataset = 469
...
2011-03-17.08:09:41 [internal snapshot txg:872817974] dataset = 1203
2011-03-17.08:09:42 [internal replay_inc_sync txg:872817975] dataset = 1208
2011-03-17.08:09:42 [internal reservation set txg:872817976] 0 dataset = 469
2011-03-17.08:09:42 [internal property set txg:872817977] compression=10 
dataset = 469
2011-03-17.08:09:42 [internal property set txg:872817977] mountpoint=/home 
dataset = 469
2011-03-17.08:09:50 [internal destroy_begin_sync txg:872817980] dataset = 1208
2011-03-17.08:09:51 [internal destroy txg:872817983] dataset = 1208
2011-03-17.08:09:51 [internal reservation set txg:872817983] 0 dataset = 0
2011-03-17.08:09:51 [internal snapshot txg:872817984] dataset = 1212
2011-03-17.08:09:52 [internal replay_inc_sync txg:872817985] dataset = 1217
2011-03-17.08:09:52 [internal reservation set txg:872817986] 0 dataset = 469
2011-03-17.08:09:52 [internal property set txg:872817987] compression=10 
dataset = 469
2011-03-17.08:09:52 [internal property set txg:872817987] mountpoint=/home 
dataset = 469
system wedged here
2011-03-17.08:35:01 [internal rollback txg:872818038] dataset = 469
2011-03-17.08:35:01 zfs rollback zroot/home@20110309
2011-03-17.08:35:14 zfs recv -vd zroot
2011-03-17.08:36:37 [internal pool scrub txg:872818059] func=1 mintxg=0 
maxtxg=872818059
2011-03-17.08:36:41 zpool scrub zroot
2011-03-17.09:17:27 [internal pool scrub done txg:872818513] complete=1
2011-03-17.09:19:44 [internal rollback txg:872818542] dataset = 469
2011-03-17.09:19:45 zfs rollback zroot/home@20110309
2011-03-17.10:51:38 [internal rollback txg:872819603] dataset = 469
2011-03-17.10:51:39 zfs rollback zroot/home@20110309
2011-03-17.10:54:11 zpool upgrade zroot
2011-03-17.10:59:12 [internal rollback txg:872819688] dataset = 469
2011-03-17.10:59:12 zfs rollback zroot/home@20110309
2011-03-17.11:16:38 [internal rollback txg:872819895] dataset = 469
2011-03-17.11:16:39 zfs rollback zroot/home@20110309
2011-03-17.11:16:54 zpool export zroot
2011-03-17.11:17:31 zpool import zroot
2011-03-17.11:30:13 [internal rollback txg:872819992] dataset = 469
2011-03-17.11:30:13 zfs rollback zroot/home@20110309
2011-03-17.12:01:02 zfs recv -vd zroot
2011-03-17.12:03:57 [internal rollback txg:872820399] dataset = 469
2011-03-17.12:03:57 zfs rollback zroot/home@20110309

-- 
Peter Jeremy


pgpQBYUCWUiu1.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Free space on ZFS file system unexpectedly missing

2011-03-09 Thread Peter Jeremy

On 2011-Mar-10 05:50:53 +0800, Tom Fanning m...@tomfanning.eu wrote:
I have a FreeNAS 0.7.2 box, based on FreeBSD 7.3-RELEASE-p1, running
ZFS with 4x1TB SATA drives in RAIDz1.

I appear to have lost 1TB of usable space after creating and deleting
a 1TB sparse file. This happened months ago.

AFAIR, ZFS on FreeBSD 7.x was always described as experimental.

This is a known problem (OpenSolaris bug id 6792701) that was fixed in
OpenSolaris onnv revision 9950:78fc41aa9bc5 which was committed to
FreeBSD as r208775 in head and r208869 in 8-stable.  The fix was never
back-ported to 7.x and I am unable to locate any workaround.

 - Exported the pool from FreeBSD, imported it on OpenIndiana 148 -
but not upgraded - same problem, much newer ZFS implementation. Can't
upgrade the pool to see if the issue goes away since for now I need a
route back to FreeBSD and I don't have spare storage.

I thought that just importing a pool on a system with the bugfix would
free the space.  If that doesn't work, your only options are to either
upgrade to FreeBSD 8.1-RELEASE or later (preferably 8.2 since there
are a number of other fairly important ZFS fixes since 8.1) and
upgrade your pool to v15 or rebuild your pool (via send/recv or similar).

-- 
Peter Jeremy


pgp2oCcOvB9YH.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

2011-02-07 Thread Peter Jeremy

On 2011-Feb-07 14:22:51 +0800, Matthew Angelo bang...@gmail.com wrote:
I'm actually more leaning towards running a simple 7+1 RAIDZ1.
Running this with 1TB is not a problem but I just wanted to
investigate at what TB size the scales would tip.

It's not that simple.  Whilst resilver time is proportional to device
size, it's far more impacted by the degree of fragmentation of the
pool.  And there's no 'tipping point' - it's a gradual slope so it's
really up to you to decide where you want to sit on the probability
curve.

   I understand
RAIDZ2 protects against failures during a rebuild process.

This would be its current primary purpose.

  Currently,
my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks
and worse case assuming this is 2 days this is my 'exposure' time.

Unless this is a write-once pool, you can probably also assume that
your pool will get more fragmented over time, so by the time your
pool gets to twice it's current capacity, it might well take 3 days
to rebuild due to the additional fragmentation.

One point I haven't seen mentioned elsewhere in this thread is that
all the calculations so far have assumed that drive failures were
independent.  In practice, this probably isn't true.  All HDD
manufacturers have their off days - where whole batches or models of
disks are cr*p and fail unexpectedly early.  The WD EARS is simply a
demonstration that it's WD's turn to turn out junk.  Your best
protection against this is to have disks from enough different batches
that a batch failure won't take out your pool.

PSU, fan and SATA controller failures are likely to take out multiple
disks but it's far harder to include enough redundancy to handle this
and your best approach is probably to have good backups.

I will be running hot (or maybe cold) spare.  So I don't need to
factor in Time it takes for a manufacture to replace the drive.

In which case, the question is more whether 8-way RAIDZ1 with a
hot spare (7+1+1) is better than 9-way RAIDZ2 (7+2).  In the latter
case, your hot spare is already part of the pool so you don't
lose the time-to-notice plus time-to-resilver before regaining
redundancy.  The downside is that actively using the hot spare
may increase the probability of it failing.

-- 
Peter Jeremy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best choice - file system for system

2011-01-30 Thread Peter Jeremy

On 2011-Jan-28 21:37:50 +0800, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
2- When you want to restore, it's all or nothing.  If a single bit is
corrupt in the data stream, the whole stream is lost.

Regarding point #2, I contend that zfs send is better than ufsdump.  I would
prefer to discover corruption in the backup, rather than blindly restoring
it undetected.

OTOH, it renders ZFS send useless for backup or archival purposes.

With ufsdump, I can probably recover most of the data off a backup
even if it has some errors.  Since I'm aware of that problem, I can
separately store a file of expected checksums etc to verify what I
restore.  If I lose a file from one backup, I can hopefully retrieve
it from another backup.

With ZFS send, a 1-bit error renders my multi-GB backup useless.  I
can't get ZFS to restore the rest of the backup and tell me what is
missing - which might let me recover it in other ways.

-- 
Peter Jeremy


pgppzMAxBmwjV.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] multiple disk failure

2011-01-30 Thread Peter Jeremy

On 2011-Jan-30 13:39:22 +0800, Richard Elling richard.ell...@gmail.com wrote:
I'm not sure of the way BSD enumerates devices.  Some clever person thought
that hiding the partition or slice would be useful.

No, there's no hiding.  /dev/ada0 always refers to the entire physical disk.
If it had PC-style fdisk slices, there would be a sN suffix.
If it had GPT partitions, there would be a pN suffix.
If it had BSD partitions, there would be an alpha suffix [a-h].

On a Solaris
system, ZFS can show a disk something like c0t1d0, but that doesn't exist.

If we're discussing brokenness in OS device names, I've always thought
that reporting device names that don't exist and not having any way to
access the complete physical disk in Solaris was silly.  Having a fake
's2' meaning the whole disk if there's no label is a bad kludge.

Mike might like to try gpart list - which will display FreeBSD's view
of the physical disks.  It might also be worthwhile looking at a hexdump
of the first and last few MB of the faulty disks - it's possible that
the controller has decided to just shift things by a few sectors so the
labels aren't where ZFS expects to find them.

-- 
Peter Jeremy


pgpNc13adVY1q.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] stupid ZFS question - floating point operations

2010-12-22 Thread Peter Jeremy

On 2010-Dec-23 04:48:19 +0800, Deano de...@rattie.demon.co.uk wrote:
 modern CPU are float monsters indeed its
likely some things would be faster if converted to use the float ALU

_Some_ modern CPUs are good at FP, a lot aren't.  The SPARC T-1 was
particularly poor as it only had a single FPU.  Likewise, performance
in the x86 world is highly variable, depending on the vendor and core
you pick.  AFAIK, iA64 and PPC are consistently good - but neither are
commonly found in conjunction with ZFS.  You may also need to allow
for software assist:  Very few CPUs implement all of the IEEE FP
standard in hardware and most (including SPARC) require software to
implement parts of the standard.  If your algorithm happens to make
significant use of things other than normalised numbers and zero, your
performance may be severely affected by the resultant traps and
software assistance.

Any use of floating point within the kernel also means changes to
when FPU context is saved - and, unless this can be implemented
lazily, it will adversely impact the cost of all context switches
and potentially system calls.

-- 
Peter Jeremy


pgphVXYz2zc3s.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Growing the swap vol?

2010-11-14 Thread Peter Jeremy

On 2010-Nov-14 07:53:05 +0800, Ian Collins i...@ianshome.com wrote:
 -BEGIN PGP SIGNATURE-

PGP signatures are a PITA on mail lists!

Only when the mailing list software is broken.

Signatures are probably more relevant on mailing lists than elsewhere
and this is the only mailing list I'm subscribed to where signatures
get mangled.

-- 
Peter Jeremy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Peter Jeremy

On 2010-Oct-28 04:45:16 +0800, Harry Putnam rea...@newsguy.com wrote:
Short of doing such a test, I have evidence already that machine will
predictably shutdown after 15 to 20 minutes of uptime.

My initial guess is thermal issues.  Check that the fans are running
correctly and there's no dust/fluff buildup on the CPU heatsink.  The
BIOS might be able to report actual fan speeds.

It's also possible that you have RAM or PSU problems and I'd also
recommend running some sort of offline stress test (eg memtest86 or
the mersenne prime tester).

It seems there ought to be something, some kind of evidence and clues
if I only knew how to look for them, in the logs.

Serious hardware problems are unlikely to be in the logs because the
system will die before it can write the error to disk and sync the
disks.  You are more likely to see a problem on the console.

-- 
Peter Jeremy


pgpL46BRTTVid.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Jumping ship.. what of the data

2010-10-27 Thread Peter Jeremy

On 2010-Oct-28 04:54:00 +0800, Harry Putnam rea...@newsguy.com wrote:
If I were to decide my current setup is too problem beset to continue
using it, is there a guide or some good advice I might employ to scrap
it out and build something newer and better in the old roomy midtower?

I'd scrap the existing PSU as well unless you are sure it is OK -
consumer grade PSUs don't have especially long lives.

I'm a bit worried about whether with modern hardware the IDE drives
will even have a hookup.  If it does, can I just hook the two rpool
discs up to two of them and expect it to boot OK?

Most current motherboards still have one IDE channel, though they
may not be able to boot off it.  It's also still very easy to find
PCIe cards with IDE ports (some have SATA as well).  Again, you
will need to check the fine print to make sure that they support
booting off IDE.

Assuming that you aren't currently using any hardware RAID, then there
should be no problems accessing any of your existing pools from a new
motherboard.  Booting off your IDE rpool just relies on BIOS support
for IDE booting (which you will need to verify).

I expect to make sure I have a goodly number of sata connections even
if it means extra cards, but again, can just hook the other mirrored
discs up and expect them to just work.

Finding PCIe x1 cards with more than 2 SATA ports is difficult so you
might want to make sure that either your chosen motherboard has lots
of PCIe slots or has some wider slots.  If you plan on using on-board
video and re-using the x16 slot for something else, you should verify
that the BIOS will let you do that - I've got several (admittedly old)
systems where the x16 slot must either be empty or have a video card
to work.

If you are concerned about reliability, you might like to look at
motherboard and CPU combinations that support ECC RAM.  I believe all
Asus AMD boards now support ECC and some Gigabyte boards do (though
identifying them can be tricky).

See the archives for lots more discussion on suggested systems for ZFS.

Would I expect to need to reinstall for starters?

With care, nothing.

-- 
Peter Jeremy


pgpN6MGlBBoFC.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Balancing LVOL fill?

2010-10-20 Thread Peter Jeremy

On 2010-Oct-21 01:28:46 +0800, David Dyer-Bennet d...@dd-b.net wrote:
On Wed, October 20, 2010 04:24, Tuomas Leikola wrote:

 I wished for a more aggressive write balancer but that may be too much
 to ask for.

I don't think it can be too much to ask for.  Storage servers have long
enough lives that adding disks to them is a routine operation; to the
extent that that's a problem, that really needs to be fixed.

It will (should) arrive as part of the mythical block pointer rewrite project.

-- 
Peter Jeremy


pgpy4W6UFItyz.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Newbie ZFS Question: RAM for Dedup

2010-10-19 Thread Peter Jeremy

On 2010-Oct-20 08:36:30 +0800, Never Best qui...@hotmail.com wrote:
Sorry I couldn't find this anywhere yet.  For deduping it is best to
have the lookup table in RAM, but I wasn't too sure how much RAM is
suggested?

*Lots*

::Assuming 128KB Block Sizes, and 100% unique data:
1TB*1024*1024*1024/128 = 8388608 Blocks
::Each Block needs 8 byte pointer?
8388608*8 = 67108864 bytes
::Ram suggest per TB
67108864/1024/1024 = 64MB

So if I understand correctly we should have a min of 64MB RAM per TB
for deduping? *hopes my math wasn't way off*, or is there significant
extra overhead stored per block for the lookup table?

The rule-of-thumb is 270 bytes per DDT entry - that means a minimum of
2.2GB of RAM (or fast L2ARC) per TB.

And note that 128KB is the maximum blocksize - it's quite likely that
you will have smaller blocks (which implies more RAM).  I know my
average blocksize is only a few KB.

-- 
Peter Jeremy


pgp8Dn2Yb6bMc.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to avoid striping ?

2010-10-18 Thread Peter Jeremy

On 2010-Oct-18 17:45:34 +0800, casper@sun.com casper@sun.com wrote:
 Write-lock  (wlock)  the  specified  file-system.  wlock
 suspends  writes  that  would  modify  the  file system.
 Access times are not kept while a file system is  write-
 locked.


All the applications trying to write will suspend.  What would be the
risk of that?

At least some versions of Oracle rdbms have timeouts around I/O and
will abort if I/O operations don't complete within a short period.

-- 
Peter Jeremy


pgp1r1gM7cLEs.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [RFC] Backup solution

2010-10-07 Thread Peter Jeremy

On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey sh...@nedharvey.com wrote:
If you're going raidz3, with 7 disks, then you might as well just make
mirrors instead, and eliminate the slow resilver.

There is a difference in reliability:  raidzN means _any_ N disks can
fail, whereas mirror means one disk in each mirror pair can fail.
With a mirror, Murphy's Law says that the second disk to fail will be
the pair of the first disk :-).

-- 
Peter Jeremy


pgpqLss4mZKH3.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] TLER and ZFS

2010-10-05 Thread Peter Jeremy

On 2010-Oct-06 05:59:06 +0800, Michael DeMan sola...@deman.com wrote:
Another annoying thing with the whole 4K sector size, is what happens
when you need to replace drives next year, or the year after?

About the only mitigation needed is to ensure that any partitioning is
based on multiples of 4KB.

  Does
anybody know if there any vendors that are shipping 4K sector drives
that have a jumper option to make them 512 size?

This would require a low-level re-format and would significantly
reduce the available space if it was possible at all.

  WD has a jumper,
but is there explicitly to work with WindowsXP, and is not a real way
to dumb down the drive to 512.

All it does is offset the sector numbers by 1 so that sector 63
becomes physical sector 64 (a multiple of 4KB).

  I would presume that any vendor that
is shipping 4K sector size drives now, with a jumper to make it
'real' 512, would be supporting that over the long run?

I would be very surprised if any vendor shipped a drive that could
be jumpered to real 512 bytes.  The best you are going to get is
jumpered to logical 512 bytes and maybe a 1-sector offset (needed
for WindozeXP only).  These jumpers will probably last as long as
the 8GB jumpers that were needed by old BIOS code.  (Eg BIOS boots
using simulated 512-byte sectors and then the OS tells the drive to
switch to native mode).

It's unfortunate that Sun didn't bite the bullet several decades
ago and provide support for block sizes other than 512-bytes
instead of getting custom firmware for their CD drives to make
them provide 512-byte logical blocks for 2KB CD-ROMs.

It's even more idiotic of WD to sell a drive with 4KB sectors but
not provide any way for an OS to identify those drives and perform
4KB aligned I/O.

-- 
Peter Jeremy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] non-ECC Systems and ZFS for home users (was: Please warn a home user against OpenSolaris under VirtualBox under WinXP ; ))

2010-09-23 Thread Peter Jeremy

On 2010-Sep-24 00:58:47 +0800, R.G. Keen k...@geofex.com wrote:
That may not be the best of all possible things to do
on a number of levels. But for me, the likelihood of 
making a setup or operating mistake in a virtual machine 
setup server is far outweighs the hardware cost to put
another physical machine on the ground. 

The downsides are generally that it'll be slower and less power-
efficient that a current generation server and the I/O interfaces will
be also be last generation (so you are more likely to be stuck with
parallel SCSI and PCI or PCIx rather than SAS/SATA and PCIe).  And
when something fails (fan, PSU, ...), it's more likely to be customised
in some way that makes it more difficult/expensive to repair/replace.

In fact, the issue goes further. Processor chipsets from both
Intel and AMD used to support ECC on an ad-hoc basis. It may
have been there, but may or may not have been supported
by the motherboard. Intels recent chipsets emphatically do 
not support ECC.

Not quite.  When Intel moved the memory controllers from the
northbridge into the CPU, they made a conscious decision to separate
server and desktop CPUs and chipsets.  The desktop CPUs do not support
ECC whereas the server ones do - this way they can continue to charge
a premium for server-grade parts and prevent the server
manufacturers from using lower-margin desktop parts.  This means that
if you want an Intel-based solution, you need to look at a Xeon CPU.
That said, the low-end Xeons aren't outrageously expensive and you
generally wind up with support for registered RAM and registered ECC
RAM is often easier to find than unregistered ECC RAM.

 AMDs do, in general.

AMD chose to leave ECC support in almost all their higher-end memory
controllers, rather than use it as a market differentiator.  AFAIK,
all non-mobile Athlon, Phenom and Opteron CPUs support ECC, whereas
the lower-end Sempron, Neo, Turion and Geode CPUs don't.  Note that
Athlon and Phenom CPUs normally need unbuffered RAM whereas Opteron
CPUs normally want buffered/registered RAM.

 However, the motherboard
must still support the ECC reporting in hardware and BIOS for
ECC to actually work, and you have to buy the ECC memory. 

In the case of AMD motherboards, it's really just laziness on the
manufacturer's part to not bother routing the additional tracks.

The newer the intel motherboard, the less likely and more
expensive ECC is. Older intel motherboards sometimes
did support ECC, as a side note. 

On older Intel motherboards, it was a chipset issue rather than a
CPU issue (and even if the chipset supported ECC, the motherboard
manufacturer might have decided to not bother running the ECC tracks).

There's about sixteen more pages of typing to cover the issue 
even modestly correctly. The bottom line is this: for 
current-generation hardware, buy an AMD AM3 socket CPU,
ASUS motherboard, and ECC memory. DDR2 and DDR3 ECC
memory is only moderately more expensive than non-ECC.

Asus appears to have made a conscious decision to support ECC on
all AMD motherboards whereas other vendors support it sporadically
and determining whether a particular motherboard supports ECC can
be quite difficult since it's never one of the options in their
motherboard selection tools.

And when picking the RAM, make sure it's compatible with your
motherboard - motherboards are virtually never compatible with
both unbuffered and buffered RAM.

hardware going into wearout. I also bought new, high quality
power supplies for $40-$60 per machine because the power
supply is a single point of failure, and wears out - that's a 
fact that many people ignore until the machine doesn't come
up one day.

Doesn't come up one day is at least a clear failure.  With a
cheap (or under-dimensioned) PSU, things are more likely to go
out of tolerance under heavy load so you wind up with unrepeatable
strange glitches.

Think about what happens if you find a silent bit corruption in 
a file system that includes encrypted files. 

Or compressed files.

-- 
Peter Jeremy


pgp2gl67ZdR99.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Unwanted filesystem mounting when using send/recv

2010-09-14 Thread Peter Jeremy

I am looking at backing up my fileserver by replicating the
filesystems onto an external disk using send/recv with something
similar to:
  zfs send ... myp...@snapshot | zfs recv -d backup
but have run into a bit of a gotcha with the mountpoint property:
- If I use zfs send -R ... then the mountpoint gets replicated and
  the backup gets mounted over the top of my real filesystems.
- If I skip the '-R' then none of the properties get backed up.

Is there some way to have zfs recv not automatically mount filesystems
when it creates them?

-- 
Peter Jeremy


pgpOliK2tC1Vs.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 64-bit vs 32-bit applications

2010-08-18 Thread Peter Jeremy

On 2010-Aug-18 04:40:21 +0800, Joerg Schilling 
joerg.schill...@fokus.fraunhofer.de wrote:
Ian Collins i...@ianshome.com wrote:
 Some application benefit from the extended register set and function 
 call ABI, others suffer due to increased sizes impacting the cache.

Well, please verify your claims as they do not meet my experience.

I would agree with Ian that it varies.  I have recently been
evaluating a number of different SHA256 implementations and have just
compared the 32-bit vs 64-bit performance on both x86 (P4 nocona using
gcc 4.2.1) and SPARC (US-IVa using Studio12).

Comparing the different implementations on each platform, the
differences between best and worst varied from 10% to 27% depending on
the platform (and the slowest algorithm on x86/64 was equal fastest in
the other 3 platforms).

Comparing the 32-bit vs 64-bit version of each implementation on
each platform, the difference between 32-bit and 64-bit varied from
-11% to +13% on SPARC and same to +68% on x86.

My interpretation of those results is that you can't generalise: The
only way to determine whether your application is faster in 32-bit or
64-bit more is to test it.  And your choice of algorithm is at least
as important as whether it's 32-bit or 64-bit.

-- 
Peter Jeremy


pgpSec5hUa4mU.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-15 Thread Peter Jeremy

On 2010-Aug-16 08:17:10 +0800, Garrett D'Amore garr...@nexenta.com wrote:
For either ZFS or BTRFS (or any other filesystem) to survive, there have
to be sufficiently skilled developers with an interest in developing and
maintaining it (whether the interest is commercial or recreational).

Agreed.  And this applies to OpenSolaris (or Illumos or any other fork)
as well.

Honestly, I think both ZFS and btrfs will continue to be invested in by
Oracle.

Given that both provide similar features, it's difficult to see why
Oracle would continue to invest in both.  Given that ZFS is the more
mature product, it would seem more logical to transfer all the effort
to ZFS and leave btrfs to die.

Irrespective of the above, there is nothing requiring Oracle to release
any future btrfs or ZFS improvements (or even bugfixes).  They can't
retrospectively change the license on already released code but they
can put a different (non-OSS) license on any new code.

-- 
Peter Jeremy


pgpuCWzXnMlHq.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] FreeBSD 8.1 out, has zfs vserion 14 and can boot from zfs

2010-07-27 Thread Peter Jeremy

On 2010-Jul-27 19:43:50 +0800, Andrey V. Elsukov bu7c...@yandex.ru wrote:
On 27.07.2010 1:57, Peter Jeremy wrote:
 Note that ZFS v15 has been integrated into the development branches
 (-current and 8-stable) and will be in FreeBSD 8.2 (or you can run it

ZFS v15 is not yet in 8-stable. Only in HEAD. Perhaps it will be merged
into stable after 2 months.

Oops, sorry.  There are patches available for 8-stable (which I'm running).
I misremembered the commit message.

-- 
Peter Jeremy


pgpHQlZ2UoRAA.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] FreeBSD 8.1 out, has zfs vserion 14 and can boot from zfs

2010-07-26 Thread Peter Jeremy

On 2010-Jul-26 20:32:41 +0800, Eugen Leitl eu...@leitl.org wrote:
FreeBSD 8.1 features version 14 of the ZFS subsystem, the addition of the ZFS
Loader (zfsloader), allowing users to boot from ZFS,

Only on i386 or amd64 systems at present, but you can boot RAIDZ1 and
RAIDZ2 as well as mirrored roots.

Note that ZFS v15 has been integrated into the development branches
(-current and 8-stable) and will be in FreeBSD 8.2 (or you can run it
now by compiling FreeBSD yourself - unlike OpenSolaris, the full build
process is documented and everything necessary is on the release DVDs
or can be downloaded).

See http://www.freebsd.org/releases/8.1R/announce.html

-- 
Peter Jeremy


pgppFbh5U0Jj5.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS compression

2010-07-25 Thread Peter Jeremy

On 2010-Jul-25 21:12:08 +0800, Ben ben.lav...@gmail.com wrote:
I've read a small amount about compression, enough to find that it'll effect 
performance (not a problem for me) and that once you enable compression it 
only effects new files written to the file system.  
Is this still true of b134?  And if it is, how can I compress all of the 
current data on the file system?  Do I have to move it off then back on?

Yes, changing things like compression, dedup etc only affect data written
after the change.  The only way to re-compress everything is to copy it off
and back on again.

Good news: There is an easy way to do this and preserve (whilst
compressing) all your snapshots.  All you need to do is set
compression=gzip (or whatever you want) and then do a send/recv of
that filesystem.  The destination fileset will be completely created
according to the source fileset parameters at the time of the send.

If you have sufficient free space, you can even do a send|recv on the
same system - but if the original fileset was mounted that this will
result in the new fileset being mounted over the top of it, so you
shouldn't do this on an active system.

-- 
Peter Jeremy


pgpBFqeTZn2jS.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Hashing files rapidly on ZFS

2010-07-08 Thread Peter Jeremy

On 2010-Jul-09 06:46:54 +0800, Edward Ned Harvey solar...@nedharvey.com wrote:
md5 is significantly slower (but surprisingly not much slower) and it's a
cryptographic hash.  Probably not necessary for your needs.

As someone else has pointed out, MD5 is no longer considered secure
(neither is SHA-1).  If you want cryptographic hashing, you should
probably use SHA-256 for now and be prepared to migrate to SHA-3 once
it is announced.  Unfortunately, SHA-256 is significantly slower than
MD5 (about 4 times on a P-4, about 3 times on a SPARC-IV) and no
cryptographic hash is amenable to multi-threading .  The new crypto
instructions on some of Intel's recent offerings may help performance
(and it's likely that they will help more with SHA-3).

And one more thing.  No matter how strong your hash is, unless your hash is
just as big as your file, collisions happen.  Don't assume data is the same
just because hash is the same, if you care about your data.  Always
byte-level verify every block or file whose hash matches some other hash.

In theory, collisions happen.  In practice, given a cryptographic hash,
if you can find two different blocks or files that produce the same
output, please publicise it widely as you have broken that hash function.

-- 
Peter Jeremy


pgpiebzGoklvU.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Remove non-redundant disk

2010-07-07 Thread Peter Jeremy

On 2010-Jul-08 02:39:05 +0800, Garrett D'Amore garr...@nexenta.com wrote:
I believe that long term folks are working on solving this problem.  I
believe bp_rewrite is needed for this work.

Accepted.

Mid/short term, the solution to me at least seems to be to migrate your
data to a new zpool on the newly configured array, etc.

IMHO, this isn't an acceptable solution.

Note that (eg) DEC/Compaq/HP AdvFS has supported vdev removal from day
1 and (until a couple of years ago), I had an AdvFS pool that had,
over a decade, grown from a mirrored pair of 4.3GB disks to six pairs
of mirrored 36GB disks - without needing any downtime for disk
expansion.  [Adding disks was done with mirror pairs because AdvFS
didn't support any RAID5/6 style redundancy, the big win was being
able to remove older vdevs so those disk slots could be reused].

  Most
enterprises don't incrementally upgrade an array (except perhaps to add
more drives, etc.)

This isn't true for me.  It is not uncommon for me to replace an xGB
disk with a (2x)GB disk to expand an existing filesystem - in many
cases, it is not possible to add more drives because there are no
physical slots available.  And, one of the problems with ZFS is that,
unless you don't bother with any data redundancy, it's not possible to
add single drives - you can only add vdevs that are pre-configured with
the desired level of redundancy.

  Disks are cheap enough that its usually not that
hard to justify a full upgrade every few years.  (Frankly, spinning rust
MTBFs are still low enough that I think most sites wind up assuming that
they are going to have to replace their storage on a 3-5 year cycle
anyway.  We've not yet seen what SSDs do that trend, I think.)

Maybe in some environments.  We tend to run equipment into the ground
and I know other companies with similar policies.  And getting approval
for a couple of thousand dollars of new disks is very much easier than
getting approval for a complete new SAN with (eg) twice the capacity
of the existing one.

-- 
Peter Jeremy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Native ZFS for Linux

2010-06-14 Thread Peter Jeremy

On 2010-Jun-11 17:41:38 +0800, Joerg Schilling 
joerg.schill...@fokus.fraunhofer.de wrote:
PP.S.: Did you know that FreeBSD _includes_ the GPLd Reiserfs in the FreeBSD 
kernel since a while and that nobody did complain about this, see e.g.:

http://svn.freebsd.org/base/stable/8/sys/gnu/fs/reiserfs/

That is completely irrelevant and somewhat misleading.  FreeBSD has
never prohibited non-BSD-licensed code in their kernel or userland
however it has always been optional and, AFAIR, the GENERIC kernel has
always defaulted to only contain BSD code.  Non-BSD code (whether GPL
or CDDL) is carefully segregated (note the 'gnu' in the above URI).

-- 
Peter Jeremy


pgpvmgKqx7nJf.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pool revcovery from replaced disks.

2010-05-18 Thread Peter Jeremy

On 2010-May-18 19:06:11 +0800, Demian Phillips demianphill...@gmail.com wrote:
Is it possible to recover a pool (as it was) from a set of disks that
were replaced during a capacity upgrade?

If no other writes occurred during the capacity upgrade then I'd
suspect it would be possible.  The transaction numbers would still
vary across the drives and the pool information would be inconsistent
but I suspect a recent version of ZFS could manage to recover.

It might be possible to test this by creating a small, file-backed
RAIDZn zpool, simulating a capacity upgrade, exporting that pool
and trying to import the original zpool from the detached files.

-- 
Peter Jeremy


pgp5OU8Gba0CI.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Reverse lookup: inode to name lookup

2010-05-03 Thread Peter Jeremy

On 2010-May-02 01:44:51 +0800, Edward Ned Harvey solar...@nedharvey.com wrote:
Obviously, the kernel has the facility to open an inode by number.  However,
for security reasons (enforcing permissions of parent directories before the
parent directories have been identified), the ability to open an arbitrary
inode by number is not normally made available to user level applications,
except perhaps when run by root.

There is no provision in normal Unix to open a file by inode from
userland.  Some filesystems (eg HP Tru64) may expose a special
pseudo-directoy that exposes all the inodes.  Note that opening a
file by inode number is a completely different issue to mapping an
inode number to a pathname.

because:  (a) every directory contains an entry .. which refers to its
parent by number, and (b) every directory has precisely one parent, and no
more.  There is no such thing as a hardlink copy of a directory.  Therefore,
there is exactly one absolute path to any directory in any ZFS filesystem.

s/is/should be/ - I haven't checked with ZFS but it may be possible to
trick/corrupt the filesystem into allowing a second real name (though
the filesystem is then inconsistent).

If the kernel (or root) can open an arbitrary directory by inode number,
then the kernel (or root) can find the inode number of its parent by looking
at the '..' entry, which the kernel (or root) can then open, and identify
both:  the name of the child subdir whose inode number is already known, and
(b) yet another '..' entry.  The kernel (or root) can repeat this process
recursively, up to the root of the filesystem tree.  At that time, the
kernel (or root) has completely identified the absolute path of the inode
that it started with.

Any user can do this (subject to permissions) and this is how 'pwd'
was traditionally implemented.  Note that you need to check device and
inode, not just inode, to correctly handle mountpoints.

-- 
Peter Jeremy


pgpsc9geRSx95.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS snapshot versus Netapp - Security and convenience

2010-05-03 Thread Peter Jeremy

On 2010-Apr-30 21:56:46 +0800, Edward Ned Harvey solar...@nedharvey.com wrote:
How many bytes long is an inode number?  I couldn't find that easily by
googling, so for the moment, I'll guess it's a fixed size, and I'll guess
64bits (8 bytes).

Based on a rummage in some header files, it looks like it's 8 bytes.

How many bytes is that?  Would it be exceptionally difficult to extend
and/or make variable?

Extending inodes increases the amount of metadata associated with a
file, which increases overheads for small files.  It looks like a ZFS
inode is currently 264 bytes, but is always stored with a dnode and
currently has some free space.  ZFS code assumes that the physical
dnode (dnode+znode+some free space) is a fixed size and making it
variable is likely to be quite difficult.

One important consideration in that hypothetical scenario would be
fragmentation.  If every inode were fragmented in two, that would be a real
drag for performance.  Perhaps every inode could be extended (for example)
32 bytes to accommodate a list of up to 4 parent inodes, but whenever the
number of parents exceeds 4, the inode itself gets fragmented to store a
variable list of parents.

ACLs already do something like this.  And having parent information
stored away from the rest of the inode would not impact the normal
inode access time since the parent information is not normally needed.

On 2010-Apr-30 23:08:58 +0800, Edward Ned Harvey solar...@nedharvey.com wrote:
Therefore, it should be very easy to implement proof of concept, by writing
a setuid root C program, similar to sudo which could then become root,
identify the absolute path of a directory by its inode number, and then
print that absolute path, only if the real UID has permission to ls that
path.

It doesn't need to be setuid.  Check out
http://minnie.tuhs.org/cgi-bin/utree.pl?file=V6/usr/source/s2/pwd.c
http://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/pwd.c
(The latter is somewhat more readable)

While not trivial, it's certainly possible to extend inodes of files, to
include parent pointers.

This is a far more significant change and the utility is not clear.

Also not trivial, it's certainly possible to make all this information
available under proposed directories, .zfs/inodes or something similar.

HP Tru64 already does something like this.

-- 
Peter Jeremy


pgp2nCFDIdxia.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Single-disk pool corrupted after controller failure

2010-05-03 Thread Peter Jeremy

On 2010-May-03 23:59:17 +0800, Diogo Franco diogomfra...@gmail.com wrote:
I managed to get a livefs cd that had zfs14, but it was unable to import
the zpool (internal error: Illegal byte sequence). The zpool does
appear if I try to run `zpool import` though, as tank FAULTED corrupted
data, and ad6s1d is ONLINE.

That's not promising.

 There is no -F option on bsd's zpool import.

It was introduced around zfs20.  I feared it might be needed.


 This is almost certainly the problem.  ad6s1 may be the same as c5d0p1
 but OpenSolaris isn't going to understand the FreeBSD partition label
 on that slice.  All I can suggest is to (temporarily) change the disk
 slicing so that there is a fdisk slice that matches ad6s1d.
How could I do just that? I know that my label has a 1G UFS, 1G swap,
and the rest is ZFS; but I don't know how to calculate the correct
offset to give to 'format'. I can just regenerate the UFS later after
the ZFS is fixed since it was only used for its /boot.

In FreeBSD, bsdlabel ad0s1 will report the size and offset of the
'd' partition in sectors.  The offset is relative to the start of that
slice - which would normally be absolute block 63 (fdisk ad0 will
confirm that).

Adding the offset of 's1' to the offset of 'd' will give you a sector
offset for your ZFS data.  I haven't tried using OpenSolaris on x86
so I'm not sure if format allows sector offsets (I know format on
Solaris/SPARC insists on cylinder offsets).  Since cylinders are a
fiction anyway, you might be able to kludge a cylinder size to suit
your offset if necessary.  The FreeBSD fdisk(8) man page implies
that slices start at a track boundary and and at a cylinder boundary
but I'm not sure if this is a restriction on LBA disks.

Note that if you keep a record of your existing c5d0 format and
restore it later, this will recover your existing boot and swap so you
shouldn't need to restore them.

-- 
Peter Jeremy


pgpHLIUCADaBM.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Single-disk pool corrupted after controller failure

2010-05-02 Thread Peter Jeremy

On 2010-May-02 04:06:41 +0800, Diogo Franco diogomfra...@gmail.com wrote:
regular data corruption and then the box locked up. I had also
converted the pool to v14 a few days before, so the freebsd v13 tools
couldn't do anything to help.

Note that ZFS v14 was imported to FreeBSD 8-stable in mid-January.
I can't comment whether it would be able to recover your data.

On 2010-May-02 05:07:17 +0800, Bill Sommerfeld bill.sommerf...@oracle.com 
wrote:
  2) the labels are not at the start of what solaris sees as p1, and 
thus are somewhere else on the disk.  I'd look more closely at how 
freebsd computes the start of the partition or slice '/dev/ad6s1d'
that contains the pool.

I think #2 is somewhat more likely.

This is almost certainly the problem.  ad6s1 may be the same as c5d0p1
but OpenSolaris isn't going to understand the FreeBSD partition label
on that slice.  All I can suggest is to (temporarily) change the disk
slicing so that there is a fdisk slice that matches ad6s1d.

-- 
Peter Jeremy


pgpuiR7yDRv37.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS snapshot versus Netapp - Security and convenience

2010-04-29 Thread Peter Jeremy

On 2010-Apr-30 10:24:14 +0800, Edward Ned Harvey solar...@nedharvey.com wrote:
Each inode contain a link count.  In most cases, each inode has a
link count of 1, but of course that can't be assumed.  It seems
trivially simple to me, that along with the link count in each inode,
the filesystem could also store a list of which inodes link to it.
If link count is 2, then there's a list of 2 inodes, which are the
parents of this inode.

I'm not sure exactly what you are trying to say here but it don't
think it will work.

In a Unix FS (UFS or ZFS), a directory entry contains a filename and a
pointer to an inode.  The inode itself contains a count of the number
of directory entries that point to it and pointers to the actual data.
There is currently no provision for a reverse link back to the
directory.

I gather you are suggesting that the inode be extended to contain a
list of the inode numbers of all directories that contain a filename
referring to that inode.  Whilst I agree that this would simplify
inode to filename mapping and provide an alternate mechanism for
checking file permissions, I think you are glossing over the issue of
how/where to store these links.

Whilst files can have a link count of 1 (I'm not sure if this is true
in most cases), they can have up to 32767 links.  Where is this list
of (up to) 32767 parent inodes going to be stored?

In which case, it would be trivially easy to walk back up the whole
tree, almost instantly identifying every combination of paths that
could possibly lead to this inode, while simultaneously correctly
handling security concerns about bypassing security of parent
directories and everything.

Whilst it's trivially easy to get from the file to the list of
directories containing that file, actually getting from one directory
to its parent is less so: A directory containing N sub-directories has
N+2 links.  Whilst the '.' link is easy to identify (it points to its
own inode), distinguishing between the name of this directory in its
parent and the '..' entries in its subdirectories is rather messy
(requiring directory scans) unless you mandate that the reference to
the parent directory is in a fixed location (ie 1st or 2nd entry in
the parent inode list).

It seems too perfect and too simple.  Instead of a one-directional
directed graph, simply make a bidirectional.  There's no significant
additional overhead as far as I can tell.  It seems like it would
even be easy.

Well, you need to find somewhere to store up to 32K inode numbers,
whilst having minimal space overhead for small numbers of links.  Then
you will need to patch the vnode operations underlying creat(),
link(), unlink(), rename(), mkdir() and rmdir() to manage the
backlinks (taking into account transactional consistency).

-- 
Peter Jeremy


pgpLmGCkPtpSv.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-02 Thread Peter Jeremy

On 2010-Feb-03 00:12:43 +0800, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
wrote:
On Tue, 2 Feb 2010, David Dyer-Bennet wrote:

 Now, I'm sure not ALL drives offered at Newegg could qualify; but the
 question is, how much do I give up by buying an enterprise-grade drive
 from a major manufacturer, compared to the Sun-certified drive?

If you have a Sun service contract, you give up quite a lot.  If a Sun 
drive fails every other day, then Sun will replace that Sun drive 
every other day, even if the system warranty has expired.  But if it 
is a non-Sun drive, then you have to deal with a disinterested drive 
manufacturer, which could take weeks or months.

OTOH, if I'm paying 10x the street drive price upfront, plus roughly
the street price annually in support, I can save a fair amount of
money by just buying a pile of spare drives - when one fails, just
swap it for a spare and it doesn't matter if it takes weeks for the
vendor to swap it.

Hopefully Oracle will do better than Sun at explaining the benefits 
and services provided by a service contract.

I know that trying to get Sun to renew a service contract is like
pulling teeth but Oracle is far worse - as far as I can tell, Oracle
contracts are deliberately designed so you can't be certain whether
you are compliant or not.

-- 
Peter Jeremy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is ZFS internal reservation excessive?

2010-01-18 Thread Peter Jeremy

On 2010-Jan-19 00:26:27 +0800, Jesus Cea j...@jcea.es wrote:
On 01/18/2010 05:11 PM, David Magda wrote:
 Ext2/3 uses 5% by default for root's usage; 8% under FreeBSD for FFS.
 Solaris (10) uses a bit more nuance for its UFS:

That reservation is to preclude users to exhaust diskspace in such a way
that ever root can not login and solve the problem.

At least for UFS-derived filesystems (ie FreeBSD and Solaris), the
primary reason for the 8-10% reserved space is to minimise FS
fragmentation and improve space allocation performance:  More total
free space means it's quicker and easier to find the required
contiguous (or any) free space whilst searching a free space bitmap.
Allowing root to eat into that reserved space provided a neat
solution to resource starvation issues but was not the justification.

I agree that is a lot of space but only 2% of a modern disk. My point
is that 32GB is a lot of space to reserve to be able, for instance, to
delete a file when the pool is full (thanks to COW). And more when the
minimum reserved is 32MB and ZFS can get away with it. I think that
could be a good thing to put a cap to the maximum implicit reservation.

AFAIK, it's also necessary to ensure reasonable ZFS performance - the
find some free space issue becomes much more time critical with a
COW filesystem.  I recently had a 2.7TB RAIDZ1 pool get to the point
where zpool was reporting ~2% free space - and performance was
absolutely abyssmal (fsync() was taking over 16 seconds).  When I
freed up a few percent more space, the performance recovered.

Maybe it would be useful if ZFS allowed the reserved space to be
tuned lower but, at least for ZFS v13, the reserved space seems to
actually be a bit less than is needed for ZFS to function reasonably.

-- 
Peter Jeremy


pgpaYK13eLyWU.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raid-z as a boot pool

2009-12-16 Thread Peter Jeremy

On 2009-Dec-16 00:26:28 +0800, Luca Morettoni l...@morettoni.net wrote:
As reported here:
http://hub.opensolaris.org/bin/view/Community+Group+zfs/zfsbootFAQ

we can't boot from a pool with raidz, any plan to have this feature?

Note that FreeBSD currently supports booting from RAIDZ (at least on
i386).  It may be possible to reuse some of that code.

-- 
Peter Jeremy


pgp0WiQELKoEj.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Recovering FAULTED zpool

2009-12-07 Thread Peter Jeremy

On 2009-Nov-18 11:50:44 +1100, I wrote:
I have a zpool on a JBOD SE3320 that I was using for data with Solaris
10 (the root/usr/var filesystems were all UFS).  Unfortunately, we had
a bit of a mixup with SCSI cabling and I believe that we created a
SCSI target clash.  The system was unloaded and nothing happened until
I ran zpool status at which point things broke.  After correcting
all the cabling, Solaris panic'd before reaching single user.

I wound up installing OpenSolaris snv_128a on some spare disks and
this enabled me to recover the data.  Thanks to Tim Haley and Victor
Latushkin for their assistance.

As a first attempt, 'zpool import -F data' said Destroy and re-create
the pool from a backup source..

'zpool import -nFX data' initially ran the system out of swap (I
hadn't attached any swap and it only has 8GB RAM):
WARNING: /etc/svc/volatile: File system full, swap space limit exceeded
INIT: Couldn't write persistent state file `/etc/svc/volatile/init.state'.

After rebooting and adding some swap (which didn't seem to ever get
used), it did work (though it took several hours - unfortunately, I
didn't record exactly how long):

# zpool import -nFX data
Would be able to return data to its state as of Thu Jan 01 10:00:00 1970.
Would discard approximately 369 minutes of transactions.
# zpool import -FX data
Pool data returned to its state as of Thu Jan 01 10:00:00 1970.
Discarded approximately 369 minutes of transactions.
cannot share 'data/backup': share(1M) failed
cannot share 'data/JumpStart': share(1M) failed
cannot share 'data/OS_images': share(1M) failed
#

I notice that the two times aren't consistent but the data appears to
be present and a 'zpool scrub' reported no errors.  I have reverted
back to Solaris 10 and successfully copied all the data off.

-- 
Peter Jeremy


pgpC0sjEufK37.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best practices for zpools on zfs

2009-11-25 Thread Peter Jeremy

On 2009-Nov-24 14:07:06 -0600, Mike Gerdts mger...@gmail.com wrote:
On Tue, Nov 24, 2009 at 1:39 PM, Richard Elling
richard.ell...@gmail.com wrote:
 Also, the performance of /dev/*random is not very good.  So prestaging
 lots of random data will be particularly challenging.

This depends on the random number generation algorithm used in the
kernel.  I get 50MB/sec out of FreeBSD on 3.2GHz P4 (using Yarrow).
In any case, you don't need crypto-grade random numbers, just data
that is different and uncompressible - there are lots of relatively
simple RNGs that can deliver this with far greater speed.

I was thinking that a bignum library such as libgmp could be handy to
allow easy bit shifting of large amounts of data.  That is, fill a 128
KB buffer with random data then do bitwise rotations for each
successive use of the buffer.  Unless my math is wrong, it should
allow 128 KB of random data to be write 128 GB of data with very
little deduplication or compression.  A much larger data set could be
generated with the use of a 128 KB linear feedback shift register...

This strikes me as much harder to use than just filling the buffer
with 8/32/64-bit random numbers from a linear congruential generator,
lagged fibonacci generator, mersenne twister or even random(3)

http://en.wikipedia.org/wiki/List_of_random_number_generators

-- 
Peter Jeremy


pgpO9mAWzbb7x.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Recovering FAULTED zpool

2009-11-19 Thread Peter Jeremy

On 2009-Nov-18 08:40:41 -0800, Orvar Korvar knatte_fnatte_tja...@yahoo.com 
wrote:
There is a new PSARC in b126(?) that allows to rollback to latest functioning 
uber block. Maybe it can help you?

It's in b128 and the feedback I've received suggests it will work.
I've been trying to get the relevant ZFS bits for my b127 system but
haven't managed to get them to work so far.

-- 
Peter Jeremy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Recovering FAULTED zpool

2009-11-18 Thread Peter Jeremy

On 2009-Nov-19 02:57:31 +0300, Victor Latushkin victor.latush...@sun.com 
wrote:
 all the cabling, Solaris panic'd before reaching single user.

Do you have crash dump of this panic saved?

Yes.  It was provided to Sun Support.

Option -F is new one added with pool recovery support, so it'll be 
available in build 128 only

OK, thanks I knew it was new but I wasn't certain exactly which build
it had been imported into.

I think it should be possible at least in readonly mode. I cannot tell 
if full recovery will be possible, but at least there's good chance to 
get some data back.

That's what I was hoping.

You can try build 128 as soon as it becomes available, or you can try to 
build BFU archives from source and apply to your build 125 BE.

I'm currently discussing this off-line with Tim Haley.

Metadata replication helps to protect against failures localized in 
space, but as all copies of metadata are written at the same time, it 
cannot protect against failures localized in time.

Thanks for that.  I suspected it might be something like this.

-- 
Peter Jeremy


pgpTbho8x8cyp.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Recovering FAULTED zpool

2009-11-17 Thread Peter Jeremy

I have a zpool on a JBOD SE3320 that I was using for data with Solaris
10 (the root/usr/var filesystems were all UFS).  Unfortunately, we had
a bit of a mixup with SCSI cabling and I believe that we created a
SCSI target clash.  The system was unloaded and nothing happened until
I ran zpool status at which point things broke.  After correcting
all the cabling, Solaris panic'd before reaching single user.

Sun Support could only suggest restoring from backups - but
unfortunately, we do not have backups of some of the data that we
would like to recover.

Since OpenSolaris has a much newer version of ZFS, I thought I would
give OpenSolaris a try and it looks slightly more promising, though I
still can't access the pool.  The following is using snv125 on a T2000.

r...@als253:~# zpool import -F data
Nov 17 15:26:46 opensolaris zfs: WARNING: can't open objset for data/backup
r...@als253:~# zpool status -v data
  pool: data
 state: FAULTED
status: An intent log record could not be read.
Waiting for adminstrator intervention to fix the faulted pool.
action: Either restore the affected device(s) and run 'zpool online',
or ignore the intent log records by running 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-K4
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
data FAULTED  0 0 3  bad intent log
  raidz2-0   DEGRADED 0 018
c2t8d0   FAULTED  0 0 0  too many errors
c2t9d0   ONLINE   0 0 0
c2t10d0  ONLINE   0 0 0
c2t11d0  ONLINE   0 0 0
c2t12d0  ONLINE   0 0 0
c2t13d0  ONLINE   0 0 0
c3t8d0   ONLINE   0 0 0
c3t9d0   ONLINE   0 0 0
c3t10d0  ONLINE   0 0 0
c3t11d0  ONLINE   0 0 0
c3t12d0  DEGRADED 0 0 0  too many errors
c3t13d0  ONLINE   0 0 0
r...@als253:~# zpool online data c2t8d0
Nov 17 15:28:42 opensolaris zfs: WARNING: can't open objset for data/backup
cannot open 'data': pool is unavailable
r...@als253:~# zpool clear data
cannot clear errors for data: one or more devices is currently unavailable
r...@als253:~# zpool clear -F data
cannot open '-F': name must begin with a letter
r...@als253:~# zpool status data
  pool: data
 state: FAULTED
status: One or more devices are faulted in response to persistent errors.  
There are insufficient replicas for the pool to
continue functioning.
action: Destroy and re-create the pool from a backup source.  Manually marking 
the device
repaired using 'zpool clear' may allow some data to be recovered.
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
data FAULTED  0 0 1  corrupted data
  raidz2-0   FAULTED  0 0 6  corrupted data
c2t8d0   FAULTED  0 0 0  too many errors
c2t9d0   ONLINE   0 0 0
c2t10d0  ONLINE   0 0 0
c2t11d0  ONLINE   0 0 0
c2t12d0  ONLINE   0 0 0
c2t13d0  ONLINE   0 0 0
c3t8d0   ONLINE   0 0 0
c3t9d0   ONLINE   0 0 0
c3t10d0  ONLINE   0 0 0
c3t11d0  ONLINE   0 0 0
c3t12d0  DEGRADED 0 0 0  too many errors
c3t13d0  ONLINE   0 0 0
r...@als253:~#

Annoyingly, data/backup is not a filesystem I'm especially worried
about - I'd just like to get access to the other filesystems on it.
Is is possible to hack the pool to make data/backup just disappear.
For that matter:
1) Why is the whole pool faulted when n-2 vdevs are online?
2) Given that metadata is triplicated, where did the objset go?

-- 
Peter Jeremy


pgpcSxvFaLwUM.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

63 matches

Mail list logo