Re: [zfs-discuss] Expanding raidz2

2006-07-13 Thread Darren Dunham
  But presumably it would be possible to use additional columns for future
  writes?
 
 I guess that could be made to work, but then the data on the disk becomes
 much (much much) more difficult to interpret because you have some rows which
 are effectively one width and others which are another (ad infinitum).

How do rows come into it?  I was just assuming that each (existing)
in-use disk block was pointed to by a FS block, which was tracked by
other structures.  I was guessing that adding space (effectively
extending the rows) wasn't going to be noticed for accessing old data.

Even after reading through the on-disk format document and trying to
comprehend some of the code, I have no idea how the free space is
tracked or examined when a raidz block is being allocated on disk.

 It also doesn't really address the issue since you assume that you
 want to add space because the disks are getting full, but this scheme,
 as you mention, only applies the new width to empty rows.

Correct.  It would certainly be somewhat of a limitation.  But a little
bit of block turnover (either through normal usage or explicitly driven)
would make it much less of an effect.

I'm not suggesting that this is something that should (or could) be
implemented.  Just that instead of it being technically difficult, it
appeared pretty straightforward to me given what little I know about how
the disk blocks are currently managed.  I'm just trying to understand if
this is true or what other bits I'm still misunderstanding.  :-)

-- 
Darren Dunham   [EMAIL PROTECTED]
Senior Technical Consultant TAOShttp://www.taos.com/
Got some Dr Pepper?   San Francisco, CA bay area
  This line left intentionally blank to confuse you. 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


RE: [zfs-discuss] Expanding raidz2

2006-07-13 Thread Bennett, Steve
 
  I guess that could be made to work, but then the data on 
  the disk becomes much (much much) more difficult to
  interpret because you have some rows which are effectively
  one width and others which are another (ad infinitum).
 
 How do rows come into it?  I was just assuming that each
 (existing) in-use disk block was pointed to by a FS block,
 which was tracked by other structures.  I was guessing that
 adding space (effectively extending the rows) wasn't
 going to be noticed for accessing old data.

That's what my assumption was too. I had the impression from
the initial information (I nearly said hype) about ZFS, that
the distinctions between RAID levels were to become less
clear i.e. that you could have some files stored with higher
resilience than others.

Maybe this is a dumb question, but I've never written a
filesystem is there a fundamental reason why you cannot have
some files mirrored, with others as raidz, and others with no
resilience? This would allow a pool to initially exist on one
disk, then gracefully change between different resilience
strategies as you add disks and the requirements change.

Apologies if this is pie in the sky.

Steve.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


RE: [zfs-discuss] Expanding raidz2

2006-07-13 Thread Jeff Bonwick
 Maybe this is a dumb question, but I've never written a
 filesystem is there a fundamental reason why you cannot have
 some files mirrored, with others as raidz, and others with no
 resilience? This would allow a pool to initially exist on one
 disk, then gracefully change between different resilience
 strategies as you add disks and the requirements change.

Actually, it's an excellent question.  And a deep one.
It goes to the very heart of why the traditional factoring
of storage into filesystems and volumes is such a bad idea.

In a typical filesystem, each block is represented by a small
integer -- typically 32 or 64 bits -- indicating its location
on disk.  To make a filesystem talk to multiple disks, you
either need to add another integer -- a device number -- to
each block pointer, or you need to generate virtual block
numbers.  Doing the former requires modifying the filesystem;
doing the latter does not, which is why volumes caught on
in the first place.  It was expedient.

The simplest example of block virtualization is a concatentation
of two disks.  For simplicity, assume all disks have 100 blocks.
To create a 200-block volume using disks A and B, we assign virtual
blocks 0-99 to A and 100-199 to B.  As far as the filesystem is
concerned, it's just looking at a 200-block logical device.
But when it issues a read for (say) logical block 137, the volume
manager will actually map that to physical block 37 of disk B.

A stripe (RAID-0) is similar, except that instead of putting
the low blocks on A and the high ones on B, you put the even
ones on A and the odd ones on B.  So disk A stores virtual
blocks 0, 2, 4, 6, ... on physical blocks 0, 1, 2, 3, etc.
The advantage of striping is that when you issue a read of
(say) 10 blocks, that maps into 5 blocks on each disk, and you
can read from those disks in parallel.  So you get up to double
the bandwidth (less for small I/O, because then the per-I/O
overhead dominates, but I digress).

A mirror (RAID-1) is even simpler -- it's just a 1-1 mapping
of logical to physical block numbers on two or more disks.

RAID-4 is only slightly more complex.  The rule here is that all
disks XOR to zero (i.e., if you XOR the nth block of each disk
together, you get a block of zeroes), so you can lose any one disk
and still be able to reconstruct the data.  The block mapping is
just like a stripe, except that there's a parity disk as well.

RAID-5 is like RAID-4, but the parity rotates at some fixed
interval so that you don't have a single 'hot' parity disk.

RAID-6 is a variant on RAID-4/5 that (using a bit subtler
mathematics) can survive two disk failures, not just one.

Now here's the key limitation of this scheme, which is so obvious
that it's easy to miss:  the relationship between replicas of your
data is expressed in terms of the *devices*, not the *data*.

That's why a traditional filesystem can't offer different
RAID levels using the same devices -- because the RAID levels
are device-wide in nature.  In a mirror, all disks are identical.
In a RAID-4/5 group, all disks XOR to zero.  Mixing (say) mirroring
with RAID-5 doesn't work because in the event of disk failure, the
volume manager would have no idea how to reconstruct missing data.

RAID-Z takes a different approach.  We were designing a filesystem
as well, so we could make the block pointers as semantically rich
as we wanted.  To that end, the block pointers in ZFS contains data
layout information.  One nice side effect of this is that we don't
need fixed-width RAID stripes.  If you have 4+1 RAID-Z, we'll store
128k as 4x32k plus 32k of parity, just like any RAID system would.
But if you only need to store 3 sectors, we won't do a partial-stripe
update of an existing 5-wide stripe; instead, we'll just allocate
four sectors, and store the data and its parity.  The stripe width
is variable on a per-block basis.  And, although we don't support it
yet, so is the replication model.  The rule for how to reconstruct
a given block is described explicitly in the block pointer, not
implicitly by the device configuration.

So to answer your question: no, it's no pie in the sky.  It's a
great idea.  Per-file or even per-block replication is something
we've thought about in depth, built into the on-disk format,
and plan to support in the future.

The main issues are administrative.  ZFS is all about ease of use
(when it's not busy being all about data integrity), so getting the
interface to be simple and intuitive is important -- and not as
simple as it sounds.  If your free disk space might be used for
single-copy data, or might be used for mirrored data, then
how much free space do you have?  Questions like that need
to be answered, and answered in ways that make sense.

(Note: would anyone ever really want per-block replication levels?
It's not as crazy as it sounds.  A couple of examples: replicating
only the first block, so that even if you lose data, you know the
file type and some idea what it contained; 

[zfs-discuss] Removing a device from a zfs pool

2006-07-13 Thread Yacov Ben-Moshe
How can I remove a device or a partition from a pool.
NOTE: The devices are not mirrored or raidz

Thanks
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Removing a device from a zfs pool

2006-07-13 Thread Dick Davies

On 13/07/06, Yacov Ben-Moshe [EMAIL PROTECTED] wrote:

How can I remove a device or a partition from a pool.
NOTE: The devices are not mirrored or raidz


Then you can't - there isn't a 'zfs remove' command yet.

--
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] howto reduce ?zfs introduced? noise

2006-07-13 Thread Thomas Maier-Komor
Hi,

after switching over to zfs from ufs for my ~/ at home, I am a little bit 
disturbed by the noise the disks are making. To be more precise, I always have 
thunderbird and firefox running on my desktop and either or both seem to be 
writing to my ~/ at short intervals and ZFS flushes these transactions at 
intervals about 2-5 seconds to the disks. In contrast UFS seems to be doing a 
little bit more aggressive caching, which reduces disk noise.

I didn't really track down who is the offender and what is the precise reason. 
I only know that the noise disappears as soon as I close Thunderbird and 
Firefox. So maybe there is an easy way to solve this problem at the application 
level. And anyway I want to move my $HOME to more silent disks. 

But I am curious, if I am the only one who observed this behaviour? Maybe there 
is even an easy way to reduce this noise. Additionally, I'd guess that moving 
the heads of the disks all the time, won't make the disks last any longer...

Cheers,
Tom
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Expanding raidz2

2006-07-13 Thread Scott Roberts
If it was possible to implement raidz/raidz2 expansion it would be a big 
feature in favor of ZFS. Most hardware RAID controllers have the ability to 
expand a raid pool - some have to take the raid array offline, but the ones I 
work with generally do it online, although you are forced to suffer through 
reduced performance until it is done.

I'd rather not use ZFS on top of a raid controller. I like the way ZFS works, 
and putting another layer between it and the disks seems like it would only 
reduce it's effectiveness.

I'm not saying that expanding a raidz2 would be good for  MTTDL, but have that 
ability there would be very handy. 


 I'm not suggesting that this is something that should
 (or could) be
 implemented.  Just that instead of it being
 technically difficult, it
 appeared pretty straightforward to me given what
 little I know about how
 the disk blocks are currently managed.  I'm just
 trying to understand if
 this is true or what other bits I'm still
 misunderstanding.  :-)

 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Enabling compression/encryption on a populated filesystem

2006-07-13 Thread Chad Mynhier

On 7/13/06, Darren Reed [EMAIL PROTECTED] wrote:

When ZFS compression is enabled, although the man page doesn't
explicitly say this, my guess is that only new data that gets
written out is compressed - in keeping with the COW policy.


[ ... ]


Hmmm, well, I suppose the same problem might apply to
encrypting data too...so maybe what I need is a zfs command
that will walk the filesystem's data tree, read in data and
write it back out according to the current data policy.



It seems this could be made a function of 'zfs scrub' -- instead of
simply verifying the data, it could rewrite the data as it goes.

This comes in handy in other situations.  For example, with the
current state of things, if you add disks to a pool that contains
mostly static data, you don't get the benefit of the additional
spindles when reading old data.  Rewriting the data would gain you
that benefit, plus it would avoid the new disks becoming the hot spot
for all new writes (assuming the old disks were very full.)

Theoretically this could also be useful in a live data migration
situation, where you have both new and old storage connected to a
server.  But this assumes there would be some way to tell ZFS to treat
a subset of disks as read-only.

Chad Mynhier
http://cmynhier.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Expanding raidz2

2006-07-13 Thread Luke Scharf

David Abrahams wrote:

I've seen people wondering if ZFS was a scam because the claims just
seemed too good to be true.  Given that ZFS *is* really great, I don't
think it would hurt to prominently advertise limitations like this one
it would probably benefit credibility considerably, and it's a real
consideration for anyone who's doing RAID-Z.

  
Very true.  I recently had someone imply to me that ZFS was a network 
protocol and everything else related to disks and file sharing -- 
instead of volume manager integrated with a filesystem and an 
automounter.  There is hype and misinformation out there.


As for the claims, I don't buy that it's impossible to corrupt a ZFS 
volume.  I've replicated the demo where the guy dd's /dev/urandom over 
part of the disk, and I believe that works -- but there are a lot of 
other ways to corrupt a filesystem in the real world.  I'm spending this 
morning setting up a server to try ZFS in our environment -- which will 
put it under a heavy load with a lot of large files and heavy churn.   
We'll see what happens!


-Luke



smime.p7s
Description: S/MIME Cryptographic Signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Expanding raidz2

2006-07-13 Thread David Dyer-Bennet
Luke Scharf [EMAIL PROTECTED] writes:

 As for the claims, I don't buy that it's impossible to corrupt a ZFS
 volume.  I've replicated the demo where the guy dd's /dev/urandom
 over part of the disk, and I believe that works -- but there are a
 lot of other ways to corrupt a filesystem in the real world.  I'm
 spending this morning setting up a server to try ZFS in our
 environment -- which will put it under a heavy load with a lot of
 large files and heavy churn.  We'll see what happens!

I've done that one too.  It's fun -- and caused me to learn the
difference between /dev/random and /dev/urandom :-).

It's easy to corrupt the volume, though -- just copy random data over
*two* disks of a RAIDZ volume.  Okay, you have to either do the whole
volume, or get a little lucky to hit both copies of some piece of
information before you get corruption.  Or pull two disks out of the
rack at once.  

With the transactional nature and rotating pool of top-level blocks, I
think it will be pretty darned hard to corrupt a structure *short of*
deliberate damage exceeding the redundancy of the vdev.  If you
succeed, you've found a bug, don't forget to report it!
-- 
David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/
RKBA: http://www.dd-b.net/carry/
Pics: http://dd-b.lighthunters.net/ http://www.dd-b.net/dd-b/SnapshotAlbum/
Dragaera/Steven Brust: http://dragaera.info/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: howto reduce ?zfs introduced? noise

2006-07-13 Thread Jason Holtzapple
I am seeing the same behavior on my SunBlade 2500 while running firefox. I 
think my disks are
quieter than yours though, because I don't really notice the difference that 
much.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Removing a device from a zfs pool

2006-07-13 Thread David Dyer-Bennet
Dick Davies [EMAIL PROTECTED] writes:

 On 13/07/06, Yacov Ben-Moshe [EMAIL PROTECTED] wrote:
  How can I remove a device or a partition from a pool.
  NOTE: The devices are not mirrored or raidz
 
 Then you can't - there isn't a 'zfs remove' command yet.

Yeah, I ran into that in my testing, too.  I suspect it's something
that will come up in testing a LOT more than in real production use.
Although accidentally adding a device to the wrong thing is an
unfixable error at the moment, which is not good.
-- 
David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/
RKBA: http://www.dd-b.net/carry/
Pics: http://dd-b.lighthunters.net/ http://www.dd-b.net/dd-b/SnapshotAlbum/
Dragaera/Steven Brust: http://dragaera.info/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Expanding raidz2

2006-07-13 Thread Luke Scharf

David Dyer-Bennet wrote:

It's easy to corrupt the volume, though -- just copy random data over
*two* disks of a RAIDZ volume.  Okay, you have to either do the whole
volume, or get a little lucky to hit both copies of some piece of
information before you get corruption.  Or pull two disks out of the
rack at once.  
  
I tried that too - some of the files were borked, but I was impressed 
that other files on the volume were still recoverable.  Also, ZFS 
automatically started the scrub - which was handy.  Unfortunately, my 
test system only had one HDD (with 3 partitions simulating a RAID-Z), so 
the timing wasn't realistic.

With the transactional nature and rotating pool of top-level blocks, I
think it will be pretty darned hard to corrupt a structure *short of*
deliberate damage exceeding the redundancy of the vdev.  If you
succeed, you've found a bug, don't forget to report it!
  
I buy very good, backed by good theory and good coding.  After after a 
few months of testing, I might even buy better than any other general 
purpose filesystem and volume manager.


But infallible?  If so, I shall name my storage server Titanic.

-Luke



smime.p7s
Description: S/MIME Cryptographic Signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Expanding raidz2

2006-07-13 Thread David Dyer-Bennet
On Thu, Jul 13, 2006 at 09:44:18AM -0500, Al Hopper wrote:
 On Thu, 13 Jul 2006, David Dyer-Bennet wrote:
 
  Adam Leventhal [EMAIL PROTECTED] writes:
 
   I'm not sure I even agree with the notion that this is a real
   problem (and if it is, I don't think is easily solved). Stripe
   widths are a function of the expected failure rate and fault domains
   of the system which tend to be static in nature. A coarser solution
   would be to create a new pool where you zfs send/zfs recv the
   filesystems of the old pool.
 
  RAIDZ expansion is a big enough deal that I may end up buying an
  Infrant NAS box and using their X-RAID instead.  The ZFS should be
  more secure, and I *really* like the block checksumming -- but the
  ability to expand my existing pool by just adding a new disk is REALLY
  REALLY USEFUL in a small office or home configuration.
 
 The economics of what you're saying don't make sense to me.  Let me
 explain: I just did a quick grep for pricing on an Infrant X6 box at $700
 with one 250Gb drive installed.  And then you would still have to add at
 least one more disk drive to get data resilience.  I picked up two Seagate
 500Gb drives (in Frys, on special) for $189 each.  So with a $700 budget
 and ZFS .

My figures suggest that an Infrant NV box costs about $650 with no
disks.  A server box to run Solaris on with hot-swap racks and such
costs out to roughly $1600 with no disks, plus a huge learning curve
for me.  The fact that I'm willing to even *consider* that route says
something about how attractive other features of ZFS are to me
-- 
David Dyer-Bennet, mailto:[EMAIL PROTECTED], http://www.dd-b.net/dd-b/
RKBA: http://www.dd-b.net/carry/
Pics: http://dd-b.lighthunters.net/ http://www.dd-b.net/dd-b/SnapshotAlbum/
Dragaera/Steven Brust: http://dragaera.info/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Expanding raidz2

2006-07-13 Thread David Abrahams
Jeff Bonwick [EMAIL PROTECTED] writes:

 The main issues are administrative.  ZFS is all about ease of use
 (when it's not busy being all about data integrity), so getting the
 interface to be simple and intuitive is important -- and not as
 simple as it sounds.  If your free disk space might be used for
 single-copy data, or might be used for mirrored data, then
 how much free space do you have?  Questions like that need
 to be answered, and answered in ways that make sense.

It seems, on the face of it, as though a *single* sensible answer
might be impossible.  But it also seems like it might be unnecessary.
How often is it useful for a program to ask about the amount of free
memory these days?  Not often; another process or thread can come
along and allocate (or free) memory before the information is used,
and make it totally useless.

So if you want to deliver one answer, maybe report the maximum amount
available under all allowed storage schemes, because it doesn't matter
all that much; you have to be able to fail to allocate it dynamically
anyhow.  ZFS would need another interface that allows more
sophisticated queries.

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Expanding raidz2

2006-07-13 Thread David Abrahams
David Dyer-Bennet [EMAIL PROTECTED] writes:

 Adam Leventhal [EMAIL PROTECTED] writes:

 I'm not sure I even agree with the notion that this is a real
 problem (and if it is, I don't think is easily solved). Stripe
 widths are a function of the expected failure rate and fault domains
 of the system which tend to be static in nature. A coarser solution
 would be to create a new pool where you zfs send/zfs recv the
 filesystems of the old pool.

 RAIDZ expansion is a big enough deal that I may end up buying an
 Infrant NAS box and using their X-RAID instead.  The ZFS should be
 more secure, and I *really* like the block checksumming -- but the
 ability to expand my existing pool by just adding a new disk is REALLY
 REALLY USEFUL in a small office or home configuration.  

Yes, and while it's not an immediate showstopper for me, I'll want to
know that expansion is coming imminently before I adopt RAID-Z.

 I see phrases like just add another 7-disk RAIDZ, and I laugh; the
 boxes I'm looking at mostly have *4* or *5* hot-swap bays.  If I
 could, I'd start with a 2-disk RAIDZ, planning to expand it twice
 before hitting the system config limit.  A *single* 7-disk RAIDZ is
 probably beyond my means; two of them is absurd to even consider. 

 Possibly this isn't the market ZFS will make money in, but it's the
 market *I'm* in. 

Ditto, ditto, ditto.

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Expanding raidz2 [Infrant]

2006-07-13 Thread Rob Logan


Infrant NAS box and using their X-RAID instead.  

I've gone back to solaris from an Infrant box.

1) while the Infrant cpu is sparc, its way, way, slow.
  a) the web IU takes 3-5 seconds per page
  b) any local process, rsync, UPnP, SlimServer is cpu starved
2) like a netapp, its frustrating to not have shell access
3) NFSv3 is buggy (use NFSv2)
  a) http://www.infrant.com/forum/viewtopic.php?t=546
  b) NFSv2 works, but its max filesize is 2Gig.
4) 8MB/sec writes and 15MB/sec reads isn't that fast
5) local rsync writes are 2MB/sec (use NFS instead)

put solaris on your our old PPro box.  It will be faster (yes!), cheaper
and you can do more than one snapshot  (and it doesn't kill the system)
plus one gets shell access!
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


RE: [zfs-discuss] Expanding raidz2

2006-07-13 Thread Bennett, Steve
Jeff Bonwick said:

 RAID-Z takes a different approach.  We were designing a filesystem
 as well, so we could make the block pointers as semantically rich
 as we wanted.  To that end, the block pointers in ZFS contains data
 layout information.  One nice side effect of this is that we don't
 need fixed-width RAID stripes.  If you have 4+1 RAID-Z, we'll store
 128k as 4x32k plus 32k of parity, just like any RAID system would.
 But if you only need to store 3 sectors, we won't do a partial-stripe
 update of an existing 5-wide stripe; instead, we'll just allocate
 four sectors, and store the data and its parity.  The stripe width
 is variable on a per-block basis.  And, although we don't support it
 yet, so is the replication model.  The rule for how to reconstruct
 a given block is described explicitly in the block pointer, not
 implicitly by the device configuration.

Thanks for the explanation - a great help in understanding how all this
stuff fits together.

Unfortunately I'm now less sure about why you cannot 'just' add another
disk to a RAID-Z pool. Is this just a policy decision for the sake of
keeping it simple, rather than a technical restriction?

 If your free disk space might be used for single-copy data,
 or might be used for mirrored data, then how much free space
 do you have?  Questions like that need to be answered, and
 answered in ways that make sense.

They need to be answered, but as the storage is scaled up we don't need
any extra accuracy - knowing that a filesystem is somewhere around 80%
full is just fine - I really don't need to care precisely how many
blocks are free, and it actually hinders me if I get given the exact
information (I have to scale it into the number of GB, or the percentage
of space used).

The fact that we then pretty much ignore exact block counts then leads
me to think that we don't actually need to care about exactly how many
blocks are free on a disk - so if I store N blocks of data it's
acceptable for the number of free blocks to change by domething
different to N. And once data starts to be compressed the direct
correlation between the size of a file and the amount of disk space it
uses goes away in any case.

All pretty exciting - how long are we going to have to wait for this?

Steve.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Expanding raidz2

2006-07-13 Thread Rob Logan



comfortable with having 2 parity drives for 12 disks,


the thread starting config of 4 disks per controller(?):
zpool create tank raidz2 c1t1d0 c1t2d0 c1t3d0 c1t4d0c2t1d0 c2t2d0

then later
zpool add tank raidz2 c2t3d0 c2t4d0 c3t1d0 c3t2d0 c3t3d0 c3t4d0

as described, doubles ones IOPs, and usable space in tank, with the loss
of another two disks, splitting the cluster into four (and two parity)
writes per disk.  perhaps a 8 disk controller, and start with

zpool create tank raidz c1t1d0 c1t2d0 c1t3d0 c1t4d0 c1t5d0

then do a
zpool add tank raidz c1t6d0 c1t7d0 c1t8d0 c2t1d0 c2t2d0
zpool add tank raidz c2t3d0 c2t4d0 c2t5d0 c2t6d0 c2t7d0
zpool add tank spare c2t8d0

gives one the same largeish cluster size div 4 per raidz disk, 3x the
IOPs, less parity math per write, and a hot spare for the same usable
space and loss of 4 disks.

splitting the max 128k cluster into 12 chunks (+2 parity) makes good MTTR
sense but not much performance sense.  if someone wants to do the MTTR
math between all three configs, I'd love to read it.

Rob

http://storageadvisors.adaptec.com/2005/11/02/actual-reliability-calculations-for-raid/
http://www.barringer1.com/ar.htm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Expanding raidz2

2006-07-13 Thread Richard Elling



David Abrahams wrote:

David Dyer-Bennet [EMAIL PROTECTED] writes:


Adam Leventhal [EMAIL PROTECTED] writes:


I'm not sure I even agree with the notion that this is a real
problem (and if it is, I don't think is easily solved). Stripe
widths are a function of the expected failure rate and fault domains
of the system which tend to be static in nature. A coarser solution
would be to create a new pool where you zfs send/zfs recv the
filesystems of the old pool.

RAIDZ expansion is a big enough deal that I may end up buying an
Infrant NAS box and using their X-RAID instead.  The ZFS should be
more secure, and I *really* like the block checksumming -- but the
ability to expand my existing pool by just adding a new disk is REALLY
REALLY USEFUL in a small office or home configuration.  


Yes, and while it's not an immediate showstopper for me, I'll want to
know that expansion is coming imminently before I adopt RAID-Z.


[in brainstorming mode, sans coffee so far this morning]

Better yet, buy two disks, say 500 GByte.  Need more space, replace
them with 750 GByte, because by then the price of the 750 GByte disks
will be as low as the 250 GByte disks today, and the 1.5 TByte disks
will be $400.  Over time, the cost of disks remains the same, but the
density increases.  This will continue to occur faster than the
development and qualification of complex software.  ZFS will already 
expand a mirror as you replace disks :-)  KISS

 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] system unresponsive after issuing a zpool attach

2006-07-13 Thread Dennis Clarke

 Who hoo! It looks like the resilver completed sometime over night. The
 system appears to be running normally, (after one final reboot):

 [EMAIL PROTECTED]: zpool status
   pool: storage
  state: ONLINE
  scrub: none requested
 config:

 NAME  STATE READ WRITE CKSUM
 storage   ONLINE   0 0 0
   mirror  ONLINE   0 0 0
 c1t2d0s4  ONLINE   0 0 0
 c1t1d0s4  ONLINE   0 0 0

 errors: No known data errors

looks nice :-)

 I took a poke at the zfs bugs on SunSolve again, and found one that is
 the likely culprit:

 6355416 zpool scrubbing consumes all memory, system hung

 Appears that a fix is in Nevada 36, hopefully it'll be back ported to a
 patch for 10.


whoa whoa ... just one bloody second .. whoa ..

That looks like a real nasty bug description there.

What are the details on that?  Is this particular to a given system or
controller config or something liek that or are we talking global to Solaris
10 Update 2 everywhere ??  :-(

Bug ID: 6355416
Synopsis: zpool scrubbing consumes all memory, system hung
Category: kernel
Subcategory: zfs
State: 10-Fix Delivered   -- in a patch somewhere ?

Description:

On a 6800 domain with 8G of RAM I created a zpool using a single 18G drive
and on that pool created a file system and a zvol. The zvol was filled with
data.

# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
pool  11.0G  5.58G  9.00K  /pool
pool/fs  8K  5.58G 8K  /pool/fs
pool/[EMAIL PROTECTED]  0  - 8K  -
pool/root 11.0G  5.58G  11.0G  -
pool/[EMAIL PROTECTED]783K  -  11.0G  -
#

I then attached a second 18g drive to the pool and all seemed well. After a
few minutes however the system ground to a halt.  No response from the
keyboard.

Aborting the system it failed to dump due to the dump device being to small.
 On rebooting it did not make it into multi user.

Booting milestone=none and then bringing it up by had I could see it hung
doing zfs mount -a.

Booting milestone=none again I was able to export the pool and then the
system would come up into multiuser.  Any attempt to import the pool would
hang the system , running vmstat showing it consumed all available memory.

With the pool exported I reinstalled the system with a larger dump device
and then imported the pool.  The same hung occurred however this time I got
the crash dump.

Dumps can be found here:

/net/enospc.uk/export/esc/pts-crashdumps/zfs_nomemory

Dump 0 is from stock build 72a dump 1 from my workspace and had KMF_AUDIT
set.  The only change in my workspace is to the isp driver.

::kmausers gives: ::kmausers
365010944 bytes for 44557 allocations with data size 8192:
 kmem_cache_alloc+0x148
 segkmem_xalloc+0x40
 segkmem_alloc+0x9c
 vmem_xalloc+0x554
 vmem_alloc+0x214
 kmem_slab_create+0x44
 kmem_slab_alloc+0x3c
 kmem_cache_alloc+0x148
 kmem_zalloc+0x28
 zio_create+0x3c
 zio_vdev_child_io+0xc4
 vdev_mirror_io_start+0x1ac
 spa_scrub_cb+0xe4
 traverse_segment+0x2e8
 traverse_more+0x7c
362520576 bytes for 44253 allocations with data size 8192:
 kmem_cache_alloc+0x148
 segkmem_xalloc+0x40
 segkmem_alloc+0x9c
 vmem_xalloc+0x554
 vmem_alloc+0x214
 kmem_slab_create+0x44
 kmem_slab_alloc+0x3c
 kmem_cache_alloc+0x148
 kmem_zalloc+0x28
 zio_create+0x3c
 zio_read+0x54
 spa_scrub_io_start+0x88
 spa_scrub_cb+0xe4
 traverse_segment+0x2e8
 traverse_more+0x7c
241177600 bytes for 376840 allocations with data size 640:
 kmem_cache_alloc+0x88
 kmem_zalloc+0x28
 zio_create+0x3c
 zio_vdev_child_io+0xc4
 vdev_mirror_io_done+0x254
 taskq_thread+0x1a0
209665920 bytes for 327603 allocations with data size 640:
 kmem_cache_alloc+0x88
 kmem_zalloc+0x28
 zio_create+0x3c
 zio_read+0x54
 spa_scrub_io_start+0x88
 spa_scrub_cb+0xe4
 traverse_segment+0x2e8
 traverse_more+0x7c

I have attached the full output.

If I am quick I can detatch the disk and the export the pool before the
system grinds to a halt.  Then reimporting the pool I can access the data. 
Attaching the disk again results in the system using all the memory again.

Date Modified: 2005-11-25 09:03:07 GMT+00:00


Work Around:
Suggested Fix:
Evaluation:
Fixed by patch:
Integrated in Build: snv_36
Duplicate of:
Related Change Request(s):6352306  6384439  6385428
Date Modified: 2006-03-23 23:58:15 GMT+00:00
Public Summary:


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Expanding raidz2

2006-07-13 Thread Erik Trimble
On Thu, 2006-07-13 at 11:42 -0700, Richard Elling wrote:
 [in brainstorming mode, sans coffee so far this morning]
 
 Better yet, buy two disks, say 500 GByte.  Need more space, replace
 them with 750 GByte, because by then the price of the 750 GByte disks
 will be as low as the 250 GByte disks today, and the 1.5 TByte disks
 will be $400.  Over time, the cost of disks remains the same, but the
 density increases.  This will continue to occur faster than the
 development and qualification of complex software.  ZFS will already 
 expand a mirror as you replace disks :-)  KISS
   -- richard

Looking at our (Sun's) product line now, we're not just going after the
Enterprise market anymore. Specifically, the Medium Business market is a
target (few 100 people, a half-dozen IT staff, total). 

RAIDZ expansion for these folks is essentially a must-have to sell to
them.  Being able to expand a 2-drive array into a 5-drive RAIDZ by
simply pushing in new disks and typing a single command is a HUGE win.
Most hardware RAID (even the low-end, and both SCSI  SATA) controllers
can do this on-line nowdays.  It's something that is simply expected,
and not having it is a big black mark. 

A typical instance here is a small business server (2-4 CPUs) hooked to
a small JBOD. We're not going to sell them a fully populated JBOD to
start with, but selling them one 50% full is much more likely.  (look at
the price differential between a 3510FC fully and half populated).   In
the Small Business market, expandability is key, as their limited
budgets tend to make for Just-In-Time purchasing.  They are _much_ more
likely to buy from us things that can be had in a minimum configuration
at low cost, but have considerable future expansion, even if the
expansion costs them considerably more overall than getting the entire
thing in the first place. 

Also, mixing and matching inside a disk server is unlikely until you get
to places that have a highly trained staff. Yes, adding 4 250GB drives
is more expensive than adding 2 750GB ones, but it is nominal, compared
to the extra effort of configuration and maintenance.  At the Medium
Business level, less stress on the Admin staff is usually the driving
factor after raw cost, since Admin staff tend to be extremely
overworked.



-- 
Erik Trimble
Java System Support
Mailstop:  usca14-102
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] system unresponsive after issuing a zpool attach

2006-07-13 Thread Daniel Rock

Joseph Mocker schrieb:
Today I attempted to upgrade to S10_U2 and migrate some mirrored UFS SVM 
partitions to ZFS.


I used Live Upgrade to migrate from U1 to U2 and that went without a 
hitch on my SunBlade 2000. And the initial conversion of one side of the 
UFS mirrors to a ZFS pool and subsequent data migration went fine. 
However, when I attempted to attach the second side mirrors as a mirror 
of the ZFS pool, all hell broke loose.



9. attach the partition to the pool as a mirror
  zpool attach storage cXtXdXs4 cYtYdYs4

A few minutes after issuing the command the system became unresponsive 
as described above.


Same here. I also did upgrade to S10_U2, and converted my non-root md
similar like you. Everything went fine until the zpool attach. The system
seemed to be hanging for at least 2-3 minutes. Then I could type something
again. top then showed 98% system time.

This was on a SunBlade 1000 with 2 x 750MHz CPUs. The zpool/zfs was created 
with checksum=sha256.




Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] system unresponsive after issuing a zpool attach

2006-07-13 Thread Joseph Mocker



Dennis Clarke wrote:


whoa whoa ... just one bloody second .. whoa ..

That looks like a real nasty bug description there.

What are the details on that?  Is this particular to a given system or
controller config or something liek that or are we talking global to Solaris
10 Update 2 everywhere ??  :-(
  

Thats a good question. Looking the internal evaluation, it appears
scrubs can be a little too aggressive.
Perhaps one of the ZFS engineers can comment, Jeff?

I am curious about the fix delivered state as well. Looks like its
been fixed in SNV 36, but I wonder if there will be a patch available.

 --joe

Bug ID: 6355416
Synopsis: zpool scrubbing consumes all memory, system hung
Category: kernel
Subcategory: zfs
State: 10-Fix Delivered   -- in a patch somewhere ?

Description:

On a 6800 domain with 8G of RAM I created a zpool using a single 18G drive
and on that pool created a file system and a zvol. The zvol was filled with
data.

# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
pool  11.0G  5.58G  9.00K  /pool
pool/fs  8K  5.58G 8K  /pool/fs
pool/[EMAIL PROTECTED]  0  - 8K  -
pool/root 11.0G  5.58G  11.0G  -
pool/[EMAIL PROTECTED]783K  -  11.0G  -
#

I then attached a second 18g drive to the pool and all seemed well. After a
few minutes however the system ground to a halt.  No response from the
keyboard.

Aborting the system it failed to dump due to the dump device being to small.
 On rebooting it did not make it into multi user.

Booting milestone=none and then bringing it up by had I could see it hung
doing zfs mount -a.

Booting milestone=none again I was able to export the pool and then the
system would come up into multiuser.  Any attempt to import the pool would
hang the system , running vmstat showing it consumed all available memory.

With the pool exported I reinstalled the system with a larger dump device
and then imported the pool.  The same hung occurred however this time I got
the crash dump.

Dumps can be found here:

/net/enospc.uk/export/esc/pts-crashdumps/zfs_nomemory

Dump 0 is from stock build 72a dump 1 from my workspace and had KMF_AUDIT
set.  The only change in my workspace is to the isp driver.

::kmausers gives: ::kmausers
365010944 bytes for 44557 allocations with data size 8192:
 kmem_cache_alloc+0x148
 segkmem_xalloc+0x40
 segkmem_alloc+0x9c
 vmem_xalloc+0x554
 vmem_alloc+0x214
 kmem_slab_create+0x44
 kmem_slab_alloc+0x3c
 kmem_cache_alloc+0x148
 kmem_zalloc+0x28
 zio_create+0x3c
 zio_vdev_child_io+0xc4
 vdev_mirror_io_start+0x1ac
 spa_scrub_cb+0xe4
 traverse_segment+0x2e8
 traverse_more+0x7c
362520576 bytes for 44253 allocations with data size 8192:
 kmem_cache_alloc+0x148
 segkmem_xalloc+0x40
 segkmem_alloc+0x9c
 vmem_xalloc+0x554
 vmem_alloc+0x214
 kmem_slab_create+0x44
 kmem_slab_alloc+0x3c
 kmem_cache_alloc+0x148
 kmem_zalloc+0x28
 zio_create+0x3c
 zio_read+0x54
 spa_scrub_io_start+0x88
 spa_scrub_cb+0xe4
 traverse_segment+0x2e8
 traverse_more+0x7c
241177600 bytes for 376840 allocations with data size 640:
 kmem_cache_alloc+0x88
 kmem_zalloc+0x28
 zio_create+0x3c
 zio_vdev_child_io+0xc4
 vdev_mirror_io_done+0x254
 taskq_thread+0x1a0
209665920 bytes for 327603 allocations with data size 640:
 kmem_cache_alloc+0x88
 kmem_zalloc+0x28
 zio_create+0x3c
 zio_read+0x54
 spa_scrub_io_start+0x88
 spa_scrub_cb+0xe4
 traverse_segment+0x2e8
 traverse_more+0x7c

I have attached the full output.

If I am quick I can detatch the disk and the export the pool before the
system grinds to a halt.  Then reimporting the pool I can access the data. 
Attaching the disk again results in the system using all the memory again.


Date Modified: 2005-11-25 09:03:07 GMT+00:00


Work Around:
Suggested Fix:
Evaluation:
Fixed by patch:
Integrated in Build: snv_36
Duplicate of:
Related Change Request(s):6352306  6384439  6385428
Date Modified: 2006-03-23 23:58:15 GMT+00:00
Public Summary:


  



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] How to monitor ZFS ?

2006-07-13 Thread martin
How could i monitor zfs  ?
or the zpool activity ?

I want to know if anything wrong is going on.
If i could receive those warning by email, it would be great :)

Martin
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Expanding raidz2

2006-07-13 Thread Tatjana S Heuser
 Of course when it's time to upgrade you can always
 just call sun and get a Thumper on a Try before you
 Buy - and use it as a temporary storage space for
 your files while you re-do your raidz/raidz2 virtual
 device from scratch with an additional disk. zfs
 send/zfs recieve here I come.

For all my experience is worth, given enough space, 
and just some time to sift through all that and reorganize, 
data has some mysterical way of growing and populating 
unused space. Small SOHO sites willing to devote some 
tender loving care to their accumulated wealth are more 
prone to that than large sites with established plans and routines. 
(Though those may compensate for that by having the large 
device in production for a bit and let their users do the rest 
- empty disk space has some magnetic attraction to data, 
it never fails to fill.
So that scheme of yours may well end in another Thumper sold, 
because it's too populated to migrate and return. ;-)

Sorry, I just couldn't resist.
  Tatjana
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: RE: Expanding raidz2

2006-07-13 Thread Marion Hakanson
[EMAIL PROTECTED] said:
 There's no reason at all why you can't do this. The only thing preventing
 most file systems from taking advantage of ?adjustable? replication is that
 they don?t have the integrated volume management capabilities that ZFS does. 

And in fact, Sun's own QFS can do this, on a per-file (or directory) basis,
if one has setup the underlying filesystem with the appropriate mix of
replication, striping, etc.

Regards,

Marion



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Expanding raidz2

2006-07-13 Thread grant beattie
On Thu, Jul 13, 2006 at 11:42:21AM -0700, Richard Elling wrote:

 Yes, and while it's not an immediate showstopper for me, I'll want to
 know that expansion is coming imminently before I adopt RAID-Z.
 
 [in brainstorming mode, sans coffee so far this morning]
 
 Better yet, buy two disks, say 500 GByte.  Need more space, replace
 them with 750 GByte, because by then the price of the 750 GByte disks
 will be as low as the 250 GByte disks today, and the 1.5 TByte disks
 will be $400.  Over time, the cost of disks remains the same, but the
 density increases.  This will continue to occur faster than the
 development and qualification of complex software.  ZFS will already 
 expand a mirror as you replace disks :-)  KISS

indeed, but this is not the same as expansion of an existing vdev
because you still have the same number of spindles, with potentially
more data on each, so it may in fact be a net performance loss.

I don't think the only driver for wanting to expand a raidz vdev is
to gain more space...

grant.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss