Re: [zfs-discuss] File contents changed with no ZFS error

2011-10-22 Thread Mark Sandrock
Why don't you see which byte differs, and how it does?
Maybe that would suggest the failure mode. Is it the
same byte data in all affected files, for instance?

Mark

Sent from my iPhone

On Oct 22, 2011, at 2:08 PM, Robert Watzlavick rob...@watzlavick.com wrote:

 On Oct 22, 2011, at 13:14, Edward Ned Harvey 
 opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 
 How can you outrule the possibility of something changed the file.
 Intentionally, not as a form of filesystem corruption.
 
 I suppose that's possible but seems unlikely. One byte on a file changed on 
 the disk with no corresponding change in the mod time seems unlikely. I did 
 access that file for read sometime I'm the past few months but again, if it 
 had accidentally been written to, the time would have been updated. 
 
 If you have snapshots on your ZFS filesystem, you can use zhist (or whatever
 technique you want) to see in which snapshot(s) it changed, and find all the
 unique versions of it.  'Course that will only give you any valuable
 information if you have different versions of the file in different
 snapshots.
 
 I only have one or two snapshots but I'll look. 
 
 Thanks,
 -Bob
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] about btrfs and zfs

2011-10-18 Thread Mark Sandrock

On Oct 18, 2011, at 11:09 AM, Nico Williams wrote:

 On Tue, Oct 18, 2011 at 9:35 AM, Brian Wilson wrote:
 I just wanted to add something on fsck on ZFS - because for me that used to
 make ZFS 'not ready for prime-time' in 24x7 5+ 9s uptime environments.
 Where ZFS doesn't have an fsck command - and that really used to bug me - it
 does now have a -F option on zpool import.  To me it's the same
 functionality for my environment - the ability to try to roll back to a
 'hopefully' good state and get the filesystem mounted up, leaving the
 corrupted data objects corrupted.  [...]
 
 Yes, that's exactly what it is.  There's no point calling it fsck
 because fsck fixes individual filesystems, while ZFS fixups need to
 happen at the volume level (at volume import time).
 
 It's true that this should have been in ZFS from the word go.  But
 it's there now, and that's what matters, IMO.

Doesn't a scrub do more than what
'fsck' does?

 
 It's also true that this was never necessary with hardware that
 doesn't lie, but it's good to have it anyways, and is critical for
 personal systems such as laptops.

IIRC, fsck was seldom needed at
my former site once UFS journalling
became available. Sweet update.

Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Mark Sandrock
Shouldn't the choice of RAID type also
be based on the i/o requirements?

Anyway, with RAID-10, even a second
failed disk is not catastophic, so long
as it is not the counterpart of the first
failed disk, no matter the no. of disks.
(With 2-way mirrors.)

But that's why we do backups, right?

Mark

Sent from my iPhone

On Aug 6, 2011, at 7:01 AM, Orvar Korvar knatte_fnatte_tja...@yahoo.com wrote:

 Ok, so mirrors resilver faster.
 
 But, it is not uncommon that another disk shows problem during resilver (for 
 instance r/w errors), this scenario would mean your entire raid is gone, 
 right? If you are using mirrors, and one disk crashes and you start resilver. 
 Then the other disk shows r/w errors because of the increased load - then you 
 are screwed? Because large disks take long time to resilver, possibly weeks?
 
 In that case, it would be preferable to use mirrors with 3 disks in each 
 vdev. Trimorrs. Each vdev should be one raidz3.
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 no next-gen product?

2011-04-08 Thread Mark Sandrock

On Apr 8, 2011, at 2:37 AM, Ian Collins i...@ianshome.com wrote:

 On 04/ 8/11 06:30 PM, Erik Trimble wrote:
 On 4/7/2011 10:25 AM, Chris Banal wrote:
 While I understand everything at Oracle is top secret these days.
 
 Does anyone have any insight into a next-gen X4500 / X4540? Does some other 
 Oracle / Sun partner make a comparable system that is fully supported by 
 Oracle / Sun?
 
 http://www.oracle.com/us/products/servers-storage/servers/previous-products/index.html
  
 
 What do X4500 / X4540 owners use if they'd like more comparable zfs based 
 storage and full Oracle support?
 
 I'm aware of Nexenta and other cloned products but am specifically asking 
 about Oracle supported hardware. However, does anyone know if these type of 
 vendors will be at NAB this year? I'd like to talk to a few if they are...
 
 
 The move seems to be to the Unified Storage (aka ZFS Storage) line, which is 
 a successor to the 7000-series OpenStorage stuff.
 
 http://www.oracle.com/us/products/servers-storage/storage/unified-storage/index.html
  
 
 Which is not a lot of use to those of us who use X4540s for what they were 
 intended: storage appliances.

Can you elaborate briefly on what exactly the problem is?

I don't follow? What else would an X4540 or a 7xxx box
be used for, other than a storage appliance?

Guess I'm slow. :-)

Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 no next-gen product?

2011-04-08 Thread Mark Sandrock

On Apr 8, 2011, at 3:29 AM, Ian Collins i...@ianshome.com wrote:

 On 04/ 8/11 08:08 PM, Mark Sandrock wrote:
 On Apr 8, 2011, at 2:37 AM, Ian Collinsi...@ianshome.com  wrote:
 
 On 04/ 8/11 06:30 PM, Erik Trimble wrote:
 On 4/7/2011 10:25 AM, Chris Banal wrote:
 While I understand everything at Oracle is top secret these days.
 
 Does anyone have any insight into a next-gen X4500 / X4540? Does some 
 other Oracle / Sun partner make a comparable system that is fully 
 supported by Oracle / Sun?
 
 http://www.oracle.com/us/products/servers-storage/servers/previous-products/index.html
 
 What do X4500 / X4540 owners use if they'd like more comparable zfs based 
 storage and full Oracle support?
 
 I'm aware of Nexenta and other cloned products but am specifically asking 
 about Oracle supported hardware. However, does anyone know if these type 
 of vendors will be at NAB this year? I'd like to talk to a few if they 
 are...
 
 The move seems to be to the Unified Storage (aka ZFS Storage) line, which 
 is a successor to the 7000-series OpenStorage stuff.
 
 http://www.oracle.com/us/products/servers-storage/storage/unified-storage/index.html
 
 Which is not a lot of use to those of us who use X4540s for what they were 
 intended: storage appliances.
 Can you elaborate briefly on what exactly the problem is?
 
 I don't follow? What else would an X4540 or a 7xxx box
 be used for, other than a storage appliance?
 
 Guess I'm slow. :-)
 
 No, I just wasn't clear - we use ours as storage/application servers.  They 
 run Samba, Apache and various other applications and P2V zones that access 
 the large pool of data.  Each also acts as a fail over box (both data and 
 applications) for the other.

You have built-in storage failover with an AR cluster;
and they do NFS, CIFS, iSCSI, HTTP and WebDav
out of the box.

And you have fairly unlimited options for application servers,
once they are decoupled from the storage servers.

It doesn't seem like much of a drawback -- although it
may be for some smaller sites. I see AR clusters going in
in local high schools and small universities.

Anything's a fraction of the price of a SAN, isn't it? :-)

Mark
 
 They replaced several application servers backed by a SAN for a fraction the 
 price of a new SAN.
 
 -- 
 Ian.
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 no next-gen product?

2011-04-08 Thread Mark Sandrock

On Apr 8, 2011, at 7:50 AM, Evaldas Auryla evaldas.aur...@edqm.eu wrote:

 On 04/ 8/11 01:14 PM, Ian Collins wrote:
 You have built-in storage failover with an AR cluster;
 and they do NFS, CIFS, iSCSI, HTTP and WebDav
 out of the box.
 
 And you have fairly unlimited options for application servers,
 once they are decoupled from the storage servers.
 
 It doesn't seem like much of a drawback -- although it
 may be for some smaller sites. I see AR clusters going in
 in local high schools and small universities.
 
 Which is all fine and dandy if you have a green field, or are willing to
 re-architect your systems.  We just wanted to add a couple more x4540s!
 
 
 Hi, same here, it's a sad news that Oracle decided to stop x4540s production 
 line. Before, ZFS geeks had choice - buy 7000 series if you want quick out 
 of the box storage with nice GUI, or build your own storage with x4540 line, 
 which by the way has brilliant engineering design, the choice is gone now.

Okay, so what is the great advantage
of an X4540 versus X86 server plus
disk array(s)?

Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 no next-gen product?

2011-04-08 Thread Mark Sandrock

On Apr 8, 2011, at 9:39 PM, Ian Collins i...@ianshome.com wrote:

 On 04/ 9/11 03:20 AM, Mark Sandrock wrote:
 On Apr 8, 2011, at 7:50 AM, Evaldas Aurylaevaldas.aur...@edqm.eu  wrote:
 On 04/ 8/11 01:14 PM, Ian Collins wrote:
 You have built-in storage failover with an AR cluster;
 and they do NFS, CIFS, iSCSI, HTTP and WebDav
 out of the box.
 
 And you have fairly unlimited options for application servers,
 once they are decoupled from the storage servers.
 
 It doesn't seem like much of a drawback -- although it
 may be for some smaller sites. I see AR clusters going in
 in local high schools and small universities.
 
 Which is all fine and dandy if you have a green field, or are willing to
 re-architect your systems.  We just wanted to add a couple more x4540s!
 Hi, same here, it's a sad news that Oracle decided to stop x4540s 
 production line. Before, ZFS geeks had choice - buy 7000 series if you want 
 quick out of the box storage with nice GUI, or build your own storage 
 with x4540 line, which by the way has brilliant engineering design, the 
 choice is gone now.
 Okay, so what is the great advantage
 of an X4540 versus X86 server plus
 disk array(s)?
 
 One less x86 box (even more of an issue now we have to mortgage the children 
 for support), a lot less $.
 
 Not to mention an existing infrastructure built using X4540s and me looking a 
 fool explaining to the client they can't get any more so the systems we have 
 spent two years building up are a dead end.
 
 One size does not fit all, choice is good for business.

I'm not arguing. If it were up to me,
we'd still be selling those boxes.

Mark
 
 -- 
 Ian.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 no next-gen product?

2011-04-08 Thread Mark Sandrock

On Apr 8, 2011, at 11:19 PM, Ian Collins i...@ianshome.com wrote:

 On 04/ 9/11 03:53 PM, Mark Sandrock wrote:
 I'm not arguing. If it were up to me,
 we'd still be selling those boxes.
 
 Maybe you could whisper in the right ear?

I wish. I'd have a long list if I could do that.

Mark

 :)
 
 -- 
 Ian.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Any use for extra drives?

2011-03-25 Thread Mark Sandrock

On Mar 24, 2011, at 7:23 AM, Anonymous wrote:

 Generally, you choose your data pool config based on data size,
 redundancy, and performance requirements.  If those are all satisfied with
 your single mirror, the only thing left for you to do is think about
 splitting your data off onto a separate pool due to better performance
 etc.  (Because there are things you can't do with the root pool, such as
 striping and raidz) 
 
 That's all there is to it.  To split, or not to split.
 
 Thanks for the update. I guess there's not much to do for this box since
 it's a development machine and doesn't have much need for extra redundancy
 although if I would have had some extra 500s I would have liked to stripe
 the root pool. I see from your answer that's not possible anyway. Cheers.

If you plan to generate a lot of data, why use the root pool? You can put the 
/home
and /proj filesystems (/export/...) on a separate pool, thus off-loading the 
root pool.

My two cents,
Mark


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Any use for extra drives?

2011-03-24 Thread Mark Sandrock

On Mar 24, 2011, at 5:42 AM, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Nomen Nescio
 
 Hi ladies and gents, I've got a new Solaris 10 development box with ZFS
 mirror root using 500G drives. I've got several extra 320G drives and I'm
 wondering if there's any way I can use these to good advantage in this
 box. I've got enough storage for my needs with the 500G pool. At this
 point
 I would be looking for a way to speed things up if possible or add
 redundancy if necessary but I understand I can't use these smaller drives
 to
 stripe the root pool, so what would you suggest? Thanks.
 
 Generally, you choose your data pool config based on data size, redundancy,
 and performance requirements.  If those are all satisfied with your single
 mirror, the only thing left for you to do is think about splitting your data
 off onto a separate pool due to better performance etc.  (Because there are
 things you can't do with the root pool, such as striping and raidz)
 
 That's all there is to it.  To split, or not to split.

I'd just put /export/home on this second set of drives, as a striped mirror.

Same as I would have done in the old days under SDS. :-)

Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spindle speed (7.2k / 10k / 15k)

2011-02-02 Thread Mark Sandrock

On Feb 2, 2011, at 8:10 PM, Eric D. Mudama wrote:

  All other
 things being equal, the 15k and the 7200 drive, which share
 electronics, will have the same max transfer rate at the OD.

Is that true? So the only difference is in the access time?

Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best choice - file system for system

2011-01-31 Thread Mark Sandrock
Why do you say fssnap has the same problem?

If it write locks the file system, it is only for a matter of seconds, as I 
recall.

Years ago, I used it on a daily basis to do ufsdumps of large fs'es.

Mark

On Jan 30, 2011, at 5:41 PM, Torrey McMahon wrote:

 On 1/30/2011 5:26 PM, Joerg Schilling wrote:
 Richard Ellingrichard.ell...@gmail.com  wrote:
 
 ufsdump is the problem, not ufsrestore. If you ufsdump an active
 file system, there is no guarantee you can ufsrestore it. The only way
 to guarantee this is to keep the file system quiesced during the entire
 ufsdump.  Needless to say, this renders ufsdump useless for backup
 when the file system also needs to accommodate writes.
 This is why there is a ufs snapshot utility.
 
 You'll have the same problem. fssnap_ufs(1M) write locks the file system when 
 you run the lock command. See the notes section of the man page.
 
 http://download.oracle.com/docs/cd/E19253-01/816-5166/6mbb1kq1p/index.html#Notes
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best choice - file system for system

2011-01-31 Thread Mark Sandrock
iirc, we would notify the user community that the FS'es were going to hang 
briefly.

Locking the FS'es is the best way to quiesce it, when users are worldwide, imo.

Mark

On Jan 31, 2011, at 9:45 AM, Torrey McMahon wrote:

 A matter of seconds is a long time for a running Oracle database. The point 
 is that if you have to keep writing to a UFS filesystem - when the file 
 system also needs to accommodate writes - you're still out of luck. If you 
 can quiesce the apps, great, but if you can't then you're still stuck.  In 
 other words, fssnap_ufs doesn't solve the quiesce problem.
 
 On 1/31/2011 10:24 AM, Mark Sandrock wrote:
 Why do you say fssnap has the same problem?
 
 If it write locks the file system, it is only for a matter of seconds, as I 
 recall.
 
 Years ago, I used it on a daily basis to do ufsdumps of large fs'es.
 
 Mark
 
 On Jan 30, 2011, at 5:41 PM, Torrey McMahon wrote:
 
 On 1/30/2011 5:26 PM, Joerg Schilling wrote:
 Richard Ellingrichard.ell...@gmail.com   wrote:
 
 ufsdump is the problem, not ufsrestore. If you ufsdump an active
 file system, there is no guarantee you can ufsrestore it. The only way
 to guarantee this is to keep the file system quiesced during the entire
 ufsdump.  Needless to say, this renders ufsdump useless for backup
 when the file system also needs to accommodate writes.
 This is why there is a ufs snapshot utility.
 You'll have the same problem. fssnap_ufs(1M) write locks the file system 
 when you run the lock command. See the notes section of the man page.
 
 http://download.oracle.com/docs/cd/E19253-01/816-5166/6mbb1kq1p/index.html#Notes

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Mark Sandrock

On Dec 18, 2010, at 12:23 PM, Lanky Doodle wrote:

 Now this is getting really complex, but can you have server failover in ZFS, 
 much like DFS-R in Windows - you point clients to a clustered ZFS namespace 
 so if a complete server failed nothing is interrupted.

This is the purpose of an Amber Road dual-head cluster (7310C, 7410C, etc.) -- 
not only the storage pool fails over,
but also the server IP address fails over, so that NFS, etc. shares remain 
active, when one storage head goes down.

Amber Road uses ZFS, but the clustering and failover are not related to the 
filesystem type.

Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Mark Sandrock
Erik,

just a hypothetical what-if ...

In the case of resilvering on a mirrored disk, why not take a snapshot, and then
resilver by doing a pure block copy from the snapshot? It would be sequential,
so long as the original data was unmodified; and random access in dealing with
the modified blocks only, right.

After the original snapshot had been replicated, a second pass would be done,
in order to update the clone to 100% live data.

Not knowing enough about the inner workings of ZFS snapshots, I don't know why
this would not be doable. (I'm biased towards mirrors for busy filesystems.)

I'm supposing that a block-level snapshot is not doable -- or is it?

Mark

On Dec 20, 2010, at 1:27 PM, Erik Trimble wrote:

 On 12/20/2010 9:20 AM, Saxon, Will wrote:
 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 Sent: Monday, December 20, 2010 11:46 AM
 To: 'Lanky Doodle'; zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] A few questions
 
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Lanky Doodle
 
 I believe Oracle is aware of the problem, but most of
 the core ZFS team has left. And of course, a fix for
 Oracle Solaris no longer means a fix for the rest of
 us.
 OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I
 want
 to committ to a file system that is 'broken' and may not be fully fixed,
 if at all.
 
 ZFS is not broken.  It is, however, a weak spot, that resilver is very
 inefficient.  For example:
 
 On my server, which is made up of 10krpm SATA drives, 1TB each...  My
 drives
 can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
 to resilver the entire drive (in a mirror) sequentially, it would take ...
 8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
 and disks are around 70% full, and resilver takes 12-14 hours.
 
 So although resilver is broken by some standards, it is bounded, and you
 can limit it to something which is survivable, by using mirrors instead of
 raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
 fine.  But you start getting unsustainable if you get up to 21-disk radiz3
 for example.
 This argument keeps coming up on the list, but I don't see where anyone has 
 made a good suggestion about whether this can even be 'fixed' or how it 
 would be done.
 
 As I understand it, you have two basic types of array reconstruction: in a 
 mirror you can make a block-by-block copy and that's easy, but in a parity 
 array you have to perform a calculation on the existing data and/or existing 
 parity to reconstruct the missing piece. This is pretty easy when you can 
 guarantee that all your stripes are the same width, start/end on the same 
 sectors/boundaries/whatever and thus know a piece of them lives on all 
 drives in the set. I don't think this is possible with ZFS since we have 
 variable stripe width. A failed disk d may or may not contain data from 
 stripe s (or transaction t). This information has to be discovered by 
 looking at the transaction records. Right?
 
 Can someone speculate as to how you could rebuild a variable stripe width 
 array without replaying all the available transactions? I am no filesystem 
 engineer but I can't wrap my head around how this could be handled any 
 better than it already is. I've read that resilvering is throttled - 
 presumably to keep performance degradation to a minimum during the process - 
 maybe this could be a tunable (e.g. priority: low, normal, high)?
 
 Do we know if resilvers on a mirror are actually handled differently from 
 those on a raidz?
 
 Sorry if this has already been explained. I think this is an issue that 
 everyone who uses ZFS should understand completely before jumping in, 
 because the behavior (while not 'wrong') is clearly NOT the same as with 
 more conventional arrays.
 
 -Will
 the problem is NOT the checksum/error correction overhead. that's 
 relatively trivial.  The problem isn't really even variable width (i.e. 
 variable number of disks one crosses) slabs.
 
 The problem boils down to this:
 
 When ZFS does a resilver, it walks the METADATA tree to determine what order 
 to rebuild things from. That means, it resilvers the very first slab ever 
 written, then the next oldest, etc.   The problem here is that slab age has 
 nothing to do with where that data physically resides on the actual disks. If 
 you've used the zpool as a WORM device, then, sure, there should be a strict 
 correlation between increasing slab age and locality on the disk.  However, 
 in any reasonable case, files get deleted regularly. This means that the 
 probability that for a slab B, written immediately after slab A, it WON'T be 
 physically near slab A.
 
 In the end, the problem is that using metadata order, while reducing the 
 total amount of work to do in the resilver (as 

Re: [zfs-discuss] A few questions

2010-12-20 Thread Mark Sandrock

On Dec 20, 2010, at 2:05 PM, Erik Trimble wrote:

 On 12/20/2010 11:56 AM, Mark Sandrock wrote:
 Erik,
 
  just a hypothetical what-if ...
 
 In the case of resilvering on a mirrored disk, why not take a snapshot, and 
 then
 resilver by doing a pure block copy from the snapshot? It would be 
 sequential,
 so long as the original data was unmodified; and random access in dealing 
 with
 the modified blocks only, right.
 
 After the original snapshot had been replicated, a second pass would be done,
 in order to update the clone to 100% live data.
 
 Not knowing enough about the inner workings of ZFS snapshots, I don't know 
 why
 this would not be doable. (I'm biased towards mirrors for busy filesystems.)
 
 I'm supposing that a block-level snapshot is not doable -- or is it?
 
 Mark
 Snapshots on ZFS are true snapshots - they take a picture of the current 
 state of the system. They DON'T copy any data around when created. So, a ZFS 
 snapshot would be just as fragmented as the ZFS filesystem was at the time.

But if one does a raw (block) copy, there isn't any fragmentation -- except for 
the COW updates.

If there were no updates to the snapshot, then it becomes a 100% sequential 
block copy operation.

But even with COW updates, presumably the large majority of the copy would 
still be sequential i/o.

Maybe for the 2nd pass, the filesystem would have to be locked, so the 
operation would ever complete,
but if this is fairly short in relation to the overall resilvering time, then 
it could still be a win in many cases.

I'm probably not explaining it well, and may be way off, but it seemed an 
interesting notion.

Mark

 
 
 The problem is this:
 
 Let's say I write block A, B, C, and D on a clean zpool (what kind, it 
 doesn't matter).  I now delete block C.  Later on, I write block E.   There 
 is a probability (increasing dramatically as times goes on), that the on-disk 
 layout will now look like:
 
 A, B, E, D
 
 rather than
 
 A, B, [space], D, E
 
 
 So, in the first case, I can do a sequential read to get A  B, but then must 
 do a seek to get D, and a seek to get E.
 
 The fragmentation problem is mainly due to file deletion, NOT to file 
 re-writing.  (though, in ZFS, being a C-O-W filesystem, re-writing generally 
 looks like a delete-then-write process, rather than a modify process).
 
 
 -- 
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA
 Timezone: US/Pacific (GMT-0800)
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Mark Sandrock
It well may be that different methods are optimal for different use cases.

Mechanical disk vs. SSD; mirrored vs. raidz[123]; sparse vs. populated; etc.

It would be interesting to read more in this area, if papers are available.

I'll have to take a look. ... Or does someone have pointers?

Mark


On Dec 20, 2010, at 6:28 PM, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Erik Trimble
 
 In the case of resilvering on a mirrored disk, why not take a snapshot,
 and
 then
 resilver by doing a pure block copy from the snapshot? It would be
 sequential,
 
 So, a
 ZFS snapshot would be just as fragmented as the ZFS filesystem was at
 the time.
 
 I think Mark was suggesting something like dd copy device 1 onto device 2,
 in order to guarantee a first-pass sequential resilver.  And my response
 would be:  Creative thinking and suggestions are always a good thing.  In
 fact, the above suggestion is already faster than the present-day solution
 for what I'm calling typical usage, but there are an awful lot of use
 cases where the dd solution would be worse... Such as a pool which is
 largely sequential already, or largely empty, or made of high IOPS devices
 such as SSD.  However, there is a desire to avoid resilvering unused blocks,
 so I hope a better solution is possible... 
 
 The fundamental requirement for a better optimized solution would be a way
 to resilver according to disk ordering...  And it's just a question for
 somebody that actually knows the answer ... How terrible is the idea of
 figuring out the on-disk order?
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Excruciatingly slow resilvering on X4540 (build 134)

2010-11-15 Thread Mark Sandrock

On Nov 2, 2010, at 12:10 AM, Ian Collins wrote:

 On 11/ 2/10 08:33 AM, Mark Sandrock wrote:
 
 
 I'm working with someone who replaced a failed 1TB drive (50% utilized),
 on an X4540 running OS build 134, and I think something must be wrong.
 
 Last Tuesday afternoon, zpool status reported:
 
 scrub: resilver in progress for 306h0m, 63.87% done, 173h7m to go
 
 and a week being 168 hours, that put completion at sometime tomorrow night.
 
 However, he just reported zpool status shows:
 
 scrub: resilver in progress for 447h26m, 65.07% done, 240h10m to go
 
 so it's looking more like 2011 now. That can't be right.

 
 
 How is the pool configured?

Both 10 and 12 disk RAIDZ-2. That, plus too much other io
must be the problem. I'm thinking 5 x (7-2) would be better,
assuming he doesn't want to go RAID-10.

Thanks much for all the helpful replies.

Mark
 
 I look after a very busy x5400 with 500G drives configured as 8 drive raidz2 
 and these take about 100 hours to resilver.  The workload on this box is 
 probably worst case for resivering, it receives a steady stream of snapshots.
 
 -- 
 Ian.
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS

2010-11-15 Thread Mark Sandrock
Edward,

I recently installed a 7410 cluster, which had added Fiber Channel HBAs.

I know the site also has Blade 6000s running VMware, but no idea if they
were planning to run fiber to those blades (or even had the option to do so).

But perhaps FC would be an option for you?

Mark

On Nov 12, 2010, at 9:03 AM, Edward Ned Harvey wrote:

 Since combining ZFS storage backend, via nfs or iscsi, with ESXi heads, I’m 
 in love.  But for one thing.  The interconnect between the head  storage.
  
 1G Ether is so cheap, but not as fast as desired.  10G ether is fast enough, 
 but it’s overkill and why is it so bloody expensive?  Why is there nothing in 
 between?  Is there something in between?  Is there a better option?  I mean … 
 sata is cheap, and it’s 3g or 6g, but it’s not suitable for this purpose.  
 But the point remains, there isn’t a fundamental limitation that *requires* 
 10G to be expensive, or *requires* a leap directly from 1G to 10G.  I would 
 very much like to find a solution which is a good fit… to attach ZFS storage 
 to vmware.
  
 What are people using, as interconnect, to use ZFS storage on ESX(i)?
  
 Any suggestions?
  
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Excruciatingly slow resilvering on X4540 (build 134)

2010-11-01 Thread Mark Sandrock
Hello,

I'm working with someone who replaced a failed 1TB drive (50% utilized),
on an X4540 running OS build 134, and I think something must be wrong.

Last Tuesday afternoon, zpool status reported:

scrub: resilver in progress for 306h0m, 63.87% done, 173h7m to go

and a week being 168 hours, that put completion at sometime tomorrow night.

However, he just reported zpool status shows:

scrub: resilver in progress for 447h26m, 65.07% done, 240h10m to go

so it's looking more like 2011 now. That can't be right.

I'm hoping for a suggestion or two on this issue.

I'd search the archives, but they don't seem searchable. Or am I wrong about 
that?

Thanks.
Mark (subscription pending)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss