Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread Ross
In that case, this may be a much tougher nut to crack than I thought.

I'll be the first to admit that other than having seen a few presentations I 
don't have a clue about the details of how ZFS works under the hood, however...

You mention that moving the old block means updating all it's ancestors.  I had 
naively assumed moving a block would be relatively simple, and would also 
update all the ancestors.  

My understanding of ZFS (in short: an upside down tree) is that each block is 
referenced by it's parent.  So regardless of how many snapshots you take, each 
block is only ever referenced by one other, and I'm guessing that the pointer 
and checksum are both stored there.

If that's the case, to move a block it's just a case of:
 - read the data
 - write to the new location
 - update the pointer in the parent block

Please let me know if I'm mis-understanding ZFS here.

The major problem with this is that I don't know if there's any easy way to 
identify the parent block from the child, or an effcient way to do this move.  
However, thinking about it, there must be.  ZFS intelligently moves data if it 
detects corruption, so there must already be tools in place to do exactly what 
we need here.

In which case, this is still relatively simple and much of the code already 
exists.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Henry Zhang


Christian Kelly 写道:
 Hi Calum,
 
 heh, as it happens, I was tinkering with pygtk to see how difficult this 
 would be :)
 
 Supposing I have a ZFS on my machine called root/export/home which is 
 mounted on /export/home. Then I have my home dir as /export/home/chris. 
 Say I only want to snapshot and backup /export/home/chris/Documents. I 
 can't create a snapshot of /export/home/chris/Documents as it is a 
 directory, I have to create a snapshot of the parent ZFS, in this case 
 /export/home/. So there isn't really the granularity that the attached 
 spec implies. Someone correct me if I'm wrong, but I just tried it and 
 it didn't work.
 
 I've had a bit of a look at 'Time Machine' and I'd be more in favour of 
 that style of backup. Just back up everything so I don't have to think 
 about it. My feeling is that picking individual directories out just 
 causes confusion. Think of it this way: how much change is there on a 
 daily basis on your desktop/laptop? Those snapshots aren't going to grow 
 very quickly.
Time Machine is storing all in the system by default, but you still can 
select some ones that you don't like to store. And Time Machine don't 
use ZFS.
Here we will use ZFS snapshot, and what it's working with is file 
system. In Nevada, the default file system is not ZFS, it means some 
directory is not ZFS, so seems you have to select some directory which 
is ZFS, and it's impossible for you to store all, (some are not ZFS)...
 
 -Christian
 
 
 
 Calum Benson wrote:
 Hi all,

 We've been thinking a little about a more integrated desktop presence  
 for Tim Foster's ZFS backup and snapshot services[1].  Here are some  
 initial ideas about what a Phase 0 (snapshot only, not backup) user  
 experience might look like... comments welcome.

 http://www.genunix.org/wiki/index.php/ZFS_Snapshot

 (I'm not subscribed to zfs-discuss, so please make sure either  
 desktop-discuss or I remain cc'ed on any replies if you want me to  
 see them...)

 Cheeri,
 Calum.

 [1] http://blogs.sun.com/timf/entry/zfs_automatic_for_the_people

   
 
 ___
 desktop-discuss mailing list
 [EMAIL PROTECTED]

-- 
Henry Zhang
JDS Software Development, OPG

Sun China Engineering  Research Institute
Sun Microsystems, Inc.
10/F Chuang Xin Plaza, Tsinghua Science Park
Beijing 100084, P.R. China
Tel: +86 10 62673866
Fax: +86 10 62780969
eMail: [EMAIL PROTECTED]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Christian Kelly

 Time Machine is storing all in the system by default, but you still can 
 select some ones that you don't like to store. And Time Machine don't 
 use ZFS.
 Here we will use ZFS snapshot, and what it's working with is file 
 system. In Nevada, the default file system is not ZFS, it means some 
 directory is not ZFS, so seems you have to select some directory which 
 is ZFS, and it's impossible for you to store all, (some are not ZFS)...
   

What I'm suggesting is that the configuration presents a list of pools 
and their ZFSes and that you have a checkbox, backup/don't backup sort 
of an option. When you start having nested ZFSes it could get confusing 
as to what you are actually backing up if you start browsing down 
through the filesystem with the likes of nautilus.

Take the example I gave before, where you have a pool called, say, 
pool1. In the pool you have two ZFSes: pool1/export and 
pool1/export/home. So, suppose the user chooses /export in nautilus and 
adds this to the backup list. Will the user be aware, from browsing 
through nautilus, that /export/home may or may not be backed up - 
depending on whether the -r (?) option is used. I guess what I'm saying 
is, how aware of the behavior of ZFS must the user be in order to use 
this backup system?

-Christian
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread can you guess?
...

 My understanding of ZFS (in short: an upside down
 tree) is that each block is referenced by it's
 parent. So regardless of how many snapshots you take,
 each block is only ever referenced by one other, and
 I'm guessing that the pointer and checksum are both
 stored there.
 
 If that's the case, to move a block it's just a case
 of:
 - read the data
 - write to the new location
 - update the pointer in the parent block

Which changes the contents of the parent block (the change in the data checksum 
changed it as well), and thus requires that this parent also be rewritten 
(using COW), which changes the pointer to it (and of course its checksum as 
well) in *its* parent block, which thus also must be re-written... and finally 
a new copy of the superblock is written to reflect the new underlying tree 
structure - all this in a single batch-written 'transaction'.

The old version of each of these blocks need only be *saved* if a snapshot 
exists and it hasn't previously been updated since that snapshot was created.  
But all the blocks need to be COWed even if no snapshot exists (in which case 
the old versions are simply discarded).

...
 
 PS.
 
 1. You'd still need an initial defragmentation pass
 to ensure that the file was reasonably piece-wise
 contiguous to begin with.
 
 No, not necessarily.  If you were using a zpool
 configured like this I'd hope you were planning on
 creating the file as a contiguous block in the first
 place :)

I'm not certain that you could ensure this if other updates in the system were 
occurring concurrently.  Furthermore, the file may be extended dynamically as 
new data is inserted, and you'd like to have some mechanism that could restore 
reasonable contiguity to the result (which can be difficult to accomplish in 
the foreground if, for example, free space doesn't happen to exist on the disk 
right after the existing portion of the file).

...
 
 Any zpool with this option would probably be
 dedicated to the database file and nothing else.  In
 fact, even with multiple databases I think I'd have a
 single pool per database.

It's nice if you can afford such dedicated resources, but it seems a bit 
cavalier to ignore users who just want decent performance from a database that 
has to share its resources with other activity.

Your prompt response is probably what prevented me from editing my previous 
post after I re-read it and realized I had overlooked the fact that 
over-writing the old data complicates things.  So I'll just post the revised 
portion here:


3.  Now you must make the above transaction persistent, and then randomly 
over-write the old data block with the new data (since that data must be in 
place before you update the path to it below, and unfortunately since its 
location is not arbitrary you can't combine this update with either the 
transaction above or the transaction below).

4.  You can't just slide in the new version of the block using the old 
version's existing set of ancestors because a) you just deallocated that path 
above (introducing additional mechanism to preserve it temporarily almost 
certainly would not be wise), b) the data block checksum changed, and c) in any 
event this new path should be *newer* than the path to the old version's new 
location that you just had to establish (if a snapshot exists, that's the path 
that should be propagated to it by the COW mechanism).  However, this is just 
the normal situation whenever you update a data block (save for the fact that 
the block itself was already written above):  all the *additional* overhead 
occurred in the previous steps.

So instead of a single full-path update that fragments the file, you have two 
full-path updates, a random write, and possibly a random read initially to 
fetch the old data.  And you still need an initial defrag pass to establish 
initial contiguity.  Furthermore, these additional resources are consumed at 
normal rather than the reduced priority at which a background reorg can 
operate.  On the plus side, though, the file would be kept contiguous all the 
time rather than just returned to contiguity whenever there was time to do so.

...

 Taking it a stage further, I wonder if this would
 work well with the prioritized write feature request
 (caching writes to a solid state disk)?
  http://www.genunix.org/wiki/index.php/OpenSolaris_Sto
 age_Developer_Wish_List
 
 That could potentially mean there's very little
 slowdown:
  - Read the original block
 - Save that to solid state disk
  - Write the the new block in the original location
 - Periodically stream writes from the solid state
 disk to the main storage

I'm not sure this would confer much benefit if things in fact need to be 
handled as I described above.  In particular, if a snapshot exists you almost 
certainly must establish the old version in its new location in the snapshot 
rather than just capture it in the log; if no snapshot exists you could capture 
the old version in the log and 

Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Tim Foster
On Tue, 2007-11-20 at 13:35 +, Christian Kelly wrote:
 What I'm suggesting is that the configuration presents a list of pools 
 and their ZFSes and that you have a checkbox, backup/don't backup sort 
 of an option.

That's basically the (hacked-up) zenity GUI I have at the moment on my
blog, download  install the packages and you'll see - I think getting
that in a proper tree-structure help ? Right now, there's a bug in my
gui, such that with:

 [X] tank
 [ ] tank/timf
 [ ] tank/timf/Documents
 [ ] tank/timf/Music

Selecting tank implicitly marks the other filesystems for backup
because of the way zfs properties inherit. (load the above gui again
having just selected tank, and you'll see the other filesystems being
selected for you)


Having said that, I like Calum's ideas - and am happy to defer the
decision about the gui to someone a lot more qualified than I in this
area :-)

I think that when browsing directories in nautilus, it would be good to
have some sort of backup or snapshot icon (ála the little padlock in
secure web-browsing sessions) to let you know this directory is being
either backed up, and/or included in snapshots.

cheers,
tim

-- 
Tim Foster, Sun Microsystems Inc, Solaris Engineering Ops
http://blogs.sun.com/timf

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread Ross
Hmm... that's a pain if updating the parent also means updating the parent's 
checksum too.  I guess the functionality is there for moving bad blocks, but 
since that's likely to be a rare occurence, it wasn't something that would need 
to be particularly efficient.

With regards sharing the disk resources with other programs, obviously it's 
down to the individual admins how they would configure this, but I would 
suggest that if you have a database with heavy enough requirements to be 
suffering noticable read performance issues due to fragmentation, then that 
database really should have it's own dedicated drives and shouldn't be 
competing with other programs.

I'm not saying defrag is bad (it may be the better solution here), just that if 
you're looking at performance in this kind of depth, you're probably 
experienced enough to have created the database in a contiguous chunk in the 
first place :-)

I do agree that doing these writes now sounds like a lot of work.  I'm guessing 
that needing two full-path updates to achieve this means you're talking about a 
much greater write penalty.  And that means you can probably expect significant 
read penalty if you have any significant volume of writes at all, which would 
rather defeat the point.  After all, if you have a low enough amount of writes 
to not suffer from this penalty, your database isn't going to be particularly 
fragmented.

However, I'm now in over my depth.  This needs somebody who knows the internal 
architecture of ZFS to decide whether it's feasible or desirable, and whether 
defrag is a good enough workaround.

It may be that ZFS is not a good fit for this kind of use, and that if you're 
really concerned about this kind of performance you should be looking at other 
file systems.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Calum Benson

On 20 Nov 2007, at 12:56, Christian Kelly wrote:

 Hi Calum,

 heh, as it happens, I was tinkering with pygtk to see how difficult  
 this would be :)

 Supposing I have a ZFS on my machine called root/export/home which  
 is mounted on /export/home. Then I have my home dir as /export/home/ 
 chris. Say I only want to snapshot and backup /export/home/chris/ 
 Documents. I can't create a snapshot of /export/home/chris/ 
 Documents as it is a directory, I have to create a snapshot of the  
 parent ZFS, in this case /export/home/. So there isn't really the  
 granularity that the attached spec implies. Someone correct me if  
 I'm wrong, but I just tried it and it didn't work.

Right, for Phase 0 the thinking was that you'd really have to  
manually set up whatever pools and filesystems you required first.   
So in your example, you (or, perhaps, the Indiana installer) would  
have had to set up /export/home/chris/Documents as a ZFS filesystem  
in its own right before you could start taking snapshots of it.

Were we to stick with this general design, in later phases, creating  
a new ZFS filesystem on the fly, and migrating the contents of the  
existing folder into to it, would hopefully happen behind the scenes  
when you selected that folder to be backed up.  (That could  
presumably be quite a long operation, though, for folders with large  
contents.)

 I've had a bit of a look at 'Time Machine' and I'd be more in  
 favour of that style of backup. Just back up everything so I don't  
 have to think about it. My feeling is that picking individual  
 directories out just causes confusion. Think of it this way: how  
 much change is there on a daily basis on your desktop/laptop? Those  
 snapshots aren't going to grow very quickly.

I have no problem looking at it from that angle if it turns out  
that's what people want-- much of the UI would be fairly similar.   
But at the same time, I don't necessarily always expect OSX users'  
requirements to be the same as Solaris users' requirements-- I'd  
especially like to hear from people who are already using Tim's  
snapshot and backup services, to find out how they use it and what  
their needs are.

Cheeri,
Calum.

-- 
CALUM BENSON, Usability Engineer   Sun Microsystems Ireland
mailto:[EMAIL PROTECTED]GNOME Desktop Team
http://blogs.sun.com/calum +353 1 819 9771

Any opinions are personal and not necessarily those of Sun Microsystems


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Christian Kelly
Calum Benson wrote:
 Right, for Phase 0 the thinking was that you'd really have to manually 
 set up whatever pools and filesystems you required first.  So in your 
 example, you (or, perhaps, the Indiana installer) would have had to 
 set up /export/home/chris/Documents as a ZFS filesystem in its own 
 right before you could start taking snapshots of it.

 Were we to stick with this general design, in later phases, creating a 
 new ZFS filesystem on the fly, and migrating the contents of the 
 existing folder into to it, would hopefully happen behind the scenes 
 when you selected that folder to be backed up.  (That could presumably 
 be quite a long operation, though, for folders with large contents.)


Ah, I see. So, for phase 0, the 'Enable Automatic Snapshots' option 
would only be available for/work for existing ZFSes. Then at some later 
stage, create them on the fly.

 I have no problem looking at it from that angle if it turns out that's 
 what people want-- much of the UI would be fairly similar.  But at the 
 same time, I don't necessarily always expect OSX users' requirements 
 to be the same as Solaris users' requirements-- I'd especially like to 
 hear from people who are already using Tim's snapshot and backup 
 services, to find out how they use it and what their needs are.

Yes, absolutely, OSX users' requirements probably vary wildly from those 
of a Solaris users'. I guess I fall into what we might call the 'lazy' 
category of user ;) I'm aware of Tim's tool, don't use it though.

-Christian
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz DEGRADED state

2007-11-20 Thread MC
 So there is no current way to specify the creation of
 a 3 disk raid-z
 array with a known missing disk?

Can someone answer that?  Or does the zpool command NOT accommodate the 
creation of a degraded raidz array?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread can you guess?
...

 With regards sharing the disk resources with other
 programs, obviously it's down to the individual
 admins how they would configure this,

Only if they have an unconstrained budget.

 but I would
 suggest that if you have a database with heavy enough
 requirements to be suffering noticable read
 performance issues due to fragmentation, then that
 database really should have it's own dedicated drives
 and shouldn't be competing with other programs.

You're not looking at it from a whole-system viewpoint (which if you're 
accustomed to having your own dedicated storage devices is understandable).

Even if your database performance is acceptable, if it's performing 50x as many 
disk seeks as it would otherwise need to when scanning a table that's affecting 
the performance of *other* applications.

 
 Also, I'm not saying defrag is bad (it may be the
 better solution here), just that if you're looking at
 performance in this kind of depth, you're probably
 experienced enough to have created the database in a
 contiguous chunk in the first place :-)

As I noted, ZFS may not allow you to ensure that and in any event if the 
database grows that contiguity may need to be reestablished.  You could grow 
the db in separate files, each of which was preallocated in full (though again 
ZFS may not allow you to ensure that each is created contiguously on disk), but 
while databases may include such facilities as a matter of course it would 
still (all other things being equal) be easier to manage everything if it could 
just extend a single existing file (or one file per table, if they needed to be 
kept separate) as it needed additional space.

 
 I do agree that doing these writes now sounds like a
 lot of work.  I'm guessing that needing two full-path
 updates to achieve this means you're talking about a
 much greater write penalty.

Not all that much.  Each full-path update is still only a single write request 
to the disk, since all the path blocks (again, possibly excepting the 
superblock) are batch-written together, thus mostly increasing only streaming 
bandwidth consumption.

...

 It may be that ZFS is not a good fit for this kind of
 use, and that if you're really concerned about this
 kind of performance you should be looking at other
 file systems.

I suspect that while it may not be a great fit now with relatively minor 
changes it could be at least an acceptable one.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Calum Benson

On 20 Nov 2007, at 13:35, Christian Kelly wrote:

 Take the example I gave before, where you have a pool called, say,  
 pool1. In the pool you have two ZFSes: pool1/export and pool1/ 
 export/home. So, suppose the user chooses /export in nautilus and  
 adds this to the backup list. Will the user be aware, from browsing  
 through nautilus, that /export/home may or may not be backed up -  
 depending on whether the -r (?) option is used.

I'd consider that to be a fairly strong requirement, but it's not  
something I particularly thought through for the mockups.

One solution might be to change the nautilus background for folders  
that are being backed up, another might be an indicator in the status  
bar, another might be emblems on the folder icons themselves.  Which  
approach works best would probably depend on whether we expect most  
of  the folders people are browsing reguarly to be backed up, or not  
backed up-- in general, you'd want any sort of indicator to show the  
less common state.

Cheeri,
Calum.

-- 
CALUM BENSON, Usability Engineer   Sun Microsystems Ireland
mailto:[EMAIL PROTECTED]GNOME Desktop Team
http://blogs.sun.com/calum +353 1 819 9771

Any opinions are personal and not necessarily those of Sun Microsystems


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread Moore, Joe
Louwtjie Burger wrote:
 Richard Elling wrote:
 
  - COW probably makes that conflict worse
  
  
 
  This needs to be proven with a reproducible, real-world 
 workload before it
  makes sense to try to solve it.  After all, if we cannot 
 measure where
  we are,
  how can we prove that we've improved?
 
 I agree, let's first find a reproducible example where updates
 negatively impacts large table scans ... one that is rather simple (if
 there is one) to reproduce and then work from there.

I'd say it would be possible to define a reproducible workload that
demonstrates this using the Filebench tool... I haven't worked with it
much (maybe over the holidays I'll be able to do this), but I think a
workload like:

1) create a large file (bigger than main memory) on an empty ZFS pool.
2) time a sequential scan of the file
3) random write i/o over say, 50% of the file (either with or without
matching blocksize)
4) time a sequential scan of the file

The difference between times 2 and 4 are the penalty that COW block
reordering (which may introduce seemingly-random seeks between
sequential blocks) imposes on the system.

It would be interesting to watch seeksize.d's output during this run
too.

--Joe

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread Ross
  doing these writes now sounds like a
  lot of work.  I'm guessing that needing two full-path
  updates to achieve this means you're talking about a
  much greater write penalty.
 
 Not all that much.  Each full-path update is still
 only a single write request to the disk, since all
 the path blocks (again, possibly excepting the
 superblock) are batch-written together, thus mostly
 increasing only streaming bandwidth consumption.

Ok, that took some thinking about.  I'm pretty new to ZFS, so I've only just 
gotten my head around how CoW works, and I'm not used to thinking about files 
at this kind of level.  I'd not considered that path blocks would be 
batch-written close together, but of course that makes sense.

What I'd been thinking was that ordinarily files would get fragmented as they 
age, which would make these updates slower as blocks would be scattered over 
the disk, so a full-path update would take some time.  I'd forgotten that the 
whole point of doing this is to prevent fragmentation...

So a nice side effect of this approach is that if you use it, it makes itself 
more efficient :D
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] zpool io to 6140 is really slow

2007-11-20 Thread Asif Iqbal
On Nov 20, 2007 1:48 AM, Louwtjie Burger [EMAIL PROTECTED] wrote:
 
  That is still 256MB/s . I am getting about 194MB/s

 No, I don't think you can take 2Gbit / 8bits per byte and say 256MB is
 what you should get...
 Someone with far more FC knowledge can comment here.  There must be
 some overhead in transporting data (as with regular SCSI) ... in the
 same way ULTRA 320MB SCSI never yields close to 320 MB/s ... even
 though it might seem so.

  Adding a second loop by adding another non active port I may have to 
  rebuild the
  FS, no?

 No. Use MPXio to help you out here ... Solaris will see the same LUN's
 on each of the 2,3 or 4 ports on the primary controller ... but with
 multi-pathing switched on will only give you 1 vhci LUN to work with.

 What I would do is export the zpool(s). Hook up more links to the
 primary and enable scsi_vhci. Reboot and look for the new cX vhci
 devices.

 zpool import should rebuilt the pools from the multipath devices just fine.

 Interesting test though.

  I am gettin 194MB/s. Hmm my 490 has 16G memory. I really I could benefit 
  some
  from OS and controller RAM, atleast for Oracle IO

 Close to 200MB seems good from 1 x 2Gb.

Should I not gain a lot (I am not getting any) of performance gain
with 2 x 2GB RAM on my raid controllers
NVRAM?


 Something else to try ... when creating hardware LUNs, one can assign
 the LUN to either controller A or B (as preferred or owner). By doing
 assignments one can use the secondary controller ... you are going to
 then stripe over controllers .. as one way of looking at it.

 PS: Is this a direct connection? Switched fabric?




-- 
Asif Iqbal
PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow

2007-11-20 Thread Asif Iqbal
On Nov 20, 2007 7:01 AM, Chad Mynhier [EMAIL PROTECTED] wrote:
 On 11/20/07, Asif Iqbal [EMAIL PROTECTED] wrote:
  On Nov 19, 2007 1:43 AM, Louwtjie Burger [EMAIL PROTECTED] wrote:
   On Nov 17, 2007 9:40 PM, Asif Iqbal [EMAIL PROTECTED] wrote:
(Including storage-discuss)
   
I have 6 6140s with 96 disks. Out of which 64 of them are Seagate
ST337FC (300GB - 1 RPM FC-AL)
  
   Those disks are 2Gb disks, so the tray will operate at 2Gb.
  
 
  That is still 256MB/s . I am getting about 194MB/s

 2Gb fibre channel is going to max out at a data transmission rate

But I am running 4GB fiber channels with 4GB NVRAM on a 6 tray of
300GB FC 10K rpm (2Gb/s) disks

So I should get a lot more than ~ 200MB/s. Shouldn't I?


 around 200MB/s rather than the 256MB/s that you'd expect.  Fibre
 channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data
 in 10 bits on the wire.  So while 256MB/s is being transmitted on the
 connection itself, only 200MB/s of that is the data that you're
 transmitting.

 Chad Mynhier




-- 
Asif Iqbal
PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Darren J Moffat
Calum Benson wrote:
 On 20 Nov 2007, at 13:35, Christian Kelly wrote:
 Take the example I gave before, where you have a pool called, say,  
 pool1. In the pool you have two ZFSes: pool1/export and pool1/ 
 export/home. So, suppose the user chooses /export in nautilus and  
 adds this to the backup list. Will the user be aware, from browsing  
 through nautilus, that /export/home may or may not be backed up -  
 depending on whether the -r (?) option is used.
 
 I'd consider that to be a fairly strong requirement, but it's not  
 something I particularly thought through for the mockups.
 
 One solution might be to change the nautilus background for folders  
 that are being backed up, another might be an indicator in the status  
 bar, another might be emblems on the folder icons themselves. 

I think changing the background is a non starter since users can change 
the background already anyway.

An emblem is good for the case where you are looking from above a 
dataset that is tagged for backup.

An indicator in the status bar is good for when you are in a dataset 
that is tagged for backup.

-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow

2007-11-20 Thread Chad Mynhier
On 11/20/07, Asif Iqbal [EMAIL PROTECTED] wrote:
 On Nov 20, 2007 7:01 AM, Chad Mynhier [EMAIL PROTECTED] wrote:
  On 11/20/07, Asif Iqbal [EMAIL PROTECTED] wrote:
   On Nov 19, 2007 1:43 AM, Louwtjie Burger [EMAIL PROTECTED] wrote:
On Nov 17, 2007 9:40 PM, Asif Iqbal [EMAIL PROTECTED] wrote:
 (Including storage-discuss)

 I have 6 6140s with 96 disks. Out of which 64 of them are Seagate
 ST337FC (300GB - 1 RPM FC-AL)
   
Those disks are 2Gb disks, so the tray will operate at 2Gb.
   
  
   That is still 256MB/s . I am getting about 194MB/s
 
  2Gb fibre channel is going to max out at a data transmission rate
  around 200MB/s rather than the 256MB/s that you'd expect.  Fibre
  channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data
  in 10 bits on the wire.  So while 256MB/s is being transmitted on the
  connection itself, only 200MB/s of that is the data that you're
  transmitting.

 But I am running 4GB fiber channels with 4GB NVRAM on a 6 tray of
 300GB FC 10K rpm (2Gb/s) disks

 So I should get a lot more than ~ 200MB/s. Shouldn't I?

Here, I'm relying on what Louwtjie said above, that the tray itself is
going to be limited to 2Gb/s because of the 2Gb/s FC disks.

Chad Mynhier
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread can you guess?
Rats - I was right the first time:  there's a messy problem with snapshots.

The problem is that the parent of the child that you're about to update in 
place may *already* be in one or more snapshots because one or more of its 
*other* children was updated since each snapshot was created.  If so, then each 
snapshot copy of the parent is pointing to the location of the existing copy of 
the child you now want to update in place, and unless you change the snapshot 
copy of the parent (as well as the current copy of the parent) the snapshot 
will point to the *new* copy of the child you are now about to update (with an 
incorrect checksum to boot).

With enough snapshots, enough children, and bad enough luck, you might have to 
change the parent (and of course all its ancestors...) in every snapshot.

In other words, Nathan's approach is pretty much infeasible in the presence of 
snapshots.  Background defragmention works as long as you move the entire 
region (which often has a single common parent) to a new location, which if the 
source region isn't excessively fragmented may not be all that expensive; it's 
probably not something you'd want to try at normal priority *during* an update 
to make Nathan's approach work, though, especially since you'd then wind up 
moving the entire region on every such update rather than in one batch in the 
background.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why did resilvering restart?

2007-11-20 Thread Wade . Stuart
Resilver and scrub are broken and restart when a snapshot is created -- the
current workaround is to disable snaps while resilvering,  the ZFS team is
working on the issue for a long term fix.

-Wade

[EMAIL PROTECTED] wrote on 11/20/2007 09:58:19 AM:

 On b66:
   # zpool replace tww c0t600A0B8000299966059E4668CBD3d0 \
   c0t600A0B8000299CCC06734741CD4Ed0
some hours later
   # zpool status tww
 pool: tww
state: DEGRADED
   status: One or more devices is currently being resilvered.  The pool
will
   continue to function, possibly in a degraded state.
   action: Wait for the resilver to complete.
scrub: resilver in progress, 62.90% done, 4h26m to go
some hours later
   # zpool status tww
 pool: tww
state: DEGRADED
   status: One or more devices is currently being resilvered.  The pool
will
   continue to function, possibly in a degraded state.
   action: Wait for the resilver to complete.
scrub: resilver in progress, 3.85% done, 18h49m to go

   # zpool history tww | tail -1
   2007-11-20.02:37:13 zpool replace tww
c0t600A0B8000299966059E4668CBD3d0
   c0t600A0B8000299CCC06734741CD4Ed0

 So, why did resilvering restart when no zfs operations occurred? I
 just ran zpool status again and now I get:
   # zpool status tww
 pool: tww
state: DEGRADED
   status: One or more devices is currently being resilvered.  The pool
will
   continue to function, possibly in a degraded state.
   action: Wait for the resilver to complete.
scrub: resilver in progress, 0.00% done, 134h45m to go

 What's going on?

 --
 albert chin ([EMAIL PROTECTED])
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] NFS performance considerations (Linux vs Solaris)

2007-11-20 Thread msl
Hello all...
 I think all of you agree that performance is a great topic in NFS. 
 So, when we talk about NFS and ZFS we imagine a great combination/solution. 
But one is not dependent on another, actually are two well distinct 
technologies. ZFS has a lot of features that all we know about, and maybe, 
all of us want in a NFS share (maybe not). The point is: Two technologies with 
diferent priorities.
 So, what i think is important, is a document (here on NFS/ZFS discuss), that 
lists and explains the ZFS features that have a real performance impact. I 
know that there is the solarisinternals wiki about ZFS/NFS integration, but 
what i think is really important is a comparison between Linux and Solaris/ZFS 
on server side.
 That would be very useful to see for example, what consistency i have with 
Linux and (XFS, ext3, etc), with that performance. And how can i configure 
a similar NFS service on solaris/ZFS. 
 Here we have some information about it: 
http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine
 but there is no relation with Linux, what i think is important.
 What i do mean, is that the people that knows a lot about the NFS protocol, 
and about the filesystem features, should make such comparison (to facilitate 
the adoption and users' comparison). I think there are many users comparing 
oranges with apples.
 Another example (correct me if i am wrong), Until the kernel 2.4.20 (at 
least), the default export option for sync/async was async (in solaris i 
think always was sync). Another point was about the commit operation in 
vers2, that was not implemented, the server just reply with an OK, but the 
data was not in stable storage yet (here the ZIL and the roch blog entry is 
excellent).
 That's it, i'm proposing the creation of a matrix/table with features and 
performance impact, as well as a comparison with other 
implementations/implications.
 Thanks very much for your time, and sorry for the long post.

 Leal.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow

2007-11-20 Thread Andrew Wilson




And, just to add one more point, since pretty much everything the host
writes to the controller eventually has to make it out to the disk
drives, the long term average write rate cannot exceed the rate that
the backend disk subsystem can absorb the writes, regardless of the
workload. (An exception is if the controller can combine some
overlapping writes). Basically just like putting water into a reservoir
at twice the rate it is being withdrawn, the reservoir will eventually
overflow! At least in this case the controller can limit the input from
the host and avoid an actual data overflow situation.

Drew

Andrew Wilson wrote:

  What kind of workload are you running. If you are you doing these measurements 
with some sort of "write as fast as possible" microbenchmark, once the 4 GB of 
nvram is full, you will be limited by backend performance (FC disks and their 
interconnect) rather than the host / controller bus.

Since, best case, 4 gbit FC can transfer 4 GBytes of data in about 10 seconds, 
you will fill it up, even with the backend writing out data as fast as it can, 
in about 20 seconds. Once the nvram is full, you will only see the backend (e.g. 
2 Gbit) rate.

The reason these controller buffers are useful with real applications is that 
they smooth the bursts of writes that real applications tend to generate, thus 
reducing the latency of those writes and improving performance. They will then 
"catch up" during periods when few writes are being issued. But a typical 
microbenchmark that pumps out a steady stream of writes won't see this benefit.

Drew Wilson

Asif Iqbal wrote:

On Nov 20, 2007 7:01 AM, Chad Mynhier [EMAIL PROTECTED] wrote:
  

On 11/20/07, Asif Iqbal [EMAIL PROTECTED] wrote:


On Nov 19, 2007 1:43 AM, Louwtjie Burger [EMAIL PROTECTED] wrote:
  

On Nov 17, 2007 9:40 PM, Asif Iqbal [EMAIL PROTECTED] wrote:


(Including storage-discuss)

I have 6 6140s with 96 disks. Out of which 64 of them are Seagate
ST337FC (300GB - 1 RPM FC-AL)
  

Those disks are 2Gb disks, so the tray will operate at 2Gb.



That is still 256MB/s . I am getting about 194MB/s
  

2Gb fibre channel is going to max out at a data transmission rate



But I am running 4GB fiber channels with 4GB NVRAM on a 6 tray of
300GB FC 10K rpm (2Gb/s) disks

So I should get "a lot" more than ~ 200MB/s. Shouldn't I?


  

around 200MB/s rather than the 256MB/s that you'd expect.  Fibre
channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data
in 10 bits on the wire.  So while 256MB/s is being transmitted on the
connection itself, only 200MB/s of that is the data that you're
transmitting.

Chad Mynhier






  


  
  

___
perf-discuss mailing list
[EMAIL PROTECTED]
  




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why did resilvering restart?

2007-11-20 Thread Albert Chin
On Tue, Nov 20, 2007 at 10:01:49AM -0600, [EMAIL PROTECTED] wrote:
 Resilver and scrub are broken and restart when a snapshot is created
 -- the current workaround is to disable snaps while resilvering,
 the ZFS team is working on the issue for a long term fix.

But, no snapshot was taken. If so, zpool history would have shown
this. So, in short, _no_ ZFS operations are going on during the
resilvering. Yet, it is restarting.

 -Wade
 
 [EMAIL PROTECTED] wrote on 11/20/2007 09:58:19 AM:
 
  On b66:
# zpool replace tww c0t600A0B8000299966059E4668CBD3d0 \
c0t600A0B8000299CCC06734741CD4Ed0
 some hours later
# zpool status tww
  pool: tww
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool
 will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress, 62.90% done, 4h26m to go
 some hours later
# zpool status tww
  pool: tww
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool
 will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress, 3.85% done, 18h49m to go
 
# zpool history tww | tail -1
2007-11-20.02:37:13 zpool replace tww
 c0t600A0B8000299966059E4668CBD3d0
c0t600A0B8000299CCC06734741CD4Ed0
 
  So, why did resilvering restart when no zfs operations occurred? I
  just ran zpool status again and now I get:
# zpool status tww
  pool: tww
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool
 will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress, 0.00% done, 134h45m to go
 
  What's going on?
 
  --
  albert chin ([EMAIL PROTECTED])
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 

-- 
albert chin ([EMAIL PROTECTED])
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Calum Benson

On 20 Nov 2007, at 15:04, Darren J Moffat wrote:

 Calum Benson wrote:
 On 20 Nov 2007, at 13:35, Christian Kelly wrote:
 Take the example I gave before, where you have a pool called,  
 say,  pool1. In the pool you have two ZFSes: pool1/export and  
 pool1/ export/home. So, suppose the user chooses /export in  
 nautilus and  adds this to the backup list. Will the user be  
 aware, from browsing  through nautilus, that /export/home may or  
 may not be backed up -  depending on whether the -r (?) option is  
 used.
 I'd consider that to be a fairly strong requirement, but it's not   
 something I particularly thought through for the mockups.
 One solution might be to change the nautilus background for  
 folders  that are being backed up, another might be an indicator  
 in the status  bar, another might be emblems on the folder icons  
 themselves.

 I think changing the background is a non starter since users can  
 change the background already anyway.

You're right that they can, and while that probably does write it  
off, I wonder how many really do.  (And we could possibly do  
something clever like a semi-opaque overlay anyway, we may not have  
to replace the background entirely.)

All just brainstorming at this stage though, other ideas welcome :)

 An emblem is good for the case where you are looking from above a  
 dataset that is tagged for backup.

 An indicator in the status bar is good for when you are in a  
 dataset that is tagged for backup.

Yep, all true.  Also need to bear in mind that nowadays, with the  
(fairly) new nautilus treeview, you can potentially see both in and  
above at the same time, so any solution would have to work  
elegantly with that view too.

Cheeri,
Calum.

-- 
CALUM BENSON, Usability Engineer   Sun Microsystems Ireland
mailto:[EMAIL PROTECTED]GNOME Desktop Team
http://blogs.sun.com/calum +353 1 819 9771

Any opinions are personal and not necessarily those of Sun Microsystems


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Calum Benson

On 20 Nov 2007, at 14:31, Christian Kelly wrote:

 Ah, I see. So, for phase 0, the 'Enable Automatic Snapshots' option
 would only be available for/work for existing ZFSes. Then at some  
 later
 stage, create them on the fly.

Yes, that's the scenario for the mockups I posted, anyway... if the  
requirements are bogus, then of course we'll have to change them :)

My original mockup did allow you to create a pool/filesystem on the  
fly if required, but it felt like the wrong place to be doing that--  
if you could understand the dialog to do that, you would probably  
know how to do it better on the command line anyway.  Longer term, I  
guess we might be wanting to ship some sort of ZFS management GUI  
that might be better suited to that sort of thing (maybe like the  
Nexenta app that Roman mentioned earlier, but I haven't really looked  
at that yet...)

Cheeri,
Calum.

-- 
CALUM BENSON, Usability Engineer   Sun Microsystems Ireland
mailto:[EMAIL PROTECTED]GNOME Desktop Team
http://blogs.sun.com/calum +353 1 819 9771

Any opinions are personal and not necessarily those of Sun Microsystems


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI

2007-11-20 Thread Darren J Moffat
Calum Benson wrote:
 You're right that they can, and while that probably does write it off, 
 I wonder how many really do.  (And we could possibly do something clever 
 like a semi-opaque overlay anyway, we may not have to replace the 
 background entirely.)

Almost everyone I've seen using the filemanager other than myself has 
done this :-)

If you do a semi-opaque overlay thats going to require lots of colour 
selection stuff - plus what if the background is a complex image (why 
people do this I don't know but I've seen it done).

 An emblem is good for the case where you are looking from above a 
 dataset that is tagged for backup.

 An indicator in the status bar is good for when you are in a dataset 
 that is tagged for backup.
 
 Yep, all true.  Also need to bear in mind that nowadays, with the 
 (fairly) new nautilus treeview, you can potentially see both in and 
 above at the same time, so any solution would have to work elegantly 
 with that view too.

I would expect emblem in the tree and status bar indicator for the non 
tree part.

-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread Chris Csanady
On Nov 19, 2007 10:08 PM, Richard Elling [EMAIL PROTECTED] wrote:
 James Cone wrote:
  Hello All,
 
  Here's a possibly-silly proposal from a non-expert.
 
  Summarising the problem:
 - there's a conflict between small ZFS record size, for good random
  update performance, and large ZFS record size for good sequential read
  performance
 

 Poor sequential read performance has not been quantified.

I think this is a good point.  A lot of solutions are being thrown
around, and the problems are only theoretical at the moment.
Conventional solutions may not even be appropriate for something like
ZFS.

The point that makes me skeptical is this: blocks do not need to be
logically contiguous to be (nearly) physically contiguous.  As long as
you reallocate the blocks close to the originals, chances are that a
scan of the file will end up being mostly physically contiguous reads
anyway.  ZFS's intelligent prefetching along with the disk's track
cache should allow for good performance even in this case.

ZFS may or may not already do this, I haven't checked.  Obviously, you
won't want to keep a years worth of snapshots, or run the pool near
capacity.  With a few minor tweaks though, it should work quite well.
Talking about fundamental ZFS design flaws at this point seems
unnecessary to say the least.

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread can you guess?
 But the whole point of snapshots is that they don't
 take up extra space on the disk.  If a file (and
 hence a block) is in every snapshot it doesn't mean
 you've got multiple copies of it.  You only have one
 copy of that block, it's just referenced by many
 snapshots.

I used the wording copies of a parent loosely to mean previous states of the 
parent that also contain pointers to the current state of the child about to be 
updated in place.

 
 The thing is, the location of that block isn't saved
 separately in every snapshot either - the location is
 just stored in it's parent.

And in every earlier version of the parent that was updated for some *other* 
reason and still contains a pointer to the current child that someone using 
that snapshot must be able to follow correctly.

  So moving a block is
 just a case of updating one parent.

No:  every version of the parent that points to the current version of the 
child must be updated.

...

 If you think about it, that has to work for the old
 data since as I said before, ZFS already has this
 functionality.  If ZFS detects a bad block, it moves
 it to a new location on disk.  If it can already do
 that without affecting any of the existing snapshots,
 so there's no reason to think we couldn't use the
 same code for a different purpose.

Only if it works the way you think it works, rather than, say, by using a 
look-aside list of moved blocks (there shouldn't be that many of them), or by 
just leaving the bad block in the snapshot (if it's mirrored or 
parity-protected, it'll still be usable there unless a second failure occurs; 
if not, then it was lost anyway).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread Ross
But the whole point of snapshots is that they don't take up extra space on the 
disk.  If a file (and hence a block) is in every snapshot it doesn't mean 
you've got multiple copies of it.  You only have one copy of that block, it's 
just referenced by many snapshots.

The thing is, the location of that block isn't saved separately in every 
snapshot either - the location is just stored in it's parent.  So moving a 
block is just a case of updating one parent.  So regardless of how many 
snapshots the parent is in, you only have to update one parent to point it at 
the new location for the *old* data.  Then you save the new data to the old 
location and ensure the current tree points to that.

If you think about it, that has to work for the old data since as I said 
before, ZFS already has this functionality.  If ZFS detects a bad block, it 
moves it to a new location on disk.  If it can already do that without 
affecting any of the existing snapshots, so there's no reason to think we 
couldn't use the same code for a different purpose.

Ultimately, your old snapshots get fragmented, but the live data stays 
contiguous.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Unsubscribe

2007-11-20 Thread Hay, Mausul W


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Calum Benson
Sent: Tuesday, November 20, 2007 11:12 AM
To: Darren J Moffat
Cc: Henry Zhang; zfs-discuss@opensolaris.org; Desktop discuss; Christian
Kelly
Subject: Re: [zfs-discuss] [desktop-discuss] ZFS snapshot GUI


On 20 Nov 2007, at 15:04, Darren J Moffat wrote:

 Calum Benson wrote:
 On 20 Nov 2007, at 13:35, Christian Kelly wrote:
 Take the example I gave before, where you have a pool called,  
 say,  pool1. In the pool you have two ZFSes: pool1/export and  
 pool1/ export/home. So, suppose the user chooses /export in  
 nautilus and  adds this to the backup list. Will the user be  
 aware, from browsing  through nautilus, that /export/home may or  
 may not be backed up -  depending on whether the -r (?) option is  
 used.
 I'd consider that to be a fairly strong requirement, but it's not   
 something I particularly thought through for the mockups.
 One solution might be to change the nautilus background for  
 folders  that are being backed up, another might be an indicator  
 in the status  bar, another might be emblems on the folder icons  
 themselves.

 I think changing the background is a non starter since users can  
 change the background already anyway.

You're right that they can, and while that probably does write it  
off, I wonder how many really do.  (And we could possibly do  
something clever like a semi-opaque overlay anyway, we may not have  
to replace the background entirely.)

All just brainstorming at this stage though, other ideas welcome :)

 An emblem is good for the case where you are looking from above a  
 dataset that is tagged for backup.

 An indicator in the status bar is good for when you are in a  
 dataset that is tagged for backup.

Yep, all true.  Also need to bear in mind that nowadays, with the  
(fairly) new nautilus treeview, you can potentially see both in and  
above at the same time, so any solution would have to work  
elegantly with that view too.

Cheeri,
Calum.

-- 
CALUM BENSON, Usability Engineer   Sun Microsystems Ireland
mailto:[EMAIL PROTECTED]GNOME Desktop Team
http://blogs.sun.com/calum +353 1 819 9771

Any opinions are personal and not necessarily those of Sun Microsystems


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why did resilvering restart?

2007-11-20 Thread Wade . Stuart

[EMAIL PROTECTED] wrote on 11/20/2007 10:11:50 AM:

 On Tue, Nov 20, 2007 at 10:01:49AM -0600, [EMAIL PROTECTED] wrote:
  Resilver and scrub are broken and restart when a snapshot is created
  -- the current workaround is to disable snaps while resilvering,
  the ZFS team is working on the issue for a long term fix.

 But, no snapshot was taken. If so, zpool history would have shown
 this. So, in short, _no_ ZFS operations are going on during the
 resilvering. Yet, it is restarting.


Does 2007-11-20.02:37:13 actually match the expected timestamp of the
original zpool replace command before the first zpool status output listed
below?  Is it possible that another zpool replace is further up on your
pool history (ie it was rerun by an admin or automatically from some
service)?

-Wade


 
  [EMAIL PROTECTED] wrote on 11/20/2007 09:58:19 AM:
 
   On b66:
 # zpool replace tww c0t600A0B8000299966059E4668CBD3d0 \
 c0t600A0B8000299CCC06734741CD4Ed0
  some hours later
 # zpool status tww
   pool: tww
  state: DEGRADED
 status: One or more devices is currently being resilvered.  The
pool
  will
 continue to function, possibly in a degraded state.
 action: Wait for the resilver to complete.
  scrub: resilver in progress, 62.90% done, 4h26m to go
  some hours later
 # zpool status tww
   pool: tww
  state: DEGRADED
 status: One or more devices is currently being resilvered.  The
pool
  will
 continue to function, possibly in a degraded state.
 action: Wait for the resilver to complete.
  scrub: resilver in progress, 3.85% done, 18h49m to go
  
 # zpool history tww | tail -1
 2007-11-20.02:37:13 zpool replace tww
  c0t600A0B8000299966059E4668CBD3d0
 c0t600A0B8000299CCC06734741CD4Ed0
  
   So, why did resilvering restart when no zfs operations occurred? I
   just ran zpool status again and now I get:
 # zpool status tww
   pool: tww
  state: DEGRADED
 status: One or more devices is currently being resilvered.  The
pool
  will
 continue to function, possibly in a degraded state.
 action: Wait for the resilver to complete.
  scrub: resilver in progress, 0.00% done, 134h45m to go
  
   What's going on?
  
   --
   albert chin ([EMAIL PROTECTED])
   ___
   zfs-discuss mailing list
   zfs-discuss@opensolaris.org
   http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 

 --
 albert chin ([EMAIL PROTECTED])
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] raidz2 testing

2007-11-20 Thread Brian Lionberger
Is there a preferred method to test a raidz2.
I would like to see the the disks recover on there own after simulating 
a disk failure.
I'm have a 4 disk configuration.

Brian.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread Will Murnane
On Nov 20, 2007 5:33 PM, can you guess? [EMAIL PROTECTED] wrote:
  But the whole point of snapshots is that they don't
  take up extra space on the disk.  If a file (and
  hence a block) is in every snapshot it doesn't mean
  you've got multiple copies of it.  You only have one
  copy of that block, it's just referenced by many
  snapshots.

 I used the wording copies of a parent loosely to mean previous
 states of the parent that also contain pointers to the current state of
 the child about to be updated in place.
But children are never updated in place.  When a new block is written
to a leaf, new blocks are used for all the ancestors back to the
superblock, and then the old ones are either freed or held on to by
the snapshot.

 And in every earlier version of the parent that was updated for some
 *other* reason and still contains a pointer to the current child that
 someone using that snapshot must be able to follow correctly.
The snapshot doesn't get the 'current' child - it gets the one that
was there when the snapshot was taken.

 No:  every version of the parent that points to the current version of
 the child must be updated.
Even with clones, the two 'parent' and the 'clone' are allowed to
diverge - they contain different data.

Perhaps I'm missing something.  Excluding ditto blocks, when in ZFS
would two parents point to the same child, and need to both be updated
when the child is updated?

Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why did resilvering restart?

2007-11-20 Thread Albert Chin
On Tue, Nov 20, 2007 at 11:10:20AM -0600, [EMAIL PROTECTED] wrote:
 
 [EMAIL PROTECTED] wrote on 11/20/2007 10:11:50 AM:
 
  On Tue, Nov 20, 2007 at 10:01:49AM -0600, [EMAIL PROTECTED] wrote:
   Resilver and scrub are broken and restart when a snapshot is created
   -- the current workaround is to disable snaps while resilvering,
   the ZFS team is working on the issue for a long term fix.
 
  But, no snapshot was taken. If so, zpool history would have shown
  this. So, in short, _no_ ZFS operations are going on during the
  resilvering. Yet, it is restarting.
 
 
 Does 2007-11-20.02:37:13 actually match the expected timestamp of
 the original zpool replace command before the first zpool status
 output listed below?

No. We ran some 'zpool status' commands after the last 'zpool
replace'. The 'zpool status' output in the initial email is from this
morning. The only ZFS command we've been running is 'zfs list', 'zpool
list tww', 'zpool status', or 'zpool status -v' after the last 'zpool
replace'.

Server is on GMT time.

 Is it possible that another zpool replace is further up on your
 pool history (ie it was rerun by an admin or automatically from some
 service)?

Yes, but a zpool replace for the same bad disk:
  2007-11-20.00:57:40 zpool replace tww c0t600A0B8000299966059E4668CBD3d0
  c0t600A0B800029996606584741C7C3d0
  2007-11-20.02:35:22 zpool detach tww c0t600A0B800029996606584741C7C3d0
  2007-11-20.02:37:13 zpool replace tww c0t600A0B8000299966059E4668CBD3d0
  c0t600A0B8000299CCC06734741CD4Ed0

We accidentally removed c0t600A0B800029996606584741C7C3d0 from the
array, hence the 'zpool detach'.

The last 'zpool replace' has been running for 15h now.

 -Wade
 
 
  
   [EMAIL PROTECTED] wrote on 11/20/2007 09:58:19 AM:
  
On b66:
  # zpool replace tww c0t600A0B8000299966059E4668CBD3d0 \
  c0t600A0B8000299CCC06734741CD4Ed0
   some hours later
  # zpool status tww
pool: tww
   state: DEGRADED
  status: One or more devices is currently being resilvered.  The
 pool
   will
  continue to function, possibly in a degraded state.
  action: Wait for the resilver to complete.
   scrub: resilver in progress, 62.90% done, 4h26m to go
   some hours later
  # zpool status tww
pool: tww
   state: DEGRADED
  status: One or more devices is currently being resilvered.  The
 pool
   will
  continue to function, possibly in a degraded state.
  action: Wait for the resilver to complete.
   scrub: resilver in progress, 3.85% done, 18h49m to go
   
  # zpool history tww | tail -1
  2007-11-20.02:37:13 zpool replace tww
   c0t600A0B8000299966059E4668CBD3d0
  c0t600A0B8000299CCC06734741CD4Ed0
   
So, why did resilvering restart when no zfs operations occurred? I
just ran zpool status again and now I get:
  # zpool status tww
pool: tww
   state: DEGRADED
  status: One or more devices is currently being resilvered.  The
 pool
   will
  continue to function, possibly in a degraded state.
  action: Wait for the resilver to complete.
   scrub: resilver in progress, 0.00% done, 134h45m to go
   
What's going on?
   
--
albert chin ([EMAIL PROTECTED])
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  
   ___
   zfs-discuss mailing list
   zfs-discuss@opensolaris.org
   http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  
  
 
  --
  albert chin ([EMAIL PROTECTED])
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 

-- 
albert chin ([EMAIL PROTECTED])
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz2

2007-11-20 Thread Eric Schrock
On Tue, Nov 20, 2007 at 11:02:55AM +0100, Paul Boven wrote:
 
 I seem to be having exactly the problems you are describing (see my
 postings with the subject 'zfs on a raid box'). So I would very much
 like to give b77 a try. I'm currently running b76, as that's the latest
 sxce that's available. Are the sources to anything beyond b76 already
 available? Would I need to build it, or bfu?

The sources, yes (you can pull them from the ON mercurial mirror).  It
looks like the latest SX:CE is still on build 76, so it doesn't seem
like you can get a binary distro, yet.

 
 I'm seeing zfs not making use of available hot-spares when I pull a
 disk, long and indeed painful SCSI retries and very poor write
 performance on a degraded zpool - I hope to be able to test if b77 fares
 any better with this.

What hardware/driver are you using?  Build 76 should have the ability to
recognize removed devices via DKIOCGETSTATE and immediately transition
to the REMOVED state instead of going through the SCSI retry logic (3x
60 seconds).  Build 77 added a 'probe' operation on I/O failure that
will try to read/write some basic data to the disk and if that fails
will immediately determine the disk as FAULTED without having to wait
for retries to fail and FMA diagnosis to offline the device.

- Eric

--
Eric Schrock, FishWorkshttp://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz2

2007-11-20 Thread Richard Elling
comment on retries below...

Paul Boven wrote:
 Hi Eric, everyone,
 
 Eric Schrock wrote:
 There have been many improvements in proactively detecting failure,
 culminating in build 77 of Nevada.  Earlier builds:

 - Were unable to distinguish device removal from devices misbehaving,
   depending on the driver and hardware.

 - Did not diagnose a series of I/O failures as disk failure.

 - Allowed several (painful) SCSI retries and continued to queue up I/O,
   even if the disk was fatally damaged.
 
 Most classes of hardware would behave reasonably well on device removal,
 but certain classes caused cascading failures in ZFS, all which should
 be resolved in build 77 or later.
 
 I seem to be having exactly the problems you are describing (see my
 postings with the subject 'zfs on a raid box'). So I would very much
 like to give b77 a try. I'm currently running b76, as that's the latest
 sxce that's available. Are the sources to anything beyond b76 already
 available? Would I need to build it, or bfu?
 
 I'm seeing zfs not making use of available hot-spares when I pull a
 disk, long and indeed painful SCSI retries and very poor write
 performance on a degraded zpool - I hope to be able to test if b77 fares
 any better with this.

The SCSI retries are implemented at the driver level (usually sd) below
ZFS.  By default, the timeout (60s) and retry (3 or 5) counters are
somewhat conservative and intended to apply to a wide variety of hardware,
including slow CD-ROMs and ancient processors.  Depending on your
situation and business requirements, these may be tuned.  There is a
pretty good article on BigAdmin which describes tuning the FC side of
the equation (ssd driver).
http://www.sun.com/bigadmin/features/hub_articles/tuning_sfs.jsp

Beware, making these tunables too small can lead to an unstable system.
The article does a good job of explaining how interdependent the tunables
are, so hopefully you can make wise choices.
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] snv-76 panics on installation

2007-11-20 Thread Bill Moloney
I have an Intel based server running dual P3 Xeons (Intel A46044-609, 1.26GHz) 
with a BIOS from American Megatrends Inc (AMIBIOS, SCB2 production BIOS rev 
2.0, BIOS build 0039) with 2GB of RAM

when I attempt to install snv-76 the system panics during the initial boot from 
CD

I've been using this system for extensive testing with ZFS and have had no 
problems installing snv-68, 69 or 70, but I'm having this problem with snv-76

any information regarding this problem or a potential workaround would be 
appreciated

Thx ... bill moloney
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread Al Hopper
On Tue, 20 Nov 2007, Ross wrote:

 doing these writes now sounds like a
 lot of work.  I'm guessing that needing two full-path
 updates to achieve this means you're talking about a
 much greater write penalty.

 Not all that much.  Each full-path update is still
 only a single write request to the disk, since all
 the path blocks (again, possibly excepting the
 superblock) are batch-written together, thus mostly
 increasing only streaming bandwidth consumption.

... reformatted ...

 Ok, that took some thinking about.  I'm pretty new to ZFS, so I've 
 only just gotten my head around how CoW works, and I'm not used to 
 thinking about files at this kind of level.  I'd not considered that

Here's a couple of resources that'll help you get up to speed with ZFS 
internals:

a) From the London OpenSolaris User Group (LOSUG) session, presented 
by Jarod Nash, TSC Systems Engineer entitled: ZFS: Under The Hood:

ZFS-UTH_3_v1.1_LOSUG.pdf
zfs_data_structures_for_single_file.pdf

also referred to as ZFS Internals Lite.

and b) the ZFS on-disk Specification:

ondiskformat0822.pdf

 path blocks would be batch-written close together, but of course 
 that makes sense.

 What I'd been thinking was that ordinarily files would get 
 fragmented as they age, which would make these updates slower as 
 blocks would be scattered over the disk, so a full-path update would 
 take some time.  I'd forgotten that the whole point of doing this is 
 to prevent fragmentation...

 So a nice side effect of this approach is that if you use it, it 
 makes itself more efficient :D


Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Graduate from sugar-coating school?  Sorry - I never attended! :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] snv-76 panics on installation

2007-11-20 Thread Michael Schuster
Bill Moloney wrote:
 I have an Intel based server running dual P3 Xeons (Intel A46044-609,
 1.26GHz) with a BIOS from American Megatrends Inc (AMIBIOS, SCB2
 production BIOS rev 2.0, BIOS build 0039) with 2GB of RAM
 
 when I attempt to install snv-76 the system panics during the initial
 boot from CD

please post the panic stack (to the list, not to me alone), if possible, 
and as much other information as you have (ie. what step does the panic 
happen at, etc.)

where did you get the media from (is it really a CD, or a DVD?)?
Can you read/mount the CD when running an older build? if no, are there 
errors in the messages file? ...

HTH
Michael
-- 
Michael Schuster
Recursion, n.: see 'Recursion'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS

2007-11-20 Thread Al Hopper
On Mon, 19 Nov 2007, Brian Hechinger wrote:

 On Sun, Nov 18, 2007 at 02:18:21PM +0100, Peter Schuller wrote:
 Right now I have noticed that LSI has recently began offering some
 lower-budget stuff; specifically I am looking at the MegaRAID SAS
 8208ELP/XLP, which are very reasonably priced.

 I looked up the 8204XLP, which is really quite expensive compared to
 the Supermicro MV based card.

 That being said, for a small 1U box that is only going to have two SATA
 disks, the Supermicro card is way overkill/overpriced for my needs.

 Does anyone know if there are any PCI-X cards based on the MV88SX6041?

 I'm not having much luck finding any.

A couple of options:

a) the SuperMicro AOC-SAT2-MV8 is an 8-port SATA card available for 
around $110 IIRC.

b) There is also a PCI-X version of the older LSI 4-port (internal) 
PCI Express SAS3041E card which is still available for around $165 and 
works well with ZFS (SATA or SAS drives).

c) Any card based on the SiliconImage 3124/3132 chips will work.  But, 
ensure you're running an OS with the latest version of the si3124 
drivers - or - you can swap out the older drivers using the files 
from:

http://www.opensolaris.org/jive/servlet/JiveServlet/download/80-32437-138083-3390/si3124.tar.gz

Note: if these drives are your boot drives, you'll need to do this 
after booting from a CDROM/DVD disk, otherwise you can unload the 
driver and swap out the files.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Graduate from sugar-coating school?  Sorry - I never attended! :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [perf-discuss] [storage-discuss] zpool io to 6140 is really slow

2007-11-20 Thread Asif Iqbal
On Nov 20, 2007 10:40 AM, Andrew Wilson [EMAIL PROTECTED] wrote:

  What kind of workload are you running. If you are you doing these
 measurements with some sort of write as fast as possible microbenchmark,

Oracle database with blocksize 16K .. populating the database as fast I can

 once the 4 GB of nvram is full, you will be limited by backend performance
 (FC disks and their interconnect) rather than the host / controller bus.

  Since, best case, 4 gbit FC can transfer 4 GBytes of data in about 10
 seconds, you will fill it up, even with the backend writing out data as fast
 as it can, in about 20 seconds. Once the nvram is full, you will only see
 the backend (e.g. 2 Gbit) rate.

  The reason these controller buffers are useful with real applications is
 that they smooth the bursts of writes that real applications tend to
 generate, thus reducing the latency of those writes and improving
 performance. They will then catch up during periods when few writes are
 being issued. But a typical microbenchmark that pumps out a steady stream of
 writes won't see this benefit.

  Drew Wilson



  Asif Iqbal wrote:
  On Nov 20, 2007 7:01 AM, Chad Mynhier [EMAIL PROTECTED] wrote:


  On 11/20/07, Asif Iqbal [EMAIL PROTECTED] wrote:


  On Nov 19, 2007 1:43 AM, Louwtjie Burger [EMAIL PROTECTED] wrote:


  On Nov 17, 2007 9:40 PM, Asif Iqbal [EMAIL PROTECTED] wrote:


  (Including storage-discuss)

 I have 6 6140s with 96 disks. Out of which 64 of them are Seagate
 ST337FC (300GB - 1 RPM FC-AL)

  Those disks are 2Gb disks, so the tray will operate at 2Gb.


  That is still 256MB/s . I am getting about 194MB/s

  2Gb fibre channel is going to max out at a data transmission rate

  But I am running 4GB fiber channels with 4GB NVRAM on a 6 tray of
 300GB FC 10K rpm (2Gb/s) disks

 So I should get a lot more than ~ 200MB/s. Shouldn't I?




  around 200MB/s rather than the 256MB/s that you'd expect. Fibre
 channel uses an 8-bit/10-bit encoding, so it transmits 8-bits of data
 in 10 bits on the wire. So while 256MB/s is being transmitted on the
 connection itself, only 200MB/s of that is the data that you're
 transmitting.

 Chad Mynhier










-- 
Asif Iqbal
PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS

2007-11-20 Thread Brian Hechinger
On Tue, Nov 20, 2007 at 02:01:34PM -0600, Al Hopper wrote:
 
 a) the SuperMicro AOC-SAT2-MV8 is an 8-port SATA card available for 
 around $110 IIRC.

Yeah, I'd like to spend a lot less than that, especially as I only need
2 ports. :)

 b) There is also a PCI-X version of the older LSI 4-port (internal) 
 PCI Express SAS3041E card which is still available for around $165 and 
 works well with ZFS (SATA or SAS drives).

I actually just picked up a SAS3080X for my Ultra80 on ebay for $30. I
guess I can always scour ebay for something similar.

 c) Any card based on the SiliconImage 3124/3132 chips will work.  But, 
 ensure you're running an OS with the latest version of the si3124 
 drivers - or - you can swap out the older drivers using the files 
 from:

the 3124 looks perfect. The only problem is the only thing I found on ebay
was for the 3132, which is PCIe, which doesn't help me. :) I'm not finding
anything for 3124 other than the data on silicon image's site.  Do you know
of any cards I should be looking for that uses this chip?

These will be OS disks, and I'm willing to run whichever version is best
for this hardware and ZFS (I'm going to try the most recent SXCE once
I have all the hardware together).  Any recommendations as related to
this card?

-brian
-- 
Perl can be fast and elegant as much as J2EE can be fast and elegant.
In the hands of a skilled artisan, it can and does happen; it's just
that most of the shit out there is built by people who'd be better
suited to making sure that my burger is cooked thoroughly.  -- Jonathan 
Patschke
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS

2007-11-20 Thread Jason P. Warr


the 3124 looks perfect. The only problem is the only thing I found on ebay
was for the 3132, which is PCIe, which doesn't help me. :) I'm not finding
anything for 3124 other than the data on silicon image's site.  Do you know
of any cards I should be looking for that uses this chip?

http://www.cooldrives.com/sata-cards.html

There are a couple on there for about $80.  Not quite where you want to get I 
am sure but it is an option.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS

2007-11-20 Thread Al Hopper
On Tue, 20 Nov 2007, Jason P. Warr wrote:



 the 3124 looks perfect. The only problem is the only thing I found on ebay
 was for the 3132, which is PCIe, which doesn't help me. :) I'm not finding
 anything for 3124 other than the data on silicon image's site.  Do you know
 of any cards I should be looking for that uses this chip?

 http://www.cooldrives.com/sata-cards.html

 There are a couple on there for about $80.  Not quite where you want to get I 
 am sure but it is an option.

Yep - I see: http://www.cooldrives.com/saiiraco2esa.html for $60.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Graduate from sugar-coating school?  Sorry - I never attended! :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool io to 6140 is really slow

2007-11-20 Thread Richard Elling
Asif Iqbal wrote:
 On Nov 19, 2007 11:47 PM, Richard Elling [EMAIL PROTECTED] wrote:
   
 Asif Iqbal wrote:
 
 I have the following layout

 A 490 with 8 1.8Ghz and 16G mem. 6 6140s with 2 FC controllers using
 A1 anfd B1 controller port 4Gbps speed.
 Each controller has 2G NVRAM

 On 6140s I setup raid0 lun per SAS disks with 16K segment size.

 On 490 I created a zpool with 8 4+1 raidz1s

 I am getting zpool IO of only 125MB/s with zfs:zfs_nocacheflush = 1 in
 /etc/system

 Is there a way I can improve the performance. I like to get 1GB/sec IO.

   
 I don't believe a V490 is capable of driving 1 GByte/s of I/O.
 

 Well I am getting ~190MB/s right now. I sure not hitting any where close
 to that ceiling

   
 The V490 has two schizos and the schizo is not a full speed
 bridge.  For more information see Section 1.2 of:
 http://www.sun.com/processors/manuals/External_Schizo_PRM.pdf
 

[err - see Section 1.3]

You will notice from Table 1-1, the read bandwidth limit for a schizo 
PCI leaf is
204 MBytes/s.  With two schizos, you can expect to max out at 816 
MBytes/s or
less, depending on resource contention.  It makes no difference that a 4 
Gbps FC
card could read 400 MBytes/s, the best you can do for the card is 204 
MBytes/s.
1 GBytes/s of read throughput will not be attainable with a V490.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] which would be faster

2007-11-20 Thread Tim Cook
So I have 8 drives total. 

5x500GB seagate 7200.10
3x300GB seagate 7200.10

I'm trying to decide, would I be better off just creating two separate pools?

pool1 = 5x500gb raidz
pool2= 3x300gb raidz

or would I be better off creating one large pool, with two raid sets?  I'm 
trying to figure out if it would be faster this way since it should be striping 
across the two pools (from what I understand).  On the other hand, the pool of 
3 disks is obviously going to be much slower than the pool of 5.

In a perfect world I'd just benchmark both ways, but due to some constraints, 
that may not be possible.  Any insight?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz2 testing

2007-11-20 Thread Richard Elling
Brian Lionberger wrote:
 Is there a preferred method to test a raidz2.
 I would like to see the the disks recover on there own after simulating 
 a disk failure.
 I'm have a 4 disk configuration.

It really depends on what failure mode you're interested in.  The
most common failure we see from disks in the field is an uncorrectable
read.  Pulling a disk will not simulate an uncorrectable read.

For such tests, there are really two different parts of the system
you are exercising: the fault detection and the recovery/reconfiguration.
When we do RAS benchmarking, we ofen find that the recovery/reconfiguration
code path is the interesting part and the fault detection less so.
In other words, there will be little difference in the recovery/
reconfiguration between initiating a zpool replace from the command line
vs fault injection.  Unless you are really interested in the maze of
fault detection code, you might want to stick with the command line
interfaces to stimulate a reconfiguration.

If you really do want to stimulate the fault detection code, then a
simple online test which requires no hands-on changes, is to change
the partition table to zero out the size of the partition or slice.
This will have the effect of causing an I/O to receive an ENXIO error
which should then kick off the recovery.

prtvtoc will show you a partition map which can be sent to fmthard -s
to populate the VTOC.  Be careful here, this is a place where mistakes
can be painful to overcome.

Dtrace can be used to perform all sorts of nasty fault injection,
but that may be more than you want to bite off at first.

b77 adds a zpool failmode property which will allow you to set the
mode to something other than panic -- options are: wait(default),
continue, and panic.  See zpool(1m) for more info.  You will want to
know the failmode if you are experimenting with fault injection.

Finally, you will want to be aware of the FMA commands for viewing
reports and diagnosis status.  See fmadm(1m), fmdump(1m), and fmstat(1m)
If you want to experiment with fault injection, you'll want to pay
particular attention to the SERD engines and reset them between runs.
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] which would be faster

2007-11-20 Thread Al Hopper
On Tue, 20 Nov 2007, Tim Cook wrote:

 So I have 8 drives total.

 5x500GB seagate 7200.10
 3x300GB seagate 7200.10

 I'm trying to decide, would I be better off just creating two separate pools?

 pool1 = 5x500gb raidz
 pool2= 3x300gb raidz

... reformatted ...

 or would I be better off creating one large pool, with two raid 
 sets?  I'm trying to figure out if it would be faster this way since 
 it should be striping across the two pools (from what I understand). 
 On the other hand, the pool of 3 disks is obviously going to be much 
 slower than the pool of 5.

 In a perfect world I'd just benchmark both ways, but due to some 
 constraints, that may not be possible.  Any insight?


Hi Tim,

Let me give you a 3rd option for your consideration.  In general, 
there is no one-pool-fits-all-workloads solution.  On a 10 disk 
system here, we ended up with a:

5 disk raidz1 pool
2 disk mirror pool
3 disk mirror pool

Each have their strengths/weaknesses.  The raidz set is ideal for 
large file sequential access type workloads - but the IOPS are 
limited to the IOPS of a single drive.  The 3-way mirror is 
ideal for a workload with a high read to write ratio - which describes 
many real-world type workloads (e.g. software development) - since ZFS 
will load balance read ops amoung all members of the mirror set.  So 
read IOPS is 3x the IOPS rating of a single disk.

I would suggest/recommend you configure a 5 disk raidz1 pool (with the 
500Gb disks) and a 2nd pool using a 3-way mirror.  You can then match 
pool/filesystems to the best fit with your different workloads.

Remember the incredibly useful blogs at: http://blogs.sun.com/relling/ 
(Thank you Richard) to determine the relative reliability/failure 
rates of different ZFS configs.

PS: If we had to do it over, I'd probably go with a 6-disk raidz2, 
in place of the 5-disk raidz1 - due to the much higher relibility of 
that config.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Graduate from sugar-coating school?  Sorry - I never attended! :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recommended many-port SATA controllers for budget ZFS

2007-11-20 Thread Jaz
 the 3124 looks perfect. The only problem is the only thing I found on 
 ebay
 was for the 3132, which is PCIe, which doesn't help me. :) I'm not 
 finding
 anything for 3124 other than the data on silicon image's site.  Do you 
 know
 of any cards I should be looking for that uses this chip?

 http://www.cooldrives.com/sata-cards.html

 There are a couple on there for about $80.  Not quite where you want to 
 get I am sure but it is an option.

 Yep - I see: http://www.cooldrives.com/saiiraco2esa.html for $60.

I got a Sil3114 (4 internal ports) off ebay for $AU30 including postage. 
Didnt look at any PCIe stuff since I'm building up from old parts.

 Regards,

 Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
 OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
 Graduate from sugar-coating school?  Sorry - I never attended! :)
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Modify fsid/guid of dataset for NFS failover

2007-11-20 Thread asa
Well then this is probably the wrong list to be hounding

I am looking for something like 
http://blog.wpkg.org/2007/10/26/stale-nfs-file-handle/
Where when fileserver A dies, fileserver B can come up, grab the same  
IP address via some mechanism(in this case I am using sun cluster) and  
keep on trucking without the lovely stale file handle errors I am  
encountering.

My clients are Linux, Servers are sol 10u4.

it seems that it is impossible to change the fsid on solaris, can you  
point me towards the appropriate NFS client behavior option lingo if  
you have a minute?(just the terminology would be great, there are a  
ton of confusing options in the land of nfs:  Client recovery,  
failover, replicas etc)
I am unable to use block base replication(AVS) underneath the ZFS  
layer because I would like to run with different zpool schemes on each  
server( fast primary server, slower, larger failover server only to be  
used during downtime on the main server.)

Worst case scenario here seems to be that I would have to forcibly  
unmount and remount all my client mounts.

Ill start bugging the nfs-discuss people.

Thank you.

Asa

On Nov 12, 2007, at 1:21 PM, Darren J Moffat wrote:

 asa wrote:
 I would like for all my NFS clients to hang during the failover,  
 then  pick up trucking on this new filesystem, perhaps obviously  
 failing  their writes back to the apps which are doing the  
 writing.  Naive?

 The OpenSolaris NFS client does this already - has done since IIRC  
 around Solaris 2.6.  The knowledge is in the NFS client code.

 For NFSv4 this functionality is part of the standard.

 -- 
 Darren J Moffat

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Modify fsid/guid of dataset for NFS failover

2007-11-20 Thread Richard Elling
asa wrote:
 Well then this is probably the wrong list to be hounding

 I am looking for something like 
 http://blog.wpkg.org/2007/10/26/stale-nfs-file-handle/
 Where when fileserver A dies, fileserver B can come up, grab the same  
 IP address via some mechanism(in this case I am using sun cluster) and  
 keep on trucking without the lovely stale file handle errors I am  
 encountering.
   

If you are getting stale file handles, then the Solaris cluster is 
misconfigured.
Please double check the NFS installation guide for Solaris Cluster and
verify that the paths are correct.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] which would be faster

2007-11-20 Thread Rob Logan

  On the other hand, the pool of 3 disks is obviously
  going to be much slower than the pool of 5

while today that's true, someday io will be
balanced by the latency of vdevs rather than
the number... plus two vdevs are always going
to be faster than one vdev, even if one is slower
than the other.

so do 4+1 and 2+1 in the same pool rather than
separate pools. this will let zfs balance
the load (always) between the two vdevs rather than
you trying the balance the load between pools.

Rob

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Modify fsid/guid of dataset for NFS failover

2007-11-20 Thread asa
I am rolling my own replication using zfs send|recv through the  
cluster agent framework and a custom HA shared local storage set of  
scripts(similar to http://www.posix.brte.com.br/blog/?p=75  but  
without avs).  I am not using zfs off of shared storage in the  
supported way. So this is a bit of a lonely area. =)

As these are two different zfs volumes on different zpools of  
differing underlying vdev topology, it appears they are not sharing  
the same fsid and so are assumedly presenting different file handles  
from each other.

I have the cluster parts out of the way(mostly =)), I now need to  
solve the nfs side of things so that at the point of failing over.

I have isolated zfs out of the equation, I receive the same stale file  
handle errors if I try and share an arbitrary UFS folder to the client  
through the cluster interface.

Yeah I am a hack.

Asa

On Nov 20, 2007, at 7:27 PM, Richard Elling wrote:

 asa wrote:
 Well then this is probably the wrong list to be hounding

 I am looking for something like 
 http://blog.wpkg.org/2007/10/26/stale-nfs-file-handle/
 Where when fileserver A dies, fileserver B can come up, grab the  
 same  IP address via some mechanism(in this case I am using sun  
 cluster) and  keep on trucking without the lovely stale file handle  
 errors I am  encountering.


 If you are getting stale file handles, then the Solaris cluster is  
 misconfigured.
 Please double check the NFS installation guide for Solaris Cluster and
 verify that the paths are correct.
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz DEGRADED state

2007-11-20 Thread Joe Little
On Nov 20, 2007 6:34 AM, MC [EMAIL PROTECTED] wrote:
  So there is no current way to specify the creation of
  a 3 disk raid-z
  array with a known missing disk?

 Can someone answer that?  Or does the zpool command NOT accommodate the 
 creation of a degraded raidz array?


can't started degraded, but you can make it so..

If one can make a sparse file, then you'd be set. Just create the
file, make a zpool out of the two disks and the file, and then drop
the file from the pool _BEFORE_ copying over the data. I believe then
you can add the third disk as a replacement. The gotcha (and why the
sparse may be needed) is that it will only use per disk the size of
the smallest disk.



 This message posted from opensolaris.org
 ___

 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-20 Thread can you guess?
...

 just rearrange your blocks sensibly -
 and to at least some degree you could do that while
 they're still cache-resident

Lots of discussion has passed under the bridge since that observation above, 
but it may have contained the core of a virtually free solution:  let your 
table become fragmented, but each time that a sequential scan is performed on 
it determine whether the region that you're currently scanning is 
*sufficiently* fragmented that you should retain the sequential blocks that 
you've just had to access anyway in cache until you've built up around 1 MB of 
them and then (in a background thread) flush the result contiguously back to a 
new location in a single bulk 'update' that changes only their location rather 
than their contents.

1.  You don't incur any extra reads, since you were reading sequentially anyway 
and already have the relevant blocks in cache.  Yes, if you had reorganized 
earlier in the background the current scan would have gone faster, but if scans 
occur sufficiently frequently for their performance to be a significant issue 
then the *previous* scan will probably not have left things *all* that 
fragmented.  This is why you choose a fragmentation threshold to trigger reorg 
rather than just do it whenever there's any fragmentation at all, since the 
latter would probably not be cost-effective in some circumstances; conversely, 
if you only perform sequential scans once in a blue moon, every one may be 
completely fragmented but it probably wouldn't have been worth defragmenting 
constantly in the background to avoid this, and the occasional reorg triggered 
by the rare scan won't constitute enough additional overhead to justify heroic 
efforts to avoid it.  Such a 'threshold' is a crude but possi
 bly adequate metric; a better but more complex one would perhaps nudge up the 
threshold value every time a sequential scan took place without an intervening 
update, such that rarely-updated but frequently-scanned files would eventually 
approach full contiguity, and an even finer-grained metric would maintain such 
information about each individual *region* in a file, but absent evidence that 
the single, crude, unchanging threshold (probably set to defragment moderately 
aggressively - e.g., whenever it takes more than 3 or 5 disk seeks to inhale a 
1 MB region) is inadequate these sound a bit like over-kill.

2.  You don't defragment data that's never sequentially scanned, avoiding 
unnecessary system activity and snapshot space consumption.

3.  You still incur additional snapshot overhead for data that you do decide to 
defragment for each block that hadn't already been modified since the most 
recent snapshot, but performing the local reorg as a batch operation means that 
only a single copy of all affected ancestor blocks will wind up in the snapshot 
due to the reorg (rather than potentially multiple copies in multiple snapshots 
if snapshots were frequent and movement was performed one block at a time).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot GUI

2007-11-20 Thread Anton B. Rang
How does the ability to set a snapshot schedule for a particular *file* or 
*folder* interact with the fact that ZFS snapshots are on a per-filesystem 
basis?  This seems a poor fit.  If I choose to snapshot my Important 
Documents folder every 5 minutes, that's implicitly creating snapshots of my 
Giant Video Downloads folder every 5 minutes too, if they're both in the same 
file system.  It seems unwise not to expose this to the user.

One possibility would be for the enable snapshots menu item to implicitly 
apply to the root of the file system in which the selected item is.  So in the 
example shown, right-clicking on Documents would bring up a dialog labeled 
something like Automatic snapshots for /home/cb114949.

==

I don't think it's a good idea to replace Enable Automatic Snapshots by 
Restore from Snapshot because there's no obvious way to Disable Automatic 
Snapshots (or change their properties). (It appears one could probably do that 
from the properties dialog, but that's certainly not obvious to a user who has 
turned this on using the menu and now wants to make a change -- if you can turn 
it on in the menu, you should be able to turn it off in the menu too.)

==

If Roll back affects the whole file system, it definitely should NOT be an 
option when right-clicking on a file or folder within the file system! This is 
a recipe for disaster. I would not present this as an option at all -- it's 
already in the Restore Files dialog.

Also, All files will be restored is not a good description for rollback.  
That really means All changes since the selected snapshot will be lost.  I 
can readily imagine a user thinking, I deleted three files, so if I choose to 
restore all files, I'll get those three back [without losing the other work 
I've done].

==

Just a few random comments.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss