Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-13 Thread David Collier-Brown


Bob Friesenhahn wrote:
 On Fri, 13 Feb 2009, Ross wrote:

 Something like that will have people praising ZFS' ability to
 safeguard their data, and the way it recovers even after system
 crashes or when hardware has gone wrong.  You could even have a
 common causes of this are... message, or a link to an online help
 article if you wanted people to be really impressed.
 
 I see a career in politics for you.  Barring an operating system
 implementation bug, the type of problem you are talking about is due to
 improperly working hardware.  Irreversibly reverting to a previous
 checkpoint may or may not obtain the correct data.  Perhaps it will
 produce a bunch of checksum errors.

Actually that's a lot like FMA replies when it sees a problem,
telling the person what happened and pointing them to a web page
which can be updated with the newest information on the problem.

That's a good spot for This pool was not unmounted cleanly due
to a hardware fault and data has been lost.  The name of timestamp
line contains the date which can be recovered to.  Use the command
  # zfs reframbulocate this that -t timestamp
to revert to timestamp

--dave
-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
dav...@sun.com |  -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does your device honor write barriers?

2009-02-10 Thread David Collier-Brown
Peter Schuller wrote:
 It would actually be nice in general I think, not just for ZFS, to
 have some standard run this tool that will give you a check list of
 successes/failures that specifically target storage
 correctness. Though correctness cannot be proven, you can at least
 test for common cases of systematic incorrect behavior.

A tiny niggle: for an operation set of moderate size, you can
generate an exhaustive set of tests.  I've done so for APIs,
but unless you have infinite spare time, you want to
generate the test set with a tool (;-))

--dave (who hasn't even Copious Spare Time, much less Infinite) c-b
-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
dav...@sun.com |  -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS core contributor nominations

2009-02-03 Thread David Collier-Brown
+1 utterly!

Mark Shellenbaum wrote:
 Neelakanth Nadgir wrote:
 +1.

 I would like to nominate roch.bourbonn...@sun.com for his work on
 improving the performance of ZFS over the last few years.

 thanks,
 -neel


 +1 on Roch being a core contributor.
 
 
 On Feb 2, 2009, at 4:02 PM, Neil Perrin wrote:

 Looks reasonable
 +1

 Neil.

 On 02/02/09 08:55, Mark Shellenbaum wrote:
 The time has come to review the current Contributor and Core  
 contributor
 grants for ZFS.  Since all of the ZFS core contributors grants are  
 set
 to expire on 02-24-2009 we need to renew the members that are still
 contributing at core contributor levels.   We should also add some  
 new
 members to both Contributor and Core contributor levels.

 First the current list of Core contributors:

 Bill Moore (billm)
 Cindy Swearingen (cindys)
 Lori M. Alt (lalt)
 Mark Shellenbaum (marks)
 Mark Maybee (maybee)
 Matthew A. Ahrens (ahrens)
 Neil V. Perrin (perrin)
 Jeff Bonwick (bonwick)
 Eric Schrock (eschrock)
 Noel Dellofano (ndellofa)
 Eric Kustarz (goo)*
 Georgina A. Chua (chua)*
 Tabriz Holtz (tabriz)*
 Krister Johansen (johansen)*

 All of these should be renewed at Core contributor level, except for
 those with a *.  Those with a * are no longer involved with ZFS  
 and
 we should let their grants expire.

 I am nominating the following to be new Core Contributors of ZFS:

 Jonathan W. Adams (jwadams)
 Chris Kirby
 Lin Ling
 Eric C. Taylor (taylor)
 Mark Musante
 Rich Morris
 George Wilson
 Tim Haley
 Brendan Gregg
 Adam Leventhal
 Pawel Jakub Dawidek
 Ricardo Correia

 For Contributor I am nominating the following:
 Darren Moffat
 Richard Elling

 I am voting +1 for all of these (including myself)

 Feel free to nominate others for Contributor or Core Contributor.


-Mark



 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 

-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
dav...@sun.com |  -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why is st_size of a zfs directory equal to the

2009-01-14 Thread David Collier-Brown
Richard L. Hamilton rlha...@smart.net wrote:
 I did find the earlier discussion on the subject (someone e-mailed me that 
 there had been
 such).  It seemed to conclude that some apps are statically linked with old 
 scandir() code
 that (incorrectly) assumed that the number of directory entries could be 
 estimated as
 st_size/24; and worse, that some such apps might be seeing the small st_size 
 that zfs
 offers via NFS, so they might not even be something that could be fixed on 
 Solaris at all.
 But I didn't see anything in the discussion that suggested that this was 
 going to be changed.
 Nor did I see a compelling argument for leaving it the way it is, either.  
 In the face of
 undefined, all arguments end up as pragmatism rather than principle, IMO.
 
Joerg Schilling wrote:
 This is a problem I had to fix for some customers in 1992 when people started 
 to use NFS 
 servers based on the Novell OS.
 Jörg
 

  Oh bother, I should have noticed this back in 1999/2001 (;-))

  Joking aside, we were looking at the Solaris ABI (application
Binary interface) and working on ensuring binary stability. The
size of a directory entry was supposed to be undefined and in
principle *variable*, but Novell et all seem to have assumed that
the size they used was guaranteed to be the same for all time.

  And no machine needs more than 640 KB of memory, either...

  Ah well, at least the ZFS folks found it for us, so I can add
it to my database of porting problems.  What OSs did you folks
find it on?

--dave (an external consultant, these days) c-b
-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
dav...@sun.com |  -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning for a file server, disabling data cache (almost)

2008-10-17 Thread David Collier-Brown
Marcelo Leal [EMAIL PROTECTED] wrote:
 Hello all,
  I think he got some point here... maybe that would be an interesting 
 feature for that kind of workload. Caching all the metadata, would make t
 the rsync task more fast (for many files). Try to cache the data is really
 waste of time, because the data will not be read again, and will just send
 away the good metadata cached. That is what i understand when he said
 about the 96k being descarded soon. He wants to configure an area to 
 copy the data, and that´s it. Leave my metadata cache alone. ;-)

That's a common enough behavior pattern that Per Brinch Hansen
defined a distinct filetype for it in, if memory serves, the RC 4000.
As soon as it's read, it's gone.

   We saw this behavior on NFS servers in the Markham ACE lab, and
absolutely with Samba almost everywhere.  My Smarter Colleagues[tm]
explained it as a normal pattern whenever you have front-end
caching, as backend caching is then rendered far less effective, and
sometimes directly disadvantageous.

   It sounded like, from the previous discussion, one could tune
for it with the level 1 and 2 caches, although if I understood
it properly, the particular machine also had to narrow a stripe
for the particular load being discussed...

--dave
-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Sidebar re ABI stability (was Segmentation fault / core dump)

2008-10-01 Thread David Collier-Brown
[EMAIL PROTECTED] wrote
 Linux does not implement stable kernel interfaces. It may be that there is 
 an intention to do so but I've seen problems on Linux resulting from
 self-incompatibility on a regular base.

To be precise, Linus tries hard to prevent ABI changes in the system
call interfaces exported from the kernel, but the glibc team had
defeated him in the past.  For example, they accidentally started
returning ENOTSUP from getgid when one had a library version mis-
match (!).

Sun stabilizes both library and system call interfaces: I used to 
work on that with David J. Brown's team, back when I was an
employee.

--dave (who's a contractor) c-b
-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-09-01 Thread David Collier-Brown
:
 
 Case 1. Fully redundant storage array with active/active controllers.
A failed controller should cause the system to recover on the
surviving controller.  I have some lab test data for this sort of 
 thing
and some popular arrays can take on the order of a minute to
complete the failure detection and reconfiguration.  You don't
want to degrade the vdev when this happens, you just want to
wait until the array is again ready for use (this works ok today.)
I would further argue that no disk failure prediction code would
be useful for this case.
 
 Case 2.  Power on test.  I had a bruise (no scar :-) once from an
integrated product we were designing
   http://docs.sun.com/app/docs/coll/cluster280-3
which had a server (or two) and raid array (or two).  If you build
such a system from scratch, then it will fail a power-on test.  
 If you
power on the rack containing these systems, then the time required
for the RAID array to boot was longer than the time required for
the server to boot *and* timeout probes of the array.  The result
was that the volume manager will declare the disks bad and
system administration intervention is required to regain access to
the data in the array.  Since this was an integrated product, we
solved it by inducing a delay loop in the server boot cycle to
slow down the server.  Was it the best possible solution?  No, but
it was the only solution which met our other design constraints.
 
 In both of these cases, the solutions imply multi-minute timeouts are
 required to maintain a stable system.  For 101-level insight to this sort
 of problem see the Sun BluePrint article (an oldie, but goodie):
 http://www.sun.com/blueprints/1101/clstrcomplex.pdf
 

--dave
-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Sidebar to ZFS Availability discussion

2008-08-31 Thread David Collier-Brown
Re Availability: ZFS needs to handle disk removal / 
 driver failure better
 A better option would be to not use this to perform FMA diagnosis, but
 instead work into the mirror child selection code.  This has already
 been alluded to before, but it would be cool to keep track of latency
 over time, and use this to both a) prefer one drive over another when
 selecting the child and b) proactively timeout/ignore results from one
 child and select the other if it's taking longer than some historical
 standard deviation.  This keeps away from diagnosing drives as faulty,
 but does allow ZFS to make better choices and maintain response times.
 It shouldn't be hard to keep track of the average and/or standard
 deviation and use it for selection; proactively timing out the slow I/Os
 is much trickier.

  Interestingly, tracking latency has come under discussion in the
Linux world, too, as they start to deal with developing resource
management for disks as well as CPU.

  In fact, there are two cases where you can use a feedback loop to
adjust disk behavior, and a third to detect problems. The first 
loop is the one you identified, for dealing with near/far and
fast/slow mirrors.

  The second is for resource management, where one throttles
disk-hog projects when one discovers latency growing without
bound on disk saturation, and the third is in case of a fault
other than the above.

  For the latter to work well, I'd like to see the resource management
and fast/slow mirror adaptation be something one turns on explicitly,
because then when FMA discovered that you in fact have a fast/slow
mirror or a Dr. Evil program saturating the array, the fix
could be to notify the sysadmin that they had a problem and
suggesting built-in tools to ameliorate it. 

 
Ian Collins writes: 
 One solution (again, to be used with a remote mirror) is the three way 
 mirror.  If two devices are local and one remote, data is safe once the 
 two local writes return.  I guess the issue then changes from is my 
 data safe to how safe is my data.  I would be reluctant to deploy a 
 remote mirror device without local redundancy, so this probably won't be 
 an uncommon setup.  There would have to be an acceptable window of risk 
 when local data isn't replicated.

  And in this case too, I'd prefer the sysadmin provide the information
to ZFS about what she wants, and have the system adapt to it, and
report how big the risk window is.

  This would effectively change the FMA behavior, you understand, so as 
to have it report failures to complete the local writes in time t0 and 
remote in time t1, much as the resource management or fast/slow cases would
need to be visible to FMA.

--dave (at home) c-b

-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-29 Thread David Collier-Brown
 where you would store this 
information, but wherever that is, zpool status should be reporting 
the error and directing the admin to the log file.
 
I would probably say this could be safely stored on the system drive.  
Would it be possible to have a number of possible places to store this 
log?  What I'm thinking is that if the system drive is unavailable, 
ZFS could try each pool in turn and attempt to store the log there.
 
In fact e-mail alerts or external error logging would be a great 
addition to ZFS.  Surely it makes sense that filesystem errors would 
be better off being stored and handled externally?
 
Ross
 



Date: Mon, 28 Jul 2008 12:28:34 -0700
From: [EMAIL PROTECTED]
Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive 

removed

To: [EMAIL PROTECTED]

I'm trying to reproduce and will let you know what I find.
-- richard




Win £3000 to spend on whatever you want at Uni! Click here to WIN! 
http://clk.atdmt.com/UKM/go/101719803/direct/01/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  
 
 

-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-25 Thread David Collier-Brown
And do you really have 4-sided raid 1 mirrors, not 4-wide raid-0 stripes???

--dave

Robert Milkowski wrote:
 Hello Tharindu,
 
 
 Thursday, July 24, 2008, 6:02:31 AM, you wrote:
 
 

 
   
 
 We do not use raidz*. Virtually, no raid or stripe through OS.
 
 
 We have 4 disk RAID1 volumes.  RAID1 was created from CAM on 2540.
 
 
 2540 does not have RAID 1+0 or 0+1.
 
 
 
 
 Of course it does 1+0. Just add more drives to RAID-1
 
 
 
 
 -- 
 
 Best regards,
 
  Robert Milkowski   mailto:[EMAIL PROTECTED]
 
http://milek.blogspot.com
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Peak every 4-5 second

2008-07-24 Thread David Collier-Brown
  Hmmn, that *sounds* as if you are saying you've a very-high-redundancy
RAID1 mirror, 4 disks deep, on an 'enterprise-class tier 2 storage' array
that doesn't support RAID 1+0 or 0+1. 

  That sounds weird: the 2540 supports RAID levels 0, 1, (1+0), 3 and 5,
and deep mirrors are normally only used on really fast equipment in
mission-critical tier 1 storage...

  Are you sure you don't mean you have raid 0 (stripes) 4 disks wide,
each stripe presented as a LUN?

  If you really have 4-deep RAID 1, you have a configuration that will
perform somewhat slower than any single disk, as the array launches
4 writes to 4 drives in parallel, and returns success when they
all complete.

  If you had 4-wide RAID 0, with mirroring done at the host, you would
have a configuration that would (probabilistically) perform better than 
a single drive when writing to each side of the mirror, and the write
would return success when the slowest side of the mirror completed.

 --dave (puzzled!) c-b

Tharindu Rukshan Bamunuarachchi wrote:
 We do not use raidz*. Virtually, no raid or stripe through OS.
 
 We have 4 disk RAID1 volumes.  RAID1 was created from CAM on 2540.
 
 2540 does not have RAID 1+0 or 0+1.
 
 cheers
 tharindu
 
 Brandon High wrote:
 
On Tue, Jul 22, 2008 at 10:35 PM, Tharindu Rukshan Bamunuarachchi
[EMAIL PROTECTED] wrote:
  

Dear Mark/All,

Our trading system is writing to local and/or array volume at 10k
messages per second.
Each message is about 700bytes in size.

Before ZFS, we used UFS.
Even with UFS, there was evey 5 second peak due to fsflush invocation.

However each peak is about ~5ms.
Our application can not recover from such higher latency.



Is the pool using raidz, raidz2, or mirroring? How many drives are you using?

-B

  

 
 
 
 ***
 
 The information contained in this email including in any attachment is 
 confidential and is meant to be read only by the person to whom it is 
 addressed. If you are not the intended recipient(s), you are prohibited from 
 printing, forwarding, saving or copying this email. If you have received this 
 e-mail in error, please immediately notify the sender and delete this e-mail 
 and its attachments from your computer.
 
 ***
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OT: Formatting Problem of ZFS Adm Guide (pdf)

2008-07-23 Thread David Collier-Brown
  One can carve furniture with an axe, especially if it's razor-sharp,
but that doesn't make it a spokeshave, plane and saw.

  I love star office, and use it every day, but my publisher uses
Frame, so that's what I use for books.

--dave

W. Wayne Liauh wrote:
I doubt so. Star/OpenOffice are word processors...
and like Word they are not suitable for typesetting
documents.

SGML, FrameMaker  TeX/LateX are the only ones
capable of doing that.
 
 
 This was pretty much true about a year ago.  However, after version 2.3, 
 which adds the kerning feature, OpenOffice.org can produce very 
 professionally looking documents.
 
 All of the OOo User Guides, which are every bit as complex as if not more so 
 than our own user guides, are now self-generated.  Solveig Haugland, a 
 highly respected OpenOffice.org consultant, published her book 
 OpenOffice.org 2 Guidebook (a 527-page book complete with drawings, table 
 of contents, multi-column index, etc.) entirely on OOo.
 
 Another key consideration, in addition to perhaps a desire to support our 
 sister product, is that the documents so generated are guaranteed to be 
 displayable on the OS they are intended to serve.  This is a pretty important 
 consideration IMO.  :-)
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 

-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS deduplication

2008-07-08 Thread David Collier-Brown
  Hmmn, you might want to look at Andrew Tridgell's' thesis (yes,
Andrew of Samba fame), as he had to solve this very question
to be able to select an algorithm to use inside rsync.

--dave

Darren J Moffat wrote:
 [EMAIL PROTECTED] wrote:
 
[EMAIL PROTECTED] wrote on 07/08/2008 03:08:26 AM:


Does anyone know a tool that can look over a dataset and give
duplication statistics? I'm not looking for something incredibly
efficient but I'd like to know how much it would actually benefit our

Check out the following blog..:

http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool

Just want to add,  while this is ok to give you a ballpark dedup number --
fletcher2 is notoriously collision prone on real data sets.  It is meant to
be fast at the expense of collisions.  This issue can show much more dedup
possible than really exists on large datasets.
 
 
 Doing this using sha256 as the checksum algorithm would be much more 
 interesting.  I'm going to try that now and see how it compares with 
 fletcher2 for a small contrived test.
 

-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Some basic questions about getting the best performance for database usage

2008-06-30 Thread David Collier-Brown
  This is a bit of a sidebar to the discussion about getting the 
best performance for PostgreSQL from ZFS, but may affect
you if you're doing sequential scans through the 70GB table
or its segments.

  ZFS copy-on-write results in tables' contents being spread across
the full width of their stripe, which is arguably a good thing
for transaction processing performance (or at least can be), but
makes sequential table-scan speed degrade.
 
  If you're doing sequential scans over large amounts of data
which isn't changing very rapidly, such as older segments, you
may want to re-sequentialize that data.

 I was talking to one of the Slony developers back whern this
first came up, and he suggested a process to do this in PostgreSQL.

  He suggested doing a cluster operation, relative to a specific 
index, then dropping and recreating the index.  This results in the 
relation being rewritten in the order the index is sorted by, which
should defragment/linearize it. The dropping and recreating
the index rewrites it sequentially too.

  Neither he nor I know the cost if the relation has more than one
index: we speculate they should be dropped before the clustering
and recreated last.

 --dave
-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Some basic questions about getting the best performance for database usage

2008-06-30 Thread David Collier-Brown
David Collier-Brown wrote:
   ZFS copy-on-write results in tables' contents being spread across
 the full width of their stripe, which is arguably a good thing
 for transaction processing performance (or at least can be), but
 makes sequential table-scan speed degrade.
  
   If you're doing sequential scans over large amounts of data
 which isn't changing very rapidly, such as older segments, you
 may want to re-sequentialize that data.

Richard Elling [EMAIL PROTECTED] wrote 
 There is a general feeling that COW, as used by ZFS, will cause
 all sorts of badness for database scans.  Alas, there is a dearth of
 real-world data on any impacts (I'm anxiously awaiting...)
 There are cases where this won't be a problem at all, but it will
 depend on how you use the data.

I quite agree: at some point, the experts on Oracle, MySQL and
PostgreSQL will get a clear understanding of how to get the
best performance for random database I/O and ZFS.  I'll be
interested to see what the behavior is for large, high-performance
systems. In the meantime...

 In this particular case, it would be cost effective to just buy a
 bunch of RAM and not worry too much about disk I/O during
 scans.  In the future, if you significantly outgrow the RAM, then
 there might be a case for a ZFS (L2ARC) cache LUN to smooth
 out the bumps.  You can probably defer that call until later.

... it's a Really Nice Thing that large memories only cost small 
dollars (;-))

--dave
-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

2008-05-01 Thread David Collier-Brown
Darren J Moffat [EMAIL PROTECTED] wrote:
 Chris Siebenmann wrote:
| Still, I'm curious -- why lots of pools?  Administration would be
| simpler with a single pool containing many filesystems.

 The short answer is that it is politically and administratively easier
to use (at least) one pool per storage-buying group in our environment.
 
 
 I think the root cause of the issue is that multiple groups are buying 
 physical rather than virtual storage yet it is all being attached to a 
 single system.  I will likely be a huge up hill battle but: if all the 
 physical storage could be purchased by one group and a combination of 
 ZFS reservations and quotas used on top level (eg one level down from 
 the pool) datasets to allocate the virtual storage, and appropriate 
 amounts charged to the groups, you could technical be able to use ZFS 
 how it was intended with much fewer (hopefully 1 or 2) pools.

The scenario Chris describes is one I see repeatedly at customers
buying SAN storage (as late as last month!) and is considered
a best practice on the business side.

We may want to make this issue and it's management visible, as
people moving from SAN to ZFS are likely to trip over it.

In particular, I'd like to see a blueprint or at least a 
wiki discussion by someone from the SAN world on how to 
map those kinds of purchases to ZFS pools, how few one 
wants to have, what happens when it goes wrong, and how 
to mitigate it (;-))

--dave
ps: as always, having asked for something, I'm also volunteering to
help provide it: I'm not a storage or ZFS guy, but I am an author,
and will happily help my Smarter Colleagues[tm] to write it up.

-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

2008-05-01 Thread David Collier-Brown
Chris Siebenmann [EMAIL PROTECTED] wrote:
|  Speaking as a sysadmin (and a Sun customer), why on earth would I have
| to provision 8 GB+ of RAM on my NFS fileservers? I would much rather
| have that memory in the NFS client machines, where it can actually be
| put to work by user programs.
|
| (If I have decently provisioned NFS client machines, I don't expect much
| from the NFS fileserver's cache. Given that the clients have caches too,
| I believe that the server's cache will mostly be hit for things that the
| clients cannot cache because of NFS semantics, like NFS GETATTR requests
| for revalidation and the like.)

That's certainly true for the NFS part of the NFS fileserver, but to get
the ZFS feature-set, you trade off cycles and memory.  If we investigate
this a bit, we should be able to figure out a rule of thumb for how
little memory we need for an NFS-home-directories workload without 
cutting into performance.

--dave
-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How many ZFS pools is it sensible to use on a single server?

2008-04-15 Thread David Collier-Brown
  We've discussed this in considerable detail, but the original
question remains unanswered:  if an organization *must* use
multiple pools, is there an upper bound to avoid or a rate
of degradation to be considered?

--dave
-- 
David Collier-Brown| Always do right. This will gratify
Sun Microsystems, Toronto  | some people and astonish the rest
[EMAIL PROTECTED] |  -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss