Re: [zfs-discuss] resilver = defrag?

2010-09-17 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of David Dyer-Bennet

  For example, if you start with an empty drive, and you write a large
  amount
  of data to it, you will have no fragmentation.  (At least, no
 significant
  fragmentation; you may get a little bit based on random factors.)  As
 life
  goes on, as long as you keep plenty of empty space on the drive,
 there's
  never any reason for anything to become significantly fragmented.
 
 Sure, if only a single thread is ever writing to the disk store at a
 time.

This has already been discussed in this thread.

The threading model doesn't affect the outcome of files being fragmented or
unfragmented on disk.  The OS is smart enough to know these blocks writen
by process A are all sequential, and those blocks all written by process B
are also sequential, but separate.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-17 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Marty Scholes
  
 What appears to be missing from this discussion is any shred of
 scientific evidence that fragmentation is good or bad and by how much.
 We also lack any detail on how much fragmentation does take place.

Agreed.  I've been rather lazily asserting a few things here and there that
I expected to be challenged, so I've been thinking up tests to
verify/dispute my claims, but then nobody challenged.  Specifically, the
blocks on disk are not interleaved just because multiple threads were
writing at the same time.

So there's at least one thing which is testable, if anyone cares.

But there's also no way that I know of, to measure fragmentation in a real
system that's been in production for a year.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-17 Thread Richard Elling
On Sep 16, 2010, at 12:33 PM, Marty Scholes wrote:

 David Dyer-Bennet wote:
 Sure, if only a single thread is ever writing to the
 disk store at a time.
 
 This situation doesn't exist with any kind of
 enterprise disk appliance,
 though; there are always multiple users doing stuff.
 
 Ok, I'll bite.
 
 Your assertion seems to be that any kind of enterprise disk appliance will 
 always have enough simultaneous I/O requests queued that any sequential read 
 from any application will be sufficiently broken up by requests from other 
 applications, effectively rendering all read requests as random.  If I follow 
 your logic, since all requests are essentially random anyway, then where they 
 fall on the disk is irrelevant.

Allan and Neel did a study of this for MySQL.
http://www.youtube.com/watch?v=a31NhwzlAxs
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com

Richard Elling
rich...@nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-16 Thread David Dyer-Bennet

On Wed, September 15, 2010 16:18, Edward Ned Harvey wrote:

 For example, if you start with an empty drive, and you write a large
 amount
 of data to it, you will have no fragmentation.  (At least, no significant
 fragmentation; you may get a little bit based on random factors.)  As life
 goes on, as long as you keep plenty of empty space on the drive, there's
 never any reason for anything to become significantly fragmented.

Sure, if only a single thread is ever writing to the disk store at a time.

This situation doesn't exist with any kind of enterprise disk appliance,
though; there are always multiple users doing stuff.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-16 Thread Miles Nordin
 dd == David Dyer-Bennet d...@dd-b.net writes:

dd Sure, if only a single thread is ever writing to the disk
dd store at a time.

video warehousing is a reasonable use case that will have small
numbers of sequential readers and writers to large files.  virtual
tape library is another obviously similar one.  basically, things
which used to be stored on tape.  which are not uncommon.  

AIUI ZFS does not have a fragmentation problem for these cases unless
you fill past 96%, though I've been trying to keep my pool below 80%
because general FUD.

dd This situation doesn't exist with any kind of enterprise disk
dd appliance, though; there are always multiple users doing
dd stuff.

the point's relevant, but I'm starting to tune out every time I hear
the word ``enterprise.''  seems it often decodes to: 

 (1) ``fat sacks and no clue,'' or 

 (2) ``i can't hear you i can't hear you i have one big hammer in my
 toolchest and one quick answer to all questions, and everything's
 perfect! perfect, I say.  unless you're offering an even bigger
 hammer I can swap for this one, I don't want to hear it,'' or

 (3) ``However of course I agree that hammers come in different
 colors, and a wise and experienced craftsman will always choose
 the color of his hammer based on the color of the nail he's
 hitting, because the interface between hammers and nails doesn't
 work well otherwise.  We all know here how to match hammer and
 nail colors, but I don't want to discuss that at all because it's
 a private decision to make between you and your salesdroid.  

 ``However, in this forum here we talk about GREEN NAILS ONLY.  If
 you are hitting green nails with red hammers and finding they go
 into the wood anyway then you are being very unprofessional
 because that nail might have been a bank transaction. --posted
 from opensolaris.org''


pgpqzPhCxoUuU.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-16 Thread Marty Scholes
David Dyer-Bennet wote:
 Sure, if only a single thread is ever writing to the
 disk store at a time.
 
 This situation doesn't exist with any kind of
 enterprise disk appliance,
 though; there are always multiple users doing stuff.

Ok, I'll bite.

Your assertion seems to be that any kind of enterprise disk appliance will 
always have enough simultaneous I/O requests queued that any sequential read 
from any application will be sufficiently broken up by requests from other 
applications, effectively rendering all read requests as random.  If I follow 
your logic, since all requests are essentially random anyway, then where they 
fall on the disk is irrelevant.

I might challenge a couple of those assumptions.

First, if the data is not fragmented, then ZFS would coalesce multiple 
contiguous read requests into a single large read request, increasing total 
throughput regardless of competing I/O requests (which also might benefit from 
the same effect).

Second, I am unaware of an enterprise requirement that disk I/O run at 100% 
busy, any more than I am aware of the same requirement for full network link 
utilization, CPU utilization or PCI bus utilization.

What appears to be missing from this discussion is any shred of scientific 
evidence that fragmentation is good or bad and by how much.  We also lack any 
detail on how much fragmentation does take place.

Let's see if some people in the community can get some real numbers behind this 
stuff in real world situations.

Cheers,
Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-15 Thread Richard Elling
On Sep 14, 2010, at 4:58 AM, Edward Ned Harvey wrote:

 From: Haudy Kazemi [mailto:kaze0...@umn.edu]
 
 With regard to multiuser systems and how that negates the need to
 defragment, I think that is only partially true.  As long as the files
 are defragmented enough so that each particular read request only
 requires one seek before it is time to service the next read request,
 further defragmentation may offer only marginal benefit.  On the other
 
 Here's a great way to quantify how much fragmentation is acceptable:
 
 Suppose you want to ensure at least 99% efficiency of the drive.  At most 1%
 time wasted by seeking.

This is practically impossible on a HDD.  If you need this, use SSD.
This phenomenon is why short-stroking became popular, until SSDs
killed short-stroking.

 Suppose you're talking about 7200rpm sata drives, which sustain 500Mbit/s
 transfer, and have average seek time 8ms.
 
 8ms is 1% of 800ms.
 In 800ms, the drive could read 400 Mbit of sequential data.
 That's 40 MB

In UFS we have cluster groups -- doesn't survive the test of time. 
In ZFS we have metaslabs, perhaps with a better chance of longevity.
The vdev is divided into a set of metaslabs and the allocator tries to
use space in one metaslab before changing to another.

Several features work against HDD optimization. Redundant copies 
of the metadata are intentionally spread across the media, so that there
is some resilience to media errors. Entries into the ZIL can also be of
varying size and are allocated in the pool -- solved by using a separate 
log device. COW can lead to wikipedia disk [de]fragmentation for files
which are larger than the recordsize.

Continuing to try to optimize for  HDD performance is just a matter of 
changing the lipstick on the pig.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com

Richard Elling
rich...@nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-15 Thread Edward Ned Harvey
 From: Richard Elling [mailto:rich...@nexenta.com]

  Suppose you want to ensure at least 99% efficiency of the drive.  At
 most 1%
  time wasted by seeking.
 
 This is practically impossible on a HDD.  If you need this, use SSD.

Lately, Richard, you're saying some of the craziest illogical things I've
ever heard, about fragmentation and/or raid.

It is absolutely not difficult to avoid fragmentation on a spindle drive, at
the level I described.  Just keep plenty of empty space in your drive, and
you won't have a fragmentation problem.  (Except as required by COW.)  How
on earth do you conclude this is practically impossible?

For example, if you start with an empty drive, and you write a large amount
of data to it, you will have no fragmentation.  (At least, no significant
fragmentation; you may get a little bit based on random factors.)  As life
goes on, as long as you keep plenty of empty space on the drive, there's
never any reason for anything to become significantly fragmented.

Again, except for COW.  It is known that COW will cause fragmentation if you
write randomly in the middle of a file that is protected by snapshots.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-15 Thread Richard Elling

On Sep 15, 2010, at 2:18 PM, Edward Ned Harvey wrote:

 From: Richard Elling [mailto:rich...@nexenta.com]
 
 Suppose you want to ensure at least 99% efficiency of the drive.  At
 most 1%
 time wasted by seeking.
 
 This is practically impossible on a HDD.  If you need this, use SSD.
 
 Lately, Richard, you're saying some of the craziest illogical things I've
 ever heard, about fragmentation and/or raid.
 
 It is absolutely not difficult to avoid fragmentation on a spindle drive, at
 the level I described.  Just keep plenty of empty space in your drive, and
 you won't have a fragmentation problem.  (Except as required by COW.)  How
 on earth do you conclude this is practically impossible?

It is practically impossible to keep a drive from seeking.  It is also
practically impossible to keep from blowing a rev.  Cute little piggy, eh? :-)

 For example, if you start with an empty drive, and you write a large amount
 of data to it, you will have no fragmentation.  (At least, no significant
 fragmentation; you may get a little bit based on random factors.)  As life
 goes on, as long as you keep plenty of empty space on the drive, there's
 never any reason for anything to become significantly fragmented.
 
 Again, except for COW.  It is known that COW will cause fragmentation if you
 write randomly in the middle of a file that is protected by snapshots.

IFF the file is larger than recordsize.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-15 Thread Ian Collins

On 09/16/10 09:18 AM, Edward Ned Harvey wrote:

From: Richard Elling [mailto:rich...@nexenta.com]

 

Suppose you want to ensure at least 99% efficiency of the drive.  At
   

most 1%
 

time wasted by seeking.
   

This is practically impossible on a HDD.  If you need this, use SSD.
 

Lately, Richard, you're saying some of the craziest illogical things I've
ever heard, about fragmentation and/or raid.

It is absolutely not difficult to avoid fragmentation on a spindle drive, at
the level I described.  Just keep plenty of empty space in your drive, and
you won't have a fragmentation problem.  (Except as required by COW.)  How
on earth do you conclude this is practically impossible?
   


Drives seek, there isn't a lot you can do to stop that.

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-15 Thread Nicolas Williams
On Wed, Sep 15, 2010 at 05:18:08PM -0400, Edward Ned Harvey wrote:
 It is absolutely not difficult to avoid fragmentation on a spindle drive, at
 the level I described.  Just keep plenty of empty space in your drive, and
 you won't have a fragmentation problem.  (Except as required by COW.)  How
 on earth do you conclude this is practically impossible?

That's expensive.  It's also approaching short-stroking (which is
expensive).  Which is what Richard said (in so many words, that it's
expensive).  Can you make HDDs perform awesome?  Yes, but you'll need
lots of them, and you'll need to use them very inefficiently.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-15 Thread Edward Ned Harvey
 From: Richard Elling [mailto:rich...@nexenta.com]
 
 It is practically impossible to keep a drive from seeking.  It is also

The first time somebody (Richard) said you can't prevent a drive from
seeking, I just decided to ignore it.  But then it was said twice.  (Ian.)

I don't get why anybody is saying drives seek.  Did anybody say drives
don't seek?

I said you can quantify how much fragmentation is acceptable, given drive
speed characteristics, and a percentage of time you consider acceptable for
seeking.  I suggested acceptable was 99% efficiency and 1% time waste
seeking.  Roughly calculated, I came up with 40 MB sequential data per
random seek would yield 99% efficiency.

For some situations, that's entirely possible and likely to be the norm.
For other cases, it may be unrealistic, and you may suffer badly from
fragmentation.

Is there some point we're talking about here?  I don't get why the
conversation seems to have taken such a tangent.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-14 Thread Edward Ned Harvey
 From: Haudy Kazemi [mailto:kaze0...@umn.edu]
 
 With regard to multiuser systems and how that negates the need to
 defragment, I think that is only partially true.  As long as the files
 are defragmented enough so that each particular read request only
 requires one seek before it is time to service the next read request,
 further defragmentation may offer only marginal benefit.  On the other

Here's a great way to quantify how much fragmentation is acceptable:

Suppose you want to ensure at least 99% efficiency of the drive.  At most 1%
time wasted by seeking.
Suppose you're talking about 7200rpm sata drives, which sustain 500Mbit/s
transfer, and have average seek time 8ms.

8ms is 1% of 800ms.
In 800ms, the drive could read 400 Mbit of sequential data.
That's 40 MB

So as long as the fragment size of your files are approx 40 MB or larger,
then fragmentation has a negligible effect on performance.  One seek per
every 40MB read/written will yield less than 1% performance impact.

For the heck of it, let's see how that would have computed with 15krpm SAS
drives.
Sustained transfer 1Gbit/s, and average seek 3.5ms
3.5ms is 1% of 350ms
In 350ms, the drive could read 350 Mbit (call it 43MB)

That's certainly in the ballpark of 40 MB.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-14 Thread Edward Ned Harvey
 From: Richard Elling [mailto:rich...@nexenta.com]
  With appropriate write caching and grouping or re-ordering of writes
 algorithms, it should be possible to minimize the amount of file
 interleaving and fragmentation on write that takes place.
 
 To some degree, ZFS already does this.  The dynamic block sizing tries
 to ensure
 that a file is written into the largest block[1]

Yes, but the block sizes in question are typically up to 128K.
As computed in my email 1 minute ago ... The fragment size needs to be on
the order of 40 MB in order to effectively eliminate performance loss of
fragmentation.


 Also, ZFS has an intelligent prefetch algorithm that can hide some
 performance
 aspects of defragmentation on HDDs.

Unfortunately, prefetch can only hide fragmentation on systems that have
idle disk time.  Prefetch isn't going to help you if you actually need to
transfer a whole file as fast as possible.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-14 Thread Marty Scholes
Richard Elling wote:
 Define fragmentation?

Maybe this is the wrong thread.  I have noticed that an old pool can take 4 
hours to scrub, with a large portion of the time reading from the pool disks at 
the rate of 150+ MB/s but zpool iostat reports 2 MB/s read speed.  My naive 
interpretation is that the data scrub is looking for has become fragmented.

Should I refresh the pool by zfs sending it to another pool then zfs receiving 
the data back again, the same scrub can take less than an hour with zpool 
iostat reporting more sane throughput.

On an old pool which had lots of snapshots come and go, the scrub throughput is 
awful.  On that same data, refreshed via zfs send/receive, the throughput much 
better.

It would appear to me that this is an artifact of fragmentation, although I 
have nothing scientific on which to base this.  Additional unscientific 
observations leads me to believe these same refreshed pools also perform 
better for non-scrub activities.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-14 Thread David Dyer-Bennet
The difference between multi-user thinking and single-user thinking is
really quite dramatic in this area.  I came up the time-sharing side
(PDP-8, PDP-11, DECSYSTEM-20); TOPS-20 didn't have any sort of disk
defragmenter, and nobody thought one was particularly desirable, because
the normal access pattern of a busy system was spread all across the disk
packs anyway.

On a desktop workstation, it makes some sense to think about loading big
executable files fast -- that's something the user is sitting there
waiting for, and there's often nothing else going on at that exact moment.
 (There *could* be significant things happening in the background, but
quite often there aren't.)  Similarly, loading a big document
(single-file book manuscript, bitmap image, or whatever) happens at a
point where the user has requested it and is waiting for it right then,
and there's mostly nothing else going on.

But on really shared disk space (either on a timesharing system, or a
network file server serving a good-sized user base), the user is competing
for disk activity (either bandwidth or IOPs, depending on the access
pattern of the users).  Generally you don't get to load your big DLL in
one read -- and to the extent that you don't, it doesn't matter much how
it's spread around the disk, because the head won't be in the same spot
when you get your turn again.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Edward Ned Harvey
 From: Richard Elling [mailto:rich...@nexenta.com]
 
 This operational definition of fragmentation comes from the single-
 user,
 single-tasking world (PeeCees). In that world, only one thread writes
 files
 from one application at one time. In those cases, there is a reasonable
 expectation that a single file's blocks might be contiguous on a single
 disk.
 That isn't the world we live in, where have RAID, multi-user, or multi-
 threaded
 environments.

I don't know what you're saying, but I'm quite sure I disagree with it.

Regardless of multithreading, multiprocessing, it's absolutely possible to
have contiguous files, and/or file fragmentation.  That's not a
characteristic which depends on the threading model.

Also regardless of raid, it's possible to have contiguous or fragmented
files.  The same concept applies to multiple disks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Orvar Korvar
I was thinking to delete all zfs snapshots before zfs send receive to another 
new zpool. Then everything would be defragmented, I thought.


(I assume snapshots works this way: I snapshot once and do some changes, say 
delete file A and edit file B. When I delete the snapshot, the file A is 
still deleted and file B is still edited. In other words, deletion of 
snapshot does not revert back the changes. Therefore I just delete all 
snapshots and make my filesystem up to date before zfs send receive)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Orvar Korvar
 
 I was thinking to delete all zfs snapshots before zfs send receive to
 another new zpool. Then everything would be defragmented, I thought.

You don't need to delete snaps before zfs send, if your goal is to
defragment your filesystem.  Just perform a single zfs send, and don't do
any incrementals afterward.  The receiving filesystem will layout the
filesystem as it wishes.


 (I assume snapshots works this way: I snapshot once and do some
 changes, say delete file A and edit file B. When I delete the
 snapshot, the file A is still deleted and file B is still edited.
 In other words, deletion of snapshot does not revert back the changes.

You are correct.

A snapshot is a read-only image of the filesystem, as it was, at some time
in the past.  If you destroy the snapshot, you've only destroyed the
snapshot.  You haven't destroyed the most recent live version of the
filesystem.

If you wanted to, you could rollback, which destroys the live version of
the filesystem, and restores you back to some snapshot.  But that is a very
different operation.  Rollback is not at all similar to destroying a
snapshot.  These two operations are basically opposites of each other.

All of this is discussed in the man pages.  I suggest man zpool and man
zfs

Everything you need to know is written there.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Richard Elling
On Sep 13, 2010, at 5:14 AM, Edward Ned Harvey wrote:

 From: Richard Elling [mailto:rich...@nexenta.com]
 
 This operational definition of fragmentation comes from the single-
 user,
 single-tasking world (PeeCees). In that world, only one thread writes
 files
 from one application at one time. In those cases, there is a reasonable
 expectation that a single file's blocks might be contiguous on a single
 disk.
 That isn't the world we live in, where have RAID, multi-user, or multi-
 threaded
 environments.
 
 I don't know what you're saying, but I'm quite sure I disagree with it.
 
 Regardless of multithreading, multiprocessing, it's absolutely possible to
 have contiguous files, and/or file fragmentation.  That's not a
 characteristic which depends on the threading model.

Possible, yes.  Probable, no.  Consider that a file system is allocating
space for multiple, concurrent file writers.

 Also regardless of raid, it's possible to have contiguous or fragmented
 files.  The same concept applies to multiple disks.

RAID works against the efforts to gain performance by contiguous access
because the access becomes non-contiguous.
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Edward Ned Harvey
 From: Richard Elling [mailto:rich...@nexenta.com]
 
  Regardless of multithreading, multiprocessing, it's absolutely
 possible to
  have contiguous files, and/or file fragmentation.  That's not a
  characteristic which depends on the threading model.
 
 Possible, yes.  Probable, no.  Consider that a file system is
 allocating
 space for multiple, concurrent file writers.

Process A is writing.  Suppose it starts writing at block 10,000 out of my
1,000,000 block device.
Process B is also writing.  Suppose it starts writing at block 50,000.

These two processes write simultaneously, and no fragmentation occurs,
unless Process A writes more than 40,000 blocks.  In that case, A's file
gets fragmented, and the 2nd fragment might begin at block 300,000.

The concept which causes fragmentation (not counting COW) in the size of the
span of unallocated blocks.  Most filesystems will allocate blocks from the
largest unallocated contiguous area of the physical device, so as to
minimize fragmentation.

I can't say how ZFS behaves authoritatively, but I'd be extremely surprised
if two processes writing different files as fast as possible result in all
their blocks interleaved with each other on physical disk.  I think this is
possible if you have multiple processes lazily writing at less-than full
speed, because then ZFS might remap a bunch of small writes into a single
contiguous write.


  Also regardless of raid, it's possible to have contiguous or
 fragmented
  files.  The same concept applies to multiple disks.
 
 RAID works against the efforts to gain performance by contiguous access
 because the access becomes non-contiguous.

These might as well have been words randomly selected from the dictionary to
me - I recognize that it's a complete sentence, but you might have said
processors aren't needed in computers anymore, or something equally
illogical.

Suppose you have a 3-disk raid stripe set, using traditional simple
striping, because it's very easy to explain.  Suppose a process is writing
as fast as it can, and suppose it's going to write block 0 through block 99
of a virtual device.

virtual block 0 = block 0 of disk 0
virtual block 1 = block 0 of disk 1
virtual block 2 = block 0 of disk 2
virtual block 3 = block 1 of disk 0
virtual block 4 = block 1 of disk 1
virtual block 5 = block 1 of disk 2
virtual block 6 = block 2 of disk 0
virtual block 7 = block 2 of disk 1
virtual block 8 = block 2 of disk 2
virtual block 9 = block 3 of disk 0
...
virtual block 96 = block 32 of disk 0
virtual block 97 = block 32 of disk 1
virtual block 98 = block 32 of disk 2
virtual block 99 = block 33 of disk 0

Thanks to buffering and command queueing, the OS tells the RAID controller
to write blocks 0-8, and the raid controller tells disk 0 to write blocks
0-2, tells disk 1 to write blocks 0-2, and tells disk 2 to write 0-2,
simultaneously.  So the total throughput is the sum of all 3 disks writing
continuously and contiguously to sequential blocks.

This accelerates performance for continuous sequential writes.  It does not
work against efforts to gain performance by contiguous access.

The same concept is true for raid-5 or raidz, but it's more complicated.
The filesystem or raid controller does in fact know how to write sequential
filesystem blocks to sequential physical blocks on the physical devices for
the sake of performance enhancement on contiguous read/write.

If you don't believe me, there's a very easy test to prove it:

Create a zpool with 1 disk in it.  time writing 100G (or some amount of data
 larger than RAM.)
Create a zpool with several disks in a raidz set, and time writing 100G.
The speed scales up linearly with the number of disks, until you reach some
other hardware bottleneck, such as bus speed or something like that.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread David Dyer-Bennet

On Mon, September 13, 2010 07:14, Edward Ned Harvey wrote:
 From: Richard Elling [mailto:rich...@nexenta.com]

 This operational definition of fragmentation comes from the single-
 user,
 single-tasking world (PeeCees). In that world, only one thread writes
 files
 from one application at one time. In those cases, there is a reasonable
 expectation that a single file's blocks might be contiguous on a single
 disk.
 That isn't the world we live in, where have RAID, multi-user, or multi-
 threaded
 environments.

 I don't know what you're saying, but I'm quite sure I disagree with it.

 Regardless of multithreading, multiprocessing, it's absolutely possible to
 have contiguous files, and/or file fragmentation.  That's not a
 characteristic which depends on the threading model.

 Also regardless of raid, it's possible to have contiguous or fragmented
 files.  The same concept applies to multiple disks.

The attitude that it *matters* seems to me to have developed, and be
relevant only to, single-user computers.

Regardless of whether a file is contiguous or not, by the time you read
the next chunk of it, in the multi-user world some other user is going to
have moved the access arm of that drive.  Hence, it doesn't matter if the
file is contiguous or not.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Orvar Korvar
To summarize, 

A) resilver does not defrag.

B) zfs send receive to a new zpool means it will be defragged

Correctly understood?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Richard Elling
On Sep 13, 2010, at 10:54 AM, Orvar Korvar wrote:

 To summarize, 
 
 A) resilver does not defrag.
 
 B) zfs send receive to a new zpool means it will be defragged

Define fragmentation?

If you follow the wikipedia definition of defragmentation then the 
answer is no, zfs send/receive does not change the location of files.
Why? Because zfs sends objects, not files.  The objects can be 
allocated in a (more) contiguous form on the receiving side, or maybe
not, depending on the configuration and use of the receiving side. 

A file may be wholly contained in an object, or not, depending on how it
was created. For example, if a file is less than 128KB (by default) and
is created at one time, then it will be wholly contained in one object.
By contrast, UFS has an 8KB max block size will use up to 16 different
blocks to store the same file. These blocks may or may not be contiguous
in UFS.

http://en.wikipedia.org/wiki/Defragmentation

 Correctly understood?

Clear as mud.  I suggest deprecating the use of the term defragmentation.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com

Richard Elling
rich...@nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Haudy Kazemi

Richard Elling wrote:

On Sep 13, 2010, at 5:14 AM, Edward Ned Harvey wrote:

  

From: Richard Elling [mailto:rich...@nexenta.com]

This operational definition of fragmentation comes from the single-
user,
single-tasking world (PeeCees). In that world, only one thread writes
files
from one application at one time. In those cases, there is a reasonable
expectation that a single file's blocks might be contiguous on a single
disk.
That isn't the world we live in, where have RAID, multi-user, or multi-
threaded
environments.
  

I don't know what you're saying, but I'm quite sure I disagree with it.

Regardless of multithreading, multiprocessing, it's absolutely possible to
have contiguous files, and/or file fragmentation.  That's not a
characteristic which depends on the threading model.



Possible, yes.  Probable, no.  Consider that a file system is allocating
space for multiple, concurrent file writers.
  


With appropriate write caching and grouping or re-ordering of writes 
algorithms, it should be possible to minimize the amount of file 
interleaving and fragmentation on write that takes place.  (Or at least 
optimize the amount of file interleaving.  Years ago MFM hard drives had 
configurable sector interleave factors to better optimize performance 
when no interleaving meant the drive had spun the platter far enough to 
be ready to give the next sector to the CPU before the CPU was ready 
with the result that the platter had to be spun a second time around to 
wait for the CPU to catch up.)




Also regardless of raid, it's possible to have contiguous or fragmented
files.  The same concept applies to multiple disks.



RAID works against the efforts to gain performance by contiguous access
because the access becomes non-contiguous.


From what I've seen, defragmentation offers its greatest benefit when 
the tiniest reads are eliminated by grouping them into larger contiguous 
reads.  Once the contiguous areas reach a certain size (somewhere in the 
few Mbytes to a few hundred Mbytes range), further defragmentation 
offers little additional benefit.  Full defragmentation is a useful goal 
when the option of using file carving based data recovery is desirable.  
Also remember that defragmentation is not limited to space used by 
files.  It can also apply to free, unused space, which should also be 
defragmented to prevent future writes from being fragmented on write.


With regard to multiuser systems and how that negates the need to 
defragment, I think that is only partially true.  As long as the files 
are defragmented enough so that each particular read request only 
requires one seek before it is time to service the next read request, 
further defragmentation may offer only marginal benefit.  On the other 
hand, if files from been fragmented down to each sector being stored 
separately on the drive, then each read request is going to take that 
much longer to be completed (or will be interrupted by another read 
request because it has taken too long)..


-hk
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Richard Elling
On Sep 13, 2010, at 9:41 PM, Haudy Kazemi wrote:
 Richard Elling wrote:
 On Sep 13, 2010, at 5:14 AM, Edward Ned Harvey wrote:
 From: Richard Elling [mailto:rich...@nexenta.com
 ]
 
 This operational definition of fragmentation comes from the single-
 user,
 single-tasking world (PeeCees). In that world, only one thread writes
 files
 from one application at one time. In those cases, there is a reasonable
 expectation that a single file's blocks might be contiguous on a single
 disk.
 That isn't the world we live in, where have RAID, multi-user, or multi-
 threaded
 environments.
   
 
 I don't know what you're saying, but I'm quite sure I disagree with it.
 
 Regardless of multithreading, multiprocessing, it's absolutely possible to
 have contiguous files, and/or file fragmentation.  That's not a
 characteristic which depends on the threading model.
 
 
 
 Possible, yes.  Probable, no.  Consider that a file system is allocating
 space for multiple, concurrent file writers.
 
 With appropriate write caching and grouping or re-ordering of writes 
 algorithms, it should be possible to minimize the amount of file interleaving 
 and fragmentation on write that takes place.  

To some degree, ZFS already does this.  The dynamic block sizing tries to ensure
that a file is written into the largest block[1]

 (Or at least optimize the amount of file interleaving.  Years ago MFM hard 
 drives had configurable sector interleave factors to better optimize 
 performance when no interleaving meant the drive had spun the platter far 
 enough to be ready to give the next sector to the CPU before the CPU was 
 ready with the result that the platter had to be spun a second time around to 
 wait for the CPU to catch up.)

Reason #526 why SSDs kill HDDs on performance.

 Also regardless of raid, it's possible to have contiguous or fragmented
 files.  The same concept applies to multiple disks.
 
 
 
 RAID works against the efforts to gain performance by contiguous access
 because the access becomes non-contiguous.
 
 
 From what I've seen, defragmentation offers its greatest benefit when the 
 tiniest reads are eliminated by grouping them into larger contiguous reads.  
 Once the contiguous areas reach a certain size (somewhere in the few Mbytes 
 to a few hundred Mbytes range), further defragmentation offers little 
 additional benefit.

For the wikipedia definition of defragmentation, this can only occur when the 
files
themselves are hundreds of megabytes in size.  This is not the general case for 
which
I see defragmentation used.

Also, ZFS has an intelligent prefetch algorithm that can hide some performance
aspects of defragmentation on HDDs.

  Full defragmentation is a useful goal when the option of using file carving 
 based data recovery is desirable.  Also remember that defragmentation is not 
 limited to space used by files.  It can also apply to free, unused space, 
 which should also be defragmented to prevent future writes from being 
 fragmented on write.

This is why ZFS uses a first fit algorithm until space becomes too low, when it 
changes
to a best fit algorithm. As long as available space is big enough for the 
block, then it will
be used. 
 
 With regard to multiuser systems and how that negates the need to defragment, 
 I think that is only partially true.  As long as the files are defragmented 
 enough so that each particular read request only requires one seek before it 
 is time to service the next read request, further defragmentation may offer 
 only marginal benefit.  On the other hand, if files from been fragmented down 
 to each sector being stored separately on the drive, then each read request 
 is going to take that much longer to be completed (or will be interrupted by 
 another read request because it has taken too long)..

Yes, so try to avoid running your ZFS pool at more than 96% full.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com

Richard Elling
rich...@nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-12 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Orvar Korvar
 
 I am not really worried about fragmentation. I was just wondering if I
 attach new drives and zfs send recieve to a new zpool, would count as
 defrag. But apparently, not.

Apparently not in all situations would be more appropriate.

The understanding I had was:  If you send a single zfs send | receive, then
it does effectively get defragmented, because the receiving filesystem is
going to re-layout the received filesystem, and there is nothing
pre-existing to make the receiving filesystem dance around...  But if you're
sending some initial, plus incrementals, then you're actually repeating the
same operations that probably caused the original filesystem to become
fragmented in the first place.  And in fact, it seems unavoidable...

Suppose you have a large file, which is all sequential on disk.  You make a
snapshot of it.  Which means all the individual blocks must not be
overwritten.  And then you overwrite a few bytes scattered randomly in the
middle of the file.  The nature of copy on write is such that of course, the
latest version of the filesystem is impossible to remain contiguous.  Your
only choices are:  To read  write copies of the whole file, including
multiple copies of what didn't change, or you leave the existing data in
place where it is on disk, and you instead write your new random bytes to
other non-contiguous locations on disk.  Hence fragmentation.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-12 Thread Richard Elling
On Sep 12, 2010, at 8:27 PM, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Orvar Korvar
 
 I am not really worried about fragmentation. I was just wondering if I
 attach new drives and zfs send recieve to a new zpool, would count as
 defrag. But apparently, not.
 
 Apparently not in all situations would be more appropriate.
 
 The understanding I had was:  If you send a single zfs send | receive, then
 it does effectively get defragmented, because the receiving filesystem is
 going to re-layout the received filesystem, and there is nothing
 pre-existing to make the receiving filesystem dance around...  But if you're
 sending some initial, plus incrementals, then you're actually repeating the
 same operations that probably caused the original filesystem to become
 fragmented in the first place.  And in fact, it seems unavoidable...
 
 Suppose you have a large file, which is all sequential on disk.  You make a
 snapshot of it.  Which means all the individual blocks must not be
 overwritten.  And then you overwrite a few bytes scattered randomly in the
 middle of the file.  The nature of copy on write is such that of course, the
 latest version of the filesystem is impossible to remain contiguous.  Your
 only choices are:  To read  write copies of the whole file, including
 multiple copies of what didn't change, or you leave the existing data in
 place where it is on disk, and you instead write your new random bytes to
 other non-contiguous locations on disk.  Hence fragmentation.

This operational definition of fragmentation comes from the single-user,
single-tasking world (PeeCees). In that world, only one thread writes files
from one application at one time. In those cases, there is a reasonable
expectation that a single file's blocks might be contiguous on a single disk.
That isn't the world we live in, where have RAID, multi-user, or multi-threaded 
environments. 
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com

Richard Elling
rich...@nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-11 Thread Orvar Korvar
I am not really worried about fragmentation. I was just wondering if I attach 
new drives and zfs send recieve to a new zpool, would count as defrag. But 
apparently, not. 

Anyway thank you for your input!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-11 Thread Richard Elling
It really depends on your definition of fragmentation. This term is used
differently for various file systems. The UFS notion of fragmentation is
closer to the ZFS notion of gangs.

 -- richard

On Sep 11, 2010, at 6:16 AM, Orvar Korvar knatte_fnatte_tja...@yahoo.com 
wrote:

 I am not really worried about fragmentation. I was just wondering if I attach 
 new drives and zfs send recieve to a new zpool, would count as defrag. But 
 apparently, not. 
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-10 Thread Darren J Moffat

On 10/09/2010 04:24, Bill Sommerfeld wrote:

C) Does zfs send zfs receive mean it will defrag?


Scores so far:
1 No
2 Yes


maybe. If there is sufficient contiguous freespace in the destination
pool, files may be less fragmented.

But if you do incremental sends of multiple snapshots, you may well
replicate some or all the fragmentation on the origin (because snapshots
only copy the blocks that change, and receiving an incremental send does
the same).

And if the destination pool is short on space you may end up more
fragmented than the source.


There is yet more it depends.

It depends on what you mean by fragmentation.

ZFS has gang blocks, which are used when we need to store a block of 
size N but can't find a block that size but can make up that amount of 
storage from M smaller blocks that are available.


Because ZFS send|recv work at the DMU layer they know nothing about gang 
blocks, which are a ZIO layer concept.  As such if your filesystem is 
heavily fragmented on the source because it uses gang blocks, that 
doesn't necessarily mean it will be using gang blocks at all or of the 
same size on the destination.


I very strongly recommend the original poster take a step back and ask 
why are you even worried about fragmentation ? do you know you have a 
pool that is fragmented? is it actually causing you a performance 
problem?


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-09 Thread Freddie Cash
On Thu, Sep 9, 2010 at 1:04 PM, Orvar Korvar
knatte_fnatte_tja...@yahoo.com wrote:
 A) Resilver = Defrag. True/false?

False.  Resilver just rebuilds a drive in a vdev based on the
redundant data stored on the other drives in the vdev.  Similar to how
replacing a dead drive works in a hardware RAID array.

 B) If I buy larger drives and resilver, does defrag happen?

No.

 C) Does zfs send zfs receive mean it will defrag?

No.

ZFS doesn't currently have a defragmenter.  That will come when the
legendary block pointer rewrite feature is committed.


-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-09 Thread Marty Scholes
I am speaking from my own observations and nothing scientific such as reading 
the code or designing the process.

 A) Resilver = Defrag. True/false?

False

 B) If I buy larger drives and resilver, does defrag
 happen?

No.  The first X sectors of the bigger drive are identical to the smaller 
drive, fragments and all.

 C) Does zfs send zfs receive mean it will defrag?

Yes.  The data is laid out on the receiving side in a sane manner, until it 
later becomes fragmented.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-09 Thread Freddie Cash
On Thu, Sep 9, 2010 at 1:26 PM, Freddie Cash fjwc...@gmail.com wrote:
 On Thu, Sep 9, 2010 at 1:04 PM, Orvar Korvar
 knatte_fnatte_tja...@yahoo.com wrote:
 A) Resilver = Defrag. True/false?

 False.  Resilver just rebuilds a drive in a vdev based on the
 redundant data stored on the other drives in the vdev.  Similar to how
 replacing a dead drive works in a hardware RAID array.

 B) If I buy larger drives and resilver, does defrag happen?

 No.

Actually, thinking about it ... since the resilver is writing new data
to an empty drive, in essence, the drive is defragmented.

 C) Does zfs send zfs receive mean it will defrag?

 No.

Same here, but only if the receiving pool has never had any snapshots
deleted or files deleted, so that there are no holes in the pool.
Then the newly written data will be contiguous (not fragmented).

 ZFS doesn't currently have a defragmenter.  That will come when the
 legendary block pointer rewrite feature is committed.


 --
 Freddie Cash
 fjwc...@gmail.com




-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-09 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Orvar Korvar
 
 A) Resilver = Defrag. True/false?

I think everyone will agree false on this question.  However, more detail
may be appropriate.  See below.


 B) If I buy larger drives and resilver, does defrag happen?

Scores so far:
2 No
1 Yes


 C) Does zfs send zfs receive mean it will defrag?

Scores so far:
1 No
2 Yes

 ...

Does anybody here know what they're talking about?  I'd feel good if perhaps
Erik ... or Neil ... perhaps ... answered the question with actual
knowledge.

Thanks...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-09 Thread Bill Sommerfeld

On 09/09/10 20:08, Edward Ned Harvey wrote:

Scores so far:
2 No
1 Yes


No.  resilver does not re-layout your data or change whats in the block 
pointers on disk.  if it was fragmented before, it will be fragmented after.



C) Does zfs send zfs receive mean it will defrag?


Scores so far:
1 No
2 Yes


maybe.  If there is sufficient contiguous freespace in the destination 
pool, files may be less fragmented.


But if you do incremental sends of multiple snapshots, you may well 
replicate some or all the fragmentation on the origin (because snapshots 
only copy the blocks that change, and receiving an incremental send does 
the same).


And if the destination pool is short on space you may end up more 
fragmented than the source.


- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss