Re: [zfs-discuss] Re: ZFS and databases

2006-06-14 Thread Richard Elling

billtodd wrote:
I do want to comment on the observation that enough concurrent 128K I/O can 
saturate a disk - the apparent implication being that one could therefore do 
no better with larger accesses, an incorrect conclusion.  Current disks can 
stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the 
average-seek-plus-partial-rotation required to get to that 128 KB in the first 
place.  Thus on a full drive serial random accesses to 128 KB chunks will yield 
only about 20% of the drive's streaming capability (by contrast, accessing 
data using serial random accesses in 4 MB contiguous chunks achieves around 
90% of a drive's streaming capability):  one can do better on disks that 
support queuing if one allows queues to form, but this trades significantly 
increased average operation latency for the increase in throughput (and said 
increase still falls far short of the 90% utilization one could achieve using 
4 MB chunking).


Enough concurrent 0.5 KB I/O can also saturate a disk, after all - but this 
says little about effective utilization.


I think I can summarize where we are at on this.

This is the classic big-{packet|block|$-line|bikini} versus
small-{packet|block|$-line|bikini} argument.  One size won't fit all.

The jury is still out on what all of this means for any given application.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS and databases

2006-06-14 Thread Roch

For Output ops, ZFS could setup  a 10MB I/O transfer to disk
starting at sector  X, or chunk that up  in 128K while still
assigning  the samerangeof  disk   blocks forthe
operations. Yes there will be more control information going
around, a little more  CPU  consumed, but  the disk  will be
streaming all right, I would guess.

Most heavy output load will behave this way with ZFS, random
or not. The throughput  will depend more on the availability
of  contiguous chunk of  disk blocks  that the actual record
size in use.

As for  random input, the  issue is that  ZFS does not get a
say   as to what the  application  is requesting in terms of
size  and location. Yes doing a  4M Input of contiguous disk
block will be faster than  random reading 128K chunks spread
out.  But  if the application  is  manipulating  4M objects,
those will stream out and land on contiguous disk blocks (if
available) and those should stream in as well (if our read
codepath is clever enough).

The fundamental question is really,  is there something that
ZFS  does  that causes data   that represents an application
logical unit of information (likely to  be read as one chunk)
and so that we  would like to have   contiguous on disk,  to
actually spread out everywhere on the platter.

-r




Richard Elling writes:
  billtodd wrote:
   I do want to comment on the observation that enough concurrent 128K I/O 
   can 
   saturate a disk - the apparent implication being that one could therefore 
   do 
   no better with larger accesses, an incorrect conclusion.  Current disks 
   can 
   stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the 
   average-seek-plus-partial-rotation required to get to that 128 KB in the 
   first 
   place.  Thus on a full drive serial random accesses to 128 KB chunks will 
   yield 
   only about 20% of the drive's streaming capability (by contrast, accessing 
   data using serial random accesses in 4 MB contiguous chunks achieves 
   around 
   90% of a drive's streaming capability):  one can do better on disks that 
   support queuing if one allows queues to form, but this trades 
   significantly 
   increased average operation latency for the increase in throughput (and 
   said 
   increase still falls far short of the 90% utilization one could achieve 
   using 
   4 MB chunking).
   
   Enough concurrent 0.5 KB I/O can also saturate a disk, after all - but 
   this 
   says little about effective utilization.
  
  I think I can summarize where we are at on this.
  
  This is the classic big-{packet|block|$-line|bikini} versus
  small-{packet|block|$-line|bikini} argument.  One size won't fit all.
  
  The jury is still out on what all of this means for any given application.
-- richard
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS and databases

2006-06-13 Thread can you guess?
Sorry for resurrecting this interesting discussion so late:  I'm skinning 
backwards through the forum.

One comment about segregating database logs is that people who take their data 
seriously often want a 'belt plus suspenders' approach to recovery.  
Conventional RAID, even supplemented with ZFS's self-healing scrubbing, isn't 
sufficient (though RAID-6 might be):  they want at least the redo logs separate 
so that in the extremely unlikely event that they lose something in the 
(already replicated) database the failure is guaranteed not to have affected 
the redo logs as well, from which they can reconstruct the current database 
state from a backup.

True, this will mean that you can't aggregate redo log activity with other 
transaction bulk-writes, but that's at least partly good as well:  databases 
are often extremely sensitive to redo log write latency and would not want such 
writes delayed by combination with other updates, let alone by up to a 5-second 
delay.

ZFS's synchronous write intent log could help here (if you replicate it:  
serious database people would consider even the very temporary exposure to a 
single failure inherent in an unmirrored log completely unacceptable), but that 
could also be slowed by other synch small write activity; conversely, databases 
often couldn't care less about the latency of many of their other writes, 
because their own (replicated) redo log has already established the persistence 
that they need.

As for direct I/O, it's not clear why ZFS couldn't support it:  it could verify 
each read in user memory against its internal checksum and perform its 
self-healing magic if necessary before returning completion status (which would 
be the same status it would return if the same situation occurred during its 
normal mode of operation:  either unconditional success or 
success-after-recovery if the application might care to know that); it could 
handle each synchronous write analogously, and if direct I/O mechanisms support 
lazy writes then presumably they tie up the user buffer until the write 
completes such that you could use your normal mechanisms there as well (just 
operating on the user buffer instead of your cache).  In this I'm assuming that 
'direct I/O' refers not to raw device access but to file-oriented access that 
simply avoids any internal cache use, such that you could still use your 
no-overwrite approach.

Of course, this also assumes that the direct I/O is always being performed in 
aligned integral multiples of checksum units by the application; if not, you'd 
either have to bag the checksum facility (this would not be an entirely 
unreasonable option to offer, given that some sophisticated applictions might 
want to use their own even higher-level integrity mechanisms, e.g., across 
geographically-separated sites, and would not need yours) or run everything 
through cache as you normally do.  In suitably-aligned cases where you do 
validate the data you could avoid half the copy overhead (an issue of memory 
bandwidth as well as simply operation latency:  TPC-C submissions can be 
affected by this, though it may be rare in real-world use) by integrating the 
checksum calculation with the copy, but would still have multiple copies of the 
data taking up memory in a situation (direct I/O) where the application *by 
definition* does not expect you to be caching the data (quite likely because it 
is doing any desirable caching itself).

Tablespace contiguity may, however, be a deal-breaker for some users:  it is 
common for tablespaces to be scanned sequentially (when selection criteria 
don't mesh with existing indexes, perhaps especially in joins where the smaller 
tablespace (still too large to be retained in cache, though) is scanned 
repeatedly in an inner loop, and a DBMS often goes to some effort to keep them 
defragmented.  Until ZFS provides some effective continuous defragmenting 
mechanisms of its own, its no-overwrite policy may do more harm than good in 
such cases (since the database's own logs keep persistence latency low, while 
the backing tablespaces can then be updated at leisure).

I do want to comment on the observation that enough concurrent 128K I/O can 
saturate a disk - the apparent implication being that one could therefore do 
no better with larger accesses, an incorrect conclusion.  Current disks can 
stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the 
average-seek-plus-partial-rotation required to get to that 128 KB in the first 
place.  Thus on a full drive serial random accesses to 128 KB chunks will yield 
only about 20% of the drive's streaming capability (by contrast, accessing data 
using serial random accesses in 4 MB contiguous chunks achieves around 90% of a 
drive's streaming capability):  one can do better on disks that support queuing 
if one allows queues to form, but this trades significantly increased average 
operation latency for the increase in throughput (and said increase 

Re: [zfs-discuss] Re: ZFS and databases

2006-05-15 Thread Franz Haberhauer

Nicolas Williams wrote:


On Sat, May 13, 2006 at 08:23:55AM +0200, Franz Haberhauer wrote:
 

Given that ISV apps can be only changed by the ISV who may or may not be 
willing to
use such a new interface, having a no cache property for the file - or 
given that filesystems
are now really cheap with ZFS -  for the filesystem would be important 
as well,

like the forcedirectio mount option for UFS.
No caching at the filesystem level is always appropriate if the 
application itself
maintains a buffer of application data and does their own application 
specific buffer management
like DBMSes or large matrix solvers. Double caching these typicaly huge 
amounts data

in the filesystem is always a waste of RAM.
   



Yes, but remember, DB vendors have adopted new features before -- they
want to have the fastest DB.  Same with open source web servers.  So I'm
a bit optimistic.
 

Yes, but they usually adopt it only with their latest releases, but it 
takes time until these are
adopted by customers. And it's not just DB vendors, there are other apps 
around which could
benefit, and there are always some who may not adopt a new feature in 
Solaris at all.
Remember when UFS Directio was introduced - forcedirectio was in much 
wider use than

apps which used the API directly.


Also, an LD_PRELOAD library could be provided to enable direct I/O as
necessary.
 

This would work technically, but wether ISVs are willing to support such 
usage is a different
topic (there may be startup scripts involved making it a little tricky 
to pass an library path

to the app).

So while having the app request no caching may be the architecturally 
cleaner approach, having
it as a property on a file or filesystem is a pragmatic approach, with 
faster time-to-market

and a potential for a much broader use.

- Franz


Nico
 



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS and databases

2006-05-15 Thread Darren J Moffat

Franz Haberhauer wrote:
This would work technically, but wether ISVs are willing to support such 
usage is a different
topic (there may be startup scripts involved making it a little tricky 
to pass an library path

to the app).


Yet another reason to start the applications from SMF.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS and databases

2006-05-15 Thread Nicolas Williams
On Mon, May 15, 2006 at 07:16:38PM +0200, Franz Haberhauer wrote:
 Nicolas Williams wrote:
 Yes, but remember, DB vendors have adopted new features before -- they
 want to have the fastest DB.  Same with open source web servers.  So I'm
 a bit optimistic.
  
 
 Yes, but they usually adopt it only with their latest releases, but it 
 takes time until these are
 adopted by customers. And it's not just DB vendors, there are other apps 
 around which could
 benefit, and there are always some who may not adopt a new feature in 
 Solaris at all.
 Remember when UFS Directio was introduced - forcedirectio was in much 
 wider use than
 apps which used the API directly.

I (but I'm not in the ZFS team) don't oppose a file attribute of some
sort to provide hints to the FS about the utility of direct I/O to
processes that open such files.

Ideally the OS could just figure it out every time with enough accuracy
that no interface should be necessary at all, but I'm not sure that this
is possible.

But really, the right interface is for the application to tell the OS.
I don't know what others (marketing particularly -- you may well be
right about time to market) here think of it but if we could just stick
to proper interfaces that would be best.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS and databases

2006-05-15 Thread Nicolas Williams
On Mon, May 15, 2006 at 11:17:17AM -0700, Bart Smaalders wrote:
 Perhaps an fadvise call is in order?

We already have directio(3C).

(That was a surprise for me also.)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS and databases

2006-05-13 Thread Nicolas Williams
On Sat, May 13, 2006 at 08:23:55AM +0200, Franz Haberhauer wrote:
 Given that ISV apps can be only changed by the ISV who may or may not be 
 willing to
 use such a new interface, having a no cache property for the file - or 
 given that filesystems
 are now really cheap with ZFS -  for the filesystem would be important 
 as well,
 like the forcedirectio mount option for UFS.
 No caching at the filesystem level is always appropriate if the 
 application itself
 maintains a buffer of application data and does their own application 
 specific buffer management
 like DBMSes or large matrix solvers. Double caching these typicaly huge 
 amounts data
 in the filesystem is always a waste of RAM.

Yes, but remember, DB vendors have adopted new features before -- they
want to have the fastest DB.  Same with open source web servers.  So I'm
a bit optimistic.

Also, an LD_PRELOAD library could be provided to enable direct I/O as
necessary.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS and databases

2006-05-12 Thread Nicolas Williams
On Fri, May 12, 2006 at 05:23:53PM +0200, Roch Bourbonnais - Performance 
Engineering wrote:
 For read it is an interesting concept. Since
 
   Reading into cache
   Then copy into user space
   then keep data around but never use it
 
 is not optimal. 
 So 2 issues, there is the cost of copy and there is the memory.
 
 Now could we detect the pattern that cause holding to the
 cached block not optimal and do a quick freebehind after the 
 copyout ? Something like Random access +  very large file + poor cache hit
 ratio ?

An interface to request no caching on a per-file basis would be good
(madvise(2) should do for mmap'ed files, an fcntl(2) or open(2) flag
would be better).

 Now about avoiding the copy; That would mean dma straight
 into user space ? But if the checksum does not validate the
 data, what do we do ?

Who cares?  You DMA into user-space, check the checksum and if there's a
problem return an error; so there's [corrupted] data in the user space
buffer... but the app knows it, so what's the problem (see below)?

   If storage is not raid-protected and we
 have to return EIO, I don't think we can do this _and_
 corrupt the user buffer also, not sure what POSIX says for
 this situation.

If POSIX compliance is an issue just add new interfaces (possibly as
simple as an open(2) flag).

 Now latency wise, the cost of copy is  small compared to the
 I/O;  right ? So it now  turns into an  issue of saving some
 CPU cycles.

Can you build a system where the cost of the copy adds significantly to
the latency numbers?  (Think RAM disks.)

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS and databases

2006-05-12 Thread Roch Bourbonnais - Performance Engineering

Nicolas Williams writes:
  On Fri, May 12, 2006 at 05:23:53PM +0200, Roch Bourbonnais - Performance 
  Engineering wrote:
   For read it is an interesting concept. Since
   
  Reading into cache
  Then copy into user space
  then keep data around but never use it
   
   is not optimal. 
   So 2 issues, there is the cost of copy and there is the memory.
   
   Now could we detect the pattern that cause holding to the
   cached block not optimal and do a quick freebehind after the 
   copyout ? Something like Random access +  very large file + poor cache hit
   ratio ?
  
  An interface to request no caching on a per-file basis would be good
  (madvise(2) should do for mmap'ed files, an fcntl(2) or open(2) flag
  would be better).
  
   Now about avoiding the copy; That would mean dma straight
   into user space ? But if the checksum does not validate the
   data, what do we do ?
  
  Who cares?  You DMA into user-space, check the checksum and if there's a
  problem return an error; so there's [corrupted] data in the user space
  buffer... but the app knows it, so what's the problem (see below)?
  
 If storage is not raid-protected and we
   have to return EIO, I don't think we can do this _and_
   corrupt the user buffer also, not sure what POSIX says for
   this situation.
  
  If POSIX compliance is an issue just add new interfaces (possibly as
  simple as an open(2) flag).
  
   Now latency wise, the cost of copy is  small compared to the
   I/O;  right ? So it now  turns into an  issue of saving some
   CPU cycles.
  
  Can you build a system where the cost of the copy adds significantly to
  the latency numbers?  (Think RAM disks.)
  
  Nico
  -- 


Finally I can agree with somebody today. 

Directio is non-posix anyway and given that people have been 
train to inform the system that the cache won't be useful,
that it's a hard problem to detect automatically, let's
avoid the copy and save memory all at once for the read path.

We could use the directio() call for that ...

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS and databases

2006-05-12 Thread Nicolas Williams
On Fri, May 12, 2006 at 06:33:00PM +0200, Roch Bourbonnais - Performance 
Engineering wrote:
 Directio is non-posix anyway and given that people have been 
 train to inform the system that the cache won't be useful,
 that it's a hard problem to detect automatically, let's
 avoid the copy and save memory all at once for the read path.
 
 We could use the directio() call for that ...

I had no idea about directio(3C)!

We might want an interface for the app to know what the natural block 
size of the file is, so it can read at proper file offsets.

Of course, if that block size is smaller than the ZFS filesystem record
size then ZFS may yet grow it.  How to deal with this?  (One option:
don't grow it as long as an app has turned direct I/O on for a fildes
and the fildes remains open.)

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS and databases

2006-05-12 Thread Richard Elling
On Fri, 2006-05-12 at 10:42 -0500, Anton Rang wrote:
  Now latency wise, the cost of copy is  small compared to the
  I/O;  right ? So it now  turns into an  issue of saving some
  CPU cycles.
 
 CPU cycles and memory bandwidth (which both can be in short
 supply on a database server).

We can throw hardware at that :-)  Imagine a machine with lots
of extra CPU cycles and lots of parallel access to multiple
memory banks.  This is the strategy behind CMT.  In the future,
you will have many more CPU cycles and even better memory
bandwidth than you do now, perhaps by an order of magnitude
in the next few years.
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS and databases

2006-05-12 Thread Nicolas Williams
On Fri, May 12, 2006 at 09:59:56AM -0700, Richard Elling wrote:
 On Fri, 2006-05-12 at 10:42 -0500, Anton Rang wrote:
   Now latency wise, the cost of copy is  small compared to the
   I/O;  right ? So it now  turns into an  issue of saving some
   CPU cycles.
  
  CPU cycles and memory bandwidth (which both can be in short
  supply on a database server).
 
 We can throw hardware at that :-)  Imagine a machine with lots
 of extra CPU cycles and lots of parallel access to multiple
 memory banks.  This is the strategy behind CMT.  In the future,
 you will have many more CPU cycles and even better memory
 bandwidth than you do now, perhaps by an order of magnitude
 in the next few years.

Well, yes, of course, but I think the arguments for direct I/O are
excellent.

Another thing that I see an argument for is limiting the size of various
caches, to avoid paging (even having no swap isn't enough as you don't
want memory pressure evicting hot text pages).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS and databases

2006-05-12 Thread Anton Rang

On May 12, 2006, at 11:59 AM, Richard Elling wrote:


CPU cycles and memory bandwidth (which both can be in short
supply on a database server).


We can throw hardware at that :-)  Imagine a machine with lots
of extra CPU cycles [ ... ]


Yes, I've heard this story before, and I won't believe it this  
time.  ;-)


Seriously, I believe a database can perform very well on a CMT system,
but there won't be any extra CPU cycles or memory bandwidth, because
the demand for transaction rates will always exceed what we can supply.

Anton

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss