[zfs-discuss] User-visible non-blocking / atomic ops in ZFS

2007-11-21 Thread James Cone
Hello All,

Is any of the following available in ZFS, or is there any plan to add it?

   - persistent atomic-inc/atomic-dec of a group of bytes in a file?

   - LL/SC or Compare-and-swap of a group of bytes in a file, or a whole 
file

   - multiple renames, where:
   - all or none of them happen, with regard to:
   - readers of the files
   - panic or hard-stop of the machine
   - other people doing renames or multiple renames
?

Regards,
James.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] User-visible non-blocking / atomic ops in ZFS

2007-11-21 Thread can you guess?
I'm going to combine three posts here because they all involve jcone:

First, as to my message heading:

The 'search forum' mechanism can't find his posts under the 'jcone' name (I was 
curious, because they're interesting/strange, depending on how one looks at 
them).  I've also noticed (once in his case, once in Louwtjie's) that the 'last 
post' column of one thread may reflect a post made to a different thread.

Second, in response to your Indexing other than hash tables post:

The only way you could get a file system like ZFS to perform indexed look-ups 
for you would be to make each of your 'records' an entire file with the 
appropriate look-up name, and ReiserFS may be the only current file system that 
could handle this reasonably well

This is an outgrowth of the Unix mindset that files must only be byte-streams 
rather than anything more powerful (such as the single- and multi-key indexed 
files of traditional minicomputer and mainframe systems) - and that's 
especially unfortunate in ZFS's case, because system-managed COW mechanisms 
just happen to be a dynamite way to handle b-trees (you could do so at the 
application level on top of ZFS via use of a sparse file plus a facility to 
deallocate space in it explicitly, but you'd still need an entire separate 
level of in-file space-allocation/deallocation mechanism).  B-trees are the 
obvious solution to the kind of partial-key and/or key-range queries that you 
described.

Finally, in response to your current post (which sounds more as if it had come 
from a hardware engineer than from a database type):

All the facilities that you describe are traditionally handled by transactions 
of one form or another, and only read-only transactions can normally be 
non-blocking (because they simply capture a consistent point-in-time database 
state and operate upon that, ignoring any subsequent changes that may occur 
during their lifetimes).  Other less-popular but more general non-blocking 
approaches exist which simply abort upon detecting conflict rather than attempt 
to wait for the conflict to evaporate, which tends not to scale very well 
because (unlike the case with non-blocking low-level hardware synchronization) 
restarting a transaction when you don't have to can very often result in a 
*lot* of redundant work being performed; they include some multi-version 
approaches that implement more general 'time domain addressing' than that just 
described for read-only transactions and the rare implementations based upon 
'optimistic' concurrency control that let conflicts occur and then decide
  whether to abort someone when they attempt to commit.

ZFS supports transactions only for its internal use, and cannot feasibly 
support arbitrarily complex transactions because its atomicity approach depends 
upon gathering all transaction updates in RAM before writing them back 
atomically to disk (yes, it could perhaps do so in stages, since the entire new 
tree structure doesn't become visible until its root has been made persistent, 
but that could arbitrarily delay other write activity in the system).  While I 
think that supporting user-level transactions is a useful file-system feature 
and a few file systems such as Transarc's Structured File System have actually 
done so, ZFS would have to change significantly to do so for anything other 
than *very* limited user-level transactions - hence I wouldn't hold my breath 
waiting for such support in ZFS.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] User-visible non-blocking / atomic ops in ZFS

2007-11-21 Thread James Cone
Hi Bill,

Yes, that covers all of my selfish questions, thanks.

The B-trees I'm used to tree divide in arbitrary places across the whole 
key, so doing partial-key queries is painful.

I can't find Structured File System Transarc usefully in Google.  Do 
you have a link handy?  If not, never mind.

Regards,
James.

can you guess? wrote:
 I'm going to combine three posts here because they all involve jcone:
 
 First, as to my message heading:
 
 The 'search forum' mechanism can't find his posts under the 'jcone' name (I 
 was curious, because they're interesting/strange, depending on how one looks 
 at them).  I've also noticed (once in his case, once in Louwtjie's) that the 
 'last post' column of one thread may reflect a post made to a different 
 thread.
 
 Second, in response to your Indexing other than hash tables post:
 
 The only way you could get a file system like ZFS to perform indexed look-ups 
 for you would be to make each of your 'records' an entire file with the 
 appropriate look-up name, and ReiserFS may be the only current file system 
 that could handle this reasonably well
 
 This is an outgrowth of the Unix mindset that files must only be byte-streams 
 rather than anything more powerful (such as the single- and multi-key indexed 
 files of traditional minicomputer and mainframe systems) - and that's 
 especially unfortunate in ZFS's case, because system-managed COW mechanisms 
 just happen to be a dynamite way to handle b-trees (you could do so at the 
 application level on top of ZFS via use of a sparse file plus a facility to 
 deallocate space in it explicitly, but you'd still need an entire separate 
 level of in-file space-allocation/deallocation mechanism).  B-trees are the 
 obvious solution to the kind of partial-key and/or key-range queries that you 
 described.
 
 Finally, in response to your current post (which sounds more as if it had 
 come from a hardware engineer than from a database type):
 
 All the facilities that you describe are traditionally handled by 
 transactions of one form or another, and only read-only transactions can 
 normally be non-blocking (because they simply capture a consistent 
 point-in-time database state and operate upon that, ignoring any subsequent 
 changes that may occur during their lifetimes).  Other less-popular but more 
 general non-blocking approaches exist which simply abort upon detecting 
 conflict rather than attempt to wait for the conflict to evaporate, which 
 tends not to scale very well because (unlike the case with non-blocking 
 low-level hardware synchronization) restarting a transaction when you don't 
 have to can very often result in a *lot* of redundant work being performed; 
 they include some multi-version approaches that implement more general 'time 
 domain addressing' than that just described for read-only transactions and 
 the rare implementations based upon 'optimistic' concurrency control that let 
 conflicts occur and then dec
ide
   whether to abort someone when they attempt to commit.
 
 ZFS supports transactions only for its internal use, and cannot feasibly 
 support arbitrarily complex transactions because its atomicity approach 
 depends upon gathering all transaction updates in RAM before writing them 
 back atomically to disk (yes, it could perhaps do so in stages, since the 
 entire new tree structure doesn't become visible until its root has been made 
 persistent, but that could arbitrarily delay other write activity in the 
 system).  While I think that supporting user-level transactions is a useful 
 file-system feature and a few file systems such as Transarc's Structured File 
 System have actually done so, ZFS would have to change significantly to do so 
 for anything other than *very* limited user-level transactions - hence I 
 wouldn't hold my breath waiting for such support in ZFS.
 
 - bill
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] User-visible non-blocking / atomic ops in ZFS

2007-11-21 Thread can you guess?
 The B-trees I'm used to tree divide in arbitrary
 places across the whole 
 key, so doing partial-key queries is painful.

While the b-trees in DEC's Record Management Services (RMS) allowed 
multi-segment keys, they treated the entire key as a byte-string as far as 
prefix searches went (i.e., the segmentation wasn't significant to that, and 
there's no obvious reason why it should have been in other implementations).

 
 I can't find Structured File System Transarc
 usefully in Google.  Do 
 you have a link handy?  If not, never mind.

Well, transarc.com now leads to a porn site, so that's not much help.  And 
Wikipedia's entry for Transarc is regrettably sparse.

Transarc was a Pittsburgh RD company formed by some *very* bright CMU people.  
It's probably best known for its 'Encina' distributed transaction environment 
(SFS was actually part of Encina, but IIRC a separable one), for having 
developed the distributed file system (DFS) component of the Open Group's 
Distributed Computing Environment (DCE), and for AFS, the productized (and now 
open source) version of CMU's distributed Andrew file system; my own 
acquaintance with Transarc became closer when I was helping develop a 
distributed transactional object system in the mid''90s and we were using their 
book Camelot and Avalon for high-level design inspiration.  They were always 
closely associated with IBM, which absorbed them as a wholly-owned subsidiary 
in 1994 (and I've heard relatively little about them since).

SFS was one of their lesser-known achievements:  a record-oriented 
transactional file system.  I've always felt that system-managed 
record-oriented files were useful, in part because a lot of the nitty-gritty 
space management that's required (e.g., to handle the structured pages that 
tend to be necessary to accommodate data that's allowed to change its size or 
is required to remain in some key order under insertion/update/deletion 
activity) duplicates similar space-management required of the system to manage 
conventional byte-stream files and in part because any kind of system-wide 
lock- and deadlock-management facilities tend to want to tie into such data at 
a higher-than-byte-stream level (e.g., because the locked entities may have to 
move around) - so SFS was interesting to me.  Unfortunately, it's been long 
enough that I can't remember too many details about it - e.g., it may or may 
not have supported interlocked access at the record field level - and at least 
after a qui
 ck search I can't find any papers about it that I may have downloaded (that 
era was before I really recognized how evanescent Web material often may be).

I actually did get a Google hit at position 19 with the search terms you used 
(after a plethora of hits on log structured file system, of course), but it 
wasn't very enlightening.  Nor were several later ones, until hit 42 at the 
University of Waterloo - a .pdf that contains at least a brief description 
starting on page 21 (including a thinly-disguised rip-off of a figure in 
GrayReuter's classic Transaction Processing - but it's not quite 
*identical*...).

Aha - good old reliable IBM *does* still have some SFS documentation on line 
that hit 75 noticed; munging that URL a bit led to 
http://publib.boulder.ibm.com/infocenter/txformp/v5r1/index.jsp?noscript=1 
(expand Encina Books in the left-hand frame and start digging...).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss