On 04/02/10 08:24, Edward Ned Harvey wrote:
The purpose of the ZIL is to act like a fast "log" for synchronous
writes.  It allows the system to quickly confirm a synchronous write
request with the minimum amount of work.

Bob and Casper and some others clearly know a lot here.  But I'm hearing
conflicting information, and don't know what to believe.  Does anyone here
work on ZFS as an actual ZFS developer for Sun/Oracle?  Can claim "I can
answer this question, I wrote that code, or at least have read it?"

I'm one of the ZFS developers. I wrote most of the zil code.
Still I don't have all the answers. There's a lot of knowledgeable people
on this alias. I usually monitor this alias and sometimes chime in
when there's some misinformation being spread, but sometimes the volume is so high.
Since I started this reply there's been 20 new posts on this thread alone!

Questions to answer would be:

Is a ZIL log device used only by sync() and fsync() system calls?

- The intent log (separate device(s) or not) is only used by fsync, O_DSYNC, O_SYNC, O_RSYNC.
NFS commits are seen to ZFS as fsyncs.
Note sync(1m) and sync(2s) do not use the intent log. They force transaction group (txg) commits on all pools. So zfs goes beyond the the requirement for sync() which only requires
it schedules but does not necessarily complete the writing before returning.
The zfs interpretation is rather expensive but seemed broken so we fixed it.

Is it ever used to accelerate async writes?


The zil is not used to accelerate async writes.

Suppose there is an application which sometimes does sync writes, and
sometimes async writes.  In fact, to make it easier, suppose two processes
open two files, one of which always writes asynchronously, and one of which
always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
for writes to be committed to disk out-of-order?  Meaning, can a large block
async write be put into a TXG and committed to disk before a small sync
write to a different file is committed to disk, even though the small sync
write was issued by the application before the large async write?  Remember,
the point is:  ZIL is disabled.  Question is whether the async could
possibly be committed to disk before the sync.

Threads can be pre-empted in the OS at any time. So even though thread A issued W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS as W1, W2.
Multi-threaded applications have to handle this.

If this was a single thread issuing W1 then W2 then yes the order is guaranteed
regardless of whether W1 or W2 are synchronous or asynchronous.
Of course if the system crashes then the async operations might not be there.

I make the assumption that an uberblock is the term for a TXG after it is
committed to disk.  Correct?

- Kind of. The uberblock contains the root of the txg.


At boot time, or "zpool import" time, what is taken to be "the current
filesystem?"  The latest uberblock?  Something else?

A txg is for the whole pool which can contain many filesystems.
The latest txg defines the current state of the pool and each individual fs.

My understanding is that enabling a dedicated ZIL device guarantees sync()
and fsync() system calls block until the write has been committed to
nonvolatile storage, and attempts to accelerate by using a physical device
which is faster or more idle than the main storage pool.

Correct (except replace sync() with O_DSYNC, etc).
This also assumes hardware that for example handles correctly the flushing of it's caches.

  My understanding
is that this provides two implicit guarantees:  (1) sync writes are always
guaranteed to be committed to disk in order, relevant to other sync writes.
(2) In the event of OS halting or ungraceful shutdown, sync writes committed
to disk are guaranteed to be equal or greater than the async writes that
were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.

The ZIL doesn't make such guarantees. It's the DMU that handles transactions
and their grouping into txgs. It ensures that writes are committed in order
by it's transactional nature.

The function of the zil is to merely ensure that synchronous operations are
stable and replayed after a crash/power fail onto the latest txg.

Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.
No, disabling the ZIL does not disable the DMU.

Somebody, (Casper?) said it before, and now I'm starting to realize ... This
is also true of the snapshots.  If you disable your ZIL, then there is no
guarantee your snapshots are consistent either.  Rolling back doesn't
necessarily gain you anything.

No, a snapshot forces a txg which is a consistent up to date view of the pool and
it's file systems. The zil is not involved.

See also http://blogs.sun.com/perrin/entry/the_lumberjack
- which is a bit dated and simplistic but still largely true.

Neil.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to