On Wed, Apr 24, 2013 at 11:21 AM, Sašo Kiselkov <skiselkov...@gmail.com> wrote:
> ZFS has been the filesystem of choice for SunOS-based systems for about
> the last 5 years now, is becoming that for FreeBSD as we speak, and is

More like 8 years :)

> quickly gaining ground on Linux. The absence of support for
> posix_fallocate() on ZFS kind of makes sense, since copy-on-write
> filesystems cannot keep the posix_fallocate promise:

Agreed.

> http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html
> "If posix_fallocate() returns successfully, subsequent writes to the
> specified file data shall not fail due to the lack of free space on the
> file system storage media."

Pre-allocation should be a per-filesystem feature, discoverable via pathconf(3).

What would it take to add such a pathconf?  (I should know this, but I don't.)

In the meantime:

> As such, I would suggest one of:
>
>  1) Introduce a configure option which allows SQLite users to explicitly
>     disable posix_fallocate support, if they expect to be running on
>     file systems without support for it. Merely switching by OS may not
>     be reliable enough, since for instanceUFS on SunOS implements it and
>     there is no simple way for libc to guess what file system a
>     particular file sits on.
>
>  2) Implement some sort of automatic fallback method which detects the
>     EINVAL condition and attempts to fall back to using the
>     truncate-and-write method.

EINVAL seems like a lousy error code to return here though.  ENOTSUP
seems much better.  EINVAL should be fatal here, but ENOTSUP should
cause SQLite3 to shrug and continue.

> If method #2 is acceptable for the SQLite project, I can attempt to

I would think that it should be, but I think the errno that triggers
fallback should be ENOTSUP.

> implement it. I could also implement support for posix_fallocate into
> ZFS, but that will take a lot of time to get widely deployed (at least
> several years), and even then the best ZFS could do is lie to the
> applications (due to the aforementioned COW design).

Let's expand a bit on why.  ZFS could save the DVAs of fallocated
blocks in the file's dnode for use later when either the file deleted
(last unlink) or written to.  Admittedly it'd be tricky: the
pre-allocated blocks would have to include blocks for writing metadata
all the way up to the root, and the block sizes would have to be just
right, which would effectively mean having to pre-allocate the largest
possible block sizes (since ZFS has variable block sizes, but for any
given file the data blocks are all the same size, but this can change
when the file is one block long and grows; this applies to a bunch of
metadata as well), and that'd be rather painful.

For a SQLite3 DB/WAL in a dedicated ZFS dataset you could use
reservations to roughly equivalent effect to posix_fallocate().  But
that's not a solution.

Nico
--
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to