Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-15 Thread Nico Williams
On Sat, Sep 13, 2014 at 10:39 AM, Richard Hipp  wrote:
> I say that a filesystem is an eventually-consistent key/value database.

Yes!

> The keys are the filenames and the values are all big BLOBs, specifically
> the file content.  Filesystems also have a hierarchical keyspace, which is
> an extension from the usual key/value concept, but it is still key/value.

POSIX semantics are such that it's easiest for me to think of the
filesystem namespace as a set of {parent_inode_number, name,
inode_number} rows, with a typical hierarchical self-join relation.

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-15 Thread Nico Williams
On Fri, Sep 12, 2014 at 7:21 PM, Richard Hipp  wrote:
> On Fri, Sep 12, 2014 at 8:07 PM, Simon Slavin  wrote:
>>   one thing that annoys me about SQLite is that it needs to make a
>> journal file which isn't part of the database file.  Why ?  Why can't it
>> just write the journal to the database file it already has open ?  This
>> would reduce the problems where the OS prevents an application from
>> creating a new file because of permissions or sandboxing.
>>
>
> Where in the database does the journal information get stored?  At the
> end?  What happens then if the transaction is an INSERT and the size of the
> content has to grow?  Does that leave a big hole in the middle of the file
> when the journal is removed?  During recovery after a crash, where does the
> recovery process go to look for the journal information?   If the journal
> is at some arbitrary point in the file, where does it look.  Note that we
> cannot write the journal location in the file header because the header
> cannot be (safely) changed without first journaling it but we cannot
> journal the header without first writing the journal location into the
> header.

One answer is to use a COW patter, with two or more ubberblocks that
store the previous and current/next root of the DB; each ubberblock
would also reference a free space map.  When you write you just
allocate currently unused space from the previous ubberblock's free
space map -or grow the file if necessary-, then when all writes for a
transaction reach stable storage you write a new ubberblock, with a
new free space map.

That's a fairly standard COW model (e.g., ZFS does it).  I believe you
can find prior art going back a long time.  E.g., the 4.4BSD LFS was a
design sort of like this.  ZFS is quite similar to the 4.4BSD LFS, in
fact, mostly differing in that a handful of very obvious ways: it
doesn't use fixed-sized log chunks, doesn't insist on writing
contiguous blocks, and doesn't need a cleaner (the trade-off is
framentation); other differences (snapshots, clones, ...) are just
icing on the cake that come from reifying ubberblocks.

This way you have no log as such because the file is written in a
log-like manner anyways: the file *is* the log!

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-15 Thread Nico Williams
On Fri, Sep 12, 2014 at 6:47 PM, James K. Lowden
 wrote:
> On Fri, 12 Sep 2014 19:38:53 +0100
> Simon Slavin  wrote:
>
>> I don't think it can be done by trying to build it on top of an
>> existing file system.  I think we need a file system (volume format,
>> drivers, etc.) built from the ground up with
>> atomicity/ACID/transactions in mind.  Since the greatest of these is
>> transactions, I want a transactional file system.
>
> Funny you should mention that.  About 6 years ago Mike McKusick gave a
> presentation on then-recent updates to FFS in FreeBSD, including the
> birthdate.  Among other things, I remember he explored using a tree
> instead of an array for a directory file, but found that because the
> vast majority of directories hold a small number of names, the overall
> performance is better with a simple array.

ZFS uses a hash table for this.

> I asked your question: why not add transactions to FFS?
>
> His answer: that's the province of a database.

I agree, but the filesystem ought to provide a write barrier, and it
ought to provide an async fsync() with event completion notification.
That should be enough to implement high-performance ACID at the
application layer.

ZFS provides a write barrier: but it's fsync(), so if you want it to
go fast you must either disable sync writes (oof) or use a fast intent
log device (oof).

With a first-class write barrier we could have both, it and synchronous writes.

That's what the proposed osync() is all about, and I say godspeed to them!

Of course, the biggest problem with new filesystem interfaces is
adoption, but here a handful of apps (e.g., SQLite3) could adopt
osync() very quickly, vastly improving safety and performance for a
great many users.

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-14 Thread James K. Lowden
On Sat, 13 Sep 2014 15:43:00 -0400
Richard Hipp  wrote:

> There were often restrictions on the permitted values for block
> sizes.  And you couldn't ask the operating system to tell you whether
> a file was text or binary or sequential or random-access or what its
> block-size was;  you just had to know.  

That brings to mind Faulkner's observation: 

"The past is never dead. It's not even past."  

VSAM has the properties you describe, and is the successor to ISAM,
which was the state of art when Unix was being invented.  VSAM is still
alive and kicking, see VSAM Demystified:

http://www.redbooks.ibm.com/abstracts/sg246105.html?Open

Sample paragraph: 

"VSAM is used to organize records into four types of data sets:
Key-sequenced (KSDS), entry-sequenced (ESDS), linear (LDS), and
relative record (RRDS and VRRDS). The primary difference between the
types of VSAM data sets is the way that their records are stored and
accessed."

Whatever complaints one might have with Posix, ten minutes with that
document will reveal the blessings of mmap and fsync!  

--jkl
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Scott Robison
On Sat, Sep 13, 2014 at 2:53 PM, Howard Chu  wrote:

> Scott Robison wrote:
>
>> On Sat, Sep 13, 2014 at 2:24 PM, Howard Chu  wrote:
>>
>>  Scott Robison wrote:
>>>
>>>  A couple of academic thoughts.


>>>  1. If one wanted to embed the journal within the database, would it be
 adequate to reserve a specific page as the "root" page of the journal,
 then
 allocate the remaining pages as normal (either to the journal or the
 main
 database)? This does leave the big hole problem so it may still not be
 ideal, but it would give you a known location to find the beginning of
 the
 journal without doubling the database size or requiring an extra file.


>>> Starting with a known location is definitely a step in the right
>>> direction.
>>>
>>>   2. Building on 1, could sparse files be used to accomplish this? Seek
>>> to
>>>
 "really big constant offset" and do all journaling operations at that
 point, allowing the operating system to manage actual disk allocation?
 If


>>> We're talking about implementing a filesystem. "the operating system" is
>>> your own code, in this case, you don't get to foist the work off onto
>>> anyone else.
>>>
>>
>>
>> No, Simon's original question was to the effect of why doesn't SQLite just
>> use the already open database file for journaling purposes as well.
>>
>
> OK, maybe I missed that, but I thought that question itself arose from how
> to use SQLite to implement a filesystem, on a raw partition. And the answer
> to that question (operating inside a raw partition) could apply equally
> well to operating inside a single file.
>
> If you preassign a fixed maximum size to the file, you could e.g. reserve
> the tail of the file for the journal, growing backward toward the head of
> the file, while the main data grows the usual direction from the head of
> the file toward the tail. This would basically be your (2) above. On HDDs
> this approach would have horrible seek latencies but it could work OK on
> SSDs.
>
> The other point though - like the existing journaling filesystems, you
> should not limit yourself to using a single file/storage device. Allow the
> option of storing the journal somewhere else - the performance potential is
> worth it.
>
>  My
>> point 1 was in response to the need to know where the journal file is, so
>> just pick a dedicated page in the file as the root page of the journal,
>> allowing the two files to be co-mingled. It doesn't address every possible
>> bad reason for co-mingling the data, but it would at least answer the
>> question "how do you find the journal".
>>
>> My second point was about existing SQLite database files that live in a
>> file system managed by some operating system. SQLite already foists that
>> work off on to someone else, this would be no different. It still may be a
>> bad idea, but that's not the reason why it wouldn't work. :)
>>
>>
To be fair, I may have read something out of context. As I said originally,
the questions were academic. I have not thought about these problems to
anywhere near the extent or depth the SQLite devs have. I just was thinking
out loud.

You are absolutely right, there are very good reasons to allow the
possibility to have external journals, perhaps on different physical
devices (much as the test_multiplex vfs uses multiple files) to support
much higher throughput and/or larger database sizes. At the same time,
there certainly could be reasons that keeping everything self contained in
a single file could be useful, and a VFS could be written to accommodate
that (similar to the test_onefile vfs) without needing to modify SQLite.

I was thinking some time back about a VFS that communicates with a network
service. The network service would be a replacement for NFS / SMB / CIFS
that would get file locking semantics "right" through a custom protocol
instead of hoping (and being disappointed) that the system's standard file
handling APIs got it right. The network service would essentially be a page
server. It would likely not be as fast as a local file, but it could allow
distributed access to a shared database in a secure manner.

Anyway, my point with the last paragraph was that my thought was to use a
SQLite database privately owned by the network service as the container
that would hold the pages read & written by the remote clients. A VFS can
do almost anything it wants as long as it has a reliable means of locking
out access. Not that it would be *easy*, just possible.

Sorry for the ramble there...

-- 
Scott Robison
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Howard Chu

Scott Robison wrote:

On Sat, Sep 13, 2014 at 2:24 PM, Howard Chu  wrote:


Scott Robison wrote:


A couple of academic thoughts.




1. If one wanted to embed the journal within the database, would it be
adequate to reserve a specific page as the "root" page of the journal,
then
allocate the remaining pages as normal (either to the journal or the main
database)? This does leave the big hole problem so it may still not be
ideal, but it would give you a known location to find the beginning of the
journal without doubling the database size or requiring an extra file.



Starting with a known location is definitely a step in the right direction.

  2. Building on 1, could sparse files be used to accomplish this? Seek to

"really big constant offset" and do all journaling operations at that
point, allowing the operating system to manage actual disk allocation? If



We're talking about implementing a filesystem. "the operating system" is
your own code, in this case, you don't get to foist the work off onto
anyone else.



No, Simon's original question was to the effect of why doesn't SQLite just
use the already open database file for journaling purposes as well.


OK, maybe I missed that, but I thought that question itself arose from how to 
use SQLite to implement a filesystem, on a raw partition. And the answer to 
that question (operating inside a raw partition) could apply equally well to 
operating inside a single file.


If you preassign a fixed maximum size to the file, you could e.g. reserve the 
tail of the file for the journal, growing backward toward the head of the 
file, while the main data grows the usual direction from the head of the file 
toward the tail. This would basically be your (2) above. On HDDs this approach 
would have horrible seek latencies but it could work OK on SSDs.


The other point though - like the existing journaling filesystems, you should 
not limit yourself to using a single file/storage device. Allow the option of 
storing the journal somewhere else - the performance potential is worth it.



My
point 1 was in response to the need to know where the journal file is, so
just pick a dedicated page in the file as the root page of the journal,
allowing the two files to be co-mingled. It doesn't address every possible
bad reason for co-mingling the data, but it would at least answer the
question "how do you find the journal".

My second point was about existing SQLite database files that live in a
file system managed by some operating system. SQLite already foists that
work off on to someone else, this would be no different. It still may be a
bad idea, but that's not the reason why it wouldn't work. :)




--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Scott Robison
On Sat, Sep 13, 2014 at 2:24 PM, Howard Chu  wrote:

> Scott Robison wrote:
>
>> A couple of academic thoughts.
>>
>
>> 1. If one wanted to embed the journal within the database, would it be
>> adequate to reserve a specific page as the "root" page of the journal,
>> then
>> allocate the remaining pages as normal (either to the journal or the main
>> database)? This does leave the big hole problem so it may still not be
>> ideal, but it would give you a known location to find the beginning of the
>> journal without doubling the database size or requiring an extra file.
>>
>
> Starting with a known location is definitely a step in the right direction.
>
>  2. Building on 1, could sparse files be used to accomplish this? Seek to
>> "really big constant offset" and do all journaling operations at that
>> point, allowing the operating system to manage actual disk allocation? If
>>
>
> We're talking about implementing a filesystem. "the operating system" is
> your own code, in this case, you don't get to foist the work off onto
> anyone else.


No, Simon's original question was to the effect of why doesn't SQLite just
use the already open database file for journaling purposes as well. My
point 1 was in response to the need to know where the journal file is, so
just pick a dedicated page in the file as the root page of the journal,
allowing the two files to be co-mingled. It doesn't address every possible
bad reason for co-mingling the data, but it would at least answer the
question "how do you find the journal".

My second point was about existing SQLite database files that live in a
file system managed by some operating system. SQLite already foists that
work off on to someone else, this would be no different. It still may be a
bad idea, but that's not the reason why it wouldn't work. :)

-- 
Scott Robison
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Howard Chu

Scott Robison wrote:

On Fri, Sep 12, 2014 at 6:21 PM, Richard Hipp  wrote:


On Fri, Sep 12, 2014 at 8:07 PM, Simon Slavin 
wrote:


   one thing that annoys me about SQLite is that it needs to make a
journal file which isn't part of the database file.  Why ?  Why can't it
just write the journal to the database file it already has open ?  This
would reduce the problems where the OS prevents an application from
creating a new file because of permissions or sandboxing.


Where in the database does the journal information get stored?  At the
end?  What happens then if the transaction is an INSERT and the size of the
content has to grow?  Does that leave a big hole in the middle of the file
when the journal is removed?  During recovery after a crash, where does the
recovery process go to look for the journal information?   If the journal
is at some arbitrary point in the file, where does it look.  Note that we
cannot write the journal location in the file header because the header
cannot be (safely) changed without first journaling it but we cannot
journal the header without first writing the journal location into the
header.


Journaling filesystems already have this problem. By default they just use a 
section of the partition, reserved at FS creation time. Which leads to the 
problem already described in the video that started this thread - perform a 
large enough write operation and you can exceed the fixed size of the journal, 
which requires the journal data to be split and the operation journal update 
is no longer atomic.


Of course, most journaling filesystems also allow you to optionally specify an 
external journal - i.e., instead of embedding the journal on the filesystem's 
partition, you can use some other block device instead. Naturally you can also 
choose a larger size when doing this. Putting the journal on a separate device 
can bring some major performance benefits, as well as accomodating larger 
transactions.


In the tests I did two years ago, JFS with an external journal was blazingly 
fast. http://symas.com/mdb/microbench/july/#sec11



One idea that might work is to interleave the journal information with the
content.  So for each page in the database, there is a corresponding page
of journal content.  The downside there is that you double the size of the
database file without increasing its storage capacity.


This is why LMDB is much better suited to this task - it uses no journal at 
all, nor does it require compaction/defragmentation/VACUUMing.



A couple of academic thoughts.

1. If one wanted to embed the journal within the database, would it be
adequate to reserve a specific page as the "root" page of the journal, then
allocate the remaining pages as normal (either to the journal or the main
database)? This does leave the big hole problem so it may still not be
ideal, but it would give you a known location to find the beginning of the
journal without doubling the database size or requiring an extra file.


Starting with a known location is definitely a step in the right direction.


2. Building on 1, could sparse files be used to accomplish this? Seek to
"really big constant offset" and do all journaling operations at that
point, allowing the operating system to manage actual disk allocation? If


We're talking about implementing a filesystem. "the operating system" is your 
own code, in this case, you don't get to foist the work off onto anyone else.



this were possible, deleting the journal would be a "fast" truncate
operation. A custom VFS might be able to provide a proof of concept... hmm.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Simon Slavin

On 13 Sep 2014, at 9:16pm, Howard Chu  wrote:

> Scott Robison wrote:
> 
>> At the
>> very least (and I suspect more) Commodore 8-bit DOS (which was embedded
>> within a smart drive with its very own dedicated CPU & RAM) supported
>> (essentially) sequential byte stream files (no random seeking for these!)
>> and random access record oriented files (where the record size was set at
>> file creation time). Man were those a pain in the backside to use.
> 
> Now imagine writing an ftp client (or server) for one of these. I wrote both 
> for an IBM mainframe, way back when. 

You could have just used KERMIT.  Wait ... did KERMIT run on IBM mainframes ?



Hahahaha of course it did.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Howard Chu

Scott Robison wrote:

On Sat, Sep 13, 2014 at 1:43 PM, Richard Hipp  wrote:


Decades ago, files came in all kinds of varieties and permutations.
Details varied from one OS to the next.  But it was common to have a
distinction between text files and binary files (with different APIs for
accessing each.)  It was also common to have a difference between "ordinary
files" (that had to be read sequentially from beginning to end) and
"random-access files", which supported operations similar to lseek().
(Once again, completely incompatible APIs existed for accessing each file
type.)  With binary files, one often had to specify a "block size" which
was the increment in which the file was read and written.  The block size
was typically a property of the file and could not be changed after the
file had been created.  There were often restrictions on the permitted
values for block sizes.  And you couldn't ask the operating system to tell
you whether a file was text or binary or sequential or random-access or
what its block-size was;  you just had to know.  And bewildering problems
resulted if you got it wrong.



And this was not true just of big expensive business class machines. At the
very least (and I suspect more) Commodore 8-bit DOS (which was embedded
within a smart drive with its very own dedicated CPU & RAM) supported
(essentially) sequential byte stream files (no random seeking for these!)
and random access record oriented files (where the record size was set at
file creation time). Man were those a pain in the backside to use.

Now imagine writing an ftp client (or server) for one of these. I wrote both 
for an IBM mainframe, way back when. You had to trust the user to select the 
appropriate record mode for the ftp transfer to succeed without just getting 
garbage on the other end.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Scott Robison
On Sat, Sep 13, 2014 at 1:43 PM, Richard Hipp  wrote:

> Decades ago, files came in all kinds of varieties and permutations.
> Details varied from one OS to the next.  But it was common to have a
> distinction between text files and binary files (with different APIs for
> accessing each.)  It was also common to have a difference between "ordinary
> files" (that had to be read sequentially from beginning to end) and
> "random-access files", which supported operations similar to lseek().
> (Once again, completely incompatible APIs existed for accessing each file
> type.)  With binary files, one often had to specify a "block size" which
> was the increment in which the file was read and written.  The block size
> was typically a property of the file and could not be changed after the
> file had been created.  There were often restrictions on the permitted
> values for block sizes.  And you couldn't ask the operating system to tell
> you whether a file was text or binary or sequential or random-access or
> what its block-size was;  you just had to know.  And bewildering problems
> resulted if you got it wrong.
>
> Then the boys at Bell Labs had the idea of "lets make every file be a
> sequence of bytes of arbitrary length".   Many observers scoffed and told
> them that you had to have all kinds of different types of files with
> different APIs for "efficiency".  But in the end, the Unix Filesystem was
> proven to be a Very Good Idea, and has become the norm ever sense.
>
> You youngsters really have concept of the chaos and nonsense we programmers
> had to put up with prior to the invention of the Unix Filesystem, do you?
>

And this was not true just of big expensive business class machines. At the
very least (and I suspect more) Commodore 8-bit DOS (which was embedded
within a smart drive with its very own dedicated CPU & RAM) supported
(essentially) sequential byte stream files (no random seeking for these!)
and random access record oriented files (where the record size was set at
file creation time). Man were those a pain in the backside to use.

-- 
Scott Robison
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Simon Slavin

On 13 Sep 2014, at 8:43pm, Richard Hipp  wrote:

> On Sat, Sep 13, 2014 at 2:46 PM, Simon Slavin  wrote:
> 
>> I would dispute that.  [snip]
> 
> Roger is correct.
> 
> [description]

Thanks for the correction.  Apologies to Roger for doubting him.

> You youngsters really have concept of the chaos and nonsense we programmers
> had to put up with prior to the invention of the Unix Filesystem, do you?

Well, I started programming DEC mainframes in the 1970s but it seems I never 
used any calls which didn't allow random access to file contents.  What you 
describe does sound horrible to me.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Richard Hipp
On Sat, Sep 13, 2014 at 2:46 PM, Simon Slavin  wrote:

>
>
> > Before Unix came along it was quite common for files to be structured,
> > managed by the operating system and to be record based with file apis
> > working that way.  Unix turned files (and similar) into unstructured
> bags of
> > bytes.
>
> I would dispute that.  Unix is not at fault here.  Nor is Minix.  Even the
> mainframes of the day did not support transactional file systems
> routinely.  I can't think of any programming environment that 'expects' all
> file activity to be transactional.  We haven't go there yet.
>
>
Roger is correct.

Decades ago, files came in all kinds of varieties and permutations.
Details varied from one OS to the next.  But it was common to have a
distinction between text files and binary files (with different APIs for
accessing each.)  It was also common to have a difference between "ordinary
files" (that had to be read sequentially from beginning to end) and
"random-access files", which supported operations similar to lseek().
(Once again, completely incompatible APIs existed for accessing each file
type.)  With binary files, one often had to specify a "block size" which
was the increment in which the file was read and written.  The block size
was typically a property of the file and could not be changed after the
file had been created.  There were often restrictions on the permitted
values for block sizes.  And you couldn't ask the operating system to tell
you whether a file was text or binary or sequential or random-access or
what its block-size was;  you just had to know.  And bewildering problems
resulted if you got it wrong.

Then the boys at Bell Labs had the idea of "lets make every file be a
sequence of bytes of arbitrary length".   Many observers scoffed and told
them that you had to have all kinds of different types of files with
different APIs for "efficiency".  But in the end, the Unix Filesystem was
proven to be a Very Good Idea, and has become the norm ever sense.

You youngsters really have concept of the chaos and nonsense we programmers
had to put up with prior to the invention of the Unix Filesystem, do you?
;-)
-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Scott Robison
On Fri, Sep 12, 2014 at 6:21 PM, Richard Hipp  wrote:

> On Fri, Sep 12, 2014 at 8:07 PM, Simon Slavin 
> wrote:
> >
> >   one thing that annoys me about SQLite is that it needs to make a
> > journal file which isn't part of the database file.  Why ?  Why can't it
> > just write the journal to the database file it already has open ?  This
> > would reduce the problems where the OS prevents an application from
> > creating a new file because of permissions or sandboxing.
> >
> Where in the database does the journal information get stored?  At the
> end?  What happens then if the transaction is an INSERT and the size of the
> content has to grow?  Does that leave a big hole in the middle of the file
> when the journal is removed?  During recovery after a crash, where does the
> recovery process go to look for the journal information?   If the journal
> is at some arbitrary point in the file, where does it look.  Note that we
> cannot write the journal location in the file header because the header
> cannot be (safely) changed without first journaling it but we cannot
> journal the header without first writing the journal location into the
> header.
>
> One idea that might work is to interleave the journal information with the
> content.  So for each page in the database, there is a corresponding page
> of journal content.  The downside there is that you double the size of the
> database file without increasing its storage capacity.
>

A couple of academic thoughts.

1. If one wanted to embed the journal within the database, would it be
adequate to reserve a specific page as the "root" page of the journal, then
allocate the remaining pages as normal (either to the journal or the main
database)? This does leave the big hole problem so it may still not be
ideal, but it would give you a known location to find the beginning of the
journal without doubling the database size or requiring an extra file.

2. Building on 1, could sparse files be used to accomplish this? Seek to
"really big constant offset" and do all journaling operations at that
point, allowing the operating system to manage actual disk allocation? If
this were possible, deleting the journal would be a "fast" truncate
operation. A custom VFS might be able to provide a proof of concept... hmm.

-- 
Scott Robison
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Simon Slavin

On 13 Sep 2014, at 7:01pm, Luuk  wrote:

> On 13-9-2014 19:45, Simon Slavin wrote:
> 
>> What I want is the entire volume as a database.  It would be something like
>> 
>> Key  Example value
>> ---  -
>> /robots.txt/path /
>> /robots.txt/filename robots.txt
>> /robots.txt/creationdate whatever
>> /robots.txt/filetype text
>> /robots.txt/creatorapp   /Applications/textwrangler.app
>> /robots.txt/content  BLOB
>> /index.html  /
>> /index.html/filename /index.html
>> /index.html/creationdate whatever
>> ...
>> 
>> The actual primary key would be a a two column index, of course.
> 
> My question will than be:
> Would the primary index not be the i-node number (in case of ext3/ext4 
> filesystem), or the sectornr in case of FAT

Sorry, that still incorporates a sub-database disk format, and that's what I 
want to avoid.  The volume itself should be the equivalent of a SQLite 
database.  Sectors of the volume would be the pages of the SQLite database.  
The i-node numbers (or sector numbers) would be the pointers inside the SQLite 
database which normally point to page numbers in the database file.  For 
example, the binary trees mentioned in section 2.3 of



would contain page numbers which are, in this case, sector/inode numbers.  They 
wouldn't be integers stored in the database itself.

One big difference between this and how SQLite works right now is that the 
database would always take up the complete space reserved for the volume, thus 
making VACUUM do nothing.  Presumably some equivalent of VACUUM could be used 
to defragment tables and indexes, and it would have to be disabled for SSDs.  
Though with today's storage hardware defragmentation isn't of much benefit.

I must admit I'm having trouble getting my head around how to do journaling 
with this.

Another way of arranging the storage would, of course, be with every file a row 
of the Files table which has one column for each property stored:

Columns for the 'Files' table:

PathNameCreationDateCreatorApp  TypeContent
/   robots.txt  1/1/2012TextEdittextBLOB
/   index.html  1/1/2014WebmakerHTMLBLOB

Other tables in the database could be used to store other things.  Although the 
freespace and bad block list might be best stored as pseudo-files.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Simon Slavin

On 13 Sep 2014, at 7:05pm, Roger Binns  wrote:

> Ever hear of Windows and Transactional NTFS :-)
> 
> http://msdn.microsoft.com/en-us/magazine/cc163388.aspx
> 
> It turns out that adding transactions indiscriminately doesn't magically
> make everything better and lots of thought does need to be applied to get
> everything right.  That leads to this:
> 
> http://msdn.microsoft.com/en-us/library/windows/desktop/hh802690(v=vs.85).aspx

I had heard of it but never encountered in the wild.  It was a good idea which 
I don't think can be implemented on any variation of NTFS.  For instance, a 
quote from your second reference:

"You shouldn’t use TxF if you have a system that will have long-running 
transactions. The definition of "long-running" here is relative, however. A 
long-running transaction is any transaction that has been alive longer than 
many other transactions within the same log. This could mean a few seconds if 
there are thousands of transactions happening and lasting for very brief 
periods of time. On the other hand, a day or two might not be a long time if 
there are very few transactions happening in the system."

So if I have a banking system that slowly does end-of-day processing, it can't 
be used on the same /volume/ that is being updated in real-time, even if 
they're working on completely different files.  That's not much use.

> Before Unix came along it was quite common for files to be structured,
> managed by the operating system and to be record based with file apis
> working that way.  Unix turned files (and similar) into unstructured bags of
> bytes.

I would dispute that.  Unix is not at fault here.  Nor is Minix.  Even the 
mainframes of the day did not support transactional file systems routinely.  I 
can't think of any programming environment that 'expects' all file activity to 
be transactional.  We haven't go there yet.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Howard Chu

Scott Robison wrote:

On Sat, Sep 13, 2014 at 9:39 AM, Richard Hipp  wrote:


I say that a filesystem is an eventually-consistent key/value database.
The keys are the filenames and the values are all big BLOBs, specifically
the file content.  Filesystems also have a hierarchical keyspace, which is
an extension from the usual key/value concept, but it is still key/value.



Dan Bernstein, author of qmail & djbdns (among others), used the file
system as a configuration database for those applications. Rather than
having a text configuration file, he used the directory and file names as
keys and their contents as values. I seem to recall him later regretting
this choice (in part, anyway) but I always thought there was a certain
elegance to that solution. It's not perfect, but what is?

OpenLDAP's config database currently uses the filesystem this way as well. 
It's no paragon of efficiency, but it doesn't need to be particularly 
performant in the first place, and it requires zero setup.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Scott Robison
On Sat, Sep 13, 2014 at 9:39 AM, Richard Hipp  wrote:

> I say that a filesystem is an eventually-consistent key/value database.
> The keys are the filenames and the values are all big BLOBs, specifically
> the file content.  Filesystems also have a hierarchical keyspace, which is
> an extension from the usual key/value concept, but it is still key/value.
>

Dan Bernstein, author of qmail & djbdns (among others), used the file
system as a configuration database for those applications. Rather than
having a text configuration file, he used the directory and file names as
keys and their contents as values. I seem to recall him later regretting
this choice (in part, anyway) but I always thought there was a certain
elegance to that solution. It's not perfect, but what is?

-- 
Scott Robison
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Roger Binns
On 12/09/14 17:07, Simon Slavin wrote:
> Programmers don't expect file services to support transactions because file 
> services have never supported transactions. 

Ever hear of Windows and Transactional NTFS :-)

http://msdn.microsoft.com/en-us/magazine/cc163388.aspx

It turns out that adding transactions indiscriminately doesn't magically
make everything better and lots of thought does need to be applied to get
everything right.  That leads to this:

http://msdn.microsoft.com/en-us/library/windows/desktop/hh802690(v=vs.85).aspx

Before Unix came along it was quite common for files to be structured,
managed by the operating system and to be record based with file apis
working that way.  Unix turned files (and similar) into unstructured bags of
bytes.

Roger
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Luuk

On 13-9-2014 19:45, Simon Slavin wrote:


On 13 Sep 2014, at 6:16pm, Tim Streater  wrote:


On 13 Sep 2014 at 16:39, Richard Hipp  wrote:


I say that a filesystem is an eventually-consistent key/value database.
The keys are the filenames and the values are all big BLOBs, specifically
the file content.  Filesystems also have a hierarchical keyspace, which is
an extension from the usual key/value concept, but it is still key/value.


Key  Example value
---  -
filename somefile.txt
creation date
file typetext
opens with   textwrangler.app
content  
...

and so on.


That's not what I meant.  That's the file as a database.  What I want is the 
entire volume as a database.  It would be something like

Key Example value
--- -
/robots.txt/path/
/robots.txt/filenamerobots.txt
/robots.txt/creationdatewhatever
/robots.txt/filetypetext
/robots.txt/creatorapp  /Applications/textwrangler.app
/robots.txt/content BLOB
/index.html /
/index.html/filename/index.html
/index.html/creationdatewhatever
...

The actual primary key would be a a two column index, of course.

Simon.


My question will than be:
Would the primary index not be the i-node number (in case of ext3/ext4 
filesystem), or the sectornr in case of FAT


Files on disk (did?) get fragmented, but for a database this should not 
have influence on the 'i-node' where a file is stored.



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Howard Chu

Simon Slavin wrote:

That's not what I meant.  That's the file as a database.  What I want is
the entire volume as a database.


That's exactly what I pointed you to before. The thesis is pretty enlightening 
too.


http://www.fsl.cs.sunysb.edu/docs/kbdbfs-msthesis/

--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Simon Slavin

On 13 Sep 2014, at 6:16pm, Tim Streater  wrote:

> On 13 Sep 2014 at 16:39, Richard Hipp  wrote: 
> 
>> I say that a filesystem is an eventually-consistent key/value database.
>> The keys are the filenames and the values are all big BLOBs, specifically
>> the file content.  Filesystems also have a hierarchical keyspace, which is
>> an extension from the usual key/value concept, but it is still key/value.
> 
> Key  Example value
> ---  -
> filename somefile.txt
> creation date
> file typetext
> opens with   textwrangler.app
> content  
> ...
> 
> and so on.

That's not what I meant.  That's the file as a database.  What I want is the 
entire volume as a database.  It would be something like

Key Example value
--- -
/robots.txt/path/
/robots.txt/filenamerobots.txt
/robots.txt/creationdatewhatever
/robots.txt/filetypetext
/robots.txt/creatorapp  /Applications/textwrangler.app
/robots.txt/content BLOB
/index.html /
/index.html/filename/index.html
/index.html/creationdatewhatever
...

The actual primary key would be a a two column index, of course.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Tim Streater
On 13 Sep 2014 at 16:39, Richard Hipp  wrote: 

> I say that a filesystem is an eventually-consistent key/value database.
> The keys are the filenames and the values are all big BLOBs, specifically
> the file content.  Filesystems also have a hierarchical keyspace, which is
> an extension from the usual key/value concept, but it is still key/value.

Key  Example value
---  -
filename somefile.txt
creation date
file typetext
opens with   textwrangler.app
content  
...

and so on.

--
Cheers  --  Tim
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread Richard Hipp
On Sat, Sep 13, 2014 at 11:01 AM, James K. Lowden 
wrote:

> I used "database" because that was his word.  Whether or not a
> filesystem is a database depends on the definition.  I would say that
> the files are data, but the filesystem is not a DBMS, partly for
> reasons you mention.  Do the files nonetheless constitute a database?
> I would say no, because they don't meet that treshhold.  Among other
> things, the filesystem offers no key: no way to identify the entities
> it stores.


I say that a filesystem is an eventually-consistent key/value database.
The keys are the filenames and the values are all big BLOBs, specifically
the file content.  Filesystems also have a hierarchical keyspace, which is
an extension from the usual key/value concept, but it is still key/value.

If your definition of "database" requires secondary keys, then lots of
things that go by the name "database" do not qualify.  Do you not consider
GDBM to be a database engine?  And yet it has only primary keys and values,
just like a filesystem.

This idea of viewing a filesystem as a key/value database is developed
further in http://www.sqlite.org/affcase1.html and
http://www.sqlite.org/appfileformat.html

"Git" is a classic example of using a filesystem as a database.  The
repository history is stored in a "database" called ".git".  Compare this
with Monotone or Fossil which store the repository history in a relational
database.
-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-13 Thread James K. Lowden
On Sat, 13 Sep 2014 01:07:59 +0100
Simon Slavin  wrote:

> > Implement a block-transaction store on the device
> > itself: no inodes, no directories, just writeable blocks managed in
> > transactions.  Build your DBMS on that.  
> 
> That would be ... erm ... perhaps a new disk volume format.  Where
> the blocks of the volume were pages of a SQLite database.  

Yes.  It can be done in userspace; Sybase calls it a "device".  Better,
surely, would be a kernel-side implementation, living alongside
FFS, ext4, etc., except *not* providing Posix filesystem services.  

As Richard pointed out, if the tranlog is in the "file" and the "file"
has observable size, then the "file" grows by the size of the tranlog.
In Sybase, the tranlog is a table in a "device", the size of which was
preallocated in a disk partition.  It's a different model.  As I said,
the filesystem affords conveniences to user and programmer both.  

> > I asked your question: why not add transactions to FFS?  
> > 
> > His answer: that's the province of a database.  
> 
> A file system is a database.  It accepts data from a user.  It stores
> it away for later retrieval.  It allows it to be searched in various
> ways more convenient than just reading it from beginning to end every
> time.

I used "database" because that was his word.  Whether or not a
filesystem is a database depends on the definition.  I would say that
the files are data, but the filesystem is not a DBMS, partly for
reasons you mention.  Do the files nonetheless constitute a database?
I would say no, because they don't meet that treshhold.  Among other
things, the filesystem offers no key: no way to identify the entities
it stores.  If you say the offset into the file is the key, OK, but
then the "entity" is a byte, which is a pretty primitive database if
you ask me.  

I really think enriching the filesystem is not the way to go.  I saw a
presentation not long ago on running Postgres on ZFS.  ZFS, you may
know, has snapshots and transactions.  Now watch as the filesystem
journals writes to the transaction log, and takes snapshots of the
non-quiescent database files.  I didn't know whether to laugh or cry.  

--jkl
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-12 Thread Simon Slavin

On 13 Sep 2014, at 1:21am, Richard Hipp  wrote:

> On Fri, Sep 12, 2014 at 8:07 PM, Simon Slavin  wrote:
> 
>> one thing that annoys me about SQLite is that it needs to make a
>> journal file which isn't part of the database file.  Why ?  Why can't it
>> just write the journal to the database file it already has open ?  This
>> would reduce the problems where the OS prevents an application from
>> creating a new file because of permissions or sandboxing.
> 
> Where in the database does the journal information get stored?

Good points, all of them.  I don't doubt that the dev team has considered all 
these things carefully and chosen the best solution for the circumstances.

> There are also performance reasons for separating the temporary tables and
> indexes.  Because temporary tables do not have to be preserved across a
> system crash, SQLite is able to take lots of short-cuts when writing
> temporary tables (for example: omitting fsync() calls) which make them run
> must faster.

I never thought of either of those points.  Just goes to show.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-12 Thread Richard Hipp
On Fri, Sep 12, 2014 at 8:07 PM, Simon Slavin  wrote:

>
>   one thing that annoys me about SQLite is that it needs to make a
> journal file which isn't part of the database file.  Why ?  Why can't it
> just write the journal to the database file it already has open ?  This
> would reduce the problems where the OS prevents an application from
> creating a new file because of permissions or sandboxing.
>

Where in the database does the journal information get stored?  At the
end?  What happens then if the transaction is an INSERT and the size of the
content has to grow?  Does that leave a big hole in the middle of the file
when the journal is removed?  During recovery after a crash, where does the
recovery process go to look for the journal information?   If the journal
is at some arbitrary point in the file, where does it look.  Note that we
cannot write the journal location in the file header because the header
cannot be (safely) changed without first journaling it but we cannot
journal the header without first writing the journal location into the
header.

One idea that might work is to interleave the journal information with the
content.  So for each page in the database, there is a corresponding page
of journal content.  The downside there is that you double the size of the
database file without increasing its storage capacity.



>
> Similarly, temporary indexes and temporary tables (I think) also go in
> external files.  I don't see why, if they're part of 'main', they can't go
> in the main file.
>
>
Temporary tables and indexes are only suppose to be visible to the one
database connection that created them, not to all database connections.  So
they cannot be put into the main database file where they would be visible
to everybody.

There are also performance reasons for separating the temporary tables and
indexes.  Because temporary tables do not have to be preserved across a
system crash, SQLite is able to take lots of short-cuts when writing
temporary tables (for example: omitting fsync() calls) which make them run
must faster.
-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-12 Thread Simon Slavin

On 13 Sep 2014, at 12:47am, James K. Lowden  wrote:

> I asked your question: why not add transactions to FFS?  
> 
> His answer: that's the province of a database.  

A file system is a database.  It accepts data from a user.  It stores it away 
for later retrieval.  It allows it to be searched in various ways more 
convenient than just reading it from beginning to end every time.

We're just not used to file systems being very clever.  New ones like ZFS and 
ext4 are getting very clever but the creators still seem to have missed out a 
feature I consider should have been implemented long before the fancy things I 
see now.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-12 Thread Simon Slavin

On 13 Sep 2014, at 12:47am, James K. Lowden  wrote:

> The filesystem guys I've talked to all consider transactions to be a
> high-level construct hardly ever needed by most applications.  They're
> interested in raw speed and a consistent file structure (metadata)
> after a crash.  They feel that just making one disk look like another
> is hard enough.  
> 
> On the evidence, they're right.

Programmers don't expect file services to support transactions because file 
services have never supported transactions.  We're not used to specifying, when 
we write to one or many files, that a bunch of changes all go together.  But 
when it becomes possible everyone will rave about it.  I come from a banking 
background and I can imagine the transports of delight the financial techies 
will go through when they can assure their users that a crash in software will 
never lead to an inconsistent state: if one account was debited, the other will 
definitely be credited, and the transaction will definitely be in the ledger.  
No need to run consistency checks every time even the simplest application 
crashes.

> SQLite, for example, makes precious
> little use of the filesystem, insofar as the whole database is one
> file.

Except ... one thing that annoys me about SQLite is that it needs to make a 
journal file which isn't part of the database file.  Why ?  Why can't it just 
write the journal to the database file it already has open ?  This would reduce 
the problems where the OS prevents an application from creating a new file 
because of permissions or sandboxing.

Similarly, temporary indexes and temporary tables (I think) also go in external 
files.  I don't see why, if they're part of 'main', they can't go in the main 
file.

> Maybe we should go back to the future. You remember when DBMSs didn't
> use the filesystem, but acted on (the kernel's abstraction of) the
> device directly.  Implement a block-transaction store on the device
> itself: no inodes, no directories, just writeable blocks managed in
> transactions.  Build your DBMS on that.  Use the DBMS to build a
> user-space filesystem. With a little cleverness, they could be mutually
> intelligible, such that tables looked like files and find(1) would
> locate data in the database.  

That would be ... erm ... perhaps a new disk volume format.  Where the blocks 
of the volume were pages of a SQLite database.  One in which VACUUM could never 
shrink the file (but perhaps could be subverted to do something like 
defragmentation).

It would remove one level of abstraction between the OS and the bits on the 
disk surface.  And that's always good.  It would still rely on the disk driver 
correctly supporting flush and all other flush-like operations the OS 
implemented.  Most of them will when the jumpers on the drive are set correctly 
and the driver mounts the volume correctly.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-12 Thread James K. Lowden
On Fri, 12 Sep 2014 19:38:53 +0100
Simon Slavin  wrote:

> I don't think it can be done by trying to build it on top of an
> existing file system.  I think we need a file system (volume format,
> drivers, etc.) built from the ground up with
> atomicity/ACID/transactions in mind.  Since the greatest of these is
> transactions, I want a transactional file system.

Funny you should mention that.  About 6 years ago Mike McKusick gave a
presentation on then-recent updates to FFS in FreeBSD, including the
birthdate.  Among other things, I remember he explored using a tree
instead of an array for a directory file, but found that because the
vast majority of directories hold a small number of names, the overall
performance is better with a simple array.  

I asked your question: why not add transactions to FFS?  

His answer: that's the province of a database.  

The filesystem guys I've talked to all consider transactions to be a
high-level construct hardly ever needed by most applications.  They're
interested in raw speed and a consistent file structure (metadata)
after a crash.  They feel that just making one disk look like another
is hard enough.  

On the evidence, they're right.  SQLite, for example, makes precious
little use of the filesystem, insofar as the whole database is one
file.  Within the database, there are no directories, no filesystem
metadata, no inodes, nothing for the OS virtualize/arbitrate between
users.  Apart from anciliary files, about the only thing the FS
supplies is extent management for blocks, and user convenience.  I
don't have to explain to you why even such extent management as it
provides is suboptimal from the point of view of the DBMS.  

Maybe we should go back to the future. You remember when DBMSs didn't
use the filesystem, but acted on (the kernel's abstraction of) the
device directly.  Implement a block-transaction store on the device
itself: no inodes, no directories, just writeable blocks managed in
transactions.  Build your DBMS on that.  Use the DBMS to build a
user-space filesystem. With a little cleverness, they could be mutually
intelligible, such that tables looked like files and find(1) would
locate data in the database.  

--jkl
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-12 Thread Remzi H. Arpaci-Dusseau
hi!

Thanks for the note, and the interest.

I'm cc'ing Thanu into this discussion -- he and a few other grad students
did all the heavy lifting and can share the details of what they've found.

As I recall, he has already spoken with some of the SQLite gang a bit
about the purported "problem" (and perhaps we even disagree as to
whether it is or not). It may be that our reading of what the documentation
guarantees does not match what you think it does. Thanu?

As for SQLite "coming out on top" -- well, we're trying very hard not
to view our results as a comparison of one system vs. another,
because frankly we are only looking at one type of issue and
not overall robustness.

That said, we've been very impressed with SQLite -- it seems like
one of the few systems that has been very carefully thought through,
and we've learned much from it as well as what we've seen of its
testing tools.

We do have an upcoming paper on said topic (at OSDI '14),
in case you'd like to read more.

Thanks again!
Remzi









On Thu, Sep 11, 2014 at 9:07 PM, Richard Hipp  wrote:

>
>
> On Thu, Sep 11, 2014 at 5:49 PM, Kees Nuyt  wrote:
>
>>
>> Hi all,
>>
>> Today I bumped into a presentation about ordering and atomicity
>> of filesystems that might interest you.
>>
>> https://www.youtube.com/watch?v=YvchhB1-Aws
>>
>> The Application/Storage Interface: After All These Years, We're
>> Still Doing It Wrong
>> Remzi Arpaci-Dusseau, University of Wisconsin--Madison
>>
>> Talk at usenix 2014 Published on Sep 4, 2014 by USENIX
>> Association Videos
>>
>> Somewhat related to the article drh recently wrote about using
>> sqlite as an application data store.
>>
>>
> Thanks for the link, Kees!
>
> I just finished watching the video.  Remzi Arpaci-Dusseau talks about
> research (done by he and his graduate students) into how well application
> data survives system crashes.  Remzi observes that filesystem developers
> have worked very hard for many years ensuring that filesystem metadata is
> preserved in a crash, but they seem less concerned about protecting
> application data.
>
> Remzi developed tools (BOB and ALICE) to study various workloads to see
> how vulnerable they were to system crashes.  He looked at various
> "applications".  His definition of "application" includes standalone
> programs like Git and Hg, and database servers like PostgreSQL, and
> libraries like SQLite and LevelDB.  At one point he shows a chart that
> counts the number of unwarranted assumptions that the applications make
> about filesystem behavior.  Such unwarranted assumptions can lead to
> corruption following a system crash (or power loss).
>
> SQLite and PostgreSQL came out on top, with just one vulnerability each.
> Hg and Git each had many vulnerabilities.  In fairness, Remzi points out
> that these vulnerabilities assume a "worst case" filesystem and that many
> of them might not exist on a modern filesystem like EXT4.
>
> Remzi:  I would very much like to learn more about that one unwarranted
> durability assumption that you contend SQLite is making.
>
> That SQLite does well in an analysis using ALICE and BOB is not really
> surprising.  It turns out that we SQLite developers have our own ALICE and
> BOB like tools that we have implemented using custom VFSes.  We have three
> of them, actually, implemented at different times, by both me and Dan.
> (Only two are BOB- and ALICE-like crash simulators - the third tool is an
> invariant checker that helps us to prove crashes are recoverable.)  We run
> many cycles of all three prior to every release, looking for crash
> vulnerabilities.  If SQLite really is making an unwarranted durability
> assumption, as Remzi contends, then that points to a deficiency in our
> three crash analyzers, which is something we would like to fix.
>
> Remzi also talks about the idea of a new system call that he refers to as
> "osync()" that causes I/O operations to be ordered.  I've been saying much
> the same thing, for years, to anybody who would listen, though I've been
> calling the system call a "write barrier".  The idea is that if you could
> replace fsync() with the write barrier, you would lose durability (which
> few people really care about) but gain a lot more performance.  Remzi shows
> a test case using SQLite where osync() instead of fsync() results in a
> ten-fold performance improvement.
>
> --
> D. Richard Hipp
> d...@sqlite.org
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-12 Thread Nico Williams
On Fri, Sep 12, 2014 at 1:18 PM, Howard Chu  wrote:
> Nico Williams wrote:
>> On ZFS datasets with sync disabled fsync() functions as osync(), as a
>> write
>> barrier without durability and without the associated penalty.  The
>> obvious
>> problem is that really do need osync() and fsync(); just one or the other
>> is not a reasonable compromise.
>
> Write barriers have been debated in Linux ad nauseum. I agree that osync()
> would be great to have, but it's still a die roll - the OS can flush blocks
> to the storage device in order, but without waiting for the storage device's
> buffer to empty, can't make any further ordering promises from there. You
> need device-level ordering support too. - which prompted my suggestion here

For ZFS there's no problem if this happens: you might lose whole
transactions, but the filesystem will remain consistent.  That's the
no-durability-guarantee part of a write barrier.

On recovery ZFS wants you to note and approve of any transactions lost
beyond the last one that was in process of being flushed.  If the last
one didn't complete that may be because of a power outage before the
cache flush could complete.  If more than one transaction didn't
completely reach stable storage ZFS figures it must have been that the
HW lied about cache flushes.

Of course, one might actually want the filesystem to never issue cache
flushes except in response to sync()/fsync()/fdatasync() calls.  IIRC
there's no way to configure such behavior in ZFS -- one can only make
sync/fsync/fdatasync not wait for a cache flush.  For some uses one
might really want such behavior though: a few lost transactions not
closed because of a sync/fsync/fdatasync syscall may be tolerable.
But then, ZFS can keep transactions open for a fairly long time if
sync I/O is not requested by any apps...

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-12 Thread Simon Slavin

On 12 Sep 2014, at 7:08pm, Nico Williams  wrote:

> On ZFS datasets with sync disabled fsync() functions as osync(), as a write
> barrier without durability and without the associated penalty.  The obvious
> problem is that really do need osync() and fsync(); just one or the other
> is not a reasonable compromise.

Yep.  However ...

If you do it right, and you don't have an extra layer of hardware doing your 
write barrier, you sacrifice speed.  No exceptions.  Today's computers -- the 
desktop or laptop every one of us has -- are tensed to the eyebrows with 
time-optimization dodges.  Many of those defeat any opportunity for 
in-order-writing.  I don't think it can be done by trying to build it on top of 
an existing file system.  I think we need a file system (volume format, 
drivers, etc.) built from the ground up with atomicity/ACID/transactions in 
mind.  Since the greatest of these is transactions, I want a transactional file 
system.

Some years ago I ordered a very fast server-class Wintel computer with all 
components specced for use with safety-critical server use.  Having set jumpers 
on the motherboard and hard disk for "do it properly, don't acknowledge a write 
until the data is on the disk" I installed a standard copy of Windows (I think) 
and Microsoft Office on it.  

It was unusably slow.  Something like 5 minutes to boot and another 5 minutes 
to get a blank document showing in Microsoft Word.   I'm not talking about the 
initial setup boot, I'm talking 5 minutes for every boot, and 5 minutes to wait 
every time you start Word.  Typing into Word got about 3 characters a second 
which lags behind my standard typing speed quite a bit.  It made for a very 
convincing demonstration.

This is not a gripe at Microsoft.  It wasn't the software's fault.  It's what 
you get when you defeat all the speed optimizations we expect these days.  Or 
ten years ago when I did the experiment.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-12 Thread Howard Chu

Nico Williams wrote:

On ZFS datasets with sync disabled fsync() functions as osync(), as a write
barrier without durability and without the associated penalty.  The obvious
problem is that really do need osync() and fsync(); just one or the other
is not a reasonable compromise.


Write barriers have been debated in Linux ad nauseum. I agree that osync() 
would be great to have, but it's still a die roll - the OS can flush blocks to 
the storage device in order, but without waiting for the storage device's 
buffer to empty, can't make any further ordering promises from there. You need 
device-level ordering support too. - which prompted my suggestion here


http://www.spinics.net/lists/linux-fsdevel/msg70047.html

--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-12 Thread Nico Williams
On ZFS datasets with sync disabled fsync() functions as osync(), as a write
barrier without durability and without the associated penalty.  The obvious
problem is that really do need osync() and fsync(); just one or the other
is not a reasonable compromise.

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-12 Thread Howard Chu

Richard Hipp wrote:

On Thu, Sep 11, 2014 at 5:49 PM, Kees Nuyt  wrote:



Hi all,

Today I bumped into a presentation about ordering and atomicity
of filesystems that might interest you.

https://www.youtube.com/watch?v=YvchhB1-Aws

The Application/Storage Interface: After All These Years, We're
Still Doing It Wrong
Remzi Arpaci-Dusseau, University of Wisconsin—Madison

Talk at usenix 2014 Published on Sep 4, 2014 by USENIX
Association Videos

Somewhat related to the article drh recently wrote about using
sqlite as an application data store.



Thanks for the link, Kees!

I just finished watching the video.  Remzi Arpaci-Dusseau talks about
research (done by he and his graduate students) into how well application
data survives system crashes.  Remzi observes that filesystem developers
have worked very hard for many years ensuring that filesystem metadata is
preserved in a crash, but they seem less concerned about protecting
application data.

Remzi developed tools (BOB and ALICE) to study various workloads to see how
vulnerable they were to system crashes.  He looked at various
"applications".  His definition of "application" includes standalone
programs like Git and Hg, and database servers like PostgreSQL, and
libraries like SQLite and LevelDB.  At one point he shows a chart that
counts the number of unwarranted assumptions that the applications make
about filesystem behavior.  Such unwarranted assumptions can lead to
corruption following a system crash (or power loss).

SQLite and PostgreSQL came out on top, with just one vulnerability each.
Hg and Git each had many vulnerabilities.  In fairness, Remzi points out
that these vulnerabilities assume a "worst case" filesystem and that many
of them might not exist on a modern filesystem like EXT4.


Actually LMDB comes out on top with zero vulnerabilities. I spoke to the UWisc 
folks to find out what was the one Atomicity vulnerability they reported in 
LMDB and we confirmed that it was not in fact a valid vulnerability.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-12 Thread Howard Chu

Simon Slavin wrote:


On 12 Sep 2014, at 3:18am, Scott Robison  wrote:


That was an interesting 90 minutes.


Indeed.  Thanks to Kees for posting it.  Though I was surprised he didn't 
mention the term 'ACID' explicitly.

I'm still of the opinion that we need an actual transactional file system
with equivalents to BEGIN, END and ROLLBACK.  It will have to support many
transactions at the same time, of course, since each process will be doing
its own thing.


There have been such projects. They don't seem to have continued though.
http://mile-outta-boyce.blogspot.ie/2007/05/kernel-berkeley-db-file-system-kbdbfs.html

https://github.com/dmeister/kbdb/tree/master/kbdbfs-1.0

I've got a project underway to retrace their steps, using LMDB instead of 
BerkeleyDB.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-11 Thread Richard Hipp
On Thu, Sep 11, 2014 at 5:49 PM, Kees Nuyt  wrote:

>
> Hi all,
>
> Today I bumped into a presentation about ordering and atomicity
> of filesystems that might interest you.
>
> https://www.youtube.com/watch?v=YvchhB1-Aws
>
> The Application/Storage Interface: After All These Years, We're
> Still Doing It Wrong
> Remzi Arpaci-Dusseau, University of Wisconsin—Madison
>
> Talk at usenix 2014 Published on Sep 4, 2014 by USENIX
> Association Videos
>
> Somewhat related to the article drh recently wrote about using
> sqlite as an application data store.
>
>
Thanks for the link, Kees!

I just finished watching the video.  Remzi Arpaci-Dusseau talks about
research (done by he and his graduate students) into how well application
data survives system crashes.  Remzi observes that filesystem developers
have worked very hard for many years ensuring that filesystem metadata is
preserved in a crash, but they seem less concerned about protecting
application data.

Remzi developed tools (BOB and ALICE) to study various workloads to see how
vulnerable they were to system crashes.  He looked at various
"applications".  His definition of "application" includes standalone
programs like Git and Hg, and database servers like PostgreSQL, and
libraries like SQLite and LevelDB.  At one point he shows a chart that
counts the number of unwarranted assumptions that the applications make
about filesystem behavior.  Such unwarranted assumptions can lead to
corruption following a system crash (or power loss).

SQLite and PostgreSQL came out on top, with just one vulnerability each.
Hg and Git each had many vulnerabilities.  In fairness, Remzi points out
that these vulnerabilities assume a "worst case" filesystem and that many
of them might not exist on a modern filesystem like EXT4.

Remzi:  I would very much like to learn more about that one unwarranted
durability assumption that you contend SQLite is making.

That SQLite does well in an analysis using ALICE and BOB is not really
surprising.  It turns out that we SQLite developers have our own ALICE and
BOB like tools that we have implemented using custom VFSes.  We have three
of them, actually, implemented at different times, by both me and Dan.
(Only two are BOB- and ALICE-like crash simulators - the third tool is an
invariant checker that helps us to prove crashes are recoverable.)  We run
many cycles of all three prior to every release, looking for crash
vulnerabilities.  If SQLite really is making an unwarranted durability
assumption, as Remzi contends, then that points to a deficiency in our
three crash analyzers, which is something we would like to fix.

Remzi also talks about the idea of a new system call that he refers to as
"osync()" that causes I/O operations to be ordered.  I've been saying much
the same thing, for years, to anybody who would listen, though I've been
calling the system call a "write barrier".  The idea is that if you could
replace fsync() with the write barrier, you would lose durability (which
few people really care about) but gain a lot more performance.  Remzi shows
a test case using SQLite where osync() instead of fsync() results in a
ten-fold performance improvement.

-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-11 Thread Simon Slavin

On 12 Sep 2014, at 3:18am, Scott Robison  wrote:

> That was an interesting 90 minutes.

Indeed.  Thanks to Kees for posting it.  Though I was surprised he didn't 
mention the term 'ACID' explicitly.

I'm still of the opinion that we need an actual transactional file system with 
equivalents to BEGIN, END and ROLLBACK.  It will have to support many 
transactions at the same time, of course, since each process will be doing its 
own thing.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-11 Thread Scott Robison
On Thu, Sep 11, 2014 at 4:09 PM, Kees Nuyt  wrote:

> On Thu, 11 Sep 2014 23:49:22 +0200, Kees Nuyt 
> wrote:
>
> > Today I bumped into a presentation about ordering and atomicity
> > of filesystems that might interest you.
> >
> > https://www.youtube.com/watch?v=YvchhB1-Aws
>
> Compliments for sqlite at 43:20 .. 43:59


That was an interesting 90 minutes. His chart of Atomicity/Order/Durability
vulnerabilities (and his comments) speak highly of SQLite. Do the SQLite
devs agree with his assessment that there is (or was at the time) a
durability vulnerability? If you agree can you state what it was (a
specific flaw or a documented limitation as he mentioned was the case for
the atomicity vulnerability for postgresql)? Just curious since he didn't
go into great detail given the amount of ground he had to cover.

-- 
Scott Robison
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] presentation about ordering and atomicity of filesystems

2014-09-11 Thread Kees Nuyt
On Thu, 11 Sep 2014 23:49:22 +0200, Kees Nuyt 
wrote:

> Today I bumped into a presentation about ordering and atomicity
> of filesystems that might interest you.
>
> https://www.youtube.com/watch?v=YvchhB1-Aws

Compliments for sqlite at 43:20 .. 43:59
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users