TDB details - write transactions.

Andy Seaborne Fri, 03 Nov 2017 04:47:49 -0700

This is a long message that explains the changes in TDB2 around the waywrite transactions work.

TDB2 transactions are completely different to TDB1 transactions. Thetransaction coordinator is general purpose and works on a set oftransaction components, each index is a separate component. In TDB1,the transaction manager works on the TDB1 database.


** TDB1

In TDB1, a write transaction creates a number of changes to be made tothe database. These are stored in the journal. They consist ofreplacement blocks (i.e overwrite) and new blocks for the indexes. Alllater transactions (after the W commits) use the in-memory cache of thejournal and the main database.

The Node changes are written ahead to the node storage which isappend-only so don't need recording in the journal. They areinaccessible to earlier transactions as they are unreferenced via thenode table indexes.

The journal needs to be written to the main database. TBD1 isupdate-in-place. TDB1 is also lock-free. Writing to the main indexrequires that there are no other transactions using the database. Ifthere are other active transactions, the work is not done but queued.

This queue is checked whenever a transaction, read or write, finishes.If at that point, the transaction is the only one active, TDB1 writesthe journal to the main database, and clears the journal. Thattransaction can be a reader - work done to write-back is incurred by thereader.

This is the delayed replay queue. (Replay because it's a log-aheadsystem and writing back the journal is replaying changes.) Writetransaction changes are always delayed for efficiency to amortize theoverhead of the costs of write-back.

There will be be layers : writers running with more changes to thedatabase still in the delayed replay queue yet these may be in=-use byreaders. A new layer is added for the new writer.

Under load, the delayed replay queue grows. There isn't a moment towrite back the changes to the main database.

There are couple of mechanisms to catch this - if the queue is over acertain length, or the total size of the journal is over a threshold,TDB1 holds back transactions as they begin, waits for the current onesto finish, then writes the queue.


** TDB2

In TDB2, data structures are "append-only" in that once written andcommitted they are never changed. New data is written to new blocks,and the root of the tree changes (in the case of the B+Trees -copy-on-write, also call "persistent datastructures", where 'persistent'is not related to external storage - different branch of computerscience using the same word with a different meaning) or the visiblelength of the file changes (append-only .dat files).

The only use of the journal is to transactionally manage small controldata such as the block id of the new tree root. A transaction is lessthan a disk block.


Compared to TDB1, TDB2:

+ writers change to the database as the writer proceeds.

Write efficiency: They go directly to the databases,so only one write,not two, once to journal, once to database, and they get write-bufferedby the operating system with all the usual efficiency the OS can providein disk scheduling.

This improves bulk loading to the point where tdb2.tdbloader isn't doinglow level file manipulation but this a simple write to database. If lowlevel manipulation is an an improvement, it can fit there.

No variable size heap cache: Large inserts and deletes go to a livedatabase can be any size. There is no caching of the old-style journalthat depends on the size of the changes. No more running out of heapwith a large transaction.


+ Readers only read

A read transaction does not need to do anything about the delayed replayqueue. Readers just read the database, never write.


Predictable read performance.

Of course, there is a downside.

The database grows faster and needs compaction when the

People will start asking why the database is so large. They ask aboutTDB1 and TDB2 databases will be bigger.

Maintaining compact databases while the system runs has costs, dependingon how it is done. e.g. it's slower - with some kind of incrementalmaintenance overhead (disk/SDD I/O); transaction performance lesspredicable; (very) complicated locking schemes, including system abortswhen the DB detects a deadlock (and bugs because its complicated); largewrites impact concurrent readers much more.


TDB1 and TDB2 don't system-abort due to deadlock.

Other: TDB2 transaction coordinator is general, not TDB2 specific so itwill be able to include text indexes in the future.


** TDB3

An experiment, not part of Jena. Currently, it's working and not bad.Bulk loads are slower at 100m but the promise is that large loads(billion triple range) are better. As an experiment, it may not be agood idea - and will make slow progress. There are no releases and noneplanned.


TDB3 uses RocksDB -- http://rocksdb.org/.

That means using SSTables, not CoW B+Trees. At the moment, one singleSSTable for everything because the storage data can be partitioned so noneed to have several RocksDB databases.

Still needs compaction. That's a innate feature of SSTable and LSM (LogStructured Merge) systems.

It also based on work (RocksDB PR#1298) by Adam Retter to expose theRocksDB transaction system to java.


https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats

    Andy

TDB details - write transactions.

Reply via email to