This is a long message that explains the changes in TDB2 around the way write transactions work.

TDB2 transactions are completely different to TDB1 transactions. The transaction coordinator is general purpose and works on a set of transaction components, each index is a separate component. In TDB1, the transaction manager works on the TDB1 database.

** TDB1

In TDB1, a write transaction creates a number of changes to be made to the database. These are stored in the journal. They consist of replacement blocks (i.e overwrite) and new blocks for the indexes. All later transactions (after the W commits) use the in-memory cache of the journal and the main database.

The Node changes are written ahead to the node storage which is append-only so don't need recording in the journal. They are inaccessible to earlier transactions as they are unreferenced via the node table indexes.

The journal needs to be written to the main database. TBD1 is update-in-place. TDB1 is also lock-free. Writing to the main index requires that there are no other transactions using the database. If there are other active transactions, the work is not done but queued.

This queue is checked whenever a transaction, read or write, finishes. If at that point, the transaction is the only one active, TDB1 writes the journal to the main database, and clears the journal. That transaction can be a reader - work done to write-back is incurred by the reader.

This is the delayed replay queue. (Replay because it's a log-ahead system and writing back the journal is replaying changes.) Write transaction changes are always delayed for efficiency to amortize the overhead of the costs of write-back.

There will be be layers : writers running with more changes to the database still in the delayed replay queue yet these may be in=-use by readers. A new layer is added for the new writer.

Under load, the delayed replay queue grows. There isn't a moment to write back the changes to the main database.

There are couple of mechanisms to catch this - if the queue is over a certain length, or the total size of the journal is over a threshold, TDB1 holds back transactions as they begin, waits for the current ones to finish, then writes the queue.

** TDB2

In TDB2, data structures are "append-only" in that once written and committed they are never changed. New data is written to new blocks, and the root of the tree changes (in the case of the B+Trees - copy-on-write, also call "persistent datastructures", where 'persistent' is not related to external storage - different branch of computer science using the same word with a different meaning) or the visible length of the file changes (append-only .dat files).

The only use of the journal is to transactionally manage small control data such as the block id of the new tree root. A transaction is less than a disk block.

Compared to TDB1, TDB2:

+ writers change to the database as the writer proceeds.

Write efficiency: They go directly to the databases,so only one write, not two, once to journal, once to database, and they get write-buffered by the operating system with all the usual efficiency the OS can provide in disk scheduling.

This improves bulk loading to the point where tdb2.tdbloader isn't doing low level file manipulation but this a simple write to database. If low level manipulation is an an improvement, it can fit there.

No variable size heap cache: Large inserts and deletes go to a live database can be any size. There is no caching of the old-style journal that depends on the size of the changes. No more running out of heap with a large transaction.

+ Readers only read

A read transaction does not need to do anything about the delayed replay queue. Readers just read the database, never write.

Predictable read performance.

Of course, there is a downside.

The database grows faster and needs compaction when the

People will start asking why the database is so large. They ask about TDB1 and TDB2 databases will be bigger.

Maintaining compact databases while the system runs has costs, depending on how it is done. e.g. it's slower - with some kind of incremental maintenance overhead (disk/SDD I/O); transaction performance less predicable; (very) complicated locking schemes, including system aborts when the DB detects a deadlock (and bugs because its complicated); large writes impact concurrent readers much more.

TDB1 and TDB2 don't system-abort due to deadlock.

Other: TDB2 transaction coordinator is general, not TDB2 specific so it will be able to include text indexes in the future.

** TDB3

An experiment, not part of Jena. Currently, it's working and not bad. Bulk loads are slower at 100m but the promise is that large loads (billion triple range) are better. As an experiment, it may not be a good idea - and will make slow progress. There are no releases and none planned.

TDB3 uses RocksDB -- http://rocksdb.org/.

That means using SSTables, not CoW B+Trees. At the moment, one single SSTable for everything because the storage data can be partitioned so no need to have several RocksDB databases.

Still needs compaction. That's a innate feature of SSTable and LSM (Log Structured Merge) systems.

It also based on work (RocksDB PR#1298) by Adam Retter to expose the RocksDB transaction system to java.

https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats

    Andy

Reply via email to