Re: TDB details - write transactions.

zPlus Fri, 03 Nov 2017 05:09:12 -0700

Does Jena have a way to compact TDB2 databases, maybe with some CLI
tool to run manually? Or do TDB2 databases just grow indefinitely?

----- Original Message -----
From: [email protected]
To:<[email protected]>
Cc:
Sent:Fri, 3 Nov 2017 11:47:01 +0000
Subject:TDB details - write transactions.

 This is a long message that explains the changes in TDB2 around the
way 
 write transactions work.

 TDB2 transactions are completely different to TDB1 transactions. The 
 transaction coordinator is general purpose and works on a set of 
 transaction components, each index is a separate component. In TDB1, 
 the transaction manager works on the TDB1 database.

 ** TDB1

 In TDB1, a write transaction creates a number of changes to be made
to 
 the database. These are stored in the journal. They consist of 
 replacement blocks (i.e overwrite) and new blocks for the indexes.
All 
 later transactions (after the W commits) use the in-memory cache of
the 
 journal and the main database.

 The Node changes are written ahead to the node storage which is 
 append-only so don't need recording in the journal. They are 
 inaccessible to earlier transactions as they are unreferenced via the

 node table indexes.

 The journal needs to be written to the main database. TBD1 is 
 update-in-place. TDB1 is also lock-free. Writing to the main index 
 requires that there are no other transactions using the database. If 
 there are other active transactions, the work is not done but queued.

 This queue is checked whenever a transaction, read or write,
finishes. 
 If at that point, the transaction is the only one active, TDB1 writes

 the journal to the main database, and clears the journal. That 
 transaction can be a reader - work done to write-back is incurred by
the 
 reader.

 This is the delayed replay queue. (Replay because it's a log-ahead 
 system and writing back the journal is replaying changes.) Write 
 transaction changes are always delayed for efficiency to amortize the

 overhead of the costs of write-back.

 There will be be layers : writers running with more changes to the 
 database still in the delayed replay queue yet these may be in=-use
by 
 readers. A new layer is added for the new writer.

 Under load, the delayed replay queue grows. There isn't a moment to 
 write back the changes to the main database.

 There are couple of mechanisms to catch this - if the queue is over a

 certain length, or the total size of the journal is over a threshold,

 TDB1 holds back transactions as they begin, waits for the current
ones 
 to finish, then writes the queue.

 ** TDB2

 In TDB2, data structures are "append-only" in that once written and 
 committed they are never changed. New data is written to new blocks, 
 and the root of the tree changes (in the case of the B+Trees - 
 copy-on-write, also call "persistent datastructures", where
'persistent' 
 is not related to external storage - different branch of computer 
 science using the same word with a different meaning) or the visible 
 length of the file changes (append-only .dat files).

 The only use of the journal is to transactionally manage small
control 
 data such as the block id of the new tree root. A transaction is less

 than a disk block.

 Compared to TDB1, TDB2:

 + writers change to the database as the writer proceeds.

 Write efficiency: They go directly to the databases,so only one
write, 
 not two, once to journal, once to database, and they get
write-buffered 
 by the operating system with all the usual efficiency the OS can
provide 
 in disk scheduling.

 This improves bulk loading to the point where tdb2.tdbloader isn't
doing 
 low level file manipulation but this a simple write to database. If
low 
 level manipulation is an an improvement, it can fit there.

 No variable size heap cache: Large inserts and deletes go to a live 
 database can be any size. There is no caching of the old-style
journal 
 that depends on the size of the changes. No more running out of heap 
 with a large transaction.

 + Readers only read

 A read transaction does not need to do anything about the delayed
replay 
 queue. Readers just read the database, never write.

 Predictable read performance.

 Of course, there is a downside.

 The database grows faster and needs compaction when the

 People will start asking why the database is so large. They ask about

 TDB1 and TDB2 databases will be bigger.

 Maintaining compact databases while the system runs has costs,
depending 
 on how it is done. e.g. it's slower - with some kind of incremental 
 maintenance overhead (disk/SDD I/O); transaction performance less 
 predicable; (very) complicated locking schemes, including system
aborts 
 when the DB detects a deadlock (and bugs because its complicated);
large 
 writes impact concurrent readers much more.

 TDB1 and TDB2 don't system-abort due to deadlock.

 Other: TDB2 transaction coordinator is general, not TDB2 specific so
it 
 will be able to include text indexes in the future.

 ** TDB3

 An experiment, not part of Jena. Currently, it's working and not bad.

 Bulk loads are slower at 100m but the promise is that large loads 
 (billion triple range) are better. As an experiment, it may not be a 
 good idea - and will make slow progress. There are no releases and
none 
 planned.

 TDB3 uses RocksDB -- http://rocksdb.org/.

 That means using SSTables, not CoW B+Trees. At the moment, one single

 SSTable for everything because the storage data can be partitioned so
no 
 need to have several RocksDB databases.

 Still needs compaction. That's a innate feature of SSTable and LSM
(Log 
 Structured Merge) systems.

 It also based on work (RocksDB PR#1298) by Adam Retter to expose the 
 RocksDB transaction system to java.

 https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats

 Andy

Re: TDB details - write transactions.

Reply via email to