TDB3

Andy Seaborne Thu, 25 Apr 2024 10:35:49 -0700



On 24/04/2024 21:42, Martynas Jusevičius wrote:

Andy,

Not directly related, but would different storage backend address
issues like this?

It might sound a bit like the legacy SDB, but AFAIK oxigraph, Stardog
and another commercial triplestore use RocksDB for storage.
https://github.com/oxigraph/oxigraph
https://docs.stardog.com/operating-stardog/database-administration/storage-optimize

There is even a RocksDB backend for Jena:
https://github.com/zourzouvillys/triplerocks
And just now I found your own TDB3 repo: https://github.com/afs/TDB3

Can you shed some light on TDB3 and this approach in general?

TDB3 uses RocksDB as the storage layer replacing the custom B+trees andalso the node table. It's a naive use of RockDB. It seems to work(functional). It's untested both in code and deployment.

It loads slower than tdb2 bulk loaders (IIRC maybe 70Ktriples/s) butlittle work has been done to exploit Rocks capabilities.

The advantage of Rocks is that it is likely to be around for a long time(= it's a safe investment), it's transactional, has compression [1], hascompaction [2], it has a java wrapper (separate, but closely related andin contact with the Rocks team).

While there are many storage engines that claim to be faster thanRocksDB, often, such claims have assumptions.


There are other storage layers to explore as well.

    Andy

[1] Better, or also, would probably be compression in the encoding ofstored tuples.

[2] compaction has two parts - finding the RDF terms that are currentlyin use in the database and recovering space indexes. RocksDB compactionis about the second case.


Martynas

On Wed, Apr 24, 2024 at 10:30 PM Andy Seaborne <a...@apache.org> wrote:


Hi Balduin,

Thanks for the detailed report. It's useful to hear of the use case that
occur and also the behaviour of specific deployments.

On 22/04/2024 16:22, Balduin Landolt wrote:

Hello,

we're running Fuseki 5.0.0 (but previously the last 4.x versions behaved
essentially the same) with roughly 40 Mio triples (tendency: growing).
Not sure what configuration is relevant, but we have the default graph as
the union graph.


Sort of relevant.

There are more indexes on named graphs so there is more compaction work
to be done.

"union default graph" is a view at query time, not in the storage itself.

Also, we use Fuseki as our main database, not just as a "view on our data"
so we do quite a bit of updating on the data all the time.

Lately, we've been having more and more issues with servers running out of
disk space because Fuseki's database grew pretty rapidly.
This can be solved by compacting the DB, but with our data and hardware
this takes ca. 15 minutes, during which Fuseki does not accept any update
queries, so for the production system we can't really do this outside of
nighttime hours when (hopefully) no one uses the system anyways.


Is the database disk area on an SSD, on a hard disk, or a remote
filesystem (and then, is it SSD or hard disk)?

Some things we've noticed:
- A subset of our data (I think ~20 Mio triples) taking up 6GB in compacted
state, when dumped to a .trig file is ca. 5GB. But when uploading the same
.trig file to an empty DB, this grows to ca. 25GB
- Dropping graphs does not free up disk space


That's at the point the graph is dropped? It should reclaim space at
compaction.

- A sequence of e.g. 10k queries updating only a small number of triples
(maybe 1-10 or so) on the full dataset seems to grow the DB size a lot,
like 10s to 100s of GB (I don't have numbers on this one, but it was
substantial).


This might be a factor. There is a space overhead per transaction, not
solely due to the size of update. Sounds like 10k updates is makiing
that appreciably.

Are the updates all additions? Or a mix of additions and deletions?

My question is:

Would that kind of growth in disk usage be expected?


Given 10K updates, then what you describe sounds possible.

  > Are other people having similar issues?> Are there strategies to
mitigate this?
Batching the updates although this does mean the updates don't
immediately appear in the database.

This can work reasonable when the updates are additions. If there are
deletes, it's harder.

Maybe some configuration that may be tweaked or so?


Sorry - there aren't any controls.


Best & thanks in advance,
Balduin


      Andy

TDB3

Reply via email to