Andy, Not directly related, but would different storage backend address issues like this?
It might sound a bit like the legacy SDB, but AFAIK oxigraph, Stardog and another commercial triplestore use RocksDB for storage. https://github.com/oxigraph/oxigraph https://docs.stardog.com/operating-stardog/database-administration/storage-optimize There is even a RocksDB backend for Jena: https://github.com/zourzouvillys/triplerocks And just now I found your own TDB3 repo: https://github.com/afs/TDB3 Can you shed some light on TDB3 and this approach in general? Martynas On Wed, Apr 24, 2024 at 10:30 PM Andy Seaborne <a...@apache.org> wrote: > > Hi Balduin, > > Thanks for the detailed report. It's useful to hear of the use case that > occur and also the behaviour of specific deployments. > > On 22/04/2024 16:22, Balduin Landolt wrote: > > Hello, > > > > we're running Fuseki 5.0.0 (but previously the last 4.x versions behaved > > essentially the same) with roughly 40 Mio triples (tendency: growing). > > Not sure what configuration is relevant, but we have the default graph as > > the union graph. > > Sort of relevant. > > There are more indexes on named graphs so there is more compaction work > to be done. > > "union default graph" is a view at query time, not in the storage itself. > > > Also, we use Fuseki as our main database, not just as a "view on our data" > > so we do quite a bit of updating on the data all the time. > > > > Lately, we've been having more and more issues with servers running out of > > disk space because Fuseki's database grew pretty rapidly. > > This can be solved by compacting the DB, but with our data and hardware > > this takes ca. 15 minutes, during which Fuseki does not accept any update > > queries, so for the production system we can't really do this outside of > > nighttime hours when (hopefully) no one uses the system anyways. > > Is the database disk area on an SSD, on a hard disk, or a remote > filesystem (and then, is it SSD or hard disk)? > > > Some things we've noticed: > > - A subset of our data (I think ~20 Mio triples) taking up 6GB in compacted > > state, when dumped to a .trig file is ca. 5GB. But when uploading the same > > .trig file to an empty DB, this grows to ca. 25GB > > - Dropping graphs does not free up disk space > > That's at the point the graph is dropped? It should reclaim space at > compaction. > > > - A sequence of e.g. 10k queries updating only a small number of triples > > (maybe 1-10 or so) on the full dataset seems to grow the DB size a lot, > > like 10s to 100s of GB (I don't have numbers on this one, but it was > > substantial). > > This might be a factor. There is a space overhead per transaction, not > solely due to the size of update. Sounds like 10k updates is makiing > that appreciably. > > Are the updates all additions? Or a mix of additions and deletions? > > > My question is: > > > Would that kind of growth in disk usage be expected? > > Given 10K updates, then what you describe sounds possible. > > > Are other people having similar issues?> Are there strategies to > mitigate this? > Batching the updates although this does mean the updates don't > immediately appear in the database. > > This can work reasonable when the updates are additions. If there are > deletes, it's harder. > > > Maybe some configuration that may be tweaked or so? > > Sorry - there aren't any controls. > > > > > Best & thanks in advance, > > Balduin > > > > Andy