Andy,

Not directly related, but would different storage backend address
issues like this?

It might sound a bit like the legacy SDB, but AFAIK oxigraph, Stardog
and another commercial triplestore use RocksDB for storage.
https://github.com/oxigraph/oxigraph
https://docs.stardog.com/operating-stardog/database-administration/storage-optimize

There is even a RocksDB backend for Jena:
https://github.com/zourzouvillys/triplerocks
And just now I found your own TDB3 repo: https://github.com/afs/TDB3

Can you shed some light on TDB3 and this approach in general?

Martynas

On Wed, Apr 24, 2024 at 10:30 PM Andy Seaborne <a...@apache.org> wrote:
>
> Hi Balduin,
>
> Thanks for the detailed report. It's useful to hear of the use case that
> occur and also the behaviour of specific deployments.
>
> On 22/04/2024 16:22, Balduin Landolt wrote:
> > Hello,
> >
> > we're running Fuseki 5.0.0 (but previously the last 4.x versions behaved
> > essentially the same) with roughly 40 Mio triples (tendency: growing).
> > Not sure what configuration is relevant, but we have the default graph as
> > the union graph.
>
> Sort of relevant.
>
> There are more indexes on named graphs so there is more compaction work
> to be done.
>
> "union default graph" is a view at query time, not in the storage itself.
>
> > Also, we use Fuseki as our main database, not just as a "view on our data"
> > so we do quite a bit of updating on the data all the time.
> >
> > Lately, we've been having more and more issues with servers running out of
> > disk space because Fuseki's database grew pretty rapidly.
> > This can be solved by compacting the DB, but with our data and hardware
> > this takes ca. 15 minutes, during which Fuseki does not accept any update
> > queries, so for the production system we can't really do this outside of
> > nighttime hours when (hopefully) no one uses the system anyways.
>
> Is the database disk area on an SSD, on a hard disk, or a remote
> filesystem (and then, is it SSD or hard disk)?
>
> > Some things we've noticed:
> > - A subset of our data (I think ~20 Mio triples) taking up 6GB in compacted
> > state, when dumped to a .trig file is ca. 5GB. But when uploading the same
> > .trig file to an empty DB, this grows to ca. 25GB
> > - Dropping graphs does not free up disk space
>
> That's at the point the graph is dropped? It should reclaim space at
> compaction.
>
> > - A sequence of e.g. 10k queries updating only a small number of triples
> > (maybe 1-10 or so) on the full dataset seems to grow the DB size a lot,
> > like 10s to 100s of GB (I don't have numbers on this one, but it was
> > substantial).
>
> This might be a factor. There is a space overhead per transaction, not
> solely due to the size of update. Sounds like 10k updates is makiing
> that appreciably.
>
> Are the updates all additions? Or a mix of additions and deletions?
>
> > My question is:
>
> > Would that kind of growth in disk usage be expected?
>
> Given 10K updates, then what you describe sounds possible.
>
>  > Are other people having similar issues?> Are there strategies to
> mitigate this?
> Batching the updates although this does mean the updates don't
> immediately appear in the database.
>
> This can work reasonable when the updates are additions. If there are
> deletes, it's harder.
>
> > Maybe some configuration that may be tweaked or so?
>
> Sorry - there aren't any controls.
>
> >
> > Best & thanks in advance,
> > Balduin
> >
>
>      Andy

Reply via email to