Hi Vinay,
On 27/01/2022 06:14, Vinay Mahamuni wrote:
Hello,
I am using Apache Jena v4.3.2 + FusekiĀ + TDB2 persistent disk storage. I
am using jena RDFConnection to connect to the Fuseki server. I am
sending 50k triples in one update. This is mostly new data(only a few
triples will match with existing data). These data are instances based
on an ontology. Please have a look at the attached file containing how
much disk memory increases with each update. For 1.5million triples, it
took around 1.2GB. We want to store around a few billions of triples.
Thus the bytes/triple ratio won't be good for our use case.
When I used the tdb2.tdbcompact tool, the data volume shrinked to 400MB.
But this extra step needs to be performed manually to optimise the storage.
It can be triggered by an admin process with e.g. "cron".
It doesn't have to be done very often unless your volume of 50k triple
transactions is very high - in which case I suggest batching them into
larger units.
My questions are as follows:
1. Why 30 update queries each of 50k triples take 3 times more memory
than a single update query of 1500k triples? Data getting stored is
the same but memory consumed is more in the first case.
TDB2 uses an MVCC/copy-on-write scheme for transaction isolation. It
gives a very high isolation guarantee (serialized),
That means there is a per transaction overhead here which is recovered
by compact. In fact, it can't recover at the time because the old data
may be in use in read-transactions seeing the pre-write state.
Compact is similar (not identical) to like PostgreSQL VACUUM.
Note that all additional space is recovered by "compact". The active
directory is the highest number "Data-NNNN". You can delete the earlier
ones once the "compact" has finished as logged in the server log. Or zip
them and keep them as backups - Fuseki has released them and does not
touch them. Caution: on MS Windows, due to a long standing (10+year)
Java JDK issue, the server has to be stopped and restarted to properly
release old files.
It doesn't matter whether it was one large write-transaction or 100
write transactions, the compacted database will be the same size. It
will have become bigger for 100 writes than 1, but more space is
recovered and the new data storage is the same size if you delete the
now unused storage areas.
2. Is there any other way to solve this memory problem?
Schedule "compact", delete the old data storage.
If the update are a stream of updates without reading the database,
write a big file (N-triples, Turtle: just write all concatenated to a
single file).
You can also consider instead of loading in to Fuseki, to use the bulk
loader tbd2.tdbloader to build the database offline, then put in place,
then start Fuseki. The bulk loader is significantly faster when sizes
get into the 100's millions of triples.
3. What are the existing strategies that can be used to optimise the
storage memory while writing data?
4. Is there any new development going on to use less memory for the
write/update query?
Just plans that need resources!
It would be nice to have serve-side transactions over several updates
(which is beyond what the SPARQL protocol can do).
--
I've tried TDB with other storage systems (e.g. RocksDB) but the ability
to directly write the on-disk format is useful - it makes the bulk
loader work.
--
There are other issues as well in your use case.
It also depends on the data. If many triples have unique literals/ URIs,
the node table is proportionately large
Andy
Thanks,
Vinay Mahamuni