Problem with the required space on Hard Disk after loading a determined dataset.

Marraffa, Rodolfo Tue, 21 Jun 2022 04:51:45 -0700

Dear Apache Jena Community,

I am working as software developer at the Berlin State Library. Recently I was 
assigned a task which would evaluate several open-source triple-stores. I have 
included Apache Jena in this set.


In order to get real experience with Apache Jena, I have used as starting point 
the British National Bibliography (BNB) Books LOD: around 25Gbyte of nt-triples 
in 50 files. In total we have around 180.000.000 triples.

I loaded this dataset using Apache Jena Fuseki (deployed with Tomcat) into a 
persistent TDB2 database. Generally speaking, the operation was pretty smooth. 
The only problem I had and this is the core of my question, is concerning the 
cost in terms of space on the hard disk for that database.

Unfortunately I have found on my HD the database measuring 195 Gbyte for 25 
Gbyte of raw data. This created some problems when I loaded the dataset (I 
repeated the operation several times before to be aware of that) and eventually 
I have used a network Partition instead of the space on my physical machine.

More in depth, I am wondering if the underlying indexing strategy could be 
controlled in some respect.
My Data-001 folder from the database contains:
OSP.dat -> 69 Gbyte
POS.dat -> 50 Gbyte
SPO.dat -> 26 Gbyte
Nodes.dat -> 43 Gbyte

I am wondering if both OSP.dat and POS.dat could be avoided when a dataset is 
loaded.
Is there a way to load the data with more control over the system, like for 
example we have in RDF4J triple store?

Many thanks in advance!

Best,
Rodolfo Marraffa

Problem with the required space on Hard Disk after loading a determined dataset.

Reply via email to