Dear Apache Jena Community, I am working as software developer at the Berlin State Library. Recently I was assigned a task which would evaluate several open-source triple-stores. I have included Apache Jena in this set.
In order to get real experience with Apache Jena, I have used as starting point the British National Bibliography (BNB) Books LOD: around 25Gbyte of nt-triples in 50 files. In total we have around 180.000.000 triples. I loaded this dataset using Apache Jena Fuseki (deployed with Tomcat) into a persistent TDB2 database. Generally speaking, the operation was pretty smooth. The only problem I had and this is the core of my question, is concerning the cost in terms of space on the hard disk for that database. Unfortunately I have found on my HD the database measuring 195 Gbyte for 25 Gbyte of raw data. This created some problems when I loaded the dataset (I repeated the operation several times before to be aware of that) and eventually I have used a network Partition instead of the space on my physical machine. More in depth, I am wondering if the underlying indexing strategy could be controlled in some respect. My Data-001 folder from the database contains: OSP.dat -> 69 Gbyte POS.dat -> 50 Gbyte SPO.dat -> 26 Gbyte Nodes.dat -> 43 Gbyte I am wondering if both OSP.dat and POS.dat could be avoided when a dataset is loaded. Is there a way to load the data with more control over the system, like for example we have in RDF4J triple store? Many thanks in advance! Best, Rodolfo Marraffa
