On 17/12/2018 09:11, Jovanovska Sashka wrote:
Dear all,
We are a group of developers from Macedonia, who are currently working
with JenaDB. We are trying to implement Jena in our project and we did
a lot of research and decided that this may be the best approach for
us. The requirement for choosing of suitable database was defined to
store data about objects with their classes and attributes. The
database needs to be chosen in that way so it has the highest
(fastest) performance concerning RDF data, so objects can be retrieved
in shortest time possible. The analysis for choosing database have
shown that JENA TDB is the right approach for storing RDF data model,
which as triple store technology is shown as best option for RDF
structures (subject, predicate and object). It is concluded that the
JENA TDB database is capable to persist all objects according to
profile defined classes.
1. We have three environments. Their configurations are:
1.1 CPU 4 cores @ 2.2 GHz
1.2 16Gb - RAM
1.3 100Gb - Disk
2.1 CPU 16 cores @ 2.0 GHz
2.2 96Gb - RAM
2.3 500Gb - Disk
3.1 CPU 4 cores @ 3.2 GHz
3.2 16Gb - RAM
3.3 320Gb - SSD Disk
SSD is better especially when adding in multiple transactions.
Which setup was used to produce the figures?
We are using Jena version 3.6
The current Jena release is 3.9.0.
Are you using Fuseki or running TDB in your application?
2. We are working on system that will contain data for objects with
unique object ID - mRID (Master Resource ID). The main goal of the
system is exchange of those objects between more systems. Importing of
model from file goes from model.write and with that is creating a
dataset. Also we are creating single objects with insert. Attached are
examples of our model, together with its namespaces.
3. Our objectives on terms of speed are for create and get to be less
then 30ms, and fully load database with 100M objects (~500M triplets).
Also we need to make export of the database, with correct tag names
from namespace of the objects (currently we receive all of them with
rdf:Description tag)
rdf:Description is not RDF data - it is part of the RDFXML format. You
can write out the database in RDF/XML (use RDFFormat.RDFXML_PLAIN, not
the default "pretty" format) if you want but other formats are more
efficient and more readable and are the same RDF data.
4. Currently we are getting numbers like 200ms or more for creating of
1 object and we think that we have reached the load limit of the
database which is currently full with 80 million objects.
An object is on average 5 triples.
How are you using dataset transactions in your application? (Are you
using them?)
It will be better to add objects in batches - many objects in a single
transaction. At the end of a transaction, there is important overhead
such as writing the journal safely to disk, and then at some point
updating the main database. The default (TDB run in application) is some
buffering of 10 transactions but adding 50 triple units will still have
high overhead, especially with rotating disk.
The bulk loader is faster. For TDB1 it only works on an empty database;
for TDB2 it will update in bulk an existing database.
One way is to write all the triples to a file, and load the file once,
with the bulkloader, for the majority of your data.
Andy
But in the process of developing we faced some problems. Our goal was
to fill the database with 100M objects in which we encounter a lot of
problems. Nothing is working as it should, and the importing process
is getting slower and slower as the database grows bigger, even though
it still dose not reach the highest limit.
Would you be able to answer few questions of ours?
1. Is it possible somehow to optimize the application (and database
also) in order to work faster and more reliable?
2. What is the correct way for using namespace prefixes in order to
export data in correct way?
3. Will it be possible for us to get your help in form of a workshop
or training?
I would like to thank you in advance for your help.
Kind Regards,
Sashka