Am 13.09.2018 um 11:59 schrieb Rob Vesse <[email protected]>:
Markus
Jena in general should transparently recognize and handle files with a .gz
extension provided they follow the standard approach of appending this after
the normal file extension i.e. .ttl.gz I checked the Fuseki code and GZipped
uploads should be supported
From Jena 3.8.0 support is also provided for BZip2 files with a .bz2 extension
and Snappy compressed files with a .sz extension. Although looking at the
Fuseki code not sure this is wired up into Fuseki currently.
Rob
On 13/09/2018, 10:49, "Markus Neumann" <[email protected]> wrote:
Hi Rob,
Am 13.09.2018 um 11:41 schrieb Rob Vesse <[email protected]>:
Markus
Comments inline:
On 12/09/2018, 16:09, "Markus Neumann" <[email protected]> wrote:
Hi,
we are running a Fuseki server that will hold about 2.2 * 10^9 triples of
meteorological data eventually.
I currently run it with "-Xmx80GB" on a 128GB Server. The database is TDB2
on a 900GB SSD.
Now I face several performance issues:
1. Inserting data:
It takes more than one hour to upload the measurements of a month
(7.5GB .ttl file ~ 16 Mio triples) (using the data-upload web-interface of
fuseki)
Is there a way to do this faster?
At a minimum try GZipping the file and uploading it in GZipped form to reduce
the amount of data transferred over the network. It is possible that your
bottleneck here is actually network upload bandwith rather than anything with
Jena itself. I would expect GZip to substantially reduce the file size and
hopefully improve your load times.
I am uploading on the server itself to localhost, so network should not be
any issue (or am I missing something?).
Can fuseki handle gzipped ttl files?
Secondly TDB is typically reported to achieve load speeds of up to around 200k
triples/second, although that if for offline bulk loads with SSDs. Even if we
assume you could achieve only 25k triples/second that would suggest a
theoretical load time of approximately 11 minutes. If you can setup your
system so the TDB database is written to an SSD that will improve your
performance to some extent.
We are on a SSD setup. 11minutes would be something, we could live with...
Thirdly TDB is multi reader single writer (MRSW) concurrency so if you have a
lot of reads happening while trying to upload, which is a write operation, the
write operation will be forced to wait for active readers to finish before
proceeding which may introduce some delays.
I'm aware of that, but as we are not in production mode yet, there are no
request delaying the upload.
So yes I think you should be able to get faster load times.
2. Updating data:
We get new model runs 5 times per day. This is data for the next 10
days, that needs to be updated every time.
My idea was to create a named graph "forecast" that holds the latest
version of this data.
Every time a new model run arrives, I create a new temporary graph to upload the
data to. Once this is finished, I move the temporary graph to "forecast".
This seems to do the work twice as it takes 1 hour for the upload an 1
hour for the move.
Yes this is exactly what happens, the database that backs Fuseki, TDB, is a
quads store so it is storing each triple as a quad of GSPO where G is the graph
name. So when you move the temporary graph it has to copy all the quads from
the source graph to the target graph and then delete that source graph.
Thanks for that input. I will have to figure something else here...
Rob
Markus