Re: Updating large amounts of data

Markus Neumann Thu, 13 Sep 2018 02:48:30 -0700

Hi Rob,

> Am 13.09.2018 um 11:41 schrieb Rob Vesse <[email protected]>:
> 
> Markus
> 
> Comments inline:
> 
> On 12/09/2018, 16:09, "Markus Neumann" <[email protected]> wrote:
> 
>    Hi,
> 
>    we are running a Fuseki server that will hold about 2.2 * 10^9 triples of 
> meteorological data eventually.
>    I currently run it with "-Xmx80GB" on a 128GB Server. The database is TDB2 
> on a 900GB SSD.
> 
>    Now I face several performance issues:
>    1. Inserting data:
>       It takes more than one hour to upload the measurements of a month 
> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload web-interface of 
> fuseki)
>       Is there a way to do this faster? 
> 
> At a minimum try GZipping the file and uploading it in GZipped form to reduce 
> the amount of data transferred over the network.  It is possible that your 
> bottleneck here is actually network upload bandwith rather than anything with 
> Jena itself.  I would expect GZip to substantially reduce the file size and 
> hopefully improve your load times.


I am uploading on the server itself to localhost, so network should not be any 
issue (or am I missing something?).
Can fuseki handle gzipped ttl files?

> Secondly TDB is typically reported to achieve load speeds of up to around 
> 200k triples/second, although that if for offline bulk loads with SSDs.  Even 
> if we assume you could achieve only 25k triples/second that would suggest a 
> theoretical load time of approximately 11 minutes.  If you can setup your 
> system so the TDB database is written to an SSD that will improve your 
> performance to some extent.
We are on a SSD setup. 11minutes would be something, we could live with...
> 
> Thirdly TDB is multi reader single writer (MRSW) concurrency so if you have a 
> lot of reads happening while trying to upload, which is a write operation, 
> the write operation will be forced to wait for active readers to finish 
> before proceeding which may introduce some delays.
I'm aware of that, but as we are not in production mode yet, there are no 
request delaying the upload.
> 
> So yes I think you should be able to get faster load times.
> 
>    2. Updating data:
>       We get new model runs 5 times per day. This is data for the next 10 
> days, that needs to be updated every time.
>       My idea was to create a named graph "forecast" that holds the latest 
> version of this data.
>       Every time a new model run arrives, I create a new temporary graph to 
> upload the data to. Once this is finished, I move the temporary graph to 
> "forecast".
>       This seems to do the work twice as it takes 1 hour for the upload an 1 
> hour for the move.
> 
> Yes this is exactly what happens, the database that backs Fuseki, TDB, is a 
> quads store so it is storing each triple as a quad of GSPO where G is the 
> graph name.  So when you move the temporary graph it has to copy all the 
> quads from the source graph to the target graph and then delete that source 
> graph.
Thanks for that input. I will have to figure something else here...
> 
> Rob
> 

Markus

Re: Updating large amounts of data

Reply via email to