Re: Updating large amounts of data

Rob Vesse Thu, 13 Sep 2018 03:00:51 -0700

Markus

Jena in general should transparently recognize and handle files with a .gz 
extension provided they follow the standard approach of appending this after 
the normal file extension i.e. .ttl.gz  I checked the Fuseki code and GZipped 
uploads should be supported


>From Jena 3.8.0 support is also provided for BZip2 files with a .bz2 extension 
>and Snappy compressed files with a .sz extension.  Although looking at the 
>Fuseki code not sure this is wired up into Fuseki currently.

Rob

On 13/09/2018, 10:49, "Markus Neumann" <mneum...@meteomatics.com> wrote:

    Hi Rob,
    
    > Am 13.09.2018 um 11:41 schrieb Rob Vesse <rve...@dotnetrdf.org>:
    > 
    > Markus
    > 
    > Comments inline:
    > 
    > On 12/09/2018, 16:09, "Markus Neumann" <mneum...@meteomatics.com> wrote:
    > 
    >    Hi,
    > 
    >    we are running a Fuseki server that will hold about 2.2 * 10^9 triples 
of meteorological data eventually.
    >    I currently run it with "-Xmx80GB" on a 128GB Server. The database is 
TDB2 on a 900GB SSD.
    > 
    >    Now I face several performance issues:
    >    1. Inserting data:
    >           It takes more than one hour to upload the measurements of a 
month (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload web-interface 
of fuseki)
    >           Is there a way to do this faster? 
    > 
    > At a minimum try GZipping the file and uploading it in GZipped form to 
reduce the amount of data transferred over the network.  It is possible that 
your bottleneck here is actually network upload bandwith rather than anything 
with Jena itself.  I would expect GZip to substantially reduce the file size 
and hopefully improve your load times.
    
    I am uploading on the server itself to localhost, so network should not be 
any issue (or am I missing something?).
    Can fuseki handle gzipped ttl files?
    
    > Secondly TDB is typically reported to achieve load speeds of up to around 
200k triples/second, although that if for offline bulk loads with SSDs.  Even 
if we assume you could achieve only 25k triples/second that would suggest a 
theoretical load time of approximately 11 minutes.  If you can setup your 
system so the TDB database is written to an SSD that will improve your 
performance to some extent.
    We are on a SSD setup. 11minutes would be something, we could live with...
    > 
    > Thirdly TDB is multi reader single writer (MRSW) concurrency so if you 
have a lot of reads happening while trying to upload, which is a write 
operation, the write operation will be forced to wait for active readers to 
finish before proceeding which may introduce some delays.
    I'm aware of that, but as we are not in production mode yet, there are no 
request delaying the upload.
    > 
    > So yes I think you should be able to get faster load times.
    > 
    >    2. Updating data:
    >           We get new model runs 5 times per day. This is data for the 
next 10 days, that needs to be updated every time.
    >           My idea was to create a named graph "forecast" that holds the 
latest version of this data.
    >           Every time a new model run arrives, I create a new temporary 
graph to upload the data to. Once this is finished, I move the temporary graph 
to "forecast".
    >           This seems to do the work twice as it takes 1 hour for the 
upload an 1 hour for the move.
    > 
    > Yes this is exactly what happens, the database that backs Fuseki, TDB, is 
a quads store so it is storing each triple as a quad of GSPO where G is the 
graph name.  So when you move the temporary graph it has to copy all the quads 
from the source graph to the target graph and then delete that source graph.
    Thanks for that input. I will have to figure something else here...
    > 
    > Rob
    > 
    
    Markus

Re: Updating large amounts of data

Reply via email to