Re: Updating large amounts of data

Rob Vesse Thu, 13 Sep 2018 02:42:26 -0700

Markus

Comments inline:


On 12/09/2018, 16:09, "Markus Neumann" <mneum...@meteomatics.com> wrote:

    Hi,
    
    we are running a Fuseki server that will hold about 2.2 * 10^9 triples of 
meteorological data eventually.
    I currently run it with "-Xmx80GB" on a 128GB Server. The database is TDB2 
on a 900GB SSD.
    
    Now I face several performance issues:
    1. Inserting data:
        It takes more than one hour to upload the measurements of a month 
(7.5GB .ttl file ~ 16 Mio triples) (using the data-upload web-interface of 
fuseki)
        Is there a way to do this faster? 

At a minimum try GZipping the file and uploading it in GZipped form to reduce 
the amount of data transferred over the network.  It is possible that your 
bottleneck here is actually network upload bandwith rather than anything with 
Jena itself.  I would expect GZip to substantially reduce the file size and 
hopefully improve your load times.

Secondly TDB is typically reported to achieve load speeds of up to around 200k 
triples/second, although that if for offline bulk loads with SSDs.  Even if we 
assume you could achieve only 25k triples/second that would suggest a 
theoretical load time of approximately 11 minutes.  If you can setup your 
system so the TDB database is written to an SSD that will improve your 
performance to some extent.

Thirdly TDB is multi reader single writer (MRSW) concurrency so if you have a 
lot of reads happening while trying to upload, which is a write operation, the 
write operation will be forced to wait for active readers to finish before 
proceeding which may introduce some delays.

So yes I think you should be able to get faster load times.

    2. Updating data:
        We get new model runs 5 times per day. This is data for the next 10 
days, that needs to be updated every time.
        My idea was to create a named graph "forecast" that holds the latest 
version of this data.
        Every time a new model run arrives, I create a new temporary graph to 
upload the data to. Once this is finished, I move the temporary graph to 
"forecast".
        This seems to do the work twice as it takes 1 hour for the upload an 1 
hour for the move.

Yes this is exactly what happens, the database that backs Fuseki, TDB, is a 
quads store so it is storing each triple as a quad of GSPO where G is the graph 
name.  So when you move the temporary graph it has to copy all the quads from 
the source graph to the target graph and then delete that source graph.

Rob
    
    Our data consists of the following:
    
    Locations (total 1607 -> 16070 triples):
    mm-locations:8500015 a mm:Location ;
        a geosparql:Geometry ;
        owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015> ;
        geosparql:asWKT "POINT(7.61574425031 
47.5425915732)"^^geosparql:wktLiteral ;
        mm:station_name "Basel SBB GB Ost" ;
        mm:abbreviation "BSGO" ;
        mm:didok_id 8500015 ;
        geo:lat 47.54259 ;
        geo:long 7.61574 ;
        mm:elevation 273 .
    
    Parameters (total 14 -> 56 triples):
    mm-parameters:t_2m:C a mm:Parameter ;
        rdfs:label "t_2m:C" ;
        dcterms:description "Air temperature at 2m above ground in degree 
Celsius"@en ;
        mm:unit_symbol "˚C" .
    
    Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1 Mio -> 
5Mio triples per day):
    mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a mm:Measurement ;
        mm:location mm-locations:8500015 ;
        mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ;
        mm:value 15.1 ;
        mm:parameter mm-parameters:t_2m:C .
    
    I would really appreciate if someone could give me some advice on how to 
handle this tasks or point out things I could do to optimize the organization 
of the data.
    
    Many thanks and kind regards
    Markus Neumann

Re: Updating large amounts of data

Reply via email to