Re: Updating large amounts of data

Andy Seaborne Fri, 14 Sep 2018 08:50:34 -0700



On 14/09/18 14:18, Markus Neumann wrote:

Hi Andy,

thanks for pointing that out.
what would you recommend as a heap size?


From a pure storage point of view, 2-4G per dataset.

Query can take some workspace, and HTTP handling, as can other Jena modules.

And Java itself.

And the minimum needed is to be avoided as it can lead to excessive GC's.

Ultimately there is no choice but to try.

128GB Server? Nothing else of note using RAM?  Try 10-16G (as a guess).

    Andy

Am 14.09.2018 um 15:04 schrieb Andy Seaborne <[email protected]>:



On 12/09/18 16:08, Markus Neumann wrote:

Hi,
we are running a Fuseki server that will hold about 2.2 * 10^9 triples of 
meteorological data eventually.
I currently run it with "-Xmx80GB" on a 128GB Server. The database is TDB2 on a 
900GB SSD.


Not sure if this is mentioned later in the thread (I'm in catch-up mode) but 
for TDB/TDB2, a lot of the workspace isn't in the heap, its the OS file system 
cache, so a bigger Java heap can actually slow things down.

    Andy

Now I face several performance issues:
1. Inserting data:
        It takes more than one hour to upload the measurements of a month 
(7.5GB .ttl file ~ 16 Mio triples) (using the data-upload web-interface of 
fuseki)
        Is there a way to do this faster?
2. Updating data:
        We get new model runs 5 times per day. This is data for the next 10 
days, that needs to be updated every time.
        My idea was to create a named graph "forecast" that holds the latest 
version of this data.
        Every time a new model run arrives, I create a new temporary graph to upload the 
data to. Once this is finished, I move the temporary graph to "forecast".
        This seems to do the work twice as it takes 1 hour for the upload an 1 
hour for the move.
Our data consists of the following:
Locations (total 1607 -> 16070 triples):
mm-locations:8500015 a mm:Location ;
     a geosparql:Geometry ;
     owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015> ;
     geosparql:asWKT "POINT(7.61574425031 47.5425915732)"^^geosparql:wktLiteral 
;
     mm:station_name "Basel SBB GB Ost" ;
     mm:abbreviation "BSGO" ;
     mm:didok_id 8500015 ;
     geo:lat 47.54259 ;
     geo:long 7.61574 ;
     mm:elevation 273 .
Parameters (total 14 -> 56 triples):
mm-parameters:t_2m:C a mm:Parameter ;
     rdfs:label "t_2m:C" ;
     dcterms:description "Air temperature at 2m above ground in degree 
Celsius"@en ;
     mm:unit_symbol "˚C" .
Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1 Mio -> 5Mio 
triples per day):
mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a mm:Measurement ;
     mm:location mm-locations:8500015 ;
     mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ;
     mm:value 15.1 ;
     mm:parameter mm-parameters:t_2m:C .
I would really appreciate if someone could give me some advice on how to handle 
this tasks or point out things I could do to optimize the organization of the 
data.
Many thanks and kind regards
Markus Neumann

Re: Updating large amounts of data

Reply via email to