Re: Updating large amounts of data

Markus Neumann Thu, 13 Sep 2018 01:58:21 -0700

Hi Marco,

as this is a project for a customer, I'm afraid we can't make the data public.


1. I'm running Fuseki-3.8.0 with the following configuration:
@prefix :      <http://base/#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix tdb2:  <http://jena.apache.org/2016/tdb#> .
@prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix spatial: <http://jena.apache.org/spatial#> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix geosparql: <http://www.opengis.net/ont/geosparql#> .

:service_tdb_all  a                   fuseki:Service ;
        rdfs:label                    "TDB2 mm" ;
        fuseki:dataset                :spatial_dataset ;
        fuseki:name                   "mm" ;
        fuseki:serviceQuery           "query" , "sparql" ;
        fuseki:serviceReadGraphStore  "get" ;
        fuseki:serviceReadWriteGraphStore
                "data" ;
        fuseki:serviceUpdate          "update" ;
        fuseki:serviceUpload          "upload" .

:spatial_dataset a spatial:SpatialDataset ;
    spatial:dataset   :tdb_dataset_readwrite ;
    spatial:index     <#indexLucene> ;
    .

<#indexLucene> a spatial:SpatialIndexLucene ;
    #spatial:directory <file:Lucene> ;
    spatial:directory "mem" ;
    spatial:definition <#definition> ;
    .

<#definition> a spatial:EntityDefinition ;
    spatial:entityField      "uri" ;
    spatial:geoField     "geo" ;
    # custom geo predicates for 1) Latitude/Longitude Format
    spatial:hasSpatialPredicatePairs (
         [ spatial:latitude geo:lat ; spatial:longitude geo:long ]
         ) ;
    # custom geo predicates for 2) Well Known Text (WKT) Literal
    spatial:hasWKTPredicates (geosparql:asWKT) ;
    # custom SpatialContextFactory for 2) Well Known Text (WKT) Literal
    spatial:spatialContextFactory
#         "com.spatial4j.core.context.jts.JtsSpatialContextFactory"
        "org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory"
    .

:tdb_dataset_readwrite
        a              tdb2:DatasetTDB2 ;
        tdb2:location  "/srv/linked_data_store/fuseki-server/run/databases/mm" .

I've been through the Fuseki documentation several times, but I find it still a 
bit confusing. I would highly appreciate if you could point me to other 
resources.

I have not found the tdbloader in the fuseki repo. For now I use a small shell 
script that wraps curl to upload the data:

if [ ! -z $2 ]
then
    ADD="?graph=http://rdf.meteomatics.com/mm/graphs/$2";
fi
curl --basic -u user:password -X POST -F "filename=@$1" 
localhost:3030/mm/data${ADD}

2. Our customer has not specified a default use case yet, as the whole RDF 
concept is about as new to them as it is to me. I suppose it will be something 
like "Find all locations in a certain radius that have nice weather next 
saturday".

I just took a glance at the ha-fuseki page and will give it a try later.

Many thanks for your time

Best
Markus

> Am 13.09.2018 um 10:00 schrieb Marco Neumann <[email protected]>:
> 
> do you make the data endpoint publicly available?
> 
> 1. did you try the tdbloader, what version of tdb2 do you use?
> 
> 2. many ways to improve your response time here. what does a typical query
> look like? do you make use of the spatial indexer?
> 
> and Andy has a work in progress here for more granular updates that might
> be of interest to your effort as well: "High Availablity Apache Jena Fuseki"
> 
> https://afs.github.io/rdf-delta/ha-fuseki.html
> 
> 
> On Wed, Sep 12, 2018 at 4:09 PM Markus Neumann <[email protected]>
> wrote:
> 
>> Hi,
>> 
>> we are running a Fuseki server that will hold about 2.2 * 10^9 triples of
>> meteorological data eventually.
>> I currently run it with "-Xmx80GB" on a 128GB Server. The database is TDB2
>> on a 900GB SSD.
>> 
>> Now I face several performance issues:
>> 1. Inserting data:
>>        It takes more than one hour to upload the measurements of a month
>> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload web-interface of
>> fuseki)
>>        Is there a way to do this faster?
>> 2. Updating data:
>>        We get new model runs 5 times per day. This is data for the next
>> 10 days, that needs to be updated every time.
>>        My idea was to create a named graph "forecast" that holds the
>> latest version of this data.
>>        Every time a new model run arrives, I create a new temporary graph
>> to upload the data to. Once this is finished, I move the temporary graph to
>> "forecast".
>>        This seems to do the work twice as it takes 1 hour for the upload
>> an 1 hour for the move.
>> 
>> Our data consists of the following:
>> 
>> Locations (total 1607 -> 16070 triples):
>> mm-locations:8500015 a mm:Location ;
>>    a geosparql:Geometry ;
>>    owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015> ;
>>    geosparql:asWKT "POINT(7.61574425031
>> 47.5425915732)"^^geosparql:wktLiteral ;
>>    mm:station_name "Basel SBB GB Ost" ;
>>    mm:abbreviation "BSGO" ;
>>    mm:didok_id 8500015 ;
>>    geo:lat 47.54259 ;
>>    geo:long 7.61574 ;
>>    mm:elevation 273 .
>> 
>> Parameters (total 14 -> 56 triples):
>> mm-parameters:t_2m:C a mm:Parameter ;
>>    rdfs:label "t_2m:C" ;
>>    dcterms:description "Air temperature at 2m above ground in degree
>> Celsius"@en ;
>>    mm:unit_symbol "˚C" .
>> 
>> Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1 Mio ->
>> 5Mio triples per day):
>> mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a mm:Measurement ;
>>    mm:location mm-locations:8500015 ;
>>    mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ;
>>    mm:value 15.1 ;
>>    mm:parameter mm-parameters:t_2m:C .
>> 
>> I would really appreciate if someone could give me some advice on how to
>> handle this tasks or point out things I could do to optimize the
>> organization of the data.
>> 
>> Many thanks and kind regards
>> Markus Neumann
>> 
>> 
>> 
> 
> -- 
> 
> 
> ---
> Marco Neumann
> KONA

Re: Updating large amounts of data

Reply via email to