Re: Updating large amounts of data

Rob Vesse Thu, 13 Sep 2018 02:33:46 -0700

I am not sure tdbloader/tbdloader2 scripts help in this case.  This is an 
online update of a running Fuseki instance backed by TDB from what has been 
described.


Since a TDB instance can only be safely used by a single JVM at a time using 
those scripts would not be a viable option here unless the OP was willing to 
stop Fuseki during updates as otherwise it would either fail (because the built 
in TDB mechanisms would prevent it) or it would risk causing data corruption

Rob

On 13/09/2018, 10:11, "Marco Neumann" <[email protected]> wrote:

    Markus, the tdbloader2 script is part of the apache-jena distribution.
    
    let me know how you get on and how this improves your data load process.
    
    Marco
    
    
    
    On Thu, Sep 13, 2018 at 9:58 AM Markus Neumann <[email protected]>
    wrote:
    
    > Hi Marco,
    >
    > as this is a project for a customer, I'm afraid we can't make the data
    > public.
    >
    > 1. I'm running Fuseki-3.8.0 with the following configuration:
    > @prefix :      <http://base/#> .
    > @prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    > @prefix tdb2:  <http://jena.apache.org/2016/tdb#> .
    > @prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler#> .
    > @prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
    > @prefix fuseki: <http://jena.apache.org/fuseki#> .
    > @prefix spatial: <http://jena.apache.org/spatial#> .
    > @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
    > @prefix geosparql: <http://www.opengis.net/ont/geosparql#> .
    >
    > :service_tdb_all  a                   fuseki:Service ;
    >         rdfs:label                    "TDB2 mm" ;
    >         fuseki:dataset                :spatial_dataset ;
    >         fuseki:name                   "mm" ;
    >         fuseki:serviceQuery           "query" , "sparql" ;
    >         fuseki:serviceReadGraphStore  "get" ;
    >         fuseki:serviceReadWriteGraphStore
    >                 "data" ;
    >         fuseki:serviceUpdate          "update" ;
    >         fuseki:serviceUpload          "upload" .
    >
    > :spatial_dataset a spatial:SpatialDataset ;
    >     spatial:dataset   :tdb_dataset_readwrite ;
    >     spatial:index     <#indexLucene> ;
    >     .
    >
    > <#indexLucene> a spatial:SpatialIndexLucene ;
    >     #spatial:directory <file:Lucene> ;
    >     spatial:directory "mem" ;
    >     spatial:definition <#definition> ;
    >     .
    >
    > <#definition> a spatial:EntityDefinition ;
    >     spatial:entityField      "uri" ;
    >     spatial:geoField     "geo" ;
    >     # custom geo predicates for 1) Latitude/Longitude Format
    >     spatial:hasSpatialPredicatePairs (
    >          [ spatial:latitude geo:lat ; spatial:longitude geo:long ]
    >          ) ;
    >     # custom geo predicates for 2) Well Known Text (WKT) Literal
    >     spatial:hasWKTPredicates (geosparql:asWKT) ;
    >     # custom SpatialContextFactory for 2) Well Known Text (WKT) Literal
    >     spatial:spatialContextFactory
    > #         "com.spatial4j.core.context.jts.JtsSpatialContextFactory"
    >         "org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory"
    >     .
    >
    > :tdb_dataset_readwrite
    >         a              tdb2:DatasetTDB2 ;
    >         tdb2:location
    > "/srv/linked_data_store/fuseki-server/run/databases/mm" .
    >
    > I've been through the Fuseki documentation several times, but I find it
    > still a bit confusing. I would highly appreciate if you could point me to
    > other resources.
    >
    > I have not found the tdbloader in the fuseki repo. For now I use a small
    > shell script that wraps curl to upload the data:
    >
    > if [ ! -z $2 ]
    > then
    >     ADD="?graph=http://rdf.meteomatics.com/mm/graphs/$2";
    > fi
    > curl --basic -u user:password -X POST -F "filename=@$1"
    > localhost:3030/mm/data${ADD}
    >
    > 2. Our customer has not specified a default use case yet, as the whole RDF
    > concept is about as new to them as it is to me. I suppose it will be
    > something like "Find all locations in a certain radius that have nice
    > weather next saturday".
    >
    > I just took a glance at the ha-fuseki page and will give it a try later.
    >
    > Many thanks for your time
    >
    > Best
    > Markus
    >
    > > Am 13.09.2018 um 10:00 schrieb Marco Neumann <[email protected]>:
    > >
    > > do you make the data endpoint publicly available?
    > >
    > > 1. did you try the tdbloader, what version of tdb2 do you use?
    > >
    > > 2. many ways to improve your response time here. what does a typical
    > query
    > > look like? do you make use of the spatial indexer?
    > >
    > > and Andy has a work in progress here for more granular updates that 
might
    > > be of interest to your effort as well: "High Availablity Apache Jena
    > Fuseki"
    > >
    > > https://afs.github.io/rdf-delta/ha-fuseki.html
    > >
    > >
    > > On Wed, Sep 12, 2018 at 4:09 PM Markus Neumann <[email protected]
    > >
    > > wrote:
    > >
    > >> Hi,
    > >>
    > >> we are running a Fuseki server that will hold about 2.2 * 10^9 triples
    > of
    > >> meteorological data eventually.
    > >> I currently run it with "-Xmx80GB" on a 128GB Server. The database is
    > TDB2
    > >> on a 900GB SSD.
    > >>
    > >> Now I face several performance issues:
    > >> 1. Inserting data:
    > >>        It takes more than one hour to upload the measurements of a 
month
    > >> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload web-interface
    > of
    > >> fuseki)
    > >>        Is there a way to do this faster?
    > >> 2. Updating data:
    > >>        We get new model runs 5 times per day. This is data for the next
    > >> 10 days, that needs to be updated every time.
    > >>        My idea was to create a named graph "forecast" that holds the
    > >> latest version of this data.
    > >>        Every time a new model run arrives, I create a new temporary
    > graph
    > >> to upload the data to. Once this is finished, I move the temporary
    > graph to
    > >> "forecast".
    > >>        This seems to do the work twice as it takes 1 hour for the 
upload
    > >> an 1 hour for the move.
    > >>
    > >> Our data consists of the following:
    > >>
    > >> Locations (total 1607 -> 16070 triples):
    > >> mm-locations:8500015 a mm:Location ;
    > >>    a geosparql:Geometry ;
    > >>    owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015> ;
    > >>    geosparql:asWKT "POINT(7.61574425031
    > >> 47.5425915732)"^^geosparql:wktLiteral ;
    > >>    mm:station_name "Basel SBB GB Ost" ;
    > >>    mm:abbreviation "BSGO" ;
    > >>    mm:didok_id 8500015 ;
    > >>    geo:lat 47.54259 ;
    > >>    geo:long 7.61574 ;
    > >>    mm:elevation 273 .
    > >>
    > >> Parameters (total 14 -> 56 triples):
    > >> mm-parameters:t_2m:C a mm:Parameter ;
    > >>    rdfs:label "t_2m:C" ;
    > >>    dcterms:description "Air temperature at 2m above ground in degree
    > >> Celsius"@en ;
    > >>    mm:unit_symbol "˚C" .
    > >>
    > >> Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1 Mio 
->
    > >> 5Mio triples per day):
    > >> mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a mm:Measurement ;
    > >>    mm:location mm-locations:8500015 ;
    > >>    mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ;
    > >>    mm:value 15.1 ;
    > >>    mm:parameter mm-parameters:t_2m:C .
    > >>
    > >> I would really appreciate if someone could give me some advice on how 
to
    > >> handle this tasks or point out things I could do to optimize the
    > >> organization of the data.
    > >>
    > >> Many thanks and kind regards
    > >> Markus Neumann
    > >>
    > >>
    > >>
    > >
    > > --
    > >
    > >
    > > ---
    > > Marco Neumann
    > > KONA
    >
    >
    
    -- 
    
    
    ---
    Marco Neumann
    KONA

Re: Updating large amounts of data

Reply via email to