Re: Updating large amounts of data

Marco Neumann Thu, 13 Sep 2018 02:56:58 -0700

Rob, keeping fuseki live wasn't stated as a requirement for 1. so my advise
stands. we are running similar updates with fresh data frequently.


Markus, to keep fuseki downtime at a minimum you can pre-populate tdb into
a temporary directory as well and later switch between directories. don't
forget to run the tdb optimizer to generate the stats.opt file.


On Thu, Sep 13, 2018 at 10:33 AM Rob Vesse <[email protected]> wrote:

> I am not sure tdbloader/tbdloader2 scripts help in this case.  This is an
> online update of a running Fuseki instance backed by TDB from what has been
> described.
>
> Since a TDB instance can only be safely used by a single JVM at a time
> using those scripts would not be a viable option here unless the OP was
> willing to stop Fuseki during updates as otherwise it would either fail
> (because the built in TDB mechanisms would prevent it) or it would risk
> causing data corruption
>
> Rob
>
> On 13/09/2018, 10:11, "Marco Neumann" <[email protected]> wrote:
>
>     Markus, the tdbloader2 script is part of the apache-jena distribution.
>
>     let me know how you get on and how this improves your data load
> process.
>
>     Marco
>
>
>
>     On Thu, Sep 13, 2018 at 9:58 AM Markus Neumann <
> [email protected]>
>     wrote:
>
>     > Hi Marco,
>     >
>     > as this is a project for a customer, I'm afraid we can't make the
> data
>     > public.
>     >
>     > 1. I'm running Fuseki-3.8.0 with the following configuration:
>     > @prefix :      <http://base/#> .
>     > @prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>     > @prefix tdb2:  <http://jena.apache.org/2016/tdb#> .
>     > @prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler#> .
>     > @prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
>     > @prefix fuseki: <http://jena.apache.org/fuseki#> .
>     > @prefix spatial: <http://jena.apache.org/spatial#> .
>     > @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
>     > @prefix geosparql: <http://www.opengis.net/ont/geosparql#> .
>     >
>     > :service_tdb_all  a                   fuseki:Service ;
>     >         rdfs:label                    "TDB2 mm" ;
>     >         fuseki:dataset                :spatial_dataset ;
>     >         fuseki:name                   "mm" ;
>     >         fuseki:serviceQuery           "query" , "sparql" ;
>     >         fuseki:serviceReadGraphStore  "get" ;
>     >         fuseki:serviceReadWriteGraphStore
>     >                 "data" ;
>     >         fuseki:serviceUpdate          "update" ;
>     >         fuseki:serviceUpload          "upload" .
>     >
>     > :spatial_dataset a spatial:SpatialDataset ;
>     >     spatial:dataset   :tdb_dataset_readwrite ;
>     >     spatial:index     <#indexLucene> ;
>     >     .
>     >
>     > <#indexLucene> a spatial:SpatialIndexLucene ;
>     >     #spatial:directory <file:Lucene> ;
>     >     spatial:directory "mem" ;
>     >     spatial:definition <#definition> ;
>     >     .
>     >
>     > <#definition> a spatial:EntityDefinition ;
>     >     spatial:entityField      "uri" ;
>     >     spatial:geoField     "geo" ;
>     >     # custom geo predicates for 1) Latitude/Longitude Format
>     >     spatial:hasSpatialPredicatePairs (
>     >          [ spatial:latitude geo:lat ; spatial:longitude geo:long ]
>     >          ) ;
>     >     # custom geo predicates for 2) Well Known Text (WKT) Literal
>     >     spatial:hasWKTPredicates (geosparql:asWKT) ;
>     >     # custom SpatialContextFactory for 2) Well Known Text (WKT)
> Literal
>     >     spatial:spatialContextFactory
>     > #         "com.spatial4j.core.context.jts.JtsSpatialContextFactory"
>     >
>  "org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory"
>     >     .
>     >
>     > :tdb_dataset_readwrite
>     >         a              tdb2:DatasetTDB2 ;
>     >         tdb2:location
>     > "/srv/linked_data_store/fuseki-server/run/databases/mm" .
>     >
>     > I've been through the Fuseki documentation several times, but I find
> it
>     > still a bit confusing. I would highly appreciate if you could point
> me to
>     > other resources.
>     >
>     > I have not found the tdbloader in the fuseki repo. For now I use a
> small
>     > shell script that wraps curl to upload the data:
>     >
>     > if [ ! -z $2 ]
>     > then
>     >     ADD="?graph=http://rdf.meteomatics.com/mm/graphs/$2";
>     > fi
>     > curl --basic -u user:password -X POST -F "filename=@$1"
>     > localhost:3030/mm/data${ADD}
>     >
>     > 2. Our customer has not specified a default use case yet, as the
> whole RDF
>     > concept is about as new to them as it is to me. I suppose it will be
>     > something like "Find all locations in a certain radius that have nice
>     > weather next saturday".
>     >
>     > I just took a glance at the ha-fuseki page and will give it a try
> later.
>     >
>     > Many thanks for your time
>     >
>     > Best
>     > Markus
>     >
>     > > Am 13.09.2018 um 10:00 schrieb Marco Neumann <
> [email protected]>:
>     > >
>     > > do you make the data endpoint publicly available?
>     > >
>     > > 1. did you try the tdbloader, what version of tdb2 do you use?
>     > >
>     > > 2. many ways to improve your response time here. what does a
> typical
>     > query
>     > > look like? do you make use of the spatial indexer?
>     > >
>     > > and Andy has a work in progress here for more granular updates
> that might
>     > > be of interest to your effort as well: "High Availablity Apache
> Jena
>     > Fuseki"
>     > >
>     > > https://afs.github.io/rdf-delta/ha-fuseki.html
>     > >
>     > >
>     > > On Wed, Sep 12, 2018 at 4:09 PM Markus Neumann <
> [email protected]
>     > >
>     > > wrote:
>     > >
>     > >> Hi,
>     > >>
>     > >> we are running a Fuseki server that will hold about 2.2 * 10^9
> triples
>     > of
>     > >> meteorological data eventually.
>     > >> I currently run it with "-Xmx80GB" on a 128GB Server. The
> database is
>     > TDB2
>     > >> on a 900GB SSD.
>     > >>
>     > >> Now I face several performance issues:
>     > >> 1. Inserting data:
>     > >>        It takes more than one hour to upload the measurements of
> a month
>     > >> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload
> web-interface
>     > of
>     > >> fuseki)
>     > >>        Is there a way to do this faster?
>     > >> 2. Updating data:
>     > >>        We get new model runs 5 times per day. This is data for
> the next
>     > >> 10 days, that needs to be updated every time.
>     > >>        My idea was to create a named graph "forecast" that holds
> the
>     > >> latest version of this data.
>     > >>        Every time a new model run arrives, I create a new
> temporary
>     > graph
>     > >> to upload the data to. Once this is finished, I move the temporary
>     > graph to
>     > >> "forecast".
>     > >>        This seems to do the work twice as it takes 1 hour for the
> upload
>     > >> an 1 hour for the move.
>     > >>
>     > >> Our data consists of the following:
>     > >>
>     > >> Locations (total 1607 -> 16070 triples):
>     > >> mm-locations:8500015 a mm:Location ;
>     > >>    a geosparql:Geometry ;
>     > >>    owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015>
> ;
>     > >>    geosparql:asWKT "POINT(7.61574425031
>     > >> 47.5425915732)"^^geosparql:wktLiteral ;
>     > >>    mm:station_name "Basel SBB GB Ost" ;
>     > >>    mm:abbreviation "BSGO" ;
>     > >>    mm:didok_id 8500015 ;
>     > >>    geo:lat 47.54259 ;
>     > >>    geo:long 7.61574 ;
>     > >>    mm:elevation 273 .
>     > >>
>     > >> Parameters (total 14 -> 56 triples):
>     > >> mm-parameters:t_2m:C a mm:Parameter ;
>     > >>    rdfs:label "t_2m:C" ;
>     > >>    dcterms:description "Air temperature at 2m above ground in
> degree
>     > >> Celsius"@en ;
>     > >>    mm:unit_symbol "˚C" .
>     > >>
>     > >> Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1
> Mio ->
>     > >> 5Mio triples per day):
>     > >> mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a
> mm:Measurement ;
>     > >>    mm:location mm-locations:8500015 ;
>     > >>    mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ;
>     > >>    mm:value 15.1 ;
>     > >>    mm:parameter mm-parameters:t_2m:C .
>     > >>
>     > >> I would really appreciate if someone could give me some advice on
> how to
>     > >> handle this tasks or point out things I could do to optimize the
>     > >> organization of the data.
>     > >>
>     > >> Many thanks and kind regards
>     > >> Markus Neumann
>     > >>
>     > >>
>     > >>
>     > >
>     > > --
>     > >
>     > >
>     > > ---
>     > > Marco Neumann
>     > > KONA
>     >
>     >
>
>     --
>
>
>     ---
>     Marco Neumann
>     KONA
>
>
>
>
>
>

-- 


---
Marco Neumann
KONA

Re: Updating large amounts of data

Reply via email to