Rob, keeping fuseki live wasn't stated as a requirement for 1. so my advise stands. we are running similar updates with fresh data frequently.
Markus, to keep fuseki downtime at a minimum you can pre-populate tdb into a temporary directory as well and later switch between directories. don't forget to run the tdb optimizer to generate the stats.opt file. On Thu, Sep 13, 2018 at 10:33 AM Rob Vesse <[email protected]> wrote: > I am not sure tdbloader/tbdloader2 scripts help in this case. This is an > online update of a running Fuseki instance backed by TDB from what has been > described. > > Since a TDB instance can only be safely used by a single JVM at a time > using those scripts would not be a viable option here unless the OP was > willing to stop Fuseki during updates as otherwise it would either fail > (because the built in TDB mechanisms would prevent it) or it would risk > causing data corruption > > Rob > > On 13/09/2018, 10:11, "Marco Neumann" <[email protected]> wrote: > > Markus, the tdbloader2 script is part of the apache-jena distribution. > > let me know how you get on and how this improves your data load > process. > > Marco > > > > On Thu, Sep 13, 2018 at 9:58 AM Markus Neumann < > [email protected]> > wrote: > > > Hi Marco, > > > > as this is a project for a customer, I'm afraid we can't make the > data > > public. > > > > 1. I'm running Fuseki-3.8.0 with the following configuration: > > @prefix : <http://base/#> . > > @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . > > @prefix tdb2: <http://jena.apache.org/2016/tdb#> . > > @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> . > > @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . > > @prefix fuseki: <http://jena.apache.org/fuseki#> . > > @prefix spatial: <http://jena.apache.org/spatial#> . > > @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> . > > @prefix geosparql: <http://www.opengis.net/ont/geosparql#> . > > > > :service_tdb_all a fuseki:Service ; > > rdfs:label "TDB2 mm" ; > > fuseki:dataset :spatial_dataset ; > > fuseki:name "mm" ; > > fuseki:serviceQuery "query" , "sparql" ; > > fuseki:serviceReadGraphStore "get" ; > > fuseki:serviceReadWriteGraphStore > > "data" ; > > fuseki:serviceUpdate "update" ; > > fuseki:serviceUpload "upload" . > > > > :spatial_dataset a spatial:SpatialDataset ; > > spatial:dataset :tdb_dataset_readwrite ; > > spatial:index <#indexLucene> ; > > . > > > > <#indexLucene> a spatial:SpatialIndexLucene ; > > #spatial:directory <file:Lucene> ; > > spatial:directory "mem" ; > > spatial:definition <#definition> ; > > . > > > > <#definition> a spatial:EntityDefinition ; > > spatial:entityField "uri" ; > > spatial:geoField "geo" ; > > # custom geo predicates for 1) Latitude/Longitude Format > > spatial:hasSpatialPredicatePairs ( > > [ spatial:latitude geo:lat ; spatial:longitude geo:long ] > > ) ; > > # custom geo predicates for 2) Well Known Text (WKT) Literal > > spatial:hasWKTPredicates (geosparql:asWKT) ; > > # custom SpatialContextFactory for 2) Well Known Text (WKT) > Literal > > spatial:spatialContextFactory > > # "com.spatial4j.core.context.jts.JtsSpatialContextFactory" > > > "org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory" > > . > > > > :tdb_dataset_readwrite > > a tdb2:DatasetTDB2 ; > > tdb2:location > > "/srv/linked_data_store/fuseki-server/run/databases/mm" . > > > > I've been through the Fuseki documentation several times, but I find > it > > still a bit confusing. I would highly appreciate if you could point > me to > > other resources. > > > > I have not found the tdbloader in the fuseki repo. For now I use a > small > > shell script that wraps curl to upload the data: > > > > if [ ! -z $2 ] > > then > > ADD="?graph=http://rdf.meteomatics.com/mm/graphs/$2" > > fi > > curl --basic -u user:password -X POST -F "filename=@$1" > > localhost:3030/mm/data${ADD} > > > > 2. Our customer has not specified a default use case yet, as the > whole RDF > > concept is about as new to them as it is to me. I suppose it will be > > something like "Find all locations in a certain radius that have nice > > weather next saturday". > > > > I just took a glance at the ha-fuseki page and will give it a try > later. > > > > Many thanks for your time > > > > Best > > Markus > > > > > Am 13.09.2018 um 10:00 schrieb Marco Neumann < > [email protected]>: > > > > > > do you make the data endpoint publicly available? > > > > > > 1. did you try the tdbloader, what version of tdb2 do you use? > > > > > > 2. many ways to improve your response time here. what does a > typical > > query > > > look like? do you make use of the spatial indexer? > > > > > > and Andy has a work in progress here for more granular updates > that might > > > be of interest to your effort as well: "High Availablity Apache > Jena > > Fuseki" > > > > > > https://afs.github.io/rdf-delta/ha-fuseki.html > > > > > > > > > On Wed, Sep 12, 2018 at 4:09 PM Markus Neumann < > [email protected] > > > > > > wrote: > > > > > >> Hi, > > >> > > >> we are running a Fuseki server that will hold about 2.2 * 10^9 > triples > > of > > >> meteorological data eventually. > > >> I currently run it with "-Xmx80GB" on a 128GB Server. The > database is > > TDB2 > > >> on a 900GB SSD. > > >> > > >> Now I face several performance issues: > > >> 1. Inserting data: > > >> It takes more than one hour to upload the measurements of > a month > > >> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload > web-interface > > of > > >> fuseki) > > >> Is there a way to do this faster? > > >> 2. Updating data: > > >> We get new model runs 5 times per day. This is data for > the next > > >> 10 days, that needs to be updated every time. > > >> My idea was to create a named graph "forecast" that holds > the > > >> latest version of this data. > > >> Every time a new model run arrives, I create a new > temporary > > graph > > >> to upload the data to. Once this is finished, I move the temporary > > graph to > > >> "forecast". > > >> This seems to do the work twice as it takes 1 hour for the > upload > > >> an 1 hour for the move. > > >> > > >> Our data consists of the following: > > >> > > >> Locations (total 1607 -> 16070 triples): > > >> mm-locations:8500015 a mm:Location ; > > >> a geosparql:Geometry ; > > >> owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015> > ; > > >> geosparql:asWKT "POINT(7.61574425031 > > >> 47.5425915732)"^^geosparql:wktLiteral ; > > >> mm:station_name "Basel SBB GB Ost" ; > > >> mm:abbreviation "BSGO" ; > > >> mm:didok_id 8500015 ; > > >> geo:lat 47.54259 ; > > >> geo:long 7.61574 ; > > >> mm:elevation 273 . > > >> > > >> Parameters (total 14 -> 56 triples): > > >> mm-parameters:t_2m:C a mm:Parameter ; > > >> rdfs:label "t_2m:C" ; > > >> dcterms:description "Air temperature at 2m above ground in > degree > > >> Celsius"@en ; > > >> mm:unit_symbol "˚C" . > > >> > > >> Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1 > Mio -> > > >> 5Mio triples per day): > > >> mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a > mm:Measurement ; > > >> mm:location mm-locations:8500015 ; > > >> mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ; > > >> mm:value 15.1 ; > > >> mm:parameter mm-parameters:t_2m:C . > > >> > > >> I would really appreciate if someone could give me some advice on > how to > > >> handle this tasks or point out things I could do to optimize the > > >> organization of the data. > > >> > > >> Many thanks and kind regards > > >> Markus Neumann > > >> > > >> > > >> > > > > > > -- > > > > > > > > > --- > > > Marco Neumann > > > KONA > > > > > > -- > > > --- > Marco Neumann > KONA > > > > > > -- --- Marco Neumann KONA
