Thanks for the links. How do I see if the loader does the spatial index? As far as I understood the documentation, my config should produce the spatial index in memory. I haven't figured that part completely though: When I start the database from scratch, the spatial indexing works. After a restart I have to re-upload the stations file (which is no big deal as it is only 593K in size) to regenerate the index. I couldn't get it to work with a persistent index file though.
Right now I'm trying the tdb2.tdbloader (Didn't see that before) and it seems to go even faster: 12:49:11 INFO loader :: Add: 41,000,000 2017-01-01_1M_30min.ttl (Batch: 67,980 / Avg: 62,995) 12:49:11 INFO loader :: Elapsed: 650.84 seconds [2018/09/13 12:49:11 UTC] Is there a way to tell the loader, that he should do the spatial index? Yes, we have to use the spatial filter eventually, so I would highly appreciate some more informations on the correct setup here. Many thanks. > Am 13.09.2018 um 14:19 schrieb Marco Neumann <[email protected]>: > > :-) > > this sounds much better Markus. now with regards to the optimizer please > consult the online documentation here: > > https://jena.apache.org/documentation/tdb/optimizer.html > <https://jena.apache.org/documentation/tdb/optimizer.html> > (it's a very simple process to create the stats file and place it in the > directory) > > also did the loader index the spatial data? do your queries make use of the > spatial filter? > > On Thu, Sep 13, 2018 at 12:59 PM Markus Neumann <[email protected] > <mailto:[email protected]>> > wrote: > >> Marco, >> >> I just tried the tdbloader2 script with 1 Month of data: >> >> INFO Total: 167,385,120 tuples : 1,143.55 seconds : 146,373.23 tuples/sec >> [2018/09/13 11:29:31 UTC] >> 11:41:44 INFO Index Building Phase Completed >> 11:41:46 INFO -- TDB Bulk Loader Finish >> 11:41:46 INFO -- 1880 seconds >> >> Thats already a lot better. I'm working on a way to reduce the amount of >> data by >> Can you give me a pointer on >>> don't forget to run the tdb optimizer to generate the stats.opt file. >> ? I haven't heard of that so far... >> >> A more general question: >> Would there be a benefit in using the jena stack over using the fuseki >> bundle as I'm doing now? (Documentation was not clear to me on that point) >> - If so: is there a guide on how to set it up? >> >> > fuseki makes use of the jena stack. think of the jena distribution as a > kind of toolbox you can use to work with your different projects in > addition to your fuseki endpoint. > > just make sure to configure the class path correctly > > https://jena.apache.org/documentation/tools/index.html > <https://jena.apache.org/documentation/tools/index.html> > > Also further to the conversation with Rob, he has a valid point with > regards to data corruption. please do not update of a live tdb database > instance directly with tdbloader while it's connected to a running fuseki > endpoint. > > shut down the fuseki server first and then run the loader. or run the > loader process in parallel into different target directory and swap the > data or the path again later on. I don't know if there is hot swap option > in fuseki to map to a new directory but a quick restart should do the trick. > > > > > >> Thanks and kind regards >> Markus >> >>> Am 13.09.2018 um 11:56 schrieb Marco Neumann <[email protected]>: >>> >>> Rob, keeping fuseki live wasn't stated as a requirement for 1. so my >> advise >>> stands. we are running similar updates with fresh data frequently. >>> >>> Markus, to keep fuseki downtime at a minimum you can pre-populate tdb >> into >>> a temporary directory as well and later switch between directories. don't >>> forget to run the tdb optimizer to generate the stats.opt file. >>> >>> >>> On Thu, Sep 13, 2018 at 10:33 AM Rob Vesse <[email protected]> wrote: >>> >>>> I am not sure tdbloader/tbdloader2 scripts help in this case. This is >> an >>>> online update of a running Fuseki instance backed by TDB from what has >> been >>>> described. >>>> >>>> Since a TDB instance can only be safely used by a single JVM at a time >>>> using those scripts would not be a viable option here unless the OP was >>>> willing to stop Fuseki during updates as otherwise it would either fail >>>> (because the built in TDB mechanisms would prevent it) or it would risk >>>> causing data corruption >>>> >>>> Rob >>>> >>>> On 13/09/2018, 10:11, "Marco Neumann" <[email protected]> wrote: >>>> >>>> Markus, the tdbloader2 script is part of the apache-jena >> distribution. >>>> >>>> let me know how you get on and how this improves your data load >>>> process. >>>> >>>> Marco >>>> >>>> >>>> >>>> On Thu, Sep 13, 2018 at 9:58 AM Markus Neumann < >>>> [email protected]> >>>> wrote: >>>> >>>>> Hi Marco, >>>>> >>>>> as this is a project for a customer, I'm afraid we can't make the >>>> data >>>>> public. >>>>> >>>>> 1. I'm running Fuseki-3.8.0 with the following configuration: >>>>> @prefix : <http://base/#> . >>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . >>>>> @prefix tdb2: <http://jena.apache.org/2016/tdb#> . >>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> . >>>>> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . >>>>> @prefix fuseki: <http://jena.apache.org/fuseki#> . >>>>> @prefix spatial: <http://jena.apache.org/spatial#> . >>>>> @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> . >>>>> @prefix geosparql: <http://www.opengis.net/ont/geosparql#> . >>>>> >>>>> :service_tdb_all a fuseki:Service ; >>>>> rdfs:label "TDB2 mm" ; >>>>> fuseki:dataset :spatial_dataset ; >>>>> fuseki:name "mm" ; >>>>> fuseki:serviceQuery "query" , "sparql" ; >>>>> fuseki:serviceReadGraphStore "get" ; >>>>> fuseki:serviceReadWriteGraphStore >>>>> "data" ; >>>>> fuseki:serviceUpdate "update" ; >>>>> fuseki:serviceUpload "upload" . >>>>> >>>>> :spatial_dataset a spatial:SpatialDataset ; >>>>> spatial:dataset :tdb_dataset_readwrite ; >>>>> spatial:index <#indexLucene> ; >>>>> . >>>>> >>>>> <#indexLucene> a spatial:SpatialIndexLucene ; >>>>> #spatial:directory <file:Lucene> ; >>>>> spatial:directory "mem" ; >>>>> spatial:definition <#definition> ; >>>>> . >>>>> >>>>> <#definition> a spatial:EntityDefinition ; >>>>> spatial:entityField "uri" ; >>>>> spatial:geoField "geo" ; >>>>> # custom geo predicates for 1) Latitude/Longitude Format >>>>> spatial:hasSpatialPredicatePairs ( >>>>> [ spatial:latitude geo:lat ; spatial:longitude geo:long ] >>>>> ) ; >>>>> # custom geo predicates for 2) Well Known Text (WKT) Literal >>>>> spatial:hasWKTPredicates (geosparql:asWKT) ; >>>>> # custom SpatialContextFactory for 2) Well Known Text (WKT) >>>> Literal >>>>> spatial:spatialContextFactory >>>>> # "com.spatial4j.core.context.jts.JtsSpatialContextFactory" >>>>> >>>> "org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory" >>>>> . >>>>> >>>>> :tdb_dataset_readwrite >>>>> a tdb2:DatasetTDB2 ; >>>>> tdb2:location >>>>> "/srv/linked_data_store/fuseki-server/run/databases/mm" . >>>>> >>>>> I've been through the Fuseki documentation several times, but I find >>>> it >>>>> still a bit confusing. I would highly appreciate if you could point >>>> me to >>>>> other resources. >>>>> >>>>> I have not found the tdbloader in the fuseki repo. For now I use a >>>> small >>>>> shell script that wraps curl to upload the data: >>>>> >>>>> if [ ! -z $2 ] >>>>> then >>>>> ADD="?graph=http://rdf.meteomatics.com/mm/graphs/$2" >>>>> fi >>>>> curl --basic -u user:password -X POST -F "filename=@$1" >>>>> localhost:3030/mm/data${ADD} >>>>> >>>>> 2. Our customer has not specified a default use case yet, as the >>>> whole RDF >>>>> concept is about as new to them as it is to me. I suppose it will be >>>>> something like "Find all locations in a certain radius that have nice >>>>> weather next saturday". >>>>> >>>>> I just took a glance at the ha-fuseki page and will give it a try >>>> later. >>>>> >>>>> Many thanks for your time >>>>> >>>>> Best >>>>> Markus >>>>> >>>>>> Am 13.09.2018 um 10:00 schrieb Marco Neumann < >>>> [email protected]>: >>>>>> >>>>>> do you make the data endpoint publicly available? >>>>>> >>>>>> 1. did you try the tdbloader, what version of tdb2 do you use? >>>>>> >>>>>> 2. many ways to improve your response time here. what does a >>>> typical >>>>> query >>>>>> look like? do you make use of the spatial indexer? >>>>>> >>>>>> and Andy has a work in progress here for more granular updates >>>> that might >>>>>> be of interest to your effort as well: "High Availablity Apache >>>> Jena >>>>> Fuseki" >>>>>> >>>>>> https://afs.github.io/rdf-delta/ha-fuseki.html >>>>>> >>>>>> >>>>>> On Wed, Sep 12, 2018 at 4:09 PM Markus Neumann < >>>> [email protected] >>>>>> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> we are running a Fuseki server that will hold about 2.2 * 10^9 >>>> triples >>>>> of >>>>>>> meteorological data eventually. >>>>>>> I currently run it with "-Xmx80GB" on a 128GB Server. The >>>> database is >>>>> TDB2 >>>>>>> on a 900GB SSD. >>>>>>> >>>>>>> Now I face several performance issues: >>>>>>> 1. Inserting data: >>>>>>> It takes more than one hour to upload the measurements of >>>> a month >>>>>>> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload >>>> web-interface >>>>> of >>>>>>> fuseki) >>>>>>> Is there a way to do this faster? >>>>>>> 2. Updating data: >>>>>>> We get new model runs 5 times per day. This is data for >>>> the next >>>>>>> 10 days, that needs to be updated every time. >>>>>>> My idea was to create a named graph "forecast" that holds >>>> the >>>>>>> latest version of this data. >>>>>>> Every time a new model run arrives, I create a new >>>> temporary >>>>> graph >>>>>>> to upload the data to. Once this is finished, I move the temporary >>>>> graph to >>>>>>> "forecast". >>>>>>> This seems to do the work twice as it takes 1 hour for the >>>> upload >>>>>>> an 1 hour for the move. >>>>>>> >>>>>>> Our data consists of the following: >>>>>>> >>>>>>> Locations (total 1607 -> 16070 triples): >>>>>>> mm-locations:8500015 a mm:Location ; >>>>>>> a geosparql:Geometry ; >>>>>>> owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015> >>>> ; >>>>>>> geosparql:asWKT "POINT(7.61574425031 >>>>>>> 47.5425915732)"^^geosparql:wktLiteral ; >>>>>>> mm:station_name "Basel SBB GB Ost" ; >>>>>>> mm:abbreviation "BSGO" ; >>>>>>> mm:didok_id 8500015 ; >>>>>>> geo:lat 47.54259 ; >>>>>>> geo:long 7.61574 ; >>>>>>> mm:elevation 273 . >>>>>>> >>>>>>> Parameters (total 14 -> 56 triples): >>>>>>> mm-parameters:t_2m:C a mm:Parameter ; >>>>>>> rdfs:label "t_2m:C" ; >>>>>>> dcterms:description "Air temperature at 2m above ground in >>>> degree >>>>>>> Celsius"@en ; >>>>>>> mm:unit_symbol "˚C" . >>>>>>> >>>>>>> Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1 >>>> Mio -> >>>>>>> 5Mio triples per day): >>>>>>> mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a >>>> mm:Measurement ; >>>>>>> mm:location mm-locations:8500015 ; >>>>>>> mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ; >>>>>>> mm:value 15.1 ; >>>>>>> mm:parameter mm-parameters:t_2m:C . >>>>>>> >>>>>>> I would really appreciate if someone could give me some advice on >>>> how to >>>>>>> handle this tasks or point out things I could do to optimize the >>>>>>> organization of the data. >>>>>>> >>>>>>> Many thanks and kind regards >>>>>>> Markus Neumann >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> >>>>>> --- >>>>>> Marco Neumann >>>>>> KONA >>>>> >>>>> >>>> >>>> -- >>>> >>>> >>>> --- >>>> Marco Neumann >>>> KONA >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> -- >>> >>> >>> --- >>> Marco Neumann >>> KONA >> >> > > -- > > > --- > Marco Neumann > KONA
