Hi, spatial index creation fails. I tried to figure the documentation but failed. I can't find the jena.spatialindexer to build it manually and the one I specified in my config does not work when I use the tdbloader.
Any ideas? > Am 13.09.2018 um 19:48 schrieb Marco Neumann <[email protected]>: > > to create the spatial index you can take a look at the "Building a Spatial > Index" section in the "Spatial searches with SPARQL" documentation here > > https://jena.apache.org/documentation/query/spatial-query.html > <https://jena.apache.org/documentation/query/spatial-query.html> > > hint: if you don't get results for a spatial filter query that matches your > data in the database your data isn't spatially indexed correctly. there > will be no error or the like in the result set though. > > > > On Thu, Sep 13, 2018 at 1:53 PM Markus Neumann <[email protected] > <mailto:[email protected]>> > wrote: > >> Thanks for the links. >> >> How do I see if the loader does the spatial index? As far as I understood >> the documentation, my config should produce the spatial index in memory. I >> haven't figured that part completely though: >> When I start the database from scratch, the spatial indexing works. After >> a restart I have to re-upload the stations file (which is no big deal as it >> is only 593K in size) to regenerate the index. >> I couldn't get it to work with a persistent index file though. >> >> Right now I'm trying the tdb2.tdbloader (Didn't see that before) and it >> seems to go even faster: >> 12:49:11 INFO loader :: Add: 41,000,000 >> 2017-01-01_1M_30min.ttl (Batch: 67,980 / Avg: 62,995) >> 12:49:11 INFO loader :: Elapsed: 650.84 seconds >> [2018/09/13 12:49:11 UTC] >> >> Is there a way to tell the loader, that he should do the spatial index? >> >> Yes, we have to use the spatial filter eventually, so I would highly >> appreciate some more informations on the correct setup here. >> >> Many thanks. >> >>> Am 13.09.2018 um 14:19 schrieb Marco Neumann <[email protected]>: >>> >>> :-) >>> >>> this sounds much better Markus. now with regards to the optimizer please >>> consult the online documentation here: >>> >>> https://jena.apache.org/documentation/tdb/optimizer.html < >> https://jena.apache.org/documentation/tdb/optimizer.html >> <https://jena.apache.org/documentation/tdb/optimizer.html>> >>> (it's a very simple process to create the stats file and place it in the >>> directory) >>> >>> also did the loader index the spatial data? do your queries make use of >> the >>> spatial filter? >>> >>> On Thu, Sep 13, 2018 at 12:59 PM Markus Neumann < >> [email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>> >>> wrote: >>> >>>> Marco, >>>> >>>> I just tried the tdbloader2 script with 1 Month of data: >>>> >>>> INFO Total: 167,385,120 tuples : 1,143.55 seconds : 146,373.23 >> tuples/sec >>>> [2018/09/13 11:29:31 UTC] >>>> 11:41:44 INFO Index Building Phase Completed >>>> 11:41:46 INFO -- TDB Bulk Loader Finish >>>> 11:41:46 INFO -- 1880 seconds >>>> >>>> Thats already a lot better. I'm working on a way to reduce the amount of >>>> data by >>>> Can you give me a pointer on >>>>> don't forget to run the tdb optimizer to generate the stats.opt file. >>>> ? I haven't heard of that so far... >>>> >>>> A more general question: >>>> Would there be a benefit in using the jena stack over using the fuseki >>>> bundle as I'm doing now? (Documentation was not clear to me on that >> point) >>>> - If so: is there a guide on how to set it up? >>>> >>>> >>> fuseki makes use of the jena stack. think of the jena distribution as a >>> kind of toolbox you can use to work with your different projects in >>> addition to your fuseki endpoint. >>> >>> just make sure to configure the class path correctly >>> >>> https://jena.apache.org/documentation/tools/index.html >>> <https://jena.apache.org/documentation/tools/index.html> < >> https://jena.apache.org/documentation/tools/index.html >> <https://jena.apache.org/documentation/tools/index.html>> >>> >>> Also further to the conversation with Rob, he has a valid point with >>> regards to data corruption. please do not update of a live tdb database >>> instance directly with tdbloader while it's connected to a running fuseki >>> endpoint. >>> >>> shut down the fuseki server first and then run the loader. or run the >>> loader process in parallel into different target directory and swap the >>> data or the path again later on. I don't know if there is hot swap option >>> in fuseki to map to a new directory but a quick restart should do the >> trick. >>> >>> >>> >>> >>> >>>> Thanks and kind regards >>>> Markus >>>> >>>>> Am 13.09.2018 um 11:56 schrieb Marco Neumann <[email protected] >>>>> <mailto:[email protected]> >>> : >>>>> >>>>> Rob, keeping fuseki live wasn't stated as a requirement for 1. so my >>>> advise >>>>> stands. we are running similar updates with fresh data frequently. >>>>> >>>>> Markus, to keep fuseki downtime at a minimum you can pre-populate tdb >>>> into >>>>> a temporary directory as well and later switch between directories. >> don't >>>>> forget to run the tdb optimizer to generate the stats.opt file. >>>>> >>>>> >>>>> On Thu, Sep 13, 2018 at 10:33 AM Rob Vesse <[email protected] >>>>> <mailto:[email protected]>> >> wrote: >>>>> >>>>>> I am not sure tdbloader/tbdloader2 scripts help in this case. This is >>>> an >>>>>> online update of a running Fuseki instance backed by TDB from what has >>>> been >>>>>> described. >>>>>> >>>>>> Since a TDB instance can only be safely used by a single JVM at a time >>>>>> using those scripts would not be a viable option here unless the OP >> was >>>>>> willing to stop Fuseki during updates as otherwise it would either >> fail >>>>>> (because the built in TDB mechanisms would prevent it) or it would >> risk >>>>>> causing data corruption >>>>>> >>>>>> Rob >>>>>> >>>>>> On 13/09/2018, 10:11, "Marco Neumann" <[email protected] >>>>>> <mailto:[email protected]>> >> wrote: >>>>>> >>>>>> Markus, the tdbloader2 script is part of the apache-jena >>>> distribution. >>>>>> >>>>>> let me know how you get on and how this improves your data load >>>>>> process. >>>>>> >>>>>> Marco >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Sep 13, 2018 at 9:58 AM Markus Neumann < >>>>>> [email protected] <mailto:[email protected]>> >>>>>> wrote: >>>>>> >>>>>>> Hi Marco, >>>>>>> >>>>>>> as this is a project for a customer, I'm afraid we can't make the >>>>>> data >>>>>>> public. >>>>>>> >>>>>>> 1. I'm running Fuseki-3.8.0 with the following configuration: >>>>>>> @prefix : <http://base/#> . >>>>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . >>>>>>> @prefix tdb2: <http://jena.apache.org/2016/tdb#> . >>>>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> . >>>>>>> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . >>>>>>> @prefix fuseki: <http://jena.apache.org/fuseki#> . >>>>>>> @prefix spatial: <http://jena.apache.org/spatial#> . >>>>>>> @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> . >>>>>>> @prefix geosparql: <http://www.opengis.net/ont/geosparql#> . >>>>>>> >>>>>>> :service_tdb_all a fuseki:Service ; >>>>>>> rdfs:label "TDB2 mm" ; >>>>>>> fuseki:dataset :spatial_dataset ; >>>>>>> fuseki:name "mm" ; >>>>>>> fuseki:serviceQuery "query" , "sparql" ; >>>>>>> fuseki:serviceReadGraphStore "get" ; >>>>>>> fuseki:serviceReadWriteGraphStore >>>>>>> "data" ; >>>>>>> fuseki:serviceUpdate "update" ; >>>>>>> fuseki:serviceUpload "upload" . >>>>>>> >>>>>>> :spatial_dataset a spatial:SpatialDataset ; >>>>>>> spatial:dataset :tdb_dataset_readwrite ; >>>>>>> spatial:index <#indexLucene> ; >>>>>>> . >>>>>>> >>>>>>> <#indexLucene> a spatial:SpatialIndexLucene ; >>>>>>> #spatial:directory <file:Lucene> ; >>>>>>> spatial:directory "mem" ; >>>>>>> spatial:definition <#definition> ; >>>>>>> . >>>>>>> >>>>>>> <#definition> a spatial:EntityDefinition ; >>>>>>> spatial:entityField "uri" ; >>>>>>> spatial:geoField "geo" ; >>>>>>> # custom geo predicates for 1) Latitude/Longitude Format >>>>>>> spatial:hasSpatialPredicatePairs ( >>>>>>> [ spatial:latitude geo:lat ; spatial:longitude geo:long ] >>>>>>> ) ; >>>>>>> # custom geo predicates for 2) Well Known Text (WKT) Literal >>>>>>> spatial:hasWKTPredicates (geosparql:asWKT) ; >>>>>>> # custom SpatialContextFactory for 2) Well Known Text (WKT) >>>>>> Literal >>>>>>> spatial:spatialContextFactory >>>>>>> # "com.spatial4j.core.context.jts.JtsSpatialContextFactory" >>>>>>> >>>>>> "org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory" >>>>>>> . >>>>>>> >>>>>>> :tdb_dataset_readwrite >>>>>>> a tdb2:DatasetTDB2 ; >>>>>>> tdb2:location >>>>>>> "/srv/linked_data_store/fuseki-server/run/databases/mm" . >>>>>>> >>>>>>> I've been through the Fuseki documentation several times, but I find >>>>>> it >>>>>>> still a bit confusing. I would highly appreciate if you could point >>>>>> me to >>>>>>> other resources. >>>>>>> >>>>>>> I have not found the tdbloader in the fuseki repo. For now I use a >>>>>> small >>>>>>> shell script that wraps curl to upload the data: >>>>>>> >>>>>>> if [ ! -z $2 ] >>>>>>> then >>>>>>> ADD="?graph=http://rdf.meteomatics.com/mm/graphs/$2" >>>>>>> fi >>>>>>> curl --basic -u user:password -X POST -F "filename=@$1" >>>>>>> localhost:3030/mm/data${ADD} >>>>>>> >>>>>>> 2. Our customer has not specified a default use case yet, as the >>>>>> whole RDF >>>>>>> concept is about as new to them as it is to me. I suppose it will be >>>>>>> something like "Find all locations in a certain radius that have nice >>>>>>> weather next saturday". >>>>>>> >>>>>>> I just took a glance at the ha-fuseki page and will give it a try >>>>>> later. >>>>>>> >>>>>>> Many thanks for your time >>>>>>> >>>>>>> Best >>>>>>> Markus >>>>>>> >>>>>>>> Am 13.09.2018 um 10:00 schrieb Marco Neumann < >>>>>> [email protected]>: >>>>>>>> >>>>>>>> do you make the data endpoint publicly available? >>>>>>>> >>>>>>>> 1. did you try the tdbloader, what version of tdb2 do you use? >>>>>>>> >>>>>>>> 2. many ways to improve your response time here. what does a >>>>>> typical >>>>>>> query >>>>>>>> look like? do you make use of the spatial indexer? >>>>>>>> >>>>>>>> and Andy has a work in progress here for more granular updates >>>>>> that might >>>>>>>> be of interest to your effort as well: "High Availablity Apache >>>>>> Jena >>>>>>> Fuseki" >>>>>>>> >>>>>>>> https://afs.github.io/rdf-delta/ha-fuseki.html >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Sep 12, 2018 at 4:09 PM Markus Neumann < >>>>>> [email protected] >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> we are running a Fuseki server that will hold about 2.2 * 10^9 >>>>>> triples >>>>>>> of >>>>>>>>> meteorological data eventually. >>>>>>>>> I currently run it with "-Xmx80GB" on a 128GB Server. The >>>>>> database is >>>>>>> TDB2 >>>>>>>>> on a 900GB SSD. >>>>>>>>> >>>>>>>>> Now I face several performance issues: >>>>>>>>> 1. Inserting data: >>>>>>>>> It takes more than one hour to upload the measurements of >>>>>> a month >>>>>>>>> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload >>>>>> web-interface >>>>>>> of >>>>>>>>> fuseki) >>>>>>>>> Is there a way to do this faster? >>>>>>>>> 2. Updating data: >>>>>>>>> We get new model runs 5 times per day. This is data for >>>>>> the next >>>>>>>>> 10 days, that needs to be updated every time. >>>>>>>>> My idea was to create a named graph "forecast" that holds >>>>>> the >>>>>>>>> latest version of this data. >>>>>>>>> Every time a new model run arrives, I create a new >>>>>> temporary >>>>>>> graph >>>>>>>>> to upload the data to. Once this is finished, I move the temporary >>>>>>> graph to >>>>>>>>> "forecast". >>>>>>>>> This seems to do the work twice as it takes 1 hour for the >>>>>> upload >>>>>>>>> an 1 hour for the move. >>>>>>>>> >>>>>>>>> Our data consists of the following: >>>>>>>>> >>>>>>>>> Locations (total 1607 -> 16070 triples): >>>>>>>>> mm-locations:8500015 a mm:Location ; >>>>>>>>> a geosparql:Geometry ; >>>>>>>>> owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015> >>>>>> ; >>>>>>>>> geosparql:asWKT "POINT(7.61574425031 >>>>>>>>> 47.5425915732)"^^geosparql:wktLiteral ; >>>>>>>>> mm:station_name "Basel SBB GB Ost" ; >>>>>>>>> mm:abbreviation "BSGO" ; >>>>>>>>> mm:didok_id 8500015 ; >>>>>>>>> geo:lat 47.54259 ; >>>>>>>>> geo:long 7.61574 ; >>>>>>>>> mm:elevation 273 . >>>>>>>>> >>>>>>>>> Parameters (total 14 -> 56 triples): >>>>>>>>> mm-parameters:t_2m:C a mm:Parameter ; >>>>>>>>> rdfs:label "t_2m:C" ; >>>>>>>>> dcterms:description "Air temperature at 2m above ground in >>>>>> degree >>>>>>>>> Celsius"@en ; >>>>>>>>> mm:unit_symbol "˚C" . >>>>>>>>> >>>>>>>>> Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1 >>>>>> Mio -> >>>>>>>>> 5Mio triples per day): >>>>>>>>> mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a >>>>>> mm:Measurement ; >>>>>>>>> mm:location mm-locations:8500015 ; >>>>>>>>> mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ; >>>>>>>>> mm:value 15.1 ; >>>>>>>>> mm:parameter mm-parameters:t_2m:C . >>>>>>>>> >>>>>>>>> I would really appreciate if someone could give me some advice on >>>>>> how to >>>>>>>>> handle this tasks or point out things I could do to optimize the >>>>>>>>> organization of the data. >>>>>>>>> >>>>>>>>> Many thanks and kind regards >>>>>>>>> Markus Neumann >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> >>>>>>>> --- >>>>>>>> Marco Neumann >>>>>>>> KONA >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> >>>>>> --- >>>>>> Marco Neumann >>>>>> KONA >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> >>>>> >>>>> --- >>>>> Marco Neumann >>>>> KONA >>>> >>>> >>> >>> -- >>> >>> >>> --- >>> Marco Neumann >>> KONA >> >> > > -- > > > --- > Marco Neumann > KONA
