I got the jar from https://mvnrepository.com/artifact/org.apache.jena/jena-spatial/3.8.0 <https://mvnrepository.com/artifact/org.apache.jena/jena-spatial/3.8.0> but the command from the docu does not seem to work:
java -cp jena-spatial-3.8.0.jar jena.spatialindexer --loc /srv/linked_data_store/prod_dp_2018-09-13-1 Error: Could not find or load main class jena.spatialindexer > Am 13.09.2018 um 21:47 schrieb Marco Neumann <[email protected]>: > > Set the classpath to include the spatialIndexer > > On Thu 13 Sep 2018 at 20:30, Markus Neumann <[email protected] > <mailto:[email protected]>> > wrote: > >> Hi, >> >> spatial index creation fails. >> I tried to figure the documentation but failed. I can't find the >> jena.spatialindexer to build it manually and the one I specified in my >> config does not work when I use the tdbloader. >> >> Any ideas? >> >> >>> Am 13.09.2018 um 19:48 schrieb Marco Neumann <[email protected]>: >>> >>> to create the spatial index you can take a look at the "Building a >> Spatial >>> Index" section in the "Spatial searches with SPARQL" documentation here >>> >>> https://jena.apache.org/documentation/query/spatial-query.html < >> https://jena.apache.org/documentation/query/spatial-query.html >> <https://jena.apache.org/documentation/query/spatial-query.html>> >>> >>> hint: if you don't get results for a spatial filter query that matches >> your >>> data in the database your data isn't spatially indexed correctly. there >>> will be no error or the like in the result set though. >>> >>> >>> >>> On Thu, Sep 13, 2018 at 1:53 PM Markus Neumann <[email protected] >>> <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>> >>> wrote: >>> >>>> Thanks for the links. >>>> >>>> How do I see if the loader does the spatial index? As far as I >> understood >>>> the documentation, my config should produce the spatial index in >> memory. I >>>> haven't figured that part completely though: >>>> When I start the database from scratch, the spatial indexing works. >> After >>>> a restart I have to re-upload the stations file (which is no big deal >> as it >>>> is only 593K in size) to regenerate the index. >>>> I couldn't get it to work with a persistent index file though. >>>> >>>> Right now I'm trying the tdb2.tdbloader (Didn't see that before) and it >>>> seems to go even faster: >>>> 12:49:11 INFO loader :: Add: 41,000,000 >>>> 2017-01-01_1M_30min.ttl (Batch: 67,980 / Avg: 62,995) >>>> 12:49:11 INFO loader :: Elapsed: 650.84 seconds >>>> [2018/09/13 12:49:11 UTC] >>>> >>>> Is there a way to tell the loader, that he should do the spatial index? >>>> >>>> Yes, we have to use the spatial filter eventually, so I would highly >>>> appreciate some more informations on the correct setup here. >>>> >>>> Many thanks. >>>> >>>>> Am 13.09.2018 um 14:19 schrieb Marco Neumann <[email protected] >>> : >>>>> >>>>> :-) >>>>> >>>>> this sounds much better Markus. now with regards to the optimizer >> please >>>>> consult the online documentation here: >>>>> >>>>> https://jena.apache.org/documentation/tdb/optimizer.html < >>>> https://jena.apache.org/documentation/tdb/optimizer.html < >> https://jena.apache.org/documentation/tdb/optimizer.html >> <https://jena.apache.org/documentation/tdb/optimizer.html>>> >>>>> (it's a very simple process to create the stats file and place it in >> the >>>>> directory) >>>>> >>>>> also did the loader index the spatial data? do your queries make use of >>>> the >>>>> spatial filter? >>>>> >>>>> On Thu, Sep 13, 2018 at 12:59 PM Markus Neumann < >>>> [email protected] <mailto:[email protected]> >>>> <mailto:[email protected] <mailto:[email protected]>> >>>> <mailto: >> [email protected] <mailto:[email protected]>>> >>>>> wrote: >>>>> >>>>>> Marco, >>>>>> >>>>>> I just tried the tdbloader2 script with 1 Month of data: >>>>>> >>>>>> INFO Total: 167,385,120 tuples : 1,143.55 seconds : 146,373.23 >>>> tuples/sec >>>>>> [2018/09/13 11:29:31 UTC] >>>>>> 11:41:44 INFO Index Building Phase Completed >>>>>> 11:41:46 INFO -- TDB Bulk Loader Finish >>>>>> 11:41:46 INFO -- 1880 seconds >>>>>> >>>>>> Thats already a lot better. I'm working on a way to reduce the amount >> of >>>>>> data by >>>>>> Can you give me a pointer on >>>>>>> don't forget to run the tdb optimizer to generate the stats.opt file. >>>>>> ? I haven't heard of that so far... >>>>>> >>>>>> A more general question: >>>>>> Would there be a benefit in using the jena stack over using the fuseki >>>>>> bundle as I'm doing now? (Documentation was not clear to me on that >>>> point) >>>>>> - If so: is there a guide on how to set it up? >>>>>> >>>>>> >>>>> fuseki makes use of the jena stack. think of the jena distribution as a >>>>> kind of toolbox you can use to work with your different projects in >>>>> addition to your fuseki endpoint. >>>>> >>>>> just make sure to configure the class path correctly >>>>> >>>>> https://jena.apache.org/documentation/tools/index.html < >> https://jena.apache.org/documentation/tools/index.html> < >>>> https://jena.apache.org/documentation/tools/index.html >>>> <https://jena.apache.org/documentation/tools/index.html> < >> https://jena.apache.org/documentation/tools/index.html >> <https://jena.apache.org/documentation/tools/index.html>>> >>>>> >>>>> Also further to the conversation with Rob, he has a valid point with >>>>> regards to data corruption. please do not update of a live tdb database >>>>> instance directly with tdbloader while it's connected to a running >> fuseki >>>>> endpoint. >>>>> >>>>> shut down the fuseki server first and then run the loader. or run the >>>>> loader process in parallel into different target directory and swap the >>>>> data or the path again later on. I don't know if there is hot swap >> option >>>>> in fuseki to map to a new directory but a quick restart should do the >>>> trick. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Thanks and kind regards >>>>>> Markus >>>>>> >>>>>>> Am 13.09.2018 um 11:56 schrieb Marco Neumann < >> [email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>> >>>>> : >>>>>>> >>>>>>> Rob, keeping fuseki live wasn't stated as a requirement for 1. so my >>>>>> advise >>>>>>> stands. we are running similar updates with fresh data frequently. >>>>>>> >>>>>>> Markus, to keep fuseki downtime at a minimum you can pre-populate tdb >>>>>> into >>>>>>> a temporary directory as well and later switch between directories. >>>> don't >>>>>>> forget to run the tdb optimizer to generate the stats.opt file. >>>>>>> >>>>>>> >>>>>>> On Thu, Sep 13, 2018 at 10:33 AM Rob Vesse <[email protected] >>>>>>> <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>> >>>> wrote: >>>>>>> >>>>>>>> I am not sure tdbloader/tbdloader2 scripts help in this case. This >> is >>>>>> an >>>>>>>> online update of a running Fuseki instance backed by TDB from what >> has >>>>>> been >>>>>>>> described. >>>>>>>> >>>>>>>> Since a TDB instance can only be safely used by a single JVM at a >> time >>>>>>>> using those scripts would not be a viable option here unless the OP >>>> was >>>>>>>> willing to stop Fuseki during updates as otherwise it would either >>>> fail >>>>>>>> (because the built in TDB mechanisms would prevent it) or it would >>>> risk >>>>>>>> causing data corruption >>>>>>>> >>>>>>>> Rob >>>>>>>> >>>>>>>> On 13/09/2018, 10:11, "Marco Neumann" <[email protected] >>>>>>>> <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>> >>>> wrote: >>>>>>>> >>>>>>>> Markus, the tdbloader2 script is part of the apache-jena >>>>>> distribution. >>>>>>>> >>>>>>>> let me know how you get on and how this improves your data load >>>>>>>> process. >>>>>>>> >>>>>>>> Marco >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Sep 13, 2018 at 9:58 AM Markus Neumann < >>>>>>>> [email protected] <mailto:[email protected]> >>>>>>>> <mailto:[email protected] <mailto:[email protected]>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Marco, >>>>>>>>> >>>>>>>>> as this is a project for a customer, I'm afraid we can't make the >>>>>>>> data >>>>>>>>> public. >>>>>>>>> >>>>>>>>> 1. I'm running Fuseki-3.8.0 with the following configuration: >>>>>>>>> @prefix : <http://base/# <http://base/#>> . >>>>>>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns# >>>>>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#>> . >>>>>>>>> @prefix tdb2: <http://jena.apache.org/2016/tdb# >>>>>>>>> <http://jena.apache.org/2016/tdb#>> . >>>>>>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler# >>>>>>>>> <http://jena.hpl.hp.com/2005/11/Assembler#>> . >>>>>>>>> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema# >>>>>>>>> <http://www.w3.org/2000/01/rdf-schema#>> . >>>>>>>>> @prefix fuseki: <http://jena.apache.org/fuseki# >>>>>>>>> <http://jena.apache.org/fuseki#>> . >>>>>>>>> @prefix spatial: <http://jena.apache.org/spatial# >>>>>>>>> <http://jena.apache.org/spatial#>> . >>>>>>>>> @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos# >>>>>>>>> <http://www.w3.org/2003/01/geo/wgs84_pos#>> . >>>>>>>>> @prefix geosparql: <http://www.opengis.net/ont/geosparql# >>>>>>>>> <http://www.opengis.net/ont/geosparql#>> . >>>>>>>>> >>>>>>>>> :service_tdb_all a fuseki:Service ; >>>>>>>>> rdfs:label "TDB2 mm" ; >>>>>>>>> fuseki:dataset :spatial_dataset ; >>>>>>>>> fuseki:name "mm" ; >>>>>>>>> fuseki:serviceQuery "query" , "sparql" ; >>>>>>>>> fuseki:serviceReadGraphStore "get" ; >>>>>>>>> fuseki:serviceReadWriteGraphStore >>>>>>>>> "data" ; >>>>>>>>> fuseki:serviceUpdate "update" ; >>>>>>>>> fuseki:serviceUpload "upload" . >>>>>>>>> >>>>>>>>> :spatial_dataset a spatial:SpatialDataset ; >>>>>>>>> spatial:dataset :tdb_dataset_readwrite ; >>>>>>>>> spatial:index <#indexLucene> ; >>>>>>>>> . >>>>>>>>> >>>>>>>>> <#indexLucene> a spatial:SpatialIndexLucene ; >>>>>>>>> #spatial:directory <file:Lucene> ; >>>>>>>>> spatial:directory "mem" ; >>>>>>>>> spatial:definition <#definition> ; >>>>>>>>> . >>>>>>>>> >>>>>>>>> <#definition> a spatial:EntityDefinition ; >>>>>>>>> spatial:entityField "uri" ; >>>>>>>>> spatial:geoField "geo" ; >>>>>>>>> # custom geo predicates for 1) Latitude/Longitude Format >>>>>>>>> spatial:hasSpatialPredicatePairs ( >>>>>>>>> [ spatial:latitude geo:lat ; spatial:longitude geo:long ] >>>>>>>>> ) ; >>>>>>>>> # custom geo predicates for 2) Well Known Text (WKT) Literal >>>>>>>>> spatial:hasWKTPredicates (geosparql:asWKT) ; >>>>>>>>> # custom SpatialContextFactory for 2) Well Known Text (WKT) >>>>>>>> Literal >>>>>>>>> spatial:spatialContextFactory >>>>>>>>> # "com.spatial4j.core.context.jts.JtsSpatialContextFactory" >>>>>>>>> >>>>>>>> "org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory" >>>>>>>>> . >>>>>>>>> >>>>>>>>> :tdb_dataset_readwrite >>>>>>>>> a tdb2:DatasetTDB2 ; >>>>>>>>> tdb2:location >>>>>>>>> "/srv/linked_data_store/fuseki-server/run/databases/mm" . >>>>>>>>> >>>>>>>>> I've been through the Fuseki documentation several times, but I >> find >>>>>>>> it >>>>>>>>> still a bit confusing. I would highly appreciate if you could point >>>>>>>> me to >>>>>>>>> other resources. >>>>>>>>> >>>>>>>>> I have not found the tdbloader in the fuseki repo. For now I use a >>>>>>>> small >>>>>>>>> shell script that wraps curl to upload the data: >>>>>>>>> >>>>>>>>> if [ ! -z $2 ] >>>>>>>>> then >>>>>>>>> ADD="?graph=http://rdf.meteomatics.com/mm/graphs/$2" >>>>>>>>> fi >>>>>>>>> curl --basic -u user:password -X POST -F "filename=@$1" >>>>>>>>> localhost:3030/mm/data${ADD} >>>>>>>>> >>>>>>>>> 2. Our customer has not specified a default use case yet, as the >>>>>>>> whole RDF >>>>>>>>> concept is about as new to them as it is to me. I suppose it will >> be >>>>>>>>> something like "Find all locations in a certain radius that have >> nice >>>>>>>>> weather next saturday". >>>>>>>>> >>>>>>>>> I just took a glance at the ha-fuseki page and will give it a try >>>>>>>> later. >>>>>>>>> >>>>>>>>> Many thanks for your time >>>>>>>>> >>>>>>>>> Best >>>>>>>>> Markus >>>>>>>>> >>>>>>>>>> Am 13.09.2018 um 10:00 schrieb Marco Neumann < >>>>>>>> [email protected]>: >>>>>>>>>> >>>>>>>>>> do you make the data endpoint publicly available? >>>>>>>>>> >>>>>>>>>> 1. did you try the tdbloader, what version of tdb2 do you use? >>>>>>>>>> >>>>>>>>>> 2. many ways to improve your response time here. what does a >>>>>>>> typical >>>>>>>>> query >>>>>>>>>> look like? do you make use of the spatial indexer? >>>>>>>>>> >>>>>>>>>> and Andy has a work in progress here for more granular updates >>>>>>>> that might >>>>>>>>>> be of interest to your effort as well: "High Availablity Apache >>>>>>>> Jena >>>>>>>>> Fuseki" >>>>>>>>>> >>>>>>>>>> https://afs.github.io/rdf-delta/ha-fuseki.html >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Sep 12, 2018 at 4:09 PM Markus Neumann < >>>>>>>> [email protected] >>>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> we are running a Fuseki server that will hold about 2.2 * 10^9 >>>>>>>> triples >>>>>>>>> of >>>>>>>>>>> meteorological data eventually. >>>>>>>>>>> I currently run it with "-Xmx80GB" on a 128GB Server. The >>>>>>>> database is >>>>>>>>> TDB2 >>>>>>>>>>> on a 900GB SSD. >>>>>>>>>>> >>>>>>>>>>> Now I face several performance issues: >>>>>>>>>>> 1. Inserting data: >>>>>>>>>>> It takes more than one hour to upload the measurements of >>>>>>>> a month >>>>>>>>>>> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload >>>>>>>> web-interface >>>>>>>>> of >>>>>>>>>>> fuseki) >>>>>>>>>>> Is there a way to do this faster? >>>>>>>>>>> 2. Updating data: >>>>>>>>>>> We get new model runs 5 times per day. This is data for >>>>>>>> the next >>>>>>>>>>> 10 days, that needs to be updated every time. >>>>>>>>>>> My idea was to create a named graph "forecast" that holds >>>>>>>> the >>>>>>>>>>> latest version of this data. >>>>>>>>>>> Every time a new model run arrives, I create a new >>>>>>>> temporary >>>>>>>>> graph >>>>>>>>>>> to upload the data to. Once this is finished, I move the >> temporary >>>>>>>>> graph to >>>>>>>>>>> "forecast". >>>>>>>>>>> This seems to do the work twice as it takes 1 hour for the >>>>>>>> upload >>>>>>>>>>> an 1 hour for the move. >>>>>>>>>>> >>>>>>>>>>> Our data consists of the following: >>>>>>>>>>> >>>>>>>>>>> Locations (total 1607 -> 16070 triples): >>>>>>>>>>> mm-locations:8500015 a mm:Location ; >>>>>>>>>>> a geosparql:Geometry ; >>>>>>>>>>> owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015> >>>>>>>> ; >>>>>>>>>>> geosparql:asWKT "POINT(7.61574425031 >>>>>>>>>>> 47.5425915732)"^^geosparql:wktLiteral ; >>>>>>>>>>> mm:station_name "Basel SBB GB Ost" ; >>>>>>>>>>> mm:abbreviation "BSGO" ; >>>>>>>>>>> mm:didok_id 8500015 ; >>>>>>>>>>> geo:lat 47.54259 ; >>>>>>>>>>> geo:long 7.61574 ; >>>>>>>>>>> mm:elevation 273 . >>>>>>>>>>> >>>>>>>>>>> Parameters (total 14 -> 56 triples): >>>>>>>>>>> mm-parameters:t_2m:C a mm:Parameter ; >>>>>>>>>>> rdfs:label "t_2m:C" ; >>>>>>>>>>> dcterms:description "Air temperature at 2m above ground in >>>>>>>> degree >>>>>>>>>>> Celsius"@en ; >>>>>>>>>>> mm:unit_symbol "˚C" . >>>>>>>>>>> >>>>>>>>>>> Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1 >>>>>>>> Mio -> >>>>>>>>>>> 5Mio triples per day): >>>>>>>>>>> mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a >>>>>>>> mm:Measurement ; >>>>>>>>>>> mm:location mm-locations:8500015 ; >>>>>>>>>>> mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ; >>>>>>>>>>> mm:value 15.1 ; >>>>>>>>>>> mm:parameter mm-parameters:t_2m:C . >>>>>>>>>>> >>>>>>>>>>> I would really appreciate if someone could give me some advice on >>>>>>>> how to >>>>>>>>>>> handle this tasks or point out things I could do to optimize the >>>>>>>>>>> organization of the data. >>>>>>>>>>> >>>>>>>>>>> Many thanks and kind regards >>>>>>>>>>> Markus Neumann >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> --- >>>>>>>>>> Marco Neumann >>>>>>>>>> KONA >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> >>>>>>>> --- >>>>>>>> Marco Neumann >>>>>>>> KONA >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> >>>>>>> --- >>>>>>> Marco Neumann >>>>>>> KONA >>>>>> >>>>>> >>>>> >>>>> -- >>>>> >>>>> >>>>> --- >>>>> Marco Neumann >>>>> KONA >>>> >>>> >>> >>> -- >>> >>> >>> --- >>> Marco Neumann >>> KONA >> >> -- > > > --- > Marco Neumann > KONA
