Set the classpath to include the spatialIndexer On Thu 13 Sep 2018 at 20:30, Markus Neumann <[email protected]> wrote:
> Hi, > > spatial index creation fails. > I tried to figure the documentation but failed. I can't find the > jena.spatialindexer to build it manually and the one I specified in my > config does not work when I use the tdbloader. > > Any ideas? > > > > Am 13.09.2018 um 19:48 schrieb Marco Neumann <[email protected]>: > > > > to create the spatial index you can take a look at the "Building a > Spatial > > Index" section in the "Spatial searches with SPARQL" documentation here > > > > https://jena.apache.org/documentation/query/spatial-query.html < > https://jena.apache.org/documentation/query/spatial-query.html> > > > > hint: if you don't get results for a spatial filter query that matches > your > > data in the database your data isn't spatially indexed correctly. there > > will be no error or the like in the result set though. > > > > > > > > On Thu, Sep 13, 2018 at 1:53 PM Markus Neumann <[email protected] > <mailto:[email protected]>> > > wrote: > > > >> Thanks for the links. > >> > >> How do I see if the loader does the spatial index? As far as I > understood > >> the documentation, my config should produce the spatial index in > memory. I > >> haven't figured that part completely though: > >> When I start the database from scratch, the spatial indexing works. > After > >> a restart I have to re-upload the stations file (which is no big deal > as it > >> is only 593K in size) to regenerate the index. > >> I couldn't get it to work with a persistent index file though. > >> > >> Right now I'm trying the tdb2.tdbloader (Didn't see that before) and it > >> seems to go even faster: > >> 12:49:11 INFO loader :: Add: 41,000,000 > >> 2017-01-01_1M_30min.ttl (Batch: 67,980 / Avg: 62,995) > >> 12:49:11 INFO loader :: Elapsed: 650.84 seconds > >> [2018/09/13 12:49:11 UTC] > >> > >> Is there a way to tell the loader, that he should do the spatial index? > >> > >> Yes, we have to use the spatial filter eventually, so I would highly > >> appreciate some more informations on the correct setup here. > >> > >> Many thanks. > >> > >>> Am 13.09.2018 um 14:19 schrieb Marco Neumann <[email protected] > >: > >>> > >>> :-) > >>> > >>> this sounds much better Markus. now with regards to the optimizer > please > >>> consult the online documentation here: > >>> > >>> https://jena.apache.org/documentation/tdb/optimizer.html < > >> https://jena.apache.org/documentation/tdb/optimizer.html < > https://jena.apache.org/documentation/tdb/optimizer.html>> > >>> (it's a very simple process to create the stats file and place it in > the > >>> directory) > >>> > >>> also did the loader index the spatial data? do your queries make use of > >> the > >>> spatial filter? > >>> > >>> On Thu, Sep 13, 2018 at 12:59 PM Markus Neumann < > >> [email protected] <mailto:[email protected]> <mailto: > [email protected] <mailto:[email protected]>>> > >>> wrote: > >>> > >>>> Marco, > >>>> > >>>> I just tried the tdbloader2 script with 1 Month of data: > >>>> > >>>> INFO Total: 167,385,120 tuples : 1,143.55 seconds : 146,373.23 > >> tuples/sec > >>>> [2018/09/13 11:29:31 UTC] > >>>> 11:41:44 INFO Index Building Phase Completed > >>>> 11:41:46 INFO -- TDB Bulk Loader Finish > >>>> 11:41:46 INFO -- 1880 seconds > >>>> > >>>> Thats already a lot better. I'm working on a way to reduce the amount > of > >>>> data by > >>>> Can you give me a pointer on > >>>>> don't forget to run the tdb optimizer to generate the stats.opt file. > >>>> ? I haven't heard of that so far... > >>>> > >>>> A more general question: > >>>> Would there be a benefit in using the jena stack over using the fuseki > >>>> bundle as I'm doing now? (Documentation was not clear to me on that > >> point) > >>>> - If so: is there a guide on how to set it up? > >>>> > >>>> > >>> fuseki makes use of the jena stack. think of the jena distribution as a > >>> kind of toolbox you can use to work with your different projects in > >>> addition to your fuseki endpoint. > >>> > >>> just make sure to configure the class path correctly > >>> > >>> https://jena.apache.org/documentation/tools/index.html < > https://jena.apache.org/documentation/tools/index.html> < > >> https://jena.apache.org/documentation/tools/index.html < > https://jena.apache.org/documentation/tools/index.html>> > >>> > >>> Also further to the conversation with Rob, he has a valid point with > >>> regards to data corruption. please do not update of a live tdb database > >>> instance directly with tdbloader while it's connected to a running > fuseki > >>> endpoint. > >>> > >>> shut down the fuseki server first and then run the loader. or run the > >>> loader process in parallel into different target directory and swap the > >>> data or the path again later on. I don't know if there is hot swap > option > >>> in fuseki to map to a new directory but a quick restart should do the > >> trick. > >>> > >>> > >>> > >>> > >>> > >>>> Thanks and kind regards > >>>> Markus > >>>> > >>>>> Am 13.09.2018 um 11:56 schrieb Marco Neumann < > [email protected] <mailto:[email protected]> > >>> : > >>>>> > >>>>> Rob, keeping fuseki live wasn't stated as a requirement for 1. so my > >>>> advise > >>>>> stands. we are running similar updates with fresh data frequently. > >>>>> > >>>>> Markus, to keep fuseki downtime at a minimum you can pre-populate tdb > >>>> into > >>>>> a temporary directory as well and later switch between directories. > >> don't > >>>>> forget to run the tdb optimizer to generate the stats.opt file. > >>>>> > >>>>> > >>>>> On Thu, Sep 13, 2018 at 10:33 AM Rob Vesse <[email protected] > <mailto:[email protected]>> > >> wrote: > >>>>> > >>>>>> I am not sure tdbloader/tbdloader2 scripts help in this case. This > is > >>>> an > >>>>>> online update of a running Fuseki instance backed by TDB from what > has > >>>> been > >>>>>> described. > >>>>>> > >>>>>> Since a TDB instance can only be safely used by a single JVM at a > time > >>>>>> using those scripts would not be a viable option here unless the OP > >> was > >>>>>> willing to stop Fuseki during updates as otherwise it would either > >> fail > >>>>>> (because the built in TDB mechanisms would prevent it) or it would > >> risk > >>>>>> causing data corruption > >>>>>> > >>>>>> Rob > >>>>>> > >>>>>> On 13/09/2018, 10:11, "Marco Neumann" <[email protected] > <mailto:[email protected]>> > >> wrote: > >>>>>> > >>>>>> Markus, the tdbloader2 script is part of the apache-jena > >>>> distribution. > >>>>>> > >>>>>> let me know how you get on and how this improves your data load > >>>>>> process. > >>>>>> > >>>>>> Marco > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Thu, Sep 13, 2018 at 9:58 AM Markus Neumann < > >>>>>> [email protected] <mailto:[email protected]>> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi Marco, > >>>>>>> > >>>>>>> as this is a project for a customer, I'm afraid we can't make the > >>>>>> data > >>>>>>> public. > >>>>>>> > >>>>>>> 1. I'm running Fuseki-3.8.0 with the following configuration: > >>>>>>> @prefix : <http://base/#> . > >>>>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . > >>>>>>> @prefix tdb2: <http://jena.apache.org/2016/tdb#> . > >>>>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> . > >>>>>>> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . > >>>>>>> @prefix fuseki: <http://jena.apache.org/fuseki#> . > >>>>>>> @prefix spatial: <http://jena.apache.org/spatial#> . > >>>>>>> @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> . > >>>>>>> @prefix geosparql: <http://www.opengis.net/ont/geosparql#> . > >>>>>>> > >>>>>>> :service_tdb_all a fuseki:Service ; > >>>>>>> rdfs:label "TDB2 mm" ; > >>>>>>> fuseki:dataset :spatial_dataset ; > >>>>>>> fuseki:name "mm" ; > >>>>>>> fuseki:serviceQuery "query" , "sparql" ; > >>>>>>> fuseki:serviceReadGraphStore "get" ; > >>>>>>> fuseki:serviceReadWriteGraphStore > >>>>>>> "data" ; > >>>>>>> fuseki:serviceUpdate "update" ; > >>>>>>> fuseki:serviceUpload "upload" . > >>>>>>> > >>>>>>> :spatial_dataset a spatial:SpatialDataset ; > >>>>>>> spatial:dataset :tdb_dataset_readwrite ; > >>>>>>> spatial:index <#indexLucene> ; > >>>>>>> . > >>>>>>> > >>>>>>> <#indexLucene> a spatial:SpatialIndexLucene ; > >>>>>>> #spatial:directory <file:Lucene> ; > >>>>>>> spatial:directory "mem" ; > >>>>>>> spatial:definition <#definition> ; > >>>>>>> . > >>>>>>> > >>>>>>> <#definition> a spatial:EntityDefinition ; > >>>>>>> spatial:entityField "uri" ; > >>>>>>> spatial:geoField "geo" ; > >>>>>>> # custom geo predicates for 1) Latitude/Longitude Format > >>>>>>> spatial:hasSpatialPredicatePairs ( > >>>>>>> [ spatial:latitude geo:lat ; spatial:longitude geo:long ] > >>>>>>> ) ; > >>>>>>> # custom geo predicates for 2) Well Known Text (WKT) Literal > >>>>>>> spatial:hasWKTPredicates (geosparql:asWKT) ; > >>>>>>> # custom SpatialContextFactory for 2) Well Known Text (WKT) > >>>>>> Literal > >>>>>>> spatial:spatialContextFactory > >>>>>>> # "com.spatial4j.core.context.jts.JtsSpatialContextFactory" > >>>>>>> > >>>>>> "org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory" > >>>>>>> . > >>>>>>> > >>>>>>> :tdb_dataset_readwrite > >>>>>>> a tdb2:DatasetTDB2 ; > >>>>>>> tdb2:location > >>>>>>> "/srv/linked_data_store/fuseki-server/run/databases/mm" . > >>>>>>> > >>>>>>> I've been through the Fuseki documentation several times, but I > find > >>>>>> it > >>>>>>> still a bit confusing. I would highly appreciate if you could point > >>>>>> me to > >>>>>>> other resources. > >>>>>>> > >>>>>>> I have not found the tdbloader in the fuseki repo. For now I use a > >>>>>> small > >>>>>>> shell script that wraps curl to upload the data: > >>>>>>> > >>>>>>> if [ ! -z $2 ] > >>>>>>> then > >>>>>>> ADD="?graph=http://rdf.meteomatics.com/mm/graphs/$2" > >>>>>>> fi > >>>>>>> curl --basic -u user:password -X POST -F "filename=@$1" > >>>>>>> localhost:3030/mm/data${ADD} > >>>>>>> > >>>>>>> 2. Our customer has not specified a default use case yet, as the > >>>>>> whole RDF > >>>>>>> concept is about as new to them as it is to me. I suppose it will > be > >>>>>>> something like "Find all locations in a certain radius that have > nice > >>>>>>> weather next saturday". > >>>>>>> > >>>>>>> I just took a glance at the ha-fuseki page and will give it a try > >>>>>> later. > >>>>>>> > >>>>>>> Many thanks for your time > >>>>>>> > >>>>>>> Best > >>>>>>> Markus > >>>>>>> > >>>>>>>> Am 13.09.2018 um 10:00 schrieb Marco Neumann < > >>>>>> [email protected]>: > >>>>>>>> > >>>>>>>> do you make the data endpoint publicly available? > >>>>>>>> > >>>>>>>> 1. did you try the tdbloader, what version of tdb2 do you use? > >>>>>>>> > >>>>>>>> 2. many ways to improve your response time here. what does a > >>>>>> typical > >>>>>>> query > >>>>>>>> look like? do you make use of the spatial indexer? > >>>>>>>> > >>>>>>>> and Andy has a work in progress here for more granular updates > >>>>>> that might > >>>>>>>> be of interest to your effort as well: "High Availablity Apache > >>>>>> Jena > >>>>>>> Fuseki" > >>>>>>>> > >>>>>>>> https://afs.github.io/rdf-delta/ha-fuseki.html > >>>>>>>> > >>>>>>>> > >>>>>>>> On Wed, Sep 12, 2018 at 4:09 PM Markus Neumann < > >>>>>> [email protected] > >>>>>>>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> we are running a Fuseki server that will hold about 2.2 * 10^9 > >>>>>> triples > >>>>>>> of > >>>>>>>>> meteorological data eventually. > >>>>>>>>> I currently run it with "-Xmx80GB" on a 128GB Server. The > >>>>>> database is > >>>>>>> TDB2 > >>>>>>>>> on a 900GB SSD. > >>>>>>>>> > >>>>>>>>> Now I face several performance issues: > >>>>>>>>> 1. Inserting data: > >>>>>>>>> It takes more than one hour to upload the measurements of > >>>>>> a month > >>>>>>>>> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload > >>>>>> web-interface > >>>>>>> of > >>>>>>>>> fuseki) > >>>>>>>>> Is there a way to do this faster? > >>>>>>>>> 2. Updating data: > >>>>>>>>> We get new model runs 5 times per day. This is data for > >>>>>> the next > >>>>>>>>> 10 days, that needs to be updated every time. > >>>>>>>>> My idea was to create a named graph "forecast" that holds > >>>>>> the > >>>>>>>>> latest version of this data. > >>>>>>>>> Every time a new model run arrives, I create a new > >>>>>> temporary > >>>>>>> graph > >>>>>>>>> to upload the data to. Once this is finished, I move the > temporary > >>>>>>> graph to > >>>>>>>>> "forecast". > >>>>>>>>> This seems to do the work twice as it takes 1 hour for the > >>>>>> upload > >>>>>>>>> an 1 hour for the move. > >>>>>>>>> > >>>>>>>>> Our data consists of the following: > >>>>>>>>> > >>>>>>>>> Locations (total 1607 -> 16070 triples): > >>>>>>>>> mm-locations:8500015 a mm:Location ; > >>>>>>>>> a geosparql:Geometry ; > >>>>>>>>> owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015> > >>>>>> ; > >>>>>>>>> geosparql:asWKT "POINT(7.61574425031 > >>>>>>>>> 47.5425915732)"^^geosparql:wktLiteral ; > >>>>>>>>> mm:station_name "Basel SBB GB Ost" ; > >>>>>>>>> mm:abbreviation "BSGO" ; > >>>>>>>>> mm:didok_id 8500015 ; > >>>>>>>>> geo:lat 47.54259 ; > >>>>>>>>> geo:long 7.61574 ; > >>>>>>>>> mm:elevation 273 . > >>>>>>>>> > >>>>>>>>> Parameters (total 14 -> 56 triples): > >>>>>>>>> mm-parameters:t_2m:C a mm:Parameter ; > >>>>>>>>> rdfs:label "t_2m:C" ; > >>>>>>>>> dcterms:description "Air temperature at 2m above ground in > >>>>>> degree > >>>>>>>>> Celsius"@en ; > >>>>>>>>> mm:unit_symbol "˚C" . > >>>>>>>>> > >>>>>>>>> Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1 > >>>>>> Mio -> > >>>>>>>>> 5Mio triples per day): > >>>>>>>>> mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a > >>>>>> mm:Measurement ; > >>>>>>>>> mm:location mm-locations:8500015 ; > >>>>>>>>> mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ; > >>>>>>>>> mm:value 15.1 ; > >>>>>>>>> mm:parameter mm-parameters:t_2m:C . > >>>>>>>>> > >>>>>>>>> I would really appreciate if someone could give me some advice on > >>>>>> how to > >>>>>>>>> handle this tasks or point out things I could do to optimize the > >>>>>>>>> organization of the data. > >>>>>>>>> > >>>>>>>>> Many thanks and kind regards > >>>>>>>>> Markus Neumann > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> > >>>>>>>> > >>>>>>>> --- > >>>>>>>> Marco Neumann > >>>>>>>> KONA > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> -- > >>>>>> > >>>>>> > >>>>>> --- > >>>>>> Marco Neumann > >>>>>> KONA > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> -- > >>>>> > >>>>> > >>>>> --- > >>>>> Marco Neumann > >>>>> KONA > >>>> > >>>> > >>> > >>> -- > >>> > >>> > >>> --- > >>> Marco Neumann > >>> KONA > >> > >> > > > > -- > > > > > > --- > > Marco Neumann > > KONA > > -- --- Marco Neumann KONA
