Re: Updating large amounts of data

Markus Neumann Thu, 13 Sep 2018 22:37:00 -0700

I got the jar from 
https://mvnrepository.com/artifact/org.apache.jena/jena-spatial/3.8.0 
<https://mvnrepository.com/artifact/org.apache.jena/jena-spatial/3.8.0>
but the command from the docu does not seem to work:


java -cp jena-spatial-3.8.0.jar jena.spatialindexer --loc 
/srv/linked_data_store/prod_dp_2018-09-13-1
Error: Could not find or load main class jena.spatialindexer


> Am 13.09.2018 um 21:47 schrieb Marco Neumann <[email protected]>:
> 
> Set the classpath to include the spatialIndexer
> 
> On Thu 13 Sep 2018 at 20:30, Markus Neumann <[email protected] 
> <mailto:[email protected]>>
> wrote:
> 
>> Hi,
>> 
>> spatial index creation fails.
>> I tried to figure the documentation but failed. I can't find the
>> jena.spatialindexer to build it manually and the one I specified in my
>> config does not work when I use the tdbloader.
>> 
>> Any ideas?
>> 
>> 
>>> Am 13.09.2018 um 19:48 schrieb Marco Neumann <[email protected]>:
>>> 
>>> to create the spatial index you can take a look at the "Building a
>> Spatial
>>> Index" section in the "Spatial searches with SPARQL" documentation here
>>> 
>>> https://jena.apache.org/documentation/query/spatial-query.html <
>> https://jena.apache.org/documentation/query/spatial-query.html 
>> <https://jena.apache.org/documentation/query/spatial-query.html>>
>>> 
>>> hint: if you don't get results for a spatial filter query that matches
>> your
>>> data in the database your data isn't spatially indexed correctly. there
>>> will be no error or the like in the result set though.
>>> 
>>> 
>>> 
>>> On Thu, Sep 13, 2018 at 1:53 PM Markus Neumann <[email protected] 
>>> <mailto:[email protected]>
>> <mailto:[email protected] <mailto:[email protected]>>>
>>> wrote:
>>> 
>>>> Thanks for the links.
>>>> 
>>>> How do I see if the loader does the spatial index? As far as I
>> understood
>>>> the documentation, my config should produce the spatial index in
>> memory. I
>>>> haven't figured that part completely though:
>>>> When I start the database from scratch, the spatial indexing works.
>> After
>>>> a restart I have to re-upload the stations file (which is no big deal
>> as it
>>>> is only 593K in size) to regenerate the index.
>>>> I couldn't get it to work with a persistent index file though.
>>>> 
>>>> Right now I'm trying the tdb2.tdbloader (Didn't see that before) and it
>>>> seems to go even faster:
>>>> 12:49:11 INFO  loader               :: Add: 41,000,000
>>>> 2017-01-01_1M_30min.ttl (Batch: 67,980 / Avg: 62,995)
>>>> 12:49:11 INFO  loader               ::   Elapsed: 650.84 seconds
>>>> [2018/09/13 12:49:11 UTC]
>>>> 
>>>> Is there a way to tell the loader, that he should do the spatial index?
>>>> 
>>>> Yes, we have to use the spatial filter eventually, so I would highly
>>>> appreciate some more informations on the correct setup here.
>>>> 
>>>> Many thanks.
>>>> 
>>>>> Am 13.09.2018 um 14:19 schrieb Marco Neumann <[email protected]
>>> :
>>>>> 
>>>>> :-)
>>>>> 
>>>>> this sounds much better Markus. now with regards to the optimizer
>> please
>>>>> consult the online documentation here:
>>>>> 
>>>>> https://jena.apache.org/documentation/tdb/optimizer.html <
>>>> https://jena.apache.org/documentation/tdb/optimizer.html <
>> https://jena.apache.org/documentation/tdb/optimizer.html 
>> <https://jena.apache.org/documentation/tdb/optimizer.html>>>
>>>>> (it's a very simple process to create the stats file and place it in
>> the
>>>>> directory)
>>>>> 
>>>>> also did the loader index the spatial data? do your queries make use of
>>>> the
>>>>> spatial filter?
>>>>> 
>>>>> On Thu, Sep 13, 2018 at 12:59 PM Markus Neumann <
>>>> [email protected] <mailto:[email protected]> 
>>>> <mailto:[email protected] <mailto:[email protected]>> 
>>>> <mailto:
>> [email protected] <mailto:[email protected]>>>
>>>>> wrote:
>>>>> 
>>>>>> Marco,
>>>>>> 
>>>>>> I just tried the tdbloader2 script with 1 Month of data:
>>>>>> 
>>>>>> INFO  Total: 167,385,120 tuples : 1,143.55 seconds : 146,373.23
>>>> tuples/sec
>>>>>> [2018/09/13 11:29:31 UTC]
>>>>>> 11:41:44 INFO Index Building Phase Completed
>>>>>> 11:41:46 INFO -- TDB Bulk Loader Finish
>>>>>> 11:41:46 INFO -- 1880 seconds
>>>>>> 
>>>>>> Thats already a lot better. I'm working on a way to reduce the amount
>> of
>>>>>> data by
>>>>>> Can you give me a pointer on
>>>>>>> don't forget to run the tdb optimizer to generate the stats.opt file.
>>>>>> ? I haven't heard of that so far...
>>>>>> 
>>>>>> A more general question:
>>>>>> Would there be a benefit in using the jena stack over using the fuseki
>>>>>> bundle as I'm doing now? (Documentation was not clear to me on that
>>>> point)
>>>>>>      - If so: is there a guide on how to set it up?
>>>>>> 
>>>>>> 
>>>>> fuseki makes use of the jena stack. think of the jena distribution as a
>>>>> kind of toolbox you can use to work with your different projects in
>>>>> addition to your fuseki endpoint.
>>>>> 
>>>>> just make sure to configure the class path correctly
>>>>> 
>>>>> https://jena.apache.org/documentation/tools/index.html <
>> https://jena.apache.org/documentation/tools/index.html> <
>>>> https://jena.apache.org/documentation/tools/index.html 
>>>> <https://jena.apache.org/documentation/tools/index.html> <
>> https://jena.apache.org/documentation/tools/index.html 
>> <https://jena.apache.org/documentation/tools/index.html>>>
>>>>> 
>>>>> Also further to the conversation with Rob, he has a valid point with
>>>>> regards to data corruption. please do not update of a live tdb database
>>>>> instance directly with tdbloader while it's connected to a running
>> fuseki
>>>>> endpoint.
>>>>> 
>>>>> shut down the fuseki server first and then run the loader. or run the
>>>>> loader process in parallel into different target directory and swap the
>>>>> data or the path again later on. I don't know if there is hot swap
>> option
>>>>> in fuseki to map to a new directory but a quick restart should do the
>>>> trick.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> Thanks and kind regards
>>>>>> Markus
>>>>>> 
>>>>>>> Am 13.09.2018 um 11:56 schrieb Marco Neumann <
>> [email protected] <mailto:[email protected]> 
>> <mailto:[email protected] <mailto:[email protected]>>
>>>>> :
>>>>>>> 
>>>>>>> Rob, keeping fuseki live wasn't stated as a requirement for 1. so my
>>>>>> advise
>>>>>>> stands. we are running similar updates with fresh data frequently.
>>>>>>> 
>>>>>>> Markus, to keep fuseki downtime at a minimum you can pre-populate tdb
>>>>>> into
>>>>>>> a temporary directory as well and later switch between directories.
>>>> don't
>>>>>>> forget to run the tdb optimizer to generate the stats.opt file.
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Sep 13, 2018 at 10:33 AM Rob Vesse <[email protected] 
>>>>>>> <mailto:[email protected]>
>> <mailto:[email protected] <mailto:[email protected]>>>
>>>> wrote:
>>>>>>> 
>>>>>>>> I am not sure tdbloader/tbdloader2 scripts help in this case.  This
>> is
>>>>>> an
>>>>>>>> online update of a running Fuseki instance backed by TDB from what
>> has
>>>>>> been
>>>>>>>> described.
>>>>>>>> 
>>>>>>>> Since a TDB instance can only be safely used by a single JVM at a
>> time
>>>>>>>> using those scripts would not be a viable option here unless the OP
>>>> was
>>>>>>>> willing to stop Fuseki during updates as otherwise it would either
>>>> fail
>>>>>>>> (because the built in TDB mechanisms would prevent it) or it would
>>>> risk
>>>>>>>> causing data corruption
>>>>>>>> 
>>>>>>>> Rob
>>>>>>>> 
>>>>>>>> On 13/09/2018, 10:11, "Marco Neumann" <[email protected] 
>>>>>>>> <mailto:[email protected]>
>> <mailto:[email protected] <mailto:[email protected]>>>
>>>> wrote:
>>>>>>>> 
>>>>>>>> Markus, the tdbloader2 script is part of the apache-jena
>>>>>> distribution.
>>>>>>>> 
>>>>>>>> let me know how you get on and how this improves your data load
>>>>>>>> process.
>>>>>>>> 
>>>>>>>> Marco
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Sep 13, 2018 at 9:58 AM Markus Neumann <
>>>>>>>> [email protected] <mailto:[email protected]> 
>>>>>>>> <mailto:[email protected] <mailto:[email protected]>>>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Marco,
>>>>>>>>> 
>>>>>>>>> as this is a project for a customer, I'm afraid we can't make the
>>>>>>>> data
>>>>>>>>> public.
>>>>>>>>> 
>>>>>>>>> 1. I'm running Fuseki-3.8.0 with the following configuration:
>>>>>>>>> @prefix :      <http://base/# <http://base/#>> .
>>>>>>>>> @prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns# 
>>>>>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#>> .
>>>>>>>>> @prefix tdb2:  <http://jena.apache.org/2016/tdb# 
>>>>>>>>> <http://jena.apache.org/2016/tdb#>> .
>>>>>>>>> @prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler# 
>>>>>>>>> <http://jena.hpl.hp.com/2005/11/Assembler#>> .
>>>>>>>>> @prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema# 
>>>>>>>>> <http://www.w3.org/2000/01/rdf-schema#>> .
>>>>>>>>> @prefix fuseki: <http://jena.apache.org/fuseki# 
>>>>>>>>> <http://jena.apache.org/fuseki#>> .
>>>>>>>>> @prefix spatial: <http://jena.apache.org/spatial# 
>>>>>>>>> <http://jena.apache.org/spatial#>> .
>>>>>>>>> @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos# 
>>>>>>>>> <http://www.w3.org/2003/01/geo/wgs84_pos#>> .
>>>>>>>>> @prefix geosparql: <http://www.opengis.net/ont/geosparql# 
>>>>>>>>> <http://www.opengis.net/ont/geosparql#>> .
>>>>>>>>> 
>>>>>>>>> :service_tdb_all  a                   fuseki:Service ;
>>>>>>>>>     rdfs:label                    "TDB2 mm" ;
>>>>>>>>>     fuseki:dataset                :spatial_dataset ;
>>>>>>>>>     fuseki:name                   "mm" ;
>>>>>>>>>     fuseki:serviceQuery           "query" , "sparql" ;
>>>>>>>>>     fuseki:serviceReadGraphStore  "get" ;
>>>>>>>>>     fuseki:serviceReadWriteGraphStore
>>>>>>>>>             "data" ;
>>>>>>>>>     fuseki:serviceUpdate          "update" ;
>>>>>>>>>     fuseki:serviceUpload          "upload" .
>>>>>>>>> 
>>>>>>>>> :spatial_dataset a spatial:SpatialDataset ;
>>>>>>>>> spatial:dataset   :tdb_dataset_readwrite ;
>>>>>>>>> spatial:index     <#indexLucene> ;
>>>>>>>>> .
>>>>>>>>> 
>>>>>>>>> <#indexLucene> a spatial:SpatialIndexLucene ;
>>>>>>>>> #spatial:directory <file:Lucene> ;
>>>>>>>>> spatial:directory "mem" ;
>>>>>>>>> spatial:definition <#definition> ;
>>>>>>>>> .
>>>>>>>>> 
>>>>>>>>> <#definition> a spatial:EntityDefinition ;
>>>>>>>>> spatial:entityField      "uri" ;
>>>>>>>>> spatial:geoField     "geo" ;
>>>>>>>>> # custom geo predicates for 1) Latitude/Longitude Format
>>>>>>>>> spatial:hasSpatialPredicatePairs (
>>>>>>>>>      [ spatial:latitude geo:lat ; spatial:longitude geo:long ]
>>>>>>>>>      ) ;
>>>>>>>>> # custom geo predicates for 2) Well Known Text (WKT) Literal
>>>>>>>>> spatial:hasWKTPredicates (geosparql:asWKT) ;
>>>>>>>>> # custom SpatialContextFactory for 2) Well Known Text (WKT)
>>>>>>>> Literal
>>>>>>>>> spatial:spatialContextFactory
>>>>>>>>> #         "com.spatial4j.core.context.jts.JtsSpatialContextFactory"
>>>>>>>>> 
>>>>>>>> "org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory"
>>>>>>>>> .
>>>>>>>>> 
>>>>>>>>> :tdb_dataset_readwrite
>>>>>>>>>     a              tdb2:DatasetTDB2 ;
>>>>>>>>>     tdb2:location
>>>>>>>>> "/srv/linked_data_store/fuseki-server/run/databases/mm" .
>>>>>>>>> 
>>>>>>>>> I've been through the Fuseki documentation several times, but I
>> find
>>>>>>>> it
>>>>>>>>> still a bit confusing. I would highly appreciate if you could point
>>>>>>>> me to
>>>>>>>>> other resources.
>>>>>>>>> 
>>>>>>>>> I have not found the tdbloader in the fuseki repo. For now I use a
>>>>>>>> small
>>>>>>>>> shell script that wraps curl to upload the data:
>>>>>>>>> 
>>>>>>>>> if [ ! -z $2 ]
>>>>>>>>> then
>>>>>>>>> ADD="?graph=http://rdf.meteomatics.com/mm/graphs/$2";
>>>>>>>>> fi
>>>>>>>>> curl --basic -u user:password -X POST -F "filename=@$1"
>>>>>>>>> localhost:3030/mm/data${ADD}
>>>>>>>>> 
>>>>>>>>> 2. Our customer has not specified a default use case yet, as the
>>>>>>>> whole RDF
>>>>>>>>> concept is about as new to them as it is to me. I suppose it will
>> be
>>>>>>>>> something like "Find all locations in a certain radius that have
>> nice
>>>>>>>>> weather next saturday".
>>>>>>>>> 
>>>>>>>>> I just took a glance at the ha-fuseki page and will give it a try
>>>>>>>> later.
>>>>>>>>> 
>>>>>>>>> Many thanks for your time
>>>>>>>>> 
>>>>>>>>> Best
>>>>>>>>> Markus
>>>>>>>>> 
>>>>>>>>>> Am 13.09.2018 um 10:00 schrieb Marco Neumann <
>>>>>>>> [email protected]>:
>>>>>>>>>> 
>>>>>>>>>> do you make the data endpoint publicly available?
>>>>>>>>>> 
>>>>>>>>>> 1. did you try the tdbloader, what version of tdb2 do you use?
>>>>>>>>>> 
>>>>>>>>>> 2. many ways to improve your response time here. what does a
>>>>>>>> typical
>>>>>>>>> query
>>>>>>>>>> look like? do you make use of the spatial indexer?
>>>>>>>>>> 
>>>>>>>>>> and Andy has a work in progress here for more granular updates
>>>>>>>> that might
>>>>>>>>>> be of interest to your effort as well: "High Availablity Apache
>>>>>>>> Jena
>>>>>>>>> Fuseki"
>>>>>>>>>> 
>>>>>>>>>> https://afs.github.io/rdf-delta/ha-fuseki.html
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Sep 12, 2018 at 4:09 PM Markus Neumann <
>>>>>>>> [email protected]
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> we are running a Fuseki server that will hold about 2.2 * 10^9
>>>>>>>> triples
>>>>>>>>> of
>>>>>>>>>>> meteorological data eventually.
>>>>>>>>>>> I currently run it with "-Xmx80GB" on a 128GB Server. The
>>>>>>>> database is
>>>>>>>>> TDB2
>>>>>>>>>>> on a 900GB SSD.
>>>>>>>>>>> 
>>>>>>>>>>> Now I face several performance issues:
>>>>>>>>>>> 1. Inserting data:
>>>>>>>>>>>    It takes more than one hour to upload the measurements of
>>>>>>>> a month
>>>>>>>>>>> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload
>>>>>>>> web-interface
>>>>>>>>> of
>>>>>>>>>>> fuseki)
>>>>>>>>>>>    Is there a way to do this faster?
>>>>>>>>>>> 2. Updating data:
>>>>>>>>>>>    We get new model runs 5 times per day. This is data for
>>>>>>>> the next
>>>>>>>>>>> 10 days, that needs to be updated every time.
>>>>>>>>>>>    My idea was to create a named graph "forecast" that holds
>>>>>>>> the
>>>>>>>>>>> latest version of this data.
>>>>>>>>>>>    Every time a new model run arrives, I create a new
>>>>>>>> temporary
>>>>>>>>> graph
>>>>>>>>>>> to upload the data to. Once this is finished, I move the
>> temporary
>>>>>>>>> graph to
>>>>>>>>>>> "forecast".
>>>>>>>>>>>    This seems to do the work twice as it takes 1 hour for the
>>>>>>>> upload
>>>>>>>>>>> an 1 hour for the move.
>>>>>>>>>>> 
>>>>>>>>>>> Our data consists of the following:
>>>>>>>>>>> 
>>>>>>>>>>> Locations (total 1607 -> 16070 triples):
>>>>>>>>>>> mm-locations:8500015 a mm:Location ;
>>>>>>>>>>> a geosparql:Geometry ;
>>>>>>>>>>> owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015>
>>>>>>>> ;
>>>>>>>>>>> geosparql:asWKT "POINT(7.61574425031
>>>>>>>>>>> 47.5425915732)"^^geosparql:wktLiteral ;
>>>>>>>>>>> mm:station_name "Basel SBB GB Ost" ;
>>>>>>>>>>> mm:abbreviation "BSGO" ;
>>>>>>>>>>> mm:didok_id 8500015 ;
>>>>>>>>>>> geo:lat 47.54259 ;
>>>>>>>>>>> geo:long 7.61574 ;
>>>>>>>>>>> mm:elevation 273 .
>>>>>>>>>>> 
>>>>>>>>>>> Parameters (total 14 -> 56 triples):
>>>>>>>>>>> mm-parameters:t_2m:C a mm:Parameter ;
>>>>>>>>>>> rdfs:label "t_2m:C" ;
>>>>>>>>>>> dcterms:description "Air temperature at 2m above ground in
>>>>>>>> degree
>>>>>>>>>>> Celsius"@en ;
>>>>>>>>>>> mm:unit_symbol "˚C" .
>>>>>>>>>>> 
>>>>>>>>>>> Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1
>>>>>>>> Mio ->
>>>>>>>>>>> 5Mio triples per day):
>>>>>>>>>>> mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a
>>>>>>>> mm:Measurement ;
>>>>>>>>>>> mm:location mm-locations:8500015 ;
>>>>>>>>>>> mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ;
>>>>>>>>>>> mm:value 15.1 ;
>>>>>>>>>>> mm:parameter mm-parameters:t_2m:C .
>>>>>>>>>>> 
>>>>>>>>>>> I would really appreciate if someone could give me some advice on
>>>>>>>> how to
>>>>>>>>>>> handle this tasks or point out things I could do to optimize the
>>>>>>>>>>> organization of the data.
>>>>>>>>>>> 
>>>>>>>>>>> Many thanks and kind regards
>>>>>>>>>>> Markus Neumann
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ---
>>>>>>>>>> Marco Neumann
>>>>>>>>>> KONA
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ---
>>>>>>>> Marco Neumann
>>>>>>>> KONA
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> 
>>>>>>> 
>>>>>>> ---
>>>>>>> Marco Neumann
>>>>>>> KONA
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> 
>>>>> ---
>>>>> Marco Neumann
>>>>> KONA
>>>> 
>>>> 
>>> 
>>> --
>>> 
>>> 
>>> ---
>>> Marco Neumann
>>> KONA
>> 
>> --
> 
> 
> ---
> Marco Neumann
> KONA

Re: Updating large amounts of data

Reply via email to