Re: Updating large amounts of data

Markus Neumann Thu, 13 Sep 2018 12:30:59 -0700

Hi,

spatial index creation fails.
I tried to figure the documentation but failed. I can't find the 
jena.spatialindexer to build it manually and the one I specified in my config 
does not work when I use the tdbloader.


Any ideas?


> Am 13.09.2018 um 19:48 schrieb Marco Neumann <[email protected]>:
> 
> to create the spatial index you can take a look at the "Building a Spatial
> Index" section in the "Spatial searches with SPARQL" documentation here
> 
> https://jena.apache.org/documentation/query/spatial-query.html 
> <https://jena.apache.org/documentation/query/spatial-query.html>
> 
> hint: if you don't get results for a spatial filter query that matches your
> data in the database your data isn't spatially indexed correctly. there
> will be no error or the like in the result set though.
> 
> 
> 
> On Thu, Sep 13, 2018 at 1:53 PM Markus Neumann <[email protected] 
> <mailto:[email protected]>>
> wrote:
> 
>> Thanks for the links.
>> 
>> How do I see if the loader does the spatial index? As far as I understood
>> the documentation, my config should produce the spatial index in memory. I
>> haven't figured that part completely though:
>> When I start the database from scratch, the spatial indexing works. After
>> a restart I have to re-upload the stations file (which is no big deal as it
>> is only 593K in size) to regenerate the index.
>> I couldn't get it to work with a persistent index file though.
>> 
>> Right now I'm trying the tdb2.tdbloader (Didn't see that before) and it
>> seems to go even faster:
>> 12:49:11 INFO  loader               :: Add: 41,000,000
>> 2017-01-01_1M_30min.ttl (Batch: 67,980 / Avg: 62,995)
>> 12:49:11 INFO  loader               ::   Elapsed: 650.84 seconds
>> [2018/09/13 12:49:11 UTC]
>> 
>> Is there a way to tell the loader, that he should do the spatial index?
>> 
>> Yes, we have to use the spatial filter eventually, so I would highly
>> appreciate some more informations on the correct setup here.
>> 
>> Many thanks.
>> 
>>> Am 13.09.2018 um 14:19 schrieb Marco Neumann <[email protected]>:
>>> 
>>> :-)
>>> 
>>> this sounds much better Markus. now with regards to the optimizer please
>>> consult the online documentation here:
>>> 
>>> https://jena.apache.org/documentation/tdb/optimizer.html <
>> https://jena.apache.org/documentation/tdb/optimizer.html 
>> <https://jena.apache.org/documentation/tdb/optimizer.html>>
>>> (it's a very simple process to create the stats file and place it in the
>>> directory)
>>> 
>>> also did the loader index the spatial data? do your queries make use of
>> the
>>> spatial filter?
>>> 
>>> On Thu, Sep 13, 2018 at 12:59 PM Markus Neumann <
>> [email protected] <mailto:[email protected]> 
>> <mailto:[email protected] <mailto:[email protected]>>>
>>> wrote:
>>> 
>>>> Marco,
>>>> 
>>>> I just tried the tdbloader2 script with 1 Month of data:
>>>> 
>>>> INFO  Total: 167,385,120 tuples : 1,143.55 seconds : 146,373.23
>> tuples/sec
>>>> [2018/09/13 11:29:31 UTC]
>>>> 11:41:44 INFO Index Building Phase Completed
>>>> 11:41:46 INFO -- TDB Bulk Loader Finish
>>>> 11:41:46 INFO -- 1880 seconds
>>>> 
>>>> Thats already a lot better. I'm working on a way to reduce the amount of
>>>> data by
>>>> Can you give me a pointer on
>>>>> don't forget to run the tdb optimizer to generate the stats.opt file.
>>>> ? I haven't heard of that so far...
>>>> 
>>>> A more general question:
>>>> Would there be a benefit in using the jena stack over using the fuseki
>>>> bundle as I'm doing now? (Documentation was not clear to me on that
>> point)
>>>>       - If so: is there a guide on how to set it up?
>>>> 
>>>> 
>>> fuseki makes use of the jena stack. think of the jena distribution as a
>>> kind of toolbox you can use to work with your different projects in
>>> addition to your fuseki endpoint.
>>> 
>>> just make sure to configure the class path correctly
>>> 
>>> https://jena.apache.org/documentation/tools/index.html 
>>> <https://jena.apache.org/documentation/tools/index.html> <
>> https://jena.apache.org/documentation/tools/index.html 
>> <https://jena.apache.org/documentation/tools/index.html>>
>>> 
>>> Also further to the conversation with Rob, he has a valid point with
>>> regards to data corruption. please do not update of a live tdb database
>>> instance directly with tdbloader while it's connected to a running fuseki
>>> endpoint.
>>> 
>>> shut down the fuseki server first and then run the loader. or run the
>>> loader process in parallel into different target directory and swap the
>>> data or the path again later on. I don't know if there is hot swap option
>>> in fuseki to map to a new directory but a quick restart should do the
>> trick.
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> Thanks and kind regards
>>>> Markus
>>>> 
>>>>> Am 13.09.2018 um 11:56 schrieb Marco Neumann <[email protected] 
>>>>> <mailto:[email protected]>
>>> :
>>>>> 
>>>>> Rob, keeping fuseki live wasn't stated as a requirement for 1. so my
>>>> advise
>>>>> stands. we are running similar updates with fresh data frequently.
>>>>> 
>>>>> Markus, to keep fuseki downtime at a minimum you can pre-populate tdb
>>>> into
>>>>> a temporary directory as well and later switch between directories.
>> don't
>>>>> forget to run the tdb optimizer to generate the stats.opt file.
>>>>> 
>>>>> 
>>>>> On Thu, Sep 13, 2018 at 10:33 AM Rob Vesse <[email protected] 
>>>>> <mailto:[email protected]>>
>> wrote:
>>>>> 
>>>>>> I am not sure tdbloader/tbdloader2 scripts help in this case.  This is
>>>> an
>>>>>> online update of a running Fuseki instance backed by TDB from what has
>>>> been
>>>>>> described.
>>>>>> 
>>>>>> Since a TDB instance can only be safely used by a single JVM at a time
>>>>>> using those scripts would not be a viable option here unless the OP
>> was
>>>>>> willing to stop Fuseki during updates as otherwise it would either
>> fail
>>>>>> (because the built in TDB mechanisms would prevent it) or it would
>> risk
>>>>>> causing data corruption
>>>>>> 
>>>>>> Rob
>>>>>> 
>>>>>> On 13/09/2018, 10:11, "Marco Neumann" <[email protected] 
>>>>>> <mailto:[email protected]>>
>> wrote:
>>>>>> 
>>>>>>  Markus, the tdbloader2 script is part of the apache-jena
>>>> distribution.
>>>>>> 
>>>>>>  let me know how you get on and how this improves your data load
>>>>>> process.
>>>>>> 
>>>>>>  Marco
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  On Thu, Sep 13, 2018 at 9:58 AM Markus Neumann <
>>>>>> [email protected] <mailto:[email protected]>>
>>>>>>  wrote:
>>>>>> 
>>>>>>> Hi Marco,
>>>>>>> 
>>>>>>> as this is a project for a customer, I'm afraid we can't make the
>>>>>> data
>>>>>>> public.
>>>>>>> 
>>>>>>> 1. I'm running Fuseki-3.8.0 with the following configuration:
>>>>>>> @prefix :      <http://base/#> .
>>>>>>> @prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>>>>>> @prefix tdb2:  <http://jena.apache.org/2016/tdb#> .
>>>>>>> @prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>>>>>> @prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
>>>>>>> @prefix fuseki: <http://jena.apache.org/fuseki#> .
>>>>>>> @prefix spatial: <http://jena.apache.org/spatial#> .
>>>>>>> @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
>>>>>>> @prefix geosparql: <http://www.opengis.net/ont/geosparql#> .
>>>>>>> 
>>>>>>> :service_tdb_all  a                   fuseki:Service ;
>>>>>>>      rdfs:label                    "TDB2 mm" ;
>>>>>>>      fuseki:dataset                :spatial_dataset ;
>>>>>>>      fuseki:name                   "mm" ;
>>>>>>>      fuseki:serviceQuery           "query" , "sparql" ;
>>>>>>>      fuseki:serviceReadGraphStore  "get" ;
>>>>>>>      fuseki:serviceReadWriteGraphStore
>>>>>>>              "data" ;
>>>>>>>      fuseki:serviceUpdate          "update" ;
>>>>>>>      fuseki:serviceUpload          "upload" .
>>>>>>> 
>>>>>>> :spatial_dataset a spatial:SpatialDataset ;
>>>>>>>  spatial:dataset   :tdb_dataset_readwrite ;
>>>>>>>  spatial:index     <#indexLucene> ;
>>>>>>>  .
>>>>>>> 
>>>>>>> <#indexLucene> a spatial:SpatialIndexLucene ;
>>>>>>>  #spatial:directory <file:Lucene> ;
>>>>>>>  spatial:directory "mem" ;
>>>>>>>  spatial:definition <#definition> ;
>>>>>>>  .
>>>>>>> 
>>>>>>> <#definition> a spatial:EntityDefinition ;
>>>>>>>  spatial:entityField      "uri" ;
>>>>>>>  spatial:geoField     "geo" ;
>>>>>>>  # custom geo predicates for 1) Latitude/Longitude Format
>>>>>>>  spatial:hasSpatialPredicatePairs (
>>>>>>>       [ spatial:latitude geo:lat ; spatial:longitude geo:long ]
>>>>>>>       ) ;
>>>>>>>  # custom geo predicates for 2) Well Known Text (WKT) Literal
>>>>>>>  spatial:hasWKTPredicates (geosparql:asWKT) ;
>>>>>>>  # custom SpatialContextFactory for 2) Well Known Text (WKT)
>>>>>> Literal
>>>>>>>  spatial:spatialContextFactory
>>>>>>> #         "com.spatial4j.core.context.jts.JtsSpatialContextFactory"
>>>>>>> 
>>>>>> "org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory"
>>>>>>>  .
>>>>>>> 
>>>>>>> :tdb_dataset_readwrite
>>>>>>>      a              tdb2:DatasetTDB2 ;
>>>>>>>      tdb2:location
>>>>>>> "/srv/linked_data_store/fuseki-server/run/databases/mm" .
>>>>>>> 
>>>>>>> I've been through the Fuseki documentation several times, but I find
>>>>>> it
>>>>>>> still a bit confusing. I would highly appreciate if you could point
>>>>>> me to
>>>>>>> other resources.
>>>>>>> 
>>>>>>> I have not found the tdbloader in the fuseki repo. For now I use a
>>>>>> small
>>>>>>> shell script that wraps curl to upload the data:
>>>>>>> 
>>>>>>> if [ ! -z $2 ]
>>>>>>> then
>>>>>>>  ADD="?graph=http://rdf.meteomatics.com/mm/graphs/$2";
>>>>>>> fi
>>>>>>> curl --basic -u user:password -X POST -F "filename=@$1"
>>>>>>> localhost:3030/mm/data${ADD}
>>>>>>> 
>>>>>>> 2. Our customer has not specified a default use case yet, as the
>>>>>> whole RDF
>>>>>>> concept is about as new to them as it is to me. I suppose it will be
>>>>>>> something like "Find all locations in a certain radius that have nice
>>>>>>> weather next saturday".
>>>>>>> 
>>>>>>> I just took a glance at the ha-fuseki page and will give it a try
>>>>>> later.
>>>>>>> 
>>>>>>> Many thanks for your time
>>>>>>> 
>>>>>>> Best
>>>>>>> Markus
>>>>>>> 
>>>>>>>> Am 13.09.2018 um 10:00 schrieb Marco Neumann <
>>>>>> [email protected]>:
>>>>>>>> 
>>>>>>>> do you make the data endpoint publicly available?
>>>>>>>> 
>>>>>>>> 1. did you try the tdbloader, what version of tdb2 do you use?
>>>>>>>> 
>>>>>>>> 2. many ways to improve your response time here. what does a
>>>>>> typical
>>>>>>> query
>>>>>>>> look like? do you make use of the spatial indexer?
>>>>>>>> 
>>>>>>>> and Andy has a work in progress here for more granular updates
>>>>>> that might
>>>>>>>> be of interest to your effort as well: "High Availablity Apache
>>>>>> Jena
>>>>>>> Fuseki"
>>>>>>>> 
>>>>>>>> https://afs.github.io/rdf-delta/ha-fuseki.html
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Sep 12, 2018 at 4:09 PM Markus Neumann <
>>>>>> [email protected]
>>>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> we are running a Fuseki server that will hold about 2.2 * 10^9
>>>>>> triples
>>>>>>> of
>>>>>>>>> meteorological data eventually.
>>>>>>>>> I currently run it with "-Xmx80GB" on a 128GB Server. The
>>>>>> database is
>>>>>>> TDB2
>>>>>>>>> on a 900GB SSD.
>>>>>>>>> 
>>>>>>>>> Now I face several performance issues:
>>>>>>>>> 1. Inserting data:
>>>>>>>>>     It takes more than one hour to upload the measurements of
>>>>>> a month
>>>>>>>>> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload
>>>>>> web-interface
>>>>>>> of
>>>>>>>>> fuseki)
>>>>>>>>>     Is there a way to do this faster?
>>>>>>>>> 2. Updating data:
>>>>>>>>>     We get new model runs 5 times per day. This is data for
>>>>>> the next
>>>>>>>>> 10 days, that needs to be updated every time.
>>>>>>>>>     My idea was to create a named graph "forecast" that holds
>>>>>> the
>>>>>>>>> latest version of this data.
>>>>>>>>>     Every time a new model run arrives, I create a new
>>>>>> temporary
>>>>>>> graph
>>>>>>>>> to upload the data to. Once this is finished, I move the temporary
>>>>>>> graph to
>>>>>>>>> "forecast".
>>>>>>>>>     This seems to do the work twice as it takes 1 hour for the
>>>>>> upload
>>>>>>>>> an 1 hour for the move.
>>>>>>>>> 
>>>>>>>>> Our data consists of the following:
>>>>>>>>> 
>>>>>>>>> Locations (total 1607 -> 16070 triples):
>>>>>>>>> mm-locations:8500015 a mm:Location ;
>>>>>>>>> a geosparql:Geometry ;
>>>>>>>>> owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015>
>>>>>> ;
>>>>>>>>> geosparql:asWKT "POINT(7.61574425031
>>>>>>>>> 47.5425915732)"^^geosparql:wktLiteral ;
>>>>>>>>> mm:station_name "Basel SBB GB Ost" ;
>>>>>>>>> mm:abbreviation "BSGO" ;
>>>>>>>>> mm:didok_id 8500015 ;
>>>>>>>>> geo:lat 47.54259 ;
>>>>>>>>> geo:long 7.61574 ;
>>>>>>>>> mm:elevation 273 .
>>>>>>>>> 
>>>>>>>>> Parameters (total 14 -> 56 triples):
>>>>>>>>> mm-parameters:t_2m:C a mm:Parameter ;
>>>>>>>>> rdfs:label "t_2m:C" ;
>>>>>>>>> dcterms:description "Air temperature at 2m above ground in
>>>>>> degree
>>>>>>>>> Celsius"@en ;
>>>>>>>>> mm:unit_symbol "˚C" .
>>>>>>>>> 
>>>>>>>>> Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1
>>>>>> Mio ->
>>>>>>>>> 5Mio triples per day):
>>>>>>>>> mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a
>>>>>> mm:Measurement ;
>>>>>>>>> mm:location mm-locations:8500015 ;
>>>>>>>>> mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ;
>>>>>>>>> mm:value 15.1 ;
>>>>>>>>> mm:parameter mm-parameters:t_2m:C .
>>>>>>>>> 
>>>>>>>>> I would really appreciate if someone could give me some advice on
>>>>>> how to
>>>>>>>>> handle this tasks or point out things I could do to optimize the
>>>>>>>>> organization of the data.
>>>>>>>>> 
>>>>>>>>> Many thanks and kind regards
>>>>>>>>> Markus Neumann
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ---
>>>>>>>> Marco Neumann
>>>>>>>> KONA
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>>  --
>>>>>> 
>>>>>> 
>>>>>>  ---
>>>>>>  Marco Neumann
>>>>>>  KONA
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> 
>>>>> ---
>>>>> Marco Neumann
>>>>> KONA
>>>> 
>>>> 
>>> 
>>> --
>>> 
>>> 
>>> ---
>>> Marco Neumann
>>> KONA
>> 
>> 
> 
> -- 
> 
> 
> ---
> Marco Neumann
> KONA

Re: Updating large amounts of data

Reply via email to