Re: Updating large amounts of data

Markus Neumann Thu, 13 Sep 2018 05:53:20 -0700

Thanks for the links.

How do I see if the loader does the spatial index? As far as I understood the 
documentation, my config should produce the spatial index in memory. I haven't 
figured that part completely though:
When I start the database from scratch, the spatial indexing works. After a 
restart I have to re-upload the stations file (which is no big deal as it is 
only 593K in size) to regenerate the index.
I couldn't get it to work with a persistent index file though.


Right now I'm trying the tdb2.tdbloader (Didn't see that before) and it seems 
to go even faster:
12:49:11 INFO  loader               :: Add: 41,000,000 2017-01-01_1M_30min.ttl 
(Batch: 67,980 / Avg: 62,995)
12:49:11 INFO  loader               ::   Elapsed: 650.84 seconds [2018/09/13 
12:49:11 UTC]

Is there a way to tell the loader, that he should do the spatial index?

Yes, we have to use the spatial filter eventually, so I would highly appreciate 
some more informations on the correct setup here.

Many thanks.

> Am 13.09.2018 um 14:19 schrieb Marco Neumann <[email protected]>:
> 
> :-)
> 
> this sounds much better Markus. now with regards to the optimizer please
> consult the online documentation here:
> 
> https://jena.apache.org/documentation/tdb/optimizer.html 
> <https://jena.apache.org/documentation/tdb/optimizer.html>
> (it's a very simple process to create the stats file and place it in the
> directory)
> 
> also did the loader index the spatial data? do your queries make use of the
> spatial filter?
> 
> On Thu, Sep 13, 2018 at 12:59 PM Markus Neumann <[email protected] 
> <mailto:[email protected]>>
> wrote:
> 
>> Marco,
>> 
>> I just tried the tdbloader2 script with 1 Month of data:
>> 
>> INFO  Total: 167,385,120 tuples : 1,143.55 seconds : 146,373.23 tuples/sec
>> [2018/09/13 11:29:31 UTC]
>> 11:41:44 INFO Index Building Phase Completed
>> 11:41:46 INFO -- TDB Bulk Loader Finish
>> 11:41:46 INFO -- 1880 seconds
>> 
>> Thats already a lot better. I'm working on a way to reduce the amount of
>> data by
>> Can you give me a pointer on
>>> don't forget to run the tdb optimizer to generate the stats.opt file.
>> ? I haven't heard of that so far...
>> 
>> A more general question:
>> Would there be a benefit in using the jena stack over using the fuseki
>> bundle as I'm doing now? (Documentation was not clear to me on that point)
>>        - If so: is there a guide on how to set it up?
>> 
>> 
> fuseki makes use of the jena stack. think of the jena distribution as a
> kind of toolbox you can use to work with your different projects in
> addition to your fuseki endpoint.
> 
> just make sure to configure the class path correctly
> 
> https://jena.apache.org/documentation/tools/index.html 
> <https://jena.apache.org/documentation/tools/index.html>
> 
> Also further to the conversation with Rob, he has a valid point with
> regards to data corruption. please do not update of a live tdb database
> instance directly with tdbloader while it's connected to a running fuseki
> endpoint.
> 
> shut down the fuseki server first and then run the loader. or run the
> loader process in parallel into different target directory and swap the
> data or the path again later on. I don't know if there is hot swap option
> in fuseki to map to a new directory but a quick restart should do the trick.
> 
> 
> 
> 
> 
>> Thanks and kind regards
>> Markus
>> 
>>> Am 13.09.2018 um 11:56 schrieb Marco Neumann <[email protected]>:
>>> 
>>> Rob, keeping fuseki live wasn't stated as a requirement for 1. so my
>> advise
>>> stands. we are running similar updates with fresh data frequently.
>>> 
>>> Markus, to keep fuseki downtime at a minimum you can pre-populate tdb
>> into
>>> a temporary directory as well and later switch between directories. don't
>>> forget to run the tdb optimizer to generate the stats.opt file.
>>> 
>>> 
>>> On Thu, Sep 13, 2018 at 10:33 AM Rob Vesse <[email protected]> wrote:
>>> 
>>>> I am not sure tdbloader/tbdloader2 scripts help in this case.  This is
>> an
>>>> online update of a running Fuseki instance backed by TDB from what has
>> been
>>>> described.
>>>> 
>>>> Since a TDB instance can only be safely used by a single JVM at a time
>>>> using those scripts would not be a viable option here unless the OP was
>>>> willing to stop Fuseki during updates as otherwise it would either fail
>>>> (because the built in TDB mechanisms would prevent it) or it would risk
>>>> causing data corruption
>>>> 
>>>> Rob
>>>> 
>>>> On 13/09/2018, 10:11, "Marco Neumann" <[email protected]> wrote:
>>>> 
>>>>   Markus, the tdbloader2 script is part of the apache-jena
>> distribution.
>>>> 
>>>>   let me know how you get on and how this improves your data load
>>>> process.
>>>> 
>>>>   Marco
>>>> 
>>>> 
>>>> 
>>>>   On Thu, Sep 13, 2018 at 9:58 AM Markus Neumann <
>>>> [email protected]>
>>>>   wrote:
>>>> 
>>>>> Hi Marco,
>>>>> 
>>>>> as this is a project for a customer, I'm afraid we can't make the
>>>> data
>>>>> public.
>>>>> 
>>>>> 1. I'm running Fuseki-3.8.0 with the following configuration:
>>>>> @prefix :      <http://base/#> .
>>>>> @prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>>>> @prefix tdb2:  <http://jena.apache.org/2016/tdb#> .
>>>>> @prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>>>> @prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
>>>>> @prefix fuseki: <http://jena.apache.org/fuseki#> .
>>>>> @prefix spatial: <http://jena.apache.org/spatial#> .
>>>>> @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
>>>>> @prefix geosparql: <http://www.opengis.net/ont/geosparql#> .
>>>>> 
>>>>> :service_tdb_all  a                   fuseki:Service ;
>>>>>       rdfs:label                    "TDB2 mm" ;
>>>>>       fuseki:dataset                :spatial_dataset ;
>>>>>       fuseki:name                   "mm" ;
>>>>>       fuseki:serviceQuery           "query" , "sparql" ;
>>>>>       fuseki:serviceReadGraphStore  "get" ;
>>>>>       fuseki:serviceReadWriteGraphStore
>>>>>               "data" ;
>>>>>       fuseki:serviceUpdate          "update" ;
>>>>>       fuseki:serviceUpload          "upload" .
>>>>> 
>>>>> :spatial_dataset a spatial:SpatialDataset ;
>>>>>   spatial:dataset   :tdb_dataset_readwrite ;
>>>>>   spatial:index     <#indexLucene> ;
>>>>>   .
>>>>> 
>>>>> <#indexLucene> a spatial:SpatialIndexLucene ;
>>>>>   #spatial:directory <file:Lucene> ;
>>>>>   spatial:directory "mem" ;
>>>>>   spatial:definition <#definition> ;
>>>>>   .
>>>>> 
>>>>> <#definition> a spatial:EntityDefinition ;
>>>>>   spatial:entityField      "uri" ;
>>>>>   spatial:geoField     "geo" ;
>>>>>   # custom geo predicates for 1) Latitude/Longitude Format
>>>>>   spatial:hasSpatialPredicatePairs (
>>>>>        [ spatial:latitude geo:lat ; spatial:longitude geo:long ]
>>>>>        ) ;
>>>>>   # custom geo predicates for 2) Well Known Text (WKT) Literal
>>>>>   spatial:hasWKTPredicates (geosparql:asWKT) ;
>>>>>   # custom SpatialContextFactory for 2) Well Known Text (WKT)
>>>> Literal
>>>>>   spatial:spatialContextFactory
>>>>> #         "com.spatial4j.core.context.jts.JtsSpatialContextFactory"
>>>>> 
>>>> "org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory"
>>>>>   .
>>>>> 
>>>>> :tdb_dataset_readwrite
>>>>>       a              tdb2:DatasetTDB2 ;
>>>>>       tdb2:location
>>>>> "/srv/linked_data_store/fuseki-server/run/databases/mm" .
>>>>> 
>>>>> I've been through the Fuseki documentation several times, but I find
>>>> it
>>>>> still a bit confusing. I would highly appreciate if you could point
>>>> me to
>>>>> other resources.
>>>>> 
>>>>> I have not found the tdbloader in the fuseki repo. For now I use a
>>>> small
>>>>> shell script that wraps curl to upload the data:
>>>>> 
>>>>> if [ ! -z $2 ]
>>>>> then
>>>>>   ADD="?graph=http://rdf.meteomatics.com/mm/graphs/$2";
>>>>> fi
>>>>> curl --basic -u user:password -X POST -F "filename=@$1"
>>>>> localhost:3030/mm/data${ADD}
>>>>> 
>>>>> 2. Our customer has not specified a default use case yet, as the
>>>> whole RDF
>>>>> concept is about as new to them as it is to me. I suppose it will be
>>>>> something like "Find all locations in a certain radius that have nice
>>>>> weather next saturday".
>>>>> 
>>>>> I just took a glance at the ha-fuseki page and will give it a try
>>>> later.
>>>>> 
>>>>> Many thanks for your time
>>>>> 
>>>>> Best
>>>>> Markus
>>>>> 
>>>>>> Am 13.09.2018 um 10:00 schrieb Marco Neumann <
>>>> [email protected]>:
>>>>>> 
>>>>>> do you make the data endpoint publicly available?
>>>>>> 
>>>>>> 1. did you try the tdbloader, what version of tdb2 do you use?
>>>>>> 
>>>>>> 2. many ways to improve your response time here. what does a
>>>> typical
>>>>> query
>>>>>> look like? do you make use of the spatial indexer?
>>>>>> 
>>>>>> and Andy has a work in progress here for more granular updates
>>>> that might
>>>>>> be of interest to your effort as well: "High Availablity Apache
>>>> Jena
>>>>> Fuseki"
>>>>>> 
>>>>>> https://afs.github.io/rdf-delta/ha-fuseki.html
>>>>>> 
>>>>>> 
>>>>>> On Wed, Sep 12, 2018 at 4:09 PM Markus Neumann <
>>>> [email protected]
>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> we are running a Fuseki server that will hold about 2.2 * 10^9
>>>> triples
>>>>> of
>>>>>>> meteorological data eventually.
>>>>>>> I currently run it with "-Xmx80GB" on a 128GB Server. The
>>>> database is
>>>>> TDB2
>>>>>>> on a 900GB SSD.
>>>>>>> 
>>>>>>> Now I face several performance issues:
>>>>>>> 1. Inserting data:
>>>>>>>      It takes more than one hour to upload the measurements of
>>>> a month
>>>>>>> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload
>>>> web-interface
>>>>> of
>>>>>>> fuseki)
>>>>>>>      Is there a way to do this faster?
>>>>>>> 2. Updating data:
>>>>>>>      We get new model runs 5 times per day. This is data for
>>>> the next
>>>>>>> 10 days, that needs to be updated every time.
>>>>>>>      My idea was to create a named graph "forecast" that holds
>>>> the
>>>>>>> latest version of this data.
>>>>>>>      Every time a new model run arrives, I create a new
>>>> temporary
>>>>> graph
>>>>>>> to upload the data to. Once this is finished, I move the temporary
>>>>> graph to
>>>>>>> "forecast".
>>>>>>>      This seems to do the work twice as it takes 1 hour for the
>>>> upload
>>>>>>> an 1 hour for the move.
>>>>>>> 
>>>>>>> Our data consists of the following:
>>>>>>> 
>>>>>>> Locations (total 1607 -> 16070 triples):
>>>>>>> mm-locations:8500015 a mm:Location ;
>>>>>>>  a geosparql:Geometry ;
>>>>>>>  owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015>
>>>> ;
>>>>>>>  geosparql:asWKT "POINT(7.61574425031
>>>>>>> 47.5425915732)"^^geosparql:wktLiteral ;
>>>>>>>  mm:station_name "Basel SBB GB Ost" ;
>>>>>>>  mm:abbreviation "BSGO" ;
>>>>>>>  mm:didok_id 8500015 ;
>>>>>>>  geo:lat 47.54259 ;
>>>>>>>  geo:long 7.61574 ;
>>>>>>>  mm:elevation 273 .
>>>>>>> 
>>>>>>> Parameters (total 14 -> 56 triples):
>>>>>>> mm-parameters:t_2m:C a mm:Parameter ;
>>>>>>>  rdfs:label "t_2m:C" ;
>>>>>>>  dcterms:description "Air temperature at 2m above ground in
>>>> degree
>>>>>>> Celsius"@en ;
>>>>>>>  mm:unit_symbol "˚C" .
>>>>>>> 
>>>>>>> Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1
>>>> Mio ->
>>>>>>> 5Mio triples per day):
>>>>>>> mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a
>>>> mm:Measurement ;
>>>>>>>  mm:location mm-locations:8500015 ;
>>>>>>>  mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ;
>>>>>>>  mm:value 15.1 ;
>>>>>>>  mm:parameter mm-parameters:t_2m:C .
>>>>>>> 
>>>>>>> I would really appreciate if someone could give me some advice on
>>>> how to
>>>>>>> handle this tasks or point out things I could do to optimize the
>>>>>>> organization of the data.
>>>>>>> 
>>>>>>> Many thanks and kind regards
>>>>>>> Markus Neumann
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> 
>>>>>> ---
>>>>>> Marco Neumann
>>>>>> KONA
>>>>> 
>>>>> 
>>>> 
>>>>   --
>>>> 
>>>> 
>>>>   ---
>>>>   Marco Neumann
>>>>   KONA
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> 
>>> 
>>> ---
>>> Marco Neumann
>>> KONA
>> 
>> 
> 
> -- 
> 
> 
> ---
> Marco Neumann
> KONA

Re: Updating large amounts of data

Reply via email to