Re: Updating large amounts of data

Markus Neumann Thu, 13 Sep 2018 05:00:02 -0700

Marco,

I just tried the tdbloader2 script with 1 Month of data:


INFO  Total: 167,385,120 tuples : 1,143.55 seconds : 146,373.23 tuples/sec 
[2018/09/13 11:29:31 UTC]
 11:41:44 INFO Index Building Phase Completed
 11:41:46 INFO -- TDB Bulk Loader Finish
 11:41:46 INFO -- 1880 seconds

Thats already a lot better. I'm working on a way to reduce the amount of data 
by 
Can you give me a pointer on 
> don't forget to run the tdb optimizer to generate the stats.opt file.
? I haven't heard of that so far...

A more general question:
Would there be a benefit in using the jena stack over using the fuseki bundle 
as I'm doing now? (Documentation was not clear to me on that point)
        - If so: is there a guide on how to set it up?

Thanks and kind regards
Markus

> Am 13.09.2018 um 11:56 schrieb Marco Neumann <[email protected]>:
> 
> Rob, keeping fuseki live wasn't stated as a requirement for 1. so my advise
> stands. we are running similar updates with fresh data frequently.
> 
> Markus, to keep fuseki downtime at a minimum you can pre-populate tdb into
> a temporary directory as well and later switch between directories. don't
> forget to run the tdb optimizer to generate the stats.opt file.
> 
> 
> On Thu, Sep 13, 2018 at 10:33 AM Rob Vesse <[email protected]> wrote:
> 
>> I am not sure tdbloader/tbdloader2 scripts help in this case.  This is an
>> online update of a running Fuseki instance backed by TDB from what has been
>> described.
>> 
>> Since a TDB instance can only be safely used by a single JVM at a time
>> using those scripts would not be a viable option here unless the OP was
>> willing to stop Fuseki during updates as otherwise it would either fail
>> (because the built in TDB mechanisms would prevent it) or it would risk
>> causing data corruption
>> 
>> Rob
>> 
>> On 13/09/2018, 10:11, "Marco Neumann" <[email protected]> wrote:
>> 
>>    Markus, the tdbloader2 script is part of the apache-jena distribution.
>> 
>>    let me know how you get on and how this improves your data load
>> process.
>> 
>>    Marco
>> 
>> 
>> 
>>    On Thu, Sep 13, 2018 at 9:58 AM Markus Neumann <
>> [email protected]>
>>    wrote:
>> 
>>> Hi Marco,
>>> 
>>> as this is a project for a customer, I'm afraid we can't make the
>> data
>>> public.
>>> 
>>> 1. I'm running Fuseki-3.8.0 with the following configuration:
>>> @prefix :      <http://base/#> .
>>> @prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>> @prefix tdb2:  <http://jena.apache.org/2016/tdb#> .
>>> @prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>> @prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
>>> @prefix fuseki: <http://jena.apache.org/fuseki#> .
>>> @prefix spatial: <http://jena.apache.org/spatial#> .
>>> @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
>>> @prefix geosparql: <http://www.opengis.net/ont/geosparql#> .
>>> 
>>> :service_tdb_all  a                   fuseki:Service ;
>>>        rdfs:label                    "TDB2 mm" ;
>>>        fuseki:dataset                :spatial_dataset ;
>>>        fuseki:name                   "mm" ;
>>>        fuseki:serviceQuery           "query" , "sparql" ;
>>>        fuseki:serviceReadGraphStore  "get" ;
>>>        fuseki:serviceReadWriteGraphStore
>>>                "data" ;
>>>        fuseki:serviceUpdate          "update" ;
>>>        fuseki:serviceUpload          "upload" .
>>> 
>>> :spatial_dataset a spatial:SpatialDataset ;
>>>    spatial:dataset   :tdb_dataset_readwrite ;
>>>    spatial:index     <#indexLucene> ;
>>>    .
>>> 
>>> <#indexLucene> a spatial:SpatialIndexLucene ;
>>>    #spatial:directory <file:Lucene> ;
>>>    spatial:directory "mem" ;
>>>    spatial:definition <#definition> ;
>>>    .
>>> 
>>> <#definition> a spatial:EntityDefinition ;
>>>    spatial:entityField      "uri" ;
>>>    spatial:geoField     "geo" ;
>>>    # custom geo predicates for 1) Latitude/Longitude Format
>>>    spatial:hasSpatialPredicatePairs (
>>>         [ spatial:latitude geo:lat ; spatial:longitude geo:long ]
>>>         ) ;
>>>    # custom geo predicates for 2) Well Known Text (WKT) Literal
>>>    spatial:hasWKTPredicates (geosparql:asWKT) ;
>>>    # custom SpatialContextFactory for 2) Well Known Text (WKT)
>> Literal
>>>    spatial:spatialContextFactory
>>> #         "com.spatial4j.core.context.jts.JtsSpatialContextFactory"
>>> 
>> "org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory"
>>>    .
>>> 
>>> :tdb_dataset_readwrite
>>>        a              tdb2:DatasetTDB2 ;
>>>        tdb2:location
>>> "/srv/linked_data_store/fuseki-server/run/databases/mm" .
>>> 
>>> I've been through the Fuseki documentation several times, but I find
>> it
>>> still a bit confusing. I would highly appreciate if you could point
>> me to
>>> other resources.
>>> 
>>> I have not found the tdbloader in the fuseki repo. For now I use a
>> small
>>> shell script that wraps curl to upload the data:
>>> 
>>> if [ ! -z $2 ]
>>> then
>>>    ADD="?graph=http://rdf.meteomatics.com/mm/graphs/$2";
>>> fi
>>> curl --basic -u user:password -X POST -F "filename=@$1"
>>> localhost:3030/mm/data${ADD}
>>> 
>>> 2. Our customer has not specified a default use case yet, as the
>> whole RDF
>>> concept is about as new to them as it is to me. I suppose it will be
>>> something like "Find all locations in a certain radius that have nice
>>> weather next saturday".
>>> 
>>> I just took a glance at the ha-fuseki page and will give it a try
>> later.
>>> 
>>> Many thanks for your time
>>> 
>>> Best
>>> Markus
>>> 
>>>> Am 13.09.2018 um 10:00 schrieb Marco Neumann <
>> [email protected]>:
>>>> 
>>>> do you make the data endpoint publicly available?
>>>> 
>>>> 1. did you try the tdbloader, what version of tdb2 do you use?
>>>> 
>>>> 2. many ways to improve your response time here. what does a
>> typical
>>> query
>>>> look like? do you make use of the spatial indexer?
>>>> 
>>>> and Andy has a work in progress here for more granular updates
>> that might
>>>> be of interest to your effort as well: "High Availablity Apache
>> Jena
>>> Fuseki"
>>>> 
>>>> https://afs.github.io/rdf-delta/ha-fuseki.html
>>>> 
>>>> 
>>>> On Wed, Sep 12, 2018 at 4:09 PM Markus Neumann <
>> [email protected]
>>>> 
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> we are running a Fuseki server that will hold about 2.2 * 10^9
>> triples
>>> of
>>>>> meteorological data eventually.
>>>>> I currently run it with "-Xmx80GB" on a 128GB Server. The
>> database is
>>> TDB2
>>>>> on a 900GB SSD.
>>>>> 
>>>>> Now I face several performance issues:
>>>>> 1. Inserting data:
>>>>>       It takes more than one hour to upload the measurements of
>> a month
>>>>> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload
>> web-interface
>>> of
>>>>> fuseki)
>>>>>       Is there a way to do this faster?
>>>>> 2. Updating data:
>>>>>       We get new model runs 5 times per day. This is data for
>> the next
>>>>> 10 days, that needs to be updated every time.
>>>>>       My idea was to create a named graph "forecast" that holds
>> the
>>>>> latest version of this data.
>>>>>       Every time a new model run arrives, I create a new
>> temporary
>>> graph
>>>>> to upload the data to. Once this is finished, I move the temporary
>>> graph to
>>>>> "forecast".
>>>>>       This seems to do the work twice as it takes 1 hour for the
>> upload
>>>>> an 1 hour for the move.
>>>>> 
>>>>> Our data consists of the following:
>>>>> 
>>>>> Locations (total 1607 -> 16070 triples):
>>>>> mm-locations:8500015 a mm:Location ;
>>>>>   a geosparql:Geometry ;
>>>>>   owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015>
>> ;
>>>>>   geosparql:asWKT "POINT(7.61574425031
>>>>> 47.5425915732)"^^geosparql:wktLiteral ;
>>>>>   mm:station_name "Basel SBB GB Ost" ;
>>>>>   mm:abbreviation "BSGO" ;
>>>>>   mm:didok_id 8500015 ;
>>>>>   geo:lat 47.54259 ;
>>>>>   geo:long 7.61574 ;
>>>>>   mm:elevation 273 .
>>>>> 
>>>>> Parameters (total 14 -> 56 triples):
>>>>> mm-parameters:t_2m:C a mm:Parameter ;
>>>>>   rdfs:label "t_2m:C" ;
>>>>>   dcterms:description "Air temperature at 2m above ground in
>> degree
>>>>> Celsius"@en ;
>>>>>   mm:unit_symbol "˚C" .
>>>>> 
>>>>> Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1
>> Mio ->
>>>>> 5Mio triples per day):
>>>>> mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a
>> mm:Measurement ;
>>>>>   mm:location mm-locations:8500015 ;
>>>>>   mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ;
>>>>>   mm:value 15.1 ;
>>>>>   mm:parameter mm-parameters:t_2m:C .
>>>>> 
>>>>> I would really appreciate if someone could give me some advice on
>> how to
>>>>> handle this tasks or point out things I could do to optimize the
>>>>> organization of the data.
>>>>> 
>>>>> Many thanks and kind regards
>>>>> Markus Neumann
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> 
>>>> 
>>>> ---
>>>> Marco Neumann
>>>> KONA
>>> 
>>> 
>> 
>>    --
>> 
>> 
>>    ---
>>    Marco Neumann
>>    KONA
>> 
>> 
>> 
>> 
>> 
>> 
> 
> -- 
> 
> 
> ---
> Marco Neumann
> KONA

Re: Updating large amounts of data

Reply via email to