Re: Updating large amounts of data

Marco Neumann Thu, 13 Sep 2018 05:19:50 -0700

:-)

this sounds much better Markus. now with regards to the optimizer please
consult the online documentation here:


https://jena.apache.org/documentation/tdb/optimizer.html
 (it's a very simple process to create the stats file and place it in the
directory)

also did the loader index the spatial data? do your queries make use of the
spatial filter?

On Thu, Sep 13, 2018 at 12:59 PM Markus Neumann <[email protected]>
wrote:

> Marco,
>
> I just tried the tdbloader2 script with 1 Month of data:
>
> INFO  Total: 167,385,120 tuples : 1,143.55 seconds : 146,373.23 tuples/sec
> [2018/09/13 11:29:31 UTC]
>  11:41:44 INFO Index Building Phase Completed
>  11:41:46 INFO -- TDB Bulk Loader Finish
>  11:41:46 INFO -- 1880 seconds
>
> Thats already a lot better. I'm working on a way to reduce the amount of
> data by
> Can you give me a pointer on
> > don't forget to run the tdb optimizer to generate the stats.opt file.
> ? I haven't heard of that so far...
>
> A more general question:
> Would there be a benefit in using the jena stack over using the fuseki
> bundle as I'm doing now? (Documentation was not clear to me on that point)
>         - If so: is there a guide on how to set it up?
>
>
fuseki makes use of the jena stack. think of the jena distribution as a
kind of toolbox you can use to work with your different projects in
addition to your fuseki endpoint.

just make sure to configure the class path correctly

https://jena.apache.org/documentation/tools/index.html

Also further to the conversation with Rob, he has a valid point with
regards to data corruption. please do not update of a live tdb database
instance directly with tdbloader while it's connected to a running fuseki
endpoint.

shut down the fuseki server first and then run the loader. or run the
loader process in parallel into different target directory and swap the
data or the path again later on. I don't know if there is hot swap option
in fuseki to map to a new directory but a quick restart should do the trick.





> Thanks and kind regards
> Markus
>
> > Am 13.09.2018 um 11:56 schrieb Marco Neumann <[email protected]>:
> >
> > Rob, keeping fuseki live wasn't stated as a requirement for 1. so my
> advise
> > stands. we are running similar updates with fresh data frequently.
> >
> > Markus, to keep fuseki downtime at a minimum you can pre-populate tdb
> into
> > a temporary directory as well and later switch between directories. don't
> > forget to run the tdb optimizer to generate the stats.opt file.
> >
> >
> > On Thu, Sep 13, 2018 at 10:33 AM Rob Vesse <[email protected]> wrote:
> >
> >> I am not sure tdbloader/tbdloader2 scripts help in this case.  This is
> an
> >> online update of a running Fuseki instance backed by TDB from what has
> been
> >> described.
> >>
> >> Since a TDB instance can only be safely used by a single JVM at a time
> >> using those scripts would not be a viable option here unless the OP was
> >> willing to stop Fuseki during updates as otherwise it would either fail
> >> (because the built in TDB mechanisms would prevent it) or it would risk
> >> causing data corruption
> >>
> >> Rob
> >>
> >> On 13/09/2018, 10:11, "Marco Neumann" <[email protected]> wrote:
> >>
> >>    Markus, the tdbloader2 script is part of the apache-jena
> distribution.
> >>
> >>    let me know how you get on and how this improves your data load
> >> process.
> >>
> >>    Marco
> >>
> >>
> >>
> >>    On Thu, Sep 13, 2018 at 9:58 AM Markus Neumann <
> >> [email protected]>
> >>    wrote:
> >>
> >>> Hi Marco,
> >>>
> >>> as this is a project for a customer, I'm afraid we can't make the
> >> data
> >>> public.
> >>>
> >>> 1. I'm running Fuseki-3.8.0 with the following configuration:
> >>> @prefix :      <http://base/#> .
> >>> @prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
> >>> @prefix tdb2:  <http://jena.apache.org/2016/tdb#> .
> >>> @prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler#> .
> >>> @prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
> >>> @prefix fuseki: <http://jena.apache.org/fuseki#> .
> >>> @prefix spatial: <http://jena.apache.org/spatial#> .
> >>> @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
> >>> @prefix geosparql: <http://www.opengis.net/ont/geosparql#> .
> >>>
> >>> :service_tdb_all  a                   fuseki:Service ;
> >>>        rdfs:label                    "TDB2 mm" ;
> >>>        fuseki:dataset                :spatial_dataset ;
> >>>        fuseki:name                   "mm" ;
> >>>        fuseki:serviceQuery           "query" , "sparql" ;
> >>>        fuseki:serviceReadGraphStore  "get" ;
> >>>        fuseki:serviceReadWriteGraphStore
> >>>                "data" ;
> >>>        fuseki:serviceUpdate          "update" ;
> >>>        fuseki:serviceUpload          "upload" .
> >>>
> >>> :spatial_dataset a spatial:SpatialDataset ;
> >>>    spatial:dataset   :tdb_dataset_readwrite ;
> >>>    spatial:index     <#indexLucene> ;
> >>>    .
> >>>
> >>> <#indexLucene> a spatial:SpatialIndexLucene ;
> >>>    #spatial:directory <file:Lucene> ;
> >>>    spatial:directory "mem" ;
> >>>    spatial:definition <#definition> ;
> >>>    .
> >>>
> >>> <#definition> a spatial:EntityDefinition ;
> >>>    spatial:entityField      "uri" ;
> >>>    spatial:geoField     "geo" ;
> >>>    # custom geo predicates for 1) Latitude/Longitude Format
> >>>    spatial:hasSpatialPredicatePairs (
> >>>         [ spatial:latitude geo:lat ; spatial:longitude geo:long ]
> >>>         ) ;
> >>>    # custom geo predicates for 2) Well Known Text (WKT) Literal
> >>>    spatial:hasWKTPredicates (geosparql:asWKT) ;
> >>>    # custom SpatialContextFactory for 2) Well Known Text (WKT)
> >> Literal
> >>>    spatial:spatialContextFactory
> >>> #         "com.spatial4j.core.context.jts.JtsSpatialContextFactory"
> >>>
> >> "org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory"
> >>>    .
> >>>
> >>> :tdb_dataset_readwrite
> >>>        a              tdb2:DatasetTDB2 ;
> >>>        tdb2:location
> >>> "/srv/linked_data_store/fuseki-server/run/databases/mm" .
> >>>
> >>> I've been through the Fuseki documentation several times, but I find
> >> it
> >>> still a bit confusing. I would highly appreciate if you could point
> >> me to
> >>> other resources.
> >>>
> >>> I have not found the tdbloader in the fuseki repo. For now I use a
> >> small
> >>> shell script that wraps curl to upload the data:
> >>>
> >>> if [ ! -z $2 ]
> >>> then
> >>>    ADD="?graph=http://rdf.meteomatics.com/mm/graphs/$2";
> >>> fi
> >>> curl --basic -u user:password -X POST -F "filename=@$1"
> >>> localhost:3030/mm/data${ADD}
> >>>
> >>> 2. Our customer has not specified a default use case yet, as the
> >> whole RDF
> >>> concept is about as new to them as it is to me. I suppose it will be
> >>> something like "Find all locations in a certain radius that have nice
> >>> weather next saturday".
> >>>
> >>> I just took a glance at the ha-fuseki page and will give it a try
> >> later.
> >>>
> >>> Many thanks for your time
> >>>
> >>> Best
> >>> Markus
> >>>
> >>>> Am 13.09.2018 um 10:00 schrieb Marco Neumann <
> >> [email protected]>:
> >>>>
> >>>> do you make the data endpoint publicly available?
> >>>>
> >>>> 1. did you try the tdbloader, what version of tdb2 do you use?
> >>>>
> >>>> 2. many ways to improve your response time here. what does a
> >> typical
> >>> query
> >>>> look like? do you make use of the spatial indexer?
> >>>>
> >>>> and Andy has a work in progress here for more granular updates
> >> that might
> >>>> be of interest to your effort as well: "High Availablity Apache
> >> Jena
> >>> Fuseki"
> >>>>
> >>>> https://afs.github.io/rdf-delta/ha-fuseki.html
> >>>>
> >>>>
> >>>> On Wed, Sep 12, 2018 at 4:09 PM Markus Neumann <
> >> [email protected]
> >>>>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> we are running a Fuseki server that will hold about 2.2 * 10^9
> >> triples
> >>> of
> >>>>> meteorological data eventually.
> >>>>> I currently run it with "-Xmx80GB" on a 128GB Server. The
> >> database is
> >>> TDB2
> >>>>> on a 900GB SSD.
> >>>>>
> >>>>> Now I face several performance issues:
> >>>>> 1. Inserting data:
> >>>>>       It takes more than one hour to upload the measurements of
> >> a month
> >>>>> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload
> >> web-interface
> >>> of
> >>>>> fuseki)
> >>>>>       Is there a way to do this faster?
> >>>>> 2. Updating data:
> >>>>>       We get new model runs 5 times per day. This is data for
> >> the next
> >>>>> 10 days, that needs to be updated every time.
> >>>>>       My idea was to create a named graph "forecast" that holds
> >> the
> >>>>> latest version of this data.
> >>>>>       Every time a new model run arrives, I create a new
> >> temporary
> >>> graph
> >>>>> to upload the data to. Once this is finished, I move the temporary
> >>> graph to
> >>>>> "forecast".
> >>>>>       This seems to do the work twice as it takes 1 hour for the
> >> upload
> >>>>> an 1 hour for the move.
> >>>>>
> >>>>> Our data consists of the following:
> >>>>>
> >>>>> Locations (total 1607 -> 16070 triples):
> >>>>> mm-locations:8500015 a mm:Location ;
> >>>>>   a geosparql:Geometry ;
> >>>>>   owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015>
> >> ;
> >>>>>   geosparql:asWKT "POINT(7.61574425031
> >>>>> 47.5425915732)"^^geosparql:wktLiteral ;
> >>>>>   mm:station_name "Basel SBB GB Ost" ;
> >>>>>   mm:abbreviation "BSGO" ;
> >>>>>   mm:didok_id 8500015 ;
> >>>>>   geo:lat 47.54259 ;
> >>>>>   geo:long 7.61574 ;
> >>>>>   mm:elevation 273 .
> >>>>>
> >>>>> Parameters (total 14 -> 56 triples):
> >>>>> mm-parameters:t_2m:C a mm:Parameter ;
> >>>>>   rdfs:label "t_2m:C" ;
> >>>>>   dcterms:description "Air temperature at 2m above ground in
> >> degree
> >>>>> Celsius"@en ;
> >>>>>   mm:unit_symbol "˚C" .
> >>>>>
> >>>>> Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1
> >> Mio ->
> >>>>> 5Mio triples per day):
> >>>>> mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a
> >> mm:Measurement ;
> >>>>>   mm:location mm-locations:8500015 ;
> >>>>>   mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ;
> >>>>>   mm:value 15.1 ;
> >>>>>   mm:parameter mm-parameters:t_2m:C .
> >>>>>
> >>>>> I would really appreciate if someone could give me some advice on
> >> how to
> >>>>> handle this tasks or point out things I could do to optimize the
> >>>>> organization of the data.
> >>>>>
> >>>>> Many thanks and kind regards
> >>>>> Markus Neumann
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>>
> >>>>
> >>>> ---
> >>>> Marco Neumann
> >>>> KONA
> >>>
> >>>
> >>
> >>    --
> >>
> >>
> >>    ---
> >>    Marco Neumann
> >>    KONA
> >>
> >>
> >>
> >>
> >>
> >>
> >
> > --
> >
> >
> > ---
> > Marco Neumann
> > KONA
>
>

-- 


---
Marco Neumann
KONA

Re: Updating large amounts of data

Reply via email to