Re: Updating large amounts of data

Marco Neumann Thu, 13 Sep 2018 12:47:41 -0700

Set the classpath to include the spatialIndexer

On Thu 13 Sep 2018 at 20:30, Markus Neumann <[email protected]>
wrote:


> Hi,
>
> spatial index creation fails.
> I tried to figure the documentation but failed. I can't find the
> jena.spatialindexer to build it manually and the one I specified in my
> config does not work when I use the tdbloader.
>
> Any ideas?
>
>
> > Am 13.09.2018 um 19:48 schrieb Marco Neumann <[email protected]>:
> >
> > to create the spatial index you can take a look at the "Building a
> Spatial
> > Index" section in the "Spatial searches with SPARQL" documentation here
> >
> > https://jena.apache.org/documentation/query/spatial-query.html <
> https://jena.apache.org/documentation/query/spatial-query.html>
> >
> > hint: if you don't get results for a spatial filter query that matches
> your
> > data in the database your data isn't spatially indexed correctly. there
> > will be no error or the like in the result set though.
> >
> >
> >
> > On Thu, Sep 13, 2018 at 1:53 PM Markus Neumann <[email protected]
> <mailto:[email protected]>>
> > wrote:
> >
> >> Thanks for the links.
> >>
> >> How do I see if the loader does the spatial index? As far as I
> understood
> >> the documentation, my config should produce the spatial index in
> memory. I
> >> haven't figured that part completely though:
> >> When I start the database from scratch, the spatial indexing works.
> After
> >> a restart I have to re-upload the stations file (which is no big deal
> as it
> >> is only 593K in size) to regenerate the index.
> >> I couldn't get it to work with a persistent index file though.
> >>
> >> Right now I'm trying the tdb2.tdbloader (Didn't see that before) and it
> >> seems to go even faster:
> >> 12:49:11 INFO  loader               :: Add: 41,000,000
> >> 2017-01-01_1M_30min.ttl (Batch: 67,980 / Avg: 62,995)
> >> 12:49:11 INFO  loader               ::   Elapsed: 650.84 seconds
> >> [2018/09/13 12:49:11 UTC]
> >>
> >> Is there a way to tell the loader, that he should do the spatial index?
> >>
> >> Yes, we have to use the spatial filter eventually, so I would highly
> >> appreciate some more informations on the correct setup here.
> >>
> >> Many thanks.
> >>
> >>> Am 13.09.2018 um 14:19 schrieb Marco Neumann <[email protected]
> >:
> >>>
> >>> :-)
> >>>
> >>> this sounds much better Markus. now with regards to the optimizer
> please
> >>> consult the online documentation here:
> >>>
> >>> https://jena.apache.org/documentation/tdb/optimizer.html <
> >> https://jena.apache.org/documentation/tdb/optimizer.html <
> https://jena.apache.org/documentation/tdb/optimizer.html>>
> >>> (it's a very simple process to create the stats file and place it in
> the
> >>> directory)
> >>>
> >>> also did the loader index the spatial data? do your queries make use of
> >> the
> >>> spatial filter?
> >>>
> >>> On Thu, Sep 13, 2018 at 12:59 PM Markus Neumann <
> >> [email protected] <mailto:[email protected]> <mailto:
> [email protected] <mailto:[email protected]>>>
> >>> wrote:
> >>>
> >>>> Marco,
> >>>>
> >>>> I just tried the tdbloader2 script with 1 Month of data:
> >>>>
> >>>> INFO  Total: 167,385,120 tuples : 1,143.55 seconds : 146,373.23
> >> tuples/sec
> >>>> [2018/09/13 11:29:31 UTC]
> >>>> 11:41:44 INFO Index Building Phase Completed
> >>>> 11:41:46 INFO -- TDB Bulk Loader Finish
> >>>> 11:41:46 INFO -- 1880 seconds
> >>>>
> >>>> Thats already a lot better. I'm working on a way to reduce the amount
> of
> >>>> data by
> >>>> Can you give me a pointer on
> >>>>> don't forget to run the tdb optimizer to generate the stats.opt file.
> >>>> ? I haven't heard of that so far...
> >>>>
> >>>> A more general question:
> >>>> Would there be a benefit in using the jena stack over using the fuseki
> >>>> bundle as I'm doing now? (Documentation was not clear to me on that
> >> point)
> >>>>       - If so: is there a guide on how to set it up?
> >>>>
> >>>>
> >>> fuseki makes use of the jena stack. think of the jena distribution as a
> >>> kind of toolbox you can use to work with your different projects in
> >>> addition to your fuseki endpoint.
> >>>
> >>> just make sure to configure the class path correctly
> >>>
> >>> https://jena.apache.org/documentation/tools/index.html <
> https://jena.apache.org/documentation/tools/index.html> <
> >> https://jena.apache.org/documentation/tools/index.html <
> https://jena.apache.org/documentation/tools/index.html>>
> >>>
> >>> Also further to the conversation with Rob, he has a valid point with
> >>> regards to data corruption. please do not update of a live tdb database
> >>> instance directly with tdbloader while it's connected to a running
> fuseki
> >>> endpoint.
> >>>
> >>> shut down the fuseki server first and then run the loader. or run the
> >>> loader process in parallel into different target directory and swap the
> >>> data or the path again later on. I don't know if there is hot swap
> option
> >>> in fuseki to map to a new directory but a quick restart should do the
> >> trick.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>> Thanks and kind regards
> >>>> Markus
> >>>>
> >>>>> Am 13.09.2018 um 11:56 schrieb Marco Neumann <
> [email protected] <mailto:[email protected]>
> >>> :
> >>>>>
> >>>>> Rob, keeping fuseki live wasn't stated as a requirement for 1. so my
> >>>> advise
> >>>>> stands. we are running similar updates with fresh data frequently.
> >>>>>
> >>>>> Markus, to keep fuseki downtime at a minimum you can pre-populate tdb
> >>>> into
> >>>>> a temporary directory as well and later switch between directories.
> >> don't
> >>>>> forget to run the tdb optimizer to generate the stats.opt file.
> >>>>>
> >>>>>
> >>>>> On Thu, Sep 13, 2018 at 10:33 AM Rob Vesse <[email protected]
> <mailto:[email protected]>>
> >> wrote:
> >>>>>
> >>>>>> I am not sure tdbloader/tbdloader2 scripts help in this case.  This
> is
> >>>> an
> >>>>>> online update of a running Fuseki instance backed by TDB from what
> has
> >>>> been
> >>>>>> described.
> >>>>>>
> >>>>>> Since a TDB instance can only be safely used by a single JVM at a
> time
> >>>>>> using those scripts would not be a viable option here unless the OP
> >> was
> >>>>>> willing to stop Fuseki during updates as otherwise it would either
> >> fail
> >>>>>> (because the built in TDB mechanisms would prevent it) or it would
> >> risk
> >>>>>> causing data corruption
> >>>>>>
> >>>>>> Rob
> >>>>>>
> >>>>>> On 13/09/2018, 10:11, "Marco Neumann" <[email protected]
> <mailto:[email protected]>>
> >> wrote:
> >>>>>>
> >>>>>>  Markus, the tdbloader2 script is part of the apache-jena
> >>>> distribution.
> >>>>>>
> >>>>>>  let me know how you get on and how this improves your data load
> >>>>>> process.
> >>>>>>
> >>>>>>  Marco
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>  On Thu, Sep 13, 2018 at 9:58 AM Markus Neumann <
> >>>>>> [email protected] <mailto:[email protected]>>
> >>>>>>  wrote:
> >>>>>>
> >>>>>>> Hi Marco,
> >>>>>>>
> >>>>>>> as this is a project for a customer, I'm afraid we can't make the
> >>>>>> data
> >>>>>>> public.
> >>>>>>>
> >>>>>>> 1. I'm running Fuseki-3.8.0 with the following configuration:
> >>>>>>> @prefix :      <http://base/#> .
> >>>>>>> @prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
> >>>>>>> @prefix tdb2:  <http://jena.apache.org/2016/tdb#> .
> >>>>>>> @prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler#> .
> >>>>>>> @prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
> >>>>>>> @prefix fuseki: <http://jena.apache.org/fuseki#> .
> >>>>>>> @prefix spatial: <http://jena.apache.org/spatial#> .
> >>>>>>> @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
> >>>>>>> @prefix geosparql: <http://www.opengis.net/ont/geosparql#> .
> >>>>>>>
> >>>>>>> :service_tdb_all  a                   fuseki:Service ;
> >>>>>>>      rdfs:label                    "TDB2 mm" ;
> >>>>>>>      fuseki:dataset                :spatial_dataset ;
> >>>>>>>      fuseki:name                   "mm" ;
> >>>>>>>      fuseki:serviceQuery           "query" , "sparql" ;
> >>>>>>>      fuseki:serviceReadGraphStore  "get" ;
> >>>>>>>      fuseki:serviceReadWriteGraphStore
> >>>>>>>              "data" ;
> >>>>>>>      fuseki:serviceUpdate          "update" ;
> >>>>>>>      fuseki:serviceUpload          "upload" .
> >>>>>>>
> >>>>>>> :spatial_dataset a spatial:SpatialDataset ;
> >>>>>>>  spatial:dataset   :tdb_dataset_readwrite ;
> >>>>>>>  spatial:index     <#indexLucene> ;
> >>>>>>>  .
> >>>>>>>
> >>>>>>> <#indexLucene> a spatial:SpatialIndexLucene ;
> >>>>>>>  #spatial:directory <file:Lucene> ;
> >>>>>>>  spatial:directory "mem" ;
> >>>>>>>  spatial:definition <#definition> ;
> >>>>>>>  .
> >>>>>>>
> >>>>>>> <#definition> a spatial:EntityDefinition ;
> >>>>>>>  spatial:entityField      "uri" ;
> >>>>>>>  spatial:geoField     "geo" ;
> >>>>>>>  # custom geo predicates for 1) Latitude/Longitude Format
> >>>>>>>  spatial:hasSpatialPredicatePairs (
> >>>>>>>       [ spatial:latitude geo:lat ; spatial:longitude geo:long ]
> >>>>>>>       ) ;
> >>>>>>>  # custom geo predicates for 2) Well Known Text (WKT) Literal
> >>>>>>>  spatial:hasWKTPredicates (geosparql:asWKT) ;
> >>>>>>>  # custom SpatialContextFactory for 2) Well Known Text (WKT)
> >>>>>> Literal
> >>>>>>>  spatial:spatialContextFactory
> >>>>>>> #         "com.spatial4j.core.context.jts.JtsSpatialContextFactory"
> >>>>>>>
> >>>>>> "org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory"
> >>>>>>>  .
> >>>>>>>
> >>>>>>> :tdb_dataset_readwrite
> >>>>>>>      a              tdb2:DatasetTDB2 ;
> >>>>>>>      tdb2:location
> >>>>>>> "/srv/linked_data_store/fuseki-server/run/databases/mm" .
> >>>>>>>
> >>>>>>> I've been through the Fuseki documentation several times, but I
> find
> >>>>>> it
> >>>>>>> still a bit confusing. I would highly appreciate if you could point
> >>>>>> me to
> >>>>>>> other resources.
> >>>>>>>
> >>>>>>> I have not found the tdbloader in the fuseki repo. For now I use a
> >>>>>> small
> >>>>>>> shell script that wraps curl to upload the data:
> >>>>>>>
> >>>>>>> if [ ! -z $2 ]
> >>>>>>> then
> >>>>>>>  ADD="?graph=http://rdf.meteomatics.com/mm/graphs/$2";
> >>>>>>> fi
> >>>>>>> curl --basic -u user:password -X POST -F "filename=@$1"
> >>>>>>> localhost:3030/mm/data${ADD}
> >>>>>>>
> >>>>>>> 2. Our customer has not specified a default use case yet, as the
> >>>>>> whole RDF
> >>>>>>> concept is about as new to them as it is to me. I suppose it will
> be
> >>>>>>> something like "Find all locations in a certain radius that have
> nice
> >>>>>>> weather next saturday".
> >>>>>>>
> >>>>>>> I just took a glance at the ha-fuseki page and will give it a try
> >>>>>> later.
> >>>>>>>
> >>>>>>> Many thanks for your time
> >>>>>>>
> >>>>>>> Best
> >>>>>>> Markus
> >>>>>>>
> >>>>>>>> Am 13.09.2018 um 10:00 schrieb Marco Neumann <
> >>>>>> [email protected]>:
> >>>>>>>>
> >>>>>>>> do you make the data endpoint publicly available?
> >>>>>>>>
> >>>>>>>> 1. did you try the tdbloader, what version of tdb2 do you use?
> >>>>>>>>
> >>>>>>>> 2. many ways to improve your response time here. what does a
> >>>>>> typical
> >>>>>>> query
> >>>>>>>> look like? do you make use of the spatial indexer?
> >>>>>>>>
> >>>>>>>> and Andy has a work in progress here for more granular updates
> >>>>>> that might
> >>>>>>>> be of interest to your effort as well: "High Availablity Apache
> >>>>>> Jena
> >>>>>>> Fuseki"
> >>>>>>>>
> >>>>>>>> https://afs.github.io/rdf-delta/ha-fuseki.html
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Sep 12, 2018 at 4:09 PM Markus Neumann <
> >>>>>> [email protected]
> >>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> we are running a Fuseki server that will hold about 2.2 * 10^9
> >>>>>> triples
> >>>>>>> of
> >>>>>>>>> meteorological data eventually.
> >>>>>>>>> I currently run it with "-Xmx80GB" on a 128GB Server. The
> >>>>>> database is
> >>>>>>> TDB2
> >>>>>>>>> on a 900GB SSD.
> >>>>>>>>>
> >>>>>>>>> Now I face several performance issues:
> >>>>>>>>> 1. Inserting data:
> >>>>>>>>>     It takes more than one hour to upload the measurements of
> >>>>>> a month
> >>>>>>>>> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload
> >>>>>> web-interface
> >>>>>>> of
> >>>>>>>>> fuseki)
> >>>>>>>>>     Is there a way to do this faster?
> >>>>>>>>> 2. Updating data:
> >>>>>>>>>     We get new model runs 5 times per day. This is data for
> >>>>>> the next
> >>>>>>>>> 10 days, that needs to be updated every time.
> >>>>>>>>>     My idea was to create a named graph "forecast" that holds
> >>>>>> the
> >>>>>>>>> latest version of this data.
> >>>>>>>>>     Every time a new model run arrives, I create a new
> >>>>>> temporary
> >>>>>>> graph
> >>>>>>>>> to upload the data to. Once this is finished, I move the
> temporary
> >>>>>>> graph to
> >>>>>>>>> "forecast".
> >>>>>>>>>     This seems to do the work twice as it takes 1 hour for the
> >>>>>> upload
> >>>>>>>>> an 1 hour for the move.
> >>>>>>>>>
> >>>>>>>>> Our data consists of the following:
> >>>>>>>>>
> >>>>>>>>> Locations (total 1607 -> 16070 triples):
> >>>>>>>>> mm-locations:8500015 a mm:Location ;
> >>>>>>>>> a geosparql:Geometry ;
> >>>>>>>>> owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015>
> >>>>>> ;
> >>>>>>>>> geosparql:asWKT "POINT(7.61574425031
> >>>>>>>>> 47.5425915732)"^^geosparql:wktLiteral ;
> >>>>>>>>> mm:station_name "Basel SBB GB Ost" ;
> >>>>>>>>> mm:abbreviation "BSGO" ;
> >>>>>>>>> mm:didok_id 8500015 ;
> >>>>>>>>> geo:lat 47.54259 ;
> >>>>>>>>> geo:long 7.61574 ;
> >>>>>>>>> mm:elevation 273 .
> >>>>>>>>>
> >>>>>>>>> Parameters (total 14 -> 56 triples):
> >>>>>>>>> mm-parameters:t_2m:C a mm:Parameter ;
> >>>>>>>>> rdfs:label "t_2m:C" ;
> >>>>>>>>> dcterms:description "Air temperature at 2m above ground in
> >>>>>> degree
> >>>>>>>>> Celsius"@en ;
> >>>>>>>>> mm:unit_symbol "˚C" .
> >>>>>>>>>
> >>>>>>>>> Measurements (that is the huge bunch. Per day: 14 * 1607 * 48 ~ 1
> >>>>>> Mio ->
> >>>>>>>>> 5Mio triples per day):
> >>>>>>>>> mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a
> >>>>>> mm:Measurement ;
> >>>>>>>>> mm:location mm-locations:8500015 ;
> >>>>>>>>> mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ;
> >>>>>>>>> mm:value 15.1 ;
> >>>>>>>>> mm:parameter mm-parameters:t_2m:C .
> >>>>>>>>>
> >>>>>>>>> I would really appreciate if someone could give me some advice on
> >>>>>> how to
> >>>>>>>>> handle this tasks or point out things I could do to optimize the
> >>>>>>>>> organization of the data.
> >>>>>>>>>
> >>>>>>>>> Many thanks and kind regards
> >>>>>>>>> Markus Neumann
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ---
> >>>>>>>> Marco Neumann
> >>>>>>>> KONA
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>  --
> >>>>>>
> >>>>>>
> >>>>>>  ---
> >>>>>>  Marco Neumann
> >>>>>>  KONA
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>>
> >>>>> ---
> >>>>> Marco Neumann
> >>>>> KONA
> >>>>
> >>>>
> >>>
> >>> --
> >>>
> >>>
> >>> ---
> >>> Marco Neumann
> >>> KONA
> >>
> >>
> >
> > --
> >
> >
> > ---
> > Marco Neumann
> > KONA
>
> --


---
Marco Neumann
KONA

Re: Updating large amounts of data

Reply via email to