Re: Updating large amounts of data

Marco Neumann Fri, 14 Sep 2018 00:30:51 -0700

I remember giving or reading advise on this here on the mailing list. if
you can't find it here please consult the old jena archive mailing list.


if you still can't find the answer to this question please open a new
thread and we will take it from there.


On Fri, Sep 14, 2018 at 6:36 AM Markus Neumann <mneum...@meteomatics.com>
wrote:

> I got the jar from
> https://mvnrepository.com/artifact/org.apache.jena/jena-spatial/3.8.0 <
> https://mvnrepository.com/artifact/org.apache.jena/jena-spatial/3.8.0>
> but the command from the docu does not seem to work:
>
> java -cp jena-spatial-3.8.0.jar jena.spatialindexer --loc
> /srv/linked_data_store/prod_dp_2018-09-13-1
> Error: Could not find or load main class jena.spatialindexer
>
>
> > Am 13.09.2018 um 21:47 schrieb Marco Neumann <marco.neum...@gmail.com>:
> >
> > Set the classpath to include the spatialIndexer
> >
> > On Thu 13 Sep 2018 at 20:30, Markus Neumann <mneum...@meteomatics.com
> <mailto:mneum...@meteomatics.com>>
> > wrote:
> >
> >> Hi,
> >>
> >> spatial index creation fails.
> >> I tried to figure the documentation but failed. I can't find the
> >> jena.spatialindexer to build it manually and the one I specified in my
> >> config does not work when I use the tdbloader.
> >>
> >> Any ideas?
> >>
> >>
> >>> Am 13.09.2018 um 19:48 schrieb Marco Neumann <marco.neum...@gmail.com
> >:
> >>>
> >>> to create the spatial index you can take a look at the "Building a
> >> Spatial
> >>> Index" section in the "Spatial searches with SPARQL" documentation here
> >>>
> >>> https://jena.apache.org/documentation/query/spatial-query.html <
> >> https://jena.apache.org/documentation/query/spatial-query.html <
> https://jena.apache.org/documentation/query/spatial-query.html>>
> >>>
> >>> hint: if you don't get results for a spatial filter query that matches
> >> your
> >>> data in the database your data isn't spatially indexed correctly. there
> >>> will be no error or the like in the result set though.
> >>>
> >>>
> >>>
> >>> On Thu, Sep 13, 2018 at 1:53 PM Markus Neumann <
> mneum...@meteomatics.com <mailto:mneum...@meteomatics.com>
> >> <mailto:mneum...@meteomatics.com <mailto:mneum...@meteomatics.com>>>
> >>> wrote:
> >>>
> >>>> Thanks for the links.
> >>>>
> >>>> How do I see if the loader does the spatial index? As far as I
> >> understood
> >>>> the documentation, my config should produce the spatial index in
> >> memory. I
> >>>> haven't figured that part completely though:
> >>>> When I start the database from scratch, the spatial indexing works.
> >> After
> >>>> a restart I have to re-upload the stations file (which is no big deal
> >> as it
> >>>> is only 593K in size) to regenerate the index.
> >>>> I couldn't get it to work with a persistent index file though.
> >>>>
> >>>> Right now I'm trying the tdb2.tdbloader (Didn't see that before) and
> it
> >>>> seems to go even faster:
> >>>> 12:49:11 INFO  loader               :: Add: 41,000,000
> >>>> 2017-01-01_1M_30min.ttl (Batch: 67,980 / Avg: 62,995)
> >>>> 12:49:11 INFO  loader               ::   Elapsed: 650.84 seconds
> >>>> [2018/09/13 12:49:11 UTC]
> >>>>
> >>>> Is there a way to tell the loader, that he should do the spatial
> index?
> >>>>
> >>>> Yes, we have to use the spatial filter eventually, so I would highly
> >>>> appreciate some more informations on the correct setup here.
> >>>>
> >>>> Many thanks.
> >>>>
> >>>>> Am 13.09.2018 um 14:19 schrieb Marco Neumann <
> marco.neum...@gmail.com
> >>> :
> >>>>>
> >>>>> :-)
> >>>>>
> >>>>> this sounds much better Markus. now with regards to the optimizer
> >> please
> >>>>> consult the online documentation here:
> >>>>>
> >>>>> https://jena.apache.org/documentation/tdb/optimizer.html <
> >>>> https://jena.apache.org/documentation/tdb/optimizer.html <
> >> https://jena.apache.org/documentation/tdb/optimizer.html <
> https://jena.apache.org/documentation/tdb/optimizer.html>>>
> >>>>> (it's a very simple process to create the stats file and place it in
> >> the
> >>>>> directory)
> >>>>>
> >>>>> also did the loader index the spatial data? do your queries make use
> of
> >>>> the
> >>>>> spatial filter?
> >>>>>
> >>>>> On Thu, Sep 13, 2018 at 12:59 PM Markus Neumann <
> >>>> mneum...@meteomatics.com <mailto:mneum...@meteomatics.com> <mailto:
> mneum...@meteomatics.com <mailto:mneum...@meteomatics.com>> <mailto:
> >> mneum...@meteomatics.com <mailto:mneum...@meteomatics.com>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Marco,
> >>>>>>
> >>>>>> I just tried the tdbloader2 script with 1 Month of data:
> >>>>>>
> >>>>>> INFO  Total: 167,385,120 tuples : 1,143.55 seconds : 146,373.23
> >>>> tuples/sec
> >>>>>> [2018/09/13 11:29:31 UTC]
> >>>>>> 11:41:44 INFO Index Building Phase Completed
> >>>>>> 11:41:46 INFO -- TDB Bulk Loader Finish
> >>>>>> 11:41:46 INFO -- 1880 seconds
> >>>>>>
> >>>>>> Thats already a lot better. I'm working on a way to reduce the
> amount
> >> of
> >>>>>> data by
> >>>>>> Can you give me a pointer on
> >>>>>>> don't forget to run the tdb optimizer to generate the stats.opt
> file.
> >>>>>> ? I haven't heard of that so far...
> >>>>>>
> >>>>>> A more general question:
> >>>>>> Would there be a benefit in using the jena stack over using the
> fuseki
> >>>>>> bundle as I'm doing now? (Documentation was not clear to me on that
> >>>> point)
> >>>>>>      - If so: is there a guide on how to set it up?
> >>>>>>
> >>>>>>
> >>>>> fuseki makes use of the jena stack. think of the jena distribution
> as a
> >>>>> kind of toolbox you can use to work with your different projects in
> >>>>> addition to your fuseki endpoint.
> >>>>>
> >>>>> just make sure to configure the class path correctly
> >>>>>
> >>>>> https://jena.apache.org/documentation/tools/index.html <
> >> https://jena.apache.org/documentation/tools/index.html> <
> >>>> https://jena.apache.org/documentation/tools/index.html <
> https://jena.apache.org/documentation/tools/index.html> <
> >> https://jena.apache.org/documentation/tools/index.html <
> https://jena.apache.org/documentation/tools/index.html>>>
> >>>>>
> >>>>> Also further to the conversation with Rob, he has a valid point with
> >>>>> regards to data corruption. please do not update of a live tdb
> database
> >>>>> instance directly with tdbloader while it's connected to a running
> >> fuseki
> >>>>> endpoint.
> >>>>>
> >>>>> shut down the fuseki server first and then run the loader. or run the
> >>>>> loader process in parallel into different target directory and swap
> the
> >>>>> data or the path again later on. I don't know if there is hot swap
> >> option
> >>>>> in fuseki to map to a new directory but a quick restart should do the
> >>>> trick.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Thanks and kind regards
> >>>>>> Markus
> >>>>>>
> >>>>>>> Am 13.09.2018 um 11:56 schrieb Marco Neumann <
> >> marco.neum...@gmail.com <mailto:marco.neum...@gmail.com> <mailto:
> marco.neum...@gmail.com <mailto:marco.neum...@gmail.com>>
> >>>>> :
> >>>>>>>
> >>>>>>> Rob, keeping fuseki live wasn't stated as a requirement for 1. so
> my
> >>>>>> advise
> >>>>>>> stands. we are running similar updates with fresh data frequently.
> >>>>>>>
> >>>>>>> Markus, to keep fuseki downtime at a minimum you can pre-populate
> tdb
> >>>>>> into
> >>>>>>> a temporary directory as well and later switch between directories.
> >>>> don't
> >>>>>>> forget to run the tdb optimizer to generate the stats.opt file.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Sep 13, 2018 at 10:33 AM Rob Vesse <rve...@dotnetrdf.org
> <mailto:rve...@dotnetrdf.org>
> >> <mailto:rve...@dotnetrdf.org <mailto:rve...@dotnetrdf.org>>>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> I am not sure tdbloader/tbdloader2 scripts help in this case.
> This
> >> is
> >>>>>> an
> >>>>>>>> online update of a running Fuseki instance backed by TDB from what
> >> has
> >>>>>> been
> >>>>>>>> described.
> >>>>>>>>
> >>>>>>>> Since a TDB instance can only be safely used by a single JVM at a
> >> time
> >>>>>>>> using those scripts would not be a viable option here unless the
> OP
> >>>> was
> >>>>>>>> willing to stop Fuseki during updates as otherwise it would either
> >>>> fail
> >>>>>>>> (because the built in TDB mechanisms would prevent it) or it would
> >>>> risk
> >>>>>>>> causing data corruption
> >>>>>>>>
> >>>>>>>> Rob
> >>>>>>>>
> >>>>>>>> On 13/09/2018, 10:11, "Marco Neumann" <marco.neum...@gmail.com
> <mailto:marco.neum...@gmail.com>
> >> <mailto:marco.neum...@gmail.com <mailto:marco.neum...@gmail.com>>>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>> Markus, the tdbloader2 script is part of the apache-jena
> >>>>>> distribution.
> >>>>>>>>
> >>>>>>>> let me know how you get on and how this improves your data load
> >>>>>>>> process.
> >>>>>>>>
> >>>>>>>> Marco
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Sep 13, 2018 at 9:58 AM Markus Neumann <
> >>>>>>>> mneum...@meteomatics.com <mailto:mneum...@meteomatics.com>
> <mailto:mneum...@meteomatics.com <mailto:mneum...@meteomatics.com>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Marco,
> >>>>>>>>>
> >>>>>>>>> as this is a project for a customer, I'm afraid we can't make the
> >>>>>>>> data
> >>>>>>>>> public.
> >>>>>>>>>
> >>>>>>>>> 1. I'm running Fuseki-3.8.0 with the following configuration:
> >>>>>>>>> @prefix :      <http://base/# <http://base/#>> .
> >>>>>>>>> @prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns# <
> http://www.w3.org/1999/02/22-rdf-syntax-ns#>> .
> >>>>>>>>> @prefix tdb2:  <http://jena.apache.org/2016/tdb# <
> http://jena.apache.org/2016/tdb#>> .
> >>>>>>>>> @prefix ja:    <http://jena.hpl.hp.com/2005/11/Assembler# <
> http://jena.hpl.hp.com/2005/11/Assembler#>> .
> >>>>>>>>> @prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema# <
> http://www.w3.org/2000/01/rdf-schema#>> .
> >>>>>>>>> @prefix fuseki: <http://jena.apache.org/fuseki# <
> http://jena.apache.org/fuseki#>> .
> >>>>>>>>> @prefix spatial: <http://jena.apache.org/spatial# <
> http://jena.apache.org/spatial#>> .
> >>>>>>>>> @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos# <
> http://www.w3.org/2003/01/geo/wgs84_pos#>> .
> >>>>>>>>> @prefix geosparql: <http://www.opengis.net/ont/geosparql# <
> http://www.opengis.net/ont/geosparql#>> .
> >>>>>>>>>
> >>>>>>>>> :service_tdb_all  a                   fuseki:Service ;
> >>>>>>>>>     rdfs:label                    "TDB2 mm" ;
> >>>>>>>>>     fuseki:dataset                :spatial_dataset ;
> >>>>>>>>>     fuseki:name                   "mm" ;
> >>>>>>>>>     fuseki:serviceQuery           "query" , "sparql" ;
> >>>>>>>>>     fuseki:serviceReadGraphStore  "get" ;
> >>>>>>>>>     fuseki:serviceReadWriteGraphStore
> >>>>>>>>>             "data" ;
> >>>>>>>>>     fuseki:serviceUpdate          "update" ;
> >>>>>>>>>     fuseki:serviceUpload          "upload" .
> >>>>>>>>>
> >>>>>>>>> :spatial_dataset a spatial:SpatialDataset ;
> >>>>>>>>> spatial:dataset   :tdb_dataset_readwrite ;
> >>>>>>>>> spatial:index     <#indexLucene> ;
> >>>>>>>>> .
> >>>>>>>>>
> >>>>>>>>> <#indexLucene> a spatial:SpatialIndexLucene ;
> >>>>>>>>> #spatial:directory <file:Lucene> ;
> >>>>>>>>> spatial:directory "mem" ;
> >>>>>>>>> spatial:definition <#definition> ;
> >>>>>>>>> .
> >>>>>>>>>
> >>>>>>>>> <#definition> a spatial:EntityDefinition ;
> >>>>>>>>> spatial:entityField      "uri" ;
> >>>>>>>>> spatial:geoField     "geo" ;
> >>>>>>>>> # custom geo predicates for 1) Latitude/Longitude Format
> >>>>>>>>> spatial:hasSpatialPredicatePairs (
> >>>>>>>>>      [ spatial:latitude geo:lat ; spatial:longitude geo:long ]
> >>>>>>>>>      ) ;
> >>>>>>>>> # custom geo predicates for 2) Well Known Text (WKT) Literal
> >>>>>>>>> spatial:hasWKTPredicates (geosparql:asWKT) ;
> >>>>>>>>> # custom SpatialContextFactory for 2) Well Known Text (WKT)
> >>>>>>>> Literal
> >>>>>>>>> spatial:spatialContextFactory
> >>>>>>>>> #
>  "com.spatial4j.core.context.jts.JtsSpatialContextFactory"
> >>>>>>>>>
> >>>>>>>> "org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory"
> >>>>>>>>> .
> >>>>>>>>>
> >>>>>>>>> :tdb_dataset_readwrite
> >>>>>>>>>     a              tdb2:DatasetTDB2 ;
> >>>>>>>>>     tdb2:location
> >>>>>>>>> "/srv/linked_data_store/fuseki-server/run/databases/mm" .
> >>>>>>>>>
> >>>>>>>>> I've been through the Fuseki documentation several times, but I
> >> find
> >>>>>>>> it
> >>>>>>>>> still a bit confusing. I would highly appreciate if you could
> point
> >>>>>>>> me to
> >>>>>>>>> other resources.
> >>>>>>>>>
> >>>>>>>>> I have not found the tdbloader in the fuseki repo. For now I use
> a
> >>>>>>>> small
> >>>>>>>>> shell script that wraps curl to upload the data:
> >>>>>>>>>
> >>>>>>>>> if [ ! -z $2 ]
> >>>>>>>>> then
> >>>>>>>>> ADD="?graph=http://rdf.meteomatics.com/mm/graphs/$2";
> >>>>>>>>> fi
> >>>>>>>>> curl --basic -u user:password -X POST -F "filename=@$1"
> >>>>>>>>> localhost:3030/mm/data${ADD}
> >>>>>>>>>
> >>>>>>>>> 2. Our customer has not specified a default use case yet, as the
> >>>>>>>> whole RDF
> >>>>>>>>> concept is about as new to them as it is to me. I suppose it will
> >> be
> >>>>>>>>> something like "Find all locations in a certain radius that have
> >> nice
> >>>>>>>>> weather next saturday".
> >>>>>>>>>
> >>>>>>>>> I just took a glance at the ha-fuseki page and will give it a try
> >>>>>>>> later.
> >>>>>>>>>
> >>>>>>>>> Many thanks for your time
> >>>>>>>>>
> >>>>>>>>> Best
> >>>>>>>>> Markus
> >>>>>>>>>
> >>>>>>>>>> Am 13.09.2018 um 10:00 schrieb Marco Neumann <
> >>>>>>>> marco.neum...@gmail.com>:
> >>>>>>>>>>
> >>>>>>>>>> do you make the data endpoint publicly available?
> >>>>>>>>>>
> >>>>>>>>>> 1. did you try the tdbloader, what version of tdb2 do you use?
> >>>>>>>>>>
> >>>>>>>>>> 2. many ways to improve your response time here. what does a
> >>>>>>>> typical
> >>>>>>>>> query
> >>>>>>>>>> look like? do you make use of the spatial indexer?
> >>>>>>>>>>
> >>>>>>>>>> and Andy has a work in progress here for more granular updates
> >>>>>>>> that might
> >>>>>>>>>> be of interest to your effort as well: "High Availablity Apache
> >>>>>>>> Jena
> >>>>>>>>> Fuseki"
> >>>>>>>>>>
> >>>>>>>>>> https://afs.github.io/rdf-delta/ha-fuseki.html
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Sep 12, 2018 at 4:09 PM Markus Neumann <
> >>>>>>>> mneum...@meteomatics.com
> >>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> we are running a Fuseki server that will hold about 2.2 * 10^9
> >>>>>>>> triples
> >>>>>>>>> of
> >>>>>>>>>>> meteorological data eventually.
> >>>>>>>>>>> I currently run it with "-Xmx80GB" on a 128GB Server. The
> >>>>>>>> database is
> >>>>>>>>> TDB2
> >>>>>>>>>>> on a 900GB SSD.
> >>>>>>>>>>>
> >>>>>>>>>>> Now I face several performance issues:
> >>>>>>>>>>> 1. Inserting data:
> >>>>>>>>>>>    It takes more than one hour to upload the measurements of
> >>>>>>>> a month
> >>>>>>>>>>> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload
> >>>>>>>> web-interface
> >>>>>>>>> of
> >>>>>>>>>>> fuseki)
> >>>>>>>>>>>    Is there a way to do this faster?
> >>>>>>>>>>> 2. Updating data:
> >>>>>>>>>>>    We get new model runs 5 times per day. This is data for
> >>>>>>>> the next
> >>>>>>>>>>> 10 days, that needs to be updated every time.
> >>>>>>>>>>>    My idea was to create a named graph "forecast" that holds
> >>>>>>>> the
> >>>>>>>>>>> latest version of this data.
> >>>>>>>>>>>    Every time a new model run arrives, I create a new
> >>>>>>>> temporary
> >>>>>>>>> graph
> >>>>>>>>>>> to upload the data to. Once this is finished, I move the
> >> temporary
> >>>>>>>>> graph to
> >>>>>>>>>>> "forecast".
> >>>>>>>>>>>    This seems to do the work twice as it takes 1 hour for the
> >>>>>>>> upload
> >>>>>>>>>>> an 1 hour for the move.
> >>>>>>>>>>>
> >>>>>>>>>>> Our data consists of the following:
> >>>>>>>>>>>
> >>>>>>>>>>> Locations (total 1607 -> 16070 triples):
> >>>>>>>>>>> mm-locations:8500015 a mm:Location ;
> >>>>>>>>>>> a geosparql:Geometry ;
> >>>>>>>>>>> owl:sameAs <http://lod.opentransportdata.swiss/didok/8500015>
> >>>>>>>> ;
> >>>>>>>>>>> geosparql:asWKT "POINT(7.61574425031
> >>>>>>>>>>> 47.5425915732)"^^geosparql:wktLiteral ;
> >>>>>>>>>>> mm:station_name "Basel SBB GB Ost" ;
> >>>>>>>>>>> mm:abbreviation "BSGO" ;
> >>>>>>>>>>> mm:didok_id 8500015 ;
> >>>>>>>>>>> geo:lat 47.54259 ;
> >>>>>>>>>>> geo:long 7.61574 ;
> >>>>>>>>>>> mm:elevation 273 .
> >>>>>>>>>>>
> >>>>>>>>>>> Parameters (total 14 -> 56 triples):
> >>>>>>>>>>> mm-parameters:t_2m:C a mm:Parameter ;
> >>>>>>>>>>> rdfs:label "t_2m:C" ;
> >>>>>>>>>>> dcterms:description "Air temperature at 2m above ground in
> >>>>>>>> degree
> >>>>>>>>>>> Celsius"@en ;
> >>>>>>>>>>> mm:unit_symbol "˚C" .
> >>>>>>>>>>>
> >>>>>>>>>>> Measurements (that is the huge bunch. Per day: 14 * 1607 * 48
> ~ 1
> >>>>>>>> Mio ->
> >>>>>>>>>>> 5Mio triples per day):
> >>>>>>>>>>> mm-measurements:8500015_2018-09-02T00:00:00Z_t_2m:C a
> >>>>>>>> mm:Measurement ;
> >>>>>>>>>>> mm:location mm-locations:8500015 ;
> >>>>>>>>>>> mm:validdate "2018-09-02T00:00:00Z"^^xsd:dateTime ;
> >>>>>>>>>>> mm:value 15.1 ;
> >>>>>>>>>>> mm:parameter mm-parameters:t_2m:C .
> >>>>>>>>>>>
> >>>>>>>>>>> I would really appreciate if someone could give me some advice
> on
> >>>>>>>> how to
> >>>>>>>>>>> handle this tasks or point out things I could do to optimize
> the
> >>>>>>>>>>> organization of the data.
> >>>>>>>>>>>
> >>>>>>>>>>> Many thanks and kind regards
> >>>>>>>>>>> Markus Neumann
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> ---
> >>>>>>>>>> Marco Neumann
> >>>>>>>>>> KONA
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ---
> >>>>>>>> Marco Neumann
> >>>>>>>> KONA
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>>
> >>>>>>>
> >>>>>>> ---
> >>>>>>> Marco Neumann
> >>>>>>> KONA
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>>
> >>>>> ---
> >>>>> Marco Neumann
> >>>>> KONA
> >>>>
> >>>>
> >>>
> >>> --
> >>>
> >>>
> >>> ---
> >>> Marco Neumann
> >>> KONA
> >>
> >> --
> >
> >
> > ---
> > Marco Neumann
> > KONA
>
>

-- 


---
Marco Neumann
KONA

Re: Updating large amounts of data

Reply via email to