Hi Rob, > Am 13.09.2018 um 11:41 schrieb Rob Vesse <[email protected]>: > > Markus > > Comments inline: > > On 12/09/2018, 16:09, "Markus Neumann" <[email protected]> wrote: > > Hi, > > we are running a Fuseki server that will hold about 2.2 * 10^9 triples of > meteorological data eventually. > I currently run it with "-Xmx80GB" on a 128GB Server. The database is TDB2 > on a 900GB SSD. > > Now I face several performance issues: > 1. Inserting data: > It takes more than one hour to upload the measurements of a month > (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload web-interface of > fuseki) > Is there a way to do this faster? > > At a minimum try GZipping the file and uploading it in GZipped form to reduce > the amount of data transferred over the network. It is possible that your > bottleneck here is actually network upload bandwith rather than anything with > Jena itself. I would expect GZip to substantially reduce the file size and > hopefully improve your load times.
I am uploading on the server itself to localhost, so network should not be any issue (or am I missing something?). Can fuseki handle gzipped ttl files? > Secondly TDB is typically reported to achieve load speeds of up to around > 200k triples/second, although that if for offline bulk loads with SSDs. Even > if we assume you could achieve only 25k triples/second that would suggest a > theoretical load time of approximately 11 minutes. If you can setup your > system so the TDB database is written to an SSD that will improve your > performance to some extent. We are on a SSD setup. 11minutes would be something, we could live with... > > Thirdly TDB is multi reader single writer (MRSW) concurrency so if you have a > lot of reads happening while trying to upload, which is a write operation, > the write operation will be forced to wait for active readers to finish > before proceeding which may introduce some delays. I'm aware of that, but as we are not in production mode yet, there are no request delaying the upload. > > So yes I think you should be able to get faster load times. > > 2. Updating data: > We get new model runs 5 times per day. This is data for the next 10 > days, that needs to be updated every time. > My idea was to create a named graph "forecast" that holds the latest > version of this data. > Every time a new model run arrives, I create a new temporary graph to > upload the data to. Once this is finished, I move the temporary graph to > "forecast". > This seems to do the work twice as it takes 1 hour for the upload an 1 > hour for the move. > > Yes this is exactly what happens, the database that backs Fuseki, TDB, is a > quads store so it is storing each triple as a quad of GSPO where G is the > graph name. So when you move the temporary graph it has to copy all the > quads from the source graph to the target graph and then delete that source > graph. Thanks for that input. I will have to figure something else here... > > Rob > Markus
