Re: Updating large amounts of data

Andy Seaborne Fri, 14 Sep 2018 06:07:16 -0700



On 13/09/18 12:26, Markus Neumann wrote:

Hi Rob,

seems like Fuseki doesn't handle gzip. I created the file with `tar -cvzf 
tar_test.ttl.gz large_input.ttl` so it should be a standard gzip.


That will be a standard tar file that has been compressed gzip.

Run gzip itself on the file:

gzip < large_input.ttl > large_input.ttl.gz

Uploading fails with the following log:

[2018-09-13 11:23:33] Fuseki     ERROR [line: 1, col: 1 ] Out of place: 
[KEYWORD:PaxHeader]
[2018-09-13 11:23:50] Fuseki     INFO  [9] 400 Parse error: [line: 1, col: 1 ] 
Out of place: [KEYWORD:PaxHeader] (16.674 s)

PaxHeader is the tar file information but "ttl.gz" is used to guess thatit is a compressed TTL file, not in a tar file.


    Andy


Markus

Am 13.09.2018 um 11:59 schrieb Rob Vesse <[email protected]>:

Markus

Jena in general should transparently recognize and handle files with a .gz 
extension provided they follow the standard approach of appending this after 
the normal file extension i.e. .ttl.gz  I checked the Fuseki code and GZipped 
uploads should be supported

 From Jena 3.8.0 support is also provided for BZip2 files with a .bz2 extension 
and Snappy compressed files with a .sz extension.  Although looking at the 
Fuseki code not sure this is wired up into Fuseki currently.

Rob

On 13/09/2018, 10:49, "Markus Neumann" <[email protected]> wrote:

    Hi Rob,

Am 13.09.2018 um 11:41 schrieb Rob Vesse <[email protected]>:

Markus

Comments inline:

On 12/09/2018, 16:09, "Markus Neumann" <[email protected]> wrote:

   Hi,

   we are running a Fuseki server that will hold about 2.2 * 10^9 triples of 
meteorological data eventually.
   I currently run it with "-Xmx80GB" on a 128GB Server. The database is TDB2 
on a 900GB SSD.

   Now I face several performance issues:
   1. Inserting data:
        It takes more than one hour to upload the measurements of a month 
(7.5GB .ttl file ~ 16 Mio triples) (using the data-upload web-interface of 
fuseki)
        Is there a way to do this faster?

At a minimum try GZipping the file and uploading it in GZipped form to reduce 
the amount of data transferred over the network.  It is possible that your 
bottleneck here is actually network upload bandwith rather than anything with 
Jena itself.  I would expect GZip to substantially reduce the file size and 
hopefully improve your load times.


    I am uploading on the server itself to localhost, so network should not be 
any issue (or am I missing something?).
    Can fuseki handle gzipped ttl files?

Secondly TDB is typically reported to achieve load speeds of up to around 200k 
triples/second, although that if for offline bulk loads with SSDs.  Even if we 
assume you could achieve only 25k triples/second that would suggest a 
theoretical load time of approximately 11 minutes.  If you can setup your 
system so the TDB database is written to an SSD that will improve your 
performance to some extent.

    We are on a SSD setup. 11minutes would be something, we could live with...


Thirdly TDB is multi reader single writer (MRSW) concurrency so if you have a 
lot of reads happening while trying to upload, which is a write operation, the 
write operation will be forced to wait for active readers to finish before 
proceeding which may introduce some delays.

    I'm aware of that, but as we are not in production mode yet, there are no 
request delaying the upload.


So yes I think you should be able to get faster load times.

   2. Updating data:
        We get new model runs 5 times per day. This is data for the next 10 
days, that needs to be updated every time.
        My idea was to create a named graph "forecast" that holds the latest 
version of this data.
        Every time a new model run arrives, I create a new temporary graph to upload the 
data to. Once this is finished, I move the temporary graph to "forecast".
        This seems to do the work twice as it takes 1 hour for the upload an 1 
hour for the move.

Yes this is exactly what happens, the database that backs Fuseki, TDB, is a 
quads store so it is storing each triple as a quad of GSPO where G is the graph 
name.  So when you move the temporary graph it has to copy all the quads from 
the source graph to the target graph and then delete that source graph.

    Thanks for that input. I will have to figure something else here...

Rob


    Markus

Re: Updating large amounts of data

Reply via email to