Re: Updating large amounts of data

Markus Neumann Thu, 13 Sep 2018 04:27:02 -0700

Hi Rob,

seems like Fuseki doesn't handle gzip. I created the file with `tar -cvzf 
tar_test.ttl.gz large_input.ttl` so it should be a standard gzip.
Uploading fails with the following log:


[2018-09-13 11:23:33] Fuseki     ERROR [line: 1, col: 1 ] Out of place: 
[KEYWORD:PaxHeader]
[2018-09-13 11:23:50] Fuseki     INFO  [9] 400 Parse error: [line: 1, col: 1 ] 
Out of place: [KEYWORD:PaxHeader] (16.674 s)

Markus

> Am 13.09.2018 um 11:59 schrieb Rob Vesse <[email protected]>:
> 
> Markus
> 
> Jena in general should transparently recognize and handle files with a .gz 
> extension provided they follow the standard approach of appending this after 
> the normal file extension i.e. .ttl.gz  I checked the Fuseki code and GZipped 
> uploads should be supported
> 
> From Jena 3.8.0 support is also provided for BZip2 files with a .bz2 
> extension and Snappy compressed files with a .sz extension.  Although looking 
> at the Fuseki code not sure this is wired up into Fuseki currently.
> 
> Rob
> 
> On 13/09/2018, 10:49, "Markus Neumann" <[email protected]> wrote:
> 
>    Hi Rob,
> 
>> Am 13.09.2018 um 11:41 schrieb Rob Vesse <[email protected]>:
>> 
>> Markus
>> 
>> Comments inline:
>> 
>> On 12/09/2018, 16:09, "Markus Neumann" <[email protected]> wrote:
>> 
>>   Hi,
>> 
>>   we are running a Fuseki server that will hold about 2.2 * 10^9 triples of 
>> meteorological data eventually.
>>   I currently run it with "-Xmx80GB" on a 128GB Server. The database is TDB2 
>> on a 900GB SSD.
>> 
>>   Now I face several performance issues:
>>   1. Inserting data:
>>      It takes more than one hour to upload the measurements of a month 
>> (7.5GB .ttl file ~ 16 Mio triples) (using the data-upload web-interface of 
>> fuseki)
>>      Is there a way to do this faster? 
>> 
>> At a minimum try GZipping the file and uploading it in GZipped form to 
>> reduce the amount of data transferred over the network.  It is possible that 
>> your bottleneck here is actually network upload bandwith rather than 
>> anything with Jena itself.  I would expect GZip to substantially reduce the 
>> file size and hopefully improve your load times.
> 
>    I am uploading on the server itself to localhost, so network should not be 
> any issue (or am I missing something?).
>    Can fuseki handle gzipped ttl files?
> 
>> Secondly TDB is typically reported to achieve load speeds of up to around 
>> 200k triples/second, although that if for offline bulk loads with SSDs.  
>> Even if we assume you could achieve only 25k triples/second that would 
>> suggest a theoretical load time of approximately 11 minutes.  If you can 
>> setup your system so the TDB database is written to an SSD that will improve 
>> your performance to some extent.
>    We are on a SSD setup. 11minutes would be something, we could live with...
>> 
>> Thirdly TDB is multi reader single writer (MRSW) concurrency so if you have 
>> a lot of reads happening while trying to upload, which is a write operation, 
>> the write operation will be forced to wait for active readers to finish 
>> before proceeding which may introduce some delays.
>    I'm aware of that, but as we are not in production mode yet, there are no 
> request delaying the upload.
>> 
>> So yes I think you should be able to get faster load times.
>> 
>>   2. Updating data:
>>      We get new model runs 5 times per day. This is data for the next 10 
>> days, that needs to be updated every time.
>>      My idea was to create a named graph "forecast" that holds the latest 
>> version of this data.
>>      Every time a new model run arrives, I create a new temporary graph to 
>> upload the data to. Once this is finished, I move the temporary graph to 
>> "forecast".
>>      This seems to do the work twice as it takes 1 hour for the upload an 1 
>> hour for the move.
>> 
>> Yes this is exactly what happens, the database that backs Fuseki, TDB, is a 
>> quads store so it is storing each triple as a quad of GSPO where G is the 
>> graph name.  So when you move the temporary graph it has to copy all the 
>> quads from the source graph to the target graph and then delete that source 
>> graph.
>    Thanks for that input. I will have to figure something else here...
>> 
>> Rob
>> 
> 
>    Markus
> 
> 
> 
>

Re: Updating large amounts of data

Reply via email to