Hello.

I'm building a file processing workflow monitoring system based on Jena Fuseki. 
The goal for this system is to be purely additive. Individual events are a 
pieces of knowledge about each of the jobs eventually connected via various 
identifiers and references. Finally I can search for an event and with basic 
knowledge of the scheme I can rebuild a graph representing the whole processing 
job, and display it to the customer. It seems to work and I'm happy about the 
whole idea.

There's however one thing I'd like to solve and that is the incredible amount 
of space the triplestore consumes. Currently there's about 40M triples (2 
months of traffic approximately) and if unmaintained, the amount of disk space 
the database consumes is huge. The maintenance process is to stop Fuseki -> 
dump database -> backup -> delete the old database -> load the dump -> start 
Fuseki. At this point the database is at most 10 % of what it was before the 
maintenance. Then it grows back and even more with new data being added.

Note that approx. half of triples in the inserts might already be in the 
database. Example:

### First event

e:MyReceiveEvent
    a e:Event ;
    a e:Receive ;
    e:subjectMessage e:MyMessage .
e:MyMessage
    a e:Message ;
    a e:guid "MyMessage"^^xsd:string .

### Second event

e:MySendEvent
    a e:Event ;
    a e:Send ;
    e:subjectMessage e:MyMessage .
e:MyMessage
    a e:Message ;
    a e:guid "MyMessage"^^xsd:string .

This is because I need to use inserts only and I don't know anything about the 
triplestore contents at the point I emit the event. My testing however didn't 
show any database growth when I did this on purpose.

How to fight this? Is it because of all the inserts and some related 
triplestore designs? What steps should I perform to optimize the store to suit 
the scenario?

Thank you very much for your responses.

Jan


Reply via email to