Hi Andy!

01.10.2012 23:33, Andy Seaborne kirjoitti:

It's not a GC issue, at least not in the normal low level sense.

Write transactions are batched together for write-back to the main
database after they are committed. They are in the journal on-disk but
also the in-memory structures are retained for access to a view of the
database with the transactions applied.  These take memory.  (it's the
indexes - the node data is written back in the prepare file because it's
an append-only file).

The batching size is set to 10 - after 10 writes, the system flushes the
journal and drops the in-memory structures.  So if you get past that
point, it should go "forever".

And every incoming request is pared in-memory to check validity of the
RDF.  Also a source of RAM usage.

Ah, thanks a lot! Now I understand what I was seeing. When I PUT several (but <10) datasets, Fuseki will temporarily eat a lot of memory. And now my problem is that for my datasets, this is more than the available heap.

I understand that batching is performed for performance reasons (I just read JENA-256), but in my scenario, writes (using PUT) are usually rather big and infrequent (so write performance is not important, or at least not much helped by batching) except when I sometimes want to update every dataset in one go, so there will be several large PUTs and Fuseki will run out of heap unless I restart it in between the PUTs.

What the system should do is:
1/ use a persistent-but-cached layer for completed transactions
2/ be tunable (*)
3/ Notice a store is transactional and use that instead of parsing to an
in-memory graph

but does not currently offer those features.   Contributions welcome.

        Andy

(*) I have tended to avoid lots of configuration options as I find in
other systems lots of knobs to tweak is unhelpful overall.  Either
people use the default or it needs deep magic to control.

I understand, nothing is perfect and there are always possible improvements to be made. And also I understand the aversion of knobs.

In my case, I would like to see in Fuseki and/or TDB a way to either
1) reduce the batch size to something less than 10 (say, 2 or 5),
2) turn off batching completely,
3) make batching behavior dependent on the size (in triples or megabytes) of the accumulated queue, so a queue of large writes would be flushed sooner than a queue of small writes, or 4) make batching behavior dependent on time, so that if no further writes are performed in a certain time (say, 10 seconds or a minute) then the flushing will be done regardless of the size of the accumulated write queue

I guess 1 or 2 would be in the tunable category, while 3 and 4 would maybe qualify as deep magic :)

But now that I understand what's happening I can at least work around the problem.

-Osma


--
Osma Suominen | [email protected] | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing Research Group Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076 Aalto, Finland

Reply via email to