Hi Andy!
01.10.2012 23:33, Andy Seaborne kirjoitti:
It's not a GC issue, at least not in the normal low level sense.
Write transactions are batched together for write-back to the main
database after they are committed. They are in the journal on-disk but
also the in-memory structures are retained for access to a view of the
database with the transactions applied. These take memory. (it's the
indexes - the node data is written back in the prepare file because it's
an append-only file).
The batching size is set to 10 - after 10 writes, the system flushes the
journal and drops the in-memory structures. So if you get past that
point, it should go "forever".
And every incoming request is pared in-memory to check validity of the
RDF. Also a source of RAM usage.
Ah, thanks a lot! Now I understand what I was seeing. When I PUT several
(but <10) datasets, Fuseki will temporarily eat a lot of memory. And now
my problem is that for my datasets, this is more than the available heap.
I understand that batching is performed for performance reasons (I just
read JENA-256), but in my scenario, writes (using PUT) are usually
rather big and infrequent (so write performance is not important, or at
least not much helped by batching) except when I sometimes want to
update every dataset in one go, so there will be several large PUTs and
Fuseki will run out of heap unless I restart it in between the PUTs.
What the system should do is:
1/ use a persistent-but-cached layer for completed transactions
2/ be tunable (*)
3/ Notice a store is transactional and use that instead of parsing to an
in-memory graph
but does not currently offer those features. Contributions welcome.
Andy
(*) I have tended to avoid lots of configuration options as I find in
other systems lots of knobs to tweak is unhelpful overall. Either
people use the default or it needs deep magic to control.
I understand, nothing is perfect and there are always possible
improvements to be made. And also I understand the aversion of knobs.
In my case, I would like to see in Fuseki and/or TDB a way to either
1) reduce the batch size to something less than 10 (say, 2 or 5),
2) turn off batching completely,
3) make batching behavior dependent on the size (in triples or
megabytes) of the accumulated queue, so a queue of large writes would be
flushed sooner than a queue of small writes, or
4) make batching behavior dependent on time, so that if no further
writes are performed in a certain time (say, 10 seconds or a minute)
then the flushing will be done regardless of the size of the accumulated
write queue
I guess 1 or 2 would be in the tunable category, while 3 and 4 would
maybe qualify as deep magic :)
But now that I understand what's happening I can at least work around
the problem.
-Osma
--
Osma Suominen | [email protected] | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing
Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076
Aalto, Finland