On Tue, Oct 2, 2012 at 12:32 AM, Osma Suominen <[email protected]> wrote:
> Hi Andy!
>
> 01.10.2012 23:33, Andy Seaborne kirjoitti:
>
>
>> It's not a GC issue, at least not in the normal low level sense.
>>
>> Write transactions are batched together for write-back to the main
>> database after they are committed. They are in the journal on-disk but
>> also the in-memory structures are retained for access to a view of the
>> database with the transactions applied. These take memory. (it's the
>> indexes - the node data is written back in the prepare file because it's
>> an append-only file).
>>
>> The batching size is set to 10 - after 10 writes, the system flushes the
>> journal and drops the in-memory structures. So if you get past that
>> point, it should go "forever".
>>
>> And every incoming request is pared in-memory to check validity of the
>> RDF. Also a source of RAM usage.
>
>
> Ah, thanks a lot! Now I understand what I was seeing. When I PUT several
> (but <10) datasets, Fuseki will temporarily eat a lot of memory. And now my
> problem is that for my datasets, this is more than the available heap.
>
> I understand that batching is performed for performance reasons (I just read
> JENA-256), but in my scenario, writes (using PUT) are usually rather big and
> infrequent (so write performance is not important, or at least not much
> helped by batching) except when I sometimes want to update every dataset in
> one go, so there will be several large PUTs and Fuseki will run out of heap
> unless I restart it in between the PUTs.
>
>
>> What the system should do is:
>> 1/ use a persistent-but-cached layer for completed transactions
>> 2/ be tunable (*)
>> 3/ Notice a store is transactional and use that instead of parsing to an
>> in-memory graph
>>
>> but does not currently offer those features. Contributions welcome.
>>
>> Andy
>>
>> (*) I have tended to avoid lots of configuration options as I find in
>> other systems lots of knobs to tweak is unhelpful overall. Either
>> people use the default or it needs deep magic to control.
>
>
> I understand, nothing is perfect and there are always possible improvements
> to be made. And also I understand the aversion of knobs.
>
> In my case, I would like to see in Fuseki and/or TDB a way to either
> 1) reduce the batch size to something less than 10 (say, 2 or 5),
> 2) turn off batching completely,
> 3) make batching behavior dependent on the size (in triples or megabytes) of
> the accumulated queue, so a queue of large writes would be flushed sooner
> than a queue of small writes, or
> 4) make batching behavior dependent on time, so that if no further writes
> are performed in a certain time (say, 10 seconds or a minute) then the
> flushing will be done regardless of the size of the accumulated write queue
>
> I guess 1 or 2 would be in the tunable category, while 3 and 4 would maybe
> qualify as deep magic :)
>
> But now that I understand what's happening I can at least work around the
> problem.
>
A decent win would be to address what Andy mentioned as his number 3.
I've been working in this area lately on the SPARQL Update Query side
(PUT is SPARQL Update Graph Store Protocol). But I hope to get to
that in time.
Meanwhile, if you really need to reduce memory, you can try the
following (untested) patch against the jena-fuseki project. Adjust
the 10000 constant to something lower if needed.
-Stephen
Index:
jena-fuseki/src/main/java/org/apache/jena/fuseki/servlets/SPARQL_Upload.java
===================================================================
---
jena-fuseki/src/main/java/org/apache/jena/fuseki/servlets/SPARQL_Upload.java
(revision
1392600)
+++
jena-fuseki/src/main/java/org/apache/jena/fuseki/servlets/SPARQL_Upload.java
(working
copy)
@@ -36,6 +36,8 @@
import org.apache.jena.fuseki.http.HttpSC ;
import org.apache.jena.fuseki.server.DatasetRef ;
import org.apache.jena.iri.IRI ;
+import org.openjena.atlas.data.ThresholdPolicy ;
+import org.openjena.atlas.data.ThresholdPolicyFactory ;
import org.openjena.atlas.lib.Sink ;
import org.openjena.atlas.web.ContentType ;
import org.openjena.riot.* ;
@@ -46,6 +48,7 @@
import com.hp.hpl.jena.graph.Graph ;
import com.hp.hpl.jena.graph.Node ;
import com.hp.hpl.jena.graph.Triple ;
+import com.hp.hpl.jena.sparql.graph.GraphDefaultDataBag ;
import com.hp.hpl.jena.sparql.graph.GraphFactory ;
public class SPARQL_Upload extends SPARQL_ServletBase
@@ -95,7 +98,10 @@
// Locking only needed over the insert into dataset
try {
String graphName = null ;
- Graph graphTmp = GraphFactory.createGraphMem() ;
+ //Graph graphTmp = GraphFactory.createGraphMem() ;
+ ThresholdPolicy<Triple> policy =
ThresholdPolicyFactory.count(10000); // Need to read the proper
setting from a Context object
+ Graph graphTmp = new GraphDefaultDataBag(policy); // We
don't care that dupes can appear in here
+
Node gn = null ;
String name = null ;
ContentType ct = null ;