Thanks for the prompt reply, Andy.
I am doing batch-type storage: after a certain number of resources has been
converted and stored, I issue a transation. Based on your comment, the heap
size is quite enough - 80 GB, but I guess the issue remains with
(programmatically) using the TDB 2.
The relevant packages for storage are org.apache.jena.system.Txn;
org.apache.jena.tdb2.DatabaseMgr; and org.apache.jena.tdb2.TDB2Factory, but yet
some TDB 1 behavior seems to show, or?
Just as additional information, the following method is invoked to store Model
instances to TDB2:
public void storeLinkInstance(String namedGraph, Model model) {
Txn.executeWrite(dataset, ()->{
// Add to existing named graph
if (dataset.containsNamedModel(namedGraph)){
/* Model tempModel = dataset.getNamedModel(namedGraph); //
.add(model)
dataset.addNamedModel(namedGraph, tempModel); */
dataset.getNamedModel(namedGraph).add(model);
} else {
// Add the named graph for the first time
dataset.addNamedModel(namedGraph, model);
}
});
}
Finally, when I use TDB2 loader (from the command line) to load all these
smaller parts, that works just fine, and I am also able to use Jena Fuseki on
top of the resulting TDB, but I face the issue when programmatically converting
and storing the resources.
Thanks
On 2020/10/27 11:30:22, Andy Seaborne <[email protected]> wrote:
> Hi Fidan,
>
> I guess either that you are loading all the files inside a transaction?
>
> How much heap size are you using? (Don't allocate the whole of free RAM).
>
> TDB1 uses heap space for uncommitteed transactions and it also buffers a
> few committed transactions because, usually, it is better to do the
> final work on them a few at a time.
>
> There is a control for the buffering:
> TransactionActionManager.QueueBatchSize
>
> TDB2 does not use heap space in this way and does not have limitations
> on the size of transactions. A heap of 2-4G is fine - the main work at
> scale happens in the indexes which are not in the heap.
>
> >> dataset.getNamedModel(namedGraph).add(model);
>
> So you seem to have the data in memory in "model" as well so both the
> TDB(1) space and model are taking up heap.
>
> You can stream the data in by having a transaction and calling Model.add
> (or DatasetGraph.add(Triple) if yo end up working in triples not
> models+statements. Your choice - it isn't a factor here.).
>
> A different approach might be:
>
> Convert your resources to RDF and write these to disk, possibly with
> adding the named graph (so TriG or N-Quads format) then using a bulking
> loader (TDB1: tdbloader (TDB1 tdblaoder32 is only useful for very large
> datasets) or tdb2.tdbloader.
>
> They are faster than loading into a "live" dataset - they work by
> manipulating the internal structures directly.
>
> For TDB1, they have to start with an empty database.
>
> For TDB2, it (there is one bulkloader, with options) works on partially
> loaded databases.
>
> As to which options for the "--loader" argument to tdb2.tdbloader, it
> depends. The default is good; if you have several 100's of millions and
> up, try --loader=parallel if it s a big server.
>
> Andy
>
>
>
>
> On 27/10/2020 08:26, Fidan Limani wrote:
> > Recently, I am dealing with a large collection of resources that need to be
> > converted to RDF. The original collection contains a set of files, each
> > containing > 4 M resources on average. In order to keep the provenance, I
> > thought having named graphs with the same name to organize the RDF
> > collection would be nice.
> >
> > However, after half of the collection is stored, even on a powerful server,
> > the memory does not seem to be enough for the store operation in the TDB.
> > Consider the following statement:
> >
>
>
>
> >
> > In it, we retrieve the current RDF Model of triples and add another
> > collection of triples to it. After a while, once the storage reaches a
> > certain point, the operation "hangs" due to heap space exception.
> >
> > (Finally) The question, then, is: is there a way (a more streaming-like) to
> > store larger collections via named graphs? My current workaround consists
> > in splitting the original collection into smaller, more manageable
> > collections that the server can handle and store in named graphs.
> >
>