Hi Fidan,

I guess either that you are loading all the files inside a transaction?

How much heap size are you using? (Don't allocate the whole of free RAM).

TDB1 uses heap space for uncommitteed transactions and it also buffers a few committed transactions because, usually, it is better to do the final work on them a few at a time.

There is a control for the buffering:
TransactionActionManager.QueueBatchSize

TDB2 does not use heap space in this way and does not have limitations on the size of transactions. A heap of 2-4G is fine - the main work at scale happens in the indexes which are not in the heap.

>>   dataset.getNamedModel(namedGraph).add(model);

So you seem to have the data in memory in "model" as well so both the TDB(1) space and model are taking up heap.

You can stream the data in by having a transaction and calling Model.add (or DatasetGraph.add(Triple) if yo end up working in triples not models+statements. Your choice - it isn't a factor here.).

A different approach might be:

Convert your resources to RDF and write these to disk, possibly with adding the named graph (so TriG or N-Quads format) then using a bulking loader (TDB1: tdbloader (TDB1 tdblaoder32 is only useful for very large datasets) or tdb2.tdbloader.

They are faster than loading into a "live" dataset - they work by manipulating the internal structures directly.

For TDB1, they have to start with an empty database.

For TDB2, it (there is one bulkloader, with options) works on partially loaded databases.

As to which options for the "--loader" argument to tdb2.tdbloader, it depends. The default is good; if you have several 100's of millions and up, try --loader=parallel if it s a big server.

    Andy




On 27/10/2020 08:26, Fidan Limani wrote:
Recently, I am dealing with a large collection of resources that need to be 
converted to RDF. The original collection contains a set of files, each containing 
> 4 M resources on average. In order to keep the provenance, I thought having 
named graphs with the same name to organize the RDF collection would be nice.

However, after half of the collection is stored, even on a powerful server, the 
memory does not seem to be enough for the store operation in the TDB. Consider 
the following statement:




In it, we retrieve the current RDF Model of triples and add another collection of triples 
to it. After a while, once the storage reaches a certain point, the operation 
"hangs" due to heap space exception.

(Finally) The question, then, is: is there a way (a more streaming-like) to 
store larger collections via named graphs? My current workaround consists in 
splitting the original collection into smaller, more manageable collections 
that the server can handle and store in named graphs.

Reply via email to