On 27/10/2020 12:27, Fidan Limani wrote:
Thanks for the prompt reply, Andy.

I am doing batch-type storage: after a certain number of resources has been 
converted and stored, I issue a transation. Based on your comment, the heap 
size is quite enough - 80 GB

Do leave space for the OS file system cache. Otherwise the indexes have no cache space (that is not in the heap).

, but I guess the issue remains with (programmatically) using the TDB 2.


The relevant packages for storage are org.apache.jena.system.Txn; 
org.apache.jena.tdb2.DatabaseMgr; and org.apache.jena.tdb2.TDB2Factory, but yet 
some TDB 1 behavior seems to show, or?

"or?" - don't understand.

if you get OOME or heap CPU death, then maybe the issues isn't in TDB1 or 2.

The code seems to read everything into memory, then add it to TDB.


Does your data contain, for example, many large literals?


Just as additional information, the following method is invoked to store Model 
instances to TDB2:


I'm guessing here but is "model" in-memeory and you read your data into it?

Can you read data straight into TDB instead?

public void storeLinkInstance(String namedGraph, Model model) {
         Txn.executeWrite(dataset, ()->{
             // Add to existing named graph
             if (dataset.containsNamedModel(namedGraph)){
                 /* Model tempModel = dataset.getNamedModel(namedGraph); // 
.add(model)

                   // This model "m" is a view of the database
                   // it does not shore anything itself.
                   Model m = dataset.getNamedModel(namedGraph)
                   RDFDataMgr.read(m, "filename");


                 dataset.addNamedModel(namedGraph, tempModel); */
                 dataset.getNamedModel(namedGraph).add(model);
             } else {
                 // Add the named graph for the first time
                 dataset.addNamedModel(namedGraph, model);
             }
         });
     }


Finally, when I use TDB2 loader (from the command line) to load all these 
smaller parts, that works just fine, and I am also able to use Jena Fuseki on 
top of the resulting TDB, but I face the issue when programmatically converting 
and storing the resources.

Thanks


One more note:

Ideally, given an input collection, the implementation should convert it to RDF 
and generate data dumps, which the implementing parties then could use for 
their use cases.

Teh idea of convert to files and read those files into TDB2 fits well with that requirement.




On 2020/10/27 11:30:22, Andy Seaborne <[email protected]> wrote:
Hi Fidan,

I guess either that you are loading all the files inside a transaction?

How much heap size are you using? (Don't allocate the whole of free RAM).

TDB1 uses heap space for uncommitteed transactions and it also buffers a
few committed transactions because, usually, it is better to do the
final work on them a few at a time.

There is a control for the buffering:
TransactionActionManager.QueueBatchSize

TDB2 does not use heap space in this way and does not have limitations
on the size of transactions. A heap of 2-4G is fine - the main work at
scale happens in the indexes which are not in the heap.

  >>   dataset.getNamedModel(namedGraph).add(model);

So you seem to have the data in memory in "model" as well so both the
TDB(1) space and model are taking up heap.

You can stream the data in by having a transaction and calling Model.add
(or DatasetGraph.add(Triple) if yo end up working in triples not
models+statements. Your choice - it isn't a factor here.).

A different approach might be:

Convert your resources to RDF and write these to disk, possibly with
adding the named graph (so TriG or N-Quads format) then using a bulking
loader (TDB1: tdbloader (TDB1 tdblaoder32 is only useful for very large
datasets) or tdb2.tdbloader.

They are faster than loading into a "live" dataset - they work by
manipulating the internal structures directly.

For TDB1, they have to start with an empty database.

For TDB2, it (there is one bulkloader, with options) works on partially
loaded databases.

As to which options for the "--loader" argument to tdb2.tdbloader, it
depends. The default is good; if you have several 100's of millions and
up, try --loader=parallel if it s a big server.

      Andy




On 27/10/2020 08:26, Fidan Limani wrote:
Recently, I am dealing with a large collection of resources that need to be 
converted to RDF. The original collection contains a set of files, each containing 
> 4 M resources on average. In order to keep the provenance, I thought having 
named graphs with the same name to organize the RDF collection would be nice.

However, after half of the collection is stored, even on a powerful server, the 
memory does not seem to be enough for the store operation in the TDB. Consider 
the following statement:




In it, we retrieve the current RDF Model of triples and add another collection of triples 
to it. After a while, once the storage reaches a certain point, the operation 
"hangs" due to heap space exception.

(Finally) The question, then, is: is there a way (a more streaming-like) to 
store larger collections via named graphs? My current workaround consists in 
splitting the original collection into smaller, more manageable collections 
that the server can handle and store in named graphs.


Reply via email to