Re: Jena TDB: Limitations of orgnizing large collections via named graphs

Fidan Limani Tue, 27 Oct 2020 05:28:02 -0700

Thanks for the prompt reply, Andy.

I am doing batch-type storage: after a certain number of resources has been 
converted and stored, I issue a transation. Based on your comment, the heap 
size is quite enough - 80 GB, but I guess the issue remains with 
(programmatically) using the TDB 2.


The relevant packages for storage are org.apache.jena.system.Txn; 
org.apache.jena.tdb2.DatabaseMgr; and org.apache.jena.tdb2.TDB2Factory, but yet 
some TDB 1 behavior seems to show, or?

Just as additional information, the following method is invoked to store Model 
instances to TDB2:

public void storeLinkInstance(String namedGraph, Model model) {
        Txn.executeWrite(dataset, ()->{
            // Add to existing named graph
            if (dataset.containsNamedModel(namedGraph)){
                /* Model tempModel = dataset.getNamedModel(namedGraph); // 
.add(model)
                dataset.addNamedModel(namedGraph, tempModel); */
                dataset.getNamedModel(namedGraph).add(model);
            } else {
                // Add the named graph for the first time
                dataset.addNamedModel(namedGraph, model);
            }
        });
    }


Finally, when I use TDB2 loader (from the command line) to load all these 
smaller parts, that works just fine, and I am also able to use Jena Fuseki on 
top of the resulting TDB, but I face the issue when programmatically converting 
and storing the resources.

Thanks

On 2020/10/27 11:30:22, Andy Seaborne <[email protected]> wrote: 
> Hi Fidan,
> 
> I guess either that you are loading all the files inside a transaction?
> 
> How much heap size are you using? (Don't allocate the whole of free RAM).
> 
> TDB1 uses heap space for uncommitteed transactions and it also buffers a 
> few committed transactions because, usually, it is better to do the 
> final work on them a few at a time.
> 
> There is a control for the buffering:
> TransactionActionManager.QueueBatchSize
> 
> TDB2 does not use heap space in this way and does not have limitations 
> on the size of transactions. A heap of 2-4G is fine - the main work at 
> scale happens in the indexes which are not in the heap.
> 
>  >>   dataset.getNamedModel(namedGraph).add(model);
> 
> So you seem to have the data in memory in "model" as well so both the 
> TDB(1) space and model are taking up heap.
> 
> You can stream the data in by having a transaction and calling Model.add 
> (or DatasetGraph.add(Triple) if yo end up working in triples not 
> models+statements. Your choice - it isn't a factor here.).
> 
> A different approach might be:
> 
> Convert your resources to RDF and write these to disk, possibly with 
> adding the named graph (so TriG or N-Quads format) then using a bulking 
> loader (TDB1: tdbloader (TDB1 tdblaoder32 is only useful for very large 
> datasets) or tdb2.tdbloader.
> 
> They are faster than loading into a "live" dataset - they work by 
> manipulating the internal structures directly.
> 
> For TDB1, they have to start with an empty database.
> 
> For TDB2, it (there is one bulkloader, with options) works on partially 
> loaded databases.
> 
> As to which options for the "--loader" argument to tdb2.tdbloader, it 
> depends. The default is good; if you have several 100's of millions and 
> up, try --loader=parallel if it s a big server.
> 
>      Andy
> 
> 
> 
> 
> On 27/10/2020 08:26, Fidan Limani wrote:
> > Recently, I am dealing with a large collection of resources that need to be 
> > converted to RDF. The original collection contains a set of files, each 
> > containing > 4 M resources on average. In order to keep the provenance, I 
> > thought having named graphs with the same name to organize the RDF 
> > collection would be nice.
> > 
> > However, after half of the collection is stored, even on a powerful server, 
> > the memory does not seem to be enough for the store operation in the TDB. 
> > Consider the following statement:
> >       
> 
> 
> 
> > 
> > In it, we retrieve the current RDF Model of triples and add another 
> > collection of triples to it. After a while, once the storage reaches a 
> > certain point, the operation "hangs" due to heap space exception.
> > 
> > (Finally) The question, then, is: is there a way (a more streaming-like) to 
> > store larger collections via named graphs? My current workaround consists 
> > in splitting the original collection into smaller, more manageable 
> > collections that the server can handle and store in named graphs.
> > 
>

Re: Jena TDB: Limitations of orgnizing large collections via named graphs

Reply via email to