Hi Andy,

El 2015-04-25 05:59, Andy Seaborne escribió:
It may be the stats being collected which is per predicate and is
kept in-memory during loading.  Just how many unique predicates are
here?

There are 57,088,184 unique predicates in the file d3.nt. For each
predicate the file contains two triples of the form:

<http://www.wikidata.org/entity/Q99> <http://www.wikidata.org/entity/Q99Sfb55e939-4b54-1a88-3063-6ad722d1d569> <http://www.wikidata.org/entity/Q14615145> . <http://www.wikidata.org/entity/Q99Sfb55e939-4b54-1a88-3063-6ad722d1d569> <http://www.w3.org/2002/07/owl#subPropertyOf> <http://www.wikidata.org/entity/P1151> .

As you can see, triples are taken from the wikidata project, but I'm
using some alternative schemes to check how they behave in most
popular engines. The scheme that I'm having problems with Jena is this,
in which each wikidata statement is modeled as a singleton property that
inherits of a common property.

The cache used behind NodeTableCache is a fixed number of slots LRU
cache so if it is causing the process to run out of memory then the
entries must be very large.

The cache is in the heap but all the index caching (which is much,
much large) is not in the heap and is unaffected by -Xmx.  That's why
setting the heap large actually slows things down - it is taking space
away from the index caches which are in memory mapped files, so
managed by the OS in the file system cache.

A heap of 2G should be enough - 10G does not help much.  The loader
will use the rest of the machine (in fact, whether you want it to or
not!) for indexes and the indexing part of the node table.

Hence the unusual point is the predicate distribution.

All predicates are used once in the predicate position of a triple.
The rest of time are used in the subject position.

What is the data?  This hasn't happened (IIRC) before even with some
of the older dbpedia dumps which had huge numbers of synthetic
predicates.

The source was the wikidata dumps, but I have changed the schema to
study how the schema approach impact the work with the data.

You could try tdbloader (not 2) which does not collect statistics
during the loading.

It can be slower or even sometimes faster than tdbloader2 at your
scale but it should not be very slow.

I have run the tdbloader with the following command:

export JVM_ARGS=-Xmx2000M

bin/tdbloader \
   --loc=/home/daniel/wikidata/data/jena/tdb-03 \
   /home/daniel/wikidata/data/rawfiles/d3.nt \
   /home/daniel/wikidata/data/rawfiles/dc.nt \
   > ~/tdb-03-out.log \
   2> ~/tdb-03-error.log

However, I got the error:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at com.hp.hpl.jena.tdb.lib.TupleLib.record(TupleLib.java:183)
at com.hp.hpl.jena.tdb.store.tupletable.TupleIndexRecord.performAdd(TupleIndexRecord.java:61) at com.hp.hpl.jena.tdb.store.tupletable.TupleIndexBase.add(TupleIndexBase.java:64) at com.hp.hpl.jena.tdb.store.tupletable.TupleTable.add(TupleTable.java:96) at com.hp.hpl.jena.tdb.store.nodetupletable.NodeTupleTableConcrete.addRow(NodeTupleTableConcrete.java:88) at com.hp.hpl.jena.tdb.store.bulkloader.LoaderNodeTupleTable.load(LoaderNodeTupleTable.java:107) at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$DestinationDSG.process(BulkLoader.java:237) at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$DestinationDSG.triple(BulkLoader.java:220) at org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:61)
        at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:42)
at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:185)
        at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:906)
        at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:687)
        at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:666)
        at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:654)
at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:148) at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:114) at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:261) at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
        at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
        at tdb.tdbloader.loadQuads(tdbloader.java:118)
        at tdb.tdbloader.exec(tdbloader.java:86)
        at arq.cmdline.CmdMain.mainMethod(CmdMain.java:102)
        at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
        at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
        at tdb.tdbloader.main(tdbloader.java:44)

I guess that is a problem because having to much properties.
I will try uploading less data.

It is possible to load a ntriples file over an existing tdb
directory? I have tried to do that, but the tdbloader raise
an error saying that the directory is not empty. For
example, I would have an option --merge in the tdbloader to
indicate that new data must be merged with the existing in
the tdb directory.

I'd be interested in hearing how that goes and seeing a complete load
log to know how it does.

If you are interested I can publish the data that I'm trying to load.

Daniel

Reply via email to