Hi Daniel,

This is very strange.

It may be the stats being collected which is per predicate and is kept in-memory during loading. Just how many unique predicates are here?

The cache used behind NodeTableCache is a fixed number of slots LRU cache so if it is causing the process to run out of memory then the entries must be very large.

The cache is in the heap but all the index caching (which is much, much large) is not in the heap and is unaffected by -Xmx. That's why setting the heap large actually slows things down - it is taking space away from the index caches which are in memory mapped files, so managed by the OS in the file system cache.

A heap of 2G should be enough - 10G does not help much. The loader will use the rest of the machine (in fact, whether you want it to or not!) for indexes and the indexing part of the node table.

Hence the unusual point is the predicate distribution.

What is the data? This hasn't happened (IIRC) before even with some of the older dbpedia dumps which had huge numbers of synthetic predicates.

You could try tdbloader (not 2) which does not collect statistics during the loading.

It can be slower or even sometimes faster than tdbloader2 at your scale but it should not be very slow.

I'd be interested in hearing how that goes and seeing a complete load log to know how it does.

        Andy

On 24/04/15 17:07, Daniel Hernández wrote:
El 2015-04-23 18:12, Andy Seaborne escribió:
Hi there,

It's hard to eb sure - what does the load log file say before the
exception occurs?

It was loading data when the error occurs. I tried again with
export JVM_ARGS=-Xmx10000M before the load execution and I got the
error:



NFO  Add: 289,750,000 Data (Batch: 120,481 / Avg: 67,007)
INFO  Add: 289,800,000 Data (Batch: 117,647 / Avg: 67,012)
INFO  Add: 289,850,000 Data (Batch: 155,279 / Avg: 67,018)
INFO  Add: 289,900,000 Data (Batch: 151,515 / Avg: 67,025)
INFO  Add: 289,950,000 Data (Batch: 156,250 / Avg: 67,031)
INFO  Add: 290,000,000 Data (Batch: 155,279 / Avg: 67,038)
INFO    Elapsed: 4,325.89 seconds [2015/04/24 12:23:55 UTC]
INFO  Add: 290,050,000 Data (Batch: 162,866 / Avg: 67,045)
INFO  Add: 290,100,000 Data (Batch: 50,968 / Avg: 67,041)
INFO  Add: 290,150,000 Data (Batch: 160,771 / Avg: 67,048)
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
exceeded
     at java.util.LinkedHashMap.createEntry(LinkedHashMap.java:442)
     at java.util.HashMap.addEntry(HashMap.java:884)
     at java.util.LinkedHashMap.addEntry(LinkedHashMap.java:427)
     at java.util.HashMap.put(HashMap.java:505)
     at org.apache.jena.atlas.lib.cache.CacheLRU.put(CacheLRU.java:59)
     at
com.hp.hpl.jena.tdb.store.nodetable.NodeTableCache.cacheUpdate(NodeTableCache.java:200)

     at
com.hp.hpl.jena.tdb.store.nodetable.NodeTableCache._retrieveNodeByNodeId(NodeTableCache.java:127)

     at
com.hp.hpl.jena.tdb.store.nodetable.NodeTableCache.getNodeForNodeId(NodeTableCache.java:85)

     at
com.hp.hpl.jena.tdb.store.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:55)

     at
com.hp.hpl.jena.tdb.store.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:67)

     at
com.hp.hpl.jena.tdb.solver.stats.StatsCollectorNodeId.convert(StatsCollectorNodeId.java:51)

     at
com.hp.hpl.jena.tdb.solver.stats.StatsCollectorBase.results(StatsCollectorBase.java:54)

     at
com.hp.hpl.jena.tdb.solver.stats.StatsCollectorNodeId.results(StatsCollectorNodeId.java:30)

     at
com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:172)

     at arq.cmdline.CmdMain.mainMethod(CmdMain.java:102)
     at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
     at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
     at
com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:80



On 23/04/15 20:53, Daniel Hernández wrote:
Hello,

I'm trying to load two files into a tdb with the command below:

bin/tdbloader2 --loc=tdb-03 d3.nt dc.nt

Do these files have a lot of literals? A lot of large literals?

I think that there is not problem with the literals, because I have
loaded the same data with another schema and without problems. I guess
that the problem could be having much different predicates. The first
file have 50 millions of different predicates.

I have incremented the memory used by java setting the line above in
the bin/tbloader2worker file.

JVM_ARGS=${JVM_ARGS:--Xmx20000M}

JVM_ARGS is set further out in tdbloader2 as well and so this change
has no effect (JVM_ARGS is set so ${:-} returns the existing value).
it's merely a fall back at that point.

The right idiom is to set in the shell environment calling tdbloader2

e.g.

export JVM_ARGS=-Xmx5000M
tdbloader2 ...

or
env JVM_ARGS=-Xmx5000M tdbloader2 ...

Don't set it too large.  Much of the bulk space is no in the java heap.

I used 10GB for the heap the last time, so there are 20GB extra to be used.
However, I got the error above.

Thanks,
Daniel

Reply via email to