Hi Daniel,
This is very strange.
It may be the stats being collected which is per predicate and is kept
in-memory during loading. Just how many unique predicates are here?
The cache used behind NodeTableCache is a fixed number of slots LRU
cache so if it is causing the process to run out of memory then the
entries must be very large.
The cache is in the heap but all the index caching (which is much, much
large) is not in the heap and is unaffected by -Xmx. That's why setting
the heap large actually slows things down - it is taking space away from
the index caches which are in memory mapped files, so managed by the OS
in the file system cache.
A heap of 2G should be enough - 10G does not help much. The loader will
use the rest of the machine (in fact, whether you want it to or not!)
for indexes and the indexing part of the node table.
Hence the unusual point is the predicate distribution.
What is the data? This hasn't happened (IIRC) before even with some of
the older dbpedia dumps which had huge numbers of synthetic predicates.
You could try tdbloader (not 2) which does not collect statistics during
the loading.
It can be slower or even sometimes faster than tdbloader2 at your scale
but it should not be very slow.
I'd be interested in hearing how that goes and seeing a complete load
log to know how it does.
Andy
On 24/04/15 17:07, Daniel Hernández wrote:
El 2015-04-23 18:12, Andy Seaborne escribió:
Hi there,
It's hard to eb sure - what does the load log file say before the
exception occurs?
It was loading data when the error occurs. I tried again with
export JVM_ARGS=-Xmx10000M before the load execution and I got the
error:
NFO Add: 289,750,000 Data (Batch: 120,481 / Avg: 67,007)
INFO Add: 289,800,000 Data (Batch: 117,647 / Avg: 67,012)
INFO Add: 289,850,000 Data (Batch: 155,279 / Avg: 67,018)
INFO Add: 289,900,000 Data (Batch: 151,515 / Avg: 67,025)
INFO Add: 289,950,000 Data (Batch: 156,250 / Avg: 67,031)
INFO Add: 290,000,000 Data (Batch: 155,279 / Avg: 67,038)
INFO Elapsed: 4,325.89 seconds [2015/04/24 12:23:55 UTC]
INFO Add: 290,050,000 Data (Batch: 162,866 / Avg: 67,045)
INFO Add: 290,100,000 Data (Batch: 50,968 / Avg: 67,041)
INFO Add: 290,150,000 Data (Batch: 160,771 / Avg: 67,048)
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
exceeded
at java.util.LinkedHashMap.createEntry(LinkedHashMap.java:442)
at java.util.HashMap.addEntry(HashMap.java:884)
at java.util.LinkedHashMap.addEntry(LinkedHashMap.java:427)
at java.util.HashMap.put(HashMap.java:505)
at org.apache.jena.atlas.lib.cache.CacheLRU.put(CacheLRU.java:59)
at
com.hp.hpl.jena.tdb.store.nodetable.NodeTableCache.cacheUpdate(NodeTableCache.java:200)
at
com.hp.hpl.jena.tdb.store.nodetable.NodeTableCache._retrieveNodeByNodeId(NodeTableCache.java:127)
at
com.hp.hpl.jena.tdb.store.nodetable.NodeTableCache.getNodeForNodeId(NodeTableCache.java:85)
at
com.hp.hpl.jena.tdb.store.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:55)
at
com.hp.hpl.jena.tdb.store.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:67)
at
com.hp.hpl.jena.tdb.solver.stats.StatsCollectorNodeId.convert(StatsCollectorNodeId.java:51)
at
com.hp.hpl.jena.tdb.solver.stats.StatsCollectorBase.results(StatsCollectorBase.java:54)
at
com.hp.hpl.jena.tdb.solver.stats.StatsCollectorNodeId.results(StatsCollectorNodeId.java:30)
at
com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:172)
at arq.cmdline.CmdMain.mainMethod(CmdMain.java:102)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
at
com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:80
On 23/04/15 20:53, Daniel Hernández wrote:
Hello,
I'm trying to load two files into a tdb with the command below:
bin/tdbloader2 --loc=tdb-03 d3.nt dc.nt
Do these files have a lot of literals? A lot of large literals?
I think that there is not problem with the literals, because I have
loaded the same data with another schema and without problems. I guess
that the problem could be having much different predicates. The first
file have 50 millions of different predicates.
I have incremented the memory used by java setting the line above in
the bin/tbloader2worker file.
JVM_ARGS=${JVM_ARGS:--Xmx20000M}
JVM_ARGS is set further out in tdbloader2 as well and so this change
has no effect (JVM_ARGS is set so ${:-} returns the existing value).
it's merely a fall back at that point.
The right idiom is to set in the shell environment calling tdbloader2
e.g.
export JVM_ARGS=-Xmx5000M
tdbloader2 ...
or
env JVM_ARGS=-Xmx5000M tdbloader2 ...
Don't set it too large. Much of the bulk space is no in the java heap.
I used 10GB for the heap the last time, so there are 20GB extra to be used.
However, I got the error above.
Thanks,
Daniel