Hello Andy,
[tdbloader2 performance for 1B+ triples]
On Mon, Jul 30, 2012 at 05:06:55PM +0100, Andy Seaborne wrote:
> >>How big are the node* files (node2id.dat, .idn, nodes.dat) in the
> >>resulting database in this case?
> >
> >node2id.dat 9470738432 bytes
>
> 9,470,738,432 => 9G
>
> >node2id.idn 50331648 bytes
>
> 50,331,648 => 50M
>
> Much less than RAM size.
>
> >nodes.dat 20182577027 bytes
>
> This file is written sequentially and isn't read during loading so
> should not be an issue.
>
> In 64 bit mode, the B+Tree node2id is a memory mapped file and the OS
> takes care of paging+caching the data.
>
> I think that use of
>
> JVM_ARGS="-Xmx32768M -server"
>
> is in fact making things worse: the heap grows to 32G, reducing the
> space available to the OS for mmap files. So it is squeezing out the OS
> managed mmap files and the result is that there is little real RAM
> devoted to caching the node table.
>
> 2G heap should be enough IIRC (caveat long literals).
The -Xmx32768M is not there without reason. I've had out of memory errors with
much higher values and earlier Jena versions. I tried JVM_ARGS="-Xmx2048M"
with tdbloader2 from apache-jena-2.7.3 and the error came after 55mio triples:
INFO Add: 55,300,000 Data (Batch: 281 / Avg: 13,794)
INFO Add: 55,350,000 Data (Batch: 227 / Avg: 13,088)
INFO Add: 55,400,000 Data (Batch: 192 / Avg: 12,342)
INFO Add: 55,450,000 Data (Batch: 134 / Avg: 11,406)
INFO Add: 55,500,000 Data (Batch: 98 / Avg: 10,335)
INFO Elapsed: 5,369.59 seconds [2012/08/09 17:45:44 CEST]
INFO Add: 55,550,000 Data (Batch: 52 / Avg: 8,785)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
at java.lang.StringBuilder.append(StringBuilder.java:119)
at com.hp.hpl.jena.tdb.lib.NodeLib.hash(NodeLib.java:160)
at com.hp.hpl.jena.tdb.lib.NodeLib.setHash(NodeLib.java:116)
at
com.hp.hpl.jena.tdb.nodetable.NodeTableNative.accessIndex(NodeTableNative.java:124)
at
com.hp.hpl.jena.tdb.nodetable.NodeTableNative._idForNode(NodeTableNative.java:117)
at
com.hp.hpl.jena.tdb.nodetable.NodeTableNative.getAllocateNodeId(NodeTableNative.java:83)
at
com.hp.hpl.jena.tdb.nodetable.NodeTableCache._idForNode(NodeTableCache.java:123)
at
com.hp.hpl.jena.tdb.nodetable.NodeTableCache.getAllocateNodeId(NodeTableCache.java:83)
at
com.hp.hpl.jena.tdb.nodetable.NodeTableWrapper.getAllocateNodeId(NodeTableWrapper.java:43)
at
com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getAllocateNodeId(NodeTableInline.java:51)
at
com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder$NodeTableBuilder.send(CmdNodeTableBuilder.java:223)
at
com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder$NodeTableBuilder.send(CmdNodeTableBuilder.java:190)
at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:71)
at org.openjena.riot.lang.LangBase.parse(LangBase.java:43)
at org.openjena.riot.RiotLoader.readQuads(RiotLoader.java:206)
at
com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:168)
at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
at
com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:79)
Any idea what a good value for -Xmx for 1B+ triples would be ?
I will try with 16384 now.
Regards,
Michael Brunnbauer
--
++ Michael Brunnbauer
++ netEstate GmbH
++ Geisenhausener Straße 11a
++ 81379 München
++ Tel +49 89 32 19 77 80
++ Fax +49 89 32 19 77 89
++ E-Mail [email protected]
++ http://www.netestate.de/
++
++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
++ USt-IdNr. DE221033342
++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel