On 27/07/16 13:19, Dick Murray wrote:
;-) Yes I did. But then I switched to the actual files I need to import and
they produce ~3.5M triples...
Using normal Jena 3.1 (i.e. no special context symbols set) the commit
after 100k triples works to import the file 10 times with the [B varying
between ~2Mb and ~4Mb. I'm currently testing a 20 instance pass.
A batched commit works for this bulk load because if it fails after a batch
commit I can remove the graph.
For my understanding... TDB is holding the triples/block/journal in heap
until commit is called? But this doesn't account for the [B not being
cleared after a commit of 3.5M triples. It takes another pass plus ~2M
uncommited triples before I get an OOME.
And the [B have a strange average size. A block is 8K.
Digging around and there are some references made to the DirectByteBuffers
causing issues. IBM
https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/excessive_native_memory_usage_by_directbytebuffers?lang=en
links the problem to;
Essentially the problem boils down to either:
1. There are too many DBBs being allocated (or they are too large),
and/or
2. The DBBs are not being cleared up quickly enough.
TDB does not use DirectByteBuffers unless you ask it to. They are not [B.
.hasArray is false.
.array() throws UnsupportedOperationException.
(Grep the code for "allocateDirect" and trace back the use of single use
of BufferAllocatorDirect to the journal in "direct" mode.)
I can believe that if activated, that the GC recycling would be slow.
The code ought to recycle them (beause yuo can't explicitly free them
for some weird reason - they little more than malloc).
But they are not being used unless you ask for them.
Journal entries are 8K unless they are commit records which are about 20
bytes (I think).
and recommends using -XX:MaxDirectMemorySize=1024m to poke the GC via
System.gc(). Not sure if GC1C helps because of it's new heap model...
Would it be possible to get Jena to write it's uncommitted triples to disk
and then commit them to the TDB?
Set TDB.transactionJournalWriteBlockMode to "mapped". That uses a disk file.
Ok it's slower than RAM but until they are
committed only one thread has visibility anyway? Could direct that at a
different disk as well...
Just before hitting send I'm at pass 13 and the [B maxed at just over 4Gb
before dropping back to 2Gb.
Or use TDB2 :-)
It has no problem loading 100m+ triples in a single transaction (the
space per transaction is fixed at about 80 bytes of transaction - disk
writes happen during the transaction not to a roll-forward journal).
And it should be a bit faster because writes happen once.
Just need to find time to clean it up ...
Andy
Dick.
On 27 July 2016 at 11:47, Andy Seaborne <[email protected]> wrote:
On 27/07/16 11:22, Dick Murray wrote:
Hello.
Something doesn't add up here... I've run repeated tests with the
following
MWE on a 16GB machine with -Xms8g -Xmx8g and the I always get an OOME.
What I don't understand is the size of [B increases with each pass until
the OOME is thrown. The exact same process is run 5 times with a new graph
for each set of triples.
There are ~3.5M triples added within the transaction from a file which is
a
"simple" text based file (30Mb) which is read in line pairs.
Err - you said 200k quads earlier!
Set
TransactionManager.QueueBatchSize=0 ;
and break the load into small units for now and see if that helps.
One experiment would be to write the output to disk and load from a
program that only does the TDB part.
Andy
I've tested sequential loads of other text files (i.e. file x *5) and
other
text files loaded sequentally (i.e. file x, file y, file ...) and the same
result is exhibited.
If I reduce -Xmx to 6g it will fail earlier.
Changing the GC using -XX:+UseGC1C doesn't alter the outcome.
I'm running on Ubuntu 16.04 with Java 1.8 and I can replicate this on
Centos 7 with Java 1.8.
Any ideas?
Regards Dick.