The MWE in the previous email will work with any even line text file and
will produce the odd [B values. I can't see anywhere obvious where the non
Jena code is creating them, just odd that there's so many of them!

OK, that knocks the DBB idea on the head!

I'll set the mapped symbol and play with batch sizes. Can the map location
be configured or will it go after the TDB location?

Is "TDB2" what we discussed some time back? I'm happy to provide some
testing on that as I've ~2000 files to ETL via an automated process each
producing 3-4M quads...

Thanks Dick.

On 27 Jul 2016 20:10, "Andy Seaborne" <[email protected]> wrote:
>
> On 27/07/16 13:19, Dick Murray wrote:
>>
>> ;-) Yes I did. But then I switched to the actual files I need to import
and
>> they produce ~3.5M triples...
>>
>> Using normal Jena 3.1 (i.e. no special context symbols set) the commit
>> after 100k triples works to import the file 10 times with the [B varying
>> between ~2Mb and ~4Mb. I'm currently testing a 20 instance pass.
>>
>> A batched commit works for this bulk load because if it fails after a
batch
>> commit I can remove the graph.
>>
>> For my understanding... TDB is holding the triples/block/journal in heap
>> until commit is called? But this doesn't account for the [B not being
>> cleared after a commit of 3.5M triples. It takes another pass plus ~2M
>> uncommited triples before I get an OOME.
>
>
> And the [B have a strange average size.  A block is 8K.
>
>> Digging around and there are some references made to the
DirectByteBuffers
>> causing issues. IBM
>>
https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/excessive_native_memory_usage_by_directbytebuffers?lang=en
>> links the problem to;
>>
>> Essentially the problem boils down to either:
>>
>>    1. There are too many DBBs being allocated (or they are too large),
>>    and/or
>>    2. The DBBs are not being cleared up quickly enough.
>>
>
> TDB does not use DirectByteBuffers unless you ask it to.  They are not [B.
>
> .hasArray is false.
> .array() throws UnsupportedOperationException.
>
> (Grep the code for "allocateDirect" and trace back the use of single use
of BufferAllocatorDirect to the journal in "direct" mode.)
>
> I can believe that if activated, that the GC recycling would be slow. The
code ought to recycle them (beause yuo can't explicitly free them for some
weird reason - they little more than malloc).
>
> But they are not being used unless you ask for them.
>
> Journal entries are 8K unless they are commit records which are about 20
bytes (I think).
>
>
>
>>
>> and recommends using -XX:MaxDirectMemorySize=1024m to poke the GC via
>> System.gc(). Not sure if GC1C helps because of it's new heap model...
>>
>> Would it be possible to get Jena to write it's uncommitted triples to
disk
>> and then commit them to the TDB?
>
>
> Set TDB.transactionJournalWriteBlockMode to "mapped". That uses a disk
file.
>
>
>> Ok it's slower than RAM but until they are
>> committed only one thread has visibility anyway? Could direct that at a
>> different disk as well...
>>
>> Just before hitting send I'm at pass 13 and the [B maxed at just over 4Gb
>> before dropping back to 2Gb.
>
>
> Or use TDB2 :-)
>
> It has no problem loading 100m+ triples in a single transaction (the
space per transaction is fixed at about 80 bytes of transaction - disk
writes happen during the transaction not to a roll-forward journal). And it
should be a bit faster because writes happen once.
>
> Just need to find time to clean it up ...
>
>         Andy
>
>
>>
>> Dick.
>>
>>
>>
>> On 27 July 2016 at 11:47, Andy Seaborne <[email protected]> wrote:
>>
>>> On 27/07/16 11:22, Dick Murray wrote:
>>>
>>>> Hello.
>>>>
>>>> Something doesn't add up here... I've run repeated tests with the
>>>> following
>>>> MWE on a 16GB machine with -Xms8g -Xmx8g and the I always get an OOME.
>>>>
>>>> What I don't understand is the size of [B increases with each pass
until
>>>> the OOME is thrown. The exact same process is run 5 times with a new
graph
>>>> for each set of triples.
>>>>
>>>> There are ~3.5M triples added within the transaction from a file which
is
>>>> a
>>>> "simple" text based file (30Mb) which is read in line pairs.
>>>>
>>>
>>> Err - you said 200k quads earlier!
>>>
>>> Set
>>>
>>> TransactionManager.QueueBatchSize=0 ;
>>>
>>> and break the load into small units for now and see if that helps.
>>>
>>> One experiment would be to write the output to disk and load from a
>>> program that only does the TDB part.
>>>
>>>     Andy
>>>
>>>
>>>
>>>> I've tested sequential loads of other text files (i.e. file x *5) and
>>>> other
>>>> text files loaded sequentally (i.e. file x, file y, file ...) and the
same
>>>> result is exhibited.
>>>>
>>>> If I reduce -Xmx to 6g it will fail earlier.
>>>>
>>>> Changing the GC using -XX:+UseGC1C doesn't alter the outcome.
>>>>
>>>> I'm running on Ubuntu 16.04 with Java 1.8 and I can replicate this on
>>>> Centos 7 with Java 1.8.
>>>>
>>>> Any ideas?
>>>>
>>>> Regards Dick.
>>>>
>>>>
>>>
>>>
>>
>

Reply via email to