Re: Understanding the output of Jena TDB Loader

Andy Seaborne Sun, 14 Feb 2021 03:51:32 -0800



On 13/02/2021 17:17, Daniel Hernandez wrote:


Hi,

Andy Seaborne writes:

How much data are you loading?


I am loading a billion triples.

Heap is only used for the node table cache and not index work which is
out of heap in memory mapped filesmapped by the virtual memory of the
OS process so caching is done by the OS filesystem cache machinery. It
can make the OS process look very large even if the heap is only 1.2G.


So it is better to do not modify the Xms parameter?

Xms does not matter. Personally, I'd set -Xmx to 4G which is larger thannormal and plenty.

Don't set it too high - that can slow things down. If the heap grows, itis taking space away from the OS and, as the data size grows, then fileI/O on indexes is the dominate speed factor. So caching and I/Ohardware matter. For example, on AWS, EBS SSD vs local SSD has differentspeed characteristics.

tdbloader2 may not be the right choice. It is a bit niche but if you
have much less RAM than total data it can be better than tdbloader and
it is better if there is rotating disk, not SSD. It has been reported
to be the right choice for several billion for SSD.


I have a SSD disk, a machine with 256 GB of ram, and 32 cores. Do you
recommend using tdbloader in this setting?

The rate you were getting seem low even for tdbloader2 - is it all SDDor could /tmp be on a disk? And is the SSD local or remove (e.g. EBS)?

As a general point, because the hardware matters, it is a case of try afew cases and see.

Does to have to be TDB1? "tdb2.tdbloader --loader=parallel" is the mostaggressive loader. For TDB1, I'm not sure if "tdbloader2" or "tdbloader"will be faster end-to-end.

I'd be interested in what you found out. It's been a while since I hadaccess to a large machine (which was on AWS ~240G RAM, local SSD). Iused tdb2.tdbloader (i.e. TDB2).


    Andy


Best regards,
Daniel

Re: Understanding the output of Jena TDB Loader

Reply via email to