On 13/02/2021 17:17, Daniel Hernandez wrote:

Hi,

Andy Seaborne writes:
How much data are you loading?

I am loading a billion triples.

Heap is only used for the node table cache and not index work which is
out of heap in memory mapped filesmapped by the virtual memory of the
OS process so caching is done by the OS filesystem cache machinery. It
can make the OS process look very large even if the heap is only 1.2G.

So it is better to do not modify the Xms parameter?

Xms does not matter. Personally, I'd set -Xmx to 4G which is larger than normal and plenty.

Don't set it too high - that can slow things down. If the heap grows, it is taking space away from the OS and, as the data size grows, then file I/O on indexes is the dominate speed factor. So caching and I/O hardware matter. For example, on AWS, EBS SSD vs local SSD has different speed characteristics.

tdbloader2 may not be the right choice. It is a bit niche but if you
have much less RAM than total data it can be better than tdbloader and
it is better if there is rotating disk, not SSD. It has been reported
to be the right choice for several billion for SSD.

I have a SSD disk, a machine with 256 GB of ram, and 32 cores. Do you
recommend using tdbloader in this setting?

The rate you were getting seem low even for tdbloader2 - is it all SDD or could /tmp be on a disk? And is the SSD local or remove (e.g. EBS)?

As a general point, because the hardware matters, it is a case of try a few cases and see.

Does to have to be TDB1? "tdb2.tdbloader --loader=parallel" is the most aggressive loader. For TDB1, I'm not sure if "tdbloader2" or "tdbloader" will be faster end-to-end.


I'd be interested in what you found out. It's been a while since I had access to a large machine (which was on AWS ~240G RAM, local SSD). I used tdb2.tdbloader (i.e. TDB2).

    Andy


Best regards,
Daniel

Reply via email to