On 17/04/17 23:07, Laura Morales wrote:
tdbloader2 builds b+trees from bottom to top, given sorted input. As
such blocks are streamed to disk which is disk-efficient.
It is a series of java programs scripted together by a shell script.
tdbloader is pure java. It builds the b+trees by inserting, which for
some idndxes is not optimal because it causes random inserts leading to
random I/O, which is bad for disk performance.
Andy
But why is tdbloader better for smaller datasets, whereas tdbloader2 is better for very
large dataset ("100M+ triples")? Wouldn't the approach of tdbloader2 be
superior in all cases?
Try them both and see!
tdbloader2 has high overhead.
On small datasets (less than 100m), an index fits in the OS disk cache
so tdbloader I/O is effectively "in-memory" and the randomness is not a
problem. When it spills, it slows down quite markedly.
tdbloader2 is a slower algorithm but does not produce this "fall-off
effect" on index writing.
Andy