On 16/02/2021 09:20, Daniel Hernandez wrote:
Hi,
tdbloader2 may not be the right choice. It is a bit niche but if you
have much less RAM than total data it can be better than tdbloader and
it is better if there is rotating disk, not SSD. It has been reported
to be the right choice for several billion for SSD.
I have a SSD disk, a machine with 256 GB of ram, and 32 cores. Do
you recommend using tdbloader in this setting?
The rate you were getting seem low even for tdbloader2 - is it all SDD
or could /tmp be on a disk? And is the SSD local or remove (e.g. EBS)?
As a general point, because the hardware matters, it is a case of try
a few cases and see.
Sorry, I have been confused. The disk where I was loading the data was a
local rotating disk of 7200 rpm. The machine has also an SSD but is too
small to do the experiment.
tdbloader2 may be the right choice for that setup - it was written with
disks in mind. It uses Unix sort(1). What it needs is to tune the
parameters to the runs of "sort"
Wolfgang Fahl has loaded large (several billion triples)
https://issues.apache.org/jira/browse/JENA-1909
and his notes are at:
http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
Does to have to be TDB1? "tdb2.tdbloader --loader=parallel" is the
most aggressive loader. For TDB1, I'm not sure if "tdbloader2" or
"tdbloader" will be faster end-to-end.
I have running some queries using TDB1 before, so I want to compare the
performance in similar conditions. Otherwise, I would have to run the
queries again for TDB2. So I have to evaluate what option is better.
I'd be interested in what you found out. It's been a while since I had
access to a large machine (which was on AWS ~240G RAM, local SSD). I
used tdb2.tdbloader (i.e. TDB2).
I am sorry that my machine was not so good
Mime neither :-)
And I don't have access to additional hardware at the moment.
because it has a rotating
disk. I have another machine, with a 1T local SSD disk, but with only 64
GB. I am going to test the loading speed on that machine (when that
machine finishes the jobs it is doing). I wonder if it is better to load
the data using a fast disk, a lot of RAM, or a lot of cores.
A few years ago, I ran load tests of two machines, one 32G+SATA SSD, one
16G+ 1TB M2 SSD. The 16G but faster SSD was quicker overall.
Database directories can be copied across machines after they have been
built.
Andy
Best,
Daniel