On 23/02/2021 17:55, Daniel Hernandez wrote:
Hi,
The disk where I was loading the data was a local rotating disk of
7200 rpm. The machine has also an SSD but is too small to do the
experiment.
tdbloader2 may be the right choice for that setup - it was written
with disks in mind. It uses Unix sort(1). What it needs is to tune the
parameters to the runs of "sort"
Thanks, this information is very useful.
Wolfgang Fahl has loaded large (several billion triples)
https://issues.apache.org/jira/browse/JENA-1909
and his notes are at:
http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
I also have loaded Wikidata in a very small virtual machine with a
single core, and a rotating non local disk. I remember it lasted more
than a week. I do not saved the log, because the machine was running
other jobs at the same time. Next time I loaded a big dataset I will
share the machine specification and loading log.
I wonder if it is better to load the data using a fast disk, a lot of
RAM, or a lot of cores.
A few years ago, I ran load tests of two machines, one 32G+SATA SSD,
one 16G+ 1TB M2 SSD. The 16G but faster SSD was quicker overall.
That is interesting. I am considering to have a machine with an NVMe
SSD disk for the next loading.
Database directories can be copied across machines after they have
been built.
The tdbloader2 generates some files with the tmp extension. The file
data-triples.tmp can be very big. The name suggest that it is a temporal
file. Can I delete that file after the loading ends?
Yes.
The files are the triples ids from the parse/load nodes stage.
Then comes the indexing which is multiple passes over the tmp files,
once per index to sort, using an external sort (in both sense! external
program and external to disk), then build the indexes in a single pass
per index.
This is reusing the external sort capability of sorting data much larger
than RAM. sort(1) needs
I found a previous load script (when wikidata was 2.2 B IIRC)
Setting SORT_ARGS
------------------
#!/bin/bash
echo "== $(date)"
export TOOL_DIR="$PWD"
export JENA_HOME="$HOME/jlib/apache-jena-3.5.0"
export JVM_ARGS=""
export GZIP="--fast"
#export SORT_ARGS="--parallel=2 --compress-program=/bin/gzip
--temporary-directory=$PWD/tmp --buffer-size=75%"
export SORT_ARGS="--temporary-directory=$PWD/tmp"
## -k : keep work files.
# Logger:org.apache.jena.riot
PHASE="--phase index"
ARGS="--keep-work $PHASE --loc db2-all"
tdbloader2 $ARGS "$@"
echo "== $(date)"
------------------
IIRC not all sort(1) had "--parallel" back then.
I also found a script replacement for the "sort" command in the scripts:
------------------
#!/bin/bash
# Special.
## mysort $KEYS "$DATA" "$WORK"
KEYS="$1"
DATA="$2"
WORK="$3"
SORT_ARGS="--compress-program=/bin/gzip --temporary-directory=$PWD/tmp
--buffer-size=80%"
gzip -d < "$DATA.gz" | sort $SORT_ARGS -u $KEYS > "$WORK"
------------------
HTH
Andy
Best,
Daniel