>> (processing batches of 25K)
The loaders work on empty databases.
tdbloader will load into a existing none but it does not do anything
special and you'll get RAM contention.
If you are splitting files, and doing partial loads, things are rather
different.
>> Right now resident memory is ~3.5GB and virtual ~5.5GB
Maybe swappiness is set to keep a %-age of RAM free.
Andy
On 14/12/17 20:38, dandh988 wrote:
Your IO doesn't know whether it's coming or going!
You're reading from a 250GB file whilst writing to two .tmp files and the id to
node files. Then you are reading the data-triple.tmp to sort it which will be
writing to tmp whilst chewing RAM because it's too big to sort in memory and
writing the sorted file to then read it whilst writing the index files. Repeat
three times.
Your HDD heads can only be in one place at a time and I suspect you've only got
maximum 128MB cache on the drive. The queues on the drive will go through the
roof and if the OS decides to page it'll be properly screwed!
SSD can service deep queues because it can be in more than one place at a time,
as an analogy.
Stick the 250GB file on a USB drive to get that read load off the internal IO
as a start.
The loader works on HDD's you just need to be a little smart in understanding
the limits of the hardware you're using and laptops are not known for IO
chipsets. Even my Dell M3800 which is supposed to be a workstation grade laptop
has one drive and an external SATA connection to help out.
Dick
-------- Original message --------From: Laura Morales <[email protected]> Date:
14/12/2017 20:09 (GMT+00:00) To: jena-users-ml <[email protected]> Subject: Re:
Report on loading wikidata (errata)
ERRATA:
I don't know why then. Maybe SSD is making all the difference. Try to load it (or
"latest-all") on a comparable machine using a single SATA disk instead of SSD.
s/SATA/HDD
----------------------------
I loaded 2.2B on a 16G machine which wasn't even server class (i.e. it's
I/O path to SSD isn't very quick).
I don't know why then. Maybe SSD is making all the difference. Try to load it (or
"latest-all") on a comparable machine using a single SATA disk instead of SSD.
Around 100-150M my computer slows dows significantly, and then always down from here. All
I know is that it's either because of too little RAM, or because the disk can't keep up.
If RAM really is at 1G , even on your small 8G server, suggests your
setup is configured in the OS to restrict the RAM for mapping. RAM per
process should be > real RAM (remember memory mapped files are used) or
the VM is setup in some odd way. Or 32bit java.
Yeah sorry I was looking at shared memory. Right now resident memory is ~3.5GB
and virtual ~5.5GB. Process started with 150K triples per second, now after
250M triples processed is at 50K triples/second and slowing down (processing
batches of 25K). I don't know what to say, I think the conclusion is simply
that tdbloader (any version) just doesn't work with large graphs on HDDs. So
the only solution has to be to use an SSD, or find a way to split the graph
into smaller stores, or simply give up.
$ java -version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-1~deb9u1-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)
$ ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes 31370
-n: file descriptors 1024
-l: locked-in-memory size (kbytes) unlimited
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 31370
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 95
-N 15: unlimited