Re: Report on loading wikidata (errata)

dandh988 Thu, 14 Dec 2017 12:38:55 -0800

Your IO doesn't know whether it's coming or going! 
You're reading from a 250GB file whilst writing to two .tmp files and the id to 
node files. Then you are reading the data-triple.tmp to sort it which will be 
writing to tmp whilst chewing RAM because it's too big to sort in memory and 
writing the sorted file to then read it whilst writing the index files. Repeat 
three times.
Your HDD heads can only be in one place at a time and I suspect you've only got 
maximum 128MB cache on the drive. The queues on the drive will go through the 
roof and if the OS decides to page it'll be properly screwed!
SSD can service deep queues because it can be in more than one place at a time, 
as an analogy. 
Stick the 250GB file on a USB drive to get that read load off the internal IO 
as a start.
The loader works on HDD's you just need to be a little smart in understanding 
the limits of the hardware you're using and laptops are not known for IO 
chipsets. Even my Dell M3800 which is supposed to be a workstation grade laptop 
has one drive and an external SATA connection to help out.



Dick
-------- Original message --------From: Laura Morales <[email protected]> Date: 
14/12/2017  20:09  (GMT+00:00) To: jena-users-ml <[email protected]> 
Subject: Re: Report on loading wikidata (errata) 
ERRATA:

> I don't know why then. Maybe SSD is making all the difference. Try to load it 
> (or "latest-all") on a comparable machine using a single SATA disk instead of 
> SSD.

s/SATA/HDD



----------------------------

> I loaded 2.2B on a 16G machine which wasn't even server class (i.e. it's
> I/O path to SSD isn't very quick).

I don't know why then. Maybe SSD is making all the difference. Try to load it 
(or "latest-all") on a comparable machine using a single SATA disk instead of 
SSD. Around 100-150M my computer slows dows significantly, and then always down 
from here. All I know is that it's either because of too little RAM, or because 
the disk can't keep up.

> If RAM really is at 1G , even on your small 8G server, suggests your
> setup is configured in the OS to restrict the RAM for mapping. RAM per
> process should be > real RAM (remember memory mapped files are used) or
> the VM is setup in some odd way. Or 32bit java.

Yeah sorry I was looking at shared memory. Right now resident memory is ~3.5GB 
and virtual ~5.5GB. Process started with 150K triples per second, now after 
250M triples processed is at 50K triples/second and slowing down (processing 
batches of 25K). I don't know what to say, I think the conclusion is simply 
that tdbloader (any version) just doesn't work with large graphs on HDDs. So 
the only solution has to be to use an SSD, or find a way to split the graph 
into smaller stores, or simply give up.

$ java -version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-1~deb9u1-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

$ ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes 31370
-n: file descriptors 1024
-l: locked-in-memory size (kbytes) unlimited
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 31370
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 95
-N 15: unlimited

Re: Report on loading wikidata (errata)

Reply via email to