Your IO doesn't know whether it's coming or going! You're reading from a 250GB file whilst writing to two .tmp files and the id to node files. Then you are reading the data-triple.tmp to sort it which will be writing to tmp whilst chewing RAM because it's too big to sort in memory and writing the sorted file to then read it whilst writing the index files. Repeat three times. Your HDD heads can only be in one place at a time and I suspect you've only got maximum 128MB cache on the drive. The queues on the drive will go through the roof and if the OS decides to page it'll be properly screwed! SSD can service deep queues because it can be in more than one place at a time, as an analogy. Stick the 250GB file on a USB drive to get that read load off the internal IO as a start. The loader works on HDD's you just need to be a little smart in understanding the limits of the hardware you're using and laptops are not known for IO chipsets. Even my Dell M3800 which is supposed to be a workstation grade laptop has one drive and an external SATA connection to help out.
Dick -------- Original message --------From: Laura Morales <[email protected]> Date: 14/12/2017 20:09 (GMT+00:00) To: jena-users-ml <[email protected]> Subject: Re: Report on loading wikidata (errata) ERRATA: > I don't know why then. Maybe SSD is making all the difference. Try to load it > (or "latest-all") on a comparable machine using a single SATA disk instead of > SSD. s/SATA/HDD ---------------------------- > I loaded 2.2B on a 16G machine which wasn't even server class (i.e. it's > I/O path to SSD isn't very quick). I don't know why then. Maybe SSD is making all the difference. Try to load it (or "latest-all") on a comparable machine using a single SATA disk instead of SSD. Around 100-150M my computer slows dows significantly, and then always down from here. All I know is that it's either because of too little RAM, or because the disk can't keep up. > If RAM really is at 1G , even on your small 8G server, suggests your > setup is configured in the OS to restrict the RAM for mapping. RAM per > process should be > real RAM (remember memory mapped files are used) or > the VM is setup in some odd way. Or 32bit java. Yeah sorry I was looking at shared memory. Right now resident memory is ~3.5GB and virtual ~5.5GB. Process started with 150K triples per second, now after 250M triples processed is at 50K triples/second and slowing down (processing batches of 25K). I don't know what to say, I think the conclusion is simply that tdbloader (any version) just doesn't work with large graphs on HDDs. So the only solution has to be to use an SSD, or find a way to split the graph into smaller stores, or simply give up. $ java -version openjdk version "1.8.0_151" OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-1~deb9u1-b12) OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode) $ ulimit -a -t: cpu time (seconds) unlimited -f: file size (blocks) unlimited -d: data seg size (kbytes) unlimited -s: stack size (kbytes) 8192 -c: core file size (blocks) 0 -m: resident set size (kbytes) unlimited -u: processes 31370 -n: file descriptors 1024 -l: locked-in-memory size (kbytes) unlimited -v: address space (kbytes) unlimited -x: file locks unlimited -i: pending signals 31370 -q: bytes in POSIX msg queues 819200 -e: max nice 0 -r: max rt priority 95 -N 15: unlimited
