Yes, there are a lot of large literals. I am using tdbloader.
The OS is Red Hat 6.3. The hardware is definitely shared, but I am not sure what the other applications are. I will have to ask the Sys Admin. If memory serves me the hardware has something like 90-ish GB or RAM, so I currently have the lion's share. All the RAM was being used by the tdbloader at the time. I will also have to inquire about a possible ulimit on memory-mapped files. I don't think that I mentioned this, but I gave 3200M each as the JVM_ARGS (Xms and Xmx). Before that I tried giving it 60G but it started swapping on the GC. Maybe there is a happy median? Thanks again for the pointers and advice. - Josh On Mar 29, 2013, at 4:50 AM, Andy Seaborne wrote: > On 27/03/13 20:46, David Jordan wrote: >> >> What indexes exist during the load? > > SPO, POS, OSP -- it's a fixed set for TDB (OK, it is changeable but not the > facility is not exposed). > > More below ... > >> >> On Mar 27, 2013, at 4:40 PM, Joshua Greben wrote: >> >>> Hello all, >>> >>> I just wanted to give an update on how my loading of 670M triples was >>> going, or in this case not going. >>> >>> I ran the riot --time --sink script on my 6.3 GB nt.gz file and this was >>> the result: >>> 14,128.77 sec 676,740,132 triples 47,898.03 TPS > > 47K is slowish - are there lost of large literals? > > But it may reflect the fact that I/O on the VM is slow. > > (you may have answered this - the gap in thread reflects a gap in my memory > :-) > >>> >>> According to this result it should finish in 4 hours, however I am still >>> seeing the same slowdown as before, without any swapping. Here are the >>> first and last few lines of the log file (with only the number of triples >>> processed and the TPS displayed): >>> > > Which loader are you trying? > >>> 50,000 15,165 >>> 100,000 21,886 >>> 150,000 25,231 >>> 200,000 27,074 >>> 250,000 28,344 >>> 300,000 29,620 >>> 350,000 30,261 >>> 400,000 31,063 >>> 450,000 31,768 >>> 500,000 31,752 > > That's all quite slow - for a common distribution of triples, it can be going > at 70-80K at this point. > >>> ... >>> 266,350,000 1,564 >>> 266,400,000 1,551 >>> 266,450,000 1,539 >>> 266,500,000 1,522 >>> >>> >>> My new machine (still a VM) has 64GB RAM and uses 4 2.3Ghz processors. Any >>> ideas why I can't get a decent processing time? It seems like I would be >>> able to load ~225M triples into three stores over the course of a day or >>> two, but I would rather be able to have one massive store if that is even >>> possible. > > That is surprisingly slow. I do loads of 200M-300M size on a smaller box. > > What's the OS? Is the hardware shared? (yes - it happens - I've seen setups > of 8G VMs ... 4 on an 8G box)? And is VM on hardware with more than 64G? > > Is all the RAM getting used? Some Linux setups ulimit the space allowed for > memory mapped files so you can give all the RAM you like but it does not get > used. > > Andy > > >>> >>> Any ideas appreciated. >>> >>> Thanks >>> >>> - Josh >>> >>> On Feb 26, 2013, at 2:02 PM, Andy Seaborne wrote: >>> >>>> Joshua, >>>> >>>> If you're in a VM you have another layer trying to help and, in my >>>> experience, it does not. Sometimes, maliciously so. [*] >>>> >>>> And sharing a real machine can mean you are contending for real resources >>>> like disk or memory bandwidth. >>>> >>>> In that setup, whatever you do, make sure the VM is not swapping on the >>>> real machine. That will make memory mapped files very bad (a sort of >>>> double swapping effect). >>>> >>>> Reading the N-triples file is unlikely to be the bottle neck. The VM >>>> should not make too much difference - only large chunks of stream read I/O >>>> is being done (though I have encountered VMs that seem to make that a >>>> cost). >>>> >>>> You can test this with >>>> >>>> riot --time --sink ... files ... >>>> >>>> On this machine I'm on currently, a 3y old desktop 8G RAM, no SSD, I get >>>> 100K-150K triples/s BSBM from .gz.nt files. BSBM isn't the faster to >>>> parse as it has some moderately long literals (or "more bytes to copy" as >>>> the parser sees it). >>>> >>>> A bit faster from .nt but not enough to make it worth decompressing. >>>> >>>> .. just tried ... >>>> >>>> ~ >> riot --time --sink ~/Datasets/BSBM/bsbm-25m.nt.gz >>>> ... 179.50 sec 25,000,250 triples 139,280.26 TPS >>>> >>>> ~ >> gzip -d < ~/Datasets/BSBM/bsbm-25m.nt.gz > X.nt >>>> ~ >> riot --time --sink X.nt >>>> ... 168.74 sec 25,000,250 triples 148,157.53 TPS >>>> >>>> The parser streams ... only a ridiculous proportion of bNodes in large >>>> datasets should cause it to slow down. >>>> >>>> Please let us know how it goes - it's all useful to build up a pool of >>>> experiences. >>>> >>>> Andy >>>> >>>> [*] as you might guess, I've encountered various "issues" on VMs in the >>>> past due to under or mis-provisioning for a database server. AWS is OK, >>>> subject to not getting one of the occasional duff machines that people >>>> report; I've had one of these once. >>>> >>>> >>> >> >
