Hello all, I just wanted to give an update on how my loading of 670M triples was going, or in this case not going.
I ran the riot --time --sink script on my 6.3 GB nt.gz file and this was the result: 14,128.77 sec 676,740,132 triples 47,898.03 TPS According to this result it should finish in 4 hours, however I am still seeing the same slowdown as before, without any swapping. Here are the first and last few lines of the log file (with only the number of triples processed and the TPS displayed): 50,000 15,165 100,000 21,886 150,000 25,231 200,000 27,074 250,000 28,344 300,000 29,620 350,000 30,261 400,000 31,063 450,000 31,768 500,000 31,752 ... 266,350,000 1,564 266,400,000 1,551 266,450,000 1,539 266,500,000 1,522 My new machine (still a VM) has 64GB RAM and uses 4 2.3Ghz processors. Any ideas why I can't get a decent processing time? It seems like I would be able to load ~225M triples into three stores over the course of a day or two, but I would rather be able to have one massive store if that is even possible. Any ideas appreciated. Thanks - Josh On Feb 26, 2013, at 2:02 PM, Andy Seaborne wrote: > Joshua, > > If you're in a VM you have another layer trying to help and, in my > experience, it does not. Sometimes, maliciously so. [*] > > And sharing a real machine can mean you are contending for real resources > like disk or memory bandwidth. > > In that setup, whatever you do, make sure the VM is not swapping on the real > machine. That will make memory mapped files very bad (a sort of double > swapping effect). > > Reading the N-triples file is unlikely to be the bottle neck. The VM should > not make too much difference - only large chunks of stream read I/O is being > done (though I have encountered VMs that seem to make that a cost). > > You can test this with > > riot --time --sink ... files ... > > On this machine I'm on currently, a 3y old desktop 8G RAM, no SSD, I get > 100K-150K triples/s BSBM from .gz.nt files. BSBM isn't the faster to parse > as it has some moderately long literals (or "more bytes to copy" as the > parser sees it). > > A bit faster from .nt but not enough to make it worth decompressing. > > .. just tried ... > > ~ >> riot --time --sink ~/Datasets/BSBM/bsbm-25m.nt.gz > ... 179.50 sec 25,000,250 triples 139,280.26 TPS > > ~ >> gzip -d < ~/Datasets/BSBM/bsbm-25m.nt.gz > X.nt > ~ >> riot --time --sink X.nt > ... 168.74 sec 25,000,250 triples 148,157.53 TPS > > The parser streams ... only a ridiculous proportion of bNodes in large > datasets should cause it to slow down. > > Please let us know how it goes - it's all useful to build up a pool of > experiences. > > Andy > > [*] as you might guess, I've encountered various "issues" on VMs in the past > due to under or mis-provisioning for a database server. AWS is OK, subject to > not getting one of the occasional duff machines that people report; I've had > one of these once. > >
