What indexes exist during the load? On Mar 27, 2013, at 4:40 PM, Joshua Greben wrote:
> Hello all, > > I just wanted to give an update on how my loading of 670M triples was going, > or in this case not going. > > I ran the riot --time --sink script on my 6.3 GB nt.gz file and this was the > result: > 14,128.77 sec 676,740,132 triples 47,898.03 TPS > > According to this result it should finish in 4 hours, however I am still > seeing the same slowdown as before, without any swapping. Here are the first > and last few lines of the log file (with only the number of triples processed > and the TPS displayed): > > 50,000 15,165 > 100,000 21,886 > 150,000 25,231 > 200,000 27,074 > 250,000 28,344 > 300,000 29,620 > 350,000 30,261 > 400,000 31,063 > 450,000 31,768 > 500,000 31,752 > ... > 266,350,000 1,564 > 266,400,000 1,551 > 266,450,000 1,539 > 266,500,000 1,522 > > > My new machine (still a VM) has 64GB RAM and uses 4 2.3Ghz processors. Any > ideas why I can't get a decent processing time? It seems like I would be able > to load ~225M triples into three stores over the course of a day or two, but > I would rather be able to have one massive store if that is even possible. > > Any ideas appreciated. > > Thanks > > - Josh > > On Feb 26, 2013, at 2:02 PM, Andy Seaborne wrote: > >> Joshua, >> >> If you're in a VM you have another layer trying to help and, in my >> experience, it does not. Sometimes, maliciously so. [*] >> >> And sharing a real machine can mean you are contending for real resources >> like disk or memory bandwidth. >> >> In that setup, whatever you do, make sure the VM is not swapping on the real >> machine. That will make memory mapped files very bad (a sort of double >> swapping effect). >> >> Reading the N-triples file is unlikely to be the bottle neck. The VM should >> not make too much difference - only large chunks of stream read I/O is being >> done (though I have encountered VMs that seem to make that a cost). >> >> You can test this with >> >> riot --time --sink ... files ... >> >> On this machine I'm on currently, a 3y old desktop 8G RAM, no SSD, I get >> 100K-150K triples/s BSBM from .gz.nt files. BSBM isn't the faster to parse >> as it has some moderately long literals (or "more bytes to copy" as the >> parser sees it). >> >> A bit faster from .nt but not enough to make it worth decompressing. >> >> .. just tried ... >> >> ~ >> riot --time --sink ~/Datasets/BSBM/bsbm-25m.nt.gz >> ... 179.50 sec 25,000,250 triples 139,280.26 TPS >> >> ~ >> gzip -d < ~/Datasets/BSBM/bsbm-25m.nt.gz > X.nt >> ~ >> riot --time --sink X.nt >> ... 168.74 sec 25,000,250 triples 148,157.53 TPS >> >> The parser streams ... only a ridiculous proportion of bNodes in large >> datasets should cause it to slow down. >> >> Please let us know how it goes - it's all useful to build up a pool of >> experiences. >> >> Andy >> >> [*] as you might guess, I've encountered various "issues" on VMs in the past >> due to under or mis-provisioning for a database server. AWS is OK, subject >> to not getting one of the occasional duff machines that people report; I've >> had one of these once. >> >> >
