What indexes exist during the load?

On Mar 27, 2013, at 4:40 PM, Joshua Greben wrote:

> Hello all,
> 
> I just wanted to give an update on how my loading of 670M triples was going, 
> or in this case not going.
> 
> I ran the riot --time --sink script on my 6.3 GB nt.gz file and this was the 
> result:
> 14,128.77 sec  676,740,132 triples  47,898.03 TPS
> 
> According to this result it should finish in 4 hours, however I am still 
> seeing the same slowdown as before, without any swapping. Here are the first 
> and last few lines of the log file (with only the number of triples processed 
> and the TPS displayed):
> 
> 50,000 15,165
> 100,000 21,886
> 150,000 25,231
> 200,000 27,074
> 250,000 28,344
> 300,000 29,620
> 350,000 30,261
> 400,000 31,063
> 450,000 31,768
> 500,000 31,752
> ...
> 266,350,000 1,564
> 266,400,000 1,551
> 266,450,000 1,539
> 266,500,000 1,522
> 
> 
> My new machine (still a VM) has 64GB RAM and uses 4 2.3Ghz processors. Any 
> ideas why I can't get a decent processing time? It seems like I would be able 
> to load ~225M triples into three stores over the course of a day or two, but 
> I would rather be able to have one massive store if that is even possible.
> 
> Any ideas appreciated.
> 
> Thanks
> 
> - Josh
> 
> On Feb 26, 2013, at 2:02 PM, Andy Seaborne wrote:
> 
>> Joshua,
>> 
>> If you're in a VM you have another layer trying to help and, in my 
>> experience, it does not.  Sometimes, maliciously so. [*]
>> 
>> And sharing a real machine can mean you are contending for real resources 
>> like disk or memory bandwidth.
>> 
>> In that setup, whatever you do, make sure the VM is not swapping on the real 
>> machine.  That will make memory mapped files very bad (a sort of double 
>> swapping effect).
>> 
>> Reading the N-triples file is unlikely to be the bottle neck.  The VM should 
>> not make too much difference - only large chunks of stream read I/O is being 
>> done (though I have encountered VMs that seem to make that a cost).
>> 
>> You can test this with
>> 
>> riot --time --sink ... files ...
>> 
>> On this machine I'm on currently, a 3y old desktop 8G RAM, no SSD, I get 
>> 100K-150K triples/s BSBM from .gz.nt files.  BSBM isn't the faster to parse 
>> as it has some moderately long literals (or "more bytes to copy" as the 
>> parser sees it).
>> 
>> A bit faster from .nt but not enough to make it worth decompressing.
>> 
>> .. just tried ...
>> 
>> ~ >> riot --time --sink ~/Datasets/BSBM/bsbm-25m.nt.gz
>> ... 179.50 sec  25,000,250 triples  139,280.26 TPS
>> 
>> ~ >> gzip -d < ~/Datasets/BSBM/bsbm-25m.nt.gz > X.nt
>> ~ >> riot --time --sink X.nt
>> ... 168.74 sec  25,000,250 triples  148,157.53 TPS
>> 
>> The parser streams ... only a ridiculous proportion of bNodes in large 
>> datasets should cause it to slow down.
>> 
>> Please let us know how it goes - it's all useful to build up a pool of 
>> experiences.
>> 
>>      Andy
>> 
>> [*] as you might guess, I've encountered various "issues" on VMs in the past 
>> due to under or mis-provisioning for a database server. AWS is OK, subject 
>> to not getting one of the occasional duff machines that people report; I've 
>> had one of these once.
>> 
>> 
> 

Reply via email to