Re: Loading Large TripleStore using TDB

Joshua Greben Wed, 27 Mar 2013 13:41:14 -0700

Hello all,

I just wanted to give an update on how my loading of 670M triples was going, or 
in this case not going.


I ran the riot --time --sink script on my 6.3 GB nt.gz file and this was the 
result:
14,128.77 sec  676,740,132 triples  47,898.03 TPS

According to this result it should finish in 4 hours, however I am still seeing 
the same slowdown as before, without any swapping. Here are the first and last 
few lines of the log file (with only the number of triples processed and the 
TPS displayed):

50,000 15,165
100,000 21,886
150,000 25,231
200,000 27,074
250,000 28,344
300,000 29,620
350,000 30,261
400,000 31,063
450,000 31,768
500,000 31,752
...
266,350,000 1,564
266,400,000 1,551
266,450,000 1,539
266,500,000 1,522


My new machine (still a VM) has 64GB RAM and uses 4 2.3Ghz processors. Any 
ideas why I can't get a decent processing time? It seems like I would be able 
to load ~225M triples into three stores over the course of a day or two, but I 
would rather be able to have one massive store if that is even possible.

Any ideas appreciated.

Thanks

- Josh

On Feb 26, 2013, at 2:02 PM, Andy Seaborne wrote:

> Joshua,
> 
> If you're in a VM you have another layer trying to help and, in my 
> experience, it does not.  Sometimes, maliciously so. [*]
> 
> And sharing a real machine can mean you are contending for real resources 
> like disk or memory bandwidth.
> 
> In that setup, whatever you do, make sure the VM is not swapping on the real 
> machine.  That will make memory mapped files very bad (a sort of double 
> swapping effect).
> 
> Reading the N-triples file is unlikely to be the bottle neck.  The VM should 
> not make too much difference - only large chunks of stream read I/O is being 
> done (though I have encountered VMs that seem to make that a cost).
> 
> You can test this with
> 
> riot --time --sink ... files ...
> 
> On this machine I'm on currently, a 3y old desktop 8G RAM, no SSD, I get 
> 100K-150K triples/s BSBM from .gz.nt files.  BSBM isn't the faster to parse 
> as it has some moderately long literals (or "more bytes to copy" as the 
> parser sees it).
> 
> A bit faster from .nt but not enough to make it worth decompressing.
> 
> .. just tried ...
> 
> ~ >> riot --time --sink ~/Datasets/BSBM/bsbm-25m.nt.gz
> ... 179.50 sec  25,000,250 triples  139,280.26 TPS
> 
> ~ >> gzip -d < ~/Datasets/BSBM/bsbm-25m.nt.gz > X.nt
> ~ >> riot --time --sink X.nt
> ... 168.74 sec  25,000,250 triples  148,157.53 TPS
> 
> The parser streams ... only a ridiculous proportion of bNodes in large 
> datasets should cause it to slow down.
> 
> Please let us know how it goes - it's all useful to build up a pool of 
> experiences.
> 
>       Andy
> 
> [*] as you might guess, I've encountered various "issues" on VMs in the past 
> due to under or mis-provisioning for a database server. AWS is OK, subject to 
> not getting one of the occasional duff machines that people report; I've had 
> one of these once.
> 
>

Re: Loading Large TripleStore using TDB

Reply via email to