On 27/03/13 20:46, David Jordan wrote:

What indexes exist during the load?

SPO, POS, OSP -- it's a fixed set for TDB (OK, it is changeable but not the facility is not exposed).

More below ...


On Mar 27, 2013, at 4:40 PM, Joshua Greben wrote:

Hello all,

I just wanted to give an update on how my loading of 670M triples was going, or 
in this case not going.

I ran the riot --time --sink script on my 6.3 GB nt.gz file and this was the 
result:
14,128.77 sec  676,740,132 triples  47,898.03 TPS

47K is slowish - are there lost of large literals?

But it may reflect the fact that I/O on the VM is slow.

(you may have answered this - the gap in thread reflects a gap in my memory :-)


According to this result it should finish in 4 hours, however I am still seeing 
the same slowdown as before, without any swapping. Here are the first and last 
few lines of the log file (with only the number of triples processed and the 
TPS displayed):


Which loader are you trying?

50,000 15,165
100,000 21,886
150,000 25,231
200,000 27,074
250,000 28,344
300,000 29,620
350,000 30,261
400,000 31,063
450,000 31,768
500,000 31,752

That's all quite slow - for a common distribution of triples, it can be going at 70-80K at this point.

...
266,350,000 1,564
266,400,000 1,551
266,450,000 1,539
266,500,000 1,522


My new machine (still a VM) has 64GB RAM and uses 4 2.3Ghz processors. Any 
ideas why I can't get a decent processing time? It seems like I would be able 
to load ~225M triples into three stores over the course of a day or two, but I 
would rather be able to have one massive store if that is even possible.

That is surprisingly slow.  I do loads of 200M-300M size on a smaller box.

What's the OS? Is the hardware shared? (yes - it happens - I've seen setups of 8G VMs ... 4 on an 8G box)? And is VM on hardware with more than 64G?

Is all the RAM getting used? Some Linux setups ulimit the space allowed for memory mapped files so you can give all the RAM you like but it does not get used.

        Andy



Any ideas appreciated.

Thanks

- Josh

On Feb 26, 2013, at 2:02 PM, Andy Seaborne wrote:

Joshua,

If you're in a VM you have another layer trying to help and, in my experience, 
it does not.  Sometimes, maliciously so. [*]

And sharing a real machine can mean you are contending for real resources like 
disk or memory bandwidth.

In that setup, whatever you do, make sure the VM is not swapping on the real 
machine.  That will make memory mapped files very bad (a sort of double 
swapping effect).

Reading the N-triples file is unlikely to be the bottle neck.  The VM should 
not make too much difference - only large chunks of stream read I/O is being 
done (though I have encountered VMs that seem to make that a cost).

You can test this with

riot --time --sink ... files ...

On this machine I'm on currently, a 3y old desktop 8G RAM, no SSD, I get 100K-150K 
triples/s BSBM from .gz.nt files.  BSBM isn't the faster to parse as it has some 
moderately long literals (or "more bytes to copy" as the parser sees it).

A bit faster from .nt but not enough to make it worth decompressing.

.. just tried ...

~ >> riot --time --sink ~/Datasets/BSBM/bsbm-25m.nt.gz
... 179.50 sec  25,000,250 triples  139,280.26 TPS

~ >> gzip -d < ~/Datasets/BSBM/bsbm-25m.nt.gz > X.nt
~ >> riot --time --sink X.nt
... 168.74 sec  25,000,250 triples  148,157.53 TPS

The parser streams ... only a ridiculous proportion of bNodes in large datasets 
should cause it to slow down.

Please let us know how it goes - it's all useful to build up a pool of 
experiences.

        Andy

[*] as you might guess, I've encountered various "issues" on VMs in the past 
due to under or mis-provisioning for a database server. AWS is OK, subject to not getting 
one of the occasional duff machines that people report; I've had one of these once.





Reply via email to