Hi Olivier,
Which loader are you using? Your right, if the data start spilling to
disk, things slow down. Do you have the log file?
On a smaller dataset, using tdbloader2, on an m2.2xlarge (34G RAM) [I
happen to have access to the machine rather than allocating for a
special test] I recently got:
Finance (COINS): 417,908,490 triples in 12,406s (3hours, 26min)
=> 33KTPS
with the initial stage proceeding at 150KTPS.
It does not even use all of RAM.
ulimit -d and -m must be "unlimited" and some kernels seem to have
limits on the amount of mapped memory a process is allowed.
We also have tdbloader3 - that needs bedding down but does do parallel
sorting. It requires tuning to use it at scale; the defaults are too
small.
It would be interesting to put more concurrent operation in the index
creation stage for an SSD in tdbloader2. For a single plain HDD,
parallel can create disk head thrashing as two or more processes attempt
to write to the disk (more spindles would help).
Paolo has in the past looked at MapReduce jobs for very large scale
loading. Paolo?
Andy
On 20/07/12 17:26, Olivier Rossel wrote:
Hi all.
Amazon cloud used to provide a high-end solution : 8 core/64GB RAM/1 TB HDD.
I tried to load DBPedia in TDB with this solution, but performances
are "bad" as soon as the 64GB RAM
are not enough to store the indexes. Swap on disk is then used and HDD
performances are "bad".
So it takes several hours (days?) to load DBPedia.
(Honestely I gave up).
Now Amazon cloud has upgraded its high-end solution: 8 core/64GB RAM/1TB SSD.
The SSD option seems to be EXTREMELY fast w.r.t the previous HDD option.
I am wondering if this SSD option can make the loading of DBPedia to
go below (let's say) 4h?
Did anyone try it?
Or may be, you know of a pay-per-hour cloud solution with a LOT of RAM
(let's say 256GB) so
TDB never has to swap on disk?
Any opinion or idea about all that?