On 26/11/17 08:05, Laura Morales wrote:
Experiment needed - on a normal commodity server or a portable, the
limiting factor may not be the CPU, or just the CPU. The system bus
(moving data around) and the persistent storage may be limitations.

On my computer, creating a TDB2 store with tdb2.tdloader from a .nt file 1.1G 
in size:

- read from disk, write to disk: 35K triples per second (as reported by 
tdb2.tdbloader AVG value)

TDB2 is, at the moment, much better with an SSD.

As I've said, the TDB2 loader is simplistic and crude.

It is not even TDB1 tdbloader.

- read from disk, write to tmpfs: 43K AVG triples per second
- read from tmpfs, write to tmpfs: 45K AVG triples per second

Adding -Xmx3G to the "java" command in tdb2.tdbloader didn't seem to have any 
effect.

It won't. File caching is outside the heap.

On the other hand, I have a thread constantly at 100%
Creating the store with the "--graph" argument seems significantly slower than 
without such argument. I've only tested this for the first case (read disk, write disk) 
and tdb2.tdbloader reports about 50K AVG triples per second

Yes - more indexes for named graphs. The default setup is skewed for general query use, not for load.

It's not a deep statistic but... if the reported AVG numbers are correct then I 
guess a slow CPU is the bottleneck for tdb2?

Or memory bandwidth.
And for a rotating disk, doing better write order would help.

I guess the largest improvement would probably be adding multi-threading to 
tdb2.tdbloader, considering that servers can have more then 10 or 20 cores. 
Either this or map-reduce.

You need to factor in the NodeIds.  One RDF term, the same NodeId always.

At the moment, they are incrementally allocated and give the location in a file. This is not parallelizable.

They could be hashes (see TDB2 96bit ids or maybe longer), which is parallelizable, but you need a hash to file location index.

tdbloader (loader1) has, or had, a parallel mode. When I last used it, the gain was small, suggesting that the system bus (or memory channel bandwidth) is a factor. Redoing that for modern server CPUs would be interesting.

It was parallel on building the secondary indexes - a single sequential pass to build the primary SPO (and have NodeId allocation) then create the secondary indexes.

map-reduce has high overhead and needs a cluster.

TDB3 ...

    Andy

Reply via email to