Without knowing much about TDB architecture I can still describe a
couple of things.

One of the most important aspects of speed of indexing and size of the
resulting store is the shape of the data. Some data sets have many
unique resources, meaning that there are lots of URIs and unique
string literals. Other data sets can have many more triples, but each
URI and string is re-used a lot. This is both faster to index and
results in a much smaller index.

Some indexes can also index strings for fast searching, which can have
its own effects. I don't know if TDB does anything interesting there,
but this is another area where shape can have an impact.

Finally, the type of work done during indexing can lead to files being
accessed with a totally different pattern, again depending on the
shape of the data. This can mean that operations which are fast under
some circumstances can slow right down in others (due to seeking,
write contention, and other vagaries of the disk system). I'm not
saying that this is what made loading so much slower in your second
run while indexing stayed the same, but it's a common enough
occurrence that I'm not shocked to see it.

Also, did you ensure that you had nothing else going on during either
load operation? It can be difficult to benchmark these things in
modern operating systems, due to the number of simultaneous tasks
which are necessarily running. My own desktop invariably starts
backing up the hard drive whenever I try to time something.  :-)

I look forward to hearing a response from the TDB developers with
their opinions.

Regards,
Paul

On Fri, Jun 22, 2012 at 9:05 AM, Stefan Scheffler
<[email protected]> wrote:
> Hello,
> At the moment i am doing some performance checks on tdb. The first i checked
> was the import of the tdbloader2 and i got some weird results.
> Maybe someone can help me out. Here are my testbase and the results.
>
> The first test was to store 12 GB of triples into an empty store (i used the
> german dbpedia).
>
> Load time: 16 minutes
> average loading: ca 81.000 triple / second
> index time: 40 minutes
> store size: 9,3GB
>
>
> The second test was to store the same data into an allready filled store
> As i started the import i created a store with 348.398.593 Triples from DNB
> and HBZ (which are german libraries, store size: 33 GB).
> Then i started to load the german dbpedia in.
>
> Load time: 3 hours and 4 minutes
> average loading: ca 7200 / second
> index time: 38 minutes
> store size: 19 GB!!!!!
>
> Why does the loading time increases that immense? My expectation was, that
> the index time increases. But it does not. There where no other big
> proccesses running nearby. And why does the store size shrink to 19GB? I am
> totally confused about that point.
>
> With friendly regards
> Stefan
>
> --
> Stefan Scheffler
> Avantgarde Labs GbR
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: [email protected]
>

Reply via email to