Re: jena TDB scalability

Andy Seaborne Fri, 04 Oct 2013 02:27:52 -0700

On 03/10/13 15:01, Zhiyun Qian wrote:

Hi there,


I'm looking for some clues on the scalability of jena TDB. It looks like
our requirement would be at least 1B - 10B triples. From what I can find
online (which seems to be dated back in 2008), the max number ever put into
TDB is 1.7B [1]. I wonder if there's any more recent number on this.

I'm also curious about whether the scalability is primarily measured on the
union of all the graphs or individual graphs. In other words, whether a
"Dataset" (regardless of how many graphs/models in it) can only scale up to
a given number (let's say 1.7B) or an individual graph/model can scale to a
given number. Since our data naturally can be divided into different graphs
(with limited relationship across graphs), most queries can be performed on
a single graph at a time (we need some hacks to query the relationship
across graphs but I assume it is possible).

My understanding is that if we simply query one graph out of the many in a
dataset, it does not matter much how many triples there are in other
graphs. Is this correct?

[1]. http://www.w3.org/wiki/LargeTripleStores

Best,
-Zhiyun

Theer isn't a hard cutoff point whereby it works at X but not at X+1.There are no particular built-in assumptions like that (the nearest isthat nodes have unique hashes - but the node hash is 128 bits so you cando some maths about that; things like undetected memory corruption aremore likely).

10B triples is beyond the practical limits. 1B will need a big machineand not too complicated queries.

As the database gets larger, the practical queries that can be executedbecome more limited. Loading also becomes an issue.

If you are just doing URI->some properties and a bit of filtering on theretrieved values, then huge databases are possible.

But as soon as general patterns, or group-aggregates or complicatedcombinations of patterns, OPTIONALs and UNIONS and NOT EXISTS then itwill be impractically slow. ARQ/TDB uses an evaluation strategy [*]that uses temporary RAM only at a few points, so it does not run out ofmemory easily.

Loading takes a long time - more hardware, specifically, more RAM, makesa big difference.


        Andy

[*] currently, in the released code.

Re: jena TDB scalability

Reply via email to