Re: Jena with Large Data and single process

Andy Seaborne Sun, 21 Jul 2013 08:37:38 -0700

On 21/07/13 13:34, Marco Neumann wrote:

what are the most critical issues that prevent TDB from handling larger
data sets at the moment iyo? jvm? index?


Hi Marco,

Probably some degree of clustering.

Even with improved, more compact, indexing (simple run length encodingof the B+Tree leaves for example) doesn't make the jump IMO. More RAM,and more CPU does.

There is a limitation on system bus I/O to main RAM - there is only onebus on commodity hardware and the processor isn't actually running at100%. Database code is doing a lot of data structure walking, and not alot of CPU-intensive compute. [1]

So a single machine as the way to scale means one with special(=expensive) interconnect and RAM. Not so commodity any more.

Several machines, by which I mean a few, like 4-10, seems a moreeffective way to go.

There are some consequences of this - MVCC datastructures are better fortransactions across a cluster. The demands of multi-machine transactioncoordination are easier.

(MVCC are datastructures where when you write, you also copy all thetree nodes from root to the active block, then the structure has oneroot per transaction - also, transactions become one-write, not two(once to log, once to main DB). See CouchDB or Mulgara for example.)

And "just doing it" so there is a working architecture that can beimproved rather than trying to be perfect first time.


        Andy

[1]

See
http://highscalability.com/blog/2013/6/13/busting-4-modern-hardware-myths-are-memory-hdds-and-ssds-rea.html

and the 45min video

http://www.youtube.com/watch?v=MC1EKLQ2Wmg

Re: Jena with Large Data and single process

Reply via email to