On 21/03/16 13:35, Alexandra Kokkinaki wrote:
Hi Andy, thanks for your answers.
On Fri, Mar 18, 2016 at 11:43 AM, Andy Seaborne <[email protected]> wrote:
Hi,
it will depend on usage patterns. 2* 500 million isn't unreasonable but
validating with your expected usage is essential.
The critical factors are the usage patterns and the hardware available.
Number of queries, query complexity, number of updates, all matter. RAM is
good (which is true for any database) as are SSDs if you do lots of update
or need fast startup from cold.
What kind of usage patterns are considered not valid for big triple stores.
We are planning to use our fuseki server to allow, machine to machine
communication and also allow independent users to express mostly spatial
queries We plan to do indexing and have a query time out too. Is that
enough to address performance issues?
They are a good idea. It will protect the server.
It is possible to write SPARQL queries which are fundamentally expensive.
The TDB will need to get updated daily, using jena API, since I suppose
deleting and inserting everything back would take a long time. I read in (
https://lists.w3.org/Archives/Public/public-sparql-dev/2008JulSep/0029.html
) that it takes 5370secs for 100M triples to be loaded in TDB, which is
good.
But here <https://www.w3.org/wiki/LargeTripleStores> it is said that it
took 36 hours to load 1.7B triples in TDB
... in 2008 ... with a spinning disk.
12k triples/s would be a bit slow nowadays.
At large scale tdbloader2 can be faster that tdbloader. You have to try
with your data on your hardware - it isn't a simple yes/no question
unfortunately.
tdbloader2 only loads from empty.
tdbloader does not do anything special when loading a partial database.
, which drives me towards the
daily updates rather than daily delete and insert.
How long would a 500 triple DB take to be loaded in an empty database?
500M?
Just run
tdbloader --loc DB <data> and see what rate you get - I'd be interested
in seeing the log. Every data set, every hardware set can be different.
That's why it is hard to make any accurate predications - just try it.
tdbloader --loc=DB <the_data>
The pattern of the data makes a difference - LUBM loads very fast as it
has a high triples to nodes ratio so less bytes are being loaded. All
triple stores report better figures on that data - a factor of x2 faster
is common - but it's not typical data.
Andy
Multiple requests, whether same service or different service, are
competing for the same machine resources. Fuseki runs requests
independently and in parallel. There are per-database transactions
supporting multiple, truly parallel readers.
Andy
Many thanks,
Alexandra
On 18/03/16 09:35, Alexandra Kokkinaki wrote:
Hi,
after researching on TDB performance with Big Data, I would still like to
know:
We have one fuseki server exposing 2 sparql endpoints (2million triples
each) as data services. We are planning to add one more, but with Big data
500Million triples
- For big data is it better to use many installations of fuseki server
or
- many data services under the same Fuseki server?
Could fuseki cope with two or more services with more than 500 Million
triples each?
How does Fuseki cope when it has to serve concurrent queries to the
different data services?
Many thanks,
Alexandra