On 2012-06-30 12:48, Andy Seaborne wrote:
On 29/06/12 02:49, Sarven Capadisli wrote:
On 2012-06-28 20:25, Andy Seaborne wrote:
On 28/06/12 10:11, Sarven Capadisli wrote:
I was wondering if there is a way to rebuild the TDB index from
command-line and have it consequently update the stats file?

There isn't a way to rebuild just one of the indexes from another in the
TDB distribution.  Is that you want to do?

tdbstats calculates the stats.

I want to optimize query response times.

I can't get a satisfactory solution with tdbstats because it doesn't let
me optimize for each named graph in the store.

What sort of queries are you asking the store?

For a store with 165 million triples, some real examples that's :

SELECT DISTINCT ?o WHERE { ?s a ?o }

Time: 159.359 sec (100 sec in second time round)


SELECT DISTINCT ?o WHERE { GRAPH <http://worldbank.270a.info/graph/meta> { ?s a ?o } }

Time: 0.394 sec

SELECT DISTINCT ?o WHERE { GRAPH <http://worldbank.270a.info/graph/world-bank-finances> { ?s a ?o } }

Time: 1.946 sec

SELECT DISTINCT ?o WHERE { GRAPH <http://worldbank.270a.info/graph/world-bank-climates { ?s a ?o } }

Time: 46.967 sec

SELECT DISTINCT ?o WHERE { GRAPH <http://worldbank.270a.info/graph/world-development-indicators> { ?s a ?o } }

Time: 61.323 sec

SELECT DISTINCT ?o WHERE { GRAPH <http://worldbank.270a.info/graph/world-bank-projects-and-operations> { ?s a ?o } }

Time: 0.559 sec

A quick note on this: when I run the query where the default graph is the union of all graphs, it takes much longer in total than the total time for queries with different named graphs.

Other examples:

SELECT DISTINCT ?p ?o WHERE { GRAPH <g> ?s ?p ?o } --time=10m

SELECT DISTINCT ?g WHERE { GRAPH ?g { } } --time=60s (49s in second round, 55s on third..)


I thought rebuilding the indexes might create a new stats file. As I
understand it, the stats file is created after the initial import with
tdbloader, and subsequent imports don't update the stats file.

True - but you just need to rebuild the stats, not the index itself. The
stats file is separate from the index.

On a copy of the dataset, I ran:

java tdb.tdbstats -v --desc=/usr/lib/fuseki/tdb2.worldbank.bak.ttl --graph=urn:x-arq:UnionGraph

and replaced content of stats.opt with the output. I've placed the output here http://pastebin.com/2nt9jphE

I took the following query and compared the results in two stores:

SELECT DISTINCT ?o WHERE { ?s a ?o }

There was no notable difference.

Did I miss a step or make a wrong turn somewhere? Which type of queries would demonstrate the differences in stats?

I suppose at this point I need to compare the performances between the
stats that's created after importing incrementally (meanwhile initial
import being the largest), and the stats that's based on the union of
graphs.

That's all in context of having the data dumps in N-Triples format where
each dump is assigned a named graph.

Alternatively, I have to switch to using a single dump file in N-Quads,
but I'm thinking that the stats for that would get me at best the same
results as in the union of graphs approach.

Does this line of thinking makes sense: Which state should the TDB
indexes be in such that I get the most preferable stats? Is there even a
need to rebuild the indexes?

No point rebuilding the indexes.

You can write a stats file by hand.

But I think the first step is understand the queries.

Fair enough. Although the queries above are generally done by anyone trying to get an insight on the data in the store (although they are going at it via Fuseki), as opposed to some basic information retrieval, do you think the results are in any way returned in a reasonable amount of time? Is there a way to allocate more memory for TDB queries? If it matters in the end, what would be the ideal way to import the data into the store?

Thanks Andy,

-Sarven

Reply via email to