On 25/05/12 09:40, "Dr. André Lanka" wrote:
Hello Jena-Users,
we are using Jena+TDB in production and are looking for an efficient
method to check the validity of the TDB files on disk.
Our situation is as follows.
With Jena 2.6.4 and TDB 0.8.10 each of our servers stores triples in up
to 4000 different TDB stores stored on its local hard drive. On average
each store owns 1 million triples (with high variance). To get our
system working fluently, we need massive parallel write access to the
different stores, so one huge named graph is no alternative. Also we
need to have all stores open and accessible.
In order to get that large number of TDB stores opened in parallel, we
customised the TDB code for our needs. For instance we introduced read
caches shared between all stores (to avoid memory problems). Also we
introduced basic capabilities to roll back transactions. (We took
control over all data read from or written to ObjectFile and BlockMgr).
Would you consider contributing your improvements back to Jena?
So, in our situation we can't switch to the new TDB version over night.
OK - as you probably know, transactions in 0.9.0 do provide robust update.
Now, the problem is that we had some disk issues a few days ago and want
to check which stores have got broken (We know some of them are broken).
Our initial idea is to iterate over all statements in the store and
collect any S, P and O used in the store. Second step would be to check
if any such URI is correctly mapped to an nodeID. And the other way round.
Unfortunately we are not sure, if this will cover any possible file
problem. Also, we think there could be a more efficient way to check the
internal data structures.
So, any idea (both high and low level) is highly appreciated.
Just iterating over S/P/O isn't enough I'm afraid. It will iterate over
a single index so it is not checking the other indexes.
A way to check the disk files would be to:
(for the default graph)
Graph.find(null, null, null)
which checks SPO index and the node table in the id->string direction.
Then dump each index (you'll need to write code for this) in tuple
order, one row per line of output. TupleIndexes reorder their estries
back to primary table order - the POS index with consist of lines in the
order S,P,O.
Sort each index dump and compare them. They should be the same.
There is also the prefix index and the node to node id table.
The Node->NodeId mapping can be checked by taking apart the Node string
table and checking for each Node in the Node->NodeId table.
Andy