Andy,
Also, as a reference, I have also used the TDBVerifier* code developed by
Paolo, to check the validity of the indexes.
I got the following results
---- Scanning node table ... ----
Found 890517 RDF nodes.
Nodes.dat file size is: 114295521
114295393 + 124 + 4 = 114295521
---- Scanning GSPO ... ----
Found 0 records.
---- Scanning GPOS ... ----
Found 0 records.
---- Scanning GOSP ... ----
Found 0 records.
---- Scanning POSG ... ----
Found 0 records.
---- Scanning OSPG ... ----
Found 0 records.
---- Scanning SPOG ... ----
Found 0 records.
---- Scanning SPO ... ----
00000000012b1b96000000000000005a00000000000001bf00000000012b1b96
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3
at org.openjena.atlas.lib.ColumnMap.fetchSlotIdx(ColumnMap.java:137)
at com.hp.hpl.jena.tdb.lib.TupleLib.tuple(TupleLib.java:213)
at FunctionalTests.TDBVerifier.verifyIndex(TDBVerifier.java:111)
at FunctionalTests.TDBVerifier.main(TDBVerifier.java:57)
Java Result: 1
Corroborating the corruption in the SPO index
Regards,
Emilio
*
https://github.com/castagna/tdbloader4/blob/f5363fa49d16a04a362898c1a5084ade620ee81b/src/test/java/dev/TDBVerifier.java
On 28 Jan 2013, at 11:23, Andy Seaborne wrote:
> On 28/01/13 10:21, Emilio Miguelanez wrote:
>>
>> On 27 Jan 2013, at 22:04, Andy Seaborne wrote:
>>
>>> If select * { ?agent
>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
>>> <file:///etc/recovery/models/flow/.node-server/model-turbine-e4.json/seed/core.owl#Agent>
>>>
>>>
> }
>>>
>>> works, it may be your lucky day. The SPO index is intact so
>>> tdbdump will work. Maybe.
>>>
>>> If you have the original data, then rebuilding is much safer.
>>> There may be other problems not yet encountered.
>>
>>
>> This query works .... what should I do now?
>>
>> If I run
>>
>> tdbdump --loc=tdb > tdb.dump (question: tdbdump are tdbbackup
>> are same commands?)
>
> Almost.
>
>> I get same error.
>
> Not your lucky day I'm afraid. The SPO index is damaged. It does however
> look as if another index is intact.
>
>> Exception in thread "main" com.hp.hpl.jena.tdb.base.StorageException:
>> RecordRangeIterator: records not strictly increasing:
>> 0000000006d00261000000000000021c0000000006cfff69 //
>> 0000000006b861a3000000000005233d00000000015a78b5
>>
>> I would like to try if the current tdb can be fixed, as rebuilding
>> could take long time. The database was created with minimal data,
>> and it is being populated (dynamically) with data over a long period
>> of time (> 1 year)
>
> SPO is the index used for iteration of the whole database. This can be
> changed.
>
> Is this a database of just triples? No named graphs? So far, the corruption
> looks to be in SPO (an index on the default graph).
>
> It will take some programming to fix this. No guarantees that it will work
> but I've experimented here.
>
> Take a backup of the database.
>
> A/ (the second way is better)
> If you know all the possible properties, then write code that loops on each
> of the properties and does
>
> defaultGraph.find(null, property, null)
>
> This will use the POS index.
>
> Print everything in N-Triples.
>
> B/ A different, better approach is to build a special version of TDB. The
> changes needed are small but you need to build Jena.
>
> These instructions apply to code in SVN as it is now, today. Not the last
> release, not last week. It's just easier to setup and explain from the
> current code base as a small recent change centralised the point you need to
> change and also introduced an easy to use testing feature.
>
> 1/ svn co the Jena code from trunk.
>
> 2/ Build Jena
> mvn clean install
>
> It is easier to build and install than just package.
>
> You must use the development releases of the other modules.
> I don't think you need to set up maven to use the snapshot builds on Apache
> but if you do:
>
> Set <repository>
> http://jena.apache.org/download/maven.html
>
> 3/ mvn eclipse:eclipse to use Eclipse if you plan to use that to edit the
> code.
>
> 4/ Setup to use this build for tdbdump. e.g. the apache-jena or fuseki.
>
> For added ease - use the Fuseki server jar which as everything in it
>
> java -cp fuseki-server.jar tdb.tdbdump --version
>
> Check timestamps/version numbers.
>
> 5/ Test create a small text file of a few triples.
>
> --- D.ttl
> @prefix : <http://example/> .
>
> :s1 :p 1 .
> :s2 :p 2 .
> :s3 :q 3 .
> :s2 :q 4 .
> :s1 :q 5 .
>
> ---
>
> tdbdump --data D.ttl should dump the file with triples clustered by subject.
>
> (no - you do not need to load a database - --data is a recent feature for
> testing)
>
> 6/ Edit com.hp.hpl.jena.tdb.index.TupleTable, static method
> "chooseScanAllIndex"
>
> Change:
> -----
> if ( tupleLen != 4 )
> return indexes[0] ;
> ==>
> if ( tupleLen != 4 )
> {
> if ( indexes.length == 3 )
> return indexes[1] ;
> else
> return indexes[0] ;
> }
> -----
>
> 7/ Rebuild.
>
> Yes - the tests for TDB should pass!
>
> 8/ check the new version
>
> tdbdump --version
>
> check the change
>
> tdbdump --data D.ttl
>
> and it should be n-triples clustered by property, different to earlier on.
>
> 9/ Dump your database.
>
> Hope there is a good index.
>
> You can also try indexes[2] not indexes[1] to use the OSP index.
> Each dumps the entire database, but in different triple orders.
>
> 10/ Clean up maven to get rid of the temporary build.
>
> rm -r REPO/org/apache/jena/
>
> 11/ Rebuild the database with tdbloader/tdbloader2.
>
> (the load is slower than if dumped in SPO order)
>
> I tested the change here on that test file - I don't have a large corrupt
> database to try it on.
>
>> Any ideas of how to get it fixed are more than welcome.
>
> Personally, I would adopt a 2 stream approach.
>
> Do approach above and also collect all the data together and start a fresh
> load of the database on another machine.
>
> Good luck
> Andy
>
>>
>> Regards, Emilio
>>
>>
>> -- Emilio Migueláñez Martín [email protected]
>>
>>
>>
>
--
Emilio Migueláñez Martín
[email protected]