Andy,

Also, as a reference, I have also used the TDBVerifier* code developed by 
Paolo, to check the validity of the indexes. 

I got the following results

---- Scanning node table ... ----
Found 890517 RDF nodes.
Nodes.dat file size is: 114295521
114295393 + 124 + 4 = 114295521
---- Scanning GSPO ... ----
Found 0 records.
---- Scanning GPOS ... ----
Found 0 records.
---- Scanning GOSP ... ----
Found 0 records.
---- Scanning POSG ... ----
Found 0 records.
---- Scanning OSPG ... ----
Found 0 records.
---- Scanning SPOG ... ----
Found 0 records.
---- Scanning SPO ... ----
00000000012b1b96000000000000005a00000000000001bf00000000012b1b96
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3
        at org.openjena.atlas.lib.ColumnMap.fetchSlotIdx(ColumnMap.java:137)
        at com.hp.hpl.jena.tdb.lib.TupleLib.tuple(TupleLib.java:213)
        at FunctionalTests.TDBVerifier.verifyIndex(TDBVerifier.java:111)
        at FunctionalTests.TDBVerifier.main(TDBVerifier.java:57)
Java Result: 1


Corroborating the corruption in the SPO index

Regards,
Emilio


* 
https://github.com/castagna/tdbloader4/blob/f5363fa49d16a04a362898c1a5084ade620ee81b/src/test/java/dev/TDBVerifier.java

On 28 Jan 2013, at 11:23, Andy Seaborne wrote:

> On 28/01/13 10:21, Emilio Miguelanez wrote:
>> 
>> On 27 Jan 2013, at 22:04, Andy Seaborne wrote:
>> 
>>> If select * { ?agent
>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
>>> <file:///etc/recovery/models/flow/.node-server/model-turbine-e4.json/seed/core.owl#Agent>
>>> 
>>> 
> }
>>> 
>>> works, it may be your lucky day.  The SPO index is intact so
>>> tdbdump will work.  Maybe.
>>> 
>>> If you have the original data, then rebuilding is much safer.
>>> There may be other problems not yet encountered.
>> 
>> 
>> This query works .... what should I do now?
>> 
>> If I run
>> 
>> tdbdump --loc=tdb > tdb.dump       (question: tdbdump are tdbbackup
>> are same commands?)
> 
> Almost.
> 
>> I get same error.
> 
> Not your lucky day I'm afraid.  The SPO index is damaged.  It does however 
> look as if another index is intact.
> 
>> Exception in thread "main" com.hp.hpl.jena.tdb.base.StorageException:
>> RecordRangeIterator: records not strictly increasing:
>> 0000000006d00261000000000000021c0000000006cfff69 //
>> 0000000006b861a3000000000005233d00000000015a78b5
>> 
>> I would like to try if the current tdb can be fixed, as rebuilding
>> could take long time. The database was created  with minimal data,
>> and it is being populated (dynamically) with data over a long period
>> of time (> 1 year)
> 
> SPO is the index used for iteration of the whole database.  This can be 
> changed.
> 
> Is this a database of just triples? No named graphs?  So far, the corruption 
> looks to be in SPO (an index on the default graph).
> 
> It will take some programming to fix this.  No guarantees that it will work 
> but I've experimented here.
> 
> Take a backup of the database.
> 
> A/ (the second way is better)
> If you know all the possible properties, then write code that loops on each 
> of the properties and does
> 
>   defaultGraph.find(null, property, null)
> 
> This will use the POS index.
> 
> Print everything in N-Triples.
> 
> B/ A different, better approach is to build a special version of TDB. The 
> changes needed are small but you need to build Jena.
> 
> These instructions apply to code in SVN as it is now, today.  Not the last 
> release, not last week.  It's just easier to setup and explain from the 
> current code base as a small recent change centralised the point you need to 
> change and also introduced an easy to use testing feature.
> 
> 1/ svn co the Jena code from trunk.
> 
> 2/ Build Jena
>   mvn clean install
> 
> It is easier to build and install than just package.
> 
> You must use the development releases of the other modules.
> I don't think you need to set up maven to use the snapshot builds on Apache 
> but if you do:
> 
> Set <repository>
> http://jena.apache.org/download/maven.html
> 
> 3/ mvn eclipse:eclipse to use Eclipse if you plan to use that to edit the 
> code.
> 
> 4/ Setup to use this build for tdbdump.  e.g. the apache-jena or fuseki.
> 
> For added ease - use the Fuseki server jar which as everything in it
> 
> java -cp fuseki-server.jar tdb.tdbdump --version
> 
> Check timestamps/version numbers.
> 
> 5/ Test create a small text file of a few triples.
> 
> --- D.ttl
> @prefix : <http://example/> .
> 
> :s1 :p 1 .
> :s2 :p 2 .
> :s3 :q 3 .
> :s2 :q 4 .
> :s1 :q 5 .
> 
> ---
> 
> tdbdump --data D.ttl should dump the file with triples clustered by subject.
> 
> (no - you do not need to load a database - --data is a recent feature for 
> testing)
> 
> 6/ Edit com.hp.hpl.jena.tdb.index.TupleTable, static method 
> "chooseScanAllIndex"
> 
> Change:
> -----
>        if ( tupleLen != 4 )
>            return indexes[0] ;
> ==>
>        if ( tupleLen != 4 )
>        {
>            if ( indexes.length == 3 )
>                return indexes[1] ;
>            else
>                return indexes[0] ;
>        }
> -----
> 
> 7/ Rebuild.
> 
> Yes - the tests for TDB should pass!
> 
> 8/ check the new version
> 
> tdbdump --version
> 
> check the change
> 
> tdbdump --data D.ttl
> 
> and it should be n-triples clustered by property, different to earlier on.
> 
> 9/ Dump your database.
> 
> Hope there is a good index.
> 
> You can also try indexes[2] not indexes[1] to use the OSP index.
> Each dumps the entire database, but in different triple orders.
> 
> 10/ Clean up maven to get rid of the temporary build.
> 
> rm -r REPO/org/apache/jena/
> 
> 11/ Rebuild the database with tdbloader/tdbloader2.
> 
> (the load is slower than if dumped in SPO order)
> 
> I tested the change here on that test file - I don't have a large corrupt 
> database to try it on.
> 
>> Any ideas of how to get it fixed are more than welcome.
> 
> Personally, I would adopt a 2 stream approach.
> 
> Do approach above and also collect all the data together and start a fresh 
> load of the database on another machine.
> 
>       Good luck
>       Andy
> 
>> 
>> Regards, Emilio
>> 
>> 
>> -- Emilio Migueláñez Martín [email protected]
>> 
>> 
>> 
> 

--
Emilio Migueláñez Martín
[email protected]


Reply via email to