Re: TDB: records not strictly increasing

Emilio Miguelanez Mon, 28 Jan 2013 14:58:40 -0800

Hi Andy,

I have done some testing.


> On 28/01/13 10:21, Emilio Miguelanez wrote:
>> 
>> On 27 Jan 2013, at 22:04, Andy Seaborne wrote:
>> 
>>> If select * { ?agent
>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
>>> <file:///etc/recovery/models/flow/.node-server/model-turbine-e4.json/seed/core.owl#Agent>
>>> 
>>> 
> }
>>> 
>>> works, it may be your lucky day.  The SPO index is intact so
>>> tdbdump will work.  Maybe.
>>> 
>>> If you have the original data, then rebuilding is much safer.
>>> There may be other problems not yet encountered.
>> 
>> 
>> This query works .... what should I do now?
>> 
>> If I run
>> 
>> tdbdump --loc=tdb > tdb.dump       (question: tdbdump are tdbbackup
>> are same commands?)
> 
> Almost.
> 
>> I get same error.
> 
> Not your lucky day I'm afraid.  The SPO index is damaged.  It does however 
> look as if another index is intact.
> 
>> Exception in thread "main" com.hp.hpl.jena.tdb.base.StorageException:
>> RecordRangeIterator: records not strictly increasing:
>> 0000000006d00261000000000000021c0000000006cfff69 //
>> 0000000006b861a3000000000005233d00000000015a78b5
>> 
>> I would like to try if the current tdb can be fixed, as rebuilding
>> could take long time. The database was created  with minimal data,
>> and it is being populated (dynamically) with data over a long period
>> of time (> 1 year)
> 
> SPO is the index used for iteration of the whole database.  This can be 
> changed.
> 
> Is this a database of just triples? No named graphs?  So far, the corruption 
> looks to be in SPO (an index on the default graph).

The database started with a named graph, and is being populated with triples 
over time. 

> It will take some programming to fix this.  No guarantees that it will work 
> but I've experimented here.
> 
> Take a backup of the database.

Done

> 
> A/ (the second way is better)

I haven't tested this approach.

> If you know all the possible properties, then write code that loops on each 
> of the properties and does
> 
>   defaultGraph.find(null, property, null)
> 
> This will use the POS index.
> 
> Print everything in N-Triples.
> 
> B/ A different, better approach is to build a special version of TDB. The 
> changes needed are small but you need to build Jena.
> 
> These instructions apply to code in SVN as it is now, today.  Not the last 
> release, not last week.  It's just easier to setup and explain from the 
> current code base as a small recent change centralised the point you need to 
> change and also introduced an easy to use testing feature.
> 
> 1/ svn co the Jena code from trunk.
> 
Done
> 2/ Build Jena
>   mvn clean install
> 
Done
> It is easier to build and install than just package.
> 
> You must use the development releases of the other modules.
> I don't think you need to set up maven to use the snapshot builds on Apache 
> but if you do:
> 
> Set <repository>
> http://jena.apache.org/download/maven.html
> 
> 3/ mvn eclipse:eclipse to use Eclipse if you plan to use that to edit the 
> code.
Didn't set up maven or use Eclipse.

> 4/ Setup to use this build for tdbdump.  e.g. the apache-jena or fuseki.
> 
> For added ease - use the Fuseki server jar which as everything in it
> 
> java -cp fuseki-server.jar tdb.tdbdump —version

java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbdump 
—version

Jena:       VERSION: 2.10.0-SNAPSHOT
Jena:       BUILD_DATE: 2013-01-28T21:00:30+0000
ARQ:        VERSION: 2.10.0-SNAPSHOT
ARQ:        BUILD_DATE: 2013-01-28T21:00:30+0000
TDB:        VERSION: 0.10.0-SNAPSHOT
TDB:        BUILD_DATE: 2013-01-28T21:00:30+0000

> Check timestamps/version numbers.
> 
> 5/ Test create a small text file of a few triples.
> 
> --- D.ttl
> @prefix : <http://example/> .
> 
> :s1 :p 1 .
> :s2 :p 2 .
> :s3 :q 3 .
> :s2 :q 4 .
> :s1 :q 5 .
> 
> ---
> 
> tdbdump --data D.ttl should dump the file with triples clustered by subject.
> 
> (no - you do not need to load a database - --data is a recent feature for 
> testing)

java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbdump 
--data D.ttl 
<http://example/s1> <http://example/p> 
"1"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s1> <http://example/q> 
"5"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s2> <http://example/p> 
"2"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s2> <http://example/q> 
"4"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s3> <http://example/q> 
"3"^^<http://www.w3.org/2001/XMLSchema#integer> .

> 6/ Edit com.hp.hpl.jena.tdb.index.TupleTable, static method 
> "chooseScanAllIndex"
> 
> Change:
> -----
>        if ( tupleLen != 4 )
>            return indexes[0] ;
> ==>
>        if ( tupleLen != 4 )
>        {
>            if ( indexes.length == 3 )
>                return indexes[1] ;
>            else
>                return indexes[0] ;
>        }
> -----
> 
> 7/ Rebuild.
> 
> Yes - the tests for TDB should pass!
> 
> 8/ check the new version
> 
> tdbdump --version
> 
> check the change
> 
> tdbdump --data D.ttl
> 
> and it should be n-triples clustered by property, different to earlier on.

java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbdump 
--data D.ttl 
<http://example/s1> <http://example/p> 
"1"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s2> <http://example/p> 
"2"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s3> <http://example/q> 
"3"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s2> <http://example/q> 
"4"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s1> <http://example/q> 
"5"^^<http://www.w3.org/2001/XMLSchema#integer> .

Is it what you expect?

> 
> 9/ Dump your database.
> 
> Hope there is a good index.

It works and no errors were reported, however the size of the dump file is just 
84MB, which is considerable smaller than the actual tdb (~1GB)

> You can also try indexes[2] not indexes[1] to use the OSP index.
> Each dumps the entire database, but in different triple orders.

I did also try this changes of indexes, and it gave me the same error

Exception in thread "main" com.hp.hpl.jena.tdb.base.StorageException: 
RecordRangeIterator: records not strictly increasing: 
00000000021aa0a20000000006cffe6b000000000005233d // 
00000000021a2c0a0000000006b85f9f000000000005233d

> 10/ Clean up maven to get rid of the temporary build.
> 
> rm -r REPO/org/apache/jena/
> 
> 11/ Rebuild the database with tdbloader/tdbloader2.

java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbloader 
--loc=tdb tdb.dump

but the size of the tdb is smaller than the original tdb

> (the load is slower than if dumped in SPO order)
> 
> I tested the change here on that test file - I don't have a large corrupt 
> database to try it on.
> 
>> Any ideas of how to get it fixed are more than welcome.
> 
> Personally, I would adopt a 2 stream approach.
> 
> Do approach above and also collect all the data together and start a fresh 
> load of the database on another machine.

Doing it already.

Thanks,
Emilio

> 
>       Good luck
>       Andy
> 
>> 
>> Regards, Emilio
>> 
>> 
>> -- Emilio Migueláñez Martín [email protected]
>> 
>> 
>> 
> 

--
Emilio Migueláñez Martín
[email protected]

Re: TDB: records not strictly increasing

Reply via email to