Re: TDB: records not strictly increasing

Andy Seaborne Tue, 29 Jan 2013 00:26:40 -0800

B/ A different, better approach is to build a special version of TDB. The 
changes needed are small but you need to build Jena.


These instructions apply to code in SVN as it is now, today.  Not the last 
release, not last week.  It's just easier to setup and explain from the current 
code base as a small recent change centralised the point you need to change and 
also introduced an easy to use testing feature.

1/ svn co the Jena code from trunk.

Done

2/ Build Jena
   mvn clean install

Done

It is easier to build and install than just package.

You must use the development releases of the other modules.
I don't think you need to set up maven to use the snapshot builds on Apache but 
if you do:

Set <repository>
http://jena.apache.org/download/maven.html

3/ mvn eclipse:eclipse to use Eclipse if you plan to use that to edit the code.

Didn't set up maven or use Eclipse.

4/ Setup to use this build for tdbdump.  e.g. the apache-jena or fuseki.

For added ease - use the Fuseki server jar which as everything in it

java -cp fuseki-server.jar tdb.tdbdump —version


java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbdump 
—version

Jena:       VERSION: 2.10.0-SNAPSHOT
Jena:       BUILD_DATE: 2013-01-28T21:00:30+0000
ARQ:        VERSION: 2.10.0-SNAPSHOT
ARQ:        BUILD_DATE: 2013-01-28T21:00:30+0000
TDB:        VERSION: 0.10.0-SNAPSHOT
TDB:        BUILD_DATE: 2013-01-28T21:00:30+0000

Check timestamps/version numbers.

5/ Test create a small text file of a few triples.

--- D.ttl
@prefix : <http://example/> .

:s1 :p 1 .
:s2 :p 2 .
:s3 :q 3 .
:s2 :q 4 .
:s1 :q 5 .

---

tdbdump --data D.ttl should dump the file with triples clustered by subject.

(no - you do not need to load a database - --data is a recent feature for 
testing)


java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbdump 
--data D.ttl
<http://example/s1> <http://example/p> 
"1"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s1> <http://example/q> 
"5"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s2> <http://example/p> 
"2"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s2> <http://example/q> 
"4"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s3> <http://example/q> 
"3"^^<http://www.w3.org/2001/XMLSchema#integer> .

6/ Edit com.hp.hpl.jena.tdb.index.TupleTable, static method "chooseScanAllIndex"

Change:
-----
        if ( tupleLen != 4 )
            return indexes[0] ;
==>
        if ( tupleLen != 4 )
        {
            if ( indexes.length == 3 )
                return indexes[1] ;
            else
                return indexes[0] ;
        }
-----

7/ Rebuild.

Yes - the tests for TDB should pass!

8/ check the new version

tdbdump --version

check the change

tdbdump --data D.ttl

and it should be n-triples clustered by property, different to earlier on.


java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbdump 
--data D.ttl
<http://example/s1> <http://example/p> 
"1"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s2> <http://example/p> 
"2"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s3> <http://example/q> 
"3"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s2> <http://example/q> 
"4"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s1> <http://example/q> 
"5"^^<http://www.w3.org/2001/XMLSchema#integer> .

Is it what you expect?


Yes.


9/ Dump your database.

Hope there is a good index.


It works and no errors were reported, however the size of the dump file is just 
84MB, which is considerable smaller than the actual tdb (~1GB)

Quite possible - especially if you have also been deleting stuff in thedatabase as well as adding.

You can also try indexes[2] not indexes[1] to use the OSP index.
Each dumps the entire database, but in different triple orders.


I did also try this changes of indexes, and it gave me the same error

Exception in thread "main" com.hp.hpl.jena.tdb.base.StorageException: 
RecordRangeIterator: records not strictly increasing: 
00000000021aa0a20000000006cffe6b000000000005233d // 
00000000021a2c0a0000000006b85f9f000000000005233d


The OSP index is also broken.

10/ Clean up maven to get rid of the temporary build.

rm -r REPO/org/apache/jena/

11/ Rebuild the database with tdbloader/tdbloader2.


java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbloader 
--loc=tdb tdb.dump

but the size of the tdb is smaller than the original tdb

The loader produces more compact indexes than if the data has beenloaded incrementally. This is even more the case for tdblaoder2.

Also if you have been deleting and adding, for 0.8, then the databasecan grow. This is addressed, but not totlally fixed in 0.9.X

(the load is slower than if dumped in SPO order)

I tested the change here on that test file - I don't have a large corrupt 
database to try it on.

Any ideas of how to get it fixed are more than welcome.


Personally, I would adopt a 2 stream approach.

Do approach above and also collect all the data together and start a fresh load 
of the database on another machine.


Doing it already.


        Andy


Thanks,
Emilio


        Good luck
        Andy


Regards, Emilio


-- Emilio Migueláñez Martín [email protected]


--
Emilio Migueláñez Martín
[email protected]

Re: TDB: records not strictly increasing

Reply via email to