B/ A different, better approach is to build a special version of TDB. The
changes needed are small but you need to build Jena.
These instructions apply to code in SVN as it is now, today. Not the last
release, not last week. It's just easier to setup and explain from the current
code base as a small recent change centralised the point you need to change and
also introduced an easy to use testing feature.
1/ svn co the Jena code from trunk.
Done
2/ Build Jena
mvn clean install
Done
It is easier to build and install than just package.
You must use the development releases of the other modules.
I don't think you need to set up maven to use the snapshot builds on Apache but
if you do:
Set <repository>
http://jena.apache.org/download/maven.html
3/ mvn eclipse:eclipse to use Eclipse if you plan to use that to edit the code.
Didn't set up maven or use Eclipse.
4/ Setup to use this build for tdbdump. e.g. the apache-jena or fuseki.
For added ease - use the Fuseki server jar which as everything in it
java -cp fuseki-server.jar tdb.tdbdump —version
java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbdump
—version
Jena: VERSION: 2.10.0-SNAPSHOT
Jena: BUILD_DATE: 2013-01-28T21:00:30+0000
ARQ: VERSION: 2.10.0-SNAPSHOT
ARQ: BUILD_DATE: 2013-01-28T21:00:30+0000
TDB: VERSION: 0.10.0-SNAPSHOT
TDB: BUILD_DATE: 2013-01-28T21:00:30+0000
Check timestamps/version numbers.
5/ Test create a small text file of a few triples.
--- D.ttl
@prefix : <http://example/> .
:s1 :p 1 .
:s2 :p 2 .
:s3 :q 3 .
:s2 :q 4 .
:s1 :q 5 .
---
tdbdump --data D.ttl should dump the file with triples clustered by subject.
(no - you do not need to load a database - --data is a recent feature for
testing)
java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbdump
--data D.ttl
<http://example/s1> <http://example/p>
"1"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s1> <http://example/q>
"5"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s2> <http://example/p>
"2"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s2> <http://example/q>
"4"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s3> <http://example/q>
"3"^^<http://www.w3.org/2001/XMLSchema#integer> .
6/ Edit com.hp.hpl.jena.tdb.index.TupleTable, static method "chooseScanAllIndex"
Change:
-----
if ( tupleLen != 4 )
return indexes[0] ;
==>
if ( tupleLen != 4 )
{
if ( indexes.length == 3 )
return indexes[1] ;
else
return indexes[0] ;
}
-----
7/ Rebuild.
Yes - the tests for TDB should pass!
8/ check the new version
tdbdump --version
check the change
tdbdump --data D.ttl
and it should be n-triples clustered by property, different to earlier on.
java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbdump
--data D.ttl
<http://example/s1> <http://example/p>
"1"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s2> <http://example/p>
"2"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s3> <http://example/q>
"3"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s2> <http://example/q>
"4"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://example/s1> <http://example/q>
"5"^^<http://www.w3.org/2001/XMLSchema#integer> .
Is it what you expect?
Yes.
9/ Dump your database.
Hope there is a good index.
It works and no errors were reported, however the size of the dump file is just
84MB, which is considerable smaller than the actual tdb (~1GB)
Quite possible - especially if you have also been deleting stuff in the
database as well as adding.
You can also try indexes[2] not indexes[1] to use the OSP index.
Each dumps the entire database, but in different triple orders.
I did also try this changes of indexes, and it gave me the same error
Exception in thread "main" com.hp.hpl.jena.tdb.base.StorageException:
RecordRangeIterator: records not strictly increasing:
00000000021aa0a20000000006cffe6b000000000005233d //
00000000021a2c0a0000000006b85f9f000000000005233d
The OSP index is also broken.
10/ Clean up maven to get rid of the temporary build.
rm -r REPO/org/apache/jena/
11/ Rebuild the database with tdbloader/tdbloader2.
java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbloader
--loc=tdb tdb.dump
but the size of the tdb is smaller than the original tdb
The loader produces more compact indexes than if the data has been
loaded incrementally. This is even more the case for tdblaoder2.
Also if you have been deleting and adding, for 0.8, then the database
can grow. This is addressed, but not totlally fixed in 0.9.X
(the load is slower than if dumped in SPO order)
I tested the change here on that test file - I don't have a large corrupt
database to try it on.
Any ideas of how to get it fixed are more than welcome.
Personally, I would adopt a 2 stream approach.
Do approach above and also collect all the data together and start a fresh load
of the database on another machine.
Doing it already.
Andy
Thanks,
Emilio
Good luck
Andy
Regards, Emilio
-- Emilio Migueláñez Martín [email protected]
--
Emilio Migueláñez Martín
[email protected]