Thanks for the details. Good to add to the collective experience.
One reason to parse the file to /dev/null before trying to load it.
It doesn't look like there is much you can do. Reading the man page for
bzip2recover, it's going to loose some data and if that is not aligned
to N-triples, it will break the parser. Only by finding and fixing up
the damaged (in the NT sense) block file will it recover most of the data.
Andy
On 14/02/2022 13:19, Neubert, Joachim wrote:
The error was in the binary:
lbzcat: "/zbw/var/wikidata/2022-02-03/rdf/latest-truthy.nt.bz2": compressed
data error: bad block header magic
That created non-RDF input:
[nbt@e6810f891672 ~]$ bzcat
/zbw/var/wikidata/2022-02-03/rdf/latest-truthy.nt.bz2 | sed -n
'4052914958,4052914960p;4052914961q'
<http://www.wikidata.org/entity/Q85112545> <http://schema.org/description>
"\u0646\u062C\u0645 \u0641\u064A \u0643\u0648\u0643\u0628\u0629
\u0627\u0644\u062B\u0648\u0631"@ar .
bzcat: Compressed file ends unexpectedly;
perhaps it is corrupted? *Possible* reason follows.
bzcat: Success
Input file = /zbw/var/wikidata/2022-02-03/rdf/latest-truthy.nt.bz2,
output file = (stdout)
It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.
You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
<http://www.wikidata.org/entity/Q85112545> <http://schema.org/description> "star in
the constellation Taurus"@en .
<https://www.wikidata.org/wiki/Special:EntityData/Q85112563>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Dataset> .
which in turn produced:
03:02:18 INFO Nodes :: Add: 4,052,000,000 latest-truthy.nt (Batch:
108,189 / Avg: 102,550)
03:02:26 ERROR riot :: [line: 4052914959, col: 80] Bad input stream
[java.io.IOException: Unexpected end of stream]
Exception in thread "AsyncParser" org.apache.jena.riot.RiotException: [line:
4052914959, col: 80] Bad input stream [java.io.IOException: Unexpected end of stream]
at
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:163)
at
org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
at org.apache.jena.riot.lang.LangNTuple.parseTriple(LangNTuple.java:95)
at
org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:61)
at
org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:53)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
at
org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:186)
at org.apache.jena.riot.RDFParser.read(RDFParser.java:366)
at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:335)
at org.apache.jena.riot.RDFParser.parse(RDFParser.java:310)
at
org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:552)
at
org.apache.jena.tdb2.xloader.ProcBuildNodeTableX.lambda$exec2$0(ProcBuildNodeTableX.java:198)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
at
org.apache.jena.tdb2.xloader.ProcBuildNodeTableX.lambda$exec2$1(ProcBuildNodeTableX.java:194)
at java.base/java.lang.Thread.run(Thread.java:829)
Cheers, Joachim
-----Ursprüngliche Nachricht-----
Von: Andy Seaborne <a...@apache.org>
Gesendet: Montag, 14. Februar 2022 13:46
An: users@jena.apache.org
Betreff: Re: AW: AW: AW: AW: xloader "Can't find gzip program"
On 14/02/2022 08:01, Neubert, Joachim wrote:
Thanks, Andy, the TDB2 assembler fixed it, and all worked well.
I've tried to load wikidata-truthy then, but apparently the bzip file
was damaged at line 4052914959 - have to try again
How annoying.
Is it an RDF syntax error or bad binary or somethign else?
--
My experience is that gz is faster to load.
bz2 emphases compactness over speed.
Andy
Cheers, Joachim
-----Ursprüngliche Nachricht-----
Von: Andy Seaborne <a...@apache.org>
Gesendet: Samstag, 12. Februar 2022 11:15
An: users@jena.apache.org
Betreff: Re: AW: AW: AW: xloader "Can't find gzip program"
Hi Joachim,
Aside: I've realised why the timestampes are fixed at "2022-01-30 15:03".
The build setup is for repeatable builds of releases. Any build from
the X.Y.Z release source, with the same JDK, will generate the byte-wise
same jar files.
Each release build fixes the timestamp and uses that, and it gets in
the POM as property <project.build.outputTimestamp>. It only get
updated when a release happens otherwise the POM file is going to get
modified several times a week.
Thankfully, we have --version on most commands as well.
That's timestamps explained.
----
You seem to have run the TDB2 xloader, then given the text index
builder a assembler description for TDB1.
Fuseki with --loc determines the database type by looking at the file
layout, but assemblers don't.
The version output can be changed to say "TDB1" without too much
disruption. Small tweak that might have helped shown this up earlier.
Andy
On 11/02/2022 23:06, Neubert, Joachim wrote:
Sorry, my fault: I've actually had jena-4.4.0 active, not 4.5.0-SNAPSHOT.
Now the loading works smoothly:
22:50:10 INFO Load node table = 62 seconds
22:50:10 INFO Load ingest data = 37 seconds
22:50:10 INFO Build index SPO = 7 seconds
22:50:10 INFO Build index POS = 12 seconds
22:50:10 INFO Build index OSP = 9 seconds
22:50:10 INFO Overall 127 seconds
22:50:10 INFO Overall 00h 02m 07s
22:50:10 INFO Triples loaded = 10000000
22:50:10 INFO Quads loaded = 0
22:50:10 INFO Overall Rate 78740 tuples per second
That's output from tdb2.xloader.
At 10m up to 500m (laptop) or maybe 1B (server), triples, also try
"tdb2.tdbloader --loader=parallel"
However, the text indexing crashes, when called like that:
java -cp $FUSEKI_HOME/fuseki-server.jar jena.textindexer --debug
--desc=/tmp/temp.ttl
org.apache.jena.assembler.exceptions.AssemblerException: caught:
Unable to check TDB lock owner, the lock file contents appear to be
for a
TDB2 database. Please try loading this location as a TDB2 database.
See https://jena.apache.org/documentation/tdb/faqs.html for more
information.
doing:
root: file:///tmp/temp.ttl#dataset with type:
http://jena.hpl.hp.com/2008/tdb#DatasetTDB assembler class: class
org.apache.jena.tdb.assembler.DatasetAssemblerTDB1
But that is TDB1
root: http://localhost/jena_example/#text_dataset with type:
http://jena.apache.org/text#TextDataset assembler class: class
org.apache.jena.query.text.assembler.TextDatasetAssembler
...
Caused by: org.apache.jena.tdb.base.file.FileException: Unable to
check
TDB lock owner, the lock file contents appear to be for a TDB2 database.
Please try loading this location as a TDB2 database. See
https://jena.apache.org/documentation/tdb/faqs.html for more
information.
at
org.apache.jena.tdb.base.file.LocationLock.getOwner(LocationLock.java:
110)
org.apache.jena.tdb == TDB1
at
org.apache.jena.tdb.base.file.LocationLock.canObtain(LocationLock.jav
a:139)
at
org.apache.jena.tdb.StoreConnection._makeAndCache(StoreConnection.jav
a
:262)
at
org.apache.jena.tdb.StoreConnection.make(StoreConnection.java:226)
at
org.apache.jena.tdb.StoreConnection.make(StoreConnection.java:240)
at
org.apache.jena.tdb.transaction.DatasetGraphTransaction.<init>(Datase
tGra
phTransaction.java:72)
at
org.apache.jena.tdb.sys.TDBMaker.createDirect(TDBMaker.java:114)
...
... 23 more
2022-02-11 22:50:12 ABORTED
cat /var/lib/fuseki/databases/temp/tdb.lock
32907
Cheers, Joachim