The error was in the binary: lbzcat: "/zbw/var/wikidata/2022-02-03/rdf/latest-truthy.nt.bz2": compressed data error: bad block header magic
That created non-RDF input: [nbt@e6810f891672 ~]$ bzcat /zbw/var/wikidata/2022-02-03/rdf/latest-truthy.nt.bz2 | sed -n '4052914958,4052914960p;4052914961q' <http://www.wikidata.org/entity/Q85112545> <http://schema.org/description> "\u0646\u062C\u0645 \u0641\u064A \u0643\u0648\u0643\u0628\u0629 \u0627\u0644\u062B\u0648\u0631"@ar . bzcat: Compressed file ends unexpectedly; perhaps it is corrupted? *Possible* reason follows. bzcat: Success Input file = /zbw/var/wikidata/2022-02-03/rdf/latest-truthy.nt.bz2, output file = (stdout) It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. <http://www.wikidata.org/entity/Q85112545> <http://schema.org/description> "star in the constellation Taurus"@en . <https://www.wikidata.org/wiki/Special:EntityData/Q85112563> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Dataset> . which in turn produced: 03:02:18 INFO Nodes :: Add: 4,052,000,000 latest-truthy.nt (Batch: 108,189 / Avg: 102,550) 03:02:26 ERROR riot :: [line: 4052914959, col: 80] Bad input stream [java.io.IOException: Unexpected end of stream] Exception in thread "AsyncParser" org.apache.jena.riot.RiotException: [line: 4052914959, col: 80] Bad input stream [java.io.IOException: Unexpected end of stream] at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:163) at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148) at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105) at org.apache.jena.riot.lang.LangNTuple.parseTriple(LangNTuple.java:95) at org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:61) at org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:53) at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43) at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:186) at org.apache.jena.riot.RDFParser.read(RDFParser.java:366) at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:335) at org.apache.jena.riot.RDFParser.parse(RDFParser.java:310) at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:552) at org.apache.jena.tdb2.xloader.ProcBuildNodeTableX.lambda$exec2$0(ProcBuildNodeTableX.java:198) at java.base/java.util.ArrayList.forEach(ArrayList.java:1541) at org.apache.jena.tdb2.xloader.ProcBuildNodeTableX.lambda$exec2$1(ProcBuildNodeTableX.java:194) at java.base/java.lang.Thread.run(Thread.java:829) Cheers, Joachim > -----Ursprüngliche Nachricht----- > Von: Andy Seaborne <a...@apache.org> > Gesendet: Montag, 14. Februar 2022 13:46 > An: users@jena.apache.org > Betreff: Re: AW: AW: AW: AW: xloader "Can't find gzip program" > > > > On 14/02/2022 08:01, Neubert, Joachim wrote: > > Thanks, Andy, the TDB2 assembler fixed it, and all worked well. > > > > I've tried to load wikidata-truthy then, but apparently the bzip file > > was damaged at line 4052914959 - have to try again > > How annoying. > > Is it an RDF syntax error or bad binary or somethign else? > > -- > > My experience is that gz is faster to load. > > bz2 emphases compactness over speed. > > Andy > > > > > Cheers, Joachim > > > >> -----Ursprüngliche Nachricht----- > >> Von: Andy Seaborne <a...@apache.org> > >> Gesendet: Samstag, 12. Februar 2022 11:15 > >> An: users@jena.apache.org > >> Betreff: Re: AW: AW: AW: xloader "Can't find gzip program" > >> > >> Hi Joachim, > >> > >> Aside: I've realised why the timestampes are fixed at "2022-01-30 15:03". > >> > >> The build setup is for repeatable builds of releases. Any build from > >> the X.Y.Z release source, with the same JDK, will generate the byte-wise > same jar files. > >> > >> Each release build fixes the timestamp and uses that, and it gets in > >> the POM as property <project.build.outputTimestamp>. It only get > >> updated when a release happens otherwise the POM file is going to get > >> modified several times a week. > >> > >> Thankfully, we have --version on most commands as well. > >> > >> That's timestamps explained. > >> > >> ---- > >> > >> You seem to have run the TDB2 xloader, then given the text index > >> builder a assembler description for TDB1. > >> > >> Fuseki with --loc determines the database type by looking at the file > >> layout, but assemblers don't. > >> > >> The version output can be changed to say "TDB1" without too much > >> disruption. Small tweak that might have helped shown this up earlier. > >> > >> Andy > >> > >> On 11/02/2022 23:06, Neubert, Joachim wrote: > >>> Sorry, my fault: I've actually had jena-4.4.0 active, not 4.5.0-SNAPSHOT. > >>> > >>> Now the loading works smoothly: > >>> > >>> 22:50:10 INFO Load node table = 62 seconds > >>> 22:50:10 INFO Load ingest data = 37 seconds > >>> 22:50:10 INFO Build index SPO = 7 seconds > >>> 22:50:10 INFO Build index POS = 12 seconds > >>> 22:50:10 INFO Build index OSP = 9 seconds > >>> 22:50:10 INFO Overall 127 seconds > >>> 22:50:10 INFO Overall 00h 02m 07s > >>> 22:50:10 INFO Triples loaded = 10000000 > >>> 22:50:10 INFO Quads loaded = 0 > >>> 22:50:10 INFO Overall Rate 78740 tuples per second > >> > >> That's output from tdb2.xloader. > >> > >> At 10m up to 500m (laptop) or maybe 1B (server), triples, also try > >> "tdb2.tdbloader --loader=parallel" > >> > >>> However, the text indexing crashes, when called like that: > >>> > >>> java -cp $FUSEKI_HOME/fuseki-server.jar jena.textindexer --debug > >>> --desc=/tmp/temp.ttl > >>> > >>> org.apache.jena.assembler.exceptions.AssemblerException: caught: > >> Unable to check TDB lock owner, the lock file contents appear to be > >> for a > >> TDB2 database. Please try loading this location as a TDB2 database. > >> See https://jena.apache.org/documentation/tdb/faqs.html for more > >> information. > >>> doing: > >>> root: file:///tmp/temp.ttl#dataset with type: > >>> http://jena.hpl.hp.com/2008/tdb#DatasetTDB assembler class: class > >>> org.apache.jena.tdb.assembler.DatasetAssemblerTDB1 > >> > >> But that is TDB1 > >> > >>> root: http://localhost/jena_example/#text_dataset with type: > >>> http://jena.apache.org/text#TextDataset assembler class: class > >>> org.apache.jena.query.text.assembler.TextDatasetAssembler > >>> > >> ... > >>> Caused by: org.apache.jena.tdb.base.file.FileException: Unable to > >>> check > >> TDB lock owner, the lock file contents appear to be for a TDB2 database. > >> Please try loading this location as a TDB2 database. See > >> https://jena.apache.org/documentation/tdb/faqs.html for more > >> information. > >>> at > >>> org.apache.jena.tdb.base.file.LocationLock.getOwner(LocationLock.java: > >>> 110) > >> > >> org.apache.jena.tdb == TDB1 > >> > >>> at > >> org.apache.jena.tdb.base.file.LocationLock.canObtain(LocationLock.jav > >> a:139) > >>> at > >> > org.apache.jena.tdb.StoreConnection._makeAndCache(StoreConnection.jav > >> a > >> :262) > >>> at > >> org.apache.jena.tdb.StoreConnection.make(StoreConnection.java:226) > >>> at > >> org.apache.jena.tdb.StoreConnection.make(StoreConnection.java:240) > >>> at > >> org.apache.jena.tdb.transaction.DatasetGraphTransaction.<init>(Datase > >> tGra > >> phTransaction.java:72) > >>> at > >>> org.apache.jena.tdb.sys.TDBMaker.createDirect(TDBMaker.java:114) > >> ... > >> > >>> ... 23 more > >>> 2022-02-11 22:50:12 ABORTED > >>> > >>> cat /var/lib/fuseki/databases/temp/tdb.lock > >>> 32907 > >>> > >>> Cheers, Joachim