The error was in the binary:
lbzcat: "/zbw/var/wikidata/2022-02-03/rdf/latest-truthy.nt.bz2": compressed 
data error: bad block header magic

That created non-RDF input:

 [nbt@e6810f891672 ~]$ bzcat 
/zbw/var/wikidata/2022-02-03/rdf/latest-truthy.nt.bz2 | sed -n 
'4052914958,4052914960p;4052914961q'
<http://www.wikidata.org/entity/Q85112545> <http://schema.org/description> 
"\u0646\u062C\u0645 \u0641\u064A \u0643\u0648\u0643\u0628\u0629 
\u0627\u0644\u062B\u0648\u0631"@ar .

bzcat: Compressed file ends unexpectedly;
        perhaps it is corrupted?  *Possible* reason follows.
bzcat: Success
        Input file = /zbw/var/wikidata/2022-02-03/rdf/latest-truthy.nt.bz2, 
output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

<http://www.wikidata.org/entity/Q85112545> <http://schema.org/description> 
"star in the constellation Taurus"@en .
<https://www.wikidata.org/wiki/Special:EntityData/Q85112563> 
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Dataset> .

which in turn produced:

03:02:18 INFO  Nodes           :: Add: 4,052,000,000 latest-truthy.nt (Batch: 
108,189 / Avg: 102,550)
03:02:26 ERROR riot            :: [line: 4052914959, col: 80] Bad input stream 
[java.io.IOException: Unexpected end of stream]
Exception in thread "AsyncParser" org.apache.jena.riot.RiotException: [line: 
4052914959, col: 80] Bad input stream [java.io.IOException: Unexpected end of 
stream]
        at 
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:163)
        at 
org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
        at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
        at org.apache.jena.riot.lang.LangNTuple.parseTriple(LangNTuple.java:95)
        at org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:61)
        at 
org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:53)
        at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
        at 
org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:186)
        at org.apache.jena.riot.RDFParser.read(RDFParser.java:366)
        at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:335)
        at org.apache.jena.riot.RDFParser.parse(RDFParser.java:310)
        at 
org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:552)
        at 
org.apache.jena.tdb2.xloader.ProcBuildNodeTableX.lambda$exec2$0(ProcBuildNodeTableX.java:198)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
        at 
org.apache.jena.tdb2.xloader.ProcBuildNodeTableX.lambda$exec2$1(ProcBuildNodeTableX.java:194)
        at java.base/java.lang.Thread.run(Thread.java:829)

Cheers, Joachim

> -----Ursprüngliche Nachricht-----
> Von: Andy Seaborne <a...@apache.org>
> Gesendet: Montag, 14. Februar 2022 13:46
> An: users@jena.apache.org
> Betreff: Re: AW: AW: AW: AW: xloader "Can't find gzip program"
> 
> 
> 
> On 14/02/2022 08:01, Neubert, Joachim wrote:
> > Thanks, Andy, the TDB2 assembler fixed it, and all worked well.
> >
> > I've tried to load wikidata-truthy then, but apparently the bzip file
> > was damaged at line 4052914959 - have to try again
> 
> How annoying.
> 
> Is it an RDF syntax error or bad binary or somethign else?
> 
> --
> 
> My experience is that gz is faster to load.
> 
> bz2 emphases compactness over speed.
> 
>      Andy
> 
> >
> > Cheers, Joachim
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: Andy Seaborne <a...@apache.org>
> >> Gesendet: Samstag, 12. Februar 2022 11:15
> >> An: users@jena.apache.org
> >> Betreff: Re: AW: AW: AW: xloader "Can't find gzip program"
> >>
> >> Hi Joachim,
> >>
> >> Aside: I've realised why the timestampes are fixed at "2022-01-30 15:03".
> >>
> >> The build setup is for repeatable builds of releases. Any build from
> >> the X.Y.Z release source, with the same JDK, will generate the byte-wise
> same jar files.
> >>
> >> Each release build fixes the timestamp and uses that, and it gets in
> >> the POM as property <project.build.outputTimestamp>. It only get
> >> updated when a release happens otherwise the POM file is going to get
> >> modified several times a week.
> >>
> >> Thankfully, we have --version on most commands as well.
> >>
> >> That's timestamps explained.
> >>
> >> ----
> >>
> >> You seem to have run the TDB2 xloader, then given the text index
> >> builder a assembler description for TDB1.
> >>
> >> Fuseki with --loc determines the database type by looking at the file
> >> layout, but assemblers don't.
> >>
> >> The version output can be changed to say "TDB1" without too much
> >> disruption. Small tweak that might have helped shown this up earlier.
> >>
> >>       Andy
> >>
> >> On 11/02/2022 23:06, Neubert, Joachim wrote:
> >>> Sorry, my fault: I've actually had jena-4.4.0 active, not 4.5.0-SNAPSHOT.
> >>>
> >>> Now the loading works smoothly:
> >>>
> >>> 22:50:10 INFO  Load node table  = 62 seconds
> >>> 22:50:10 INFO  Load ingest data = 37 seconds
> >>> 22:50:10 INFO  Build index SPO  = 7 seconds
> >>> 22:50:10 INFO  Build index POS  = 12 seconds
> >>> 22:50:10 INFO  Build index OSP  = 9 seconds
> >>> 22:50:10 INFO  Overall          127 seconds
> >>> 22:50:10 INFO  Overall          00h 02m 07s
> >>> 22:50:10 INFO  Triples loaded   = 10000000
> >>> 22:50:10 INFO  Quads loaded     = 0
> >>> 22:50:10 INFO  Overall Rate     78740 tuples per second
> >>
> >> That's output from tdb2.xloader.
> >>
> >> At 10m up to 500m (laptop) or maybe 1B (server), triples, also try
> >> "tdb2.tdbloader --loader=parallel"
> >>
> >>> However, the text indexing crashes, when called like that:
> >>>
> >>> java -cp $FUSEKI_HOME/fuseki-server.jar jena.textindexer --debug
> >>> --desc=/tmp/temp.ttl
> >>>
> >>> org.apache.jena.assembler.exceptions.AssemblerException: caught:
> >> Unable to check TDB lock owner, the lock file contents appear to be
> >> for a
> >> TDB2 database.  Please try loading this location as a TDB2 database.
> >> See https://jena.apache.org/documentation/tdb/faqs.html for more
> >> information.
> >>>     doing:
> >>>       root: file:///tmp/temp.ttl#dataset with type:
> >>> http://jena.hpl.hp.com/2008/tdb#DatasetTDB assembler class: class
> >>> org.apache.jena.tdb.assembler.DatasetAssemblerTDB1
> >>
> >> But that is TDB1
> >>
> >>>       root: http://localhost/jena_example/#text_dataset with type:
> >>> http://jena.apache.org/text#TextDataset assembler class: class
> >>> org.apache.jena.query.text.assembler.TextDatasetAssembler
> >>>
> >> ...
> >>> Caused by: org.apache.jena.tdb.base.file.FileException: Unable to
> >>> check
> >> TDB lock owner, the lock file contents appear to be for a TDB2 database.
> >> Please try loading this location as a TDB2 database. See
> >> https://jena.apache.org/documentation/tdb/faqs.html for more
> >> information.
> >>>           at
> >>> org.apache.jena.tdb.base.file.LocationLock.getOwner(LocationLock.java:
> >>> 110)
> >>
> >> org.apache.jena.tdb == TDB1
> >>
> >>>           at
> >> org.apache.jena.tdb.base.file.LocationLock.canObtain(LocationLock.jav
> >> a:139)
> >>>           at
> >>
> org.apache.jena.tdb.StoreConnection._makeAndCache(StoreConnection.jav
> >> a
> >> :262)
> >>>           at
> >> org.apache.jena.tdb.StoreConnection.make(StoreConnection.java:226)
> >>>           at
> >> org.apache.jena.tdb.StoreConnection.make(StoreConnection.java:240)
> >>>           at
> >> org.apache.jena.tdb.transaction.DatasetGraphTransaction.<init>(Datase
> >> tGra
> >> phTransaction.java:72)
> >>>           at
> >>> org.apache.jena.tdb.sys.TDBMaker.createDirect(TDBMaker.java:114)
> >> ...
> >>
> >>>           ... 23 more
> >>> 2022-02-11 22:50:12 ABORTED
> >>>
> >>> cat /var/lib/fuseki/databases/temp/tdb.lock
> >>> 32907
> >>>
> >>> Cheers, Joachim

Reply via email to