Thanks for the details.  Good to add to the collective experience.

One reason to parse the file to /dev/null before trying to load it.

It doesn't look like there is much you can do. Reading the man page for bzip2recover, it's going to loose some data and if that is not aligned to N-triples, it will break the parser. Only by finding and fixing up the damaged (in the NT sense) block file will it recover most of the data.

    Andy

On 14/02/2022 13:19, Neubert, Joachim wrote:
The error was in the binary:
lbzcat: "/zbw/var/wikidata/2022-02-03/rdf/latest-truthy.nt.bz2": compressed 
data error: bad block header magic

That created non-RDF input:

  [nbt@e6810f891672 ~]$ bzcat 
/zbw/var/wikidata/2022-02-03/rdf/latest-truthy.nt.bz2 | sed -n 
'4052914958,4052914960p;4052914961q'
<http://www.wikidata.org/entity/Q85112545> <http://schema.org/description> 
"\u0646\u062C\u0645 \u0641\u064A \u0643\u0648\u0643\u0628\u0629 
\u0627\u0644\u062B\u0648\u0631"@ar .

bzcat: Compressed file ends unexpectedly;
         perhaps it is corrupted?  *Possible* reason follows.
bzcat: Success
         Input file = /zbw/var/wikidata/2022-02-03/rdf/latest-truthy.nt.bz2, 
output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

<http://www.wikidata.org/entity/Q85112545> <http://schema.org/description> "star in 
the constellation Taurus"@en .
<https://www.wikidata.org/wiki/Special:EntityData/Q85112563> 
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Dataset> .

which in turn produced:

03:02:18 INFO  Nodes           :: Add: 4,052,000,000 latest-truthy.nt (Batch: 
108,189 / Avg: 102,550)
03:02:26 ERROR riot            :: [line: 4052914959, col: 80] Bad input stream 
[java.io.IOException: Unexpected end of stream]
Exception in thread "AsyncParser" org.apache.jena.riot.RiotException: [line: 
4052914959, col: 80] Bad input stream [java.io.IOException: Unexpected end of stream]
         at 
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:163)
         at 
org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
         at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:105)
         at org.apache.jena.riot.lang.LangNTuple.parseTriple(LangNTuple.java:95)
         at 
org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:61)
         at 
org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:53)
         at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
         at 
org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:186)
         at org.apache.jena.riot.RDFParser.read(RDFParser.java:366)
         at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:335)
         at org.apache.jena.riot.RDFParser.parse(RDFParser.java:310)
         at 
org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:552)
         at 
org.apache.jena.tdb2.xloader.ProcBuildNodeTableX.lambda$exec2$0(ProcBuildNodeTableX.java:198)
         at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
         at 
org.apache.jena.tdb2.xloader.ProcBuildNodeTableX.lambda$exec2$1(ProcBuildNodeTableX.java:194)
         at java.base/java.lang.Thread.run(Thread.java:829)

Cheers, Joachim

-----Ursprüngliche Nachricht-----
Von: Andy Seaborne <a...@apache.org>
Gesendet: Montag, 14. Februar 2022 13:46
An: users@jena.apache.org
Betreff: Re: AW: AW: AW: AW: xloader "Can't find gzip program"



On 14/02/2022 08:01, Neubert, Joachim wrote:
Thanks, Andy, the TDB2 assembler fixed it, and all worked well.

I've tried to load wikidata-truthy then, but apparently the bzip file
was damaged at line 4052914959 - have to try again

How annoying.

Is it an RDF syntax error or bad binary or somethign else?

--

My experience is that gz is faster to load.

bz2 emphases compactness over speed.

      Andy


Cheers, Joachim

-----Ursprüngliche Nachricht-----
Von: Andy Seaborne <a...@apache.org>
Gesendet: Samstag, 12. Februar 2022 11:15
An: users@jena.apache.org
Betreff: Re: AW: AW: AW: xloader "Can't find gzip program"

Hi Joachim,

Aside: I've realised why the timestampes are fixed at "2022-01-30 15:03".

The build setup is for repeatable builds of releases. Any build from
the X.Y.Z release source, with the same JDK, will generate the byte-wise
same jar files.

Each release build fixes the timestamp and uses that, and it gets in
the POM as property <project.build.outputTimestamp>. It only get
updated when a release happens otherwise the POM file is going to get
modified several times a week.

Thankfully, we have --version on most commands as well.

That's timestamps explained.

----

You seem to have run the TDB2 xloader, then given the text index
builder a assembler description for TDB1.

Fuseki with --loc determines the database type by looking at the file
layout, but assemblers don't.

The version output can be changed to say "TDB1" without too much
disruption. Small tweak that might have helped shown this up earlier.

       Andy

On 11/02/2022 23:06, Neubert, Joachim wrote:
Sorry, my fault: I've actually had jena-4.4.0 active, not 4.5.0-SNAPSHOT.

Now the loading works smoothly:

22:50:10 INFO  Load node table  = 62 seconds
22:50:10 INFO  Load ingest data = 37 seconds
22:50:10 INFO  Build index SPO  = 7 seconds
22:50:10 INFO  Build index POS  = 12 seconds
22:50:10 INFO  Build index OSP  = 9 seconds
22:50:10 INFO  Overall          127 seconds
22:50:10 INFO  Overall          00h 02m 07s
22:50:10 INFO  Triples loaded   = 10000000
22:50:10 INFO  Quads loaded     = 0
22:50:10 INFO  Overall Rate     78740 tuples per second

That's output from tdb2.xloader.

At 10m up to 500m (laptop) or maybe 1B (server), triples, also try
"tdb2.tdbloader --loader=parallel"

However, the text indexing crashes, when called like that:

java -cp $FUSEKI_HOME/fuseki-server.jar jena.textindexer --debug
--desc=/tmp/temp.ttl

org.apache.jena.assembler.exceptions.AssemblerException: caught:
Unable to check TDB lock owner, the lock file contents appear to be
for a
TDB2 database.  Please try loading this location as a TDB2 database.
See https://jena.apache.org/documentation/tdb/faqs.html for more
information.
     doing:
       root: file:///tmp/temp.ttl#dataset with type:
http://jena.hpl.hp.com/2008/tdb#DatasetTDB assembler class: class
org.apache.jena.tdb.assembler.DatasetAssemblerTDB1

But that is TDB1

       root: http://localhost/jena_example/#text_dataset with type:
http://jena.apache.org/text#TextDataset assembler class: class
org.apache.jena.query.text.assembler.TextDatasetAssembler

...
Caused by: org.apache.jena.tdb.base.file.FileException: Unable to
check
TDB lock owner, the lock file contents appear to be for a TDB2 database.
Please try loading this location as a TDB2 database. See
https://jena.apache.org/documentation/tdb/faqs.html for more
information.
           at
org.apache.jena.tdb.base.file.LocationLock.getOwner(LocationLock.java:
110)

org.apache.jena.tdb == TDB1

           at
org.apache.jena.tdb.base.file.LocationLock.canObtain(LocationLock.jav
a:139)
           at

org.apache.jena.tdb.StoreConnection._makeAndCache(StoreConnection.jav
a
:262)
           at
org.apache.jena.tdb.StoreConnection.make(StoreConnection.java:226)
           at
org.apache.jena.tdb.StoreConnection.make(StoreConnection.java:240)
           at
org.apache.jena.tdb.transaction.DatasetGraphTransaction.<init>(Datase
tGra
phTransaction.java:72)
           at
org.apache.jena.tdb.sys.TDBMaker.createDirect(TDBMaker.java:114)
...

           ... 23 more
2022-02-11 22:50:12 ABORTED

cat /var/lib/fuseki/databases/temp/tdb.lock
32907

Cheers, Joachim

Reply via email to