On 21/02/2022 08:27, Neubert, Joachim wrote:
I've reloaded the GND dataset at http://zbw.eu/beta/sparql/gnd/query with
4.5.0-SNAPSHOT. The sources were a 133G .nt.gz file, plus several small .ttl
files with ontology etc. I loaded the large one with tdb2.xloader, and
immediately after that the smaller ones with tdb2.tdbloader (see protocol at
https://zbw.eu/beta/tmp/fuseki/create_tdb_20220220.log).
What's the URL for the data files?
Two things smelled fishy in this load:
1) The tdb2.tdbstats call after the loading looped at 100% CPU, and I had to
kill it after an hour or so (this is reproducible)
Unclear. If I can get the data, I can see it it happens here.
2) some files remained in the fuseki/databases/temp directory (1.3G
triples.tmp.gz, empty quads.tmp.gz, and a load.json with
You can delete the files after the xloader has finished.
{
"ingested" : "2022-02-20T13:15:45.528+00:00" ,
"data" : [ "../var/gnd/2021-11/src/GND.utf8.ttl.gz" ] ,
"triples" : 165639860 ,
"quads" : 0
}
Just give all the files to single run of tdb2.tdbloader
--loader=parallel. At 165e6, it should be significant faster than
xloader - there isn't a benefit to xloader.
Text indexing however worked, and also a few example queries. However, a basic query like
"?x gndo:DifferentiatedPerson ." does not work any more.
Any idea what could have gone wrong?
Cheers, Joachim
Andy