This mail of yesterday didn't get through - here again.

The data of the broken load is temporarily linked from 
http://134.245.93.72/beta/tmp. 

I've now invoked 

/opt/jena/bin/tdb2.tdbloader --loader=parallel 
--loc=/zbw/var/lib/fuseki/databases/temp ../var/gnd/2021-11/src/GND.utf8.ttl.gz 
../var/gnd/2021-11/src/gnd-sc.ttl ../var/gnd/2021-11/src/gnd-sc_notation.ttl 
../var/gnd/2021-11/src/gndo.ttl

and got a steadily decreasing rate (see below). On the other hand, the total 
load time is nice. tdbstats ran correctly afterwards, and the query for 
gndo:DifferentiatedPerson works as expected.

Cheers - Joachim

Jena version:
4.5.0-SNAPSHOT
JAVA_OPTS: -d64 -Xmx12G
Loader: tdb2.tdbloader
LOADER_OPTS: --loader=parallel
16:26:29 INFO  loader          :: Loader = LoaderParallel
16:26:29 INFO  loader          :: Start: 4 files
16:26:34 INFO  loader          :: Add: 1,000,000 GND.utf8.ttl.gz (Batch: 
202,634 / Avg: 202,634)
16:26:40 INFO  loader          :: Add: 2,000,000 GND.utf8.ttl.gz (Batch: 
143,636 / Avg: 168,109)
16:26:52 INFO  loader          :: Add: 3,000,000 GND.utf8.ttl.gz (Batch: 84,588 
/ Avg: 126,480)
16:27:05 INFO  loader          :: Add: 4,000,000 GND.utf8.ttl.gz (Batch: 78,672 
/ Avg: 109,799)
16:27:17 INFO  loader          :: Add: 5,000,000 GND.utf8.ttl.gz (Batch: 82,413 
/ Avg: 102,956)
16:27:29 INFO  loader          :: Add: 6,000,000 GND.utf8.ttl.gz (Batch: 85,375 
/ Avg: 99,540)
16:27:41 INFO  loader          :: Add: 7,000,000 GND.utf8.ttl.gz (Batch: 80,716 
/ Avg: 96,331)
...
17:00:54 INFO  loader          :: Add: 164,000,000 GND.utf8.ttl.gz (Batch: 
81,672 / Avg: 79,401)
17:01:04 WARN  riot            :: [line: 205782946, col: 17] Not advised IRI: 
<https://http://www.gordonkampe.de> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: 
The colon introducing an empty port component should be omitted entirely, or a 
port number should be specified.
17:01:05 INFO  loader          :: Add: 165,000,000 GND.utf8.ttl.gz (Batch: 
90,106 / Avg: 79,458)
17:01:12 INFO  loader          ::   End file: GND.utf8.ttl.gz (triples/quads = 
165,639,860)
17:01:12 INFO  loader          ::   End file: gnd-sc.ttl (triples/quads = 1,963)
17:01:12 INFO  loader          ::   End file: gnd-sc_notation.ttl 
(triples/quads = 463)
17:01:12 INFO  loader          ::   End file: gndo.ttl (triples/quads = 4,436)
17:01:12 INFO  loader          :: Finished: 4 files: 165,646,722 tuples in 
2083.74s (Avg: 79,495)
17:01:27 INFO  loader          :: Finish - index SPO
17:01:28 INFO  loader          :: Finish - index POS
17:01:32 INFO  loader          :: Finish - index OSP
17:01:32 INFO  loader          :: Time = 2,103.562 seconds : Triples = 
165,646,722 : Rate = 78,746 /s
2022-02-21 17:01:35 finished loading

> -----Ursprüngliche Nachricht-----
> Von: Andy Seaborne <[email protected]>
> Gesendet: Montag, 21. Februar 2022 10:11
> An: [email protected]
> Betreff: Re: Broken GND dataset after loading with 
> tdb2.xloader+tdb2.tdbloader
> 
> On 21/02/2022 08:27, Neubert, Joachim wrote:
> > I've reloaded the GND dataset at http://zbw.eu/beta/sparql/gnd/query
> with 4.5.0-SNAPSHOT. The sources were a 133G .nt.gz file,  plus 
> several small .ttl files with ontology etc. I loaded the large one 
> with tdb2.xloader, and immediately after that the smaller ones with 
> tdb2.tdbloader (see protocol at 
> https://zbw.eu/beta/tmp/fuseki/create_tdb_20220220.log).
> 
> What's the URL for the data files?
> 
> > Two things smelled fishy in this load:
> >
> > 1) The tdb2.tdbstats call after the loading looped at 100% CPU, and 
> > I had to kill it after an hour or so (this is reproducible)
> 
> Unclear. If I can get the data, I can see it it happens here.
> 
> >
> > 2) some files remained in the fuseki/databases/temp directory (1.3G 
> > triples.tmp.gz, empty quads.tmp.gz, and a load.json with
> 
> You can delete the files after the xloader has finished.
> 
> >
> > {
> >    "ingested" : "2022-02-20T13:15:45.528+00:00" ,
> >    "data" : [ "../var/gnd/2021-11/src/GND.utf8.ttl.gz" ] ,
> >    "triples" : 165639860 ,
> >    "quads" : 0
> > }
> 
> Just give all the files to single run of tdb2.tdbloader 
> --loader=parallel. At 165e6, it should be significant faster than 
> xloader - there isn't a benefit to xloader.
> 
> >
> > Text indexing however worked, and also a few example queries. 
> > However, a
> basic query like "?x gndo:DifferentiatedPerson ." does not work any more.
> >
> > Any idea what could have gone wrong?
> >
> > Cheers, Joachim
> >
> >
> 
>      Andy

Reply via email to