Re: Text indexing Wikidata

Lorenz Buehmann Sat, 19 Feb 2022 00:01:21 -0800

Hi,

so far you can't do anything else - the whole indexing pipeline issingle-threaded as far as I know. It simply iterates all propertiesdeclared to be used for fetching the RDF triple values - Lucene indexingitself would be threadsafe, so the easiest thing would be to apply onewriter thread per property. This clearly would not help here when youjust set rdfs:label as only property. Thus, we would have to also splitthe dataset somehow for the given property and then would be able todistribute each split to a separate writer thread.

The main loop is here and makes it rather easy to understand where wecould introduce parallelism:https://github.com/apache/jena/blob/main/jena-text/src/main/java/org/apache/jena/query/text/cmd/textindexer.java#L125-L143

Multiple read from a dataset is trivial, we just have to get appropriatesplits - not sure how easy this is, maybe a cursor/iterator on thesubjects with different offsets or something?


@Andy what do you think?

On 18.02.22 09:59, Neubert, Joachim wrote:

Text indexing the truthy Wikidata dump took 13:10 h for 1.5b labels (in parts 
using text:LowerCaseKeywordAnalyzer) on the massive parallel machine.

I observed a CPU usage of 100-250 %, and wonder if I could do something to 
speed up. My command line simply was

java -cp /opt/fuseki/fuseki-server.jar jena.textindexer --debug 
--desc=/tmp/temp.ttl

(apache-jena-fuseki-4.5.0-SNAPSHOT)

Cheers, Joachim

--
Joachim Neubert

ZBW - Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-40-42834-462

Re: Text indexing Wikidata

Reply via email to