Hi,

so far you can't do anything else - the whole indexing pipeline is single-threaded as far as I know. It simply iterates all properties declared to be used for fetching the RDF triple values - Lucene indexing itself would be threadsafe, so the easiest thing would be to apply one writer thread per property. This clearly would not help here when you just set rdfs:label as only property. Thus, we would have to also split the dataset somehow for the given property and then would be able to distribute each split to a separate writer thread.

The main loop is here and makes it rather easy to understand where we could introduce parallelism: https://github.com/apache/jena/blob/main/jena-text/src/main/java/org/apache/jena/query/text/cmd/textindexer.java#L125-L143

Multiple read from a dataset is trivial, we just have to get appropriate splits - not sure how easy this is, maybe a cursor/iterator on the subjects with different offsets or something?

@Andy what do you think?

On 18.02.22 09:59, Neubert, Joachim wrote:
Text indexing the truthy Wikidata dump took 13:10 h for 1.5b labels (in parts 
using text:LowerCaseKeywordAnalyzer) on the massive parallel machine.

I observed a CPU usage of 100-250 %, and wonder if I could do something to 
speed up. My command line simply was

java -cp /opt/fuseki/fuseki-server.jar jena.textindexer --debug 
--desc=/tmp/temp.ttl

(apache-jena-fuseki-4.5.0-SNAPSHOT)

Cheers, Joachim

--
Joachim Neubert

ZBW - Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-40-42834-462


Reply via email to