Re: slow loading in TDB with Lucene

Andy Seaborne Fri, 19 Jun 2020 06:57:17 -0700



On 17/06/2020 17:11, Jean-Marc Vanel wrote:

Sorry for the late answer ;
I'm aware of the bad side of autocommit which I never use.
I did wrap In Transaction the call to removeGraph
I'll make measurements you asked to assert the respective CPU and elapsed
times for loading RDF and indexing the text.

But for the time being, I had to solve my issue of loading data without
stopping my SPARQL + HTML server .
So I wrote a client RDF uploader, talking to the SPARQL graph store
protocol :
https://www.w3.org/TR/sparql11-http-rdf-update/
splitting given RDF file in chunks of 10000 triples for sending :
https://github.com/jmvanel/semantic_forms/blob/master/scala/clients/src/main/scala/deductions/runtime/clients/RDFuploader.scala#L66
I used for the first time the Riot parser with callback
(org.apache.jena.riot.system.StreamRDFBase) , which I'll also test for
performance. It is understandable that it can be slow, since the input was
a Turtle file , not N-Triple .

On server side, I modularized my code, so that now several instances TDB(1)
are created on the same directory, which is not a problem for TDB.
But apparently this is a problem for Lucene: there is a
LockObtainFailedException: "Lock held by this virtual machine:
../LUCENE/write.lock" when creating the second TDB instance connected to
Lucene.


Re: One lucene index shared across multiple databases.

The code isn't written to be used in this way. The locking issue couldbe made to work - I don't think there is a fundamental reason why textindexes can't be shared read-only across databases in the same JVM.

But update adds a complication. Having one index in multiple transactioncontrollers is not going to work.


DatasetGraphText does special things for TDB1 and TDB2.

TDB1 transaction management only works with one database and specialTransactionLifecycle listeners.

TDB2 transaction management can, in theory, work across databases andextra TransactionalComponents but the code to build the compoundtransaction domain does not exist.


    Andy

So 'll ensure that only one TDB database is instantiated.
Or maybe I use badly the API (it's configured by API not RDF configuration).

NOTES

    - I'm not sure if LUCENE/write.lock is deleted in all cases when closing
    the TDB, although it has been specified at text index creation:
                TextDatasetFactory.create(... closeIndexOnDSGClose = true)
    - using the GUI Luke in lucene-8.5.2 is useful to inspect Lucene index


Jean-Marc Vanel
<http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
+33
(0)6 89 16 29 52


Le sam. 6 juin 2020 à 11:45, Andy Seaborne <[email protected]> a écrit :



On 04/06/2020 10:25, Jean-Marc Vanel wrote:

Hi

It took hours loading a TTL document with text indexing (in TDB 3.15.0).
The TTL document is Taxrefld_taxonomy_classes.ttl (size: 2_676_428

triples)

in zip taxref12-core.zip
<

https://github.com/frmichel/taxref-ld/blob/master/dataset/12.0/taxref12-core.zip


Have you tried with and without the text index to get a information
about where the time is going?

This is a combination setup so it is harder to say where time is going
without an experiment.


This method in DatasetGraph is called :
      public void add(Node g, Node s, Node p, Node o) ;

With logging at debug level, it appeared that most of the elapsed time is
taken by removing the graph, one entity at a time.

In fact I explicitly call *removeGraph()* before, because the data is
stored in provenance specific graphs in this database.


The text index has to be updated as well, and I think there is nothing
special about removeGraph for a test index so it undoes all the indexing.

Also - lucene indexing may be slower that the TDB part.


Is there a way to accelerate things ?
I wondered if wrapping removeGraph()operation in a transaction is

mandatory

or useful.


useful - If you don't have a transaction, TDB1 is going to be less safe
for your data.

At runtime Jena does not protest about that ...


TDB1 does not ... but it is better to use a transaction and its
mandatory for TDB2.

Adding an autocommit mode is not as good as it may seem. Like in SQL,
autocommit is nothing more than an automatic transaction around each
step and very easily becomes extremely slow.

      Andy


A typical block in the data:
<http://taxref.mnhn.fr/lod/taxon/629656/12.0>
          a                            owl:Class ;
          rdfs:isDefinedBy             <
http://taxref.mnhn.fr/lod/taxref-ld/12.0> ;

*        rdfs:label                   "Eranthemum pulchellum" ;*
rdfs:subClassOf              <

http://taxref.mnhn.fr/lod/taxon/452421/12.0> ;

          schema:mainEntityOfPage      <
https://inpn.mnhn.fr/espece/cd_nom/629656?lg=en> ;
          taxrefprop:habitat           taxrefhab:FreshWater ,
taxrefhab:Terrestrial ;
          taxrefprop:hasRank           taxrefrk:Species ;
          taxrefprop:hasReferenceName  <

http://taxref.mnhn.fr/lod/name/629656>

;
          taxrefprop:hasSynonym        <

http://taxref.mnhn.fr/lod/name/633029>

, <http://taxref.mnhn.fr/lod/name/637984> , <
http://taxref.mnhn.fr/lod/name/634312> ;
          foaf:homepage                <
https://inpn.mnhn.fr/espece/cd_nom/629656?lg=en> .

Jean-Marc Vanel
<

http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me


+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
   Chroniques jardin
<

http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle

Re: slow loading in TDB with Lucene

Reply via email to