On 17/06/2020 17:11, Jean-Marc Vanel wrote:
Sorry for the late answer ;
I'm aware of the bad side of autocommit which I never use.
I did wrap In Transaction the call to removeGraph
I'll make measurements you asked to assert the respective CPU and elapsed
times for loading RDF and indexing the text.

But for the time being, I had to solve my issue of loading data without
stopping my SPARQL + HTML server .
So I wrote a client RDF uploader, talking to the SPARQL graph store
protocol :
https://www.w3.org/TR/sparql11-http-rdf-update/
splitting given RDF file in chunks of 10000 triples for sending :
https://github.com/jmvanel/semantic_forms/blob/master/scala/clients/src/main/scala/deductions/runtime/clients/RDFuploader.scala#L66
I used for the first time the Riot parser with callback
(org.apache.jena.riot.system.StreamRDFBase) , which I'll also test for
performance. It is understandable that it can be slow, since the input was
a Turtle file , not N-Triple .

On server side, I modularized my code, so that now several instances TDB(1)
are created on the same directory, which is not a problem for TDB.
But apparently this is a problem for Lucene: there is a
LockObtainFailedException: "Lock held by this virtual machine:
../LUCENE/write.lock" when creating the second TDB instance connected to
Lucene.

Re: One lucene index shared across multiple databases.

The code isn't written to be used in this way. The locking issue could be made to work - I don't think there is a fundamental reason why text indexes can't be shared read-only across databases in the same JVM.

But update adds a complication. Having one index in multiple transaction controllers is not going to work.

DatasetGraphText does special things for TDB1 and TDB2.

TDB1 transaction management only works with one database and special TransactionLifecycle listeners.

TDB2 transaction management can, in theory, work across databases and extra TransactionalComponents but the code to build the compound transaction domain does not exist.

    Andy

So 'll ensure that only one TDB database is instantiated.
Or maybe I use badly the API (it's configured by API not RDF configuration).

NOTES

    - I'm not sure if LUCENE/write.lock is deleted in all cases when closing
    the TDB, although it has been specified at text index creation:
                TextDatasetFactory.create(... closeIndexOnDSGClose = true)
    - using the GUI Luke in lucene-8.5.2 is useful to inspect Lucene index


Jean-Marc Vanel
<http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
+33
(0)6 89 16 29 52


Le sam. 6 juin 2020 à 11:45, Andy Seaborne <[email protected]> a écrit :



On 04/06/2020 10:25, Jean-Marc Vanel wrote:
Hi

It took hours loading a TTL document with text indexing (in TDB 3.15.0).
The TTL document is Taxrefld_taxonomy_classes.ttl (size: 2_676_428
triples)
in zip taxref12-core.zip
<
https://github.com/frmichel/taxref-ld/blob/master/dataset/12.0/taxref12-core.zip

   .

Have you tried with and without the text index to get a information
about where the time is going?

This is a combination setup so it is harder to say where time is going
without an experiment.


This method in DatasetGraph is called :
      public void add(Node g, Node s, Node p, Node o) ;

With logging at debug level, it appeared that most of the elapsed time is
taken by removing the graph, one entity at a time.
  >
In fact I explicitly call *removeGraph()* before, because the data is
stored in provenance specific graphs in this database.

The text index has to be updated as well, and I think there is nothing
special about removeGraph for a test index so it undoes all the indexing.

Also - lucene indexing may be slower that the TDB part.


Is there a way to accelerate things ?
I wondered if wrapping removeGraph()operation in a transaction is
mandatory
or useful.

useful - If you don't have a transaction, TDB1 is going to be less safe
for your data.

At runtime Jena does not protest about that ...

TDB1 does not ... but it is better to use a transaction and its
mandatory for TDB2.

Adding an autocommit mode is not as good as it may seem. Like in SQL,
autocommit is nothing more than an automatic transaction around each
step and very easily becomes extremely slow.

      Andy


A typical block in the data:
<http://taxref.mnhn.fr/lod/taxon/629656/12.0>
          a                            owl:Class ;
          rdfs:isDefinedBy             <
http://taxref.mnhn.fr/lod/taxref-ld/12.0> ;

*        rdfs:label                   "Eranthemum pulchellum" ;*
rdfs:subClassOf              <
http://taxref.mnhn.fr/lod/taxon/452421/12.0> ;
          schema:mainEntityOfPage      <
https://inpn.mnhn.fr/espece/cd_nom/629656?lg=en> ;
          taxrefprop:habitat           taxrefhab:FreshWater ,
taxrefhab:Terrestrial ;
          taxrefprop:hasRank           taxrefrk:Species ;
          taxrefprop:hasReferenceName  <
http://taxref.mnhn.fr/lod/name/629656>
;
          taxrefprop:hasSynonym        <
http://taxref.mnhn.fr/lod/name/633029>
, <http://taxref.mnhn.fr/lod/name/637984> , <
http://taxref.mnhn.fr/lod/name/634312> ;
          foaf:homepage                <
https://inpn.mnhn.fr/espece/cd_nom/629656?lg=en> .

Jean-Marc Vanel
<
http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me

+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
   Chroniques jardin
<
http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle




Reply via email to