Re: Text Index build with empty fields

Sorin Gheorghiu Mon, 25 Mar 2019 09:59:41 -0700

Hi Chris,

after doing more tests I have good news, the textindexer of Jena 3.10 isworking fine. When a large rdf data is indexed, the textindexer startswith one field (per record), but later on the other fields are indexedas well. This behaviour had confused me, I expected to see all fieldsindexed immediately. Hence I learnt I have to wait until textindexerfinishes his task, then to check the results.


Thank you for your support so far! Shall I close the ticket?

Best regards,
Sorin


Am 12.03.2019 um 15:39 schrieb Chris Tomlinson:

Hi Sorin,

I have focussed on the jena text integration w/ Lucene local tojena/fuseki. The solr was dropped over a year ago due to lack ofsupport/interest and w’ your information about ES 7.x it’s likelygoing to take someone who is a user of ES to help keep the integrationup-to-date.

Anuj Kumar <[email protected]<mailto:[email protected]>> did the ES integration about ayear ago for jena 3.9.0 and as I mentioned I made /obvious/ changes tothe ES integration to update to Lucene 7.4.0 for jena 3.10.0.

The upgrade to Lucene 7.4.0<https://issues.apache.org/jira/browse/JENA-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673657#comment-16673657>wasprompted by a user, [email protected]<mailto:[email protected]>, who was interested in Lucene 7.5,but the released version of ES was built against 7.4 so we upgraded tothat version.

I’ve opened JENA-1681<https://issues.apache.org/jira/browse/JENA-1681> for the issue you’vereported. You can report your findings there and hopefully we can getto the bottom of the problem.


Regards,
Chris

On Mar 12, 2019, at 6:40 AM, Sorin Gheorghiu<[email protected]<mailto:[email protected]>> wrote:


Hi Chris,

Thank you for your detailed answer. I will still try to find the rootcause of this issue.

But I have a question to you, do you know if Jena will supportElasticsearch in the further versions?

I am asking because in Elasticsearch 7.0 are breaking changes whichwill affect the transport-client [1]:/The TransportClient is deprecated in favour of the Java High LevelREST Client and will be removed in Elasticsearch 8.0./

This supposes changes in the client’s initialization code, theMigration Guide [2] explains how to do it.

[1]https://www.elastic.co/guide/en/elasticsearch/client/java-api/master/transport-client.html

[2]https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-high-level-migration.html



Best regards,
Sorin


Am 11.03.2019 um 18:38 schrieb Chris Tomlinson:

Hi Sorin,

I haven’t had the time to try and delve further into your issue.Your pcap seems to clearly indicate that there is no data populatingany field/property other than the first one in the entity map.

I’ve included the configuration file that we use. It has many manyfields defined that are all populated. We load jena/fuseki from acollection of git repos via a git-to-dbs tool<https://github.com/buda-base/git-to-dbs> and we don’t see the sortof issue you’re reporting where there is a single field out of allthe defined fields that is populated in the dataset and Lucene index- we don’t use ElasticSearch.

The point being that whatever is going wrong is apparently not inthe parsing of the configuration and setting up of the internaltables that record information about which predicates are indexedvia Lucene (or Elasticsearch) into what fields.

So it appears to me that the issue is something that is happening inthe connection between the standalone textindexer.java and theElasticsearch via the TextIndexES.java. The textindexer.java doesn’thave any post 3.8.0 changes that I can see and the only change inthe TextIndexES.java is a change in the nameof org.elasticsearch.common.transport.InetSocketTransportAddressto org.elasticsearch.common.transport.TransportAddress as part ofthe upgrade.


I’m really not able to go further at this time.

I’m sorry,
Chris

# Fuseki configuration for BDRC, configures two endpoints:
#   - /bdrc is read-only
#   - /bdrcrw is read-write
#

# This was painful to come up with but the web interface basicallyallows no option# and there is no subclass inference by default so such aconfiguration file is necessary.

#
# The main doc sources are:

# -https://jena.apache.org/documentation/fuseki2/fuseki-configuration.html# -https://jena.apache.org/documentation/assembler/assembler-howto.html

#  - https://jena.apache.org/documentation/assembler/assembler.ttl
#

# Seehttps://jena.apache.org/documentation/fuseki2/fuseki-layout.htmlfor the destination of this file.


@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix tdb2:    <http://jena.apache.org/2016/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix :        <http://base/#> .
@prefix text:    <http://jena.apache.org/text#> .
@prefix skos:    <http://www.w3.org/2004/02/skos/core#> .
@prefix adm:     <http://purl.bdrc.io/ontology/admin/> .
@prefix bdd:     <http://purl.bdrc.io/data/> .
@prefix bdo:     <http://purl.bdrc.io/ontology/core/> .
@prefix bdr:     <http://purl.bdrc.io/resource/> .
@prefix f: <java:io.bdrc.ldspdi.sparql.functions.> .

# [] ja:loadClass "org.seaborne.tdb2.TDB2" .
# tdb2:DatasetTDB2  rdfs:subClassOf  ja:RDFDataset .
# tdb2:GraphTDB2    rdfs:subClassOf  ja:Model .

[] rdf:type fuseki:Server ;
   fuseki:services (
     :bdrcrw
   ) .

:bdrcrw rdf:type fuseki:Service ;
    fuseki:name   "bdrcrw" ;     # name of the dataset in the url
    fuseki:serviceQuery   "query" ;    # SPARQL query service
    fuseki:serviceUpdate  "update" ;   # SPARQL update service
    fuseki:serviceUpload  "upload" ;   # Non-SPARQL upload service

fuseki:serviceReadWriteGraphStore "data" ; # SPARQL Graph storeprotocol (read and write)

    fuseki:dataset  :bdrc_text_dataset ;
    .

# using TDB
:dataset_bdrc rdf:type  tdb:DatasetTDB ;
     tdb:location "/usr/local/fuseki/base/databases/bdrc" ;
     tdb:unionDefaultGraph true ;
     .

# using TDB2
# :dataset_bdrc rdf:type  tdb2:DatasetTDB2 ;
#      tdb2:location "/usr/local/fuseki/base/databases/bdrc" ;
#      tdb2:unionDefaultGraph true ;
#   .

:bdrc_text_dataset rdf:type text:TextDataset ;
    text:dataset   :dataset_bdrc ;
    text:index :bdrc_lucene_index ;
    .

# Text index description
:bdrc_lucene_index a text:TextIndexLucene ;
    text:directory <file:/usr/local/fuseki/base/lucene-bdrc> ;
    text:storeValues true ;
    text:multilingualSupport true ;
    text:entityMap :bdrc_entmap ;
    text:defineAnalyzers (
        [ text:defineAnalyzer :romanWordAnalyzer ;
          text:analyzer [
            a text:GenericAnalyzer ;
            text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
            text:params (
                [ text:paramName "mode" ;
                  text:paramValue "word" ]
                [ text:paramName "inputEncoding" ;
                  text:paramValue "roman" ]
                [ text:paramName "mergePrepositions" ;
                  text:paramValue true ]
                [ text:paramName "filterGeminates" ;
                  text:paramValue true ]
                )
            ] ;
          ]
        [ text:defineAnalyzer :devaWordAnalyzer ;
          text:analyzer [
            a text:GenericAnalyzer ;
            text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
            text:params (
                [ text:paramName "mode" ;
                  text:paramValue "word" ]
                [ text:paramName "inputEncoding" ;
                  text:paramValue "deva" ]
                [ text:paramName "mergePrepositions" ;
                  text:paramValue true ]
                [ text:paramName "filterGeminates" ;
                  text:paramValue true ]
                )
            ] ;
          ]
        [ text:defineAnalyzer :slpWordAnalyzer ;
          text:analyzer [
            a text:GenericAnalyzer ;
            text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
            text:params (
                [ text:paramName "mode" ;
                  text:paramValue "word" ]
                [ text:paramName "inputEncoding" ;
                  text:paramValue "SLP" ]
                [ text:paramName "mergePrepositions" ;
                  text:paramValue true ]
                [ text:paramName "filterGeminates" ;
                  text:paramValue true ]
                )
            ] ;
          ]
        [ text:defineAnalyzer :romanLenientIndexAnalyzer ;
          text:analyzer [
            a text:GenericAnalyzer ;
            text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
            text:params (
                [ text:paramName "mode" ;
                  text:paramValue "syl" ]
                [ text:paramName "inputEncoding" ;
                  text:paramValue "roman" ]
                [ text:paramName "mergePrepositions" ;
                  text:paramValue false ]
                [ text:paramName "filterGeminates" ;
                  text:paramValue true ]
                [ text:paramName "lenient" ;
                  text:paramValue "index" ]
                )
            ] ;
          ]
        [ text:defineAnalyzer :devaLenientIndexAnalyzer ;
          text:analyzer [
            a text:GenericAnalyzer ;
            text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
            text:params (
                [ text:paramName "mode" ;
                  text:paramValue "syl" ]
                [ text:paramName "inputEncoding" ;
                  text:paramValue "deva" ]
                [ text:paramName "mergePrepositions" ;
                  text:paramValue false ]
                [ text:paramName "filterGeminates" ;
                  text:paramValue true ]
                [ text:paramName "lenient" ;
                  text:paramValue "index" ]
                )
            ] ;
          ]
        [ text:defineAnalyzer :slpLenientIndexAnalyzer ;
          text:analyzer [
            a text:GenericAnalyzer ;
            text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
            text:params (
                [ text:paramName "mode" ;
                  text:paramValue "syl" ]
                [ text:paramName "inputEncoding" ;
                  text:paramValue "SLP" ]
                [ text:paramName "mergePrepositions" ;
                  text:paramValue false ]
                [ text:paramName "filterGeminates" ;
                  text:paramValue true ]
                [ text:paramName "lenient" ;
                  text:paramValue "index" ]
                )
            ] ;
          ]
        [ text:defineAnalyzer :romanLenientQueryAnalyzer ;
          text:analyzer [
            a text:GenericAnalyzer ;
            text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
            text:params (
                [ text:paramName "mode" ;
                  text:paramValue "syl" ]
                [ text:paramName "inputEncoding" ;
                  text:paramValue "roman" ]
                [ text:paramName "mergePrepositions" ;
                  text:paramValue false ]
                [ text:paramName "filterGeminates" ;
                  text:paramValue false ]
                [ text:paramName "lenient" ;
                  text:paramValue "query" ]
                )
            ] ;
          ]
        [ text:defineAnalyzer :hanzAnalyzer ;
          text:analyzer [
            a text:GenericAnalyzer ;
            text:class "io.bdrc.lucene.zh.ChineseAnalyzer" ;
            text:params (
                [ text:paramName "profile" ;
                  text:paramValue "TC2SC" ]
                [ text:paramName "stopwords" ;
                  text:paramValue false ]
                [ text:paramName "filterChars" ;
                  text:paramValue 0 ]
                )
            ] ;
          ]
        [ text:defineAnalyzer :han2pinyin ;
          text:analyzer [
            a text:GenericAnalyzer ;
            text:class "io.bdrc.lucene.zh.ChineseAnalyzer" ;
            text:params (
                [ text:paramName "profile" ;
                  text:paramValue "TC2PYstrict" ]
                [ text:paramName "stopwords" ;
                  text:paramValue false ]
                [ text:paramName "filterChars" ;
                  text:paramValue 0 ]
                )
            ] ;
          ]
        [ text:defineAnalyzer :pinyin ;
          text:analyzer [
            a text:GenericAnalyzer ;
            text:class "io.bdrc.lucene.zh.ChineseAnalyzer" ;
            text:params (
                [ text:paramName "profile" ;
                  text:paramValue "PYstrict" ]
                )
            ] ;
          ]
        [ text:addLang "bo" ;
          text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
          text:analyzer [
            a text:GenericAnalyzer ;
            text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
            text:params (
                [ text:paramName "segmentInWords" ;
                  text:paramValue false ]
                [ text:paramName "lemmatize" ;
                  text:paramValue true ]
                [ text:paramName "filterChars" ;
                  text:paramValue false ]
                [ text:paramName "inputMode" ;
                  text:paramValue "unicode" ]
                [ text:paramName "stopFilename" ;
                  text:paramValue "" ]
                )
            ] ;
          ]
        [ text:addLang "bo-x-ewts" ;
          text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
          text:analyzer [
            a text:GenericAnalyzer ;
            text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
            text:params (
                [ text:paramName "segmentInWords" ;
                  text:paramValue false ]
                [ text:paramName "lemmatize" ;
                  text:paramValue true ]
                [ text:paramName "filterChars" ;
                  text:paramValue false ]
                [ text:paramName "inputMode" ;
                  text:paramValue "ewts" ]
                [ text:paramName "stopFilename" ;
                  text:paramValue "" ]
                )
            ] ;
          ]
        [ text:addLang "bo-alalc97" ;
          text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
          text:analyzer [
            a text:GenericAnalyzer ;
            text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
            text:params (
                [ text:paramName "segmentInWords" ;
                  text:paramValue false ]
                [ text:paramName "lemmatize" ;
                  text:paramValue true ]
                [ text:paramName "filterChars" ;
                  text:paramValue false ]
                [ text:paramName "inputMode" ;
                  text:paramValue "alalc" ]
                [ text:paramName "stopFilename" ;
                  text:paramValue "" ]
                )
            ] ;
          ]
        [ text:addLang "zh-hans" ;
          text:searchFor ( "zh-hans" "zh-hant" ) ;
          text:auxIndex ( "zh-aux-han2pinyin" ) ;
          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :hanzAnalyzer ] ;
          ]
        [ text:addLang "zh-hant" ;
          text:searchFor ( "zh-hans" "zh-hant" ) ;
          text:auxIndex ( "zh-aux-han2pinyin" ) ;
          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :hanzAnalyzer
            ] ;
          ]
        [ text:addLang "zh-latn-pinyin" ;
          text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ;
          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :pinyin
            ] ;
          ]
        [ text:addLang "zh-aux-han2pinyin" ;
          text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ;
          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :pinyin
            ] ;
          text:indexAnalyzer :han2pinyin ;
          ]
        [ text:addLang "sa-x-ndia" ;

text:searchFor ( "sa-x-ndia" "sa-aux-deva2Ndia""sa-aux-roman2Ndia" "sa-aux-slp2Ndia" ) ;

          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :romanLenientQueryAnalyzer
            ] ;
          ]
        [ text:addLang "sa-aux-deva2Ndia" ;

text:searchFor ( "sa-x-ndia" "sa-aux-roman2Ndia""sa-aux-slp2Ndia" ) ;

          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :romanLenientQueryAnalyzer
            ] ;
          text:indexAnalyzer :devaLenientIndexAnalyzer ;
          ]
        [ text:addLang "sa-aux-roman2Ndia" ;

text:searchFor ( "sa-x-ndia" "sa-aux-deva2Ndia""sa-aux-slp2Ndia" ) ;

          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :romanLenientQueryAnalyzer
            ] ;
          text:indexAnalyzer :romanLenientIndexAnalyzer ;
          ]
        [ text:addLang "sa-aux-slp2Ndia" ;

text:searchFor ( "sa-x-ndia" "sa-aux-deva2Ndia""sa-aux-roman2Ndia" ) ;

          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :romanLenientQueryAnalyzer
            ] ;
          text:indexAnalyzer :slpLenientIndexAnalyzer ;
          ]
        [ text:addLang "sa-deva" ;

text:searchFor ( "sa-deva" "sa-x-iast" "sa-x-slp1""sa-x-iso" "sa-alalc97" ) ;

          text:auxIndex ( "sa-aux-deva2Ndia" ) ;
          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :devaWordAnalyzer ] ;
          ]
        [ text:addLang "sa-x-iso" ;

text:searchFor ( "sa-x-iso" "sa-x-iast" "sa-x-slp1""sa-deva" "sa-alalc97" ) ;

          text:auxIndex ( "sa-aux-roman2Ndia" ) ;
          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :romanWordAnalyzer ] ;
          ]
        [ text:addLang "sa-x-slp1" ;

text:searchFor ( "sa-x-slp1" "sa-x-iast" "sa-x-iso""sa-deva" "sa-alalc97" ) ;

          text:auxIndex ( "sa-aux-slp2Ndia" ) ;
          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :slpWordAnalyzer ] ;
          ]
        [ text:addLang "sa-x-iast" ;

text:searchFor ( "sa-x-iast" "sa-x-slp1" "sa-x-iso""sa-deva" "sa-alalc97" ) ;

          text:auxIndex ( "sa-aux-roman2Ndia" ) ;
          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :romanWordAnalyzer ] ;
          ]
        [ text:addLang "sa-alalc97" ;

text:searchFor ( "sa-alalc97" "sa-x-slp1" "sa-x-iso""sa-deva" "sa-iast" ) ;

          text:auxIndex ( "sa-aux-roman2Ndia" ) ;
          text:analyzer [
            a text:DefinedAnalyzer ;
            text:useAnalyzer :romanWordAnalyzer ] ;
          ]
      ) ;
    .

# Index mappings
:bdrc_entmap a text:EntityMap ;
    text:entityField      "uri" ;
    text:uidField         "uid" ;
    text:defaultField     "label" ;
    text:langField        "lang" ;
    text:graphField       "graph" ; ## enable graph-specific indexing
    text:map (
         [ text:field "label" ;
           text:predicate skos:prefLabel ]
         [ text:field "altLabel" ;
           text:predicate skos:altLabel ; ]
         [ text:field "rdfsLabel" ;
           text:predicate rdfs:label ; ]
         [ text:field "chunkContents" ;
           text:predicate bdo:chunkContents ; ]
         [ text:field "eTextTitle" ;
           text:predicate bdo:eTextTitle ; ]
         [ text:field "logMessage" ;
           text:predicate adm:logMessage ; ]
         [ text:field "noteText" ;
           text:predicate bdo:noteText ; ]
         [ text:field "workAuthorshipStatement" ;
           text:predicate bdo:workAuthorshipStatement ; ]
         [ text:field "workColophon" ;
           text:predicate bdo:workColophon ; ]
         [ text:field "workEditionStatement" ;
           text:predicate bdo:workEditionStatement ; ]
         [ text:field "workPublisherLocation" ;
           text:predicate bdo:workPublisherLocation ; ]
         [ text:field "workPublisherName" ;
           text:predicate bdo:workPublisherName ; ]
         [ text:field "workSeriesName" ;
           text:predicate bdo:workSeriesName ; ]
         ) ;
    .

On Mar 11, 2019, at 11:42 AM, Sorin Gheorghiu<[email protected]<mailto:[email protected]>> wrote:


Hi Chris,

have you had time to look in my results, by chance? Would this helpto isolate the issue?

Let me know if you need any other data to collect, please.

Best regards,
Sorin

-------- Weitergeleitete Nachricht --------
Betreff:        Re: Text Index build with empty fields
Datum:  Mon, 4 Mar 2019 17:35:56 +0100
Von:    Sorin Gheorghiu <[email protected]>
An:     [email protected]
Kopie (CC):     Chris Tomlinson <[email protected]>



Hi Chris,

when I reduce the entity map to 3 fields:

         [ text:field "oldgndid";
text:predicate gndo:oldAuthorityNumber
         ]
         [ text:field "prefName";
text:predicate gndo:preferredNameForThePerson
         ]
         [ text:field "varName";
text:predicate gndo:variantNameForThePerson
         ]

then *oldgndid *field only contains data (seetextindexer_3params_040319.pcap attached):


ES...|..........\*.......gnd_fts_es_131018_index.Y6BxYm-hT6qL0_NX10HrZQ..GndSubjectheadings.http://d-nb.info/gnd/4000002-3........

ES...B..........\*.....transport_client.indices:data/write/update..gnd_fts_es_131018_index.........GndSubjectheadings.http://d-nb.info/gnd/4000023-0......painless..if((ctx._source== null) || (ctx._source.oldgndid == null) ||(ctx._source.oldgndid.empty == true)){ctx._source.oldgndid=[params.fieldValue] } else{ctx._source.oldgndid.add(params.fieldValue)}..fieldValue..(DE-588c)4000023-0...............gnd_fts_es_131018_index....GndSubjectheadings..http://d-nb.info/gnd/4000023-0..>{"varName":[],"prefName":[],"oldgndid":["(DE-588c)4000023-0"]}.............


moreover with 2 fields:

         [ text:field "prefName";
text:predicate gndo:preferredNameForThePerson
         ]
         [ text:field "varName";
text:predicate gndo:variantNameForThePerson
         ]

then *prefName* field only contains data (seetextindexer_2params_040319.pcap attached):


ES...|..........\*.......gnd_fts_es_131018_index.Y6BxYm-hT6qL0_NX10HrZQ..GndSubjectheadings.http://d-nb.info/gnd/134316541........

ES...$..........\*.....transport_client.indices:data/write/update..gnd_fts_es_131018_index.........GndSubjectheadings.http://d-nb.info/gnd/1153446294......painless..if((ctx._source== null) || (ctx._source.prefName == null) ||(ctx._source.prefName.empty == true)){ctx._source.prefName=[params.fieldValue] } else{ctx._source.prefName.add(params.fieldValue)}..fieldValue.Pharmakon...............gnd_fts_es_131018_index....GndSubjectheadings..http://d-nb.info/gnd/1153446294..'{"varName":[],"prefName":["Pharmakon"]}.................


Regards,
Sorin

Am 01.03.2019 um 18:06 schrieb Chris Tomlinson:

Hi Sorin,

tcpdump -A -r works fine to view the pcap file; however, I don’t have the time 
to delve into the data. I’ll take your word for it that the whole setup worked 
in 3.8.0 and I encourage you to try simplifying the entity map perhaps by 
having a unique field per property to see if the problem appears related to 
prefName and varName fields mapping to multiple properties.

I do notice that the field oldgndid only maps to a single property but not 
knowing the data I have no idea whether there’s any of that data in your tests.

Since you indicate that only the field, gndtype, has data (per the pcap file) 
then if there is oldgndid data (i.e., occurrences of gndo:oldAuthorityNumber, 
then that suggests that there is some rather generic issue w/ textindexer; 
however if there is no oldgndid data then there may be a problem that has crept 
in since 3.8.0 that is leading to a problem with data for multiple properties 
assigned to a single field which I would guess might be related to 
google.common.collection.MultiMap that holds the results of parsing the entity 
map.

I have no idea how to enable the debug when running the standalone textindexer, 
perhaps someone else can answer that.

Regards,
Chris

On Mar 1, 2019, at 2:57 AM, Sorin Gheorghiu<[email protected]>  
wrote:

Hi Chris,

1) As I said before, this entity map worked in 3.8.0.
The pcap file I sent you is the proof that Jena delivers inconsistent data. You 
may open it with Wireshark

<jndbgnifbhkopbdd.png>

or read it with tcpick:
# tcpick -C -yP -r textindexer_280219.pcap | more

ES...}..........\*.......gnd_fts_es_131018_index.cp-dFuCVTg-dUwvfyREG2w..GndSubjectheadings.http://d-nb.info/gnd/102968225
  
<dfucvtg-duwvfyreg2w..gndsubjectheadings.http://d-nb.info/gnd/102968225>.........
ES..............\*.....transport_client.indices:data/write/update..gnd_fts_es_131018_index.........GndSubjectheadings.http://d-nb.info/gnd/102968438......painless..if((ctx._source
 == null) || (ctx._source.gndtype == null) || (ctx._source.gndtype.empty == 
true)) {ctx._source.gndtype=[params.fieldValue] } else 
{ctx._source.gndtype.add(params.fieldValue)}
..fieldValue..Person...............gnd_fts_es_131018_index....GndSubjectheadings..http://d-nb.info/gnd/102968438....{"varName":[],"varName":[],"varName":[],"varName":[],"varName":[],"varName":[],"varName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"oldgndid":[],"gndtype":["Person"]}..................................
As a remark, Jena sends whole text index data within one TCP packet for one 
Elasticsearch document.

3) fuseki.log collects logs when Fuseki server is running, but for text indexer 
we have to run java command line, i.e.

        java -cp ./fuseki-server.jar:<other_jars> jena.textindexer 
--desc=run/config.ttl
The question is how to activate the debug logs during text indexer?


Regards,
Sorin

Am 28.02.2019 um 21:41 schrieb Chris Tomlinson:

Hi Sorin,

1) I suggest trying to simplify the entity map. I assume there’s data for each 
of the properties other than skos:altLabel in the entity map:

          [ text:field "gndtype";
            text:predicate skos:altLabel
          ]
          [ text:field "oldgndid";
            text:predicate gndo:oldAuthorityNumber
          ]
          [ text:field "prefName";
            text:predicate gndo:preferredNameForTheSubjectHeading
          ]
          [ text:field "varName";
            text:predicate gndo:variantNameForTheSubjectHeading
          ]
          [ text:field "prefName";
            text:predicate gndo:preferredNameForThePlaceOrGeographicName
          ]
          [ text:field "varName";
            text:predicate gndo:variantNameForThePlaceOrGeographicName
          ]
          [ text:field "prefName";
            text:predicate gndo:preferredNameForTheWork
          ]
          [ text:field "varName";
            text:predicate gndo:variantNameForTheWork
          ]
          [ text:field "prefName";
            text:predicate gndo:preferredNameForTheConferenceOrEvent
          ]
          [ text:field "varName";
            text:predicate gndo:variantNameForTheConferenceOrEvent
          ]
          [ text:field "prefName";
            text:predicate gndo:preferredNameForTheCorporateBody
          ]
          [ text:field "varName";
            text:predicate gndo:variantNameForTheCorporateBody
          ]
          [ text:field "prefName";
            text:predicate gndo:preferredNameForThePerson
          ]
          [ text:field "varName";
            text:predicate gndo:variantNameForThePerson
          ]
          [ text:field "prefName";
            text:predicate gndo:preferredNameForTheFamily
          ]
          [ text:field "varName";
            text:predicate gndo:variantNameForTheFamily
          ]

2) You might try a TextIndexLucene

3) Adding the line log4j.logger.org.apache.jena.query.text.es=DEBUG should 
work. I see no problem with it.

Sorry to be of little help,
Chris

On Feb 28, 2019, at 8:53 AM, Sorin Gheorghiu<[email protected]>  
<mailto:[email protected]>  wrote:

Hi Chris,
Thank you for answering, I reply you directly because users@jena doesn't accept 
messages larger than 1Mb.

The previous text index successful attempt we did was with 3.8.0, not 3.9.0, 
sorry for the misinformation.
Attached is the assembler file for 3.10.0 as requested, as well as the packet 
capture file to see that only the 'gndtype' field has data.
I tried to enable the debug logs in log4j.properties with 
log4j.logger.org.apache.jena.query.text.es=DEBUG but no output in the log file.

Regards,
Sorin

Am 27.02.2019 um 20:01 schrieb Chris Tomlinson:

Hi Sorin,

Please provide the assembler file for Elasticsearch that has the problematic 
entity map definitions.

There haven’t been any changes in over a year to textindexer since well before 
3.9. I don’t see any relevant changes to the handling of entity maps either so 
I can’t begin to pursue the issue further w/o perhaps seeing your current 
assembler file.

I don't have any experience with Elasticsearch or with using jena-text-es 
beyond a simple change to TextIndexES.java to change 
org.elasticsearch.common.transport.InetSocketTransportAddress to 
org.elasticsearch.common.transport.TransportAddress as part of the upgrade to 
Lucene 7.4.0 and Elasticsearch 6.4.2.

Regards,
Chris

On Feb 25, 2019, at 2:37 AM, Sorin Gheorghiu<[email protected]>  
<mailto:[email protected]>  <mailto:[email protected]>  
<mailto:[email protected]>  wrote:

Correction: only the *latest field *from the /text:map/ list contains a value.

To reformulate:

* if there are 3 fields in /text:map/, then during indexing the first
   two are empty (let's name them 'text1' and 'text2') and the latest
   field contains data (let's name it 'text3')
* if on the next attempt the field 'text3' is commented out, then
   'text1' is empty and 'text2' contains data


Am 22.02.2019 um 15:01 schrieb Sorin Gheorghiu:

In addition:

  * if there are 3 fields in /text:map/, then during indexing one
    contains data (let's name it 'text1'), the others are empty (let's
    name them 'text2' and 'text3'),
  * if on the next attempt the field 'text1' is commented out, then
    'text2' contains data and 'text3' is empty



-------- Weitergeleitete Nachricht --------
Betreff:        Text Index build with empty fields
Datum:  Fri, 22 Feb 2019 14:01:18 +0100
Von:    Sorin Gheorghiu<[email protected]>  
<mailto:[email protected]>  <mailto:[email protected]>  
<mailto:[email protected]>
Antwort an:     [email protected]  <mailto:[email protected]>  
<mailto:[email protected]>  <mailto:[email protected]>
An:     [email protected]  <mailto:[email protected]>  
<mailto:[email protected]>  <mailto:[email protected]>



Hi,

When building the text index with the /jena.textindexer/ tool in Jena 3.10 for 
an external full-text search engine (Elasticsearch of course) and having 
multiple fields with different names in /text:map/, just *one field is indexed* 
(more precisely one field contains data, the others are empty). It doesn't look 
to be an issue with Elasticsearch, in the logs generated during the indexing 
the fields are already missing the values, but one. The same setup worked in 
Jena 3.9. Changing the Java version from 8 to 9 or 11 didn't change anything.

Could it be that changes of the new release have affected this tool and we deal 
with a bug?

--
Sorin Gheorghiu             Tel: +49 7531 88-3198
Universität Konstanz        Raum: B705
78464 [email protected]  <mailto:[email protected]>  
<mailto:[email protected]>  <mailto:[email protected]>

- KIM: Abteilung Contentdienste -

--
Sorin Gheorghiu             Tel: +49 7531 88-3198
Universität Konstanz        Raum: B705
78464 [email protected]  
<mailto:[email protected]>

- KIM: Abteilung Contentdienste -

--
Sorin Gheorghiu             Tel: +49 7531 88-3198
Universität Konstanz        Raum: B705
78464 [email protected]

- KIM: Abteilung Contentdienste -
<textindexer_2params_040319.pcap><textindexer_3params_040319.pcap>

--
Sorin Gheorghiu             Tel: +49 7531 88-3198
Universität Konstanz        Raum: B705
78464 [email protected]

- KIM: Abteilung Contentdienste -

Re: Text Index build with empty fields

Reply via email to