Re: Text Index build with empty fields

Chris Tomlinson Tue, 12 Mar 2019 07:39:45 -0700

Hi Sorin,

I have focussed on the jena text integration w/ Lucene local to jena/fuseki. 
The solr was dropped over a year ago due to lack of support/interest and w’ 
your information about ES 7.x it’s likely going to take someone who is a user 
of ES to help keep the integration up-to-date.


Anuj Kumar <[email protected]> did the ES integration about a year ago 
for jena 3.9.0 and as I mentioned I made obvious changes to the ES integration 
to update to Lucene 7.4.0 for jena 3.10.0.

The upgrade to Lucene 7.4.0  
<https://issues.apache.org/jira/browse/JENA-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673657#comment-16673657>was
 prompted by a user, [email protected] 
<mailto:[email protected]>, who was interested in Lucene 7.5, but the 
released version of ES was built against 7.4 so we upgraded to that version.

I’ve opened JENA-1681 <https://issues.apache.org/jira/browse/JENA-1681> for the 
issue you’ve reported. You can report your findings there and hopefully we can 
get to the bottom of the problem.

Regards,
Chris



> On Mar 12, 2019, at 6:40 AM, Sorin Gheorghiu 
> <[email protected]> wrote:
> 
> Hi Chris,
> 
> Thank you for your detailed answer. I will still try to find the root cause 
> of this issue.
> But I have a question to you, do you know if Jena will support Elasticsearch 
> in the further versions?
> 
> I am asking because in Elasticsearch 7.0 are breaking changes which will 
> affect the transport-client [1]: 
> The TransportClient is deprecated in favour of the Java High Level REST 
> Client and will be removed in Elasticsearch 8.0.
> This supposes changes in the client’s initialization code, the Migration 
> Guide [2] explains how to do it.
> 
> [1] 
> https://www.elastic.co/guide/en/elasticsearch/client/java-api/master/transport-client.html
>  
> <https://www.elastic.co/guide/en/elasticsearch/client/java-api/master/transport-client.html>
> [2] 
> https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-high-level-migration.html
>  
> <https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-high-level-migration.html>
> 
> Best regards,
> Sorin
> 
> Am 11.03.2019 um 18:38 schrieb Chris Tomlinson:
>> Hi Sorin,
>> 
>> I haven’t had the time to try and delve further into your issue. Your pcap 
>> seems to clearly indicate that there is no data populating any 
>> field/property other than the first one in the entity map.
>> 
>> I’ve included the configuration file that we use. It has many many fields 
>> defined that are all populated. We load jena/fuseki from a collection of git 
>> repos via a git-to-dbs tool <https://github.com/buda-base/git-to-dbs> and we 
>> don’t see the sort of issue you’re reporting where there is a single field 
>> out of all the defined fields that is populated in the dataset and Lucene 
>> index - we don’t use ElasticSearch. 
>> 
>> The point being that whatever is going wrong is apparently not in the 
>> parsing of the configuration and setting up of the internal tables that 
>> record information about which predicates are indexed via Lucene (or 
>> Elasticsearch) into what fields.
>> 
>> So it appears to me that the issue is something that is happening in the 
>> connection between the standalone textindexer.java and the Elasticsearch via 
>> the TextIndexES.java. The textindexer.java doesn’t have any post 3.8.0 
>> changes that I can see and the only change in the TextIndexES.java is a 
>> change in the name of 
>> org.elasticsearch.common.transport.InetSocketTransportAddress to 
>> org.elasticsearch.common.transport.TransportAddress as part of the upgrade.
>> 
>> I’m really not able to go further at this time.
>> 
>> I’m sorry,
>> Chris
>> 
>> 
>>> # Fuseki configuration for BDRC, configures two endpoints:
>>> #   - /bdrc is read-only
>>> #   - /bdrcrw is read-write
>>> #
>>> # This was painful to come up with but the web interface basically allows 
>>> no option
>>> # and there is no subclass inference by default so such a configuration 
>>> file is necessary.
>>> #
>>> # The main doc sources are:
>>> #  - 
>>> https://jena.apache.org/documentation/fuseki2/fuseki-configuration.html 
>>> <https://jena.apache.org/documentation/fuseki2/fuseki-configuration.html>
>>> #  - https://jena.apache.org/documentation/assembler/assembler-howto.html 
>>> <https://jena.apache.org/documentation/assembler/assembler-howto.html>
>>> #  - https://jena.apache.org/documentation/assembler/assembler.ttl 
>>> <https://jena.apache.org/documentation/assembler/assembler.ttl>
>>> #
>>> # See https://jena.apache.org/documentation/fuseki2/fuseki-layout.html 
>>> <https://jena.apache.org/documentation/fuseki2/fuseki-layout.html> for the 
>>> destination of this file.
>>> 
>>> @prefix fuseki:  <http://jena.apache.org/fuseki# 
>>> <http://jena.apache.org/fuseki#>> .
>>> @prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns# 
>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#>> .
>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema# 
>>> <http://www.w3.org/2000/01/rdf-schema#>> .
>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb# 
>>> <http://jena.hpl.hp.com/2008/tdb#>> .
>>> @prefix tdb2:    <http://jena.apache.org/2016/tdb# 
>>> <http://jena.apache.org/2016/tdb#>> .
>>> @prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler# 
>>> <http://jena.hpl.hp.com/2005/11/Assembler#>> .
>>> @prefix :        <http://base/# <http://base/#>> .
>>> @prefix text:    <http://jena.apache.org/text# 
>>> <http://jena.apache.org/text#>> .
>>> @prefix skos:    <http://www.w3.org/2004/02/skos/core# 
>>> <http://www.w3.org/2004/02/skos/core#>> .
>>> @prefix adm:     <http://purl.bdrc.io/ontology/admin/ 
>>> <http://purl.bdrc.io/ontology/admin/>> .
>>> @prefix bdd:     <http://purl.bdrc.io/data/ <http://purl.bdrc.io/data/>> .
>>> @prefix bdo:     <http://purl.bdrc.io/ontology/core/ 
>>> <http://purl.bdrc.io/ontology/core/>> .
>>> @prefix bdr:     <http://purl.bdrc.io/resource/ 
>>> <http://purl.bdrc.io/resource/>> .
>>> @prefix f:       <java:io.bdrc.ldspdi.sparql.functions.> .
>>> 
>>> # [] ja:loadClass "org.seaborne.tdb2.TDB2" .
>>> # tdb2:DatasetTDB2  rdfs:subClassOf  ja:RDFDataset .
>>> # tdb2:GraphTDB2    rdfs:subClassOf  ja:Model .
>>> 
>>> [] rdf:type fuseki:Server ;
>>>    fuseki:services (
>>>      :bdrcrw
>>>    ) .
>>> 
>>> :bdrcrw rdf:type fuseki:Service ;
>>>     fuseki:name                       "bdrcrw" ;     # name of the dataset 
>>> in the url
>>>     fuseki:serviceQuery               "query" ;    # SPARQL query service
>>>     fuseki:serviceUpdate              "update" ;   # SPARQL update service
>>>     fuseki:serviceUpload              "upload" ;   # Non-SPARQL upload 
>>> service
>>>     fuseki:serviceReadWriteGraphStore "data" ;     # SPARQL Graph store 
>>> protocol (read and write)
>>>     fuseki:dataset                    :bdrc_text_dataset ;
>>>     .
>>> 
>>> # using TDB
>>> :dataset_bdrc rdf:type      tdb:DatasetTDB ;
>>>      tdb:location "/usr/local/fuseki/base/databases/bdrc" ;
>>>      tdb:unionDefaultGraph true ;
>>>      .
>>> 
>>> # using TDB2
>>> # :dataset_bdrc rdf:type      tdb2:DatasetTDB2 ;
>>> #      tdb2:location "/usr/local/fuseki/base/databases/bdrc" ;
>>> #      tdb2:unionDefaultGraph true ;
>>> #   .
>>> 
>>> :bdrc_text_dataset rdf:type     text:TextDataset ;
>>>     text:dataset   :dataset_bdrc ;
>>>     text:index     :bdrc_lucene_index ;
>>>     .
>>> 
>>> # Text index description
>>> :bdrc_lucene_index a text:TextIndexLucene ;
>>>     text:directory <file:/usr/local/fuseki/base/lucene-bdrc> 
>>> <file:///usr/local/fuseki/base/lucene-bdrc> ;
>>>     text:storeValues true ;
>>>     text:multilingualSupport true ;
>>>     text:entityMap :bdrc_entmap ;
>>>     text:defineAnalyzers (
>>>         [ text:defineAnalyzer :romanWordAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "mode" ;
>>>                   text:paramValue "word" ]
>>>                 [ text:paramName "inputEncoding" ;
>>>                   text:paramValue "roman" ]
>>>                 [ text:paramName "mergePrepositions" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "filterGeminates" ;
>>>                   text:paramValue true ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :devaWordAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "mode" ;
>>>                   text:paramValue "word" ]
>>>                 [ text:paramName "inputEncoding" ;
>>>                   text:paramValue "deva" ]
>>>                 [ text:paramName "mergePrepositions" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "filterGeminates" ;
>>>                   text:paramValue true ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :slpWordAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "mode" ;
>>>                   text:paramValue "word" ]
>>>                 [ text:paramName "inputEncoding" ;
>>>                   text:paramValue "SLP" ]
>>>                 [ text:paramName "mergePrepositions" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "filterGeminates" ;
>>>                   text:paramValue true ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :romanLenientIndexAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "mode" ;
>>>                   text:paramValue "syl" ]
>>>                 [ text:paramName "inputEncoding" ;
>>>                   text:paramValue "roman" ]
>>>                 [ text:paramName "mergePrepositions" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "filterGeminates" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "lenient" ;
>>>                   text:paramValue "index" ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :devaLenientIndexAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "mode" ;
>>>                   text:paramValue "syl" ]
>>>                 [ text:paramName "inputEncoding" ;
>>>                   text:paramValue "deva" ]
>>>                 [ text:paramName "mergePrepositions" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "filterGeminates" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "lenient" ;
>>>                   text:paramValue "index" ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :slpLenientIndexAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "mode" ;
>>>                   text:paramValue "syl" ]
>>>                 [ text:paramName "inputEncoding" ;
>>>                   text:paramValue "SLP" ]
>>>                 [ text:paramName "mergePrepositions" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "filterGeminates" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "lenient" ;
>>>                   text:paramValue "index" ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :romanLenientQueryAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "mode" ;
>>>                   text:paramValue "syl" ]
>>>                 [ text:paramName "inputEncoding" ;
>>>                   text:paramValue "roman" ]
>>>                 [ text:paramName "mergePrepositions" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "filterGeminates" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "lenient" ;
>>>                   text:paramValue "query" ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :hanzAnalyzer ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.zh.ChineseAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "profile" ;
>>>                   text:paramValue "TC2SC" ]
>>>                 [ text:paramName "stopwords" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "filterChars" ;
>>>                   text:paramValue 0 ]
>>>                 )
>>>             ] ; 
>>>           ]  
>>>         [ text:defineAnalyzer :han2pinyin ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.zh.ChineseAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "profile" ;
>>>                   text:paramValue "TC2PYstrict" ]
>>>                 [ text:paramName "stopwords" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "filterChars" ;
>>>                   text:paramValue 0 ]
>>>                 )
>>>             ] ; 
>>>           ]
>>>         [ text:defineAnalyzer :pinyin ; 
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.zh.ChineseAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "profile" ;
>>>                   text:paramValue "PYstrict" ]
>>>                 )
>>>             ] ; 
>>>           ]
>>>         [ text:addLang "bo" ; 
>>>           text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "segmentInWords" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "lemmatize" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "filterChars" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "inputMode" ;
>>>                   text:paramValue "unicode" ]
>>>                 [ text:paramName "stopFilename" ;
>>>                   text:paramValue "" ]
>>>                 )
>>>             ] ;
>>>           ]
>>>         [ text:addLang "bo-x-ewts" ;
>>>           text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>>>           text:analyzer [
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "segmentInWords" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "lemmatize" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "filterChars" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "inputMode" ;
>>>                   text:paramValue "ewts" ]
>>>                 [ text:paramName "stopFilename" ;
>>>                   text:paramValue "" ]
>>>                 )
>>>             ] ;
>>>           ]
>>>         [ text:addLang "bo-alalc97" ;
>>>           text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>>>           text:analyzer [ 
>>>             a text:GenericAnalyzer ;
>>>             text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
>>>             text:params (
>>>                 [ text:paramName "segmentInWords" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "lemmatize" ;
>>>                   text:paramValue true ]
>>>                 [ text:paramName "filterChars" ;
>>>                   text:paramValue false ]
>>>                 [ text:paramName "inputMode" ;
>>>                   text:paramValue "alalc" ]
>>>                 [ text:paramName "stopFilename" ;
>>>                   text:paramValue "" ]
>>>                 )
>>>             ] ;
>>>           ]
>>>         [ text:addLang "zh-hans" ;
>>>           text:searchFor ( "zh-hans" "zh-hant" ) ;
>>>           text:auxIndex ( "zh-aux-han2pinyin" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :hanzAnalyzer ] ;
>>>           ]
>>>         [ text:addLang "zh-hant" ; 
>>>           text:searchFor ( "zh-hans" "zh-hant" ) ;
>>>           text:auxIndex ( "zh-aux-han2pinyin" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :hanzAnalyzer
>>>             ] ;
>>>           ]
>>>         [ text:addLang "zh-latn-pinyin" ;
>>>           text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :pinyin
>>>             ] ;
>>>           ]
>>>         [ text:addLang "zh-aux-han2pinyin" ;
>>>           text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :pinyin
>>>             ] ;
>>>           text:indexAnalyzer :han2pinyin ;
>>>           ]
>>>         [ text:addLang "sa-x-ndia" ;
>>>           text:searchFor ( "sa-x-ndia" "sa-aux-deva2Ndia" 
>>> "sa-aux-roman2Ndia" "sa-aux-slp2Ndia" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :romanLenientQueryAnalyzer
>>>             ] ;
>>>           ]
>>>         [ text:addLang "sa-aux-deva2Ndia" ;
>>>           text:searchFor ( "sa-x-ndia" "sa-aux-roman2Ndia" 
>>> "sa-aux-slp2Ndia" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :romanLenientQueryAnalyzer
>>>             ] ;
>>>           text:indexAnalyzer :devaLenientIndexAnalyzer ;
>>>           ]
>>>         [ text:addLang "sa-aux-roman2Ndia" ;
>>>           text:searchFor ( "sa-x-ndia" "sa-aux-deva2Ndia" "sa-aux-slp2Ndia" 
>>> ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :romanLenientQueryAnalyzer 
>>>             ] ; 
>>>           text:indexAnalyzer :romanLenientIndexAnalyzer ;
>>>           ]
>>>         [ text:addLang "sa-aux-slp2Ndia" ;
>>>           text:searchFor ( "sa-x-ndia" "sa-aux-deva2Ndia" 
>>> "sa-aux-roman2Ndia" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :romanLenientQueryAnalyzer
>>>             ] ;
>>>           text:indexAnalyzer :slpLenientIndexAnalyzer ;
>>>           ]
>>>         [ text:addLang "sa-deva" ;
>>>           text:searchFor ( "sa-deva" "sa-x-iast" "sa-x-slp1" "sa-x-iso" 
>>> "sa-alalc97" ) ;
>>>           text:auxIndex ( "sa-aux-deva2Ndia" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :devaWordAnalyzer ] ; 
>>>           ]
>>>         [ text:addLang "sa-x-iso" ;
>>>           text:searchFor ( "sa-x-iso" "sa-x-iast" "sa-x-slp1" "sa-deva" 
>>> "sa-alalc97" ) ;
>>>           text:auxIndex ( "sa-aux-roman2Ndia" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :romanWordAnalyzer ] ; 
>>>           ]
>>>         [ text:addLang "sa-x-slp1" ;
>>>           text:searchFor ( "sa-x-slp1" "sa-x-iast" "sa-x-iso" "sa-deva" 
>>> "sa-alalc97" ) ;
>>>           text:auxIndex ( "sa-aux-slp2Ndia" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :slpWordAnalyzer ] ; 
>>>           ]
>>>         [ text:addLang "sa-x-iast" ;
>>>           text:searchFor ( "sa-x-iast" "sa-x-slp1" "sa-x-iso" "sa-deva" 
>>> "sa-alalc97" ) ;
>>>           text:auxIndex ( "sa-aux-roman2Ndia" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :romanWordAnalyzer ] ; 
>>>           ]
>>>         [ text:addLang "sa-alalc97" ;
>>>           text:searchFor ( "sa-alalc97" "sa-x-slp1" "sa-x-iso" "sa-deva" 
>>> "sa-iast" ) ;
>>>           text:auxIndex ( "sa-aux-roman2Ndia" ) ;
>>>           text:analyzer [
>>>             a text:DefinedAnalyzer ;
>>>             text:useAnalyzer :romanWordAnalyzer ] ; 
>>>           ]
>>>       ) ;
>>>     .
>>> 
>>> # Index mappings
>>> :bdrc_entmap a text:EntityMap ;
>>>     text:entityField      "uri" ;
>>>     text:uidField         "uid" ;
>>>     text:defaultField     "label" ;
>>>     text:langField        "lang" ;
>>>     text:graphField       "graph" ; ## enable graph-specific indexing
>>>     text:map (
>>>          [ text:field "label" ; 
>>>            text:predicate skos:prefLabel ]
>>>          [ text:field "altLabel" ; 
>>>            text:predicate skos:altLabel ; ]
>>>          [ text:field "rdfsLabel" ;
>>>            text:predicate rdfs:label ; ]
>>>          [ text:field "chunkContents" ;
>>>            text:predicate bdo:chunkContents ; ]
>>>          [ text:field "eTextTitle" ;
>>>            text:predicate bdo:eTextTitle ; ]
>>>          [ text:field "logMessage" ;
>>>            text:predicate adm:logMessage ; ]
>>>          [ text:field "noteText" ;
>>>            text:predicate bdo:noteText ; ]
>>>          [ text:field "workAuthorshipStatement" ;
>>>            text:predicate bdo:workAuthorshipStatement ; ]
>>>          [ text:field "workColophon" ; 
>>>            text:predicate bdo:workColophon ; ]
>>>          [ text:field "workEditionStatement" ;
>>>            text:predicate bdo:workEditionStatement ; ]
>>>          [ text:field "workPublisherLocation" ;
>>>            text:predicate bdo:workPublisherLocation ; ]
>>>          [ text:field "workPublisherName" ;
>>>            text:predicate bdo:workPublisherName ; ]
>>>          [ text:field "workSeriesName" ;
>>>            text:predicate bdo:workSeriesName ; ]
>>>          ) ;
>>>     .
>> 
>> 
>>> On Mar 11, 2019, at 11:42 AM, Sorin Gheorghiu 
>>> <[email protected] <mailto:[email protected]>> 
>>> wrote:
>>> 
>>> Hi Chris,
>>> 
>>> have you had time to look in my results, by chance? Would this help to 
>>> isolate the issue?
>>> Let me know if you need any other data to collect, please.
>>> Best regards,
>>> Sorin
>>> 
>>> -------- Weitergeleitete Nachricht --------
>>> Betreff:    Re: Text Index build with empty fields
>>> Datum:      Mon, 4 Mar 2019 17:35:56 +0100
>>> Von:        Sorin Gheorghiu <[email protected]> 
>>> <mailto:[email protected]>
>>> An: [email protected] <mailto:[email protected]>
>>> Kopie (CC): Chris Tomlinson <[email protected]> 
>>> <mailto:[email protected]>
>>> 
>>> Hi Chris,
>>> 
>>> when I reduce the entity map to 3 fields:
>>> 
>>>          [ text:field "oldgndid";
>>>            text:predicate gndo:oldAuthorityNumber
>>>          ]
>>>          [ text:field "prefName";
>>>            text:predicate gndo:preferredNameForThePerson
>>>          ]
>>>          [ text:field "varName";
>>>            text:predicate gndo:variantNameForThePerson
>>>          ]
>>> then oldgndid field only contains data (see textindexer_3params_040319.pcap 
>>> attached):
>>> ES...|..........\*.......gnd_fts_es_131018_index.Y6BxYm-hT6qL0_NX10HrZQ..GndSubjectheadings.http://d-nb.info/gnd/4000002-3
>>>  <http://d-nb.info/gnd/4000002-3>........
>>> ES...B..........\*.....transport_client.indices:data/write/update..gnd_fts_es_131018_index.........GndSubjectheadings.http://d-nb.info/gnd/4000023-0......painless..if
>>>  <http://d-nb.info/gnd/4000023-0......painless..if>((ctx._source == null) 
>>> || (ctx._source.oldgndid == null) || (ctx._source.oldgndid.empty == true)) 
>>> {ctx._source.oldgndid=[params.fieldValue] } else 
>>> {ctx._source.oldgndid.add(params.fieldValue)}..fieldValue..(DE-588c)4000023-0...............gnd_fts_es_131018_index....GndSubjectheadings..http://d-nb.info/gnd/4000023-0
>>>  
>>> <http://d-nb.info/gnd/4000023-0>..>{"varName":[],"prefName":[],"oldgndid":["(DE-588c)4000023-0"]}.............
>>> moreover with 2 fields:
>>> 
>>>          [ text:field "prefName";
>>>            text:predicate gndo:preferredNameForThePerson
>>>          ]
>>>          [ text:field "varName";
>>>            text:predicate gndo:variantNameForThePerson
>>>          ]
>>> then prefName field only contains data (see textindexer_2params_040319.pcap 
>>> attached):
>>> 
>>> ES...|..........\*.......gnd_fts_es_131018_index.Y6BxYm-hT6qL0_NX10HrZQ..GndSubjectheadings.http://d-nb.info/gnd/134316541
>>>  <http://d-nb.info/gnd/134316541>........
>>> ES...$..........\*.....transport_client.indices:data/write/update..gnd_fts_es_131018_index.........GndSubjectheadings.http://d-nb.info/gnd/1153446294......painless..if
>>>  <http://d-nb.info/gnd/1153446294......painless..if>((ctx._source == null) 
>>> || (ctx._source.prefName == null) || (ctx._source.prefName.empty == true)) 
>>> {ctx._source.prefName=[params.fieldValue] } else 
>>> {ctx._source.prefName.add(params.fieldValue)}..fieldValue.     
>>> Pharmakon...............gnd_fts_es_131018_index....GndSubjectheadings..http://d-nb.info/gnd/1153446294
>>>  
>>> <http://d-nb.info/gnd/1153446294>..'{"varName":[],"prefName":["Pharmakon"]}.................
>>> 
>>> Regards,
>>> Sorin
>>> 
>>> Am 01.03.2019 um 18:06 schrieb Chris Tomlinson:
>>>> Hi Sorin,
>>>> 
>>>> tcpdump -A -r works fine to view the pcap file; however, I don’t have the 
>>>> time to delve into the data. I’ll take your word for it that the whole 
>>>> setup worked in 3.8.0 and I encourage you to try simplifying the entity 
>>>> map perhaps by having a unique field per property to see if the problem 
>>>> appears related to prefName and varName fields mapping to multiple 
>>>> properties. 
>>>> 
>>>> I do notice that the field oldgndid only maps to a single property but not 
>>>> knowing the data I have no idea whether there’s any of that data in your 
>>>> tests.
>>>> 
>>>> Since you indicate that only the field, gndtype, has data (per the pcap 
>>>> file) then if there is oldgndid data (i.e., occurrences of 
>>>> gndo:oldAuthorityNumber, then that suggests that there is some rather 
>>>> generic issue w/ textindexer; however if there is no oldgndid data then 
>>>> there may be a problem that has crept in since 3.8.0 that is leading to a 
>>>> problem with data for multiple properties assigned to a single field which 
>>>> I would guess might be related to google.common.collection.MultiMap that 
>>>> holds the results of parsing the entity map.
>>>> 
>>>> I have no idea how to enable the debug when running the standalone 
>>>> textindexer, perhaps someone else can answer that.
>>>> 
>>>> Regards,
>>>> Chris
>>>> 
>>>> 
>>>>> On Mar 1, 2019, at 2:57 AM, Sorin Gheorghiu 
>>>>> <[email protected]> 
>>>>> <mailto:[email protected]> wrote:
>>>>> 
>>>>> Hi Chris,
>>>>> 
>>>>> 1) As I said before, this entity map worked in 3.8.0. 
>>>>> The pcap file I sent you is the proof that Jena delivers inconsistent 
>>>>> data. You may open it with Wireshark
>>>>> 
>>>>> <jndbgnifbhkopbdd.png>
>>>>> 
>>>>> or read it with tcpick:
>>>>> # tcpick -C -yP -r textindexer_280219.pcap | more
>>>>> 
>>>>> ES...}..........\*.......gnd_fts_es_131018_index.cp-dFuCVTg-dUwvfyREG2w..GndSubjectheadings.http://d-nb.info/gnd/102968225
>>>>>  
>>>>> <dfucvtg-duwvfyreg2w..gndsubjectheadings.http://d-nb.info/gnd/102968225>.........
>>>>> ES..............\*.....transport_client.indices:data/write/update..gnd_fts_es_131018_index.........GndSubjectheadings.http://d-nb.info/gnd/102968438......painless..if
>>>>>  <http://d-nb.info/gnd/102968438......painless..if>((ctx._source == null) 
>>>>> || (ctx._source.gndtype == null) || (ctx._source.gndtype.empty == true)) 
>>>>> {ctx._source.gndtype=[params.fieldValue] } else 
>>>>> {ctx._source.gndtype.add(params.fieldValue)}
>>>>> ..fieldValue..Person...............gnd_fts_es_131018_index....GndSubjectheadings..http://d-nb.info/gnd/102968438
>>>>>  
>>>>> <http://d-nb.info/gnd/102968438>....{"varName":[],"varName":[],"varName":[],"varName":[],"varName":[],"varName":[],"varName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"oldgndid":[],"gndtype":["Person"]}..................................
>>>>> As a remark, Jena sends whole text index data within one TCP packet for 
>>>>> one Elasticsearch document.
>>>>> 
>>>>> 3) fuseki.log collects logs when Fuseki server is running, but for text 
>>>>> indexer we have to run java command line, i.e.
>>>>> 
>>>>>   java -cp ./fuseki-server.jar:<other_jars> jena.textindexer 
>>>>> --desc=run/config.ttl
>>>>> The question is how to activate the debug logs during text indexer?
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Sorin
>>>>> 
>>>>> Am 28.02.2019 um 21:41 schrieb Chris Tomlinson:
>>>>>> Hi Sorin,
>>>>>> 
>>>>>> 1) I suggest trying to simplify the entity map. I assume there’s data 
>>>>>> for each of the properties other than skos:altLabel in the entity map:
>>>>>> 
>>>>>>>          [ text:field "gndtype";
>>>>>>>            text:predicate skos:altLabel
>>>>>>>          ]
>>>>>>>          [ text:field "oldgndid";
>>>>>>>            text:predicate gndo:oldAuthorityNumber
>>>>>>>          ]
>>>>>>>          [ text:field "prefName";
>>>>>>>            text:predicate gndo:preferredNameForTheSubjectHeading
>>>>>>>          ]
>>>>>>>          [ text:field "varName";
>>>>>>>            text:predicate gndo:variantNameForTheSubjectHeading
>>>>>>>          ]
>>>>>>>          [ text:field "prefName";
>>>>>>>            text:predicate gndo:preferredNameForThePlaceOrGeographicName
>>>>>>>          ]
>>>>>>>          [ text:field "varName";
>>>>>>>            text:predicate gndo:variantNameForThePlaceOrGeographicName
>>>>>>>          ]
>>>>>>>          [ text:field "prefName";
>>>>>>>            text:predicate gndo:preferredNameForTheWork
>>>>>>>          ]
>>>>>>>          [ text:field "varName";
>>>>>>>            text:predicate gndo:variantNameForTheWork
>>>>>>>          ]
>>>>>>>          [ text:field "prefName";
>>>>>>>            text:predicate gndo:preferredNameForTheConferenceOrEvent
>>>>>>>          ]
>>>>>>>          [ text:field "varName";
>>>>>>>            text:predicate gndo:variantNameForTheConferenceOrEvent
>>>>>>>          ]
>>>>>>>          [ text:field "prefName";
>>>>>>>            text:predicate gndo:preferredNameForTheCorporateBody
>>>>>>>          ]
>>>>>>>          [ text:field "varName";
>>>>>>>            text:predicate gndo:variantNameForTheCorporateBody
>>>>>>>          ]
>>>>>>>          [ text:field "prefName";
>>>>>>>            text:predicate gndo:preferredNameForThePerson
>>>>>>>          ]
>>>>>>>          [ text:field "varName";
>>>>>>>            text:predicate gndo:variantNameForThePerson
>>>>>>>          ]
>>>>>>>          [ text:field "prefName";
>>>>>>>            text:predicate gndo:preferredNameForTheFamily
>>>>>>>          ]
>>>>>>>          [ text:field "varName";
>>>>>>>            text:predicate gndo:variantNameForTheFamily
>>>>>>>          ]
>>>>>> 2) You might try a TextIndexLucene
>>>>>> 
>>>>>> 3) Adding the line log4j.logger.org.apache.jena.query.text.es=DEBUG 
>>>>>> should work. I see no problem with it.
>>>>>> 
>>>>>> Sorry to be of little help,
>>>>>> Chris
>>>>>> 
>>>>>> 
>>>>>>> On Feb 28, 2019, at 8:53 AM, Sorin Gheorghiu 
>>>>>>> <[email protected]> 
>>>>>>> <mailto:[email protected]> 
>>>>>>> <mailto:[email protected]> 
>>>>>>> <mailto:[email protected]> wrote:
>>>>>>> 
>>>>>>> Hi Chris,
>>>>>>> Thank you for answering, I reply you directly because users@jena 
>>>>>>> doesn't accept messages larger than 1Mb.
>>>>>>> 
>>>>>>> The previous text index successful attempt we did was with 3.8.0, not 
>>>>>>> 3.9.0, sorry for the misinformation.
>>>>>>> Attached is the assembler file for 3.10.0 as requested, as well as the 
>>>>>>> packet capture file to see that only the 'gndtype' field has data.
>>>>>>> I tried to enable the debug logs in log4j.properties with 
>>>>>>> log4j.logger.org.apache.jena.query.text.es=DEBUG but no output in the 
>>>>>>> log file.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Sorin
>>>>>>> 
>>>>>>> Am 27.02.2019 um 20:01 schrieb Chris Tomlinson:
>>>>>>>> Hi Sorin,
>>>>>>>> 
>>>>>>>> Please provide the assembler file for Elasticsearch that has the 
>>>>>>>> problematic entity map definitions.
>>>>>>>> 
>>>>>>>> There haven’t been any changes in over a year to textindexer since 
>>>>>>>> well before 3.9. I don’t see any relevant changes to the handling of 
>>>>>>>> entity maps either so I can’t begin to pursue the issue further w/o 
>>>>>>>> perhaps seeing your current assembler file. 
>>>>>>>> 
>>>>>>>> I don't have any experience with Elasticsearch or with using 
>>>>>>>> jena-text-es beyond a simple change to TextIndexES.java to change 
>>>>>>>> org.elasticsearch.common.transport.InetSocketTransportAddress to 
>>>>>>>> org.elasticsearch.common.transport.TransportAddress as part of the 
>>>>>>>> upgrade to Lucene 7.4.0 and Elasticsearch 6.4.2.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Chris
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Feb 25, 2019, at 2:37 AM, Sorin Gheorghiu 
>>>>>>>>> <[email protected]> 
>>>>>>>>> <mailto:[email protected]> 
>>>>>>>>> <mailto:[email protected]> 
>>>>>>>>> <mailto:[email protected]> 
>>>>>>>>> <mailto:[email protected]> 
>>>>>>>>> <mailto:[email protected]> 
>>>>>>>>> <mailto:[email protected]> 
>>>>>>>>> <mailto:[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> Correction: only the *latest field *from the /text:map/ list contains 
>>>>>>>>> a value.
>>>>>>>>> 
>>>>>>>>> To reformulate:
>>>>>>>>> 
>>>>>>>>> * if there are 3 fields in /text:map/, then during indexing the first
>>>>>>>>>   two are empty (let's name them 'text1' and 'text2') and the latest
>>>>>>>>>   field contains data (let's name it 'text3')
>>>>>>>>> * if on the next attempt the field 'text3' is commented out, then
>>>>>>>>>   'text1' is empty and 'text2' contains data
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Am 22.02.2019 um 15:01 schrieb Sorin Gheorghiu:
>>>>>>>>>> In addition:
>>>>>>>>>> 
>>>>>>>>>>  * if there are 3 fields in /text:map/, then during indexing one
>>>>>>>>>>    contains data (let's name it 'text1'), the others are empty (let's
>>>>>>>>>>    name them 'text2' and 'text3'),
>>>>>>>>>>  * if on the next attempt the field 'text1' is commented out, then
>>>>>>>>>>    'text2' contains data and 'text3' is empty
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -------- Weitergeleitete Nachricht --------
>>>>>>>>>> Betreff:     Text Index build with empty fields
>>>>>>>>>> Datum:       Fri, 22 Feb 2019 14:01:18 +0100
>>>>>>>>>> Von:         Sorin Gheorghiu <[email protected]> 
>>>>>>>>>> <mailto:[email protected]> 
>>>>>>>>>> <mailto:[email protected]> 
>>>>>>>>>> <mailto:[email protected]> 
>>>>>>>>>> <mailto:[email protected]> 
>>>>>>>>>> <mailto:[email protected]> 
>>>>>>>>>> <mailto:[email protected]> 
>>>>>>>>>> <mailto:[email protected]>
>>>>>>>>>> Antwort an:  [email protected] <mailto:[email protected]> 
>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected]> 
>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected]> 
>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected]>
>>>>>>>>>> An:  [email protected] <mailto:[email protected]> 
>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected]> 
>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected]> 
>>>>>>>>>> <mailto:[email protected]> <mailto:[email protected]>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> When building the text index with the /jena.textindexer/ tool in 
>>>>>>>>>> Jena 3.10 for an external full-text search engine (Elasticsearch of 
>>>>>>>>>> course) and having multiple fields with different names in 
>>>>>>>>>> /text:map/, just *one field is indexed* (more precisely one field 
>>>>>>>>>> contains data, the others are empty). It doesn't look to be an issue 
>>>>>>>>>> with Elasticsearch, in the logs generated during the indexing the 
>>>>>>>>>> fields are already missing the values, but one. The same setup 
>>>>>>>>>> worked in Jena 3.9. Changing the Java version from 8 to 9 or 11 
>>>>>>>>>> didn't change anything.
>>>>>>>>>> 
>>>>>>>>>> Could it be that changes of the new release have affected this tool 
>>>>>>>>>> and we deal with a bug?
>>>>>>>>>> 
>>>>>>> -- 
>>>>>>> Sorin Gheorghiu             Tel: +49 7531 88-3198
>>>>>>> Universität Konstanz        Raum: B705
>>>>>>> 78464 Konstanz              [email protected] 
>>>>>>> <mailto:[email protected]> 
>>>>>>> <mailto:[email protected]> 
>>>>>>> <mailto:[email protected]> 
>>>>>>> <mailto:[email protected]> 
>>>>>>> <mailto:[email protected]> 
>>>>>>> <mailto:[email protected]> 
>>>>>>> <mailto:[email protected]>
>>>>>>> 
>>>>>>> - KIM: Abteilung Contentdienste -
>>>>> -- 
>>>>> Sorin Gheorghiu             Tel: +49 7531 88-3198
>>>>> Universität Konstanz        Raum: B705
>>>>> 78464 Konstanz              [email protected] 
>>>>> <mailto:[email protected]> 
>>>>> <mailto:[email protected]> 
>>>>> <mailto:[email protected]>
>>>>> 
>>>>> - KIM: Abteilung Contentdienste -
>>> -- 
>>> Sorin Gheorghiu             Tel: +49 7531 88-3198
>>> Universität Konstanz        Raum: B705
>>> 78464 Konstanz              [email protected] 
>>> <mailto:[email protected]>
>>> 
>>> - KIM: Abteilung Contentdienste -
>>> <textindexer_2params_040319.pcap><textindexer_3params_040319.pcap>
>> 
> -- 
> Sorin Gheorghiu             Tel: +49 7531 88-3198
> Universität Konstanz        Raum: B705
> 78464 Konstanz              [email protected] 
> <mailto:[email protected]>
> 
> - KIM: Abteilung Contentdienste -

Re: Text Index build with empty fields

Reply via email to