Hi Sorin, I have focussed on the jena text integration w/ Lucene local to jena/fuseki. The solr was dropped over a year ago due to lack of support/interest and w’ your information about ES 7.x it’s likely going to take someone who is a user of ES to help keep the integration up-to-date.
Anuj Kumar <[email protected]> did the ES integration about a year ago for jena 3.9.0 and as I mentioned I made obvious changes to the ES integration to update to Lucene 7.4.0 for jena 3.10.0. The upgrade to Lucene 7.4.0 <https://issues.apache.org/jira/browse/JENA-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673657#comment-16673657>was prompted by a user, [email protected] <mailto:[email protected]>, who was interested in Lucene 7.5, but the released version of ES was built against 7.4 so we upgraded to that version. I’ve opened JENA-1681 <https://issues.apache.org/jira/browse/JENA-1681> for the issue you’ve reported. You can report your findings there and hopefully we can get to the bottom of the problem. Regards, Chris > On Mar 12, 2019, at 6:40 AM, Sorin Gheorghiu > <[email protected]> wrote: > > Hi Chris, > > Thank you for your detailed answer. I will still try to find the root cause > of this issue. > But I have a question to you, do you know if Jena will support Elasticsearch > in the further versions? > > I am asking because in Elasticsearch 7.0 are breaking changes which will > affect the transport-client [1]: > The TransportClient is deprecated in favour of the Java High Level REST > Client and will be removed in Elasticsearch 8.0. > This supposes changes in the client’s initialization code, the Migration > Guide [2] explains how to do it. > > [1] > https://www.elastic.co/guide/en/elasticsearch/client/java-api/master/transport-client.html > > <https://www.elastic.co/guide/en/elasticsearch/client/java-api/master/transport-client.html> > [2] > https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-high-level-migration.html > > <https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-high-level-migration.html> > > Best regards, > Sorin > > Am 11.03.2019 um 18:38 schrieb Chris Tomlinson: >> Hi Sorin, >> >> I haven’t had the time to try and delve further into your issue. Your pcap >> seems to clearly indicate that there is no data populating any >> field/property other than the first one in the entity map. >> >> I’ve included the configuration file that we use. It has many many fields >> defined that are all populated. We load jena/fuseki from a collection of git >> repos via a git-to-dbs tool <https://github.com/buda-base/git-to-dbs> and we >> don’t see the sort of issue you’re reporting where there is a single field >> out of all the defined fields that is populated in the dataset and Lucene >> index - we don’t use ElasticSearch. >> >> The point being that whatever is going wrong is apparently not in the >> parsing of the configuration and setting up of the internal tables that >> record information about which predicates are indexed via Lucene (or >> Elasticsearch) into what fields. >> >> So it appears to me that the issue is something that is happening in the >> connection between the standalone textindexer.java and the Elasticsearch via >> the TextIndexES.java. The textindexer.java doesn’t have any post 3.8.0 >> changes that I can see and the only change in the TextIndexES.java is a >> change in the name of >> org.elasticsearch.common.transport.InetSocketTransportAddress to >> org.elasticsearch.common.transport.TransportAddress as part of the upgrade. >> >> I’m really not able to go further at this time. >> >> I’m sorry, >> Chris >> >> >>> # Fuseki configuration for BDRC, configures two endpoints: >>> # - /bdrc is read-only >>> # - /bdrcrw is read-write >>> # >>> # This was painful to come up with but the web interface basically allows >>> no option >>> # and there is no subclass inference by default so such a configuration >>> file is necessary. >>> # >>> # The main doc sources are: >>> # - >>> https://jena.apache.org/documentation/fuseki2/fuseki-configuration.html >>> <https://jena.apache.org/documentation/fuseki2/fuseki-configuration.html> >>> # - https://jena.apache.org/documentation/assembler/assembler-howto.html >>> <https://jena.apache.org/documentation/assembler/assembler-howto.html> >>> # - https://jena.apache.org/documentation/assembler/assembler.ttl >>> <https://jena.apache.org/documentation/assembler/assembler.ttl> >>> # >>> # See https://jena.apache.org/documentation/fuseki2/fuseki-layout.html >>> <https://jena.apache.org/documentation/fuseki2/fuseki-layout.html> for the >>> destination of this file. >>> >>> @prefix fuseki: <http://jena.apache.org/fuseki# >>> <http://jena.apache.org/fuseki#>> . >>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns# >>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#>> . >>> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema# >>> <http://www.w3.org/2000/01/rdf-schema#>> . >>> @prefix tdb: <http://jena.hpl.hp.com/2008/tdb# >>> <http://jena.hpl.hp.com/2008/tdb#>> . >>> @prefix tdb2: <http://jena.apache.org/2016/tdb# >>> <http://jena.apache.org/2016/tdb#>> . >>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler# >>> <http://jena.hpl.hp.com/2005/11/Assembler#>> . >>> @prefix : <http://base/# <http://base/#>> . >>> @prefix text: <http://jena.apache.org/text# >>> <http://jena.apache.org/text#>> . >>> @prefix skos: <http://www.w3.org/2004/02/skos/core# >>> <http://www.w3.org/2004/02/skos/core#>> . >>> @prefix adm: <http://purl.bdrc.io/ontology/admin/ >>> <http://purl.bdrc.io/ontology/admin/>> . >>> @prefix bdd: <http://purl.bdrc.io/data/ <http://purl.bdrc.io/data/>> . >>> @prefix bdo: <http://purl.bdrc.io/ontology/core/ >>> <http://purl.bdrc.io/ontology/core/>> . >>> @prefix bdr: <http://purl.bdrc.io/resource/ >>> <http://purl.bdrc.io/resource/>> . >>> @prefix f: <java:io.bdrc.ldspdi.sparql.functions.> . >>> >>> # [] ja:loadClass "org.seaborne.tdb2.TDB2" . >>> # tdb2:DatasetTDB2 rdfs:subClassOf ja:RDFDataset . >>> # tdb2:GraphTDB2 rdfs:subClassOf ja:Model . >>> >>> [] rdf:type fuseki:Server ; >>> fuseki:services ( >>> :bdrcrw >>> ) . >>> >>> :bdrcrw rdf:type fuseki:Service ; >>> fuseki:name "bdrcrw" ; # name of the dataset >>> in the url >>> fuseki:serviceQuery "query" ; # SPARQL query service >>> fuseki:serviceUpdate "update" ; # SPARQL update service >>> fuseki:serviceUpload "upload" ; # Non-SPARQL upload >>> service >>> fuseki:serviceReadWriteGraphStore "data" ; # SPARQL Graph store >>> protocol (read and write) >>> fuseki:dataset :bdrc_text_dataset ; >>> . >>> >>> # using TDB >>> :dataset_bdrc rdf:type tdb:DatasetTDB ; >>> tdb:location "/usr/local/fuseki/base/databases/bdrc" ; >>> tdb:unionDefaultGraph true ; >>> . >>> >>> # using TDB2 >>> # :dataset_bdrc rdf:type tdb2:DatasetTDB2 ; >>> # tdb2:location "/usr/local/fuseki/base/databases/bdrc" ; >>> # tdb2:unionDefaultGraph true ; >>> # . >>> >>> :bdrc_text_dataset rdf:type text:TextDataset ; >>> text:dataset :dataset_bdrc ; >>> text:index :bdrc_lucene_index ; >>> . >>> >>> # Text index description >>> :bdrc_lucene_index a text:TextIndexLucene ; >>> text:directory <file:/usr/local/fuseki/base/lucene-bdrc> >>> <file:///usr/local/fuseki/base/lucene-bdrc> ; >>> text:storeValues true ; >>> text:multilingualSupport true ; >>> text:entityMap :bdrc_entmap ; >>> text:defineAnalyzers ( >>> [ text:defineAnalyzer :romanWordAnalyzer ; >>> text:analyzer [ >>> a text:GenericAnalyzer ; >>> text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ; >>> text:params ( >>> [ text:paramName "mode" ; >>> text:paramValue "word" ] >>> [ text:paramName "inputEncoding" ; >>> text:paramValue "roman" ] >>> [ text:paramName "mergePrepositions" ; >>> text:paramValue true ] >>> [ text:paramName "filterGeminates" ; >>> text:paramValue true ] >>> ) >>> ] ; >>> ] >>> [ text:defineAnalyzer :devaWordAnalyzer ; >>> text:analyzer [ >>> a text:GenericAnalyzer ; >>> text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ; >>> text:params ( >>> [ text:paramName "mode" ; >>> text:paramValue "word" ] >>> [ text:paramName "inputEncoding" ; >>> text:paramValue "deva" ] >>> [ text:paramName "mergePrepositions" ; >>> text:paramValue true ] >>> [ text:paramName "filterGeminates" ; >>> text:paramValue true ] >>> ) >>> ] ; >>> ] >>> [ text:defineAnalyzer :slpWordAnalyzer ; >>> text:analyzer [ >>> a text:GenericAnalyzer ; >>> text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ; >>> text:params ( >>> [ text:paramName "mode" ; >>> text:paramValue "word" ] >>> [ text:paramName "inputEncoding" ; >>> text:paramValue "SLP" ] >>> [ text:paramName "mergePrepositions" ; >>> text:paramValue true ] >>> [ text:paramName "filterGeminates" ; >>> text:paramValue true ] >>> ) >>> ] ; >>> ] >>> [ text:defineAnalyzer :romanLenientIndexAnalyzer ; >>> text:analyzer [ >>> a text:GenericAnalyzer ; >>> text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ; >>> text:params ( >>> [ text:paramName "mode" ; >>> text:paramValue "syl" ] >>> [ text:paramName "inputEncoding" ; >>> text:paramValue "roman" ] >>> [ text:paramName "mergePrepositions" ; >>> text:paramValue false ] >>> [ text:paramName "filterGeminates" ; >>> text:paramValue true ] >>> [ text:paramName "lenient" ; >>> text:paramValue "index" ] >>> ) >>> ] ; >>> ] >>> [ text:defineAnalyzer :devaLenientIndexAnalyzer ; >>> text:analyzer [ >>> a text:GenericAnalyzer ; >>> text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ; >>> text:params ( >>> [ text:paramName "mode" ; >>> text:paramValue "syl" ] >>> [ text:paramName "inputEncoding" ; >>> text:paramValue "deva" ] >>> [ text:paramName "mergePrepositions" ; >>> text:paramValue false ] >>> [ text:paramName "filterGeminates" ; >>> text:paramValue true ] >>> [ text:paramName "lenient" ; >>> text:paramValue "index" ] >>> ) >>> ] ; >>> ] >>> [ text:defineAnalyzer :slpLenientIndexAnalyzer ; >>> text:analyzer [ >>> a text:GenericAnalyzer ; >>> text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ; >>> text:params ( >>> [ text:paramName "mode" ; >>> text:paramValue "syl" ] >>> [ text:paramName "inputEncoding" ; >>> text:paramValue "SLP" ] >>> [ text:paramName "mergePrepositions" ; >>> text:paramValue false ] >>> [ text:paramName "filterGeminates" ; >>> text:paramValue true ] >>> [ text:paramName "lenient" ; >>> text:paramValue "index" ] >>> ) >>> ] ; >>> ] >>> [ text:defineAnalyzer :romanLenientQueryAnalyzer ; >>> text:analyzer [ >>> a text:GenericAnalyzer ; >>> text:class "io.bdrc.lucene.sa.SanskritAnalyzer" ; >>> text:params ( >>> [ text:paramName "mode" ; >>> text:paramValue "syl" ] >>> [ text:paramName "inputEncoding" ; >>> text:paramValue "roman" ] >>> [ text:paramName "mergePrepositions" ; >>> text:paramValue false ] >>> [ text:paramName "filterGeminates" ; >>> text:paramValue false ] >>> [ text:paramName "lenient" ; >>> text:paramValue "query" ] >>> ) >>> ] ; >>> ] >>> [ text:defineAnalyzer :hanzAnalyzer ; >>> text:analyzer [ >>> a text:GenericAnalyzer ; >>> text:class "io.bdrc.lucene.zh.ChineseAnalyzer" ; >>> text:params ( >>> [ text:paramName "profile" ; >>> text:paramValue "TC2SC" ] >>> [ text:paramName "stopwords" ; >>> text:paramValue false ] >>> [ text:paramName "filterChars" ; >>> text:paramValue 0 ] >>> ) >>> ] ; >>> ] >>> [ text:defineAnalyzer :han2pinyin ; >>> text:analyzer [ >>> a text:GenericAnalyzer ; >>> text:class "io.bdrc.lucene.zh.ChineseAnalyzer" ; >>> text:params ( >>> [ text:paramName "profile" ; >>> text:paramValue "TC2PYstrict" ] >>> [ text:paramName "stopwords" ; >>> text:paramValue false ] >>> [ text:paramName "filterChars" ; >>> text:paramValue 0 ] >>> ) >>> ] ; >>> ] >>> [ text:defineAnalyzer :pinyin ; >>> text:analyzer [ >>> a text:GenericAnalyzer ; >>> text:class "io.bdrc.lucene.zh.ChineseAnalyzer" ; >>> text:params ( >>> [ text:paramName "profile" ; >>> text:paramValue "PYstrict" ] >>> ) >>> ] ; >>> ] >>> [ text:addLang "bo" ; >>> text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; >>> text:analyzer [ >>> a text:GenericAnalyzer ; >>> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; >>> text:params ( >>> [ text:paramName "segmentInWords" ; >>> text:paramValue false ] >>> [ text:paramName "lemmatize" ; >>> text:paramValue true ] >>> [ text:paramName "filterChars" ; >>> text:paramValue false ] >>> [ text:paramName "inputMode" ; >>> text:paramValue "unicode" ] >>> [ text:paramName "stopFilename" ; >>> text:paramValue "" ] >>> ) >>> ] ; >>> ] >>> [ text:addLang "bo-x-ewts" ; >>> text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; >>> text:analyzer [ >>> a text:GenericAnalyzer ; >>> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; >>> text:params ( >>> [ text:paramName "segmentInWords" ; >>> text:paramValue false ] >>> [ text:paramName "lemmatize" ; >>> text:paramValue true ] >>> [ text:paramName "filterChars" ; >>> text:paramValue false ] >>> [ text:paramName "inputMode" ; >>> text:paramValue "ewts" ] >>> [ text:paramName "stopFilename" ; >>> text:paramValue "" ] >>> ) >>> ] ; >>> ] >>> [ text:addLang "bo-alalc97" ; >>> text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; >>> text:analyzer [ >>> a text:GenericAnalyzer ; >>> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; >>> text:params ( >>> [ text:paramName "segmentInWords" ; >>> text:paramValue false ] >>> [ text:paramName "lemmatize" ; >>> text:paramValue true ] >>> [ text:paramName "filterChars" ; >>> text:paramValue false ] >>> [ text:paramName "inputMode" ; >>> text:paramValue "alalc" ] >>> [ text:paramName "stopFilename" ; >>> text:paramValue "" ] >>> ) >>> ] ; >>> ] >>> [ text:addLang "zh-hans" ; >>> text:searchFor ( "zh-hans" "zh-hant" ) ; >>> text:auxIndex ( "zh-aux-han2pinyin" ) ; >>> text:analyzer [ >>> a text:DefinedAnalyzer ; >>> text:useAnalyzer :hanzAnalyzer ] ; >>> ] >>> [ text:addLang "zh-hant" ; >>> text:searchFor ( "zh-hans" "zh-hant" ) ; >>> text:auxIndex ( "zh-aux-han2pinyin" ) ; >>> text:analyzer [ >>> a text:DefinedAnalyzer ; >>> text:useAnalyzer :hanzAnalyzer >>> ] ; >>> ] >>> [ text:addLang "zh-latn-pinyin" ; >>> text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ; >>> text:analyzer [ >>> a text:DefinedAnalyzer ; >>> text:useAnalyzer :pinyin >>> ] ; >>> ] >>> [ text:addLang "zh-aux-han2pinyin" ; >>> text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ; >>> text:analyzer [ >>> a text:DefinedAnalyzer ; >>> text:useAnalyzer :pinyin >>> ] ; >>> text:indexAnalyzer :han2pinyin ; >>> ] >>> [ text:addLang "sa-x-ndia" ; >>> text:searchFor ( "sa-x-ndia" "sa-aux-deva2Ndia" >>> "sa-aux-roman2Ndia" "sa-aux-slp2Ndia" ) ; >>> text:analyzer [ >>> a text:DefinedAnalyzer ; >>> text:useAnalyzer :romanLenientQueryAnalyzer >>> ] ; >>> ] >>> [ text:addLang "sa-aux-deva2Ndia" ; >>> text:searchFor ( "sa-x-ndia" "sa-aux-roman2Ndia" >>> "sa-aux-slp2Ndia" ) ; >>> text:analyzer [ >>> a text:DefinedAnalyzer ; >>> text:useAnalyzer :romanLenientQueryAnalyzer >>> ] ; >>> text:indexAnalyzer :devaLenientIndexAnalyzer ; >>> ] >>> [ text:addLang "sa-aux-roman2Ndia" ; >>> text:searchFor ( "sa-x-ndia" "sa-aux-deva2Ndia" "sa-aux-slp2Ndia" >>> ) ; >>> text:analyzer [ >>> a text:DefinedAnalyzer ; >>> text:useAnalyzer :romanLenientQueryAnalyzer >>> ] ; >>> text:indexAnalyzer :romanLenientIndexAnalyzer ; >>> ] >>> [ text:addLang "sa-aux-slp2Ndia" ; >>> text:searchFor ( "sa-x-ndia" "sa-aux-deva2Ndia" >>> "sa-aux-roman2Ndia" ) ; >>> text:analyzer [ >>> a text:DefinedAnalyzer ; >>> text:useAnalyzer :romanLenientQueryAnalyzer >>> ] ; >>> text:indexAnalyzer :slpLenientIndexAnalyzer ; >>> ] >>> [ text:addLang "sa-deva" ; >>> text:searchFor ( "sa-deva" "sa-x-iast" "sa-x-slp1" "sa-x-iso" >>> "sa-alalc97" ) ; >>> text:auxIndex ( "sa-aux-deva2Ndia" ) ; >>> text:analyzer [ >>> a text:DefinedAnalyzer ; >>> text:useAnalyzer :devaWordAnalyzer ] ; >>> ] >>> [ text:addLang "sa-x-iso" ; >>> text:searchFor ( "sa-x-iso" "sa-x-iast" "sa-x-slp1" "sa-deva" >>> "sa-alalc97" ) ; >>> text:auxIndex ( "sa-aux-roman2Ndia" ) ; >>> text:analyzer [ >>> a text:DefinedAnalyzer ; >>> text:useAnalyzer :romanWordAnalyzer ] ; >>> ] >>> [ text:addLang "sa-x-slp1" ; >>> text:searchFor ( "sa-x-slp1" "sa-x-iast" "sa-x-iso" "sa-deva" >>> "sa-alalc97" ) ; >>> text:auxIndex ( "sa-aux-slp2Ndia" ) ; >>> text:analyzer [ >>> a text:DefinedAnalyzer ; >>> text:useAnalyzer :slpWordAnalyzer ] ; >>> ] >>> [ text:addLang "sa-x-iast" ; >>> text:searchFor ( "sa-x-iast" "sa-x-slp1" "sa-x-iso" "sa-deva" >>> "sa-alalc97" ) ; >>> text:auxIndex ( "sa-aux-roman2Ndia" ) ; >>> text:analyzer [ >>> a text:DefinedAnalyzer ; >>> text:useAnalyzer :romanWordAnalyzer ] ; >>> ] >>> [ text:addLang "sa-alalc97" ; >>> text:searchFor ( "sa-alalc97" "sa-x-slp1" "sa-x-iso" "sa-deva" >>> "sa-iast" ) ; >>> text:auxIndex ( "sa-aux-roman2Ndia" ) ; >>> text:analyzer [ >>> a text:DefinedAnalyzer ; >>> text:useAnalyzer :romanWordAnalyzer ] ; >>> ] >>> ) ; >>> . >>> >>> # Index mappings >>> :bdrc_entmap a text:EntityMap ; >>> text:entityField "uri" ; >>> text:uidField "uid" ; >>> text:defaultField "label" ; >>> text:langField "lang" ; >>> text:graphField "graph" ; ## enable graph-specific indexing >>> text:map ( >>> [ text:field "label" ; >>> text:predicate skos:prefLabel ] >>> [ text:field "altLabel" ; >>> text:predicate skos:altLabel ; ] >>> [ text:field "rdfsLabel" ; >>> text:predicate rdfs:label ; ] >>> [ text:field "chunkContents" ; >>> text:predicate bdo:chunkContents ; ] >>> [ text:field "eTextTitle" ; >>> text:predicate bdo:eTextTitle ; ] >>> [ text:field "logMessage" ; >>> text:predicate adm:logMessage ; ] >>> [ text:field "noteText" ; >>> text:predicate bdo:noteText ; ] >>> [ text:field "workAuthorshipStatement" ; >>> text:predicate bdo:workAuthorshipStatement ; ] >>> [ text:field "workColophon" ; >>> text:predicate bdo:workColophon ; ] >>> [ text:field "workEditionStatement" ; >>> text:predicate bdo:workEditionStatement ; ] >>> [ text:field "workPublisherLocation" ; >>> text:predicate bdo:workPublisherLocation ; ] >>> [ text:field "workPublisherName" ; >>> text:predicate bdo:workPublisherName ; ] >>> [ text:field "workSeriesName" ; >>> text:predicate bdo:workSeriesName ; ] >>> ) ; >>> . >> >> >>> On Mar 11, 2019, at 11:42 AM, Sorin Gheorghiu >>> <[email protected] <mailto:[email protected]>> >>> wrote: >>> >>> Hi Chris, >>> >>> have you had time to look in my results, by chance? Would this help to >>> isolate the issue? >>> Let me know if you need any other data to collect, please. >>> Best regards, >>> Sorin >>> >>> -------- Weitergeleitete Nachricht -------- >>> Betreff: Re: Text Index build with empty fields >>> Datum: Mon, 4 Mar 2019 17:35:56 +0100 >>> Von: Sorin Gheorghiu <[email protected]> >>> <mailto:[email protected]> >>> An: [email protected] <mailto:[email protected]> >>> Kopie (CC): Chris Tomlinson <[email protected]> >>> <mailto:[email protected]> >>> >>> Hi Chris, >>> >>> when I reduce the entity map to 3 fields: >>> >>> [ text:field "oldgndid"; >>> text:predicate gndo:oldAuthorityNumber >>> ] >>> [ text:field "prefName"; >>> text:predicate gndo:preferredNameForThePerson >>> ] >>> [ text:field "varName"; >>> text:predicate gndo:variantNameForThePerson >>> ] >>> then oldgndid field only contains data (see textindexer_3params_040319.pcap >>> attached): >>> ES...|..........\*.......gnd_fts_es_131018_index.Y6BxYm-hT6qL0_NX10HrZQ..GndSubjectheadings.http://d-nb.info/gnd/4000002-3 >>> <http://d-nb.info/gnd/4000002-3>........ >>> ES...B..........\*.....transport_client.indices:data/write/update..gnd_fts_es_131018_index.........GndSubjectheadings.http://d-nb.info/gnd/4000023-0......painless..if >>> <http://d-nb.info/gnd/4000023-0......painless..if>((ctx._source == null) >>> || (ctx._source.oldgndid == null) || (ctx._source.oldgndid.empty == true)) >>> {ctx._source.oldgndid=[params.fieldValue] } else >>> {ctx._source.oldgndid.add(params.fieldValue)}..fieldValue..(DE-588c)4000023-0...............gnd_fts_es_131018_index....GndSubjectheadings..http://d-nb.info/gnd/4000023-0 >>> >>> <http://d-nb.info/gnd/4000023-0>..>{"varName":[],"prefName":[],"oldgndid":["(DE-588c)4000023-0"]}............. >>> moreover with 2 fields: >>> >>> [ text:field "prefName"; >>> text:predicate gndo:preferredNameForThePerson >>> ] >>> [ text:field "varName"; >>> text:predicate gndo:variantNameForThePerson >>> ] >>> then prefName field only contains data (see textindexer_2params_040319.pcap >>> attached): >>> >>> ES...|..........\*.......gnd_fts_es_131018_index.Y6BxYm-hT6qL0_NX10HrZQ..GndSubjectheadings.http://d-nb.info/gnd/134316541 >>> <http://d-nb.info/gnd/134316541>........ >>> ES...$..........\*.....transport_client.indices:data/write/update..gnd_fts_es_131018_index.........GndSubjectheadings.http://d-nb.info/gnd/1153446294......painless..if >>> <http://d-nb.info/gnd/1153446294......painless..if>((ctx._source == null) >>> || (ctx._source.prefName == null) || (ctx._source.prefName.empty == true)) >>> {ctx._source.prefName=[params.fieldValue] } else >>> {ctx._source.prefName.add(params.fieldValue)}..fieldValue. >>> Pharmakon...............gnd_fts_es_131018_index....GndSubjectheadings..http://d-nb.info/gnd/1153446294 >>> >>> <http://d-nb.info/gnd/1153446294>..'{"varName":[],"prefName":["Pharmakon"]}................. >>> >>> Regards, >>> Sorin >>> >>> Am 01.03.2019 um 18:06 schrieb Chris Tomlinson: >>>> Hi Sorin, >>>> >>>> tcpdump -A -r works fine to view the pcap file; however, I don’t have the >>>> time to delve into the data. I’ll take your word for it that the whole >>>> setup worked in 3.8.0 and I encourage you to try simplifying the entity >>>> map perhaps by having a unique field per property to see if the problem >>>> appears related to prefName and varName fields mapping to multiple >>>> properties. >>>> >>>> I do notice that the field oldgndid only maps to a single property but not >>>> knowing the data I have no idea whether there’s any of that data in your >>>> tests. >>>> >>>> Since you indicate that only the field, gndtype, has data (per the pcap >>>> file) then if there is oldgndid data (i.e., occurrences of >>>> gndo:oldAuthorityNumber, then that suggests that there is some rather >>>> generic issue w/ textindexer; however if there is no oldgndid data then >>>> there may be a problem that has crept in since 3.8.0 that is leading to a >>>> problem with data for multiple properties assigned to a single field which >>>> I would guess might be related to google.common.collection.MultiMap that >>>> holds the results of parsing the entity map. >>>> >>>> I have no idea how to enable the debug when running the standalone >>>> textindexer, perhaps someone else can answer that. >>>> >>>> Regards, >>>> Chris >>>> >>>> >>>>> On Mar 1, 2019, at 2:57 AM, Sorin Gheorghiu >>>>> <[email protected]> >>>>> <mailto:[email protected]> wrote: >>>>> >>>>> Hi Chris, >>>>> >>>>> 1) As I said before, this entity map worked in 3.8.0. >>>>> The pcap file I sent you is the proof that Jena delivers inconsistent >>>>> data. You may open it with Wireshark >>>>> >>>>> <jndbgnifbhkopbdd.png> >>>>> >>>>> or read it with tcpick: >>>>> # tcpick -C -yP -r textindexer_280219.pcap | more >>>>> >>>>> ES...}..........\*.......gnd_fts_es_131018_index.cp-dFuCVTg-dUwvfyREG2w..GndSubjectheadings.http://d-nb.info/gnd/102968225 >>>>> >>>>> <dfucvtg-duwvfyreg2w..gndsubjectheadings.http://d-nb.info/gnd/102968225>......... >>>>> ES..............\*.....transport_client.indices:data/write/update..gnd_fts_es_131018_index.........GndSubjectheadings.http://d-nb.info/gnd/102968438......painless..if >>>>> <http://d-nb.info/gnd/102968438......painless..if>((ctx._source == null) >>>>> || (ctx._source.gndtype == null) || (ctx._source.gndtype.empty == true)) >>>>> {ctx._source.gndtype=[params.fieldValue] } else >>>>> {ctx._source.gndtype.add(params.fieldValue)} >>>>> ..fieldValue..Person...............gnd_fts_es_131018_index....GndSubjectheadings..http://d-nb.info/gnd/102968438 >>>>> >>>>> <http://d-nb.info/gnd/102968438>....{"varName":[],"varName":[],"varName":[],"varName":[],"varName":[],"varName":[],"varName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"prefName":[],"oldgndid":[],"gndtype":["Person"]}.................................. >>>>> As a remark, Jena sends whole text index data within one TCP packet for >>>>> one Elasticsearch document. >>>>> >>>>> 3) fuseki.log collects logs when Fuseki server is running, but for text >>>>> indexer we have to run java command line, i.e. >>>>> >>>>> java -cp ./fuseki-server.jar:<other_jars> jena.textindexer >>>>> --desc=run/config.ttl >>>>> The question is how to activate the debug logs during text indexer? >>>>> >>>>> >>>>> Regards, >>>>> Sorin >>>>> >>>>> Am 28.02.2019 um 21:41 schrieb Chris Tomlinson: >>>>>> Hi Sorin, >>>>>> >>>>>> 1) I suggest trying to simplify the entity map. I assume there’s data >>>>>> for each of the properties other than skos:altLabel in the entity map: >>>>>> >>>>>>> [ text:field "gndtype"; >>>>>>> text:predicate skos:altLabel >>>>>>> ] >>>>>>> [ text:field "oldgndid"; >>>>>>> text:predicate gndo:oldAuthorityNumber >>>>>>> ] >>>>>>> [ text:field "prefName"; >>>>>>> text:predicate gndo:preferredNameForTheSubjectHeading >>>>>>> ] >>>>>>> [ text:field "varName"; >>>>>>> text:predicate gndo:variantNameForTheSubjectHeading >>>>>>> ] >>>>>>> [ text:field "prefName"; >>>>>>> text:predicate gndo:preferredNameForThePlaceOrGeographicName >>>>>>> ] >>>>>>> [ text:field "varName"; >>>>>>> text:predicate gndo:variantNameForThePlaceOrGeographicName >>>>>>> ] >>>>>>> [ text:field "prefName"; >>>>>>> text:predicate gndo:preferredNameForTheWork >>>>>>> ] >>>>>>> [ text:field "varName"; >>>>>>> text:predicate gndo:variantNameForTheWork >>>>>>> ] >>>>>>> [ text:field "prefName"; >>>>>>> text:predicate gndo:preferredNameForTheConferenceOrEvent >>>>>>> ] >>>>>>> [ text:field "varName"; >>>>>>> text:predicate gndo:variantNameForTheConferenceOrEvent >>>>>>> ] >>>>>>> [ text:field "prefName"; >>>>>>> text:predicate gndo:preferredNameForTheCorporateBody >>>>>>> ] >>>>>>> [ text:field "varName"; >>>>>>> text:predicate gndo:variantNameForTheCorporateBody >>>>>>> ] >>>>>>> [ text:field "prefName"; >>>>>>> text:predicate gndo:preferredNameForThePerson >>>>>>> ] >>>>>>> [ text:field "varName"; >>>>>>> text:predicate gndo:variantNameForThePerson >>>>>>> ] >>>>>>> [ text:field "prefName"; >>>>>>> text:predicate gndo:preferredNameForTheFamily >>>>>>> ] >>>>>>> [ text:field "varName"; >>>>>>> text:predicate gndo:variantNameForTheFamily >>>>>>> ] >>>>>> 2) You might try a TextIndexLucene >>>>>> >>>>>> 3) Adding the line log4j.logger.org.apache.jena.query.text.es=DEBUG >>>>>> should work. I see no problem with it. >>>>>> >>>>>> Sorry to be of little help, >>>>>> Chris >>>>>> >>>>>> >>>>>>> On Feb 28, 2019, at 8:53 AM, Sorin Gheorghiu >>>>>>> <[email protected]> >>>>>>> <mailto:[email protected]> >>>>>>> <mailto:[email protected]> >>>>>>> <mailto:[email protected]> wrote: >>>>>>> >>>>>>> Hi Chris, >>>>>>> Thank you for answering, I reply you directly because users@jena >>>>>>> doesn't accept messages larger than 1Mb. >>>>>>> >>>>>>> The previous text index successful attempt we did was with 3.8.0, not >>>>>>> 3.9.0, sorry for the misinformation. >>>>>>> Attached is the assembler file for 3.10.0 as requested, as well as the >>>>>>> packet capture file to see that only the 'gndtype' field has data. >>>>>>> I tried to enable the debug logs in log4j.properties with >>>>>>> log4j.logger.org.apache.jena.query.text.es=DEBUG but no output in the >>>>>>> log file. >>>>>>> >>>>>>> Regards, >>>>>>> Sorin >>>>>>> >>>>>>> Am 27.02.2019 um 20:01 schrieb Chris Tomlinson: >>>>>>>> Hi Sorin, >>>>>>>> >>>>>>>> Please provide the assembler file for Elasticsearch that has the >>>>>>>> problematic entity map definitions. >>>>>>>> >>>>>>>> There haven’t been any changes in over a year to textindexer since >>>>>>>> well before 3.9. I don’t see any relevant changes to the handling of >>>>>>>> entity maps either so I can’t begin to pursue the issue further w/o >>>>>>>> perhaps seeing your current assembler file. >>>>>>>> >>>>>>>> I don't have any experience with Elasticsearch or with using >>>>>>>> jena-text-es beyond a simple change to TextIndexES.java to change >>>>>>>> org.elasticsearch.common.transport.InetSocketTransportAddress to >>>>>>>> org.elasticsearch.common.transport.TransportAddress as part of the >>>>>>>> upgrade to Lucene 7.4.0 and Elasticsearch 6.4.2. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Chris >>>>>>>> >>>>>>>> >>>>>>>>> On Feb 25, 2019, at 2:37 AM, Sorin Gheorghiu >>>>>>>>> <[email protected]> >>>>>>>>> <mailto:[email protected]> >>>>>>>>> <mailto:[email protected]> >>>>>>>>> <mailto:[email protected]> >>>>>>>>> <mailto:[email protected]> >>>>>>>>> <mailto:[email protected]> >>>>>>>>> <mailto:[email protected]> >>>>>>>>> <mailto:[email protected]> wrote: >>>>>>>>> >>>>>>>>> Correction: only the *latest field *from the /text:map/ list contains >>>>>>>>> a value. >>>>>>>>> >>>>>>>>> To reformulate: >>>>>>>>> >>>>>>>>> * if there are 3 fields in /text:map/, then during indexing the first >>>>>>>>> two are empty (let's name them 'text1' and 'text2') and the latest >>>>>>>>> field contains data (let's name it 'text3') >>>>>>>>> * if on the next attempt the field 'text3' is commented out, then >>>>>>>>> 'text1' is empty and 'text2' contains data >>>>>>>>> >>>>>>>>> >>>>>>>>> Am 22.02.2019 um 15:01 schrieb Sorin Gheorghiu: >>>>>>>>>> In addition: >>>>>>>>>> >>>>>>>>>> * if there are 3 fields in /text:map/, then during indexing one >>>>>>>>>> contains data (let's name it 'text1'), the others are empty (let's >>>>>>>>>> name them 'text2' and 'text3'), >>>>>>>>>> * if on the next attempt the field 'text1' is commented out, then >>>>>>>>>> 'text2' contains data and 'text3' is empty >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -------- Weitergeleitete Nachricht -------- >>>>>>>>>> Betreff: Text Index build with empty fields >>>>>>>>>> Datum: Fri, 22 Feb 2019 14:01:18 +0100 >>>>>>>>>> Von: Sorin Gheorghiu <[email protected]> >>>>>>>>>> <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected]> >>>>>>>>>> Antwort an: [email protected] <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected]> <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected]> <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected]> <mailto:[email protected]> >>>>>>>>>> An: [email protected] <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected]> <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected]> <mailto:[email protected]> >>>>>>>>>> <mailto:[email protected]> <mailto:[email protected]> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> When building the text index with the /jena.textindexer/ tool in >>>>>>>>>> Jena 3.10 for an external full-text search engine (Elasticsearch of >>>>>>>>>> course) and having multiple fields with different names in >>>>>>>>>> /text:map/, just *one field is indexed* (more precisely one field >>>>>>>>>> contains data, the others are empty). It doesn't look to be an issue >>>>>>>>>> with Elasticsearch, in the logs generated during the indexing the >>>>>>>>>> fields are already missing the values, but one. The same setup >>>>>>>>>> worked in Jena 3.9. Changing the Java version from 8 to 9 or 11 >>>>>>>>>> didn't change anything. >>>>>>>>>> >>>>>>>>>> Could it be that changes of the new release have affected this tool >>>>>>>>>> and we deal with a bug? >>>>>>>>>> >>>>>>> -- >>>>>>> Sorin Gheorghiu Tel: +49 7531 88-3198 >>>>>>> Universität Konstanz Raum: B705 >>>>>>> 78464 Konstanz [email protected] >>>>>>> <mailto:[email protected]> >>>>>>> <mailto:[email protected]> >>>>>>> <mailto:[email protected]> >>>>>>> <mailto:[email protected]> >>>>>>> <mailto:[email protected]> >>>>>>> <mailto:[email protected]> >>>>>>> <mailto:[email protected]> >>>>>>> >>>>>>> - KIM: Abteilung Contentdienste - >>>>> -- >>>>> Sorin Gheorghiu Tel: +49 7531 88-3198 >>>>> Universität Konstanz Raum: B705 >>>>> 78464 Konstanz [email protected] >>>>> <mailto:[email protected]> >>>>> <mailto:[email protected]> >>>>> <mailto:[email protected]> >>>>> >>>>> - KIM: Abteilung Contentdienste - >>> -- >>> Sorin Gheorghiu Tel: +49 7531 88-3198 >>> Universität Konstanz Raum: B705 >>> 78464 Konstanz [email protected] >>> <mailto:[email protected]> >>> >>> - KIM: Abteilung Contentdienste - >>> <textindexer_2params_040319.pcap><textindexer_3params_040319.pcap> >> > -- > Sorin Gheorghiu Tel: +49 7531 88-3198 > Universität Konstanz Raum: B705 > 78464 Konstanz [email protected] > <mailto:[email protected]> > > - KIM: Abteilung Contentdienste -
