Re: fuseki text:query : strange results + Lucene configuration

Vincent Ventresque Wed, 12 Sep 2018 05:07:17 -0700

Hello Rob


Thank you for all these elements.

> there is a limit on the results returned from each text search sowhen these are *separately executed and joined together* you may onlyget a subset of the full results


Could you please explain what would be a 'non-separate' query? Do you mean :

?s text:query ( "givenName:\"*J*\" AND familyName:\"Roussea\"" ) ?

I made 2 separate triples (1st = givenName + 2nd = familyName) because Ihad read that "when a query is to involve two or more properties then itexpressed at the SPARQL level, as it were, versus in Lucene's querylanguage"(https://jena.apache.org/documentation/query/text-query.html#queries-across-multiple-fields).


Vincent

Le 12/09/2018 à 11:52, Rob Vesse a écrit :

Well the order of triple patterns shouldn't matter too much when you have a 
pure BGP (albeit the optimiser might pick a bad order in some cases)

But we aren't talking about pure BGPs here, having the text:query triples 
results in the BGP being broken up into joins of several property functions 
with the regular triple patterns interspersed through those.  So if we take 
your query and run it through Jena's algebra compiler (you can do this online 
at http://sparql.org/validate/query) we get the following:

   1 (base <http://example/base/>
   2   (prefix ((rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
   3            (owl: <http://www.w3.org/2002/07/owl#>)
   4            (apf: <http://jena.hpl.hp.com/ARQ/property#>)
   5            (xsd: <http://www.w3.org/2001/XMLSchema#>)
   6            (fn: <http://www.w3.org/2005/xpath-functions#>)
   7            (rdfs: <http://www.w3.org/2000/01/rdf-schema#>)
   8            (text: <http://jena.apache.org/text#>)
   9            (foaf: <http://xmlns.com/foaf/0.1/>)
  10            (dc: <http://purl.org/dc/elements/1.1/>))
  11     (sequence
  12       (propfunc text:query
  13         ?uriBnF (foaf:givenName "$MY_STRING")
  14         (propfunc text:query
  15           ?uriBnF (foaf:familyName "roussea*")
  16           (table unit)))
  17       (bgp
  18         (triple ?uriBnF foaf:familyName ?nom)
  19         (triple ?uriBnF foaf:givenName ?prenom)
  20       ))))

So first its doing the text search on your parameter (lines 12-13), then 
joining that to text search on your surname (lines 14-15) via substituting 
binds from your first text search and then finally joining that with the plain 
BGP (lines 17-19).

So in this case the ordering of your property functions in the query is going 
to make a difference to the evaluation.  As I think Osma already pointed out 
there is a limit on the results returned from each text search so when these 
are separately executed and joined together you may only get a subset of the 
full results that your text index holds.

Rob

On 12/09/2018, 09:55, "Vincent Ventresque" <vincent.ventres...@ens-lyon.fr> 
wrote:

     Hi Lorenz,

Thanks for your reply.> for me it sounds more like you've found a bugI'm not able to tell, just beginning to use Fuseki + Lucene.> I'm just referring to "Order of triple patterns in a BGP" hereCould you please give a raw text URL for "Order of triple patterns in a

     BGP" (seems that the 'here' in your mail had a formatted link but I
     didn't receive the url in my mailbox).

> The order of triple patterns in a BGP shouldn't matterI thought that it was better (for performance/speed) to begin with 1)

     constants and 2) variables having few solutions in the dataset. I've
     read something about Sparql optimization and algebra, but can't remember
     where. But maybe you're talking about the logics itself (A+B = B+A)?
     N.B. I find these questions very interesting, but I'm no Sparql
     specialist (neither a logician).

Cheers,VincentLe 12/09/2018 à 10:32, Lorenz B. a écrit :

     > Hi "VV",
     >
     > well, for me it sounds more like you've found a bug and are now doing a
     > workaround. Or at least something is strange and I'm just referring to
     > "Order of triple patterns in a BGP" here.
     >
     > The order of triple patterns in a BGP shouldn't matter - as far as I
     > know it's always a good old join on the intermediate result of the
     > evaluation of the triple patterns.
     >
     > Indeed, the limit of the text index lookup matters as the internal
     > ordering by Lucene is based on some Information Retrieval measure (close
     > to TF-IDF probably with default settings).
     >
     > But I guess, Osma and Andy will give you a better and more correct 
answer.
     >
     >
     > Cheers,
     > Lorenz
     >
     >> Hello Osma,
     >>
     >>
     >> Thank you very much for your reply, you solved the problem! I've made
     >> a few tests, both the order and the limit are important (see below).
     >>
     >> Just one more question : I thought that the "Roussea*" being less
     >> numerous than the "*J*", it would be more efficient to begin with the
     >> "Roussea*". Can you explain why it's the contrary?
     >>
     >> Best,
     >>
     >> VV.
     >>
     >>
     >> 1) --------- changing only the order --------------------------
     >>
     >> ?uriBnF text:query ( foaf:givenName "*J*" ) .
     >> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
     >> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
     >>
     >>  => 3 "Jean-Marie Rousseau" ... (even if I add a limit = 100 000 or 2
     >> 000 000)
     >>
     >> 2) --------- changing order + limit = 100 000 --------------------------
     >>
     >> ?uriBnF text:query ( foaf:givenName "*J*" 100000 ) .
     >>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
     >>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
     >>
     >>  => 54 entries but not "Jean-Jacques" !
     >>
     >> 3) --------- changing order + limit = 1 000 000
     >> --------------------------
     >>
     >>  ?uriBnF text:query ( foaf:givenName "*J*" 1000000 ) .
     >>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
     >>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
     >>
     >> => 135 entries, including the 4 "Jean-Jacques", in  1.7 second
     >>
     >> 4) --------- test using filters (strstarts + contains)
     >> --------------------------
     >>
     >> ?uriBnF foaf:familyName ?nom
     >> filter(strstarts(?nom, "Roussea"))
     >> ?uriBnF foaf:givenName ?prenom
     >> filter(contains(?prenom, "J"))
     >>
     >> => 129 entries, 27 seconds [less results than
     >> "text:query ( foaf:givenName "*J*" 1000000)" because contains = case
     >> sensible ?]
     >>
     >> -----------------------------------------------------
     >>
     >> More infos about the dataset :
     >>
     >> # 3 fields are indexed ( foaf:name + foaf:givenName are in the same
     >> named graph )
     >>
     >> -- dcterms:title = +/- 9.45 M.
     >>
     >> -- foaf:givenName = +/- 1.71 M.
     >>
     >> -- foaf:familyName = +/- 1.78 M.
     >>
     >> # config file :
     >>
     >> ----------------
     >>
     >> text:storeValues true ;
     >>     text:queryParser text:AnalyzingQueryParser ;
     >>     text:map (
     >>         [ text:field "title" ; text:predicate dcterms:title ;
     >>         text:analyzer [ a text:ConfigurableAnalyzer ;
     >>          text:tokenizer text:KeywordTokenizer ;
     >>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
     >>          ] ]
     >>          [ text:field "familyName" ; text:predicate foaf:familyName ;
     >>         text:analyzer [ a text:ConfigurableAnalyzer ;
     >>          text:tokenizer text:KeywordTokenizer ;
     >>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
     >>          ] ]
     >>          [ text:field "givenName" ; text:predicate foaf:givenName ;
     >>         text:analyzer [ a text:ConfigurableAnalyzer ;
     >>          text:tokenizer text:KeywordTokenizer ;
     >>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
     >>          ] ]
     >>
     >>          ) .
     >>
     >>
     >>
     >>
     >>
     >>
     >> Le 10/09/2018 à 18:58, Osma Suominen a écrit :
     >>> Hello Vincent,
     >>>
     >>> The results you get don't seem quite right. As you say, with a
     >>> shorter query one would expect more results.
     >>>
     >>> One thing to do would be to check what results you get if you run the
     >>> queries individually. I think combining the two separate jena-text
     >>> queries (for foaf:familyName and foaf:givenName) may be part of the
     >>> problem here... So if you execute only the "roussea*" part of the
     >>> query, do you get the expected number of results? What about if you
     >>> only execute one of the givenName queries with no restriction on
     >>> familyName?
     >>>
     >>> Does it make a difference if you change the order of the firstName
     >>> and givenName clauses?
     >>>
     >>> One thing to consider is that Lucene queries always have a limit on
     >>> the number of results. With jena-text you can specify it as an
     >>> additional parameter, but if you leave it out, it will default to
     >>> 10000. My guess is that the givenName queries may generate more
     >>> results than 10000, and the results will then be cut off. This may
     >>> mean that you get many Jeans and Jacques's and Johns etc. but many
     >>> the J. Rousseaus get cut off from the list. Try adding a large limit
     >>> parameter (say 100000 or more) to the text:query functions to see if
     >>> it helps. Like this:
     >>>
     >>>     ?uriBnF text:query ( foaf:givenName "*J*" 100000 )
     >>>
     >>> jena-text is not very good at combining multiple criteria. You can do
     >>> it with separate queries as you've done, but internally the queries
     >>> will run separately and the results will only be combined in Jena,
     >>> outside Lucene.
     >>>
     >>> -Osma
     >>>
     >>>
     >>>
     >>> Vincent Ventresque kirjoitti 10.09.2018 klo 13:03:
     >>>> Hello,
     >>>>
     >>>>
     >>>> I've made new tests with a slightly different dataset and
     >>>> configuration, the problem is the same.
     >>>>
     >>>> --- Could you please tell me if these results are normal (I expected
     >>>> a bigger list with fewer letters)?
     >>>>
     >>>> ?uriBnF text:query ( foaf:givenName "*J*" ) => 3 entries
     >>>>
     >>>> ?uriBnF text:query ( foaf:givenName "*Ja*" ) => 1 entries
     >>>>
     >>>> ?uriBnF text:query ( foaf:givenName "*Je*" ) => 11 entries
     >>>>
     >>>> ?uriBnF text:query ( foaf:givenName "*-J*" ) => 11 entries
     >>>>
     >>>> ?uriBnF text:query ( foaf:givenName "*Jea*" ) => 12 entries
     >>>>
     >>>> ?uriBnF text:query ( foaf:givenName "*Jac*" ) => 13 entries
     >>>>
     >>>> Here is the complete query :
     >>>>
     >>>> SELECT * WHERE { ?uriBnF text:query ( foaf:familyName "roussea*" ) .
     >>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
     >>>>
     >>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom }
     >>>>
     >>>> N.B. : the dataset is quite large : 1,78 M family names indexed, and
     >>>> 1,71 M given names. I have 4 distinct "Jean-Jacques Rousseau" in the
     >>>> data, 713 family names containing "roussea", including 224 compound
     >>>> given names.
     >>>>
     >>>> --- Do you know where to find more documentation about Lucene
     >>>> configuration (I read jena.apache.org page + , and also found useful
     >>>> explanations on Skosmos wiki https://github.com/NatLibFi/Skosmos ),
     >>>> especially about tokenizers  ?
     >>>>
     >>>>
     >>>> Thanks in advance,
     >>>>
     >>>> VV
     >>>>
     >>>>
     >>>>
     >>>>
     >>>>
     >>>>
     >>>>
     >>>> Le 19/07/2018 à 14:07, Vincent Ventresque a écrit :
     >>>>> Hello,
     >>>>>
     >>>>> I've just subscribed to the users@jena.apache.org list, and I
     >>>>> apologize if this mail is not sent properly.
     >>>>>
     >>>>> I'm trying to use Fuseki text:query, and have encountered several
     >>>>> issues. Here are my questions
     >>>>>
     >>>>> 1) Does text:query require a minimum number of characters to be
     >>>>> efficient?
     >>>>>
     >>>>> 2) Is performance linked to the number of fields indexed?
     >>>>>
     >>>>> 3) In order to retrieve strings containing hyphens, should I use
     >>>>> KeywordTokenizer in config file?
     >>>>>
     >>>>> ~~~ 1) Does text:query require a minimum number of characters to be
     >>>>> efficient? ~~~~~~~~~~~~~
     >>>>>
     >>>>> I've noticed that a query on indexed predicates (foaf:familyName
     >>>>> and foaf:givenName) returns more results when there are more
     >>>>> characters in the string :
     >>>>>
     >>>>> SELECT * WHERE {
     >>>>>
     >>>>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
     >>>>>
     >>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
     >>>>>
     >>>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
     >>>>>
     >>>>> optional {?uriBnF bio:birth ?dateNaissance }
     >>>>>
     >>>>> }
     >>>>>
     >>>>> I was expecting that "Rousseau" + "Jean-Jacques" would be in the
     >>>>> results.
     >>>>>
     >>>>> => if  $MY_STRING = "j*", I get  0 result
     >>>>>
     >>>>> => if  $MY_STRING = "je*", I get 17 results, including
     >>>>> "Jean-Claude" & "Jean-Baptiste" BUT not "Jean-Jacques"
     >>>>>
     >>>>> => if  $MY_STRING = "jea*", I get 27 results, including 
"Jean-Jacques"
     >>>>>
     >>>>> I don't know anything about Lucene, but it looks very strange to me
     >>>>> : I expected the contrary (fewer letters = bigger results list).
     >>>>>
     >>>>>
     >>>>> ~~~ 2) Is performance linked to the number of fields indexed?
     >>>>> ~~~~~~~~~~~~~~~~~~~~~~~
     >>>>>
     >>>>> If I change the configuration and index only foaf:givenName, and
     >>>>> provide a constant for foaf:familyName, the query returns more
     >>>>> results :
     >>>>>
     >>>>> SELECT * WHERE {
     >>>>>
     >>>>> ?uriBnF foaf:familyName "Rousseau" .
     >>>>>
     >>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
     >>>>>
     >>>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
     >>>>>
     >>>>> optional {?uriBnF bio:birth ?dateNaissance }
     >>>>>
     >>>>> }
     >>>>>
     >>>>> => if  $MY_STRING = "j*", I get  7 results, whereas the first query
     >>>>> returned 0 result.
     >>>>>
     >>>>>
     >>>>> ~~~ 3) In order to retrieve containing hyphens, should I use
     >>>>> KeywordTokenizer in config file? ~~~~~~~~~~~~~
     >>>>>
     >>>>> With the same query, if $MY_STRING = "jean-ja*" :
     >>>>>
     >>>>> a) with simple configuration (cf. below), I get 0 result
     >>>>>
     >>>>> b) with KeywordTokenizer config (cf. below), I get "Jean-Jacques"
     >>>>>
     >>>>> Is it the right way to get "Jean-Jacques"?
     >>>>>
     >>>>>
     >>>>> Thanks in advance
     >>>>>
     >>>>> VV
     >>>>>
     >>>>>
     >>>>>
     >>>>> =============== SIMPLE CONFIGURATION ===================
     >>>>>
     >>>>> @prefix :        <#> .
     >>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
     >>>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
     >>>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
     >>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
     >>>>> @prefix text:    <http://jena.apache.org/text#> .
     >>>>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
     >>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
     >>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
     >>>>>
     >>>>>
     >>>>>
     >>>>> [] rdf:type fuseki:Server ;
     >>>>>    .
     >>>>>
     >>>>>
     >>>>> ## Initialize TDB --------------------------------
     >>>>>
     >>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
     >>>>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
     >>>>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
     >>>>>
     >>>>> ## Initialize text query -------------------------------------
     >>>>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
     >>>>> # A TextDataset is a regular dataset with a text index.
     >>>>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
     >>>>> # Lucene index
     >>>>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
     >>>>>
     >>>>> ## ---------------------------------------------------------------
     >>>>> ## This URI must be fixed - it's used to assemble the text dataset.
     >>>>>
     >>>>> :text_dataset rdf:type     text:TextDataset ;
     >>>>> #    text:dataset   <#dataset> ;
     >>>>>     text:dataset :tdb_dataset_readwrite ;
     >>>>> #    text:index     <#indexLucene> ;
     >>>>>     text:index :My_Lucene_index ;
     >>>>>     .
     >>>>>
     >>>>> # A TDB datset used for RDF storage ------------------------------
     >>>>> :tdb_dataset_readwrite
     >>>>>         a             tdb:DatasetTDB ;
     >>>>>         tdb:location  "$_BnF_text" ;
     >>>>> .
     >>>>>
     >>>>> # Text index description ------------------------------------------
     >>>>> #<#indexLucene> a text:TextIndexLucene ;
     >>>>> :My_Lucene_index a text:TextIndexLucene ;
     >>>>>     text:directory <file:$_Lucene> ;
     >>>>>     text:entityMap <#entMap> ;
     >>>>>     .
     >>>>>
     >>>>> # Mapping in the index ---------------------------------------------
     >>>>> # URI stored in field "uri"
     >>>>> <#entMap> a text:EntityMap ;
     >>>>>     text:entityField      "uri" ;
     >>>>>     text:defaultField     "familyName" ;
     >>>>>     text:map (
     >>>>>          [ text:field "familyName" ; text:predicate foaf:familyName ]
     >>>>>          [ text:field "givenName" ; text:predicate foaf:givenName ]
     >>>>>          ) .
     >>>>>
     >>>>> :service_tdb_all  a                   fuseki:Service ;
     >>>>>         rdfs:label                    "TDB BnF_text" ;
     >>>>>         fuseki:dataset               :text_dataset ;
     >>>>>         fuseki:name                   "BnF_text" ;
     >>>>>         fuseki:serviceQuery           "query" , "sparql" ;
     >>>>>         fuseki:serviceReadGraphStore  "get" ;
     >>>>>         fuseki:serviceReadWriteGraphStore "data" ;
     >>>>>         fuseki:serviceUpdate          "update" ;
     >>>>>         fuseki:serviceUpload          "upload" .
     >>>>>
     >>>>>
     >>>>> =========== KEYWORD TOKENIZER CONFIGURATION ================
     >>>>>
     >>>>> @prefix :        <#> .
     >>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
     >>>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
     >>>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
     >>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
     >>>>> @prefix text:    <http://jena.apache.org/text#> .
     >>>>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
     >>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
     >>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
     >>>>>
     >>>>>
     >>>>>
     >>>>> [] rdf:type fuseki:Server ;
     >>>>>
     >>>>>    .
     >>>>>
     >>>>>
     >>>>> ## Initialize TDB --------------------------------
     >>>>>
     >>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
     >>>>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
     >>>>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
     >>>>>
     >>>>> ## Initialize text query -------------------------------------
     >>>>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
     >>>>> # A TextDataset is a regular dataset with a text index.
     >>>>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
     >>>>> # Lucene index
     >>>>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
     >>>>>
     >>>>> ## ---------------------------------------------------------------
     >>>>>
     >>>>>
     >>>>> :text_dataset rdf:type     text:TextDataset ;
     >>>>> #    text:dataset   <#dataset> ;
     >>>>>     text:dataset :tdb_dataset_readwrite ;
     >>>>> #    text:index     <#indexLucene> ;
     >>>>>     text:index :My_Lucene_index ;
     >>>>>     .
     >>>>>
     >>>>> # A TDB datset used for RDF storage ------------------------------
     >>>>> :tdb_dataset_readwrite
     >>>>>         a             tdb:DatasetTDB ;
     >>>>>         tdb:location  "$_BnF_text" ;
     >>>>> .
     >>>>>
     >>>>> # Text index description ------------------------------------------
     >>>>> #<#indexLucene> a text:TextIndexLucene ;
     >>>>> :My_Lucene_index a text:TextIndexLucene ;
     >>>>>     text:directory <file:$_Lucene> ;
     >>>>>     text:entityMap <#entMap> ;
     >>>>>     .
     >>>>>
     >>>>> # Mapping in the index ---------------------------------------------
     >>>>> # URI stored in field "uri"
     >>>>> <#entMap> a text:EntityMap ;
     >>>>>     text:entityField      "uri" ;
     >>>>>     text:defaultField     "givenName" ;
     >>>>>     text:map (
     >>>>>
     >>>>>          [ text:field "familyName" ; text:predicate foaf:familyName ;
     >>>>>          text:analyzer [ a text:ConfigurableAnalyzer ;
     >>>>>                text:tokenizer text:KeywordTokenizer ;
     >>>>>                text:filters (text:ASCIIFoldingFilter
     >>>>> text:LowerCaseFilter)
     >>>>>              ] ]
     >>>>>          [ text:field "givenName" ; text:predicate foaf:givenName ;
     >>>>>         text:analyzer [ a text:ConfigurableAnalyzer ;
     >>>>>          text:tokenizer text:KeywordTokenizer ;
     >>>>>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
     >>>>>          ] ]
     >>>>>          ) .
     >>>>>
     >>>>> :service_tdb_all  a                   fuseki:Service ;
     >>>>>         rdfs:label                    "TDB BnF_text" ;
     >>>>>         fuseki:dataset               :text_dataset ; ### marche pr
     >>>>> index texte
     >>>>>         fuseki:name                   "BnF_text" ;
     >>>>>         fuseki:serviceQuery           "query" , "sparql" ;
     >>>>>         fuseki:serviceReadGraphStore  "get" ;
     >>>>>         fuseki:serviceReadWriteGraphStore "data" ;
     >>>>>         fuseki:serviceUpdate          "update" ;
     >>>>>         fuseki:serviceUpload          "upload" .
     >>>>>
     >>>>>
     >>>>>
     >>>>>
     >>>>>
     >>>>>
     >>>>>
     >>>>>
     >>>>>
     >>>>>
     >>>>>
     >>>>>
     >>>>
     >>

Re: fuseki text:query : strange results + Lucene configuration

Reply via email to