Well the order of triple patterns shouldn't matter too much when you
have a pure BGP (albeit the optimiser might pick a bad order in some
cases)
But we aren't talking about pure BGPs here, having the text:query
triples results in the BGP being broken up into joins of several
property functions with the regular triple patterns interspersed
through those. So if we take your query and run it through Jena's
algebra compiler (you can do this online at
http://sparql.org/validate/query) we get the following:
1 (base <http://example/base/>
2 (prefix ((rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
3 (owl: <http://www.w3.org/2002/07/owl#>)
4 (apf: <http://jena.hpl.hp.com/ARQ/property#>)
5 (xsd: <http://www.w3.org/2001/XMLSchema#>)
6 (fn: <http://www.w3.org/2005/xpath-functions#>)
7 (rdfs: <http://www.w3.org/2000/01/rdf-schema#>)
8 (text: <http://jena.apache.org/text#>)
9 (foaf: <http://xmlns.com/foaf/0.1/>)
10 (dc: <http://purl.org/dc/elements/1.1/>))
11 (sequence
12 (propfunc text:query
13 ?uriBnF (foaf:givenName "$MY_STRING")
14 (propfunc text:query
15 ?uriBnF (foaf:familyName "roussea*")
16 (table unit)))
17 (bgp
18 (triple ?uriBnF foaf:familyName ?nom)
19 (triple ?uriBnF foaf:givenName ?prenom)
20 ))))
So first its doing the text search on your parameter (lines 12-13),
then joining that to text search on your surname (lines 14-15) via
substituting binds from your first text search and then finally
joining that with the plain BGP (lines 17-19).
So in this case the ordering of your property functions in the query
is going to make a difference to the evaluation. As I think Osma
already pointed out there is a limit on the results returned from each
text search so when these are separately executed and joined together
you may only get a subset of the full results that your text index holds.
Rob
On 12/09/2018, 09:55, "Vincent Ventresque"
<vincent.ventres...@ens-lyon.fr> wrote:
Hi Lorenz,
Thanks for your reply.
> for me it sounds more like you've found a bug
I'm not able to tell, just beginning to use Fuseki + Lucene.
> I'm just referring to "Order of triple patterns in a BGP" here
Could you please give a raw text URL for "Order of triple
patterns in a
BGP" (seems that the 'here' in your mail had a formatted link but I
didn't receive the url in my mailbox).
> The order of triple patterns in a BGP shouldn't matter
I thought that it was better (for performance/speed) to begin
with 1)
constants and 2) variables having few solutions in the dataset. I've
read something about Sparql optimization and algebra, but can't
remember
where. But maybe you're talking about the logics itself (A+B = B+A)?
N.B. I find these questions very interesting, but I'm no Sparql
specialist (neither a logician).
Cheers,
Vincent
Le 12/09/2018 à 10:32, Lorenz B. a écrit :
> Hi "VV",
>
> well, for me it sounds more like you've found a bug and are now
doing a
> workaround. Or at least something is strange and I'm just
referring to
> "Order of triple patterns in a BGP" here.
>
> The order of triple patterns in a BGP shouldn't matter - as far
as I
> know it's always a good old join on the intermediate result of the
> evaluation of the triple patterns.
>
> Indeed, the limit of the text index lookup matters as the internal
> ordering by Lucene is based on some Information Retrieval
measure (close
> to TF-IDF probably with default settings).
>
> But I guess, Osma and Andy will give you a better and more
correct answer.
>
>
> Cheers,
> Lorenz
>
>> Hello Osma,
>>
>>
>> Thank you very much for your reply, you solved the problem!
I've made
>> a few tests, both the order and the limit are important (see
below).
>>
>> Just one more question : I thought that the "Roussea*" being less
>> numerous than the "*J*", it would be more efficient to begin
with the
>> "Roussea*". Can you explain why it's the contrary?
>>
>> Best,
>>
>> VV.
>>
>>
>> 1) --------- changing only the order --------------------------
>>
>> ?uriBnF text:query ( foaf:givenName "*J*" ) .
>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>> ?uriBnF foaf:familyName ?nom . ?uriBnF foaf:givenName ?prenom
>>
>> => 3 "Jean-Marie Rousseau" ... (even if I add a limit = 100
000 or 2
>> 000 000)
>>
>> 2) --------- changing order + limit = 100 000
--------------------------
>>
>> ?uriBnF text:query ( foaf:givenName "*J*" 100000 ) .
>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>> ?uriBnF foaf:familyName ?nom . ?uriBnF foaf:givenName ?prenom
>>
>> => 54 entries but not "Jean-Jacques" !
>>
>> 3) --------- changing order + limit = 1 000 000
>> --------------------------
>>
>> ?uriBnF text:query ( foaf:givenName "*J*" 1000000 ) .
>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>> ?uriBnF foaf:familyName ?nom . ?uriBnF foaf:givenName ?prenom
>>
>> => 135 entries, including the 4 "Jean-Jacques", in 1.7 second
>>
>> 4) --------- test using filters (strstarts + contains)
>> --------------------------
>>
>> ?uriBnF foaf:familyName ?nom
>> filter(strstarts(?nom, "Roussea"))
>> ?uriBnF foaf:givenName ?prenom
>> filter(contains(?prenom, "J"))
>>
>> => 129 entries, 27 seconds [less results than
>> "text:query ( foaf:givenName "*J*" 1000000)" because contains
= case
>> sensible ?]
>>
>> -----------------------------------------------------
>>
>> More infos about the dataset :
>>
>> # 3 fields are indexed ( foaf:name + foaf:givenName are in the
same
>> named graph )
>>
>> -- dcterms:title = +/- 9.45 M.
>>
>> -- foaf:givenName = +/- 1.71 M.
>>
>> -- foaf:familyName = +/- 1.78 M.
>>
>> # config file :
>>
>> ----------------
>>
>> text:storeValues true ;
>> text:queryParser text:AnalyzingQueryParser ;
>> text:map (
>> [ text:field "title" ; text:predicate dcterms:title ;
>> text:analyzer [ a text:ConfigurableAnalyzer ;
>> text:tokenizer text:KeywordTokenizer ;
>> text:filters (text:ASCIIFoldingFilter
text:LowerCaseFilter)
>> ] ]
>> [ text:field "familyName" ; text:predicate
foaf:familyName ;
>> text:analyzer [ a text:ConfigurableAnalyzer ;
>> text:tokenizer text:KeywordTokenizer ;
>> text:filters (text:ASCIIFoldingFilter
text:LowerCaseFilter)
>> ] ]
>> [ text:field "givenName" ; text:predicate
foaf:givenName ;
>> text:analyzer [ a text:ConfigurableAnalyzer ;
>> text:tokenizer text:KeywordTokenizer ;
>> text:filters (text:ASCIIFoldingFilter
text:LowerCaseFilter)
>> ] ]
>>
>> ) .
>>
>>
>>
>>
>>
>>
>> Le 10/09/2018 à 18:58, Osma Suominen a écrit :
>>> Hello Vincent,
>>>
>>> The results you get don't seem quite right. As you say, with a
>>> shorter query one would expect more results.
>>>
>>> One thing to do would be to check what results you get if you
run the
>>> queries individually. I think combining the two separate
jena-text
>>> queries (for foaf:familyName and foaf:givenName) may be part
of the
>>> problem here... So if you execute only the "roussea*" part of
the
>>> query, do you get the expected number of results? What about
if you
>>> only execute one of the givenName queries with no restriction on
>>> familyName?
>>>
>>> Does it make a difference if you change the order of the
firstName
>>> and givenName clauses?
>>>
>>> One thing to consider is that Lucene queries always have a
limit on
>>> the number of results. With jena-text you can specify it as an
>>> additional parameter, but if you leave it out, it will
default to
>>> 10000. My guess is that the givenName queries may generate more
>>> results than 10000, and the results will then be cut off.
This may
>>> mean that you get many Jeans and Jacques's and Johns etc. but
many
>>> the J. Rousseaus get cut off from the list. Try adding a
large limit
>>> parameter (say 100000 or more) to the text:query functions to
see if
>>> it helps. Like this:
>>>
>>> ?uriBnF text:query ( foaf:givenName "*J*" 100000 )
>>>
>>> jena-text is not very good at combining multiple criteria.
You can do
>>> it with separate queries as you've done, but internally the
queries
>>> will run separately and the results will only be combined in
Jena,
>>> outside Lucene.
>>>
>>> -Osma
>>>
>>>
>>>
>>> Vincent Ventresque kirjoitti 10.09.2018 klo 13:03:
>>>> Hello,
>>>>
>>>>
>>>> I've made new tests with a slightly different dataset and
>>>> configuration, the problem is the same.
>>>>
>>>> --- Could you please tell me if these results are normal (I
expected
>>>> a bigger list with fewer letters)?
>>>>
>>>> ?uriBnF text:query ( foaf:givenName "*J*" ) => 3 entries
>>>>
>>>> ?uriBnF text:query ( foaf:givenName "*Ja*" ) => 1 entries
>>>>
>>>> ?uriBnF text:query ( foaf:givenName "*Je*" ) => 11 entries
>>>>
>>>> ?uriBnF text:query ( foaf:givenName "*-J*" ) => 11 entries
>>>>
>>>> ?uriBnF text:query ( foaf:givenName "*Jea*" ) => 12 entries
>>>>
>>>> ?uriBnF text:query ( foaf:givenName "*Jac*" ) => 13 entries
>>>>
>>>> Here is the complete query :
>>>>
>>>> SELECT * WHERE { ?uriBnF text:query ( foaf:familyName
"roussea*" ) .
>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>>
>>>> ?uriBnF foaf:familyName ?nom . ?uriBnF foaf:givenName
?prenom }
>>>>
>>>> N.B. : the dataset is quite large : 1,78 M family names
indexed, and
>>>> 1,71 M given names. I have 4 distinct "Jean-Jacques
Rousseau" in the
>>>> data, 713 family names containing "roussea", including 224
compound
>>>> given names.
>>>>
>>>> --- Do you know where to find more documentation about Lucene
>>>> configuration (I read jena.apache.org page + , and also
found useful
>>>> explanations on Skosmos wiki
https://github.com/NatLibFi/Skosmos ),
>>>> especially about tokenizers ?
>>>>
>>>>
>>>> Thanks in advance,
>>>>
>>>> VV
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Le 19/07/2018 à 14:07, Vincent Ventresque a écrit :
>>>>> Hello,
>>>>>
>>>>> I've just subscribed to the users@jena.apache.org list, and I
>>>>> apologize if this mail is not sent properly.
>>>>>
>>>>> I'm trying to use Fuseki text:query, and have encountered
several
>>>>> issues. Here are my questions
>>>>>
>>>>> 1) Does text:query require a minimum number of characters
to be
>>>>> efficient?
>>>>>
>>>>> 2) Is performance linked to the number of fields indexed?
>>>>>
>>>>> 3) In order to retrieve strings containing hyphens, should
I use
>>>>> KeywordTokenizer in config file?
>>>>>
>>>>> ~~~ 1) Does text:query require a minimum number of
characters to be
>>>>> efficient? ~~~~~~~~~~~~~
>>>>>
>>>>> I've noticed that a query on indexed predicates
(foaf:familyName
>>>>> and foaf:givenName) returns more results when there are more
>>>>> characters in the string :
>>>>>
>>>>> SELECT * WHERE {
>>>>>
>>>>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>>>>
>>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>>>
>>>>> ?uriBnF foaf:familyName ?nom . ?uriBnF foaf:givenName
?prenom .
>>>>>
>>>>> optional {?uriBnF bio:birth ?dateNaissance }
>>>>>
>>>>> }
>>>>>
>>>>> I was expecting that "Rousseau" + "Jean-Jacques" would be
in the
>>>>> results.
>>>>>
>>>>> => if $MY_STRING = "j*", I get 0 result
>>>>>
>>>>> => if $MY_STRING = "je*", I get 17 results, including
>>>>> "Jean-Claude" & "Jean-Baptiste" BUT not "Jean-Jacques"
>>>>>
>>>>> => if $MY_STRING = "jea*", I get 27 results, including
"Jean-Jacques"
>>>>>
>>>>> I don't know anything about Lucene, but it looks very
strange to me
>>>>> : I expected the contrary (fewer letters = bigger results
list).
>>>>>
>>>>>
>>>>> ~~~ 2) Is performance linked to the number of fields indexed?
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~
>>>>>
>>>>> If I change the configuration and index only
foaf:givenName, and
>>>>> provide a constant for foaf:familyName, the query returns more
>>>>> results :
>>>>>
>>>>> SELECT * WHERE {
>>>>>
>>>>> ?uriBnF foaf:familyName "Rousseau" .
>>>>>
>>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>>>
>>>>> ?uriBnF foaf:familyName ?nom . ?uriBnF foaf:givenName
?prenom .
>>>>>
>>>>> optional {?uriBnF bio:birth ?dateNaissance }
>>>>>
>>>>> }
>>>>>
>>>>> => if $MY_STRING = "j*", I get 7 results, whereas the
first query
>>>>> returned 0 result.
>>>>>
>>>>>
>>>>> ~~~ 3) In order to retrieve containing hyphens, should I use
>>>>> KeywordTokenizer in config file? ~~~~~~~~~~~~~
>>>>>
>>>>> With the same query, if $MY_STRING = "jean-ja*" :
>>>>>
>>>>> a) with simple configuration (cf. below), I get 0 result
>>>>>
>>>>> b) with KeywordTokenizer config (cf. below), I get
"Jean-Jacques"
>>>>>
>>>>> Is it the right way to get "Jean-Jacques"?
>>>>>
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> VV
>>>>>
>>>>>
>>>>>
>>>>> =============== SIMPLE CONFIGURATION ===================
>>>>>
>>>>> @prefix : <#> .
>>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>>>> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
>>>>> @prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> .
>>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>>>> @prefix text: <http://jena.apache.org/text#> .
>>>>> @prefix fuseki: <http://jena.apache.org/fuseki#> .
>>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>>>>
>>>>>
>>>>>
>>>>> [] rdf:type fuseki:Server ;
>>>>> .
>>>>>
>>>>>
>>>>> ## Initialize TDB --------------------------------
>>>>>
>>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>>>>> tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset .
>>>>> tdb:GraphTDB rdfs:subClassOf ja:Model .
>>>>>
>>>>> ## Initialize text query -------------------------------------
>>>>> [] ja:loadClass "org.apache.jena.query.text.TextQuery" .
>>>>> # A TextDataset is a regular dataset with a text index.
>>>>> text:TextDataset rdfs:subClassOf ja:RDFDataset .
>>>>> # Lucene index
>>>>> text:TextIndexLucene rdfs:subClassOf text:TextIndex .
>>>>>
>>>>> ##
---------------------------------------------------------------
>>>>> ## This URI must be fixed - it's used to assemble the text
dataset.
>>>>>
>>>>> :text_dataset rdf:type text:TextDataset ;
>>>>> # text:dataset <#dataset> ;
>>>>> text:dataset :tdb_dataset_readwrite ;
>>>>> # text:index <#indexLucene> ;
>>>>> text:index :My_Lucene_index ;
>>>>> .
>>>>>
>>>>> # A TDB datset used for RDF storage
------------------------------
>>>>> :tdb_dataset_readwrite
>>>>> a tdb:DatasetTDB ;
>>>>> tdb:location "$_BnF_text" ;
>>>>> .
>>>>>
>>>>> # Text index description
------------------------------------------
>>>>> #<#indexLucene> a text:TextIndexLucene ;
>>>>> :My_Lucene_index a text:TextIndexLucene ;
>>>>> text:directory <file:$_Lucene> ;
>>>>> text:entityMap <#entMap> ;
>>>>> .
>>>>>
>>>>> # Mapping in the index
---------------------------------------------
>>>>> # URI stored in field "uri"
>>>>> <#entMap> a text:EntityMap ;
>>>>> text:entityField "uri" ;
>>>>> text:defaultField "familyName" ;
>>>>> text:map (
>>>>> [ text:field "familyName" ; text:predicate
foaf:familyName ]
>>>>> [ text:field "givenName" ; text:predicate
foaf:givenName ]
>>>>> ) .
>>>>>
>>>>> :service_tdb_all a fuseki:Service ;
>>>>> rdfs:label "TDB BnF_text" ;
>>>>> fuseki:dataset :text_dataset ;
>>>>> fuseki:name "BnF_text" ;
>>>>> fuseki:serviceQuery "query" , "sparql" ;
>>>>> fuseki:serviceReadGraphStore "get" ;
>>>>> fuseki:serviceReadWriteGraphStore "data" ;
>>>>> fuseki:serviceUpdate "update" ;
>>>>> fuseki:serviceUpload "upload" .
>>>>>
>>>>>
>>>>> =========== KEYWORD TOKENIZER CONFIGURATION ================
>>>>>
>>>>> @prefix : <#> .
>>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>>>> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
>>>>> @prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> .
>>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>>>> @prefix text: <http://jena.apache.org/text#> .
>>>>> @prefix fuseki: <http://jena.apache.org/fuseki#> .
>>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>>>>
>>>>>
>>>>>
>>>>> [] rdf:type fuseki:Server ;
>>>>>
>>>>> .
>>>>>
>>>>>
>>>>> ## Initialize TDB --------------------------------
>>>>>
>>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>>>>> tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset .
>>>>> tdb:GraphTDB rdfs:subClassOf ja:Model .
>>>>>
>>>>> ## Initialize text query -------------------------------------
>>>>> [] ja:loadClass "org.apache.jena.query.text.TextQuery" .
>>>>> # A TextDataset is a regular dataset with a text index.
>>>>> text:TextDataset rdfs:subClassOf ja:RDFDataset .
>>>>> # Lucene index
>>>>> text:TextIndexLucene rdfs:subClassOf text:TextIndex .
>>>>>
>>>>> ##
---------------------------------------------------------------
>>>>>
>>>>>
>>>>> :text_dataset rdf:type text:TextDataset ;
>>>>> # text:dataset <#dataset> ;
>>>>> text:dataset :tdb_dataset_readwrite ;
>>>>> # text:index <#indexLucene> ;
>>>>> text:index :My_Lucene_index ;
>>>>> .
>>>>>
>>>>> # A TDB datset used for RDF storage
------------------------------
>>>>> :tdb_dataset_readwrite
>>>>> a tdb:DatasetTDB ;
>>>>> tdb:location "$_BnF_text" ;
>>>>> .
>>>>>
>>>>> # Text index description
------------------------------------------
>>>>> #<#indexLucene> a text:TextIndexLucene ;
>>>>> :My_Lucene_index a text:TextIndexLucene ;
>>>>> text:directory <file:$_Lucene> ;
>>>>> text:entityMap <#entMap> ;
>>>>> .
>>>>>
>>>>> # Mapping in the index
---------------------------------------------
>>>>> # URI stored in field "uri"
>>>>> <#entMap> a text:EntityMap ;
>>>>> text:entityField "uri" ;
>>>>> text:defaultField "givenName" ;
>>>>> text:map (
>>>>>
>>>>> [ text:field "familyName" ; text:predicate
foaf:familyName ;
>>>>> text:analyzer [ a text:ConfigurableAnalyzer ;
>>>>> text:tokenizer text:KeywordTokenizer ;
>>>>> text:filters (text:ASCIIFoldingFilter
>>>>> text:LowerCaseFilter)
>>>>> ] ]
>>>>> [ text:field "givenName" ; text:predicate
foaf:givenName ;
>>>>> text:analyzer [ a text:ConfigurableAnalyzer ;
>>>>> text:tokenizer text:KeywordTokenizer ;
>>>>> text:filters (text:ASCIIFoldingFilter
text:LowerCaseFilter)
>>>>> ] ]
>>>>> ) .
>>>>>
>>>>> :service_tdb_all a fuseki:Service ;
>>>>> rdfs:label "TDB BnF_text" ;
>>>>> fuseki:dataset :text_dataset ; ###
marche pr
>>>>> index texte
>>>>> fuseki:name "BnF_text" ;
>>>>> fuseki:serviceQuery "query" , "sparql" ;
>>>>> fuseki:serviceReadGraphStore "get" ;
>>>>> fuseki:serviceReadWriteGraphStore "data" ;
>>>>> fuseki:serviceUpdate "update" ;
>>>>> fuseki:serviceUpload "upload" .
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>