Re: fuseki text:query : strange results + Lucene configuration

2018-09-12 Thread Vincent Ventresque
> Just to be sure, you can try to execute some very generic queries 
(e.g. "*a*") and count the results.


Thanks, I'll do that when I have a moment

> The downside of using a high limit (and the reason the default is 
"only" 1) is that jena-text/Lucene allocates an array of that size 
to hold the results before actually executing the query against the 
index. [...] All these resources (CPU time and memory) are wasted if the 
index then returns only a small number of results.


That's very important, thanks a lot. I think we're going to give 2 
options, the default being a combination like


?auth text:query ( foaf:familyName "roussea*") ;

?edition dcterms:contributor ?auth ;

  text:query ( dcterms:title '*rêverie*' ) .

and a more "fuzzy" one to broaden the search, with text:query on 
givenName. Using named graphs, the default options gives good results 
(between 0.2 and 1.5s depending on the retrieved editions).


Vincent

 


Le 12/09/2018 à 15:23, Osma Suominen a écrit :

Hi Vincent!

Vincent Ventresque kirjoitti 12.09.2018 klo 15:53:

What do you think about this solution :

?uriBnF text:query ( foaf:givenName "*J*" 200 ) . ?uriBnF 
text:query ( foaf:familyName "roussea*" ) . ?uriBnF foaf:familyName 
?nom .  ?uriBnF foaf:givenName ?prenom


It returns all the expected results and takes only 1.7 second (with 
default configuration, RAM 2Gb).


Sounds good to me!

Knowing I have 1.71 M givenName, it's reasonable to expect all the 
results with a limit = 2 000 000 , ins't it? It is, I think, the most 
important question : am I sure to get all the results if I use a 
limit > total properties indexed?


Yes, I think this is the case. If you have 1.71M triples with the 
givenName property, the text index should never return more than 1.71M 
results on a givenName property. So a limit of 2M should be enough in 
your case.


Just to be sure, you can try to execute some very generic queries 
(e.g. "*a*") and count the results.


The downside of using a high limit (and the reason the default is 
"only" 1) is that jena-text/Lucene allocates an array of that size 
to hold the results before actually executing the query against the 
index. With a large limit value such as 2M, that takes some time - 
probably most of the 1.7 seconds. You can experiment with how the 
query execution time changes if you change only the limit value. Also 
the array will need some memory, maybe in the range of tens or perhaps 
even hundreds of MB for a limit of 2M. All these resources (CPU time 
and memory) are wasted if the index then returns only a small number 
of results. The memory will of course be freed soon after the query by 
the garbage collector.


N.B. : I like the idea of using only text:query because it's case 
insensitive AND allows fuzzy queries. It's particularly important for 
our use case (we want to find author + edition with incomplete 
information, such as "1 word in title + 1 word in familyName + 
givenName initial + one of these words is not fully legible"). But 
you're right, a combination of text:query + regex or contains is very 
fast (see example below).

Great that you tried this approach as well and it is fast.

-Osma



---

Hi Osma,


Thanks again, it's very helpful.

> Either you get less results than expected or the query will take a 
long time, or both


What do you think about this solution :

?uriBnF text:query ( foaf:givenName "*J*" 200 ) . ?uriBnF text:query 
( foaf:familyName "roussea*" ) . ?uriBnF foaf:familyName ?nom .  ?uriBnF 
foaf:givenName ?prenom


It returns all the expected results and takes only 1.7 second (with 
default configuration, RAM 2Gb).


Knowing I have 1.71 M givenName, it's reasonable to expect all the 
results with a limit = 2 000 000 , ins't it? It is, I think, the most 
important question : am I sure to get all the results if I use a limit > 
total properties indexed?


N.B. : I like the idea of using only text:query because it's case 
insensitive AND allows fuzzy queries. It's particularly important for 
our use case (we want to find author + edition with incomplete 
information, such as "1 word in title + 1 word in familyName + givenName 
initial + one of these words is not fully legible"). But you're right, a 
combination of text:query + regex or contains is very fast (see example 
below).


Vincent


-

?uriBnF text:query ( foaf:familyName "roussea*" ) ;
  foaf:givenName ?prenom
  filter(contains(?prenom, "J"))   # case sensitive
  ?uriBnF foaf:familyName ?nom  .

=> 37ms for 130 entries

-

?uriBnF text:query ( foaf:familyName "roussea*" ) ;
  foaf:givenName ?prenom
  filter(regex(?prenom, "j", "i"))  # case insensitive
  ?uriBnF foaf:familyName ?nom  .

=> 55ms for 133 entries






Le 12/09/2018 à 14:12, Osma Suominen a écrit :

Hi Vincent!

Jena-text with the Lucene backend indexes each triple as a separate 
Lucene document. This means that you cannot combine 

Re: fuseki text:query : strange results + Lucene configuration

2018-09-12 Thread Osma Suominen

Hi Vincent!

Vincent Ventresque kirjoitti 12.09.2018 klo 15:53:

What do you think about this solution :

?uriBnF text:query ( foaf:givenName "*J*" 200 ) . ?uriBnF text:query 
( foaf:familyName "roussea*" ) . ?uriBnF foaf:familyName ?nom .  ?uriBnF 
foaf:givenName ?prenom


It returns all the expected results and takes only 1.7 second (with 
default configuration, RAM 2Gb).


Sounds good to me!

Knowing I have 1.71 M givenName, it's reasonable to expect all the 
results with a limit = 2 000 000 , ins't it? It is, I think, the most 
important question : am I sure to get all the results if I use a limit > 
total properties indexed?


Yes, I think this is the case. If you have 1.71M triples with the 
givenName property, the text index should never return more than 1.71M 
results on a givenName property. So a limit of 2M should be enough in 
your case.


Just to be sure, you can try to execute some very generic queries (e.g. 
"*a*") and count the results.


The downside of using a high limit (and the reason the default is "only" 
1) is that jena-text/Lucene allocates an array of that size to hold 
the results before actually executing the query against the index. With 
a large limit value such as 2M, that takes some time - probably most of 
the 1.7 seconds. You can experiment with how the query execution time 
changes if you change only the limit value. Also the array will need 
some memory, maybe in the range of tens or perhaps even hundreds of MB 
for a limit of 2M. All these resources (CPU time and memory) are wasted 
if the index then returns only a small number of results. The memory 
will of course be freed soon after the query by the garbage collector.


N.B. : I like the idea of using only text:query because it's case 
insensitive AND allows fuzzy queries. It's particularly important for 
our use case (we want to find author + edition with incomplete 
information, such as "1 word in title + 1 word in familyName + givenName 
initial + one of these words is not fully legible"). But you're right, a 
combination of text:query + regex or contains is very fast (see example 
below).

Great that you tried this approach as well and it is fast.

-Osma

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suomi...@helsinki.fi
http://www.nationallibrary.fi


Re: fuseki text:query : strange results + Lucene configuration

2018-09-12 Thread Vincent Ventresque

Hi Osma,


Thanks again, it's very helpful.

> Either you get less results than expected or the query will take a 
long time, or both


What do you think about this solution :

?uriBnF text:query ( foaf:givenName "*J*" 200 ) . ?uriBnF text:query 
( foaf:familyName "roussea*" ) . ?uriBnF foaf:familyName ?nom .  ?uriBnF 
foaf:givenName ?prenom


It returns all the expected results and takes only 1.7 second (with 
default configuration, RAM 2Gb).


Knowing I have 1.71 M givenName, it's reasonable to expect all the 
results with a limit = 2 000 000 , ins't it? It is, I think, the most 
important question : am I sure to get all the results if I use a limit > 
total properties indexed?


N.B. : I like the idea of using only text:query because it's case 
insensitive AND allows fuzzy queries. It's particularly important for 
our use case (we want to find author + edition with incomplete 
information, such as "1 word in title + 1 word in familyName + givenName 
initial + one of these words is not fully legible"). But you're right, a 
combination of text:query + regex or contains is very fast (see example 
below).


Vincent


-

?uriBnF text:query ( foaf:familyName "roussea*" ) ;
  foaf:givenName ?prenom
  filter(contains(?prenom, "J"))   # case sensitive
  ?uriBnF foaf:familyName ?nom  .

=> 37ms for 130 entries

-

?uriBnF text:query ( foaf:familyName "roussea*" ) ;
  foaf:givenName ?prenom
  filter(regex(?prenom, "j", "i"))  # case insensitive
  ?uriBnF foaf:familyName ?nom  .

=> 55ms for 133 entries




 


Le 12/09/2018 à 14:12, Osma Suominen a écrit :

Hi Vincent!

Jena-text with the Lucene backend indexes each triple as a separate 
Lucene document. This means that you cannot combine givenName and 
familyName in the same query - from the Lucene perspective, the 
givenName appears in one document where familyName appears in another 
document, and querying for both (using AND) will just give you an 
empty result. So what you are doing is the correct way. The problem is 
just that some of the query patterns, such as "*J*", will return a 
very large number of results. This pushes the limits of jena-text as 
you've discovered. Either you get less results than expected or the 
query will take a long time, or both.


It might make sense to use only one text:query for the more 
restrictive part (familyName "roussea*" in this case) and then use a 
FILTER with some string matching (STRSTARTS or CONTAINS or REGEX) to 
further limit the results.


The Elasticsearch backend of jena-text is different though. It will 
combine different indexed properties of the same subject within the 
same Elasticsearch/Lucene document. So an AND query with both 
givenName and familyName is possible when using that backend.


-Osma

Vincent Ventresque kirjoitti 12.09.2018 klo 15:06:

Hello Rob


Thank you for all these elements.

 > there is a limit on the results returned from each text search so 
when these are *separately executed and joined together* you may only 
get a subset of the full results


Could you please explain what would be a 'non-separate' query? Do you 
mean :


?s text:query ( "givenName:\"*J*\" AND familyName:\"Roussea\"" ) ?

I made 2 separate triples (1st = givenName + 2nd = familyName) 
because I had read that "when a query is to involve two or more 
properties then it expressed at the SPARQL level, as it were, versus 
in Lucene's query language" 
(https://jena.apache.org/documentation/query/text-query.html#queries-across-multiple-fields). 



Vincent




Le 12/09/2018 à 11:52, Rob Vesse a écrit :
Well the order of triple patterns shouldn't matter too much when you 
have a pure BGP (albeit the optimiser might pick a bad order in some 
cases)


But we aren't talking about pure BGPs here, having the text:query 
triples results in the BGP being broken up into joins of several 
property functions with the regular triple patterns interspersed 
through those.  So if we take your query and run it through Jena's 
algebra compiler (you can do this online at 
http://sparql.org/validate/query) we get the following:


   1 (base 
   2   (prefix ((rdf: )
   3    (owl: )
   4    (apf: )
   5    (xsd: )
   6    (fn: )
   7    (rdfs: )
   8    (text: )
   9    (foaf: )
  10    (dc: ))
  11 (sequence
  12   (propfunc text:query
  13 ?uriBnF (foaf:givenName "$MY_STRING")
  14 (propfunc text:query
  15   ?uriBnF (foaf:familyName "roussea*")
  16   (table unit)))
  17   (bgp
  18 (triple ?uriBnF foaf:familyName ?nom)
  19   

Re: fuseki text:query : strange results + Lucene configuration

2018-09-12 Thread Osma Suominen

Hi Vincent!

Jena-text with the Lucene backend indexes each triple as a separate 
Lucene document. This means that you cannot combine givenName and 
familyName in the same query - from the Lucene perspective, the 
givenName appears in one document where familyName appears in another 
document, and querying for both (using AND) will just give you an empty 
result. So what you are doing is the correct way. The problem is just 
that some of the query patterns, such as "*J*", will return a very large 
number of results. This pushes the limits of jena-text as you've 
discovered. Either you get less results than expected or the query will 
take a long time, or both.


It might make sense to use only one text:query for the more restrictive 
part (familyName "roussea*" in this case) and then use a FILTER with 
some string matching (STRSTARTS or CONTAINS or REGEX) to further limit 
the results.


The Elasticsearch backend of jena-text is different though. It will 
combine different indexed properties of the same subject within the same 
Elasticsearch/Lucene document. So an AND query with both givenName and 
familyName is possible when using that backend.


-Osma

Vincent Ventresque kirjoitti 12.09.2018 klo 15:06:

Hello Rob


Thank you for all these elements.

 > there is a limit on the results returned from each text search so 
when these are *separately executed and joined together* you may only 
get a subset of the full results


Could you please explain what would be a 'non-separate' query? Do you 
mean :


?s text:query ( "givenName:\"*J*\" AND familyName:\"Roussea\"" ) ?

I made 2 separate triples (1st = givenName + 2nd = familyName) because I 
had read that "when a query is to involve two or more properties then it 
expressed at the SPARQL level, as it were, versus in Lucene's query 
language" 
(https://jena.apache.org/documentation/query/text-query.html#queries-across-multiple-fields). 



Vincent




Le 12/09/2018 à 11:52, Rob Vesse a écrit :
Well the order of triple patterns shouldn't matter too much when you 
have a pure BGP (albeit the optimiser might pick a bad order in some 
cases)


But we aren't talking about pure BGPs here, having the text:query 
triples results in the BGP being broken up into joins of several 
property functions with the regular triple patterns interspersed 
through those.  So if we take your query and run it through Jena's 
algebra compiler (you can do this online at 
http://sparql.org/validate/query) we get the following:


   1 (base 
   2   (prefix ((rdf: )
   3    (owl: )
   4    (apf: )
   5    (xsd: )
   6    (fn: )
   7    (rdfs: )
   8    (text: )
   9    (foaf: )
  10    (dc: ))
  11 (sequence
  12   (propfunc text:query
  13 ?uriBnF (foaf:givenName "$MY_STRING")
  14 (propfunc text:query
  15   ?uriBnF (foaf:familyName "roussea*")
  16   (table unit)))
  17   (bgp
  18 (triple ?uriBnF foaf:familyName ?nom)
  19 (triple ?uriBnF foaf:givenName ?prenom)
  20   

So first its doing the text search on your parameter (lines 12-13), 
then joining that to text search on your surname (lines 14-15) via 
substituting binds from your first text search and then finally 
joining that with the plain BGP (lines 17-19).


So in this case the ordering of your property functions in the query 
is going to make a difference to the evaluation.  As I think Osma 
already pointed out there is a limit on the results returned from each 
text search so when these are separately executed and joined together 
you may only get a subset of the full results that your text index holds.


Rob

On 12/09/2018, 09:55, "Vincent Ventresque" 
 wrote:


 Hi Lorenz,
 Thanks for your reply.
 > for me it sounds more like you've found a bug
 I'm not able to tell, just beginning to use Fuseki + Lucene.
 > I'm just referring to "Order of triple patterns in a BGP" here
 Could you please give a raw text URL for "Order of triple 
patterns in a

 BGP" (seems that the 'here' in your mail had a formatted link but I
 didn't receive the url in my mailbox).
 > The order of triple patterns in a BGP shouldn't matter
 I thought that it was better (for performance/speed) to begin 
with 1)

 constants and 2) variables having few solutions in the dataset. I've
 read something about Sparql optimization and algebra, but can't 
remember

 where. But maybe you're talking about the logics itself (A+B = B+A)?
 N.B. I find these questions very interesting, but I'm no Sparql
 specialist (neither a logician).
   

Re: fuseki text:query : strange results + Lucene configuration

2018-09-12 Thread Vincent Ventresque

Hello Rob


Thank you for all these elements.

> there is a limit on the results returned from each text search so 
when these are *separately executed and joined together* you may only 
get a subset of the full results


Could you please explain what would be a 'non-separate' query? Do you mean :

?s text:query ( "givenName:\"*J*\" AND familyName:\"Roussea\"" ) ?

I made 2 separate triples (1st = givenName + 2nd = familyName) because I 
had read that "when a query is to involve two or more properties then it 
expressed at the SPARQL level, as it were, versus in Lucene's query 
language" 
(https://jena.apache.org/documentation/query/text-query.html#queries-across-multiple-fields).


Vincent


 


Le 12/09/2018 à 11:52, Rob Vesse a écrit :

Well the order of triple patterns shouldn't matter too much when you have a 
pure BGP (albeit the optimiser might pick a bad order in some cases)

But we aren't talking about pure BGPs here, having the text:query triples 
results in the BGP being broken up into joins of several property functions 
with the regular triple patterns interspersed through those.  So if we take 
your query and run it through Jena's algebra compiler (you can do this online 
at http://sparql.org/validate/query) we get the following:

   1 (base 
   2   (prefix ((rdf: )
   3(owl: )
   4(apf: )
   5(xsd: )
   6(fn: )
   7(rdfs: )
   8(text: )
   9(foaf: )
  10(dc: ))
  11 (sequence
  12   (propfunc text:query
  13 ?uriBnF (foaf:givenName "$MY_STRING")
  14 (propfunc text:query
  15   ?uriBnF (foaf:familyName "roussea*")
  16   (table unit)))
  17   (bgp
  18 (triple ?uriBnF foaf:familyName ?nom)
  19 (triple ?uriBnF foaf:givenName ?prenom)
  20   

So first its doing the text search on your parameter (lines 12-13), then 
joining that to text search on your surname (lines 14-15) via substituting 
binds from your first text search and then finally joining that with the plain 
BGP (lines 17-19).

So in this case the ordering of your property functions in the query is going 
to make a difference to the evaluation.  As I think Osma already pointed out 
there is a limit on the results returned from each text search so when these 
are separately executed and joined together you may only get a subset of the 
full results that your text index holds.

Rob

On 12/09/2018, 09:55, "Vincent Ventresque"  
wrote:

 Hi Lorenz,
 
 
 Thanks for your reply.
 
 > for me it sounds more like you've found a bug
 
 I'm not able to tell, just beginning to use Fuseki + Lucene.
 
 > I'm just referring to "Order of triple patterns in a BGP" here
 
 Could you please give a raw text URL for "Order of triple patterns in a

 BGP" (seems that the 'here' in your mail had a formatted link but I
 didn't receive the url in my mailbox).
 
 > The order of triple patterns in a BGP shouldn't matter
 
 I thought that it was better (for performance/speed) to begin with 1)

 constants and 2) variables having few solutions in the dataset. I've
 read something about Sparql optimization and algebra, but can't remember
 where. But maybe you're talking about the logics itself (A+B = B+A)?
 N.B. I find these questions very interesting, but I'm no Sparql
 specialist (neither a logician).
 
 Cheers,
 
 Vincent
 
 
 
 
 Le 12/09/2018 à 10:32, Lorenz B. a écrit :

 > Hi "VV",
 >
 > well, for me it sounds more like you've found a bug and are now doing a
 > workaround. Or at least something is strange and I'm just referring to
 > "Order of triple patterns in a BGP" here.
 >
 > The order of triple patterns in a BGP shouldn't matter - as far as I
 > know it's always a good old join on the intermediate result of the
 > evaluation of the triple patterns.
 >
 > Indeed, the limit of the text index lookup matters as the internal
 > ordering by Lucene is based on some Information Retrieval measure (close
 > to TF-IDF probably with default settings).
 >
 > But I guess, Osma and Andy will give you a better and more correct 
answer.
 >
 >
 > Cheers,
 > Lorenz
 >
 >> Hello Osma,
 >>
 >>
 >> Thank you very much for your reply, you solved the problem! I've made
 >> a few tests, both the order and the limit are important (see below).
 >>
 >> Just one more question : I thought that the "Roussea*" being less
 >> numerous than the 

Re: fuseki text:query : strange results + Lucene configuration

2018-09-12 Thread Rob Vesse
Well the order of triple patterns shouldn't matter too much when you have a 
pure BGP (albeit the optimiser might pick a bad order in some cases)

But we aren't talking about pure BGPs here, having the text:query triples 
results in the BGP being broken up into joins of several property functions 
with the regular triple patterns interspersed through those.  So if we take 
your query and run it through Jena's algebra compiler (you can do this online 
at http://sparql.org/validate/query) we get the following:

  1 (base 
  2   (prefix ((rdf: )
  3(owl: )
  4(apf: )
  5(xsd: )
  6(fn: )
  7(rdfs: )
  8(text: )
  9(foaf: )
 10(dc: ))
 11 (sequence
 12   (propfunc text:query
 13 ?uriBnF (foaf:givenName "$MY_STRING")
 14 (propfunc text:query
 15   ?uriBnF (foaf:familyName "roussea*")
 16   (table unit)))
 17   (bgp
 18 (triple ?uriBnF foaf:familyName ?nom)
 19 (triple ?uriBnF foaf:givenName ?prenom)
 20   

So first its doing the text search on your parameter (lines 12-13), then 
joining that to text search on your surname (lines 14-15) via substituting 
binds from your first text search and then finally joining that with the plain 
BGP (lines 17-19).

So in this case the ordering of your property functions in the query is going 
to make a difference to the evaluation.  As I think Osma already pointed out 
there is a limit on the results returned from each text search so when these 
are separately executed and joined together you may only get a subset of the 
full results that your text index holds.  

Rob

On 12/09/2018, 09:55, "Vincent Ventresque"  
wrote:

Hi Lorenz,


Thanks for your reply.

> for me it sounds more like you've found a bug

I'm not able to tell, just beginning to use Fuseki + Lucene.

> I'm just referring to "Order of triple patterns in a BGP" here

Could you please give a raw text URL for "Order of triple patterns in a
BGP" (seems that the 'here' in your mail had a formatted link but I
didn't receive the url in my mailbox).

> The order of triple patterns in a BGP shouldn't matter

I thought that it was better (for performance/speed) to begin with 1)
constants and 2) variables having few solutions in the dataset. I've
read something about Sparql optimization and algebra, but can't remember
where. But maybe you're talking about the logics itself (A+B = B+A)?
N.B. I find these questions very interesting, but I'm no Sparql
specialist (neither a logician).

Cheers,

Vincent




Le 12/09/2018 à 10:32, Lorenz B. a écrit :
> Hi "VV",
>
> well, for me it sounds more like you've found a bug and are now doing a
> workaround. Or at least something is strange and I'm just referring to
> "Order of triple patterns in a BGP" here.
>
> The order of triple patterns in a BGP shouldn't matter - as far as I
> know it's always a good old join on the intermediate result of the
> evaluation of the triple patterns.
>
> Indeed, the limit of the text index lookup matters as the internal
> ordering by Lucene is based on some Information Retrieval measure (close
> to TF-IDF probably with default settings).
>
> But I guess, Osma and Andy will give you a better and more correct answer.
>
>
> Cheers,
> Lorenz
>
>> Hello Osma,
>>
>>
>> Thank you very much for your reply, you solved the problem! I've made
>> a few tests, both the order and the limit are important (see below).
>>
>> Just one more question : I thought that the "Roussea*" being less
>> numerous than the "*J*", it would be more efficient to begin with the
>> "Roussea*". Can you explain why it's the contrary?
>>
>> Best,
>>
>> VV.
>>
>>
>> 1) - changing only the order --
>>
>> ?uriBnF text:query ( foaf:givenName "*J*" ) .
>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>
>>  => 3 "Jean-Marie Rousseau" ... (even if I add a limit = 100 000 or 2
>> 000 000)
>>
>> 2) - changing order + limit = 100 000 --
>>
>> ?uriBnF text:query ( foaf:givenName "*J*" 10 ) .
>>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>
>>  => 54 entries 

Re: fuseki text:query : strange results + Lucene configuration

2018-09-12 Thread Vincent Ventresque
Hi Lorenz,


Thanks for your reply.

> for me it sounds more like you've found a bug

I'm not able to tell, just beginning to use Fuseki + Lucene.

> I'm just referring to "Order of triple patterns in a BGP" here

Could you please give a raw text URL for "Order of triple patterns in a
BGP" (seems that the 'here' in your mail had a formatted link but I
didn't receive the url in my mailbox).

> The order of triple patterns in a BGP shouldn't matter

I thought that it was better (for performance/speed) to begin with 1)
constants and 2) variables having few solutions in the dataset. I've
read something about Sparql optimization and algebra, but can't remember
where. But maybe you're talking about the logics itself (A+B = B+A)?
N.B. I find these questions very interesting, but I'm no Sparql
specialist (neither a logician).

Cheers,

Vincent




Le 12/09/2018 à 10:32, Lorenz B. a écrit :
> Hi "VV",
>
> well, for me it sounds more like you've found a bug and are now doing a
> workaround. Or at least something is strange and I'm just referring to
> "Order of triple patterns in a BGP" here.
>
> The order of triple patterns in a BGP shouldn't matter - as far as I
> know it's always a good old join on the intermediate result of the
> evaluation of the triple patterns.
>
> Indeed, the limit of the text index lookup matters as the internal
> ordering by Lucene is based on some Information Retrieval measure (close
> to TF-IDF probably with default settings).
>
> But I guess, Osma and Andy will give you a better and more correct answer.
>
>
> Cheers,
> Lorenz
>
>> Hello Osma,
>>
>>
>> Thank you very much for your reply, you solved the problem! I've made
>> a few tests, both the order and the limit are important (see below).
>>
>> Just one more question : I thought that the "Roussea*" being less
>> numerous than the "*J*", it would be more efficient to begin with the
>> "Roussea*". Can you explain why it's the contrary?
>>
>> Best,
>>
>> VV.
>>
>>
>> 1) - changing only the order --
>>
>> ?uriBnF text:query ( foaf:givenName "*J*" ) .
>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>
>>  => 3 "Jean-Marie Rousseau" ... (even if I add a limit = 100 000 or 2
>> 000 000)
>>
>> 2) - changing order + limit = 100 000 --
>>
>> ?uriBnF text:query ( foaf:givenName "*J*" 10 ) .
>>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>
>>  => 54 entries but not "Jean-Jacques" !
>>
>> 3) - changing order + limit = 1 000 000
>> --
>>
>>  ?uriBnF text:query ( foaf:givenName "*J*" 100 ) .
>>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>
>> => 135 entries, including the 4 "Jean-Jacques", in  1.7 second
>>
>> 4) - test using filters (strstarts + contains)
>> --
>>
>> ?uriBnF foaf:familyName ?nom
>> filter(strstarts(?nom, "Roussea"))
>> ?uriBnF foaf:givenName ?prenom
>> filter(contains(?prenom, "J"))
>>
>> => 129 entries, 27 seconds [less results than
>> "text:query ( foaf:givenName "*J*" 100)" because contains = case
>> sensible ?]
>>
>> -
>>
>> More infos about the dataset :
>>
>> # 3 fields are indexed ( foaf:name + foaf:givenName are in the same
>> named graph )
>>
>> -- dcterms:title = +/- 9.45 M.
>>
>> -- foaf:givenName = +/- 1.71 M.
>>
>> -- foaf:familyName = +/- 1.78 M.
>>
>> # config file :
>>
>> 
>>
>> text:storeValues true ;
>>     text:queryParser text:AnalyzingQueryParser ;
>>     text:map (
>>     [ text:field "title" ; text:predicate dcterms:title ;
>>     text:analyzer [ a text:ConfigurableAnalyzer ;
>>  text:tokenizer text:KeywordTokenizer ;
>>  text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>>  ] ]
>>  [ text:field "familyName" ; text:predicate foaf:familyName ;
>>     text:analyzer [ a text:ConfigurableAnalyzer ;
>>  text:tokenizer text:KeywordTokenizer ;
>>  text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>>  ] ]
>>  [ text:field "givenName" ; text:predicate foaf:givenName ;
>>     text:analyzer [ a text:ConfigurableAnalyzer ;
>>  text:tokenizer text:KeywordTokenizer ;
>>  text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>>  ] ]
>>
>>  ) .
>>
>>
>>
>>
>>  
>>
>> Le 10/09/2018 à 18:58, Osma Suominen a écrit :
>>> Hello Vincent,
>>>
>>> The results you get don't seem quite right. As you say, with a
>>> shorter query one would expect more results.
>>>
>>> One thing to do would be to check what results you get if you run the
>>> queries individually. I think combining the two separate jena-text
>>> queries (for foaf:familyName and foaf:givenName) may be part of 

Re: fuseki text:query : strange results + Lucene configuration

2018-09-12 Thread Lorenz B.
Hi "VV",

well, for me it sounds more like you've found a bug and are now doing a
workaround. Or at least something is strange and I'm just referring to
"Order of triple patterns in a BGP" here.

The order of triple patterns in a BGP shouldn't matter - as far as I
know it's always a good old join on the intermediate result of the
evaluation of the triple patterns.

Indeed, the limit of the text index lookup matters as the internal
ordering by Lucene is based on some Information Retrieval measure (close
to TF-IDF probably with default settings).

But I guess, Osma and Andy will give you a better and more correct answer.


Cheers,
Lorenz

> Hello Osma,
>
>
> Thank you very much for your reply, you solved the problem! I've made
> a few tests, both the order and the limit are important (see below).
>
> Just one more question : I thought that the "Roussea*" being less
> numerous than the "*J*", it would be more efficient to begin with the
> "Roussea*". Can you explain why it's the contrary?
>
> Best,
>
> VV.
>
>
> 1) - changing only the order --
>
> ?uriBnF text:query ( foaf:givenName "*J*" ) .
> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>
>  => 3 "Jean-Marie Rousseau" ... (even if I add a limit = 100 000 or 2
> 000 000)
>
> 2) - changing order + limit = 100 000 --
>
> ?uriBnF text:query ( foaf:givenName "*J*" 10 ) .
>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>
>  => 54 entries but not "Jean-Jacques" !
>
> 3) - changing order + limit = 1 000 000
> --
>
>  ?uriBnF text:query ( foaf:givenName "*J*" 100 ) .
>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>
> => 135 entries, including the 4 "Jean-Jacques", in  1.7 second
>
> 4) - test using filters (strstarts + contains)
> --
>
> ?uriBnF foaf:familyName ?nom
> filter(strstarts(?nom, "Roussea"))
> ?uriBnF foaf:givenName ?prenom
> filter(contains(?prenom, "J"))
>
> => 129 entries, 27 seconds [less results than
> "text:query ( foaf:givenName "*J*" 100)" because contains = case
> sensible ?]
>
> -
>
> More infos about the dataset :
>
> # 3 fields are indexed ( foaf:name + foaf:givenName are in the same
> named graph )
>
> -- dcterms:title = +/- 9.45 M.
>
> -- foaf:givenName = +/- 1.71 M.
>
> -- foaf:familyName = +/- 1.78 M.
>
> # config file :
>
> 
>
> text:storeValues true ;
>     text:queryParser text:AnalyzingQueryParser ;
>     text:map (
>     [ text:field "title" ; text:predicate dcterms:title ;
>     text:analyzer [ a text:ConfigurableAnalyzer ;
>  text:tokenizer text:KeywordTokenizer ;
>  text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>  ] ]
>  [ text:field "familyName" ; text:predicate foaf:familyName ;
>     text:analyzer [ a text:ConfigurableAnalyzer ;
>  text:tokenizer text:KeywordTokenizer ;
>  text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>  ] ]
>  [ text:field "givenName" ; text:predicate foaf:givenName ;
>     text:analyzer [ a text:ConfigurableAnalyzer ;
>  text:tokenizer text:KeywordTokenizer ;
>  text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>  ] ]
>
>  ) .
>
>
>
>
>  
>
> Le 10/09/2018 à 18:58, Osma Suominen a écrit :
>> Hello Vincent,
>>
>> The results you get don't seem quite right. As you say, with a
>> shorter query one would expect more results.
>>
>> One thing to do would be to check what results you get if you run the
>> queries individually. I think combining the two separate jena-text
>> queries (for foaf:familyName and foaf:givenName) may be part of the
>> problem here... So if you execute only the "roussea*" part of the
>> query, do you get the expected number of results? What about if you
>> only execute one of the givenName queries with no restriction on
>> familyName?
>>
>> Does it make a difference if you change the order of the firstName
>> and givenName clauses?
>>
>> One thing to consider is that Lucene queries always have a limit on
>> the number of results. With jena-text you can specify it as an
>> additional parameter, but if you leave it out, it will default to
>> 1. My guess is that the givenName queries may generate more
>> results than 1, and the results will then be cut off. This may
>> mean that you get many Jeans and Jacques's and Johns etc. but many
>> the J. Rousseaus get cut off from the list. Try adding a large limit
>> parameter (say 10 or more) to the text:query functions to see if
>> it helps. Like this:
>>
>>     ?uriBnF text:query ( foaf:givenName "*J*" 10 )
>>
>> jena-text is not very good at combining multiple 

Re: fuseki text:query : strange results + Lucene configuration

2018-09-11 Thread Vincent Ventresque

Hello Osma,


Thank you very much for your reply, you solved the problem! I've made a 
few tests, both the order and the limit are important (see below).


Just one more question : I thought that the "Roussea*" being less 
numerous than the "*J*", it would be more efficient to begin with the 
"Roussea*". Can you explain why it's the contrary?


Best,

VV.


1) - changing only the order --

?uriBnF text:query ( foaf:givenName "*J*" ) .
?uriBnF text:query ( foaf:familyName "roussea*" ) .
?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom

 => 3 "Jean-Marie Rousseau" ... (even if I add a limit = 100 000 or 2 
000 000)


2) - changing order + limit = 100 000 --

?uriBnF text:query ( foaf:givenName "*J*" 10 ) .
 ?uriBnF text:query ( foaf:familyName "roussea*" ) .
 ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom

 => 54 entries but not "Jean-Jacques" !

3) - changing order + limit = 1 000 000 --

 ?uriBnF text:query ( foaf:givenName "*J*" 100 ) .
 ?uriBnF text:query ( foaf:familyName "roussea*" ) .
 ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom

=> 135 entries, including the 4 "Jean-Jacques", in  1.7 second

4) - test using filters (strstarts + contains) 
--


?uriBnF foaf:familyName ?nom
filter(strstarts(?nom, "Roussea"))
?uriBnF foaf:givenName ?prenom
filter(contains(?prenom, "J"))

=> 129 entries, 27 seconds [less results than
"text:query ( foaf:givenName "*J*" 100)" because contains = case 
sensible ?]


-

More infos about the dataset :

# 3 fields are indexed ( foaf:name + foaf:givenName are in the same 
named graph )


-- dcterms:title = +/- 9.45 M.

-- foaf:givenName = +/- 1.71 M.

-- foaf:familyName = +/- 1.78 M.

# config file :



text:storeValues true ;
    text:queryParser text:AnalyzingQueryParser ;
    text:map (
    [ text:field "title" ; text:predicate dcterms:title ;
    text:analyzer [ a text:ConfigurableAnalyzer ;
 text:tokenizer text:KeywordTokenizer ;
 text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
 ] ]
 [ text:field "familyName" ; text:predicate foaf:familyName ;
    text:analyzer [ a text:ConfigurableAnalyzer ;
 text:tokenizer text:KeywordTokenizer ;
 text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
 ] ]
 [ text:field "givenName" ; text:predicate foaf:givenName ;
    text:analyzer [ a text:ConfigurableAnalyzer ;
 text:tokenizer text:KeywordTokenizer ;
 text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
 ] ]

 ) .




 


Le 10/09/2018 à 18:58, Osma Suominen a écrit :

Hello Vincent,

The results you get don't seem quite right. As you say, with a shorter 
query one would expect more results.


One thing to do would be to check what results you get if you run the 
queries individually. I think combining the two separate jena-text 
queries (for foaf:familyName and foaf:givenName) may be part of the 
problem here... So if you execute only the "roussea*" part of the 
query, do you get the expected number of results? What about if you 
only execute one of the givenName queries with no restriction on 
familyName?


Does it make a difference if you change the order of the firstName and 
givenName clauses?


One thing to consider is that Lucene queries always have a limit on 
the number of results. With jena-text you can specify it as an 
additional parameter, but if you leave it out, it will default to 
1. My guess is that the givenName queries may generate more 
results than 1, and the results will then be cut off. This may 
mean that you get many Jeans and Jacques's and Johns etc. but many the 
J. Rousseaus get cut off from the list. Try adding a large limit 
parameter (say 10 or more) to the text:query functions to see if 
it helps. Like this:


    ?uriBnF text:query ( foaf:givenName "*J*" 10 )

jena-text is not very good at combining multiple criteria. You can do 
it with separate queries as you've done, but internally the queries 
will run separately and the results will only be combined in Jena, 
outside Lucene.


-Osma



Vincent Ventresque kirjoitti 10.09.2018 klo 13:03:

Hello,


I've made new tests with a slightly different dataset and 
configuration, the problem is the same.


--- Could you please tell me if these results are normal (I expected 
a bigger list with fewer letters)?


?uriBnF text:query ( foaf:givenName "*J*" ) => 3 entries

?uriBnF text:query ( foaf:givenName "*Ja*" ) => 1 entries

?uriBnF text:query ( foaf:givenName "*Je*" ) => 11 entries

?uriBnF text:query ( foaf:givenName "*-J*" ) => 11 entries

?uriBnF text:query ( foaf:givenName "*Jea*" ) => 12 entries

?uriBnF text:query ( foaf:givenName "*Jac*" ) => 13 entries

Here is the complete query :

SELECT * WHERE 

Re: fuseki text:query : strange results + Lucene configuration

2018-09-10 Thread Osma Suominen

Hello Vincent,

The results you get don't seem quite right. As you say, with a shorter 
query one would expect more results.


One thing to do would be to check what results you get if you run the 
queries individually. I think combining the two separate jena-text 
queries (for foaf:familyName and foaf:givenName) may be part of the 
problem here... So if you execute only the "roussea*" part of the query, 
do you get the expected number of results? What about if you only 
execute one of the givenName queries with no restriction on familyName?


Does it make a difference if you change the order of the firstName and 
givenName clauses?


One thing to consider is that Lucene queries always have a limit on the 
number of results. With jena-text you can specify it as an additional 
parameter, but if you leave it out, it will default to 1. My guess 
is that the givenName queries may generate more results than 1, and 
the results will then be cut off. This may mean that you get many Jeans 
and Jacques's and Johns etc. but many the J. Rousseaus get cut off from 
the list. Try adding a large limit parameter (say 10 or more) to the 
text:query functions to see if it helps. Like this:


?uriBnF text:query ( foaf:givenName "*J*" 10 )

jena-text is not very good at combining multiple criteria. You can do it 
with separate queries as you've done, but internally the queries will 
run separately and the results will only be combined in Jena, outside 
Lucene.


-Osma



Vincent Ventresque kirjoitti 10.09.2018 klo 13:03:

Hello,


I've made new tests with a slightly different dataset and configuration, 
the problem is the same.


--- Could you please tell me if these results are normal (I expected a 
bigger list with fewer letters)?


?uriBnF text:query ( foaf:givenName "*J*" ) => 3 entries

?uriBnF text:query ( foaf:givenName "*Ja*" ) => 1 entries

?uriBnF text:query ( foaf:givenName "*Je*" ) => 11 entries

?uriBnF text:query ( foaf:givenName "*-J*" ) => 11 entries

?uriBnF text:query ( foaf:givenName "*Jea*" ) => 12 entries

?uriBnF text:query ( foaf:givenName "*Jac*" ) => 13 entries

Here is the complete query :

SELECT * WHERE { ?uriBnF text:query ( foaf:familyName "roussea*" ) . 
?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .


?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom }

N.B. : the dataset is quite large : 1,78 M family names indexed, and 
1,71 M given names. I have 4 distinct "Jean-Jacques Rousseau" in the 
data, 713 family names containing "roussea", including 224 compound 
given names.


--- Do you know where to find more documentation about Lucene 
configuration (I read jena.apache.org page + , and also found useful 
explanations on Skosmos wiki https://github.com/NatLibFi/Skosmos ), 
especially about tokenizers  ?



Thanks in advance,

VV







Le 19/07/2018 à 14:07, Vincent Ventresque a écrit :

Hello,

I've just subscribed to the users@jena.apache.org list, and I 
apologize if this mail is not sent properly.


I'm trying to use Fuseki text:query, and have encountered several 
issues. Here are my questions


1) Does text:query require a minimum number of characters to be 
efficient?


2) Is performance linked to the number of fields indexed?

3) In order to retrieve strings containing hyphens, should I use 
KeywordTokenizer in config file?


~~~ 1) Does text:query require a minimum number of characters to be 
efficient? ~


I've noticed that a query on indexed predicates (foaf:familyName and 
foaf:givenName) returns more results when there are more characters in 
the string :


SELECT * WHERE {

?uriBnF text:query ( foaf:familyName "roussea*" ) .

?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .

?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .

optional {?uriBnF bio:birth ?dateNaissance }

}

I was expecting that "Rousseau" + "Jean-Jacques" would be in the results.

=> if  $MY_STRING = "j*", I get  0 result

=> if  $MY_STRING = "je*", I get 17 results, including "Jean-Claude" & 
"Jean-Baptiste" BUT not "Jean-Jacques"


=> if  $MY_STRING = "jea*", I get 27 results, including "Jean-Jacques"

I don't know anything about Lucene, but it looks very strange to me : 
I expected the contrary (fewer letters = bigger results list).



~~~ 2) Is performance linked to the number of fields indexed? 
~~~


If I change the configuration and index only foaf:givenName, and 
provide a constant for foaf:familyName, the query returns more results :


SELECT * WHERE {

?uriBnF foaf:familyName "Rousseau" .

?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .

?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .

optional {?uriBnF bio:birth ?dateNaissance }

}

=> if  $MY_STRING = "j*", I get  7 results, whereas the first query 
returned 0 result.



~~~ 3) In order to retrieve containing hyphens, should I use 
KeywordTokenizer in config file? ~


With the same query, if $MY_STRING = "jean-ja*" :

a) with 

Re: fuseki text:query : strange results + Lucene configuration

2018-09-10 Thread Vincent Ventresque

Hello,


I've made new tests with a slightly different dataset and configuration, 
the problem is the same.


--- Could you please tell me if these results are normal (I expected a 
bigger list with fewer letters)?


?uriBnF text:query ( foaf:givenName "*J*" ) => 3 entries

?uriBnF text:query ( foaf:givenName "*Ja*" ) => 1 entries

?uriBnF text:query ( foaf:givenName "*Je*" ) => 11 entries

?uriBnF text:query ( foaf:givenName "*-J*" ) => 11 entries

?uriBnF text:query ( foaf:givenName "*Jea*" ) => 12 entries

?uriBnF text:query ( foaf:givenName "*Jac*" ) => 13 entries

Here is the complete query :

SELECT * WHERE { ?uriBnF text:query ( foaf:familyName "roussea*" ) . 
?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .


?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom }

N.B. : the dataset is quite large : 1,78 M family names indexed, and 
1,71 M given names. I have 4 distinct "Jean-Jacques Rousseau" in the 
data, 713 family names containing "roussea", including 224 compound 
given names.


--- Do you know where to find more documentation about Lucene 
configuration (I read jena.apache.org page + , and also found useful 
explanations on Skosmos wiki https://github.com/NatLibFi/Skosmos ), 
especially about tokenizers  ?



Thanks in advance,

VV





 


Le 19/07/2018 à 14:07, Vincent Ventresque a écrit :

Hello,

I've just subscribed to the users@jena.apache.org list, and I 
apologize if this mail is not sent properly.


I'm trying to use Fuseki text:query, and have encountered several 
issues. Here are my questions


1) Does text:query require a minimum number of characters to be 
efficient?


2) Is performance linked to the number of fields indexed?

3) In order to retrieve strings containing hyphens, should I use 
KeywordTokenizer in config file?


~~~ 1) Does text:query require a minimum number of characters to be 
efficient? ~


I've noticed that a query on indexed predicates (foaf:familyName and 
foaf:givenName) returns more results when there are more characters in 
the string :


SELECT * WHERE {

?uriBnF text:query ( foaf:familyName "roussea*" ) .

?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .

?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .

optional {?uriBnF bio:birth ?dateNaissance }

}

I was expecting that "Rousseau" + "Jean-Jacques" would be in the results.

=> if  $MY_STRING = "j*", I get  0 result

=> if  $MY_STRING = "je*", I get 17 results, including "Jean-Claude" & 
"Jean-Baptiste" BUT not "Jean-Jacques"


=> if  $MY_STRING = "jea*", I get 27 results, including "Jean-Jacques"

I don't know anything about Lucene, but it looks very strange to me : 
I expected the contrary (fewer letters = bigger results list).



~~~ 2) Is performance linked to the number of fields indexed? 
~~~


If I change the configuration and index only foaf:givenName, and 
provide a constant for foaf:familyName, the query returns more results :


SELECT * WHERE {

?uriBnF foaf:familyName "Rousseau" .

?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .

?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .

optional {?uriBnF bio:birth ?dateNaissance }

}

=> if  $MY_STRING = "j*", I get  7 results, whereas the first query 
returned 0 result.



~~~ 3) In order to retrieve containing hyphens, should I use 
KeywordTokenizer in config file? ~


With the same query, if $MY_STRING = "jean-ja*" :

a) with simple configuration (cf. below), I get 0 result

b) with KeywordTokenizer config (cf. below), I get "Jean-Jacques"

Is it the right way to get "Jean-Jacques"?


Thanks in advance

VV



=== SIMPLE CONFIGURATION ===

@prefix :    <#> .
@prefix rdf:  .
@prefix rdfs:     .
@prefix tdb:  .
@prefix ja:   .
@prefix text:     .
@prefix fuseki:   .
@prefix foaf:  .
@prefix dcterms:  .



[] rdf:type fuseki:Server ;
   .


## Initialize TDB 

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

## Initialize text query -
[] ja:loadClass   "org.apache.jena.query.text.TextQuery" .
# A TextDataset is a regular dataset with a text index.
text:TextDataset  rdfs:subClassOf   ja:RDFDataset .
# Lucene index
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

## ---
## This URI must be fixed - it's used to assemble the text dataset.

:text_dataset rdf:type text:TextDataset ;
#    text:dataset   <#dataset> ;
    text:dataset :tdb_dataset_readwrite ;
#    text:index <#indexLucene> ;
    text:index