Fuseki + Lucene special characters.

Mark Wharton Sat, 19 Mar 2016 00:20:07 -0700

Hi Jena Users.

We've been experiencing some peculiar behaviour with Jena/Fuseki and
Lucene - particularly, but not entirely, around special characters.


We are currently running Fuseki 2.3.0, which seems to include Lucene
4.9.1, as far as we can tell.

Using the query:

PREFIX text: <http://jena.apache.org/text#>
SELECT ?ent ?score
 { (?ent ?score) text:query (<TEXT> 'lang:en')  }

...and different values of <TEXT>, the following happens

1) <TEXT> = ''
Get server error: Cannot parse '() AND lang:en'"

2) <TEXT> = '*' - 26 results

3) <TEXT> = '\\*' - 26 results

4) <TEXT> = '\\?' - 26 results

5) <TEXT> = 'will' - 26 results
("will" is one of the words which is ignored by lucene, see e.g.
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.9.1/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopAnalyzer.java#L51

6) <TEXT> = '(?)' - 3 results
labels/comments with single character words in them?

7) <TEXT> = '(\\?)' - 26 results

8) <TEXT> = '\\(\\?\\)' - 26 results

It looks to us as if:
Since fuseki turns
"<TEXT>" into  "(<TEXT>) AND lang:en",
it would appear that empty matches for TEXT (grouped with
braces) result in ALL entries being matched.

Problem:
Unless know complete list of ignored words & characters that lucene then
goes on to turn into an empty match, it is impossible to stop fuseki
returning ALL results with certain queries!


Thanks in advance for any thoughts and help

Mark


-- 
Technology Lead, Iotic Labs
+44 7973 674404
[email protected]
https://www.iotic-labs.com

signature.asc
Description: OpenPGP digital signature

Fuseki + Lucene special characters.

Reply via email to