Hi Mark!

Thanks for pointing this out. This seems to be a bug/feature of the jena-text module. It is not directly related to Fuseki (which is really the web server module), but Fuseki includes jena-text. Your interpretation of what is happening seems correct to me.

Do you have a suggestion of how this could be resolved? What kind of query results would you expect for, say, text:query ('' 'lang:en') or text:query ('will' 'lang:en') ? Do you happen to know a better way to construct the Lucene query than just ANDing the language restriction to the keyword part, as is currently done?

One thing that might help a bit is to use a different analyzer than the default StandardAnalyzer. StandardAnalyzer has a lot of smarts including the built-in stop word list, but in your case this causes problems with stopwords such as "will". If you used for example SimpleAnalyzer, then this would not be an issue. But I guess there would still be problems with the wildcard-type queries.

-Osma



17.03.2016, 14:23, Mark Wharton wrote:
Hi Jena Users.

We've been experiencing some peculiar behaviour with Jena/Fuseki and
Lucene - particularly, but not entirely, around special characters.

We are currently running Fuseki 2.3.0, which seems to include Lucene
4.9.1, as far as we can tell.

Using the query:

PREFIX text: <http://jena.apache.org/text#>
SELECT ?ent ?score
  { (?ent ?score) text:query (<TEXT> 'lang:en')  }

...and different values of <TEXT>, the following happens

1) <TEXT> = ''
Get server error: Cannot parse '() AND lang:en'"

2) <TEXT> = '*' - 26 results

3) <TEXT> = '\\*' - 26 results

4) <TEXT> = '\\?' - 26 results

5) <TEXT> = 'will' - 26 results
("will" is one of the words which is ignored by lucene, see e.g.
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.9.1/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopAnalyzer.java#L51

6) <TEXT> = '(?)' - 3 results
labels/comments with single character words in them?

7) <TEXT> = '(\\?)' - 26 results

8) <TEXT> = '\\(\\?\\)' - 26 results

It looks to us as if:
Since fuseki turns
"<TEXT>" into  "(<TEXT>) AND lang:en",
it would appear that empty matches for TEXT (grouped with
braces) result in ALL entries being matched.

Problem:
Unless know complete list of ignored words & characters that lucene then
goes on to turn into an empty match, it is impossible to stop fuseki
returning ALL results with certain queries!


Thanks in advance for any thoughts and help

Mark




--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Reply via email to