Re: Fuseki + Lucene special characters.

Osma Suominen Sat, 19 Mar 2016 16:03:15 -0700

Hi Mark!

Thanks for pointing this out. This seems to be a bug/feature of thejena-text module. It is not directly related to Fuseki (which is reallythe web server module), but Fuseki includes jena-text. Yourinterpretation of what is happening seems correct to me.

Do you have a suggestion of how this could be resolved? What kind ofquery results would you expect for, say, text:query ('' 'lang:en') ortext:query ('will' 'lang:en') ? Do you happen to know a better way toconstruct the Lucene query than just ANDing the language restriction tothe keyword part, as is currently done?

One thing that might help a bit is to use a different analyzer than thedefault StandardAnalyzer. StandardAnalyzer has a lot of smarts includingthe built-in stop word list, but in your case this causes problems withstopwords such as "will". If you used for example SimpleAnalyzer, thenthis would not be an issue. But I guess there would still be problemswith the wildcard-type queries.


-Osma



17.03.2016, 14:23, Mark Wharton wrote:

Hi Jena Users.

We've been experiencing some peculiar behaviour with Jena/Fuseki and
Lucene - particularly, but not entirely, around special characters.

We are currently running Fuseki 2.3.0, which seems to include Lucene
4.9.1, as far as we can tell.

Using the query:

PREFIX text: <http://jena.apache.org/text#>
SELECT ?ent ?score
  { (?ent ?score) text:query (<TEXT> 'lang:en')  }

...and different values of <TEXT>, the following happens

1) <TEXT> = ''
Get server error: Cannot parse '() AND lang:en'"

2) <TEXT> = '*' - 26 results

3) <TEXT> = '\\*' - 26 results

4) <TEXT> = '\\?' - 26 results

5) <TEXT> = 'will' - 26 results
("will" is one of the words which is ignored by lucene, see e.g.
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.9.1/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopAnalyzer.java#L51

6) <TEXT> = '(?)' - 3 results
labels/comments with single character words in them?

7) <TEXT> = '(\\?)' - 26 results

8) <TEXT> = '\\(\\?\\)' - 26 results

It looks to us as if:
Since fuseki turns
"<TEXT>" into  "(<TEXT>) AND lang:en",
it would appear that empty matches for TEXT (grouped with
braces) result in ALL entries being matched.

Problem:
Unless know complete list of ignored words & characters that lucene then
goes on to turn into an empty match, it is impossible to stop fuseki
returning ALL results with certain queries!


Thanks in advance for any thoughts and help

Mark



--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Re: Fuseki + Lucene special characters.

Reply via email to