[
https://issues.apache.org/jira/browse/STANBOL-330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113249#comment-13113249
]
Rupert Westenthaler commented on STANBOL-330:
---------------------------------------------
In general the tokenizing of the search is done to provide a similar behavior
to full text searches in Solr/Lucene based on the capabilities of [1]. Without
this Queries on Virtuoso servers would behave like Solr queries in untokenized
fields.
To the specific Example:
If I make the request (*)
curl -v -X POST -d "name=Richard M. Daley"
http://localhost:8080/entityhub/site/dbpedia/find
I do get "http://dbpedia.org/resource/Richard_M._Daley" as result.
The SPQRQL query sent to the dbpedia Virtuoso servers looks like this:
CONSTRUCT {
?id <http://www.w3.org/2000/01/rdf-schema#label> ?v_1 .
<http://www.iks-project.eu/ontology/rick/query/QueryResultSet>
<http://www.iks-project.eu/ontology/rick/query/queryResult> ?id .
} WHERE {
{
SELECT ?id
WHERE {
?id <http://www.w3.org/2000/01/rdf-schema#label> ?v_1 .
?v_1 bif:contains '("Richard" AND "M." AND "Daley")' .
}
ORDER BY DESC ( <LONG::IRI_RANK> (?id) )
LIMIT 10
}
OPTIONAL { ?id <http://www.w3.org/2000/01/rdf-schema#label> ?v_1 . }
}
ORDER BY DESC ( <LONG::IRI_RANK> (?id) )
LIMIT 10
Only if I modify the search to "name=Richard M . Daley" than I get the reported
error. Also "name=Richard M Daley" is fine but executes much slower (3.5sec
instead of 1sec for the query using "M."). Also tests with number (such as
"Richard Daley 5") executed successfully.
Interesting is that "name=Richard .M Daley" also selects
"http://dbpedia.org/resource/Richard_M._Daley". Because of this I come to the
conclusion that the used full text search engine simple ignores non
Alpha-Numeric characters. This could also explain the error, because it would
end up with an empty token.
The documentation [1] section "20.3.3. Text Expression Syntax" notes that "A
word is a sequence of word characters." but does not explicitly define what
word characters are. For now I think we should simple ignore Tokens that do not
contain a single Alpha-Numeric character.
Based on that I suggest to modify the Virtuoso searcher to ignore tokens that
do not contain a single Alpha-Numeric chars. However I would not filter non
Alpha-Numeric chars because it looks like they are anyway filtered by the
Virtuoso.
If you agree I will make the necessary changes and mark this Issue as resolved.
best
Rupert
(*) For this tests I used the stable launcher of the current version.
Stoped/Uninstalled the defaultdata bundle for dbpedia and installed the
dbpediacached bundle (org.apache.stanbol.data.sites.dbpedia.cached).
[1] http://docs.openlinksw.com/virtuoso/queryingftcols.html#containspredicate
> EnityHub query API can generate incorrect fulltext clause with Virtuoso
> searcher
> --------------------------------------------------------------------------------
>
> Key: STANBOL-330
> URL: https://issues.apache.org/jira/browse/STANBOL-330
> Project: Stanbol
> Issue Type: Bug
> Reporter: Olivier Grisel
>
> At the moment the fulltext expression is tokenized on the client side and the
> bif:contains clause contains a conjunction of single word phrase queries
> which can be invalid if one of the word is a stop word.
> Instead of tokenizing on the client size one should probably just do a basic
> fulltext search without explicit conjunction and phrase query operators.
> Detailed error message to be added in the first comment.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira