[ 
https://issues.apache.org/jira/browse/STANBOL-330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113249#comment-13113249
 ] 

Rupert Westenthaler commented on STANBOL-330:
---------------------------------------------

In general the tokenizing of the search is done to provide a similar behavior 
to full text searches in Solr/Lucene based on the capabilities of [1]. Without 
this Queries on Virtuoso servers would behave like Solr queries in untokenized 
fields.

To the specific Example:

If I make the request (*)

    curl -v -X POST -d "name=Richard M. Daley" 
http://localhost:8080/entityhub/site/dbpedia/find

I do get "http://dbpedia.org/resource/Richard_M._Daley"; as result.

The SPQRQL query sent to the dbpedia Virtuoso servers looks like this:

    CONSTRUCT { 
      ?id <http://www.w3.org/2000/01/rdf-schema#label> ?v_1 .
      <http://www.iks-project.eu/ontology/rick/query/QueryResultSet> 
<http://www.iks-project.eu/ontology/rick/query/queryResult> ?id . 
    } WHERE { 
      { 
        SELECT ?id 
        WHERE { 
          ?id <http://www.w3.org/2000/01/rdf-schema#label> ?v_1 . 
            ?v_1 bif:contains '("Richard" AND "M." AND "Daley")' . 
        } 
        ORDER BY DESC ( <LONG::IRI_RANK> (?id) ) 
        LIMIT 10 
            } 
      OPTIONAL { ?id <http://www.w3.org/2000/01/rdf-schema#label> ?v_1 . } 
    } 
    ORDER BY DESC ( <LONG::IRI_RANK> (?id) ) 
    LIMIT 10 

Only if I modify the search to "name=Richard M . Daley" than I get the reported 
error. Also  "name=Richard M Daley" is fine but executes much slower (3.5sec 
instead of 1sec for the query using "M."). Also tests with number (such as 
"Richard Daley 5") executed successfully.

Interesting is that "name=Richard .M Daley" also selects 
"http://dbpedia.org/resource/Richard_M._Daley";. Because of this I come to the 
conclusion that the used full text search engine simple ignores non 
Alpha-Numeric characters. This could also explain the error, because it would 
end up with an empty token.

The documentation [1] section "20.3.3. Text Expression Syntax" notes that "A 
word is a sequence of word characters." but does not explicitly define what 
word characters are. For now I think we should simple ignore Tokens that do not 
contain a single Alpha-Numeric character. 

Based on that I suggest to modify the Virtuoso searcher to ignore tokens that 
do not contain a single Alpha-Numeric chars. However I would not filter non 
Alpha-Numeric chars because it looks like they are anyway filtered by the 
Virtuoso.

If you agree I will make the necessary changes and mark this Issue as resolved.

best
Rupert

(*) For this tests I used the stable launcher of the current version. 
Stoped/Uninstalled the defaultdata bundle for dbpedia and installed the 
dbpediacached bundle (org.apache.stanbol.data.sites.dbpedia.cached).

[1] http://docs.openlinksw.com/virtuoso/queryingftcols.html#containspredicate

> EnityHub query API can generate incorrect fulltext clause with Virtuoso 
> searcher
> --------------------------------------------------------------------------------
>
>                 Key: STANBOL-330
>                 URL: https://issues.apache.org/jira/browse/STANBOL-330
>             Project: Stanbol
>          Issue Type: Bug
>            Reporter: Olivier Grisel
>
> At the moment the fulltext expression is tokenized on the client side and the 
> bif:contains clause contains a conjunction of single word phrase queries 
> which can be invalid if one of the word is a stop word.
> Instead of tokenizing on the client size one should probably just do a basic 
> fulltext search without explicit conjunction and phrase query operators.
> Detailed error message to be added in the first comment.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to