[ 
https://issues.apache.org/jira/browse/STANBOL-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Christ reopened STANBOL-89:
----------------------------------


Should be set to 'Resolved' instead of 'Closed'

> SolrYard uses string field for natural text queries
> ---------------------------------------------------
>
>                 Key: STANBOL-89
>                 URL: https://issues.apache.org/jira/browse/STANBOL-89
>             Project: Stanbol
>          Issue Type: Bug
>          Components: Entity Hub
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>            Priority: Minor
>
> This describes a change to the way the SolrYard does index values with the 
> data type xsd:string in order to improve the support for natural language 
> text searches for such values. This change will remove a wrong assumption 
> present in the current implementation. Details below!
> Background:
> The Entityhub distinguishes "natural language text" from normal values such 
> as integer, floats, dates and string values. This is mainly because one might 
> want to process natural language differently than normal string values. e.g. 
> When processing natural language text one might want to use things like white 
> space separators, stop word filters and/or stemming, but for ISBN numbers, 
> article numbers, postal codes using such algorithms will use to unwanted 
> effects.
> This distinction is nothing special to the Entityhub, but also present within 
> RDF. RDF defines "PlainLiterals" (with an optional xml:lang attribute) used 
> to represent natural language text and "TypedLiterals" (with an optional xsd 
> data type) to represent other values (including xsd:string). This is also 
> represented in the RDF APIs incl. Clerezzas RDF model.
> Solr also provides a lot of functionality to improve the indexing and 
> searching for natural language texts. Therefore the correct declaration of 
> natural language texts and string values is of importance for getting the 
> expected search results.
> For natural language texts the Solr schema.xml used by the SolrYard defines a 
> fieldType that uses the WhitespaceTokenizer, StopFilterFactory, 
> WordDelimiterFilter and LowerCaseFilter. For English texts also the 
> SnowballPorterFilter (stemming) is used.
> In contrast to that string field do not use any Tokenizer.
> The Problem:
> A lot of developers of applications that produce RDF data do not correctly 
> use the RDF APIs. It is often the case that TypedLiterals with the data type 
> xsd:string are used to create literals representing natural language texts. 
> This is often because typically RDF APIs provide some kind of LiteralFactory 
> to create RDF Literals for Java Objects. So parsing an Java String instance 
> representing a natural language text will create a TypedLiteral with the data 
> type xsd:string. Even the Stanbol Enhancer is no exception to that because it 
> also creates TypedLiterals holding natural language texts! Developers usually 
> only use PlainLiterals if there is a requirement to specify the language.
> The Conclusion is that components MUST NOT assume that string values do not 
> represent natural language texts. However they can also not assume that all 
> string values are in fact natural language texts.
> The best solution to that is to let the user define how to interpret the 
> values when he interact with the data (at query time)
> Old Implementation:
> Previous to this change the SolrYard indexed "natural language text"s and 
> "stirng" values differently.
> String values for a field where stored with the prefix "str" without any 
> processing.
> Natural language texts where stored with the prefix "@{land}" (e.g. "@en" for 
> english texts, "@" for texts without a language) and processed by several 
> tokenizers as described above. In addition texts where also stored within a 
> field with the prefix "_!@" that combined all natural text values of all 
> languages.
> To include string values in search results for natural language text queries 
> for natural language texts where created to search also within the "str" 
> field. Here an example for a Query for "Rupert" within the field "rdfs:label":
>    "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)"
> However this had one important shortcoming. The second term of the query 
> searched within a field that is not suited for natural language text 
> searches. To describe that in more detail lets assume the value "Rupert 
> Westenthaler" defined in the following two ways:
> (1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would end 
> up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" and 
> the "_!@/rdfs:label/" fields.
> (2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string 
> (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the 
> "str/rdfs:label/" field.
> With (1) the above query would select the document in the second case it 
> would not. This is because the query assumes to search for natural language 
> values that are indexed in that way, but the "str/rdfs:label/" field does not 
> fulfill this requirements
> Solution:
> The solution is to change the indexing to index string values also within the 
> "_!@"-field. This means that searches within that field assumes that all 
> string values do actually represent natural language texts. Searches for 
> string values need to use the "str"-field. This assumes that string value 
> searches (e.g. for an ISBN number) will still work as intended while searches 
> for natural language texts do have also access to string values.
> As an positive side effect natural language searches will no longer need to 
> search in two different fields (meaning the the OR clause as shown above in 
> the example is no longer needed).
> Additional Note:
> It would be also possible to index natural language text values without 
> defined language within the string field. This would remove the assumption 
> that each natural language text value does in fact represent natural text and 
> not a string. However until someone can point to real world cases where 
> datasets do wrongly use PlainLiterals instead of TypedLiterals with the data 
> type xsd:string there is no practical advantage to that.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to