[
https://issues.apache.org/jira/browse/STANBOL-89?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Fabian Christ reopened STANBOL-89:
----------------------------------
Should be set to 'Resolved' instead of 'Closed'
> SolrYard uses string field for natural text queries
> ---------------------------------------------------
>
> Key: STANBOL-89
> URL: https://issues.apache.org/jira/browse/STANBOL-89
> Project: Stanbol
> Issue Type: Bug
> Components: Entity Hub
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
> Priority: Minor
>
> This describes a change to the way the SolrYard does index values with the
> data type xsd:string in order to improve the support for natural language
> text searches for such values. This change will remove a wrong assumption
> present in the current implementation. Details below!
> Background:
> The Entityhub distinguishes "natural language text" from normal values such
> as integer, floats, dates and string values. This is mainly because one might
> want to process natural language differently than normal string values. e.g.
> When processing natural language text one might want to use things like white
> space separators, stop word filters and/or stemming, but for ISBN numbers,
> article numbers, postal codes using such algorithms will use to unwanted
> effects.
> This distinction is nothing special to the Entityhub, but also present within
> RDF. RDF defines "PlainLiterals" (with an optional xml:lang attribute) used
> to represent natural language text and "TypedLiterals" (with an optional xsd
> data type) to represent other values (including xsd:string). This is also
> represented in the RDF APIs incl. Clerezzas RDF model.
> Solr also provides a lot of functionality to improve the indexing and
> searching for natural language texts. Therefore the correct declaration of
> natural language texts and string values is of importance for getting the
> expected search results.
> For natural language texts the Solr schema.xml used by the SolrYard defines a
> fieldType that uses the WhitespaceTokenizer, StopFilterFactory,
> WordDelimiterFilter and LowerCaseFilter. For English texts also the
> SnowballPorterFilter (stemming) is used.
> In contrast to that string field do not use any Tokenizer.
> The Problem:
> A lot of developers of applications that produce RDF data do not correctly
> use the RDF APIs. It is often the case that TypedLiterals with the data type
> xsd:string are used to create literals representing natural language texts.
> This is often because typically RDF APIs provide some kind of LiteralFactory
> to create RDF Literals for Java Objects. So parsing an Java String instance
> representing a natural language text will create a TypedLiteral with the data
> type xsd:string. Even the Stanbol Enhancer is no exception to that because it
> also creates TypedLiterals holding natural language texts! Developers usually
> only use PlainLiterals if there is a requirement to specify the language.
> The Conclusion is that components MUST NOT assume that string values do not
> represent natural language texts. However they can also not assume that all
> string values are in fact natural language texts.
> The best solution to that is to let the user define how to interpret the
> values when he interact with the data (at query time)
> Old Implementation:
> Previous to this change the SolrYard indexed "natural language text"s and
> "stirng" values differently.
> String values for a field where stored with the prefix "str" without any
> processing.
> Natural language texts where stored with the prefix "@{land}" (e.g. "@en" for
> english texts, "@" for texts without a language) and processed by several
> tokenizers as described above. In addition texts where also stored within a
> field with the prefix "_!@" that combined all natural text values of all
> languages.
> To include string values in search results for natural language text queries
> for natural language texts where created to search also within the "str"
> field. Here an example for a Query for "Rupert" within the field "rdfs:label":
> "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)"
> However this had one important shortcoming. The second term of the query
> searched within a field that is not suited for natural language text
> searches. To describe that in more detail lets assume the value "Rupert
> Westenthaler" defined in the following two ways:
> (1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would end
> up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" and
> the "_!@/rdfs:label/" fields.
> (2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string
> (TypedLiteral) would end up as one Token "Rupert Westenthaler" within the
> "str/rdfs:label/" field.
> With (1) the above query would select the document in the second case it
> would not. This is because the query assumes to search for natural language
> values that are indexed in that way, but the "str/rdfs:label/" field does not
> fulfill this requirements
> Solution:
> The solution is to change the indexing to index string values also within the
> "_!@"-field. This means that searches within that field assumes that all
> string values do actually represent natural language texts. Searches for
> string values need to use the "str"-field. This assumes that string value
> searches (e.g. for an ISBN number) will still work as intended while searches
> for natural language texts do have also access to string values.
> As an positive side effect natural language searches will no longer need to
> search in two different fields (meaning the the OR clause as shown above in
> the example is no longer needed).
> Additional Note:
> It would be also possible to index natural language text values without
> defined language within the string field. This would remove the assumption
> that each natural language text value does in fact represent natural text and
> not a string. However until someone can point to real world cases where
> datasets do wrongly use PlainLiterals instead of TypedLiterals with the data
> type xsd:string there is no practical advantage to that.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira