SolrYard uses string field for natural text queries
---------------------------------------------------
Key: STANBOL-89
URL: https://issues.apache.org/jira/browse/STANBOL-89
Project: Stanbol
Issue Type: Bug
Components: Entity Hub
Reporter: Rupert Westenthaler
Assignee: Rupert Westenthaler
Priority: Minor
This describes a change to the way the SolrYard does index values with the data
type xsd:string in order to improve the support for natural language text
searches for such values. This change will remove a wrong assumption present in
the current implementation. Details below!
Background:
The Entityhub distinguishes "natural language text" from normal values such as
integer, floats, dates and string values. This is mainly because one might want
to process natural language differently than normal string values. e.g. When
processing natural language text one might want to use things like white space
separators, stop word filters and/or stemming, but for ISBN numbers, article
numbers, postal codes using such algorithms will use to unwanted effects.
This distinction is nothing special to the Entityhub, but also present within
RDF. RDF defines "PlainLiterals" (with an optional xml:lang attribute) used to
represent natural language text and "TypedLiterals" (with an optional xsd data
type) to represent other values (including xsd:string). This is also
represented in the RDF APIs incl. Clerezzas RDF model.
Solr also provides a lot of functionality to improve the indexing and searching
for natural language texts. Therefore the correct declaration of natural
language texts and string values is of importance for getting the expected
search results.
For natural language texts the Solr schema.xml used by the SolrYard defines a
fieldType that uses the WhitespaceTokenizer, StopFilterFactory,
WordDelimiterFilter and LowerCaseFilter. For English texts also the
SnowballPorterFilter (stemming) is used.
In contrast to that string field do not use any Tokenizer.
The Problem:
A lot of developers of applications that produce RDF data do not correctly use
the RDF APIs. It is often the case that TypedLiterals with the data type
xsd:string are used to create literals representing natural language texts.
This is often because typically RDF APIs provide some kind of LiteralFactory to
create RDF Literals for Java Objects. So parsing an Java String instance
representing a natural language text will create a TypedLiteral with the data
type xsd:string. Even the Stanbol Enhancer is no exception to that because it
also creates TypedLiterals holding natural language texts! Developers usually
only use PlainLiterals if there is a requirement to specify the language.
The Conclusion is that components MUST NOT assume that string values do not
represent natural language texts. However they can also not assume that all
string values are in fact natural language texts.
The best solution to that is to let the user define how to interpret the values
when he interact with the data (at query time)
Old Implementation:
Previous to this change the SolrYard indexed "natural language text"s and
"stirng" values differently.
String values for a field where stored with the prefix "str" without any
processing.
Natural language texts where stored with the prefix "@{land}" (e.g. "@en" for
english texts, "@" for texts without a language) and processed by several
tokenizers as described above. In addition texts where also stored within a
field with the prefix "_!@" that combined all natural text values of all
languages.
To include string values in search results for natural language text queries
for natural language texts where created to search also within the "str" field.
Here an example for a Query for "Rupert" within the field "rdfs:label":
"(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)"
However this had one important shortcoming. The second term of the query
searched within a field that is not suited for natural language text searches.
To describe that in more detail lets assume the value "Rupert Westenthaler"
defined in the following two ways:
(1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would end
up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" and the
"_!@/rdfs:label/" fields.
(2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string (TypedLiteral)
would end up as one Token "Rupert Westenthaler" within the "str/rdfs:label/"
field.
With (1) the above query would select the document in the second case it would
not. This is because the query assumes to search for natural language values
that are indexed in that way, but the "str/rdfs:label/" field does not fulfill
this requirements
Solution:
The solution is to change the indexing to index string values also within the
"_!@"-field. This means that searches within that field assumes that all string
values do actually represent natural language texts. Searches for string values
need to use the "str"-field. This assumes that string value searches (e.g. for
an ISBN number) will still work as intended while searches for natural language
texts do have also access to string values.
As an positive side effect natural language searches will no longer need to
search in two different fields (meaning the the OR clause as shown above in the
example is no longer needed).
Additional Note:
It would be also possible to index natural language text values without defined
language within the string field. This would remove the assumption that each
natural language text value does in fact represent natural text and not a
string. However until someone can point to real world cases where datasets do
wrongly use PlainLiterals instead of TypedLiterals with the data type
xsd:string there is no practical advantage to that.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira