SolrYard uses string field for natural text queries
---------------------------------------------------

                 Key: STANBOL-89
                 URL: https://issues.apache.org/jira/browse/STANBOL-89
             Project: Stanbol
          Issue Type: Bug
          Components: Entity Hub
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler
            Priority: Minor


This describes a change to the way the SolrYard does index values with the data 
type xsd:string in order to improve the support for natural language text 
searches for such values. This change will remove a wrong assumption present in 
the current implementation. Details below!

Background:

The Entityhub distinguishes "natural language text" from normal values such as 
integer, floats, dates and string values. This is mainly because one might want 
to process natural language differently than normal string values. e.g. When 
processing natural language text one might want to use things like white space 
separators, stop word filters and/or stemming, but for ISBN numbers, article 
numbers, postal codes using such algorithms will use to unwanted effects.
This distinction is nothing special to the Entityhub, but also present within 
RDF. RDF defines "PlainLiterals" (with an optional xml:lang attribute) used to 
represent natural language text and "TypedLiterals" (with an optional xsd data 
type) to represent other values (including xsd:string). This is also 
represented in the RDF APIs incl. Clerezzas RDF model.

Solr also provides a lot of functionality to improve the indexing and searching 
for natural language texts. Therefore the correct declaration of natural 
language texts and string values is of importance for getting the expected 
search results.
For natural language texts the Solr schema.xml used by the SolrYard defines a 
fieldType that uses the WhitespaceTokenizer, StopFilterFactory, 
WordDelimiterFilter and LowerCaseFilter. For English texts also the 
SnowballPorterFilter (stemming) is used.
In contrast to that string field do not use any Tokenizer.


The Problem:

A lot of developers of applications that produce RDF data do not correctly use 
the RDF APIs. It is often the case that TypedLiterals with the data type 
xsd:string are used to create literals representing natural language texts. 
This is often because typically RDF APIs provide some kind of LiteralFactory to 
create RDF Literals for Java Objects. So parsing an Java String instance 
representing a natural language text will create a TypedLiteral with the data 
type xsd:string. Even the Stanbol Enhancer is no exception to that because it 
also creates TypedLiterals holding natural language texts! Developers usually 
only use PlainLiterals if there is a requirement to specify the language.
The Conclusion is that components MUST NOT assume that string values do not 
represent natural language texts. However they can also not assume that all 
string values are in fact natural language texts.
The best solution to that is to let the user define how to interpret the values 
when he interact with the data (at query time)


Old Implementation:

Previous to this change the SolrYard indexed "natural language text"s and 
"stirng" values differently.
String values for a field where stored with the prefix "str" without any 
processing.
Natural language texts where stored with the prefix "@{land}" (e.g. "@en" for 
english texts, "@" for texts without a language) and processed by several 
tokenizers as described above. In addition texts where also stored within a 
field with the prefix "_!@" that combined all natural text values of all 
languages.
To include string values in search results for natural language text queries 
for natural language texts where created to search also within the "str" field. 
Here an example for a Query for "Rupert" within the field "rdfs:label":
   "(_\!@/rdfs\:label/:Rupert OR str/rdfs\:label/:Rupert)"
However this had one important shortcoming. The second term of the query 
searched within a field that is not suited for natural language text searches. 
To describe that in more detail lets assume the value "Rupert Westenthaler" 
defined in the following two ways:
(1) defined as "rdfs:label" -> "Rupert Westenthaler" (PlainLiteral) would end 
up as two tokens "rupert" and "westenthaler" within the @/rdfs:label/" and the 
"_!@/rdfs:label/" fields.
(2) defined as "rdfs:label" ->"Rupert Westenthaler"^^xsd:string (TypedLiteral) 
would end up as one Token "Rupert Westenthaler" within the "str/rdfs:label/" 
field.

With (1) the above query would select the document in the second case it would 
not. This is because the query assumes to search for natural language values 
that are indexed in that way, but the "str/rdfs:label/" field does not fulfill 
this requirements


Solution:

The solution is to change the indexing to index string values also within the 
"_!@"-field. This means that searches within that field assumes that all string 
values do actually represent natural language texts. Searches for string values 
need to use the "str"-field. This assumes that string value searches (e.g. for 
an ISBN number) will still work as intended while searches for natural language 
texts do have also access to string values.
As an positive side effect natural language searches will no longer need to 
search in two different fields (meaning the the OR clause as shown above in the 
example is no longer needed).

Additional Note:
It would be also possible to index natural language text values without defined 
language within the string field. This would remove the assumption that each 
natural language text value does in fact represent natural text and not a 
string. However until someone can point to real world cases where datasets do 
wrongly use PlainLiterals instead of TypedLiterals with the data type 
xsd:string there is no practical advantage to that.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to