Hi Stanbol Community

Sebastian, Jakob Frank and my self had a meeting about the next steps in the 
integration of the LMF and Apache Stanbol. This mail provides an overview about 
the discussed points and my opinion how the integration with Stanbol could look 
like.

We decided to start of the integration with the following two assets: (1) 
RdfPathLanguage [1] and (2) LMF semantic search component [2]. (*)

In the following I will give an short overview about the necessary steps for 
integrating this two assets with Stanbol:

(1) make the RdfPathLanguage [1] generic so we can use it within Stanbol. The 
decision was to define an own Java API for this specification and manage it as 
an own library (outside of Stanbol). This is mainly because we think this 
specification is of general interest as as query language/indexing language for 
the Linked Data and therefore it makes sense to keep the Java API definition in 
depended of any specific project. 

* Sebastian will take the lead in defining the API. Jakob Frank has also done a 
lot of work on this. So he will also actively contribute to that.
* The current Implementation - based on the Kiwi Triple store will be adapted 
to the updated API. 
* I will work on implementations for Clerezza and the Entityhub. 
* It was also discussed to provide an implementation based on SPRAQL. This 
could be used to retrieve data referenced in paths from Sparql Endpoints. 
* In addition I think it would make also sense to provide an implementation 
based on the CMA adapter to allowing retrieving data from CMS. Suat it would be 
great if you could check if this would make sense. 
* As a first usage of the RdfPathLanguage within Stanbol I will replace the 
current FieldMapping infrastructure used by the Entityhub with the 
RdfPathLanguage. This will also the test case for the Clerezza and Entityhub 
based implementations.


(2) Integrate the LMF semantic search component [2] with Stanbol: The LMF 
semantic search is based on semantic index configurations that are based on the 
RdfPathLanguage [1]. Based on this configuration a Solr schema (schema.xml) is 
generated that is than used to store the data of the semantic index.  The 
actual search interface is the normal Solr RESTful API - just the data stored 
within the index are smarter. This has two very big advantage: For users -  
that a lot of programmers do already know how to use Solr and for us - that we 
do not need to reinvent to rich query interface of Solr (think about features 
such as facets, /mlt, ranking functions …). (**)

The usage of the LMF semantic search component within Stanbol required the 
following three features (I already started to work on):

* Enable the RESTful API for SolrCores managed by Stanbol (STANBOL-353): This 
is needed to allow the usage of the Solr RESTful API to query the semantic 
index. (nearly finished)
* Improve the Component that manages internal SolrIncexes to better support 
update and removal of SolrIndexes. This is needed to allow the LMF to update 
schema.xml files, deleting and reindexing indexes after incompatible changes. 
However this will than also allow to "replace" a small default index for 
DBpedia with a bigger version or a version with additional fields, languages … 
(plan to start work on that next week)
* Implementation of the RdfPathLanguage based on Clerezza (to index 
EnhancementResults) and the Entityhub (to include additional Information for 
suggested Entities). Optional: Implementation based on the CMS Adapter to also 
include additional data from a connected CMS. This is required to allow updates 
to the Semantic Index when running within Stanbol.

The usage of the LMF semantic search component within Stanbol will also require 
to integrate is within the "/contentub". With the following points I try to 
highlight the main topics that need further investigations/discussions:

* The LMF semantic search component overlaps greatly with the recently by Anil 
contributed "contenthub/search/engines/solr" component.  Related to this it 
would be great if Anil could have a look at [2] and check for 
similarities/differencies and possible integration paths.

* The Semantic Search Inteface: The Contenthub currently defines it's own query 
API (supports keyword based search as well as "field -> value" like 
constraints, supports facets). The LMF directly exposes the RESTful API of the 
semantic Solr index. I strongly prefer the approach of the LMF, because the two 
points already described above. But I am also the opinion that a semantic 
search interface should at least provide the following three additional 
features:
    1. Query preprocessing: e.g. substitute  "Paris" in the query with 
"http://dbpedia.org/resource/Paris";;
    2. Entity Facets: if a keyword matches a Entity (e.g. "Paris" -> 
"dbpedia:Paris", "dbpedia:Paris_Texas", "dbpedia:Paris_Hilton") than provide a 
Facet to the user over such possible matches; 
    3. Semantic Facets: if a user uses an instance of an ontology type (e.g. a 
Place, Person, Organization) in a query, that provide facets over semantic 
relations for such types (e.g. fiends for persons, products/services for 
Organizations, nearby Points-Of-Interests for Places, Participants for Events, 
…). To implement features like that we need components that provide query 
preprocessing capabilities based on data available in the Entityhub, Ontonet … 
. To me it seams that the contenthub/search/engines/ontologyresource component 
provides already some functionality related to this so this might be a good 
starting point.

* How is the relation to the Factstore? As far as I understand the Factstore 
provides the possibility to first define how facts do look like and than allows 
to efficiently CRUD over such facts. Related to this one can ask two questions: 
    1. Could one use the  RdfPathLanguage to define Fact schemas and/or to 
create Fact instances? 
    2. could it be possible to integrate the LMF semantic search component as a 
FactStore implementation into Stanbol?

Next Steps:

Work in the coming weeks need to focus on making the IKS developer meeting end 
of November in Salzburg [3] as productive as possible. Let me also note that 
all interested people of the Stanbol Community are also very welcome to join.

To prepare this meeting it would be important if we could already start  
discussion about the main topics - especially the /contenthub before this 
meeting on this mailing list. Implementation wise we (the LMF developer and me) 
will try realize a first integration until the start of the meeting. This will 
not try to integrate the LMF semantic search component with the content hub but 
rather concentrate on the RdfPathLanguage and the necessary improvements to the 
management of Solr indexes within stanbol. The should allow us to run LMF demos 
like [4] within the Stanbol environment. But without the possibility to update 
the semantic index or to index additional documents - because this would need 
than a real integration with the Contenthub.

This two things should than allow us to use the face to face meeting in 
Salzburg to really work on the integration and hopefully to show a demo that 
allows to

1. send a Document to the Contenthub (optionally by using the CMS adapter)
2. enhancing it with the enhancer 
3. storing the enhanced document within the semantic index
4. use the faced search interface to navigate through documents

at the end of the meeting.

best
Rupert Westenthaler

[1] http://code.google.com/p/kiwi/wiki/RdfPathLanguage
[2] 
http://code.google.com/p/kiwi/source/browse/#hg%2Flmf-search%2Fsrc%2Fmain%2Fjava%2Fat%2Fnewmedialab%2Flmf%2Fsearch
[3] http://wiki.iks-project.eu/index.php/IntegrationHackathonSalzburg
[4] http://labs.newmedialab.at/LMF/sn/search/suche.html

(*) Note that within this we excluded the topic of how to integrate the LMF 
rule engine. Not because this is not important, because I am do not have the 
according knowledge about the rule implementation of Stanbol to discuss this.
(**) As a side note: The use of Solr by the Semantic Search component is very 
different of the one of the Entityhub SolrYard. First creates a dedicated 
schema based on a configuration - it can only index specific data, but produces 
a nice schema that is easy to understand and to use by users - very important 
for the semantic search use case. The later can store any kind of RDF data with 
a singe schema - very important when managing very diverse entities but the 
generic schema needs to use complex prefixes, and suffixes in combination with 
dynamic fields that would it make very hard for users to write Solr queries.

Reply via email to