Updates to LMF/Stanbol integration

Rupert Westenthaler Wed, 26 Oct 2011 04:11:03 -0700

Hi Stanbol Community

Sebastian, Jakob Frank and my self had a meeting about the next steps in the 
integration of the LMF and Apache Stanbol. This mail provides an overview about 
the discussed points and my opinion how the integration with Stanbol could look 
like.

We decided to start of the integration with the following two assets: (1)
RdfPathLanguage [1] and (2) LMF semantic search component [2]. (*)

In the following I will give an short overview about the necessary steps for
integrating this two assets with Stanbol:

(1) make the RdfPathLanguage [1] generic so we can use it within Stanbol. The
decision was to define an own Java API for this specification and manage it as
an own library (outside of Stanbol). This is mainly because we think this
specification is of general interest as as query language/indexing language for
the Linked Data and therefore it makes sense to keep the Java API definition in
depended of any specific project.

* Sebastian will take the lead in defining the API. Jakob Frank has also done a
lot of work on this. So he will also actively contribute to that.
* The current Implementation - based on the Kiwi Triple store will be adapted
to the updated API.
* I will work on implementations for Clerezza and the Entityhub.
* It was also discussed to provide an implementation based on SPRAQL. This
could be used to retrieve data referenced in paths from Sparql Endpoints.
* In addition I think it would make also sense to provide an implementation
based on the CMA adapter to allowing retrieving data from CMS. Suat it would be
great if you could check if this would make sense.
* As a first usage of the RdfPathLanguage within Stanbol I will replace the
current FieldMapping infrastructure used by the Entityhub with the
RdfPathLanguage. This will also the test case for the Clerezza and Entityhub
based implementations.

(2) Integrate the LMF semantic search component [2] with Stanbol: The LMF
semantic search is based on semantic index configurations that are based on the
RdfPathLanguage [1]. Based on this configuration a Solr schema (schema.xml) is
generated that is than used to store the data of the semantic index. The
actual search interface is the normal Solr RESTful API - just the data stored
within the index are smarter. This has two very big advantage: For users -
that a lot of programmers do already know how to use Solr and for us - that we
do not need to reinvent to rich query interface of Solr (think about features
such as facets, /mlt, ranking functions …). (**)

The usage of the LMF semantic search component within Stanbol required the
following three features (I already started to work on):

* Enable the RESTful API for SolrCores managed by Stanbol (STANBOL-353): This
is needed to allow the usage of the Solr RESTful API to query the semantic
index. (nearly finished)
* Improve the Component that manages internal SolrIncexes to better support
update and removal of SolrIndexes. This is needed to allow the LMF to update
schema.xml files, deleting and reindexing indexes after incompatible changes.
However this will than also allow to "replace" a small default index for
DBpedia with a bigger version or a version with additional fields, languages …
(plan to start work on that next week)
* Implementation of the RdfPathLanguage based on Clerezza (to index
EnhancementResults) and the Entityhub (to include additional Information for
suggested Entities). Optional: Implementation based on the CMS Adapter to also
include additional data from a connected CMS. This is required to allow updates
to the Semantic Index when running within Stanbol.

The usage of the LMF semantic search component within Stanbol will also require
to integrate is within the "/contentub". With the following points I try to
highlight the main topics that need further investigations/discussions:

* The LMF semantic search component overlaps greatly with the recently by Anil
contributed "contenthub/search/engines/solr" component. Related to this it
would be great if Anil could have a look at [2] and check for
similarities/differencies and possible integration paths.

* The Semantic Search Inteface: The Contenthub currently defines it's own query
API (supports keyword based search as well as "field -> value" like
constraints, supports facets). The LMF directly exposes the RESTful API of the
semantic Solr index. I strongly prefer the approach of the LMF, because the two
points already described above. But I am also the opinion that a semantic
search interface should at least provide the following three additional
features:
1. Query preprocessing: e.g. substitute "Paris" in the query with
"http://dbpedia.org/resource/Paris";;
2. Entity Facets: if a keyword matches a Entity (e.g. "Paris" ->
"dbpedia:Paris", "dbpedia:Paris_Texas", "dbpedia:Paris_Hilton") than provide a
Facet to the user over such possible matches;
3. Semantic Facets: if a user uses an instance of an ontology type (e.g. a
Place, Person, Organization) in a query, that provide facets over semantic
relations for such types (e.g. fiends for persons, products/services for
Organizations, nearby Points-Of-Interests for Places, Participants for Events,
…). To implement features like that we need components that provide query
preprocessing capabilities based on data available in the Entityhub, Ontonet …
. To me it seams that the contenthub/search/engines/ontologyresource component
provides already some functionality related to this so this might be a good
starting point.

* How is the relation to the Factstore? As far as I understand the Factstore
provides the possibility to first define how facts do look like and than allows
to efficiently CRUD over such facts. Related to this one can ask two questions:
1. Could one use the RdfPathLanguage to define Fact schemas and/or to
create Fact instances?
2. could it be possible to integrate the LMF semantic search component as a
FactStore implementation into Stanbol?

Next Steps:

Work in the coming weeks need to focus on making the IKS developer meeting end
of November in Salzburg [3] as productive as possible. Let me also note that
all interested people of the Stanbol Community are also very welcome to join.

To prepare this meeting it would be important if we could already start
discussion about the main topics - especially the /contenthub before this
meeting on this mailing list. Implementation wise we (the LMF developer and me)
will try realize a first integration until the start of the meeting. This will
not try to integrate the LMF semantic search component with the content hub but
rather concentrate on the RdfPathLanguage and the necessary improvements to the
management of Solr indexes within stanbol. The should allow us to run LMF demos
like [4] within the Stanbol environment. But without the possibility to update
the semantic index or to index additional documents - because this would need
than a real integration with the Contenthub.

This two things should than allow us to use the face to face meeting in
Salzburg to really work on the integration and hopefully to show a demo that
allows to

1. send a Document to the Contenthub (optionally by using the CMS adapter)
2. enhancing it with the enhancer
3. storing the enhanced document within the semantic index
4. use the faced search interface to navigate through documents

at the end of the meeting.

best
Rupert Westenthaler

[1] http://code.google.com/p/kiwi/wiki/RdfPathLanguage
[2]
http://code.google.com/p/kiwi/source/browse/#hg%2Flmf-search%2Fsrc%2Fmain%2Fjava%2Fat%2Fnewmedialab%2Flmf%2Fsearch
[3] http://wiki.iks-project.eu/index.php/IntegrationHackathonSalzburg
[4] http://labs.newmedialab.at/LMF/sn/search/suche.html

(*) Note that within this we excluded the topic of how to integrate the LMF
rule engine. Not because this is not important, because I am do not have the
according knowledge about the rule implementation of Stanbol to discuss this.
(**) As a side note: The use of Solr by the Semantic Search component is very
different of the one of the Entityhub SolrYard. First creates a dedicated
schema based on a configuration - it can only index specific data, but produces
a nice schema that is easy to understand and to use by users - very important
for the semantic search use case. The later can store any kind of RDF data with
a singe schema - very important when managing very diverse entities but the
generic schema needs to use complex prefixes, and suffixes in combination with
dynamic fields that would it make very hard for users to write Solr queries.

Updates to LMF/Stanbol integration

Reply via email to