Re: Extend stanbol content hub for RDFa support

Rupert Westenthaler Mon, 02 Apr 2012 10:46:05 -0700

Hi Rüdiger, Olivier, all

Currently the typical Contenthub use-case looks like that

1. CMS sends/syncs content with Contenthub
2. Contenthub enhances the parsed content with the default chain of
the Stanbol Enhancer
3. Contenthub semantically indexes the Content.

This best describes the automatic content enhancement use-case. But as
soon as users (or more complex content workflows) come into play the
CMS might want/need to intercept in-between step (2) and (3).

Here is an "user in the loop" version of an CMS that uses the Stanbol
Enhancer for content enhancements and the Stanbol Contenthub for
semantic indexing/search. This is also the workflow I discussed with
Rüdiger last week.

1. CMS with create.js/VIE (or similar) : In this case an interface
level integration between create.js/VIE with the Stanbol Enhancer *)
2. User accepts/rejects/add - manually adjusts - enhancements
3. Users stores the updated content (RDFa annotated in case of create.js/VIE)
4. CMS sends updated document to the Contenthub for indexing: Content
+ RDF Metadata or Content with RDFa **)
5. Conenthub semantically indexes the Content based on the semantic
index definition.

@Olivier: "not the role of the Content Hub to do document enhancements":

This was not the intension to add RDFa support directly to the
Contenthub. My intension was to use

* The Apache Tika Engine to convert the HTML from the CMS to clean
XHTML : ***) This needs to ensure that Tika does not touch the RDFa
* Create an EnhancementEngine that uses the XHTML to extract the RDFa
: We need to decide if we add it to the metadata or an own ContentPart
    * If this functionality is supported by Clerezza it would be really great.
    * I was also thinking the Metaxa supports this, but I was not able
to spot it in the source when I added content part support to this
engine. I will take an other look ...

The Contenthub could than use an EnhancementChain with this two
EnhancementEngines to process parsed ContentItems.
The same chain could also be very useful for other scenarios where one
needs to extract metadata and knowledge from imported content (e.g.
CSS feeds).

I plan to look into the Tika+RDFa as well as the Clerezza RDFa
extraction later this week.

best
Rupert

Notes:

*) This interface level integration is only an example. the same could
also work for workflows such as automatic newsfeed annotation +
indexing with the feature that users can manually correct/enhance the
automatic annotations. Manual edited documents would follow the 2nd
workflow while the original fully automated annotation could still use
the first variant.

**) Why RDFa: Using Content+RDFa in (5) looks cool, because it would
even work on CMS systems that do know nothing about all the semantic
stuff mentioned above. But even if a CMS supports semantic
technologies "Content+RDFa+RDF Metadata" might still be an interesting
option for (5).

***) I would like to pre-process parsed content with TIKA, because it
would ensure that only fully valid XHTML reaches the RDFa parser. In
addition it should allow us even to extract RDFa from non HTML but
RDFa annotated XML files.

On Mon, Apr 2, 2012 at 6:12 PM, Rüdiger Kurz <[email protected]> wrote:
> Hi,
>
> Am 02.04.2012 14:16, schrieb Olivier Grisel:
>
>> Le 2 avril 2012 12:44, Rüdiger Kurz<[email protected]>  a écrit :
>>>
>>> Hi Staboler,
>>>
>>> during the last hackathon that took place in Saarbrücken next to the IKS
>>> review meeting I had the opportunity to play around with Stanbol content
>>> hub. At this point I want to suggest a new feature for the content hub:
>>>
>>> I have an already annotated content and I want to find related content by
>>> using stanbol. Therefore I want to suggest the extension of stanbol
>>> content
>>> hub for RDFa extraction support.
>>>
>>> Benefit:
>>> The semantic information that is already present will not be lost. RDFa
>>> generated by the CMS or that is being created by annotate.js can be
>>> transfered to Stanbol and than be used to retrieve content.
>>>
>>> Procedure:
>>> 1. Send a RDF(a) annotated HTML-Document to Stanbol.
>>> 2. Stanbols content hub extracts (e.g. using clarezza as Reto mentioned)
>>> the
>>> RDFa annotations and stores the document together with its entities.
>>
>>
>> This is not the role of the Content Hub to do document enhancements.
>> The focus of the Content Hub is to store and query content and its
>> annotations. Document pre-processing should be handled by the Enhancer
>> (that can be called by the Content Hub when uploading a new document),
>> and actually it might already be the case: the metaxa engine should be
>> example to extract the RDFa content of a HTML document. I don't know
>> how it works in practice though. Try it and if it does not work, have
>> a look at the source code.
>>
>> I think the Clerezza developers will also provide RDFa parsers and
>> maybe serializers too at some point.
>>
> Maybe someone could point me to the right starting point for using the
> metaxa engine. Is there a documentation available?
>
> @Rupert, @Suat and @Reto: If you remember, we discussed such a feature at
> the review in Saarbrücken. In order to be more precise, it would be nice if
> you could let us know what you think about RDFa extraction in coherence with
> content hub.
>
> thanks
> Rüdiger
>
>
> --
> Kind Regards,
> Rüdiger.
>
> -------------------
>
> Rüdiger Kurz
>
> Alkacon Software GmbH  - The OpenCms Experts
> http://www.alkacon.com - http://www.opencms.org

-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Extend stanbol content hub for RDFa support

Reply via email to