2011/3/25 Tommaso Teofili <[email protected]>:
> Hi all,
> recently I've been working with Solr to enable named entity recognition of
> indexed documents which I did with UIMA so I wonder if that could be an
> interesting use case for Stanbol as well.
>
> For the mentioned purpose I've developed a custom UpdateHandler[1] for Solr
> which enables enriching of documents being indexed with Apache UIMA on the
> basis of the following use case:
>
>   1. user sends documents to Solr
>   2. each document received by Solr is sent to a UIMA analysis pipeline
>   just before it gets indexed
>   3. the UIMA pipeline extracts enrichments, i.e. named entites
>   4. the enrichments are written to Solr fields on the basis of a mapping
>   configuration
>   5. the enriched Solr document is actually written inside the index
>
> In my opinion that could be done also with Stanbol Enhancer.
> Such an integration could run on top of the already developed contrib module
> [2][3] or with a separate one written from scratch; obviously such options
> have advantages and drawbacks we can discuss (later?).
> What do you think?

I think that we should definitely work at some point to be able to run
an arbitrary UIMA analysis chain inside a Stanbol Enhancer. We need to
write a dummy collection reader that turns a ContentItem into a CAS
and a generic cas consumer that converts the output into a Clerezza
Graph + a UIMAEnhancer that takes a CPE configuration to embed. Also
the CAS to Clerezza Graph consumer could be directly contributed to
the clerezza project while the ContentItem to CAS collection reader is
stanbol specific.

That would allow Stanbol users to reuse existing UIMA tools and turn
them into a more linked data centric REST service.

As for the use case, this in indeed interesting. Please note that the
Solr engine embedded inside the entity hub is dedicated to fast local
indexing Linked Data entities (dbpedia entries for instance) and not
documents. Stanbol it not really meant to be a document  management
system (at least not in the short term) but more like a knowledge base
management system that lives next to an existing CMS that would
probably have its own instance of Solr to index its documents.

Extending Stanbol to build semantically enriched indices of documents
would still be in the scope of stanbol but I think we should first
focus on finishing the cleaning / refactoring of the existing code
base before implementing new services.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to