On Fri, Mar 25, 2011 at 12:42 PM, Olivier Grisel <[email protected]> wrote: > 2011/3/25 Tommaso Teofili <[email protected]>: >> Hi all, >> recently I've been working with Solr to enable named entity recognition of >> indexed documents which I did with UIMA so I wonder if that could be an >> interesting use case for Stanbol as well. >> >> For the mentioned purpose I've developed a custom UpdateHandler[1] for Solr >> which enables enriching of documents being indexed with Apache UIMA on the >> basis of the following use case: >> >> 1. user sends documents to Solr >> 2. each document received by Solr is sent to a UIMA analysis pipeline >> just before it gets indexed >> 3. the UIMA pipeline extracts enrichments, i.e. named entites >> 4. the enrichments are written to Solr fields on the basis of a mapping >> configuration >> 5. the enriched Solr document is actually written inside the index >> >> In my opinion that could be done also with Stanbol Enhancer. >> Such an integration could run on top of the already developed contrib module >> [2][3] or with a separate one written from scratch; obviously such options >> have advantages and drawbacks we can discuss (later?). >> What do you think? > > I think that we should definitely work at some point to be able to run > an arbitrary UIMA analysis chain inside a Stanbol Enhancer. We need to > write a dummy collection reader that turns a ContentItem into a CAS > and a generic cas consumer that converts the output into a Clerezza > Graph + a UIMAEnhancer that takes a CPE configuration to embed. Also > the CAS to Clerezza Graph consumer could be directly contributed to > the clerezza project while the ContentItem to CAS collection reader is > stanbol specific. > > That would allow Stanbol users to reuse existing UIMA tools and turn > them into a more linked data centric REST service. > > As for the use case, this in indeed interesting. Please note that the > Solr engine embedded inside the entity hub is dedicated to fast local > indexing Linked Data entities (dbpedia entries for instance) and not > documents. Stanbol it not really meant to be a document management > system (at least not in the short term) but more like a knowledge base > management system that lives next to an existing CMS that would > probably have its own instance of Solr to index its documents. >
The suggested module would be interesting for CMS that do already use Solr within there search infrastructure. It would allow them very easily to incorporate the semantic lifting capabilities of Stanbol. > Extending Stanbol to build semantically enriched indices of documents > would still be in the scope of stanbol but I think we should first > focus on finishing the cleaning / refactoring of the existing code > base before implementing new services. > The /store and the /sparql endpoint do provide this functionality to some degree and as far as I know they are used especially in portal environments (where the documents provided by the portal are actually managed by different CMS. Semantic search over the metadata extracted from documents by the stanbol enhancer is an interesting feature and I think it could be implemented by combining a triple store together with an adapted version of the SolrYard (Solr based storage component - part of the Entityhub). However I would define this as an additional component (same level as Enhancer and Entityhub) - maybe Documenthub? best Rupert > -- > Olivier > http://twitter.com/ogrisel - http://github.com/ogrisel > -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
