2011/3/25 Tommaso Teofili <[email protected]>: > Hi all, > recently I've been working with Solr to enable named entity recognition of > indexed documents which I did with UIMA so I wonder if that could be an > interesting use case for Stanbol as well. > > For the mentioned purpose I've developed a custom UpdateHandler[1] for Solr > which enables enriching of documents being indexed with Apache UIMA on the > basis of the following use case: > > 1. user sends documents to Solr > 2. each document received by Solr is sent to a UIMA analysis pipeline > just before it gets indexed > 3. the UIMA pipeline extracts enrichments, i.e. named entites > 4. the enrichments are written to Solr fields on the basis of a mapping > configuration > 5. the enriched Solr document is actually written inside the index > > In my opinion that could be done also with Stanbol Enhancer. > Such an integration could run on top of the already developed contrib module > [2][3] or with a separate one written from scratch; obviously such options > have advantages and drawbacks we can discuss (later?). > What do you think?
I think that we should definitely work at some point to be able to run an arbitrary UIMA analysis chain inside a Stanbol Enhancer. We need to write a dummy collection reader that turns a ContentItem into a CAS and a generic cas consumer that converts the output into a Clerezza Graph + a UIMAEnhancer that takes a CPE configuration to embed. Also the CAS to Clerezza Graph consumer could be directly contributed to the clerezza project while the ContentItem to CAS collection reader is stanbol specific. That would allow Stanbol users to reuse existing UIMA tools and turn them into a more linked data centric REST service. As for the use case, this in indeed interesting. Please note that the Solr engine embedded inside the entity hub is dedicated to fast local indexing Linked Data entities (dbpedia entries for instance) and not documents. Stanbol it not really meant to be a document management system (at least not in the short term) but more like a knowledge base management system that lives next to an existing CMS that would probably have its own instance of Solr to index its documents. Extending Stanbol to build semantically enriched indices of documents would still be in the scope of stanbol but I think we should first focus on finishing the cleaning / refactoring of the existing code base before implementing new services. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
