Hi Mihály An integration between Stanbol and UIMA would indeed be something very useful. I will try to provide some pointers - especially related to the Stanbol Enhancer - in this mail. But because my own experience with UIMA is limited to reading the documentation about two years ago I will not be able to provide much input on the UIMA side of the task.
On 04.06.2012, at 18:45, Mihály Héder wrote: > Hello Everyone, > > I'm new to this list, my name is Mihály Héder ; I am the lead > developer of Sztakipedia project: > http://www.youtube.com/watch?v=8VW0TrvXpl4 > > Most of Sztakipedia's suggestions are based on UIMA Annoation Chains, > that are organized of UIMA Annotation Engines. This are similar stuff > to Enhancer Chains and Enhancement Engines, resp. If you are curious, > you can play around one of Sztakipedia's chains: > http://pedia.sztaki.hu:8080/tfidfengsb/?mode=form This is a > Tokenizer+Sentence boundary detector+lemmatizer+tf-idf calculator > chain (tf-idf is calculated on enwiki in this case) > [..] > > So right now I'm investigating how to integrate UIMA stuff into > Stanbol. After having read some Stanbol Docs and writing a Hello World > enhancement engine to get a grip on Stanbol, I think I this is how it > should be done: > -An adapter-like interface is needed that glues together two > components. If you use UIMA, most of the time you just have a pear > file from a third party that you cant/do not want to modify. It will > have its own type system, chain definition, etc. Also, hopefully there > will be much more Stanbol users than developers in the long run. > -This means that the real use case is that the future user downloads a > uima chain from somewhere, downloads stanbol, and want to glue the two > together without coding in either projects. > -However, most of the time it will be non-trivial to turn UIMA Feature > Sets to Stanbol Enhancements. In some cases I can imagine that you can > just turn every FS to a triple by a simple rule or something, but > making this flexible enough from some configuration files seems rather > unrealistic for me. > > So what I have in mind now about UIMA->Enhancement conversion is: > -defining a simple java interface with one function, e.g: Triple > convertFStoTriple(org.apache.uima.cas.FeatureStructure fs). By > implemeting this one function the user could easily define how feature > structs are to be turned to Triples. Most of the time this function > would give back nulls as there are usually much more UIMA > FeatureStructures generated (e.g about two for every word) than the > user want to deal with. Dont forget the possibility to store the UIMA feature structure as ContentPart to the Stanbol ContentItem. [1] I would suggest to define a fixed URI as key so that all UIMA related stuff does know how to search for it. With the multipart ContentItem RESTful API users could even request the UIMA feature structure via the Stanbol RESTful API. > -creating an Enhancement Engine called UIMAAdapter. This would have a > converterClass Service Property that could be configured to contain > the name of the class the user just created. This would instantiate > the user-written class, provided that its on the classpath, and use it > to create enhancements. In OSGI one would rather define an interface and register converters as services. Services can be manually registered by using the BundleContext. An alternative is to use "@Component" annotations - as in the case of EnhancementEngines. In this case the OSGI config admin will automatically create the component and register it as service. > -for more advanced cases we could provide an interface to map a > List<FeatureStructure> to List<Triples>. For even more advanced cases > we could provide a convert(List<FeatureStructure>,ContentItem ci) > function with full access to the Stanbol ContentItem > -naturally we could write some default converter that converts every > FeatureStructure that comes out of UIMA to triples in a way for > testing purposes and for a basis of extension. I would suggest to separate two things: 1. calls an Engine that executes the UIMA Annotation Chain and stores the results as ContentPart in the Stanbol ContentItem 2. one or more Engines that convert the UIMA results to Stanbol Enhancements one possibility would be to use an EnhancementChain for chaining (1) and (2). I would also expect different implementations of (2) * Fixed implementations for typical things contained in UIMA results * Configurable implementations that require users to provide the mappings * Generic implementations that mainly convert the UIMA results to RDF: Those RDF might be further processed by an other StanbolEngine. * Special implementations optimized for special use cases. Those would need to be created by Stanbol users or UIMA annotationChain providers. however as my knowledge about typical UIMA results is very limited this might also be not feasible. > > The other question is how to communicate with the UIMA Engine. I think > the feature of accessing a remotely deployed UIMA engine is a must and > the REST interface you can try out on the link above (provided by > UIMASimpleServlet) is good for starters. I'm much less sure that > embedding everything into a Stanbol Enhancement Engine that is needed > to run a UIMA engine is such a good idea, but I think it can be done. > There is already a integration of Apache Clerezza with UIMA. Maybe we can build upon this and even if we can not this should provide valuable input on how to use UIMA from an OSGI based framework. > What do you think of all the above? > > p.s. Do you have a "How to write and deploy a Hello World Enhancement > Engine tutorial"? I have found the description of the functions to > implement, but still it took me a while to figure out how to deploy it > to felix, etc. If no, I can write one for you based on my notes. That would be a valuable addition to the Documentation of the Stanbol Enhancer. BTW: Contributing to the Webpage is easy: * svn co http://svn.apache.org/repos/asf/incubator/stanbol/site/trunk/ stanbol-website * create new content using Markdown Syntax * open an JIRA issue and provide your contribution as patch best Rupert > > Best, > Mihály
