> That being said there are some additional tools that might help with > putting together various pipelines. For instance, I've been working on > (although not very hard) an Analysis engine that would allow an > different Analysis Engine to work on only part of the output of another > Analysis Engine. Suppose I had an Analysis Engine that detected the > different languages being used in a text. Then suppose I had a Person > Annotation Extractor that only works on Japanese. I might want to be > able to send the Japanese parts of my text to the Person Annotation > Extractor without writing any code. I'm not at all sure what the best > way to go about this would be. Such an Analysis Engine might be good to > include in the UIMA package but it might not belong in the > specification.
This touches on thoughts I've been having about combining arbitrary annotators. Ultimately, it would be good to get some level of standards for Type Systems that define a minimal set of fields for tokens, parts of speech, named entities, taxonomy classifications, etc. However, that's a long process that will involve lots of community organizing and vendor cooperation, and won't be happening any time in the near future, I think. In the interim, I believe the only way it will be possible to combine arbitrary annotators is by transforming the data in the CAS from one type system to another. Sort of an ETL for UIMA. I can imagine something with a nice mapping GUI similar to the GUIs in an database ETL product such as Informatica. The kind of sub-setting you describe above would be one of the things such a tool could do. Another example would be to take parts of speech coming out of OpenNLP and transform them into parts of speech as required by a particular named entity annotator. Maybe there's a business opportunity here for someone. Or maybe there are open-source tools that could be adapted to do this. It does seem like a project that probably exceeds the capacity of the current UIMA project. Greg Holmberg
