Hi!
On Thu, Jul 09, 2015 at 03:51:26PM -0400, Marshall Schor wrote:
> The discussion of future directions for UIMA is spread over several pages in
> the
> wiki, but a good page to start is
> https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3
On Thu, Jul 09, 2015 at 04:17:44PM -0400, Marshall Schor wrote:
> I'll take a look. This kind of thing is "on the list" for uima v3; see
> https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3
I didn't figure out how to edit that wiki page, but a mental summary
of the things I find currently irritating about UIMA and would love to
see changed formed in my mind, so I thought I could contribute it for
discussion.
* UIMAfit is not part of core UIMA and UIMA-AS is not part of core
UIMA. It seems to me that UIMA-AS is doing things a bit differently
than what the original UIMA idea of doing scaleout was. The two
things don't play well together. I'd love a way to easily take
my plain UIMA pipeline and scale it out, ideally without any code
changes, *and* avoid the terrible XML config files.
* Speaking of avoiding the config files, it'd be nice if I could avoid
them for type systems as well. A radical idea: In the end, I treat
UIMA essentially as a storage for Java objects; I suspect many others
do the same. I'd love a way to turn JCasGen on its head and write
the Java classes (possibly with some restrictions) that I could
store in UIMA, with the backend figuring out the low-level UIMA
representation on its own. This would radically reduce some aspects
of the engineering overhead for me and maybe many other users.
* The JCas UIMA interface should be more transparent in other ways
too. Working with arrays (and absence of lists) is a huge pain.
I just want to work with feature structures as if they were normal
Java objects, without major restrictions.
* Connected with the above - I'd love .addToIndexes() to just
disappear. Right now, the paradigm is that you build an annotation
in an annotator, and the moment it gets saved in a CAS, it becomes
basically read-only. But if I want e.g. to build up a set of
features across multiple annotators, things again become very
painful. Because also fixed-size arrays, I need awful boilerplate
code like
AnswerInfo ai = JCasUtil.selectSingle(jcas, AnswerInfo.class);
AnswerFV fv = new AnswerFV(ai);
fv.setFeature(f, 1.0);
for (FeatureStructure af : ai.getFeatures().toArray())
((AnswerFeature) af).removeFromIndexes();
ai.removeFromIndexes();
ai.setFeatures(fv.toFSArray(jcas));
ai.addToIndexes();
simply to add a feature. (Note the AnswerFV class, which is the
actual thing I want to store in a JCas - a dynamic list of
(feature_label, feature_value) pairs - but to do that it ends
up being instead a complex factory of JCas FSes with a lot more
boilerplate code inside. Also note the typecast.)
* I wondered about storing (arbitrary) graphs in the CAS, but the
issues above make this really impractical. If you also think about
integrating microformats, you need to think about how to do this.
* Complex pipelines are a bit clumsy. I think the biggest obvious
problem is lack of signalling to CAS merger that input CASes have
been exhausted. Having an "isLast" barrier sounds simple as long
as you have only a single CAS multiplier paired with the CAS merger,
but when this assumption breaks down, things start to deteriorate.
However, I realize complex pipelines are a niche area.
I think these are my main concerns. I guess another way to phrase it:
I came to UIMA looking for a way to generate, store and organize
my+3rdparty Java object annotations of various text-based entities.
It sort of delivers, but if I did this again, I'd seriously hesitate
if the steep learning curve and incredible engineering overhead is worth
the deal. I want to suggest that UIMAj3 would make me not hesitate, and
get out of my way! :)
--
Petr Baudis
If you have good ideas, good data and fast computers,
you can do almost anything. -- Geoffrey Hinton