UIMAj3 ideas

Petr Baudis Thu, 09 Jul 2015 15:53:00 -0700

  Hi!

On Thu, Jul 09, 2015 at 03:51:26PM -0400, Marshall Schor wrote:
> The discussion of future directions for UIMA is spread over several pages in 
> the
> wiki, but a good page to start is
> https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3
On Thu, Jul 09, 2015 at 04:17:44PM -0400, Marshall Schor wrote:
> I'll take a look.  This kind of thing is "on the list" for uima v3; see
> https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3


  I didn't figure out how to edit that wiki page, but a mental summary
of the things I find currently irritating about UIMA and would love to
see changed formed in my mind, so I thought I could contribute it for
discussion.

  * UIMAfit is not part of core UIMA and UIMA-AS is not part of core
    UIMA.  It seems to me that UIMA-AS is doing things a bit differently
    than what the original UIMA idea of doing scaleout was.  The two
    things don't play well together.  I'd love a way to easily take
    my plain UIMA pipeline and scale it out, ideally without any code
    changes, *and* avoid the terrible XML config files.

  * Speaking of avoiding the config files, it'd be nice if I could avoid
    them for type systems as well.  A radical idea: In the end, I treat
    UIMA essentially as a storage for Java objects; I suspect many others
    do the same.  I'd love a way to turn JCasGen on its head and write
    the Java classes (possibly with some restrictions) that I could
    store in UIMA, with the backend figuring out the low-level UIMA
    representation on its own.  This would radically reduce some aspects
    of the engineering overhead for me and maybe many other users.

  * The JCas UIMA interface should be more transparent in other ways
    too.  Working with arrays (and absence of lists) is a huge pain.
    I just want to work with feature structures as if they were normal
    Java objects, without major restrictions.

  * Connected with the above - I'd love .addToIndexes() to just
    disappear.  Right now, the paradigm is that you build an annotation
    in an annotator, and the moment it gets saved in a CAS, it becomes
    basically read-only.  But if I want e.g. to build up a set of
    features across multiple annotators, things again become very
    painful.  Because also fixed-size arrays, I need awful boilerplate
    code like

                AnswerInfo ai = JCasUtil.selectSingle(jcas, AnswerInfo.class);
                AnswerFV fv = new AnswerFV(ai);
                fv.setFeature(f, 1.0);

                for (FeatureStructure af : ai.getFeatures().toArray())
                        ((AnswerFeature) af).removeFromIndexes();
                ai.removeFromIndexes();

                ai.setFeatures(fv.toFSArray(jcas));
                ai.addToIndexes();

    simply to add a feature.  (Note the AnswerFV class, which is the
    actual thing I want to store in a JCas - a dynamic list of
    (feature_label, feature_value) pairs - but to do that it ends
    up being instead a complex factory of JCas FSes with a lot more
    boilerplate code inside.  Also note the typecast.)

  * I wondered about storing (arbitrary) graphs in the CAS, but the
    issues above make this really impractical.  If you also think about
    integrating microformats, you need to think about how to do this.

  * Complex pipelines are a bit clumsy.  I think the biggest obvious
    problem is lack of signalling to CAS merger that input CASes have
    been exhausted.  Having an "isLast" barrier sounds simple as long
    as you have only a single CAS multiplier paired with the CAS merger,
    but when this assumption breaks down, things start to deteriorate.
    However, I realize complex pipelines are a niche area.

  I think these are my main concerns.  I guess another way to phrase it:
I came to UIMA looking for a way to generate, store and organize
my+3rdparty Java object annotations of various text-based entities.
It sort of delivers, but if I did this again, I'd seriously hesitate
if the steep learning curve and incredible engineering overhead is worth
the deal.  I want to suggest that UIMAj3 would make me not hesitate, and
get out of my way! :)

-- 
                                Petr Baudis
        If you have good ideas, good data and fast computers,
        you can do almost anything. -- Geoffrey Hinton

UIMAj3 ideas

Reply via email to