On Fri, Jul 10, 2015 at 01:37:27PM -0400, Marshall Schor wrote:
> On 7/9/2015 6:52 PM, Petr Baudis wrote:
> <snip...>
>
> https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3
>
> > I didn't figure out how to edit that wiki page,
> Due to spammers, we had to turn off public editing. However, I can add you
> to a
> list ( to do this, you have to "register" for a user id on the wiki, and then
> send me offline what that Id is ), but even without being on the list,
> there's a
> comment button which (I think) lets you add comments at the bottom.
> > but a mental summary
> > of the things I find currently irritating about UIMA and would love to
> > see changed formed in my mind, so I thought I could contribute it for
> > discussion.
> Great!
> >
> > * UIMAfit is not part of core UIMA and UIMA-AS is not part of core
> > UIMA. It seems to me that UIMA-AS is doing things a bit differently
> > than what the original UIMA idea of doing scaleout was. The two
> > things don't play well together. I'd love a way to easily take
> > my plain UIMA pipeline and scale it out, ideally without any code
> > changes, *and* avoid the terrible XML config files.
> Any specifics of what to change here would be helpful. UIMA-AS was designed
> to
> enable scale-out without changing the core UIMA pipeline or it's XML
> descriptor. THe additional information for UIMA-AS scaleout was put into a
> separate xml descriptor which "embeds" the original plain UIMA one.
I'm sure Richard would be able to explain this better, but I think one
of the core issues is that UIMA-AS embeds the XML descriptor instead of
the AnalysisEngineDescription. So when I want to use it together with
AnalysisEngineDescription built with UIMAfit instead, it's time to
start making crazy workarounds like
https://code.google.com/p/dkpro-lab/source/browse/de.tudarmstadt.ukp.dkpro.lab/de.tudarmstadt.ukp.dkpro.lab.uima.engine.uimaas/src/main/java/de/tudarmstadt/ukp/dkpro/lab/uima/engine/uimaas/component/SimpleService.java?name=14aeba50c8c1&r=14aeba50c8c18ea4d14c0d099f43c049f806d9db
> > * Connected with the above - I'd love .addToIndexes() to just
> > disappear. Right now, the paradigm is that you build an annotation
> > in an annotator, and the moment it gets saved in a CAS, it becomes
> > basically read-only.
> You certainly can modify any of an Annotation's features subsequently.
> I'm guessing you're referring to another idea - adding additional features
> that were
> not initially defined in the UIMA type system.
Sorry for the confusion, but that's not quite what I had in mind.
I literally believe that right now, in order to modify value of
a feature, you need to first remove it from an index, change the
value, then re-add it back. Is that a misconception?
> UIMA sets up the types and
> features once at the start of the pipeline run (from a merge of all the
> component's type systems), and locks down the type system. Other frameworks
> sometimes allow an unlocked type system, where you could add (after a Feature
> Structure is created) additional features. This is usually done by keeping a
> list of feature-name <-> feature-value pairs (such as your code snippet does,
> below). We're thinking of including this capability in the version 3, with a
> bit of a twist - the intent would be to keep the "compilable" aspect of
> "locked-down" type/features (for high performance), while adding (for those
> use
> cases that want it) the other style of dynamically added additional features
> (at
> some cost in performance).
Still, this would be awesome and I'd totally make use of it!
(The code in my original email I guess conflates demonstration of two
issues - the addToIndex and lack of variable-sized lists, i.e. the java
collection support issue. Even if you decide generic collection / map
support would be too tricky, at least supporting variable-sized lists
would help a lot...)
> > * I wondered about storing (arbitrary) graphs in the CAS, but the
> > issues above make this really impractical. If you also think about
> > integrating microformats, you need to think about how to do this.
> We have had users store arbitrary graphs in the CAS, but, yes, it is not so
> efficient. The main element UIMA has for collections of references (to
> FeatureStructures) are the FSArray and FSList. As you point out the FSArray
> is
> fixed length. The FSList supports dynamic adding/removing etc. using the
> standard link-list technology. However, because UIMA data in the CAS
> (currently) is not garbage collected, you have to be careful when using this
> technique.
...oh, never mind. After using UIMA heavily for well over a year,
I managed not to learn that FSList exists at all! Thanks for this
pointer.
I think that's a bug for the UIMA Tutorial, which mentions FSArray but
not FSList. :-)
(Another pain point here - I always ache when I need to work with
FSArray or I guess FSList, since it does not carry the type information
that is in the typesystem - I need to manually typecast all the time
and hope I don't make a mistake.)
> The above proposal to allow the common Java Collection objects (like
> ArrayList,
> and Maps) as things in the CAS, plus garbage collection,should make it much
> more
> convenient to store and work with graphs in the CAS.
> >
> > * Complex pipelines are a bit clumsy. I think the biggest obvious
> > problem is lack of signalling to CAS merger that input CASes have
> > been exhausted. Having an "isLast" barrier sounds simple as long
> > as you have only a single CAS multiplier paired with the CAS merger,
> > but when this assumption breaks down, things start to deteriorate.
> > However, I realize complex pipelines are a niche area.
> It would be nice to hear some ideas here.
(After reading Eddie Epstein's email and coming back to some more of
his emails to me, I realize that the isLast hack I'm using is needless
if I would instead use the "process-parent-last" flag of CASMultiplier.
I'm learning a lot from interacting here! I guess that shows we could
always make use of more good UIMA code examples...)
--
Petr Baudis
If you have good ideas, good data and fast computers,
you can do almost anything. -- Geoffrey Hinton