Re: CAS Views and Sofas simplification

Marshall Schor Fri, 22 Dec 2006 11:14:28 -0800

Adam Lally wrote:

On 12/22/06, Marshall Schor <[EMAIL PROTECTED]> wrote:

0) CASes are the unit of work, the unit of remote data transfer, in
UIMA. They often
correspond to a "document" (but for big docs, may only have part of it).
1) FS's are created in the (one-and-only) CAS.
2) Annotations can be created.
   -  If there is more than one "Sofa", you must specify which Sofa they
are "over".
3) A magic method exists for tools to get all the FS's out of a CAS
(when serializing).
   - This magic method can be restricted to just those FS's that are
indexed in some index,
      or which is reachable from a chain of references starting in
another FS which is indexed.


Can we stop there (here)?  I think with these concepts we can build the
higher level
concepts we now have, efficiently, except for the concept of subsetting
the FS's by "index-set".


Hmmm.. don't you need the "FS can be indexed (but don't have to be)"
part in here?  You refer to FS being indexed but don't say how they
get there.

Right - I think I forgot to include at least one index-set.


Also why did you say "(when serializing)" - is it intended that this
operation not be used for other purposes such as by an annotator?

This was me thinking that the main use-case for this is serialization, and
remembering you wanted to hide this from users because they might abuse it?

Currently, we don't have a way to define an index which is a "filter" -
including some members
of a type, while excluding others.  An abstract example: "odd-token" and
"even-token" - both
being "token" types, but one only holding the "odd" ones, etc.  As Thilo
has pointed out -
the index could contain all token types, and a "filtered-iterater" could
be used at iteration time
to sort these out, as an alternative.  There are of course space/time
tradeoffs here.

  - If we did have a way to define an index which is a "filter", we
might be able to
     efficiently use this to do the same thing that index-sets enable,
perhaps in a more general
     way.


So let me see if I have this right - there would be just one

annotation index, sorted on begin, end.

What I was trying to say was that there might be many annotationindexes. Each one might have a"filter" saying that it should have annotations whose "sofa" was aparticular sofa, for example.

All indexed annotations for
any Sofa in the CAS would exist in this one index.  If an annotator
wanted to do the usual operation of iterating over annotations
relating to one particular Sofa, this would be done using a filtered
iterator that would filter out any annotations not referring to the
specified Sofa.  Correct?

See above... Could be done this way, but I was thinking that thefiltering would

be done at indexing time, not at iteration time.


One thing that comes to mind is that it may be more efficient to keep
the annotation index segregated by Sofa as we do today.  That's
because I presume no one will actually care about the relative
ordering of annotations from different Sofas, so we'd be wasting time

if we computed it.

Right - why I was thinking it would be done at indexing time.

And, we currently benefit from the fact that
annotations are usually created in order, but we'd lose that benefit
if we had an index that interleaved annotations across Sofas.

Also, we have some uses of non-annotation indexes that are segregated
by Sofa (say, a Lemma index that's particular to a Sofa, where there's
actually no explicit link from the Lemma to the Sofa).  A filtering

approach wouldn't work there,

It could be made to work by adding a feature to the Lemma type which was
a sofa reference.  But maybe that's asking too much of the user?

although perhaps we can argue that those
cases are poor design.

- Otherwise, we could use the concept of index-set to specify thisfilter:


4) FS's can be indexed (but don't have to be).
   -  If there is more than one "index set", you have to specify which
"index set" to use;
      the index operations (add/remove) update only the indexes in that
index-set.
      (Note that this doesn't fit with other ideas where a particular
"index" might be in
       multiple index sets.  In this proposal, the only way to put an
instance into multiple
       index sets is to do multiple adds, one per index-set.)

This doesn't have the concept of "global indexes".  If you want that,
you can create another
"index set" and use it for that purpose.

This doesn't tie a Sofa to a View.  You could enforce some tie-in /
restriction here if it was wanted.


So basically, is this equivalent to taking our current implemenation
of View and saying that the sofa is optional? (Which is more or less
what the UIMA spec says.)

Well, it allows 2 or more Sofas to be indexed using a single
index-set (i.e., in a single view), which
the current design doesn't.

This doesn't say that Annotations can only be indexed in a view which is
(somehow) tied to
the same Sofa the Annotation is over.


To clarify on the anchored view constraint that the UIMA Spec talks
about - it is not quite what you say here.  You can add any Annotation
to a non-anchored view (one that has no sofa).  You just cannot add an
Annotation to an view that's anchored to a _different_ sofa than the
Sofa that the Annotation points to.
Whether this constraint is checked or not is up to the implementation.
So if think this just comes down to how we feel about the performance
of the check.  But it's still important to understand the concept -
the intention of the anchored view is to segregate things by Sofa, and
its valid for downstream annotators to rely on this.  A framework may
not check, but an annotator that violates the anchored view constraint
it is a badly behaved annotator.

-Marshall

Re: CAS Views and Sofas simplification

Reply via email to