CAS Views and Sofas simplification

Marshall Schor Fri, 22 Dec 2006 09:06:54 -0800

Using the definitions Adam defined:

* "CAS" means the entire CAS.  It never means a specific view of the CAS.
* "Index Definition" means the declaration in the descriptor that
defines an index - giving it a label, kind of index, CAS type, and
sort keys.
* "Index" is an instance of an index definition - something that can
be retreived by a getIndex() call and from which you can get an
iterator.
* "Physical Index" is an actual data structure holding references to
FeatureStructures.  This  is transparent to the user but sometimes we
need to talk about it if we're concerned about performance.


To this, let me add:

* "Index Set" - a collection of Index definition instances - or Indexes(for short) -

identified by a name (called the "view name").

and:

* "Sofa" - a particular Subject of Analysis.- A CAS can hold many Sofas.- Annotations (subtypes of AnnotationBase) are created having a refto a particular Sofa


We can approach simplicity by identifying a small number of primitive things
that can be combined to give useful interpretations.

Consider:

0) CASes are the unit of work, the unit of remote data transfer, inUIMA. They often

correspond to a "document" (but for big docs, may only have part of it).
1) FS's are created in the (one-and-only) CAS.
2) Annotations can be created.

- If there is more than one "Sofa", you must specify which Sofa theyare "over".3) A magic method exists for tools to get all the FS's out of a CAS(when serializing).- This magic method can be restricted to just those FS's that areindexed in some index,or which is reachable from a chain of references starting inanother FS which is indexed.

Can we stop there (here)? I think with these concepts we can build thehigher levelconcepts we now have, efficiently, except for the concept of subsettingthe FS's by "index-set".

Currently, we don't have a way to define an index which is a "filter" -including some membersof a type, while excluding others. An abstract example: "odd-token" and"even-token" - bothbeing "token" types, but one only holding the "odd" ones, etc. As Thilohas pointed out -the index could contain all token types, and a "filtered-iterater" couldbe used at iteration timeto sort these out, as an alternative. There are of course space/timetradeoffs here.- If we did have a way to define an index which is a "filter", wemight be able toefficiently use this to do the same thing that index-sets enable,perhaps in a more general

    way.

 - Otherwise, we could use the concept of index-set to specify this filter:

4) FS's can be indexed (but don't have to be).

- If there is more than one "index set", you have to specify which"index set" to use;the index operations (add/remove) update only the indexes in thatindex-set.(Note that this doesn't fit with other ideas where a particular"index" might be inmultiple index sets. In this proposal, the only way to put aninstance into multiple

      index sets is to do multiple adds, one per index-set.)

This doesn't have the concept of "global indexes". If you want that,you can create another

"index set" and use it for that purpose.

This doesn't tie a Sofa to a View. You could enforce some tie-in /restriction here if it was wanted.

This doesn't say that Annotations can only be indexed in a view which is(somehow) tied to

the same Sofa the Annotation is over.

The simplest solution in my mind (today :-) ) would scrap index-sets infavor of index-filtering.


-Marshall

CAS Views and Sofas simplification

Reply via email to