Re: CAS Views and Sofas simplification

Marshall Schor Wed, 27 Dec 2006 07:24:12 -0800

Some clarifications below:

Thilo Goetz wrote:

Marshall Schor wrote:
Using the definitions Adam defined:
* "CAS" means the entire CAS. It never means a specific view of theCAS.
* "Index Definition" means the declaration in the descriptor that
defines an index - giving it a label, kind of index, CAS type, and
sort keys.
* "Index" is an instance of an index definition - something that can
be retreived by a getIndex() call and from which you can get an
iterator.
* "Physical Index" is an actual data structure holding references to
FeatureStructures.  This  is transparent to the user but sometimes we
need to talk about it if we're concerned about performance.

To this, let me add:
* "Index Set" - a collection of Index definition instances - orIndexes (for short) -
identified by a name (called the "view name").
I'm not sure this wasn't settled in your discussion with Adam, but tomy current way of thinking, a non-anchored view is nothing but a namedset of indexes. So this definition of an index set seems redundant.

OK.  I was trying to keep the concepts simpler, more circumscribed:
    CAS - a container
         -  can have a Sofa, can have more than 1 Sofa
         -  has an Index Set, can have more than 1 Index Set

   Special case: "Anchored View":  An Index set that has 1 associated Sofa

With this formulation, you can see that there are other potentialcombinations:

   non-Anchored view: An Index Set without an associated Sofa

multi-Anchored view (?? I made up this name, not really suggestingit ??): An Index set with more than 1 associated Sofa.

<snip>
3) A magic method exists for tools to get all the FS's out of a CAS(when serializing).- This magic method can be restricted to just those FS's that areindexed in some index,or which is reachable from a chain of references starting inanother FS which is indexed.
I'm not quite sure what you mean here, but if this implies that thismagic method can also return FSs that are not indexed anywhere, Idon't think so.

That's OK - which is why I said the 2nd sentence.

FSs that are not indexed are meant to be temporary and local to anannotator, so no need to serialize them or do anything else with them.

Data that is *local* to an annotator is most likely never put into theCAS, but would rather held in "native" annotator data structures,because (a) it's usually more efficient, and (b) the space is reclaimed(of course, depending on the Annotator design, and assuming we aren'tadding garbage-collection to the CAS).

It seems to me a more convincing use-case for this is data that was putinto the CAS (to be shared) which somesubsequent process effectively "deleted" (e.g., changed a reference thatwas serving to locate the FS).

Can we stop there (here)? I think with these concepts we can buildthe higher levelconcepts we now have, efficiently, except for the concept ofsubsetting the FS's by "index-set".
Currently, we don't have a way to define an index which is a "filter"- including some membersof a type, while excluding others. An abstract example: "odd-token"and "even-token" - bothbeing "token" types, but one only holding the "odd" ones, etc. AsThilo has pointed out -the index could contain all token types, and a "filtered-iterater"could be used at iteration timeto sort these out, as an alternative. There are of course space/timetradeoffs here.- If we did have a way to define an index which is a "filter", wemight be able toefficiently use this to do the same thing that index-sets enable,perhaps in a more general
    way.
More general how?

Multiple Index sets are an additional mechanism (compared with nothaving multiple index sets). They providea way to say a FS is a "member" of some index set.A more general approach (one which doesn't add any new mechanisms tohaving a single Index Set) would be toget rid of multiple index sets, and say if users want to make FSs"members" of some user-defined "sets" (calledviews), they can do that using normal indexes (assuming we have addedthe ability to define indexes which

filter).   Here's how it could work:
  - Simple case: User desires FSs to be members of 1 view:

- User defines additional structures to support the way they wantto refer to these.- e.g. an additional slot per FS, with a ref to a "view"object they define.- User defines an index over the type they want in the view,with the extra predicate that

            slot  value == the view object

This may not be such a good idea because it requires additional slot /FS, and it requires some "management" of

view names to be able to specify the equal test.

So - I think I would come down in favor of having multiple named indexsets as a better approach here.Just think of this thread as an intellectual exploration ofpossibilities, not as something I'm advocating :-)

- Otherwise, we could use the concept of index-set to specify thisfilter:
4) FS's can be indexed (but don't have to be).
- If there is more than one "index set", you have to specify which"index set" to use;the index operations (add/remove) update only the indexes inthat index-set.(Note that this doesn't fit with other ideas where a particular"index" might be inmultiple index sets. In this proposal, the only way to put aninstance into multiple
      index sets is to do multiple adds, one per index-set.)
That's what we have views for, isn't it?

Right. I think these are equivalent. A view corresponds to a (named)index-set.

This doesn't have the concept of "global indexes". If you want that,you can create another
"index set" and use it for that purpose.
<snip>
The set of all indexes must be accessible to the user in the CAS,otherwise we violate the "all data must be accessible from the CASwithout recourse to views" constraint.

I think this is satisfied by my so-called "magic method". This wouldlikely be implemented as suggested above.Most *typical users* probably would not use APIs which iterate throughall indexes, but framework & tooling would.

-Marshall

Re: CAS Views and Sofas simplification

Reply via email to