Marshall Schor wrote:
Using the definitions Adam defined:
* "CAS" means the entire CAS. It never means a specific view of the CAS.
* "Index Definition" means the declaration in the descriptor that
defines an index - giving it a label, kind of index, CAS type, and
sort keys.
* "Index" is an instance of an index definition - something that can
be retreived by a getIndex() call and from which you can get an
iterator.
* "Physical Index" is an actual data structure holding references to
FeatureStructures. This is transparent to the user but sometimes we
need to talk about it if we're concerned about performance.
To this, let me add:
* "Index Set" - a collection of Index definition instances - or Indexes
(for short) -
identified by a name (called the "view name").
I'm not sure this wasn't settled in your discussion with Adam, but to my
current way of thinking, a non-anchored view is nothing but a named set
of indexes. So this definition of an index set seems redundant.
<snip>
3) A magic method exists for tools to get all the FS's out of a CAS
(when serializing).
- This magic method can be restricted to just those FS's that are
indexed in some index,
or which is reachable from a chain of references starting in
another FS which is indexed.
I'm not quite sure what you mean here, but if this implies that this
magic method can also return FSs that are not indexed anywhere, I don't
think so. FSs that are not indexed are meant to be temporary and local
to an annotator, so no need to serialize them or do anything else with them.
Can we stop there (here)? I think with these concepts we can build the
higher level
concepts we now have, efficiently, except for the concept of subsetting
the FS's by "index-set".
Currently, we don't have a way to define an index which is a "filter" -
including some members
of a type, while excluding others. An abstract example: "odd-token" and
"even-token" - both
being "token" types, but one only holding the "odd" ones, etc. As Thilo
has pointed out -
the index could contain all token types, and a "filtered-iterater" could
be used at iteration time
to sort these out, as an alternative. There are of course space/time
tradeoffs here.
- If we did have a way to define an index which is a "filter", we might
be able to
efficiently use this to do the same thing that index-sets enable,
perhaps in a more general
way.
More general how?
- Otherwise, we could use the concept of index-set to specify this filter:
4) FS's can be indexed (but don't have to be).
- If there is more than one "index set", you have to specify which
"index set" to use;
the index operations (add/remove) update only the indexes in that
index-set.
(Note that this doesn't fit with other ideas where a particular
"index" might be in
multiple index sets. In this proposal, the only way to put an
instance into multiple
index sets is to do multiple adds, one per index-set.)
That's what we have views for, isn't it?
This doesn't have the concept of "global indexes". If you want that,
you can create another
"index set" and use it for that purpose.
<snip>
The set of all indexes must be accessible to the user in the CAS,
otherwise we violate the "all data must be accessible from the CAS
without recourse to views" constraint.
--Thilo