What is the recommended way of storing document properties, such as "author",
"date created", "title", etc?
I also need some data for internal uses, such as the document size and URI.
One other requirement: this is not a closed vertical solution with a known set
of annotators designed to inter-operate. This is an application platform that
will use some known annotators but allow plugging in arbitrary unknown
annotators from other companies (that's why one uses UIMA, of course!). Also,
some of our annotators may be used in UIMA containers from other companies with
unknown annotators. So my code can't depend on either the UIMA container
providing, or all of the other annotators (but possibly our own) knowing about,
any data structure containing these properties.
I see a few possibilities:
1. Add features to DocumentAnnotation
2. Add features to SourceDocumentInformation
3. Create my own annotation or TOP FS.
The documentation recommends not adding features to DocumentAnnotation if you
are using JCas (I am). I agree--what if both my annotators and someone else's
annotator have added features to DA? It just wouldn't work, right?
It's the same with SDI, if two annotators both add features to it. They in
conflict, and they can't be merged.
SDI is useful however, since it has the document size and URI. Despite it
being in a package called "examples", in truth it's become a standard. All the
annotators the ship with UIMA use it. If you want to use the semantic search
(Juru) indexing CAS Consumer, you have to use SDI. I'm sure many annotators
in the world have used SDI.
I would like my annotators and UIMA container to be compatible with all those
annotators. Therefore, I think I have to use SDI for size and URI, but not
modify it.
Creating my own annotation (or is extending TOP FS better?) seems like the best
answer. My UIMA container and set of annotators would know about it, and
other's annotators wouldn't be affected. My annotators would have to
gracefully degrade when running in a UIMA container that doesn't provide this
new annotation.
What are people's thoughts? 1, 2 or 3?
================
Longer term, I think we as a community need to define Type Systems that allow
inter-operability of annotators and CAS Consumers. For example, we could
create an official SourceDocumentInformation that allows arbitrary sets of
document properties as simple name-value pairs. In other words, add this
feature to SourceDocumentInformation:
properties uima.cas.FSArray PropertyFS
uima.PropertyFS uima.cas.TOP
name uima.cas.String
value uima.cas.String
scheme uima.cas.String
And define that names, values, and schemes conform to the Dublin Core Metadata
Initiative standards.
Similarly, I think we need to create Type System standards for representing
document structure. For example, how could HTML elements and attributes be
stored in the CAS such that all annotators could depend on them being there and
therefore make intelligent use of them?
And finally, we need some Type System standards for representing certain common
result annotations, such as lexical markup and named entities. How can we
combine two annotators from different companies if they don't have a shared
definition of the data flowing between them?
And isn't this the whole point of UIMA? It appears to me that the UIMA dream
won't come true until we create these standards for data exchange or data
transformation within the CAS.
In my opinion, the current situation really limits the usefulness of UIMA as a
platform for text processing (unless you control every piece of code in the
system, of course).
How do we start such a consortium?
Thanks for listening,
Greg Holmberg