[EMAIL PROTECTED] wrote:
What is the recommended way of storing document properties, such as "author", "date
created", "title", etc?
I also need some data for internal uses, such as the document size and URI.
One other requirement: this is not a closed vertical solution with a known set
of annotators designed to inter-operate. This is an application platform that
will use some known annotators but allow plugging in arbitrary unknown
annotators from other companies (that's why one uses UIMA, of course!). Also,
some of our annotators may be used in UIMA containers from other companies with
unknown annotators. So my code can't depend on either the UIMA container
providing, or all of the other annotators (but possibly our own) knowing about,
any data structure containing these properties.
I see a few possibilities:
1. Add features to DocumentAnnotation
2. Add features to SourceDocumentInformation
3. Create my own annotation or TOP FS.
The documentation recommends not adding features to DocumentAnnotation if you
are using JCas (I am). I agree--what if both my annotators and someone else's
annotator have added features to DA? It just wouldn't work, right?
It's the same with SDI, if two annotators both add features to it. They in
conflict, and they can't be merged.
SDI is useful however, since it has the document size and URI. Despite it being in a
package called "examples", in truth it's become a standard. All the annotators
the ship with UIMA use it. If you want to use the semantic search (Juru) indexing CAS
Consumer, you have to use SDI. I'm sure many annotators in the world have used SDI.
I would like my annotators and UIMA container to be compatible with all those
annotators. Therefore, I think I have to use SDI for size and URI, but not
modify it.
Creating my own annotation (or is extending TOP FS better?) seems like the best
answer. My UIMA container and set of annotators would know about it, and
other's annotators wouldn't be affected. My annotators would have to
gracefully degrade when running in a UIMA container that doesn't provide this
new annotation.
What are people's thoughts? 1, 2 or 3?
If you use the JCas, as you say you do, definitely 3. There is no need
to use an annotation, extending TOP would be sufficient.
================
Longer term, I think we as a community need to define Type Systems that allow
inter-operability of annotators and CAS Consumers. For example, we could
create an official SourceDocumentInformation that allows arbitrary sets of
document properties as simple name-value pairs. In other words, add this
feature to SourceDocumentInformation:
properties uima.cas.FSArray PropertyFS
uima.PropertyFS uima.cas.TOP
name uima.cas.String
value uima.cas.String
scheme uima.cas.String
I'm personally not a big fan of arbitrary attribute-value schemes like
this. You need yet another place (outside the type system) where you
document what the properties are that you define and expect.
And define that names, values, and schemes conform to the Dublin Core Metadata
Initiative standards.
Similarly, I think we need to create Type System standards for representing
document structure. For example, how could HTML elements and attributes be
stored in the CAS such that all annotators could depend on them being there and
therefore make intelligent use of them?
And finally, we need some Type System standards for representing certain common
result annotations, such as lexical markup and named entities. How can we
combine two annotators from different companies if they don't have a shared
definition of the data flowing between them?
And isn't this the whole point of UIMA? It appears to me that the UIMA dream
won't come true until we create these standards for data exchange or data
transformation within the CAS.
In my opinion, the current situation really limits the usefulness of UIMA as a
platform for text processing (unless you control every piece of code in the
system, of course).
How do we start such a consortium?
This mailing list is a good start ;-). I know there are others who work
on similar things, but I'll let them speak for themselves.
One issue of course is that it is difficult to agree on any common type
system. It's hard enough to even agree on what an annotation is, let
alone specific types of annotations. We could try to define a certain
base set on Apache. I would hesitate to put more built-in types into
UIMA itself, though. I'd rather have a type system repository where we
modularly define certain kinds of type systems (such as html markup, for
example), and that people can use, or not.
--Thilo
Thanks for listening,
Greg Holmberg