Re: Document "properties" and SourceDocumentInformation

Adam Lally Wed, 28 Feb 2007 06:39:31 -0800

On 2/28/07, Thilo Goetz <[EMAIL PROTECTED]> wrote:

[EMAIL PROTECTED] wrote:
> 1. Add features to DocumentAnnotation
> 2. Add features to SourceDocumentInformation
> 3. Create my own annotation or TOP FS.
>
If you use the JCas, as you say you do, definitely 3.  There is no need
to use an annotation, extending TOP would be sufficient.


I agree.  Adding feature to existing types should generally be avoided
if there's an acceptable alternative solution, especially if you're
using JCas.  BTW in 2.1 we've added the ability to index and retrieve
non-Annotation FeatureStructures without having to define a custom
index in your component descriptor, which should make it much more
convenient to use a document metadata Type that extends TOP.

> Longer term, I think we as a community need to define Type Systems that allow 
inter-operability of annotators and CAS Consumers.  For example, we could create 
an official SourceDocumentInformation that allows arbitrary sets of document 
properties as simple name-value pairs.  In other words, add this feature to 
SourceDocumentInformation:
>
>         properties           uima.cas.FSArray    PropertyFS
>
>     uima.PropertyFS    uima.cas.TOP
>         name                  uima.cas.String
>         value                   uima.cas.String
>         scheme               uima.cas.String

I'm personally not a big fan of arbitrary attribute-value schemes like
this.  You need yet another place (outside the type system) where you
document what the properties are that you define and expect.


Agreed.  Our hope is that the type system would be used for declaring
these things.  There can be more than one Type declared for holding
different kinds of document metadata (e.g. a DublinCoreMetadata  type,
in addition to other types with different properties).

Perhaps, it might be useful if these all extended from some base
DocumentMetadata type that did not define any features, just so it
would be clear that these all represented some kind of
DocumentMetadata?

> Similarly, I think we need to create Type System standards for representing 
document structure.  For example, how could HTML elements and attributes be stored 
in the CAS such that all annotators could depend on them being there and therefore 
make intelligent use of them?
>
>
> And finally, we need some Type System standards for representing certain 
common result annotations, such as lexical markup and named entities.  How can we 
combine two annotators from different companies if they don't have a shared 
definition of the data flowing between them?
>
>
> And isn't this the whole point of UIMA?  It appears to me that the UIMA dream 
won't come true until we create these standards for data exchange or data 
transformation within the CAS.
>
> In my opinion, the current situation really limits the usefulness of UIMA as 
a platform for text processing (unless you control every piece of code in the 
system, of course).
>
> How do we start such a consortium?

This mailing list is a good start ;-).  I know there are others who work
on similar things, but I'll let them speak for themselves.

One issue of course is that it is difficult to agree on any common type
system.  It's hard enough to even agree on what an annotation is, let
alone specific types of annotations.  We could try to define a certain
base set on Apache.  I would hesitate to put more built-in types into
UIMA itself, though.  I'd rather have a type system repository where we
modularly define certain kinds of type systems (such as html markup, for
example), and that people can use, or not.


Right.  I think largely it's the uima users, not the framework
developers, who would have to participate in forging agreements on
common type systems.  So uima-user seems like a good place to have
discussions like that, at least once this list has a larger number of
subscribers, which we hope will happen after the Apache release out
and people have migrated to it.

Possibly, type systems that have gotten significant support on
uima-user might be included in the Apache UIMA release, initially as
part of the "sandbox".

-Adam

Re: Document "properties" and SourceDocumentInformation

Reply via email to