Document "properties" and SourceDocumentInformation

[EMAIL PROTECTED] Tue, 27 Feb 2007 22:27:10 -0800

What is the recommended way of storing document properties, such as "author", 
"date created", "title", etc?


I also need some data for internal uses, such as the document size and URI.

One other requirement: this is not a closed vertical solution with a known set 
of annotators designed to inter-operate.  This is an application platform that 
will use some known annotators but allow plugging in arbitrary unknown 
annotators from other companies (that's why one uses UIMA, of course!).  Also, 
some of our annotators may be used in UIMA containers from other companies with 
unknown annotators.  So my code can't depend on either the UIMA container 
providing, or all of the other annotators (but possibly our own) knowing about, 
any data structure containing these properties.

I see a few possibilities:

1. Add features to DocumentAnnotation
2. Add features to SourceDocumentInformation
3. Create my own annotation or TOP FS.

The documentation recommends not adding features to DocumentAnnotation if you 
are using JCas (I am).  I agree--what if both my annotators and someone else's 
annotator have added features to DA?  It just wouldn't work, right?

It's the same with SDI, if two annotators both add features to it.  They in 
conflict, and they can't be merged.

SDI is useful however, since it has the document size and URI.  Despite it 
being in a package called "examples", in truth it's become a standard.  All the 
annotators the ship with UIMA use it.  If you want to use the semantic search 
(Juru) indexing CAS Consumer, you have to use SDI.   I'm sure many annotators 
in the world have used SDI.

I would like my annotators and UIMA container to be compatible with all those 
annotators.  Therefore, I think I have to use SDI for size and URI, but not 
modify it.

Creating my own annotation (or is extending TOP FS better?) seems like the best 
answer.  My UIMA container and set of annotators would know about it, and 
other's annotators wouldn't be affected.  My annotators would have to 
gracefully degrade when running in a UIMA container that doesn't provide this 
new annotation.

What are people's thoughts?  1, 2 or 3?

================

Longer term, I think we as a community need to define Type Systems that allow 
inter-operability of annotators and CAS Consumers.  For example, we could 
create an official SourceDocumentInformation that allows arbitrary sets of 
document properties as simple name-value pairs.  In other words, add this 
feature to SourceDocumentInformation:

        properties           uima.cas.FSArray    PropertyFS

    uima.PropertyFS    uima.cas.TOP
        name                  uima.cas.String
        value                   uima.cas.String
        scheme               uima.cas.String

And define that names, values, and schemes conform to the Dublin Core Metadata 
Initiative standards.


Similarly, I think we need to create Type System standards for representing 
document structure.  For example, how could HTML elements and attributes be 
stored in the CAS such that all annotators could depend on them being there and 
therefore make intelligent use of them?


And finally, we need some Type System standards for representing certain common 
result annotations, such as lexical markup and named entities.  How can we 
combine two annotators from different companies if they don't have a shared 
definition of the data flowing between them?


And isn't this the whole point of UIMA?  It appears to me that the UIMA dream 
won't come true until we create these standards for data exchange or data 
transformation within the CAS.

In my opinion, the current situation really limits the usefulness of UIMA as a 
platform for text processing (unless you control every piece of code in the 
system, of course).

How do we start such a consortium?

Thanks for listening,


Greg Holmberg

Document "properties" and SourceDocumentInformation

Reply via email to