Re: Document "properties" and SourceDocumentInformation

Thilo Goetz Wed, 28 Feb 2007 03:57:21 -0800

[EMAIL PROTECTED] wrote:

What is the recommended way of storing document properties, such as "author", "date 
created", "title", etc?


I also need some data for internal uses, such as the document size and URI.

One other requirement: this is not a closed vertical solution with a known set 
of annotators designed to inter-operate.  This is an application platform that 
will use some known annotators but allow plugging in arbitrary unknown 
annotators from other companies (that's why one uses UIMA, of course!).  Also, 
some of our annotators may be used in UIMA containers from other companies with 
unknown annotators.  So my code can't depend on either the UIMA container 
providing, or all of the other annotators (but possibly our own) knowing about, 
any data structure containing these properties.

I see a few possibilities:

1. Add features to DocumentAnnotation
2. Add features to SourceDocumentInformation
3. Create my own annotation or TOP FS.

The documentation recommends not adding features to DocumentAnnotation if you 
are using JCas (I am).  I agree--what if both my annotators and someone else's 
annotator have added features to DA?  It just wouldn't work, right?

It's the same with SDI, if two annotators both add features to it.  They in 
conflict, and they can't be merged.

SDI is useful however, since it has the document size and URI.  Despite it being in a 
package called "examples", in truth it's become a standard.  All the annotators 
the ship with UIMA use it.  If you want to use the semantic search (Juru) indexing CAS 
Consumer, you have to use SDI.   I'm sure many annotators in the world have used SDI.

I would like my annotators and UIMA container to be compatible with all those 
annotators.  Therefore, I think I have to use SDI for size and URI, but not 
modify it.

Creating my own annotation (or is extending TOP FS better?) seems like the best 
answer.  My UIMA container and set of annotators would know about it, and 
other's annotators wouldn't be affected.  My annotators would have to 
gracefully degrade when running in a UIMA container that doesn't provide this 
new annotation.

What are people's thoughts?  1, 2 or 3?

If you use the JCas, as you say you do, definitely 3. There is no needto use an annotation, extending TOP would be sufficient.


================

Longer term, I think we as a community need to define Type Systems that allow 
inter-operability of annotators and CAS Consumers.  For example, we could 
create an official SourceDocumentInformation that allows arbitrary sets of 
document properties as simple name-value pairs.  In other words, add this 
feature to SourceDocumentInformation:

        properties           uima.cas.FSArray    PropertyFS

    uima.PropertyFS    uima.cas.TOP
        name                  uima.cas.String
        value                   uima.cas.String
        scheme               uima.cas.String

I'm personally not a big fan of arbitrary attribute-value schemes likethis. You need yet another place (outside the type system) where youdocument what the properties are that you define and expect.


And define that names, values, and schemes conform to the Dublin Core Metadata 
Initiative standards.


Similarly, I think we need to create Type System standards for representing 
document structure.  For example, how could HTML elements and attributes be 
stored in the CAS such that all annotators could depend on them being there and 
therefore make intelligent use of them?


And finally, we need some Type System standards for representing certain common 
result annotations, such as lexical markup and named entities.  How can we 
combine two annotators from different companies if they don't have a shared 
definition of the data flowing between them?


And isn't this the whole point of UIMA?  It appears to me that the UIMA dream 
won't come true until we create these standards for data exchange or data 
transformation within the CAS.

In my opinion, the current situation really limits the usefulness of UIMA as a 
platform for text processing (unless you control every piece of code in the 
system, of course).

How do we start such a consortium?

This mailing list is a good start ;-). I know there are others who workon similar things, but I'll let them speak for themselves.

One issue of course is that it is difficult to agree on any common typesystem. It's hard enough to even agree on what an annotation is, letalone specific types of annotations. We could try to define a certainbase set on Apache. I would hesitate to put more built-in types intoUIMA itself, though. I'd rather have a type system repository where wemodularly define certain kinds of type systems (such as html markup, forexample), and that people can use, or not.


--Thilo


Thanks for listening,


Greg Holmberg

Re: Document "properties" and SourceDocumentInformation

Reply via email to